Custom ML Packages¶

Installation of Custom ML Libraries¶

While we try to include as many common ML frameworks and versions as we can in ML-Toolkit, we recognize that there are also situations in which a custom installation may be preferable. We recommend using conda-env-mod to install and manage Python packages. Please follow the steps carefully, otherwise you may end up with a faulty installation. The example below shows how to install TensorFlow in your home directory.

Install¶

Step 1: Unload all modules and start with a clean environment.

1	`module purge`

Step 2: Load the anaconda module with desired Python version.

1	`module load anaconda`

Step 2A: If the ML application requires Cuda and CuDNN, load the appropriate modules. Be sure to check that the versions you load are compatible with the desired ML package.

module load cuda
module load cudnn

Many machine-learning packages, including PyTorch and TensorFlow, now provide installation pathways that include the full cudatoolkit within the environment, making it unnecessary to load these modules.

Step 3: Create a custom anaconda environment. Make sure the Python version matches the Python version in the anaconda module.

conda-env-mod create -n env_name_here

Step 4: Activate the anaconda environment by loading the modules displayed at the end of step 3.

module load use.own
module load conda-env/env_name_here-py3.8.5

Step 5: Now install the desired ML application. You can install multiple Python packages at this step using either conda or pip.

For TensorFlow, as of 2024, the recommended approach is to use pip; see the TensorFlow GPU install guide.

pip install --ignore-installed 'tensorflow[and-cuda]'

For PyTorch, the recommended approach is to use conda; see the PyTorch website.

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

If the installation succeeded, you can now proceed to testing and using the installed application. You must load the environment you created as well as any supporting modules, such as anaconda, whenever you want to use this installation. If your installation did not succeed, please refer to the troubleshooting section below as well as documentation for the desired package you are installing.

Note that loading the modules generated by conda-env-mod has different behavior than conda create env_name_here followed by source activate env_name_here. After running source activate, you may not be able to access any Python packages in anaconda or ml-toolkit modules. Therefore, using conda-env-mod is the preferred way of using your custom installations.

Testing the Installation¶

Verify the installation by using a simple import statement, like the one below for TensorFlow:

python -c "import tensorflow as tf; print(tf.__version__);"

A successful import of TensorFlow will print a variety of system and hardware information. This is expected.

If importing the package leads to errors, be sure to verify that all dependencies for the package have been managed and the correct versions installed. Dependency issues between Python packages are the most common cause of errors. For example, in TensorFlow, conflicts with the h5py or numpy versions are common, but upgrading those packages typically solves the problem. Managing dependencies for ML libraries can be non-trivial.

Next, test using the TensorFlow installation for a GPU run. This uses the matrix multiplication example from the TensorFlow documentation.

# filename: matrixmult.py
import tensorflow as tf

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
tf.debugging.set_log_device_placement(True)

# Place tensors on the CPU
with tf.device('/CPU:0'):
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

# Run on the GPU
c = tf.matmul(a, b)
print(c)

Run the example:

1	`python matrixmult.py`

This will produce output similar to:

Num GPUs Available:  3
2022-07-25 10:33:23.358919: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-25 10:33:26.223459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22183 MB memory:  -> device: 0, name: NVIDIA A30, pci bus id: 0000:3b:00.0, compute capability: 8.0
2022-07-25 10:33:26.225495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22183 MB memory:  -> device: 1, name: NVIDIA A30, pci bus id: 0000:af:00.0, compute capability: 8.0
2022-07-25 10:33:26.228514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22183 MB memory:  -> device: 2, name: NVIDIA A30, pci bus id: 0000:d8:00.0, compute capability: 8.0
2022-07-25 10:33:26.933709: I tensorflow/core/common_runtime/eager/execute.cc:1323] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
2022-07-25 10:33:28.181855: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

For more details, refer to the TensorFlow User Guide.

Troubleshooting¶

In most situations, dependencies among Python modules lead to errors. If you cannot use a Python package after installing it, follow the steps below to find a workaround.

Unload all modules:

1	`module purge`

Clean up PYTHONPATH:

1	`unset PYTHONPATH`

Next, load the modules, such as anaconda and your custom environment:

module load anaconda
module load use.own
module load conda-env/env_name_here-py3.8.5

For GPU-enabled applications, you may also need to load the corresponding cuda/ and cudnn/ modules.

Now try running your code again.

A few applications only run on specific versions of Python, such as Python 3.6. Check the documentation of your application if that is the case.

If you have installed a newer version of an ml-toolkit package, such as a newer version of PyTorch or TensorFlow, make sure that the ml-toolkit modules are not loaded. In general, RCAC recommends that you do not mix ml-toolkit modules with your custom installations.

GPU-enabled ML applications often have dependencies on specific versions of CUDA and CuDNN. For example, TensorFlow version 1.5.0 and higher needs CUDA 9. Check the application documentation for these dependencies.

Tensorboard¶

You can visualize data from a TensorFlow session using TensorBoard. For this, you need to save your session summary as described in the TensorBoard User Guide.

Launch TensorBoard:

python -m tensorboard.main --logdir=/path/to/session/logs

When TensorBoard is launched successfully, it will give you the URL for accessing TensorBoard.

1 2	`<... build related warnings ...> TensorBoard 0.4.0 at http://gilbreth-a000.rcac.purdue.edu:6006`

Follow the printed URL to visualize your model.

Due to firewall rules, the TensorBoard URL may only be accessible from Gilbreth nodes. If you cannot access the URL directly, you can use Firefox in ThinLinc.

For more details, refer to the TensorBoard User Guide.

Back to the Running Jobs section