GPU usage documentation
Selecting a GPU to carry out a task seems a more intranscendental task than what it truly is. We shouldn't be using the same GPU that a colleague is already using if we have a free GPU available.
In this documentation we are going to tackle how to checkout which GPU's are being used, who is using each gpu, how to select a specific gpu and the relation between batch size and memory usage.
1. NVIDIA SMI tutorial
The NVIDIA System Management Interface (nvidia-smi) is a command line utility that allows us to query the GPU's current state and all processes.
The SMI can be accessed to using the command nvidia-smi
at the terminal.
Using it, we get the following output:
$ user@machinelearning2:~$ nvidia-smi
Thu Oct 7 15:56:00 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:18:00.0 Off | N/A |
| 51% 55C P2 110W / 350W | 2329MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A |
| 0% 46C P8 27W / 260W | 158MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 3090 Off | 00000000:86:00.0 Off | N/A |
| 34% 44C P8 36W / 350W | 2MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A |
| 0% 38C P8 20W / 260W | 3MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 27689 C python 2327MiB |
| 1 N/A N/A 7492 C python 155MiB |
+-----------------------------------------------------------------------------+
The first table contains information about the GPU's. On the first column we have, mainly, the GPU number, the model, temperature and current wattage. On the next column we can find the amount of memory that the GPU is currently using. Lastly we can find the volatile utilization of the GPU. You can check out more information about this in this StackOverflow post.
On the second table we can find information about the processes that are currently running on each GPU. If you'd want more information about an specific process, you could use ps aux
. Check the next section to find out how.
Note: The command
gpustat
is installed both in gea_1 and gea_2 and it works in a similar fashion thannvidia-smi
. We will focus solely onnvidia-smi
in this tutorial, but you can check outgpustat
here.
2. Checking out processes using ps aux
The ps aux
gives us information about the processes that are currently running. If we run the command ps aux | grep PID
we get specific information of the proces with said PID.
Let's see what happens when we use the command with the process 7492 that was previously running:
$ user@machinelearning2:~$ ps aux | grep 7492
elopez 7492 1081 8.4 30075588 11168984 pts/64 Dl+ oct06 18199:29 python train_2terms.py
jcribei+ 14683 0.0 0.0 14364 964 pts/32 S+ 15:57 0:00 grep --color=auto 7492
Here, we can see the user that is running the process (elopez), when he/she started the process (oct06) and what he/she is specifficly running.
3. Criteria for selecting a GPU for training
Now that we have stablish how to check the GPU usage and individual processes, it's time to talk about which GPU to select for training and when.
Plain simply, you should avoid using a GPU that someone else is currently using. If it's not possible, select the GPU that you know that have enough memory left to support all of the tasks that are running, or wait for one to free up. Why is it? When a GPU is running a process, running another process on top of it queue more work into the GPU, thus slowing down all the processes that the GPU is running. Besides that, another problems may surge when running multiple experiments at the same GPU that can end-up on a crash.
Up to ai-project-template v0.6.0 selecting a GPU was a manual task, but since this version there is a full suite of options available for correctly selecting a GPU for your experiment:
- Legacy GPU manual selection: the classical way of selecting a GPU using its ID.
- Assisted manual selection: similar to the manual one, but the user is presented with the available GPUs and the state of each one before launching the experiment so he/she can judge which one to use.
- Automatic selection: an algorithm determines which GPU to use based in available memory and status of the GPUs available in the server.
You are free to use any of the three previous ways to select a GPU, in the next section is explained how to use all of them.
4. How to select a GPU for training: manual, assisted and auto
Check out the experiment stage from ai-project's dvc.yaml
file:
# dvc.yaml
# . . .
run_experiment_mlflow:
cmd: export MLFLOW_TRACKING_URI="http://10.10.30.58:8999/" &&
mlflow run . --experiment-name {{cookiecutter.__project_slug}} --no-conda
-P dataset=configs/datasets/{{cookiecutter.dataset}}.py
-P model=configs/models/resnet_18.py
-P runtime=configs/runtimes/runtime.py
-P scheduler=configs/schedulers/one_cycle_8_epochs.py
-P gpu=2
deps:
- configs
- results/data/transform/coco_to_mmclassification-{{cookiecutter.dataset}}
metrics:
- results/metrics.json:
cache: false
plots:
- results/prc.json:
cache: false
x: recall
y: precision
Here, the gpu
argument indicates which GPU selection mode will be used.
- If the
gpu
argument isn't present in the command or if its value isgpu=-2
the assisted mode will be used. This is the default mode for GPU selection. - If
gpu=-1
the automatic mode will be used, so the user doesn't have to select anything. - Otherwise, if
gpu=n
wheren
is any positive integer including 0, this activates the legacy mode. The GPU with its ID equal to the value ofn
will be used.
For example, take a look to the GPUs using the nvidia-smi
command:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:18:00.0 Off | N/A |
| 51% 55C P2 110W / 350W | 2329MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A |
| 0% 46C P8 27W / 260W | 158MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 3090 Off | 00000000:86:00.0 Off | N/A |
| 34% 44C P8 36W / 350W | 2MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A |
| 0% 38C P8 20W / 260W | 3MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
gpu=2
would refer to the third GPU in the table in legacy mode. Analogously, using the assisted mode the user will be presented with this output:
__ __
___ ____ __ _____ / /____ _/ /_
/ _ `/ _ \/ // (___/ __/ _ `/ __/
\_, / .__/\_,_/___/\__/\_,_/\__/
/___/_/
machinelearning2 Tue Feb 1 12:46:49 2022 460.32.03
[0] GeForce RTX 3090 | 53'C, 0 % | 2329 / 24268 MB | userA(2329M)
[1] GeForce RTX 2080 Ti | 37'C, 0 % | 158 / 11019 MB | userB(158M)
[2] GeForce RTX 3090 | 60'C, 0 % | 2 / 24268 MB |
[3] GeForce RTX 2080 Ti | 38'C, 0 % | 3 / 11019 MB |
Select a GPU for operation: |
nvidia-smi
output.
Automatic mode tries to find the GPU with the maximum available memory available_memory = max_memory - used_memory
and solves the ties selecting the GPU with the lowest used_memory
. We strongly recommend using the assited mode for fine grained control over which GPU to use.
5. Relation between batch size, GPU memory and training speed
Keep this in mind at all times:
The larger the batch size, the larger the memory consumption, the larger the GPU usage; the faster the training speed.
When setting up your training remember to maximize your batch size in order to speed up your training and, at the same time, reduce the amount of queues formed around the GPUs.
NOTE: Taking a large batch size it's not always good, as it can reduce the regularization of the model. You can read more about it here.
6. Recap and conclusions
Summing up:
Before training, check for available GPUs using
nvidia-smi
. Decide which GPU selection mode to use, aim to assisted for fine grained control or auto for fast and easy deployment. Finally, try to maximize the batch size to train efficiently and reduce queues around the GPU.