-
Description: a low GPU utilization usually hints at an inefficient use of the GPUs
-
Criterion:
- nv_util << 1
- (or nv_mem_util << 1)
-
Works for shared jobs with existing metrics:
- yes, if a GPU is allocated to at most one compute job
-
Possible false positives: *
-
Possible cures/workarounds:
- check MPI-rank-GPU affinity/visibility
- check the suitability of the GPU setup/configuration for the given workload
- check applications internal GPU distribution settings
Eingabe:
* Metrik acc_utilization im accelerator scope
* Metadatum numAcc
* Parameter threshold für einen accelerator
Regel:
load_mean = acc_utilization.mean('all')
load_threshold = job.numAcc * threshold
lowload = load_mean < load_threshold
Ausgabe: lowload ist True