-
Description: A generally high memory usage or increase of memory usage can be dangerous if the user isn't aware of it, as it can easily lead to a job cancellation due to OOM.
-
Criterion:
- high memory usage: the memory used by a process is very high compared to the memory requested by the job, e.g., used>=0.9*requested. This should be seen as a warning instead of because of the danger of the job failing by running out of memory.
- systematic increase of memory usage: the memory used by the job grows roughly linearly with time over large stretches of the job. This could be seen as a warning because of the danger of the job failing by running out of memory.
-
Works for shared jobs with existing metrics:
- no, because no job-specific memory usage is recorded right now
-
Possible false positives:
- workload with naturally high memory usage or increase pattern
-
Possible cures/workarounds:
- increasing memory usage:
- possibly a memory leak:
- Check the code
- often: try other MPI implementations
- possibly a memory leak:
- high memory usage:
- request more memory
- increasing memory usage:
-
Missing Data:
- requested memory per job and node
- used data per job and Node
High memory usage
Eingabe:
* Metrik mem_used im node scope
* Metadatum allocated_memory pro node
* Parameter threshold für einen node
Regel:
memory_used = mem_used.max('time')
mem_threshold = job.allocated_memory * threshold
highmem_nodes = memory_used > mem_threshold
highmem = highmem_nodes.any('all')
Ausgabe: highmem ist True
Memory Leak
Eingabe:
* Metrik mem_used im node scope
* Metadatum duration
* Metadatum allocated_memory pro node
Regel:
memory_slope = mem_used.slope()
memory_slope_avg = memory_slope.avg()
memory_leak = memory_slope_avg * job.duration > job.allocated_memory
Ausgabe: memory_leak ist True