Possibly problematic memory usage · Wiki · PathoJobs / Documentation

Description: A generally high memory usage or increase of memory usage can be dangerous if the user isn't aware of it, as it can easily lead to a job cancellation due to OOM.
Criterion:
- high memory usage: the memory used by a process is very high compared to the memory requested by the job, e.g., used>=0.9*requested. This should be seen as a warning instead of because of the danger of the job failing by running out of memory.
- systematic increase of memory usage: the memory used by the job grows roughly linearly with time over large stretches of the job. This could be seen as a warning because of the danger of the job failing by running out of memory.
Works for shared jobs with existing metrics:
- no, because no job-specific memory usage is recorded right now
Possible false positives:
- workload with naturally high memory usage or increase pattern
Possible cures/workarounds:
- increasing memory usage:
  - possibly a memory leak:
    - Check the code
    - often: try other MPI implementations
- high memory usage:
  - request more memory
Missing Data:
- requested memory per job and node
- used data per job and Node

High memory usage

Eingabe:
* Metrik mem_used im node scope
* Metadatum allocated_memory pro node
* Parameter threshold für einen node

Regel:
memory_used    = mem_used.max('time')
mem_threshold  = job.allocated_memory * threshold
highmem_nodes  = memory_used > mem_threshold
highmem        = highmem_nodes.any('all')

Ausgabe: highmem ist True

Memory Leak

Eingabe:
* Metrik mem_used im node scope
* Metadatum duration
* Metadatum allocated_memory pro node

Regel:
memory_slope     = mem_used.slope()
memory_slope_avg = memory_slope.avg()
memory_leak      = memory_slope_avg * job.duration > job.allocated_memory

Ausgabe: memory_leak ist True

Comments

Please register or sign in to add a comment.