-
Description: a user / work-group submits many short jobs to the workload manager. This results in waste due to pre-job and post-job overheads, slows down work-load manager
-
Criterion:
- Of 100 submitted batch-jobs the job-runtime is << 100x (Pre + Post Job Overhead)
- Multiple jobs of this category active at any time, or within a sliding window
-
Possible false positives: *
-
Possible cures / solutions:
- Aggregate multiple jobs into one batch job
- Use of job-arrays
// rule on batch level
{
"name":"Too Many Jobs",
"type":"bool",
"tag":"batchDOS",
"level":"jobs",
"metric(SLURM:jobID)":"jobIDs", // last 100 jobs for the user
"metric(SLURM:WTIME)":"jobWALLTimes",
"metric(SLURM:SubmitTime)":"jobSubmitTimes",
"parameter": ["max_submits_in_period","observation_period"] // <-- explicit list of used parameters? or can be gathered from code parsing?
"rule_terms":[
{"jobs_in_perdiod": "SelectMetricIf(jobSubmitTimes,($$ - Now) < observation_period)"},
{"dos_alert": "Carinality($jobs_in_period) > $max_submits_in_period)"},
{"job_submission_rate": "$jobs_in_perdiod/$observation_period"},
{"max_job_submission_rate": "$max_submits_in_period/$observation_period"},
],
"output":"dos_allert"
"output_metric":"$job_submission_rate"
"template":"The number of jobs sumbitted exceeded $max_job_submission_rate Jobs/Hour, was $job_submission_rate"
}