|
|
* Description: a user / work-group submits many short jobs to the workload manager. This results in waste due to pre-job and post-job overheads, slows down work-load manager
|
|
|
|
|
|
* Criterion:
|
|
|
* Of 100 submitted batch-jobs the job-runtime is << 100x (Pre + Post Job Overhead)
|
|
|
* Multiple jobs of this category active at any time, or within a sliding window
|
|
|
|
|
|
* Possible false positives: *
|
|
|
|
|
|
* Possible cures / solutions:
|
|
|
* Aggregate multiple jobs into one batch job
|
|
|
* Use of job-arrays
|
|
|
|
|
|
|
|
|
```
|
|
|
// rule on batch level
|
|
|
{
|
|
|
"name":"Too Many Jobs",
|
|
|
"type":"bool",
|
|
|
"tag":"batchDOS",
|
|
|
"level":"jobs",
|
|
|
"metric(SLURM:jobID)":"jobIDs", // last 100 jobs for the user
|
|
|
"metric(SLURM:WTIME)":"jobWALLTimes",
|
|
|
"metric(SLURM:SubmitTime)":"jobSubmitTimes",
|
|
|
"parameter": ["max_submits_in_period","observation_period"] // <-- explicit list of used parameters? or can be gathered from code parsing?
|
|
|
"rule_terms":[
|
|
|
{"jobs_in_perdiod": "SelectMetricIf(jobSubmitTimes,($$ - Now) < observation_period)"},
|
|
|
{"dos_alert": "Carinality($jobs_in_period) > $max_submits_in_period)"},
|
|
|
{"job_submission_rate": "$jobs_in_perdiod/$observation_period"},
|
|
|
{"max_job_submission_rate": "$max_submits_in_period/$observation_period"},
|
|
|
],
|
|
|
"output":"dos_allert"
|
|
|
"output_metric":"$job_submission_rate"
|
|
|
"template":"The number of jobs sumbitted exceeded $max_job_submission_rate Jobs/Hour, was $job_submission_rate"
|
|
|
}
|
|
|
|
|
|
``` |
|
|
\ No newline at end of file |