Skip to content
Snippets Groups Projects
Commit 6913245e authored by Alex Wiens's avatar Alex Wiens
Browse files

Cleanup README and extend documentation

parent 3a8388e1
Branches
No related tags found
No related merge requests found
# Rule Evaluation of Pathological Job Pattern
# Rule Evaluation of Pathological Job Patterns
Rule evaluation engine for detection of patterns in ClusterCockpit JobArchive measurement files.
## Documentation
The idea is to identify jobs running on an HPC cluster with problematic (performance) behavior.
Once problematic jobs are recognized, advisors can contact the respective users to solve the problems.
For this job recognition, patterns in the background measurement data recorded with ClusterCockpit can be used.
The rules to identify the patterns need to be defined depending on the measured job metrics, the job metadata (e.g. number of allocated cores) and parameters such as threshold values.
ClusterCockpit stores the measurement data of finished jobs in Job Archive files containing metadata and metric data.
For analysis, the data needs to be loaded from the JobArchive, the rule definitions applied to the metric data and the results about matched rules stored.
This software was developed in the context of the [NHR](https://www.nhr-verein.de/) project "Automatic Detection of Pathological Jobs for HPC User Support".
## Summary
* `prule` analyses a JobArchive according to rule definitions.
* `prule.daemon` talks to ClusterCockpit and applies the `prule` analysis to newly finished jobs.
* `prule.summary` creates a user/account based summary of the analysis results.
## Documentation
* [Motivation and rationale](docs/motivation-and-rationale.md)
* [Usage](docs/usage.md)
* [Rule Format](docs/rule-format.md)
* [Rule - available data structures and functions](docs/data-structures-and-functions.md)
* [Build Python package for deployment](docs/build.md)
## Requirements
* [NumPy](https://numpy.org/) (BSD)
* [Pint](https://pint.readthedocs.io/) (BSD)
* [Jinja2](https://jinja.palletsprojects.com/) (BSD)
* [jsonschema](https://github.com/python-jsonschema/jsonschema) (MIT)
* [NumPy](https://numpy.org/)
* [Pint](https://pint.readthedocs.io/)
* [Jinja2](https://jinja.palletsprojects.com/)
* [jsonschema](https://github.com/python-jsonschema/jsonschema)
## Examples
The `rules.json` contains example rules.
The `examples` folder contains cluster specifications and example job measurements.
## Current status
## Rule file validation
`python3 schema/jsonschema-check.py schema/rule.schema.json rules.json`
## Rule format suggestion
```
// separately defined input parameters
{
"load_threshold_factor":0.9,
}
// rule on node level
{
"name":"Low CPU load",
"tag":"lowload",
"parameter": ["load_threshold"]
"rule_terms":[ // -> Array of objects to keep order of terms (in contrast to Object with all terms)
{"load_mean": "cpu_load.mean('all')"},
{"load_threshold": "cores_per_node * load_threshold_factor"},
{"lowload": "load_mean < load_threshold"},
{"load_perc": "load_mean / load_max"}
],
"output":"lowload"
"output_scalar":"load_perc"
"template":"This job was detected as lowload because the load %{lowload}"
},
// rule on job level, for node exclusive jobs only
{
"name":"Low CPU load",
"tag":"lowload",
"parameter": ["load_threshold"] // <-- explicit list of used parameters? or can be gathered from code parsing?
"rule_terms":[
{"load_mean": "cpu_load.mean('all')"},
{"load_threshold": "cores_per_node * load_threshold_factor"},
{"lowload": "or(load_mean < load_threshold)"}, // <-- operator overloading for matrices, vectors
{"load_perc": "load_mean / load_max"}
],
"output":"lowload"
"output_scalar":"load_perc"
"template":"This job was detected as lowload because the load %{lowload}"
},
```
## Execute
### Rule evaluation
`python3 -m prule` will execute the tool.
Since the module has to be found by Python, you might want to add the repository folder to you `PYTHONPATH`, e.g.
```
# Run prule evaluation
PYTHONPATH="." python3 -u -m prule
# Run daemon
PYTHONPATH="." python3 -u -m prule.daemon
```
### Job processing daemon
`python3 -m prule.daemon` starts the daemon.
The daemon does two things in a loop: periodically checking the CC REST API for finished jobs and running the rule evaluation on finished jobs.
In the `config` file for the daemon, one specifies the necessary paths and flags for the behavior.
The daemon can read the input files from the filesystem or request them from CC, store the results in verbose JSON result files and in a sqlite database and pass them as tags and metadata to CC.
The daemon can also be started in a `batch mode` to process a list of jobs listed as `job ids` in a file and without the continuous finished job check.
## Build package
## License
The following command creates source and wheel packages in the `dist` folder:
```
python3 setup.py sdist bdist_wheel
```
# Build Python package for deployment
In case you need to build a Python package for easier deployment.
## Build package
The following command creates source and wheel packages in the `dist` folder:
```
python3 setup.py sdist bdist_wheel
```
......@@ -141,7 +141,7 @@ Parameters:
#### `any`
Evaluates to `True` if at least of the values in the defined set is `True`.
Evaluates to `True` if at least one of the values in the defined set is `True`.
Parameters:
* one of [`all`,`time`,`scope`]
......
# Rule format
The rule format was designed for a trade off between expressiveness and ease of implementation.
You can look up the details in [Motivation and Rationale](motivation-and-rationale.md).
The rule definition file is a JSON file containing a list of rule objects. Here is an example:
~~~~
[{
"name":"Low CPU load",
"tag":"lowload",
"parameters": ["lowcpuload_threshold_factor","job_min_duration_seconds","sampling_interval_seconds"],
"metrics": ["cpu_load"],
"requirements": [
"job.exclusive == 1",
"job.duration > job_min_duration_seconds",
"required_metrics_min_samples > job_min_duration_seconds / sampling_interval_seconds"
],
"terms":[
{"load_mean": "cpu_load[cpu_load_pre_cutoff_samples:].mean('all')"},
{"load_threshold": "job.numHwthreads * lowcpuload_threshold_factor"},
{"lowload_nodes": "load_mean < load_threshold"},
{"lowload": "lowload_nodes.any('all')"},
{"load_perc": "1.0 - (load_mean / load_threshold)"}
],
"output":"lowload",
"output_scalar":"load_perc",
"template":"Job ({{ job.jobId }})\nThis job was detected as lowload because the mean cpu load {{ load_mean }} falls below the threshold {{ load_threshold }}."
}]
~~~~
The most interesting part is the `rule_terms` array.
It contains a list of assignments of the form `variable = expression` encoded as key-value pair in an JSON object `{"variable":"expression"}`.
During rule evaluation the `expressions` are one-by-one evaluated and the results assigned to the respective `variable`.
In this implementation, the evaluation is performed using the Python `eval` functionality.
Afterwards, the result (if the rule matched or not) shall be stored in the variable with name specified in the `output` attribute.
Additionally, an `output_scalar` variable can exist, containing a scalar value calculated during rule evaluation.
The `template` variable is filled with a helpful text, explaining why the rule matched.
If the rule matched, the `tag` can be used as a signifier for the job.
The expressions in the rule's terms can and need access to the input data.
One form of input data are parameters specified in the parameter file.
The rule definition shall list the used parameters in the `parameters` list, so it can be checked if the required parameters are present before rule evaluation (e.g. a value for `lowcpuload_threshold_factor` is defined).
The same idea is behind the `metrics` list and the `requirements` list, which contains a list of `expressions` that need to evaluate to `True` before rule evaluation is attempted.
Naturally, the rule terms also need access to the metric data of the job measurements in the job archive.
In this example, the `cpu_load` variable represents the cpu load measurement data read from the job archive.
Details about the data structure used for metric data can be found in [Data structures and functions](data-structures-and-functions.md).
## Debugging
Evaluation of a rule can be investigated by limiting the evaluation to one rule (e.g. `--rule "Low CPU load"`) and turning on output of debug messages (e.g. `--debug`).
Now, the details about the processed input data and the results of each rule term are printed.
# Usage
## Rule evaluation
`python3 -m prule` will execute the tool.
You can execute it without installation by adding the repository folder to your `PYTHONPATH` environment variable:
```
# Run prule evaluation from the repository folder
PYTHONPATH="." python3 -u -m prule
```
### Configuration input files
For evaluation `prule` needs certain files that describe rules, parameters and the specifics of the cluster the jobs were executed on.
* `--rules-file FILE`: describes the rules in a JSON format. See [Rules-format](rules-format.md) for details.
* `--parameters-file FILE`: is the companion file for the rules and contains certain tunable parameters for specific rules.
* `--clusters-file FILE`: contains the ClusterCockpit definition for all relevant clusters, e.g. partitions, core count, memory etc.
### Measurement input files
`prule` can process single or multiple job archives.
A job archive is expected to be stored as a folder with two files: `meta.json` and `data.json`.
With `--job META_PATH DATA_PATH` a single job archive can be specified with the separate paths to the respective files.
With `--job-dir JOB_ARCHIVE_PATH` a single job archive can be specified with one path to the job archive directory.
With `--job-dir-list-file PATH_LIST_FILE` multiple job archive directories can be specified as a text file containing the directory paths.
### Output
The evaluation output is a JSON object with information about the evaluated rules and the results.
For the output several options are possible.
`--output-auto-file` will simply create a `$JOBID.json` in the respective job archive directory.
`--output-dir OUTPUT_DIR` will store all the `$JOBID.json` files in one folder.
`--output-file OUTPUT_FILE` will store all results in one JSON file containing an array with the result objects.
`--db-path DB_PATH` can be specified to add the results to an sqlite database.
### Debugging
By using the `--debug` parameter the output of detailed information of the rule evaluation is enabled.
Specifics of the input data preprocessing and for each rule evaluation step the output is printed.
## Job processing daemon
`python3 -m prule.daemon` starts the daemon.
The daemon does two things in a loop:
* periodically checking the CC REST API for finished job
* and running the rule evaluation on the newly finished jobs.
In the `config` file for the daemon, the necessary paths and flags for the behavior can be specified.
Since the daemon executes the `prule` evaluation, the daemon also needs to know the configuration paths for `prule`.
The daemon can read the input files from the filesystem or request them from CC, store the results in verbose JSON result files and in a sqlite database and pass them as tags and metadata to ClusterCockpit.
By using the `--job-ids-file JOB_IDS_FILE` parameter, the daemon can also be started in a "batch mode" to process a list of jobs listed as `job ids` in a file and without the continuous finished job check.
A detailed description of the `prule.daemon` configuration can be found in the output of `--long-help`.
## Result summary
`python3 -m prule.summary` creates a summary of results stored in the sqlite result database.
The sqlite result database stores job metadata and the evaluation results:
* `eval`: was there an attempt to evalute the rule
* `match`: did the rule evluation result in a match
* `error`: did an error occur during rule evluation
* `scalar`: value of the scalar rule output
For the summary one can use parameters to specify the selection of jobs and users/accounts.
One can summarize according to accounts, users or users with a breakdown into accounts.
It is also possible to restrict the summary to a certain cluster, user or account.
The summary can be done for a certain number of most recent jobs (per user/account) or for the jobs of a certain passed timespan (e.g. last week).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment