Cleanup README and extend documentation

6913245e · Alex Wiens · 3a8388e1 · 6913245e · 6913245e · 6913245e
Commit 6913245e authored 11 months ago by Alex Wiens
--- a/README.md
+++ b/README.md
-# Rule Evaluation of Pathological Job Pattern
-
+# Rule Evaluation of Pathological Job Patterns
 Rule evaluation engine for detection of patterns in ClusterCockpit JobArchive measurement files.

-## Documentation
+The idea is to identify jobs running on an HPC cluster with problematic (performance) behavior.
+Once problematic jobs are recognized, advisors can contact the respective users to solve the problems.
+For this job recognition, patterns in the background measurement data recorded with ClusterCockpit can be used.
+The rules to identify the patterns need to be defined depending on the measured job metrics, the job metadata (e.g. number of allocated cores) and parameters such as threshold values.
+ClusterCockpit stores the measurement data of finished jobs in Job Archive files containing metadata and metric data.
+For analysis, the data needs to be loaded from the JobArchive, the rule definitions applied to the metric data and the results about matched rules stored.
+
+This software was developed in the context of the [NHR](https://www.nhr-verein.de/) project "Automatic Detection of Pathological Jobs for HPC User Support".

+## Summary
+* `prule` analyses a JobArchive according to rule definitions.
+* `prule.daemon` talks to ClusterCockpit and applies the `prule` analysis to newly finished jobs.
+* `prule.summary` creates a user/account based summary of the analysis results.
+
+## Documentation
 * [Motivation and rationale](docs/motivation-and-rationale.md)
+* [Usage](docs/usage.md)
+* [Rule Format](docs/rule-format.md)
 * [Rule - available data structures and functions](docs/data-structures-and-functions.md)
-
+* [Build Python package for deployment](docs/build.md)

 ## Requirements
+* [NumPy](https://numpy.org/) (BSD)
+* [Pint](https://pint.readthedocs.io/) (BSD)
+* [Jinja2](https://jinja.palletsprojects.com/) (BSD)
+* [jsonschema](https://github.com/python-jsonschema/jsonschema) (MIT)

-* [NumPy](https://numpy.org/)
-* [Pint](https://pint.readthedocs.io/)
-* [Jinja2](https://jinja.palletsprojects.com/)
-* [jsonschema](https://github.com/python-jsonschema/jsonschema)
-
-
-## Examples
-
-The `rules.json` contains example rules.
-The `examples` folder contains cluster specifications and example job measurements.
-
-## Current status
-
-## Rule file validation
-
-`python3 schema/jsonschema-check.py schema/rule.schema.json rules.json`
-
-## Rule format suggestion
-
-```
-// separately defined input parameters
-{
-"load_threshold_factor":0.9,
-}
-// rule on node level
-{
-    "name":"Low CPU load",
-    "tag":"lowload",
-    "parameter": ["load_threshold"]
-    "rule_terms":[ // -> Array of objects to keep order of terms (in contrast to Object with all terms)
-        {"load_mean":       "cpu_load.mean('all')"},
-        {"load_threshold":  "cores_per_node * load_threshold_factor"},
-        {"lowload":         "load_mean < load_threshold"},
-        {"load_perc":       "load_mean / load_max"}
-    ],  
-    "output":"lowload"
-    "output_scalar":"load_perc"
-    "template":"This job was detected as lowload because the load %{lowload}"
-},
-// rule on job level, for node exclusive jobs only
-{
-    "name":"Low CPU load",
-    "tag":"lowload",
-    "parameter": ["load_threshold"] // <-- explicit list of used parameters? or can be gathered from code parsing?
-    "rule_terms":[
-        {"load_mean":       "cpu_load.mean('all')"},
-        {"load_threshold":  "cores_per_node * load_threshold_factor"},
-        {"lowload":         "or(load_mean < load_threshold)"}, // <-- operator overloading for matrices, vectors
-        {"load_perc":       "load_mean / load_max"}
-    ],  
-    "output":"lowload"
-    "output_scalar":"load_perc"
-    "template":"This job was detected as lowload because the load %{lowload}"
-},
-```
-
-## Execute
-
-### Rule evaluation
-
-`python3 -m prule` will execute the tool.
-Since the module has to be found by Python, you might want to add the repository folder to you `PYTHONPATH`, e.g.
-```
-# Run prule evaluation
-PYTHONPATH="." python3 -u -m prule
-# Run daemon
-PYTHONPATH="." python3 -u -m prule.daemon
-```
-
-### Job processing daemon
-
-`python3 -m prule.daemon` starts the daemon.
-The daemon does two things in a loop: periodically checking the CC REST API for finished jobs and running the rule evaluation on finished jobs.
-In the `config` file for the daemon, one specifies the necessary paths and flags for the behavior.
-The daemon can read the input files from the filesystem or request them from CC, store the results in verbose JSON result files and in a sqlite database and pass them as tags and metadata to CC.
-The daemon can also be started in a `batch mode` to process a list of jobs listed as `job ids` in a file and without the continuous finished job check.
-
-
-## Build package
+## License

-The following command creates source and wheel packages in the `dist` folder:
-```
-python3 setup.py sdist bdist_wheel
-```
--- a/docs/build.md
+++ b/docs/build.md
+# Build Python package for deployment
+
+In case you need to build a Python package for easier deployment.
+
+## Build package
+The following command creates source and wheel packages in the `dist` folder:
+```
+python3 setup.py sdist bdist_wheel
+```
--- a/docs/data-structures-and-functions.md
+++ b/docs/data-structures-and-functions.md
@@ -141,7 +141,7 @@ Parameters:

 #### `any`

-Evaluates to `True` if at least of the values in the defined set is `True`.
+Evaluates to `True` if at least one of the values in the defined set is `True`.

 Parameters:
 * one of [`all`,`time`,`scope`]

--- a/docs/rule-format.md
+++ b/docs/rule-format.md
+# Rule format
+The rule format was designed for a trade off between expressiveness and ease of implementation.
+You can look up the details in [Motivation and Rationale](motivation-and-rationale.md).
+
+The rule definition file is a JSON file containing a list of rule objects. Here is an example:
+~~~~
+[{
+    "name":"Low CPU load",
+    "tag":"lowload",
+    "parameters": ["lowcpuload_threshold_factor","job_min_duration_seconds","sampling_interval_seconds"],
+    "metrics": ["cpu_load"],
+    "requirements": [
+        "job.exclusive == 1",
+        "job.duration > job_min_duration_seconds",
+        "required_metrics_min_samples > job_min_duration_seconds / sampling_interval_seconds"
+    ],  
+    "terms":[
+        {"load_mean":           "cpu_load[cpu_load_pre_cutoff_samples:].mean('all')"},
+        {"load_threshold":      "job.numHwthreads * lowcpuload_threshold_factor"},
+        {"lowload_nodes":       "load_mean < load_threshold"},
+        {"lowload":             "lowload_nodes.any('all')"},
+        {"load_perc":           "1.0 - (load_mean / load_threshold)"}
+    ],  
+    "output":"lowload",
+    "output_scalar":"load_perc",
+    "template":"Job ({{ job.jobId }})\nThis job was detected as lowload because the mean cpu load {{ load_mean }} falls below the threshold {{ load_threshold }}."
+}]
+~~~~
+
+The most interesting part is the `rule_terms` array.
+It contains a list of assignments of the form `variable = expression` encoded as key-value pair in an JSON object `{"variable":"expression"}`.
+During rule evaluation the `expressions` are one-by-one evaluated and the results assigned to the respective `variable`.
+In this implementation, the evaluation is performed using the Python `eval` functionality.
+Afterwards, the result (if the rule matched or not) shall be stored in the variable with name specified in the `output` attribute.
+Additionally, an `output_scalar` variable can exist, containing a scalar value calculated during rule evaluation.
+The `template` variable is filled with a helpful text, explaining why the rule matched.
+If the rule matched, the `tag` can be used as a signifier for the job.
+
+The expressions in the rule's terms can and need access to the input data.
+One form of input data are parameters specified in the parameter file.
+The rule definition shall list the used parameters in the `parameters` list, so it can be checked if the required parameters are present before rule evaluation (e.g. a value for `lowcpuload_threshold_factor` is defined).
+The same idea is behind the `metrics` list and the `requirements` list, which contains a list of `expressions` that need to evaluate to `True` before rule evaluation is attempted.
+
+Naturally, the rule terms also need access to the metric data of the job measurements in the job archive.
+In this example, the `cpu_load` variable represents the cpu load measurement data read from the job archive.
+Details about the data structure used for metric data can be found in [Data structures and functions](data-structures-and-functions.md).
+
+## Debugging
+Evaluation of a rule can be investigated by limiting the evaluation to one rule (e.g. `--rule "Low CPU load"`) and turning on output of debug messages (e.g. `--debug`).
+Now, the details about the processed input data and the results of each rule term are printed.
--- a/docs/usage.md
+++ b/docs/usage.md
+# Usage
+
+## Rule evaluation
+
+`python3 -m prule` will execute the tool.
+You can execute it without installation by adding the repository folder to your `PYTHONPATH` environment variable:
+```
+# Run prule evaluation from the repository folder
+PYTHONPATH="." python3 -u -m prule
+```
+### Configuration input files
+For evaluation `prule` needs certain files that describe rules, parameters and the specifics of the cluster the jobs were executed on.
+* `--rules-file FILE`: describes the rules in a JSON format. See [Rules-format](rules-format.md) for details.
+* `--parameters-file FILE`: is the companion file for the rules and contains certain tunable parameters for specific rules.
+* `--clusters-file FILE`: contains the ClusterCockpit definition for all relevant clusters, e.g. partitions, core count, memory etc.
+### Measurement input files
+`prule` can process single or multiple job archives.
+A job archive is expected to be stored as a folder with two files: `meta.json` and `data.json`.
+With `--job META_PATH DATA_PATH` a single job archive can be specified with the separate paths to the respective files.
+With `--job-dir JOB_ARCHIVE_PATH` a single job archive can be specified with one path to the job archive directory.
+With `--job-dir-list-file PATH_LIST_FILE` multiple job archive directories can be specified as a text file containing the directory paths.
+### Output
+The evaluation output is a JSON object with information about the evaluated rules and the results.
+For the output several options are possible.
+`--output-auto-file` will simply create a `$JOBID.json` in the respective job archive directory.
+`--output-dir OUTPUT_DIR` will store all the `$JOBID.json` files in one folder.
+`--output-file OUTPUT_FILE` will store all results in one JSON file containing an array with the result objects.
+`--db-path DB_PATH` can be specified to add the results to an sqlite database.
+
+### Debugging
+By using the `--debug` parameter the output of detailed information of the rule evaluation is enabled.
+Specifics of the input data preprocessing and for each rule evaluation step the output is printed.
+
+
+## Job processing daemon
+
+`python3 -m prule.daemon` starts the daemon.
+The daemon does two things in a loop:
+
+* periodically checking the CC REST API for finished job
+* and running the rule evaluation on the newly finished jobs.
+
+In the `config` file for the daemon, the necessary paths and flags for the behavior can be specified.
+Since the daemon executes the `prule` evaluation, the daemon also needs to know the configuration paths for `prule`.
+The daemon can read the input files from the filesystem or request them from CC, store the results in verbose JSON result files and in a sqlite database and pass them as tags and metadata to ClusterCockpit.
+By using the `--job-ids-file JOB_IDS_FILE` parameter, the daemon can also be started in a "batch mode" to process a list of jobs listed as `job ids` in a file and without the continuous finished job check.
+A detailed description of the `prule.daemon` configuration can be found in the output of `--long-help`.
+
+## Result summary
+`python3 -m prule.summary` creates a summary of results stored in the sqlite result database.
+The sqlite result database stores job metadata and the evaluation results:
+
+* `eval`: was there an attempt to evalute the rule
+* `match`: did the rule evluation result in a match
+* `error`: did an error occur during rule evluation
+* `scalar`: value of the scalar rule output
+
+For the summary one can use parameters to specify the selection of jobs and users/accounts.
+One can summarize according to accounts, users or users with a breakdown into accounts.
+It is also possible to restrict the summary to a certain cluster, user or account.
+
+The summary can be done for a certain number of most recent jobs (per user/account) or for the jobs of a certain passed timespan (e.g. last week).