# Data structures and functions ## Data structures ### Metric time series Metrics are stored as time series in a two-dimensional matrix. First dimension differentiates the samples in time, e.g. sample (0,0) at time 0, sample (1,0) at time 1. Second dimension differentiates samples of different measurement instances at a topology level, e.g. sample (0,0) at core 0, sample (0,1) at core 1. The metric time series is stored in a `Data` object. A `Data` object contains the following structures for `N` points in time and `M` different measurement instances: * timestamp array (N,1) * time series samples (N,M) * list of metadata for each column (M) Unit information about the samples can be stored to check validity of operations between the samples. In this implementation: * Pint is used to store type information with the sample values. * The metric objects ideally can be used as normal NumPy arrays with a pre-defined list of functions. ### Topological scopes |Scope|Description| |--|--| | thread | unequal cores in case of SMT| | core | physical cores| | socket | | | node | | | memoryDomain | | | die | | | accelerator | in case of multiple GPUs per job| | job | | **This list does not necessarily reflect a hierarchical order.** ## Available data The following data is accessible: |Data|Example|Description| |--|--|--| | metric time series | `cpu_load` | metrics are accessed by their names and stored as `Data` objects | | parameters | `lowcpuload_threshold_factor` | parameters are accessed by their names and hold the values as defined in the `parameters.json` | | job metadata | `job.numHwthreads` | job metadata is accessible via the attributes of the `job` object as defined in the ClusterCockpit JobArchive job metadata JSON | | number of allocated threads | `numthreads.nodes` | number of threads at each topology level | | rule specific values | `required_metrics_min_samples` | values that are generated anew for each rule evaluation | ### Metrics Metrics can be accessed by name. See the [ClusterCockpit Schemas](https://github.com/ClusterCockpit/cc-backend/tree/master/pkg/schema/schemas) for the metric names. ### Number of allocated threads `numthreads` contains the number of threads for each unique scope. E.g. for a job with allocated three sockets on two nodes on a system with 64 cores per socket and 2 sockets per node, then `numthreads.nodes` might be `[128,64]` and `numthreads.sockets` might be `[64,64,64]`. ### Rule specifc values Ideally, the input values for each rule evaluation are identical and are not writable during the rule evaluation. This results in the following: * The input values can be prepared once and used for evaluation of each rule. * The order of rule evaluation does not change the evaluation result. However, there are reasons to prepare rule specific values, e.g. rules require different sets of metrics and different metrics can have a different amount of missing samples. For reasons of implementation, the rule specific values are writable during rule evaluation. This could be changed, but would result in unnecessary copying of the input data. #### `required_metrics_min_samples` Specifies the minimal number of samples over all metrics in the rule specific subset of required metrics. The intention is to use this value in requirement checks of the rules. For example, the requirement `required_metrics_min_samples > rule_min_samples` could check if the minimal number of samples, specified by the parameter value `rule_min_samples`, is satisfied by the particular subset of metrics as defined by the rules `metrics` list. ## Functions Functions might be pre-defined as global functions or object functions. Additionally, operators are used. For metrics, accumulation functions need to take the metric scope into account. For this, the scope is given as additional argument to the function. Example: For a metric with N samples per M entities (e.g. 10 samples for 15 cores), the mean function can operate on * `all`, all sample values and generate 1 result value * `time`, all time values and generate M result values * `scope`, accumulate the samples for a given scope, e.g. 'socket', and generate samples in the number of sockets, e.g. (N,#sockets) The idea here is to first accumulate the samples from the measured scope into the desired scope, e.g. from samples per core to samples per node, and afterwards evaluate the values, e.g. compare the node values to some pre-defined factors. ### Operators `+`, `-`, `*`, `/`, `<`, `>` work as expected for combinations of `Data` objects and/or scalar numbers. ### Data member functions #### `mean` Computes the average of all samples in the defined set. Parameters: * one of [`all`,`time`,`scope`] #### `mean_int` Integrates over time to find the average value in the time dimension for the values of the defined set. This function is necessary to compute the correct average value, in case the samples are not measured in a regular time interval. Parameters: * one of [`all`,`time`,`scope`] #### `sum` Computes the sum of all samples in the defined set. Parameters: * one of [`all`,`time`,`scope`] #### `min` Computes the minimal value of all samples in the defined set. Parameters: * one of [`all`,`time`,`scope`] #### `max` Computes the maximal value of all samples in the defined set. Parameters: * one of [`all`,`time`,`scope`] #### `any` Evaluates to `True` if at least one of the values in the defined set is `True`. Parameters: * one of [`all`,`time`,`scope`] #### `std` Computes the standard deviation for the samples in the defined set. Parameters: * one of [`all`,`time`,`scope`] #### `slope` Computes linear regression over samples in time and returns the slope. ### Other functions #### `quantity` Creates Pint Quantity values. Parameters: * value (str or other) * unit (str, optional) Example: ``` > quantity('1 GB/s') <class 'pint.Quantity'> 1.0 gigabyte / second > quantity(5, 'GB/s') <class 'pint.Quantity'> 5 gigabyte / second ``` ## Rule format Here is an example rule: ``` { "name":"Memory Leak", "tag":"memory_leak", "parameters": ["memory_leak_slope_threshold"], "metrics": ["mem_used"], "requirements": ["hasattr(job, \"allocated_memory\")"], "terms":[ {"memory_slope": "mem_used.slope()"}, {"memory_slope_avg": "memory_slope.mean()"}, {"memory_leak": "memory_slope_avg * job.duration > job.allocated_memory # allocated_memory is job memory or memory per node?"}, {"memory_leak_perc": "memory_slope_avg * job.duration / job.allocated_memory"} ], "output":"memory_leak", "output_scalar":"memory_leak_perc", "template":"Job ({{ job.jobId }})\nThis job was detected as memory_leak because the increase of memory usage {{ memory_slope_avg }} would reach the allocated memory limit {{ job.allocated_memory }} for the full duration of the job." } ``` There is also a JSON schema file for rule files. ### Rule attributes #### Name (String) The **friendly** name of rule. #### Description (String, optional) A description of the rule. #### Comment (String, optional) Additional information about this rule. #### Tag (String) The tag set, if the rule is fulfilled. #### Parameters (List of strings, optional) The set of parameters used by this rule. The values from this set are used to check the availability of the used parameters. If parameters are not set, then the rule will not be evaluated. #### Metrics (List of strings, optional) The set of metrics used by this rule. The values from this set are used to check the availability of the used metrics. If metrics are not available, then the rule will not be evaluated. #### Requirements (List of strings, optional) A list expressions that are evaluated to check if the rule should be evaluated. An expression must result in a boolean value. The expressions have access to all values that can also be used in the rule terms. Results from expression evaluation will not be stored in the input data for the rule evaluation. The intention is to check assumptions that must be fulfilled for the evaluation of the rule and quickly filter out jobs that are not fit for the evaluation of this rule. The requirement expressions are evaluated in the defined order and evaluation is stopped once the first failing expression is found. #### Terms (List of objects that contain a single key-value attribute) The terms that are evaluated to check the job for unwanted behavior. The terms are evaluated in the given order. The key-value pairs essentially specifiy a statement of the form `key = value` that is evaluated. The result of the `value` expression is stored in the `key` variable. At the end of the `value` string, a comment can be added prefixed by the hash character `#`. The variables named by the `output` and `output_scalar` attributes have special meaning. #### Output (String) Defines the name of a variable that holds the boolean decision if the rule matches or not after rule evaluation. The name can be chosen freely, but the variable must be set during rule evaluation. #### Output_scalar (String, optional) Defines the name of a variable that holds a scalar value that specifies the severity of the matched rule. The meaning of the scalar can be adjusted to the meaning of the rule. It could be a percentage or a specific quantity. The intention of this attribute is to expose a special value that shows the severity of the rule matching. #### Template (String, optional) Contains a string that is used to explain the detected problem to a user. The string is evaluated by the `jinja2` library and can access the same values as terms during the evaluation. This way, the templated can be filled with a job specific explanation, why the rule matched.