Skip to content
Snippets Groups Projects
Select Git revision
  • master default protected
1 result

job-pattern-rule-evaluation-wip

  • Clone with SSH
  • Clone with HTTPS
  • Rule Evaluation of Pathological Job Patterns

    Rule evaluation engine for detection of patterns in ClusterCockpit JobArchive measurement files.

    The idea is to identify jobs running on an HPC cluster with problematic (performance) behavior. Once problematic jobs are recognized, advisors can contact the respective users to solve the problems. For this job recognition, patterns in the background measurement data recorded with ClusterCockpit can be used. The rules to identify the patterns need to be defined depending on the measured job metrics, the job metadata (e.g. number of allocated cores) and parameters such as threshold values. ClusterCockpit stores the measurement data of finished jobs in Job Archive files containing metadata and metric data. For analysis, the data needs to be loaded from the JobArchive, the rule definitions applied to the metric data and the results about matched rules stored.

    This software was developed in the context of the NHR project "Automatic Detection of Pathological Jobs for HPC User Support".

    Summary

    • prule analyses a JobArchive according to rule definitions.
    • prule.daemon talks to ClusterCockpit and applies the prule analysis to newly finished jobs.
    • prule.summary creates a user/account based summary of the analysis results.

    Documentation

    Requirements

    License

    Copyright 2024, Paderborn Center for Parallel Computing (PC2)

    SPDX-License-Identifier: MIT