Skip to content
Snippets Groups Projects
user avatar
af104d56
History
Name Last commit Last update
Material
PatternDocumentation
README.md

Documentation

The "Automatic Detection of Pathological Jobs for HPC User Support" project develops an automated detection engine to identify performance issues (pathological jobs) in High-Performance Computing (HPC) environments. Performance tuning in HPC is challenging, requiring deep expertise in algorithms, processor architecture, and system-level optimizations. Many research groups and users often lack this specialized knowledge, leading to common mistakes like over-parallelization and inefficient memory usage. Traditionally, HPC support staff manually detect these issues through job-specific monitoring systems.

With this project, we aim to reduce the workload on support staff by creating a configurable detection engine that analyzes job-specific metrics and identifies performance anti-patterns (PAPs). These PAPs, defined externally by experts and tailored to specific clusters, enable quick adaptation to different HPC systems. Our detection engine is designed to be monitoring-environment agnostic, supporting frameworks like ClusterCockpit, PIKA, and GEOPM.

We also include a subsystem that automatically determines the applications used in jobs, improving detection accuracy by linking performance patterns to specific applications or domains. When the engine detects a pathological job, it triggers predefined action templates (ATs) to notify users or support staff and provide steps to resolve the issue.

Currently, the project consists of the rule evaluation engine and the rule definitions: Job Pattern Rule Evaluation Engine

Links