8–9 Jul 2024
Fermi National Accelerator Laboratory
America/Chicago timezone

Monitoring HTCondor in HEPCloud's Decision Engine using Prometheus and Grafana

9 Jul 2024, 15:45
15m
One West (Fermi National Accelerator Laboratory)

One West

Fermi National Accelerator Laboratory

Kirk Road at Pine Street Batavia, IL 60510

Speaker

Ilya Baburashvili

Description

HEPCloud Decision Engine is a framework used by science collaborations for efficient and cost-effective provisioning of computing resources. This is done by selecting the resources with heuristics and starting Glideins and HTCondor startds to run the jobs.

Decision Engine uses Prometheus metrics to track various aspects of how a job is running and the overall health of Decision Engine. We created metrics to track the performance of de-client commands as well as monitoring the current number and status of jobs, glideins, and the amount of cores and memory being used. The metrics were then added to dashboards using Grafana.

Primary author

Ilya Baburashvili

Co-authors

Shreyas Bhat (Fermilab) Bruno Coimbra (Fermilab) Marco Mambelli (Fermilab) Namratha Urs (Fermilab)

Presentation materials