Speaker
Ilya Baburashvili
Description
HEPCloud Decision Engine is a framework used by science collaborations for efficient and cost-effective provisioning of computing resources. This is done by selecting the resources with heuristics and starting Glideins and HTCondor startds to run the jobs.
Decision Engine uses Prometheus metrics to track various aspects of how a job is running and the overall health of Decision Engine. We created metrics to track the performance of de-client commands as well as monitoring the current number and status of jobs, glideins, and the amount of cores and memory being used. The metrics were then added to dashboards using Grafana.
Primary author
Ilya Baburashvili
Co-authors
Shreyas Bhat
(Fermilab)
Bruno Coimbra
(Fermilab)
Marco Mambelli
(Fermilab)
Namratha Urs
(Fermilab)