Should work to include models into ML perf: existing benchmarking tool for the broader supercomputing community
Intensity frontier models: Exa.TrkX/NuGraph GNN already used IaaS. Models already exist but development is ongoing on larger models. Synergies with ATLAS Exa.TrkX work planned for SML.
How often do these models have to be trained:
Particle transformer: retrain every few months
How long does training take:
Particle transformer training time: 1 week on one A100
Exatrack: couple of weeks of training on one Perlmutter nodes (four GPUs)
Neural architecture search is more useful and computing intensive
Challenge for IaaS:
ProtoDUNE hit the bandwidth out
Automate the conversion to change a model to be a service: would this be part of HEP-CCE? Probably. Custom backend with it’s own custom kernel
Action items