Scaling ML meeting

US/Central

 

Should work to include models into ML perf: existing benchmarking tool for the broader supercomputing community

Intensity frontier models: Exa.TrkX/NuGraph GNN already used IaaS. Models already exist but development is ongoing on larger models. Synergies with ATLAS Exa.TrkX work planned for SML.

How often do these models have to be trained: 

  • Particle transformer: retrain every few months

How long does training take:

  • Particle transformer training time: 1 week on one A100

  • Exatrack: couple of weeks of training on one Perlmutter nodes (four GPUs)

 

Neural architecture search is more useful and computing intensive 

  • Should do performance (physics) metric comparison 

Challenge for IaaS: 

  • Automatically deciding the resources needed. This should have PAW group synergies.
  • ProtoDUNE hit the bandwidth out

  • Automate the conversion to change a model to be a service: would this be part of HEP-CCE? Probably. Custom backend with it’s own custom kernel

Action items

  • Summarize scale of models and data for selected models
    • Tracking: Number of model parameters? tens of GB of training data
    • Particle transformer: 2M parameters in model, training data size?
  • Find neural architecture search framework
  • Find insiders of models to work with SML. Insider has to contribute to scaling models and there is funding
  • Add columns in person power sheet for individual models
There are minutes attached to this event. Show them.