Scaling ML meeting

Name: Scaling ML meeting
Start: 2024-10-30T15:00:00-05:00
End: 2024-10-30T15:30:00-05:00
Location: No location set

Wednesday 30 Oct 2024, 15:00 → 15:30 US/Central

Description

https://cern.zoom.us/j/67753039215?pwd=eHY4VTFFbi80N0diVk9mN0Y4d0dyZz09

Hide

Discussion with HEP-CCE portability people

Workflow prep for inference and potentially grid HPO

Scaling ML application candidates

Neural-Based Inference (talk to Aishik and portability)
- Rui: Is Aishik running code with local script or Panda? Local scripts.
- https://github.com/DeepDriveMD
- Come up with action items with them
- Run this on different HPCs and understand how to improve the experience
- Doug and Kelly are doing the integration of globus compute for job scheduling without logging directly into an HPC. This is integrated into harvester. Expected challenge of authentication is being addressed with a special token. To be further discussed with NBI people.
- Already using NERSC
- Polaris and Aurora allocation?
- Checkpoint handling is needed (harvester has it?)?
- Set up meeting to discuss with them
- What about Ben Nachman et al.
- Follow up with Aishik. Maybe add to slack channel?
  - Where is there code?
  - Prepare a google doc (Xiangyang)
  - Send email (Xiangyang)
Inference as a Service for tracking: see ATLAS upgrade week.
- NERSC already run: load balancing was bad
- Chicago for Kubernetes
- Aurora/Polaris
- Study the node balancing using Kubernetes.
- Check if Aurora has Kubernetes available (Rui)
Resource constrained ML?
- Follow up with Lindsey whether they have people (WH)

FASST RFI:

Already highlighted questions
Develop over the next week
Kick up to Salman and Paolo

Questions that are relevant for SML (potential answers in bold):

How can DOE ensure FASST investments support a competitive hardware ecosystem and maintain American leadership in AI compute, including through DOE's existing AI and high-performance-computing testbeds?}

Software support? More intelligent super API (Rui)

How can DOE improve awareness of existing allocation processes for DOE's AI-capable supercomputers and AI testbeds for smaller companies and newer research teams? How should DOE evaluate compute resource allocation strategies for large-scale foundation-model training and/or other AI use cases?}

Large allocations for organization on a rolling basis would facilitate the foundation model R&D.

How can DOE continue to support development of energy-efficient AI hardware, algorithms, and platforms?}

How can DOE continue to support the development of AI hardware, algorithms, and platforms tailored for science and engineering applications in cases where the needs of those applications differ from the needs of commodity AI applications? }

Funding for the development and scaling of foundation models for science. Foundation models require significant resources and having to scale these models up will be essential

What are application areas in science, applied energy, and national security that are primed for AI breakthroughs?}

How can DOE ensure foundation AI models are effectively developed to realize breakthrough applications, in partnership with industry, academia, and other agencies?}

Ease of access and scaling of compute resources

There are minutes attached to this event. Show them.

- 15:00 → 15:15
  
  Intro 15m
  
  Speakers: Paolo Calafiura (LBNL), Walter Hopkins (Argonne National Laboratory)
- 15:15 → 15:30
  
  AOB 15m