Attending: Paolo Calafiura, Salman Habib, Rob Ross, Peter Van Gemmeren, Doug Benjamin, Matthieu Dorier, John Wu, Philippe Canal, Chris Jones, Rob Latham, Jakob Blomer, Liz Sexton-Kennedy, Shane Snyder, Suren Byna
Management:
- asked to put some slides together. Peter has put together a draft that is still a WIP.
- for use in public presentations of HEP CCE
Intro to ATLAS simulation and event service
Doug Benjamin
p.3
- predominantly doing simulation on HPC platforms.
- working on "fast chain" which will do event generation + simulation + reconstruction, but not in production yet
p.4
- running primarily at NERSC
- harvester on login node, talks to the slurm scheduler (at NERSC)
- submit some number of compute chunks to the system
50 jobs at a time
lots of submissions queued up, set for six hours but don't run that long
for each submission, python process "pilot2" talks with AthenaMP via Yampl to get the work done
master-worker model
all output merged for one of these things
everything on compute node runs in a container
in the future, only the "payload" (AthenaMP) will run in the container
p.5
- Jumbo job -- group work into appropriately sized chunks
- Normal ATLAS simulation jobs are 1K events (8 cores for 24 hours)
p.6
- running larger jobs needs new infrastructure
- Harvester - edge of center, connects to ATLAS workload mgmt (PanDA)
- ATLAS Event Service (ES)
events are simulated and written to individual files,
- ...
p.7
- Harvester - mediates
- first use beyond ATLAS recently in ASGC
- deals with security policies, 2-factor auth., etc.
p.8
- Event Service
- allows for work to continue up until the moment the resources are reclaimed
- captures partial work in files as it goes
- some work at the end is typically lost due to getting kicked out mid-calculation
p.9
- Yoda
- rank 0 manages things, MPI communication used to coordinate
- up to 250 nodes works well, then it starts to lose messages (?)
- have switched away from it?
p.10
- Geant4 simulations
- when there are comm. problems, nodes are idle
p.11 CORI KNL nodes
- looking at occupancy
- there are other overheads that prevent 100% utilization.
- somewhat broad distribution of event simulation times
p.12
- PanDA
- trying to simulate 1-10M events
- submitting work in chunks. 1K events on grid, much larger on HPC platforms
p.13
- graph of event service simulations
- peak over 30M events/day
p.14
- Pilot2
- python code stack, launches payload
- folded into standard ATLAS pilot code, or at least working on that.
p.15
- next-generation event service
- rather than using own code, looking at "Ray" toolkit, developed for ML
- Ray driver on rank 0, actors still running Pilot, Yampl, Master, Worker
- TCP/IP comm. with Ray
- shared FS for I/O and some communication
p.16-17
- Redis server involved in here also
- HTTP server running as part of Ray actor
- Pilot2 uses HTTP for talking with the Ray actor
- this is how things run in the grid also, apparently
- AthenaMP - this actually means "multi-process" but *on a node*, not across nodes. Working towards a "MT" version
---
50-node submissions at NERSC, 128- to 256l-node submissions at ALCF.
one harvester managing all the ongoing submissions
determines instructions for a submission when it starts
in event service, harvester is more active, keeps doling out work as things complete.
communicating via shared FS
CJ: how long does a submission run?
DB: a number of different configurations. trying to get max. throughput. right now 6-hour jobs at NERSC.
some flexible scheduling capabilities they are able to take advantage of.
PC:
- Security requirements of HPC platforms lead to some complexities
- Heterogeneous systems: some integration of scheduling from event service down to AthenaMP would be advantageous
- Long term vision, probably not impactful for near term
Suren: What fraction of time in these workflows is I/O time?
DB: Haven't measured it.
PvG: For simulation, I/O is much faster than the computation. But we're writing small chunks that are later merged together, so we're doing I/O twice. May be something to optimize. In principle, simulation is compute dominated as long as we're talking about Geant4 simulation
DB: Fastchain will be different.
RBR: What's the data passing strategy in fastchain?
DB: In memory?
PVG: Not too familiar. One of the points is that we aren't going to write out the intermediate steps. Simulation is faster, no longer justifiable to write that stuff out.
CJ: CMS idea is to just use a local disk for their version of "fast chain". Only the output of reconstruction goes off node. Find this to be efficient.
...but CMS simulation time is much less than ATLAS, comparable to reconstruction time.
PVG: more in a fastchain simulation
CJ: yes, but serially running processes on a node coupled via files on a local FS.
PVG: I think the ATLAS fastchain uses a single large process (or will).
PC: What's the initialization time for jobs?
CJ: Typically jobs run 8 hours single thread target, startup times are in the minutes, working to get to less than a minute.
PVG: Can we discuss file copying?
DB: For HPC, we copy to a "storage element". On the grid write to an object store at CERN, essentially writing little files into the object store.
PVG: p.4: each worker writes small files
DB: on a grid site pilot writes to Ceph. kind of overloaded them. on HPC store on shared FS, tar and ship to merge elsewhere.
DB: in-file metadata and requirement that we group events in output files in a particular way, related to the input file organization, so we cannot just randomize events in output files. right now don't have a good way to manage this.
CJ: Because of provenance tracking? We have a similar assumption to work with.
DB: Yes. Don't care about event order within an output file, but we do need ranges to map to specific files.
DB: Merging is done somewhere else, so time for that isn't shown in any of the graphs in this presentation.
PVG: ATLAS has an event sharing capability that Peter can discuss in a future call.
PC: Would be useful to have a multi-node I/O capability.
RR: Something we could help with.
DB: Would work in HPC, skeptical about use in other contexts.
PC: Getting rid of file merging would be relevant in all the environments.
DB: With MT we'll get some of that...but still just on a single node.
CJ: Talk to Dirk on the CMS side for instrumentation. Not joining calls at this time.
RR: Maybe work through the ATLAS steps first, then come back to CMS.