Weekly CCE-IOS tele-conference

US/Central
Peter van Gemmeren (ANL), Rob Ross (ANL)
Description
BlueJeans Link: https://bluejeans.com/102100194

Attending: Peter Van Gemmeren, Saba Sehrish, Phillippe Canal, Torre Wenaus, Tammy Walton, Doug Benjamin, Liz Sexton-Kennedy, Chris Jones, Suren Byna, Ken Herner, Patrick Gartung, Paolo, RobR, Shane Snyder

**Management.** Defining a policy for publications. Will see an email real soon; there's a draft.

Q (PVG): Is there going to be a general CCE meeting?

A (Paolo): In discussion. Not sure yet.

 

**Darshan for ROOT I/O.**

Ken Herner: Put some slides together.

Background:

- DUNE uses LArSoft, which is based on Art.

- Event generation -> Geant4 -> detector sim/noise -> recorded

- Each stage runs the same "lar" executable with different config file (a ".fcl" or "fickle" file).

- For this test everything is in the same "job"

- All the data is in CVMFS (https://docs.nersc.gov/services/cvmfs/)

Darshan:

- Installed v3.2.1 in DUNE area with non-MPI mode

- in a shifter container

- simple bash script to run each stage serially

- copy Darshan files to laptop, darshan-merge, then job summary

- very preliminary!

- note: didn't compile with Lustre support

                - so missing some striping info, etc.

                - but that's fine for now.

 

Summary of a synthetic thing:

- Lots of small reads using STDIO

- POSIX accesses are dominated by 0-10K reads

                - Lots of 8191 byte reads (?)

- Mostly sequential/consecutive operations on the read side of things.

- In a "real" production job, would see more stuff, maybe more little stuff, etc.

- Some question about the veracity of the output stage write total (p.3 of the summary)

                - Going to share data with Shane and see if he can deduce what might be up.

  - Output should have been 10s of MBs?

Will run something larger / more real once we have had a closer look at discrepancies in this run.

 

Doug Benjamin:

- Raythena Scheme on Cori KNL and ANL

- Image of how ATLAS runs the next-generation event service

                - Fine-grained simulation

                - Tested at NERSC and LCRC

                - Something on the edge that gets information from PANDA

                - Pilot does monitoring on each node running the Ray Actor and the computational payload (inside a container via shifter / singularity)

 

- Not seeing the I/O behavior inside the container at this time. Do see the behavior all around it.

- Lots of files being opened (will need to filter).

- Have built a new container with Darshan within it, hopeful that this will work now.

 

- Shane notes that there's a ton of darshan data when looking only at outside the container, but we're not seeing (or weren't) what was going on inside the container. So this new approach is promising.

- First time we've tried to mix this inside- outside- container model like this.

                - But then Ken succeeded...

                - But it was Shifter and not Singularity, and run interactively

 

Doug: Python script calls Bash script that starts Singularity. Inside that is AthenaMP.

Patrick: Plan to mess with Darshan but have not, yet.

 

**HDF for Intermediate Results.** Saba and Suren working on this.

Saba:

- Some updates, haven't uploaded slides yet, some updates to the slides not complete.

- Trying to write data products. Using HighFive API to write to HDF files. Two datasets per data product currently.

- Writing events has been implemented, working on re-reading and validation of the writing code.

- Looking at H5CPP as an alternative. Have initially discussed with the author. Have been able to use this as a write path for trivial tests as well.

- Parallelism comes later.

Some discussion of next steps. Peter interested in some testing.

 

**Constraints on I/O Discussion.** Ran out of time for discussion that day (Chris's presentation from maybe two weeks ago).

Peter has some slides:

- multi-threading to save memory

- multi-process doesn't do it, still soaks up 1-2GB/process more than necessary.

- See slides for additional details.

Phillippe: Some of the issues are being addressed as part of RNtuple work.

There are minutes attached to this event. Show them.