Please read these instructions before posting any event on Fermilab Indico

The FERMI(FNAL) network authorization method will be removed on Tuesday, Feb 4th at 5PM CST. See news for more details.

Weekly CCE-IOS tele-conference

US/Central
Peter van Gemmeren (ANL), Rob Ross (ANL)
Description
BlueJeans Link: https://bluejeans.com/102100194

Attending: Peter Van Gemmeren, Saba Sehrish, Phillippe Canal, Torre Wenaus, Tammy Walton, Doug Benjamin, Liz Sexton-Kennedy, Chris Jones, Suren Byna, Ken Herner, Patrick Gartung, Paolo, RobR, Shane Snyder

**Management.** Defining a policy for publications. Will see an email real soon; there's a draft.

Q (PVG): Is there going to be a general CCE meeting?

A (Paolo): In discussion. Not sure yet.

 

**Darshan for ROOT I/O.**

Ken Herner: Put some slides together.

Background:

- DUNE uses LArSoft, which is based on Art.

- Event generation -> Geant4 -> detector sim/noise -> recorded

- Each stage runs the same "lar" executable with different config file (a ".fcl" or "fickle" file).

- For this test everything is in the same "job"

- All the data is in CVMFS (https://docs.nersc.gov/services/cvmfs/)

Darshan:

- Installed v3.2.1 in DUNE area with non-MPI mode

- in a shifter container

- simple bash script to run each stage serially

- copy Darshan files to laptop, darshan-merge, then job summary

- very preliminary!

- note: didn't compile with Lustre support

                - so missing some striping info, etc.

                - but that's fine for now.

 

Summary of a synthetic thing:

- Lots of small reads using STDIO

- POSIX accesses are dominated by 0-10K reads

                - Lots of 8191 byte reads (?)

- Mostly sequential/consecutive operations on the read side of things.

- In a "real" production job, would see more stuff, maybe more little stuff, etc.

- Some question about the veracity of the output stage write total (p.3 of the summary)

                - Going to share data with Shane and see if he can deduce what might be up.

  - Output should have been 10s of MBs?

Will run something larger / more real once we have had a closer look at discrepancies in this run.

 

Doug Benjamin:

- Raythena Scheme on Cori KNL and ANL

- Image of how ATLAS runs the next-generation event service

                - Fine-grained simulation

                - Tested at NERSC and LCRC

                - Something on the edge that gets information from PANDA

                - Pilot does monitoring on each node running the Ray Actor and the computational payload (inside a container via shifter / singularity)

 

- Not seeing the I/O behavior inside the container at this time. Do see the behavior all around it.

- Lots of files being opened (will need to filter).

- Have built a new container with Darshan within it, hopeful that this will work now.

 

- Shane notes that there's a ton of darshan data when looking only at outside the container, but we're not seeing (or weren't) what was going on inside the container. So this new approach is promising.

- First time we've tried to mix this inside- outside- container model like this.

                - But then Ken succeeded...

                - But it was Shifter and not Singularity, and run interactively

 

Doug: Python script calls Bash script that starts Singularity. Inside that is AthenaMP.

Patrick: Plan to mess with Darshan but have not, yet.

 

**HDF for Intermediate Results.** Saba and Suren working on this.

Saba:

- Some updates, haven't uploaded slides yet, some updates to the slides not complete.

- Trying to write data products. Using HighFive API to write to HDF files. Two datasets per data product currently.

- Writing events has been implemented, working on re-reading and validation of the writing code.

- Looking at H5CPP as an alternative. Have initially discussed with the author. Have been able to use this as a write path for trivial tests as well.

- Parallelism comes later.

Some discussion of next steps. Peter interested in some testing.

 

**Constraints on I/O Discussion.** Ran out of time for discussion that day (Chris's presentation from maybe two weeks ago).

Peter has some slides:

- multi-threading to save memory

- multi-process doesn't do it, still soaks up 1-2GB/process more than necessary.

- See slides for additional details.

Phillippe: Some of the issues are being addressed as part of RNtuple work.

There are minutes attached to this event. Show them.
    • 14:00 14:05
      Management News 5m
      Speakers: Paolo Calafiura (LBNL), Dr Salman Habib (Argonne National Laboratory)
    • 14:05 14:10
      Introduction 5m
      Speakers: Dr Peter van Gemmeren (ANL), Rob Ross (ANL)
    • 14:10 14:15
      Update: Darshan for ROOT I/O in HEP workflows on HPC 5m
      Speakers: Christopher Jones (Fermilab), Doug Benjamin (ANL), Kenneth Herner (Fermilab), Patrick Gartung (Fermilib), Shane Snyder (Argonne National Laboratory)
    • 14:15 14:20
      Update: Investigate HDF5 as intermediate event storage for HPC processing 5m
      Speakers: Kyle Knoepfel (Fermilab), Lisa Goodenough, Dr Peter van Gemmeren (ANL), Saba Sehrish (Fermilab), Suren Byna (LBNL), Tammy Walton (Fermilab)
    • 14:20 14:40
      Follow Up: Constraints on I/O from HEP Data Processing 20m
      Speakers: Christopher Jones (Fermilab), Dr Peter van Gemmeren (ANL), Philippe Canal (FERMILAB)