Attendees: Paolo Calafiura, Salman Habib, Rob Ross, Peter van Gemmeren, Shane Snyder, Chris Jones, Doug Benjamin, Jakob Blomer, John Wu, Liz Sexton-Kennedy, Matthieu Dorier, Rob Latham, Saba Sehrish, Suren Byna
Management News: Some slides need to be ready in a couple of weeks; RobR and PeterVG will follow up.
Shane Snyder presenting Darshan
Slide 4:
Darshan: lightweight I/O characterization tool. Deployed at most of the DOE sites, often "on" by default.
Modular, can be extended.
Slide 5:
Works via link-time or runtime instrumentation depending on the build.
Focus on MPI programs, but will revisit this.
Darshan itself writes out information at end of job (MPI_Finalize). Collapses data into a single file for the whole job. Compression applied.
Some simple analysis tools for digging into this data.
Q: Darshan aggregates data from multiple processes; do you still show the individual process behavior?
A: Some aggregation of data, but still some information on individual behavior.
Slide 6:
Modular setup. Core library coordinates.
Instrumentation modules target specific libraries or use cases (e.g., HDF, POSIX)
Self-describing format.
Slide 8:
Cori -- Cray XC40
Enabled by default.
Integrated into Cray software module system.
module list shows version available, etc.
Slide 9:
Just compile and run -- darshan is integrated.
Location of darshan logs described on slide.
Slide 10:
Recently moved to dynamic linking as default. Not in latest releases/deployments.
This doesn't change how things are used, but instrumentation via LD_PRELOAD may be needed.
Slide 12:
Getting text output from the log, tuples.
-1 for a rank indicates an aggregated record.
Slide 13:
Darshan job summary tool -- generates a PDF summarizing some key statistics.
Might have to load texlive to get it to work.
Slide 15:
Performance _estimates_ -- not entirely accurate due to what is/isn't captured. Take with a grain of salt.
Slide 16:
"other" is time outside of things Darshan observes -- typically majority of this time is "compute" (but could be waiting on anything outside of Darshan purview).
Slide 18:
Timeline of I/O operations, reads on top, writes on the bottom.
Can't literally see the individual operations (in a default Darshan capture) because we aren't tracing. But we can bound things by open file times, etc. This can be enough to get useful insights.
Slide 20:
There _is_ a fine-grained tracing capability, if you want to enable it, works for both POSIX and MPI-IO at this time.
darshan-dxt-parser can be used to look at this.
Also some OST information that can help you debug situations where a particular OST is problematic.
Slide 21:
This is a simple R/W timeline across ranks. This is a subfiling example -- groups of processes writing but not all.
Slide 23:
New things, are integrated, maybe useful here: Non-MPI instrumentation
Significant refactoring to enable this (i.e., to make MPI optional), including new way to initialize and catch end of job
Slide 24:
Right now one has to build Darshan especially for non-MPI capabilities
When running, have to set an environment variable. This allows us to avoid capturing all sorts of executables that you run -- keeps the noise down.
Q: DougB: When we launch on an HPC, ATLAS runs python pilot that spawns additional work, also monitors things. Launched inside a container. Wondering where to put the Darshan intercepts.
A: Shane: Good question! We need to work together on this particular bit.
A: Calls to fork() can cause issues, something to look out for.
Q: Do we have a Spark-specific way to follow the sub-processes?
A: Separate the capture from the analysis.
Discussion:
Q: Will Darshan work in a case where executables are stored on a RO FS? Compiled with some other (older) compiler?
A: Think so. Need to investigate.
TODO: Plan a hack session to investigate. Shane and Doug leading.
Doug and Torre hopefully presenting next week. Suren on HDF5 the following week.