Weekly CCE-IOS tele-conference (1 Apr 2020)
Chaired by: Ross, Rob; Dr. van Gemmeren, Peter
Attend: Rob Ross, Peter Van Gemmeren, Salman Habib, Chris Jones, Doug Benjamin, Liz Sexton-Kennedy, Jakob Blomer, John Wu, Matthieu Dorier, Philippe Canal, Rob Latham, Saba Sehrish, Shane Snyder, Suren Byna, Torre Wenaus
Philippe Canal: ROOT I/O
Slide 3) HEP data flow: raw (from detector) -> reco -> analysis formats -> images. Last three are typically stored in ROOT
Slide 4) "cling" is a C++ interpreter built into ROOT that interprets headers, lets them write C++ objects via serialization
Slide 6) Parallelization in ROOT refers to thread level parallelism.
Slide 9) TFile is the file class in ROOT.
- header
- records
- possible compression
- FS-like structure
- self descriptive
Slide 10) plug-in system that allows for remote access, data in SQL
Slide 11) file is mix of headers plus object data. deleted things might still take up space
- file header -- summary information, pointers to free regions, etc.
- logical record header -- information on the objects in the region, etc.
Slide 14) serialization
- lots of features in here that likely make it impossible to use another solution without giving up features (e.g., custom serialization of types, schema evolution)
Slide 18) column format
- represented by TTree, or just "tree"
- a TBranch, "branch" is a column
Slide 22) "anatomy of a file" slide speaks to this
- "cluster" is a contiguous area of the file holding an integral number of entries for all the branches (things from different columns)
- so all the data from a set of rows.
- "baskets" are collections of data from a "branch" -- specifically a chunk of data from a single column
- baskets tend to be single writes, range from hundreds of bytes to ones of MBs
- clusters are often 10s of MBs (but this is customizable)
Slide 32) Fast Merge -- way of combining data from multiple files into a final ROOT file. This is done with threads. Some testing on individual drives (HD and SSD).
Slide 43) Basic MPI-based thing for doing this also (just a quick mention)
Slide 45) TFile WriteCache as a way to do aggregation. Could be a location for inserting code to do smarter writes.
FastMerge mechanism can be enhanced to collect and reorganize how the baskets are layout on the file
Jakob Blomer: RNTuple -- evolution of TTree I/O
"the future" -- experimental new I/O subsystem in ROOT
looking at simple(r) event models, want to understand if they can get faster performance at cost of incompatiblity
Slide 47) borrowing from Apache Arrow concepts, for example
thinking about object stores
Slide 48) storage layer knows how to get byte ranges from whatever the back end is
notionally this would be a way for us to store RNTuple data in a Mochi-based service
looking at DAOS, in touch with Intel folks, should be followed up with HPC experts
Slide 49) layout is similar to TTree layout
basket -> page
leaf -> column
cluster -> cluster