This is the content of https://docs.google.com/document/d/1EJ1dSSJyL7ojTbBhUV41CK0HW4MobfZimpDn09cdtBY/edit as of 9/14/2019
DUNE COMPUTING MODEL WORKSHOP NOTES
Monday 9 Sep 2019
Present:
In room:
S. Timm, A. McNab, M. Kirby, H. Schellman, C. Brew, F. Wilson, R. Illingworth, E. Lancon, P. Laycock, M. Bishai, T. Junk, P. van Gemmeren, P. Ding, K. Herner, M. Mambelli, D. Benjamin, A. Norman, B. Jayatilaka, S. Fuess, K. Biery, P. DeMar, A. Mazzacane, A. Tiradani, K. Ellis, J. Boyd
On Zoom:
P. Clarke, N. Buchanan, B. Viren, G. Cooper, B. White, T. Wenaus, R. Nandakumar, A. Thea, M. Nebot, S. Calvez, P. Vokac, D. Adams, M. Votava, S. White, A. Steklain
TALKS Monday
M. Kirby--logistics
Discussion--pre-gdb and dune both booked in this room Tuesday morning, need to change one of them.
Andrew McNab-- goals of the workshops
Discussion: Fergus--want to know what decisions need to be made and by what mechanism it will be made
Steve--can you speak to external constraints that are driving design of computing model now
Heidi--need first round of computing model that we can show to local funding agencies by end of calendar 2019
Pete--that’s correct and it needs to be plausible.
Heidi Schellman: DUNE DATA OVERVIEW
Agreement with DAQ is not to write more than 30 PB/yr to offline in total from FD, even as modules are added.
About 1 supernova “candidate” per month, 460TB with 4 modules. 10 hrs to transfer from FD at 100Gb/s.
Discussion--Early in the planning it is worth looking at other big experiments and how accurate their numbers were, to figure how much contingency we should put in.
What is the fraction of analysis
Schellman: comment that we’re currently pretty anarchistic for analysis, production isolated.
M. Bishai: Can we continuously exercise the system to make sure we’re ready for big supernova burst and high analysis demand
------------------------------------
10:45 discussion “Mapping Requirements to Sites”
Data Volumes
LHCB has 27 PB in storage at Tier-1
ST: Should define data storage based on what type of activities you need to do at the various tiers
HS: Should streaming be a significant part of the models
CB: from CMS--streaming works but wouldn’t want to use it for everything
ST: need to understand the IOPS requirements on any given site. Also which files are hot and accessed a large amount of time
AN: 2 separate problems: (1) archival storage of the data and who should have archival/custodial responsibility (2) How do we analyze data at resources worldwide Multi-tiered approach with national/regional warehouses from which we stream, or can prime an HPC center for instance.
AM: can we (in theory) put that kind of archival data at FNAL..yes
Should we--AN: one copy at FNAL and one elsewhere (is this required to be a DoE “facility”?)
May also want copies that are regionally convenient at regional centers.
Some discussion--can today’s HPSS handle this kind of reads--BV says yes.
Should users be able to chaotically stage files from tape? General feeling seems to be no.
Mapping The WLCG Tier model
https://docs.google.com/presentation/d/1T68f6UL0xXSTSjUgC2w5txX-GFg0Bxf37SsBodlnRnQ/edit?usp=sharing visible version of model shown. Revised version after conversation with Ian Collier. Invented the D24 and the D8 options
Mike K: What are the support levels at the different types of sites? (Host, center, grid site)
Eric Lançon’s reference to DOMA site definitions is slide 20 here: https://docs.google.com/presentation/d/1F6hLvFT1X_z2Kpf49xgR_PzYrM69H5DXiUlpXxiK7I4/edit?ts=5cc06c1f#slide=id.g58baf64d22_10_169
Is it better to have more sites with 24x7 response (smaller sites like Prague will never have enough people to really cover 24x7), or more sites with multiple copies of datasets in case something goes down off-hours (there was a flood incident even at one WLCG T1 centre and it took ~ 4 months to bring that site back to life)?
Eric: don’t have to think of boxes geographically mapped to a site--can have geographically distributed resources presented as a unified entity
Doug: should consider network bandwidth needs especially transatlantic, due to SKA coming online while DUNE is running
Andrew: What data volume really needs to be on disk? Heidi--Analyzed data low end 3PB/yr, maybe 10PB at top end? Tom--reduction factor will be large. Near detector still an unknown
Mike--how many datasets are live at once--one, two,three? Andrew--what is content of those datasets? Raw data, first reco, etc? See
Mary: what are MC requirements--2-3x processed? Does ML need raw data? ML should be able to train on subset of raw data and/or processed?
Summarising MC requirements - canonical 10:1 of processed MC used for nova, assumption is this remains true for DUNE. (HS> But NOvA does onboard zero-suppression - need to compare with potential full data size). What is NOvA zero-suppression level and actual far detector rate that we are comparing to? Ie, # of events we’re talking about for MC and data. HS: Suggestion for a Task Force to estimate MC needs.
CPU architectures:
Heidi updated this table based in the spreadsheet #’s. Need a version for ProtoDUNE’s and NP as well. PD should be possible to do now.
Far Detector
|
|
|
|
|
|
|
|
|
Data Type
|
Amount/yr
|
Copies
|
Num of versions
|
Total/yr
|
Disk
|
Tape
|
Lifetime on disk
|
Note
|
Raw
|
30 PB
|
2?
|
1
|
60 PB
|
3 PB
|
100%
|
short
|
Assume 1 month?
|
Reduced
|
0.3 PB
|
2+
|
2ish
|
1.2 PB
|
100%
|
?
|
6 months?
|
Likely 2 versions but only one on disk
|
Reco Path A
|
1.5 PB
|
2+
|
2
|
6 PB
|
100%
|
?
|
Always on disk in some form
|
Assume always keep 2 versions*2 copies.
|
Reco Path B
|
1.5 PB
|
2+
|
2
|
6 PB
|
100%
|
?
|
Merge with type A and store that
|
Much better to run multiple algorithms at same time but some architectures may require different times?
|
Reco Path Nth...
|
|
|
|
|
|
|
|
|
MC
|
6 PB
|
1
|
2
|
6 PB
|
100%
|
|
|
10:1 Ratio with raw? But mC has a lot of overhead add factor of 2 for that.
|
Raw like
|
3PB
|
1?
|
1
|
|
|
|
|
~10% Raw for ML Training. Is it different to Raw?
|
|
|
|
|
|
|
|
|
|
Official numbers spreadsheet https://docs.google.com/spreadsheets/d/17Xtwl3lIT00xOYgZMhMtyLt6ebFjrsraNHXI39y3hSY/edit#gid=1918904950
Can we get quick access to HPC’s for hit finding production? SKA and other astro people do get quick access.
Supernovas? General events?
More generally for the centers - what level of DB access, code access.
Discussion of commercial cloud - nice way to get latest version of new hardware for testing. Good for peak/infrequent activities where buying your own is not cost-effective.
Make requirements on bandwidths.
(DRAFT) Summary of interim conclusions stated in the session:
-
The custodial requirements for raw data, on tape, can be met by FNAL (one copy of everything) and by DOE lab(s). However, it is desirable that site(s) outside the US also participate in fulfilling these requirements.
-
We do not wish to have chaotic access to tape by users: it will be sufficient that data is staged off tape at FNAL in an organised way as part of (re)processing activities. It is not necessary that offsite tape copies are used in (re)processing unless there is a loss of data at FNAL.
-
It is not a requirement on the data management system that it can handle tape access at multiple sites, and it may be sufficient for staging to be handled outside (most of) the DM system.
-
It will be sufficient for FNAL and the other tape sites to provide access to tape in this controlled manner; for centers or disk sites to provide disk to allow jobs to access data in an efficient way; and for grid sites or cpu sites to provide only CPU and local scratch disk.
-
A sufficient number of replicas of files needs to be maintained on disk by DUNE across different sites to minimise disruption to workflows to an acceptable level when sites have failures, outages, or planned downtimes.
-
FNAL and the other tape sites must provide 24/7 on call support. It will be sufficient for disk and cpu sites to provide 8/5 working hours support.
-
Software should be written to run on CPUs irrespective of CPU features (AVX etc) and on GPUs (for example by using libraries which allow the most performant execution on different platforms.) DUNE should be able to use whatever generations or CPU/GPU types are offered by the sites, by writing flexible software and by matching any software with specific requirements to compatible resources at sites.
-
DUNE has some CPU-bound use cases which will be run more efficiently on HPCs, due to their fast interconnects. Whilst DUNE will make use of many HPC resources in an HTC fashion, it may positively request that partners provide HPC.
-
DUNE will ensure that it can run workloads on commercial clouds, in case partner countries decide to use commercial clouds, as a way of handling peaks in load, and as a way of evaluating and preparing for new architectures before they are generally available at sites and with a greater degree of profiling access (eg root privileges) than with conventional grid jobs.
-
DUNE will pursue a mixed model of moving jobs to data and streaming data across the network to jobs. The balance of this will be based on experience and the network capacity provided by sites and (inter)national networks.
Data Management Technologies session
Steve Timm is preparing a document
Steve: explains SAM
Question about use cases -> are we overspecifying - not now as these are existing examples.
Examples of what is needed to “see”/get a single file. DB’s reasonably simple. API probably needs to join together info from many DB’s.
File attributes → query run attributes DB?
Datasets → move to snapshots? Rucio supports hierarchies
File location → locations DB
File association with a particular job
Examples of what is needed to put a single file into the DB.
Slide 6 - Sam locate file → rucio,
sam list-files → discovery DB/runs DB?
Sam projects → POMS or other job system knows how to talk to them.
Discussion of hierarchical metadata instead of tied to files.
Book-keeping:
Figure out where all the data for a particular trigger record are - supernova → data management
Figure out which files have been processed or not. → workload?
Robert Illingworth on Rucio
Currently running hybrid Rucio/SAM system - possibility of mismatch between the 2 dbs.
Some discussion of object stores (good bad, oh my god not that again…)
Paul Laycock on Rucio at Belle II
Have used DIRAC for quite a while with the LFC file catalog
Having to introduce Rucio while data taking has already started is interesting
Use a RUCIO file catalog plugin to DIRAC. Everything talks to the DIRAC plugin, not directly to Rucio.
What was good about Rucio - automation of things. Data lifetime would be very nice. DM services run at BNL, the DIRAC stuff is in Japan.
Chris on DIRAC
Does both file and grid management.
File catalog is used by many DIRAC uses - Belle, LHCb,
Marco on GlideinWMS
Works with multiple local batch systems, lots of experience in the field (CMS, OSG, FNAL IF)
Long-term support: moving to support CentOS 8 now; other long-term customers are around (CMS + OSG for instance)
Chris on DIRAC - have tested using local files from the DIRAC catalog and sam access from a local script.
Torre on PanDA
Works with lots of different facilities and integrated with Rucio
Discussion of the Atlas event service. Can stream events independently, can keep cores busy even if you get a really long event on one core. Also recover from preemption by recovering individual events. Simulation is ½ of ATLAS usage, works well for this.
Working on Intelligent Data Delivery System with IRIS-HEP. Can join into project through HSF.
HPC use for ML - Fast Simulation, Analysis, Tracking
Assume future machines will support common ML library.
Question about how to get fast turnaround.
Panda for nightly testing (ART system) - 3000 jobs/month
Plug Panda onto back-end of DIRAC
Kirby asks about using the event service independently of some of the other PanDA specific parts. Seems to work even with their very high event rates. How does this map onto DUNE? APA? Trigger record?
[Torre - today, the event service operates only within PanDA and trying it would involve using PanDA. Objective of iDDS (which for ATLAS is the next step for the event service) is to support event delivery independent of PanDA. (Developing this in the IRIS-HEP/HSF context is intended to ensure the PanDA independence really happens). Early DUNE involvement in iDDS (with whatever workload manager(s) it chooses) could ensure DUNE use cases and functionality are addressed from the beginning.]
Ask Mary B. to find a volunteer to study PanDA/DUNE
Anthony Tiridani on HEPCloud NRESC/Cloud in there - production version has started this year.
Discussion of decision engine.
Has a “replay” capability that allows you out figure out why things happened.
Question about protecting budgets - production workflows only and HEPcloud can help minimize costs.
Similarly, how do you manage HPC allocations. Can monitor burn rates
What about if an individual user acquires a large allocation - do we have a way to allow that person to use that resource in a dedicated way.
Andrew on VCycle - small agent to make VM’s on various systems.
Now working with IRIS in UK. How does this interact with HEPCloud?
Interesting questions about how one can plan use of allocations - profile of expected needs.
Need some policy agreements for DUNE to use this.
SAM as Workflow management Comments:
having a data delivery service that tracks the location, bandwidth, and latency of a file to a grid job would be important - Heidi
Making sure that we aren’t tied to the file based structure that we cooked into SAM would be good to avoid. Have something that is more flexible and able to deal with “data cells” and object stores - Doug Benjamin
POMS as Workflow Management Comments:
Would the new project/station functionality be part of POMS or the other stuff? - Heidi The discussion at the lab is that this would be separate from POMS. - Marc
Tuesday morning
Andrew shows a slide with possible combos of Production/Workflow WMS and DMS, and asks what people think about what to use and what to put dev effort into in the short/medium term, and if we want to keep options open longer term.
Questions - where does jobsub
As explained below, consensus was that we need more requirements gathering to decide. And the strawman that any of the alternatives would be sufficient given our scale had two exceptions identified: handling object storage with chunks that need to be assembled into events for processing; job-to-job pipelining rather than using storage as the intermediary.
Fergus had asked yesterday about decision process
From the consortium document:
“Technical advisory board
The Consortium Lead, in collaboration with the Consortium, will convene Technical Advisory Boards as needed. A Board will be convened when there is a particular technical issue to address and will be given a charge appropriate to the issue at hand, such as reviewing and recommending solutions to a technical issue. Boards will include the 3 Consortium leads, the Software Architect, relevant subgroup representatives and technical experts chosen appropriately to address the charge. The outcome of a Board will typically be an advisory report delivered to Consortium leadership, the report being public within the Collaboration.”
Mary and Steve and Heidi would like us to come up with requirements informed by the technical decisions we need to make.
Anna Mazzacane asks that we make the detector characterization a use case, not just the final running beast in steady state. Need more data, more processing steps, Mary - protoDUNE may be harder, so is a good test.
Questions about support (short and long term) for the various products
Questions about getting feedback from stakeholders. CISC/CAL/DAQ/physics groups
Is object store an additional use case beyond the file based system?
Andrew Norman - if we’re doing files, any of these systems can do the job.
Steve Timm - what about heterogenous workflows. Andrew Norman - astro expts do this using.
Multinode pipelines - GPU -> on a different machine doing something else. If they communicate through storage, things are ok… ATLAS doing this.
Do we envision a mixture of workflows. Probably yes.
Very Top Level Requirements questions from Mary B.
-Software Framework interface requirements: for example memory/data format requirements that interface with the WMS - does ROOT meet those requirements for e.g..
- Long Term Maintainability
-Heterogeneous workflows: for eg HTC->HPC->GPU
Data preparation, Pattern recognition, Simulation, Data analysis, tuple creation
Template for Workflow Use Case Description:
https://docs.google.com/document/d/1lYNANqaE6r32u0oTWCcyQ2yfNEO6og3nKtNEDiTt85U/edit#
Need to go through workflows what are the steps
-
Normal events (< 6 GB) - Based on Brett’s diagram to define “jobs” - Kirby https://indico.fnal.gov/event/21160/session/10/contribution/14/material/slides/4.pdf
-
SNB events (> 100 TB) - defer discussion until there’s more detail???
-
Simulation - overlays (detector size?) - Tingjun Yang/Wes Ketchum assistance on this - Ken
-
Merging outputs - Heidi focus on after reconstruction and potentially SNB and derived samples and explicitly joining divergent paths
-
Tuple creation - question about file format (HDF5?) less important compared with access patterns and data volume - Norm Buchanan(???) and Steve Timm - physics impacts should be considered
-
Calibration - needs input from CAL group following the Collaboration Meeting
-
ML training
-
ML inference
-
User analysis - Input from DRA
-
Event picking - Andrew Norman
-
Parameter estimation ala NOvA - Andrew Norman, Steven Calvez, Pengfei Ding
-
Unfolding
Define each of the processing stages as individual tasks.
Define the input and output size for each of those tasks.
Data lifetime of the output for each stage.
When the campaign is reprocessed, what will be the impact on the data model and storage
Some workflow diagrams to discuss
https://docs.google.com/presentation/d/1d-qrtHwd5u-D91YDQ9p5kTm_G_l1DxZOBCrGeGyRgAk/edit?usp=sharing
Do we write intermediate steps to disk? Tape? What gets distributed?
Which stages need DB access?
If input data is carried to the output of a stage and if splits and joins between stages are supported then some support for handling duplicate data (same exact object carried along both branches) is needed in the framework.
- Object storage/Offline event building
- Distributed data requirements: >= 2 raw data storage on tape with 24/7 support and 10%(?) dedicated disk space for staging + processed data and simulation data (x10 of processed data?) at different sites on disk with local support (see Heidi’s picture from yesterday)
-Network requirements:
-Specific stakeholder requirements: DAQ, Calibration, Physics Analysis
One also needs to think about workflow stages the communicate via storage (i.e. outputs of one are inputs to the next) vs. those that communicate via jobs
“mergeAna” workflow with project.py is a kind of merge workflow used by protoDUNE. But it does not use SAM for bookkeeping, but does that by list files. First exercise towards more complicated merge workflow is to incorporate this with SAM and POMS.
Mary asks can we get the experts to summarize the major features of the proposed workflows. Summarize the slides we saw yesterday. MB: Specifications of the software packages proposed
that address our requirements at each stage. For example what features of RUCIO+file management system (SAM or replacement) reliably allow for offline event building from file fragments (if that is our plan).
Chris B. Where are we strange?
Heidi question about data formats. Is that a framework not a workflow problem.
Paul L comment to this for late-stage analysis for Belle II: we use root as format, but for analysis python packages are used which untie this dependence: “root_numpy” and “root_pandas” packages convert TTrees to numpy arrays / pandas dataframes respectively, “uproot” is a package increasing in popularity which has its own optimised numpy-like array formats.
MB: some selected event data needs to be stored in a format for 3-D display:
Discussion of whether we would store data to display chosen events. Mary points out that
event displays of all selected neutrino events in the far detector will be an important part to validate any ML results and reconstruction.
Example of a 3-D event display are the BEE display (used online for ProtoDUNE) here is an example with simulation:
https://www.phy.bnl.gov/wire-cell/bee/set/23/event/0/
This is very different from colliders and is unique to neutrino experiments: the need to visually assess large numbers of events to validate reconstruction.
Operations discussion (Thursday afternoon)
https://docs.google.com/document/d/1x0ima7MUiMNhwYU0rg37MLM_rFCrcEY9Qfuf76MmGFw/edit?usp=sharing is a link to a document with a list of current things we use and questions about mapping their functionality going forward.