This is the content of https://docs.google.com/document/d/1EJ1dSSJyL7ojTbBhUV41CK0HW4MobfZimpDn09cdtBY/edit as of 9/14/2019

DUNE COMPUTING MODEL WORKSHOP NOTES

Monday 9 Sep 2019

Present:

In room:

S. Timm, A. McNab, M. Kirby, H. Schellman, C. Brew, F. Wilson, R. Illingworth, E. Lancon, P. Laycock, M. Bishai, T. Junk, P. van Gemmeren, P. Ding, K. Herner, M. Mambelli, D. Benjamin, A. Norman, B. Jayatilaka, S. Fuess, K. Biery, P. DeMar, A. Mazzacane, A. Tiradani, K. Ellis, J. Boyd

On Zoom:

P. Clarke, N. Buchanan, B. Viren, G. Cooper, B. White, T. Wenaus, R. Nandakumar, A. Thea, M. Nebot, S. Calvez, P. Vokac, D. Adams, M. Votava, S. White, A. Steklain

TALKS Monday

M. Kirby--logistics

Discussion--pre-gdb and dune both booked in this room Tuesday morning, need to change one of them.

Andrew McNab-- goals of the workshops

Discussion: Fergus--want to know what decisions need to be made and by what mechanism it will be made

Steve--can you speak to external constraints that are driving design of computing model now

Heidi--need first round of computing model that we can show to local funding agencies by end of calendar 2019

Pete--that’s correct and it needs to be plausible.

Heidi Schellman: DUNE DATA OVERVIEW

Agreement with DAQ is not to write more than 30 PB/yr to offline in total from FD, even as modules are added.

About 1 supernova “candidate” per month, 460TB with 4 modules. 10 hrs to transfer from FD at 100Gb/s.

Discussion--Early in the planning it is worth looking at other big experiments and how accurate their numbers were, to figure how much contingency we should put in.

What is the fraction of analysis

CMS 50%
ATLAS - 5% for T1, 25% for T2

Schellman: comment that we’re currently pretty anarchistic for analysis, production isolated.

M. Bishai: Can we continuously exercise the system to make sure we’re ready for big supernova burst and high analysis demand

------------------------------------

10:45 discussion “Mapping Requirements to Sites”

Data Volumes

LHCB has 27 PB in storage at Tier-1

ST: Should define data storage based on what type of activities you need to do at the various tiers

HS: Should streaming be a significant part of the models

CB: from CMS--streaming works but wouldn’t want to use it for everything

ST: need to understand the IOPS requirements on any given site. Also which files are hot and accessed a large amount of time

AN: 2 separate problems: (1) archival storage of the data and who should have archival/custodial responsibility (2) How do we analyze data at resources worldwide Multi-tiered approach with national/regional warehouses from which we stream, or can prime an HPC center for instance.

AM: can we (in theory) put that kind of archival data at FNAL..yes

Should we--AN: one copy at FNAL and one elsewhere (is this required to be a DoE “facility”?)

May also want copies that are regionally convenient at regional centers.

Some discussion--can today’s HPSS handle this kind of reads--BV says yes.

Should users be able to chaotically stage files from tape? General feeling seems to be no.

Mapping The WLCG Tier model

https://docs.google.com/presentation/d/1T68f6UL0xXSTSjUgC2w5txX-GFg0Bxf37SsBodlnRnQ/edit?usp=sharing visible version of model shown. Revised version after conversation with Ian Collier. Invented the D24 and the D8 options

Mike K: What are the support levels at the different types of sites? (Host, center, grid site)

Eric Lançon’s reference to DOMA site definitions is slide 20 here: https://docs.google.com/presentation/d/1F6hLvFT1X_z2Kpf49xgR_PzYrM69H5DXiUlpXxiK7I4/edit?ts=5cc06c1f#slide=id.g58baf64d22_10_169

Is it better to have more sites with 24x7 response (smaller sites like Prague will never have enough people to really cover 24x7), or more sites with multiple copies of datasets in case something goes down off-hours (there was a flood incident even at one WLCG T1 centre and it took ~ 4 months to bring that site back to life)?

Eric: don’t have to think of boxes geographically mapped to a site--can have geographically distributed resources presented as a unified entity

Doug: should consider network bandwidth needs especially transatlantic, due to SKA coming online while DUNE is running

Andrew: What data volume really needs to be on disk? Heidi--Analyzed data low end 3PB/yr, maybe 10PB at top end? Tom--reduction factor will be large. Near detector still an unknown

Mike--how many datasets are live at once--one, two,three? Andrew--what is content of those datasets? Raw data, first reco, etc? See

Mary: what are MC requirements--2-3x processed? Does ML need raw data? ML should be able to train on subset of raw data and/or processed?

Summarising MC requirements - canonical 10:1 of processed MC used for nova, assumption is this remains true for DUNE. (HS> But NOvA does onboard zero-suppression - need to compare with potential full data size). What is NOvA zero-suppression level and actual far detector rate that we are comparing to? Ie, # of events we’re talking about for MC and data. HS: Suggestion for a Task Force to estimate MC needs.

CPU architectures:

Heidi updated this table based in the spreadsheet #’s. Need a version for ProtoDUNE’s and NP as well. PD should be possible to do now.

Far Detector
Data Type	Amount/yr	Copies	Num of versions	Total/yr	Disk	Tape	Lifetime on disk	Note
Raw	30 PB	2?	1	60 PB	3 PB	100%	short	Assume 1 month?
Reduced	0.3 PB	2+	2ish	1.2 PB	100%	?	6 months?	Likely 2 versions but only one on disk
Reco Path A	1.5 PB	2+	2	6 PB	100%	?	Always on disk in some form	Assume always keep 2 versions*2 copies.
Reco Path B	1.5 PB	2+	2	6 PB	100%	?	Merge with type A and store that	Much better to run multiple algorithms at same time but some architectures may require different times?
Reco Path Nth...
MC	6 PB	1	2	6 PB	100%			10:1 Ratio with raw? But mC has a lot of overhead add factor of 2 for that.
Raw like	3PB	1?	1					~10% Raw for ML Training. Is it different to Raw?

Official numbers spreadsheet https://docs.google.com/spreadsheets/d/17Xtwl3lIT00xOYgZMhMtyLt6ebFjrsraNHXI39y3hSY/edit#gid=1918904950

Can we get quick access to HPC’s for hit finding production? SKA and other astro people do get quick access.

Supernovas? General events?

More generally for the centers - what level of DB access, code access.

Discussion of commercial cloud - nice way to get latest version of new hardware for testing. Good for peak/infrequent activities where buying your own is not cost-effective.

Make requirements on bandwidths.

(DRAFT) Summary of interim conclusions stated in the session:

The custodial requirements for raw data, on tape, can be met by FNAL (one copy of everything) and by DOE lab(s). However, it is desirable that site(s) outside the US also participate in fulfilling these requirements.
We do not wish to have chaotic access to tape by users: it will be sufficient that data is staged off tape at FNAL in an organised way as part of (re)processing activities. It is not necessary that offsite tape copies are used in (re)processing unless there is a loss of data at FNAL.
It is not a requirement on the data management system that it can handle tape access at multiple sites, and it may be sufficient for staging to be handled outside (most of) the DM system.
It will be sufficient for FNAL and the other tape sites to provide access to tape in this controlled manner; for centers or disk sites to provide disk to allow jobs to access data in an efficient way; and for grid sites or cpu sites to provide only CPU and local scratch disk.
A sufficient number of replicas of files needs to be maintained on disk by DUNE across different sites to minimise disruption to workflows to an acceptable level when sites have failures, outages, or planned downtimes.
FNAL and the other tape sites must provide 24/7 on call support. It will be sufficient for disk and cpu sites to provide 8/5 working hours support.
Software should be written to run on CPUs irrespective of CPU features (AVX etc) and on GPUs (for example by using libraries which allow the most performant execution on different platforms.) DUNE should be able to use whatever generations or CPU/GPU types are offered by the sites, by writing flexible software and by matching any software with specific requirements to compatible resources at sites.
DUNE has some CPU-bound use cases which will be run more efficiently on HPCs, due to their fast interconnects. Whilst DUNE will make use of many HPC resources in an HTC fashion, it may positively request that partners provide HPC.
DUNE will ensure that it can run workloads on commercial clouds, in case partner countries decide to use commercial clouds, as a way of handling peaks in load, and as a way of evaluating and preparing for new architectures before they are generally available at sites and with a greater degree of profiling access (eg root privileges) than with conventional grid jobs.
DUNE will pursue a mixed model of moving jobs to data and streaming data across the network to jobs. The balance of this will be based on experience and the network capacity provided by sites and (inter)national networks.

Data Management Technologies session

Steve Timm is preparing a document

Steve: explains SAM

Question about use cases -> are we overspecifying - not now as these are existing examples.

Examples of what is needed to “see”/get a single file. DB’s reasonably simple. API probably needs to join together info from many DB’s.

File attributes → query run attributes DB?

Datasets → move to snapshots? Rucio supports hierarchies

File location → locations DB

File association with a particular job

Examples of what is needed to put a single file into the DB.

Slide 6 - Sam locate file → rucio,

sam list-files → discovery DB/runs DB?

Sam projects → POMS or other job system knows how to talk to them.

Discussion of hierarchical metadata instead of tied to files.

Book-keeping:

Figure out where all the data for a particular trigger record are - supernova → data management

Figure out which files have been processed or not. → workload?

Robert Illingworth on Rucio

Currently running hybrid Rucio/SAM system - possibility of mismatch between the 2 dbs.

Some discussion of object stores (good bad, oh my god not that again…)

Paul Laycock on Rucio at Belle II

Have used DIRAC for quite a while with the LFC file catalog

Having to introduce Rucio while data taking has already started is interesting

Use a RUCIO file catalog plugin to DIRAC. Everything talks to the DIRAC plugin, not directly to Rucio.

What was good about Rucio - automation of things. Data lifetime would be very nice. DM services run at BNL, the DIRAC stuff is in Japan.

Chris on DIRAC

Does both file and grid management.

File catalog is used by many DIRAC uses - Belle, LHCb,

Marco on GlideinWMS

Works with multiple local batch systems, lots of experience in the field (CMS, OSG, FNAL IF)

Long-term support: moving to support CentOS 8 now; other long-term customers are around (CMS + OSG for instance)

Chris on DIRAC - have tested using local files from the DIRAC catalog and sam access from a local script.

Torre on PanDA

Works with lots of different facilities and integrated with Rucio

Discussion of the Atlas event service. Can stream events independently, can keep cores busy even if you get a really long event on one core. Also recover from preemption by recovering individual events. Simulation is ½ of ATLAS usage, works well for this.

Working on Intelligent Data Delivery System with IRIS-HEP. Can join into project through HSF.

HPC use for ML - Fast Simulation, Analysis, Tracking

Assume future machines will support common ML library.

Question about how to get fast turnaround.

Panda for nightly testing (ART system) - 3000 jobs/month

Plug Panda onto back-end of DIRAC

Kirby asks about using the event service independently of some of the other PanDA specific parts. Seems to work even with their very high event rates. How does this map onto DUNE? APA? Trigger record?

[Torre - today, the event service operates only within PanDA and trying it would involve using PanDA. Objective of iDDS (which for ATLAS is the next step for the event service) is to support event delivery independent of PanDA. (Developing this in the IRIS-HEP/HSF context is intended to ensure the PanDA independence really happens). Early DUNE involvement in iDDS (with whatever workload manager(s) it chooses) could ensure DUNE use cases and functionality are addressed from the beginning.]

Ask Mary B. to find a volunteer to study PanDA/DUNE

Anthony Tiridani on HEPCloud NRESC/Cloud in there - production version has started this year.

Discussion of decision engine.

Has a “replay” capability that allows you out figure out why things happened.

Question about protecting budgets - production workflows only and HEPcloud can help minimize costs.

Similarly, how do you manage HPC allocations. Can monitor burn rates

What about if an individual user acquires a large allocation - do we have a way to allow that person to use that resource in a dedicated way.

Andrew on VCycle - small agent to make VM’s on various systems.

Now working with IRIS in UK. How does this interact with HEPCloud?

Interesting questions about how one can plan use of allocations - profile of expected needs.

Need some policy agreements for DUNE to use this.

SAM as Workflow management Comments:

having a data delivery service that tracks the location, bandwidth, and latency of a file to a grid job would be important - Heidi

Making sure that we aren’t tied to the file based structure that we cooked into SAM would be good to avoid. Have something that is more flexible and able to deal with “data cells” and object stores - Doug Benjamin

POMS as Workflow Management Comments:

Would the new project/station functionality be part of POMS or the other stuff? - Heidi The discussion at the lab is that this would be separate from POMS. - Marc

Tuesday morning

Andrew shows a slide with possible combos of Production/Workflow WMS and DMS, and asks what people think about what to use and what to put dev effort into in the short/medium term, and if we want to keep options open longer term.

Questions - where does jobsub

As explained below, consensus was that we need more requirements gathering to decide. And the strawman that any of the alternatives would be sufficient given our scale had two exceptions identified: handling object storage with chunks that need to be assembled into events for processing; job-to-job pipelining rather than using storage as the intermediary.

Fergus had asked yesterday about decision process

From the consortium document:

“Technical advisory board

The Consortium Lead, in collaboration with the Consortium, will convene Technical Advisory Boards as needed. A Board will be convened when there is a particular technical issue to address and will be given a charge appropriate to the issue at hand, such as reviewing and recommending solutions to a technical issue. Boards will include the 3 Consortium leads, the Software Architect, relevant subgroup representatives and technical experts chosen appropriately to address the charge. The outcome of a Board will typically be an advisory report delivered to Consortium leadership, the report being public within the Collaboration.”

Mary and Steve and Heidi would like us to come up with requirements informed by the technical decisions we need to make.

Anna Mazzacane asks that we make the detector characterization a use case, not just the final running beast in steady state. Need more data, more processing steps, Mary - protoDUNE may be harder, so is a good test.

Questions about support (short and long term) for the various products

Questions about getting feedback from stakeholders. CISC/CAL/DAQ/physics groups

Is object store an additional use case beyond the file based system?

Andrew Norman - if we’re doing files, any of these systems can do the job.

Steve Timm - what about heterogenous workflows. Andrew Norman - astro expts do this using.

Multinode pipelines - GPU -> on a different machine doing something else. If they communicate through storage, things are ok… ATLAS doing this.

Do we envision a mixture of workflows. Probably yes.

Very Top Level Requirements questions from Mary B.

-Software Framework interface requirements: for example memory/data format requirements that interface with the WMS - does ROOT meet those requirements for e.g..

- Long Term Maintainability

-Heterogeneous workflows: for eg HTC->HPC->GPU

Data preparation, Pattern recognition, Simulation, Data analysis, tuple creation

Template for Workflow Use Case Description:

https://docs.google.com/document/d/1lYNANqaE6r32u0oTWCcyQ2yfNEO6og3nKtNEDiTt85U/edit#

Need to go through workflows what are the steps

Normal events (< 6 GB) - Based on Brett’s diagram to define “jobs” - Kirby https://indico.fnal.gov/event/21160/session/10/contribution/14/material/slides/4.pdf
SNB events (> 100 TB) - defer discussion until there’s more detail???
Simulation - overlays (detector size?) - Tingjun Yang/Wes Ketchum assistance on this - Ken
Merging outputs - Heidi focus on after reconstruction and potentially SNB and derived samples and explicitly joining divergent paths
Tuple creation - question about file format (HDF5?) less important compared with access patterns and data volume - Norm Buchanan(???) and Steve Timm - physics impacts should be considered
Calibration - needs input from CAL group following the Collaboration Meeting
- Laser calibration
- Ar39 calibration
- Neutron source calibration
ML training
ML inference
User analysis - Input from DRA
Event picking - Andrew Norman
Parameter estimation ala NOvA - Andrew Norman, Steven Calvez, Pengfei Ding
Unfolding

Define each of the processing stages as individual tasks.

Define the input and output size for each of those tasks.

Data lifetime of the output for each stage.

When the campaign is reprocessed, what will be the impact on the data model and storage

Some workflow diagrams to discuss

https://docs.google.com/presentation/d/1d-qrtHwd5u-D91YDQ9p5kTm_G_l1DxZOBCrGeGyRgAk/edit?usp=sharing

Do we write intermediate steps to disk? Tape? What gets distributed?

Which stages need DB access?

If input data is carried to the output of a stage and if splits and joins between stages are supported then some support for handling duplicate data (same exact object carried along both branches) is needed in the framework.

- Object storage/Offline event building

- Distributed data requirements: >= 2 raw data storage on tape with 24/7 support and 10%(?) dedicated disk space for staging + processed data and simulation data (x10 of processed data?) at different sites on disk with local support (see Heidi’s picture from yesterday)

-Network requirements:

-Specific stakeholder requirements: DAQ, Calibration, Physics Analysis

One also needs to think about workflow stages the communicate via storage (i.e. outputs of one are inputs to the next) vs. those that communicate via jobs

“mergeAna” workflow with project.py is a kind of merge workflow used by protoDUNE. But it does not use SAM for bookkeeping, but does that by list files. First exercise towards more complicated merge workflow is to incorporate this with SAM and POMS.

Mary asks can we get the experts to summarize the major features of the proposed workflows. Summarize the slides we saw yesterday. MB: Specifications of the software packages proposed

that address our requirements at each stage. For example what features of RUCIO+file management system (SAM or replacement) reliably allow for offline event building from file fragments (if that is our plan).

Chris B. Where are we strange?

Heidi question about data formats. Is that a framework not a workflow problem.

Paul L comment to this for late-stage analysis for Belle II: we use root as format, but for analysis python packages are used which untie this dependence: “root_numpy” and “root_pandas” packages convert TTrees to numpy arrays / pandas dataframes respectively, “uproot” is a package increasing in popularity which has its own optimised numpy-like array formats.

MB: some selected event data needs to be stored in a format for 3-D display:

Discussion of whether we would store data to display chosen events. Mary points out that

event displays of all selected neutrino events in the far detector will be an important part to validate any ML results and reconstruction.

Example of a 3-D event display are the BEE display (used online for ProtoDUNE) here is an example with simulation:

https://www.phy.bnl.gov/wire-cell/bee/set/23/event/0/

This is very different from colliders and is unique to neutrino experiments: the need to visually assess large numbers of events to validate reconstruction.

Operations discussion (Thursday afternoon)

More sites/platforms more require that they come with operational/support effort
Nova experience:
- control room shifts: DAQ/oncall (per institute shares)
- offline model designed to deal with failures: problems reported at the time, dealt with during working hours
- service work (per student/postdoc): production, other infrastructure, calibration
  - Downside is turnover of people (~6 months)
- some services run by FNAL (eg SAM)
DUNE will have 24/7/52 shifts for DAQ
- Can they do the computing shift? ie trigger the on-call process for computing experts
- Can “Joe/Jane Shifter” recognise the problem?
- Can we just rely on buffering at SURF until the real expert reads/sees the problem, themselves.

https://docs.google.com/document/d/1x0ima7MUiMNhwYU0rg37MLM_rFCrcEY9Qfuf76MmGFw/edit?usp=sharing is a link to a document with a list of current things we use and questions about mapping their functionality going forward.