This is the content of https://docs.google.com/document/d/1EJ1dSSJyL7ojTbBhUV41CK0HW4MobfZimpDn09cdtBY/edit as of 9/14/2019

 

DUNE COMPUTING MODEL WORKSHOP NOTES

 

Monday 9 Sep 2019

 

Present:  

 

In room:

 

S. Timm, A. McNab, M. Kirby, H. Schellman, C. Brew, F. Wilson, R. Illingworth, E. Lancon, P. Laycock, M. Bishai, T. Junk, P. van Gemmeren, P. Ding, K. Herner, M. Mambelli, D. Benjamin, A. Norman, B. Jayatilaka, S. Fuess, K. Biery, P. DeMar, A. Mazzacane, A. Tiradani, K. Ellis, J. Boyd

 

On Zoom:

 

P. Clarke, N. Buchanan, B. Viren, G. Cooper, B. White, T. Wenaus, R. Nandakumar, A. Thea, M. Nebot, S. Calvez, P. Vokac, D. Adams, M. Votava, S. White, A. Steklain

 

TALKS Monday

 

M. Kirby--logistics

 

Discussion--pre-gdb and dune both booked in this room  Tuesday morning, need to change one of them.

 

Andrew McNab-- goals of the workshops

 

Discussion:  Fergus--want to know what decisions need to be made and by what mechanism it will be made

 

Steve--can you speak to external constraints that are driving design of computing model now

 

Heidi--need first round of computing model that we can show to local funding agencies by end of calendar 2019

 

Pete--that’s correct and it needs to be plausible.

 

Heidi Schellman:  DUNE DATA OVERVIEW

 

Agreement with DAQ is not to write more than 30 PB/yr to offline in total from FD, even as modules are added.

 

About 1 supernova “candidate” per month, 460TB with 4 modules. 10 hrs to transfer from FD at 100Gb/s.

 

Discussion--Early in the planning it is worth looking at other big experiments and how accurate their numbers were, to figure how much contingency we should put in.

 

What is the fraction of analysis

 

Schellman: comment that we’re currently pretty anarchistic for analysis, production isolated. 

 

M. Bishai: Can we continuously exercise the system to make sure we’re ready for big supernova burst and high analysis demand

 

------------------------------------

 

10:45 discussion “Mapping Requirements to Sites”

 

Data Volumes

LHCB has 27 PB in storage at Tier-1

 

ST: Should define data storage based on what type of activities you need to do at the various tiers

 

HS: Should streaming be a significant part of the models

 

CB: from CMS--streaming works but wouldn’t want to use it for everything

 

ST: need to understand the IOPS requirements on any given site.  Also which files are hot and accessed a large amount of time

 

AN: 2 separate problems:  (1) archival storage of the data and who should have archival/custodial responsibility  (2) How do we analyze data at resources worldwide Multi-tiered approach with national/regional warehouses from which we stream, or can prime an HPC center for instance.

 

AM: can we (in theory) put that kind of archival data at FNAL..yes

 

Should we--AN:  one copy at FNAL and one elsewhere (is this required to be a DoE “facility”?)

May also want copies that are regionally convenient at regional centers.

 

Some discussion--can today’s HPSS handle this kind of reads--BV says yes.

 

Should users be able to chaotically stage files from tape? General feeling seems to be no.


 

Mapping The WLCG Tier model


 

https://docs.google.com/presentation/d/1T68f6UL0xXSTSjUgC2w5txX-GFg0Bxf37SsBodlnRnQ/edit?usp=sharing  visible version  of model shown. Revised version after conversation with Ian Collier. Invented the D24 and the D8 options

 

Mike K: What are the support levels at the different types of sites? (Host, center, grid site)

 

Eric Lançon’s reference to DOMA site definitions is slide 20 here: https://docs.google.com/presentation/d/1F6hLvFT1X_z2Kpf49xgR_PzYrM69H5DXiUlpXxiK7I4/edit?ts=5cc06c1f#slide=id.g58baf64d22_10_169


 

Is it better to have more sites with 24x7 response (smaller sites like Prague will never have enough people to really cover 24x7), or more sites with multiple copies of datasets in case something goes down off-hours (there was a flood incident even at one WLCG T1 centre and it took ~ 4 months to bring that site back to life)?

 

Eric:  don’t have to think of boxes geographically mapped to a site--can have geographically distributed resources presented as a unified entity

 

Doug:  should consider network bandwidth needs especially transatlantic, due to SKA coming online while DUNE is running

 

Andrew:  What data volume really needs to be on disk?  Heidi--Analyzed data low end 3PB/yr, maybe 10PB at top end?  Tom--reduction factor will be large. Near detector still an unknown

 

Mike--how many datasets are live at once--one, two,three?  Andrew--what is content of those datasets? Raw data, first reco, etc?  See

 

Mary: what are MC requirements--2-3x processed?  Does ML need raw data? ML should be able to train on subset of raw data and/or processed?

 

Summarising MC requirements - canonical 10:1 of processed MC used for nova, assumption is this remains true for DUNE. (HS> But NOvA does onboard zero-suppression - need to compare with potential full data size).  What is NOvA zero-suppression level and actual far detector rate that we are comparing to? Ie, # of events we’re talking about for MC and data. HS: Suggestion for a Task Force to estimate MC needs.

 

CPU architectures:

 

Heidi updated this table based in the spreadsheet #’s.  Need a version for ProtoDUNE’s and NP as well. PD should be possible to do now. 



 

Far Detector

               

Data Type

Amount/yr

Copies

Num of versions

Total/yr

Disk

Tape

Lifetime on disk

Note

Raw

30 PB

2?

1

60 PB

3 PB

100%

short

Assume 1 month? 

Reduced

0.3 PB

2+

2ish

1.2 PB

100%

?

6 months? 

Likely 2 versions but only one on disk 

Reco Path A

1.5 PB

2+

2

6 PB

100%

Always on disk in some form

Assume always keep 2 versions*2 copies. 

Reco Path B

1.5 PB

2+

2

6 PB

100%

?

Merge with type A and store that

Much better to run multiple algorithms at same time but some architectures may require different times? 

Reco Path Nth...

               

MC

6 PB

1

2

6 PB

100%

   

10:1 Ratio with raw? But mC has a lot of overhead  add factor of 2 for that.

Raw like

3PB

1?

1

       

~10% Raw for ML Training. Is it different to Raw?

                 

 

Official numbers spreadsheet https://docs.google.com/spreadsheets/d/17Xtwl3lIT00xOYgZMhMtyLt6ebFjrsraNHXI39y3hSY/edit#gid=1918904950

 

Can we get quick access to HPC’s for hit finding production? SKA and other astro people do get quick access.  

 

Supernovas?  General events? 

 

More generally for the centers - what level of DB access, code access. 

 

Discussion of commercial cloud - nice way to get latest version of new hardware for testing.  Good for peak/infrequent activities where buying your own is not cost-effective.

 

Make requirements on bandwidths.  

 

(DRAFT) Summary of interim conclusions stated in the session:

 


 

Data Management Technologies session

 

Steve Timm is preparing a document

 

Steve: explains SAM

 

Question about use cases -> are we overspecifying - not now as these are existing examples. 

 

Examples of what is needed to “see”/get a single file.  DB’s reasonably simple. API probably needs to join together info from many DB’s.

 

File attributes → query run attributes DB? 

Datasets → move to snapshots?  Rucio supports hierarchies

File location → locations DB

File association with a particular job

 

Examples of what is needed to put a single file into the DB.   

 

Slide 6 - Sam locate file → rucio, 

sam list-files → discovery DB/runs DB? 

Sam projects → POMS or other job system knows how to talk to them. 

 

Discussion of hierarchical metadata instead of tied to files. 

 

Book-keeping:

 

Figure out where all the data for a particular trigger record are - supernova → data management

 

Figure out which files have been processed or not.  → workload?


 

Robert Illingworth on Rucio 

 

Currently running hybrid Rucio/SAM system - possibility of mismatch between the 2 dbs.

 

Some discussion of object stores (good bad, oh my god not that again…)

 

Paul Laycock on Rucio at Belle II

 

Have used DIRAC for quite a while with the LFC file catalog

 

Having to introduce Rucio while data taking has already started is interesting

 

Use a RUCIO file catalog plugin to DIRAC.  Everything talks to the DIRAC plugin, not directly to Rucio.  

 

What was good about Rucio - automation of things.  Data lifetime would be very nice. DM services run at BNL, the DIRAC stuff is in Japan.  

 

Chris on DIRAC 

 

Does both file and grid management.  

 

File catalog is used by many DIRAC uses - Belle, LHCb, 

 

Marco on GlideinWMS

 

Works with multiple local batch systems, lots of experience in the field (CMS, OSG, FNAL IF)

 

Long-term support: moving to support CentOS 8 now; other long-term customers are around (CMS + OSG for instance)

 

Chris on DIRAC - have tested using local files from the DIRAC catalog and sam access from a local script. 

 

Torre on PanDA  

 

Works with lots of different facilities and integrated with Rucio

 

Discussion of the Atlas event service.   Can stream events independently, can keep cores busy even if you get a really long event on one core.  Also recover from preemption by recovering individual events. Simulation is ½ of ATLAS usage, works well for this. 

 

Working on Intelligent Data Delivery System with IRIS-HEP.   Can join into project through HSF. 

 

HPC use for ML - Fast Simulation, Analysis, Tracking  

Assume future machines will support common ML library. 

 

Question about how to get fast turnaround.  

 

Panda for nightly testing (ART system) - 3000 jobs/month 

 

Plug Panda onto back-end of DIRAC

 

Kirby asks about using the event service independently of some of the other PanDA specific parts.   Seems to work even with their very high event rates. How does this map onto DUNE? APA? Trigger record?

 

[Torre - today, the event service operates only within PanDA and trying it would involve using PanDA. Objective of iDDS (which for ATLAS is the next step for the event service) is to support event delivery independent of PanDA. (Developing this in the IRIS-HEP/HSF context is intended to ensure the PanDA independence really happens). Early DUNE involvement in iDDS (with whatever workload manager(s) it chooses) could ensure DUNE use cases and functionality are addressed from the beginning.]

 

Ask Mary B. to find a volunteer to study PanDA/DUNE

 

Anthony Tiridani on HEPCloud  NRESC/Cloud in there - production version has started this year. 

 

Discussion of decision engine.

Has a “replay” capability that allows you out figure out why things happened. 

 

Question about protecting budgets - production workflows only and HEPcloud can help minimize costs. 

 

Similarly, how do you manage HPC allocations.  Can monitor burn rates

 

What about if an individual user acquires a large allocation - do we have a way to allow that person to use that resource in a dedicated way. 

 

Andrew on VCycle  - small agent to make VM’s on various systems. 

 

Now working with IRIS in UK. How does this interact with HEPCloud?

 

Interesting questions about how one can plan use of allocations - profile of expected needs. 

 

Need some policy agreements for DUNE to use this. 

 

SAM as Workflow management Comments:

 

having a data delivery service that tracks the location, bandwidth, and latency of a file to a grid job would be important - Heidi

 

Making sure that we aren’t tied to the file based structure that we cooked into SAM would be good to avoid. Have something that is more flexible and able to deal with “data cells” and object stores - Doug Benjamin

 

POMS as Workflow Management Comments:

 

Would the new project/station functionality be part of POMS or the other stuff? - Heidi  The discussion at the lab is that this would be separate from POMS. - Marc

 

Tuesday morning

 

Andrew shows a slide with possible combos of Production/Workflow WMS and DMS, and asks what people think about what to use and what to put dev effort into in the short/medium term, and if we want to keep options open longer term. 

 

Questions - where does jobsub 

 

As explained below, consensus was that we need more requirements gathering to decide. And the strawman that any of the alternatives would be sufficient given our scale had two exceptions identified: handling object storage with chunks that need to be assembled into events for processing; job-to-job pipelining rather than using storage as the intermediary. 

 

Fergus had asked yesterday about decision process

 

From the consortium document:

 

“Technical advisory board

 

The Consortium Lead, in collaboration with the Consortium, will convene Technical Advisory Boards as needed. A Board will be convened when there is a particular technical issue to address and will be given a charge appropriate to the issue at hand, such as reviewing and recommending solutions to a technical issue. Boards will include the 3 Consortium leads, the Software Architect, relevant subgroup representatives and technical experts chosen appropriately to address the charge. The outcome of a Board will typically be an advisory report delivered to Consortium leadership, the report being public within the Collaboration.”

 

Mary and Steve and Heidi would like us to come up with requirements informed by the technical decisions we need to make. 

 

Anna Mazzacane asks that we make the detector characterization a use case, not just the final running beast in steady state.  Need more data, more processing steps, Mary - protoDUNE may be harder, so is a good test. 

 

Questions about support (short and long term) for the various products

 

Questions about getting feedback from stakeholders.   CISC/CAL/DAQ/physics groups

 

Is object store an additional use case beyond the file based system? 

 

Andrew Norman - if we’re doing files, any of these systems can do the job.  

 

Steve Timm - what about heterogenous workflows. Andrew Norman - astro expts do this using. 

 

Multinode pipelines - GPU -> on a different machine doing something else.   If they communicate through storage, things are ok… ATLAS doing this. 

 

Do we envision a mixture of workflows. Probably yes. 

 

Very Top Level Requirements questions from Mary B. 

 

-Software Framework interface requirements: for example memory/data format requirements that interface with the WMS - does ROOT meet those requirements for e.g..

 

- Long Term Maintainability 

 

-Heterogeneous workflows: for eg HTC->HPC->GPU

Data preparation, Pattern recognition, Simulation, Data analysis, tuple creation

 

Template for Workflow Use Case Description:

https://docs.google.com/document/d/1lYNANqaE6r32u0oTWCcyQ2yfNEO6og3nKtNEDiTt85U/edit#

 

Need to go through workflows what are the steps

 

Define each of the processing stages as individual tasks.

Define the input and output size for each of those tasks.

Data lifetime of the output for each stage.

When the campaign is reprocessed, what will be the impact on the data model and storage


 

Some workflow diagrams to discuss

 

https://docs.google.com/presentation/d/1d-qrtHwd5u-D91YDQ9p5kTm_G_l1DxZOBCrGeGyRgAk/edit?usp=sharing

 

Do we write intermediate steps to disk? Tape? What gets distributed?

Which stages need DB access? 

 

If input data is carried to the output of a stage and if splits and joins between stages are supported then some support for handling duplicate data (same exact object carried along both branches) is needed in the framework.

 

- Object storage/Offline event building

 

- Distributed data requirements: >= 2 raw data storage on tape with 24/7 support  and 10%(?) dedicated disk space for staging + processed data and simulation data (x10 of processed data?) at different sites on disk with local support (see Heidi’s picture from yesterday)

 

-Network requirements:

 

-Specific stakeholder requirements: DAQ, Calibration, Physics Analysis

 

One also needs to think about workflow stages the communicate via storage (i.e. outputs of one are inputs to the next) vs. those that communicate via jobs

 

“mergeAna” workflow with project.py is a kind of merge workflow used by protoDUNE. But it does not use SAM for bookkeeping, but does that by list files. First exercise towards more complicated merge workflow is to incorporate this with SAM and POMS. 


 

Mary asks can we get the experts to summarize the major features of the proposed workflows.   Summarize the slides we saw yesterday. MB: Specifications of the software packages proposed

that address our requirements at each stage. For example what features of RUCIO+file management system (SAM or replacement) reliably allow for offline event building from file fragments (if that is our plan).

 

Chris B.  Where are we strange?

 

Heidi question about data formats. Is that a framework not a workflow problem. 

Paul L comment to this for late-stage analysis for Belle II:  we use root as format, but for analysis python packages are used which untie this dependence: “root_numpy” and “root_pandas” packages convert TTrees to numpy arrays / pandas dataframes respectively, “uproot” is a package increasing in popularity which has its own optimised numpy-like array formats.

 

MB: some selected event data needs to be stored in a format for 3-D display:

 

Discussion of whether we would store data to display chosen events. Mary points out that

event displays of all selected neutrino events in the far detector will be an important part to validate any ML results and reconstruction. 

 

Example of a 3-D event display are the BEE display (used online for ProtoDUNE) here is an example with simulation:

 

https://www.phy.bnl.gov/wire-cell/bee/set/23/event/0/

 

This is very different from colliders and is unique to neutrino experiments: the need to visually assess large numbers of events to validate reconstruction. 

 

Operations discussion (Thursday afternoon)

 


 

https://docs.google.com/document/d/1x0ima7MUiMNhwYU0rg37MLM_rFCrcEY9Qfuf76MmGFw/edit?usp=sharing is a link to a document with a list of current things we use and questions about mapping their functionality going forward.