Please read these instructions before posting any event on Fermilab Indico

Indico will be unavailable on Wed., Dec 18th from 7 - 7:30am due to server maintenance.

DUNE Computing Model Workshop

US/Central
Fermilab

Fermilab

Andrew McNab (University of Manchester), Heidi Schellman (Oregon state), Michael Kirby (FNAL)
Description

 

This is a followup workshop to go over the computing infrastructure aspects of the DUNE computing model.

It meets Monday and Tuesday, Sept 9-10 at Fermilab and is open to the HEP community if you have registered to attend.

Join Zoom Meeting https://fnal.zoom.us/j/6308404326

Tuesday will be meet in Wilson Hall in the Curia II (WH2SW).

Venue: Illinois Accelerator Research Center (IARC) at Fermilab

- Illinois Accelerator Research Center (IARC) will appear on Google Maps. Use that for directions.
- There is ample parking directly across from the IARC bldg.

It is joint with several other meetings on identity management and the Grid Deployment Board meeting:

https://indico.cern.ch/event/739896/ pre-GDB - AuthZ WG (Tuesday)

https://indico.cern.ch/event/739882/ GDB (Wednesday)

https://indico.cern.ch/event/834658/ mini-FIM4R (Thursday)

If you plan to attend this meeting or any of the above, please register for each so the organizers can estimate room capacities.

Live minutes/notes/actions: https://docs.google.com/document/d/1EJ1dSSJyL7ojTbBhUV41CK0HW4MobfZimpDn09cdtBY/edit

Participants
  • Andre Fabiano Steklain Lisboa
  • Andrew McNab
  • Andrew Norman
  • Anna Mazzacane
  • Cheng-Ju Lin
  • Chris Brew
  • Doug Benjamin
  • eric lancon
  • Fergus Wilson
  • Heidi Schellman
  • Katy Ellis
  • Kenneth Herner
  • Kevin Retzke
  • Kurt Biery
  • Marco Mambelli
  • Margaret Votava
  • Mary Bishai
  • Michael Kirby
  • Norm Buchanan
  • Paul Laycock
  • Pengfei Ding
  • Peter van Gemmeren
  • Robert Illingworth
  • Sergey Uzunyan
  • Steven Calvez
  • Steven Timm
  • Stuart Fuess
  • Thomas Junk
    • Introduction and welcome

      This is the content of https://docs.google.com/document/d/1EJ1dSSJyL7ojTbBhUV41CK0HW4MobfZimpDn09cdtBY/edit as of 9/14/2019

       

      DUNE COMPUTING MODEL WORKSHOP NOTES

       

      Monday 9 Sep 2019

       

      Present:  

       

      In room:

       

      S. Timm, A. McNab, M. Kirby, H. Schellman, C. Brew, F. Wilson, R. Illingworth, E. Lancon, P. Laycock, M. Bishai, T. Junk, P. van Gemmeren, P. Ding, K. Herner, M. Mambelli, D. Benjamin, A. Norman, B. Jayatilaka, S. Fuess, K. Biery, P. DeMar, A. Mazzacane, A. Tiradani, K. Ellis, J. Boyd

       

      On Zoom:

       

      P. Clarke, N. Buchanan, B. Viren, G. Cooper, B. White, T. Wenaus, R. Nandakumar, A. Thea, M. Nebot, S. Calvez, P. Vokac, D. Adams, M. Votava, S. White, A. Steklain

       

      TALKS Monday

       

      M. Kirby--logistics

       

      Discussion--pre-gdb and dune both booked in this room  Tuesday morning, need to change one of them.

       

      Andrew McNab-- goals of the workshops

       

      Discussion:  Fergus--want to know what decisions need to be made and by what mechanism it will be made

       

      Steve--can you speak to external constraints that are driving design of computing model now

       

      Heidi--need first round of computing model that we can show to local funding agencies by end of calendar 2019

       

      Pete--that’s correct and it needs to be plausible.

       

      Heidi Schellman:  DUNE DATA OVERVIEW

       

      Agreement with DAQ is not to write more than 30 PB/yr to offline in total from FD, even as modules are added.

       

      About 1 supernova “candidate” per month, 460TB with 4 modules. 10 hrs to transfer from FD at 100Gb/s.

       

      Discussion--Early in the planning it is worth looking at other big experiments and how accurate their numbers were, to figure how much contingency we should put in.

       

      What is the fraction of analysis

      • CMS 50%

      • ATLAS - 5% for T1, 25% for T2

       

      Schellman: comment that we’re currently pretty anarchistic for analysis, production isolated. 

       

      M. Bishai: Can we continuously exercise the system to make sure we’re ready for big supernova burst and high analysis demand

       

      ------------------------------------

       

      10:45 discussion “Mapping Requirements to Sites”

       

      Data Volumes

      LHCB has 27 PB in storage at Tier-1

       

      ST: Should define data storage based on what type of activities you need to do at the various tiers

       

      HS: Should streaming be a significant part of the models

       

      CB: from CMS--streaming works but wouldn’t want to use it for everything

       

      ST: need to understand the IOPS requirements on any given site.  Also which files are hot and accessed a large amount of time

       

      AN: 2 separate problems:  (1) archival storage of the data and who should have archival/custodial responsibility  (2) How do we analyze data at resources worldwide Multi-tiered approach with national/regional warehouses from which we stream, or can prime an HPC center for instance.

       

      AM: can we (in theory) put that kind of archival data at FNAL..yes

       

      Should we--AN:  one copy at FNAL and one elsewhere (is this required to be a DoE “facility”?)

      May also want copies that are regionally convenient at regional centers.

       

      Some discussion--can today’s HPSS handle this kind of reads--BV says yes.

       

      Should users be able to chaotically stage files from tape? General feeling seems to be no.


       

      Mapping The WLCG Tier model


       

      https://docs.google.com/presentation/d/1T68f6UL0xXSTSjUgC2w5txX-GFg0Bxf37SsBodlnRnQ/edit?usp=sharing  visible version  of model shown. Revised version after conversation with Ian Collier. Invented the D24 and the D8 options

       

      Mike K: What are the support levels at the different types of sites? (Host, center, grid site)

       

      Eric Lançon’s reference to DOMA site definitions is slide 20 here: https://docs.google.com/presentation/d/1F6hLvFT1X_z2Kpf49xgR_PzYrM69H5DXiUlpXxiK7I4/edit?ts=5cc06c1f#slide=id.g58baf64d22_10_169


       

      Is it better to have more sites with 24x7 response (smaller sites like Prague will never have enough people to really cover 24x7), or more sites with multiple copies of datasets in case something goes down off-hours (there was a flood incident even at one WLCG T1 centre and it took ~ 4 months to bring that site back to life)?

       

      Eric:  don’t have to think of boxes geographically mapped to a site--can have geographically distributed resources presented as a unified entity

       

      Doug:  should consider network bandwidth needs especially transatlantic, due to SKA coming online while DUNE is running

       

      Andrew:  What data volume really needs to be on disk?  Heidi--Analyzed data low end 3PB/yr, maybe 10PB at top end?  Tom--reduction factor will be large. Near detector still an unknown

       

      Mike--how many datasets are live at once--one, two,three?  Andrew--what is content of those datasets? Raw data, first reco, etc?  See

       

      Mary: what are MC requirements--2-3x processed?  Does ML need raw data? ML should be able to train on subset of raw data and/or processed?

       

      Summarising MC requirements - canonical 10:1 of processed MC used for nova, assumption is this remains true for DUNE. (HS> But NOvA does onboard zero-suppression - need to compare with potential full data size).  What is NOvA zero-suppression level and actual far detector rate that we are comparing to? Ie, # of events we’re talking about for MC and data. HS: Suggestion for a Task Force to estimate MC needs.

       

      CPU architectures:

       

      Heidi updated this table based in the spreadsheet #’s.  Need a version for ProtoDUNE’s and NP as well. PD should be possible to do now. 



       

      Far Detector

                     

      Data Type

      Amount/yr

      Copies

      Num of versions

      Total/yr

      Disk

      Tape

      Lifetime on disk

      Note

      Raw

      30 PB

      2?

      1

      60 PB

      3 PB

      100%

      short

      Assume 1 month? 

      Reduced

      0.3 PB

      2+

      2ish

      1.2 PB

      100%

      ?

      6 months? 

      Likely 2 versions but only one on disk 

      Reco Path A

      1.5 PB

      2+

      2

      6 PB

      100%

      Always on disk in some form

      Assume always keep 2 versions*2 copies. 

      Reco Path B

      1.5 PB

      2+

      2

      6 PB

      100%

      ?

      Merge with type A and store that

      Much better to run multiple algorithms at same time but some architectures may require different times? 

      Reco Path Nth...

                     

      MC

      6 PB

      1

      2

      6 PB

      100%

         

      10:1 Ratio with raw? But mC has a lot of overhead  add factor of 2 for that.

      Raw like

      3PB

      1?

      1

             

      ~10% Raw for ML Training. Is it different to Raw?

                       

       

      Official numbers spreadsheet https://docs.google.com/spreadsheets/d/17Xtwl3lIT00xOYgZMhMtyLt6ebFjrsraNHXI39y3hSY/edit#gid=1918904950

       

      Can we get quick access to HPC’s for hit finding production? SKA and other astro people do get quick access.  

       

      Supernovas?  General events? 

       

      More generally for the centers - what level of DB access, code access. 

       

      Discussion of commercial cloud - nice way to get latest version of new hardware for testing.  Good for peak/infrequent activities where buying your own is not cost-effective.

       

      Make requirements on bandwidths.  

       

      (DRAFT) Summary of interim conclusions stated in the session:

       

      • The custodial requirements for raw data, on tape, can be met by FNAL (one copy of everything) and by DOE lab(s). However, it is desirable that site(s) outside the US also participate in fulfilling these requirements.

      • We do not wish to have chaotic access to tape by users: it will be sufficient that data is staged off tape at FNAL in an organised way as part of (re)processing activities. It is not necessary that offsite tape copies are used in (re)processing unless there is a loss of data at FNAL.

      • It is not a requirement on the data management system that it can handle tape access at multiple sites, and it may be sufficient for staging to be handled outside (most of) the DM system.

      • It will be sufficient for FNAL and the other tape sites to provide access to tape in this controlled manner; for centers or disk sites to provide disk to allow jobs to access data in an efficient way; and for grid sites or cpu sites to provide only CPU and local scratch disk. 

      • A sufficient number of replicas of files needs to be maintained on disk by DUNE across different sites to minimise disruption to workflows to an acceptable level when sites have failures, outages, or planned downtimes.

      • FNAL and the other tape sites must provide 24/7 on call support. It will be sufficient for disk and cpu sites to provide 8/5 working hours support.

      • Software should be written to run on CPUs irrespective of CPU features (AVX etc) and on GPUs (for example by using libraries which allow the most performant execution on different platforms.) DUNE should be able to use whatever generations or CPU/GPU types are offered by the sites, by writing flexible software and by matching any software with specific requirements to compatible resources at sites.

      • DUNE has some CPU-bound use cases which will be run more efficiently on HPCs, due to their fast interconnects. Whilst DUNE will make use of many HPC resources in an HTC fashion, it may positively request that partners provide HPC.

      • DUNE will ensure that it can run workloads on commercial clouds, in case partner countries decide to use commercial clouds, as a way of handling peaks in load, and as a way of evaluating and preparing for new architectures before they are generally available at sites and with a greater degree of profiling access (eg root privileges) than with conventional grid jobs.

      • DUNE will pursue a mixed model of moving jobs to data and streaming data across the network to jobs. The balance of this will be based on experience and the network capacity provided by sites and (inter)national networks.


       

      Data Management Technologies session

       

      Steve Timm is preparing a document

       

      Steve: explains SAM

       

      Question about use cases -> are we overspecifying - not now as these are existing examples. 

       

      Examples of what is needed to “see”/get a single file.  DB’s reasonably simple. API probably needs to join together info from many DB’s.

       

      File attributes → query run attributes DB? 

      Datasets → move to snapshots?  Rucio supports hierarchies

      File location → locations DB

      File association with a particular job

       

      Examples of what is needed to put a single file into the DB.   

       

      Slide 6 - Sam locate file → rucio, 

      sam list-files → discovery DB/runs DB? 

      Sam projects → POMS or other job system knows how to talk to them. 

       

      Discussion of hierarchical metadata instead of tied to files. 

       

      Book-keeping:

       

      Figure out where all the data for a particular trigger record are - supernova → data management

       

      Figure out which files have been processed or not.  → workload?


       

      Robert Illingworth on Rucio 

       

      Currently running hybrid Rucio/SAM system - possibility of mismatch between the 2 dbs.

       

      Some discussion of object stores (good bad, oh my god not that again…)

       

      Paul Laycock on Rucio at Belle II

       

      Have used DIRAC for quite a while with the LFC file catalog

       

      Having to introduce Rucio while data taking has already started is interesting

       

      Use a RUCIO file catalog plugin to DIRAC.  Everything talks to the DIRAC plugin, not directly to Rucio.  

       

      What was good about Rucio - automation of things.  Data lifetime would be very nice. DM services run at BNL, the DIRAC stuff is in Japan.  

       

      Chris on DIRAC 

       

      Does both file and grid management.  

       

      File catalog is used by many DIRAC uses - Belle, LHCb, 

       

      Marco on GlideinWMS

       

      Works with multiple local batch systems, lots of experience in the field (CMS, OSG, FNAL IF)

       

      Long-term support: moving to support CentOS 8 now; other long-term customers are around (CMS + OSG for instance)

       

      Chris on DIRAC - have tested using local files from the DIRAC catalog and sam access from a local script. 

       

      Torre on PanDA  

       

      Works with lots of different facilities and integrated with Rucio

       

      Discussion of the Atlas event service.   Can stream events independently, can keep cores busy even if you get a really long event on one core.  Also recover from preemption by recovering individual events. Simulation is ½ of ATLAS usage, works well for this. 

       

      Working on Intelligent Data Delivery System with IRIS-HEP.   Can join into project through HSF. 

       

      HPC use for ML - Fast Simulation, Analysis, Tracking  

      Assume future machines will support common ML library. 

       

      Question about how to get fast turnaround.  

       

      Panda for nightly testing (ART system) - 3000 jobs/month 

       

      Plug Panda onto back-end of DIRAC

       

      Kirby asks about using the event service independently of some of the other PanDA specific parts.   Seems to work even with their very high event rates. How does this map onto DUNE? APA? Trigger record?

       

      [Torre - today, the event service operates only within PanDA and trying it would involve using PanDA. Objective of iDDS (which for ATLAS is the next step for the event service) is to support event delivery independent of PanDA. (Developing this in the IRIS-HEP/HSF context is intended to ensure the PanDA independence really happens). Early DUNE involvement in iDDS (with whatever workload manager(s) it chooses) could ensure DUNE use cases and functionality are addressed from the beginning.]

       

      Ask Mary B. to find a volunteer to study PanDA/DUNE

       

      Anthony Tiridani on HEPCloud  NRESC/Cloud in there - production version has started this year. 

       

      Discussion of decision engine.

      Has a “replay” capability that allows you out figure out why things happened. 

       

      Question about protecting budgets - production workflows only and HEPcloud can help minimize costs. 

       

      Similarly, how do you manage HPC allocations.  Can monitor burn rates

       

      What about if an individual user acquires a large allocation - do we have a way to allow that person to use that resource in a dedicated way. 

       

      Andrew on VCycle  - small agent to make VM’s on various systems. 

       

      Now working with IRIS in UK. How does this interact with HEPCloud?

       

      Interesting questions about how one can plan use of allocations - profile of expected needs. 

       

      Need some policy agreements for DUNE to use this. 

       

      SAM as Workflow management Comments:

       

      having a data delivery service that tracks the location, bandwidth, and latency of a file to a grid job would be important - Heidi

       

      Making sure that we aren’t tied to the file based structure that we cooked into SAM would be good to avoid. Have something that is more flexible and able to deal with “data cells” and object stores - Doug Benjamin

       

      POMS as Workflow Management Comments:

       

      Would the new project/station functionality be part of POMS or the other stuff? - Heidi  The discussion at the lab is that this would be separate from POMS. - Marc

       

      Tuesday morning

       

      Andrew shows a slide with possible combos of Production/Workflow WMS and DMS, and asks what people think about what to use and what to put dev effort into in the short/medium term, and if we want to keep options open longer term. 

       

      Questions - where does jobsub 

       

      As explained below, consensus was that we need more requirements gathering to decide. And the strawman that any of the alternatives would be sufficient given our scale had two exceptions identified: handling object storage with chunks that need to be assembled into events for processing; job-to-job pipelining rather than using storage as the intermediary. 

       

      Fergus had asked yesterday about decision process

       

      From the consortium document:

       

      “Technical advisory board

       

      The Consortium Lead, in collaboration with the Consortium, will convene Technical Advisory Boards as needed. A Board will be convened when there is a particular technical issue to address and will be given a charge appropriate to the issue at hand, such as reviewing and recommending solutions to a technical issue. Boards will include the 3 Consortium leads, the Software Architect, relevant subgroup representatives and technical experts chosen appropriately to address the charge. The outcome of a Board will typically be an advisory report delivered to Consortium leadership, the report being public within the Collaboration.”

       

      Mary and Steve and Heidi would like us to come up with requirements informed by the technical decisions we need to make. 

       

      Anna Mazzacane asks that we make the detector characterization a use case, not just the final running beast in steady state.  Need more data, more processing steps, Mary - protoDUNE may be harder, so is a good test. 

       

      Questions about support (short and long term) for the various products

       

      Questions about getting feedback from stakeholders.   CISC/CAL/DAQ/physics groups

       

      Is object store an additional use case beyond the file based system? 

       

      Andrew Norman - if we’re doing files, any of these systems can do the job.  

       

      Steve Timm - what about heterogenous workflows. Andrew Norman - astro expts do this using. 

       

      Multinode pipelines - GPU -> on a different machine doing something else.   If they communicate through storage, things are ok… ATLAS doing this. 

       

      Do we envision a mixture of workflows. Probably yes. 

       

      Very Top Level Requirements questions from Mary B. 

       

      -Software Framework interface requirements: for example memory/data format requirements that interface with the WMS - does ROOT meet those requirements for e.g..

       

      - Long Term Maintainability 

       

      -Heterogeneous workflows: for eg HTC->HPC->GPU

      Data preparation, Pattern recognition, Simulation, Data analysis, tuple creation

       

      Template for Workflow Use Case Description:

      https://docs.google.com/document/d/1lYNANqaE6r32u0oTWCcyQ2yfNEO6og3nKtNEDiTt85U/edit#

       

      Need to go through workflows what are the steps

      • Normal events (< 6 GB) - Based on Brett’s diagram to define “jobs” - Kirby https://indico.fnal.gov/event/21160/session/10/contribution/14/material/slides/4.pdf

      • SNB events (> 100 TB) - defer discussion until there’s more detail???

      • Simulation - overlays  (detector size?) - Tingjun Yang/Wes Ketchum assistance on this - Ken

      • Merging outputs - Heidi focus on after reconstruction and potentially SNB and derived samples and explicitly joining divergent paths

      • Tuple creation - question about file format (HDF5?) less important compared with access patterns and data volume - Norm Buchanan(???) and Steve Timm - physics impacts should be considered

      • Calibration -  needs input from CAL group following the Collaboration Meeting

        • Laser calibration

        • Ar39 calibration

        • Neutron source calibration

      • ML training

      • ML inference

      • User analysis -  Input from DRA

      • Event picking -  Andrew Norman

      • Parameter estimation ala NOvA - Andrew Norman, Steven Calvez, Pengfei Ding

      • Unfolding

       

      Define each of the processing stages as individual tasks.

      Define the input and output size for each of those tasks.

      Data lifetime of the output for each stage.

      When the campaign is reprocessed, what will be the impact on the data model and storage


       

      Some workflow diagrams to discuss

       

      https://docs.google.com/presentation/d/1d-qrtHwd5u-D91YDQ9p5kTm_G_l1DxZOBCrGeGyRgAk/edit?usp=sharing

       

      Do we write intermediate steps to disk? Tape? What gets distributed?

      Which stages need DB access? 

       

      If input data is carried to the output of a stage and if splits and joins between stages are supported then some support for handling duplicate data (same exact object carried along both branches) is needed in the framework.

       

      - Object storage/Offline event building

       

      - Distributed data requirements: >= 2 raw data storage on tape with 24/7 support  and 10%(?) dedicated disk space for staging + processed data and simulation data (x10 of processed data?) at different sites on disk with local support (see Heidi’s picture from yesterday)

       

      -Network requirements:

       

      -Specific stakeholder requirements: DAQ, Calibration, Physics Analysis

       

      One also needs to think about workflow stages the communicate via storage (i.e. outputs of one are inputs to the next) vs. those that communicate via jobs

       

      “mergeAna” workflow with project.py is a kind of merge workflow used by protoDUNE. But it does not use SAM for bookkeeping, but does that by list files. First exercise towards more complicated merge workflow is to incorporate this with SAM and POMS. 


       

      Mary asks can we get the experts to summarize the major features of the proposed workflows.   Summarize the slides we saw yesterday. MB: Specifications of the software packages proposed

      that address our requirements at each stage. For example what features of RUCIO+file management system (SAM or replacement) reliably allow for offline event building from file fragments (if that is our plan).

       

      Chris B.  Where are we strange?

       

      Heidi question about data formats. Is that a framework not a workflow problem. 

      Paul L comment to this for late-stage analysis for Belle II:  we use root as format, but for analysis python packages are used which untie this dependence: “root_numpy” and “root_pandas” packages convert TTrees to numpy arrays / pandas dataframes respectively, “uproot” is a package increasing in popularity which has its own optimised numpy-like array formats.

       

      MB: some selected event data needs to be stored in a format for 3-D display:

       

      Discussion of whether we would store data to display chosen events. Mary points out that

      event displays of all selected neutrino events in the far detector will be an important part to validate any ML results and reconstruction. 

       

      Example of a 3-D event display are the BEE display (used online for ProtoDUNE) here is an example with simulation:

       

      https://www.phy.bnl.gov/wire-cell/bee/set/23/event/0/

       

      This is very different from colliders and is unique to neutrino experiments: the need to visually assess large numbers of events to validate reconstruction. 

       

      Operations discussion (Thursday afternoon)

       

      • More sites/platforms more require that they come with operational/support effort

      • Nova experience: 

        • control room shifts: DAQ/oncall (per institute shares)

        • offline model designed to deal with failures: problems reported at the time, dealt with during working hours

        • service work (per student/postdoc): production, other infrastructure, calibration

          • Downside is turnover of people (~6 months)

        • some services run by FNAL (eg SAM)

      • DUNE will have 24/7/52 shifts for DAQ

        • Can they do the computing shift? ie trigger the on-call process for computing experts

        • Can “Joe/Jane Shifter” recognise the problem?

        • Can we just rely on buffering at SURF until the real expert reads/sees the problem, themselves.

      •  


       

      https://docs.google.com/document/d/1x0ima7MUiMNhwYU0rg37MLM_rFCrcEY9Qfuf76MmGFw/edit?usp=sharing is a link to a document with a list of current things we use and questions about mapping their functionality going forward. 

      • 1
        Welcome
        Including other meetings this week and logistics
        Speaker: Dr Michael Kirby (FNAL)
        Slides
      • 2
        Goals of the workshop
        Speaker: Andrew McNab (University of Manchester)
        Slides
      • 3
        Requirements from Data Model
        * Data volume, volume per year plan, peak data rates to offline * Job geometries: hours, processors, memory, temp disk * CPU requirements per year plan
        Speaker: Heidi Schellman (Oregon state)
        Slides
    • 10:00
      Coffee/discussion
    • Mapping requirements to sites
      • Mapping on to the WLCG Tier model
      • Disk vs Tape
      • 24/7 vs 8/5 sites
      • CPU architectures: AVX etc; GPUs; ???
      • Site architectures: HTC vs HPC vs commercial cloud
      • Network/transfers between sites
    • 12:30
      Lunch
    • Data management technologies

      SAM, Bookkeeping, RUCIO, DIRAC/RUCIO ???

      • 4
        SAM data management
        Speakers: Heidi Schellman (Oregon state), Dr Steven Timm (Fermilab)
        Slides
      • 5
        Bookkeeping database
        Speaker: Dr Steven Timm (Fermilab)
      • 6
        RUCIO
        Speaker: Dr Robert Illingworth (Fermilab)
        Slides
      • 7
        DIRAC/RUCIO plugin
        Speaker: Dr Paul Laycock (Brookhaven National Laboratory)
        Slides
      • 8
        DIRAC data management?
        Speaker: Christopher Brew
        Slides
      • 9
        Discussion
    • 14:40
      Coffee/discussion
    • Workload management technologies

      GlideinWMS, DIRAC, PanDA, HEPCloud, Vcycle, ...

      • 10
        GlideInWMS
        Speaker: Marco Mambelli (Fermilab)
        Slides
      • 11
        DIRAC workload management
        Speaker: Christopher Brew
      • 12
        PanDA
        Speaker: Dr Torre Wenaus (BNL)
        Slides
      • 13
        HEPCloud
        Speaker: Mr Anthony Tiradani (Fermilab)
        Slides
      • 14
        Vcycle cloud management
        Speaker: Andrew McNab (University of Manchester)
        Slides
      • 15
        Discussion
    • 16:05
      Coffee/discussion
    • Workflow/production technologies

      POMS, SAM, LHCb for comparison, ...

      • 16
        LHCb workflow interface
        Screenshots etc as an example of what else is being done
        Speaker: Andrew McNab (University of Manchester)
        Slides
      • 17
        SAM as workflow management
        Speaker: Dr Kenneth Herner (Fermilab)
        Slides
      • 18
        POMS
        Speaker: Mr Marc Mengel (Fermilab)
        Slides
      • 19
        Discussion
    • Representative workflow designs Curia II (WH2SW)

      Curia II (WH2SW)

      Fermilab

      Take some important foreseeable workflows (Monte Carlo, reconstruction, user analysis, ...) and work through how we would do them based on the site resources and technologies

      • 20
        Technologies discussion
        Slides
    • 10:30
      Coffee/discussion
    • Computing model doc/slides drafting Curia II (WH2SW)

      Curia II (WH2SW)

      Fermilab

      Preparation for summaries, writing a document, additional discussion, other ideas as they come up

    • 12:30
      Lunch
    • Joint session with WLCG Authz WG: https://indico.cern.ch/event/739896/
    • WLCG GDB: https://indico.cern.ch/event/739882/
    • Mon/Tue session summaries

      Presentation of summaries of DUNE computing sessions on Monday and Tuesday

    • FIM4R Workshop (all day): https://indico.cern.ch/event/834658/
    • DUNE Computing Ops discussion
      • Tickets model: internal and/or GGUS/OSGHelpDesk
      • Operations meetings: daily? Weekly? Non-COMP ops meetings?
      • Ops shifts: FNAL or remote?
      • User support shifts
      • Staffing implications of the above (as input to grant requests etc)
      • 21
        Introduction/ideas
        Speaker: Andrew McNab (University of Manchester)