LArSoft Coordination Meeting

US/Central
Zoom-only

Zoom-only

Description

To connect via Zoom:  Meeting ID 831-443-820

Password distributed with meeting announcement

(See instructions for setting Zoom default to join a meeting with audio and video off: https://larsoft.org/zoom-info/)

PC, Mac, Linux, iOS, Android:  https://fnal.zoom.us/j/831443820

Phone:

H.323:

162.255.37.11 (US West)
162.255.36.11 (US East)
213.19.144.110 (EMEA)
See https://fnal.zoom.us/ for more information
 

At Fermilab:  no in-person presence at the lab for this meeting

 

Erica: Release and project report

  • Re. move to e20: Tom Junk noted that DUNE has been able to build under e20

  • Clarification: art metadata change was requested by SAM project. Listed as a breaking change that would require updating metadata extractors

    • Erica (a) missed that the "art metadata" was SAM related, and (b) forgot about that change in art, and the discussion at the time about needing to update extractors. Apologies for suggesting that an API was being evaded

 

Ken Herner: Job submission with FIFE tools

  • Outline

    • Transitioning from project.py

    • Job autorelease for resorce limits and other best practics

    • Singularity images

  • Transitioning to project.py

  • Can convert from project.py description files to POMS

  • Example XML conversion using Project-py

  • After converting

    • Can then just go w POMS, discard project.py XML file

    • To change something at that point, then can clone the campaign

      • Make tweaks to this clone

      • Demonstrated on POMS GUI pages

  • Example monitoring

    • Showed features of POMS monitoring web page

    • Campaign stage page

  • Example POMS config file (from DUNE. All remaining examples are from DUNE)

    • Similar to dict format

    • Showed for SAM and non-SAM job

    • Showed a complicated executable section

  • Running with input dataset

    • "Dataset" specification is just a string or list of strings

    • Make it a SAM dataset by passing it to jobsub as dataset option

    • Showed how to split large datasets. Mapping onto project.py features

  • Discussed "reasonableness" in submissions

    • Aim for multi-hour run times

    • "Reasonable" file sizes

  • Job recovery options

    • Automatically recover failed jobs. Various criteria possible

  • Proxy upload for analyzers

    • If running as analyzer, need to occasionally upload a proxy

    • Name must conform to std format

  • Job autorelease (from "held" status) for resource limits

    • Encourage people to use this

    • Now, jobs go into "held" state if they exceed resource requests

    • With POMS, can instead automatically re-submit w increased limits

    • Run time and memory

    • If they exceed twice, then it will go held

  • Singularity and offsite running

    • Will be auto invoked at all supported sites (> 99% offiste slots). On

    • On Fermigrid, still run in a Docker container by defula, but will run Singularity if you specify an image (recommended)

    • Discussed how to do this

    • Noted that running off-site will soon become the default. So need to opt into running on site

  • Ken will be available for help, can share DUNE configs/ campaigns

 

Discussion (Ken Herner and Marc Mengel responding to questions and comments)

  • Are best practices documented?

  • If don't enforce Singularity requirement, often find that ...[missed this]

  • Possible missing feature. What is best practice for pre-staging? And does DUNE do this?

    • Yes. Currently just run samweb session the day before.

    • When run the --dataset command (or include that optoin), this gets passed to jobsub, which then makes a dag

    • The begin-job calls ifdh start project

      • Does kick-off pre-staging

      • Also requires that a sufficient fraction of the dataset be resident on disk

      • Start job will wait for presence of some number of files. Can't recall the number.

      • Noted that pre-staging in this way may not feed things quickly enough, depending upon resource contingency. So prior pre-staging is still recommended

      • Pre-staging in advance is always a best practice

  • If using cron submission w a very large dataset or multiple datasets that runs over multiple days, it's hard to manage manual pre-staging. Is there a way to deal with this?

    • Can add a stage to the job that runs pre-staging script

      • Then launches the job

    • DUNE does not do that. Just pre-stages in advance

    • Splitter

      • Can use project name of pre-stage job to configure launching and input files

  • Requested that every definition of "reasonable" be documented somewhere

    • Context / experiment dependent. But needs to be done

  • Are tarballs unwound before executable is invoked?

    • This is a jobsub option, so need to read documentation for that to get details, how to configure

    • On wiki, says ...[missed...]

    • Comment that "dataset" option has side effects. Requested that this be documented on the wiki.

      • OK

  • Pre-staging, cases where dag isn't quite sufficient

    • Can submit a dag. But FIFE launch does not configure dags, or do anything that jobsub doesn't do.

  • At some point, may have filesets that are together, and cannot be chopped up

    • Eg., from iceberg, ended up needing five files open at same time. How should we do this?

    • Couple of approaches

      • First is to make a union dataset, and make sure all pre-staged in advance

        • Put one of five in dataset, and run project from that. The input file specifies the other four

      • Second, put all files in one project, scrip consumes files out of that project, and provides files to all jobs when they become available

      • If people have a specific workflow in mind, can help try to put something together.

      • Grouped file delivery is not something that SAM does, so just need to work around that

  • Further questions?

    • #poms slack channel in FNAL Computing workspace

      • Not everyone is a member there, and SCD personnel have been discouraged from inviting non-SCD people

    • Open service desk tickets w questions

      •  

 

Kyle Knoepfel: Incompatibility in hdf5 and h5cpp product usage

  • Have been distributing both for some time

  • Since Jan, when updated hdf5 from 1.10 to 1.12 (Jan 2021), then h5cpp became incompatible

  • Have a problem example from wirecell.

  • Until h5cpp does support hdf5 1.12, several options

    1. Keep h5cpp 1.10 and roll-back hdf5

    2. keep hdf5 1.12 and drop h5cpp from distrib

    3. keep hdf5 and h5cpp, and make people aware fo the issue

      • For wirecell, need to make changes to json file to avoid using the h5::high_throughput

  • Not aware of any LArSoft jobs that are affected

  • Seeking feedback from users of HDF5 and h5cpp

  • Our suggestion

    • Option (1): keep h5cpp 1.10, and roll back hdf5 to 1.10

 

Discussion

  • Tom J: maybe Kurt Biery needs to be in this convo? Since there may be some number of hdf5 1.12 files already in existence.

    • Lynn G: spoke to Eric Flumerfelt.

    • Talking to Steven Varga this Friday

    • Could also provide two versions of hdf5

  • Tom J has been writing HDF5. Would need to test if rolling back would be ok

    • Lynn G: both are currently available, so can do that.

  • For sake of moving forward

    • Will continue discussion, proposing option (1) as best bet

    • Might be able to read 1.12 files with 1.10.

    • Discuss at Offline Leads meeting. Might be time for Tom J to test roll-back option by then.

There are minutes attached to this event. Show them.