LArSoft Coordination Meeting

Name: LArSoft Coordination Meeting
Start: 2021-04-20T09:00:00-05:00
End: 2021-04-20T10:30:00-05:00
Location: Zoom-only

Tuesday 20 Apr 2021, 09:00 → 10:30 US/Central

Zoom-only

Description

To connect via Zoom: Meeting ID 831-443-820

Password distributed with meeting announcement

(See instructions for setting Zoom default to join a meeting with audio and video off: https://larsoft.org/zoom-info/)

PC, Mac, Linux, iOS, Android: https://fnal.zoom.us/j/831443820

Phone:

US toll: meeting ID 831-443-820
+1 646-558-8656
+1 408-638-0968
International numbers:
https://fnal.zoom.us/zoomconference?m=SvP8nd8sBN4intZiUh6nLkW0-N16p5_b

H.323:

162.255.37.11 (US West)
162.255.36.11 (US East)
213.19.144.110 (EMEA)
See https://fnal.zoom.us/ for more information

At Fermilab: no in-person presence at the lab for this meeting

Support

scisoft-team@fnal.gov

Hide

Erica: Release and project report

Re. move to e20: Tom Junk noted that DUNE has been able to build under e20
Clarification: art metadata change was requested by SAM project. Listed as a breaking change that would require updating metadata extractors
- Erica (a) missed that the "art metadata" was SAM related, and (b) forgot about that change in art, and the discussion at the time about needing to update extractors. Apologies for suggesting that an API was being evaded

Ken Herner: Job submission with FIFE tools

Outline
- Transitioning from project.py
- Job autorelease for resorce limits and other best practics
- Singularity images
Transitioning to project.py
- Noted that project.py/larbatch is not SCD-supported tool
- Reasons
  - Usable by any workflow
  - Supported by SCD
  - Campaign monitoring integrated into FIFEMON
  - Automated cron-style submissions and automated recovery launches
  - Multi-stage workflows possible (w auto dataset creation for outputs being inputs to next stage)
- Typ usage pattern is to create "campaigns" in a web GUI, submit there or command line
- Documentation:
  - User doc: https://cdcvs.fnal.gov/redmine/projects/prod_mgmt_db/wiki/POMS_User_Documentation
  - FIFE launch reference: https://cdcvs.fnal.gov/redmine/projects/fife_utils/wiki/Fife_launch_Reference
Can convert from project.py description files to POMS
Example XML conversion using Project-py
- Instructions: https://cdcvs.fnal.gov/redmine/projects/project-py/wiki/Project-py_guide
- Creates POMS config file from input project,py project XML file
- Walked through the conversion process for a particular workflow
After converting
- Can then just go w POMS, discard project.py XML file
- To change something at that point, then can clone the campaign
  - Make tweaks to this clone
  - Demonstrated on POMS GUI pages
Example monitoring
- Showed features of POMS monitoring web page
- Campaign stage page
Example POMS config file (from DUNE. All remaining examples are from DUNE)
- Similar to dict format
- Showed for SAM and non-SAM job
- Showed a complicated executable section
Running with input dataset
- "Dataset" specification is just a string or list of strings
- Make it a SAM dataset by passing it to jobsub as dataset option
- Showed how to split large datasets. Mapping onto project.py features
Discussed "reasonableness" in submissions
- Aim for multi-hour run times
- "Reasonable" file sizes
Job recovery options
- Automatically recover failed jobs. Various criteria possible
Proxy upload for analyzers
- If running as analyzer, need to occasionally upload a proxy
- Name must conform to std format
Job autorelease (from "held" status) for resource limits
- Encourage people to use this
- Now, jobs go into "held" state if they exceed resource requests
- With POMS, can instead automatically re-submit w increased limits
- Run time and memory
- If they exceed twice, then it will go held
Singularity and offsite running
- Will be auto invoked at all supported sites (> 99% offiste slots). On
- On Fermigrid, still run in a Docker container by defula, but will run Singularity if you specify an image (recommended)
- Discussed how to do this
- Noted that running off-site will soon become the default. So need to opt into running on site
Ken will be available for help, can share DUNE configs/ campaigns

Discussion (Ken Herner and Marc Mengel responding to questions and comments)

Are best practices documented?
- Yes. Redmine wiki. FIFE wiki. https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Singularity_jobs
If don't enforce Singularity requirement, often find that ...[missed this]
Possible missing feature. What is best practice for pre-staging? And does DUNE do this?
- Yes. Currently just run samweb session the day before.
- When run the --dataset command (or include that optoin), this gets passed to jobsub, which then makes a dag
- The begin-job calls ifdh start project
  - Does kick-off pre-staging
  - Also requires that a sufficient fraction of the dataset be resident on disk
  - Start job will wait for presence of some number of files. Can't recall the number.
  - Noted that pre-staging in this way may not feed things quickly enough, depending upon resource contingency. So prior pre-staging is still recommended
  - Pre-staging in advance is always a best practice
If using cron submission w a very large dataset or multiple datasets that runs over multiple days, it's hard to manage manual pre-staging. Is there a way to deal with this?
- Can add a stage to the job that runs pre-staging script
  - Then launches the job
- DUNE does not do that. Just pre-stages in advance
- Splitter
  - Can use project name of pre-stage job to configure launching and input files
Requested that every definition of "reasonable" be documented somewhere
- Context / experiment dependent. But needs to be done
Are tarballs unwound before executable is invoked?
- This is a jobsub option, so need to read documentation for that to get details, how to configure
- On wiki, says ...[missed...]
- Comment that "dataset" option has side effects. Requested that this be documented on the wiki.
  - OK
Pre-staging, cases where dag isn't quite sufficient
- Can submit a dag. But FIFE launch does not configure dags, or do anything that jobsub doesn't do.
At some point, may have filesets that are together, and cannot be chopped up
- Eg., from iceberg, ended up needing five files open at same time. How should we do this?
- Couple of approaches
  - First is to make a union dataset, and make sure all pre-staged in advance
    - Put one of five in dataset, and run project from that. The input file specifies the other four
  - Second, put all files in one project, scrip consumes files out of that project, and provides files to all jobs when they become available
  - If people have a specific workflow in mind, can help try to put something together.
  - Grouped file delivery is not something that SAM does, so just need to work around that
Further questions?
- #poms slack channel in FNAL Computing workspace
  - Not everyone is a member there, and SCD personnel have been discouraged from inviting non-SCD people
- Open service desk tickets w questions

Kyle Knoepfel: Incompatibility in hdf5 and h5cpp product usage

Have been distributing both for some time
Since Jan, when updated hdf5 from 1.10 to 1.12 (Jan 2021), then h5cpp became incompatible
Have a problem example from wirecell.
- h5::high_throughput not supported with mismatched versions
- Not a compile-time error. Run-time error manifesting as abort/segmentation violation
- Submitted as a h5cpp issue: https://github.com/steen-varga/h5cpp/issues/70
Until h5cpp does support hdf5 1.12, several options
1. Keep h5cpp 1.10 and roll-back hdf5
2. keep hdf5 1.12 and drop h5cpp from distrib
3. keep hdf5 and h5cpp, and make people aware fo the issue
  - For wirecell, need to make changes to json file to avoid using the h5::high_throughput
Not aware of any LArSoft jobs that are affected
Seeking feedback from users of HDF5 and h5cpp
Our suggestion
- Option (1): keep h5cpp 1.10, and roll back hdf5 to 1.10

Discussion

Tom J: maybe Kurt Biery needs to be in this convo? Since there may be some number of hdf5 1.12 files already in existence.
- Lynn G: spoke to Eric Flumerfelt.
- Talking to Steven Varga this Friday
- Could also provide two versions of hdf5
Tom J has been writing HDF5. Would need to test if rolling back would be ok
- Lynn G: both are currently available, so can do that.
For sake of moving forward
- Will continue discussion, proposing option (1) as best bet
- Might be able to read 1.12 files with 1.10.
- Discuss at Offline Leads meeting. Might be time for Tom J to test roll-back option by then.

There are minutes attached to this event. Show them.

- 09:00 → 09:15
  
  Release and project report 15m
  
  Speaker: Erica Snider (Fermilab)
  
  larsoft-coordination-meeting-2021-04-20.pdf
- 09:15 → 09:35
  
  Job submission with FIFE tools 20m
  
  Speaker: Kenneth Herner (Fermilab)
  
  POMS for LArSoft.pdf
- 09:35 → 09:45
  
  Incompatibility in usage of hdf5 and h5cpp prooducts 10m
  
  Speaker: Kyle Knoepfel (Fermilab)
  
  hdf5-2021-04-20.pdf