To connect via Zoom: Meeting ID 831-443-820
Password distributed with meeting announcement
(See instructions for setting Zoom default to join a meeting with audio and video off: https://larsoft.org/zoom-info/)
PC, Mac, Linux, iOS, Android: https://fnal.zoom.us/j/831443820
Phone:
https://fnal.zoom.us/zoomconference?m=SvP8nd8sBN4intZiUh6nLkW0-N16p5_b
H.323:
162.255.37.11 (US West)
162.255.36.11 (US East)
213.19.144.110 (EMEA)
See https://fnal.zoom.us/ for more information
At Fermilab: no in-person presence at the lab for this meeting
Erica: Release and project report
Re. move to e20: Tom Junk noted that DUNE has been able to build under e20
Clarification: art metadata change was requested by SAM project. Listed as a breaking change that would require updating metadata extractors
Erica (a) missed that the "art metadata" was SAM related, and (b) forgot about that change in art, and the discussion at the time about needing to update extractors. Apologies for suggesting that an API was being evaded
Ken Herner: Job submission with FIFE tools
Outline
Transitioning from project.py
Job autorelease for resorce limits and other best practics
Singularity images
Transitioning to project.py
Noted that project.py/larbatch is not SCD-supported tool
Reasons
Usable by any workflow
Supported by SCD
Campaign monitoring integrated into FIFEMON
Automated cron-style submissions and automated recovery launches
Multi-stage workflows possible (w auto dataset creation for outputs being inputs to next stage)
Typ usage pattern is to create "campaigns" in a web GUI, submit there or command line
Documentation:
Can convert from project.py description files to POMS
Example XML conversion using Project-py
Instructions: https://cdcvs.fnal.gov/redmine/projects/project-py/wiki/Project-py_guide
Creates POMS config file from input project,py project XML file
Walked through the conversion process for a particular workflow
After converting
Can then just go w POMS, discard project.py XML file
To change something at that point, then can clone the campaign
Make tweaks to this clone
Demonstrated on POMS GUI pages
Example monitoring
Showed features of POMS monitoring web page
Campaign stage page
Example POMS config file (from DUNE. All remaining examples are from DUNE)
Similar to dict format
Showed for SAM and non-SAM job
Showed a complicated executable section
Running with input dataset
"Dataset" specification is just a string or list of strings
Make it a SAM dataset by passing it to jobsub as dataset option
Showed how to split large datasets. Mapping onto project.py features
Discussed "reasonableness" in submissions
Aim for multi-hour run times
"Reasonable" file sizes
Job recovery options
Automatically recover failed jobs. Various criteria possible
Proxy upload for analyzers
If running as analyzer, need to occasionally upload a proxy
Name must conform to std format
Job autorelease (from "held" status) for resource limits
Encourage people to use this
Now, jobs go into "held" state if they exceed resource requests
With POMS, can instead automatically re-submit w increased limits
Run time and memory
If they exceed twice, then it will go held
Singularity and offsite running
Will be auto invoked at all supported sites (> 99% offiste slots). On
On Fermigrid, still run in a Docker container by defula, but will run Singularity if you specify an image (recommended)
Discussed how to do this
Noted that running off-site will soon become the default. So need to opt into running on site
Ken will be available for help, can share DUNE configs/ campaigns
Discussion (Ken Herner and Marc Mengel responding to questions and comments)
Are best practices documented?
Yes. Redmine wiki. FIFE wiki. https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Singularity_jobs
If don't enforce Singularity requirement, often find that ...[missed this]
Possible missing feature. What is best practice for pre-staging? And does DUNE do this?
Yes. Currently just run samweb session the day before.
When run the --dataset command (or include that optoin), this gets passed to jobsub, which then makes a dag
The begin-job calls ifdh start project
Does kick-off pre-staging
Also requires that a sufficient fraction of the dataset be resident on disk
Start job will wait for presence of some number of files. Can't recall the number.
Noted that pre-staging in this way may not feed things quickly enough, depending upon resource contingency. So prior pre-staging is still recommended
Pre-staging in advance is always a best practice
If using cron submission w a very large dataset or multiple datasets that runs over multiple days, it's hard to manage manual pre-staging. Is there a way to deal with this?
Can add a stage to the job that runs pre-staging script
Then launches the job
DUNE does not do that. Just pre-stages in advance
Splitter
Can use project name of pre-stage job to configure launching and input files
Requested that every definition of "reasonable" be documented somewhere
Context / experiment dependent. But needs to be done
Are tarballs unwound before executable is invoked?
This is a jobsub option, so need to read documentation for that to get details, how to configure
On wiki, says ...[missed...]
Comment that "dataset" option has side effects. Requested that this be documented on the wiki.
OK
Pre-staging, cases where dag isn't quite sufficient
Can submit a dag. But FIFE launch does not configure dags, or do anything that jobsub doesn't do.
At some point, may have filesets that are together, and cannot be chopped up
Eg., from iceberg, ended up needing five files open at same time. How should we do this?
Couple of approaches
First is to make a union dataset, and make sure all pre-staged in advance
Put one of five in dataset, and run project from that. The input file specifies the other four
Second, put all files in one project, scrip consumes files out of that project, and provides files to all jobs when they become available
If people have a specific workflow in mind, can help try to put something together.
Grouped file delivery is not something that SAM does, so just need to work around that
Further questions?
#poms slack channel in FNAL Computing workspace
Not everyone is a member there, and SCD personnel have been discouraged from inviting non-SCD people
Open service desk tickets w questions
Kyle Knoepfel: Incompatibility in hdf5 and h5cpp product usage
Have been distributing both for some time
Since Jan, when updated hdf5 from 1.10 to 1.12 (Jan 2021), then h5cpp became incompatible
Have a problem example from wirecell.
h5::high_throughput not supported with mismatched versions
Not a compile-time error. Run-time error manifesting as abort/segmentation violation
Submitted as a h5cpp issue: https://github.com/steen-varga/h5cpp/issues/70
Until h5cpp does support hdf5 1.12, several options
Keep h5cpp 1.10 and roll-back hdf5
keep hdf5 1.12 and drop h5cpp from distrib
keep hdf5 and h5cpp, and make people aware fo the issue
For wirecell, need to make changes to json file to avoid using the h5::high_throughput
Not aware of any LArSoft jobs that are affected
Seeking feedback from users of HDF5 and h5cpp
Our suggestion
Option (1): keep h5cpp 1.10, and roll back hdf5 to 1.10
Discussion
Tom J: maybe Kurt Biery needs to be in this convo? Since there may be some number of hdf5 1.12 files already in existence.
Lynn G: spoke to Eric Flumerfelt.
Talking to Steven Varga this Friday
Could also provide two versions of hdf5
Tom J has been writing HDF5. Would need to test if rolling back would be ok
Lynn G: both are currently available, so can do that.
For sake of moving forward
Will continue discussion, proposing option (1) as best bet
Might be able to read 1.12 files with 1.10.
Discuss at Offline Leads meeting. Might be time for Tom J to test roll-back option by then.