Andrew Norman is inviting you to a scheduled Zoom meeting.
Topic: HEPCloud Facility Board
Time: Jul 27, 2020 03:00 PM Central Time (US and Canada)
Present: M. Livny, B. Bockelman, A. Norman, S. Timm, A. Tiradani, M. Mambelli, M. Acosta
HEPCloud theta
Users:
CMS, DUNE, other neutrino expos.
Scope: what functionality are we delivering
All experiments can define a “campaign” and then at experiment level, say that
This “campaign” should run at ALCF.
Not a job-by-job decision.
Data delivery:
For purposes of this demo it is assumed that the right input data needed for the CMS jobs
Is already staged in to Argonne and the output data will be transferred out of band,
Discussion—CMS already have some hooks about data in their job descriptions, how accurate are they and how much they are used?
Use the whole block of allocation at once, or part of it, etc.
Would prefer DC work but will have capacity for allocating the whole burst if needed.
Technical
Tony:
Got instructions for replicating the Barcelona setup
Replicating on Wilson @ Fermilab first
Jim and Liz will make contact with people @ Argonne to see if there is a Kubernetes cluster
Available to work on.
If not (default assumption) presume we are working on a login node @ Argonne
CMS will give small test workflow and then later a bigger one.
Miron Q—are we following the Barcelona model
They (Barcelona) had no schedd on the HPC system.
Barcelona system assumed you had individual jobs
Miron—if scheduling is a campaign, can we delegate some of the schedd activity to Argonne
Miron Interested in moving a whole bucket of jobs from schedd to schedd
Tony—wondering how long it would take to develop this. Barcelona method is available in current htcondor.. in time crunch to deliver.
Miron—still wonder if they can really run a schedd—for fall the Barcelona model may be the right way but longer term want to run a schedd on that end if we can.
Have we checked with Argonne re. The networking assumptions? Tony—yes.
In theory there is some network between login and worker nodes but not robust.
Brian B. — likes idea of keeping both in mind but starting with the Barcelona model and adding capabilitiy/capacity to it for larger bulk transfers..
Better to learn how to move sets of jobs
condor_b—submitting condor jobs into seti@home /BOINC large chunk of jobs and assign to a location.
Steve—involved with condor_annex? Brian — no
Miron—question—do we have authority to run this on the login nodes? Andrew not yet
But division head is tasked with clearing this for us with Argonne
Madison involvement
Brian—need a standing contact
Propose it is the standing fermi/condor meeting 2nd Friday of each month)
Who needs to be there? Best answer at the moment Jaime Frey (combination of
Schedd understanding and knowing how the Barcelona system works) but could change.
Who is the Fermi contact? Tony is the technical contact but Maria Acosta is doing most of the work.
Does Maria have the contacts she need?
Maria:
So far yes
Got Jaime’s code and is following his instructions on the HTcondor wiki.
Have adapted the code for what our needs are.
Still trying to submit slurm jobs remotely.
Have been talking to Jaime constantly
Chirp attributes to ship back, etc.
Meetings on Friday are a good first step—have a conflict but can try to make it.
E-mail is OK
Slack might be nice
When do they (CMS) want to run jobs?
Want to run jobs in this calendar year.