Compute Provisioning Blueprint 2013-11-25
MONDAY
AM Session Part 1
----------------------------------------------------------
Provisioning *by* OSG
-- Opportunistic resources (site says "use my resources as you like") (inter-VO provisioning)
-- Intra-VO provisioining (e.g. Commercial clouds).
-- OSG purchasing from commercial clouds?
-- VO gives money.
-- OSG purchases via group negotiation
M: resource provisioning *decoupled* from resource scheduling
I: cross-OSG VO pool. Independent from OSG VO.
C: HPC, GPU, Cloud, Multicore?
I: What OSG does for VOs? M: Running a factory for them.
C: How does a VO ask for what it wants?
-> OSG Service: -- Policy registration.
-- Resource request processing mechanism.
M: Letting OSG make decisions on the basis of policy is useful to VOs.
This is also where OSG would enact policy.
I: CMS 1/3 of CMS provisioning comes from EGI.
B: Currencies: OSG priority
XD allocation
M: (Qvo, Ded-vo, osg, $money, XRAC) -> (Nvo, Nosg, Ncloud, Nxd)
Who's problem is this function? The VO? or OSG?
C: Adding value, but taking on risk. Is the tradeoff reasonable?
?? since dedicated VO resources are part of the policy, what about VOs that run dedicated work out-of-band from OSG systems?
L: Currently ready to do allocations. Not ready to do cash money.
M: What if another entity agrees to be the business/legal entity instead of OSG?
M: VOs already delegate proxies.
B: Defining the policy vs. executing the policy.
L: If OSG does the technology and allows VOs to actually to the provisioning.
J: Can we pursue the technology to do policy-based provisioning to VOs, and later make the decision on whether OSG will use that technology to do centralized provisioning service.
C: What is the benefit of doing this?
M: Are we leaders contributing to science.
Answer: Yes.
Action: Technology group looking into means to do this?
What about who owns OSG opportunistic resources? So that they could be allocated programatically. Policy-based provisioning. How do we get them?
AM Session Part 2
Google has a separate HTC cluster. Exacycle.
?? Make OSG a research scientist on Exacycle. PaaS
-- Security of VM environments?
-- Demonstrates the dynamic changing of the landscape.
-- NSF cloud?
?? Condor means of translating policy into action. Hard part is how to display the state of the pool in such a way that the user can tell what is happening?
L: Should the resource provider be able to express its access policies? If so, how?
PM Session 1
-- What capabilities does the current system have to express provisioning policies?
-- Accounting vs. Auditing:
-- Issue of sensitivity to tampering when auditing is involved.
M: HIPPA auditing requirements involved tracing access rather than preventing access.
M: Has money willing to spend on EC2. What proof/auditing will be called for by granting agency?
L: Need concept in our allocation system for a billable transaction.
L: This could take the form of actual dollar values (albeit phantom dollars).
C: GAAP: Generally Accepted Accounting Practices --Agreed upon by all parties.
What about network and disk allocation? Co-scheduling other resources is hard.
M: Work without time estimates makes co-scheduling very difficult.
I: How can data and network be integrated in the policy expression?
M: Circuit breakers against stupidty and misuse.
What capabilities does the current system have to express provisioning policies?
What constitutes "success" in provisioning?
Discussion of capabilities of APF and GF wrt maintenance, operations, error reporting.
What about chirp from running jobs?
Miron: Wants to run GF for GLOW at Indiana GOC.
Provisioning vs. site-support.
1. black holes, no stderr/stdout returned. Rapid cycling.
2. mismatch between GRAM reportage and actual batch status.
3. all/some jobs fail submission, enter HELD state.
TUESDAY
AM Session 1
M: How does the provisioning system learn about demand?
How do you represent the current state of workload?
Discussion and graphing of concepts of pressure groups (WMS queues), provisioning groups (APF queues, GMWS Entries) in APF vs. GWMS.
AM Session 2
Policy expression.
ACTION ITEM:
-- document sketching APF/GWMS architecture, with annotations providing common vocabulary/semantics. Dec 13.
-- minutes + photos from meeting to group.
-- set up GWMS Gfactory at Indiana for GLOW.
-- Transfer site support responsibility from GFactory admins to OSG GOC user support. How?
-- black hole detection on Condor batch system?
(RACF)
-- submit glideins via APF.
********************************************************************************
KEY DIAGRAMS
(transcribed from whiteboard photos)
----------------------------------------------------------------------------
Services needed for provisioning system
a) Needed for operations & maintenance (Ops)
b) circuit breakers
c) packaging, deployment?, clients?
d) Accounting [including allocation use]
e) Ability for VO to express its resource provisioning policy. Assistance to validate policy.
g) Interface for VO to express demand (input to #1) [including time estimates?]
h) Ability for OSG (ET) to express a resource provisioning policy. (??)
j) Auditing interface for money spent? What level of scrutiny is needed? Traceability? "GAAP for OSG"
k) Monitor, understand, and verify service is executing correctly for VO, based on #1 and #2 (VO and OSG policy)
l) Ability of provider to express resource provisioning policy
m) interface for " "
Stakeholdes/Actors
- Operators (OSG, VO, Resources)
- VO liason
- tech/admins
- IOSG ET
- VO manager
- Resource manager
0) OSG takes ownership of opportunistic cycles
1) Who owns the opportunistic pool in OSG?
2) Find tools to empower person in #1 (technology group?)
3) Introduce concept of allocations(?)
4) Develop definition of successful provisioning
5) Sanity check of running jobs
Policy definition transformations:
Pvo -> T1 -> Posg -> T2 -> Ssys
Pvo : VO policy definition
T1 : Transformation to machine-usable policy config
Posg : OSG policy definition
T2 : Transformation of output of T1 by Posg.
Ssys : Final usage system state definition.
---------------------------------------------------------------------------
Important provisioning error/faults/problems:
1) Blackhole nodes
2) Unreliable pilot updates from CE.
3) Job submit fails.
4) Validation failure
5) Resources withc stop functioning after 'T' hours. (?)
-) Payload failure [possibly not a valid "provisioning failure"
Q: Where does "provisioning" end, i.e. at what point can it be considered "successful"
-----------------------------------------------------------------------------
Action Items
1) Produce a document on provisioning architecture [T + 3 weeks]. Hover
2) Produce a document on having OSG provide access on a commercial cloud. Bauerdick
3) Run experiment of having GLOW provisioning run by GOC. [1 month @ 2FTE]. Quick
4) Write an operations plan which separates provisioning operations and site support. Write a transition plan for moving OSG site operations to elsewhere.
5) Plan for tools to address 5 major issues outlined for "why pilots fail". Bockelman