Reviewers: Jim Bottum <jb@clemson.edu>,
Ian Fisk <ifisk@fnal.gov>, Mark
Neubauer <msn@illinois.edu>, Ewa
Deelman <deelman@isi.edu>
Date: Mar 14, 2014
-----
*** Final ET / Reviewers discussion
- ...
- Dealing with large organizations lead to an inherent inertia
- Expanding user basis with the current model (e.g. Software
inheriting orphaned projects) is a challenge
- LHC experiments are probably willing to evolve as directed by
OSG in long timescales
- LHC experiments would favor a model where resources / effort is
given in bursts instead of steady.
- Infrastructure is too static: it takes too long to set up a new
site.
Goals: Set up a new site in 1/2 day, then tear them down.
Either provide a large pool that can be easily tapped into or
become more agile in setting up access to centers.
- Process to decide on new technologies is inefficient. You don't
need more Blueprint meetings. Maybe focused workshops, user
groups, a better planning process, ...
- OSG is missing user (at all levels) feedback. No good sense of
roadmap from the presentation, comeing from user feedback.
- Campus grid approach is a step in the right direction, but need
to set the bar high e.g. this may be a a way to deal with the T3.
- Are the council members able to represent their communities in
full? E.g. WLCG does not take decision for the sites.
- Who / how do you engage campus. OSG failed to engage CIOs. Find
CIO that can organize a CIO group and advise on this. There is
nothing wrong deadline with the scientists directly. CIOs know how
to connect in IT depts. Connection between faculty and IT is often
lose: IT can learn what to do to help faculty better by dealing
with you.
- OSG has strengths in packaging, testing, and deployment. Should
capitalize on these e.g. to connect more campuses through them.
- Lot of focus on technical aspects and not enough on people
communicating with users. CIO have resources to allocate 0.5 FTE
for communication. Jim can help with this strategy.
- Providing course material and example online to spread
knowledge. OSG has no funding for education. OSG does only summer
school. A user goes to the OSG web site and cannot easily find
documentation.
- Resources are deployed to run ops, security, make sure that LHC
succeeds. Need to change allocation if you want to evolve beyond
core mission.
- Analyze software stack and identify what is really used (gWMS,
no cert VOs, etc.) and remove unneeded dependencies. Redeploy the
resources from Software.
- You are trying to do too many things yourself and you do not
leverage the community.
- Have we tried to "evangelize" - find 10 people and ask them to
find 10 people each.
- Help scientists and let their competition know. Find
under-served communities / universities: don't forget the little
guys.
- Distribute information by creating a 1 evening course with
examples.
- Create a buzz on a few campuses to make it spread.
- Mark's final remarks: Define stakeholder, grid sites. Be more
nimble. Facilitate new ideas instead of doing the Blueprint
meetings.
---
*** Discussion during the talks
** 8:30 Lothar
Ian: What is the target amount of opportunistic resources in OSG?
Miron: We target to "hit the ceiling" of all available resources.
We may find more resources at that point.
Chander: 90M CPU h is the usage: we don't know how much there
is.
Miron: Glow provided as many opportunistic resources to OSG as
consumed opportunistic resources on OSG: the plumbing works.
Miron: how do we make sure that all the services can support a grow
from 60M h per mo to, say, 100m h.
Jim Bottum: Does the utilization maps to our investors?
Lothar: roughly
Miron: if we had more demand, we might find more resources
Miron: some communities are still not thinking about this order of
magnitude in available resources, although OSG can provide them.
Jim / Ian: Double edged sword: there may be various internal
organizational (campus, DOE, ...) incentives to use resources
owned by them, rather than going to OSG. Sociological issues.
Mark Neubauer: Make the case that OSG is the basis for professional
training in the US. Better metrics than computational hours.
Miron: google and facebook are the "identity providers"
Jim: I wish that it was possible to rely on the commercial sector
for that. We might not be there yet to run identity management for
an entire university.
** 9:15 Chander - OSG User Support
Ian: Do people switch among VOs?
Miron: OSG is merging with Engage
Jim: you said that want to increase the number of supported
projects: have you reached out to Internet2 e.g. through the Net+
service?
Miron: yes, but no deal yet
Mark: You might want to change the way you present number of users
(with the tables of XD projects, OSG-Direct, projects at campuses,
...). Underline that these are only the PIs.
Michael Ernst: what is the feedback from the community?
Chander: Users are satisfied. Metric: anecdotal evidence
Jim: need senior university management to tell local campus
champions that they have to work with the national
cyber-infrastructure as well as focusing on local computing
** 9:41 Rob Gardner - Campus Grid
Jim: it would be useful to have a glossary of acronyms (DHTC, etc.)
Ewa: Do you have of deploying components easily?
Rob: yes, they don't have to be exposed to the full complexity.
Mark: to get more users, it seems that going to the campuses and
doing tutorials seems very effective to enable technology there.
Need to also discuss under-represented universities.
Mark: what is the "coherent" OSG strategy in outreach to serve the
communities? This seems missing from the agenda.
Lothar: The model started as VO-driven: bring resources and share.
Now evolving a new business model: OSG VO and campus connect. Our
new strategy is not mature yet.
Miron: still need to ask researchers "what would you do with 100,000
cpu hours"? Scientists need to think differently to affect usage
increase
** Break
** 10:30 Rob Quick - Operations
Miron: what is the impact of the government shutdown?
Rob: if FNAL and BNL had gone down, all critical services were
expected to be up. Communication has a separate path from FNAL.
Accounting is cached locally at sites.
Mark: your level of downtime for the critical services in 2 yrs is
commendable. Do you have a disaster recovery plan?
Rob: yes. Should IU have a disaster, we can bring up the critical
services within a week
Miron: distributing the services to various institution is a
strength that increases some risks.
Jim: do you have a service methodology across the 5 sites to tight
them together?
Rob: the process is ITIL-like
Jim: how do you develop senior staff?
Rob: one-on-one with Rob and leadership classes. Left for 1 month
for paternity leave and ops went smoothly.
Mark: XSEDE gives you an environment; OSG works with and existing
environment. What is the symmetry of the XSEDE / OSG relationship?
Chander: See Brian's talk
Ewa: What is the distribution of tickets from users vs. sites?
Rob: Probably close to 40%-60%. Need to check.
** 11:07 Brian Bockelman - Technology
Mike Ernst: Who is providing guidance of where the architecture is
heading and what technologies are considered. What is the input
process?
Brian: Miron is the technical director. John runs the process. Input
from several stakeholders.
Requests from council, area coordinators, blueprint meetings. Not a
formalized process.
Miron: if there is enough pressure from a stakeholder, then we do a
blueprint meeting. We should increase the number of blueprint
meetings (quarterly is too little).
Lothar: The fact that the Blueprint meeting process is not smooth is
holding us back. Bringing expert to the table is an almost unique
strength. We have to fix the process.
Miron: I accept to be responsible to set up 6 (?) blueprint meetings
every year, if someone helps with logistic
Jim: is there a process to gather requests?
Lothar: yes, there is a request system
Michael: it is a good system to track the requets. We need a better
way to help stakeholders formulate requests.
Brian: the system is good for specific requests, not high-level /
general requests. Council meetings may be better to discuss those.
Michael Ernst: we should use workload manager, instead of workflow
manager, for the gWMS / Condor technologies
** 11:35 Tim C. and Tim T. - Software and release management
Mark: is it manageable / too high-maintenance to customize 15M lines
of code, most of which are out of your control?
Tim T.: we have automated processes to retrieve, apply patches, etc.
Mark: It can be a high barrier to change technologies
Brian: we struggle to count packages: is globus 1 or 50? Today we
say 50 with 50 automated downloads, but we update that as 1.
Tim C.: We use the terms "components" of "packages"
Miron: we should have better language / metrics. Number of packages
is high as per reviewers' feedback
Brian: we are decreasing the number of packages we support and
increasing the number supported by the community
Miron: separation of seoftware and release management is working
well. Release management is the interface to operations.
Jim: how do you financially support the taking over of orphaned
software?
Miron: OSG pays for it.
Jim: do the funding agencies understand that you are taking this
over?
Miron: they understand we do the best we can. If their
stakeholders get the job done, the agencies are fine. The ET has
been managing priorities well.
Lothar: we are in this space to support our communities
Mark: is it the users that don't want to overcome the barrier to
update or there are more substantial issues?
Miron: all of the above.
Jim: what is your budget for software?
Chander: 9.6 FTE for technology, including software. 1/3 of our
effort.
** 12:09 Mine Altunay - Security
Mark: where can you push standardization for InCommon? Computing is
an equalizer to allow contributions from small universities. Are we
marginalizing them this way?
Mine: there are standardization efforts
Lothar: this is the 80% coverage of users: maybe it should be 95%.
We have other identity mechanisms for the remaining users.
** Shawn McKee - Networking