Name: Workshop on Monitoring , Logging and Accounting, (MLA) in Production Grids
Start: 2009-06-10T09:00:00+02:00
End: 2009-06-10T17:00:00+02:00
Location: HPDC 2009

Workshop on Monitoring , Logging and Accounting, (MLA) in Production Grids

Wednesday, 10 June 2009 - 09:00

Monday, 8 June 2009
Tuesday, 9 June 2009
Wednesday, 10 June 2009
09:00 Welcome - Erwin Laure (KTH)
Welcome
- Erwin Laure (KTH)
09:00 - 09:15
09:15 Accounting, Monitoring and Logging in the EGI era - Are we there yet? - Steven Newhouse (CERN)
Accounting, Monitoring and Logging in the EGI era - Are we there yet?
- Steven Newhouse (CERN)
09:15 - 09:45
As the EGEE series of projects comes to an end and the European e-infrastructure community transitions to a sustainable structure based upon the EGI Blueprint, the provision of European distributed computing infrastructures is set to change. The centrally driven top-down model will continue to gradually evolve into a bottom-up federation of independent interoperating infrastructures built from different software technologies and architectures. The current EGEE infrastructure provides accounting, monitoring and logging across over 300 sites in 50 countries. As the EGI operating model has become clearer, EGEE has started a series of changes to transition to the new EGI structure. This talk will describe these changes and how they relate to and interoperate with other infrastructures - both current and planned - and in doing so provide an assessment on our readiness for EGI.
09:45 Monitoring, Logging, and Accounting on the TeraGrid - David Hart (SDSC)
Monitoring, Logging, and Accounting on the TeraGrid
- David Hart (SDSC)
09:45 - 10:15
The TeraGrid, formed in 2003, has evolved as a federation of heterogeneous HPC and associated resources, and its monitoring, logging, and accounting environment has evolved along with it. TeraGrid monitors and tests, via various tools, all its software and services, including HPC user-level services, current system queues and loads, GridFTP transfers and performance, GRAM job submissions, and bandwidth on the dedicated network between sites. TeraGrid¹s allocations and accounting system evolved from the policies of the earlier PACI and Supercomputer Centers program and now supports capabilities such as multi-site/multi-resource allocations, user notification, etc. Central accounting is supported by the AMIE protocol, and usage is tracked as job-level records from all sites. Users and staff can view the current monitoring results and accounting for their projects via the user portal as well as other web interfaces.
10:15 A Nordic Resource Sharing Framework - Michael Gronager (NDGF)
A Nordic Resource Sharing Framework
- Michael Gronager (NDGF)
10:15 - 10:45
The initial motivation for Grid Computing was sharing of resources, which was interpreted as free use of spare resources. As there is really no such thing as a free lunch (nor resources) this should be rephrased as the possibility of using spare resources as long as the are accounted accordingly and that this use can further be used for acquiring more computational and storage resources. The Nordic DataGrid Facility are currently investigating how to create a common Nordic market for computational and storage resources. This a common unique identity for users, tying use of resources on to projects and accounting for the use no matter this is done. Further, this setup should work across national boundaries and organizational borders. In this presentation we present the initial ideas and preliminary investigations on how to create a common Nordic scene for sharing of computational and storage resources.
10:45 Coffee
Coffee
10:45 - 11:15
11:15 Accounting Facilities and Policies currently used by DEISA - Johannes Reetz (MPG)
Accounting Facilities and Policies currently used by DEISA
- Johannes Reetz (MPG)
11:15 - 11:45
The DEISA accounting facilities are employed in production since more than two years. Their purpose is to provide actual and historical information on the consumption of the distributed heterogeneous HPC resources in DEISA from a global perspective. The facilities comprise (1) a data provider, a platform- and site-specific implementation that collects usage records on batch system level and stores them using the GGF UR format into a local XML database, (2) an accounting information service based on a CGI script which offers a uniform interface to the site-local databases, (3) a client tool that gathers the accounting information from selected sites and machines. The accounting facilities are directly or indirectly interfacing to the DEISA user administration system, to a project and proposal database which is administrated centrally, and they make use of performance conversion factors determined as a result of the DEISA benchmark activities. A design requirement of the suite of facilities was security and data privacy: the delivered information is aggregated with a granularity that depends on the requester's role (user, project supervisor, administrator of virtual communities or member sites). The talk will present the accounting facilities and policies, experiences and envisaged future improvements are discussed.
11:45 APEL CPU Accounting in the EGEE/WLCG infrastructure - Cristina Del Cane Novales (STFC)
APEL CPU Accounting in the EGEE/WLCG infrastructure
- Cristina Del Cane Novales (STFC)
11:45 - 12:15
The APEL (Accounting Processor for Event Logs) collects accounting data from all the sites in the EGEE and WLCG (World-Wide LHC Computing Grid) infrastructures and from other collaborating organisations like OSG (Open Science Grid), NorduGrid, INFN-Grid and GridPP. APEL comprises a client sensor that parses batch system and grid gatekeeper logs at each grid site and generates CPU usage records, and a Centralised Repository at a Grid Operations Centre that receives the records from the sites, processes and summarises them, and makes them available for viewing through the EGEE Accounting Portal. APEL faces some important changes in the near future, replacing the current transport mechanism with the Apache ActiveMQ messaging system and preparing for the possibility of distribution to regional or national instances.
12:15 Accounting - towards national grid infrastructures - Andrea Guarise (INFN)
Accounting - towards national grid infrastructures
- Andrea Guarise (INFN)
12:15 - 12:45
The talk presents the current practices in grid Accounting, focusing on toolkits adopted in production, their functionalities and limitations. Changes to the accounting paradigm and new functionalities due to the forthcoming migration to federations of national grids, possibly involving also non-scientific partners, are highlighted.
12:45 Lunch
Lunch
12:45 - 14:00
14:00 OSG Localized Monitoring and Customized User Dashboards - Rob Quick (Indiana University)
OSG Localized Monitoring and Customized User Dashboards
- Rob Quick (Indiana University)
14:00 - 14:30
The Open Science Grid (OSG) Resource and Service Validation (RSV) and MyOSG information presentation projects seek to provide solutions for several grid fabric monitoring problems, while at the same time providing a bridge between the OSG operations and monitoring infrastructure and the WLCG (Worldwid LHC Computing Grid) infrastructure and personalizing a dashboard based on user preferences. The RSV-based OSG fabric monitoring begins with local resource fabric monitoring, which gives local administrators tools to monitor their status on the OSG without leaving their local monitoring infrastructure. With a set of local grid status probes, the results of which are uploaded to a central collector, a system administrator can monitor and watch for issues in house, while the OSG Operations Center (GOC) can watch from a centralized position. Display of OSG information is consolidated and displayed in the MyOSG presentation layer, which focuses on gathering information from many different sources and allow presentation to be defined using the Universal Widget API (UWA) standards, thus allowing custom views for each member of OSG.
14:30 Enabling cross-grid monitoring with Messaging - James Casey (CERN)
Enabling cross-grid monitoring with Messaging
- James Casey (CERN)
14:30 - 15:00
The WLCG project federates several grid infrastructures, including OSG, EGEE and NDGF. The responsibility to monitor specific pieces of the WLCG is delegated to the relevant infrastructures. In this talk we show how we use enterprise messaging, based on Apache ActiveMQ, to aggregate the monitoring results from these infrastructures into a combined WLCG view. We also will show how messaging fits well into the highly distributed world of grid monitoring with examples drawn from recent work carried out within the EGEE project.
15:00 Achieving operational security with the help of grid users. - Mine Altunay (FNAL)
Achieving operational security with the help of grid users.
- Mine Altunay (FNAL)
15:00 - 15:30
Maintaining operational security is highly dependent on security experts' abilities to monitor the infrastructure, to respond to threats appropriately, and to enforce security policies. Monitoring provides us with the current status of the security infrastructure, helping us to respond to imminent threats. While enforcement of security policies is dependent on our ability to distinguish the malicious usage and users on the grid; therefore, holding users accountable for their activities on the grid. In this talk, we focus on how we use accounting and monitoring tools in order to maintain operational security. OSG accounting service (Gratia) provides daily reports of user's activities on the grid at the job-level detail. Moreover, monitoring services such as resource validation and service (RSV) shows us to up-to-date status of grid sites in our infrastructure. By combining the information from these two resources, we generate daily and weekly user activity reports. These reports are examined by the security team to detect unusual usage patterns. Moreover, these reports are sent to users and their VO managers. It is our experience that VO managers carefully examine such reports and are very prompt to indicate suspicious behavior, for example due to stolen credentials. This is a great benefit in to security team since it involves our users actively with maintaining the security of our infrastructure.
15:30 How to Know That We Do not Know? - Miron Livny (Univ. of Wisconsin, Madison)
How to Know That We Do not Know?
- Miron Livny (Univ. of Wisconsin, Madison)
15:30 - 16:00
Throughout our software stack, we produce a seamlessly endless stream of information about events that took place in our production grids. By the very nature of our software tools and hardware capabilities, some of this information is lost due to software bugs, hardware failures or operational outages. Unfortunately, information about these losses is not recorded anywhere and therefore can not be presented to an end user. In other words, the monitoring or accounting displays we provide fail to provide any indicators on the quality of the displayed information; all they say is “here is what we know but we have no idea what we do not know!” We will report on a recent effort to address this limitation of our information collection tools in the context of the Condor job logging software.
16:00 Closure
Closure
16:00 - 16:15