Workshop on Monitoring , Logging and Accounting, (MLA) in Production Grids

HPDC 2009

HPDC 2009

Munich Germany
With the establishment of large scale multidisciplinary production Grid infrastructures such as the EGEE, OSG, DEISA, TeraGrid, or NAREGI, the monitoring of the state of the infrastructure as well as the accounting of its usage is becoming a vital issue in operating these infrastructures. This workshop is the fourth of the series on topics in production Grids that was initiated at HPDC-15 with the workshop on Management of Rights in Production Grids, followed by the workshop on Data Handling in Production Grids at HPDC-16 and the workshop on VO Management in Production Grids at HPDC-17. This workshop will bring together practitioners and researchers on all aspects of monitoring, logging and accounting (MLA) to discuss capabilities of existing technologies, identify areas where new functionalities are needed and explore how latest research results can be integrated into the software stack and procedures of production Grids. Special attention will be devoted to the security and privacy aspects of these services. Topics include, but are not restricted to: • MLA systems • Policies for MLA • Interoperability • Integration of MLA systems at different levels. For logistics and general information on HPDC 2009 please visit the HPDC 2009 homepage.
    • 9:00 AM 9:15 AM
      Welcome 15m
      Speaker: Erwin Laure (KTH)
    • 9:15 AM 9:45 AM
      Accounting, Monitoring and Logging in the EGI era - Are we there yet? 30m
      As the EGEE series of projects comes to an end and the European e-infrastructure community transitions to a sustainable structure based upon the EGI Blueprint, the provision of European distributed computing infrastructures is set to change. The centrally driven top-down model will continue to gradually evolve into a bottom-up federation of independent interoperating infrastructures built from different software technologies and architectures. The current EGEE infrastructure provides accounting, monitoring and logging across over 300 sites in 50 countries. As the EGI operating model has become clearer, EGEE has started a series of changes to transition to the new EGI structure. This talk will describe these changes and how they relate to and interoperate with other infrastructures - both current and planned - and in doing so provide an assessment on our readiness for EGI.
      Speaker: Steven Newhouse (CERN)
    • 9:45 AM 10:15 AM
      Monitoring, Logging, and Accounting on the TeraGrid 30m
      The TeraGrid, formed in 2003, has evolved as a federation of heterogeneous HPC and associated resources, and its monitoring, logging, and accounting environment has evolved along with it. TeraGrid monitors and tests, via various tools, all its software and services, including HPC user-level services, current system queues and loads, GridFTP transfers and performance, GRAM job submissions, and bandwidth on the dedicated network between sites. TeraGrid¹s allocations and accounting system evolved from the policies of the earlier PACI and Supercomputer Centers program and now supports capabilities such as multi-site/multi-resource allocations, user notification, etc. Central accounting is supported by the AMIE protocol, and usage is tracked as job-level records from all sites. Users and staff can view the current monitoring results and accounting for their projects via the user portal as well as other web interfaces.
      Speaker: David Hart (SDSC)
    • 10:15 AM 10:45 AM
      A Nordic Resource Sharing Framework 30m
      The initial motivation for Grid Computing was sharing of resources, which was interpreted as free use of spare resources. As there is really no such thing as a free lunch (nor resources) this should be rephrased as the possibility of using spare resources as long as the are accounted accordingly and that this use can further be used for acquiring more computational and storage resources. The Nordic DataGrid Facility are currently investigating how to create a common Nordic market for computational and storage resources. This a common unique identity for users, tying use of resources on to projects and accounting for the use no matter this is done. Further, this setup should work across national boundaries and organizational borders. In this presentation we present the initial ideas and preliminary investigations on how to create a common Nordic scene for sharing of computational and storage resources.
      Speaker: Michael Gronager (NDGF)
    • 10:45 AM 11:15 AM
      Coffee 30m
    • 11:15 AM 11:45 AM
      Accounting Facilities and Policies currently used by DEISA 30m
      The DEISA accounting facilities are employed in production since more than two years. Their purpose is to provide actual and historical information on the consumption of the distributed heterogeneous HPC resources in DEISA from a global perspective. The facilities comprise (1) a data provider, a platform- and site-specific implementation that collects usage records on batch system level and stores them using the GGF UR format into a local XML database, (2) an accounting information service based on a CGI script which offers a uniform interface to the site-local databases, (3) a client tool that gathers the accounting information from selected sites and machines. The accounting facilities are directly or indirectly interfacing to the DEISA user administration system, to a project and proposal database which is administrated centrally, and they make use of performance conversion factors determined as a result of the DEISA benchmark activities. A design requirement of the suite of facilities was security and data privacy: the delivered information is aggregated with a granularity that depends on the requester's role (user, project supervisor, administrator of virtual communities or member sites). The talk will present the accounting facilities and policies, experiences and envisaged future improvements are discussed.
      Speaker: Johannes Reetz (MPG)
    • 11:45 AM 12:15 PM
      APEL CPU Accounting in the EGEE/WLCG infrastructure 30m
      The APEL (Accounting Processor for Event Logs) collects accounting data from all the sites in the EGEE and WLCG (World-Wide LHC Computing Grid) infrastructures and from other collaborating organisations like OSG (Open Science Grid), NorduGrid, INFN-Grid and GridPP. APEL comprises a client sensor that parses batch system and grid gatekeeper logs at each grid site and generates CPU usage records, and a Centralised Repository at a Grid Operations Centre that receives the records from the sites, processes and summarises them, and makes them available for viewing through the EGEE Accounting Portal. APEL faces some important changes in the near future, replacing the current transport mechanism with the Apache ActiveMQ messaging system and preparing for the possibility of distribution to regional or national instances.
      Speaker: Cristina Del Cane Novales (STFC)
    • 12:15 PM 12:45 PM
      Accounting - towards national grid infrastructures 30m
      The talk presents the current practices in grid Accounting, focusing on toolkits adopted in production, their functionalities and limitations. Changes to the accounting paradigm and new functionalities due to the forthcoming migration to federations of national grids, possibly involving also non-scientific partners, are highlighted.
      Speaker: Andrea Guarise (INFN)
    • 12:45 PM 2:00 PM
      Lunch 1h 15m
    • 2:00 PM 2:30 PM
      OSG Localized Monitoring and Customized User Dashboards 30m
      The Open Science Grid (OSG) Resource and Service Validation (RSV) and MyOSG information presentation projects seek to provide solutions for several grid fabric monitoring problems, while at the same time providing a bridge between the OSG operations and monitoring infrastructure and the WLCG (Worldwid LHC Computing Grid) infrastructure and personalizing a dashboard based on user preferences. The RSV-based OSG fabric monitoring begins with local resource fabric monitoring, which gives local administrators tools to monitor their status on the OSG without leaving their local monitoring infrastructure. With a set of local grid status probes, the results of which are uploaded to a central collector, a system administrator can monitor and watch for issues in house, while the OSG Operations Center (GOC) can watch from a centralized position. Display of OSG information is consolidated and displayed in the MyOSG presentation layer, which focuses on gathering information from many different sources and allow presentation to be defined using the Universal Widget API (UWA) standards, thus allowing custom views for each member of OSG.
      Speaker: Rob Quick (Indiana University)
    • 2:30 PM 3:00 PM
      Enabling cross-grid monitoring with Messaging 30m
      The WLCG project federates several grid infrastructures, including OSG, EGEE and NDGF. The responsibility to monitor specific pieces of the WLCG is delegated to the relevant infrastructures. In this talk we show how we use enterprise messaging, based on Apache ActiveMQ, to aggregate the monitoring results from these infrastructures into a combined WLCG view. We also will show how messaging fits well into the highly distributed world of grid monitoring with examples drawn from recent work carried out within the EGEE project.
      Speaker: James Casey (CERN)
    • 3:00 PM 3:30 PM
      Achieving operational security with the help of grid users. 30m
      Maintaining operational security is highly dependent on security experts' abilities to monitor the infrastructure, to respond to threats appropriately, and to enforce security policies. Monitoring provides us with the current status of the security infrastructure, helping us to respond to imminent threats. While enforcement of security policies is dependent on our ability to distinguish the malicious usage and users on the grid; therefore, holding users accountable for their activities on the grid. In this talk, we focus on how we use accounting and monitoring tools in order to maintain operational security. OSG accounting service (Gratia) provides daily reports of user's activities on the grid at the job-level detail. Moreover, monitoring services such as resource validation and service (RSV) shows us to up-to-date status of grid sites in our infrastructure. By combining the information from these two resources, we generate daily and weekly user activity reports. These reports are examined by the security team to detect unusual usage patterns. Moreover, these reports are sent to users and their VO managers. It is our experience that VO managers carefully examine such reports and are very prompt to indicate suspicious behavior, for example due to stolen credentials. This is a great benefit in to security team since it involves our users actively with maintaining the security of our infrastructure.
      Speaker: Dr Mine Altunay (FNAL)
    • 3:30 PM 4:00 PM
      How to Know That We Do not Know? 30m
      Throughout our software stack, we produce a seamlessly endless stream of information about events that took place in our production grids. By the very nature of our software tools and hardware capabilities, some of this information is lost due to software bugs, hardware failures or operational outages. Unfortunately, information about these losses is not recorded anywhere and therefore can not be presented to an end user. In other words, the monitoring or accounting displays we provide fail to provide any indicators on the quality of the displayed information; all they say is “here is what we know but we have no idea what we do not know!” We will report on a recent effort to address this limitation of our information collection tools in the context of the Condor job logging software.
      Speaker: Miron Livny (Univ. of Wisconsin, Madison)
    • 4:00 PM 4:15 PM
      Closure 15m