The HEPiX forum unifies IT system support engineers from the High Energy Physics (HEP) laboratories and institutes, such as BNL, CERN, DESY, FNAL, IN2P3, INFN, JLAB, NIKHEF, RAL, SLAC, TRIUMF and others. The HEPiX meetings have been held regularly since 1991, and are an excellent source of information for IT specialists. That's why they enjoy large participation also from the non-HEP organizations.
Spamcop is a popular tool on the internet for reporting
"spammers" to the ISP's. Several HEP sites have signed up
SpamCop as a method of detecting spam. Unfortunately, the
way Fermilab processes bounced spam email it can appear to
Spamcop that Fermilab is an initiator of spam. This has
occurred several times in the past year. To resolve the last
incident, we requested several sites add Fermilab to their
servers whitelists. In addition, we adjusted our mail
gateways to eliminate bounced spam messages whenever
possible. We are also looking into improving our spam
filtering systems to minimize any spam that might get
through and subsequently forwarded to another site. We would
like to discuss getting together a list of other HEP sites
and their email servers, and sharing this list so that we
don't inadvertantly block each others email transmissions.
What is TRAC?
This talk will present Trac, a unique open source tool
combining a wiki, an issue tracker, a Subversion client and
a roadmap manager. More than a tool, Trac is an extensible
framework based on plugins. LAL is currently using this tool
both for software development and system administration.
(LAL / IN2P3)
Scientific Linux Update
In this talk, we will present the status of Scientific
Linux, focusing on relevant changes in the past six months.
Next, we will also present current projects with SL,
focusing on SL 5.x and scientific applications. To conclude,
we will talk about future enhancements.
TWiki at CERN
The Database and Engineering Services (DES) Group of the IT
Department at CERN supports and maintains a CERN TWiki.
This presentation will cover the history of TWiki at CERN,
facts about the system, the technical setup, problems we
face and our plans for finding a solution to them.
Service Level Status - A Real-time status Display for IT
Nowadays, IT departments provide, and people use many
various computing services of more and more heterogeneous
nature. And there is a growing need of having a common
display that groups these different services and reports
about their status and availabilities in a uniform way. At
CERN, it led to launching the SLS project.
Service Level Status Overview (SLS) is a web-based tool that
dynamically shows availability, basic information and
statistics about various IT services, as well as
dependencies between them.
The presentation starts with a short description of the
project, its goals, architecture, and users. Then, the
concepts of subservices, metaservices, dependencies, service
availability etc. are introduced, followed by a
demonstration of the system and an explanation of how to add
a service to SLS. The talk ends with a information on how
SLS could be used by other HEP institutes.
Managing system history and problem tracking with SVN/Trac
This talk will present LAL experience to address the need to
track system configuration changes and link this with an
issue tracker, using a combination of Subversion and Trac.
(LAL / IN2P3)
Using RT to Manage Installation Workflow
We had a need to more formally manage the workflow of
installation tasks, because there had gotten to be so many
happening simultaneously that confusion was resulting. We
modeled the workflow using RT, the Request Tracking system
that we use for user requests. The result is a relatively
lightweight and flexible system that gives planners a
"dashboard" of the status of all active projects, and the
information they need to execute the task.
High Availability Methods at GSI
This presentation gives an overwiev about the methods used
to ensure the high availability of important services such
as data base, web service, central file server a. o. Apart
from commercial products for
certain systems (Oracle, Exchange) different open source
linux tools (heartbeat, drbd, mon) are combined with
monitoring and hardware
methods and adapted to our special needs.
Using Quattor to manage a grid (EGEE) Fabric
Deploying grid services means managing a potentially large
number of machines that partially share their configuration.
A tool is needed not only to install but to maintain such a
configuration. Quattor, developped as part of EDG, is such a
tool. This talk will focus on the LCG/gLite support in Quattor.
(LAL / IN2P3)
Spam - Statistics and Fighting Methods
Scientific Linux Inventory Project (SLIP)
This talk will discuss the effort to provide an inventory of
all Linux machines at Fermilab. We will describe the
motivation for the project, the package we selected, and the
current state of the project.
Networking Dinner at Newport News City Center Marriott
The BNL RHIC/ATLAS Computing Facility (RACF) Central
Analysis/Reconstruction Server (CAS/CRS) Farm is a large
scale computing cluster currently consisting of ~2000
hosts running Scientific Linux. Besides providing for
computation, the CAS/CRS systems' local disk drives are used
distributed data systems such as dCache, ROOTD and XROOTD to
store considerable amounts of data (presently ~400 TB). The
sheer number of systems in the farm, combined with our
distributed storage model complicates network installation
This presentation describes the system developed at RACF to
fully automate and simplify management of the PXE
Support of Kerberos 5 Authenticated Environment by TORQUE
TORQUE is a successor of the OpenPBS batch queuing system,
available as an Open Source product. Despite the wide spread
usage of TORQUE as Job Management System on computational
farms and LHC grid installations, this batch system does not
support any advanced authentication mechanisms.
We show two possibilities, how to redesign the existing
source code in order to add Kerberos 5 authentication
support for batch jobs.
The first way uses local server-client RPC connections while
the second one makes use of the Authenticated Remote Control
The described modifications have been successfully deployed
in the local computing infrastructure of the H1
Collaboration at DESY. This provides on identical
environment for batch jobs and
user desktop processes.
Planning for Hall D: The Hazards of Fast Tape Drives
The upgrade to Jefferson Lab will require a hardware refresh
of the mass storage system in order to handle the higher
volume of data from new experiments and simulations. The
next generation, higher capactity tape drives are also
significantly faster, a fact that has implications for
almost all parts of the mass storage system. This talk
examines the performance tuning required to make efficient
use of these drives and underscores some of the particular
needs of tape-based storage systems used by most experiments.
Porting to and Running Applications on 64 Bit Platforms
The author describes his recent experience porting software
packages to and running these packages on 64 bit machines
with Solaris and Linux. Issues discussed include code
modification, compiling, operating system requirements, and
performance comparisons with 32 bit machines.
NGF NERSC's Global Filesystem and PDSF
I would like to explain a bit about our global filesystem
and it's use on PDSF. Also about how this filesystem can be
extended to other sites/labs. Our filesystem is GPFS, but
the concept can also be extended to Lustre or other cluster
Storage Class : Problematic and Implementation at CCIN2P3
Storage Classes attempt to represent storage use cases for a
given experiment. It is considered harmfull to match the
storage classes to real life storage system especialy if the
latter is based on path to get the storage configuration of
This presentation aims to define the problematic of Storage
Classes, explain one possible solution which is implemented
at CCIN2P3 and discuss the pros ans cons.
This talk will present the current state of the art of
CERN. We will explain our benchmarking procedures, review our
latest results and talk about where we are going from here.
As part of the results review, we will comment on the
current CPU trends and we will talk about the increasingly
important power consumption.
Recent Fabric Management Improvements at CERN
This talk will describe some improvements to the monitoring
and management of the storage and CPU services in the
- use of SMART for disk monitoring
- integration of disk server monitoring and storage system
- transmission of Grid job memory requirements to the local
The Stakkato Intrusions
During 15 months, from late 2003 until early 2005,
hundreds of supercomputing sites, universities and
companies worldwide were hit by a series of intrusions,
with the perpetrator leapfrogging from site to site using
stolen ssh passwords. These are collectively known as
the Stakkato intrusions, and includethe Teragrid
Incident and the Cisco IOS source code theft, both of
which received widespread attention from the media.
This talk will cover case studies of performed intrusions,
an analysis of why Stakkato could be so successful, and
the story of how the suspect was finally tracked down
Network Security Monitoring with Sguil
Most mid- or large-sized organizations conduct some sort of
network monitoring for security purposes. Traditional
Intrusion Detection Systems (IDS) tell only part of the
story, leaving analysts to perform complex and
time-consuming data-mining operations from multiple sources
just to answer simple questions about IDS alerts. This talk
presents a more efficient model that uses the open source
Sguil software to optimize the process for analyst time and
GridX1: A Canadian Computational grid for HEP Applications
GridX1 is a Canadian computational grid which combines the
shared resources of several Canadian research institutes for
the primary purpose of executing HEP applications. With more
than two years of production experience, GridX1 has
demonstrated the successful application of Globus Toolkit
(GT) v.2 cluster gatekeepers managed by a Condor-G resource
brokering system. A novel feature of the project was a
resource brokering interface to the LHC Compute Grid, which
was used during Data Challenge 2 to route ATLAS jobs to the
Canadian resources without having dedicated Compute Elements
at each cluster. Further, independent Condor-G resource
brokers have been implemented to manage the Canadian ATLAS
and BaBar MC production systems. Finally, our recent efforts
have been directed toward building a service-oriented grid
using GT4, including a WS-MDS registry service and WS-GRAM
enabled metaschedulers built upon Condor and GridWay.
(University of Victoria/HEPnet Canada)
GridPP is a UK e-Science project which started in 2001 with
the aim of devloping and operating a production Grid for UK
Particle Physicists. It is aligned with the EGEE
infrastructure and the WLCG Project but also worsk with
current running experiments and theorists. GridPP aims to
provide an environment in which all UK particle physcists
can do their analysis, share data, etc, and the UK can also
contribute to the worldwide collaboration and activities of
their experiments .
The EGEE Grid Infrastructure
The EGEE grid infrastructure is in constant production use
with significant workloads, not only for High Energy Physics
but for many other scientific applications. An overview of
the EGEE project, the infrastructure itself, and how it is
being used will be given. Several applications rely on a
long term infrastructure being in place; the current ideas
of how this may be achieved will be discussed.
Virtual Machines in a Distributed Environment
(University of Florida)
Issues and problems around Grid site management
The problems of grid site reliability and availability are
becoming the biggest outstanding issue in building a
reliable grid service. This is particularly important for
WLCG where specific reliability targets are set. This talk
will outline the scope of the problems that need to be
addressed, and point out potential areas where HEPiX members
can contribute, and will seek input on how we can address
some of the problems.
FermiGrid - Status and Plans
FermiGrid is the Fermilab Campus Grid. This talk will
discuss the current state of FermiGrid and plans for the
Open Science Grid Progress and Vision
This talk will detail recent Open Science Grid progress and
outline the vision for the upcoming year.
Grid Security in WLCG and EGEE
This talk will present the current status, plans and issues
for Grid Security in WLCG and EGEE. This will include
Authentication, Authorization, Policy and Operational Security.
Testing the UK Tier 2 Data Transfer and Storage Infrastructure
When the LHC experiments start taking data next year the
Tier 2 sites in the UK (and elsewhere) will need to be able
to recieve and transmit data data at unprecidented rates and
reliabilities. We present the efforts in the UK to test the
disk to disk transfer rates between Tier 2 sites along with
some of the lessons learnt and results obtained.