XCache DevOps Meeting July 16, 2020

Name: XCache DevOps Meeting July 16, 2020
Start: 2020-07-16T13:00:00-05:00
End: 2020-07-16T14:10:00-05:00
Location: No location set

Thursday 16 Jul 2020, 13:00 → 14:10 US/Central

Derek Weitzel (University of Nebraska - Lincoln), Marian Zvada (US CMS)

Description

Time: 11AM PT/1PM CT/2PM ET/8PM CET
Where: ZOOM.US

Join from PC, Mac, Linux, iOS or Android: https://unl.zoom.us/j/651969661
Meeting is password-protected: ask #xcache slack channel or xcache@opensciencegrid.org

Or iPhone one-tap :
    US: +16699006833,,651969661# or +14086380968,,651969661#
Or Telephone:
    Dial(for higher quality, dial a number based on your current location):
        US: +1 669 900 6833 or +1 408 638 0968 or +1 646 876 9923
    Meeting ID: 651 969 661
    International numbers available: https://unl.zoom.us/zoomconference?m=wxCNSgMZiA-cVKSNowGYlQ

Hide

Attendance:  Derek, Andy, Diego, Edgar, Ilija, Matevz, Riccardo, Mat

Note: Derek will be out of civilization next week.

Next week: special topic on using Kubernetes operators for caches


Special topic (throttling cache connections):

  Edgar: Maintenance at Cardiff required their cache to be turned
  off.  This led to all requests from LIGO jobs in Europe to hit the
  Amsterdam (PIC) cache instead, saturating the outbound network
  connection from the cache.

  PIC's workaround was to restrict the number of pilots running at
  PIC, but that of course had a limited effect since jobs outside of
  PIC were still accessing the cache.


  Q: Should CVMFS have handled this situation?  It knows to switch
  caches when one cache is too slow.

  A: The European CVMFS only has European caches listed as options;
  there were only two European caches active at the time, so CVMFS
  might have been bouncing back and forth.  (Would need logs from
  many sites to prove or disprove this.)


  Adding additional caches (the East Coast caches, like Manhattan,
  possibly Syracuse) would help, but we have no (current) software
  solution.


  Andy suggests a minimum hardware configuration for caches open to
  the world -- 10 Gb might not be enough for public caches.

  Andy also says throttling based on the number of connections is very
  wasteful because a lot of clients leave connections open and idling
  -- 20000 clients are connected but only 200 are actually downloading.


  A cache can return a special error code for being overloaded.  The
  client can wait and try again.  In HTTP, the wait is performed
  inside the server, i.e. the connection is still open -- from the
  client's point of view, it's just a slow connection.



Hotfixed RPM:

  Edgar built a new OSG RPM with the hotfix, tested it successfully
  using the OSG automated tests, and deployed it in one of the
  production caches.  Some European sites still finding HTTP issues.

  OSG can promote the RPM and related plugins to "upcoming testing"
  repos, ask people to test, and report the results to Andy.



Development:

  gstream monitoring:

    Ilija noticed that the gstream does not contain information about
    how many bytes were written to disk, and would like that added.

    Also, gstream (UDP) monitoring is not self-contained and doesn't
    contain source information.  With Kubernetes networking, the IP
    address can no longer be used as an identifier.  Feature request:
    add the site name and the hostname of the source into each
    gstream packet.

    Ilija would like it if gstream packets contained a single JSON
    array of accesses instead of him having to unpack multiple JSON
    objects and recombine them into an array himself.  Matevz responded
    that this is not feasible because a packet might contain multiple
    kinds of records interleaved together.



AOB:

- Ilija has a new reporter that understands v3 cinfo files at
  <https://github.com/slateci/XCache/blob/v5.0.0/cacheReporter/reporter.py>

- Riccardo discovered new issues in XRootD 5 and opened GitHub issue
  [#1254](https://github.com/xrootd/xrootd/issues/1254)

- Riccardo discovered authentication issues when using the XRootD
  protocol in Kubernetes.  Networking might be to blame; Andy
  suggests using host networking if possible, both to avoid issues
  from Kubernetes NATting and for a sizable performance boost.

There are minutes attached to this event. Show them.

- News
  
  Conveners: Dr Derek Weitzel (University of Nebraska - Lincoln), Mr Marian Zvada (US CMS)
- XRootD development for XCache
  
  General status of development, XRootD releases with new cache features, bug fixes, etc...
  
  Conveners: Mr Andrew Hanushevsky (SLAC National Accelerator Laboratory), Matevz Tadel (UCSD), Wei Yang (SLAC)
- ATLAS XCache deployment
  
  Conveners: Dr Ilija Vukotic (LAL), Wei Yang (SLAC)
- CMS XCache deployment
  
  Conveners: Diego Ciangottini (Universita e INFN, Perugia (IT)), Prof. Frank Wuerthwein (UCSD)
- OSG StashCache deployment
  
  Status of OSG deployment, origin/caches operations, stashcp, usage, etc...
  
  Conveners: Brian Bockelman, Brian Lin (University of Wisconsin-Madison), Dr Derek Weitzel (University of Nebraska - Lincoln), Mr Edgar Fajardo Hernandez (UCSD), John Hicks (Internet2), Mr John Thiltges (University of Nebraska - Lincoln), Mr Marian Zvada (US CMS), Mr Matyas Selmeci (University of Wisconsin-Madison)
- AOB