Time: 11AM PT/1PM CT/2PM ET/8PM CET
Where: ZOOM.US
Join from PC, Mac, Linux, iOS or Android: https://unl.zoom.us/j/651969661
Meeting is password-protected: ask #xcache slack channel or xcache@opensciencegrid.org
Or iPhone one-tap :
US: +16699006833,,651969661# or +14086380968,,651969661#
Or Telephone:
Dial(for higher quality, dial a number based on your current location):
US: +1 669 900 6833 or +1 408 638 0968 or +1 646 876 9923
Meeting ID: 651 969 661
International numbers available: https://unl.zoom.us/zoomconference?m=wxCNSgMZiA-cVKSNowGYlQ
Attendance: Derek, Andy, Diego, Edgar, Ilija, Matevz, Riccardo, Mat Note: Derek will be out of civilization next week. Next week: special topic on using Kubernetes operators for caches Special topic (throttling cache connections): Edgar: Maintenance at Cardiff required their cache to be turned off. This led to all requests from LIGO jobs in Europe to hit the Amsterdam (PIC) cache instead, saturating the outbound network connection from the cache. PIC's workaround was to restrict the number of pilots running at PIC, but that of course had a limited effect since jobs outside of PIC were still accessing the cache. Q: Should CVMFS have handled this situation? It knows to switch caches when one cache is too slow. A: The European CVMFS only has European caches listed as options; there were only two European caches active at the time, so CVMFS might have been bouncing back and forth. (Would need logs from many sites to prove or disprove this.) Adding additional caches (the East Coast caches, like Manhattan, possibly Syracuse) would help, but we have no (current) software solution. Andy suggests a minimum hardware configuration for caches open to the world -- 10 Gb might not be enough for public caches. Andy also says throttling based on the number of connections is very wasteful because a lot of clients leave connections open and idling -- 20000 clients are connected but only 200 are actually downloading. A cache can return a special error code for being overloaded. The client can wait and try again. In HTTP, the wait is performed inside the server, i.e. the connection is still open -- from the client's point of view, it's just a slow connection. Hotfixed RPM: Edgar built a new OSG RPM with the hotfix, tested it successfully using the OSG automated tests, and deployed it in one of the production caches. Some European sites still finding HTTP issues. OSG can promote the RPM and related plugins to "upcoming testing" repos, ask people to test, and report the results to Andy. Development: gstream monitoring: Ilija noticed that the gstream does not contain information about how many bytes were written to disk, and would like that added. Also, gstream (UDP) monitoring is not self-contained and doesn't contain source information. With Kubernetes networking, the IP address can no longer be used as an identifier. Feature request: add the site name and the hostname of the source into each gstream packet. Ilija would like it if gstream packets contained a single JSON array of accesses instead of him having to unpack multiple JSON objects and recombine them into an array himself. Matevz responded that this is not feasible because a packet might contain multiple kinds of records interleaved together. AOB: - Ilija has a new reporter that understands v3 cinfo files at <https://github.com/slateci/XCache/blob/v5.0.0/cacheReporter/reporter.py> - Riccardo discovered new issues in XRootD 5 and opened GitHub issue [#1254](https://github.com/xrootd/xrootd/issues/1254) - Riccardo discovered authentication issues when using the XRootD protocol in Kubernetes. Networking might be to blame; Andy suggests using host networking if possible, both to avoid issues from Kubernetes NATting and for a sizable performance boost.