Here's my one sentence synopsis about the problem: "The amount of memory we need is too damn high!" For protoDUNE we are now, or can be, just squeaking by. After some memory reduction in WCT signal processing and configuring the job to drop all but the data products strictly needed, the protoDUNE signal processing job fluctuates at the 2GB limit, meaning 2-core GRID jobs are needed to be safe. Here are some of the reasons we've identified for this memory pressure: - As of now, an art::Event gets populated with all waveform data from *all* channels in the detector. - protoDUNE has 2x the number of channels as MicroBooNE so was the first to be impacted, even with the shorter readout. DUNE SP 10kt has 50x the channels of MB so will really face trouble. - the early part of the processing chain (noise filtering, signal processing) requires data to be in "dense" (non-zero-suppressed) waveforms and the first step of that is a 2x inflation from short int to float. - art's ROOT I/O buffers shadow size of data products get/put from art::Event. These buffers tend to be held for the long term and in particular during non-I/O times when a module is executing. - GRID of course has its nominal 2GB/core limit - As for now, our jobs only use 1 core. - There are many strategies to use multiple cores and most (but not all!) lead to more problems than they solve As you can imagine, fixing any of these things helps. I have some specific ideas below but we'd really benefit from LArSoft and art experts ideas. I'm looking forward to finding some good solutions that we can all work together on in some efficient coordinated way. It's also worth nothing that after signal-processing, the waveform data is sparse and memory usage of the output *can* be much lower. However, some downstream reco will have combinatoric memory usage and those algorithms will become memory limited on 1-core GRID allocations (even with protoDUNE) so many of these problems have a broader surface. Some of the possible directions which have been considered: - Modify art and/or art file-input/output modules to enact a per-APA event loop so that only a single APA's data is in memory at a time. - Extend this to use art's MT support with a per-APA module paths each running in parallel. Thus the job can allocate multiple GRID cores to gain the required RAM but not leave N-1 cores idle. (for protoDUNE either is good, for DUNE FD we probably need both) - Use module-local MT to use multiple cores but then we must worry about how to "back fill" cores which would otherwise be idle when non-MT modules are run. - Back fill these idle threads by having art support "pipelining" the data, ie have multiple APAs data "in flight" at once. - Move to a fully parallel architecture like Wire-Cell's data-flow graph supports or which art can support given module-level and config updates. These are listed sort of in increasing level of work and also increasing deviation from the current status quo. I think we have to walk this whole list as we get into the mid-2020's. But, for right now, my bet is on the first two being very achievable. And, I'm looking forward to hearing any other ideas! -Brett. PS: Oh, and there's one other big thing to worry about with DUNE. Even if we have per-APA processing, a DUNE FD single-APA "event" for a supernova neutrino burst dump will be as much as 1TB (100s) raw, not counting any ROOT I/O overhead. This is uncomfortably large even for a single file, let alone a single RAM load. So, some kind of streamed or chunked processing scheme is definitely needed. DUNE FD DAQ is probably involved in the solution by producing data that is in some optimum form in the first place.