Here's my one sentence synopsis about the problem:

  "The amount of memory we need is too damn high!" </meme>

For protoDUNE we are now, or can be, just squeaking by.  After some
memory reduction in WCT signal processing and configuring the job to
drop all but the data products strictly needed, the protoDUNE signal
processing job fluctuates at the 2GB limit, meaning 2-core GRID jobs are
needed to be safe.

Here are some of the reasons we've identified for this memory pressure:

- As of now, an art::Event gets populated with all waveform data from
  *all* channels in the detector.

- protoDUNE has 2x the number of channels as MicroBooNE so was the first
  to be impacted, even with the shorter readout.  DUNE SP 10kt has 50x
  the channels of MB so will really face trouble.

- the early part of the processing chain (noise filtering, signal
  processing) requires data to be in "dense" (non-zero-suppressed)
  waveforms and the first step of that is a 2x inflation from short int
  to float.

- art's ROOT I/O buffers shadow size of data products get/put from
  art::Event.  These buffers tend to be held for the long term and in
  particular during non-I/O times when a module is executing.

- GRID of course has its nominal 2GB/core limit

- As for now, our jobs only use 1 core.

- There are many strategies to use multiple cores and most (but not
  all!)  lead to more problems than they solve

As you can imagine, fixing any of these things helps.  I have some
specific ideas below but we'd really benefit from LArSoft and art
experts ideas.  I'm looking forward to finding some good solutions that
we can all work together on in some efficient coordinated way.

It's also worth nothing that after signal-processing, the waveform data
is sparse and memory usage of the output *can* be much lower.  However,
some downstream reco will have combinatoric memory usage and those
algorithms will become memory limited on 1-core GRID allocations (even
with protoDUNE) so many of these problems have a broader surface.


Some of the possible directions which have been considered:

- Modify art and/or art file-input/output modules to enact a per-APA
  event loop so that only a single APA's data is in memory at a time.

- Extend this to use art's MT support with a per-APA module paths each
  running in parallel.  Thus the job can allocate multiple GRID cores
  to gain the required RAM but not leave N-1 cores idle.

  (for protoDUNE either is good, for DUNE FD we probably need both)

- Use module-local MT to use multiple cores but then we must worry about
  how to "back fill" cores which would otherwise be idle when non-MT
  modules are run.

- Back fill these idle threads by having art support "pipelining" the
  data, ie have multiple APAs data "in flight" at once.

- Move to a fully parallel architecture like Wire-Cell's data-flow graph
  supports or which art can support given module-level and config updates.

These are listed sort of in increasing level of work and also increasing
deviation from the current status quo.  I think we have to walk this
whole list as we get into the mid-2020's.  But, for right now, my bet is
on the first two being very achievable.

And, I'm looking forward to hearing any other ideas!


-Brett.

PS: Oh, and there's one other big thing to worry about with DUNE.  Even
if we have per-APA processing, a DUNE FD single-APA "event" for a
supernova neutrino burst dump will be as much as 1TB (100s) raw, not
counting any ROOT I/O overhead.  This is uncomfortably large even for a
single file, let alone a single RAM load.  So, some kind of streamed or
chunked processing scheme is definitely needed.  DUNE FD DAQ is probably
involved in the solution by producing data that is in some optimum form
in the first place.