DUNE DAQ Core Software Meeting

Name: DUNE DAQ Core Software Meeting
Start: 2024-12-11T09:00:00-06:00
End: 2024-12-11T10:00:00-06:00
Location: No location set

Wednesday 11 Dec 2024, 09:00 → 10:00 US/Central

Kurt Biery (Fermilab), Roland Sipos (CERN)

Description

Zoom: https://fnal.zoom.us/j/91792149723 (usual password)

- 09:00 → 09:10
  
  Placeholder 1 10m
  
  TBD
  
  Speakers: Kurt Biery (Fermilab), Roland Sipos (CERN)
- 09:10 → 09:20
  SourceIDs for TPStreamWriter processes in v5.x 10m
  
  Minutes
  
  Plus, any thoughts on how to ensure valid configurations generally.
  
  Speaker: Kurt Biery (Fermilab)
  In following up on a question from Artur, I noticed that in a system with multiple TPStreamWriter apps, these apps all have the same SourceID. This is to be compared with the DataWriter apps, which have different SourceIDs. I presume that this is just a bug in configurations, but I wanted to check. And, if it is a bug in configuration, how to we find and fix all of the configurations that have it? And, how do we prevent it happening in the future?
  
  Background information:
  
  Run 32644 wrote out multiple TPStream data files, as shown in the np04-srv-005:/data3/test directory:
  
  [biery@np04-srv-005 test]$ pwd
  /data3/test
  [biery@np04-srv-005 test]$ dir *32644* | head
  -rw-r--r-- 1 aoranday np-comp 4275140200 Nov 4 06:09 swtest_tp_run032644_0000_tp-stream-writer-apa1_tpw_4_20241104T050713.hdf5
  -rw-r--r-- 1 aoranday np-comp 4255322960 Nov 4 06:08 swtest_tp_run032644_0000_tp-stream-writer-apa2_tpw_4_20241104T050713.hdf5
  -rw-r--r-- 1 aoranday np-comp 4247215528 Nov 4 06:08 swtest_tp_run032644_0000_tp-stream-writer-apa3_tpw_4_20241104T050713.hdf5
  -rw-r--r-- 1 aoranday np-comp 4279945968 Nov 4 06:08 swtest_tp_run032644_0000_tp-stream-writer-apa4_tpw_4_20241104T050713.hdf5
  -rw-r--r-- 1 aoranday np-comp 4291641792 Nov 4 06:11 swtest_tp_run032644_0001_tp-stream-writer-apa1_tpw_4_20241104T050940.hdf5
  -rw-r--r-- 1 aoranday np-comp 4289043592 Nov 4 06:09 swtest_tp_run032644_0001_tp-stream-writer-apa2_tpw_4_20241104T050825.hdf5
  -rw-r--r-- 1 aoranday np-comp 4239198200 Nov 4 06:09 swtest_tp_run032644_0001_tp-stream-writer-apa3_tpw_4_20241104T050824.hdf5
  -rw-r--r-- 1 aoranday np-comp 4271226696 Nov 4 06:09 swtest_tp_run032644_0001_tp-stream-writer-apa4_tpw_4_20241104T050827.hdf5
  -rw-r--r-- 1 aoranday np-comp 4273074968 Nov 4 06:14 swtest_tp_run032644_0002_tp-stream-writer-apa1_tpw_4_20241104T051159.hdf5
  -rw-r--r-- 1 aoranday np-comp 4263665560 Nov 4 06:10 swtest_tp_run032644_0002_tp-stream-writer-apa2_tpw_4_20241104T050946.hdf5
  
  The configuration for that run used different SourceIDs for each of DataWriter apps, but the same SourceID for all of the TPStreamWriter apps:
  
  [biery@mac-135043 run32644]$ grep -A 7 'df-0' tmpkswqliw0.data.xml | egrep 'DFApplication|SourceIDConf'
  <obj class="DFApplication" id="df-01">
  <rel name="source_id" class="SourceIDConf" id="srcid-df-01"/>
  <obj class="DFApplication" id="df-02">
  <rel name="source_id" class="SourceIDConf" id="srcid-df-02"/>
  <obj class="DFApplication" id="df-03">
  <rel name="source_id" class="SourceIDConf" id="srcid-df-03"/>
  <ref class="DFApplication" id="df-01"/>
  <ref class="DFApplication" id="df-02"/>
  <ref class="DFApplication" id="df-03"/>
  <obj class="SourceIDConf" id="srcid-df-01">
  <obj class="SourceIDConf" id="srcid-df-02">
  <obj class="SourceIDConf" id="srcid-df-03">
  <obj class="SourceIDConf" id="srcid-tp-stream-writer">
  [biery@mac-135043 run32644]$
  [biery@mac-135043 run32644]$
  [biery@mac-135043 run32644]$ grep -A 1 srcid-df-0 tmpkswqliw0.data.xml | grep -A 1 obj
  <obj class="SourceIDConf" id="srcid-df-01">
  <attr name="sid" type="u32" val="1"/>
  --
  <obj class="SourceIDConf" id="srcid-df-02">
  <attr name="sid" type="u32" val="2"/>
  --
  <obj class="SourceIDConf" id="srcid-df-03">
  <attr name="sid" type="u32" val="3"/>
  
  [biery@mac-135043 run32644]$ grep -A 7 'tp-stream-writer-apa' tmpkswqliw0.data.xml | grep -A 7 TPStreamWriterApplication | egrep 'SourceIDConf|TPStreamWriterApplication'
  <ref class="TPStreamWriterApplication" id="tp-stream-writer-apa1"/>
  <ref class="TPStreamWriterApplication" id="tp-stream-writer-apa2"/>
  <ref class="TPStreamWriterApplication" id="tp-stream-writer-apa3"/>
  <ref class="TPStreamWriterApplication" id="tp-stream-writer-apa4"/>
  <obj class="TPStreamWriterApplication" id="tp-stream-writer-apa1">
  <rel name="source_id" class="SourceIDConf" id="srcid-tp-stream-writer"/>
  <obj class="TPStreamWriterApplication" id="tp-stream-writer-apa2">
  <rel name="source_id" class="SourceIDConf" id="srcid-tp-stream-writer"/>
  <obj class="TPStreamWriterApplication" id="tp-stream-writer-apa3">
  <rel name="source_id" class="SourceIDConf" id="srcid-tp-stream-writer"/>
  <obj class="TPStreamWriterApplication" id="tp-stream-writer-apa4">
  <rel name="source_id" class="SourceIDConf" id="srcid-tp-stream-writer"/>
  [biery@mac-135043 run32644]$
  [biery@mac-135043 run32644]$
  [biery@mac-135043 run32644]$ grep -A 1 srcid-tp-stream-writer tmpkswqliw0.data.xml | grep -A 1 obj
  <obj class="SourceIDConf" id="srcid-tp-stream-writer">
  <attr name="sid" type="u32" val="4"/>
- 09:20 → 09:25
  TPStreamWriter data file naming in v5.x 5m
  
  Minutes
  
  This is currently not consistent with the way that we name raw data files. Should it be?
  
  Speaker: Kurt Biery (Fermilab)
  Raw data HDF5 files have filenames that include the dataflow application name, a constant string ("_dw_"), and the index of the DataWriter that created the raw data file. (Recall that there can be multiple DataWriter modules per DF application.)
  
  for example, df-02_dw_0
  
  The relevant place to look in the code is here.
  
  In contrast, TPStream data files have filenames that include the application name, a constant string, and the source ID of the tp-stream-writer application.
  
  for example, tp-stream-writer-apa3_tpw_4
  
  The relevant place to look in the code is here.
  
  I would argue that these two naming schemes should be consistent. And, therefore, tp-stream-writer files should have substrings like tp-stream-writer-apa3_tpw_0 in their name (assuming just one TPStreamWriter module per tp-stream-writer application.
  
  Objections?
- 09:25 → 09:40
  Thoughts on synchronizing TPStream data files 15m
  
  Minutes
  
  Currently, there is basically no enforced synchronization. It may happen in some cases, but that should not be expected. Do we want to take some steps to change this situation?
  
  Speaker: Kurt Biery (Fermilab)
  There have been some questions about how the data is organized in TPStream HDF5 data files...
  
  Some reminders:
  
  The TPStream data files have TimeSlices inside of them (instead of TriggerRecords).
  
  Each TimeSlice contains the TriggerPrimitives that fall into a specific 1-second interval. The TP time that is used to decide which TimeSlice is appropriate is the TriggerPrimitive time_start (DTS 62.5 MHz clock). And, the length of the 1-second interval is configurable.
  
  Each TimeSlice has N Fragments in it, and each of the Fragments contains all of the TPs from a given Source ID.
  
  Pause here to look at the first sample TimeSlice printout here
  
  There is no synchronization between the startup of TPG in various Readout Apps, nor between the tp-stream-writer apps in a system that has more than one of them.
  
  Typically, the first TimeSlice from all tp-stream-writer files will contain TPs from the same wallclock 1-second interval, but this is not guaranteed
  
  The assignment of TimeSlice number 1 is local to the tp-stream-writer, and it may not match what other tp-stream-writer instances are calling TimeSlice #1
  
  Pause here to look at the first TimeSlices in a couple of files here
  
  The TPs within each Fragment are ordered in time, according to the TPSet in which they were received.
  
  Recall that the Readout system sends TPSets to Dataflow.
  
  Inside the TPStreamWriter, the TPSets for each Source ID are ordered in time and stored in the appropriate 1-second bucket.
  
  If a TPSet has data that crosses a 1-second boundary, a copy is made so that the appropriate TPs can be included in each TimeSlice.
  
  When Fragments are created, the data from each TPSet is copied into the Fragment, but no checking is done to ensure that the TPs in TPSet N don't overlap with the TPs in TPSet N+1.
  
  We simply trust that this is correctly handled in creation of the TPSets.
  
  TPStream data files are subject to a max_file_size configuration parameter, similar to what is done for raw data files. This is typically set to ~4 GB. This page contains a listing of several sample TPStream files:
  
  Since the number of TPs from a given APA might be different than other APAs, there can be different numbers of TimeSlices in each 4 GB TPStream file.
  
  The samples page has examples of the first TimeSlices in the second TPStream files from different APAs. Those show that the number of TimeSlices per file can vary signficantly.
  
  Of course, matching TimeSlices can be found (modulo any mismatch at the start of the different streams), but they may be in different files. Example on this page.
  
  Is there a desire/need to add some synchronization between different TPStream streams?
  
  I can imagine synchronizing the wallclock time of the first TimeSlice in each stream, but beyond that...
- 09:40 → 09:50
  Update on daqsystemtest regression tests, esp. data-file-checking changes 10m
  
  Minutes
  
  Speaker: Kurt Biery (Fermilab)
  Reminders of warnings and errors seen in daqsystemtest regression tests recently:
  
  Occasional complaints about TA fragment size of 360 (in the 3ru_3df_multirun_test or tpstream_writing_test), e.g. here
  
  Occasional complaints about 1 or 2 TP Stream fragments when 3 were expected (in the tpstream_writing_test), e.g. here
  
  A rare instance of a TC fragment that was larger than expected
  
  Several others that haven't been debugged yet
  
  Let's start with (2)...
  
  This only happens in the first TPStream TimeSlice
  
  It is a natural consequence of the asynchronous startup of TPG in several Readout Apps. Most of the time, TPG will start in all Readout Apps within the same wallclock second, but occasionally, some will start up in a different 1-second interval than others.
  
  It would be a little silly to loosen the data-quality check on the number of TPStream fragments from [3, 3] to [1,3] for all TimeSlices. That would significantly reduce the effectiveness of the fragment-count checking for TimeSlices 2..N.
  
  Given this, it seems like enhanced fragment-count checking, that supports a different allowed range in the number of fragments for different TimeSlices, would be useful.
  
  Based on this, I have implemented record-ordinal-based fragment count checking. And, I went ahead and implemented record-ordinal-based fragment size checking and TC-type-based fragment count and fragment size checking which will be useful in other situations.
  
  An initial implementation of this is available on the integrationtest kbiery/data_file_check_changes branch (PR coming soon). This branch depends on branches in the hdf5libs repo (kbiery/get_sids_by_fragtyoe_and_detid branch) and the trgdataformats repo (already merged to develop).
  
  I used ordinal numbers (e.g. first, second, ..,, penultimate, last) for the record-based checking since that seemed to be the right thing to do. (E.g., there doesn't seem to be a guarantee that the first record in a file would have record number 1.)
  
  As part of this, I modified the list of supported integrationtest data-quality-check config parameters, as described <here>.
  
  And, I have made use of this new functionality in initial changes to the tpstream_writing_test (link).
  
  For (1) [TA fragment size of 360]...
  
  I found that this is a result of two underlying TAs being included in the fragment
  
  Here is the output from HDF5LIBS_TestDumpRecord for one such TA fragment:
  
          Trigger_Activity fragment with SourceID Trigger_0x0000044c from subdetector DAQ has size = 360 -----
                  Readout window before = 0, after = 32
                  Number of TAs in this fragment=2, overall number of referenced TPs=2, size of TA data=88
                  First TA type = 1, TA algorithm = 2, number of TPs = 1
                  First TA start time=108365367111959330, end time=108365367111959362, and activity time=0
                  Second TA type = 1, TA algorithm = 2, number of TPs = 1
                  Second TA start time=108365367111959330, end time=108365367111959362, and activity time=0
  
  Where does the size of 360 come from?
  
  Fragment header size is 72
  
  TA data size is 88
  
  TP size is 56
  
  72 + 2 * (88+56) = 360
  
  As we see above, the timestamps of the two TAs are within the requested readout window
  
  I remembered that TA making is at the per-APA-plane level now. Since the regression tests use the simple Prescale TAMaker, it's possible that occasionally multiple TAs are created from TPs with the same timestamp.
  
  Based on this, I changed the allowed range of TA fragment size to have an upper bound of 360, as shown here.
  
  Since there are 3 possible sources of TAs in the tpstream_writing_test, one can imagine that the size of the TA fragment could fluctuate up to 504. I took the approach that if 360 is rare, 504 will essentially never happen. We'll see.
  
  I also took the approach that multiple TAs in a kRandom trigger readout window will be exceedingly rare. Again, this could come back to haunt me.
  
  For (3), I found that the upward fluctuation in the size of the TC fragment was because of multiple TCs being included (a kPrescale one and a kRandom one). After poking around a bit, I noticed that the v5 version of the tpstream_writing_test was merging overlapping TCs (whereas the v4 version did not). To keep the spirit of this test consistent with what we had before, I added back the configuration parameter that disabled TC merging (link).
  
  And, lastly, I'll say that I took the opportunity to tighten up some of the fragment-size and fragment-count checking, based on the new ability to do those checks based on the record ordinal number and the trigger type of the record.
- 09:50 → 10:00
  Open PRs and questions 10m
  
  Minutes
  
  Speaker: Kurt Biery (Fermilab)
  Has there been any discussion of a tentative code-complete date for fddaq-v5.3.0?
  
  I would appreciate reviews of the following PRs:
  
  hdf5libs #107 - Modified the calculation of the readout window before and after times in HDF5LIBS_TestDumpRecord so that it can handle negative numbers
  
  hdf5libs #106 - Creation of a utility to help recover raw data files that were not cleanly closed
  
  hdf5libs #109 and rawdatautils #85 - Convert the creation_timestamp and closing_timestamp HDF5 Attribute data types from string to integer
  
  What is happening when we see error messages in daq_application log files like the following, for example here?
  
  dunedaq::ipm::ZmqSubscriber::connect_for_receives(const nlohmann::json&) at /tmp/root/spack-stage/spack-stage-ipm-NB_DEV_241209_A9-y5k3dyytnv6hs54q52azpvt45javb2u4/spack-src/plugins/ZmqSubscriber.cpp:80] An exception occured while calling connect on the ZMQ receive socket: Invalid argument (connection_string: tcp://daq.fnal.gov:*)