v5.2 Release Coordination

US/Central
    • 1
      Working Group Deliverables
      Speakers: Asher Kaboth (Royal Holloway University of London/Rutherford Appleton Laboratory), Joshua Klein, Joshua Klein (University of Pennsylvania), Kurt Biery (Fermilab), Marco Roda (University of Liverpool), Pierre Lasorak (Imperial College London), Stoyan Trilov
    • 2
      Report on ongoing Work
      Speakers: Eric Flumerfelt (Fermilab), Kurt Biery (Fermilab), Michal Rigan (University of Sussex), Pierre Lasorak (Imperial College London)

      At the last meeting, we agreed to push on six topics for Tuesday:

      • Application connectivity server (Pierre, Eric)
        • Dynamic ports for applications/controllers
        • integrationtest manage Connectivity Service
      • FSM Command sequences in drunc (Pierre)
      • tpgtools unit tests (Alex)
      • Running integration tests from release (Eric, Kurt)
        • an issue was observed when running the minimal_system_quick_test (could not create cache).  This happens when running from a release (e.g. pytest -s $DUNE_DAQ_RELEASE_SOURCE/daqsystemtest/integtest/minimal_system_quick_test.py)
      • Metadata changes in hdf5libs (Kurt)
      • Tag collector update (Eric)

       

    • 3
      Questions from Developers and Testers
      Speakers: Eric Flumerfelt (Fermilab), Kurt Biery (Fermilab)

      Note: Please let Eric know if you have questions/comments to share here!

       

      Thursday, 3 October:

      Questions/notes from Kurt:

      • The high CPU usage of the MLT and gunicorn processes in the minimal_system_quick_test has been debugged and discussed.  The action items that I recall are
        • reduce the rate of retries in IOManager when there is a dangling connection
        • modify our example configuration(s) to have the RTCM send TCs to the TC DataSubscriberModule instead of directly to the TC TriggerDataHandlerModule
        • remember to check for subscriber endpoints that match up with no publisher(s) in our configuration validation tools (when those become available)
        • anything else?
        • the v5.2 software area instructions have been updated to use the 03-Oct nightly build with many/all of the changes that have been discussed
        • the mlt-dec thread is still using 100% of a CPU, though...
          • ELF: Noticed lots of mutex lock/unlock when running perf
        • updated plots of the minimal_system_quick_test and daqsystemtest example-configs configurations are attached to this agenda item
      • It will be great to hear the answers to Michal's questions on Slack about drunc commands:
        1. is it possible to run chain of commands with drunc, as was possible with nanorc: ie boot conf start 101 enable-trigger in one command/submit
          • This is supported by drunc already
        2. will we get back the options start_run and stop_run  that will execute a chain under the hood but on their own
          • Pierre will look into FSM sequences
      • Some additional notes from Michal:
        • the tpgtools unit tests fail
          • Alex will investigate
        • an issue was observed when running the minimal_system_quick_test (could not create cache).  This happens when running from a release (e.g. pytest -s $DUNE_DAQ_RELEASE_SOURCE/daqsystemtest/integtest/minimal_system_quick_test.py).  Can this be made to work in v5?
          • Eric/Kurt to investiage
      • I'm making progress on the HDF5-related v4-to-v5 ports and will plan on filing PRs once the current in-flight PRs get merged.
        • Planned changes should not interfere with this work, go ahead

       

      Other Notes:

      • Release Coordinator documentation to try to ease process/onboard more release coordinators
      • v4 diffs should be reviewed by developers to identify features for porting
      • A daq-release-preparation Slack channel should be created for announcing PR merges and other current-release development discussion

       

      Tuesday,  1 October:

      Questions from Kurt:

      • Will the configuration "inspector" utility from Alessandro be included in v5.2?  (a link to the code is here)
        • If yes, when will it get merged to the develop branch?
        • This should be merged soon. Alessandro is working on a few more updates
      • I still see occasional instances of "controller" processes taking a while to exit after I exit drunc when using the 30-Sept nightly build.
        • has that been fixed, and I'm doing something wrong?
        • if not, will that behavior get fixed in v5.2, or at a later time?
        • if v5.2, when?
        • Probably not v5.2, but an issue should be made/updated to track this
      • Can we ask Michal to proceed with a fix for TR #0 (Issue here)?
        • Target Thursday for fixes from Trigger group (MLT)
      • Is the code ready for a test of multiple simultaneous DAQ sessions running on the same computer?
        • (I'd appreciate a reminder of how to avoid clashes.)
        • "application connectivity service" changes follow environment variable changes scheduled for Thursday. Multiple session running may be available Monday/Tuesday
      • How important is it to port the tardy TP changes in the TPStreamWriter from v4 to v5 (configuration-controlled warning messages; metrics to indicate discarded tardy TPs)?
        • v4-to-v5 porting work is lower priority except where needed for NP02 running. This is associated with v5.2 but should not interfere with higher-priority work
      • [I'm corresponding with Pierre about some of the topics in this bullet, so some of these questions may be obsolete by the time of the meeting...]
        • What are the differences between the available FSMconfigurations?  It would be great for the Wiki page to have a high-level description of what they can/should be used for.  I can act as scribe if someone tells me what to write.
        • Pierre and I are working on the changes in drunc and rcif to pass the PROD vs. TEST argument to the FSM start command down to the HDF5 file (and from there, into the file-transfer metadata).
        • FSM documentation should be updated. drunc should be documented in docs/*.md files to be captured on readthedocs
        • Additional documentation on "runtime parameters" in drunc should be produced (e.g. how will the disabled_output_test in dfmodules work?)
      • Are the errors and warnings about "Broadcast"s in the controller log files a relatively permanent feature?  If not, will they get addressed for v5.2, or for a later release?
        • log_biery_local-2x3-config_hsi-controller.log:           WARNING  "Broadcast": There is no broadcasting service!                                                              broadcast_sender.py:32
          log_biery_local-2x3-config_hsi-controller.log:           ERROR    "Broadcast": Propagating take_control to children (hsi-01) failed: NOT_EXECUTED_NOT_IMPLEMENTED. See its    broadcast_sender.py:65
        • First error should be fixed by using ERS in drunc (planned for v5.2). Second error should be reduced in severity as it is an informational message
      • Are the occasional errors in the daq_app log files about failures to connect to the ConnectivityService a relatively permanent feature?  If not, will they be addressed for v5.2, or for a later release?
        • Failed to lookup time_sync_104 at /getconnection/local-2x3-config connect: Connection refused
        • This is likely a startup issue, changes to infrastructure management in drunc may help. Definitive test: run against a pocket-based connectivity service and ensure that message does not appear
      • In one of the Slack channels, Alessandro mentioned reorganizing Wiki pages.  Which ones were those, and has that happened yet?
        • Alessandro completed reorganization during the meeting

      Comments from Kurt:

      • Now that Giovanna's fix for request_timeout values in datahandlinglibs has been merged...
        • I will have a follow-up PR in daqsystemtest to reduce the request_timeouts for HSI and TC fragments in the sample config(s).  This is to avoid Inhibits.
        • This work has been scheduled to be completed by the Thursday meeting
        • daqsystemtest PR 120 (link)
      • The start transition in the minimal_system_quick_test seems to be taking a long time with recent nightly builds.  I haven't looked into why that might be happening yet.
        • Recent nightly runs of the integration tests show minimal_system_quick_test taking 49s https://github.com/DUNE-DAQ/daq-release/actions/runs/11076327895/job/30779190785

      Other notes:

      • When John comes back, there will be a mass-rename from confmodel::Session to confmodel::System to distinguish configuration from drunc Session
      • Eric will work on indexing configuration tools in a document
        • Harry's config editor
        • John's create_config_plot (documentation here)
        • Alessandro's daqconf_inspector
        • Kurt's print_detailed_config_info script