v5.2 Release Coordination

US/Central
    • 08:05 08:25
      Working Group Deliverables 20m
      Speakers: Asher Kaboth (Royal Holloway University of London/Rutherford Appleton Laboratory), Joshua Klein (University of Pennsylvania), Joshua Klein, Kurt Biery (Fermilab), Marco Roda (University of Liverpool), Pierre Lasorak (Imperial College London), Stoyan Trilov
    • 08:25 08:45
      Report on ongoing Work 20m
      Speakers: Artur Sztuc (University College London), Artur Sztuc (Imperial College London), Eric Flumerfelt (Fermilab), Kurt Biery (Fermilab), Michal Rigan (University of Sussex), Pierre Lasorak (Imperial College London)

      Topics for Tuesday:

      • Tag Collection
      • NP-02 Stability Testing (Alejandro, Shyam)
      • Trigger Rate-related Changes (Artur)
        • FakeHSI updates (don't listen to start rate change)
        • Trigger schema updates (TC window/type)
      • PDS Readout
        • daphnemodules Status (Marco)
        • Felix v5 Updates (Giovanna)
        • PDS Emulation (readout_type_scan)  (Eric)
      • PR Reviews
        • RandomTCMaker Updates
        • drunc Improvements
        • Release Coordinator documentation from Eric (Alessandro, Roland, Asher, Kurt)
    • 08:45 09:05
      Questions from Developers and Testers 20m
      Speakers: Artur Sztuc (University College London), Artur Sztuc (Imperial College London), Eric Flumerfelt (Fermilab), Kurt Biery (Fermilab)

      Note: Please let Eric know if you have questions/comments to share here!

      Thursday, 17 October

      • Eric will evaluate packages needing tags and update Tag Collector as needed
        • Package tags are appreciated, code complete is Wednesday, 23 October
      • daphnemodules will not be ready on v5.2 timescale. Will likely be patch release in ~2 weeks
        • We currently cannot emulate PDS date (test cases commented in readout_type_scan). Eric to investigate
      • CCM coordinating with trigger for command changes
        • When to use FakeHSI? When HSI data path should be exercised. For now, static configuration of FakeHSI should be sufficient
        • Want to maintain and exercise ability to generate triggers from multiple sources (RTCM, FakeHSI, etc)
      • PRs in drunc, trying to fix remaining issues. Nightly minimal_system_quick_test failed with apparent run control issues

       

      Tuesday, 15 October

      • Need report from Timing experts
        • In terms of timing-only deliverables, the WR-DTS (the one at 0%) is getting postponed, so no need to worry about that. I’ve developed some refinements to the OKS schema and those are ready to merge in. There is a feature which I want to include which is related to the configuration but that should also be fine for next week, but it’s not fully necessary for the NP02 running. In fact it’s not clear to me if we want to run the timing from v5.2.0. Protodune HD and VD are sharing a timing master at the moment, and that has been using v4.x.x fine so far. Given that the timing and DAQ sessions are completely independent, we have flexibility.
          Closer to the DAQ is the DTS HSI. As I mentioned last time, it’s implemented in OKS but the raw HSI data is not written to disk. That still needs to be debugged, and then more integration is required to create a segment with both a readout and controller application. I think that should be done in a week, but could slip if something else comes up. What also might take longer is figure out how to go about improving the integration, e.g. creating a common SmartDaqApplication for HSIEvent senders. Those improvements could be made in a patch though. For NP02 operations I don’t think there are any plans to use a DTS HSI to capture any signals, so probably could be postponed completely if needed.
          Long story short I don’t think there’s anything which should delay the release
      • daq-release Release Coordinator document in PR
      • Integrationtest Connectivity Service changes (Manages CS instead of drunc)
      • CCM looking into tasks to defer from NP02 support list
      • MLT has PRs in trigger and appmodel for review (Eric)

      Thursday, 10 October

      Notes from Kurt:

      • it would be good to follow-up on Wes' question about the change-rate command (is it available in v5/drunc?)
        • related:  reminder to myself about updating automated regression tests after the RTCM/FakeHSI/--trigger-rate changes
        • FSM will have to be updated to  support change-rate (Gordon to look at it)
      • I'd like to request a review of a simple PR in daqdataformats (additional method that returns the total size of the Fragment payload)
        • Eric has reviewed
      • I get an error when I try to run a system out of a base release.  Do we want to look into this?
        • e.g.  source /cvmfs/dunedaq.opensciencegrid.org/setup_dunedaq.sh; setup_dbt latest_v5; dbt-setup-release -n NFD_DEV_241009_A9; drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config boot
        • Likely related to rte script, experts will investigate
      • I'm seeing fairly common errors in sending commands to daq_apps (mainly when running 3ru_3df automated regression tests).  Any suggestions?
        • [22:18:13] INFO     "dfo-01-commander": Queue empty! dfo-01 to conf                                                              rest_api_child.py:306
                     ERROR    "dfo-01-commander": Timeout while waiting for a reply from dfo-01 for command conf                           rest_api_child.py:310
                     ERROR    "dfo-01-rest-api-child": Got error from 'conf' to 'dfo-01': Timeout while waiting for a reply from dfo-01    rest_api_child.py:558
                              for command conf                                                                                          
        • [22:11:49] INFO     "Broadcast": Changing operational_state from initial to preparing-conf                                      broadcast_sender.py:65
                     INFO     "Broadcast": Changing operational_state from preparing-conf to conf-ready                                   broadcast_sender.py:65
                     INFO     "Broadcast": Changing operational_state from conf-ready to propagating-conf                                 broadcast_sender.py:65
                     INFO     "Broadcast": Propagating execute_fsm_command to children                                                    broadcast_sender.py:65
                     INFO     "Broadcast": Propagating execute_fsm_command to children (mlt)                                              broadcast_sender.py:65
                     INFO     "mlt-rest-api-child": Sending 'conf' to 'mlt'                                                                rest_api_child.py:514
                     ERROR    "mlt-commander": Connection error to http://localhost:5702/command                                           rest_api_child.py:278
                     ERROR    "mlt-rest-api-child": Got error from 'conf' to 'mlt': Connection error to http://localhost:5702/command      rest_api_child.py:558\

        • Issues should be opened and instructions given so experts can investigate

      • reminders about nagging occasional problems
        • DF app occasionally fails to conf? start?  say 5% of the time.  Problem seems to be a port still in use.
        • drunc-controller processes persist for ~20 seconds after the clean end of a DAQ session
        • connection warnings in logs 
      • other reminders:  2+ sessions on same host; skipped and failing tests in daqsystemtest regression tests;
        • Multiple session running should be possible with changes to be merged today

       

      Tuesday, 8 October:

      Note/question from Kurt:

      • In v4 systems, if the trigger_rate is not specified as a parameter to the start_run (or just start) command, then the rate that was specified in the configuration is used.  In v5 systems, the start command trigger-rate parameter has a default value of 1.0 Hz and it over-rides anything provided in the configuration.  Could/should we restore the v4 behavior?
        • The v5 behavior means that we always needs to specify the start --trigger-rate parameter if we want a rate different from 1.0 Hz.
        • Recall that this "trigger rate" is mainly (and maybe solely) used by HSI software modules (e.g. FakeHSI)
        • IMO, if we keep the v5 behavior, we should probably remove the configuration parameter to avoid confusion.
        • I noticed this behavior when working on automated regression tests that use the FakeHSI.
        • In the meantime, the kbiery/lwr_candidate_changes branch in daqsystemtest has changes to get the long_window_readout_test to work.
        • This discussion led to a few work items:
          • FakeHSI should not be needed in regression tests. RandomTCMaker should be able to have a configurable window, and MLT should be able to alter TC windows
          • The default should be 0Hz, as FakeHSI already knows to ignore that value
          • FakeHSI should not listen to trigger rate changes at start, but RandomTCMaker should

       

      Thursday, 3 October:

      Questions/notes from Kurt:

      • The high CPU usage of the MLT and gunicorn processes in the minimal_system_quick_test has been debugged and discussed.  The action items that I recall are
        • reduce the rate of retries in IOManager when there is a dangling connection
        • modify our example configuration(s) to have the RTCM send TCs to the TC DataSubscriberModule instead of directly to the TC TriggerDataHandlerModule
        • remember to check for subscriber endpoints that match up with no publisher(s) in our configuration validation tools (when those become available)
        • anything else?
        • the v5.2 software area instructions have been updated to use the 03-Oct nightly build with many/all of the changes that have been discussed
        • the mlt-dec thread is still using 100% of a CPU, though...
          • ELF: Noticed lots of mutex lock/unlock when running perf
        • updated plots of the minimal_system_quick_test and daqsystemtest example-configs configurations are attached to this agenda item
      • It will be great to hear the answers to Michal's questions on Slack about drunc commands:
        1. is it possible to run chain of commands with drunc, as was possible with nanorc: ie boot conf start 101 enable-trigger in one command/submit
          • This is supported by drunc already
        2. will we get back the options start_run and stop_run  that will execute a chain under the hood but on their own
          • Pierre will look into FSM sequences
      • Some additional notes from Michal:
        • the tpgtools unit tests fail
          • Alex will investigate
        • an issue was observed when running the minimal_system_quick_test (could not create cache).  This happens when running from a release (e.g. pytest -s $DUNE_DAQ_RELEASE_SOURCE/daqsystemtest/integtest/minimal_system_quick_test.py).  Can this be made to work in v5?
          • Eric/Kurt to investiage
      • I'm making progress on the HDF5-related v4-to-v5 ports and will plan on filing PRs once the current in-flight PRs get merged.
        • Planned changes should not interfere with this work, go ahead

       

      Other Notes:

      • Release Coordinator documentation to try to ease process/onboard more release coordinators
      • v4 diffs should be reviewed by developers to identify features for porting
      • A daq-release-preparation Slack channel should be created for announcing PR merges and other current-release development discussion

       

      Tuesday,  1 October:

      Questions from Kurt:

      • Will the configuration "inspector" utility from Alessandro be included in v5.2?  (a link to the code is here)
        • If yes, when will it get merged to the develop branch?
        • This should be merged soon. Alessandro is working on a few more updates
      • I still see occasional instances of "controller" processes taking a while to exit after I exit drunc when using the 30-Sept nightly build.
        • has that been fixed, and I'm doing something wrong?
        • if not, will that behavior get fixed in v5.2, or at a later time?
        • if v5.2, when?
        • Probably not v5.2, but an issue should be made/updated to track this
      • Can we ask Michal to proceed with a fix for TR #0 (Issue here)?
        • Target Thursday for fixes from Trigger group (MLT)
      • Is the code ready for a test of multiple simultaneous DAQ sessions running on the same computer?
        • (I'd appreciate a reminder of how to avoid clashes.)
        • "application connectivity service" changes follow environment variable changes scheduled for Thursday. Multiple session running may be available Monday/Tuesday
      • How important is it to port the tardy TP changes in the TPStreamWriter from v4 to v5 (configuration-controlled warning messages; metrics to indicate discarded tardy TPs)?
        • v4-to-v5 porting work is lower priority except where needed for NP02 running. This is associated with v5.2 but should not interfere with higher-priority work
      • [I'm corresponding with Pierre about some of the topics in this bullet, so some of these questions may be obsolete by the time of the meeting...]
        • What are the differences between the available FSMconfigurations?  It would be great for the Wiki page to have a high-level description of what they can/should be used for.  I can act as scribe if someone tells me what to write.
        • Pierre and I are working on the changes in drunc and rcif to pass the PROD vs. TEST argument to the FSM start command down to the HDF5 file (and from there, into the file-transfer metadata).
        • FSM documentation should be updated. drunc should be documented in docs/*.md files to be captured on readthedocs
        • Additional documentation on "runtime parameters" in drunc should be produced (e.g. how will the disabled_output_test in dfmodules work?)
      • Are the errors and warnings about "Broadcast"s in the controller log files a relatively permanent feature?  If not, will they get addressed for v5.2, or for a later release?
        • log_biery_local-2x3-config_hsi-controller.log:           WARNING  "Broadcast": There is no broadcasting service!                                                              broadcast_sender.py:32
          log_biery_local-2x3-config_hsi-controller.log:           ERROR    "Broadcast": Propagating take_control to children (hsi-01) failed: NOT_EXECUTED_NOT_IMPLEMENTED. See its    broadcast_sender.py:65
        • First error should be fixed by using ERS in drunc (planned for v5.2). Second error should be reduced in severity as it is an informational message
      • Are the occasional errors in the daq_app log files about failures to connect to the ConnectivityService a relatively permanent feature?  If not, will they be addressed for v5.2, or for a later release?
        • Failed to lookup time_sync_104 at /getconnection/local-2x3-config connect: Connection refused
        • This is likely a startup issue, changes to infrastructure management in drunc may help. Definitive test: run against a pocket-based connectivity service and ensure that message does not appear
      • In one of the Slack channels, Alessandro mentioned reorganizing Wiki pages.  Which ones were those, and has that happened yet?
        • Alessandro completed reorganization during the meeting

      Comments from Kurt:

      • Now that Giovanna's fix for request_timeout values in datahandlinglibs has been merged...
        • I will have a follow-up PR in daqsystemtest to reduce the request_timeouts for HSI and TC fragments in the sample config(s).  This is to avoid Inhibits.
        • This work has been scheduled to be completed by the Thursday meeting
        • daqsystemtest PR 120 (link)
      • The start transition in the minimal_system_quick_test seems to be taking a long time with recent nightly builds.  I haven't looked into why that might be happening yet.
        • Recent nightly runs of the integration tests show minimal_system_quick_test taking 49s https://github.com/DUNE-DAQ/daq-release/actions/runs/11076327895/job/30779190785

      Other notes:

      • When John comes back, there will be a mass-rename from confmodel::Session to confmodel::System to distinguish configuration from drunc Session
      • Eric will work on indexing configuration tools in a document
        • Harry's config editor
        • John's create_config_plot (documentation here)
        • Alessandro's daqconf_inspector
        • Kurt's print_detailed_config_info script