

# From analog to automation to assistants: detectors, AI, and you

Nhan Tran, Fermilab EDIT school 2024 & Fermilab W&C 22 November 2024

# An interesting time, personally

- October: 2024 Fast ML for Science Conference
  - <u>https://indico.cern.ch/e/fastml2024</u>
- November: DOE Office of Science Al Roundtable on Experiments and Facilities
  - Including biology, environmental, material, fusion sciences, HEP and nuclear physics
- November: DOE-wide discussions on the role of supercomputing and AI
  - FASST initiative: Frontiers in AI for Science, Security and Technology

Huge credit to many folks and stimulating discussions!



# What is in this talk and what is not

- X Al for physics analysis including theory, simulation, reconstruction, and interpretation let's just assume physics + Al is awesome :)
- 🗹 Al as it applies to detectors and instruments\* to:
  - <u>Accelerate</u> new physics discoveries
    - Unearth new physics signatures much more quickly
    - Operate instruments and detectors much more efficiently
      - Enabling faster analysis
      - Requiring less data needed to get to the same results
      - Reducing operational resources and increasing long term reliability



# Outline

- Fast and Slow
- Fast ML and hardware codesign
- Fast and Slow ML together
- Slow ML & Real-Time







COMPARABLE TO GLOBAL INTERNET TRAFFIC BANDWIDTH







# Types of compute





# Types of compute

#### Credit: Dorothea vom Bruch

|         |                                                   | CPU                                                                             | GPU                                                             | FPGA                                                                                             |      |            |  |  |
|---------|---------------------------------------------------|---------------------------------------------------------------------------------|-----------------------------------------------------------------|--------------------------------------------------------------------------------------------------|------|------------|--|--|
| FLEXIBI | Latency                                           | O (10) μs                                                                       | O (100) μs                                                      | Deterministic,<br>O (100) ns                                                                     |      |            |  |  |
|         | I/O with processor                                | Ethernet, USB, PCIe                                                             | PCIe, Nvlink                                                    | Connectivity to any<br>data source via<br>printed circuit board<br>(PCB)                         | NPUs |            |  |  |
|         | Engineering cost                                  | Low entry level<br>(programmable with<br>c++, python, etc.)                     | Low entry level<br>(programmable with<br>CUDA, OpenCL,<br>etc.) | Some high-level<br>syntax available,<br>traditionally VHDL,<br>Verilog (specialized<br>engineer) |      | ASICs      |  |  |
|         | Single precision<br>floating point<br>performance | 0 (10) TFLOPs                                                                   | O (10) TFLOPs                                                   | Optimized for fixed point performance                                                            |      | EFFICIENCY |  |  |
|         | Serial / parallel                                 | Optimized for serial<br>performance,<br>increasingly using<br>vector processing | Optimized for parallel performance                              | Optimized for parallel performance                                                               |      |            |  |  |







1 channel ~ 10b 1 channel, 1 MHz rate ~ 10 Mb/s 100k channels, 1 MHz rate ~ 1 Tb/s



1 channel ~ 10b 1 channel, 1 MHz rate ~ 10 Mb/s 100k channels, 1 MHz rate ~ 1 Tb/s

Tee Much Data Tee Late



Embed more intelligence

# Fast ML for science and the extreme edge

"Scientific discoveries come from groundbreaking ideas and the capability to validate those ideas by testing nature at new scales - finer and more precise temporal and spatial resolution. This is leading to an explosion of data that must be interpreted, and ML is proving a powerful approach. The more efficiently we can test our hypotheses, the faster we can achieve discovery. To fully unleash the power of ML and accelerate discoveries, it is necessary to embed it into our scientific process, into our instruments and detectors."

> Applications and Techniques for Fast Machine Learning in Science https://doi.org/10.3389/fdata.2022.787421

# Why AI?

Universal function approximation - fit with customizable objective: f(inputs; lots of parameters) = output

- <u>Expressive & powerful</u>: able to find patterns and correlations in high-dimensional data not explicitly accounted for; can unlock large gains in performance
- <u>Adaptive & flexible</u>: able to adapt to new data, conditions; handles all different types of data representations



### Fast ML for Science

The Fast ML for Science community aims to bring **seemingly different domains** together to develop **techniques**, **tools**, **and platforms** for challenges that **far outpace industry**.



### Fast ML for Science

The Fast ML for Science community aims to bring **seemingly different domains** together to develop **techniques**, **tools**, **and platforms** for challenges that **far outpace industry**.

MLCommons launches machine learning benchmark for devices like smartwatches and voice assistants by Ben Wodecki 6/16/2021



With experts from Qualcomm, Fermilab, and Google aiding in its development

MLCommons, the open engineering consortium behind the MLPerf benchmark test, has launched a new measurement suite aimed at 'tiny' devices like smartwatches and voice assistants.

MLPerf Tiny Inference is designed to compare performance of embedded devices and models with a footprint of 100kB or less by measuring















Image credit: A. Thea







- Fast control
  - Immediate response to dynamics of the experiment and data readout
  - Event timing, triggering, etc.
- Slow control
  - Detector stability over minutes, days, weeks, months,...
  - Monitoring and controlling operational parameters: electronics gains, pedestals, calibrations, etc.



THE NEW YORK TIMES BESTSELLER THINKING, FAST AND SLOW OENUME . DANIEL KAHNEMAN WINNER OF THE NOBEL PRIZE IN ECONOMICS WHITE HERE "(A) masterpiece... This is one of the greatest and most engaging collections of insights into the human mind I have read." — wILLIAM EASTERLY, Financial Times

...on reducing biases in our "fast" decisionmaking selves



# Life cycle of instruments, detectors, facilities





# Life cycle of instruments, detectors, facilities





# Life cycle of instruments, detectors, facilities





# Outline

- Fast and Slow
- Fast ML and hardware codesign
- Fast and Slow ML together
- Slow ML & Real-Time



# Fast ML example applications









Fermilab

### Fast ML example applications



Pixel information in the trigger brings fundamentally new capabilities to LHC experiments - but rates are massive!



### Fast ML example applications



Pixel information in the trigger brings fundamentally new capabilities to LHC experiments - but rates are massive!





#### Fermilab

10<sup>5</sup>

10<sup>3</sup>

Computation time [s]

Fast ML for Science

benchmark tasks

Internet-of-things

Mobile devices

10<sup>1</sup>

Ŧ

10<sup>-1</sup>

# Fast ML science applications



Dynamic and noisy systems with multiple timescales for changing conditions...



# Moving data expensive, computing cheap

|                       |             | _ | Relati | ve Energ | y Cost | _     |
|-----------------------|-------------|---|--------|----------|--------|-------|
| Operation:            | Energy (pJ) |   |        |          | -      |       |
| 8b Add                | 0.03        |   |        |          |        |       |
| 16b Add               | 0.05        |   |        |          |        |       |
| 32b Add               | 0.1         |   |        |          |        |       |
| 16b FP Add            | 0.4         |   |        |          |        |       |
| 32b FP Add            | 0.9         |   |        |          |        |       |
| 8b Mult               | 0.2         |   |        |          |        |       |
| 32b Mult              | 3.1         |   |        |          |        |       |
| 16b FP Mult           | 1.1         |   |        |          |        |       |
| 32b FP Mult           | 3.7         |   |        |          |        |       |
| 32b SRAM Read (8KB)   | 5           |   |        |          |        |       |
| 32b DRAM Read         | 640         |   |        |          |        |       |
| Adapted from Horowitz |             |   | 10     | 100      | 1000   | 10000 |







- Computation parallelization/vectorization and in-memory compute (architecture)
- Quantization, reduced precision
  - For ML, 32-bit floating point is often overkill
  - Integer/fixed-point math at 16,8,7,6,5...1 bits
- Compression, pruning
  - maintain the same performance while removing low weight synapses and neurons





- Computation parallelization/vectorization and in-memory compute (architecture)
- Quantization, reduced precision
  - For ML, 32-bit floating point is often overkill
  - Integer/fixed-point math at 16,8,7,6,5...1 bits

 $\boldsymbol{x}_m = \boldsymbol{g}_m \left( \boldsymbol{W}_{m,m-1} \boldsymbol{x}_{m-1} + \boldsymbol{b}_m \right)$ Nm N1 Matrix-vector multiply NM Non-linear activation function )≼ ≥( M hidden layers ..... outout lav input layer layer m reuse = 4use 1 multiplier 4 times mul reuse = 2use 2 multipliers 2 times each mul mult reuse = 1use 4 multipliers 1 time each muli muli

- Compression, pruning
  - maintain the same performance while removing low weight synapses and neurons

- Computation parallelization/vectorization and in-memory compute (architecture)
- Quantization, reduced precision
  - For ML, 32-bit floating point is often overkill
  - Integer/fixed-point math at 16,8,7,6,5...1 bits
- Compression, pruning
  - maintain the same performance while removing low weight synapses and neurons









🛟 Fermilab









Keras O PyTorch → ML architecture Physics requirements

**Data representation** 

Latency? Pipeline Interval?

Resources? Area/power? Radiation? Cryo?

Synthesize, validate design, satisfy design rules/timing

Brevitas (AMD) HAWQ (UC Berkeley) QONNX (Microsoft/AMD) Neural architecture search/ Hyperparameter optimization

ASICS

**Quantize network** 

Intermediate (quantized) representations

Pruning/sparsity?

**Microarchitecture** 

**Parallelization** 

Multi-objective design space optimization



can you help me develop design a 4 layer MLP using QKeras that can be translated to AMD FPGAs using hls4ml?



can you help me develop design a 4 layer MLP using QKeras that can be translated to AMD FPGAs using hls4ml?

#### 2. Design Your Quantized MLP Using QKeras Here's how you can define a 4-layer MLP with quantized weights and activations: python D Copy code import tensorflow as tf from tensorflow.keras.models import Sequential from gkeras import QDense, QActivation from gkeras.guantizers import guantized\_bits, guantized\_relu # Define input and output sizes input size = 32 # Replace with your actual input feature size output\_size = 10 # Replace with your actual number of classes # Define quantizers weight\_quantizer = quantized\_bits(8, 0, alpha=1) # 8-bit quantization for we activation\_guantizer = guantized\_relu(8, 0) # 8-bit quantization for a # Build the model model = Sequential([ QDense(64, input\_shape=(input\_size,), kernel\_quantizer=weight\_quantizer, bias\_quantizer=weight\_quantizer, name='fc1'), QActivation(activation\_quantizer, name='act1'),



can you help me develop design a 4 layer MLP using QKeras that can be translated to AMD FPGAs using hls4ml?

#### 2. Design Your Quantized MLP Using QKeras

Here's how you can define a 4-layer MLP with quantized weights and activations:

python

🗇 Copy code

# 8-bit quantization for we

# 8-bit quantization for a

import tensorflow as tf
from tensorflow.keras.models import Sequential
from qkeras import QDense, QActivation
from qkeras.quantizers import quantized\_bits, quantized\_relu

# Define input and output sizes input\_size = 32 # Replace with your actual input feature size output\_size = 10 # Replace with your actual number of classes

# Define quantizers

weight\_quantizer = quantized\_bits(8, 0, alpha=1)
activation\_quantizer = quantized\_relu(8, 0)

```
# Build the model
```

```
model = Sequential([
    QDense(64, input_shape=(input_size,),
        kernel_quantizer=weight_quantizer,
        bias_quantizer=weight_quantizer,
        name='fc1'),
    QActivation(activation_quantizer, name='act1'),
```

3. Train Your Model

4. Prepare for Conversion with hls4ml

#### 5. Convert the Model Using hls4ml

Specify the FPGA part number corresponding to your AMD FPGA. Since AMD acquired Xilinx, you can use Xilinx part numbers.



can you help me develop design a 4 layer MLP using QKeras that can be translated to AMD FPGAs using hls4ml?

#### 2. Design Your Quantized MLP Using QKeras

Here's how you can define a 4-layer MLP with quantized weights and activations:

python

🗇 Copy code

import tensorflow as tf
from tensorflow.keras.models import Sequential
from qkeras import QDense, QActivation
from qkeras.quantizers import quantized\_bits, quantized\_relu

# Define input and output sizes input\_size = 32 # Replace with your actual input feature size output\_size = 10 # Replace with your actual number of classes

#### # Define quantizers

weight\_quantizer = quantized\_bits(8, 0, alpha=1)
activation\_quantizer = quantized\_relu(8, 0)

# 8-bit quantization for we
# 8-bit quantization for a

#### # Build the model

```
model = Sequential([
    QDense(64, input_shape=(input_size,),
        kernel_quantizer=weight_quantizer,
        bias_quantizer=weight_quantizer,
        name='fc1'),
    QActivation(activation_quantizer, name='act1'),
```

#### 3. Train Your Model

4. Prepare for Conversion with hls4ml

#### 5. Convert the Model Using hls4ml

Specify the FPGA part number corresponding to your AMD FPGA. Since AMD acquired Xilinx, you can use Xilinx part numbers.

| python                                                                                                                           | ල්) Copy code |
|----------------------------------------------------------------------------------------------------------------------------------|---------------|
| <pre>hls_model = hls4ml.converters.convert_from_keras_model(     model,</pre>                                                    |               |
| <pre>hls_config=config,<br/>output_dir='hls4ml_prj',<br/>part='xcvu9p-flgb2104-2-i' # Replace with your specific FPGA part</pre> | number        |
| )                                                                                                                                |               |

### 6. Compile and Build the HLS Model

#### 7. Validate the HLS Model

Compare the outputs of the original Keras model and the HLS model to ensure consistency.

python

ြာ Copy cod

import numpy as np

# Generate some test data

X\_test = np.random.rand(100, input\_size)

### 8. Synthesize the Design Using Vivado HLS



how many FPGA resources will this model take?



how many FPGA resources will this model take?

#### **Estimating Resource Utilization**

While I can't provide exact numbers without running the synthesis, I can guide you on how to estimate and obtain the resource utilization for your model.

#### 1. Understanding the Resource Components

- · Lookup Tables (LUTs): Implement combinational logic.
- Flip-Flops (FFs): Store state information.
- Block RAMs (BRAMs): Used for on-chip memory storage.
- Digital Signal Processing (DSP) Slices: Perform arithmetic operations like multiplication and addition.

#### 2. Factors Influencing Resource Usage

- Neuron Count: More neurons increase resource consumption.
- Bit-widths: Higher bit-widths for weights and activations consume more resources.
- Parallelism: Lower ReuseFactor values increase parallelism and resource usage.
- Pipelining: Increases FF usage but can improve performance.
- Optimizations: Strategies like resource sharing can reduce usage.



how many FPGA resources will this model take?

#### **Estimating Resource Utilization**

While I can't provide exact numbers without running the synthesis, I can guide you on how to estimate and obtain the resource utilization for your model.

#### 1. Understanding the Resource Components

- Lookup Tables (LUTs): Implement combinational logic.
- Flip-Flops (FFs): Store state information.
- Block RAMs (BRAMs): Used for on-chip memory storage.
- Digital Signal Processing (DSP) Slices: Perform arithmetic operations like multiplication and addition.

#### 2. Factors Influencing Resource Usage

- Neuron Count: More neurons increase resource consumption.
- Bit-widths: Higher bit-widths for weights and activations consume more resources.
- Parallelism: Lower ReuseFactor values increase parallelism and resource usage.
- Pipelining: Increases FF usage but can improve performance.
- Optimizations: Strategies like resource sharing can reduce usage.

#### **Estimating Based on Model Parameters**

To get a rough estimate, consider the following:

#### 1. Neuron Connections

Each fully connected layer has input\_size \* output\_size weights.

- First Layer (fc1): 32 inputs \* 64 outputs = 2048 weights
- Second Layer (fc2): 64 inputs \* 64 outputs = 4096 weights
- Third Layer (fc3): 64 inputs \* 64 outputs = 4096 weights
- Output Layer: 64 inputs \* 10 outputs = 640 weights

#### 2. Bit-width Impact

#### With 8-bit quantization:

• Total Weights Memory: (2048 + 4096 + 4096 + 640) weights \* 8 bits ≈ 88 KB

#### 3. DSP Slice Usage

Each multiplication can use a part of a DSP slice:

• Total Multiplications per Inference: Sum of weights.

However, hls4ml can optimize to reuse DSP slices depending on the ReuseFactor.

- 4. ReuseFactor Impact
- ReuseFactor = 1: Maximum parallelism, highest resource usage.
- Higher ReuseFactor: Reduces parallelism, lowers resource usage but increases latency.



how many FPGA resources will this model take?

#### **Estimating Resource Utilization**

While I can't provide exact numbers without running the synthesis, I can guide you on how to estimate and obtain the resource utilization for your model.

#### 1. Understanding the Resource Components

- Lookup Tables (LUTs): Implement combinational logic.
- Flip-Flops (FFs): Store state information.
- Block RAMs (BRAMs): Used for on-chip memory storage.
- Digital Signal Processing (DSP) Slices: Perform arithmetic operations like multiplication and addition.

#### 2. Factors Influencing Resource Usage

- Neuron Count: More neurons increase resource consumption.
- Bit-widths: Higher bit-widths for weights and activations consume more resources.
- Parallelism: Lower ReuseFactor values increase parallelism and resource usage.
- Pipelining: Increases FF usage but can improve performance.
- Optimizations: Strategies like resource sharing can reduce usage.

#### **Estimating Based on Model Parameters**

To get a rough estimate, consider the following:

#### 1. Neuron Connections

Each fully connected layer has input\_size \* output\_size weights.

- First Layer (fc1): 32 inputs \* 64 outputs = 2048 weights
- Second Layer (fc2): 64 inputs \* 64 outputs = 4096 weights
- Third Layer (fc3): 64 inputs \* 64 outputs = 4096 weights
- Output Layer: 64 inputs \* 10 outputs = 640 weights

#### 2. Bit-width Impact

#### With 8-bit quantization:

• Total Weights Memory: (2048 + 4096 + 4096 + 640) weights \* 8 bits ≈ 88 KB

#### 3. DSP Slice Usage

Each multiplication can use a part of a DSP slice:

• Total Multiplications per Inference: Sum of weights.

However, hls4ml can optimize to reuse DSP slices depending on the ReuseFactor.

- 4. ReuseFactor Impact
- ReuseFactor = 1: Maximum parallelism, highest resource usage.
- Higher ReuseFactor: Reduces parallelism, lowers resource usage but increases latency.

#### Conclusion

- Exact Resource Usage: Can only be determined by synthesizing the design and reviewing the reports.
- Estimation: Use the factors above to estimate resource usage.
- Optimization: Adjust model complexity, quantization bit-widths, and hls4ml configurations to meet resource constraints.

how many FPGA resources will this model take?

#### **Estimating Resource Utilization**

While I can't provide exact numbers without running the synthesis, I can guide you on how to estimate and obtain the resource utilization for your model.



- Parallelism: Lower ReuseFactor values increase parallelism and resource usage.
- Pipelining: Increases FF usage but can improve performance.
- Optimizations: Strategies like resource sharing can reduce usage.

#### **Estimating Based on Model Parameters**

To get a rough estimate, consider the following:

#### 1. Neuron Connections

Each fully connected layer has input\_size \* output\_size weights.

- First Layer (fc1): 32 inputs \* 64 outputs = 2048 weights
- Second Layer (fc2): 64 inputs \* 64 outputs = 4096 weights
- Third Layer (fc3): 64 inputs \* 64 outputs = 4096 weights
- Output Layer: 64 inputs \* 10 outputs = 640 weights

#### 2. Bit-width Impact

With 8-bit quantization:

• Total Weights Memory: (2048 + 4096 + 4096 + 640) weights \* 8 bits ≈ 88 KB

#### 3. DSP Slice Usage

Each multiplication can use a part of a DSP slice:

• Total Multiplications per Inference: Sum of weights.

However, hls4ml can optimize to reuse DSP slices depending on the ReuseFactor.

- 4. ReuseFactor Impact
- ReuseFactor = 1: Maximum parallelism, highest resource usage.
- Higher ReuseFactor: Reduces parallelism, lowers resource usage but increases latency.

#### Conclusion

- Exact Resource Usage: Can only be determined by synthesizing the design and reviewing the reports.
- Estimation: Use the factors above to estimate resource usage.
- Optimization: Adjust model complexity, quantization bit-widths, and hls4ml configurations to meet resource constraints.





# wa-hls4ml & luiGNN

- Hawks, Plotnikov et al https://indico.cern.ch/event/1387540/contributions/6153600/ https://indico.cern.ch/event/1387540/contributions/6153564/
- Build a surrogate model of hls4ml to predict resources without running costly synthesis steps
- First dataset of its kind
  - > 100k models on NRP
  - Simple to start: MLPs
- Build a graph NN which predicts network FPGA resources
  - Each layer of the network is a layer in the graph
  - In distribution accuracy
    - is ~few %





# wa-hls4ml & luiGNN

- Hawks, Plotnikov et al https://indico.cern.ch/event/1387540/contributions/6153600/ https://indico.cern.ch/event/1387540/contributions/6153564/
- Build a surrogate model of hls4ml to predict resources without running costly synthesis steps
- First dataset of its kind
  - > 100k models on NRP
  - Simple to start: MLPs
- Build a graph NN which predicts network FPGA resources
  - Each layer of the network is a layer in the graph
  - In distribution accuracy is ~few %



| Hyperparameter                            | Range     | Step Size |  |  |
|-------------------------------------------|-----------|-----------|--|--|
| Input Features                            | 8-128     | 8         |  |  |
| First Layer Neurons                       | 8-128     | 8         |  |  |
| Second Layer Neurons                      | 8-128     | 8         |  |  |
| Weight and Bias Precision<br>(Total Bits) | 2-16      | 2         |  |  |
| <b>Target Reuse Factor</b>                | 1024-4093 | 1023      |  |  |



# wa-hls4ml & luiGNN

- https://indico.cern.ch/event/1387540/contributions/6153600/ https://indico.cern.ch/event/1387540/contributions/6153564/
- Build a surrogate model of hls4ml to predict resources without running costly synthesis steps
- First dataset of its kind
  - > 100k models on NRP
  - Simple to start: MLPs
- Build a graph NN which predicts network FPGA resources
  - Each layer of the network is a layer in the graph
  - In distribution accuracy is ~few %





| Hyperparameter                            | Range     | Step Size |  |  |
|-------------------------------------------|-----------|-----------|--|--|
| Input Features                            | 8-128     | 8         |  |  |
| First Layer Neurons                       | 8-128     | 8         |  |  |
| Second Layer Neurons                      | 8-128     | 8         |  |  |
| Weight and Bias Precision<br>(Total Bits) | 2-16      | 2         |  |  |
| <b>Target Reuse Factor</b>                | 1024-4093 | 1023      |  |  |



# Fast ML example applications





# AI on ASIC



Catapult AI NN brings together hls4ml, an open-source package for machine learning hardware acceleration, and Siemens' Catapult<sup>™</sup> HLS software for High-Level Synthesis. Developed in close collaboration with Fermilab, a U.S. Department of Energy Laboratory, and other leading contributors to hls4ml, Catapult AI NN addresses the unique requirements of machine learning accelerator design for power, performance, and area on custom silicon.

 Enables software development teams to seamlessly translate AI models designed in Python into silicon-based implementations, facilitating faster and more power-efficient execution compared to standard processors









Floorplan with analog pixels and power a bias grid



### Red: classifier algorithm

### White: network weights









WIINGT

- Accessing sub-ns time component while meeting power requirements likely requires novel microelectronics solutions
  - Clocked digital CMOS solutions are power hungry
  - Consider neuromorphic approaches, e.g. analog or spiking NN





# **AXOLITL**



https://cds.cern.ch/record/2904695





### Reconstructed data $\hat{x}$



T. Aarrestad, CMS ML Townhall

 $x - \hat{x}$  represents degree of abnormality



### Train on ZeroBias LHC data

Bottleneck: autoencoder learns to compress high dimensional inputs into low dimensional latent space

# **AXOLITL**



|         | Latency          | LUTs | FFs | DSPs | BRAMs |
|---------|------------------|------|-----|------|-------|
| AXOLITL | 2 ticks<br>50 ns | 2.1% | ~0  | 0    | 0     |



### AXOLITL IRL







### AXOLITL IRL





CMS Experiment at the LHC, CERN Data recorded: 2023-May-24 01:42:17.826112 GMT Run / Event / LS: 367883 / 374187302 / 159

DL1TL 7 payload 7 infrastructure

For some signals, up to factor of 10 increase in signal efficiency!

Fermilab

ml

# **ML-based qubit readout**

B. Du et al, https://indico.cern.ch/event/1387540/contributions/6153407/



eadout frequencies in the RF, ~few GH Control latency < 1 µs











### < 40 ns latency and < 6% of FPGA resources







### < 40 ns latency and < 6% of FPGA resources





Proof-of-concept end-to-end workflow established, extendable to multi-qubit systems and adaptive automated operation





# Outline

- Fast and Slow
- Fast ML and hardware codesign
- Fast and Slow ML together
- Slow ML & Real-Time



# Adaptive systems





# Adaptive systems

Summit on Digital Twins last Tuesday https://indico.fnal.gov/event/66849/





# Cart-pole: canonical reinforcement learning





# Cart-pole: canonical reinforcement learning





## Cart-pole IRL

















## Credit: Marcin Paluch (ETH Zurich) & Ben Hawks (FNAL), Dennis Plotnikov (SULI, JHU)



# Thoughts on adaptive systems

- Real-time feedback often deal with raw data only one crack at the data!
- The performance of the twin and the control agent are interconnected
- What are the right time scales?
- A robust, adaptive ML control agent impacts system performance and timescales





# Learning = optimization

learning as optimization



to learn the weights, we need the **derivative** of the loss w.r.t. the weight i.e. "how should the weight be updated to decrease the loss?"

$$w' = w - \alpha \frac{\partial \mathcal{L}}{\partial w}$$



# Robustness, a priori

- Goal train a model to be more robust to model perturbations more generalizable
  - Input perturbations = noise and lost channels
  - Weight perturbations = bit flips due to radiation



# Robustness, a priori

- Goal train a model to be more robust to model perturbations more generalizable
  - Input perturbations = noise and lost channels
  - Weight perturbations = bit flips due to radiation





#### Mode connectivity: barriers between converged models



Hessian Trace: Local smoothness

CKA similarity: similarity across trained models



# Learning = optimization

learning as optimization



• Add additional loss terms to improve robustness and generalizability

 $\lambda \|J(x)\|_F^2$ 

**Jacobian regularization** penalizes large output impacts due to input perturbations

 $\lambda \| W^T W - I \|_F$ 

Lipschitz regularization encourages weight orthogonality for loss landscape smoothness



## ASIC autoencoder





## ASIC autoencoder



65 nm LP CMOS





# ASIC autoencoder



#### Powerful, Flexible, Adaptable Data Compression





## Robustness





## Robustness



Lipschitz regularization seems quite promising as a way make more robust edge NNs Feeds into the automation loop and provides more system stability



# Outline

- Fast and Slow
- Fast ML and hardware codesign
- Fast and Slow ML together
- Slow ML & Real-Time



# Life cycle of instruments, detectors, facilities





# Life cycle of instruments, detectors, facilities





# Life cycle of instruments, detectors, facilities





#### Thought for 10 seconds ∨

To determine how much energy a muon loses while traversing 5 meters of iron, we need to calculate the energy loss per unit distance and then multiply it by the total distance traveled.

#### Step 1: Understanding Muon Energy Loss Mechanisms

Muons, like other charged particles, lose energy in matter primarily through:

- Ionization and Excitation: Interacting with electrons in the material, causing ionization or excitation.
- Radiative Processes: At higher energies, processes like bremsstrahlung, pair production, and photonuclear interactions become significant.

For this calculation, we'll focus on ionization energy loss, which is dominant for muon energies up to several hundred GeV.

#### Step 2: Using the Bethe-Bloch Formula

The Bethe-Bloch formula describes the mean energy loss per unit path length  $(\frac{dE}{dx})$  of charged particles as they pass through matter. For muons in iron, the mass stopping power (mean energy loss per unit mass thickness) is approximately:



#### n.b. GPT-40 got this wrong



#### Step 3: Calculating Energy Loss Per Unit Length

First, we need to convert the mass stopping power to energy loss per unit length using the density of iron:

• Density of Iron (ho): Approximately  $7.874 \, {
m g/cm}^3$ 

Now, calculate the energy loss per centimeter:

$$\left(rac{dE}{dx}
ight)_{
m length} = \left(rac{dE}{dx}
ight)_{
m mass} imes 
ho = 1.5\,{
m MeV}\cdot{
m cm}^2/{
m g} imes 7.874\,{
m g/cm}^3 = 11.811\,{
m MeV/cm}^3$$

Step 4: Calculating Total Energy Loss Over 5 Meters

Convert 5 meters to centimeters:

$$5\,\mathrm{meters}=500\,\mathrm{cm}$$

Now, multiply the energy loss per centimeter by the total distance:

$$\Delta E = \left(rac{dE}{dx}
ight)_{
m length} imes {
m Distance} = 11.811\,{
m MeV/cm} imes 500\,{
m cm} = 5905.5\,{
m MeV}$$

Step 5: Converting to GeV

Since  $1 \, GeV = 1000 \, MeV$ :

$$\Delta E = rac{5905.5\,{
m MeV}}{1000} = 5.9055\,{
m GeV}$$

**Fermilab** 

Answer:

Approximately 5.9 GeV of energy is lost by the muon while traversing 5 meters of iron.



# Yesterday





# chATLAS

# An AI Assistant for the ATLAS Collaboration

## LLM + RAG (Retrieval-Augmented Generation):

Goal: Ask questions to an LLM about private information without retraining









# In closing...



# Summary

## Fast and Slow

- Resolving 18 orders of magnitude in time (!) to understand where AI could be deployed to improve sensing, automation, control, and knowledge synthesis
- Fast ML and hardware codesign
  - Embedded ML can be used to access new information, reduce biases, and develop adaptive systems; accessible tools like hls4ml speed up algorithm/hardware design
- Fast and Slow ML together
  - More powerful autonomous instruments enabled through digital twins physicallycoupled fast simulators - and more robust real-time controllers
- Slow ML & Real-Time
  - LLMs + RAG and emerging agentic workflows can aid in reducing operational load by preserving and synthesizing collective expertise



# **Ruminations**

- "Real-time" is a curious term
- Humans are not particularly good at pattern recognition over multiple timescales - this also requires us to capture the right data
- Al is a disruptive technology I'm hopeful that we can harness it to accelerate our particle physics ambitions
- There are many connections to other scientific domains enabling discovery in many (sub-)fields

time, its illusiveness across multiple timescales in particle physics with a message of hope - Imagen3



