

### "On-chip" Computation

FPGA frameworks for edge and near-edge computing: examples and strategy towards HL-LHC

Michalis Bachtis (UCLA), Javier Duarte (FNAL)



### Compute on hardware

- <u>Track Reconstruction:</u> CPU expensive algorithm at HL-LHC
- Recent application of "offline" track reconstruction in FPGAs with a Kalman filter in the L1 Muon Trigger in current CMS data taking
  - opens doors towards accelerating track reconstruction algorithms with FPGAs <u>also offline at HL-LHC</u>







# Implementing a Kalman Filter in an FPGA

### Matrix algebra including matrix inversion!



Every step consists of track propagation and parameter update (k=q/P<sub>T</sub>)



Modern FPGAs: 1000s of DSP cores

- Exist for filtering, AI, and military applications
  - ASIC cores in the FPGA that contain wide multipliers and adders
- Exploiting this commercially available resource reduced required FPGA resources by x5



### Implementation results

FPGA firmware written with the latest High Level Synthesis (HLS) tools in C

- Deployed in CMS L1 data taking in Run I
- Reconstructs all muon tracks in 150ns!

Proving that C code can run efficiently on an FPGA

 Fundamental step towards <u>accelerating</u> <u>current C/C++ offline algorithms with</u> <u>FPGAs in HL-LHC</u>



Fermila



# Interface between L1 Trigger and computing towards HL-LHC



US CMS performing R&D with cutting edge technology FPGAs for the L1 Trigger

FPGA vendors moving towards combining many different technologies in a single chip

- Future generation FPGAs [to arrive in the market in 2020] will combine FPGA logic with CPUs and specific AI cores towards an adaptable computing engine
- While the L1 Trigger (due to latency limits of ~ $\mu$ s ) will mostly benefit from the FPGA logic, the same device can be reconfigured for a computing application accelerating algorithmic parts using the FPGA logic



# HL-LHC strategy from Muon trigger implementation



- 1. Advanced/Clever Programming of Modern FPGAs
- Exploit DSP cores to reduce resource usage: More algorithms in a chip
- Running C algorithms in an FPGA: Enables acceleration of offline algorithms
- Bigger and faster FPGAs
  - Faster clocks make algorithms faster
  - Embedded computing elements inside chip perform co-processing
- 2. High Speed on-board data links & Hybrid On board (or on Chip) computing More and/or higher speed links (~25 100Gbps Ethernet connections/FPGA!)
  - Connect multiple devices together
- Many devices in the same Chip: Adaptable computing optimized for the application



### Reminder: types of compute engines





#### NN correctly identifies jets 70-80% of the time









#### OPTIMIZE NN's for FPGAs resource

Compress: Maintain high performance while removing redundant synapses and neurons Quantize: Reduce precision from 32-bit floating point to 20-bit, 8-bit, ... Parallelize/Reuse: Balance: parallelization (how fast) with FPGA resources needed (how costly)

J. Duarte et al.

0.8

0.2

0.0

#### JINST 13 P07027

Fast inference of deep neural networks in FPGAs for particle physics

Javier Duarte<sup>a</sup>, Song Han<sup>b</sup>, Philip Harris<sup>b</sup>, Sergo Jindariani<sup>a</sup>, Edward Kreinar<sup>c</sup>, Benjamin Kreis<sup>a</sup>, Jennifer Ngadiuba<sup>d</sup>, Maurizio Pierini<sup>d</sup>, Ryan Rivera<sup>a</sup>, Nhan Tran<sup>a</sup>, Zhenbin Wu<sup>e</sup> <sup>a</sup>Fermi National Accelerator Laboratory, Batavia, IL 60510, USA <sup>b</sup>Massachusets Institute of Technology, Cambridge, MA 02139, USA <sup>c</sup> HawkEye360, Herndon, VA 20170, USA <sup>d</sup> CERN, CH-1211 Geneva 23, Switzerland <sup>e</sup> University of Illinois at Chicago, Chicago, IL 60607, USA *E*-mail: Isbeah. help@gmail.com



### Tool: <u>hls4ml</u>

### hls4ml for physicists or ML experts to translate ML algorithms into FPGA firmware



#### Citation

If you are using the package please cite:

2018

Jun

28

- DOI 10.5281/zenodo.1204445
- J. Duarte *et al.*, "Fast inference of deep neural networks in FPGAs for particle physics", JINST 13 P07027 (2018), arXiv:1804.06913.

#### Contributors

- Vladimir Loncar, Jennifer Ngadiuba, Maurizio Pierini, Sioni Summers [CERN]
- Javier Duarte, Sergo Jindariani, Benjamin Kreis, Ryan Rivera, Nhan Tran [Fermilab]
- Edward Kreinar [Hawkeye360]
- Song Han, Philip Harris, Dylan Rankin [MIT]
- Zhenbin Wu [University of Illinois at Chicago]
- Mark Neubauer [University of Illinois Urbana-Champaign]
- Shih-Chieh Hsu [University of Washington]
- Giuseppe Di Guglielmo [Columbia University]



DUNE, ATLAS, Accel. Division interested

# Accelerating High-Level Trigger with FPGAs

Inputs

TS0 TS1 TS2

TS3

TS4

TS5

TS6 TS7 iη iφ

dept

- HCAL local reconstruction contributes significantly to HLT compute time
- ML+FPGA as co-processor can reduce HCAL local reco. compute time by up to ×16
- Tested using AWS FPGAs





**#Fermilab** 

Machine Learning Inference Latency Insensitive (High Batch)

### Summary

- Exploiting new paradigms to improve and accelerate HL-LHC trigger algorithms with applications for the future computing model in HL-LHC
  - Algorithm acceleration with FPGAs programmed in C
  - High speed interconnect of computing elements
  - New adaptable hardware
- Strengthening connections and familiarity with new industry tools and technologies
- Developed techniques applicable in ATLAS, DUNE, Accelerator controls, and more (with interested collaborators)



# High End CPU High End GPU Versal Al Core



XILINX announced (end of 2018) their 7nm technology (Versal) which does parallel, sequential processing and AI. White Paper: Versal ACAPs (https://bit.ly/2IZf1BS)

🗲 XII INX.

WP505 (v1.0) October 2, 2018