

## HLSFactory A Framework Empowering High-level Synthesis Datasets For Machine Learning And Beyond



Stefan Abi-Karam<sup>1,2</sup>, Rishov Sarkar<sup>1</sup>, Allison Seigler<sup>3</sup>, Sean Lowe<sup>4</sup>, Zhigang Wei<sup>3</sup>, Hanqiu Chen<sup>1</sup>, Nanditha Rao<sup>5</sup>, <u>Lizy John<sup>3</sup></u>, Aman Arora<sup>4</sup>, Cong Hao<sup>1</sup>

<sup>1</sup>Georgia Institute of Technology, <sup>2</sup>Georgia Tech Research Institute, <sup>3</sup>The University of Texas at Austin <sup>4</sup>Arizona State University, <sup>5</sup>International Institute of Information Technology Bangalore

### Background



ML has been widely used in HLS domain, BUT every study has its own dataset



Accurate Timing and Resource Estimation [FCCM'18, FPL'19, DAC'22]



Power Estimation [ASP-DAC'20, DATE'22]

### Background



ML has been widely used in HLS domain:

- 1. XGB, ANN are used to predict post-implementation resource utilization [FCCM'19]
- 2. Pyramid used ANN, SVM to help find design with optimal timing and resource usage [FPL'19]
- 3. GNN is used to predict actual resource and timing [DAC'22]
- 4. HL-POW used CNN to predict on-board measured average power for each FPGA [ASP-DAC'20]
- 5. PowerGear used GNN further increase the accuracy of average power prediction [DATE'22]

However, every study has its own dataset

### Background



#### Existing dataset:

- 1. Small or homogeneous, contains only a subset of previously published HLS benchmark
- 2. The designs and intermediate/final tool outputs, which serve as important ML model features, are often reported organized in non-standard ad hoc ways
- 3. Challenging for external users to extend the dataset

Therefore, HLSFactory is proposed, and it boasts the following features:

- 1. Complete and easily extensible with user inputs at multiple stages
- 2. Diverse and comprehensive
- 3. Reproducible and user-friendly
- 4. ML-ready and multi-purpose
- 5. High performance and open-source

| HLSFactory                                                                                                                                                                                     |  |  |  |  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| Stage : Design Space Expansion<br>Existing open-source HLS<br>designs/benchmarks User submitted HLS<br>Abstract Designs                                                                        |  |  |  |  |
| OptDSL Frontend (Vendor agnostic)                                                                                                                                                              |  |  |  |  |
| Expanded Design Space (can be extremely huge)  Design Space Sampling  Random sample Active learning  Sampled Concrete Designs (sampling rate adjustable)                                       |  |  |  |  |
| User Entry Point 2<br>One HLS Concrete Design                                                                                                                                                  |  |  |  |  |
| Stage 2: Design Synthesis User submitted HLS Concrete Designs                                                                                                                                  |  |  |  |  |
| AMD/Xilinx Vitis HLS & Vivado Intel i++ & Quartus ···     Pre-implementation results     Post-implementation results                                                                           |  |  |  |  |
| User Entry Point 3                                                                                                                                                                             |  |  |  |  |
| Stage 3: Data Aggregation         AMD/Xilinx Post-processing       Intel Post-processing         • ML-ready dataset         • Multi-purpose usage                                              |  |  |  |  |
| <ul> <li>✓ Flexible: Can supply user input at any stage</li> <li>✓ Extensible: Modular architecture is easy to customize</li> <li>✓ Reproducible: Open-source end-to-end build flow</li> </ul> |  |  |  |  |

Stage 1: Design space expansion and sampling

eorgia

### Stage 2: Design Synthesis

# Stage 3: Data extraction and Aggregation

Stage 1: Design Space Expansion And Sampling





#### **Stage 2: Design Synthesis**



Two steps:

- 1. HLSSynth: synthesize HLS into RTL
- 2. HLSImpl: RTL code is implemented



**Stage 3: Data Extraction and Aggregation** 



Flexible: Can supply user input at any stage
 Extensible: Modular architecture is easy to customize
 Reproducible: Open-source end-to-end build flow



 TABLE I

 A comparison of HLSFactory with the existing work.
 •: feature supported;

 O: feature unsupported;
 •: feature partially supported.

| Contributions                    | DB4HLS | HLSyn      | HLSDataset | HLSFactory |
|----------------------------------|--------|------------|------------|------------|
| Benchmark — Polybench            | 0      |            |            |            |
| Benchmark — MachSuite            |        |            |            |            |
| Benchmark — Rosetta              |        | Q          |            |            |
| Benchmark — CHStone              |        | Q          |            |            |
| Collection — PP4FPGA             |        | Q          | Q          |            |
| Collection — Accelerators (§V-E) | 0      | 0          | 0          |            |
| Post-HLS Latency                 |        |            |            |            |
| Post-HLS Resources               |        |            |            | •          |
| Post-HLS Artifacts               |        | Q          | O O        |            |
| Post-Impl. Data                  | 0      | $\bigcirc$ | •          |            |
| HLS Optimization DSL             |        | O O        |            |            |
| Fine-Grained Parallel Builds     |        | 0          | 0          |            |
| Xilinx HLS Support               |        |            |            |            |
| Intel HLS Support                |        |            | 0          |            |
| User Extendable to Other Tools   |        | 0          | O          |            |
| Programmable API                 |        | O          | 0          |            |
| Open Source                      |        |            |            |            |

### **HLSFactory – Implementation & Usage**



#### **Python API**

| API Functions                          | Description                                  |
|----------------------------------------|----------------------------------------------|
| class Design                           | Single HLS design                            |
| class Dataset                          | Multiple HLS designs                         |
| class Flow(ABC)                        | Abstract class for arbitrary design flow     |
| Flow.execute(design)                   | Execute a flow on one design                 |
| Flow.execute_datasets_parallel(design) | Execute a flow on many designs               |
| class Frontend(Flow)                   | Abstract class for frontend design expansion |
| class OptDSLFrontend(Frontend)         | Opt DSL frontend for Xilinx HLS designs      |
| class ToolFlow(Flow)                   | Abstract class for EDA tool                  |
| class VitisHLSSynthFlow(ToolFlow)      | Run Vitis HLS synthesis                      |
| class VitisHLSImplFlow(ToolFlow)       | Run Vivado implementation (via Vitis HLS)    |
| class VitisHLSImplReportFlow(ToolFlow) | Run Vivado reporting                         |

#### **Example Use of the APIs**

### **HLSFactory – Implementation & Usage**





Fig. 4. The directory structure that HLSFactory uses. Red are input files; green are the intermediate design points; blue are output files.

#### **Design Directory Structure**

Shows specific entry points scripts that users add to integrate into HLSFactory



LUTs FFs RAMB18s 1e4 le4 80 ML (25%): R2=0.70, RAE=0.42 ML (25%): R2=0.72, RAE=0.39 ML (25%): R2=0.06, RAE=1.39 8 Predicted Value 7.5 7.1 7. Value ML (100%): R2=0.95, RAE=0.16 ML (100%): R2=0.99, RAE=0.07 ML (100%): R2=0.75, RAE=0.61 Predicted Value HLS: R2=-89.00, RAE=4.08 HLS: R2=-0.03, RAE=0.45 HLS: R2=-109.33, RAE=10.52 60 6 Predicted 20 0.0 3.0 4.5 0 20 40 0.0 1.5 True Value 1e4 True Value 1e4 True Value Worst Hold Slack **DSP Blocks** Worst Negative Slack 0.08 ML (25%): R2=0.65, RAE=0.54 ML (25%): R2=0.63, RAE=0.77 ML (25%): R2=0.64, RAE=0.61 10 Value ML (100%): R2=0.93, RAE=0.14 ML (100%): R2=0.76, RAE=0.36 Predicted Value ML (100%): R2=0.78, RAE=0.52 Predicted Value 0.06 300 HLS: R2=1.00, RAE=0.02 0.04 Predicted 200 0.02 100 -5 0.00 -10-0.02200 100 -10-5 0 5 -0.020.00 0.02 0.04 True Value **True Value** True Value ML 25% Design Space ML Full Design Space **HLS Reported** 

Xilinx Post-Implementation QoR Prediction

Fig. 5. True-vs-predicted plots for the HLS-based ML QoR model. Test values are shown for models trained on the complete and partial subset of the training design space. "RAE": Relative Absolute Error  $(|\hat{y}-y|/|y-\bar{y}|)$ , "R2": Coefficient of Determination

Generating more data points using HLSFactory can result in higher prediction accuracy





Effect of design sampling to cover more design space. Sampled designs cover a wider range of metrics than base designs with no optimizations. Latency is HLS estimated; resources are post-implementation. Note that these are stacked density plots to show the effect of cumulative design sampling.



#### Design Space Visualization: Grouped by Design



Projection  $x_1$ 

Projection  $x_0$ 





Projection  $x_1$ 

### Design Space Visualization: Grouped by Benchmark

Projection  $x_0$ 





Parallel execution of Vitis HLS synthesis. Top panel shows core utilization over time with naive parallelism across datasets; bottom panel shows our finegrained design parallelism across datasets.

#### Naive Parallelism

3

2

1

Density



10<sup>3</sup>













#### Comparison of Vitis HLS Metrics Between Versions 2021.1 and 2023.1



Distribution of HLS tool metrics from two versions of Vitis HLS

### Conclusion



HLSFactory Key Points:

- 1. Complete and easily extensible with user inputs at multiple stages
- 2. Diverse and comprehensive
- 3. Reproducible and user-friendly
- 4. ML-ready and multi-purpose
- 5. High performance and open-source (available at https://github.com/sharclab/HLSFactory)

Future Direction:

- Simulation Flows, e.g. vendor supported co-simulation or with our own published tools like LightningSim (which is co-sim accurate and much faster)
- More designs to add from others in the academic community and ope- source
- Developing more frontends and vendor agnostic to abstractions to enumerate more designs from different design spaces
- Ex: An HLS4ML frontend to enumerate HLS designs from HLS4ML model specs or from ONNX models





#### **Documentation + Tutorials**

### **GitHub Repository**

sharc-lab.github.io/HLSFactory/docs/

github.com/sharc-lab/hlsfactory

**ArXiv Pre-Print Paper** 

arxiv.org/abs/2405.00820

# Thanks! Questions?

### References



[1] H. Mohammadi Makrani, F. Farahmand, H. Sayadi, S. Bondi, S. M.Pudukotai Dinakarrao, H. Homayoun, and S. Rafatirad, "Pyramid: Machine learning framework to estimate the optimal timing and resource usage of a high-level synthesis design," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL), 2019.

[2] S. Dai, Y. Zhou, H. Zhang, E. Ustun, E. F. Young, and Z. Zhang, "Fast and accurate estimation of quality of results in high-level synthesis with machine learning," in 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2018, pp. 129–132.

[3] D. Liu and B. C. Schafer, "Efficient and reliable high-level synthesis design space explorer for fpgas," in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016.

[4] W. Haaswijk, E. Collins, B. Seguin, M. Soeken, F. Kaplan, S. S üsstrunk, and G. De Micheli, "Deep learning for logic optimization algorithms," in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018.

[5] Y. Luo, C. Tan, N. B. Agostini, A. Li, A. Tumeo, N. Dave, and T. Geng, "Ml-cgra: An integrated compilation framework to enable efficient machine learning acceleration on cgras," in 2023 60th ACM/IEEE Design Automation Conference (DAC), 2023.

[6] V. A. Chhabria, Y. Zhang, H. Ren, B. Keller, B. Khailany, and S. S. Sapatnekar, "Mavirec: Ml-aided vectored ir-drop estimation and classification," in 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2021.

[7] R. G. Kim, J. R. Doppa, and P. P. Pande, "Machine learning for design space exploration and optimization of manycore systems," in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2018, pp. 1–6.

[8] Z. Lin, J. Zhao, S. Sinha, and W. Zhang, "HL-Pow: A Learning-Based Power Modeling Framework for High-Level Synthesis," in 25th Asia and South Pacific Design Automation Conference (ASP-DAC), 2020.

[9] G. Singha, D. Diamantopoulosb, J. G ómez-Lunaa, S. Stuijkc, H. Corporaalc, and O. Mutlu, "LEAPER: Fast and Accurate FPGA-based System Performance Prediction via Transfer Learning," in IEEE 40th International Conference on Computer Design (ICCD), 2022.
 [10] Z. Lin, Z. Yuan, J. Zhao, W. Zhang, H. Wang, and Y. Tian, "Powergear: Early-stage power estimation in fpga hls via heterogeneous edge-centric gnns," in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2022.