# RISC-V based GPGPU on FPGA: A Competitive Approach for Scientific Computing? E. Guthmuller & J. Fereyre **CEA List** eric.guthmuller@cea.fr GPUs have enabled supercomputers to exceed exaFLOP performance. But Al is driving GPU architecture evolutions, needing only low precision computing. ⇒ How long before 64b support in GPUs is dropped or emulated? Typical scientific computing kernel needs 64b floating point support but exhibits low arithmetic intensity and performance is limited by memory throughput. FPGAs now provide hardened arithmetic units, Network-on-Chips (NoC) and memory controllers, including High Bandwidth Memory (HBM). While dedicated architectures or CGRAs may better exploit FPGA fine-grained architecture, GPGPUs are easier to program and already exploited in existing code. ⇒ Would it be possible to implement a GPGPU on FPGA maximizing HBM throughput, and thus being competitive with ASIC GPUs? Fig 1. Example of a typical HPC kernel: Congugate Gradient (CG) iterative solver. Fig 3. Roofline model of linear algebra kernels for 820 GB/s memory SpMV SpMV\_UB ### Platform & Architecture #### **Targeted Platform: AMD Alveo V80** Fig 4. AMD Alveo V80 main features and organization #### **Vortex FPGA Implementation** Vortex is an opensource RISC-V based GPGPU (https://github.com/vortexgpgpu/vortex). Fig 5. Vortex architecture with 1 to 14 clusters connected to NoC ## **Early Results** - Mapping up to 56 Vortex cores on Alveo V80 - Up to 224 FMA lanes (Single) Precision) - 4 wavefronts per core - No SLR crossing - Max frequency stable at ~300 MHz even with high utilization - OpenCL driver developed over AMD PCIe QDMA driver Fig 6. Post-route FPGA floorplan with 14 clusters (colored) | Config | LUTs | FFs | RAM<br>small | RAM<br>big | DSP | Freq | Peak<br>FP32 | |-------------------------|-------|-----|--------------|------------|--------|----------|--------------| | 4 cores | 5% | 3% | 4% | 0% | <1% | 300 MHz | 10 GFLOPS | | 56 cores | 70% | 39% | 39% | 0% | 4% | 282 MHz | 126 GFLOPS | | AMD Versal<br>HBM XCV80 | 2.5 M | - | 132 Mb | 541 Mb | 10.8 K | ~800 MHz | 17.5 TFLOPS | Fig 7. Implementation results #### **Future works** - HW support for double precision (FP64) operations - HPC benchmarks: Linpack and HPCG - Optimized memory subsystem to exploit HBM bandwidth