

# Ventus: an RVV-based General Purpose GPU **Design and Implementation**

Kexiang Yang<sup>1,2</sup>, Hualin Wu<sup>3</sup>, Jingzhou Li<sup>1,2</sup>, Chufeng Jin<sup>1,2</sup>, Yujie Shi<sup>1,2</sup>, Xudong Liu<sup>1,2</sup>, Zexia Yang<sup>1,2</sup>, Fangfei Yu<sup>1,2</sup>, Mingyuan Ma<sup>1,2</sup>, Sipeng Hu<sup>4</sup>, Tianwei Gong<sup>4</sup>, Hu He<sup>1,2\*</sup>

<sup>1</sup>Tsinghua University, <sup>2</sup>International Innovation Center of Tsinghua University, Shanghai, <sup>3</sup>Terapines Ltd, <sup>4</sup>Beijing Information Science and Technology University

#### What is Ventus?

- Open-sourced RVV-based GPGPU
- An implementation of Chisel HDL, driver and compiler
- OpenCL compatibility
- RISC-V compatibility with 256 registers available

## Software Stack

- Ventus-LLVM: compiler based on LLVM for Ventus ISA and library  $\bullet$
- PoCL: OpenCL platform implementation  $\bullet$
- Ventus-driver: KMD implementation ullet
- Ventus-gpgpu-isa-simulator: ISS based on Spike

Warp Scheduler







### ISA: RV32IMA\_ZFinx\_Zicsr\_V

- Vector instructions for per-thread operation, elen=32 bit, vlen=32\*elen
- Scalar instructions for common data
- Custom instructions:  $\bullet$ 
  - VBranch/Join to control thread divergence
  - EndProgram and Barrier to control warps
  - RegisterExtension to extend register index
- Registers: 64 sGPRs, 256 vGPRs
- Memory space definition and access methods
- Custom CSRs and metadata to launch workgroup and implement workitem functions

|                            | Memory Space  |  |  |
|----------------------------|---------------|--|--|
| 0xFFFFFFFF                 |               |  |  |
| CSR_KNL for<br>NDRange     | globalmemory  |  |  |
| CSR_PDS for<br>warp/thread | privatememory |  |  |
| CSR_KNL for<br>NDRange     | instruction   |  |  |
| CSR_LDS for<br>workgroup   | sharedmemory  |  |  |
| 0x0000000                  |               |  |  |
|                            |               |  |  |

|                       | AMD                                        | NVIDIA                                    | Intel                                       | Vortex                      | Ventus                                            |
|-----------------------|--------------------------------------------|-------------------------------------------|---------------------------------------------|-----------------------------|---------------------------------------------------|
| ISA                   | RDNA                                       | PTX                                       | GEM                                         | RISC-V IMF                  | RV32V                                             |
| Instruction<br>Length | 32/64 bit                                  | 128 bit (SASS)                            | 128 bit                                     | 32 bit                      | 32 bit                                            |
| Memory<br>Model       | GDS, LDS<br>Constants<br>Global            | Shared,<br>Texture<br>Constants<br>Global | Software<br>Managed                         | Shared<br>Global            | Private<br>Shared<br>Global                       |
| Threading<br>Model    | workgroup<br>wavefront<br>32/64 thread     | CTA<br>warp<br>32 thread                  | Root Thread<br>Child Thread                 | compute unit<br>wavefront   | workgroup<br>warp<br>32 thread                    |
| Register file         | 256 vGPRs<br>106 sGPRs                     | Scalar                                    | 128 GRFs                                    | 32 sGPRs                    | 256 vGPRs<br>64 sGPRs                             |
| Thread<br>Control     | endpgm<br>message<br>branch<br>thread mask | branch<br>predicate                       | message<br>branch<br>SPF Regs<br>split/join | thread mask<br>(split/join) | endprg<br>branch<br>thread mask<br>(vbranch/join) |
| Synchronizati<br>on   | barrier<br>wait_cnt                        | barrier<br>membar                         | wait<br>fence                               | barrier<br>flush            | barrier<br>fence                                  |
| <b>Execution Unit</b> | ALU<br>memory<br>Matrix Core               | ALU<br>memory<br>Tensor Core              | ALU<br>memory<br>Matrix Engine              | ALU<br>memory               | ALU<br>memory<br>Tensor Core                      |

#### Microarchitecture

- Multi-level task allocation is implemented by driver and CTA-scheduler
- SM works as an RVV processor supporting warp scheduling
- 4-bank register files can be allocated according to usage
- Tensor Core supports custom tensor operations



#### **Evaluation & Conclusion**

- A complete implementation of GPGPU based on RVV
- Chisel HDL, configurable in num of warps, threads, SMs, lanes...
- A 16SM-16warp-16lane version with Tensor Core occupies 65% of the area of 4 VU19P FPGAs

Open-sourced at https://github.com/THU-DSP-LAB/ventus-gpgpu