













# CVA6S+: A Superscalar RISC-V Core with High-Throughput Memory Architecture

**Riccardo Tedeschi<sup>1</sup>**, Gianmarco Ottavi<sup>1</sup>, Côme Allart<sup>2,3</sup>, Nils Wistoff<sup>4</sup>, Zexin Fu<sup>4</sup>, Filippo Grillotti<sup>5</sup>, Fabio De Ambroggi<sup>5</sup>, Elio Guidetti<sup>5</sup>, Jean-Baptiste Rigaud<sup>3</sup>, Olivier Potin<sup>3</sup>, Jean Roch Coulon<sup>2</sup>, César Fuguet<sup>6</sup>, Luca Benini<sup>1,4</sup>, Davide Rossi<sup>1</sup>

University of Bologna, Italy<sup>1</sup> Mines Saint-Etienne, France<sup>3</sup> STMicroelectronics, Italy<sup>5</sup>

Thales DIS, France<sup>2</sup> ETH Zürich, Switzerland<sup>4</sup> Inria, France<sup>6</sup>

#### **PULP Platform**

Open Source Hardware, the way it should be!









→ The growing demand for autonomy in critical fields like automotive, industrial automation, and aerospace is driving the need for high-performance CPUs



















- → The growing demand for autonomy in critical fields like automotive, industrial automation, and aerospace is driving the need for high-performance CPUs
- → An increasing number of open-source RISC-V cores are targeting high-performance applications, including BlackParrot, BOOM, XiangShan, and Xuantie C910

















- → The growing demand for autonomy in critical fields like automotive, industrial automation, and aerospace is driving the need for high-performance CPUs
- → An increasing number of open-source RISC-V cores are targeting high-performance applications, including BlackParrot, BOOM, XiangShan, and Xuantie C910

CVA6<sup>1</sup> is a configurable 64/32 bit RISC-V core originally developed by the PULP Platform and now maintained by OpenHW Group with the support of multiple industrial and academic partners:

<sup>1</sup>F. Zaruba, "The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology", IEEE VLSI, 2019



















- → The growing demand for autonomy in critical fields like automotive, industrial automation, and aerospace is driving the need for high-performance CPUs
- → An increasing number of open-source RISC-V cores are targeting high-performance applications, including BlackParrot, BOOM, XiangShan, and Xuantie C910

CVA6<sup>1</sup> is a configurable 64/32 bit RISC-V core originally developed by the PULP Platform and now maintained by OpenHW Group with the support of multiple industrial and academic partners:

→ 6-stage pipeline: two-stage Instruction Fetch (IF), Instruction Decode (ID), Instruction Issue (IS), Instruction Execute (IE), and Writeback (WB)

1F. Zaruba, "The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology", IEEE VLSI, 2019



















- The growing demand for autonomy in critical fields like automotive, industrial automation, and aerospace is driving the need for high-performance CPUs
- → An increasing number of open-source RISC-V cores are targeting high-performance applications, including BlackParrot, BOOM, XiangShan, and Xuantie C910

CVA6<sup>1</sup> is a configurable 64/32 bit RISC-V core originally developed by the PULP Platform and now maintained by OpenHW Group with the support of multiple industrial and academic partners:

- **6-stage** pipeline: two-stage Instruction Fetch (IF), Instruction Decode (ID), Instruction Issue (IS), Instruction Execute (IE), and Writeback (WB)
- → In order dispatch, out of order completion, in order commit

F. Zaruba, "The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology", IEEE VLSI, 2019

















# Background: from CVA6 to CVA6S



CVA6 IPC (Instructions Per Clock) is constrained by its simple, scalar in-order front-end microarchitecture

<sup>2</sup>C. Allart, "<u>Using a Performance Model to Implement a Superscalar CVA6</u>", ACM CF'24















# Background: from CVA6 to CVA6S



CVA6 IPC (Instructions Per Clock) is constrained by its simple, scalar in-order front-end microarchitecture

CVA6S<sup>2</sup> is the <u>superscalar dual-issue version</u> of CVA6 by Thales, making the core suitable for more demanding workloads

<sup>2</sup>C. Allart, "<u>Using a Performance Model to Implement a Superscalar CVA6</u>", ACM CF'24

















# Background: from CVA6 to CVA6S



CVA6 IPC (Instructions Per Clock) is constrained by its simple, scalar in-order front-end microarchitecture

CVA6S<sup>2</sup> is the <u>superscalar dual-issue version</u> of CVA6 by Thales, making the core suitable for more demanding workloads



- → ×2 instruction fetch width
- → ×2 decoding and issue logic
- → Secondary ALU

<sup>&</sup>lt;sup>2</sup>C. Allart, "<u>Using a Performance Model to Implement a Superscalar CVA6</u>", ACM CF'24



















We present <u>CVA6S+</u>, which **builds on the CVA6S microarchitecture** with key enhancements aimed at **further boosting performance**:



















We present <u>CVA6S+</u>, which **builds on the CVA6S microarchitecture** with key enhancements aimed at **further boosting performance**:

→ Register renaming

















We present <u>CVA6S+</u>, which **builds on the CVA6S microarchitecture** with key enhancements aimed at **further boosting performance**:

- → Register renaming
- → Improved branch predictor















We present <u>CVA6S+</u>, which **builds on the CVA6S microarchitecture** with key enhancements aimed at **further boosting performance**:

- → Register renaming
- → Improved branch predictor
- → ALU-ALU forwarding

















We present <u>CVA6S+</u>, which <u>builds on the CVA6S microarchitecture</u> with key enhancements aimed at <u>further boosting performance</u>:

- → Register renaming
- → Improved branch predictor
- → ALU-ALU forwarding
- → FPU integration in superscalar mode



















We present <u>CVA6S+</u>, which **builds on the CVA6S microarchitecture** with key enhancements aimed at **further boosting performance**:

- → Register renaming
- → Improved branch predictor
- → ALU-ALU forwarding
- → FPU integration in superscalar mode

Moreover, we integrate and evaluate **CVA6S+** with the the OpenHW Core-V High-Performance L1 Data Cache (**HPDCache**)

















#### CVA6S: the baseline



















#### CVA6S: the baseline





FPU support was out of scope for CVA6S















#### CVA6S: the baseline





FPU support was out of scope for CVA6S





















The evaluation is based on the **Embench-IoT suite** 























































The evaluation is based on the **Embench-IoT suite** 



Instructions are dual issued already for 30% of the cycles



















































































#### CVA6S+: what's new?

















# **CVA6S+: Private History Predictor**



















# CVA6S+: Private History Predictor



Legacy BHT predictor 2-bit per entry

















# CVA6S+: Private History Predictor



Legacy BHT predictor 2-bit per entry



New PHBHT predictor
with n bits history
n + (2<sup>n</sup> \* 2) bits per entry













NT,T,NT





NT



Taken

NT



















- → The scoreboard is a circular buffer
- → RAW hazards need to know the newest instruction to correctly forward data

#### Scoreboard (SB)

| ID | Valid | rd |  |  |  |  |  |
|----|-------|----|--|--|--|--|--|
| 7  | 1     | 10 |  |  |  |  |  |
| 6  | 1     | 11 |  |  |  |  |  |
| 5  | 1     | 12 |  |  |  |  |  |
| 4  | 0     |    |  |  |  |  |  |
| 3  | 0     |    |  |  |  |  |  |
| 2  | 0     |    |  |  |  |  |  |
| 1  | 1     | 5  |  |  |  |  |  |
| 0  | 1     | 12 |  |  |  |  |  |



















- → The scoreboard is a circular buffer
- → RAW hazards need to know the newest instruction to correctly forward data

Scoreboard (SB)

| ID | Valid | rd |  |  |  |  |  |
|----|-------|----|--|--|--|--|--|
| 7  | 1     | 10 |  |  |  |  |  |
| 6  | 1     | 11 |  |  |  |  |  |
| 5  | 1     | 12 |  |  |  |  |  |
| 4  | 0     |    |  |  |  |  |  |
| 3  | 0     |    |  |  |  |  |  |
| 2  | 0     |    |  |  |  |  |  |
| 1  | 1     | 5  |  |  |  |  |  |
| 0  | 1     | 12 |  |  |  |  |  |

Instr. 0 and 5 both write register x12



















Reorder the scoreboard based on commit pointer

Scoreboard (SB)

| ID | Valid | rd |  |   |          |
|----|-------|----|--|---|----------|
| 7  | 1     | 10 |  | 0 |          |
| 6  | 1     | 11 |  | 0 |          |
| 5  | 1     | 12 |  | 0 |          |
| 4  | 0     |    |  | 1 | SB-ID: 1 |
| 3  | 0     |    |  | 1 | SB-ID: 0 |
| 2  | 0     |    |  | 1 | SB-ID: 7 |
| 1  | 1     | 5  |  | 1 | SB-ID: 6 |
| 0  | 1     | 12 |  | 1 | SB-ID: 5 |

Instr. 0 and 5 both write register x12

- → The scoreboard is a circular buffer
- → RAW hazards need to know the newest instruction to correctly forward data



















scoreboard based on commit pointer Scoreboard (SB)

Reorder the

- The **scoreboard** is a
  - circular buffer
- → RAW hazards need to know the newest instruction to correctly forward data

 ID
 Valid
 rd

 7
 1
 10

 6
 1
 11

 5
 1
 12

 4
 0
 3

 2
 0
 2

 1
 1
 5

 0
 1
 12

0 0 0 1 SB-ID: 1 SB-ID: 0 SB-ID: 7 SB-ID: 6 SB-ID: 5

Instr. 0 and 5 both write register x12

Instr. 0 has higher priority than instr. 5

















- → The scoreboard is a circular buffer
- → RAW hazards need to know the newest instruction to correctly forward data

Reorder the scoreboard based on commit pointer

Forwarding logic based on SB-ID

x31

x12

x11

x10

x9

**8**x

**x**7

**x6** 

**x**5

x4 x3

**x2** 

x1

x0



Instr. 0 and 5 both write register x12

Instr. 0 has higher priority than instr. 5

















#### CVA6S+: FPU support



































→ The ALUs operate separately when dual-issuing independent instructions





















→ The ALUs operate <u>separately</u> when dual-issuing independent instructions





















- → The ALUs operate separately when dual-issuing independent instructions
- → The ALUs are <u>chained</u> when dependent instructions can be fused





















- → The ALUs operate separately when dual-issuing independent instructions
- → The ALUs are chained when dependent instructions can be fused
- → Selected **few operations** are **never chained** to **preserve the critical path**





















The **Embench-IoT suite** focuses on the **pipeline**:





















The **Embench-IoT suite** focuses on the **pipeline**:

→ All the cores use the same cache configuration





















The **Embench-IoT suite** focuses on the **pipeline**:

- → All the cores use the same cache configuration
- → The working set is fully cached





















The **Embench-IoT suite** focuses on the **pipeline**:

- → All the cores use the same cache configuration
- → The working set is fully cached



+43.5% IPC versus baseline CVA6



+10.9% IPC versus CVA6S





















#### Evaluation setup:

- → **GF22 nm** CMOS technology
- → Worst timing corner





















#### Evaluation setup:

- → GF22 nm CMOS technology
- → Worst timing corner
- → CVA6S+ versus CVA6
- → Same caches configuration





















#### Evaluation setup:

- → GF22 nm CMOS technology
- → Worst timing corner
- → CVA6S+ versus CVA6
- → Same caches configuration

Pipeline area delta: +28%





















#### Evaluation setup:

- → **GF22 nm** CMOS technology
- → Worst timing corner
- → CVA6S+ versus CVA6
- → Same caches configuration

Pipeline area delta: +28%

Total area delta: +9%





















#### Evaluation setup:

- → GF22 nm CMOS technology
- → Worst timing corner
- → CVA6S+ versus CVA6
- → Same caches configuration

Pipeline area delta: +28%

Total area delta: +9%

Max. Frequency: 1090 MHz

(-0.5% vs CVA6)





















#### Evaluation setup:

- → **GF22 nm** CMOS technology
- → Worst timing corner
- → CVA6S+ versus CVA6
- → Same caches configuration

Pipeline area delta: +28%

Total area delta: +9%

Max. Frequency: 1090 MHz

(-0.5% vs CVA6)

to obtain 43.5% IPC improvement















#### CVA6S+: what about the cache?





















## HPDCache: Open-Source High-Performance L1 D\$



→ Performance-Optimized Design: features pipelined micro-architecture, single-cycle read/write hit latency



<sup>3</sup>C. Fuguet, "HPDcache: Open-Source High-Performance L1 Data Cache for RISC-V Cores", ACM CF'23

















## HPDCache: Open-Source High-Performance L1 D\$



- → Performance-Optimized Design: features pipelined micro-architecture, single-cycle read/write hit latency
- → Out-of-Order Execution & Non-Blocking: handles requests out-of-order to avoid head-of-line blocking



<sup>3</sup>C. Fuguet, "HPDcache: Open-Source High-Performance L1 Data Cache for RISC-V Cores", ACM CF'23

















## HPDCache: Open-Source High-Performance L1 D\$



- → Performance-Optimized Design: features pipelined micro-architecture, single-cycle read/write hit latency
- → Out-of-Order Execution & Non-Blocking: handles requests out-of-order to avoid head-of-line blocking
- → Highly Configurable Architecture: supports both <u>WB and WT policies</u> on a <u>cache line-level granularity</u>, includes configurable associativity, request port count and data widths



<sup>3</sup>C. Fuguet, "HPDcache: Open-Source High-Performance L1 Data Cache for RISC-V Cores", ACM CF'23























The **RaiderSTREAM suite** focuses on the cache subsystem:



















#### **RaiderSTREAM**



The RaiderSTREAM suite focuses on the cache subsystem:

The same CVA6S+ pipeline is tested with the legacy D\$ and the **HPDCache** 























The RaiderSTREAM suite focuses on the cache subsystem:

- The same CVA6S+ pipeline is tested with the legacy D\$ and the **HPDCache**
- The working set is 2× the cache capacity























The **RaiderSTREAM suite** focuses on the cache subsystem:

- The same CVA6S+ pipeline is tested with the legacy D\$ and the **HPDCache**
- The working set is 2× the cache capacity



+74.1% bandwidth by replacing the legacy D\$ with the HPDCache





















#### Evaluation setup:

- → **GF22 nm** CMOS technology
- → Worst timing corner





















#### Evaluation setup:

- → **GF22 nm** CMOS technology
- → Worst timing corner
- → Legacy Cache versus HPDCache
- → Same pipeline configuration





















#### Evaluation setup:

- → GF22 nm CMOS technology
- → Worst timing corner
- → Legacy Cache versus HPDCache
- → Same pipeline configuration

Cache area reduction: -18.9%

due to better SRAM organization





















#### Evaluation setup:

- → **GF22 nm** CMOS technology
- → Worst timing corner
- → Legacy Cache versus HPDCache
- → Same pipeline configuration

Cache area reduction: -18.9% due to <u>better SRAM organization</u> while providing <u>+74.1% bandwidth</u>!



















→ We introduce CVA6S+, adding key features upon CVA6S, the superscalar dual-issue extension of the CVA6 RISC-V application-class core CVA6



















- → We introduce CVA6S+, adding key features upon CVA6S, the superscalar dual-issue extension of the CVA6 RISC-V application-class core CVA6
- → We integrate CVA6S+ with the OpenHW Core-V High-Performance L1 Data Cache HPDCache















- → We introduce CVA6S+, adding key features upon CVA6S, the superscalar dual-issue extension of the CVA6 RISC-V application-class core CVA6
- → We integrate CVA6S+ with the OpenHW Core-V High-Performance L1 Data Cache HPDCache
- → We demonstrate 10.9% and 43.5% IPC improvement over CVA6S and CVA6, respectively, with an area overhead of less than 10% and only 0.5% maximum frequency regression















- → We introduce CVA6S+, adding key features upon CVA6S, the superscalar dual-issue extension of the CVA6 RISC-V application-class core CVA6
- → We integrate CVA6S+ with the OpenHW Core-V High-Performance L1 Data Cache HPDCache
- → We demonstrate 10.9% and 43.5% IPC improvement over CVA6S and CVA6, respectively, with an area overhead of less than 10% and only 0.5% maximum frequency regression
- → We showcase the benefit of adopting the HPDCache, which improves the bandwidth by 74.1% and reduces the cache area by 18.9%













#### PULP Platform

Open Source Hardware, the way it should be!

# Thank you! Questions?

Institut für Integrierte Systeme – ETH Zürich Gloriastrasse 35 Zürich, Switzerland

DEI – Università di Bologna





















