Posters

Notes for poster presenters

Preparation before the conference:

Posters shall be printed in A0 format in portrait mode.
Each presenter shall bring their own poster on site.
There is no template for posters.
Make sure that the poster is easy to read from distance and attracts people, use QR codes to link to more content.
At least one author of the poster must register for the core conference (Tuesday 9 to Thursday 11).
Upload ASAP an update your PDF abstract on the submission web site:
- To add the authors’ names if the submission was blind.
- To fix typos, if any, if your submission was non-blind.
Upload your poster PDF on the submission web site before Friday May 29th AOE (Anywhere on Earth).

If you want to print your poster on-site instead of bringing it from home to Bologna, here is a list of print shops in downtdown Bologna. Do not hesitate to contact them directly:

At the conference:

Each posters will be displayed for a full day.
You will mount your own poster, no own tape allowed, you will get some.
Presenters are expected to stand next to their posters during breaks and lunches.
The exhibition and poster area will be open only during breaks and lunches.
The absracts and poster PDFs will be published online on the conference web site.

Accepted posters

Check this page regularly for the schedule and location of posters!

Posters will soon be dispatched over the three days of the core conference
(Tuesday 9 to Thursday 11).

CREATOR: A RISC-V web simulator based on Sail specification language

Sub. #33798M.

Juan Carlos Cano Resa.

Abstract: This paper introduces a new version of the CREATOR tool. CREATOR is a web-based simulator that lets users code, compile, and execute assembly-language programs. The architectures that can be simulated in CREATOR are MIPS and RISC-V. Until now, CREATOR allowed the simulation of the RISC-V architecture with a reduced instruction set (RISC-V IMFD), but this has changed thanks to Sail. Sail is an instruction specification language used by RISC-V to define both its architecture and the entire instruction set. This language has enabled the development of a complete RISC-V architecture simulator for web environments, without requiring any installation. It also allows users to develop their own instructions for research purposes or to adapt the implementation of existing ones for their own industrial purposes. This simulator also integrates an editable cache memory module into its architecture to explore the functioning of the architecture in greater depth. The tool includes a multi-file code-editing module that allows users to import, export, and edit assembly-language programs; an integrated compiler; and a program-debugging module during execution. To develop this simulator for web environments, the simulator implementation in Sail was exported to a high-level language (C), and the exported code was transpiled with Emscripten to generate executable code for web environments while maintaining performance, as if it were a native application.

AdaMut-RV: FPGA-Accelerated RISC-V Fuzzing with Adaptive Mutation Operator Scheduling

Sub. #33SLKJ.

Zheng Huazhong.

Abstract: The exponential growth in RISC-V processor complexity challenges traditional functional verification. Both directed testing and constrained-random simulation are inefficient for modern architectural designs. While hardware fuzzing has emerged as a powerful alternative for uncovering deep microarchitectural bugs, existing software-based fuzzers are severely bottlenecked by slow RTL simulation speeds and suboptimal mutation strategies that lack adaptive guidance.We propose AdaMut-RV, a high-throughput, FPGA-accelerated fuzzing framework specifically optimized for RISC-V processor verification. Unlike software-bound solutions, AdaMut-RV offloads both the processor and the fuzzer onto FPGA hardware, enabling MHz-scale execution speeds. The core innovation of AdaMut-RV is an intelligent mutation-operator scheduler based on the Multi-Armed Bandit (MAB) reinforcement learning algorithm. By categorizing AFL-inspired mutation operators into nine distinct classes, our scheduler dynamically prioritizes those that yield the highest coverage gains based on real-time hardware feedback. This dynamic scheduling mechanism accelerates the exploration of processor design spaces and critical corner-case logic. Preliminary results demonstrate that AdaMut-RV significantly outperforms state-of-the-art software-based fuzzers. It achieves higher Control and Status Register Coverage while reaching the same coverage targets at a significantly faster rate.

The RISC-V test platform; an extension of the omnipresent RISC-V test environments

Sub. #3AVQMT.

Eloi Merino.

Abstract: This paper presents the riscv-test-platform, an enhanced set of environments built upon the riscv-test-env, designed to facilitate the execution of tests and benchmarks on RISC-V architectures atop bare-metal environments. Their balance between code complexity and features make them a flexible platform to execute software in both simulated and FPGA environments, bridging the gap between the two platforms, with the advantages that each provide. The result is a set of four environments that mimic the functionalities of the original riscv-test-env, with the addition of some benchmarking features, which exercise more parts of RTL designs and help verification teams spot mismatches in the early stages of development.

A RISC-V Accelerator for Convex Optimisation

Sub. #3HRCLU.

David Herrera Marti.

Abstract: We show how a processor with native extended floating point precision could be incorporated in algebraic subroutines in convex optimisation, namely in indirect matrix inversion methods like Conjugate Gradient, which are used in Interior Point Methods in the case of very large problem sizes. Also, an estimate is provided of the expected acceleration of the time to solution for a hardware running natively on extended precision. Specifically, when using indirect matrix inversion methods like Conjugate Gradient, which have lower complexity than direct methods and are therefore used in very large problems, we see that increasing the internal working precision reduces the time to solution by a factor that increases with the system size.

A Decoupled IOPMP Architecture: Open-Source Implementation with Distributed Bridges for Multi-Master SoCs

Sub. #3JPLFW.

Hongtuo Yuan.

Abstract: This paper presents a distributed Input-Output Physical Memory Protection (IOPMP) architecture featuring a decoupled I/O bridge and checker. The architecture enables distributed placement of I/O bridges to accommodate multi-master SoC configurations while sharing a single centralized checker, significantly reducing area overhead. The design is implemented in a 7nm process, with a single I/O bridge occupying less than 1200 µm². It incurs less than 0.2% performance overhead under 4KB large-packet transmission. The code has been merged into the OpenXiangShan Git repository and is available as open-source hardware.

Improving DSP Performance in Processors by Repurposing Existing Multiplier Architectures

Sub. #3MEHL7.

Sven Schönewald.

Abstract: Hearing loss is among the most prevalent sensory impairments worldwide. Hearing aids that incorporate adaptable, personalized signal processing have the potential to improve communication outcomes and social participation for affected individuals. The development and rigorous evaluation of novel hearing-aid algorithms requires high-level programmable, low-power, and portable behind-the-ear (BTE) research platforms that enable studies in real-world environments. The RISC-V open source instruction set architecture (ISA) presents a flexible and configurable baseline for the development of signal processing architectures. This paper presents a lightweight single instruction multiple data (SIMD) architecture featuring a complex number extension targeted at embedded applications with a specific focus on the energy efficiency by reusing existing multiplier hardware. Performance data is gathered from a reference Fast Fourier Transform (FFT) implementation, with FFT sizes taken from the field of typical hearing aid applications. Power values are obtained by synthesizing the baseline processor and the extension in a 22 nm FD-SOI technology, followed by a gate-level simulation to obtain accurate switching activity values. The implemented extension achieves a speedup of up to 26 %. A comparison of the energy values through gate-level simulation reveals that the energy consumption of the modified design decreases by 27 % on average.

Maximizing Performance at Low Area Cost in RISC-V Processors Leveraging Fine-Grained Multithreading

Sub. #3TQ3K9.

Arbi.

Abstract: Embedded applications increasingly rely on highly efficient, low-area RISC-V processors. However, short 3-stage pipelines often suffer from data and control hazards that degrade performance by introducing frequent stalls. This paper presents the implementation of Fine-Grained Multithreading (FGMT) on [OMITTED FOR BLIND REVIEW], an industrial 32-bit RISC-V core supporting the RV32ECM instruction set. By interleaving two hardware threads, the design effectively hides pipeline stalls and simplifies branch target calculation without requiring complex branch prediction. To further mitigate structural hazards and the underutilization caused by inactive thread contexts in fixed round-robin scheduling, we introduce a novel “Thread Forwarding” (TF) technique which enables a form of TLP (Thread-Level Parallelism). Implemented in 40nm (C40) technology with a target frequency of 300MHz, the standard FGMT achieves a 14\% IPC improvement over the baseline at the cost of a 9\% area overhead within an MCU SoC featuring 16 KBytes of instruction and data memory. The TF architecture achieves a Pareto optimal configuration, further boosting IPC to 0.958 (+18.8\% over baseline) while maintaining the same area footprint as the standard FGMT implementation.

CHERI-VP: Evaluating CHERI Early for Embedded RISC-V Systems with Virtual Prototypes

Sub. #3X9NJV.

Spandan Das.

Abstract: The adoption of capability-based architectures such as Capability Hardware Enhanced RISC Instructions (CHERI) in constrained RISC-V systems raises open questions regarding performance overheads, verification complexity, and practical evaluation methodologies. Virtual prototyping provides an effective means to explore these questions early in the design process, before committing to Register-Transfer Level (RTL) implementations. In this paper, we present a CHERI-enabled RISC-V Virtual Prototype (VP) targeting constrained embedded systems and demonstrate its use for early architectural evaluation. We describe VP-based verification workflows for both software and hardware and report early performance insights focusing on CHERI tagged memory management. Our experiences highlight the benefits of VPs for guiding CHERI adoption decisions and identify practical challenges, including the need for lightweight benchmarks suitable for constrained environments.

Conflict to Compliance: RISC-V Extension Migration Across Spec, HW, and SW

Sub. #3XNCRH.

Afonso Oliveira.

Abstract: Non-compliant RISC-V extensions remain a practical obstacle for custom and legacy CPU designs. We present a two-phase workflow that uses the RISC-V Unified Database (UDB) as the source of truth for extension definition and a Large Language Model (LLM) connected through Model Context Protocol (MCP) tools to accelerate migration into compliant custom opcode space. In Phase 1, the agent inspects instruction encodings, identifies conflicts against ratified and reserved space, and proposes remappings for human approval. In Phase 2, it generates hardware and software artifacts from the approved mapping and validates them through automated build-and-test loops. To evaluate the flow, we are open-sourcing two packed-SIMD extensions and the supporting hardware and software, including a full-system simulator, GNU assembler support, Zephyr runtime support and even the RTL design for the more than 140 migrated instructions. The result is a working open-source end-to-end stack, from decoder to application code, demonstrating AI assistance as a practical aid for ISA architects and as an automation layer for the associated hardware and software engineering.

Quantum Computing Simulation on RISC-V: Vector and Multithreaded Evaluation

Sub. #3Y7HQP.

Rebeca Rasco Flores.

Abstract: Classical quantum computing simulation is computationally demanding due to exponential state-vector growth.This work evaluates parallelization strategies on the RISC-V SpacemiT K1 using the RISC-V Vector Extension (RVV v1.0) and OpenMP. The dominant qubit-wise multiplication kernel was implemented in four variants: Sequential, OpenMP (MIMD), RVV vectorized (SIMD), and hybrid (OpenMP+RVV). Benchmarks up to 30 qubits 2³⁰ show size-dependent behavior: SIMD benefits small systems, multithreading improves medium scales, and large systems become memory-bound. The hybrid configuration achieves a peak speedup of 72.1× at 16 qubits and maintains 34.7× at 30 qubits, demonstrating the benefits of vector extensions and multi-core parallelism for quantum computing simulation workloads.

Generator-Driven Functional Safety for RISC-V SoCs with Formal Assurance

Sub. #7AQXJK.

Frederik Haxel.

Abstract: Functional safety (FuSa) in modern SoC designs demands rigorous fault detection mechanisms alongside standardized error reporting. We present a fully automated, generator-driven design flow that automatically applies dual modular redundancy (DMR) through a pass implemented in CIRCT, an MLIR-based hardware compiler framework, without requiring manual RTL modification. To validate the correctness of the generated design, we apply formal verification, providing strong assurance that the DMR composition itself introduces no spurious faults. In addition, we address the system-level integration of the generated fault detection signals by routing them to a safety controller that adheres to the “RISC-V RERI Architecture Specification” for error reporting across the SoC, capturing each error’s severity, nature, and location. We validate our generation flow through fault injection, demonstrating reliable fault detection across arbitrary hardware modules and correct propagation, recording, and reporting of detected errors in the safety controller. Combined, our contributions form an automated path from module-level fault hardening to system-level error observability, advancing the practical adoption of FuSa practices in generator-based RISC-V SoCs.

Unlocking High-Performance AVX2 Emulation with RVV 1.0

Sub. #7FMCFX.

Paris Oplopoios.

Abstract: The x86-64 instruction set has a long history of backwards compatibility, and a lot of performance-intensive software, most of which may never be ported to RISC-V. While existing emulators support the Advanced Vector Extensions (AVX) extension, none do so using the RISC-V Vector (RVV) extension directly. We implemented AVX and AVX2 support in Placeholder using RVV 1.0 and compare performance with existing implementations that don’t utilize RVV 1.0. Additionally, we measure the performance benefit of enabling AVX2 support in benchmarks and compare it with the performance benefit in x86-64 hardware.

An Embedded RISC-V Vector Extension for Edge-Oriented Acceleration

Sub. #7TD7TB.

Iñigo Diez de Ulzurrun.

Abstract: This work details a high-performance Vector Processing Unit (VPU) architecture designed to exploit data-Level Parallelism (DLP) within the strict power and area constraints of embedded environments. Addressing the parallelization needs of data-intensive tasks, the proposed modular architecture implements a subset of the RISC-V Vector (RVV) Zve32x sub-extension, focusing on essential 32-bit integer operations. The VPU is integrated as a co-processor to a CV32E20 core within the eXtendable Heterogeneous Energy-efficient Platform (X-HEEP) ecosystem. It leverages the Core-V eXtension Interface (CV-X-IF) 1.0 for low-latency instruction offloading and the Open Bus Interface (OBI) v1.0 protocol to ensure high-throughput data memory access during load/store operations. The implementation, featuring a Vector Register Length (VLEN) of 128 bits, was validated through Register Transfer Level (RTL) simulation and in hardware using a Xilinx Pynq-Z2 FPGA. Performance was evaluated using standard data-parallel kernels including SAXPY, Indexed Arithmetic, and Matrix Multiplication (Matmul). Additionally, this research investigates the RISC-V GNU Compiler Toolchain, comparing standard C auto-vectorization against manual vectorization using RISC-V Vector C Intrinsics.

Pre-silicon Robustness Assessment of RISC-V Cores using bit-accurate FPGA fault injection

Sub. #87KRJS.

Ilya Tuzov.

Abstract: FPGA fault injection (FFI) is a well-known technique for verification and robustness assessment of critical systems. However existing FFI tools for current generation FPGAs support only FPGA-specific fault models irrelevant for ASIC prototypes, and feature very coarse-grain analysis insufficient for localization of dependability bottlenecks in the design. To address these limitations have developed a bit-accurate FFI tool (BAFFI), capable of emulating ASIC (RTL) faults at the level of individual netlist cells. This paper explains how BAFFI can be used to obtain robustness estimates for RTL designs and exemplifies this through a case study of an open-source RISC-V SoC.

High-Performance CRC/EC Acceleration for RISC-V Server Storage via Novel ISA Extensions

Sub. #88TJXG.

Fengrui Sun and Zhanheng Yang.

Abstract: Data integrity and fault tolerance are prerequisites for RISC-V adoption in enterprise server environments, relying heavily on Cyclic Redundancy Check (CRC) and Erasure Coding (EC) for storage reliability and network transmission. Currently, the RISC-V ISA lacks the dedicated hardware acceleration found in mature architectures such as x86, leading to significant overhead in implementations. We propose novel ISA extensions to bridge this gap: a fused carry-less multiply-add instruction for CRC folding achieving up to 4x speedup, and a specialized GF(2^8) multiply-accumulate instruction for EC delivering 4x throughput gains over vectorized baselines. Evaluation confirms that these extensions significantly enhance data path efficiency, positioning RISC-V as a competitive architecture for reliable, high-performance storage and networking systems.

AI inference on bare-metal RISC-V Microcontrollers: A comparison of ExecuTorch and IREE/MLIR

Sub. #8BXPC9.

Jeremy Bennett.

Abstract: We have previously demonstrated that it is practical to bring up ExecuTorch on a low power bare metal microcontroller. ExecuTorch is a project derived from the PyTorch AI framework for inference on embedded devices using traditional eager (“interpreted”) evaluation of AI models. In this paper, we provide a short overview of how to run ExecuTorch on a bare metal microcontroller. We then illustrate the features of 32-bit RISC-V [4] which make it attractive for use in edge AI applications, using the Open Hardware Foundation’s CORE-V CV32E40Pv2 microcontroller as deployed in a real world design by two of the co-authors and their colleagues at . We have now ported IREE for the same platform. IREE is a Linux Foundation experimental project, which uses lazy (“compiled”) evaluation of AI models, with LLVM MLIR as an intermediate representation. We give a short overview of how to run IREE on a bare metal microcontroller, and then assess what aspects of 32-bit RISC-V make it attractive for IREE. We conclude by comparing the feasibility of using IREE instead of ExecuTorch and an assessment on the performance of both when carrying out AI inference.

Runtime Reconfiguration of Decoders in Minimal-area RISC-V Cores

Sub. #8MCLHT.

Lukas Glantschnig and Tobias Scheipel.

Abstract: Processor implementations designed to occupy minimal areas, such as SERV or FazyRV, are becoming increasingly popular. Some of these designs focus on flexibility and configurability while maintaining their compact design. However, due to their minimal area, implementations often involve compromises in specific components to achieve this level of efficiency. The FazyRV decoder, e.g., is highly optimized for area and therefore omits certain checks for illegal instructions. To address these drawbacks, we propose a concept that uses partial runtime reconfiguration to dynamically replace the decoder’s logic with a more robust variant to enable stricter instruction checking. These modifications introduce an area overhead of up to 39% more flip-flops than the original implementation. Dynamic partial reconfiguration can be triggered during runtime via a memory-mapped register, enabling the processor to continue normal operation seamlessly.

Locality-Aware Sparse Matrix Multiplication on RISC-V RVV

Sub. #8QKTQW.

Andrea Herrerías León.

Abstract: Sparse matrix–dense matrix multiplication (SpMM) is a fundamental workload in high-performance computing and emerging edge workloads, yet its performance is typically memory-bound due to irregular and indirect memory accesses. While the RISC-V Vector Extension (RVV) provides flexible data-parallel execution, efficiently exploiting it for sparse workloads remains challenging.

This work evaluates an iterative SpMM kernel on an RVV-enabled RISC-V processor (Spacemit X60, 8 cores) and investigates the combined impact of locality-aware data layout and explicit vectorization. We compare scalar, compiler-vectorized, library-based, and manual intrinsic implementations. Additionally, we apply Morton (Z-order) reordering to improve spatial locality in memory.

Experimental results show that vectorization alone provides limited benefits in memory-bound regimes. However, when combined with Morton reordering, manual RVV vectorization achieves the best performance. Microarchitectural analysis confirms reduced cache misses and improved IPC, although the workload remains fundamentally bandwidth-limited.

The study highlights the importance of data layout co-design when targeting sparse workloads on emerging RISC-V platforms.

Partial-VL is a First-Class Optimization Tool: Fastest CRC for DPDK on RISC-V

Sub. #8QKWEG.

Dr. Philipp Tomsich.

Abstract: We present an optimised CRC32/CRC16 implementation for DPDK using the Zvbc vector carry-less multiply extension. For infrastructure silicon — including SmartNICs targeting AI networking workloads where RDMA-capable packet processing must keep GPUs fed — DPDK on RISC-V needs CRC performance at parity with x86 and Arm.

Starting from a direct port of the Arm NEON folding-and-Barrett-reduction approach, we apply a series of RISC-V-specific refinements that reframe RVV’s variable vector length (VL) and tail-undisturbed policy as general-purpose optimization mechanisms — not merely loop-length controls. The tail-undisturbed Barrett reduction eliminates two 128-bit mask-constant loads, two vand instructions, and one vxor compared to the Arm NEON source, removing 32 bytes of constant data. VL=1 carry-less multiply eliminates ~5-instruction scalar register-file round-trips at each call site. These refinements cascade: each narrows the set of live elements, enabling further simplifications — 128-bit byte-shifts become 64-bit bit-shifts; stack-allocated sequences for CRC seed injection and short-data handling become register-only using tail-undisturbed loads and construction.

The unifying insight is that VL serves as a first-class control knob equivalent to PCLMULQDQ’s imm8 lane selector on x86 and explicit mask constants on Arm — but applicable to every vector instruction, not just one opcode. The implementation requires only Zvbc with VLEN≥128 (both mandated by RVA23), uses LMUL=1, and integrates into DPDK’s existing multi-backend CRC framework with no API changes.

Implementation of Open RAN software in a RISC-V platform

Sub. #9G3V9G.

Javier Hormigo.

Abstract: The use of open-source stacks with the Open~RAN (Radio Access Network) architecture has been predominantly restricted to x86 and ARM architectures. This work presents the first successful porting of the srsRAN Project to RISC-V, targeting the low-cost Banana Pi BPI-F3 (SpacemiT K1 SoC) board. We describe the cross-compilation toolchain, the removal of AVX/NEON dependencies in favour of a scalar C fallback, and a preliminary performance evaluation across two 5G NR FDD scenarios. Profiling with Linux perf identifies key data-parallel physical layer (PHY) bottlenecks, establishing primary targets for RVV 1.0 vectorisation. Results show that RISC-V offers promising real-time MIMO performance for a single user, even with a scalar fallback, suggesting that vectorisation will elevate it to highly competitive levels.

RoRiV: Porting the RTOS RODOS on RISC-V for future satellite missions

Sub. #9GW78U.

Andreas Nüchter, Matthias Jung, Jonathan Hager, Sergio Montenegro and Andreas Theiner.

Abstract: This work implements the first port of the Real-Time Operating System RODOS on RISC-V. Specifically, the utilized RISC-V core is the RV32IM version of the PicoRV32. In order to develop and test the implementation, the PicoSoC, which is a System-on-a-Chip with a PicoRV32 core, is employed on the Basys3 FPGA board. RODOS provides benchmarks that give a rough estimate of the performance. These benchmarks of various versions of the PicoSoC are evaluated on the FPGA and compared to results of already existing ports. Our benchmarks show that the FPGA prototype is only 5 times slower than CPUs on real chips like an STM32F4. This shows that RISC-V is a promising platform for future applications of RODOS.

Custom RISC‑V SIMD Matrix Extensions with LLVM Support

Sub. #9MT9QK.

Catalin Ciobanu and Alexandru Puscasu.

Abstract: The development of our tightly coupled SIMD/Vector accelerator for matrix operations requires extending the RISC-V instruction set. Special compiler support is required for this extension. Our methodology starts from a Sail description of the ISA extension and generates the compiler target description data. The instructions are described in Sail and are tested in the generated simulator. The compiler is generated from the description model and is tested with the accelerator implemented in hardware. The experimental results suggest that for matrix multiplication we obtained speed-ups up to 1413x compared to an ARM A72 core.

Energy-Efficient RISC-V based neuromorphic SoC for Edge AI Applications

Sub. #9PWJCZ.

wenfei.

Abstract: Spiking Neural Networks (SNNs) offer significant energy efficiency for Edge AI, yet their event-driven nature leads to unpredictable, variable-length output data. In traditional heterogeneous SoCs, this unpredictability causes high CPU overhead and bus inefficiency. This paper presents a specialized Event-Adaptive DMA (EA-DMA) integrated into a RISC-V based SoC. Unlike standard DMAs, the proposed engine performs buffer-triggered, variable-length transfers with maximum-size clamping and hardware backpressure for irregular SNN spike traffic. This work provides a scalable solution for integrating neuromorphic accelerators into the RISC-V ecosystem.

The art of zeroing on CHERI RISC-V systems

Sub. #9S7HBH.

Yuecheng Wang.

Abstract: Memory zeroing is a common operation for enforcing system security. Zeroing is used to clear memory contents to prevent information leakage and initialise memory contents to prevent uninitialized memory access. Vendors such as Intel and ARM support fast memory zeroing instructions to improve system performance efficiency. The cache management operation (CMO) extension has also been recently been added to RISC-V which can be used for improving memory zeroing performance. Compared to the standard systems, memory zeroing is more frequently used in capability systems such as CHERI to prevent capability leakage. In this work, we evaluate different memory zeroing strategies on CHERI, and implement hardware support for improving the performance and efficiency of memory zeroing on CHERI-Toooba: a CHERI-extended RISC-V CPU.

Priority-Aware Scheduling of Multi-Model, Multi-Precision DNN Inference on Multi-Cores RISC-V

Sub. #A77QAQ.

PGA and GARREAU.

Abstract: Efficient deployment of Deep Learning (DL) models on RISC-V-based multi-core platforms remains a significant challenge, especially when multiple models with heterogeneous structures and precision requirements must run concurrently. Existing frameworks offer optimized execution for single-model inference but lack support for multi-model scheduling, as well as priority-based resource allocation. In this work, we extend the capabilities of such frameworks by formalizing the problem of multi-model, multiprecision inference scheduling on constrained many-core architectures like Parallel Ultra-Low Power (PULP). We define a scheduling space where multiple Deep Neural Networks (DNNs), varying in size, type and precision, compete for limited computing and memory resources. We introduce a simple, priority-aware scheduling layer that allocates cores and memory tiles across models, aiming to either minimize overall inference latency or find a tradeoff satisfying each model’s deadline. To demonstrate the effectiveness of our approach, we leverage the existing Deployment Oriented to memoRY (DORY) framework, and apply a greedy scheduling strategy. We conducted experiments with several models across several tasks and showed that even basic scheduling policies can significantly improve latency, core utilization, and memory efficiency over static and sequential baselines.

Architectural Scalability Trade-Offs in an RISC-V Vector Processor for Communication Kernels

Sub. #A8BV3C.

Keivan Fayyazifard.

Abstract: Communication baseband workloads such as covariance estimation, synchronization and reduction operations exhibit substantial data-level parallelism. The RISC-V Vector Extension (RVV) introduces vector-length agnostic (VLA) execution, enabling scalable vector implementations independent of a fixed hardware width. In this work, we explore architectural scalability trade-offs of a configurable RVV-based vector processor across VLEN, lane count, and lane width. Using representative communication kernels and synthesis with the predictive ASAP7 PDK, we analyze architectural scaling behavior and the interaction between cycle reduction and frequency degradation. While increasing VLEN reduces cycle counts, critical-path growth and bandwidth imbalance introduce a parallelism–frequency trade-off that yields kernel-dependent optimal configurations. We further demonstrate how a lightweight custom vector complex multiplication instruction improves efficiency for covariance-based workloads. The results highlight the importance of balanced compute–memory design for practical and physically realizable RVV implementations.

Exhaustive Security Verification of Access Control in Processors

Sub. #A9JD3J.

Anna Duque Antón.

Abstract: Access control is a foundation of security and is implemented in the hardware of Systems-on-chip. The entire system stack relies on the secure and correct functioning of these access control mechanisms. However, contemporary security verification methods face major challenges in exhaustively detecting targeted security vulnerabilities while also being scalable. We address these challenges with a novel formulation of security property sets. Our approach introduces interlocked property sets, which have a mathematical characteristic that enables scalable and exhaustive verification of general security targets. We propose an interlocked property set for access control verification in processors and have evaluated our approach in several case studies on RISC-V processor cores. Our approach detected multiple security vulnerabilities.

CHIMERA: Cryptographic Hardware for Integrated Multipurpose Engine on RISC-V with ASCON

Sub. #ACRS3G.

Valeria Piscopo, aledolme and Enrico Manfredi.

Abstract: As the NIST Lightweight Cryptography (LWC) standard, ASCON is pivotal for securing IoT ecosystems. This work presents CHIMERA, a multipurpose cryptographic engine for RISC-V, supporting AEAD and Hashing. We propose two architectural paradigms integrated via the Core-V eXtension Interface (CV-X-IF): a high-performance Complete Round (CR) version utilizing a state-register bank, and a minimalist Bitwise Rotation Unit (BRU) version focusing on Instruction Set Extensions (ISE). Our designs suit throughput-critical workloads, delivering up to 6x speed-up, as well as footprint-constrained deployments on ASIC and FPGA.

InterFinder: A Framework for Memory Interference Analysis in RISC-V Vectors

Sub. #AHPSLC.

Matoussi.

Abstract: Vector architectures are widely used in multicore systems to exploit data-level parallelism, but their bursty, high-bandwidth memory behavior can exacerbate contention for shared resources such as caches and DRAM, increasing timing variability. This paper introduces the principles behind InterFinder, a unified interference analysis framework for RISC-V vector architectures that combines compiler-based analysis, formal SW/HW modeling, and microarchitectural abstraction to support interference-aware timing analysis.

STRiVe-VP: LLVM-based performance simulator for RISC-V processors

Sub. #ALLVDR.

Giovanni Di Guardo and Giorgio Marletta.

Abstract: In this paper we present STRiVe-VP, a hybrid RISC-V simulation framework that unifies functional and timing simulation by leveraging LLVM’s compiler infrastructure. Built on RISC-V VP++, it translates executed instructions into LLVM MCInst and LLVM MCA Instruction objects, which are injected into an extended LLVM MCA pipeline. Custom hardware units (cache, prefetch buffer, branch predictor) are modeled, allowing the combination of static scheduling information with dynamic effects from control flow and memory behavior. This direct integration enables timing-aware decisions using live architectural state and provides unified functional and timing debugging. Validation against an FPGA prototype of an in-order, single-issue rv32emc_zfinx core shows that STRiVe-VP matches FPGA cycle counts exactly for several benchmarks and across multiple optimization levels, demonstrating cycle-accurate performance estimation and a solid basis for extending to more complex RISC-V microarchitectures.

Shield-XS: A Lightweight Dynamic Security Isolation for RISC-V

Sub. #AMVGWA.

yuanmiaomiao.

Abstract: Efficient resource isolation remain critical challenges for RISC-V-based cloud computing, where workloads such as confidential virtual machines (CVMs) and containers face threats from unauthorized memory access, DMA attacks, and cross-privilege&cross-workload attacks. Existing hardware isolation mechanisms like Physical Memory Protection (PMP) suffer from static resource partitioning and inflexible dynamic configuration. This work presents Shield-XS, a lightweight workload isolation model for RISC-V. Shield-XS leverages a bitmap-based mechanism with a Shield-bit (1 bit per 4KB physical page) to dynamically mark and isolate sensitive and normal workloads. It implements hardware-enforced access control integrated into the Memory Manage Unit pipeline, with a dedicated Bitmap Cache to minimize performance overhead. It supports configurable one-way isolation, blocking unauthorized access from non-sensitive workloads to protected memory, I/O, and interrupt resources. Evaluated on a 7nm process, it introduces only 0.3% CPU hardware area overhead. SPEC06 benchmarking shows a mere 0.72% performance overhead. This work provides a flexible, low-overhead isolation solution for virtualization and containerization workloads within the RISC-V ecosystem.

LLM-Driven Multi-Agent Framework for Automated RISC-V Verification Stimulus Generation

Sub. #BREVML.

Nicholas Matus, Kavya Sri Endukuri and Radha Govindaradjou.

Abstract: Writing verification stimulus for RISC-V processors requires deep expertise across ISA specifications, microarchitectural implementation, and test framework APIs. We present an LLM-driven multi-agent framework that transforms a brief natural-language scenario description into a comprehensive, executable RISC-V test generator. Five specialized AI agents form a sequential enrichment pipeline: an ISA expert expands intent into architecturally complete scenarios, an RTL analyst reads hardware source code to inject microarchitecture-targeted stress patterns, a framework specialist maps steps to concrete API calls, a builder synthesizes deployable code, and a validator ensures correctness through static checks and instruction-set-simulator execution. On the RISC-V Svadu extension, a 3-line scenario yields 490 lines of validated, simulation-passing code in under 13 minutes—a ∼40× speedup versus an estimated∼8 hours of manual effort.

Tightly Coupled Near-Memory Matrix Unit for RISC-V Embedded Computing

Sub. #BSGU8Y.

Juan Granja.

Abstract: This work presents the design of a tightly coupled near-memory computing unit compatible with a preliminary RISC-V Attached Matrix Extension. The proposed unit is designed to be integrated with a processor core through the Core-V eXtension Interface (CV-X-IF), enabling matrix operations to be decoded and executed directly in a processing unit attached to a system’s main memory. Instead of moving data into registers prior to computation, load instructions specify operand locations in main memory. Memory access and near-memory computation are deferred until the execution unit requires the operands. To evaluate the feasibility of the proposed architecture, a model of the unit is designed, implemented, and validated in the gem5 architectural simulator. This model serves as a first step to prove the concept and enables design-space exploration of the architecture. As a preliminary evaluation, a quantized convolutional neural network workload is executed on the simulator to assess the potential performance benefits of the approach, achieving a 47x speed-up with respect to a simulated processor baseline.

Enhancing Boot Time Security in RISC-V Leveraging Keccak Hardware Accelerator

Sub. #BVHAVM.

Utku Budak.

Abstract: Secure boot verifies the integrity and authenticity of code before execution; otherwise, it terminates the boot process. In contrast, measured boot produces verifiable evidence of code integrity at boot time, for example using the Device Identifier Composition Engine (DICE). However, in both mechanisms, hashing large code is the main performance bottleneck. This work combines secure boot and DICE-based measured boot, implements the design on a CVA6-based RISC-V platform, incorporates post-quantum cryptography for quantum-resistant secure boot, and accelerates computationally intensive hash computations through a custom Keccak hardware accelerator.

STARBUG: RISC-V Hint Instructions for Lightweight VLIW Execution on Embedded DSP Workloads

Sub. #BXQ73A.

Leo Marek.

Abstract: This paper presents a standards-aligned microarchitectural extension that leverages architecturally reserved RISC-V HINT encodings to enable lightweight Very Long Instruction Word (VLIW) execution while preserving full backward binary compatibility. Unlike conventional superscalar designs that rely on dynamic scheduling, speculative issue, and complex hazard detection, our approach encodes static scheduling decisions in HINT instructions that execute as NOPs on unmodified cores. Modified implementations interpret these hints to form statically scheduled issue bundles, achieving higher Instruction-Level Parallelism (ILP) without increasing ISA surface area or compromising compliance.

We validate the proposal through a full-stack methodology spanning ISA modeling, RTL implementation, and FPGA deployment. ISA semantics were prototyped using Google’s MPACT simulator to evaluate bundle formation and decode behavior. We then extended the OpenHW Group CVW (Wally) core to support 4-wide integer VLIW execution via a widened multi-ported register file and parallel datapaths. The design was verified in Questa and Verilator and synthesized for FPGA-based cycle-accurate measurement.

Evaluation on representative DSP kernels (FFT, FIR, IIR, and dot product) demonstrates substantial IPC and cycle-count improvements relative to scalar RV32I execution, while maintaining binary compatibility and toolchain transparency. The proposed mechanism provides a path for energy-efficient ILP extraction in embedded and domain-specific systems, illustrating how reserved ISA space can be systematically exploited to deliver microarchitectural innovation without ecosystem fragmentation.

Insights on high-performance code generation for early and future RISC-V vector systems

Sub. #C8TUDR.

asantana.

Abstract: To achieve both high performance and productivity, modern software stacks rely on linear algebra libraries that encapsulate decades of optimization efforts. These libraries derive from empirical studies and analytical models developed primarily for high-end general-purpose systems, particularly the vector/SIMD extensions of x86-64 processors (e.g., SSE, AVX, and AVX-512). As a new and open-standard architecture, RISC-V software programming models target systems with unprecedent hardware diversity since pivotal extensions, such as the RISC-V V, may be realized on processors for domains ranging from edge to supercomputing. In this work, we advocate for flexible code generation tools to foster vendor-agnostic performance on high-performance software stacks, highlighting the most impactful software-level performance optimizations on early RISC-V vector systems and our insights on how to handle them on the context of linear algebra library development.

AI-Driven Testlist Generation for RISC-V Core Verification

Sub. #CBLJYX.

Abhishek Rajgadia, Shubham Singla, Radha Govindaradjou and Vikas Dubey.

Abstract: Verifying modern RISC-V cores requires qualifying every merge request (MR) against a large and evolving test space spanning ISA extensions, micro-architectural features, and system-level scenarios. Manually selecting appropriate tests for each MR is time-consuming and error-prone, and does not scale with the rate of RTL changes. This work presents an AI-driven testlist generator that automatically derives MR-specific regression lists for a production RISC-V core verification environment. The tool analyzes Git diffs for an MR, infers impacted features using a combination of static rules and large language models (LLMs), and synthesizes targeted regressions across multiple test generators. The resulting flow reduces MR-qualification effort, improves repeatability, and provides a concrete path toward coverage-driven, closed-loop test selection for RISC-V core verification.

InjectV: Modeling Fault Injection Attacks in RISC-V Simulation Environment

Sub. #CEW33C.

Giorgio Fardo and Niccolò Lentini.

Abstract: Fault Injection Attacks (FIAs) induce transient hardware faults to subvert software security mechanisms, yet assessing fault resilience, especially during early design phases, remains impractical without specialized laboratory equipment. Microarchitectural simulation provides a reproducible and scalable alternative. This paper presents InjectV, a gem5-based fault injection framework targeting RISC-V systems, which employs trace-guided fault injection by identifying Candidate Injection Points (CIPs) at security-critical operations including control-flow branches and conditional comparisons. Supporting transient corruption of architectural registers and physical memory under full-system simulation, InjectV demonstrates that guided fault injection requires 95.8% fewer injections than random exploration to expose successful attacks on the FISSC VerifyPIN benchmarks.

End-to-End AI Compilation for RISC-V: A Multi-Level Optimization Approach

Sub. #CH97A7.

Hongbin Zhang.

Abstract: With the rapid evolution of RISC-V extensions such as Vector, Matrix, and other custom instructions, RISC-V platforms are becoming capable of executing modern AI models. However, achieving high-performance deployment while maintaining a unified software stack across diverse extensions remains a key challenge. This paper introduces Buddy Compiler, an end-to-end AI compiler designed to provide a unified interface for AI model integration, multi-level compilation optimizations, and extensible code generation targeting diverse RISC-V extensions. Buddy Compiler adopts a multi-level architecture consisting of a frontend, middle-section, and backend, enabling reusable high-level optimizations while supporting specialized backends for RISC-V architectures. The frontend provides a graph infrastructure that interfaces with mainstream AI frameworks and converts imported models into a unified representation expressed with high-level MLIR dialects. The middle-section is built on MLIR and performs multi-level compilation optimizations, including operator fusion, memory access optimization, and vectorization. The backend implements dedicated MLIR dialects for RISC-V extensions, such as RVV, IME, AME, and Gemmini, and performs target-specific code generation. Through multi-level compilation, Buddy Compiler and its runtime system enable efficient deployment of AI models on RISC-V platforms, achieving performance comparable to manually optimized implementations such as llama.cpp.

CVA6-RT: an Open-Source Time-Predictable RV64 Processor for Mixed-Criticality Systems

Sub. #DSMVHN.

Enrico Zelioli.

Abstract: This work presents CVA6-RT, a real-time micro-architectural extension of the CVA6 core to bound worst-case latency and reduce task’s timing execution variability. CVA6-RT implements the rv64gch ISA and features advanced support for real-time execution, including TLB partitioning and locking for predictable address translation, a dynamically reconfigurable scratchpad mode in the L1 caches for deterministic memory access, and low-latency interrupt handling via an enhanced interrupt controller combined with hardware-assisted context stacking. With real-time features enabled, CVA6-RT achieves an interrupt latency of 12 cycles, comparable to that of simpler Arm Cortex-M microcontrollers, and 10x lower than the baseline CVA6 core.

Window-Level Telemetry for Runtime Performance and Reliability Monitoring in RISC-V Systems

Sub. #DVVW8V.

Arda Öztürk.

Abstract: RISC-V–based processors and ML accelerators are increasingly targeted for latency-sensitive domains such as automotive Software-Defined Vehicle platforms and edge systems, where runtime observability is essential for performance validation and early fault diagnosis. Although RISC-V standardizes architectural and hardware performance monitoring counters, raw cumulative snapshots do not directly provide window-level deltas or streaming metrics required for real-time analytics. To bridge this gap, we present a monitoring tool that implements a window-level telemetry pipeline to enable real-time observability. It converts cumulative counters into per-window delta values, selects a curated metric set, and computes derived metrics. The resulting telemetry is recorded simultaneously as CSV and structured logs (NDJSON) and streamed to external consumers via ZeroMQ for runtime processing. The approach is validated using a cycle-level gem5 RISC-V simulation, demonstrating 2–3 ms host-side processing per 10 ms window with minimal overhead. The modular design incorporates a source-agnostic acquisition layer, allowing the input backend to be replaced by hardware performance counters with minimal changes to the core processing and output interfaces.

Beyond Bare-Metal: A Lightweight Cross-Privilege Framework for RISC-V RTL Security Evaluation

Sub. #DZKKKP.

Karim AIT LAHSSAINE.

Abstract: Mitigating transient execution attacks like Spectre in RISC-V processors requires cycle-accurate Register Transfer Level (RTL) simulation. However, existing methodologies face a severe dichotomy: simple bare-metal benchmarks lack crucial architectural features (e.g., virtual memory, privilege boundaries), while full-OS simulations incur prohibitive execution times. To bridge this gap, we propose a novel, lightweight RTL simulation framework that accurately models cross-privilege transitions (User and Supervisor modes) and virtual address translation without the overhead of a full OS payload. We validated this approach by simulating a realistic, cross-privilege Spectre-PHT attack on the out-of-order NaxRiscv core, achieving secret recovery in approximately 100,000 cycles. This drastically accelerates vulnerability characterization compared to Linux-boot environments. Ultimately, this low-noise environment provides hardware designers with an efficient tool to rapidly analyze transient vulnerabilities and evaluate the performance overhead of hardware countermeasures.

HBENCH: RISC-V Microbenchmark Suite

Sub. #EDH9EE.

Carlos Rojas Morales, Erick Brandon Cureño Contreras and Victor Asanza.

Abstract: We present HBENCH, a microbenchmark suite for instruction-level characterization of RISC-V (scalar and RISC-V Vector Extension (RVV)) and x86 (scalar and Advanced Vector Extensions (AVX)2), enabling accurate simulator performance models. HBENCH maps scalar and vector microkernels to gem5 latency groups and reports latency and peak throughput for Floating-Point (FP)32, FP64, and Integer (INT) operations. We evaluate a Banana Pi F3 (SpacemiT K1, X60, RVV 1.0, 256-bit Vector Length (VLEN)) and derive a gem5-compatible performance model. Coverage is validated against RIVEC workloads using dominant RVV instruction mixes. Our results span over 329 microkernels, providing per-latency-group latency and throughput, cache hierarchy probes, and Instructions Per Cycle (IPC)-based classification, demonstrating HBENCH’s ability to support high-accuracy instruction-level performance modeling.

From Leakage to Exploitability: Empirical Study of Cross-Process L1 Prime+Probe on RISC-V

Sub. #EFXQQP.

Fortunelli Gianmarco.

Abstract: Cache timing attacks against AES are well studied on x86 and ARM, but their end-to-end exploitability on commercially deployed RISC-V systems under realistic OS scheduling is less documented. This paper presents an experimental evaluation of a Prime+Probe attack targeting the private L1 data cache of a PolarFire SoC RISC-V platform running Linux, where attacker and victim are independent user-space processes time-multiplexed on the same core. We separate three stages, leakage observability, cache-set classification, and key inference, and show that first-round T-table lookups induce measurable per-set interference enabling reliable inference of the most significant 4 bits of AES key bytes. We also find substantial cache-set variability highlighting a practical gap between observable leakage and end-to-end exploitability on real RISC-V systems.

MAGIA-V: A Heterogeneous Zve32d+GEMM Tile for Emerging Mesh-of-Tiles Accelerators

Sub. #EHJ3MR.

Luca Balboni and Alessandro Nadalini.

Abstract: AI and HPC workloads demand scalable, efficient accelerator architectures. We present MAGIA-V, an open mesh-of-tiles accelerator template integrating a RISC-V Zve32d Spatz vector processor with a RedMulE tensor engine, enabling concurrent vector and matrix operations.

Towards Efficient Utilization of RISC-V Long Vector Register Files: A Characterization Study

Sub. #ELYZEY.

Álvaro Moreno.

Abstract: As RISC-V “Vector” (RVV 1.0) architectures scale, the Vector Register File (VRF) becomes a primary bottleneck in area and power. This work characterizes data residency and redundancy patterns in a distributed, long-vector VRF using the gem5 simulator. We identify two primary inefficiencies: resource fragmentation and an entropy-capacity mismatch in active data. Our evaluation of lightweight compression schemes reveals that a 2-entry dictionary-based approach consistently yields a 2.5x compression ratio of the computed vector elements. These results demonstrate that hardware-level data compaction is an interesting path for optimizing the area of future long-vector RISC-V accelerators

Open E-Trace Infrastructure: Tooling for Evaluation, Analysis, and Research

Sub. #ESQB3N.

Julian Ganz.

Abstract: Tracing allows capturing timing sensitive behavior that would be obscured by other means of extraction that rely on code running on the HART itself, such as debugging hardware or instrumentation. The ratified specification “Efficient Trace for RISC-V” (E-Trace) defines a highly compressed yet relatively simple RISC-V-specific instruction and data tracing format. In combination with the program binary, E-Traces allow the complete reconstruction of a program’s execution path. We present an open source Rust library and CLI tool that allows both inspection of traces via an intuitive text interface and converting traces to other formats for analysis by downstream tools. While proprietary solutions for consuming E-Traces exist, this is, to our knowledge, the first open source tool suitable for use in production. Our tooling makes E-Trace-based augmentation of CI flows feasible. Based on traces collected during program execution, the CLI tool enables additional checks and metrics (e.g., coverage). Engineers may also use the tool to gain a better understanding of a failure. Use-case-specific checks may be implemented using the library. We also developed a QEMU plugin that enables experimenting with and evaluating such CI and development flows before any hardware investment, significantly lowering the entry barrier. The plugin serves as a configurable trace encoder, controlled solely by plugin arguments, that produces a trace file on the host.

RISC-V Based SoC for Event-Based Sparse Convolutions

Sub. #ETWYRB.

Diego Gigena Ivanovich.

Abstract: In this work, we present a dedicated IP core optimized for energy-efficient convolution over highly sparse and unstructured input arrays, characteristic of event-based convolutional neural networks. The accelerator was integrated into a RISC-V-enabled system on chip (SoC) and taped out in 65nm TSMC technology to enable full post silicon characterization and to evaluate alternative sparse computation algorithm variants.

A Lightweight Multi-Context Architecture for Mixed-Criticality Systems on RISC-V Processors

Sub. #EZCAXM.

Giacomo Valente and Leonardo Fazzini.

Abstract: Mixed-criticality systems incorporating software components with different criticality levels demand strong isolation mechanisms to guarantee dependability. High-end and mid-end architectures accomplish this through rigorous temporal and spatial partitioning, backed by multiple privilege levels and memory management units. Nevertheless, low-end processors, constrained to two privilege levels, encounter difficulties in realizing effective temporal and spatial partitioning without undermining system composability. This paper presents a novel multi-context framework for low-end RISC-V processors, exploiting a lightweight hardware extension and enabling efficient temporal and spatial partitioning. The proposed approach not only guarantees robust isolation and system composability but also offers flexibility to trade off hardware and software overhead, pushing forward the state of the art in dependability for resource-constrained embedded systems.

Co-optimizing Custom Instructions RISC-V and LLM Specialized Accelerator for Attention-Based Edge AI

Sub. #EZY7K8.

Joaquin Cornejo.

Abstract: This work presents a co-optimized architecture for edge-based Transformers, focusing on a specialized RISC-V CPU designed to manage a parallel AI co-processor. While the BumBleBee (BBB) unit handles core Flash Attention Method (FAM) computations, the system relies on an adaptable RISC-V core for critical data orchestration and pre-processing. To overcome the bottlenecks of a memory-bound system, the CPU’s ISA is enhanced with custom fused instructions—convcat, lwincr, and swincr—which consolidate complex macro-operations into single-cycle actions. Notably, the convcat instruction reduces 13 F-extension instructions to one, cutting latency by over 50%. Furthermore, the CPU incorporates M and F extensions with data-gating in the ALU to minimize power consumption during scaling and normalization tasks. By prioritizing CPU-level adaptability and instruction fusion, the architecture significantly reduces the energy bill and latency required for high-performance LLM inference in power-constrained environments.

EMiX: Emulating Beyond Single-FPGA Limits

Sub. #F39Y3T.

Behzad Salami and Alexander Kropotov.

Abstract: FPGA-level emulation is a key step in pre-silicon chip design validation. However, emulating large-scale multi-core systems increasingly exceed the hardware resource capacity of a single FPGA, limiting the feasibility of full-system emulation. To address this challenge, we introduce EMiX, a scalable multi-FPGA framework that enables distributed emulation of multi-core RISC-V architectures beyond single-FPGA resource limits. EMiX systematically partitions a monolithic multi-core design into multiple components and deploys them across multiple interconnected FPGAs, effectively exploiting inter-FPGA interconnects to balance scalability and performance without requiring fundamental RTL redesign. We prototype EMiX with a 64-core architecture across eight interconnected Alveo U55c FPGAs (scalable on core and FPGA counts), successfully demonstrating full-system execution including Linux boot. EMiX will be released as an open-source platform.

Hardware-Synthesized Monitor-Actuator Design Patterns: a Proof-of-Concept Application

Sub. #F7UKA3.

Giann Spilere Nandi.

Abstract: As system complexity increases, so does the difficulty in demonstrating their overall correctness. The Monitor-Actuator design pattern is one the main approaches in the literature proposing ways to ensure that systems can work safely, even in the presence of undetected or unpatched system defects. This design pattern consists of coupling verification monitors to, at execution time, verify if a target system is executing as expected and intervene when needed. Therefore, maximizing the isolation between the target system and the monitoring unit becomes a fundamental factor to reduce mutual interference, both in functionality and in terms of computational overhead. This work presents a Monitor-Actuator proof of concept system developed for the PolarFire SoC Icicle Kit. The system consists of a target application executing on the PolarFire SoC’s processing system and a dedicated runtime verification monitor IP executing on the programmable logic unit. We detail how the monitor IP is generated from a formal specification that is used to, first synthesize it’s equivalent CPP code, and later serve as input to the process of high-level synthesis of hardware description language. The description of the development process and setup is designed to serve as a reference for future applications requiring low interference hardware synthesized runtime monitors capable of detecting user-specified property violations in a platform’s hardcore and softcore RISC-V processors.

Concolic Execution Guided Hybrid Whitebox Fuzzing for RISC-V Processors

Sub. #FFPCHP.

Zijian Jiang.

Abstract: Verification remains a key bottleneck in the design of modern RISC-V processors, particularly for deep corner cases that are difficult to reach with conventional verification techniques. Coverage-guided hardware fuzzing provides fast exploration, but often relies on coarse-grained coverage feedback and blind mutation, leading to shallow exploration. Symbolic and concolic methods offer control path reasoning, but their practicality is limited by path explosion and high solver cost on realistic RTL processor designs. We present a concolic execution guided hybrid whitebox fuzzing framework for RISC-V processors with FPGA acceleration. The framework combines RTL static analysis, concolic solving, and high-throughput fuzzing to balance exploration of hard-to-trigger deep processor behaviors with fuzzing efficiency. It extracts the processor control-flow graph from RTL, instruments synthesizable control path monitoring, and uses the collected path conditions to steer test generation toward high-value unexplored paths. We further map the DUT and fuzzer on FPGA programmable logic, while running concolic engine and SMT solver on the on-board ARM processor to accelerate the hybrid whitebox fuzzing process through an end-to-end heterogeneous architecture. We evaluate the approach on open-source RISC-V processors, including CVA6, Ibex, and PicoRV32. Results show that our approach can achieve 1.33x higher coverage than SOTA fuzzers and explore deep corner coverage points that are difficult to trigger with existing approaches.

PQCUARK: A Scalar RISC-V ISA Extension for ML-KEM and ML-DSA

Sub. #FGEDJB.

Xavier Carril Gil.

Abstract: Recent advances in quantum computing threaten conventional public-key cryptographic algorithms, necessitating the adoption of post-quantum schemes such as ML-KEM and ML-DSA. The performance of these schemes is constrained primarily by two computationally intensive kernels: the Number-Theoretic Transform (NTT) and the Keccak-f1600 permutation.

This work introduces PQCUARK, a scalar RISC-V Instruction Set Architecture (ISA) extension that accelerates both kernels through two tightly integrated units: a packed-SIMD butterfly unit for the NTT and a Keccak engine capable of delivering two rounds per cycle with direct access to the Load-Store Unit.

Implementation of PQCUARK on an RV64 core and deployment on an FPGA achieves up to a 10.1× speedup over NIST reference software and a 4.2× improvement over optimized implementations, surpassing state-of-the-art solutions by factors ranging from 1.4× to 12.3×. ASIC synthesis in GF22-FDSOI demonstrates only an 8% core-area overhead at 1.2 GHz, with no impact on the critical path.

An Open-Source RISC-V VM-Level TEE Architecture Implemented on XiangShan Processor

Sub. #FLFBGM.

Wenhao Wang.

Abstract: Trusted Execution Environments (TEEs) are essential for cloud security, with Confidential Virtual Machines (CVMs) as the prevailing approach. While proprietary solutions dominate deployments, the RISC-V ecosystem lacks mature open-source CVM implementations despite CoVE progress. This paper presents a VM-level TEE architecture on the open-source XiangShan RISC-V processor, featuring physical isolation of Enclave Management Tasks via dedicated secure cores. We implement bitmap-based page-granularity memory isolation and multi-key memory encryption for fine-grained access control and software-defined full-memory cryptographic protection. Evaluation on FPGA prototypes demonstrates minimal EMS area overhead (<1% of SoC area).

AME-PIM: Breaking the Memory Wall with RISC-V Matrix Extensions and HBM-PIM

Sub. #FP7AUZ.

Emanuele Venieri.

Abstract: Matrix workloads, essential in generative AI, increasingly rely on ISA-level (i.e. AMX, SME). The attached matrix extension (AME) is one of the three (IME, AME, VME) ISA extensions under standardization in RISC-V. In common, all these matrix-ISA assumes extensions of the processor datapath with dedicated matrix acceleration hardware. However, executing matrix kernels requires moving large tiles between memory and processor registers, making performance limited by memory bandwidth. We investigate whether High Bandwidth Memory with Processing-in-Memory (HBM–PIM) can serve as alternative implementation of AME instructions. We propose a PIM Execution Primitive (PEP) computational model mapping AME ISA onto Samsung Aquabolt-XL HBM-PIM microkernels, using an outer-product dataflow to enable in-memory accumulation, as well as remapping AME tile registers into memory regions—making possible to chain AME instructions without leaving the memory. Our experiments show AME tile multiplication reaching 14.9 GFLOP/s (59.4 FLOP/cycle) on a HBM–PIM pseudo-channel, demonstrating that HBM–PIM can serve as an implementation of RISC-V matrix extensions.

APEX: Accelerating FFT on CVA6 with a Tightly Coupled CV-X-IF Co-processor

Sub. #FWGCHN.

Abdul Wadood.

Abstract: The Fast Fourier Transform (FFT) is a fundamental algorithm in embedded and edge signal processing applications, including audio and speech processing, radar systems, and biomedical sensing, where real-time performance must be achieved under strict area and power constraints. Conventional approaches typically rely on dedicated standalone accelerators, but these often impose significant area and power overheads that are impractical for resource-constrained embedded and edge platforms. To address this, tightly-coupled acceleration within the CPU pipeline offers a more efficient alternative by delivering substantial performance gains without requiring an independent hardware block. This paper presents APEX, a tightly-coupled coprocessor integrated with the CV32A6 32-bit RISC-V processor, designed to provide high-performance FFT acceleration for embedded RISC-V systems. For a fixed-point FFT of size N=512, APEX achieves an 83.5% reduction in execution cycles and an 87.9% reduction in instruction count compared to the software FFT implementation on the baseline CV32A6, while preserving the baseline operating frequency and full RV32IM_Zicsr software compatibility with only minimal area overhead. These results demonstrate that APEX is an efficient and practical solution for accelerating FFT-intensive workloads in embedded and edge deployments built on open RISC-V architectures.

Functional Verification Strategy for a CVA6 MMU

Sub. #FWXFHU.

Tanuj Khandelwal.

Abstract: Modern processors implement complex features that require unique verification strategies to exhaustively verify the feature and achieve coverage goals faster. The memory management unit (MMU) within the CVA6, with multiple level of table, page tables, lookaside buffers (TLBs), and physical memory protection (PMP) capabilities, is one such feature. It is highly configurable and complex, making an exhaustive verification a real challenge. It requires smart management of different page table entries (PTEs) and PMP entries, to simulate different types of exceptions, page faults and PMP access errors. This work is done using an Universal Verification Method (UVM) framework provinding an efficient means creating PMPs and PTEs, thus simplifying the verification of MMU.

Spike-RTL: Two technologies for fast and accurate SW-RTL co-simulation

Sub. #FYCK9P.

Eugenio Villar.

Abstract: The verification of integrated systems traditionally relies on detailed Register Transfer Level (RTL) simulations to ensure functional correctness before hardware implementation. While RTL simulation provides cycle-accurate behavior and can even achieve event-level precision when combinational delays are modeled, it suffers from extremely long execution times. Simulating complex software workloads such as booting an operating system may require several days of simulation time. Instruction Set Simulators (ISS) provide a faster alternative for software execution. In the RISC-V ecosystem, Spike is the reference ISS and can achieve simulation speeds several orders of magnitude faster than equivalent RTL processor models. However, replacing the processor RTL model with an ISS introduces temporal discrepancies that may affect the accuracy of system-level simulations. This work presents Spike-RTL, a HW/SW co-simulation framework that integrates the Spike ISS with RTL models of the remaining hardware components. The tool supports both Verilog simulation and C/SystemC HW models (e.g. generated using Verilator). Experimental results show simulation speedups of up to 40× compared to Verilog simulation and 4× compared to Verilator, while maintaining timing errors on the order of 10%. The framework also introduces configurable timing models for instruction execution, cache miss latency integration, and variable-granularity synchronization mechanisms between ISS and RTL components.

Flying V: A Radiation-Hardened L1 Data Cache for RISC-V Aerospace Processors

Sub. #G8FEKT.

César Fuguet.

Abstract: Radiation-induced bit-flips in on-chip memories threaten the reliability of processor-based systems, particularly in aerospace applications. This work introduces ECC-based hardening for the HPDcache, an open-source L1 data cache compatible with RISC-V cores (e.g., CVA6). The design enables Single Error Correction and Double Error Detection (SECDED), thereby protecting SRAMs from transient faults. A scrubber further mitigates multi-bit errors by periodically refreshing cachelines. Implementation of an 8 KiB cache configuration in 45 nm technology shows a 2.1% core area overhead and an 8% clock frequency reduction. This is a first step towards a fully open-source RISC-V core with both safety features and a high-performance memory subsystem to address the increasing computing demand in aerospace applications.

Accelerating the Poseidon2 S-box in a RISC-V SoC with a 4×4 CGRA

Sub. #G93SVJ.

Cristian Campos.

Abstract: Cryptographic hash algorithms for zero-knowledge proof systems often rely on prime-field S-box kernels such as x⁷ mod p over 31-bit fields. We accelerate this class of S-box primitives on a 4×4 coarse-grained reconfigurable array (CGRA) integrated within a RISC-V SoC. As a case study, we use the BabyBear instantiation adopted by the state-of-the-art Poseidon2 hash function, employing Barrett reduction to avoid software division on the host core. Our mapping decomposes operands into 8-bit limbs across CGRA processing elements and exploits the toroidal mesh for carry propagation in 4 hops. Compared to a hand-optimized baseline, we achieve 1.26× speedup and 25.7% energy reduction; versus an automatic compiler, we improve by 6.6× speedup and save 82% energy. Cycle-accurate RTL simulation of a full Poseidon2 integration shows ~3.3× fewer cycles than the RISC-V host for the full 141-invocation workload at 100 MHz (even a ~1.3× reduction at 250 MHz).

Machine Learning-Based Performance Estimation for RISC-V Virtual Prototypes

Sub. #GE87GD.

Caaliph Andriamisaina.

Abstract: Early-stage performance estimation plays a critical role in HW/SW co-design. It enables SW development prior to silicon availability while guiding HW architectural exploration. These activities are inherently iterative and therefore require simulation environments that are sufficiently fast to evaluate SW optimizations and HW design alternatives efficiently. Cycle-accurate simulators provide highly precise execution-time estimation but are often too slow for evaluating realistic workloads. Instruction Set Simulators (ISSs), in contrast, offer significantly higher simulation speed but lack accurate timing information. Abstract performance models represent a promising compromise; however, existing handcrafted approaches remain labor-intensive and difficult to generalize. We present an automated methodology for generating Machine Learning (ML)-based performance models from cycle-accurate simulations and integrating them into fast ISS environments. The approach targets the 64-bit RISC-V core CVA6 and is implemented with the open-source emulator QEMU.

UCAgent: An End-to-End Agent for Block-Level Functional Verification

Sub. #GJMGHA.

wangsa, Fangyuan Song, Junyue Wang, YanPi and yaozhicheng.

Abstract: Functional verification remains a critical bottleneck in modern IC development cycles. However, traditional methods, including constrained-random and formal verification, struggle to keep pace with the growing complexity of modern semiconductor designs.

While recent advances in Large Language Models (LLMs) have shown promise in code generation and task automation, significant challenges hinder the realization of end-to-end functional verification automation. These challenges include (i) limited accuracy in generating Verilog/SystemVerilog verification code, (ii) the fragility of LLMs when executing complex, multi-step verification workflows, and (iii) the difficulty of maintaining verification consistency across specifications, coverage models, and test cases throughout the workflow.

To address these challenges, we propose UCAgent, an end-to-end agent that automates hardware block-level functional verification based on three core mechanisms. First, we establish a pure Python verification environment using Picker and Toffee to avoid relying on LLM-generated SystemVerilog verification code. Second, we introduce a configurable 31-stage fine-grained verification workflow to guide the LLM, where each stage is verified by an automated checker. Furthermore, we propose a Verification Consistency Labeling Mechanism (VCLM) that assigns hierarchical labels to LLM-generated artifacts, improving the reliability and traceability of verification.

Experimental results show that UCAgent can complete end-to-end automated verification on multiple modules, including the UART, FPU, and integer divider modules, achieving up to 98.5\% code coverage and up to 100\% functional coverage. UCAgent also discovers previously unidentified design defects in realistic designs, demonstrating its practical potential.

CAGE-V: Confidential Computing Architecture supporting Guest Enclaves for RISC-V

Sub. #H87VUG.

Moritz Waser.

Abstract: Confidential VMs enable cloud service providers to operate a secure and trustworthy multi-tenant cloud infrastructure. While confidential VMs ensure comprehensive protection for cloud workloads, such heavy-weight isolation is often omitted for serverless applications that co-locate thousands of cloud workers within the same process to optimize FaaS overheads through efficient context switches. In this work, we present CAGE-V, a novel confidential computing architecture that supports lightweight enclave-based isolation for individual cloud workers running inside confidential VMs. Guest enclaves support fast context switches within the confidential VM, as TLB entries are tagged with Domain Identifiers, eliminating overheads that stem from TLB flushes. We present a CAGE-V prototype, consisting of a hardware extension for the CORE-V CVA6 processor and a small security monitor, and evaluate our design in terms of system performance, demonstrating a minor performance impact.

Integration of CVA6 in ESP for ISA extensions and coherent multicore: with FFT-butterfly instruction

Sub. #H8SWVM.

rodrigo olmos.

Abstract: The RISC-V ecosystem is moving toward increasingly heterogeneous SoCs that combine multicore processors, hardware accelerators, and software programmability. In this work, we integrate the application-class CORE-V CVA6 processor into the open-source ESP framework, enabling ISA extensions within a coherent multicore platform. The integration preserves cache coherence, SMP correctness, and Linux-class software support, while providing a practical path to deploying custom instructions in ESP-based systems. To demonstrate the benefits of heterogeneity at the core level, we implement an FFT butterfly extension using a CV-X-IF-based flow with three custom instructions. Across FFT sizes from 16 to 1024 points, the proposed design achieves speedups of 1.37× to 1.45×. These improvements are obtained with low hardware overhead, namely +0.23% at the platform level and +5% at the core level. Results show that ISA-level extensions can complement multi-accelerator architectures by providing efficient fine-grained acceleration for recurring DSP kernels.

Towards Open User-Space Power-Management Communication Interfaces

Sub. #HKZX8R.

Emanuele Venieri and Antonio del Vecchio.

Abstract: Modern processors delegate power and thermal management to dedicated Power Control Systems (PCS), communicating through kernel-mediated interfaces such as SCMI or the emerging RPMI. Prior work has shown that end-to-end control quality is dominated by the power-management policy rather than by interface latency, leaving room to choose communication paradigms based on flexibility rather than raw latency. We integrate Micro XRCE-DDS on ControlPULP, a RISC-V–based PCS, connecting it to a user-space Agent on an ARM host via a custom shared-memory transport. This design removes protocol logic from kernel drivers and naturally supports multi-controller coordination through a shared middleware layer. Experiments on a ZCU102 FPGA at 20 MHz show 490 μs of active processing per publication, 0.8 MB/s throughput, and a memory footprint under 11.2 KB for 32 topics. The resulting latency is comparable to SCMI [1] while enabling a more flexible communication model.

Vishwa: A Scalable RISC-V Based GPGPU

Sub. #HMZYAS.

Prachi Pandey, Vivian and PRANOSE J EDAVOOR.

Abstract: The growing demand for artificial intelligence, scientific computing, and large-scale data analytics has significantly increased the need for massively parallel computing architectures. Modern GPUs provide high computational throughput by executing thousands of concurrent threads, but most existing GPU architectures remain proprietary, limiting open architectural innovation and research. This paper presents Vishwa, a scalable RISC-V based General Purpose GPU (GPGPU) architecture designed to enable open and extensible parallel computing platforms. The architecture adopts a hierarchical compute model composed of Vishwa Compute Clusters (VCLs) containing multiple Vishwa Compute Cores (VCCs) that execute threads using a Single Instruction Multiple Thread (SIMT) execution model. Each compute core integrates specialised Vishwa Matrix Cores (VMCs) designed to accelerate matrix-intensive operations commonly used in machine learning workloads. Work distribution across the architecture is managed by a global Vishwa Work Distributor (VWD) that schedules workloads across available compute clusters. The architecture is supported by a complete software ecosystem through the CHAKRA compiler stack, which integrates with LLVM to provide kernel compilation and runtime execution support. The compute core architecture has been implemented and validated on an FPGA platform, demonstrating functional correctness of the execution pipeline and SIMT execution model.

RISCY Prefetchers

Sub. #HZQQP9.

Mohamed Soliman.

Abstract: While hardware prefetchers accelerate memory performance, they inadvertently leave microarchitectural footprints that attackers can exploit. Previous work showed that instruction and data prefetchers on Intel, AMD and Apple processors are prone to microarchitectural side-channel attacks. In this paper we investigate the data stride prefetcher in the \textit{Xuantie C910} – a server-grade RISC-V processor extensively deployed in cloud environments. Furthermore, we present the first microarchitectural attack targeting a hardware prefetcher on a RISC-V processors. In that regard, we port StrideRE on RISC-V processors to reverse engineer its hardware prefetcher. Finally, we provide two Proof-of-Concept (PoC) attacks: partial memory address disclosure and control flow leakage. We find that both attacks are effective across privilege levels.

RuyiSDK Package Manager - A Unified Package Management and Development Environment for RISC-V

Sub. #K9TL88.

Weilin Cai and Yunxiang Luo.

Abstract: New RISC-V CPU cores are released every year, and while these cores typically conform to standardized RISC-V ISA profiles, vendors frequently introduce additional proprietary extensions. This growing diversity makes it difficult for developers to accurately determine the exact instruction sets supported by a specific CPU core, thereby complicating the selection of appropriate toolchains, firmware, and operating system images. The RuyiSDK Package Manager addresses this challenge by aggregating information on RISC-V CPUs, MCUs, and development boards together with their corresponding toolchains, firmware, and system images. It establishes a comprehensive mapping between CPU architectures, development boards, and required software resources. This mapping is maintained in a structured packages index, which provides a unified, metadata-driven representation of RISC-V hardware and software resources, along with associated download links. This paper presents the overall architecture and design of the RuyiSDK Package Manager, focusing on three core components: package management, virtual environment isolation, and device provisioning. The system currently supports most commercially available RISC-V development boards. Beyond toolchain integration, it lays the foundation for IDE integration and other developer utilities. By streamlining access to software resources and standardizing development workflows, the system lowers the barrier to entry for RISC-V software development, facilitates developer onboarding, and improves visibility into software support across heterogeneous RISC-V platforms.

Functional Verification Strategy of the CORE-V Floating-Point Unit (CVFPU) for RISC-V cores

Sub. #KBLECB.

Ihsane Tahir.

Abstract: Floating-point unit (FPU) verification is inherently challenging due to IEEE-754 corner cases, multiple rounding modes, exception handling, subnormal behavior, and the large input space introduced by mixed precision. The CORE-V Floating-Point Unit (CVFPU) released as open source, provides a highly configurable multi-format implementation but lacks an industrial-grade functional verification framework. This work addresses that gap by proposing a structured UVM-based verification strategy tailored to its configurable architecture. The approach integrates a variable-precision C++ reference model, directed and constrained-random stimulus, assertion-based checks, and coverage-driven closure.

Lessons Learned from Designing Decoupled-Access Hardware Accelerators in a RISC-V Framework

Sub. #KGMSLQ.

Xicu Marí.

Abstract: Sparse tensor operations are critical for scientific computing but their irregular memory access patterns challenge traditional architectures. While domain-specific architectures offer efficiency, integration into mature SoCs often requires ISA modifications or complex driver development. This work addresses these challenges via a decoupled SpMV access unit integrated through Cohort, a coherent shared-memory queue interface communicating with a CVA6 RISC-V core. To mitigate the inter-tile communication overhead, we introduce a hybrid tiling approach that co-locates the access unit and the core in the same tile, enabling direct data delivery into the private cache. This hybrid architecture achieves significant performance gains, yielding geometric mean speedups of 1.33× and 1.50× for COO and CSR formats, respectively, over traditional multi-tile configurations. These results demonstrate that offloading memory traversal to a programmable data-flow engine, combined with optimized placement in the memory hierarchy, efficiently accelerates irregular workloads with minimal intrusion.

RISCV-Perf: A Performance Modeling Framework for RISC-V Processors Integrated with Spike

Sub. #LNBB8V.

Tsung-LI.

Abstract: Microarchitectural performance evaluation is an essential step in modern processor design and architecture exploration. However, developing a cycle-accurate simulator from scratch requires implementing both instruction semantics and detailed microarchitectural models, which significantly increases development complexity.

This work presents RISCV-Perf, a lightweight performance modeling framework designed to integrate with the Spike RISC-V functional simulator. The framework decouples functional execution from cycle-level timing simulation through a minimal instruction interface that captures key instruction attributes such as program counters, operand registers, and memory access information. By reusing the functional correctness provided by Spike, RISCV-Perf focuses solely on modeling microarchitectural timing behavior.

RISCV-Perf adopts an execution-driven simulation approach, enabling cycle-level modeling of superscalar out-of-order processors without generating execution traces. The timing model represents major microarchitectural components including an instruction flow model, register renaming mechanism, memory operation pipelines, and cache hierarchy interactions. In addition, the framework is implemented using a modular policy-based design, allowing architectural components such as branch predictors and cache policies to be easily replaced or extended.

Experimental evaluation using the MiBench benchmark suite on an RV64GC configuration demonstrates that RISCV-Perf can effectively generate performance insights such as CPI behavior and branch prediction miss rates across workloads. These results show that the framework provides a practical platform for workload characterization and microarchitectural policy exploration.

Heterogeneous Interrupts for Ultra-Low Latency Embedded RISC-V Systems

Sub. #LNTKNT.

Antti Nurmi.

Abstract: Reactive real-time systems rely on preemption and, by extension, context switching (CS) to schedule critical tasks. Short, frequent interrupt routines may use a disproportinally large ammount of time and energy for CS rather than core application functionality. Replicated register files (RRFs) are an established solution for fast CS, but area-intensive and poorly scalable. This abstract presents the heterogeneous interrupt architecture, a solution for targeted use of RRFs that maintains area-efficiency, and the parallel context stack (PCS), a novel RRF microarchitecture. The proposed concepts are evaluated with implementations in TSMC 22-nm and a periodic task case study. The implementation in a RISC-V microcontroller system demonstrates a 1.2% area overhead with no timing detriment for the PCS, while the case study demonstrates a reduction in clock cycles and retired instructions of up to 26% and 21%, respectively.

Energy-Efficiency Optimization of a RISC-V Floating-Point Unit for HPC-Oriented Architectures

Sub. #LP7WTL.

Marco Crisologo.

Abstract: As High-Performance Computing (HPC) advances towards the Exascale era, energy efficiency has become the primary design constraint. In HPC systems, the Floating-Point Unit (FPU) is instantiated in massive numbers to support parallel workloads, that require huge number of floating point computations. Consequently, the FPU becomes a dominant consumer of dynamic power within the chip. This work presents an energy-optimized FPU for RISC-V Vector Processing Units. To address the inefficiencies of standard unified FMA datapath, we propose a Split-Path FMA micro- architecture tailored for the RISC-V Vector specification. Our design integrates the physical separation of the arithmetic pipelines with vector-aware clock gating and operand isolation. Evaluated in a commercial 4nm technology at 2 GHz, the optimized design demonstrates up to a 29% increase in energy efficiency for mixed-arithmetic workloads and a 7.8% performance speedup in vector reduction-heavy kernels.

SMSIC: Software-Interrupt MSI Controller for RISCV AIA in Large-Scale NoC Systems

Sub. #LQRAW3.

GUO Ren.

Abstract: Advanced Interrupt Architecture (AIA) Incoming MSI Controller (IMSIC) is a message-signaled interrupt (MSI) solution designed for the RISC-V External interrupt. However, due to the lack of native support for Software-Interrupt, IPI was forced to mix with IMSIC interrupts. In large systems, inter-processor interrupts (IPIs) occur very frequently and in large numbers, far exceeding the number of device interrupts. To alleviate IPI pressure on the Network-on-Chip (NoC), an interrupt-forwarding router is typically designed. However, the requirement for 2048 interrupt sources in the AIA IMSIC consumes a significant amount of SRAM in the BITMAP design, increasing chip area and cost. To improve IPI doorbell efficiency, hardware logic for merge-and-absorption based on BITMAP also needs to be designed per-hart at the transmitter, but AIA IMSIC’s large interrupt sources design makes this design expensive. Furthermore, IMSIC’s IPI allows any MSI-capable device to forge an IPI by sending a specific interrupt number, causing unnecessary disruption. To bridge this gap, propose a Software MSI Controller (SMSIC) for AIA, an optional RISC-V hardware component tightly coupled to each hart. The idea of architecturally separating IPIs from external interrupts not only reduces the cost of improving IPI performance in large systems but also aligns with the original intent of the Software Interrupt design in the RISC-V Privileged Specification.

Wolvrix: A SystemVerilog-Native Graph Infrastructure for RTL Research

Sub. #LSXU7K.

Haojin Tang.

Abstract: We present Wolvrix, an open-source infrastructure that ingests Verilog-2005/SystemVerilog into GRH (Graph RTL Hierarchy), an SSA-based graph intermediate representation, and supports composable transformation passes with Verilog re-emission. Wolvrix models complex SystemVerilog semantics, including multi-event registers, multi-port memories, blackboxes, cross-module references, and DPI-C calls, within a uniform graph structure amenable to analysis and transformation. We describe GRH and Wolvrix’s architecture, and present roundtrip re-emission on XiangShan and XuanTie C910 plus RepCut partitioning on XiangShan.

Using RISC-V E-Trace for effective insights for RISC-V Vector Optimizations

Sub. #LZ7U8Y.

Harry van Haaren.

Abstract: RISC-V E-Trace is a powerful tool for observing the execution of a CPU. Optimizing code to use RISC-V Vector instructions brings novel challenges, and gaining real-time insight into the code as it executes helps quickly iterate to better solutions. The ability of E-Trace to capture runtime “vector length” (vl CSR) and instruction execution, together with its very compressed nature makes it an ideal choice as a tracing format. This keeps traces small, allows live-streaming E-Trace data for post-processing, and ultimately allows the software developer to easily understand the utilization of the vector unit. The end result is a very powerful workflow allowing fast iteration and development of low-level optimized software, with the execution of the code underpinned by QEMU and the RISC-V E-Trace format.

Towards a Secure RISC-V Platform: The Environment Around the CVA6-Core

Sub. #M8ABDU.

Lukas Füreder.

Abstract: The rising adoption of RISC-V in real-world applications raises the requirement of security solutions within its processors. Secure boot enables the product owner to control which software may be booted on it, preventing execution of malicious software. That requires a hardware root-of-trust, typically in conjunction with public key cryptography, establishing the infrastructure to verify software. We propose a secure boot concept for the widely adopted CVA6 core with revocation capabilities. We also modernized the CVA6 software stack to be able to continue the verification steps in later software stages and leverage modern security hardware extensions.

FPGA Lifecycle Management for RISC-V Systems

Sub. #MARKW9.

Tianhai Liu.

Abstract: FPGA lifecycle management remains tied to proprietary toolchains and host architectures, leaving RISC-V without a vendor-neutral model for scalable bitstream deployment. A host-agnostic control-plane architecture is presented that shifts lifecycle management to the operating-system layer by leveraging standard Linux capabilities, thereby decoupling deployment from specific ISAs and vendor stacks. This enables Linux-capable RISC-V processors to serve as control hosts in heterogeneous FPGA systems. Prototyped on a Zynq-7000 SoC and generalizable to RISC-V platforms, the architecture provides a portable foundation for fleet-scale FPGA management.

RV64Y Temporal Safety Exploration

Sub. #MD7RVM.

Jonathan Woodruff.

Abstract: We present the studies leading up to the temporal safety support included in the RV64Y “CHERI” capability RISC-V extension. Memory safety enforcement is increasingly important for new programs, languages, and architectures. RV64Y enforces spatial memory safety natively, and provides the necessary invariants to enforce temporal safety in software. To ensure that RV64Y systems can enforce temporal safety with reasonable performance and memory overhead, we have reproduced experiments from previous CHERI research, optimised CheriBSD revocation support, and explored simplified state machines for virtual memory pages encoded in Page Table Entry (PTE) bits. We managed to optimize revocation in CheriBSD to reduce overhead in Spec2006 by 12%. We then explored the simplest PTE encoding with generational capability read support, and found that they incurred an overhead of about 33% over the optimised baseline, justifying the inclusion of generational capability dirty states in the frozen RV64Y specification. Finally, we discuss ongoing work that has the potential to further optimize temporal safety for RV64Y with vendor-specific or future ratified extensions.

The Practical Security Rules Proposal to HW RoT in Security Requirement RISC-V Server Specification

Sub. #MHEXB7.

Vincent Cui.

Abstract: We present a practical RSS (Root Security System) which not only conforms to HW RoT security requirement in RISC-V Server SoC and Platform Specification but also could be integrated into Server and AI SoC as HW RoT to offer security services applicable to system security service and user application. Beside already defined security rules to HW RoT, we found RSS also requires new function to enhance system security and support server Reliability, Availabitliy and Servicable. we present new 4 security rules proposal to HW RoT according to new function. Finally, we discuss an effective solution for RSS to security compliance certification as TOE.

Evaluating the Vulnerability of RISC-V CPUs Against Cache Timing Attacks

Sub. #MKBAZS.

Vasileios Karakostas.

Abstract: Assessing the vulnerability of caches against side-channel attacks is of critical importance when enhanced microarchitectural security is a must-have feature for a multicore CPU implementation. Previous works have proposed various metrics and methodologies to assess such vulnerabilities. However, those works suffer from limitations regarding either the range of target cache attacks, the support for the RISC-V ISA, or the public availability of the assessment tools. The goal of this paper is to provide support for systematically evaluating RISC-V multicore CPUs against a wide range of cache timing attacks. This support should allow the assessment of both real and simulated systems, enabling early security evaluation in the design phase of the processor using the open-source gem5 microarchitectural simulator. We base our approach on the Cache Timing Vulnerability Score (CTVS) methodology and enhance it along two axes. We first port the CTVS methodology to the RISC-V ISA, and then we integrate the CTVS methodology for the RISC-V and x86 ISAs with gem5. We evaluate the use of the CTVS methodology for simulated RISC-V and x86 multicore CPUs and analyze the results.

QUICK: QEMU Internal Checkpointing for Gem5

Sub. #N7AVZD.

Qi Shao.

Abstract: The gem5 simulator is a widely used tool for microarchitectural research, but often incurs prohibitive execution times. gem5 mitigates this cost through checkpoint-based resumption, yet existing checkpoint-generation mechanisms remain slow, non-portable, or both—significantly limiting iterative hardware-software exploration.

We introduce QUICK (QEMU Internal Checkpointing for gem5), a framework that enables fast, automated, and deterministic generation of gem5-compatible checkpoints directly within QEMU. QUICK integrates full-system checkpointing into QEMU’s TCG engine, capturing architectural, memory, and essential device state without external orchestration. QUICK substantially reduces checkpoint-generation overhead while preserving existing gem5 workflows, enabling scalable and systematic microarchitectural studies.

Initial validation demonstrates correct cross-simulator state transfer and consistent workload resumption.

Integrating AES Cryptographic Acceleration with RISC-V Cryptography Extensions in 32-bit processors

Sub. #NEMJHQ.

Francisco J. Romero.

Abstract: This work introduces a compatible acceleration approach for AES encryption that retains the standardized ISA interface while enhancing execution time for AES-128 on 32-bit processors, including the key-schedule phase. By reformulating the behavior of existing Zk instructions without altering their opcodes, we preserve binary and source compatibility with software written for Zkne, without the performance losses of having to perform key expansion only by software. The result is an integration strategy suitable for constrained IoT or automotive devices that delivers improved throughput with reduced area overhead, enabling systems to realize the intended benefits of RISC-V’s cryptographic extension without sacrificing portability and standarization.

Accelerating LLM Inference on Edge RISC-V CPUs via Vector Extension Instructions and Flash Attention

Sub. #NFYQCV.

Yueh-Feng Lee.

Abstract: In this work, we optimize LLM inference on edge RISC-V CPUs using vector extension instructions. We leverage 4-bit vector load and efficient 8-bit dot-product instructions to accelerate quantized and repacked 4-bit kernels in llama.cpp. In addition, we implement RVV support for tiled flash attention, which further improves performance in the prefill stage. Experimental results show that the proposed optimizations achieve 1.76x-2.14x speedup over the upstream implementation while maintaining near-linear scaling for prefill workloads on an RVV-enabled multi-core platform.

An Efficient Approach to Apply the RISC-V Sail Model to Chip Verification

Sub. #NMX7W7.

Yunxiang Luo and Mingzhu Yan.

Abstract: The Sail RISC-V Model can generate an executable file from its formal specification,. Currrently RISC-V tests only provides limited test cases and cannot comprehensively test your RISC-V implementation. Some chips may use self-developed simulators for testing, but they cannot obtain formal verification-based guarantees like RISC-V Sail Model, nor can they offer full configurability. This work introduces a new test framework that uses the RISC-V Sail Model as the ref model, ensuring the model’s completeness and accuracy. To improve simulation performance, we choose to use Pydrofoil, which is an improved version of the Sail Model that delivers ultra-high performance. To enhance test compatibility and usability, we provide a set of simple test interfaces (including register access, memory access, etc.) and support customizing model configurations. Currently, it has successfully integrated tests for some open-source RISC-V implementations.

Coverage-Directed Smoke Regression Optimization via Greedy Set Cover for RISC-V Verification

Sub. #NTCH3P.

Abhishek Rajgadia, Shubham Singla, Anish Jaltare and Radha Govindaradjou.

Abstract: We present a coverage-driven framework that optimizes RISC-V smoke regressions by decomposing VCS coverage into feature-specific subsets via tag-based pattern matching, ranking tests via greedy set cover, and flagging runtime outliers. Applied to a 978-test production suite drawn from a larger regression pool of 10,000 tests, the framework cut smoke tests by 40% and peak test runtime by 63%, while improving coverage on key architectural features-including +64% (SMRNMI), +53% (timer), and +25% (counters)-with modest regressions on a few features (median <3%), all within project thresholds.

Cost-Benefit Analysis of a 22nm ASIC ML-KEM Accelerator for RISC-V Secure Elements

Sub. #PB9JGQ.

Stefano Di Matteo, Hack, Emanuele Valea and Ivan Sarno.

Abstract: This paper provides a quantitative analysis of the costs and benefits of integrating a dedicated hardware accelerator for the Post Quantum Cryptography (PQC) algorithm ML-KEM into a 32-bit RISC-V SoC. We compare a software-only implementation on the CV32E40P core against a full-hardware datapath offloading the entire algorithm. We implemented the system on a 22 nm ASIC chip, and we measured the results: the dedicated hardware achieves a 139x speed-up over the software baseline. This performance gain requires an area overhead of 301 kGE, representing only a 6% increase in the total SoC silicon footprint. This study provides a data-driven assessment of the silicon-to-latency trade-off for Post-Quantum Cryptography (PQC) in resource-constrained RISC-V systems.

An Open Heterogeneous RISC-V AI Acceleration Architecture for Next-Generation Space Computers

Sub. #PSCKCE.

Yvan Tortorella.

Abstract: Integrating AI onboard satellites to reduce dependence on ground stations and facilitate quick orbital maneuvers demands a new class of onboard computers with enhanced processing power, real-time control capabilities, and robustness against the harsh space environment. Astral is a fully open-source, highly parametric platform for RISC-V-based heterogeneous SoCs targeting reliable onboard control and AI acceleration for next-generation space computers.

Holographic Execution: A Hyperdimensional Computing Approach for Robust RISC-V Instruction Decoding

Sub. #PTSN7C.

Marcello Barbirotta.

Abstract: The evolution of modern computing towards emerging paradigms, such as In-Memory Computing (IMC), is severely limited by the high intrinsic noise of these memory technologies. Simultaneously, conventional Von Neumann architectures exhibit data-dependent execution and power profiles, leaving embedded systems highly vulnerable to physical Side-Channel Attacks. In this extended abstract, we propose a novel paradigm based on Hyperdimensional Computing for encoding and decoding RISC-V instructions. By mapping standard assembly instructions into a neural-inspired holographic representation and storing them in superposition, leveraging the capacity of high-dimensional spaces, the traditional decoding logic is replaced by a highly parallel Associative Memory. Our Design Space Exploration compares 1-bit Binary and 8-bit Integer representations, evaluating the trade-off between instruction capacity (chunk size) and dimensionality. Furthermore, we demonstrate the intrinsic fault tolerance and security-by-design of the architecture: a binary HDC system maintains 100% decoding accuracy even when subjected to a 5% physical memory corruption, while its constant-time execution and massive pseudo-random switching activity inherently mask side-channel leakages. This paradigm paves the way for ultra-robust, secure, and ECC-free RISC-V pipelines tailored for next-generation processing cores.

End-to-End ML Graph Compiler Fused with Triton Kernel Compiler for RISC-V

Sub. #PUPQJF.

Hualin Wu.

Abstract: RISC-V AI acceleration faces a combinatorial explosion: hundreds of kernel variations across shapes, data types, and vendor platforms create unsustainable complexity. We present an end-to-end compilation solution fusing ML graph compilation with Triton DSL kernel compilation in a unified MLIR-based framework targeting RISC-V scalar (RV64IM) and vector (RVV) instruction sets.

Low-power Floating Point Unit for RISC-V Processors using FPHUB format

Sub. #PYSBZM.

Javier Hormigo.

Abstract: In this paper, we present the results of the XXXXXXXX project, in which a fully open-source, parametrizable low-power floating-point unit (FPU) under HUB format has been designed and validated. This unit, implemented in SystemVerilog, supports addition, subtraction, multiplication, division, square root, and Fused Multiply-Add (FMA) operations under HUB format. This FPU has been exhaustively tested through simulation and FPGA implementations. Moreover, it has been integrated with some RISC-V cores and validated using several test benches. The development is complemented by a compiler environment that enables native FPHUB arithmetic for C and C++ programs. The proposed unit achieves a roughly 60\% reduction in area and power consumption compared with a classic IEEE FPU implementation.

Improving ChaCha20 by RISC-V Vector Extension: Design and Engineering Implementation

Sub. #QGKMZ7.

Meng Zhuo.

Abstract: ChaCha20 is a high-performance stream cipher widely deployed in TLS and SSH, typically combined with Poly1305 for authenticated encryption. This paper presents a practical vectorized implementation of ChaCha20 using the RISC-V Vector (RVV) extension, with complete engineering code in the Go ecosystem. We outline how ChaCha20’s add–xor–rotate structure maps to RVV instructions and describe a fully vectorized design covering register allocation, rotation implementation, and 64-byte block processing. Experiments on a real RISC-V 64 platform (Spacemit X60) show up to 1.5X throughput improvement on large data blocks and a 35.58% geometric mean speedup over the generic Go implementation. The implementation is suitable for direct integration into open-source cryptographic stacks on RVV-enabled RISC-V platforms.

Compiler-Aided Autovectorization of PQC on RISC-V Vector Extensions

Sub. #QLUGPK.

Stefano Di Matteo and Ivan Sarno.

Abstract: Post-Quantum Cryptography (PQC) is rapidly becoming a security requirement, and ML-KEM (FIPS 203) is emerging as a foundational primitive for future secure systems. On RISC-V platforms, performance evaluations frequently emphasize custom extensions or dedicated accelerators, while the optimization potential of the standard ISA remains comparatively underexplored. This paper establishes a rigorous performance baseline for the main computational kernels of ML-KEM using only the standard RISC-V Vector Extension (RVV). Rather than relying on handwritten assembly, we apply targeted C-level program transformations that systematically enable effective compiler autovectorization, achieving up to a 10× reduction in instruction count for NTT while preserving portability across all RVV-compliant implementations.

An Open-Source Framework to Enable Float16 On-Device Training on RISC-V Single-Core

Sub. #QM3SVB.

Benjamin Hubinet.

Abstract: This work proposes an open-source framework that leverages both the Zfh (scalar float16) and the Zvfh (vector float16) extensions to enable complete on-device training on resource-constrained RISC-V single-core. On top of reducing the memory footprint by about 50% as compared to using float32, our approach facilitates transfer learning and fine-tuning scenarios by incorporating layer-freezing capabilities. Our work builds onto AIfES an open-source, modular and generic DNN training and inference framework for embedded systems that can be extended with custom hardware-specific functions.

RISC-V Architecture innovations need software stack innovations

Sub. #QPZCJG.

Quentin_Melotte and Henri-Pierre CHARLES.

Abstract: RISC-V is a major breakthrough in the computing ecosystem. It open opportunities for hardware research, innovation for industry. Researcher or industrial can customize a CPU core for a given specific application, thus provide industrial advantage.

It would be strange not to take advantage of this opportunity to revisit the ecosystem of software tools.

In this article, we propose a new compiler for generating a part of the binary code at runtime.

This has several advantages: (1) generating code by leveraging knowledge of user data which provide speed optimization, (2) generating code with knowledge of the accelerators available on a given platform, and (3) taking advantage of unconventional accelerators specific to a computing platform.

The 2 later points are specifically interesting for the RISC-V community which already show a wide variety of platforms.

Accelerating Myers’ Bit-Vector Alignment With RISC-V Vector Intrinsics

Sub. #QUURJD.

Elena Espinosa.

Abstract: Pairwise sequence alignment is a key component of many bioinformatics workflows and is often a performance bottleneck. Recent advances in sequencing technologies have improved accuracy, while also increasing the need for accelerators that can efficiently handle long reads. Myers’ bit-vector algorithm is well suited to acceleration, and AVX-512 has enabled high-performance implementations, such as SeqMatcher. However, these solutions rely on a fixed register width and AVX-512-specific instructions, which creates a scalability ceiling and limits portability. We implement Myers’ algorithm using RISC-V Vector (RVV) intrinsics and focus on the addition step, which we identify as the main bottleneck in our vectorized kernel. We evaluate two RVV addition alternatives across LMUL values and dataset sizes on a Banana Pi and find that the iterative carry-propagation variant achieves up to 10.29x speedup over the scalar baseline.

TOXOS: A RISC-V Coprocessor for Non Linear Function Acceleration

Sub. #RCLLZH.

Luigi Giuffrida.

Abstract: The growing demand for near-sensor processing exposes a gap: nonlinear activation functions still fall back on the host CPU, incurring energy and latency penalties. We present TOXOS, a RISC-V CORDIC coprocessor tightly integrated into X-HEEP via the Core-V eXtension Interface, achieving up to 27× speedup over a hardware FPU (CVFPU) with minimal area overhead.

Scalable Symbolic Quick Error Detection using Lightweight Processor-Level Abstraction

Sub. #RP8QNP.

Yufeng Li.

Abstract: Symbolic Quick Error Detection (SQED) streamlines processor verification by checking a microarchitecture-agnostic self-consistency property using bounded model checking (BMC). While effective in detecting bugs without manual property specification, SQED suffers from severe scalability limitations due to state explosion in complex designs. This paper introduces RDM-SQED to mitigate this bottleneck by reducing the resource-intensive duplicate mode with a lightweight Processor-Level Abstraction (PLA). The PLA captures software-visible behaviors through a concise set of Elementary Instructions (EIs). To further constrain the verification logic, we propose a recursive refinement algorithm that generates a minimal EI set. Experimental evaluation on an out-of-order RISC-V processor demonstrates that RDM-SQED significantly outperforms existing variants in both scalability and bug detection efficiency, successfully identifying bugs that cause timeouts in other methods.

Cincoranch: A Heterogeneous Multi-Microarchitecture RISC-V Test Chip – Silicon Bring-Up

Sub. #RRK9ZJ.

Hugo Safadi.

Abstract: The Cincoranch Test Chip 1 (TC1), manufactured in Intel3 technology, integrates three RISC-V processors with a Vector Processing Unit (VPU) accelerator and an HPC-oriented cache hierarchy. This work presents the electrical characterization of the silicon, the power-on bring-up procedure, and basic functionality verification of the TC1 chips. Initial measurements focused on power consumption and temperature of each core under idle conditions, providing insight into the chip’s behavior and readiness for further workload testing.

A Low Latency Real-Time RISC-V MCU for TEE

Sub. #RT7JWV.

Paul Shan-Chyun Ku.

Abstract: In modern embedded security architectures, the Trusted Execution Environment (TEE) serves as the fundamental tool for isolation, ensuring that critical assets in applications like Electric Vehicles (EVs) and robotics remain protected from compromised software. However, restricted by current RISC-V specifications for MCUs, implementing this isolation typically imposes a severe penalty on real-time performance due to the prolonged software prologue required for context switching. To resolve this, we present a lightweight 2-mode (M-mode and U-mode) secure-domain-aware RISC-V MCU architecture designed for security-sensitive, real-time applications. This architecture introduces a hardware-managed “Trusted State” (TS) used to dynamically filter valid enhanced Physical Memory Protection (ePMP) entries in U-mode. To eliminate register preservation overhead, the MCU features a dedicated “Snapshot Buffer” for every General Purpose Register (GPR) and Control and Status Register (CSR) subject to backup. Crucially, the hardware captures the execution context into this buffer in a single cycle, allowing the CPU to immediately begin executing the Interrupt Service Routine (ISR). The captured data is then pushed to an SP-based Data Local Memory (DLM) via a 128-bit wide data-path in the background. By overlapping this memory write with the ISR’s preamble execution, this design effectively hides the context save time, ensuring the system is seamlessly prepared for nested interrupts. This architecture guarantees hardware-enforced isolation while satisfying the real-time requirement.

RISC-V Instruction-Subset Processors for Extreme Edge Machine Learning.

Sub. #RTYDA8.

Shengyu Duan and Konstantinos Iordanou.

Abstract: We present an end-to-end framework for the automatic generation of custom RISC-V instruction-subset processors (RISSPs) tailored to machine learning (ML) inference. Building on the RISSP methodology, our fully automated flow accepts model hyperparameters and a target dataset, performs offline training, and generates the complete inference implementation together with all deployment artifacts for the target device. The resulting inference code then drives the RISSP generation, synthesising a custom processor that implements only the RISC-V instructions used by the application. By co-optimizing software and hardware within a tightly integrated co-design toolchain, the combined flow reduces ISA footprint and design complexity, enabling smaller and more energy-efficient processors for ML workloads at the edge.

Transaction-Level Analysis and Optimization of Decision Diagram Packages on RISC-V

Sub. #RUUDYM.

Rune Krauss.

Abstract: The complexity of modern electronic systems has increased significantly over the past decades due to continuous technological advances. To cope with this growing complexity, data structures, algorithms, and the underlying hardware platforms used in Electronic Design Automation (EDA) must be continuously improved. Decision Diagrams (DDs) constitute a fundamental graph-based structure for formal verification, enabling efficient representation and algorithmic manipulation of switching functions. Owing to their practical relevance, numerous optimizations have been incorporated into existing DD software packages. However, these optimizations are typically designed in an architecture-agnostic manner and do not explicitly exploit characteristics of a specific target platform. As a consequence, architecture-specific optimization opportunities may remain untapped. In this work, a transaction-level analysis of a representative DD package is conducted using a RISC-V-based trace analysis tool to investigate this potential. The study reveals recurring instruction sequences with strong potential for hardware-level aggregation, enabling more efficient hardware designs. Furthermore, the derived insights provide guidance for higher-level software optimizations.

RETrace EX: Interactive Trace Analysis Framework for RISC-V Hardware Optimization

Sub. #RWGSHJ.

Jan Zielasko.

Abstract: Identifying the optimal hardware configuration for running complex workloads on edge devices is critical for reducing cost and maximizing performance. Tailoring hardware designs to specific applications significantly increases resource efficiency, which is essential to meet the strict performance constraints. Unfortunately, exploring the design space at the hardware-level is difficult due to the complexity of the hardware design processes. We present RETrace EX, an interactive analysis framework for identifying profitable hardware optimizations from system-level execution traces. The tool automatically identifies custom ISA extensions and estimates their performance impact as well as the expected area cost. To adjust the optimization goal for arbitrary systems and design capabilities, the user can choose from a range of preset scoring functions or specify a custom one. Applied to a wide range of representative embedded and edge artificial intelligence workloads, we are able to identify individual custom instructions that yield expected performance improvements of up to 32 % for Embench and 60 % for MLPerf Tiny benchmarks. The framework is provided as open source.

DASICS: Efficient In-process Protection with Hardware-assisted Dynamic Compartmentalization

Sub. #RZD9L9.

Tianyue Lu.

Abstract: Hardware-assisted in-process compartmentalization is an effective method for addressing security threats within complex software applications. This paper proposes DASICS, an efficient design of hardware-assisted in-process compartmentalization, including flexible permission management, sufficient security metadata protection, complete resource access control, and little hardware-to-software ABI modification requirements. DASICS divides the process into trusted and untrusted region and uses boundary registers and user-level interrupts to achieve dynamic permission management, thereby avoiding the overhead of privilege-level switching in traditional methods. We implemented a hardware prototype of DASICS on the RISC-V XiangShan out-of-order processor and validated its effectiveness on FPGA. Experimental results show that DASICS incurs an average performance overhead of only 1.53% in SPECint2006 tests while effectively defending against common vulnerabilities such as stack/heap overflows and control-flow hijacking in security test suite.

RISC-V Packed-SIMD Acceleration for Quantized Edge-AI Inference on Space-Qualified Platforms

Sub. #S3LMTB.

Carlos Rafael Tordoya Taquichiri.

Abstract: Conservative/qualification-sensitive RISC-V ecosystems tend to view large architectural changes as costly due to hardware overhead, integration effort, software/toolchain adaptation, and assurance scope. This is especially relevant for platforms intended for harsh environments and long lifetimes such as space-oriented and radiation-tolerant platforms (e.g., NOEL-V). At the same time, there is growing interest in on-board processing to support time-critical decisions close to the sensor and reduce reliance on transmitting raw sensor data, increasing the demand for compute-intensive Edge-AI inference. In such settings, full vector architectures can deliver high throughput, but they tend to introduce additional architectural state and increase integration complexity across the hardware and software stack. Therefore, to introduce data-parallel acceleration with minimal disruption, we evaluate packed-SIMD as a small-change alternative based on packed subword parallelism that remains close to the existing register and memory model. We consider two packed-SIMD options: SWAR and SPARROW. On a NOEL-V softcore, we implement SWAR operator kernels for the most computationally expensive layers and integrate them into the math backend of a space prequalified inference engine, running on a space prequalified RTOS (RTEMS6 SMP). Using a hardware SWAR unit for packed subword operations, we report full-model results with and without SWAR acceleration, showing improved inference performance without requiring a full vector architecture. Finally, we outline future work extending the same backend methodology to SPARROW to compare performance across packed-SIMD options.

Who Checks the Checker? End-to-End Architectural SEU Tolerance for RISC-V Microcontroller Protection

Sub. #SFGFJE.

Michael Rogenmoser.

Abstract: RISC-V-based microcontroller units (MCUs) are increasingly adopted in radiation-heavy environments such as space, where single-event upsets (SEUs) can cause bit-flips in sequential and combinational logic. RISC-V-based designs are ideally suited for these domains, as open architectures allow for fault-tolerance modifications, enhancing readiness for architectures and systems-on-chip (SoCs). While component-level architectural protection methods, such as error correction codes (ECC) and triple modular redundancy (TMR), can individually harden each component, they leave critical gaps: the voters, encoders, and decoders that implement these protections themselves remain unprotected and become single points of failure. We propose an overlapping protection approach that addresses this fundamental “who checks the checker?” problem. By extending each protection domain to encompass the checking logic of adjacent domains, we achieve end-to-end fault tolerance across an entire RISC-V MCU without requiring radiation-hardened standard cells. We build on croc, an open-source, extensible RISC-V MCU platform based on the CVE2 core, incrementally applying ECC-protected SRAM, triple-core lockstep cores, a reliable OBI interconnect, and TMR peripherals. Fault injection campaigns in both RTL and synthesized netlist show that the fully protected RISC-V MCU achieves over 99.9% fault coverage at 2.71× area overhead, 22% less than fine-grained triplication. Critically, without overlapping protection, 16.33% of faults in voter signals cause failures; with overlapping, this drops to 0.26%. All designs are implemented using the fully open-source IHP 130nm technology, Yosys, and OpenROAD.

Rust on RISC-V: Alignment and Friction at the Hardware-Software Boundary

Sub. #SFZSB9.

David de Rosier.

Abstract: Rust is increasingly discussed in embedded and safety-aware systems, yet it remains uncommon in serious RISC-V projects. For teams working in C and assembly, the question is whether Rust meaningfully complements the RISC-V ecosystem at all.

This talk offers an engineering-level exploration of that question. Rather than a migration guide or code-heavy tutorial, it examines where Rust aligns with low-level RISC-V work - and where real friction remains.

Topics include:

How Rust’s abstractions translate in bare-metal contexts,
Toolchain realities, including LLVM constraints and custom ISA extension workflows,
Practical limits around vector extension,
Incremental adoption strategies for mixed C/Rust systems,
Build reproducibility and multi-target configuration,
Off-hardware testing and separation of logic from hardware layers.

The goal is to give engineers enough practical insight to judge whether Rust has a place in their RISC-V workflow. This is an exploratory talk, not a language tutorial - no prior Rust experience is required or assumed.

CIRCE: CROSS Integrated RISC-V Cryptographic Extension

Sub. #SXGTPT.

Valeria Piscopo and aledolme.

Abstract: Post-Quantum Cryptography (PQC) is moving from algorithm selection to deployment, where performance, energy, and portability are key constraints, especially on embedded and IoT-class processors. Many PQC schemes stress general-purpose cores with large arithmetic workloads and heavy memory traffic. Instruction-set extensions (ISE) offer a practical middle ground: they speed up dominant kernels while preserving programmability. In this context, we target post-quantum digital signatures, which remain under active evaluation, as reflected by NIST’s 2023 call for additional schemes. We focus on CROSS, a code-based signature built from zero-knowledge proofs and the Restricted Syndrome Decoding Problem, and present CIRCE: a RISC-V–integrated extension connected through the Core-V eXtension Interface (CV-X-IF). CIRCE supports both R-SDP and R-SDP(G), runs across all official parameter sets without hardware retuning, and achieves an average 2x speed-up on a Zynq UltraScale+ FPGA with an ultra-compact footprint (down to 800 LUTs / 100 FFs).

Loop Optimization Practices for RISC-V

Sub. #TMWG8J.

Lei Qiu.

Abstract: Compilers play a central role in unlocking the full performance potential of rapidly evolving RISC-V processors. In the practice of optimizing SPEC CPU 2006 and SPEC CPU 2017 using LLVM for RISC-V, a few compiler optimizations targeting RISC-V have been implemented, involving approaches that both enhance the effectiveness of individual optimization passes and refine how passes interact within the optimization pipeline. This work introduces four such optimizations integrated into LLVM: (1) extending loop interchange to support loops containing reduction patterns, (2) enhancing loop strength reduction for nested loops, (3) eliminating unnecessary loop counters to unlock further optimizations such as loop unroll, and (4) refactoring multi-dimensional array accesses to enable subsequent redundant computation elimination. While motivated by RISC-V performance tuning, the proposed techniques can also benefit other architectures such as x86. Evaluated on SPEC CPU 2006 and SPEC CPU 2017, these improvements achieve performance gains ranging from 6\% to 54\% across Intel i9-11900K, SpacemiT Key Stone K1, and XiangShan KMHv3 platforms.

Fault-Tolerant Open-Source CVA6 Core for Automotive, Aeronautics and Space

Sub. #TYLEYW.

Jérôme Quévremont.

Abstract: This paper presents a radiation-hardened, open-source RISC-V CVA6 core designed for space, aeronautics, and automotive applications, where Single Event Upsets (SEUs) threaten reliability or safety. The design integrates error detection and recovery in L1 caches and Dual-Core Lockstep (DCLS) with temporal diversity. For non-critical workloads, the system supports Asymmetric Multiprocessing (AMP), enabling independent core operation. Tested with Linux and Zephyr, this work is inspired by RISC-V International’s Functional Safety white paper and advances open-source, fault-tolerant computing for critical systems. It is being integrated in a new 18 nm SoC for AI.

Bringing Cloud-Connected Automotive Workloads to RISC-V: A CVA6-Based FPGA Case Study

Sub. #UCSJXG.

Holger Blasum and Tianhai Liu.

Abstract: An end-to-end case study evaluating cloud-connected workloads on CVA6 platforms is presented. System behaviour under increasing telemetry loads is analysed using CAN trace replay. The results provide empirical insights into the suitability of open RISC-V platforms for industrial deployment and highlight further optimisation.

Exploring AI Acceleration Paradigms for Automotive RISC-V Platforms

Sub. #V33F9K.

DAVID ALBACETE SEGURA and Anestis Athanasiadis.

Abstract: The transition toward centralized automotive computing platforms demands scalable, high-performance, and energy-efficient AI acceleration tightly integrated with open instruction set architectures. Within the European Chips Joint Undertaking framework, the [PROJECT NAME] project develops a next-generation automotive hardware platform based on RISC-V technology. This paper explores three hardware acceleration paradigms applicable to RISC-V-based automotive systems: (i) memory-mapped monolithic accelerators, (ii) custom ISA extensions tightly coupled to the processor pipeline, and (iii) Near-Memory Computing (NMC) architectures. We present an ongoing comparative study evaluating their applicability to representative automotive AI kernels, including conventional neural networks (CNNs, MLPs), data-driven battery models, and emerging Spiking Neural Networks (SNNs). While all paradigms provide workload-dependent performance benefits, preliminary architectural analysis suggests that Near-Memory Computing offers superior scalability and energy efficiency for memory-bound AI workloads. Complementing the hardware effort, we develop a software ecosystem leveraging MLIR-based compilation flows to efficiently map both conventional and neuromorphic models onto heterogeneous RISC-V accelerators.

RISC-V Hardware Accelerator for 2-D Discrete Cosine Transform

Sub. #VRKMGE.

Andrei Stan.

Abstract: The Discrete Cosine Transform (DCT) is a key component in image and video compression systems due to its high energy compaction and efficient implementation. This paper presents a hardware accelerator for the 2-D DCT integrated into RISC-V–based FPGA systems. The design relies on an optimized 8-point 1-D DCT algorithm requiring only 11 multiplications and 29 additions, extended to 2-D using row–column decomposition. The accelerator employs a three-stage pipeline performing row-wise transform, column-wise transform, and quantization. It was integrated with both MicroBlaze V and CVA6 RISC-V cores and implemented on AMD VCU128 and KCU116 FPGA development boards. Experimental results for multiple image resolutions show significant performance improvements compared with the software implementation, achieving speedups of up to 44.56× and a throughput of 2 Mpixel/s at 100 MHz. The accelerator uses modest FPGA resources, enabling multiple instances and demonstrating its suitability for accelerating image and video compression pipelines in RISC-V–based systems.

kepler-formal: Open Logic Equivalence Checking for RISC-V CI Workflows

Sub. #VVBBMK.

Christophe Alexandre and Noam Cohen.

Abstract: The rapid expansion of the RISC-V ecosystem has led to an increasing number of open hardware projects hosted on collaborative platforms such as GitHub. While modern software development benefits from mature continuous integration and continuous deployment (CI/CD) methodologies, equivalent automated verification infrastructure remains limited for hardware design. In particular, formal verification tools such as logic equivalence checking (LEC) remain largely restricted to proprietary EDA solutions. This work explores the use of lightweight open-source EDA tools as scalable verification agents for open hardware development workflows. We present an open-source logic equivalence checking tool designed to operate efficiently within CI environments for RISC-V projects. Built on a high-performance C++ infrastructure for netlist representation and analysis, the tool enables rapid equivalence verification between different RTL transformations and synthesized netlists. Experimental results on open RISC-V designs demonstrate that automated equivalence checks can be integrated into CI pipelines with execution times compatible with typical pull request validation workflows. This approach provides a practical first verification gate for open hardware repositories before deeper sign-off verification using commercial tools.

The Next Generation RISC-V SoCs for Space Communications

Sub. #VXCZWW.

Marco Bertuletti.

Abstract: Non-Terrestrial Networks (NTN) require software-defined payloads to stretch the lifetime of space-components, while meeting strict real-time and power constraints. We evaluate the end-to-end 5G NTN up&downlink on a single rv64gc core. Measurements show that 273× speedup is needed to run the uplink in 1ms transmission time interval (TTI). We argue that programmable decoupled vector datapaths, implementing the RISC-V “V” extension are the key to bridge performance gaps, while preserving long term flexibility for space-grade systems.

Profiling and Optimizing AME for Matrix Multiplication

Sub. #W3ANF8.

Xinlei Zhao.

Abstract: The RISC-V ecosystem is evolving toward AI-oriented computing, with matrix-oriented proposal directions such as AME, VME, and IME attracting increasing attention. In LLM inference, matrix multiplication constitutes one of the dominant computation patterns, and quantized matrix multiplication is widely adopted by many accelerators to improve efficiency. In this setting, the practical value of matrix-oriented proposals depends not only on the instruction capabilities they provide, but also on how effectively representative operators can be mapped onto realistic execution flows. This work presents an operator-level profiling study of a currently discussed AME proposal for RISC-V AI. We first design representative matrix operators for quantized LLM-style workloads, then develop a gem5-based platform with support for the AME proposal, and profile matrix multiplication on this platform. Based on these observations, we further analyze scaled matrix multiplication as an extended operator flow and discuss a possible scaled matrix multiplication instruction strategy as a future optimization direction.

Toward an open-source platform for multi-lead Embedded ECG Processing on RISC-V processors

Sub. #WDXS8R.

Da Rocha Carvalho Bruno.

Abstract: Interest in edge inference for biomedical applications has boomed in recent years, given its benefits in terms of data privacy, low latency, and reduced cloud costs. We present Embedded ECG Processing on RISC-V(EEP-V), an end-to-end platform for multi-lead embedded ECG processing on RISC-V processors. EEP-V combines a custom multi-lead acquisition board, real-time digital signal conditioning, and on-device neural network inference in a fully local processing pipeline without cloud offloading. The platform is designed as an open-source hardware/software stack to support reproducible research on embedded cardiac monitoring. Our implementation targets a heterogeneous RISC-V architecture based on GAP9 and supports concurrent processing of up to 12 ECG leads. We validate the complete acquisition-to-inference pipeline using a medical-grade patient simulator and a reference multi-class arrhythmia classification model from PhysioNet/CinC Challenge 2021. On the deployed system, inference completes in 150 ms using 488 kB of L2 memory and consumes less than 5.47 mJ per classification, while the full pipeline consumes about 7 mJ per inference cycle. These results show the feasibility of an end-to-end multi-lead ECG processing platform on RISC-V and provide an open foundation for future embedded cardiac-monitoring research.

ONNX Runtime Convolution Acceleration on RISC-V via RVV

Sub. #WHMPM8.

Jose Sanchez-Yun.

Abstract: Inference engines are specialized software systems designed to execute pre-trained Machine Learning models. ONNX Runtime (ORT) emerges as a leading open-source inference engine for the Open Neural Network Exchange (ONNX) format, allowing models to be deployed regardless of the framework in which they were trained. While ORT provides a flexible architecture for deploying models across diverse hardware, it currently lacks architecture-specific optimizations for RISC-V. Consequently, computationally intensive tasks such as the convolution operation—which accounts for the majority of inference time in Convolutional Neural Networks (CNNs)—suffer from hardware underutilization by relying on standard scalar instructions. In this paper, we address this gap by proposing an optimized convolution implementation leveraging the RISC-V Vector Extension (RVV) and integrating it as a custom Execution Provider in ORT. We evaluate our solution on a Banana Pi BPI-F3 board across six standard reference CNN models. Experimental results show that our RVV-accelerated implementation achieves speedups of up to 3x compared to the official scalar ORT release, significantly improving CNN inference performance on RISC-V platforms.

Hardware support in RISC-V for ternary LLMs

Sub. #WKS77D.

David Aledo.

Abstract: Language models are becoming increasingly common, and their number of parameters is continuously increasing, imposing huge memory capacities. One of the most common techniques to reduce their memory footprint is weight quantization. Ternary models are one of the most extreme cases of quantization. So far, most hardware proposals focus on FPGA-based accelerators to optimize inference in quantized models, while current general-purpose processors have limited support (up to 8-bit integers). In this work we attempt a preliminary analysis of the potential benefits of moving the quantization hardware support directly to the processor. To do so, we make use of a state-of-the-art inference framework for CPUs and Small Language Models, evaluating what the competitive advantages of having dedicated SIMD hardware for quantized operations. The results show a speedup x2 (tokens/s) on a 350MB Small Language Model with a tendency to increase the speedup with the model size, using a minimal increase of the hardware resources (1.25% in LUTs).

Evaluating the Impact of Vector Co-Processors on Memory Hierarchies through Hybrid Simulation

Sub. #WVYG7T.

J Parker Jones.

Abstract: With the proliferation of data-hungry accelerators and co-processors in embedded system design, co-design of processors and memory systems is becoming more important. Current simulation techniques for processors rely on oversimplified and inflexible memory models, while techniques for memory system simulation tend to only utilize simple processor models. In this work, we integrate a cycle-accurate Verilator processor and vector co-processor model with the Gem5 memory simulator in order to evaluate the full impact of a data-hungry co-processor on the memory system and main core performance, and to provide a framework for future co-design of both processor and memory systems.

Sail-RISC-V and Spike for RISC-V Vector: Toward Consistent Golden Reference Behavior

Sub. #WWSLLF.

Daniel Große, Katharina Ruep and Manfred Schlägl.

Abstract: In recent years, the executable specification generated from Sail-RISC-V has increasingly been considered as a successor to the widely used Spike ISA Simulator as golden reference for RISC-V, including the complex and highly configurable RISC-V Vector Extension (RVV). In this paper, we compare the RVV behavior of Sail-RISC-V against Spike using the automated testing framework RVVTS. While Sail-RISC-V largely matches Spike under positive testing (0.23% deviations), negative testing reveals substantially more deviations (3.73%), highlighting remaining issues in Sail-RISC-V’s RVV instruction validity checking under dynamic configurations.

Reproducibility in open-source RISC-V HW flows

Sub. #XANKHZ.

Anmol Xx and Petr Kourzanov.

Abstract: Open-source hardware is booming. To prevent fragmentation, encourage collaboration and reuse we propose the RISC-V to join forces with Reproducible Builds communities and concentrate innovation potential where its needed most: creation of new micro-architectures, IPs and their integration into new SoCs and applications. To facilitate this goal, we chose Guix, a rigorous solution for reproducible software artefacts. We apply it to dependency management & reproducibility problems in open-source hardware and show the validity of the approach, taking CVA6 as a running example. The end result - a fully reproducible collection of packaged tests, emulation, simulation and cycle-accurate models - shows a promising workflow that could (in future) scale to support larger RISC-V community with reusable software & hardware components for next-generation platforms.

Sail-RISCV-WASM A Browser-Native RISC-V Toolchain and Debugging Workbench

Sub. #XDQWMR.

Yunxiang Luo and Mingzhu Yan.

Abstract: This paper presents Sail-RISCV-WASM, which addresses three common limitations of existing browser-based RISC-V tools: fragmented capabilities, limited configurability, and disconnected build/debug pipelines. The system uses sail-riscv as its semantic baseline and compiles it to WebAssembly, forming a three-layer architecture in a pure browser environment: a Sail decode/execute layer, a toolchain layer (gas/ld/objdump), and a metadata layer based on the RISC-V UDB. Based on this architecture, the paper defines two core workflows. The first is configuration-sensitive online encode/decode with instruction metadata navigation for cross-configuration behavior comparison. The second is an in-browser assemble-to-ELF, execute, and interactive debugging loop, supporting instruction-level stepping, source-line stepping, synchronized source/disassembly views, and register/memory tracing. Results show that the system provides a complete single-page flow from exploration to build to diagnosis, with strong extension coverage and configuration flexibility.

RISC-V Vector 1.0 code Generation in MLIR-xDSL

Sub. #XTDV7A.

Jie Lei.

Abstract: The fragmented RISC-V ecosystem demands portable, high-performance code generation for the Vector Extension (RVV 1.0). Upstream MLIR (LLVM 22.0) lacks two critical lowering stages needed for this: it cannot flatten dynamic memref ma- trix references into C pointers, nor emit Vector-Length-Agnostic (VLA) RVV intrinsics. This paper closes that gap with a six-stage hybrid MLIR–xDSL compilation workflow that automatically generates parameterized, hardware-aware C micro- kernels for GEMM entirely in Python, without modifying the MLIR C++ codebase. On a COTS BananaPi F3 board (SpaceMiT K1, 256-bit RVV 1.0), we show: (i) isolated micro-kernels match or exceed hand-written reference code (0.98×– 1.05×), peaking at 16.2 GFLOPS at the optimal 16×15 tile; (ii) on BERT-Large transformer layers (B1–B5), generated micro-kernels consistently surpass OpenBLAS, reaching up to 12.2 GFLOPS against the baseline’s 5.1 GFLOPS (a 2.4× speedup) and maintaining an average 15–27% performance advantage across all layer dimensions.

Bicameral+: re-assessing split vector and scalar cache designs for increased efficiency

Sub. #XYRVET.

Aitor Echevarría and Borja Perez.

Abstract: This paper introduces the Bicameral+, an enhanced version of the Bicameral Cache; a vector-aware memory hierarchy that separates vector and scalar accesses into two cache partitions tailored to the needs of each kind of access, improving spatial locality for vectors and eliminating scalar interference. The new design aims to reduce implementation complexity and improve energy efficiency, while retaining the performance improvements of the original proposal, by introducing a set associative design and an alternative opportunistic dirty block management scheme. Experimental results on thirteen benchmarks across various configurations show a 7x area reduction and 18x energy savings, while retaining an average 1.59x speedup w.r.t a conventional cache.

Revisiting x86-64 to RISC-V Binary Translation: A Hardware/Software Co-Design Path

Sub. #YAEZRU.

Xieyuan Wu.

Abstract: RISC-V is rapidly emerging as an open and extensible ISA, yet its adoption in desktop and server environments remains constrained by the dominance of the x86-64 software ecosystem. Dynamic binary translation (DBT) provides a practical mechanism for executing legacy x86-64 binaries on RISC-V without source code, but purely software-based DBT often incurs substantial overhead. In this work, we investigate a hardware/software co-designed approach for user-level x64-to-RV64 translation. We begin with a fine-grained characterization of runtime instruction behavior from SPEC CPU 2017 benchmarks, and extract micro-operation (μop) information for different instruction variants on a representative x86 microarchitecture. By correlating dynamic execution profiles with μop-level complexity, we introduce a quantitative model of semantic inflation, which exposes the semantic gap introduced by cross-ISA translation by discounting the inherent execution complexity of CISC instructions. This model enables us to systematically identify instruction variants that exhibit disproportionate expansion and reveals the underlying causes of this bloat. Based on these insights, we propose targeted hardware extensions to mitigate translation overhead. We implement the proposed approach in a Box64-based prototype and evaluate it through QEMU-based simulation. Experimental results demonstrate a significant reduction in the number of translated instructions, indicating a practical path toward near-native cross-ISA execution efficiency.

CHERI RVY development support platform

Sub. #YLJJMH.

Alexandre Joannou.

Abstract: We present the development flow and platform we have built to support CHERI development and ratification of the RVY extension. CHERI is an ISA extension providing hardware support for capabilities - unforgeable memory references embedding a memory address as well as bounds and permissions metadata. It enables spatial and temporal memory safety by design. We have developed a comprehensive workflow used to validate the proposed RVY extension both for functionality and performance. We maintain and make use of a formal golden model, which we leverage for design verification effort through directed-random fuzz testing of architectural features under development. We gather core CHERI functionalities in a reusable RTL library to use across multiple commercial and research implementations, maximising reuse of verification effort. We build and boot soft-core images of CHERI-enabled systems on FPGA at scale, enabling software development and performance evaluation of RV64Y microarchitectures and software stacks. This infrastructure has enabled rapid convergence for the development of the RVY extension with a high level of confidence in functionality and performance. We are now making use of this infrastructure to further enable various streams of research.

CVA6 Optimization

Sub. #YXSRKX.

Udaya Subedi and Angela Gonzalez.

Abstract: CVA6 is an open-source RISC-V core with highly configurable parameters for tailoring the core to various applications. An optimization-oriented analysis of the current implementation showed that the scoreboard (SB) and controller are the biggest combinational modules involved in the critical path. The SB is in charge of many crucial functions, including issuing, forwarding, writeback, and committing, while the controller manages all the stages of the core. This work presents two optimization proposals: re-order buffer (ROB) and Issue logic separation from the Scoreboard and registering Controller output. Preliminary results show promising outcomes in implementing the core, relaxing the timing, which in turn enables operating at a higher frequency. With this optimization, we get to improve the maximum frequency of operation by 14% for the existing FPGA configuration from OpenHW for Xilinx FPGA.

RVV Tips & Tricks

Sub. #Z8GZYW.

Olaf Bernstein.

Abstract: The RISC-V vector extension introduces SIMD instructions to RISC-V, however many known patterns in other SIMD extensions don’t translate 1-to-1. Therefore, the goal of this document is to share various RVV tips and tricks as well as some common pitfalls. It should help people familiar with other SIMD ISAs to figure out how to efficiently express many common patterns in RVV. We collected these paradigms while porting various software and algorithms to RVV.

Benchmarking the Vortex RISC-V GPU for Sparse Workloads

Sub. #ZBRZ7X.

Jules Dubois.

Abstract: Many computational problems require the processing of large sparse matrices, where the vast majority of entries are zero. The irregular distribution of the non-zero elements in these matrices stresses the memory system resulting in performance being bottlenecked by the memory bandwidth. On parallel architectures, workload imbalances also limit performance. Graphics Processing Units (GPUs) runnning sparse matrix kernels using state-of-the-art Basic Linear Algebra Subsystem (BLAS) libraries are central in modern HPC systems. Although RISC-V application processors are gaining in performance, RISC-V based GPUs are in an early stage of development. We benchmark sparse kernels both on modern HPC-grade GPUs and on Vortex, a RISC-V GPU that is gaining adoption. We analyse their performance under memory-bound workloads and report the gaps in software and hardware required to enable efficient sparse BLAS processing on RISC-V GPUs.

ANSSI IPECC-Accelerated ECC on CVA6 RISC-V SoC: Integration and Benchmarking

Sub. #ZBWEM7.

IGHILAHRIZ Billal.

Abstract: IPECC, an open-source side-channel-resistant ECC hardware accelerator developed by the French national agency ANSSI, is integrated into the CVA6 RISC-V SoC and prototyped on a Genesys 2 (XC7K325T) FPGA. Using the libecc cryptographic library, we evaluate eight signature scheme/curve combinations in three configurations: software-only execution, hardware acceleration without countermeasures, and fully protected hardware acceleration. With all countermeasures active, IPECC reduces ECDSA P-256 signature latency from 1.13 s to 180 ms (a 6.3x speedup), reaching 7.8x for Schnorr-based schemes and scaling up to 9.1x for P-521. On the FPGA target, the countermeasure overhead varies drastically from +3% (hash-dominated EdDSA) to +279% (Schnorr-based schemes). We demonstrate that this variance is fundamentally driven by the physical True Random Number Generator (TRNG) latency and each protocol’s specific reliance on scalar multiplication. In its compact P-256 configuration, the accelerator occupies only 4.2% of the FPGA LUT fabric (3,602 LUTs, 12 DSP48E1). This platform provides a reproducible basis for benchmarking ECC acceleration and side-channel countermeasures on RISC-V SoCs.

Distinguishing Exploit Failure from Effective CHERI Protection on RISC-V

Sub. #ZBWRKF.

Andreas Hinterdorfer, Daniel Große and Manfred Schlägl.

Abstract: CHERI extends conventional ISAs with hardware-enforced capabilities to provide fine-grained memory protection and its integration in RISC-V is gaining momentum with RVY. As adoption grows, implementations must be evaluated to ensure working CHERI protection mechanisms. We show that existing memory-corruption exploit implementations do not directly carry over to CHERI-enabled architectures, and that observed exploit failures (i.e., unsuccessful exploits) do not necessarily imply effective protection. To resolve this ambiguity, we propose a methodology that temporarily disables CHERI enforcement within a RISC-V VP. Comparing exploit behavior with and without CHERI enforcement under otherwise identical conditions makes it possible to distinguish exploit failure from effective CHERI protection.

Performance Characterization and Profiling of HQC Autovectorization on RISC-V Vector cores

Sub. #ZE8LDR.

Vito Cucinelli.

Abstract: The emergence of quantum computers threatens traditional cryptographic schemes, requiring the development of post-quantum algorithms. In this paper, we study the performance of the Hamming Quasi-Cyclic (HQC) scheme, the new Key Encapsulation Method (KEM) ratified by NIST in March 2025. We analyze different implementation approaches for the Sargantana RV64GBV core using the standard RISC-V bit-manipulation (B) and vector (V) extensions. We compare reference implementations against auto-vectorized code and then provide an overview of how to analyze and profile these implementations using RAVE.

1W Envelope: Area-Energy Trade-offs of Scalable RISC-V Systolic Arrays in Sky130

Sub. #ZFMXUE.

Daniel Klünder.

Abstract: Deploying high-performance AI inference on autonomous drones requires a precise balance between computational throughput and a strict 1W power envelope. This paper presents a vertical design space exploration (DSE) of the RISC-V Gemmini accelerator, scaling from 8x8 to 32x32 mesh configurations in the SkyWater 130nm process. Through an end-to-end evaluation using a YOLOv4-tiny model on the VisDrone dataset, we demonstrate a 74.75% reduction in model memory footprint via INT8 quantization and a speedup of up to 2352x compared to a RISC-V CPU baseline. Our results indicate that while the 32x32 mesh excels in peak throughput, the 16x16 mesh represents the optimal “sweet spot” for 1W-limited drone chiplets, combining high performance with manageable leakage and area.

A Fully Integrated FPGA-Based Reconfigurable Intelligent Surface Controller using an Embedded RISC-V Core

Sub. #ZTDS9J.

Rubén Padial-Allué.

Abstract: This paper presents a compact FPGA-based controller for Reconfigurable Intelligent Surfaces (RIS) that integrates an embedded RISC-V processor and dedicated hardware control within a single device. The proposed architecture targets a 15×15 mechanical RIS prototype driven by stepper motor actuators. The embedded RISC-V processor accesses the RIS controller through a lightweight memory-mapped interface, enabling software-programmable RIS reconfiguration while fully abstracting low-level actuation details. By integrating processing and control within the same FPGA, the proposed platform eliminates the need for external computing units and reduces communication latency.

A RISC-V Dual-Core Microcontroller Architecture for Flight Control OSD: A Single-Chip Implementation

Sub. #ZTNLUP.

Yong Yang.

Abstract: This work presents a novel, highly integrated dual-core microcontroller architecture based on the RISC-V ISA, specifically designed for First Person View(FPV) Drone On-Screen Display (OSD) systems. Traditional solutions suffer from computational bottlenecks or multi-chip synchronization latency. By leveraging a specialized RISC-V asymmetric dual-core architecture, this design achieves sub-microsecond synchronization between complex flight control execution and high-framerate video rendering. Incorporating advanced ISA extensions and custom microarchitectural features, the proposed SoC successfully injects rendered OSD data during the video signal’s blanking period with pixel-level precision, showcasing the potential of RISC-V in mission-critical vertical application domains.

Microarchitectural Side-Channel Attack on RISC-V

Sub. #ZZ7ADW.

Sadia Shamas.

Abstract: Side-channel attacks leveraging microarchitectural features are well-studied on x86 and ARM, but less so on RISC-V. This work implements and evaluates Flush+Reload cache-side-channel attacks on user-space software in a RISC-V system simulated in gem5 full-system mode. We develop both eviction-based and cache-block-invalidate (cbo.inval) probes, establishing an attack methodology for an unprivileged process using the RISC-V cycle counter. Our experiments reveal timing differences between cached and evicted accesses, confirming the existence of exploitable timing channels. While key recovery remains partial, these results demonstrate the feasibility of cache side-channel attacks on RISC-V and validate gem5 as an effective platform for microarchitectural security research.

HORCRUX: a Post-Quantum Cryptography Instruction Set Extension

Sub. #ZZQYRH.

Valeria Piscopo and aledolme.

Abstract: This paper introduces HORCRUX, an open RISC-V instruction set extension for post-quantum cryptography (PQC). A modular PQ-ALU, integrated through the Core-V eXtension Interface (CV-X-IF) accelerates the core kernels shared by hash-, lattice-, and code-based schemes, including Keccak processing, sampling, modular/polynomial arithmetic, finite-field operations, and coefficient compression. The design targets NIST-standardized algorithms (ML-KEM, ML-DSA, SLH-DSA, HQC) and additional candidates under evaluation. We release the complete hardware/software stack as open source and report 65 nm ASIC post-synthesis results: with a compact footprint of ~26.3 kGE and energy savings up to 99.5%, the extension provides a practical route to energy-efficient PQC on RISC-V with minimal integration effort.

SoCMake: Modular RISC-V SoCs for Radiation-Harsh and Safety-Critical Environments

Sub. #7DSDWG.

Benoît Denkinger.

Abstract: Building on the long-standing use of programmable system-on-chips (SoCs) for edge computing in other domains, this work explores their early adoption in application-specific integrated circuit (ASIC) designs for high-energy physics (HEP) experiments, targeting optimization from design through in-field operation. From front-end detector readout in harsh radiation environments to infrastructure monitoring such as beam and radiation level surveillance, radiation-hardened ASIC SoCs could benefit a range of HEP applications. Current efforts focus on programmable SoCs for control tasks and local data processing such as chip calibration, with physics data processing remaining a longer-term prospect. Beyond HEP, such fault-tolerant techniques, including triple modular redundancy (TMR), hardened interconnects, and memory protection, are equally applicable to safety-critical embedded controllers in domains such as automotive and industrial control systems. In this context, SoCMake, part of the System-on-Chip Radiation-Tolerant Ecosystem (SOCRATES), is being actively developed and used to produce prototype chips. One such chip is TriglaV , a fully radiation-hardened prototype ASIC designed for reliable operation in the radiation environment typical of Large Hadron Collider (LHC) detector front-end electronics. This paper reports on the current status of SOCRATES/SoCMake and the test results of TriglaV, as well as the ongoing work and future directions for the platform.

A RISC-V based Coarse-Grained Reconfigurable Architecture to Unify Signal and AI Processing

Sub. #7NKZ8A.

Christian Siemers.

Abstract: Combining signal processing and artificial intelligence applications is actually a demanding task as these areas require different hardware support during program execution. Specifically, if the demands on rea-time as well as fast execution are high, any feasible solution will use different processing platforms, e.g. DSPs for signal processing and GPU for AI. This results in higher costs, less reliability and high demands on memory transfer rates not to talk about different development tool chains, This paper introduces the UB410 architecture based on RISC-V with enhancements to support different application classes like digital signal processing and artificial intelligence.

FREESS: A Web-Based Educational Simulator for a RISC-V-Inspired Superscalar Processor Tomasulo-Style

Sub. #8ALWHG.

Roberto Giorgi.

Abstract: FREESS (Free Educational Superscalar Simulator) is an open-source teaching environment for instruction-level parallelism in a RISC-V-inspired superscalar processor. It provides a compact, cycle-by-cycle view of register renaming, issue, execution, write-back, commit, and memory ordering in a Tomasulo-style machine. The simulator exposes the register map, free pool, instruction window, reorder buffer, and load/store queues in one textual representation, so the evolution of the hardware state can be followed on screen and reproduced on paper. Runtime parameters such as issue width, queue sizes, and functional-unit latencies can be changed easily, enabling direct comparison among alternative superscalar organizations. The tool has supported Advanced Computer Architecture teaching for about fifteen years and is publicly available on GitHub.

Integrated Development Environment Features for Unified Database Specification Development

Sub. #8CYVSX.

Madeline Seifert, Isabel Godoy, Ajit Dingankar, Brayden Mendoza, Lughnasa Miller and Nina Luo.

Abstract: The RISC-V Unified Database (UDB) serves as a machine-readable “source of truth” for written RISC-V specifications. To improve the ease of creating these specifications, Qualcomm collaborated with a team of Harvey Mudd College students to develop an Integrated Development Environment (IDE) toolkit that can support architects for RISC-V specifications. The team has worked to develop many of the features one would consider standard for developing in a programming language in a modern IDE, including syntax highlighting, autocompletion, and cross-referencing. The groundwork for this IDE also lays the foundation for other tool developers for the RISC-V ecosystem to use information contained in the UDB more efficiently.

Loom: An Open-Source Toolchain for Automatic FPGA Emulation of Simulation-Grade SystemVerilog

Sub. #8VDLDD.

Florian Zaruba.

Abstract: Functional verification dominates modern SoC development effort, yet migrating simulation testbenches to FPGA emulation typically requires proprietary tools, expensive licenses, and significant manual RTL adaptation-particularly for designs using DPI-C calls, multi-cycle timing blocks, or system tasks like $display and $finish. We present Loom, a fully open-source toolchain that automatically transforms unmodified simulation-grade SystemVerilog into FPGA-synthesizable RTL with complete host communication infrastructure. Built on Yosys, Loom applies five composable compiler passes-memory shadowing, reset extraction, DPI-C bridge instrumentation, scan chain insertion, and AXI-Lite emulation wrapping-to close the semantic gap between simulation and emulation. We validate Loom end-to-end on a Snitch RISC-V core running on a Xilinx Alveo U250 with no manual source modifications, demonstrating DPI argument passing, scan-based state capture/restore, and host memory preloading via PCIe XDMA.

Simulation-Driven Framework for Custom RISC-V HW/SW Co-Development and Debug

Sub. #97EAVY.

Henrik Gustafsson.

Abstract: Custom RISC‑V implementations increasingly require tight coupling between hardware and software development to ensure correctness, performance, and rapid iteration. This paper presents the RISC‑V Unified DB Instruction Set Simulator (RVUDB‑ISS), an open-source simulation‑driven framework that enables early‑stage HW/SW co‑development, configuration validation, and full‑stack debug prior to RTL availability. The ISS is automatically generated from a formally specified configuration, producing an implementation‑accurate model for custom RISC‑V cores and extensions.

RVUDB‑ISS supports configuration‑optimized binaries, enforcement of architectural corner cases, and precise modeling of implementation‑defined behaviors. A key functionality is the ISS’s integrated debug experience: developers can run custom workloads, halt execution at the first instruction, and attach standard tools such as GDB and VS Code to provide a familiar SW debug environment. This enables full symbolic debug of custom cores without hardware availability, significantly reducing time‑to‑bring up, and improving quality at bring up.

Overall, RVUDB‑ISS demonstrates that simulation‑based debug for custom RISC‑V configurations enables earlier validation, higher code quality, and more reliable HW/SW co‑development compared to traditional pre and post‑silicon workflows.

Revisiting Transputers with RISC-V

Sub. #ADHZLM.

Rich Neale.

Abstract: The transputer is a famous High Performance Computing (HPC) architecture from the late 1980s/early 1990s, with Inmos being arguably the most famous example. Embodying a communication-centric, distributed-memory MIMD architecture designed explicitly for scalable parallel process networks, there are numerous potential efficiency advantages to this approach. In a world where scientific programmers are ever demanding more performance, but having to balance this with energy efficiency, this approach is worth another look. The Esperanto ET-SoC-1 was a 1,088-core RISC-V manycore accelerator organised around a mesh network-on-chip (NoC) with hierarchical cache and scratchpad memory structures. Purchased and released by the AI foundry who are focussed on open source, they are emphasising the transputer credentials of the architecture. In this abstract and associated poster we provide and independent exploration around how parallel code written for a T800 transputer array may be systematically mirrored onto the ET-SoC-1 compute fabric. We identify architectural similarities and highlight key divergences.

Integrating RISC-V into University Education: A Full-Stack Approach to Teaching System Security

Sub. #B3ASBU.

Moritz Waser and Lorenz Schumm.

Abstract: The semiconductor industry increasingly requires engineers skilled in both hardware design and software execution. This contribution presents a RISC-V-centric educational pipeline developed at our institute, bridging foundational bachelor’s coursework and specialized master’s programs. We outline three core courses that integrate practical hardware design, custom ISA extensions, and full-stack security. First, a computer organization course teaches students hardware design in SystemVerilog with the goal of modifying and extending a full RISC-V CPU. Second, a hardware security course tasks students with both the implementation of security-related hardware primitives for open-source RISC-V cores, and the development of software to interact with the extended hardware. Finally, a secure system architectures course addresses memory safety through full system prototyping, requiring students to modify the RISC-V Spike simulator and write custom LLVM compiler passes. This hands-on approach provides the ecosystem with engineers equipped to tackle modern microarchitectural and security challenges.

GPU-Accelerated Parallel Simulation for RISC-V Multi-Core IP Verification

Sub. #B8BJHQ.

Abinaya Senthil.

Abstract: Functional verification of RISC-V multi-core IPs is bottleneck by the sequential nature of convectional CPU-based event driven simulation, where coverage closure timelines scale linearly with core count and configuration complexity. This paper presents a GPU- accelerated parallel simulation framework that offloads stimulus generation, constraint solving, and concurrent coverage computation to GPU hardware while retaining UVM testbench orchestration on the CPU host. The framework employs a heterogeneous partitioning tasks including constrained-random transaction generation, functional coverage bin evaluation, and reference model computation are parallelized across GPU threads using CUDA kernels. The control, DUT RTL simulation, and sequential verification logic, ensuring complete compatibility with existing verification flows. Evaluated on RISC-V IP configurations ranging from 2 to 32 cores with AXI4 interconnect and MESI coherency protocol, the framework achieves up to 22x simulation speedup, reduces coverage closure time from 44 hours to 14 hours, and reaches 99 percent functional coverage versus 93 percent for CPU-only baselines within the same wall clock budget. The GPU acceleration advantage scales near -linearly with core count, making it particularly valuable for emerging many core RISC-V designs targeting automotive and data-center applications. The approach requires no modification to existing RTL or UVM testbench architectures, integrating via a lightweight GPU dispatch layer that operates on standard simulation interfaces.

Unleashing the Penguin: Programmable Device Model for verifying RISC-V IOMMU using Linux

Sub. #BWUHVG.

Sai Rajat Goparaju and Nicholas Piggin.

Abstract: RISC-V provides complex platform-level specifications, such as the RISC-V IOMMU, in addition to the core-level ISA to support a complete open computing platform. The RISC-V IOMMU delves into intricate hardware-software interactions, page table formats, command and fault queue handling, and multi-stage address translations that are as critical to system correctness but significantly harder to validate. An essential part of verifying the IOMMU involves executing real-world scenarios as would be presented via Linux. However, setting up a full SoC-level environment to run Linux sequences is time-consuming and resource-intensive. As a result, critical IOMMU interactions are often validated too late or not at all.

We have developed a programmable device model that permits Linux testing of RISC-V IOMMU RTL without requiring PCIe or DMA-capable devices to be integrated into the design under test. The device model has been pivotal in creating an emulation-friendly subsystem-level environment that integrates high-performance RISC-V cores (TT-Ascalon) with RISC-V IOMMU. The subsystem runs Linux as the primary stimulus source, reusing the upstream kernel IOMMU driver to exercise the IOMMU implementation against the RISC-V specification with complex and realistic scenarios.

We will present the design and operation of this device model, the subsystem environment and related software, and shall share our findings, including how it enabled us to quickly uncover corner-case bugs in our IOMMU RTL and its software drivers, thereby complementing traditional IP-level validation approaches.

XSCC: A High-Performance Compiler for RISC-V

Sub. #BYLQDM.

Lei Qiu.

Abstract: The RISC-V architecture has experienced rapid growth in recent years, evolving from an academic research project into a global ecosystem spanning industry, academia, and open-source communities. However, achieving competitive application performance across the diverse RISC-V microarchitectures requires a mature compiler infrastructure capable of realizing the performance potential of the underlying hardware. In this work, we present XSCC, a high-performance compiler built on LLVM 19.1.0, designed to meet industrial-grade performance demands while actively contributing to open-source ecosystem development. XSCC performs a systematic cross-architecture optimization analysis, distilling compiler insights from mature architectures into a cohesive set of optimizations for RISC-V, including enhanced loop transformations, memory access reordering, and microarchitecture-specific scheduling models. Four of these optimizations have been upstreamed to the LLVM project. Experimental evaluation demonstrates consistent improvements over baseline LLVM 19 and GCC 12, achieving up to 1.14x speedup on SPEC CPU 2006 FP on the simulated XiangShan KMHv3 and up to 1.30x speedup on SPEC CPU 2006 INT on commercial RISC-V hardware SpacemiT X60.

From Fragmentation to Systematization: A Standardized Quality Selection and Reconstruction Approach for RISC-V Courses

Sub. #C3QQBZ.

Fuyuan Zhang and Yunxiang Luo.

Abstract: The development of RISC-V technology faces challenges such as the existence of low-quality online courses, fragmented content, a lack of hierarchical and systematic course series, insufficient online experimental practice environments, and limited channels for learning Q&A. The paper sets out to develop a standardized model for assessing the quality of RISC-V courses. In addition, it puts forward a reconstruction method based on course classification tags, organized the individual video into a structured course series. The solution integrates a online RISC-V lab with offline community activities, thereby establishing an integrated online-offline practical teaching environment. This project has produced over 1,000 original RISC-V lecture videos, with total views exceeding 1.3 million. The experimental results demonstrate that the systematically organized course collections generated by this method significantly improve viewership and user engagement, providing a systematic solution for the development of the RISC-V education ecosystem.

From Open Architecture to Open Silicon: Taping out CORE-ET Many-Core RISC-V Platform

Sub. #DHQPQB.

Roman Shaposhnik and Tanya Dadasheva.

Abstract: We are going to talk about a fully open tapeout - from first schematics published for community review to sending the design to the Fab over the course of 6 months. Leveraging ET-platform and ecosystem around it, open source tools and now open CORE-ET silicon platform (part of OpenHW group), we present many-core RISC-V-based design with MRAM, creating a basis for the next generation open designs. This talk presents an increasingly open development model, highlighting both the progress already made and the practical gaps that remain in today’s silicon ecosystem.

ATESOR: A Multi-Stage LLM-based Framework for Autonomous RISC-V Software Porting

Sub. #DMVSJ8.

Akif Ejaz.

Abstract: The RISC-V instruction set architecture (ISA) has seen rapid adoption over the past few years. Despite this growth, the software ecosystem remains a major challenge to broader adoption. In contrast to x86 and ARM platforms, where precompiled binaries are widely available, RISC-V developers often face a significant software availability gap. Consequently, many packages, libraries, or applications must be built from source, requiring substantial expertise in build systems and target architectures. This process is largely manual and time-consuming, creating a significant barrier to widespread adoption of the RISC-V. To address this critical gap, this paper presents ATESOR, a multi-stage LLM-based framework for autonomous RISC-V software porting. The framework uses large language models to plan build requirements, compile packages, debug failures, and test generated binaries in RISC-V sandboxed environments. ATESOR supports both containerized RISC-V environment and native execution on RISC-V hardware such as the Banana Pi BPI-F3 and Milk-V Pioneer, provided by Cloud-V. ATESOR is trained on an internal dataset of more than 500 manually ported packages spanning build systems including CMake, Make, Ninja, and Go. For 100 CMake and Go-based packages, ATESOR demonstrated a 80% successful porting rate and experiment completed in approximately 1.5 hours, corresponding to an average porting time of about 54 seconds per package.

Fully Automated RISC-V ArchitecturalExploration with Chipyard and A-DECA

Sub. #DMYH7U.

Lilia Zaourar, Bruno Bodin and Bruno Bodin.

Abstract: The increasing demand for domain-specific architectures from various domains such as Artificial Intelligence (AI),High Performance Computing (HPC), and automotive systems is reshaping modern System on Chip (SoC) design,requiring faster iteration cycles and deeper hardware/software integration. While the open ISA RISC-V enablesunprecedented architectural flexibility, it also dramatically expands the design space across system, micro-architectural,and implementation levels. Efficiently navigating this complexity remains a key challenge for both academia and industry.The A-DECA framework is a design space exploration framework developed within the SoC Planner project to accelerateproductive SoC design. A-DECA enables a structured and modular exploration from high-level architectural configurationdown to synthesis-aware micro-architectural evaluation, effectively bridging the gap between system-level modeling andimplementation constraints.Our methodology leverage the open-source RISC-V design flow Chipyard to develop a hardware/software co-designsolution that supports automated configuration generation, parameter tuning, and quantitative performance, power, areatrade-off analysis. By reducing manual exploration effort and formalizing early-stage architectural planning, A-DECAsignificantly improves design productivity and accelerates pre-silicon decision-making. The framework reinforces theopen-source chip design ecosystem and lays the foundation for scalable, chiplet-oriented RISC-V architectures. Its plannedopen-source release aims to further enable reproducible research, industrial adoption, and collaborative innovation infull-flow RISC-V SoC development.

Advanced Interrupt Latency Optimization Approaches in RISC‑V Interrupt Architectures

Sub. #FMQNTD.

Evgenii Paltsev.

Abstract: Modern interrupt controllers combine hardware and software mechanisms to reduce interrupt latency, optimizing either worst-case latency, average-case latency, or both. The paper provides analysis of interrupt-latency optimization techniques and their trade-offs in the context of RISC‑V interrupt architectures. It draws on an end-to-end workflow that began with functional modeling and continued through simulation, OS porting, and RTL implementation. This provides practical insight into how these techniques behave both in isolation and in real-world systems. The paper shows that each technique admits multiple realizations, which redistribute cost across latency metrics, software and hardware implementation complexity, memory footprint, and other factors, and that the techniques are interdependent, so the benefits of enabling them are not directly additive.

Integration Challenges in RISC-V System Prototyping: The RISER Microserver Platform

Sub. #G7YQKR.

Manolis Marazakis.

Abstract: This extended abstract reports on the experiences and roadmap of the RISER project, which since January 2023 has been developing first-generation all-European RISC-V cloud server and accelerator prototypes capable of running fully-featured Linux-based software stacks. Building on processor IP from the EPI and EUPILOT projects, RISER targets Europe’s open strategic autonomy in cloud infrastructure. We present the RISER Microserver Platform, an FPGA-assisted prototype that integrates the EPAC1.5 RISC-V vector-processor test-chip in a standalone computing node with its own boot firmware, NVMe storage, and 100 Gbps Ethernet connectivity, and discuss the integration challenges encountered during bring-up

A Doom Demo Journey: Tenstorrent's Ascalon CPU on Synopsys emulation and prototyping systems

Sub. #GLCJSG.

Dongjie Xie, Brandon Zupan and Rae Parnmukh.

Abstract: This paper tells the incremental journey of taking Tenstorrent’s Ascalon RISC‑V CPU IP from RTL and emulation to a playable DOOM demo on a Synopsys’s prototyping platform. Along the way we describe the problems we overcame, and how we optimized our flows and the design. We close with a set of lessons and recommendations for teams who want to use emulation and prototyping and realistic workloads like DOOM to de‑risk RISC‑V IP adoption and accelerate hardware/software co‑design.

Code size reduction by advanced near addressing modes

Sub. #HAZPKR.

Kajetan Nürnberger.

Abstract: To enable debugging and calibration of real time systems, which are in interaction with the real plant, the software used on those systems often has a huge number of global variables. The huge number of global variables exceed the range addressable relative to the global pointer. Therefore, addressing these variables normally needs two instructions. Other CPU architectures commonly used in the real time control systems domain address these by various near addressing modes. This results in significant code size reductions and performance boost. This paper discusses different variants to add such near addressing features to the RISC-V ISA. The impact on the code size is evaluated with different representative workloads

Bao-CHERI: A Pure-Capability RISC-V Hypervisor

Sub. #JFUNQZ.

Bruno Sa.

Abstract: We present our work on porting CHERI to the open-source Bao hypervisor targeting the RISC-V architecture. A preliminary evaluation of our implementation shows a 30.2% increase in code size, an additional 1 KiB of runtime memory usage, a 20% increase in boot time, and a 13.43% increase in interrupt latency. To the best of our knowledge, this is the first publicly available implementation of a hypervisor incorporating CHERI for RISC-V that supports both CHERI and the RISC-V hypervisor extension. The port is publicly available as an open-source artifact for the RISC-V and CHERI communities.

A Hardware-Software Heterogeneous Framework for Agile RISC-V Verification with Model-Based Processor Fuzzing

Sub. #JZAHPN.

Juncheng Huo.

Abstract: Processor designs are increasingly complex, making verification a critical challenge in the chip development process. Traditional verification techniques, heavily reliant on software simulations and random test inputs, often fail to effectively identify complex corner cases, leading to slow convergence and high verification costs. To address these challenges, we propose a heterogeneous hardware-accelerated RISC-V verification framework that integrates FPGA acceleration with a domain-specific generative model. This framework generates semantically-aware RISC-V instruction sequences and executes them in parallel with a reference model, providing real-time coverage collection and differential checking. The system improves verification efficiency by generating high-quality test inputs and reducing the time required for coverage convergence. Experimental results show that our framework outperforms existing fuzzers in terms of both coverage and speed, achieving up to 1.27× higher coverage and accelerates verification by up to 107×(Cascade) to 3343×(DifuzzRTL) compared to state-of-the-art fuzzers, with consistently lower convergence difficulty.

openKylin: Empowering the RISC-V AI Ecosystem

Sub. #L98NW8.

Wenzhu Wang.

Abstract: In the AI era, the RISC-V architecture represents a transformative force due to its inherent modularity and extensibility. However, the transition from hardware potential to a production-ready AI ecosystem is fraught with challenges, primarily the fragmentation of hardware-software interfaces and the relative immaturity of the AI software ecosystem. As a leading Tier-1 operating system community, openKylin serves as the critical architectural glue, addressing these obstacles by empowering the RISC-V AI landscape through foundational OS construction, software stack optimization, and application innovation. By harmonizing hardware diversity with a unified software infrastructure, openKylin not only lowers the barrier to AI deployment but also defines a scalable roadmap for RISC-V across multi-scenario environments, transforming RISC-V into a premier, open-standard architecture for the global AI revolution.

X‑HEEP: An Open Hardware Platform Enabling Research and Education in RISC‑V SoC Design

Sub. #LCDXJK.

Pasquale Davide Schiavone.

Abstract: This work presents X-HEEP, an open-source RISC-V System-on-Chip (SoC) platform designed to lower the barrier to chip design for research and education, providing a configurable and extensible infrastructure that enables rapid development of custom RISC-V-based SoCs and hardware accelerators. The platform demonstrates how open ecosystems can accelerate silicon innovation and enable new academic chip design activities. In addition, X-HEEP illustrates how open hardware fosters collaboration between universities and industry, strengthens education in VLSI design, and contributes to broader European initiatives to advance semiconductor capabilities and technological sovereignty.

C-Trace: An Open-Source RISC-V Trace Encoder and its Ecosystem

Sub. #LWYLAF.

Alexander Weiss and Simon Wegener (AbsInt).

Abstract: Embedded tracing is essential for validating reliability, optimizing performance, and debugging complex embedded software. Despite rapid innovation in the RISC-V ecosystem, open and interoperable trace solutions have remained limited. C-Trace, developed in the context of the European TRISTAN project , addresses this gap with an open-source trace encoder and an extensible ecosystem approach. C-Trace introduces a modular trace-encoder architecture designed for efficient, continuous “live” observation. Beyond standard program-flow tracing, it supports hardware-assisted instrumentation that can automatically emit trace messages on access to selected control/status registers (CSRs) or on configurable watchpoints. This enables trace streams that carry richer runtime context (e.g., program counter, timestamps, direct data, and selected performance counters) and can support application use cases such as worst-case execution time (WCET) estimation, timing optimization, test-case prioritization, and integration-level coverage measurement. In addition to off-chip export, C-Trace can forward trace-triggered events internally to an on-chip CPU, enabling watchdog, runtime verification, or control-flow integrity (CFI) checking functionality. Finally, C-Trace is provided under a dual-licensing model (CERN-OHL-S and a non-copyleft commercial option) to balance open collaboration with industrial IP needs.

RISC-V Address-Encoded Byte Order Extension

Sub. #MKEL9U.

David Guerrero Martos and Jorge.

Abstract: In certain scenarios computer systems have to deal with both little-endian and big-endian data regardless their native endianness. A RISC-V extension that makes it possible to remove the overhead introduced when dealing with foreign-endian data is proposed. It can be implemented with little engineering effort and negligible impact on performance and hardware resources. Preliminary results show that the extension can remove a 62% or 37% of foreign-endian data processing overhead when compared to software solutions using the base Instruction Set Architecture (ISA) or the currently available bit manipulation extensions respectively. This performance boost can benefit both new and legacy software once compiler and library support is put in place.

From Architecture to GDS: Introducing the X200, a Market-Ready, High-Performance RISC-V Core

Sub. #MQJFNF.

feixiaolong.

Abstract: As the RISC-V software ecosystem achieves maturity for high-performance computing, the demand for production-ready, competitive processor cores has reached a critical point. This presentation introduces the X200, our flagship RISC-V core, which has completed its entire development cycle and is now ready for deployment. We provide a comprehensive overview of the X200’s journey, from its ambitious design goals to final GDS layout. The session details its advanced multi-stage pipeline, sophisticated memory subsystem, and scalable multi-core interconnect fabric. We present a transparent competitive analysis, share key benchmark results, and reveal detailed Power, Performance, and Area (PPA) data verified from the final layout. Attendees will gain a clear understanding of the X200’s capabilities and its readiness to power next-generation SoCs.

Optimizing IREE Compilation and End-to-End Object Detection Pipeline for RISC-V

Sub. #MYRD9A.

Adeel Ahmad.

Abstract: This work enables optimized, end-to-end inference of the object detection models on RISC-V vector CPU. It includes the implementation of optimized pre- and post-processing pipelines as well as the enablement of efficient execution of the models at FP32, FP16, and INT8 precisions. IREE, an MLIR-based compiler, is used to compile and optimize the model. Model inference on the Banana Pi BPI-F3 is profiled to identify top hotspot ops and their compilation is optimized in the IREE compilation pipeline either by improving vectorization or by implementing ukernels. For accuracy validation, the mean Average Precision (mAP) is computed using the COCO validation dataset. This project is supported by the RISC-V Software Ecosystem (RISE), and all the developed artifacts are open-source.

ACE: Atomic Cryptography Extension for RISC-V

Sub. #NA9Q9H.

Roberto Avanzi, Ruud Derwig, Luis Fiolhais, and Radim Krcmár.

Abstract: The Atomic Cryptographic Extension (ACE) is an ISA extension to enable secure cryptographic implementations. ACE separates key provisioning from key usage, enabling distinct environments to perform the two functions. For example, keys could be delivered to user software by a TEE applet. Unlike existing round-based AES extensions, which inherently expose key material, ACE performs cryptographic operations atomically. Keys are associated with metadata that ties them to specific algorithms and usage policies. Keys and metadata are bonded to each other by writing them in Context Registers (CRs). The contents of CRs can only be exported in encrypted and authenticated form for secure re-import, enabling secure context switches and VM migrations. ACE is work in progress of the High Assurance Cryptography (HAC) TG of RISC-V International.

openEuler for RVA23: Building a RISC-V Server OS with Ecosystem Partners

Sub. #NJKQXQ.

YANJUN WU, Sheng Qu and Jingwei Wang.

Abstract: In early 2026, the openEuler community, together with the Institute of Software, Chinese Academy of Sciences (ISCAS) and industry partners, released the openEuler 24.03 LTS SP3 for RISC-V server bring-up. The release aligns with ongoing RISC-V Server Platform efforts and adds practical support for RVA23-related vector and virtualization features across toolchains, user-space components, and the kernel. A central part of this work is RVCK (RISC-V Common Kernel), a shared kernel baseline designed to reduce duplicated per-vendor enablement work and improve reuse across platforms. This talk presents the engineering lessons behind that effort, including cross-vendor coordination, upstream collaboration, and the challenges of building a reusable software baseline for server-class RISC-V systems.

ALPES: Advanced Low-Power Edge Skeleton

Sub. #NWCNCN.

Emanuele Valea and JEREMIE PESCATORE.

Abstract: The emergence of the open-source RISC-V Instruction Set Architecture has significantly democratized CPU and SoC design across a wide range of applications. By enabling companies to implement and customize their own processor architectures, rather than relying on proprietary solutions from a few vendors, RISC-V allows architectures to be tailored to specific application requirements. However, CPU and SoC development remains complex and demands substantial design and verification expertise. To address this challenge, several academic and industrial reference platforms have been introduced to accelerate RISC-V–based SoC development. This abstract presents ALPES, a versatile SoC platform built around cores from the OpenHW Foundation. ALPES includes an application-class chipset based on the CVA6 processor, as well as multiple secondary chipsets built around the CV32E40P core, targeting safe and secure real-time use cases. ALPES provides a robust, pre-verified foundation for ASIC projects, supporting several research projects focused on the RISC-V ecosystem.

NoC and Memory Subsystems for AI Employing RISC-V Processors with Vector or Matrix Extensions

Sub. #PRQ9ES.

Ashley Stevens.

Abstract: To enhance processor performance on HPC and AI workloads, the RISC-V Vector Extension (RVV) was ratified by RISC-V International in 2021. Vector processing enables data parallelism by operating on vectors rather than scalars, which studies have shown can improve performance on vectorized workloads by up to eight times or more on some workloads, significantly increasing memory bandwidth requirements compared with scalar processors. The proposed RISC-V Matrix Extensions further multiply the challenges, placing even greater demands on the memory subsystem. This increased data throughput requires architects to re-evaluate their long-held assumptions about SoC architecture. While many studies focus on the software implications of vector and matrix extensions, this paper explores the challenges and evaluates possible solutions for performant and power-efficient, optimal SoC hardware architectures.

PicoNut/RISC-V: One Educational Platform for Hardware and Software Development

Sub. #PRU8TZ.

Johannes Hofmann and Gundolf Kiefer.

Abstract: Most existing educational tools for RISC-V focus on either hardware or software development, rarely both. PicoNut is an open, synthesizable RISC-V processor and system platform designed for academic education and rapid prototyping. By using SystemC-RTL as the main hardware modeling language, modules can be arbitrarily replaced by C/C++ software implementations. This allows to build full RTL as well as software based simulators for highly efficient and cycle-accurate simulations of complete systems. Even emulations of external hardware are possible, for example, by implementing screen hardware with a Qt-based GUI. To any system simulator, GDB can be attached for software debugging.

To demonstrate the platforms capabilities, students implemented a retro game console featuring custom peripherals and several ports of classic games. Through hands-on engagement with real hardware and software challenges, PicoNut empowers students to develop a comprehensive understanding of digital design, computer architecture, hardware-software codesign, and embedded systems development.

Implementing and Optimizing an Open-Source SD-card Host Controller for RISC-V SoCs

Sub. #PTWGKY.

Philippe Sauter.

Abstract: Recent announcements have shown the viability of end-to-end open-source (OS) Linux-capable RISC-V systems on chip (SoCs). However, practical application and software development platforms require efficient non-volatile storage, which is not adequately served by common SPI-based interfaces due to their limited throughput. Secure Digital (SD) cards are the de facto standard storage medium for embedded Linux systems; efficient SD host controller (SDHC) integration is thus essential for open-source RISC-V platforms. We present an OS SD host controller interface (SDHCI) peripheral integrated into the end-to-end OS Cheshire RISC-V SoC platform. The controller and its software stack are designed with full awareness of CVA6’s memory system and Linux driver behavior; during evaluation, we identify a significant performance bottleneck caused by the RISC-V memory model and CVA6’s implementation of the fence instruction, which flushes the pipeline and data cache on memory-mapped register accesses when cache management operations (CMOs) are unavailable. By customizing the driver’s register access paths and avoiding unnecessary fences, we substantially reduced this overhead. Our fully OS controller achieves up to 11.1 MB/s throughput, approaching the 12.5 MB/s limit of the SD interface and providing up to 6.5 times the throughput of SPI-based storage.

Why Edges Matter: A Case Study on Performance Improvements for OpenBLAS GEMM on RISC-V

Sub. #QB3TNY.

Rama Malladi and Chip Kerchner.

Abstract: Matrix multiplication (GEMM) sits at the heart of scientific computing, data analytics, and modern AI workloads. While much attention is given to peak throughput and ideal matrix sizes, real-world performance often hinges on the “edges” i.e., non-ideal dimensions, cache boundaries, and vector tail cases that quietly dominate execution time. In this paper, we present a practical case study of optimizing GEMM in OpenBLAS for RISC-V vector architectures. We show how careful handling of edge conditions, cache reuse, and vectorization strategy can deliver measurable performance gains. Techniques include maximizing cache and register reuse with single-pass data traversal, swapping operands and deferring transposition for easier storage, combining full- and half-vector operations with scalar instructions to efficiently handle irregular dimensions, and leveraging strided segmented load/store vector intrinsics to sustain throughput even in non-ideal layouts. These optimizations are not just academic; small inefficiencies in GEMM propagate directly into AI inference latency and energy. By focusing on edge cases and architectural nuance, we can unlock meaningful improvements for real-world workloads. These optimizations give substantial gains; for example, a 6 x 3072 × 3072 SGEMM MatMul efficiency improves from 23.5% to 68.7% of the peak.

Accelerating neural networks using SIMD ISA-Extension for RISC-V processor platforms: A complete toolflow

Sub. #QBSVKB.

Alexander Zapp and Carsten Rolfes.

Abstract: Our contribution demonstrates how developers can easily run neural networks on RISC-V processors using our custom hardware accelerator TetraEdge. We present a complete solution combining three components.

First, we introduce TetraEdge, a custom hardware SIMD accelerator. TetraEdge contains a four-stage pipeline design to accelerate 8-bit quantized CNNs inference on 32-bit RISC-V processors. In comparison to other hardware accelerators, TetraEdge features an innovative automatic data reordering and min/max accumulation.

Second, we extend the NeoRV32 open-source RISC-V processor, by two custom instructions to control TetraEdge without blocking the main processor. The CPU continues other tasks while the accelerator handles neural network operations. By directly interfacing with the CPU core’s register file, TetraEdge minimizes area and control complexity, enabling seamless integration into existing toolchains.

Finally, we combine both aforementioned contributions to the open-source framework AIfES (Artificial Intelligence for Embedded Systems). AIfES is specifically designed to train and run neural networks directly on resource-constrained devices. Its modular software architecture enables the integration of user-specific hardware accelerators, such as TetraEdge. AIfES reduces software overhead significantly with up to 54\% less memory usage and faster execution for CNNs.

RISC-V Tournament: Battle of HDLs

Sub. #QZVQMY.

Christoph Hazott.

Abstract: Hardware Description Languages (HDLs) have evolved from traditional Register Transfer Level (RTL) modeling over High-level Synthesis (HLS) towards todays generative approaches. Although modern HDLs often assert technical advantages, directly comparable evaluations across HDL paradigms remain scarce. This work introduces a year-long, community-driven tournament, designed to enable reproducible comparison of HDLs under uniform conditions. A RISC-V microarchitecture is independently implemented in multiple HDLs and evaluated within a standardized, GitHub-based framework. Since the framework provides identical conditions, differences can be related to how an HDL enables hardware realization. To ensure the quality of this tournament, all results are public, reproducible, and objectively evaluated, providing transparent evidence of HDL-specific strengths and trade-offs. Through contribution, participants can systematically demonstrate the capabilities of their preferred HDL. The collected implementations can further be used as common reference basis for research, education, and reproducible comparison of HDL approaches.

Beyond the Basics: Elevating Eclipse ThreadX to a First-Class RTOS for RISC-V

Sub. #RGDVRL.

Frédéric Desbiens and Akif Ejaz.

Abstract: As RISC-V moves from experimental silicon to mass-market industrial applications, the availability of proven, safety-certified Real-Time Operating Systems (RTOS) is a key enabler for adoption. Eclipse ThreadX (formerly Azure RTOS) has long been a cornerstone of the embedded industry. Yet, its immature support for the RISC-V ISA, particularly 64-bit implementations, remained a barrier for high-performance adoption.

In this session, you will learn how 10xEngineers, in collaboration with the Eclipse ThreadX project team, brought first-class RISC-V support to the ThreadX kernel. You will go on a deep dive into the architectural challenges of porting the kernel’s core components to both RV32 and RV64, including context switching, interrupt nesting, and timer management tailored for the RISC-V privileged architecture. You will also explore the practical enablement of this port on the SpacemiT K1 SoC (Banana Pi BPI-F3), bridging the gap between virtual prototyping in QEMU and physical hardware deployment. Finally, you will gain insights into the low-level kernel modifications required for RISC-V compliance and discover a roadmap for deploying ThreadX in the next generation of RISC-V embedded systems.

Memory Protection for MMU-less RISC-V: Current Status of SPMP and vSPMP

Sub. #RQP9GP.

joseosyx.

Abstract: As RISC-V expands into embedded critical domains like IoT and automotive automotive require predictable isolation mechanisms. Traditional MMU-based virtualization is often impractical for these resource-constrained environments due to the latency of page-table walks and significant memory overhead. In contrast, MPU-style region-based protection offers deterministic access checks with minimal footprint, making physical memory protection essential for secure, mixed-criticality systems.

While RISC-V PMP provides such mechanisms at machine privilege level, modern embedded software stacks, including RTOSes, separation kernels, and lightweight hypervisors, require similar capabilities at supervisor level. The proposed Supervisor-mode Physical Memory Protection (SPMP) extensions address this gap by allowing supervisor software to define access permissions over physical memory regions, enabling robust compartmentalization of software components in systems without virtual memory.

Virtualization further increases the need for such mechanisms. Embedded hypervisors are increasingly used to consolidate multiple operating systems or software domains on a single microcontroller-class platform while maintaining strict isolation guarantees. To support this model, SPMP is being extended to interact with the RISC-V Hypervisor extension through a two-stage protection approach (vSPMP), enabling the hypervisor to enforce global isolation while allowing guest operating systems to manage their own protection domains.

This talk presents the current status of the SPMP and SPMP for Hypervisor specifications, their architectural design and rationale, and their integration with the RISC-V privilege architecture. We will discuss the design rationale, implementation considerations, and potential deployment scenarios in secure IoT microcontrollers and automotive mixed-criticality systems.

UnifiedDB : Status and Plans for Fast and Rigorous RISC-V Development

Sub. #RXFF98.

Derek Hower and Paul Clarke.

Abstract: UnifiedDB is a machine‑readable encoding of the RISC‑V specification that enables automated generation of documentation, simulators, toolchain support, and certification tooling. We summarize recent accomplishments, including adoption within RISC‑V certification workflows, its role as the authoritative store for architectural parameters, and its maturation into stable, shared specification infrastructure backed by robust continuous integration and validation capabilities. In addition to summarizing technical stabilization and growing community use, the talk outlines a forward‑looking roadmap centered on lowering contribution barriers through new interaction models, including AI‑assisted specification development, while preserving architectural rigor.

Virtual Prototyping of Pixel Detector Architecture via Co-Simulation of PixESL and GVSoC

Sub. #SCCL9S.

mobradovic00.

Abstract: This work presents a co-simulation methodology for the evaluation of pixel detector architecture, combining two independently developed tools: PixESL, a virtual prototyping framework targeting the architectural exploration and performance assessment of pixel detector systems, and GVSoC, a full-platform simulator for RISC-V IoT SoCs.
In parallel with PixESL’s development, studies on integrating RISC-V-based SoCs with pixel detector readout circuitry were carried out using GVSoC. Rather than relying on fixed ASIC readout architectures, this approach introduces a programmable processing layer alongside the pixel readout, enabling software-level control over data handling.
In the outlined co-simulation flow, PixESL generates data for a given readout architecture and set of stimuli, while GVSoC simulates the target application executing on the PULP RISC-V SoC platform. Additionally, in order to accurately capture the overhead of data movement within this chain, a virtual prototype of the DMA block responsible for transfers between pixel readout and SoC was developed. Together, these components provide a unified view of the full readout chain, from initial stimulus to processed data, opening possibilities for more informed hardware-software co-design in future detectors.

RISC-V Silicon at Scale in Academia: Designing “Big” Open-Source Chips on PULP Platform

Sub. #SED3UJ.

Yichao Zhang.

Abstract: The PULP Platform team at ETH Zürich and the University of Bologna has delivered several “big” chips based on RISC-V cores that exceed the complexity of chips traditionally designed in academic/research settings. These designs are made possible through open-source principles that allow greater collaboration and innovation in critical parts of the design. RISC-V has been instrumental in the development of these designs, allowing the team to develop a sandbox of building blocks for creating designs that exceed one billion transistors.

CVA6-CFI: A First Glance at RISC-V Control-Flow Integrity Extensions

Sub. #SPL3GT.

Simone Manoni.

Abstract: This work presents the first design and evaluation of the standard RISC-V Control-Flow Integrity (CFI) extensions. The Zicfiss and Zicfilp extensions protect vulnerable software from control-flow hijacking through shadow stack and landing pad mechanisms. We integrate dedicated hardware support for both extensions into the open-source CVA6 core. Synthesis in 22 nm FDX technology shows only 1.0% area overhead, while evaluation on the MiBench automotive benchmark subset reports up to 15.6% runtime overhead.

Deep Dive into Upstream RISC-V Boot Chain

Sub. #TAK7KZ.

Marcel Ziswiler.

Abstract: While porting Freedesktop SDK to the EBC7700 for my FOSDEM’26 talk [1], I encountered some UEFI boot issues, which motivated me to dig deeper and uncover all the mysteries about the RISC-V boot chain. This talk looks in-depth at the RISC-V boot chain, both on the virtualised QEMU target, as well as practical hardware examples. Starting from their boot ROMs and how fusing/strapping may select the boot sources, usually containing some form of an SPL, like the one from U-Boot. Such SPL, in turn, loads U-Boot proper, which contains OpenSBI, implementing the Supervisor Binary Interface, in so-called FW_DYNAMIC form, meaning it does not require any platform-specific configuration parameters because all required information is passed by the previous booting stage at runtime. OpenSBI gets executed first and stays resident. The handover to the U-Boot boot loader, which implements the UEFI specification, marks the next boot stage. It may either directly load the Linux kernel, an optional initial RAM disk and the device tree, particularly useful during bring-up/development, or launch a UEFI boot loader like systemd-boot or GRUB. Handover to the Linux kernel marks the last and final stage in the boot chain. This talk not only looks at the software involved, but also how it may be built, deployed and debugged. As usual, I complete my talk with a live demo.

[1] https://fosdem.org/2026/schedule/event/LX3NNU-upstream-embedded-linux-on-risc-v-sbcs

Cycle-Accurate IOPMP Reference Model with Configurable Interfaces, Integration Tests, and a CVA6 SoC Implementation

Sub. #TTPKFR.

Gull Ahmed.

Abstract: In RISC-V based systems, a key security mecha- nism is the Input-Output Physical Memory Protection (IOPMP) subsystem, which enables controlled access to shared memory and peripherals by non-core initiators [1]. While the specification defines functional behavior, the availability of a publicly available cycle-accurate reference model will encourage early SoC-level integration. This paper presents an open-source cycle-accurate IOPMP reference model consisting of a SystemVerilog wrapper integrated with a C-based functional reference model. The SystemVer- ilog wrapper models pipeline timing, transaction ordering, and standard bus interfaces, while the C-based reference model provides specification-compliant functional evaluation of address matching, permission checking, and priority resolution. The combined architecture enables both functional correctness and timing-accurate system-level validation. The model supports AMBA AXI4 for transaction enforcement and AMBA AHB3-Lite for configuration, to enables seamless re- placement with actual RTL. A reusable suite of architectural-level bare-metal tests is provided, and the approach is demonstrated through integration in an open-source CVA6-based SoC [4]. Index Terms—RISC-V, IOPMP, SoC Security, Reference Model, Cycle-Accurate Modeling

A Holistic Approach to Attached Matrix Extension on RISC-V From ISA to Software Stack

Sub. #U38PRX.

Qiu Jing.

Abstract: The increasing computational demands of modern AI workloads necessitate a holistic architectural approach to AI acceleration on RISC-V processors. This talk presents the XuanTie Tensor Processing Engine (TPE), a RISC-V-based Attached Matrix Extension (AME) engine designed to address AI acceleration across three dimensions: ISA, microarchitecture, and software ecosystem. At the ISA level, the TPE adopts the in-progress RISC-V AME specification, featuring dedicated tensor registers and a comprehensive instruction set encompassing matrix multiply-accumulate, element-wise, special function, reduction, and load/store operations with broad data type support including INT4, FP8, FP16, and micro-scaling formats. At the microarchitecture level, the design incorporates a matrix engine achieving 2 TOPS/GHz at INT8/FP8, a concurrent vector engine with hardware-accelerated non-linear functions, and a layered memory subsystem featuring a coherent tensor cache and data prefetch engine. A full-stack software ecosystem spanning LLVM toolchain to graph execution runtime completes the solution. Experimental results on the XuanTie C930 cluster demonstrate 99% FP16 GEMM utilization. We discuss key design trade-offs and implications for the evolving RISC-V AME standard.

RISCVML: Teaching RISC-V Embedded ML with Rust — From ESP32-C3 to ESP32-P4

Sub. #UMWQJ8.

Scottie_von_Bruchhausen.

Abstract: RISC-V deployment in embedded systems, IoT, and edge AI has outpaced developer education: most tutorials target C/C++ and cover only basic microcontroller tasks, leaving a gap for building ML systems with modern toolchains. RISCVML addresses this with a Rust curriculum spanning 172 chapters across seven modules, from beginner hardware to on-device ML inference.

The curriculum uses Espressif RISC-V SoCs: the ESP32-C3 (BLE 5.0, ~€3) and ESP32-C6 (Wi-Fi 6, Thread/Matter, ~€4) introduce Rust fundamentals — GPIO, sensors, power management, and wireless protocols. The ESP32-P4 (dual core 400 MHz, AI extensions, 128 bit vector ISA, ~€25) anchors an advanced module: ISP camera pipeline, hardware accelerated 2D rendering, H.264 encoding, DMA orchestration, and vector accelerated ML inference.

These converge in a capstone: a bird detection pipeline capturing MIPI-CSI frames, running quantized detection through esp-dl, driving pan/tilt servos, and recording H.264 video — all in async Rust with ESP-IDF drivers via FFI.

By pairing Rust memory safety with production toolchains (esp-hal, esp-idf-hal) on affordable hardware, and using a mascot to make complex terminology approachable for younger learners, RISCVML lowers the barrier for the next generation of RISC-V developers — supporting Europe’s push for sovereign silicon literacy.

RISC-V vs. ARM in an Embedded Real-Time System

Sub. #UZ9UMQ.

Germano Brunacci and Christian Wenzel-Benner.

Abstract: The Raspberry Pi Pico2 is an ideal platform to showcase the state of RISC-V capabilities in the realm of embedded real-time systems. When switching from the ARM Cortex-M33 to the RISC-V Hazard3 CPU cores everything else stays the same: memory subsystem, clock tree, peripherals. This allows for an apples-to-apples comparison of the relative strengths and weaknesses of the two CPU implementations by compiling the same C code for both architectures. We present a detailed comparison of relative performance and assembly code differences as well as insight on how much effort using RISC-V instead of ARM on the RP2350 MCU powering the Pico2 really adds.

From Profiling to Performance: Optimizing Small Language Models on RISC‑V Architectures

Sub. #VPNYEP.

Dongjie Xie, Jose Arnau, Rama Malladi and Chip Kerchner.

Abstract: Small Language Models (SLMs) are increasingly critical for edge AI, yet their performance on RISC-V requires rigorous profiling to identify architectural bottlenecks. This work evaluates the performance of SLMs including Gemma3, Llama-3.2, Qwen-2.5, DeepSeek, and Phi-3.5 on the Tenstorrent Ascalon RISC-V Core. We developed a profiling methodology to analyze workload distribution, which revealed that Matrix Multiplication (MatMul) contributes ~90% of total compute across all evaluated models. Given the computational complexity of running full-model emulations, we extract these critical kernels for targeted benchmarking. Our implementation on the HAPS platform achieves significant performance leaps over standard baselines. FP32 execution, utilized for maximum precision, was optimized by transitioning from traditional SGEMM to a new high-performance implementation. Simultaneously, INT8 performance, targeted for efficient inference, was accelerated by migrating from standard RVV to a specialized IGEMM (with a VQDOT) implementation.

Vitamin-V: Results and Lessons Learnt

Sub. #VVJ8FY.

Ramon Canal.

Abstract: Vitamin‑V (2023–2025) is a Horizon Europe project building a production‑grade, open‑source RISC‑V ecosystem for cloud environments. It extends RISC-V ISA support in three execution platforms (QEMU, gem5, FPGA), enables virtualization and contributes to the development of full cloud‑native stacks—OpenStack, Kubernetes, Kata Containers, RustVMM. The project also boosts commercial developments from Semidynamics, ZeroPoint, and Virtual Open Systems. This paper summarizes the technical outcomes and lessons learned.

PQC4eMRTD: Post Quantum Cryptography for Resource Constrained RISC-V Systems

Sub. #WEQNRM.

Leonidas Kosmidis.

Abstract: The PQC4eMRTD is a CSA (Coordination and Support Action) project funded by the European Commission, which focuses on monitoring and influencing the standardization space of Post Quantum Security with a particular focus on machine readable travel documents (MRTDs) such as national identification cards, passports and other personal documents. These types of documents include tiny microprocessors which interact with RFID devices in order to read sensitive personal information (e.g. biographic and biometric data) stored securely on the document, and allows the authentication of a person. Existing documents use conventional cryptographic algorithms which will be vulnerable to quantum computer attacks, especially considering the long validity period of such documents. For this reason, there is an interest in planning their migration to post quantum cryptographic algorithms in a standardized way. This abstract focuses on the work performed in the project within RISC-V systems, which extends to other types of resource constrained systems.

RISE and Yocto: Building a RISC-V Board Farm

Sub. #XMBLRH.

Trevor Gamblin.

Abstract: The Yocto Project is an open-source collaboration providing the tools which developers and organizations need to create custom embedded systems for a variety of architectures. This nominally includes both 32- and 64-bit RISC-V platforms, but until recently, official support and testing has been limited to emulated systems and a community-managed board-support layer (targeting hardware compliant with RVA22 and earlier) called meta-riscv. With the impending mass-availability of RVA23-compliant development boards and growing community interest, the RISE Project aims to ensure that Yocto is ready, by providing developer support for triaging RISC-V specific issues, while simultaneously improving test coverage and board support in the meta-riscv layer. To this end, a set of RVA22-based development boards have been deployed alongside some periodic build and test pipelines implemented with Forgejo and Labgrid, allowing early prototyping of longer-term validation workflows that Yocto and the community can build upon in the future. The foundation that this provides will ensure that as more organizations investigate RISC-V platforms for inclusion in their projects, a proven and reliable level of support will be waiting for them.

Monte Cimone v3: Where RISC-V Stands in High-Performance Computing

Sub. #XSLFRB.

Emanuele Venieri.

Abstract: The Monte Cimone project provides a RISC-V testbed for High-Performacne Computing cluster. This paper presents Monte Cimone v3 (MCv3), the third iteration of the Monte Cimone RISC-V HPC cluster, integrating the SOPHGO Sophon SG2044 processor, an evolution of the SG2042 used in MCv2. We characterize MCv3 using HPL and STREAM benchmarks coupled with power measurements, and compare it against two reference platforms: the Intel Xeon Platinum 8480+ (Sapphire Rapids) and the NVIDIA Grace CPU Superchip. Our results show that the SG2044 more than doubles single-core performance and improves scalability compared to SG2042. MCv3 achieves an energy efficiency of 3.08GFLOPs/W which improves of 10x w.r.t. MCv1 and is in the range of x86-64 and Arm servers. On pure performance when normalized on the SIMD/Vector length MCv3 on its peak efficiency point (16 cores) achieves 46% performance of Intel Sapphire Rapids server and 91% performance of NVIDIA Grace CPU superchip.

Enabling Ultra Low Power Signal Processing on RISC V with Micro-DSP Enhanced Microcontrollers

Sub. #XUGKLJ.

Revi Ofir and Dmitry Utyansky.

Abstract: While RISC-V’s modular and efficient design is well suited for cost sensitive, low power systems, modern sensor centric workloads increasingly demand digital signal processing (DSP) and control capability on a single core. Although the RISC-V P extension significantly advances integer based DSP through packed SIMD operations, its breadth and complexity can be prohibitive for ultra low power and highly constrained MCUs. This paper presents the µDSP extension, a lightweight set of SIMD DSP instructions designed to accelerate fixed point signal processing while preserving minimal area and power overhead. Integrated directly into a compact 3 stage in order RISC V pipeline, µDSP targets common sensor processing kernels such as filtering, accumulation, and FFTs. The proposed approach complements the evolving RISC-V ecosystem for deeply constrained embedded sensor-based applications.

"One Student One Chip" Initiative: Learn to Build RISC-V Chips from Scratch with MOOC

Sub. #XVSFHN.

Xiaoke Su.

Abstract: The “One Student One Chip” (OSOC) initiative was launched by the University of Chinese Academy of Sciences in 2019. The initiative guides students through designing a RISC-V processor chip from scratch, including tape-out, developing a simple operating system, running it on the chip, running the real game Legend of Sword and Fairy, and completing the physical design process using open-source EDA tools. This enables students to understand the entire processor chip design process. As of February 2026, OSOC enrollments have surpassed 17,000, representing participants from more than 1,200 universities worldwide. This report introduces the implementation of the “One Student One Chip” initiative and the outcomes of open-source chip talent cultivation.

The ISOLDE Space Demonstrator: a RISC-V Ecosystem for Low-Power On-board Inference

Sub. #XXQJQF.

Emanuele Valpreda, Mattia Paladino and Davide Di Ienno.

Abstract: Integrating AI-based capabilities into satellites improves spacecraft autonomy, but poses considerable obstacles in designing the hardware and software ecosystem. The orbit-dependent generation of power with solar cells, the limited thermal dissipation and weight present significant challenges in designing a compute platform capable of edge inference, forcing the trade-off between high-performance for complex AI models and strict power/area budget. Moreover, AI models must share the same resources of traditional algorithm that are executed onboard concurrently with the inference, such as avionics, attitude orbit control, data handling and signal processing. However, benefits of onboard processing comprise secure and private computation, decreased data uplink/downlink demands, autonomous detection and resolution of anomalies, enabling autonomous spacecraft operation. Therefore, it is necessary to adapt a hardware-aware codesign approach in designing software components to implement energy-efficient and secure edge inference without decreasing the performance of traditional applications. The ISOLDE space demonstrator comprises several RISC-V cores and accelerators, and its hardware architecture and software ecosystem are presented, with a particular focus on the interactions of several open-source and open-hardware IPs developed by various academic and industry European partners.

Towards a Modern Packed SIMD Architecture for RISC-V: Learning from Production Of ET-SIMD

Sub. #YCVWTV.

FelixCLC.

Abstract: The RISC-V Vector Extension (RVV) adopts a vector-length agnostic (VLA) model for exploiting data-level parallelism. We argue that this abstraction imposes significant costs in real silicon: control logic complexity, implicit state tracking in out-of-order pipelines, and runtime overhead that erode VLA’s theoretical portability benefits. Drawing on production experience with the ET-SoC-1, a 1088-core RISC-V processor, we present ET-SIMD, a fixed-width 256-bit packed SIMD extension that overlays the standard F extension register file. In Flynn’s Taxonomy [2], ET-SIMD is a classical SIMD design: scalar and packed instructions share the same register file, a pattern well understood by GCC and LLVM autovectorizers and proven on competing ISAs, yet absent from RISC-V. We describe the extension’s architectural rationale, its relationship to contemporary packed SIMD practice, and its availability through the AI Foundry initiative.

HyperCroc: Open-Source RISC-V MCU with Plug-In Interface for Domain-Specific Accelerators

Sub. #ZFT38M.

Philippe Sauter.

Abstract: Domain-Specific architectures with accelerators for machine learning and signal processing require efficient bulk data movement and high-bandwidth access to large datasets. Such capabilities are often absent from minimal open-source microcontrollers (MCUs). We present HyperCroc, an extension to the end-to-end open-source RISC-V Croc system-on-chip (SoC) integrating a silicon-proven HyperBus controller for off-chip DRAM and Flash memory access and a DMA engine, providing a practical MCU-class platform with streamlined plug-in support for domain-specific acceleration. HyperBus offers a low-pin-count PSDRAM interface at up to 400 MB/s, enabling bandwidth-scaled dataset access, while the DMA engine enables autonomous, high-throughput transfers without CPU intervention. HyperCroc preserves Croc’s open-source synthesis and physical implementation flow targeting IHP’s open 130 nm process design kit (PDK); the full chip can be implemented in under one hour on a consumer-grade workstation. We further report first silicon measurements from MLEM, the first Croc tapeout, confirming that the silicon is fully functional at 72 MHz @ 1.2 V and validating the end-to-end flow.

CHERI for RISC‑V: From Academic Breakthrough to Industry-Scale Ecosystem Adoption

Sub. #ZJXLJD.

Mike EFTIMAKIS.

Abstract: As digital systems become ever more interconnected, the global cost of cybercrime continues its steep rise. 70% of the vulnerabilities leading to these attacks stem from memory safety issues, which have remained persistent for several decades. After 15 years of research, the Capability Hardware Enhanced RISC Instructions (CHERI) technology has matured into a practical solution to this challenge now moving toward standardization within RISC-V. While CHERI has been validated through extensive academic research and multiple industrial prototypes, the next critical step is a broad, sustainable transfer from research labs into commercial RISC-V products. This paper outlines how collaborative ecosystem building—across academia, industry, open-source communities, and government stakeholders—is essential to enable the adoption of CHERI on RISC-V. The CHERI Alliance plays a central role in this transition, acting as a bridge, amplifier, and catalyst for a memory safe ecosystem.