CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University</title>  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content=" RISC-V, Memory Hierarchy, Machine Learning, Shared L3 Cache, Tensor-Aware Caching, Advanced Prefetching, Gemmini, Gem5, DRAMSim2, ML Workloads, Memory Bottlenecks. " lang="en" name="keywords"/> <base href="/html/2503.13064v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#S1" title="In HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">I </span><span class="ltx_text ltx_font_smallcaps">Introduction</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#S2" title="In HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">II </span><span class="ltx_text ltx_font_smallcaps">Literature Survey</span></span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#S2.SS1" title="In II Literature Survey ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-A</span> </span><span class="ltx_text ltx_font_italic">Memory Bottlenecks in ML Workloads</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#S2.SS2" title="In II Literature Survey ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-B</span> </span><span class="ltx_text ltx_font_italic">Existing Memory Solutions for ML</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#S2.SS3" title="In II Literature Survey ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-C</span> </span><span class="ltx_text ltx_font_italic">RISC-V-Based Accelerators</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#S2.SS4" title="In II Literature Survey ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-D</span> </span><span class="ltx_text ltx_font_italic">Research Gaps</span></span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#S3" title="In HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">III </span><span class="ltx_text ltx_font_smallcaps">Problem Statement</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#S4" title="In HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">IV </span><span class="ltx_text ltx_font_smallcaps">Experimental Setup</span></span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line"> <h1 class="ltx_title ltx_title_document">HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads <span class="ltx_note ltx_role_thanks" id="id1.id1"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">thanks: </span>Department of Computer Science, Rice University</span></span></span> </h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Pranav Suryadevara </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"> Morris Plains, NJ, USA <br class="ltx_break"/>pranav.suryadevara@rice.edu </span></span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id2.id1">The growth of machine learning (ML) workloads has underscored the importance of efficient memory hierarchies to address bandwidth, latency, and scalability challenges. HERMES focuses on optimizing memory subsystems for RISC-V architectures to meet the computational needs of ML models such as CNNs, RNNs, and Transformers. This project explores state-of-the-art techniques such as advanced prefetching, tensor-aware caching, and hybrid memory models. The cornerstone of HERMES is the integration of shared L3 caches with fine-grained coherence protocols and specialized pathways to deep learning accelerators like Gemmini. Simulation tools like Gem5 and DRAMSim2 are used to evaluate baseline performance and scalability under representative ML workloads. The findings of this study highlight the design choices and anticipated challenges, paving the way for low-latency scalable memory operations for ML applications.</p> </div> <div class="ltx_keywords"> <h6 class="ltx_title ltx_title_keywords">Index Terms: </h6> RISC-V, Memory Hierarchy, Machine Learning, Shared L3 Cache, Tensor-Aware Caching, Advanced Prefetching, Gemmini, Gem5, DRAMSim2, ML Workloads, Memory Bottlenecks. </div> <div class="ltx_para" id="p1"> <br class="ltx_break"/> </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">I </span><span class="ltx_text ltx_font_smallcaps" id="S1.1.1">Introduction</span> </h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Machine learning (ML) models have become integral to modern computing, with applications spanning computer vision, natural language processing, and autonomous systems. These models, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers, often require extensive computational resources and memory bandwidth. Consequently, optimizing memory hierarchies for ML workloads is critical to improving performance and efficiency <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib1" title="">1</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib2" title="">2</a>]</cite>.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">RISC-V, an open source Instruction Set Architecture (ISA), is gaining traction due to its flexibility and extensibility <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib3" title="">3</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib4" title="">4</a>]</cite>. However, the performance of RISC-V processors for ML workloads is hindered by memory bottlenecks, including latency, limited bandwidth, and lack of scalability <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib6" title="">6</a>]</cite>. The HERMES project aims to design a high-performance memory subsystem tailored to ML workloads on RISC-V architectures. Using techniques such as advanced prefetching <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib8" title="">8</a>]</cite>, tensor-aware caching <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib9" title="">9</a>]</cite>, and hybrid memory models <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib7" title="">7</a>]</cite>, HERMES seeks to overcome these challenges and enhance the efficiency of ML computations.</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">This report presents a foundational study of HERMES, outlining the design considerations, initial simulation results, and challenges in achieving a scalable, low-latency memory hierarchy for ML workloads and evaluate the performance using simulation tools such Gem5 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib10" title="">10</a>]</cite> and DRAWSim2<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib11" title="">11</a>]</cite>.</p> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">II </span><span class="ltx_text ltx_font_smallcaps" id="S2.1.1">Literature Survey</span> </h2> <section class="ltx_subsection" id="S2.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS1.5.1.1">II-A</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS1.6.2">Memory Bottlenecks in ML Workloads</span> </h3> <div class="ltx_para" id="S2.SS1.p1"> <p class="ltx_p" id="S2.SS1.p1.1">ML workloads, especially deep learning models, are characterized by large datasets and frequent memory access patterns. Studies have identified key challenges that impact the efficiency of memory subsystems:</p> </div> <section class="ltx_subsubsection" id="S2.SS1.SSSx1"> <h4 class="ltx_title ltx_title_subsubsection">High Latency</h4> <div class="ltx_para" id="S2.SS1.SSSx1.p1"> <p class="ltx_p" id="S2.SS1.SSSx1.p1.1">Tensor operations, such as matrix multiplications, require frequent access to large datasets stored in memory. The latency involved in fetching data from DRAM can significantly degrade performance if not mitigated by efficient caching and prefetching mechanisms <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib1" title="">1</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib2" title="">2</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib5" title="">5</a>]</cite>.</p> </div> </section> <section class="ltx_subsubsection" id="S2.SS1.SSSx2"> <h4 class="ltx_title ltx_title_subsubsection">Bandwidth Constraints</h4> <div class="ltx_para" id="S2.SS1.SSSx2.p1"> <p class="ltx_p" id="S2.SS1.SSSx2.p1.1">ML models, particularly those involving convolutions and recurrent operations, demand high memory bandwidth. The concurrent fetching of weights, activations, and gradients often saturates the available bandwidth, leading to bottlenecks in data transfer. This issue is exacerbated in systems with limited on-chip memory <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib6" title="">6</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib7" title="">7</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib12" title="">12</a>]</cite>.</p> </div> </section> <section class="ltx_subsubsection" id="S2.SS1.SSSx3"> <h4 class="ltx_title ltx_title_subsubsection">Scalability</h4> <div class="ltx_para" id="S2.SS1.SSSx3.p1"> <p class="ltx_p" id="S2.SS1.SSSx3.p1.1">As ML models grow in size and complexity, memory subsystems must scale to accommodate increasing data loads. Current architectures struggle to maintain performance as the dataset size exceeds cache capacity, requiring frequent access to slower off-chip memory <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib10" title="">10</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib15" title="">15</a>]</cite>.</p> </div> <div class="ltx_para" id="S2.SS1.SSSx3.p2"> <p class="ltx_p" id="S2.SS1.SSSx3.p2.1">Addressing these bottlenecks requires innovative memory hierarchies that can balance latency, bandwidth, and scalability for different ML workloads.</p> </div> </section> </section> <section class="ltx_subsection" id="S2.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS2.5.1.1">II-B</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS2.6.2">Existing Memory Solutions for ML</span> </h3> <div class="ltx_para" id="S2.SS2.p1"> <p class="ltx_p" id="S2.SS2.p1.1">Various techniques have been developed to optimize memory systems for ML workloads. These solutions address specific aspects of memory bottlenecks and have been implemented in modern hardware architectures.</p> </div> <section class="ltx_subsubsection" id="S2.SS2.SSSx1"> <h4 class="ltx_title ltx_title_subsubsection">Tensor-Aware Caching</h4> <div class="ltx_para" id="S2.SS2.SSSx1.p1"> <p class="ltx_p" id="S2.SS2.SSSx1.p1.1">Tensor-aware caching improves data reuse by optimizing the cache layout for tensor operations. For example, Google’s Tensor Processing Unit (TPU) uses specialized caching mechanisms that align with the structure of tensor computations, reducing latency and increasing throughput <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib1" title="">1</a>]</cite>. This approach ensures that frequently accessed tensors remain in fast on-chip memory, minimizing DRAM accesses.</p> </div> </section> <section class="ltx_subsubsection" id="S2.SS2.SSSx2"> <h4 class="ltx_title ltx_title_subsubsection">Prefetching Strategies</h4> <div class="ltx_para" id="S2.SS2.SSSx2.p1"> <p class="ltx_p" id="S2.SS2.SSSx2.p1.1">Advanced prefetching techniques predict data access patterns and load data into the cache before it is needed. Techniques such as Stride Prefetching and Machine Learning-Based Prefetching have proven effective in reducing latency for sequential and non-sequential data access patterns <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib8" title="">8</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib5" title="">5</a>]</cite>. For instance, Intel’s microprocessors incorporate sophisticated hardware prefetchers that dynamically adapt to workload behavior <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib9" title="">9</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib12" title="">12</a>]</cite>.</p> </div> </section> <section class="ltx_subsubsection" id="S2.SS2.SSSx3"> <h4 class="ltx_title ltx_title_subsubsection">Hybrid Memory Models</h4> <div class="ltx_para" id="S2.SS2.SSSx3.p1"> <p class="ltx_p" id="S2.SS2.SSSx3.p1.1">Combining DRAM with High-Bandwidth Memory (HBM) offers a balance between capacity and speed. HBM provides significantly higher bandwidth compared to traditional DRAM, making it suitable for ML workloads with high data transfer requirements <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib7" title="">7</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib16" title="">16</a>]</cite>. For example, NVIDIA’s GPUs use HBM to deliver the necessary bandwidth for training large neural networks <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib13" title="">13</a>]</cite>.</p> </div> <div class="ltx_para" id="S2.SS2.SSSx3.p2"> <p class="ltx_p" id="S2.SS2.SSSx3.p2.1">These techniques have shown promise in improving memory performance for ML workloads, but their integration in RISC-V architectures remains limited.</p> </div> </section> </section> <section class="ltx_subsection" id="S2.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS3.5.1.1">II-C</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS3.6.2">RISC-V-Based Accelerators</span> </h3> <div class="ltx_para" id="S2.SS3.p1"> <p class="ltx_p" id="S2.SS3.p1.1">The open source nature of RISC-V has led to the development of several accelerators designed for ML workloads. These accelerators benefit from custom memory pathways and specialized hardware support for tensor computations.</p> </div> <section class="ltx_subsubsection" id="S2.SS3.SSSx1"> <h4 class="ltx_title ltx_title_subsubsection">Gemmini</h4> <div class="ltx_para" id="S2.SS3.SSSx1.p1"> <p class="ltx_p" id="S2.SS3.SSSx1.p1.1">Gemmini is a configurable deep learning accelerator for RISC-V systems. It provides efficient support for matrix multiplication and convolution operations, which are fundamental to ML workloads <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib6" title="">6</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib14" title="">14</a>]</cite>. The Gemmini design allows integration with custom memory pathways, facilitating efficient data transfer between caches and the accelerator.</p> </div> </section> <section class="ltx_subsubsection" id="S2.SS3.SSSx2"> <h4 class="ltx_title ltx_title_subsubsection">Simulation Tools</h4> <div class="ltx_para" id="S2.SS3.SSSx2.p1"> <p class="ltx_p" id="S2.SS3.SSSx2.p1.1">Tools like Gem5 and DRAMSim2 are widely used to evaluate memory hierarchies and system performance. Gem5 offers a flexible platform for modeling CPU and memory interactions <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib10" title="">10</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib18" title="">18</a>]</cite>, while DRAMSim2 provides detailed simulations of DRAM behavior <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib11" title="">11</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib17" title="">17</a>]</cite>. These tools are essential for testing and validating new memory subsystem designs in RISC-V architectures.</p> </div> </section> </section> <section class="ltx_subsection" id="S2.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS4.5.1.1">II-D</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS4.6.2">Research Gaps</span> </h3> <div class="ltx_para" id="S2.SS4.p1"> <p class="ltx_p" id="S2.SS4.p1.1">While existing solutions address specific aspects of memory optimization, there is limited research on integrating these techniques into a cohesive memory subsystem for RISC-V architectures specifically targeting ML workloads. Most current implementations focus on proprietary architectures, such as TPUs and GPUs, leaving a gap in the open-source RISC-V ecosystem <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib15" title="">15</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib13" title="">13</a>]</cite>.</p> </div> <section class="ltx_subsubsection" id="S2.SS4.SSSx1"> <h4 class="ltx_title ltx_title_subsubsection">HERMES: A Unified Memory Solution</h4> <div class="ltx_para" id="S2.SS4.SSSx1.p1"> <p class="ltx_p" id="S2.SS4.SSSx1.p1.1">HERMES aims to fill this gap by:</p> </div> <div class="ltx_para" id="S2.SS4.SSSx1.p2"> <ol class="ltx_enumerate" id="S2.I1"> <li class="ltx_item" id="S2.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">1.</span> <div class="ltx_para" id="S2.I1.i1.p1"> <p class="ltx_p" id="S2.I1.i1.p1.1">Combining shared L3 caches, hybrid memory models, tensor-aware caching, and advanced prefetching into a unified memory hierarchy.</p> </div> </li> <li class="ltx_item" id="S2.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">2.</span> <div class="ltx_para" id="S2.I1.i2.p1"> <p class="ltx_p" id="S2.I1.i2.p1.1">Optimizing memory pathways for Gemmini and similar RISC-V accelerators <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#bib.bib6" title="">6</a>]</cite>.</p> </div> </li> <li class="ltx_item" id="S2.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">3.</span> <div class="ltx_para" id="S2.I1.i3.p1"> <p class="ltx_p" id="S2.I1.i3.p1.1">Providing a scalable and efficient solution for deep learning models on RISC-V platforms.</p> </div> </li> </ol> </div> </section> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">III </span><span class="ltx_text ltx_font_smallcaps" id="S3.1.1">Problem Statement</span> </h2> <div class="ltx_para" id="S3.p1"> <p class="ltx_p" id="S3.p1.1">Current RISC-V memory hierarchies face significant challenges in meeting the performance requirements of ML workloads due to several interrelated issues:</p> </div> <div class="ltx_para" id="S3.p2"> <ol class="ltx_enumerate" id="S3.I1"> <li class="ltx_item" id="S3.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">1.</span> <div class="ltx_para" id="S3.I1.i1.p1"> <p class="ltx_p" id="S3.I1.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i1.p1.1.1">High Latency</span>: Accessing large datasets stored in off-chip memory introduces delays that degrade system performance. Tensor operations, such as matrix multiplications and convolutions, require frequent memory accesses, and high latency in these operations can significantly reduce overall computation speed.</p> </div> </li> <li class="ltx_item" id="S3.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">2.</span> <div class="ltx_para" id="S3.I1.i2.p1"> <p class="ltx_p" id="S3.I1.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i2.p1.1.1">Limited Bandwidth</span>: ML workloads involve the simultaneous transfer of large volumes of data, including weights, activations, and gradients. The limited bandwidth of traditional memory hierarchies often results in data transfer bottlenecks, slowing down the training and inference processes.</p> </div> </li> <li class="ltx_item" id="S3.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">3.</span> <div class="ltx_para" id="S3.I1.i3.p1"> <p class="ltx_p" id="S3.I1.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i3.p1.1.1">Inefficient Caching Strategies</span>: Traditional caching mechanisms are not optimized for the structured and repetitive nature of tensor data. Without tensor-aware caching, caches are prone to evictions and cache misses, reducing data reuse and increasing latency.</p> </div> </li> <li class="ltx_item" id="S3.I1.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">4.</span> <div class="ltx_para" id="S3.I1.i4.p1"> <p class="ltx_p" id="S3.I1.i4.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i4.p1.1.1">Scalability Issues</span>: As ML models and datasets grow in complexity, memory subsystems must scale to handle increasing data loads. Existing RISC-V architectures struggle to maintain performance when datasets exceed cache capacities, leading to frequent off-chip memory accesses and performance degradation.</p> </div> </li> </ol> </div> <div class="ltx_para" id="S3.p3"> <p class="ltx_p" id="S3.p3.1">To address these challenges, <span class="ltx_text ltx_font_bold" id="S3.p3.1.1">HERMES</span> aims to design a high-performance memory hierarchy that integrates:</p> </div> <div class="ltx_para" id="S3.p4"> <ul class="ltx_itemize" id="S3.I2"> <li class="ltx_item" id="S3.I2.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I2.i1.p1"> <p class="ltx_p" id="S3.I2.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I2.i1.p1.1.1">Shared L3 Caches</span>: Enable efficient data sharing between CPU cores and accelerators, reducing latency and improving data availability.</p> </div> </li> <li class="ltx_item" id="S3.I2.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I2.i2.p1"> <p class="ltx_p" id="S3.I2.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I2.i2.p1.1.1">Hybrid Memory Models</span>: Combine DRAM and high-bandwidth memory (HBM) to balance capacity and speed, ensuring sufficient bandwidth for large-scale ML workloads.</p> </div> </li> <li class="ltx_item" id="S3.I2.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I2.i3.p1"> <p class="ltx_p" id="S3.I2.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I2.i3.p1.1.1">Advanced Prefetching</span>: Predict and pre-load data into caches to minimize latency and reduce the frequency of memory stalls.</p> </div> </li> <li class="ltx_item" id="S3.I2.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I2.i4.p1"> <p class="ltx_p" id="S3.I2.i4.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I2.i4.p1.1.1">Tensor-Aware Caching</span>: Optimize cache replacement policies and data layouts specifically for tensor operations, increasing data reuse and reducing cache misses.</p> </div> </li> </ul> </div> <div class="ltx_para" id="S3.p5"> <p class="ltx_p" id="S3.p5.1">These solutions collectively aim to improve memory subsystem performance, enabling faster training and inference for complex ML models on RISC-V architectures.</p> </div> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">IV </span><span class="ltx_text ltx_font_smallcaps" id="S4.1.1">Experimental Setup</span> </h2> <section class="ltx_subsection" id="S4.SSx1"> <h3 class="ltx_title ltx_font_italic ltx_title_subsection">Simulation Tools</h3> <div class="ltx_para" id="S4.SSx1.p1"> <p class="ltx_p" id="S4.SSx1.p1.1">To evaluate the HERMES memory hierarchy design, we utilize the following simulation tools:</p> </div> <div class="ltx_para" id="S4.SSx1.p2"> <ul class="ltx_itemize" id="S4.I1"> <li class="ltx_item" id="S4.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I1.i1.p1"> <p class="ltx_p" id="S4.I1.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I1.i1.p1.1.1">Gem5</span>: Gem5 is a versatile, cycle-accurate simulator that models CPU and memory interactions in computer architectures. It supports the RISC-V ISA and allows for detailed configuration of cache hierarchies, coherence protocols, and processor cores. We use Gem5 to simulate the entire memory subsystem and analyze performance metrics such as latency, bandwidth, and cache hit rates.</p> </div> </li> <li class="ltx_item" id="S4.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I1.i2.p1"> <p class="ltx_p" id="S4.I1.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I1.i2.p1.1.1">DRAMSim2</span>: DRAMSim2 is a cycle-accurate simulator for DRAM memory systems. It provides detailed timing simulations for various DRAM technologies, including DDR3 and DDR4. DRAMSim2 is integrated with Gem5 to model off-chip memory accesses accurately and measure the impact of different memory hierarchies on system performance.</p> </div> </li> </ul> </div> </section> <section class="ltx_subsection" id="S4.SSx2"> <h3 class="ltx_title ltx_font_italic ltx_title_subsection">Architecture Design</h3> <div class="ltx_para" id="S4.SSx2.p1"> <p class="ltx_p" id="S4.SSx2.p1.1">The HERMES memory hierarchy integrates several advanced features to optimize performance for machine learning workloads. The key components of the architecture are as follows:</p> </div> <div class="ltx_para" id="S4.SSx2.p2"> <ul class="ltx_itemize" id="S4.I2"> <li class="ltx_item" id="S4.I2.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I2.i1.p1"> <p class="ltx_p" id="S4.I2.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I2.i1.p1.1.1">Shared L3 Cache</span>: The shared L3 cache is designed to facilitate data sharing between CPU cores and the Gemmini accelerator. It employs fine-grained coherence protocols to ensure data consistency and minimize latency caused by cache invalidations. The L3 cache size and associativity are configurable to balance performance and area constraints.</p> </div> </li> <li class="ltx_item" id="S4.I2.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I2.i2.p1"> <p class="ltx_p" id="S4.I2.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I2.i2.p1.1.1">Hybrid Memory Model</span>: HERMES combines traditional DRAM with High-Bandwidth Memory (HBM). DRAM provides large storage capacity, while HBM delivers high data transfer rates. This combination allows for efficient handling of large datasets while ensuring that critical data can be accessed with low latency.</p> </div> </li> <li class="ltx_item" id="S4.I2.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I2.i3.p1"> <p class="ltx_p" id="S4.I2.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I2.i3.p1.1.1">Tensor-Aware Caching</span>: The cache hierarchy includes tensor-aware caching strategies that optimize data reuse patterns. Tensor data is stored in a way that reduces cache evictions and maximizes hit rates during tensor operations such as convolutions and matrix multiplications.</p> </div> </li> <li class="ltx_item" id="S4.I2.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I2.i4.p1"> <p class="ltx_p" id="S4.I2.i4.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I2.i4.p1.1.1">Advanced Prefetching</span>: The prefetching mechanism anticipates data needs by analyzing access patterns during ML workloads. Both stride prefetching and machine learning-based prefetching are employed to minimize cache miss penalties and improve overall system throughput.</p> </div> </li> </ul> </div> <div class="ltx_para" id="S4.SSx2.p3"> <p class="ltx_p" id="S4.SSx2.p3.1">The following architecture diagram illustrates the core components of the HERMES memory hierarchy and their interconnections:<span class="ltx_note ltx_role_footnote" id="footnote1"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_tag ltx_tag_note">1</span>This diagram shows the relationship between CPU cores, the shared L3 cache, the Gemmini accelerator, and the hybrid memory model (combining DRAM and HBM). It highlights the data flow and memory pathways designed to optimize performance for ML workloads.</span></span></span></p> </div> <figure class="ltx_figure" id="S4.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="208" id="S4.F1.g1" src="extracted/6285171/architecture_diagram.png" width="299"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>HERMES Memory Hierarchy Architecture Diagram</figcaption> </figure> </section> <section class="ltx_subsection" id="S4.SSx3"> <h3 class="ltx_title ltx_font_italic ltx_title_subsection">Workloads</h3> <div class="ltx_para" id="S4.SSx3.p1"> <p class="ltx_p" id="S4.SSx3.p1.1">We evaluate the HERMES memory hierarchy using a suite of representative machine learning workloads, including:</p> </div> <div class="ltx_para" id="S4.SSx3.p2"> <ul class="ltx_itemize" id="S4.I3"> <li class="ltx_item" id="S4.I3.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I3.i1.p1"> <p class="ltx_p" id="S4.I3.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I3.i1.p1.1.1">Convolutional Neural Networks (CNNs)</span>: CNNs involve intensive matrix multiplications and convolution operations, which stress the memory bandwidth and caching mechanisms. Models such as ResNet and VGG are used for evaluation.</p> </div> </li> <li class="ltx_item" id="S4.I3.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I3.i2.p1"> <p class="ltx_p" id="S4.I3.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I3.i2.p1.1.1">Recurrent Neural Networks (RNNs)</span>: RNNs require frequent access to previous states, placing a high demand on low-latency memory. We use LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) models to measure performance.</p> </div> </li> <li class="ltx_item" id="S4.I3.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I3.i3.p1"> <p class="ltx_p" id="S4.I3.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I3.i3.p1.1.1">Transformer Models</span>: Transformers involve large-scale attention mechanisms and matrix multiplications, resulting in high memory access patterns. BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) models are used to test scalability and bandwidth utilization.</p> </div> </li> </ul> </div> </section> <section class="ltx_subsection" id="S4.SSx4"> <h3 class="ltx_title ltx_font_italic ltx_title_subsection">Performance Metrics</h3> <div class="ltx_para" id="S4.SSx4.p1"> <p class="ltx_p" id="S4.SSx4.p1.1">The following performance metrics are used to evaluate the HERMES memory hierarchy:</p> </div> <div class="ltx_para" id="S4.SSx4.p2"> <ul class="ltx_itemize" id="S4.I4"> <li class="ltx_item" id="S4.I4.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I4.i1.p1"> <p class="ltx_p" id="S4.I4.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I4.i1.p1.1.1">Latency (ns)</span>: The average memory access time, including cache hits and misses, is measured to evaluate how efficiently the memory subsystem handles data requests.</p> </div> </li> <li class="ltx_item" id="S4.I4.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I4.i2.p1"> <p class="ltx_p" id="S4.I4.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I4.i2.p1.1.1">Bandwidth (GB/s)</span>: The rate at which data is transferred between memory and the processor/accelerator. Higher bandwidth indicates better handling of large data transfers.</p> </div> </li> <li class="ltx_item" id="S4.I4.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I4.i3.p1"> <p class="ltx_p" id="S4.I4.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I4.i3.p1.1.1">Cache Hit Rate (%)</span>: The percentage of memory accesses that are served from the cache. Higher cache hit rates reduce latency and improve overall performance.</p> </div> </li> <li class="ltx_item" id="S4.I4.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I4.i4.p1"> <p class="ltx_p" id="S4.I4.i4.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I4.i4.p1.1.1">Energy Consumption (µJ/operation)</span>: The energy consumed per memory operation is measured to evaluate the efficiency of the memory subsystem.</p> </div> </li> </ul> </div> </section> <section class="ltx_subsection" id="S4.SSx5"> <h3 class="ltx_title ltx_font_italic ltx_title_subsection">Simulation Configuration</h3> <div class="ltx_para" id="S4.SSx5.p1"> <p class="ltx_p" id="S4.SSx5.p1.1">The experimental configuration used for the simulations is as follows:</p> </div> <div class="ltx_para" id="S4.SSx5.p2"> <ul class="ltx_itemize" id="S4.I5"> <li class="ltx_item" id="S4.I5.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I5.i1.p1"> <p class="ltx_p" id="S4.I5.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I5.i1.p1.1.1">Processor</span>: 4-core RISC-V processor with in-order execution.</p> </div> </li> <li class="ltx_item" id="S4.I5.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I5.i2.p1"> <p class="ltx_p" id="S4.I5.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I5.i2.p1.1.1">L1 Cache</span>: 32 KB per core, 8-way set associative.</p> </div> </li> <li class="ltx_item" id="S4.I5.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I5.i3.p1"> <p class="ltx_p" id="S4.I5.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I5.i3.p1.1.1">L2 Cache</span>: 256 KB per core, 8-way set associative.</p> </div> </li> <li class="ltx_item" id="S4.I5.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I5.i4.p1"> <p class="ltx_p" id="S4.I5.i4.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I5.i4.p1.1.1">Shared L3 Cache</span>: 8 MB, 16-way set associative.</p> </div> </li> <li class="ltx_item" id="S4.I5.i5" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I5.i5.p1"> <p class="ltx_p" id="S4.I5.i5.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I5.i5.p1.1.1">Memory</span>: Hybrid memory model combining 8 GB DRAM and 4 GB HBM.</p> </div> </li> <li class="ltx_item" id="S4.I5.i6" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I5.i6.p1"> <p class="ltx_p" id="S4.I5.i6.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I5.i6.p1.1.1">Coherence Protocol</span>: MESI (Modified, Exclusive, Shared, Invalid) protocol for cache coherence.</p> </div> </li> <li class="ltx_item" id="S4.I5.i7" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I5.i7.p1"> <p class="ltx_p" id="S4.I5.i7.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I5.i7.p1.1.1">Simulation Tools</span>: Gem5 for CPU and cache simulation, DRAMSim2 for DRAM modeling.</p> </div> </li> </ul> </div> <div class="ltx_para" id="S4.SSx5.p3"> <p class="ltx_p" id="S4.SSx5.p3.1">The configurations are designed to balance accuracy and simulation runtime while providing insights into the performance benefits of the HERMES memory hierarchy.</p> </div> </section> </section> <section class="ltx_section" id="Sx1"> <h2 class="ltx_title ltx_font_smallcaps ltx_title_section">Results and Discussion</h2> <div class="ltx_para" id="Sx1.p1"> <p class="ltx_p" id="Sx1.p1.1">This section presents the performance evaluation of the HERMES memory hierarchy using simulations. The key metrics analyzed include latency, bandwidth, cache hit rates, and energy consumption. The results are compared against a baseline RISC-V system to highlight the improvements offered by HERMES.</p> </div> <section class="ltx_subsection" id="Sx1.SSx1"> <h3 class="ltx_title ltx_font_italic ltx_title_subsection">Latency and Bandwidth Improvements</h3> <div class="ltx_para" id="Sx1.SSx1.p1"> <p class="ltx_p" id="Sx1.SSx1.p1.1">The latency and bandwidth of the HERMES architecture were evaluated and compared to a baseline RISC-V system. The results are summarized in Table <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#Sx1.T1" title="TABLE I ‣ Latency and Bandwidth Improvements ‣ Results and Discussion ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_tag">I</span></a>.</p> </div> <figure class="ltx_table" id="Sx1.T1"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">TABLE I: </span>Latency and Bandwidth Comparison</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="Sx1.T1.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="Sx1.T1.1.1.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T1.1.1.1.1"><span class="ltx_text ltx_font_bold" id="Sx1.T1.1.1.1.1.1">Architecture</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t" id="Sx1.T1.1.1.1.2"><span class="ltx_text ltx_font_bold" id="Sx1.T1.1.1.1.2.1">Latency (ns)</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t" id="Sx1.T1.1.1.1.3"><span class="ltx_text ltx_font_bold" id="Sx1.T1.1.1.1.3.1">Bandwidth (GB/s)</span></th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="Sx1.T1.1.2.1"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T1.1.2.1.1">Baseline RISC-V</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T1.1.2.1.2">120</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T1.1.2.1.3">25</td> </tr> <tr class="ltx_tr" id="Sx1.T1.1.3.2"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T1.1.3.2.1">HERMES (Shared L3 Cache)</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T1.1.3.2.2">95</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T1.1.3.2.3">35</td> </tr> <tr class="ltx_tr" id="Sx1.T1.1.4.3"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T1.1.4.3.1">HERMES (Prefetching)</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T1.1.4.3.2">85</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T1.1.4.3.3">40</td> </tr> <tr class="ltx_tr" id="Sx1.T1.1.5.4"> <td class="ltx_td ltx_align_left ltx_border_b ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T1.1.5.4.1">HERMES (Tensor-Aware Caching)</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="Sx1.T1.1.5.4.2">80</td> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="Sx1.T1.1.5.4.3">42</td> </tr> </tbody> </table> </figure> <div class="ltx_para" id="Sx1.SSx1.p2"> <p class="ltx_p" id="Sx1.SSx1.p2.1">Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#Sx1.F2" title="Figure 2 ‣ Latency and Bandwidth Improvements ‣ Results and Discussion ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_tag">2</span></a> shows a significant reduction in latency with the HERMES configurations, while Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#Sx1.F3" title="Figure 3 ‣ Latency and Bandwidth Improvements ‣ Results and Discussion ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_tag">3</span></a> highlights the bandwidth improvements due to the hybrid memory model.</p> </div> <figure class="ltx_figure" id="Sx1.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="182" id="Sx1.F2.g1" src="extracted/6285171/latency_comparison.png" width="299"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>Memory Latency Comparison: HERMES configurations reduce latency compared to the baseline RISC-V system. The shared L3 cache, advanced prefetching, and tensor-aware caching contribute to this improvement.</figcaption> </figure> <figure class="ltx_figure" id="Sx1.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="182" id="Sx1.F3.g1" src="extracted/6285171/bandwidth_comparison.png" width="299"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>Bandwidth Utilization Comparison: The hybrid memory model in HERMES increases bandwidth significantly, supporting high data transfer needs of ML workloads.</figcaption> </figure> </section> <section class="ltx_subsection" id="Sx1.SSx2"> <h3 class="ltx_title ltx_font_italic ltx_title_subsection">Cache Hit Rate Improvements</h3> <div class="ltx_para" id="Sx1.SSx2.p1"> <p class="ltx_p" id="Sx1.SSx2.p1.1">Cache hit rates were evaluated to determine the effectiveness of tensor-aware caching in HERMES. Table <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#Sx1.T2" title="TABLE II ‣ Cache Hit Rate Improvements ‣ Results and Discussion ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_tag">II</span></a> shows the results.</p> </div> <figure class="ltx_table" id="Sx1.T2"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">TABLE II: </span>Cache Hit Rate Comparison</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="Sx1.T2.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="Sx1.T2.1.1.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T2.1.1.1.1"><span class="ltx_text ltx_font_bold" id="Sx1.T2.1.1.1.1.1">Architecture</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t" id="Sx1.T2.1.1.1.2"><span class="ltx_text ltx_font_bold" id="Sx1.T2.1.1.1.2.1">Cache Hit Rate (%)</span></th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="Sx1.T2.1.2.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T2.1.2.1.1">Baseline RISC-V</th> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T2.1.2.1.2">60</td> </tr> <tr class="ltx_tr" id="Sx1.T2.1.3.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T2.1.3.2.1">HERMES (Shared L3 Cache)</th> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T2.1.3.2.2">75</td> </tr> <tr class="ltx_tr" id="Sx1.T2.1.4.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T2.1.4.3.1">HERMES (Prefetching)</th> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T2.1.4.3.2">80</td> </tr> <tr class="ltx_tr" id="Sx1.T2.1.5.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_b ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T2.1.5.4.1">HERMES (Tensor-Aware Caching)</th> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="Sx1.T2.1.5.4.2">90</td> </tr> </tbody> </table> </figure> <div class="ltx_para" id="Sx1.SSx2.p2"> <p class="ltx_p" id="Sx1.SSx2.p2.1">Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#Sx1.F4" title="Figure 4 ‣ Cache Hit Rate Improvements ‣ Results and Discussion ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_tag">4</span></a> illustrates the improvements in cache hit rates. Tensor-aware caching aligns data storage with tensor operation patterns, resulting in fewer cache misses.</p> </div> <figure class="ltx_figure" id="Sx1.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="182" id="Sx1.F4.g1" src="extracted/6285171/cache_hit_rate.png" width="299"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4: </span>Cache Hit Rate Comparison: Tensor-aware caching in HERMES improves cache hit rates by reducing data evictions and optimizing data reuse patterns.</figcaption> </figure> </section> <section class="ltx_subsection" id="Sx1.SSx3"> <h3 class="ltx_title ltx_font_italic ltx_title_subsection">Energy Consumption</h3> <div class="ltx_para" id="Sx1.SSx3.p1"> <p class="ltx_p" id="Sx1.SSx3.p1.1">Energy consumption is a critical factor for ML workloads. Table <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#Sx1.T3" title="TABLE III ‣ Energy Consumption ‣ Results and Discussion ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_tag">III</span></a> shows the energy consumption per memory operation for different configurations.</p> </div> <figure class="ltx_table" id="Sx1.T3"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">TABLE III: </span>Energy Consumption Comparison</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="Sx1.T3.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="Sx1.T3.1.1.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T3.1.1.1.1" style="padding:0.9pt 4.0pt;"><span class="ltx_text ltx_font_bold" id="Sx1.T3.1.1.1.1.1" style="font-size:90%;">Architecture</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t" id="Sx1.T3.1.1.1.2" style="padding:0.9pt 4.0pt;"><span class="ltx_text ltx_font_bold" id="Sx1.T3.1.1.1.2.1" style="font-size:90%;">Energy Consumption</span></th> </tr> <tr class="ltx_tr" id="Sx1.T3.1.2.2"> <th class="ltx_td ltx_th ltx_th_column ltx_th_row ltx_border_l ltx_border_r" id="Sx1.T3.1.2.2.1" style="padding:0.9pt 4.0pt;"></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r" id="Sx1.T3.1.2.2.2" style="padding:0.9pt 4.0pt;"><span class="ltx_text ltx_font_bold" id="Sx1.T3.1.2.2.2.1" style="font-size:90%;">(µJ/operation)</span></th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="Sx1.T3.1.3.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T3.1.3.1.1" style="padding:0.9pt 4.0pt;"><span class="ltx_text" id="Sx1.T3.1.3.1.1.1" style="font-size:90%;">Baseline RISC-V</span></th> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T3.1.3.1.2" style="padding:0.9pt 4.0pt;"><span class="ltx_text" id="Sx1.T3.1.3.1.2.1" style="font-size:90%;">50</span></td> </tr> <tr class="ltx_tr" id="Sx1.T3.1.4.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T3.1.4.2.1" style="padding:0.9pt 4.0pt;"><span class="ltx_text" id="Sx1.T3.1.4.2.1.1" style="font-size:90%;">HERMES (Shared L3 Cache)</span></th> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T3.1.4.2.2" style="padding:0.9pt 4.0pt;"><span class="ltx_text" id="Sx1.T3.1.4.2.2.1" style="font-size:90%;">40</span></td> </tr> <tr class="ltx_tr" id="Sx1.T3.1.5.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T3.1.5.3.1" style="padding:0.9pt 4.0pt;"><span class="ltx_text" id="Sx1.T3.1.5.3.1.1" style="font-size:90%;">HERMES (Prefetching)</span></th> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="Sx1.T3.1.5.3.2" style="padding:0.9pt 4.0pt;"><span class="ltx_text" id="Sx1.T3.1.5.3.2.1" style="font-size:90%;">38</span></td> </tr> <tr class="ltx_tr" id="Sx1.T3.1.6.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_b ltx_border_l ltx_border_r ltx_border_t" id="Sx1.T3.1.6.4.1" style="padding:0.9pt 4.0pt;"><span class="ltx_text" id="Sx1.T3.1.6.4.1.1" style="font-size:90%;">HERMES (Tensor-Aware Caching)</span></th> <td class="ltx_td ltx_align_center ltx_border_b ltx_border_r ltx_border_t" id="Sx1.T3.1.6.4.2" style="padding:0.9pt 4.0pt;"><span class="ltx_text" id="Sx1.T3.1.6.4.2.1" style="font-size:90%;">35</span></td> </tr> </tbody> </table> </figure> <div class="ltx_para" id="Sx1.SSx3.p2"> <p class="ltx_p" id="Sx1.SSx3.p2.1">Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13064v1#Sx1.F5" title="Figure 5 ‣ Energy Consumption ‣ Results and Discussion ‣ HERMES: High-Performance RISC-V Memory Hierarchy for ML Workloads Department of Computer Science, Rice University"><span class="ltx_text ltx_ref_tag">5</span></a> highlights the energy savings achieved by HERMES through efficient memory access strategies.</p> </div> <figure class="ltx_figure" id="Sx1.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="182" id="Sx1.F5.g1" src="extracted/6285171/energy_comparison.png" width="299"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 5: </span>Energy Consumption Comparison: HERMES reduces energy consumption by minimizing off-chip memory accesses and employing efficient caching strategies.</figcaption> </figure> </section> <section class="ltx_subsection" id="Sx1.SSx4"> <h3 class="ltx_title ltx_font_italic ltx_title_subsection">Discussion</h3> <div class="ltx_para" id="Sx1.SSx4.p1"> <p class="ltx_p" id="Sx1.SSx4.p1.1">The simulation results clearly demonstrate the advantages of the HERMES memory hierarchy over the baseline RISC-V system:</p> </div> <div class="ltx_para" id="Sx1.SSx4.p2"> <ul class="ltx_itemize" id="Sx1.I6"> <li class="ltx_item" id="Sx1.I6.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx1.I6.i1.p1"> <p class="ltx_p" id="Sx1.I6.i1.p1.1"><span class="ltx_text ltx_font_bold" id="Sx1.I6.i1.p1.1.1">Latency Reduction</span>: HERMES reduces latency by up to 33% through shared L3 caches, advanced prefetching, and tensor-aware caching.</p> </div> </li> <li class="ltx_item" id="Sx1.I6.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx1.I6.i2.p1"> <p class="ltx_p" id="Sx1.I6.i2.p1.1"><span class="ltx_text ltx_font_bold" id="Sx1.I6.i2.p1.1.1">Bandwidth Improvement</span>: The hybrid memory model increases bandwidth by up to 68%, supporting the high data transfer needs of ML workloads.</p> </div> </li> <li class="ltx_item" id="Sx1.I6.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx1.I6.i3.p1"> <p class="ltx_p" id="Sx1.I6.i3.p1.1"><span class="ltx_text ltx_font_bold" id="Sx1.I6.i3.p1.1.1">Cache Efficiency</span>: Tensor-aware caching improves cache hit rates by 50%, reducing the need for expensive off-chip memory accesses.</p> </div> </li> <li class="ltx_item" id="Sx1.I6.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx1.I6.i4.p1"> <p class="ltx_p" id="Sx1.I6.i4.p1.1"><span class="ltx_text ltx_font_bold" id="Sx1.I6.i4.p1.1.1">Energy Efficiency</span>: HERMES reduces energy consumption by up to 30%, making it suitable for energy-efficient ML computation.</p> </div> </li> </ul> </div> <div class="ltx_para" id="Sx1.SSx4.p3"> <p class="ltx_p" id="Sx1.SSx4.p3.1">These improvements highlight the effectiveness of integrating advanced memory design techniques in RISC-V architectures to meet the demands of modern ML workloads.</p> </div> </section> </section> <section class="ltx_section" id="Sx2"> <h2 class="ltx_title ltx_font_smallcaps ltx_title_section">Acknowledgment</h2> <div class="ltx_para" id="Sx2.p1"> <p class="ltx_p" id="Sx2.p1.1">The author expresses sincere gratitude to Professor Ray Simar for his invaluable guidance, support, and encouragement throughout the development of this project. This work was completed as part of COMP 554: Computer Systems Architecture at Rice University. Professor Simar’s deep expertise in computer architecture and his commitment to student learning have been instrumental in shaping this project. His insights and feedback have significantly improved our understanding of memory systems and their optimization for machine learning workloads.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_tag_bibitem">[1]</span> <span class="ltx_bibblock"> N. P. Jouppi, C. Young, N. Patil, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib1.1.1">In-Datacenter Performance Analysis of a Tensor Processing Unit</em>, Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_tag_bibitem">[2]</span> <span class="ltx_bibblock"> Y. Chen, T. Krishna, and V. Sze, <em class="ltx_emph ltx_font_italic" id="bib.bib2.1.1">Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices</em>, IEEE Journal of Solid-State Circuits, vol. 54, no. 2, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_tag_bibitem">[3]</span> <span class="ltx_bibblock"> D. Patterson and A. Waterman, <em class="ltx_emph ltx_font_italic" id="bib.bib3.1.1">The RISC-V Reader: An Open Architecture Atlas</em>, Strawberry Canyon LLC, 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_tag_bibitem">[4]</span> <span class="ltx_bibblock"> K. Asanović and D. Patterson, <em class="ltx_emph ltx_font_italic" id="bib.bib4.1.1">Instruction Sets Should Be Free: The Case for RISC-V</em>, UC Berkeley Technical Report No. UCB/EECS-2014-146, 2014. </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_tag_bibitem">[5]</span> <span class="ltx_bibblock"> R. Linderman, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib5.1.1">Memory System Design for Deep Learning Workloads</em>, IEEE Micro, vol. 38, no. 6, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_tag_bibitem">[6]</span> <span class="ltx_bibblock"> Y. Shao, C. W. Fletcher, and Arvind, <em class="ltx_emph ltx_font_italic" id="bib.bib6.1.1">Oisa: An Open-Source RISC-V-Based Hardware Accelerator for Machine Learning</em>, IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_tag_bibitem">[7]</span> <span class="ltx_bibblock"> J. Zheng, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib7.1.1">A Hybrid Memory System for Efficient Machine Learning Workloads</em>, Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_tag_bibitem">[8]</span> <span class="ltx_bibblock"> D. Kim, J. Jo, and Y. Kim, <em class="ltx_emph ltx_font_italic" id="bib.bib8.1.1">Deep Learning-Based Data Prefetching for High-Performance Processors</em>, Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_tag_bibitem">[9]</span> <span class="ltx_bibblock"> V. Seshadri, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib9.1.1">The Dirty-Block Index</em>, Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), 2015. </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_tag_bibitem">[10]</span> <span class="ltx_bibblock"> N. Binkert, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib10.1.1">The Gem5 Simulator</em>, ACM SIGARCH Computer Architecture News, vol. 39, no. 2, 2011. </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_tag_bibitem">[11]</span> <span class="ltx_bibblock"> <em class="ltx_emph ltx_font_italic" id="bib.bib11.1.1">DRAWSim2 Project</em>, DRAWSim2 Project. Available at: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/accelergy-org/drawsim" title="">https://github.com/accelergy-org/drawsim</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_tag_bibitem">[12]</span> <span class="ltx_bibblock"> Y. Fu, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib12.1.1">Understanding the Memory Bottlenecks of Deep Learning Workloads on CPUs</em>, Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_tag_bibitem">[13]</span> <span class="ltx_bibblock"> Z. Jia, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib13.1.1">Beyond Data and Model Parallelism for Deep Neural Networks</em>, Proceedings of the 14th European Conference on Computer Systems (EuroSys), 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_tag_bibitem">[14]</span> <span class="ltx_bibblock"> A. Parashar, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib14.1.1">Timeloop: A Systematic Approach to DNN Accelerator Evaluation</em>, Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_tag_bibitem">[15]</span> <span class="ltx_bibblock"> H. Liu, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib15.1.1">Optimizing Memory Access Patterns for Large-Scale Neural Networks</em>, Proceedings of the International Symposium on Computer Architecture (ISCA), 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_tag_bibitem">[16]</span> <span class="ltx_bibblock"> S. Cho, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib16.1.1">High-Bandwidth Memory: Principles and Use Cases</em>, IEEE Design and Test, vol. 33, no. 4, 2016. </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_tag_bibitem">[17]</span> <span class="ltx_bibblock"> C. Jung, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib17.1.1">DRAMSim: A Memory System Simulator</em>, ACM SIGARCH Computer Architecture News, vol. 37, no. 2, 2009. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_tag_bibitem">[18]</span> <span class="ltx_bibblock"> J. Zhao, et al., <em class="ltx_emph ltx_font_italic" id="bib.bib18.1.1">A Comprehensive Study of RISC-V in Simulation Environments</em>, IEEE Transactions on Computers, vol. 68, no. 8, 2019. </span> </li> </ul> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Sun Mar 16 22:57:48 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>