CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>Correct Wrong Path</title>  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content=" CPU Microarchitecture, Out of Order Execution " lang="en" name="keywords"/> <base href="/html/2408.05912v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S1" title="In Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">I </span><span class="ltx_text ltx_font_smallcaps">Introduction</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S2" title="In Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">II </span><span class="ltx_text ltx_font_smallcaps">WP Model</span></span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S2.SS1" title="In II WP Model ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-A</span> </span><span class="ltx_text ltx_font_italic">WP Trace</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S2.SS2" title="In II WP Model ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-B</span> </span><span class="ltx_text ltx_font_italic">Implementing WP in a Trace Driven Simulator</span></span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S3" title="In Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">III </span><span class="ltx_text ltx_font_smallcaps">Methodology</span></span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S3.SS1" title="In III Methodology ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">III-A</span> </span><span class="ltx_text ltx_font_italic">Simulation Infrastructure</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S3.SS2" title="In III Methodology ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">III-B</span> </span><span class="ltx_text ltx_font_italic">Benchmarks</span></span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S4" title="In Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">IV </span><span class="ltx_text ltx_font_smallcaps">Results</span></span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S4.SS1" title="In IV Results ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">IV-A</span> </span><span class="ltx_text ltx_font_italic">WP Impact on Instructions processed</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S4.SS2" title="In IV Results ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">IV-B</span> </span><span class="ltx_text ltx_font_italic">Cache State</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S4.SS3" title="In IV Results ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">IV-C</span> </span><span class="ltx_text ltx_font_italic">WP Impact on Cache State</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S4.SS4" title="In IV Results ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">IV-D</span> </span><span class="ltx_text ltx_font_italic">Performance Impact</span></span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S5" title="In Correct Wrong Path"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">V </span><span class="ltx_text ltx_font_smallcaps">Conclusion</span></span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line"> <h1 class="ltx_title ltx_title_document">Correct Wrong Path</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Bhargav Reddy Godala<sup class="ltx_sup" id="id3.1.id1">*</sup> </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"> </span></span></span> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Sankara Prasad Ramesh<sup class="ltx_sup" id="id4.1.id1">*</sup> </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"> </span></span></span> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Krishnam Tibrewala </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"> </span></span></span> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Chrysanthos Pepi </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"> </span></span></span> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Gino Chacon </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"> </span></span></span> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Svilen Kanev </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"> </span></span></span> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Gilles A. Pokam </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"> </span></span></span> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Daniel A. Jiménez </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"> </span></span></span> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Paul V. Gratz </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"> </span></span></span> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> David I. August </span><span class="ltx_author_notes">Bhargav Reddy Godala, Gino Chacon and Gilles A. Pokam from Intel ({bhargav.reddy.godala, ginoachacon, gilles.a.pokam}@intel.com). Sankara Prasad Ramesh from Nvidia (spramesh@ucsd.edu). Krishnam Tibrewala, Chrysanthos Pepi, Daniel A. Jiménez, and Paul V. Gratz from Texas A&M University ({krishnamtibrewala, cpepis, djimenez, pgratz}@tamu.edu). Svilen Kanev from Google (skanev@google.com). David I. August from Princeton University (august@princeton.edu). * Bhargav and Sankara Contributed Equally to this work. <span class="ltx_contact ltx_role_affiliation"> </span></span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id2.2">Modern OOO CPUs have very deep pipelines with large branch misprediction recovery penalties. Speculatively executed instructions on the wrong path can significantly change cache state, depending on speculation levels. Architects often employ trace-driven simulation models in the design exploration stage, which sacrifice precision for speed. Trace-driven simulators are orders of magnitude faster than execution-driven models, reducing the often hundreds of thousands of simulation hours needed to explore new micro-architectural ideas. Despite this strong benefit of trace-driven simulation, these often fail to adequately model the consequences of wrong path because obtaining them is nontrivial. Prior works consider either a positive or negative impact of wrong path but not both. Here, we examine wrong path execution in simulation results and design a set of infrastructure for enabling wrong-path execution in a trace driven simulator. Our analysis shows the wrong path affects structures on both the instruction and data sides extensively, resulting in performance variations ranging from <math alttext="-3.05" class="ltx_Math" display="inline" id="id1.1.m1.1"><semantics id="id1.1.m1.1a"><mrow id="id1.1.m1.1.1" xref="id1.1.m1.1.1.cmml"><mo id="id1.1.m1.1.1a" xref="id1.1.m1.1.1.cmml">−</mo><mn id="id1.1.m1.1.1.2" xref="id1.1.m1.1.1.2.cmml">3.05</mn></mrow><annotation-xml encoding="MathML-Content" id="id1.1.m1.1b"><apply id="id1.1.m1.1.1.cmml" xref="id1.1.m1.1.1"><minus id="id1.1.m1.1.1.1.cmml" xref="id1.1.m1.1.1"></minus><cn id="id1.1.m1.1.1.2.cmml" type="float" xref="id1.1.m1.1.1.2">3.05</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="id1.1.m1.1c">-3.05</annotation><annotation encoding="application/x-llamapun" id="id1.1.m1.1d">- 3.05</annotation></semantics></math>% to <math alttext="20.9" class="ltx_Math" display="inline" id="id2.2.m2.1"><semantics id="id2.2.m2.1a"><mn id="id2.2.m2.1.1" xref="id2.2.m2.1.1.cmml">20.9</mn><annotation-xml encoding="MathML-Content" id="id2.2.m2.1b"><cn id="id2.2.m2.1.1.cmml" type="float" xref="id2.2.m2.1.1">20.9</cn></annotation-xml><annotation encoding="application/x-tex" id="id2.2.m2.1c">20.9</annotation><annotation encoding="application/x-llamapun" id="id2.2.m2.1d">20.9</annotation></semantics></math>% when ignoring wrong path. To benefit the research community and enhance the accuracy of simulators, we opened our traces and tracing utility in the hopes that industry can provide wrong-path traces generated by their internal simulators, enabling academic simulation without exposing industry IP.</p> </div> <div class="ltx_keywords"> <h6 class="ltx_title ltx_title_keywords">Index Terms: </h6> CPU Microarchitecture, Out of Order Execution </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">I </span><span class="ltx_text ltx_font_smallcaps" id="S1.1.1">Introduction</span> </h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Accurate modeling of modern Out-of-Order (OOO) CPUs is very expensive in terms of simulator complexity and simulation times. Simulators facilitate prototyping of micro-architectures, but their complexity and simulation times can be significant bottlenecks in design space exploration. Consequently, architects often use simpler, and faster, though less accurate, trace-driven simulators, trading accuracy for speed. Trace driven simulators do not model wrong path (WP), as traces are typically generated from the execution of a workload on a real machine and thus can only contain correct path, executed instructions. Ignoring WP, however, can lead to incorrect estimation of performance.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">Prior works <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib10" title="">10</a>, <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib16" title="">16</a>]</cite> have proposed solutions to model either positive or negative impact but not both. Here, a positive impact is observed when a cache line fetched in WP is later used by correct path instructions, thus acting like a prefetch. A negative impact is observed when the cache is polluted with WP data or instructions that are never used. The alternative is to use more accurate execution-driven models, but they are orders of magnitude slower than trace-driven simulators <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib7" title="">7</a>, <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib12" title="">12</a>]</cite>.</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">Instead of using actual execution to generate traces, we propose to use an execution-driven simulator to obtain traces. This approach allows us to observe and annotate both the correct- and wrong-path instructions into the trace. To facilitate this, we also propose a new trace format and a set of design changes to a trace-driven simulator to conditionally model WP instructions and repair state of the CPU before executing in the correct path. Fig <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S1.F1" title="Figure 1 ‣ I Introduction ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_tag">1</span></a> shows our proposed simulation flow where a one-time cost of running the execution-driven model enables quick WP-aware trace-based design space exploration. Here we implemented our approach in the widely used ChampSim <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib12" title="">12</a>]</cite> simulator, though this approach could be implemented in any trace-driven simulator.</p> </div> <div class="ltx_para" id="S1.p4"> <p class="ltx_p" id="S1.p4.1">This paper makes the following contributions:</p> <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i1.p1"> <p class="ltx_p" id="S1.I1.i1.p1.1">Extensive analysis and metrics to measure the impact of WP instructions.</p> </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i2.p1"> <p class="ltx_p" id="S1.I1.i2.p1.1">A new tracing utility leveraging gem5 and a new WP trace format <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib4" title="">4</a>]</cite>.</p> </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i3.p1"> <p class="ltx_p" id="S1.I1.i3.p1.1">Correct WP modeling in a trace-driven simulator <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib2" title="">2</a>]</cite>.</p> </div> </li> <li class="ltx_item" id="S1.I1.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i4.p1"> <p class="ltx_p" id="S1.I1.i4.p1.1">An open set of traces of widely used data center and SPEC workloads <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib5" title="">5</a>]</cite>.</p> </div> </li> </ul> </div> <figure class="ltx_figure" id="S1.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="141" id="S1.F1.g1" src="x1.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>WP traces from execution-driven to trace-driven simulator</figcaption> </figure> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">II </span><span class="ltx_text ltx_font_smallcaps" id="S2.1.1">WP Model</span> </h2> <div class="ltx_para" id="S2.p1"> <p class="ltx_p" id="S2.p1.1">Simulating instructions in the WP involves first, knowing what instructions are down the WP after a branch misprediction. Then letting those instructions flow from fetch till execute states of processor pipeline, repairing all pipeline stages by flushing instructions in the WP once the branch is resolved. Obtaining traces with WP instructions is very challenging as most tracing utilities only annotate committed instructions because they are obtained via binary instrumentation <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib14" title="">14</a>]</cite>.</p> </div> <div class="ltx_para" id="S2.p2"> <p class="ltx_p" id="S2.p2.1">We propose using an execution-driven simulator to capture wrong-path instructions and encode them in the trace. Additionally, we suggest several modifications to a trace-driven simulator to support the execution and squashing of wrong-path instructions.</p> </div> <section class="ltx_subsection" id="S2.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS1.5.1.1">II-A</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS1.6.2">WP Trace</span> </h3> <div class="ltx_para" id="S2.SS1.p1"> <p class="ltx_p" id="S2.SS1.p1.1">We propose a new trace format to encode a stream of instructions following a mis-speculation, either due to branch misprediction or load-store disambiguation failure. WP instructions executed before the redirection are discarded or squashed. These instructions are encoded in the trace following the mis-speculated instruction using additional fields to differentiate them from correct path instructions. In a modern core with a decoupled front-end, the branch predictor runs ahead of fetch to prefetch instructions in the predicted path. The Branch Predictor Unit (BPU) and Instruction Fetch Unit (IFU) communicate using a Fetch Target Queue (FTQ). When the IFU is redirected on a mis-speculation the FTQ is flushed. The FTQ entries that are flushed are also on the wrong-path. The prefetcher would have issued the FTQ entries so they contribute towards the Instruction Cache pollution in the wrong-path. We refer to this as FTQ Prefetch. Since it is not possible to obtain wrong-path instructions with existing tracing utilities on real CPUs, we use gem5 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib7" title="">7</a>]</cite> to generate traces with wrong-path instructions. gem5 is an execution driven simulator so execution of the WP is modeled in an OOO CPU model.</p> </div> </section> <section class="ltx_subsection" id="S2.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS2.5.1.1">II-B</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS2.6.2">Implementing WP in a Trace Driven Simulator</span> </h3> <div class="ltx_para" id="S2.SS2.p1"> <p class="ltx_p" id="S2.SS2.p1.1">In typical trace driven simulators, such as ChampSim, when a branch instruction is encountered, the BPU is queried for the branch direction and target. The trace, which includes the correct branch target, is then compared with the predicted target to detect mispredictions. The front-end is stalled till the branch is resolved. Once the branch is resolved a constant penalty is paid to account for pipeline repair cost and then fetch is resumed again to process stream of instructions in the correct path. Since wrong-path instructions are not seen in the pipeline, there is no need to squash or repair any structures in the pipeline, thus no pollution in the caches.</p> </div> <div class="ltx_para" id="S2.SS2.p2"> <p class="ltx_p" id="S2.SS2.p2.1">To model WP instructions in a trace-driven simulator, all stages in the pipeline are modified as follows. The fetch stage is modified to continue streaming instructions from the trace after a misprediction. WP fetch is stopped either when a resteer signal is received, indicating the resolution of the mispeculation. Resteers may come from the decode stage for unconditional branches or from the execute stage for other types of mispredictions.</p> </div> <div class="ltx_para" id="S2.SS2.p3"> <p class="ltx_p" id="S2.SS2.p3.1">In the decode stage unconditional direct branches that are mispredicted are identified. Once the mispredicted branch is found, all newer (younger) instructions following the mispredicted branch are squashed. The resteer signal to fetch stage is sent and FTQ, Decode and Fetch buffers are flushed. Furthermore, the trace until the correct target instruction is skipped.</p> </div> <div class="ltx_para" id="S2.SS2.p4"> <p class="ltx_p" id="S2.SS2.p4.1">The execute stage handles all other mispredictions. At execute stage, instructions are executed out of order from the Reorder Buffer (ROB). The ROB contains instructions in the program order (in-order). Instructions in the ROB following a mis-speculating instruction are only in the WP. Once the mis-speculation is resolved all following instructions in the ROB are flushed. All pipeline structures leading to execute stage are flushed. Rename maps are repaired to remove stale dependencies introduced by the renamed WP instructions.</p> </div> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">III </span><span class="ltx_text ltx_font_smallcaps" id="S3.1.1">Methodology</span> </h2> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S3.SS1.5.1.1">III-A</span> </span><span class="ltx_text ltx_font_italic" id="S3.SS1.6.2">Simulation Infrastructure</span> </h3> <div class="ltx_para" id="S3.SS1.p1"> <p class="ltx_p" id="S3.SS1.p1.1">We extended ChampSim <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib12" title="">12</a>]</cite> as described above. The CPU core microarchitecture parameters (Table <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S3.T1" title="TABLE I ‣ III-B Benchmarks ‣ III Methodology ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_tag">I</span></a>) are modeled after Intel’s Golden Cove core, (aka Alder Lake). Initially, the CPU is warmed up using 10 million instructions and, the simulation executes 100 million instructions in detailed mode (O3CPU).</p> </div> <div class="ltx_para" id="S3.SS1.p2"> <p class="ltx_p" id="S3.SS1.p2.1">We used the gem5 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib7" title="">7</a>]</cite> simulator to obtain the traces with WP instructions. Traces are obtained using the detailed O3CPU with same core parameters in Table <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S3.T1" title="TABLE I ‣ III-B Benchmarks ‣ III Methodology ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_tag">I</span></a>. The gem5 simulator also models the FDIP pipeline.</p> </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S3.SS2.5.1.1">III-B</span> </span><span class="ltx_text ltx_font_italic" id="S3.SS2.6.2">Benchmarks</span> </h3> <div class="ltx_para" id="S3.SS2.p1"> <p class="ltx_p" id="S3.SS2.p1.1">We examined 14 widely-used server applications from various benchmark suites <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib3" title="">3</a>, <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib15" title="">15</a>, <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib9" title="">9</a>, <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib13" title="">13</a>, <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib11" title="">11</a>, <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib6" title="">6</a>, <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib1" title="">1</a>]</cite> as depicted on Table <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S3.T2" title="TABLE II ‣ III-B Benchmarks ‣ III Methodology ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_tag">II</span></a>. These selected workloads contains traditional SPEC’17 CPU <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib3" title="">3</a>]</cite> and datacenter workloads with large code footprints.</p> </div> <figure class="ltx_table" id="S3.T1"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S3.T1.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S3.T1.1.1.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_l ltx_border_r ltx_border_t" id="S3.T1.1.1.1.1">Front-End</th> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_r ltx_border_t" id="S3.T1.1.1.1.2"> <table class="ltx_tabular ltx_align_middle" id="S3.T1.1.1.1.2.1"> <tr class="ltx_tr" id="S3.T1.1.1.1.2.1.1"> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.1.1.2.1.1.1">24 FTQ entries, 16K entries, 8-way BTB</td> </tr> <tr class="ltx_tr" id="S3.T1.1.1.1.2.1.2"> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.1.1.2.1.2.1">64 KB TAGE <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib18" title="">18</a>]</cite>/ITTAGE <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib17" title="">17</a>]</cite> branch predictors,</td> </tr> <tr class="ltx_tr" id="S3.T1.1.1.1.2.1.3"> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.1.1.2.1.3.1">32KB, 8-way L1I with latency of 2 cycles</td> </tr> </table> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S3.T1.1.2.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_l ltx_border_r ltx_border_t" id="S3.T1.1.2.1.1">Execution</th> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S3.T1.1.2.1.2"> <table class="ltx_tabular ltx_align_middle" id="S3.T1.1.2.1.2.1"> <tr class="ltx_tr" id="S3.T1.1.2.1.2.1.1"> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.2.1.2.1.1.1">512/194/144/112 ROB/Issue/Load/Store entries,</td> </tr> <tr class="ltx_tr" id="S3.T1.1.2.1.2.1.2"> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.2.1.2.1.2.1">12 wide Decode and Retire, 448/400 Int/Vec. Registes</td> </tr> </table> </td> </tr> <tr class="ltx_tr" id="S3.T1.1.3.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_b ltx_border_l ltx_border_r ltx_border_t" id="S3.T1.1.3.2.1">Caches</th> <td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t" id="S3.T1.1.3.2.2"> <table class="ltx_tabular ltx_align_middle" id="S3.T1.1.3.2.2.1"> <tr class="ltx_tr" id="S3.T1.1.3.2.2.1.1"> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.3.2.2.1.1.1">64KB, 16-way L1D with latency of 1 cycle,</td> </tr> <tr class="ltx_tr" id="S3.T1.1.3.2.2.1.2"> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.3.2.2.1.2.1">1 MB, 16-way Private L2C with latency of 10 cycles,</td> </tr> <tr class="ltx_tr" id="S3.T1.1.3.2.2.1.3"> <td class="ltx_td ltx_nopad_r ltx_align_left" id="S3.T1.1.3.2.2.1.3.1">2 MB, 16-way Shared LLC with latency of 20 cycles</td> </tr> </table> </td> </tr> </tbody> </table> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">TABLE I: </span>Processor configurations for ARM ISA</figcaption> </figure> <figure class="ltx_table" id="S3.T2"> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S3.T2.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S3.T2.1.1.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S3.T2.1.1.1.1">Benchmark Suite</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S3.T2.1.1.1.2">Benchmarks</th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S3.T2.1.2.1"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S3.T2.1.2.1.1">SPEC <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib3" title="">3</a>]</cite> </td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S3.T2.1.2.1.2">401.bzip2, 429.mcf, 500.perlbench, 523.xalancbmk</td> </tr> <tr class="ltx_tr" id="S3.T2.1.3.2"> <td class="ltx_td ltx_border_l ltx_border_r" id="S3.T2.1.3.2.1"></td> <td class="ltx_td ltx_align_left ltx_border_r" id="S3.T2.1.3.2.2">514.leela, 520.omentpp, 531.deepsjeng, 557.xz</td> </tr> <tr class="ltx_tr" id="S3.T2.1.4.3"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S3.T2.1.4.3.1">DaCapo <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib8" title="">8</a>]</cite> </td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S3.T2.1.4.3.2">cassandra, kafka, tomcat</td> </tr> <tr class="ltx_tr" id="S3.T2.1.5.4"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S3.T2.1.5.4.1">Renaissance <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib15" title="">15</a>]</cite> </td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S3.T2.1.5.4.2">finagle-chirper, finagle-http</td> </tr> <tr class="ltx_tr" id="S3.T2.1.6.5"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S3.T2.1.6.5.1">OLTB Bench <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib9" title="">9</a>]</cite> </td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S3.T2.1.6.5.2">tpcc, wikipedia</td> </tr> <tr class="ltx_tr" id="S3.T2.1.7.6"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S3.T2.1.7.6.1">Tailbench <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib13" title="">13</a>]</cite> </td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S3.T2.1.7.6.2">specjbb, web-search, xapian</td> </tr> <tr class="ltx_tr" id="S3.T2.1.8.7"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S3.T2.1.8.7.1">Cloudsuite V4 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib11" title="">11</a>]</cite> </td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S3.T2.1.8.7.2">data-serving, media-streaming</td> </tr> <tr class="ltx_tr" id="S3.T2.1.9.8"> <td class="ltx_td ltx_align_left ltx_border_l ltx_border_r ltx_border_t" id="S3.T2.1.9.8.1">Chipyard <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib6" title="">6</a>]</cite> </td> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S3.T2.1.9.8.2">verilator</td> </tr> <tr class="ltx_tr" id="S3.T2.1.10.9"> <td class="ltx_td ltx_align_left ltx_border_b ltx_border_l ltx_border_r ltx_border_t" id="S3.T2.1.10.9.1">Browser Bench <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#bib.bib1" title="">1</a>]</cite> </td> <td class="ltx_td ltx_align_left ltx_border_b ltx_border_r ltx_border_t" id="S3.T2.1.10.9.2">speedometer2.0</td> </tr> </tbody> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">TABLE II: </span>Benchmarks used to evaluate</figcaption> </figure> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">IV </span><span class="ltx_text ltx_font_smallcaps" id="S4.1.1">Results</span> </h2> <div class="ltx_para" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">In this section, we present the performance analysis of simulating WP execution known as WP mode and compare it with CP mode that runs only CP instructions ignoring all WP effects on all benchmarks detailed in Section IV-B. Furthermore, we examine the impact of WP mode on the cache hierarchy, providing measurements for the L1 instruction cache (L1I), L1 data cache (L1D), L2 cache, and L3 cache. These insights offer a comprehensive understanding of the implications of WP execution on cache performance across different levels of the cache subsystem.</p> </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S4.SS1.5.1.1">IV-A</span> </span><span class="ltx_text ltx_font_italic" id="S4.SS1.6.2">WP Impact on Instructions processed</span> </h3> <div class="ltx_para" id="S4.SS1.p1"> <p class="ltx_p" id="S4.SS1.p1.1">Modern processors have aggressive decoupled front ends, large structures, and deep ROBs, allowing a substantial number of WP instructions to be fetched and executed before a branch misprediction is identified and corrected. Fig <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S4.F2" title="Figure 2 ‣ IV-A WP Impact on Instructions processed ‣ IV Results ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_tag">2</span></a> shows that on average, 83% more instructions are fetched in WP mode, with many benchmarks such as <span class="ltx_text ltx_font_typewriter" id="S4.SS1.p1.1.1">429.mcf</span>, <span class="ltx_text ltx_font_typewriter" id="S4.SS1.p1.1.2">514.leela</span>, and <span class="ltx_text ltx_font_typewriter" id="S4.SS1.p1.1.3">finagle-chirper</span> showing greater than a 2.5-fold increase. This increase strongly correlates with branch misprediction rates, which control how often we proceed down the wrong path, and structural factors like FTQ size and ROB size, determining how deep we go down the wrong path. ROB occupancy at the time of misprediction in WP is, on average, close to 2 times that seen on the CP. This also leads to considerably more ROB Full events in WP mode. Benchmarks with higher misprediction rates and lower ROB occupancy show the greatest increases in WP instructions executed.</p> </div> <figure class="ltx_figure" id="S4.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="739" id="S4.F2.g1" src="x2.png" width="1196"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>Relative Increase in Instructions in WP vs CP</figcaption> </figure> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S4.SS2.5.1.1">IV-B</span> </span><span class="ltx_text ltx_font_italic" id="S4.SS2.6.2">Cache State</span> </h3> <div class="ltx_para" id="S4.SS2.p1"> <p class="ltx_p" id="S4.SS2.p1.1">Modeling the execution of WP instructions has a direct impact on cache state. L1I cache is the most transformed by the excess WP instructions. Fig <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S4.F3" title="Figure 3 ‣ IV-B Cache State ‣ IV Results ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_tag">3</span></a> shows that, on average, there is a 32% increase in misses in the L1I cache in WP mode compared to CP mode. Notably, four benchmarks more than double the number of misses, and the <span class="ltx_text ltx_font_typewriter" id="S4.SS2.p1.1.1">500.perlbench</span> benchmark exhibits a 147% increase. An even more pronounced effect is observed in the hit patterns of the L1I cache. In WP mode, there is an average increase of 66% in hits, with over half the benchmarks showing increases greater than 150%. The cache states of the other caches are also affected by WP. Fig <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S4.F4" title="Figure 4 ‣ IV-B Cache State ‣ IV Results ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_tag">4</span></a> shows, on average, L1D misses increase by 9.3%, L2C misses increase by 18.5% and LLC misses increase by 26%. Fig <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S4.F5" title="Figure 5 ‣ IV-B Cache State ‣ IV Results ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_tag">5</span></a> shows, on average, L1D hits increases by 12%, L2C hits increase by 58.3% and LLC hits increase by 26.3%. Thus WP also affects the data caches and must not be ignored. The pressure on the L2C cache correlates more closely with performance than the L1D cache. Fundamentally, the access patterns of caches differ significantly when WP is enabled. Furthermore metadata used by cache policies like reuse percentage, stride and other temporal and spatial patterns are also different. Thus any work impacted by cache behaviour is incomplete without modelling WP accurately. The higher hit rates may ensure that high-reuse lines are preserved longer in the design, but they may also lead to the pollution of metadata updates. Depending on the nature of the benchmark, this effect could positively or negatively affect performance at different phases of the program.</p> </div> <figure class="ltx_figure" id="S4.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="519" id="S4.F3.g1" src="x3.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>Cache stats for WP and CP modes in L1I</figcaption> </figure> <figure class="ltx_figure" id="S4.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="519" id="S4.F4.g1" src="x4.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4: </span>Cache miss stats for WP and CP modes in for all caches</figcaption> </figure> <figure class="ltx_figure" id="S4.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="519" id="S4.F5.g1" src="x5.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 5: </span>Relative increase in cache hits of WP mode w.r.t CP mode</figcaption> </figure> <figure class="ltx_figure" id="S4.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="519" id="S4.F6.g1" src="x6.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 6: </span>Reduction of cache misses in the CP in WP mode</figcaption> </figure> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S4.SS3.5.1.1">IV-C</span> </span><span class="ltx_text ltx_font_italic" id="S4.SS3.6.2">WP Impact on Cache State</span> </h3> <div class="ltx_para" id="S4.SS3.p1"> <p class="ltx_p" id="S4.SS3.p1.1">The cache lines that were brought in due to WP execution can be useful or useless. Wrong-path lines that are eventually used on the correct path can enhance performance and is considered WP a useful fill, leading to a positive impact like a prefetch. Conversely, lines that are evicted without being used can degrade performance and is considered WP a useless fill, resulting in a negative impact. WP-induced useless fills reduce effective occupancy of the caches and may also evict other critical useful lines, thus leading to lost performance. However, WP-induced useful fills help in prefetching lines accessed in later by the CP, resulting in lower cache misses. On average 67% of WP fills in L1I cache are useful. Thus we see a strong prefetching effect. Further Fig <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S4.F6" title="Figure 6 ‣ IV-B Cache State ‣ IV Results ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_tag">6</span></a> shows the reduction in misses along the CP when running in WP mode. L1I sees a reduction of 44% in the misses along the correct path, with L1D showing 3.85% and L2C showing 8.58% reduction. Although the overall misses increase, the prefetching effect of WP reduces the number of misses along CP across instruction and data caches.</p> </div> </section> <section class="ltx_subsection" id="S4.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S4.SS4.5.1.1">IV-D</span> </span><span class="ltx_text ltx_font_italic" id="S4.SS4.6.2">Performance Impact</span> </h3> <div class="ltx_para" id="S4.SS4.p1"> <p class="ltx_p" id="S4.SS4.p1.1">As discussed in the previous sections, the performance impact of WP cannot be approximated by considering cache behavior, branch mispredictions, or back-end impacts in isolation. WP influences these factors differently across various parts of the benchmarks, and only a comprehensive analysis of all these effects together can adequately capture the true impact of WP. Figure <a class="ltx_ref" href="https://arxiv.org/html/2408.05912v1#S5.F7" title="Figure 7 ‣ V Conclusion ‣ Correct Wrong Path"><span class="ltx_text ltx_ref_tag">7</span></a> shows the IPC speedup of WP mode over CP mode and the branch MPKI of all benchmarks. Benchmarks like <span class="ltx_text ltx_font_typewriter" id="S4.SS4.p1.1.1">tomcat</span> and <span class="ltx_text ltx_font_typewriter" id="S4.SS4.p1.1.2">cassandra</span> have high branch MPKI. In contrast, <span class="ltx_text ltx_font_typewriter" id="S4.SS4.p1.1.3">523.xalancbmk,</span> and <span class="ltx_text ltx_font_typewriter" id="S4.SS4.p1.1.4">525.x264</span> have low branch MPKI so it has negligible variance with WP. The expected trend is that the higher the branch misprediction rate is, the higher the number of WP instructions. However, in <span class="ltx_text ltx_font_typewriter" id="S4.SS4.p1.1.5">verilator</span> the branch MPKI is 9.14 the WP instructions are only 55%(lower than other benchmarks with similar MPKI). This is because <span class="ltx_text ltx_font_typewriter" id="S4.SS4.p1.1.6">verilator</span> has high number of unconditional branches which are resolved in the decode stage. Thus instructions fetched in the WP are squashed even before they reach decode stage.</p> </div> <div class="ltx_para" id="S4.SS4.p2"> <p class="ltx_p" id="S4.SS4.p2.1">The prefetching effect of WP dominates in majority of benchmarks thus WP showing positive impact in majority of benchmarks with a speedup of upto 20.9% and a mean gain of 3.26%. WP can also have negative impact which resulted in performance loss in seven benchmarks with <span class="ltx_text ltx_font_typewriter" id="S4.SS4.p2.1.1">xapian</span> showing the minimum of -3.09% speeedup.</p> </div> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">V </span><span class="ltx_text ltx_font_smallcaps" id="S5.1.1">Conclusion</span> </h2> <div class="ltx_para" id="S5.p1"> <p class="ltx_p" id="S5.p1.1">Complex modern OOO CPUs have speculative pipeline which have significant impact on the state of cache. The effect of WP execution is dependent on various characteristics of benchmarks. The impact of WP is often ignored by fast less accurate trace-driven simulators. We provide a model to simulate WP in trace-drive simulator with a new trace format to encode WP instructions. As the traces are generated from a execution-driven simulator, the advantage of correct WP is achieved still at the quick speed of a trace-driven simulator. The WP enabled traces are robust enough to capture various characteristics of benchmarks and CPU parameters.</p> </div> <figure class="ltx_figure" id="S5.F7"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="519" id="S5.F7.g1" src="x7.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 7: </span>IPC improvement with of WP mode w.r.t CP mode</figcaption> </figure> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_tag_bibitem">[1]</span> <span class="ltx_bibblock"> Browserbench. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://browserbench.org" title="">https://browserbench.org</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_tag_bibitem">[2]</span> <span class="ltx_bibblock"> “Champsim-wp,” <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/PrincetonUniversity/ChampSim" title="">https://github.com/PrincetonUniversity/ChampSim</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_tag_bibitem">[3]</span> <span class="ltx_bibblock"> Spec 2017. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.spec.org/cpu2017/" title="">https://www.spec.org/cpu2017/</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_tag_bibitem">[4]</span> <span class="ltx_bibblock"> “Tracer utility,” <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/PrincetonUniversity/gem5_FDIP" title="">https://github.com/PrincetonUniversity/gem5_FDIP</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_tag_bibitem">[5]</span> <span class="ltx_bibblock"> “Traces,” <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://tinyurl.com/4ak3kzhv" title="">https://tinyurl.com/4ak3kzhv</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_tag_bibitem">[6]</span> <span class="ltx_bibblock"> A. Amid, D. Biancolin, A. Gonzalez, D. Grubb, S. Karandikar, H. Liew, A. Magyar, H. Mao, A. Ou, N. Pemberton, P. Rigge, C. Schmidt, J. Wright, J. Zhao, Y. S. Shao, K. Asanović, and B. Nikolić, “Chipyard: Integrated design, simulation, and implementation framework for custom socs,” <em class="ltx_emph ltx_font_italic" id="bib.bib6.1.1">IEEE Micro</em>, vol. 40, no. 4, pp. 10–21, 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_tag_bibitem">[7]</span> <span class="ltx_bibblock"> N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” <em class="ltx_emph ltx_font_italic" id="bib.bib7.1.1">SIGARCH Comput. Archit. News</em>, 2011. [Online]. Available: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/2024716.2024718" title="">https://doi.org/10.1145/2024716.2024718</a> </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_tag_bibitem">[8]</span> <span class="ltx_bibblock"> S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khang, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann, “The dacapo benchmarks: Java benchmarking development and analysis,” in <em class="ltx_emph ltx_font_italic" id="bib.bib8.1.1">Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications</em>, ser. OOPSLA ’06. New York, NY, USA: Association for Computing Machinery, 2006, p. 169–190. [Online]. Available: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/1167473.1167488" title="">https://doi.org/10.1145/1167473.1167488</a> </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_tag_bibitem">[9]</span> <span class="ltx_bibblock"> D. E. Difallah, A. Pavlo, C. Curino, and P. Cudre-Mauroux, “Oltp-bench: An extensible testbed for benchmarking relational databases,” <em class="ltx_emph ltx_font_italic" id="bib.bib9.1.1">Proc. VLDB Endow.</em>, vol. 7, no. 4, p. 277–288, dec 2013. [Online]. Available: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.14778/2732240.2732246" title="">https://doi.org/10.14778/2732240.2732246</a> </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_tag_bibitem">[10]</span> <span class="ltx_bibblock"> S. Eyerman, S. V. D. Steen, W. Heirman, and I. Hur, “Simulating wrong-path instructions in decoupled functional-first simulation,” in <em class="ltx_emph ltx_font_italic" id="bib.bib10.1.1">2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)</em>, 2023, pp. 124–133. </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_tag_bibitem">[11]</span> <span class="ltx_bibblock"> M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the clouds: A study of emerging scale-out workloads on modern hardware,” <em class="ltx_emph ltx_font_italic" id="bib.bib11.1.1">SIGPLAN Not.</em>, vol. 47, no. 4, p. 37–48, mar 2012. [Online]. Available: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/2248487.2150982" title="">https://doi.org/10.1145/2248487.2150982</a> </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_tag_bibitem">[12]</span> <span class="ltx_bibblock"> N. Gober, G. Chacon, L. Wang, P. V. Gratz, D. A. Jimenez, E. Teran, S. Pugsley, and J. Kim, “The Championship Simulator: Architectural Simulation for Education and Competition,” 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_tag_bibitem">[13]</span> <span class="ltx_bibblock"> H. Kasture and D. Sanchez, “TailBench: A benchmark suite and evaluation methodology for latency-critical applications,” in <em class="ltx_emph ltx_font_italic" id="bib.bib13.1.1">Workload Characterization (IISWC)</em>, 2016. </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_tag_bibitem">[14]</span> <span class="ltx_bibblock"> C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building customized program analysis tools with dynamic instrumentation,” in <em class="ltx_emph ltx_font_italic" id="bib.bib14.1.1">ACM SIGPLAN</em>, 2005. </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_tag_bibitem">[15]</span> <span class="ltx_bibblock"> A. Prokopec, A. Rosà, D. Leopoldseder, G. Duboscq, P. Tůma, M. Studener, L. Bulej, Y. Zheng, A. Villazón, D. Simon, T. Würthinger, and W. Binder, “Renaissance: Benchmarking suite for parallel applications on the jvm,” in <em class="ltx_emph ltx_font_italic" id="bib.bib15.1.1">Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation</em>, ser. PLDI 2019. New York, NY, USA: Association for Computing Machinery, 2019, p. 31–47. [Online]. Available: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1145/3314221.3314637" title="">https://doi.org/10.1145/3314221.3314637</a> </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_tag_bibitem">[16]</span> <span class="ltx_bibblock"> A. Ros and A. Jimborean, “Wrong-path-aware entangling instruction prefetcher,” <em class="ltx_emph ltx_font_italic" id="bib.bib16.1.1">IEEE Transactions on Computers</em>, vol. 73, no. 2, pp. 548–559, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_tag_bibitem">[17]</span> <span class="ltx_bibblock"> A. Seznec, “A 64-kbytes ittage indirect branch predictor,” in <em class="ltx_emph ltx_font_italic" id="bib.bib17.1.1">JWAC-2: Championship Branch Prediction</em>, 2011. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_tag_bibitem">[18]</span> <span class="ltx_bibblock"> A. Seznec and P. Michaud, “A case for (partially) tagged geometric history length branch prediction,” <em class="ltx_emph ltx_font_italic" id="bib.bib18.1.1">The Journal of Instruction-Level Parallelism</em>, vol. 8, p. 23, 2006. </span> </li> </ul> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Mon Aug 12 01:59:35 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>