CINXE.COM
FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration
<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration</title> <!--Generated on Fri Nov 22 05:49:48 2024 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content="Self-Attention, Analog-Mixed-Signal, Process-in-Memory, PTQ, Bitwise Sparsity, Error Resiliency, Floating-Point Unit (FPU)" lang="en" name="keywords"/> <base href="/html/2411.14733v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S1" title="In FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S2" title="In FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Backgrounds</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S2.SS1" title="In 2. Backgrounds ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.1 </span>Analog-Mixed-Signal Process-in-Memory (AMS-PiM)</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S2.SS2" title="In 2. Backgrounds ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.2 </span>Self-Attention Layer in Transformer Encoders</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S2.SS3" title="In 2. Backgrounds ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.3 </span>Kernel Fusion and Self-Attention Inference Optimization</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3" title="In FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>Motivations</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.SS1" title="In 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>Motivation for PTQ and Non-BLAS Layer Optimizations</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.SS2" title="In 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>Motivation for Accurate and Efficient AMS Computation</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.SS3" title="In 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.3 </span>Massive GEMVs with Quadratic Latency Bottleneck</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4" title="In FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>FLARE Architecture</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS1" title="In 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>MRAM-SRAM Hybrid AMS-PiM Design</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS2" title="In 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2 </span>Dequantization-Free PTQ Technique</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS3" title="In 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.3 </span>Proposed VDR-Softmax Fused with eMSB-Q</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS4" title="In 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.4 </span>BitSift-GEMV Processor</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS5" title="In 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.5 </span>Reduced Tensor Traffic</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5" title="In FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>FLARE Evaluations</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.SS1" title="In 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.1 </span>Hardware Customization Exploration</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.SS2" title="In 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.2 </span>Impact of eMSB-Q and VDR-Softmax</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.SS3" title="In 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.3 </span>Impact of BitSift-GEMV</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.SS4" title="In 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.4 </span>Impact on Latency Performance</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.SS5" title="In 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.5 </span>Impact on Energy Performance</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S6" title="In FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6 </span>Discussion - Related Works</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S7" title="In FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">7 </span>Conclusion</span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line ltx_leqno"> <h1 class="ltx_title ltx_title_document">FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Donghyeon Yi<sup class="ltx_sup" id="id13.6.id1">1</sup>, Seoyoung Lee<sup class="ltx_sup" id="id14.7.id2">1</sup>, Jongho Kim<sup class="ltx_sup" id="id15.8.id3">1</sup>, Junyoung Kim<sup class="ltx_sup" id="id16.9.id4">1</sup> </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id5.5.1">KAIST<sup class="ltx_sup" id="id5.5.1.1">1</sup></span><span class="ltx_text ltx_affiliation_country" id="id17.10.id1">Republic of Korea</span> </span> <span class="ltx_contact ltx_role_email"><a href="mailto:yicoreen,%20seoyoung,%20jongho,lotanda17@kaist.ac.kr">yicoreen, seoyoung, jongho,lotanda17@kaist.ac.kr</a> </span></span></span> <span class="ltx_author_before">, </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Sohmyung Ha<sup class="ltx_sup" id="id18.3.id1">2</sup> </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id7.2.1">NYU Abu Dhabi<sup class="ltx_sup" id="id7.2.1.1">2</sup></span><span class="ltx_text ltx_affiliation_country" id="id19.4.id1">United Arab Emirates</span> </span> <span class="ltx_contact ltx_role_email"><a href="mailto:sh169@nyu.edu">sh169@nyu.edu</a> </span></span></span> <span class="ltx_author_before">, </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Ik-Joon Chang<sup class="ltx_sup" id="id20.3.id1">3</sup> </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id9.2.1">Kyung Hee University<sup class="ltx_sup" id="id9.2.1.1">3</sup></span><span class="ltx_text ltx_affiliation_country" id="id21.4.id1">Republic of Korea</span> </span> <span class="ltx_contact ltx_role_email"><a href="mailto:ichang@khu.ac.kr">ichang@khu.ac.kr</a> </span></span></span> <span class="ltx_author_before"> and </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Minkyu Je<sup class="ltx_sup" id="id22.3.id1">1</sup> </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation"><span class="ltx_text ltx_affiliation_institution" id="id11.2.1">KAIST<sup class="ltx_sup" id="id11.2.1.1">1</sup></span><span class="ltx_text ltx_affiliation_country" id="id23.4.id1">Republic of Korea</span> </span> <span class="ltx_contact ltx_role_email"><a href="mailto:mkje@kaist.ac.kr">mkje@kaist.ac.kr</a> </span></span></span> </div> <div class="ltx_dates">(2024)</div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract.</h6> <p class="ltx_p" id="id12.1">Encoder-based transformers, powered by self-attention layers, have revolutionized machine learning with their context-aware representations. However, their quadratic growth in computational and memory demands presents significant bottlenecks. Analog-Mixed-Signal Process-in-Memory (AMS-PiM) architectures address these challenges by enabling efficient on-chip processing. Traditionally, AMS-PiM relies on Quantization-Aware Training (QAT), which is hardware-efficient but requires extensive retraining to adapt models to AMS-PiMs, making it increasingly impractical for transformer models. Post-Training Quantization (PTQ) mitigates this training overhead but introduces significant hardware inefficiencies. PTQ relies on dequantization-quantization (DQ-Q) processes, floating-point units (FPUs), and high-ENOB (Effective Number of Bits) analog-to-digital converters (ADCs). Particularly, High-ENOB ADCs scale exponentially in area and energy (2<math alttext="{}^{\text{ENOB}}" class="ltx_Math" display="inline" id="id12.1.m1.1"><semantics id="id12.1.m1.1a"><msup id="id12.1.m1.1.1" xref="id12.1.m1.1.1.cmml"><mi id="id12.1.m1.1.1a" xref="id12.1.m1.1.1.cmml"></mi><mtext id="id12.1.m1.1.1.1" xref="id12.1.m1.1.1.1a.cmml">ENOB</mtext></msup><annotation-xml encoding="MathML-Content" id="id12.1.m1.1b"><apply id="id12.1.m1.1.1.cmml" xref="id12.1.m1.1.1"><ci id="id12.1.m1.1.1.1a.cmml" xref="id12.1.m1.1.1.1"><mtext id="id12.1.m1.1.1.1.cmml" mathsize="70%" xref="id12.1.m1.1.1.1">ENOB</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="id12.1.m1.1c">{}^{\text{ENOB}}</annotation><annotation encoding="application/x-llamapun" id="id12.1.m1.1d">start_FLOATSUPERSCRIPT ENOB end_FLOATSUPERSCRIPT</annotation></semantics></math>), reduce sensing margins, and increase susceptibility to process, voltage, and temperature (PVT) variations, further compounding PTQ’s challenges in AMS-PiM systems. To overcome these limitations, we propose FLARE, an AMS-PiM architecture that eliminates DQ-Q processes, introduces FPU- and division-free nonlinear processing, and employs a low-ENOB-ADC-based sparse Matrix Vector multiplication technique. Using the proposed techniques, FLARE improves error resiliency, area/energy efficiency, and computational speed while preserving numerical stability. Experimental results demonstrate that FLARE outperforms state-of-the-art GPUs and conventional PiM architectures in energy efficiency, latency, and accuracy, making it a scalable solution for the efficient deployment of transformers.</p> </div> <div class="ltx_keywords">Self-Attention, Analog-Mixed-Signal, Process-in-Memory, PTQ, Bitwise Sparsity, Error Resiliency, Floating-Point Unit (FPU) </div> <span class="ltx_note ltx_note_frontmatter ltx_role_copyright" id="id1"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">copyright: </span>none</span></span></span><span class="ltx_note ltx_note_frontmatter ltx_role_journalyear" id="id2"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">journalyear: </span>2024</span></span></span><span class="ltx_note ltx_note_frontmatter ltx_role_doi" id="id3"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">doi: </span> </span></span></span><span class="ltx_note ltx_note_frontmatter ltx_role_isbn" id="id4"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">isbn: </span> </span></span></span><span class="ltx_note ltx_note_frontmatter ltx_role_conference" id="id5"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">conference: </span>; Research Preview; 2024</span></span></span> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">1. </span>Introduction</h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Transformer models <cite class="ltx_cite ltx_citemacro_citep">(Vaswani, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib51" title="">2017</a>; Brown, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib9" title="">2020</a>; Achiam et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib2" title="">2023</a>; Devlin et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib17" title="">2018</a>; Li and Gu, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib38" title="">2023</a>; Dosovitskiy et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib18" title="">2020</a>)</cite>, built on the attention mechanism <cite class="ltx_cite ltx_citemacro_citep">(Vaswani, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib51" title="">2017</a>)</cite>, have become foundational in machine learning across diverse domains. While decoder-based architectures <cite class="ltx_cite ltx_citemacro_citep">(Brown, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib9" title="">2020</a>; Achiam et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib2" title="">2023</a>)</cite>, such as those powering chatbot services, have gained substantial popularity, encoder-based models <cite class="ltx_cite ltx_citemacro_citep">(Devlin et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib17" title="">2018</a>; Li and Gu, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib38" title="">2023</a>; Dosovitskiy et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib18" title="">2020</a>)</cite>, like BERT and Vision Transformers (ViT), remain critical for many tasks. However, it is important to note that encoder-based transformers process entire input sequences within their self-attention layers. Due to this processing, as the sequence length increases, the computational intensity and intermediate memory traffic of self-attention layers grow quadratically, posing a significant challenge in efficiently processing encoder-based transformers especially in edge devices.</p> </div> <figure class="ltx_figure" id="S1.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_portrait" height="1174" id="S1.F1.g1" src="x1.png" width="746"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1. </span>Limitations of PTQ during inference optimizations. (a) The Dequantization-Quantization (DQ-Q) process and (b) nonlinear layers induce high-ENOB ADCs and FPUs that induce extreme area/energy overhead.</figcaption> </figure> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">Analog-Mixed-Signal Process-in-Memory (AMS-PiM) <cite class="ltx_cite ltx_citemacro_citep">(Shafiee et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib44" title="">2016</a>; Chi et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib12" title="">2016</a>; Jung et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib27" title="">2022</a>; Xue et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib55" title="">2019</a>; Jiang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib25" title="">2020</a>; Sridharan et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib47" title="">2023</a>; Zhou et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib59" title="">2022</a>; Yang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib56" title="">2020</a>; Si et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib46" title="">2019b</a>; Kim et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib32" title="">2019</a>; Biswas and Chandrakasan, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib7" title="">2018</a>; Yin et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib58" title="">2020</a>; Roy et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib42" title="">2021</a>; Kazemi et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib30" title="">2021</a>; Lee et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib36" title="">2022</a>; Long et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib40" title="">2018</a>)</cite> has emerged as a promising solution to address those challenges, enabling efficient execution of level-2 and level-3 BLAS operations (Basic Linear Algebra Subprograms, primarily matrix multiplications) directly within memory. These BLAS operations represent a significant portion of the self-attention layers, making AMS-PiM highly effective for reducing off-chip data movement and energy consumption. Besides, AMS-PiM’s effectiveness does not extend to non-BLAS tasks, such as quantization processes or nonlinear-layer computations, requiring additional hardware and computational support. Concurrently, most AMS-PiM architectures rely on integer quantization techniques for both stationary weights (e.g., weights for Q (query), K (key), V (value), and O projections) and dynamic activations (e.g., input embeddings, Q/K/V values, attention scores, and final outputs), which are crucial for attention operations in transformers. Despite recent advancements in low-precision floating-point (FP) quantization methods <cite class="ltx_cite ltx_citemacro_citep">(Dettmers et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib16" title="">2024</a>; Xiao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib54" title="">2023</a>)</cite>, AMS-PiM remains constrained to integer quantization due to the substantial overhead required for exponent alignment in FP arithmetic, or the significant accuracy degradation if alignment is omitted.</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">Quantization-Aware Training (QAT) <cite class="ltx_cite ltx_citemacro_citep">(Jacob et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib24" title="">2018</a>; Hubara et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib22" title="">2018</a>; Zhou et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib60" title="">2016</a>)</cite> and Post-Training Quantization (PTQ) <cite class="ltx_cite ltx_citemacro_citep">(Xiao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib54" title="">2023</a>; Banner et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib6" title="">2019</a>)</cite> are two primary techniques for enabling quantization. QAT, widely adopted in AMS-PiM, optimizes area/energy efficiency by enabling direct quantization with low-ENOB (Effective Number of Bits) ADCs (Analog-to-Digital Converters). Through retraining, QAT allows models to adapt to characteristics of ADCs and integer-approximated nonlinear layers, eliminating the need for dequantization-quantization (DQ-Q) steps and FPUs, as visualized in the left side of Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S1.F1" title="Figure 1 ‣ 1. Introduction ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">1</span></a>-(a). However, QAT’s retraining process introduces substantial overhead, making it increasingly impractical for growing transformer models. </p> </div> <div class="ltx_para" id="S1.p4"> <p class="ltx_p" id="S1.p4.1">Accordingly, PTQ has become a preferred method as a retraining-free alternative to Quantization-Aware Training (QAT). However, as shown on the right side of Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S1.F1" title="Figure 1 ‣ 1. Introduction ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">1</span></a>-(a), PTQ introduces three major obstacles in AMS-PiM architectures as well, including (1) reliance on high-ENOB ADCs, (2) susceptibility to process, voltage, and temperature (PVT) variations, and (3) the need for Floating-Point Units (FPUs).</p> </div> <div class="ltx_para" id="S1.p5"> <p class="ltx_p" id="S1.p5.4">Typically, PTQ requires precise partial sums and scaling factors to maintain accuracy, for transformer models with large embedding depths (<math alttext="\geq" class="ltx_Math" display="inline" id="S1.p5.1.m1.1"><semantics id="S1.p5.1.m1.1a"><mo id="S1.p5.1.m1.1.1" xref="S1.p5.1.m1.1.1.cmml">≥</mo><annotation-xml encoding="MathML-Content" id="S1.p5.1.m1.1b"><geq id="S1.p5.1.m1.1.1.cmml" xref="S1.p5.1.m1.1.1"></geq></annotation-xml><annotation encoding="application/x-tex" id="S1.p5.1.m1.1c">\geq</annotation><annotation encoding="application/x-llamapun" id="S1.p5.1.m1.1d">≥</annotation></semantics></math>1k) and high bit-precision requirements (<math alttext="\geq" class="ltx_Math" display="inline" id="S1.p5.2.m2.1"><semantics id="S1.p5.2.m2.1a"><mo id="S1.p5.2.m2.1.1" xref="S1.p5.2.m2.1.1.cmml">≥</mo><annotation-xml encoding="MathML-Content" id="S1.p5.2.m2.1b"><geq id="S1.p5.2.m2.1.1.cmml" xref="S1.p5.2.m2.1.1"></geq></annotation-xml><annotation encoding="application/x-tex" id="S1.p5.2.m2.1c">\geq</annotation><annotation encoding="application/x-llamapun" id="S1.p5.2.m2.1d">≥</annotation></semantics></math>4 bits per activation and weight). These demands necessitate ADCs with ENOB of <math alttext="\geq" class="ltx_Math" display="inline" id="S1.p5.3.m3.1"><semantics id="S1.p5.3.m3.1a"><mo id="S1.p5.3.m3.1.1" xref="S1.p5.3.m3.1.1.cmml">≥</mo><annotation-xml encoding="MathML-Content" id="S1.p5.3.m3.1b"><geq id="S1.p5.3.m3.1.1.cmml" xref="S1.p5.3.m3.1.1"></geq></annotation-xml><annotation encoding="application/x-tex" id="S1.p5.3.m3.1c">\geq</annotation><annotation encoding="application/x-llamapun" id="S1.p5.3.m3.1d">≥</annotation></semantics></math>18 bits to prevent quantization errors, as PTQ lacks the adaptability of QAT to lower-ENOB ADCs via retraining. Attempting PTQ with low-ENOB ADCs is not impossible, but would result in excessive latency bottleneck due to the immense computation cycles from segmenting computations into smaller units. The main concern with the High-ENOB ADCs is that they exacerbate hardware inefficiencies, Their area and power consumption scale exponentially with 2<math alttext="{}^{\text{ENOB}}" class="ltx_Math" display="inline" id="S1.p5.4.m4.1"><semantics id="S1.p5.4.m4.1a"><msup id="S1.p5.4.m4.1.1" xref="S1.p5.4.m4.1.1.cmml"><mi id="S1.p5.4.m4.1.1a" xref="S1.p5.4.m4.1.1.cmml"></mi><mtext id="S1.p5.4.m4.1.1.1" xref="S1.p5.4.m4.1.1.1a.cmml">ENOB</mtext></msup><annotation-xml encoding="MathML-Content" id="S1.p5.4.m4.1b"><apply id="S1.p5.4.m4.1.1.cmml" xref="S1.p5.4.m4.1.1"><ci id="S1.p5.4.m4.1.1.1a.cmml" xref="S1.p5.4.m4.1.1.1"><mtext id="S1.p5.4.m4.1.1.1.cmml" mathsize="70%" xref="S1.p5.4.m4.1.1.1">ENOB</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S1.p5.4.m4.1c">{}^{\text{ENOB}}</annotation><annotation encoding="application/x-llamapun" id="S1.p5.4.m4.1d">start_FLOATSUPERSCRIPT ENOB end_FLOATSUPERSCRIPT</annotation></semantics></math> <cite class="ltx_cite ltx_citemacro_citep">(Danial et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib13" title="">2018</a>)</cite>, and the narrower sensing margins that increase with ENOB make the system more vulnerable to errors caused by PVT variations. Furthermore, PTQ’s reliance on accurate scaling factors during the DQ-Q process necessitates FPUs to handle division operations and avoid cumulative rounding errors, adding to system complexity.</p> </div> <div class="ltx_para" id="S1.p6"> <p class="ltx_p" id="S1.p6.1">Moreover, nonlinear layers for PTQs introduce additional challenges, as depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S1.F1" title="Figure 1 ‣ 1. Introduction ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">1</span></a>-(b). While integer-based approximations of nonlinear functions and normalization layers <cite class="ltx_cite ltx_citemacro_citep">(Kim et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib33" title="">2021</a>; Li and Gu, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib38" title="">2023</a>)</cite> work effectively in QAT-based systems, they often lead to significant computation deviations when applied to PTQ. As a result, PTQ relies on dequantization-quantization (DQ-Q) processes and floating-point units (FPUs) to ensure accurate nonlinear-layer processing. This dependency further increases system complexity and overhead.</p> </div> <div class="ltx_para" id="S1.p7"> <p class="ltx_p" id="S1.p7.1">To address these challenges, we propose FLARE, a novel AMS-PiM architecture that holistically tackles PTQ’s inefficiencies. FLARE eliminates high-ENOB ADC, division arithmetics, and FPUs while introducing dequantization-free PTQ and integer-processed nonlinear layers. Furthermore, by leveraging low-ENOB-ADC-based sparse General Matrix Vector (GEMV) operations with bitwise sparsity, FLARE ensures fast, error-resilient, and efficient computation. With a detailed analysis of end-to-end attention computations throughout our design, we enable on-device end-to-end kernel fusion.</p> </div> <div class="ltx_para" id="S1.p8"> <p class="ltx_p" id="S1.p8.1">In summary, FLARE achieves:</p> <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i1.p1"> <p class="ltx_p" id="S1.I1.i1.p1.1">end-to-end on-chip processing of self-attention layers, reducing quadratic off-chip tensor traffic;</p> </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i2.p1"> <p class="ltx_p" id="S1.I1.i2.p1.1">INT-only, yet accurate, dequantization-free PTQ and nonlinear-layer processing, to maintain precision without high-ENOB ADC, division, or FPUs;</p> </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i3.p1"> <p class="ltx_p" id="S1.I1.i3.p1.1">fast, accurate, and efficient sparse GEMV operations with 6-<math alttext="\sigma" class="ltx_Math" display="inline" id="S1.I1.i3.p1.1.m1.1"><semantics id="S1.I1.i3.p1.1.m1.1a"><mi id="S1.I1.i3.p1.1.m1.1.1" xref="S1.I1.i3.p1.1.m1.1.1.cmml">σ</mi><annotation-xml encoding="MathML-Content" id="S1.I1.i3.p1.1.m1.1b"><ci id="S1.I1.i3.p1.1.m1.1.1.cmml" xref="S1.I1.i3.p1.1.m1.1.1">𝜎</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.I1.i3.p1.1.m1.1c">\sigma</annotation><annotation encoding="application/x-llamapun" id="S1.I1.i3.p1.1.m1.1d">italic_σ</annotation></semantics></math> confidence leveraging low-ENOB ADC within MRAM-SRAM hybrid AMS-PiM arrays.</p> </div> </li> </ul> <p class="ltx_p" id="S1.p8.2">These advancements provide a scalable and energy-efficient solution for transformer inference, addressing the distinct inference-time bottlenecks of encoder-based models.</p> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">2. </span>Backgrounds</h2> <section class="ltx_subsection" id="S2.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.1. </span>Analog-Mixed-Signal Process-in-Memory (AMS-PiM)</h3> <figure class="ltx_figure" id="S2.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="537" id="S2.F2.g1" src="x2.png" width="680"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2. </span>Design example for typical PiM devices, where the level-2 (GEMV) and level-3 (GEMM) BLAS operations are processed inherently.</figcaption> </figure> <div class="ltx_para" id="S2.SS1.p1"> <p class="ltx_p" id="S2.SS1.p1.3">Process-in-Memory (PiM) <cite class="ltx_cite ltx_citemacro_citep">(Si et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib45" title="">2019a</a>; Chen et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib11" title="">2018</a>; Su et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib48" title="">2017</a>; Guo et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib19" title="">2017</a>; Zhou et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib59" title="">2022</a>; Jung et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib27" title="">2022</a>; Verma et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib52" title="">2019</a>; Jiang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib25" title="">2020</a>; Sridharan et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib47" title="">2023</a>; Li et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib37" title="">2024</a>; Yang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib56" title="">2020</a>; Si et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib46" title="">2019b</a>; Kim et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib32" title="">2019</a>; Yin et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib58" title="">2020</a>; Biswas and Chandrakasan, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib7" title="">2018</a>; Gupta et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib20" title="">2019</a>; Roy et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib42" title="">2021</a>; Kazemi et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib30" title="">2021</a>; Lee et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib36" title="">2022</a>; Long et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib40" title="">2018</a>; Imani et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib23" title="">2019</a>; Xue et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib55" title="">2019</a>; Sun et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib49" title="">2018</a>)</cite> architectures are highly regarded for their ability to execute level-2 (i.e., GEMV) and level-3 (i.e., GEMM) BLAS operations inherently and efficiently, which contribute to a major portion of most deep neural networks (DNNs). PiM devices allocate matrices onto memory arrays and fetch vectors (which can also form matrices) along wordlines (WLs), bitlines (BLs), or sourcelines (SLs) to execute such BLAS operations, as illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S2.F2" title="Figure 2 ‣ 2.1. Analog-Mixed-Signal Process-in-Memory (AMS-PiM) ‣ 2. Backgrounds ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">2</span></a>. In the illustrated configuration, weights of <math alttext="F" class="ltx_Math" display="inline" id="S2.SS1.p1.1.m1.1"><semantics id="S2.SS1.p1.1.m1.1a"><mi id="S2.SS1.p1.1.m1.1.1" xref="S2.SS1.p1.1.m1.1.1.cmml">F</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.1.m1.1b"><ci id="S2.SS1.p1.1.m1.1.1.cmml" xref="S2.SS1.p1.1.m1.1.1">𝐹</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.1.m1.1c">F</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.1.m1.1d">italic_F</annotation></semantics></math> filters are stored across the memory array, inputs of depth <math alttext="D" class="ltx_Math" display="inline" id="S2.SS1.p1.2.m2.1"><semantics id="S2.SS1.p1.2.m2.1a"><mi id="S2.SS1.p1.2.m2.1.1" xref="S2.SS1.p1.2.m2.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.2.m2.1b"><ci id="S2.SS1.p1.2.m2.1.1.cmml" xref="S2.SS1.p1.2.m2.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.2.m2.1c">D</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.2.m2.1d">italic_D</annotation></semantics></math> are fetched to WL<sub class="ltx_sub" id="S2.SS1.p1.3.1"><span class="ltx_text ltx_font_italic" id="S2.SS1.p1.3.1.1">1∼D</span></sub> bit-serially for m-bit representation, and partial sums for each bit position are obtained at BLs. By integrating BLAS operations <cite class="ltx_cite ltx_citemacro_citep">(Blackford et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib8" title="">2002</a>)</cite> directly within the memory storage, PiM drastically reduces data movement overhead and associated latency.</p> </div> <div class="ltx_para" id="S2.SS1.p2"> <p class="ltx_p" id="S2.SS1.p2.1">Analog-Mixed-Signal Process-in-Memory (AMS-PiM) <cite class="ltx_cite ltx_citemacro_citep">(Si et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib45" title="">2019a</a>; Chen et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib11" title="">2018</a>; Su et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib48" title="">2017</a>; Guo et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib19" title="">2017</a>; Jung et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib27" title="">2022</a>; Jiang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib25" title="">2020</a>; Sridharan et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib47" title="">2023</a>; Li et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib37" title="">2024</a>; Yang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib56" title="">2020</a>; Si et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib46" title="">2019b</a>; Kim et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib32" title="">2019</a>; Yin et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib58" title="">2020</a>; Biswas and Chandrakasan, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib7" title="">2018</a>; Lee et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib36" title="">2022</a>; Long et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib40" title="">2018</a>)</cite> architectures, particularly, enhance the computational efficiency over their digital counterparts. This is achieved by enabling the simultaneous activation of multiple memory interfaces, such as WLs, BLs, and SLs, allowing multiple concurrent computations within a memory array. While in digital PiM a single WL is activated per cycle to perform the column sum operations; i.e., pop (“1” or “on-cell”)-count operations, similar to a conventional read operation, contrarily, AMS-PiM simultanesouly activates multiple WLs and converts the analog sum into digital signal via Analog-to-Digital Converters (ADCs), enabling multiple concurrent computations.</p> </div> </section> <section class="ltx_subsection" id="S2.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.2. </span>Self-Attention Layer in Transformer Encoders</h3> <div class="ltx_para" id="S2.SS2.p1"> <p class="ltx_p" id="S2.SS2.p1.4">For the attention mechanism in a transformer model (described in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S2.F3" title="Figure 3 ‣ 2.2. Self-Attention Layer in Transformer Encoders ‣ 2. Backgrounds ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">3</span></a>), each token in the input is represented as a vector of dimension <math alttext="D" class="ltx_Math" display="inline" id="S2.SS2.p1.1.m1.1"><semantics id="S2.SS2.p1.1.m1.1a"><mi id="S2.SS2.p1.1.m1.1.1" xref="S2.SS2.p1.1.m1.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p1.1.m1.1b"><ci id="S2.SS2.p1.1.m1.1.1.cmml" xref="S2.SS2.p1.1.m1.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p1.1.m1.1c">D</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p1.1.m1.1d">italic_D</annotation></semantics></math>. Each sequence/image of <math alttext="B" class="ltx_Math" display="inline" id="S2.SS2.p1.2.m2.1"><semantics id="S2.SS2.p1.2.m2.1a"><mi id="S2.SS2.p1.2.m2.1.1" xref="S2.SS2.p1.2.m2.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p1.2.m2.1b"><ci id="S2.SS2.p1.2.m2.1.1.cmml" xref="S2.SS2.p1.2.m2.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p1.2.m2.1c">B</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p1.2.m2.1d">italic_B</annotation></semantics></math> input batches to the self-attention layer consists of <math alttext="N" class="ltx_Math" display="inline" id="S2.SS2.p1.3.m3.1"><semantics id="S2.SS2.p1.3.m3.1a"><mi id="S2.SS2.p1.3.m3.1.1" xref="S2.SS2.p1.3.m3.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p1.3.m3.1b"><ci id="S2.SS2.p1.3.m3.1.1.cmml" xref="S2.SS2.p1.3.m3.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p1.3.m3.1c">N</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p1.3.m3.1d">italic_N</annotation></semantics></math> tokens, resulting in an input tensor of dimension <math alttext="[B,N,D]" class="ltx_Math" display="inline" id="S2.SS2.p1.4.m4.3"><semantics id="S2.SS2.p1.4.m4.3a"><mrow id="S2.SS2.p1.4.m4.3.4.2" xref="S2.SS2.p1.4.m4.3.4.1.cmml"><mo id="S2.SS2.p1.4.m4.3.4.2.1" stretchy="false" xref="S2.SS2.p1.4.m4.3.4.1.cmml">[</mo><mi id="S2.SS2.p1.4.m4.1.1" xref="S2.SS2.p1.4.m4.1.1.cmml">B</mi><mo id="S2.SS2.p1.4.m4.3.4.2.2" xref="S2.SS2.p1.4.m4.3.4.1.cmml">,</mo><mi id="S2.SS2.p1.4.m4.2.2" xref="S2.SS2.p1.4.m4.2.2.cmml">N</mi><mo id="S2.SS2.p1.4.m4.3.4.2.3" xref="S2.SS2.p1.4.m4.3.4.1.cmml">,</mo><mi id="S2.SS2.p1.4.m4.3.3" xref="S2.SS2.p1.4.m4.3.3.cmml">D</mi><mo id="S2.SS2.p1.4.m4.3.4.2.4" stretchy="false" xref="S2.SS2.p1.4.m4.3.4.1.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p1.4.m4.3b"><list id="S2.SS2.p1.4.m4.3.4.1.cmml" xref="S2.SS2.p1.4.m4.3.4.2"><ci id="S2.SS2.p1.4.m4.1.1.cmml" xref="S2.SS2.p1.4.m4.1.1">𝐵</ci><ci id="S2.SS2.p1.4.m4.2.2.cmml" xref="S2.SS2.p1.4.m4.2.2">𝑁</ci><ci id="S2.SS2.p1.4.m4.3.3.cmml" xref="S2.SS2.p1.4.m4.3.3">𝐷</ci></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p1.4.m4.3c">[B,N,D]</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p1.4.m4.3d">[ italic_B , italic_N , italic_D ]</annotation></semantics></math>. The input is processed through the attention mechanism with several steps, each contributing to the tensor traffic.</p> </div> <div class="ltx_para ltx_noindent" id="S2.SS2.p2"> <p class="ltx_p" id="S2.SS2.p2.11"><span class="ltx_text ltx_font_bold" id="S2.SS2.p2.11.1">1. QKV Projections:</span> The initial step involves projecting each input sequence into three distinct weight matrices to produce: Query (Q), Key (K), and Value (V). These projections are computed using stationary weight matrices <math alttext="W_{Q}" class="ltx_Math" display="inline" id="S2.SS2.p2.1.m1.1"><semantics id="S2.SS2.p2.1.m1.1a"><msub id="S2.SS2.p2.1.m1.1.1" xref="S2.SS2.p2.1.m1.1.1.cmml"><mi id="S2.SS2.p2.1.m1.1.1.2" xref="S2.SS2.p2.1.m1.1.1.2.cmml">W</mi><mi id="S2.SS2.p2.1.m1.1.1.3" xref="S2.SS2.p2.1.m1.1.1.3.cmml">Q</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.1.m1.1b"><apply id="S2.SS2.p2.1.m1.1.1.cmml" xref="S2.SS2.p2.1.m1.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.1.m1.1.1.1.cmml" xref="S2.SS2.p2.1.m1.1.1">subscript</csymbol><ci id="S2.SS2.p2.1.m1.1.1.2.cmml" xref="S2.SS2.p2.1.m1.1.1.2">𝑊</ci><ci id="S2.SS2.p2.1.m1.1.1.3.cmml" xref="S2.SS2.p2.1.m1.1.1.3">𝑄</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.1.m1.1c">W_{Q}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.1.m1.1d">italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT</annotation></semantics></math>, <math alttext="W_{K}" class="ltx_Math" display="inline" id="S2.SS2.p2.2.m2.1"><semantics id="S2.SS2.p2.2.m2.1a"><msub id="S2.SS2.p2.2.m2.1.1" xref="S2.SS2.p2.2.m2.1.1.cmml"><mi id="S2.SS2.p2.2.m2.1.1.2" xref="S2.SS2.p2.2.m2.1.1.2.cmml">W</mi><mi id="S2.SS2.p2.2.m2.1.1.3" xref="S2.SS2.p2.2.m2.1.1.3.cmml">K</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.2.m2.1b"><apply id="S2.SS2.p2.2.m2.1.1.cmml" xref="S2.SS2.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.2.m2.1.1.1.cmml" xref="S2.SS2.p2.2.m2.1.1">subscript</csymbol><ci id="S2.SS2.p2.2.m2.1.1.2.cmml" xref="S2.SS2.p2.2.m2.1.1.2">𝑊</ci><ci id="S2.SS2.p2.2.m2.1.1.3.cmml" xref="S2.SS2.p2.2.m2.1.1.3">𝐾</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.2.m2.1c">W_{K}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.2.m2.1d">italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT</annotation></semantics></math>, and <math alttext="W_{V}" class="ltx_Math" display="inline" id="S2.SS2.p2.3.m3.1"><semantics id="S2.SS2.p2.3.m3.1a"><msub id="S2.SS2.p2.3.m3.1.1" xref="S2.SS2.p2.3.m3.1.1.cmml"><mi id="S2.SS2.p2.3.m3.1.1.2" xref="S2.SS2.p2.3.m3.1.1.2.cmml">W</mi><mi id="S2.SS2.p2.3.m3.1.1.3" xref="S2.SS2.p2.3.m3.1.1.3.cmml">V</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.3.m3.1b"><apply id="S2.SS2.p2.3.m3.1.1.cmml" xref="S2.SS2.p2.3.m3.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.3.m3.1.1.1.cmml" xref="S2.SS2.p2.3.m3.1.1">subscript</csymbol><ci id="S2.SS2.p2.3.m3.1.1.2.cmml" xref="S2.SS2.p2.3.m3.1.1.2">𝑊</ci><ci id="S2.SS2.p2.3.m3.1.1.3.cmml" xref="S2.SS2.p2.3.m3.1.1.3">𝑉</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.3.m3.1c">W_{V}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.3.m3.1d">italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT</annotation></semantics></math> of dimension <math alttext="[D,d_{k}]" class="ltx_Math" display="inline" id="S2.SS2.p2.4.m4.2"><semantics id="S2.SS2.p2.4.m4.2a"><mrow id="S2.SS2.p2.4.m4.2.2.1" xref="S2.SS2.p2.4.m4.2.2.2.cmml"><mo id="S2.SS2.p2.4.m4.2.2.1.2" stretchy="false" xref="S2.SS2.p2.4.m4.2.2.2.cmml">[</mo><mi id="S2.SS2.p2.4.m4.1.1" xref="S2.SS2.p2.4.m4.1.1.cmml">D</mi><mo id="S2.SS2.p2.4.m4.2.2.1.3" xref="S2.SS2.p2.4.m4.2.2.2.cmml">,</mo><msub id="S2.SS2.p2.4.m4.2.2.1.1" xref="S2.SS2.p2.4.m4.2.2.1.1.cmml"><mi id="S2.SS2.p2.4.m4.2.2.1.1.2" xref="S2.SS2.p2.4.m4.2.2.1.1.2.cmml">d</mi><mi id="S2.SS2.p2.4.m4.2.2.1.1.3" xref="S2.SS2.p2.4.m4.2.2.1.1.3.cmml">k</mi></msub><mo id="S2.SS2.p2.4.m4.2.2.1.4" stretchy="false" xref="S2.SS2.p2.4.m4.2.2.2.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.4.m4.2b"><interval closure="closed" id="S2.SS2.p2.4.m4.2.2.2.cmml" xref="S2.SS2.p2.4.m4.2.2.1"><ci id="S2.SS2.p2.4.m4.1.1.cmml" xref="S2.SS2.p2.4.m4.1.1">𝐷</ci><apply id="S2.SS2.p2.4.m4.2.2.1.1.cmml" xref="S2.SS2.p2.4.m4.2.2.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.4.m4.2.2.1.1.1.cmml" xref="S2.SS2.p2.4.m4.2.2.1.1">subscript</csymbol><ci id="S2.SS2.p2.4.m4.2.2.1.1.2.cmml" xref="S2.SS2.p2.4.m4.2.2.1.1.2">𝑑</ci><ci id="S2.SS2.p2.4.m4.2.2.1.1.3.cmml" xref="S2.SS2.p2.4.m4.2.2.1.1.3">𝑘</ci></apply></interval></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.4.m4.2c">[D,d_{k}]</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.4.m4.2d">[ italic_D , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]</annotation></semantics></math>, where <math alttext="d_{k}" class="ltx_Math" display="inline" id="S2.SS2.p2.5.m5.1"><semantics id="S2.SS2.p2.5.m5.1a"><msub id="S2.SS2.p2.5.m5.1.1" xref="S2.SS2.p2.5.m5.1.1.cmml"><mi id="S2.SS2.p2.5.m5.1.1.2" xref="S2.SS2.p2.5.m5.1.1.2.cmml">d</mi><mi id="S2.SS2.p2.5.m5.1.1.3" xref="S2.SS2.p2.5.m5.1.1.3.cmml">k</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.5.m5.1b"><apply id="S2.SS2.p2.5.m5.1.1.cmml" xref="S2.SS2.p2.5.m5.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.5.m5.1.1.1.cmml" xref="S2.SS2.p2.5.m5.1.1">subscript</csymbol><ci id="S2.SS2.p2.5.m5.1.1.2.cmml" xref="S2.SS2.p2.5.m5.1.1.2">𝑑</ci><ci id="S2.SS2.p2.5.m5.1.1.3.cmml" xref="S2.SS2.p2.5.m5.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.5.m5.1c">d_{k}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.5.m5.1d">italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT</annotation></semantics></math> is a (separate) weight dimension per attention head. This results in tensors <math alttext="Q=X\cdot W_{Q}" class="ltx_Math" display="inline" id="S2.SS2.p2.6.m6.1"><semantics id="S2.SS2.p2.6.m6.1a"><mrow id="S2.SS2.p2.6.m6.1.1" xref="S2.SS2.p2.6.m6.1.1.cmml"><mi id="S2.SS2.p2.6.m6.1.1.2" xref="S2.SS2.p2.6.m6.1.1.2.cmml">Q</mi><mo id="S2.SS2.p2.6.m6.1.1.1" xref="S2.SS2.p2.6.m6.1.1.1.cmml">=</mo><mrow id="S2.SS2.p2.6.m6.1.1.3" xref="S2.SS2.p2.6.m6.1.1.3.cmml"><mi id="S2.SS2.p2.6.m6.1.1.3.2" xref="S2.SS2.p2.6.m6.1.1.3.2.cmml">X</mi><mo id="S2.SS2.p2.6.m6.1.1.3.1" lspace="0.222em" rspace="0.222em" xref="S2.SS2.p2.6.m6.1.1.3.1.cmml">⋅</mo><msub id="S2.SS2.p2.6.m6.1.1.3.3" xref="S2.SS2.p2.6.m6.1.1.3.3.cmml"><mi id="S2.SS2.p2.6.m6.1.1.3.3.2" xref="S2.SS2.p2.6.m6.1.1.3.3.2.cmml">W</mi><mi id="S2.SS2.p2.6.m6.1.1.3.3.3" xref="S2.SS2.p2.6.m6.1.1.3.3.3.cmml">Q</mi></msub></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.6.m6.1b"><apply id="S2.SS2.p2.6.m6.1.1.cmml" xref="S2.SS2.p2.6.m6.1.1"><eq id="S2.SS2.p2.6.m6.1.1.1.cmml" xref="S2.SS2.p2.6.m6.1.1.1"></eq><ci id="S2.SS2.p2.6.m6.1.1.2.cmml" xref="S2.SS2.p2.6.m6.1.1.2">𝑄</ci><apply id="S2.SS2.p2.6.m6.1.1.3.cmml" xref="S2.SS2.p2.6.m6.1.1.3"><ci id="S2.SS2.p2.6.m6.1.1.3.1.cmml" xref="S2.SS2.p2.6.m6.1.1.3.1">⋅</ci><ci id="S2.SS2.p2.6.m6.1.1.3.2.cmml" xref="S2.SS2.p2.6.m6.1.1.3.2">𝑋</ci><apply id="S2.SS2.p2.6.m6.1.1.3.3.cmml" xref="S2.SS2.p2.6.m6.1.1.3.3"><csymbol cd="ambiguous" id="S2.SS2.p2.6.m6.1.1.3.3.1.cmml" xref="S2.SS2.p2.6.m6.1.1.3.3">subscript</csymbol><ci id="S2.SS2.p2.6.m6.1.1.3.3.2.cmml" xref="S2.SS2.p2.6.m6.1.1.3.3.2">𝑊</ci><ci id="S2.SS2.p2.6.m6.1.1.3.3.3.cmml" xref="S2.SS2.p2.6.m6.1.1.3.3.3">𝑄</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.6.m6.1c">Q=X\cdot W_{Q}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.6.m6.1d">italic_Q = italic_X ⋅ italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT</annotation></semantics></math>, <math alttext="K=X\cdot W_{K}" class="ltx_Math" display="inline" id="S2.SS2.p2.7.m7.1"><semantics id="S2.SS2.p2.7.m7.1a"><mrow id="S2.SS2.p2.7.m7.1.1" xref="S2.SS2.p2.7.m7.1.1.cmml"><mi id="S2.SS2.p2.7.m7.1.1.2" xref="S2.SS2.p2.7.m7.1.1.2.cmml">K</mi><mo id="S2.SS2.p2.7.m7.1.1.1" xref="S2.SS2.p2.7.m7.1.1.1.cmml">=</mo><mrow id="S2.SS2.p2.7.m7.1.1.3" xref="S2.SS2.p2.7.m7.1.1.3.cmml"><mi id="S2.SS2.p2.7.m7.1.1.3.2" xref="S2.SS2.p2.7.m7.1.1.3.2.cmml">X</mi><mo id="S2.SS2.p2.7.m7.1.1.3.1" lspace="0.222em" rspace="0.222em" xref="S2.SS2.p2.7.m7.1.1.3.1.cmml">⋅</mo><msub id="S2.SS2.p2.7.m7.1.1.3.3" xref="S2.SS2.p2.7.m7.1.1.3.3.cmml"><mi id="S2.SS2.p2.7.m7.1.1.3.3.2" xref="S2.SS2.p2.7.m7.1.1.3.3.2.cmml">W</mi><mi id="S2.SS2.p2.7.m7.1.1.3.3.3" xref="S2.SS2.p2.7.m7.1.1.3.3.3.cmml">K</mi></msub></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.7.m7.1b"><apply id="S2.SS2.p2.7.m7.1.1.cmml" xref="S2.SS2.p2.7.m7.1.1"><eq id="S2.SS2.p2.7.m7.1.1.1.cmml" xref="S2.SS2.p2.7.m7.1.1.1"></eq><ci id="S2.SS2.p2.7.m7.1.1.2.cmml" xref="S2.SS2.p2.7.m7.1.1.2">𝐾</ci><apply id="S2.SS2.p2.7.m7.1.1.3.cmml" xref="S2.SS2.p2.7.m7.1.1.3"><ci id="S2.SS2.p2.7.m7.1.1.3.1.cmml" xref="S2.SS2.p2.7.m7.1.1.3.1">⋅</ci><ci id="S2.SS2.p2.7.m7.1.1.3.2.cmml" xref="S2.SS2.p2.7.m7.1.1.3.2">𝑋</ci><apply id="S2.SS2.p2.7.m7.1.1.3.3.cmml" xref="S2.SS2.p2.7.m7.1.1.3.3"><csymbol cd="ambiguous" id="S2.SS2.p2.7.m7.1.1.3.3.1.cmml" xref="S2.SS2.p2.7.m7.1.1.3.3">subscript</csymbol><ci id="S2.SS2.p2.7.m7.1.1.3.3.2.cmml" xref="S2.SS2.p2.7.m7.1.1.3.3.2">𝑊</ci><ci id="S2.SS2.p2.7.m7.1.1.3.3.3.cmml" xref="S2.SS2.p2.7.m7.1.1.3.3.3">𝐾</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.7.m7.1c">K=X\cdot W_{K}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.7.m7.1d">italic_K = italic_X ⋅ italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT</annotation></semantics></math>, and <math alttext="V=X\cdot W_{V}" class="ltx_Math" display="inline" id="S2.SS2.p2.8.m8.1"><semantics id="S2.SS2.p2.8.m8.1a"><mrow id="S2.SS2.p2.8.m8.1.1" xref="S2.SS2.p2.8.m8.1.1.cmml"><mi id="S2.SS2.p2.8.m8.1.1.2" xref="S2.SS2.p2.8.m8.1.1.2.cmml">V</mi><mo id="S2.SS2.p2.8.m8.1.1.1" xref="S2.SS2.p2.8.m8.1.1.1.cmml">=</mo><mrow id="S2.SS2.p2.8.m8.1.1.3" xref="S2.SS2.p2.8.m8.1.1.3.cmml"><mi id="S2.SS2.p2.8.m8.1.1.3.2" xref="S2.SS2.p2.8.m8.1.1.3.2.cmml">X</mi><mo id="S2.SS2.p2.8.m8.1.1.3.1" lspace="0.222em" rspace="0.222em" xref="S2.SS2.p2.8.m8.1.1.3.1.cmml">⋅</mo><msub id="S2.SS2.p2.8.m8.1.1.3.3" xref="S2.SS2.p2.8.m8.1.1.3.3.cmml"><mi id="S2.SS2.p2.8.m8.1.1.3.3.2" xref="S2.SS2.p2.8.m8.1.1.3.3.2.cmml">W</mi><mi id="S2.SS2.p2.8.m8.1.1.3.3.3" xref="S2.SS2.p2.8.m8.1.1.3.3.3.cmml">V</mi></msub></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.8.m8.1b"><apply id="S2.SS2.p2.8.m8.1.1.cmml" xref="S2.SS2.p2.8.m8.1.1"><eq id="S2.SS2.p2.8.m8.1.1.1.cmml" xref="S2.SS2.p2.8.m8.1.1.1"></eq><ci id="S2.SS2.p2.8.m8.1.1.2.cmml" xref="S2.SS2.p2.8.m8.1.1.2">𝑉</ci><apply id="S2.SS2.p2.8.m8.1.1.3.cmml" xref="S2.SS2.p2.8.m8.1.1.3"><ci id="S2.SS2.p2.8.m8.1.1.3.1.cmml" xref="S2.SS2.p2.8.m8.1.1.3.1">⋅</ci><ci id="S2.SS2.p2.8.m8.1.1.3.2.cmml" xref="S2.SS2.p2.8.m8.1.1.3.2">𝑋</ci><apply id="S2.SS2.p2.8.m8.1.1.3.3.cmml" xref="S2.SS2.p2.8.m8.1.1.3.3"><csymbol cd="ambiguous" id="S2.SS2.p2.8.m8.1.1.3.3.1.cmml" xref="S2.SS2.p2.8.m8.1.1.3.3">subscript</csymbol><ci id="S2.SS2.p2.8.m8.1.1.3.3.2.cmml" xref="S2.SS2.p2.8.m8.1.1.3.3.2">𝑊</ci><ci id="S2.SS2.p2.8.m8.1.1.3.3.3.cmml" xref="S2.SS2.p2.8.m8.1.1.3.3.3">𝑉</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.8.m8.1c">V=X\cdot W_{V}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.8.m8.1d">italic_V = italic_X ⋅ italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT</annotation></semantics></math>, each with dimension of <math alttext="[B,N,d_{k}]" class="ltx_Math" display="inline" id="S2.SS2.p2.9.m9.3"><semantics id="S2.SS2.p2.9.m9.3a"><mrow id="S2.SS2.p2.9.m9.3.3.1" xref="S2.SS2.p2.9.m9.3.3.2.cmml"><mo id="S2.SS2.p2.9.m9.3.3.1.2" stretchy="false" xref="S2.SS2.p2.9.m9.3.3.2.cmml">[</mo><mi id="S2.SS2.p2.9.m9.1.1" xref="S2.SS2.p2.9.m9.1.1.cmml">B</mi><mo id="S2.SS2.p2.9.m9.3.3.1.3" xref="S2.SS2.p2.9.m9.3.3.2.cmml">,</mo><mi id="S2.SS2.p2.9.m9.2.2" xref="S2.SS2.p2.9.m9.2.2.cmml">N</mi><mo id="S2.SS2.p2.9.m9.3.3.1.4" xref="S2.SS2.p2.9.m9.3.3.2.cmml">,</mo><msub id="S2.SS2.p2.9.m9.3.3.1.1" xref="S2.SS2.p2.9.m9.3.3.1.1.cmml"><mi id="S2.SS2.p2.9.m9.3.3.1.1.2" xref="S2.SS2.p2.9.m9.3.3.1.1.2.cmml">d</mi><mi id="S2.SS2.p2.9.m9.3.3.1.1.3" xref="S2.SS2.p2.9.m9.3.3.1.1.3.cmml">k</mi></msub><mo id="S2.SS2.p2.9.m9.3.3.1.5" stretchy="false" xref="S2.SS2.p2.9.m9.3.3.2.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.9.m9.3b"><list id="S2.SS2.p2.9.m9.3.3.2.cmml" xref="S2.SS2.p2.9.m9.3.3.1"><ci id="S2.SS2.p2.9.m9.1.1.cmml" xref="S2.SS2.p2.9.m9.1.1">𝐵</ci><ci id="S2.SS2.p2.9.m9.2.2.cmml" xref="S2.SS2.p2.9.m9.2.2">𝑁</ci><apply id="S2.SS2.p2.9.m9.3.3.1.1.cmml" xref="S2.SS2.p2.9.m9.3.3.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.9.m9.3.3.1.1.1.cmml" xref="S2.SS2.p2.9.m9.3.3.1.1">subscript</csymbol><ci id="S2.SS2.p2.9.m9.3.3.1.1.2.cmml" xref="S2.SS2.p2.9.m9.3.3.1.1.2">𝑑</ci><ci id="S2.SS2.p2.9.m9.3.3.1.1.3.cmml" xref="S2.SS2.p2.9.m9.3.3.1.1.3">𝑘</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.9.m9.3c">[B,N,d_{k}]</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.9.m9.3d">[ italic_B , italic_N , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]</annotation></semantics></math> per head. Given <math alttext="H=D/d_{k}" class="ltx_Math" display="inline" id="S2.SS2.p2.10.m10.1"><semantics id="S2.SS2.p2.10.m10.1a"><mrow id="S2.SS2.p2.10.m10.1.1" xref="S2.SS2.p2.10.m10.1.1.cmml"><mi id="S2.SS2.p2.10.m10.1.1.2" xref="S2.SS2.p2.10.m10.1.1.2.cmml">H</mi><mo id="S2.SS2.p2.10.m10.1.1.1" xref="S2.SS2.p2.10.m10.1.1.1.cmml">=</mo><mrow id="S2.SS2.p2.10.m10.1.1.3" xref="S2.SS2.p2.10.m10.1.1.3.cmml"><mi id="S2.SS2.p2.10.m10.1.1.3.2" xref="S2.SS2.p2.10.m10.1.1.3.2.cmml">D</mi><mo id="S2.SS2.p2.10.m10.1.1.3.1" xref="S2.SS2.p2.10.m10.1.1.3.1.cmml">/</mo><msub id="S2.SS2.p2.10.m10.1.1.3.3" xref="S2.SS2.p2.10.m10.1.1.3.3.cmml"><mi id="S2.SS2.p2.10.m10.1.1.3.3.2" xref="S2.SS2.p2.10.m10.1.1.3.3.2.cmml">d</mi><mi id="S2.SS2.p2.10.m10.1.1.3.3.3" xref="S2.SS2.p2.10.m10.1.1.3.3.3.cmml">k</mi></msub></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.10.m10.1b"><apply id="S2.SS2.p2.10.m10.1.1.cmml" xref="S2.SS2.p2.10.m10.1.1"><eq id="S2.SS2.p2.10.m10.1.1.1.cmml" xref="S2.SS2.p2.10.m10.1.1.1"></eq><ci id="S2.SS2.p2.10.m10.1.1.2.cmml" xref="S2.SS2.p2.10.m10.1.1.2">𝐻</ci><apply id="S2.SS2.p2.10.m10.1.1.3.cmml" xref="S2.SS2.p2.10.m10.1.1.3"><divide id="S2.SS2.p2.10.m10.1.1.3.1.cmml" xref="S2.SS2.p2.10.m10.1.1.3.1"></divide><ci id="S2.SS2.p2.10.m10.1.1.3.2.cmml" xref="S2.SS2.p2.10.m10.1.1.3.2">𝐷</ci><apply id="S2.SS2.p2.10.m10.1.1.3.3.cmml" xref="S2.SS2.p2.10.m10.1.1.3.3"><csymbol cd="ambiguous" id="S2.SS2.p2.10.m10.1.1.3.3.1.cmml" xref="S2.SS2.p2.10.m10.1.1.3.3">subscript</csymbol><ci id="S2.SS2.p2.10.m10.1.1.3.3.2.cmml" xref="S2.SS2.p2.10.m10.1.1.3.3.2">𝑑</ci><ci id="S2.SS2.p2.10.m10.1.1.3.3.3.cmml" xref="S2.SS2.p2.10.m10.1.1.3.3.3">𝑘</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.10.m10.1c">H=D/d_{k}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.10.m10.1d">italic_H = italic_D / italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT</annotation></semantics></math> heads, the dimension for each Q, K, and V becomes <math alttext="[B,H,N,d_{k}]" class="ltx_Math" display="inline" id="S2.SS2.p2.11.m11.4"><semantics id="S2.SS2.p2.11.m11.4a"><mrow id="S2.SS2.p2.11.m11.4.4.1" xref="S2.SS2.p2.11.m11.4.4.2.cmml"><mo id="S2.SS2.p2.11.m11.4.4.1.2" stretchy="false" xref="S2.SS2.p2.11.m11.4.4.2.cmml">[</mo><mi id="S2.SS2.p2.11.m11.1.1" xref="S2.SS2.p2.11.m11.1.1.cmml">B</mi><mo id="S2.SS2.p2.11.m11.4.4.1.3" xref="S2.SS2.p2.11.m11.4.4.2.cmml">,</mo><mi id="S2.SS2.p2.11.m11.2.2" xref="S2.SS2.p2.11.m11.2.2.cmml">H</mi><mo id="S2.SS2.p2.11.m11.4.4.1.4" xref="S2.SS2.p2.11.m11.4.4.2.cmml">,</mo><mi id="S2.SS2.p2.11.m11.3.3" xref="S2.SS2.p2.11.m11.3.3.cmml">N</mi><mo id="S2.SS2.p2.11.m11.4.4.1.5" xref="S2.SS2.p2.11.m11.4.4.2.cmml">,</mo><msub id="S2.SS2.p2.11.m11.4.4.1.1" xref="S2.SS2.p2.11.m11.4.4.1.1.cmml"><mi id="S2.SS2.p2.11.m11.4.4.1.1.2" xref="S2.SS2.p2.11.m11.4.4.1.1.2.cmml">d</mi><mi id="S2.SS2.p2.11.m11.4.4.1.1.3" xref="S2.SS2.p2.11.m11.4.4.1.1.3.cmml">k</mi></msub><mo id="S2.SS2.p2.11.m11.4.4.1.6" stretchy="false" xref="S2.SS2.p2.11.m11.4.4.2.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.11.m11.4b"><list id="S2.SS2.p2.11.m11.4.4.2.cmml" xref="S2.SS2.p2.11.m11.4.4.1"><ci id="S2.SS2.p2.11.m11.1.1.cmml" xref="S2.SS2.p2.11.m11.1.1">𝐵</ci><ci id="S2.SS2.p2.11.m11.2.2.cmml" xref="S2.SS2.p2.11.m11.2.2">𝐻</ci><ci id="S2.SS2.p2.11.m11.3.3.cmml" xref="S2.SS2.p2.11.m11.3.3">𝑁</ci><apply id="S2.SS2.p2.11.m11.4.4.1.1.cmml" xref="S2.SS2.p2.11.m11.4.4.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.11.m11.4.4.1.1.1.cmml" xref="S2.SS2.p2.11.m11.4.4.1.1">subscript</csymbol><ci id="S2.SS2.p2.11.m11.4.4.1.1.2.cmml" xref="S2.SS2.p2.11.m11.4.4.1.1.2">𝑑</ci><ci id="S2.SS2.p2.11.m11.4.4.1.1.3.cmml" xref="S2.SS2.p2.11.m11.4.4.1.1.3">𝑘</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.11.m11.4c">[B,H,N,d_{k}]</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.11.m11.4d">[ italic_B , italic_H , italic_N , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]</annotation></semantics></math>.</p> </div> <div class="ltx_para ltx_noindent" id="S2.SS2.p3"> <p class="ltx_p" id="S2.SS2.p3.4"><span class="ltx_text ltx_font_bold" id="S2.SS2.p3.4.1">2. Logit Calculation:</span> The core of the attention mechanism is computing the logit scores <math alttext="L" class="ltx_Math" display="inline" id="S2.SS2.p3.1.m1.1"><semantics id="S2.SS2.p3.1.m1.1a"><mi id="S2.SS2.p3.1.m1.1.1" xref="S2.SS2.p3.1.m1.1.1.cmml">L</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p3.1.m1.1b"><ci id="S2.SS2.p3.1.m1.1.1.cmml" xref="S2.SS2.p3.1.m1.1.1">𝐿</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p3.1.m1.1c">L</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p3.1.m1.1d">italic_L</annotation></semantics></math> by performing a dot product between the query and key: <math alttext="L=Q\cdot K^{T}" class="ltx_Math" display="inline" id="S2.SS2.p3.2.m2.1"><semantics id="S2.SS2.p3.2.m2.1a"><mrow id="S2.SS2.p3.2.m2.1.1" xref="S2.SS2.p3.2.m2.1.1.cmml"><mi id="S2.SS2.p3.2.m2.1.1.2" xref="S2.SS2.p3.2.m2.1.1.2.cmml">L</mi><mo id="S2.SS2.p3.2.m2.1.1.1" xref="S2.SS2.p3.2.m2.1.1.1.cmml">=</mo><mrow id="S2.SS2.p3.2.m2.1.1.3" xref="S2.SS2.p3.2.m2.1.1.3.cmml"><mi id="S2.SS2.p3.2.m2.1.1.3.2" xref="S2.SS2.p3.2.m2.1.1.3.2.cmml">Q</mi><mo id="S2.SS2.p3.2.m2.1.1.3.1" lspace="0.222em" rspace="0.222em" xref="S2.SS2.p3.2.m2.1.1.3.1.cmml">⋅</mo><msup id="S2.SS2.p3.2.m2.1.1.3.3" xref="S2.SS2.p3.2.m2.1.1.3.3.cmml"><mi id="S2.SS2.p3.2.m2.1.1.3.3.2" xref="S2.SS2.p3.2.m2.1.1.3.3.2.cmml">K</mi><mi id="S2.SS2.p3.2.m2.1.1.3.3.3" xref="S2.SS2.p3.2.m2.1.1.3.3.3.cmml">T</mi></msup></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p3.2.m2.1b"><apply id="S2.SS2.p3.2.m2.1.1.cmml" xref="S2.SS2.p3.2.m2.1.1"><eq id="S2.SS2.p3.2.m2.1.1.1.cmml" xref="S2.SS2.p3.2.m2.1.1.1"></eq><ci id="S2.SS2.p3.2.m2.1.1.2.cmml" xref="S2.SS2.p3.2.m2.1.1.2">𝐿</ci><apply id="S2.SS2.p3.2.m2.1.1.3.cmml" xref="S2.SS2.p3.2.m2.1.1.3"><ci id="S2.SS2.p3.2.m2.1.1.3.1.cmml" xref="S2.SS2.p3.2.m2.1.1.3.1">⋅</ci><ci id="S2.SS2.p3.2.m2.1.1.3.2.cmml" xref="S2.SS2.p3.2.m2.1.1.3.2">𝑄</ci><apply id="S2.SS2.p3.2.m2.1.1.3.3.cmml" xref="S2.SS2.p3.2.m2.1.1.3.3"><csymbol cd="ambiguous" id="S2.SS2.p3.2.m2.1.1.3.3.1.cmml" xref="S2.SS2.p3.2.m2.1.1.3.3">superscript</csymbol><ci id="S2.SS2.p3.2.m2.1.1.3.3.2.cmml" xref="S2.SS2.p3.2.m2.1.1.3.3.2">𝐾</ci><ci id="S2.SS2.p3.2.m2.1.1.3.3.3.cmml" xref="S2.SS2.p3.2.m2.1.1.3.3.3">𝑇</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p3.2.m2.1c">L=Q\cdot K^{T}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p3.2.m2.1d">italic_L = italic_Q ⋅ italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT</annotation></semantics></math>. This results in a score tensor of dimension <math alttext="[B,H,N,N]" class="ltx_Math" display="inline" id="S2.SS2.p3.3.m3.4"><semantics id="S2.SS2.p3.3.m3.4a"><mrow id="S2.SS2.p3.3.m3.4.5.2" xref="S2.SS2.p3.3.m3.4.5.1.cmml"><mo id="S2.SS2.p3.3.m3.4.5.2.1" stretchy="false" xref="S2.SS2.p3.3.m3.4.5.1.cmml">[</mo><mi id="S2.SS2.p3.3.m3.1.1" xref="S2.SS2.p3.3.m3.1.1.cmml">B</mi><mo id="S2.SS2.p3.3.m3.4.5.2.2" xref="S2.SS2.p3.3.m3.4.5.1.cmml">,</mo><mi id="S2.SS2.p3.3.m3.2.2" xref="S2.SS2.p3.3.m3.2.2.cmml">H</mi><mo id="S2.SS2.p3.3.m3.4.5.2.3" xref="S2.SS2.p3.3.m3.4.5.1.cmml">,</mo><mi id="S2.SS2.p3.3.m3.3.3" xref="S2.SS2.p3.3.m3.3.3.cmml">N</mi><mo id="S2.SS2.p3.3.m3.4.5.2.4" xref="S2.SS2.p3.3.m3.4.5.1.cmml">,</mo><mi id="S2.SS2.p3.3.m3.4.4" xref="S2.SS2.p3.3.m3.4.4.cmml">N</mi><mo id="S2.SS2.p3.3.m3.4.5.2.5" stretchy="false" xref="S2.SS2.p3.3.m3.4.5.1.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p3.3.m3.4b"><list id="S2.SS2.p3.3.m3.4.5.1.cmml" xref="S2.SS2.p3.3.m3.4.5.2"><ci id="S2.SS2.p3.3.m3.1.1.cmml" xref="S2.SS2.p3.3.m3.1.1">𝐵</ci><ci id="S2.SS2.p3.3.m3.2.2.cmml" xref="S2.SS2.p3.3.m3.2.2">𝐻</ci><ci id="S2.SS2.p3.3.m3.3.3.cmml" xref="S2.SS2.p3.3.m3.3.3">𝑁</ci><ci id="S2.SS2.p3.3.m3.4.4.cmml" xref="S2.SS2.p3.3.m3.4.4">𝑁</ci></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p3.3.m3.4c">[B,H,N,N]</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p3.3.m3.4d">[ italic_B , italic_H , italic_N , italic_N ]</annotation></semantics></math>. The quadratic nature of this operation, requiring <math alttext="O(N^{2}d)" class="ltx_Math" display="inline" id="S2.SS2.p3.4.m4.1"><semantics id="S2.SS2.p3.4.m4.1a"><mrow id="S2.SS2.p3.4.m4.1.1" xref="S2.SS2.p3.4.m4.1.1.cmml"><mi id="S2.SS2.p3.4.m4.1.1.3" xref="S2.SS2.p3.4.m4.1.1.3.cmml">O</mi><mo id="S2.SS2.p3.4.m4.1.1.2" xref="S2.SS2.p3.4.m4.1.1.2.cmml"></mo><mrow id="S2.SS2.p3.4.m4.1.1.1.1" xref="S2.SS2.p3.4.m4.1.1.1.1.1.cmml"><mo id="S2.SS2.p3.4.m4.1.1.1.1.2" stretchy="false" xref="S2.SS2.p3.4.m4.1.1.1.1.1.cmml">(</mo><mrow id="S2.SS2.p3.4.m4.1.1.1.1.1" xref="S2.SS2.p3.4.m4.1.1.1.1.1.cmml"><msup id="S2.SS2.p3.4.m4.1.1.1.1.1.2" xref="S2.SS2.p3.4.m4.1.1.1.1.1.2.cmml"><mi id="S2.SS2.p3.4.m4.1.1.1.1.1.2.2" xref="S2.SS2.p3.4.m4.1.1.1.1.1.2.2.cmml">N</mi><mn id="S2.SS2.p3.4.m4.1.1.1.1.1.2.3" xref="S2.SS2.p3.4.m4.1.1.1.1.1.2.3.cmml">2</mn></msup><mo id="S2.SS2.p3.4.m4.1.1.1.1.1.1" xref="S2.SS2.p3.4.m4.1.1.1.1.1.1.cmml"></mo><mi id="S2.SS2.p3.4.m4.1.1.1.1.1.3" xref="S2.SS2.p3.4.m4.1.1.1.1.1.3.cmml">d</mi></mrow><mo id="S2.SS2.p3.4.m4.1.1.1.1.3" stretchy="false" xref="S2.SS2.p3.4.m4.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p3.4.m4.1b"><apply id="S2.SS2.p3.4.m4.1.1.cmml" xref="S2.SS2.p3.4.m4.1.1"><times id="S2.SS2.p3.4.m4.1.1.2.cmml" xref="S2.SS2.p3.4.m4.1.1.2"></times><ci id="S2.SS2.p3.4.m4.1.1.3.cmml" xref="S2.SS2.p3.4.m4.1.1.3">𝑂</ci><apply id="S2.SS2.p3.4.m4.1.1.1.1.1.cmml" xref="S2.SS2.p3.4.m4.1.1.1.1"><times id="S2.SS2.p3.4.m4.1.1.1.1.1.1.cmml" xref="S2.SS2.p3.4.m4.1.1.1.1.1.1"></times><apply id="S2.SS2.p3.4.m4.1.1.1.1.1.2.cmml" xref="S2.SS2.p3.4.m4.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S2.SS2.p3.4.m4.1.1.1.1.1.2.1.cmml" xref="S2.SS2.p3.4.m4.1.1.1.1.1.2">superscript</csymbol><ci id="S2.SS2.p3.4.m4.1.1.1.1.1.2.2.cmml" xref="S2.SS2.p3.4.m4.1.1.1.1.1.2.2">𝑁</ci><cn id="S2.SS2.p3.4.m4.1.1.1.1.1.2.3.cmml" type="integer" xref="S2.SS2.p3.4.m4.1.1.1.1.1.2.3">2</cn></apply><ci id="S2.SS2.p3.4.m4.1.1.1.1.1.3.cmml" xref="S2.SS2.p3.4.m4.1.1.1.1.1.3">𝑑</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p3.4.m4.1c">O(N^{2}d)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p3.4.m4.1d">italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d )</annotation></semantics></math> operations per head, significantly contributing to the tensor traffic and computational load.</p> </div> <div class="ltx_para ltx_noindent" id="S2.SS2.p4"> <p class="ltx_p" id="S2.SS2.p4.1"><span class="ltx_text ltx_font_bold" id="S2.SS2.p4.1.1">3. Softmax Normalization:</span> The computed logits are scaled and passed through the Softmax layer to normalize the summed score. This normalization step, involving nonlinear operations, incorporates tensor traffic of <math alttext="[B,H,N,N]" class="ltx_Math" display="inline" id="S2.SS2.p4.1.m1.4"><semantics id="S2.SS2.p4.1.m1.4a"><mrow id="S2.SS2.p4.1.m1.4.5.2" xref="S2.SS2.p4.1.m1.4.5.1.cmml"><mo id="S2.SS2.p4.1.m1.4.5.2.1" stretchy="false" xref="S2.SS2.p4.1.m1.4.5.1.cmml">[</mo><mi id="S2.SS2.p4.1.m1.1.1" xref="S2.SS2.p4.1.m1.1.1.cmml">B</mi><mo id="S2.SS2.p4.1.m1.4.5.2.2" xref="S2.SS2.p4.1.m1.4.5.1.cmml">,</mo><mi id="S2.SS2.p4.1.m1.2.2" xref="S2.SS2.p4.1.m1.2.2.cmml">H</mi><mo id="S2.SS2.p4.1.m1.4.5.2.3" xref="S2.SS2.p4.1.m1.4.5.1.cmml">,</mo><mi id="S2.SS2.p4.1.m1.3.3" xref="S2.SS2.p4.1.m1.3.3.cmml">N</mi><mo id="S2.SS2.p4.1.m1.4.5.2.4" xref="S2.SS2.p4.1.m1.4.5.1.cmml">,</mo><mi id="S2.SS2.p4.1.m1.4.4" xref="S2.SS2.p4.1.m1.4.4.cmml">N</mi><mo id="S2.SS2.p4.1.m1.4.5.2.5" stretchy="false" xref="S2.SS2.p4.1.m1.4.5.1.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p4.1.m1.4b"><list id="S2.SS2.p4.1.m1.4.5.1.cmml" xref="S2.SS2.p4.1.m1.4.5.2"><ci id="S2.SS2.p4.1.m1.1.1.cmml" xref="S2.SS2.p4.1.m1.1.1">𝐵</ci><ci id="S2.SS2.p4.1.m1.2.2.cmml" xref="S2.SS2.p4.1.m1.2.2">𝐻</ci><ci id="S2.SS2.p4.1.m1.3.3.cmml" xref="S2.SS2.p4.1.m1.3.3">𝑁</ci><ci id="S2.SS2.p4.1.m1.4.4.cmml" xref="S2.SS2.p4.1.m1.4.4">𝑁</ci></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p4.1.m1.4c">[B,H,N,N]</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p4.1.m1.4d">[ italic_B , italic_H , italic_N , italic_N ]</annotation></semantics></math>.</p> </div> <div class="ltx_para ltx_noindent" id="S2.SS2.p5"> <p class="ltx_p" id="S2.SS2.p5.2"><span class="ltx_text ltx_font_bold" id="S2.SS2.p5.2.1">4. Weighted Sum of Values:</span> The normalized attention scores are then used to compute a weighted sum of the value (<math alttext="V" class="ltx_Math" display="inline" id="S2.SS2.p5.1.m1.1"><semantics id="S2.SS2.p5.1.m1.1a"><mi id="S2.SS2.p5.1.m1.1.1" xref="S2.SS2.p5.1.m1.1.1.cmml">V</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p5.1.m1.1b"><ci id="S2.SS2.p5.1.m1.1.1.cmml" xref="S2.SS2.p5.1.m1.1.1">𝑉</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p5.1.m1.1c">V</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p5.1.m1.1d">italic_V</annotation></semantics></math>) vectors, yielding an output tensor of dimension <math alttext="[B,H,N,d_{k}]" class="ltx_Math" display="inline" id="S2.SS2.p5.2.m2.4"><semantics id="S2.SS2.p5.2.m2.4a"><mrow id="S2.SS2.p5.2.m2.4.4.1" xref="S2.SS2.p5.2.m2.4.4.2.cmml"><mo id="S2.SS2.p5.2.m2.4.4.1.2" stretchy="false" xref="S2.SS2.p5.2.m2.4.4.2.cmml">[</mo><mi id="S2.SS2.p5.2.m2.1.1" xref="S2.SS2.p5.2.m2.1.1.cmml">B</mi><mo id="S2.SS2.p5.2.m2.4.4.1.3" xref="S2.SS2.p5.2.m2.4.4.2.cmml">,</mo><mi id="S2.SS2.p5.2.m2.2.2" xref="S2.SS2.p5.2.m2.2.2.cmml">H</mi><mo id="S2.SS2.p5.2.m2.4.4.1.4" xref="S2.SS2.p5.2.m2.4.4.2.cmml">,</mo><mi id="S2.SS2.p5.2.m2.3.3" xref="S2.SS2.p5.2.m2.3.3.cmml">N</mi><mo id="S2.SS2.p5.2.m2.4.4.1.5" xref="S2.SS2.p5.2.m2.4.4.2.cmml">,</mo><msub id="S2.SS2.p5.2.m2.4.4.1.1" xref="S2.SS2.p5.2.m2.4.4.1.1.cmml"><mi id="S2.SS2.p5.2.m2.4.4.1.1.2" xref="S2.SS2.p5.2.m2.4.4.1.1.2.cmml">d</mi><mi id="S2.SS2.p5.2.m2.4.4.1.1.3" xref="S2.SS2.p5.2.m2.4.4.1.1.3.cmml">k</mi></msub><mo id="S2.SS2.p5.2.m2.4.4.1.6" stretchy="false" xref="S2.SS2.p5.2.m2.4.4.2.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p5.2.m2.4b"><list id="S2.SS2.p5.2.m2.4.4.2.cmml" xref="S2.SS2.p5.2.m2.4.4.1"><ci id="S2.SS2.p5.2.m2.1.1.cmml" xref="S2.SS2.p5.2.m2.1.1">𝐵</ci><ci id="S2.SS2.p5.2.m2.2.2.cmml" xref="S2.SS2.p5.2.m2.2.2">𝐻</ci><ci id="S2.SS2.p5.2.m2.3.3.cmml" xref="S2.SS2.p5.2.m2.3.3">𝑁</ci><apply id="S2.SS2.p5.2.m2.4.4.1.1.cmml" xref="S2.SS2.p5.2.m2.4.4.1.1"><csymbol cd="ambiguous" id="S2.SS2.p5.2.m2.4.4.1.1.1.cmml" xref="S2.SS2.p5.2.m2.4.4.1.1">subscript</csymbol><ci id="S2.SS2.p5.2.m2.4.4.1.1.2.cmml" xref="S2.SS2.p5.2.m2.4.4.1.1.2">𝑑</ci><ci id="S2.SS2.p5.2.m2.4.4.1.1.3.cmml" xref="S2.SS2.p5.2.m2.4.4.1.1.3">𝑘</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p5.2.m2.4c">[B,H,N,d_{k}]</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p5.2.m2.4d">[ italic_B , italic_H , italic_N , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]</annotation></semantics></math>.</p> </div> <div class="ltx_para ltx_noindent" id="S2.SS2.p6"> <p class="ltx_p" id="S2.SS2.p6.2"><span class="ltx_text ltx_font_bold" id="S2.SS2.p6.2.1">5. Concatenation and Final Output Projection:</span> Finally, the outputs from all heads are concatenated and projected together once again, using a weight matrix (<math alttext="W_{O}" class="ltx_Math" display="inline" id="S2.SS2.p6.1.m1.1"><semantics id="S2.SS2.p6.1.m1.1a"><msub id="S2.SS2.p6.1.m1.1.1" xref="S2.SS2.p6.1.m1.1.1.cmml"><mi id="S2.SS2.p6.1.m1.1.1.2" xref="S2.SS2.p6.1.m1.1.1.2.cmml">W</mi><mi id="S2.SS2.p6.1.m1.1.1.3" xref="S2.SS2.p6.1.m1.1.1.3.cmml">O</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p6.1.m1.1b"><apply id="S2.SS2.p6.1.m1.1.1.cmml" xref="S2.SS2.p6.1.m1.1.1"><csymbol cd="ambiguous" id="S2.SS2.p6.1.m1.1.1.1.cmml" xref="S2.SS2.p6.1.m1.1.1">subscript</csymbol><ci id="S2.SS2.p6.1.m1.1.1.2.cmml" xref="S2.SS2.p6.1.m1.1.1.2">𝑊</ci><ci id="S2.SS2.p6.1.m1.1.1.3.cmml" xref="S2.SS2.p6.1.m1.1.1.3">𝑂</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p6.1.m1.1c">W_{O}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p6.1.m1.1d">italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT</annotation></semantics></math>) to produce the final output tensor of <math alttext="[B,N,D]" class="ltx_Math" display="inline" id="S2.SS2.p6.2.m2.3"><semantics id="S2.SS2.p6.2.m2.3a"><mrow id="S2.SS2.p6.2.m2.3.4.2" xref="S2.SS2.p6.2.m2.3.4.1.cmml"><mo id="S2.SS2.p6.2.m2.3.4.2.1" stretchy="false" xref="S2.SS2.p6.2.m2.3.4.1.cmml">[</mo><mi id="S2.SS2.p6.2.m2.1.1" xref="S2.SS2.p6.2.m2.1.1.cmml">B</mi><mo id="S2.SS2.p6.2.m2.3.4.2.2" xref="S2.SS2.p6.2.m2.3.4.1.cmml">,</mo><mi id="S2.SS2.p6.2.m2.2.2" xref="S2.SS2.p6.2.m2.2.2.cmml">N</mi><mo id="S2.SS2.p6.2.m2.3.4.2.3" xref="S2.SS2.p6.2.m2.3.4.1.cmml">,</mo><mi id="S2.SS2.p6.2.m2.3.3" xref="S2.SS2.p6.2.m2.3.3.cmml">D</mi><mo id="S2.SS2.p6.2.m2.3.4.2.4" stretchy="false" xref="S2.SS2.p6.2.m2.3.4.1.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p6.2.m2.3b"><list id="S2.SS2.p6.2.m2.3.4.1.cmml" xref="S2.SS2.p6.2.m2.3.4.2"><ci id="S2.SS2.p6.2.m2.1.1.cmml" xref="S2.SS2.p6.2.m2.1.1">𝐵</ci><ci id="S2.SS2.p6.2.m2.2.2.cmml" xref="S2.SS2.p6.2.m2.2.2">𝑁</ci><ci id="S2.SS2.p6.2.m2.3.3.cmml" xref="S2.SS2.p6.2.m2.3.3">𝐷</ci></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p6.2.m2.3c">[B,N,D]</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p6.2.m2.3d">[ italic_B , italic_N , italic_D ]</annotation></semantics></math>.</p> </div> <div class="ltx_para ltx_noindent" id="S2.SS2.p7"> <p class="ltx_p" id="S2.SS2.p7.1"><span class="ltx_text ltx_font_bold" id="S2.SS2.p7.1.1">- DQ-Q Process and FPU:</span> Not only the Softmax process involving nonlinear functions and accurate divisions, but also each of the stages (from step <span class="ltx_text ltx_font_bold" id="S2.SS2.p7.1.2">1.</span> to <span class="ltx_text ltx_font_bold" id="S2.SS2.p7.1.3">5.</span>) with PTQ necessitates the DQ-Q process, involving high-ENOB ADC, division, and FPUs. These requirements easily deteriorate the efficacy of PiM, motivating us to design a PiM-based accelerator that alleviates these limitations.</p> </div> <figure class="ltx_figure" id="S2.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="874" id="S2.F3.g1" src="x3.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3. </span>Overview of self-attention, with operation type and tensor traffic analysis. Self-attention comprises multiple linear projection layers with two types of GEMV operations, connected by Softmax, dimensional adjustments (transpose and concatenation), and DQ-Q operations. Unlike QAT, the DQ-Q process and nonlinear layers in PTQ create a deadlock between hardware overhead and tensor traffic.</figcaption> </figure> </section> <section class="ltx_subsection" id="S2.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.3. </span>Kernel Fusion and Self-Attention Inference Optimization</h3> <div class="ltx_para" id="S2.SS3.p1"> <p class="ltx_p" id="S2.SS3.p1.1">The unique characteristics of the self-attention mechanism have delivered remarkable performance; however, they also introduce quadratic computational load and memory traffic, prompting the development of various innovative techniques to alleviate these limitations. Alike PTQ, kernel (operation) fusion techniques also effectively achieve the given goal without impacting the model inference accuracy <cite class="ltx_cite ltx_citemacro_citep">(Kao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib29" title="">2023</a>; Alwani et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib3" title="">2016</a>; Dao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib14" title="">2022</a>)</cite>. Kernel fusion techniques combine operations involving massive intermediate tensor traffic, reducing memory access overhead for these intermediate tensors.</p> </div> <div class="ltx_para" id="S2.SS3.p2"> <p class="ltx_p" id="S2.SS3.p2.1">The out-of-PiM tensor traffic for the self-attention computations without any on-device kernel fusion technique can be expressed as:</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="S7.EGx1"> <tbody id="S2.Ex1"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><span class="ltx_text ltx_markedasmath" id="S2.Ex1.4.3.2.1">Original (Out-of-Device Tensor) traffic</span></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> <tbody id="S2.Ex2"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=B\times N\times D\text{ (in)}+B\times N\times D\text{ (out)}" class="ltx_Math" display="inline" id="S2.Ex2.m1.1"><semantics id="S2.Ex2.m1.1a"><mrow id="S2.Ex2.m1.1.1" xref="S2.Ex2.m1.1.1.cmml"><mi id="S2.Ex2.m1.1.1.2" xref="S2.Ex2.m1.1.1.2.cmml"></mi><mo id="S2.Ex2.m1.1.1.1" xref="S2.Ex2.m1.1.1.1.cmml">=</mo><mrow id="S2.Ex2.m1.1.1.3" xref="S2.Ex2.m1.1.1.3.cmml"><mrow id="S2.Ex2.m1.1.1.3.2" xref="S2.Ex2.m1.1.1.3.2.cmml"><mrow id="S2.Ex2.m1.1.1.3.2.2" xref="S2.Ex2.m1.1.1.3.2.2.cmml"><mi id="S2.Ex2.m1.1.1.3.2.2.2" xref="S2.Ex2.m1.1.1.3.2.2.2.cmml">B</mi><mo id="S2.Ex2.m1.1.1.3.2.2.1" lspace="0.222em" rspace="0.222em" xref="S2.Ex2.m1.1.1.3.2.2.1.cmml">×</mo><mi id="S2.Ex2.m1.1.1.3.2.2.3" xref="S2.Ex2.m1.1.1.3.2.2.3.cmml">N</mi><mo id="S2.Ex2.m1.1.1.3.2.2.1a" lspace="0.222em" rspace="0.222em" xref="S2.Ex2.m1.1.1.3.2.2.1.cmml">×</mo><mi id="S2.Ex2.m1.1.1.3.2.2.4" xref="S2.Ex2.m1.1.1.3.2.2.4.cmml">D</mi></mrow><mo id="S2.Ex2.m1.1.1.3.2.1" xref="S2.Ex2.m1.1.1.3.2.1.cmml"></mo><mtext id="S2.Ex2.m1.1.1.3.2.3" xref="S2.Ex2.m1.1.1.3.2.3a.cmml"> (in)</mtext></mrow><mo id="S2.Ex2.m1.1.1.3.1" xref="S2.Ex2.m1.1.1.3.1.cmml">+</mo><mrow id="S2.Ex2.m1.1.1.3.3" xref="S2.Ex2.m1.1.1.3.3.cmml"><mrow id="S2.Ex2.m1.1.1.3.3.2" xref="S2.Ex2.m1.1.1.3.3.2.cmml"><mi id="S2.Ex2.m1.1.1.3.3.2.2" xref="S2.Ex2.m1.1.1.3.3.2.2.cmml">B</mi><mo id="S2.Ex2.m1.1.1.3.3.2.1" lspace="0.222em" rspace="0.222em" xref="S2.Ex2.m1.1.1.3.3.2.1.cmml">×</mo><mi id="S2.Ex2.m1.1.1.3.3.2.3" xref="S2.Ex2.m1.1.1.3.3.2.3.cmml">N</mi><mo id="S2.Ex2.m1.1.1.3.3.2.1a" lspace="0.222em" rspace="0.222em" xref="S2.Ex2.m1.1.1.3.3.2.1.cmml">×</mo><mi id="S2.Ex2.m1.1.1.3.3.2.4" xref="S2.Ex2.m1.1.1.3.3.2.4.cmml">D</mi></mrow><mo id="S2.Ex2.m1.1.1.3.3.1" xref="S2.Ex2.m1.1.1.3.3.1.cmml"></mo><mtext id="S2.Ex2.m1.1.1.3.3.3" xref="S2.Ex2.m1.1.1.3.3.3a.cmml"> (out)</mtext></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.Ex2.m1.1b"><apply id="S2.Ex2.m1.1.1.cmml" xref="S2.Ex2.m1.1.1"><eq id="S2.Ex2.m1.1.1.1.cmml" xref="S2.Ex2.m1.1.1.1"></eq><csymbol cd="latexml" id="S2.Ex2.m1.1.1.2.cmml" xref="S2.Ex2.m1.1.1.2">absent</csymbol><apply id="S2.Ex2.m1.1.1.3.cmml" xref="S2.Ex2.m1.1.1.3"><plus id="S2.Ex2.m1.1.1.3.1.cmml" xref="S2.Ex2.m1.1.1.3.1"></plus><apply id="S2.Ex2.m1.1.1.3.2.cmml" xref="S2.Ex2.m1.1.1.3.2"><times id="S2.Ex2.m1.1.1.3.2.1.cmml" xref="S2.Ex2.m1.1.1.3.2.1"></times><apply id="S2.Ex2.m1.1.1.3.2.2.cmml" xref="S2.Ex2.m1.1.1.3.2.2"><times id="S2.Ex2.m1.1.1.3.2.2.1.cmml" xref="S2.Ex2.m1.1.1.3.2.2.1"></times><ci id="S2.Ex2.m1.1.1.3.2.2.2.cmml" xref="S2.Ex2.m1.1.1.3.2.2.2">𝐵</ci><ci id="S2.Ex2.m1.1.1.3.2.2.3.cmml" xref="S2.Ex2.m1.1.1.3.2.2.3">𝑁</ci><ci id="S2.Ex2.m1.1.1.3.2.2.4.cmml" xref="S2.Ex2.m1.1.1.3.2.2.4">𝐷</ci></apply><ci id="S2.Ex2.m1.1.1.3.2.3a.cmml" xref="S2.Ex2.m1.1.1.3.2.3"><mtext id="S2.Ex2.m1.1.1.3.2.3.cmml" xref="S2.Ex2.m1.1.1.3.2.3"> (in)</mtext></ci></apply><apply id="S2.Ex2.m1.1.1.3.3.cmml" xref="S2.Ex2.m1.1.1.3.3"><times id="S2.Ex2.m1.1.1.3.3.1.cmml" xref="S2.Ex2.m1.1.1.3.3.1"></times><apply id="S2.Ex2.m1.1.1.3.3.2.cmml" xref="S2.Ex2.m1.1.1.3.3.2"><times id="S2.Ex2.m1.1.1.3.3.2.1.cmml" xref="S2.Ex2.m1.1.1.3.3.2.1"></times><ci id="S2.Ex2.m1.1.1.3.3.2.2.cmml" xref="S2.Ex2.m1.1.1.3.3.2.2">𝐵</ci><ci id="S2.Ex2.m1.1.1.3.3.2.3.cmml" xref="S2.Ex2.m1.1.1.3.3.2.3">𝑁</ci><ci id="S2.Ex2.m1.1.1.3.3.2.4.cmml" xref="S2.Ex2.m1.1.1.3.3.2.4">𝐷</ci></apply><ci id="S2.Ex2.m1.1.1.3.3.3a.cmml" xref="S2.Ex2.m1.1.1.3.3.3"><mtext id="S2.Ex2.m1.1.1.3.3.3.cmml" xref="S2.Ex2.m1.1.1.3.3.3"> (out)</mtext></ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.Ex2.m1.1c">\displaystyle=B\times N\times D\text{ (in)}+B\times N\times D\text{ (out)}</annotation><annotation encoding="application/x-llamapun" id="S2.Ex2.m1.1d">= italic_B × italic_N × italic_D (in) + italic_B × italic_N × italic_D (out)</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> <tbody id="S2.Ex3"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle+2\times B\times H\times N^{2}+4\times B\times H\times N\times d_% {k}\text{ (in-and-out)}." class="ltx_Math" display="inline" id="S2.Ex3.m1.1"><semantics id="S2.Ex3.m1.1a"><mrow id="S2.Ex3.m1.1.1.1" xref="S2.Ex3.m1.1.1.1.1.cmml"><mrow id="S2.Ex3.m1.1.1.1.1" xref="S2.Ex3.m1.1.1.1.1.cmml"><mrow id="S2.Ex3.m1.1.1.1.1.2" xref="S2.Ex3.m1.1.1.1.1.2.cmml"><mo id="S2.Ex3.m1.1.1.1.1.2a" xref="S2.Ex3.m1.1.1.1.1.2.cmml">+</mo><mrow id="S2.Ex3.m1.1.1.1.1.2.2" xref="S2.Ex3.m1.1.1.1.1.2.2.cmml"><mn id="S2.Ex3.m1.1.1.1.1.2.2.2" xref="S2.Ex3.m1.1.1.1.1.2.2.2.cmml">2</mn><mo id="S2.Ex3.m1.1.1.1.1.2.2.1" lspace="0.222em" rspace="0.222em" xref="S2.Ex3.m1.1.1.1.1.2.2.1.cmml">×</mo><mi id="S2.Ex3.m1.1.1.1.1.2.2.3" xref="S2.Ex3.m1.1.1.1.1.2.2.3.cmml">B</mi><mo id="S2.Ex3.m1.1.1.1.1.2.2.1a" lspace="0.222em" rspace="0.222em" xref="S2.Ex3.m1.1.1.1.1.2.2.1.cmml">×</mo><mi id="S2.Ex3.m1.1.1.1.1.2.2.4" xref="S2.Ex3.m1.1.1.1.1.2.2.4.cmml">H</mi><mo id="S2.Ex3.m1.1.1.1.1.2.2.1b" lspace="0.222em" rspace="0.222em" xref="S2.Ex3.m1.1.1.1.1.2.2.1.cmml">×</mo><msup id="S2.Ex3.m1.1.1.1.1.2.2.5" xref="S2.Ex3.m1.1.1.1.1.2.2.5.cmml"><mi id="S2.Ex3.m1.1.1.1.1.2.2.5.2" xref="S2.Ex3.m1.1.1.1.1.2.2.5.2.cmml">N</mi><mn id="S2.Ex3.m1.1.1.1.1.2.2.5.3" xref="S2.Ex3.m1.1.1.1.1.2.2.5.3.cmml">2</mn></msup></mrow></mrow><mo id="S2.Ex3.m1.1.1.1.1.1" xref="S2.Ex3.m1.1.1.1.1.1.cmml">+</mo><mrow id="S2.Ex3.m1.1.1.1.1.3" xref="S2.Ex3.m1.1.1.1.1.3.cmml"><mrow id="S2.Ex3.m1.1.1.1.1.3.2" xref="S2.Ex3.m1.1.1.1.1.3.2.cmml"><mn id="S2.Ex3.m1.1.1.1.1.3.2.2" xref="S2.Ex3.m1.1.1.1.1.3.2.2.cmml">4</mn><mo id="S2.Ex3.m1.1.1.1.1.3.2.1" lspace="0.222em" rspace="0.222em" xref="S2.Ex3.m1.1.1.1.1.3.2.1.cmml">×</mo><mi id="S2.Ex3.m1.1.1.1.1.3.2.3" xref="S2.Ex3.m1.1.1.1.1.3.2.3.cmml">B</mi><mo id="S2.Ex3.m1.1.1.1.1.3.2.1a" lspace="0.222em" rspace="0.222em" xref="S2.Ex3.m1.1.1.1.1.3.2.1.cmml">×</mo><mi id="S2.Ex3.m1.1.1.1.1.3.2.4" xref="S2.Ex3.m1.1.1.1.1.3.2.4.cmml">H</mi><mo id="S2.Ex3.m1.1.1.1.1.3.2.1b" lspace="0.222em" rspace="0.222em" xref="S2.Ex3.m1.1.1.1.1.3.2.1.cmml">×</mo><mi id="S2.Ex3.m1.1.1.1.1.3.2.5" xref="S2.Ex3.m1.1.1.1.1.3.2.5.cmml">N</mi><mo id="S2.Ex3.m1.1.1.1.1.3.2.1c" lspace="0.222em" rspace="0.222em" xref="S2.Ex3.m1.1.1.1.1.3.2.1.cmml">×</mo><msub id="S2.Ex3.m1.1.1.1.1.3.2.6" xref="S2.Ex3.m1.1.1.1.1.3.2.6.cmml"><mi id="S2.Ex3.m1.1.1.1.1.3.2.6.2" xref="S2.Ex3.m1.1.1.1.1.3.2.6.2.cmml">d</mi><mi id="S2.Ex3.m1.1.1.1.1.3.2.6.3" xref="S2.Ex3.m1.1.1.1.1.3.2.6.3.cmml">k</mi></msub></mrow><mo id="S2.Ex3.m1.1.1.1.1.3.1" xref="S2.Ex3.m1.1.1.1.1.3.1.cmml"></mo><mtext id="S2.Ex3.m1.1.1.1.1.3.3" xref="S2.Ex3.m1.1.1.1.1.3.3a.cmml"> (in-and-out)</mtext></mrow></mrow><mo id="S2.Ex3.m1.1.1.1.2" lspace="0em" xref="S2.Ex3.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.Ex3.m1.1b"><apply id="S2.Ex3.m1.1.1.1.1.cmml" xref="S2.Ex3.m1.1.1.1"><plus id="S2.Ex3.m1.1.1.1.1.1.cmml" xref="S2.Ex3.m1.1.1.1.1.1"></plus><apply id="S2.Ex3.m1.1.1.1.1.2.cmml" xref="S2.Ex3.m1.1.1.1.1.2"><plus id="S2.Ex3.m1.1.1.1.1.2.1.cmml" xref="S2.Ex3.m1.1.1.1.1.2"></plus><apply id="S2.Ex3.m1.1.1.1.1.2.2.cmml" xref="S2.Ex3.m1.1.1.1.1.2.2"><times id="S2.Ex3.m1.1.1.1.1.2.2.1.cmml" xref="S2.Ex3.m1.1.1.1.1.2.2.1"></times><cn id="S2.Ex3.m1.1.1.1.1.2.2.2.cmml" type="integer" xref="S2.Ex3.m1.1.1.1.1.2.2.2">2</cn><ci id="S2.Ex3.m1.1.1.1.1.2.2.3.cmml" xref="S2.Ex3.m1.1.1.1.1.2.2.3">𝐵</ci><ci id="S2.Ex3.m1.1.1.1.1.2.2.4.cmml" xref="S2.Ex3.m1.1.1.1.1.2.2.4">𝐻</ci><apply id="S2.Ex3.m1.1.1.1.1.2.2.5.cmml" xref="S2.Ex3.m1.1.1.1.1.2.2.5"><csymbol cd="ambiguous" id="S2.Ex3.m1.1.1.1.1.2.2.5.1.cmml" xref="S2.Ex3.m1.1.1.1.1.2.2.5">superscript</csymbol><ci id="S2.Ex3.m1.1.1.1.1.2.2.5.2.cmml" xref="S2.Ex3.m1.1.1.1.1.2.2.5.2">𝑁</ci><cn id="S2.Ex3.m1.1.1.1.1.2.2.5.3.cmml" type="integer" xref="S2.Ex3.m1.1.1.1.1.2.2.5.3">2</cn></apply></apply></apply><apply id="S2.Ex3.m1.1.1.1.1.3.cmml" xref="S2.Ex3.m1.1.1.1.1.3"><times id="S2.Ex3.m1.1.1.1.1.3.1.cmml" xref="S2.Ex3.m1.1.1.1.1.3.1"></times><apply id="S2.Ex3.m1.1.1.1.1.3.2.cmml" xref="S2.Ex3.m1.1.1.1.1.3.2"><times id="S2.Ex3.m1.1.1.1.1.3.2.1.cmml" xref="S2.Ex3.m1.1.1.1.1.3.2.1"></times><cn id="S2.Ex3.m1.1.1.1.1.3.2.2.cmml" type="integer" xref="S2.Ex3.m1.1.1.1.1.3.2.2">4</cn><ci id="S2.Ex3.m1.1.1.1.1.3.2.3.cmml" xref="S2.Ex3.m1.1.1.1.1.3.2.3">𝐵</ci><ci id="S2.Ex3.m1.1.1.1.1.3.2.4.cmml" xref="S2.Ex3.m1.1.1.1.1.3.2.4">𝐻</ci><ci id="S2.Ex3.m1.1.1.1.1.3.2.5.cmml" xref="S2.Ex3.m1.1.1.1.1.3.2.5">𝑁</ci><apply id="S2.Ex3.m1.1.1.1.1.3.2.6.cmml" xref="S2.Ex3.m1.1.1.1.1.3.2.6"><csymbol cd="ambiguous" id="S2.Ex3.m1.1.1.1.1.3.2.6.1.cmml" xref="S2.Ex3.m1.1.1.1.1.3.2.6">subscript</csymbol><ci id="S2.Ex3.m1.1.1.1.1.3.2.6.2.cmml" xref="S2.Ex3.m1.1.1.1.1.3.2.6.2">𝑑</ci><ci id="S2.Ex3.m1.1.1.1.1.3.2.6.3.cmml" xref="S2.Ex3.m1.1.1.1.1.3.2.6.3">𝑘</ci></apply></apply><ci id="S2.Ex3.m1.1.1.1.1.3.3a.cmml" xref="S2.Ex3.m1.1.1.1.1.3.3"><mtext id="S2.Ex3.m1.1.1.1.1.3.3.cmml" xref="S2.Ex3.m1.1.1.1.1.3.3"> (in-and-out)</mtext></ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.Ex3.m1.1c">\displaystyle+2\times B\times H\times N^{2}+4\times B\times H\times N\times d_% {k}\text{ (in-and-out)}.</annotation><annotation encoding="application/x-llamapun" id="S2.Ex3.m1.1d">+ 2 × italic_B × italic_H × italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 × italic_B × italic_H × italic_N × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (in-and-out) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS3.p2.2">As previously mentioned, PiM devices are only suited for processing BLAS layers on-device. Hence, without adequate optimization, the kernel fusion with their inherent limitations - high-ENOB ADCs and FPUs - deteriorate the effectiveness of PiM architectures.</p> </div> <div class="ltx_para" id="S2.SS3.p3"> <p class="ltx_p" id="S2.SS3.p3.3">Subsequently, our FLARE architecture proposes innovative quantization and nonlinear-layer processing techniques enabling the end-to-end kernel fusion with low hardware overhead, where the revised tensor traffic with our technique is: </p> <table class="ltx_equation ltx_eqn_table" id="S2.Ex4"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\text{Revised Tensor Traffic}=2\times B\times N\times D\text{ (in)}+B\times N% \times D\text{ (out)}," class="ltx_Math" display="block" id="S2.Ex4.m1.1"><semantics id="S2.Ex4.m1.1a"><mrow id="S2.Ex4.m1.1.1.1" xref="S2.Ex4.m1.1.1.1.1.cmml"><mrow id="S2.Ex4.m1.1.1.1.1" xref="S2.Ex4.m1.1.1.1.1.cmml"><mtext id="S2.Ex4.m1.1.1.1.1.2" xref="S2.Ex4.m1.1.1.1.1.2a.cmml">Revised Tensor Traffic</mtext><mo id="S2.Ex4.m1.1.1.1.1.1" xref="S2.Ex4.m1.1.1.1.1.1.cmml">=</mo><mrow id="S2.Ex4.m1.1.1.1.1.3" xref="S2.Ex4.m1.1.1.1.1.3.cmml"><mrow id="S2.Ex4.m1.1.1.1.1.3.2" xref="S2.Ex4.m1.1.1.1.1.3.2.cmml"><mrow id="S2.Ex4.m1.1.1.1.1.3.2.2" xref="S2.Ex4.m1.1.1.1.1.3.2.2.cmml"><mn id="S2.Ex4.m1.1.1.1.1.3.2.2.2" xref="S2.Ex4.m1.1.1.1.1.3.2.2.2.cmml">2</mn><mo id="S2.Ex4.m1.1.1.1.1.3.2.2.1" lspace="0.222em" rspace="0.222em" xref="S2.Ex4.m1.1.1.1.1.3.2.2.1.cmml">×</mo><mi id="S2.Ex4.m1.1.1.1.1.3.2.2.3" xref="S2.Ex4.m1.1.1.1.1.3.2.2.3.cmml">B</mi><mo id="S2.Ex4.m1.1.1.1.1.3.2.2.1a" lspace="0.222em" rspace="0.222em" xref="S2.Ex4.m1.1.1.1.1.3.2.2.1.cmml">×</mo><mi id="S2.Ex4.m1.1.1.1.1.3.2.2.4" xref="S2.Ex4.m1.1.1.1.1.3.2.2.4.cmml">N</mi><mo id="S2.Ex4.m1.1.1.1.1.3.2.2.1b" lspace="0.222em" rspace="0.222em" xref="S2.Ex4.m1.1.1.1.1.3.2.2.1.cmml">×</mo><mi id="S2.Ex4.m1.1.1.1.1.3.2.2.5" xref="S2.Ex4.m1.1.1.1.1.3.2.2.5.cmml">D</mi></mrow><mo id="S2.Ex4.m1.1.1.1.1.3.2.1" xref="S2.Ex4.m1.1.1.1.1.3.2.1.cmml"></mo><mtext id="S2.Ex4.m1.1.1.1.1.3.2.3" xref="S2.Ex4.m1.1.1.1.1.3.2.3a.cmml"> (in)</mtext></mrow><mo id="S2.Ex4.m1.1.1.1.1.3.1" xref="S2.Ex4.m1.1.1.1.1.3.1.cmml">+</mo><mrow id="S2.Ex4.m1.1.1.1.1.3.3" xref="S2.Ex4.m1.1.1.1.1.3.3.cmml"><mrow id="S2.Ex4.m1.1.1.1.1.3.3.2" xref="S2.Ex4.m1.1.1.1.1.3.3.2.cmml"><mi id="S2.Ex4.m1.1.1.1.1.3.3.2.2" xref="S2.Ex4.m1.1.1.1.1.3.3.2.2.cmml">B</mi><mo id="S2.Ex4.m1.1.1.1.1.3.3.2.1" lspace="0.222em" rspace="0.222em" xref="S2.Ex4.m1.1.1.1.1.3.3.2.1.cmml">×</mo><mi id="S2.Ex4.m1.1.1.1.1.3.3.2.3" xref="S2.Ex4.m1.1.1.1.1.3.3.2.3.cmml">N</mi><mo id="S2.Ex4.m1.1.1.1.1.3.3.2.1a" lspace="0.222em" rspace="0.222em" xref="S2.Ex4.m1.1.1.1.1.3.3.2.1.cmml">×</mo><mi id="S2.Ex4.m1.1.1.1.1.3.3.2.4" xref="S2.Ex4.m1.1.1.1.1.3.3.2.4.cmml">D</mi></mrow><mo id="S2.Ex4.m1.1.1.1.1.3.3.1" xref="S2.Ex4.m1.1.1.1.1.3.3.1.cmml"></mo><mtext id="S2.Ex4.m1.1.1.1.1.3.3.3" xref="S2.Ex4.m1.1.1.1.1.3.3.3a.cmml"> (out)</mtext></mrow></mrow></mrow><mo id="S2.Ex4.m1.1.1.1.2" xref="S2.Ex4.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.Ex4.m1.1b"><apply id="S2.Ex4.m1.1.1.1.1.cmml" xref="S2.Ex4.m1.1.1.1"><eq id="S2.Ex4.m1.1.1.1.1.1.cmml" xref="S2.Ex4.m1.1.1.1.1.1"></eq><ci id="S2.Ex4.m1.1.1.1.1.2a.cmml" xref="S2.Ex4.m1.1.1.1.1.2"><mtext id="S2.Ex4.m1.1.1.1.1.2.cmml" xref="S2.Ex4.m1.1.1.1.1.2">Revised Tensor Traffic</mtext></ci><apply id="S2.Ex4.m1.1.1.1.1.3.cmml" xref="S2.Ex4.m1.1.1.1.1.3"><plus id="S2.Ex4.m1.1.1.1.1.3.1.cmml" xref="S2.Ex4.m1.1.1.1.1.3.1"></plus><apply id="S2.Ex4.m1.1.1.1.1.3.2.cmml" xref="S2.Ex4.m1.1.1.1.1.3.2"><times id="S2.Ex4.m1.1.1.1.1.3.2.1.cmml" xref="S2.Ex4.m1.1.1.1.1.3.2.1"></times><apply id="S2.Ex4.m1.1.1.1.1.3.2.2.cmml" xref="S2.Ex4.m1.1.1.1.1.3.2.2"><times id="S2.Ex4.m1.1.1.1.1.3.2.2.1.cmml" xref="S2.Ex4.m1.1.1.1.1.3.2.2.1"></times><cn id="S2.Ex4.m1.1.1.1.1.3.2.2.2.cmml" type="integer" xref="S2.Ex4.m1.1.1.1.1.3.2.2.2">2</cn><ci id="S2.Ex4.m1.1.1.1.1.3.2.2.3.cmml" xref="S2.Ex4.m1.1.1.1.1.3.2.2.3">𝐵</ci><ci id="S2.Ex4.m1.1.1.1.1.3.2.2.4.cmml" xref="S2.Ex4.m1.1.1.1.1.3.2.2.4">𝑁</ci><ci id="S2.Ex4.m1.1.1.1.1.3.2.2.5.cmml" xref="S2.Ex4.m1.1.1.1.1.3.2.2.5">𝐷</ci></apply><ci id="S2.Ex4.m1.1.1.1.1.3.2.3a.cmml" xref="S2.Ex4.m1.1.1.1.1.3.2.3"><mtext id="S2.Ex4.m1.1.1.1.1.3.2.3.cmml" xref="S2.Ex4.m1.1.1.1.1.3.2.3"> (in)</mtext></ci></apply><apply id="S2.Ex4.m1.1.1.1.1.3.3.cmml" xref="S2.Ex4.m1.1.1.1.1.3.3"><times id="S2.Ex4.m1.1.1.1.1.3.3.1.cmml" xref="S2.Ex4.m1.1.1.1.1.3.3.1"></times><apply id="S2.Ex4.m1.1.1.1.1.3.3.2.cmml" xref="S2.Ex4.m1.1.1.1.1.3.3.2"><times id="S2.Ex4.m1.1.1.1.1.3.3.2.1.cmml" xref="S2.Ex4.m1.1.1.1.1.3.3.2.1"></times><ci id="S2.Ex4.m1.1.1.1.1.3.3.2.2.cmml" xref="S2.Ex4.m1.1.1.1.1.3.3.2.2">𝐵</ci><ci id="S2.Ex4.m1.1.1.1.1.3.3.2.3.cmml" xref="S2.Ex4.m1.1.1.1.1.3.3.2.3">𝑁</ci><ci id="S2.Ex4.m1.1.1.1.1.3.3.2.4.cmml" xref="S2.Ex4.m1.1.1.1.1.3.3.2.4">𝐷</ci></apply><ci id="S2.Ex4.m1.1.1.1.1.3.3.3a.cmml" xref="S2.Ex4.m1.1.1.1.1.3.3.3"><mtext id="S2.Ex4.m1.1.1.1.1.3.3.3.cmml" xref="S2.Ex4.m1.1.1.1.1.3.3.3"> (out)</mtext></ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.Ex4.m1.1c">\text{Revised Tensor Traffic}=2\times B\times N\times D\text{ (in)}+B\times N% \times D\text{ (out)},</annotation><annotation encoding="application/x-llamapun" id="S2.Ex4.m1.1d">Revised Tensor Traffic = 2 × italic_B × italic_N × italic_D (in) + italic_B × italic_N × italic_D (out) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS3.p3.2">where the quadratic <math alttext="O(N^{2})" class="ltx_Math" display="inline" id="S2.SS3.p3.1.m1.1"><semantics id="S2.SS3.p3.1.m1.1a"><mrow id="S2.SS3.p3.1.m1.1.1" xref="S2.SS3.p3.1.m1.1.1.cmml"><mi id="S2.SS3.p3.1.m1.1.1.3" xref="S2.SS3.p3.1.m1.1.1.3.cmml">O</mi><mo id="S2.SS3.p3.1.m1.1.1.2" xref="S2.SS3.p3.1.m1.1.1.2.cmml"></mo><mrow id="S2.SS3.p3.1.m1.1.1.1.1" xref="S2.SS3.p3.1.m1.1.1.1.1.1.cmml"><mo id="S2.SS3.p3.1.m1.1.1.1.1.2" stretchy="false" xref="S2.SS3.p3.1.m1.1.1.1.1.1.cmml">(</mo><msup id="S2.SS3.p3.1.m1.1.1.1.1.1" xref="S2.SS3.p3.1.m1.1.1.1.1.1.cmml"><mi id="S2.SS3.p3.1.m1.1.1.1.1.1.2" xref="S2.SS3.p3.1.m1.1.1.1.1.1.2.cmml">N</mi><mn id="S2.SS3.p3.1.m1.1.1.1.1.1.3" xref="S2.SS3.p3.1.m1.1.1.1.1.1.3.cmml">2</mn></msup><mo id="S2.SS3.p3.1.m1.1.1.1.1.3" stretchy="false" xref="S2.SS3.p3.1.m1.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.p3.1.m1.1b"><apply id="S2.SS3.p3.1.m1.1.1.cmml" xref="S2.SS3.p3.1.m1.1.1"><times id="S2.SS3.p3.1.m1.1.1.2.cmml" xref="S2.SS3.p3.1.m1.1.1.2"></times><ci id="S2.SS3.p3.1.m1.1.1.3.cmml" xref="S2.SS3.p3.1.m1.1.1.3">𝑂</ci><apply id="S2.SS3.p3.1.m1.1.1.1.1.1.cmml" xref="S2.SS3.p3.1.m1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS3.p3.1.m1.1.1.1.1.1.1.cmml" xref="S2.SS3.p3.1.m1.1.1.1.1">superscript</csymbol><ci id="S2.SS3.p3.1.m1.1.1.1.1.1.2.cmml" xref="S2.SS3.p3.1.m1.1.1.1.1.1.2">𝑁</ci><cn id="S2.SS3.p3.1.m1.1.1.1.1.1.3.cmml" type="integer" xref="S2.SS3.p3.1.m1.1.1.1.1.1.3">2</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p3.1.m1.1c">O(N^{2})</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p3.1.m1.1d">italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )</annotation></semantics></math> tensor traffic reduces to a linear <math alttext="O(N)" class="ltx_Math" display="inline" id="S2.SS3.p3.2.m2.1"><semantics id="S2.SS3.p3.2.m2.1a"><mrow id="S2.SS3.p3.2.m2.1.2" xref="S2.SS3.p3.2.m2.1.2.cmml"><mi id="S2.SS3.p3.2.m2.1.2.2" xref="S2.SS3.p3.2.m2.1.2.2.cmml">O</mi><mo id="S2.SS3.p3.2.m2.1.2.1" xref="S2.SS3.p3.2.m2.1.2.1.cmml"></mo><mrow id="S2.SS3.p3.2.m2.1.2.3.2" xref="S2.SS3.p3.2.m2.1.2.cmml"><mo id="S2.SS3.p3.2.m2.1.2.3.2.1" stretchy="false" xref="S2.SS3.p3.2.m2.1.2.cmml">(</mo><mi id="S2.SS3.p3.2.m2.1.1" xref="S2.SS3.p3.2.m2.1.1.cmml">N</mi><mo id="S2.SS3.p3.2.m2.1.2.3.2.2" stretchy="false" xref="S2.SS3.p3.2.m2.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.p3.2.m2.1b"><apply id="S2.SS3.p3.2.m2.1.2.cmml" xref="S2.SS3.p3.2.m2.1.2"><times id="S2.SS3.p3.2.m2.1.2.1.cmml" xref="S2.SS3.p3.2.m2.1.2.1"></times><ci id="S2.SS3.p3.2.m2.1.2.2.cmml" xref="S2.SS3.p3.2.m2.1.2.2">𝑂</ci><ci id="S2.SS3.p3.2.m2.1.1.cmml" xref="S2.SS3.p3.2.m2.1.1">𝑁</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p3.2.m2.1c">O(N)</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p3.2.m2.1d">italic_O ( italic_N )</annotation></semantics></math>.</p> </div> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">3. </span>Motivations</h2> <figure class="ltx_figure" id="S3.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="365" id="S3.F4.g1" src="extracted/6015391/DivisionOverhead.png" width="538"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4. </span>We alleviate not only the reliance on high-ENOB ADCs and FPUs but also the need for division arithmetic in our PiM hardware. </figcaption> </figure> <div class="ltx_para" id="S3.p1"> <p class="ltx_p" id="S3.p1.1">Our study reveals that the effectiveness of PiM architectures is notably hindered by the high-ENOB ADC and FPUs. Additionally, among various arithmetic operations required for the execution of self-attention, division arithmetic - not only in FP but also in integer formats - hinders the hardware efficiency, as illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F4" title="Figure 4 ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">4</span></a>. Thus, our optimization strategy tackles these challenging overheads while preserving accuracy and performance, focusing on the following principles: </p> <ul class="ltx_itemize" id="S3.I1"> <li class="ltx_item" id="S3.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i1.p1"> <p class="ltx_p" id="S3.I1.i1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i1.p1.1.1">Accurate Quantization</span></p> <ul class="ltx_itemize" id="S3.I1.i1.I1"> <li class="ltx_item" id="S3.I1.i1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S3.I1.i1.I1.i1.1.1.1">–</span></span> <div class="ltx_para" id="S3.I1.i1.I1.i1.p1"> <p class="ltx_p" id="S3.I1.i1.I1.i1.p1.1">Lossless quantization: Our method must accurately represent quantized values even without FP or division arithmetic.</p> </div> </li> <li class="ltx_item" id="S3.I1.i1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S3.I1.i1.I1.i2.1.1.1">–</span></span> <div class="ltx_para" id="S3.I1.i1.I1.i2.p1"> <p class="ltx_p" id="S3.I1.i1.I1.i2.p1.1">Outlier preservation: Attention layers often generate channels with extreme outliers, which standard integer quantization in PTQ cannot capture well. To address this, we use token-wise quantization to reflect these outliers accurately.</p> </div> </li> </ul> </div> </li> <li class="ltx_item" id="S3.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i2.p1"> <p class="ltx_p" id="S3.I1.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i2.p1.1.1">Accurate Nonlinear (Non-BLAS)-Layer Executions</span></p> <ul class="ltx_itemize" id="S3.I1.i2.I1"> <li class="ltx_item" id="S3.I1.i2.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S3.I1.i2.I1.i1.1.1.1">–</span></span> <div class="ltx_para" id="S3.I1.i2.I1.i1.p1"> <p class="ltx_p" id="S3.I1.i2.I1.i1.p1.1">Transformer models depend heavily on nonlinear layers, which require complex exponentiation and division operations. Our approach preserves the exponential trend and proportional relationships between values without FP or division arithmetic.</p> </div> </li> </ul> </div> </li> <li class="ltx_item" id="S3.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i3.p1"> <p class="ltx_p" id="S3.I1.i3.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i3.p1.1.1">PVT-Robust and Error-Resilient AMS-PiM</span></p> <ul class="ltx_itemize" id="S3.I1.i3.I1"> <li class="ltx_item" id="S3.I1.i3.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="S3.I1.i3.I1.i1.1.1.1">–</span></span> <div class="ltx_para" id="S3.I1.i3.I1.i1.p1"> <p class="ltx_p" id="S3.I1.i3.I1.i1.p1.1">Despite its high efficiency, AMS-PiM has frequently been criticized for computation errors due to PVT variations. However, design solutions for those errors often conflict with the efficient exploitation of PiM devices. Our proposed techniques resolve this dilemma effectively. </p> </div> </li> </ul> </div> </li> </ul> <p class="ltx_p" id="S3.p1.2">In the following sections, we analyze these principles more deeply, before presenting our solutions as a breakthrough.</p> </div> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.1. </span>Motivation for PTQ and Non-BLAS Layer Optimizations</h3> <figure class="ltx_figure" id="S3.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="1015" id="S3.F5.g1" src="x4.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 5. </span>(a) Even for a lossless quantization, values lose the exponent information. (b) The linear projections are consistent without exponent. (c) Softmax with the values shown at (a) - the absence of exponents deteriorates the computation result of nonlinear layers. </figcaption> </figure> <div class="ltx_para" id="S3.SS1.p1"> <p class="ltx_p" id="S3.SS1.p1.5">While there are numerous quantization methods that we may not be able to cover in this literature, the fundamental steps for the accurate yet efficient quantization processes are as follows: (1) identify the minimum and maximum values among the given inputs (<math alttext="x_{i}\in\mathbf{X}" class="ltx_Math" display="inline" id="S3.SS1.p1.1.m1.1"><semantics id="S3.SS1.p1.1.m1.1a"><mrow id="S3.SS1.p1.1.m1.1.1" xref="S3.SS1.p1.1.m1.1.1.cmml"><msub id="S3.SS1.p1.1.m1.1.1.2" xref="S3.SS1.p1.1.m1.1.1.2.cmml"><mi id="S3.SS1.p1.1.m1.1.1.2.2" xref="S3.SS1.p1.1.m1.1.1.2.2.cmml">x</mi><mi id="S3.SS1.p1.1.m1.1.1.2.3" xref="S3.SS1.p1.1.m1.1.1.2.3.cmml">i</mi></msub><mo id="S3.SS1.p1.1.m1.1.1.1" xref="S3.SS1.p1.1.m1.1.1.1.cmml">∈</mo><mi id="S3.SS1.p1.1.m1.1.1.3" xref="S3.SS1.p1.1.m1.1.1.3.cmml">𝐗</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.1.m1.1b"><apply id="S3.SS1.p1.1.m1.1.1.cmml" xref="S3.SS1.p1.1.m1.1.1"><in id="S3.SS1.p1.1.m1.1.1.1.cmml" xref="S3.SS1.p1.1.m1.1.1.1"></in><apply id="S3.SS1.p1.1.m1.1.1.2.cmml" xref="S3.SS1.p1.1.m1.1.1.2"><csymbol cd="ambiguous" id="S3.SS1.p1.1.m1.1.1.2.1.cmml" xref="S3.SS1.p1.1.m1.1.1.2">subscript</csymbol><ci id="S3.SS1.p1.1.m1.1.1.2.2.cmml" xref="S3.SS1.p1.1.m1.1.1.2.2">𝑥</ci><ci id="S3.SS1.p1.1.m1.1.1.2.3.cmml" xref="S3.SS1.p1.1.m1.1.1.2.3">𝑖</ci></apply><ci id="S3.SS1.p1.1.m1.1.1.3.cmml" xref="S3.SS1.p1.1.m1.1.1.3">𝐗</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.1.m1.1c">x_{i}\in\mathbf{X}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.1.m1.1d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_X</annotation></semantics></math>), (2) for n-bit quantization, define a “quantization step (<math alttext="\mathbf{S}" class="ltx_Math" display="inline" id="S3.SS1.p1.2.m2.1"><semantics id="S3.SS1.p1.2.m2.1a"><mi id="S3.SS1.p1.2.m2.1.1" xref="S3.SS1.p1.2.m2.1.1.cmml">𝐒</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.2.m2.1b"><ci id="S3.SS1.p1.2.m2.1.1.cmml" xref="S3.SS1.p1.2.m2.1.1">𝐒</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.2.m2.1c">\mathbf{S}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.2.m2.1d">bold_S</annotation></semantics></math>),” by dividing the minimum-maximum range by <math alttext="2^{n}" class="ltx_Math" display="inline" id="S3.SS1.p1.3.m3.1"><semantics id="S3.SS1.p1.3.m3.1a"><msup id="S3.SS1.p1.3.m3.1.1" xref="S3.SS1.p1.3.m3.1.1.cmml"><mn id="S3.SS1.p1.3.m3.1.1.2" xref="S3.SS1.p1.3.m3.1.1.2.cmml">2</mn><mi id="S3.SS1.p1.3.m3.1.1.3" xref="S3.SS1.p1.3.m3.1.1.3.cmml">n</mi></msup><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.3.m3.1b"><apply id="S3.SS1.p1.3.m3.1.1.cmml" xref="S3.SS1.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS1.p1.3.m3.1.1.1.cmml" xref="S3.SS1.p1.3.m3.1.1">superscript</csymbol><cn id="S3.SS1.p1.3.m3.1.1.2.cmml" type="integer" xref="S3.SS1.p1.3.m3.1.1.2">2</cn><ci id="S3.SS1.p1.3.m3.1.1.3.cmml" xref="S3.SS1.p1.3.m3.1.1.3">𝑛</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.3.m3.1c">2^{n}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.3.m3.1d">2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT</annotation></semantics></math> levels, and (3) divide <math alttext="x_{i}" class="ltx_Math" display="inline" id="S3.SS1.p1.4.m4.1"><semantics id="S3.SS1.p1.4.m4.1a"><msub id="S3.SS1.p1.4.m4.1.1" xref="S3.SS1.p1.4.m4.1.1.cmml"><mi id="S3.SS1.p1.4.m4.1.1.2" xref="S3.SS1.p1.4.m4.1.1.2.cmml">x</mi><mi id="S3.SS1.p1.4.m4.1.1.3" xref="S3.SS1.p1.4.m4.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.4.m4.1b"><apply id="S3.SS1.p1.4.m4.1.1.cmml" xref="S3.SS1.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S3.SS1.p1.4.m4.1.1.1.cmml" xref="S3.SS1.p1.4.m4.1.1">subscript</csymbol><ci id="S3.SS1.p1.4.m4.1.1.2.cmml" xref="S3.SS1.p1.4.m4.1.1.2">𝑥</ci><ci id="S3.SS1.p1.4.m4.1.1.3.cmml" xref="S3.SS1.p1.4.m4.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.4.m4.1c">x_{i}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.4.m4.1d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math>’s by <math alttext="\mathbf{S}" class="ltx_Math" display="inline" id="S3.SS1.p1.5.m5.1"><semantics id="S3.SS1.p1.5.m5.1a"><mi id="S3.SS1.p1.5.m5.1.1" xref="S3.SS1.p1.5.m5.1.1.cmml">𝐒</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.5.m5.1b"><ci id="S3.SS1.p1.5.m5.1.1.cmml" xref="S3.SS1.p1.5.m5.1.1">𝐒</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.5.m5.1c">\mathbf{S}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.5.m5.1d">bold_S</annotation></semantics></math> to obtain the quantized values.</p> </div> <div class="ltx_para" id="S3.SS1.p2"> <p class="ltx_p" id="S3.SS1.p2.1">Along the process, as illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F5" title="Figure 5 ‣ 3.1. Motivation for PTQ and Non-BLAS Layer Optimizations ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">5</span></a>-(a), we say a quantization is “lossless” when the ratios between quantized values retain the original proportions. That is:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_left" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_left">(1)</span></td> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\text{quantize}(x_{i})/\text{quantize}(x_{j})\approx x_{i}/x_{j},~{}\forall x_% {i},x_{j}\in\mathbf{X}~{}." class="ltx_Math" display="block" id="S3.E1.m1.1"><semantics id="S3.E1.m1.1a"><mrow id="S3.E1.m1.1.1.1"><mrow id="S3.E1.m1.1.1.1.1.2" xref="S3.E1.m1.1.1.1.1.3.cmml"><mrow id="S3.E1.m1.1.1.1.1.1.1" xref="S3.E1.m1.1.1.1.1.1.1.cmml"><mrow id="S3.E1.m1.1.1.1.1.1.1.2" xref="S3.E1.m1.1.1.1.1.1.1.2.cmml"><mrow id="S3.E1.m1.1.1.1.1.1.1.1.1" xref="S3.E1.m1.1.1.1.1.1.1.1.1.cmml"><mrow id="S3.E1.m1.1.1.1.1.1.1.1.1.1" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.cmml"><mtext id="S3.E1.m1.1.1.1.1.1.1.1.1.1.3" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.3a.cmml">quantize</mtext><mo id="S3.E1.m1.1.1.1.1.1.1.1.1.1.2" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.2.cmml"></mo><mrow id="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mo id="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml">(</mo><msub id="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">x</mi><mi id="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml">i</mi></msub><mo id="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S3.E1.m1.1.1.1.1.1.1.1.1.2" xref="S3.E1.m1.1.1.1.1.1.1.1.1.2.cmml">/</mo><mtext id="S3.E1.m1.1.1.1.1.1.1.1.1.3" xref="S3.E1.m1.1.1.1.1.1.1.1.1.3a.cmml">quantize</mtext></mrow><mo id="S3.E1.m1.1.1.1.1.1.1.2.3" xref="S3.E1.m1.1.1.1.1.1.1.2.3.cmml"></mo><mrow id="S3.E1.m1.1.1.1.1.1.1.2.2.1" xref="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.cmml"><mo id="S3.E1.m1.1.1.1.1.1.1.2.2.1.2" stretchy="false" xref="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.cmml">(</mo><msub id="S3.E1.m1.1.1.1.1.1.1.2.2.1.1" xref="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.cmml"><mi id="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.2" xref="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.2.cmml">x</mi><mi id="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.3" xref="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.3.cmml">j</mi></msub><mo id="S3.E1.m1.1.1.1.1.1.1.2.2.1.3" stretchy="false" xref="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.cmml">)</mo></mrow></mrow><mo id="S3.E1.m1.1.1.1.1.1.1.5" xref="S3.E1.m1.1.1.1.1.1.1.5.cmml">≈</mo><mrow id="S3.E1.m1.1.1.1.1.1.1.4.2" xref="S3.E1.m1.1.1.1.1.1.1.4.3.cmml"><mrow id="S3.E1.m1.1.1.1.1.1.1.3.1.1" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.cmml"><msub id="S3.E1.m1.1.1.1.1.1.1.3.1.1.2" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.cmml"><mi id="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.2" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.2.cmml">x</mi><mi id="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.3" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.3.cmml">i</mi></msub><mo id="S3.E1.m1.1.1.1.1.1.1.3.1.1.1" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.1.cmml">/</mo><msub id="S3.E1.m1.1.1.1.1.1.1.3.1.1.3" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.cmml"><mi id="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.2" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.2.cmml">x</mi><mi id="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.3" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.3.cmml">j</mi></msub></mrow><mo id="S3.E1.m1.1.1.1.1.1.1.4.2.3" rspace="0.497em" xref="S3.E1.m1.1.1.1.1.1.1.4.3.cmml">,</mo><mrow id="S3.E1.m1.1.1.1.1.1.1.4.2.2" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2.cmml"><mo id="S3.E1.m1.1.1.1.1.1.1.4.2.2.1" rspace="0.167em" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2.1.cmml">∀</mo><msub id="S3.E1.m1.1.1.1.1.1.1.4.2.2.2" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.cmml"><mi id="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.2" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.2.cmml">x</mi><mi id="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.3" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.3.cmml">i</mi></msub></mrow></mrow></mrow><mo id="S3.E1.m1.1.1.1.1.2.3" xref="S3.E1.m1.1.1.1.1.3a.cmml">,</mo><mrow id="S3.E1.m1.1.1.1.1.2.2" xref="S3.E1.m1.1.1.1.1.2.2.cmml"><msub id="S3.E1.m1.1.1.1.1.2.2.2" xref="S3.E1.m1.1.1.1.1.2.2.2.cmml"><mi id="S3.E1.m1.1.1.1.1.2.2.2.2" xref="S3.E1.m1.1.1.1.1.2.2.2.2.cmml">x</mi><mi id="S3.E1.m1.1.1.1.1.2.2.2.3" xref="S3.E1.m1.1.1.1.1.2.2.2.3.cmml">j</mi></msub><mo id="S3.E1.m1.1.1.1.1.2.2.1" xref="S3.E1.m1.1.1.1.1.2.2.1.cmml">∈</mo><mi id="S3.E1.m1.1.1.1.1.2.2.3" xref="S3.E1.m1.1.1.1.1.2.2.3.cmml">𝐗</mi></mrow></mrow><mo id="S3.E1.m1.1.1.1.2" lspace="0.330em">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E1.m1.1b"><apply id="S3.E1.m1.1.1.1.1.3.cmml" xref="S3.E1.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E1.m1.1.1.1.1.3a.cmml" xref="S3.E1.m1.1.1.1.1.2.3">formulae-sequence</csymbol><apply id="S3.E1.m1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1"><approx id="S3.E1.m1.1.1.1.1.1.1.5.cmml" xref="S3.E1.m1.1.1.1.1.1.1.5"></approx><apply id="S3.E1.m1.1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.2"><times id="S3.E1.m1.1.1.1.1.1.1.2.3.cmml" xref="S3.E1.m1.1.1.1.1.1.1.2.3"></times><apply id="S3.E1.m1.1.1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1"><divide id="S3.E1.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.2"></divide><apply id="S3.E1.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1"><times id="S3.E1.m1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.2"></times><ci id="S3.E1.m1.1.1.1.1.1.1.1.1.1.3a.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.3"><mtext id="S3.E1.m1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.3">quantize</mtext></ci><apply id="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.2">𝑥</ci><ci id="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.1.3">𝑖</ci></apply></apply><ci id="S3.E1.m1.1.1.1.1.1.1.1.1.3a.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.3"><mtext id="S3.E1.m1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E1.m1.1.1.1.1.1.1.1.1.3">quantize</mtext></ci></apply><apply id="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.2.2.1"><csymbol cd="ambiguous" id="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.2.2.1">subscript</csymbol><ci id="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.2">𝑥</ci><ci id="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.3.cmml" xref="S3.E1.m1.1.1.1.1.1.1.2.2.1.1.3">𝑗</ci></apply></apply><list id="S3.E1.m1.1.1.1.1.1.1.4.3.cmml" xref="S3.E1.m1.1.1.1.1.1.1.4.2"><apply id="S3.E1.m1.1.1.1.1.1.1.3.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1"><divide id="S3.E1.m1.1.1.1.1.1.1.3.1.1.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.1"></divide><apply id="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.2"><csymbol cd="ambiguous" id="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.2">subscript</csymbol><ci id="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.2">𝑥</ci><ci id="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.3.cmml" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.2.3">𝑖</ci></apply><apply id="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.cmml" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.3"><csymbol cd="ambiguous" id="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.3">subscript</csymbol><ci id="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.2">𝑥</ci><ci id="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.3.cmml" xref="S3.E1.m1.1.1.1.1.1.1.3.1.1.3.3">𝑗</ci></apply></apply><apply id="S3.E1.m1.1.1.1.1.1.1.4.2.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2"><csymbol cd="latexml" id="S3.E1.m1.1.1.1.1.1.1.4.2.2.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2.1">for-all</csymbol><apply id="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.1.cmml" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2.2">subscript</csymbol><ci id="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.2.cmml" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.2">𝑥</ci><ci id="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.3.cmml" xref="S3.E1.m1.1.1.1.1.1.1.4.2.2.2.3">𝑖</ci></apply></apply></list></apply><apply id="S3.E1.m1.1.1.1.1.2.2.cmml" xref="S3.E1.m1.1.1.1.1.2.2"><in id="S3.E1.m1.1.1.1.1.2.2.1.cmml" xref="S3.E1.m1.1.1.1.1.2.2.1"></in><apply id="S3.E1.m1.1.1.1.1.2.2.2.cmml" xref="S3.E1.m1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.1.1.1.1.2.2.2.1.cmml" xref="S3.E1.m1.1.1.1.1.2.2.2">subscript</csymbol><ci id="S3.E1.m1.1.1.1.1.2.2.2.2.cmml" xref="S3.E1.m1.1.1.1.1.2.2.2.2">𝑥</ci><ci id="S3.E1.m1.1.1.1.1.2.2.2.3.cmml" xref="S3.E1.m1.1.1.1.1.2.2.2.3">𝑗</ci></apply><ci id="S3.E1.m1.1.1.1.1.2.2.3.cmml" xref="S3.E1.m1.1.1.1.1.2.2.3">𝐗</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E1.m1.1c">\text{quantize}(x_{i})/\text{quantize}(x_{j})\approx x_{i}/x_{j},~{}\forall x_% {i},x_{j}\in\mathbf{X}~{}.</annotation><annotation encoding="application/x-llamapun" id="S3.E1.m1.1d">quantize ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / quantize ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≈ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_X .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS1.p2.2">However, the original exponent is not retained, even when “lossless” quantization is achieved using FPUs. Nonetheless, the issue of omitting exponents is less severe in linear projection operations (as depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F5" title="Figure 5 ‣ 3.1. Motivation for PTQ and Non-BLAS Layer Optimizations ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">5</span></a>-(b)), as operations keep their inherent quality without exponent information, i.e.,</p> <table class="ltx_equation ltx_eqn_table" id="S3.E2"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_left" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_left">(2)</span></td> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="a\times f(\mathbf{X})=f(a\times\mathbf{X})." class="ltx_Math" display="block" id="S3.E2.m1.2"><semantics id="S3.E2.m1.2a"><mrow id="S3.E2.m1.2.2.1" xref="S3.E2.m1.2.2.1.1.cmml"><mrow id="S3.E2.m1.2.2.1.1" xref="S3.E2.m1.2.2.1.1.cmml"><mrow id="S3.E2.m1.2.2.1.1.3" xref="S3.E2.m1.2.2.1.1.3.cmml"><mrow id="S3.E2.m1.2.2.1.1.3.2" xref="S3.E2.m1.2.2.1.1.3.2.cmml"><mi id="S3.E2.m1.2.2.1.1.3.2.2" xref="S3.E2.m1.2.2.1.1.3.2.2.cmml">a</mi><mo id="S3.E2.m1.2.2.1.1.3.2.1" lspace="0.222em" rspace="0.222em" xref="S3.E2.m1.2.2.1.1.3.2.1.cmml">×</mo><mi id="S3.E2.m1.2.2.1.1.3.2.3" xref="S3.E2.m1.2.2.1.1.3.2.3.cmml">f</mi></mrow><mo id="S3.E2.m1.2.2.1.1.3.1" xref="S3.E2.m1.2.2.1.1.3.1.cmml"></mo><mrow id="S3.E2.m1.2.2.1.1.3.3.2" xref="S3.E2.m1.2.2.1.1.3.cmml"><mo id="S3.E2.m1.2.2.1.1.3.3.2.1" stretchy="false" xref="S3.E2.m1.2.2.1.1.3.cmml">(</mo><mi id="S3.E2.m1.1.1" xref="S3.E2.m1.1.1.cmml">𝐗</mi><mo id="S3.E2.m1.2.2.1.1.3.3.2.2" stretchy="false" xref="S3.E2.m1.2.2.1.1.3.cmml">)</mo></mrow></mrow><mo id="S3.E2.m1.2.2.1.1.2" xref="S3.E2.m1.2.2.1.1.2.cmml">=</mo><mrow id="S3.E2.m1.2.2.1.1.1" xref="S3.E2.m1.2.2.1.1.1.cmml"><mi id="S3.E2.m1.2.2.1.1.1.3" xref="S3.E2.m1.2.2.1.1.1.3.cmml">f</mi><mo id="S3.E2.m1.2.2.1.1.1.2" xref="S3.E2.m1.2.2.1.1.1.2.cmml"></mo><mrow id="S3.E2.m1.2.2.1.1.1.1.1" xref="S3.E2.m1.2.2.1.1.1.1.1.1.cmml"><mo id="S3.E2.m1.2.2.1.1.1.1.1.2" stretchy="false" xref="S3.E2.m1.2.2.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E2.m1.2.2.1.1.1.1.1.1" xref="S3.E2.m1.2.2.1.1.1.1.1.1.cmml"><mi id="S3.E2.m1.2.2.1.1.1.1.1.1.2" xref="S3.E2.m1.2.2.1.1.1.1.1.1.2.cmml">a</mi><mo id="S3.E2.m1.2.2.1.1.1.1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S3.E2.m1.2.2.1.1.1.1.1.1.1.cmml">×</mo><mi id="S3.E2.m1.2.2.1.1.1.1.1.1.3" xref="S3.E2.m1.2.2.1.1.1.1.1.1.3.cmml">𝐗</mi></mrow><mo id="S3.E2.m1.2.2.1.1.1.1.1.3" stretchy="false" xref="S3.E2.m1.2.2.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E2.m1.2.2.1.2" lspace="0em" xref="S3.E2.m1.2.2.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E2.m1.2b"><apply id="S3.E2.m1.2.2.1.1.cmml" xref="S3.E2.m1.2.2.1"><eq id="S3.E2.m1.2.2.1.1.2.cmml" xref="S3.E2.m1.2.2.1.1.2"></eq><apply id="S3.E2.m1.2.2.1.1.3.cmml" xref="S3.E2.m1.2.2.1.1.3"><times id="S3.E2.m1.2.2.1.1.3.1.cmml" xref="S3.E2.m1.2.2.1.1.3.1"></times><apply id="S3.E2.m1.2.2.1.1.3.2.cmml" xref="S3.E2.m1.2.2.1.1.3.2"><times id="S3.E2.m1.2.2.1.1.3.2.1.cmml" xref="S3.E2.m1.2.2.1.1.3.2.1"></times><ci id="S3.E2.m1.2.2.1.1.3.2.2.cmml" xref="S3.E2.m1.2.2.1.1.3.2.2">𝑎</ci><ci id="S3.E2.m1.2.2.1.1.3.2.3.cmml" xref="S3.E2.m1.2.2.1.1.3.2.3">𝑓</ci></apply><ci id="S3.E2.m1.1.1.cmml" xref="S3.E2.m1.1.1">𝐗</ci></apply><apply id="S3.E2.m1.2.2.1.1.1.cmml" xref="S3.E2.m1.2.2.1.1.1"><times id="S3.E2.m1.2.2.1.1.1.2.cmml" xref="S3.E2.m1.2.2.1.1.1.2"></times><ci id="S3.E2.m1.2.2.1.1.1.3.cmml" xref="S3.E2.m1.2.2.1.1.1.3">𝑓</ci><apply id="S3.E2.m1.2.2.1.1.1.1.1.1.cmml" xref="S3.E2.m1.2.2.1.1.1.1.1"><times id="S3.E2.m1.2.2.1.1.1.1.1.1.1.cmml" xref="S3.E2.m1.2.2.1.1.1.1.1.1.1"></times><ci id="S3.E2.m1.2.2.1.1.1.1.1.1.2.cmml" xref="S3.E2.m1.2.2.1.1.1.1.1.1.2">𝑎</ci><ci id="S3.E2.m1.2.2.1.1.1.1.1.1.3.cmml" xref="S3.E2.m1.2.2.1.1.1.1.1.1.3">𝐗</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E2.m1.2c">a\times f(\mathbf{X})=f(a\times\mathbf{X}).</annotation><annotation encoding="application/x-llamapun" id="S3.E2.m1.2d">italic_a × italic_f ( bold_X ) = italic_f ( italic_a × bold_X ) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS1.p2.3">On the other hand, nonlinear layers are completely disrupted when the exponent information is omitted, as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F5" title="Figure 5 ‣ 3.1. Motivation for PTQ and Non-BLAS Layer Optimizations ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">5</span></a>-(c). That is,</p> <table class="ltx_equation ltx_eqn_table" id="S3.E3"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_left" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_left">(3)</span></td> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="a\times f(\mathbf{X})\neq f(a\times\mathbf{X})." class="ltx_Math" display="block" id="S3.E3.m1.2"><semantics id="S3.E3.m1.2a"><mrow id="S3.E3.m1.2.2.1" xref="S3.E3.m1.2.2.1.1.cmml"><mrow id="S3.E3.m1.2.2.1.1" xref="S3.E3.m1.2.2.1.1.cmml"><mrow id="S3.E3.m1.2.2.1.1.3" xref="S3.E3.m1.2.2.1.1.3.cmml"><mrow id="S3.E3.m1.2.2.1.1.3.2" xref="S3.E3.m1.2.2.1.1.3.2.cmml"><mi id="S3.E3.m1.2.2.1.1.3.2.2" xref="S3.E3.m1.2.2.1.1.3.2.2.cmml">a</mi><mo id="S3.E3.m1.2.2.1.1.3.2.1" lspace="0.222em" rspace="0.222em" xref="S3.E3.m1.2.2.1.1.3.2.1.cmml">×</mo><mi id="S3.E3.m1.2.2.1.1.3.2.3" xref="S3.E3.m1.2.2.1.1.3.2.3.cmml">f</mi></mrow><mo id="S3.E3.m1.2.2.1.1.3.1" xref="S3.E3.m1.2.2.1.1.3.1.cmml"></mo><mrow id="S3.E3.m1.2.2.1.1.3.3.2" xref="S3.E3.m1.2.2.1.1.3.cmml"><mo id="S3.E3.m1.2.2.1.1.3.3.2.1" stretchy="false" xref="S3.E3.m1.2.2.1.1.3.cmml">(</mo><mi id="S3.E3.m1.1.1" xref="S3.E3.m1.1.1.cmml">𝐗</mi><mo id="S3.E3.m1.2.2.1.1.3.3.2.2" stretchy="false" xref="S3.E3.m1.2.2.1.1.3.cmml">)</mo></mrow></mrow><mo id="S3.E3.m1.2.2.1.1.2" xref="S3.E3.m1.2.2.1.1.2.cmml">≠</mo><mrow id="S3.E3.m1.2.2.1.1.1" xref="S3.E3.m1.2.2.1.1.1.cmml"><mi id="S3.E3.m1.2.2.1.1.1.3" xref="S3.E3.m1.2.2.1.1.1.3.cmml">f</mi><mo id="S3.E3.m1.2.2.1.1.1.2" xref="S3.E3.m1.2.2.1.1.1.2.cmml"></mo><mrow id="S3.E3.m1.2.2.1.1.1.1.1" xref="S3.E3.m1.2.2.1.1.1.1.1.1.cmml"><mo id="S3.E3.m1.2.2.1.1.1.1.1.2" stretchy="false" xref="S3.E3.m1.2.2.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.2.2.1.1.1.1.1.1" xref="S3.E3.m1.2.2.1.1.1.1.1.1.cmml"><mi id="S3.E3.m1.2.2.1.1.1.1.1.1.2" xref="S3.E3.m1.2.2.1.1.1.1.1.1.2.cmml">a</mi><mo id="S3.E3.m1.2.2.1.1.1.1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S3.E3.m1.2.2.1.1.1.1.1.1.1.cmml">×</mo><mi id="S3.E3.m1.2.2.1.1.1.1.1.1.3" xref="S3.E3.m1.2.2.1.1.1.1.1.1.3.cmml">𝐗</mi></mrow><mo id="S3.E3.m1.2.2.1.1.1.1.1.3" stretchy="false" xref="S3.E3.m1.2.2.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E3.m1.2.2.1.2" lspace="0em" xref="S3.E3.m1.2.2.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E3.m1.2b"><apply id="S3.E3.m1.2.2.1.1.cmml" xref="S3.E3.m1.2.2.1"><neq id="S3.E3.m1.2.2.1.1.2.cmml" xref="S3.E3.m1.2.2.1.1.2"></neq><apply id="S3.E3.m1.2.2.1.1.3.cmml" xref="S3.E3.m1.2.2.1.1.3"><times id="S3.E3.m1.2.2.1.1.3.1.cmml" xref="S3.E3.m1.2.2.1.1.3.1"></times><apply id="S3.E3.m1.2.2.1.1.3.2.cmml" xref="S3.E3.m1.2.2.1.1.3.2"><times id="S3.E3.m1.2.2.1.1.3.2.1.cmml" xref="S3.E3.m1.2.2.1.1.3.2.1"></times><ci id="S3.E3.m1.2.2.1.1.3.2.2.cmml" xref="S3.E3.m1.2.2.1.1.3.2.2">𝑎</ci><ci id="S3.E3.m1.2.2.1.1.3.2.3.cmml" xref="S3.E3.m1.2.2.1.1.3.2.3">𝑓</ci></apply><ci id="S3.E3.m1.1.1.cmml" xref="S3.E3.m1.1.1">𝐗</ci></apply><apply id="S3.E3.m1.2.2.1.1.1.cmml" xref="S3.E3.m1.2.2.1.1.1"><times id="S3.E3.m1.2.2.1.1.1.2.cmml" xref="S3.E3.m1.2.2.1.1.1.2"></times><ci id="S3.E3.m1.2.2.1.1.1.3.cmml" xref="S3.E3.m1.2.2.1.1.1.3">𝑓</ci><apply id="S3.E3.m1.2.2.1.1.1.1.1.1.cmml" xref="S3.E3.m1.2.2.1.1.1.1.1"><times id="S3.E3.m1.2.2.1.1.1.1.1.1.1.cmml" xref="S3.E3.m1.2.2.1.1.1.1.1.1.1"></times><ci id="S3.E3.m1.2.2.1.1.1.1.1.1.2.cmml" xref="S3.E3.m1.2.2.1.1.1.1.1.1.2">𝑎</ci><ci id="S3.E3.m1.2.2.1.1.1.1.1.1.3.cmml" xref="S3.E3.m1.2.2.1.1.1.1.1.1.3">𝐗</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E3.m1.2c">a\times f(\mathbf{X})\neq f(a\times\mathbf{X}).</annotation><annotation encoding="application/x-llamapun" id="S3.E3.m1.2d">italic_a × italic_f ( bold_X ) ≠ italic_f ( italic_a × bold_X ) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS1.p2.4">Some recent studies simplify FP-based computations <cite class="ltx_cite ltx_citemacro_citep">(Cardarilli et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib10" title="">2021</a>; Kim et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib33" title="">2021</a>; Li and Gu, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib38" title="">2023</a>)</cite> to integer-based lightweight alternatives with satisfactory reliability. However, this approach can reduce the accuracy of PTQ-based inference, from the lack of retraining to adapt the model for these approximations.</p> </div> <div class="ltx_para" id="S3.SS1.p3"> <p class="ltx_p" id="S3.SS1.p3.1">In response, in section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS2" title="4.2. Dequantization-Free PTQ Technique ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">4.2</span></a> and <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS3" title="4.3. Proposed VDR-Softmax Fused with eMSB-Q ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">4.3</span></a>, we propose an integrated quantization-computation method that preserves the exponent for nonlinear layers while replacing division operations only with parsing and bit-shifting. While prior works <cite class="ltx_cite ltx_citemacro_citep">(Ho and Chang, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib21" title="">2023</a>; Anonymous, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib5" title="">2024</a>)</cite> introduced capturing the MSB’s position, utilizing parsing(s), they rely on group-wise quantization, incurring large storage overhead and limited scalability. In contrast, our method combines token-wise quantization with VDR-Softmax in a fused manner, temporarily storing only a single-token value in compact registers and discarding them after processing, and linking quantization to penetrate through the end-to-end attributes of each stages in attention layer.</p> </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.2. </span>Motivation for Accurate and Efficient AMS Computation</h3> <figure class="ltx_figure" id="S3.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="439" id="S3.F6.g1" src="x5.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 6. </span>AMS-PiM with high-ENOB ADCs are more susceptible to computation errors due to PVT variation, resulting from reduced sensing margins. </figcaption> </figure> <div class="ltx_para" id="S3.SS2.p1"> <p class="ltx_p" id="S3.SS2.p1.4">As mentioned in section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S1" title="1. Introduction ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">1</span></a>, transformer models with PTQ demand a high precision for partial sum, easily exceeding <math alttext="\geq" class="ltx_Math" display="inline" id="S3.SS2.p1.1.m1.1"><semantics id="S3.SS2.p1.1.m1.1a"><mo id="S3.SS2.p1.1.m1.1.1" xref="S3.SS2.p1.1.m1.1.1.cmml">≥</mo><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.1.m1.1b"><geq id="S3.SS2.p1.1.m1.1.1.cmml" xref="S3.SS2.p1.1.m1.1.1"></geq></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.1.m1.1c">\geq</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.1.m1.1d">≥</annotation></semantics></math>18 bits with <math alttext="d_{model}" class="ltx_Math" display="inline" id="S3.SS2.p1.2.m2.1"><semantics id="S3.SS2.p1.2.m2.1a"><msub id="S3.SS2.p1.2.m2.1.1" xref="S3.SS2.p1.2.m2.1.1.cmml"><mi id="S3.SS2.p1.2.m2.1.1.2" xref="S3.SS2.p1.2.m2.1.1.2.cmml">d</mi><mrow id="S3.SS2.p1.2.m2.1.1.3" xref="S3.SS2.p1.2.m2.1.1.3.cmml"><mi id="S3.SS2.p1.2.m2.1.1.3.2" xref="S3.SS2.p1.2.m2.1.1.3.2.cmml">m</mi><mo id="S3.SS2.p1.2.m2.1.1.3.1" xref="S3.SS2.p1.2.m2.1.1.3.1.cmml"></mo><mi id="S3.SS2.p1.2.m2.1.1.3.3" xref="S3.SS2.p1.2.m2.1.1.3.3.cmml">o</mi><mo id="S3.SS2.p1.2.m2.1.1.3.1a" xref="S3.SS2.p1.2.m2.1.1.3.1.cmml"></mo><mi id="S3.SS2.p1.2.m2.1.1.3.4" xref="S3.SS2.p1.2.m2.1.1.3.4.cmml">d</mi><mo id="S3.SS2.p1.2.m2.1.1.3.1b" xref="S3.SS2.p1.2.m2.1.1.3.1.cmml"></mo><mi id="S3.SS2.p1.2.m2.1.1.3.5" xref="S3.SS2.p1.2.m2.1.1.3.5.cmml">e</mi><mo id="S3.SS2.p1.2.m2.1.1.3.1c" xref="S3.SS2.p1.2.m2.1.1.3.1.cmml"></mo><mi id="S3.SS2.p1.2.m2.1.1.3.6" xref="S3.SS2.p1.2.m2.1.1.3.6.cmml">l</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.2.m2.1b"><apply id="S3.SS2.p1.2.m2.1.1.cmml" xref="S3.SS2.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS2.p1.2.m2.1.1.1.cmml" xref="S3.SS2.p1.2.m2.1.1">subscript</csymbol><ci id="S3.SS2.p1.2.m2.1.1.2.cmml" xref="S3.SS2.p1.2.m2.1.1.2">𝑑</ci><apply id="S3.SS2.p1.2.m2.1.1.3.cmml" xref="S3.SS2.p1.2.m2.1.1.3"><times id="S3.SS2.p1.2.m2.1.1.3.1.cmml" xref="S3.SS2.p1.2.m2.1.1.3.1"></times><ci id="S3.SS2.p1.2.m2.1.1.3.2.cmml" xref="S3.SS2.p1.2.m2.1.1.3.2">𝑚</ci><ci id="S3.SS2.p1.2.m2.1.1.3.3.cmml" xref="S3.SS2.p1.2.m2.1.1.3.3">𝑜</ci><ci id="S3.SS2.p1.2.m2.1.1.3.4.cmml" xref="S3.SS2.p1.2.m2.1.1.3.4">𝑑</ci><ci id="S3.SS2.p1.2.m2.1.1.3.5.cmml" xref="S3.SS2.p1.2.m2.1.1.3.5">𝑒</ci><ci id="S3.SS2.p1.2.m2.1.1.3.6.cmml" xref="S3.SS2.p1.2.m2.1.1.3.6">𝑙</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.2.m2.1c">d_{model}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.2.m2.1d">italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT</annotation></semantics></math><math alttext="\geq" class="ltx_Math" display="inline" id="S3.SS2.p1.3.m3.1"><semantics id="S3.SS2.p1.3.m3.1a"><mo id="S3.SS2.p1.3.m3.1.1" xref="S3.SS2.p1.3.m3.1.1.cmml">≥</mo><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.3.m3.1b"><geq id="S3.SS2.p1.3.m3.1.1.cmml" xref="S3.SS2.p1.3.m3.1.1"></geq></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.3.m3.1c">\geq</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.3.m3.1d">≥</annotation></semantics></math>1k and with activation/weight operands of <math alttext="\geq" class="ltx_Math" display="inline" id="S3.SS2.p1.4.m4.1"><semantics id="S3.SS2.p1.4.m4.1a"><mo id="S3.SS2.p1.4.m4.1.1" xref="S3.SS2.p1.4.m4.1.1.cmml">≥</mo><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.4.m4.1b"><geq id="S3.SS2.p1.4.m4.1.1.cmml" xref="S3.SS2.p1.4.m4.1.1"></geq></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.4.m4.1c">\geq</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.4.m4.1d">≥</annotation></semantics></math>4bit. This high-precision summation, for which direct quantization is infeasible, requires ADCs with higher ENOBs: this increases not only area/energy overhead but also the likelihood of errors.</p> </div> <div class="ltx_para" id="S3.SS2.p2"> <p class="ltx_p" id="S3.SS2.p2.1">Ensuring precise analog computation for PTQ requires the number of decision boundaries to exceed the distinct analog signal levels. This requirement is fundamental in PTQ, as accurate quantization relies on precise summation and digital conversion of the analog values. However, achieving this narrows the sensing margin of ADCs, which worsens error resilience, as depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F6" title="Figure 6 ‣ 3.2. Motivation for Accurate and Efficient AMS Computation ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">6</span></a>. Errors occur when analog signals cross decision boundaries during analog-to-digital conversion, and this challenge intensifies as sensing margins (the distance between decision boundaries) narrow.</p> </div> <div class="ltx_para" id="S3.SS2.p3"> <p class="ltx_p" id="S3.SS2.p3.1">The computational error resilience, which prompts us to our proposed technique, can be analyzed as follows. Analog signals are proportional to the number of “on-cell”s generating on-current within the activated word lines (WLs), as illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S2.F2" title="Figure 2 ‣ 2.1. Analog-Mixed-Signal Process-in-Memory (AMS-PiM) ‣ 2. Backgrounds ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">2</span></a>. Recent studies <cite class="ltx_cite ltx_citemacro_citep">(Yi et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib57" title="">2024</a>; Andrulis et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib4" title="">2023</a>; Kim et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib31" title="">2020</a>)</cite> show that reducing the activated “on-cell”s along the partial sum process reduces the error rates. Meanwhile, the number of Simultaneously Activated WLs (SAWL) determines the direct upper limit of the “on-cell”s contributing to the analog summation. Therefore, we can also limit the SAWL to limit the number of active “on-cell”s and segment unit computations with shorter input vectors for more reliable computation. This approach reduces the required ENOB of ADC, lowering the area/energy overhead of ADCs by <math alttext="2^{\text{ENOB}}" class="ltx_Math" display="inline" id="S3.SS2.p3.1.m1.1"><semantics id="S3.SS2.p3.1.m1.1a"><msup id="S3.SS2.p3.1.m1.1.1" xref="S3.SS2.p3.1.m1.1.1.cmml"><mn id="S3.SS2.p3.1.m1.1.1.2" xref="S3.SS2.p3.1.m1.1.1.2.cmml">2</mn><mtext id="S3.SS2.p3.1.m1.1.1.3" xref="S3.SS2.p3.1.m1.1.1.3a.cmml">ENOB</mtext></msup><annotation-xml encoding="MathML-Content" id="S3.SS2.p3.1.m1.1b"><apply id="S3.SS2.p3.1.m1.1.1.cmml" xref="S3.SS2.p3.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS2.p3.1.m1.1.1.1.cmml" xref="S3.SS2.p3.1.m1.1.1">superscript</csymbol><cn id="S3.SS2.p3.1.m1.1.1.2.cmml" type="integer" xref="S3.SS2.p3.1.m1.1.1.2">2</cn><ci id="S3.SS2.p3.1.m1.1.1.3a.cmml" xref="S3.SS2.p3.1.m1.1.1.3"><mtext id="S3.SS2.p3.1.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p3.1.m1.1.1.3">ENOB</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p3.1.m1.1c">2^{\text{ENOB}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p3.1.m1.1d">2 start_POSTSUPERSCRIPT ENOB end_POSTSUPERSCRIPT</annotation></semantics></math>, significantly relaxing the hardware complexity <cite class="ltx_cite ltx_citemacro_citep">(Danial et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib13" title="">2018</a>)</cite>. By integrating strategic segmentation reducing ADC overhead with error mitigation techniques described in section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS4" title="4.4. BitSift-GEMV Processor ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">4.4</span></a>, FLARE can significantly improve PVT robustness.</p> </div> <figure class="ltx_figure" id="S3.F7"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_portrait" height="1395" id="S3.F7.g1" src="x6.png" width="797"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 7. </span>Comparison of sparsity measured in various granularity. (a)<math alttext="\sim" class="ltx_Math" display="inline" id="S3.F7.2.m1.1"><semantics id="S3.F7.2.m1.1b"><mo id="S3.F7.2.m1.1.1" xref="S3.F7.2.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S3.F7.2.m1.1c"><csymbol cd="latexml" id="S3.F7.2.m1.1.1.cmml" xref="S3.F7.2.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S3.F7.2.m1.1d">\sim</annotation><annotation encoding="application/x-llamapun" id="S3.F7.2.m1.1e">∼</annotation></semantics></math>(d) visualizes numerous sparsities where the same 8-bit-integer vectors are bit-serially inputted. Sparsities are measured and exploited; (a) bitwise manner, (b) valuewise (VW) manner (when 8-bit integer is of complete “0”s), <span class="ltx_text ltx_font_italic" id="S3.F7.6.1">N</span>-valuewise (<span class="ltx_text ltx_font_italic" id="S3.F7.7.2">N</span>-VW) manner, and (d) <span class="ltx_text ltx_font_italic" id="S3.F7.8.3">N</span>-column-BitSlicewise manner (NCW, when a part of bitwise input is all “0”s in column direction). The activation sparsities are measured and averaged with input, Q, K, V, and output layers in transformer blocks for (e) NLP task and (f) vision task.</figcaption> </figure> </section> <section class="ltx_subsection" id="S3.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.3. </span>Massive GEMVs with Quadratic Latency Bottleneck</h3> <div class="ltx_para" id="S3.SS3.p1"> <p class="ltx_p" id="S3.SS3.p1.3">As delineated in Section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.SS2" title="3.2. Motivation for Accurate and Efficient AMS Computation ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">3.2</span></a>, the error-free analog computation can be achieved by limiting the maximum number of SAWL and segmenting computations over multiple cycles with sliced inputs (e.g., dividing <math alttext="D" class="ltx_Math" display="inline" id="S3.SS3.p1.1.m1.1"><semantics id="S3.SS3.p1.1.m1.1a"><mi id="S3.SS3.p1.1.m1.1.1" xref="S3.SS3.p1.1.m1.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.1.m1.1b"><ci id="S3.SS3.p1.1.m1.1.1.cmml" xref="S3.SS3.p1.1.m1.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.1.m1.1c">D</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.1.m1.1d">italic_D</annotation></semantics></math> in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S2.F2" title="Figure 2 ‣ 2.1. Analog-Mixed-Signal Process-in-Memory (AMS-PiM) ‣ 2. Backgrounds ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">2</span></a> into <math alttext="D/N" class="ltx_Math" display="inline" id="S3.SS3.p1.2.m2.1"><semantics id="S3.SS3.p1.2.m2.1a"><mrow id="S3.SS3.p1.2.m2.1.1" xref="S3.SS3.p1.2.m2.1.1.cmml"><mi id="S3.SS3.p1.2.m2.1.1.2" xref="S3.SS3.p1.2.m2.1.1.2.cmml">D</mi><mo id="S3.SS3.p1.2.m2.1.1.1" xref="S3.SS3.p1.2.m2.1.1.1.cmml">/</mo><mi id="S3.SS3.p1.2.m2.1.1.3" xref="S3.SS3.p1.2.m2.1.1.3.cmml">N</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.2.m2.1b"><apply id="S3.SS3.p1.2.m2.1.1.cmml" xref="S3.SS3.p1.2.m2.1.1"><divide id="S3.SS3.p1.2.m2.1.1.1.cmml" xref="S3.SS3.p1.2.m2.1.1.1"></divide><ci id="S3.SS3.p1.2.m2.1.1.2.cmml" xref="S3.SS3.p1.2.m2.1.1.2">𝐷</ci><ci id="S3.SS3.p1.2.m2.1.1.3.cmml" xref="S3.SS3.p1.2.m2.1.1.3">𝑁</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.2.m2.1c">D/N</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.2.m2.1d">italic_D / italic_N</annotation></semantics></math> using <math alttext="N" class="ltx_Math" display="inline" id="S3.SS3.p1.3.m3.1"><semantics id="S3.SS3.p1.3.m3.1a"><mi id="S3.SS3.p1.3.m3.1.1" xref="S3.SS3.p1.3.m3.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.3.m3.1b"><ci id="S3.SS3.p1.3.m3.1.1.cmml" xref="S3.SS3.p1.3.m3.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.3.m3.1c">N</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.3.m3.1d">italic_N</annotation></semantics></math> cycles). However, this sliced processing — with a constrained vector length — results in significantly more processing cycles. Furthermore, the increasing model size and the resultant quadratic GEMV demand in the attention layer further exacerbate the suboptimal performance.</p> </div> <div class="ltx_para" id="S3.SS3.p2"> <p class="ltx_p" id="S3.SS3.p2.12">For example, consider a length(<math alttext="l" class="ltx_Math" display="inline" id="S3.SS3.p2.1.m1.1"><semantics id="S3.SS3.p2.1.m1.1a"><mi id="S3.SS3.p2.1.m1.1.1" xref="S3.SS3.p2.1.m1.1.1.cmml">l</mi><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.1.m1.1b"><ci id="S3.SS3.p2.1.m1.1.1.cmml" xref="S3.SS3.p2.1.m1.1.1">𝑙</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.1.m1.1c">l</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.1.m1.1d">italic_l</annotation></semantics></math>)-restricted GEMM operation where SAWL<math alttext="\leq" class="ltx_Math" display="inline" id="S3.SS3.p2.2.m2.1"><semantics id="S3.SS3.p2.2.m2.1a"><mo id="S3.SS3.p2.2.m2.1.1" xref="S3.SS3.p2.2.m2.1.1.cmml">≤</mo><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.2.m2.1b"><leq id="S3.SS3.p2.2.m2.1.1.cmml" xref="S3.SS3.p2.2.m2.1.1"></leq></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.2.m2.1c">\leq</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.2.m2.1d">≤</annotation></semantics></math><math alttext="l" class="ltx_Math" display="inline" id="S3.SS3.p2.3.m3.1"><semantics id="S3.SS3.p2.3.m3.1a"><mi id="S3.SS3.p2.3.m3.1.1" xref="S3.SS3.p2.3.m3.1.1.cmml">l</mi><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.3.m3.1b"><ci id="S3.SS3.p2.3.m3.1.1.cmml" xref="S3.SS3.p2.3.m3.1.1">𝑙</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.3.m3.1c">l</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.3.m3.1d">italic_l</annotation></semantics></math>=8, with: <math alttext="D=1024" class="ltx_Math" display="inline" id="S3.SS3.p2.4.m4.1"><semantics id="S3.SS3.p2.4.m4.1a"><mrow id="S3.SS3.p2.4.m4.1.1" xref="S3.SS3.p2.4.m4.1.1.cmml"><mi id="S3.SS3.p2.4.m4.1.1.2" xref="S3.SS3.p2.4.m4.1.1.2.cmml">D</mi><mo id="S3.SS3.p2.4.m4.1.1.1" xref="S3.SS3.p2.4.m4.1.1.1.cmml">=</mo><mn id="S3.SS3.p2.4.m4.1.1.3" xref="S3.SS3.p2.4.m4.1.1.3.cmml">1024</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.4.m4.1b"><apply id="S3.SS3.p2.4.m4.1.1.cmml" xref="S3.SS3.p2.4.m4.1.1"><eq id="S3.SS3.p2.4.m4.1.1.1.cmml" xref="S3.SS3.p2.4.m4.1.1.1"></eq><ci id="S3.SS3.p2.4.m4.1.1.2.cmml" xref="S3.SS3.p2.4.m4.1.1.2">𝐷</ci><cn id="S3.SS3.p2.4.m4.1.1.3.cmml" type="integer" xref="S3.SS3.p2.4.m4.1.1.3">1024</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.4.m4.1c">D=1024</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.4.m4.1d">italic_D = 1024</annotation></semantics></math> hidden states and <math alttext="N=512" class="ltx_Math" display="inline" id="S3.SS3.p2.5.m5.1"><semantics id="S3.SS3.p2.5.m5.1a"><mrow id="S3.SS3.p2.5.m5.1.1" xref="S3.SS3.p2.5.m5.1.1.cmml"><mi id="S3.SS3.p2.5.m5.1.1.2" xref="S3.SS3.p2.5.m5.1.1.2.cmml">N</mi><mo id="S3.SS3.p2.5.m5.1.1.1" xref="S3.SS3.p2.5.m5.1.1.1.cmml">=</mo><mn id="S3.SS3.p2.5.m5.1.1.3" xref="S3.SS3.p2.5.m5.1.1.3.cmml">512</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.5.m5.1b"><apply id="S3.SS3.p2.5.m5.1.1.cmml" xref="S3.SS3.p2.5.m5.1.1"><eq id="S3.SS3.p2.5.m5.1.1.1.cmml" xref="S3.SS3.p2.5.m5.1.1.1"></eq><ci id="S3.SS3.p2.5.m5.1.1.2.cmml" xref="S3.SS3.p2.5.m5.1.1.2">𝑁</ci><cn id="S3.SS3.p2.5.m5.1.1.3.cmml" type="integer" xref="S3.SS3.p2.5.m5.1.1.3">512</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.5.m5.1c">N=512</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.5.m5.1d">italic_N = 512</annotation></semantics></math> tokens per batch - resulting in 512 (=<math alttext="N" class="ltx_Math" display="inline" id="S3.SS3.p2.6.m6.1"><semantics id="S3.SS3.p2.6.m6.1a"><mi id="S3.SS3.p2.6.m6.1.1" xref="S3.SS3.p2.6.m6.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.6.m6.1b"><ci id="S3.SS3.p2.6.m6.1.1.cmml" xref="S3.SS3.p2.6.m6.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.6.m6.1c">N</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.6.m6.1d">italic_N</annotation></semantics></math>) GEMVs -, using 8-bit integer representation (Bit-Precision, <math alttext="BP" class="ltx_Math" display="inline" id="S3.SS3.p2.7.m7.1"><semantics id="S3.SS3.p2.7.m7.1a"><mrow id="S3.SS3.p2.7.m7.1.1" xref="S3.SS3.p2.7.m7.1.1.cmml"><mi id="S3.SS3.p2.7.m7.1.1.2" xref="S3.SS3.p2.7.m7.1.1.2.cmml">B</mi><mo id="S3.SS3.p2.7.m7.1.1.1" xref="S3.SS3.p2.7.m7.1.1.1.cmml"></mo><mi id="S3.SS3.p2.7.m7.1.1.3" xref="S3.SS3.p2.7.m7.1.1.3.cmml">P</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.7.m7.1b"><apply id="S3.SS3.p2.7.m7.1.1.cmml" xref="S3.SS3.p2.7.m7.1.1"><times id="S3.SS3.p2.7.m7.1.1.1.cmml" xref="S3.SS3.p2.7.m7.1.1.1"></times><ci id="S3.SS3.p2.7.m7.1.1.2.cmml" xref="S3.SS3.p2.7.m7.1.1.2">𝐵</ci><ci id="S3.SS3.p2.7.m7.1.1.3.cmml" xref="S3.SS3.p2.7.m7.1.1.3">𝑃</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.7.m7.1c">BP</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.7.m7.1d">italic_B italic_P</annotation></semantics></math>=8) for both input and weight values. This necessitates <math alttext="D/l=1024/8=128" class="ltx_Math" display="inline" id="S3.SS3.p2.8.m8.1"><semantics id="S3.SS3.p2.8.m8.1a"><mrow id="S3.SS3.p2.8.m8.1.1" xref="S3.SS3.p2.8.m8.1.1.cmml"><mrow id="S3.SS3.p2.8.m8.1.1.2" xref="S3.SS3.p2.8.m8.1.1.2.cmml"><mi id="S3.SS3.p2.8.m8.1.1.2.2" xref="S3.SS3.p2.8.m8.1.1.2.2.cmml">D</mi><mo id="S3.SS3.p2.8.m8.1.1.2.1" xref="S3.SS3.p2.8.m8.1.1.2.1.cmml">/</mo><mi id="S3.SS3.p2.8.m8.1.1.2.3" xref="S3.SS3.p2.8.m8.1.1.2.3.cmml">l</mi></mrow><mo id="S3.SS3.p2.8.m8.1.1.3" xref="S3.SS3.p2.8.m8.1.1.3.cmml">=</mo><mrow id="S3.SS3.p2.8.m8.1.1.4" xref="S3.SS3.p2.8.m8.1.1.4.cmml"><mn id="S3.SS3.p2.8.m8.1.1.4.2" xref="S3.SS3.p2.8.m8.1.1.4.2.cmml">1024</mn><mo id="S3.SS3.p2.8.m8.1.1.4.1" xref="S3.SS3.p2.8.m8.1.1.4.1.cmml">/</mo><mn id="S3.SS3.p2.8.m8.1.1.4.3" xref="S3.SS3.p2.8.m8.1.1.4.3.cmml">8</mn></mrow><mo id="S3.SS3.p2.8.m8.1.1.5" xref="S3.SS3.p2.8.m8.1.1.5.cmml">=</mo><mn id="S3.SS3.p2.8.m8.1.1.6" xref="S3.SS3.p2.8.m8.1.1.6.cmml">128</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.8.m8.1b"><apply id="S3.SS3.p2.8.m8.1.1.cmml" xref="S3.SS3.p2.8.m8.1.1"><and id="S3.SS3.p2.8.m8.1.1a.cmml" xref="S3.SS3.p2.8.m8.1.1"></and><apply id="S3.SS3.p2.8.m8.1.1b.cmml" xref="S3.SS3.p2.8.m8.1.1"><eq id="S3.SS3.p2.8.m8.1.1.3.cmml" xref="S3.SS3.p2.8.m8.1.1.3"></eq><apply id="S3.SS3.p2.8.m8.1.1.2.cmml" xref="S3.SS3.p2.8.m8.1.1.2"><divide id="S3.SS3.p2.8.m8.1.1.2.1.cmml" xref="S3.SS3.p2.8.m8.1.1.2.1"></divide><ci id="S3.SS3.p2.8.m8.1.1.2.2.cmml" xref="S3.SS3.p2.8.m8.1.1.2.2">𝐷</ci><ci id="S3.SS3.p2.8.m8.1.1.2.3.cmml" xref="S3.SS3.p2.8.m8.1.1.2.3">𝑙</ci></apply><apply id="S3.SS3.p2.8.m8.1.1.4.cmml" xref="S3.SS3.p2.8.m8.1.1.4"><divide id="S3.SS3.p2.8.m8.1.1.4.1.cmml" xref="S3.SS3.p2.8.m8.1.1.4.1"></divide><cn id="S3.SS3.p2.8.m8.1.1.4.2.cmml" type="integer" xref="S3.SS3.p2.8.m8.1.1.4.2">1024</cn><cn id="S3.SS3.p2.8.m8.1.1.4.3.cmml" type="integer" xref="S3.SS3.p2.8.m8.1.1.4.3">8</cn></apply></apply><apply id="S3.SS3.p2.8.m8.1.1c.cmml" xref="S3.SS3.p2.8.m8.1.1"><eq id="S3.SS3.p2.8.m8.1.1.5.cmml" xref="S3.SS3.p2.8.m8.1.1.5"></eq><share href="https://arxiv.org/html/2411.14733v1#S3.SS3.p2.8.m8.1.1.4.cmml" id="S3.SS3.p2.8.m8.1.1d.cmml" xref="S3.SS3.p2.8.m8.1.1"></share><cn id="S3.SS3.p2.8.m8.1.1.6.cmml" type="integer" xref="S3.SS3.p2.8.m8.1.1.6">128</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.8.m8.1c">D/l=1024/8=128</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.8.m8.1d">italic_D / italic_l = 1024 / 8 = 128</annotation></semantics></math> cycles to process a GEMV for each bit-serial vector of a token. It involves partitioning each bit-serial part of the input token into 8-WL chunks to limit SAWL<math alttext="\leq" class="ltx_Math" display="inline" id="S3.SS3.p2.9.m9.1"><semantics id="S3.SS3.p2.9.m9.1a"><mo id="S3.SS3.p2.9.m9.1.1" xref="S3.SS3.p2.9.m9.1.1.cmml">≤</mo><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.9.m9.1b"><leq id="S3.SS3.p2.9.m9.1.1.cmml" xref="S3.SS3.p2.9.m9.1.1"></leq></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.9.m9.1c">\leq</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.9.m9.1d">≤</annotation></semantics></math>8. Subsequently, the bitwise GEMV operations of <math alttext="D/l" class="ltx_Math" display="inline" id="S3.SS3.p2.10.m10.1"><semantics id="S3.SS3.p2.10.m10.1a"><mrow id="S3.SS3.p2.10.m10.1.1" xref="S3.SS3.p2.10.m10.1.1.cmml"><mi id="S3.SS3.p2.10.m10.1.1.2" xref="S3.SS3.p2.10.m10.1.1.2.cmml">D</mi><mo id="S3.SS3.p2.10.m10.1.1.1" xref="S3.SS3.p2.10.m10.1.1.1.cmml">/</mo><mi id="S3.SS3.p2.10.m10.1.1.3" xref="S3.SS3.p2.10.m10.1.1.3.cmml">l</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.10.m10.1b"><apply id="S3.SS3.p2.10.m10.1.1.cmml" xref="S3.SS3.p2.10.m10.1.1"><divide id="S3.SS3.p2.10.m10.1.1.1.cmml" xref="S3.SS3.p2.10.m10.1.1.1"></divide><ci id="S3.SS3.p2.10.m10.1.1.2.cmml" xref="S3.SS3.p2.10.m10.1.1.2">𝐷</ci><ci id="S3.SS3.p2.10.m10.1.1.3.cmml" xref="S3.SS3.p2.10.m10.1.1.3">𝑙</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.10.m10.1c">D/l</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.10.m10.1d">italic_D / italic_l</annotation></semantics></math>(=128)-cycles-per-bit are required for <math alttext="N\times BP=512\times 8=4,096" class="ltx_Math" display="inline" id="S3.SS3.p2.11.m11.2"><semantics id="S3.SS3.p2.11.m11.2a"><mrow id="S3.SS3.p2.11.m11.2.2.1" xref="S3.SS3.p2.11.m11.2.2.2.cmml"><mrow id="S3.SS3.p2.11.m11.2.2.1.1" xref="S3.SS3.p2.11.m11.2.2.1.1.cmml"><mrow id="S3.SS3.p2.11.m11.2.2.1.1.2" xref="S3.SS3.p2.11.m11.2.2.1.1.2.cmml"><mrow id="S3.SS3.p2.11.m11.2.2.1.1.2.2" xref="S3.SS3.p2.11.m11.2.2.1.1.2.2.cmml"><mi id="S3.SS3.p2.11.m11.2.2.1.1.2.2.2" xref="S3.SS3.p2.11.m11.2.2.1.1.2.2.2.cmml">N</mi><mo id="S3.SS3.p2.11.m11.2.2.1.1.2.2.1" lspace="0.222em" rspace="0.222em" xref="S3.SS3.p2.11.m11.2.2.1.1.2.2.1.cmml">×</mo><mi id="S3.SS3.p2.11.m11.2.2.1.1.2.2.3" xref="S3.SS3.p2.11.m11.2.2.1.1.2.2.3.cmml">B</mi></mrow><mo id="S3.SS3.p2.11.m11.2.2.1.1.2.1" xref="S3.SS3.p2.11.m11.2.2.1.1.2.1.cmml"></mo><mi id="S3.SS3.p2.11.m11.2.2.1.1.2.3" xref="S3.SS3.p2.11.m11.2.2.1.1.2.3.cmml">P</mi></mrow><mo id="S3.SS3.p2.11.m11.2.2.1.1.3" xref="S3.SS3.p2.11.m11.2.2.1.1.3.cmml">=</mo><mrow id="S3.SS3.p2.11.m11.2.2.1.1.4" xref="S3.SS3.p2.11.m11.2.2.1.1.4.cmml"><mn id="S3.SS3.p2.11.m11.2.2.1.1.4.2" xref="S3.SS3.p2.11.m11.2.2.1.1.4.2.cmml">512</mn><mo id="S3.SS3.p2.11.m11.2.2.1.1.4.1" lspace="0.222em" rspace="0.222em" xref="S3.SS3.p2.11.m11.2.2.1.1.4.1.cmml">×</mo><mn id="S3.SS3.p2.11.m11.2.2.1.1.4.3" xref="S3.SS3.p2.11.m11.2.2.1.1.4.3.cmml">8</mn></mrow><mo id="S3.SS3.p2.11.m11.2.2.1.1.5" xref="S3.SS3.p2.11.m11.2.2.1.1.5.cmml">=</mo><mn id="S3.SS3.p2.11.m11.2.2.1.1.6" xref="S3.SS3.p2.11.m11.2.2.1.1.6.cmml">4</mn></mrow><mo id="S3.SS3.p2.11.m11.2.2.1.2" xref="S3.SS3.p2.11.m11.2.2.2a.cmml">,</mo><mn id="S3.SS3.p2.11.m11.1.1" xref="S3.SS3.p2.11.m11.1.1.cmml">096</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.11.m11.2b"><apply id="S3.SS3.p2.11.m11.2.2.2.cmml" xref="S3.SS3.p2.11.m11.2.2.1"><csymbol cd="ambiguous" id="S3.SS3.p2.11.m11.2.2.2a.cmml" xref="S3.SS3.p2.11.m11.2.2.1.2">formulae-sequence</csymbol><apply id="S3.SS3.p2.11.m11.2.2.1.1.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1"><and id="S3.SS3.p2.11.m11.2.2.1.1a.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1"></and><apply id="S3.SS3.p2.11.m11.2.2.1.1b.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1"><eq id="S3.SS3.p2.11.m11.2.2.1.1.3.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.3"></eq><apply id="S3.SS3.p2.11.m11.2.2.1.1.2.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.2"><times id="S3.SS3.p2.11.m11.2.2.1.1.2.1.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.2.1"></times><apply id="S3.SS3.p2.11.m11.2.2.1.1.2.2.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.2.2"><times id="S3.SS3.p2.11.m11.2.2.1.1.2.2.1.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.2.2.1"></times><ci id="S3.SS3.p2.11.m11.2.2.1.1.2.2.2.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.2.2.2">𝑁</ci><ci id="S3.SS3.p2.11.m11.2.2.1.1.2.2.3.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.2.2.3">𝐵</ci></apply><ci id="S3.SS3.p2.11.m11.2.2.1.1.2.3.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.2.3">𝑃</ci></apply><apply id="S3.SS3.p2.11.m11.2.2.1.1.4.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.4"><times id="S3.SS3.p2.11.m11.2.2.1.1.4.1.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.4.1"></times><cn id="S3.SS3.p2.11.m11.2.2.1.1.4.2.cmml" type="integer" xref="S3.SS3.p2.11.m11.2.2.1.1.4.2">512</cn><cn id="S3.SS3.p2.11.m11.2.2.1.1.4.3.cmml" type="integer" xref="S3.SS3.p2.11.m11.2.2.1.1.4.3">8</cn></apply></apply><apply id="S3.SS3.p2.11.m11.2.2.1.1c.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1"><eq id="S3.SS3.p2.11.m11.2.2.1.1.5.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1.5"></eq><share href="https://arxiv.org/html/2411.14733v1#S3.SS3.p2.11.m11.2.2.1.1.4.cmml" id="S3.SS3.p2.11.m11.2.2.1.1d.cmml" xref="S3.SS3.p2.11.m11.2.2.1.1"></share><cn id="S3.SS3.p2.11.m11.2.2.1.1.6.cmml" type="integer" xref="S3.SS3.p2.11.m11.2.2.1.1.6">4</cn></apply></apply><cn id="S3.SS3.p2.11.m11.1.1.cmml" type="integer" xref="S3.SS3.p2.11.m11.1.1">096</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.11.m11.2c">N\times BP=512\times 8=4,096</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.11.m11.2d">italic_N × italic_B italic_P = 512 × 8 = 4 , 096</annotation></semantics></math> times to process every GEMVs of bit-vectors and tokens composing the aforementioned GEMM, culminating in <math alttext="(D/l)\times N\times BP=524,288" class="ltx_Math" display="inline" id="S3.SS3.p2.12.m12.3"><semantics id="S3.SS3.p2.12.m12.3a"><mrow id="S3.SS3.p2.12.m12.3.3" xref="S3.SS3.p2.12.m12.3.3.cmml"><mrow id="S3.SS3.p2.12.m12.3.3.1" xref="S3.SS3.p2.12.m12.3.3.1.cmml"><mrow id="S3.SS3.p2.12.m12.3.3.1.1" xref="S3.SS3.p2.12.m12.3.3.1.1.cmml"><mrow id="S3.SS3.p2.12.m12.3.3.1.1.1.1" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.cmml"><mo id="S3.SS3.p2.12.m12.3.3.1.1.1.1.2" stretchy="false" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.cmml">(</mo><mrow id="S3.SS3.p2.12.m12.3.3.1.1.1.1.1" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.cmml"><mi id="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.2" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.2.cmml">D</mi><mo id="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.1" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.1.cmml">/</mo><mi id="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.3" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.3.cmml">l</mi></mrow><mo id="S3.SS3.p2.12.m12.3.3.1.1.1.1.3" rspace="0.055em" stretchy="false" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.cmml">)</mo></mrow><mo id="S3.SS3.p2.12.m12.3.3.1.1.2" rspace="0.222em" xref="S3.SS3.p2.12.m12.3.3.1.1.2.cmml">×</mo><mi id="S3.SS3.p2.12.m12.3.3.1.1.3" xref="S3.SS3.p2.12.m12.3.3.1.1.3.cmml">N</mi><mo id="S3.SS3.p2.12.m12.3.3.1.1.2a" lspace="0.222em" rspace="0.222em" xref="S3.SS3.p2.12.m12.3.3.1.1.2.cmml">×</mo><mi id="S3.SS3.p2.12.m12.3.3.1.1.4" xref="S3.SS3.p2.12.m12.3.3.1.1.4.cmml">B</mi></mrow><mo id="S3.SS3.p2.12.m12.3.3.1.2" xref="S3.SS3.p2.12.m12.3.3.1.2.cmml"></mo><mi id="S3.SS3.p2.12.m12.3.3.1.3" xref="S3.SS3.p2.12.m12.3.3.1.3.cmml">P</mi></mrow><mo id="S3.SS3.p2.12.m12.3.3.2" xref="S3.SS3.p2.12.m12.3.3.2.cmml">=</mo><mrow id="S3.SS3.p2.12.m12.3.3.3.2" xref="S3.SS3.p2.12.m12.3.3.3.1.cmml"><mn id="S3.SS3.p2.12.m12.1.1" xref="S3.SS3.p2.12.m12.1.1.cmml">524</mn><mo id="S3.SS3.p2.12.m12.3.3.3.2.1" xref="S3.SS3.p2.12.m12.3.3.3.1.cmml">,</mo><mn id="S3.SS3.p2.12.m12.2.2" xref="S3.SS3.p2.12.m12.2.2.cmml">288</mn></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.12.m12.3b"><apply id="S3.SS3.p2.12.m12.3.3.cmml" xref="S3.SS3.p2.12.m12.3.3"><eq id="S3.SS3.p2.12.m12.3.3.2.cmml" xref="S3.SS3.p2.12.m12.3.3.2"></eq><apply id="S3.SS3.p2.12.m12.3.3.1.cmml" xref="S3.SS3.p2.12.m12.3.3.1"><times id="S3.SS3.p2.12.m12.3.3.1.2.cmml" xref="S3.SS3.p2.12.m12.3.3.1.2"></times><apply id="S3.SS3.p2.12.m12.3.3.1.1.cmml" xref="S3.SS3.p2.12.m12.3.3.1.1"><times id="S3.SS3.p2.12.m12.3.3.1.1.2.cmml" xref="S3.SS3.p2.12.m12.3.3.1.1.2"></times><apply id="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.cmml" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1"><divide id="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.1.cmml" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.1"></divide><ci id="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.2.cmml" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.2">𝐷</ci><ci id="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.3.cmml" xref="S3.SS3.p2.12.m12.3.3.1.1.1.1.1.3">𝑙</ci></apply><ci id="S3.SS3.p2.12.m12.3.3.1.1.3.cmml" xref="S3.SS3.p2.12.m12.3.3.1.1.3">𝑁</ci><ci id="S3.SS3.p2.12.m12.3.3.1.1.4.cmml" xref="S3.SS3.p2.12.m12.3.3.1.1.4">𝐵</ci></apply><ci id="S3.SS3.p2.12.m12.3.3.1.3.cmml" xref="S3.SS3.p2.12.m12.3.3.1.3">𝑃</ci></apply><list id="S3.SS3.p2.12.m12.3.3.3.1.cmml" xref="S3.SS3.p2.12.m12.3.3.3.2"><cn id="S3.SS3.p2.12.m12.1.1.cmml" type="integer" xref="S3.SS3.p2.12.m12.1.1">524</cn><cn id="S3.SS3.p2.12.m12.2.2.cmml" type="integer" xref="S3.SS3.p2.12.m12.2.2">288</cn></list></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.12.m12.3c">(D/l)\times N\times BP=524,288</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.12.m12.3d">( italic_D / italic_l ) × italic_N × italic_B italic_P = 524 , 288</annotation></semantics></math> cycles to process a GEMM of an embedding input with static weight matrices.</p> </div> <div class="ltx_para" id="S3.SS3.p3"> <p class="ltx_p" id="S3.SS3.p3.1">Assuming 10ns per bitwise GEMV, which is a fast figure for an AMS-PiM array, a GEMM operation for generating Q, K, or V would take approximately 0.5 milliseconds (ms). Other steps involving activation-activation GEMV with quadratic computational complexity and nonlinear layers may induce substantially greater latency, rendering the PiM device noncompetitive due to poor latency performance. Consequently, the latency for processing only the linear projections across the attention layers in tens of encoder blocks could be approximately in the order of 100 ms.</p> </div> <div class="ltx_para" id="S3.SS3.p4"> <p class="ltx_p" id="S3.SS3.p4.1">To address the latency issue, we take advantage of the high “bitwise” sparsity found in the activation values of bitwise GEMVs in the attention layer. Our findings suggest collecting only “1”s along a bitwise embedding input can significantly boost GEMV - and GEMMs composed of GEMVs - operations. Our investigation visualized in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F7" title="Figure 7 ‣ 3.2. Motivation for Accurate and Efficient AMS Computation ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">7</span></a>-(e), (f) shows that when split into bitwise elements (Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F7" title="Figure 7 ‣ 3.2. Motivation for Accurate and Efficient AMS Computation ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">7</span></a>-(a)), the activation matrices exhibit much higher sparsity than when measured in coarser value grains (Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F7" title="Figure 7 ‣ 3.2. Motivation for Accurate and Efficient AMS Computation ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">7</span></a>-(b) <math alttext="\sim" class="ltx_Math" display="inline" id="S3.SS3.p4.1.m1.1"><semantics id="S3.SS3.p4.1.m1.1a"><mo id="S3.SS3.p4.1.m1.1.1" xref="S3.SS3.p4.1.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S3.SS3.p4.1.m1.1b"><csymbol cd="latexml" id="S3.SS3.p4.1.m1.1.1.cmml" xref="S3.SS3.p4.1.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p4.1.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p4.1.m1.1d">∼</annotation></semantics></math> (d)). With our findings, we propose BitSift-GEMV technique, where all bitwise “0”s of bit-serial fetched input activations are skipped from GEMV within AMS-PiM arrays.</p> </div> <div class="ltx_para" id="S3.SS3.p5"> <p class="ltx_p" id="S3.SS3.p5.1">In section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS4" title="4.4. BitSift-GEMV Processor ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">4.4</span></a>, we propose a hardware-oriented technique that designates the longest combination(s) of parsed input slices, incorporating a fixed, desired number of “1”s. By parsing and merging the bit-serial-fetched token vectors, we can significantly stretch the length of a processible input vector for a unit GEMV. Furthermore, our proposed technique fully exploits the highest utilization of the designed ADCs, stabilizing its operation by limiting flexibility.</p> </div> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">4. </span>FLARE Architecture</h2> <figure class="ltx_figure" id="S4.F8"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="354" id="S4.F8.g1" src="x7.png" width="838"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 8. </span>(a) Overall FLARE Architecture. (b) Hardware configurations of each FLARE PE. </figcaption> </figure> <div class="ltx_para" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">Building upon our findings and discussions, we present our FLARE architecture — an FP- and division-less, end-to-end attention accelerator based on MRAM-SRAM hybrid AMS-PiM with low-ENOB ADCs. The architecture overview is visualized in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.F8" title="Figure 8 ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">8</span></a>. Our proposed architecture ensures accurate, reliable, efficient, and fast on-device attention-layer computations and is validated at the post-layout level using a 28nm FD-SOI process.</p> </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.1. </span>MRAM-SRAM Hybrid AMS-PiM Design</h3> <figure class="ltx_figure" id="S4.F9"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="836" id="S4.F9.g1" src="x8.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 9. </span>(a) Attribute and size requirements of each memory type, and (b) Monte-Carlo-simulation-guided SAWL selection for lossless computations.</figcaption> </figure> <div class="ltx_para" id="S4.SS1.p1"> <p class="ltx_p" id="S4.SS1.p1.1">The basement of our system - MRAM-SRAM hybrid AMS-PiM design - utilizes the intrinsic advantages of each memory for two distinct types of GEMVs: activation-weight (A-W) and activation-activation (A-A) GEMVs, during the DNN’s main, and PIM’s core; BLAS accelerations. The MRAM-PiM are well-suited to A-W-pair GEMVs , owing to its non-volatile property which allows for significant weight reuse. Furthermore, MRAM maximizes array-level parallelism with a small memory cell size (<math alttext="\sim" class="ltx_Math" display="inline" id="S4.SS1.p1.1.m1.1"><semantics id="S4.SS1.p1.1.m1.1a"><mo id="S4.SS1.p1.1.m1.1.1" xref="S4.SS1.p1.1.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.1.m1.1b"><csymbol cd="latexml" id="S4.SS1.p1.1.m1.1.1.cmml" xref="S4.SS1.p1.1.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.1.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.1.m1.1d">∼</annotation></semantics></math> 1/3 of SRAM). On the other hand, A-A-pair GEMVs , which require both operands to be dynamic, are handled using SRAM-PiM devices. SRAM offers the advantages of low write energy and high write speed compared to MRAM, making it ideal for A-A GEMVs.</p> </div> <div class="ltx_para" id="S4.SS1.p2"> <p class="ltx_p" id="S4.SS1.p2.1">The size and number of arrays are configured to fully encompass all parameters and computations of the self-attention layers, which is a fundamental requirement for achieving the end-to-end acceleration of the target model. The following paragraphs and Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.F9" title="Figure 9 ‣ 4.1. MRAM-SRAM Hybrid AMS-PiM Design ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">9</span></a>-(a) outline the minimum array configuration requirements for implementing our FLARE architecture, while our specific design choices are summarized in TABLE <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.T1" title="Table 1 ‣ 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">1</span></a>.</p> </div> <div class="ltx_para" id="S4.SS1.p3"> <p class="ltx_p" id="S4.SS1.p3.3">First, the MRAM-CiM arrays for A-W GEMVs should store D<sup class="ltx_sup" id="S4.SS1.p3.3.1">2</sup> weight elements for each query-, key-, value-, and the final-output projection layers. For each array, the total number of WLs must be larger than the hidden dimension <math alttext="D" class="ltx_Math" display="inline" id="S4.SS1.p3.2.m2.1"><semantics id="S4.SS1.p3.2.m2.1a"><mi id="S4.SS1.p3.2.m2.1.1" xref="S4.SS1.p3.2.m2.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p3.2.m2.1b"><ci id="S4.SS1.p3.2.m2.1.1.cmml" xref="S4.SS1.p3.2.m2.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p3.2.m2.1c">D</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p3.2.m2.1d">italic_D</annotation></semantics></math> so that the input tokens can be full-parallelly fetched, and the total number of BLs should be larger than <math alttext="D\times WBP" class="ltx_Math" display="inline" id="S4.SS1.p3.3.m3.1"><semantics id="S4.SS1.p3.3.m3.1a"><mrow id="S4.SS1.p3.3.m3.1.1" xref="S4.SS1.p3.3.m3.1.1.cmml"><mrow id="S4.SS1.p3.3.m3.1.1.2" xref="S4.SS1.p3.3.m3.1.1.2.cmml"><mi id="S4.SS1.p3.3.m3.1.1.2.2" xref="S4.SS1.p3.3.m3.1.1.2.2.cmml">D</mi><mo id="S4.SS1.p3.3.m3.1.1.2.1" lspace="0.222em" rspace="0.222em" xref="S4.SS1.p3.3.m3.1.1.2.1.cmml">×</mo><mi id="S4.SS1.p3.3.m3.1.1.2.3" xref="S4.SS1.p3.3.m3.1.1.2.3.cmml">W</mi></mrow><mo id="S4.SS1.p3.3.m3.1.1.1" xref="S4.SS1.p3.3.m3.1.1.1.cmml"></mo><mi id="S4.SS1.p3.3.m3.1.1.3" xref="S4.SS1.p3.3.m3.1.1.3.cmml">B</mi><mo id="S4.SS1.p3.3.m3.1.1.1a" xref="S4.SS1.p3.3.m3.1.1.1.cmml"></mo><mi id="S4.SS1.p3.3.m3.1.1.4" xref="S4.SS1.p3.3.m3.1.1.4.cmml">P</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p3.3.m3.1b"><apply id="S4.SS1.p3.3.m3.1.1.cmml" xref="S4.SS1.p3.3.m3.1.1"><times id="S4.SS1.p3.3.m3.1.1.1.cmml" xref="S4.SS1.p3.3.m3.1.1.1"></times><apply id="S4.SS1.p3.3.m3.1.1.2.cmml" xref="S4.SS1.p3.3.m3.1.1.2"><times id="S4.SS1.p3.3.m3.1.1.2.1.cmml" xref="S4.SS1.p3.3.m3.1.1.2.1"></times><ci id="S4.SS1.p3.3.m3.1.1.2.2.cmml" xref="S4.SS1.p3.3.m3.1.1.2.2">𝐷</ci><ci id="S4.SS1.p3.3.m3.1.1.2.3.cmml" xref="S4.SS1.p3.3.m3.1.1.2.3">𝑊</ci></apply><ci id="S4.SS1.p3.3.m3.1.1.3.cmml" xref="S4.SS1.p3.3.m3.1.1.3">𝐵</ci><ci id="S4.SS1.p3.3.m3.1.1.4.cmml" xref="S4.SS1.p3.3.m3.1.1.4">𝑃</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p3.3.m3.1c">D\times WBP</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p3.3.m3.1d">italic_D × italic_W italic_B italic_P</annotation></semantics></math> (weight-bit-precision) to independently store and compute the multi-bit weights at independent BLs.</p> </div> <div class="ltx_para" id="S4.SS1.p4"> <p class="ltx_p" id="S4.SS1.p4.4">Second, for the SRAM-PiM array, the number of WLs should be guaranteed so that it can handle <math alttext="D/H=d_{k}" class="ltx_Math" display="inline" id="S4.SS1.p4.1.m1.1"><semantics id="S4.SS1.p4.1.m1.1a"><mrow id="S4.SS1.p4.1.m1.1.1" xref="S4.SS1.p4.1.m1.1.1.cmml"><mrow id="S4.SS1.p4.1.m1.1.1.2" xref="S4.SS1.p4.1.m1.1.1.2.cmml"><mi id="S4.SS1.p4.1.m1.1.1.2.2" xref="S4.SS1.p4.1.m1.1.1.2.2.cmml">D</mi><mo id="S4.SS1.p4.1.m1.1.1.2.1" xref="S4.SS1.p4.1.m1.1.1.2.1.cmml">/</mo><mi id="S4.SS1.p4.1.m1.1.1.2.3" xref="S4.SS1.p4.1.m1.1.1.2.3.cmml">H</mi></mrow><mo id="S4.SS1.p4.1.m1.1.1.1" xref="S4.SS1.p4.1.m1.1.1.1.cmml">=</mo><msub id="S4.SS1.p4.1.m1.1.1.3" xref="S4.SS1.p4.1.m1.1.1.3.cmml"><mi id="S4.SS1.p4.1.m1.1.1.3.2" xref="S4.SS1.p4.1.m1.1.1.3.2.cmml">d</mi><mi id="S4.SS1.p4.1.m1.1.1.3.3" xref="S4.SS1.p4.1.m1.1.1.3.3.cmml">k</mi></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p4.1.m1.1b"><apply id="S4.SS1.p4.1.m1.1.1.cmml" xref="S4.SS1.p4.1.m1.1.1"><eq id="S4.SS1.p4.1.m1.1.1.1.cmml" xref="S4.SS1.p4.1.m1.1.1.1"></eq><apply id="S4.SS1.p4.1.m1.1.1.2.cmml" xref="S4.SS1.p4.1.m1.1.1.2"><divide id="S4.SS1.p4.1.m1.1.1.2.1.cmml" xref="S4.SS1.p4.1.m1.1.1.2.1"></divide><ci id="S4.SS1.p4.1.m1.1.1.2.2.cmml" xref="S4.SS1.p4.1.m1.1.1.2.2">𝐷</ci><ci id="S4.SS1.p4.1.m1.1.1.2.3.cmml" xref="S4.SS1.p4.1.m1.1.1.2.3">𝐻</ci></apply><apply id="S4.SS1.p4.1.m1.1.1.3.cmml" xref="S4.SS1.p4.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.p4.1.m1.1.1.3.1.cmml" xref="S4.SS1.p4.1.m1.1.1.3">subscript</csymbol><ci id="S4.SS1.p4.1.m1.1.1.3.2.cmml" xref="S4.SS1.p4.1.m1.1.1.3.2">𝑑</ci><ci id="S4.SS1.p4.1.m1.1.1.3.3.cmml" xref="S4.SS1.p4.1.m1.1.1.3.3">𝑘</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p4.1.m1.1c">D/H=d_{k}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p4.1.m1.1d">italic_D / italic_H = italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT</annotation></semantics></math> dimensions, while <math alttext="H" class="ltx_Math" display="inline" id="S4.SS1.p4.2.m2.1"><semantics id="S4.SS1.p4.2.m2.1a"><mi id="S4.SS1.p4.2.m2.1.1" xref="S4.SS1.p4.2.m2.1.1.cmml">H</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p4.2.m2.1b"><ci id="S4.SS1.p4.2.m2.1.1.cmml" xref="S4.SS1.p4.2.m2.1.1">𝐻</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p4.2.m2.1c">H</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p4.2.m2.1d">italic_H</annotation></semantics></math> independent arrays are needed to handle multi-head projections. The columns of the SRAM(s) must store <math alttext="N" class="ltx_Math" display="inline" id="S4.SS1.p4.3.m3.1"><semantics id="S4.SS1.p4.3.m3.1a"><mi id="S4.SS1.p4.3.m3.1.1" xref="S4.SS1.p4.3.m3.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p4.3.m3.1b"><ci id="S4.SS1.p4.3.m3.1.1.cmml" xref="S4.SS1.p4.3.m3.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p4.3.m3.1c">N</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p4.3.m3.1d">italic_N</annotation></semantics></math> weight features, requiring <math alttext="N\times IBP" class="ltx_Math" display="inline" id="S4.SS1.p4.4.m4.1"><semantics id="S4.SS1.p4.4.m4.1a"><mrow id="S4.SS1.p4.4.m4.1.1" xref="S4.SS1.p4.4.m4.1.1.cmml"><mrow id="S4.SS1.p4.4.m4.1.1.2" xref="S4.SS1.p4.4.m4.1.1.2.cmml"><mi id="S4.SS1.p4.4.m4.1.1.2.2" xref="S4.SS1.p4.4.m4.1.1.2.2.cmml">N</mi><mo id="S4.SS1.p4.4.m4.1.1.2.1" lspace="0.222em" rspace="0.222em" xref="S4.SS1.p4.4.m4.1.1.2.1.cmml">×</mo><mi id="S4.SS1.p4.4.m4.1.1.2.3" xref="S4.SS1.p4.4.m4.1.1.2.3.cmml">I</mi></mrow><mo id="S4.SS1.p4.4.m4.1.1.1" xref="S4.SS1.p4.4.m4.1.1.1.cmml"></mo><mi id="S4.SS1.p4.4.m4.1.1.3" xref="S4.SS1.p4.4.m4.1.1.3.cmml">B</mi><mo id="S4.SS1.p4.4.m4.1.1.1a" xref="S4.SS1.p4.4.m4.1.1.1.cmml"></mo><mi id="S4.SS1.p4.4.m4.1.1.4" xref="S4.SS1.p4.4.m4.1.1.4.cmml">P</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p4.4.m4.1b"><apply id="S4.SS1.p4.4.m4.1.1.cmml" xref="S4.SS1.p4.4.m4.1.1"><times id="S4.SS1.p4.4.m4.1.1.1.cmml" xref="S4.SS1.p4.4.m4.1.1.1"></times><apply id="S4.SS1.p4.4.m4.1.1.2.cmml" xref="S4.SS1.p4.4.m4.1.1.2"><times id="S4.SS1.p4.4.m4.1.1.2.1.cmml" xref="S4.SS1.p4.4.m4.1.1.2.1"></times><ci id="S4.SS1.p4.4.m4.1.1.2.2.cmml" xref="S4.SS1.p4.4.m4.1.1.2.2">𝑁</ci><ci id="S4.SS1.p4.4.m4.1.1.2.3.cmml" xref="S4.SS1.p4.4.m4.1.1.2.3">𝐼</ci></apply><ci id="S4.SS1.p4.4.m4.1.1.3.cmml" xref="S4.SS1.p4.4.m4.1.1.3">𝐵</ci><ci id="S4.SS1.p4.4.m4.1.1.4.cmml" xref="S4.SS1.p4.4.m4.1.1.4">𝑃</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p4.4.m4.1c">N\times IBP</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p4.4.m4.1d">italic_N × italic_I italic_B italic_P</annotation></semantics></math> (IBP: input-activation-bit-precision) BLs.</p> </div> <div class="ltx_para" id="S4.SS1.p5"> <p class="ltx_p" id="S4.SS1.p5.2">Meanwhile, to ensure a stable, error-free computation in AMS-PiM devices, we limit SAWL<math alttext="\leq" class="ltx_Math" display="inline" id="S4.SS1.p5.1.m1.1"><semantics id="S4.SS1.p5.1.m1.1a"><mo id="S4.SS1.p5.1.m1.1.1" xref="S4.SS1.p5.1.m1.1.1.cmml">≤</mo><annotation-xml encoding="MathML-Content" id="S4.SS1.p5.1.m1.1b"><leq id="S4.SS1.p5.1.m1.1.1.cmml" xref="S4.SS1.p5.1.m1.1.1"></leq></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p5.1.m1.1c">\leq</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p5.1.m1.1d">≤</annotation></semantics></math>8 to guarantee stable, 6-<math alttext="\sigma" class="ltx_Math" display="inline" id="S4.SS1.p5.2.m2.1"><semantics id="S4.SS1.p5.2.m2.1a"><mi id="S4.SS1.p5.2.m2.1.1" xref="S4.SS1.p5.2.m2.1.1.cmml">σ</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p5.2.m2.1b"><ci id="S4.SS1.p5.2.m2.1.1.cmml" xref="S4.SS1.p5.2.m2.1.1">𝜎</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p5.2.m2.1c">\sigma</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p5.2.m2.1d">italic_σ</annotation></semantics></math> error-free computation, which is verified by Monte-Carlo simulation using our AMS-PiM arrays, where the results are summarized in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.F9" title="Figure 9 ‣ 4.1. MRAM-SRAM Hybrid AMS-PiM Design ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">9</span></a>-(b). This configuration effectively ensures lossless computation, however, for the latency risk tackled in section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.SS3" title="3.3. Massive GEMVs with Quadratic Latency Bottleneck ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">3.3</span></a>, we propose our fast, accurate, and efficient BitSift-GEMV method in section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS4" title="4.4. BitSift-GEMV Processor ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">4.4</span></a>.</p> </div> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.2. </span>Dequantization-Free PTQ Technique</h3> <figure class="ltx_figure" id="S4.F10"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_portrait" height="1277" id="S4.F10.g1" src="x9.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 10. </span>Our FP- and division-less integer quantization method, eMSB-Q. (a) Definition of eMSB for negative and positive (or unsigned) numbers. (b) eMSB-Q achieves lossless quantization at the expense of 1 more bit. (c) When applying our eMSB-Q for the AMS-PiM array, bit-serial inputs also benefit from skipping irrelevant computations.</figcaption> </figure> <div class="ltx_para" id="S4.SS2.p1"> <p class="ltx_p" id="S4.SS2.p1.1">In this section, we propose a dequantization-free PTQ that penetrates all computations within the attention layer, replacing the FP and division arithmetics while retaining accuracy. As previously discussed, our proposed quantization method enables lossless integer quantization using only shifting and parsing, accommodating important attributes discussed in section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3" title="3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">3</span></a>.</p> </div> <div class="ltx_para" id="S4.SS2.p2"> <p class="ltx_p" id="S4.SS2.p2.5">As discussed in section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.SS1" title="3.1. Motivation for PTQ and Non-BLAS Layer Optimizations ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">3.1</span></a>, an integer quantization is “lossless” when the requirement from equation <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.E1" title="In 3.1. Motivation for PTQ and Non-BLAS Layer Optimizations ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">1</span></a> is satisfied, which occurs if the quantization step size is smaller than that derived from floating-point representations. This implies that the original input can be divided by any <math alttext="2^{n}" class="ltx_Math" display="inline" id="S4.SS2.p2.1.m1.1"><semantics id="S4.SS2.p2.1.m1.1a"><msup id="S4.SS2.p2.1.m1.1.1" xref="S4.SS2.p2.1.m1.1.1.cmml"><mn id="S4.SS2.p2.1.m1.1.1.2" xref="S4.SS2.p2.1.m1.1.1.2.cmml">2</mn><mi id="S4.SS2.p2.1.m1.1.1.3" xref="S4.SS2.p2.1.m1.1.1.3.cmml">n</mi></msup><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.1.m1.1b"><apply id="S4.SS2.p2.1.m1.1.1.cmml" xref="S4.SS2.p2.1.m1.1.1"><csymbol cd="ambiguous" id="S4.SS2.p2.1.m1.1.1.1.cmml" xref="S4.SS2.p2.1.m1.1.1">superscript</csymbol><cn id="S4.SS2.p2.1.m1.1.1.2.cmml" type="integer" xref="S4.SS2.p2.1.m1.1.1.2">2</cn><ci id="S4.SS2.p2.1.m1.1.1.3.cmml" xref="S4.SS2.p2.1.m1.1.1.3">𝑛</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.1.m1.1c">2^{n}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.1.m1.1d">2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT</annotation></semantics></math>, as long as <math alttext="2^{n}" class="ltx_Math" display="inline" id="S4.SS2.p2.2.m2.1"><semantics id="S4.SS2.p2.2.m2.1a"><msup id="S4.SS2.p2.2.m2.1.1" xref="S4.SS2.p2.2.m2.1.1.cmml"><mn id="S4.SS2.p2.2.m2.1.1.2" xref="S4.SS2.p2.2.m2.1.1.2.cmml">2</mn><mi id="S4.SS2.p2.2.m2.1.1.3" xref="S4.SS2.p2.2.m2.1.1.3.cmml">n</mi></msup><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.2.m2.1b"><apply id="S4.SS2.p2.2.m2.1.1.cmml" xref="S4.SS2.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS2.p2.2.m2.1.1.1.cmml" xref="S4.SS2.p2.2.m2.1.1">superscript</csymbol><cn id="S4.SS2.p2.2.m2.1.1.2.cmml" type="integer" xref="S4.SS2.p2.2.m2.1.1.2">2</cn><ci id="S4.SS2.p2.2.m2.1.1.3.cmml" xref="S4.SS2.p2.2.m2.1.1.3">𝑛</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.2.m2.1c">2^{n}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.2.m2.1d">2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT</annotation></semantics></math><math alttext="<" class="ltx_Math" display="inline" id="S4.SS2.p2.3.m3.1"><semantics id="S4.SS2.p2.3.m3.1a"><mo id="S4.SS2.p2.3.m3.1.1" xref="S4.SS2.p2.3.m3.1.1.cmml"><</mo><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.3.m3.1b"><lt id="S4.SS2.p2.3.m3.1.1.cmml" xref="S4.SS2.p2.3.m3.1.1"></lt></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.3.m3.1c"><</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.3.m3.1d"><</annotation></semantics></math><math alttext="\mathbf{S}" class="ltx_Math" display="inline" id="S4.SS2.p2.4.m4.1"><semantics id="S4.SS2.p2.4.m4.1a"><mi id="S4.SS2.p2.4.m4.1.1" xref="S4.SS2.p2.4.m4.1.1.cmml">𝐒</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.4.m4.1b"><ci id="S4.SS2.p2.4.m4.1.1.cmml" xref="S4.SS2.p2.4.m4.1.1">𝐒</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.4.m4.1c">\mathbf{S}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.4.m4.1d">bold_S</annotation></semantics></math>, where <math alttext="\mathbf{S}" class="ltx_Math" display="inline" id="S4.SS2.p2.5.m5.1"><semantics id="S4.SS2.p2.5.m5.1a"><mi id="S4.SS2.p2.5.m5.1.1" xref="S4.SS2.p2.5.m5.1.1.cmml">𝐒</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.5.m5.1b"><ci id="S4.SS2.p2.5.m5.1.1.cmml" xref="S4.SS2.p2.5.m5.1.1">𝐒</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.5.m5.1c">\mathbf{S}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.5.m5.1d">bold_S</annotation></semantics></math> represents the quantization step size used in FP-based quantization. Therefore, we decided to replace the dequantization and division process, as depicted and compared with the conventional method, in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.F10" title="Figure 10 ‣ 4.2. Dequantization-Free PTQ Technique ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">10</span></a>-(a) and (b). Our method, termed effective-MSB-based Quantization (eMSB-Q), ensures numerical integrity while replacing the dequantization and quantization-step acquisition process with eMSB detection, and division operations with arithmetic shifting and parsing. The eMSB-search process simply identifies the location of the actual MSB within a group of bitwise values, eliminating the need for sorting to determine the maximum “value” and its “index.”</p> </div> <div class="ltx_para" id="S4.SS2.p3"> <p class="ltx_p" id="S4.SS2.p3.4">To prevent outliers from being clipped during eMSB-Q, we apply per-token quantization for <math alttext="\mathbf{Q}" class="ltx_Math" display="inline" id="S4.SS2.p3.1.m1.1"><semantics id="S4.SS2.p3.1.m1.1a"><mi id="S4.SS2.p3.1.m1.1.1" xref="S4.SS2.p3.1.m1.1.1.cmml">𝐐</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p3.1.m1.1b"><ci id="S4.SS2.p3.1.m1.1.1.cmml" xref="S4.SS2.p3.1.m1.1.1">𝐐</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p3.1.m1.1c">\mathbf{Q}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p3.1.m1.1d">bold_Q</annotation></semantics></math>, <math alttext="\mathbf{QK}^{T}" class="ltx_Math" display="inline" id="S4.SS2.p3.2.m2.1"><semantics id="S4.SS2.p3.2.m2.1a"><msup id="S4.SS2.p3.2.m2.1.1" xref="S4.SS2.p3.2.m2.1.1.cmml"><mi id="S4.SS2.p3.2.m2.1.1.2" xref="S4.SS2.p3.2.m2.1.1.2.cmml">𝐐𝐊</mi><mi id="S4.SS2.p3.2.m2.1.1.3" xref="S4.SS2.p3.2.m2.1.1.3.cmml">T</mi></msup><annotation-xml encoding="MathML-Content" id="S4.SS2.p3.2.m2.1b"><apply id="S4.SS2.p3.2.m2.1.1.cmml" xref="S4.SS2.p3.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS2.p3.2.m2.1.1.1.cmml" xref="S4.SS2.p3.2.m2.1.1">superscript</csymbol><ci id="S4.SS2.p3.2.m2.1.1.2.cmml" xref="S4.SS2.p3.2.m2.1.1.2">𝐐𝐊</ci><ci id="S4.SS2.p3.2.m2.1.1.3.cmml" xref="S4.SS2.p3.2.m2.1.1.3">𝑇</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p3.2.m2.1c">\mathbf{QK}^{T}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p3.2.m2.1d">bold_QK start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT</annotation></semantics></math>, and the final output. This approach leverages that each token vector in the <math alttext="\mathbf{Q}" class="ltx_Math" display="inline" id="S4.SS2.p3.3.m3.1"><semantics id="S4.SS2.p3.3.m3.1a"><mi id="S4.SS2.p3.3.m3.1.1" xref="S4.SS2.p3.3.m3.1.1.cmml">𝐐</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p3.3.m3.1b"><ci id="S4.SS2.p3.3.m3.1.1.cmml" xref="S4.SS2.p3.3.m3.1.1">𝐐</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p3.3.m3.1c">\mathbf{Q}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p3.3.m3.1d">bold_Q</annotation></semantics></math>-matrix is independently processed throughout the self-attention mechanism. However, KV generations - which must preserve global context - involve storing all vectors of <math alttext="N" class="ltx_Math" display="inline" id="S4.SS2.p3.4.m4.1"><semantics id="S4.SS2.p3.4.m4.1a"><mi id="S4.SS2.p3.4.m4.1.1" xref="S4.SS2.p3.4.m4.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p3.4.m4.1b"><ci id="S4.SS2.p3.4.m4.1.1.cmml" xref="S4.SS2.p3.4.m4.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p3.4.m4.1c">N</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p3.4.m4.1d">italic_N</annotation></semantics></math> tokens. It differentiates the parsing strategy: we parse bits starting from MSBs rather than eMSBs, using a longer bit to reserve a wider dynamic range for the global context.</p> </div> <div class="ltx_para" id="S4.SS2.p4"> <p class="ltx_p" id="S4.SS2.p4.1">Note that, the inherent quality of linear operations is kept with eMSB-Q. However, to save the distinct attributes within the nonlinear layer using our per-token quantization process, we transfer the eMSB information to the VDR-Softmax block, enabling it to incorporate exponent data during processing. This step is essential for the precise functioning of the Softmax layer, which relies on accurate exponent handling, unlike linear layers. In the following section, we further describe integer-processed Softmax for PTQ.</p> </div> <figure class="ltx_float ltx_float_algorithm ltx_framed ltx_framed_top" id="alg1"> <figcaption class="ltx_caption" style="font-size:90%;"><span class="ltx_tag ltx_tag_float"><span class="ltx_text ltx_font_bold" id="alg1.4.1.1">Algorithm 1</span> </span> VDR-Softmax</figcaption> <div class="ltx_listing ltx_listing" id="alg1.5"> <div class="ltx_listingline" id="alg1.l1"> <span class="ltx_text" id="alg1.l1.2" style="font-size:90%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l1.3" style="font-size:90%;">Input:</span><span class="ltx_text" id="alg1.l1.4" style="font-size:90%;"> </span><math alttext="x" class="ltx_Math" display="inline" id="alg1.l1.m1.1"><semantics id="alg1.l1.m1.1a"><mi id="alg1.l1.m1.1.1" mathsize="90%" xref="alg1.l1.m1.1.1.cmml">x</mi><annotation-xml encoding="MathML-Content" id="alg1.l1.m1.1b"><ci id="alg1.l1.m1.1.1.cmml" xref="alg1.l1.m1.1.1">𝑥</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l1.m1.1c">x</annotation><annotation encoding="application/x-llamapun" id="alg1.l1.m1.1d">italic_x</annotation></semantics></math><span class="ltx_text" id="alg1.l1.5" style="font-size:90%;"> [Q</span><sub class="ltx_sub" id="alg1.l1.6"><span class="ltx_text ltx_font_italic" id="alg1.l1.6.1" style="font-size:90%;">I</span></sub><span class="ltx_text" id="alg1.l1.7" style="font-size:90%;">-1:0][(N-1):0], </span><span class="ltx_text ltx_font_italic" id="alg1.l1.1" style="font-size:90%;">n<sub class="ltx_sub" id="alg1.l1.1.1">e</sub></span><span class="ltx_text" id="alg1.l1.8" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l2"> <span class="ltx_text" id="alg1.l2.1" style="font-size:90%;"> N = number of the input elements </span> </div> <div class="ltx_listingline" id="alg1.l3"> <span class="ltx_text" id="alg1.l3.1" style="font-size:90%;"> Q</span><sub class="ltx_sub" id="alg1.l3.2"><span class="ltx_text ltx_font_italic" id="alg1.l3.2.1" style="font-size:90%;">I</span></sub><span class="ltx_text" id="alg1.l3.3" style="font-size:90%;"> = input bit precision, Q</span><sub class="ltx_sub" id="alg1.l3.4"><span class="ltx_text ltx_font_italic" id="alg1.l3.4.1" style="font-size:90%;">O</span></sub><span class="ltx_text" id="alg1.l3.5" style="font-size:90%;"> output bit precision </span> </div> <div class="ltx_listingline" id="alg1.l4"> <span class="ltx_text" id="alg1.l4.2" style="font-size:90%;"> </span><span class="ltx_text ltx_font_italic" id="alg1.l4.1" style="font-size:90%;">n<sub class="ltx_sub" id="alg1.l4.1.1">e</sub></span><span class="ltx_text" id="alg1.l4.3" style="font-size:90%;"> = </span><math alttext="\sum" class="ltx_Math" display="inline" id="alg1.l4.m1.1"><semantics id="alg1.l4.m1.1a"><mo id="alg1.l4.m1.1.1" maxsize="90%" minsize="90%" stretchy="true" xref="alg1.l4.m1.1.1.cmml">∑</mo><annotation-xml encoding="MathML-Content" id="alg1.l4.m1.1b"><sum id="alg1.l4.m1.1.1.cmml" xref="alg1.l4.m1.1.1"></sum></annotation-xml><annotation encoding="application/x-tex" id="alg1.l4.m1.1c">\sum</annotation><annotation encoding="application/x-llamapun" id="alg1.l4.m1.1d">∑</annotation></semantics></math><span class="ltx_text" id="alg1.l4.4" style="font-size:90%;">(MSB-eMSB) over input</span><math alttext="\rightarrow" class="ltx_Math" display="inline" id="alg1.l4.m2.1"><semantics id="alg1.l4.m2.1a"><mo id="alg1.l4.m2.1.1" mathsize="90%" stretchy="false" xref="alg1.l4.m2.1.1.cmml">→</mo><annotation-xml encoding="MathML-Content" id="alg1.l4.m2.1b"><ci id="alg1.l4.m2.1.1.cmml" xref="alg1.l4.m2.1.1">→</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l4.m2.1c">\rightarrow</annotation><annotation encoding="application/x-llamapun" id="alg1.l4.m2.1d">→</annotation></semantics></math><span class="ltx_text" id="alg1.l4.5" style="font-size:90%;">Q</span><math alttext="\rightarrow" class="ltx_Math" display="inline" id="alg1.l4.m3.1"><semantics id="alg1.l4.m3.1a"><mo id="alg1.l4.m3.1.1" mathsize="90%" stretchy="false" xref="alg1.l4.m3.1.1.cmml">→</mo><annotation-xml encoding="MathML-Content" id="alg1.l4.m3.1b"><ci id="alg1.l4.m3.1.1.cmml" xref="alg1.l4.m3.1.1">→</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l4.m3.1c">\rightarrow</annotation><annotation encoding="application/x-llamapun" id="alg1.l4.m3.1d">→</annotation></semantics></math><span class="ltx_text" id="alg1.l4.6" style="font-size:90%;">QK</span><sup class="ltx_sup" id="alg1.l4.7"><span class="ltx_text ltx_font_italic" id="alg1.l4.7.1" style="font-size:90%;">T</span></sup><span class="ltx_text" id="alg1.l4.8" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l5"> <span class="ltx_text" id="alg1.l5.1" style="font-size:90%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l5.2" style="font-size:90%;">Parameters:</span><span class="ltx_text" id="alg1.l5.3" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l6"> <span class="ltx_text" id="alg1.l6.1" style="font-size:90%;"> </span><math alttext="~{}~{}a,b,c,S,l\leftarrow a(n_{e}),b(n_{e}),c(n_{e}),S(n_{e}),l(n_{e})" class="ltx_Math" display="inline" id="alg1.l6.m1.7"><semantics id="alg1.l6.m1.7a"><mrow id="alg1.l6.m1.7.7.2" xref="alg1.l6.m1.7.7.3.cmml"><mrow id="alg1.l6.m1.6.6.1.1" xref="alg1.l6.m1.6.6.1.1.cmml"><mrow id="alg1.l6.m1.6.6.1.1.3.2" xref="alg1.l6.m1.6.6.1.1.3.1.cmml"><mi id="alg1.l6.m1.1.1" mathsize="90%" xref="alg1.l6.m1.1.1.cmml">a</mi><mo id="alg1.l6.m1.6.6.1.1.3.2.1" mathsize="90%" xref="alg1.l6.m1.6.6.1.1.3.1.cmml">,</mo><mi id="alg1.l6.m1.2.2" mathsize="90%" xref="alg1.l6.m1.2.2.cmml">b</mi><mo id="alg1.l6.m1.6.6.1.1.3.2.2" mathsize="90%" xref="alg1.l6.m1.6.6.1.1.3.1.cmml">,</mo><mi id="alg1.l6.m1.3.3" mathsize="90%" xref="alg1.l6.m1.3.3.cmml">c</mi><mo id="alg1.l6.m1.6.6.1.1.3.2.3" mathsize="90%" xref="alg1.l6.m1.6.6.1.1.3.1.cmml">,</mo><mi id="alg1.l6.m1.4.4" mathsize="90%" xref="alg1.l6.m1.4.4.cmml">S</mi><mo id="alg1.l6.m1.6.6.1.1.3.2.4" mathsize="90%" xref="alg1.l6.m1.6.6.1.1.3.1.cmml">,</mo><mi id="alg1.l6.m1.5.5" mathsize="90%" xref="alg1.l6.m1.5.5.cmml">l</mi></mrow><mo id="alg1.l6.m1.6.6.1.1.2" mathsize="90%" stretchy="false" xref="alg1.l6.m1.6.6.1.1.2.cmml">←</mo><mrow id="alg1.l6.m1.6.6.1.1.1" xref="alg1.l6.m1.6.6.1.1.1.cmml"><mi id="alg1.l6.m1.6.6.1.1.1.3" mathsize="90%" xref="alg1.l6.m1.6.6.1.1.1.3.cmml">a</mi><mo id="alg1.l6.m1.6.6.1.1.1.2" xref="alg1.l6.m1.6.6.1.1.1.2.cmml"></mo><mrow id="alg1.l6.m1.6.6.1.1.1.1.1" xref="alg1.l6.m1.6.6.1.1.1.1.1.1.cmml"><mo id="alg1.l6.m1.6.6.1.1.1.1.1.2" maxsize="90%" minsize="90%" xref="alg1.l6.m1.6.6.1.1.1.1.1.1.cmml">(</mo><msub id="alg1.l6.m1.6.6.1.1.1.1.1.1" xref="alg1.l6.m1.6.6.1.1.1.1.1.1.cmml"><mi id="alg1.l6.m1.6.6.1.1.1.1.1.1.2" mathsize="90%" xref="alg1.l6.m1.6.6.1.1.1.1.1.1.2.cmml">n</mi><mi id="alg1.l6.m1.6.6.1.1.1.1.1.1.3" mathsize="90%" xref="alg1.l6.m1.6.6.1.1.1.1.1.1.3.cmml">e</mi></msub><mo id="alg1.l6.m1.6.6.1.1.1.1.1.3" maxsize="90%" minsize="90%" xref="alg1.l6.m1.6.6.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="alg1.l6.m1.7.7.2.3" mathsize="90%" xref="alg1.l6.m1.7.7.3a.cmml">,</mo><mrow id="alg1.l6.m1.7.7.2.2.4" xref="alg1.l6.m1.7.7.2.2.5.cmml"><mrow id="alg1.l6.m1.7.7.2.2.1.1" xref="alg1.l6.m1.7.7.2.2.1.1.cmml"><mi id="alg1.l6.m1.7.7.2.2.1.1.3" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.1.1.3.cmml">b</mi><mo id="alg1.l6.m1.7.7.2.2.1.1.2" xref="alg1.l6.m1.7.7.2.2.1.1.2.cmml"></mo><mrow id="alg1.l6.m1.7.7.2.2.1.1.1.1" xref="alg1.l6.m1.7.7.2.2.1.1.1.1.1.cmml"><mo id="alg1.l6.m1.7.7.2.2.1.1.1.1.2" maxsize="90%" minsize="90%" xref="alg1.l6.m1.7.7.2.2.1.1.1.1.1.cmml">(</mo><msub id="alg1.l6.m1.7.7.2.2.1.1.1.1.1" xref="alg1.l6.m1.7.7.2.2.1.1.1.1.1.cmml"><mi id="alg1.l6.m1.7.7.2.2.1.1.1.1.1.2" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.1.1.1.1.1.2.cmml">n</mi><mi id="alg1.l6.m1.7.7.2.2.1.1.1.1.1.3" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.1.1.1.1.1.3.cmml">e</mi></msub><mo id="alg1.l6.m1.7.7.2.2.1.1.1.1.3" maxsize="90%" minsize="90%" xref="alg1.l6.m1.7.7.2.2.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="alg1.l6.m1.7.7.2.2.4.5" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.5.cmml">,</mo><mrow id="alg1.l6.m1.7.7.2.2.2.2" xref="alg1.l6.m1.7.7.2.2.2.2.cmml"><mi id="alg1.l6.m1.7.7.2.2.2.2.3" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.2.2.3.cmml">c</mi><mo id="alg1.l6.m1.7.7.2.2.2.2.2" xref="alg1.l6.m1.7.7.2.2.2.2.2.cmml"></mo><mrow id="alg1.l6.m1.7.7.2.2.2.2.1.1" xref="alg1.l6.m1.7.7.2.2.2.2.1.1.1.cmml"><mo id="alg1.l6.m1.7.7.2.2.2.2.1.1.2" maxsize="90%" minsize="90%" xref="alg1.l6.m1.7.7.2.2.2.2.1.1.1.cmml">(</mo><msub id="alg1.l6.m1.7.7.2.2.2.2.1.1.1" xref="alg1.l6.m1.7.7.2.2.2.2.1.1.1.cmml"><mi id="alg1.l6.m1.7.7.2.2.2.2.1.1.1.2" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.2.2.1.1.1.2.cmml">n</mi><mi id="alg1.l6.m1.7.7.2.2.2.2.1.1.1.3" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.2.2.1.1.1.3.cmml">e</mi></msub><mo id="alg1.l6.m1.7.7.2.2.2.2.1.1.3" maxsize="90%" minsize="90%" xref="alg1.l6.m1.7.7.2.2.2.2.1.1.1.cmml">)</mo></mrow></mrow><mo id="alg1.l6.m1.7.7.2.2.4.6" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.5.cmml">,</mo><mrow id="alg1.l6.m1.7.7.2.2.3.3" xref="alg1.l6.m1.7.7.2.2.3.3.cmml"><mi id="alg1.l6.m1.7.7.2.2.3.3.3" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.3.3.3.cmml">S</mi><mo id="alg1.l6.m1.7.7.2.2.3.3.2" xref="alg1.l6.m1.7.7.2.2.3.3.2.cmml"></mo><mrow id="alg1.l6.m1.7.7.2.2.3.3.1.1" xref="alg1.l6.m1.7.7.2.2.3.3.1.1.1.cmml"><mo id="alg1.l6.m1.7.7.2.2.3.3.1.1.2" maxsize="90%" minsize="90%" xref="alg1.l6.m1.7.7.2.2.3.3.1.1.1.cmml">(</mo><msub id="alg1.l6.m1.7.7.2.2.3.3.1.1.1" xref="alg1.l6.m1.7.7.2.2.3.3.1.1.1.cmml"><mi id="alg1.l6.m1.7.7.2.2.3.3.1.1.1.2" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.3.3.1.1.1.2.cmml">n</mi><mi id="alg1.l6.m1.7.7.2.2.3.3.1.1.1.3" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.3.3.1.1.1.3.cmml">e</mi></msub><mo id="alg1.l6.m1.7.7.2.2.3.3.1.1.3" maxsize="90%" minsize="90%" xref="alg1.l6.m1.7.7.2.2.3.3.1.1.1.cmml">)</mo></mrow></mrow><mo id="alg1.l6.m1.7.7.2.2.4.7" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.5.cmml">,</mo><mrow id="alg1.l6.m1.7.7.2.2.4.4" xref="alg1.l6.m1.7.7.2.2.4.4.cmml"><mi id="alg1.l6.m1.7.7.2.2.4.4.3" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.4.4.3.cmml">l</mi><mo id="alg1.l6.m1.7.7.2.2.4.4.2" xref="alg1.l6.m1.7.7.2.2.4.4.2.cmml"></mo><mrow id="alg1.l6.m1.7.7.2.2.4.4.1.1" xref="alg1.l6.m1.7.7.2.2.4.4.1.1.1.cmml"><mo id="alg1.l6.m1.7.7.2.2.4.4.1.1.2" maxsize="90%" minsize="90%" xref="alg1.l6.m1.7.7.2.2.4.4.1.1.1.cmml">(</mo><msub id="alg1.l6.m1.7.7.2.2.4.4.1.1.1" xref="alg1.l6.m1.7.7.2.2.4.4.1.1.1.cmml"><mi id="alg1.l6.m1.7.7.2.2.4.4.1.1.1.2" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.4.4.1.1.1.2.cmml">n</mi><mi id="alg1.l6.m1.7.7.2.2.4.4.1.1.1.3" mathsize="90%" xref="alg1.l6.m1.7.7.2.2.4.4.1.1.1.3.cmml">e</mi></msub><mo id="alg1.l6.m1.7.7.2.2.4.4.1.1.3" maxsize="90%" minsize="90%" xref="alg1.l6.m1.7.7.2.2.4.4.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l6.m1.7b"><apply id="alg1.l6.m1.7.7.3.cmml" xref="alg1.l6.m1.7.7.2"><csymbol cd="ambiguous" id="alg1.l6.m1.7.7.3a.cmml" xref="alg1.l6.m1.7.7.2.3">formulae-sequence</csymbol><apply id="alg1.l6.m1.6.6.1.1.cmml" xref="alg1.l6.m1.6.6.1.1"><ci id="alg1.l6.m1.6.6.1.1.2.cmml" xref="alg1.l6.m1.6.6.1.1.2">←</ci><list id="alg1.l6.m1.6.6.1.1.3.1.cmml" xref="alg1.l6.m1.6.6.1.1.3.2"><ci id="alg1.l6.m1.1.1.cmml" xref="alg1.l6.m1.1.1">𝑎</ci><ci id="alg1.l6.m1.2.2.cmml" xref="alg1.l6.m1.2.2">𝑏</ci><ci id="alg1.l6.m1.3.3.cmml" xref="alg1.l6.m1.3.3">𝑐</ci><ci id="alg1.l6.m1.4.4.cmml" xref="alg1.l6.m1.4.4">𝑆</ci><ci id="alg1.l6.m1.5.5.cmml" xref="alg1.l6.m1.5.5">𝑙</ci></list><apply id="alg1.l6.m1.6.6.1.1.1.cmml" xref="alg1.l6.m1.6.6.1.1.1"><times id="alg1.l6.m1.6.6.1.1.1.2.cmml" xref="alg1.l6.m1.6.6.1.1.1.2"></times><ci id="alg1.l6.m1.6.6.1.1.1.3.cmml" xref="alg1.l6.m1.6.6.1.1.1.3">𝑎</ci><apply id="alg1.l6.m1.6.6.1.1.1.1.1.1.cmml" xref="alg1.l6.m1.6.6.1.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l6.m1.6.6.1.1.1.1.1.1.1.cmml" xref="alg1.l6.m1.6.6.1.1.1.1.1">subscript</csymbol><ci id="alg1.l6.m1.6.6.1.1.1.1.1.1.2.cmml" xref="alg1.l6.m1.6.6.1.1.1.1.1.1.2">𝑛</ci><ci id="alg1.l6.m1.6.6.1.1.1.1.1.1.3.cmml" xref="alg1.l6.m1.6.6.1.1.1.1.1.1.3">𝑒</ci></apply></apply></apply><list id="alg1.l6.m1.7.7.2.2.5.cmml" xref="alg1.l6.m1.7.7.2.2.4"><apply id="alg1.l6.m1.7.7.2.2.1.1.cmml" xref="alg1.l6.m1.7.7.2.2.1.1"><times id="alg1.l6.m1.7.7.2.2.1.1.2.cmml" xref="alg1.l6.m1.7.7.2.2.1.1.2"></times><ci id="alg1.l6.m1.7.7.2.2.1.1.3.cmml" xref="alg1.l6.m1.7.7.2.2.1.1.3">𝑏</ci><apply id="alg1.l6.m1.7.7.2.2.1.1.1.1.1.cmml" xref="alg1.l6.m1.7.7.2.2.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l6.m1.7.7.2.2.1.1.1.1.1.1.cmml" xref="alg1.l6.m1.7.7.2.2.1.1.1.1">subscript</csymbol><ci id="alg1.l6.m1.7.7.2.2.1.1.1.1.1.2.cmml" xref="alg1.l6.m1.7.7.2.2.1.1.1.1.1.2">𝑛</ci><ci id="alg1.l6.m1.7.7.2.2.1.1.1.1.1.3.cmml" xref="alg1.l6.m1.7.7.2.2.1.1.1.1.1.3">𝑒</ci></apply></apply><apply id="alg1.l6.m1.7.7.2.2.2.2.cmml" xref="alg1.l6.m1.7.7.2.2.2.2"><times id="alg1.l6.m1.7.7.2.2.2.2.2.cmml" xref="alg1.l6.m1.7.7.2.2.2.2.2"></times><ci id="alg1.l6.m1.7.7.2.2.2.2.3.cmml" xref="alg1.l6.m1.7.7.2.2.2.2.3">𝑐</ci><apply id="alg1.l6.m1.7.7.2.2.2.2.1.1.1.cmml" xref="alg1.l6.m1.7.7.2.2.2.2.1.1"><csymbol cd="ambiguous" id="alg1.l6.m1.7.7.2.2.2.2.1.1.1.1.cmml" xref="alg1.l6.m1.7.7.2.2.2.2.1.1">subscript</csymbol><ci id="alg1.l6.m1.7.7.2.2.2.2.1.1.1.2.cmml" xref="alg1.l6.m1.7.7.2.2.2.2.1.1.1.2">𝑛</ci><ci id="alg1.l6.m1.7.7.2.2.2.2.1.1.1.3.cmml" xref="alg1.l6.m1.7.7.2.2.2.2.1.1.1.3">𝑒</ci></apply></apply><apply id="alg1.l6.m1.7.7.2.2.3.3.cmml" xref="alg1.l6.m1.7.7.2.2.3.3"><times id="alg1.l6.m1.7.7.2.2.3.3.2.cmml" xref="alg1.l6.m1.7.7.2.2.3.3.2"></times><ci id="alg1.l6.m1.7.7.2.2.3.3.3.cmml" xref="alg1.l6.m1.7.7.2.2.3.3.3">𝑆</ci><apply id="alg1.l6.m1.7.7.2.2.3.3.1.1.1.cmml" xref="alg1.l6.m1.7.7.2.2.3.3.1.1"><csymbol cd="ambiguous" id="alg1.l6.m1.7.7.2.2.3.3.1.1.1.1.cmml" xref="alg1.l6.m1.7.7.2.2.3.3.1.1">subscript</csymbol><ci id="alg1.l6.m1.7.7.2.2.3.3.1.1.1.2.cmml" xref="alg1.l6.m1.7.7.2.2.3.3.1.1.1.2">𝑛</ci><ci id="alg1.l6.m1.7.7.2.2.3.3.1.1.1.3.cmml" xref="alg1.l6.m1.7.7.2.2.3.3.1.1.1.3">𝑒</ci></apply></apply><apply id="alg1.l6.m1.7.7.2.2.4.4.cmml" xref="alg1.l6.m1.7.7.2.2.4.4"><times id="alg1.l6.m1.7.7.2.2.4.4.2.cmml" xref="alg1.l6.m1.7.7.2.2.4.4.2"></times><ci id="alg1.l6.m1.7.7.2.2.4.4.3.cmml" xref="alg1.l6.m1.7.7.2.2.4.4.3">𝑙</ci><apply id="alg1.l6.m1.7.7.2.2.4.4.1.1.1.cmml" xref="alg1.l6.m1.7.7.2.2.4.4.1.1"><csymbol cd="ambiguous" id="alg1.l6.m1.7.7.2.2.4.4.1.1.1.1.cmml" xref="alg1.l6.m1.7.7.2.2.4.4.1.1">subscript</csymbol><ci id="alg1.l6.m1.7.7.2.2.4.4.1.1.1.2.cmml" xref="alg1.l6.m1.7.7.2.2.4.4.1.1.1.2">𝑛</ci><ci id="alg1.l6.m1.7.7.2.2.4.4.1.1.1.3.cmml" xref="alg1.l6.m1.7.7.2.2.4.4.1.1.1.3">𝑒</ci></apply></apply></list></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l6.m1.7c">~{}~{}a,b,c,S,l\leftarrow a(n_{e}),b(n_{e}),c(n_{e}),S(n_{e}),l(n_{e})</annotation><annotation encoding="application/x-llamapun" id="alg1.l6.m1.7d">italic_a , italic_b , italic_c , italic_S , italic_l ← italic_a ( italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) , italic_b ( italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) , italic_c ( italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) , italic_S ( italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) , italic_l ( italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT )</annotation></semantics></math><span class="ltx_text" id="alg1.l6.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l7"> <span class="ltx_text" id="alg1.l7.1" style="font-size:90%;"> </span><math alttext="~{}~{}\Rightarrow\text{once for }Q_{I},\text{incorporates exponent information}." class="ltx_Math" display="inline" id="alg1.l7.m1.2"><semantics id="alg1.l7.m1.2a"><mrow id="alg1.l7.m1.2.2.1" xref="alg1.l7.m1.2.2.1.1.cmml"><mrow id="alg1.l7.m1.2.2.1.1" xref="alg1.l7.m1.2.2.1.1.cmml"><mi id="alg1.l7.m1.2.2.1.1.3" xref="alg1.l7.m1.2.2.1.1.3.cmml"></mi><mo id="alg1.l7.m1.2.2.1.1.2" lspace="0.878em" mathsize="90%" stretchy="false" xref="alg1.l7.m1.2.2.1.1.2.cmml">⇒</mo><mrow id="alg1.l7.m1.2.2.1.1.1.1" xref="alg1.l7.m1.2.2.1.1.1.2.cmml"><mrow id="alg1.l7.m1.2.2.1.1.1.1.1" xref="alg1.l7.m1.2.2.1.1.1.1.1.cmml"><mtext id="alg1.l7.m1.2.2.1.1.1.1.1.2" mathsize="90%" xref="alg1.l7.m1.2.2.1.1.1.1.1.2a.cmml">once for </mtext><mo id="alg1.l7.m1.2.2.1.1.1.1.1.1" xref="alg1.l7.m1.2.2.1.1.1.1.1.1.cmml"></mo><msub id="alg1.l7.m1.2.2.1.1.1.1.1.3" xref="alg1.l7.m1.2.2.1.1.1.1.1.3.cmml"><mi id="alg1.l7.m1.2.2.1.1.1.1.1.3.2" mathsize="90%" xref="alg1.l7.m1.2.2.1.1.1.1.1.3.2.cmml">Q</mi><mi id="alg1.l7.m1.2.2.1.1.1.1.1.3.3" mathsize="90%" xref="alg1.l7.m1.2.2.1.1.1.1.1.3.3.cmml">I</mi></msub></mrow><mo id="alg1.l7.m1.2.2.1.1.1.1.2" mathsize="90%" xref="alg1.l7.m1.2.2.1.1.1.2.cmml">,</mo><mtext id="alg1.l7.m1.1.1" mathsize="90%" xref="alg1.l7.m1.1.1a.cmml">incorporates exponent information</mtext></mrow></mrow><mo id="alg1.l7.m1.2.2.1.2" lspace="0em" mathsize="90%" xref="alg1.l7.m1.2.2.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="alg1.l7.m1.2b"><apply id="alg1.l7.m1.2.2.1.1.cmml" xref="alg1.l7.m1.2.2.1"><ci id="alg1.l7.m1.2.2.1.1.2.cmml" xref="alg1.l7.m1.2.2.1.1.2">⇒</ci><csymbol cd="latexml" id="alg1.l7.m1.2.2.1.1.3.cmml" xref="alg1.l7.m1.2.2.1.1.3">absent</csymbol><list id="alg1.l7.m1.2.2.1.1.1.2.cmml" xref="alg1.l7.m1.2.2.1.1.1.1"><apply id="alg1.l7.m1.2.2.1.1.1.1.1.cmml" xref="alg1.l7.m1.2.2.1.1.1.1.1"><times id="alg1.l7.m1.2.2.1.1.1.1.1.1.cmml" xref="alg1.l7.m1.2.2.1.1.1.1.1.1"></times><ci id="alg1.l7.m1.2.2.1.1.1.1.1.2a.cmml" xref="alg1.l7.m1.2.2.1.1.1.1.1.2"><mtext id="alg1.l7.m1.2.2.1.1.1.1.1.2.cmml" mathsize="90%" xref="alg1.l7.m1.2.2.1.1.1.1.1.2">once for </mtext></ci><apply id="alg1.l7.m1.2.2.1.1.1.1.1.3.cmml" xref="alg1.l7.m1.2.2.1.1.1.1.1.3"><csymbol cd="ambiguous" id="alg1.l7.m1.2.2.1.1.1.1.1.3.1.cmml" xref="alg1.l7.m1.2.2.1.1.1.1.1.3">subscript</csymbol><ci id="alg1.l7.m1.2.2.1.1.1.1.1.3.2.cmml" xref="alg1.l7.m1.2.2.1.1.1.1.1.3.2">𝑄</ci><ci id="alg1.l7.m1.2.2.1.1.1.1.1.3.3.cmml" xref="alg1.l7.m1.2.2.1.1.1.1.1.3.3">𝐼</ci></apply></apply><ci id="alg1.l7.m1.1.1a.cmml" xref="alg1.l7.m1.1.1"><mtext id="alg1.l7.m1.1.1.cmml" mathsize="90%" xref="alg1.l7.m1.1.1">incorporates exponent information</mtext></ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l7.m1.2c">~{}~{}\Rightarrow\text{once for }Q_{I},\text{incorporates exponent information}.</annotation><annotation encoding="application/x-llamapun" id="alg1.l7.m1.2d">⇒ once for italic_Q start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , incorporates exponent information .</annotation></semantics></math><span class="ltx_text" id="alg1.l7.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l8"> <span class="ltx_text" id="alg1.l8.1" style="font-size:90%;"> </span><math alttext="~{}~{}\Rightarrow\text{LUT-processed},100\text{ bytes}" class="ltx_Math" display="inline" id="alg1.l8.m1.2"><semantics id="alg1.l8.m1.2a"><mrow id="alg1.l8.m1.2.2" xref="alg1.l8.m1.2.2.cmml"><mi id="alg1.l8.m1.2.2.3" xref="alg1.l8.m1.2.2.3.cmml"></mi><mo id="alg1.l8.m1.2.2.2" lspace="0.878em" mathsize="90%" stretchy="false" xref="alg1.l8.m1.2.2.2.cmml">⇒</mo><mrow id="alg1.l8.m1.2.2.1.1" xref="alg1.l8.m1.2.2.1.2.cmml"><mtext id="alg1.l8.m1.1.1" mathsize="90%" xref="alg1.l8.m1.1.1a.cmml">LUT-processed</mtext><mo id="alg1.l8.m1.2.2.1.1.2" mathsize="90%" xref="alg1.l8.m1.2.2.1.2.cmml">,</mo><mrow id="alg1.l8.m1.2.2.1.1.1" xref="alg1.l8.m1.2.2.1.1.1.cmml"><mn id="alg1.l8.m1.2.2.1.1.1.2" mathsize="90%" xref="alg1.l8.m1.2.2.1.1.1.2.cmml">100</mn><mo id="alg1.l8.m1.2.2.1.1.1.1" xref="alg1.l8.m1.2.2.1.1.1.1.cmml"></mo><mtext id="alg1.l8.m1.2.2.1.1.1.3" mathsize="90%" xref="alg1.l8.m1.2.2.1.1.1.3a.cmml"> bytes</mtext></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l8.m1.2b"><apply id="alg1.l8.m1.2.2.cmml" xref="alg1.l8.m1.2.2"><ci id="alg1.l8.m1.2.2.2.cmml" xref="alg1.l8.m1.2.2.2">⇒</ci><csymbol cd="latexml" id="alg1.l8.m1.2.2.3.cmml" xref="alg1.l8.m1.2.2.3">absent</csymbol><list id="alg1.l8.m1.2.2.1.2.cmml" xref="alg1.l8.m1.2.2.1.1"><ci id="alg1.l8.m1.1.1a.cmml" xref="alg1.l8.m1.1.1"><mtext id="alg1.l8.m1.1.1.cmml" mathsize="90%" xref="alg1.l8.m1.1.1">LUT-processed</mtext></ci><apply id="alg1.l8.m1.2.2.1.1.1.cmml" xref="alg1.l8.m1.2.2.1.1.1"><times id="alg1.l8.m1.2.2.1.1.1.1.cmml" xref="alg1.l8.m1.2.2.1.1.1.1"></times><cn id="alg1.l8.m1.2.2.1.1.1.2.cmml" type="integer" xref="alg1.l8.m1.2.2.1.1.1.2">100</cn><ci id="alg1.l8.m1.2.2.1.1.1.3a.cmml" xref="alg1.l8.m1.2.2.1.1.1.3"><mtext id="alg1.l8.m1.2.2.1.1.1.3.cmml" mathsize="90%" xref="alg1.l8.m1.2.2.1.1.1.3"> bytes</mtext></ci></apply></list></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l8.m1.2c">~{}~{}\Rightarrow\text{LUT-processed},100\text{ bytes}</annotation><annotation encoding="application/x-llamapun" id="alg1.l8.m1.2d">⇒ LUT-processed , 100 bytes</annotation></semantics></math><span class="ltx_text" id="alg1.l8.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l9"> <span class="ltx_text" id="alg1.l9.1" style="font-size:90%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l9.2" style="font-size:90%;">Output:</span><span class="ltx_text" id="alg1.l9.3" style="font-size:90%;"> </span><math alttext="x_{Softmax}" class="ltx_Math" display="inline" id="alg1.l9.m1.1"><semantics id="alg1.l9.m1.1a"><msub id="alg1.l9.m1.1.1" xref="alg1.l9.m1.1.1.cmml"><mi id="alg1.l9.m1.1.1.2" mathsize="90%" xref="alg1.l9.m1.1.1.2.cmml">x</mi><mrow id="alg1.l9.m1.1.1.3" xref="alg1.l9.m1.1.1.3.cmml"><mi id="alg1.l9.m1.1.1.3.2" mathsize="90%" xref="alg1.l9.m1.1.1.3.2.cmml">S</mi><mo id="alg1.l9.m1.1.1.3.1" xref="alg1.l9.m1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m1.1.1.3.3" mathsize="90%" xref="alg1.l9.m1.1.1.3.3.cmml">o</mi><mo id="alg1.l9.m1.1.1.3.1a" xref="alg1.l9.m1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m1.1.1.3.4" mathsize="90%" xref="alg1.l9.m1.1.1.3.4.cmml">f</mi><mo id="alg1.l9.m1.1.1.3.1b" xref="alg1.l9.m1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m1.1.1.3.5" mathsize="90%" xref="alg1.l9.m1.1.1.3.5.cmml">t</mi><mo id="alg1.l9.m1.1.1.3.1c" xref="alg1.l9.m1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m1.1.1.3.6" mathsize="90%" xref="alg1.l9.m1.1.1.3.6.cmml">m</mi><mo id="alg1.l9.m1.1.1.3.1d" xref="alg1.l9.m1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m1.1.1.3.7" mathsize="90%" xref="alg1.l9.m1.1.1.3.7.cmml">a</mi><mo id="alg1.l9.m1.1.1.3.1e" xref="alg1.l9.m1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m1.1.1.3.8" mathsize="90%" xref="alg1.l9.m1.1.1.3.8.cmml">x</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="alg1.l9.m1.1b"><apply id="alg1.l9.m1.1.1.cmml" xref="alg1.l9.m1.1.1"><csymbol cd="ambiguous" id="alg1.l9.m1.1.1.1.cmml" xref="alg1.l9.m1.1.1">subscript</csymbol><ci id="alg1.l9.m1.1.1.2.cmml" xref="alg1.l9.m1.1.1.2">𝑥</ci><apply id="alg1.l9.m1.1.1.3.cmml" xref="alg1.l9.m1.1.1.3"><times id="alg1.l9.m1.1.1.3.1.cmml" xref="alg1.l9.m1.1.1.3.1"></times><ci id="alg1.l9.m1.1.1.3.2.cmml" xref="alg1.l9.m1.1.1.3.2">𝑆</ci><ci id="alg1.l9.m1.1.1.3.3.cmml" xref="alg1.l9.m1.1.1.3.3">𝑜</ci><ci id="alg1.l9.m1.1.1.3.4.cmml" xref="alg1.l9.m1.1.1.3.4">𝑓</ci><ci id="alg1.l9.m1.1.1.3.5.cmml" xref="alg1.l9.m1.1.1.3.5">𝑡</ci><ci id="alg1.l9.m1.1.1.3.6.cmml" xref="alg1.l9.m1.1.1.3.6">𝑚</ci><ci id="alg1.l9.m1.1.1.3.7.cmml" xref="alg1.l9.m1.1.1.3.7">𝑎</ci><ci id="alg1.l9.m1.1.1.3.8.cmml" xref="alg1.l9.m1.1.1.3.8">𝑥</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l9.m1.1c">x_{Softmax}</annotation><annotation encoding="application/x-llamapun" id="alg1.l9.m1.1d">italic_x start_POSTSUBSCRIPT italic_S italic_o italic_f italic_t italic_m italic_a italic_x end_POSTSUBSCRIPT</annotation></semantics></math><span class="ltx_text" id="alg1.l9.4" style="font-size:90%;"> [Q</span><sub class="ltx_sub" id="alg1.l9.5"><span class="ltx_text ltx_font_italic" id="alg1.l9.5.1" style="font-size:90%;">O</span></sub><span class="ltx_text" id="alg1.l9.6" style="font-size:90%;">-1:0][(N-1)):0] </span> </div> <div class="ltx_listingline" id="alg1.l10"> <span class="ltx_text" id="alg1.l10.1" style="font-size:90%;"> </span><span class="ltx_text" id="alg1.l10.2" style="font-size:90%;"> </span><span class="ltx_text" id="alg1.l10.3" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l11"> <span class="ltx_text" id="alg1.l11.2" style="font-size:90%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l11.3" style="font-size:90%;">function</span><span class="ltx_text" id="alg1.l11.4" style="font-size:90%;"> VDR_iPOLY(r, </span><span class="ltx_text ltx_font_italic" id="alg1.l11.1" style="font-size:90%;">n<sub class="ltx_sub" id="alg1.l11.1.1">e</sub></span><span class="ltx_text" id="alg1.l11.5" style="font-size:90%;">) </span> </div> <div class="ltx_listingline" id="alg1.l12"> <span class="ltx_text" id="alg1.l12.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}x_{POLY}\leftarrow r\times(r+b)+c" class="ltx_Math" display="inline" id="alg1.l12.m1.1"><semantics id="alg1.l12.m1.1a"><mrow id="alg1.l12.m1.1.1" xref="alg1.l12.m1.1.1.cmml"><msub id="alg1.l12.m1.1.1.3" xref="alg1.l12.m1.1.1.3.cmml"><mi id="alg1.l12.m1.1.1.3.2" mathsize="90%" xref="alg1.l12.m1.1.1.3.2.cmml">x</mi><mrow id="alg1.l12.m1.1.1.3.3" xref="alg1.l12.m1.1.1.3.3.cmml"><mi id="alg1.l12.m1.1.1.3.3.2" mathsize="90%" xref="alg1.l12.m1.1.1.3.3.2.cmml">P</mi><mo id="alg1.l12.m1.1.1.3.3.1" xref="alg1.l12.m1.1.1.3.3.1.cmml"></mo><mi id="alg1.l12.m1.1.1.3.3.3" mathsize="90%" xref="alg1.l12.m1.1.1.3.3.3.cmml">O</mi><mo id="alg1.l12.m1.1.1.3.3.1a" xref="alg1.l12.m1.1.1.3.3.1.cmml"></mo><mi id="alg1.l12.m1.1.1.3.3.4" mathsize="90%" xref="alg1.l12.m1.1.1.3.3.4.cmml">L</mi><mo id="alg1.l12.m1.1.1.3.3.1b" xref="alg1.l12.m1.1.1.3.3.1.cmml"></mo><mi id="alg1.l12.m1.1.1.3.3.5" mathsize="90%" xref="alg1.l12.m1.1.1.3.3.5.cmml">Y</mi></mrow></msub><mo id="alg1.l12.m1.1.1.2" mathsize="90%" stretchy="false" xref="alg1.l12.m1.1.1.2.cmml">←</mo><mrow id="alg1.l12.m1.1.1.1" xref="alg1.l12.m1.1.1.1.cmml"><mrow id="alg1.l12.m1.1.1.1.1" xref="alg1.l12.m1.1.1.1.1.cmml"><mi id="alg1.l12.m1.1.1.1.1.3" mathsize="90%" xref="alg1.l12.m1.1.1.1.1.3.cmml">r</mi><mo id="alg1.l12.m1.1.1.1.1.2" lspace="0.222em" mathsize="90%" rspace="0.222em" xref="alg1.l12.m1.1.1.1.1.2.cmml">×</mo><mrow id="alg1.l12.m1.1.1.1.1.1.1" xref="alg1.l12.m1.1.1.1.1.1.1.1.cmml"><mo id="alg1.l12.m1.1.1.1.1.1.1.2" maxsize="90%" minsize="90%" xref="alg1.l12.m1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="alg1.l12.m1.1.1.1.1.1.1.1" xref="alg1.l12.m1.1.1.1.1.1.1.1.cmml"><mi id="alg1.l12.m1.1.1.1.1.1.1.1.2" mathsize="90%" xref="alg1.l12.m1.1.1.1.1.1.1.1.2.cmml">r</mi><mo id="alg1.l12.m1.1.1.1.1.1.1.1.1" mathsize="90%" xref="alg1.l12.m1.1.1.1.1.1.1.1.1.cmml">+</mo><mi id="alg1.l12.m1.1.1.1.1.1.1.1.3" mathsize="90%" xref="alg1.l12.m1.1.1.1.1.1.1.1.3.cmml">b</mi></mrow><mo id="alg1.l12.m1.1.1.1.1.1.1.3" maxsize="90%" minsize="90%" xref="alg1.l12.m1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="alg1.l12.m1.1.1.1.2" mathsize="90%" xref="alg1.l12.m1.1.1.1.2.cmml">+</mo><mi id="alg1.l12.m1.1.1.1.3" mathsize="90%" xref="alg1.l12.m1.1.1.1.3.cmml">c</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l12.m1.1b"><apply id="alg1.l12.m1.1.1.cmml" xref="alg1.l12.m1.1.1"><ci id="alg1.l12.m1.1.1.2.cmml" xref="alg1.l12.m1.1.1.2">←</ci><apply id="alg1.l12.m1.1.1.3.cmml" xref="alg1.l12.m1.1.1.3"><csymbol cd="ambiguous" id="alg1.l12.m1.1.1.3.1.cmml" xref="alg1.l12.m1.1.1.3">subscript</csymbol><ci id="alg1.l12.m1.1.1.3.2.cmml" xref="alg1.l12.m1.1.1.3.2">𝑥</ci><apply id="alg1.l12.m1.1.1.3.3.cmml" xref="alg1.l12.m1.1.1.3.3"><times id="alg1.l12.m1.1.1.3.3.1.cmml" xref="alg1.l12.m1.1.1.3.3.1"></times><ci id="alg1.l12.m1.1.1.3.3.2.cmml" xref="alg1.l12.m1.1.1.3.3.2">𝑃</ci><ci id="alg1.l12.m1.1.1.3.3.3.cmml" xref="alg1.l12.m1.1.1.3.3.3">𝑂</ci><ci id="alg1.l12.m1.1.1.3.3.4.cmml" xref="alg1.l12.m1.1.1.3.3.4">𝐿</ci><ci id="alg1.l12.m1.1.1.3.3.5.cmml" xref="alg1.l12.m1.1.1.3.3.5">𝑌</ci></apply></apply><apply id="alg1.l12.m1.1.1.1.cmml" xref="alg1.l12.m1.1.1.1"><plus id="alg1.l12.m1.1.1.1.2.cmml" xref="alg1.l12.m1.1.1.1.2"></plus><apply id="alg1.l12.m1.1.1.1.1.cmml" xref="alg1.l12.m1.1.1.1.1"><times id="alg1.l12.m1.1.1.1.1.2.cmml" xref="alg1.l12.m1.1.1.1.1.2"></times><ci id="alg1.l12.m1.1.1.1.1.3.cmml" xref="alg1.l12.m1.1.1.1.1.3">𝑟</ci><apply id="alg1.l12.m1.1.1.1.1.1.1.1.cmml" xref="alg1.l12.m1.1.1.1.1.1.1"><plus id="alg1.l12.m1.1.1.1.1.1.1.1.1.cmml" xref="alg1.l12.m1.1.1.1.1.1.1.1.1"></plus><ci id="alg1.l12.m1.1.1.1.1.1.1.1.2.cmml" xref="alg1.l12.m1.1.1.1.1.1.1.1.2">𝑟</ci><ci id="alg1.l12.m1.1.1.1.1.1.1.1.3.cmml" xref="alg1.l12.m1.1.1.1.1.1.1.1.3">𝑏</ci></apply></apply><ci id="alg1.l12.m1.1.1.1.3.cmml" xref="alg1.l12.m1.1.1.1.3">𝑐</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l12.m1.1c">~{}~{}~{}~{}x_{POLY}\leftarrow r\times(r+b)+c</annotation><annotation encoding="application/x-llamapun" id="alg1.l12.m1.1d">italic_x start_POSTSUBSCRIPT italic_P italic_O italic_L italic_Y end_POSTSUBSCRIPT ← italic_r × ( italic_r + italic_b ) + italic_c</annotation></semantics></math><span class="ltx_text" id="alg1.l12.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l13"> <span class="ltx_text" id="alg1.l13.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}S_{POLY}\leftarrow a\times S" class="ltx_Math" display="inline" id="alg1.l13.m1.1"><semantics id="alg1.l13.m1.1a"><mrow id="alg1.l13.m1.1.1" xref="alg1.l13.m1.1.1.cmml"><msub id="alg1.l13.m1.1.1.2" xref="alg1.l13.m1.1.1.2.cmml"><mi id="alg1.l13.m1.1.1.2.2" mathsize="90%" xref="alg1.l13.m1.1.1.2.2.cmml">S</mi><mrow id="alg1.l13.m1.1.1.2.3" xref="alg1.l13.m1.1.1.2.3.cmml"><mi id="alg1.l13.m1.1.1.2.3.2" mathsize="90%" xref="alg1.l13.m1.1.1.2.3.2.cmml">P</mi><mo id="alg1.l13.m1.1.1.2.3.1" xref="alg1.l13.m1.1.1.2.3.1.cmml"></mo><mi id="alg1.l13.m1.1.1.2.3.3" mathsize="90%" xref="alg1.l13.m1.1.1.2.3.3.cmml">O</mi><mo id="alg1.l13.m1.1.1.2.3.1a" xref="alg1.l13.m1.1.1.2.3.1.cmml"></mo><mi id="alg1.l13.m1.1.1.2.3.4" mathsize="90%" xref="alg1.l13.m1.1.1.2.3.4.cmml">L</mi><mo id="alg1.l13.m1.1.1.2.3.1b" xref="alg1.l13.m1.1.1.2.3.1.cmml"></mo><mi id="alg1.l13.m1.1.1.2.3.5" mathsize="90%" xref="alg1.l13.m1.1.1.2.3.5.cmml">Y</mi></mrow></msub><mo id="alg1.l13.m1.1.1.1" mathsize="90%" stretchy="false" xref="alg1.l13.m1.1.1.1.cmml">←</mo><mrow id="alg1.l13.m1.1.1.3" xref="alg1.l13.m1.1.1.3.cmml"><mi id="alg1.l13.m1.1.1.3.2" mathsize="90%" xref="alg1.l13.m1.1.1.3.2.cmml">a</mi><mo id="alg1.l13.m1.1.1.3.1" lspace="0.222em" mathsize="90%" rspace="0.222em" xref="alg1.l13.m1.1.1.3.1.cmml">×</mo><mi id="alg1.l13.m1.1.1.3.3" mathsize="90%" xref="alg1.l13.m1.1.1.3.3.cmml">S</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l13.m1.1b"><apply id="alg1.l13.m1.1.1.cmml" xref="alg1.l13.m1.1.1"><ci id="alg1.l13.m1.1.1.1.cmml" xref="alg1.l13.m1.1.1.1">←</ci><apply id="alg1.l13.m1.1.1.2.cmml" xref="alg1.l13.m1.1.1.2"><csymbol cd="ambiguous" id="alg1.l13.m1.1.1.2.1.cmml" xref="alg1.l13.m1.1.1.2">subscript</csymbol><ci id="alg1.l13.m1.1.1.2.2.cmml" xref="alg1.l13.m1.1.1.2.2">𝑆</ci><apply id="alg1.l13.m1.1.1.2.3.cmml" xref="alg1.l13.m1.1.1.2.3"><times id="alg1.l13.m1.1.1.2.3.1.cmml" xref="alg1.l13.m1.1.1.2.3.1"></times><ci id="alg1.l13.m1.1.1.2.3.2.cmml" xref="alg1.l13.m1.1.1.2.3.2">𝑃</ci><ci id="alg1.l13.m1.1.1.2.3.3.cmml" xref="alg1.l13.m1.1.1.2.3.3">𝑂</ci><ci id="alg1.l13.m1.1.1.2.3.4.cmml" xref="alg1.l13.m1.1.1.2.3.4">𝐿</ci><ci id="alg1.l13.m1.1.1.2.3.5.cmml" xref="alg1.l13.m1.1.1.2.3.5">𝑌</ci></apply></apply><apply id="alg1.l13.m1.1.1.3.cmml" xref="alg1.l13.m1.1.1.3"><times id="alg1.l13.m1.1.1.3.1.cmml" xref="alg1.l13.m1.1.1.3.1"></times><ci id="alg1.l13.m1.1.1.3.2.cmml" xref="alg1.l13.m1.1.1.3.2">𝑎</ci><ci id="alg1.l13.m1.1.1.3.3.cmml" xref="alg1.l13.m1.1.1.3.3">𝑆</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l13.m1.1c">~{}~{}~{}~{}S_{POLY}\leftarrow a\times S</annotation><annotation encoding="application/x-llamapun" id="alg1.l13.m1.1d">italic_S start_POSTSUBSCRIPT italic_P italic_O italic_L italic_Y end_POSTSUBSCRIPT ← italic_a × italic_S</annotation></semantics></math><span class="ltx_text" id="alg1.l13.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l14"> <span class="ltx_text" id="alg1.l14.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}\textbf{return}~{}x_{POLY},S_{POLY}" class="ltx_Math" display="inline" id="alg1.l14.m1.2"><semantics id="alg1.l14.m1.2a"><mrow id="alg1.l14.m1.2.2.2" xref="alg1.l14.m1.2.2.3.cmml"><mrow id="alg1.l14.m1.1.1.1.1" xref="alg1.l14.m1.1.1.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="alg1.l14.m1.1.1.1.1.2" mathsize="90%" xref="alg1.l14.m1.1.1.1.1.2a.cmml">return</mtext><mo id="alg1.l14.m1.1.1.1.1.1" lspace="0.300em" xref="alg1.l14.m1.1.1.1.1.1.cmml"></mo><msub id="alg1.l14.m1.1.1.1.1.3" xref="alg1.l14.m1.1.1.1.1.3.cmml"><mi id="alg1.l14.m1.1.1.1.1.3.2" mathsize="90%" xref="alg1.l14.m1.1.1.1.1.3.2.cmml">x</mi><mrow id="alg1.l14.m1.1.1.1.1.3.3" xref="alg1.l14.m1.1.1.1.1.3.3.cmml"><mi id="alg1.l14.m1.1.1.1.1.3.3.2" mathsize="90%" xref="alg1.l14.m1.1.1.1.1.3.3.2.cmml">P</mi><mo id="alg1.l14.m1.1.1.1.1.3.3.1" xref="alg1.l14.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l14.m1.1.1.1.1.3.3.3" mathsize="90%" xref="alg1.l14.m1.1.1.1.1.3.3.3.cmml">O</mi><mo id="alg1.l14.m1.1.1.1.1.3.3.1a" xref="alg1.l14.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l14.m1.1.1.1.1.3.3.4" mathsize="90%" xref="alg1.l14.m1.1.1.1.1.3.3.4.cmml">L</mi><mo id="alg1.l14.m1.1.1.1.1.3.3.1b" xref="alg1.l14.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l14.m1.1.1.1.1.3.3.5" mathsize="90%" xref="alg1.l14.m1.1.1.1.1.3.3.5.cmml">Y</mi></mrow></msub></mrow><mo id="alg1.l14.m1.2.2.2.3" mathsize="90%" xref="alg1.l14.m1.2.2.3.cmml">,</mo><msub id="alg1.l14.m1.2.2.2.2" xref="alg1.l14.m1.2.2.2.2.cmml"><mi id="alg1.l14.m1.2.2.2.2.2" mathsize="90%" xref="alg1.l14.m1.2.2.2.2.2.cmml">S</mi><mrow id="alg1.l14.m1.2.2.2.2.3" xref="alg1.l14.m1.2.2.2.2.3.cmml"><mi id="alg1.l14.m1.2.2.2.2.3.2" mathsize="90%" xref="alg1.l14.m1.2.2.2.2.3.2.cmml">P</mi><mo id="alg1.l14.m1.2.2.2.2.3.1" xref="alg1.l14.m1.2.2.2.2.3.1.cmml"></mo><mi id="alg1.l14.m1.2.2.2.2.3.3" mathsize="90%" xref="alg1.l14.m1.2.2.2.2.3.3.cmml">O</mi><mo id="alg1.l14.m1.2.2.2.2.3.1a" xref="alg1.l14.m1.2.2.2.2.3.1.cmml"></mo><mi id="alg1.l14.m1.2.2.2.2.3.4" mathsize="90%" xref="alg1.l14.m1.2.2.2.2.3.4.cmml">L</mi><mo id="alg1.l14.m1.2.2.2.2.3.1b" xref="alg1.l14.m1.2.2.2.2.3.1.cmml"></mo><mi id="alg1.l14.m1.2.2.2.2.3.5" mathsize="90%" xref="alg1.l14.m1.2.2.2.2.3.5.cmml">Y</mi></mrow></msub></mrow><annotation-xml encoding="MathML-Content" id="alg1.l14.m1.2b"><list id="alg1.l14.m1.2.2.3.cmml" xref="alg1.l14.m1.2.2.2"><apply id="alg1.l14.m1.1.1.1.1.cmml" xref="alg1.l14.m1.1.1.1.1"><times id="alg1.l14.m1.1.1.1.1.1.cmml" xref="alg1.l14.m1.1.1.1.1.1"></times><ci id="alg1.l14.m1.1.1.1.1.2a.cmml" xref="alg1.l14.m1.1.1.1.1.2"><mtext class="ltx_mathvariant_bold" id="alg1.l14.m1.1.1.1.1.2.cmml" mathsize="90%" xref="alg1.l14.m1.1.1.1.1.2">return</mtext></ci><apply id="alg1.l14.m1.1.1.1.1.3.cmml" xref="alg1.l14.m1.1.1.1.1.3"><csymbol cd="ambiguous" id="alg1.l14.m1.1.1.1.1.3.1.cmml" xref="alg1.l14.m1.1.1.1.1.3">subscript</csymbol><ci id="alg1.l14.m1.1.1.1.1.3.2.cmml" xref="alg1.l14.m1.1.1.1.1.3.2">𝑥</ci><apply id="alg1.l14.m1.1.1.1.1.3.3.cmml" xref="alg1.l14.m1.1.1.1.1.3.3"><times id="alg1.l14.m1.1.1.1.1.3.3.1.cmml" xref="alg1.l14.m1.1.1.1.1.3.3.1"></times><ci id="alg1.l14.m1.1.1.1.1.3.3.2.cmml" xref="alg1.l14.m1.1.1.1.1.3.3.2">𝑃</ci><ci id="alg1.l14.m1.1.1.1.1.3.3.3.cmml" xref="alg1.l14.m1.1.1.1.1.3.3.3">𝑂</ci><ci id="alg1.l14.m1.1.1.1.1.3.3.4.cmml" xref="alg1.l14.m1.1.1.1.1.3.3.4">𝐿</ci><ci id="alg1.l14.m1.1.1.1.1.3.3.5.cmml" xref="alg1.l14.m1.1.1.1.1.3.3.5">𝑌</ci></apply></apply></apply><apply id="alg1.l14.m1.2.2.2.2.cmml" xref="alg1.l14.m1.2.2.2.2"><csymbol cd="ambiguous" id="alg1.l14.m1.2.2.2.2.1.cmml" xref="alg1.l14.m1.2.2.2.2">subscript</csymbol><ci id="alg1.l14.m1.2.2.2.2.2.cmml" xref="alg1.l14.m1.2.2.2.2.2">𝑆</ci><apply id="alg1.l14.m1.2.2.2.2.3.cmml" xref="alg1.l14.m1.2.2.2.2.3"><times id="alg1.l14.m1.2.2.2.2.3.1.cmml" xref="alg1.l14.m1.2.2.2.2.3.1"></times><ci id="alg1.l14.m1.2.2.2.2.3.2.cmml" xref="alg1.l14.m1.2.2.2.2.3.2">𝑃</ci><ci id="alg1.l14.m1.2.2.2.2.3.3.cmml" xref="alg1.l14.m1.2.2.2.2.3.3">𝑂</ci><ci id="alg1.l14.m1.2.2.2.2.3.4.cmml" xref="alg1.l14.m1.2.2.2.2.3.4">𝐿</ci><ci id="alg1.l14.m1.2.2.2.2.3.5.cmml" xref="alg1.l14.m1.2.2.2.2.3.5">𝑌</ci></apply></apply></list></annotation-xml><annotation encoding="application/x-tex" id="alg1.l14.m1.2c">~{}~{}~{}~{}\textbf{return}~{}x_{POLY},S_{POLY}</annotation><annotation encoding="application/x-llamapun" id="alg1.l14.m1.2d">return italic_x start_POSTSUBSCRIPT italic_P italic_O italic_L italic_Y end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_P italic_O italic_L italic_Y end_POSTSUBSCRIPT</annotation></semantics></math><span class="ltx_text" id="alg1.l14.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l15"> <span class="ltx_text" id="alg1.l15.1" style="font-size:90%;"> </span><span class="ltx_text" id="alg1.l15.2" style="font-size:90%;"> </span><span class="ltx_text" id="alg1.l15.3" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l16"> <span class="ltx_text" id="alg1.l16.2" style="font-size:90%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l16.3" style="font-size:90%;">function</span><span class="ltx_text" id="alg1.l16.4" style="font-size:90%;"> VDR_iEXP(x</span><sub class="ltx_sub" id="alg1.l16.5"><span class="ltx_text ltx_font_italic" id="alg1.l16.5.1" style="font-size:90%;">sub</span></sub><span class="ltx_text" id="alg1.l16.6" style="font-size:90%;">, </span><span class="ltx_text ltx_font_italic" id="alg1.l16.1" style="font-size:90%;">n<sub class="ltx_sub" id="alg1.l16.1.1">e</sub></span><span class="ltx_text" id="alg1.l16.7" style="font-size:90%;">) </span> </div> <div class="ltx_listingline" id="alg1.l17"> <span class="ltx_text" id="alg1.l17.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}Q\leftarrow\text{clip}(int(X_{sub}/l),2\times Q_{I})" class="ltx_Math" display="inline" id="alg1.l17.m1.2"><semantics id="alg1.l17.m1.2a"><mrow id="alg1.l17.m1.2.2" xref="alg1.l17.m1.2.2.cmml"><mi id="alg1.l17.m1.2.2.4" mathsize="90%" xref="alg1.l17.m1.2.2.4.cmml">Q</mi><mo id="alg1.l17.m1.2.2.3" mathsize="90%" stretchy="false" xref="alg1.l17.m1.2.2.3.cmml">←</mo><mrow id="alg1.l17.m1.2.2.2" xref="alg1.l17.m1.2.2.2.cmml"><mtext id="alg1.l17.m1.2.2.2.4" mathsize="90%" xref="alg1.l17.m1.2.2.2.4a.cmml">clip</mtext><mo id="alg1.l17.m1.2.2.2.3" xref="alg1.l17.m1.2.2.2.3.cmml"></mo><mrow id="alg1.l17.m1.2.2.2.2.2" xref="alg1.l17.m1.2.2.2.2.3.cmml"><mo id="alg1.l17.m1.2.2.2.2.2.3" maxsize="90%" minsize="90%" xref="alg1.l17.m1.2.2.2.2.3.cmml">(</mo><mrow id="alg1.l17.m1.1.1.1.1.1.1" xref="alg1.l17.m1.1.1.1.1.1.1.cmml"><mi id="alg1.l17.m1.1.1.1.1.1.1.3" mathsize="90%" xref="alg1.l17.m1.1.1.1.1.1.1.3.cmml">i</mi><mo id="alg1.l17.m1.1.1.1.1.1.1.2" xref="alg1.l17.m1.1.1.1.1.1.1.2.cmml"></mo><mi id="alg1.l17.m1.1.1.1.1.1.1.4" mathsize="90%" xref="alg1.l17.m1.1.1.1.1.1.1.4.cmml">n</mi><mo id="alg1.l17.m1.1.1.1.1.1.1.2a" xref="alg1.l17.m1.1.1.1.1.1.1.2.cmml"></mo><mi id="alg1.l17.m1.1.1.1.1.1.1.5" mathsize="90%" xref="alg1.l17.m1.1.1.1.1.1.1.5.cmml">t</mi><mo id="alg1.l17.m1.1.1.1.1.1.1.2b" xref="alg1.l17.m1.1.1.1.1.1.1.2.cmml"></mo><mrow id="alg1.l17.m1.1.1.1.1.1.1.1.1" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.cmml"><mo id="alg1.l17.m1.1.1.1.1.1.1.1.1.2" maxsize="90%" minsize="90%" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="alg1.l17.m1.1.1.1.1.1.1.1.1.1" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.cmml"><msub id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.cmml"><mi id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.2" mathsize="90%" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.2.cmml">X</mi><mrow id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.cmml"><mi id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.2" mathsize="90%" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.2.cmml">s</mi><mo id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.1" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.1.cmml"></mo><mi id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.3" mathsize="90%" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.3.cmml">u</mi><mo id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.1a" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.1.cmml"></mo><mi id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.4" mathsize="90%" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.4.cmml">b</mi></mrow></msub><mo id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.1" maxsize="90%" minsize="90%" stretchy="true" symmetric="true" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.1.cmml">/</mo><mi id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.3" mathsize="90%" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.3.cmml">l</mi></mrow><mo id="alg1.l17.m1.1.1.1.1.1.1.1.1.3" maxsize="90%" minsize="90%" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="alg1.l17.m1.2.2.2.2.2.4" mathsize="90%" xref="alg1.l17.m1.2.2.2.2.3.cmml">,</mo><mrow id="alg1.l17.m1.2.2.2.2.2.2" xref="alg1.l17.m1.2.2.2.2.2.2.cmml"><mn id="alg1.l17.m1.2.2.2.2.2.2.2" mathsize="90%" xref="alg1.l17.m1.2.2.2.2.2.2.2.cmml">2</mn><mo id="alg1.l17.m1.2.2.2.2.2.2.1" lspace="0.222em" mathsize="90%" rspace="0.222em" xref="alg1.l17.m1.2.2.2.2.2.2.1.cmml">×</mo><msub id="alg1.l17.m1.2.2.2.2.2.2.3" xref="alg1.l17.m1.2.2.2.2.2.2.3.cmml"><mi id="alg1.l17.m1.2.2.2.2.2.2.3.2" mathsize="90%" xref="alg1.l17.m1.2.2.2.2.2.2.3.2.cmml">Q</mi><mi id="alg1.l17.m1.2.2.2.2.2.2.3.3" mathsize="90%" xref="alg1.l17.m1.2.2.2.2.2.2.3.3.cmml">I</mi></msub></mrow><mo id="alg1.l17.m1.2.2.2.2.2.5" maxsize="90%" minsize="90%" xref="alg1.l17.m1.2.2.2.2.3.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l17.m1.2b"><apply id="alg1.l17.m1.2.2.cmml" xref="alg1.l17.m1.2.2"><ci id="alg1.l17.m1.2.2.3.cmml" xref="alg1.l17.m1.2.2.3">←</ci><ci id="alg1.l17.m1.2.2.4.cmml" xref="alg1.l17.m1.2.2.4">𝑄</ci><apply id="alg1.l17.m1.2.2.2.cmml" xref="alg1.l17.m1.2.2.2"><times id="alg1.l17.m1.2.2.2.3.cmml" xref="alg1.l17.m1.2.2.2.3"></times><ci id="alg1.l17.m1.2.2.2.4a.cmml" xref="alg1.l17.m1.2.2.2.4"><mtext id="alg1.l17.m1.2.2.2.4.cmml" mathsize="90%" xref="alg1.l17.m1.2.2.2.4">clip</mtext></ci><interval closure="open" id="alg1.l17.m1.2.2.2.2.3.cmml" xref="alg1.l17.m1.2.2.2.2.2"><apply id="alg1.l17.m1.1.1.1.1.1.1.cmml" xref="alg1.l17.m1.1.1.1.1.1.1"><times id="alg1.l17.m1.1.1.1.1.1.1.2.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.2"></times><ci id="alg1.l17.m1.1.1.1.1.1.1.3.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.3">𝑖</ci><ci id="alg1.l17.m1.1.1.1.1.1.1.4.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.4">𝑛</ci><ci id="alg1.l17.m1.1.1.1.1.1.1.5.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.5">𝑡</ci><apply id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1"><divide id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.1"></divide><apply id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.1.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2">subscript</csymbol><ci id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.2.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.2">𝑋</ci><apply id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3"><times id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.1.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.1"></times><ci id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.2.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.2">𝑠</ci><ci id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.3.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.3">𝑢</ci><ci id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.4.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.2.3.4">𝑏</ci></apply></apply><ci id="alg1.l17.m1.1.1.1.1.1.1.1.1.1.3.cmml" xref="alg1.l17.m1.1.1.1.1.1.1.1.1.1.3">𝑙</ci></apply></apply><apply id="alg1.l17.m1.2.2.2.2.2.2.cmml" xref="alg1.l17.m1.2.2.2.2.2.2"><times id="alg1.l17.m1.2.2.2.2.2.2.1.cmml" xref="alg1.l17.m1.2.2.2.2.2.2.1"></times><cn id="alg1.l17.m1.2.2.2.2.2.2.2.cmml" type="integer" xref="alg1.l17.m1.2.2.2.2.2.2.2">2</cn><apply id="alg1.l17.m1.2.2.2.2.2.2.3.cmml" xref="alg1.l17.m1.2.2.2.2.2.2.3"><csymbol cd="ambiguous" id="alg1.l17.m1.2.2.2.2.2.2.3.1.cmml" xref="alg1.l17.m1.2.2.2.2.2.2.3">subscript</csymbol><ci id="alg1.l17.m1.2.2.2.2.2.2.3.2.cmml" xref="alg1.l17.m1.2.2.2.2.2.2.3.2">𝑄</ci><ci id="alg1.l17.m1.2.2.2.2.2.2.3.3.cmml" xref="alg1.l17.m1.2.2.2.2.2.2.3.3">𝐼</ci></apply></apply></interval></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l17.m1.2c">~{}~{}~{}~{}Q\leftarrow\text{clip}(int(X_{sub}/l),2\times Q_{I})</annotation><annotation encoding="application/x-llamapun" id="alg1.l17.m1.2d">italic_Q ← clip ( italic_i italic_n italic_t ( italic_X start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT / italic_l ) , 2 × italic_Q start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT )</annotation></semantics></math><span class="ltx_text" id="alg1.l17.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l18"> <span class="ltx_text" id="alg1.l18.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}r\leftarrow x_{sub}-Q-l" class="ltx_Math" display="inline" id="alg1.l18.m1.1"><semantics id="alg1.l18.m1.1a"><mrow id="alg1.l18.m1.1.1" xref="alg1.l18.m1.1.1.cmml"><mi id="alg1.l18.m1.1.1.2" mathsize="90%" xref="alg1.l18.m1.1.1.2.cmml">r</mi><mo id="alg1.l18.m1.1.1.1" mathsize="90%" stretchy="false" xref="alg1.l18.m1.1.1.1.cmml">←</mo><mrow id="alg1.l18.m1.1.1.3" xref="alg1.l18.m1.1.1.3.cmml"><msub id="alg1.l18.m1.1.1.3.2" xref="alg1.l18.m1.1.1.3.2.cmml"><mi id="alg1.l18.m1.1.1.3.2.2" mathsize="90%" xref="alg1.l18.m1.1.1.3.2.2.cmml">x</mi><mrow id="alg1.l18.m1.1.1.3.2.3" xref="alg1.l18.m1.1.1.3.2.3.cmml"><mi id="alg1.l18.m1.1.1.3.2.3.2" mathsize="90%" xref="alg1.l18.m1.1.1.3.2.3.2.cmml">s</mi><mo id="alg1.l18.m1.1.1.3.2.3.1" xref="alg1.l18.m1.1.1.3.2.3.1.cmml"></mo><mi id="alg1.l18.m1.1.1.3.2.3.3" mathsize="90%" xref="alg1.l18.m1.1.1.3.2.3.3.cmml">u</mi><mo id="alg1.l18.m1.1.1.3.2.3.1a" xref="alg1.l18.m1.1.1.3.2.3.1.cmml"></mo><mi id="alg1.l18.m1.1.1.3.2.3.4" mathsize="90%" xref="alg1.l18.m1.1.1.3.2.3.4.cmml">b</mi></mrow></msub><mo id="alg1.l18.m1.1.1.3.1" mathsize="90%" xref="alg1.l18.m1.1.1.3.1.cmml">−</mo><mi id="alg1.l18.m1.1.1.3.3" mathsize="90%" xref="alg1.l18.m1.1.1.3.3.cmml">Q</mi><mo id="alg1.l18.m1.1.1.3.1a" mathsize="90%" xref="alg1.l18.m1.1.1.3.1.cmml">−</mo><mi id="alg1.l18.m1.1.1.3.4" mathsize="90%" xref="alg1.l18.m1.1.1.3.4.cmml">l</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l18.m1.1b"><apply id="alg1.l18.m1.1.1.cmml" xref="alg1.l18.m1.1.1"><ci id="alg1.l18.m1.1.1.1.cmml" xref="alg1.l18.m1.1.1.1">←</ci><ci id="alg1.l18.m1.1.1.2.cmml" xref="alg1.l18.m1.1.1.2">𝑟</ci><apply id="alg1.l18.m1.1.1.3.cmml" xref="alg1.l18.m1.1.1.3"><minus id="alg1.l18.m1.1.1.3.1.cmml" xref="alg1.l18.m1.1.1.3.1"></minus><apply id="alg1.l18.m1.1.1.3.2.cmml" xref="alg1.l18.m1.1.1.3.2"><csymbol cd="ambiguous" id="alg1.l18.m1.1.1.3.2.1.cmml" xref="alg1.l18.m1.1.1.3.2">subscript</csymbol><ci id="alg1.l18.m1.1.1.3.2.2.cmml" xref="alg1.l18.m1.1.1.3.2.2">𝑥</ci><apply id="alg1.l18.m1.1.1.3.2.3.cmml" xref="alg1.l18.m1.1.1.3.2.3"><times id="alg1.l18.m1.1.1.3.2.3.1.cmml" xref="alg1.l18.m1.1.1.3.2.3.1"></times><ci id="alg1.l18.m1.1.1.3.2.3.2.cmml" xref="alg1.l18.m1.1.1.3.2.3.2">𝑠</ci><ci id="alg1.l18.m1.1.1.3.2.3.3.cmml" xref="alg1.l18.m1.1.1.3.2.3.3">𝑢</ci><ci id="alg1.l18.m1.1.1.3.2.3.4.cmml" xref="alg1.l18.m1.1.1.3.2.3.4">𝑏</ci></apply></apply><ci id="alg1.l18.m1.1.1.3.3.cmml" xref="alg1.l18.m1.1.1.3.3">𝑄</ci><ci id="alg1.l18.m1.1.1.3.4.cmml" xref="alg1.l18.m1.1.1.3.4">𝑙</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l18.m1.1c">~{}~{}~{}~{}r\leftarrow x_{sub}-Q-l</annotation><annotation encoding="application/x-llamapun" id="alg1.l18.m1.1d">italic_r ← italic_x start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT - italic_Q - italic_l</annotation></semantics></math><span class="ltx_text" id="alg1.l18.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l19"> <span class="ltx_text" id="alg1.l19.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}r_{POLY},S_{POLY}\leftarrow\text{iPOLY}(r,S)" class="ltx_Math" display="inline" id="alg1.l19.m1.4"><semantics id="alg1.l19.m1.4a"><mrow id="alg1.l19.m1.4.4" xref="alg1.l19.m1.4.4.cmml"><mrow id="alg1.l19.m1.4.4.2.2" xref="alg1.l19.m1.4.4.2.3.cmml"><msub id="alg1.l19.m1.3.3.1.1.1" xref="alg1.l19.m1.3.3.1.1.1.cmml"><mi id="alg1.l19.m1.3.3.1.1.1.2" mathsize="90%" xref="alg1.l19.m1.3.3.1.1.1.2.cmml">r</mi><mrow id="alg1.l19.m1.3.3.1.1.1.3" xref="alg1.l19.m1.3.3.1.1.1.3.cmml"><mi id="alg1.l19.m1.3.3.1.1.1.3.2" mathsize="90%" xref="alg1.l19.m1.3.3.1.1.1.3.2.cmml">P</mi><mo id="alg1.l19.m1.3.3.1.1.1.3.1" xref="alg1.l19.m1.3.3.1.1.1.3.1.cmml"></mo><mi id="alg1.l19.m1.3.3.1.1.1.3.3" mathsize="90%" xref="alg1.l19.m1.3.3.1.1.1.3.3.cmml">O</mi><mo id="alg1.l19.m1.3.3.1.1.1.3.1a" xref="alg1.l19.m1.3.3.1.1.1.3.1.cmml"></mo><mi id="alg1.l19.m1.3.3.1.1.1.3.4" mathsize="90%" xref="alg1.l19.m1.3.3.1.1.1.3.4.cmml">L</mi><mo id="alg1.l19.m1.3.3.1.1.1.3.1b" xref="alg1.l19.m1.3.3.1.1.1.3.1.cmml"></mo><mi id="alg1.l19.m1.3.3.1.1.1.3.5" mathsize="90%" xref="alg1.l19.m1.3.3.1.1.1.3.5.cmml">Y</mi></mrow></msub><mo id="alg1.l19.m1.4.4.2.2.3" mathsize="90%" xref="alg1.l19.m1.4.4.2.3.cmml">,</mo><msub id="alg1.l19.m1.4.4.2.2.2" xref="alg1.l19.m1.4.4.2.2.2.cmml"><mi id="alg1.l19.m1.4.4.2.2.2.2" mathsize="90%" xref="alg1.l19.m1.4.4.2.2.2.2.cmml">S</mi><mrow id="alg1.l19.m1.4.4.2.2.2.3" xref="alg1.l19.m1.4.4.2.2.2.3.cmml"><mi id="alg1.l19.m1.4.4.2.2.2.3.2" mathsize="90%" xref="alg1.l19.m1.4.4.2.2.2.3.2.cmml">P</mi><mo id="alg1.l19.m1.4.4.2.2.2.3.1" xref="alg1.l19.m1.4.4.2.2.2.3.1.cmml"></mo><mi id="alg1.l19.m1.4.4.2.2.2.3.3" mathsize="90%" xref="alg1.l19.m1.4.4.2.2.2.3.3.cmml">O</mi><mo id="alg1.l19.m1.4.4.2.2.2.3.1a" xref="alg1.l19.m1.4.4.2.2.2.3.1.cmml"></mo><mi id="alg1.l19.m1.4.4.2.2.2.3.4" mathsize="90%" xref="alg1.l19.m1.4.4.2.2.2.3.4.cmml">L</mi><mo id="alg1.l19.m1.4.4.2.2.2.3.1b" xref="alg1.l19.m1.4.4.2.2.2.3.1.cmml"></mo><mi id="alg1.l19.m1.4.4.2.2.2.3.5" mathsize="90%" xref="alg1.l19.m1.4.4.2.2.2.3.5.cmml">Y</mi></mrow></msub></mrow><mo id="alg1.l19.m1.4.4.3" mathsize="90%" stretchy="false" xref="alg1.l19.m1.4.4.3.cmml">←</mo><mrow id="alg1.l19.m1.4.4.4" xref="alg1.l19.m1.4.4.4.cmml"><mtext id="alg1.l19.m1.4.4.4.2" mathsize="90%" xref="alg1.l19.m1.4.4.4.2a.cmml">iPOLY</mtext><mo id="alg1.l19.m1.4.4.4.1" xref="alg1.l19.m1.4.4.4.1.cmml"></mo><mrow id="alg1.l19.m1.4.4.4.3.2" xref="alg1.l19.m1.4.4.4.3.1.cmml"><mo id="alg1.l19.m1.4.4.4.3.2.1" maxsize="90%" minsize="90%" xref="alg1.l19.m1.4.4.4.3.1.cmml">(</mo><mi id="alg1.l19.m1.1.1" mathsize="90%" xref="alg1.l19.m1.1.1.cmml">r</mi><mo id="alg1.l19.m1.4.4.4.3.2.2" mathsize="90%" xref="alg1.l19.m1.4.4.4.3.1.cmml">,</mo><mi id="alg1.l19.m1.2.2" mathsize="90%" xref="alg1.l19.m1.2.2.cmml">S</mi><mo id="alg1.l19.m1.4.4.4.3.2.3" maxsize="90%" minsize="90%" xref="alg1.l19.m1.4.4.4.3.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l19.m1.4b"><apply id="alg1.l19.m1.4.4.cmml" xref="alg1.l19.m1.4.4"><ci id="alg1.l19.m1.4.4.3.cmml" xref="alg1.l19.m1.4.4.3">←</ci><list id="alg1.l19.m1.4.4.2.3.cmml" xref="alg1.l19.m1.4.4.2.2"><apply id="alg1.l19.m1.3.3.1.1.1.cmml" xref="alg1.l19.m1.3.3.1.1.1"><csymbol cd="ambiguous" id="alg1.l19.m1.3.3.1.1.1.1.cmml" xref="alg1.l19.m1.3.3.1.1.1">subscript</csymbol><ci id="alg1.l19.m1.3.3.1.1.1.2.cmml" xref="alg1.l19.m1.3.3.1.1.1.2">𝑟</ci><apply id="alg1.l19.m1.3.3.1.1.1.3.cmml" xref="alg1.l19.m1.3.3.1.1.1.3"><times id="alg1.l19.m1.3.3.1.1.1.3.1.cmml" xref="alg1.l19.m1.3.3.1.1.1.3.1"></times><ci id="alg1.l19.m1.3.3.1.1.1.3.2.cmml" xref="alg1.l19.m1.3.3.1.1.1.3.2">𝑃</ci><ci id="alg1.l19.m1.3.3.1.1.1.3.3.cmml" xref="alg1.l19.m1.3.3.1.1.1.3.3">𝑂</ci><ci id="alg1.l19.m1.3.3.1.1.1.3.4.cmml" xref="alg1.l19.m1.3.3.1.1.1.3.4">𝐿</ci><ci id="alg1.l19.m1.3.3.1.1.1.3.5.cmml" xref="alg1.l19.m1.3.3.1.1.1.3.5">𝑌</ci></apply></apply><apply id="alg1.l19.m1.4.4.2.2.2.cmml" xref="alg1.l19.m1.4.4.2.2.2"><csymbol cd="ambiguous" id="alg1.l19.m1.4.4.2.2.2.1.cmml" xref="alg1.l19.m1.4.4.2.2.2">subscript</csymbol><ci id="alg1.l19.m1.4.4.2.2.2.2.cmml" xref="alg1.l19.m1.4.4.2.2.2.2">𝑆</ci><apply id="alg1.l19.m1.4.4.2.2.2.3.cmml" xref="alg1.l19.m1.4.4.2.2.2.3"><times id="alg1.l19.m1.4.4.2.2.2.3.1.cmml" xref="alg1.l19.m1.4.4.2.2.2.3.1"></times><ci id="alg1.l19.m1.4.4.2.2.2.3.2.cmml" xref="alg1.l19.m1.4.4.2.2.2.3.2">𝑃</ci><ci id="alg1.l19.m1.4.4.2.2.2.3.3.cmml" xref="alg1.l19.m1.4.4.2.2.2.3.3">𝑂</ci><ci id="alg1.l19.m1.4.4.2.2.2.3.4.cmml" xref="alg1.l19.m1.4.4.2.2.2.3.4">𝐿</ci><ci id="alg1.l19.m1.4.4.2.2.2.3.5.cmml" xref="alg1.l19.m1.4.4.2.2.2.3.5">𝑌</ci></apply></apply></list><apply id="alg1.l19.m1.4.4.4.cmml" xref="alg1.l19.m1.4.4.4"><times id="alg1.l19.m1.4.4.4.1.cmml" xref="alg1.l19.m1.4.4.4.1"></times><ci id="alg1.l19.m1.4.4.4.2a.cmml" xref="alg1.l19.m1.4.4.4.2"><mtext id="alg1.l19.m1.4.4.4.2.cmml" mathsize="90%" xref="alg1.l19.m1.4.4.4.2">iPOLY</mtext></ci><interval closure="open" id="alg1.l19.m1.4.4.4.3.1.cmml" xref="alg1.l19.m1.4.4.4.3.2"><ci id="alg1.l19.m1.1.1.cmml" xref="alg1.l19.m1.1.1">𝑟</ci><ci id="alg1.l19.m1.2.2.cmml" xref="alg1.l19.m1.2.2">𝑆</ci></interval></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l19.m1.4c">~{}~{}~{}~{}r_{POLY},S_{POLY}\leftarrow\text{iPOLY}(r,S)</annotation><annotation encoding="application/x-llamapun" id="alg1.l19.m1.4d">italic_r start_POSTSUBSCRIPT italic_P italic_O italic_L italic_Y end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_P italic_O italic_L italic_Y end_POSTSUBSCRIPT ← iPOLY ( italic_r , italic_S )</annotation></semantics></math><span class="ltx_text" id="alg1.l19.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l20"> <span class="ltx_text" id="alg1.l20.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}r_{EXP}=r_{POLY}<<Q" class="ltx_Math" display="inline" id="alg1.l20.m1.1"><semantics id="alg1.l20.m1.1a"><mrow id="alg1.l20.m1.1.1" xref="alg1.l20.m1.1.1.cmml"><msub id="alg1.l20.m1.1.1.2" xref="alg1.l20.m1.1.1.2.cmml"><mi id="alg1.l20.m1.1.1.2.2" mathsize="90%" xref="alg1.l20.m1.1.1.2.2.cmml">r</mi><mrow id="alg1.l20.m1.1.1.2.3" xref="alg1.l20.m1.1.1.2.3.cmml"><mi id="alg1.l20.m1.1.1.2.3.2" mathsize="90%" xref="alg1.l20.m1.1.1.2.3.2.cmml">E</mi><mo id="alg1.l20.m1.1.1.2.3.1" xref="alg1.l20.m1.1.1.2.3.1.cmml"></mo><mi id="alg1.l20.m1.1.1.2.3.3" mathsize="90%" xref="alg1.l20.m1.1.1.2.3.3.cmml">X</mi><mo id="alg1.l20.m1.1.1.2.3.1a" xref="alg1.l20.m1.1.1.2.3.1.cmml"></mo><mi id="alg1.l20.m1.1.1.2.3.4" mathsize="90%" xref="alg1.l20.m1.1.1.2.3.4.cmml">P</mi></mrow></msub><mo id="alg1.l20.m1.1.1.3" mathsize="90%" xref="alg1.l20.m1.1.1.3.cmml">=</mo><msub id="alg1.l20.m1.1.1.4" xref="alg1.l20.m1.1.1.4.cmml"><mi id="alg1.l20.m1.1.1.4.2" mathsize="90%" xref="alg1.l20.m1.1.1.4.2.cmml">r</mi><mrow id="alg1.l20.m1.1.1.4.3" xref="alg1.l20.m1.1.1.4.3.cmml"><mi id="alg1.l20.m1.1.1.4.3.2" mathsize="90%" xref="alg1.l20.m1.1.1.4.3.2.cmml">P</mi><mo id="alg1.l20.m1.1.1.4.3.1" xref="alg1.l20.m1.1.1.4.3.1.cmml"></mo><mi id="alg1.l20.m1.1.1.4.3.3" mathsize="90%" xref="alg1.l20.m1.1.1.4.3.3.cmml">O</mi><mo id="alg1.l20.m1.1.1.4.3.1a" xref="alg1.l20.m1.1.1.4.3.1.cmml"></mo><mi id="alg1.l20.m1.1.1.4.3.4" mathsize="90%" xref="alg1.l20.m1.1.1.4.3.4.cmml">L</mi><mo id="alg1.l20.m1.1.1.4.3.1b" xref="alg1.l20.m1.1.1.4.3.1.cmml"></mo><mi id="alg1.l20.m1.1.1.4.3.5" mathsize="90%" xref="alg1.l20.m1.1.1.4.3.5.cmml">Y</mi></mrow></msub><mo id="alg1.l20.m1.1.1.5" mathsize="90%" xref="alg1.l20.m1.1.1.5.cmml"><<</mo><mi id="alg1.l20.m1.1.1.6" mathsize="90%" xref="alg1.l20.m1.1.1.6.cmml">Q</mi></mrow><annotation-xml encoding="MathML-Content" id="alg1.l20.m1.1b"><apply id="alg1.l20.m1.1.1.cmml" xref="alg1.l20.m1.1.1"><and id="alg1.l20.m1.1.1a.cmml" xref="alg1.l20.m1.1.1"></and><apply id="alg1.l20.m1.1.1b.cmml" xref="alg1.l20.m1.1.1"><eq id="alg1.l20.m1.1.1.3.cmml" xref="alg1.l20.m1.1.1.3"></eq><apply id="alg1.l20.m1.1.1.2.cmml" xref="alg1.l20.m1.1.1.2"><csymbol cd="ambiguous" id="alg1.l20.m1.1.1.2.1.cmml" xref="alg1.l20.m1.1.1.2">subscript</csymbol><ci id="alg1.l20.m1.1.1.2.2.cmml" xref="alg1.l20.m1.1.1.2.2">𝑟</ci><apply id="alg1.l20.m1.1.1.2.3.cmml" xref="alg1.l20.m1.1.1.2.3"><times id="alg1.l20.m1.1.1.2.3.1.cmml" xref="alg1.l20.m1.1.1.2.3.1"></times><ci id="alg1.l20.m1.1.1.2.3.2.cmml" xref="alg1.l20.m1.1.1.2.3.2">𝐸</ci><ci id="alg1.l20.m1.1.1.2.3.3.cmml" xref="alg1.l20.m1.1.1.2.3.3">𝑋</ci><ci id="alg1.l20.m1.1.1.2.3.4.cmml" xref="alg1.l20.m1.1.1.2.3.4">𝑃</ci></apply></apply><apply id="alg1.l20.m1.1.1.4.cmml" xref="alg1.l20.m1.1.1.4"><csymbol cd="ambiguous" id="alg1.l20.m1.1.1.4.1.cmml" xref="alg1.l20.m1.1.1.4">subscript</csymbol><ci id="alg1.l20.m1.1.1.4.2.cmml" xref="alg1.l20.m1.1.1.4.2">𝑟</ci><apply id="alg1.l20.m1.1.1.4.3.cmml" xref="alg1.l20.m1.1.1.4.3"><times id="alg1.l20.m1.1.1.4.3.1.cmml" xref="alg1.l20.m1.1.1.4.3.1"></times><ci id="alg1.l20.m1.1.1.4.3.2.cmml" xref="alg1.l20.m1.1.1.4.3.2">𝑃</ci><ci id="alg1.l20.m1.1.1.4.3.3.cmml" xref="alg1.l20.m1.1.1.4.3.3">𝑂</ci><ci id="alg1.l20.m1.1.1.4.3.4.cmml" xref="alg1.l20.m1.1.1.4.3.4">𝐿</ci><ci id="alg1.l20.m1.1.1.4.3.5.cmml" xref="alg1.l20.m1.1.1.4.3.5">𝑌</ci></apply></apply></apply><apply id="alg1.l20.m1.1.1c.cmml" xref="alg1.l20.m1.1.1"><csymbol cd="latexml" id="alg1.l20.m1.1.1.5.cmml" xref="alg1.l20.m1.1.1.5">much-less-than</csymbol><share href="https://arxiv.org/html/2411.14733v1#alg1.l20.m1.1.1.4.cmml" id="alg1.l20.m1.1.1d.cmml" xref="alg1.l20.m1.1.1"></share><ci id="alg1.l20.m1.1.1.6.cmml" xref="alg1.l20.m1.1.1.6">𝑄</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l20.m1.1c">~{}~{}~{}~{}r_{EXP}=r_{POLY}<<Q</annotation><annotation encoding="application/x-llamapun" id="alg1.l20.m1.1d">italic_r start_POSTSUBSCRIPT italic_E italic_X italic_P end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_P italic_O italic_L italic_Y end_POSTSUBSCRIPT < < italic_Q</annotation></semantics></math><span class="ltx_text" id="alg1.l20.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l21"> <span class="ltx_text" id="alg1.l21.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}S_{EXP}=S_{POLY}>>Q" class="ltx_Math" display="inline" id="alg1.l21.m1.1"><semantics id="alg1.l21.m1.1a"><mrow id="alg1.l21.m1.1.1" xref="alg1.l21.m1.1.1.cmml"><msub id="alg1.l21.m1.1.1.2" xref="alg1.l21.m1.1.1.2.cmml"><mi id="alg1.l21.m1.1.1.2.2" mathsize="90%" xref="alg1.l21.m1.1.1.2.2.cmml">S</mi><mrow id="alg1.l21.m1.1.1.2.3" xref="alg1.l21.m1.1.1.2.3.cmml"><mi id="alg1.l21.m1.1.1.2.3.2" mathsize="90%" xref="alg1.l21.m1.1.1.2.3.2.cmml">E</mi><mo id="alg1.l21.m1.1.1.2.3.1" xref="alg1.l21.m1.1.1.2.3.1.cmml"></mo><mi id="alg1.l21.m1.1.1.2.3.3" mathsize="90%" xref="alg1.l21.m1.1.1.2.3.3.cmml">X</mi><mo id="alg1.l21.m1.1.1.2.3.1a" xref="alg1.l21.m1.1.1.2.3.1.cmml"></mo><mi id="alg1.l21.m1.1.1.2.3.4" mathsize="90%" xref="alg1.l21.m1.1.1.2.3.4.cmml">P</mi></mrow></msub><mo id="alg1.l21.m1.1.1.3" mathsize="90%" xref="alg1.l21.m1.1.1.3.cmml">=</mo><msub id="alg1.l21.m1.1.1.4" xref="alg1.l21.m1.1.1.4.cmml"><mi id="alg1.l21.m1.1.1.4.2" mathsize="90%" xref="alg1.l21.m1.1.1.4.2.cmml">S</mi><mrow id="alg1.l21.m1.1.1.4.3" xref="alg1.l21.m1.1.1.4.3.cmml"><mi id="alg1.l21.m1.1.1.4.3.2" mathsize="90%" xref="alg1.l21.m1.1.1.4.3.2.cmml">P</mi><mo id="alg1.l21.m1.1.1.4.3.1" xref="alg1.l21.m1.1.1.4.3.1.cmml"></mo><mi id="alg1.l21.m1.1.1.4.3.3" mathsize="90%" xref="alg1.l21.m1.1.1.4.3.3.cmml">O</mi><mo id="alg1.l21.m1.1.1.4.3.1a" xref="alg1.l21.m1.1.1.4.3.1.cmml"></mo><mi id="alg1.l21.m1.1.1.4.3.4" mathsize="90%" xref="alg1.l21.m1.1.1.4.3.4.cmml">L</mi><mo id="alg1.l21.m1.1.1.4.3.1b" xref="alg1.l21.m1.1.1.4.3.1.cmml"></mo><mi id="alg1.l21.m1.1.1.4.3.5" mathsize="90%" xref="alg1.l21.m1.1.1.4.3.5.cmml">Y</mi></mrow></msub><mo id="alg1.l21.m1.1.1.5" lspace="0.278em" mathsize="90%" rspace="0.278em" xref="alg1.l21.m1.1.1.5.cmml">>></mo><mi id="alg1.l21.m1.1.1.6" mathsize="90%" xref="alg1.l21.m1.1.1.6.cmml">Q</mi></mrow><annotation-xml encoding="MathML-Content" id="alg1.l21.m1.1b"><apply id="alg1.l21.m1.1.1.cmml" xref="alg1.l21.m1.1.1"><and id="alg1.l21.m1.1.1a.cmml" xref="alg1.l21.m1.1.1"></and><apply id="alg1.l21.m1.1.1b.cmml" xref="alg1.l21.m1.1.1"><eq id="alg1.l21.m1.1.1.3.cmml" xref="alg1.l21.m1.1.1.3"></eq><apply id="alg1.l21.m1.1.1.2.cmml" xref="alg1.l21.m1.1.1.2"><csymbol cd="ambiguous" id="alg1.l21.m1.1.1.2.1.cmml" xref="alg1.l21.m1.1.1.2">subscript</csymbol><ci id="alg1.l21.m1.1.1.2.2.cmml" xref="alg1.l21.m1.1.1.2.2">𝑆</ci><apply id="alg1.l21.m1.1.1.2.3.cmml" xref="alg1.l21.m1.1.1.2.3"><times id="alg1.l21.m1.1.1.2.3.1.cmml" xref="alg1.l21.m1.1.1.2.3.1"></times><ci id="alg1.l21.m1.1.1.2.3.2.cmml" xref="alg1.l21.m1.1.1.2.3.2">𝐸</ci><ci id="alg1.l21.m1.1.1.2.3.3.cmml" xref="alg1.l21.m1.1.1.2.3.3">𝑋</ci><ci id="alg1.l21.m1.1.1.2.3.4.cmml" xref="alg1.l21.m1.1.1.2.3.4">𝑃</ci></apply></apply><apply id="alg1.l21.m1.1.1.4.cmml" xref="alg1.l21.m1.1.1.4"><csymbol cd="ambiguous" id="alg1.l21.m1.1.1.4.1.cmml" xref="alg1.l21.m1.1.1.4">subscript</csymbol><ci id="alg1.l21.m1.1.1.4.2.cmml" xref="alg1.l21.m1.1.1.4.2">𝑆</ci><apply id="alg1.l21.m1.1.1.4.3.cmml" xref="alg1.l21.m1.1.1.4.3"><times id="alg1.l21.m1.1.1.4.3.1.cmml" xref="alg1.l21.m1.1.1.4.3.1"></times><ci id="alg1.l21.m1.1.1.4.3.2.cmml" xref="alg1.l21.m1.1.1.4.3.2">𝑃</ci><ci id="alg1.l21.m1.1.1.4.3.3.cmml" xref="alg1.l21.m1.1.1.4.3.3">𝑂</ci><ci id="alg1.l21.m1.1.1.4.3.4.cmml" xref="alg1.l21.m1.1.1.4.3.4">𝐿</ci><ci id="alg1.l21.m1.1.1.4.3.5.cmml" xref="alg1.l21.m1.1.1.4.3.5">𝑌</ci></apply></apply></apply><apply id="alg1.l21.m1.1.1c.cmml" xref="alg1.l21.m1.1.1"><csymbol cd="latexml" id="alg1.l21.m1.1.1.5.cmml" xref="alg1.l21.m1.1.1.5">much-greater-than</csymbol><share href="https://arxiv.org/html/2411.14733v1#alg1.l21.m1.1.1.4.cmml" id="alg1.l21.m1.1.1d.cmml" xref="alg1.l21.m1.1.1"></share><ci id="alg1.l21.m1.1.1.6.cmml" xref="alg1.l21.m1.1.1.6">𝑄</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l21.m1.1c">~{}~{}~{}~{}S_{EXP}=S_{POLY}>>Q</annotation><annotation encoding="application/x-llamapun" id="alg1.l21.m1.1d">italic_S start_POSTSUBSCRIPT italic_E italic_X italic_P end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_P italic_O italic_L italic_Y end_POSTSUBSCRIPT > > italic_Q</annotation></semantics></math><span class="ltx_text" id="alg1.l21.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l22"> <span class="ltx_text" id="alg1.l22.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}\textbf{return}~{}r_{EXP},S_{EXP}" class="ltx_Math" display="inline" id="alg1.l22.m1.2"><semantics id="alg1.l22.m1.2a"><mrow id="alg1.l22.m1.2.2.2" xref="alg1.l22.m1.2.2.3.cmml"><mrow id="alg1.l22.m1.1.1.1.1" xref="alg1.l22.m1.1.1.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="alg1.l22.m1.1.1.1.1.2" mathsize="90%" xref="alg1.l22.m1.1.1.1.1.2a.cmml">return</mtext><mo id="alg1.l22.m1.1.1.1.1.1" lspace="0.300em" xref="alg1.l22.m1.1.1.1.1.1.cmml"></mo><msub id="alg1.l22.m1.1.1.1.1.3" xref="alg1.l22.m1.1.1.1.1.3.cmml"><mi id="alg1.l22.m1.1.1.1.1.3.2" mathsize="90%" xref="alg1.l22.m1.1.1.1.1.3.2.cmml">r</mi><mrow id="alg1.l22.m1.1.1.1.1.3.3" xref="alg1.l22.m1.1.1.1.1.3.3.cmml"><mi id="alg1.l22.m1.1.1.1.1.3.3.2" mathsize="90%" xref="alg1.l22.m1.1.1.1.1.3.3.2.cmml">E</mi><mo id="alg1.l22.m1.1.1.1.1.3.3.1" xref="alg1.l22.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l22.m1.1.1.1.1.3.3.3" mathsize="90%" xref="alg1.l22.m1.1.1.1.1.3.3.3.cmml">X</mi><mo id="alg1.l22.m1.1.1.1.1.3.3.1a" xref="alg1.l22.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l22.m1.1.1.1.1.3.3.4" mathsize="90%" xref="alg1.l22.m1.1.1.1.1.3.3.4.cmml">P</mi></mrow></msub></mrow><mo id="alg1.l22.m1.2.2.2.3" mathsize="90%" xref="alg1.l22.m1.2.2.3.cmml">,</mo><msub id="alg1.l22.m1.2.2.2.2" xref="alg1.l22.m1.2.2.2.2.cmml"><mi id="alg1.l22.m1.2.2.2.2.2" mathsize="90%" xref="alg1.l22.m1.2.2.2.2.2.cmml">S</mi><mrow id="alg1.l22.m1.2.2.2.2.3" xref="alg1.l22.m1.2.2.2.2.3.cmml"><mi id="alg1.l22.m1.2.2.2.2.3.2" mathsize="90%" xref="alg1.l22.m1.2.2.2.2.3.2.cmml">E</mi><mo id="alg1.l22.m1.2.2.2.2.3.1" xref="alg1.l22.m1.2.2.2.2.3.1.cmml"></mo><mi id="alg1.l22.m1.2.2.2.2.3.3" mathsize="90%" xref="alg1.l22.m1.2.2.2.2.3.3.cmml">X</mi><mo id="alg1.l22.m1.2.2.2.2.3.1a" xref="alg1.l22.m1.2.2.2.2.3.1.cmml"></mo><mi id="alg1.l22.m1.2.2.2.2.3.4" mathsize="90%" xref="alg1.l22.m1.2.2.2.2.3.4.cmml">P</mi></mrow></msub></mrow><annotation-xml encoding="MathML-Content" id="alg1.l22.m1.2b"><list id="alg1.l22.m1.2.2.3.cmml" xref="alg1.l22.m1.2.2.2"><apply id="alg1.l22.m1.1.1.1.1.cmml" xref="alg1.l22.m1.1.1.1.1"><times id="alg1.l22.m1.1.1.1.1.1.cmml" xref="alg1.l22.m1.1.1.1.1.1"></times><ci id="alg1.l22.m1.1.1.1.1.2a.cmml" xref="alg1.l22.m1.1.1.1.1.2"><mtext class="ltx_mathvariant_bold" id="alg1.l22.m1.1.1.1.1.2.cmml" mathsize="90%" xref="alg1.l22.m1.1.1.1.1.2">return</mtext></ci><apply id="alg1.l22.m1.1.1.1.1.3.cmml" xref="alg1.l22.m1.1.1.1.1.3"><csymbol cd="ambiguous" id="alg1.l22.m1.1.1.1.1.3.1.cmml" xref="alg1.l22.m1.1.1.1.1.3">subscript</csymbol><ci id="alg1.l22.m1.1.1.1.1.3.2.cmml" xref="alg1.l22.m1.1.1.1.1.3.2">𝑟</ci><apply id="alg1.l22.m1.1.1.1.1.3.3.cmml" xref="alg1.l22.m1.1.1.1.1.3.3"><times id="alg1.l22.m1.1.1.1.1.3.3.1.cmml" xref="alg1.l22.m1.1.1.1.1.3.3.1"></times><ci id="alg1.l22.m1.1.1.1.1.3.3.2.cmml" xref="alg1.l22.m1.1.1.1.1.3.3.2">𝐸</ci><ci id="alg1.l22.m1.1.1.1.1.3.3.3.cmml" xref="alg1.l22.m1.1.1.1.1.3.3.3">𝑋</ci><ci id="alg1.l22.m1.1.1.1.1.3.3.4.cmml" xref="alg1.l22.m1.1.1.1.1.3.3.4">𝑃</ci></apply></apply></apply><apply id="alg1.l22.m1.2.2.2.2.cmml" xref="alg1.l22.m1.2.2.2.2"><csymbol cd="ambiguous" id="alg1.l22.m1.2.2.2.2.1.cmml" xref="alg1.l22.m1.2.2.2.2">subscript</csymbol><ci id="alg1.l22.m1.2.2.2.2.2.cmml" xref="alg1.l22.m1.2.2.2.2.2">𝑆</ci><apply id="alg1.l22.m1.2.2.2.2.3.cmml" xref="alg1.l22.m1.2.2.2.2.3"><times id="alg1.l22.m1.2.2.2.2.3.1.cmml" xref="alg1.l22.m1.2.2.2.2.3.1"></times><ci id="alg1.l22.m1.2.2.2.2.3.2.cmml" xref="alg1.l22.m1.2.2.2.2.3.2">𝐸</ci><ci id="alg1.l22.m1.2.2.2.2.3.3.cmml" xref="alg1.l22.m1.2.2.2.2.3.3">𝑋</ci><ci id="alg1.l22.m1.2.2.2.2.3.4.cmml" xref="alg1.l22.m1.2.2.2.2.3.4">𝑃</ci></apply></apply></list></annotation-xml><annotation encoding="application/x-tex" id="alg1.l22.m1.2c">~{}~{}~{}~{}\textbf{return}~{}r_{EXP},S_{EXP}</annotation><annotation encoding="application/x-llamapun" id="alg1.l22.m1.2d">return italic_r start_POSTSUBSCRIPT italic_E italic_X italic_P end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_E italic_X italic_P end_POSTSUBSCRIPT</annotation></semantics></math><span class="ltx_text" id="alg1.l22.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l23"> <span class="ltx_text" id="alg1.l23.1" style="font-size:90%;"> </span><span class="ltx_text" id="alg1.l23.2" style="font-size:90%;"> </span><span class="ltx_text" id="alg1.l23.3" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l24"> <span class="ltx_text" id="alg1.l24.1" style="font-size:90%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l24.2" style="font-size:90%;">function</span><span class="ltx_text" id="alg1.l24.3" style="font-size:90%;"> VDR_Norm(x, S) </span> </div> <div class="ltx_listingline" id="alg1.l25"> <span class="ltx_text" id="alg1.l25.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}x_{sub}=x-\text{max}(x)" class="ltx_Math" display="inline" id="alg1.l25.m1.1"><semantics id="alg1.l25.m1.1a"><mrow id="alg1.l25.m1.1.2" xref="alg1.l25.m1.1.2.cmml"><msub id="alg1.l25.m1.1.2.2" xref="alg1.l25.m1.1.2.2.cmml"><mi id="alg1.l25.m1.1.2.2.2" mathsize="90%" xref="alg1.l25.m1.1.2.2.2.cmml">x</mi><mrow id="alg1.l25.m1.1.2.2.3" xref="alg1.l25.m1.1.2.2.3.cmml"><mi id="alg1.l25.m1.1.2.2.3.2" mathsize="90%" xref="alg1.l25.m1.1.2.2.3.2.cmml">s</mi><mo id="alg1.l25.m1.1.2.2.3.1" xref="alg1.l25.m1.1.2.2.3.1.cmml"></mo><mi id="alg1.l25.m1.1.2.2.3.3" mathsize="90%" xref="alg1.l25.m1.1.2.2.3.3.cmml">u</mi><mo id="alg1.l25.m1.1.2.2.3.1a" xref="alg1.l25.m1.1.2.2.3.1.cmml"></mo><mi id="alg1.l25.m1.1.2.2.3.4" mathsize="90%" xref="alg1.l25.m1.1.2.2.3.4.cmml">b</mi></mrow></msub><mo id="alg1.l25.m1.1.2.1" mathsize="90%" xref="alg1.l25.m1.1.2.1.cmml">=</mo><mrow id="alg1.l25.m1.1.2.3" xref="alg1.l25.m1.1.2.3.cmml"><mi id="alg1.l25.m1.1.2.3.2" mathsize="90%" xref="alg1.l25.m1.1.2.3.2.cmml">x</mi><mo id="alg1.l25.m1.1.2.3.1" mathsize="90%" xref="alg1.l25.m1.1.2.3.1.cmml">−</mo><mrow id="alg1.l25.m1.1.2.3.3" xref="alg1.l25.m1.1.2.3.3.cmml"><mtext id="alg1.l25.m1.1.2.3.3.2" mathsize="90%" xref="alg1.l25.m1.1.2.3.3.2a.cmml">max</mtext><mo id="alg1.l25.m1.1.2.3.3.1" xref="alg1.l25.m1.1.2.3.3.1.cmml"></mo><mrow id="alg1.l25.m1.1.2.3.3.3.2" xref="alg1.l25.m1.1.2.3.3.cmml"><mo id="alg1.l25.m1.1.2.3.3.3.2.1" maxsize="90%" minsize="90%" xref="alg1.l25.m1.1.2.3.3.cmml">(</mo><mi id="alg1.l25.m1.1.1" mathsize="90%" xref="alg1.l25.m1.1.1.cmml">x</mi><mo id="alg1.l25.m1.1.2.3.3.3.2.2" maxsize="90%" minsize="90%" xref="alg1.l25.m1.1.2.3.3.cmml">)</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l25.m1.1b"><apply id="alg1.l25.m1.1.2.cmml" xref="alg1.l25.m1.1.2"><eq id="alg1.l25.m1.1.2.1.cmml" xref="alg1.l25.m1.1.2.1"></eq><apply id="alg1.l25.m1.1.2.2.cmml" xref="alg1.l25.m1.1.2.2"><csymbol cd="ambiguous" id="alg1.l25.m1.1.2.2.1.cmml" xref="alg1.l25.m1.1.2.2">subscript</csymbol><ci id="alg1.l25.m1.1.2.2.2.cmml" xref="alg1.l25.m1.1.2.2.2">𝑥</ci><apply id="alg1.l25.m1.1.2.2.3.cmml" xref="alg1.l25.m1.1.2.2.3"><times id="alg1.l25.m1.1.2.2.3.1.cmml" xref="alg1.l25.m1.1.2.2.3.1"></times><ci id="alg1.l25.m1.1.2.2.3.2.cmml" xref="alg1.l25.m1.1.2.2.3.2">𝑠</ci><ci id="alg1.l25.m1.1.2.2.3.3.cmml" xref="alg1.l25.m1.1.2.2.3.3">𝑢</ci><ci id="alg1.l25.m1.1.2.2.3.4.cmml" xref="alg1.l25.m1.1.2.2.3.4">𝑏</ci></apply></apply><apply id="alg1.l25.m1.1.2.3.cmml" xref="alg1.l25.m1.1.2.3"><minus id="alg1.l25.m1.1.2.3.1.cmml" xref="alg1.l25.m1.1.2.3.1"></minus><ci id="alg1.l25.m1.1.2.3.2.cmml" xref="alg1.l25.m1.1.2.3.2">𝑥</ci><apply id="alg1.l25.m1.1.2.3.3.cmml" xref="alg1.l25.m1.1.2.3.3"><times id="alg1.l25.m1.1.2.3.3.1.cmml" xref="alg1.l25.m1.1.2.3.3.1"></times><ci id="alg1.l25.m1.1.2.3.3.2a.cmml" xref="alg1.l25.m1.1.2.3.3.2"><mtext id="alg1.l25.m1.1.2.3.3.2.cmml" mathsize="90%" xref="alg1.l25.m1.1.2.3.3.2">max</mtext></ci><ci id="alg1.l25.m1.1.1.cmml" xref="alg1.l25.m1.1.1">𝑥</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l25.m1.1c">~{}~{}~{}~{}x_{sub}=x-\text{max}(x)</annotation><annotation encoding="application/x-llamapun" id="alg1.l25.m1.1d">italic_x start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT = italic_x - max ( italic_x )</annotation></semantics></math><span class="ltx_text" id="alg1.l25.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l26"> <span class="ltx_text" id="alg1.l26.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}r_{EXP},S_{EXP}=iEXP(x_{sub},S)" class="ltx_Math" display="inline" id="alg1.l26.m1.4"><semantics id="alg1.l26.m1.4a"><mrow id="alg1.l26.m1.4.4" xref="alg1.l26.m1.4.4.cmml"><mrow id="alg1.l26.m1.3.3.2.2" xref="alg1.l26.m1.3.3.2.3.cmml"><msub id="alg1.l26.m1.2.2.1.1.1" xref="alg1.l26.m1.2.2.1.1.1.cmml"><mi id="alg1.l26.m1.2.2.1.1.1.2" mathsize="90%" xref="alg1.l26.m1.2.2.1.1.1.2.cmml">r</mi><mrow id="alg1.l26.m1.2.2.1.1.1.3" xref="alg1.l26.m1.2.2.1.1.1.3.cmml"><mi id="alg1.l26.m1.2.2.1.1.1.3.2" mathsize="90%" xref="alg1.l26.m1.2.2.1.1.1.3.2.cmml">E</mi><mo id="alg1.l26.m1.2.2.1.1.1.3.1" xref="alg1.l26.m1.2.2.1.1.1.3.1.cmml"></mo><mi id="alg1.l26.m1.2.2.1.1.1.3.3" mathsize="90%" xref="alg1.l26.m1.2.2.1.1.1.3.3.cmml">X</mi><mo id="alg1.l26.m1.2.2.1.1.1.3.1a" xref="alg1.l26.m1.2.2.1.1.1.3.1.cmml"></mo><mi id="alg1.l26.m1.2.2.1.1.1.3.4" mathsize="90%" xref="alg1.l26.m1.2.2.1.1.1.3.4.cmml">P</mi></mrow></msub><mo id="alg1.l26.m1.3.3.2.2.3" mathsize="90%" xref="alg1.l26.m1.3.3.2.3.cmml">,</mo><msub id="alg1.l26.m1.3.3.2.2.2" xref="alg1.l26.m1.3.3.2.2.2.cmml"><mi id="alg1.l26.m1.3.3.2.2.2.2" mathsize="90%" xref="alg1.l26.m1.3.3.2.2.2.2.cmml">S</mi><mrow id="alg1.l26.m1.3.3.2.2.2.3" xref="alg1.l26.m1.3.3.2.2.2.3.cmml"><mi id="alg1.l26.m1.3.3.2.2.2.3.2" mathsize="90%" xref="alg1.l26.m1.3.3.2.2.2.3.2.cmml">E</mi><mo id="alg1.l26.m1.3.3.2.2.2.3.1" xref="alg1.l26.m1.3.3.2.2.2.3.1.cmml"></mo><mi id="alg1.l26.m1.3.3.2.2.2.3.3" mathsize="90%" xref="alg1.l26.m1.3.3.2.2.2.3.3.cmml">X</mi><mo id="alg1.l26.m1.3.3.2.2.2.3.1a" xref="alg1.l26.m1.3.3.2.2.2.3.1.cmml"></mo><mi id="alg1.l26.m1.3.3.2.2.2.3.4" mathsize="90%" xref="alg1.l26.m1.3.3.2.2.2.3.4.cmml">P</mi></mrow></msub></mrow><mo id="alg1.l26.m1.4.4.4" mathsize="90%" xref="alg1.l26.m1.4.4.4.cmml">=</mo><mrow id="alg1.l26.m1.4.4.3" xref="alg1.l26.m1.4.4.3.cmml"><mi id="alg1.l26.m1.4.4.3.3" mathsize="90%" xref="alg1.l26.m1.4.4.3.3.cmml">i</mi><mo id="alg1.l26.m1.4.4.3.2" xref="alg1.l26.m1.4.4.3.2.cmml"></mo><mi id="alg1.l26.m1.4.4.3.4" mathsize="90%" xref="alg1.l26.m1.4.4.3.4.cmml">E</mi><mo id="alg1.l26.m1.4.4.3.2a" xref="alg1.l26.m1.4.4.3.2.cmml"></mo><mi id="alg1.l26.m1.4.4.3.5" mathsize="90%" xref="alg1.l26.m1.4.4.3.5.cmml">X</mi><mo id="alg1.l26.m1.4.4.3.2b" xref="alg1.l26.m1.4.4.3.2.cmml"></mo><mi id="alg1.l26.m1.4.4.3.6" mathsize="90%" xref="alg1.l26.m1.4.4.3.6.cmml">P</mi><mo id="alg1.l26.m1.4.4.3.2c" xref="alg1.l26.m1.4.4.3.2.cmml"></mo><mrow id="alg1.l26.m1.4.4.3.1.1" xref="alg1.l26.m1.4.4.3.1.2.cmml"><mo id="alg1.l26.m1.4.4.3.1.1.2" maxsize="90%" minsize="90%" xref="alg1.l26.m1.4.4.3.1.2.cmml">(</mo><msub id="alg1.l26.m1.4.4.3.1.1.1" xref="alg1.l26.m1.4.4.3.1.1.1.cmml"><mi id="alg1.l26.m1.4.4.3.1.1.1.2" mathsize="90%" xref="alg1.l26.m1.4.4.3.1.1.1.2.cmml">x</mi><mrow id="alg1.l26.m1.4.4.3.1.1.1.3" xref="alg1.l26.m1.4.4.3.1.1.1.3.cmml"><mi id="alg1.l26.m1.4.4.3.1.1.1.3.2" mathsize="90%" xref="alg1.l26.m1.4.4.3.1.1.1.3.2.cmml">s</mi><mo id="alg1.l26.m1.4.4.3.1.1.1.3.1" xref="alg1.l26.m1.4.4.3.1.1.1.3.1.cmml"></mo><mi id="alg1.l26.m1.4.4.3.1.1.1.3.3" mathsize="90%" xref="alg1.l26.m1.4.4.3.1.1.1.3.3.cmml">u</mi><mo id="alg1.l26.m1.4.4.3.1.1.1.3.1a" xref="alg1.l26.m1.4.4.3.1.1.1.3.1.cmml"></mo><mi id="alg1.l26.m1.4.4.3.1.1.1.3.4" mathsize="90%" xref="alg1.l26.m1.4.4.3.1.1.1.3.4.cmml">b</mi></mrow></msub><mo id="alg1.l26.m1.4.4.3.1.1.3" mathsize="90%" xref="alg1.l26.m1.4.4.3.1.2.cmml">,</mo><mi id="alg1.l26.m1.1.1" mathsize="90%" xref="alg1.l26.m1.1.1.cmml">S</mi><mo id="alg1.l26.m1.4.4.3.1.1.4" maxsize="90%" minsize="90%" xref="alg1.l26.m1.4.4.3.1.2.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l26.m1.4b"><apply id="alg1.l26.m1.4.4.cmml" xref="alg1.l26.m1.4.4"><eq id="alg1.l26.m1.4.4.4.cmml" xref="alg1.l26.m1.4.4.4"></eq><list id="alg1.l26.m1.3.3.2.3.cmml" xref="alg1.l26.m1.3.3.2.2"><apply id="alg1.l26.m1.2.2.1.1.1.cmml" xref="alg1.l26.m1.2.2.1.1.1"><csymbol cd="ambiguous" id="alg1.l26.m1.2.2.1.1.1.1.cmml" xref="alg1.l26.m1.2.2.1.1.1">subscript</csymbol><ci id="alg1.l26.m1.2.2.1.1.1.2.cmml" xref="alg1.l26.m1.2.2.1.1.1.2">𝑟</ci><apply id="alg1.l26.m1.2.2.1.1.1.3.cmml" xref="alg1.l26.m1.2.2.1.1.1.3"><times id="alg1.l26.m1.2.2.1.1.1.3.1.cmml" xref="alg1.l26.m1.2.2.1.1.1.3.1"></times><ci id="alg1.l26.m1.2.2.1.1.1.3.2.cmml" xref="alg1.l26.m1.2.2.1.1.1.3.2">𝐸</ci><ci id="alg1.l26.m1.2.2.1.1.1.3.3.cmml" xref="alg1.l26.m1.2.2.1.1.1.3.3">𝑋</ci><ci id="alg1.l26.m1.2.2.1.1.1.3.4.cmml" xref="alg1.l26.m1.2.2.1.1.1.3.4">𝑃</ci></apply></apply><apply id="alg1.l26.m1.3.3.2.2.2.cmml" xref="alg1.l26.m1.3.3.2.2.2"><csymbol cd="ambiguous" id="alg1.l26.m1.3.3.2.2.2.1.cmml" xref="alg1.l26.m1.3.3.2.2.2">subscript</csymbol><ci id="alg1.l26.m1.3.3.2.2.2.2.cmml" xref="alg1.l26.m1.3.3.2.2.2.2">𝑆</ci><apply id="alg1.l26.m1.3.3.2.2.2.3.cmml" xref="alg1.l26.m1.3.3.2.2.2.3"><times id="alg1.l26.m1.3.3.2.2.2.3.1.cmml" xref="alg1.l26.m1.3.3.2.2.2.3.1"></times><ci id="alg1.l26.m1.3.3.2.2.2.3.2.cmml" xref="alg1.l26.m1.3.3.2.2.2.3.2">𝐸</ci><ci id="alg1.l26.m1.3.3.2.2.2.3.3.cmml" xref="alg1.l26.m1.3.3.2.2.2.3.3">𝑋</ci><ci id="alg1.l26.m1.3.3.2.2.2.3.4.cmml" xref="alg1.l26.m1.3.3.2.2.2.3.4">𝑃</ci></apply></apply></list><apply id="alg1.l26.m1.4.4.3.cmml" xref="alg1.l26.m1.4.4.3"><times id="alg1.l26.m1.4.4.3.2.cmml" xref="alg1.l26.m1.4.4.3.2"></times><ci id="alg1.l26.m1.4.4.3.3.cmml" xref="alg1.l26.m1.4.4.3.3">𝑖</ci><ci id="alg1.l26.m1.4.4.3.4.cmml" xref="alg1.l26.m1.4.4.3.4">𝐸</ci><ci id="alg1.l26.m1.4.4.3.5.cmml" xref="alg1.l26.m1.4.4.3.5">𝑋</ci><ci id="alg1.l26.m1.4.4.3.6.cmml" xref="alg1.l26.m1.4.4.3.6">𝑃</ci><interval closure="open" id="alg1.l26.m1.4.4.3.1.2.cmml" xref="alg1.l26.m1.4.4.3.1.1"><apply id="alg1.l26.m1.4.4.3.1.1.1.cmml" xref="alg1.l26.m1.4.4.3.1.1.1"><csymbol cd="ambiguous" id="alg1.l26.m1.4.4.3.1.1.1.1.cmml" xref="alg1.l26.m1.4.4.3.1.1.1">subscript</csymbol><ci id="alg1.l26.m1.4.4.3.1.1.1.2.cmml" xref="alg1.l26.m1.4.4.3.1.1.1.2">𝑥</ci><apply id="alg1.l26.m1.4.4.3.1.1.1.3.cmml" xref="alg1.l26.m1.4.4.3.1.1.1.3"><times id="alg1.l26.m1.4.4.3.1.1.1.3.1.cmml" xref="alg1.l26.m1.4.4.3.1.1.1.3.1"></times><ci id="alg1.l26.m1.4.4.3.1.1.1.3.2.cmml" xref="alg1.l26.m1.4.4.3.1.1.1.3.2">𝑠</ci><ci id="alg1.l26.m1.4.4.3.1.1.1.3.3.cmml" xref="alg1.l26.m1.4.4.3.1.1.1.3.3">𝑢</ci><ci id="alg1.l26.m1.4.4.3.1.1.1.3.4.cmml" xref="alg1.l26.m1.4.4.3.1.1.1.3.4">𝑏</ci></apply></apply><ci id="alg1.l26.m1.1.1.cmml" xref="alg1.l26.m1.1.1">𝑆</ci></interval></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l26.m1.4c">~{}~{}~{}~{}r_{EXP},S_{EXP}=iEXP(x_{sub},S)</annotation><annotation encoding="application/x-llamapun" id="alg1.l26.m1.4d">italic_r start_POSTSUBSCRIPT italic_E italic_X italic_P end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_E italic_X italic_P end_POSTSUBSCRIPT = italic_i italic_E italic_X italic_P ( italic_x start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT , italic_S )</annotation></semantics></math><span class="ltx_text" id="alg1.l26.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l27"> <span class="ltx_text" id="alg1.l27.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}x_{Softmax}=\text{eMSB-Q}(r_{EXP})" class="ltx_Math" display="inline" id="alg1.l27.m1.1"><semantics id="alg1.l27.m1.1a"><mrow id="alg1.l27.m1.1.1" xref="alg1.l27.m1.1.1.cmml"><msub id="alg1.l27.m1.1.1.3" xref="alg1.l27.m1.1.1.3.cmml"><mi id="alg1.l27.m1.1.1.3.2" mathsize="90%" xref="alg1.l27.m1.1.1.3.2.cmml">x</mi><mrow id="alg1.l27.m1.1.1.3.3" xref="alg1.l27.m1.1.1.3.3.cmml"><mi id="alg1.l27.m1.1.1.3.3.2" mathsize="90%" xref="alg1.l27.m1.1.1.3.3.2.cmml">S</mi><mo id="alg1.l27.m1.1.1.3.3.1" xref="alg1.l27.m1.1.1.3.3.1.cmml"></mo><mi id="alg1.l27.m1.1.1.3.3.3" mathsize="90%" xref="alg1.l27.m1.1.1.3.3.3.cmml">o</mi><mo id="alg1.l27.m1.1.1.3.3.1a" xref="alg1.l27.m1.1.1.3.3.1.cmml"></mo><mi id="alg1.l27.m1.1.1.3.3.4" mathsize="90%" xref="alg1.l27.m1.1.1.3.3.4.cmml">f</mi><mo id="alg1.l27.m1.1.1.3.3.1b" xref="alg1.l27.m1.1.1.3.3.1.cmml"></mo><mi id="alg1.l27.m1.1.1.3.3.5" mathsize="90%" xref="alg1.l27.m1.1.1.3.3.5.cmml">t</mi><mo id="alg1.l27.m1.1.1.3.3.1c" xref="alg1.l27.m1.1.1.3.3.1.cmml"></mo><mi id="alg1.l27.m1.1.1.3.3.6" mathsize="90%" xref="alg1.l27.m1.1.1.3.3.6.cmml">m</mi><mo id="alg1.l27.m1.1.1.3.3.1d" xref="alg1.l27.m1.1.1.3.3.1.cmml"></mo><mi id="alg1.l27.m1.1.1.3.3.7" mathsize="90%" xref="alg1.l27.m1.1.1.3.3.7.cmml">a</mi><mo id="alg1.l27.m1.1.1.3.3.1e" xref="alg1.l27.m1.1.1.3.3.1.cmml"></mo><mi id="alg1.l27.m1.1.1.3.3.8" mathsize="90%" xref="alg1.l27.m1.1.1.3.3.8.cmml">x</mi></mrow></msub><mo id="alg1.l27.m1.1.1.2" mathsize="90%" xref="alg1.l27.m1.1.1.2.cmml">=</mo><mrow id="alg1.l27.m1.1.1.1" xref="alg1.l27.m1.1.1.1.cmml"><mtext id="alg1.l27.m1.1.1.1.3" mathsize="90%" xref="alg1.l27.m1.1.1.1.3a.cmml">eMSB-Q</mtext><mo id="alg1.l27.m1.1.1.1.2" xref="alg1.l27.m1.1.1.1.2.cmml"></mo><mrow id="alg1.l27.m1.1.1.1.1.1" xref="alg1.l27.m1.1.1.1.1.1.1.cmml"><mo id="alg1.l27.m1.1.1.1.1.1.2" maxsize="90%" minsize="90%" xref="alg1.l27.m1.1.1.1.1.1.1.cmml">(</mo><msub id="alg1.l27.m1.1.1.1.1.1.1" xref="alg1.l27.m1.1.1.1.1.1.1.cmml"><mi id="alg1.l27.m1.1.1.1.1.1.1.2" mathsize="90%" xref="alg1.l27.m1.1.1.1.1.1.1.2.cmml">r</mi><mrow id="alg1.l27.m1.1.1.1.1.1.1.3" xref="alg1.l27.m1.1.1.1.1.1.1.3.cmml"><mi id="alg1.l27.m1.1.1.1.1.1.1.3.2" mathsize="90%" xref="alg1.l27.m1.1.1.1.1.1.1.3.2.cmml">E</mi><mo id="alg1.l27.m1.1.1.1.1.1.1.3.1" xref="alg1.l27.m1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l27.m1.1.1.1.1.1.1.3.3" mathsize="90%" xref="alg1.l27.m1.1.1.1.1.1.1.3.3.cmml">X</mi><mo id="alg1.l27.m1.1.1.1.1.1.1.3.1a" xref="alg1.l27.m1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l27.m1.1.1.1.1.1.1.3.4" mathsize="90%" xref="alg1.l27.m1.1.1.1.1.1.1.3.4.cmml">P</mi></mrow></msub><mo id="alg1.l27.m1.1.1.1.1.1.3" maxsize="90%" minsize="90%" xref="alg1.l27.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l27.m1.1b"><apply id="alg1.l27.m1.1.1.cmml" xref="alg1.l27.m1.1.1"><eq id="alg1.l27.m1.1.1.2.cmml" xref="alg1.l27.m1.1.1.2"></eq><apply id="alg1.l27.m1.1.1.3.cmml" xref="alg1.l27.m1.1.1.3"><csymbol cd="ambiguous" id="alg1.l27.m1.1.1.3.1.cmml" xref="alg1.l27.m1.1.1.3">subscript</csymbol><ci id="alg1.l27.m1.1.1.3.2.cmml" xref="alg1.l27.m1.1.1.3.2">𝑥</ci><apply id="alg1.l27.m1.1.1.3.3.cmml" xref="alg1.l27.m1.1.1.3.3"><times id="alg1.l27.m1.1.1.3.3.1.cmml" xref="alg1.l27.m1.1.1.3.3.1"></times><ci id="alg1.l27.m1.1.1.3.3.2.cmml" xref="alg1.l27.m1.1.1.3.3.2">𝑆</ci><ci id="alg1.l27.m1.1.1.3.3.3.cmml" xref="alg1.l27.m1.1.1.3.3.3">𝑜</ci><ci id="alg1.l27.m1.1.1.3.3.4.cmml" xref="alg1.l27.m1.1.1.3.3.4">𝑓</ci><ci id="alg1.l27.m1.1.1.3.3.5.cmml" xref="alg1.l27.m1.1.1.3.3.5">𝑡</ci><ci id="alg1.l27.m1.1.1.3.3.6.cmml" xref="alg1.l27.m1.1.1.3.3.6">𝑚</ci><ci id="alg1.l27.m1.1.1.3.3.7.cmml" xref="alg1.l27.m1.1.1.3.3.7">𝑎</ci><ci id="alg1.l27.m1.1.1.3.3.8.cmml" xref="alg1.l27.m1.1.1.3.3.8">𝑥</ci></apply></apply><apply id="alg1.l27.m1.1.1.1.cmml" xref="alg1.l27.m1.1.1.1"><times id="alg1.l27.m1.1.1.1.2.cmml" xref="alg1.l27.m1.1.1.1.2"></times><ci id="alg1.l27.m1.1.1.1.3a.cmml" xref="alg1.l27.m1.1.1.1.3"><mtext id="alg1.l27.m1.1.1.1.3.cmml" mathsize="90%" xref="alg1.l27.m1.1.1.1.3">eMSB-Q</mtext></ci><apply id="alg1.l27.m1.1.1.1.1.1.1.cmml" xref="alg1.l27.m1.1.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l27.m1.1.1.1.1.1.1.1.cmml" xref="alg1.l27.m1.1.1.1.1.1">subscript</csymbol><ci id="alg1.l27.m1.1.1.1.1.1.1.2.cmml" xref="alg1.l27.m1.1.1.1.1.1.1.2">𝑟</ci><apply id="alg1.l27.m1.1.1.1.1.1.1.3.cmml" xref="alg1.l27.m1.1.1.1.1.1.1.3"><times id="alg1.l27.m1.1.1.1.1.1.1.3.1.cmml" xref="alg1.l27.m1.1.1.1.1.1.1.3.1"></times><ci id="alg1.l27.m1.1.1.1.1.1.1.3.2.cmml" xref="alg1.l27.m1.1.1.1.1.1.1.3.2">𝐸</ci><ci id="alg1.l27.m1.1.1.1.1.1.1.3.3.cmml" xref="alg1.l27.m1.1.1.1.1.1.1.3.3">𝑋</ci><ci id="alg1.l27.m1.1.1.1.1.1.1.3.4.cmml" xref="alg1.l27.m1.1.1.1.1.1.1.3.4">𝑃</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l27.m1.1c">~{}~{}~{}~{}x_{Softmax}=\text{eMSB-Q}(r_{EXP})</annotation><annotation encoding="application/x-llamapun" id="alg1.l27.m1.1d">italic_x start_POSTSUBSCRIPT italic_S italic_o italic_f italic_t italic_m italic_a italic_x end_POSTSUBSCRIPT = eMSB-Q ( italic_r start_POSTSUBSCRIPT italic_E italic_X italic_P end_POSTSUBSCRIPT )</annotation></semantics></math><span class="ltx_text" id="alg1.l27.2" style="font-size:90%;"> </span> </div> <div class="ltx_listingline" id="alg1.l28"> <span class="ltx_text" id="alg1.l28.1" style="font-size:90%;"> </span><math alttext="~{}~{}~{}~{}\textbf{return}~{}x_{Softmax},S_{EXP}" class="ltx_Math" display="inline" id="alg1.l28.m1.2"><semantics id="alg1.l28.m1.2a"><mrow id="alg1.l28.m1.2.2.2" xref="alg1.l28.m1.2.2.3.cmml"><mrow id="alg1.l28.m1.1.1.1.1" xref="alg1.l28.m1.1.1.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="alg1.l28.m1.1.1.1.1.2" mathsize="90%" xref="alg1.l28.m1.1.1.1.1.2a.cmml">return</mtext><mo id="alg1.l28.m1.1.1.1.1.1" lspace="0.300em" xref="alg1.l28.m1.1.1.1.1.1.cmml"></mo><msub id="alg1.l28.m1.1.1.1.1.3" xref="alg1.l28.m1.1.1.1.1.3.cmml"><mi id="alg1.l28.m1.1.1.1.1.3.2" mathsize="90%" xref="alg1.l28.m1.1.1.1.1.3.2.cmml">x</mi><mrow id="alg1.l28.m1.1.1.1.1.3.3" xref="alg1.l28.m1.1.1.1.1.3.3.cmml"><mi id="alg1.l28.m1.1.1.1.1.3.3.2" mathsize="90%" xref="alg1.l28.m1.1.1.1.1.3.3.2.cmml">S</mi><mo id="alg1.l28.m1.1.1.1.1.3.3.1" xref="alg1.l28.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l28.m1.1.1.1.1.3.3.3" mathsize="90%" xref="alg1.l28.m1.1.1.1.1.3.3.3.cmml">o</mi><mo id="alg1.l28.m1.1.1.1.1.3.3.1a" xref="alg1.l28.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l28.m1.1.1.1.1.3.3.4" mathsize="90%" xref="alg1.l28.m1.1.1.1.1.3.3.4.cmml">f</mi><mo id="alg1.l28.m1.1.1.1.1.3.3.1b" xref="alg1.l28.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l28.m1.1.1.1.1.3.3.5" mathsize="90%" xref="alg1.l28.m1.1.1.1.1.3.3.5.cmml">t</mi><mo id="alg1.l28.m1.1.1.1.1.3.3.1c" xref="alg1.l28.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l28.m1.1.1.1.1.3.3.6" mathsize="90%" xref="alg1.l28.m1.1.1.1.1.3.3.6.cmml">m</mi><mo id="alg1.l28.m1.1.1.1.1.3.3.1d" xref="alg1.l28.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l28.m1.1.1.1.1.3.3.7" mathsize="90%" xref="alg1.l28.m1.1.1.1.1.3.3.7.cmml">a</mi><mo id="alg1.l28.m1.1.1.1.1.3.3.1e" xref="alg1.l28.m1.1.1.1.1.3.3.1.cmml"></mo><mi id="alg1.l28.m1.1.1.1.1.3.3.8" mathsize="90%" xref="alg1.l28.m1.1.1.1.1.3.3.8.cmml">x</mi></mrow></msub></mrow><mo id="alg1.l28.m1.2.2.2.3" mathsize="90%" xref="alg1.l28.m1.2.2.3.cmml">,</mo><msub id="alg1.l28.m1.2.2.2.2" xref="alg1.l28.m1.2.2.2.2.cmml"><mi id="alg1.l28.m1.2.2.2.2.2" mathsize="90%" xref="alg1.l28.m1.2.2.2.2.2.cmml">S</mi><mrow id="alg1.l28.m1.2.2.2.2.3" xref="alg1.l28.m1.2.2.2.2.3.cmml"><mi id="alg1.l28.m1.2.2.2.2.3.2" mathsize="90%" xref="alg1.l28.m1.2.2.2.2.3.2.cmml">E</mi><mo id="alg1.l28.m1.2.2.2.2.3.1" xref="alg1.l28.m1.2.2.2.2.3.1.cmml"></mo><mi id="alg1.l28.m1.2.2.2.2.3.3" mathsize="90%" xref="alg1.l28.m1.2.2.2.2.3.3.cmml">X</mi><mo id="alg1.l28.m1.2.2.2.2.3.1a" xref="alg1.l28.m1.2.2.2.2.3.1.cmml"></mo><mi id="alg1.l28.m1.2.2.2.2.3.4" mathsize="90%" xref="alg1.l28.m1.2.2.2.2.3.4.cmml">P</mi></mrow></msub></mrow><annotation-xml encoding="MathML-Content" id="alg1.l28.m1.2b"><list id="alg1.l28.m1.2.2.3.cmml" xref="alg1.l28.m1.2.2.2"><apply id="alg1.l28.m1.1.1.1.1.cmml" xref="alg1.l28.m1.1.1.1.1"><times id="alg1.l28.m1.1.1.1.1.1.cmml" xref="alg1.l28.m1.1.1.1.1.1"></times><ci id="alg1.l28.m1.1.1.1.1.2a.cmml" xref="alg1.l28.m1.1.1.1.1.2"><mtext class="ltx_mathvariant_bold" id="alg1.l28.m1.1.1.1.1.2.cmml" mathsize="90%" xref="alg1.l28.m1.1.1.1.1.2">return</mtext></ci><apply id="alg1.l28.m1.1.1.1.1.3.cmml" xref="alg1.l28.m1.1.1.1.1.3"><csymbol cd="ambiguous" id="alg1.l28.m1.1.1.1.1.3.1.cmml" xref="alg1.l28.m1.1.1.1.1.3">subscript</csymbol><ci id="alg1.l28.m1.1.1.1.1.3.2.cmml" xref="alg1.l28.m1.1.1.1.1.3.2">𝑥</ci><apply id="alg1.l28.m1.1.1.1.1.3.3.cmml" xref="alg1.l28.m1.1.1.1.1.3.3"><times id="alg1.l28.m1.1.1.1.1.3.3.1.cmml" xref="alg1.l28.m1.1.1.1.1.3.3.1"></times><ci id="alg1.l28.m1.1.1.1.1.3.3.2.cmml" xref="alg1.l28.m1.1.1.1.1.3.3.2">𝑆</ci><ci id="alg1.l28.m1.1.1.1.1.3.3.3.cmml" xref="alg1.l28.m1.1.1.1.1.3.3.3">𝑜</ci><ci id="alg1.l28.m1.1.1.1.1.3.3.4.cmml" xref="alg1.l28.m1.1.1.1.1.3.3.4">𝑓</ci><ci id="alg1.l28.m1.1.1.1.1.3.3.5.cmml" xref="alg1.l28.m1.1.1.1.1.3.3.5">𝑡</ci><ci id="alg1.l28.m1.1.1.1.1.3.3.6.cmml" xref="alg1.l28.m1.1.1.1.1.3.3.6">𝑚</ci><ci id="alg1.l28.m1.1.1.1.1.3.3.7.cmml" xref="alg1.l28.m1.1.1.1.1.3.3.7">𝑎</ci><ci id="alg1.l28.m1.1.1.1.1.3.3.8.cmml" xref="alg1.l28.m1.1.1.1.1.3.3.8">𝑥</ci></apply></apply></apply><apply id="alg1.l28.m1.2.2.2.2.cmml" xref="alg1.l28.m1.2.2.2.2"><csymbol cd="ambiguous" id="alg1.l28.m1.2.2.2.2.1.cmml" xref="alg1.l28.m1.2.2.2.2">subscript</csymbol><ci id="alg1.l28.m1.2.2.2.2.2.cmml" xref="alg1.l28.m1.2.2.2.2.2">𝑆</ci><apply id="alg1.l28.m1.2.2.2.2.3.cmml" xref="alg1.l28.m1.2.2.2.2.3"><times id="alg1.l28.m1.2.2.2.2.3.1.cmml" xref="alg1.l28.m1.2.2.2.2.3.1"></times><ci id="alg1.l28.m1.2.2.2.2.3.2.cmml" xref="alg1.l28.m1.2.2.2.2.3.2">𝐸</ci><ci id="alg1.l28.m1.2.2.2.2.3.3.cmml" xref="alg1.l28.m1.2.2.2.2.3.3">𝑋</ci><ci id="alg1.l28.m1.2.2.2.2.3.4.cmml" xref="alg1.l28.m1.2.2.2.2.3.4">𝑃</ci></apply></apply></list></annotation-xml><annotation encoding="application/x-tex" id="alg1.l28.m1.2c">~{}~{}~{}~{}\textbf{return}~{}x_{Softmax},S_{EXP}</annotation><annotation encoding="application/x-llamapun" id="alg1.l28.m1.2d">return italic_x start_POSTSUBSCRIPT italic_S italic_o italic_f italic_t italic_m italic_a italic_x end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_E italic_X italic_P end_POSTSUBSCRIPT</annotation></semantics></math><span class="ltx_text" id="alg1.l28.2" style="font-size:90%;"> </span> </div> </div> <br class="ltx_break ltx_break"/> </figure> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.3. </span>Proposed VDR-Softmax Fused with eMSB-Q</h3> <div class="ltx_para" id="S4.SS3.p1"> <p class="ltx_p" id="S4.SS3.p1.1">The simplified yet accurate implementation of the Softmax layer in PTQ requires a precise understanding of its characteristics and a more sophisticated integer-processing technique. The implementation involves several key steps, outlined in the Algorithm <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#alg1" title="Algorithm 1 ‣ 4.2. Dequantization-Free PTQ Technique ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">1</span></a>. Our approach utilizes per-token eMSB information to ensure accurate yet FP-less processing and uses only shifting and parsing of integers to normalize and quantize the score into integer format.</p> </div> <div class="ltx_para" id="S4.SS3.p2"> <p class="ltx_p" id="S4.SS3.p2.9">While accommodating exponent information in processing the nonlinear layer is critical, as visualized in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F5" title="Figure 5 ‣ 3.1. Motivation for PTQ and Non-BLAS Layer Optimizations ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">5</span></a>-(c), directly multiplying/dividing the integer <math alttext="x" class="ltx_Math" display="inline" id="S4.SS3.p2.1.m1.1"><semantics id="S4.SS3.p2.1.m1.1a"><mi id="S4.SS3.p2.1.m1.1.1" xref="S4.SS3.p2.1.m1.1.1.cmml">x</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.1.m1.1b"><ci id="S4.SS3.p2.1.m1.1.1.cmml" xref="S4.SS3.p2.1.m1.1.1">𝑥</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.1.m1.1c">x</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.1.m1.1d">italic_x</annotation></semantics></math> makes the number exceed the representable range of the integers with designated bit-precision, leading to loss of information. The key to handling Softmax processing with integer arithmetic lies in adapting the exponent indirectly while approximating the <math alttext="e^{x/n}" class="ltx_Math" display="inline" id="S4.SS3.p2.2.m2.1"><semantics id="S4.SS3.p2.2.m2.1a"><msup id="S4.SS3.p2.2.m2.1.1" xref="S4.SS3.p2.2.m2.1.1.cmml"><mi id="S4.SS3.p2.2.m2.1.1.2" xref="S4.SS3.p2.2.m2.1.1.2.cmml">e</mi><mrow id="S4.SS3.p2.2.m2.1.1.3" xref="S4.SS3.p2.2.m2.1.1.3.cmml"><mi id="S4.SS3.p2.2.m2.1.1.3.2" xref="S4.SS3.p2.2.m2.1.1.3.2.cmml">x</mi><mo id="S4.SS3.p2.2.m2.1.1.3.1" xref="S4.SS3.p2.2.m2.1.1.3.1.cmml">/</mo><mi id="S4.SS3.p2.2.m2.1.1.3.3" xref="S4.SS3.p2.2.m2.1.1.3.3.cmml">n</mi></mrow></msup><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.2.m2.1b"><apply id="S4.SS3.p2.2.m2.1.1.cmml" xref="S4.SS3.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.2.m2.1.1.1.cmml" xref="S4.SS3.p2.2.m2.1.1">superscript</csymbol><ci id="S4.SS3.p2.2.m2.1.1.2.cmml" xref="S4.SS3.p2.2.m2.1.1.2">𝑒</ci><apply id="S4.SS3.p2.2.m2.1.1.3.cmml" xref="S4.SS3.p2.2.m2.1.1.3"><divide id="S4.SS3.p2.2.m2.1.1.3.1.cmml" xref="S4.SS3.p2.2.m2.1.1.3.1"></divide><ci id="S4.SS3.p2.2.m2.1.1.3.2.cmml" xref="S4.SS3.p2.2.m2.1.1.3.2">𝑥</ci><ci id="S4.SS3.p2.2.m2.1.1.3.3.cmml" xref="S4.SS3.p2.2.m2.1.1.3.3">𝑛</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.2.m2.1c">e^{x/n}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.2.m2.1d">italic_e start_POSTSUPERSCRIPT italic_x / italic_n end_POSTSUPERSCRIPT</annotation></semantics></math> function, where <math alttext="x" class="ltx_Math" display="inline" id="S4.SS3.p2.3.m3.1"><semantics id="S4.SS3.p2.3.m3.1a"><mi id="S4.SS3.p2.3.m3.1.1" xref="S4.SS3.p2.3.m3.1.1.cmml">x</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.3.m3.1b"><ci id="S4.SS3.p2.3.m3.1.1.cmml" xref="S4.SS3.p2.3.m3.1.1">𝑥</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.3.m3.1c">x</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.3.m3.1d">italic_x</annotation></semantics></math> is an integer-format input value and <math alttext="n" class="ltx_Math" display="inline" id="S4.SS3.p2.4.m4.1"><semantics id="S4.SS3.p2.4.m4.1a"><mi id="S4.SS3.p2.4.m4.1.1" xref="S4.SS3.p2.4.m4.1.1.cmml">n</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.4.m4.1b"><ci id="S4.SS3.p2.4.m4.1.1.cmml" xref="S4.SS3.p2.4.m4.1.1">𝑛</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.4.m4.1c">n</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.4.m4.1d">italic_n</annotation></semantics></math> is a representation of the exponent. To accomplish that, we adjust the base (<math alttext="e" class="ltx_Math" display="inline" id="S4.SS3.p2.5.m5.1"><semantics id="S4.SS3.p2.5.m5.1a"><mi id="S4.SS3.p2.5.m5.1.1" xref="S4.SS3.p2.5.m5.1.1.cmml">e</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.5.m5.1b"><ci id="S4.SS3.p2.5.m5.1.1.cmml" xref="S4.SS3.p2.5.m5.1.1">𝑒</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.5.m5.1c">e</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.5.m5.1d">italic_e</annotation></semantics></math>) of the exponential function from <math alttext="e" class="ltx_Math" display="inline" id="S4.SS3.p2.6.m6.1"><semantics id="S4.SS3.p2.6.m6.1a"><mi id="S4.SS3.p2.6.m6.1.1" xref="S4.SS3.p2.6.m6.1.1.cmml">e</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.6.m6.1b"><ci id="S4.SS3.p2.6.m6.1.1.cmml" xref="S4.SS3.p2.6.m6.1.1">𝑒</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.6.m6.1c">e</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.6.m6.1d">italic_e</annotation></semantics></math> to <math alttext="{}^{n}\sqrt{e}" class="ltx_Math" display="inline" id="S4.SS3.p2.7.m7.1"><semantics id="S4.SS3.p2.7.m7.1a"><mmultiscripts id="S4.SS3.p2.7.m7.1.1" xref="S4.SS3.p2.7.m7.1.1.cmml"><msqrt id="S4.SS3.p2.7.m7.1.1.2" xref="S4.SS3.p2.7.m7.1.1.2.cmml"><mi id="S4.SS3.p2.7.m7.1.1.2.2" xref="S4.SS3.p2.7.m7.1.1.2.2.cmml">e</mi></msqrt><mprescripts id="S4.SS3.p2.7.m7.1.1a" xref="S4.SS3.p2.7.m7.1.1.cmml"></mprescripts><mrow id="S4.SS3.p2.7.m7.1.1b" xref="S4.SS3.p2.7.m7.1.1.cmml"></mrow><mi id="S4.SS3.p2.7.m7.1.1.3" xref="S4.SS3.p2.7.m7.1.1.3.cmml">n</mi></mmultiscripts><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.7.m7.1b"><apply id="S4.SS3.p2.7.m7.1.1.cmml" xref="S4.SS3.p2.7.m7.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.7.m7.1.1.1.cmml" xref="S4.SS3.p2.7.m7.1.1">superscript</csymbol><apply id="S4.SS3.p2.7.m7.1.1.2.cmml" xref="S4.SS3.p2.7.m7.1.1.2"><root id="S4.SS3.p2.7.m7.1.1.2a.cmml" xref="S4.SS3.p2.7.m7.1.1.2"></root><ci id="S4.SS3.p2.7.m7.1.1.2.2.cmml" xref="S4.SS3.p2.7.m7.1.1.2.2">𝑒</ci></apply><ci id="S4.SS3.p2.7.m7.1.1.3.cmml" xref="S4.SS3.p2.7.m7.1.1.3">𝑛</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.7.m7.1c">{}^{n}\sqrt{e}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.7.m7.1d">start_FLOATSUPERSCRIPT italic_n end_FLOATSUPERSCRIPT square-root start_ARG italic_e end_ARG</annotation></semantics></math>, rather than adjusting the integer-represented <math alttext="x" class="ltx_Math" display="inline" id="S4.SS3.p2.8.m8.1"><semantics id="S4.SS3.p2.8.m8.1a"><mi id="S4.SS3.p2.8.m8.1.1" xref="S4.SS3.p2.8.m8.1.1.cmml">x</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.8.m8.1b"><ci id="S4.SS3.p2.8.m8.1.1.cmml" xref="S4.SS3.p2.8.m8.1.1">𝑥</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.8.m8.1c">x</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.8.m8.1d">italic_x</annotation></semantics></math> in order to avoid the loss of information. Using well-established, <math alttext="e^{x}" class="ltx_Math" display="inline" id="S4.SS3.p2.9.m9.1"><semantics id="S4.SS3.p2.9.m9.1a"><msup id="S4.SS3.p2.9.m9.1.1" xref="S4.SS3.p2.9.m9.1.1.cmml"><mi id="S4.SS3.p2.9.m9.1.1.2" xref="S4.SS3.p2.9.m9.1.1.2.cmml">e</mi><mi id="S4.SS3.p2.9.m9.1.1.3" xref="S4.SS3.p2.9.m9.1.1.3.cmml">x</mi></msup><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.9.m9.1b"><apply id="S4.SS3.p2.9.m9.1.1.cmml" xref="S4.SS3.p2.9.m9.1.1"><csymbol cd="ambiguous" id="S4.SS3.p2.9.m9.1.1.1.cmml" xref="S4.SS3.p2.9.m9.1.1">superscript</csymbol><ci id="S4.SS3.p2.9.m9.1.1.2.cmml" xref="S4.SS3.p2.9.m9.1.1.2">𝑒</ci><ci id="S4.SS3.p2.9.m9.1.1.3.cmml" xref="S4.SS3.p2.9.m9.1.1.3">𝑥</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.9.m9.1c">e^{x}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.9.m9.1d">italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT</annotation></semantics></math>-approximating functions from previous works <cite class="ltx_cite ltx_citemacro_citep">(Kim et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib33" title="">2021</a>; Li and Gu, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib38" title="">2023</a>)</cite>. This adjustment is accomplished by modifying the internal parameters of the proposed method, without requiring division arithmetic.</p> </div> <div class="ltx_para" id="S4.SS3.p3"> <p class="ltx_p" id="S4.SS3.p3.2">Additionally, the base adjustment along the exponentiation process - from <math alttext="e" class="ltx_Math" display="inline" id="S4.SS3.p3.1.m1.1"><semantics id="S4.SS3.p3.1.m1.1a"><mi id="S4.SS3.p3.1.m1.1.1" xref="S4.SS3.p3.1.m1.1.1.cmml">e</mi><annotation-xml encoding="MathML-Content" id="S4.SS3.p3.1.m1.1b"><ci id="S4.SS3.p3.1.m1.1.1.cmml" xref="S4.SS3.p3.1.m1.1.1">𝑒</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p3.1.m1.1c">e</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p3.1.m1.1d">italic_e</annotation></semantics></math> to <math alttext="{}^{n}\sqrt{e}" class="ltx_Math" display="inline" id="S4.SS3.p3.2.m2.1"><semantics id="S4.SS3.p3.2.m2.1a"><mmultiscripts id="S4.SS3.p3.2.m2.1.1" xref="S4.SS3.p3.2.m2.1.1.cmml"><msqrt id="S4.SS3.p3.2.m2.1.1.2" xref="S4.SS3.p3.2.m2.1.1.2.cmml"><mi id="S4.SS3.p3.2.m2.1.1.2.2" xref="S4.SS3.p3.2.m2.1.1.2.2.cmml">e</mi></msqrt><mprescripts id="S4.SS3.p3.2.m2.1.1a" xref="S4.SS3.p3.2.m2.1.1.cmml"></mprescripts><mrow id="S4.SS3.p3.2.m2.1.1b" xref="S4.SS3.p3.2.m2.1.1.cmml"></mrow><mi id="S4.SS3.p3.2.m2.1.1.3" xref="S4.SS3.p3.2.m2.1.1.3.cmml">n</mi></mmultiscripts><annotation-xml encoding="MathML-Content" id="S4.SS3.p3.2.m2.1b"><apply id="S4.SS3.p3.2.m2.1.1.cmml" xref="S4.SS3.p3.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS3.p3.2.m2.1.1.1.cmml" xref="S4.SS3.p3.2.m2.1.1">superscript</csymbol><apply id="S4.SS3.p3.2.m2.1.1.2.cmml" xref="S4.SS3.p3.2.m2.1.1.2"><root id="S4.SS3.p3.2.m2.1.1.2a.cmml" xref="S4.SS3.p3.2.m2.1.1.2"></root><ci id="S4.SS3.p3.2.m2.1.1.2.2.cmml" xref="S4.SS3.p3.2.m2.1.1.2.2">𝑒</ci></apply><ci id="S4.SS3.p3.2.m2.1.1.3.cmml" xref="S4.SS3.p3.2.m2.1.1.3">𝑛</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p3.2.m2.1c">{}^{n}\sqrt{e}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p3.2.m2.1d">start_FLOATSUPERSCRIPT italic_n end_FLOATSUPERSCRIPT square-root start_ARG italic_e end_ARG</annotation></semantics></math> - is performed per token, using eMSB data from the eMSB-Q stage. Since Softmax is computed per token, aligned with our quantization approach (in section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS2" title="4.2. Dequantization-Free PTQ Technique ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">4.2</span></a>) and dataflow control mechanism (in section <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.SS5" title="4.5. Reduced Tensor Traffic ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">4.5</span></a>), we do not need to store eMSB positions for all Q vectors, enhancing our architecture’s scalability.</p> </div> <div class="ltx_para" id="S4.SS3.p4"> <p class="ltx_p" id="S4.SS3.p4.1">After the exponentiations, we apply eMSB-Q to a group of exponentiated outputs for each token, maximizing the utilization of the dynamic range provided by a Q<sub class="ltx_sub" id="S4.SS3.p4.1.1"><span class="ltx_text ltx_font_italic" id="S4.SS3.p4.1.1.1">O</span></sub>-bit (Softmax output precision) integer. This approach avoids division while preserving the original proportions of the values, ensuring that the maximum value is mapped to the highest possible representation, with sufficient range allocated for smaller values.</p> </div> <div class="ltx_para" id="S4.SS3.p5"> <p class="ltx_p" id="S4.SS3.p5.1">Consequently, our VDR-Softmax achieves both efficiency and accuracy, making it well-suited for end-to-end, integer-based acceleration within our PTQ-based attention accelerator architecture. This approach not only retains the critical properties of the exponentiation and divisions in Softmax, but also alleviates the computational overhead, leading to significantly enhanced efficiency with secured performance.</p> </div> </section> <section class="ltx_subsection" id="S4.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.4. </span>BitSift-GEMV Processor</h3> <div class="ltx_para" id="S4.SS4.p1"> <p class="ltx_p" id="S4.SS4.p1.1">Along the end-to-end, on-device fusion of the self-attention layer, our BitSift-GEMV technique for AMS-PiM significantly boosts sparse GEMV, which composes most of the computations in the attention layer. BitSift-GEMV is motivated by two key factors: <span class="ltx_text ltx_font_bold" id="S4.SS4.p1.1.1">(1)</span> the highlighted bitwise sparsity of activation data illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F7" title="Figure 7 ‣ 3.2. Motivation for Accurate and Efficient AMS Computation ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">7</span></a> and <span class="ltx_text ltx_font_bold" id="S4.SS4.p1.1.2">(2)</span> the constant-maximum number of “1”s (=8) WL activations, which allows the highest-utilization-rate operation and strengthened robustness of ADC designs.</p> </div> <div class="ltx_para" id="S4.SS4.p2"> <p class="ltx_p" id="S4.SS4.p2.1">Existing AMS-PiM designs require flexibility in the number of processible SAWL as input vectors have a random number of “1”s along the bit-slice. One thing to note here is that a flexible SAWL can lead to worse computation reliability with higher overhead for the ADC design (Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.F11" title="Figure 11 ‣ 4.4. BitSift-GEMV Processor ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">11</span></a>-(b)). For SAWL-flexible configurations, the ADCs should handle a wider, variable dynamic range, higher resolution, and varying threshold during the analog-to-digital conversions. This also compromises the utilization rate when the actual SAWL lowers, wasting power/area implemented for higher SAWL. Therefore, our design sets “SAWL=MAX” for all time without flexibility, by incorporating dummy-“1” inputs to maintain a fixed, maximum number of “1”s.</p> </div> <figure class="ltx_figure" id="S4.F11"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="485" id="S4.F11.g1" src="x10.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 11. </span>(a) SAWL<sub class="ltx_sub" id="S4.F11.4.1"><span class="ltx_text ltx_font_italic" id="S4.F11.4.1.1">D</span></sub> controller enabling a fixed number of SAWL(=8) when the input data is insufficient of “1”s. (b) Relation between SAWL and the dynamic range variation of analog-signal-domain popcount. Limiting flexibility with SAWL relaxes the ADC-design challenges for AMS-PiM.</figcaption> </figure> <div class="ltx_para" id="S4.SS4.p3"> <p class="ltx_p" id="S4.SS4.p3.5">To maintain a constant number of SAWL at 8 (= maximum SAWL that allows 6-<math alttext="\sigma" class="ltx_Math" display="inline" id="S4.SS4.p3.1.m1.1"><semantics id="S4.SS4.p3.1.m1.1a"><mi id="S4.SS4.p3.1.m1.1.1" xref="S4.SS4.p3.1.m1.1.1.cmml">σ</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.p3.1.m1.1b"><ci id="S4.SS4.p3.1.m1.1.1.cmml" xref="S4.SS4.p3.1.m1.1.1">𝜎</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p3.1.m1.1c">\sigma</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p3.1.m1.1d">italic_σ</annotation></semantics></math> reliability), the SAWL<sub class="ltx_sub" id="S4.SS4.p3.5.1"><span class="ltx_text ltx_font_italic" id="S4.SS4.p3.5.1.1">D</span></sub> controller in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.F11" title="Figure 11 ‣ 4.4. BitSift-GEMV Processor ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">11</span></a>-(a) supplements the SAWL with additional “1”s, when the input vector has fewer “1”s than 8. Specifically, the SAWL<sub class="ltx_sub" id="S4.SS4.p3.5.2"><span class="ltx_text ltx_font_italic" id="S4.SS4.p3.5.2.1">D</span></sub> controller activates SAWL<sub class="ltx_sub" id="S4.SS4.p3.5.3"><span class="ltx_text ltx_font_italic" id="S4.SS4.p3.5.3.1">D</span></sub> = (8<math alttext="-" class="ltx_Math" display="inline" id="S4.SS4.p3.5.m5.1"><semantics id="S4.SS4.p3.5.m5.1a"><mo id="S4.SS4.p3.5.m5.1.1" xref="S4.SS4.p3.5.m5.1.1.cmml">−</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p3.5.m5.1b"><minus id="S4.SS4.p3.5.m5.1.1.cmml" xref="S4.SS4.p3.5.m5.1.1"></minus></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p3.5.m5.1c">-</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p3.5.m5.1d">-</annotation></semantics></math>SAWL) WLs at the bottom of the array. This is achieved by incorporating a dummy array and WL interfaces consisting of seven rows at the bottom. The memory cells within these dummy rows are all set to “0” (off-cells), emulating computations with dummy inputs. The array configuration serves two purposes: (1) isolating dummy operations - it prevents the dummy array from affecting the actual computation output, and (2) enhancing variation resiliency - by consistently returning “0”s, the dummy array area improves the system’s robustness against variations. I.e., While processing only “1”s along the input vector, the bitwise “0”s along the vector(s) are skipped from the actual computation.</p> </div> <div class="ltx_para" id="S4.SS4.p4"> <p class="ltx_p" id="S4.SS4.p4.2">The fixed-SAWL processing incorporating SAWL<sub class="ltx_sub" id="S4.SS4.p4.2.1"><span class="ltx_text ltx_font_italic" id="S4.SS4.p4.2.1.1">D</span></sub> controller employs a two-step approach. First, the input vector is scanned to identify the longest slice containing eight “1”s. The slice is then selected for processing. Second, for the tail end of a vector with fewer “1”s (or, for a very sparse input vector), the SAWL<sub class="ltx_sub" id="S4.SS4.p4.2.2"><span class="ltx_text ltx_font_italic" id="S4.SS4.p4.2.2.1">D</span></sub> processor supplements the existing “1”s with additional ones from the dummy array. This ensures a consistent maximum SAWL for all computations, regardless of the input vector’s density. By maintaining a constant SAWL of 8, this method fixes the input dynamic range for ADC and enables lower area/energy/error overhead, even with varying input sparsity.</p> </div> <figure class="ltx_figure" id="S4.F12"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="891" id="S4.F12.g1" src="x11.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 12. </span>Our BitSift-GEMV controller parses, selects, and fetches the longest portion(s) of the input - that contains designated SAWL - to WL interface(s). </figcaption> </figure> <div class="ltx_para" id="S4.SS4.p5"> <p class="ltx_p" id="S4.SS4.p5.2">To support the SAWL<sub class="ltx_sub" id="S4.SS4.p5.2.1"><span class="ltx_text ltx_font_italic" id="S4.SS4.p5.2.1.1">D</span></sub> controller by identifying the longest segment(s) of the input with the largest number of SAWL that remain constrainted of being <math alttext="\leq" class="ltx_Math" display="inline" id="S4.SS4.p5.2.m2.1"><semantics id="S4.SS4.p5.2.m2.1a"><mo id="S4.SS4.p5.2.m2.1.1" xref="S4.SS4.p5.2.m2.1.1.cmml">≤</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p5.2.m2.1b"><leq id="S4.SS4.p5.2.m2.1.1.cmml" xref="S4.SS4.p5.2.m2.1.1"></leq></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p5.2.m2.1c">\leq</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p5.2.m2.1d">≤</annotation></semantics></math>8, we introduce our BitSift-GEMV controller, depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.F12" title="Figure 12 ‣ 4.4. BitSift-GEMV Processor ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">12</span></a>. This strategy introduces a novel approach for highly efficient sparse GEMV processing, achieving low latency and minimal area/energy overhead. Instead of a hypothetical brute-force method that might scan the entire input vector bit by bit - resulting in significant latency penalties as vector length and sparsity increase - the BitSift-GEMV controller leverages a hierarchical design with high parallelism. This includes (1) a local-pop controller (LPC) and (2) a global-pop controller (GPC), where “pop-count” refers to the number of “1”s in the input vector.</p> </div> <div class="ltx_para" id="S4.SS4.p6"> <p class="ltx_p" id="S4.SS4.p6.2"><span class="ltx_text ltx_font_bold" id="S4.SS4.p6.2.1">(1)</span> The LPC efficiently counts the number of “1”s within short, fixed-length slices of the input vector using dedicated “pop detector” and “popcount” blocks. This localized processing ensures low-latency computation of the popcount within each slice. <span class="ltx_text ltx_font_bold" id="S4.SS4.p6.2.2">(2)</span> The GPC aggregates the popcounts from LPCs to rapidly identify the highest possible number of “1”s, constrained to a maximum of 8, for further processing. Once aggregation is complete, the GPC activates the “MASK” to fetch the corresponding vector to the output and WLs. If the total number of “1”s identified by the GPC is fewer than 8, the SAWL<sub class="ltx_sub" id="S4.SS4.p6.2.3"><span class="ltx_text ltx_font_italic" id="S4.SS4.p6.2.3.1">D</span></sub> controller (Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.F11" title="Figure 11 ‣ 4.4. BitSift-GEMV Processor ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">11</span></a>-(a)) supplements the input by adding “1”s from the dummy array before forwarding it to the WLs of PiM array. This hierarchical and parallel architecture enables our AMS-PiM arrays to sustain high performance, even as input vector lengths and sparsity increase. The design is specifically optimized for extremely sparse inputs, especially whose popcount within each LPC’s slice remains <math alttext="\leq 8" class="ltx_Math" display="inline" id="S4.SS4.p6.2.m2.1"><semantics id="S4.SS4.p6.2.m2.1a"><mrow id="S4.SS4.p6.2.m2.1.1" xref="S4.SS4.p6.2.m2.1.1.cmml"><mi id="S4.SS4.p6.2.m2.1.1.2" xref="S4.SS4.p6.2.m2.1.1.2.cmml"></mi><mo id="S4.SS4.p6.2.m2.1.1.1" xref="S4.SS4.p6.2.m2.1.1.1.cmml">≤</mo><mn id="S4.SS4.p6.2.m2.1.1.3" xref="S4.SS4.p6.2.m2.1.1.3.cmml">8</mn></mrow><annotation-xml encoding="MathML-Content" id="S4.SS4.p6.2.m2.1b"><apply id="S4.SS4.p6.2.m2.1.1.cmml" xref="S4.SS4.p6.2.m2.1.1"><leq id="S4.SS4.p6.2.m2.1.1.1.cmml" xref="S4.SS4.p6.2.m2.1.1.1"></leq><csymbol cd="latexml" id="S4.SS4.p6.2.m2.1.1.2.cmml" xref="S4.SS4.p6.2.m2.1.1.2">absent</csymbol><cn id="S4.SS4.p6.2.m2.1.1.3.cmml" type="integer" xref="S4.SS4.p6.2.m2.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p6.2.m2.1c">\leq 8</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p6.2.m2.1d">≤ 8</annotation></semantics></math>. This emphasis on sparsity aligns with the activation values frequently observed in neural networks, as illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S3.F7" title="Figure 7 ‣ 3.2. Motivation for Accurate and Efficient AMS Computation ‣ 3. Motivations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">7</span></a>-(e), (f).</p> </div> <div class="ltx_para" id="S4.SS4.p7"> <p class="ltx_p" id="S4.SS4.p7.6">However, recognizing that real-world data can exhibit varying sparsity patterns, the BitSift-Controller also incorporates dedicated circuitry to handle denser input segments. That is, our BitSift-GEMV controller dynamically adjusts the processing flow when an LPC encounters eight or more “1”s within its slice, or when the GPC detects a global popcount exceeding 8. This dynamic adjustment mechanism is as follows. When the “pop detector” block in LPC detects 8<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.p7.1.m1.1"><semantics id="S4.SS4.p7.1.m1.1a"><mo id="S4.SS4.p7.1.m1.1.1" xref="S4.SS4.p7.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p7.1.m1.1b"><times id="S4.SS4.p7.1.m1.1.1.cmml" xref="S4.SS4.p7.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p7.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p7.1.m1.1d">×</annotation></semantics></math>“1”s before reaching the end of the sliced input, the inputs are marked from the beginning to the point where the 8-th “1” is found, using the “pop marker” block. The marked section is then fetched first, consuming a column-sum cycle in the AMS-PiM array. After that, the marker marks from the next section of the inputs to be processed - excluding the section fetched so far - and continues until the process reaches the end of the LPC’s processible input section. After the LPC-level sparsity requirement (popcount<math alttext="<" class="ltx_Math" display="inline" id="S4.SS4.p7.2.m2.1"><semantics id="S4.SS4.p7.2.m2.1a"><mo id="S4.SS4.p7.2.m2.1.1" xref="S4.SS4.p7.2.m2.1.1.cmml"><</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p7.2.m2.1b"><lt id="S4.SS4.p7.2.m2.1.1.cmml" xref="S4.SS4.p7.2.m2.1.1"></lt></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p7.2.m2.1c"><</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p7.2.m2.1d"><</annotation></semantics></math>8) is all met, in the GPC, when the global popcount adder returns <math alttext=">" class="ltx_Math" display="inline" id="S4.SS4.p7.3.m3.1"><semantics id="S4.SS4.p7.3.m3.1a"><mo id="S4.SS4.p7.3.m3.1.1" xref="S4.SS4.p7.3.m3.1.1.cmml">></mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p7.3.m3.1b"><gt id="S4.SS4.p7.3.m3.1.1.cmml" xref="S4.SS4.p7.3.m3.1.1"></gt></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p7.3.m3.1c">></annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p7.3.m3.1d">></annotation></semantics></math>8, the ctq (compute-token queue) circuit block controls the “MASK”s, gating the LPCs from fetching the input to the PiM array so that the LPCs of having summed popcount under 8 can be only fetched. Thus along the input vector of length [D], it takes processing cycles up to roundup(<math alttext="{D\times\text{(1-bitwise\_sparsity)}}\over 8" class="ltx_Math" display="inline" id="S4.SS4.p7.4.m4.1"><semantics id="S4.SS4.p7.4.m4.1a"><mfrac id="S4.SS4.p7.4.m4.1.1" xref="S4.SS4.p7.4.m4.1.1.cmml"><mrow id="S4.SS4.p7.4.m4.1.1.2" xref="S4.SS4.p7.4.m4.1.1.2.cmml"><mi id="S4.SS4.p7.4.m4.1.1.2.2" xref="S4.SS4.p7.4.m4.1.1.2.2.cmml">D</mi><mo id="S4.SS4.p7.4.m4.1.1.2.1" lspace="0.222em" rspace="0.222em" xref="S4.SS4.p7.4.m4.1.1.2.1.cmml">×</mo><mtext id="S4.SS4.p7.4.m4.1.1.2.3" mathsize="142%" xref="S4.SS4.p7.4.m4.1.1.2.3a.cmml">(1-bitwise_sparsity)</mtext></mrow><mn id="S4.SS4.p7.4.m4.1.1.3" xref="S4.SS4.p7.4.m4.1.1.3.cmml">8</mn></mfrac><annotation-xml encoding="MathML-Content" id="S4.SS4.p7.4.m4.1b"><apply id="S4.SS4.p7.4.m4.1.1.cmml" xref="S4.SS4.p7.4.m4.1.1"><divide id="S4.SS4.p7.4.m4.1.1.1.cmml" xref="S4.SS4.p7.4.m4.1.1"></divide><apply id="S4.SS4.p7.4.m4.1.1.2.cmml" xref="S4.SS4.p7.4.m4.1.1.2"><times id="S4.SS4.p7.4.m4.1.1.2.1.cmml" xref="S4.SS4.p7.4.m4.1.1.2.1"></times><ci id="S4.SS4.p7.4.m4.1.1.2.2.cmml" xref="S4.SS4.p7.4.m4.1.1.2.2">𝐷</ci><ci id="S4.SS4.p7.4.m4.1.1.2.3a.cmml" xref="S4.SS4.p7.4.m4.1.1.2.3"><mtext id="S4.SS4.p7.4.m4.1.1.2.3.cmml" xref="S4.SS4.p7.4.m4.1.1.2.3">(1-bitwise_sparsity)</mtext></ci></apply><cn id="S4.SS4.p7.4.m4.1.1.3.cmml" type="integer" xref="S4.SS4.p7.4.m4.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p7.4.m4.1c">{D\times\text{(1-bitwise\_sparsity)}}\over 8</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p7.4.m4.1d">divide start_ARG italic_D × (1-bitwise_sparsity) end_ARG start_ARG 8 end_ARG</annotation></semantics></math>). Therefore, our BitSift-GEMV technique reduces the processing latency by (1 <math alttext="-" class="ltx_Math" display="inline" id="S4.SS4.p7.5.m5.1"><semantics id="S4.SS4.p7.5.m5.1a"><mo id="S4.SS4.p7.5.m5.1.1" xref="S4.SS4.p7.5.m5.1.1.cmml">−</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.p7.5.m5.1b"><minus id="S4.SS4.p7.5.m5.1.1.cmml" xref="S4.SS4.p7.5.m5.1.1"></minus></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p7.5.m5.1c">-</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p7.5.m5.1d">-</annotation></semantics></math> bitwse<math alttext="\_" class="ltx_Math" display="inline" id="S4.SS4.p7.6.m6.1"><semantics id="S4.SS4.p7.6.m6.1a"><mi id="S4.SS4.p7.6.m6.1.1" mathvariant="normal" xref="S4.SS4.p7.6.m6.1.1.cmml">_</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.p7.6.m6.1b"><ci id="S4.SS4.p7.6.m6.1.1.cmml" xref="S4.SS4.p7.6.m6.1.1">_</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.p7.6.m6.1c">\_</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.p7.6.m6.1d">_</annotation></semantics></math>sparsity), fully utilizing the finest grain of sparsity.</p> </div> </section> <section class="ltx_subsection" id="S4.SS5"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.5. </span>Reduced Tensor Traffic</h3> <figure class="ltx_figure" id="S4.F13"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="621" id="S4.F13.g1" src="x12.png" width="805"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 13. </span>Visualized Dataflow and tensor traffic of FLARE: QKV generation, on-the-fly processing, and fused operations.</figcaption> </figure> <div class="ltx_para" id="S4.SS5.p1"> <p class="ltx_p" id="S4.SS5.p1.4">Conclusively, our FLARE architecture minimizes tensor traffic in attention mechanisms by fusing the end-to-end attention layer with combined novel techniques. The dataflow of our FLARE architecture (depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.F13" title="Figure 13 ‣ 4.5. Reduced Tensor Traffic ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">13</span></a>) can be understood as follows: (1) K & V Projections and (2) end-to-end fused, per-token (<math alttext="Q" class="ltx_Math" display="inline" id="S4.SS5.p1.1.m1.1"><semantics id="S4.SS5.p1.1.m1.1a"><mi id="S4.SS5.p1.1.m1.1.1" xref="S4.SS5.p1.1.m1.1.1.cmml">Q</mi><annotation-xml encoding="MathML-Content" id="S4.SS5.p1.1.m1.1b"><ci id="S4.SS5.p1.1.m1.1.1.cmml" xref="S4.SS5.p1.1.m1.1.1">𝑄</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS5.p1.1.m1.1c">Q</annotation><annotation encoding="application/x-llamapun" id="S4.SS5.p1.1.m1.1d">italic_Q</annotation></semantics></math>-<math alttext="L" class="ltx_Math" display="inline" id="S4.SS5.p1.2.m2.1"><semantics id="S4.SS5.p1.2.m2.1a"><mi id="S4.SS5.p1.2.m2.1.1" xref="S4.SS5.p1.2.m2.1.1.cmml">L</mi><annotation-xml encoding="MathML-Content" id="S4.SS5.p1.2.m2.1b"><ci id="S4.SS5.p1.2.m2.1.1.cmml" xref="S4.SS5.p1.2.m2.1.1">𝐿</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS5.p1.2.m2.1c">L</annotation><annotation encoding="application/x-llamapun" id="S4.SS5.p1.2.m2.1d">italic_L</annotation></semantics></math>-<math alttext="A" class="ltx_Math" display="inline" id="S4.SS5.p1.3.m3.1"><semantics id="S4.SS5.p1.3.m3.1a"><mi id="S4.SS5.p1.3.m3.1.1" xref="S4.SS5.p1.3.m3.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S4.SS5.p1.3.m3.1b"><ci id="S4.SS5.p1.3.m3.1.1.cmml" xref="S4.SS5.p1.3.m3.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS5.p1.3.m3.1c">A</annotation><annotation encoding="application/x-llamapun" id="S4.SS5.p1.3.m3.1d">italic_A</annotation></semantics></math>-<math alttext="O" class="ltx_Math" display="inline" id="S4.SS5.p1.4.m4.1"><semantics id="S4.SS5.p1.4.m4.1a"><mi id="S4.SS5.p1.4.m4.1.1" xref="S4.SS5.p1.4.m4.1.1.cmml">O</mi><annotation-xml encoding="MathML-Content" id="S4.SS5.p1.4.m4.1b"><ci id="S4.SS5.p1.4.m4.1.1.cmml" xref="S4.SS5.p1.4.m4.1.1">𝑂</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS5.p1.4.m4.1c">O</annotation><annotation encoding="application/x-llamapun" id="S4.SS5.p1.4.m4.1d">italic_O</annotation></semantics></math>) process.</p> </div> <div class="ltx_para" id="S4.SS5.p2"> <p class="ltx_p" id="S4.SS5.p2.2">(1) K & V projections: This stage involves A-W GEMM (a series of GEMV) operations on input tokens with weight matrice W<sub class="ltx_sub" id="S4.SS5.p2.2.1"><span class="ltx_text ltx_font_italic" id="S4.SS5.p2.2.1.1">K</span></sub> & W<sub class="ltx_sub" id="S4.SS5.p2.2.2"><span class="ltx_text ltx_font_italic" id="S4.SS5.p2.2.2.1">V</span></sub>, requiring input streaming to generate the complete KV matrices.</p> </div> <div class="ltx_para" id="S4.SS5.p3"> <p class="ltx_p" id="S4.SS5.p3.1">(2) Per-token fused process: After generating KV matrices, the end-to-end-fused attention process is handled, requiring a second input stream-in. Rather than storing an input in an internal buffer, we stream data twice for improved area efficiency and scalability.</p> </div> <div class="ltx_para" id="S4.SS5.p4"> <p class="ltx_p" id="S4.SS5.p4.1">Hence, the revised tensor traffic can be represented as:</p> <table class="ltx_equation ltx_eqn_table" id="S4.Ex5"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\text{Revised Tensor Traffic}=2\times B\times N\times D\text{ (in)}+B\times N% \times D\text{ (out)}~{}," class="ltx_Math" display="block" id="S4.Ex5.m1.1"><semantics id="S4.Ex5.m1.1a"><mrow id="S4.Ex5.m1.1.1.1" xref="S4.Ex5.m1.1.1.1.1.cmml"><mrow id="S4.Ex5.m1.1.1.1.1" xref="S4.Ex5.m1.1.1.1.1.cmml"><mtext id="S4.Ex5.m1.1.1.1.1.2" xref="S4.Ex5.m1.1.1.1.1.2a.cmml">Revised Tensor Traffic</mtext><mo id="S4.Ex5.m1.1.1.1.1.1" xref="S4.Ex5.m1.1.1.1.1.1.cmml">=</mo><mrow id="S4.Ex5.m1.1.1.1.1.3" xref="S4.Ex5.m1.1.1.1.1.3.cmml"><mrow id="S4.Ex5.m1.1.1.1.1.3.2" xref="S4.Ex5.m1.1.1.1.1.3.2.cmml"><mrow id="S4.Ex5.m1.1.1.1.1.3.2.2" xref="S4.Ex5.m1.1.1.1.1.3.2.2.cmml"><mn id="S4.Ex5.m1.1.1.1.1.3.2.2.2" xref="S4.Ex5.m1.1.1.1.1.3.2.2.2.cmml">2</mn><mo id="S4.Ex5.m1.1.1.1.1.3.2.2.1" lspace="0.222em" rspace="0.222em" xref="S4.Ex5.m1.1.1.1.1.3.2.2.1.cmml">×</mo><mi id="S4.Ex5.m1.1.1.1.1.3.2.2.3" xref="S4.Ex5.m1.1.1.1.1.3.2.2.3.cmml">B</mi><mo id="S4.Ex5.m1.1.1.1.1.3.2.2.1a" lspace="0.222em" rspace="0.222em" xref="S4.Ex5.m1.1.1.1.1.3.2.2.1.cmml">×</mo><mi id="S4.Ex5.m1.1.1.1.1.3.2.2.4" xref="S4.Ex5.m1.1.1.1.1.3.2.2.4.cmml">N</mi><mo id="S4.Ex5.m1.1.1.1.1.3.2.2.1b" lspace="0.222em" rspace="0.222em" xref="S4.Ex5.m1.1.1.1.1.3.2.2.1.cmml">×</mo><mi id="S4.Ex5.m1.1.1.1.1.3.2.2.5" xref="S4.Ex5.m1.1.1.1.1.3.2.2.5.cmml">D</mi></mrow><mo id="S4.Ex5.m1.1.1.1.1.3.2.1" xref="S4.Ex5.m1.1.1.1.1.3.2.1.cmml"></mo><mtext id="S4.Ex5.m1.1.1.1.1.3.2.3" xref="S4.Ex5.m1.1.1.1.1.3.2.3a.cmml"> (in)</mtext></mrow><mo id="S4.Ex5.m1.1.1.1.1.3.1" xref="S4.Ex5.m1.1.1.1.1.3.1.cmml">+</mo><mrow id="S4.Ex5.m1.1.1.1.1.3.3" xref="S4.Ex5.m1.1.1.1.1.3.3.cmml"><mrow id="S4.Ex5.m1.1.1.1.1.3.3.2" xref="S4.Ex5.m1.1.1.1.1.3.3.2.cmml"><mi id="S4.Ex5.m1.1.1.1.1.3.3.2.2" xref="S4.Ex5.m1.1.1.1.1.3.3.2.2.cmml">B</mi><mo id="S4.Ex5.m1.1.1.1.1.3.3.2.1" lspace="0.222em" rspace="0.222em" xref="S4.Ex5.m1.1.1.1.1.3.3.2.1.cmml">×</mo><mi id="S4.Ex5.m1.1.1.1.1.3.3.2.3" xref="S4.Ex5.m1.1.1.1.1.3.3.2.3.cmml">N</mi><mo id="S4.Ex5.m1.1.1.1.1.3.3.2.1a" lspace="0.222em" rspace="0.222em" xref="S4.Ex5.m1.1.1.1.1.3.3.2.1.cmml">×</mo><mi id="S4.Ex5.m1.1.1.1.1.3.3.2.4" xref="S4.Ex5.m1.1.1.1.1.3.3.2.4.cmml">D</mi></mrow><mo id="S4.Ex5.m1.1.1.1.1.3.3.1" xref="S4.Ex5.m1.1.1.1.1.3.3.1.cmml"></mo><mtext id="S4.Ex5.m1.1.1.1.1.3.3.3" xref="S4.Ex5.m1.1.1.1.1.3.3.3a.cmml"> (out)</mtext></mrow></mrow></mrow><mo id="S4.Ex5.m1.1.1.1.2" lspace="0.330em" xref="S4.Ex5.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.Ex5.m1.1b"><apply id="S4.Ex5.m1.1.1.1.1.cmml" xref="S4.Ex5.m1.1.1.1"><eq id="S4.Ex5.m1.1.1.1.1.1.cmml" xref="S4.Ex5.m1.1.1.1.1.1"></eq><ci id="S4.Ex5.m1.1.1.1.1.2a.cmml" xref="S4.Ex5.m1.1.1.1.1.2"><mtext id="S4.Ex5.m1.1.1.1.1.2.cmml" xref="S4.Ex5.m1.1.1.1.1.2">Revised Tensor Traffic</mtext></ci><apply id="S4.Ex5.m1.1.1.1.1.3.cmml" xref="S4.Ex5.m1.1.1.1.1.3"><plus id="S4.Ex5.m1.1.1.1.1.3.1.cmml" xref="S4.Ex5.m1.1.1.1.1.3.1"></plus><apply id="S4.Ex5.m1.1.1.1.1.3.2.cmml" xref="S4.Ex5.m1.1.1.1.1.3.2"><times id="S4.Ex5.m1.1.1.1.1.3.2.1.cmml" xref="S4.Ex5.m1.1.1.1.1.3.2.1"></times><apply id="S4.Ex5.m1.1.1.1.1.3.2.2.cmml" xref="S4.Ex5.m1.1.1.1.1.3.2.2"><times id="S4.Ex5.m1.1.1.1.1.3.2.2.1.cmml" xref="S4.Ex5.m1.1.1.1.1.3.2.2.1"></times><cn id="S4.Ex5.m1.1.1.1.1.3.2.2.2.cmml" type="integer" xref="S4.Ex5.m1.1.1.1.1.3.2.2.2">2</cn><ci id="S4.Ex5.m1.1.1.1.1.3.2.2.3.cmml" xref="S4.Ex5.m1.1.1.1.1.3.2.2.3">𝐵</ci><ci id="S4.Ex5.m1.1.1.1.1.3.2.2.4.cmml" xref="S4.Ex5.m1.1.1.1.1.3.2.2.4">𝑁</ci><ci id="S4.Ex5.m1.1.1.1.1.3.2.2.5.cmml" xref="S4.Ex5.m1.1.1.1.1.3.2.2.5">𝐷</ci></apply><ci id="S4.Ex5.m1.1.1.1.1.3.2.3a.cmml" xref="S4.Ex5.m1.1.1.1.1.3.2.3"><mtext id="S4.Ex5.m1.1.1.1.1.3.2.3.cmml" xref="S4.Ex5.m1.1.1.1.1.3.2.3"> (in)</mtext></ci></apply><apply id="S4.Ex5.m1.1.1.1.1.3.3.cmml" xref="S4.Ex5.m1.1.1.1.1.3.3"><times id="S4.Ex5.m1.1.1.1.1.3.3.1.cmml" xref="S4.Ex5.m1.1.1.1.1.3.3.1"></times><apply id="S4.Ex5.m1.1.1.1.1.3.3.2.cmml" xref="S4.Ex5.m1.1.1.1.1.3.3.2"><times id="S4.Ex5.m1.1.1.1.1.3.3.2.1.cmml" xref="S4.Ex5.m1.1.1.1.1.3.3.2.1"></times><ci id="S4.Ex5.m1.1.1.1.1.3.3.2.2.cmml" xref="S4.Ex5.m1.1.1.1.1.3.3.2.2">𝐵</ci><ci id="S4.Ex5.m1.1.1.1.1.3.3.2.3.cmml" xref="S4.Ex5.m1.1.1.1.1.3.3.2.3">𝑁</ci><ci id="S4.Ex5.m1.1.1.1.1.3.3.2.4.cmml" xref="S4.Ex5.m1.1.1.1.1.3.3.2.4">𝐷</ci></apply><ci id="S4.Ex5.m1.1.1.1.1.3.3.3a.cmml" xref="S4.Ex5.m1.1.1.1.1.3.3.3"><mtext id="S4.Ex5.m1.1.1.1.1.3.3.3.cmml" xref="S4.Ex5.m1.1.1.1.1.3.3.3"> (out)</mtext></ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.Ex5.m1.1c">\text{Revised Tensor Traffic}=2\times B\times N\times D\text{ (in)}+B\times N% \times D\text{ (out)}~{},</annotation><annotation encoding="application/x-llamapun" id="S4.Ex5.m1.1d">Revised Tensor Traffic = 2 × italic_B × italic_N × italic_D (in) + italic_B × italic_N × italic_D (out) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> </table> <p class="ltx_p" id="S4.SS5.p4.2">achieving substantial tensor traffic reduction by leveraging our proposed techniques.</p> </div> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">5. </span>FLARE Evaluations</h2> <figure class="ltx_table" id="S5.T1"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 1. </span>Summary of our FLARE design in 28nm process</figcaption><img alt="[Uncaptioned image]" class="ltx_graphics ltx_centering ltx_img_square" height="806" id="S5.T1.g1" src="x13.png" width="830"/> </figure> <div class="ltx_para ltx_noindent" id="S5.p1"> <p class="ltx_p" id="S5.p1.1">Our proposed FLARE architecture and its design components are all embedded into a compact ASIC with superior energy and latency performance per token. We validate our design through meticulous circuit simulations, including Monte Carlo simulations (result summarized in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S4.F9" title="Figure 9 ‣ 4.1. MRAM-SRAM Hybrid AMS-PiM Design ‣ 4. FLARE Architecture ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">9</span></a>-(b) and post-layout circuit simulations for physical-level feasibility, using our custom-designed ASIC on 28nm FD-SOI technology, and our design is summarized in TABLE <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.T1" title="Table 1 ‣ 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">1</span></a>.</p> </div> <div class="ltx_para" id="S5.p2"> <p class="ltx_p" id="S5.p2.1">We tested our FLARE using 8-bit integer-quantized models across different NLP (with GLUE benchmarks <cite class="ltx_cite ltx_citemacro_citep">(Wang et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib53" title="">2018</a>)</cite>) and vision (ImageNet <cite class="ltx_cite ltx_citemacro_citep">(Deng et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib15" title="">2009</a>)</cite> Classification) tasks, as well as with various transformer models <cite class="ltx_cite ltx_citemacro_citep">(Touvron et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib50" title="">2021</a>; Dosovitskiy et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib18" title="">2020</a>; Liu et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib39" title="">2019</a>; Devlin et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib17" title="">2018</a>)</cite>. We compared the results with state-of-the-art GPUs (Nvidia RTX 3090, RTX 4090, T4, and A100), as well as a PiM baseline where we excluded our proposed techniques.</p> </div> <div class="ltx_para" id="S5.p3"> <p class="ltx_p" id="S5.p3.1">Our experimental results validate the FLARE architecture’s diverse efficacy. The design exploration highlights the significance of customizing inner-array configurations and parallelism levels to optimize performance. The eMSB-Q and VDR-Softmax techniques maintain high inference accuracy while eliminating FP operations, thereby enhancing computational efficiency. Additionally, the BitSift-GEMV technique leverages bitwise sparsity to significantly accelerate GEMV operations. Collectively, these innovations establish our architecture as a robust and scalable solution for transformer acceleration, delivering substantial reductions in both latency and energy consumption.</p> </div> <section class="ltx_subsection" id="S5.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.1. </span>Hardware Customization Exploration</h3> <figure class="ltx_figure" id="S5.F14"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="311" id="S5.F14.g1" src="x14.png" width="829"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 14. </span>Design study on array-level device parallelism. Larger values imply better characteristics.</figcaption> </figure> <div class="ltx_para" id="S5.SS1.p1"> <p class="ltx_p" id="S5.SS1.p1.1">We conducted a design exploration study to understand how different array-parallelism configurations affect the overall performance of our proposed architecture. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.F14" title="Figure 14 ‣ 5.1. Hardware Customization Exploration ‣ 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">14</span></a> shows the normalized metrics of various design choices, including chip size, latency, power, and energy consumption. For the summary in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.T1" title="Table 1 ‣ 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">1</span></a>, our choice for the parallelism is 8<math alttext="\times" class="ltx_Math" display="inline" id="S5.SS1.p1.1.m1.1"><semantics id="S5.SS1.p1.1.m1.1a"><mo id="S5.SS1.p1.1.m1.1.1" xref="S5.SS1.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S5.SS1.p1.1.m1.1b"><times id="S5.SS1.p1.1.m1.1.1.cmml" xref="S5.SS1.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p1.1.m1.1d">×</annotation></semantics></math> array (i.e., DC8), to show the design result with a medium benchmark.</p> </div> </section> <section class="ltx_subsection" id="S5.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.2. </span>Impact of eMSB-Q and VDR-Softmax</h3> <figure class="ltx_figure" id="S5.F15"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="872" id="S5.F15.g1" src="x15.png" width="829"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 15. </span>Comparison of inference accuracy results: various NLP tasks and ImageNet classification tasks with different transformer models.</figcaption> </figure> <div class="ltx_para" id="S5.SS2.p1"> <p class="ltx_p" id="S5.SS2.p1.1">Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.F15" title="Figure 15 ‣ 5.2. Impact of eMSB-Q and VDR-Softmax ‣ 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">15</span></a> presents the inference accuracy benchmarks across various tasks and models. The comparison includes the FP32 baseline, a traditionally-integer-quantized model utilizing the FP32-based DQ-Q process, and our proposed method employing eMSB-Q quantization combined with the VDR-Softmax technique. The results demonstrate that our approach sustains high accuracy while eliminating FP operations, thereby achieving substantial improvements in computational efficiency.</p> </div> </section> <section class="ltx_subsection" id="S5.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.3. </span>Impact of BitSift-GEMV</h3> <figure class="ltx_figure" id="S5.F16"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="242" id="S5.F16.g1" src="extracted/6015391/Experiment_ImpactofSparsity_BitSiftGEMV.png" width="479"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 16. </span>Average amount of how much GEMV operations were boosted. The boosting factors were almost directly the same as our anticipated values, revealing our BitSift-GEMV’s efficacy.</figcaption> </figure> <div class="ltx_para" id="S5.SS3.p1"> <p class="ltx_p" id="S5.SS3.p1.1">Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.F16" title="Figure 16 ‣ 5.3. Impact of BitSift-GEMV ‣ 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">16</span></a> presents the impact of our proposed BitSift-GEMV technique on the average number of boosting factors in GEMV operation cycles, compared to PiM baseline with fixed-length processed GEMV. Note that the boosting is marked without the effect of array parallelism, and the measurement result is solely boosted by the BitSift-GEMV only. The boosting factor observed correlates closely with our anticipated values, demonstrating the effectiveness of exploiting bitwise sparsity to enhance GEMV processing speed.</p> </div> </section> <section class="ltx_subsection" id="S5.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.4. </span>Impact on Latency Performance</h3> <figure class="ltx_figure" id="S5.F17"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="376" id="S5.F17.g1" src="extracted/6015391/ExperimentResult_TokenPerSecpng2.png" width="598"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 17. </span>Comparison of normalized token/sec performance. The CiM Baseline was omitted due to its poor performance from significant latency with out-of-PiM tensor traffic and segmented GEMV.</figcaption> </figure> <div class="ltx_para" id="S5.SS4.p1"> <p class="ltx_p" id="S5.SS4.p1.1">We measured and compared the token/sec performance with various tasks, models, and hardware. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.F17" title="Figure 17 ‣ 5.4. Impact on Latency Performance ‣ 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">17</span></a> shows that our design achieves competitive latency performance, outperforming not only the PiM baseline but also famous SOTA GPUs. This improvement is primarily attributed to the integration of the end-to-end fusion technique and the BitSift-GEMV approach.</p> </div> </section> <section class="ltx_subsection" id="S5.SS5"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.5. </span>Impact on Energy Performance</h3> <figure class="ltx_figure" id="S5.F18"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="393" id="S5.F18.g1" src="extracted/6015391/ExperimentResult_tokenperjoule2.png" width="598"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 18. </span>Comparison of normalized token/Joule performance.</figcaption> </figure> <div class="ltx_para" id="S5.SS5.p1"> <p class="ltx_p" id="S5.SS5.p1.1">Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#S5.F18" title="Figure 18 ‣ 5.5. Impact on Energy Performance ‣ 5. FLARE Evaluations ‣ FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration"><span class="ltx_text ltx_ref_tag">18</span></a> illustrates the energy efficiency of our FLARE architecture compared to various SOTA GPUs and a PiM baseline across diverse NLP and vision tasks. The results highlight that our architecture achieves outstanding energy efficiency, positioning it as a practical solution for energy-constrained applications. Notably, without our on-device end-to-end kernel fusion technique, the reliance on FPUs or the substantial out-of-PiM tensor traffic significantly undermines the baseline PiM devices’ inherent efficiency.</p> </div> </section> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">6. </span>Discussion - Related Works</h2> <div class="ltx_para" id="S6.p1"> <p class="ltx_p" id="S6.p1.1">FLARE’s design offers an accurate, fast, and efficient foundation for accelerating self-attention layers in encoder models. Beyond its standalone capabilities, FLARE seamlessly integrates with optimization techniques to further enhance performance without hardware revisions. For instance, FLARE complements FlashAttention <cite class="ltx_cite ltx_citemacro_citep">(Dao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib14" title="">2022</a>)</cite> by processing tiles independently in a FLARE PE, effectively scaling to long sequences or oversized models while preserving its core strengths. Also, FLARE aligns with Dynamic Attention Modulation techniques <cite class="ltx_cite ltx_citemacro_citep">(Moitra et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib41" title="">2024</a>)</cite>, dynamically allocating resources based on input complexity to reduce redundant computations and optimize energy efficiency. Moreover, FLARE supports hot-expert routing <cite class="ltx_cite ltx_citemacro_citep">(Kim et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib34" title="">2024</a>)</cite> in Mixture-of-Experts models, efficiently handling intensive computations while reducing tensor traffic. Finally, representation compression techniques <cite class="ltx_cite ltx_citemacro_citep">(Lan, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib35" title="">2019</a>; Jiao et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib26" title="">2019</a>; Sanh, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib43" title="">2019</a>; Jégou et al<span class="ltx_text">.</span>, <a class="ltx_ref" href="https://arxiv.org/html/2411.14733v1#bib.bib28" title="">2011</a>)</cite> align with FLARE’s processing to lower computational demands without sacrificing accuracy.</p> </div> </section> <section class="ltx_section" id="S7"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">7. </span>Conclusion</h2> <div class="ltx_para" id="S7.p1"> <p class="ltx_p" id="S7.p1.1">This work presents FLARE, an AMS-PiM-based architecture designed to overcome the computational and hardware bottlenecks of transformer models. By introducing dequantization-free PTQ, integer-only nonlinear processing, and BitSift-GEMV for sparse GEMV acceleration, FLARE achieves robust energy efficiency, error resilience, and computational performance. FLARE’s hybrid MRAM-SRAM design enables on-chip processing of self-attention layers while eliminating the need for high-ENOB ADCs and FPUs, reducing area and power overhead. Experimental results confirm FLARE’s superiority over GPUs and PiM baselines in latency, energy efficiency, and scalability, making it a practical solution for deploying transformers in diverse environments.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">(1)</span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Achiam et al<span class="ltx_text" id="bib.bib2.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al<span class="ltx_text" id="bib.bib2.3.1">.</span> 2023. </span> <span class="ltx_bibblock">Gpt-4 technical report. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib2.4.1">arXiv preprint arXiv:2303.08774</em> (2023). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Alwani et al<span class="ltx_text" id="bib.bib3.2.2.1">.</span> (2016)</span> <span class="ltx_bibblock"> Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. </span> <span class="ltx_bibblock">Fused-layer CNN accelerators. In <em class="ltx_emph ltx_font_italic" id="bib.bib3.3.1">2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)</em>. IEEE, 1–12. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Andrulis et al<span class="ltx_text" id="bib.bib4.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Tanner Andrulis, Joel S Emer, and Vivienne Sze. 2023. </span> <span class="ltx_bibblock">RAELLA: Reforming the arithmetic for efficient, low-resolution, and low-loss analog PIM: No retraining required!. In <em class="ltx_emph ltx_font_italic" id="bib.bib4.3.1">Proceedings of the 50th Annual International Symposium on Computer Architecture</em>. 1–16. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Anonymous (2024)</span> <span class="ltx_bibblock"> Anonymous. 2024. </span> <span class="ltx_bibblock">QRazor: Reliable and Effortless 4-bit LLM Quantization by Significant Data Razoring. In <em class="ltx_emph ltx_font_italic" id="bib.bib5.1.1">Submitted to The Thirteenth International Conference on Learning Representations</em>. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://openreview.net/forum?id=lwcnZmyojm" title="">https://openreview.net/forum?id=lwcnZmyojm</a> </span> <span class="ltx_bibblock">under review. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Banner et al<span class="ltx_text" id="bib.bib6.2.2.1">.</span> (2019)</span> <span class="ltx_bibblock"> Ron Banner, Yury Nahshan, and Daniel Soudry. 2019. </span> <span class="ltx_bibblock">Post training 4-bit quantization of convolutional networks for rapid-deployment. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib6.3.1">Advances in Neural Information Processing Systems</em> 32 (2019). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Biswas and Chandrakasan (2018)</span> <span class="ltx_bibblock"> Avishek Biswas and Anantha P Chandrakasan. 2018. </span> <span class="ltx_bibblock">CONV-SRAM: An energy-efficient SRAM with in-memory dot-product computation for low-power convolutional neural networks. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib7.1.1">IEEE Journal of Solid-State Circuits</em> 54, 1 (2018), 217–230. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Blackford et al<span class="ltx_text" id="bib.bib8.2.2.1">.</span> (2002)</span> <span class="ltx_bibblock"> L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al<span class="ltx_text" id="bib.bib8.3.1">.</span> 2002. </span> <span class="ltx_bibblock">An updated set of basic linear algebra subprograms (BLAS). </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib8.4.1">ACM Trans. Math. Software</em> 28, 2 (2002), 135–151. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Brown (2020)</span> <span class="ltx_bibblock"> Tom B Brown. 2020. </span> <span class="ltx_bibblock">Language models are few-shot learners. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib9.1.1">arXiv preprint arXiv:2005.14165</em> (2020). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cardarilli et al<span class="ltx_text" id="bib.bib10.2.2.1">.</span> (2021)</span> <span class="ltx_bibblock"> Gian Carlo Cardarilli, Luca Di Nunzio, Rocco Fazzolari, Daniele Giardino, Alberto Nannarelli, Marco Re, and Sergio Spanò. 2021. </span> <span class="ltx_bibblock">A pseudo-softmax function for hardware-based high speed image classification. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib10.3.1">Scientific reports</em> 11, 1 (2021), 15307. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al<span class="ltx_text" id="bib.bib11.2.2.1">.</span> (2018)</span> <span class="ltx_bibblock"> Wei-Hao Chen, Kai-Xiang Li, Wei-Yu Lin, Kuo-Hsiang Hsu, Pin-Yi Li, Cheng-Han Yang, Cheng-Xin Xue, En-Yu Yang, Yen-Kai Chen, Yun-Sheng Chang, et al<span class="ltx_text" id="bib.bib11.3.1">.</span> 2018. </span> <span class="ltx_bibblock">A 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge processors. In <em class="ltx_emph ltx_font_italic" id="bib.bib11.4.1">2018 IEEE International Solid-State Circuits Conference-(ISSCC)</em>. IEEE, 494–496. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chi et al<span class="ltx_text" id="bib.bib12.2.2.1">.</span> (2016)</span> <span class="ltx_bibblock"> Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. </span> <span class="ltx_bibblock">PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In <em class="ltx_emph ltx_font_italic" id="bib.bib12.3.1">2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)</em>. IEEE, 27–39. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Danial et al<span class="ltx_text" id="bib.bib13.2.2.1">.</span> (2018)</span> <span class="ltx_bibblock"> Loai Danial, Nicolás Wainstein, Shraga Kraus, and Shahar Kvatinsky. 2018. </span> <span class="ltx_bibblock">Breaking through the speed-power-accuracy tradeoff in ADCs using a memristive neuromorphic architecture. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib13.3.1">IEEE Transactions on Emerging Topics in Computational Intelligence</em> 2, 5 (2018), 396–409. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dao et al<span class="ltx_text" id="bib.bib14.2.2.1">.</span> (2022)</span> <span class="ltx_bibblock"> Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. </span> <span class="ltx_bibblock">Flashattention: Fast and memory-efficient exact attention with io-awareness. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib14.3.1">Advances in Neural Information Processing Systems</em> 35 (2022), 16344–16359. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Deng et al<span class="ltx_text" id="bib.bib15.2.2.1">.</span> (2009)</span> <span class="ltx_bibblock"> Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. </span> <span class="ltx_bibblock">Imagenet: A large-scale hierarchical image database. In <em class="ltx_emph ltx_font_italic" id="bib.bib15.3.1">2009 IEEE conference on computer vision and pattern recognition</em>. Ieee, 248–255. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dettmers et al<span class="ltx_text" id="bib.bib16.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024. </span> <span class="ltx_bibblock">Qlora: Efficient finetuning of quantized llms. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib16.3.1">Advances in Neural Information Processing Systems</em> 36 (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Devlin et al<span class="ltx_text" id="bib.bib17.2.2.1">.</span> (2018)</span> <span class="ltx_bibblock"> Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. </span> <span class="ltx_bibblock">Bert: Pre-training of deep bidirectional transformers for language understanding. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib17.3.1">arXiv preprint arXiv:1810.04805</em> (2018). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dosovitskiy et al<span class="ltx_text" id="bib.bib18.2.2.1">.</span> (2020)</span> <span class="ltx_bibblock"> Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al<span class="ltx_text" id="bib.bib18.3.1">.</span> 2020. </span> <span class="ltx_bibblock">An image is worth 16x16 words: Transformers for image recognition at scale. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib18.4.1">arXiv preprint arXiv:2010.11929</em> (2020). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Guo et al<span class="ltx_text" id="bib.bib19.2.2.1">.</span> (2017)</span> <span class="ltx_bibblock"> Xinjie Guo, F Merrikh Bayat, M Bavandpour, M Klachko, MR Mahmoodi, M Prezioso, KK Likharev, and DB Strukov. 2017. </span> <span class="ltx_bibblock">Fast, energy-efficient, robust, and reproducible mixed-signal neuromorphic classifier based on embedded NOR flash memory technology. In <em class="ltx_emph ltx_font_italic" id="bib.bib19.3.1">2017 IEEE International Electron Devices Meeting (IEDM)</em>. IEEE, 6–5. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gupta et al<span class="ltx_text" id="bib.bib20.2.2.1">.</span> (2019)</span> <span class="ltx_bibblock"> Saransh Gupta, Mohsen Imani, Harveen Kaur, and Tajana Simunic Rosing. 2019. </span> <span class="ltx_bibblock">Nnpim: A processing in-memory architecture for neural network acceleration. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib20.3.1">IEEE Trans. Comput.</em> 68, 9 (2019), 1325–1337. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ho and Chang (2023)</span> <span class="ltx_bibblock"> Nguyen-Dong Ho and Ik-Joon Chang. 2023. </span> <span class="ltx_bibblock">O-2A: Ourlier-Aware Compression for 8-bit Post-Training Quantization Model. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib21.1.1">IEEE Access</em> (2023). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hubara et al<span class="ltx_text" id="bib.bib22.2.2.1">.</span> (2018)</span> <span class="ltx_bibblock"> Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2018. </span> <span class="ltx_bibblock">Quantized neural networks: Training neural networks with low precision weights and activations. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib22.3.1">Journal of Machine Learning Research</em> 18, 187 (2018), 1–30. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Imani et al<span class="ltx_text" id="bib.bib23.2.2.1">.</span> (2019)</span> <span class="ltx_bibblock"> Mohsen Imani, Saransh Gupta, Yeseong Kim, Minxuan Zhou, and Tajana Rosing. 2019. </span> <span class="ltx_bibblock">Digitalpim: Digital-based processing in-memory for big data acceleration. In <em class="ltx_emph ltx_font_italic" id="bib.bib23.3.1">Proceedings of the 2019 on Great Lakes Symposium on VLSI</em>. 429–434. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jacob et al<span class="ltx_text" id="bib.bib24.2.2.1">.</span> (2018)</span> <span class="ltx_bibblock"> Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. </span> <span class="ltx_bibblock">Quantization and training of neural networks for efficient integer-arithmetic-only inference. In <em class="ltx_emph ltx_font_italic" id="bib.bib24.3.1">Proceedings of the IEEE conference on computer vision and pattern recognition</em>. 2704–2713. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jiang et al<span class="ltx_text" id="bib.bib25.2.2.1">.</span> (2020)</span> <span class="ltx_bibblock"> Zhewei Jiang, Shihui Yin, Jae-Sun Seo, and Mingoo Seok. 2020. </span> <span class="ltx_bibblock">C3SRAM: An in-memory-computing SRAM macro based on robust capacitive coupling computing mechanism. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib25.3.1">IEEE Journal of Solid-State Circuits</em> 55, 7 (2020), 1888–1897. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jiao et al<span class="ltx_text" id="bib.bib26.2.2.1">.</span> (2019)</span> <span class="ltx_bibblock"> Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. </span> <span class="ltx_bibblock">Tinybert: Distilling bert for natural language understanding. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib26.3.1">arXiv preprint arXiv:1909.10351</em> (2019). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jung et al<span class="ltx_text" id="bib.bib27.2.2.1">.</span> (2022)</span> <span class="ltx_bibblock"> Seungchul Jung, Hyungwoo Lee, Sungmeen Myung, Hyunsoo Kim, Seung Keun Yoon, Soon-Wan Kwon, Yongmin Ju, Minje Kim, Wooseok Yi, Shinhee Han, et al<span class="ltx_text" id="bib.bib27.3.1">.</span> 2022. </span> <span class="ltx_bibblock">A crossbar array of magnetoresistive memory devices for in-memory computing. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib27.4.1">Nature</em> 601, 7892 (2022), 211–216. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jégou et al<span class="ltx_text" id="bib.bib28.2.2.1">.</span> (2011)</span> <span class="ltx_bibblock"> Herve Jégou, Matthijs Douze, and Cordelia Schmid. 2011. </span> <span class="ltx_bibblock">Product Quantization for Nearest Neighbor Search. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib28.3.1">IEEE Transactions on Pattern Analysis and Machine Intelligence</em> 33, 1 (2011), 117–128. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/TPAMI.2010.57" title="">https://doi.org/10.1109/TPAMI.2010.57</a> </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kao et al<span class="ltx_text" id="bib.bib29.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, and Tushar Krishna. 2023. </span> <span class="ltx_bibblock">Flat: An optimized dataflow for mitigating attention bottlenecks. In <em class="ltx_emph ltx_font_italic" id="bib.bib29.3.1">Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2</em>. 295–310. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kazemi et al<span class="ltx_text" id="bib.bib30.2.2.1">.</span> (2021)</span> <span class="ltx_bibblock"> Arman Kazemi, Mohammad Mehdi Sharifi, Zhuowen Zou, Michael Niemier, X Sharon Hu, and Mohsen Imani. 2021. </span> <span class="ltx_bibblock">Mimhd: Accurate and efficient hyperdimensional inference using multi-bit in-memory computing. In <em class="ltx_emph ltx_font_italic" id="bib.bib30.3.1">2021 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)</em>. IEEE, 1–6. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kim et al<span class="ltx_text" id="bib.bib31.2.2.1">.</span> (2020)</span> <span class="ltx_bibblock"> Hyungjun Kim, Hyunmyung Oh, and Jae-Joon Kim. 2020. </span> <span class="ltx_bibblock">Energy-efficient XNOR-free in-memory BNN accelerator with input distribution regularization. In <em class="ltx_emph ltx_font_italic" id="bib.bib31.3.1">Proceedings of the 39th International Conference on Computer-Aided Design</em>. 1–9. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kim et al<span class="ltx_text" id="bib.bib32.2.2.1">.</span> (2019)</span> <span class="ltx_bibblock"> Hyeonuk Kim, Jaehyeong Sim, Yeongjae Choi, and Lee-Sup Kim. 2019. </span> <span class="ltx_bibblock">Nand-net: Minimizing computational complexity of in-memory processing for binary neural networks. In <em class="ltx_emph ltx_font_italic" id="bib.bib32.3.1">2019 IEEE international symposium on high performance computer architecture (HPCA)</em>. IEEE, 661–673. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kim et al<span class="ltx_text" id="bib.bib33.2.2.1">.</span> (2021)</span> <span class="ltx_bibblock"> Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. 2021. </span> <span class="ltx_bibblock">I-bert: Integer-only bert quantization. In <em class="ltx_emph ltx_font_italic" id="bib.bib33.3.1">International conference on machine learning</em>. PMLR, 5506–5518. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kim et al<span class="ltx_text" id="bib.bib34.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Taehyun Kim, Kwanseok Choi, Youngmock Cho, Jaehoon Cho, Hyuk-Jae Lee, and Jaewoong Sim. 2024. </span> <span class="ltx_bibblock">MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib34.3.1">arXiv preprint arXiv:2405.18832</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lan (2019)</span> <span class="ltx_bibblock"> Z Lan. 2019. </span> <span class="ltx_bibblock">Albert: A lite bert for self-supervised learning of language representations. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib35.1.1">arXiv preprint arXiv:1909.11942</em> (2019). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lee et al<span class="ltx_text" id="bib.bib36.2.2.1">.</span> (2022)</span> <span class="ltx_bibblock"> Hunjun Lee, Minseop Kim, Dongmoon Min, Joonsung Kim, Jongwon Back, Honam Yoo, Jong-Ho Lee, and Jangwoo Kim. 2022. </span> <span class="ltx_bibblock">3D-FPIM: An extreme energy-efficient DNN acceleration system using 3D NAND flash-based in-situ PIM unit. In <em class="ltx_emph ltx_font_italic" id="bib.bib36.3.1">2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)</em>. IEEE, 1359–1376. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al<span class="ltx_text" id="bib.bib37.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Huize Li, Zhaoying Li, Zhenyu Bai, and Tulika Mitra. 2024. </span> <span class="ltx_bibblock">ASADI: Accelerating Sparse Attention Using Diagonal-based In-Situ Computing. In <em class="ltx_emph ltx_font_italic" id="bib.bib37.3.1">2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)</em>. IEEE, 774–787. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li and Gu (2023)</span> <span class="ltx_bibblock"> Zhikai Li and Qingyi Gu. 2023. </span> <span class="ltx_bibblock">I-vit: Integer-only quantization for efficient vision transformer inference. In <em class="ltx_emph ltx_font_italic" id="bib.bib38.1.1">Proceedings of the IEEE/CVF International Conference on Computer Vision</em>. 17065–17075. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al<span class="ltx_text" id="bib.bib39.2.2.1">.</span> (2019)</span> <span class="ltx_bibblock"> Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. </span> <span class="ltx_bibblock">Roberta: A robustly optimized bert pretraining approach. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib39.3.1">arXiv preprint arXiv:1907.11692</em> (2019). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Long et al<span class="ltx_text" id="bib.bib40.2.2.1">.</span> (2018)</span> <span class="ltx_bibblock"> Yun Long, Taesik Na, and Saibal Mukhopadhyay. 2018. </span> <span class="ltx_bibblock">ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib40.3.1">IEEE Transactions on Very Large Scale Integration (VLSI) Systems</em> 26, 12 (2018), 2781–2794. </span> <span class="ltx_bibblock"> <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.1109/TVLSI.2018.2819190" title="">https://doi.org/10.1109/TVLSI.2018.2819190</a> </span> </li> <li class="ltx_bibitem" id="bib.bib41"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Moitra et al<span class="ltx_text" id="bib.bib41.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Abhishek Moitra, Abhiroop Bhattacharjee, and Priyadarshini Panda. 2024. </span> <span class="ltx_bibblock">PIVOT-Input-aware Path Selection for Energy-efficient ViT Inference. In <em class="ltx_emph ltx_font_italic" id="bib.bib41.3.1">Proceedings of the 61st ACM/IEEE Design Automation Conference</em>. 1–6. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib42"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Roy et al<span class="ltx_text" id="bib.bib42.2.2.1">.</span> (2021)</span> <span class="ltx_bibblock"> Sourjya Roy, Mustafa Ali, and Anand Raghunathan. 2021. </span> <span class="ltx_bibblock">PIM-DRAM: Accelerating machine learning workloads using processing in commodity DRAM. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib42.3.1">IEEE Journal on Emerging and Selected Topics in Circuits and Systems</em> 11, 4 (2021), 701–710. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib43"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Sanh (2019)</span> <span class="ltx_bibblock"> V Sanh. 2019. </span> <span class="ltx_bibblock">DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib43.1.1">arXiv preprint arXiv:1910.01108</em> (2019). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib44"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Shafiee et al<span class="ltx_text" id="bib.bib44.2.2.1">.</span> (2016)</span> <span class="ltx_bibblock"> Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. </span> <span class="ltx_bibblock">ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In <em class="ltx_emph ltx_font_italic" id="bib.bib44.3.1">2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)</em>. IEEE, 14–26. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib45"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Si et al<span class="ltx_text" id="bib.bib45.2.2.1">.</span> (2019a)</span> <span class="ltx_bibblock"> Xin Si, Jia-Jing Chen, Yung-Ning Tu, Wei-Hsing Huang, Jing-Hong Wang, Yen-Cheng Chiu, Wei-Chen Wei, Ssu-Yen Wu, Xiaoyu Sun, Rui Liu, et al<span class="ltx_text" id="bib.bib45.3.1">.</span> 2019a. </span> <span class="ltx_bibblock">24.5 A twin-8T SRAM computation-in-memory macro for multiple-bit CNN-based machine learning. In <em class="ltx_emph ltx_font_italic" id="bib.bib45.4.1">2019 IEEE International Solid-State Circuits Conference-(ISSCC)</em>. IEEE, 396–398. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib46"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Si et al<span class="ltx_text" id="bib.bib46.2.2.1">.</span> (2019b)</span> <span class="ltx_bibblock"> Xin Si, Jia-Jing Chen, Yung-Ning Tu, Wei-Hsing Huang, Jing-Hong Wang, Yen-Cheng Chiu, Wei-Chen Wei, Ssu-Yen Wu, Xiaoyu Sun, Rui Liu, et al<span class="ltx_text" id="bib.bib46.3.1">.</span> 2019b. </span> <span class="ltx_bibblock">A twin-8T SRAM computation-in-memory unit-macro for multibit CNN-based AI edge processors. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib46.4.1">IEEE Journal of Solid-State Circuits</em> 55, 1 (2019), 189–202. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib47"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Sridharan et al<span class="ltx_text" id="bib.bib47.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Shrihari Sridharan, Jacob R Stevens, Kaushik Roy, and Anand Raghunathan. 2023. </span> <span class="ltx_bibblock">X-former: In-memory acceleration of transformers. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib47.3.1">IEEE Transactions on Very Large Scale Integration (VLSI) Systems</em> 31, 8 (2023), 1223–1233. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib48"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Su et al<span class="ltx_text" id="bib.bib48.2.2.1">.</span> (2017)</span> <span class="ltx_bibblock"> Fang Su, Wei-Hao Chen, Lixue Xia, Chieh-Pu Lo, Tianqi Tang, Zhibo Wang, Kuo-Hsiang Hsu, Ming Cheng, Jun-Yi Li, Yuan Xie, et al<span class="ltx_text" id="bib.bib48.3.1">.</span> 2017. </span> <span class="ltx_bibblock">A 462GOPs/J RRAM-based nonvolatile intelligent processor for energy harvesting IoE system featuring nonvolatile logics and processing-in-memory. In <em class="ltx_emph ltx_font_italic" id="bib.bib48.4.1">2017 Symposium on VLSI Technology</em>. IEEE, T260–T261. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib49"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Sun et al<span class="ltx_text" id="bib.bib49.2.2.1">.</span> (2018)</span> <span class="ltx_bibblock"> Baohua Sun, Daniel Liu, Leo Yu, Jay Li, Helen Liu, Wenhan Zhang, and Terry Torng. 2018. </span> <span class="ltx_bibblock">MRAM co-designed processing-in-memory CNN accelerator for mobile and IoT applications. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib49.3.1">arXiv preprint arXiv:1811.12179</em> (2018). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib50"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Touvron et al<span class="ltx_text" id="bib.bib50.2.2.1">.</span> (2021)</span> <span class="ltx_bibblock"> Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. </span> <span class="ltx_bibblock">Training data-efficient image transformers & distillation through attention. In <em class="ltx_emph ltx_font_italic" id="bib.bib50.3.1">International conference on machine learning</em>. PMLR, 10347–10357. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib51"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Vaswani (2017)</span> <span class="ltx_bibblock"> A Vaswani. 2017. </span> <span class="ltx_bibblock">Attention is all you need. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib51.1.1">Advances in Neural Information Processing Systems</em> (2017). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib52"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Verma et al<span class="ltx_text" id="bib.bib52.2.2.1">.</span> (2019)</span> <span class="ltx_bibblock"> Naveen Verma, Hongyang Jia, Hossein Valavi, Yinqi Tang, Murat Ozatay, Lung-Yen Chen, Bonan Zhang, and Peter Deaville. 2019. </span> <span class="ltx_bibblock">In-memory computing: Advances and prospects. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib52.3.1">IEEE Solid-State Circuits Magazine</em> 11, 3 (2019), 43–55. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib53"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wang et al<span class="ltx_text" id="bib.bib53.2.2.1">.</span> (2018)</span> <span class="ltx_bibblock"> Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. </span> <span class="ltx_bibblock">GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In <em class="ltx_emph ltx_font_italic" id="bib.bib53.3.1">Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</em>. 353–355. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib54"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Xiao et al<span class="ltx_text" id="bib.bib54.2.2.1">.</span> (2023)</span> <span class="ltx_bibblock"> Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. </span> <span class="ltx_bibblock">Smoothquant: Accurate and efficient post-training quantization for large language models. In <em class="ltx_emph ltx_font_italic" id="bib.bib54.3.1">International Conference on Machine Learning</em>. PMLR, 38087–38099. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib55"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Xue et al<span class="ltx_text" id="bib.bib55.2.2.1">.</span> (2019)</span> <span class="ltx_bibblock"> Cheng-Xin Xue, Wei-Hao Chen, Je-Syu Liu, Jia-Fang Li, Wei-Yu Lin, Wei-En Lin, Jing-Hong Wang, Wei-Chen Wei, Ting-Wei Chang, Tung-Cheng Chang, et al<span class="ltx_text" id="bib.bib55.3.1">.</span> 2019. </span> <span class="ltx_bibblock">24.1 A 1Mb multibit ReRAM computing-in-memory macro with 14.6 ns parallel MAC computing time for CNN based AI edge processors. In <em class="ltx_emph ltx_font_italic" id="bib.bib55.4.1">2019 IEEE International Solid-State Circuits Conference-(ISSCC)</em>. IEEE, 388–390. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib56"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yang et al<span class="ltx_text" id="bib.bib56.2.2.1">.</span> (2020)</span> <span class="ltx_bibblock"> Xiaoxuan Yang, Bonan Yan, Hai Li, and Yiran Chen. 2020. </span> <span class="ltx_bibblock">ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration. In <em class="ltx_emph ltx_font_italic" id="bib.bib56.3.1">Proceedings of the 39th International Conference on Computer-Aided Design</em>. 1–9. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib57"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yi et al<span class="ltx_text" id="bib.bib57.2.2.1">.</span> (2024)</span> <span class="ltx_bibblock"> Donghyeon Yi, Seoyoung Lee, Injun Choi, Gichan Yun, Edward Jongyoon Choi, Jonghee Park, Jonghoon Kwak, Sung-Joon Jang, Sohmyung Ha, Ik-Joon Chang, et al<span class="ltx_text" id="bib.bib57.3.1">.</span> 2024. </span> <span class="ltx_bibblock">Skew-CIM: Process-Variation-Resilient and Energy-Efficient Computation-in-Memory Design Technique With Skewed Weights. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib57.4.1">IEEE Transactions on Circuits and Systems I: Regular Papers</em> (2024). </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib58"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yin et al<span class="ltx_text" id="bib.bib58.2.2.1">.</span> (2020)</span> <span class="ltx_bibblock"> Shihui Yin, Zhewei Jiang, Jae-Sun Seo, and Mingoo Seok. 2020. </span> <span class="ltx_bibblock">XNOR-SRAM: In-memory computing SRAM macro for binary/ternary deep neural networks. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib58.3.1">IEEE Journal of Solid-State Circuits</em> 55, 6 (2020), 1733–1743. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib59"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhou et al<span class="ltx_text" id="bib.bib59.2.2.1">.</span> (2022)</span> <span class="ltx_bibblock"> Minxuan Zhou, Weihong Xu, Jaeyoung Kang, and Tajana Rosing. 2022. </span> <span class="ltx_bibblock">Transpim: A memory-based acceleration via software-hardware co-design for transformer. In <em class="ltx_emph ltx_font_italic" id="bib.bib59.3.1">2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)</em>. IEEE, 1071–1085. </span> <span class="ltx_bibblock"> </span> </li> <li class="ltx_bibitem" id="bib.bib60"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhou et al<span class="ltx_text" id="bib.bib60.2.2.1">.</span> (2016)</span> <span class="ltx_bibblock"> Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. </span> <span class="ltx_bibblock">Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib60.3.1">arXiv preprint arXiv:1606.06160</em> (2016). </span> <span class="ltx_bibblock"> </span> </li> </ul> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Fri Nov 22 05:49:48 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src=""/></a> </div></footer> </div> </body> </html>