CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models</title>  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content=" Large language models, speculative decoding, batch processing, learning-based algorithm " lang="en" name="keywords"/> <base href="/html/2503.15921v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S1" title="In Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">I </span><span class="ltx_text ltx_font_smallcaps">Introduction</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2" title="In Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">II </span><span class="ltx_text ltx_font_smallcaps">Background and Motivation</span></span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.SS1" title="In II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-A</span> </span><span class="ltx_text ltx_font_italic">Background</span></span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.SS1.SSS1" title="In II-A Background ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-A</span>1 </span>Autoregressive Decoding</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.SS1.SSS2" title="In II-A Background ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-A</span>2 </span>Speculative Decoding</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.SS2" title="In II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-B</span> </span><span class="ltx_text ltx_font_italic">Motivation</span></span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.SS2.SSS1" title="In II-B Motivation ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-B</span>1 </span>Limitation of Homogeneous SSMs</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.SS2.SSS2" title="In II-B Motivation ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-B</span>2 </span>Batching processing is not well supported by speculative decoding</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.SS2.SSS3" title="In II-B Motivation ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-B</span>3 </span>Speculation and verification need to be orchestrated</span></a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S3" title="In Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">III </span><span class="ltx_text ltx_font_smallcaps">System Overview</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S4" title="In Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">IV </span><span class="ltx_text ltx_font_smallcaps">Learning-based SSM Selection</span></span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S4.SS1" title="In IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">IV-A</span> </span><span class="ltx_text ltx_font_italic">Problem Statement</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S4.SS2" title="In IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">IV-B</span> </span><span class="ltx_text ltx_font_italic">Algorithm Design</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S4.SS3" title="In IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">IV-C</span> </span><span class="ltx_text ltx_font_italic">Fast SSM Switching</span></span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S5" title="In Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">V </span><span class="ltx_text ltx_font_smallcaps">Spin’s Runtime Engine</span></span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S5.SS1" title="In V Spin’s Runtime Engine ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">V-A</span> </span><span class="ltx_text ltx_font_italic">Fast Batch Verification</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S5.SS2" title="In V Spin’s Runtime Engine ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">V-B</span> </span><span class="ltx_text ltx_font_italic">Speculative Decoding Pipeline</span></span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S6" title="In Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">VI </span><span class="ltx_text ltx_font_smallcaps">Performance Evaluation</span></span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S6.SS1" title="In VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">VI-A</span> </span><span class="ltx_text ltx_font_italic">Experimental Settings</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S6.SS2" title="In VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">VI-B</span> </span><span class="ltx_text ltx_font_italic">Results</span></span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S6.SS2.SSS1" title="In VI-B Results ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">VI-B</span>1 </span>Overall Performance</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S6.SS2.SSS2" title="In VI-B Results ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">VI-B</span>2 </span>Impact of the SSM Selection Algorithm</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S6.SS2.SSS3" title="In VI-B Results ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">VI-B</span>3 </span>Impact of the Fast Batch Verification</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S6.SS2.SSS4" title="In VI-B Results ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">VI-B</span>4 </span>Impact of the Speculative Decoding Pipeline</span></a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S7" title="In Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">VII </span><span class="ltx_text ltx_font_smallcaps">Related Work</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S8" title="In Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">VIII </span><span class="ltx_text ltx_font_smallcaps">Conclusion</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S9" title="In Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">IX </span><span class="ltx_text ltx_font_smallcaps">Acknowledgments</span></span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line"> <h1 class="ltx_title ltx_title_document"> <span class="ltx_text ltx_font_smallcaps" id="id2.id1">Spin</span>: Accelerating Large Language Model Inference with Heterogeneous Speculative Models</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Fahao Chen1, Peng Li2, Tom H. Luan2, Zhou Su2, and Jing Deng3 </span><span class="ltx_author_notes"> <span class="ltx_contact ltx_role_affiliation">1School of Computer Science and Engineering, The University of Aizu, Japan </span> <span class="ltx_contact ltx_role_affiliation">2School of Cyber Science and Engineering, Xi’an Jiaotong University, China </span> <span class="ltx_contact ltx_role_affiliation">3Department of Computer Science, University of North Carolina at Greensboro, USA </span> <span class="ltx_contact ltx_role_affiliation">Emails: c24fahao@u-aizu.ac.jp, pengli@xjtu.edu.cn, tom.luan@xjtu.edu.cn, zhousu@ieee.org, jing.deng@uncg.edu </span></span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id1.1">Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently verified by the LLM in a verification phase. However, current state-of-the-art speculative decoding approaches have three key limitations: handling requests with varying difficulty using homogeneous SSMs, lack of robust support for batch processing, and insufficient holistic optimization for both speculation and verification phases. In this paper, we introduce <span class="ltx_text ltx_font_smallcaps" id="id1.1.1">Spin</span>, an efficient LLM inference serving system based on speculative decoding, designed to address these challenges through three main innovations. First, <span class="ltx_text ltx_font_smallcaps" id="id1.1.2">Spin</span> improves token speculation by using multiple heterogeneous SSMs, with a learning-based algorithm for SSM selection that operates without prior knowledge of request difficulty. Second, <span class="ltx_text ltx_font_smallcaps" id="id1.1.3">Spin</span> employs a request decomposition method to minimize batching overhead during LLM verification. Finally, <span class="ltx_text ltx_font_smallcaps" id="id1.1.4">Spin</span> orchestrates speculation and verification phases by pipelining their executions on GPUs to achieve further acceleration. Experimental results demonstrate that <span class="ltx_text ltx_font_smallcaps" id="id1.1.5">Spin</span> significantly outperforms state-of-the-art methods, achieving a performance increase of approximately 2.28<math alttext="\times" class="ltx_Math" display="inline" id="id1.1.m1.1"><semantics id="id1.1.m1.1a"><mo id="id1.1.m1.1.1" xref="id1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="id1.1.m1.1b"><times id="id1.1.m1.1.1.cmml" xref="id1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="id1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="id1.1.m1.1d">×</annotation></semantics></math>.</p> </div> <div class="ltx_keywords"> <h6 class="ltx_title ltx_title_keywords">Index Terms: </h6> Large language models, speculative decoding, batch processing, learning-based algorithm </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">I </span><span class="ltx_text ltx_font_smallcaps" id="S1.1.1">Introduction</span> </h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Large Language Models (LLMs), such as the GPT series <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib1" title="">1</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib2" title="">2</a>]</cite>, have demonstrated remarkable performance across a variety of generative tasks <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib3" title="">3</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib4" title="">4</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib5" title="">5</a>]</cite> and efficient LLM inference has obtain significant research attention <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib6" title="">6</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib7" title="">7</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib8" title="">8</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib9" title="">9</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib10" title="">10</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib11" title="">11</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib12" title="">12</a>]</cite>. Recently, speculative decoding <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib13" title="">13</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib14" title="">14</a>]</cite> has been proposed to accelerate LLM inference by using a small speculative model (SSM) in conjunction with the LLM, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S1.F1" title="Figure 1 ‣ I Introduction ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 1</span></a>(a). The speculative decoding process begins with the SSM generating candidate tokens in a <span class="ltx_text ltx_font_italic" id="S1.p1.1.1">speculation</span> phase. Subsequently, these candidate tokens are verified by the LLM in a <span class="ltx_text ltx_font_italic" id="S1.p1.1.2">verification</span> phase. The SSM operates with high speed due to its small size, and verification can be efficiently completed with a single forward pass of the LLM, allowing all speculative tokens to be verified in parallel.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">Existing research efforts on speculative decoding can be classified into two primary approaches. The first approach aims to improve the speculation phase by launching multiple homogeneous SSM instances, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S1.F1" title="Figure 1 ‣ I Introduction ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 1</span></a>(b), to generate more tokens with a higher probability to be accepted by the LLM. For instance, SpecInfer <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib15" title="">15</a>]</cite> and Medusa <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib16" title="">16</a>]</cite> use a tree-based speculation policy, generating multiple candidate tokens at the same position to increase the acceptance rate. EGALE <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib17" title="">17</a>]</cite> enhances the SSM with additional inputs, specifically the intermediate results of some LLM layers, to further boost the acceptance rate. The second approach focuses on optimizing the verification phase by reducing the inference cost of the LLM. For example, LayerSkip <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib18" title="">18</a>]</cite> and EESD <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib19" title="">19</a>]</cite> employ a portion of the LLM as the SSM, allowing intermediate results from the speculation phase to be reused during verification, thus reducing the overall inference cost of the LLM.</p> </div> <figure class="ltx_figure" id="S1.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="204" id="S1.F1.1.g1" src="x1.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>Illustration of different speculative decoding approaches.</figcaption> </figure> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">Despite the potential of speculative decoding, existing works do not fully exploit its capabilities for accelerating LLM inference, primarily due to three reasons. <span class="ltx_text ltx_framed ltx_framed_underline" id="S1.p3.1.1">First</span>, during the speculation phase, existing works <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib13" title="">13</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib14" title="">14</a>]</cite> typically employ homogeneous SSMs, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S1.F1" title="Figure 1 ‣ I Introduction ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 1</span></a>(b), which is insufficient for handling requests of varying difficulty. Our observations indicate that some requests, such as commonsense questions, are relatively “easy” and can be effectively handled by smaller SSMs with a high probability of generating acceptable tokens. Conversely, some “difficult” requests require larger SSMs to produce candidate tokens that the LLM is more likely to accept. <span class="ltx_text ltx_framed ltx_framed_underline" id="S1.p3.1.2">Second</span>, during the verification phase, existing works do not robustly support batch-based verification by the LLM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib13" title="">13</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib14" title="">14</a>]</cite>. It has been shown that the LLM severely under-utilizes GPU resources when processing a single request. Batching multiple requests for parallel processing is a common practice to increase GPU utilization and throughput. However, our experimental results (§<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.SS2" title="II-B Motivation ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag"><span class="ltx_text">II-B</span></span></a>) demonstrate that enabling batching requests during the verification phase degrades the speedup of speculative decoding by about 30%. This degradation is primarily due to the varying lengths of requests, which necessitate padding with zeros to align them for GPU processing, thereby wasting GPU memory and computational resources. <span class="ltx_text ltx_framed ltx_framed_underline" id="S1.p3.1.3">Third</span>, existing works lack a holistic optimization approach for both speculation and verification phases. They either focus on improving the acceptance rate during the speculation phase or on accelerating the verification inference. This lack of comprehensive optimization prevents existing methods from efficiently orchestrating speculation and verification executions within a serving system, thus limiting the overall inference throughput.</p> </div> <div class="ltx_para" id="S1.p4"> <p class="ltx_p" id="S1.p4.1"><span class="ltx_text ltx_font_bold" id="S1.p4.1.1">Goal:</span> In this paper, we present <span class="ltx_text ltx_font_smallcaps" id="S1.p4.1.2">Spin</span>, an efficient LLM inference serving system based on holistic-optimized speculative decoding by addressing the above limitations. <span class="ltx_text ltx_font_smallcaps" id="S1.p4.1.3">Spin</span> operates upon a GPU cluster and is capable of handling a variety of inference requests, such as multi-turn conversations and code generation. We propose to adopt multiple heterogeneous SSMs in the speculation phase, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S1.F1" title="Figure 1 ‣ I Introduction ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 1</span></a>(c), rather than relying on homogeneous SSMs, ensuring that each request is matched with the most appropriate SSM. Moreover, we propose a fast batch verification scheme for speculative decoding, which can significantly accelerate LLM verification while maintaining inference accuracy.</p> </div> <div class="ltx_para" id="S1.p5"> <p class="ltx_p" id="S1.p5.1"><span class="ltx_text ltx_font_bold" id="S1.p5.1.1">Technical challenges:</span> Harnessing the potential of the aforementioned designs involves overcoming three critical challenges. The first one is how to select an appropriate SSM for each request, whose difficulty is unknown. The accuracy of speculative tokens depends on the current input and SSM capability, which is hard to be qualified or quantified before real execution. Second, simply batching requests for LLM verification introduces excessive padding tokens, a problem also encountered in traditional transformer models for language processing <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib20" title="">20</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib21" title="">21</a>]</cite>. A common solution is to batch requests with similar lengths, however, which is less effective here because speculative decoding exacerbates the length discrepancies between requests, and their lengths change dynamically. Third, using heterogeneous SSMs may cause significant execution synchronization issues. In the default design, LLM verification cannot start until all candidate tokens are received from SSMs. Since these heterogeneous SSMs may work at different speeds, the slowest SSM can delay the entire system.</p> </div> <div class="ltx_para" id="S1.p6"> <p class="ltx_p" id="S1.p6.1"><span class="ltx_text ltx_font_bold" id="S1.p6.1.1">Solution:</span> <span class="ltx_text ltx_font_smallcaps" id="S1.p6.1.2">Spin</span> addresses the above technical challenges with three novel designs:</p> <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i1.p1"> <p class="ltx_p" id="S1.I1.i1.p1.1">We design a learning-based SSM selection algorithm, which enables <span class="ltx_text ltx_font_smallcaps" id="S1.I1.i1.p1.1.1">Spin</span> to choose suitable SSMs for inference requests to maximize speculative decoding performance (§<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S4" title="IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">IV</span></a>). We model the SSM selection problem as a multi-armed bandit and propose an efficient algorithm to explore each SSM’s performance, without any knowledge about request difficulty or model features.</p> </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i2.p1"> <p class="ltx_p" id="S1.I1.i2.p1.1"><span class="ltx_text ltx_font_smallcaps" id="S1.I1.i2.p1.1.1">Spin</span> reduces batching overhead by decomposing long requests into short ones, which could be well aligned with fewer padded tokens. In order to guarantee the inference correctness after such decomposition, we propose a new attention computation scheme.</p> </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i3.p1"> <p class="ltx_p" id="S1.I1.i3.p1.1">To orchestrate speculation and verification phases for further accleration, <span class="ltx_text ltx_font_smallcaps" id="S1.I1.i3.p1.1.1">Spin</span> divides the original batch into micro-batches and pipelines their executions on SSMs and the LLM. This design significantly reduces the idle time of LLM during synchronization of heterogeneous SSMs.</p> </div> </li> </ul> </div> <div class="ltx_para" id="S1.p7"> <p class="ltx_p" id="S1.p7.1">We evaluate <span class="ltx_text ltx_font_smallcaps" id="S1.p7.1.1">Spin</span> using five SSMs from the LLaMA series, ranging from 68M to 1.4B. The extensive experiments on various LLMs and different datasets shows that <span class="ltx_text ltx_font_smallcaps" id="S1.p7.1.2">Spin</span> achieves 2.28<math alttext="\times" class="ltx_Math" display="inline" id="S1.p7.1.m1.1"><semantics id="S1.p7.1.m1.1a"><mo id="S1.p7.1.m1.1.1" xref="S1.p7.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S1.p7.1.m1.1b"><times id="S1.p7.1.m1.1.1.cmml" xref="S1.p7.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S1.p7.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S1.p7.1.m1.1d">×</annotation></semantics></math> improvement on the inference throughput over existing baselines.</p> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">II </span><span class="ltx_text ltx_font_smallcaps" id="S2.1.1">Background and Motivation</span> </h2> <section class="ltx_subsection" id="S2.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS1.5.1.1">II-A</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS1.6.2">Background</span> </h3> <section class="ltx_subsubsection" id="S2.SS1.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S2.SS1.SSS1.5.1.1">II-A</span>1 </span>Autoregressive Decoding</h4> <div class="ltx_para" id="S2.SS1.SSS1.p1"> <p class="ltx_p" id="S2.SS1.SSS1.p1.3">The generative LLM inference typically uses an autoregressive decoding approach using the transformer architecture <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib22" title="">22</a>]</cite>: the model iteratively processes the input sequence along with all previously generated tokens to produce the next token until it generates an “end-of-sequence (EOS)” token. The transformer model consists of multiple blocks, where each block includes a self-attention layer and a feed-forward network (FFN) layer. For a request with <math alttext="n" class="ltx_Math" display="inline" id="S2.SS1.SSS1.p1.1.m1.1"><semantics id="S2.SS1.SSS1.p1.1.m1.1a"><mi id="S2.SS1.SSS1.p1.1.m1.1.1" xref="S2.SS1.SSS1.p1.1.m1.1.1.cmml">n</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.SSS1.p1.1.m1.1b"><ci id="S2.SS1.SSS1.p1.1.m1.1.1.cmml" xref="S2.SS1.SSS1.p1.1.m1.1.1">𝑛</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.SSS1.p1.1.m1.1c">n</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.SSS1.p1.1.m1.1d">italic_n</annotation></semantics></math> tokens, denoted by <math alttext="S=[s_{1},s_{2},...,s_{n}]" class="ltx_Math" display="inline" id="S2.SS1.SSS1.p1.2.m2.4"><semantics id="S2.SS1.SSS1.p1.2.m2.4a"><mrow id="S2.SS1.SSS1.p1.2.m2.4.4" xref="S2.SS1.SSS1.p1.2.m2.4.4.cmml"><mi id="S2.SS1.SSS1.p1.2.m2.4.4.5" xref="S2.SS1.SSS1.p1.2.m2.4.4.5.cmml">S</mi><mo id="S2.SS1.SSS1.p1.2.m2.4.4.4" xref="S2.SS1.SSS1.p1.2.m2.4.4.4.cmml">=</mo><mrow id="S2.SS1.SSS1.p1.2.m2.4.4.3.3" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.4.cmml"><mo id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.4" stretchy="false" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.4.cmml">[</mo><msub id="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1" xref="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.cmml"><mi id="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.2" xref="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.2.cmml">s</mi><mn id="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.3" xref="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.3.cmml">1</mn></msub><mo id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.5" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.4.cmml">,</mo><msub id="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2" xref="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.cmml"><mi id="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.2" xref="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.2.cmml">s</mi><mn id="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.3" xref="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.3.cmml">2</mn></msub><mo id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.6" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.4.cmml">,</mo><mi id="S2.SS1.SSS1.p1.2.m2.1.1" mathvariant="normal" xref="S2.SS1.SSS1.p1.2.m2.1.1.cmml">…</mi><mo id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.7" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.4.cmml">,</mo><msub id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.cmml"><mi id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.2" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.2.cmml">s</mi><mi id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.3" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.3.cmml">n</mi></msub><mo id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.8" stretchy="false" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.4.cmml">]</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.SSS1.p1.2.m2.4b"><apply id="S2.SS1.SSS1.p1.2.m2.4.4.cmml" xref="S2.SS1.SSS1.p1.2.m2.4.4"><eq id="S2.SS1.SSS1.p1.2.m2.4.4.4.cmml" xref="S2.SS1.SSS1.p1.2.m2.4.4.4"></eq><ci id="S2.SS1.SSS1.p1.2.m2.4.4.5.cmml" xref="S2.SS1.SSS1.p1.2.m2.4.4.5">𝑆</ci><list id="S2.SS1.SSS1.p1.2.m2.4.4.3.4.cmml" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.3"><apply id="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.cmml" xref="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.1.cmml" xref="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1">subscript</csymbol><ci id="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.2.cmml" xref="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.2">𝑠</ci><cn id="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.3.cmml" type="integer" xref="S2.SS1.SSS1.p1.2.m2.2.2.1.1.1.3">1</cn></apply><apply id="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.cmml" xref="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.1.cmml" xref="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2">subscript</csymbol><ci id="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.2.cmml" xref="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.2">𝑠</ci><cn id="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.3.cmml" type="integer" xref="S2.SS1.SSS1.p1.2.m2.3.3.2.2.2.3">2</cn></apply><ci id="S2.SS1.SSS1.p1.2.m2.1.1.cmml" xref="S2.SS1.SSS1.p1.2.m2.1.1">…</ci><apply id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.cmml" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.1.cmml" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3">subscript</csymbol><ci id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.2.cmml" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.2">𝑠</ci><ci id="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.3.cmml" xref="S2.SS1.SSS1.p1.2.m2.4.4.3.3.3.3">𝑛</ci></apply></list></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.SSS1.p1.2.m2.4c">S=[s_{1},s_{2},...,s_{n}]</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.SSS1.p1.2.m2.4d">italic_S = [ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ]</annotation></semantics></math>, each layer first applies projections on each token through three learnable linear matrices <math alttext="W^{Q},W^{K},W^{V}" class="ltx_Math" display="inline" id="S2.SS1.SSS1.p1.3.m3.3"><semantics id="S2.SS1.SSS1.p1.3.m3.3a"><mrow id="S2.SS1.SSS1.p1.3.m3.3.3.3" xref="S2.SS1.SSS1.p1.3.m3.3.3.4.cmml"><msup id="S2.SS1.SSS1.p1.3.m3.1.1.1.1" xref="S2.SS1.SSS1.p1.3.m3.1.1.1.1.cmml"><mi id="S2.SS1.SSS1.p1.3.m3.1.1.1.1.2" xref="S2.SS1.SSS1.p1.3.m3.1.1.1.1.2.cmml">W</mi><mi id="S2.SS1.SSS1.p1.3.m3.1.1.1.1.3" xref="S2.SS1.SSS1.p1.3.m3.1.1.1.1.3.cmml">Q</mi></msup><mo id="S2.SS1.SSS1.p1.3.m3.3.3.3.4" xref="S2.SS1.SSS1.p1.3.m3.3.3.4.cmml">,</mo><msup id="S2.SS1.SSS1.p1.3.m3.2.2.2.2" xref="S2.SS1.SSS1.p1.3.m3.2.2.2.2.cmml"><mi id="S2.SS1.SSS1.p1.3.m3.2.2.2.2.2" xref="S2.SS1.SSS1.p1.3.m3.2.2.2.2.2.cmml">W</mi><mi id="S2.SS1.SSS1.p1.3.m3.2.2.2.2.3" xref="S2.SS1.SSS1.p1.3.m3.2.2.2.2.3.cmml">K</mi></msup><mo id="S2.SS1.SSS1.p1.3.m3.3.3.3.5" xref="S2.SS1.SSS1.p1.3.m3.3.3.4.cmml">,</mo><msup id="S2.SS1.SSS1.p1.3.m3.3.3.3.3" xref="S2.SS1.SSS1.p1.3.m3.3.3.3.3.cmml"><mi id="S2.SS1.SSS1.p1.3.m3.3.3.3.3.2" xref="S2.SS1.SSS1.p1.3.m3.3.3.3.3.2.cmml">W</mi><mi id="S2.SS1.SSS1.p1.3.m3.3.3.3.3.3" xref="S2.SS1.SSS1.p1.3.m3.3.3.3.3.3.cmml">V</mi></msup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.SSS1.p1.3.m3.3b"><list id="S2.SS1.SSS1.p1.3.m3.3.3.4.cmml" xref="S2.SS1.SSS1.p1.3.m3.3.3.3"><apply id="S2.SS1.SSS1.p1.3.m3.1.1.1.1.cmml" xref="S2.SS1.SSS1.p1.3.m3.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.3.m3.1.1.1.1.1.cmml" xref="S2.SS1.SSS1.p1.3.m3.1.1.1.1">superscript</csymbol><ci id="S2.SS1.SSS1.p1.3.m3.1.1.1.1.2.cmml" xref="S2.SS1.SSS1.p1.3.m3.1.1.1.1.2">𝑊</ci><ci id="S2.SS1.SSS1.p1.3.m3.1.1.1.1.3.cmml" xref="S2.SS1.SSS1.p1.3.m3.1.1.1.1.3">𝑄</ci></apply><apply id="S2.SS1.SSS1.p1.3.m3.2.2.2.2.cmml" xref="S2.SS1.SSS1.p1.3.m3.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.3.m3.2.2.2.2.1.cmml" xref="S2.SS1.SSS1.p1.3.m3.2.2.2.2">superscript</csymbol><ci id="S2.SS1.SSS1.p1.3.m3.2.2.2.2.2.cmml" xref="S2.SS1.SSS1.p1.3.m3.2.2.2.2.2">𝑊</ci><ci id="S2.SS1.SSS1.p1.3.m3.2.2.2.2.3.cmml" xref="S2.SS1.SSS1.p1.3.m3.2.2.2.2.3">𝐾</ci></apply><apply id="S2.SS1.SSS1.p1.3.m3.3.3.3.3.cmml" xref="S2.SS1.SSS1.p1.3.m3.3.3.3.3"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.3.m3.3.3.3.3.1.cmml" xref="S2.SS1.SSS1.p1.3.m3.3.3.3.3">superscript</csymbol><ci id="S2.SS1.SSS1.p1.3.m3.3.3.3.3.2.cmml" xref="S2.SS1.SSS1.p1.3.m3.3.3.3.3.2">𝑊</ci><ci id="S2.SS1.SSS1.p1.3.m3.3.3.3.3.3.cmml" xref="S2.SS1.SSS1.p1.3.m3.3.3.3.3.3">𝑉</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.SSS1.p1.3.m3.3c">W^{Q},W^{K},W^{V}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.SSS1.p1.3.m3.3d">italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT</annotation></semantics></math>, to generate queries, keys, and values matrices:</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="S9.EGx1"> <tbody id="S2.E1"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle Q=W^{Q}S,\mbox{ }K=W^{K}S,\mbox{ }V=W^{V}S." class="ltx_Math" display="inline" id="S2.E1.m1.1"><semantics id="S2.E1.m1.1a"><mrow id="S2.E1.m1.1.1.1"><mrow id="S2.E1.m1.1.1.1.1.2" xref="S2.E1.m1.1.1.1.1.3.cmml"><mrow id="S2.E1.m1.1.1.1.1.1.1" xref="S2.E1.m1.1.1.1.1.1.1.cmml"><mi id="S2.E1.m1.1.1.1.1.1.1.2" xref="S2.E1.m1.1.1.1.1.1.1.2.cmml">Q</mi><mo id="S2.E1.m1.1.1.1.1.1.1.1" xref="S2.E1.m1.1.1.1.1.1.1.1.cmml">=</mo><mrow id="S2.E1.m1.1.1.1.1.1.1.3" xref="S2.E1.m1.1.1.1.1.1.1.3.cmml"><msup id="S2.E1.m1.1.1.1.1.1.1.3.2" xref="S2.E1.m1.1.1.1.1.1.1.3.2.cmml"><mi id="S2.E1.m1.1.1.1.1.1.1.3.2.2" xref="S2.E1.m1.1.1.1.1.1.1.3.2.2.cmml">W</mi><mi id="S2.E1.m1.1.1.1.1.1.1.3.2.3" xref="S2.E1.m1.1.1.1.1.1.1.3.2.3.cmml">Q</mi></msup><mo id="S2.E1.m1.1.1.1.1.1.1.3.1" xref="S2.E1.m1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="S2.E1.m1.1.1.1.1.1.1.3.3" xref="S2.E1.m1.1.1.1.1.1.1.3.3.cmml">S</mi></mrow></mrow><mo id="S2.E1.m1.1.1.1.1.2.3" xref="S2.E1.m1.1.1.1.1.3a.cmml">,</mo><mrow id="S2.E1.m1.1.1.1.1.2.2.2" xref="S2.E1.m1.1.1.1.1.2.2.3.cmml"><mrow id="S2.E1.m1.1.1.1.1.2.2.1.1" xref="S2.E1.m1.1.1.1.1.2.2.1.1.cmml"><mrow id="S2.E1.m1.1.1.1.1.2.2.1.1.2" xref="S2.E1.m1.1.1.1.1.2.2.1.1.2.cmml"><mtext id="S2.E1.m1.1.1.1.1.2.2.1.1.2.2" xref="S2.E1.m1.1.1.1.1.2.2.1.1.2.2a.cmml"> </mtext><mo id="S2.E1.m1.1.1.1.1.2.2.1.1.2.1" xref="S2.E1.m1.1.1.1.1.2.2.1.1.2.1.cmml">⁢</mo><mi id="S2.E1.m1.1.1.1.1.2.2.1.1.2.3" xref="S2.E1.m1.1.1.1.1.2.2.1.1.2.3.cmml">K</mi></mrow><mo id="S2.E1.m1.1.1.1.1.2.2.1.1.1" xref="S2.E1.m1.1.1.1.1.2.2.1.1.1.cmml">=</mo><mrow id="S2.E1.m1.1.1.1.1.2.2.1.1.3" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.cmml"><msup id="S2.E1.m1.1.1.1.1.2.2.1.1.3.2" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.cmml"><mi id="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.2" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.2.cmml">W</mi><mi id="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.3" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.3.cmml">K</mi></msup><mo id="S2.E1.m1.1.1.1.1.2.2.1.1.3.1" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.1.cmml">⁢</mo><mi id="S2.E1.m1.1.1.1.1.2.2.1.1.3.3" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.3.cmml">S</mi></mrow></mrow><mo id="S2.E1.m1.1.1.1.1.2.2.2.3" xref="S2.E1.m1.1.1.1.1.2.2.3a.cmml">,</mo><mrow id="S2.E1.m1.1.1.1.1.2.2.2.2" xref="S2.E1.m1.1.1.1.1.2.2.2.2.cmml"><mrow id="S2.E1.m1.1.1.1.1.2.2.2.2.2" xref="S2.E1.m1.1.1.1.1.2.2.2.2.2.cmml"><mtext id="S2.E1.m1.1.1.1.1.2.2.2.2.2.2" xref="S2.E1.m1.1.1.1.1.2.2.2.2.2.2a.cmml"> </mtext><mo id="S2.E1.m1.1.1.1.1.2.2.2.2.2.1" xref="S2.E1.m1.1.1.1.1.2.2.2.2.2.1.cmml">⁢</mo><mi id="S2.E1.m1.1.1.1.1.2.2.2.2.2.3" xref="S2.E1.m1.1.1.1.1.2.2.2.2.2.3.cmml">V</mi></mrow><mo id="S2.E1.m1.1.1.1.1.2.2.2.2.1" xref="S2.E1.m1.1.1.1.1.2.2.2.2.1.cmml">=</mo><mrow id="S2.E1.m1.1.1.1.1.2.2.2.2.3" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.cmml"><msup id="S2.E1.m1.1.1.1.1.2.2.2.2.3.2" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.cmml"><mi id="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.2" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.2.cmml">W</mi><mi id="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.3" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.3.cmml">V</mi></msup><mo id="S2.E1.m1.1.1.1.1.2.2.2.2.3.1" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.1.cmml">⁢</mo><mi id="S2.E1.m1.1.1.1.1.2.2.2.2.3.3" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.3.cmml">S</mi></mrow></mrow></mrow></mrow><mo id="S2.E1.m1.1.1.1.2" lspace="0em">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.E1.m1.1b"><apply id="S2.E1.m1.1.1.1.1.3.cmml" xref="S2.E1.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="S2.E1.m1.1.1.1.1.3a.cmml" xref="S2.E1.m1.1.1.1.1.2.3">formulae-sequence</csymbol><apply id="S2.E1.m1.1.1.1.1.1.1.cmml" xref="S2.E1.m1.1.1.1.1.1.1"><eq id="S2.E1.m1.1.1.1.1.1.1.1.cmml" xref="S2.E1.m1.1.1.1.1.1.1.1"></eq><ci id="S2.E1.m1.1.1.1.1.1.1.2.cmml" xref="S2.E1.m1.1.1.1.1.1.1.2">𝑄</ci><apply id="S2.E1.m1.1.1.1.1.1.1.3.cmml" xref="S2.E1.m1.1.1.1.1.1.1.3"><times id="S2.E1.m1.1.1.1.1.1.1.3.1.cmml" xref="S2.E1.m1.1.1.1.1.1.1.3.1"></times><apply id="S2.E1.m1.1.1.1.1.1.1.3.2.cmml" xref="S2.E1.m1.1.1.1.1.1.1.3.2"><csymbol cd="ambiguous" id="S2.E1.m1.1.1.1.1.1.1.3.2.1.cmml" xref="S2.E1.m1.1.1.1.1.1.1.3.2">superscript</csymbol><ci id="S2.E1.m1.1.1.1.1.1.1.3.2.2.cmml" xref="S2.E1.m1.1.1.1.1.1.1.3.2.2">𝑊</ci><ci id="S2.E1.m1.1.1.1.1.1.1.3.2.3.cmml" xref="S2.E1.m1.1.1.1.1.1.1.3.2.3">𝑄</ci></apply><ci id="S2.E1.m1.1.1.1.1.1.1.3.3.cmml" xref="S2.E1.m1.1.1.1.1.1.1.3.3">𝑆</ci></apply></apply><apply id="S2.E1.m1.1.1.1.1.2.2.3.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S2.E1.m1.1.1.1.1.2.2.3a.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.3">formulae-sequence</csymbol><apply id="S2.E1.m1.1.1.1.1.2.2.1.1.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1"><eq id="S2.E1.m1.1.1.1.1.2.2.1.1.1.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.1"></eq><apply id="S2.E1.m1.1.1.1.1.2.2.1.1.2.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.2"><times id="S2.E1.m1.1.1.1.1.2.2.1.1.2.1.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.2.1"></times><ci id="S2.E1.m1.1.1.1.1.2.2.1.1.2.2a.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.2.2"><mtext id="S2.E1.m1.1.1.1.1.2.2.1.1.2.2.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.2.2"> </mtext></ci><ci id="S2.E1.m1.1.1.1.1.2.2.1.1.2.3.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.2.3">𝐾</ci></apply><apply id="S2.E1.m1.1.1.1.1.2.2.1.1.3.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3"><times id="S2.E1.m1.1.1.1.1.2.2.1.1.3.1.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.1"></times><apply id="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.2"><csymbol cd="ambiguous" id="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.1.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.2">superscript</csymbol><ci id="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.2.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.2">𝑊</ci><ci id="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.3.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.2.3">𝐾</ci></apply><ci id="S2.E1.m1.1.1.1.1.2.2.1.1.3.3.cmml" xref="S2.E1.m1.1.1.1.1.2.2.1.1.3.3">𝑆</ci></apply></apply><apply id="S2.E1.m1.1.1.1.1.2.2.2.2.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2"><eq id="S2.E1.m1.1.1.1.1.2.2.2.2.1.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.1"></eq><apply id="S2.E1.m1.1.1.1.1.2.2.2.2.2.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.2"><times id="S2.E1.m1.1.1.1.1.2.2.2.2.2.1.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.2.1"></times><ci id="S2.E1.m1.1.1.1.1.2.2.2.2.2.2a.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.2.2"><mtext id="S2.E1.m1.1.1.1.1.2.2.2.2.2.2.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.2.2"> </mtext></ci><ci id="S2.E1.m1.1.1.1.1.2.2.2.2.2.3.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.2.3">𝑉</ci></apply><apply id="S2.E1.m1.1.1.1.1.2.2.2.2.3.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3"><times id="S2.E1.m1.1.1.1.1.2.2.2.2.3.1.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.1"></times><apply id="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.2"><csymbol cd="ambiguous" id="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.1.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.2">superscript</csymbol><ci id="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.2.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.2">𝑊</ci><ci id="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.3.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.2.3">𝑉</ci></apply><ci id="S2.E1.m1.1.1.1.1.2.2.2.2.3.3.cmml" xref="S2.E1.m1.1.1.1.1.2.2.2.2.3.3">𝑆</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E1.m1.1c">\displaystyle Q=W^{Q}S,\mbox{ }K=W^{K}S,\mbox{ }V=W^{V}S.</annotation><annotation encoding="application/x-llamapun" id="S2.E1.m1.1d">italic_Q = italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT italic_S , italic_K = italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_S , italic_V = italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_S .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS1.SSS1.p1.4">Then, each layers calculate the output <math alttext="O_{i}" class="ltx_Math" display="inline" id="S2.SS1.SSS1.p1.4.m1.1"><semantics id="S2.SS1.SSS1.p1.4.m1.1a"><msub id="S2.SS1.SSS1.p1.4.m1.1.1" xref="S2.SS1.SSS1.p1.4.m1.1.1.cmml"><mi id="S2.SS1.SSS1.p1.4.m1.1.1.2" xref="S2.SS1.SSS1.p1.4.m1.1.1.2.cmml">O</mi><mi id="S2.SS1.SSS1.p1.4.m1.1.1.3" xref="S2.SS1.SSS1.p1.4.m1.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.SSS1.p1.4.m1.1b"><apply id="S2.SS1.SSS1.p1.4.m1.1.1.cmml" xref="S2.SS1.SSS1.p1.4.m1.1.1"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.4.m1.1.1.1.cmml" xref="S2.SS1.SSS1.p1.4.m1.1.1">subscript</csymbol><ci id="S2.SS1.SSS1.p1.4.m1.1.1.2.cmml" xref="S2.SS1.SSS1.p1.4.m1.1.1.2">𝑂</ci><ci id="S2.SS1.SSS1.p1.4.m1.1.1.3.cmml" xref="S2.SS1.SSS1.p1.4.m1.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.SSS1.p1.4.m1.1c">O_{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.SSS1.p1.4.m1.1d">italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> of the self-attention layer for each token by:</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="S9.EGx2"> <tbody id="S2.E2"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle O_{i}=\sum_{j=1}^{n}a_{i,j}V_{j},\mbox{ }a_{i,j}=\frac{\mathcal{% F}(Q_{i},K_{j})}{\sum_{j=1}^{n}\mathcal{F}(Q_{i},K_{j})}," class="ltx_Math" display="inline" id="S2.E2.m1.9"><semantics id="S2.E2.m1.9a"><mrow id="S2.E2.m1.9.9.1"><mrow id="S2.E2.m1.9.9.1.1.2" xref="S2.E2.m1.9.9.1.1.3.cmml"><mrow id="S2.E2.m1.9.9.1.1.1.1" xref="S2.E2.m1.9.9.1.1.1.1.cmml"><msub id="S2.E2.m1.9.9.1.1.1.1.2" xref="S2.E2.m1.9.9.1.1.1.1.2.cmml"><mi id="S2.E2.m1.9.9.1.1.1.1.2.2" xref="S2.E2.m1.9.9.1.1.1.1.2.2.cmml">O</mi><mi id="S2.E2.m1.9.9.1.1.1.1.2.3" xref="S2.E2.m1.9.9.1.1.1.1.2.3.cmml">i</mi></msub><mo id="S2.E2.m1.9.9.1.1.1.1.1" xref="S2.E2.m1.9.9.1.1.1.1.1.cmml">=</mo><mrow id="S2.E2.m1.9.9.1.1.1.1.3" xref="S2.E2.m1.9.9.1.1.1.1.3.cmml"><mstyle displaystyle="true" id="S2.E2.m1.9.9.1.1.1.1.3.1" xref="S2.E2.m1.9.9.1.1.1.1.3.1.cmml"><munderover id="S2.E2.m1.9.9.1.1.1.1.3.1a" xref="S2.E2.m1.9.9.1.1.1.1.3.1.cmml"><mo id="S2.E2.m1.9.9.1.1.1.1.3.1.2.2" movablelimits="false" xref="S2.E2.m1.9.9.1.1.1.1.3.1.2.2.cmml">∑</mo><mrow id="S2.E2.m1.9.9.1.1.1.1.3.1.2.3" xref="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.cmml"><mi id="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.2" xref="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.2.cmml">j</mi><mo id="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.1" xref="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.1.cmml">=</mo><mn id="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.3" xref="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.3.cmml">1</mn></mrow><mi id="S2.E2.m1.9.9.1.1.1.1.3.1.3" xref="S2.E2.m1.9.9.1.1.1.1.3.1.3.cmml">n</mi></munderover></mstyle><mrow id="S2.E2.m1.9.9.1.1.1.1.3.2" xref="S2.E2.m1.9.9.1.1.1.1.3.2.cmml"><msub id="S2.E2.m1.9.9.1.1.1.1.3.2.2" xref="S2.E2.m1.9.9.1.1.1.1.3.2.2.cmml"><mi id="S2.E2.m1.9.9.1.1.1.1.3.2.2.2" xref="S2.E2.m1.9.9.1.1.1.1.3.2.2.2.cmml">a</mi><mrow id="S2.E2.m1.2.2.2.4" xref="S2.E2.m1.2.2.2.3.cmml"><mi id="S2.E2.m1.1.1.1.1" xref="S2.E2.m1.1.1.1.1.cmml">i</mi><mo id="S2.E2.m1.2.2.2.4.1" xref="S2.E2.m1.2.2.2.3.cmml">,</mo><mi id="S2.E2.m1.2.2.2.2" xref="S2.E2.m1.2.2.2.2.cmml">j</mi></mrow></msub><mo id="S2.E2.m1.9.9.1.1.1.1.3.2.1" xref="S2.E2.m1.9.9.1.1.1.1.3.2.1.cmml">⁢</mo><msub id="S2.E2.m1.9.9.1.1.1.1.3.2.3" xref="S2.E2.m1.9.9.1.1.1.1.3.2.3.cmml"><mi id="S2.E2.m1.9.9.1.1.1.1.3.2.3.2" xref="S2.E2.m1.9.9.1.1.1.1.3.2.3.2.cmml">V</mi><mi id="S2.E2.m1.9.9.1.1.1.1.3.2.3.3" xref="S2.E2.m1.9.9.1.1.1.1.3.2.3.3.cmml">j</mi></msub></mrow></mrow></mrow><mo id="S2.E2.m1.9.9.1.1.2.3" xref="S2.E2.m1.9.9.1.1.3a.cmml">,</mo><mrow id="S2.E2.m1.9.9.1.1.2.2" xref="S2.E2.m1.9.9.1.1.2.2.cmml"><mrow id="S2.E2.m1.9.9.1.1.2.2.2" xref="S2.E2.m1.9.9.1.1.2.2.2.cmml"><mtext id="S2.E2.m1.9.9.1.1.2.2.2.2" xref="S2.E2.m1.9.9.1.1.2.2.2.2a.cmml"> </mtext><mo id="S2.E2.m1.9.9.1.1.2.2.2.1" xref="S2.E2.m1.9.9.1.1.2.2.2.1.cmml">⁢</mo><msub id="S2.E2.m1.9.9.1.1.2.2.2.3" xref="S2.E2.m1.9.9.1.1.2.2.2.3.cmml"><mi id="S2.E2.m1.9.9.1.1.2.2.2.3.2" xref="S2.E2.m1.9.9.1.1.2.2.2.3.2.cmml">a</mi><mrow id="S2.E2.m1.4.4.2.4" xref="S2.E2.m1.4.4.2.3.cmml"><mi id="S2.E2.m1.3.3.1.1" xref="S2.E2.m1.3.3.1.1.cmml">i</mi><mo id="S2.E2.m1.4.4.2.4.1" xref="S2.E2.m1.4.4.2.3.cmml">,</mo><mi id="S2.E2.m1.4.4.2.2" xref="S2.E2.m1.4.4.2.2.cmml">j</mi></mrow></msub></mrow><mo id="S2.E2.m1.9.9.1.1.2.2.1" xref="S2.E2.m1.9.9.1.1.2.2.1.cmml">=</mo><mstyle displaystyle="true" id="S2.E2.m1.8.8" xref="S2.E2.m1.8.8.cmml"><mfrac id="S2.E2.m1.8.8a" xref="S2.E2.m1.8.8.cmml"><mrow id="S2.E2.m1.6.6.2" xref="S2.E2.m1.6.6.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E2.m1.6.6.2.4" xref="S2.E2.m1.6.6.2.4.cmml">ℱ</mi><mo id="S2.E2.m1.6.6.2.3" xref="S2.E2.m1.6.6.2.3.cmml">⁢</mo><mrow id="S2.E2.m1.6.6.2.2.2" xref="S2.E2.m1.6.6.2.2.3.cmml"><mo id="S2.E2.m1.6.6.2.2.2.3" stretchy="false" xref="S2.E2.m1.6.6.2.2.3.cmml">(</mo><msub id="S2.E2.m1.5.5.1.1.1.1" xref="S2.E2.m1.5.5.1.1.1.1.cmml"><mi id="S2.E2.m1.5.5.1.1.1.1.2" xref="S2.E2.m1.5.5.1.1.1.1.2.cmml">Q</mi><mi id="S2.E2.m1.5.5.1.1.1.1.3" xref="S2.E2.m1.5.5.1.1.1.1.3.cmml">i</mi></msub><mo id="S2.E2.m1.6.6.2.2.2.4" xref="S2.E2.m1.6.6.2.2.3.cmml">,</mo><msub id="S2.E2.m1.6.6.2.2.2.2" xref="S2.E2.m1.6.6.2.2.2.2.cmml"><mi id="S2.E2.m1.6.6.2.2.2.2.2" xref="S2.E2.m1.6.6.2.2.2.2.2.cmml">K</mi><mi id="S2.E2.m1.6.6.2.2.2.2.3" xref="S2.E2.m1.6.6.2.2.2.2.3.cmml">j</mi></msub><mo id="S2.E2.m1.6.6.2.2.2.5" stretchy="false" xref="S2.E2.m1.6.6.2.2.3.cmml">)</mo></mrow></mrow><mrow id="S2.E2.m1.8.8.4" xref="S2.E2.m1.8.8.4.cmml"><msubsup id="S2.E2.m1.8.8.4.3" xref="S2.E2.m1.8.8.4.3.cmml"><mo id="S2.E2.m1.8.8.4.3.2.2" xref="S2.E2.m1.8.8.4.3.2.2.cmml">∑</mo><mrow id="S2.E2.m1.8.8.4.3.2.3" xref="S2.E2.m1.8.8.4.3.2.3.cmml"><mi id="S2.E2.m1.8.8.4.3.2.3.2" xref="S2.E2.m1.8.8.4.3.2.3.2.cmml">j</mi><mo id="S2.E2.m1.8.8.4.3.2.3.1" xref="S2.E2.m1.8.8.4.3.2.3.1.cmml">=</mo><mn id="S2.E2.m1.8.8.4.3.2.3.3" xref="S2.E2.m1.8.8.4.3.2.3.3.cmml">1</mn></mrow><mi id="S2.E2.m1.8.8.4.3.3" xref="S2.E2.m1.8.8.4.3.3.cmml">n</mi></msubsup><mrow id="S2.E2.m1.8.8.4.2" xref="S2.E2.m1.8.8.4.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E2.m1.8.8.4.2.4" xref="S2.E2.m1.8.8.4.2.4.cmml">ℱ</mi><mo id="S2.E2.m1.8.8.4.2.3" xref="S2.E2.m1.8.8.4.2.3.cmml">⁢</mo><mrow id="S2.E2.m1.8.8.4.2.2.2" xref="S2.E2.m1.8.8.4.2.2.3.cmml"><mo id="S2.E2.m1.8.8.4.2.2.2.3" stretchy="false" xref="S2.E2.m1.8.8.4.2.2.3.cmml">(</mo><msub id="S2.E2.m1.7.7.3.1.1.1.1" xref="S2.E2.m1.7.7.3.1.1.1.1.cmml"><mi id="S2.E2.m1.7.7.3.1.1.1.1.2" xref="S2.E2.m1.7.7.3.1.1.1.1.2.cmml">Q</mi><mi id="S2.E2.m1.7.7.3.1.1.1.1.3" xref="S2.E2.m1.7.7.3.1.1.1.1.3.cmml">i</mi></msub><mo id="S2.E2.m1.8.8.4.2.2.2.4" xref="S2.E2.m1.8.8.4.2.2.3.cmml">,</mo><msub id="S2.E2.m1.8.8.4.2.2.2.2" xref="S2.E2.m1.8.8.4.2.2.2.2.cmml"><mi id="S2.E2.m1.8.8.4.2.2.2.2.2" xref="S2.E2.m1.8.8.4.2.2.2.2.2.cmml">K</mi><mi id="S2.E2.m1.8.8.4.2.2.2.2.3" xref="S2.E2.m1.8.8.4.2.2.2.2.3.cmml">j</mi></msub><mo id="S2.E2.m1.8.8.4.2.2.2.5" stretchy="false" xref="S2.E2.m1.8.8.4.2.2.3.cmml">)</mo></mrow></mrow></mrow></mfrac></mstyle></mrow></mrow><mo id="S2.E2.m1.9.9.1.2">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.E2.m1.9b"><apply id="S2.E2.m1.9.9.1.1.3.cmml" xref="S2.E2.m1.9.9.1.1.2"><csymbol cd="ambiguous" id="S2.E2.m1.9.9.1.1.3a.cmml" xref="S2.E2.m1.9.9.1.1.2.3">formulae-sequence</csymbol><apply id="S2.E2.m1.9.9.1.1.1.1.cmml" xref="S2.E2.m1.9.9.1.1.1.1"><eq id="S2.E2.m1.9.9.1.1.1.1.1.cmml" xref="S2.E2.m1.9.9.1.1.1.1.1"></eq><apply id="S2.E2.m1.9.9.1.1.1.1.2.cmml" xref="S2.E2.m1.9.9.1.1.1.1.2"><csymbol cd="ambiguous" id="S2.E2.m1.9.9.1.1.1.1.2.1.cmml" xref="S2.E2.m1.9.9.1.1.1.1.2">subscript</csymbol><ci id="S2.E2.m1.9.9.1.1.1.1.2.2.cmml" xref="S2.E2.m1.9.9.1.1.1.1.2.2">𝑂</ci><ci id="S2.E2.m1.9.9.1.1.1.1.2.3.cmml" xref="S2.E2.m1.9.9.1.1.1.1.2.3">𝑖</ci></apply><apply id="S2.E2.m1.9.9.1.1.1.1.3.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3"><apply id="S2.E2.m1.9.9.1.1.1.1.3.1.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.1"><csymbol cd="ambiguous" id="S2.E2.m1.9.9.1.1.1.1.3.1.1.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.1">superscript</csymbol><apply id="S2.E2.m1.9.9.1.1.1.1.3.1.2.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.1"><csymbol cd="ambiguous" id="S2.E2.m1.9.9.1.1.1.1.3.1.2.1.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.1">subscript</csymbol><sum id="S2.E2.m1.9.9.1.1.1.1.3.1.2.2.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.1.2.2"></sum><apply id="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.1.2.3"><eq id="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.1.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.1"></eq><ci id="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.2.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.2">𝑗</ci><cn id="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.3.cmml" type="integer" xref="S2.E2.m1.9.9.1.1.1.1.3.1.2.3.3">1</cn></apply></apply><ci id="S2.E2.m1.9.9.1.1.1.1.3.1.3.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.1.3">𝑛</ci></apply><apply id="S2.E2.m1.9.9.1.1.1.1.3.2.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.2"><times id="S2.E2.m1.9.9.1.1.1.1.3.2.1.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.2.1"></times><apply id="S2.E2.m1.9.9.1.1.1.1.3.2.2.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.2.2"><csymbol cd="ambiguous" id="S2.E2.m1.9.9.1.1.1.1.3.2.2.1.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.2.2">subscript</csymbol><ci id="S2.E2.m1.9.9.1.1.1.1.3.2.2.2.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.2.2.2">𝑎</ci><list id="S2.E2.m1.2.2.2.3.cmml" xref="S2.E2.m1.2.2.2.4"><ci id="S2.E2.m1.1.1.1.1.cmml" xref="S2.E2.m1.1.1.1.1">𝑖</ci><ci id="S2.E2.m1.2.2.2.2.cmml" xref="S2.E2.m1.2.2.2.2">𝑗</ci></list></apply><apply id="S2.E2.m1.9.9.1.1.1.1.3.2.3.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.2.3"><csymbol cd="ambiguous" id="S2.E2.m1.9.9.1.1.1.1.3.2.3.1.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.2.3">subscript</csymbol><ci id="S2.E2.m1.9.9.1.1.1.1.3.2.3.2.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.2.3.2">𝑉</ci><ci id="S2.E2.m1.9.9.1.1.1.1.3.2.3.3.cmml" xref="S2.E2.m1.9.9.1.1.1.1.3.2.3.3">𝑗</ci></apply></apply></apply></apply><apply id="S2.E2.m1.9.9.1.1.2.2.cmml" xref="S2.E2.m1.9.9.1.1.2.2"><eq id="S2.E2.m1.9.9.1.1.2.2.1.cmml" xref="S2.E2.m1.9.9.1.1.2.2.1"></eq><apply id="S2.E2.m1.9.9.1.1.2.2.2.cmml" xref="S2.E2.m1.9.9.1.1.2.2.2"><times id="S2.E2.m1.9.9.1.1.2.2.2.1.cmml" xref="S2.E2.m1.9.9.1.1.2.2.2.1"></times><ci id="S2.E2.m1.9.9.1.1.2.2.2.2a.cmml" xref="S2.E2.m1.9.9.1.1.2.2.2.2"><mtext id="S2.E2.m1.9.9.1.1.2.2.2.2.cmml" xref="S2.E2.m1.9.9.1.1.2.2.2.2"> </mtext></ci><apply id="S2.E2.m1.9.9.1.1.2.2.2.3.cmml" xref="S2.E2.m1.9.9.1.1.2.2.2.3"><csymbol cd="ambiguous" id="S2.E2.m1.9.9.1.1.2.2.2.3.1.cmml" xref="S2.E2.m1.9.9.1.1.2.2.2.3">subscript</csymbol><ci id="S2.E2.m1.9.9.1.1.2.2.2.3.2.cmml" xref="S2.E2.m1.9.9.1.1.2.2.2.3.2">𝑎</ci><list id="S2.E2.m1.4.4.2.3.cmml" xref="S2.E2.m1.4.4.2.4"><ci id="S2.E2.m1.3.3.1.1.cmml" xref="S2.E2.m1.3.3.1.1">𝑖</ci><ci id="S2.E2.m1.4.4.2.2.cmml" xref="S2.E2.m1.4.4.2.2">𝑗</ci></list></apply></apply><apply id="S2.E2.m1.8.8.cmml" xref="S2.E2.m1.8.8"><divide id="S2.E2.m1.8.8.5.cmml" xref="S2.E2.m1.8.8"></divide><apply id="S2.E2.m1.6.6.2.cmml" xref="S2.E2.m1.6.6.2"><times id="S2.E2.m1.6.6.2.3.cmml" xref="S2.E2.m1.6.6.2.3"></times><ci id="S2.E2.m1.6.6.2.4.cmml" xref="S2.E2.m1.6.6.2.4">ℱ</ci><interval closure="open" id="S2.E2.m1.6.6.2.2.3.cmml" xref="S2.E2.m1.6.6.2.2.2"><apply id="S2.E2.m1.5.5.1.1.1.1.cmml" xref="S2.E2.m1.5.5.1.1.1.1"><csymbol cd="ambiguous" id="S2.E2.m1.5.5.1.1.1.1.1.cmml" xref="S2.E2.m1.5.5.1.1.1.1">subscript</csymbol><ci id="S2.E2.m1.5.5.1.1.1.1.2.cmml" xref="S2.E2.m1.5.5.1.1.1.1.2">𝑄</ci><ci id="S2.E2.m1.5.5.1.1.1.1.3.cmml" xref="S2.E2.m1.5.5.1.1.1.1.3">𝑖</ci></apply><apply id="S2.E2.m1.6.6.2.2.2.2.cmml" xref="S2.E2.m1.6.6.2.2.2.2"><csymbol cd="ambiguous" id="S2.E2.m1.6.6.2.2.2.2.1.cmml" xref="S2.E2.m1.6.6.2.2.2.2">subscript</csymbol><ci id="S2.E2.m1.6.6.2.2.2.2.2.cmml" xref="S2.E2.m1.6.6.2.2.2.2.2">𝐾</ci><ci id="S2.E2.m1.6.6.2.2.2.2.3.cmml" xref="S2.E2.m1.6.6.2.2.2.2.3">𝑗</ci></apply></interval></apply><apply id="S2.E2.m1.8.8.4.cmml" xref="S2.E2.m1.8.8.4"><apply id="S2.E2.m1.8.8.4.3.cmml" xref="S2.E2.m1.8.8.4.3"><csymbol cd="ambiguous" id="S2.E2.m1.8.8.4.3.1.cmml" xref="S2.E2.m1.8.8.4.3">superscript</csymbol><apply id="S2.E2.m1.8.8.4.3.2.cmml" xref="S2.E2.m1.8.8.4.3"><csymbol cd="ambiguous" id="S2.E2.m1.8.8.4.3.2.1.cmml" xref="S2.E2.m1.8.8.4.3">subscript</csymbol><sum id="S2.E2.m1.8.8.4.3.2.2.cmml" xref="S2.E2.m1.8.8.4.3.2.2"></sum><apply id="S2.E2.m1.8.8.4.3.2.3.cmml" xref="S2.E2.m1.8.8.4.3.2.3"><eq id="S2.E2.m1.8.8.4.3.2.3.1.cmml" xref="S2.E2.m1.8.8.4.3.2.3.1"></eq><ci id="S2.E2.m1.8.8.4.3.2.3.2.cmml" xref="S2.E2.m1.8.8.4.3.2.3.2">𝑗</ci><cn id="S2.E2.m1.8.8.4.3.2.3.3.cmml" type="integer" xref="S2.E2.m1.8.8.4.3.2.3.3">1</cn></apply></apply><ci id="S2.E2.m1.8.8.4.3.3.cmml" xref="S2.E2.m1.8.8.4.3.3">𝑛</ci></apply><apply id="S2.E2.m1.8.8.4.2.cmml" xref="S2.E2.m1.8.8.4.2"><times id="S2.E2.m1.8.8.4.2.3.cmml" xref="S2.E2.m1.8.8.4.2.3"></times><ci id="S2.E2.m1.8.8.4.2.4.cmml" xref="S2.E2.m1.8.8.4.2.4">ℱ</ci><interval closure="open" id="S2.E2.m1.8.8.4.2.2.3.cmml" xref="S2.E2.m1.8.8.4.2.2.2"><apply id="S2.E2.m1.7.7.3.1.1.1.1.cmml" xref="S2.E2.m1.7.7.3.1.1.1.1"><csymbol cd="ambiguous" id="S2.E2.m1.7.7.3.1.1.1.1.1.cmml" xref="S2.E2.m1.7.7.3.1.1.1.1">subscript</csymbol><ci id="S2.E2.m1.7.7.3.1.1.1.1.2.cmml" xref="S2.E2.m1.7.7.3.1.1.1.1.2">𝑄</ci><ci id="S2.E2.m1.7.7.3.1.1.1.1.3.cmml" xref="S2.E2.m1.7.7.3.1.1.1.1.3">𝑖</ci></apply><apply id="S2.E2.m1.8.8.4.2.2.2.2.cmml" xref="S2.E2.m1.8.8.4.2.2.2.2"><csymbol cd="ambiguous" id="S2.E2.m1.8.8.4.2.2.2.2.1.cmml" xref="S2.E2.m1.8.8.4.2.2.2.2">subscript</csymbol><ci id="S2.E2.m1.8.8.4.2.2.2.2.2.cmml" xref="S2.E2.m1.8.8.4.2.2.2.2.2">𝐾</ci><ci id="S2.E2.m1.8.8.4.2.2.2.2.3.cmml" xref="S2.E2.m1.8.8.4.2.2.2.2.3">𝑗</ci></apply></interval></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E2.m1.9c">\displaystyle O_{i}=\sum_{j=1}^{n}a_{i,j}V_{j},\mbox{ }a_{i,j}=\frac{\mathcal{% F}(Q_{i},K_{j})}{\sum_{j=1}^{n}\mathcal{F}(Q_{i},K_{j})},</annotation><annotation encoding="application/x-llamapun" id="S2.E2.m1.9d">italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG caligraphic_F ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_F ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS1.SSS1.p1.8">where <math alttext="a_{i,j}" class="ltx_Math" display="inline" id="S2.SS1.SSS1.p1.5.m1.2"><semantics id="S2.SS1.SSS1.p1.5.m1.2a"><msub id="S2.SS1.SSS1.p1.5.m1.2.3" xref="S2.SS1.SSS1.p1.5.m1.2.3.cmml"><mi id="S2.SS1.SSS1.p1.5.m1.2.3.2" xref="S2.SS1.SSS1.p1.5.m1.2.3.2.cmml">a</mi><mrow id="S2.SS1.SSS1.p1.5.m1.2.2.2.4" xref="S2.SS1.SSS1.p1.5.m1.2.2.2.3.cmml"><mi id="S2.SS1.SSS1.p1.5.m1.1.1.1.1" xref="S2.SS1.SSS1.p1.5.m1.1.1.1.1.cmml">i</mi><mo id="S2.SS1.SSS1.p1.5.m1.2.2.2.4.1" xref="S2.SS1.SSS1.p1.5.m1.2.2.2.3.cmml">,</mo><mi id="S2.SS1.SSS1.p1.5.m1.2.2.2.2" xref="S2.SS1.SSS1.p1.5.m1.2.2.2.2.cmml">j</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.SSS1.p1.5.m1.2b"><apply id="S2.SS1.SSS1.p1.5.m1.2.3.cmml" xref="S2.SS1.SSS1.p1.5.m1.2.3"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.5.m1.2.3.1.cmml" xref="S2.SS1.SSS1.p1.5.m1.2.3">subscript</csymbol><ci id="S2.SS1.SSS1.p1.5.m1.2.3.2.cmml" xref="S2.SS1.SSS1.p1.5.m1.2.3.2">𝑎</ci><list id="S2.SS1.SSS1.p1.5.m1.2.2.2.3.cmml" xref="S2.SS1.SSS1.p1.5.m1.2.2.2.4"><ci id="S2.SS1.SSS1.p1.5.m1.1.1.1.1.cmml" xref="S2.SS1.SSS1.p1.5.m1.1.1.1.1">𝑖</ci><ci id="S2.SS1.SSS1.p1.5.m1.2.2.2.2.cmml" xref="S2.SS1.SSS1.p1.5.m1.2.2.2.2">𝑗</ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.SSS1.p1.5.m1.2c">a_{i,j}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.SSS1.p1.5.m1.2d">italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT</annotation></semantics></math> is the attention score between token <math alttext="s_{i}" class="ltx_Math" display="inline" id="S2.SS1.SSS1.p1.6.m2.1"><semantics id="S2.SS1.SSS1.p1.6.m2.1a"><msub id="S2.SS1.SSS1.p1.6.m2.1.1" xref="S2.SS1.SSS1.p1.6.m2.1.1.cmml"><mi id="S2.SS1.SSS1.p1.6.m2.1.1.2" xref="S2.SS1.SSS1.p1.6.m2.1.1.2.cmml">s</mi><mi id="S2.SS1.SSS1.p1.6.m2.1.1.3" xref="S2.SS1.SSS1.p1.6.m2.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.SSS1.p1.6.m2.1b"><apply id="S2.SS1.SSS1.p1.6.m2.1.1.cmml" xref="S2.SS1.SSS1.p1.6.m2.1.1"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.6.m2.1.1.1.cmml" xref="S2.SS1.SSS1.p1.6.m2.1.1">subscript</csymbol><ci id="S2.SS1.SSS1.p1.6.m2.1.1.2.cmml" xref="S2.SS1.SSS1.p1.6.m2.1.1.2">𝑠</ci><ci id="S2.SS1.SSS1.p1.6.m2.1.1.3.cmml" xref="S2.SS1.SSS1.p1.6.m2.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.SSS1.p1.6.m2.1c">s_{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.SSS1.p1.6.m2.1d">italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="s_{j}" class="ltx_Math" display="inline" id="S2.SS1.SSS1.p1.7.m3.1"><semantics id="S2.SS1.SSS1.p1.7.m3.1a"><msub id="S2.SS1.SSS1.p1.7.m3.1.1" xref="S2.SS1.SSS1.p1.7.m3.1.1.cmml"><mi id="S2.SS1.SSS1.p1.7.m3.1.1.2" xref="S2.SS1.SSS1.p1.7.m3.1.1.2.cmml">s</mi><mi id="S2.SS1.SSS1.p1.7.m3.1.1.3" xref="S2.SS1.SSS1.p1.7.m3.1.1.3.cmml">j</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.SSS1.p1.7.m3.1b"><apply id="S2.SS1.SSS1.p1.7.m3.1.1.cmml" xref="S2.SS1.SSS1.p1.7.m3.1.1"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.7.m3.1.1.1.cmml" xref="S2.SS1.SSS1.p1.7.m3.1.1">subscript</csymbol><ci id="S2.SS1.SSS1.p1.7.m3.1.1.2.cmml" xref="S2.SS1.SSS1.p1.7.m3.1.1.2">𝑠</ci><ci id="S2.SS1.SSS1.p1.7.m3.1.1.3.cmml" xref="S2.SS1.SSS1.p1.7.m3.1.1.3">𝑗</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.SSS1.p1.7.m3.1c">s_{j}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.SSS1.p1.7.m3.1d">italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT</annotation></semantics></math>. Typically, we conduct dot-product attention with Softmax normalization by setting <math alttext="\mathcal{F}(Q_{i},K_{j})=\exp(Q_{i}K_{j}^{T})" class="ltx_Math" display="inline" id="S2.SS1.SSS1.p1.8.m4.4"><semantics id="S2.SS1.SSS1.p1.8.m4.4a"><mrow id="S2.SS1.SSS1.p1.8.m4.4.4" xref="S2.SS1.SSS1.p1.8.m4.4.4.cmml"><mrow id="S2.SS1.SSS1.p1.8.m4.3.3.2" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.SS1.SSS1.p1.8.m4.3.3.2.4" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.4.cmml">ℱ</mi><mo id="S2.SS1.SSS1.p1.8.m4.3.3.2.3" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.3.cmml">⁢</mo><mrow id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.3.cmml"><mo id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.3" stretchy="false" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.3.cmml">(</mo><msub id="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1" xref="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.cmml"><mi id="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.2" xref="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.2.cmml">Q</mi><mi id="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.3" xref="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.3.cmml">i</mi></msub><mo id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.4" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.3.cmml">,</mo><msub id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.cmml"><mi id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.2" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.2.cmml">K</mi><mi id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.3" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.3.cmml">j</mi></msub><mo id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.5" stretchy="false" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.3.cmml">)</mo></mrow></mrow><mo id="S2.SS1.SSS1.p1.8.m4.4.4.4" xref="S2.SS1.SSS1.p1.8.m4.4.4.4.cmml">=</mo><mrow id="S2.SS1.SSS1.p1.8.m4.4.4.3.1" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.2.cmml"><mi id="S2.SS1.SSS1.p1.8.m4.1.1" xref="S2.SS1.SSS1.p1.8.m4.1.1.cmml">exp</mi><mo id="S2.SS1.SSS1.p1.8.m4.4.4.3.1a" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.2.cmml">⁡</mo><mrow id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.2.cmml"><mo id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.2" stretchy="false" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.2.cmml">(</mo><mrow id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.cmml"><msub id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.cmml"><mi id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.2" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.2.cmml">Q</mi><mi id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.3" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.3.cmml">i</mi></msub><mo id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.1" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.1.cmml">⁢</mo><msubsup id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.cmml"><mi id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.2.2" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.2.2.cmml">K</mi><mi id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.2.3" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.2.3.cmml">j</mi><mi id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.3" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.3.cmml">T</mi></msubsup></mrow><mo id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.3" stretchy="false" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.2.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.SSS1.p1.8.m4.4b"><apply id="S2.SS1.SSS1.p1.8.m4.4.4.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4"><eq id="S2.SS1.SSS1.p1.8.m4.4.4.4.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.4"></eq><apply id="S2.SS1.SSS1.p1.8.m4.3.3.2.cmml" xref="S2.SS1.SSS1.p1.8.m4.3.3.2"><times id="S2.SS1.SSS1.p1.8.m4.3.3.2.3.cmml" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.3"></times><ci id="S2.SS1.SSS1.p1.8.m4.3.3.2.4.cmml" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.4">ℱ</ci><interval closure="open" id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.3.cmml" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2"><apply id="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.cmml" xref="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.1.cmml" xref="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1">subscript</csymbol><ci id="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.2.cmml" xref="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.2">𝑄</ci><ci id="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.3.cmml" xref="S2.SS1.SSS1.p1.8.m4.2.2.1.1.1.1.3">𝑖</ci></apply><apply id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.cmml" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.1.cmml" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2">subscript</csymbol><ci id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.2.cmml" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.2">𝐾</ci><ci id="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.3.cmml" xref="S2.SS1.SSS1.p1.8.m4.3.3.2.2.2.2.3">𝑗</ci></apply></interval></apply><apply id="S2.SS1.SSS1.p1.8.m4.4.4.3.2.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1"><exp id="S2.SS1.SSS1.p1.8.m4.1.1.cmml" xref="S2.SS1.SSS1.p1.8.m4.1.1"></exp><apply id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1"><times id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.1.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.1"></times><apply id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.1.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2">subscript</csymbol><ci id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.2.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.2">𝑄</ci><ci id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.3.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.2.3">𝑖</ci></apply><apply id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.1.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3">superscript</csymbol><apply id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.2.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3"><csymbol cd="ambiguous" id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.2.1.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3">subscript</csymbol><ci id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.2.2.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.2.2">𝐾</ci><ci id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.2.3.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.2.3">𝑗</ci></apply><ci id="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.3.cmml" xref="S2.SS1.SSS1.p1.8.m4.4.4.3.1.1.1.3.3">𝑇</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.SSS1.p1.8.m4.4c">\mathcal{F}(Q_{i},K_{j})=\exp(Q_{i}K_{j}^{T})</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.SSS1.p1.8.m4.4d">caligraphic_F ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_exp ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )</annotation></semantics></math>. The self-attention outputs are then projected by the FFN layer and forwarded to the next transformer block as inputs. The outputs of the final transformer block are a probability vector to mark out the most probable output tokens.</p> </div> <div class="ltx_para" id="S2.SS1.SSS1.p2"> <p class="ltx_p" id="S2.SS1.SSS1.p2.1">During the generation, the intermediate results (key and value) of generated tokens are used in the self-attention layer for generating the subsequent tokens. To avoid the redundant computation, existing inference engines <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib23" title="">23</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib24" title="">24</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib25" title="">25</a>]</cite> store these intermediate results in the GPU memory and reuse them in the following computation, which is referred to as KV cache.</p> </div> </section> <section class="ltx_subsubsection" id="S2.SS1.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S2.SS1.SSS2.5.1.1">II-A</span>2 </span>Speculative Decoding</h4> <div class="ltx_para" id="S2.SS1.SSS2.p1"> <p class="ltx_p" id="S2.SS1.SSS2.p1.1">The entire speculative decoding process includes multiple iterations, where each iteration involves a <span class="ltx_text ltx_font_italic" id="S2.SS1.SSS2.p1.1.1">speculation phase</span> and a <span class="ltx_text ltx_font_italic" id="S2.SS1.SSS2.p1.1.2">verification phase</span>. The speculation phase is performed by the SSM to generate multiple candidate tokens. After that, the LLM verifies these candidate tokens in the verification phase. If these tokens are consistent with LLM’s output, they are accepted to be a part of the final output. The speculation and verification phases will repeat until an EOS token is accepted by the LLM. Since the speculation is performed by the SSM, which operates significantly faster than the LLM, the more tokens accepted by the LLM from the SSM, the greater the speedup for the LLM’s inference process.</p> </div> <figure class="ltx_figure" id="S2.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="267" id="S2.F2.1.g1" src="x2.png" width="829"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>The ratio of different SSMs selected as the best model across requests in three datasets.</figcaption> </figure> <figure class="ltx_figure" id="S2.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="237" id="S2.F3.1.g1" src="x3.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>Results of three random requests using different SSMs.</figcaption> </figure> </section> </section> <section class="ltx_subsection" id="S2.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS2.5.1.1">II-B</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS2.6.2">Motivation</span> </h3> <section class="ltx_subsubsection" id="S2.SS2.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S2.SS2.SSS1.5.1.1">II-B</span>1 </span>Limitation of Homogeneous SSMs</h4> <div class="ltx_para" id="S2.SS2.SSS1.p1"> <p class="ltx_p" id="S2.SS2.SSS1.p1.1">Existing works mainly perform the speculative decoding using homogeneous SSMs, which fails to achieve the optimal benefits for all inference requests. We conduct experiments to demonstrate this issue. Specifically, we use LLaMA-7B as the LLM and select five SSMs from the LLaMA series, whose parameter scales range from 65 million to 1.4 billion. We evaluate the performance of speculative decoding across three datasets: Alpaca <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib26" title="">26</a>]</cite>, ChatGPT Prompts (CP) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib27" title="">27</a>]</cite>, and Chatbot Instruction Prompts (CIP) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib28" title="">28</a>]</cite>. Inference latency is used as the performance metric. <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S2.F2" title="Figure 2 ‣ II-A2 Speculative Decoding ‣ II-A Background ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 2</span></a> reports the percentage of inference requests that achieve the best performance on each SSM. The results indicate that these SSMs play differently when handling these requests. For example, only 3% requests of the Alpaca dataset can achieve the best performance using the LLaMA-265M, but this model is preferred by 40% requests of the CP dataset.</p> </div> <div class="ltx_para" id="S2.SS2.SSS1.p2"> <p class="ltx_p" id="S2.SS2.SSS1.p2.1">To have deeper study about such differences, we randomly select 3 requests from the Alpaca dataset and show their speculation speed, acceptance rate, and inference latency under different SSMs in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S2.F3" title="Figure 3 ‣ II-A2 Speculative Decoding ‣ II-A Background ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 3</span></a>. It can be seen that smaller SSM, e.g., LLaMA-68M, generates tokens much faster than larger ones, e.g., LLaMA-1.4B. Multiplying these with the acceptance rate, which is highly correlated to token prediction accuracy, the optimum SSM varies among these three requests.</p> </div> <div class="ltx_para" id="S2.SS2.SSS1.p3"> <p class="ltx_p" id="S2.SS2.SSS1.p3.1">The above results suggest that heterogeneous SSMs have great potential to accelerate speculative decoding. However, it is not easy to choose the best SSM for each request, because it is hard to accurately estimate token speculation capability of these SSMs. <span class="ltx_text ltx_font_smallcaps" id="S2.SS2.SSS1.p3.1.1">Spin</span> addresses this challenge by proposing a learning-based algorithm that can perceive the request difficulty and choose appropriate SSMs accordingly. It does not rely on prior knowledge of SSM performance on requests.</p> </div> <figure class="ltx_figure" id="S2.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="304" id="S2.F4.1.g1" src="x4.png" width="831"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4: </span>The benefits of speculative decoding with different batch sizes.</figcaption> </figure> <figure class="ltx_figure" id="S2.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="136" id="S2.F5.1.g1" src="x5.png" width="747"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 5: </span>Illustration of padding tokens in attention computation.</figcaption> </figure> </section> <section class="ltx_subsubsection" id="S2.SS2.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S2.SS2.SSS2.5.1.1">II-B</span>2 </span>Batching processing is not well supported by speculative decoding</h4> <div class="ltx_para" id="S2.SS2.SSS2.p1"> <p class="ltx_p" id="S2.SS2.SSS2.p1.1">It has been shown that batching multiple requests can significantly improve GPU utilization and LLM inference throughput <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib24" title="">24</a>]</cite>. However, when applying batching for speculative decoding, we find that it comes with increasing overhead as the batch size grows, which may overwhelm its benefits. As shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S2.F4" title="Figure 4 ‣ II-B1 Limitation of Homogeneous SSMs ‣ II-B Motivation ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 4</span></a>, we measure the inference throughput, in terms of number of tokens output by LLM per second, under different batch sizes. Without batching, i.e., only 1 request is considered, speculative decoding can achieve 1.61<math alttext="\times" class="ltx_Math" display="inline" id="S2.SS2.SSS2.p1.1.m1.1"><semantics id="S2.SS2.SSS2.p1.1.m1.1a"><mo id="S2.SS2.SSS2.p1.1.m1.1.1" xref="S2.SS2.SSS2.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS2.p1.1.m1.1b"><times id="S2.SS2.SSS2.p1.1.m1.1.1.cmml" xref="S2.SS2.SSS2.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS2.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS2.p1.1.m1.1d">×</annotation></semantics></math> acceleration. However, such an acceleration decreases as we enlarge the batch size. Especially, when we set the batch size to 32, an out-of-memory error happens for speculative decoding while traditional inference can still work with growing throughput.</p> </div> <div class="ltx_para" id="S2.SS2.SSS2.p2"> <p class="ltx_p" id="S2.SS2.SSS2.p2.1">The reason of this inefficiency is that requests have different lengths, but CUDA kernels are designed to handle tensors with regular shapes. In order to align requests, additional padding tokens must be inserted to ensure tensors to be regular-shaped in the attention computation of the LLM forward pass. An example of padding tokens is shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S2.F5" title="Figure 5 ‣ II-B1 Limitation of Homogeneous SSMs ‣ II-B Motivation ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 5</span></a>. These padding tokens contribute nothing to the final output, but waste GPU memory and computation resources.</p> </div> <div class="ltx_para" id="S2.SS2.SSS2.p3"> <p class="ltx_p" id="S2.SS2.SSS2.p3.1">Similar overhead of padding tokens also exists in traditional LLM inference without speculative decoding, as claimed by <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib20" title="">20</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib21" title="">21</a>]</cite>. However, speculative decoding exacerbates length differences between batched requests since their token acceptance rates at the LLM may differ greatly, and thus more padding tokens are needed, leading to increased overhead. Moreover, although requests are with different length in traditional LLM inference, their length gaps are fixed once they are batched. In contrast, request lengths dynamically change in speculative decoding and thus that solutions proposed by <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib20" title="">20</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib21" title="">21</a>]</cite> can hardly work here.</p> </div> <figure class="ltx_figure" id="S2.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="379" id="S2.F6.1.g1" src="x6.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 6: </span>The illustration of orchestration of speculation and verification.</figcaption> </figure> </section> <section class="ltx_subsubsection" id="S2.SS2.SSS3"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S2.SS2.SSS3.5.1.1">II-B</span>3 </span>Speculation and verification need to be orchestrated</h4> <div class="ltx_para" id="S2.SS2.SSS3.p1"> <p class="ltx_p" id="S2.SS2.SSS3.p1.1">Token speculations of SSMs and token verification by the LLM are sequence-dependent, which could make GPUs idle and thus constrain the acceleration of speculative decoding, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S2.F6" title="Figure 6 ‣ II-B2 Batching processing is not well supported by speculative decoding ‣ II-B Motivation ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 6</span></a>(a). Especially, heterogeneous SSMs have different running speeds and synchronizing their output would seriously postpone token verification for fast SSMs. SpecPIM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib29" title="">29</a>]</cite> has proposed to let some speculative tokens skip the verification and directly use them in the next iteration. However, these tokens could be incorrect and waste the computing resources. In this paper, we propose to decompose the original batch into multiple micro-batches and pipeline their execution on SSMs and LLM. As shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S2.F6" title="Figure 6 ‣ II-B2 Batching processing is not well supported by speculative decoding ‣ II-B Motivation ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 6</span></a>(b), we can significantly reduce the GPU idle time and guarantee the correctness of final output. However, determining how to effectively split the batch on each SSM into micro-batches is challenging, and we design an efficient heuristic to address this issue.</p> </div> </section> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">III </span><span class="ltx_text ltx_font_smallcaps" id="S3.1.1">System Overview</span> </h2> <figure class="ltx_figure" id="S3.F7"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="435" id="S3.F7.1.g1" src="x7.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 7: </span>System Overview of <span class="ltx_text ltx_font_smallcaps" id="S3.F7.3.1">Spin</span>.</figcaption> </figure> <div class="ltx_para" id="S3.p1"> <p class="ltx_p" id="S3.p1.1"><span class="ltx_text ltx_font_smallcaps" id="S3.p1.1.1">Spin</span> is an LLM inference serving system based on speculative decoding <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib13" title="">13</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib14" title="">14</a>]</cite> and batch processing <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib24" title="">24</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib23" title="">23</a>]</cite>. The key design principle of <span class="ltx_text ltx_font_smallcaps" id="S3.p1.1.2">Spin</span> is to exploit heterogeneous SSMs to generate better speculative tokens for various requests. An overview of <span class="ltx_text ltx_font_smallcaps" id="S3.p1.1.3">Spin</span> is shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S3.F7" title="Figure 7 ‣ III System Overview ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 7</span></a>, highlighting its two major modules. The first one is the <span class="ltx_text ltx_font_smallcaps" id="S3.p1.1.4">Spin</span>’s SSM selector (§<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S4" title="IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">IV</span></a>), which selects an appropriate SSM for each request using a learning-based selection algorithm that operates without prior knowledge of request difficulty. The second module is the runtime engine (§<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S5" title="V Spin’s Runtime Engine ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">V</span></a>), responsible for managing the execution of heterogeneous SSMs and the LLM. Requests are first assigned to SSMs, according to the decisions made by the selector. The speculative tokens generated by SSMs are sent to the LLM for verification. The request batching is enabled on both SSMs and LLM. To improve batching performance, <span class="ltx_text ltx_font_smallcaps" id="S3.p1.1.5">Spin</span> adopts a request decomposition method to reduce redundant padding tokens, particularly benefiting LLM verification. In addition, <span class="ltx_text ltx_font_smallcaps" id="S3.p1.1.6">Spin</span> uses a pipeline parallelism mechanism to orchestrate the speculation and verification phases for further acceleration. In practical deployment, we can launch multiple instances for each type of SSM, and each instance run on a dedicated GPU, because of the small size of SSM. The LLM operates on multiple GPUs using tensor parallelism, ensuring efficient utilization of GPU resources.</p> </div> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">IV </span><span class="ltx_text ltx_font_smallcaps" id="S4.1.1">Learning-based SSM Selection</span> </h2> <div class="ltx_para" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">In this section, we first present the problem statement, followed by the algorithm design as well as performance analysis. </p> </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S4.SS1.5.1.1">IV-A</span> </span><span class="ltx_text ltx_font_italic" id="S4.SS1.6.2">Problem Statement</span> </h3> <div class="ltx_para" id="S4.SS1.p1"> <p class="ltx_p" id="S4.SS1.p1.8">We consider a set of inference requests, denoted by <math alttext="\mathcal{N}=\{1,2,...,N\}" class="ltx_Math" display="inline" id="S4.SS1.p1.1.m1.4"><semantics id="S4.SS1.p1.1.m1.4a"><mrow id="S4.SS1.p1.1.m1.4.5" xref="S4.SS1.p1.1.m1.4.5.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.SS1.p1.1.m1.4.5.2" xref="S4.SS1.p1.1.m1.4.5.2.cmml">𝒩</mi><mo id="S4.SS1.p1.1.m1.4.5.1" xref="S4.SS1.p1.1.m1.4.5.1.cmml">=</mo><mrow id="S4.SS1.p1.1.m1.4.5.3.2" xref="S4.SS1.p1.1.m1.4.5.3.1.cmml"><mo id="S4.SS1.p1.1.m1.4.5.3.2.1" stretchy="false" xref="S4.SS1.p1.1.m1.4.5.3.1.cmml">{</mo><mn id="S4.SS1.p1.1.m1.1.1" xref="S4.SS1.p1.1.m1.1.1.cmml">1</mn><mo id="S4.SS1.p1.1.m1.4.5.3.2.2" xref="S4.SS1.p1.1.m1.4.5.3.1.cmml">,</mo><mn id="S4.SS1.p1.1.m1.2.2" xref="S4.SS1.p1.1.m1.2.2.cmml">2</mn><mo id="S4.SS1.p1.1.m1.4.5.3.2.3" xref="S4.SS1.p1.1.m1.4.5.3.1.cmml">,</mo><mi id="S4.SS1.p1.1.m1.3.3" mathvariant="normal" xref="S4.SS1.p1.1.m1.3.3.cmml">…</mi><mo id="S4.SS1.p1.1.m1.4.5.3.2.4" xref="S4.SS1.p1.1.m1.4.5.3.1.cmml">,</mo><mi id="S4.SS1.p1.1.m1.4.4" xref="S4.SS1.p1.1.m1.4.4.cmml">N</mi><mo id="S4.SS1.p1.1.m1.4.5.3.2.5" stretchy="false" xref="S4.SS1.p1.1.m1.4.5.3.1.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.1.m1.4b"><apply id="S4.SS1.p1.1.m1.4.5.cmml" xref="S4.SS1.p1.1.m1.4.5"><eq id="S4.SS1.p1.1.m1.4.5.1.cmml" xref="S4.SS1.p1.1.m1.4.5.1"></eq><ci id="S4.SS1.p1.1.m1.4.5.2.cmml" xref="S4.SS1.p1.1.m1.4.5.2">𝒩</ci><set id="S4.SS1.p1.1.m1.4.5.3.1.cmml" xref="S4.SS1.p1.1.m1.4.5.3.2"><cn id="S4.SS1.p1.1.m1.1.1.cmml" type="integer" xref="S4.SS1.p1.1.m1.1.1">1</cn><cn id="S4.SS1.p1.1.m1.2.2.cmml" type="integer" xref="S4.SS1.p1.1.m1.2.2">2</cn><ci id="S4.SS1.p1.1.m1.3.3.cmml" xref="S4.SS1.p1.1.m1.3.3">…</ci><ci id="S4.SS1.p1.1.m1.4.4.cmml" xref="S4.SS1.p1.1.m1.4.4">𝑁</ci></set></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.1.m1.4c">\mathcal{N}=\{1,2,...,N\}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.1.m1.4d">caligraphic_N = { 1 , 2 , … , italic_N }</annotation></semantics></math>, which is submitted to the GPU cluster managed by <span class="ltx_text ltx_font_smallcaps" id="S4.SS1.p1.8.1">Spin</span>. There are multiple heterogeneous SSMs to perform speculation, denoted by the set <math alttext="\mathcal{M}=\{1,2,...,M\}" class="ltx_Math" display="inline" id="S4.SS1.p1.2.m2.4"><semantics id="S4.SS1.p1.2.m2.4a"><mrow id="S4.SS1.p1.2.m2.4.5" xref="S4.SS1.p1.2.m2.4.5.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.SS1.p1.2.m2.4.5.2" xref="S4.SS1.p1.2.m2.4.5.2.cmml">ℳ</mi><mo id="S4.SS1.p1.2.m2.4.5.1" xref="S4.SS1.p1.2.m2.4.5.1.cmml">=</mo><mrow id="S4.SS1.p1.2.m2.4.5.3.2" xref="S4.SS1.p1.2.m2.4.5.3.1.cmml"><mo id="S4.SS1.p1.2.m2.4.5.3.2.1" stretchy="false" xref="S4.SS1.p1.2.m2.4.5.3.1.cmml">{</mo><mn id="S4.SS1.p1.2.m2.1.1" xref="S4.SS1.p1.2.m2.1.1.cmml">1</mn><mo id="S4.SS1.p1.2.m2.4.5.3.2.2" xref="S4.SS1.p1.2.m2.4.5.3.1.cmml">,</mo><mn id="S4.SS1.p1.2.m2.2.2" xref="S4.SS1.p1.2.m2.2.2.cmml">2</mn><mo id="S4.SS1.p1.2.m2.4.5.3.2.3" xref="S4.SS1.p1.2.m2.4.5.3.1.cmml">,</mo><mi id="S4.SS1.p1.2.m2.3.3" mathvariant="normal" xref="S4.SS1.p1.2.m2.3.3.cmml">…</mi><mo id="S4.SS1.p1.2.m2.4.5.3.2.4" xref="S4.SS1.p1.2.m2.4.5.3.1.cmml">,</mo><mi id="S4.SS1.p1.2.m2.4.4" xref="S4.SS1.p1.2.m2.4.4.cmml">M</mi><mo id="S4.SS1.p1.2.m2.4.5.3.2.5" stretchy="false" xref="S4.SS1.p1.2.m2.4.5.3.1.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.2.m2.4b"><apply id="S4.SS1.p1.2.m2.4.5.cmml" xref="S4.SS1.p1.2.m2.4.5"><eq id="S4.SS1.p1.2.m2.4.5.1.cmml" xref="S4.SS1.p1.2.m2.4.5.1"></eq><ci id="S4.SS1.p1.2.m2.4.5.2.cmml" xref="S4.SS1.p1.2.m2.4.5.2">ℳ</ci><set id="S4.SS1.p1.2.m2.4.5.3.1.cmml" xref="S4.SS1.p1.2.m2.4.5.3.2"><cn id="S4.SS1.p1.2.m2.1.1.cmml" type="integer" xref="S4.SS1.p1.2.m2.1.1">1</cn><cn id="S4.SS1.p1.2.m2.2.2.cmml" type="integer" xref="S4.SS1.p1.2.m2.2.2">2</cn><ci id="S4.SS1.p1.2.m2.3.3.cmml" xref="S4.SS1.p1.2.m2.3.3">…</ci><ci id="S4.SS1.p1.2.m2.4.4.cmml" xref="S4.SS1.p1.2.m2.4.4">𝑀</ci></set></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.2.m2.4c">\mathcal{M}=\{1,2,...,M\}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.2.m2.4d">caligraphic_M = { 1 , 2 , … , italic_M }</annotation></semantics></math>. Each request <math alttext="i\in\mathcal{N}" class="ltx_Math" display="inline" id="S4.SS1.p1.3.m3.1"><semantics id="S4.SS1.p1.3.m3.1a"><mrow id="S4.SS1.p1.3.m3.1.1" xref="S4.SS1.p1.3.m3.1.1.cmml"><mi id="S4.SS1.p1.3.m3.1.1.2" xref="S4.SS1.p1.3.m3.1.1.2.cmml">i</mi><mo id="S4.SS1.p1.3.m3.1.1.1" xref="S4.SS1.p1.3.m3.1.1.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.SS1.p1.3.m3.1.1.3" xref="S4.SS1.p1.3.m3.1.1.3.cmml">𝒩</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.3.m3.1b"><apply id="S4.SS1.p1.3.m3.1.1.cmml" xref="S4.SS1.p1.3.m3.1.1"><in id="S4.SS1.p1.3.m3.1.1.1.cmml" xref="S4.SS1.p1.3.m3.1.1.1"></in><ci id="S4.SS1.p1.3.m3.1.1.2.cmml" xref="S4.SS1.p1.3.m3.1.1.2">𝑖</ci><ci id="S4.SS1.p1.3.m3.1.1.3.cmml" xref="S4.SS1.p1.3.m3.1.1.3">𝒩</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.3.m3.1c">i\in\mathcal{N}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.3.m3.1d">italic_i ∈ caligraphic_N</annotation></semantics></math> selects only one SSM and multiple requests using the same SSM can be batched together for simultaneous processing. The batch size of each SSM <math alttext="j" class="ltx_Math" display="inline" id="S4.SS1.p1.4.m4.1"><semantics id="S4.SS1.p1.4.m4.1a"><mi id="S4.SS1.p1.4.m4.1.1" xref="S4.SS1.p1.4.m4.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.4.m4.1b"><ci id="S4.SS1.p1.4.m4.1.1.cmml" xref="S4.SS1.p1.4.m4.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.4.m4.1c">j</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.4.m4.1d">italic_j</annotation></semantics></math> is denoted as <math alttext="B_{j}" class="ltx_Math" display="inline" id="S4.SS1.p1.5.m5.1"><semantics id="S4.SS1.p1.5.m5.1a"><msub id="S4.SS1.p1.5.m5.1.1" xref="S4.SS1.p1.5.m5.1.1.cmml"><mi id="S4.SS1.p1.5.m5.1.1.2" xref="S4.SS1.p1.5.m5.1.1.2.cmml">B</mi><mi id="S4.SS1.p1.5.m5.1.1.3" xref="S4.SS1.p1.5.m5.1.1.3.cmml">j</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.5.m5.1b"><apply id="S4.SS1.p1.5.m5.1.1.cmml" xref="S4.SS1.p1.5.m5.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.5.m5.1.1.1.cmml" xref="S4.SS1.p1.5.m5.1.1">subscript</csymbol><ci id="S4.SS1.p1.5.m5.1.1.2.cmml" xref="S4.SS1.p1.5.m5.1.1.2">𝐵</ci><ci id="S4.SS1.p1.5.m5.1.1.3.cmml" xref="S4.SS1.p1.5.m5.1.1.3">𝑗</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.5.m5.1c">B_{j}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.5.m5.1d">italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT</annotation></semantics></math>. We define a metric of <span class="ltx_text ltx_font_italic" id="S4.SS1.p1.8.2">goodput</span>, which indicates how many tokens are accepted by LLM per second, to evaluate the performance of speculative decoding. It is important to note that the goodput does not equal the speculation speed of the SSM since the generated tokens may be rejected by the LLM. Formally, the goodput of the speculative decoding when using SSM <math alttext="j" class="ltx_Math" display="inline" id="S4.SS1.p1.6.m6.1"><semantics id="S4.SS1.p1.6.m6.1a"><mi id="S4.SS1.p1.6.m6.1.1" xref="S4.SS1.p1.6.m6.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.6.m6.1b"><ci id="S4.SS1.p1.6.m6.1.1.cmml" xref="S4.SS1.p1.6.m6.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.6.m6.1c">j</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.6.m6.1d">italic_j</annotation></semantics></math> for request <math alttext="i" class="ltx_Math" display="inline" id="S4.SS1.p1.7.m7.1"><semantics id="S4.SS1.p1.7.m7.1a"><mi id="S4.SS1.p1.7.m7.1.1" xref="S4.SS1.p1.7.m7.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.7.m7.1b"><ci id="S4.SS1.p1.7.m7.1.1.cmml" xref="S4.SS1.p1.7.m7.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.7.m7.1c">i</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.7.m7.1d">italic_i</annotation></semantics></math> is denoted as <math alttext="g_{i,j}" class="ltx_Math" display="inline" id="S4.SS1.p1.8.m8.2"><semantics id="S4.SS1.p1.8.m8.2a"><msub id="S4.SS1.p1.8.m8.2.3" xref="S4.SS1.p1.8.m8.2.3.cmml"><mi id="S4.SS1.p1.8.m8.2.3.2" xref="S4.SS1.p1.8.m8.2.3.2.cmml">g</mi><mrow id="S4.SS1.p1.8.m8.2.2.2.4" xref="S4.SS1.p1.8.m8.2.2.2.3.cmml"><mi id="S4.SS1.p1.8.m8.1.1.1.1" xref="S4.SS1.p1.8.m8.1.1.1.1.cmml">i</mi><mo id="S4.SS1.p1.8.m8.2.2.2.4.1" xref="S4.SS1.p1.8.m8.2.2.2.3.cmml">,</mo><mi id="S4.SS1.p1.8.m8.2.2.2.2" xref="S4.SS1.p1.8.m8.2.2.2.2.cmml">j</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.8.m8.2b"><apply id="S4.SS1.p1.8.m8.2.3.cmml" xref="S4.SS1.p1.8.m8.2.3"><csymbol cd="ambiguous" id="S4.SS1.p1.8.m8.2.3.1.cmml" xref="S4.SS1.p1.8.m8.2.3">subscript</csymbol><ci id="S4.SS1.p1.8.m8.2.3.2.cmml" xref="S4.SS1.p1.8.m8.2.3.2">𝑔</ci><list id="S4.SS1.p1.8.m8.2.2.2.3.cmml" xref="S4.SS1.p1.8.m8.2.2.2.4"><ci id="S4.SS1.p1.8.m8.1.1.1.1.cmml" xref="S4.SS1.p1.8.m8.1.1.1.1">𝑖</ci><ci id="S4.SS1.p1.8.m8.2.2.2.2.cmml" xref="S4.SS1.p1.8.m8.2.2.2.2">𝑗</ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.8.m8.2c">g_{i,j}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.8.m8.2d">italic_g start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT</annotation></semantics></math>.</p> </div> <div class="ltx_para" id="S4.SS1.p2"> <p class="ltx_p" id="S4.SS1.p2.2">We denote <math alttext="x_{i}\in\mathcal{M}" class="ltx_Math" display="inline" id="S4.SS1.p2.1.m1.1"><semantics id="S4.SS1.p2.1.m1.1a"><mrow id="S4.SS1.p2.1.m1.1.1" xref="S4.SS1.p2.1.m1.1.1.cmml"><msub id="S4.SS1.p2.1.m1.1.1.2" xref="S4.SS1.p2.1.m1.1.1.2.cmml"><mi id="S4.SS1.p2.1.m1.1.1.2.2" xref="S4.SS1.p2.1.m1.1.1.2.2.cmml">x</mi><mi id="S4.SS1.p2.1.m1.1.1.2.3" xref="S4.SS1.p2.1.m1.1.1.2.3.cmml">i</mi></msub><mo id="S4.SS1.p2.1.m1.1.1.1" xref="S4.SS1.p2.1.m1.1.1.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.SS1.p2.1.m1.1.1.3" xref="S4.SS1.p2.1.m1.1.1.3.cmml">ℳ</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p2.1.m1.1b"><apply id="S4.SS1.p2.1.m1.1.1.cmml" xref="S4.SS1.p2.1.m1.1.1"><in id="S4.SS1.p2.1.m1.1.1.1.cmml" xref="S4.SS1.p2.1.m1.1.1.1"></in><apply id="S4.SS1.p2.1.m1.1.1.2.cmml" xref="S4.SS1.p2.1.m1.1.1.2"><csymbol cd="ambiguous" id="S4.SS1.p2.1.m1.1.1.2.1.cmml" xref="S4.SS1.p2.1.m1.1.1.2">subscript</csymbol><ci id="S4.SS1.p2.1.m1.1.1.2.2.cmml" xref="S4.SS1.p2.1.m1.1.1.2.2">𝑥</ci><ci id="S4.SS1.p2.1.m1.1.1.2.3.cmml" xref="S4.SS1.p2.1.m1.1.1.2.3">𝑖</ci></apply><ci id="S4.SS1.p2.1.m1.1.1.3.cmml" xref="S4.SS1.p2.1.m1.1.1.3">ℳ</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p2.1.m1.1c">x_{i}\in\mathcal{M}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p2.1.m1.1d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_M</annotation></semantics></math> as the SSM selection for request <math alttext="i\in\mathcal{N}" class="ltx_Math" display="inline" id="S4.SS1.p2.2.m2.1"><semantics id="S4.SS1.p2.2.m2.1a"><mrow id="S4.SS1.p2.2.m2.1.1" xref="S4.SS1.p2.2.m2.1.1.cmml"><mi id="S4.SS1.p2.2.m2.1.1.2" xref="S4.SS1.p2.2.m2.1.1.2.cmml">i</mi><mo id="S4.SS1.p2.2.m2.1.1.1" xref="S4.SS1.p2.2.m2.1.1.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.SS1.p2.2.m2.1.1.3" xref="S4.SS1.p2.2.m2.1.1.3.cmml">𝒩</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p2.2.m2.1b"><apply id="S4.SS1.p2.2.m2.1.1.cmml" xref="S4.SS1.p2.2.m2.1.1"><in id="S4.SS1.p2.2.m2.1.1.1.cmml" xref="S4.SS1.p2.2.m2.1.1.1"></in><ci id="S4.SS1.p2.2.m2.1.1.2.cmml" xref="S4.SS1.p2.2.m2.1.1.2">𝑖</ci><ci id="S4.SS1.p2.2.m2.1.1.3.cmml" xref="S4.SS1.p2.2.m2.1.1.3">𝒩</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p2.2.m2.1c">i\in\mathcal{N}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p2.2.m2.1d">italic_i ∈ caligraphic_N</annotation></semantics></math>. The objective of the SSM selection problem is to maximize the total goodputs for all requests:</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="S9.EGx3"> <tbody id="S4.E3"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\max_{x}\quad" class="ltx_Math" display="inline" id="S4.E3.m1.1"><semantics id="S4.E3.m1.1a"><mrow id="S4.E3.m1.1.1.1" xref="S4.E3.m1.1.1.1.1.cmml"><munder id="S4.E3.m1.1.1.1.1" xref="S4.E3.m1.1.1.1.1.cmml"><mi id="S4.E3.m1.1.1.1.1.2" xref="S4.E3.m1.1.1.1.1.2.cmml">max</mi><mi id="S4.E3.m1.1.1.1.1.3" xref="S4.E3.m1.1.1.1.1.3.cmml">x</mi></munder><mspace id="S4.E3.m1.1.1.1.2" width="1.167em" xref="S4.E3.m1.1.1.1.1.cmml"></mspace></mrow><annotation-xml encoding="MathML-Content" id="S4.E3.m1.1b"><apply id="S4.E3.m1.1.1.1.1.cmml" xref="S4.E3.m1.1.1.1"><csymbol cd="ambiguous" id="S4.E3.m1.1.1.1.1.1.cmml" xref="S4.E3.m1.1.1.1">subscript</csymbol><max id="S4.E3.m1.1.1.1.1.2.cmml" xref="S4.E3.m1.1.1.1.1.2"></max><ci id="S4.E3.m1.1.1.1.1.3.cmml" xref="S4.E3.m1.1.1.1.1.3">𝑥</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E3.m1.1c">\displaystyle\max_{x}\quad</annotation><annotation encoding="application/x-llamapun" id="S4.E3.m1.1d">roman_max start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\sum_{i=1}^{N}g_{i,x_{i}}" class="ltx_Math" display="inline" id="S4.E3.m2.2"><semantics id="S4.E3.m2.2a"><mrow id="S4.E3.m2.2.3" xref="S4.E3.m2.2.3.cmml"><mstyle displaystyle="true" id="S4.E3.m2.2.3.1" xref="S4.E3.m2.2.3.1.cmml"><munderover id="S4.E3.m2.2.3.1a" xref="S4.E3.m2.2.3.1.cmml"><mo id="S4.E3.m2.2.3.1.2.2" movablelimits="false" xref="S4.E3.m2.2.3.1.2.2.cmml">∑</mo><mrow id="S4.E3.m2.2.3.1.2.3" xref="S4.E3.m2.2.3.1.2.3.cmml"><mi id="S4.E3.m2.2.3.1.2.3.2" xref="S4.E3.m2.2.3.1.2.3.2.cmml">i</mi><mo id="S4.E3.m2.2.3.1.2.3.1" xref="S4.E3.m2.2.3.1.2.3.1.cmml">=</mo><mn id="S4.E3.m2.2.3.1.2.3.3" xref="S4.E3.m2.2.3.1.2.3.3.cmml">1</mn></mrow><mi id="S4.E3.m2.2.3.1.3" xref="S4.E3.m2.2.3.1.3.cmml">N</mi></munderover></mstyle><msub id="S4.E3.m2.2.3.2" xref="S4.E3.m2.2.3.2.cmml"><mi id="S4.E3.m2.2.3.2.2" xref="S4.E3.m2.2.3.2.2.cmml">g</mi><mrow id="S4.E3.m2.2.2.2.2" xref="S4.E3.m2.2.2.2.3.cmml"><mi id="S4.E3.m2.1.1.1.1" xref="S4.E3.m2.1.1.1.1.cmml">i</mi><mo id="S4.E3.m2.2.2.2.2.2" xref="S4.E3.m2.2.2.2.3.cmml">,</mo><msub id="S4.E3.m2.2.2.2.2.1" xref="S4.E3.m2.2.2.2.2.1.cmml"><mi id="S4.E3.m2.2.2.2.2.1.2" xref="S4.E3.m2.2.2.2.2.1.2.cmml">x</mi><mi id="S4.E3.m2.2.2.2.2.1.3" xref="S4.E3.m2.2.2.2.2.1.3.cmml">i</mi></msub></mrow></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.E3.m2.2b"><apply id="S4.E3.m2.2.3.cmml" xref="S4.E3.m2.2.3"><apply id="S4.E3.m2.2.3.1.cmml" xref="S4.E3.m2.2.3.1"><csymbol cd="ambiguous" id="S4.E3.m2.2.3.1.1.cmml" xref="S4.E3.m2.2.3.1">superscript</csymbol><apply id="S4.E3.m2.2.3.1.2.cmml" xref="S4.E3.m2.2.3.1"><csymbol cd="ambiguous" id="S4.E3.m2.2.3.1.2.1.cmml" xref="S4.E3.m2.2.3.1">subscript</csymbol><sum id="S4.E3.m2.2.3.1.2.2.cmml" xref="S4.E3.m2.2.3.1.2.2"></sum><apply id="S4.E3.m2.2.3.1.2.3.cmml" xref="S4.E3.m2.2.3.1.2.3"><eq id="S4.E3.m2.2.3.1.2.3.1.cmml" xref="S4.E3.m2.2.3.1.2.3.1"></eq><ci id="S4.E3.m2.2.3.1.2.3.2.cmml" xref="S4.E3.m2.2.3.1.2.3.2">𝑖</ci><cn id="S4.E3.m2.2.3.1.2.3.3.cmml" type="integer" xref="S4.E3.m2.2.3.1.2.3.3">1</cn></apply></apply><ci id="S4.E3.m2.2.3.1.3.cmml" xref="S4.E3.m2.2.3.1.3">𝑁</ci></apply><apply id="S4.E3.m2.2.3.2.cmml" xref="S4.E3.m2.2.3.2"><csymbol cd="ambiguous" id="S4.E3.m2.2.3.2.1.cmml" xref="S4.E3.m2.2.3.2">subscript</csymbol><ci id="S4.E3.m2.2.3.2.2.cmml" xref="S4.E3.m2.2.3.2.2">𝑔</ci><list id="S4.E3.m2.2.2.2.3.cmml" xref="S4.E3.m2.2.2.2.2"><ci id="S4.E3.m2.1.1.1.1.cmml" xref="S4.E3.m2.1.1.1.1">𝑖</ci><apply id="S4.E3.m2.2.2.2.2.1.cmml" xref="S4.E3.m2.2.2.2.2.1"><csymbol cd="ambiguous" id="S4.E3.m2.2.2.2.2.1.1.cmml" xref="S4.E3.m2.2.2.2.2.1">subscript</csymbol><ci id="S4.E3.m2.2.2.2.2.1.2.cmml" xref="S4.E3.m2.2.2.2.2.1.2">𝑥</ci><ci id="S4.E3.m2.2.2.2.2.1.3.cmml" xref="S4.E3.m2.2.2.2.2.1.3">𝑖</ci></apply></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E3.m2.2c">\displaystyle\sum_{i=1}^{N}g_{i,x_{i}}</annotation><annotation encoding="application/x-llamapun" id="S4.E3.m2.2d">∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(3)</span></td> </tr></tbody> <tbody id="S4.E4"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><span class="ltx_text ltx_markedasmath" id="S4.E4.2.1.1.1">s.t.</span></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle|\mathcal{N}_{j}|\leq B_{j},\forall j\in\mathcal{M};" class="ltx_Math" display="inline" id="S4.E4.m2.1"><semantics id="S4.E4.m2.1a"><mrow id="S4.E4.m2.1.1.1"><mrow id="S4.E4.m2.1.1.1.1.2" xref="S4.E4.m2.1.1.1.1.3.cmml"><mrow id="S4.E4.m2.1.1.1.1.1.1" xref="S4.E4.m2.1.1.1.1.1.1.cmml"><mrow id="S4.E4.m2.1.1.1.1.1.1.1.1" xref="S4.E4.m2.1.1.1.1.1.1.1.2.cmml"><mo id="S4.E4.m2.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S4.E4.m2.1.1.1.1.1.1.1.2.1.cmml">|</mo><msub id="S4.E4.m2.1.1.1.1.1.1.1.1.1" xref="S4.E4.m2.1.1.1.1.1.1.1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.E4.m2.1.1.1.1.1.1.1.1.1.2" xref="S4.E4.m2.1.1.1.1.1.1.1.1.1.2.cmml">𝒩</mi><mi id="S4.E4.m2.1.1.1.1.1.1.1.1.1.3" xref="S4.E4.m2.1.1.1.1.1.1.1.1.1.3.cmml">j</mi></msub><mo id="S4.E4.m2.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S4.E4.m2.1.1.1.1.1.1.1.2.1.cmml">|</mo></mrow><mo id="S4.E4.m2.1.1.1.1.1.1.2" xref="S4.E4.m2.1.1.1.1.1.1.2.cmml">≤</mo><msub id="S4.E4.m2.1.1.1.1.1.1.3" xref="S4.E4.m2.1.1.1.1.1.1.3.cmml"><mi id="S4.E4.m2.1.1.1.1.1.1.3.2" xref="S4.E4.m2.1.1.1.1.1.1.3.2.cmml">B</mi><mi id="S4.E4.m2.1.1.1.1.1.1.3.3" xref="S4.E4.m2.1.1.1.1.1.1.3.3.cmml">j</mi></msub></mrow><mo id="S4.E4.m2.1.1.1.1.2.3" xref="S4.E4.m2.1.1.1.1.3a.cmml">,</mo><mrow id="S4.E4.m2.1.1.1.1.2.2" xref="S4.E4.m2.1.1.1.1.2.2.cmml"><mrow id="S4.E4.m2.1.1.1.1.2.2.2" xref="S4.E4.m2.1.1.1.1.2.2.2.cmml"><mo id="S4.E4.m2.1.1.1.1.2.2.2.1" rspace="0.167em" xref="S4.E4.m2.1.1.1.1.2.2.2.1.cmml">∀</mo><mi id="S4.E4.m2.1.1.1.1.2.2.2.2" xref="S4.E4.m2.1.1.1.1.2.2.2.2.cmml">j</mi></mrow><mo id="S4.E4.m2.1.1.1.1.2.2.1" xref="S4.E4.m2.1.1.1.1.2.2.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.E4.m2.1.1.1.1.2.2.3" xref="S4.E4.m2.1.1.1.1.2.2.3.cmml">ℳ</mi></mrow></mrow><mo id="S4.E4.m2.1.1.1.2">;</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.E4.m2.1b"><apply id="S4.E4.m2.1.1.1.1.3.cmml" xref="S4.E4.m2.1.1.1.1.2"><csymbol cd="ambiguous" id="S4.E4.m2.1.1.1.1.3a.cmml" xref="S4.E4.m2.1.1.1.1.2.3">formulae-sequence</csymbol><apply id="S4.E4.m2.1.1.1.1.1.1.cmml" xref="S4.E4.m2.1.1.1.1.1.1"><leq id="S4.E4.m2.1.1.1.1.1.1.2.cmml" xref="S4.E4.m2.1.1.1.1.1.1.2"></leq><apply id="S4.E4.m2.1.1.1.1.1.1.1.2.cmml" xref="S4.E4.m2.1.1.1.1.1.1.1.1"><abs id="S4.E4.m2.1.1.1.1.1.1.1.2.1.cmml" xref="S4.E4.m2.1.1.1.1.1.1.1.1.2"></abs><apply id="S4.E4.m2.1.1.1.1.1.1.1.1.1.cmml" xref="S4.E4.m2.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S4.E4.m2.1.1.1.1.1.1.1.1.1.1.cmml" xref="S4.E4.m2.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S4.E4.m2.1.1.1.1.1.1.1.1.1.2.cmml" xref="S4.E4.m2.1.1.1.1.1.1.1.1.1.2">𝒩</ci><ci id="S4.E4.m2.1.1.1.1.1.1.1.1.1.3.cmml" xref="S4.E4.m2.1.1.1.1.1.1.1.1.1.3">𝑗</ci></apply></apply><apply id="S4.E4.m2.1.1.1.1.1.1.3.cmml" xref="S4.E4.m2.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.E4.m2.1.1.1.1.1.1.3.1.cmml" xref="S4.E4.m2.1.1.1.1.1.1.3">subscript</csymbol><ci id="S4.E4.m2.1.1.1.1.1.1.3.2.cmml" xref="S4.E4.m2.1.1.1.1.1.1.3.2">𝐵</ci><ci id="S4.E4.m2.1.1.1.1.1.1.3.3.cmml" xref="S4.E4.m2.1.1.1.1.1.1.3.3">𝑗</ci></apply></apply><apply id="S4.E4.m2.1.1.1.1.2.2.cmml" xref="S4.E4.m2.1.1.1.1.2.2"><in id="S4.E4.m2.1.1.1.1.2.2.1.cmml" xref="S4.E4.m2.1.1.1.1.2.2.1"></in><apply id="S4.E4.m2.1.1.1.1.2.2.2.cmml" xref="S4.E4.m2.1.1.1.1.2.2.2"><csymbol cd="latexml" id="S4.E4.m2.1.1.1.1.2.2.2.1.cmml" xref="S4.E4.m2.1.1.1.1.2.2.2.1">for-all</csymbol><ci id="S4.E4.m2.1.1.1.1.2.2.2.2.cmml" xref="S4.E4.m2.1.1.1.1.2.2.2.2">𝑗</ci></apply><ci id="S4.E4.m2.1.1.1.1.2.2.3.cmml" xref="S4.E4.m2.1.1.1.1.2.2.3">ℳ</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E4.m2.1c">\displaystyle|\mathcal{N}_{j}|\leq B_{j},\forall j\in\mathcal{M};</annotation><annotation encoding="application/x-llamapun" id="S4.E4.m2.1d">| caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ∈ caligraphic_M ;</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(4)</span></td> </tr></tbody> <tbody id="S4.E5"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle x_{i}=\{1,2,...,M\},\forall i\in\mathcal{N}," class="ltx_Math" display="inline" id="S4.E5.m1.5"><semantics id="S4.E5.m1.5a"><mrow id="S4.E5.m1.5.5.1"><mrow id="S4.E5.m1.5.5.1.1.2" xref="S4.E5.m1.5.5.1.1.3.cmml"><mrow id="S4.E5.m1.5.5.1.1.1.1" xref="S4.E5.m1.5.5.1.1.1.1.cmml"><msub id="S4.E5.m1.5.5.1.1.1.1.2" xref="S4.E5.m1.5.5.1.1.1.1.2.cmml"><mi id="S4.E5.m1.5.5.1.1.1.1.2.2" xref="S4.E5.m1.5.5.1.1.1.1.2.2.cmml">x</mi><mi id="S4.E5.m1.5.5.1.1.1.1.2.3" xref="S4.E5.m1.5.5.1.1.1.1.2.3.cmml">i</mi></msub><mo id="S4.E5.m1.5.5.1.1.1.1.1" xref="S4.E5.m1.5.5.1.1.1.1.1.cmml">=</mo><mrow id="S4.E5.m1.5.5.1.1.1.1.3.2" xref="S4.E5.m1.5.5.1.1.1.1.3.1.cmml"><mo id="S4.E5.m1.5.5.1.1.1.1.3.2.1" stretchy="false" xref="S4.E5.m1.5.5.1.1.1.1.3.1.cmml">{</mo><mn id="S4.E5.m1.1.1" xref="S4.E5.m1.1.1.cmml">1</mn><mo id="S4.E5.m1.5.5.1.1.1.1.3.2.2" xref="S4.E5.m1.5.5.1.1.1.1.3.1.cmml">,</mo><mn id="S4.E5.m1.2.2" xref="S4.E5.m1.2.2.cmml">2</mn><mo id="S4.E5.m1.5.5.1.1.1.1.3.2.3" xref="S4.E5.m1.5.5.1.1.1.1.3.1.cmml">,</mo><mi id="S4.E5.m1.3.3" mathvariant="normal" xref="S4.E5.m1.3.3.cmml">…</mi><mo id="S4.E5.m1.5.5.1.1.1.1.3.2.4" xref="S4.E5.m1.5.5.1.1.1.1.3.1.cmml">,</mo><mi id="S4.E5.m1.4.4" xref="S4.E5.m1.4.4.cmml">M</mi><mo id="S4.E5.m1.5.5.1.1.1.1.3.2.5" stretchy="false" xref="S4.E5.m1.5.5.1.1.1.1.3.1.cmml">}</mo></mrow></mrow><mo id="S4.E5.m1.5.5.1.1.2.3" xref="S4.E5.m1.5.5.1.1.3a.cmml">,</mo><mrow id="S4.E5.m1.5.5.1.1.2.2" xref="S4.E5.m1.5.5.1.1.2.2.cmml"><mrow id="S4.E5.m1.5.5.1.1.2.2.2" xref="S4.E5.m1.5.5.1.1.2.2.2.cmml"><mo id="S4.E5.m1.5.5.1.1.2.2.2.1" rspace="0.167em" xref="S4.E5.m1.5.5.1.1.2.2.2.1.cmml">∀</mo><mi id="S4.E5.m1.5.5.1.1.2.2.2.2" xref="S4.E5.m1.5.5.1.1.2.2.2.2.cmml">i</mi></mrow><mo id="S4.E5.m1.5.5.1.1.2.2.1" xref="S4.E5.m1.5.5.1.1.2.2.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.E5.m1.5.5.1.1.2.2.3" xref="S4.E5.m1.5.5.1.1.2.2.3.cmml">𝒩</mi></mrow></mrow><mo id="S4.E5.m1.5.5.1.2">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.E5.m1.5b"><apply id="S4.E5.m1.5.5.1.1.3.cmml" xref="S4.E5.m1.5.5.1.1.2"><csymbol cd="ambiguous" id="S4.E5.m1.5.5.1.1.3a.cmml" xref="S4.E5.m1.5.5.1.1.2.3">formulae-sequence</csymbol><apply id="S4.E5.m1.5.5.1.1.1.1.cmml" xref="S4.E5.m1.5.5.1.1.1.1"><eq id="S4.E5.m1.5.5.1.1.1.1.1.cmml" xref="S4.E5.m1.5.5.1.1.1.1.1"></eq><apply id="S4.E5.m1.5.5.1.1.1.1.2.cmml" xref="S4.E5.m1.5.5.1.1.1.1.2"><csymbol cd="ambiguous" id="S4.E5.m1.5.5.1.1.1.1.2.1.cmml" xref="S4.E5.m1.5.5.1.1.1.1.2">subscript</csymbol><ci id="S4.E5.m1.5.5.1.1.1.1.2.2.cmml" xref="S4.E5.m1.5.5.1.1.1.1.2.2">𝑥</ci><ci id="S4.E5.m1.5.5.1.1.1.1.2.3.cmml" xref="S4.E5.m1.5.5.1.1.1.1.2.3">𝑖</ci></apply><set id="S4.E5.m1.5.5.1.1.1.1.3.1.cmml" xref="S4.E5.m1.5.5.1.1.1.1.3.2"><cn id="S4.E5.m1.1.1.cmml" type="integer" xref="S4.E5.m1.1.1">1</cn><cn id="S4.E5.m1.2.2.cmml" type="integer" xref="S4.E5.m1.2.2">2</cn><ci id="S4.E5.m1.3.3.cmml" xref="S4.E5.m1.3.3">…</ci><ci id="S4.E5.m1.4.4.cmml" xref="S4.E5.m1.4.4">𝑀</ci></set></apply><apply id="S4.E5.m1.5.5.1.1.2.2.cmml" xref="S4.E5.m1.5.5.1.1.2.2"><in id="S4.E5.m1.5.5.1.1.2.2.1.cmml" xref="S4.E5.m1.5.5.1.1.2.2.1"></in><apply id="S4.E5.m1.5.5.1.1.2.2.2.cmml" xref="S4.E5.m1.5.5.1.1.2.2.2"><csymbol cd="latexml" id="S4.E5.m1.5.5.1.1.2.2.2.1.cmml" xref="S4.E5.m1.5.5.1.1.2.2.2.1">for-all</csymbol><ci id="S4.E5.m1.5.5.1.1.2.2.2.2.cmml" xref="S4.E5.m1.5.5.1.1.2.2.2.2">𝑖</ci></apply><ci id="S4.E5.m1.5.5.1.1.2.2.3.cmml" xref="S4.E5.m1.5.5.1.1.2.2.3">𝒩</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E5.m1.5c">\displaystyle x_{i}=\{1,2,...,M\},\forall i\in\mathcal{N},</annotation><annotation encoding="application/x-llamapun" id="S4.E5.m1.5d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { 1 , 2 , … , italic_M } , ∀ italic_i ∈ caligraphic_N ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(5)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S4.SS1.p2.5">where <math alttext="\mathcal{N}_{j}=\{i|x_{i}=j\}" class="ltx_Math" display="inline" id="S4.SS1.p2.3.m1.2"><semantics id="S4.SS1.p2.3.m1.2a"><mrow id="S4.SS1.p2.3.m1.2.2" xref="S4.SS1.p2.3.m1.2.2.cmml"><msub id="S4.SS1.p2.3.m1.2.2.3" xref="S4.SS1.p2.3.m1.2.2.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.SS1.p2.3.m1.2.2.3.2" xref="S4.SS1.p2.3.m1.2.2.3.2.cmml">𝒩</mi><mi id="S4.SS1.p2.3.m1.2.2.3.3" xref="S4.SS1.p2.3.m1.2.2.3.3.cmml">j</mi></msub><mo id="S4.SS1.p2.3.m1.2.2.2" xref="S4.SS1.p2.3.m1.2.2.2.cmml">=</mo><mrow id="S4.SS1.p2.3.m1.2.2.1.1" xref="S4.SS1.p2.3.m1.2.2.1.2.cmml"><mo id="S4.SS1.p2.3.m1.2.2.1.1.2" stretchy="false" xref="S4.SS1.p2.3.m1.2.2.1.2.1.cmml">{</mo><mi id="S4.SS1.p2.3.m1.1.1" xref="S4.SS1.p2.3.m1.1.1.cmml">i</mi><mo id="S4.SS1.p2.3.m1.2.2.1.1.3" lspace="0em" rspace="0em" xref="S4.SS1.p2.3.m1.2.2.1.2.1.cmml">|</mo><mrow id="S4.SS1.p2.3.m1.2.2.1.1.1" xref="S4.SS1.p2.3.m1.2.2.1.1.1.cmml"><msub id="S4.SS1.p2.3.m1.2.2.1.1.1.2" xref="S4.SS1.p2.3.m1.2.2.1.1.1.2.cmml"><mi id="S4.SS1.p2.3.m1.2.2.1.1.1.2.2" xref="S4.SS1.p2.3.m1.2.2.1.1.1.2.2.cmml">x</mi><mi id="S4.SS1.p2.3.m1.2.2.1.1.1.2.3" xref="S4.SS1.p2.3.m1.2.2.1.1.1.2.3.cmml">i</mi></msub><mo id="S4.SS1.p2.3.m1.2.2.1.1.1.1" xref="S4.SS1.p2.3.m1.2.2.1.1.1.1.cmml">=</mo><mi id="S4.SS1.p2.3.m1.2.2.1.1.1.3" xref="S4.SS1.p2.3.m1.2.2.1.1.1.3.cmml">j</mi></mrow><mo id="S4.SS1.p2.3.m1.2.2.1.1.4" stretchy="false" xref="S4.SS1.p2.3.m1.2.2.1.2.1.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p2.3.m1.2b"><apply id="S4.SS1.p2.3.m1.2.2.cmml" xref="S4.SS1.p2.3.m1.2.2"><eq id="S4.SS1.p2.3.m1.2.2.2.cmml" xref="S4.SS1.p2.3.m1.2.2.2"></eq><apply id="S4.SS1.p2.3.m1.2.2.3.cmml" xref="S4.SS1.p2.3.m1.2.2.3"><csymbol cd="ambiguous" id="S4.SS1.p2.3.m1.2.2.3.1.cmml" xref="S4.SS1.p2.3.m1.2.2.3">subscript</csymbol><ci id="S4.SS1.p2.3.m1.2.2.3.2.cmml" xref="S4.SS1.p2.3.m1.2.2.3.2">𝒩</ci><ci id="S4.SS1.p2.3.m1.2.2.3.3.cmml" xref="S4.SS1.p2.3.m1.2.2.3.3">𝑗</ci></apply><apply id="S4.SS1.p2.3.m1.2.2.1.2.cmml" xref="S4.SS1.p2.3.m1.2.2.1.1"><csymbol cd="latexml" id="S4.SS1.p2.3.m1.2.2.1.2.1.cmml" xref="S4.SS1.p2.3.m1.2.2.1.1.2">conditional-set</csymbol><ci id="S4.SS1.p2.3.m1.1.1.cmml" xref="S4.SS1.p2.3.m1.1.1">𝑖</ci><apply id="S4.SS1.p2.3.m1.2.2.1.1.1.cmml" xref="S4.SS1.p2.3.m1.2.2.1.1.1"><eq id="S4.SS1.p2.3.m1.2.2.1.1.1.1.cmml" xref="S4.SS1.p2.3.m1.2.2.1.1.1.1"></eq><apply id="S4.SS1.p2.3.m1.2.2.1.1.1.2.cmml" xref="S4.SS1.p2.3.m1.2.2.1.1.1.2"><csymbol cd="ambiguous" id="S4.SS1.p2.3.m1.2.2.1.1.1.2.1.cmml" xref="S4.SS1.p2.3.m1.2.2.1.1.1.2">subscript</csymbol><ci id="S4.SS1.p2.3.m1.2.2.1.1.1.2.2.cmml" xref="S4.SS1.p2.3.m1.2.2.1.1.1.2.2">𝑥</ci><ci id="S4.SS1.p2.3.m1.2.2.1.1.1.2.3.cmml" xref="S4.SS1.p2.3.m1.2.2.1.1.1.2.3">𝑖</ci></apply><ci id="S4.SS1.p2.3.m1.2.2.1.1.1.3.cmml" xref="S4.SS1.p2.3.m1.2.2.1.1.1.3">𝑗</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p2.3.m1.2c">\mathcal{N}_{j}=\{i|x_{i}=j\}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p2.3.m1.2d">caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { italic_i | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j }</annotation></semantics></math> is the set of requests running on SSM <math alttext="j" class="ltx_Math" display="inline" id="S4.SS1.p2.4.m2.1"><semantics id="S4.SS1.p2.4.m2.1a"><mi id="S4.SS1.p2.4.m2.1.1" xref="S4.SS1.p2.4.m2.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p2.4.m2.1b"><ci id="S4.SS1.p2.4.m2.1.1.cmml" xref="S4.SS1.p2.4.m2.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p2.4.m2.1c">j</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p2.4.m2.1d">italic_j</annotation></semantics></math>. Constraint (<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S4.E4" title="In IV-A Problem Statement ‣ IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">4</span></a>) ensures the batch size limitation on each SSM. However, the above optimization problem cannot be directly solved because the goodput <math alttext="g_{i,x_{i}}" class="ltx_Math" display="inline" id="S4.SS1.p2.5.m3.2"><semantics id="S4.SS1.p2.5.m3.2a"><msub id="S4.SS1.p2.5.m3.2.3" xref="S4.SS1.p2.5.m3.2.3.cmml"><mi id="S4.SS1.p2.5.m3.2.3.2" xref="S4.SS1.p2.5.m3.2.3.2.cmml">g</mi><mrow id="S4.SS1.p2.5.m3.2.2.2.2" xref="S4.SS1.p2.5.m3.2.2.2.3.cmml"><mi id="S4.SS1.p2.5.m3.1.1.1.1" xref="S4.SS1.p2.5.m3.1.1.1.1.cmml">i</mi><mo id="S4.SS1.p2.5.m3.2.2.2.2.2" xref="S4.SS1.p2.5.m3.2.2.2.3.cmml">,</mo><msub id="S4.SS1.p2.5.m3.2.2.2.2.1" xref="S4.SS1.p2.5.m3.2.2.2.2.1.cmml"><mi id="S4.SS1.p2.5.m3.2.2.2.2.1.2" xref="S4.SS1.p2.5.m3.2.2.2.2.1.2.cmml">x</mi><mi id="S4.SS1.p2.5.m3.2.2.2.2.1.3" xref="S4.SS1.p2.5.m3.2.2.2.2.1.3.cmml">i</mi></msub></mrow></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.p2.5.m3.2b"><apply id="S4.SS1.p2.5.m3.2.3.cmml" xref="S4.SS1.p2.5.m3.2.3"><csymbol cd="ambiguous" id="S4.SS1.p2.5.m3.2.3.1.cmml" xref="S4.SS1.p2.5.m3.2.3">subscript</csymbol><ci id="S4.SS1.p2.5.m3.2.3.2.cmml" xref="S4.SS1.p2.5.m3.2.3.2">𝑔</ci><list id="S4.SS1.p2.5.m3.2.2.2.3.cmml" xref="S4.SS1.p2.5.m3.2.2.2.2"><ci id="S4.SS1.p2.5.m3.1.1.1.1.cmml" xref="S4.SS1.p2.5.m3.1.1.1.1">𝑖</ci><apply id="S4.SS1.p2.5.m3.2.2.2.2.1.cmml" xref="S4.SS1.p2.5.m3.2.2.2.2.1"><csymbol cd="ambiguous" id="S4.SS1.p2.5.m3.2.2.2.2.1.1.cmml" xref="S4.SS1.p2.5.m3.2.2.2.2.1">subscript</csymbol><ci id="S4.SS1.p2.5.m3.2.2.2.2.1.2.cmml" xref="S4.SS1.p2.5.m3.2.2.2.2.1.2">𝑥</ci><ci id="S4.SS1.p2.5.m3.2.2.2.2.1.3.cmml" xref="S4.SS1.p2.5.m3.2.2.2.2.1.3">𝑖</ci></apply></list></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p2.5.m3.2c">g_{i,x_{i}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p2.5.m3.2d">italic_g start_POSTSUBSCRIPT italic_i , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT</annotation></semantics></math> is strongly related to the acceptance rate at LLM, however, which is unknown before execution.</p> </div> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S4.SS2.6.1.1">IV-B</span> </span><span class="ltx_text ltx_font_italic" id="S4.SS2.7.2">Algorithm Design</span> </h3> <div class="ltx_para" id="S4.SS2.p1"> <p class="ltx_p" id="S4.SS2.p1.6">We model the SSM selection as a multi-armed bandit (MAB) problem. Specifically, we divide the total speculative decoding process into a series of time slots, denoted by <math alttext="\mathcal{T}=\{1,2,...,T\}" class="ltx_Math" display="inline" id="S4.SS2.p1.1.m1.4"><semantics id="S4.SS2.p1.1.m1.4a"><mrow id="S4.SS2.p1.1.m1.4.5" xref="S4.SS2.p1.1.m1.4.5.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.SS2.p1.1.m1.4.5.2" xref="S4.SS2.p1.1.m1.4.5.2.cmml">𝒯</mi><mo id="S4.SS2.p1.1.m1.4.5.1" xref="S4.SS2.p1.1.m1.4.5.1.cmml">=</mo><mrow id="S4.SS2.p1.1.m1.4.5.3.2" xref="S4.SS2.p1.1.m1.4.5.3.1.cmml"><mo id="S4.SS2.p1.1.m1.4.5.3.2.1" stretchy="false" xref="S4.SS2.p1.1.m1.4.5.3.1.cmml">{</mo><mn id="S4.SS2.p1.1.m1.1.1" xref="S4.SS2.p1.1.m1.1.1.cmml">1</mn><mo id="S4.SS2.p1.1.m1.4.5.3.2.2" xref="S4.SS2.p1.1.m1.4.5.3.1.cmml">,</mo><mn id="S4.SS2.p1.1.m1.2.2" xref="S4.SS2.p1.1.m1.2.2.cmml">2</mn><mo id="S4.SS2.p1.1.m1.4.5.3.2.3" xref="S4.SS2.p1.1.m1.4.5.3.1.cmml">,</mo><mi id="S4.SS2.p1.1.m1.3.3" mathvariant="normal" xref="S4.SS2.p1.1.m1.3.3.cmml">…</mi><mo id="S4.SS2.p1.1.m1.4.5.3.2.4" xref="S4.SS2.p1.1.m1.4.5.3.1.cmml">,</mo><mi id="S4.SS2.p1.1.m1.4.4" xref="S4.SS2.p1.1.m1.4.4.cmml">T</mi><mo id="S4.SS2.p1.1.m1.4.5.3.2.5" stretchy="false" xref="S4.SS2.p1.1.m1.4.5.3.1.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.1.m1.4b"><apply id="S4.SS2.p1.1.m1.4.5.cmml" xref="S4.SS2.p1.1.m1.4.5"><eq id="S4.SS2.p1.1.m1.4.5.1.cmml" xref="S4.SS2.p1.1.m1.4.5.1"></eq><ci id="S4.SS2.p1.1.m1.4.5.2.cmml" xref="S4.SS2.p1.1.m1.4.5.2">𝒯</ci><set id="S4.SS2.p1.1.m1.4.5.3.1.cmml" xref="S4.SS2.p1.1.m1.4.5.3.2"><cn id="S4.SS2.p1.1.m1.1.1.cmml" type="integer" xref="S4.SS2.p1.1.m1.1.1">1</cn><cn id="S4.SS2.p1.1.m1.2.2.cmml" type="integer" xref="S4.SS2.p1.1.m1.2.2">2</cn><ci id="S4.SS2.p1.1.m1.3.3.cmml" xref="S4.SS2.p1.1.m1.3.3">…</ci><ci id="S4.SS2.p1.1.m1.4.4.cmml" xref="S4.SS2.p1.1.m1.4.4">𝑇</ci></set></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.1.m1.4c">\mathcal{T}=\{1,2,...,T\}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.1.m1.4d">caligraphic_T = { 1 , 2 , … , italic_T }</annotation></semantics></math>. At each time slot <math alttext="t" class="ltx_Math" display="inline" id="S4.SS2.p1.2.m2.1"><semantics id="S4.SS2.p1.2.m2.1a"><mi id="S4.SS2.p1.2.m2.1.1" xref="S4.SS2.p1.2.m2.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.2.m2.1b"><ci id="S4.SS2.p1.2.m2.1.1.cmml" xref="S4.SS2.p1.2.m2.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.2.m2.1c">t</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.2.m2.1d">italic_t</annotation></semantics></math>, we decide the SSM selection for each request <math alttext="i" class="ltx_Math" display="inline" id="S4.SS2.p1.3.m3.1"><semantics id="S4.SS2.p1.3.m3.1a"><mi id="S4.SS2.p1.3.m3.1.1" xref="S4.SS2.p1.3.m3.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.3.m3.1b"><ci id="S4.SS2.p1.3.m3.1.1.cmml" xref="S4.SS2.p1.3.m3.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.3.m3.1c">i</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.3.m3.1d">italic_i</annotation></semantics></math>, denoted by <math alttext="x_{i}(t)\in\mathcal{M}" class="ltx_Math" display="inline" id="S4.SS2.p1.4.m4.1"><semantics id="S4.SS2.p1.4.m4.1a"><mrow id="S4.SS2.p1.4.m4.1.2" xref="S4.SS2.p1.4.m4.1.2.cmml"><mrow id="S4.SS2.p1.4.m4.1.2.2" xref="S4.SS2.p1.4.m4.1.2.2.cmml"><msub id="S4.SS2.p1.4.m4.1.2.2.2" xref="S4.SS2.p1.4.m4.1.2.2.2.cmml"><mi id="S4.SS2.p1.4.m4.1.2.2.2.2" xref="S4.SS2.p1.4.m4.1.2.2.2.2.cmml">x</mi><mi id="S4.SS2.p1.4.m4.1.2.2.2.3" xref="S4.SS2.p1.4.m4.1.2.2.2.3.cmml">i</mi></msub><mo id="S4.SS2.p1.4.m4.1.2.2.1" xref="S4.SS2.p1.4.m4.1.2.2.1.cmml">⁢</mo><mrow id="S4.SS2.p1.4.m4.1.2.2.3.2" xref="S4.SS2.p1.4.m4.1.2.2.cmml"><mo id="S4.SS2.p1.4.m4.1.2.2.3.2.1" stretchy="false" xref="S4.SS2.p1.4.m4.1.2.2.cmml">(</mo><mi id="S4.SS2.p1.4.m4.1.1" xref="S4.SS2.p1.4.m4.1.1.cmml">t</mi><mo id="S4.SS2.p1.4.m4.1.2.2.3.2.2" stretchy="false" xref="S4.SS2.p1.4.m4.1.2.2.cmml">)</mo></mrow></mrow><mo id="S4.SS2.p1.4.m4.1.2.1" xref="S4.SS2.p1.4.m4.1.2.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.SS2.p1.4.m4.1.2.3" xref="S4.SS2.p1.4.m4.1.2.3.cmml">ℳ</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.4.m4.1b"><apply id="S4.SS2.p1.4.m4.1.2.cmml" xref="S4.SS2.p1.4.m4.1.2"><in id="S4.SS2.p1.4.m4.1.2.1.cmml" xref="S4.SS2.p1.4.m4.1.2.1"></in><apply id="S4.SS2.p1.4.m4.1.2.2.cmml" xref="S4.SS2.p1.4.m4.1.2.2"><times id="S4.SS2.p1.4.m4.1.2.2.1.cmml" xref="S4.SS2.p1.4.m4.1.2.2.1"></times><apply id="S4.SS2.p1.4.m4.1.2.2.2.cmml" xref="S4.SS2.p1.4.m4.1.2.2.2"><csymbol cd="ambiguous" id="S4.SS2.p1.4.m4.1.2.2.2.1.cmml" xref="S4.SS2.p1.4.m4.1.2.2.2">subscript</csymbol><ci id="S4.SS2.p1.4.m4.1.2.2.2.2.cmml" xref="S4.SS2.p1.4.m4.1.2.2.2.2">𝑥</ci><ci id="S4.SS2.p1.4.m4.1.2.2.2.3.cmml" xref="S4.SS2.p1.4.m4.1.2.2.2.3">𝑖</ci></apply><ci id="S4.SS2.p1.4.m4.1.1.cmml" xref="S4.SS2.p1.4.m4.1.1">𝑡</ci></apply><ci id="S4.SS2.p1.4.m4.1.2.3.cmml" xref="S4.SS2.p1.4.m4.1.2.3">ℳ</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.4.m4.1c">x_{i}(t)\in\mathcal{M}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.4.m4.1d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∈ caligraphic_M</annotation></semantics></math>, and the corresponding real-time goodput is <math alttext="r_{i,x(t)}" class="ltx_Math" display="inline" id="S4.SS2.p1.5.m5.3"><semantics id="S4.SS2.p1.5.m5.3a"><msub id="S4.SS2.p1.5.m5.3.4" xref="S4.SS2.p1.5.m5.3.4.cmml"><mi id="S4.SS2.p1.5.m5.3.4.2" xref="S4.SS2.p1.5.m5.3.4.2.cmml">r</mi><mrow id="S4.SS2.p1.5.m5.3.3.3.3" xref="S4.SS2.p1.5.m5.3.3.3.4.cmml"><mi id="S4.SS2.p1.5.m5.2.2.2.2" xref="S4.SS2.p1.5.m5.2.2.2.2.cmml">i</mi><mo id="S4.SS2.p1.5.m5.3.3.3.3.2" xref="S4.SS2.p1.5.m5.3.3.3.4.cmml">,</mo><mrow id="S4.SS2.p1.5.m5.3.3.3.3.1" xref="S4.SS2.p1.5.m5.3.3.3.3.1.cmml"><mi id="S4.SS2.p1.5.m5.3.3.3.3.1.2" xref="S4.SS2.p1.5.m5.3.3.3.3.1.2.cmml">x</mi><mo id="S4.SS2.p1.5.m5.3.3.3.3.1.1" xref="S4.SS2.p1.5.m5.3.3.3.3.1.1.cmml">⁢</mo><mrow id="S4.SS2.p1.5.m5.3.3.3.3.1.3.2" xref="S4.SS2.p1.5.m5.3.3.3.3.1.cmml"><mo id="S4.SS2.p1.5.m5.3.3.3.3.1.3.2.1" stretchy="false" xref="S4.SS2.p1.5.m5.3.3.3.3.1.cmml">(</mo><mi id="S4.SS2.p1.5.m5.1.1.1.1" xref="S4.SS2.p1.5.m5.1.1.1.1.cmml">t</mi><mo id="S4.SS2.p1.5.m5.3.3.3.3.1.3.2.2" stretchy="false" xref="S4.SS2.p1.5.m5.3.3.3.3.1.cmml">)</mo></mrow></mrow></mrow></msub><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.5.m5.3b"><apply id="S4.SS2.p1.5.m5.3.4.cmml" xref="S4.SS2.p1.5.m5.3.4"><csymbol cd="ambiguous" id="S4.SS2.p1.5.m5.3.4.1.cmml" xref="S4.SS2.p1.5.m5.3.4">subscript</csymbol><ci id="S4.SS2.p1.5.m5.3.4.2.cmml" xref="S4.SS2.p1.5.m5.3.4.2">𝑟</ci><list id="S4.SS2.p1.5.m5.3.3.3.4.cmml" xref="S4.SS2.p1.5.m5.3.3.3.3"><ci id="S4.SS2.p1.5.m5.2.2.2.2.cmml" xref="S4.SS2.p1.5.m5.2.2.2.2">𝑖</ci><apply id="S4.SS2.p1.5.m5.3.3.3.3.1.cmml" xref="S4.SS2.p1.5.m5.3.3.3.3.1"><times id="S4.SS2.p1.5.m5.3.3.3.3.1.1.cmml" xref="S4.SS2.p1.5.m5.3.3.3.3.1.1"></times><ci id="S4.SS2.p1.5.m5.3.3.3.3.1.2.cmml" xref="S4.SS2.p1.5.m5.3.3.3.3.1.2">𝑥</ci><ci id="S4.SS2.p1.5.m5.1.1.1.1.cmml" xref="S4.SS2.p1.5.m5.1.1.1.1">𝑡</ci></apply></list></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.5.m5.3c">r_{i,x(t)}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.5.m5.3d">italic_r start_POSTSUBSCRIPT italic_i , italic_x ( italic_t ) end_POSTSUBSCRIPT</annotation></semantics></math>. Therefore, we have <math alttext="g_{i,x_{i}}=\mathbb{E}[r_{i,x_{i}(t)}]" class="ltx_Math" display="inline" id="S4.SS2.p1.6.m6.6"><semantics id="S4.SS2.p1.6.m6.6a"><mrow id="S4.SS2.p1.6.m6.6.6" xref="S4.SS2.p1.6.m6.6.6.cmml"><msub id="S4.SS2.p1.6.m6.6.6.3" xref="S4.SS2.p1.6.m6.6.6.3.cmml"><mi id="S4.SS2.p1.6.m6.6.6.3.2" xref="S4.SS2.p1.6.m6.6.6.3.2.cmml">g</mi><mrow id="S4.SS2.p1.6.m6.2.2.2.2" xref="S4.SS2.p1.6.m6.2.2.2.3.cmml"><mi id="S4.SS2.p1.6.m6.1.1.1.1" xref="S4.SS2.p1.6.m6.1.1.1.1.cmml">i</mi><mo id="S4.SS2.p1.6.m6.2.2.2.2.2" xref="S4.SS2.p1.6.m6.2.2.2.3.cmml">,</mo><msub id="S4.SS2.p1.6.m6.2.2.2.2.1" xref="S4.SS2.p1.6.m6.2.2.2.2.1.cmml"><mi id="S4.SS2.p1.6.m6.2.2.2.2.1.2" xref="S4.SS2.p1.6.m6.2.2.2.2.1.2.cmml">x</mi><mi id="S4.SS2.p1.6.m6.2.2.2.2.1.3" xref="S4.SS2.p1.6.m6.2.2.2.2.1.3.cmml">i</mi></msub></mrow></msub><mo id="S4.SS2.p1.6.m6.6.6.2" xref="S4.SS2.p1.6.m6.6.6.2.cmml">=</mo><mrow id="S4.SS2.p1.6.m6.6.6.1" xref="S4.SS2.p1.6.m6.6.6.1.cmml"><mi id="S4.SS2.p1.6.m6.6.6.1.3" xref="S4.SS2.p1.6.m6.6.6.1.3.cmml">𝔼</mi><mo id="S4.SS2.p1.6.m6.6.6.1.2" xref="S4.SS2.p1.6.m6.6.6.1.2.cmml">⁢</mo><mrow id="S4.SS2.p1.6.m6.6.6.1.1.1" xref="S4.SS2.p1.6.m6.6.6.1.1.2.cmml"><mo id="S4.SS2.p1.6.m6.6.6.1.1.1.2" stretchy="false" xref="S4.SS2.p1.6.m6.6.6.1.1.2.1.cmml">[</mo><msub id="S4.SS2.p1.6.m6.6.6.1.1.1.1" xref="S4.SS2.p1.6.m6.6.6.1.1.1.1.cmml"><mi id="S4.SS2.p1.6.m6.6.6.1.1.1.1.2" xref="S4.SS2.p1.6.m6.6.6.1.1.1.1.2.cmml">r</mi><mrow id="S4.SS2.p1.6.m6.5.5.3.3" xref="S4.SS2.p1.6.m6.5.5.3.4.cmml"><mi id="S4.SS2.p1.6.m6.4.4.2.2" xref="S4.SS2.p1.6.m6.4.4.2.2.cmml">i</mi><mo id="S4.SS2.p1.6.m6.5.5.3.3.2" xref="S4.SS2.p1.6.m6.5.5.3.4.cmml">,</mo><mrow id="S4.SS2.p1.6.m6.5.5.3.3.1" xref="S4.SS2.p1.6.m6.5.5.3.3.1.cmml"><msub id="S4.SS2.p1.6.m6.5.5.3.3.1.2" xref="S4.SS2.p1.6.m6.5.5.3.3.1.2.cmml"><mi id="S4.SS2.p1.6.m6.5.5.3.3.1.2.2" xref="S4.SS2.p1.6.m6.5.5.3.3.1.2.2.cmml">x</mi><mi id="S4.SS2.p1.6.m6.5.5.3.3.1.2.3" xref="S4.SS2.p1.6.m6.5.5.3.3.1.2.3.cmml">i</mi></msub><mo id="S4.SS2.p1.6.m6.5.5.3.3.1.1" xref="S4.SS2.p1.6.m6.5.5.3.3.1.1.cmml">⁢</mo><mrow id="S4.SS2.p1.6.m6.5.5.3.3.1.3.2" xref="S4.SS2.p1.6.m6.5.5.3.3.1.cmml"><mo id="S4.SS2.p1.6.m6.5.5.3.3.1.3.2.1" stretchy="false" xref="S4.SS2.p1.6.m6.5.5.3.3.1.cmml">(</mo><mi id="S4.SS2.p1.6.m6.3.3.1.1" xref="S4.SS2.p1.6.m6.3.3.1.1.cmml">t</mi><mo id="S4.SS2.p1.6.m6.5.5.3.3.1.3.2.2" stretchy="false" xref="S4.SS2.p1.6.m6.5.5.3.3.1.cmml">)</mo></mrow></mrow></mrow></msub><mo id="S4.SS2.p1.6.m6.6.6.1.1.1.3" stretchy="false" xref="S4.SS2.p1.6.m6.6.6.1.1.2.1.cmml">]</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.6.m6.6b"><apply id="S4.SS2.p1.6.m6.6.6.cmml" xref="S4.SS2.p1.6.m6.6.6"><eq id="S4.SS2.p1.6.m6.6.6.2.cmml" xref="S4.SS2.p1.6.m6.6.6.2"></eq><apply id="S4.SS2.p1.6.m6.6.6.3.cmml" xref="S4.SS2.p1.6.m6.6.6.3"><csymbol cd="ambiguous" id="S4.SS2.p1.6.m6.6.6.3.1.cmml" xref="S4.SS2.p1.6.m6.6.6.3">subscript</csymbol><ci id="S4.SS2.p1.6.m6.6.6.3.2.cmml" xref="S4.SS2.p1.6.m6.6.6.3.2">𝑔</ci><list id="S4.SS2.p1.6.m6.2.2.2.3.cmml" xref="S4.SS2.p1.6.m6.2.2.2.2"><ci id="S4.SS2.p1.6.m6.1.1.1.1.cmml" xref="S4.SS2.p1.6.m6.1.1.1.1">𝑖</ci><apply id="S4.SS2.p1.6.m6.2.2.2.2.1.cmml" xref="S4.SS2.p1.6.m6.2.2.2.2.1"><csymbol cd="ambiguous" id="S4.SS2.p1.6.m6.2.2.2.2.1.1.cmml" xref="S4.SS2.p1.6.m6.2.2.2.2.1">subscript</csymbol><ci id="S4.SS2.p1.6.m6.2.2.2.2.1.2.cmml" xref="S4.SS2.p1.6.m6.2.2.2.2.1.2">𝑥</ci><ci id="S4.SS2.p1.6.m6.2.2.2.2.1.3.cmml" xref="S4.SS2.p1.6.m6.2.2.2.2.1.3">𝑖</ci></apply></list></apply><apply id="S4.SS2.p1.6.m6.6.6.1.cmml" xref="S4.SS2.p1.6.m6.6.6.1"><times id="S4.SS2.p1.6.m6.6.6.1.2.cmml" xref="S4.SS2.p1.6.m6.6.6.1.2"></times><ci id="S4.SS2.p1.6.m6.6.6.1.3.cmml" xref="S4.SS2.p1.6.m6.6.6.1.3">𝔼</ci><apply id="S4.SS2.p1.6.m6.6.6.1.1.2.cmml" xref="S4.SS2.p1.6.m6.6.6.1.1.1"><csymbol cd="latexml" id="S4.SS2.p1.6.m6.6.6.1.1.2.1.cmml" xref="S4.SS2.p1.6.m6.6.6.1.1.1.2">delimited-[]</csymbol><apply id="S4.SS2.p1.6.m6.6.6.1.1.1.1.cmml" xref="S4.SS2.p1.6.m6.6.6.1.1.1.1"><csymbol cd="ambiguous" id="S4.SS2.p1.6.m6.6.6.1.1.1.1.1.cmml" xref="S4.SS2.p1.6.m6.6.6.1.1.1.1">subscript</csymbol><ci id="S4.SS2.p1.6.m6.6.6.1.1.1.1.2.cmml" xref="S4.SS2.p1.6.m6.6.6.1.1.1.1.2">𝑟</ci><list id="S4.SS2.p1.6.m6.5.5.3.4.cmml" xref="S4.SS2.p1.6.m6.5.5.3.3"><ci id="S4.SS2.p1.6.m6.4.4.2.2.cmml" xref="S4.SS2.p1.6.m6.4.4.2.2">𝑖</ci><apply id="S4.SS2.p1.6.m6.5.5.3.3.1.cmml" xref="S4.SS2.p1.6.m6.5.5.3.3.1"><times id="S4.SS2.p1.6.m6.5.5.3.3.1.1.cmml" xref="S4.SS2.p1.6.m6.5.5.3.3.1.1"></times><apply id="S4.SS2.p1.6.m6.5.5.3.3.1.2.cmml" xref="S4.SS2.p1.6.m6.5.5.3.3.1.2"><csymbol cd="ambiguous" id="S4.SS2.p1.6.m6.5.5.3.3.1.2.1.cmml" xref="S4.SS2.p1.6.m6.5.5.3.3.1.2">subscript</csymbol><ci id="S4.SS2.p1.6.m6.5.5.3.3.1.2.2.cmml" xref="S4.SS2.p1.6.m6.5.5.3.3.1.2.2">𝑥</ci><ci id="S4.SS2.p1.6.m6.5.5.3.3.1.2.3.cmml" xref="S4.SS2.p1.6.m6.5.5.3.3.1.2.3">𝑖</ci></apply><ci id="S4.SS2.p1.6.m6.3.3.1.1.cmml" xref="S4.SS2.p1.6.m6.3.3.1.1">𝑡</ci></apply></list></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.6.m6.6c">g_{i,x_{i}}=\mathbb{E}[r_{i,x_{i}(t)}]</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.6.m6.6d">italic_g start_POSTSUBSCRIPT italic_i , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = blackboard_E [ italic_r start_POSTSUBSCRIPT italic_i , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ]</annotation></semantics></math>.</p> </div> <div class="ltx_para" id="S4.SS2.p2"> <p class="ltx_p" id="S4.SS2.p2.6">The bandit-based algorithm for the SSM selection explores different SSMs to estimate the goodput, however, which incurs additional switching cost due to the re-computation of KV values of tokens already generated. The KV re-computation cost, which is denoted as <math alttext="c_{i,j}(t)" class="ltx_Math" display="inline" id="S4.SS2.p2.1.m1.3"><semantics id="S4.SS2.p2.1.m1.3a"><mrow id="S4.SS2.p2.1.m1.3.4" xref="S4.SS2.p2.1.m1.3.4.cmml"><msub id="S4.SS2.p2.1.m1.3.4.2" xref="S4.SS2.p2.1.m1.3.4.2.cmml"><mi id="S4.SS2.p2.1.m1.3.4.2.2" xref="S4.SS2.p2.1.m1.3.4.2.2.cmml">c</mi><mrow id="S4.SS2.p2.1.m1.2.2.2.4" xref="S4.SS2.p2.1.m1.2.2.2.3.cmml"><mi id="S4.SS2.p2.1.m1.1.1.1.1" xref="S4.SS2.p2.1.m1.1.1.1.1.cmml">i</mi><mo id="S4.SS2.p2.1.m1.2.2.2.4.1" xref="S4.SS2.p2.1.m1.2.2.2.3.cmml">,</mo><mi id="S4.SS2.p2.1.m1.2.2.2.2" xref="S4.SS2.p2.1.m1.2.2.2.2.cmml">j</mi></mrow></msub><mo id="S4.SS2.p2.1.m1.3.4.1" xref="S4.SS2.p2.1.m1.3.4.1.cmml">⁢</mo><mrow id="S4.SS2.p2.1.m1.3.4.3.2" xref="S4.SS2.p2.1.m1.3.4.cmml"><mo id="S4.SS2.p2.1.m1.3.4.3.2.1" stretchy="false" xref="S4.SS2.p2.1.m1.3.4.cmml">(</mo><mi id="S4.SS2.p2.1.m1.3.3" xref="S4.SS2.p2.1.m1.3.3.cmml">t</mi><mo id="S4.SS2.p2.1.m1.3.4.3.2.2" stretchy="false" xref="S4.SS2.p2.1.m1.3.4.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.1.m1.3b"><apply id="S4.SS2.p2.1.m1.3.4.cmml" xref="S4.SS2.p2.1.m1.3.4"><times id="S4.SS2.p2.1.m1.3.4.1.cmml" xref="S4.SS2.p2.1.m1.3.4.1"></times><apply id="S4.SS2.p2.1.m1.3.4.2.cmml" xref="S4.SS2.p2.1.m1.3.4.2"><csymbol cd="ambiguous" id="S4.SS2.p2.1.m1.3.4.2.1.cmml" xref="S4.SS2.p2.1.m1.3.4.2">subscript</csymbol><ci id="S4.SS2.p2.1.m1.3.4.2.2.cmml" xref="S4.SS2.p2.1.m1.3.4.2.2">𝑐</ci><list id="S4.SS2.p2.1.m1.2.2.2.3.cmml" xref="S4.SS2.p2.1.m1.2.2.2.4"><ci id="S4.SS2.p2.1.m1.1.1.1.1.cmml" xref="S4.SS2.p2.1.m1.1.1.1.1">𝑖</ci><ci id="S4.SS2.p2.1.m1.2.2.2.2.cmml" xref="S4.SS2.p2.1.m1.2.2.2.2">𝑗</ci></list></apply><ci id="S4.SS2.p2.1.m1.3.3.cmml" xref="S4.SS2.p2.1.m1.3.3">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.1.m1.3c">c_{i,j}(t)</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.1.m1.3d">italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t )</annotation></semantics></math>, of request <math alttext="i" class="ltx_Math" display="inline" id="S4.SS2.p2.2.m2.1"><semantics id="S4.SS2.p2.2.m2.1a"><mi id="S4.SS2.p2.2.m2.1.1" xref="S4.SS2.p2.2.m2.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.2.m2.1b"><ci id="S4.SS2.p2.2.m2.1.1.cmml" xref="S4.SS2.p2.2.m2.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.2.m2.1c">i</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.2.m2.1d">italic_i</annotation></semantics></math> on SSM <math alttext="j" class="ltx_Math" display="inline" id="S4.SS2.p2.3.m3.1"><semantics id="S4.SS2.p2.3.m3.1a"><mi id="S4.SS2.p2.3.m3.1.1" xref="S4.SS2.p2.3.m3.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.3.m3.1b"><ci id="S4.SS2.p2.3.m3.1.1.cmml" xref="S4.SS2.p2.3.m3.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.3.m3.1c">j</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.3.m3.1d">italic_j</annotation></semantics></math> increases over time slots as more tokens are generated <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib30" title="">30</a>]</cite>. We denote <math alttext="z_{i}(t)=c_{i,x_{i}(t)}(t)\mathbbm{1}_{x_{i}(t)\neq x_{i}(t-1)}" class="ltx_Math" display="inline" id="S4.SS2.p2.4.m4.7"><semantics id="S4.SS2.p2.4.m4.7a"><mrow id="S4.SS2.p2.4.m4.7.8" xref="S4.SS2.p2.4.m4.7.8.cmml"><mrow id="S4.SS2.p2.4.m4.7.8.2" xref="S4.SS2.p2.4.m4.7.8.2.cmml"><msub id="S4.SS2.p2.4.m4.7.8.2.2" xref="S4.SS2.p2.4.m4.7.8.2.2.cmml"><mi id="S4.SS2.p2.4.m4.7.8.2.2.2" xref="S4.SS2.p2.4.m4.7.8.2.2.2.cmml">z</mi><mi id="S4.SS2.p2.4.m4.7.8.2.2.3" xref="S4.SS2.p2.4.m4.7.8.2.2.3.cmml">i</mi></msub><mo id="S4.SS2.p2.4.m4.7.8.2.1" xref="S4.SS2.p2.4.m4.7.8.2.1.cmml">⁢</mo><mrow id="S4.SS2.p2.4.m4.7.8.2.3.2" xref="S4.SS2.p2.4.m4.7.8.2.cmml"><mo id="S4.SS2.p2.4.m4.7.8.2.3.2.1" stretchy="false" xref="S4.SS2.p2.4.m4.7.8.2.cmml">(</mo><mi id="S4.SS2.p2.4.m4.6.6" xref="S4.SS2.p2.4.m4.6.6.cmml">t</mi><mo id="S4.SS2.p2.4.m4.7.8.2.3.2.2" stretchy="false" xref="S4.SS2.p2.4.m4.7.8.2.cmml">)</mo></mrow></mrow><mo id="S4.SS2.p2.4.m4.7.8.1" xref="S4.SS2.p2.4.m4.7.8.1.cmml">=</mo><mrow id="S4.SS2.p2.4.m4.7.8.3" xref="S4.SS2.p2.4.m4.7.8.3.cmml"><msub id="S4.SS2.p2.4.m4.7.8.3.2" xref="S4.SS2.p2.4.m4.7.8.3.2.cmml"><mi id="S4.SS2.p2.4.m4.7.8.3.2.2" xref="S4.SS2.p2.4.m4.7.8.3.2.2.cmml">c</mi><mrow id="S4.SS2.p2.4.m4.3.3.3.3" xref="S4.SS2.p2.4.m4.3.3.3.4.cmml"><mi id="S4.SS2.p2.4.m4.2.2.2.2" xref="S4.SS2.p2.4.m4.2.2.2.2.cmml">i</mi><mo id="S4.SS2.p2.4.m4.3.3.3.3.2" xref="S4.SS2.p2.4.m4.3.3.3.4.cmml">,</mo><mrow id="S4.SS2.p2.4.m4.3.3.3.3.1" xref="S4.SS2.p2.4.m4.3.3.3.3.1.cmml"><msub id="S4.SS2.p2.4.m4.3.3.3.3.1.2" xref="S4.SS2.p2.4.m4.3.3.3.3.1.2.cmml"><mi id="S4.SS2.p2.4.m4.3.3.3.3.1.2.2" xref="S4.SS2.p2.4.m4.3.3.3.3.1.2.2.cmml">x</mi><mi id="S4.SS2.p2.4.m4.3.3.3.3.1.2.3" xref="S4.SS2.p2.4.m4.3.3.3.3.1.2.3.cmml">i</mi></msub><mo id="S4.SS2.p2.4.m4.3.3.3.3.1.1" xref="S4.SS2.p2.4.m4.3.3.3.3.1.1.cmml">⁢</mo><mrow id="S4.SS2.p2.4.m4.3.3.3.3.1.3.2" xref="S4.SS2.p2.4.m4.3.3.3.3.1.cmml"><mo id="S4.SS2.p2.4.m4.3.3.3.3.1.3.2.1" stretchy="false" xref="S4.SS2.p2.4.m4.3.3.3.3.1.cmml">(</mo><mi id="S4.SS2.p2.4.m4.1.1.1.1" xref="S4.SS2.p2.4.m4.1.1.1.1.cmml">t</mi><mo id="S4.SS2.p2.4.m4.3.3.3.3.1.3.2.2" stretchy="false" xref="S4.SS2.p2.4.m4.3.3.3.3.1.cmml">)</mo></mrow></mrow></mrow></msub><mo id="S4.SS2.p2.4.m4.7.8.3.1" xref="S4.SS2.p2.4.m4.7.8.3.1.cmml">⁢</mo><mrow id="S4.SS2.p2.4.m4.7.8.3.3.2" xref="S4.SS2.p2.4.m4.7.8.3.cmml"><mo id="S4.SS2.p2.4.m4.7.8.3.3.2.1" stretchy="false" xref="S4.SS2.p2.4.m4.7.8.3.cmml">(</mo><mi id="S4.SS2.p2.4.m4.7.7" xref="S4.SS2.p2.4.m4.7.7.cmml">t</mi><mo id="S4.SS2.p2.4.m4.7.8.3.3.2.2" stretchy="false" xref="S4.SS2.p2.4.m4.7.8.3.cmml">)</mo></mrow><mo id="S4.SS2.p2.4.m4.7.8.3.1a" xref="S4.SS2.p2.4.m4.7.8.3.1.cmml">⁢</mo><msub id="S4.SS2.p2.4.m4.7.8.3.4" xref="S4.SS2.p2.4.m4.7.8.3.4.cmml"><mn id="S4.SS2.p2.4.m4.7.8.3.4.2" xref="S4.SS2.p2.4.m4.7.8.3.4.2.cmml">𝟙</mn><mrow id="S4.SS2.p2.4.m4.5.5.2" xref="S4.SS2.p2.4.m4.5.5.2.cmml"><mrow id="S4.SS2.p2.4.m4.5.5.2.4" xref="S4.SS2.p2.4.m4.5.5.2.4.cmml"><msub id="S4.SS2.p2.4.m4.5.5.2.4.2" xref="S4.SS2.p2.4.m4.5.5.2.4.2.cmml"><mi id="S4.SS2.p2.4.m4.5.5.2.4.2.2" xref="S4.SS2.p2.4.m4.5.5.2.4.2.2.cmml">x</mi><mi id="S4.SS2.p2.4.m4.5.5.2.4.2.3" xref="S4.SS2.p2.4.m4.5.5.2.4.2.3.cmml">i</mi></msub><mo id="S4.SS2.p2.4.m4.5.5.2.4.1" xref="S4.SS2.p2.4.m4.5.5.2.4.1.cmml">⁢</mo><mrow id="S4.SS2.p2.4.m4.5.5.2.4.3.2" xref="S4.SS2.p2.4.m4.5.5.2.4.cmml"><mo id="S4.SS2.p2.4.m4.5.5.2.4.3.2.1" stretchy="false" xref="S4.SS2.p2.4.m4.5.5.2.4.cmml">(</mo><mi id="S4.SS2.p2.4.m4.4.4.1.1" xref="S4.SS2.p2.4.m4.4.4.1.1.cmml">t</mi><mo id="S4.SS2.p2.4.m4.5.5.2.4.3.2.2" stretchy="false" xref="S4.SS2.p2.4.m4.5.5.2.4.cmml">)</mo></mrow></mrow><mo id="S4.SS2.p2.4.m4.5.5.2.3" xref="S4.SS2.p2.4.m4.5.5.2.3.cmml">≠</mo><mrow id="S4.SS2.p2.4.m4.5.5.2.2" xref="S4.SS2.p2.4.m4.5.5.2.2.cmml"><msub id="S4.SS2.p2.4.m4.5.5.2.2.3" xref="S4.SS2.p2.4.m4.5.5.2.2.3.cmml"><mi id="S4.SS2.p2.4.m4.5.5.2.2.3.2" xref="S4.SS2.p2.4.m4.5.5.2.2.3.2.cmml">x</mi><mi id="S4.SS2.p2.4.m4.5.5.2.2.3.3" xref="S4.SS2.p2.4.m4.5.5.2.2.3.3.cmml">i</mi></msub><mo id="S4.SS2.p2.4.m4.5.5.2.2.2" xref="S4.SS2.p2.4.m4.5.5.2.2.2.cmml">⁢</mo><mrow id="S4.SS2.p2.4.m4.5.5.2.2.1.1" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.cmml"><mo id="S4.SS2.p2.4.m4.5.5.2.2.1.1.2" stretchy="false" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.cmml">(</mo><mrow id="S4.SS2.p2.4.m4.5.5.2.2.1.1.1" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.cmml"><mi id="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.2" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.2.cmml">t</mi><mo id="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.1" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.1.cmml">−</mo><mn id="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.3" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.3.cmml">1</mn></mrow><mo id="S4.SS2.p2.4.m4.5.5.2.2.1.1.3" stretchy="false" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.cmml">)</mo></mrow></mrow></mrow></msub></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.4.m4.7b"><apply id="S4.SS2.p2.4.m4.7.8.cmml" xref="S4.SS2.p2.4.m4.7.8"><eq id="S4.SS2.p2.4.m4.7.8.1.cmml" xref="S4.SS2.p2.4.m4.7.8.1"></eq><apply id="S4.SS2.p2.4.m4.7.8.2.cmml" xref="S4.SS2.p2.4.m4.7.8.2"><times id="S4.SS2.p2.4.m4.7.8.2.1.cmml" xref="S4.SS2.p2.4.m4.7.8.2.1"></times><apply id="S4.SS2.p2.4.m4.7.8.2.2.cmml" xref="S4.SS2.p2.4.m4.7.8.2.2"><csymbol cd="ambiguous" id="S4.SS2.p2.4.m4.7.8.2.2.1.cmml" xref="S4.SS2.p2.4.m4.7.8.2.2">subscript</csymbol><ci id="S4.SS2.p2.4.m4.7.8.2.2.2.cmml" xref="S4.SS2.p2.4.m4.7.8.2.2.2">𝑧</ci><ci id="S4.SS2.p2.4.m4.7.8.2.2.3.cmml" xref="S4.SS2.p2.4.m4.7.8.2.2.3">𝑖</ci></apply><ci id="S4.SS2.p2.4.m4.6.6.cmml" xref="S4.SS2.p2.4.m4.6.6">𝑡</ci></apply><apply id="S4.SS2.p2.4.m4.7.8.3.cmml" xref="S4.SS2.p2.4.m4.7.8.3"><times id="S4.SS2.p2.4.m4.7.8.3.1.cmml" xref="S4.SS2.p2.4.m4.7.8.3.1"></times><apply id="S4.SS2.p2.4.m4.7.8.3.2.cmml" xref="S4.SS2.p2.4.m4.7.8.3.2"><csymbol cd="ambiguous" id="S4.SS2.p2.4.m4.7.8.3.2.1.cmml" xref="S4.SS2.p2.4.m4.7.8.3.2">subscript</csymbol><ci id="S4.SS2.p2.4.m4.7.8.3.2.2.cmml" xref="S4.SS2.p2.4.m4.7.8.3.2.2">𝑐</ci><list id="S4.SS2.p2.4.m4.3.3.3.4.cmml" xref="S4.SS2.p2.4.m4.3.3.3.3"><ci id="S4.SS2.p2.4.m4.2.2.2.2.cmml" xref="S4.SS2.p2.4.m4.2.2.2.2">𝑖</ci><apply id="S4.SS2.p2.4.m4.3.3.3.3.1.cmml" xref="S4.SS2.p2.4.m4.3.3.3.3.1"><times id="S4.SS2.p2.4.m4.3.3.3.3.1.1.cmml" xref="S4.SS2.p2.4.m4.3.3.3.3.1.1"></times><apply id="S4.SS2.p2.4.m4.3.3.3.3.1.2.cmml" xref="S4.SS2.p2.4.m4.3.3.3.3.1.2"><csymbol cd="ambiguous" id="S4.SS2.p2.4.m4.3.3.3.3.1.2.1.cmml" xref="S4.SS2.p2.4.m4.3.3.3.3.1.2">subscript</csymbol><ci id="S4.SS2.p2.4.m4.3.3.3.3.1.2.2.cmml" xref="S4.SS2.p2.4.m4.3.3.3.3.1.2.2">𝑥</ci><ci id="S4.SS2.p2.4.m4.3.3.3.3.1.2.3.cmml" xref="S4.SS2.p2.4.m4.3.3.3.3.1.2.3">𝑖</ci></apply><ci id="S4.SS2.p2.4.m4.1.1.1.1.cmml" xref="S4.SS2.p2.4.m4.1.1.1.1">𝑡</ci></apply></list></apply><ci id="S4.SS2.p2.4.m4.7.7.cmml" xref="S4.SS2.p2.4.m4.7.7">𝑡</ci><apply id="S4.SS2.p2.4.m4.7.8.3.4.cmml" xref="S4.SS2.p2.4.m4.7.8.3.4"><csymbol cd="ambiguous" id="S4.SS2.p2.4.m4.7.8.3.4.1.cmml" xref="S4.SS2.p2.4.m4.7.8.3.4">subscript</csymbol><cn id="S4.SS2.p2.4.m4.7.8.3.4.2.cmml" type="integer" xref="S4.SS2.p2.4.m4.7.8.3.4.2">1</cn><apply id="S4.SS2.p2.4.m4.5.5.2.cmml" xref="S4.SS2.p2.4.m4.5.5.2"><neq id="S4.SS2.p2.4.m4.5.5.2.3.cmml" xref="S4.SS2.p2.4.m4.5.5.2.3"></neq><apply id="S4.SS2.p2.4.m4.5.5.2.4.cmml" xref="S4.SS2.p2.4.m4.5.5.2.4"><times id="S4.SS2.p2.4.m4.5.5.2.4.1.cmml" xref="S4.SS2.p2.4.m4.5.5.2.4.1"></times><apply id="S4.SS2.p2.4.m4.5.5.2.4.2.cmml" xref="S4.SS2.p2.4.m4.5.5.2.4.2"><csymbol cd="ambiguous" id="S4.SS2.p2.4.m4.5.5.2.4.2.1.cmml" xref="S4.SS2.p2.4.m4.5.5.2.4.2">subscript</csymbol><ci id="S4.SS2.p2.4.m4.5.5.2.4.2.2.cmml" xref="S4.SS2.p2.4.m4.5.5.2.4.2.2">𝑥</ci><ci id="S4.SS2.p2.4.m4.5.5.2.4.2.3.cmml" xref="S4.SS2.p2.4.m4.5.5.2.4.2.3">𝑖</ci></apply><ci id="S4.SS2.p2.4.m4.4.4.1.1.cmml" xref="S4.SS2.p2.4.m4.4.4.1.1">𝑡</ci></apply><apply id="S4.SS2.p2.4.m4.5.5.2.2.cmml" xref="S4.SS2.p2.4.m4.5.5.2.2"><times id="S4.SS2.p2.4.m4.5.5.2.2.2.cmml" xref="S4.SS2.p2.4.m4.5.5.2.2.2"></times><apply id="S4.SS2.p2.4.m4.5.5.2.2.3.cmml" xref="S4.SS2.p2.4.m4.5.5.2.2.3"><csymbol cd="ambiguous" id="S4.SS2.p2.4.m4.5.5.2.2.3.1.cmml" xref="S4.SS2.p2.4.m4.5.5.2.2.3">subscript</csymbol><ci id="S4.SS2.p2.4.m4.5.5.2.2.3.2.cmml" xref="S4.SS2.p2.4.m4.5.5.2.2.3.2">𝑥</ci><ci id="S4.SS2.p2.4.m4.5.5.2.2.3.3.cmml" xref="S4.SS2.p2.4.m4.5.5.2.2.3.3">𝑖</ci></apply><apply id="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.cmml" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1"><minus id="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.1.cmml" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.1"></minus><ci id="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.2.cmml" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.2">𝑡</ci><cn id="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.3.cmml" type="integer" xref="S4.SS2.p2.4.m4.5.5.2.2.1.1.1.3">1</cn></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.4.m4.7c">z_{i}(t)=c_{i,x_{i}(t)}(t)\mathbbm{1}_{x_{i}(t)\neq x_{i}(t-1)}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.4.m4.7d">italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_c start_POSTSUBSCRIPT italic_i , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ( italic_t ) blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ≠ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) end_POSTSUBSCRIPT</annotation></semantics></math>, where <math alttext="\mathbbm{1}_{A}" class="ltx_Math" display="inline" id="S4.SS2.p2.5.m5.1"><semantics id="S4.SS2.p2.5.m5.1a"><msub id="S4.SS2.p2.5.m5.1.1" xref="S4.SS2.p2.5.m5.1.1.cmml"><mn id="S4.SS2.p2.5.m5.1.1.2" xref="S4.SS2.p2.5.m5.1.1.2.cmml">𝟙</mn><mi id="S4.SS2.p2.5.m5.1.1.3" xref="S4.SS2.p2.5.m5.1.1.3.cmml">A</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.5.m5.1b"><apply id="S4.SS2.p2.5.m5.1.1.cmml" xref="S4.SS2.p2.5.m5.1.1"><csymbol cd="ambiguous" id="S4.SS2.p2.5.m5.1.1.1.cmml" xref="S4.SS2.p2.5.m5.1.1">subscript</csymbol><cn id="S4.SS2.p2.5.m5.1.1.2.cmml" type="integer" xref="S4.SS2.p2.5.m5.1.1.2">1</cn><ci id="S4.SS2.p2.5.m5.1.1.3.cmml" xref="S4.SS2.p2.5.m5.1.1.3">𝐴</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.5.m5.1c">\mathbbm{1}_{A}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.5.m5.1d">blackboard_1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT</annotation></semantics></math> is an indicator function for condition <math alttext="A" class="ltx_Math" display="inline" id="S4.SS2.p2.6.m6.1"><semantics id="S4.SS2.p2.6.m6.1a"><mi id="S4.SS2.p2.6.m6.1.1" xref="S4.SS2.p2.6.m6.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.6.m6.1b"><ci id="S4.SS2.p2.6.m6.1.1.cmml" xref="S4.SS2.p2.6.m6.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.6.m6.1c">A</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.6.m6.1d">italic_A</annotation></semantics></math>. The cumulative regret to quantify the SSM selection performance can be expressed as:</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="S9.EGx4"> <tbody id="S4.E6"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\mathcal{R}(T)\!=\!\underbrace{\sum_{i=1}^{N}(Tg_{i,x_{i}^{*}}-% \sum_{t=1}^{T}\!\mathbb{E}[r_{i,x_{i}(t)}])}_{\text{Goodput regret}}+\lambda% \underbrace{\sum_{t=1}^{T}\sum_{i=1}^{N}z_{i}(t)}_{\text{Switching cost}}," class="ltx_Math" display="inline" id="S4.E6.m1.9"><semantics id="S4.E6.m1.9a"><mrow id="S4.E6.m1.9.9.1" xref="S4.E6.m1.9.9.1.1.cmml"><mrow id="S4.E6.m1.9.9.1.1" xref="S4.E6.m1.9.9.1.1.cmml"><mrow id="S4.E6.m1.9.9.1.1.2" xref="S4.E6.m1.9.9.1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.E6.m1.9.9.1.1.2.2" xref="S4.E6.m1.9.9.1.1.2.2.cmml">ℛ</mi><mo id="S4.E6.m1.9.9.1.1.2.1" xref="S4.E6.m1.9.9.1.1.2.1.cmml">⁢</mo><mrow id="S4.E6.m1.9.9.1.1.2.3.2" xref="S4.E6.m1.9.9.1.1.2.cmml"><mo id="S4.E6.m1.9.9.1.1.2.3.2.1" stretchy="false" xref="S4.E6.m1.9.9.1.1.2.cmml">(</mo><mi id="S4.E6.m1.8.8" xref="S4.E6.m1.8.8.cmml">T</mi><mo id="S4.E6.m1.9.9.1.1.2.3.2.2" stretchy="false" xref="S4.E6.m1.9.9.1.1.2.cmml">)</mo></mrow></mrow><mo id="S4.E6.m1.9.9.1.1.1" lspace="0.108em" rspace="0.108em" xref="S4.E6.m1.9.9.1.1.1.cmml">=</mo><mrow id="S4.E6.m1.9.9.1.1.3" xref="S4.E6.m1.9.9.1.1.3.cmml"><munder id="S4.E6.m1.9.9.1.1.3.2" xref="S4.E6.m1.9.9.1.1.3.2.cmml"><munder accentunder="true" id="S4.E6.m1.6.6" xref="S4.E6.m1.6.6.cmml"><mrow id="S4.E6.m1.6.6.6" xref="S4.E6.m1.6.6.6.cmml"><mstyle displaystyle="true" id="S4.E6.m1.6.6.6.7" xref="S4.E6.m1.6.6.6.7.cmml"><munderover id="S4.E6.m1.6.6.6.7a" xref="S4.E6.m1.6.6.6.7.cmml"><mo id="S4.E6.m1.6.6.6.7.2.2" movablelimits="false" xref="S4.E6.m1.6.6.6.7.2.2.cmml">∑</mo><mrow id="S4.E6.m1.6.6.6.7.2.3" xref="S4.E6.m1.6.6.6.7.2.3.cmml"><mi id="S4.E6.m1.6.6.6.7.2.3.2" xref="S4.E6.m1.6.6.6.7.2.3.2.cmml">i</mi><mo id="S4.E6.m1.6.6.6.7.2.3.1" xref="S4.E6.m1.6.6.6.7.2.3.1.cmml">=</mo><mn id="S4.E6.m1.6.6.6.7.2.3.3" xref="S4.E6.m1.6.6.6.7.2.3.3.cmml">1</mn></mrow><mi id="S4.E6.m1.6.6.6.7.3" xref="S4.E6.m1.6.6.6.7.3.cmml">N</mi></munderover></mstyle><mrow id="S4.E6.m1.6.6.6.6.1" xref="S4.E6.m1.6.6.6.6.1.1.cmml"><mo id="S4.E6.m1.6.6.6.6.1.2" stretchy="false" xref="S4.E6.m1.6.6.6.6.1.1.cmml">(</mo><mrow id="S4.E6.m1.6.6.6.6.1.1" xref="S4.E6.m1.6.6.6.6.1.1.cmml"><mrow id="S4.E6.m1.6.6.6.6.1.1.3" xref="S4.E6.m1.6.6.6.6.1.1.3.cmml"><mi id="S4.E6.m1.6.6.6.6.1.1.3.2" xref="S4.E6.m1.6.6.6.6.1.1.3.2.cmml">T</mi><mo id="S4.E6.m1.6.6.6.6.1.1.3.1" xref="S4.E6.m1.6.6.6.6.1.1.3.1.cmml">⁢</mo><msub id="S4.E6.m1.6.6.6.6.1.1.3.3" xref="S4.E6.m1.6.6.6.6.1.1.3.3.cmml"><mi id="S4.E6.m1.6.6.6.6.1.1.3.3.2" xref="S4.E6.m1.6.6.6.6.1.1.3.3.2.cmml">g</mi><mrow id="S4.E6.m1.2.2.2.2.2.2" xref="S4.E6.m1.2.2.2.2.2.3.cmml"><mi id="S4.E6.m1.1.1.1.1.1.1" xref="S4.E6.m1.1.1.1.1.1.1.cmml">i</mi><mo id="S4.E6.m1.2.2.2.2.2.2.2" xref="S4.E6.m1.2.2.2.2.2.3.cmml">,</mo><msubsup id="S4.E6.m1.2.2.2.2.2.2.1" xref="S4.E6.m1.2.2.2.2.2.2.1.cmml"><mi id="S4.E6.m1.2.2.2.2.2.2.1.2.2" xref="S4.E6.m1.2.2.2.2.2.2.1.2.2.cmml">x</mi><mi id="S4.E6.m1.2.2.2.2.2.2.1.2.3" xref="S4.E6.m1.2.2.2.2.2.2.1.2.3.cmml">i</mi><mo id="S4.E6.m1.2.2.2.2.2.2.1.3" xref="S4.E6.m1.2.2.2.2.2.2.1.3.cmml">∗</mo></msubsup></mrow></msub></mrow><mo id="S4.E6.m1.6.6.6.6.1.1.2" xref="S4.E6.m1.6.6.6.6.1.1.2.cmml">−</mo><mrow id="S4.E6.m1.6.6.6.6.1.1.1" xref="S4.E6.m1.6.6.6.6.1.1.1.cmml"><mstyle displaystyle="true" id="S4.E6.m1.6.6.6.6.1.1.1.2" xref="S4.E6.m1.6.6.6.6.1.1.1.2.cmml"><munderover id="S4.E6.m1.6.6.6.6.1.1.1.2a" xref="S4.E6.m1.6.6.6.6.1.1.1.2.cmml"><mo id="S4.E6.m1.6.6.6.6.1.1.1.2.2.2" movablelimits="false" xref="S4.E6.m1.6.6.6.6.1.1.1.2.2.2.cmml">∑</mo><mrow id="S4.E6.m1.6.6.6.6.1.1.1.2.2.3" xref="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.cmml"><mi id="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.2" xref="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.2.cmml">t</mi><mo id="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.1" xref="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.1.cmml">=</mo><mn id="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.3" xref="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.3.cmml">1</mn></mrow><mi id="S4.E6.m1.6.6.6.6.1.1.1.2.3" xref="S4.E6.m1.6.6.6.6.1.1.1.2.3.cmml">T</mi></munderover></mstyle><mrow id="S4.E6.m1.6.6.6.6.1.1.1.1" xref="S4.E6.m1.6.6.6.6.1.1.1.1.cmml"><mi id="S4.E6.m1.6.6.6.6.1.1.1.1.3" xref="S4.E6.m1.6.6.6.6.1.1.1.1.3.cmml">𝔼</mi><mo id="S4.E6.m1.6.6.6.6.1.1.1.1.2" xref="S4.E6.m1.6.6.6.6.1.1.1.1.2.cmml">⁢</mo><mrow id="S4.E6.m1.6.6.6.6.1.1.1.1.1.1" xref="S4.E6.m1.6.6.6.6.1.1.1.1.1.2.cmml"><mo id="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.2" stretchy="false" xref="S4.E6.m1.6.6.6.6.1.1.1.1.1.2.1.cmml">[</mo><msub id="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.1" xref="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.1.cmml"><mi id="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.1.2" xref="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.1.2.cmml">r</mi><mrow id="S4.E6.m1.5.5.5.5.3.3" xref="S4.E6.m1.5.5.5.5.3.4.cmml"><mi id="S4.E6.m1.4.4.4.4.2.2" xref="S4.E6.m1.4.4.4.4.2.2.cmml">i</mi><mo id="S4.E6.m1.5.5.5.5.3.3.2" xref="S4.E6.m1.5.5.5.5.3.4.cmml">,</mo><mrow id="S4.E6.m1.5.5.5.5.3.3.1" xref="S4.E6.m1.5.5.5.5.3.3.1.cmml"><msub id="S4.E6.m1.5.5.5.5.3.3.1.2" xref="S4.E6.m1.5.5.5.5.3.3.1.2.cmml"><mi id="S4.E6.m1.5.5.5.5.3.3.1.2.2" xref="S4.E6.m1.5.5.5.5.3.3.1.2.2.cmml">x</mi><mi id="S4.E6.m1.5.5.5.5.3.3.1.2.3" xref="S4.E6.m1.5.5.5.5.3.3.1.2.3.cmml">i</mi></msub><mo id="S4.E6.m1.5.5.5.5.3.3.1.1" xref="S4.E6.m1.5.5.5.5.3.3.1.1.cmml">⁢</mo><mrow id="S4.E6.m1.5.5.5.5.3.3.1.3.2" xref="S4.E6.m1.5.5.5.5.3.3.1.cmml"><mo id="S4.E6.m1.5.5.5.5.3.3.1.3.2.1" stretchy="false" xref="S4.E6.m1.5.5.5.5.3.3.1.cmml">(</mo><mi id="S4.E6.m1.3.3.3.3.1.1" xref="S4.E6.m1.3.3.3.3.1.1.cmml">t</mi><mo id="S4.E6.m1.5.5.5.5.3.3.1.3.2.2" stretchy="false" xref="S4.E6.m1.5.5.5.5.3.3.1.cmml">)</mo></mrow></mrow></mrow></msub><mo id="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.3" stretchy="false" xref="S4.E6.m1.6.6.6.6.1.1.1.1.1.2.1.cmml">]</mo></mrow></mrow></mrow></mrow><mo id="S4.E6.m1.6.6.6.6.1.3" stretchy="false" xref="S4.E6.m1.6.6.6.6.1.1.cmml">)</mo></mrow></mrow><mo id="S4.E6.m1.6.6.7" xref="S4.E6.m1.6.6.7.cmml">⏟</mo></munder><mtext id="S4.E6.m1.9.9.1.1.3.2.2" xref="S4.E6.m1.9.9.1.1.3.2.2a.cmml">Goodput regret</mtext></munder><mo id="S4.E6.m1.9.9.1.1.3.1" xref="S4.E6.m1.9.9.1.1.3.1.cmml">+</mo><mrow id="S4.E6.m1.9.9.1.1.3.3" xref="S4.E6.m1.9.9.1.1.3.3.cmml"><mi id="S4.E6.m1.9.9.1.1.3.3.2" xref="S4.E6.m1.9.9.1.1.3.3.2.cmml">λ</mi><mo id="S4.E6.m1.9.9.1.1.3.3.1" xref="S4.E6.m1.9.9.1.1.3.3.1.cmml">⁢</mo><munder id="S4.E6.m1.9.9.1.1.3.3.3" xref="S4.E6.m1.9.9.1.1.3.3.3.cmml"><munder accentunder="true" id="S4.E6.m1.7.7" xref="S4.E6.m1.7.7.cmml"><mrow id="S4.E6.m1.7.7.1" xref="S4.E6.m1.7.7.1.cmml"><mstyle displaystyle="true" id="S4.E6.m1.7.7.1.2" xref="S4.E6.m1.7.7.1.2.cmml"><munderover id="S4.E6.m1.7.7.1.2a" xref="S4.E6.m1.7.7.1.2.cmml"><mo id="S4.E6.m1.7.7.1.2.2.2" movablelimits="false" xref="S4.E6.m1.7.7.1.2.2.2.cmml">∑</mo><mrow id="S4.E6.m1.7.7.1.2.2.3" xref="S4.E6.m1.7.7.1.2.2.3.cmml"><mi id="S4.E6.m1.7.7.1.2.2.3.2" xref="S4.E6.m1.7.7.1.2.2.3.2.cmml">t</mi><mo id="S4.E6.m1.7.7.1.2.2.3.1" xref="S4.E6.m1.7.7.1.2.2.3.1.cmml">=</mo><mn id="S4.E6.m1.7.7.1.2.2.3.3" xref="S4.E6.m1.7.7.1.2.2.3.3.cmml">1</mn></mrow><mi id="S4.E6.m1.7.7.1.2.3" xref="S4.E6.m1.7.7.1.2.3.cmml">T</mi></munderover></mstyle><mrow id="S4.E6.m1.7.7.1.3" xref="S4.E6.m1.7.7.1.3.cmml"><mstyle displaystyle="true" id="S4.E6.m1.7.7.1.3.1" xref="S4.E6.m1.7.7.1.3.1.cmml"><munderover id="S4.E6.m1.7.7.1.3.1a" xref="S4.E6.m1.7.7.1.3.1.cmml"><mo id="S4.E6.m1.7.7.1.3.1.2.2" movablelimits="false" xref="S4.E6.m1.7.7.1.3.1.2.2.cmml">∑</mo><mrow id="S4.E6.m1.7.7.1.3.1.2.3" xref="S4.E6.m1.7.7.1.3.1.2.3.cmml"><mi id="S4.E6.m1.7.7.1.3.1.2.3.2" xref="S4.E6.m1.7.7.1.3.1.2.3.2.cmml">i</mi><mo id="S4.E6.m1.7.7.1.3.1.2.3.1" xref="S4.E6.m1.7.7.1.3.1.2.3.1.cmml">=</mo><mn id="S4.E6.m1.7.7.1.3.1.2.3.3" xref="S4.E6.m1.7.7.1.3.1.2.3.3.cmml">1</mn></mrow><mi id="S4.E6.m1.7.7.1.3.1.3" xref="S4.E6.m1.7.7.1.3.1.3.cmml">N</mi></munderover></mstyle><mrow id="S4.E6.m1.7.7.1.3.2" xref="S4.E6.m1.7.7.1.3.2.cmml"><msub id="S4.E6.m1.7.7.1.3.2.2" xref="S4.E6.m1.7.7.1.3.2.2.cmml"><mi id="S4.E6.m1.7.7.1.3.2.2.2" xref="S4.E6.m1.7.7.1.3.2.2.2.cmml">z</mi><mi id="S4.E6.m1.7.7.1.3.2.2.3" xref="S4.E6.m1.7.7.1.3.2.2.3.cmml">i</mi></msub><mo id="S4.E6.m1.7.7.1.3.2.1" xref="S4.E6.m1.7.7.1.3.2.1.cmml">⁢</mo><mrow id="S4.E6.m1.7.7.1.3.2.3.2" xref="S4.E6.m1.7.7.1.3.2.cmml"><mo id="S4.E6.m1.7.7.1.3.2.3.2.1" stretchy="false" xref="S4.E6.m1.7.7.1.3.2.cmml">(</mo><mi id="S4.E6.m1.7.7.1.1" xref="S4.E6.m1.7.7.1.1.cmml">t</mi><mo id="S4.E6.m1.7.7.1.3.2.3.2.2" stretchy="false" xref="S4.E6.m1.7.7.1.3.2.cmml">)</mo></mrow></mrow></mrow></mrow><mo id="S4.E6.m1.7.7.2" xref="S4.E6.m1.7.7.2.cmml">⏟</mo></munder><mtext id="S4.E6.m1.9.9.1.1.3.3.3.2" xref="S4.E6.m1.9.9.1.1.3.3.3.2a.cmml">Switching cost</mtext></munder></mrow></mrow></mrow><mo id="S4.E6.m1.9.9.1.2" xref="S4.E6.m1.9.9.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.E6.m1.9b"><apply id="S4.E6.m1.9.9.1.1.cmml" xref="S4.E6.m1.9.9.1"><eq id="S4.E6.m1.9.9.1.1.1.cmml" xref="S4.E6.m1.9.9.1.1.1"></eq><apply id="S4.E6.m1.9.9.1.1.2.cmml" xref="S4.E6.m1.9.9.1.1.2"><times id="S4.E6.m1.9.9.1.1.2.1.cmml" xref="S4.E6.m1.9.9.1.1.2.1"></times><ci id="S4.E6.m1.9.9.1.1.2.2.cmml" xref="S4.E6.m1.9.9.1.1.2.2">ℛ</ci><ci id="S4.E6.m1.8.8.cmml" xref="S4.E6.m1.8.8">𝑇</ci></apply><apply id="S4.E6.m1.9.9.1.1.3.cmml" xref="S4.E6.m1.9.9.1.1.3"><plus id="S4.E6.m1.9.9.1.1.3.1.cmml" xref="S4.E6.m1.9.9.1.1.3.1"></plus><apply id="S4.E6.m1.9.9.1.1.3.2.cmml" xref="S4.E6.m1.9.9.1.1.3.2"><csymbol cd="ambiguous" id="S4.E6.m1.9.9.1.1.3.2.1.cmml" xref="S4.E6.m1.9.9.1.1.3.2">subscript</csymbol><apply id="S4.E6.m1.6.6.cmml" xref="S4.E6.m1.6.6"><ci id="S4.E6.m1.6.6.7.cmml" xref="S4.E6.m1.6.6.7">⏟</ci><apply id="S4.E6.m1.6.6.6.cmml" xref="S4.E6.m1.6.6.6"><apply id="S4.E6.m1.6.6.6.7.cmml" xref="S4.E6.m1.6.6.6.7"><csymbol cd="ambiguous" id="S4.E6.m1.6.6.6.7.1.cmml" xref="S4.E6.m1.6.6.6.7">superscript</csymbol><apply id="S4.E6.m1.6.6.6.7.2.cmml" xref="S4.E6.m1.6.6.6.7"><csymbol cd="ambiguous" id="S4.E6.m1.6.6.6.7.2.1.cmml" xref="S4.E6.m1.6.6.6.7">subscript</csymbol><sum id="S4.E6.m1.6.6.6.7.2.2.cmml" xref="S4.E6.m1.6.6.6.7.2.2"></sum><apply id="S4.E6.m1.6.6.6.7.2.3.cmml" xref="S4.E6.m1.6.6.6.7.2.3"><eq id="S4.E6.m1.6.6.6.7.2.3.1.cmml" xref="S4.E6.m1.6.6.6.7.2.3.1"></eq><ci id="S4.E6.m1.6.6.6.7.2.3.2.cmml" xref="S4.E6.m1.6.6.6.7.2.3.2">𝑖</ci><cn id="S4.E6.m1.6.6.6.7.2.3.3.cmml" type="integer" xref="S4.E6.m1.6.6.6.7.2.3.3">1</cn></apply></apply><ci id="S4.E6.m1.6.6.6.7.3.cmml" xref="S4.E6.m1.6.6.6.7.3">𝑁</ci></apply><apply id="S4.E6.m1.6.6.6.6.1.1.cmml" xref="S4.E6.m1.6.6.6.6.1"><minus id="S4.E6.m1.6.6.6.6.1.1.2.cmml" xref="S4.E6.m1.6.6.6.6.1.1.2"></minus><apply id="S4.E6.m1.6.6.6.6.1.1.3.cmml" xref="S4.E6.m1.6.6.6.6.1.1.3"><times id="S4.E6.m1.6.6.6.6.1.1.3.1.cmml" xref="S4.E6.m1.6.6.6.6.1.1.3.1"></times><ci id="S4.E6.m1.6.6.6.6.1.1.3.2.cmml" xref="S4.E6.m1.6.6.6.6.1.1.3.2">𝑇</ci><apply id="S4.E6.m1.6.6.6.6.1.1.3.3.cmml" xref="S4.E6.m1.6.6.6.6.1.1.3.3"><csymbol cd="ambiguous" id="S4.E6.m1.6.6.6.6.1.1.3.3.1.cmml" xref="S4.E6.m1.6.6.6.6.1.1.3.3">subscript</csymbol><ci id="S4.E6.m1.6.6.6.6.1.1.3.3.2.cmml" xref="S4.E6.m1.6.6.6.6.1.1.3.3.2">𝑔</ci><list id="S4.E6.m1.2.2.2.2.2.3.cmml" xref="S4.E6.m1.2.2.2.2.2.2"><ci id="S4.E6.m1.1.1.1.1.1.1.cmml" xref="S4.E6.m1.1.1.1.1.1.1">𝑖</ci><apply id="S4.E6.m1.2.2.2.2.2.2.1.cmml" xref="S4.E6.m1.2.2.2.2.2.2.1"><csymbol cd="ambiguous" id="S4.E6.m1.2.2.2.2.2.2.1.1.cmml" xref="S4.E6.m1.2.2.2.2.2.2.1">superscript</csymbol><apply id="S4.E6.m1.2.2.2.2.2.2.1.2.cmml" xref="S4.E6.m1.2.2.2.2.2.2.1"><csymbol cd="ambiguous" id="S4.E6.m1.2.2.2.2.2.2.1.2.1.cmml" xref="S4.E6.m1.2.2.2.2.2.2.1">subscript</csymbol><ci id="S4.E6.m1.2.2.2.2.2.2.1.2.2.cmml" xref="S4.E6.m1.2.2.2.2.2.2.1.2.2">𝑥</ci><ci id="S4.E6.m1.2.2.2.2.2.2.1.2.3.cmml" xref="S4.E6.m1.2.2.2.2.2.2.1.2.3">𝑖</ci></apply><times id="S4.E6.m1.2.2.2.2.2.2.1.3.cmml" xref="S4.E6.m1.2.2.2.2.2.2.1.3"></times></apply></list></apply></apply><apply id="S4.E6.m1.6.6.6.6.1.1.1.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1"><apply id="S4.E6.m1.6.6.6.6.1.1.1.2.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.2"><csymbol cd="ambiguous" id="S4.E6.m1.6.6.6.6.1.1.1.2.1.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.2">superscript</csymbol><apply id="S4.E6.m1.6.6.6.6.1.1.1.2.2.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.2"><csymbol cd="ambiguous" id="S4.E6.m1.6.6.6.6.1.1.1.2.2.1.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.2">subscript</csymbol><sum id="S4.E6.m1.6.6.6.6.1.1.1.2.2.2.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.2.2.2"></sum><apply id="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.2.2.3"><eq id="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.1.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.1"></eq><ci id="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.2.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.2">𝑡</ci><cn id="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.3.cmml" type="integer" xref="S4.E6.m1.6.6.6.6.1.1.1.2.2.3.3">1</cn></apply></apply><ci id="S4.E6.m1.6.6.6.6.1.1.1.2.3.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.2.3">𝑇</ci></apply><apply id="S4.E6.m1.6.6.6.6.1.1.1.1.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.1"><times id="S4.E6.m1.6.6.6.6.1.1.1.1.2.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.1.2"></times><ci id="S4.E6.m1.6.6.6.6.1.1.1.1.3.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.1.3">𝔼</ci><apply id="S4.E6.m1.6.6.6.6.1.1.1.1.1.2.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.1.1.1"><csymbol cd="latexml" id="S4.E6.m1.6.6.6.6.1.1.1.1.1.2.1.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.2">delimited-[]</csymbol><apply id="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.1.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.1.1.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.1">subscript</csymbol><ci id="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.1.2.cmml" xref="S4.E6.m1.6.6.6.6.1.1.1.1.1.1.1.2">𝑟</ci><list id="S4.E6.m1.5.5.5.5.3.4.cmml" xref="S4.E6.m1.5.5.5.5.3.3"><ci id="S4.E6.m1.4.4.4.4.2.2.cmml" xref="S4.E6.m1.4.4.4.4.2.2">𝑖</ci><apply id="S4.E6.m1.5.5.5.5.3.3.1.cmml" xref="S4.E6.m1.5.5.5.5.3.3.1"><times id="S4.E6.m1.5.5.5.5.3.3.1.1.cmml" xref="S4.E6.m1.5.5.5.5.3.3.1.1"></times><apply id="S4.E6.m1.5.5.5.5.3.3.1.2.cmml" xref="S4.E6.m1.5.5.5.5.3.3.1.2"><csymbol cd="ambiguous" id="S4.E6.m1.5.5.5.5.3.3.1.2.1.cmml" xref="S4.E6.m1.5.5.5.5.3.3.1.2">subscript</csymbol><ci id="S4.E6.m1.5.5.5.5.3.3.1.2.2.cmml" xref="S4.E6.m1.5.5.5.5.3.3.1.2.2">𝑥</ci><ci id="S4.E6.m1.5.5.5.5.3.3.1.2.3.cmml" xref="S4.E6.m1.5.5.5.5.3.3.1.2.3">𝑖</ci></apply><ci id="S4.E6.m1.3.3.3.3.1.1.cmml" xref="S4.E6.m1.3.3.3.3.1.1">𝑡</ci></apply></list></apply></apply></apply></apply></apply></apply></apply><ci id="S4.E6.m1.9.9.1.1.3.2.2a.cmml" xref="S4.E6.m1.9.9.1.1.3.2.2"><mtext id="S4.E6.m1.9.9.1.1.3.2.2.cmml" mathsize="70%" xref="S4.E6.m1.9.9.1.1.3.2.2">Goodput regret</mtext></ci></apply><apply id="S4.E6.m1.9.9.1.1.3.3.cmml" xref="S4.E6.m1.9.9.1.1.3.3"><times id="S4.E6.m1.9.9.1.1.3.3.1.cmml" xref="S4.E6.m1.9.9.1.1.3.3.1"></times><ci id="S4.E6.m1.9.9.1.1.3.3.2.cmml" xref="S4.E6.m1.9.9.1.1.3.3.2">𝜆</ci><apply id="S4.E6.m1.9.9.1.1.3.3.3.cmml" xref="S4.E6.m1.9.9.1.1.3.3.3"><csymbol cd="ambiguous" id="S4.E6.m1.9.9.1.1.3.3.3.1.cmml" xref="S4.E6.m1.9.9.1.1.3.3.3">subscript</csymbol><apply id="S4.E6.m1.7.7.cmml" xref="S4.E6.m1.7.7"><ci id="S4.E6.m1.7.7.2.cmml" xref="S4.E6.m1.7.7.2">⏟</ci><apply id="S4.E6.m1.7.7.1.cmml" xref="S4.E6.m1.7.7.1"><apply id="S4.E6.m1.7.7.1.2.cmml" xref="S4.E6.m1.7.7.1.2"><csymbol cd="ambiguous" id="S4.E6.m1.7.7.1.2.1.cmml" xref="S4.E6.m1.7.7.1.2">superscript</csymbol><apply id="S4.E6.m1.7.7.1.2.2.cmml" xref="S4.E6.m1.7.7.1.2"><csymbol cd="ambiguous" id="S4.E6.m1.7.7.1.2.2.1.cmml" xref="S4.E6.m1.7.7.1.2">subscript</csymbol><sum id="S4.E6.m1.7.7.1.2.2.2.cmml" xref="S4.E6.m1.7.7.1.2.2.2"></sum><apply id="S4.E6.m1.7.7.1.2.2.3.cmml" xref="S4.E6.m1.7.7.1.2.2.3"><eq id="S4.E6.m1.7.7.1.2.2.3.1.cmml" xref="S4.E6.m1.7.7.1.2.2.3.1"></eq><ci id="S4.E6.m1.7.7.1.2.2.3.2.cmml" xref="S4.E6.m1.7.7.1.2.2.3.2">𝑡</ci><cn id="S4.E6.m1.7.7.1.2.2.3.3.cmml" type="integer" xref="S4.E6.m1.7.7.1.2.2.3.3">1</cn></apply></apply><ci id="S4.E6.m1.7.7.1.2.3.cmml" xref="S4.E6.m1.7.7.1.2.3">𝑇</ci></apply><apply id="S4.E6.m1.7.7.1.3.cmml" xref="S4.E6.m1.7.7.1.3"><apply id="S4.E6.m1.7.7.1.3.1.cmml" xref="S4.E6.m1.7.7.1.3.1"><csymbol cd="ambiguous" id="S4.E6.m1.7.7.1.3.1.1.cmml" xref="S4.E6.m1.7.7.1.3.1">superscript</csymbol><apply id="S4.E6.m1.7.7.1.3.1.2.cmml" xref="S4.E6.m1.7.7.1.3.1"><csymbol cd="ambiguous" id="S4.E6.m1.7.7.1.3.1.2.1.cmml" xref="S4.E6.m1.7.7.1.3.1">subscript</csymbol><sum id="S4.E6.m1.7.7.1.3.1.2.2.cmml" xref="S4.E6.m1.7.7.1.3.1.2.2"></sum><apply id="S4.E6.m1.7.7.1.3.1.2.3.cmml" xref="S4.E6.m1.7.7.1.3.1.2.3"><eq id="S4.E6.m1.7.7.1.3.1.2.3.1.cmml" xref="S4.E6.m1.7.7.1.3.1.2.3.1"></eq><ci id="S4.E6.m1.7.7.1.3.1.2.3.2.cmml" xref="S4.E6.m1.7.7.1.3.1.2.3.2">𝑖</ci><cn id="S4.E6.m1.7.7.1.3.1.2.3.3.cmml" type="integer" xref="S4.E6.m1.7.7.1.3.1.2.3.3">1</cn></apply></apply><ci id="S4.E6.m1.7.7.1.3.1.3.cmml" xref="S4.E6.m1.7.7.1.3.1.3">𝑁</ci></apply><apply id="S4.E6.m1.7.7.1.3.2.cmml" xref="S4.E6.m1.7.7.1.3.2"><times id="S4.E6.m1.7.7.1.3.2.1.cmml" xref="S4.E6.m1.7.7.1.3.2.1"></times><apply id="S4.E6.m1.7.7.1.3.2.2.cmml" xref="S4.E6.m1.7.7.1.3.2.2"><csymbol cd="ambiguous" id="S4.E6.m1.7.7.1.3.2.2.1.cmml" xref="S4.E6.m1.7.7.1.3.2.2">subscript</csymbol><ci id="S4.E6.m1.7.7.1.3.2.2.2.cmml" xref="S4.E6.m1.7.7.1.3.2.2.2">𝑧</ci><ci id="S4.E6.m1.7.7.1.3.2.2.3.cmml" xref="S4.E6.m1.7.7.1.3.2.2.3">𝑖</ci></apply><ci id="S4.E6.m1.7.7.1.1.cmml" xref="S4.E6.m1.7.7.1.1">𝑡</ci></apply></apply></apply></apply><ci id="S4.E6.m1.9.9.1.1.3.3.3.2a.cmml" xref="S4.E6.m1.9.9.1.1.3.3.3.2"><mtext id="S4.E6.m1.9.9.1.1.3.3.3.2.cmml" mathsize="70%" xref="S4.E6.m1.9.9.1.1.3.3.3.2">Switching cost</mtext></ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E6.m1.9c">\displaystyle\mathcal{R}(T)\!=\!\underbrace{\sum_{i=1}^{N}(Tg_{i,x_{i}^{*}}-% \sum_{t=1}^{T}\!\mathbb{E}[r_{i,x_{i}(t)}])}_{\text{Goodput regret}}+\lambda% \underbrace{\sum_{t=1}^{T}\sum_{i=1}^{N}z_{i}(t)}_{\text{Switching cost}},</annotation><annotation encoding="application/x-llamapun" id="S4.E6.m1.9d">caligraphic_R ( italic_T ) = under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_T italic_g start_POSTSUBSCRIPT italic_i , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E [ italic_r start_POSTSUBSCRIPT italic_i , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_POSTSUBSCRIPT ] ) end_ARG start_POSTSUBSCRIPT Goodput regret end_POSTSUBSCRIPT + italic_λ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_ARG start_POSTSUBSCRIPT Switching cost end_POSTSUBSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(6)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S4.SS2.p2.7">where <math alttext="\lambda" class="ltx_Math" display="inline" id="S4.SS2.p2.7.m1.1"><semantics id="S4.SS2.p2.7.m1.1a"><mi id="S4.SS2.p2.7.m1.1.1" xref="S4.SS2.p2.7.m1.1.1.cmml">λ</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.7.m1.1b"><ci id="S4.SS2.p2.7.m1.1.1.cmml" xref="S4.SS2.p2.7.m1.1.1">𝜆</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.7.m1.1c">\lambda</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.7.m1.1d">italic_λ</annotation></semantics></math> is a parameter to balance the goodput regret and switching cost. Then, the optimization problem can be re-formulated as:</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="S9.EGx5"> <tbody id="S4.E7"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\min_{x}\quad" class="ltx_Math" display="inline" id="S4.E7.m1.1"><semantics id="S4.E7.m1.1a"><mrow id="S4.E7.m1.1.1.1" xref="S4.E7.m1.1.1.1.1.cmml"><munder id="S4.E7.m1.1.1.1.1" xref="S4.E7.m1.1.1.1.1.cmml"><mi id="S4.E7.m1.1.1.1.1.2" xref="S4.E7.m1.1.1.1.1.2.cmml">min</mi><mi id="S4.E7.m1.1.1.1.1.3" xref="S4.E7.m1.1.1.1.1.3.cmml">x</mi></munder><mspace id="S4.E7.m1.1.1.1.2" width="1.167em" xref="S4.E7.m1.1.1.1.1.cmml"></mspace></mrow><annotation-xml encoding="MathML-Content" id="S4.E7.m1.1b"><apply id="S4.E7.m1.1.1.1.1.cmml" xref="S4.E7.m1.1.1.1"><csymbol cd="ambiguous" id="S4.E7.m1.1.1.1.1.1.cmml" xref="S4.E7.m1.1.1.1">subscript</csymbol><min id="S4.E7.m1.1.1.1.1.2.cmml" xref="S4.E7.m1.1.1.1.1.2"></min><ci id="S4.E7.m1.1.1.1.1.3.cmml" xref="S4.E7.m1.1.1.1.1.3">𝑥</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E7.m1.1c">\displaystyle\min_{x}\quad</annotation><annotation encoding="application/x-llamapun" id="S4.E7.m1.1d">roman_min start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\mathcal{R}(T)" class="ltx_Math" display="inline" id="S4.E7.m2.1"><semantics id="S4.E7.m2.1a"><mrow id="S4.E7.m2.1.2" xref="S4.E7.m2.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.E7.m2.1.2.2" xref="S4.E7.m2.1.2.2.cmml">ℛ</mi><mo id="S4.E7.m2.1.2.1" xref="S4.E7.m2.1.2.1.cmml">⁢</mo><mrow id="S4.E7.m2.1.2.3.2" xref="S4.E7.m2.1.2.cmml"><mo id="S4.E7.m2.1.2.3.2.1" stretchy="false" xref="S4.E7.m2.1.2.cmml">(</mo><mi id="S4.E7.m2.1.1" xref="S4.E7.m2.1.1.cmml">T</mi><mo id="S4.E7.m2.1.2.3.2.2" stretchy="false" xref="S4.E7.m2.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.E7.m2.1b"><apply id="S4.E7.m2.1.2.cmml" xref="S4.E7.m2.1.2"><times id="S4.E7.m2.1.2.1.cmml" xref="S4.E7.m2.1.2.1"></times><ci id="S4.E7.m2.1.2.2.cmml" xref="S4.E7.m2.1.2.2">ℛ</ci><ci id="S4.E7.m2.1.1.cmml" xref="S4.E7.m2.1.1">𝑇</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E7.m2.1c">\displaystyle\mathcal{R}(T)</annotation><annotation encoding="application/x-llamapun" id="S4.E7.m2.1d">caligraphic_R ( italic_T )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(7)</span></td> </tr></tbody> <tbody id="S4.E8"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><span class="ltx_text ltx_markedasmath" id="S4.E8.2.1.1.1">s.t.</span></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle|\mathcal{N}_{j}(t)|\leq B_{j},\forall j\in\mathcal{M},t\in% \mathcal{T};" class="ltx_Math" display="inline" id="S4.E8.m2.2"><semantics id="S4.E8.m2.2a"><mrow id="S4.E8.m2.2.2.1"><mrow id="S4.E8.m2.2.2.1.1.2" xref="S4.E8.m2.2.2.1.1.3.cmml"><mrow id="S4.E8.m2.2.2.1.1.1.1" xref="S4.E8.m2.2.2.1.1.1.1.cmml"><mrow id="S4.E8.m2.2.2.1.1.1.1.1.1" xref="S4.E8.m2.2.2.1.1.1.1.1.2.cmml"><mo id="S4.E8.m2.2.2.1.1.1.1.1.1.2" stretchy="false" xref="S4.E8.m2.2.2.1.1.1.1.1.2.1.cmml">|</mo><mrow id="S4.E8.m2.2.2.1.1.1.1.1.1.1" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.cmml"><msub id="S4.E8.m2.2.2.1.1.1.1.1.1.1.2" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.2" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.2.cmml">𝒩</mi><mi id="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.3" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.3.cmml">j</mi></msub><mo id="S4.E8.m2.2.2.1.1.1.1.1.1.1.1" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.1.cmml">⁢</mo><mrow id="S4.E8.m2.2.2.1.1.1.1.1.1.1.3.2" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.cmml"><mo id="S4.E8.m2.2.2.1.1.1.1.1.1.1.3.2.1" stretchy="false" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.cmml">(</mo><mi id="S4.E8.m2.1.1" xref="S4.E8.m2.1.1.cmml">t</mi><mo id="S4.E8.m2.2.2.1.1.1.1.1.1.1.3.2.2" stretchy="false" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S4.E8.m2.2.2.1.1.1.1.1.1.3" stretchy="false" xref="S4.E8.m2.2.2.1.1.1.1.1.2.1.cmml">|</mo></mrow><mo id="S4.E8.m2.2.2.1.1.1.1.2" xref="S4.E8.m2.2.2.1.1.1.1.2.cmml">≤</mo><msub id="S4.E8.m2.2.2.1.1.1.1.3" xref="S4.E8.m2.2.2.1.1.1.1.3.cmml"><mi id="S4.E8.m2.2.2.1.1.1.1.3.2" xref="S4.E8.m2.2.2.1.1.1.1.3.2.cmml">B</mi><mi id="S4.E8.m2.2.2.1.1.1.1.3.3" xref="S4.E8.m2.2.2.1.1.1.1.3.3.cmml">j</mi></msub></mrow><mo id="S4.E8.m2.2.2.1.1.2.3" xref="S4.E8.m2.2.2.1.1.3a.cmml">,</mo><mrow id="S4.E8.m2.2.2.1.1.2.2.2" xref="S4.E8.m2.2.2.1.1.2.2.3.cmml"><mrow id="S4.E8.m2.2.2.1.1.2.2.1.1" xref="S4.E8.m2.2.2.1.1.2.2.1.1.cmml"><mrow id="S4.E8.m2.2.2.1.1.2.2.1.1.2" xref="S4.E8.m2.2.2.1.1.2.2.1.1.2.cmml"><mo id="S4.E8.m2.2.2.1.1.2.2.1.1.2.1" rspace="0.167em" xref="S4.E8.m2.2.2.1.1.2.2.1.1.2.1.cmml">∀</mo><mi id="S4.E8.m2.2.2.1.1.2.2.1.1.2.2" xref="S4.E8.m2.2.2.1.1.2.2.1.1.2.2.cmml">j</mi></mrow><mo id="S4.E8.m2.2.2.1.1.2.2.1.1.1" xref="S4.E8.m2.2.2.1.1.2.2.1.1.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.E8.m2.2.2.1.1.2.2.1.1.3" xref="S4.E8.m2.2.2.1.1.2.2.1.1.3.cmml">ℳ</mi></mrow><mo id="S4.E8.m2.2.2.1.1.2.2.2.3" xref="S4.E8.m2.2.2.1.1.2.2.3a.cmml">,</mo><mrow id="S4.E8.m2.2.2.1.1.2.2.2.2" xref="S4.E8.m2.2.2.1.1.2.2.2.2.cmml"><mi id="S4.E8.m2.2.2.1.1.2.2.2.2.2" xref="S4.E8.m2.2.2.1.1.2.2.2.2.2.cmml">t</mi><mo id="S4.E8.m2.2.2.1.1.2.2.2.2.1" xref="S4.E8.m2.2.2.1.1.2.2.2.2.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.E8.m2.2.2.1.1.2.2.2.2.3" xref="S4.E8.m2.2.2.1.1.2.2.2.2.3.cmml">𝒯</mi></mrow></mrow></mrow><mo id="S4.E8.m2.2.2.1.2">;</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.E8.m2.2b"><apply id="S4.E8.m2.2.2.1.1.3.cmml" xref="S4.E8.m2.2.2.1.1.2"><csymbol cd="ambiguous" id="S4.E8.m2.2.2.1.1.3a.cmml" xref="S4.E8.m2.2.2.1.1.2.3">formulae-sequence</csymbol><apply id="S4.E8.m2.2.2.1.1.1.1.cmml" xref="S4.E8.m2.2.2.1.1.1.1"><leq id="S4.E8.m2.2.2.1.1.1.1.2.cmml" xref="S4.E8.m2.2.2.1.1.1.1.2"></leq><apply id="S4.E8.m2.2.2.1.1.1.1.1.2.cmml" xref="S4.E8.m2.2.2.1.1.1.1.1.1"><abs id="S4.E8.m2.2.2.1.1.1.1.1.2.1.cmml" xref="S4.E8.m2.2.2.1.1.1.1.1.1.2"></abs><apply id="S4.E8.m2.2.2.1.1.1.1.1.1.1.cmml" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1"><times id="S4.E8.m2.2.2.1.1.1.1.1.1.1.1.cmml" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.1"></times><apply id="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.cmml" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.1.cmml" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.2.cmml" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.2">𝒩</ci><ci id="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.3.cmml" xref="S4.E8.m2.2.2.1.1.1.1.1.1.1.2.3">𝑗</ci></apply><ci id="S4.E8.m2.1.1.cmml" xref="S4.E8.m2.1.1">𝑡</ci></apply></apply><apply id="S4.E8.m2.2.2.1.1.1.1.3.cmml" xref="S4.E8.m2.2.2.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.E8.m2.2.2.1.1.1.1.3.1.cmml" xref="S4.E8.m2.2.2.1.1.1.1.3">subscript</csymbol><ci id="S4.E8.m2.2.2.1.1.1.1.3.2.cmml" xref="S4.E8.m2.2.2.1.1.1.1.3.2">𝐵</ci><ci id="S4.E8.m2.2.2.1.1.1.1.3.3.cmml" xref="S4.E8.m2.2.2.1.1.1.1.3.3">𝑗</ci></apply></apply><apply id="S4.E8.m2.2.2.1.1.2.2.3.cmml" xref="S4.E8.m2.2.2.1.1.2.2.2"><csymbol cd="ambiguous" id="S4.E8.m2.2.2.1.1.2.2.3a.cmml" xref="S4.E8.m2.2.2.1.1.2.2.2.3">formulae-sequence</csymbol><apply id="S4.E8.m2.2.2.1.1.2.2.1.1.cmml" xref="S4.E8.m2.2.2.1.1.2.2.1.1"><in id="S4.E8.m2.2.2.1.1.2.2.1.1.1.cmml" xref="S4.E8.m2.2.2.1.1.2.2.1.1.1"></in><apply id="S4.E8.m2.2.2.1.1.2.2.1.1.2.cmml" xref="S4.E8.m2.2.2.1.1.2.2.1.1.2"><csymbol cd="latexml" id="S4.E8.m2.2.2.1.1.2.2.1.1.2.1.cmml" xref="S4.E8.m2.2.2.1.1.2.2.1.1.2.1">for-all</csymbol><ci id="S4.E8.m2.2.2.1.1.2.2.1.1.2.2.cmml" xref="S4.E8.m2.2.2.1.1.2.2.1.1.2.2">𝑗</ci></apply><ci id="S4.E8.m2.2.2.1.1.2.2.1.1.3.cmml" xref="S4.E8.m2.2.2.1.1.2.2.1.1.3">ℳ</ci></apply><apply id="S4.E8.m2.2.2.1.1.2.2.2.2.cmml" xref="S4.E8.m2.2.2.1.1.2.2.2.2"><in id="S4.E8.m2.2.2.1.1.2.2.2.2.1.cmml" xref="S4.E8.m2.2.2.1.1.2.2.2.2.1"></in><ci id="S4.E8.m2.2.2.1.1.2.2.2.2.2.cmml" xref="S4.E8.m2.2.2.1.1.2.2.2.2.2">𝑡</ci><ci id="S4.E8.m2.2.2.1.1.2.2.2.2.3.cmml" xref="S4.E8.m2.2.2.1.1.2.2.2.2.3">𝒯</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E8.m2.2c">\displaystyle|\mathcal{N}_{j}(t)|\leq B_{j},\forall j\in\mathcal{M},t\in% \mathcal{T};</annotation><annotation encoding="application/x-llamapun" id="S4.E8.m2.2d">| caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) | ≤ italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ∈ caligraphic_M , italic_t ∈ caligraphic_T ;</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(8)</span></td> </tr></tbody> <tbody id="S4.E9"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle x_{i}(t)=\{1,2,...,M\},\forall i\in\mathcal{N},t\in\mathcal{T}." class="ltx_Math" display="inline" id="S4.E9.m1.6"><semantics id="S4.E9.m1.6a"><mrow id="S4.E9.m1.6.6.1"><mrow id="S4.E9.m1.6.6.1.1.2" xref="S4.E9.m1.6.6.1.1.3.cmml"><mrow id="S4.E9.m1.6.6.1.1.1.1" xref="S4.E9.m1.6.6.1.1.1.1.cmml"><mrow id="S4.E9.m1.6.6.1.1.1.1.2" xref="S4.E9.m1.6.6.1.1.1.1.2.cmml"><msub id="S4.E9.m1.6.6.1.1.1.1.2.2" xref="S4.E9.m1.6.6.1.1.1.1.2.2.cmml"><mi id="S4.E9.m1.6.6.1.1.1.1.2.2.2" xref="S4.E9.m1.6.6.1.1.1.1.2.2.2.cmml">x</mi><mi id="S4.E9.m1.6.6.1.1.1.1.2.2.3" xref="S4.E9.m1.6.6.1.1.1.1.2.2.3.cmml">i</mi></msub><mo id="S4.E9.m1.6.6.1.1.1.1.2.1" xref="S4.E9.m1.6.6.1.1.1.1.2.1.cmml">⁢</mo><mrow id="S4.E9.m1.6.6.1.1.1.1.2.3.2" xref="S4.E9.m1.6.6.1.1.1.1.2.cmml"><mo id="S4.E9.m1.6.6.1.1.1.1.2.3.2.1" stretchy="false" xref="S4.E9.m1.6.6.1.1.1.1.2.cmml">(</mo><mi id="S4.E9.m1.1.1" xref="S4.E9.m1.1.1.cmml">t</mi><mo id="S4.E9.m1.6.6.1.1.1.1.2.3.2.2" stretchy="false" xref="S4.E9.m1.6.6.1.1.1.1.2.cmml">)</mo></mrow></mrow><mo id="S4.E9.m1.6.6.1.1.1.1.1" xref="S4.E9.m1.6.6.1.1.1.1.1.cmml">=</mo><mrow id="S4.E9.m1.6.6.1.1.1.1.3.2" xref="S4.E9.m1.6.6.1.1.1.1.3.1.cmml"><mo id="S4.E9.m1.6.6.1.1.1.1.3.2.1" stretchy="false" xref="S4.E9.m1.6.6.1.1.1.1.3.1.cmml">{</mo><mn id="S4.E9.m1.2.2" xref="S4.E9.m1.2.2.cmml">1</mn><mo id="S4.E9.m1.6.6.1.1.1.1.3.2.2" xref="S4.E9.m1.6.6.1.1.1.1.3.1.cmml">,</mo><mn id="S4.E9.m1.3.3" xref="S4.E9.m1.3.3.cmml">2</mn><mo id="S4.E9.m1.6.6.1.1.1.1.3.2.3" xref="S4.E9.m1.6.6.1.1.1.1.3.1.cmml">,</mo><mi id="S4.E9.m1.4.4" mathvariant="normal" xref="S4.E9.m1.4.4.cmml">…</mi><mo id="S4.E9.m1.6.6.1.1.1.1.3.2.4" xref="S4.E9.m1.6.6.1.1.1.1.3.1.cmml">,</mo><mi id="S4.E9.m1.5.5" xref="S4.E9.m1.5.5.cmml">M</mi><mo id="S4.E9.m1.6.6.1.1.1.1.3.2.5" stretchy="false" xref="S4.E9.m1.6.6.1.1.1.1.3.1.cmml">}</mo></mrow></mrow><mo id="S4.E9.m1.6.6.1.1.2.3" xref="S4.E9.m1.6.6.1.1.3a.cmml">,</mo><mrow id="S4.E9.m1.6.6.1.1.2.2.2" xref="S4.E9.m1.6.6.1.1.2.2.3.cmml"><mrow id="S4.E9.m1.6.6.1.1.2.2.1.1" xref="S4.E9.m1.6.6.1.1.2.2.1.1.cmml"><mrow id="S4.E9.m1.6.6.1.1.2.2.1.1.2" xref="S4.E9.m1.6.6.1.1.2.2.1.1.2.cmml"><mo id="S4.E9.m1.6.6.1.1.2.2.1.1.2.1" rspace="0.167em" xref="S4.E9.m1.6.6.1.1.2.2.1.1.2.1.cmml">∀</mo><mi id="S4.E9.m1.6.6.1.1.2.2.1.1.2.2" xref="S4.E9.m1.6.6.1.1.2.2.1.1.2.2.cmml">i</mi></mrow><mo id="S4.E9.m1.6.6.1.1.2.2.1.1.1" xref="S4.E9.m1.6.6.1.1.2.2.1.1.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.E9.m1.6.6.1.1.2.2.1.1.3" xref="S4.E9.m1.6.6.1.1.2.2.1.1.3.cmml">𝒩</mi></mrow><mo id="S4.E9.m1.6.6.1.1.2.2.2.3" xref="S4.E9.m1.6.6.1.1.2.2.3a.cmml">,</mo><mrow id="S4.E9.m1.6.6.1.1.2.2.2.2" xref="S4.E9.m1.6.6.1.1.2.2.2.2.cmml"><mi id="S4.E9.m1.6.6.1.1.2.2.2.2.2" xref="S4.E9.m1.6.6.1.1.2.2.2.2.2.cmml">t</mi><mo id="S4.E9.m1.6.6.1.1.2.2.2.2.1" xref="S4.E9.m1.6.6.1.1.2.2.2.2.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.E9.m1.6.6.1.1.2.2.2.2.3" xref="S4.E9.m1.6.6.1.1.2.2.2.2.3.cmml">𝒯</mi></mrow></mrow></mrow><mo id="S4.E9.m1.6.6.1.2" lspace="0em">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.E9.m1.6b"><apply id="S4.E9.m1.6.6.1.1.3.cmml" xref="S4.E9.m1.6.6.1.1.2"><csymbol cd="ambiguous" id="S4.E9.m1.6.6.1.1.3a.cmml" xref="S4.E9.m1.6.6.1.1.2.3">formulae-sequence</csymbol><apply id="S4.E9.m1.6.6.1.1.1.1.cmml" xref="S4.E9.m1.6.6.1.1.1.1"><eq id="S4.E9.m1.6.6.1.1.1.1.1.cmml" xref="S4.E9.m1.6.6.1.1.1.1.1"></eq><apply id="S4.E9.m1.6.6.1.1.1.1.2.cmml" xref="S4.E9.m1.6.6.1.1.1.1.2"><times id="S4.E9.m1.6.6.1.1.1.1.2.1.cmml" xref="S4.E9.m1.6.6.1.1.1.1.2.1"></times><apply id="S4.E9.m1.6.6.1.1.1.1.2.2.cmml" xref="S4.E9.m1.6.6.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S4.E9.m1.6.6.1.1.1.1.2.2.1.cmml" xref="S4.E9.m1.6.6.1.1.1.1.2.2">subscript</csymbol><ci id="S4.E9.m1.6.6.1.1.1.1.2.2.2.cmml" xref="S4.E9.m1.6.6.1.1.1.1.2.2.2">𝑥</ci><ci id="S4.E9.m1.6.6.1.1.1.1.2.2.3.cmml" xref="S4.E9.m1.6.6.1.1.1.1.2.2.3">𝑖</ci></apply><ci id="S4.E9.m1.1.1.cmml" xref="S4.E9.m1.1.1">𝑡</ci></apply><set id="S4.E9.m1.6.6.1.1.1.1.3.1.cmml" xref="S4.E9.m1.6.6.1.1.1.1.3.2"><cn id="S4.E9.m1.2.2.cmml" type="integer" xref="S4.E9.m1.2.2">1</cn><cn id="S4.E9.m1.3.3.cmml" type="integer" xref="S4.E9.m1.3.3">2</cn><ci id="S4.E9.m1.4.4.cmml" xref="S4.E9.m1.4.4">…</ci><ci id="S4.E9.m1.5.5.cmml" xref="S4.E9.m1.5.5">𝑀</ci></set></apply><apply id="S4.E9.m1.6.6.1.1.2.2.3.cmml" xref="S4.E9.m1.6.6.1.1.2.2.2"><csymbol cd="ambiguous" id="S4.E9.m1.6.6.1.1.2.2.3a.cmml" xref="S4.E9.m1.6.6.1.1.2.2.2.3">formulae-sequence</csymbol><apply id="S4.E9.m1.6.6.1.1.2.2.1.1.cmml" xref="S4.E9.m1.6.6.1.1.2.2.1.1"><in id="S4.E9.m1.6.6.1.1.2.2.1.1.1.cmml" xref="S4.E9.m1.6.6.1.1.2.2.1.1.1"></in><apply id="S4.E9.m1.6.6.1.1.2.2.1.1.2.cmml" xref="S4.E9.m1.6.6.1.1.2.2.1.1.2"><csymbol cd="latexml" id="S4.E9.m1.6.6.1.1.2.2.1.1.2.1.cmml" xref="S4.E9.m1.6.6.1.1.2.2.1.1.2.1">for-all</csymbol><ci id="S4.E9.m1.6.6.1.1.2.2.1.1.2.2.cmml" xref="S4.E9.m1.6.6.1.1.2.2.1.1.2.2">𝑖</ci></apply><ci id="S4.E9.m1.6.6.1.1.2.2.1.1.3.cmml" xref="S4.E9.m1.6.6.1.1.2.2.1.1.3">𝒩</ci></apply><apply id="S4.E9.m1.6.6.1.1.2.2.2.2.cmml" xref="S4.E9.m1.6.6.1.1.2.2.2.2"><in id="S4.E9.m1.6.6.1.1.2.2.2.2.1.cmml" xref="S4.E9.m1.6.6.1.1.2.2.2.2.1"></in><ci id="S4.E9.m1.6.6.1.1.2.2.2.2.2.cmml" xref="S4.E9.m1.6.6.1.1.2.2.2.2.2">𝑡</ci><ci id="S4.E9.m1.6.6.1.1.2.2.2.2.3.cmml" xref="S4.E9.m1.6.6.1.1.2.2.2.2.3">𝒯</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E9.m1.6c">\displaystyle x_{i}(t)=\{1,2,...,M\},\forall i\in\mathcal{N},t\in\mathcal{T}.</annotation><annotation encoding="application/x-llamapun" id="S4.E9.m1.6d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = { 1 , 2 , … , italic_M } , ∀ italic_i ∈ caligraphic_N , italic_t ∈ caligraphic_T .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(9)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S4.SS2.p2.10">where <math alttext="\mathcal{N}_{j}(t)=\{i|x_{i}(t)=j\}" class="ltx_Math" display="inline" id="S4.SS2.p2.8.m1.4"><semantics id="S4.SS2.p2.8.m1.4a"><mrow id="S4.SS2.p2.8.m1.4.4" xref="S4.SS2.p2.8.m1.4.4.cmml"><mrow id="S4.SS2.p2.8.m1.4.4.3" xref="S4.SS2.p2.8.m1.4.4.3.cmml"><msub id="S4.SS2.p2.8.m1.4.4.3.2" xref="S4.SS2.p2.8.m1.4.4.3.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.SS2.p2.8.m1.4.4.3.2.2" xref="S4.SS2.p2.8.m1.4.4.3.2.2.cmml">𝒩</mi><mi id="S4.SS2.p2.8.m1.4.4.3.2.3" xref="S4.SS2.p2.8.m1.4.4.3.2.3.cmml">j</mi></msub><mo id="S4.SS2.p2.8.m1.4.4.3.1" xref="S4.SS2.p2.8.m1.4.4.3.1.cmml">⁢</mo><mrow id="S4.SS2.p2.8.m1.4.4.3.3.2" xref="S4.SS2.p2.8.m1.4.4.3.cmml"><mo id="S4.SS2.p2.8.m1.4.4.3.3.2.1" stretchy="false" xref="S4.SS2.p2.8.m1.4.4.3.cmml">(</mo><mi id="S4.SS2.p2.8.m1.1.1" xref="S4.SS2.p2.8.m1.1.1.cmml">t</mi><mo id="S4.SS2.p2.8.m1.4.4.3.3.2.2" stretchy="false" xref="S4.SS2.p2.8.m1.4.4.3.cmml">)</mo></mrow></mrow><mo id="S4.SS2.p2.8.m1.4.4.2" xref="S4.SS2.p2.8.m1.4.4.2.cmml">=</mo><mrow id="S4.SS2.p2.8.m1.4.4.1.1" xref="S4.SS2.p2.8.m1.4.4.1.2.cmml"><mo id="S4.SS2.p2.8.m1.4.4.1.1.2" stretchy="false" xref="S4.SS2.p2.8.m1.4.4.1.2.1.cmml">{</mo><mi id="S4.SS2.p2.8.m1.3.3" xref="S4.SS2.p2.8.m1.3.3.cmml">i</mi><mo id="S4.SS2.p2.8.m1.4.4.1.1.3" lspace="0em" rspace="0em" xref="S4.SS2.p2.8.m1.4.4.1.2.1.cmml">|</mo><mrow id="S4.SS2.p2.8.m1.4.4.1.1.1" xref="S4.SS2.p2.8.m1.4.4.1.1.1.cmml"><mrow id="S4.SS2.p2.8.m1.4.4.1.1.1.2" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.cmml"><msub id="S4.SS2.p2.8.m1.4.4.1.1.1.2.2" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.cmml"><mi id="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.2" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.2.cmml">x</mi><mi id="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.3" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.3.cmml">i</mi></msub><mo id="S4.SS2.p2.8.m1.4.4.1.1.1.2.1" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.1.cmml">⁢</mo><mrow id="S4.SS2.p2.8.m1.4.4.1.1.1.2.3.2" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.cmml"><mo id="S4.SS2.p2.8.m1.4.4.1.1.1.2.3.2.1" stretchy="false" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.cmml">(</mo><mi id="S4.SS2.p2.8.m1.2.2" xref="S4.SS2.p2.8.m1.2.2.cmml">t</mi><mo id="S4.SS2.p2.8.m1.4.4.1.1.1.2.3.2.2" stretchy="false" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.cmml">)</mo></mrow></mrow><mo id="S4.SS2.p2.8.m1.4.4.1.1.1.1" xref="S4.SS2.p2.8.m1.4.4.1.1.1.1.cmml">=</mo><mi id="S4.SS2.p2.8.m1.4.4.1.1.1.3" xref="S4.SS2.p2.8.m1.4.4.1.1.1.3.cmml">j</mi></mrow><mo id="S4.SS2.p2.8.m1.4.4.1.1.4" stretchy="false" xref="S4.SS2.p2.8.m1.4.4.1.2.1.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.8.m1.4b"><apply id="S4.SS2.p2.8.m1.4.4.cmml" xref="S4.SS2.p2.8.m1.4.4"><eq id="S4.SS2.p2.8.m1.4.4.2.cmml" xref="S4.SS2.p2.8.m1.4.4.2"></eq><apply id="S4.SS2.p2.8.m1.4.4.3.cmml" xref="S4.SS2.p2.8.m1.4.4.3"><times id="S4.SS2.p2.8.m1.4.4.3.1.cmml" xref="S4.SS2.p2.8.m1.4.4.3.1"></times><apply id="S4.SS2.p2.8.m1.4.4.3.2.cmml" xref="S4.SS2.p2.8.m1.4.4.3.2"><csymbol cd="ambiguous" id="S4.SS2.p2.8.m1.4.4.3.2.1.cmml" xref="S4.SS2.p2.8.m1.4.4.3.2">subscript</csymbol><ci id="S4.SS2.p2.8.m1.4.4.3.2.2.cmml" xref="S4.SS2.p2.8.m1.4.4.3.2.2">𝒩</ci><ci id="S4.SS2.p2.8.m1.4.4.3.2.3.cmml" xref="S4.SS2.p2.8.m1.4.4.3.2.3">𝑗</ci></apply><ci id="S4.SS2.p2.8.m1.1.1.cmml" xref="S4.SS2.p2.8.m1.1.1">𝑡</ci></apply><apply id="S4.SS2.p2.8.m1.4.4.1.2.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1"><csymbol cd="latexml" id="S4.SS2.p2.8.m1.4.4.1.2.1.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1.2">conditional-set</csymbol><ci id="S4.SS2.p2.8.m1.3.3.cmml" xref="S4.SS2.p2.8.m1.3.3">𝑖</ci><apply id="S4.SS2.p2.8.m1.4.4.1.1.1.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1.1"><eq id="S4.SS2.p2.8.m1.4.4.1.1.1.1.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1.1.1"></eq><apply id="S4.SS2.p2.8.m1.4.4.1.1.1.2.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2"><times id="S4.SS2.p2.8.m1.4.4.1.1.1.2.1.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.1"></times><apply id="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.2"><csymbol cd="ambiguous" id="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.1.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.2">subscript</csymbol><ci id="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.2.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.2">𝑥</ci><ci id="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.3.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1.1.2.2.3">𝑖</ci></apply><ci id="S4.SS2.p2.8.m1.2.2.cmml" xref="S4.SS2.p2.8.m1.2.2">𝑡</ci></apply><ci id="S4.SS2.p2.8.m1.4.4.1.1.1.3.cmml" xref="S4.SS2.p2.8.m1.4.4.1.1.1.3">𝑗</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.8.m1.4c">\mathcal{N}_{j}(t)=\{i|x_{i}(t)=j\}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.8.m1.4d">caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = { italic_i | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_j }</annotation></semantics></math> is the set of requests running on SSM <math alttext="j" class="ltx_Math" display="inline" id="S4.SS2.p2.9.m2.1"><semantics id="S4.SS2.p2.9.m2.1a"><mi id="S4.SS2.p2.9.m2.1.1" xref="S4.SS2.p2.9.m2.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.9.m2.1b"><ci id="S4.SS2.p2.9.m2.1.1.cmml" xref="S4.SS2.p2.9.m2.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.9.m2.1c">j</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.9.m2.1d">italic_j</annotation></semantics></math> at time slot <math alttext="t" class="ltx_Math" display="inline" id="S4.SS2.p2.10.m3.1"><semantics id="S4.SS2.p2.10.m3.1a"><mi id="S4.SS2.p2.10.m3.1.1" xref="S4.SS2.p2.10.m3.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p2.10.m3.1b"><ci id="S4.SS2.p2.10.m3.1.1.cmml" xref="S4.SS2.p2.10.m3.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p2.10.m3.1c">t</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p2.10.m3.1d">italic_t</annotation></semantics></math>.</p> </div> <figure class="ltx_figure" id="S4.F8"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="303" id="S4.F8.1.g1" src="x8.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 8: </span>The illustration of the epoch division. In each exploration stage, we further group time slots into chunks to reduce the SSM switching frequency. In this example, we set the chunk size <math alttext="\beta" class="ltx_Math" display="inline" id="S4.F8.3.2.m1.1"><semantics id="S4.F8.3.2.m1.1b"><mi id="S4.F8.3.2.m1.1.1" xref="S4.F8.3.2.m1.1.1.cmml">β</mi><annotation-xml encoding="MathML-Content" id="S4.F8.3.2.m1.1c"><ci id="S4.F8.3.2.m1.1.1.cmml" xref="S4.F8.3.2.m1.1.1">𝛽</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.F8.3.2.m1.1d">\beta</annotation><annotation encoding="application/x-llamapun" id="S4.F8.3.2.m1.1e">italic_β</annotation></semantics></math> as 2.</figcaption> </figure> <div class="ltx_para" id="S4.SS2.p3"> <p class="ltx_p" id="S4.SS2.p3.1">The proposed algorithm for solving the problem (<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S4.E7" title="In IV-B Algorithm Design ‣ IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">7</span></a>) is presented in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#alg1" title="Algorithm 1 ‣ IV-B Algorithm Design ‣ IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Algorithm 1</span></a>. We have no assumption about the length of <math alttext="\mathcal{T}" class="ltx_Math" display="inline" id="S4.SS2.p3.1.m1.1"><semantics id="S4.SS2.p3.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S4.SS2.p3.1.m1.1.1" xref="S4.SS2.p3.1.m1.1.1.cmml">𝒯</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p3.1.m1.1b"><ci id="S4.SS2.p3.1.m1.1.1.cmml" xref="S4.SS2.p3.1.m1.1.1">𝒯</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p3.1.m1.1c">\mathcal{T}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p3.1.m1.1d">caligraphic_T</annotation></semantics></math>, which is divided into epochs, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S4.F8" title="Figure 8 ‣ IV-B Algorithm Design ‣ IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 8</span></a>. Each epoch includes two stages:</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p4"> <p class="ltx_p" id="S4.SS2.p4.12"><span class="ltx_text ltx_font_bold" id="S4.SS2.p4.12.1">Exploration Stage:</span> The exploration stage contains <math alttext="\alpha" class="ltx_Math" display="inline" id="S4.SS2.p4.1.m1.1"><semantics id="S4.SS2.p4.1.m1.1a"><mi id="S4.SS2.p4.1.m1.1.1" xref="S4.SS2.p4.1.m1.1.1.cmml">α</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.1.m1.1b"><ci id="S4.SS2.p4.1.m1.1.1.cmml" xref="S4.SS2.p4.1.m1.1.1">𝛼</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.1.m1.1c">\alpha</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.1.m1.1d">italic_α</annotation></semantics></math> time slots. In this stage, we randomly select SSMs for inference requests and observes the goodput <math alttext="r_{i,j}(t)" class="ltx_Math" display="inline" id="S4.SS2.p4.2.m2.3"><semantics id="S4.SS2.p4.2.m2.3a"><mrow id="S4.SS2.p4.2.m2.3.4" xref="S4.SS2.p4.2.m2.3.4.cmml"><msub id="S4.SS2.p4.2.m2.3.4.2" xref="S4.SS2.p4.2.m2.3.4.2.cmml"><mi id="S4.SS2.p4.2.m2.3.4.2.2" xref="S4.SS2.p4.2.m2.3.4.2.2.cmml">r</mi><mrow id="S4.SS2.p4.2.m2.2.2.2.4" xref="S4.SS2.p4.2.m2.2.2.2.3.cmml"><mi id="S4.SS2.p4.2.m2.1.1.1.1" xref="S4.SS2.p4.2.m2.1.1.1.1.cmml">i</mi><mo id="S4.SS2.p4.2.m2.2.2.2.4.1" xref="S4.SS2.p4.2.m2.2.2.2.3.cmml">,</mo><mi id="S4.SS2.p4.2.m2.2.2.2.2" xref="S4.SS2.p4.2.m2.2.2.2.2.cmml">j</mi></mrow></msub><mo id="S4.SS2.p4.2.m2.3.4.1" xref="S4.SS2.p4.2.m2.3.4.1.cmml">⁢</mo><mrow id="S4.SS2.p4.2.m2.3.4.3.2" xref="S4.SS2.p4.2.m2.3.4.cmml"><mo id="S4.SS2.p4.2.m2.3.4.3.2.1" stretchy="false" xref="S4.SS2.p4.2.m2.3.4.cmml">(</mo><mi id="S4.SS2.p4.2.m2.3.3" xref="S4.SS2.p4.2.m2.3.3.cmml">t</mi><mo id="S4.SS2.p4.2.m2.3.4.3.2.2" stretchy="false" xref="S4.SS2.p4.2.m2.3.4.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.2.m2.3b"><apply id="S4.SS2.p4.2.m2.3.4.cmml" xref="S4.SS2.p4.2.m2.3.4"><times id="S4.SS2.p4.2.m2.3.4.1.cmml" xref="S4.SS2.p4.2.m2.3.4.1"></times><apply id="S4.SS2.p4.2.m2.3.4.2.cmml" xref="S4.SS2.p4.2.m2.3.4.2"><csymbol cd="ambiguous" id="S4.SS2.p4.2.m2.3.4.2.1.cmml" xref="S4.SS2.p4.2.m2.3.4.2">subscript</csymbol><ci id="S4.SS2.p4.2.m2.3.4.2.2.cmml" xref="S4.SS2.p4.2.m2.3.4.2.2">𝑟</ci><list id="S4.SS2.p4.2.m2.2.2.2.3.cmml" xref="S4.SS2.p4.2.m2.2.2.2.4"><ci id="S4.SS2.p4.2.m2.1.1.1.1.cmml" xref="S4.SS2.p4.2.m2.1.1.1.1">𝑖</ci><ci id="S4.SS2.p4.2.m2.2.2.2.2.cmml" xref="S4.SS2.p4.2.m2.2.2.2.2">𝑗</ci></list></apply><ci id="S4.SS2.p4.2.m2.3.3.cmml" xref="S4.SS2.p4.2.m2.3.3">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.2.m2.3c">r_{i,j}(t)</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.2.m2.3d">italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t )</annotation></semantics></math>. However, random selection would lead to frequent SSM switching with high cost. Therefore, we design a chunk-based exploration approach. Specifically, time slots in the exploration stage are further grouped into chunks. Each request <math alttext="i" class="ltx_Math" display="inline" id="S4.SS2.p4.3.m3.1"><semantics id="S4.SS2.p4.3.m3.1a"><mi id="S4.SS2.p4.3.m3.1.1" xref="S4.SS2.p4.3.m3.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.3.m3.1b"><ci id="S4.SS2.p4.3.m3.1.1.cmml" xref="S4.SS2.p4.3.m3.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.3.m3.1c">i</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.3.m3.1d">italic_i</annotation></semantics></math> uses the same SSM for all time slots within a time chunk, reducing the SSM switching frequency. We denote chunk size, i.e., the number of time slots in a chunk, as <math alttext="\beta" class="ltx_Math" display="inline" id="S4.SS2.p4.4.m4.1"><semantics id="S4.SS2.p4.4.m4.1a"><mi id="S4.SS2.p4.4.m4.1.1" xref="S4.SS2.p4.4.m4.1.1.cmml">β</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.4.m4.1b"><ci id="S4.SS2.p4.4.m4.1.1.cmml" xref="S4.SS2.p4.4.m4.1.1">𝛽</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.4.m4.1c">\beta</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.4.m4.1d">italic_β</annotation></semantics></math>. Thus, there are <math alttext="\alpha/\beta" class="ltx_Math" display="inline" id="S4.SS2.p4.5.m5.1"><semantics id="S4.SS2.p4.5.m5.1a"><mrow id="S4.SS2.p4.5.m5.1.1" xref="S4.SS2.p4.5.m5.1.1.cmml"><mi id="S4.SS2.p4.5.m5.1.1.2" xref="S4.SS2.p4.5.m5.1.1.2.cmml">α</mi><mo id="S4.SS2.p4.5.m5.1.1.1" xref="S4.SS2.p4.5.m5.1.1.1.cmml">/</mo><mi id="S4.SS2.p4.5.m5.1.1.3" xref="S4.SS2.p4.5.m5.1.1.3.cmml">β</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.5.m5.1b"><apply id="S4.SS2.p4.5.m5.1.1.cmml" xref="S4.SS2.p4.5.m5.1.1"><divide id="S4.SS2.p4.5.m5.1.1.1.cmml" xref="S4.SS2.p4.5.m5.1.1.1"></divide><ci id="S4.SS2.p4.5.m5.1.1.2.cmml" xref="S4.SS2.p4.5.m5.1.1.2">𝛼</ci><ci id="S4.SS2.p4.5.m5.1.1.3.cmml" xref="S4.SS2.p4.5.m5.1.1.3">𝛽</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.5.m5.1c">\alpha/\beta</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.5.m5.1d">italic_α / italic_β</annotation></semantics></math> time chunks in the exploration stage. At the end of each time slot <math alttext="t" class="ltx_Math" display="inline" id="S4.SS2.p4.6.m6.1"><semantics id="S4.SS2.p4.6.m6.1a"><mi id="S4.SS2.p4.6.m6.1.1" xref="S4.SS2.p4.6.m6.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.6.m6.1b"><ci id="S4.SS2.p4.6.m6.1.1.cmml" xref="S4.SS2.p4.6.m6.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.6.m6.1c">t</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.6.m6.1d">italic_t</annotation></semantics></math> in the exploration stage, request <math alttext="i" class="ltx_Math" display="inline" id="S4.SS2.p4.7.m7.1"><semantics id="S4.SS2.p4.7.m7.1a"><mi id="S4.SS2.p4.7.m7.1.1" xref="S4.SS2.p4.7.m7.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.7.m7.1b"><ci id="S4.SS2.p4.7.m7.1.1.cmml" xref="S4.SS2.p4.7.m7.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.7.m7.1c">i</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.7.m7.1d">italic_i</annotation></semantics></math> observes the throughput feedback <math alttext="r_{i,j}(t)" class="ltx_Math" display="inline" id="S4.SS2.p4.8.m8.3"><semantics id="S4.SS2.p4.8.m8.3a"><mrow id="S4.SS2.p4.8.m8.3.4" xref="S4.SS2.p4.8.m8.3.4.cmml"><msub id="S4.SS2.p4.8.m8.3.4.2" xref="S4.SS2.p4.8.m8.3.4.2.cmml"><mi id="S4.SS2.p4.8.m8.3.4.2.2" xref="S4.SS2.p4.8.m8.3.4.2.2.cmml">r</mi><mrow id="S4.SS2.p4.8.m8.2.2.2.4" xref="S4.SS2.p4.8.m8.2.2.2.3.cmml"><mi id="S4.SS2.p4.8.m8.1.1.1.1" xref="S4.SS2.p4.8.m8.1.1.1.1.cmml">i</mi><mo id="S4.SS2.p4.8.m8.2.2.2.4.1" xref="S4.SS2.p4.8.m8.2.2.2.3.cmml">,</mo><mi id="S4.SS2.p4.8.m8.2.2.2.2" xref="S4.SS2.p4.8.m8.2.2.2.2.cmml">j</mi></mrow></msub><mo id="S4.SS2.p4.8.m8.3.4.1" xref="S4.SS2.p4.8.m8.3.4.1.cmml">⁢</mo><mrow id="S4.SS2.p4.8.m8.3.4.3.2" xref="S4.SS2.p4.8.m8.3.4.cmml"><mo id="S4.SS2.p4.8.m8.3.4.3.2.1" stretchy="false" xref="S4.SS2.p4.8.m8.3.4.cmml">(</mo><mi id="S4.SS2.p4.8.m8.3.3" xref="S4.SS2.p4.8.m8.3.3.cmml">t</mi><mo id="S4.SS2.p4.8.m8.3.4.3.2.2" stretchy="false" xref="S4.SS2.p4.8.m8.3.4.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.8.m8.3b"><apply id="S4.SS2.p4.8.m8.3.4.cmml" xref="S4.SS2.p4.8.m8.3.4"><times id="S4.SS2.p4.8.m8.3.4.1.cmml" xref="S4.SS2.p4.8.m8.3.4.1"></times><apply id="S4.SS2.p4.8.m8.3.4.2.cmml" xref="S4.SS2.p4.8.m8.3.4.2"><csymbol cd="ambiguous" id="S4.SS2.p4.8.m8.3.4.2.1.cmml" xref="S4.SS2.p4.8.m8.3.4.2">subscript</csymbol><ci id="S4.SS2.p4.8.m8.3.4.2.2.cmml" xref="S4.SS2.p4.8.m8.3.4.2.2">𝑟</ci><list id="S4.SS2.p4.8.m8.2.2.2.3.cmml" xref="S4.SS2.p4.8.m8.2.2.2.4"><ci id="S4.SS2.p4.8.m8.1.1.1.1.cmml" xref="S4.SS2.p4.8.m8.1.1.1.1">𝑖</ci><ci id="S4.SS2.p4.8.m8.2.2.2.2.cmml" xref="S4.SS2.p4.8.m8.2.2.2.2">𝑗</ci></list></apply><ci id="S4.SS2.p4.8.m8.3.3.cmml" xref="S4.SS2.p4.8.m8.3.3">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.8.m8.3c">r_{i,j}(t)</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.8.m8.3d">italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_t )</annotation></semantics></math>. After the exploration stage at epoch <math alttext="k" class="ltx_Math" display="inline" id="S4.SS2.p4.9.m9.1"><semantics id="S4.SS2.p4.9.m9.1a"><mi id="S4.SS2.p4.9.m9.1.1" xref="S4.SS2.p4.9.m9.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.9.m9.1b"><ci id="S4.SS2.p4.9.m9.1.1.cmml" xref="S4.SS2.p4.9.m9.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.9.m9.1c">k</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.9.m9.1d">italic_k</annotation></semantics></math>, we can get the average value of goodput for request <math alttext="i" class="ltx_Math" display="inline" id="S4.SS2.p4.10.m10.1"><semantics id="S4.SS2.p4.10.m10.1a"><mi id="S4.SS2.p4.10.m10.1.1" xref="S4.SS2.p4.10.m10.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.10.m10.1b"><ci id="S4.SS2.p4.10.m10.1.1.cmml" xref="S4.SS2.p4.10.m10.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.10.m10.1c">i</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.10.m10.1d">italic_i</annotation></semantics></math> on SSM <math alttext="j" class="ltx_Math" display="inline" id="S4.SS2.p4.11.m11.1"><semantics id="S4.SS2.p4.11.m11.1a"><mi id="S4.SS2.p4.11.m11.1.1" xref="S4.SS2.p4.11.m11.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.11.m11.1b"><ci id="S4.SS2.p4.11.m11.1.1.cmml" xref="S4.SS2.p4.11.m11.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.11.m11.1c">j</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.11.m11.1d">italic_j</annotation></semantics></math> as <math alttext="\tilde{g}_{i,j}^{k}" class="ltx_Math" display="inline" id="S4.SS2.p4.12.m12.2"><semantics id="S4.SS2.p4.12.m12.2a"><msubsup id="S4.SS2.p4.12.m12.2.3" xref="S4.SS2.p4.12.m12.2.3.cmml"><mover accent="true" id="S4.SS2.p4.12.m12.2.3.2.2" xref="S4.SS2.p4.12.m12.2.3.2.2.cmml"><mi id="S4.SS2.p4.12.m12.2.3.2.2.2" xref="S4.SS2.p4.12.m12.2.3.2.2.2.cmml">g</mi><mo id="S4.SS2.p4.12.m12.2.3.2.2.1" xref="S4.SS2.p4.12.m12.2.3.2.2.1.cmml">~</mo></mover><mrow id="S4.SS2.p4.12.m12.2.2.2.4" xref="S4.SS2.p4.12.m12.2.2.2.3.cmml"><mi id="S4.SS2.p4.12.m12.1.1.1.1" xref="S4.SS2.p4.12.m12.1.1.1.1.cmml">i</mi><mo id="S4.SS2.p4.12.m12.2.2.2.4.1" xref="S4.SS2.p4.12.m12.2.2.2.3.cmml">,</mo><mi id="S4.SS2.p4.12.m12.2.2.2.2" xref="S4.SS2.p4.12.m12.2.2.2.2.cmml">j</mi></mrow><mi id="S4.SS2.p4.12.m12.2.3.3" xref="S4.SS2.p4.12.m12.2.3.3.cmml">k</mi></msubsup><annotation-xml encoding="MathML-Content" id="S4.SS2.p4.12.m12.2b"><apply id="S4.SS2.p4.12.m12.2.3.cmml" xref="S4.SS2.p4.12.m12.2.3"><csymbol cd="ambiguous" id="S4.SS2.p4.12.m12.2.3.1.cmml" xref="S4.SS2.p4.12.m12.2.3">superscript</csymbol><apply id="S4.SS2.p4.12.m12.2.3.2.cmml" xref="S4.SS2.p4.12.m12.2.3"><csymbol cd="ambiguous" id="S4.SS2.p4.12.m12.2.3.2.1.cmml" xref="S4.SS2.p4.12.m12.2.3">subscript</csymbol><apply id="S4.SS2.p4.12.m12.2.3.2.2.cmml" xref="S4.SS2.p4.12.m12.2.3.2.2"><ci id="S4.SS2.p4.12.m12.2.3.2.2.1.cmml" xref="S4.SS2.p4.12.m12.2.3.2.2.1">~</ci><ci id="S4.SS2.p4.12.m12.2.3.2.2.2.cmml" xref="S4.SS2.p4.12.m12.2.3.2.2.2">𝑔</ci></apply><list id="S4.SS2.p4.12.m12.2.2.2.3.cmml" xref="S4.SS2.p4.12.m12.2.2.2.4"><ci id="S4.SS2.p4.12.m12.1.1.1.1.cmml" xref="S4.SS2.p4.12.m12.1.1.1.1">𝑖</ci><ci id="S4.SS2.p4.12.m12.2.2.2.2.cmml" xref="S4.SS2.p4.12.m12.2.2.2.2">𝑗</ci></list></apply><ci id="S4.SS2.p4.12.m12.2.3.3.cmml" xref="S4.SS2.p4.12.m12.2.3.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p4.12.m12.2c">\tilde{g}_{i,j}^{k}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p4.12.m12.2d">over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT</annotation></semantics></math>.</p> </div> <figure class="ltx_float ltx_float_algorithm ltx_framed ltx_framed_top" id="alg1"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_float"><span class="ltx_text ltx_font_bold" id="alg1.2.1.1">Algorithm 1</span> </span> Learning-based SSM Selection (LBSS)</figcaption> <div class="ltx_listing ltx_listing" id="alg1.3"> <div class="ltx_listingline" id="alg1.l1"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l1.1.1.1" style="font-size:80%;">1:</span></span> <span class="ltx_text ltx_font_bold" id="alg1.l1.2">for</span> epoch <math alttext="k=1,2,..." class="ltx_Math" display="inline" id="alg1.l1.m1.3"><semantics id="alg1.l1.m1.3a"><mrow id="alg1.l1.m1.3.4" xref="alg1.l1.m1.3.4.cmml"><mi id="alg1.l1.m1.3.4.2" xref="alg1.l1.m1.3.4.2.cmml">k</mi><mo id="alg1.l1.m1.3.4.1" xref="alg1.l1.m1.3.4.1.cmml">=</mo><mrow id="alg1.l1.m1.3.4.3.2" xref="alg1.l1.m1.3.4.3.1.cmml"><mn id="alg1.l1.m1.1.1" xref="alg1.l1.m1.1.1.cmml">1</mn><mo id="alg1.l1.m1.3.4.3.2.1" xref="alg1.l1.m1.3.4.3.1.cmml">,</mo><mn id="alg1.l1.m1.2.2" xref="alg1.l1.m1.2.2.cmml">2</mn><mo id="alg1.l1.m1.3.4.3.2.2" xref="alg1.l1.m1.3.4.3.1.cmml">,</mo><mi id="alg1.l1.m1.3.3" mathvariant="normal" xref="alg1.l1.m1.3.3.cmml">…</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l1.m1.3b"><apply id="alg1.l1.m1.3.4.cmml" xref="alg1.l1.m1.3.4"><eq id="alg1.l1.m1.3.4.1.cmml" xref="alg1.l1.m1.3.4.1"></eq><ci id="alg1.l1.m1.3.4.2.cmml" xref="alg1.l1.m1.3.4.2">𝑘</ci><list id="alg1.l1.m1.3.4.3.1.cmml" xref="alg1.l1.m1.3.4.3.2"><cn id="alg1.l1.m1.1.1.cmml" type="integer" xref="alg1.l1.m1.1.1">1</cn><cn id="alg1.l1.m1.2.2.cmml" type="integer" xref="alg1.l1.m1.2.2">2</cn><ci id="alg1.l1.m1.3.3.cmml" xref="alg1.l1.m1.3.3">…</ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l1.m1.3c">k=1,2,...</annotation><annotation encoding="application/x-llamapun" id="alg1.l1.m1.3d">italic_k = 1 , 2 , …</annotation></semantics></math> <span class="ltx_text ltx_font_bold" id="alg1.l1.3">do</span> </div> <div class="ltx_listingline" id="alg1.l2"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l2.1.1.1" style="font-size:80%;">2:</span></span> <span class="ltx_text ltx_font_bold" id="alg1.l2.2">Exploration Stage:</span> Estimate <math alttext="\tilde{g}_{i,j}^{k}" class="ltx_Math" display="inline" id="alg1.l2.m1.2"><semantics id="alg1.l2.m1.2a"><msubsup id="alg1.l2.m1.2.3" xref="alg1.l2.m1.2.3.cmml"><mover accent="true" id="alg1.l2.m1.2.3.2.2" xref="alg1.l2.m1.2.3.2.2.cmml"><mi id="alg1.l2.m1.2.3.2.2.2" xref="alg1.l2.m1.2.3.2.2.2.cmml">g</mi><mo id="alg1.l2.m1.2.3.2.2.1" xref="alg1.l2.m1.2.3.2.2.1.cmml">~</mo></mover><mrow id="alg1.l2.m1.2.2.2.4" xref="alg1.l2.m1.2.2.2.3.cmml"><mi id="alg1.l2.m1.1.1.1.1" xref="alg1.l2.m1.1.1.1.1.cmml">i</mi><mo id="alg1.l2.m1.2.2.2.4.1" xref="alg1.l2.m1.2.2.2.3.cmml">,</mo><mi id="alg1.l2.m1.2.2.2.2" xref="alg1.l2.m1.2.2.2.2.cmml">j</mi></mrow><mi id="alg1.l2.m1.2.3.3" xref="alg1.l2.m1.2.3.3.cmml">k</mi></msubsup><annotation-xml encoding="MathML-Content" id="alg1.l2.m1.2b"><apply id="alg1.l2.m1.2.3.cmml" xref="alg1.l2.m1.2.3"><csymbol cd="ambiguous" id="alg1.l2.m1.2.3.1.cmml" xref="alg1.l2.m1.2.3">superscript</csymbol><apply id="alg1.l2.m1.2.3.2.cmml" xref="alg1.l2.m1.2.3"><csymbol cd="ambiguous" id="alg1.l2.m1.2.3.2.1.cmml" xref="alg1.l2.m1.2.3">subscript</csymbol><apply id="alg1.l2.m1.2.3.2.2.cmml" xref="alg1.l2.m1.2.3.2.2"><ci id="alg1.l2.m1.2.3.2.2.1.cmml" xref="alg1.l2.m1.2.3.2.2.1">~</ci><ci id="alg1.l2.m1.2.3.2.2.2.cmml" xref="alg1.l2.m1.2.3.2.2.2">𝑔</ci></apply><list id="alg1.l2.m1.2.2.2.3.cmml" xref="alg1.l2.m1.2.2.2.4"><ci id="alg1.l2.m1.1.1.1.1.cmml" xref="alg1.l2.m1.1.1.1.1">𝑖</ci><ci id="alg1.l2.m1.2.2.2.2.cmml" xref="alg1.l2.m1.2.2.2.2">𝑗</ci></list></apply><ci id="alg1.l2.m1.2.3.3.cmml" xref="alg1.l2.m1.2.3.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l2.m1.2c">\tilde{g}_{i,j}^{k}</annotation><annotation encoding="application/x-llamapun" id="alg1.l2.m1.2d">over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT</annotation></semantics></math> using <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#alg2" title="Algorithm 2 ‣ IV-B Algorithm Design ‣ IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Algorithm 2</span></a>; </div> <div class="ltx_listingline" id="alg1.l3"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l3.1.1.1" style="font-size:80%;">3:</span></span> <span class="ltx_text ltx_font_bold" id="alg1.l3.2">Exploitation Stage:</span> Select SSM by solving (<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S4.E10" title="In IV-B Algorithm Design ‣ IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">10</span></a>); </div> <div class="ltx_listingline" id="alg1.l4"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l4.1.1.1" style="font-size:80%;">4:</span></span> Run requests on SSMs according to the decision in exploitation for <math alttext="2^{k}" class="ltx_Math" display="inline" id="alg1.l4.m1.1"><semantics id="alg1.l4.m1.1a"><msup id="alg1.l4.m1.1.1" xref="alg1.l4.m1.1.1.cmml"><mn id="alg1.l4.m1.1.1.2" xref="alg1.l4.m1.1.1.2.cmml">2</mn><mi id="alg1.l4.m1.1.1.3" xref="alg1.l4.m1.1.1.3.cmml">k</mi></msup><annotation-xml encoding="MathML-Content" id="alg1.l4.m1.1b"><apply id="alg1.l4.m1.1.1.cmml" xref="alg1.l4.m1.1.1"><csymbol cd="ambiguous" id="alg1.l4.m1.1.1.1.cmml" xref="alg1.l4.m1.1.1">superscript</csymbol><cn id="alg1.l4.m1.1.1.2.cmml" type="integer" xref="alg1.l4.m1.1.1.2">2</cn><ci id="alg1.l4.m1.1.1.3.cmml" xref="alg1.l4.m1.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l4.m1.1c">2^{k}</annotation><annotation encoding="application/x-llamapun" id="alg1.l4.m1.1d">2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT</annotation></semantics></math> time slots; </div> <div class="ltx_listingline" id="alg1.l5"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l5.1.1.1" style="font-size:80%;">5:</span></span> <span class="ltx_text ltx_font_bold" id="alg1.l5.2">end</span> <span class="ltx_text ltx_font_bold" id="alg1.l5.3">for</span> </div> </div> </figure> <figure class="ltx_float ltx_float_algorithm ltx_framed ltx_framed_top" id="alg2"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_float"><span class="ltx_text ltx_font_bold" id="alg2.4.1.1">Algorithm 2</span> </span> Chunk-based Exploration in Epoch <math alttext="k" class="ltx_Math" display="inline" id="alg2.2.m1.1"><semantics id="alg2.2.m1.1b"><mi id="alg2.2.m1.1.1" xref="alg2.2.m1.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="alg2.2.m1.1c"><ci id="alg2.2.m1.1.1.cmml" xref="alg2.2.m1.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="alg2.2.m1.1d">k</annotation><annotation encoding="application/x-llamapun" id="alg2.2.m1.1e">italic_k</annotation></semantics></math></figcaption> <div class="ltx_listing ltx_listing" id="alg2.5"> <div class="ltx_listingline" id="alg2.l1"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l1.1.1.1" style="font-size:80%;">1:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l1.2">Initialization:</span> <math alttext="t=0" class="ltx_Math" display="inline" id="alg2.l1.m1.1"><semantics id="alg2.l1.m1.1a"><mrow id="alg2.l1.m1.1.1" xref="alg2.l1.m1.1.1.cmml"><mi id="alg2.l1.m1.1.1.2" xref="alg2.l1.m1.1.1.2.cmml">t</mi><mo id="alg2.l1.m1.1.1.1" xref="alg2.l1.m1.1.1.1.cmml">=</mo><mn id="alg2.l1.m1.1.1.3" xref="alg2.l1.m1.1.1.3.cmml">0</mn></mrow><annotation-xml encoding="MathML-Content" id="alg2.l1.m1.1b"><apply id="alg2.l1.m1.1.1.cmml" xref="alg2.l1.m1.1.1"><eq id="alg2.l1.m1.1.1.1.cmml" xref="alg2.l1.m1.1.1.1"></eq><ci id="alg2.l1.m1.1.1.2.cmml" xref="alg2.l1.m1.1.1.2">𝑡</ci><cn id="alg2.l1.m1.1.1.3.cmml" type="integer" xref="alg2.l1.m1.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l1.m1.1c">t=0</annotation><annotation encoding="application/x-llamapun" id="alg2.l1.m1.1d">italic_t = 0</annotation></semantics></math>; </div> <div class="ltx_listingline" id="alg2.l2"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l2.1.1.1" style="font-size:80%;">2:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l2.2">while</span> <math alttext="t\leq\alpha" class="ltx_Math" display="inline" id="alg2.l2.m1.1"><semantics id="alg2.l2.m1.1a"><mrow id="alg2.l2.m1.1.1" xref="alg2.l2.m1.1.1.cmml"><mi id="alg2.l2.m1.1.1.2" xref="alg2.l2.m1.1.1.2.cmml">t</mi><mo id="alg2.l2.m1.1.1.1" xref="alg2.l2.m1.1.1.1.cmml">≤</mo><mi id="alg2.l2.m1.1.1.3" xref="alg2.l2.m1.1.1.3.cmml">α</mi></mrow><annotation-xml encoding="MathML-Content" id="alg2.l2.m1.1b"><apply id="alg2.l2.m1.1.1.cmml" xref="alg2.l2.m1.1.1"><leq id="alg2.l2.m1.1.1.1.cmml" xref="alg2.l2.m1.1.1.1"></leq><ci id="alg2.l2.m1.1.1.2.cmml" xref="alg2.l2.m1.1.1.2">𝑡</ci><ci id="alg2.l2.m1.1.1.3.cmml" xref="alg2.l2.m1.1.1.3">𝛼</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l2.m1.1c">t\leq\alpha</annotation><annotation encoding="application/x-llamapun" id="alg2.l2.m1.1d">italic_t ≤ italic_α</annotation></semantics></math> <span class="ltx_text ltx_font_bold" id="alg2.l2.3">do</span> </div> <div class="ltx_listingline" id="alg2.l3"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l3.1.1.1" style="font-size:80%;">3:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l3.2">for</span> <math alttext="i=1,2,...,N" class="ltx_Math" display="inline" id="alg2.l3.m1.4"><semantics id="alg2.l3.m1.4a"><mrow id="alg2.l3.m1.4.5" xref="alg2.l3.m1.4.5.cmml"><mi id="alg2.l3.m1.4.5.2" xref="alg2.l3.m1.4.5.2.cmml">i</mi><mo id="alg2.l3.m1.4.5.1" xref="alg2.l3.m1.4.5.1.cmml">=</mo><mrow id="alg2.l3.m1.4.5.3.2" xref="alg2.l3.m1.4.5.3.1.cmml"><mn id="alg2.l3.m1.1.1" xref="alg2.l3.m1.1.1.cmml">1</mn><mo id="alg2.l3.m1.4.5.3.2.1" xref="alg2.l3.m1.4.5.3.1.cmml">,</mo><mn id="alg2.l3.m1.2.2" xref="alg2.l3.m1.2.2.cmml">2</mn><mo id="alg2.l3.m1.4.5.3.2.2" xref="alg2.l3.m1.4.5.3.1.cmml">,</mo><mi id="alg2.l3.m1.3.3" mathvariant="normal" xref="alg2.l3.m1.3.3.cmml">…</mi><mo id="alg2.l3.m1.4.5.3.2.3" xref="alg2.l3.m1.4.5.3.1.cmml">,</mo><mi id="alg2.l3.m1.4.4" xref="alg2.l3.m1.4.4.cmml">N</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg2.l3.m1.4b"><apply id="alg2.l3.m1.4.5.cmml" xref="alg2.l3.m1.4.5"><eq id="alg2.l3.m1.4.5.1.cmml" xref="alg2.l3.m1.4.5.1"></eq><ci id="alg2.l3.m1.4.5.2.cmml" xref="alg2.l3.m1.4.5.2">𝑖</ci><list id="alg2.l3.m1.4.5.3.1.cmml" xref="alg2.l3.m1.4.5.3.2"><cn id="alg2.l3.m1.1.1.cmml" type="integer" xref="alg2.l3.m1.1.1">1</cn><cn id="alg2.l3.m1.2.2.cmml" type="integer" xref="alg2.l3.m1.2.2">2</cn><ci id="alg2.l3.m1.3.3.cmml" xref="alg2.l3.m1.3.3">…</ci><ci id="alg2.l3.m1.4.4.cmml" xref="alg2.l3.m1.4.4">𝑁</ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l3.m1.4c">i=1,2,...,N</annotation><annotation encoding="application/x-llamapun" id="alg2.l3.m1.4d">italic_i = 1 , 2 , … , italic_N</annotation></semantics></math> <span class="ltx_text ltx_font_bold" id="alg2.l3.3">do</span> </div> <div class="ltx_listingline" id="alg2.l4"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l4.1.1.1" style="font-size:80%;">4:</span></span> Randomly select a SSM <math alttext="j" class="ltx_Math" display="inline" id="alg2.l4.m1.1"><semantics id="alg2.l4.m1.1a"><mi id="alg2.l4.m1.1.1" xref="alg2.l4.m1.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="alg2.l4.m1.1b"><ci id="alg2.l4.m1.1.1.cmml" xref="alg2.l4.m1.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="alg2.l4.m1.1c">j</annotation><annotation encoding="application/x-llamapun" id="alg2.l4.m1.1d">italic_j</annotation></semantics></math>; </div> <div class="ltx_listingline" id="alg2.l5"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l5.1.1.1" style="font-size:80%;">5:</span></span> <math alttext="x_{i}(t)=j" class="ltx_Math" display="inline" id="alg2.l5.m1.1"><semantics id="alg2.l5.m1.1a"><mrow id="alg2.l5.m1.1.2" xref="alg2.l5.m1.1.2.cmml"><mrow id="alg2.l5.m1.1.2.2" xref="alg2.l5.m1.1.2.2.cmml"><msub id="alg2.l5.m1.1.2.2.2" xref="alg2.l5.m1.1.2.2.2.cmml"><mi id="alg2.l5.m1.1.2.2.2.2" xref="alg2.l5.m1.1.2.2.2.2.cmml">x</mi><mi id="alg2.l5.m1.1.2.2.2.3" xref="alg2.l5.m1.1.2.2.2.3.cmml">i</mi></msub><mo id="alg2.l5.m1.1.2.2.1" xref="alg2.l5.m1.1.2.2.1.cmml">⁢</mo><mrow id="alg2.l5.m1.1.2.2.3.2" xref="alg2.l5.m1.1.2.2.cmml"><mo id="alg2.l5.m1.1.2.2.3.2.1" stretchy="false" xref="alg2.l5.m1.1.2.2.cmml">(</mo><mi id="alg2.l5.m1.1.1" xref="alg2.l5.m1.1.1.cmml">t</mi><mo id="alg2.l5.m1.1.2.2.3.2.2" stretchy="false" xref="alg2.l5.m1.1.2.2.cmml">)</mo></mrow></mrow><mo id="alg2.l5.m1.1.2.1" xref="alg2.l5.m1.1.2.1.cmml">=</mo><mi id="alg2.l5.m1.1.2.3" xref="alg2.l5.m1.1.2.3.cmml">j</mi></mrow><annotation-xml encoding="MathML-Content" id="alg2.l5.m1.1b"><apply id="alg2.l5.m1.1.2.cmml" xref="alg2.l5.m1.1.2"><eq id="alg2.l5.m1.1.2.1.cmml" xref="alg2.l5.m1.1.2.1"></eq><apply id="alg2.l5.m1.1.2.2.cmml" xref="alg2.l5.m1.1.2.2"><times id="alg2.l5.m1.1.2.2.1.cmml" xref="alg2.l5.m1.1.2.2.1"></times><apply id="alg2.l5.m1.1.2.2.2.cmml" xref="alg2.l5.m1.1.2.2.2"><csymbol cd="ambiguous" id="alg2.l5.m1.1.2.2.2.1.cmml" xref="alg2.l5.m1.1.2.2.2">subscript</csymbol><ci id="alg2.l5.m1.1.2.2.2.2.cmml" xref="alg2.l5.m1.1.2.2.2.2">𝑥</ci><ci id="alg2.l5.m1.1.2.2.2.3.cmml" xref="alg2.l5.m1.1.2.2.2.3">𝑖</ci></apply><ci id="alg2.l5.m1.1.1.cmml" xref="alg2.l5.m1.1.1">𝑡</ci></apply><ci id="alg2.l5.m1.1.2.3.cmml" xref="alg2.l5.m1.1.2.3">𝑗</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l5.m1.1c">x_{i}(t)=j</annotation><annotation encoding="application/x-llamapun" id="alg2.l5.m1.1d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = italic_j</annotation></semantics></math>; </div> <div class="ltx_listingline" id="alg2.l6"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l6.1.1.1" style="font-size:80%;">6:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l6.2">end</span> <span class="ltx_text ltx_font_bold" id="alg2.l6.3">for</span> </div> <div class="ltx_listingline" id="alg2.l7"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l7.1.1.1" style="font-size:80%;">7:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l7.2">for</span> <math alttext="j=1,2,...,M" class="ltx_Math" display="inline" id="alg2.l7.m1.4"><semantics id="alg2.l7.m1.4a"><mrow id="alg2.l7.m1.4.5" xref="alg2.l7.m1.4.5.cmml"><mi id="alg2.l7.m1.4.5.2" xref="alg2.l7.m1.4.5.2.cmml">j</mi><mo id="alg2.l7.m1.4.5.1" xref="alg2.l7.m1.4.5.1.cmml">=</mo><mrow id="alg2.l7.m1.4.5.3.2" xref="alg2.l7.m1.4.5.3.1.cmml"><mn id="alg2.l7.m1.1.1" xref="alg2.l7.m1.1.1.cmml">1</mn><mo id="alg2.l7.m1.4.5.3.2.1" xref="alg2.l7.m1.4.5.3.1.cmml">,</mo><mn id="alg2.l7.m1.2.2" xref="alg2.l7.m1.2.2.cmml">2</mn><mo id="alg2.l7.m1.4.5.3.2.2" xref="alg2.l7.m1.4.5.3.1.cmml">,</mo><mi id="alg2.l7.m1.3.3" mathvariant="normal" xref="alg2.l7.m1.3.3.cmml">…</mi><mo id="alg2.l7.m1.4.5.3.2.3" xref="alg2.l7.m1.4.5.3.1.cmml">,</mo><mi id="alg2.l7.m1.4.4" xref="alg2.l7.m1.4.4.cmml">M</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg2.l7.m1.4b"><apply id="alg2.l7.m1.4.5.cmml" xref="alg2.l7.m1.4.5"><eq id="alg2.l7.m1.4.5.1.cmml" xref="alg2.l7.m1.4.5.1"></eq><ci id="alg2.l7.m1.4.5.2.cmml" xref="alg2.l7.m1.4.5.2">𝑗</ci><list id="alg2.l7.m1.4.5.3.1.cmml" xref="alg2.l7.m1.4.5.3.2"><cn id="alg2.l7.m1.1.1.cmml" type="integer" xref="alg2.l7.m1.1.1">1</cn><cn id="alg2.l7.m1.2.2.cmml" type="integer" xref="alg2.l7.m1.2.2">2</cn><ci id="alg2.l7.m1.3.3.cmml" xref="alg2.l7.m1.3.3">…</ci><ci id="alg2.l7.m1.4.4.cmml" xref="alg2.l7.m1.4.4">𝑀</ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l7.m1.4c">j=1,2,...,M</annotation><annotation encoding="application/x-llamapun" id="alg2.l7.m1.4d">italic_j = 1 , 2 , … , italic_M</annotation></semantics></math> <span class="ltx_text ltx_font_bold" id="alg2.l7.3">do</span> </div> <div class="ltx_listingline" id="alg2.l8"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l8.1.1.1" style="font-size:80%;">8:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l8.2">if</span> <math alttext="|\mathcal{N}_{j}|>B_{j}" class="ltx_Math" display="inline" id="alg2.l8.m1.1"><semantics id="alg2.l8.m1.1a"><mrow id="alg2.l8.m1.1.1" xref="alg2.l8.m1.1.1.cmml"><mrow id="alg2.l8.m1.1.1.1.1" xref="alg2.l8.m1.1.1.1.2.cmml"><mo id="alg2.l8.m1.1.1.1.1.2" stretchy="false" xref="alg2.l8.m1.1.1.1.2.1.cmml">|</mo><msub id="alg2.l8.m1.1.1.1.1.1" xref="alg2.l8.m1.1.1.1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="alg2.l8.m1.1.1.1.1.1.2" xref="alg2.l8.m1.1.1.1.1.1.2.cmml">𝒩</mi><mi id="alg2.l8.m1.1.1.1.1.1.3" xref="alg2.l8.m1.1.1.1.1.1.3.cmml">j</mi></msub><mo id="alg2.l8.m1.1.1.1.1.3" stretchy="false" xref="alg2.l8.m1.1.1.1.2.1.cmml">|</mo></mrow><mo id="alg2.l8.m1.1.1.2" xref="alg2.l8.m1.1.1.2.cmml">></mo><msub id="alg2.l8.m1.1.1.3" xref="alg2.l8.m1.1.1.3.cmml"><mi id="alg2.l8.m1.1.1.3.2" xref="alg2.l8.m1.1.1.3.2.cmml">B</mi><mi id="alg2.l8.m1.1.1.3.3" xref="alg2.l8.m1.1.1.3.3.cmml">j</mi></msub></mrow><annotation-xml encoding="MathML-Content" id="alg2.l8.m1.1b"><apply id="alg2.l8.m1.1.1.cmml" xref="alg2.l8.m1.1.1"><gt id="alg2.l8.m1.1.1.2.cmml" xref="alg2.l8.m1.1.1.2"></gt><apply id="alg2.l8.m1.1.1.1.2.cmml" xref="alg2.l8.m1.1.1.1.1"><abs id="alg2.l8.m1.1.1.1.2.1.cmml" xref="alg2.l8.m1.1.1.1.1.2"></abs><apply id="alg2.l8.m1.1.1.1.1.1.cmml" xref="alg2.l8.m1.1.1.1.1.1"><csymbol cd="ambiguous" id="alg2.l8.m1.1.1.1.1.1.1.cmml" xref="alg2.l8.m1.1.1.1.1.1">subscript</csymbol><ci id="alg2.l8.m1.1.1.1.1.1.2.cmml" xref="alg2.l8.m1.1.1.1.1.1.2">𝒩</ci><ci id="alg2.l8.m1.1.1.1.1.1.3.cmml" xref="alg2.l8.m1.1.1.1.1.1.3">𝑗</ci></apply></apply><apply id="alg2.l8.m1.1.1.3.cmml" xref="alg2.l8.m1.1.1.3"><csymbol cd="ambiguous" id="alg2.l8.m1.1.1.3.1.cmml" xref="alg2.l8.m1.1.1.3">subscript</csymbol><ci id="alg2.l8.m1.1.1.3.2.cmml" xref="alg2.l8.m1.1.1.3.2">𝐵</ci><ci id="alg2.l8.m1.1.1.3.3.cmml" xref="alg2.l8.m1.1.1.3.3">𝑗</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l8.m1.1c">|\mathcal{N}_{j}|>B_{j}</annotation><annotation encoding="application/x-llamapun" id="alg2.l8.m1.1d">| caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | > italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT</annotation></semantics></math> <span class="ltx_text ltx_font_bold" id="alg2.l8.3">then</span> </div> <div class="ltx_listingline" id="alg2.l9"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l9.1.1.1" style="font-size:80%;">9:</span></span> Randomly choose <math alttext="B_{j}" class="ltx_Math" display="inline" id="alg2.l9.m1.1"><semantics id="alg2.l9.m1.1a"><msub id="alg2.l9.m1.1.1" xref="alg2.l9.m1.1.1.cmml"><mi id="alg2.l9.m1.1.1.2" xref="alg2.l9.m1.1.1.2.cmml">B</mi><mi id="alg2.l9.m1.1.1.3" xref="alg2.l9.m1.1.1.3.cmml">j</mi></msub><annotation-xml encoding="MathML-Content" id="alg2.l9.m1.1b"><apply id="alg2.l9.m1.1.1.cmml" xref="alg2.l9.m1.1.1"><csymbol cd="ambiguous" id="alg2.l9.m1.1.1.1.cmml" xref="alg2.l9.m1.1.1">subscript</csymbol><ci id="alg2.l9.m1.1.1.2.cmml" xref="alg2.l9.m1.1.1.2">𝐵</ci><ci id="alg2.l9.m1.1.1.3.cmml" xref="alg2.l9.m1.1.1.3">𝑗</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l9.m1.1c">B_{j}</annotation><annotation encoding="application/x-llamapun" id="alg2.l9.m1.1d">italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT</annotation></semantics></math> requests from <math alttext="\mathcal{N}_{j}" class="ltx_Math" display="inline" id="alg2.l9.m2.1"><semantics id="alg2.l9.m2.1a"><msub id="alg2.l9.m2.1.1" xref="alg2.l9.m2.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="alg2.l9.m2.1.1.2" xref="alg2.l9.m2.1.1.2.cmml">𝒩</mi><mi id="alg2.l9.m2.1.1.3" xref="alg2.l9.m2.1.1.3.cmml">j</mi></msub><annotation-xml encoding="MathML-Content" id="alg2.l9.m2.1b"><apply id="alg2.l9.m2.1.1.cmml" xref="alg2.l9.m2.1.1"><csymbol cd="ambiguous" id="alg2.l9.m2.1.1.1.cmml" xref="alg2.l9.m2.1.1">subscript</csymbol><ci id="alg2.l9.m2.1.1.2.cmml" xref="alg2.l9.m2.1.1.2">𝒩</ci><ci id="alg2.l9.m2.1.1.3.cmml" xref="alg2.l9.m2.1.1.3">𝑗</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l9.m2.1c">\mathcal{N}_{j}</annotation><annotation encoding="application/x-llamapun" id="alg2.l9.m2.1d">caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT</annotation></semantics></math> and drop others; </div> <div class="ltx_listingline" id="alg2.l10"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l10.1.1.1" style="font-size:80%;">10:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l10.2">end</span> <span class="ltx_text ltx_font_bold" id="alg2.l10.3">if</span> </div> <div class="ltx_listingline" id="alg2.l11"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l11.1.1.1" style="font-size:80%;">11:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l11.2">end</span> <span class="ltx_text ltx_font_bold" id="alg2.l11.3">for</span> </div> <div class="ltx_listingline" id="alg2.l12"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l12.1.1.1" style="font-size:80%;">12:</span></span> <em class="ltx_emph ltx_font_italic" id="alg2.l12.2">/*Chunk-based exploration*/</em> </div> <div class="ltx_listingline" id="alg2.l13"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l13.1.1.1" style="font-size:80%;">13:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l13.2">for</span> <math alttext="p=1,2,...,\alpha/\beta" class="ltx_Math" display="inline" id="alg2.l13.m1.4"><semantics id="alg2.l13.m1.4a"><mrow id="alg2.l13.m1.4.4" xref="alg2.l13.m1.4.4.cmml"><mi id="alg2.l13.m1.4.4.3" xref="alg2.l13.m1.4.4.3.cmml">p</mi><mo id="alg2.l13.m1.4.4.2" xref="alg2.l13.m1.4.4.2.cmml">=</mo><mrow id="alg2.l13.m1.4.4.1.1" xref="alg2.l13.m1.4.4.1.2.cmml"><mn id="alg2.l13.m1.1.1" xref="alg2.l13.m1.1.1.cmml">1</mn><mo id="alg2.l13.m1.4.4.1.1.2" xref="alg2.l13.m1.4.4.1.2.cmml">,</mo><mn id="alg2.l13.m1.2.2" xref="alg2.l13.m1.2.2.cmml">2</mn><mo id="alg2.l13.m1.4.4.1.1.3" xref="alg2.l13.m1.4.4.1.2.cmml">,</mo><mi id="alg2.l13.m1.3.3" mathvariant="normal" xref="alg2.l13.m1.3.3.cmml">…</mi><mo id="alg2.l13.m1.4.4.1.1.4" xref="alg2.l13.m1.4.4.1.2.cmml">,</mo><mrow id="alg2.l13.m1.4.4.1.1.1" xref="alg2.l13.m1.4.4.1.1.1.cmml"><mi id="alg2.l13.m1.4.4.1.1.1.2" xref="alg2.l13.m1.4.4.1.1.1.2.cmml">α</mi><mo id="alg2.l13.m1.4.4.1.1.1.1" xref="alg2.l13.m1.4.4.1.1.1.1.cmml">/</mo><mi id="alg2.l13.m1.4.4.1.1.1.3" xref="alg2.l13.m1.4.4.1.1.1.3.cmml">β</mi></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg2.l13.m1.4b"><apply id="alg2.l13.m1.4.4.cmml" xref="alg2.l13.m1.4.4"><eq id="alg2.l13.m1.4.4.2.cmml" xref="alg2.l13.m1.4.4.2"></eq><ci id="alg2.l13.m1.4.4.3.cmml" xref="alg2.l13.m1.4.4.3">𝑝</ci><list id="alg2.l13.m1.4.4.1.2.cmml" xref="alg2.l13.m1.4.4.1.1"><cn id="alg2.l13.m1.1.1.cmml" type="integer" xref="alg2.l13.m1.1.1">1</cn><cn id="alg2.l13.m1.2.2.cmml" type="integer" xref="alg2.l13.m1.2.2">2</cn><ci id="alg2.l13.m1.3.3.cmml" xref="alg2.l13.m1.3.3">…</ci><apply id="alg2.l13.m1.4.4.1.1.1.cmml" xref="alg2.l13.m1.4.4.1.1.1"><divide id="alg2.l13.m1.4.4.1.1.1.1.cmml" xref="alg2.l13.m1.4.4.1.1.1.1"></divide><ci id="alg2.l13.m1.4.4.1.1.1.2.cmml" xref="alg2.l13.m1.4.4.1.1.1.2">𝛼</ci><ci id="alg2.l13.m1.4.4.1.1.1.3.cmml" xref="alg2.l13.m1.4.4.1.1.1.3">𝛽</ci></apply></list></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l13.m1.4c">p=1,2,...,\alpha/\beta</annotation><annotation encoding="application/x-llamapun" id="alg2.l13.m1.4d">italic_p = 1 , 2 , … , italic_α / italic_β</annotation></semantics></math> <span class="ltx_text ltx_font_bold" id="alg2.l13.3">do</span> </div> <div class="ltx_listingline" id="alg2.l14"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l14.1.1.1" style="font-size:80%;">14:</span></span> Run request <math alttext="i" class="ltx_Math" display="inline" id="alg2.l14.m1.1"><semantics id="alg2.l14.m1.1a"><mi id="alg2.l14.m1.1.1" xref="alg2.l14.m1.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="alg2.l14.m1.1b"><ci id="alg2.l14.m1.1.1.cmml" xref="alg2.l14.m1.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="alg2.l14.m1.1c">i</annotation><annotation encoding="application/x-llamapun" id="alg2.l14.m1.1d">italic_i</annotation></semantics></math> on SSM <math alttext="x_{i}(t)" class="ltx_Math" display="inline" id="alg2.l14.m2.1"><semantics id="alg2.l14.m2.1a"><mrow id="alg2.l14.m2.1.2" xref="alg2.l14.m2.1.2.cmml"><msub id="alg2.l14.m2.1.2.2" xref="alg2.l14.m2.1.2.2.cmml"><mi id="alg2.l14.m2.1.2.2.2" xref="alg2.l14.m2.1.2.2.2.cmml">x</mi><mi id="alg2.l14.m2.1.2.2.3" xref="alg2.l14.m2.1.2.2.3.cmml">i</mi></msub><mo id="alg2.l14.m2.1.2.1" xref="alg2.l14.m2.1.2.1.cmml">⁢</mo><mrow id="alg2.l14.m2.1.2.3.2" xref="alg2.l14.m2.1.2.cmml"><mo id="alg2.l14.m2.1.2.3.2.1" stretchy="false" xref="alg2.l14.m2.1.2.cmml">(</mo><mi id="alg2.l14.m2.1.1" xref="alg2.l14.m2.1.1.cmml">t</mi><mo id="alg2.l14.m2.1.2.3.2.2" stretchy="false" xref="alg2.l14.m2.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg2.l14.m2.1b"><apply id="alg2.l14.m2.1.2.cmml" xref="alg2.l14.m2.1.2"><times id="alg2.l14.m2.1.2.1.cmml" xref="alg2.l14.m2.1.2.1"></times><apply id="alg2.l14.m2.1.2.2.cmml" xref="alg2.l14.m2.1.2.2"><csymbol cd="ambiguous" id="alg2.l14.m2.1.2.2.1.cmml" xref="alg2.l14.m2.1.2.2">subscript</csymbol><ci id="alg2.l14.m2.1.2.2.2.cmml" xref="alg2.l14.m2.1.2.2.2">𝑥</ci><ci id="alg2.l14.m2.1.2.2.3.cmml" xref="alg2.l14.m2.1.2.2.3">𝑖</ci></apply><ci id="alg2.l14.m2.1.1.cmml" xref="alg2.l14.m2.1.1">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l14.m2.1c">x_{i}(t)</annotation><annotation encoding="application/x-llamapun" id="alg2.l14.m2.1d">italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t )</annotation></semantics></math> for <math alttext="\beta" class="ltx_Math" display="inline" id="alg2.l14.m3.1"><semantics id="alg2.l14.m3.1a"><mi id="alg2.l14.m3.1.1" xref="alg2.l14.m3.1.1.cmml">β</mi><annotation-xml encoding="MathML-Content" id="alg2.l14.m3.1b"><ci id="alg2.l14.m3.1.1.cmml" xref="alg2.l14.m3.1.1">𝛽</ci></annotation-xml><annotation encoding="application/x-tex" id="alg2.l14.m3.1c">\beta</annotation><annotation encoding="application/x-llamapun" id="alg2.l14.m3.1d">italic_β</annotation></semantics></math> time slots; </div> <div class="ltx_listingline" id="alg2.l15"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l15.1.1.1" style="font-size:80%;">15:</span></span> Observe throughput at each time slot; </div> <div class="ltx_listingline" id="alg2.l16"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l16.1.1.1" style="font-size:80%;">16:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l16.2">end</span> <span class="ltx_text ltx_font_bold" id="alg2.l16.3">for</span> </div> <div class="ltx_listingline" id="alg2.l17"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l17.1.1.1" style="font-size:80%;">17:</span></span> <span class="ltx_text ltx_font_bold" id="alg2.l17.2">end</span> <span class="ltx_text ltx_font_bold" id="alg2.l17.3">while</span> </div> <div class="ltx_listingline" id="alg2.l18"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l18.1.1.1" style="font-size:80%;">18:</span></span> Estimate the goodput <math alttext="\tilde{g}_{i,j}^{k}" class="ltx_Math" display="inline" id="alg2.l18.m1.2"><semantics id="alg2.l18.m1.2a"><msubsup id="alg2.l18.m1.2.3" xref="alg2.l18.m1.2.3.cmml"><mover accent="true" id="alg2.l18.m1.2.3.2.2" xref="alg2.l18.m1.2.3.2.2.cmml"><mi id="alg2.l18.m1.2.3.2.2.2" xref="alg2.l18.m1.2.3.2.2.2.cmml">g</mi><mo id="alg2.l18.m1.2.3.2.2.1" xref="alg2.l18.m1.2.3.2.2.1.cmml">~</mo></mover><mrow id="alg2.l18.m1.2.2.2.4" xref="alg2.l18.m1.2.2.2.3.cmml"><mi id="alg2.l18.m1.1.1.1.1" xref="alg2.l18.m1.1.1.1.1.cmml">i</mi><mo id="alg2.l18.m1.2.2.2.4.1" xref="alg2.l18.m1.2.2.2.3.cmml">,</mo><mi id="alg2.l18.m1.2.2.2.2" xref="alg2.l18.m1.2.2.2.2.cmml">j</mi></mrow><mi id="alg2.l18.m1.2.3.3" xref="alg2.l18.m1.2.3.3.cmml">k</mi></msubsup><annotation-xml encoding="MathML-Content" id="alg2.l18.m1.2b"><apply id="alg2.l18.m1.2.3.cmml" xref="alg2.l18.m1.2.3"><csymbol cd="ambiguous" id="alg2.l18.m1.2.3.1.cmml" xref="alg2.l18.m1.2.3">superscript</csymbol><apply id="alg2.l18.m1.2.3.2.cmml" xref="alg2.l18.m1.2.3"><csymbol cd="ambiguous" id="alg2.l18.m1.2.3.2.1.cmml" xref="alg2.l18.m1.2.3">subscript</csymbol><apply id="alg2.l18.m1.2.3.2.2.cmml" xref="alg2.l18.m1.2.3.2.2"><ci id="alg2.l18.m1.2.3.2.2.1.cmml" xref="alg2.l18.m1.2.3.2.2.1">~</ci><ci id="alg2.l18.m1.2.3.2.2.2.cmml" xref="alg2.l18.m1.2.3.2.2.2">𝑔</ci></apply><list id="alg2.l18.m1.2.2.2.3.cmml" xref="alg2.l18.m1.2.2.2.4"><ci id="alg2.l18.m1.1.1.1.1.cmml" xref="alg2.l18.m1.1.1.1.1">𝑖</ci><ci id="alg2.l18.m1.2.2.2.2.cmml" xref="alg2.l18.m1.2.2.2.2">𝑗</ci></list></apply><ci id="alg2.l18.m1.2.3.3.cmml" xref="alg2.l18.m1.2.3.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="alg2.l18.m1.2c">\tilde{g}_{i,j}^{k}</annotation><annotation encoding="application/x-llamapun" id="alg2.l18.m1.2d">over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT</annotation></semantics></math> by averaging cumulative throughput; </div> </div> </figure> <div class="ltx_para ltx_noindent" id="S4.SS2.p5"> <p class="ltx_p" id="S4.SS2.p5.1"><span class="ltx_text ltx_font_bold" id="S4.SS2.p5.1.1">Exploitation Stage:</span> We decide the best SSM for each inference request based on the estimated goodput <math alttext="\tilde{g}_{i,j}^{k}" class="ltx_Math" display="inline" id="S4.SS2.p5.1.m1.2"><semantics id="S4.SS2.p5.1.m1.2a"><msubsup id="S4.SS2.p5.1.m1.2.3" xref="S4.SS2.p5.1.m1.2.3.cmml"><mover accent="true" id="S4.SS2.p5.1.m1.2.3.2.2" xref="S4.SS2.p5.1.m1.2.3.2.2.cmml"><mi id="S4.SS2.p5.1.m1.2.3.2.2.2" xref="S4.SS2.p5.1.m1.2.3.2.2.2.cmml">g</mi><mo id="S4.SS2.p5.1.m1.2.3.2.2.1" xref="S4.SS2.p5.1.m1.2.3.2.2.1.cmml">~</mo></mover><mrow id="S4.SS2.p5.1.m1.2.2.2.4" xref="S4.SS2.p5.1.m1.2.2.2.3.cmml"><mi id="S4.SS2.p5.1.m1.1.1.1.1" xref="S4.SS2.p5.1.m1.1.1.1.1.cmml">i</mi><mo id="S4.SS2.p5.1.m1.2.2.2.4.1" xref="S4.SS2.p5.1.m1.2.2.2.3.cmml">,</mo><mi id="S4.SS2.p5.1.m1.2.2.2.2" xref="S4.SS2.p5.1.m1.2.2.2.2.cmml">j</mi></mrow><mi id="S4.SS2.p5.1.m1.2.3.3" xref="S4.SS2.p5.1.m1.2.3.3.cmml">k</mi></msubsup><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.1.m1.2b"><apply id="S4.SS2.p5.1.m1.2.3.cmml" xref="S4.SS2.p5.1.m1.2.3"><csymbol cd="ambiguous" id="S4.SS2.p5.1.m1.2.3.1.cmml" xref="S4.SS2.p5.1.m1.2.3">superscript</csymbol><apply id="S4.SS2.p5.1.m1.2.3.2.cmml" xref="S4.SS2.p5.1.m1.2.3"><csymbol cd="ambiguous" id="S4.SS2.p5.1.m1.2.3.2.1.cmml" xref="S4.SS2.p5.1.m1.2.3">subscript</csymbol><apply id="S4.SS2.p5.1.m1.2.3.2.2.cmml" xref="S4.SS2.p5.1.m1.2.3.2.2"><ci id="S4.SS2.p5.1.m1.2.3.2.2.1.cmml" xref="S4.SS2.p5.1.m1.2.3.2.2.1">~</ci><ci id="S4.SS2.p5.1.m1.2.3.2.2.2.cmml" xref="S4.SS2.p5.1.m1.2.3.2.2.2">𝑔</ci></apply><list id="S4.SS2.p5.1.m1.2.2.2.3.cmml" xref="S4.SS2.p5.1.m1.2.2.2.4"><ci id="S4.SS2.p5.1.m1.1.1.1.1.cmml" xref="S4.SS2.p5.1.m1.1.1.1.1">𝑖</ci><ci id="S4.SS2.p5.1.m1.2.2.2.2.cmml" xref="S4.SS2.p5.1.m1.2.2.2.2">𝑗</ci></list></apply><ci id="S4.SS2.p5.1.m1.2.3.3.cmml" xref="S4.SS2.p5.1.m1.2.3.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.1.m1.2c">\tilde{g}_{i,j}^{k}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.1.m1.2d">over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT</annotation></semantics></math> by solving the following problem:</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="S9.EGx6"> <tbody id="S4.E10"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\max_{y}\quad" class="ltx_Math" display="inline" id="S4.E10.m1.1"><semantics id="S4.E10.m1.1a"><mrow id="S4.E10.m1.1.1.1" xref="S4.E10.m1.1.1.1.1.cmml"><munder id="S4.E10.m1.1.1.1.1" xref="S4.E10.m1.1.1.1.1.cmml"><mi id="S4.E10.m1.1.1.1.1.2" xref="S4.E10.m1.1.1.1.1.2.cmml">max</mi><mi id="S4.E10.m1.1.1.1.1.3" xref="S4.E10.m1.1.1.1.1.3.cmml">y</mi></munder><mspace id="S4.E10.m1.1.1.1.2" width="1.167em" xref="S4.E10.m1.1.1.1.1.cmml"></mspace></mrow><annotation-xml encoding="MathML-Content" id="S4.E10.m1.1b"><apply id="S4.E10.m1.1.1.1.1.cmml" xref="S4.E10.m1.1.1.1"><csymbol cd="ambiguous" id="S4.E10.m1.1.1.1.1.1.cmml" xref="S4.E10.m1.1.1.1">subscript</csymbol><max id="S4.E10.m1.1.1.1.1.2.cmml" xref="S4.E10.m1.1.1.1.1.2"></max><ci id="S4.E10.m1.1.1.1.1.3.cmml" xref="S4.E10.m1.1.1.1.1.3">𝑦</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E10.m1.1c">\displaystyle\max_{y}\quad</annotation><annotation encoding="application/x-llamapun" id="S4.E10.m1.1d">roman_max start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{M}\tilde{g}_{i,j}^{k}y_{i,j}," class="ltx_Math" display="inline" id="S4.E10.m2.5"><semantics id="S4.E10.m2.5a"><mrow id="S4.E10.m2.5.5.1" xref="S4.E10.m2.5.5.1.1.cmml"><mrow id="S4.E10.m2.5.5.1.1" xref="S4.E10.m2.5.5.1.1.cmml"><mstyle displaystyle="true" id="S4.E10.m2.5.5.1.1.1" xref="S4.E10.m2.5.5.1.1.1.cmml"><munderover id="S4.E10.m2.5.5.1.1.1a" xref="S4.E10.m2.5.5.1.1.1.cmml"><mo id="S4.E10.m2.5.5.1.1.1.2.2" movablelimits="false" xref="S4.E10.m2.5.5.1.1.1.2.2.cmml">∑</mo><mrow id="S4.E10.m2.5.5.1.1.1.2.3" xref="S4.E10.m2.5.5.1.1.1.2.3.cmml"><mi id="S4.E10.m2.5.5.1.1.1.2.3.2" xref="S4.E10.m2.5.5.1.1.1.2.3.2.cmml">i</mi><mo id="S4.E10.m2.5.5.1.1.1.2.3.1" xref="S4.E10.m2.5.5.1.1.1.2.3.1.cmml">=</mo><mn id="S4.E10.m2.5.5.1.1.1.2.3.3" xref="S4.E10.m2.5.5.1.1.1.2.3.3.cmml">1</mn></mrow><mi id="S4.E10.m2.5.5.1.1.1.3" xref="S4.E10.m2.5.5.1.1.1.3.cmml">N</mi></munderover></mstyle><mrow id="S4.E10.m2.5.5.1.1.2" xref="S4.E10.m2.5.5.1.1.2.cmml"><mstyle displaystyle="true" id="S4.E10.m2.5.5.1.1.2.1" xref="S4.E10.m2.5.5.1.1.2.1.cmml"><munderover id="S4.E10.m2.5.5.1.1.2.1a" xref="S4.E10.m2.5.5.1.1.2.1.cmml"><mo id="S4.E10.m2.5.5.1.1.2.1.2.2" movablelimits="false" xref="S4.E10.m2.5.5.1.1.2.1.2.2.cmml">∑</mo><mrow id="S4.E10.m2.5.5.1.1.2.1.2.3" xref="S4.E10.m2.5.5.1.1.2.1.2.3.cmml"><mi id="S4.E10.m2.5.5.1.1.2.1.2.3.2" xref="S4.E10.m2.5.5.1.1.2.1.2.3.2.cmml">j</mi><mo id="S4.E10.m2.5.5.1.1.2.1.2.3.1" xref="S4.E10.m2.5.5.1.1.2.1.2.3.1.cmml">=</mo><mn id="S4.E10.m2.5.5.1.1.2.1.2.3.3" xref="S4.E10.m2.5.5.1.1.2.1.2.3.3.cmml">1</mn></mrow><mi id="S4.E10.m2.5.5.1.1.2.1.3" xref="S4.E10.m2.5.5.1.1.2.1.3.cmml">M</mi></munderover></mstyle><mrow id="S4.E10.m2.5.5.1.1.2.2" xref="S4.E10.m2.5.5.1.1.2.2.cmml"><msubsup id="S4.E10.m2.5.5.1.1.2.2.2" xref="S4.E10.m2.5.5.1.1.2.2.2.cmml"><mover accent="true" id="S4.E10.m2.5.5.1.1.2.2.2.2.2" xref="S4.E10.m2.5.5.1.1.2.2.2.2.2.cmml"><mi id="S4.E10.m2.5.5.1.1.2.2.2.2.2.2" xref="S4.E10.m2.5.5.1.1.2.2.2.2.2.2.cmml">g</mi><mo id="S4.E10.m2.5.5.1.1.2.2.2.2.2.1" xref="S4.E10.m2.5.5.1.1.2.2.2.2.2.1.cmml">~</mo></mover><mrow id="S4.E10.m2.2.2.2.4" xref="S4.E10.m2.2.2.2.3.cmml"><mi id="S4.E10.m2.1.1.1.1" xref="S4.E10.m2.1.1.1.1.cmml">i</mi><mo id="S4.E10.m2.2.2.2.4.1" xref="S4.E10.m2.2.2.2.3.cmml">,</mo><mi id="S4.E10.m2.2.2.2.2" xref="S4.E10.m2.2.2.2.2.cmml">j</mi></mrow><mi id="S4.E10.m2.5.5.1.1.2.2.2.3" xref="S4.E10.m2.5.5.1.1.2.2.2.3.cmml">k</mi></msubsup><mo id="S4.E10.m2.5.5.1.1.2.2.1" xref="S4.E10.m2.5.5.1.1.2.2.1.cmml">⁢</mo><msub id="S4.E10.m2.5.5.1.1.2.2.3" xref="S4.E10.m2.5.5.1.1.2.2.3.cmml"><mi id="S4.E10.m2.5.5.1.1.2.2.3.2" xref="S4.E10.m2.5.5.1.1.2.2.3.2.cmml">y</mi><mrow id="S4.E10.m2.4.4.2.4" xref="S4.E10.m2.4.4.2.3.cmml"><mi id="S4.E10.m2.3.3.1.1" xref="S4.E10.m2.3.3.1.1.cmml">i</mi><mo id="S4.E10.m2.4.4.2.4.1" xref="S4.E10.m2.4.4.2.3.cmml">,</mo><mi id="S4.E10.m2.4.4.2.2" xref="S4.E10.m2.4.4.2.2.cmml">j</mi></mrow></msub></mrow></mrow></mrow><mo id="S4.E10.m2.5.5.1.2" xref="S4.E10.m2.5.5.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.E10.m2.5b"><apply id="S4.E10.m2.5.5.1.1.cmml" xref="S4.E10.m2.5.5.1"><apply id="S4.E10.m2.5.5.1.1.1.cmml" xref="S4.E10.m2.5.5.1.1.1"><csymbol cd="ambiguous" id="S4.E10.m2.5.5.1.1.1.1.cmml" xref="S4.E10.m2.5.5.1.1.1">superscript</csymbol><apply id="S4.E10.m2.5.5.1.1.1.2.cmml" xref="S4.E10.m2.5.5.1.1.1"><csymbol cd="ambiguous" id="S4.E10.m2.5.5.1.1.1.2.1.cmml" xref="S4.E10.m2.5.5.1.1.1">subscript</csymbol><sum id="S4.E10.m2.5.5.1.1.1.2.2.cmml" xref="S4.E10.m2.5.5.1.1.1.2.2"></sum><apply id="S4.E10.m2.5.5.1.1.1.2.3.cmml" xref="S4.E10.m2.5.5.1.1.1.2.3"><eq id="S4.E10.m2.5.5.1.1.1.2.3.1.cmml" xref="S4.E10.m2.5.5.1.1.1.2.3.1"></eq><ci id="S4.E10.m2.5.5.1.1.1.2.3.2.cmml" xref="S4.E10.m2.5.5.1.1.1.2.3.2">𝑖</ci><cn id="S4.E10.m2.5.5.1.1.1.2.3.3.cmml" type="integer" xref="S4.E10.m2.5.5.1.1.1.2.3.3">1</cn></apply></apply><ci id="S4.E10.m2.5.5.1.1.1.3.cmml" xref="S4.E10.m2.5.5.1.1.1.3">𝑁</ci></apply><apply id="S4.E10.m2.5.5.1.1.2.cmml" xref="S4.E10.m2.5.5.1.1.2"><apply id="S4.E10.m2.5.5.1.1.2.1.cmml" xref="S4.E10.m2.5.5.1.1.2.1"><csymbol cd="ambiguous" id="S4.E10.m2.5.5.1.1.2.1.1.cmml" xref="S4.E10.m2.5.5.1.1.2.1">superscript</csymbol><apply id="S4.E10.m2.5.5.1.1.2.1.2.cmml" xref="S4.E10.m2.5.5.1.1.2.1"><csymbol cd="ambiguous" id="S4.E10.m2.5.5.1.1.2.1.2.1.cmml" xref="S4.E10.m2.5.5.1.1.2.1">subscript</csymbol><sum id="S4.E10.m2.5.5.1.1.2.1.2.2.cmml" xref="S4.E10.m2.5.5.1.1.2.1.2.2"></sum><apply id="S4.E10.m2.5.5.1.1.2.1.2.3.cmml" xref="S4.E10.m2.5.5.1.1.2.1.2.3"><eq id="S4.E10.m2.5.5.1.1.2.1.2.3.1.cmml" xref="S4.E10.m2.5.5.1.1.2.1.2.3.1"></eq><ci id="S4.E10.m2.5.5.1.1.2.1.2.3.2.cmml" xref="S4.E10.m2.5.5.1.1.2.1.2.3.2">𝑗</ci><cn id="S4.E10.m2.5.5.1.1.2.1.2.3.3.cmml" type="integer" xref="S4.E10.m2.5.5.1.1.2.1.2.3.3">1</cn></apply></apply><ci id="S4.E10.m2.5.5.1.1.2.1.3.cmml" xref="S4.E10.m2.5.5.1.1.2.1.3">𝑀</ci></apply><apply id="S4.E10.m2.5.5.1.1.2.2.cmml" xref="S4.E10.m2.5.5.1.1.2.2"><times id="S4.E10.m2.5.5.1.1.2.2.1.cmml" xref="S4.E10.m2.5.5.1.1.2.2.1"></times><apply id="S4.E10.m2.5.5.1.1.2.2.2.cmml" xref="S4.E10.m2.5.5.1.1.2.2.2"><csymbol cd="ambiguous" id="S4.E10.m2.5.5.1.1.2.2.2.1.cmml" xref="S4.E10.m2.5.5.1.1.2.2.2">superscript</csymbol><apply id="S4.E10.m2.5.5.1.1.2.2.2.2.cmml" xref="S4.E10.m2.5.5.1.1.2.2.2"><csymbol cd="ambiguous" id="S4.E10.m2.5.5.1.1.2.2.2.2.1.cmml" xref="S4.E10.m2.5.5.1.1.2.2.2">subscript</csymbol><apply id="S4.E10.m2.5.5.1.1.2.2.2.2.2.cmml" xref="S4.E10.m2.5.5.1.1.2.2.2.2.2"><ci id="S4.E10.m2.5.5.1.1.2.2.2.2.2.1.cmml" xref="S4.E10.m2.5.5.1.1.2.2.2.2.2.1">~</ci><ci id="S4.E10.m2.5.5.1.1.2.2.2.2.2.2.cmml" xref="S4.E10.m2.5.5.1.1.2.2.2.2.2.2">𝑔</ci></apply><list id="S4.E10.m2.2.2.2.3.cmml" xref="S4.E10.m2.2.2.2.4"><ci id="S4.E10.m2.1.1.1.1.cmml" xref="S4.E10.m2.1.1.1.1">𝑖</ci><ci id="S4.E10.m2.2.2.2.2.cmml" xref="S4.E10.m2.2.2.2.2">𝑗</ci></list></apply><ci id="S4.E10.m2.5.5.1.1.2.2.2.3.cmml" xref="S4.E10.m2.5.5.1.1.2.2.2.3">𝑘</ci></apply><apply id="S4.E10.m2.5.5.1.1.2.2.3.cmml" xref="S4.E10.m2.5.5.1.1.2.2.3"><csymbol cd="ambiguous" id="S4.E10.m2.5.5.1.1.2.2.3.1.cmml" xref="S4.E10.m2.5.5.1.1.2.2.3">subscript</csymbol><ci id="S4.E10.m2.5.5.1.1.2.2.3.2.cmml" xref="S4.E10.m2.5.5.1.1.2.2.3.2">𝑦</ci><list id="S4.E10.m2.4.4.2.3.cmml" xref="S4.E10.m2.4.4.2.4"><ci id="S4.E10.m2.3.3.1.1.cmml" xref="S4.E10.m2.3.3.1.1">𝑖</ci><ci id="S4.E10.m2.4.4.2.2.cmml" xref="S4.E10.m2.4.4.2.2">𝑗</ci></list></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E10.m2.5c">\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{M}\tilde{g}_{i,j}^{k}y_{i,j},</annotation><annotation encoding="application/x-llamapun" id="S4.E10.m2.5d">∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(10)</span></td> </tr></tbody> <tbody id="S4.E11"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><span class="ltx_text ltx_markedasmath" id="S4.E11.2.1.1.1">s.t.</span></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\sum_{i=1}^{N}y_{i,j}\leq B_{j},\forall j\in\mathcal{M};" class="ltx_Math" display="inline" id="S4.E11.m2.3"><semantics id="S4.E11.m2.3a"><mrow id="S4.E11.m2.3.3.1"><mrow id="S4.E11.m2.3.3.1.1.2" xref="S4.E11.m2.3.3.1.1.3.cmml"><mrow id="S4.E11.m2.3.3.1.1.1.1" xref="S4.E11.m2.3.3.1.1.1.1.cmml"><mrow id="S4.E11.m2.3.3.1.1.1.1.2" xref="S4.E11.m2.3.3.1.1.1.1.2.cmml"><mstyle displaystyle="true" id="S4.E11.m2.3.3.1.1.1.1.2.1" xref="S4.E11.m2.3.3.1.1.1.1.2.1.cmml"><munderover id="S4.E11.m2.3.3.1.1.1.1.2.1a" xref="S4.E11.m2.3.3.1.1.1.1.2.1.cmml"><mo id="S4.E11.m2.3.3.1.1.1.1.2.1.2.2" movablelimits="false" xref="S4.E11.m2.3.3.1.1.1.1.2.1.2.2.cmml">∑</mo><mrow id="S4.E11.m2.3.3.1.1.1.1.2.1.2.3" xref="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.cmml"><mi id="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.2" xref="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.2.cmml">i</mi><mo id="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.1" xref="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.1.cmml">=</mo><mn id="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.3" xref="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.3.cmml">1</mn></mrow><mi id="S4.E11.m2.3.3.1.1.1.1.2.1.3" xref="S4.E11.m2.3.3.1.1.1.1.2.1.3.cmml">N</mi></munderover></mstyle><msub id="S4.E11.m2.3.3.1.1.1.1.2.2" xref="S4.E11.m2.3.3.1.1.1.1.2.2.cmml"><mi id="S4.E11.m2.3.3.1.1.1.1.2.2.2" xref="S4.E11.m2.3.3.1.1.1.1.2.2.2.cmml">y</mi><mrow id="S4.E11.m2.2.2.2.4" xref="S4.E11.m2.2.2.2.3.cmml"><mi id="S4.E11.m2.1.1.1.1" xref="S4.E11.m2.1.1.1.1.cmml">i</mi><mo id="S4.E11.m2.2.2.2.4.1" xref="S4.E11.m2.2.2.2.3.cmml">,</mo><mi id="S4.E11.m2.2.2.2.2" xref="S4.E11.m2.2.2.2.2.cmml">j</mi></mrow></msub></mrow><mo id="S4.E11.m2.3.3.1.1.1.1.1" xref="S4.E11.m2.3.3.1.1.1.1.1.cmml">≤</mo><msub id="S4.E11.m2.3.3.1.1.1.1.3" xref="S4.E11.m2.3.3.1.1.1.1.3.cmml"><mi id="S4.E11.m2.3.3.1.1.1.1.3.2" xref="S4.E11.m2.3.3.1.1.1.1.3.2.cmml">B</mi><mi id="S4.E11.m2.3.3.1.1.1.1.3.3" xref="S4.E11.m2.3.3.1.1.1.1.3.3.cmml">j</mi></msub></mrow><mo id="S4.E11.m2.3.3.1.1.2.3" xref="S4.E11.m2.3.3.1.1.3a.cmml">,</mo><mrow id="S4.E11.m2.3.3.1.1.2.2" xref="S4.E11.m2.3.3.1.1.2.2.cmml"><mrow id="S4.E11.m2.3.3.1.1.2.2.2" xref="S4.E11.m2.3.3.1.1.2.2.2.cmml"><mo id="S4.E11.m2.3.3.1.1.2.2.2.1" rspace="0.167em" xref="S4.E11.m2.3.3.1.1.2.2.2.1.cmml">∀</mo><mi id="S4.E11.m2.3.3.1.1.2.2.2.2" xref="S4.E11.m2.3.3.1.1.2.2.2.2.cmml">j</mi></mrow><mo id="S4.E11.m2.3.3.1.1.2.2.1" xref="S4.E11.m2.3.3.1.1.2.2.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.E11.m2.3.3.1.1.2.2.3" xref="S4.E11.m2.3.3.1.1.2.2.3.cmml">ℳ</mi></mrow></mrow><mo id="S4.E11.m2.3.3.1.2">;</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.E11.m2.3b"><apply id="S4.E11.m2.3.3.1.1.3.cmml" xref="S4.E11.m2.3.3.1.1.2"><csymbol cd="ambiguous" id="S4.E11.m2.3.3.1.1.3a.cmml" xref="S4.E11.m2.3.3.1.1.2.3">formulae-sequence</csymbol><apply id="S4.E11.m2.3.3.1.1.1.1.cmml" xref="S4.E11.m2.3.3.1.1.1.1"><leq id="S4.E11.m2.3.3.1.1.1.1.1.cmml" xref="S4.E11.m2.3.3.1.1.1.1.1"></leq><apply id="S4.E11.m2.3.3.1.1.1.1.2.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2"><apply id="S4.E11.m2.3.3.1.1.1.1.2.1.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.1"><csymbol cd="ambiguous" id="S4.E11.m2.3.3.1.1.1.1.2.1.1.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.1">superscript</csymbol><apply id="S4.E11.m2.3.3.1.1.1.1.2.1.2.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.1"><csymbol cd="ambiguous" id="S4.E11.m2.3.3.1.1.1.1.2.1.2.1.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.1">subscript</csymbol><sum id="S4.E11.m2.3.3.1.1.1.1.2.1.2.2.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.1.2.2"></sum><apply id="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.1.2.3"><eq id="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.1.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.1"></eq><ci id="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.2.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.2">𝑖</ci><cn id="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.3.cmml" type="integer" xref="S4.E11.m2.3.3.1.1.1.1.2.1.2.3.3">1</cn></apply></apply><ci id="S4.E11.m2.3.3.1.1.1.1.2.1.3.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.1.3">𝑁</ci></apply><apply id="S4.E11.m2.3.3.1.1.1.1.2.2.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S4.E11.m2.3.3.1.1.1.1.2.2.1.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.2">subscript</csymbol><ci id="S4.E11.m2.3.3.1.1.1.1.2.2.2.cmml" xref="S4.E11.m2.3.3.1.1.1.1.2.2.2">𝑦</ci><list id="S4.E11.m2.2.2.2.3.cmml" xref="S4.E11.m2.2.2.2.4"><ci id="S4.E11.m2.1.1.1.1.cmml" xref="S4.E11.m2.1.1.1.1">𝑖</ci><ci id="S4.E11.m2.2.2.2.2.cmml" xref="S4.E11.m2.2.2.2.2">𝑗</ci></list></apply></apply><apply id="S4.E11.m2.3.3.1.1.1.1.3.cmml" xref="S4.E11.m2.3.3.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.E11.m2.3.3.1.1.1.1.3.1.cmml" xref="S4.E11.m2.3.3.1.1.1.1.3">subscript</csymbol><ci id="S4.E11.m2.3.3.1.1.1.1.3.2.cmml" xref="S4.E11.m2.3.3.1.1.1.1.3.2">𝐵</ci><ci id="S4.E11.m2.3.3.1.1.1.1.3.3.cmml" xref="S4.E11.m2.3.3.1.1.1.1.3.3">𝑗</ci></apply></apply><apply id="S4.E11.m2.3.3.1.1.2.2.cmml" xref="S4.E11.m2.3.3.1.1.2.2"><in id="S4.E11.m2.3.3.1.1.2.2.1.cmml" xref="S4.E11.m2.3.3.1.1.2.2.1"></in><apply id="S4.E11.m2.3.3.1.1.2.2.2.cmml" xref="S4.E11.m2.3.3.1.1.2.2.2"><csymbol cd="latexml" id="S4.E11.m2.3.3.1.1.2.2.2.1.cmml" xref="S4.E11.m2.3.3.1.1.2.2.2.1">for-all</csymbol><ci id="S4.E11.m2.3.3.1.1.2.2.2.2.cmml" xref="S4.E11.m2.3.3.1.1.2.2.2.2">𝑗</ci></apply><ci id="S4.E11.m2.3.3.1.1.2.2.3.cmml" xref="S4.E11.m2.3.3.1.1.2.2.3">ℳ</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E11.m2.3c">\displaystyle\sum_{i=1}^{N}y_{i,j}\leq B_{j},\forall j\in\mathcal{M};</annotation><annotation encoding="application/x-llamapun" id="S4.E11.m2.3d">∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ∈ caligraphic_M ;</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(11)</span></td> </tr></tbody> <tbody id="S4.E12"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle y_{i,j}=\{0,1\},\forall i\in\mathcal{N},j\in\mathcal{M}," class="ltx_Math" display="inline" id="S4.E12.m1.5"><semantics id="S4.E12.m1.5a"><mrow id="S4.E12.m1.5.5.1"><mrow id="S4.E12.m1.5.5.1.1.2" xref="S4.E12.m1.5.5.1.1.3.cmml"><mrow id="S4.E12.m1.5.5.1.1.1.1" xref="S4.E12.m1.5.5.1.1.1.1.cmml"><msub id="S4.E12.m1.5.5.1.1.1.1.2" xref="S4.E12.m1.5.5.1.1.1.1.2.cmml"><mi id="S4.E12.m1.5.5.1.1.1.1.2.2" xref="S4.E12.m1.5.5.1.1.1.1.2.2.cmml">y</mi><mrow id="S4.E12.m1.2.2.2.4" xref="S4.E12.m1.2.2.2.3.cmml"><mi id="S4.E12.m1.1.1.1.1" xref="S4.E12.m1.1.1.1.1.cmml">i</mi><mo id="S4.E12.m1.2.2.2.4.1" xref="S4.E12.m1.2.2.2.3.cmml">,</mo><mi id="S4.E12.m1.2.2.2.2" xref="S4.E12.m1.2.2.2.2.cmml">j</mi></mrow></msub><mo id="S4.E12.m1.5.5.1.1.1.1.1" xref="S4.E12.m1.5.5.1.1.1.1.1.cmml">=</mo><mrow id="S4.E12.m1.5.5.1.1.1.1.3.2" xref="S4.E12.m1.5.5.1.1.1.1.3.1.cmml"><mo id="S4.E12.m1.5.5.1.1.1.1.3.2.1" stretchy="false" xref="S4.E12.m1.5.5.1.1.1.1.3.1.cmml">{</mo><mn id="S4.E12.m1.3.3" xref="S4.E12.m1.3.3.cmml">0</mn><mo id="S4.E12.m1.5.5.1.1.1.1.3.2.2" xref="S4.E12.m1.5.5.1.1.1.1.3.1.cmml">,</mo><mn id="S4.E12.m1.4.4" xref="S4.E12.m1.4.4.cmml">1</mn><mo id="S4.E12.m1.5.5.1.1.1.1.3.2.3" stretchy="false" xref="S4.E12.m1.5.5.1.1.1.1.3.1.cmml">}</mo></mrow></mrow><mo id="S4.E12.m1.5.5.1.1.2.3" xref="S4.E12.m1.5.5.1.1.3a.cmml">,</mo><mrow id="S4.E12.m1.5.5.1.1.2.2.2" xref="S4.E12.m1.5.5.1.1.2.2.3.cmml"><mrow id="S4.E12.m1.5.5.1.1.2.2.1.1" xref="S4.E12.m1.5.5.1.1.2.2.1.1.cmml"><mrow id="S4.E12.m1.5.5.1.1.2.2.1.1.2" xref="S4.E12.m1.5.5.1.1.2.2.1.1.2.cmml"><mo id="S4.E12.m1.5.5.1.1.2.2.1.1.2.1" rspace="0.167em" xref="S4.E12.m1.5.5.1.1.2.2.1.1.2.1.cmml">∀</mo><mi id="S4.E12.m1.5.5.1.1.2.2.1.1.2.2" xref="S4.E12.m1.5.5.1.1.2.2.1.1.2.2.cmml">i</mi></mrow><mo id="S4.E12.m1.5.5.1.1.2.2.1.1.1" xref="S4.E12.m1.5.5.1.1.2.2.1.1.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.E12.m1.5.5.1.1.2.2.1.1.3" xref="S4.E12.m1.5.5.1.1.2.2.1.1.3.cmml">𝒩</mi></mrow><mo id="S4.E12.m1.5.5.1.1.2.2.2.3" xref="S4.E12.m1.5.5.1.1.2.2.3a.cmml">,</mo><mrow id="S4.E12.m1.5.5.1.1.2.2.2.2" xref="S4.E12.m1.5.5.1.1.2.2.2.2.cmml"><mi id="S4.E12.m1.5.5.1.1.2.2.2.2.2" xref="S4.E12.m1.5.5.1.1.2.2.2.2.2.cmml">j</mi><mo id="S4.E12.m1.5.5.1.1.2.2.2.2.1" xref="S4.E12.m1.5.5.1.1.2.2.2.2.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S4.E12.m1.5.5.1.1.2.2.2.2.3" xref="S4.E12.m1.5.5.1.1.2.2.2.2.3.cmml">ℳ</mi></mrow></mrow></mrow><mo id="S4.E12.m1.5.5.1.2">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.E12.m1.5b"><apply id="S4.E12.m1.5.5.1.1.3.cmml" xref="S4.E12.m1.5.5.1.1.2"><csymbol cd="ambiguous" id="S4.E12.m1.5.5.1.1.3a.cmml" xref="S4.E12.m1.5.5.1.1.2.3">formulae-sequence</csymbol><apply id="S4.E12.m1.5.5.1.1.1.1.cmml" xref="S4.E12.m1.5.5.1.1.1.1"><eq id="S4.E12.m1.5.5.1.1.1.1.1.cmml" xref="S4.E12.m1.5.5.1.1.1.1.1"></eq><apply id="S4.E12.m1.5.5.1.1.1.1.2.cmml" xref="S4.E12.m1.5.5.1.1.1.1.2"><csymbol cd="ambiguous" id="S4.E12.m1.5.5.1.1.1.1.2.1.cmml" xref="S4.E12.m1.5.5.1.1.1.1.2">subscript</csymbol><ci id="S4.E12.m1.5.5.1.1.1.1.2.2.cmml" xref="S4.E12.m1.5.5.1.1.1.1.2.2">𝑦</ci><list id="S4.E12.m1.2.2.2.3.cmml" xref="S4.E12.m1.2.2.2.4"><ci id="S4.E12.m1.1.1.1.1.cmml" xref="S4.E12.m1.1.1.1.1">𝑖</ci><ci id="S4.E12.m1.2.2.2.2.cmml" xref="S4.E12.m1.2.2.2.2">𝑗</ci></list></apply><set id="S4.E12.m1.5.5.1.1.1.1.3.1.cmml" xref="S4.E12.m1.5.5.1.1.1.1.3.2"><cn id="S4.E12.m1.3.3.cmml" type="integer" xref="S4.E12.m1.3.3">0</cn><cn id="S4.E12.m1.4.4.cmml" type="integer" xref="S4.E12.m1.4.4">1</cn></set></apply><apply id="S4.E12.m1.5.5.1.1.2.2.3.cmml" xref="S4.E12.m1.5.5.1.1.2.2.2"><csymbol cd="ambiguous" id="S4.E12.m1.5.5.1.1.2.2.3a.cmml" xref="S4.E12.m1.5.5.1.1.2.2.2.3">formulae-sequence</csymbol><apply id="S4.E12.m1.5.5.1.1.2.2.1.1.cmml" xref="S4.E12.m1.5.5.1.1.2.2.1.1"><in id="S4.E12.m1.5.5.1.1.2.2.1.1.1.cmml" xref="S4.E12.m1.5.5.1.1.2.2.1.1.1"></in><apply id="S4.E12.m1.5.5.1.1.2.2.1.1.2.cmml" xref="S4.E12.m1.5.5.1.1.2.2.1.1.2"><csymbol cd="latexml" id="S4.E12.m1.5.5.1.1.2.2.1.1.2.1.cmml" xref="S4.E12.m1.5.5.1.1.2.2.1.1.2.1">for-all</csymbol><ci id="S4.E12.m1.5.5.1.1.2.2.1.1.2.2.cmml" xref="S4.E12.m1.5.5.1.1.2.2.1.1.2.2">𝑖</ci></apply><ci id="S4.E12.m1.5.5.1.1.2.2.1.1.3.cmml" xref="S4.E12.m1.5.5.1.1.2.2.1.1.3">𝒩</ci></apply><apply id="S4.E12.m1.5.5.1.1.2.2.2.2.cmml" xref="S4.E12.m1.5.5.1.1.2.2.2.2"><in id="S4.E12.m1.5.5.1.1.2.2.2.2.1.cmml" xref="S4.E12.m1.5.5.1.1.2.2.2.2.1"></in><ci id="S4.E12.m1.5.5.1.1.2.2.2.2.2.cmml" xref="S4.E12.m1.5.5.1.1.2.2.2.2.2">𝑗</ci><ci id="S4.E12.m1.5.5.1.1.2.2.2.2.3.cmml" xref="S4.E12.m1.5.5.1.1.2.2.2.2.3">ℳ</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E12.m1.5c">\displaystyle y_{i,j}=\{0,1\},\forall i\in\mathcal{N},j\in\mathcal{M},</annotation><annotation encoding="application/x-llamapun" id="S4.E12.m1.5d">italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { 0 , 1 } , ∀ italic_i ∈ caligraphic_N , italic_j ∈ caligraphic_M ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(12)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S4.SS2.p5.11">where <math alttext="y_{i,j}" class="ltx_Math" display="inline" id="S4.SS2.p5.2.m1.2"><semantics id="S4.SS2.p5.2.m1.2a"><msub id="S4.SS2.p5.2.m1.2.3" xref="S4.SS2.p5.2.m1.2.3.cmml"><mi id="S4.SS2.p5.2.m1.2.3.2" xref="S4.SS2.p5.2.m1.2.3.2.cmml">y</mi><mrow id="S4.SS2.p5.2.m1.2.2.2.4" xref="S4.SS2.p5.2.m1.2.2.2.3.cmml"><mi id="S4.SS2.p5.2.m1.1.1.1.1" xref="S4.SS2.p5.2.m1.1.1.1.1.cmml">i</mi><mo id="S4.SS2.p5.2.m1.2.2.2.4.1" xref="S4.SS2.p5.2.m1.2.2.2.3.cmml">,</mo><mi id="S4.SS2.p5.2.m1.2.2.2.2" xref="S4.SS2.p5.2.m1.2.2.2.2.cmml">j</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.2.m1.2b"><apply id="S4.SS2.p5.2.m1.2.3.cmml" xref="S4.SS2.p5.2.m1.2.3"><csymbol cd="ambiguous" id="S4.SS2.p5.2.m1.2.3.1.cmml" xref="S4.SS2.p5.2.m1.2.3">subscript</csymbol><ci id="S4.SS2.p5.2.m1.2.3.2.cmml" xref="S4.SS2.p5.2.m1.2.3.2">𝑦</ci><list id="S4.SS2.p5.2.m1.2.2.2.3.cmml" xref="S4.SS2.p5.2.m1.2.2.2.4"><ci id="S4.SS2.p5.2.m1.1.1.1.1.cmml" xref="S4.SS2.p5.2.m1.1.1.1.1">𝑖</ci><ci id="S4.SS2.p5.2.m1.2.2.2.2.cmml" xref="S4.SS2.p5.2.m1.2.2.2.2">𝑗</ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.2.m1.2c">y_{i,j}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.2.m1.2d">italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT</annotation></semantics></math> is a binary variable indicating whether request <math alttext="i" class="ltx_Math" display="inline" id="S4.SS2.p5.3.m2.1"><semantics id="S4.SS2.p5.3.m2.1a"><mi id="S4.SS2.p5.3.m2.1.1" xref="S4.SS2.p5.3.m2.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.3.m2.1b"><ci id="S4.SS2.p5.3.m2.1.1.cmml" xref="S4.SS2.p5.3.m2.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.3.m2.1c">i</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.3.m2.1d">italic_i</annotation></semantics></math> is assigned to SSM <math alttext="j" class="ltx_Math" display="inline" id="S4.SS2.p5.4.m3.1"><semantics id="S4.SS2.p5.4.m3.1a"><mi id="S4.SS2.p5.4.m3.1.1" xref="S4.SS2.p5.4.m3.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.4.m3.1b"><ci id="S4.SS2.p5.4.m3.1.1.cmml" xref="S4.SS2.p5.4.m3.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.4.m3.1c">j</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.4.m3.1d">italic_j</annotation></semantics></math>. The problem (<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S4.E10" title="In IV-B Algorithm Design ‣ IV Learning-based SSM Selection ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">10</span></a>) can be modeled as a bipartite graph matching problem between requests and SSMs. Since each SSM can accommodate a batch of requests, we can regard each SSM <math alttext="j" class="ltx_Math" display="inline" id="S4.SS2.p5.5.m4.1"><semantics id="S4.SS2.p5.5.m4.1a"><mi id="S4.SS2.p5.5.m4.1.1" xref="S4.SS2.p5.5.m4.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.5.m4.1b"><ci id="S4.SS2.p5.5.m4.1.1.cmml" xref="S4.SS2.p5.5.m4.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.5.m4.1c">j</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.5.m4.1d">italic_j</annotation></semantics></math> as having <math alttext="B_{j}" class="ltx_Math" display="inline" id="S4.SS2.p5.6.m5.1"><semantics id="S4.SS2.p5.6.m5.1a"><msub id="S4.SS2.p5.6.m5.1.1" xref="S4.SS2.p5.6.m5.1.1.cmml"><mi id="S4.SS2.p5.6.m5.1.1.2" xref="S4.SS2.p5.6.m5.1.1.2.cmml">B</mi><mi id="S4.SS2.p5.6.m5.1.1.3" xref="S4.SS2.p5.6.m5.1.1.3.cmml">j</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.6.m5.1b"><apply id="S4.SS2.p5.6.m5.1.1.cmml" xref="S4.SS2.p5.6.m5.1.1"><csymbol cd="ambiguous" id="S4.SS2.p5.6.m5.1.1.1.cmml" xref="S4.SS2.p5.6.m5.1.1">subscript</csymbol><ci id="S4.SS2.p5.6.m5.1.1.2.cmml" xref="S4.SS2.p5.6.m5.1.1.2">𝐵</ci><ci id="S4.SS2.p5.6.m5.1.1.3.cmml" xref="S4.SS2.p5.6.m5.1.1.3">𝑗</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.6.m5.1c">B_{j}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.6.m5.1d">italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT</annotation></semantics></math> replicas, each capable of serving a request. The weight between the request and the SSM is the estimated goodput. The objective is to maximize the total weights. We use the KM algorithm to find a maximum-weight match <math alttext="\pi^{k}" class="ltx_Math" display="inline" id="S4.SS2.p5.7.m6.1"><semantics id="S4.SS2.p5.7.m6.1a"><msup id="S4.SS2.p5.7.m6.1.1" xref="S4.SS2.p5.7.m6.1.1.cmml"><mi id="S4.SS2.p5.7.m6.1.1.2" xref="S4.SS2.p5.7.m6.1.1.2.cmml">π</mi><mi id="S4.SS2.p5.7.m6.1.1.3" xref="S4.SS2.p5.7.m6.1.1.3.cmml">k</mi></msup><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.7.m6.1b"><apply id="S4.SS2.p5.7.m6.1.1.cmml" xref="S4.SS2.p5.7.m6.1.1"><csymbol cd="ambiguous" id="S4.SS2.p5.7.m6.1.1.1.cmml" xref="S4.SS2.p5.7.m6.1.1">superscript</csymbol><ci id="S4.SS2.p5.7.m6.1.1.2.cmml" xref="S4.SS2.p5.7.m6.1.1.2">𝜋</ci><ci id="S4.SS2.p5.7.m6.1.1.3.cmml" xref="S4.SS2.p5.7.m6.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.7.m6.1c">\pi^{k}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.7.m6.1d">italic_π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT</annotation></semantics></math>. It is important to note that <math alttext="\pi^{k}" class="ltx_Math" display="inline" id="S4.SS2.p5.8.m7.1"><semantics id="S4.SS2.p5.8.m7.1a"><msup id="S4.SS2.p5.8.m7.1.1" xref="S4.SS2.p5.8.m7.1.1.cmml"><mi id="S4.SS2.p5.8.m7.1.1.2" xref="S4.SS2.p5.8.m7.1.1.2.cmml">π</mi><mi id="S4.SS2.p5.8.m7.1.1.3" xref="S4.SS2.p5.8.m7.1.1.3.cmml">k</mi></msup><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.8.m7.1b"><apply id="S4.SS2.p5.8.m7.1.1.cmml" xref="S4.SS2.p5.8.m7.1.1"><csymbol cd="ambiguous" id="S4.SS2.p5.8.m7.1.1.1.cmml" xref="S4.SS2.p5.8.m7.1.1">superscript</csymbol><ci id="S4.SS2.p5.8.m7.1.1.2.cmml" xref="S4.SS2.p5.8.m7.1.1.2">𝜋</ci><ci id="S4.SS2.p5.8.m7.1.1.3.cmml" xref="S4.SS2.p5.8.m7.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.8.m7.1c">\pi^{k}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.8.m7.1d">italic_π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT</annotation></semantics></math> is the best match based on the estimated goodput but may not indicate the optimal SSM selection since the estimated goodput could be inaccurate. According to the match <math alttext="\pi^{k}" class="ltx_Math" display="inline" id="S4.SS2.p5.9.m8.1"><semantics id="S4.SS2.p5.9.m8.1a"><msup id="S4.SS2.p5.9.m8.1.1" xref="S4.SS2.p5.9.m8.1.1.cmml"><mi id="S4.SS2.p5.9.m8.1.1.2" xref="S4.SS2.p5.9.m8.1.1.2.cmml">π</mi><mi id="S4.SS2.p5.9.m8.1.1.3" xref="S4.SS2.p5.9.m8.1.1.3.cmml">k</mi></msup><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.9.m8.1b"><apply id="S4.SS2.p5.9.m8.1.1.cmml" xref="S4.SS2.p5.9.m8.1.1"><csymbol cd="ambiguous" id="S4.SS2.p5.9.m8.1.1.1.cmml" xref="S4.SS2.p5.9.m8.1.1">superscript</csymbol><ci id="S4.SS2.p5.9.m8.1.1.2.cmml" xref="S4.SS2.p5.9.m8.1.1.2">𝜋</ci><ci id="S4.SS2.p5.9.m8.1.1.3.cmml" xref="S4.SS2.p5.9.m8.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.9.m8.1c">\pi^{k}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.9.m8.1d">italic_π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT</annotation></semantics></math>, each request is processed on the selected SSM for <math alttext="2^{k}" class="ltx_Math" display="inline" id="S4.SS2.p5.10.m9.1"><semantics id="S4.SS2.p5.10.m9.1a"><msup id="S4.SS2.p5.10.m9.1.1" xref="S4.SS2.p5.10.m9.1.1.cmml"><mn id="S4.SS2.p5.10.m9.1.1.2" xref="S4.SS2.p5.10.m9.1.1.2.cmml">2</mn><mi id="S4.SS2.p5.10.m9.1.1.3" xref="S4.SS2.p5.10.m9.1.1.3.cmml">k</mi></msup><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.10.m9.1b"><apply id="S4.SS2.p5.10.m9.1.1.cmml" xref="S4.SS2.p5.10.m9.1.1"><csymbol cd="ambiguous" id="S4.SS2.p5.10.m9.1.1.1.cmml" xref="S4.SS2.p5.10.m9.1.1">superscript</csymbol><cn id="S4.SS2.p5.10.m9.1.1.2.cmml" type="integer" xref="S4.SS2.p5.10.m9.1.1.2">2</cn><ci id="S4.SS2.p5.10.m9.1.1.3.cmml" xref="S4.SS2.p5.10.m9.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.10.m9.1c">2^{k}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.10.m9.1d">2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT</annotation></semantics></math> time slots. We use exponential duration for exploitation because as <math alttext="k" class="ltx_Math" display="inline" id="S4.SS2.p5.11.m10.1"><semantics id="S4.SS2.p5.11.m10.1a"><mi id="S4.SS2.p5.11.m10.1.1" xref="S4.SS2.p5.11.m10.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p5.11.m10.1b"><ci id="S4.SS2.p5.11.m10.1.1.cmml" xref="S4.SS2.p5.11.m10.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p5.11.m10.1c">k</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p5.11.m10.1d">italic_k</annotation></semantics></math> increases, the precision of goodput estimations improves, enhancing the probability of the optimal SSM selection.</p> </div> <div class="ltx_theorem ltx_theorem_theorem" id="Thmtheorem1"> <h6 class="ltx_title ltx_runin ltx_title_theorem"> <span class="ltx_tag ltx_tag_theorem"><span class="ltx_text ltx_font_bold" id="Thmtheorem1.1.1.1">Theorem 1</span></span><span class="ltx_text ltx_font_bold" id="Thmtheorem1.2.2">.</span> </h6> <div class="ltx_para" id="Thmtheorem1.p1"> <p class="ltx_p" id="Thmtheorem1.p1.2"><span class="ltx_text ltx_font_italic" id="Thmtheorem1.p1.2.2">The total regret <math alttext="\mathcal{R}(T)" class="ltx_Math" display="inline" id="Thmtheorem1.p1.1.1.m1.1"><semantics id="Thmtheorem1.p1.1.1.m1.1a"><mrow id="Thmtheorem1.p1.1.1.m1.1.2" xref="Thmtheorem1.p1.1.1.m1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="Thmtheorem1.p1.1.1.m1.1.2.2" xref="Thmtheorem1.p1.1.1.m1.1.2.2.cmml">ℛ</mi><mo id="Thmtheorem1.p1.1.1.m1.1.2.1" xref="Thmtheorem1.p1.1.1.m1.1.2.1.cmml">⁢</mo><mrow id="Thmtheorem1.p1.1.1.m1.1.2.3.2" xref="Thmtheorem1.p1.1.1.m1.1.2.cmml"><mo id="Thmtheorem1.p1.1.1.m1.1.2.3.2.1" stretchy="false" xref="Thmtheorem1.p1.1.1.m1.1.2.cmml">(</mo><mi id="Thmtheorem1.p1.1.1.m1.1.1" xref="Thmtheorem1.p1.1.1.m1.1.1.cmml">T</mi><mo id="Thmtheorem1.p1.1.1.m1.1.2.3.2.2" stretchy="false" xref="Thmtheorem1.p1.1.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="Thmtheorem1.p1.1.1.m1.1b"><apply id="Thmtheorem1.p1.1.1.m1.1.2.cmml" xref="Thmtheorem1.p1.1.1.m1.1.2"><times id="Thmtheorem1.p1.1.1.m1.1.2.1.cmml" xref="Thmtheorem1.p1.1.1.m1.1.2.1"></times><ci id="Thmtheorem1.p1.1.1.m1.1.2.2.cmml" xref="Thmtheorem1.p1.1.1.m1.1.2.2">ℛ</ci><ci id="Thmtheorem1.p1.1.1.m1.1.1.cmml" xref="Thmtheorem1.p1.1.1.m1.1.1">𝑇</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Thmtheorem1.p1.1.1.m1.1c">\mathcal{R}(T)</annotation><annotation encoding="application/x-llamapun" id="Thmtheorem1.p1.1.1.m1.1d">caligraphic_R ( italic_T )</annotation></semantics></math> is bounded by <math alttext="\mathcal{O}(\log_{2}T)" class="ltx_Math" display="inline" id="Thmtheorem1.p1.2.2.m2.1"><semantics id="Thmtheorem1.p1.2.2.m2.1a"><mrow id="Thmtheorem1.p1.2.2.m2.1.1" xref="Thmtheorem1.p1.2.2.m2.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="Thmtheorem1.p1.2.2.m2.1.1.3" xref="Thmtheorem1.p1.2.2.m2.1.1.3.cmml">𝒪</mi><mo id="Thmtheorem1.p1.2.2.m2.1.1.2" xref="Thmtheorem1.p1.2.2.m2.1.1.2.cmml">⁢</mo><mrow id="Thmtheorem1.p1.2.2.m2.1.1.1.1" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.cmml"><mo id="Thmtheorem1.p1.2.2.m2.1.1.1.1.2" stretchy="false" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.cmml">(</mo><mrow id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.cmml"><msub id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.cmml"><mi id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.2" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.2.cmml">log</mi><mn id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.3" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.3.cmml">2</mn></msub><mo id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1a" lspace="0.167em" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.cmml">⁡</mo><mi id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.2" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.2.cmml">T</mi></mrow><mo id="Thmtheorem1.p1.2.2.m2.1.1.1.1.3" stretchy="false" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="Thmtheorem1.p1.2.2.m2.1b"><apply id="Thmtheorem1.p1.2.2.m2.1.1.cmml" xref="Thmtheorem1.p1.2.2.m2.1.1"><times id="Thmtheorem1.p1.2.2.m2.1.1.2.cmml" xref="Thmtheorem1.p1.2.2.m2.1.1.2"></times><ci id="Thmtheorem1.p1.2.2.m2.1.1.3.cmml" xref="Thmtheorem1.p1.2.2.m2.1.1.3">𝒪</ci><apply id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.cmml" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1"><apply id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.cmml" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.1.cmml" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1">subscript</csymbol><log id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.2.cmml" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.2"></log><cn id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.3.cmml" type="integer" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.1.3">2</cn></apply><ci id="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.2.cmml" xref="Thmtheorem1.p1.2.2.m2.1.1.1.1.1.2">𝑇</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Thmtheorem1.p1.2.2.m2.1c">\mathcal{O}(\log_{2}T)</annotation><annotation encoding="application/x-llamapun" id="Thmtheorem1.p1.2.2.m2.1d">caligraphic_O ( roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T )</annotation></semantics></math>.</span></p> </div> </div> <div class="ltx_proof" id="S4.SS2.1"> <h6 class="ltx_title ltx_runin ltx_font_italic ltx_title_proof">Proof.</h6> <div class="ltx_para" id="S4.SS2.1.p1"> <p class="ltx_p" id="S4.SS2.1.p1.1">Please refer to the technical report <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib31" title="">31</a>]</cite> for the proof. ∎</p> </div> </div> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S4.SS3.5.1.1">IV-C</span> </span><span class="ltx_text ltx_font_italic" id="S4.SS3.6.2">Fast SSM Switching</span> </h3> <div class="ltx_para" id="S4.SS3.p1"> <p class="ltx_p" id="S4.SS3.p1.1">We propose a fast SSM switching mechanism to further reduce the SSM switching cost during the exploration-exploitation process. The key insight is that newly speculative tokens cannot affect the KV of tokens already generated. Therefore, we can pre-compute the KV cache of all existing tokens on the destination SSM, in parallel with the speculation on the source SSM. </p> </div> <div class="ltx_para" id="S4.SS3.p2"> <p class="ltx_p" id="S4.SS3.p2.1">To realize fast SSM switching, it is necessary to determine the destination SSM for switching in advance. During the exploration stage, the destination SSM can be easily determined because it is randomly selected. However, during the exploitation stage, the SSM selection is decided by a matching algorithm based on estimated goodput, making it challenging to determine the destination SSM in advance. A straightforward solution is to pre-compute the KV cache for all SSMs, but it introduces significant computation redundancy and wastes resources. To address this challenge, we choose the SSM with the highest estimated goodput for KV cache re-computation. Furthermore, <span class="ltx_text ltx_font_smallcaps" id="S4.SS3.p2.1.1">Spin</span> runs such KV cache re-computation during the idle time of SSMs, minimizing negative impact to the speculation on SSMs.</p> </div> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">V </span><span class="ltx_text ltx_font_smallcaps" id="S5.1.1">Spin’s Runtime Engine</span> </h2> <div class="ltx_para" id="S5.p1"> <p class="ltx_p" id="S5.p1.1">In this section, we present <span class="ltx_text ltx_font_smallcaps" id="S5.p1.1.1">Spin</span>’s speculative decoding engine, which includes a fast batch verification module, and a pipeline processing module.</p> </div> <section class="ltx_subsection" id="S5.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S5.SS1.5.1.1">V-A</span> </span><span class="ltx_text ltx_font_italic" id="S5.SS1.6.2">Fast Batch Verification</span> </h3> <figure class="ltx_figure" id="S5.F9"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="210" id="S5.F9.1.g1" src="x9.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 9: </span>The illustration of the self-attention computation with request decomposition.</figcaption> </figure> <div class="ltx_para" id="S5.SS1.p1"> <p class="ltx_p" id="S5.SS1.p1.5">The key idea of the fast batch verification is to reshape KV tensors by ripping off overlong parts of some requests and stitching them with short ones, so that we can reduce the padded tokens. <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S5.F9" title="Figure 9 ‣ V-A Fast Batch Verification ‣ V Spin’s Runtime Engine ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 9</span></a> illustrates a simple example of three requests, where tokens in white color are prompts, e.g., <math alttext="(A0,A1,A2,A3)" class="ltx_Math" display="inline" id="S5.SS1.p1.1.m1.4"><semantics id="S5.SS1.p1.1.m1.4a"><mrow id="S5.SS1.p1.1.m1.4.4.4" xref="S5.SS1.p1.1.m1.4.4.5.cmml"><mo id="S5.SS1.p1.1.m1.4.4.4.5" stretchy="false" xref="S5.SS1.p1.1.m1.4.4.5.cmml">(</mo><mrow id="S5.SS1.p1.1.m1.1.1.1.1" xref="S5.SS1.p1.1.m1.1.1.1.1.cmml"><mi id="S5.SS1.p1.1.m1.1.1.1.1.2" xref="S5.SS1.p1.1.m1.1.1.1.1.2.cmml">A</mi><mo id="S5.SS1.p1.1.m1.1.1.1.1.1" xref="S5.SS1.p1.1.m1.1.1.1.1.1.cmml">⁢</mo><mn id="S5.SS1.p1.1.m1.1.1.1.1.3" xref="S5.SS1.p1.1.m1.1.1.1.1.3.cmml">0</mn></mrow><mo id="S5.SS1.p1.1.m1.4.4.4.6" xref="S5.SS1.p1.1.m1.4.4.5.cmml">,</mo><mrow id="S5.SS1.p1.1.m1.2.2.2.2" xref="S5.SS1.p1.1.m1.2.2.2.2.cmml"><mi id="S5.SS1.p1.1.m1.2.2.2.2.2" xref="S5.SS1.p1.1.m1.2.2.2.2.2.cmml">A</mi><mo id="S5.SS1.p1.1.m1.2.2.2.2.1" xref="S5.SS1.p1.1.m1.2.2.2.2.1.cmml">⁢</mo><mn id="S5.SS1.p1.1.m1.2.2.2.2.3" xref="S5.SS1.p1.1.m1.2.2.2.2.3.cmml">1</mn></mrow><mo id="S5.SS1.p1.1.m1.4.4.4.7" xref="S5.SS1.p1.1.m1.4.4.5.cmml">,</mo><mrow id="S5.SS1.p1.1.m1.3.3.3.3" xref="S5.SS1.p1.1.m1.3.3.3.3.cmml"><mi id="S5.SS1.p1.1.m1.3.3.3.3.2" xref="S5.SS1.p1.1.m1.3.3.3.3.2.cmml">A</mi><mo id="S5.SS1.p1.1.m1.3.3.3.3.1" xref="S5.SS1.p1.1.m1.3.3.3.3.1.cmml">⁢</mo><mn id="S5.SS1.p1.1.m1.3.3.3.3.3" xref="S5.SS1.p1.1.m1.3.3.3.3.3.cmml">2</mn></mrow><mo id="S5.SS1.p1.1.m1.4.4.4.8" xref="S5.SS1.p1.1.m1.4.4.5.cmml">,</mo><mrow id="S5.SS1.p1.1.m1.4.4.4.4" xref="S5.SS1.p1.1.m1.4.4.4.4.cmml"><mi id="S5.SS1.p1.1.m1.4.4.4.4.2" xref="S5.SS1.p1.1.m1.4.4.4.4.2.cmml">A</mi><mo id="S5.SS1.p1.1.m1.4.4.4.4.1" xref="S5.SS1.p1.1.m1.4.4.4.4.1.cmml">⁢</mo><mn id="S5.SS1.p1.1.m1.4.4.4.4.3" xref="S5.SS1.p1.1.m1.4.4.4.4.3.cmml">3</mn></mrow><mo id="S5.SS1.p1.1.m1.4.4.4.9" stretchy="false" xref="S5.SS1.p1.1.m1.4.4.5.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S5.SS1.p1.1.m1.4b"><vector id="S5.SS1.p1.1.m1.4.4.5.cmml" xref="S5.SS1.p1.1.m1.4.4.4"><apply id="S5.SS1.p1.1.m1.1.1.1.1.cmml" xref="S5.SS1.p1.1.m1.1.1.1.1"><times id="S5.SS1.p1.1.m1.1.1.1.1.1.cmml" xref="S5.SS1.p1.1.m1.1.1.1.1.1"></times><ci id="S5.SS1.p1.1.m1.1.1.1.1.2.cmml" xref="S5.SS1.p1.1.m1.1.1.1.1.2">𝐴</ci><cn id="S5.SS1.p1.1.m1.1.1.1.1.3.cmml" type="integer" xref="S5.SS1.p1.1.m1.1.1.1.1.3">0</cn></apply><apply id="S5.SS1.p1.1.m1.2.2.2.2.cmml" xref="S5.SS1.p1.1.m1.2.2.2.2"><times id="S5.SS1.p1.1.m1.2.2.2.2.1.cmml" xref="S5.SS1.p1.1.m1.2.2.2.2.1"></times><ci id="S5.SS1.p1.1.m1.2.2.2.2.2.cmml" xref="S5.SS1.p1.1.m1.2.2.2.2.2">𝐴</ci><cn id="S5.SS1.p1.1.m1.2.2.2.2.3.cmml" type="integer" xref="S5.SS1.p1.1.m1.2.2.2.2.3">1</cn></apply><apply id="S5.SS1.p1.1.m1.3.3.3.3.cmml" xref="S5.SS1.p1.1.m1.3.3.3.3"><times id="S5.SS1.p1.1.m1.3.3.3.3.1.cmml" xref="S5.SS1.p1.1.m1.3.3.3.3.1"></times><ci id="S5.SS1.p1.1.m1.3.3.3.3.2.cmml" xref="S5.SS1.p1.1.m1.3.3.3.3.2">𝐴</ci><cn id="S5.SS1.p1.1.m1.3.3.3.3.3.cmml" type="integer" xref="S5.SS1.p1.1.m1.3.3.3.3.3">2</cn></apply><apply id="S5.SS1.p1.1.m1.4.4.4.4.cmml" xref="S5.SS1.p1.1.m1.4.4.4.4"><times id="S5.SS1.p1.1.m1.4.4.4.4.1.cmml" xref="S5.SS1.p1.1.m1.4.4.4.4.1"></times><ci id="S5.SS1.p1.1.m1.4.4.4.4.2.cmml" xref="S5.SS1.p1.1.m1.4.4.4.4.2">𝐴</ci><cn id="S5.SS1.p1.1.m1.4.4.4.4.3.cmml" type="integer" xref="S5.SS1.p1.1.m1.4.4.4.4.3">3</cn></apply></vector></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p1.1.m1.4c">(A0,A1,A2,A3)</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p1.1.m1.4d">( italic_A 0 , italic_A 1 , italic_A 2 , italic_A 3 )</annotation></semantics></math>, and the ones in blue color are candidate tokens, e.g., <math alttext="(A4,A5,A6)" class="ltx_Math" display="inline" id="S5.SS1.p1.2.m2.3"><semantics id="S5.SS1.p1.2.m2.3a"><mrow id="S5.SS1.p1.2.m2.3.3.3" xref="S5.SS1.p1.2.m2.3.3.4.cmml"><mo id="S5.SS1.p1.2.m2.3.3.3.4" stretchy="false" xref="S5.SS1.p1.2.m2.3.3.4.cmml">(</mo><mrow id="S5.SS1.p1.2.m2.1.1.1.1" xref="S5.SS1.p1.2.m2.1.1.1.1.cmml"><mi id="S5.SS1.p1.2.m2.1.1.1.1.2" xref="S5.SS1.p1.2.m2.1.1.1.1.2.cmml">A</mi><mo id="S5.SS1.p1.2.m2.1.1.1.1.1" xref="S5.SS1.p1.2.m2.1.1.1.1.1.cmml">⁢</mo><mn id="S5.SS1.p1.2.m2.1.1.1.1.3" xref="S5.SS1.p1.2.m2.1.1.1.1.3.cmml">4</mn></mrow><mo id="S5.SS1.p1.2.m2.3.3.3.5" xref="S5.SS1.p1.2.m2.3.3.4.cmml">,</mo><mrow id="S5.SS1.p1.2.m2.2.2.2.2" xref="S5.SS1.p1.2.m2.2.2.2.2.cmml"><mi id="S5.SS1.p1.2.m2.2.2.2.2.2" xref="S5.SS1.p1.2.m2.2.2.2.2.2.cmml">A</mi><mo id="S5.SS1.p1.2.m2.2.2.2.2.1" xref="S5.SS1.p1.2.m2.2.2.2.2.1.cmml">⁢</mo><mn id="S5.SS1.p1.2.m2.2.2.2.2.3" xref="S5.SS1.p1.2.m2.2.2.2.2.3.cmml">5</mn></mrow><mo id="S5.SS1.p1.2.m2.3.3.3.6" xref="S5.SS1.p1.2.m2.3.3.4.cmml">,</mo><mrow id="S5.SS1.p1.2.m2.3.3.3.3" xref="S5.SS1.p1.2.m2.3.3.3.3.cmml"><mi id="S5.SS1.p1.2.m2.3.3.3.3.2" xref="S5.SS1.p1.2.m2.3.3.3.3.2.cmml">A</mi><mo id="S5.SS1.p1.2.m2.3.3.3.3.1" xref="S5.SS1.p1.2.m2.3.3.3.3.1.cmml">⁢</mo><mn id="S5.SS1.p1.2.m2.3.3.3.3.3" xref="S5.SS1.p1.2.m2.3.3.3.3.3.cmml">6</mn></mrow><mo id="S5.SS1.p1.2.m2.3.3.3.7" stretchy="false" xref="S5.SS1.p1.2.m2.3.3.4.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S5.SS1.p1.2.m2.3b"><vector id="S5.SS1.p1.2.m2.3.3.4.cmml" xref="S5.SS1.p1.2.m2.3.3.3"><apply id="S5.SS1.p1.2.m2.1.1.1.1.cmml" xref="S5.SS1.p1.2.m2.1.1.1.1"><times id="S5.SS1.p1.2.m2.1.1.1.1.1.cmml" xref="S5.SS1.p1.2.m2.1.1.1.1.1"></times><ci id="S5.SS1.p1.2.m2.1.1.1.1.2.cmml" xref="S5.SS1.p1.2.m2.1.1.1.1.2">𝐴</ci><cn id="S5.SS1.p1.2.m2.1.1.1.1.3.cmml" type="integer" xref="S5.SS1.p1.2.m2.1.1.1.1.3">4</cn></apply><apply id="S5.SS1.p1.2.m2.2.2.2.2.cmml" xref="S5.SS1.p1.2.m2.2.2.2.2"><times id="S5.SS1.p1.2.m2.2.2.2.2.1.cmml" xref="S5.SS1.p1.2.m2.2.2.2.2.1"></times><ci id="S5.SS1.p1.2.m2.2.2.2.2.2.cmml" xref="S5.SS1.p1.2.m2.2.2.2.2.2">𝐴</ci><cn id="S5.SS1.p1.2.m2.2.2.2.2.3.cmml" type="integer" xref="S5.SS1.p1.2.m2.2.2.2.2.3">5</cn></apply><apply id="S5.SS1.p1.2.m2.3.3.3.3.cmml" xref="S5.SS1.p1.2.m2.3.3.3.3"><times id="S5.SS1.p1.2.m2.3.3.3.3.1.cmml" xref="S5.SS1.p1.2.m2.3.3.3.3.1"></times><ci id="S5.SS1.p1.2.m2.3.3.3.3.2.cmml" xref="S5.SS1.p1.2.m2.3.3.3.3.2">𝐴</ci><cn id="S5.SS1.p1.2.m2.3.3.3.3.3.cmml" type="integer" xref="S5.SS1.p1.2.m2.3.3.3.3.3">6</cn></apply></vector></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p1.2.m2.3c">(A4,A5,A6)</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p1.2.m2.3d">( italic_A 4 , italic_A 5 , italic_A 6 )</annotation></semantics></math>. In order to align KV tensors for self-attention computation, existing inference engines <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib20" title="">20</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib23" title="">23</a>]</cite> insert four padding tokens into requests <math alttext="B" class="ltx_Math" display="inline" id="S5.SS1.p1.3.m3.1"><semantics id="S5.SS1.p1.3.m3.1a"><mi id="S5.SS1.p1.3.m3.1.1" xref="S5.SS1.p1.3.m3.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p1.3.m3.1b"><ci id="S5.SS1.p1.3.m3.1.1.cmml" xref="S5.SS1.p1.3.m3.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p1.3.m3.1c">B</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p1.3.m3.1d">italic_B</annotation></semantics></math> and <math alttext="C" class="ltx_Math" display="inline" id="S5.SS1.p1.4.m4.1"><semantics id="S5.SS1.p1.4.m4.1a"><mi id="S5.SS1.p1.4.m4.1.1" xref="S5.SS1.p1.4.m4.1.1.cmml">C</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p1.4.m4.1b"><ci id="S5.SS1.p1.4.m4.1.1.cmml" xref="S5.SS1.p1.4.m4.1.1">𝐶</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p1.4.m4.1c">C</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p1.4.m4.1d">italic_C</annotation></semantics></math>. In contrast, <span class="ltx_text ltx_font_smallcaps" id="S5.SS1.p1.5.1">Spin</span> decomposes request <math alttext="A" class="ltx_Math" display="inline" id="S5.SS1.p1.5.m5.1"><semantics id="S5.SS1.p1.5.m5.1a"><mi id="S5.SS1.p1.5.m5.1.1" xref="S5.SS1.p1.5.m5.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p1.5.m5.1b"><ci id="S5.SS1.p1.5.m5.1.1.cmml" xref="S5.SS1.p1.5.m5.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p1.5.m5.1c">A</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p1.5.m5.1d">italic_A</annotation></semantics></math> into two sub-requests so that it aligns with the other two requests without padding tokens.</p> </div> <div class="ltx_para" id="S5.SS1.p2"> <p class="ltx_p" id="S5.SS1.p2.3">However, decomposing requests could lead to incorrect computation results in the self-attention layer. Typically, the inference engine calculates the self-attention outputs for each request in the batch separately using (<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.E2" title="In II-A1 Autoregressive Decoding ‣ II-A Background ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">2</span></a>). When a request is divided into multiple parts that locate in different rows, the denominator of (<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.E2" title="In II-A1 Autoregressive Decoding ‣ II-A Background ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">2</span></a>) cannot summing all the <math alttext="\mathcal{F}(Q_{i},K_{j})" class="ltx_Math" display="inline" id="S5.SS1.p2.1.m1.2"><semantics id="S5.SS1.p2.1.m1.2a"><mrow id="S5.SS1.p2.1.m1.2.2" xref="S5.SS1.p2.1.m1.2.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S5.SS1.p2.1.m1.2.2.4" xref="S5.SS1.p2.1.m1.2.2.4.cmml">ℱ</mi><mo id="S5.SS1.p2.1.m1.2.2.3" xref="S5.SS1.p2.1.m1.2.2.3.cmml">⁢</mo><mrow id="S5.SS1.p2.1.m1.2.2.2.2" xref="S5.SS1.p2.1.m1.2.2.2.3.cmml"><mo id="S5.SS1.p2.1.m1.2.2.2.2.3" stretchy="false" xref="S5.SS1.p2.1.m1.2.2.2.3.cmml">(</mo><msub id="S5.SS1.p2.1.m1.1.1.1.1.1" xref="S5.SS1.p2.1.m1.1.1.1.1.1.cmml"><mi id="S5.SS1.p2.1.m1.1.1.1.1.1.2" xref="S5.SS1.p2.1.m1.1.1.1.1.1.2.cmml">Q</mi><mi id="S5.SS1.p2.1.m1.1.1.1.1.1.3" xref="S5.SS1.p2.1.m1.1.1.1.1.1.3.cmml">i</mi></msub><mo id="S5.SS1.p2.1.m1.2.2.2.2.4" xref="S5.SS1.p2.1.m1.2.2.2.3.cmml">,</mo><msub id="S5.SS1.p2.1.m1.2.2.2.2.2" xref="S5.SS1.p2.1.m1.2.2.2.2.2.cmml"><mi id="S5.SS1.p2.1.m1.2.2.2.2.2.2" xref="S5.SS1.p2.1.m1.2.2.2.2.2.2.cmml">K</mi><mi id="S5.SS1.p2.1.m1.2.2.2.2.2.3" xref="S5.SS1.p2.1.m1.2.2.2.2.2.3.cmml">j</mi></msub><mo id="S5.SS1.p2.1.m1.2.2.2.2.5" stretchy="false" xref="S5.SS1.p2.1.m1.2.2.2.3.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S5.SS1.p2.1.m1.2b"><apply id="S5.SS1.p2.1.m1.2.2.cmml" xref="S5.SS1.p2.1.m1.2.2"><times id="S5.SS1.p2.1.m1.2.2.3.cmml" xref="S5.SS1.p2.1.m1.2.2.3"></times><ci id="S5.SS1.p2.1.m1.2.2.4.cmml" xref="S5.SS1.p2.1.m1.2.2.4">ℱ</ci><interval closure="open" id="S5.SS1.p2.1.m1.2.2.2.3.cmml" xref="S5.SS1.p2.1.m1.2.2.2.2"><apply id="S5.SS1.p2.1.m1.1.1.1.1.1.cmml" xref="S5.SS1.p2.1.m1.1.1.1.1.1"><csymbol cd="ambiguous" id="S5.SS1.p2.1.m1.1.1.1.1.1.1.cmml" xref="S5.SS1.p2.1.m1.1.1.1.1.1">subscript</csymbol><ci id="S5.SS1.p2.1.m1.1.1.1.1.1.2.cmml" xref="S5.SS1.p2.1.m1.1.1.1.1.1.2">𝑄</ci><ci id="S5.SS1.p2.1.m1.1.1.1.1.1.3.cmml" xref="S5.SS1.p2.1.m1.1.1.1.1.1.3">𝑖</ci></apply><apply id="S5.SS1.p2.1.m1.2.2.2.2.2.cmml" xref="S5.SS1.p2.1.m1.2.2.2.2.2"><csymbol cd="ambiguous" id="S5.SS1.p2.1.m1.2.2.2.2.2.1.cmml" xref="S5.SS1.p2.1.m1.2.2.2.2.2">subscript</csymbol><ci id="S5.SS1.p2.1.m1.2.2.2.2.2.2.cmml" xref="S5.SS1.p2.1.m1.2.2.2.2.2.2">𝐾</ci><ci id="S5.SS1.p2.1.m1.2.2.2.2.2.3.cmml" xref="S5.SS1.p2.1.m1.2.2.2.2.2.3">𝑗</ci></apply></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p2.1.m1.2c">\mathcal{F}(Q_{i},K_{j})</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p2.1.m1.2d">caligraphic_F ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )</annotation></semantics></math> belonging to the same request, where <math alttext="i" class="ltx_Math" display="inline" id="S5.SS1.p2.2.m2.1"><semantics id="S5.SS1.p2.2.m2.1a"><mi id="S5.SS1.p2.2.m2.1.1" xref="S5.SS1.p2.2.m2.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p2.2.m2.1b"><ci id="S5.SS1.p2.2.m2.1.1.cmml" xref="S5.SS1.p2.2.m2.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p2.2.m2.1c">i</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p2.2.m2.1d">italic_i</annotation></semantics></math> and <math alttext="j" class="ltx_Math" display="inline" id="S5.SS1.p2.3.m3.1"><semantics id="S5.SS1.p2.3.m3.1a"><mi id="S5.SS1.p2.3.m3.1.1" xref="S5.SS1.p2.3.m3.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p2.3.m3.1b"><ci id="S5.SS1.p2.3.m3.1.1.cmml" xref="S5.SS1.p2.3.m3.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p2.3.m3.1c">j</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p2.3.m3.1d">italic_j</annotation></semantics></math> are indices of different tokens.</p> </div> <div class="ltx_para" id="S5.SS1.p3"> <p class="ltx_p" id="S5.SS1.p3.1">In order to ensure the correctness of self-attention computation, we modify the self-attention computation in Eq. (<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.E2" title="In II-A1 Autoregressive Decoding ‣ II-A Background ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">2</span></a>) for request <math alttext="S" class="ltx_Math" display="inline" id="S5.SS1.p3.1.m1.1"><semantics id="S5.SS1.p3.1.m1.1a"><mi id="S5.SS1.p3.1.m1.1.1" xref="S5.SS1.p3.1.m1.1.1.cmml">S</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p3.1.m1.1b"><ci id="S5.SS1.p3.1.m1.1.1.cmml" xref="S5.SS1.p3.1.m1.1.1">𝑆</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p3.1.m1.1c">S</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p3.1.m1.1d">italic_S</annotation></semantics></math> as follows:</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="S9.EGx7"> <tbody id="S5.E13"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle a_{i,j}=\frac{\mathcal{F}(Q_{i},K_{j})}{\sum\limits_{j,r_{j}\in D% }\mathcal{F}(Q_{i},K_{j})I_{j,S}}," class="ltx_Math" display="inline" id="S5.E13.m1.11"><semantics id="S5.E13.m1.11a"><mrow id="S5.E13.m1.11.11.1" xref="S5.E13.m1.11.11.1.1.cmml"><mrow id="S5.E13.m1.11.11.1.1" xref="S5.E13.m1.11.11.1.1.cmml"><msub id="S5.E13.m1.11.11.1.1.2" xref="S5.E13.m1.11.11.1.1.2.cmml"><mi id="S5.E13.m1.11.11.1.1.2.2" xref="S5.E13.m1.11.11.1.1.2.2.cmml">a</mi><mrow id="S5.E13.m1.2.2.2.4" xref="S5.E13.m1.2.2.2.3.cmml"><mi id="S5.E13.m1.1.1.1.1" xref="S5.E13.m1.1.1.1.1.cmml">i</mi><mo id="S5.E13.m1.2.2.2.4.1" xref="S5.E13.m1.2.2.2.3.cmml">,</mo><mi id="S5.E13.m1.2.2.2.2" xref="S5.E13.m1.2.2.2.2.cmml">j</mi></mrow></msub><mo id="S5.E13.m1.11.11.1.1.1" xref="S5.E13.m1.11.11.1.1.1.cmml">=</mo><mstyle displaystyle="true" id="S5.E13.m1.10.10" xref="S5.E13.m1.10.10.cmml"><mfrac id="S5.E13.m1.10.10a" xref="S5.E13.m1.10.10.cmml"><mrow id="S5.E13.m1.4.4.2" xref="S5.E13.m1.4.4.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S5.E13.m1.4.4.2.4" xref="S5.E13.m1.4.4.2.4.cmml">ℱ</mi><mo id="S5.E13.m1.4.4.2.3" xref="S5.E13.m1.4.4.2.3.cmml">⁢</mo><mrow id="S5.E13.m1.4.4.2.2.2" xref="S5.E13.m1.4.4.2.2.3.cmml"><mo id="S5.E13.m1.4.4.2.2.2.3" stretchy="false" xref="S5.E13.m1.4.4.2.2.3.cmml">(</mo><msub id="S5.E13.m1.3.3.1.1.1.1" xref="S5.E13.m1.3.3.1.1.1.1.cmml"><mi id="S5.E13.m1.3.3.1.1.1.1.2" xref="S5.E13.m1.3.3.1.1.1.1.2.cmml">Q</mi><mi id="S5.E13.m1.3.3.1.1.1.1.3" xref="S5.E13.m1.3.3.1.1.1.1.3.cmml">i</mi></msub><mo id="S5.E13.m1.4.4.2.2.2.4" xref="S5.E13.m1.4.4.2.2.3.cmml">,</mo><msub id="S5.E13.m1.4.4.2.2.2.2" xref="S5.E13.m1.4.4.2.2.2.2.cmml"><mi id="S5.E13.m1.4.4.2.2.2.2.2" xref="S5.E13.m1.4.4.2.2.2.2.2.cmml">K</mi><mi id="S5.E13.m1.4.4.2.2.2.2.3" xref="S5.E13.m1.4.4.2.2.2.2.3.cmml">j</mi></msub><mo id="S5.E13.m1.4.4.2.2.2.5" stretchy="false" xref="S5.E13.m1.4.4.2.2.3.cmml">)</mo></mrow></mrow><mrow id="S5.E13.m1.10.10.8" xref="S5.E13.m1.10.10.8.cmml"><munder id="S5.E13.m1.10.10.8.7" xref="S5.E13.m1.10.10.8.7.cmml"><mo id="S5.E13.m1.10.10.8.7.2" movablelimits="false" xref="S5.E13.m1.10.10.8.7.2.cmml">∑</mo><mrow id="S5.E13.m1.6.6.4.2.2" xref="S5.E13.m1.6.6.4.2.2.cmml"><mrow id="S5.E13.m1.6.6.4.2.2.2.1" xref="S5.E13.m1.6.6.4.2.2.2.2.cmml"><mi id="S5.E13.m1.5.5.3.1.1.1" xref="S5.E13.m1.5.5.3.1.1.1.cmml">j</mi><mo id="S5.E13.m1.6.6.4.2.2.2.1.2" xref="S5.E13.m1.6.6.4.2.2.2.2.cmml">,</mo><msub id="S5.E13.m1.6.6.4.2.2.2.1.1" xref="S5.E13.m1.6.6.4.2.2.2.1.1.cmml"><mi id="S5.E13.m1.6.6.4.2.2.2.1.1.2" xref="S5.E13.m1.6.6.4.2.2.2.1.1.2.cmml">r</mi><mi id="S5.E13.m1.6.6.4.2.2.2.1.1.3" xref="S5.E13.m1.6.6.4.2.2.2.1.1.3.cmml">j</mi></msub></mrow><mo id="S5.E13.m1.6.6.4.2.2.3" xref="S5.E13.m1.6.6.4.2.2.3.cmml">∈</mo><mi id="S5.E13.m1.6.6.4.2.2.4" xref="S5.E13.m1.6.6.4.2.2.4.cmml">D</mi></mrow></munder><mrow id="S5.E13.m1.10.10.8.6" xref="S5.E13.m1.10.10.8.6.cmml"><mi class="ltx_font_mathcaligraphic" id="S5.E13.m1.10.10.8.6.4" xref="S5.E13.m1.10.10.8.6.4.cmml">ℱ</mi><mo id="S5.E13.m1.10.10.8.6.3" xref="S5.E13.m1.10.10.8.6.3.cmml">⁢</mo><mrow id="S5.E13.m1.10.10.8.6.2.2" xref="S5.E13.m1.10.10.8.6.2.3.cmml"><mo id="S5.E13.m1.10.10.8.6.2.2.3" stretchy="false" xref="S5.E13.m1.10.10.8.6.2.3.cmml">(</mo><msub id="S5.E13.m1.9.9.7.5.1.1.1" xref="S5.E13.m1.9.9.7.5.1.1.1.cmml"><mi id="S5.E13.m1.9.9.7.5.1.1.1.2" xref="S5.E13.m1.9.9.7.5.1.1.1.2.cmml">Q</mi><mi id="S5.E13.m1.9.9.7.5.1.1.1.3" xref="S5.E13.m1.9.9.7.5.1.1.1.3.cmml">i</mi></msub><mo id="S5.E13.m1.10.10.8.6.2.2.4" xref="S5.E13.m1.10.10.8.6.2.3.cmml">,</mo><msub id="S5.E13.m1.10.10.8.6.2.2.2" xref="S5.E13.m1.10.10.8.6.2.2.2.cmml"><mi id="S5.E13.m1.10.10.8.6.2.2.2.2" xref="S5.E13.m1.10.10.8.6.2.2.2.2.cmml">K</mi><mi id="S5.E13.m1.10.10.8.6.2.2.2.3" xref="S5.E13.m1.10.10.8.6.2.2.2.3.cmml">j</mi></msub><mo id="S5.E13.m1.10.10.8.6.2.2.5" stretchy="false" xref="S5.E13.m1.10.10.8.6.2.3.cmml">)</mo></mrow><mo id="S5.E13.m1.10.10.8.6.3a" xref="S5.E13.m1.10.10.8.6.3.cmml">⁢</mo><msub id="S5.E13.m1.10.10.8.6.5" xref="S5.E13.m1.10.10.8.6.5.cmml"><mi id="S5.E13.m1.10.10.8.6.5.2" xref="S5.E13.m1.10.10.8.6.5.2.cmml">I</mi><mrow id="S5.E13.m1.8.8.6.4.2.4" xref="S5.E13.m1.8.8.6.4.2.3.cmml"><mi id="S5.E13.m1.7.7.5.3.1.1" xref="S5.E13.m1.7.7.5.3.1.1.cmml">j</mi><mo id="S5.E13.m1.8.8.6.4.2.4.1" xref="S5.E13.m1.8.8.6.4.2.3.cmml">,</mo><mi id="S5.E13.m1.8.8.6.4.2.2" xref="S5.E13.m1.8.8.6.4.2.2.cmml">S</mi></mrow></msub></mrow></mrow></mfrac></mstyle></mrow><mo id="S5.E13.m1.11.11.1.2" xref="S5.E13.m1.11.11.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S5.E13.m1.11b"><apply id="S5.E13.m1.11.11.1.1.cmml" xref="S5.E13.m1.11.11.1"><eq id="S5.E13.m1.11.11.1.1.1.cmml" xref="S5.E13.m1.11.11.1.1.1"></eq><apply id="S5.E13.m1.11.11.1.1.2.cmml" xref="S5.E13.m1.11.11.1.1.2"><csymbol cd="ambiguous" id="S5.E13.m1.11.11.1.1.2.1.cmml" xref="S5.E13.m1.11.11.1.1.2">subscript</csymbol><ci id="S5.E13.m1.11.11.1.1.2.2.cmml" xref="S5.E13.m1.11.11.1.1.2.2">𝑎</ci><list id="S5.E13.m1.2.2.2.3.cmml" xref="S5.E13.m1.2.2.2.4"><ci id="S5.E13.m1.1.1.1.1.cmml" xref="S5.E13.m1.1.1.1.1">𝑖</ci><ci id="S5.E13.m1.2.2.2.2.cmml" xref="S5.E13.m1.2.2.2.2">𝑗</ci></list></apply><apply id="S5.E13.m1.10.10.cmml" xref="S5.E13.m1.10.10"><divide id="S5.E13.m1.10.10.9.cmml" xref="S5.E13.m1.10.10"></divide><apply id="S5.E13.m1.4.4.2.cmml" xref="S5.E13.m1.4.4.2"><times id="S5.E13.m1.4.4.2.3.cmml" xref="S5.E13.m1.4.4.2.3"></times><ci id="S5.E13.m1.4.4.2.4.cmml" xref="S5.E13.m1.4.4.2.4">ℱ</ci><interval closure="open" id="S5.E13.m1.4.4.2.2.3.cmml" xref="S5.E13.m1.4.4.2.2.2"><apply id="S5.E13.m1.3.3.1.1.1.1.cmml" xref="S5.E13.m1.3.3.1.1.1.1"><csymbol cd="ambiguous" id="S5.E13.m1.3.3.1.1.1.1.1.cmml" xref="S5.E13.m1.3.3.1.1.1.1">subscript</csymbol><ci id="S5.E13.m1.3.3.1.1.1.1.2.cmml" xref="S5.E13.m1.3.3.1.1.1.1.2">𝑄</ci><ci id="S5.E13.m1.3.3.1.1.1.1.3.cmml" xref="S5.E13.m1.3.3.1.1.1.1.3">𝑖</ci></apply><apply id="S5.E13.m1.4.4.2.2.2.2.cmml" xref="S5.E13.m1.4.4.2.2.2.2"><csymbol cd="ambiguous" id="S5.E13.m1.4.4.2.2.2.2.1.cmml" xref="S5.E13.m1.4.4.2.2.2.2">subscript</csymbol><ci id="S5.E13.m1.4.4.2.2.2.2.2.cmml" xref="S5.E13.m1.4.4.2.2.2.2.2">𝐾</ci><ci id="S5.E13.m1.4.4.2.2.2.2.3.cmml" xref="S5.E13.m1.4.4.2.2.2.2.3">𝑗</ci></apply></interval></apply><apply id="S5.E13.m1.10.10.8.cmml" xref="S5.E13.m1.10.10.8"><apply id="S5.E13.m1.10.10.8.7.cmml" xref="S5.E13.m1.10.10.8.7"><csymbol cd="ambiguous" id="S5.E13.m1.10.10.8.7.1.cmml" xref="S5.E13.m1.10.10.8.7">subscript</csymbol><sum id="S5.E13.m1.10.10.8.7.2.cmml" xref="S5.E13.m1.10.10.8.7.2"></sum><apply id="S5.E13.m1.6.6.4.2.2.cmml" xref="S5.E13.m1.6.6.4.2.2"><in id="S5.E13.m1.6.6.4.2.2.3.cmml" xref="S5.E13.m1.6.6.4.2.2.3"></in><list id="S5.E13.m1.6.6.4.2.2.2.2.cmml" xref="S5.E13.m1.6.6.4.2.2.2.1"><ci id="S5.E13.m1.5.5.3.1.1.1.cmml" xref="S5.E13.m1.5.5.3.1.1.1">𝑗</ci><apply id="S5.E13.m1.6.6.4.2.2.2.1.1.cmml" xref="S5.E13.m1.6.6.4.2.2.2.1.1"><csymbol cd="ambiguous" id="S5.E13.m1.6.6.4.2.2.2.1.1.1.cmml" xref="S5.E13.m1.6.6.4.2.2.2.1.1">subscript</csymbol><ci id="S5.E13.m1.6.6.4.2.2.2.1.1.2.cmml" xref="S5.E13.m1.6.6.4.2.2.2.1.1.2">𝑟</ci><ci id="S5.E13.m1.6.6.4.2.2.2.1.1.3.cmml" xref="S5.E13.m1.6.6.4.2.2.2.1.1.3">𝑗</ci></apply></list><ci id="S5.E13.m1.6.6.4.2.2.4.cmml" xref="S5.E13.m1.6.6.4.2.2.4">𝐷</ci></apply></apply><apply id="S5.E13.m1.10.10.8.6.cmml" xref="S5.E13.m1.10.10.8.6"><times id="S5.E13.m1.10.10.8.6.3.cmml" xref="S5.E13.m1.10.10.8.6.3"></times><ci id="S5.E13.m1.10.10.8.6.4.cmml" xref="S5.E13.m1.10.10.8.6.4">ℱ</ci><interval closure="open" id="S5.E13.m1.10.10.8.6.2.3.cmml" xref="S5.E13.m1.10.10.8.6.2.2"><apply id="S5.E13.m1.9.9.7.5.1.1.1.cmml" xref="S5.E13.m1.9.9.7.5.1.1.1"><csymbol cd="ambiguous" id="S5.E13.m1.9.9.7.5.1.1.1.1.cmml" xref="S5.E13.m1.9.9.7.5.1.1.1">subscript</csymbol><ci id="S5.E13.m1.9.9.7.5.1.1.1.2.cmml" xref="S5.E13.m1.9.9.7.5.1.1.1.2">𝑄</ci><ci id="S5.E13.m1.9.9.7.5.1.1.1.3.cmml" xref="S5.E13.m1.9.9.7.5.1.1.1.3">𝑖</ci></apply><apply id="S5.E13.m1.10.10.8.6.2.2.2.cmml" xref="S5.E13.m1.10.10.8.6.2.2.2"><csymbol cd="ambiguous" id="S5.E13.m1.10.10.8.6.2.2.2.1.cmml" xref="S5.E13.m1.10.10.8.6.2.2.2">subscript</csymbol><ci id="S5.E13.m1.10.10.8.6.2.2.2.2.cmml" xref="S5.E13.m1.10.10.8.6.2.2.2.2">𝐾</ci><ci id="S5.E13.m1.10.10.8.6.2.2.2.3.cmml" xref="S5.E13.m1.10.10.8.6.2.2.2.3">𝑗</ci></apply></interval><apply id="S5.E13.m1.10.10.8.6.5.cmml" xref="S5.E13.m1.10.10.8.6.5"><csymbol cd="ambiguous" id="S5.E13.m1.10.10.8.6.5.1.cmml" xref="S5.E13.m1.10.10.8.6.5">subscript</csymbol><ci id="S5.E13.m1.10.10.8.6.5.2.cmml" xref="S5.E13.m1.10.10.8.6.5.2">𝐼</ci><list id="S5.E13.m1.8.8.6.4.2.3.cmml" xref="S5.E13.m1.8.8.6.4.2.4"><ci id="S5.E13.m1.7.7.5.3.1.1.cmml" xref="S5.E13.m1.7.7.5.3.1.1">𝑗</ci><ci id="S5.E13.m1.8.8.6.4.2.2.cmml" xref="S5.E13.m1.8.8.6.4.2.2">𝑆</ci></list></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.E13.m1.11c">\displaystyle a_{i,j}=\frac{\mathcal{F}(Q_{i},K_{j})}{\sum\limits_{j,r_{j}\in D% }\mathcal{F}(Q_{i},K_{j})I_{j,S}},</annotation><annotation encoding="application/x-llamapun" id="S5.E13.m1.11d">italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG caligraphic_F ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_D end_POSTSUBSCRIPT caligraphic_F ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_I start_POSTSUBSCRIPT italic_j , italic_S end_POSTSUBSCRIPT end_ARG ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(13)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S5.SS1.p3.6">where <math alttext="S" class="ltx_Math" display="inline" id="S5.SS1.p3.2.m1.1"><semantics id="S5.SS1.p3.2.m1.1a"><mi id="S5.SS1.p3.2.m1.1.1" xref="S5.SS1.p3.2.m1.1.1.cmml">S</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p3.2.m1.1b"><ci id="S5.SS1.p3.2.m1.1.1.cmml" xref="S5.SS1.p3.2.m1.1.1">𝑆</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p3.2.m1.1c">S</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p3.2.m1.1d">italic_S</annotation></semantics></math> is a request before decomposition and <math alttext="D" class="ltx_Math" display="inline" id="S5.SS1.p3.3.m2.1"><semantics id="S5.SS1.p3.3.m2.1a"><mi id="S5.SS1.p3.3.m2.1.1" xref="S5.SS1.p3.3.m2.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p3.3.m2.1b"><ci id="S5.SS1.p3.3.m2.1.1.cmml" xref="S5.SS1.p3.3.m2.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p3.3.m2.1c">D</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p3.3.m2.1d">italic_D</annotation></semantics></math> is set of all tokens in the batch. We use the binary indicator <math alttext="I_{j,S}" class="ltx_Math" display="inline" id="S5.SS1.p3.4.m3.2"><semantics id="S5.SS1.p3.4.m3.2a"><msub id="S5.SS1.p3.4.m3.2.3" xref="S5.SS1.p3.4.m3.2.3.cmml"><mi id="S5.SS1.p3.4.m3.2.3.2" xref="S5.SS1.p3.4.m3.2.3.2.cmml">I</mi><mrow id="S5.SS1.p3.4.m3.2.2.2.4" xref="S5.SS1.p3.4.m3.2.2.2.3.cmml"><mi id="S5.SS1.p3.4.m3.1.1.1.1" xref="S5.SS1.p3.4.m3.1.1.1.1.cmml">j</mi><mo id="S5.SS1.p3.4.m3.2.2.2.4.1" xref="S5.SS1.p3.4.m3.2.2.2.3.cmml">,</mo><mi id="S5.SS1.p3.4.m3.2.2.2.2" xref="S5.SS1.p3.4.m3.2.2.2.2.cmml">S</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S5.SS1.p3.4.m3.2b"><apply id="S5.SS1.p3.4.m3.2.3.cmml" xref="S5.SS1.p3.4.m3.2.3"><csymbol cd="ambiguous" id="S5.SS1.p3.4.m3.2.3.1.cmml" xref="S5.SS1.p3.4.m3.2.3">subscript</csymbol><ci id="S5.SS1.p3.4.m3.2.3.2.cmml" xref="S5.SS1.p3.4.m3.2.3.2">𝐼</ci><list id="S5.SS1.p3.4.m3.2.2.2.3.cmml" xref="S5.SS1.p3.4.m3.2.2.2.4"><ci id="S5.SS1.p3.4.m3.1.1.1.1.cmml" xref="S5.SS1.p3.4.m3.1.1.1.1">𝑗</ci><ci id="S5.SS1.p3.4.m3.2.2.2.2.cmml" xref="S5.SS1.p3.4.m3.2.2.2.2">𝑆</ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p3.4.m3.2c">I_{j,S}</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p3.4.m3.2d">italic_I start_POSTSUBSCRIPT italic_j , italic_S end_POSTSUBSCRIPT</annotation></semantics></math> to represent whether <math alttext="r_{j}" class="ltx_Math" display="inline" id="S5.SS1.p3.5.m4.1"><semantics id="S5.SS1.p3.5.m4.1a"><msub id="S5.SS1.p3.5.m4.1.1" xref="S5.SS1.p3.5.m4.1.1.cmml"><mi id="S5.SS1.p3.5.m4.1.1.2" xref="S5.SS1.p3.5.m4.1.1.2.cmml">r</mi><mi id="S5.SS1.p3.5.m4.1.1.3" xref="S5.SS1.p3.5.m4.1.1.3.cmml">j</mi></msub><annotation-xml encoding="MathML-Content" id="S5.SS1.p3.5.m4.1b"><apply id="S5.SS1.p3.5.m4.1.1.cmml" xref="S5.SS1.p3.5.m4.1.1"><csymbol cd="ambiguous" id="S5.SS1.p3.5.m4.1.1.1.cmml" xref="S5.SS1.p3.5.m4.1.1">subscript</csymbol><ci id="S5.SS1.p3.5.m4.1.1.2.cmml" xref="S5.SS1.p3.5.m4.1.1.2">𝑟</ci><ci id="S5.SS1.p3.5.m4.1.1.3.cmml" xref="S5.SS1.p3.5.m4.1.1.3">𝑗</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p3.5.m4.1c">r_{j}</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p3.5.m4.1d">italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT</annotation></semantics></math> belongs to the request <math alttext="S" class="ltx_Math" display="inline" id="S5.SS1.p3.6.m5.1"><semantics id="S5.SS1.p3.6.m5.1a"><mi id="S5.SS1.p3.6.m5.1.1" xref="S5.SS1.p3.6.m5.1.1.cmml">S</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p3.6.m5.1b"><ci id="S5.SS1.p3.6.m5.1.1.cmml" xref="S5.SS1.p3.6.m5.1.1">𝑆</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p3.6.m5.1c">S</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p3.6.m5.1d">italic_S</annotation></semantics></math>.</p> </div> <div class="ltx_para" id="S5.SS1.p4"> <p class="ltx_p" id="S5.SS1.p4.14">Although the number of padding tokens in KV tensors can be reduced, it requires additional copies of tokens in the <math alttext="Q" class="ltx_Math" display="inline" id="S5.SS1.p4.1.m1.1"><semantics id="S5.SS1.p4.1.m1.1a"><mi id="S5.SS1.p4.1.m1.1.1" xref="S5.SS1.p4.1.m1.1.1.cmml">Q</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.1.m1.1b"><ci id="S5.SS1.p4.1.m1.1.1.cmml" xref="S5.SS1.p4.1.m1.1.1">𝑄</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.1.m1.1c">Q</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.1.m1.1d">italic_Q</annotation></semantics></math> tensor, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S5.F9" title="Figure 9 ‣ V-A Fast Batch Verification ‣ V Spin’s Runtime Engine ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 9</span></a>. To balance the redundancy in <math alttext="KV" class="ltx_Math" display="inline" id="S5.SS1.p4.2.m2.1"><semantics id="S5.SS1.p4.2.m2.1a"><mrow id="S5.SS1.p4.2.m2.1.1" xref="S5.SS1.p4.2.m2.1.1.cmml"><mi id="S5.SS1.p4.2.m2.1.1.2" xref="S5.SS1.p4.2.m2.1.1.2.cmml">K</mi><mo id="S5.SS1.p4.2.m2.1.1.1" xref="S5.SS1.p4.2.m2.1.1.1.cmml">⁢</mo><mi id="S5.SS1.p4.2.m2.1.1.3" xref="S5.SS1.p4.2.m2.1.1.3.cmml">V</mi></mrow><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.2.m2.1b"><apply id="S5.SS1.p4.2.m2.1.1.cmml" xref="S5.SS1.p4.2.m2.1.1"><times id="S5.SS1.p4.2.m2.1.1.1.cmml" xref="S5.SS1.p4.2.m2.1.1.1"></times><ci id="S5.SS1.p4.2.m2.1.1.2.cmml" xref="S5.SS1.p4.2.m2.1.1.2">𝐾</ci><ci id="S5.SS1.p4.2.m2.1.1.3.cmml" xref="S5.SS1.p4.2.m2.1.1.3">𝑉</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.2.m2.1c">KV</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.2.m2.1d">italic_K italic_V</annotation></semantics></math> and <math alttext="Q" class="ltx_Math" display="inline" id="S5.SS1.p4.3.m3.1"><semantics id="S5.SS1.p4.3.m3.1a"><mi id="S5.SS1.p4.3.m3.1.1" xref="S5.SS1.p4.3.m3.1.1.cmml">Q</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.3.m3.1b"><ci id="S5.SS1.p4.3.m3.1.1.cmml" xref="S5.SS1.p4.3.m3.1.1">𝑄</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.3.m3.1c">Q</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.3.m3.1d">italic_Q</annotation></semantics></math> tensors, <span class="ltx_text ltx_font_smallcaps" id="S5.SS1.p4.14.1">Spin</span> carefully determines the request decomposition by setting the <math alttext="KV" class="ltx_Math" display="inline" id="S5.SS1.p4.4.m4.1"><semantics id="S5.SS1.p4.4.m4.1a"><mrow id="S5.SS1.p4.4.m4.1.1" xref="S5.SS1.p4.4.m4.1.1.cmml"><mi id="S5.SS1.p4.4.m4.1.1.2" xref="S5.SS1.p4.4.m4.1.1.2.cmml">K</mi><mo id="S5.SS1.p4.4.m4.1.1.1" xref="S5.SS1.p4.4.m4.1.1.1.cmml">⁢</mo><mi id="S5.SS1.p4.4.m4.1.1.3" xref="S5.SS1.p4.4.m4.1.1.3.cmml">V</mi></mrow><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.4.m4.1b"><apply id="S5.SS1.p4.4.m4.1.1.cmml" xref="S5.SS1.p4.4.m4.1.1"><times id="S5.SS1.p4.4.m4.1.1.1.cmml" xref="S5.SS1.p4.4.m4.1.1.1"></times><ci id="S5.SS1.p4.4.m4.1.1.2.cmml" xref="S5.SS1.p4.4.m4.1.1.2">𝐾</ci><ci id="S5.SS1.p4.4.m4.1.1.3.cmml" xref="S5.SS1.p4.4.m4.1.1.3">𝑉</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.4.m4.1c">KV</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.4.m4.1d">italic_K italic_V</annotation></semantics></math> tensor length <math alttext="L" class="ltx_Math" display="inline" id="S5.SS1.p4.5.m5.1"><semantics id="S5.SS1.p4.5.m5.1a"><mi id="S5.SS1.p4.5.m5.1.1" xref="S5.SS1.p4.5.m5.1.1.cmml">L</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.5.m5.1b"><ci id="S5.SS1.p4.5.m5.1.1.cmml" xref="S5.SS1.p4.5.m5.1.1">𝐿</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.5.m5.1c">L</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.5.m5.1d">italic_L</annotation></semantics></math> and width <math alttext="B" class="ltx_Math" display="inline" id="S5.SS1.p4.6.m6.1"><semantics id="S5.SS1.p4.6.m6.1a"><mi id="S5.SS1.p4.6.m6.1.1" xref="S5.SS1.p4.6.m6.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.6.m6.1b"><ci id="S5.SS1.p4.6.m6.1.1.cmml" xref="S5.SS1.p4.6.m6.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.6.m6.1c">B</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.6.m6.1d">italic_B</annotation></semantics></math>. The <math alttext="Q" class="ltx_Math" display="inline" id="S5.SS1.p4.7.m7.1"><semantics id="S5.SS1.p4.7.m7.1a"><mi id="S5.SS1.p4.7.m7.1.1" xref="S5.SS1.p4.7.m7.1.1.cmml">Q</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.7.m7.1b"><ci id="S5.SS1.p4.7.m7.1.1.cmml" xref="S5.SS1.p4.7.m7.1.1">𝑄</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.7.m7.1c">Q</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.7.m7.1d">italic_Q</annotation></semantics></math> tensor has the same width <math alttext="B" class="ltx_Math" display="inline" id="S5.SS1.p4.8.m8.1"><semantics id="S5.SS1.p4.8.m8.1a"><mi id="S5.SS1.p4.8.m8.1.1" xref="S5.SS1.p4.8.m8.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.8.m8.1b"><ci id="S5.SS1.p4.8.m8.1.1.cmml" xref="S5.SS1.p4.8.m8.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.8.m8.1c">B</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.8.m8.1d">italic_B</annotation></semantics></math> by replicating some rows accordingly. The length of <math alttext="Q" class="ltx_Math" display="inline" id="S5.SS1.p4.9.m9.1"><semantics id="S5.SS1.p4.9.m9.1a"><mi id="S5.SS1.p4.9.m9.1.1" xref="S5.SS1.p4.9.m9.1.1.cmml">Q</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.9.m9.1b"><ci id="S5.SS1.p4.9.m9.1.1.cmml" xref="S5.SS1.p4.9.m9.1.1">𝑄</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.9.m9.1c">Q</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.9.m9.1d">italic_Q</annotation></semantics></math> tensor is typically fixed by the prediction window, a system parameter of runtime engine, whose discussion is orthogonal to this work. We first fix tensor width <math alttext="B" class="ltx_Math" display="inline" id="S5.SS1.p4.10.m10.1"><semantics id="S5.SS1.p4.10.m10.1a"><mi id="S5.SS1.p4.10.m10.1.1" xref="S5.SS1.p4.10.m10.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.10.m10.1b"><ci id="S5.SS1.p4.10.m10.1.1.cmml" xref="S5.SS1.p4.10.m10.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.10.m10.1c">B</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.10.m10.1d">italic_B</annotation></semantics></math>, which limits the overhead from <math alttext="Q" class="ltx_Math" display="inline" id="S5.SS1.p4.11.m11.1"><semantics id="S5.SS1.p4.11.m11.1a"><mi id="S5.SS1.p4.11.m11.1.1" xref="S5.SS1.p4.11.m11.1.1.cmml">Q</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.11.m11.1b"><ci id="S5.SS1.p4.11.m11.1.1.cmml" xref="S5.SS1.p4.11.m11.1.1">𝑄</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.11.m11.1c">Q</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.11.m11.1d">italic_Q</annotation></semantics></math> tensor as well as the number of decomposed requests. Then, we search for the <math alttext="L" class="ltx_Math" display="inline" id="S5.SS1.p4.12.m12.1"><semantics id="S5.SS1.p4.12.m12.1a"><mi id="S5.SS1.p4.12.m12.1.1" xref="S5.SS1.p4.12.m12.1.1.cmml">L</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.12.m12.1b"><ci id="S5.SS1.p4.12.m12.1.1.cmml" xref="S5.SS1.p4.12.m12.1.1">𝐿</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.12.m12.1c">L</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.12.m12.1d">italic_L</annotation></semantics></math> that results in the minimum number of padding tokens in <math alttext="KV" class="ltx_Math" display="inline" id="S5.SS1.p4.13.m13.1"><semantics id="S5.SS1.p4.13.m13.1a"><mrow id="S5.SS1.p4.13.m13.1.1" xref="S5.SS1.p4.13.m13.1.1.cmml"><mi id="S5.SS1.p4.13.m13.1.1.2" xref="S5.SS1.p4.13.m13.1.1.2.cmml">K</mi><mo id="S5.SS1.p4.13.m13.1.1.1" xref="S5.SS1.p4.13.m13.1.1.1.cmml">⁢</mo><mi id="S5.SS1.p4.13.m13.1.1.3" xref="S5.SS1.p4.13.m13.1.1.3.cmml">V</mi></mrow><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.13.m13.1b"><apply id="S5.SS1.p4.13.m13.1.1.cmml" xref="S5.SS1.p4.13.m13.1.1"><times id="S5.SS1.p4.13.m13.1.1.1.cmml" xref="S5.SS1.p4.13.m13.1.1.1"></times><ci id="S5.SS1.p4.13.m13.1.1.2.cmml" xref="S5.SS1.p4.13.m13.1.1.2">𝐾</ci><ci id="S5.SS1.p4.13.m13.1.1.3.cmml" xref="S5.SS1.p4.13.m13.1.1.3">𝑉</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.13.m13.1c">KV</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.13.m13.1d">italic_K italic_V</annotation></semantics></math> tensor. Note that the value of <math alttext="L" class="ltx_Math" display="inline" id="S5.SS1.p4.14.m14.1"><semantics id="S5.SS1.p4.14.m14.1a"><mi id="S5.SS1.p4.14.m14.1.1" xref="S5.SS1.p4.14.m14.1.1.cmml">L</mi><annotation-xml encoding="MathML-Content" id="S5.SS1.p4.14.m14.1b"><ci id="S5.SS1.p4.14.m14.1.1.cmml" xref="S5.SS1.p4.14.m14.1.1">𝐿</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p4.14.m14.1c">L</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p4.14.m14.1d">italic_L</annotation></semantics></math> does not have a linear relationship with the number of padding tokens, but this search process can be completed quickly.</p> </div> </section> <section class="ltx_subsection" id="S5.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S5.SS2.5.1.1">V-B</span> </span><span class="ltx_text ltx_font_italic" id="S5.SS2.6.2">Speculative Decoding Pipeline</span> </h3> <div class="ltx_para" id="S5.SS2.p1"> <p class="ltx_p" id="S5.SS2.p1.1">To reduce idle time caused by synchronizing heterogeneous SSMs, the speculative decoding pipeline module divides the original batch on each SSM into multiple micro-batches, so that speculation and verification can be performed on these micro-batches in a pipeline manner, as illustrated in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S2.F6" title="Figure 6 ‣ II-B2 Batching processing is not well supported by speculative decoding ‣ II-B Motivation ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 6</span></a>(b). Specifically, after SSMs generate speculative tokens for a micro-batch, these results are sent to the LLM for verification. Meanwhile, SSMs can begin speculation on the next micro-batch, running in parallel with the LLM’s verification. Since micro-batches involve lighter workloads for the SSMs, their speculation time is reduced, effectively minimizing the idle time of the LLM during synchronization of results from the SSMs.</p> </div> <div class="ltx_para" id="S5.SS2.p2"> <p class="ltx_p" id="S5.SS2.p2.1">A critical design choice of the speculative decoding pipeline is to decide micro-batch size to make the pipeline saturated to maximize resource utilization. If the micro-batch size is too large, the idle time of the LLM due to synchronizing results from heterogeneous SSMs cannot be effectively reduced. On the other hand, while a small micro-batch size can effectively reduce idle time, the GPU resources are under-utilized during the LLM verification phase.</p> </div> <div class="ltx_para" id="S5.SS2.p3"> <p class="ltx_p" id="S5.SS2.p3.3"><span class="ltx_text ltx_font_smallcaps" id="S5.SS2.p3.3.1">Spin</span> employs a simple heuristic to determine the optimal micro-batch setting on each SSM. Specifically, <span class="ltx_text ltx_font_smallcaps" id="S5.SS2.p3.3.2">Spin</span> iteratively splits the batch on each SSM into multiple micro-batches and checks whether the LLM can be fully utilized with the current setting. Specifically, in the initial iteration, each SSM obtains a batch of inference requests according to the SSM selection decision. Then, we split the batch on each SSM into <math alttext="b_{0}" class="ltx_Math" display="inline" id="S5.SS2.p3.1.m1.1"><semantics id="S5.SS2.p3.1.m1.1a"><msub id="S5.SS2.p3.1.m1.1.1" xref="S5.SS2.p3.1.m1.1.1.cmml"><mi id="S5.SS2.p3.1.m1.1.1.2" xref="S5.SS2.p3.1.m1.1.1.2.cmml">b</mi><mn id="S5.SS2.p3.1.m1.1.1.3" xref="S5.SS2.p3.1.m1.1.1.3.cmml">0</mn></msub><annotation-xml encoding="MathML-Content" id="S5.SS2.p3.1.m1.1b"><apply id="S5.SS2.p3.1.m1.1.1.cmml" xref="S5.SS2.p3.1.m1.1.1"><csymbol cd="ambiguous" id="S5.SS2.p3.1.m1.1.1.1.cmml" xref="S5.SS2.p3.1.m1.1.1">subscript</csymbol><ci id="S5.SS2.p3.1.m1.1.1.2.cmml" xref="S5.SS2.p3.1.m1.1.1.2">𝑏</ci><cn id="S5.SS2.p3.1.m1.1.1.3.cmml" type="integer" xref="S5.SS2.p3.1.m1.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.p3.1.m1.1c">b_{0}</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p3.1.m1.1d">italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT</annotation></semantics></math>, e.g., <math alttext="b_{0}=2" class="ltx_Math" display="inline" id="S5.SS2.p3.2.m2.1"><semantics id="S5.SS2.p3.2.m2.1a"><mrow id="S5.SS2.p3.2.m2.1.1" xref="S5.SS2.p3.2.m2.1.1.cmml"><msub id="S5.SS2.p3.2.m2.1.1.2" xref="S5.SS2.p3.2.m2.1.1.2.cmml"><mi id="S5.SS2.p3.2.m2.1.1.2.2" xref="S5.SS2.p3.2.m2.1.1.2.2.cmml">b</mi><mn id="S5.SS2.p3.2.m2.1.1.2.3" xref="S5.SS2.p3.2.m2.1.1.2.3.cmml">0</mn></msub><mo id="S5.SS2.p3.2.m2.1.1.1" xref="S5.SS2.p3.2.m2.1.1.1.cmml">=</mo><mn id="S5.SS2.p3.2.m2.1.1.3" xref="S5.SS2.p3.2.m2.1.1.3.cmml">2</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.SS2.p3.2.m2.1b"><apply id="S5.SS2.p3.2.m2.1.1.cmml" xref="S5.SS2.p3.2.m2.1.1"><eq id="S5.SS2.p3.2.m2.1.1.1.cmml" xref="S5.SS2.p3.2.m2.1.1.1"></eq><apply id="S5.SS2.p3.2.m2.1.1.2.cmml" xref="S5.SS2.p3.2.m2.1.1.2"><csymbol cd="ambiguous" id="S5.SS2.p3.2.m2.1.1.2.1.cmml" xref="S5.SS2.p3.2.m2.1.1.2">subscript</csymbol><ci id="S5.SS2.p3.2.m2.1.1.2.2.cmml" xref="S5.SS2.p3.2.m2.1.1.2.2">𝑏</ci><cn id="S5.SS2.p3.2.m2.1.1.2.3.cmml" type="integer" xref="S5.SS2.p3.2.m2.1.1.2.3">0</cn></apply><cn id="S5.SS2.p3.2.m2.1.1.3.cmml" type="integer" xref="S5.SS2.p3.2.m2.1.1.3">2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.p3.2.m2.1c">b_{0}=2</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p3.2.m2.1d">italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2</annotation></semantics></math>, micro-batches and check whether there is a obvious throughput drop. If there is no significant degradation, we continue to split the batch into <math alttext="b_{0}+1" class="ltx_Math" display="inline" id="S5.SS2.p3.3.m3.1"><semantics id="S5.SS2.p3.3.m3.1a"><mrow id="S5.SS2.p3.3.m3.1.1" xref="S5.SS2.p3.3.m3.1.1.cmml"><msub id="S5.SS2.p3.3.m3.1.1.2" xref="S5.SS2.p3.3.m3.1.1.2.cmml"><mi id="S5.SS2.p3.3.m3.1.1.2.2" xref="S5.SS2.p3.3.m3.1.1.2.2.cmml">b</mi><mn id="S5.SS2.p3.3.m3.1.1.2.3" xref="S5.SS2.p3.3.m3.1.1.2.3.cmml">0</mn></msub><mo id="S5.SS2.p3.3.m3.1.1.1" xref="S5.SS2.p3.3.m3.1.1.1.cmml">+</mo><mn id="S5.SS2.p3.3.m3.1.1.3" xref="S5.SS2.p3.3.m3.1.1.3.cmml">1</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.SS2.p3.3.m3.1b"><apply id="S5.SS2.p3.3.m3.1.1.cmml" xref="S5.SS2.p3.3.m3.1.1"><plus id="S5.SS2.p3.3.m3.1.1.1.cmml" xref="S5.SS2.p3.3.m3.1.1.1"></plus><apply id="S5.SS2.p3.3.m3.1.1.2.cmml" xref="S5.SS2.p3.3.m3.1.1.2"><csymbol cd="ambiguous" id="S5.SS2.p3.3.m3.1.1.2.1.cmml" xref="S5.SS2.p3.3.m3.1.1.2">subscript</csymbol><ci id="S5.SS2.p3.3.m3.1.1.2.2.cmml" xref="S5.SS2.p3.3.m3.1.1.2.2">𝑏</ci><cn id="S5.SS2.p3.3.m3.1.1.2.3.cmml" type="integer" xref="S5.SS2.p3.3.m3.1.1.2.3">0</cn></apply><cn id="S5.SS2.p3.3.m3.1.1.3.cmml" type="integer" xref="S5.SS2.p3.3.m3.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.p3.3.m3.1c">b_{0}+1</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p3.3.m3.1d">italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1</annotation></semantics></math> micro-batches, which would further reduce synchronization time. This process repeats until a significant degradation in inference throughput is observed. It is important to note that we can offline profile the inference throughput of the LLM with different workloads so that we can quickly check it with different micro-batch settings, instead of running a verification execution and observing the throughput. Of course, we can also formulate the batch splitting as an optimization problem and get the optimal solution by solving it. However, we find such a heuristic algorithm works sufficiently well in practice and the algorithm overhead can be neglected.</p> </div> </section> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">VI </span><span class="ltx_text ltx_font_smallcaps" id="S6.1.1">Performance Evaluation</span> </h2> <section class="ltx_subsection" id="S6.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S6.SS1.5.1.1">VI-A</span> </span><span class="ltx_text ltx_font_italic" id="S6.SS1.6.2">Experimental Settings</span> </h3> <div class="ltx_para ltx_noindent" id="S6.SS1.p1"> <p class="ltx_p" id="S6.SS1.p1.1"><span class="ltx_text ltx_font_bold" id="S6.SS1.p1.1.1">Model and Environment.</span> We evaluate the performance of <span class="ltx_text ltx_font_smallcaps" id="S6.SS1.p1.1.2">Spin</span> using the LLaMA family of models. We consider LLaMA-7B, LLaMA-13B, and LLaMA-30B as LLMs, respectively. For SSMs, we use five smaller models: LLaMA-68M, LLaMA-265M, LLaMA-616M, LLaMA-1.1B, and LLaMA-1.4B. All models are obtained from <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib32" title="">32</a>]</cite>. We deploy each SSM on a single V100 GPU. For the LLM, we use a single V100 GPU for the LLaMA-7B model, 2xV100 GPUs for the LLaMA-13B model, and 4xV100 GPUs for the LLaMA-30B model. The multiple-GPU setting uses tensor parallelism. For software configuration, we use Ubuntu 20.04 with Linux Kernel version 5.15, CUDA 11.7, and cuDNN 8.6.0, along with the NVIDIA driver version 525.85.</p> </div> <div class="ltx_para ltx_noindent" id="S6.SS1.p2"> <p class="ltx_p" id="S6.SS1.p2.1"><span class="ltx_text ltx_font_bold" id="S6.SS1.p2.1.1">Workloads.</span> We evaluate <span class="ltx_text ltx_font_smallcaps" id="S6.SS1.p2.1.2">Spin</span> using three public datasets: Alpaca <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib26" title="">26</a>]</cite>, ChatGPT Prompts <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib27" title="">27</a>]</cite>, and Chatbot Instruction Prompts <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib28" title="">28</a>]</cite>. Aligning with the settings in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib15" title="">15</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib33" title="">33</a>]</cite>, we use the questions/prompts from these datasets to create workloads. Additionally, we combine the requests from all three datasets to form a mixed dataset to better simulate real-world traces, referred to as Mix.</p> </div> <div class="ltx_para ltx_noindent" id="S6.SS1.p3"> <p class="ltx_p" id="S6.SS1.p3.1"><span class="ltx_text ltx_font_bold" id="S6.SS1.p3.1.1">Baselines.</span> We compare <span class="ltx_text ltx_font_smallcaps" id="S6.SS1.p3.1.2">Spin</span> with the Vanilla speculative decoding inference system with homogeneous SSMs, which is commonly adopted by existing works <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib16" title="">16</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib33" title="">33</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib29" title="">29</a>]</cite>. We equip this Vanilla system with different SSMs in experiments.</p> </div> <figure class="ltx_figure" id="S6.F10"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F10.sf1"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F10.sf1.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="411" id="S6.F10.sf1.1.g1" src="x10.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F10.sf1.3.1.1" style="font-size:80%;">(a)</span> </span><span class="ltx_text" id="S6.F10.sf1.4.2" style="font-size:80%;">Alpaca</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F10.sf2"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F10.sf2.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="411" id="S6.F10.sf2.1.g1" src="x11.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F10.sf2.3.1.1" style="font-size:80%;">(b)</span> </span><span class="ltx_text" id="S6.F10.sf2.4.2" style="font-size:80%;">CP</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F10.sf3"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F10.sf3.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="411" id="S6.F10.sf3.1.g1" src="x12.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F10.sf3.3.1.1" style="font-size:80%;">(c)</span> </span><span class="ltx_text" id="S6.F10.sf3.4.2" style="font-size:80%;">CIP</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F10.sf4"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F10.sf4.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="411" id="S6.F10.sf4.1.g1" src="x13.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F10.sf4.3.1.1" style="font-size:80%;">(d)</span> </span><span class="ltx_text" id="S6.F10.sf4.4.2" style="font-size:80%;">Mix</span></figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 10: </span>The evaluation of <span class="ltx_text ltx_font_smallcaps" id="S6.F10.2.1">Spin</span> on different dataset.</figcaption> </figure> </section> <section class="ltx_subsection" id="S6.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S6.SS2.5.1.1">VI-B</span> </span><span class="ltx_text ltx_font_italic" id="S6.SS2.6.2">Results</span> </h3> <section class="ltx_subsubsection" id="S6.SS2.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S6.SS2.SSS1.5.1.1">VI-B</span>1 </span>Overall Performance</h4> <div class="ltx_para" id="S6.SS2.SSS1.p1"> <p class="ltx_p" id="S6.SS2.SSS1.p1.1">We first evaluate the overall performance of <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p1.1.1">Spin</span> by comparing it with the Vanilla system. We set the batch size to 8 and evaluate <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p1.1.2">Spin</span> in three settings: (1) <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p1.1.3">Spin</span> w/o bat. & pipe.: disabling fast batch verification and pipeline optimization, only enabling heterogeneous SSM speculation; (2) <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p1.1.4">Spin</span> w/o pipeline: disabling pipeline optimization but adopting optimized heterogeneous speculation and fast batch verification; and (3) <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p1.1.5">Spin</span>: the complete implementation of <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p1.1.6">Spin</span>. We use goodput, the number of accepted tokens per second, as the evaluation metric. The results are shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S6.F10" title="Figure 10 ‣ VI-A Experimental Settings ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 10</span></a>. The results show that <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p1.1.7">Spin</span> significantly improves goodput compared to the baselines. For example, on the Alpaca dataset with LLaMA-7B as the LLM, <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p1.1.8">Spin</span> improves goodput by about 2.34<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS2.SSS1.p1.1.m1.1"><semantics id="S6.SS2.SSS1.p1.1.m1.1a"><mo id="S6.SS2.SSS1.p1.1.m1.1.1" xref="S6.SS2.SSS1.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS1.p1.1.m1.1b"><times id="S6.SS2.SSS1.p1.1.m1.1.1.cmml" xref="S6.SS2.SSS1.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS1.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS1.p1.1.m1.1d">×</annotation></semantics></math> compared to other baselines.</p> </div> <div class="ltx_para" id="S6.SS2.SSS1.p2"> <p class="ltx_p" id="S6.SS2.SSS1.p2.3">When only speculation of heterogeneous SSMs is enabled, <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p2.3.1">Spin</span> still achieves high performance. For example, on the Alpaca dataset with LLaMA-7B, the Vanilla speculative decoding using only LLaMA-68M achieves a goodput of 32.21 tokens per second, whereas <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p2.3.2">Spin</span> generates 62.33 tokens per second on average. Enabling the fast batch verification and pipeline optimizations further enhances <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p2.3.3">Spin</span>’s performance. For example, on the CP dataset with LLaMA-7B, enabling the fast batch verification and pipeline optimizations increases <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS1.p2.3.4">Spin</span>’s performance improvement from 1.45<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS2.SSS1.p2.1.m1.1"><semantics id="S6.SS2.SSS1.p2.1.m1.1a"><mo id="S6.SS2.SSS1.p2.1.m1.1.1" xref="S6.SS2.SSS1.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS1.p2.1.m1.1b"><times id="S6.SS2.SSS1.p2.1.m1.1.1.cmml" xref="S6.SS2.SSS1.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS1.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS1.p2.1.m1.1d">×</annotation></semantics></math> to 1.81<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS2.SSS1.p2.2.m2.1"><semantics id="S6.SS2.SSS1.p2.2.m2.1a"><mo id="S6.SS2.SSS1.p2.2.m2.1.1" xref="S6.SS2.SSS1.p2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS1.p2.2.m2.1b"><times id="S6.SS2.SSS1.p2.2.m2.1.1.cmml" xref="S6.SS2.SSS1.p2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS1.p2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS1.p2.2.m2.1d">×</annotation></semantics></math> and 1.99<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS2.SSS1.p2.3.m3.1"><semantics id="S6.SS2.SSS1.p2.3.m3.1a"><mo id="S6.SS2.SSS1.p2.3.m3.1.1" xref="S6.SS2.SSS1.p2.3.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS1.p2.3.m3.1b"><times id="S6.SS2.SSS1.p2.3.m3.1.1.cmml" xref="S6.SS2.SSS1.p2.3.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS1.p2.3.m3.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS1.p2.3.m3.1d">×</annotation></semantics></math>, respectively.</p> </div> <div class="ltx_para" id="S6.SS2.SSS1.p3"> <p class="ltx_p" id="S6.SS2.SSS1.p3.1">In addition, we observe all baselines perform variably with different datasets. For example, the Vanilla method with LLaMA-265M works best on the CP dataset compared to others, while the Vanilla system with LLaMA-1.1B outperforms other baselines on the Alpaca dataset, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S6.F10.sf1" title="10(a) ‣ Figure 10 ‣ VI-A Experimental Settings ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 10(a)</span></a> and <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S6.F10.sf2" title="10(b) ‣ Figure 10 ‣ VI-A Experimental Settings ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 10(b)</span></a>. This is because the inference requests on the Alpaca dataset exhibit higher difficulty, requiring larger SSMs to ensure the acceptance rate. In contrast, inference requests in the CP dataset can benefit more from smaller SSMs due to their lower difficulty. This observation can be also verified by the preliminary experiments in §<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#S2.SS2" title="II-B Motivation ‣ II Background and Motivation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag"><span class="ltx_text">II-B</span></span></a>. However, we observe that the Vanilla speculative decoding systems with both LLaMA-68M and LLaMA-1.4B cannot outperform others and consistently show the worst performance. This can be attributed to the lowest acceptance rate with LLaMA-68M and the highest inference speed with LLaMA-1.4B.</p> </div> <figure class="ltx_figure" id="S6.F11"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F11.sf1"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F11.sf1.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="411" id="S6.F11.sf1.1.g1" src="x14.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F11.sf1.3.1.1" style="font-size:80%;">(a)</span> </span><span class="ltx_text" id="S6.F11.sf1.4.2" style="font-size:80%;">Alpaca</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F11.sf2"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F11.sf2.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="411" id="S6.F11.sf2.1.g1" src="x15.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F11.sf2.3.1.1" style="font-size:80%;">(b)</span> </span><span class="ltx_text" id="S6.F11.sf2.4.2" style="font-size:80%;">CP</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F11.sf3"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F11.sf3.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="411" id="S6.F11.sf3.1.g1" src="x16.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F11.sf3.3.1.1" style="font-size:80%;">(c)</span> </span><span class="ltx_text" id="S6.F11.sf3.4.2" style="font-size:80%;">CIP</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F11.sf4"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F11.sf4.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="411" id="S6.F11.sf4.1.g1" src="x17.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F11.sf4.3.1.1" style="font-size:80%;">(d)</span> </span><span class="ltx_text" id="S6.F11.sf4.4.2" style="font-size:80%;">Mix</span></figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 11: </span>The evaluation of <span class="ltx_text ltx_font_smallcaps" id="S6.F11.2.1">Spin</span> with different SSM selection algorithms.</figcaption> </figure> </section> <section class="ltx_subsubsection" id="S6.SS2.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S6.SS2.SSS2.5.1.1">VI-B</span>2 </span>Impact of the SSM Selection Algorithm</h4> <div class="ltx_para" id="S6.SS2.SSS2.p1"> <p class="ltx_p" id="S6.SS2.SSS2.p1.5">We compare the learning-based SSM selection (LBSS) algorithm used in <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS2.p1.5.1">Spin</span> with two baselines: (1) Greedy: requests greedily select the SSMs according to their prompt lengths, where requests with shorter prompts are prioritized using smaller SSMs for speculation while others with longer prompts use the larger SSMs; and (2) <math alttext="\epsilon" class="ltx_Math" display="inline" id="S6.SS2.SSS2.p1.1.m1.1"><semantics id="S6.SS2.SSS2.p1.1.m1.1a"><mi id="S6.SS2.SSS2.p1.1.m1.1.1" xref="S6.SS2.SSS2.p1.1.m1.1.1.cmml">ϵ</mi><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS2.p1.1.m1.1b"><ci id="S6.SS2.SSS2.p1.1.m1.1.1.cmml" xref="S6.SS2.SSS2.p1.1.m1.1.1">italic-ϵ</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS2.p1.1.m1.1c">\epsilon</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS2.p1.1.m1.1d">italic_ϵ</annotation></semantics></math>-greedy: with probability <math alttext="\epsilon" class="ltx_Math" display="inline" id="S6.SS2.SSS2.p1.2.m2.1"><semantics id="S6.SS2.SSS2.p1.2.m2.1a"><mi id="S6.SS2.SSS2.p1.2.m2.1.1" xref="S6.SS2.SSS2.p1.2.m2.1.1.cmml">ϵ</mi><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS2.p1.2.m2.1b"><ci id="S6.SS2.SSS2.p1.2.m2.1.1.cmml" xref="S6.SS2.SSS2.p1.2.m2.1.1">italic-ϵ</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS2.p1.2.m2.1c">\epsilon</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS2.p1.2.m2.1d">italic_ϵ</annotation></semantics></math>, all inference requests use the best SSM according to the currently observed performance, and with probability <math alttext="1-\epsilon" class="ltx_Math" display="inline" id="S6.SS2.SSS2.p1.3.m3.1"><semantics id="S6.SS2.SSS2.p1.3.m3.1a"><mrow id="S6.SS2.SSS2.p1.3.m3.1.1" xref="S6.SS2.SSS2.p1.3.m3.1.1.cmml"><mn id="S6.SS2.SSS2.p1.3.m3.1.1.2" xref="S6.SS2.SSS2.p1.3.m3.1.1.2.cmml">1</mn><mo id="S6.SS2.SSS2.p1.3.m3.1.1.1" xref="S6.SS2.SSS2.p1.3.m3.1.1.1.cmml">−</mo><mi id="S6.SS2.SSS2.p1.3.m3.1.1.3" xref="S6.SS2.SSS2.p1.3.m3.1.1.3.cmml">ϵ</mi></mrow><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS2.p1.3.m3.1b"><apply id="S6.SS2.SSS2.p1.3.m3.1.1.cmml" xref="S6.SS2.SSS2.p1.3.m3.1.1"><minus id="S6.SS2.SSS2.p1.3.m3.1.1.1.cmml" xref="S6.SS2.SSS2.p1.3.m3.1.1.1"></minus><cn id="S6.SS2.SSS2.p1.3.m3.1.1.2.cmml" type="integer" xref="S6.SS2.SSS2.p1.3.m3.1.1.2">1</cn><ci id="S6.SS2.SSS2.p1.3.m3.1.1.3.cmml" xref="S6.SS2.SSS2.p1.3.m3.1.1.3">italic-ϵ</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS2.p1.3.m3.1c">1-\epsilon</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS2.p1.3.m3.1d">1 - italic_ϵ</annotation></semantics></math>, they randomly select an SSM. We set <math alttext="\epsilon=0.2" class="ltx_Math" display="inline" id="S6.SS2.SSS2.p1.4.m4.1"><semantics id="S6.SS2.SSS2.p1.4.m4.1a"><mrow id="S6.SS2.SSS2.p1.4.m4.1.1" xref="S6.SS2.SSS2.p1.4.m4.1.1.cmml"><mi id="S6.SS2.SSS2.p1.4.m4.1.1.2" xref="S6.SS2.SSS2.p1.4.m4.1.1.2.cmml">ϵ</mi><mo id="S6.SS2.SSS2.p1.4.m4.1.1.1" xref="S6.SS2.SSS2.p1.4.m4.1.1.1.cmml">=</mo><mn id="S6.SS2.SSS2.p1.4.m4.1.1.3" xref="S6.SS2.SSS2.p1.4.m4.1.1.3.cmml">0.2</mn></mrow><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS2.p1.4.m4.1b"><apply id="S6.SS2.SSS2.p1.4.m4.1.1.cmml" xref="S6.SS2.SSS2.p1.4.m4.1.1"><eq id="S6.SS2.SSS2.p1.4.m4.1.1.1.cmml" xref="S6.SS2.SSS2.p1.4.m4.1.1.1"></eq><ci id="S6.SS2.SSS2.p1.4.m4.1.1.2.cmml" xref="S6.SS2.SSS2.p1.4.m4.1.1.2">italic-ϵ</ci><cn id="S6.SS2.SSS2.p1.4.m4.1.1.3.cmml" type="float" xref="S6.SS2.SSS2.p1.4.m4.1.1.3">0.2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS2.p1.4.m4.1c">\epsilon=0.2</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS2.p1.4.m4.1d">italic_ϵ = 0.2</annotation></semantics></math> for better exploration. Additionally, we set the batch size to 8 and disable the fast batch verification and pipeline optimization for all methods. The results are shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S6.F11" title="Figure 11 ‣ VI-B1 Overall Performance ‣ VI-B Results ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 11</span></a>. We observe that the proposed SSM selection algorithm in <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS2.p1.5.2">Spin</span> significantly outperforms the others. For example, on the Alpaca dataset with LLaMA-7B as the LLM, the greedy method generates only 32.21 tokens per second. Although the <math alttext="\epsilon" class="ltx_Math" display="inline" id="S6.SS2.SSS2.p1.5.m5.1"><semantics id="S6.SS2.SSS2.p1.5.m5.1a"><mi id="S6.SS2.SSS2.p1.5.m5.1.1" xref="S6.SS2.SSS2.p1.5.m5.1.1.cmml">ϵ</mi><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS2.p1.5.m5.1b"><ci id="S6.SS2.SSS2.p1.5.m5.1.1.cmml" xref="S6.SS2.SSS2.p1.5.m5.1.1">italic-ϵ</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS2.p1.5.m5.1c">\epsilon</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS2.p1.5.m5.1d">italic_ϵ</annotation></semantics></math>-greedy method can explore better SSM selection, it does not balance exploration and exploitation well, achieving a goodput of 43.78 tokens per second. In contrast, our LBSS effectively explores SSMs and selects the best one for inference requests, achieving a higher goodput of 65.31 tokens per second.</p> </div> <div class="ltx_para" id="S6.SS2.SSS2.p2"> <p class="ltx_p" id="S6.SS2.SSS2.p2.3">In addition, we observe that SSM selection becomes more crucial with larger LLMs. For example, on the CP dataset, LBSS achieves a goodput improvement of about 1.45<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS2.SSS2.p2.1.m1.1"><semantics id="S6.SS2.SSS2.p2.1.m1.1a"><mo id="S6.SS2.SSS2.p2.1.m1.1.1" xref="S6.SS2.SSS2.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS2.p2.1.m1.1b"><times id="S6.SS2.SSS2.p2.1.m1.1.1.cmml" xref="S6.SS2.SSS2.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS2.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS2.p2.1.m1.1d">×</annotation></semantics></math> with the LLM of LLaMA-7B compared to the other two methods, while the improvement increases to 1.79<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS2.SSS2.p2.2.m2.1"><semantics id="S6.SS2.SSS2.p2.2.m2.1a"><mo id="S6.SS2.SSS2.p2.2.m2.1.1" xref="S6.SS2.SSS2.p2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS2.p2.2.m2.1b"><times id="S6.SS2.SSS2.p2.2.m2.1.1.cmml" xref="S6.SS2.SSS2.p2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS2.p2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS2.p2.2.m2.1d">×</annotation></semantics></math> when using LLaMA-30B as the LLM. This can be attributed to the fact that both greedy and <math alttext="\epsilon" class="ltx_Math" display="inline" id="S6.SS2.SSS2.p2.3.m3.1"><semantics id="S6.SS2.SSS2.p2.3.m3.1a"><mi id="S6.SS2.SSS2.p2.3.m3.1.1" xref="S6.SS2.SSS2.p2.3.m3.1.1.cmml">ϵ</mi><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS2.p2.3.m3.1b"><ci id="S6.SS2.SSS2.p2.3.m3.1.1.cmml" xref="S6.SS2.SSS2.p2.3.m3.1.1">italic-ϵ</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS2.p2.3.m3.1c">\epsilon</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS2.p2.3.m3.1d">italic_ϵ</annotation></semantics></math>-greedy methods tend to choose unsuitable SSMs for inference requests since they cannot estimate the goodput of SSMs accurately, resulting in significant performance degradation.</p> </div> <figure class="ltx_figure" id="S6.F12"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F12.sf1"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F12.sf1.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="409" id="S6.F12.sf1.1.g1" src="x18.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F12.sf1.3.1.1" style="font-size:80%;">(a)</span> </span><span class="ltx_text" id="S6.F12.sf1.4.2" style="font-size:80%;">Alpaca</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F12.sf2"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F12.sf2.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="409" id="S6.F12.sf2.1.g1" src="x19.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F12.sf2.3.1.1" style="font-size:80%;">(b)</span> </span><span class="ltx_text" id="S6.F12.sf2.4.2" style="font-size:80%;">CP</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F12.sf3"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F12.sf3.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="409" id="S6.F12.sf3.1.g1" src="x20.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F12.sf3.3.1.1" style="font-size:80%;">(c)</span> </span><span class="ltx_text" id="S6.F12.sf3.4.2" style="font-size:80%;">CIP</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F12.sf4"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F12.sf4.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="409" id="S6.F12.sf4.1.g1" src="x21.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F12.sf4.3.1.1" style="font-size:80%;">(d)</span> </span><span class="ltx_text" id="S6.F12.sf4.4.2" style="font-size:80%;">Mix</span></figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 12: </span>The evaluation of the batch verification on different batch sizes.</figcaption> </figure> </section> <section class="ltx_subsubsection" id="S6.SS2.SSS3"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S6.SS2.SSS3.5.1.1">VI-B</span>3 </span>Impact of the Fast Batch Verification</h4> <div class="ltx_para" id="S6.SS2.SSS3.p1"> <p class="ltx_p" id="S6.SS2.SSS3.p1.1">We study the impact of the proposed fast batch verification with different batch sizes, which adopts a request decomposing method to accelerate the verification on the LLM. Since the fast batch verification mainly works during the verification stage, we use the average verification latency as the evaluation metric. We use LLaMA-7B as the LLM. We change the batch size and show results in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S6.F12" title="Figure 12 ‣ VI-B2 Impact of the SSM Selection Algorithm ‣ VI-B Results ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 12</span></a>. We observe that the fast batch verification in <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS3.p1.1.1">Spin</span> can significantly reduce verification latency and save memory. For example, on the Alpaca dataset with a batch size of 16, the fast batch verification in <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS3.p1.1.2">Spin</span> reduces verification latency from 529.21ms to 335.46ms, achieving a reduction of 33.61%. When the batch size reaches 32, out-of-memory errors happen for the Alpaca and CIP datasets when the fast batch verification is disabled, because of too much memory occupation by padded tokens in batching. On the other hand, <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS3.p1.1.3">Spin</span> can still work with low verification latency, thanks to its sophisticated request decomposition for memory saving.</p> </div> <div class="ltx_para" id="S6.SS2.SSS3.p2"> <p class="ltx_p" id="S6.SS2.SSS3.p2.1">In addition, we observe the fast batch verification provides the most significant performance improvement on the Alpaca dataset. This is because the inference requests exhibit high difficulty, and the number of accepted tokens can vary significantly, leading to substantial padding tokens within the batch. In contrast, the benefits of the fast batch verification are smaller on the CP dataset because the requests have more similar acceptance rates, resulting in fewer padding tokens.</p> </div> <figure class="ltx_figure" id="S6.F13"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F13.sf1"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F13.sf1.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="440" id="S6.F13.sf1.1.g1" src="x22.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F13.sf1.3.1.1" style="font-size:80%;">(a)</span> </span><span class="ltx_text" id="S6.F13.sf1.4.2" style="font-size:80%;">Alpaca</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F13.sf2"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F13.sf2.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="440" id="S6.F13.sf2.1.g1" src="x23.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F13.sf2.3.1.1" style="font-size:80%;">(b)</span> </span><span class="ltx_text" id="S6.F13.sf2.4.2" style="font-size:80%;">CP</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F13.sf3"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F13.sf3.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="440" id="S6.F13.sf3.1.g1" src="x24.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F13.sf3.3.1.1" style="font-size:80%;">(c)</span> </span><span class="ltx_text" id="S6.F13.sf3.4.2" style="font-size:80%;">CIP</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S6.F13.sf4"> <div class="ltx_block ltx_minipage ltx_align_bottom" id="S6.F13.sf4.1" style="width:95.4pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="440" id="S6.F13.sf4.1.g1" src="x25.png" width="831"/> </div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S6.F13.sf4.3.1.1" style="font-size:80%;">(d)</span> </span><span class="ltx_text" id="S6.F13.sf4.4.2" style="font-size:80%;">Mix</span></figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 13: </span>The performance of speculative decoding pipeline in different numbers of micro-batches. The red dot indicates the setting used by <span class="ltx_text ltx_font_smallcaps" id="S6.F13.2.1">Spin</span>.</figcaption> </figure> </section> <section class="ltx_subsubsection" id="S6.SS2.SSS4"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S6.SS2.SSS4.5.1.1">VI-B</span>4 </span>Impact of the Speculative Decoding Pipeline</h4> <div class="ltx_para" id="S6.SS2.SSS4.p1"> <p class="ltx_p" id="S6.SS2.SSS4.p1.1">Finally, we study the impact of the proposed speculative decoding pipeline mechanism, which divides the batch on each SSM into smaller micro-batches and enables pipeline execution of speculation and verification. We use the LLaMA-7B as the LLM and set different numbers of micro-batches on each SSM. The results are shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S6.F13" title="Figure 13 ‣ VI-B3 Impact of the Fast Batch Verification ‣ VI-B Results ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 13</span></a>. We observe that the benefits of the speculative decoding pipeline vary with different numbers of micro-batches. Specifically, the benefits increase with more micro-batches up to a certain point, and then degrade when the number of micro-batches continues to increase. For example, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S6.F13.sf1" title="13(a) ‣ Figure 13 ‣ VI-B3 Impact of the Fast Batch Verification ‣ VI-B Results ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 13(a)</span></a>, the pipeline optimization with 4 micro-batches on each SSM can provide a performance improvement of 1.29<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS2.SSS4.p1.1.m1.1"><semantics id="S6.SS2.SSS4.p1.1.m1.1a"><mo id="S6.SS2.SSS4.p1.1.m1.1.1" xref="S6.SS2.SSS4.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS2.SSS4.p1.1.m1.1b"><times id="S6.SS2.SSS4.p1.1.m1.1.1.cmml" xref="S6.SS2.SSS4.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.SSS4.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.SSS4.p1.1.m1.1d">×</annotation></semantics></math> on the Alpaca dataset. However, the pipeline optimization does not work and even degrades speculative decoding performance when the number of micro-batches reaches 9. The rationale is as follows: With more micro-batches, the size of each micro-batch becomes smaller, reducing the variance in inference speed across SSMs. Thus, the LLM has a shorter idle time to synchronize the speculation results from SSMs to start the verification. However, when the number of micro-batches continues to increase, the speculation and verification of more requests are performed sequentially, leading to lower inference throughput.</p> </div> <div class="ltx_para" id="S6.SS2.SSS4.p2"> <p class="ltx_p" id="S6.SS2.SSS4.p2.1">We also observe that the optimal number of micro-batches varies across different datasets. For example, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S6.F13.sf1" title="13(a) ‣ Figure 13 ‣ VI-B3 Impact of the Fast Batch Verification ‣ VI-B Results ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 13(a)</span></a>, the speculative decoding pipeline with 5 micro-batches works best on the Alpaca dataset, while the optimal number of micro-batches for the speculative decoding pipeline is 3 on the CP dataset, as shown in <a class="ltx_ref ltx_refmacro_autoref" href="https://arxiv.org/html/2503.15921v1#S6.F13.sf2" title="13(b) ‣ Figure 13 ‣ VI-B3 Impact of the Fast Batch Verification ‣ VI-B Results ‣ VI Performance Evaluation ‣ Spin: Accelerating Large Language Model Inference with Heterogeneous Speculative Models"><span class="ltx_text ltx_ref_tag">Figure 13(b)</span></a>. Specifically, more inference requests in the Alpaca dataset perform the speculation on larger SSMs due to higher difficulty, such as LLaMA-1.1B and LLaMA-1.4B. Therefore, more micro-batches are needed to reduce synchronization time across SSMs. In contrast, for the CP dataset, more requests are processed by smaller SSMs, such as LLaMA-68M and LLaMA-265M, leading to smaller synchronization times across SSMs. Consequently, a smaller number of micro-batches can be sufficient to achieve optimal speculative decoding pipeline performance, and more micro-batches result in degraded inference throughput due to the sequential execution of micro-batches. Finally, we observe that the heuristic used in <span class="ltx_text ltx_font_smallcaps" id="S6.SS2.SSS4.p2.1.1">Spin</span> to determine the number of micro-batches can approximate the optimal point well.</p> </div> </section> </section> </section> <section class="ltx_section" id="S7"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">VII </span><span class="ltx_text ltx_font_smallcaps" id="S7.1.1">Related Work</span> </h2> <div class="ltx_para ltx_noindent" id="S7.p1"> <p class="ltx_p" id="S7.p1.1"><span class="ltx_text ltx_font_bold" id="S7.p1.1.1">Speculative Decoding.</span> Recent works optimize the speculation phase on the SSM to enhance performance. For example, Fu et al. <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib34" title="">34</a>]</cite> use an n-gram model to generate speculation tokens, while Eagle <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib17" title="">17</a>]</cite> directly uses the embeddings from the LLM. Similarly, Medusa <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib16" title="">16</a>]</cite> trains multiple speculation heads using the output from the LLM as input. SpecInfer <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib15" title="">15</a>]</cite> introduces the speculation tree to improve the acceptance rate of the SSM. For the verification phase, existing works mainly focus on reducing the LLM inference costs. For example, LayerSkip <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib18" title="">18</a>]</cite> and EESD <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib19" title="">19</a>]</cite> design the SSM as the partial of the LLM, enabling the intermediate results of the SSM to be reused in the verification phase. S3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib35" title="">35</a>]</cite> adopts a mid-layer skipping method to reduce the verification cost. Recently, existing works propose to balance the speculation correctness and verification costs by adjusting the speculation length. For instance, Su et al. <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib36" title="">36</a>]</cite> consider the relationship between the speculative window and batch size to maximize throughput, while BiLD <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib37" title="">37</a>]</cite> and Kangaroo <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib38" title="">38</a>]</cite> use a heuristic that stops speculation if the confidence in the speculative token falls below a predefined threshold. However, existing works merely focus on homogeneous SSMs, limiting their ability to serve inference requests with varying difficulty.</p> </div> <div class="ltx_para ltx_noindent" id="S7.p2"> <p class="ltx_p" id="S7.p2.1"><span class="ltx_text ltx_font_bold" id="S7.p2.1.1">Speculative Decoding based Serving Systems.</span> Recent works aim to design efficient LLM inference serving systems based on speculative decoding. SpecInfer <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib15" title="">15</a>]</cite> employs a tree-based parallel decoding mechanism to reduce end-to-end latency. Minions <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib33" title="">33</a>]</cite> introduce an execution pipeline mechanism to decouple SSM speculation and LLM verification using multiple inference batches, thus improving resource utilization. SmartSpec <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib39" title="">39</a>]</cite> is a scheduling framework that determines the optimal speculation length for different requests. SpecExec <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.15921v1#bib.bib40" title="">40</a>]</cite> deploys speculative decoding services on consumer hardware with RAM offloading. Although these works are promising for improving speculative decoding efficiency, they focus on serving a single inference request or a single batch using one SSM. Furthermore, they do not effectively address the challenge of the strict execution sequence between speculation and verification, which significantly degrades GPU utilization.</p> </div> </section> <section class="ltx_section" id="S8"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">VIII </span><span class="ltx_text ltx_font_smallcaps" id="S8.1.1">Conclusion</span> </h2> <div class="ltx_para" id="S8.p1"> <p class="ltx_p" id="S8.p1.1">We design <span class="ltx_text ltx_font_smallcaps" id="S8.p1.1.1">Spin</span>, an efficient LLM inference serving system based on speculative decoding. <span class="ltx_text ltx_font_smallcaps" id="S8.p1.1.2">Spin</span> incorporates three novel designs. First, <span class="ltx_text ltx_font_smallcaps" id="S8.p1.1.3">Spin</span> exploits heterogeneous speculative models to serve inference requests with varying difficulty. Second, <span class="ltx_text ltx_font_smallcaps" id="S8.p1.1.4">Spin</span> accelerates batch verification by using the request decomposition method. Finally, <span class="ltx_text ltx_font_smallcaps" id="S8.p1.1.5">Spin</span> presents a speculative decoding pipeline mechanism by dividing the batch on each SSM into multiple micro-batches, further improving inference throughput. Extensive experiments demonstrate that <span class="ltx_text ltx_font_smallcaps" id="S8.p1.1.6">Spin</span> can improve inference throughput by about 2.28<math alttext="\times" class="ltx_Math" display="inline" id="S8.p1.1.m1.1"><semantics id="S8.p1.1.m1.1a"><mo id="S8.p1.1.m1.1.1" xref="S8.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S8.p1.1.m1.1b"><times id="S8.p1.1.m1.1.1.cmml" xref="S8.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S8.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S8.p1.1.m1.1d">×</annotation></semantics></math> compared to baseline methods.</p> </div> </section> <section class="ltx_section" id="S9"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">IX </span><span class="ltx_text ltx_font_smallcaps" id="S9.1.1">Acknowledgments</span> </h2> <div class="ltx_para" id="S9.p1"> <p class="ltx_p" id="S9.p1.1">This research was supported by Japan Society for the Promotion of Science Fellows No. 23KJ1786, Japan Science and Technology Agency (JST) PRESTO (No. 23828673), National Natural Science Foundation of China (No. 62471383, U23A20276 and U22A2029). Peng Li is the corresponding author.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_tag_bibitem">[1]</span> <span class="ltx_bibblock"> T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell <em class="ltx_emph ltx_font_italic" id="bib.bib1.1.1">et al.</em>, “Language models are few-shot learners,” <em class="ltx_emph ltx_font_italic" id="bib.bib1.2.2">Advances in neural information processing systems</em>, vol. 33, pp. 1877–1901, 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_tag_bibitem">[2]</span> <span class="ltx_bibblock"> J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat <em class="ltx_emph ltx_font_italic" id="bib.bib2.1.1">et al.</em>, “Gpt-4 technical report,” <em class="ltx_emph ltx_font_italic" id="bib.bib2.2.2">arXiv preprint arXiv:2303.08774</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_tag_bibitem">[3]</span> <span class="ltx_bibblock"> W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong <em class="ltx_emph ltx_font_italic" id="bib.bib3.1.1">et al.</em>, “A survey of large language models,” <em class="ltx_emph ltx_font_italic" id="bib.bib3.2.2">arXiv preprint arXiv:2303.18223</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_tag_bibitem">[4]</span> <span class="ltx_bibblock"> B. Fu, F. Chen, P. Li, and D. Zeng, “Tcb: Accelerating transformer inference services with request concatenation,” in <em class="ltx_emph ltx_font_italic" id="bib.bib4.1.1">Proceedings of the 51st International Conference on Parallel Processing</em>, 2022, pp. 1–11. </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_tag_bibitem">[5]</span> <span class="ltx_bibblock"> H. Wen, Y. Li, G. Liu, S. Zhao, T. Yu, T. J.-J. Li, S. Jiang, Y. Liu, Y. Zhang, and Y. Liu, “Autodroid: Llm-powered task automation in android,” in <em class="ltx_emph ltx_font_italic" id="bib.bib5.1.1">Proceedings of the 30th Annual International Conference on Mobile Computing and Networking</em>, 2024, pp. 543–557. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_tag_bibitem">[6]</span> <span class="ltx_bibblock"> Y. Wang, K. Chen, H. Tan, and K. Guo, “Tabi: An efficient multi-level inference system for large language models,” in <em class="ltx_emph ltx_font_italic" id="bib.bib6.1.1">Proceedings of the Eighteenth European Conference on Computer Systems</em>, 2023, pp. 233–248. </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_tag_bibitem">[7]</span> <span class="ltx_bibblock"> Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y. Tian, C. Re <em class="ltx_emph ltx_font_italic" id="bib.bib7.1.1">et al.</em>, “Deja vu: Contextual sparsity for efficient llms at inference time,” in <em class="ltx_emph ltx_font_italic" id="bib.bib7.2.2">International Conference on Machine Learning</em>. PMLR, 2023, pp. 22 137–22 176. </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_tag_bibitem">[8]</span> <span class="ltx_bibblock"> A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming throughput-latency tradeoff in llm inference with sarathi-serve,” in <em class="ltx_emph ltx_font_italic" id="bib.bib8.1.1">18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)</em>. USENIX Association, 2024, pp. 117–134. </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_tag_bibitem">[9]</span> <span class="ltx_bibblock"> Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving,” in <em class="ltx_emph ltx_font_italic" id="bib.bib9.1.1">18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)</em>. USENIX Association, 2024, pp. 193–210. </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_tag_bibitem">[10]</span> <span class="ltx_bibblock"> B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y. Li, and W. Lin, “Llumnix: Dynamic scheduling for large language model serving,” in <em class="ltx_emph ltx_font_italic" id="bib.bib10.1.1">18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)</em>. USENIX Association, 2024, pp. 173–191. </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_tag_bibitem">[11]</span> <span class="ltx_bibblock"> W. Lee, J. Lee, J. Seo, and J. Sim, “Infinigen: Efficient generative inference of large language models with dynamic KV cache management,” in <em class="ltx_emph ltx_font_italic" id="bib.bib11.1.1">18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)</em>. USENIX Association, 2024, pp. 155–172. </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_tag_bibitem">[12]</span> <span class="ltx_bibblock"> F. Chen, P. Li, S. Pan, L. Zhong, and J. Deng, “Giant could be tiny: Efficient inference of giant models on resource-constrained uavs,” <em class="ltx_emph ltx_font_italic" id="bib.bib12.1.1">IEEE Internet of Things Journal</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_tag_bibitem">[13]</span> <span class="ltx_bibblock"> C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large language model decoding with speculative sampling,” <em class="ltx_emph ltx_font_italic" id="bib.bib13.1.1">arXiv preprint arXiv:2302.01318</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_tag_bibitem">[14]</span> <span class="ltx_bibblock"> Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from transformers via speculative decoding,” in <em class="ltx_emph ltx_font_italic" id="bib.bib14.1.1">International Conference on Machine Learning</em>. PMLR, 2023, pp. 19 274–19 286. </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_tag_bibitem">[15]</span> <span class="ltx_bibblock"> X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y. Y. Wong, A. Zhu, L. Yang, X. Shi <em class="ltx_emph ltx_font_italic" id="bib.bib15.1.1">et al.</em>, “Specinfer: Accelerating large language model serving with tree-based speculative inference and verification,” in <em class="ltx_emph ltx_font_italic" id="bib.bib15.2.2">Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3</em>, 2024, pp. 932–949. </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_tag_bibitem">[16]</span> <span class="ltx_bibblock"> T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao, “Medusa: Simple llm inference acceleration framework with multiple decoding heads,” <em class="ltx_emph ltx_font_italic" id="bib.bib16.1.1">arXiv preprint arXiv:2401.10774</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_tag_bibitem">[17]</span> <span class="ltx_bibblock"> Y. Li, F. Wei, C. Zhang, and H. Zhang, “Eagle: Speculative sampling requires rethinking feature uncertainty,” <em class="ltx_emph ltx_font_italic" id="bib.bib17.1.1">arXiv preprint arXiv:2401.15077</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_tag_bibitem">[18]</span> <span class="ltx_bibblock"> M. Elhoushi, A. Shrivastava, D. Liskovich, B. Hosmer, B. Wasti, L. Lai, A. Mahmoud, B. Acun, S. Agarwal, A. Roman <em class="ltx_emph ltx_font_italic" id="bib.bib18.1.1">et al.</em>, “Layer skip: Enabling early exit inference and self-speculative decoding,” <em class="ltx_emph ltx_font_italic" id="bib.bib18.2.2">arXiv preprint arXiv:2404.16710</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_tag_bibitem">[19]</span> <span class="ltx_bibblock"> J. Liu, Q. Wang, J. Wang, and X. Cai, “Speculative decoding via early-exiting for faster llm inference with thompson sampling control mechanism,” <em class="ltx_emph ltx_font_italic" id="bib.bib19.1.1">arXiv preprint arXiv:2406.03853</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_tag_bibitem">[20]</span> <span class="ltx_bibblock"> J. Fang, Y. Yu, C. Zhao, and J. Zhou, “Turbotransformers: an efficient gpu serving system for transformer models,” in <em class="ltx_emph ltx_font_italic" id="bib.bib20.1.1">Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming</em>, 2021, pp. 389–402. </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_tag_bibitem">[21]</span> <span class="ltx_bibblock"> Y. Zhai, C. Jiang, L. Wang, X. Jia, S. Zhang, Z. Chen, X. Liu, and Y. Zhu, “Bytetransformer: A high-performance transformer boosted for variable-length inputs,” in <em class="ltx_emph ltx_font_italic" id="bib.bib21.1.1">2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)</em>. IEEE, 2023, pp. 344–355. </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_tag_bibitem">[22]</span> <span class="ltx_bibblock"> A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” <em class="ltx_emph ltx_font_italic" id="bib.bib22.1.1">Advances in neural information processing systems</em>, vol. 30, 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_tag_bibitem">[23]</span> <span class="ltx_bibblock"> W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in <em class="ltx_emph ltx_font_italic" id="bib.bib23.1.1">Proceedings of the 29th Symposium on Operating Systems Principles</em>, 2023, pp. 611–626. </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_tag_bibitem">[24]</span> <span class="ltx_bibblock"> G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for transformer–based generative models,” in <em class="ltx_emph ltx_font_italic" id="bib.bib24.1.1">16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)</em>, 2022, pp. 521–538. </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_tag_bibitem">[25]</span> <span class="ltx_bibblock"> B. Sun, Z. Huang, H. Zhao, W. Xiao, X. Zhang, Y. Li, and W. Lin, “Llumnix: Dynamic scheduling for large language model serving,” in <em class="ltx_emph ltx_font_italic" id="bib.bib25.1.1">18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)</em>. USENIX Association, 2024, pp. 173–191. </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_tag_bibitem">[26]</span> <span class="ltx_bibblock"> B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” <em class="ltx_emph ltx_font_italic" id="bib.bib26.1.1">arXiv preprint arXiv:2304.03277</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_tag_bibitem">[27]</span> <span class="ltx_bibblock"> M. Rashad, “Chatgpt-prompts,” 2023. [Online]. Available: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://huggingface.co/datasets/MohamedRashad/ChatGPT-prompts" title="">https://huggingface.co/datasets/MohamedRashad/ChatGPT-prompts</a> </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_tag_bibitem">[28]</span> <span class="ltx_bibblock"> A. Palla, “Chatbot instruction prompts,” 2023. [Online]. Available: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts" title="">https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts</a> </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_tag_bibitem">[29]</span> <span class="ltx_bibblock"> C. Li, Z. Zhou, S. Zheng, J. Zhang, Y. Liang, and G. Sun, “Specpim: Accelerating speculative inference on pim-enabled system via architecture-dataflow co-exploration,” in <em class="ltx_emph ltx_font_italic" id="bib.bib29.1.1">Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3</em>, 2024, pp. 950–965. </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_tag_bibitem">[30]</span> <span class="ltx_bibblock"> B. Gao, Z. He, P. Sharma, Q. Kang, D. Jevdjic, J. Deng, X. Yang, Z. Yu, and P. Zuo, “Cost-Efficient large language model serving for multi-turn conversations with CachedAttention,” in <em class="ltx_emph ltx_font_italic" id="bib.bib30.1.1">2024 USENIX Annual Technical Conference (USENIX ATC 24)</em>, 2024, pp. 111–126. </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_tag_bibitem">[31]</span> <span class="ltx_bibblock"> “Technical report,” 2024. [Online]. Available: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.dropbox.com/scl/fi/1wl4pdw8z69e0taflcyuz/Technical_Report_SPIN.pdf?rlkey=815bdk0xg8fwrais3isriuygn&st=4n0d3a77&dl=0" title="">https://www.dropbox.com/scl/fi/1wl4pdw8z69e0taflcyuz/Technical_Report_SPIN.pdf?rlkey=815bdk0xg8fwrais3isriuygn&st=4n0d3a77&dl=0</a> </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_tag_bibitem">[32]</span> <span class="ltx_bibblock"> HuggingFace, “Large language model text generation inference,” 2023. [Online]. Available: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/huggingface/text-generation-inference" title="">https://github.com/huggingface/text-generation-inference</a> </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_tag_bibitem">[33]</span> <span class="ltx_bibblock"> S. Wang, H. Yang, X. Wang, T. Liu, P. Wang, X. Liang, K. Ma, T. Feng, X. You, Y. Bao <em class="ltx_emph ltx_font_italic" id="bib.bib33.1.1">et al.</em>, “Minions: Accelerating large language model inference with adaptive and collective speculative decoding,” <em class="ltx_emph ltx_font_italic" id="bib.bib33.2.2">arXiv preprint arXiv:2402.15678</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_tag_bibitem">[34]</span> <span class="ltx_bibblock"> Y. Fu, P. Bailis, I. Stoica, and H. Zhang, “Breaking the sequential dependency of llm inference using lookahead decoding,” November 2023. [Online]. Available: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://lmsys.org/blog/2023-11-21-lookahead-decoding/" title="">https://lmsys.org/blog/2023-11-21-lookahead-decoding/</a> </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_tag_bibitem">[35]</span> <span class="ltx_bibblock"> W. Zhong and M. Bharadwaj, “S3d: A simple and cost-effective self-speculative decoding scheme for low-memory gpus,” <em class="ltx_emph ltx_font_italic" id="bib.bib35.1.1">arXiv preprint arXiv:2405.20314</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_tag_bibitem">[36]</span> <span class="ltx_bibblock"> Q. Su, C. Giannoula, and G. Pekhimenko, “The synergy of speculative decoding and batching in serving large language models,” <em class="ltx_emph ltx_font_italic" id="bib.bib36.1.1">arXiv preprint arXiv:2310.18813</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_tag_bibitem">[37]</span> <span class="ltx_bibblock"> S. Kim, K. Mangalam, S. Moon, J. Malik, M. W. Mahoney, A. Gholami, and K. Keutzer, “Speculative decoding with big little decoder,” <em class="ltx_emph ltx_font_italic" id="bib.bib37.1.1">Advances in Neural Information Processing Systems</em>, vol. 36, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_tag_bibitem">[38]</span> <span class="ltx_bibblock"> F. Liu, Y. Tang, Z. Liu, Y. Ni, K. Han, and Y. Wang, “Kangaroo: Lossless self-speculative decoding via double early exiting,” <em class="ltx_emph ltx_font_italic" id="bib.bib38.1.1">arXiv preprint arXiv:2404.18911</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_tag_bibitem">[39]</span> <span class="ltx_bibblock"> X. Liu, C. Daniel, L. Hu, W. Kwon, Z. Li, X. Mo, A. Cheung, Z. Deng, I. Stoica, and H. Zhang, “Optimizing speculative decoding for serving large language models using goodput,” <em class="ltx_emph ltx_font_italic" id="bib.bib39.1.1">arXiv preprint arXiv:2406.14066</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_tag_bibitem">[40]</span> <span class="ltx_bibblock"> R. Svirschevski, A. May, Z. Chen, B. Chen, Z. Jia, and M. Ryabinin, “Specexec: Massively parallel speculative decoding for interactive llm inference on consumer devices,” <em class="ltx_emph ltx_font_italic" id="bib.bib40.1.1">arXiv preprint arXiv:2406.02532</em>, 2024. </span> </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Thu Mar 20 07:50:20 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>