CINXE.COM
V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM
<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM</title> <!--Generated on Fri Nov 1 13:46:02 2024 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2411.00915v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S1" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Background</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>Motivation and Challenges</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS1" title="In 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>Potential Benefits from LoRA LMM</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="In 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>Challenges of Empowering Vision Applications with LoRA LMM</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>V-LoRA Design</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS1" title="In 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>System Overview</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2" title="In 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2 </span>Accuracy-aware LoRA Adapter Generation</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2.SSS1" title="In 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2.1 </span>Accuracy-aware knowledge-fusion algorithm.</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2.SSS2" title="In 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2.2 </span>Vision task head.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3" title="In 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.3 </span>Adaptive-tiling LoRA Adapters Batching</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS1" title="In 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.3.1 </span>ATMM: Adaptive-tiling matrix multiplication operator.</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS2" title="In 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.3.2 </span>Profile-based optimal tiling search.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4" title="In 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.4 </span>Flexible LoRA Adapters Orchestration</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS1" title="In 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.4.1 </span>Swift inference mode switch.</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS2" title="In 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.4.2 </span>Mixture inference mode.</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS3" title="In 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.4.3 </span>Scheduling policy.</span></a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S5" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Implementation</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6 </span>Evaluation</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS1" title="In 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6.1 </span>Experimental Setup</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS2" title="In 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6.2 </span>End-to-End Performance</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3" title="In 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6.3 </span>Comprehensive Component-wise Analysis</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3.SSS1" title="In 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6.3.1 </span>Accuracy-aware LoRA Adapter Generation</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3.SSS2" title="In 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6.3.2 </span>Adaptive-tiling LoRA Adapters Batching</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3.SSS3" title="In 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6.3.3 </span>Flexible LoRA Adapters Orchestration</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS4" title="In 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6.4 </span>Stability and Scalability</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S7" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">7 </span>Related Works</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S8" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">8 </span>Conclusion</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A1" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A </span>Adaptive-tiling Matrix Multiplication.</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A2" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">B </span>Profile-based Optimal Tiling Search Algorithm</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A3" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">C </span>Prompt Template of Vision Applications</span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line ltx_leqno" lang="en"> <h1 class="ltx_title ltx_title_document">V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Liang Mi<sup class="ltx_sup" id="id16.16.id1">1</sup><span class="ltx_note ltx_role_footnote" id="footnotex1"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup><span class="ltx_tag ltx_tag_note">2</span>Liang Mi and Weijun Wang contributed equally to this work.</span></span></span> Weijun Wang<sup class="ltx_sup" id="id17.17.id2"><span class="ltx_text ltx_font_italic" id="id17.17.id2.1">2</span></sup><span class="ltx_note ltx_role_footnote" id="footnotex2"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup><span class="ltx_tag ltx_tag_note">2</span>Liang Mi and Weijun Wang contributed equally to this work.</span></span></span><span class="ltx_note ltx_role_footnote" id="footnotex3"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_tag ltx_tag_note">1</span> Corresponding authors: Weijun Wang and Haipeng Dai </span></span></span> Wenming Tu<sup class="ltx_sup" id="id18.18.id3">2</sup> Qingfeng He<sup class="ltx_sup" id="id19.19.id4">2</sup> Rui Kong<sup class="ltx_sup" id="id20.20.id5">2</sup> Xinyu Fang<sup class="ltx_sup" id="id21.21.id6">2</sup> Yazhu Dong<sup class="ltx_sup" id="id22.22.id7">2</sup> <br class="ltx_break"/>Yikang Zhang<sup class="ltx_sup" id="id23.23.id8">1</sup> Yuanchun Li<sup class="ltx_sup" id="id24.24.id9">2</sup> Meng Li<sup class="ltx_sup" id="id25.25.id10">1</sup> Haipeng Dai<sup class="ltx_sup" id="id26.26.id11"><span class="ltx_text ltx_font_italic" id="id26.26.id11.1">1</span></sup><span class="ltx_note ltx_role_footnote" id="footnotex4"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_tag ltx_tag_note">1</span> Corresponding authors: Weijun Wang and Haipeng Dai </span></span></span> Guihai Chen<sup class="ltx_sup" id="id27.27.id12">1</sup> Yunxin Liu<sup class="ltx_sup" id="id28.28.id13">2</sup> <br class="ltx_break"/><sup class="ltx_sup" id="id29.29.id14">1</sup>State Key Laboratory for Novel Software Technology, Nanjing University <br class="ltx_break"/><sup class="ltx_sup" id="id30.30.id15">2</sup>Institute for AI Industry Research (AIR), Tsinghua University </span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract.</h6> <p class="ltx_p" id="id31.id1"><span class="ltx_text" id="id31.id1.1">Large Multimodal Models (LMMs) have shown significant progress in various complex vision tasks with the solid linguistic and reasoning capacity inherited from large language models (LMMs). Low-rank adaptation (LoRA) offers a promising method to integrate external knowledge into LMMs, compensating for their limitations on domain-specific tasks. However, the existing LoRA model serving is excessively computationally expensive and causes extremely high latency. In this paper, we present an end-to-end solution that empowers diverse vision tasks and enriches vision applications with LoRA LMMs. Our system, V-LoRA, enables accurate and efficient vision tasks by 1) an accuracy-aware LoRA adapter generation approach that generates LoRA adapters rich in domain-specific knowledge to meet application-specific accuracy requirements, 2) an adaptive-tiling LoRA adapters batching operator that efficiently computes concurrent heterogeneous LoRA adapters, and 3) a flexible LoRA adapter orchestration mechanism that manages application requests and LoRA adapters to achieve the lowest average response latency. We prototype V-LoRA on <span class="ltx_text" id="id31.id1.1.1" style="color:#000000;">five</span> popular vision tasks on three LMMs. <span class="ltx_text" id="id31.id1.1.2" style="color:#000000;">Experiment results reveal that V-LoRA improves 24-62% of the accuracy compared to the original LMMs and reduces 20-89% of the latency compared to the state-of-the-art LoRA model serving systems.</span></span></p> </div> <span class="ltx_note ltx_note_frontmatter ltx_role_submissionid" id="id1"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">submissionid: </span>¡123¿</span></span></span><span class="ltx_note ltx_role_footnotetext" id="footnotex6"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">footnotetext: </span>Work was done while Liang, Wenming, Qingfeng, Rui, Xinyu, and Yazhu interned at the Institute for AI Industry Research (AIR), Tsinghua University.</span></span></span> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">1. </span>Introduction</h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Encouraged by the success of LLMs in NLP applications <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib7" title="">LLM12App, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib8" title="">7AppLLM, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib10" title="">indataAppLLM, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib13" title="">LLMUseCase, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib56" title="">li2024personal, </a>)</cite>, Large Multimodal Models (LMMs) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib98" title="">zhu2023minigpt, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib12" title="">gpt4o, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib11" title="">claude, </a>)</cite> have attracted great attention from both academia and industry. They enhance LLMs by perceiving and interpreting multimodal signals (<em class="ltx_emph ltx_font_italic" id="S1.p1.1.1">e.g.,</em> visual inputs <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib60" title="">liu2024visual, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib19" title="">bai2023qwen, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib73" title="">team2023gemini, </a>)</cite>) and well accomplish many complex multi-modal tasks that prior models cannot. For example, GPT-4o <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib12" title="">gpt4o, </a>)</cite> achieves leading accuracy on many multimodal tasks such as visual question answering <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib37" title="">goyal2017making, </a>)</cite>. Yet when applied to practical applications requiring domain-specific knowledge, LMMs often show suboptimal performance, similar to the early LLMs that experienced hallucinations <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib95" title="">zhang2023siren, </a>)</cite>.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1"><span class="ltx_text ltx_font_italic" id="S1.p2.1.1">Low-rank adaptation (LoRA)</span> <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib40" title="">hu2021lora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib30" title="">dettmers2024qlora, </a>)</cite> provides a promising way to integrate the external knowledge into LMM. It fine-tunes a small portion of model parameters, known as LoRA adapters, on domain-specific datasets to learn target knowledge, and freezes the base model to preserve its original capability (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2" title="2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">2</span></a>). LLMs often leverage retrieval-augmented generation (RAG) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib52" title="">lewis2020retrieval, </a>)</cite> to meet this goal. Unlike LoRA, which modifies model parameters, RAG appends the retrieved knowledge (<em class="ltx_emph ltx_font_italic" id="S1.p2.1.2">e.g.,</em> documents) onto requests (<em class="ltx_emph ltx_font_italic" id="S1.p2.1.3">i.e.,</em> input data) for accurate response. However, this data-augmented method is not appropriate for time-sensitive vision applications. Its retrieval process and appended long-context requests incur ¿10<math alttext="\times" class="ltx_Math" display="inline" id="S1.p2.1.m1.1"><semantics id="S1.p2.1.m1.1a"><mo id="S1.p2.1.m1.1.1" xref="S1.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S1.p2.1.m1.1b"><times id="S1.p2.1.m1.1.1.cmml" xref="S1.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S1.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S1.p2.1.m1.1d">×</annotation></semantics></math> response delay <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib44" title="">jin2024ragcache, </a>)</cite>. Conversely, LoRA merges fine-tuned adapters into LMMs at runtime and efficiently generates high-quality and consistent domain-specific responses without additional overhead.</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">Despite these advantages of LoRA, it introduces complex system challenges. Recent work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite>, focusing on system optimization for linguistic applications with LoRA LLM, has made noticeable progress. Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite> and S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> propose <em class="ltx_emph ltx_font_italic" id="S1.p3.1.1">unmerged inference</em> to overcome the limitation of <em class="ltx_emph ltx_font_italic" id="S1.p3.1.2">merged inference</em> that can merge only one adapter at once. It computes multiple LoRA adapters in parallel while batching the shared base model computation across different requests, to boost system efficiency. dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> further balances the throughput and latency by merged and unmerged inference mode switch (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2" title="2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">2</span></a>). However, these efforts fail to meet the high efficiency and diverse requirements of vision applications (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3.2</span></a>).</p> </div> <div class="ltx_para" id="S1.p4"> <p class="ltx_p" id="S1.p4.1">In this paper, we answer the following research question: <span class="ltx_text ltx_font_italic" id="S1.p4.1.1">Can we leverage LoRA LMMs to enrich vision applications while meeting their performance requirements?</span> We argue that yes, but need to tackle the following challenges.</p> </div> <div class="ltx_para" id="S1.p5"> <p class="ltx_p" id="S1.p5.1">First, many existing small models, trained on domain-specific datasets and used in current vision applications, outperform LMMs on target tasks. To keep accuracy, their knowledge must be integrated into LMM with LoRA adapters. However, simply training one LoRA adapter for each task is uneconomical, while fusing too much external knowledge (<em class="ltx_emph ltx_font_italic" id="S1.p5.1.1">e.g.,</em> multiple small models) into a single adapter causes inevitable accuracy degradation.</p> </div> <div class="ltx_para" id="S1.p6"> <p class="ltx_p" id="S1.p6.1">Second, a vision application often involves diverse external knowledge. When serving multiple vision applications, in particular, LoRA LMM is very likely asked to execute multiple LoRA adapters simultaneously. Although unmerged inference can compute heterogeneous adapters in parallel, they introduce excessive delays. Therefore, our LoRA LMM inference system must efficiently compute concurrent heterogeneous adapters with low latency. </p> </div> <div class="ltx_para" id="S1.p7"> <p class="ltx_p" id="S1.p7.1">Lastly, different vision applications have distinct performance requirements. Real-time video analytics application <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib90" title="">yuan2022PacketGame, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib91" title="">AccDecoder, </a>)</cite> needs low latency, while visual retrieval <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib45" title="">kang13blazeit, </a>)</cite> prefers high throughput. To meet these needs, carefully managing LoRA adapters and flexibly scheduling application requests is necessary. Typical LoRA adapter manager (<em class="ltx_emph ltx_font_italic" id="S1.p7.1.1">e.g.,</em> dLoRA) is not designed for vision applications, which incurs excessive overhead. Solely its inference mode switcher costs over half of LMM inference time.</p> </div> <div class="ltx_para" id="S1.p8"> <p class="ltx_p" id="S1.p8.1">This paper presents V-LoRA, an end-to-end LoRA LMM serving system, including LoRA adapter preparation (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2" title="4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.2</span></a>) and inference runtime (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3" title="4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.3</span></a>, §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4" title="4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.4</span></a>), to empower vision applications. V-LoRA addresses the above challenges with the following techniques:</p> </div> <div class="ltx_para" id="S1.p9"> <p class="ltx_p" id="S1.p9.1"><span class="ltx_text ltx_font_italic" id="S1.p9.1.1">Accuracy-aware LoRA adapter generation.</span> We propose a LoRA generator (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2" title="4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.2</span></a>) that prepares domain-specific LoRA adapters to generate accurate results on target tasks with external knowledge. Considering the complex accuracy variations of knowledge fusion (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3.2</span></a>), LoRA adapters generation can be formulated as a constrained bin packing problem, that given external knowledge, <em class="ltx_emph ltx_font_italic" id="S1.p9.1.2">i.e.,</em> small models and domain-specific datasets, to generate the minimum number of LoRA adapters, ensuring the accuracy specified by vision applications. We design an accuracy-aware knowledge-fusion algorithm with a greedy heuristic to solve it. Additionally, we introduce vision task heads, incorporated as part of the LoRA adapter, enabling low-latency response for vision tasks. </p> </div> <div class="ltx_para" id="S1.p10"> <p class="ltx_p" id="S1.p10.1"><span class="ltx_text ltx_font_italic" id="S1.p10.1.1">Adaptive-tiling LoRA adapters batching.</span> We propose a concurrent LoRA adapters batching method (§ <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3" title="4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.3</span></a>), comprised of the Adaptive-Tiling Matrix Multiplication (ATMM) operator and its optimal tiling search algorithm, for efficient heterogeneous LoRA adapters computation. The offline search algorithm identifies the optimal tiling configurations for each possible input matrix shape, builds a hash table storing these input-optimal tiling pairs, and compiles their code implementations for standby. At runtime, ATMM adaptively selects the optimal tiling configuration in the hash table, according to the input shapes of both concurrent requests and invoked LoRA adapters, then executes the corresponding code implementation in an extreme efficiency. </p> </div> <div class="ltx_para" id="S1.p11"> <p class="ltx_p" id="S1.p11.1"><span class="ltx_text ltx_font_italic" id="S1.p11.1.1">Flexible LoRA adapters orchestration.</span> For diverse requirements of vision applications, we propose an orchestrator (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4" title="4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.4</span></a>) that efficiently and flexibly orchestrates LoRA adapters at runtime. Two tools are developed to facilitate high efficiency. A switcher leverages ATMM and unified memory management to enable swift inference mode switch and LoRA adapters swap, and a mixture inference mode, deLoRA, mitigates the starvation. Using the above tools, we design an algorithm to dynamically switch between three inference modes, schedule requests, and manage LoRA adapters to satisfy the performance requirement of each application. </p> </div> <div class="ltx_para" id="S1.p12"> <p class="ltx_p" id="S1.p12.1">We summarize our key contributions as follows:</p> </div> <div class="ltx_para" id="S1.p13"> <div class="ltx_inline-block ltx_transformed_outer" id="S1.p13.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-0.5pt,0.4pt) scale(0.8,0.8) ;"> <p class="ltx_p" id="S1.p13.1.1"><math alttext="\bullet" class="ltx_Math" display="inline" id="S1.p13.1.1.m1.1"><semantics id="S1.p13.1.1.m1.1a"><mo id="S1.p13.1.1.m1.1.1" xref="S1.p13.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S1.p13.1.1.m1.1b"><ci id="S1.p13.1.1.m1.1.1.cmml" xref="S1.p13.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.p13.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S1.p13.1.1.m1.1d">∙</annotation></semantics></math></p> </span></div> <p class="ltx_p" id="S1.p13.2">To the best of our knowledge, we are the first to identify and solve the key problems in empowering vision applications with LoRA LMM.</p> </div> <div class="ltx_para" id="S1.p14"> <div class="ltx_inline-block ltx_transformed_outer" id="S1.p14.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-0.5pt,0.4pt) scale(0.8,0.8) ;"> <p class="ltx_p" id="S1.p14.1.1"><math alttext="\bullet" class="ltx_Math" display="inline" id="S1.p14.1.1.m1.1"><semantics id="S1.p14.1.1.m1.1a"><mo id="S1.p14.1.1.m1.1.1" xref="S1.p14.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S1.p14.1.1.m1.1b"><ci id="S1.p14.1.1.m1.1.1.cmml" xref="S1.p14.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.p14.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S1.p14.1.1.m1.1d">∙</annotation></semantics></math></p> </span></div> <p class="ltx_p" id="S1.p14.2">We prototype V-LoRA <span class="ltx_note ltx_role_footnote" id="footnote1"><sup class="ltx_note_mark">*</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">*</sup><span class="ltx_tag ltx_tag_note">*</span>We will release the code and documentation after this paper is accepted.</span></span></span> that enables accurate, efficient, and flexible LoRA LMM serving for vision applications, which involves accuracy-aware LoRA adapter generation, adaptive-tiling LoRA adapters batching, and efficient and flexible LoRA adapters orchestration.</p> </div> <div class="ltx_para" id="S1.p15"> <div class="ltx_inline-block ltx_transformed_outer" id="S1.p15.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-0.5pt,0.4pt) scale(0.8,0.8) ;"> <p class="ltx_p" id="S1.p15.1.1"><math alttext="\bullet" class="ltx_Math" display="inline" id="S1.p15.1.1.m1.1"><semantics id="S1.p15.1.1.m1.1a"><mo id="S1.p15.1.1.m1.1.1" xref="S1.p15.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S1.p15.1.1.m1.1b"><ci id="S1.p15.1.1.m1.1.1.cmml" xref="S1.p15.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.p15.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S1.p15.1.1.m1.1d">∙</annotation></semantics></math></p> </span></div> <p class="ltx_p" id="S1.p15.2">We implement V-LoRA and conduct evaluations for <span class="ltx_text" id="S1.p15.2.1" style="color:#000000;">five</span> popular analytical tasks with real-world trace on three LMMs. Experimental results show that V-LoRA <span class="ltx_text" id="S1.p15.2.2" style="color:#000000;">achieves 24-62% accuracy compared to the original LMMs, and 20-89% latency</span> compared to the state-of-the-art methods.</p> </div> <div class="ltx_para" id="S1.p16"> <p class="ltx_p" id="S1.p16.1"><span class="ltx_text ltx_font_bold" id="S1.p16.1.1">This work does not raise any ethical issues.</span></p> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">2. </span>Background</h2> <div class="ltx_para ltx_noindent" id="S2.p1"> <p class="ltx_p" id="S2.p1.1"><span class="ltx_text ltx_font_bold" id="S2.p1.1.1">Vision applications</span> in today exploit AI technology to process images or videos in RGB spaces <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib93" title="">zhang2018awstream, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib49" title="">khani2023recl, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib67" title="">Romil2022Ekya, </a>)</cite>. In video analytics, for instance, multiple DNNs that are well-trained on domain-specific datasets separately take care of one target task and together serve the application well <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib47" title="">kang2022viva, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib46" title="">kang2017noscope, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib89" title="">yu2024vqpy, </a>)</cite>. However, the limited capabilities of small models hinder the development of vision applications. Current applications yet stay on the simple combination of vision tasks such as image classification <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib90" title="">yuan2022PacketGame, </a>)</cite>, vehicle counting <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib57" title="">li2020reducto, </a>)</cite>, and target detection <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib31" title="">du2020server, </a>)</cite>. With the natural language interface inherited from LLM, LMM can enrich future vision applications. For example, serving by LMM, the police officer can find the right target when only given a text-described query such as “A boy wearing a red sweater lost at the corner”. Therefore, this paper tries to empower vision tasks with LoRA LMM and enrich future vision applications.</p> </div> <figure class="ltx_figure" id="S2.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="242" id="S2.F1.g1" src="x1.png" width="373"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1. </span>Illustration of LMM inference. Qwen-VL-7B <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib20" title="">QwenVL, </a>)</cite> generates the right action recognition answer to a piece of data from UCF-101 dataset <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib71" title="">soomro2012ucf101, </a>)</cite> and the corresponding prompt.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S2.p2"> <p class="ltx_p" id="S2.p2.1"><span class="ltx_text ltx_font_bold" id="S2.p2.1.1">Large multimodal models (LMMs)</span> aim to achieve stronger general intelligence via extending LLMs with multimodal inputs. Since more than 80% of human beings’ perception and activities are mediated through vision <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib63" title="">thomas2024vision, </a>)</cite>, it is natural to start the exploration by equipping LLMs with “eyes”. By introducing a visual receptor, comprised of a <span class="ltx_text ltx_font_italic" id="S2.p2.1.2">visual encoder</span> (<em class="ltx_emph ltx_font_italic" id="S2.p2.1.3">e.g.,</em> ViT <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib65" title="">radford2021learning, </a>)</cite>) and a <span class="ltx_text ltx_font_italic" id="S2.p2.1.4">vision-language projector</span> (<em class="ltx_emph ltx_font_italic" id="S2.p2.1.5">e.g.,</em> Q-former <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib53" title="">li2023blip, </a>)</cite>), LMMs power LLMs with visual capacity. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F1" title="Figure 1 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">1</span></a> illustrates the inference procedure of LMMs. Given an image input and its prompt, the visual encoder splits the image into small patches (<em class="ltx_emph ltx_font_italic" id="S2.p2.1.6">e.g.,</em> 14<math alttext="\times" class="ltx_Math" display="inline" id="S2.p2.1.m1.1"><semantics id="S2.p2.1.m1.1a"><mo id="S2.p2.1.m1.1.1" xref="S2.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S2.p2.1.m1.1b"><times id="S2.p2.1.m1.1.1.cmml" xref="S2.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S2.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S2.p2.1.m1.1d">×</annotation></semantics></math>14 pixels block) and extracts the visual features of each. The vision-language projector then converts patches’ visual features into visual tokens and feeds into LLM <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib17" title="">achiam2023gpt4, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib28" title="">chowdhery2023palm, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib76" title="">touvron2023llama2, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib42" title="">jiang2023mistral, </a>)</cite>, together with the embedded text tokens from the prompt, to generate the answer. LLM maps the input into high-level features and predicts the probability distribution of the next token with the <span class="ltx_text ltx_font_italic" id="S2.p2.1.7">language modeling (LM) head</span> in an autoregressive pattern <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib50" title="">kwon2023efficient, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib87" title="">yu2022orca, </a>)</cite>. As the dotted arrows depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F1" title="Figure 1 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">1</span></a>, it iteratively generates tokens with input tokens and previous output tokens once a time, until an end-of-sentence (<span class="ltx_text ltx_font_typewriter" id="S2.p2.1.8"><EOS></span>) token is emitted.</p> </div> <div class="ltx_para ltx_noindent" id="S2.p3"> <p class="ltx_p" id="S2.p3.13"><span class="ltx_text ltx_font_bold" id="S2.p3.13.1">Low-rank adaptation (LoRA)</span> <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib40" title="">hu2021lora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib30" title="">dettmers2024qlora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib23" title="">chavan2023one, </a>)</cite> is a widely used parameter-efficient fine-tuning method <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib61" title="">peft, </a>)</cite> to integrate the external knowledge into large models. The LoRA adapter is a small number of trainable parameters to learn this knowledge, typically placed in attention layers <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib40" title="">hu2021lora, </a>)</cite>. The core of LoRA, as illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F2.sf1" title="In Figure 2 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">2(a)</span></a>, is to represent each weight update <math alttext="\Delta W" class="ltx_Math" display="inline" id="S2.p3.1.m1.1"><semantics id="S2.p3.1.m1.1a"><mrow id="S2.p3.1.m1.1.1" xref="S2.p3.1.m1.1.1.cmml"><mi id="S2.p3.1.m1.1.1.2" mathvariant="normal" xref="S2.p3.1.m1.1.1.2.cmml">Δ</mi><mo id="S2.p3.1.m1.1.1.1" xref="S2.p3.1.m1.1.1.1.cmml"></mo><mi id="S2.p3.1.m1.1.1.3" xref="S2.p3.1.m1.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.1.m1.1b"><apply id="S2.p3.1.m1.1.1.cmml" xref="S2.p3.1.m1.1.1"><times id="S2.p3.1.m1.1.1.1.cmml" xref="S2.p3.1.m1.1.1.1"></times><ci id="S2.p3.1.m1.1.1.2.cmml" xref="S2.p3.1.m1.1.1.2">Δ</ci><ci id="S2.p3.1.m1.1.1.3.cmml" xref="S2.p3.1.m1.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.1.m1.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S2.p3.1.m1.1d">roman_Δ italic_W</annotation></semantics></math> as a product of two matrices, <math alttext="A" class="ltx_Math" display="inline" id="S2.p3.2.m2.1"><semantics id="S2.p3.2.m2.1a"><mi id="S2.p3.2.m2.1.1" xref="S2.p3.2.m2.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S2.p3.2.m2.1b"><ci id="S2.p3.2.m2.1.1.cmml" xref="S2.p3.2.m2.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.2.m2.1c">A</annotation><annotation encoding="application/x-llamapun" id="S2.p3.2.m2.1d">italic_A</annotation></semantics></math> and <math alttext="B" class="ltx_Math" display="inline" id="S2.p3.3.m3.1"><semantics id="S2.p3.3.m3.1a"><mi id="S2.p3.3.m3.1.1" xref="S2.p3.3.m3.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S2.p3.3.m3.1b"><ci id="S2.p3.3.m3.1.1.cmml" xref="S2.p3.3.m3.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.3.m3.1c">B</annotation><annotation encoding="application/x-llamapun" id="S2.p3.3.m3.1d">italic_B</annotation></semantics></math>, with dimensions <math alttext="r\times d" class="ltx_Math" display="inline" id="S2.p3.4.m4.1"><semantics id="S2.p3.4.m4.1a"><mrow id="S2.p3.4.m4.1.1" xref="S2.p3.4.m4.1.1.cmml"><mi id="S2.p3.4.m4.1.1.2" xref="S2.p3.4.m4.1.1.2.cmml">r</mi><mo id="S2.p3.4.m4.1.1.1" lspace="0.222em" rspace="0.222em" xref="S2.p3.4.m4.1.1.1.cmml">×</mo><mi id="S2.p3.4.m4.1.1.3" xref="S2.p3.4.m4.1.1.3.cmml">d</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.4.m4.1b"><apply id="S2.p3.4.m4.1.1.cmml" xref="S2.p3.4.m4.1.1"><times id="S2.p3.4.m4.1.1.1.cmml" xref="S2.p3.4.m4.1.1.1"></times><ci id="S2.p3.4.m4.1.1.2.cmml" xref="S2.p3.4.m4.1.1.2">𝑟</ci><ci id="S2.p3.4.m4.1.1.3.cmml" xref="S2.p3.4.m4.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.4.m4.1c">r\times d</annotation><annotation encoding="application/x-llamapun" id="S2.p3.4.m4.1d">italic_r × italic_d</annotation></semantics></math> and <math alttext="d\times r" class="ltx_Math" display="inline" id="S2.p3.5.m5.1"><semantics id="S2.p3.5.m5.1a"><mrow id="S2.p3.5.m5.1.1" xref="S2.p3.5.m5.1.1.cmml"><mi id="S2.p3.5.m5.1.1.2" xref="S2.p3.5.m5.1.1.2.cmml">d</mi><mo id="S2.p3.5.m5.1.1.1" lspace="0.222em" rspace="0.222em" xref="S2.p3.5.m5.1.1.1.cmml">×</mo><mi id="S2.p3.5.m5.1.1.3" xref="S2.p3.5.m5.1.1.3.cmml">r</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.5.m5.1b"><apply id="S2.p3.5.m5.1.1.cmml" xref="S2.p3.5.m5.1.1"><times id="S2.p3.5.m5.1.1.1.cmml" xref="S2.p3.5.m5.1.1.1"></times><ci id="S2.p3.5.m5.1.1.2.cmml" xref="S2.p3.5.m5.1.1.2">𝑑</ci><ci id="S2.p3.5.m5.1.1.3.cmml" xref="S2.p3.5.m5.1.1.3">𝑟</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.5.m5.1c">d\times r</annotation><annotation encoding="application/x-llamapun" id="S2.p3.5.m5.1d">italic_d × italic_r</annotation></semantics></math>, respectively. Their ranks are much smaller than the base model weight <math alttext="W" class="ltx_Math" display="inline" id="S2.p3.6.m6.1"><semantics id="S2.p3.6.m6.1a"><mi id="S2.p3.6.m6.1.1" xref="S2.p3.6.m6.1.1.cmml">W</mi><annotation-xml encoding="MathML-Content" id="S2.p3.6.m6.1b"><ci id="S2.p3.6.m6.1.1.cmml" xref="S2.p3.6.m6.1.1">𝑊</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.6.m6.1c">W</annotation><annotation encoding="application/x-llamapun" id="S2.p3.6.m6.1d">italic_W</annotation></semantics></math>, with dimensions <math alttext="d\times d" class="ltx_Math" display="inline" id="S2.p3.7.m7.1"><semantics id="S2.p3.7.m7.1a"><mrow id="S2.p3.7.m7.1.1" xref="S2.p3.7.m7.1.1.cmml"><mi id="S2.p3.7.m7.1.1.2" xref="S2.p3.7.m7.1.1.2.cmml">d</mi><mo id="S2.p3.7.m7.1.1.1" lspace="0.222em" rspace="0.222em" xref="S2.p3.7.m7.1.1.1.cmml">×</mo><mi id="S2.p3.7.m7.1.1.3" xref="S2.p3.7.m7.1.1.3.cmml">d</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.7.m7.1b"><apply id="S2.p3.7.m7.1.1.cmml" xref="S2.p3.7.m7.1.1"><times id="S2.p3.7.m7.1.1.1.cmml" xref="S2.p3.7.m7.1.1.1"></times><ci id="S2.p3.7.m7.1.1.2.cmml" xref="S2.p3.7.m7.1.1.2">𝑑</ci><ci id="S2.p3.7.m7.1.1.3.cmml" xref="S2.p3.7.m7.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.7.m7.1c">d\times d</annotation><annotation encoding="application/x-llamapun" id="S2.p3.7.m7.1d">italic_d × italic_d</annotation></semantics></math>. This makes sense since the low intrinsic rank phenomenon of weight updates in large models <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib40" title="">hu2021lora, </a>)</cite>. When fine-tuning, LoRA adapters only update <math alttext="A" class="ltx_Math" display="inline" id="S2.p3.8.m8.1"><semantics id="S2.p3.8.m8.1a"><mi id="S2.p3.8.m8.1.1" xref="S2.p3.8.m8.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S2.p3.8.m8.1b"><ci id="S2.p3.8.m8.1.1.cmml" xref="S2.p3.8.m8.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.8.m8.1c">A</annotation><annotation encoding="application/x-llamapun" id="S2.p3.8.m8.1d">italic_A</annotation></semantics></math> and <math alttext="B" class="ltx_Math" display="inline" id="S2.p3.9.m9.1"><semantics id="S2.p3.9.m9.1a"><mi id="S2.p3.9.m9.1.1" xref="S2.p3.9.m9.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S2.p3.9.m9.1b"><ci id="S2.p3.9.m9.1.1.cmml" xref="S2.p3.9.m9.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.9.m9.1c">B</annotation><annotation encoding="application/x-llamapun" id="S2.p3.9.m9.1d">italic_B</annotation></semantics></math> while keeping <math alttext="W" class="ltx_Math" display="inline" id="S2.p3.10.m10.1"><semantics id="S2.p3.10.m10.1a"><mi id="S2.p3.10.m10.1.1" xref="S2.p3.10.m10.1.1.cmml">W</mi><annotation-xml encoding="MathML-Content" id="S2.p3.10.m10.1b"><ci id="S2.p3.10.m10.1.1.cmml" xref="S2.p3.10.m10.1.1">𝑊</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.10.m10.1c">W</annotation><annotation encoding="application/x-llamapun" id="S2.p3.10.m10.1d">italic_W</annotation></semantics></math> frozen. At inference, the computation of all LoRA adapters can be placed bypass, as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F2.sf1" title="In Figure 2 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">2(a)</span></a>, and only adds the results onto the output. Or merge one multiplied matrix <math alttext="B\times A" class="ltx_Math" display="inline" id="S2.p3.11.m11.1"><semantics id="S2.p3.11.m11.1a"><mrow id="S2.p3.11.m11.1.1" xref="S2.p3.11.m11.1.1.cmml"><mi id="S2.p3.11.m11.1.1.2" xref="S2.p3.11.m11.1.1.2.cmml">B</mi><mo id="S2.p3.11.m11.1.1.1" lspace="0.222em" rspace="0.222em" xref="S2.p3.11.m11.1.1.1.cmml">×</mo><mi id="S2.p3.11.m11.1.1.3" xref="S2.p3.11.m11.1.1.3.cmml">A</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.11.m11.1b"><apply id="S2.p3.11.m11.1.1.cmml" xref="S2.p3.11.m11.1.1"><times id="S2.p3.11.m11.1.1.1.cmml" xref="S2.p3.11.m11.1.1.1"></times><ci id="S2.p3.11.m11.1.1.2.cmml" xref="S2.p3.11.m11.1.1.2">𝐵</ci><ci id="S2.p3.11.m11.1.1.3.cmml" xref="S2.p3.11.m11.1.1.3">𝐴</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.11.m11.1c">B\times A</annotation><annotation encoding="application/x-llamapun" id="S2.p3.11.m11.1d">italic_B × italic_A</annotation></semantics></math> (<em class="ltx_emph ltx_font_italic" id="S2.p3.13.2">i.e.,</em> <math alttext="\Delta W" class="ltx_Math" display="inline" id="S2.p3.12.m12.1"><semantics id="S2.p3.12.m12.1a"><mrow id="S2.p3.12.m12.1.1" xref="S2.p3.12.m12.1.1.cmml"><mi id="S2.p3.12.m12.1.1.2" mathvariant="normal" xref="S2.p3.12.m12.1.1.2.cmml">Δ</mi><mo id="S2.p3.12.m12.1.1.1" xref="S2.p3.12.m12.1.1.1.cmml"></mo><mi id="S2.p3.12.m12.1.1.3" xref="S2.p3.12.m12.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.12.m12.1b"><apply id="S2.p3.12.m12.1.1.cmml" xref="S2.p3.12.m12.1.1"><times id="S2.p3.12.m12.1.1.1.cmml" xref="S2.p3.12.m12.1.1.1"></times><ci id="S2.p3.12.m12.1.1.2.cmml" xref="S2.p3.12.m12.1.1.2">Δ</ci><ci id="S2.p3.12.m12.1.1.3.cmml" xref="S2.p3.12.m12.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.12.m12.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S2.p3.12.m12.1d">roman_Δ italic_W</annotation></semantics></math>) into base model <math alttext="W" class="ltx_Math" display="inline" id="S2.p3.13.m13.1"><semantics id="S2.p3.13.m13.1a"><mi id="S2.p3.13.m13.1.1" xref="S2.p3.13.m13.1.1.cmml">W</mi><annotation-xml encoding="MathML-Content" id="S2.p3.13.m13.1b"><ci id="S2.p3.13.m13.1.1.cmml" xref="S2.p3.13.m13.1.1">𝑊</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.13.m13.1c">W</annotation><annotation encoding="application/x-llamapun" id="S2.p3.13.m13.1d">italic_W</annotation></semantics></math>, which keeps computation overhead the same as the base model in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F2.sf2" title="In Figure 2 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">2(b)</span></a>.</p> </div> <figure class="ltx_figure" id="S2.F2"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S2.F2.sf1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="162" id="S2.F2.sf1.g1" src="x2.png" width="207"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S2.F2.sf1.2.1.1" style="font-size:80%;">(a)</span> </span><span class="ltx_text" id="S2.F2.sf1.3.2" style="font-size:80%;">Unmerge mode.</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S2.F2.sf2"> <p class="ltx_p ltx_align_center" id="S2.F2.sf2.1"><span class="ltx_text" id="S2.F2.sf2.1.1" style="position:relative; bottom:1.4pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="161" id="S2.F2.sf2.1.1.g1" src="x3.png" width="141"/></span></p> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S2.F2.sf2.3.1.1" style="font-size:80%;">(b)</span> </span><span class="ltx_text" id="S2.F2.sf2.4.2" style="font-size:80%;">Merge mode.</span></figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2. </span>LoRA model inference. (a) Unmerge mode supports computing multiple different LoRA adapters in a batch. <math alttext="A_{1}" class="ltx_Math" display="inline" id="S2.F2.3.m1.1"><semantics id="S2.F2.3.m1.1b"><msub id="S2.F2.3.m1.1.1" xref="S2.F2.3.m1.1.1.cmml"><mi id="S2.F2.3.m1.1.1.2" xref="S2.F2.3.m1.1.1.2.cmml">A</mi><mn id="S2.F2.3.m1.1.1.3" xref="S2.F2.3.m1.1.1.3.cmml">1</mn></msub><annotation-xml encoding="MathML-Content" id="S2.F2.3.m1.1c"><apply id="S2.F2.3.m1.1.1.cmml" xref="S2.F2.3.m1.1.1"><csymbol cd="ambiguous" id="S2.F2.3.m1.1.1.1.cmml" xref="S2.F2.3.m1.1.1">subscript</csymbol><ci id="S2.F2.3.m1.1.1.2.cmml" xref="S2.F2.3.m1.1.1.2">𝐴</ci><cn id="S2.F2.3.m1.1.1.3.cmml" type="integer" xref="S2.F2.3.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.3.m1.1d">A_{1}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.3.m1.1e">italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="B_{1}" class="ltx_Math" display="inline" id="S2.F2.4.m2.1"><semantics id="S2.F2.4.m2.1b"><msub id="S2.F2.4.m2.1.1" xref="S2.F2.4.m2.1.1.cmml"><mi id="S2.F2.4.m2.1.1.2" xref="S2.F2.4.m2.1.1.2.cmml">B</mi><mn id="S2.F2.4.m2.1.1.3" xref="S2.F2.4.m2.1.1.3.cmml">1</mn></msub><annotation-xml encoding="MathML-Content" id="S2.F2.4.m2.1c"><apply id="S2.F2.4.m2.1.1.cmml" xref="S2.F2.4.m2.1.1"><csymbol cd="ambiguous" id="S2.F2.4.m2.1.1.1.cmml" xref="S2.F2.4.m2.1.1">subscript</csymbol><ci id="S2.F2.4.m2.1.1.2.cmml" xref="S2.F2.4.m2.1.1.2">𝐵</ci><cn id="S2.F2.4.m2.1.1.3.cmml" type="integer" xref="S2.F2.4.m2.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.4.m2.1d">B_{1}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.4.m2.1e">italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT</annotation></semantics></math> constitute LoRA adapter #1. (b) Merge mode supports no-extra-delay inference but only one adapter at once.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S2.p4"> <p class="ltx_p" id="S2.p4.1"><span class="ltx_text ltx_font_bold" id="S2.p4.1.1">LoRA models serving system</span> <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> targets to improve LoRA model inference efficiency. In unmerged inference, the core characteristic is that computing two small matrices of the adapter underutilizes GPU computing units, leading to unnecessary delay and resource waste. To tackle this, Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite> and S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> batch multiple heterogeneous adapters, as the cascade LoRA adapters illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F2.sf1" title="In Figure 2 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">2(a)</span></a>, and compute them within one single custom CUDA kernel. This method boosts the system throughput indeed but causes significant additional overhead (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3.2</span></a>). dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> merges the most accessed LoRA adapter into the base model and switches to unmerge if necessary. However, its mode switch causes an unacceptable time cost (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3.2</span></a>). Besides, like Punica and S-LoRA, dLoRA’s unmerged inference also fails to address the large number of extra computation overhead.</p> </div> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">3. </span>Motivation and Challenges</h2> <div class="ltx_para" id="S3.p1"> <p class="ltx_p" id="S3.p1.1">This section explores two questions: (1) What benefits can LoRA LMM bring to vision applications (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS1" title="3.1 Potential Benefits from LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3.1</span></a>)? (2) What challenges must be tackled when empowering vision applications with LoRA LMM (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3.2</span></a>)?</p> </div> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">3.1 Potential Benefits from LoRA LMM</h3> <div class="ltx_para" id="S3.SS1.p1"> <p class="ltx_p" id="S3.SS1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.SS1.p1.1.1">LMMs offers state-of-the-art performance on many complex vision tasks.</span> To demonstrate it, we take zero-shot grounding and visual question answering as examples, and conducted experiments on Aircraft <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib9" title="">Aircraft, </a>)</cite> and VQAv2 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib38" title="">vqa2, </a>)</cite> datasets with Qwen-VL-7B <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib20" title="">QwenVL, </a>)</cite> (as the LMM), YOLO <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib33" title="">glenn2021YOLOV5, </a>)</cite> and OSCAR <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib55" title="">li2020oscar, </a>)</cite> (as the baseline small models). Aircraft contains 103 remote sensing images that are not pre-trained on Qwen-VL and YOLO, being the zero-shot test; VQAv2 is the most popular visual question answering dataset which includes text and visual modalities, to test the multi-modal ability. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F3" title="Figure 3 ‣ 3.1 Potential Benefits from LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3</span></a> shows that, with the solid linguistic and reasoning capabilities inherited from LLMs, Qwen-VL greatly outperforms small models. It delivers 48.9% higher F1-score in zero-shot grounding than YOLO. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F3.sf1" title="In Figure 3 ‣ 3.1 Potential Benefits from LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3(a)</span></a> visualizes the results on data #38, where Qwen-VL bounds more accurate boxes than YOLO. For multi-modal tasks, Qwen-VL achieves 78.8% accuracy, being 7.5% higher than OSCAR. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F3.sf2" title="In Figure 3 ‣ 3.1 Potential Benefits from LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3(b)</span></a> exemplifies a typical vehicle counting task in video analytics applications, only Qwen-VL generates the correct answer.</p> </div> <div class="ltx_para" id="S3.SS1.p2"> <p class="ltx_p" id="S3.SS1.p2.1"><span class="ltx_text ltx_font_bold" id="S3.SS1.p2.1.1">With external knowledge from LoRA adapters, LMM attains remarkable accuracy gain on domain-specific tasks.</span> To investigate the accuracy improvement from external knowledge, we fine-tune three LoRA adapters for image classification, object detection, and video classification, respectively, on external datasets, AID <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib82" title="">AID, </a>)</cite>, Aircraft <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib9" title="">Aircraft, </a>)</cite>, and UCF101 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib72" title="">UCF101, </a>)</cite>. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F4" title="Figure 4 ‣ 3.1 Potential Benefits from LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4</span></a> shows the results. With fine-tuned LoRA adapters, Qwen-VL receives 45.2%, 24.5%, and 62.2% accuracy gains on three domain-specific task, respectively. Note that we only validated the potential gain without fully exploring advanced training techniques like data enhancement <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib92" title="">VideoEnhancement, </a>)</cite>. Nevertheless, based on current results, we believe these techniques could further improve accuracy in future work.</p> </div> <figure class="ltx_figure" id="S3.F3"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S3.F3.sf1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="174" id="S3.F3.sf1.g1" src="x4.png" width="174"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S3.F3.sf1.2.1.1" style="font-size:80%;">(a)</span> </span><span class="ltx_text" id="S3.F3.sf1.3.2" style="font-size:80%;">Zero-shot Grounding results on data #38 in Aircraft dataset <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib9" title="">Aircraft, </a>)</cite>.</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S3.F3.sf2"> <p class="ltx_p ltx_align_center" id="S3.F3.sf2.1"><span class="ltx_text" id="S3.F3.sf2.1.1" style="position:relative; bottom:1.4pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="174" id="S3.F3.sf2.1.1.g1" src="x5.png" width="185"/></span></p> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S3.F3.sf2.3.1.1" style="font-size:80%;">(b)</span> </span><span class="ltx_text" id="S3.F3.sf2.4.2" style="font-size:80%;">Visualization of data #133279 in Visual Question Answering <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib38" title="">vqa2, </a>)</cite>.</span></figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3. </span>The potential of LMM. (a) To ground the airplanes in remote sensing view in zero-shot, LMM Qwen-VL, in general, delivers 67.2% accuracy <em class="ltx_emph ltx_font_italic" id="S3.F3.3.1">v.s.</em> the 18.3% of YOLO <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib33" title="">glenn2021YOLOV5, </a>)</cite>. (b) In VQA, Qwen-VL yields 78.8% accuracy <em class="ltx_emph ltx_font_italic" id="S3.F3.4.2">v.s.</em> the 73.3% of OSCAR <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib55" title="">li2020oscar, </a>)</cite>. </figcaption> </figure> <figure class="ltx_figure" id="S3.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="340" id="S3.F4.g1" src="x6.png" width="705"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4. </span>LoRA adapters with domain-specific knowledge improve the Qwen-VL’s accuracy on target tasks.</figcaption> </figure> <div class="ltx_para" id="S3.SS1.p3"> <p class="ltx_p" id="S3.SS1.p3.1"><span class="ltx_text ltx_font_bold" id="S3.SS1.p3.1.1">LoRA LMM enables more flexible serving.</span> Today’s vision applications are served by many domain-specific small models <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib67" title="">Romil2022Ekya, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib49" title="">khani2023recl, </a>)</cite>. When invoked, the specific model is loaded into GPU and swapped out to main memory after execution <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib68" title="">shen2019nexus, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib21" title="">bai2020pipeswitch, </a>)</cite>. This swapping method allows model execution in a large batch but incurs inevitable transmission costs. LoRA adapters usually have fewer parameters than small models (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS1" title="4.4.1 Swift inference mode switch. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.4.1</span></a>). Swapping them while keeping LMM in GPU provides a more flexible serving. In our test, swapping an adapter saves 97% delay than OSCAR, 15ms <em class="ltx_emph ltx_font_italic" id="S3.SS1.p3.1.2">v.s.</em> 520ms, and 86% of YOLO’s 110ms.</p> </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">3.2 Challenges of Empowering Vision Applications with LoRA LMM </h3> <div class="ltx_para" id="S3.SS2.p1"> <p class="ltx_p" id="S3.SS2.p1.1">To empower diverse vision applications, the LoRA LMM system must offer accurate and efficient responses to the vision tasks involved and meet the distinct application-specified performance requirements. To this, we face three challenges. </p> </div> <figure class="ltx_figure" id="S3.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="346" id="S3.F5.g1" src="x7.png" width="622"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 5. </span>Accuracy decreases when fusing knowledge from multiple domain-specific small models into one single LoRA. The trend varies regarding vision tasks. </figcaption> </figure> <div class="ltx_para" id="S3.SS2.p2"> <p class="ltx_p" id="S3.SS2.p2.1"><span class="ltx_text ltx_font_bold" id="S3.SS2.p2.1.1">C1: Limited capacity of LoRA adapter.</span> Integrating external knowledge from existing small models or specific datasets into LMM is essential to generate accurate results. Parameter-efficient fine-tuning LoRA adapters show promise. However, it is challenging because the LoRA adapter has only limited capacity and varies on the vision tasks. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F5" title="Figure 5 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">5</span></a> demonstrates this by fusing external knowledge from different numbers of small models on diverse tasks into a single LoRA adapter (experiment setup details in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS1" title="6.1 Experimental Setup ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">6.1</span></a>). Training a separate adapter for each small model consistently achieves high accuracy but results in significant adapter capacity waste, while fusing too many small models into one adapter incurs significant accuracy degradation. For example, the LoRA adapter that fuses six image classification models in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F5" title="Figure 5 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">5</span></a> retains over 95% accuracy, while fusing six video classification models decreases remarkable accuracy.</p> </div> <div class="ltx_para" id="S3.SS2.p3"> <p class="ltx_p" id="S3.SS2.p3.1"><span class="ltx_text ltx_font_bold" id="S3.SS2.p3.1.1">C2: Inefficient concurrent requests batching.</span> One vision application often involves multiple small models <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib43" title="">jiang2018chameleon, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib31" title="">du2020server, </a>)</cite>. Hence, serving multiple vision applications with LMM very likely leads to the simultaneous invocation of multiple heterogeneous LoRA adapters. However, current LoRA model inference systems struggle to process them efficiently, particularly under high concurrency. To demonstrate this, we measure three state-of-the-art systems. <em class="ltx_emph ltx_font_italic" id="S3.SS2.p3.1.2">Punica</em> <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite> and <em class="ltx_emph ltx_font_italic" id="S3.SS2.p3.1.3">S-LoRA</em> <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> customize CUDA operator, respectively, to batch heterogeneous LoRA adapters computation in unmerge mode (more details in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS1" title="4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.3.1</span></a>), while <em class="ltx_emph ltx_font_italic" id="S3.SS2.p3.1.4">dLoRA</em> <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> calls PyTorch operator <span class="ltx_text ltx_font_typewriter" id="S3.SS2.p3.1.5">Einsum<span class="ltx_note ltx_role_footnote" id="footnote2"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_tag ltx_tag_note"><span class="ltx_text ltx_font_serif" id="footnote2.1.1.1">†</span></span><span class="ltx_text ltx_font_serif" id="footnote2.5"> Einsum uses Einstein summation convention </span><cite class="ltx_cite ltx_citemacro_citep"><span class="ltx_text ltx_font_serif" id="footnote2.6.1">(</span><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib32" title="">Einstein1922<span class="ltx_text ltx_font_serif" id="footnote2.7.2.1.1">, </span></a><span class="ltx_text ltx_font_serif" id="footnote2.8.3">)</span></cite><span class="ltx_text ltx_font_serif" id="footnote2.9">, describing tensor index operations in a concise string form, for efficient LoRA adapter batching.</span></span></span></span></span> <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib3" title="">torcheinsum, </a>)</cite>. The experimental workload randomly generates 2-4 requests ranging from 128 to 1024 length of input tokens per second, and we repeat 1,000 times to measure the latency. For fairness, all experiments run on a server device equipped with NVIDIA A100 GPU <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib2" title="">A100, </a>)</cite> via PEFT <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib61" title="">peft, </a>)</cite> framework, and use Qwen-VL-7B <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib20" title="">QwenVL, </a>)</cite> as the base model.</p> </div> <div class="ltx_para" id="S3.SS2.p4"> <p class="ltx_p" id="S3.SS2.p4.1">Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F7" title="Figure 7 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">7</span></a> plots the results. The bars of dLoRA, S-LoRA, and Punica denote their extra latency than the merged inference, <em class="ltx_emph ltx_font_italic" id="S3.SS2.p4.1.1">e.g.,</em> <math alttext="latency_{dLoRA}-latency_{merge}" class="ltx_Math" display="inline" id="S3.SS2.p4.1.m1.1"><semantics id="S3.SS2.p4.1.m1.1a"><mrow id="S3.SS2.p4.1.m1.1.1" xref="S3.SS2.p4.1.m1.1.1.cmml"><mrow id="S3.SS2.p4.1.m1.1.1.2" xref="S3.SS2.p4.1.m1.1.1.2.cmml"><mi id="S3.SS2.p4.1.m1.1.1.2.2" xref="S3.SS2.p4.1.m1.1.1.2.2.cmml">l</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.2.3" xref="S3.SS2.p4.1.m1.1.1.2.3.cmml">a</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1a" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.2.4" xref="S3.SS2.p4.1.m1.1.1.2.4.cmml">t</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1b" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.2.5" xref="S3.SS2.p4.1.m1.1.1.2.5.cmml">e</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1c" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.2.6" xref="S3.SS2.p4.1.m1.1.1.2.6.cmml">n</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1d" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.2.7" xref="S3.SS2.p4.1.m1.1.1.2.7.cmml">c</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1e" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml"></mo><msub id="S3.SS2.p4.1.m1.1.1.2.8" xref="S3.SS2.p4.1.m1.1.1.2.8.cmml"><mi id="S3.SS2.p4.1.m1.1.1.2.8.2" xref="S3.SS2.p4.1.m1.1.1.2.8.2.cmml">y</mi><mrow id="S3.SS2.p4.1.m1.1.1.2.8.3" xref="S3.SS2.p4.1.m1.1.1.2.8.3.cmml"><mi id="S3.SS2.p4.1.m1.1.1.2.8.3.2" xref="S3.SS2.p4.1.m1.1.1.2.8.3.2.cmml">d</mi><mo id="S3.SS2.p4.1.m1.1.1.2.8.3.1" xref="S3.SS2.p4.1.m1.1.1.2.8.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.2.8.3.3" xref="S3.SS2.p4.1.m1.1.1.2.8.3.3.cmml">L</mi><mo id="S3.SS2.p4.1.m1.1.1.2.8.3.1a" xref="S3.SS2.p4.1.m1.1.1.2.8.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.2.8.3.4" xref="S3.SS2.p4.1.m1.1.1.2.8.3.4.cmml">o</mi><mo id="S3.SS2.p4.1.m1.1.1.2.8.3.1b" xref="S3.SS2.p4.1.m1.1.1.2.8.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.2.8.3.5" xref="S3.SS2.p4.1.m1.1.1.2.8.3.5.cmml">R</mi><mo id="S3.SS2.p4.1.m1.1.1.2.8.3.1c" xref="S3.SS2.p4.1.m1.1.1.2.8.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.2.8.3.6" xref="S3.SS2.p4.1.m1.1.1.2.8.3.6.cmml">A</mi></mrow></msub></mrow><mo id="S3.SS2.p4.1.m1.1.1.1" xref="S3.SS2.p4.1.m1.1.1.1.cmml">−</mo><mrow id="S3.SS2.p4.1.m1.1.1.3" xref="S3.SS2.p4.1.m1.1.1.3.cmml"><mi id="S3.SS2.p4.1.m1.1.1.3.2" xref="S3.SS2.p4.1.m1.1.1.3.2.cmml">l</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.3.3" xref="S3.SS2.p4.1.m1.1.1.3.3.cmml">a</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1a" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.3.4" xref="S3.SS2.p4.1.m1.1.1.3.4.cmml">t</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1b" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.3.5" xref="S3.SS2.p4.1.m1.1.1.3.5.cmml">e</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1c" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.3.6" xref="S3.SS2.p4.1.m1.1.1.3.6.cmml">n</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1d" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.3.7" xref="S3.SS2.p4.1.m1.1.1.3.7.cmml">c</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1e" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml"></mo><msub id="S3.SS2.p4.1.m1.1.1.3.8" xref="S3.SS2.p4.1.m1.1.1.3.8.cmml"><mi id="S3.SS2.p4.1.m1.1.1.3.8.2" xref="S3.SS2.p4.1.m1.1.1.3.8.2.cmml">y</mi><mrow id="S3.SS2.p4.1.m1.1.1.3.8.3" xref="S3.SS2.p4.1.m1.1.1.3.8.3.cmml"><mi id="S3.SS2.p4.1.m1.1.1.3.8.3.2" xref="S3.SS2.p4.1.m1.1.1.3.8.3.2.cmml">m</mi><mo id="S3.SS2.p4.1.m1.1.1.3.8.3.1" xref="S3.SS2.p4.1.m1.1.1.3.8.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.3.8.3.3" xref="S3.SS2.p4.1.m1.1.1.3.8.3.3.cmml">e</mi><mo id="S3.SS2.p4.1.m1.1.1.3.8.3.1a" xref="S3.SS2.p4.1.m1.1.1.3.8.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.3.8.3.4" xref="S3.SS2.p4.1.m1.1.1.3.8.3.4.cmml">r</mi><mo id="S3.SS2.p4.1.m1.1.1.3.8.3.1b" xref="S3.SS2.p4.1.m1.1.1.3.8.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.3.8.3.5" xref="S3.SS2.p4.1.m1.1.1.3.8.3.5.cmml">g</mi><mo id="S3.SS2.p4.1.m1.1.1.3.8.3.1c" xref="S3.SS2.p4.1.m1.1.1.3.8.3.1.cmml"></mo><mi id="S3.SS2.p4.1.m1.1.1.3.8.3.6" xref="S3.SS2.p4.1.m1.1.1.3.8.3.6.cmml">e</mi></mrow></msub></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p4.1.m1.1b"><apply id="S3.SS2.p4.1.m1.1.1.cmml" xref="S3.SS2.p4.1.m1.1.1"><minus id="S3.SS2.p4.1.m1.1.1.1.cmml" xref="S3.SS2.p4.1.m1.1.1.1"></minus><apply id="S3.SS2.p4.1.m1.1.1.2.cmml" xref="S3.SS2.p4.1.m1.1.1.2"><times id="S3.SS2.p4.1.m1.1.1.2.1.cmml" xref="S3.SS2.p4.1.m1.1.1.2.1"></times><ci id="S3.SS2.p4.1.m1.1.1.2.2.cmml" xref="S3.SS2.p4.1.m1.1.1.2.2">𝑙</ci><ci id="S3.SS2.p4.1.m1.1.1.2.3.cmml" xref="S3.SS2.p4.1.m1.1.1.2.3">𝑎</ci><ci id="S3.SS2.p4.1.m1.1.1.2.4.cmml" xref="S3.SS2.p4.1.m1.1.1.2.4">𝑡</ci><ci id="S3.SS2.p4.1.m1.1.1.2.5.cmml" xref="S3.SS2.p4.1.m1.1.1.2.5">𝑒</ci><ci id="S3.SS2.p4.1.m1.1.1.2.6.cmml" xref="S3.SS2.p4.1.m1.1.1.2.6">𝑛</ci><ci id="S3.SS2.p4.1.m1.1.1.2.7.cmml" xref="S3.SS2.p4.1.m1.1.1.2.7">𝑐</ci><apply id="S3.SS2.p4.1.m1.1.1.2.8.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8"><csymbol cd="ambiguous" id="S3.SS2.p4.1.m1.1.1.2.8.1.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8">subscript</csymbol><ci id="S3.SS2.p4.1.m1.1.1.2.8.2.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.2">𝑦</ci><apply id="S3.SS2.p4.1.m1.1.1.2.8.3.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3"><times id="S3.SS2.p4.1.m1.1.1.2.8.3.1.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.1"></times><ci id="S3.SS2.p4.1.m1.1.1.2.8.3.2.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.2">𝑑</ci><ci id="S3.SS2.p4.1.m1.1.1.2.8.3.3.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.3">𝐿</ci><ci id="S3.SS2.p4.1.m1.1.1.2.8.3.4.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.4">𝑜</ci><ci id="S3.SS2.p4.1.m1.1.1.2.8.3.5.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.5">𝑅</ci><ci id="S3.SS2.p4.1.m1.1.1.2.8.3.6.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.6">𝐴</ci></apply></apply></apply><apply id="S3.SS2.p4.1.m1.1.1.3.cmml" xref="S3.SS2.p4.1.m1.1.1.3"><times id="S3.SS2.p4.1.m1.1.1.3.1.cmml" xref="S3.SS2.p4.1.m1.1.1.3.1"></times><ci id="S3.SS2.p4.1.m1.1.1.3.2.cmml" xref="S3.SS2.p4.1.m1.1.1.3.2">𝑙</ci><ci id="S3.SS2.p4.1.m1.1.1.3.3.cmml" xref="S3.SS2.p4.1.m1.1.1.3.3">𝑎</ci><ci id="S3.SS2.p4.1.m1.1.1.3.4.cmml" xref="S3.SS2.p4.1.m1.1.1.3.4">𝑡</ci><ci id="S3.SS2.p4.1.m1.1.1.3.5.cmml" xref="S3.SS2.p4.1.m1.1.1.3.5">𝑒</ci><ci id="S3.SS2.p4.1.m1.1.1.3.6.cmml" xref="S3.SS2.p4.1.m1.1.1.3.6">𝑛</ci><ci id="S3.SS2.p4.1.m1.1.1.3.7.cmml" xref="S3.SS2.p4.1.m1.1.1.3.7">𝑐</ci><apply id="S3.SS2.p4.1.m1.1.1.3.8.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8"><csymbol cd="ambiguous" id="S3.SS2.p4.1.m1.1.1.3.8.1.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8">subscript</csymbol><ci id="S3.SS2.p4.1.m1.1.1.3.8.2.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.2">𝑦</ci><apply id="S3.SS2.p4.1.m1.1.1.3.8.3.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3"><times id="S3.SS2.p4.1.m1.1.1.3.8.3.1.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.1"></times><ci id="S3.SS2.p4.1.m1.1.1.3.8.3.2.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.2">𝑚</ci><ci id="S3.SS2.p4.1.m1.1.1.3.8.3.3.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.3">𝑒</ci><ci id="S3.SS2.p4.1.m1.1.1.3.8.3.4.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.4">𝑟</ci><ci id="S3.SS2.p4.1.m1.1.1.3.8.3.5.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.5">𝑔</ci><ci id="S3.SS2.p4.1.m1.1.1.3.8.3.6.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.6">𝑒</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p4.1.m1.1c">latency_{dLoRA}-latency_{merge}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p4.1.m1.1d">italic_l italic_a italic_t italic_e italic_n italic_c italic_y start_POSTSUBSCRIPT italic_d italic_L italic_o italic_R italic_A end_POSTSUBSCRIPT - italic_l italic_a italic_t italic_e italic_n italic_c italic_y start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT</annotation></semantics></math>. The bar of the base model denotes its time cost under the same workload. Unmerged inference yields up to 140ms additional latency when serving four 1024-token requests. This unnecessary waste is sufficient for the base model to perform 4x256 inference, with resources to spare! The reason for such a high cost is two-fold. 1) Inherently, additional overhead stems from two additional matrix multiplications and one additional matrix addition per layer as depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F2.sf1" title="In Figure 2 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">2(a)</span></a>; meanwhile, these computations run in parallel with the base model computations, each layer requires additional CUDA kernel context operations at each layer. 2) Upon concrete implementations, all three LoRA adapters batching operators fail to fully utilize computing units in GPU, due to the significant padding or ill-considered matrix tiling (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS1" title="4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.3.1</span></a>).</p> </div> <figure class="ltx_figure" id="S3.F7"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_1"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S3.F7.1" style="width:212.5pt;"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="656" id="S3.F7.1.g1" src="x8.png" width="831"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 6. </span>Unmerged inference causes 27-140ms extra latency equivalent to 40-61% of base model inference time.</figcaption> </figure> </div> <div class="ltx_flex_break"></div> <div class="ltx_flex_cell ltx_flex_size_1"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S3.F7.2" style="width:208.1pt;"> <p class="ltx_p" id="S3.F7.2.1"><span class="ltx_text" id="S3.F7.2.1.1" style="position:relative; bottom:1.4pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="733" id="S3.F7.2.1.1.g1" src="x9.png" width="830"/></span></p> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 7. </span>Mode switch alone costs 53ms, occupying 64% of merged inference time of three 256-tokens requests.</figcaption> </figure> </div> </div> </figure> <div class="ltx_para" id="S3.SS2.p5"> <p class="ltx_p" id="S3.SS2.p5.3"><span class="ltx_text ltx_font_bold" id="S3.SS2.p5.3.4">C3: Inflexible LoRA adapters orchestration.</span> To cope with the distinct performance requirements specified by vision applications, an orchestration that can carefully manage LoRA adapters and flexibly schedule application requests is necessary. We believe the inference mode switch like dLoRA is promising, yet it falls significantly short of efficiency. Its mode switch yields unacceptable overhead. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F7" title="Figure 7 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">7</span></a> illustrates a real scheduling state and mode switch latency for two consecutive inference slots of dLoRA. In this case, dLoRA serves 8 requests, each with an input length of 256, in a first-come-first-service manner. In the first slot, dLoRA serves requests 1-3 in merge mode using the same LoRA adapter, and the heterogeneous requests 4-7 are processed in an unmerged mode in the following slot. A mode switch delay of over 53 ms makes the last request (the dotted arrow in Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F7" title="Figure 7 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">7</span></a>) have to wait 165ms until the next inference slot begins. This significant cost stems from 1) unnecessary memory copy of LoRA matrices due to dLoRA’s inefficient memory management, and <span class="ltx_text" id="S3.SS2.p5.3.3" style="color:#000000;">2) the substantial overhead of LoRA matrices <math alttext="\Delta W" class="ltx_Math" display="inline" id="S3.SS2.p5.1.1.m1.1"><semantics id="S3.SS2.p5.1.1.m1.1a"><mrow id="S3.SS2.p5.1.1.m1.1.1" xref="S3.SS2.p5.1.1.m1.1.1.cmml"><mi id="S3.SS2.p5.1.1.m1.1.1.2" mathcolor="#000000" mathvariant="normal" xref="S3.SS2.p5.1.1.m1.1.1.2.cmml">Δ</mi><mo id="S3.SS2.p5.1.1.m1.1.1.1" xref="S3.SS2.p5.1.1.m1.1.1.1.cmml"></mo><mi id="S3.SS2.p5.1.1.m1.1.1.3" mathcolor="#000000" xref="S3.SS2.p5.1.1.m1.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p5.1.1.m1.1b"><apply id="S3.SS2.p5.1.1.m1.1.1.cmml" xref="S3.SS2.p5.1.1.m1.1.1"><times id="S3.SS2.p5.1.1.m1.1.1.1.cmml" xref="S3.SS2.p5.1.1.m1.1.1.1"></times><ci id="S3.SS2.p5.1.1.m1.1.1.2.cmml" xref="S3.SS2.p5.1.1.m1.1.1.2">Δ</ci><ci id="S3.SS2.p5.1.1.m1.1.1.3.cmml" xref="S3.SS2.p5.1.1.m1.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p5.1.1.m1.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p5.1.1.m1.1d">roman_Δ italic_W</annotation></semantics></math> computation, by matrix multiplication <math alttext="A\times B" class="ltx_Math" display="inline" id="S3.SS2.p5.2.2.m2.1"><semantics id="S3.SS2.p5.2.2.m2.1a"><mrow id="S3.SS2.p5.2.2.m2.1.1" xref="S3.SS2.p5.2.2.m2.1.1.cmml"><mi id="S3.SS2.p5.2.2.m2.1.1.2" mathcolor="#000000" xref="S3.SS2.p5.2.2.m2.1.1.2.cmml">A</mi><mo id="S3.SS2.p5.2.2.m2.1.1.1" lspace="0.222em" mathcolor="#000000" rspace="0.222em" xref="S3.SS2.p5.2.2.m2.1.1.1.cmml">×</mo><mi id="S3.SS2.p5.2.2.m2.1.1.3" mathcolor="#000000" xref="S3.SS2.p5.2.2.m2.1.1.3.cmml">B</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p5.2.2.m2.1b"><apply id="S3.SS2.p5.2.2.m2.1.1.cmml" xref="S3.SS2.p5.2.2.m2.1.1"><times id="S3.SS2.p5.2.2.m2.1.1.1.cmml" xref="S3.SS2.p5.2.2.m2.1.1.1"></times><ci id="S3.SS2.p5.2.2.m2.1.1.2.cmml" xref="S3.SS2.p5.2.2.m2.1.1.2">𝐴</ci><ci id="S3.SS2.p5.2.2.m2.1.1.3.cmml" xref="S3.SS2.p5.2.2.m2.1.1.3">𝐵</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p5.2.2.m2.1c">A\times B</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p5.2.2.m2.1d">italic_A × italic_B</annotation></semantics></math>, then added (merge) or subtracted (unmerge) <math alttext="\Delta W" class="ltx_Math" display="inline" id="S3.SS2.p5.3.3.m3.1"><semantics id="S3.SS2.p5.3.3.m3.1a"><mrow id="S3.SS2.p5.3.3.m3.1.1" xref="S3.SS2.p5.3.3.m3.1.1.cmml"><mi id="S3.SS2.p5.3.3.m3.1.1.2" mathcolor="#000000" mathvariant="normal" xref="S3.SS2.p5.3.3.m3.1.1.2.cmml">Δ</mi><mo id="S3.SS2.p5.3.3.m3.1.1.1" xref="S3.SS2.p5.3.3.m3.1.1.1.cmml"></mo><mi id="S3.SS2.p5.3.3.m3.1.1.3" mathcolor="#000000" xref="S3.SS2.p5.3.3.m3.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p5.3.3.m3.1b"><apply id="S3.SS2.p5.3.3.m3.1.1.cmml" xref="S3.SS2.p5.3.3.m3.1.1"><times id="S3.SS2.p5.3.3.m3.1.1.1.cmml" xref="S3.SS2.p5.3.3.m3.1.1.1"></times><ci id="S3.SS2.p5.3.3.m3.1.1.2.cmml" xref="S3.SS2.p5.3.3.m3.1.1.2">Δ</ci><ci id="S3.SS2.p5.3.3.m3.1.1.3.cmml" xref="S3.SS2.p5.3.3.m3.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p5.3.3.m3.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p5.3.3.m3.1d">roman_Δ italic_W</annotation></semantics></math> onto or from the base model, by invoking <span class="ltx_text ltx_font_typewriter" id="S3.SS2.p5.3.3.1">torch.addmm</span>, per layer.</span> Conceivably, if the mode switch can be reduced to ¡10ms (as this paper achieved in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS1" title="4.4.1 Swift inference mode switch. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.4.1</span></a>), the average response time of Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F7" title="Figure 7 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">7</span></a> case can save 45ms, with the last request only need to wait ¡80ms.</p> </div> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">4. </span>V-LoRA Design</h2> <div class="ltx_para" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">V-LoRA is an end-to-end system that empowers diverse vision tasks and enriches vision applications with LoRA LMM by addressing the above challenges. We first provide an overview of V-LoRA’s operation, then describe three core techniques it leverages.</p> </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">4.1 System Overview</h3> <div class="ltx_para" id="S4.SS1.p1"> <p class="ltx_p" id="S4.SS1.p1.1">V-LoRA includes two phases. During the offline phase, the accuracy-aware LoRA adapter generation approach takes the external knowledge, from the existing domain-specific small models or datasets, as well as the accuracy requirements specified by vision applications (as the dotted arrows plotted in Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F8" title="Figure 8 ‣ 4.1 System Overview ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">8</span></a>), to generate the minimum number of LoRA adapters (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2" title="4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.2</span></a>). The generated LoRA adapters are rich in domain-specific knowledge that can output accurate responses to tasks involved in vision applications.</p> </div> <div class="ltx_para" id="S4.SS1.p2"> <p class="ltx_p" id="S4.SS1.p2.1">During the online phase, the flexible LoRA adapter orchestration ingests the requests from vision applications (as the solid arrow plotted in Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F8" title="Figure 8 ‣ 4.1 System Overview ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">8</span></a>), organizes them into batches, chooses the inference mode, and orchestrates their corresponding LoRA adapters, to minimize the average response latency while guaranteeing each vision application’s latency constraint (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4" title="4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.4</span></a>). Each request batch is delivered to the corresponding adapters and LMM and inferred in the chosen mode. The LoRA adapter batching and inference mode switcher are implemented with ATMM, the adaptive-tilling matrix multiplication operator (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3" title="4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.3</span></a>), achieving high efficiency.</p> </div> <figure class="ltx_figure" id="S4.F8"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="384" id="S4.F8.g1" src="x10.png" width="789"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 8. </span>VaLoRA overview.</figcaption> </figure> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">4.2 Accuracy-aware LoRA Adapter Generation</h3> <div class="ltx_para" id="S4.SS2.p1"> <p class="ltx_p" id="S4.SS2.p1.1">To offer accurate results on domain-specific tasks, we propose the accuracy-aware LoRA adapter generation consisting of an accuracy-aware knowledge-fusion algorithm and the vision task head. </p> </div> <section class="ltx_subsubsection" id="S4.SS2.SSS1"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.2.1 Accuracy-aware knowledge-fusion algorithm.</h4> <div class="ltx_para" id="S4.SS2.SSS1.p1"> <p class="ltx_p" id="S4.SS2.SSS1.p1.1">To make it easy to manage at runtime, we aim to integrate external knowledge into the fewest LoRA adapters without violating the accuracy requirements of any vision tasks. To this end, the training method must account for the limited capacity of the LoRA adapter and the complex accuracy variations arising from the knowledge fusion. </p> </div> <figure class="ltx_figure" id="S4.F9"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="325" id="S4.F9.g1" src="x11.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 9. </span>Accuracy-aware LoRA generation integrates the domain-specific knowledge into LoRA adapters with the proposed accuracy-aware knowledge-fusion algorithm.</figcaption> </figure> <figure class="ltx_figure" id="S4.F10"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="524" id="S4.F10.g1" src="x12.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 10. </span><span class="ltx_text" id="S4.F10.2.1" style="color:#000000;">An example of fusing the knowledge from six object detection models, each detects one class of object, with the accuracy-aware knowledge-fusion algorithm. </span></figcaption> </figure> <div class="ltx_para" id="S4.SS2.SSS1.p2"> <p class="ltx_p" id="S4.SS2.SSS1.p2.1">This problem is highly challenging. Suppose we have an oracle who knows the accuracy of a LoRA adapter that fused arbitrary knowledge combinations in advance. Then, the problem can be formulated as a <span class="ltx_text ltx_font_italic" id="S4.SS2.SSS1.p2.1.1">constrained bin-packing problem</span> <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib36" title="">garey1981approximation, </a>)</cite>, where the objective is to pack the knowledge into the minimum number of LoRA adapter bins, ensuring each adapter maintains each vision task’s accuracy beyond the requirement. However, this relaxed variant is NP-hard, and unfortunately, such an oracle does not exist.</p> </div> <div class="ltx_para" id="S4.SS2.SSS1.p3"> <p class="ltx_p" id="S4.SS2.SSS1.p3.1"><span class="ltx_text" id="S4.SS2.SSS1.p3.1.1" style="color:#000000;">To solve the original problem, we propose a simple and easy-to-implement heuristic algorithm, the <span class="ltx_text ltx_font_italic" id="S4.SS2.SSS1.p3.1.1.1">accuracy-aware knowledge-fusion algorithm</span>, to determine which knowledge fuses into one LoRA adapter.</span> It first collects the dataset, as illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F9" title="Figure 9 ‣ 4.2.1 Accuracy-aware knowledge-fusion algorithm. ‣ 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">9</span></a>, by executing representative data on every existing domain-specific small model; if applications provide the datasets, we directly use them. After that, the training process is a standard supervised learning pipeline that computes the cross-entropy loss, <math alttext="L=CE(y,\hat{y})" class="ltx_Math" display="inline" id="S4.SS2.SSS1.p3.1.m1.2"><semantics id="S4.SS2.SSS1.p3.1.m1.2a"><mrow id="S4.SS2.SSS1.p3.1.m1.2.3" xref="S4.SS2.SSS1.p3.1.m1.2.3.cmml"><mi id="S4.SS2.SSS1.p3.1.m1.2.3.2" xref="S4.SS2.SSS1.p3.1.m1.2.3.2.cmml">L</mi><mo id="S4.SS2.SSS1.p3.1.m1.2.3.1" xref="S4.SS2.SSS1.p3.1.m1.2.3.1.cmml">=</mo><mrow id="S4.SS2.SSS1.p3.1.m1.2.3.3" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.cmml"><mi id="S4.SS2.SSS1.p3.1.m1.2.3.3.2" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.2.cmml">C</mi><mo id="S4.SS2.SSS1.p3.1.m1.2.3.3.1" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.1.cmml"></mo><mi id="S4.SS2.SSS1.p3.1.m1.2.3.3.3" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.3.cmml">E</mi><mo id="S4.SS2.SSS1.p3.1.m1.2.3.3.1a" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.1.cmml"></mo><mrow id="S4.SS2.SSS1.p3.1.m1.2.3.3.4.2" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.4.1.cmml"><mo id="S4.SS2.SSS1.p3.1.m1.2.3.3.4.2.1" stretchy="false" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.4.1.cmml">(</mo><mi id="S4.SS2.SSS1.p3.1.m1.1.1" xref="S4.SS2.SSS1.p3.1.m1.1.1.cmml">y</mi><mo id="S4.SS2.SSS1.p3.1.m1.2.3.3.4.2.2" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.4.1.cmml">,</mo><mover accent="true" id="S4.SS2.SSS1.p3.1.m1.2.2" xref="S4.SS2.SSS1.p3.1.m1.2.2.cmml"><mi id="S4.SS2.SSS1.p3.1.m1.2.2.2" xref="S4.SS2.SSS1.p3.1.m1.2.2.2.cmml">y</mi><mo id="S4.SS2.SSS1.p3.1.m1.2.2.1" xref="S4.SS2.SSS1.p3.1.m1.2.2.1.cmml">^</mo></mover><mo id="S4.SS2.SSS1.p3.1.m1.2.3.3.4.2.3" stretchy="false" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.4.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.SSS1.p3.1.m1.2b"><apply id="S4.SS2.SSS1.p3.1.m1.2.3.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3"><eq id="S4.SS2.SSS1.p3.1.m1.2.3.1.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.1"></eq><ci id="S4.SS2.SSS1.p3.1.m1.2.3.2.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.2">𝐿</ci><apply id="S4.SS2.SSS1.p3.1.m1.2.3.3.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.3"><times id="S4.SS2.SSS1.p3.1.m1.2.3.3.1.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.1"></times><ci id="S4.SS2.SSS1.p3.1.m1.2.3.3.2.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.2">𝐶</ci><ci id="S4.SS2.SSS1.p3.1.m1.2.3.3.3.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.3">𝐸</ci><interval closure="open" id="S4.SS2.SSS1.p3.1.m1.2.3.3.4.1.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.4.2"><ci id="S4.SS2.SSS1.p3.1.m1.1.1.cmml" xref="S4.SS2.SSS1.p3.1.m1.1.1">𝑦</ci><apply id="S4.SS2.SSS1.p3.1.m1.2.2.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.2"><ci id="S4.SS2.SSS1.p3.1.m1.2.2.1.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.2.1">^</ci><ci id="S4.SS2.SSS1.p3.1.m1.2.2.2.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.2.2">𝑦</ci></apply></interval></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.SSS1.p3.1.m1.2c">L=CE(y,\hat{y})</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.SSS1.p3.1.m1.2d">italic_L = italic_C italic_E ( italic_y , over^ start_ARG italic_y end_ARG )</annotation></semantics></math>, for parameter update. The knowledge fusion employs greedy and accuracy-aware heuristics. It begins with a random dataset and sequentially uses each dataset to train the LoRA adapter until its accuracy on a specific task falls below the required threshold. If this occurs, the adapter’s weights are rolled back, and a new adapter is initialized to learn from the most recent dataset. The worst case of such a method may generate one LoRA adapter for each dataset, but in our practical experiments, every LoRA adapter fuses 4 domains of knowledge (datasets) on average.</p> </div> <figure class="ltx_figure" id="S4.F11"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="333" id="S4.F11.g1" src="x13.png" width="788"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 11. </span><span class="ltx_text" id="S4.F11.2.1" style="color:#000000;">Vision task head. Replacing the original language modeling head in LMM with vision task head reduces 4 rounds of LMM inference on the action recognition task of Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F1" title="Figure 1 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">1</span></a>.</span></figcaption> </figure> <div class="ltx_para" id="S4.SS2.SSS1.p4"> <p class="ltx_p" id="S4.SS2.SSS1.p4.1">Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F10" title="Figure 10 ‣ 4.2.1 Accuracy-aware knowledge-fusion algorithm. ‣ 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">10</span></a> illustrates an example of integrating the knowledge of six object detection models, each detecting one class of target, into LoRA adapters. When fusing the vegetation-detect model (Step ④), the accuracy of LoRA adapter 1 fails on the license plate, needs above 80%, and traffic sign detection, 85%. Hence, the algorithm returns the LoRA adapter 1 of the previous state and initializes LoRA adapter 2 (Step ⑤). As the vegetation, bicycle, and person datasets are fused into LoRA adapter 2 without accuracy violation, we get the second LoRA adapter (Step ⑥). As LMM has powerful learning abilities, the training procedure costs only 25 minutes in this example. We leave the exploration of this to the future. </p> </div> </section> <section class="ltx_subsubsection" id="S4.SS2.SSS2"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.2.2 Vision task head.</h4> <div class="ltx_para" id="S4.SS2.SSS2.p1"> <p class="ltx_p" id="S4.SS2.SSS2.p1.1">To reduce the inference latency, we design the vision task head. It is designed as a trainable linear layer as a part of the LoRA adapter, as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F11" title="Figure 11 ‣ 4.2.1 Accuracy-aware knowledge-fusion algorithm. ‣ 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">11</span></a>, to predict task-specific results based on the output features of LMM. <span class="ltx_text" id="S4.SS2.SSS2.p1.1.1" style="color:#000000;">Vision task heads can be flexibly customized to various vision tasks during the LoRA adapter training, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p1.1.1.1">e.g.,</em> action recognition head in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F11" title="Figure 11 ‣ 4.2.1 Accuracy-aware knowledge-fusion algorithm. ‣ 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">11</span></a>, provided the fusing knowledge is from the same task type. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F11" title="Figure 11 ‣ 4.2.1 Accuracy-aware knowledge-fusion algorithm. ‣ 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">11</span></a> compares doing action recognition with the original language modeling (LM) head and the vision task head. By replacing LM head with the vision task head, LMM saves 4 inference rounds, around 180ms time cost.</span> The reason to do so is that the outputs of a large portion of vision tasks are a limited discrete set of candidate options, such as the number of vehicle counts <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib57" title="">li2020reducto, </a>)</cite>, classes of action recognition <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib90" title="">yuan2022PacketGame, </a>)</cite>, and binary query for a specific target on image or video <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib86" title="">yi2020eagleeye, </a>)</cite>.</p> </div> <div class="ltx_para" id="S4.SS2.SSS2.p2"> <p class="ltx_p" id="S4.SS2.SSS2.p2.1"><span class="ltx_text" id="S4.SS2.SSS2.p2.1.1" style="color:#000000;">We retain the LM head for vision applications that need the natural language interface. For example, when video query applications ask for “A boy wearing a red sweater lost at the corner” and specify the person detection, V-LoRA will invoke the corresponding LoRA adapter containing a detection head for efficient response<span class="ltx_note ltx_role_footnote" id="footnote3"><sup class="ltx_note_mark">‡</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">‡</sup><span class="ltx_tag ltx_tag_note"><span class="ltx_text" id="footnote3.1.1.1" style="color:#000000;">‡</span></span><span class="ltx_text" id="footnote3.5" style="color:#000000;">Some work, </span><em class="ltx_emph ltx_font_italic" id="footnote3.6" style="color:#000000;">e.g.,</em><span class="ltx_text" id="footnote3.7" style="color:#000000;"> task automation </span><cite class="ltx_cite ltx_citemacro_citep"><span class="ltx_text" id="footnote3.8.1" style="color:#000000;">(</span><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib79" title="">wen2024autodroid<span class="ltx_text" id="footnote3.9.2.1.1" style="color:#000000;">, </span></a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib56" title="">li2024personal<span class="ltx_text" id="footnote3.9.2.1.1" style="color:#000000;">, </span></a><span class="ltx_text" id="footnote3.10.3" style="color:#000000;">)</span></cite><span class="ltx_text" id="footnote3.11" style="color:#000000;"> and dynamic LoRA </span><cite class="ltx_cite ltx_citemacro_citep"><span class="ltx_text" id="footnote3.12.1" style="color:#000000;">(</span><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib34" title="">feng2024mixture<span class="ltx_text" id="footnote3.13.2.1.1" style="color:#000000;">, </span></a><span class="ltx_text" id="footnote3.14.3" style="color:#000000;">)</span></cite><span class="ltx_text" id="footnote3.15" style="color:#000000;">, automatically identify adapters from queries. They are orthogonal to V-LoRA.</span></span></span></span>. By using the LoRA adapters batching operator, ATMM, in the next section, these different vision task heads can be executed concurrently, as well as in parallel with the LM head.</span></p> </div> </section> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">4.3 Adaptive-tiling LoRA Adapters Batching</h3> <div class="ltx_para" id="S4.SS3.p1"> <p class="ltx_p" id="S4.SS3.p1.1">Serving vision applications with LMM very likely invokes heterogeneous LoRA adapters concurrently. To compute them in high performance, we propose an Adaptive-Tiling Matrix Multiplication operator, ATMM, enabling efficient unmerged inference, inference mode switching (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS1" title="4.4.1 Swift inference mode switch. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.4.1</span></a>), and mixture inference mode (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS2" title="4.4.2 Mixture inference mode. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.4.2</span></a>).</p> </div> <figure class="ltx_table" id="S4.T1"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S4.T1.4"> <tr class="ltx_tr" id="S4.T1.4.4"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_tt" id="S4.T1.4.4.5"> <span class="ltx_text" id="S4.T1.4.4.5.1"></span><span class="ltx_text" id="S4.T1.4.4.5.2" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.4.5.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.4.5.3.1"> <span class="ltx_tr" id="S4.T1.4.4.5.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.4.5.3.1.1.1"><span class="ltx_text ltx_font_bold" id="S4.T1.4.4.5.3.1.1.1.1">Tiling</span></span></span> <span class="ltx_tr" id="S4.T1.4.4.5.3.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.4.5.3.1.2.1"><span class="ltx_text ltx_font_bold" id="S4.T1.4.4.5.3.1.2.1.1">Configuration</span></span></span> </span></span><span class="ltx_text" id="S4.T1.4.4.5.4"></span><span class="ltx_text" id="S4.T1.4.4.5.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T1.2.2.2"> <span class="ltx_text" id="S4.T1.2.2.2.3"></span><span class="ltx_text" id="S4.T1.2.2.2.4" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.2.2.2.2" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.2.2.2.2.2"> <span class="ltx_tr" id="S4.T1.2.2.2.2.2.3"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.2.2.2.2.2.3.1"><span class="ltx_text ltx_font_bold" id="S4.T1.2.2.2.2.2.3.1.1">Input ①</span></span></span> <span class="ltx_tr" id="S4.T1.2.2.2.2.2.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.2.2.2.2.2.2.2">(256<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.1.1.1.1.1.1.1.m1.1"><semantics id="S4.T1.1.1.1.1.1.1.1.m1.1a"><mo id="S4.T1.1.1.1.1.1.1.1.m1.1.1" xref="S4.T1.1.1.1.1.1.1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.1.1.1.1.1.1.1.m1.1b"><times id="S4.T1.1.1.1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.1.1.1.1.1.1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.1.1.1.1.1.1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.1.1.1.1.1.1.1.m1.1d">×</annotation></semantics></math>4096, 4096<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.2.2.2.2.2.2.2.m2.1"><semantics id="S4.T1.2.2.2.2.2.2.2.m2.1a"><mo id="S4.T1.2.2.2.2.2.2.2.m2.1.1" xref="S4.T1.2.2.2.2.2.2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.2.2.2.2.2.2.2.m2.1b"><times id="S4.T1.2.2.2.2.2.2.2.m2.1.1.cmml" xref="S4.T1.2.2.2.2.2.2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.2.2.2.2.2.2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.2.2.2.2.2.2.2.m2.1d">×</annotation></semantics></math>32)</span></span> </span></span><span class="ltx_text" id="S4.T1.2.2.2.5"></span><span class="ltx_text" id="S4.T1.2.2.2.6" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T1.4.4.4"> <span class="ltx_text" id="S4.T1.4.4.4.3"></span><span class="ltx_text" id="S4.T1.4.4.4.4" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.4.4.2" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.4.4.2.2"> <span class="ltx_tr" id="S4.T1.4.4.4.2.2.3"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.4.4.2.2.3.1"><span class="ltx_text ltx_font_bold" id="S4.T1.4.4.4.2.2.3.1.1">Input ②</span></span></span> <span class="ltx_tr" id="S4.T1.4.4.4.2.2.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.4.4.2.2.2.2">(8192<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.3.3.3.1.1.1.1.m1.1"><semantics id="S4.T1.3.3.3.1.1.1.1.m1.1a"><mo id="S4.T1.3.3.3.1.1.1.1.m1.1.1" xref="S4.T1.3.3.3.1.1.1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.3.3.3.1.1.1.1.m1.1b"><times id="S4.T1.3.3.3.1.1.1.1.m1.1.1.cmml" xref="S4.T1.3.3.3.1.1.1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.3.3.3.1.1.1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.3.3.3.1.1.1.1.m1.1d">×</annotation></semantics></math>4096, 4096<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.4.4.4.2.2.2.2.m2.1"><semantics id="S4.T1.4.4.4.2.2.2.2.m2.1a"><mo id="S4.T1.4.4.4.2.2.2.2.m2.1.1" xref="S4.T1.4.4.4.2.2.2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.4.4.4.2.2.2.2.m2.1b"><times id="S4.T1.4.4.4.2.2.2.2.m2.1.1.cmml" xref="S4.T1.4.4.4.2.2.2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.4.4.4.2.2.2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.4.4.4.2.2.2.2.m2.1d">×</annotation></semantics></math>128)</span></span> </span></span><span class="ltx_text" id="S4.T1.4.4.4.5"></span><span class="ltx_text" id="S4.T1.4.4.4.6" style="font-size:80%;"></span> </td> </tr> <tr class="ltx_tr" id="S4.T1.4.5"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T1.4.5.1"> <span class="ltx_text" id="S4.T1.4.5.1.1"></span><span class="ltx_text" id="S4.T1.4.5.1.2" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.5.1.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.5.1.3.1"> <span class="ltx_tr" id="S4.T1.4.5.1.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.5.1.3.1.1.1">Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite></span></span> <span class="ltx_tr" id="S4.T1.4.5.1.3.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.5.1.3.1.2.1">(16, 64, 64, 16, 16, 64)</span></span> </span></span><span class="ltx_text" id="S4.T1.4.5.1.4"></span><span class="ltx_text" id="S4.T1.4.5.1.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.5.2"> <span class="ltx_text" id="S4.T1.4.5.2.1"></span><span class="ltx_text" id="S4.T1.4.5.2.2" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.5.2.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.5.2.3.1"> <span class="ltx_tr" id="S4.T1.4.5.2.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.5.2.3.1.1.1">0.087ms</span></span> <span class="ltx_tr" id="S4.T1.4.5.2.3.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.5.2.3.1.2.1">Low SM Utilization</span></span> </span></span><span class="ltx_text" id="S4.T1.4.5.2.4"></span><span class="ltx_text" id="S4.T1.4.5.2.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.5.3"> <span class="ltx_text" id="S4.T1.4.5.3.1" style="font-size:80%;"></span><span class="ltx_text" id="S4.T1.4.5.3.2"></span><span class="ltx_text" id="S4.T1.4.5.3.3" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.5.3.4" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.5.3.4.1"> <span class="ltx_tr" id="S4.T1.4.5.3.4.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.5.3.4.1.1.1">0.19ms</span></span> <span class="ltx_tr" id="S4.T1.4.5.3.4.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.5.3.4.1.2.1">Freq. Global Mem. Access</span></span> </span></span><span class="ltx_text" id="S4.T1.4.5.3.5"></span><span class="ltx_text" id="S4.T1.4.5.3.6" style="font-size:80%;"></span> </td> </tr> <tr class="ltx_tr" id="S4.T1.4.6"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T1.4.6.1"> <span class="ltx_text" id="S4.T1.4.6.1.1"></span><span class="ltx_text" id="S4.T1.4.6.1.2" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.6.1.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.6.1.3.1"> <span class="ltx_tr" id="S4.T1.4.6.1.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.6.1.3.1.1.1">Config ①</span></span> <span class="ltx_tr" id="S4.T1.4.6.1.3.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.6.1.3.1.2.1">(64, 32, 32, 32, 32, 32)</span></span> </span></span><span class="ltx_text" id="S4.T1.4.6.1.4"></span><span class="ltx_text" id="S4.T1.4.6.1.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.6.2"> <span class="ltx_text" id="S4.T1.4.6.2.1"></span><span class="ltx_text" id="S4.T1.4.6.2.2" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.6.2.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.6.2.3.1"> <span class="ltx_tr" id="S4.T1.4.6.2.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.6.2.3.1.1.1"><span class="ltx_text ltx_font_bold" id="S4.T1.4.6.2.3.1.1.1.1">0.07ms</span></span></span> </span></span><span class="ltx_text" id="S4.T1.4.6.2.4"></span><span class="ltx_text" id="S4.T1.4.6.2.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.6.3"> <span class="ltx_text" id="S4.T1.4.6.3.1" style="font-size:80%;"></span><span class="ltx_text" id="S4.T1.4.6.3.2"></span><span class="ltx_text" id="S4.T1.4.6.3.3" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.6.3.4" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.6.3.4.1"> <span class="ltx_tr" id="S4.T1.4.6.3.4.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.6.3.4.1.1.1">0.12ms</span></span> <span class="ltx_tr" id="S4.T1.4.6.3.4.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.6.3.4.1.2.1">Freq. Global Mem. Access</span></span> </span></span><span class="ltx_text" id="S4.T1.4.6.3.5"></span><span class="ltx_text" id="S4.T1.4.6.3.6" style="font-size:80%;"></span> </td> </tr> <tr class="ltx_tr" id="S4.T1.4.7"> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t" id="S4.T1.4.7.1"> <span class="ltx_text" id="S4.T1.4.7.1.1"></span><span class="ltx_text" id="S4.T1.4.7.1.2" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.7.1.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.7.1.3.1"> <span class="ltx_tr" id="S4.T1.4.7.1.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.7.1.3.1.1.1">Config ②</span></span> <span class="ltx_tr" id="S4.T1.4.7.1.3.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.7.1.3.1.2.1">(64, 64, 64, 32, 64, 64)</span></span> </span></span><span class="ltx_text" id="S4.T1.4.7.1.4"></span><span class="ltx_text" id="S4.T1.4.7.1.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S4.T1.4.7.2"> <span class="ltx_text" id="S4.T1.4.7.2.1"></span><span class="ltx_text" id="S4.T1.4.7.2.2" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.7.2.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.7.2.3.1"> <span class="ltx_tr" id="S4.T1.4.7.2.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.7.2.3.1.1.1">0.13ms</span></span> <span class="ltx_tr" id="S4.T1.4.7.2.3.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.7.2.3.1.2.1">Low SM Utilization</span></span> </span></span><span class="ltx_text" id="S4.T1.4.7.2.4"></span><span class="ltx_text" id="S4.T1.4.7.2.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S4.T1.4.7.3"> <span class="ltx_text" id="S4.T1.4.7.3.1"></span><span class="ltx_text" id="S4.T1.4.7.3.2" style="font-size:80%;"> </span><span class="ltx_text" id="S4.T1.4.7.3.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.7.3.3.1"> <span class="ltx_tr" id="S4.T1.4.7.3.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.7.3.3.1.1.1"><span class="ltx_text ltx_font_bold" id="S4.T1.4.7.3.3.1.1.1.1">0.10ms</span></span></span> </span></span><span class="ltx_text" id="S4.T1.4.7.3.4"></span><span class="ltx_text" id="S4.T1.4.7.3.5" style="font-size:80%;"></span> </td> </tr> </table> <figcaption class="ltx_caption ltx_centering" style="font-size:80%;"><span class="ltx_tag ltx_tag_table">Table 1. </span> Adaptive tiling saves remarkable computing time compared to static-tiling Punica. Configuration (a,b,c,d,e,f) indicates the thread block tile a<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.11.m1.1"><semantics id="S4.T1.11.m1.1b"><mo id="S4.T1.11.m1.1.1" xref="S4.T1.11.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.11.m1.1c"><times id="S4.T1.11.m1.1.1.cmml" xref="S4.T1.11.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.11.m1.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.11.m1.1e">×</annotation></semantics></math>b, b<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.12.m2.1"><semantics id="S4.T1.12.m2.1b"><mo id="S4.T1.12.m2.1.1" xref="S4.T1.12.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.12.m2.1c"><times id="S4.T1.12.m2.1.1.cmml" xref="S4.T1.12.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.12.m2.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.12.m2.1e">×</annotation></semantics></math>c, and warp tile d<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.13.m3.1"><semantics id="S4.T1.13.m3.1b"><mo id="S4.T1.13.m3.1.1" xref="S4.T1.13.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.13.m3.1c"><times id="S4.T1.13.m3.1.1.cmml" xref="S4.T1.13.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.13.m3.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.13.m3.1e">×</annotation></semantics></math>e, e<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.14.m4.1"><semantics id="S4.T1.14.m4.1b"><mo id="S4.T1.14.m4.1.1" xref="S4.T1.14.m4.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.14.m4.1c"><times id="S4.T1.14.m4.1.1.cmml" xref="S4.T1.14.m4.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.14.m4.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.14.m4.1e">×</annotation></semantics></math>f. We omit thread tile to save space. Input (u<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.15.m5.1"><semantics id="S4.T1.15.m5.1b"><mo id="S4.T1.15.m5.1.1" xref="S4.T1.15.m5.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.15.m5.1c"><times id="S4.T1.15.m5.1.1.cmml" xref="S4.T1.15.m5.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.15.m5.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.15.m5.1e">×</annotation></semantics></math>v, s<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.16.m6.1"><semantics id="S4.T1.16.m6.1b"><mo id="S4.T1.16.m6.1.1" xref="S4.T1.16.m6.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.16.m6.1c"><times id="S4.T1.16.m6.1.1.cmml" xref="S4.T1.16.m6.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.16.m6.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.16.m6.1e">×</annotation></semantics></math>t) indicates input matrix shapes. The text under latency explains why this configuration performs poorly under corresponding input. </figcaption> </figure> <section class="ltx_subsubsection" id="S4.SS3.SSS1"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.3.1 ATMM: Adaptive-tiling matrix multiplication operator.</h4> <div class="ltx_para" id="S4.SS3.SSS1.p1"> <p class="ltx_p" id="S4.SS3.SSS1.p1.1">Directly batching heterogeneous adapters computation upon standard kernels, as batched GEMM (General Matrix Multiplication) in dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite>, is feasible, but yields excessive latency and hardware underutilization. This stems from the significant padding arising from the heterogeneity of application request lengths and LoRA adapter ranks. Hence, a customized kernel is necessary. </p> </div> <div class="ltx_para" id="S4.SS3.SSS1.p2"> <p class="ltx_p" id="S4.SS3.SSS1.p2.1">To motivate our design, we first analyze two existing customized kernels from S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> and Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite>, respectively. S-LoRA’s kernel utilizes <em class="ltx_emph ltx_font_italic" id="S4.SS3.SSS1.p2.1.1">tiling</em> technique to avoid the significant padding. Unlike batched GEMM, which pads heterogeneous input matrices to a uniform shape, it splits them into fine-grained blocks and computes the output for each block in parallel on CUDA cores. Punica’s kernel also employs the tiling technique and further enhances efficiency by leveraging the well-developed CUTLASS <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib15" title="">cutlass, </a>)</cite> library and higher-performance Tensor cores <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib16" title="">tensorcore, </a>)</cite>. However, as the motivational study in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3.2</span></a>, both kernels fail to achieve satisfactory efficiency. The root cause is their static tiling configurations, which are inadequate to handle diverse input shapes, resulting in underutilized computational resources.</p> </div> <div class="ltx_para" id="S4.SS3.SSS1.p3"> <p class="ltx_p" id="S4.SS3.SSS1.p3.2">Our key observation is that the computational efficiency varies significantly with different tiling configurations. We conduct an experiment with two input shapes and three tiling configurations and present the results in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.T1" title="Table 1 ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">1</span></a>. Under Input ① and Input ②, three tiling configurations yield the most 1.9<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.SSS1.p3.1.m1.1"><semantics id="S4.SS3.SSS1.p3.1.m1.1a"><mo id="S4.SS3.SSS1.p3.1.m1.1.1" xref="S4.SS3.SSS1.p3.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.SSS1.p3.1.m1.1b"><times id="S4.SS3.SSS1.p3.1.m1.1.1.cmml" xref="S4.SS3.SSS1.p3.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.SSS1.p3.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.SSS1.p3.1.m1.1d">×</annotation></semantics></math> latency difference. This stems from the inappropriate tiling against input shape and hardware architecture, as well as between tiling levels. <span class="ltx_text" id="S4.SS3.SSS1.p3.2.1" style="color:#000000;"> Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F12" title="Figure 12 ‣ 4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">12</span></a> compares two configuration pairs in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.T1" title="Table 1 ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">1</span></a>. In Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F12.sf1" title="In Figure 12 ‣ 4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">12(a)</span></a>, Punica’s smaller-tile configuration produces more thread block tiles and warp tiles. Compared to Config ②, it results in more frequent accesses to global and shared memory, so more launching data transfer times incur Punica’s 1.9<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.SSS1.p3.2.1.m1.1"><semantics id="S4.SS3.SSS1.p3.2.1.m1.1a"><mo id="S4.SS3.SSS1.p3.2.1.m1.1.1" mathcolor="#000000" xref="S4.SS3.SSS1.p3.2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.SSS1.p3.2.1.m1.1b"><times id="S4.SS3.SSS1.p3.2.1.m1.1.1.cmml" xref="S4.SS3.SSS1.p3.2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.SSS1.p3.2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.SSS1.p3.2.1.m1.1d">×</annotation></semantics></math> latency. However, larger tile does not always benefit. In Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F12.sf2" title="In Figure 12 ‣ 4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">12(b)</span></a>, Config ②’s large tiling incurs Streaming Multiprocessor (SM) under-utilization. It can only utilize 64 of 108 SMs in A100, as each thread block tile can only fetch one SM, far less than Config ①. In sum, we need dynamic tiling to achieve efficiency. </span></p> </div> <figure class="ltx_figure" id="S4.F12"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_1"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S4.F12.sf1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="349" id="S4.F12.sf1.g1" src="x14.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S4.F12.sf1.2.1.1" style="font-size:80%;">(a)</span> </span><span class="ltx_text" id="S4.F12.sf1.3.2" style="font-size:80%;">Under heavier Input ②, Punica’s tiling (upper half) yields more tiles, so more costly memory read/write compared to Config ② (lower half).</span></figcaption> </figure> </div> <div class="ltx_flex_break"></div> <div class="ltx_flex_cell ltx_flex_size_1"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S4.F12.sf2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="240" id="S4.F12.sf2.g1" src="x15.png" width="829"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S4.F12.sf2.2.1.1" style="font-size:80%;">(b)</span> </span><span class="ltx_text" id="S4.F12.sf2.3.2" style="font-size:80%;">But under lighter Input ①, Config ② (lower half) cannot deliver good performance because it underutilizes SM since fewer block number.</span></figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 12. </span><span class="ltx_text" id="S4.F12.2.1" style="color:#000000;">Paired comparison of tilling configurations in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.T1" title="Table 1 ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">1</span></a>. Following tiling configurations, two multiplied matrices are divided into thread block tiles and further warp tiles. The data in each tile is transferred to their corresponding memory. </span></figcaption> </figure> <div class="ltx_para" id="S4.SS3.SSS1.p4"> <p class="ltx_p" id="S4.SS3.SSS1.p4.1">We propose ATMM, an adaptive-tiling matrix multiplication CUDA kernel that fully utilizes computational resources, achieving efficient heterogeneous LoRA adapters batching. It has two key design choices. (1) Adaptive tiling against input shapes. Given two input matrices, ATMM retrieves the optimal tiling configuration from the <span class="ltx_text ltx_font_italic" id="S4.SS3.SSS1.p4.1.1">hash table</span> (established in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS2" title="4.3.2 Profile-based optimal tiling search. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.3.2</span></a>, and following this configuration, divides the matrix into <span class="ltx_text ltx_font_italic" id="S4.SS3.SSS1.p4.1.2">thread block tiles</span>, then further into <span class="ltx_text ltx_font_italic" id="S4.SS3.SSS1.p4.1.3">warp tiles<span class="ltx_note ltx_role_footnote" id="footnote4"><sup class="ltx_note_mark">§</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">§</sup><span class="ltx_tag ltx_tag_note"><span class="ltx_text ltx_font_upright" id="footnote4.1.1.1">§</span></span><span class="ltx_text ltx_font_upright" id="footnote4.5">Each warp can be divided into thread tiles in ¡16,8,16¿ or ¡16,8,8¿ shape.</span></span></span></span></span>. (2) Pipeline data loading and computing. After tiling, ATMM transfers the data of each tile to their corresponding memories, the correspondence as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F12.sf1" title="In Figure 12 ‣ 4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">12(a)</span></a>, and executes computation on CUDA/Tensor cores with the corresponding executable kernel (pre-compiled in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS2" title="4.3.2 Profile-based optimal tiling search. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.3.2</span></a>). To hide data loading latency, ATMM allocates double the space in shared memory and register file for each tile: one for current tile’s computation and one for prefetching next tile’s data. This double buffering is feasible due to each level’s uniform tile shape, enabling efficient location of the next tile (more in Appx.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A1" title="Appendix A Adaptive-tiling Matrix Multiplication. ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">A</span></a>).</p> </div> </section> <section class="ltx_subsubsection" id="S4.SS3.SSS2"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.3.2 Profile-based optimal tiling search.</h4> <div class="ltx_para" id="S4.SS3.SSS2.p1"> <p class="ltx_p" id="S4.SS3.SSS2.p1.1">Keeping the benefit of adaptive tiling in mind, we aim to search out the optimal tiling configuration for every different input and switch among them at runtime. This can be done offline, as it is only affected by input shapes rather than their numerical values. However, this is still challenging due to the complicated thread parallelism mechanism and memory access patterns of GPU, and the large search space of input shapes. </p> </div> <div class="ltx_para" id="S4.SS3.SSS2.p2"> <p class="ltx_p" id="S4.SS3.SSS2.p2.1">To tackle this challenge, we approach the search as a black-box problem and propose a profile-based searching algorithm. It profiles the execution time for all possible ATMM input matrix shapes with the help of CUTLASS Profiler <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib15" title="">cutlass, </a>)</cite>, records the optimal tiling configurations in a hash table, and compiles corresponding executable kernels. We use the following expert knowledge to reduce the search space. (1) From the hardware perspective, architectural characteristics restrict the feasible input shapes, <em class="ltx_emph ltx_font_italic" id="S4.SS3.SSS2.p2.1.1">e.g.,</em> limited memory of each hierarchy, and tiling configurations, <em class="ltx_emph ltx_font_italic" id="S4.SS3.SSS2.p2.1.2">e.g.,</em> every dimension of tile at least 16 and must be powers of two. (2) From the perspective of input data, the model dimensions of LMMs restrict the input matrix shape changing at large steps, <em class="ltx_emph ltx_font_italic" id="S4.SS3.SSS2.p2.1.3">e.g.,</em> 4096 in Qwen-VL. Appx. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A2" title="Appendix B Profile-based Optimal Tiling Search Algorithm ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">B</span></a> shows the detailed algorithm. With our algorithm, searching all optimal configurations for vision tasks with Qwen-VL on A100 GPU only costs ¡30 minutes.</p> </div> </section> </section> <section class="ltx_subsection" id="S4.SS4"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">4.4 Flexible LoRA Adapters Orchestration</h3> <div class="ltx_para" id="S4.SS4.p1"> <p class="ltx_p" id="S4.SS4.p1.1">To meet distinct performance requirements of vision applications, we propose an orchestrator to schedule requests, manage LoRA adapters, and switch inference modes, to enable efficient and flexible LoRA LLM inference runtime. We first implement two tools with ATMM, a swift mode switcher and a mixture inference mode, to facilitate the orchestrator.</p> </div> <figure class="ltx_figure" id="S4.F13"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="152" id="S4.F13.g1" src="x16.png" width="291"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 13. </span>Mixture mode allows simultaneous execution of unmerged and merged modes to alleviate starvation.</figcaption> </figure> <section class="ltx_subsubsection" id="S4.SS4.SSS1"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.4.1 Swift inference mode switch.</h4> <div class="ltx_para" id="S4.SS4.SSS1.p1"> <p class="ltx_p" id="S4.SS4.SSS1.p1.7">As discussed in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3.2</span></a>, prior systems (<em class="ltx_emph ltx_font_italic" id="S4.SS4.SSS1.p1.7.7">e.g.,</em> dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite>) introduce excessive extra latency during mode switch. <span class="ltx_text" id="S4.SS4.SSS1.p1.6.6" style="color:#000000;">To reduce these overheads, one common method is pre-computing LoRA matrices (<em class="ltx_emph ltx_font_italic" id="S4.SS4.SSS1.p1.6.6.1">i.e.,</em>, matrix <math alttext="\Delta W" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.1.1.m1.1"><semantics id="S4.SS4.SSS1.p1.1.1.m1.1a"><mrow id="S4.SS4.SSS1.p1.1.1.m1.1.1" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.cmml"><mi id="S4.SS4.SSS1.p1.1.1.m1.1.1.2" mathcolor="#000000" mathvariant="normal" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.2.cmml">Δ</mi><mo id="S4.SS4.SSS1.p1.1.1.m1.1.1.1" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.1.cmml"></mo><mi id="S4.SS4.SSS1.p1.1.1.m1.1.1.3" mathcolor="#000000" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.1.1.m1.1b"><apply id="S4.SS4.SSS1.p1.1.1.m1.1.1.cmml" xref="S4.SS4.SSS1.p1.1.1.m1.1.1"><times id="S4.SS4.SSS1.p1.1.1.m1.1.1.1.cmml" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.1"></times><ci id="S4.SS4.SSS1.p1.1.1.m1.1.1.2.cmml" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.2">Δ</ci><ci id="S4.SS4.SSS1.p1.1.1.m1.1.1.3.cmml" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.1.1.m1.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.1.1.m1.1d">roman_Δ italic_W</annotation></semantics></math>, <math alttext="B\times A" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.2.2.m2.1"><semantics id="S4.SS4.SSS1.p1.2.2.m2.1a"><mrow id="S4.SS4.SSS1.p1.2.2.m2.1.1" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.cmml"><mi id="S4.SS4.SSS1.p1.2.2.m2.1.1.2" mathcolor="#000000" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.2.cmml">B</mi><mo id="S4.SS4.SSS1.p1.2.2.m2.1.1.1" lspace="0.222em" mathcolor="#000000" rspace="0.222em" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.1.cmml">×</mo><mi id="S4.SS4.SSS1.p1.2.2.m2.1.1.3" mathcolor="#000000" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.3.cmml">A</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.2.2.m2.1b"><apply id="S4.SS4.SSS1.p1.2.2.m2.1.1.cmml" xref="S4.SS4.SSS1.p1.2.2.m2.1.1"><times id="S4.SS4.SSS1.p1.2.2.m2.1.1.1.cmml" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.1"></times><ci id="S4.SS4.SSS1.p1.2.2.m2.1.1.2.cmml" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.2">𝐵</ci><ci id="S4.SS4.SSS1.p1.2.2.m2.1.1.3.cmml" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.3">𝐴</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.2.2.m2.1c">B\times A</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.2.2.m2.1d">italic_B × italic_A</annotation></semantics></math>) for the entire base model and storing them in main memory, then swapping into GPU when needed. However, such a mass of data, <em class="ltx_emph ltx_font_italic" id="S4.SS4.SSS1.p1.6.6.2">e.g.,</em> <math alttext="\sim" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.3.3.m3.1"><semantics id="S4.SS4.SSS1.p1.3.3.m3.1a"><mo id="S4.SS4.SSS1.p1.3.3.m3.1.1" mathcolor="#000000" xref="S4.SS4.SSS1.p1.3.3.m3.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.3.3.m3.1b"><csymbol cd="latexml" id="S4.SS4.SSS1.p1.3.3.m3.1.1.cmml" xref="S4.SS4.SSS1.p1.3.3.m3.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.3.3.m3.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.3.3.m3.1d">∼</annotation></semantics></math>3GB per LoRA of QWen-VL-7B<span class="ltx_note ltx_role_footnote" id="footnote5"><sup class="ltx_note_mark">¶</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">¶</sup><span class="ltx_tag ltx_tag_note"><span class="ltx_text" id="footnote5.1.1.1" style="color:#000000;">¶</span></span><span class="ltx_text" id="footnote5.5" style="color:#000000;">Layer#</span><math alttext="\times\Delta W\times" class="ltx_math_unparsed" display="inline" id="footnote5.m1.1"><semantics id="footnote5.m1.1b"><mrow id="footnote5.m1.1c"><mo id="footnote5.m1.1.1" mathcolor="#000000" rspace="0.222em">×</mo><mi id="footnote5.m1.1.2" mathcolor="#000000" mathvariant="normal">Δ</mi><mi id="footnote5.m1.1.3" mathcolor="#000000">W</mi><mo id="footnote5.m1.1.4" lspace="0.222em" mathcolor="#000000">×</mo></mrow><annotation encoding="application/x-tex" id="footnote5.m1.1d">\times\Delta W\times</annotation><annotation encoding="application/x-llamapun" id="footnote5.m1.1e">× roman_Δ italic_W ×</annotation></semantics></math><span class="ltx_text" id="footnote5.6" style="color:#000000;">precise, 32</span><math alttext="\times" class="ltx_Math" display="inline" id="footnote5.m2.1"><semantics id="footnote5.m2.1b"><mo id="footnote5.m2.1.1" mathcolor="#000000" xref="footnote5.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="footnote5.m2.1c"><times id="footnote5.m2.1.1.cmml" xref="footnote5.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="footnote5.m2.1d">\times</annotation><annotation encoding="application/x-llamapun" id="footnote5.m2.1e">×</annotation></semantics></math><span class="ltx_text" id="footnote5.7" style="color:#000000;">4096</span><math alttext="\times" class="ltx_Math" display="inline" id="footnote5.m3.1"><semantics id="footnote5.m3.1b"><mo id="footnote5.m3.1.1" mathcolor="#000000" xref="footnote5.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="footnote5.m3.1c"><times id="footnote5.m3.1.1.cmml" xref="footnote5.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="footnote5.m3.1d">\times</annotation><annotation encoding="application/x-llamapun" id="footnote5.m3.1e">×</annotation></semantics></math><span class="ltx_text" id="footnote5.8" style="color:#000000;">4096</span><math alttext="\times" class="ltx_Math" display="inline" id="footnote5.m4.1"><semantics id="footnote5.m4.1b"><mo id="footnote5.m4.1.1" mathcolor="#000000" xref="footnote5.m4.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="footnote5.m4.1c"><times id="footnote5.m4.1.1.cmml" xref="footnote5.m4.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="footnote5.m4.1d">\times</annotation><annotation encoding="application/x-llamapun" id="footnote5.m4.1e">×</annotation></semantics></math><span class="ltx_text" id="footnote5.9" style="color:#000000;">FP16, </span><math alttext="\Delta W" class="ltx_Math" display="inline" id="footnote5.m5.1"><semantics id="footnote5.m5.1b"><mrow id="footnote5.m5.1.1" xref="footnote5.m5.1.1.cmml"><mi id="footnote5.m5.1.1.2" mathcolor="#000000" mathvariant="normal" xref="footnote5.m5.1.1.2.cmml">Δ</mi><mo id="footnote5.m5.1.1.1" xref="footnote5.m5.1.1.1.cmml"></mo><mi id="footnote5.m5.1.1.3" mathcolor="#000000" xref="footnote5.m5.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="footnote5.m5.1c"><apply id="footnote5.m5.1.1.cmml" xref="footnote5.m5.1.1"><times id="footnote5.m5.1.1.1.cmml" xref="footnote5.m5.1.1.1"></times><ci id="footnote5.m5.1.1.2.cmml" xref="footnote5.m5.1.1.2">Δ</ci><ci id="footnote5.m5.1.1.3.cmml" xref="footnote5.m5.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="footnote5.m5.1d">\Delta W</annotation><annotation encoding="application/x-llamapun" id="footnote5.m5.1e">roman_Δ italic_W</annotation></semantics></math><span class="ltx_text" id="footnote5.10" style="color:#000000;">’s shape is same as LMM.</span></span></span></span>, incurs very high delay, <math alttext="\sim" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.4.4.m4.1"><semantics id="S4.SS4.SSS1.p1.4.4.m4.1a"><mo id="S4.SS4.SSS1.p1.4.4.m4.1.1" mathcolor="#000000" xref="S4.SS4.SSS1.p1.4.4.m4.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.4.4.m4.1b"><csymbol cd="latexml" id="S4.SS4.SSS1.p1.4.4.m4.1.1.cmml" xref="S4.SS4.SSS1.p1.4.4.m4.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.4.4.m4.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.4.4.m4.1d">∼</annotation></semantics></math>1s each, on swapping. Conversely, our swift mode switcher computes LoRA matrices at runtime and stores adapters, <em class="ltx_emph ltx_font_italic" id="S4.SS4.SSS1.p1.6.6.3">i.e.,</em> <math alttext="A" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.5.5.m5.1"><semantics id="S4.SS4.SSS1.p1.5.5.m5.1a"><mi id="S4.SS4.SSS1.p1.5.5.m5.1.1" mathcolor="#000000" xref="S4.SS4.SSS1.p1.5.5.m5.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.5.5.m5.1b"><ci id="S4.SS4.SSS1.p1.5.5.m5.1.1.cmml" xref="S4.SS4.SSS1.p1.5.5.m5.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.5.5.m5.1c">A</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.5.5.m5.1d">italic_A</annotation></semantics></math> and <math alttext="B" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.6.6.m6.1"><semantics id="S4.SS4.SSS1.p1.6.6.m6.1a"><mi id="S4.SS4.SSS1.p1.6.6.m6.1.1" mathcolor="#000000" xref="S4.SS4.SSS1.p1.6.6.m6.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.6.6.m6.1b"><ci id="S4.SS4.SSS1.p1.6.6.m6.1.1.cmml" xref="S4.SS4.SSS1.p1.6.6.m6.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.6.6.m6.1c">B</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.6.6.m6.1d">italic_B</annotation></semantics></math>, only 43MB each, on GPU. It has two core designs. (1) Eliminate unnecessary memory copy with contiguous memory allocation. Our switcher pre-allocates contiguous memory for weight matrices, avoiding the memory copy in tensor reshape, enables efficient in-place LoRA matrices un-/merge. (2) Compute all-layer LoRA matrices and un-/merge them in one shot. With ATMM, the switcher can efficiently compute LoRA matrices of the entire model and add/subtract all of them onto/from the base mode weights in one shot.</span> By doing so, our mode switch costs only ¡10ms, which speeds up dLoRA ¿5<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.7.m1.1"><semantics id="S4.SS4.SSS1.p1.7.m1.1a"><mo id="S4.SS4.SSS1.p1.7.m1.1.1" xref="S4.SS4.SSS1.p1.7.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.7.m1.1b"><times id="S4.SS4.SSS1.p1.7.m1.1.1.cmml" xref="S4.SS4.SSS1.p1.7.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.7.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.7.m1.1d">×</annotation></semantics></math>.</p> </div> <figure class="ltx_float ltx_float_algorithm ltx_framed ltx_framed_top" id="alg1"> <figcaption class="ltx_caption" style="font-size:80%;"><span class="ltx_tag ltx_tag_float"><span class="ltx_text ltx_font_bold" id="alg1.4.1.1">Algorithm 1</span> </span> Scheduling Policy</figcaption> <div class="ltx_listing ltx_listing" id="alg1.5"> <div class="ltx_listingline" id="alg1.l1"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l1.1.1.1" style="font-size:80%;">1:</span></span><span class="ltx_text" id="alg1.l1.2" style="font-size:80%;">Request </span><math alttext="R\!=\!\!\left\{r_{1},...,r_{n}\right\}" class="ltx_Math" display="inline" id="alg1.l1.m1.3"><semantics id="alg1.l1.m1.3a"><mrow id="alg1.l1.m1.3.3" xref="alg1.l1.m1.3.3.cmml"><mi id="alg1.l1.m1.3.3.4" mathsize="80%" xref="alg1.l1.m1.3.3.4.cmml">R</mi><mpadded width="0.726em"><mo id="alg1.l1.m1.3.3.3" lspace="0.108em" mathsize="80%" xref="alg1.l1.m1.3.3.3.cmml">=</mo></mpadded><mrow id="alg1.l1.m1.3.3.2.2" xref="alg1.l1.m1.3.3.2.3.cmml"><mo id="alg1.l1.m1.3.3.2.2.3" xref="alg1.l1.m1.3.3.2.3.cmml">{</mo><msub id="alg1.l1.m1.2.2.1.1.1" xref="alg1.l1.m1.2.2.1.1.1.cmml"><mi id="alg1.l1.m1.2.2.1.1.1.2" mathsize="80%" xref="alg1.l1.m1.2.2.1.1.1.2.cmml">r</mi><mn id="alg1.l1.m1.2.2.1.1.1.3" mathsize="80%" xref="alg1.l1.m1.2.2.1.1.1.3.cmml">1</mn></msub><mo id="alg1.l1.m1.3.3.2.2.4" mathsize="80%" xref="alg1.l1.m1.3.3.2.3.cmml">,</mo><mi id="alg1.l1.m1.1.1" mathsize="80%" mathvariant="normal" xref="alg1.l1.m1.1.1.cmml">…</mi><mo id="alg1.l1.m1.3.3.2.2.5" mathsize="80%" xref="alg1.l1.m1.3.3.2.3.cmml">,</mo><msub id="alg1.l1.m1.3.3.2.2.2" xref="alg1.l1.m1.3.3.2.2.2.cmml"><mi id="alg1.l1.m1.3.3.2.2.2.2" mathsize="80%" xref="alg1.l1.m1.3.3.2.2.2.2.cmml">r</mi><mi id="alg1.l1.m1.3.3.2.2.2.3" mathsize="80%" xref="alg1.l1.m1.3.3.2.2.2.3.cmml">n</mi></msub><mo id="alg1.l1.m1.3.3.2.2.6" xref="alg1.l1.m1.3.3.2.3.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l1.m1.3b"><apply id="alg1.l1.m1.3.3.cmml" xref="alg1.l1.m1.3.3"><eq id="alg1.l1.m1.3.3.3.cmml" xref="alg1.l1.m1.3.3.3"></eq><ci id="alg1.l1.m1.3.3.4.cmml" xref="alg1.l1.m1.3.3.4">𝑅</ci><set id="alg1.l1.m1.3.3.2.3.cmml" xref="alg1.l1.m1.3.3.2.2"><apply id="alg1.l1.m1.2.2.1.1.1.cmml" xref="alg1.l1.m1.2.2.1.1.1"><csymbol cd="ambiguous" id="alg1.l1.m1.2.2.1.1.1.1.cmml" xref="alg1.l1.m1.2.2.1.1.1">subscript</csymbol><ci id="alg1.l1.m1.2.2.1.1.1.2.cmml" xref="alg1.l1.m1.2.2.1.1.1.2">𝑟</ci><cn id="alg1.l1.m1.2.2.1.1.1.3.cmml" type="integer" xref="alg1.l1.m1.2.2.1.1.1.3">1</cn></apply><ci id="alg1.l1.m1.1.1.cmml" xref="alg1.l1.m1.1.1">…</ci><apply id="alg1.l1.m1.3.3.2.2.2.cmml" xref="alg1.l1.m1.3.3.2.2.2"><csymbol cd="ambiguous" id="alg1.l1.m1.3.3.2.2.2.1.cmml" xref="alg1.l1.m1.3.3.2.2.2">subscript</csymbol><ci id="alg1.l1.m1.3.3.2.2.2.2.cmml" xref="alg1.l1.m1.3.3.2.2.2.2">𝑟</ci><ci id="alg1.l1.m1.3.3.2.2.2.3.cmml" xref="alg1.l1.m1.3.3.2.2.2.3">𝑛</ci></apply></set></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l1.m1.3c">R\!=\!\!\left\{r_{1},...,r_{n}\right\}</annotation><annotation encoding="application/x-llamapun" id="alg1.l1.m1.3d">italic_R = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }</annotation></semantics></math><span class="ltx_text" id="alg1.l1.3" style="font-size:80%;">, LoRA adapter </span><math alttext="L\!=\!\!\left\{l_{1},...,l_{n}\right\}" class="ltx_Math" display="inline" id="alg1.l1.m2.3"><semantics id="alg1.l1.m2.3a"><mrow id="alg1.l1.m2.3.3" xref="alg1.l1.m2.3.3.cmml"><mi id="alg1.l1.m2.3.3.4" mathsize="80%" xref="alg1.l1.m2.3.3.4.cmml">L</mi><mpadded width="0.726em"><mo id="alg1.l1.m2.3.3.3" lspace="0.108em" mathsize="80%" xref="alg1.l1.m2.3.3.3.cmml">=</mo></mpadded><mrow id="alg1.l1.m2.3.3.2.2" xref="alg1.l1.m2.3.3.2.3.cmml"><mo id="alg1.l1.m2.3.3.2.2.3" xref="alg1.l1.m2.3.3.2.3.cmml">{</mo><msub id="alg1.l1.m2.2.2.1.1.1" xref="alg1.l1.m2.2.2.1.1.1.cmml"><mi id="alg1.l1.m2.2.2.1.1.1.2" mathsize="80%" xref="alg1.l1.m2.2.2.1.1.1.2.cmml">l</mi><mn id="alg1.l1.m2.2.2.1.1.1.3" mathsize="80%" xref="alg1.l1.m2.2.2.1.1.1.3.cmml">1</mn></msub><mo id="alg1.l1.m2.3.3.2.2.4" mathsize="80%" xref="alg1.l1.m2.3.3.2.3.cmml">,</mo><mi id="alg1.l1.m2.1.1" mathsize="80%" mathvariant="normal" xref="alg1.l1.m2.1.1.cmml">…</mi><mo id="alg1.l1.m2.3.3.2.2.5" mathsize="80%" xref="alg1.l1.m2.3.3.2.3.cmml">,</mo><msub id="alg1.l1.m2.3.3.2.2.2" xref="alg1.l1.m2.3.3.2.2.2.cmml"><mi id="alg1.l1.m2.3.3.2.2.2.2" mathsize="80%" xref="alg1.l1.m2.3.3.2.2.2.2.cmml">l</mi><mi id="alg1.l1.m2.3.3.2.2.2.3" mathsize="80%" xref="alg1.l1.m2.3.3.2.2.2.3.cmml">n</mi></msub><mo id="alg1.l1.m2.3.3.2.2.6" xref="alg1.l1.m2.3.3.2.3.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l1.m2.3b"><apply id="alg1.l1.m2.3.3.cmml" xref="alg1.l1.m2.3.3"><eq id="alg1.l1.m2.3.3.3.cmml" xref="alg1.l1.m2.3.3.3"></eq><ci id="alg1.l1.m2.3.3.4.cmml" xref="alg1.l1.m2.3.3.4">𝐿</ci><set id="alg1.l1.m2.3.3.2.3.cmml" xref="alg1.l1.m2.3.3.2.2"><apply id="alg1.l1.m2.2.2.1.1.1.cmml" xref="alg1.l1.m2.2.2.1.1.1"><csymbol cd="ambiguous" id="alg1.l1.m2.2.2.1.1.1.1.cmml" xref="alg1.l1.m2.2.2.1.1.1">subscript</csymbol><ci id="alg1.l1.m2.2.2.1.1.1.2.cmml" xref="alg1.l1.m2.2.2.1.1.1.2">𝑙</ci><cn id="alg1.l1.m2.2.2.1.1.1.3.cmml" type="integer" xref="alg1.l1.m2.2.2.1.1.1.3">1</cn></apply><ci id="alg1.l1.m2.1.1.cmml" xref="alg1.l1.m2.1.1">…</ci><apply id="alg1.l1.m2.3.3.2.2.2.cmml" xref="alg1.l1.m2.3.3.2.2.2"><csymbol cd="ambiguous" id="alg1.l1.m2.3.3.2.2.2.1.cmml" xref="alg1.l1.m2.3.3.2.2.2">subscript</csymbol><ci id="alg1.l1.m2.3.3.2.2.2.2.cmml" xref="alg1.l1.m2.3.3.2.2.2.2">𝑙</ci><ci id="alg1.l1.m2.3.3.2.2.2.3.cmml" xref="alg1.l1.m2.3.3.2.2.2.3">𝑛</ci></apply></set></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l1.m2.3c">L\!=\!\!\left\{l_{1},...,l_{n}\right\}</annotation><annotation encoding="application/x-llamapun" id="alg1.l1.m2.3d">italic_L = { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }</annotation></semantics></math><span class="ltx_text" id="alg1.l1.4" style="font-size:80%;">, Infer mode </span><math alttext="M" class="ltx_Math" display="inline" id="alg1.l1.m3.1"><semantics id="alg1.l1.m3.1a"><mi id="alg1.l1.m3.1.1" mathsize="80%" xref="alg1.l1.m3.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="alg1.l1.m3.1b"><ci id="alg1.l1.m3.1.1.cmml" xref="alg1.l1.m3.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l1.m3.1c">M</annotation><annotation encoding="application/x-llamapun" id="alg1.l1.m3.1d">italic_M</annotation></semantics></math><span class="ltx_text" id="alg1.l1.5" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg1.l2"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l2.1.1.1" style="font-size:80%;">2:</span></span><span class="ltx_text" id="alg1.l2.2" style="font-size:80%;">The batch of requests to be executed </span><math alttext="B_{next}" class="ltx_Math" display="inline" id="alg1.l2.m1.1"><semantics id="alg1.l2.m1.1a"><msub id="alg1.l2.m1.1.1" xref="alg1.l2.m1.1.1.cmml"><mi id="alg1.l2.m1.1.1.2" mathsize="80%" xref="alg1.l2.m1.1.1.2.cmml">B</mi><mrow id="alg1.l2.m1.1.1.3" xref="alg1.l2.m1.1.1.3.cmml"><mi id="alg1.l2.m1.1.1.3.2" mathsize="80%" xref="alg1.l2.m1.1.1.3.2.cmml">n</mi><mo id="alg1.l2.m1.1.1.3.1" xref="alg1.l2.m1.1.1.3.1.cmml"></mo><mi id="alg1.l2.m1.1.1.3.3" mathsize="80%" xref="alg1.l2.m1.1.1.3.3.cmml">e</mi><mo id="alg1.l2.m1.1.1.3.1a" xref="alg1.l2.m1.1.1.3.1.cmml"></mo><mi id="alg1.l2.m1.1.1.3.4" mathsize="80%" xref="alg1.l2.m1.1.1.3.4.cmml">x</mi><mo id="alg1.l2.m1.1.1.3.1b" xref="alg1.l2.m1.1.1.3.1.cmml"></mo><mi id="alg1.l2.m1.1.1.3.5" mathsize="80%" xref="alg1.l2.m1.1.1.3.5.cmml">t</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="alg1.l2.m1.1b"><apply id="alg1.l2.m1.1.1.cmml" xref="alg1.l2.m1.1.1"><csymbol cd="ambiguous" id="alg1.l2.m1.1.1.1.cmml" xref="alg1.l2.m1.1.1">subscript</csymbol><ci id="alg1.l2.m1.1.1.2.cmml" xref="alg1.l2.m1.1.1.2">𝐵</ci><apply id="alg1.l2.m1.1.1.3.cmml" xref="alg1.l2.m1.1.1.3"><times id="alg1.l2.m1.1.1.3.1.cmml" xref="alg1.l2.m1.1.1.3.1"></times><ci id="alg1.l2.m1.1.1.3.2.cmml" xref="alg1.l2.m1.1.1.3.2">𝑛</ci><ci id="alg1.l2.m1.1.1.3.3.cmml" xref="alg1.l2.m1.1.1.3.3">𝑒</ci><ci id="alg1.l2.m1.1.1.3.4.cmml" xref="alg1.l2.m1.1.1.3.4">𝑥</ci><ci id="alg1.l2.m1.1.1.3.5.cmml" xref="alg1.l2.m1.1.1.3.5">𝑡</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l2.m1.1c">B_{next}</annotation><annotation encoding="application/x-llamapun" id="alg1.l2.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT</annotation></semantics></math><span class="ltx_text" id="alg1.l2.3" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg1.l3"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l3.1.1.1" style="font-size:80%;">3:</span></span><span class="ltx_text ltx_font_bold" id="alg1.l3.2" style="font-size:80%;">function</span><span class="ltx_text" id="alg1.l3.3" style="font-size:80%;"> </span><span class="ltx_text ltx_font_smallcaps" id="alg1.l3.4" style="font-size:80%;">Scheduling</span><span class="ltx_text" id="alg1.l3.5" style="font-size:80%;">(R, M, L) </span> </div> <div class="ltx_listingline" id="alg1.l4"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l4.1.1.1" style="font-size:80%;">4:</span></span><span class="ltx_text" id="alg1.l4.2" style="font-size:80%;"> </span><math alttext="R_{starve}=\left[R_{i}.credit>\theta\;for\;R_{i}\;in\;R\right]" class="ltx_math_unparsed" display="inline" id="alg1.l4.m1.1"><semantics id="alg1.l4.m1.1a"><mrow id="alg1.l4.m1.1b"><msub id="alg1.l4.m1.1.1"><mi id="alg1.l4.m1.1.1.2" mathsize="80%">R</mi><mrow id="alg1.l4.m1.1.1.3"><mi id="alg1.l4.m1.1.1.3.2" mathsize="80%">s</mi><mo id="alg1.l4.m1.1.1.3.1"></mo><mi id="alg1.l4.m1.1.1.3.3" mathsize="80%">t</mi><mo id="alg1.l4.m1.1.1.3.1a"></mo><mi id="alg1.l4.m1.1.1.3.4" mathsize="80%">a</mi><mo id="alg1.l4.m1.1.1.3.1b"></mo><mi id="alg1.l4.m1.1.1.3.5" mathsize="80%">r</mi><mo id="alg1.l4.m1.1.1.3.1c"></mo><mi id="alg1.l4.m1.1.1.3.6" mathsize="80%">v</mi><mo id="alg1.l4.m1.1.1.3.1d"></mo><mi id="alg1.l4.m1.1.1.3.7" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l4.m1.1.2" mathsize="80%">=</mo><mrow id="alg1.l4.m1.1.3"><mo id="alg1.l4.m1.1.3.1">[</mo><msub id="alg1.l4.m1.1.3.2"><mi id="alg1.l4.m1.1.3.2.2" mathsize="80%">R</mi><mi id="alg1.l4.m1.1.3.2.3" mathsize="80%">i</mi></msub><mo id="alg1.l4.m1.1.3.3" lspace="0em" mathsize="80%" rspace="0.167em">.</mo><mi id="alg1.l4.m1.1.3.4" mathsize="80%">c</mi><mi id="alg1.l4.m1.1.3.5" mathsize="80%">r</mi><mi id="alg1.l4.m1.1.3.6" mathsize="80%">e</mi><mi id="alg1.l4.m1.1.3.7" mathsize="80%">d</mi><mi id="alg1.l4.m1.1.3.8" mathsize="80%">i</mi><mi id="alg1.l4.m1.1.3.9" mathsize="80%">t</mi><mo id="alg1.l4.m1.1.3.10" mathsize="80%">></mo><mi id="alg1.l4.m1.1.3.11" mathsize="80%">θ</mi><mi id="alg1.l4.m1.1.3.12" mathsize="80%">f</mi><mi id="alg1.l4.m1.1.3.13" mathsize="80%">o</mi><mi id="alg1.l4.m1.1.3.14" mathsize="80%">r</mi><msub id="alg1.l4.m1.1.3.15"><mi id="alg1.l4.m1.1.3.15.2" mathsize="80%">R</mi><mi id="alg1.l4.m1.1.3.15.3" mathsize="80%">i</mi></msub><mi id="alg1.l4.m1.1.3.16" mathsize="80%">i</mi><mi id="alg1.l4.m1.1.3.17" mathsize="80%">n</mi><mi id="alg1.l4.m1.1.3.18" mathsize="80%">R</mi><mo id="alg1.l4.m1.1.3.19">]</mo></mrow></mrow><annotation encoding="application/x-tex" id="alg1.l4.m1.1c">R_{starve}=\left[R_{i}.credit>\theta\;for\;R_{i}\;in\;R\right]</annotation><annotation encoding="application/x-llamapun" id="alg1.l4.m1.1d">italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT = [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . italic_c italic_r italic_e italic_d italic_i italic_t > italic_θ italic_f italic_o italic_r italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_i italic_n italic_R ]</annotation></semantics></math><span class="ltx_text" id="alg1.l4.3" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg1.l5"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l5.1.1.1" style="font-size:80%;">5:</span></span><span class="ltx_text" id="alg1.l5.2" style="font-size:80%;"> </span><math alttext="len=MaxBS-|R_{starve}|" class="ltx_Math" display="inline" id="alg1.l5.m1.1"><semantics id="alg1.l5.m1.1a"><mrow id="alg1.l5.m1.1.1" xref="alg1.l5.m1.1.1.cmml"><mrow id="alg1.l5.m1.1.1.3" xref="alg1.l5.m1.1.1.3.cmml"><mi id="alg1.l5.m1.1.1.3.2" mathsize="80%" xref="alg1.l5.m1.1.1.3.2.cmml">l</mi><mo id="alg1.l5.m1.1.1.3.1" xref="alg1.l5.m1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.3.3" mathsize="80%" xref="alg1.l5.m1.1.1.3.3.cmml">e</mi><mo id="alg1.l5.m1.1.1.3.1a" xref="alg1.l5.m1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.3.4" mathsize="80%" xref="alg1.l5.m1.1.1.3.4.cmml">n</mi></mrow><mo id="alg1.l5.m1.1.1.2" mathsize="80%" xref="alg1.l5.m1.1.1.2.cmml">=</mo><mrow id="alg1.l5.m1.1.1.1" xref="alg1.l5.m1.1.1.1.cmml"><mrow id="alg1.l5.m1.1.1.1.3" xref="alg1.l5.m1.1.1.1.3.cmml"><mi id="alg1.l5.m1.1.1.1.3.2" mathsize="80%" xref="alg1.l5.m1.1.1.1.3.2.cmml">M</mi><mo id="alg1.l5.m1.1.1.1.3.1" xref="alg1.l5.m1.1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.1.3.3" mathsize="80%" xref="alg1.l5.m1.1.1.1.3.3.cmml">a</mi><mo id="alg1.l5.m1.1.1.1.3.1a" xref="alg1.l5.m1.1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.1.3.4" mathsize="80%" xref="alg1.l5.m1.1.1.1.3.4.cmml">x</mi><mo id="alg1.l5.m1.1.1.1.3.1b" xref="alg1.l5.m1.1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.1.3.5" mathsize="80%" xref="alg1.l5.m1.1.1.1.3.5.cmml">B</mi><mo id="alg1.l5.m1.1.1.1.3.1c" xref="alg1.l5.m1.1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.1.3.6" mathsize="80%" xref="alg1.l5.m1.1.1.1.3.6.cmml">S</mi></mrow><mo id="alg1.l5.m1.1.1.1.2" mathsize="80%" xref="alg1.l5.m1.1.1.1.2.cmml">−</mo><mrow id="alg1.l5.m1.1.1.1.1.1" xref="alg1.l5.m1.1.1.1.1.2.cmml"><mo id="alg1.l5.m1.1.1.1.1.1.2" maxsize="80%" minsize="80%" xref="alg1.l5.m1.1.1.1.1.2.1.cmml">|</mo><msub id="alg1.l5.m1.1.1.1.1.1.1" xref="alg1.l5.m1.1.1.1.1.1.1.cmml"><mi id="alg1.l5.m1.1.1.1.1.1.1.2" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.2.cmml">R</mi><mrow id="alg1.l5.m1.1.1.1.1.1.1.3" xref="alg1.l5.m1.1.1.1.1.1.1.3.cmml"><mi id="alg1.l5.m1.1.1.1.1.1.1.3.2" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.2.cmml">s</mi><mo id="alg1.l5.m1.1.1.1.1.1.1.3.1" xref="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.1.1.1.1.3.3" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.3.cmml">t</mi><mo id="alg1.l5.m1.1.1.1.1.1.1.3.1a" xref="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.1.1.1.1.3.4" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.4.cmml">a</mi><mo id="alg1.l5.m1.1.1.1.1.1.1.3.1b" xref="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.1.1.1.1.3.5" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.5.cmml">r</mi><mo id="alg1.l5.m1.1.1.1.1.1.1.3.1c" xref="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.1.1.1.1.3.6" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.6.cmml">v</mi><mo id="alg1.l5.m1.1.1.1.1.1.1.3.1d" xref="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l5.m1.1.1.1.1.1.1.3.7" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.7.cmml">e</mi></mrow></msub><mo id="alg1.l5.m1.1.1.1.1.1.3" maxsize="80%" minsize="80%" xref="alg1.l5.m1.1.1.1.1.2.1.cmml">|</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l5.m1.1b"><apply id="alg1.l5.m1.1.1.cmml" xref="alg1.l5.m1.1.1"><eq id="alg1.l5.m1.1.1.2.cmml" xref="alg1.l5.m1.1.1.2"></eq><apply id="alg1.l5.m1.1.1.3.cmml" xref="alg1.l5.m1.1.1.3"><times id="alg1.l5.m1.1.1.3.1.cmml" xref="alg1.l5.m1.1.1.3.1"></times><ci id="alg1.l5.m1.1.1.3.2.cmml" xref="alg1.l5.m1.1.1.3.2">𝑙</ci><ci id="alg1.l5.m1.1.1.3.3.cmml" xref="alg1.l5.m1.1.1.3.3">𝑒</ci><ci id="alg1.l5.m1.1.1.3.4.cmml" xref="alg1.l5.m1.1.1.3.4">𝑛</ci></apply><apply id="alg1.l5.m1.1.1.1.cmml" xref="alg1.l5.m1.1.1.1"><minus id="alg1.l5.m1.1.1.1.2.cmml" xref="alg1.l5.m1.1.1.1.2"></minus><apply id="alg1.l5.m1.1.1.1.3.cmml" xref="alg1.l5.m1.1.1.1.3"><times id="alg1.l5.m1.1.1.1.3.1.cmml" xref="alg1.l5.m1.1.1.1.3.1"></times><ci id="alg1.l5.m1.1.1.1.3.2.cmml" xref="alg1.l5.m1.1.1.1.3.2">𝑀</ci><ci id="alg1.l5.m1.1.1.1.3.3.cmml" xref="alg1.l5.m1.1.1.1.3.3">𝑎</ci><ci id="alg1.l5.m1.1.1.1.3.4.cmml" xref="alg1.l5.m1.1.1.1.3.4">𝑥</ci><ci id="alg1.l5.m1.1.1.1.3.5.cmml" xref="alg1.l5.m1.1.1.1.3.5">𝐵</ci><ci id="alg1.l5.m1.1.1.1.3.6.cmml" xref="alg1.l5.m1.1.1.1.3.6">𝑆</ci></apply><apply id="alg1.l5.m1.1.1.1.1.2.cmml" xref="alg1.l5.m1.1.1.1.1.1"><abs id="alg1.l5.m1.1.1.1.1.2.1.cmml" xref="alg1.l5.m1.1.1.1.1.1.2"></abs><apply id="alg1.l5.m1.1.1.1.1.1.1.cmml" xref="alg1.l5.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l5.m1.1.1.1.1.1.1.1.cmml" xref="alg1.l5.m1.1.1.1.1.1.1">subscript</csymbol><ci id="alg1.l5.m1.1.1.1.1.1.1.2.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.2">𝑅</ci><apply id="alg1.l5.m1.1.1.1.1.1.1.3.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3"><times id="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.1"></times><ci id="alg1.l5.m1.1.1.1.1.1.1.3.2.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.2">𝑠</ci><ci id="alg1.l5.m1.1.1.1.1.1.1.3.3.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.3">𝑡</ci><ci id="alg1.l5.m1.1.1.1.1.1.1.3.4.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.4">𝑎</ci><ci id="alg1.l5.m1.1.1.1.1.1.1.3.5.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.5">𝑟</ci><ci id="alg1.l5.m1.1.1.1.1.1.1.3.6.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.6">𝑣</ci><ci id="alg1.l5.m1.1.1.1.1.1.1.3.7.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.7">𝑒</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l5.m1.1c">len=MaxBS-|R_{starve}|</annotation><annotation encoding="application/x-llamapun" id="alg1.l5.m1.1d">italic_l italic_e italic_n = italic_M italic_a italic_x italic_B italic_S - | italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT |</annotation></semantics></math><span class="ltx_text" id="alg1.l5.3" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg1.l6"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l6.1.1.1" style="font-size:80%;">6:</span></span><span class="ltx_text" id="alg1.l6.2" style="font-size:80%;"> </span><math alttext="R_{merge}=argmax_{l_{i}\in L}\left\{r_{i}\in R|r_{i}.lora==l_{i}\right\}" class="ltx_math_unparsed" display="inline" id="alg1.l6.m1.1"><semantics id="alg1.l6.m1.1a"><mrow id="alg1.l6.m1.1b"><msub id="alg1.l6.m1.1.1"><mi id="alg1.l6.m1.1.1.2" mathsize="80%">R</mi><mrow id="alg1.l6.m1.1.1.3"><mi id="alg1.l6.m1.1.1.3.2" mathsize="80%">m</mi><mo id="alg1.l6.m1.1.1.3.1"></mo><mi id="alg1.l6.m1.1.1.3.3" mathsize="80%">e</mi><mo id="alg1.l6.m1.1.1.3.1a"></mo><mi id="alg1.l6.m1.1.1.3.4" mathsize="80%">r</mi><mo id="alg1.l6.m1.1.1.3.1b"></mo><mi id="alg1.l6.m1.1.1.3.5" mathsize="80%">g</mi><mo id="alg1.l6.m1.1.1.3.1c"></mo><mi id="alg1.l6.m1.1.1.3.6" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l6.m1.1.2" mathsize="80%">=</mo><mi id="alg1.l6.m1.1.3" mathsize="80%">a</mi><mi id="alg1.l6.m1.1.4" mathsize="80%">r</mi><mi id="alg1.l6.m1.1.5" mathsize="80%">g</mi><mi id="alg1.l6.m1.1.6" mathsize="80%">m</mi><mi id="alg1.l6.m1.1.7" mathsize="80%">a</mi><msub id="alg1.l6.m1.1.8"><mi id="alg1.l6.m1.1.8.2" mathsize="80%">x</mi><mrow id="alg1.l6.m1.1.8.3"><msub id="alg1.l6.m1.1.8.3.2"><mi id="alg1.l6.m1.1.8.3.2.2" mathsize="80%">l</mi><mi id="alg1.l6.m1.1.8.3.2.3" mathsize="80%">i</mi></msub><mo id="alg1.l6.m1.1.8.3.1" mathsize="80%">∈</mo><mi id="alg1.l6.m1.1.8.3.3" mathsize="80%">L</mi></mrow></msub><mrow id="alg1.l6.m1.1.9"><mo id="alg1.l6.m1.1.9.1">{</mo><msub id="alg1.l6.m1.1.9.2"><mi id="alg1.l6.m1.1.9.2.2" mathsize="80%">r</mi><mi id="alg1.l6.m1.1.9.2.3" mathsize="80%">i</mi></msub><mo id="alg1.l6.m1.1.9.3" mathsize="80%">∈</mo><mi id="alg1.l6.m1.1.9.4" mathsize="80%">R</mi><mo fence="false" id="alg1.l6.m1.1.9.5" maxsize="80%" minsize="80%" rspace="0.167em">|</mo><msub id="alg1.l6.m1.1.9.6"><mi id="alg1.l6.m1.1.9.6.2" mathsize="80%">r</mi><mi id="alg1.l6.m1.1.9.6.3" mathsize="80%">i</mi></msub><mo id="alg1.l6.m1.1.9.7" lspace="0em" mathsize="80%" rspace="0.167em">.</mo><mi id="alg1.l6.m1.1.9.8" mathsize="80%">l</mi><mi id="alg1.l6.m1.1.9.9" mathsize="80%">o</mi><mi id="alg1.l6.m1.1.9.10" mathsize="80%">r</mi><mi id="alg1.l6.m1.1.9.11" mathsize="80%">a</mi><mo id="alg1.l6.m1.1.9.12" mathsize="80%" rspace="0em">=</mo><mo id="alg1.l6.m1.1.9.13" lspace="0em" mathsize="80%">=</mo><msub id="alg1.l6.m1.1.9.14"><mi id="alg1.l6.m1.1.9.14.2" mathsize="80%">l</mi><mi id="alg1.l6.m1.1.9.14.3" mathsize="80%">i</mi></msub><mo id="alg1.l6.m1.1.9.15">}</mo></mrow></mrow><annotation encoding="application/x-tex" id="alg1.l6.m1.1c">R_{merge}=argmax_{l_{i}\in L}\left\{r_{i}\in R|r_{i}.lora==l_{i}\right\}</annotation><annotation encoding="application/x-llamapun" id="alg1.l6.m1.1d">italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT = italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_L end_POSTSUBSCRIPT { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R | italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . italic_l italic_o italic_r italic_a = = italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }</annotation></semantics></math><span class="ltx_text" id="alg1.l6.3" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg1.l7"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l7.1.1.1" style="font-size:80%;">7:</span></span><span class="ltx_text" id="alg1.l7.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l7.3" style="font-size:80%;">if</span><span class="ltx_text" id="alg1.l7.4" style="font-size:80%;"> </span><math alttext="|R_{starve}|/MaxBS\leq 0.5" class="ltx_Math" display="inline" id="alg1.l7.m1.1"><semantics id="alg1.l7.m1.1a"><mrow id="alg1.l7.m1.1.1" xref="alg1.l7.m1.1.1.cmml"><mrow id="alg1.l7.m1.1.1.1" xref="alg1.l7.m1.1.1.1.cmml"><mrow id="alg1.l7.m1.1.1.1.1" xref="alg1.l7.m1.1.1.1.1.cmml"><mrow id="alg1.l7.m1.1.1.1.1.1.1" xref="alg1.l7.m1.1.1.1.1.1.2.cmml"><mo id="alg1.l7.m1.1.1.1.1.1.1.2" maxsize="80%" minsize="80%" xref="alg1.l7.m1.1.1.1.1.1.2.1.cmml">|</mo><msub id="alg1.l7.m1.1.1.1.1.1.1.1" xref="alg1.l7.m1.1.1.1.1.1.1.1.cmml"><mi id="alg1.l7.m1.1.1.1.1.1.1.1.2" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.2.cmml">R</mi><mrow id="alg1.l7.m1.1.1.1.1.1.1.1.3" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.cmml"><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.2" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.2.cmml">s</mi><mo id="alg1.l7.m1.1.1.1.1.1.1.1.3.1" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.3" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.3.cmml">t</mi><mo id="alg1.l7.m1.1.1.1.1.1.1.1.3.1a" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.4" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.4.cmml">a</mi><mo id="alg1.l7.m1.1.1.1.1.1.1.1.3.1b" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.5" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.5.cmml">r</mi><mo id="alg1.l7.m1.1.1.1.1.1.1.1.3.1c" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.6" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.6.cmml">v</mi><mo id="alg1.l7.m1.1.1.1.1.1.1.1.3.1d" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.7" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.7.cmml">e</mi></mrow></msub><mo id="alg1.l7.m1.1.1.1.1.1.1.3" maxsize="80%" minsize="80%" xref="alg1.l7.m1.1.1.1.1.1.2.1.cmml">|</mo></mrow><mo id="alg1.l7.m1.1.1.1.1.2" maxsize="80%" minsize="80%" stretchy="true" symmetric="true" xref="alg1.l7.m1.1.1.1.1.2.cmml">/</mo><mi id="alg1.l7.m1.1.1.1.1.3" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.3.cmml">M</mi></mrow><mo id="alg1.l7.m1.1.1.1.2" xref="alg1.l7.m1.1.1.1.2.cmml"></mo><mi id="alg1.l7.m1.1.1.1.3" mathsize="80%" xref="alg1.l7.m1.1.1.1.3.cmml">a</mi><mo id="alg1.l7.m1.1.1.1.2a" xref="alg1.l7.m1.1.1.1.2.cmml"></mo><mi id="alg1.l7.m1.1.1.1.4" mathsize="80%" xref="alg1.l7.m1.1.1.1.4.cmml">x</mi><mo id="alg1.l7.m1.1.1.1.2b" xref="alg1.l7.m1.1.1.1.2.cmml"></mo><mi id="alg1.l7.m1.1.1.1.5" mathsize="80%" xref="alg1.l7.m1.1.1.1.5.cmml">B</mi><mo id="alg1.l7.m1.1.1.1.2c" xref="alg1.l7.m1.1.1.1.2.cmml"></mo><mi id="alg1.l7.m1.1.1.1.6" mathsize="80%" xref="alg1.l7.m1.1.1.1.6.cmml">S</mi></mrow><mo id="alg1.l7.m1.1.1.2" mathsize="80%" xref="alg1.l7.m1.1.1.2.cmml">≤</mo><mn id="alg1.l7.m1.1.1.3" mathsize="80%" xref="alg1.l7.m1.1.1.3.cmml">0.5</mn></mrow><annotation-xml encoding="MathML-Content" id="alg1.l7.m1.1b"><apply id="alg1.l7.m1.1.1.cmml" xref="alg1.l7.m1.1.1"><leq id="alg1.l7.m1.1.1.2.cmml" xref="alg1.l7.m1.1.1.2"></leq><apply id="alg1.l7.m1.1.1.1.cmml" xref="alg1.l7.m1.1.1.1"><times id="alg1.l7.m1.1.1.1.2.cmml" xref="alg1.l7.m1.1.1.1.2"></times><apply id="alg1.l7.m1.1.1.1.1.cmml" xref="alg1.l7.m1.1.1.1.1"><divide id="alg1.l7.m1.1.1.1.1.2.cmml" xref="alg1.l7.m1.1.1.1.1.2"></divide><apply id="alg1.l7.m1.1.1.1.1.1.2.cmml" xref="alg1.l7.m1.1.1.1.1.1.1"><abs id="alg1.l7.m1.1.1.1.1.1.2.1.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.2"></abs><apply id="alg1.l7.m1.1.1.1.1.1.1.1.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l7.m1.1.1.1.1.1.1.1.1.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1">subscript</csymbol><ci id="alg1.l7.m1.1.1.1.1.1.1.1.2.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.2">𝑅</ci><apply id="alg1.l7.m1.1.1.1.1.1.1.1.3.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3"><times id="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1"></times><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.2.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.2">𝑠</ci><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.3.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.3">𝑡</ci><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.4.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.4">𝑎</ci><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.5.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.5">𝑟</ci><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.6.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.6">𝑣</ci><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.7.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.7">𝑒</ci></apply></apply></apply><ci id="alg1.l7.m1.1.1.1.1.3.cmml" xref="alg1.l7.m1.1.1.1.1.3">𝑀</ci></apply><ci id="alg1.l7.m1.1.1.1.3.cmml" xref="alg1.l7.m1.1.1.1.3">𝑎</ci><ci id="alg1.l7.m1.1.1.1.4.cmml" xref="alg1.l7.m1.1.1.1.4">𝑥</ci><ci id="alg1.l7.m1.1.1.1.5.cmml" xref="alg1.l7.m1.1.1.1.5">𝐵</ci><ci id="alg1.l7.m1.1.1.1.6.cmml" xref="alg1.l7.m1.1.1.1.6">𝑆</ci></apply><cn id="alg1.l7.m1.1.1.3.cmml" type="float" xref="alg1.l7.m1.1.1.3">0.5</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l7.m1.1c">|R_{starve}|/MaxBS\leq 0.5</annotation><annotation encoding="application/x-llamapun" id="alg1.l7.m1.1d">| italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT | / italic_M italic_a italic_x italic_B italic_S ≤ 0.5</annotation></semantics></math><span class="ltx_text" id="alg1.l7.5" style="font-size:80%;"> and </span><math alttext="|R_{merge}|/MaxBS>0.5" class="ltx_Math" display="inline" id="alg1.l7.m2.1"><semantics id="alg1.l7.m2.1a"><mrow id="alg1.l7.m2.1.1" xref="alg1.l7.m2.1.1.cmml"><mrow id="alg1.l7.m2.1.1.1" xref="alg1.l7.m2.1.1.1.cmml"><mrow id="alg1.l7.m2.1.1.1.1" xref="alg1.l7.m2.1.1.1.1.cmml"><mrow id="alg1.l7.m2.1.1.1.1.1.1" xref="alg1.l7.m2.1.1.1.1.1.2.cmml"><mo id="alg1.l7.m2.1.1.1.1.1.1.2" maxsize="80%" minsize="80%" xref="alg1.l7.m2.1.1.1.1.1.2.1.cmml">|</mo><msub id="alg1.l7.m2.1.1.1.1.1.1.1" xref="alg1.l7.m2.1.1.1.1.1.1.1.cmml"><mi id="alg1.l7.m2.1.1.1.1.1.1.1.2" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.2.cmml">R</mi><mrow id="alg1.l7.m2.1.1.1.1.1.1.1.3" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.cmml"><mi id="alg1.l7.m2.1.1.1.1.1.1.1.3.2" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.2.cmml">m</mi><mo id="alg1.l7.m2.1.1.1.1.1.1.1.3.1" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l7.m2.1.1.1.1.1.1.1.3.3" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.3.cmml">e</mi><mo id="alg1.l7.m2.1.1.1.1.1.1.1.3.1a" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l7.m2.1.1.1.1.1.1.1.3.4" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.4.cmml">r</mi><mo id="alg1.l7.m2.1.1.1.1.1.1.1.3.1b" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l7.m2.1.1.1.1.1.1.1.3.5" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.5.cmml">g</mi><mo id="alg1.l7.m2.1.1.1.1.1.1.1.3.1c" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l7.m2.1.1.1.1.1.1.1.3.6" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.6.cmml">e</mi></mrow></msub><mo id="alg1.l7.m2.1.1.1.1.1.1.3" maxsize="80%" minsize="80%" xref="alg1.l7.m2.1.1.1.1.1.2.1.cmml">|</mo></mrow><mo id="alg1.l7.m2.1.1.1.1.2" maxsize="80%" minsize="80%" stretchy="true" symmetric="true" xref="alg1.l7.m2.1.1.1.1.2.cmml">/</mo><mi id="alg1.l7.m2.1.1.1.1.3" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.3.cmml">M</mi></mrow><mo id="alg1.l7.m2.1.1.1.2" xref="alg1.l7.m2.1.1.1.2.cmml"></mo><mi id="alg1.l7.m2.1.1.1.3" mathsize="80%" xref="alg1.l7.m2.1.1.1.3.cmml">a</mi><mo id="alg1.l7.m2.1.1.1.2a" xref="alg1.l7.m2.1.1.1.2.cmml"></mo><mi id="alg1.l7.m2.1.1.1.4" mathsize="80%" xref="alg1.l7.m2.1.1.1.4.cmml">x</mi><mo id="alg1.l7.m2.1.1.1.2b" xref="alg1.l7.m2.1.1.1.2.cmml"></mo><mi id="alg1.l7.m2.1.1.1.5" mathsize="80%" xref="alg1.l7.m2.1.1.1.5.cmml">B</mi><mo id="alg1.l7.m2.1.1.1.2c" xref="alg1.l7.m2.1.1.1.2.cmml"></mo><mi id="alg1.l7.m2.1.1.1.6" mathsize="80%" xref="alg1.l7.m2.1.1.1.6.cmml">S</mi></mrow><mo id="alg1.l7.m2.1.1.2" mathsize="80%" xref="alg1.l7.m2.1.1.2.cmml">></mo><mn id="alg1.l7.m2.1.1.3" mathsize="80%" xref="alg1.l7.m2.1.1.3.cmml">0.5</mn></mrow><annotation-xml encoding="MathML-Content" id="alg1.l7.m2.1b"><apply id="alg1.l7.m2.1.1.cmml" xref="alg1.l7.m2.1.1"><gt id="alg1.l7.m2.1.1.2.cmml" xref="alg1.l7.m2.1.1.2"></gt><apply id="alg1.l7.m2.1.1.1.cmml" xref="alg1.l7.m2.1.1.1"><times id="alg1.l7.m2.1.1.1.2.cmml" xref="alg1.l7.m2.1.1.1.2"></times><apply id="alg1.l7.m2.1.1.1.1.cmml" xref="alg1.l7.m2.1.1.1.1"><divide id="alg1.l7.m2.1.1.1.1.2.cmml" xref="alg1.l7.m2.1.1.1.1.2"></divide><apply id="alg1.l7.m2.1.1.1.1.1.2.cmml" xref="alg1.l7.m2.1.1.1.1.1.1"><abs id="alg1.l7.m2.1.1.1.1.1.2.1.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.2"></abs><apply id="alg1.l7.m2.1.1.1.1.1.1.1.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l7.m2.1.1.1.1.1.1.1.1.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1">subscript</csymbol><ci id="alg1.l7.m2.1.1.1.1.1.1.1.2.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.2">𝑅</ci><apply id="alg1.l7.m2.1.1.1.1.1.1.1.3.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3"><times id="alg1.l7.m2.1.1.1.1.1.1.1.3.1.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.1"></times><ci id="alg1.l7.m2.1.1.1.1.1.1.1.3.2.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.2">𝑚</ci><ci id="alg1.l7.m2.1.1.1.1.1.1.1.3.3.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.3">𝑒</ci><ci id="alg1.l7.m2.1.1.1.1.1.1.1.3.4.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.4">𝑟</ci><ci id="alg1.l7.m2.1.1.1.1.1.1.1.3.5.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.5">𝑔</ci><ci id="alg1.l7.m2.1.1.1.1.1.1.1.3.6.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.6">𝑒</ci></apply></apply></apply><ci id="alg1.l7.m2.1.1.1.1.3.cmml" xref="alg1.l7.m2.1.1.1.1.3">𝑀</ci></apply><ci id="alg1.l7.m2.1.1.1.3.cmml" xref="alg1.l7.m2.1.1.1.3">𝑎</ci><ci id="alg1.l7.m2.1.1.1.4.cmml" xref="alg1.l7.m2.1.1.1.4">𝑥</ci><ci id="alg1.l7.m2.1.1.1.5.cmml" xref="alg1.l7.m2.1.1.1.5">𝐵</ci><ci id="alg1.l7.m2.1.1.1.6.cmml" xref="alg1.l7.m2.1.1.1.6">𝑆</ci></apply><cn id="alg1.l7.m2.1.1.3.cmml" type="float" xref="alg1.l7.m2.1.1.3">0.5</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l7.m2.1c">|R_{merge}|/MaxBS>0.5</annotation><annotation encoding="application/x-llamapun" id="alg1.l7.m2.1d">| italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT | / italic_M italic_a italic_x italic_B italic_S > 0.5</annotation></semantics></math><span class="ltx_text" id="alg1.l7.6" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l7.7" style="font-size:80%;">then</span><span class="ltx_text" id="alg1.l7.8" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg1.l8"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l8.1.1.1" style="font-size:80%;">8:</span></span><span class="ltx_text" id="alg1.l8.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l8.3" style="font-size:80%;">if</span><span class="ltx_text" id="alg1.l8.4" style="font-size:80%;"> </span><math alttext="|R_{starve}|==0" class="ltx_math_unparsed" display="inline" id="alg1.l8.m1.1"><semantics id="alg1.l8.m1.1a"><mrow id="alg1.l8.m1.1b"><mo fence="false" id="alg1.l8.m1.1.1" maxsize="80%" minsize="80%" rspace="0.167em">|</mo><msub id="alg1.l8.m1.1.2"><mi id="alg1.l8.m1.1.2.2" mathsize="80%">R</mi><mrow id="alg1.l8.m1.1.2.3"><mi id="alg1.l8.m1.1.2.3.2" mathsize="80%">s</mi><mo id="alg1.l8.m1.1.2.3.1"></mo><mi id="alg1.l8.m1.1.2.3.3" mathsize="80%">t</mi><mo id="alg1.l8.m1.1.2.3.1a"></mo><mi id="alg1.l8.m1.1.2.3.4" mathsize="80%">a</mi><mo id="alg1.l8.m1.1.2.3.1b"></mo><mi id="alg1.l8.m1.1.2.3.5" mathsize="80%">r</mi><mo id="alg1.l8.m1.1.2.3.1c"></mo><mi id="alg1.l8.m1.1.2.3.6" mathsize="80%">v</mi><mo id="alg1.l8.m1.1.2.3.1d"></mo><mi id="alg1.l8.m1.1.2.3.7" mathsize="80%">e</mi></mrow></msub><mo fence="false" id="alg1.l8.m1.1.3" maxsize="80%" minsize="80%">|</mo><mo id="alg1.l8.m1.1.4" lspace="0.167em" mathsize="80%" rspace="0em">=</mo><mo id="alg1.l8.m1.1.5" lspace="0em" mathsize="80%">=</mo><mn id="alg1.l8.m1.1.6" mathsize="80%">0</mn></mrow><annotation encoding="application/x-tex" id="alg1.l8.m1.1c">|R_{starve}|==0</annotation><annotation encoding="application/x-llamapun" id="alg1.l8.m1.1d">| italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT | = = 0</annotation></semantics></math><span class="ltx_text" id="alg1.l8.5" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l8.6" style="font-size:80%;">then</span><span class="ltx_text" id="alg1.l8.7" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg1.l9"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l9.1.1.1" style="font-size:80%;">9:</span></span><span class="ltx_text" id="alg1.l9.2" style="font-size:80%;"> </span><math alttext="M=Merge" class="ltx_Math" display="inline" id="alg1.l9.m1.1"><semantics id="alg1.l9.m1.1a"><mrow id="alg1.l9.m1.1.1" xref="alg1.l9.m1.1.1.cmml"><mi id="alg1.l9.m1.1.1.2" mathsize="80%" xref="alg1.l9.m1.1.1.2.cmml">M</mi><mo id="alg1.l9.m1.1.1.1" mathsize="80%" xref="alg1.l9.m1.1.1.1.cmml">=</mo><mrow id="alg1.l9.m1.1.1.3" xref="alg1.l9.m1.1.1.3.cmml"><mi id="alg1.l9.m1.1.1.3.2" mathsize="80%" xref="alg1.l9.m1.1.1.3.2.cmml">M</mi><mo id="alg1.l9.m1.1.1.3.1" xref="alg1.l9.m1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m1.1.1.3.3" mathsize="80%" xref="alg1.l9.m1.1.1.3.3.cmml">e</mi><mo id="alg1.l9.m1.1.1.3.1a" xref="alg1.l9.m1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m1.1.1.3.4" mathsize="80%" xref="alg1.l9.m1.1.1.3.4.cmml">r</mi><mo id="alg1.l9.m1.1.1.3.1b" xref="alg1.l9.m1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m1.1.1.3.5" mathsize="80%" xref="alg1.l9.m1.1.1.3.5.cmml">g</mi><mo id="alg1.l9.m1.1.1.3.1c" xref="alg1.l9.m1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m1.1.1.3.6" mathsize="80%" xref="alg1.l9.m1.1.1.3.6.cmml">e</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l9.m1.1b"><apply id="alg1.l9.m1.1.1.cmml" xref="alg1.l9.m1.1.1"><eq id="alg1.l9.m1.1.1.1.cmml" xref="alg1.l9.m1.1.1.1"></eq><ci id="alg1.l9.m1.1.1.2.cmml" xref="alg1.l9.m1.1.1.2">𝑀</ci><apply id="alg1.l9.m1.1.1.3.cmml" xref="alg1.l9.m1.1.1.3"><times id="alg1.l9.m1.1.1.3.1.cmml" xref="alg1.l9.m1.1.1.3.1"></times><ci id="alg1.l9.m1.1.1.3.2.cmml" xref="alg1.l9.m1.1.1.3.2">𝑀</ci><ci id="alg1.l9.m1.1.1.3.3.cmml" xref="alg1.l9.m1.1.1.3.3">𝑒</ci><ci id="alg1.l9.m1.1.1.3.4.cmml" xref="alg1.l9.m1.1.1.3.4">𝑟</ci><ci id="alg1.l9.m1.1.1.3.5.cmml" xref="alg1.l9.m1.1.1.3.5">𝑔</ci><ci id="alg1.l9.m1.1.1.3.6.cmml" xref="alg1.l9.m1.1.1.3.6">𝑒</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l9.m1.1c">M=Merge</annotation><annotation encoding="application/x-llamapun" id="alg1.l9.m1.1d">italic_M = italic_M italic_e italic_r italic_g italic_e</annotation></semantics></math><span class="ltx_text" id="alg1.l9.3" style="font-size:80%;"> and </span><span class="ltx_text ltx_font_smallcaps" id="alg1.l9.4" style="font-size:80%;">ModeSwitch</span><span class="ltx_text" id="alg1.l9.5" style="font-size:80%;">(</span><math alttext="M" class="ltx_Math" display="inline" id="alg1.l9.m2.1"><semantics id="alg1.l9.m2.1a"><mi id="alg1.l9.m2.1.1" mathsize="80%" xref="alg1.l9.m2.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="alg1.l9.m2.1b"><ci id="alg1.l9.m2.1.1.cmml" xref="alg1.l9.m2.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l9.m2.1c">M</annotation><annotation encoding="application/x-llamapun" id="alg1.l9.m2.1d">italic_M</annotation></semantics></math><span class="ltx_text" id="alg1.l9.6" style="font-size:80%;">, </span><math alttext="R_{merge}.lora" class="ltx_Math" display="inline" id="alg1.l9.m3.2"><semantics id="alg1.l9.m3.2a"><mrow id="alg1.l9.m3.2.2.2" xref="alg1.l9.m3.2.2.3.cmml"><msub id="alg1.l9.m3.1.1.1.1" xref="alg1.l9.m3.1.1.1.1.cmml"><mi id="alg1.l9.m3.1.1.1.1.2" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.2.cmml">R</mi><mrow id="alg1.l9.m3.1.1.1.1.3" xref="alg1.l9.m3.1.1.1.1.3.cmml"><mi id="alg1.l9.m3.1.1.1.1.3.2" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.3.2.cmml">m</mi><mo id="alg1.l9.m3.1.1.1.1.3.1" xref="alg1.l9.m3.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m3.1.1.1.1.3.3" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.3.3.cmml">e</mi><mo id="alg1.l9.m3.1.1.1.1.3.1a" xref="alg1.l9.m3.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m3.1.1.1.1.3.4" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.3.4.cmml">r</mi><mo id="alg1.l9.m3.1.1.1.1.3.1b" xref="alg1.l9.m3.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m3.1.1.1.1.3.5" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.3.5.cmml">g</mi><mo id="alg1.l9.m3.1.1.1.1.3.1c" xref="alg1.l9.m3.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l9.m3.1.1.1.1.3.6" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.3.6.cmml">e</mi></mrow></msub><mo id="alg1.l9.m3.2.2.2.3" lspace="0em" mathsize="80%" rspace="0.167em" xref="alg1.l9.m3.2.2.3a.cmml">.</mo><mrow id="alg1.l9.m3.2.2.2.2" xref="alg1.l9.m3.2.2.2.2.cmml"><mi id="alg1.l9.m3.2.2.2.2.2" mathsize="80%" xref="alg1.l9.m3.2.2.2.2.2.cmml">l</mi><mo id="alg1.l9.m3.2.2.2.2.1" xref="alg1.l9.m3.2.2.2.2.1.cmml"></mo><mi id="alg1.l9.m3.2.2.2.2.3" mathsize="80%" xref="alg1.l9.m3.2.2.2.2.3.cmml">o</mi><mo id="alg1.l9.m3.2.2.2.2.1a" xref="alg1.l9.m3.2.2.2.2.1.cmml"></mo><mi id="alg1.l9.m3.2.2.2.2.4" mathsize="80%" xref="alg1.l9.m3.2.2.2.2.4.cmml">r</mi><mo id="alg1.l9.m3.2.2.2.2.1b" xref="alg1.l9.m3.2.2.2.2.1.cmml"></mo><mi id="alg1.l9.m3.2.2.2.2.5" mathsize="80%" xref="alg1.l9.m3.2.2.2.2.5.cmml">a</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l9.m3.2b"><apply id="alg1.l9.m3.2.2.3.cmml" xref="alg1.l9.m3.2.2.2"><csymbol cd="ambiguous" id="alg1.l9.m3.2.2.3a.cmml" xref="alg1.l9.m3.2.2.2.3">formulae-sequence</csymbol><apply id="alg1.l9.m3.1.1.1.1.cmml" xref="alg1.l9.m3.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l9.m3.1.1.1.1.1.cmml" xref="alg1.l9.m3.1.1.1.1">subscript</csymbol><ci id="alg1.l9.m3.1.1.1.1.2.cmml" xref="alg1.l9.m3.1.1.1.1.2">𝑅</ci><apply id="alg1.l9.m3.1.1.1.1.3.cmml" xref="alg1.l9.m3.1.1.1.1.3"><times id="alg1.l9.m3.1.1.1.1.3.1.cmml" xref="alg1.l9.m3.1.1.1.1.3.1"></times><ci id="alg1.l9.m3.1.1.1.1.3.2.cmml" xref="alg1.l9.m3.1.1.1.1.3.2">𝑚</ci><ci id="alg1.l9.m3.1.1.1.1.3.3.cmml" xref="alg1.l9.m3.1.1.1.1.3.3">𝑒</ci><ci id="alg1.l9.m3.1.1.1.1.3.4.cmml" xref="alg1.l9.m3.1.1.1.1.3.4">𝑟</ci><ci id="alg1.l9.m3.1.1.1.1.3.5.cmml" xref="alg1.l9.m3.1.1.1.1.3.5">𝑔</ci><ci id="alg1.l9.m3.1.1.1.1.3.6.cmml" xref="alg1.l9.m3.1.1.1.1.3.6">𝑒</ci></apply></apply><apply id="alg1.l9.m3.2.2.2.2.cmml" xref="alg1.l9.m3.2.2.2.2"><times id="alg1.l9.m3.2.2.2.2.1.cmml" xref="alg1.l9.m3.2.2.2.2.1"></times><ci id="alg1.l9.m3.2.2.2.2.2.cmml" xref="alg1.l9.m3.2.2.2.2.2">𝑙</ci><ci id="alg1.l9.m3.2.2.2.2.3.cmml" xref="alg1.l9.m3.2.2.2.2.3">𝑜</ci><ci id="alg1.l9.m3.2.2.2.2.4.cmml" xref="alg1.l9.m3.2.2.2.2.4">𝑟</ci><ci id="alg1.l9.m3.2.2.2.2.5.cmml" xref="alg1.l9.m3.2.2.2.2.5">𝑎</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l9.m3.2c">R_{merge}.lora</annotation><annotation encoding="application/x-llamapun" id="alg1.l9.m3.2d">italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT . italic_l italic_o italic_r italic_a</annotation></semantics></math><span class="ltx_text" id="alg1.l9.7" style="font-size:80%;">) </span> </div> <div class="ltx_listingline" id="alg1.l10"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l10.1.1.1" style="font-size:80%;">10:</span></span><span class="ltx_text" id="alg1.l10.2" style="font-size:80%;"> </span><math alttext="B_{next}=R_{merge}\left[:MaxBS\right]" class="ltx_math_unparsed" display="inline" id="alg1.l10.m1.1"><semantics id="alg1.l10.m1.1a"><mrow id="alg1.l10.m1.1b"><msub id="alg1.l10.m1.1.1"><mi id="alg1.l10.m1.1.1.2" mathsize="80%">B</mi><mrow id="alg1.l10.m1.1.1.3"><mi id="alg1.l10.m1.1.1.3.2" mathsize="80%">n</mi><mo id="alg1.l10.m1.1.1.3.1"></mo><mi id="alg1.l10.m1.1.1.3.3" mathsize="80%">e</mi><mo id="alg1.l10.m1.1.1.3.1a"></mo><mi id="alg1.l10.m1.1.1.3.4" mathsize="80%">x</mi><mo id="alg1.l10.m1.1.1.3.1b"></mo><mi id="alg1.l10.m1.1.1.3.5" mathsize="80%">t</mi></mrow></msub><mo id="alg1.l10.m1.1.2" mathsize="80%">=</mo><msub id="alg1.l10.m1.1.3"><mi id="alg1.l10.m1.1.3.2" mathsize="80%">R</mi><mrow id="alg1.l10.m1.1.3.3"><mi id="alg1.l10.m1.1.3.3.2" mathsize="80%">m</mi><mo id="alg1.l10.m1.1.3.3.1"></mo><mi id="alg1.l10.m1.1.3.3.3" mathsize="80%">e</mi><mo id="alg1.l10.m1.1.3.3.1a"></mo><mi id="alg1.l10.m1.1.3.3.4" mathsize="80%">r</mi><mo id="alg1.l10.m1.1.3.3.1b"></mo><mi id="alg1.l10.m1.1.3.3.5" mathsize="80%">g</mi><mo id="alg1.l10.m1.1.3.3.1c"></mo><mi id="alg1.l10.m1.1.3.3.6" mathsize="80%">e</mi></mrow></msub><mrow id="alg1.l10.m1.1.4"><mo id="alg1.l10.m1.1.4.1">[</mo><mo id="alg1.l10.m1.1.4.2" mathsize="80%" rspace="0.278em">:</mo><mi id="alg1.l10.m1.1.4.3" mathsize="80%">M</mi><mi id="alg1.l10.m1.1.4.4" mathsize="80%">a</mi><mi id="alg1.l10.m1.1.4.5" mathsize="80%">x</mi><mi id="alg1.l10.m1.1.4.6" mathsize="80%">B</mi><mi id="alg1.l10.m1.1.4.7" mathsize="80%">S</mi><mo id="alg1.l10.m1.1.4.8">]</mo></mrow></mrow><annotation encoding="application/x-tex" id="alg1.l10.m1.1c">B_{next}=R_{merge}\left[:MaxBS\right]</annotation><annotation encoding="application/x-llamapun" id="alg1.l10.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT [ : italic_M italic_a italic_x italic_B italic_S ]</annotation></semantics></math><span class="ltx_text" id="alg1.l10.3" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg1.l11"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l11.1.1.1" style="font-size:80%;">11:</span></span><span class="ltx_text" id="alg1.l11.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l11.3" style="font-size:80%;">else</span> </div> <div class="ltx_listingline" id="alg1.l12"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l12.1.1.1" style="font-size:80%;">12:</span></span><span class="ltx_text" id="alg1.l12.2" style="font-size:80%;"> </span><math alttext="M=Mix" class="ltx_Math" display="inline" id="alg1.l12.m1.1"><semantics id="alg1.l12.m1.1a"><mrow id="alg1.l12.m1.1.1" xref="alg1.l12.m1.1.1.cmml"><mi id="alg1.l12.m1.1.1.2" mathsize="80%" xref="alg1.l12.m1.1.1.2.cmml">M</mi><mo id="alg1.l12.m1.1.1.1" mathsize="80%" xref="alg1.l12.m1.1.1.1.cmml">=</mo><mrow id="alg1.l12.m1.1.1.3" xref="alg1.l12.m1.1.1.3.cmml"><mi id="alg1.l12.m1.1.1.3.2" mathsize="80%" xref="alg1.l12.m1.1.1.3.2.cmml">M</mi><mo id="alg1.l12.m1.1.1.3.1" xref="alg1.l12.m1.1.1.3.1.cmml"></mo><mi id="alg1.l12.m1.1.1.3.3" mathsize="80%" xref="alg1.l12.m1.1.1.3.3.cmml">i</mi><mo id="alg1.l12.m1.1.1.3.1a" xref="alg1.l12.m1.1.1.3.1.cmml"></mo><mi id="alg1.l12.m1.1.1.3.4" mathsize="80%" xref="alg1.l12.m1.1.1.3.4.cmml">x</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l12.m1.1b"><apply id="alg1.l12.m1.1.1.cmml" xref="alg1.l12.m1.1.1"><eq id="alg1.l12.m1.1.1.1.cmml" xref="alg1.l12.m1.1.1.1"></eq><ci id="alg1.l12.m1.1.1.2.cmml" xref="alg1.l12.m1.1.1.2">𝑀</ci><apply id="alg1.l12.m1.1.1.3.cmml" xref="alg1.l12.m1.1.1.3"><times id="alg1.l12.m1.1.1.3.1.cmml" xref="alg1.l12.m1.1.1.3.1"></times><ci id="alg1.l12.m1.1.1.3.2.cmml" xref="alg1.l12.m1.1.1.3.2">𝑀</ci><ci id="alg1.l12.m1.1.1.3.3.cmml" xref="alg1.l12.m1.1.1.3.3">𝑖</ci><ci id="alg1.l12.m1.1.1.3.4.cmml" xref="alg1.l12.m1.1.1.3.4">𝑥</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l12.m1.1c">M=Mix</annotation><annotation encoding="application/x-llamapun" id="alg1.l12.m1.1d">italic_M = italic_M italic_i italic_x</annotation></semantics></math><span class="ltx_text" id="alg1.l12.3" style="font-size:80%;"> and </span><span class="ltx_text ltx_font_smallcaps" id="alg1.l12.4" style="font-size:80%;">ModeSwitch</span><span class="ltx_text" id="alg1.l12.5" style="font-size:80%;">(</span><math alttext="M" class="ltx_Math" display="inline" id="alg1.l12.m2.1"><semantics id="alg1.l12.m2.1a"><mi id="alg1.l12.m2.1.1" mathsize="80%" xref="alg1.l12.m2.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="alg1.l12.m2.1b"><ci id="alg1.l12.m2.1.1.cmml" xref="alg1.l12.m2.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l12.m2.1c">M</annotation><annotation encoding="application/x-llamapun" id="alg1.l12.m2.1d">italic_M</annotation></semantics></math><span class="ltx_text" id="alg1.l12.6" style="font-size:80%;">, </span><math alttext="R_{merge}.lora" class="ltx_Math" display="inline" id="alg1.l12.m3.2"><semantics id="alg1.l12.m3.2a"><mrow id="alg1.l12.m3.2.2.2" xref="alg1.l12.m3.2.2.3.cmml"><msub id="alg1.l12.m3.1.1.1.1" xref="alg1.l12.m3.1.1.1.1.cmml"><mi id="alg1.l12.m3.1.1.1.1.2" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.2.cmml">R</mi><mrow id="alg1.l12.m3.1.1.1.1.3" xref="alg1.l12.m3.1.1.1.1.3.cmml"><mi id="alg1.l12.m3.1.1.1.1.3.2" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.3.2.cmml">m</mi><mo id="alg1.l12.m3.1.1.1.1.3.1" xref="alg1.l12.m3.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l12.m3.1.1.1.1.3.3" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.3.3.cmml">e</mi><mo id="alg1.l12.m3.1.1.1.1.3.1a" xref="alg1.l12.m3.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l12.m3.1.1.1.1.3.4" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.3.4.cmml">r</mi><mo id="alg1.l12.m3.1.1.1.1.3.1b" xref="alg1.l12.m3.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l12.m3.1.1.1.1.3.5" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.3.5.cmml">g</mi><mo id="alg1.l12.m3.1.1.1.1.3.1c" xref="alg1.l12.m3.1.1.1.1.3.1.cmml"></mo><mi id="alg1.l12.m3.1.1.1.1.3.6" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.3.6.cmml">e</mi></mrow></msub><mo id="alg1.l12.m3.2.2.2.3" lspace="0em" mathsize="80%" rspace="0.167em" xref="alg1.l12.m3.2.2.3a.cmml">.</mo><mrow id="alg1.l12.m3.2.2.2.2" xref="alg1.l12.m3.2.2.2.2.cmml"><mi id="alg1.l12.m3.2.2.2.2.2" mathsize="80%" xref="alg1.l12.m3.2.2.2.2.2.cmml">l</mi><mo id="alg1.l12.m3.2.2.2.2.1" xref="alg1.l12.m3.2.2.2.2.1.cmml"></mo><mi id="alg1.l12.m3.2.2.2.2.3" mathsize="80%" xref="alg1.l12.m3.2.2.2.2.3.cmml">o</mi><mo id="alg1.l12.m3.2.2.2.2.1a" xref="alg1.l12.m3.2.2.2.2.1.cmml"></mo><mi id="alg1.l12.m3.2.2.2.2.4" mathsize="80%" xref="alg1.l12.m3.2.2.2.2.4.cmml">r</mi><mo id="alg1.l12.m3.2.2.2.2.1b" xref="alg1.l12.m3.2.2.2.2.1.cmml"></mo><mi id="alg1.l12.m3.2.2.2.2.5" mathsize="80%" xref="alg1.l12.m3.2.2.2.2.5.cmml">a</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l12.m3.2b"><apply id="alg1.l12.m3.2.2.3.cmml" xref="alg1.l12.m3.2.2.2"><csymbol cd="ambiguous" id="alg1.l12.m3.2.2.3a.cmml" xref="alg1.l12.m3.2.2.2.3">formulae-sequence</csymbol><apply id="alg1.l12.m3.1.1.1.1.cmml" xref="alg1.l12.m3.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l12.m3.1.1.1.1.1.cmml" xref="alg1.l12.m3.1.1.1.1">subscript</csymbol><ci id="alg1.l12.m3.1.1.1.1.2.cmml" xref="alg1.l12.m3.1.1.1.1.2">𝑅</ci><apply id="alg1.l12.m3.1.1.1.1.3.cmml" xref="alg1.l12.m3.1.1.1.1.3"><times id="alg1.l12.m3.1.1.1.1.3.1.cmml" xref="alg1.l12.m3.1.1.1.1.3.1"></times><ci id="alg1.l12.m3.1.1.1.1.3.2.cmml" xref="alg1.l12.m3.1.1.1.1.3.2">𝑚</ci><ci id="alg1.l12.m3.1.1.1.1.3.3.cmml" xref="alg1.l12.m3.1.1.1.1.3.3">𝑒</ci><ci id="alg1.l12.m3.1.1.1.1.3.4.cmml" xref="alg1.l12.m3.1.1.1.1.3.4">𝑟</ci><ci id="alg1.l12.m3.1.1.1.1.3.5.cmml" xref="alg1.l12.m3.1.1.1.1.3.5">𝑔</ci><ci id="alg1.l12.m3.1.1.1.1.3.6.cmml" xref="alg1.l12.m3.1.1.1.1.3.6">𝑒</ci></apply></apply><apply id="alg1.l12.m3.2.2.2.2.cmml" xref="alg1.l12.m3.2.2.2.2"><times id="alg1.l12.m3.2.2.2.2.1.cmml" xref="alg1.l12.m3.2.2.2.2.1"></times><ci id="alg1.l12.m3.2.2.2.2.2.cmml" xref="alg1.l12.m3.2.2.2.2.2">𝑙</ci><ci id="alg1.l12.m3.2.2.2.2.3.cmml" xref="alg1.l12.m3.2.2.2.2.3">𝑜</ci><ci id="alg1.l12.m3.2.2.2.2.4.cmml" xref="alg1.l12.m3.2.2.2.2.4">𝑟</ci><ci id="alg1.l12.m3.2.2.2.2.5.cmml" xref="alg1.l12.m3.2.2.2.2.5">𝑎</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l12.m3.2c">R_{merge}.lora</annotation><annotation encoding="application/x-llamapun" id="alg1.l12.m3.2d">italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT . italic_l italic_o italic_r italic_a</annotation></semantics></math><span class="ltx_text" id="alg1.l12.7" style="font-size:80%;">) </span> </div> <div class="ltx_listingline" id="alg1.l13"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l13.1.1.1" style="font-size:80%;">13:</span></span><span class="ltx_text" id="alg1.l13.2" style="font-size:80%;"> </span><math alttext="B_{next}=R_{starve}+(R_{merge}-R_{starve})\left[:len\right]" class="ltx_math_unparsed" display="inline" id="alg1.l13.m1.1"><semantics id="alg1.l13.m1.1a"><mrow id="alg1.l13.m1.1b"><msub id="alg1.l13.m1.1.1"><mi id="alg1.l13.m1.1.1.2" mathsize="80%">B</mi><mrow id="alg1.l13.m1.1.1.3"><mi id="alg1.l13.m1.1.1.3.2" mathsize="80%">n</mi><mo id="alg1.l13.m1.1.1.3.1"></mo><mi id="alg1.l13.m1.1.1.3.3" mathsize="80%">e</mi><mo id="alg1.l13.m1.1.1.3.1a"></mo><mi id="alg1.l13.m1.1.1.3.4" mathsize="80%">x</mi><mo id="alg1.l13.m1.1.1.3.1b"></mo><mi id="alg1.l13.m1.1.1.3.5" mathsize="80%">t</mi></mrow></msub><mo id="alg1.l13.m1.1.2" mathsize="80%">=</mo><msub id="alg1.l13.m1.1.3"><mi id="alg1.l13.m1.1.3.2" mathsize="80%">R</mi><mrow id="alg1.l13.m1.1.3.3"><mi id="alg1.l13.m1.1.3.3.2" mathsize="80%">s</mi><mo id="alg1.l13.m1.1.3.3.1"></mo><mi id="alg1.l13.m1.1.3.3.3" mathsize="80%">t</mi><mo id="alg1.l13.m1.1.3.3.1a"></mo><mi id="alg1.l13.m1.1.3.3.4" mathsize="80%">a</mi><mo id="alg1.l13.m1.1.3.3.1b"></mo><mi id="alg1.l13.m1.1.3.3.5" mathsize="80%">r</mi><mo id="alg1.l13.m1.1.3.3.1c"></mo><mi id="alg1.l13.m1.1.3.3.6" mathsize="80%">v</mi><mo id="alg1.l13.m1.1.3.3.1d"></mo><mi id="alg1.l13.m1.1.3.3.7" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l13.m1.1.4" mathsize="80%">+</mo><mrow id="alg1.l13.m1.1.5"><mo id="alg1.l13.m1.1.5.1" maxsize="80%" minsize="80%">(</mo><msub id="alg1.l13.m1.1.5.2"><mi id="alg1.l13.m1.1.5.2.2" mathsize="80%">R</mi><mrow id="alg1.l13.m1.1.5.2.3"><mi id="alg1.l13.m1.1.5.2.3.2" mathsize="80%">m</mi><mo id="alg1.l13.m1.1.5.2.3.1"></mo><mi id="alg1.l13.m1.1.5.2.3.3" mathsize="80%">e</mi><mo id="alg1.l13.m1.1.5.2.3.1a"></mo><mi id="alg1.l13.m1.1.5.2.3.4" mathsize="80%">r</mi><mo id="alg1.l13.m1.1.5.2.3.1b"></mo><mi id="alg1.l13.m1.1.5.2.3.5" mathsize="80%">g</mi><mo id="alg1.l13.m1.1.5.2.3.1c"></mo><mi id="alg1.l13.m1.1.5.2.3.6" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l13.m1.1.5.3" mathsize="80%">−</mo><msub id="alg1.l13.m1.1.5.4"><mi id="alg1.l13.m1.1.5.4.2" mathsize="80%">R</mi><mrow id="alg1.l13.m1.1.5.4.3"><mi id="alg1.l13.m1.1.5.4.3.2" mathsize="80%">s</mi><mo id="alg1.l13.m1.1.5.4.3.1"></mo><mi id="alg1.l13.m1.1.5.4.3.3" mathsize="80%">t</mi><mo id="alg1.l13.m1.1.5.4.3.1a"></mo><mi id="alg1.l13.m1.1.5.4.3.4" mathsize="80%">a</mi><mo id="alg1.l13.m1.1.5.4.3.1b"></mo><mi id="alg1.l13.m1.1.5.4.3.5" mathsize="80%">r</mi><mo id="alg1.l13.m1.1.5.4.3.1c"></mo><mi id="alg1.l13.m1.1.5.4.3.6" mathsize="80%">v</mi><mo id="alg1.l13.m1.1.5.4.3.1d"></mo><mi id="alg1.l13.m1.1.5.4.3.7" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l13.m1.1.5.5" maxsize="80%" minsize="80%">)</mo></mrow><mrow id="alg1.l13.m1.1.6"><mo id="alg1.l13.m1.1.6.1">[</mo><mo id="alg1.l13.m1.1.6.2" mathsize="80%" rspace="0.278em">:</mo><mi id="alg1.l13.m1.1.6.3" mathsize="80%">l</mi><mi id="alg1.l13.m1.1.6.4" mathsize="80%">e</mi><mi id="alg1.l13.m1.1.6.5" mathsize="80%">n</mi><mo id="alg1.l13.m1.1.6.6">]</mo></mrow></mrow><annotation encoding="application/x-tex" id="alg1.l13.m1.1c">B_{next}=R_{starve}+(R_{merge}-R_{starve})\left[:len\right]</annotation><annotation encoding="application/x-llamapun" id="alg1.l13.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT + ( italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT ) [ : italic_l italic_e italic_n ]</annotation></semantics></math><span class="ltx_text" id="alg1.l13.3" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg1.l14"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l14.1.1.1" style="font-size:80%;">14:</span></span><span class="ltx_text" id="alg1.l14.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_smallcaps" id="alg1.l14.3" style="font-size:80%;">InitDELoRA</span><span class="ltx_text" id="alg1.l14.4" style="font-size:80%;">(</span><math alttext="B_{next}" class="ltx_Math" display="inline" id="alg1.l14.m1.1"><semantics id="alg1.l14.m1.1a"><msub id="alg1.l14.m1.1.1" xref="alg1.l14.m1.1.1.cmml"><mi id="alg1.l14.m1.1.1.2" mathsize="80%" xref="alg1.l14.m1.1.1.2.cmml">B</mi><mrow id="alg1.l14.m1.1.1.3" xref="alg1.l14.m1.1.1.3.cmml"><mi id="alg1.l14.m1.1.1.3.2" mathsize="80%" xref="alg1.l14.m1.1.1.3.2.cmml">n</mi><mo id="alg1.l14.m1.1.1.3.1" xref="alg1.l14.m1.1.1.3.1.cmml"></mo><mi id="alg1.l14.m1.1.1.3.3" mathsize="80%" xref="alg1.l14.m1.1.1.3.3.cmml">e</mi><mo id="alg1.l14.m1.1.1.3.1a" xref="alg1.l14.m1.1.1.3.1.cmml"></mo><mi id="alg1.l14.m1.1.1.3.4" mathsize="80%" xref="alg1.l14.m1.1.1.3.4.cmml">x</mi><mo id="alg1.l14.m1.1.1.3.1b" xref="alg1.l14.m1.1.1.3.1.cmml"></mo><mi id="alg1.l14.m1.1.1.3.5" mathsize="80%" xref="alg1.l14.m1.1.1.3.5.cmml">t</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="alg1.l14.m1.1b"><apply id="alg1.l14.m1.1.1.cmml" xref="alg1.l14.m1.1.1"><csymbol cd="ambiguous" id="alg1.l14.m1.1.1.1.cmml" xref="alg1.l14.m1.1.1">subscript</csymbol><ci id="alg1.l14.m1.1.1.2.cmml" xref="alg1.l14.m1.1.1.2">𝐵</ci><apply id="alg1.l14.m1.1.1.3.cmml" xref="alg1.l14.m1.1.1.3"><times id="alg1.l14.m1.1.1.3.1.cmml" xref="alg1.l14.m1.1.1.3.1"></times><ci id="alg1.l14.m1.1.1.3.2.cmml" xref="alg1.l14.m1.1.1.3.2">𝑛</ci><ci id="alg1.l14.m1.1.1.3.3.cmml" xref="alg1.l14.m1.1.1.3.3">𝑒</ci><ci id="alg1.l14.m1.1.1.3.4.cmml" xref="alg1.l14.m1.1.1.3.4">𝑥</ci><ci id="alg1.l14.m1.1.1.3.5.cmml" xref="alg1.l14.m1.1.1.3.5">𝑡</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l14.m1.1c">B_{next}</annotation><annotation encoding="application/x-llamapun" id="alg1.l14.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT</annotation></semantics></math><span class="ltx_text" id="alg1.l14.5" style="font-size:80%;">) </span> </div> <div class="ltx_listingline" id="alg1.l15"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l15.1.1.1" style="font-size:80%;">15:</span></span><span class="ltx_text" id="alg1.l15.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l15.3" style="font-size:80%;">else</span> </div> <div class="ltx_listingline" id="alg1.l16"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l16.1.1.1" style="font-size:80%;">16:</span></span><span class="ltx_text" id="alg1.l16.2" style="font-size:80%;"> </span><math alttext="M=Unmerge" class="ltx_Math" display="inline" id="alg1.l16.m1.1"><semantics id="alg1.l16.m1.1a"><mrow id="alg1.l16.m1.1.1" xref="alg1.l16.m1.1.1.cmml"><mi id="alg1.l16.m1.1.1.2" mathsize="80%" xref="alg1.l16.m1.1.1.2.cmml">M</mi><mo id="alg1.l16.m1.1.1.1" mathsize="80%" xref="alg1.l16.m1.1.1.1.cmml">=</mo><mrow id="alg1.l16.m1.1.1.3" xref="alg1.l16.m1.1.1.3.cmml"><mi id="alg1.l16.m1.1.1.3.2" mathsize="80%" xref="alg1.l16.m1.1.1.3.2.cmml">U</mi><mo id="alg1.l16.m1.1.1.3.1" xref="alg1.l16.m1.1.1.3.1.cmml"></mo><mi id="alg1.l16.m1.1.1.3.3" mathsize="80%" xref="alg1.l16.m1.1.1.3.3.cmml">n</mi><mo id="alg1.l16.m1.1.1.3.1a" xref="alg1.l16.m1.1.1.3.1.cmml"></mo><mi id="alg1.l16.m1.1.1.3.4" mathsize="80%" xref="alg1.l16.m1.1.1.3.4.cmml">m</mi><mo id="alg1.l16.m1.1.1.3.1b" xref="alg1.l16.m1.1.1.3.1.cmml"></mo><mi id="alg1.l16.m1.1.1.3.5" mathsize="80%" xref="alg1.l16.m1.1.1.3.5.cmml">e</mi><mo id="alg1.l16.m1.1.1.3.1c" xref="alg1.l16.m1.1.1.3.1.cmml"></mo><mi id="alg1.l16.m1.1.1.3.6" mathsize="80%" xref="alg1.l16.m1.1.1.3.6.cmml">r</mi><mo id="alg1.l16.m1.1.1.3.1d" xref="alg1.l16.m1.1.1.3.1.cmml"></mo><mi id="alg1.l16.m1.1.1.3.7" mathsize="80%" xref="alg1.l16.m1.1.1.3.7.cmml">g</mi><mo id="alg1.l16.m1.1.1.3.1e" xref="alg1.l16.m1.1.1.3.1.cmml"></mo><mi id="alg1.l16.m1.1.1.3.8" mathsize="80%" xref="alg1.l16.m1.1.1.3.8.cmml">e</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l16.m1.1b"><apply id="alg1.l16.m1.1.1.cmml" xref="alg1.l16.m1.1.1"><eq id="alg1.l16.m1.1.1.1.cmml" xref="alg1.l16.m1.1.1.1"></eq><ci id="alg1.l16.m1.1.1.2.cmml" xref="alg1.l16.m1.1.1.2">𝑀</ci><apply id="alg1.l16.m1.1.1.3.cmml" xref="alg1.l16.m1.1.1.3"><times id="alg1.l16.m1.1.1.3.1.cmml" xref="alg1.l16.m1.1.1.3.1"></times><ci id="alg1.l16.m1.1.1.3.2.cmml" xref="alg1.l16.m1.1.1.3.2">𝑈</ci><ci id="alg1.l16.m1.1.1.3.3.cmml" xref="alg1.l16.m1.1.1.3.3">𝑛</ci><ci id="alg1.l16.m1.1.1.3.4.cmml" xref="alg1.l16.m1.1.1.3.4">𝑚</ci><ci id="alg1.l16.m1.1.1.3.5.cmml" xref="alg1.l16.m1.1.1.3.5">𝑒</ci><ci id="alg1.l16.m1.1.1.3.6.cmml" xref="alg1.l16.m1.1.1.3.6">𝑟</ci><ci id="alg1.l16.m1.1.1.3.7.cmml" xref="alg1.l16.m1.1.1.3.7">𝑔</ci><ci id="alg1.l16.m1.1.1.3.8.cmml" xref="alg1.l16.m1.1.1.3.8">𝑒</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l16.m1.1c">M=Unmerge</annotation><annotation encoding="application/x-llamapun" id="alg1.l16.m1.1d">italic_M = italic_U italic_n italic_m italic_e italic_r italic_g italic_e</annotation></semantics></math><span class="ltx_text" id="alg1.l16.3" style="font-size:80%;"> and </span><span class="ltx_text ltx_font_smallcaps" id="alg1.l16.4" style="font-size:80%;">ModeSwitch</span><span class="ltx_text" id="alg1.l16.5" style="font-size:80%;">(</span><math alttext="M" class="ltx_Math" display="inline" id="alg1.l16.m2.1"><semantics id="alg1.l16.m2.1a"><mi id="alg1.l16.m2.1.1" mathsize="80%" xref="alg1.l16.m2.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="alg1.l16.m2.1b"><ci id="alg1.l16.m2.1.1.cmml" xref="alg1.l16.m2.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l16.m2.1c">M</annotation><annotation encoding="application/x-llamapun" id="alg1.l16.m2.1d">italic_M</annotation></semantics></math><span class="ltx_text" id="alg1.l16.6" style="font-size:80%;">, ) </span> </div> <div class="ltx_listingline" id="alg1.l17"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l17.1.1.1" style="font-size:80%;">17:</span></span><span class="ltx_text" id="alg1.l17.2" style="font-size:80%;"> </span><math alttext="B_{next}=R_{starve}+(R-R_{starve})\left[:len\right]" class="ltx_math_unparsed" display="inline" id="alg1.l17.m1.1"><semantics id="alg1.l17.m1.1a"><mrow id="alg1.l17.m1.1b"><msub id="alg1.l17.m1.1.1"><mi id="alg1.l17.m1.1.1.2" mathsize="80%">B</mi><mrow id="alg1.l17.m1.1.1.3"><mi id="alg1.l17.m1.1.1.3.2" mathsize="80%">n</mi><mo id="alg1.l17.m1.1.1.3.1"></mo><mi id="alg1.l17.m1.1.1.3.3" mathsize="80%">e</mi><mo id="alg1.l17.m1.1.1.3.1a"></mo><mi id="alg1.l17.m1.1.1.3.4" mathsize="80%">x</mi><mo id="alg1.l17.m1.1.1.3.1b"></mo><mi id="alg1.l17.m1.1.1.3.5" mathsize="80%">t</mi></mrow></msub><mo id="alg1.l17.m1.1.2" mathsize="80%">=</mo><msub id="alg1.l17.m1.1.3"><mi id="alg1.l17.m1.1.3.2" mathsize="80%">R</mi><mrow id="alg1.l17.m1.1.3.3"><mi id="alg1.l17.m1.1.3.3.2" mathsize="80%">s</mi><mo id="alg1.l17.m1.1.3.3.1"></mo><mi id="alg1.l17.m1.1.3.3.3" mathsize="80%">t</mi><mo id="alg1.l17.m1.1.3.3.1a"></mo><mi id="alg1.l17.m1.1.3.3.4" mathsize="80%">a</mi><mo id="alg1.l17.m1.1.3.3.1b"></mo><mi id="alg1.l17.m1.1.3.3.5" mathsize="80%">r</mi><mo id="alg1.l17.m1.1.3.3.1c"></mo><mi id="alg1.l17.m1.1.3.3.6" mathsize="80%">v</mi><mo id="alg1.l17.m1.1.3.3.1d"></mo><mi id="alg1.l17.m1.1.3.3.7" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l17.m1.1.4" mathsize="80%">+</mo><mrow id="alg1.l17.m1.1.5"><mo id="alg1.l17.m1.1.5.1" maxsize="80%" minsize="80%">(</mo><mi id="alg1.l17.m1.1.5.2" mathsize="80%">R</mi><mo id="alg1.l17.m1.1.5.3" mathsize="80%">−</mo><msub id="alg1.l17.m1.1.5.4"><mi id="alg1.l17.m1.1.5.4.2" mathsize="80%">R</mi><mrow id="alg1.l17.m1.1.5.4.3"><mi id="alg1.l17.m1.1.5.4.3.2" mathsize="80%">s</mi><mo id="alg1.l17.m1.1.5.4.3.1"></mo><mi id="alg1.l17.m1.1.5.4.3.3" mathsize="80%">t</mi><mo id="alg1.l17.m1.1.5.4.3.1a"></mo><mi id="alg1.l17.m1.1.5.4.3.4" mathsize="80%">a</mi><mo id="alg1.l17.m1.1.5.4.3.1b"></mo><mi id="alg1.l17.m1.1.5.4.3.5" mathsize="80%">r</mi><mo id="alg1.l17.m1.1.5.4.3.1c"></mo><mi id="alg1.l17.m1.1.5.4.3.6" mathsize="80%">v</mi><mo id="alg1.l17.m1.1.5.4.3.1d"></mo><mi id="alg1.l17.m1.1.5.4.3.7" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l17.m1.1.5.5" maxsize="80%" minsize="80%">)</mo></mrow><mrow id="alg1.l17.m1.1.6"><mo id="alg1.l17.m1.1.6.1">[</mo><mo id="alg1.l17.m1.1.6.2" mathsize="80%" rspace="0.278em">:</mo><mi id="alg1.l17.m1.1.6.3" mathsize="80%">l</mi><mi id="alg1.l17.m1.1.6.4" mathsize="80%">e</mi><mi id="alg1.l17.m1.1.6.5" mathsize="80%">n</mi><mo id="alg1.l17.m1.1.6.6">]</mo></mrow></mrow><annotation encoding="application/x-tex" id="alg1.l17.m1.1c">B_{next}=R_{starve}+(R-R_{starve})\left[:len\right]</annotation><annotation encoding="application/x-llamapun" id="alg1.l17.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT + ( italic_R - italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT ) [ : italic_l italic_e italic_n ]</annotation></semantics></math><span class="ltx_text" id="alg1.l17.3" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg1.l18"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg1.l18.1.1.1" style="font-size:80%;">18:</span></span><span class="ltx_text" id="alg1.l18.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg1.l18.3" style="font-size:80%;">return</span><span class="ltx_text" id="alg1.l18.4" style="font-size:80%;"> </span><math alttext="B_{next}" class="ltx_Math" display="inline" id="alg1.l18.m1.1"><semantics id="alg1.l18.m1.1a"><msub id="alg1.l18.m1.1.1" xref="alg1.l18.m1.1.1.cmml"><mi id="alg1.l18.m1.1.1.2" mathsize="80%" xref="alg1.l18.m1.1.1.2.cmml">B</mi><mrow id="alg1.l18.m1.1.1.3" xref="alg1.l18.m1.1.1.3.cmml"><mi id="alg1.l18.m1.1.1.3.2" mathsize="80%" xref="alg1.l18.m1.1.1.3.2.cmml">n</mi><mo id="alg1.l18.m1.1.1.3.1" xref="alg1.l18.m1.1.1.3.1.cmml"></mo><mi id="alg1.l18.m1.1.1.3.3" mathsize="80%" xref="alg1.l18.m1.1.1.3.3.cmml">e</mi><mo id="alg1.l18.m1.1.1.3.1a" xref="alg1.l18.m1.1.1.3.1.cmml"></mo><mi id="alg1.l18.m1.1.1.3.4" mathsize="80%" xref="alg1.l18.m1.1.1.3.4.cmml">x</mi><mo id="alg1.l18.m1.1.1.3.1b" xref="alg1.l18.m1.1.1.3.1.cmml"></mo><mi id="alg1.l18.m1.1.1.3.5" mathsize="80%" xref="alg1.l18.m1.1.1.3.5.cmml">t</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="alg1.l18.m1.1b"><apply id="alg1.l18.m1.1.1.cmml" xref="alg1.l18.m1.1.1"><csymbol cd="ambiguous" id="alg1.l18.m1.1.1.1.cmml" xref="alg1.l18.m1.1.1">subscript</csymbol><ci id="alg1.l18.m1.1.1.2.cmml" xref="alg1.l18.m1.1.1.2">𝐵</ci><apply id="alg1.l18.m1.1.1.3.cmml" xref="alg1.l18.m1.1.1.3"><times id="alg1.l18.m1.1.1.3.1.cmml" xref="alg1.l18.m1.1.1.3.1"></times><ci id="alg1.l18.m1.1.1.3.2.cmml" xref="alg1.l18.m1.1.1.3.2">𝑛</ci><ci id="alg1.l18.m1.1.1.3.3.cmml" xref="alg1.l18.m1.1.1.3.3">𝑒</ci><ci id="alg1.l18.m1.1.1.3.4.cmml" xref="alg1.l18.m1.1.1.3.4">𝑥</ci><ci id="alg1.l18.m1.1.1.3.5.cmml" xref="alg1.l18.m1.1.1.3.5">𝑡</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l18.m1.1c">B_{next}</annotation><annotation encoding="application/x-llamapun" id="alg1.l18.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT</annotation></semantics></math><span class="ltx_text" id="alg1.l18.5" style="font-size:80%;"> </span> </div> </div> </figure> </section> <section class="ltx_subsubsection" id="S4.SS4.SSS2"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.4.2 Mixture inference mode.</h4> <div class="ltx_para" id="S4.SS4.SSS2.p1"> <p class="ltx_p" id="S4.SS4.SSS2.p1.4">Compared to the unmerged mode, merged inference supports only one LoRA adapter at once, which results in the starvation of requests from other vision tasks. To alleviate starvation, we propose a novel inference mode, deLoRA, to enable the simultaneous execution of merged and unmerged inference. As shown in Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F13" title="Figure 13 ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">13</span></a>, LoRA<sub class="ltx_sub" id="S4.SS4.SSS2.p1.4.1">1</sub> handles the requests in merged mode, while other LoRA adapters, LoRA<sub class="ltx_sub" id="S4.SS4.SSS2.p1.4.2"><span class="ltx_text ltx_font_italic" id="S4.SS4.SSS2.p1.4.2.1">x</span></sub>, process their requests in unmerged mode. To maintain the consistent results of LoRA<sub class="ltx_sub" id="S4.SS4.SSS2.p1.4.3"><span class="ltx_text ltx_font_italic" id="S4.SS4.SSS2.p1.4.3.1">x</span></sub>’s request, we introduce the deLoRA branch to prevent contamination from LoRA<sub class="ltx_sub" id="S4.SS4.SSS2.p1.4.4">1</sub>. The weight of deLoRA is the same as that of the merged LoRA. Based on the distributive property of matrix multiplication, the correctness can be verified as follows.</p> <table class="ltx_equationgroup ltx_eqn_table" id="S4.Ex1"> <tbody id="S4.Ex1X"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\mathbf{output_{x}}" class="ltx_Math" display="inline" id="S4.Ex1X.2.1.1.m1.1"><semantics id="S4.Ex1X.2.1.1.m1.1a"><msub id="S4.Ex1X.2.1.1.m1.1.1" xref="S4.Ex1X.2.1.1.m1.1.1.cmml"><mi id="S4.Ex1X.2.1.1.m1.1.1.2" mathsize="90%" xref="S4.Ex1X.2.1.1.m1.1.1.2.cmml">𝐨𝐮𝐭𝐩𝐮𝐭</mi><mi id="S4.Ex1X.2.1.1.m1.1.1.3" mathsize="90%" xref="S4.Ex1X.2.1.1.m1.1.1.3.cmml">𝐱</mi></msub><annotation-xml encoding="MathML-Content" id="S4.Ex1X.2.1.1.m1.1b"><apply id="S4.Ex1X.2.1.1.m1.1.1.cmml" xref="S4.Ex1X.2.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.Ex1X.2.1.1.m1.1.1.1.cmml" xref="S4.Ex1X.2.1.1.m1.1.1">subscript</csymbol><ci id="S4.Ex1X.2.1.1.m1.1.1.2.cmml" xref="S4.Ex1X.2.1.1.m1.1.1.2">𝐨𝐮𝐭𝐩𝐮𝐭</ci><ci id="S4.Ex1X.2.1.1.m1.1.1.3.cmml" xref="S4.Ex1X.2.1.1.m1.1.1.3">𝐱</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.Ex1X.2.1.1.m1.1c">\displaystyle\mathbf{output_{x}}</annotation><annotation encoding="application/x-llamapun" id="S4.Ex1X.2.1.1.m1.1d">bold_output start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=\mathbf{input_{x}}\times(\mathbf{W_{merge}}-\mathbf{W_{deLoRA_{1% }}}+\mathbf{W_{LoRA_{x}}})" class="ltx_Math" display="inline" id="S4.Ex1X.3.2.2.m1.1"><semantics id="S4.Ex1X.3.2.2.m1.1a"><mrow id="S4.Ex1X.3.2.2.m1.1.1" xref="S4.Ex1X.3.2.2.m1.1.1.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.3" xref="S4.Ex1X.3.2.2.m1.1.1.3.cmml"></mi><mo id="S4.Ex1X.3.2.2.m1.1.1.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.2.cmml">=</mo><mrow id="S4.Ex1X.3.2.2.m1.1.1.1" xref="S4.Ex1X.3.2.2.m1.1.1.1.cmml"><msub id="S4.Ex1X.3.2.2.m1.1.1.1.3" xref="S4.Ex1X.3.2.2.m1.1.1.1.3.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.3.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.3.2.cmml">𝐢𝐧𝐩𝐮𝐭</mi><mi id="S4.Ex1X.3.2.2.m1.1.1.1.3.3" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.3.3.cmml">𝐱</mi></msub><mo id="S4.Ex1X.3.2.2.m1.1.1.1.2" lspace="0.222em" mathsize="90%" rspace="0.222em" xref="S4.Ex1X.3.2.2.m1.1.1.1.2.cmml">×</mo><mrow id="S4.Ex1X.3.2.2.m1.1.1.1.1.1" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.cmml"><mo id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.2" maxsize="90%" minsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.cmml"><mrow id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.cmml"><msub id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.2.cmml">𝐖</mi><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.3" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.3.cmml">𝐦𝐞𝐫𝐠𝐞</mi></msub><mo id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.1" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.1.cmml">−</mo><msub id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.2.cmml">𝐖</mi><msub id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.2.cmml">𝐝𝐞𝐋𝐨𝐑𝐀</mi><mn id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.3" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.3.cmml">𝟏</mn></msub></msub></mrow><mo id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.1" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.1.cmml">+</mo><msub id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.2.cmml">𝐖</mi><msub id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.2.cmml">𝐋𝐨𝐑𝐀</mi><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.3" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.3.cmml">𝐱</mi></msub></msub></mrow><mo id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.3" maxsize="90%" minsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.Ex1X.3.2.2.m1.1b"><apply id="S4.Ex1X.3.2.2.m1.1.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1"><eq id="S4.Ex1X.3.2.2.m1.1.1.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.2"></eq><csymbol cd="latexml" id="S4.Ex1X.3.2.2.m1.1.1.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.3">absent</csymbol><apply id="S4.Ex1X.3.2.2.m1.1.1.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1"><times id="S4.Ex1X.3.2.2.m1.1.1.1.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.2"></times><apply id="S4.Ex1X.3.2.2.m1.1.1.1.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.3"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.3.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.3">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.3.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.3.2">𝐢𝐧𝐩𝐮𝐭</ci><ci id="S4.Ex1X.3.2.2.m1.1.1.1.3.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.3.3">𝐱</ci></apply><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1"><plus id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.1"></plus><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2"><minus id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.1"></minus><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.2">𝐖</ci><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.3">𝐦𝐞𝐫𝐠𝐞</ci></apply><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.2">𝐖</ci><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.2">𝐝𝐞𝐋𝐨𝐑𝐀</ci><cn id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.3.cmml" type="integer" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.3">1</cn></apply></apply></apply><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.2">𝐖</ci><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.2">𝐋𝐨𝐑𝐀</ci><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.3">𝐱</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.Ex1X.3.2.2.m1.1c">\displaystyle=\mathbf{input_{x}}\times(\mathbf{W_{merge}}-\mathbf{W_{deLoRA_{1% }}}+\mathbf{W_{LoRA_{x}}})</annotation><annotation encoding="application/x-llamapun" id="S4.Ex1X.3.2.2.m1.1d">= bold_input start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT × ( bold_W start_POSTSUBSCRIPT bold_merge end_POSTSUBSCRIPT - bold_W start_POSTSUBSCRIPT bold_deLoRA start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT bold_LoRA start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> <tbody id="S4.Ex1Xa"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=\mathbf{input_{x}}\times(\mathbf{W_{base}}+\mathbf{W_{LoRA_{x}}})," class="ltx_Math" display="inline" id="S4.Ex1Xa.2.1.1.m1.1"><semantics id="S4.Ex1Xa.2.1.1.m1.1a"><mrow id="S4.Ex1Xa.2.1.1.m1.1.1.1" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.cmml"><mrow id="S4.Ex1Xa.2.1.1.m1.1.1.1.1" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.cmml"><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.3" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.3.cmml"></mi><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.2.cmml">=</mo><mrow id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.cmml"><msub id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.cmml"><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.2.cmml">𝐢𝐧𝐩𝐮𝐭</mi><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.3" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.3.cmml">𝐱</mi></msub><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.2" lspace="0.222em" mathsize="90%" rspace="0.222em" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.2.cmml">×</mo><mrow id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.2" maxsize="90%" minsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.cmml"><msub id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.cmml"><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.2.cmml">𝐖</mi><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.3" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.3.cmml">𝐛𝐚𝐬𝐞</mi></msub><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.1" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.1.cmml">+</mo><msub id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.2.cmml">𝐖</mi><msub id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.cmml"><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.2.cmml">𝐋𝐨𝐑𝐀</mi><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.3" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.3.cmml">𝐱</mi></msub></msub></mrow><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.3" maxsize="90%" minsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.Ex1Xa.2.1.1.m1.1b"><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1"><eq id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.2"></eq><csymbol cd="latexml" id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.3">absent</csymbol><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1"><times id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.2"></times><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3">subscript</csymbol><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.2">𝐢𝐧𝐩𝐮𝐭</ci><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.3">𝐱</ci></apply><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1"><plus id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.1"></plus><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.2">𝐖</ci><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.3">𝐛𝐚𝐬𝐞</ci></apply><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.2">𝐖</ci><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3"><csymbol cd="ambiguous" id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3">subscript</csymbol><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.2">𝐋𝐨𝐑𝐀</ci><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.3">𝐱</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.Ex1Xa.2.1.1.m1.1c">\displaystyle=\mathbf{input_{x}}\times(\mathbf{W_{base}}+\mathbf{W_{LoRA_{x}}}),</annotation><annotation encoding="application/x-llamapun" id="S4.Ex1Xa.2.1.1.m1.1d">= bold_input start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT × ( bold_W start_POSTSUBSCRIPT bold_base end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT bold_LoRA start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> </table> <p class="ltx_p" id="S4.SS4.SSS2.p1.10">in which <math alttext="\mathbf{W_{merge}}=\mathbf{W_{base}}+\mathbf{W_{LoRA_{1}}}" class="ltx_Math" display="inline" id="S4.SS4.SSS2.p1.5.m1.1"><semantics id="S4.SS4.SSS2.p1.5.m1.1a"><mrow id="S4.SS4.SSS2.p1.5.m1.1.1" xref="S4.SS4.SSS2.p1.5.m1.1.1.cmml"><msub id="S4.SS4.SSS2.p1.5.m1.1.1.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.2.cmml"><mi id="S4.SS4.SSS2.p1.5.m1.1.1.2.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.2.2.cmml">𝐖</mi><mi id="S4.SS4.SSS2.p1.5.m1.1.1.2.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.2.3.cmml">𝐦𝐞𝐫𝐠𝐞</mi></msub><mo id="S4.SS4.SSS2.p1.5.m1.1.1.1" xref="S4.SS4.SSS2.p1.5.m1.1.1.1.cmml">=</mo><mrow id="S4.SS4.SSS2.p1.5.m1.1.1.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.cmml"><msub id="S4.SS4.SSS2.p1.5.m1.1.1.3.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2.cmml"><mi id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2.2.cmml">𝐖</mi><mi id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2.3.cmml">𝐛𝐚𝐬𝐞</mi></msub><mo id="S4.SS4.SSS2.p1.5.m1.1.1.3.1" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.1.cmml">+</mo><msub id="S4.SS4.SSS2.p1.5.m1.1.1.3.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.cmml"><mi id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.2.cmml">𝐖</mi><msub id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.cmml"><mi id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.2.cmml">𝐋𝐨𝐑𝐀</mi><mn id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.3.cmml">𝟏</mn></msub></msub></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS2.p1.5.m1.1b"><apply id="S4.SS4.SSS2.p1.5.m1.1.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1"><eq id="S4.SS4.SSS2.p1.5.m1.1.1.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.1"></eq><apply id="S4.SS4.SSS2.p1.5.m1.1.1.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.2"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.5.m1.1.1.2.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.2">subscript</csymbol><ci id="S4.SS4.SSS2.p1.5.m1.1.1.2.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.2.2">𝐖</ci><ci id="S4.SS4.SSS2.p1.5.m1.1.1.2.3.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.2.3">𝐦𝐞𝐫𝐠𝐞</ci></apply><apply id="S4.SS4.SSS2.p1.5.m1.1.1.3.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3"><plus id="S4.SS4.SSS2.p1.5.m1.1.1.3.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.1"></plus><apply id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2">subscript</csymbol><ci id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2.2">𝐖</ci><ci id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.3.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2.3">𝐛𝐚𝐬𝐞</ci></apply><apply id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3">subscript</csymbol><ci id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.2">𝐖</ci><apply id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3">subscript</csymbol><ci id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.2">𝐋𝐨𝐑𝐀</ci><cn id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.3.cmml" type="integer" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.3">1</cn></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS2.p1.5.m1.1c">\mathbf{W_{merge}}=\mathbf{W_{base}}+\mathbf{W_{LoRA_{1}}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS2.p1.5.m1.1d">bold_W start_POSTSUBSCRIPT bold_merge end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT bold_base end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT bold_LoRA start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT</annotation></semantics></math>, <math alttext="\mathbf{W_{LoRA_{1}}}=\mathbf{W_{deLoRA_{1}}}" class="ltx_Math" display="inline" id="S4.SS4.SSS2.p1.6.m2.1"><semantics id="S4.SS4.SSS2.p1.6.m2.1a"><mrow id="S4.SS4.SSS2.p1.6.m2.1.1" xref="S4.SS4.SSS2.p1.6.m2.1.1.cmml"><msub id="S4.SS4.SSS2.p1.6.m2.1.1.2" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.cmml"><mi id="S4.SS4.SSS2.p1.6.m2.1.1.2.2" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.2.cmml">𝐖</mi><msub id="S4.SS4.SSS2.p1.6.m2.1.1.2.3" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3.cmml"><mi id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.2" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3.2.cmml">𝐋𝐨𝐑𝐀</mi><mn id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.3" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3.3.cmml">𝟏</mn></msub></msub><mo id="S4.SS4.SSS2.p1.6.m2.1.1.1" xref="S4.SS4.SSS2.p1.6.m2.1.1.1.cmml">=</mo><msub id="S4.SS4.SSS2.p1.6.m2.1.1.3" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.cmml"><mi id="S4.SS4.SSS2.p1.6.m2.1.1.3.2" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.2.cmml">𝐖</mi><msub id="S4.SS4.SSS2.p1.6.m2.1.1.3.3" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3.cmml"><mi id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.2" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3.2.cmml">𝐝𝐞𝐋𝐨𝐑𝐀</mi><mn id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.3" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3.3.cmml">𝟏</mn></msub></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS2.p1.6.m2.1b"><apply id="S4.SS4.SSS2.p1.6.m2.1.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1"><eq id="S4.SS4.SSS2.p1.6.m2.1.1.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.1"></eq><apply id="S4.SS4.SSS2.p1.6.m2.1.1.2.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.6.m2.1.1.2.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2">subscript</csymbol><ci id="S4.SS4.SSS2.p1.6.m2.1.1.2.2.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.2">𝐖</ci><apply id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3">subscript</csymbol><ci id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.2.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3.2">𝐋𝐨𝐑𝐀</ci><cn id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.3.cmml" type="integer" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3.3">1</cn></apply></apply><apply id="S4.SS4.SSS2.p1.6.m2.1.1.3.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.6.m2.1.1.3.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3">subscript</csymbol><ci id="S4.SS4.SSS2.p1.6.m2.1.1.3.2.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.2">𝐖</ci><apply id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3">subscript</csymbol><ci id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.2.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3.2">𝐝𝐞𝐋𝐨𝐑𝐀</ci><cn id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.3.cmml" type="integer" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3.3">1</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS2.p1.6.m2.1c">\mathbf{W_{LoRA_{1}}}=\mathbf{W_{deLoRA_{1}}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS2.p1.6.m2.1d">bold_W start_POSTSUBSCRIPT bold_LoRA start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT bold_deLoRA start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT</annotation></semantics></math>, and <math alttext="\mathbf{W}" class="ltx_Math" display="inline" id="S4.SS4.SSS2.p1.7.m3.1"><semantics id="S4.SS4.SSS2.p1.7.m3.1a"><mi id="S4.SS4.SSS2.p1.7.m3.1.1" xref="S4.SS4.SSS2.p1.7.m3.1.1.cmml">𝐖</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS2.p1.7.m3.1b"><ci id="S4.SS4.SSS2.p1.7.m3.1.1.cmml" xref="S4.SS4.SSS2.p1.7.m3.1.1">𝐖</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS2.p1.7.m3.1c">\mathbf{W}</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS2.p1.7.m3.1d">bold_W</annotation></semantics></math> means the weight of the base model, deLoRA, and LoRA adapter; <math alttext="\mathbf{input_{x}}" class="ltx_Math" display="inline" id="S4.SS4.SSS2.p1.8.m4.1"><semantics id="S4.SS4.SSS2.p1.8.m4.1a"><msub id="S4.SS4.SSS2.p1.8.m4.1.1" xref="S4.SS4.SSS2.p1.8.m4.1.1.cmml"><mi id="S4.SS4.SSS2.p1.8.m4.1.1.2" xref="S4.SS4.SSS2.p1.8.m4.1.1.2.cmml">𝐢𝐧𝐩𝐮𝐭</mi><mi id="S4.SS4.SSS2.p1.8.m4.1.1.3" xref="S4.SS4.SSS2.p1.8.m4.1.1.3.cmml">𝐱</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS2.p1.8.m4.1b"><apply id="S4.SS4.SSS2.p1.8.m4.1.1.cmml" xref="S4.SS4.SSS2.p1.8.m4.1.1"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.8.m4.1.1.1.cmml" xref="S4.SS4.SSS2.p1.8.m4.1.1">subscript</csymbol><ci id="S4.SS4.SSS2.p1.8.m4.1.1.2.cmml" xref="S4.SS4.SSS2.p1.8.m4.1.1.2">𝐢𝐧𝐩𝐮𝐭</ci><ci id="S4.SS4.SSS2.p1.8.m4.1.1.3.cmml" xref="S4.SS4.SSS2.p1.8.m4.1.1.3">𝐱</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS2.p1.8.m4.1c">\mathbf{input_{x}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS2.p1.8.m4.1d">bold_input start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="\mathbf{output_{x}}" class="ltx_Math" display="inline" id="S4.SS4.SSS2.p1.9.m5.1"><semantics id="S4.SS4.SSS2.p1.9.m5.1a"><msub id="S4.SS4.SSS2.p1.9.m5.1.1" xref="S4.SS4.SSS2.p1.9.m5.1.1.cmml"><mi id="S4.SS4.SSS2.p1.9.m5.1.1.2" xref="S4.SS4.SSS2.p1.9.m5.1.1.2.cmml">𝐨𝐮𝐭𝐩𝐮𝐭</mi><mi id="S4.SS4.SSS2.p1.9.m5.1.1.3" xref="S4.SS4.SSS2.p1.9.m5.1.1.3.cmml">𝐱</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS2.p1.9.m5.1b"><apply id="S4.SS4.SSS2.p1.9.m5.1.1.cmml" xref="S4.SS4.SSS2.p1.9.m5.1.1"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.9.m5.1.1.1.cmml" xref="S4.SS4.SSS2.p1.9.m5.1.1">subscript</csymbol><ci id="S4.SS4.SSS2.p1.9.m5.1.1.2.cmml" xref="S4.SS4.SSS2.p1.9.m5.1.1.2">𝐨𝐮𝐭𝐩𝐮𝐭</ci><ci id="S4.SS4.SSS2.p1.9.m5.1.1.3.cmml" xref="S4.SS4.SSS2.p1.9.m5.1.1.3">𝐱</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS2.p1.9.m5.1c">\mathbf{output_{x}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS2.p1.9.m5.1d">bold_output start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT</annotation></semantics></math> are the input and output of the request of LoRA<sub class="ltx_sub" id="S4.SS4.SSS2.p1.10.1"><span class="ltx_text ltx_font_italic" id="S4.SS4.SSS2.p1.10.1.1">x</span></sub>.</p> </div> <div class="ltx_para" id="S4.SS4.SSS2.p2"> <p class="ltx_p" id="S4.SS4.SSS2.p2.1">Mixture inference mode shows two advantages. 1) It does not incur mode-switching costs from merged to unmerged mode. 2) It costs less extra computation than unmerged inference when there are more requests for the merged LoRA adapter than others.</p> </div> </section> <section class="ltx_subsubsection" id="S4.SS4.SSS3"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.4.3 Scheduling policy.</h4> <div class="ltx_para" id="S4.SS4.SSS3.p1"> <p class="ltx_p" id="S4.SS4.SSS3.p1.2">To minimize the average response latency and meet each request’s latency constraint, our orchestrator must carefully orchestrate requests, adapters, and inference modes. Our policy follows a greedy heuristic which includes two principles. (1) Executing in merged mode whenever possible, as it produces the fastest response and without extra overhead. (2) When starvation occurs, switch to mixture first, then unmerge mode, in order of the switching cost and extra computation. Alg.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#alg1" title="Algorithm 1 ‣ 4.4.1 Swift inference mode switch. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">1</span></a> shows the pseudo-code. To alleviate starvation, it assigns each request a credit, indicating its waiting time adds the execution time in current mode and the mode switch latency (line #2), and sets a tolerance threshold <math alttext="\theta" class="ltx_Math" display="inline" id="S4.SS4.SSS3.p1.1.m1.1"><semantics id="S4.SS4.SSS3.p1.1.m1.1a"><mi id="S4.SS4.SSS3.p1.1.m1.1.1" xref="S4.SS4.SSS3.p1.1.m1.1.1.cmml">θ</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS3.p1.1.m1.1b"><ci id="S4.SS4.SSS3.p1.1.m1.1.1.cmml" xref="S4.SS4.SSS3.p1.1.m1.1.1">𝜃</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS3.p1.1.m1.1c">\theta</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS3.p1.1.m1.1d">italic_θ</annotation></semantics></math> as the mixture mode condition. When the request workload meets the criteria for switching to merge mode, the algorithm switches the mode to merge (line #5-8). When the number of starving requests exceeds <math alttext="\theta" class="ltx_Math" display="inline" id="S4.SS4.SSS3.p1.2.m2.1"><semantics id="S4.SS4.SSS3.p1.2.m2.1a"><mi id="S4.SS4.SSS3.p1.2.m2.1.1" xref="S4.SS4.SSS3.p1.2.m2.1.1.cmml">θ</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS3.p1.2.m2.1b"><ci id="S4.SS4.SSS3.p1.2.m2.1.1.cmml" xref="S4.SS4.SSS3.p1.2.m2.1.1">𝜃</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS3.p1.2.m2.1c">\theta</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS3.p1.2.m2.1d">italic_θ</annotation></semantics></math>, Alg.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#alg1" title="Algorithm 1 ‣ 4.4.1 Swift inference mode switch. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">1</span></a> processes them with mixture mode immediately (line #9-12). When it further exceeds half the maximum batch size, it switches to unmerge mode (line #13-15).</p> </div> </section> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">5. </span>Implementation</h2> <div class="ltx_para" id="S5.p1"> <p class="ltx_p" id="S5.p1.1">We implement V-LoRA upon several tools, including Pytorch (v2.0.1) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib6" title="">pytorch, </a>)</cite>, Triton (v2.0.1) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib74" title="">Triton, </a>)</cite>, CUTLASS (v3.5.1)<cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib15" title="">cutlass, </a>)</cite>, and vLLM (v0.3.0) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib51" title="">vLLM, </a>)</cite>, with <math alttext="\sim" class="ltx_Math" display="inline" id="S5.p1.1.m1.1"><semantics id="S5.p1.1.m1.1a"><mo id="S5.p1.1.m1.1.1" xref="S5.p1.1.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S5.p1.1.m1.1b"><csymbol cd="latexml" id="S5.p1.1.m1.1.1.cmml" xref="S5.p1.1.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S5.p1.1.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S5.p1.1.m1.1d">∼</annotation></semantics></math>7.1K LOC. Most of the code is implemented in Python (v3.9) except the ATMM in CUDA. We use vLLM, especially the LightLLM <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib1" title="">LightLLM, </a>)</cite> version, to build V-LoRA because of its advanced features, such as PagedAttention, iteration-level scheduling, and token-based memory management. The three key techniques in V-LoRA are implemented as follows. (1) We implement the accuracy-aware LoRA adapter generation for popular LLMs, including Qwen-VL <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib20" title="">QwenVL, </a>)</cite> and LLaVA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib59" title="">llava, </a>)</cite> series, based on transformers <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib80" title="">transformers, </a>)</cite> and PEFT <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib61" title="">peft, </a>)</cite> library. (2) We implement ATMM using CUDA C++ based on CUTLASS. Its hash table, which stores optimal input-tiling pairs, is implemented with a 128-bit unsigned integer as a key to map the input shapes. Since CUTLASS operators cannot be dynamically compiled, we created a Python interface to bind and package the code implementation of optimal tiling configurations with Pybind11 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib41" title="">pybind11, </a>)</cite> and setuptools <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib5" title="">setuptools, </a>)</cite>. (3) We integrate ATMM into vLLM to support unmerged inference, mixture inference, and swift inference mode switch. To manage the complicated adapters and requests in unmerged and mixture mode, we transform the LoRA type of each request into a one-hot vector and build a request-type mapping matrix of the current batch. For efficient memory management, we use the unified memory management <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> for KV cache and adapters. V-LoRA streams text and images in requests asynchronously via RPyC <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib4" title="">rpyc, </a>)</cite>. After receiving a request, V-LoRA identifies its LoRA adapter, dispatches it to the adapter, then generates a response and returns it.</p> </div> <div class="ltx_para" id="S5.p2"> <p class="ltx_p" id="S5.p2.4"><span class="ltx_text ltx_font_bold" id="S5.p2.4.1">LoRA adapter swap.</span> Considering it may generate an amount of LoRA adapters, we leverage the main memory. To reduce delay, we only store <math alttext="A" class="ltx_Math" display="inline" id="S5.p2.1.m1.1"><semantics id="S5.p2.1.m1.1a"><mi id="S5.p2.1.m1.1.1" xref="S5.p2.1.m1.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S5.p2.1.m1.1b"><ci id="S5.p2.1.m1.1.1.cmml" xref="S5.p2.1.m1.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.p2.1.m1.1c">A</annotation><annotation encoding="application/x-llamapun" id="S5.p2.1.m1.1d">italic_A</annotation></semantics></math> and <math alttext="B" class="ltx_Math" display="inline" id="S5.p2.2.m2.1"><semantics id="S5.p2.2.m2.1a"><mi id="S5.p2.2.m2.1.1" xref="S5.p2.2.m2.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S5.p2.2.m2.1b"><ci id="S5.p2.2.m2.1.1.cmml" xref="S5.p2.2.m2.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.p2.2.m2.1c">B</annotation><annotation encoding="application/x-llamapun" id="S5.p2.2.m2.1d">italic_B</annotation></semantics></math>, not <math alttext="\Delta W" class="ltx_Math" display="inline" id="S5.p2.3.m3.1"><semantics id="S5.p2.3.m3.1a"><mrow id="S5.p2.3.m3.1.1" xref="S5.p2.3.m3.1.1.cmml"><mi id="S5.p2.3.m3.1.1.2" mathvariant="normal" xref="S5.p2.3.m3.1.1.2.cmml">Δ</mi><mo id="S5.p2.3.m3.1.1.1" xref="S5.p2.3.m3.1.1.1.cmml"></mo><mi id="S5.p2.3.m3.1.1.3" xref="S5.p2.3.m3.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S5.p2.3.m3.1b"><apply id="S5.p2.3.m3.1.1.cmml" xref="S5.p2.3.m3.1.1"><times id="S5.p2.3.m3.1.1.1.cmml" xref="S5.p2.3.m3.1.1.1"></times><ci id="S5.p2.3.m3.1.1.2.cmml" xref="S5.p2.3.m3.1.1.2">Δ</ci><ci id="S5.p2.3.m3.1.1.3.cmml" xref="S5.p2.3.m3.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.p2.3.m3.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S5.p2.3.m3.1d">roman_Δ italic_W</annotation></semantics></math>, and swap them asynchronously between the GPU and host memory, and compute <math alttext="\Delta W" class="ltx_Math" display="inline" id="S5.p2.4.m4.1"><semantics id="S5.p2.4.m4.1a"><mrow id="S5.p2.4.m4.1.1" xref="S5.p2.4.m4.1.1.cmml"><mi id="S5.p2.4.m4.1.1.2" mathvariant="normal" xref="S5.p2.4.m4.1.1.2.cmml">Δ</mi><mo id="S5.p2.4.m4.1.1.1" xref="S5.p2.4.m4.1.1.1.cmml"></mo><mi id="S5.p2.4.m4.1.1.3" xref="S5.p2.4.m4.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S5.p2.4.m4.1b"><apply id="S5.p2.4.m4.1.1.cmml" xref="S5.p2.4.m4.1.1"><times id="S5.p2.4.m4.1.1.1.cmml" xref="S5.p2.4.m4.1.1.1"></times><ci id="S5.p2.4.m4.1.1.2.cmml" xref="S5.p2.4.m4.1.1.2">Δ</ci><ci id="S5.p2.4.m4.1.1.3.cmml" xref="S5.p2.4.m4.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.p2.4.m4.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S5.p2.4.m4.1d">roman_Δ italic_W</annotation></semantics></math> with ATMM at runtime.</p> </div> <div class="ltx_para" id="S5.p3"> <p class="ltx_p" id="S5.p3.1"><span class="ltx_text ltx_font_bold" id="S5.p3.1.1">KV Cache reuse.</span> The same images may be accessed multiple times in some applications, <em class="ltx_emph ltx_font_italic" id="S5.p3.1.2">e.g.,</em> the multi-round visual question-answering <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib38" title="">vqa2, </a>)</cite>. We implement a prefix matching, based on CacheBlend <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib85" title="">yao2024cacheblend, </a>)</cite> and SGLang <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib96" title="">zheng2023efficiently, </a>)</cite>, to reuse the same images’ KV cache, avoiding redundant storage. </p> </div> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">6. </span>Evaluation</h2> <div class="ltx_para" id="S6.p1"> <p class="ltx_p" id="S6.p1.1">We evaluate V-LoRA with two vision applications involving five tasks on three LMMs. The key takeaways are:</p> </div> <div class="ltx_para" id="S6.p2"> <div class="ltx_inline-block ltx_transformed_outer" id="S6.p2.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-0.5pt,0.4pt) scale(0.8,0.8) ;"> <p class="ltx_p" id="S6.p2.1.1"><math alttext="\bullet" class="ltx_Math" display="inline" id="S6.p2.1.1.m1.1"><semantics id="S6.p2.1.1.m1.1a"><mo id="S6.p2.1.1.m1.1.1" xref="S6.p2.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.p2.1.1.m1.1b"><ci id="S6.p2.1.1.m1.1.1.cmml" xref="S6.p2.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.p2.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.p2.1.1.m1.1d">∙</annotation></semantics></math></p> </span></div> <p class="ltx_p" id="S6.p2.2">V-LoRA decreases 20-89% end-to-end latency compared to state-of-the-art serving systems and achieves comparable accuracy with small models on specific domains. (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS2" title="6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">6.2</span></a>)</p> </div> <div class="ltx_para" id="S6.p3"> <div class="ltx_inline-block ltx_transformed_outer" id="S6.p3.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-0.5pt,0.4pt) scale(0.8,0.8) ;"> <p class="ltx_p" id="S6.p3.1.1"><math alttext="\bullet" class="ltx_Math" display="inline" id="S6.p3.1.1.m1.1"><semantics id="S6.p3.1.1.m1.1a"><mo id="S6.p3.1.1.m1.1.1" xref="S6.p3.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.p3.1.1.m1.1b"><ci id="S6.p3.1.1.m1.1.1.cmml" xref="S6.p3.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.p3.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.p3.1.1.m1.1d">∙</annotation></semantics></math></p> </span></div> <p class="ltx_p" id="S6.p3.2">Accuracy-aware LoRA Adapter Generation brings remarkable accuracy and throughput benefits; Adaptive-tiling LoRA Adapters Batching and Flexible LoRA Adapters Orchestration boost great latency and throughput. (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3" title="6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">6.3</span></a>)</p> </div> <div class="ltx_para" id="S6.p4"> <div class="ltx_inline-block ltx_transformed_outer" id="S6.p4.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-0.5pt,0.4pt) scale(0.8,0.8) ;"> <p class="ltx_p" id="S6.p4.1.1"><math alttext="\bullet" class="ltx_Math" display="inline" id="S6.p4.1.1.m1.1"><semantics id="S6.p4.1.1.m1.1a"><mo id="S6.p4.1.1.m1.1.1" xref="S6.p4.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.p4.1.1.m1.1b"><ci id="S6.p4.1.1.m1.1.1.cmml" xref="S6.p4.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.p4.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.p4.1.1.m1.1d">∙</annotation></semantics></math></p> </span></div> <p class="ltx_p" id="S6.p4.2">V-LoRA shows strong stability to diverse workloads and LoRA adapter numbers, as well as scalability to multiple GPU resources. (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS4" title="6.4 Stability and Scalability ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">6.4</span></a>)</p> </div> <figure class="ltx_figure" id="S6.F14"> <div class="ltx_block ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F14.1" style="width:433.6pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="340" id="S6.F14.1.g1" src="x17.png" width="747"/> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 14. </span>Average token latency comparison over various serving systems on two vision applications and three LMMs.</figcaption> </figure> <figure class="ltx_table" id="S6.T2"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S6.T2.1"> <tr class="ltx_tr" id="S6.T2.1.1"> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.T2.1.1.1"><span class="ltx_text ltx_font_bold" id="S6.T2.1.1.1.1" style="font-size:80%;">Model</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.T2.1.1.2"><span class="ltx_text ltx_font_bold" id="S6.T2.1.1.2.1" style="font-size:80%;">Vision Encoder</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.T2.1.1.3"><span class="ltx_text ltx_font_bold" id="S6.T2.1.1.3.1" style="font-size:80%;">Size</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.T2.1.1.4"><span class="ltx_text ltx_font_bold" id="S6.T2.1.1.4.1" style="font-size:80%;">Layer #</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.T2.1.1.5"><span class="ltx_text ltx_font_bold" id="S6.T2.1.1.5.1" style="font-size:80%;">Dimension</span></td> </tr> <tr class="ltx_tr" id="S6.T2.1.2"> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.T2.1.2.1"> <span class="ltx_text" id="S6.T2.1.2.1.1"></span><span class="ltx_text" id="S6.T2.1.2.1.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.2.1.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.2.1.3.1"> <span class="ltx_tr" id="S6.T2.1.2.1.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.2.1.3.1.1.1">Qwen-VL-7B</span></span> </span></span><span class="ltx_text" id="S6.T2.1.2.1.4"></span><span class="ltx_text" id="S6.T2.1.2.1.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.T2.1.2.2"> <span class="ltx_text" id="S6.T2.1.2.2.1"></span><span class="ltx_text" id="S6.T2.1.2.2.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.2.2.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.2.2.3.1"> <span class="ltx_tr" id="S6.T2.1.2.2.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.2.2.3.1.1.1">Openclip-ViT (1.9B)</span></span> </span></span><span class="ltx_text" id="S6.T2.1.2.2.4"></span><span class="ltx_text" id="S6.T2.1.2.2.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.T2.1.2.3"> <span class="ltx_text" id="S6.T2.1.2.3.1"></span><span class="ltx_text" id="S6.T2.1.2.3.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.2.3.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.2.3.3.1"> <span class="ltx_tr" id="S6.T2.1.2.3.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.2.3.3.1.1.1">18GB</span></span> </span></span><span class="ltx_text" id="S6.T2.1.2.3.4"></span><span class="ltx_text" id="S6.T2.1.2.3.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.T2.1.2.4"> <span class="ltx_text" id="S6.T2.1.2.4.1"></span><span class="ltx_text" id="S6.T2.1.2.4.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.2.4.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.2.4.3.1"> <span class="ltx_tr" id="S6.T2.1.2.4.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.2.4.3.1.1.1">32</span></span> </span></span><span class="ltx_text" id="S6.T2.1.2.4.4"></span><span class="ltx_text" id="S6.T2.1.2.4.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.T2.1.2.5"> <span class="ltx_text" id="S6.T2.1.2.5.1"></span><span class="ltx_text" id="S6.T2.1.2.5.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.2.5.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.2.5.3.1"> <span class="ltx_tr" id="S6.T2.1.2.5.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.2.5.3.1.1.1">4096</span></span> </span></span><span class="ltx_text" id="S6.T2.1.2.5.4"></span><span class="ltx_text" id="S6.T2.1.2.5.5" style="font-size:80%;"></span> </td> </tr> <tr class="ltx_tr" id="S6.T2.1.3"> <td class="ltx_td ltx_align_center" id="S6.T2.1.3.1"> <span class="ltx_text" id="S6.T2.1.3.1.1"></span><span class="ltx_text" id="S6.T2.1.3.1.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.3.1.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.3.1.3.1"> <span class="ltx_tr" id="S6.T2.1.3.1.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.3.1.3.1.1.1">LLaVA1.5-7B</span></span> </span></span><span class="ltx_text" id="S6.T2.1.3.1.4"></span><span class="ltx_text" id="S6.T2.1.3.1.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center" id="S6.T2.1.3.2"> <span class="ltx_text" id="S6.T2.1.3.2.1"></span><span class="ltx_text" id="S6.T2.1.3.2.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.3.2.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.3.2.3.1"> <span class="ltx_tr" id="S6.T2.1.3.2.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.3.2.3.1.1.1">CLIP-ViT (0.3B)</span></span> </span></span><span class="ltx_text" id="S6.T2.1.3.2.4"></span><span class="ltx_text" id="S6.T2.1.3.2.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center" id="S6.T2.1.3.3"> <span class="ltx_text" id="S6.T2.1.3.3.1"></span><span class="ltx_text" id="S6.T2.1.3.3.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.3.3.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.3.3.3.1"> <span class="ltx_tr" id="S6.T2.1.3.3.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.3.3.3.1.1.1">13GB</span></span> </span></span><span class="ltx_text" id="S6.T2.1.3.3.4"></span><span class="ltx_text" id="S6.T2.1.3.3.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center" id="S6.T2.1.3.4"> <span class="ltx_text" id="S6.T2.1.3.4.1"></span><span class="ltx_text" id="S6.T2.1.3.4.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.3.4.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.3.4.3.1"> <span class="ltx_tr" id="S6.T2.1.3.4.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.3.4.3.1.1.1">32</span></span> </span></span><span class="ltx_text" id="S6.T2.1.3.4.4"></span><span class="ltx_text" id="S6.T2.1.3.4.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center" id="S6.T2.1.3.5"> <span class="ltx_text" id="S6.T2.1.3.5.1"></span><span class="ltx_text" id="S6.T2.1.3.5.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.3.5.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.3.5.3.1"> <span class="ltx_tr" id="S6.T2.1.3.5.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.3.5.3.1.1.1">4096</span></span> </span></span><span class="ltx_text" id="S6.T2.1.3.5.4"></span><span class="ltx_text" id="S6.T2.1.3.5.5" style="font-size:80%;"></span> </td> </tr> <tr class="ltx_tr" id="S6.T2.1.4"> <td class="ltx_td ltx_align_center ltx_border_bb" id="S6.T2.1.4.1"> <span class="ltx_text" id="S6.T2.1.4.1.1"></span><span class="ltx_text" id="S6.T2.1.4.1.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.4.1.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.4.1.3.1"> <span class="ltx_tr" id="S6.T2.1.4.1.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.4.1.3.1.1.1">LLaVA-1.5-13B</span></span> </span></span><span class="ltx_text" id="S6.T2.1.4.1.4"></span><span class="ltx_text" id="S6.T2.1.4.1.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S6.T2.1.4.2"> <span class="ltx_text" id="S6.T2.1.4.2.1"></span><span class="ltx_text" id="S6.T2.1.4.2.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.4.2.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.4.2.3.1"> <span class="ltx_tr" id="S6.T2.1.4.2.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.4.2.3.1.1.1">CLIP-ViT (0.3B)</span></span> </span></span><span class="ltx_text" id="S6.T2.1.4.2.4"></span><span class="ltx_text" id="S6.T2.1.4.2.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S6.T2.1.4.3"> <span class="ltx_text" id="S6.T2.1.4.3.1"></span><span class="ltx_text" id="S6.T2.1.4.3.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.4.3.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.4.3.3.1"> <span class="ltx_tr" id="S6.T2.1.4.3.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.4.3.3.1.1.1">24GB</span></span> </span></span><span class="ltx_text" id="S6.T2.1.4.3.4"></span><span class="ltx_text" id="S6.T2.1.4.3.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S6.T2.1.4.4"> <span class="ltx_text" id="S6.T2.1.4.4.1"></span><span class="ltx_text" id="S6.T2.1.4.4.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.4.4.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.4.4.3.1"> <span class="ltx_tr" id="S6.T2.1.4.4.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.4.4.3.1.1.1">40</span></span> </span></span><span class="ltx_text" id="S6.T2.1.4.4.4"></span><span class="ltx_text" id="S6.T2.1.4.4.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S6.T2.1.4.5"> <span class="ltx_text" id="S6.T2.1.4.5.1"></span><span class="ltx_text" id="S6.T2.1.4.5.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.T2.1.4.5.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.T2.1.4.5.3.1"> <span class="ltx_tr" id="S6.T2.1.4.5.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.T2.1.4.5.3.1.1.1">5120</span></span> </span></span><span class="ltx_text" id="S6.T2.1.4.5.4"></span><span class="ltx_text" id="S6.T2.1.4.5.5" style="font-size:80%;"></span> </td> </tr> </table> <figcaption class="ltx_caption ltx_centering" style="font-size:80%;"><span class="ltx_tag ltx_tag_table">Table 2. </span>Model configurations.</figcaption> </figure> <section class="ltx_subsection" id="S6.SS1"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">6.1 Experimental Setup</h3> <div class="ltx_para" id="S6.SS1.p1"> <p class="ltx_p" id="S6.SS1.p1.1"><span class="ltx_text ltx_font_bold" id="S6.SS1.p1.1.1">Vision applications and datasets.</span> We select two distinct types of visual applications: visual retrieval and video analytics, to evaluate the performance of V-LoRA. Visual retrieval aims to analyze images and respond to queries. It involves visual question-answering, image caption, and specific-target detection tasks when needed by queries. We evaluate visual retrieval on SharedGPT-4V <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib26" title="">SharedGPT4V, </a>)</cite> and RefCOCO <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib48" title="">kazemzadeh2014referitgame, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib88" title="">yu2016modeling, </a>)</cite> datasets. Video analytics ingests and analyzes each RGB frame from the video, then outputs results of fixed vision tasks, including object detection and video understanding like prior work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib78" title="">wang2024region, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib90" title="">yuan2022PacketGame, </a>)</cite>. Object detection locates and identifies objects on each video frame on YODA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib83" title="">xiao2021towards, </a>)</cite> and Cityscapes <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib29" title="">Cordts2016Cityscapes, </a>)</cite>. Video understanding recognizes actions on consecutive video frames on UCF101 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib72" title="">UCF101, </a>)</cite>.</p> </div> <div class="ltx_para ltx_noindent" id="S6.SS1.p2"> <p class="ltx_p" id="S6.SS1.p2.1"><span class="ltx_text ltx_font_bold" id="S6.SS1.p2.1.1">Testbed and workload.</span> We evaluated V-LoRA on a server equipped with one NVIDIA A100 80GB GPU and Intel Xeon Platinum 8358 CPU, with 128 GB of host memory. The workload of visual retrieval is from production traces from the Microsoft Azure LLM inference trace 2023 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib14" title="">AzureLLM, </a>)</cite>. As the high volume of requests in this workload trace exceeds current hardware capabilities, like prior work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite>, we randomly sample requests at varying rates in a round-robin approach. Like prior work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib78" title="">wang2024region, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib31" title="">du2020server, </a>)</cite>, the video analytics workload ingests one video chunk per second, 30 frames each, each stream.</p> </div> <div class="ltx_para ltx_noindent" id="S6.SS1.p3"> <p class="ltx_p" id="S6.SS1.p3.1"><span class="ltx_text ltx_font_bold" id="S6.SS1.p3.1.1">Metrics.</span> Accuracy metrics follow the standard metrics. Visual retrieval is measured by vqa-score <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib37" title="">goyal2017making, </a>)</cite>, object detection is evaluated by average F1-score and video understanding is evaluated via Top-1 accuracy. To the system metrics, we evaluate the end-to-end throughput, <em class="ltx_emph ltx_font_italic" id="S6.SS1.p3.1.2">i.e.,</em> requests per second, and the average token latency, <em class="ltx_emph ltx_font_italic" id="S6.SS1.p3.1.3">i.e.,</em> the sum of each request’s end-to-end latency divided by the total number of tokens.</p> </div> <div class="ltx_para ltx_noindent" id="S6.SS1.p4"> <p class="ltx_p" id="S6.SS1.p4.1"><span class="ltx_text ltx_font_bold" id="S6.SS1.p4.1.1">Models.</span> We choose the widely-used open-sourced LMMs, Qwen-VL <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib20" title="">QwenVL, </a>)</cite> and LLaVA-1.5 series <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib59" title="">llava, </a>)</cite>. We use various LLaVA models with different sizes, including LLaVA-v1.5-7B and LLaVA-v1.5-13B. The details of these models are shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.T2" title="Table 2 ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">2</span></a>. The rank of LoRA adapters is set to 64. We also use five small models, VisionMamba <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib99" title="">VisionMamba, </a>)</cite>, YOLO <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib33" title="">glenn2021YOLOV5, </a>)</cite>, OSCAR <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib55" title="">li2020oscar, </a>)</cite> VideoMAE <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib75" title="">VideoMAE, </a>)</cite>, and UNINEXT <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib84" title="">yan2023universal, </a>)</cite> that yield state-of-the-art performance for accuracy comparison on corresponding datasets.</p> </div> <figure class="ltx_figure" id="S6.F15"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="185" id="S6.F15.g1" src="x18.png" width="407"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 15. </span>Accuracy comparison between small models and LMM model across five different vision tasks.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S6.SS1.p5"> <p class="ltx_p" id="S6.SS1.p5.3"><span class="ltx_text ltx_font_bold" id="S6.SS1.p5.3.4">Baselines.</span> To show the superiority brought by V-LoRA, we compare it with three different baselines: <br class="ltx_break"/> <span class="ltx_inline-block ltx_transformed_outer" id="S6.SS1.p5.1.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-0.5pt,0.4pt) scale(0.8,0.8) ;"> <span class="ltx_p" id="S6.SS1.p5.1.1.1"><math alttext="\bullet" class="ltx_Math" display="inline" id="S6.SS1.p5.1.1.1.m1.1"><semantics id="S6.SS1.p5.1.1.1.m1.1a"><mo id="S6.SS1.p5.1.1.1.m1.1.1" xref="S6.SS1.p5.1.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.SS1.p5.1.1.1.m1.1b"><ci id="S6.SS1.p5.1.1.1.m1.1.1.cmml" xref="S6.SS1.p5.1.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.SS1.p5.1.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.SS1.p5.1.1.1.m1.1d">∙</annotation></semantics></math></span> </span></span> <span class="ltx_text ltx_font_bold" id="S6.SS1.p5.3.5">S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite></span> only serves in unmerge mode and employs its customized CUDA operator to batch concurrent heterogeneous LoRA computation. <br class="ltx_break"/> <span class="ltx_inline-block ltx_transformed_outer" id="S6.SS1.p5.2.2" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-0.5pt,0.4pt) scale(0.8,0.8) ;"> <span class="ltx_p" id="S6.SS1.p5.2.2.1"><math alttext="\bullet" class="ltx_Math" display="inline" id="S6.SS1.p5.2.2.1.m1.1"><semantics id="S6.SS1.p5.2.2.1.m1.1a"><mo id="S6.SS1.p5.2.2.1.m1.1.1" xref="S6.SS1.p5.2.2.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.SS1.p5.2.2.1.m1.1b"><ci id="S6.SS1.p5.2.2.1.m1.1.1.cmml" xref="S6.SS1.p5.2.2.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.SS1.p5.2.2.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.SS1.p5.2.2.1.m1.1d">∙</annotation></semantics></math></span> </span></span> <span class="ltx_text ltx_font_bold" id="S6.SS1.p5.3.6">Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite></span> also serves in unmerge mode only, like S-LoRA, but employs its own operator. <br class="ltx_break"/> <span class="ltx_inline-block ltx_transformed_outer" id="S6.SS1.p5.3.3" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-0.5pt,0.4pt) scale(0.8,0.8) ;"> <span class="ltx_p" id="S6.SS1.p5.3.3.1"><math alttext="\bullet" class="ltx_Math" display="inline" id="S6.SS1.p5.3.3.1.m1.1"><semantics id="S6.SS1.p5.3.3.1.m1.1a"><mo id="S6.SS1.p5.3.3.1.m1.1.1" xref="S6.SS1.p5.3.3.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.SS1.p5.3.3.1.m1.1b"><ci id="S6.SS1.p5.3.3.1.m1.1.1.cmml" xref="S6.SS1.p5.3.3.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.SS1.p5.3.3.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.SS1.p5.3.3.1.m1.1d">∙</annotation></semantics></math></span> </span></span> <span class="ltx_text ltx_font_bold" id="S6.SS1.p5.3.7">dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite></span> dynamically switches between unmerged and merged mode based on workload and invokes PyTorch operator <span class="ltx_text ltx_font_typewriter" id="S6.SS1.p5.3.8">Einsum</span> to implement unmerged inference.</p> </div> </section> <section class="ltx_subsection" id="S6.SS2"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">6.2 End-to-End Performance</h3> <div class="ltx_para" id="S6.SS2.p1"> <p class="ltx_p" id="S6.SS2.p1.1">This section reports the E2E performance of V-LoRA on multiple LMMs and applications.</p> </div> <figure class="ltx_figure" id="S6.F19"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F19.1" style="width:101.9pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="668" id="S6.F19.1.g1" src="x19.png" width="832"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 16. </span>Latency comparison between original LMM head and video analytics head.</figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F19.2" style="width:101.9pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="660" id="S6.F19.2.g1" src="x20.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 17. </span>Latency comparison of different operators across different token batch sizes.</figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F19.3" style="width:104.1pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="646" id="S6.F19.3.g1" src="x21.png" width="790"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 18. </span><span class="ltx_text" id="S6.F19.3.2.1" style="color:#000000;">Performance of different operators at average, 90-tile, and 95-tile .</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F19.4" style="width:104.1pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="627" id="S6.F19.4.g1" src="x22.png" width="787"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 19. </span><span class="ltx_text" id="S6.F19.4.2.1" style="color:#000000;">Performance of different schedulers under different skewness.</span></figcaption> </figure> </div> </div> </figure> <div class="ltx_para ltx_noindent" id="S6.SS2.p2"> <p class="ltx_p" id="S6.SS2.p2.1"><span class="ltx_text ltx_font_bold" id="S6.SS2.p2.1.1">System performance.</span> V-LoRA achieves notably lower average token latency than dLoRA, Punica, and SLoRA regardless of vision applications and LMMs. The first row in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F14" title="Figure 14 ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">14</span></a> shows their performance for visual retrieval on three LMMs. Across three LMMs, V-LoRA reduces 72%, 50%, and 20% average token latency, compared to dLoRA, Punica, and S-LoRA, respectively. To Punica and S-LoRA, this acceleration is obvious. They only work in unmerge mode, which ignores the merge-friendly workload pattern, <em class="ltx_emph ltx_font_italic" id="S6.SS2.p2.1.2">e.g.,</em> 60% of requests asking for the same LoRA adapter. dLoRA, though, takes the preference of both inference modes into account, its high mode switching cost and inefficient <span class="ltx_text ltx_font_typewriter" id="S6.SS2.p2.1.3">Einsum</span> operator in unmerged inference incurs this 20% drop; conversely, V-LoRA’s ATMM operator enables swift switch and efficient unmerge and mixture mode (more experiments in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3" title="6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">6.3</span></a>). Moreover, along with the increasing requests per second, the inflection points of most serving systems occur at 6, which can be eliminated if equipped with more GPUs (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS4" title="6.4 Stability and Scalability ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">6.4</span></a>).</p> </div> <div class="ltx_para" id="S6.SS2.p3"> <p class="ltx_p" id="S6.SS2.p3.1">V-LoRA delivers the best service on video analytics than other systems, as shown in the second row of Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F14" title="Figure 14 ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">14</span></a>. It makes 89%, 83%, and 71% of average token latency reduction than dLoRA, Punica, and S-LoRA, respectively. This benefit arises from the vision task head. It effectively eliminates the multi-rounds inference in the autoregressive manner of LLMs (more experiments in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3.SSS1" title="6.3.1 Accuracy-aware LoRA Adapter Generation ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">6.3.1</span></a>).</p> </div> <div class="ltx_para" id="S6.SS2.p4"> <p class="ltx_p" id="S6.SS2.p4.1">Compare each column in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F14" title="Figure 14 ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">14</span></a>. V-LoRA produces more remarkable benefits than other serving systems on the video analytics application. This difference stems from the different distribution of input and output token lengths in two vision applications. Unlike visual retrieval, video analytics typically has fewer output tokens and more input tokens, <em class="ltx_emph ltx_font_italic" id="S6.SS2.p4.1.1">e.g.,</em> each video understanding request has 6<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS2.p4.1.m1.1"><semantics id="S6.SS2.p4.1.m1.1a"><mo id="S6.SS2.p4.1.m1.1.1" xref="S6.SS2.p4.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS2.p4.1.m1.1b"><times id="S6.SS2.p4.1.m1.1.1.cmml" xref="S6.SS2.p4.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.p4.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.p4.1.m1.1d">×</annotation></semantics></math>256 input and 5-10 output tokens, while VQA has 256 and 200+. Since input tokens can be batched computation in the prefill stage, they cost much less time (¡1ms per token) than decode-stage output tokens (30-50ms per token). Furthermore, the extra latency of unmerged inference increases along with longer input length (See Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">19</span></a>); thus, longer-input video analytics can obtain more benefits from V-LoRA with mode switch.</p> </div> <div class="ltx_para ltx_noindent" id="S6.SS2.p5"> <p class="ltx_p" id="S6.SS2.p5.1"><span class="ltx_text ltx_font_bold" id="S6.SS2.p5.1.1">Accuracy performance.</span> V-LoRA achieves performance close to or surpassing the SOTA accuracy of domain-specific small models across various tasks. We report the results of Qwen-VL with fine-tuned LoRA adapters for different vision tasks as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F15" title="Figure 15 ‣ 6.1 Experimental Setup ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">15</span></a>. We conduct training on A100 80GB for five distinct tasks and test against corresponding SOTA small models. For instance, V-LoRA achieves a 4.3-5% accuracy improvement in Visual QA and image captioning tasks. Additionally, for tasks where small models typically excel, such as object detection and video understanding, V-LoRA’s fine-tuned LoRA adapters improve Qwen-VL 24.5-62.2% accuracy, achieving competitive accuracy in these domains. </p> </div> </section> <section class="ltx_subsection" id="S6.SS3"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">6.3 Comprehensive Component-wise Analysis</h3> <div class="ltx_para" id="S6.SS3.p1"> <p class="ltx_p" id="S6.SS3.p1.1">We provide an in-depth performance analysis of individual system components. If not mentioned, all results are tested on the Qwen-VL model with 10 requests per second.</p> </div> <section class="ltx_subsubsection" id="S6.SS3.SSS1"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">6.3.1 Accuracy-aware LoRA Adapter Generation</h4> <div class="ltx_para" id="S6.SS3.SSS1.p1"> <p class="ltx_p" id="S6.SS3.SSS1.p1.1">helps V-LoRA achieve great throughput improvement while keeping high accuracy. The accuracy gain has been discussed above; we only analyze the throughput gain from the vision task head. Especially for video analytics tasks, the vision task head employed by V-LoRA significantly reduces the rounds of autoregressive decodes, thereby greatly enhancing system performance. As illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">19</span></a>, V-LoRA achieves a 41-63% reduction in latency compared to the original language modeling head. This gain is attributed to the video analytics head’s contribution to minimizing the prompt length and requiring only one inference round. In video understanding tasks, V-LoRA equipped with the video analytics head can match the accuracy of certain small models and handle 3-4 video streams in real time.</p> </div> </section> <section class="ltx_subsubsection" id="S6.SS3.SSS2"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">6.3.2 Adaptive-tiling LoRA Adapters Batching</h4> <div class="ltx_para" id="S6.SS3.SSS2.p1"> <p class="ltx_p" id="S6.SS3.SSS2.p1.6">gives the most efficient and stable matrix multiplication by ATMM among all comparisons. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">19</span></a>, by testing over 100 rounds after 10 warm-ups on large amounts of diverse inputs, ATMM achieves the lowest average latency across different batch sizes, speeds up 2.7<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.1.m1.1"><semantics id="S6.SS3.SSS2.p1.1.m1.1a"><mo id="S6.SS3.SSS2.p1.1.m1.1.1" xref="S6.SS3.SSS2.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.1.m1.1b"><times id="S6.SS3.SSS2.p1.1.m1.1.1.cmml" xref="S6.SS3.SSS2.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.1.m1.1d">×</annotation></semantics></math>, 2.3<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.2.m2.1"><semantics id="S6.SS3.SSS2.p1.2.m2.1a"><mo id="S6.SS3.SSS2.p1.2.m2.1.1" xref="S6.SS3.SSS2.p1.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.2.m2.1b"><times id="S6.SS3.SSS2.p1.2.m2.1.1.cmml" xref="S6.SS3.SSS2.p1.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.2.m2.1d">×</annotation></semantics></math>, and 3.4<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.3.m3.1"><semantics id="S6.SS3.SSS2.p1.3.m3.1a"><mo id="S6.SS3.SSS2.p1.3.m3.1.1" xref="S6.SS3.SSS2.p1.3.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.3.m3.1b"><times id="S6.SS3.SSS2.p1.3.m3.1.1.cmml" xref="S6.SS3.SSS2.p1.3.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.3.m3.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.3.m3.1d">×</annotation></semantics></math> of S-LoRA, Punica, and dLoRA, respectively. On stability, the statistical results plotted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">19</span></a> show that ATMM delivers the most robust performance, which <span class="ltx_text" id="S6.SS3.SSS2.p1.6.3" style="color:#000000;">reduce the latency fluctuation by 3<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.4.1.m1.1"><semantics id="S6.SS3.SSS2.p1.4.1.m1.1a"><mo id="S6.SS3.SSS2.p1.4.1.m1.1.1" mathcolor="#000000" xref="S6.SS3.SSS2.p1.4.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.4.1.m1.1b"><times id="S6.SS3.SSS2.p1.4.1.m1.1.1.cmml" xref="S6.SS3.SSS2.p1.4.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.4.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.4.1.m1.1d">×</annotation></semantics></math>, 2<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.5.2.m2.1"><semantics id="S6.SS3.SSS2.p1.5.2.m2.1a"><mo id="S6.SS3.SSS2.p1.5.2.m2.1.1" mathcolor="#000000" xref="S6.SS3.SSS2.p1.5.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.5.2.m2.1b"><times id="S6.SS3.SSS2.p1.5.2.m2.1.1.cmml" xref="S6.SS3.SSS2.p1.5.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.5.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.5.2.m2.1d">×</annotation></semantics></math>, and 2<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.6.3.m3.1"><semantics id="S6.SS3.SSS2.p1.6.3.m3.1a"><mo id="S6.SS3.SSS2.p1.6.3.m3.1.1" mathcolor="#000000" xref="S6.SS3.SSS2.p1.6.3.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.6.3.m3.1b"><times id="S6.SS3.SSS2.p1.6.3.m3.1.1.cmml" xref="S6.SS3.SSS2.p1.6.3.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.6.3.m3.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.6.3.m3.1d">×</annotation></semantics></math> compared to S-LoRA, Punica, and dLoRA. </span> These benefits stem from the profile-based optimal tiling search at the offline phase. When the batch size exceeds 1024, for instance, ATMM adaptively adjusts to a larger tile shape, fully utilizing hardware resources, while other operators suffer from static tiling.</p> </div> <figure class="ltx_figure" id="S6.F23"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_many"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F23.1" style="width:86.7pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="628" id="S6.F23.1.g1" src="x23.png" width="788"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 20. </span>Latency gain of mixture mode.</figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_many"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F23.2" style="width:86.7pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="618" id="S6.F23.2.g1" src="x24.png" width="831"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 21. </span><span class="ltx_text" id="S6.F23.2.2.1" style="color:#000000;">Benefits from swift mode switch.</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_many"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F23.3" style="width:86.7pt;"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="663" id="S6.F23.3.g1" src="x25.png" width="831"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 22. </span>Impact of the request skewness. </figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_many"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F23.4" style="width:86.7pt;"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="679" id="S6.F23.4.g1" src="x26.png" width="832"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 23. </span>Impact of the adapter number. </figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_many"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F23.fig1" style="width:65.0pt;"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_1"> <table class="ltx_tabular ltx_centering ltx_figure_panel ltx_align_middle" id="S6.F23.fig1.1"> <tr class="ltx_tr" id="S6.F23.fig1.1.1"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_tt" id="S6.F23.fig1.1.1.1"> <span class="ltx_text" id="S6.F23.fig1.1.1.1.1"></span><span class="ltx_text" id="S6.F23.fig1.1.1.1.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.F23.fig1.1.1.1.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.F23.fig1.1.1.1.3.1"> <span class="ltx_tr" id="S6.F23.fig1.1.1.1.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.F23.fig1.1.1.1.3.1.1.1"><span class="ltx_text ltx_font_bold" id="S6.F23.fig1.1.1.1.3.1.1.1.1">GPU</span></span></span> <span class="ltx_tr" id="S6.F23.fig1.1.1.1.3.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.F23.fig1.1.1.1.3.1.2.1"><span class="ltx_text ltx_font_bold" id="S6.F23.fig1.1.1.1.3.1.2.1.1">Num.</span></span></span> </span></span><span class="ltx_text" id="S6.F23.fig1.1.1.1.4"></span><span class="ltx_text" id="S6.F23.fig1.1.1.1.5" style="font-size:80%;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.F23.fig1.1.1.2"> <span class="ltx_text" id="S6.F23.fig1.1.1.2.1"></span><span class="ltx_text" id="S6.F23.fig1.1.1.2.2" style="font-size:80%;"> </span><span class="ltx_text" id="S6.F23.fig1.1.1.2.3" style="font-size:80%;"> <span class="ltx_tabular ltx_align_middle" id="S6.F23.fig1.1.1.2.3.1"> <span class="ltx_tr" id="S6.F23.fig1.1.1.2.3.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.F23.fig1.1.1.2.3.1.1.1"><span class="ltx_text ltx_font_bold" id="S6.F23.fig1.1.1.2.3.1.1.1.1">TPT</span></span></span> <span class="ltx_tr" id="S6.F23.fig1.1.1.2.3.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S6.F23.fig1.1.1.2.3.1.2.1"><span class="ltx_text ltx_font_bold" id="S6.F23.fig1.1.1.2.3.1.2.1.1">(req/s)</span></span></span> </span></span><span class="ltx_text" id="S6.F23.fig1.1.1.2.4"></span><span class="ltx_text" id="S6.F23.fig1.1.1.2.5" style="font-size:80%;"></span> </td> </tr> <tr class="ltx_tr" id="S6.F23.fig1.1.2"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S6.F23.fig1.1.2.1"><span class="ltx_text" id="S6.F23.fig1.1.2.1.1" style="font-size:80%;">1</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.F23.fig1.1.2.2"><span class="ltx_text" id="S6.F23.fig1.1.2.2.1" style="font-size:80%;">6.07</span></td> </tr> <tr class="ltx_tr" id="S6.F23.fig1.1.3"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S6.F23.fig1.1.3.1"><span class="ltx_text" id="S6.F23.fig1.1.3.1.1" style="font-size:80%;">2</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.F23.fig1.1.3.2"><span class="ltx_text" id="S6.F23.fig1.1.3.2.1" style="font-size:80%;">11.48</span></td> </tr> <tr class="ltx_tr" id="S6.F23.fig1.1.4"> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t" id="S6.F23.fig1.1.4.1"><span class="ltx_text" id="S6.F23.fig1.1.4.1.1" style="font-size:80%;">4</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S6.F23.fig1.1.4.2"><span class="ltx_text" id="S6.F23.fig1.1.4.2.1" style="font-size:80%;">23.97</span></td> </tr> </table> </div> <div class="ltx_flex_break"></div> <div class="ltx_flex_cell ltx_flex_size_1"> <figure class="ltx_table ltx_figure_panel ltx_align_center" id="S6.T3"> <figcaption class="ltx_caption" style="font-size:80%;"><span class="ltx_tag ltx_tag_table">Table 3. </span>Scales to multiple GPUs.</figcaption> </figure> </div> </div> </figure> </div> </div> </figure> <div class="ltx_para" id="S6.SS3.SSS2.p2"> <p class="ltx_p" id="S6.SS3.SSS2.p2.2">At the decode stage with small input matrix shapes, the left part in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">19</span></a>, ATMM maintains high efficiency by adapting smaller tile shapes. It delivers comparable latency to S-LoRA, outperforming dLoRA and Punica by 4.5<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p2.1.m1.1"><semantics id="S6.SS3.SSS2.p2.1.m1.1a"><mo id="S6.SS3.SSS2.p2.1.m1.1.1" xref="S6.SS3.SSS2.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p2.1.m1.1b"><times id="S6.SS3.SSS2.p2.1.m1.1.1.cmml" xref="S6.SS3.SSS2.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p2.1.m1.1d">×</annotation></semantics></math> and 2.6<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p2.2.m2.1"><semantics id="S6.SS3.SSS2.p2.2.m2.1a"><mo id="S6.SS3.SSS2.p2.2.m2.1.1" xref="S6.SS3.SSS2.p2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p2.2.m2.1b"><times id="S6.SS3.SSS2.p2.2.m2.1.1.cmml" xref="S6.SS3.SSS2.p2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p2.2.m2.1d">×</annotation></semantics></math>. dLoRA suffers from the large context-switching overhead of repeated kernel calls by <span class="ltx_text ltx_font_typewriter" id="S6.SS3.SSS2.p2.2.1">Einsum</span>, while Punica results in a low core utilization due to its mismatched tile shapes. Benefiting from the adaptive tiling, ATMM spends only 5ms to compute and un-/merge all-layer LoRA matrices.</p> </div> </section> <section class="ltx_subsubsection" id="S6.SS3.SSS3"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">6.3.3 Flexible LoRA Adapters Orchestration</h4> <div class="ltx_para" id="S6.SS3.SSS3.p1"> <p class="ltx_p" id="S6.SS3.SSS3.p1.1">dynamically selects and switches inference modes with our swift switcher and deLoRA operation offering the best service among comparisons. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">19</span></a>, V-LoRA outperforms merge only, unmerge only, and dLoRA by 33%, 59%, and 21% of latency under different skewness, respectively. The skewness indicates the proportion of the most required LoRA adapter. Merge only processes requests invoking the same LoRA adapter, leading to underutilized resources and small batch sizes, while unmerge only introduces significant extra computation. dLoRA shows benefits only in highly skewed workloads as the poor performance of its unmerged inference operator <span class="ltx_text ltx_font_typewriter" id="S6.SS3.SSS3.p1.1.1">Einsum</span>. Our scheduling policy performs the best because it fully utilizes low-latency merge mode, and invents mixture mode to eliminate some mode switch.</p> </div> <div class="ltx_para" id="S6.SS3.SSS3.p2"> <p class="ltx_p" id="S6.SS3.SSS3.p2.2">deLora significantly reduces the latency compared to unmerged inference as plotted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F23" title="Figure 23 ‣ 6.3.2 Adaptive-tiling LoRA Adapters Batching ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">23</span></a>. Its early execution for the starved requests saves an average of 62% computation overhead when the starved requests’ number is lower than 50% of max batch size. On the other hand, the swift inference mode switcher contributes a lot, too. Supported by ATMM, it yields 1.2<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS3.p2.1.m1.1"><semantics id="S6.SS3.SSS3.p2.1.m1.1a"><mo id="S6.SS3.SSS3.p2.1.m1.1.1" xref="S6.SS3.SSS3.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS3.p2.1.m1.1b"><times id="S6.SS3.SSS3.p2.1.m1.1.1.cmml" xref="S6.SS3.SSS3.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS3.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS3.p2.1.m1.1d">×</annotation></semantics></math> and 1.4<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS3.p2.2.m2.1"><semantics id="S6.SS3.SSS3.p2.2.m2.1a"><mo id="S6.SS3.SSS3.p2.2.m2.1.1" xref="S6.SS3.SSS3.p2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS3.p2.2.m2.1b"><times id="S6.SS3.SSS3.p2.2.m2.1.1.cmml" xref="S6.SS3.SSS3.p2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS3.p2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS3.p2.2.m2.1d">×</annotation></semantics></math> speed up compared to dLoRA and unmerge in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F23" title="Figure 23 ‣ 6.3.2 Adaptive-tiling LoRA Adapters Batching ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">23</span></a> case that infers with two LoRA adapters.</p> </div> </section> </section> <section class="ltx_subsection" id="S6.SS4"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">6.4 Stability and Scalability</h3> <div class="ltx_para" id="S6.SS4.p1"> <p class="ltx_p" id="S6.SS4.p1.1">V-LoRA demonstrates great stability and scalability.</p> </div> <div class="ltx_para ltx_noindent" id="S6.SS4.p2"> <p class="ltx_p" id="S6.SS4.p2.1"><span class="ltx_text ltx_font_bold" id="S6.SS4.p2.1.1">Impacts of different skewness of requests.</span> V-LoRA achieves the best average token latency compared to other systems under diverse skewness. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F23" title="Figure 23 ‣ 6.3.2 Adaptive-tiling LoRA Adapters Batching ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">23</span></a> shows that V-LoRA achieves a reduction in average token latency by 76-81%, 72-83%, and 63-76% compared to dLoRA, Punica, and S-LoRA under four different skewness conditions. This superiority arises from the V-LoRA’s timely mode switch and proper requests and adapters orchestration. With the swift switcher and mixture mode, it responds to workload changes fast.</p> </div> <div class="ltx_para ltx_noindent" id="S6.SS4.p3"> <p class="ltx_p" id="S6.SS4.p3.1"><span class="ltx_text ltx_font_bold" id="S6.SS4.p3.1.1">Impacts of different number of LoRA adapters.</span> V-LoRA maintains the best and most stable performance when the number of LoRA adapters increases. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F23" title="Figure 23 ‣ 6.3.2 Adaptive-tiling LoRA Adapters Batching ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">23</span></a>, it suffers the minimal impact, which benefits from V-LoRA’s efficient memory management. V-LoRA’s pre-allocated contiguous memory reduces unnecessary memory copy or movement and memory fragmentation. In addition, when the number increases to need LoRA to swap, V-LoRA’s strategy swaps adapter and computes matrix at runtime, and the asynchronous swap contributes to the low latency. With the highly optimized ATMM kernel, the LoRA matrix swapping keeps high stability compared to the matrix multiplication via batched GEMM in dLoRA. </p> </div> <div class="ltx_para ltx_noindent" id="S6.SS4.p4"> <p class="ltx_p" id="S6.SS4.p4.1"><span class="ltx_text ltx_font_bold" id="S6.SS4.p4.1.1">Scales to multiple GPUs.</span> V-LoRA demonstrates excellent scalability and can significantly enhance the overall system throughput by multiple GPUs. As shown in Tab. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.T3" title="Table 3 ‣ Figure 23 ‣ 6.3.2 Adaptive-tiling LoRA Adapters Batching ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">3</span></a>, on servers equipped with 1, 2, and 4 A100 GPUs, the total system throughput can reach 6.07, 11.48, and 23.97 requests per second, respectively. In future work, we can further improve system performance in multi-GPU scenarios by incorporating inter-GPU scheduling like dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> and support larger LMM like InternVL2-76B <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib35" title="">gao2024miniinternvlflexibletransferpocketmultimodal, </a>)</cite>.</p> </div> </section> </section> <section class="ltx_section" id="S7"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">7. </span>Related Works</h2> <div class="ltx_para" id="S7.p1"> <p class="ltx_p" id="S7.p1.1"><span class="ltx_text ltx_font_bold" id="S7.p1.1.1">Large model serving systems</span> recently leveraged system optimization techniques to improve LLM’s inference efficiency. With paged thinking, vLLM <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib50" title="">kwon2023efficient, </a>)</cite> proposes a PagedAttention operator cooperating with its block-based KV cache to minimize GPU memory fragmentation. From advanced batching mechanisms, Orca <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib87" title="">yu2022orca, </a>)</cite> introduces iteration-level scheduling to continuously batch requests of varying lengths, and DeepSpeed-FastGen <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib39" title="">holmes2024deepspeed, </a>)</cite> further improves it with a dynamic split-fuse strategy. With a distributed architecture, FlexGen <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib70" title="">sheng2023flexgen, </a>)</cite> employs offloading to enhance LLM serving throughput, while Mooncack <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib64" title="">qin2407mooncake, </a>)</cite> features a KVCache-centric architecture separating the prefill and decoding clusters. With LLM inference characteristics, SpecInfer <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib62" title="">miao2024specinfer, </a>)</cite> utilizes speculative decoding to reduce latency, while SARATHI <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib18" title="">agrawal2023sarathi, </a>)</cite> schedules requests by piggybacking decodes and chunked prefills.</p> </div> <div class="ltx_para" id="S7.p2"> <p class="ltx_p" id="S7.p2.1">S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite>, Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite>, dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> are the only three systems that also serve multiple LoRA LLMs by batching requests destined for different adapters. Their shortcomings are deeply analyzed in this paper. An earlier study, PetS <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib97" title="">zhou2022pets, </a>)</cite>, also considers the scenario of serving multiple parameter-efficient DNN models, but it does not consider serving autoregressive LLMs and the unique system characteristics of LoRA adapters. Compared to them, V-LoRA provides a more efficient serving runtime and, as an end-to-end system, includes LoRA adapter generation.</p> </div> <div class="ltx_para ltx_noindent" id="S7.p3"> <p class="ltx_p" id="S7.p3.1"><span class="ltx_text ltx_font_bold" id="S7.p3.1.1">Parameter-efficient fine-tuning (PEFT)</span> <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib40" title="">hu2021lora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib54" title="">li2021prefix, </a>)</cite> is developed to adapt large pre-trained models to specific tasks or domains. By adjusting a few layers or parameters, PEFT retains core pre-trained knowledge while efficiently learning task-specific nuances <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib58" title="">liu2022few, </a>)</cite>. As a verification system, V-LoRA’s LoRA adapter generation adopts a heuristic algorithm. It can be easily replaced by many advanced PEFT techniques <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib94" title="">zhang2023adaptive, </a>)</cite>, and we leave this in future work.</p> </div> <div class="ltx_para ltx_noindent" id="S7.p4"> <p class="ltx_p" id="S7.p4.1"><span class="ltx_text ltx_font_bold" id="S7.p4.1.1">Retrieval-augmented Generation (RAG)</span> <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib22" title="">borgeaud2022improving, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib52" title="">lewis2020retrieval, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib44" title="">jin2024ragcache, </a>)</cite> enhances LLMs by incorporating relevant knowledge from external databases, enabling comparable performance to fine-tuned LLMs <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib24" title="">chen2024benchmarking, </a>)</cite>. Some work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib66" title="">ram2023context, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib77" title="">trivedi2023interleaving, </a>)</cite> suggest iterative retrieval throughout generation for higher quality. V-LoRA’s system performance greatly outperforms RAG because RAG’s costly vector search <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib27" title="">chen2021spann, </a>)</cite> and long-context prompt.</p> </div> </section> <section class="ltx_section" id="S8"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">8. </span>Conclusion</h2> <div class="ltx_para" id="S8.p1"> <p class="ltx_p" id="S8.p1.1">In this paper, we first explore the utilization of LMMs as foundation models for vision applications to achieve high serving efficiency and user-friendly nature language interface. To achieve this, we propose V-LoRA, an end-to-end system that adapts LMMs for domain-specific visual tasks with LoRA adapters and efficiently manages them at runtime to enrich vision applications. Across two typical vision applications, we show that V-LoRA enables the effective utilization of a single LMM to achieve superior performance and generalization in multiple visual tasks. While V-LoRA by no means is the final answer, we hope it serves as a stepping stone towards poly-basic design for future vision applications and demonstrates the potential of adapting LMM for visual tasks.</p> </div> <div class="ltx_pagination ltx_role_newpage"></div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[1]</span> <span class="ltx_bibblock"> Lightllm. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/ModelTC/lightllm" title="">https://github.com/ModelTC/lightllm</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[2]</span> <span class="ltx_bibblock"> NVIDIA A100. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.nvidia.com/en-us/data-center/a100/" title="">https://www.nvidia.com/en-us/data-center/a100/</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[3]</span> <span class="ltx_bibblock"> Pytorch torch.einsum. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://pytorch.org/docs/stable/generated/torch.einsum.html" title="">https://pytorch.org/docs/stable/generated/torch.einsum.html</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[4]</span> <span class="ltx_bibblock"> rpyc. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/tomerfiliba-org/rpyc" title="">https://github.com/tomerfiliba-org/rpyc</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[5]</span> <span class="ltx_bibblock"> setuptools. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/pypa/setuptools" title="">https://github.com/pypa/setuptools</a>. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[6]</span> <span class="ltx_bibblock"> Pytorch: Official website of the python platform. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://pytorch.org/" title="">https://pytorch.org/</a>, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[7]</span> <span class="ltx_bibblock"> 12 Practical Large Language Model (LLM) Applications - Techopedia. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.techopedia.com/12-practical-large-language-model-llm-applications" title="">https://www.techopedia.com/12-practical-large-language-model-llm-applications</a>, (Accessed on 09/18/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[8]</span> <span class="ltx_bibblock"> 7 top large language model use cases and applications. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.projectpro.io/article/large-language-model-use-cases-and-applications/887" title="">https://www.projectpro.io/article/large-language-model-use-cases-and-applications/887</a>, (Accessed on 09/18/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[9]</span> <span class="ltx_bibblock"> Airbus Aircraft dataset. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.kaggle.com/datasets/airbusgeo/airbus-aircrafts-sample-dataset/data" title="">https://www.kaggle.com/datasets/airbusgeo/airbus-aircrafts-sample-dataset/data</a>, (Accessed on 09/18/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[10]</span> <span class="ltx_bibblock"> Applications of large language models - indata labs. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://indatalabs.com/blog/large-language-model-apps" title="">https://indatalabs.com/blog/large-language-model-apps</a>, (Accessed on 09/18/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[11]</span> <span class="ltx_bibblock"> Claude3.5-Sonnet. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.anthropic.com/news/claude-3-5-sonnet" title="">https://www.anthropic.com/news/claude-3-5-sonnet</a>, (Accessed on 09/18/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[12]</span> <span class="ltx_bibblock"> OpenAI GPT-4o. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://openai.com/index/hello-gpt-4o/" title="">https://openai.com/index/hello-gpt-4o/</a>, (Accessed on 09/18/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[13]</span> <span class="ltx_bibblock"> Real-world use cases for large language models (llms). </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://cellstrat.medium.com/real-world-use-cases-for-large-language-models-llms-d71c3a577bf2" title="">https://cellstrat.medium.com/real-world-use-cases-for-large-language-models-llms-d71c3a577bf2</a>, (Accessed on 09/18/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[14]</span> <span class="ltx_bibblock"> Azure LLM inference trace 2023. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/Azure/AzurePublicDataset/blob/master/AzureLLMInferenceDataset2023.md" title="">https://github.com/Azure/AzurePublicDataset/blob/master/AzureLLMInferenceDataset2023.md</a>, (Accessed on 10/07/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[15]</span> <span class="ltx_bibblock"> NVIDIA CUTLASS 3.5.1. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/NVIDIA/cutlass" title="">https://github.com/NVIDIA/cutlass</a>, (Accessed on 10/07/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[16]</span> <span class="ltx_bibblock"> NVIDIA Tensor Core. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.nvidia.cn/data-center/tensor-cores/" title="">https://www.nvidia.cn/data-center/tensor-cores/</a>, (Accessed on 10/07/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[17]</span> <span class="ltx_bibblock"> Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. </span> <span class="ltx_bibblock">Gpt-4 technical report. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib17.1.1">arXiv preprint arXiv:2303.08774</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[18]</span> <span class="ltx_bibblock"> Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. </span> <span class="ltx_bibblock">Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib18.1.1">arXiv preprint arXiv:2308.16369</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[19]</span> <span class="ltx_bibblock"> Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. </span> <span class="ltx_bibblock">Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. </span> <span class="ltx_bibblock">2023. </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[20]</span> <span class="ltx_bibblock"> Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. </span> <span class="ltx_bibblock">Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib20.1.1">arXiv preprint arXiv:2308.12966</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[21]</span> <span class="ltx_bibblock"> Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. </span> <span class="ltx_bibblock">PipeSwitch: Fast pipelined context switching for deep learning applications. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib21.1.1">USENIX Symposium on Operating Systems Design and Implementation (OSDI)</span>, pages 499–514, 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[22]</span> <span class="ltx_bibblock"> Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. </span> <span class="ltx_bibblock">Improving language models by retrieving from trillions of tokens. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib22.1.1">International conference on machine learning (ICML)</span>, pages 2206–2240, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[23]</span> <span class="ltx_bibblock"> Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, and Zhiqiang Shen. </span> <span class="ltx_bibblock">One-for-all: Generalized lora for parameter-efficient fine-tuning. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib23.1.1">arXiv preprint arXiv:2306.07967</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[24]</span> <span class="ltx_bibblock"> Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. </span> <span class="ltx_bibblock">Benchmarking large language models in retrieval-augmented generation. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib24.1.1">Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[25]</span> <span class="ltx_bibblock"> Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. </span> <span class="ltx_bibblock">Punica: Multi-tenant lora serving. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib25.1.1">Proceedings of Machine Learning and Systems (MLSys)</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[26]</span> <span class="ltx_bibblock"> Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. </span> <span class="ltx_bibblock">Sharegpt4v: Improving large multi-modal models with better captions, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[27]</span> <span class="ltx_bibblock"> Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. </span> <span class="ltx_bibblock">Spann: Highly-efficient billion-scale approximate nearest neighborhood search. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib27.1.1">Advances in Neural Information Processing Systems (NeurIPS)</span>, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[28]</span> <span class="ltx_bibblock"> Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. </span> <span class="ltx_bibblock">Palm: Scaling language modeling with pathways. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib28.1.1">Journal of Machine Learning Research</span>, 24(240):1–113, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[29]</span> <span class="ltx_bibblock"> Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. </span> <span class="ltx_bibblock">The cityscapes dataset for semantic urban scene understanding. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib29.1.1">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</span>, 2016. </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[30]</span> <span class="ltx_bibblock"> Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. </span> <span class="ltx_bibblock">Qlora: Efficient finetuning of quantized llms. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib30.1.1">Advances in Neural Information Processing Systems (NeurIPS)</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[31]</span> <span class="ltx_bibblock"> Kuntai Du, Ahsan Pervaiz, Xin Yuan, Aakanksha Chowdhery, Qizheng Zhang, Henry Hoffmann, and Junchen Jiang. </span> <span class="ltx_bibblock">Server-Driven Video Streaming for Deep Learning Inference. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib31.1.1">ACM Special Interest Group on Data Communication (SIGCOMM)</span>, pages 557–570. ACM, 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[32]</span> <span class="ltx_bibblock"> Albert Einstein. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib32.1.1">The General Theory of Relativity</span>, pages 54–75. </span> <span class="ltx_bibblock">Springer Netherlands, Dordrecht, 1922. </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[33]</span> <span class="ltx_bibblock"> Glenn Jocher et. al. </span> <span class="ltx_bibblock">ultralytics/yolov5: v6.0 - YOLOv5n ’Nano’ models, Roboflow integration, TensorFlow export, OpenCV DNN support, October 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[34]</span> <span class="ltx_bibblock"> Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Yu Han, and Hao Wang. </span> <span class="ltx_bibblock">Mixture-of-loras: An efficient multitask tuning for large language models. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib34.1.1">arXiv preprint arXiv:2403.03432</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[35]</span> <span class="ltx_bibblock"> Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, and Wenhai Wang. </span> <span class="ltx_bibblock">Mini-internvl: A flexible-transfer pocket multimodal model with 5 </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib35.1.1">arXiv preprint arXiv:2410.16261</span>. </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[36]</span> <span class="ltx_bibblock"> Michael R Garey and David S Johnson. </span> <span class="ltx_bibblock">Approximation algorithms for bin packing problems: A survey. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib36.1.1">Analysis and design of algorithms in combinatorial optimization</span>, pages 147–172. Springer, 1981. </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[37]</span> <span class="ltx_bibblock"> Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. </span> <span class="ltx_bibblock">Making the v in vqa matter: Elevating the role of image understanding in visual question answering. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib37.1.1">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</span>, pages 6904–6913, 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[38]</span> <span class="ltx_bibblock"> Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. </span> <span class="ltx_bibblock">Making the v in vqa matter: Elevating the role of image understanding in visual question answering. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib38.1.1">IEEE conference on computer vision and pattern recognition (CVPR)</span>, 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[39]</span> <span class="ltx_bibblock"> Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al. </span> <span class="ltx_bibblock">Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib39.1.1">arXiv preprint arXiv:2401.08671</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[40]</span> <span class="ltx_bibblock"> Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. </span> <span class="ltx_bibblock">Lora: Low-rank adaptation of large language models. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib40.1.1">arXiv preprint arXiv:2106.09685</span>, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib41"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[41]</span> <span class="ltx_bibblock"> Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. </span> <span class="ltx_bibblock">pybind11 — seamless operability between c++11 and python, 2016. </span> <span class="ltx_bibblock">https://github.com/pybind/pybind11. </span> </li> <li class="ltx_bibitem" id="bib.bib42"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[42]</span> <span class="ltx_bibblock"> Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. </span> <span class="ltx_bibblock">Mistral 7b. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib42.1.1">arXiv preprint arXiv:2310.06825</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib43"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[43]</span> <span class="ltx_bibblock"> Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. </span> <span class="ltx_bibblock">Chameleon: scalable adaptation of video analytics. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib43.1.1">ACM Special Interest Group on Data Communication (SIGCOMM)</span>, pages 253–266, August 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib44"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[44]</span> <span class="ltx_bibblock"> Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. </span> <span class="ltx_bibblock">Ragcache: Efficient knowledge caching for retrieval-augmented generation. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib44.1.1">arXiv preprint arXiv:2404.12457</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib45"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[45]</span> <span class="ltx_bibblock"> Daniel Kang, Peter Bailis, and Matei Zaharia. </span> <span class="ltx_bibblock">Blazeit: Optimizing declarative aggregation and limit queries for neural network-based video analytics. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib45.1.1">VLDB Endowment</span>, 13(4), 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib46"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[46]</span> <span class="ltx_bibblock"> Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. </span> <span class="ltx_bibblock">NoScope: optimizing neural network queries over video at scale. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib46.1.1">Proceedings of the VLDB Endowment</span>, 10(11):1586–1597, August 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib47"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[47]</span> <span class="ltx_bibblock"> Daniel Kang, Francisco Romero, Peter D Bailis, Christos Kozyrakis, and Matei Zaharia. </span> <span class="ltx_bibblock">Viva: An end-to-end system for interactive video analytics. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib47.1.1">CIDR</span>, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib48"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[48]</span> <span class="ltx_bibblock"> Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. </span> <span class="ltx_bibblock">Referitgame: Referring to objects in photographs of natural scenes. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib48.1.1">ACL conference on empirical methods in natural language processing (EMNLP)</span>, pages 787–798, 2014. </span> </li> <li class="ltx_bibitem" id="bib.bib49"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[49]</span> <span class="ltx_bibblock"> Mehrdad Khani, Ganesh Ananthanarayanan, Kevin Hsieh, Junchen Jiang, Ravi Netravali, Yuanchao Shu, Mohammad Alizadeh, and Victor Bahl. </span> <span class="ltx_bibblock">RECL: Responsive resource-efficient continuous learning for video analytics. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib49.1.1">USENIX Symposium on Networked Systems Design and Implementation (NSDI)</span>, pages 917–932, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib50"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[50]</span> <span class="ltx_bibblock"> Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. </span> <span class="ltx_bibblock">Efficient memory management for large language model serving with pagedattention. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib50.1.1">USENIX Symposium on Operating Systems Principles (OSDI)</span>, pages 611–626, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib51"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[51]</span> <span class="ltx_bibblock"> Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. </span> <span class="ltx_bibblock">Efficient memory management for large language model serving with pagedattention. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib51.1.1">Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib52"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[52]</span> <span class="ltx_bibblock"> Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. </span> <span class="ltx_bibblock">Retrieval-augmented generation for knowledge-intensive nlp tasks. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib52.1.1">Advances in Neural Information Processing Systems (NeurIPS)</span>, 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib53"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[53]</span> <span class="ltx_bibblock"> Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. </span> <span class="ltx_bibblock">Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib53.1.1">International conference on machine learning (ICML)</span>, pages 19730–19742. PMLR, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib54"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[54]</span> <span class="ltx_bibblock"> Xiang Lisa Li and Percy Liang. </span> <span class="ltx_bibblock">Prefix-tuning: Optimizing continuous prompts for generation. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib54.1.1">Proceedings of Annual Meeting of the Association for Computational Linguistics and the Joint Conference on Natural Language Processing</span>, pages 4582–4597, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib55"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[55]</span> <span class="ltx_bibblock"> Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. </span> <span class="ltx_bibblock">Oscar: Object-semantics aligned pre-training for vision-language tasks. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib55.1.1">Springer European Conference Computer Vision (ECCV)</span>, 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib56"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[56]</span> <span class="ltx_bibblock"> Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. </span> <span class="ltx_bibblock">Personal llm agents: Insights and survey about the capability, efficiency and security. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib56.1.1">arXiv preprint arXiv:2401.05459</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib57"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[57]</span> <span class="ltx_bibblock"> Yuanqi Li, Arthi Padmanabhan, Pengzhan Zhao, Yufei Wang, Guoqing Harry Xu, and Ravi Netravali. </span> <span class="ltx_bibblock">Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib57.1.1">ACM Special Interest Group on Data Communication (SIGCOMM)</span>, pages 359–376, 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib58"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[58]</span> <span class="ltx_bibblock"> Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. </span> <span class="ltx_bibblock">Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib58.1.1">Advances in Neural Information Processing Systems (NeurIPS)</span>, 35:1950–1965, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib59"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[59]</span> <span class="ltx_bibblock"> Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. </span> <span class="ltx_bibblock">Improved baselines with visual instruction tuning, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib60"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[60]</span> <span class="ltx_bibblock"> Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. </span> <span class="ltx_bibblock">Visual instruction tuning. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib60.1.1">Conference and Workshop on Neural Information Processing Systems (NeurIPS)</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib61"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[61]</span> <span class="ltx_bibblock"> Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. </span> <span class="ltx_bibblock">Peft: State-of-the-art parameter-efficient fine-tuning methods. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/huggingface/peft" title="">https://github.com/huggingface/peft</a>, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib62"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[62]</span> <span class="ltx_bibblock"> Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. </span> <span class="ltx_bibblock">Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib62.1.1">ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)</span>, pages 932–949, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib63"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[63]</span> <span class="ltx_bibblock"> Thomas Politzer. </span> <span class="ltx_bibblock">Vision is our dominant sense. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.brainline.org/article/vision-our-dominant-sense" title="">https://www.brainline.org/article/vision-our-dominant-sense</a>, (Accessed on 09/18/2024). </span> </li> <li class="ltx_bibitem" id="bib.bib64"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[64]</span> <span class="ltx_bibblock"> Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. </span> <span class="ltx_bibblock">Mooncake: A kvcache-centric disaggregated architecture for llm serving. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib64.1.1">arXiv preprint arxiv:2407.00079</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib65"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[65]</span> <span class="ltx_bibblock"> Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. </span> <span class="ltx_bibblock">Learning transferable visual models from natural language supervision. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib65.1.1">International conference on machine learning</span>, pages 8748–8763. PMLR, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib66"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[66]</span> <span class="ltx_bibblock"> Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. </span> <span class="ltx_bibblock">In-context retrieval-augmented language models. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib66.1.1">Transactions of the Association for Computational Linguistics</span>, 11:1316–1331, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib67"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[67]</span> <span class="ltx_bibblock"> Bhardwaj Romil, Xia Zhengxu, Ananthanarayanan Ganesh, Jiang Junchen, Shu Yuanchao, Karianakis Nikolaos, Hsieh Kevin, Bahl Paramvir, and Stoica Ion. </span> <span class="ltx_bibblock">Ekya: Continuous learning of video analytics models on edge compute servers. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib67.1.1">USENIX Symposium on Networked Systems Design and Implementation (NSDI)</span>, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib68"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[68]</span> <span class="ltx_bibblock"> Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. </span> <span class="ltx_bibblock">Nexus: A gpu cluster engine for accelerating dnn-based video analysis. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib68.1.1">ACM Symposium on Operating Systems Principles (SOSP)</span>, pages 322–337, 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib69"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[69]</span> <span class="ltx_bibblock"> Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. </span> <span class="ltx_bibblock">S-lora: Serving thousands of concurrent lora adapters. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib69.1.1">Proceedings of Machine Learning and Systems (MLSys)</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib70"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[70]</span> <span class="ltx_bibblock"> Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. </span> <span class="ltx_bibblock">Flexgen: High-throughput generative inference of large language models with a single gpu. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib70.1.1">International Conference on Machine Learning (ICML)</span>, pages 31094–31116, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib71"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[71]</span> <span class="ltx_bibblock"> Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. </span> <span class="ltx_bibblock">Ucf101: A dataset of 101 human actions classes from videos in the wild. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib71.1.1">arXiv preprint arXiv:1212.0402</span>, 2012. </span> </li> <li class="ltx_bibitem" id="bib.bib72"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[72]</span> <span class="ltx_bibblock"> Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. </span> <span class="ltx_bibblock">Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012. </span> </li> <li class="ltx_bibitem" id="bib.bib73"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[73]</span> <span class="ltx_bibblock"> Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. </span> <span class="ltx_bibblock">Gemini: a family of highly capable multimodal models. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib73.1.1">arXiv preprint arXiv:2312.11805</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib74"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[74]</span> <span class="ltx_bibblock"> Philippe Tillet, H. T. Kung, and David Cox. </span> <span class="ltx_bibblock">Triton: an intermediate language and compiler for tiled neural network computations. </span> <span class="ltx_bibblock">MAPL 2019, page 10–19, New York, NY, USA, 2019. Association for Computing Machinery. </span> </li> <li class="ltx_bibitem" id="bib.bib75"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[75]</span> <span class="ltx_bibblock"> Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. </span> <span class="ltx_bibblock">Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib76"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[76]</span> <span class="ltx_bibblock"> Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. </span> <span class="ltx_bibblock">Llama 2: Open foundation and fine-tuned chat models. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib76.1.1">arXiv preprint arXiv:2307.09288</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib77"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[77]</span> <span class="ltx_bibblock"> Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. </span> <span class="ltx_bibblock">Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib77.1.1">Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL)</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib78"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[78]</span> <span class="ltx_bibblock"> Weijun Wang, Liang Mi, Shaowei Cen, Haipeng Dai, Yuanchun Li, Xiaoming Fu, and Yunxin Liu. </span> <span class="ltx_bibblock">Region-based content enhancement for efficient video analytics at the edge. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib78.1.1">arXiv preprint arXiv:2407.16990</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib79"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[79]</span> <span class="ltx_bibblock"> Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. </span> <span class="ltx_bibblock">Autodroid: Llm-powered task automation in android. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib79.1.1">Proceedings of the ACM International Conference on Mobile Computing and Networking (MobiCom)</span>, pages 543–557, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib80"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[80]</span> <span class="ltx_bibblock"> Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. </span> <span class="ltx_bibblock">Huggingface’s transformers: State-of-the-art natural language processing, 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib81"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[81]</span> <span class="ltx_bibblock"> Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. </span> <span class="ltx_bibblock">dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib81.1.1">USENIX Symposium on Operating Systems Design and Implementation (OSDI)</span>, pages 911–927, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib82"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[82]</span> <span class="ltx_bibblock"> Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. </span> <span class="ltx_bibblock">Aid: A benchmark data set for performance evaluation of aerial scene classification. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib82.1.1">IEEE Transactions on Geoscience and Remote Sensing</span>, 55(7):3965–3981, July 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib83"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[83]</span> <span class="ltx_bibblock"> Zhujun Xiao, Zhengxu Xia, Haitao Zheng, Ben Y Zhao, and Junchen Jiang. </span> <span class="ltx_bibblock">Towards performance clarity of edge video analytics. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib83.1.1">IEEE/ACM Symposium on Edge Computing (SEC)</span>, pages 148–164, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib84"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[84]</span> <span class="ltx_bibblock"> Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. </span> <span class="ltx_bibblock">Universal instance perception as object discovery and retrieval. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib84.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</span>, pages 15325–15336, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib85"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[85]</span> <span class="ltx_bibblock"> Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. </span> <span class="ltx_bibblock">Cacheblend: Fast large language model serving with cached knowledge fusion. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib85.1.1">arXiv preprint arXiv:2405.16444</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib86"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[86]</span> <span class="ltx_bibblock"> Juheon Yi, Sunghyun Choi, and Youngki Lee. </span> <span class="ltx_bibblock">EagleEye: wearable camera-based person identification in crowded urban spaces. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib86.1.1">ACM International Conference on Mobile Computing and Networking (MobiCom)</span>, pages 1–14, April 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib87"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[87]</span> <span class="ltx_bibblock"> Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. </span> <span class="ltx_bibblock">Orca: A distributed serving system for <math alttext="\{" class="ltx_Math" display="inline" id="bib.bib87.1.m1.1"><semantics id="bib.bib87.1.m1.1a"><mo id="bib.bib87.1.m1.1.1" stretchy="false" xref="bib.bib87.1.m1.1.1.cmml">{</mo><annotation-xml encoding="MathML-Content" id="bib.bib87.1.m1.1b"><ci id="bib.bib87.1.m1.1.1.cmml" xref="bib.bib87.1.m1.1.1">{</ci></annotation-xml><annotation encoding="application/x-tex" id="bib.bib87.1.m1.1c">\{</annotation><annotation encoding="application/x-llamapun" id="bib.bib87.1.m1.1d">{</annotation></semantics></math>Transformer-Based<math alttext="\}" class="ltx_Math" display="inline" id="bib.bib87.2.m2.1"><semantics id="bib.bib87.2.m2.1a"><mo id="bib.bib87.2.m2.1.1" stretchy="false" xref="bib.bib87.2.m2.1.1.cmml">}</mo><annotation-xml encoding="MathML-Content" id="bib.bib87.2.m2.1b"><ci id="bib.bib87.2.m2.1.1.cmml" xref="bib.bib87.2.m2.1.1">}</ci></annotation-xml><annotation encoding="application/x-tex" id="bib.bib87.2.m2.1c">\}</annotation><annotation encoding="application/x-llamapun" id="bib.bib87.2.m2.1d">}</annotation></semantics></math> generative models. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib87.3.1">USENIX Symposium on Operating Systems Design and Implementation (OSDI)</span>, pages 521–538, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib88"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[88]</span> <span class="ltx_bibblock"> Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. </span> <span class="ltx_bibblock">Modeling context in referring expressions. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib88.1.1">European Conference on Computer Vision (ECCV)</span>, pages 69–85. Springer, 2016. </span> </li> <li class="ltx_bibitem" id="bib.bib89"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[89]</span> <span class="ltx_bibblock"> Shan Yu, Zhenting Zhu, Yu Chen, Hanchen Xu, Pengzhan Zhao, Yang Wang, Arthi Padmanabhan, Hugo Latapie, and Harry Xu. </span> <span class="ltx_bibblock">Vqpy: An object-oriented approach to modern video analytics. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib89.1.1">Machine Learning and Systems (MLSys)</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib90"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[90]</span> <span class="ltx_bibblock"> Mu Yuan, Lan Zhang, Xuanke You, and Xiang-Yang Li. </span> <span class="ltx_bibblock">Packetgame: Multi-stream packet gating for concurrent video inference at scale. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib90.1.1">ACM Special Interest Group on Data Communication (SIGCOMM)</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib91"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[91]</span> <span class="ltx_bibblock"> Tingting Yuan, Liang Mi, Weijun Wang, Haipeng Dai, and Xiaoming Fu. </span> <span class="ltx_bibblock">Accdecoder: Accelerated decoding for neural-enhanced video analytics. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib91.1.1">IEEE Conference on Computer Communications (INFOCOM)</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib92"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[92]</span> <span class="ltx_bibblock"> Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, and Jinhyung Kim. </span> <span class="ltx_bibblock">Videomix: Rethinking data augmentation for video classification, 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib93"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[93]</span> <span class="ltx_bibblock"> Ben Zhang, Xin Jin, Sylvia Ratnasamy, John Wawrzynek, and Edward A. Lee. </span> <span class="ltx_bibblock">AWStream: adaptive wide-area streaming analytics. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib93.1.1">ACM Special Interest Group on Data Communication (SIGCOMM)</span>, pages 236–252. ACM, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib94"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[94]</span> <span class="ltx_bibblock"> Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. </span> <span class="ltx_bibblock">Adaptive budget allocation for parameter-efficient fine-tuning. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib94.1.1">International Conference on Learning Representations (ICLR)</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib95"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[95]</span> <span class="ltx_bibblock"> Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. </span> <span class="ltx_bibblock">Siren’s song in the ai ocean: a survey on hallucination in large language models. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib95.1.1">arXiv preprint arXiv:2309.01219</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib96"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[96]</span> <span class="ltx_bibblock"> Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. </span> <span class="ltx_bibblock">Efficiently programming large language models using sglang. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib96.1.1">arXiv preprint arXiv:2312.07104</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib97"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[97]</span> <span class="ltx_bibblock"> Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu Sun. </span> <span class="ltx_bibblock">Pet: A unified framework for Parameter-Efficient transformers serving. </span> <span class="ltx_bibblock">In <span class="ltx_text ltx_font_italic" id="bib.bib97.1.1">USENIX Annual Technical Conference (ATC)</span>, pages 489–504, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib98"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[98]</span> <span class="ltx_bibblock"> Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. </span> <span class="ltx_bibblock">Minigpt-4: Enhancing vision-language understanding with advanced large language models. </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib98.1.1">arXiv preprint arXiv:2304.10592</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib99"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">[99]</span> <span class="ltx_bibblock"> Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. </span> <span class="ltx_bibblock">Vision mamba: Efficient visual representation learning with bidirectional state space model, 2024. </span> </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> <section class="ltx_appendix" id="A1"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix A </span>Adaptive-tiling Matrix Multiplication.</h2> <div class="ltx_para" id="A1.p1"> <p class="ltx_p" id="A1.p1.1">As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A1.F24" title="Figure 24 ‣ Appendix A Adaptive-tiling Matrix Multiplication. ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">24</span></a>, ATMM adaptively divides input matrices into thread blocks tiles, warp tiles, and thread tiles in order of granularity guided by the hash table, then transfers the data and computes by invoking the corresponding executable kernel.</p> </div> <figure class="ltx_figure" id="A1.F24"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="440" id="A1.F24.g1" src="x27.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 24. </span><span class="ltx_text" id="A1.F24.2.1" style="color:#000000;">Procedure of ATMM. The hash table and executable kernels are prepared in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS2" title="4.3.2 Profile-based optimal tiling search. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">4.3.2</span></a>. ¡in_x,y,z¿, ¡tb_a,b,c¿, ¡w_d,e,f¿ indicate the shape of input matrices, thread block tile, and warp tile, respectively. </span></figcaption> </figure> </section> <section class="ltx_appendix" id="A2"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix B </span>Profile-based Optimal Tiling Search Algorithm</h2> <figure class="ltx_float ltx_float_algorithm ltx_framed ltx_framed_top" id="alg2"> <figcaption class="ltx_caption" style="font-size:80%;"><span class="ltx_tag ltx_tag_float"><span class="ltx_text ltx_font_bold" id="alg2.4.1.1">Algorithm 2</span> </span> Tiling Search(model, hardware)</figcaption> <div class="ltx_listing ltx_listing" id="alg2.5"> <div class="ltx_listingline" id="alg2.l1"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l1.1.1.1" style="font-size:80%;">1:</span></span><span class="ltx_text" id="alg2.l1.2" style="font-size:80%;">model, hardware </span> </div> <div class="ltx_listingline" id="alg2.l2"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l2.1.1.1" style="font-size:80%;">2:</span></span><span class="ltx_text" id="alg2.l2.2" style="font-size:80%;">Optimal configuration for each input shape </span> </div> <div class="ltx_listingline" id="alg2.l3"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l3.1.1.1" style="font-size:80%;">3:</span></span><span class="ltx_text ltx_font_bold" id="alg2.l3.2" style="font-size:80%;">function</span><span class="ltx_text" id="alg2.l3.3" style="font-size:80%;"> </span><span class="ltx_text ltx_font_smallcaps" id="alg2.l3.4" style="font-size:80%;">TilingSearch</span><span class="ltx_text" id="alg2.l3.5" style="font-size:80%;">(model, hardware) </span> </div> <div class="ltx_listingline" id="alg2.l4"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l4.2.1.1" style="font-size:80%;">4:</span></span><span class="ltx_text" id="alg2.l4.3" style="font-size:80%;"> min, max=</span><span class="ltx_text ltx_font_smallcaps" id="alg2.l4.4" style="font-size:80%;">GetRange</span><span class="ltx_text" id="alg2.l4.5" style="font-size:80%;">(model, hardware) </span><span class="ltx_text" id="alg2.l4.1" style="font-size:80%;float:right;"><math alttext="\triangleright" class="ltx_Math" display="inline" id="alg2.l4.1.m1.1"><semantics id="alg2.l4.1.m1.1a"><mo id="alg2.l4.1.m1.1.1" xref="alg2.l4.1.m1.1.1.cmml">▷</mo><annotation-xml encoding="MathML-Content" id="alg2.l4.1.m1.1b"><ci id="alg2.l4.1.m1.1.1.cmml" xref="alg2.l4.1.m1.1.1">▷</ci></annotation-xml><annotation encoding="application/x-tex" id="alg2.l4.1.m1.1c">\triangleright</annotation><annotation encoding="application/x-llamapun" id="alg2.l4.1.m1.1d">▷</annotation></semantics></math> <span class="ltx_inline-block ltx_parbox ltx_align_top" id="alg2.l4.1.1" style="width:130.1pt;"> <span class="ltx_p" id="alg2.l4.1.1.1">Limit the input size by limited memory</span> </span> </span> </div> <div class="ltx_listingline" id="alg2.l5"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l5.2.1.1" style="font-size:80%;">5:</span></span><span class="ltx_text" id="alg2.l5.3" style="font-size:80%;"> configlist=</span><span class="ltx_text ltx_font_smallcaps" id="alg2.l5.4" style="font-size:80%;">GetConfig</span><span class="ltx_text" id="alg2.l5.5" style="font-size:80%;">(harware) </span><span class="ltx_text" id="alg2.l5.1" style="font-size:80%;float:right;"><math alttext="\triangleright" class="ltx_Math" display="inline" id="alg2.l5.1.m1.1"><semantics id="alg2.l5.1.m1.1a"><mo id="alg2.l5.1.m1.1.1" xref="alg2.l5.1.m1.1.1.cmml">▷</mo><annotation-xml encoding="MathML-Content" id="alg2.l5.1.m1.1b"><ci id="alg2.l5.1.m1.1.1.cmml" xref="alg2.l5.1.m1.1.1">▷</ci></annotation-xml><annotation encoding="application/x-tex" id="alg2.l5.1.m1.1c">\triangleright</annotation><annotation encoding="application/x-llamapun" id="alg2.l5.1.m1.1d">▷</annotation></semantics></math> <span class="ltx_inline-block ltx_parbox ltx_align_top" id="alg2.l5.1.1" style="width:173.4pt;"> <span class="ltx_p" id="alg2.l5.1.1.1">Limit the tiling step s.t. the arch characteristics</span> </span></span> </div> <div class="ltx_listingline" id="alg2.l6"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l6.1.1.1" style="font-size:80%;">6:</span></span> </div> <div class="ltx_listingline" id="alg2.l7"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l7.2.1.1" style="font-size:80%;">7:</span></span><span class="ltx_text" id="alg2.l7.3" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg2.l7.4" style="font-size:80%;">for</span><span class="ltx_text" id="alg2.l7.5" style="font-size:80%;"> shape </span><span class="ltx_text ltx_font_bold" id="alg2.l7.6" style="font-size:80%;">in</span><span class="ltx_text" id="alg2.l7.7" style="font-size:80%;"> range(min, max, 32) </span><span class="ltx_text ltx_font_bold" id="alg2.l7.8" style="font-size:80%;">do</span><span class="ltx_text" id="alg2.l7.9" style="font-size:80%;"> </span><span class="ltx_text" id="alg2.l7.1" style="font-size:80%;float:right;"><math alttext="\triangleright" class="ltx_Math" display="inline" id="alg2.l7.1.m1.1"><semantics id="alg2.l7.1.m1.1a"><mo id="alg2.l7.1.m1.1.1" xref="alg2.l7.1.m1.1.1.cmml">▷</mo><annotation-xml encoding="MathML-Content" id="alg2.l7.1.m1.1b"><ci id="alg2.l7.1.m1.1.1.cmml" xref="alg2.l7.1.m1.1.1">▷</ci></annotation-xml><annotation encoding="application/x-tex" id="alg2.l7.1.m1.1c">\triangleright</annotation><annotation encoding="application/x-llamapun" id="alg2.l7.1.m1.1d">▷</annotation></semantics></math> <span class="ltx_inline-block ltx_parbox ltx_align_top" id="alg2.l7.1.1" style="width:151.8pt;"> <span class="ltx_p" id="alg2.l7.1.1.1">Reduce the search step size based on the vision task characteristics</span> </span> </span> </div> <div class="ltx_listingline" id="alg2.l8"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l8.1.1.1" style="font-size:80%;">8:</span></span><span class="ltx_text" id="alg2.l8.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg2.l8.3" style="font-size:80%;">for</span><span class="ltx_text" id="alg2.l8.4" style="font-size:80%;"> rank </span><span class="ltx_text ltx_font_bold" id="alg2.l8.5" style="font-size:80%;">in</span><span class="ltx_text" id="alg2.l8.6" style="font-size:80%;"> ranklist </span><span class="ltx_text ltx_font_bold" id="alg2.l8.7" style="font-size:80%;">do</span><span class="ltx_text" id="alg2.l8.8" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg2.l9"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l9.1.1.1" style="font-size:80%;">9:</span></span><span class="ltx_text" id="alg2.l9.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg2.l9.3" style="font-size:80%;">for</span><span class="ltx_text" id="alg2.l9.4" style="font-size:80%;"> tbtile </span><span class="ltx_text ltx_font_bold" id="alg2.l9.5" style="font-size:80%;">in</span><span class="ltx_text" id="alg2.l9.6" style="font-size:80%;"> configlist[0] </span><span class="ltx_text ltx_font_bold" id="alg2.l9.7" style="font-size:80%;">do</span><span class="ltx_text" id="alg2.l9.8" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg2.l10"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l10.1.1.1" style="font-size:80%;">10:</span></span><span class="ltx_text" id="alg2.l10.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_bold" id="alg2.l10.3" style="font-size:80%;">for</span><span class="ltx_text" id="alg2.l10.4" style="font-size:80%;"> warptile </span><span class="ltx_text ltx_font_bold" id="alg2.l10.5" style="font-size:80%;">in</span><span class="ltx_text" id="alg2.l10.6" style="font-size:80%;"> configlist[1] </span><span class="ltx_text ltx_font_bold" id="alg2.l10.7" style="font-size:80%;">do</span><span class="ltx_text" id="alg2.l10.8" style="font-size:80%;"> </span> </div> <div class="ltx_listingline" id="alg2.l11"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l11.1.1.1" style="font-size:80%;">11:</span></span><span class="ltx_text" id="alg2.l11.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_smallcaps" id="alg2.l11.3" style="font-size:80%;">Profile</span><span class="ltx_text" id="alg2.l11.4" style="font-size:80%;">(shape, tbtile, warptile) </span> </div> <div class="ltx_listingline" id="alg2.l12"> <span class="ltx_tag ltx_tag_listingline"><span class="ltx_text" id="alg2.l12.1.1.1" style="font-size:80%;">12:</span></span><span class="ltx_text" id="alg2.l12.2" style="font-size:80%;"> </span><span class="ltx_text ltx_font_smallcaps" id="alg2.l12.3" style="font-size:80%;">UpdateBestConfig</span><span class="ltx_text" id="alg2.l12.4" style="font-size:80%;">( ) </span> </div> </div> </figure> </section> <section class="ltx_appendix" id="A3"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix C </span>Prompt Template of Vision Applications</h2> <div class="ltx_para" id="A3.p1"> <p class="ltx_p" id="A3.p1.1">We report the prompt template for two vision applications, visual retrieval and video analytics, as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A3.F25" title="Figure 25 ‣ Appendix C Prompt Template of Vision Applications ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM"><span class="ltx_text ltx_ref_tag">25</span></a>. Visual retrieval includes three tasks: referring expression task, visual question answering, and image captioning. The black text represents the prompt, and the blue text shows the response. Video analytics involves two tasks: object detection and video understanding. Object detection uses a similar prompt as the referring expression task. Video understanding provides multiple image frames as input, followed by an instruction to analyze the actions depicted in the sequence.</p> </div> <figure class="ltx_figure" id="A3.F25"> <div class="ltx_block ltx_minipage ltx_align_center ltx_align_bottom" id="A3.F25.1" style="width:433.6pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="954" id="A3.F25.1.g1" src="x28.png" width="831"/> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 25. </span>The prompt template of vision applications.</figcaption> </figure> <div class="ltx_pagination ltx_role_newpage"></div> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Fri Nov 1 13:46:02 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>