CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM</title>  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2411.00915v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S1" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">1 Introduction</a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">2 Background</a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3 Motivation and Challenges</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS1" title="In 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3.1 Potential Benefits from LoRA LMM</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="In 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3.2 Challenges of Empowering Vision Applications with LoRA LMM</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4 V-LoRA Design</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS1" title="In 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.1 System Overview</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2" title="In 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.2 Accuracy-aware LoRA Adapter Generation</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2.SSS1" title="In 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.2.1 Accuracy-aware knowledge-fusion algorithm.</a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2.SSS2" title="In 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.2.2 Vision task head.</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3" title="In 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3 Adaptive-tiling LoRA Adapters Batching</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS1" title="In 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3.1 ATMM: Adaptive-tiling matrix multiplication operator.</a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS2" title="In 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3.2 Profile-based optimal tiling search.</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4" title="In 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4 Flexible LoRA Adapters Orchestration</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS1" title="In 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4.1 Swift inference mode switch.</a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS2" title="In 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4.2 Mixture inference mode.</a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS3" title="In 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4.3 Scheduling policy.</a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S5" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">5 Implementation</a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6 Evaluation</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS1" title="In 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.1 Experimental Setup</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS2" title="In 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.2 End-to-End Performance</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3" title="In 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.3 Comprehensive Component-wise Analysis</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3.SSS1" title="In 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.3.1 Accuracy-aware LoRA Adapter Generation</a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3.SSS2" title="In 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.3.2 Adaptive-tiling LoRA Adapters Batching</a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3.SSS3" title="In 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.3.3 Flexible LoRA Adapters Orchestration</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS4" title="In 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.4 Stability and Scalability</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S7" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">7 Related Works</a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S8" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">8 Conclusion</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A1" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">A Adaptive-tiling Matrix Multiplication.</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A2" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">B Profile-based Optimal Tiling Search Algorithm</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A3" title="In V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">C Prompt Template of Vision Applications</a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line ltx_leqno" lang="en"> <h1 class="ltx_title ltx_title_document">V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM</h1> <div class="ltx_authors"> Liang Mi1222Liang Mi and Weijun Wang contributed equally to this work. Weijun Wang2222Liang Mi and Weijun Wang contributed equally to this work.111 Corresponding authors: Weijun Wang and Haipeng Dai Wenming Tu2 Qingfeng He2 Rui Kong2 Xinyu Fang2 Yazhu Dong2 Yikang Zhang1 Yuanchun Li2 Meng Li1 Haipeng Dai1111 Corresponding authors: Weijun Wang and Haipeng Dai Guihai Chen1 Yunxin Liu2 1State Key Laboratory for Novel Software Technology, Nanjing University 2Institute for AI Industry Research (AIR), Tsinghua University </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract.</h6> Large Multimodal Models (LMMs) have shown significant progress in various complex vision tasks with the solid linguistic and reasoning capacity inherited from large language models (LMMs). Low-rank adaptation (LoRA) offers a promising method to integrate external knowledge into LMMs, compensating for their limitations on domain-specific tasks. However, the existing LoRA model serving is excessively computationally expensive and causes extremely high latency. In this paper, we present an end-to-end solution that empowers diverse vision tasks and enriches vision applications with LoRA LMMs. Our system, V-LoRA, enables accurate and efficient vision tasks by 1) an accuracy-aware LoRA adapter generation approach that generates LoRA adapters rich in domain-specific knowledge to meet application-specific accuracy requirements, 2) an adaptive-tiling LoRA adapters batching operator that efficiently computes concurrent heterogeneous LoRA adapters, and 3) a flexible LoRA adapter orchestration mechanism that manages application requests and LoRA adapters to achieve the lowest average response latency. We prototype V-LoRA on five popular vision tasks on three LMMs. Experiment results reveal that V-LoRA improves 24-62% of the accuracy compared to the original LMMs and reduces 20-89% of the latency compared to the state-of-the-art LoRA model serving systems. </div> ††submissionid: ¡123¿††footnotetext: Work was done while Liang, Wenming, Qingfeng, Rui, Xinyu, and Yazhu interned at the Institute for AI Industry Research (AIR), Tsinghua University. <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> 1. Introduction</h2> <div class="ltx_para" id="S1.p1"> Encouraged by the success of LLMs in NLP applications <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib7" title="">LLM12App, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib8" title="">7AppLLM, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib10" title="">indataAppLLM, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib13" title="">LLMUseCase, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib56" title="">li2024personal, </a>)</cite>, Large Multimodal Models (LMMs) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib98" title="">zhu2023minigpt, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib12" title="">gpt4o, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib11" title="">claude, </a>)</cite> have attracted great attention from both academia and industry. They enhance LLMs by perceiving and interpreting multimodal signals (e.g., visual inputs <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib60" title="">liu2024visual, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib19" title="">bai2023qwen, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib73" title="">team2023gemini, </a>)</cite>) and well accomplish many complex multi-modal tasks that prior models cannot. For example, GPT-4o <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib12" title="">gpt4o, </a>)</cite> achieves leading accuracy on many multimodal tasks such as visual question answering <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib37" title="">goyal2017making, </a>)</cite>. Yet when applied to practical applications requiring domain-specific knowledge, LMMs often show suboptimal performance, similar to the early LLMs that experienced hallucinations <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib95" title="">zhang2023siren, </a>)</cite>. </div> <div class="ltx_para" id="S1.p2"> Low-rank adaptation (LoRA) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib40" title="">hu2021lora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib30" title="">dettmers2024qlora, </a>)</cite> provides a promising way to integrate the external knowledge into LMM. It fine-tunes a small portion of model parameters, known as LoRA adapters, on domain-specific datasets to learn target knowledge, and freezes the base model to preserve its original capability (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2" title="2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">2</a>). LLMs often leverage retrieval-augmented generation (RAG) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib52" title="">lewis2020retrieval, </a>)</cite> to meet this goal. Unlike LoRA, which modifies model parameters, RAG appends the retrieved knowledge (e.g., documents) onto requests (i.e., input data) for accurate response. However, this data-augmented method is not appropriate for time-sensitive vision applications. Its retrieval process and appended long-context requests incur ¿10<math alttext="\times" class="ltx_Math" display="inline" id="S1.p2.1.m1.1"><semantics id="S1.p2.1.m1.1a"><mo id="S1.p2.1.m1.1.1" xref="S1.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S1.p2.1.m1.1b"><times id="S1.p2.1.m1.1.1.cmml" xref="S1.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S1.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S1.p2.1.m1.1d">×</annotation></semantics></math> response delay <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib44" title="">jin2024ragcache, </a>)</cite>. Conversely, LoRA merges fine-tuned adapters into LMMs at runtime and efficiently generates high-quality and consistent domain-specific responses without additional overhead. </div> <div class="ltx_para" id="S1.p3"> Despite these advantages of LoRA, it introduces complex system challenges. Recent work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite>, focusing on system optimization for linguistic applications with LoRA LLM, has made noticeable progress. Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite> and S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> propose unmerged inference to overcome the limitation of merged inference that can merge only one adapter at once. It computes multiple LoRA adapters in parallel while batching the shared base model computation across different requests, to boost system efficiency. dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> further balances the throughput and latency by merged and unmerged inference mode switch (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2" title="2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">2</a>). However, these efforts fail to meet the high efficiency and diverse requirements of vision applications (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3.2</a>). </div> <div class="ltx_para" id="S1.p4"> In this paper, we answer the following research question: Can we leverage LoRA LMMs to enrich vision applications while meeting their performance requirements? We argue that yes, but need to tackle the following challenges. </div> <div class="ltx_para" id="S1.p5"> First, many existing small models, trained on domain-specific datasets and used in current vision applications, outperform LMMs on target tasks. To keep accuracy, their knowledge must be integrated into LMM with LoRA adapters. However, simply training one LoRA adapter for each task is uneconomical, while fusing too much external knowledge (e.g., multiple small models) into a single adapter causes inevitable accuracy degradation. </div> <div class="ltx_para" id="S1.p6"> Second, a vision application often involves diverse external knowledge. When serving multiple vision applications, in particular, LoRA LMM is very likely asked to execute multiple LoRA adapters simultaneously. Although unmerged inference can compute heterogeneous adapters in parallel, they introduce excessive delays. Therefore, our LoRA LMM inference system must efficiently compute concurrent heterogeneous adapters with low latency. </div> <div class="ltx_para" id="S1.p7"> Lastly, different vision applications have distinct performance requirements. Real-time video analytics application <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib90" title="">yuan2022PacketGame, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib91" title="">AccDecoder, </a>)</cite> needs low latency, while visual retrieval <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib45" title="">kang13blazeit, </a>)</cite> prefers high throughput. To meet these needs, carefully managing LoRA adapters and flexibly scheduling application requests is necessary. Typical LoRA adapter manager (e.g., dLoRA) is not designed for vision applications, which incurs excessive overhead. Solely its inference mode switcher costs over half of LMM inference time. </div> <div class="ltx_para" id="S1.p8"> This paper presents V-LoRA, an end-to-end LoRA LMM serving system, including LoRA adapter preparation (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2" title="4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.2</a>) and inference runtime (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3" title="4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3</a>, §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4" title="4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4</a>), to empower vision applications. V-LoRA addresses the above challenges with the following techniques: </div> <div class="ltx_para" id="S1.p9"> Accuracy-aware LoRA adapter generation. We propose a LoRA generator (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2" title="4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.2</a>) that prepares domain-specific LoRA adapters to generate accurate results on target tasks with external knowledge. Considering the complex accuracy variations of knowledge fusion (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3.2</a>), LoRA adapters generation can be formulated as a constrained bin packing problem, that given external knowledge, i.e., small models and domain-specific datasets, to generate the minimum number of LoRA adapters, ensuring the accuracy specified by vision applications. We design an accuracy-aware knowledge-fusion algorithm with a greedy heuristic to solve it. Additionally, we introduce vision task heads, incorporated as part of the LoRA adapter, enabling low-latency response for vision tasks. </div> <div class="ltx_para" id="S1.p10"> Adaptive-tiling LoRA adapters batching. We propose a concurrent LoRA adapters batching method (§ <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3" title="4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3</a>), comprised of the Adaptive-Tiling Matrix Multiplication (ATMM) operator and its optimal tiling search algorithm, for efficient heterogeneous LoRA adapters computation. The offline search algorithm identifies the optimal tiling configurations for each possible input matrix shape, builds a hash table storing these input-optimal tiling pairs, and compiles their code implementations for standby. At runtime, ATMM adaptively selects the optimal tiling configuration in the hash table, according to the input shapes of both concurrent requests and invoked LoRA adapters, then executes the corresponding code implementation in an extreme efficiency. </div> <div class="ltx_para" id="S1.p11"> Flexible LoRA adapters orchestration. For diverse requirements of vision applications, we propose an orchestrator (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4" title="4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4</a>) that efficiently and flexibly orchestrates LoRA adapters at runtime. Two tools are developed to facilitate high efficiency. A switcher leverages ATMM and unified memory management to enable swift inference mode switch and LoRA adapters swap, and a mixture inference mode, deLoRA, mitigates the starvation. Using the above tools, we design an algorithm to dynamically switch between three inference modes, schedule requests, and manage LoRA adapters to satisfy the performance requirement of each application. </div> <div class="ltx_para" id="S1.p12"> We summarize our key contributions as follows: </div> <div class="ltx_para" id="S1.p13"> <div class="ltx_inline-block ltx_transformed_outer" id="S1.p13.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"> <math alttext="\bullet" class="ltx_Math" display="inline" id="S1.p13.1.1.m1.1"><semantics id="S1.p13.1.1.m1.1a"><mo id="S1.p13.1.1.m1.1.1" xref="S1.p13.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S1.p13.1.1.m1.1b"><ci id="S1.p13.1.1.m1.1.1.cmml" xref="S1.p13.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.p13.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S1.p13.1.1.m1.1d">∙</annotation></semantics></math> </div> To the best of our knowledge, we are the first to identify and solve the key problems in empowering vision applications with LoRA LMM. </div> <div class="ltx_para" id="S1.p14"> <div class="ltx_inline-block ltx_transformed_outer" id="S1.p14.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"> <math alttext="\bullet" class="ltx_Math" display="inline" id="S1.p14.1.1.m1.1"><semantics id="S1.p14.1.1.m1.1a"><mo id="S1.p14.1.1.m1.1.1" xref="S1.p14.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S1.p14.1.1.m1.1b"><ci id="S1.p14.1.1.m1.1.1.cmml" xref="S1.p14.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.p14.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S1.p14.1.1.m1.1d">∙</annotation></semantics></math> </div> We prototype V-LoRA ***We will release the code and documentation after this paper is accepted. that enables accurate, efficient, and flexible LoRA LMM serving for vision applications, which involves accuracy-aware LoRA adapter generation, adaptive-tiling LoRA adapters batching, and efficient and flexible LoRA adapters orchestration. </div> <div class="ltx_para" id="S1.p15"> <div class="ltx_inline-block ltx_transformed_outer" id="S1.p15.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"> <math alttext="\bullet" class="ltx_Math" display="inline" id="S1.p15.1.1.m1.1"><semantics id="S1.p15.1.1.m1.1a"><mo id="S1.p15.1.1.m1.1.1" xref="S1.p15.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S1.p15.1.1.m1.1b"><ci id="S1.p15.1.1.m1.1.1.cmml" xref="S1.p15.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S1.p15.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S1.p15.1.1.m1.1d">∙</annotation></semantics></math> </div> We implement V-LoRA and conduct evaluations for five popular analytical tasks with real-world trace on three LMMs. Experimental results show that V-LoRA achieves 24-62% accuracy compared to the original LMMs, and 20-89% latency compared to the state-of-the-art methods. </div> <div class="ltx_para" id="S1.p16"> This work does not raise any ethical issues. </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> 2. Background</h2> <div class="ltx_para ltx_noindent" id="S2.p1"> Vision applications in today exploit AI technology to process images or videos in RGB spaces <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib93" title="">zhang2018awstream, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib49" title="">khani2023recl, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib67" title="">Romil2022Ekya, </a>)</cite>. In video analytics, for instance, multiple DNNs that are well-trained on domain-specific datasets separately take care of one target task and together serve the application well <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib47" title="">kang2022viva, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib46" title="">kang2017noscope, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib89" title="">yu2024vqpy, </a>)</cite>. However, the limited capabilities of small models hinder the development of vision applications. Current applications yet stay on the simple combination of vision tasks such as image classification <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib90" title="">yuan2022PacketGame, </a>)</cite>, vehicle counting <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib57" title="">li2020reducto, </a>)</cite>, and target detection <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib31" title="">du2020server, </a>)</cite>. With the natural language interface inherited from LLM, LMM can enrich future vision applications. For example, serving by LMM, the police officer can find the right target when only given a text-described query such as “A boy wearing a red sweater lost at the corner”. Therefore, this paper tries to empower vision tasks with LoRA LMM and enrich future vision applications. </div> <figure class="ltx_figure" id="S2.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="242" id="S2.F1.g1" src="x1.png" width="373"/> <figcaption class="ltx_caption ltx_centering">Figure 1. Illustration of LMM inference. Qwen-VL-7B <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib20" title="">QwenVL, </a>)</cite> generates the right action recognition answer to a piece of data from UCF-101 dataset <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib71" title="">soomro2012ucf101, </a>)</cite> and the corresponding prompt.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S2.p2"> Large multimodal models (LMMs) aim to achieve stronger general intelligence via extending LLMs with multimodal inputs. Since more than 80% of human beings’ perception and activities are mediated through vision <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib63" title="">thomas2024vision, </a>)</cite>, it is natural to start the exploration by equipping LLMs with “eyes”. By introducing a visual receptor, comprised of a visual encoder (e.g., ViT <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib65" title="">radford2021learning, </a>)</cite>) and a vision-language projector (e.g., Q-former <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib53" title="">li2023blip, </a>)</cite>), LMMs power LLMs with visual capacity. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F1" title="Figure 1 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">1</a> illustrates the inference procedure of LMMs. Given an image input and its prompt, the visual encoder splits the image into small patches (e.g., 14<math alttext="\times" class="ltx_Math" display="inline" id="S2.p2.1.m1.1"><semantics id="S2.p2.1.m1.1a"><mo id="S2.p2.1.m1.1.1" xref="S2.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S2.p2.1.m1.1b"><times id="S2.p2.1.m1.1.1.cmml" xref="S2.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S2.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S2.p2.1.m1.1d">×</annotation></semantics></math>14 pixels block) and extracts the visual features of each. The vision-language projector then converts patches’ visual features into visual tokens and feeds into LLM <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib17" title="">achiam2023gpt4, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib28" title="">chowdhery2023palm, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib76" title="">touvron2023llama2, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib42" title="">jiang2023mistral, </a>)</cite>, together with the embedded text tokens from the prompt, to generate the answer. LLM maps the input into high-level features and predicts the probability distribution of the next token with the language modeling (LM) head in an autoregressive pattern <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib50" title="">kwon2023efficient, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib87" title="">yu2022orca, </a>)</cite>. As the dotted arrows depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F1" title="Figure 1 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">1</a>, it iteratively generates tokens with input tokens and previous output tokens once a time, until an end-of-sentence (<EOS>) token is emitted. </div> <div class="ltx_para ltx_noindent" id="S2.p3"> Low-rank adaptation (LoRA) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib40" title="">hu2021lora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib30" title="">dettmers2024qlora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib23" title="">chavan2023one, </a>)</cite> is a widely used parameter-efficient fine-tuning method <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib61" title="">peft, </a>)</cite> to integrate the external knowledge into large models. The LoRA adapter is a small number of trainable parameters to learn this knowledge, typically placed in attention layers <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib40" title="">hu2021lora, </a>)</cite>. The core of LoRA, as illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F2.sf1" title="In Figure 2 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">2(a)</a>, is to represent each weight update <math alttext="\Delta W" class="ltx_Math" display="inline" id="S2.p3.1.m1.1"><semantics id="S2.p3.1.m1.1a"><mrow id="S2.p3.1.m1.1.1" xref="S2.p3.1.m1.1.1.cmml"><mi id="S2.p3.1.m1.1.1.2" mathvariant="normal" xref="S2.p3.1.m1.1.1.2.cmml">Δ</mi><mo id="S2.p3.1.m1.1.1.1" xref="S2.p3.1.m1.1.1.1.cmml">⁢</mo><mi id="S2.p3.1.m1.1.1.3" xref="S2.p3.1.m1.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.1.m1.1b"><apply id="S2.p3.1.m1.1.1.cmml" xref="S2.p3.1.m1.1.1"><times id="S2.p3.1.m1.1.1.1.cmml" xref="S2.p3.1.m1.1.1.1"></times><ci id="S2.p3.1.m1.1.1.2.cmml" xref="S2.p3.1.m1.1.1.2">Δ</ci><ci id="S2.p3.1.m1.1.1.3.cmml" xref="S2.p3.1.m1.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.1.m1.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S2.p3.1.m1.1d">roman_Δ italic_W</annotation></semantics></math> as a product of two matrices, <math alttext="A" class="ltx_Math" display="inline" id="S2.p3.2.m2.1"><semantics id="S2.p3.2.m2.1a"><mi id="S2.p3.2.m2.1.1" xref="S2.p3.2.m2.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S2.p3.2.m2.1b"><ci id="S2.p3.2.m2.1.1.cmml" xref="S2.p3.2.m2.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.2.m2.1c">A</annotation><annotation encoding="application/x-llamapun" id="S2.p3.2.m2.1d">italic_A</annotation></semantics></math> and <math alttext="B" class="ltx_Math" display="inline" id="S2.p3.3.m3.1"><semantics id="S2.p3.3.m3.1a"><mi id="S2.p3.3.m3.1.1" xref="S2.p3.3.m3.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S2.p3.3.m3.1b"><ci id="S2.p3.3.m3.1.1.cmml" xref="S2.p3.3.m3.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.3.m3.1c">B</annotation><annotation encoding="application/x-llamapun" id="S2.p3.3.m3.1d">italic_B</annotation></semantics></math>, with dimensions <math alttext="r\times d" class="ltx_Math" display="inline" id="S2.p3.4.m4.1"><semantics id="S2.p3.4.m4.1a"><mrow id="S2.p3.4.m4.1.1" xref="S2.p3.4.m4.1.1.cmml"><mi id="S2.p3.4.m4.1.1.2" xref="S2.p3.4.m4.1.1.2.cmml">r</mi><mo id="S2.p3.4.m4.1.1.1" lspace="0.222em" rspace="0.222em" xref="S2.p3.4.m4.1.1.1.cmml">×</mo><mi id="S2.p3.4.m4.1.1.3" xref="S2.p3.4.m4.1.1.3.cmml">d</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.4.m4.1b"><apply id="S2.p3.4.m4.1.1.cmml" xref="S2.p3.4.m4.1.1"><times id="S2.p3.4.m4.1.1.1.cmml" xref="S2.p3.4.m4.1.1.1"></times><ci id="S2.p3.4.m4.1.1.2.cmml" xref="S2.p3.4.m4.1.1.2">𝑟</ci><ci id="S2.p3.4.m4.1.1.3.cmml" xref="S2.p3.4.m4.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.4.m4.1c">r\times d</annotation><annotation encoding="application/x-llamapun" id="S2.p3.4.m4.1d">italic_r × italic_d</annotation></semantics></math> and <math alttext="d\times r" class="ltx_Math" display="inline" id="S2.p3.5.m5.1"><semantics id="S2.p3.5.m5.1a"><mrow id="S2.p3.5.m5.1.1" xref="S2.p3.5.m5.1.1.cmml"><mi id="S2.p3.5.m5.1.1.2" xref="S2.p3.5.m5.1.1.2.cmml">d</mi><mo id="S2.p3.5.m5.1.1.1" lspace="0.222em" rspace="0.222em" xref="S2.p3.5.m5.1.1.1.cmml">×</mo><mi id="S2.p3.5.m5.1.1.3" xref="S2.p3.5.m5.1.1.3.cmml">r</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.5.m5.1b"><apply id="S2.p3.5.m5.1.1.cmml" xref="S2.p3.5.m5.1.1"><times id="S2.p3.5.m5.1.1.1.cmml" xref="S2.p3.5.m5.1.1.1"></times><ci id="S2.p3.5.m5.1.1.2.cmml" xref="S2.p3.5.m5.1.1.2">𝑑</ci><ci id="S2.p3.5.m5.1.1.3.cmml" xref="S2.p3.5.m5.1.1.3">𝑟</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.5.m5.1c">d\times r</annotation><annotation encoding="application/x-llamapun" id="S2.p3.5.m5.1d">italic_d × italic_r</annotation></semantics></math>, respectively. Their ranks are much smaller than the base model weight <math alttext="W" class="ltx_Math" display="inline" id="S2.p3.6.m6.1"><semantics id="S2.p3.6.m6.1a"><mi id="S2.p3.6.m6.1.1" xref="S2.p3.6.m6.1.1.cmml">W</mi><annotation-xml encoding="MathML-Content" id="S2.p3.6.m6.1b"><ci id="S2.p3.6.m6.1.1.cmml" xref="S2.p3.6.m6.1.1">𝑊</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.6.m6.1c">W</annotation><annotation encoding="application/x-llamapun" id="S2.p3.6.m6.1d">italic_W</annotation></semantics></math>, with dimensions <math alttext="d\times d" class="ltx_Math" display="inline" id="S2.p3.7.m7.1"><semantics id="S2.p3.7.m7.1a"><mrow id="S2.p3.7.m7.1.1" xref="S2.p3.7.m7.1.1.cmml"><mi id="S2.p3.7.m7.1.1.2" xref="S2.p3.7.m7.1.1.2.cmml">d</mi><mo id="S2.p3.7.m7.1.1.1" lspace="0.222em" rspace="0.222em" xref="S2.p3.7.m7.1.1.1.cmml">×</mo><mi id="S2.p3.7.m7.1.1.3" xref="S2.p3.7.m7.1.1.3.cmml">d</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.7.m7.1b"><apply id="S2.p3.7.m7.1.1.cmml" xref="S2.p3.7.m7.1.1"><times id="S2.p3.7.m7.1.1.1.cmml" xref="S2.p3.7.m7.1.1.1"></times><ci id="S2.p3.7.m7.1.1.2.cmml" xref="S2.p3.7.m7.1.1.2">𝑑</ci><ci id="S2.p3.7.m7.1.1.3.cmml" xref="S2.p3.7.m7.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.7.m7.1c">d\times d</annotation><annotation encoding="application/x-llamapun" id="S2.p3.7.m7.1d">italic_d × italic_d</annotation></semantics></math>. This makes sense since the low intrinsic rank phenomenon of weight updates in large models <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib40" title="">hu2021lora, </a>)</cite>. When fine-tuning, LoRA adapters only update <math alttext="A" class="ltx_Math" display="inline" id="S2.p3.8.m8.1"><semantics id="S2.p3.8.m8.1a"><mi id="S2.p3.8.m8.1.1" xref="S2.p3.8.m8.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S2.p3.8.m8.1b"><ci id="S2.p3.8.m8.1.1.cmml" xref="S2.p3.8.m8.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.8.m8.1c">A</annotation><annotation encoding="application/x-llamapun" id="S2.p3.8.m8.1d">italic_A</annotation></semantics></math> and <math alttext="B" class="ltx_Math" display="inline" id="S2.p3.9.m9.1"><semantics id="S2.p3.9.m9.1a"><mi id="S2.p3.9.m9.1.1" xref="S2.p3.9.m9.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S2.p3.9.m9.1b"><ci id="S2.p3.9.m9.1.1.cmml" xref="S2.p3.9.m9.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.9.m9.1c">B</annotation><annotation encoding="application/x-llamapun" id="S2.p3.9.m9.1d">italic_B</annotation></semantics></math> while keeping <math alttext="W" class="ltx_Math" display="inline" id="S2.p3.10.m10.1"><semantics id="S2.p3.10.m10.1a"><mi id="S2.p3.10.m10.1.1" xref="S2.p3.10.m10.1.1.cmml">W</mi><annotation-xml encoding="MathML-Content" id="S2.p3.10.m10.1b"><ci id="S2.p3.10.m10.1.1.cmml" xref="S2.p3.10.m10.1.1">𝑊</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.10.m10.1c">W</annotation><annotation encoding="application/x-llamapun" id="S2.p3.10.m10.1d">italic_W</annotation></semantics></math> frozen. At inference, the computation of all LoRA adapters can be placed bypass, as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F2.sf1" title="In Figure 2 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">2(a)</a>, and only adds the results onto the output. Or merge one multiplied matrix <math alttext="B\times A" class="ltx_Math" display="inline" id="S2.p3.11.m11.1"><semantics id="S2.p3.11.m11.1a"><mrow id="S2.p3.11.m11.1.1" xref="S2.p3.11.m11.1.1.cmml"><mi id="S2.p3.11.m11.1.1.2" xref="S2.p3.11.m11.1.1.2.cmml">B</mi><mo id="S2.p3.11.m11.1.1.1" lspace="0.222em" rspace="0.222em" xref="S2.p3.11.m11.1.1.1.cmml">×</mo><mi id="S2.p3.11.m11.1.1.3" xref="S2.p3.11.m11.1.1.3.cmml">A</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.11.m11.1b"><apply id="S2.p3.11.m11.1.1.cmml" xref="S2.p3.11.m11.1.1"><times id="S2.p3.11.m11.1.1.1.cmml" xref="S2.p3.11.m11.1.1.1"></times><ci id="S2.p3.11.m11.1.1.2.cmml" xref="S2.p3.11.m11.1.1.2">𝐵</ci><ci id="S2.p3.11.m11.1.1.3.cmml" xref="S2.p3.11.m11.1.1.3">𝐴</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.11.m11.1c">B\times A</annotation><annotation encoding="application/x-llamapun" id="S2.p3.11.m11.1d">italic_B × italic_A</annotation></semantics></math> (i.e., <math alttext="\Delta W" class="ltx_Math" display="inline" id="S2.p3.12.m12.1"><semantics id="S2.p3.12.m12.1a"><mrow id="S2.p3.12.m12.1.1" xref="S2.p3.12.m12.1.1.cmml"><mi id="S2.p3.12.m12.1.1.2" mathvariant="normal" xref="S2.p3.12.m12.1.1.2.cmml">Δ</mi><mo id="S2.p3.12.m12.1.1.1" xref="S2.p3.12.m12.1.1.1.cmml">⁢</mo><mi id="S2.p3.12.m12.1.1.3" xref="S2.p3.12.m12.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.p3.12.m12.1b"><apply id="S2.p3.12.m12.1.1.cmml" xref="S2.p3.12.m12.1.1"><times id="S2.p3.12.m12.1.1.1.cmml" xref="S2.p3.12.m12.1.1.1"></times><ci id="S2.p3.12.m12.1.1.2.cmml" xref="S2.p3.12.m12.1.1.2">Δ</ci><ci id="S2.p3.12.m12.1.1.3.cmml" xref="S2.p3.12.m12.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.12.m12.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S2.p3.12.m12.1d">roman_Δ italic_W</annotation></semantics></math>) into base model <math alttext="W" class="ltx_Math" display="inline" id="S2.p3.13.m13.1"><semantics id="S2.p3.13.m13.1a"><mi id="S2.p3.13.m13.1.1" xref="S2.p3.13.m13.1.1.cmml">W</mi><annotation-xml encoding="MathML-Content" id="S2.p3.13.m13.1b"><ci id="S2.p3.13.m13.1.1.cmml" xref="S2.p3.13.m13.1.1">𝑊</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.p3.13.m13.1c">W</annotation><annotation encoding="application/x-llamapun" id="S2.p3.13.m13.1d">italic_W</annotation></semantics></math>, which keeps computation overhead the same as the base model in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F2.sf2" title="In Figure 2 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">2(b)</a>. </div> <figure class="ltx_figure" id="S2.F2"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S2.F2.sf1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="162" id="S2.F2.sf1.g1" src="x2.png" width="207"/> <figcaption class="ltx_caption ltx_centering">(a) Unmerge mode.</figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S2.F2.sf2"> <img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="161" id="S2.F2.sf2.1.1.g1" src="x3.png" width="141"/> <figcaption class="ltx_caption ltx_centering">(b) Merge mode.</figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering">Figure 2. LoRA model inference. (a) Unmerge mode supports computing multiple different LoRA adapters in a batch. <math alttext="A_{1}" class="ltx_Math" display="inline" id="S2.F2.3.m1.1"><semantics id="S2.F2.3.m1.1b"><msub id="S2.F2.3.m1.1.1" xref="S2.F2.3.m1.1.1.cmml"><mi id="S2.F2.3.m1.1.1.2" xref="S2.F2.3.m1.1.1.2.cmml">A</mi><mn id="S2.F2.3.m1.1.1.3" xref="S2.F2.3.m1.1.1.3.cmml">1</mn></msub><annotation-xml encoding="MathML-Content" id="S2.F2.3.m1.1c"><apply id="S2.F2.3.m1.1.1.cmml" xref="S2.F2.3.m1.1.1"><csymbol cd="ambiguous" id="S2.F2.3.m1.1.1.1.cmml" xref="S2.F2.3.m1.1.1">subscript</csymbol><ci id="S2.F2.3.m1.1.1.2.cmml" xref="S2.F2.3.m1.1.1.2">𝐴</ci><cn id="S2.F2.3.m1.1.1.3.cmml" type="integer" xref="S2.F2.3.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.3.m1.1d">A_{1}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.3.m1.1e">italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="B_{1}" class="ltx_Math" display="inline" id="S2.F2.4.m2.1"><semantics id="S2.F2.4.m2.1b"><msub id="S2.F2.4.m2.1.1" xref="S2.F2.4.m2.1.1.cmml"><mi id="S2.F2.4.m2.1.1.2" xref="S2.F2.4.m2.1.1.2.cmml">B</mi><mn id="S2.F2.4.m2.1.1.3" xref="S2.F2.4.m2.1.1.3.cmml">1</mn></msub><annotation-xml encoding="MathML-Content" id="S2.F2.4.m2.1c"><apply id="S2.F2.4.m2.1.1.cmml" xref="S2.F2.4.m2.1.1"><csymbol cd="ambiguous" id="S2.F2.4.m2.1.1.1.cmml" xref="S2.F2.4.m2.1.1">subscript</csymbol><ci id="S2.F2.4.m2.1.1.2.cmml" xref="S2.F2.4.m2.1.1.2">𝐵</ci><cn id="S2.F2.4.m2.1.1.3.cmml" type="integer" xref="S2.F2.4.m2.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.4.m2.1d">B_{1}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.4.m2.1e">italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT</annotation></semantics></math> constitute LoRA adapter #1. (b) Merge mode supports no-extra-delay inference but only one adapter at once.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S2.p4"> LoRA models serving system <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> targets to improve LoRA model inference efficiency. In unmerged inference, the core characteristic is that computing two small matrices of the adapter underutilizes GPU computing units, leading to unnecessary delay and resource waste. To tackle this, Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite> and S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> batch multiple heterogeneous adapters, as the cascade LoRA adapters illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F2.sf1" title="In Figure 2 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">2(a)</a>, and compute them within one single custom CUDA kernel. This method boosts the system throughput indeed but causes significant additional overhead (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3.2</a>). dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> merges the most accessed LoRA adapter into the base model and switches to unmerge if necessary. However, its mode switch causes an unacceptable time cost (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3.2</a>). Besides, like Punica and S-LoRA, dLoRA’s unmerged inference also fails to address the large number of extra computation overhead. </div> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> 3. Motivation and Challenges</h2> <div class="ltx_para" id="S3.p1"> This section explores two questions: (1) What benefits can LoRA LMM bring to vision applications (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS1" title="3.1 Potential Benefits from LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3.1</a>)? (2) What challenges must be tackled when empowering vision applications with LoRA LMM (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3.2</a>)? </div> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">3.1 Potential Benefits from LoRA LMM</h3> <div class="ltx_para" id="S3.SS1.p1"> LMMs offers state-of-the-art performance on many complex vision tasks. To demonstrate it, we take zero-shot grounding and visual question answering as examples, and conducted experiments on Aircraft <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib9" title="">Aircraft, </a>)</cite> and VQAv2 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib38" title="">vqa2, </a>)</cite> datasets with Qwen-VL-7B <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib20" title="">QwenVL, </a>)</cite> (as the LMM), YOLO <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib33" title="">glenn2021YOLOV5, </a>)</cite> and OSCAR <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib55" title="">li2020oscar, </a>)</cite> (as the baseline small models). Aircraft contains 103 remote sensing images that are not pre-trained on Qwen-VL and YOLO, being the zero-shot test; VQAv2 is the most popular visual question answering dataset which includes text and visual modalities, to test the multi-modal ability. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F3" title="Figure 3 ‣ 3.1 Potential Benefits from LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3</a> shows that, with the solid linguistic and reasoning capabilities inherited from LLMs, Qwen-VL greatly outperforms small models. It delivers 48.9% higher F1-score in zero-shot grounding than YOLO. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F3.sf1" title="In Figure 3 ‣ 3.1 Potential Benefits from LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3(a)</a> visualizes the results on data #38, where Qwen-VL bounds more accurate boxes than YOLO. For multi-modal tasks, Qwen-VL achieves 78.8% accuracy, being 7.5% higher than OSCAR. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F3.sf2" title="In Figure 3 ‣ 3.1 Potential Benefits from LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3(b)</a> exemplifies a typical vehicle counting task in video analytics applications, only Qwen-VL generates the correct answer. </div> <div class="ltx_para" id="S3.SS1.p2"> With external knowledge from LoRA adapters, LMM attains remarkable accuracy gain on domain-specific tasks. To investigate the accuracy improvement from external knowledge, we fine-tune three LoRA adapters for image classification, object detection, and video classification, respectively, on external datasets, AID <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib82" title="">AID, </a>)</cite>, Aircraft <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib9" title="">Aircraft, </a>)</cite>, and UCF101 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib72" title="">UCF101, </a>)</cite>. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F4" title="Figure 4 ‣ 3.1 Potential Benefits from LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4</a> shows the results. With fine-tuned LoRA adapters, Qwen-VL receives 45.2%, 24.5%, and 62.2% accuracy gains on three domain-specific task, respectively. Note that we only validated the potential gain without fully exploring advanced training techniques like data enhancement <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib92" title="">VideoEnhancement, </a>)</cite>. Nevertheless, based on current results, we believe these techniques could further improve accuracy in future work. </div> <figure class="ltx_figure" id="S3.F3"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S3.F3.sf1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="174" id="S3.F3.sf1.g1" src="x4.png" width="174"/> <figcaption class="ltx_caption ltx_centering">(a) Zero-shot Grounding results on data #38 in Aircraft dataset <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib9" title="">Aircraft, </a>)</cite>.</figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S3.F3.sf2"> <img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="174" id="S3.F3.sf2.1.1.g1" src="x5.png" width="185"/> <figcaption class="ltx_caption ltx_centering">(b) Visualization of data #133279 in Visual Question Answering <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib38" title="">vqa2, </a>)</cite>.</figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering">Figure 3. The potential of LMM. (a) To ground the airplanes in remote sensing view in zero-shot, LMM Qwen-VL, in general, delivers 67.2% accuracy v.s. the 18.3% of YOLO <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib33" title="">glenn2021YOLOV5, </a>)</cite>. (b) In VQA, Qwen-VL yields 78.8% accuracy v.s. the 73.3% of OSCAR <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib55" title="">li2020oscar, </a>)</cite>. </figcaption> </figure> <figure class="ltx_figure" id="S3.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="340" id="S3.F4.g1" src="x6.png" width="705"/> <figcaption class="ltx_caption ltx_centering">Figure 4. LoRA adapters with domain-specific knowledge improve the Qwen-VL’s accuracy on target tasks.</figcaption> </figure> <div class="ltx_para" id="S3.SS1.p3"> LoRA LMM enables more flexible serving. Today’s vision applications are served by many domain-specific small models <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib67" title="">Romil2022Ekya, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib49" title="">khani2023recl, </a>)</cite>. When invoked, the specific model is loaded into GPU and swapped out to main memory after execution <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib68" title="">shen2019nexus, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib21" title="">bai2020pipeswitch, </a>)</cite>. This swapping method allows model execution in a large batch but incurs inevitable transmission costs. LoRA adapters usually have fewer parameters than small models (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS1" title="4.4.1 Swift inference mode switch. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4.1</a>). Swapping them while keeping LMM in GPU provides a more flexible serving. In our test, swapping an adapter saves 97% delay than OSCAR, 15ms v.s. 520ms, and 86% of YOLO’s 110ms. </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">3.2 Challenges of Empowering Vision Applications with LoRA LMM </h3> <div class="ltx_para" id="S3.SS2.p1"> To empower diverse vision applications, the LoRA LMM system must offer accurate and efficient responses to the vision tasks involved and meet the distinct application-specified performance requirements. To this, we face three challenges. </div> <figure class="ltx_figure" id="S3.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="346" id="S3.F5.g1" src="x7.png" width="622"/> <figcaption class="ltx_caption ltx_centering">Figure 5. Accuracy decreases when fusing knowledge from multiple domain-specific small models into one single LoRA. The trend varies regarding vision tasks. </figcaption> </figure> <div class="ltx_para" id="S3.SS2.p2"> C1: Limited capacity of LoRA adapter. Integrating external knowledge from existing small models or specific datasets into LMM is essential to generate accurate results. Parameter-efficient fine-tuning LoRA adapters show promise. However, it is challenging because the LoRA adapter has only limited capacity and varies on the vision tasks. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F5" title="Figure 5 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">5</a> demonstrates this by fusing external knowledge from different numbers of small models on diverse tasks into a single LoRA adapter (experiment setup details in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS1" title="6.1 Experimental Setup ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.1</a>). Training a separate adapter for each small model consistently achieves high accuracy but results in significant adapter capacity waste, while fusing too many small models into one adapter incurs significant accuracy degradation. For example, the LoRA adapter that fuses six image classification models in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F5" title="Figure 5 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">5</a> retains over 95% accuracy, while fusing six video classification models decreases remarkable accuracy. </div> <div class="ltx_para" id="S3.SS2.p3"> C2: Inefficient concurrent requests batching. One vision application often involves multiple small models <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib43" title="">jiang2018chameleon, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib31" title="">du2020server, </a>)</cite>. Hence, serving multiple vision applications with LMM very likely leads to the simultaneous invocation of multiple heterogeneous LoRA adapters. However, current LoRA model inference systems struggle to process them efficiently, particularly under high concurrency. To demonstrate this, we measure three state-of-the-art systems. Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite> and S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> customize CUDA operator, respectively, to batch heterogeneous LoRA adapters computation in unmerge mode (more details in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS1" title="4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3.1</a>), while dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> calls PyTorch operator Einsum††† Einsum uses Einstein summation convention <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib32" title="">Einstein1922, </a>)</cite>, describing tensor index operations in a concise string form, for efficient LoRA adapter batching. <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib3" title="">torcheinsum, </a>)</cite>. The experimental workload randomly generates 2-4 requests ranging from 128 to 1024 length of input tokens per second, and we repeat 1,000 times to measure the latency. For fairness, all experiments run on a server device equipped with NVIDIA A100 GPU <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib2" title="">A100, </a>)</cite> via PEFT <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib61" title="">peft, </a>)</cite> framework, and use Qwen-VL-7B <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib20" title="">QwenVL, </a>)</cite> as the base model. </div> <div class="ltx_para" id="S3.SS2.p4"> Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F7" title="Figure 7 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">7</a> plots the results. The bars of dLoRA, S-LoRA, and Punica denote their extra latency than the merged inference, e.g., <math alttext="latency_{dLoRA}-latency_{merge}" class="ltx_Math" display="inline" id="S3.SS2.p4.1.m1.1"><semantics id="S3.SS2.p4.1.m1.1a"><mrow id="S3.SS2.p4.1.m1.1.1" xref="S3.SS2.p4.1.m1.1.1.cmml"><mrow id="S3.SS2.p4.1.m1.1.1.2" xref="S3.SS2.p4.1.m1.1.1.2.cmml"><mi id="S3.SS2.p4.1.m1.1.1.2.2" xref="S3.SS2.p4.1.m1.1.1.2.2.cmml">l</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.2.3" xref="S3.SS2.p4.1.m1.1.1.2.3.cmml">a</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1a" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.2.4" xref="S3.SS2.p4.1.m1.1.1.2.4.cmml">t</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1b" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.2.5" xref="S3.SS2.p4.1.m1.1.1.2.5.cmml">e</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1c" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.2.6" xref="S3.SS2.p4.1.m1.1.1.2.6.cmml">n</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1d" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.2.7" xref="S3.SS2.p4.1.m1.1.1.2.7.cmml">c</mi><mo id="S3.SS2.p4.1.m1.1.1.2.1e" xref="S3.SS2.p4.1.m1.1.1.2.1.cmml">⁢</mo><msub id="S3.SS2.p4.1.m1.1.1.2.8" xref="S3.SS2.p4.1.m1.1.1.2.8.cmml"><mi id="S3.SS2.p4.1.m1.1.1.2.8.2" xref="S3.SS2.p4.1.m1.1.1.2.8.2.cmml">y</mi><mrow id="S3.SS2.p4.1.m1.1.1.2.8.3" xref="S3.SS2.p4.1.m1.1.1.2.8.3.cmml"><mi id="S3.SS2.p4.1.m1.1.1.2.8.3.2" xref="S3.SS2.p4.1.m1.1.1.2.8.3.2.cmml">d</mi><mo id="S3.SS2.p4.1.m1.1.1.2.8.3.1" xref="S3.SS2.p4.1.m1.1.1.2.8.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.2.8.3.3" xref="S3.SS2.p4.1.m1.1.1.2.8.3.3.cmml">L</mi><mo id="S3.SS2.p4.1.m1.1.1.2.8.3.1a" xref="S3.SS2.p4.1.m1.1.1.2.8.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.2.8.3.4" xref="S3.SS2.p4.1.m1.1.1.2.8.3.4.cmml">o</mi><mo id="S3.SS2.p4.1.m1.1.1.2.8.3.1b" xref="S3.SS2.p4.1.m1.1.1.2.8.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.2.8.3.5" xref="S3.SS2.p4.1.m1.1.1.2.8.3.5.cmml">R</mi><mo id="S3.SS2.p4.1.m1.1.1.2.8.3.1c" xref="S3.SS2.p4.1.m1.1.1.2.8.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.2.8.3.6" xref="S3.SS2.p4.1.m1.1.1.2.8.3.6.cmml">A</mi></mrow></msub></mrow><mo id="S3.SS2.p4.1.m1.1.1.1" xref="S3.SS2.p4.1.m1.1.1.1.cmml">−</mo><mrow id="S3.SS2.p4.1.m1.1.1.3" xref="S3.SS2.p4.1.m1.1.1.3.cmml"><mi id="S3.SS2.p4.1.m1.1.1.3.2" xref="S3.SS2.p4.1.m1.1.1.3.2.cmml">l</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.3.3" xref="S3.SS2.p4.1.m1.1.1.3.3.cmml">a</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1a" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.3.4" xref="S3.SS2.p4.1.m1.1.1.3.4.cmml">t</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1b" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.3.5" xref="S3.SS2.p4.1.m1.1.1.3.5.cmml">e</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1c" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.3.6" xref="S3.SS2.p4.1.m1.1.1.3.6.cmml">n</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1d" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.3.7" xref="S3.SS2.p4.1.m1.1.1.3.7.cmml">c</mi><mo id="S3.SS2.p4.1.m1.1.1.3.1e" xref="S3.SS2.p4.1.m1.1.1.3.1.cmml">⁢</mo><msub id="S3.SS2.p4.1.m1.1.1.3.8" xref="S3.SS2.p4.1.m1.1.1.3.8.cmml"><mi id="S3.SS2.p4.1.m1.1.1.3.8.2" xref="S3.SS2.p4.1.m1.1.1.3.8.2.cmml">y</mi><mrow id="S3.SS2.p4.1.m1.1.1.3.8.3" xref="S3.SS2.p4.1.m1.1.1.3.8.3.cmml"><mi id="S3.SS2.p4.1.m1.1.1.3.8.3.2" xref="S3.SS2.p4.1.m1.1.1.3.8.3.2.cmml">m</mi><mo id="S3.SS2.p4.1.m1.1.1.3.8.3.1" xref="S3.SS2.p4.1.m1.1.1.3.8.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.3.8.3.3" xref="S3.SS2.p4.1.m1.1.1.3.8.3.3.cmml">e</mi><mo id="S3.SS2.p4.1.m1.1.1.3.8.3.1a" xref="S3.SS2.p4.1.m1.1.1.3.8.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.3.8.3.4" xref="S3.SS2.p4.1.m1.1.1.3.8.3.4.cmml">r</mi><mo id="S3.SS2.p4.1.m1.1.1.3.8.3.1b" xref="S3.SS2.p4.1.m1.1.1.3.8.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.3.8.3.5" xref="S3.SS2.p4.1.m1.1.1.3.8.3.5.cmml">g</mi><mo id="S3.SS2.p4.1.m1.1.1.3.8.3.1c" xref="S3.SS2.p4.1.m1.1.1.3.8.3.1.cmml">⁢</mo><mi id="S3.SS2.p4.1.m1.1.1.3.8.3.6" xref="S3.SS2.p4.1.m1.1.1.3.8.3.6.cmml">e</mi></mrow></msub></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p4.1.m1.1b"><apply id="S3.SS2.p4.1.m1.1.1.cmml" xref="S3.SS2.p4.1.m1.1.1"><minus id="S3.SS2.p4.1.m1.1.1.1.cmml" xref="S3.SS2.p4.1.m1.1.1.1"></minus><apply id="S3.SS2.p4.1.m1.1.1.2.cmml" xref="S3.SS2.p4.1.m1.1.1.2"><times id="S3.SS2.p4.1.m1.1.1.2.1.cmml" xref="S3.SS2.p4.1.m1.1.1.2.1"></times><ci id="S3.SS2.p4.1.m1.1.1.2.2.cmml" xref="S3.SS2.p4.1.m1.1.1.2.2">𝑙</ci><ci id="S3.SS2.p4.1.m1.1.1.2.3.cmml" xref="S3.SS2.p4.1.m1.1.1.2.3">𝑎</ci><ci id="S3.SS2.p4.1.m1.1.1.2.4.cmml" xref="S3.SS2.p4.1.m1.1.1.2.4">𝑡</ci><ci id="S3.SS2.p4.1.m1.1.1.2.5.cmml" xref="S3.SS2.p4.1.m1.1.1.2.5">𝑒</ci><ci id="S3.SS2.p4.1.m1.1.1.2.6.cmml" xref="S3.SS2.p4.1.m1.1.1.2.6">𝑛</ci><ci id="S3.SS2.p4.1.m1.1.1.2.7.cmml" xref="S3.SS2.p4.1.m1.1.1.2.7">𝑐</ci><apply id="S3.SS2.p4.1.m1.1.1.2.8.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8"><csymbol cd="ambiguous" id="S3.SS2.p4.1.m1.1.1.2.8.1.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8">subscript</csymbol><ci id="S3.SS2.p4.1.m1.1.1.2.8.2.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.2">𝑦</ci><apply id="S3.SS2.p4.1.m1.1.1.2.8.3.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3"><times id="S3.SS2.p4.1.m1.1.1.2.8.3.1.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.1"></times><ci id="S3.SS2.p4.1.m1.1.1.2.8.3.2.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.2">𝑑</ci><ci id="S3.SS2.p4.1.m1.1.1.2.8.3.3.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.3">𝐿</ci><ci id="S3.SS2.p4.1.m1.1.1.2.8.3.4.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.4">𝑜</ci><ci id="S3.SS2.p4.1.m1.1.1.2.8.3.5.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.5">𝑅</ci><ci id="S3.SS2.p4.1.m1.1.1.2.8.3.6.cmml" xref="S3.SS2.p4.1.m1.1.1.2.8.3.6">𝐴</ci></apply></apply></apply><apply id="S3.SS2.p4.1.m1.1.1.3.cmml" xref="S3.SS2.p4.1.m1.1.1.3"><times id="S3.SS2.p4.1.m1.1.1.3.1.cmml" xref="S3.SS2.p4.1.m1.1.1.3.1"></times><ci id="S3.SS2.p4.1.m1.1.1.3.2.cmml" xref="S3.SS2.p4.1.m1.1.1.3.2">𝑙</ci><ci id="S3.SS2.p4.1.m1.1.1.3.3.cmml" xref="S3.SS2.p4.1.m1.1.1.3.3">𝑎</ci><ci id="S3.SS2.p4.1.m1.1.1.3.4.cmml" xref="S3.SS2.p4.1.m1.1.1.3.4">𝑡</ci><ci id="S3.SS2.p4.1.m1.1.1.3.5.cmml" xref="S3.SS2.p4.1.m1.1.1.3.5">𝑒</ci><ci id="S3.SS2.p4.1.m1.1.1.3.6.cmml" xref="S3.SS2.p4.1.m1.1.1.3.6">𝑛</ci><ci id="S3.SS2.p4.1.m1.1.1.3.7.cmml" xref="S3.SS2.p4.1.m1.1.1.3.7">𝑐</ci><apply id="S3.SS2.p4.1.m1.1.1.3.8.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8"><csymbol cd="ambiguous" id="S3.SS2.p4.1.m1.1.1.3.8.1.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8">subscript</csymbol><ci id="S3.SS2.p4.1.m1.1.1.3.8.2.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.2">𝑦</ci><apply id="S3.SS2.p4.1.m1.1.1.3.8.3.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3"><times id="S3.SS2.p4.1.m1.1.1.3.8.3.1.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.1"></times><ci id="S3.SS2.p4.1.m1.1.1.3.8.3.2.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.2">𝑚</ci><ci id="S3.SS2.p4.1.m1.1.1.3.8.3.3.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.3">𝑒</ci><ci id="S3.SS2.p4.1.m1.1.1.3.8.3.4.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.4">𝑟</ci><ci id="S3.SS2.p4.1.m1.1.1.3.8.3.5.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.5">𝑔</ci><ci id="S3.SS2.p4.1.m1.1.1.3.8.3.6.cmml" xref="S3.SS2.p4.1.m1.1.1.3.8.3.6">𝑒</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p4.1.m1.1c">latency_{dLoRA}-latency_{merge}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p4.1.m1.1d">italic_l italic_a italic_t italic_e italic_n italic_c italic_y start_POSTSUBSCRIPT italic_d italic_L italic_o italic_R italic_A end_POSTSUBSCRIPT - italic_l italic_a italic_t italic_e italic_n italic_c italic_y start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT</annotation></semantics></math>. The bar of the base model denotes its time cost under the same workload. Unmerged inference yields up to 140ms additional latency when serving four 1024-token requests. This unnecessary waste is sufficient for the base model to perform 4x256 inference, with resources to spare! The reason for such a high cost is two-fold. 1) Inherently, additional overhead stems from two additional matrix multiplications and one additional matrix addition per layer as depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F2.sf1" title="In Figure 2 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">2(a)</a>; meanwhile, these computations run in parallel with the base model computations, each layer requires additional CUDA kernel context operations at each layer. 2) Upon concrete implementations, all three LoRA adapters batching operators fail to fully utilize computing units in GPU, due to the significant padding or ill-considered matrix tiling (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS1" title="4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3.1</a>). </div> <figure class="ltx_figure" id="S3.F7"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_1"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S3.F7.1" style="width:212.5pt;"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="656" id="S3.F7.1.g1" src="x8.png" width="831"/> <figcaption class="ltx_caption ltx_centering">Figure 6. Unmerged inference causes 27-140ms extra latency equivalent to 40-61% of base model inference time.</figcaption> </figure> </div> <div class="ltx_flex_break"></div> <div class="ltx_flex_cell ltx_flex_size_1"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S3.F7.2" style="width:208.1pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="733" id="S3.F7.2.1.1.g1" src="x9.png" width="830"/> <figcaption class="ltx_caption ltx_centering">Figure 7. Mode switch alone costs 53ms, occupying 64% of merged inference time of three 256-tokens requests.</figcaption> </figure> </div> </div> </figure> <div class="ltx_para" id="S3.SS2.p5"> C3: Inflexible LoRA adapters orchestration. To cope with the distinct performance requirements specified by vision applications, an orchestration that can carefully manage LoRA adapters and flexibly schedule application requests is necessary. We believe the inference mode switch like dLoRA is promising, yet it falls significantly short of efficiency. Its mode switch yields unacceptable overhead. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F7" title="Figure 7 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">7</a> illustrates a real scheduling state and mode switch latency for two consecutive inference slots of dLoRA. In this case, dLoRA serves 8 requests, each with an input length of 256, in a first-come-first-service manner. In the first slot, dLoRA serves requests 1-3 in merge mode using the same LoRA adapter, and the heterogeneous requests 4-7 are processed in an unmerged mode in the following slot. A mode switch delay of over 53 ms makes the last request (the dotted arrow in Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F7" title="Figure 7 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">7</a>) have to wait 165ms until the next inference slot begins. This significant cost stems from 1) unnecessary memory copy of LoRA matrices due to dLoRA’s inefficient memory management, and 2) the substantial overhead of LoRA matrices <math alttext="\Delta W" class="ltx_Math" display="inline" id="S3.SS2.p5.1.1.m1.1"><semantics id="S3.SS2.p5.1.1.m1.1a"><mrow id="S3.SS2.p5.1.1.m1.1.1" xref="S3.SS2.p5.1.1.m1.1.1.cmml"><mi id="S3.SS2.p5.1.1.m1.1.1.2" mathcolor="#000000" mathvariant="normal" xref="S3.SS2.p5.1.1.m1.1.1.2.cmml">Δ</mi><mo id="S3.SS2.p5.1.1.m1.1.1.1" xref="S3.SS2.p5.1.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS2.p5.1.1.m1.1.1.3" mathcolor="#000000" xref="S3.SS2.p5.1.1.m1.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p5.1.1.m1.1b"><apply id="S3.SS2.p5.1.1.m1.1.1.cmml" xref="S3.SS2.p5.1.1.m1.1.1"><times id="S3.SS2.p5.1.1.m1.1.1.1.cmml" xref="S3.SS2.p5.1.1.m1.1.1.1"></times><ci id="S3.SS2.p5.1.1.m1.1.1.2.cmml" xref="S3.SS2.p5.1.1.m1.1.1.2">Δ</ci><ci id="S3.SS2.p5.1.1.m1.1.1.3.cmml" xref="S3.SS2.p5.1.1.m1.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p5.1.1.m1.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p5.1.1.m1.1d">roman_Δ italic_W</annotation></semantics></math> computation, by matrix multiplication <math alttext="A\times B" class="ltx_Math" display="inline" id="S3.SS2.p5.2.2.m2.1"><semantics id="S3.SS2.p5.2.2.m2.1a"><mrow id="S3.SS2.p5.2.2.m2.1.1" xref="S3.SS2.p5.2.2.m2.1.1.cmml"><mi id="S3.SS2.p5.2.2.m2.1.1.2" mathcolor="#000000" xref="S3.SS2.p5.2.2.m2.1.1.2.cmml">A</mi><mo id="S3.SS2.p5.2.2.m2.1.1.1" lspace="0.222em" mathcolor="#000000" rspace="0.222em" xref="S3.SS2.p5.2.2.m2.1.1.1.cmml">×</mo><mi id="S3.SS2.p5.2.2.m2.1.1.3" mathcolor="#000000" xref="S3.SS2.p5.2.2.m2.1.1.3.cmml">B</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p5.2.2.m2.1b"><apply id="S3.SS2.p5.2.2.m2.1.1.cmml" xref="S3.SS2.p5.2.2.m2.1.1"><times id="S3.SS2.p5.2.2.m2.1.1.1.cmml" xref="S3.SS2.p5.2.2.m2.1.1.1"></times><ci id="S3.SS2.p5.2.2.m2.1.1.2.cmml" xref="S3.SS2.p5.2.2.m2.1.1.2">𝐴</ci><ci id="S3.SS2.p5.2.2.m2.1.1.3.cmml" xref="S3.SS2.p5.2.2.m2.1.1.3">𝐵</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p5.2.2.m2.1c">A\times B</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p5.2.2.m2.1d">italic_A × italic_B</annotation></semantics></math>, then added (merge) or subtracted (unmerge) <math alttext="\Delta W" class="ltx_Math" display="inline" id="S3.SS2.p5.3.3.m3.1"><semantics id="S3.SS2.p5.3.3.m3.1a"><mrow id="S3.SS2.p5.3.3.m3.1.1" xref="S3.SS2.p5.3.3.m3.1.1.cmml"><mi id="S3.SS2.p5.3.3.m3.1.1.2" mathcolor="#000000" mathvariant="normal" xref="S3.SS2.p5.3.3.m3.1.1.2.cmml">Δ</mi><mo id="S3.SS2.p5.3.3.m3.1.1.1" xref="S3.SS2.p5.3.3.m3.1.1.1.cmml">⁢</mo><mi id="S3.SS2.p5.3.3.m3.1.1.3" mathcolor="#000000" xref="S3.SS2.p5.3.3.m3.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p5.3.3.m3.1b"><apply id="S3.SS2.p5.3.3.m3.1.1.cmml" xref="S3.SS2.p5.3.3.m3.1.1"><times id="S3.SS2.p5.3.3.m3.1.1.1.cmml" xref="S3.SS2.p5.3.3.m3.1.1.1"></times><ci id="S3.SS2.p5.3.3.m3.1.1.2.cmml" xref="S3.SS2.p5.3.3.m3.1.1.2">Δ</ci><ci id="S3.SS2.p5.3.3.m3.1.1.3.cmml" xref="S3.SS2.p5.3.3.m3.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p5.3.3.m3.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p5.3.3.m3.1d">roman_Δ italic_W</annotation></semantics></math> onto or from the base model, by invoking torch.addmm, per layer. Conceivably, if the mode switch can be reduced to ¡10ms (as this paper achieved in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS1" title="4.4.1 Swift inference mode switch. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4.1</a>), the average response time of Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.F7" title="Figure 7 ‣ 3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">7</a> case can save 45ms, with the last request only need to wait ¡80ms. </div> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> 4. V-LoRA Design</h2> <div class="ltx_para" id="S4.p1"> V-LoRA is an end-to-end system that empowers diverse vision tasks and enriches vision applications with LoRA LMM by addressing the above challenges. We first provide an overview of V-LoRA’s operation, then describe three core techniques it leverages. </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">4.1 System Overview</h3> <div class="ltx_para" id="S4.SS1.p1"> V-LoRA includes two phases. During the offline phase, the accuracy-aware LoRA adapter generation approach takes the external knowledge, from the existing domain-specific small models or datasets, as well as the accuracy requirements specified by vision applications (as the dotted arrows plotted in Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F8" title="Figure 8 ‣ 4.1 System Overview ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">8</a>), to generate the minimum number of LoRA adapters (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS2" title="4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.2</a>). The generated LoRA adapters are rich in domain-specific knowledge that can output accurate responses to tasks involved in vision applications. </div> <div class="ltx_para" id="S4.SS1.p2"> During the online phase, the flexible LoRA adapter orchestration ingests the requests from vision applications (as the solid arrow plotted in Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F8" title="Figure 8 ‣ 4.1 System Overview ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">8</a>), organizes them into batches, chooses the inference mode, and orchestrates their corresponding LoRA adapters, to minimize the average response latency while guaranteeing each vision application’s latency constraint (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4" title="4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4</a>). Each request batch is delivered to the corresponding adapters and LMM and inferred in the chosen mode. The LoRA adapter batching and inference mode switcher are implemented with ATMM, the adaptive-tilling matrix multiplication operator (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3" title="4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3</a>), achieving high efficiency. </div> <figure class="ltx_figure" id="S4.F8"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="384" id="S4.F8.g1" src="x10.png" width="789"/> <figcaption class="ltx_caption ltx_centering">Figure 8. VaLoRA overview.</figcaption> </figure> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">4.2 Accuracy-aware LoRA Adapter Generation</h3> <div class="ltx_para" id="S4.SS2.p1"> To offer accurate results on domain-specific tasks, we propose the accuracy-aware LoRA adapter generation consisting of an accuracy-aware knowledge-fusion algorithm and the vision task head. </div> <section class="ltx_subsubsection" id="S4.SS2.SSS1"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.2.1 Accuracy-aware knowledge-fusion algorithm.</h4> <div class="ltx_para" id="S4.SS2.SSS1.p1"> To make it easy to manage at runtime, we aim to integrate external knowledge into the fewest LoRA adapters without violating the accuracy requirements of any vision tasks. To this end, the training method must account for the limited capacity of the LoRA adapter and the complex accuracy variations arising from the knowledge fusion. </div> <figure class="ltx_figure" id="S4.F9"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="325" id="S4.F9.g1" src="x11.png" width="830"/> <figcaption class="ltx_caption ltx_centering">Figure 9. Accuracy-aware LoRA generation integrates the domain-specific knowledge into LoRA adapters with the proposed accuracy-aware knowledge-fusion algorithm.</figcaption> </figure> <figure class="ltx_figure" id="S4.F10"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="524" id="S4.F10.g1" src="x12.png" width="830"/> <figcaption class="ltx_caption ltx_centering">Figure 10. An example of fusing the knowledge from six object detection models, each detects one class of object, with the accuracy-aware knowledge-fusion algorithm. </figcaption> </figure> <div class="ltx_para" id="S4.SS2.SSS1.p2"> This problem is highly challenging. Suppose we have an oracle who knows the accuracy of a LoRA adapter that fused arbitrary knowledge combinations in advance. Then, the problem can be formulated as a constrained bin-packing problem <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib36" title="">garey1981approximation, </a>)</cite>, where the objective is to pack the knowledge into the minimum number of LoRA adapter bins, ensuring each adapter maintains each vision task’s accuracy beyond the requirement. However, this relaxed variant is NP-hard, and unfortunately, such an oracle does not exist. </div> <div class="ltx_para" id="S4.SS2.SSS1.p3"> To solve the original problem, we propose a simple and easy-to-implement heuristic algorithm, the accuracy-aware knowledge-fusion algorithm, to determine which knowledge fuses into one LoRA adapter. It first collects the dataset, as illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F9" title="Figure 9 ‣ 4.2.1 Accuracy-aware knowledge-fusion algorithm. ‣ 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">9</a>, by executing representative data on every existing domain-specific small model; if applications provide the datasets, we directly use them. After that, the training process is a standard supervised learning pipeline that computes the cross-entropy loss, <math alttext="L=CE(y,\hat{y})" class="ltx_Math" display="inline" id="S4.SS2.SSS1.p3.1.m1.2"><semantics id="S4.SS2.SSS1.p3.1.m1.2a"><mrow id="S4.SS2.SSS1.p3.1.m1.2.3" xref="S4.SS2.SSS1.p3.1.m1.2.3.cmml"><mi id="S4.SS2.SSS1.p3.1.m1.2.3.2" xref="S4.SS2.SSS1.p3.1.m1.2.3.2.cmml">L</mi><mo id="S4.SS2.SSS1.p3.1.m1.2.3.1" xref="S4.SS2.SSS1.p3.1.m1.2.3.1.cmml">=</mo><mrow id="S4.SS2.SSS1.p3.1.m1.2.3.3" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.cmml"><mi id="S4.SS2.SSS1.p3.1.m1.2.3.3.2" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.2.cmml">C</mi><mo id="S4.SS2.SSS1.p3.1.m1.2.3.3.1" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.1.cmml">⁢</mo><mi id="S4.SS2.SSS1.p3.1.m1.2.3.3.3" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.3.cmml">E</mi><mo id="S4.SS2.SSS1.p3.1.m1.2.3.3.1a" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.1.cmml">⁢</mo><mrow id="S4.SS2.SSS1.p3.1.m1.2.3.3.4.2" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.4.1.cmml"><mo id="S4.SS2.SSS1.p3.1.m1.2.3.3.4.2.1" stretchy="false" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.4.1.cmml">(</mo><mi id="S4.SS2.SSS1.p3.1.m1.1.1" xref="S4.SS2.SSS1.p3.1.m1.1.1.cmml">y</mi><mo id="S4.SS2.SSS1.p3.1.m1.2.3.3.4.2.2" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.4.1.cmml">,</mo><mover accent="true" id="S4.SS2.SSS1.p3.1.m1.2.2" xref="S4.SS2.SSS1.p3.1.m1.2.2.cmml"><mi id="S4.SS2.SSS1.p3.1.m1.2.2.2" xref="S4.SS2.SSS1.p3.1.m1.2.2.2.cmml">y</mi><mo id="S4.SS2.SSS1.p3.1.m1.2.2.1" xref="S4.SS2.SSS1.p3.1.m1.2.2.1.cmml">^</mo></mover><mo id="S4.SS2.SSS1.p3.1.m1.2.3.3.4.2.3" stretchy="false" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.4.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.SSS1.p3.1.m1.2b"><apply id="S4.SS2.SSS1.p3.1.m1.2.3.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3"><eq id="S4.SS2.SSS1.p3.1.m1.2.3.1.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.1"></eq><ci id="S4.SS2.SSS1.p3.1.m1.2.3.2.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.2">𝐿</ci><apply id="S4.SS2.SSS1.p3.1.m1.2.3.3.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.3"><times id="S4.SS2.SSS1.p3.1.m1.2.3.3.1.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.1"></times><ci id="S4.SS2.SSS1.p3.1.m1.2.3.3.2.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.2">𝐶</ci><ci id="S4.SS2.SSS1.p3.1.m1.2.3.3.3.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.3">𝐸</ci><interval closure="open" id="S4.SS2.SSS1.p3.1.m1.2.3.3.4.1.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.3.3.4.2"><ci id="S4.SS2.SSS1.p3.1.m1.1.1.cmml" xref="S4.SS2.SSS1.p3.1.m1.1.1">𝑦</ci><apply id="S4.SS2.SSS1.p3.1.m1.2.2.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.2"><ci id="S4.SS2.SSS1.p3.1.m1.2.2.1.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.2.1">^</ci><ci id="S4.SS2.SSS1.p3.1.m1.2.2.2.cmml" xref="S4.SS2.SSS1.p3.1.m1.2.2.2">𝑦</ci></apply></interval></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.SSS1.p3.1.m1.2c">L=CE(y,\hat{y})</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.SSS1.p3.1.m1.2d">italic_L = italic_C italic_E ( italic_y , over^ start_ARG italic_y end_ARG )</annotation></semantics></math>, for parameter update. The knowledge fusion employs greedy and accuracy-aware heuristics. It begins with a random dataset and sequentially uses each dataset to train the LoRA adapter until its accuracy on a specific task falls below the required threshold. If this occurs, the adapter’s weights are rolled back, and a new adapter is initialized to learn from the most recent dataset. The worst case of such a method may generate one LoRA adapter for each dataset, but in our practical experiments, every LoRA adapter fuses 4 domains of knowledge (datasets) on average. </div> <figure class="ltx_figure" id="S4.F11"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="333" id="S4.F11.g1" src="x13.png" width="788"/> <figcaption class="ltx_caption ltx_centering">Figure 11. Vision task head. Replacing the original language modeling head in LMM with vision task head reduces 4 rounds of LMM inference on the action recognition task of Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S2.F1" title="Figure 1 ‣ 2. Background ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">1</a>.</figcaption> </figure> <div class="ltx_para" id="S4.SS2.SSS1.p4"> Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F10" title="Figure 10 ‣ 4.2.1 Accuracy-aware knowledge-fusion algorithm. ‣ 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">10</a> illustrates an example of integrating the knowledge of six object detection models, each detecting one class of target, into LoRA adapters. When fusing the vegetation-detect model (Step ④), the accuracy of LoRA adapter 1 fails on the license plate, needs above 80%, and traffic sign detection, 85%. Hence, the algorithm returns the LoRA adapter 1 of the previous state and initializes LoRA adapter 2 (Step ⑤). As the vegetation, bicycle, and person datasets are fused into LoRA adapter 2 without accuracy violation, we get the second LoRA adapter (Step ⑥). As LMM has powerful learning abilities, the training procedure costs only 25 minutes in this example. We leave the exploration of this to the future. </div> </section> <section class="ltx_subsubsection" id="S4.SS2.SSS2"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.2.2 Vision task head.</h4> <div class="ltx_para" id="S4.SS2.SSS2.p1"> To reduce the inference latency, we design the vision task head. It is designed as a trainable linear layer as a part of the LoRA adapter, as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F11" title="Figure 11 ‣ 4.2.1 Accuracy-aware knowledge-fusion algorithm. ‣ 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">11</a>, to predict task-specific results based on the output features of LMM. Vision task heads can be flexibly customized to various vision tasks during the LoRA adapter training, e.g., action recognition head in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F11" title="Figure 11 ‣ 4.2.1 Accuracy-aware knowledge-fusion algorithm. ‣ 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">11</a>, provided the fusing knowledge is from the same task type. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F11" title="Figure 11 ‣ 4.2.1 Accuracy-aware knowledge-fusion algorithm. ‣ 4.2 Accuracy-aware LoRA Adapter Generation ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">11</a> compares doing action recognition with the original language modeling (LM) head and the vision task head. By replacing LM head with the vision task head, LMM saves 4 inference rounds, around 180ms time cost. The reason to do so is that the outputs of a large portion of vision tasks are a limited discrete set of candidate options, such as the number of vehicle counts <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib57" title="">li2020reducto, </a>)</cite>, classes of action recognition <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib90" title="">yuan2022PacketGame, </a>)</cite>, and binary query for a specific target on image or video <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib86" title="">yi2020eagleeye, </a>)</cite>. </div> <div class="ltx_para" id="S4.SS2.SSS2.p2"> We retain the LM head for vision applications that need the natural language interface. For example, when video query applications ask for “A boy wearing a red sweater lost at the corner” and specify the person detection, V-LoRA will invoke the corresponding LoRA adapter containing a detection head for efficient response‡‡‡Some work, e.g., task automation <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib79" title="">wen2024autodroid, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib56" title="">li2024personal, </a>)</cite> and dynamic LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib34" title="">feng2024mixture, </a>)</cite>, automatically identify adapters from queries. They are orthogonal to V-LoRA.. By using the LoRA adapters batching operator, ATMM, in the next section, these different vision task heads can be executed concurrently, as well as in parallel with the LM head. </div> </section> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">4.3 Adaptive-tiling LoRA Adapters Batching</h3> <div class="ltx_para" id="S4.SS3.p1"> Serving vision applications with LMM very likely invokes heterogeneous LoRA adapters concurrently. To compute them in high performance, we propose an Adaptive-Tiling Matrix Multiplication operator, ATMM, enabling efficient unmerged inference, inference mode switching (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS1" title="4.4.1 Swift inference mode switch. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4.1</a>), and mixture inference mode (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS4.SSS2" title="4.4.2 Mixture inference mode. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.4.2</a>). </div> <figure class="ltx_table" id="S4.T1"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S4.T1.4"> <tr class="ltx_tr" id="S4.T1.4.4"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_tt" id="S4.T1.4.4.5"> Tiling Configuration </td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T1.2.2.2"> Input ① (256<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.1.1.1.1.1.1.1.m1.1"><semantics id="S4.T1.1.1.1.1.1.1.1.m1.1a"><mo id="S4.T1.1.1.1.1.1.1.1.m1.1.1" xref="S4.T1.1.1.1.1.1.1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.1.1.1.1.1.1.1.m1.1b"><times id="S4.T1.1.1.1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.1.1.1.1.1.1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.1.1.1.1.1.1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.1.1.1.1.1.1.1.m1.1d">×</annotation></semantics></math>4096, 4096<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.2.2.2.2.2.2.2.m2.1"><semantics id="S4.T1.2.2.2.2.2.2.2.m2.1a"><mo id="S4.T1.2.2.2.2.2.2.2.m2.1.1" xref="S4.T1.2.2.2.2.2.2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.2.2.2.2.2.2.2.m2.1b"><times id="S4.T1.2.2.2.2.2.2.2.m2.1.1.cmml" xref="S4.T1.2.2.2.2.2.2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.2.2.2.2.2.2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.2.2.2.2.2.2.2.m2.1d">×</annotation></semantics></math>32) </td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T1.4.4.4"> Input ② (8192<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.3.3.3.1.1.1.1.m1.1"><semantics id="S4.T1.3.3.3.1.1.1.1.m1.1a"><mo id="S4.T1.3.3.3.1.1.1.1.m1.1.1" xref="S4.T1.3.3.3.1.1.1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.3.3.3.1.1.1.1.m1.1b"><times id="S4.T1.3.3.3.1.1.1.1.m1.1.1.cmml" xref="S4.T1.3.3.3.1.1.1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.3.3.3.1.1.1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.3.3.3.1.1.1.1.m1.1d">×</annotation></semantics></math>4096, 4096<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.4.4.4.2.2.2.2.m2.1"><semantics id="S4.T1.4.4.4.2.2.2.2.m2.1a"><mo id="S4.T1.4.4.4.2.2.2.2.m2.1.1" xref="S4.T1.4.4.4.2.2.2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.4.4.4.2.2.2.2.m2.1b"><times id="S4.T1.4.4.4.2.2.2.2.m2.1.1.cmml" xref="S4.T1.4.4.4.2.2.2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.4.4.4.2.2.2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.4.4.4.2.2.2.2.m2.1d">×</annotation></semantics></math>128) </td> </tr> <tr class="ltx_tr" id="S4.T1.4.5"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T1.4.5.1"> Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite> (16, 64, 64, 16, 16, 64) </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.5.2"> 0.087ms Low SM Utilization </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.5.3"> 0.19ms Freq. Global Mem. Access </td> </tr> <tr class="ltx_tr" id="S4.T1.4.6"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S4.T1.4.6.1"> Config ① (64, 32, 32, 32, 32, 32) </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.6.2"> 0.07ms </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.6.3"> 0.12ms Freq. Global Mem. Access </td> </tr> <tr class="ltx_tr" id="S4.T1.4.7"> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t" id="S4.T1.4.7.1"> Config ② (64, 64, 64, 32, 64, 64) </td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S4.T1.4.7.2"> 0.13ms Low SM Utilization </td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S4.T1.4.7.3"> 0.10ms </td> </tr> </table> <figcaption class="ltx_caption ltx_centering" style="font-size:80%;">Table 1. Adaptive tiling saves remarkable computing time compared to static-tiling Punica. Configuration (a,b,c,d,e,f) indicates the thread block tile a<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.11.m1.1"><semantics id="S4.T1.11.m1.1b"><mo id="S4.T1.11.m1.1.1" xref="S4.T1.11.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.11.m1.1c"><times id="S4.T1.11.m1.1.1.cmml" xref="S4.T1.11.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.11.m1.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.11.m1.1e">×</annotation></semantics></math>b, b<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.12.m2.1"><semantics id="S4.T1.12.m2.1b"><mo id="S4.T1.12.m2.1.1" xref="S4.T1.12.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.12.m2.1c"><times id="S4.T1.12.m2.1.1.cmml" xref="S4.T1.12.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.12.m2.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.12.m2.1e">×</annotation></semantics></math>c, and warp tile d<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.13.m3.1"><semantics id="S4.T1.13.m3.1b"><mo id="S4.T1.13.m3.1.1" xref="S4.T1.13.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.13.m3.1c"><times id="S4.T1.13.m3.1.1.cmml" xref="S4.T1.13.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.13.m3.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.13.m3.1e">×</annotation></semantics></math>e, e<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.14.m4.1"><semantics id="S4.T1.14.m4.1b"><mo id="S4.T1.14.m4.1.1" xref="S4.T1.14.m4.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.14.m4.1c"><times id="S4.T1.14.m4.1.1.cmml" xref="S4.T1.14.m4.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.14.m4.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.14.m4.1e">×</annotation></semantics></math>f. We omit thread tile to save space. Input (u<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.15.m5.1"><semantics id="S4.T1.15.m5.1b"><mo id="S4.T1.15.m5.1.1" xref="S4.T1.15.m5.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.15.m5.1c"><times id="S4.T1.15.m5.1.1.cmml" xref="S4.T1.15.m5.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.15.m5.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.15.m5.1e">×</annotation></semantics></math>v, s<math alttext="\times" class="ltx_Math" display="inline" id="S4.T1.16.m6.1"><semantics id="S4.T1.16.m6.1b"><mo id="S4.T1.16.m6.1.1" xref="S4.T1.16.m6.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T1.16.m6.1c"><times id="S4.T1.16.m6.1.1.cmml" xref="S4.T1.16.m6.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.16.m6.1d">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T1.16.m6.1e">×</annotation></semantics></math>t) indicates input matrix shapes. The text under latency explains why this configuration performs poorly under corresponding input. </figcaption> </figure> <section class="ltx_subsubsection" id="S4.SS3.SSS1"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.3.1 ATMM: Adaptive-tiling matrix multiplication operator.</h4> <div class="ltx_para" id="S4.SS3.SSS1.p1"> Directly batching heterogeneous adapters computation upon standard kernels, as batched GEMM (General Matrix Multiplication) in dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite>, is feasible, but yields excessive latency and hardware underutilization. This stems from the significant padding arising from the heterogeneity of application request lengths and LoRA adapter ranks. Hence, a customized kernel is necessary. </div> <div class="ltx_para" id="S4.SS3.SSS1.p2"> To motivate our design, we first analyze two existing customized kernels from S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> and Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite>, respectively. S-LoRA’s kernel utilizes tiling technique to avoid the significant padding. Unlike batched GEMM, which pads heterogeneous input matrices to a uniform shape, it splits them into fine-grained blocks and computes the output for each block in parallel on CUDA cores. Punica’s kernel also employs the tiling technique and further enhances efficiency by leveraging the well-developed CUTLASS <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib15" title="">cutlass, </a>)</cite> library and higher-performance Tensor cores <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib16" title="">tensorcore, </a>)</cite>. However, as the motivational study in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3.2</a>, both kernels fail to achieve satisfactory efficiency. The root cause is their static tiling configurations, which are inadequate to handle diverse input shapes, resulting in underutilized computational resources. </div> <div class="ltx_para" id="S4.SS3.SSS1.p3"> Our key observation is that the computational efficiency varies significantly with different tiling configurations. We conduct an experiment with two input shapes and three tiling configurations and present the results in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.T1" title="Table 1 ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">1</a>. Under Input ① and Input ②, three tiling configurations yield the most 1.9<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.SSS1.p3.1.m1.1"><semantics id="S4.SS3.SSS1.p3.1.m1.1a"><mo id="S4.SS3.SSS1.p3.1.m1.1.1" xref="S4.SS3.SSS1.p3.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.SSS1.p3.1.m1.1b"><times id="S4.SS3.SSS1.p3.1.m1.1.1.cmml" xref="S4.SS3.SSS1.p3.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.SSS1.p3.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.SSS1.p3.1.m1.1d">×</annotation></semantics></math> latency difference. This stems from the inappropriate tiling against input shape and hardware architecture, as well as between tiling levels. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F12" title="Figure 12 ‣ 4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">12</a> compares two configuration pairs in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.T1" title="Table 1 ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">1</a>. In Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F12.sf1" title="In Figure 12 ‣ 4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">12(a)</a>, Punica’s smaller-tile configuration produces more thread block tiles and warp tiles. Compared to Config ②, it results in more frequent accesses to global and shared memory, so more launching data transfer times incur Punica’s 1.9<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS3.SSS1.p3.2.1.m1.1"><semantics id="S4.SS3.SSS1.p3.2.1.m1.1a"><mo id="S4.SS3.SSS1.p3.2.1.m1.1.1" mathcolor="#000000" xref="S4.SS3.SSS1.p3.2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS3.SSS1.p3.2.1.m1.1b"><times id="S4.SS3.SSS1.p3.2.1.m1.1.1.cmml" xref="S4.SS3.SSS1.p3.2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.SSS1.p3.2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.SSS1.p3.2.1.m1.1d">×</annotation></semantics></math> latency. However, larger tile does not always benefit. In Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F12.sf2" title="In Figure 12 ‣ 4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">12(b)</a>, Config ②’s large tiling incurs Streaming Multiprocessor (SM) under-utilization. It can only utilize 64 of 108 SMs in A100, as each thread block tile can only fetch one SM, far less than Config ①. In sum, we need dynamic tiling to achieve efficiency. </div> <figure class="ltx_figure" id="S4.F12"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_1"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S4.F12.sf1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="349" id="S4.F12.sf1.g1" src="x14.png" width="830"/> <figcaption class="ltx_caption ltx_centering">(a) Under heavier Input ②, Punica’s tiling (upper half) yields more tiles, so more costly memory read/write compared to Config ② (lower half).</figcaption> </figure> </div> <div class="ltx_flex_break"></div> <div class="ltx_flex_cell ltx_flex_size_1"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S4.F12.sf2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="240" id="S4.F12.sf2.g1" src="x15.png" width="829"/> <figcaption class="ltx_caption ltx_centering">(b) But under lighter Input ①, Config ② (lower half) cannot deliver good performance because it underutilizes SM since fewer block number.</figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering">Figure 12. Paired comparison of tilling configurations in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.T1" title="Table 1 ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">1</a>. Following tiling configurations, two multiplied matrices are divided into thread block tiles and further warp tiles. The data in each tile is transferred to their corresponding memory. </figcaption> </figure> <div class="ltx_para" id="S4.SS3.SSS1.p4"> We propose ATMM, an adaptive-tiling matrix multiplication CUDA kernel that fully utilizes computational resources, achieving efficient heterogeneous LoRA adapters batching. It has two key design choices. (1) Adaptive tiling against input shapes. Given two input matrices, ATMM retrieves the optimal tiling configuration from the hash table (established in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS2" title="4.3.2 Profile-based optimal tiling search. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3.2</a>, and following this configuration, divides the matrix into thread block tiles, then further into warp tiles§§§Each warp can be divided into thread tiles in ¡16,8,16¿ or ¡16,8,8¿ shape.. (2) Pipeline data loading and computing. After tiling, ATMM transfers the data of each tile to their corresponding memories, the correspondence as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F12.sf1" title="In Figure 12 ‣ 4.3.1 ATMM: Adaptive-tiling matrix multiplication operator. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">12(a)</a>, and executes computation on CUDA/Tensor cores with the corresponding executable kernel (pre-compiled in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS2" title="4.3.2 Profile-based optimal tiling search. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3.2</a>). To hide data loading latency, ATMM allocates double the space in shared memory and register file for each tile: one for current tile’s computation and one for prefetching next tile’s data. This double buffering is feasible due to each level’s uniform tile shape, enabling efficient location of the next tile (more in Appx.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A1" title="Appendix A Adaptive-tiling Matrix Multiplication. ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">A</a>). </div> </section> <section class="ltx_subsubsection" id="S4.SS3.SSS2"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.3.2 Profile-based optimal tiling search.</h4> <div class="ltx_para" id="S4.SS3.SSS2.p1"> Keeping the benefit of adaptive tiling in mind, we aim to search out the optimal tiling configuration for every different input and switch among them at runtime. This can be done offline, as it is only affected by input shapes rather than their numerical values. However, this is still challenging due to the complicated thread parallelism mechanism and memory access patterns of GPU, and the large search space of input shapes. </div> <div class="ltx_para" id="S4.SS3.SSS2.p2"> To tackle this challenge, we approach the search as a black-box problem and propose a profile-based searching algorithm. It profiles the execution time for all possible ATMM input matrix shapes with the help of CUTLASS Profiler <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib15" title="">cutlass, </a>)</cite>, records the optimal tiling configurations in a hash table, and compiles corresponding executable kernels. We use the following expert knowledge to reduce the search space. (1) From the hardware perspective, architectural characteristics restrict the feasible input shapes, e.g., limited memory of each hierarchy, and tiling configurations, e.g., every dimension of tile at least 16 and must be powers of two. (2) From the perspective of input data, the model dimensions of LMMs restrict the input matrix shape changing at large steps, e.g., 4096 in Qwen-VL. Appx. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A2" title="Appendix B Profile-based Optimal Tiling Search Algorithm ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">B</a> shows the detailed algorithm. With our algorithm, searching all optimal configurations for vision tasks with Qwen-VL on A100 GPU only costs ¡30 minutes. </div> </section> </section> <section class="ltx_subsection" id="S4.SS4"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">4.4 Flexible LoRA Adapters Orchestration</h3> <div class="ltx_para" id="S4.SS4.p1"> To meet distinct performance requirements of vision applications, we propose an orchestrator to schedule requests, manage LoRA adapters, and switch inference modes, to enable efficient and flexible LoRA LLM inference runtime. We first implement two tools with ATMM, a swift mode switcher and a mixture inference mode, to facilitate the orchestrator. </div> <figure class="ltx_figure" id="S4.F13"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="152" id="S4.F13.g1" src="x16.png" width="291"/> <figcaption class="ltx_caption ltx_centering">Figure 13. Mixture mode allows simultaneous execution of unmerged and merged modes to alleviate starvation.</figcaption> </figure> <section class="ltx_subsubsection" id="S4.SS4.SSS1"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.4.1 Swift inference mode switch.</h4> <div class="ltx_para" id="S4.SS4.SSS1.p1"> As discussed in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S3.SS2" title="3.2 Challenges of Empowering Vision Applications with LoRA LMM ‣ 3. Motivation and Challenges ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3.2</a>, prior systems (e.g., dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite>) introduce excessive extra latency during mode switch. To reduce these overheads, one common method is pre-computing LoRA matrices (i.e.,, matrix <math alttext="\Delta W" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.1.1.m1.1"><semantics id="S4.SS4.SSS1.p1.1.1.m1.1a"><mrow id="S4.SS4.SSS1.p1.1.1.m1.1.1" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.cmml"><mi id="S4.SS4.SSS1.p1.1.1.m1.1.1.2" mathcolor="#000000" mathvariant="normal" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.2.cmml">Δ</mi><mo id="S4.SS4.SSS1.p1.1.1.m1.1.1.1" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.1.cmml">⁢</mo><mi id="S4.SS4.SSS1.p1.1.1.m1.1.1.3" mathcolor="#000000" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.1.1.m1.1b"><apply id="S4.SS4.SSS1.p1.1.1.m1.1.1.cmml" xref="S4.SS4.SSS1.p1.1.1.m1.1.1"><times id="S4.SS4.SSS1.p1.1.1.m1.1.1.1.cmml" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.1"></times><ci id="S4.SS4.SSS1.p1.1.1.m1.1.1.2.cmml" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.2">Δ</ci><ci id="S4.SS4.SSS1.p1.1.1.m1.1.1.3.cmml" xref="S4.SS4.SSS1.p1.1.1.m1.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.1.1.m1.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.1.1.m1.1d">roman_Δ italic_W</annotation></semantics></math>, <math alttext="B\times A" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.2.2.m2.1"><semantics id="S4.SS4.SSS1.p1.2.2.m2.1a"><mrow id="S4.SS4.SSS1.p1.2.2.m2.1.1" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.cmml"><mi id="S4.SS4.SSS1.p1.2.2.m2.1.1.2" mathcolor="#000000" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.2.cmml">B</mi><mo id="S4.SS4.SSS1.p1.2.2.m2.1.1.1" lspace="0.222em" mathcolor="#000000" rspace="0.222em" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.1.cmml">×</mo><mi id="S4.SS4.SSS1.p1.2.2.m2.1.1.3" mathcolor="#000000" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.3.cmml">A</mi></mrow><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.2.2.m2.1b"><apply id="S4.SS4.SSS1.p1.2.2.m2.1.1.cmml" xref="S4.SS4.SSS1.p1.2.2.m2.1.1"><times id="S4.SS4.SSS1.p1.2.2.m2.1.1.1.cmml" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.1"></times><ci id="S4.SS4.SSS1.p1.2.2.m2.1.1.2.cmml" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.2">𝐵</ci><ci id="S4.SS4.SSS1.p1.2.2.m2.1.1.3.cmml" xref="S4.SS4.SSS1.p1.2.2.m2.1.1.3">𝐴</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.2.2.m2.1c">B\times A</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.2.2.m2.1d">italic_B × italic_A</annotation></semantics></math>) for the entire base model and storing them in main memory, then swapping into GPU when needed. However, such a mass of data, e.g., <math alttext="\sim" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.3.3.m3.1"><semantics id="S4.SS4.SSS1.p1.3.3.m3.1a"><mo id="S4.SS4.SSS1.p1.3.3.m3.1.1" mathcolor="#000000" xref="S4.SS4.SSS1.p1.3.3.m3.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.3.3.m3.1b"><csymbol cd="latexml" id="S4.SS4.SSS1.p1.3.3.m3.1.1.cmml" xref="S4.SS4.SSS1.p1.3.3.m3.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.3.3.m3.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.3.3.m3.1d">∼</annotation></semantics></math>3GB per LoRA of QWen-VL-7B¶¶¶Layer#<math alttext="\times\Delta W\times" class="ltx_math_unparsed" display="inline" id="footnote5.m1.1"><semantics id="footnote5.m1.1b"><mrow id="footnote5.m1.1c"><mo id="footnote5.m1.1.1" mathcolor="#000000" rspace="0.222em">×</mo><mi id="footnote5.m1.1.2" mathcolor="#000000" mathvariant="normal">Δ</mi><mi id="footnote5.m1.1.3" mathcolor="#000000">W</mi><mo id="footnote5.m1.1.4" lspace="0.222em" mathcolor="#000000">×</mo></mrow><annotation encoding="application/x-tex" id="footnote5.m1.1d">\times\Delta W\times</annotation><annotation encoding="application/x-llamapun" id="footnote5.m1.1e">× roman_Δ italic_W ×</annotation></semantics></math>precise, 32<math alttext="\times" class="ltx_Math" display="inline" id="footnote5.m2.1"><semantics id="footnote5.m2.1b"><mo id="footnote5.m2.1.1" mathcolor="#000000" xref="footnote5.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="footnote5.m2.1c"><times id="footnote5.m2.1.1.cmml" xref="footnote5.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="footnote5.m2.1d">\times</annotation><annotation encoding="application/x-llamapun" id="footnote5.m2.1e">×</annotation></semantics></math>4096<math alttext="\times" class="ltx_Math" display="inline" id="footnote5.m3.1"><semantics id="footnote5.m3.1b"><mo id="footnote5.m3.1.1" mathcolor="#000000" xref="footnote5.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="footnote5.m3.1c"><times id="footnote5.m3.1.1.cmml" xref="footnote5.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="footnote5.m3.1d">\times</annotation><annotation encoding="application/x-llamapun" id="footnote5.m3.1e">×</annotation></semantics></math>4096<math alttext="\times" class="ltx_Math" display="inline" id="footnote5.m4.1"><semantics id="footnote5.m4.1b"><mo id="footnote5.m4.1.1" mathcolor="#000000" xref="footnote5.m4.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="footnote5.m4.1c"><times id="footnote5.m4.1.1.cmml" xref="footnote5.m4.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="footnote5.m4.1d">\times</annotation><annotation encoding="application/x-llamapun" id="footnote5.m4.1e">×</annotation></semantics></math>FP16, <math alttext="\Delta W" class="ltx_Math" display="inline" id="footnote5.m5.1"><semantics id="footnote5.m5.1b"><mrow id="footnote5.m5.1.1" xref="footnote5.m5.1.1.cmml"><mi id="footnote5.m5.1.1.2" mathcolor="#000000" mathvariant="normal" xref="footnote5.m5.1.1.2.cmml">Δ</mi><mo id="footnote5.m5.1.1.1" xref="footnote5.m5.1.1.1.cmml">⁢</mo><mi id="footnote5.m5.1.1.3" mathcolor="#000000" xref="footnote5.m5.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="footnote5.m5.1c"><apply id="footnote5.m5.1.1.cmml" xref="footnote5.m5.1.1"><times id="footnote5.m5.1.1.1.cmml" xref="footnote5.m5.1.1.1"></times><ci id="footnote5.m5.1.1.2.cmml" xref="footnote5.m5.1.1.2">Δ</ci><ci id="footnote5.m5.1.1.3.cmml" xref="footnote5.m5.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="footnote5.m5.1d">\Delta W</annotation><annotation encoding="application/x-llamapun" id="footnote5.m5.1e">roman_Δ italic_W</annotation></semantics></math>’s shape is same as LMM., incurs very high delay, <math alttext="\sim" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.4.4.m4.1"><semantics id="S4.SS4.SSS1.p1.4.4.m4.1a"><mo id="S4.SS4.SSS1.p1.4.4.m4.1.1" mathcolor="#000000" xref="S4.SS4.SSS1.p1.4.4.m4.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.4.4.m4.1b"><csymbol cd="latexml" id="S4.SS4.SSS1.p1.4.4.m4.1.1.cmml" xref="S4.SS4.SSS1.p1.4.4.m4.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.4.4.m4.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.4.4.m4.1d">∼</annotation></semantics></math>1s each, on swapping. Conversely, our swift mode switcher computes LoRA matrices at runtime and stores adapters, i.e., <math alttext="A" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.5.5.m5.1"><semantics id="S4.SS4.SSS1.p1.5.5.m5.1a"><mi id="S4.SS4.SSS1.p1.5.5.m5.1.1" mathcolor="#000000" xref="S4.SS4.SSS1.p1.5.5.m5.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.5.5.m5.1b"><ci id="S4.SS4.SSS1.p1.5.5.m5.1.1.cmml" xref="S4.SS4.SSS1.p1.5.5.m5.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.5.5.m5.1c">A</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.5.5.m5.1d">italic_A</annotation></semantics></math> and <math alttext="B" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.6.6.m6.1"><semantics id="S4.SS4.SSS1.p1.6.6.m6.1a"><mi id="S4.SS4.SSS1.p1.6.6.m6.1.1" mathcolor="#000000" xref="S4.SS4.SSS1.p1.6.6.m6.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.6.6.m6.1b"><ci id="S4.SS4.SSS1.p1.6.6.m6.1.1.cmml" xref="S4.SS4.SSS1.p1.6.6.m6.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.6.6.m6.1c">B</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.6.6.m6.1d">italic_B</annotation></semantics></math>, only 43MB each, on GPU. It has two core designs. (1) Eliminate unnecessary memory copy with contiguous memory allocation. Our switcher pre-allocates contiguous memory for weight matrices, avoiding the memory copy in tensor reshape, enables efficient in-place LoRA matrices un-/merge. (2) Compute all-layer LoRA matrices and un-/merge them in one shot. With ATMM, the switcher can efficiently compute LoRA matrices of the entire model and add/subtract all of them onto/from the base mode weights in one shot. By doing so, our mode switch costs only ¡10ms, which speeds up dLoRA ¿5<math alttext="\times" class="ltx_Math" display="inline" id="S4.SS4.SSS1.p1.7.m1.1"><semantics id="S4.SS4.SSS1.p1.7.m1.1a"><mo id="S4.SS4.SSS1.p1.7.m1.1.1" xref="S4.SS4.SSS1.p1.7.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS1.p1.7.m1.1b"><times id="S4.SS4.SSS1.p1.7.m1.1.1.cmml" xref="S4.SS4.SSS1.p1.7.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS1.p1.7.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS1.p1.7.m1.1d">×</annotation></semantics></math>. </div> <figure class="ltx_float ltx_float_algorithm ltx_framed ltx_framed_top" id="alg1"> <figcaption class="ltx_caption" style="font-size:80%;">Algorithm 1 Scheduling Policy</figcaption> <div class="ltx_listing ltx_listing" id="alg1.5"> <div class="ltx_listingline" id="alg1.l1"> 1:Request <math alttext="R\!=\!\!\left\{r_{1},...,r_{n}\right\}" class="ltx_Math" display="inline" id="alg1.l1.m1.3"><semantics id="alg1.l1.m1.3a"><mrow id="alg1.l1.m1.3.3" xref="alg1.l1.m1.3.3.cmml"><mi id="alg1.l1.m1.3.3.4" mathsize="80%" xref="alg1.l1.m1.3.3.4.cmml">R</mi><mpadded width="0.726em"><mo id="alg1.l1.m1.3.3.3" lspace="0.108em" mathsize="80%" xref="alg1.l1.m1.3.3.3.cmml">=</mo></mpadded><mrow id="alg1.l1.m1.3.3.2.2" xref="alg1.l1.m1.3.3.2.3.cmml"><mo id="alg1.l1.m1.3.3.2.2.3" xref="alg1.l1.m1.3.3.2.3.cmml">{</mo><msub id="alg1.l1.m1.2.2.1.1.1" xref="alg1.l1.m1.2.2.1.1.1.cmml"><mi id="alg1.l1.m1.2.2.1.1.1.2" mathsize="80%" xref="alg1.l1.m1.2.2.1.1.1.2.cmml">r</mi><mn id="alg1.l1.m1.2.2.1.1.1.3" mathsize="80%" xref="alg1.l1.m1.2.2.1.1.1.3.cmml">1</mn></msub><mo id="alg1.l1.m1.3.3.2.2.4" mathsize="80%" xref="alg1.l1.m1.3.3.2.3.cmml">,</mo><mi id="alg1.l1.m1.1.1" mathsize="80%" mathvariant="normal" xref="alg1.l1.m1.1.1.cmml">…</mi><mo id="alg1.l1.m1.3.3.2.2.5" mathsize="80%" xref="alg1.l1.m1.3.3.2.3.cmml">,</mo><msub id="alg1.l1.m1.3.3.2.2.2" xref="alg1.l1.m1.3.3.2.2.2.cmml"><mi id="alg1.l1.m1.3.3.2.2.2.2" mathsize="80%" xref="alg1.l1.m1.3.3.2.2.2.2.cmml">r</mi><mi id="alg1.l1.m1.3.3.2.2.2.3" mathsize="80%" xref="alg1.l1.m1.3.3.2.2.2.3.cmml">n</mi></msub><mo id="alg1.l1.m1.3.3.2.2.6" xref="alg1.l1.m1.3.3.2.3.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l1.m1.3b"><apply id="alg1.l1.m1.3.3.cmml" xref="alg1.l1.m1.3.3"><eq id="alg1.l1.m1.3.3.3.cmml" xref="alg1.l1.m1.3.3.3"></eq><ci id="alg1.l1.m1.3.3.4.cmml" xref="alg1.l1.m1.3.3.4">𝑅</ci><set id="alg1.l1.m1.3.3.2.3.cmml" xref="alg1.l1.m1.3.3.2.2"><apply id="alg1.l1.m1.2.2.1.1.1.cmml" xref="alg1.l1.m1.2.2.1.1.1"><csymbol cd="ambiguous" id="alg1.l1.m1.2.2.1.1.1.1.cmml" xref="alg1.l1.m1.2.2.1.1.1">subscript</csymbol><ci id="alg1.l1.m1.2.2.1.1.1.2.cmml" xref="alg1.l1.m1.2.2.1.1.1.2">𝑟</ci><cn id="alg1.l1.m1.2.2.1.1.1.3.cmml" type="integer" xref="alg1.l1.m1.2.2.1.1.1.3">1</cn></apply><ci id="alg1.l1.m1.1.1.cmml" xref="alg1.l1.m1.1.1">…</ci><apply id="alg1.l1.m1.3.3.2.2.2.cmml" xref="alg1.l1.m1.3.3.2.2.2"><csymbol cd="ambiguous" id="alg1.l1.m1.3.3.2.2.2.1.cmml" xref="alg1.l1.m1.3.3.2.2.2">subscript</csymbol><ci id="alg1.l1.m1.3.3.2.2.2.2.cmml" xref="alg1.l1.m1.3.3.2.2.2.2">𝑟</ci><ci id="alg1.l1.m1.3.3.2.2.2.3.cmml" xref="alg1.l1.m1.3.3.2.2.2.3">𝑛</ci></apply></set></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l1.m1.3c">R\!=\!\!\left\{r_{1},...,r_{n}\right\}</annotation><annotation encoding="application/x-llamapun" id="alg1.l1.m1.3d">italic_R = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }</annotation></semantics></math>, LoRA adapter <math alttext="L\!=\!\!\left\{l_{1},...,l_{n}\right\}" class="ltx_Math" display="inline" id="alg1.l1.m2.3"><semantics id="alg1.l1.m2.3a"><mrow id="alg1.l1.m2.3.3" xref="alg1.l1.m2.3.3.cmml"><mi id="alg1.l1.m2.3.3.4" mathsize="80%" xref="alg1.l1.m2.3.3.4.cmml">L</mi><mpadded width="0.726em"><mo id="alg1.l1.m2.3.3.3" lspace="0.108em" mathsize="80%" xref="alg1.l1.m2.3.3.3.cmml">=</mo></mpadded><mrow id="alg1.l1.m2.3.3.2.2" xref="alg1.l1.m2.3.3.2.3.cmml"><mo id="alg1.l1.m2.3.3.2.2.3" xref="alg1.l1.m2.3.3.2.3.cmml">{</mo><msub id="alg1.l1.m2.2.2.1.1.1" xref="alg1.l1.m2.2.2.1.1.1.cmml"><mi id="alg1.l1.m2.2.2.1.1.1.2" mathsize="80%" xref="alg1.l1.m2.2.2.1.1.1.2.cmml">l</mi><mn id="alg1.l1.m2.2.2.1.1.1.3" mathsize="80%" xref="alg1.l1.m2.2.2.1.1.1.3.cmml">1</mn></msub><mo id="alg1.l1.m2.3.3.2.2.4" mathsize="80%" xref="alg1.l1.m2.3.3.2.3.cmml">,</mo><mi id="alg1.l1.m2.1.1" mathsize="80%" mathvariant="normal" xref="alg1.l1.m2.1.1.cmml">…</mi><mo id="alg1.l1.m2.3.3.2.2.5" mathsize="80%" xref="alg1.l1.m2.3.3.2.3.cmml">,</mo><msub id="alg1.l1.m2.3.3.2.2.2" xref="alg1.l1.m2.3.3.2.2.2.cmml"><mi id="alg1.l1.m2.3.3.2.2.2.2" mathsize="80%" xref="alg1.l1.m2.3.3.2.2.2.2.cmml">l</mi><mi id="alg1.l1.m2.3.3.2.2.2.3" mathsize="80%" xref="alg1.l1.m2.3.3.2.2.2.3.cmml">n</mi></msub><mo id="alg1.l1.m2.3.3.2.2.6" xref="alg1.l1.m2.3.3.2.3.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l1.m2.3b"><apply id="alg1.l1.m2.3.3.cmml" xref="alg1.l1.m2.3.3"><eq id="alg1.l1.m2.3.3.3.cmml" xref="alg1.l1.m2.3.3.3"></eq><ci id="alg1.l1.m2.3.3.4.cmml" xref="alg1.l1.m2.3.3.4">𝐿</ci><set id="alg1.l1.m2.3.3.2.3.cmml" xref="alg1.l1.m2.3.3.2.2"><apply id="alg1.l1.m2.2.2.1.1.1.cmml" xref="alg1.l1.m2.2.2.1.1.1"><csymbol cd="ambiguous" id="alg1.l1.m2.2.2.1.1.1.1.cmml" xref="alg1.l1.m2.2.2.1.1.1">subscript</csymbol><ci id="alg1.l1.m2.2.2.1.1.1.2.cmml" xref="alg1.l1.m2.2.2.1.1.1.2">𝑙</ci><cn id="alg1.l1.m2.2.2.1.1.1.3.cmml" type="integer" xref="alg1.l1.m2.2.2.1.1.1.3">1</cn></apply><ci id="alg1.l1.m2.1.1.cmml" xref="alg1.l1.m2.1.1">…</ci><apply id="alg1.l1.m2.3.3.2.2.2.cmml" xref="alg1.l1.m2.3.3.2.2.2"><csymbol cd="ambiguous" id="alg1.l1.m2.3.3.2.2.2.1.cmml" xref="alg1.l1.m2.3.3.2.2.2">subscript</csymbol><ci id="alg1.l1.m2.3.3.2.2.2.2.cmml" xref="alg1.l1.m2.3.3.2.2.2.2">𝑙</ci><ci id="alg1.l1.m2.3.3.2.2.2.3.cmml" xref="alg1.l1.m2.3.3.2.2.2.3">𝑛</ci></apply></set></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l1.m2.3c">L\!=\!\!\left\{l_{1},...,l_{n}\right\}</annotation><annotation encoding="application/x-llamapun" id="alg1.l1.m2.3d">italic_L = { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }</annotation></semantics></math>, Infer mode <math alttext="M" class="ltx_Math" display="inline" id="alg1.l1.m3.1"><semantics id="alg1.l1.m3.1a"><mi id="alg1.l1.m3.1.1" mathsize="80%" xref="alg1.l1.m3.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="alg1.l1.m3.1b"><ci id="alg1.l1.m3.1.1.cmml" xref="alg1.l1.m3.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l1.m3.1c">M</annotation><annotation encoding="application/x-llamapun" id="alg1.l1.m3.1d">italic_M</annotation></semantics></math> </div> <div class="ltx_listingline" id="alg1.l2"> 2:The batch of requests to be executed <math alttext="B_{next}" class="ltx_Math" display="inline" id="alg1.l2.m1.1"><semantics id="alg1.l2.m1.1a"><msub id="alg1.l2.m1.1.1" xref="alg1.l2.m1.1.1.cmml"><mi id="alg1.l2.m1.1.1.2" mathsize="80%" xref="alg1.l2.m1.1.1.2.cmml">B</mi><mrow id="alg1.l2.m1.1.1.3" xref="alg1.l2.m1.1.1.3.cmml"><mi id="alg1.l2.m1.1.1.3.2" mathsize="80%" xref="alg1.l2.m1.1.1.3.2.cmml">n</mi><mo id="alg1.l2.m1.1.1.3.1" xref="alg1.l2.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l2.m1.1.1.3.3" mathsize="80%" xref="alg1.l2.m1.1.1.3.3.cmml">e</mi><mo id="alg1.l2.m1.1.1.3.1a" xref="alg1.l2.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l2.m1.1.1.3.4" mathsize="80%" xref="alg1.l2.m1.1.1.3.4.cmml">x</mi><mo id="alg1.l2.m1.1.1.3.1b" xref="alg1.l2.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l2.m1.1.1.3.5" mathsize="80%" xref="alg1.l2.m1.1.1.3.5.cmml">t</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="alg1.l2.m1.1b"><apply id="alg1.l2.m1.1.1.cmml" xref="alg1.l2.m1.1.1"><csymbol cd="ambiguous" id="alg1.l2.m1.1.1.1.cmml" xref="alg1.l2.m1.1.1">subscript</csymbol><ci id="alg1.l2.m1.1.1.2.cmml" xref="alg1.l2.m1.1.1.2">𝐵</ci><apply id="alg1.l2.m1.1.1.3.cmml" xref="alg1.l2.m1.1.1.3"><times id="alg1.l2.m1.1.1.3.1.cmml" xref="alg1.l2.m1.1.1.3.1"></times><ci id="alg1.l2.m1.1.1.3.2.cmml" xref="alg1.l2.m1.1.1.3.2">𝑛</ci><ci id="alg1.l2.m1.1.1.3.3.cmml" xref="alg1.l2.m1.1.1.3.3">𝑒</ci><ci id="alg1.l2.m1.1.1.3.4.cmml" xref="alg1.l2.m1.1.1.3.4">𝑥</ci><ci id="alg1.l2.m1.1.1.3.5.cmml" xref="alg1.l2.m1.1.1.3.5">𝑡</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l2.m1.1c">B_{next}</annotation><annotation encoding="application/x-llamapun" id="alg1.l2.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT</annotation></semantics></math> </div> <div class="ltx_listingline" id="alg1.l3"> 3:function Scheduling(R, M, L) </div> <div class="ltx_listingline" id="alg1.l4"> 4: <math alttext="R_{starve}=\left[R_{i}.credit>\theta\;for\;R_{i}\;in\;R\right]" class="ltx_math_unparsed" display="inline" id="alg1.l4.m1.1"><semantics id="alg1.l4.m1.1a"><mrow id="alg1.l4.m1.1b"><msub id="alg1.l4.m1.1.1"><mi id="alg1.l4.m1.1.1.2" mathsize="80%">R</mi><mrow id="alg1.l4.m1.1.1.3"><mi id="alg1.l4.m1.1.1.3.2" mathsize="80%">s</mi><mo id="alg1.l4.m1.1.1.3.1">⁢</mo><mi id="alg1.l4.m1.1.1.3.3" mathsize="80%">t</mi><mo id="alg1.l4.m1.1.1.3.1a">⁢</mo><mi id="alg1.l4.m1.1.1.3.4" mathsize="80%">a</mi><mo id="alg1.l4.m1.1.1.3.1b">⁢</mo><mi id="alg1.l4.m1.1.1.3.5" mathsize="80%">r</mi><mo id="alg1.l4.m1.1.1.3.1c">⁢</mo><mi id="alg1.l4.m1.1.1.3.6" mathsize="80%">v</mi><mo id="alg1.l4.m1.1.1.3.1d">⁢</mo><mi id="alg1.l4.m1.1.1.3.7" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l4.m1.1.2" mathsize="80%">=</mo><mrow id="alg1.l4.m1.1.3"><mo id="alg1.l4.m1.1.3.1">[</mo><msub id="alg1.l4.m1.1.3.2"><mi id="alg1.l4.m1.1.3.2.2" mathsize="80%">R</mi><mi id="alg1.l4.m1.1.3.2.3" mathsize="80%">i</mi></msub><mo id="alg1.l4.m1.1.3.3" lspace="0em" mathsize="80%" rspace="0.167em">.</mo><mi id="alg1.l4.m1.1.3.4" mathsize="80%">c</mi><mi id="alg1.l4.m1.1.3.5" mathsize="80%">r</mi><mi id="alg1.l4.m1.1.3.6" mathsize="80%">e</mi><mi id="alg1.l4.m1.1.3.7" mathsize="80%">d</mi><mi id="alg1.l4.m1.1.3.8" mathsize="80%">i</mi><mi id="alg1.l4.m1.1.3.9" mathsize="80%">t</mi><mo id="alg1.l4.m1.1.3.10" mathsize="80%">></mo><mi id="alg1.l4.m1.1.3.11" mathsize="80%">θ</mi><mi id="alg1.l4.m1.1.3.12" mathsize="80%">f</mi><mi id="alg1.l4.m1.1.3.13" mathsize="80%">o</mi><mi id="alg1.l4.m1.1.3.14" mathsize="80%">r</mi><msub id="alg1.l4.m1.1.3.15"><mi id="alg1.l4.m1.1.3.15.2" mathsize="80%">R</mi><mi id="alg1.l4.m1.1.3.15.3" mathsize="80%">i</mi></msub><mi id="alg1.l4.m1.1.3.16" mathsize="80%">i</mi><mi id="alg1.l4.m1.1.3.17" mathsize="80%">n</mi><mi id="alg1.l4.m1.1.3.18" mathsize="80%">R</mi><mo id="alg1.l4.m1.1.3.19">]</mo></mrow></mrow><annotation encoding="application/x-tex" id="alg1.l4.m1.1c">R_{starve}=\left[R_{i}.credit>\theta\;for\;R_{i}\;in\;R\right]</annotation><annotation encoding="application/x-llamapun" id="alg1.l4.m1.1d">italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT = [ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . italic_c italic_r italic_e italic_d italic_i italic_t > italic_θ italic_f italic_o italic_r italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_i italic_n italic_R ]</annotation></semantics></math> </div> <div class="ltx_listingline" id="alg1.l5"> 5: <math alttext="len=MaxBS-|R_{starve}|" class="ltx_Math" display="inline" id="alg1.l5.m1.1"><semantics id="alg1.l5.m1.1a"><mrow id="alg1.l5.m1.1.1" xref="alg1.l5.m1.1.1.cmml"><mrow id="alg1.l5.m1.1.1.3" xref="alg1.l5.m1.1.1.3.cmml"><mi id="alg1.l5.m1.1.1.3.2" mathsize="80%" xref="alg1.l5.m1.1.1.3.2.cmml">l</mi><mo id="alg1.l5.m1.1.1.3.1" xref="alg1.l5.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.3.3" mathsize="80%" xref="alg1.l5.m1.1.1.3.3.cmml">e</mi><mo id="alg1.l5.m1.1.1.3.1a" xref="alg1.l5.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.3.4" mathsize="80%" xref="alg1.l5.m1.1.1.3.4.cmml">n</mi></mrow><mo id="alg1.l5.m1.1.1.2" mathsize="80%" xref="alg1.l5.m1.1.1.2.cmml">=</mo><mrow id="alg1.l5.m1.1.1.1" xref="alg1.l5.m1.1.1.1.cmml"><mrow id="alg1.l5.m1.1.1.1.3" xref="alg1.l5.m1.1.1.1.3.cmml"><mi id="alg1.l5.m1.1.1.1.3.2" mathsize="80%" xref="alg1.l5.m1.1.1.1.3.2.cmml">M</mi><mo id="alg1.l5.m1.1.1.1.3.1" xref="alg1.l5.m1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.1.3.3" mathsize="80%" xref="alg1.l5.m1.1.1.1.3.3.cmml">a</mi><mo id="alg1.l5.m1.1.1.1.3.1a" xref="alg1.l5.m1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.1.3.4" mathsize="80%" xref="alg1.l5.m1.1.1.1.3.4.cmml">x</mi><mo id="alg1.l5.m1.1.1.1.3.1b" xref="alg1.l5.m1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.1.3.5" mathsize="80%" xref="alg1.l5.m1.1.1.1.3.5.cmml">B</mi><mo id="alg1.l5.m1.1.1.1.3.1c" xref="alg1.l5.m1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.1.3.6" mathsize="80%" xref="alg1.l5.m1.1.1.1.3.6.cmml">S</mi></mrow><mo id="alg1.l5.m1.1.1.1.2" mathsize="80%" xref="alg1.l5.m1.1.1.1.2.cmml">−</mo><mrow id="alg1.l5.m1.1.1.1.1.1" xref="alg1.l5.m1.1.1.1.1.2.cmml"><mo id="alg1.l5.m1.1.1.1.1.1.2" maxsize="80%" minsize="80%" xref="alg1.l5.m1.1.1.1.1.2.1.cmml">|</mo><msub id="alg1.l5.m1.1.1.1.1.1.1" xref="alg1.l5.m1.1.1.1.1.1.1.cmml"><mi id="alg1.l5.m1.1.1.1.1.1.1.2" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.2.cmml">R</mi><mrow id="alg1.l5.m1.1.1.1.1.1.1.3" xref="alg1.l5.m1.1.1.1.1.1.1.3.cmml"><mi id="alg1.l5.m1.1.1.1.1.1.1.3.2" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.2.cmml">s</mi><mo id="alg1.l5.m1.1.1.1.1.1.1.3.1" xref="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.1.1.1.1.3.3" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.3.cmml">t</mi><mo id="alg1.l5.m1.1.1.1.1.1.1.3.1a" xref="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.1.1.1.1.3.4" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.4.cmml">a</mi><mo id="alg1.l5.m1.1.1.1.1.1.1.3.1b" xref="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.1.1.1.1.3.5" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.5.cmml">r</mi><mo id="alg1.l5.m1.1.1.1.1.1.1.3.1c" xref="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.1.1.1.1.3.6" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.6.cmml">v</mi><mo id="alg1.l5.m1.1.1.1.1.1.1.3.1d" xref="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l5.m1.1.1.1.1.1.1.3.7" mathsize="80%" xref="alg1.l5.m1.1.1.1.1.1.1.3.7.cmml">e</mi></mrow></msub><mo id="alg1.l5.m1.1.1.1.1.1.3" maxsize="80%" minsize="80%" xref="alg1.l5.m1.1.1.1.1.2.1.cmml">|</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l5.m1.1b"><apply id="alg1.l5.m1.1.1.cmml" xref="alg1.l5.m1.1.1"><eq id="alg1.l5.m1.1.1.2.cmml" xref="alg1.l5.m1.1.1.2"></eq><apply id="alg1.l5.m1.1.1.3.cmml" xref="alg1.l5.m1.1.1.3"><times id="alg1.l5.m1.1.1.3.1.cmml" xref="alg1.l5.m1.1.1.3.1"></times><ci id="alg1.l5.m1.1.1.3.2.cmml" xref="alg1.l5.m1.1.1.3.2">𝑙</ci><ci id="alg1.l5.m1.1.1.3.3.cmml" xref="alg1.l5.m1.1.1.3.3">𝑒</ci><ci id="alg1.l5.m1.1.1.3.4.cmml" xref="alg1.l5.m1.1.1.3.4">𝑛</ci></apply><apply id="alg1.l5.m1.1.1.1.cmml" xref="alg1.l5.m1.1.1.1"><minus id="alg1.l5.m1.1.1.1.2.cmml" xref="alg1.l5.m1.1.1.1.2"></minus><apply id="alg1.l5.m1.1.1.1.3.cmml" xref="alg1.l5.m1.1.1.1.3"><times id="alg1.l5.m1.1.1.1.3.1.cmml" xref="alg1.l5.m1.1.1.1.3.1"></times><ci id="alg1.l5.m1.1.1.1.3.2.cmml" xref="alg1.l5.m1.1.1.1.3.2">𝑀</ci><ci id="alg1.l5.m1.1.1.1.3.3.cmml" xref="alg1.l5.m1.1.1.1.3.3">𝑎</ci><ci id="alg1.l5.m1.1.1.1.3.4.cmml" xref="alg1.l5.m1.1.1.1.3.4">𝑥</ci><ci id="alg1.l5.m1.1.1.1.3.5.cmml" xref="alg1.l5.m1.1.1.1.3.5">𝐵</ci><ci id="alg1.l5.m1.1.1.1.3.6.cmml" xref="alg1.l5.m1.1.1.1.3.6">𝑆</ci></apply><apply id="alg1.l5.m1.1.1.1.1.2.cmml" xref="alg1.l5.m1.1.1.1.1.1"><abs id="alg1.l5.m1.1.1.1.1.2.1.cmml" xref="alg1.l5.m1.1.1.1.1.1.2"></abs><apply id="alg1.l5.m1.1.1.1.1.1.1.cmml" xref="alg1.l5.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l5.m1.1.1.1.1.1.1.1.cmml" xref="alg1.l5.m1.1.1.1.1.1.1">subscript</csymbol><ci id="alg1.l5.m1.1.1.1.1.1.1.2.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.2">𝑅</ci><apply id="alg1.l5.m1.1.1.1.1.1.1.3.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3"><times id="alg1.l5.m1.1.1.1.1.1.1.3.1.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.1"></times><ci id="alg1.l5.m1.1.1.1.1.1.1.3.2.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.2">𝑠</ci><ci id="alg1.l5.m1.1.1.1.1.1.1.3.3.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.3">𝑡</ci><ci id="alg1.l5.m1.1.1.1.1.1.1.3.4.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.4">𝑎</ci><ci id="alg1.l5.m1.1.1.1.1.1.1.3.5.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.5">𝑟</ci><ci id="alg1.l5.m1.1.1.1.1.1.1.3.6.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.6">𝑣</ci><ci id="alg1.l5.m1.1.1.1.1.1.1.3.7.cmml" xref="alg1.l5.m1.1.1.1.1.1.1.3.7">𝑒</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l5.m1.1c">len=MaxBS-|R_{starve}|</annotation><annotation encoding="application/x-llamapun" id="alg1.l5.m1.1d">italic_l italic_e italic_n = italic_M italic_a italic_x italic_B italic_S - | italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT |</annotation></semantics></math> </div> <div class="ltx_listingline" id="alg1.l6"> 6: <math alttext="R_{merge}=argmax_{l_{i}\in L}\left\{r_{i}\in R|r_{i}.lora==l_{i}\right\}" class="ltx_math_unparsed" display="inline" id="alg1.l6.m1.1"><semantics id="alg1.l6.m1.1a"><mrow id="alg1.l6.m1.1b"><msub id="alg1.l6.m1.1.1"><mi id="alg1.l6.m1.1.1.2" mathsize="80%">R</mi><mrow id="alg1.l6.m1.1.1.3"><mi id="alg1.l6.m1.1.1.3.2" mathsize="80%">m</mi><mo id="alg1.l6.m1.1.1.3.1">⁢</mo><mi id="alg1.l6.m1.1.1.3.3" mathsize="80%">e</mi><mo id="alg1.l6.m1.1.1.3.1a">⁢</mo><mi id="alg1.l6.m1.1.1.3.4" mathsize="80%">r</mi><mo id="alg1.l6.m1.1.1.3.1b">⁢</mo><mi id="alg1.l6.m1.1.1.3.5" mathsize="80%">g</mi><mo id="alg1.l6.m1.1.1.3.1c">⁢</mo><mi id="alg1.l6.m1.1.1.3.6" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l6.m1.1.2" mathsize="80%">=</mo><mi id="alg1.l6.m1.1.3" mathsize="80%">a</mi><mi id="alg1.l6.m1.1.4" mathsize="80%">r</mi><mi id="alg1.l6.m1.1.5" mathsize="80%">g</mi><mi id="alg1.l6.m1.1.6" mathsize="80%">m</mi><mi id="alg1.l6.m1.1.7" mathsize="80%">a</mi><msub id="alg1.l6.m1.1.8"><mi id="alg1.l6.m1.1.8.2" mathsize="80%">x</mi><mrow id="alg1.l6.m1.1.8.3"><msub id="alg1.l6.m1.1.8.3.2"><mi id="alg1.l6.m1.1.8.3.2.2" mathsize="80%">l</mi><mi id="alg1.l6.m1.1.8.3.2.3" mathsize="80%">i</mi></msub><mo id="alg1.l6.m1.1.8.3.1" mathsize="80%">∈</mo><mi id="alg1.l6.m1.1.8.3.3" mathsize="80%">L</mi></mrow></msub><mrow id="alg1.l6.m1.1.9"><mo id="alg1.l6.m1.1.9.1">{</mo><msub id="alg1.l6.m1.1.9.2"><mi id="alg1.l6.m1.1.9.2.2" mathsize="80%">r</mi><mi id="alg1.l6.m1.1.9.2.3" mathsize="80%">i</mi></msub><mo id="alg1.l6.m1.1.9.3" mathsize="80%">∈</mo><mi id="alg1.l6.m1.1.9.4" mathsize="80%">R</mi><mo fence="false" id="alg1.l6.m1.1.9.5" maxsize="80%" minsize="80%" rspace="0.167em">|</mo><msub id="alg1.l6.m1.1.9.6"><mi id="alg1.l6.m1.1.9.6.2" mathsize="80%">r</mi><mi id="alg1.l6.m1.1.9.6.3" mathsize="80%">i</mi></msub><mo id="alg1.l6.m1.1.9.7" lspace="0em" mathsize="80%" rspace="0.167em">.</mo><mi id="alg1.l6.m1.1.9.8" mathsize="80%">l</mi><mi id="alg1.l6.m1.1.9.9" mathsize="80%">o</mi><mi id="alg1.l6.m1.1.9.10" mathsize="80%">r</mi><mi id="alg1.l6.m1.1.9.11" mathsize="80%">a</mi><mo id="alg1.l6.m1.1.9.12" mathsize="80%" rspace="0em">=</mo><mo id="alg1.l6.m1.1.9.13" lspace="0em" mathsize="80%">=</mo><msub id="alg1.l6.m1.1.9.14"><mi id="alg1.l6.m1.1.9.14.2" mathsize="80%">l</mi><mi id="alg1.l6.m1.1.9.14.3" mathsize="80%">i</mi></msub><mo id="alg1.l6.m1.1.9.15">}</mo></mrow></mrow><annotation encoding="application/x-tex" id="alg1.l6.m1.1c">R_{merge}=argmax_{l_{i}\in L}\left\{r_{i}\in R|r_{i}.lora==l_{i}\right\}</annotation><annotation encoding="application/x-llamapun" id="alg1.l6.m1.1d">italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT = italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_L end_POSTSUBSCRIPT { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R | italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . italic_l italic_o italic_r italic_a = = italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }</annotation></semantics></math> </div> <div class="ltx_listingline" id="alg1.l7"> 7: if <math alttext="|R_{starve}|/MaxBS\leq 0.5" class="ltx_Math" display="inline" id="alg1.l7.m1.1"><semantics id="alg1.l7.m1.1a"><mrow id="alg1.l7.m1.1.1" xref="alg1.l7.m1.1.1.cmml"><mrow id="alg1.l7.m1.1.1.1" xref="alg1.l7.m1.1.1.1.cmml"><mrow id="alg1.l7.m1.1.1.1.1" xref="alg1.l7.m1.1.1.1.1.cmml"><mrow id="alg1.l7.m1.1.1.1.1.1.1" xref="alg1.l7.m1.1.1.1.1.1.2.cmml"><mo id="alg1.l7.m1.1.1.1.1.1.1.2" maxsize="80%" minsize="80%" xref="alg1.l7.m1.1.1.1.1.1.2.1.cmml">|</mo><msub id="alg1.l7.m1.1.1.1.1.1.1.1" xref="alg1.l7.m1.1.1.1.1.1.1.1.cmml"><mi id="alg1.l7.m1.1.1.1.1.1.1.1.2" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.2.cmml">R</mi><mrow id="alg1.l7.m1.1.1.1.1.1.1.1.3" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.cmml"><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.2" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.2.cmml">s</mi><mo id="alg1.l7.m1.1.1.1.1.1.1.1.3.1" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.3" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.3.cmml">t</mi><mo id="alg1.l7.m1.1.1.1.1.1.1.1.3.1a" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.4" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.4.cmml">a</mi><mo id="alg1.l7.m1.1.1.1.1.1.1.1.3.1b" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.5" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.5.cmml">r</mi><mo id="alg1.l7.m1.1.1.1.1.1.1.1.3.1c" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.6" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.6.cmml">v</mi><mo id="alg1.l7.m1.1.1.1.1.1.1.1.3.1d" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l7.m1.1.1.1.1.1.1.1.3.7" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.7.cmml">e</mi></mrow></msub><mo id="alg1.l7.m1.1.1.1.1.1.1.3" maxsize="80%" minsize="80%" xref="alg1.l7.m1.1.1.1.1.1.2.1.cmml">|</mo></mrow><mo id="alg1.l7.m1.1.1.1.1.2" maxsize="80%" minsize="80%" stretchy="true" symmetric="true" xref="alg1.l7.m1.1.1.1.1.2.cmml">/</mo><mi id="alg1.l7.m1.1.1.1.1.3" mathsize="80%" xref="alg1.l7.m1.1.1.1.1.3.cmml">M</mi></mrow><mo id="alg1.l7.m1.1.1.1.2" xref="alg1.l7.m1.1.1.1.2.cmml">⁢</mo><mi id="alg1.l7.m1.1.1.1.3" mathsize="80%" xref="alg1.l7.m1.1.1.1.3.cmml">a</mi><mo id="alg1.l7.m1.1.1.1.2a" xref="alg1.l7.m1.1.1.1.2.cmml">⁢</mo><mi id="alg1.l7.m1.1.1.1.4" mathsize="80%" xref="alg1.l7.m1.1.1.1.4.cmml">x</mi><mo id="alg1.l7.m1.1.1.1.2b" xref="alg1.l7.m1.1.1.1.2.cmml">⁢</mo><mi id="alg1.l7.m1.1.1.1.5" mathsize="80%" xref="alg1.l7.m1.1.1.1.5.cmml">B</mi><mo id="alg1.l7.m1.1.1.1.2c" xref="alg1.l7.m1.1.1.1.2.cmml">⁢</mo><mi id="alg1.l7.m1.1.1.1.6" mathsize="80%" xref="alg1.l7.m1.1.1.1.6.cmml">S</mi></mrow><mo id="alg1.l7.m1.1.1.2" mathsize="80%" xref="alg1.l7.m1.1.1.2.cmml">≤</mo><mn id="alg1.l7.m1.1.1.3" mathsize="80%" xref="alg1.l7.m1.1.1.3.cmml">0.5</mn></mrow><annotation-xml encoding="MathML-Content" id="alg1.l7.m1.1b"><apply id="alg1.l7.m1.1.1.cmml" xref="alg1.l7.m1.1.1"><leq id="alg1.l7.m1.1.1.2.cmml" xref="alg1.l7.m1.1.1.2"></leq><apply id="alg1.l7.m1.1.1.1.cmml" xref="alg1.l7.m1.1.1.1"><times id="alg1.l7.m1.1.1.1.2.cmml" xref="alg1.l7.m1.1.1.1.2"></times><apply id="alg1.l7.m1.1.1.1.1.cmml" xref="alg1.l7.m1.1.1.1.1"><divide id="alg1.l7.m1.1.1.1.1.2.cmml" xref="alg1.l7.m1.1.1.1.1.2"></divide><apply id="alg1.l7.m1.1.1.1.1.1.2.cmml" xref="alg1.l7.m1.1.1.1.1.1.1"><abs id="alg1.l7.m1.1.1.1.1.1.2.1.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.2"></abs><apply id="alg1.l7.m1.1.1.1.1.1.1.1.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l7.m1.1.1.1.1.1.1.1.1.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1">subscript</csymbol><ci id="alg1.l7.m1.1.1.1.1.1.1.1.2.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.2">𝑅</ci><apply id="alg1.l7.m1.1.1.1.1.1.1.1.3.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3"><times id="alg1.l7.m1.1.1.1.1.1.1.1.3.1.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.1"></times><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.2.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.2">𝑠</ci><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.3.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.3">𝑡</ci><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.4.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.4">𝑎</ci><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.5.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.5">𝑟</ci><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.6.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.6">𝑣</ci><ci id="alg1.l7.m1.1.1.1.1.1.1.1.3.7.cmml" xref="alg1.l7.m1.1.1.1.1.1.1.1.3.7">𝑒</ci></apply></apply></apply><ci id="alg1.l7.m1.1.1.1.1.3.cmml" xref="alg1.l7.m1.1.1.1.1.3">𝑀</ci></apply><ci id="alg1.l7.m1.1.1.1.3.cmml" xref="alg1.l7.m1.1.1.1.3">𝑎</ci><ci id="alg1.l7.m1.1.1.1.4.cmml" xref="alg1.l7.m1.1.1.1.4">𝑥</ci><ci id="alg1.l7.m1.1.1.1.5.cmml" xref="alg1.l7.m1.1.1.1.5">𝐵</ci><ci id="alg1.l7.m1.1.1.1.6.cmml" xref="alg1.l7.m1.1.1.1.6">𝑆</ci></apply><cn id="alg1.l7.m1.1.1.3.cmml" type="float" xref="alg1.l7.m1.1.1.3">0.5</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l7.m1.1c">|R_{starve}|/MaxBS\leq 0.5</annotation><annotation encoding="application/x-llamapun" id="alg1.l7.m1.1d">| italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT | / italic_M italic_a italic_x italic_B italic_S ≤ 0.5</annotation></semantics></math> and <math alttext="|R_{merge}|/MaxBS>0.5" class="ltx_Math" display="inline" id="alg1.l7.m2.1"><semantics id="alg1.l7.m2.1a"><mrow id="alg1.l7.m2.1.1" xref="alg1.l7.m2.1.1.cmml"><mrow id="alg1.l7.m2.1.1.1" xref="alg1.l7.m2.1.1.1.cmml"><mrow id="alg1.l7.m2.1.1.1.1" xref="alg1.l7.m2.1.1.1.1.cmml"><mrow id="alg1.l7.m2.1.1.1.1.1.1" xref="alg1.l7.m2.1.1.1.1.1.2.cmml"><mo id="alg1.l7.m2.1.1.1.1.1.1.2" maxsize="80%" minsize="80%" xref="alg1.l7.m2.1.1.1.1.1.2.1.cmml">|</mo><msub id="alg1.l7.m2.1.1.1.1.1.1.1" xref="alg1.l7.m2.1.1.1.1.1.1.1.cmml"><mi id="alg1.l7.m2.1.1.1.1.1.1.1.2" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.2.cmml">R</mi><mrow id="alg1.l7.m2.1.1.1.1.1.1.1.3" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.cmml"><mi id="alg1.l7.m2.1.1.1.1.1.1.1.3.2" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.2.cmml">m</mi><mo id="alg1.l7.m2.1.1.1.1.1.1.1.3.1" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l7.m2.1.1.1.1.1.1.1.3.3" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.3.cmml">e</mi><mo id="alg1.l7.m2.1.1.1.1.1.1.1.3.1a" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l7.m2.1.1.1.1.1.1.1.3.4" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.4.cmml">r</mi><mo id="alg1.l7.m2.1.1.1.1.1.1.1.3.1b" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l7.m2.1.1.1.1.1.1.1.3.5" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.5.cmml">g</mi><mo id="alg1.l7.m2.1.1.1.1.1.1.1.3.1c" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l7.m2.1.1.1.1.1.1.1.3.6" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.6.cmml">e</mi></mrow></msub><mo id="alg1.l7.m2.1.1.1.1.1.1.3" maxsize="80%" minsize="80%" xref="alg1.l7.m2.1.1.1.1.1.2.1.cmml">|</mo></mrow><mo id="alg1.l7.m2.1.1.1.1.2" maxsize="80%" minsize="80%" stretchy="true" symmetric="true" xref="alg1.l7.m2.1.1.1.1.2.cmml">/</mo><mi id="alg1.l7.m2.1.1.1.1.3" mathsize="80%" xref="alg1.l7.m2.1.1.1.1.3.cmml">M</mi></mrow><mo id="alg1.l7.m2.1.1.1.2" xref="alg1.l7.m2.1.1.1.2.cmml">⁢</mo><mi id="alg1.l7.m2.1.1.1.3" mathsize="80%" xref="alg1.l7.m2.1.1.1.3.cmml">a</mi><mo id="alg1.l7.m2.1.1.1.2a" xref="alg1.l7.m2.1.1.1.2.cmml">⁢</mo><mi id="alg1.l7.m2.1.1.1.4" mathsize="80%" xref="alg1.l7.m2.1.1.1.4.cmml">x</mi><mo id="alg1.l7.m2.1.1.1.2b" xref="alg1.l7.m2.1.1.1.2.cmml">⁢</mo><mi id="alg1.l7.m2.1.1.1.5" mathsize="80%" xref="alg1.l7.m2.1.1.1.5.cmml">B</mi><mo id="alg1.l7.m2.1.1.1.2c" xref="alg1.l7.m2.1.1.1.2.cmml">⁢</mo><mi id="alg1.l7.m2.1.1.1.6" mathsize="80%" xref="alg1.l7.m2.1.1.1.6.cmml">S</mi></mrow><mo id="alg1.l7.m2.1.1.2" mathsize="80%" xref="alg1.l7.m2.1.1.2.cmml">></mo><mn id="alg1.l7.m2.1.1.3" mathsize="80%" xref="alg1.l7.m2.1.1.3.cmml">0.5</mn></mrow><annotation-xml encoding="MathML-Content" id="alg1.l7.m2.1b"><apply id="alg1.l7.m2.1.1.cmml" xref="alg1.l7.m2.1.1"><gt id="alg1.l7.m2.1.1.2.cmml" xref="alg1.l7.m2.1.1.2"></gt><apply id="alg1.l7.m2.1.1.1.cmml" xref="alg1.l7.m2.1.1.1"><times id="alg1.l7.m2.1.1.1.2.cmml" xref="alg1.l7.m2.1.1.1.2"></times><apply id="alg1.l7.m2.1.1.1.1.cmml" xref="alg1.l7.m2.1.1.1.1"><divide id="alg1.l7.m2.1.1.1.1.2.cmml" xref="alg1.l7.m2.1.1.1.1.2"></divide><apply id="alg1.l7.m2.1.1.1.1.1.2.cmml" xref="alg1.l7.m2.1.1.1.1.1.1"><abs id="alg1.l7.m2.1.1.1.1.1.2.1.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.2"></abs><apply id="alg1.l7.m2.1.1.1.1.1.1.1.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l7.m2.1.1.1.1.1.1.1.1.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1">subscript</csymbol><ci id="alg1.l7.m2.1.1.1.1.1.1.1.2.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.2">𝑅</ci><apply id="alg1.l7.m2.1.1.1.1.1.1.1.3.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3"><times id="alg1.l7.m2.1.1.1.1.1.1.1.3.1.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.1"></times><ci id="alg1.l7.m2.1.1.1.1.1.1.1.3.2.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.2">𝑚</ci><ci id="alg1.l7.m2.1.1.1.1.1.1.1.3.3.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.3">𝑒</ci><ci id="alg1.l7.m2.1.1.1.1.1.1.1.3.4.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.4">𝑟</ci><ci id="alg1.l7.m2.1.1.1.1.1.1.1.3.5.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.5">𝑔</ci><ci id="alg1.l7.m2.1.1.1.1.1.1.1.3.6.cmml" xref="alg1.l7.m2.1.1.1.1.1.1.1.3.6">𝑒</ci></apply></apply></apply><ci id="alg1.l7.m2.1.1.1.1.3.cmml" xref="alg1.l7.m2.1.1.1.1.3">𝑀</ci></apply><ci id="alg1.l7.m2.1.1.1.3.cmml" xref="alg1.l7.m2.1.1.1.3">𝑎</ci><ci id="alg1.l7.m2.1.1.1.4.cmml" xref="alg1.l7.m2.1.1.1.4">𝑥</ci><ci id="alg1.l7.m2.1.1.1.5.cmml" xref="alg1.l7.m2.1.1.1.5">𝐵</ci><ci id="alg1.l7.m2.1.1.1.6.cmml" xref="alg1.l7.m2.1.1.1.6">𝑆</ci></apply><cn id="alg1.l7.m2.1.1.3.cmml" type="float" xref="alg1.l7.m2.1.1.3">0.5</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l7.m2.1c">|R_{merge}|/MaxBS>0.5</annotation><annotation encoding="application/x-llamapun" id="alg1.l7.m2.1d">| italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT | / italic_M italic_a italic_x italic_B italic_S > 0.5</annotation></semantics></math> then </div> <div class="ltx_listingline" id="alg1.l8"> 8: if <math alttext="|R_{starve}|==0" class="ltx_math_unparsed" display="inline" id="alg1.l8.m1.1"><semantics id="alg1.l8.m1.1a"><mrow id="alg1.l8.m1.1b"><mo fence="false" id="alg1.l8.m1.1.1" maxsize="80%" minsize="80%" rspace="0.167em">|</mo><msub id="alg1.l8.m1.1.2"><mi id="alg1.l8.m1.1.2.2" mathsize="80%">R</mi><mrow id="alg1.l8.m1.1.2.3"><mi id="alg1.l8.m1.1.2.3.2" mathsize="80%">s</mi><mo id="alg1.l8.m1.1.2.3.1">⁢</mo><mi id="alg1.l8.m1.1.2.3.3" mathsize="80%">t</mi><mo id="alg1.l8.m1.1.2.3.1a">⁢</mo><mi id="alg1.l8.m1.1.2.3.4" mathsize="80%">a</mi><mo id="alg1.l8.m1.1.2.3.1b">⁢</mo><mi id="alg1.l8.m1.1.2.3.5" mathsize="80%">r</mi><mo id="alg1.l8.m1.1.2.3.1c">⁢</mo><mi id="alg1.l8.m1.1.2.3.6" mathsize="80%">v</mi><mo id="alg1.l8.m1.1.2.3.1d">⁢</mo><mi id="alg1.l8.m1.1.2.3.7" mathsize="80%">e</mi></mrow></msub><mo fence="false" id="alg1.l8.m1.1.3" maxsize="80%" minsize="80%">|</mo><mo id="alg1.l8.m1.1.4" lspace="0.167em" mathsize="80%" rspace="0em">=</mo><mo id="alg1.l8.m1.1.5" lspace="0em" mathsize="80%">=</mo><mn id="alg1.l8.m1.1.6" mathsize="80%">0</mn></mrow><annotation encoding="application/x-tex" id="alg1.l8.m1.1c">|R_{starve}|==0</annotation><annotation encoding="application/x-llamapun" id="alg1.l8.m1.1d">| italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT | = = 0</annotation></semantics></math> then </div> <div class="ltx_listingline" id="alg1.l9"> 9: <math alttext="M=Merge" class="ltx_Math" display="inline" id="alg1.l9.m1.1"><semantics id="alg1.l9.m1.1a"><mrow id="alg1.l9.m1.1.1" xref="alg1.l9.m1.1.1.cmml"><mi id="alg1.l9.m1.1.1.2" mathsize="80%" xref="alg1.l9.m1.1.1.2.cmml">M</mi><mo id="alg1.l9.m1.1.1.1" mathsize="80%" xref="alg1.l9.m1.1.1.1.cmml">=</mo><mrow id="alg1.l9.m1.1.1.3" xref="alg1.l9.m1.1.1.3.cmml"><mi id="alg1.l9.m1.1.1.3.2" mathsize="80%" xref="alg1.l9.m1.1.1.3.2.cmml">M</mi><mo id="alg1.l9.m1.1.1.3.1" xref="alg1.l9.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l9.m1.1.1.3.3" mathsize="80%" xref="alg1.l9.m1.1.1.3.3.cmml">e</mi><mo id="alg1.l9.m1.1.1.3.1a" xref="alg1.l9.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l9.m1.1.1.3.4" mathsize="80%" xref="alg1.l9.m1.1.1.3.4.cmml">r</mi><mo id="alg1.l9.m1.1.1.3.1b" xref="alg1.l9.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l9.m1.1.1.3.5" mathsize="80%" xref="alg1.l9.m1.1.1.3.5.cmml">g</mi><mo id="alg1.l9.m1.1.1.3.1c" xref="alg1.l9.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l9.m1.1.1.3.6" mathsize="80%" xref="alg1.l9.m1.1.1.3.6.cmml">e</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l9.m1.1b"><apply id="alg1.l9.m1.1.1.cmml" xref="alg1.l9.m1.1.1"><eq id="alg1.l9.m1.1.1.1.cmml" xref="alg1.l9.m1.1.1.1"></eq><ci id="alg1.l9.m1.1.1.2.cmml" xref="alg1.l9.m1.1.1.2">𝑀</ci><apply id="alg1.l9.m1.1.1.3.cmml" xref="alg1.l9.m1.1.1.3"><times id="alg1.l9.m1.1.1.3.1.cmml" xref="alg1.l9.m1.1.1.3.1"></times><ci id="alg1.l9.m1.1.1.3.2.cmml" xref="alg1.l9.m1.1.1.3.2">𝑀</ci><ci id="alg1.l9.m1.1.1.3.3.cmml" xref="alg1.l9.m1.1.1.3.3">𝑒</ci><ci id="alg1.l9.m1.1.1.3.4.cmml" xref="alg1.l9.m1.1.1.3.4">𝑟</ci><ci id="alg1.l9.m1.1.1.3.5.cmml" xref="alg1.l9.m1.1.1.3.5">𝑔</ci><ci id="alg1.l9.m1.1.1.3.6.cmml" xref="alg1.l9.m1.1.1.3.6">𝑒</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l9.m1.1c">M=Merge</annotation><annotation encoding="application/x-llamapun" id="alg1.l9.m1.1d">italic_M = italic_M italic_e italic_r italic_g italic_e</annotation></semantics></math> and ModeSwitch(<math alttext="M" class="ltx_Math" display="inline" id="alg1.l9.m2.1"><semantics id="alg1.l9.m2.1a"><mi id="alg1.l9.m2.1.1" mathsize="80%" xref="alg1.l9.m2.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="alg1.l9.m2.1b"><ci id="alg1.l9.m2.1.1.cmml" xref="alg1.l9.m2.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l9.m2.1c">M</annotation><annotation encoding="application/x-llamapun" id="alg1.l9.m2.1d">italic_M</annotation></semantics></math>, <math alttext="R_{merge}.lora" class="ltx_Math" display="inline" id="alg1.l9.m3.2"><semantics id="alg1.l9.m3.2a"><mrow id="alg1.l9.m3.2.2.2" xref="alg1.l9.m3.2.2.3.cmml"><msub id="alg1.l9.m3.1.1.1.1" xref="alg1.l9.m3.1.1.1.1.cmml"><mi id="alg1.l9.m3.1.1.1.1.2" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.2.cmml">R</mi><mrow id="alg1.l9.m3.1.1.1.1.3" xref="alg1.l9.m3.1.1.1.1.3.cmml"><mi id="alg1.l9.m3.1.1.1.1.3.2" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.3.2.cmml">m</mi><mo id="alg1.l9.m3.1.1.1.1.3.1" xref="alg1.l9.m3.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l9.m3.1.1.1.1.3.3" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.3.3.cmml">e</mi><mo id="alg1.l9.m3.1.1.1.1.3.1a" xref="alg1.l9.m3.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l9.m3.1.1.1.1.3.4" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.3.4.cmml">r</mi><mo id="alg1.l9.m3.1.1.1.1.3.1b" xref="alg1.l9.m3.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l9.m3.1.1.1.1.3.5" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.3.5.cmml">g</mi><mo id="alg1.l9.m3.1.1.1.1.3.1c" xref="alg1.l9.m3.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l9.m3.1.1.1.1.3.6" mathsize="80%" xref="alg1.l9.m3.1.1.1.1.3.6.cmml">e</mi></mrow></msub><mo id="alg1.l9.m3.2.2.2.3" lspace="0em" mathsize="80%" rspace="0.167em" xref="alg1.l9.m3.2.2.3a.cmml">.</mo><mrow id="alg1.l9.m3.2.2.2.2" xref="alg1.l9.m3.2.2.2.2.cmml"><mi id="alg1.l9.m3.2.2.2.2.2" mathsize="80%" xref="alg1.l9.m3.2.2.2.2.2.cmml">l</mi><mo id="alg1.l9.m3.2.2.2.2.1" xref="alg1.l9.m3.2.2.2.2.1.cmml">⁢</mo><mi id="alg1.l9.m3.2.2.2.2.3" mathsize="80%" xref="alg1.l9.m3.2.2.2.2.3.cmml">o</mi><mo id="alg1.l9.m3.2.2.2.2.1a" xref="alg1.l9.m3.2.2.2.2.1.cmml">⁢</mo><mi id="alg1.l9.m3.2.2.2.2.4" mathsize="80%" xref="alg1.l9.m3.2.2.2.2.4.cmml">r</mi><mo id="alg1.l9.m3.2.2.2.2.1b" xref="alg1.l9.m3.2.2.2.2.1.cmml">⁢</mo><mi id="alg1.l9.m3.2.2.2.2.5" mathsize="80%" xref="alg1.l9.m3.2.2.2.2.5.cmml">a</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l9.m3.2b"><apply id="alg1.l9.m3.2.2.3.cmml" xref="alg1.l9.m3.2.2.2"><csymbol cd="ambiguous" id="alg1.l9.m3.2.2.3a.cmml" xref="alg1.l9.m3.2.2.2.3">formulae-sequence</csymbol><apply id="alg1.l9.m3.1.1.1.1.cmml" xref="alg1.l9.m3.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l9.m3.1.1.1.1.1.cmml" xref="alg1.l9.m3.1.1.1.1">subscript</csymbol><ci id="alg1.l9.m3.1.1.1.1.2.cmml" xref="alg1.l9.m3.1.1.1.1.2">𝑅</ci><apply id="alg1.l9.m3.1.1.1.1.3.cmml" xref="alg1.l9.m3.1.1.1.1.3"><times id="alg1.l9.m3.1.1.1.1.3.1.cmml" xref="alg1.l9.m3.1.1.1.1.3.1"></times><ci id="alg1.l9.m3.1.1.1.1.3.2.cmml" xref="alg1.l9.m3.1.1.1.1.3.2">𝑚</ci><ci id="alg1.l9.m3.1.1.1.1.3.3.cmml" xref="alg1.l9.m3.1.1.1.1.3.3">𝑒</ci><ci id="alg1.l9.m3.1.1.1.1.3.4.cmml" xref="alg1.l9.m3.1.1.1.1.3.4">𝑟</ci><ci id="alg1.l9.m3.1.1.1.1.3.5.cmml" xref="alg1.l9.m3.1.1.1.1.3.5">𝑔</ci><ci id="alg1.l9.m3.1.1.1.1.3.6.cmml" xref="alg1.l9.m3.1.1.1.1.3.6">𝑒</ci></apply></apply><apply id="alg1.l9.m3.2.2.2.2.cmml" xref="alg1.l9.m3.2.2.2.2"><times id="alg1.l9.m3.2.2.2.2.1.cmml" xref="alg1.l9.m3.2.2.2.2.1"></times><ci id="alg1.l9.m3.2.2.2.2.2.cmml" xref="alg1.l9.m3.2.2.2.2.2">𝑙</ci><ci id="alg1.l9.m3.2.2.2.2.3.cmml" xref="alg1.l9.m3.2.2.2.2.3">𝑜</ci><ci id="alg1.l9.m3.2.2.2.2.4.cmml" xref="alg1.l9.m3.2.2.2.2.4">𝑟</ci><ci id="alg1.l9.m3.2.2.2.2.5.cmml" xref="alg1.l9.m3.2.2.2.2.5">𝑎</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l9.m3.2c">R_{merge}.lora</annotation><annotation encoding="application/x-llamapun" id="alg1.l9.m3.2d">italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT . italic_l italic_o italic_r italic_a</annotation></semantics></math>) </div> <div class="ltx_listingline" id="alg1.l10"> 10: <math alttext="B_{next}=R_{merge}\left[:MaxBS\right]" class="ltx_math_unparsed" display="inline" id="alg1.l10.m1.1"><semantics id="alg1.l10.m1.1a"><mrow id="alg1.l10.m1.1b"><msub id="alg1.l10.m1.1.1"><mi id="alg1.l10.m1.1.1.2" mathsize="80%">B</mi><mrow id="alg1.l10.m1.1.1.3"><mi id="alg1.l10.m1.1.1.3.2" mathsize="80%">n</mi><mo id="alg1.l10.m1.1.1.3.1">⁢</mo><mi id="alg1.l10.m1.1.1.3.3" mathsize="80%">e</mi><mo id="alg1.l10.m1.1.1.3.1a">⁢</mo><mi id="alg1.l10.m1.1.1.3.4" mathsize="80%">x</mi><mo id="alg1.l10.m1.1.1.3.1b">⁢</mo><mi id="alg1.l10.m1.1.1.3.5" mathsize="80%">t</mi></mrow></msub><mo id="alg1.l10.m1.1.2" mathsize="80%">=</mo><msub id="alg1.l10.m1.1.3"><mi id="alg1.l10.m1.1.3.2" mathsize="80%">R</mi><mrow id="alg1.l10.m1.1.3.3"><mi id="alg1.l10.m1.1.3.3.2" mathsize="80%">m</mi><mo id="alg1.l10.m1.1.3.3.1">⁢</mo><mi id="alg1.l10.m1.1.3.3.3" mathsize="80%">e</mi><mo id="alg1.l10.m1.1.3.3.1a">⁢</mo><mi id="alg1.l10.m1.1.3.3.4" mathsize="80%">r</mi><mo id="alg1.l10.m1.1.3.3.1b">⁢</mo><mi id="alg1.l10.m1.1.3.3.5" mathsize="80%">g</mi><mo id="alg1.l10.m1.1.3.3.1c">⁢</mo><mi id="alg1.l10.m1.1.3.3.6" mathsize="80%">e</mi></mrow></msub><mrow id="alg1.l10.m1.1.4"><mo id="alg1.l10.m1.1.4.1">[</mo><mo id="alg1.l10.m1.1.4.2" mathsize="80%" rspace="0.278em">:</mo><mi id="alg1.l10.m1.1.4.3" mathsize="80%">M</mi><mi id="alg1.l10.m1.1.4.4" mathsize="80%">a</mi><mi id="alg1.l10.m1.1.4.5" mathsize="80%">x</mi><mi id="alg1.l10.m1.1.4.6" mathsize="80%">B</mi><mi id="alg1.l10.m1.1.4.7" mathsize="80%">S</mi><mo id="alg1.l10.m1.1.4.8">]</mo></mrow></mrow><annotation encoding="application/x-tex" id="alg1.l10.m1.1c">B_{next}=R_{merge}\left[:MaxBS\right]</annotation><annotation encoding="application/x-llamapun" id="alg1.l10.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT [ : italic_M italic_a italic_x italic_B italic_S ]</annotation></semantics></math> </div> <div class="ltx_listingline" id="alg1.l11"> 11: else </div> <div class="ltx_listingline" id="alg1.l12"> 12: <math alttext="M=Mix" class="ltx_Math" display="inline" id="alg1.l12.m1.1"><semantics id="alg1.l12.m1.1a"><mrow id="alg1.l12.m1.1.1" xref="alg1.l12.m1.1.1.cmml"><mi id="alg1.l12.m1.1.1.2" mathsize="80%" xref="alg1.l12.m1.1.1.2.cmml">M</mi><mo id="alg1.l12.m1.1.1.1" mathsize="80%" xref="alg1.l12.m1.1.1.1.cmml">=</mo><mrow id="alg1.l12.m1.1.1.3" xref="alg1.l12.m1.1.1.3.cmml"><mi id="alg1.l12.m1.1.1.3.2" mathsize="80%" xref="alg1.l12.m1.1.1.3.2.cmml">M</mi><mo id="alg1.l12.m1.1.1.3.1" xref="alg1.l12.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l12.m1.1.1.3.3" mathsize="80%" xref="alg1.l12.m1.1.1.3.3.cmml">i</mi><mo id="alg1.l12.m1.1.1.3.1a" xref="alg1.l12.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l12.m1.1.1.3.4" mathsize="80%" xref="alg1.l12.m1.1.1.3.4.cmml">x</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l12.m1.1b"><apply id="alg1.l12.m1.1.1.cmml" xref="alg1.l12.m1.1.1"><eq id="alg1.l12.m1.1.1.1.cmml" xref="alg1.l12.m1.1.1.1"></eq><ci id="alg1.l12.m1.1.1.2.cmml" xref="alg1.l12.m1.1.1.2">𝑀</ci><apply id="alg1.l12.m1.1.1.3.cmml" xref="alg1.l12.m1.1.1.3"><times id="alg1.l12.m1.1.1.3.1.cmml" xref="alg1.l12.m1.1.1.3.1"></times><ci id="alg1.l12.m1.1.1.3.2.cmml" xref="alg1.l12.m1.1.1.3.2">𝑀</ci><ci id="alg1.l12.m1.1.1.3.3.cmml" xref="alg1.l12.m1.1.1.3.3">𝑖</ci><ci id="alg1.l12.m1.1.1.3.4.cmml" xref="alg1.l12.m1.1.1.3.4">𝑥</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l12.m1.1c">M=Mix</annotation><annotation encoding="application/x-llamapun" id="alg1.l12.m1.1d">italic_M = italic_M italic_i italic_x</annotation></semantics></math> and ModeSwitch(<math alttext="M" class="ltx_Math" display="inline" id="alg1.l12.m2.1"><semantics id="alg1.l12.m2.1a"><mi id="alg1.l12.m2.1.1" mathsize="80%" xref="alg1.l12.m2.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="alg1.l12.m2.1b"><ci id="alg1.l12.m2.1.1.cmml" xref="alg1.l12.m2.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l12.m2.1c">M</annotation><annotation encoding="application/x-llamapun" id="alg1.l12.m2.1d">italic_M</annotation></semantics></math>, <math alttext="R_{merge}.lora" class="ltx_Math" display="inline" id="alg1.l12.m3.2"><semantics id="alg1.l12.m3.2a"><mrow id="alg1.l12.m3.2.2.2" xref="alg1.l12.m3.2.2.3.cmml"><msub id="alg1.l12.m3.1.1.1.1" xref="alg1.l12.m3.1.1.1.1.cmml"><mi id="alg1.l12.m3.1.1.1.1.2" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.2.cmml">R</mi><mrow id="alg1.l12.m3.1.1.1.1.3" xref="alg1.l12.m3.1.1.1.1.3.cmml"><mi id="alg1.l12.m3.1.1.1.1.3.2" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.3.2.cmml">m</mi><mo id="alg1.l12.m3.1.1.1.1.3.1" xref="alg1.l12.m3.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l12.m3.1.1.1.1.3.3" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.3.3.cmml">e</mi><mo id="alg1.l12.m3.1.1.1.1.3.1a" xref="alg1.l12.m3.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l12.m3.1.1.1.1.3.4" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.3.4.cmml">r</mi><mo id="alg1.l12.m3.1.1.1.1.3.1b" xref="alg1.l12.m3.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l12.m3.1.1.1.1.3.5" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.3.5.cmml">g</mi><mo id="alg1.l12.m3.1.1.1.1.3.1c" xref="alg1.l12.m3.1.1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l12.m3.1.1.1.1.3.6" mathsize="80%" xref="alg1.l12.m3.1.1.1.1.3.6.cmml">e</mi></mrow></msub><mo id="alg1.l12.m3.2.2.2.3" lspace="0em" mathsize="80%" rspace="0.167em" xref="alg1.l12.m3.2.2.3a.cmml">.</mo><mrow id="alg1.l12.m3.2.2.2.2" xref="alg1.l12.m3.2.2.2.2.cmml"><mi id="alg1.l12.m3.2.2.2.2.2" mathsize="80%" xref="alg1.l12.m3.2.2.2.2.2.cmml">l</mi><mo id="alg1.l12.m3.2.2.2.2.1" xref="alg1.l12.m3.2.2.2.2.1.cmml">⁢</mo><mi id="alg1.l12.m3.2.2.2.2.3" mathsize="80%" xref="alg1.l12.m3.2.2.2.2.3.cmml">o</mi><mo id="alg1.l12.m3.2.2.2.2.1a" xref="alg1.l12.m3.2.2.2.2.1.cmml">⁢</mo><mi id="alg1.l12.m3.2.2.2.2.4" mathsize="80%" xref="alg1.l12.m3.2.2.2.2.4.cmml">r</mi><mo id="alg1.l12.m3.2.2.2.2.1b" xref="alg1.l12.m3.2.2.2.2.1.cmml">⁢</mo><mi id="alg1.l12.m3.2.2.2.2.5" mathsize="80%" xref="alg1.l12.m3.2.2.2.2.5.cmml">a</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l12.m3.2b"><apply id="alg1.l12.m3.2.2.3.cmml" xref="alg1.l12.m3.2.2.2"><csymbol cd="ambiguous" id="alg1.l12.m3.2.2.3a.cmml" xref="alg1.l12.m3.2.2.2.3">formulae-sequence</csymbol><apply id="alg1.l12.m3.1.1.1.1.cmml" xref="alg1.l12.m3.1.1.1.1"><csymbol cd="ambiguous" id="alg1.l12.m3.1.1.1.1.1.cmml" xref="alg1.l12.m3.1.1.1.1">subscript</csymbol><ci id="alg1.l12.m3.1.1.1.1.2.cmml" xref="alg1.l12.m3.1.1.1.1.2">𝑅</ci><apply id="alg1.l12.m3.1.1.1.1.3.cmml" xref="alg1.l12.m3.1.1.1.1.3"><times id="alg1.l12.m3.1.1.1.1.3.1.cmml" xref="alg1.l12.m3.1.1.1.1.3.1"></times><ci id="alg1.l12.m3.1.1.1.1.3.2.cmml" xref="alg1.l12.m3.1.1.1.1.3.2">𝑚</ci><ci id="alg1.l12.m3.1.1.1.1.3.3.cmml" xref="alg1.l12.m3.1.1.1.1.3.3">𝑒</ci><ci id="alg1.l12.m3.1.1.1.1.3.4.cmml" xref="alg1.l12.m3.1.1.1.1.3.4">𝑟</ci><ci id="alg1.l12.m3.1.1.1.1.3.5.cmml" xref="alg1.l12.m3.1.1.1.1.3.5">𝑔</ci><ci id="alg1.l12.m3.1.1.1.1.3.6.cmml" xref="alg1.l12.m3.1.1.1.1.3.6">𝑒</ci></apply></apply><apply id="alg1.l12.m3.2.2.2.2.cmml" xref="alg1.l12.m3.2.2.2.2"><times id="alg1.l12.m3.2.2.2.2.1.cmml" xref="alg1.l12.m3.2.2.2.2.1"></times><ci id="alg1.l12.m3.2.2.2.2.2.cmml" xref="alg1.l12.m3.2.2.2.2.2">𝑙</ci><ci id="alg1.l12.m3.2.2.2.2.3.cmml" xref="alg1.l12.m3.2.2.2.2.3">𝑜</ci><ci id="alg1.l12.m3.2.2.2.2.4.cmml" xref="alg1.l12.m3.2.2.2.2.4">𝑟</ci><ci id="alg1.l12.m3.2.2.2.2.5.cmml" xref="alg1.l12.m3.2.2.2.2.5">𝑎</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l12.m3.2c">R_{merge}.lora</annotation><annotation encoding="application/x-llamapun" id="alg1.l12.m3.2d">italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT . italic_l italic_o italic_r italic_a</annotation></semantics></math>) </div> <div class="ltx_listingline" id="alg1.l13"> 13: <math alttext="B_{next}=R_{starve}+(R_{merge}-R_{starve})\left[:len\right]" class="ltx_math_unparsed" display="inline" id="alg1.l13.m1.1"><semantics id="alg1.l13.m1.1a"><mrow id="alg1.l13.m1.1b"><msub id="alg1.l13.m1.1.1"><mi id="alg1.l13.m1.1.1.2" mathsize="80%">B</mi><mrow id="alg1.l13.m1.1.1.3"><mi id="alg1.l13.m1.1.1.3.2" mathsize="80%">n</mi><mo id="alg1.l13.m1.1.1.3.1">⁢</mo><mi id="alg1.l13.m1.1.1.3.3" mathsize="80%">e</mi><mo id="alg1.l13.m1.1.1.3.1a">⁢</mo><mi id="alg1.l13.m1.1.1.3.4" mathsize="80%">x</mi><mo id="alg1.l13.m1.1.1.3.1b">⁢</mo><mi id="alg1.l13.m1.1.1.3.5" mathsize="80%">t</mi></mrow></msub><mo id="alg1.l13.m1.1.2" mathsize="80%">=</mo><msub id="alg1.l13.m1.1.3"><mi id="alg1.l13.m1.1.3.2" mathsize="80%">R</mi><mrow id="alg1.l13.m1.1.3.3"><mi id="alg1.l13.m1.1.3.3.2" mathsize="80%">s</mi><mo id="alg1.l13.m1.1.3.3.1">⁢</mo><mi id="alg1.l13.m1.1.3.3.3" mathsize="80%">t</mi><mo id="alg1.l13.m1.1.3.3.1a">⁢</mo><mi id="alg1.l13.m1.1.3.3.4" mathsize="80%">a</mi><mo id="alg1.l13.m1.1.3.3.1b">⁢</mo><mi id="alg1.l13.m1.1.3.3.5" mathsize="80%">r</mi><mo id="alg1.l13.m1.1.3.3.1c">⁢</mo><mi id="alg1.l13.m1.1.3.3.6" mathsize="80%">v</mi><mo id="alg1.l13.m1.1.3.3.1d">⁢</mo><mi id="alg1.l13.m1.1.3.3.7" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l13.m1.1.4" mathsize="80%">+</mo><mrow id="alg1.l13.m1.1.5"><mo id="alg1.l13.m1.1.5.1" maxsize="80%" minsize="80%">(</mo><msub id="alg1.l13.m1.1.5.2"><mi id="alg1.l13.m1.1.5.2.2" mathsize="80%">R</mi><mrow id="alg1.l13.m1.1.5.2.3"><mi id="alg1.l13.m1.1.5.2.3.2" mathsize="80%">m</mi><mo id="alg1.l13.m1.1.5.2.3.1">⁢</mo><mi id="alg1.l13.m1.1.5.2.3.3" mathsize="80%">e</mi><mo id="alg1.l13.m1.1.5.2.3.1a">⁢</mo><mi id="alg1.l13.m1.1.5.2.3.4" mathsize="80%">r</mi><mo id="alg1.l13.m1.1.5.2.3.1b">⁢</mo><mi id="alg1.l13.m1.1.5.2.3.5" mathsize="80%">g</mi><mo id="alg1.l13.m1.1.5.2.3.1c">⁢</mo><mi id="alg1.l13.m1.1.5.2.3.6" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l13.m1.1.5.3" mathsize="80%">−</mo><msub id="alg1.l13.m1.1.5.4"><mi id="alg1.l13.m1.1.5.4.2" mathsize="80%">R</mi><mrow id="alg1.l13.m1.1.5.4.3"><mi id="alg1.l13.m1.1.5.4.3.2" mathsize="80%">s</mi><mo id="alg1.l13.m1.1.5.4.3.1">⁢</mo><mi id="alg1.l13.m1.1.5.4.3.3" mathsize="80%">t</mi><mo id="alg1.l13.m1.1.5.4.3.1a">⁢</mo><mi id="alg1.l13.m1.1.5.4.3.4" mathsize="80%">a</mi><mo id="alg1.l13.m1.1.5.4.3.1b">⁢</mo><mi id="alg1.l13.m1.1.5.4.3.5" mathsize="80%">r</mi><mo id="alg1.l13.m1.1.5.4.3.1c">⁢</mo><mi id="alg1.l13.m1.1.5.4.3.6" mathsize="80%">v</mi><mo id="alg1.l13.m1.1.5.4.3.1d">⁢</mo><mi id="alg1.l13.m1.1.5.4.3.7" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l13.m1.1.5.5" maxsize="80%" minsize="80%">)</mo></mrow><mrow id="alg1.l13.m1.1.6"><mo id="alg1.l13.m1.1.6.1">[</mo><mo id="alg1.l13.m1.1.6.2" mathsize="80%" rspace="0.278em">:</mo><mi id="alg1.l13.m1.1.6.3" mathsize="80%">l</mi><mi id="alg1.l13.m1.1.6.4" mathsize="80%">e</mi><mi id="alg1.l13.m1.1.6.5" mathsize="80%">n</mi><mo id="alg1.l13.m1.1.6.6">]</mo></mrow></mrow><annotation encoding="application/x-tex" id="alg1.l13.m1.1c">B_{next}=R_{starve}+(R_{merge}-R_{starve})\left[:len\right]</annotation><annotation encoding="application/x-llamapun" id="alg1.l13.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT + ( italic_R start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT ) [ : italic_l italic_e italic_n ]</annotation></semantics></math> </div> <div class="ltx_listingline" id="alg1.l14"> 14: InitDELoRA(<math alttext="B_{next}" class="ltx_Math" display="inline" id="alg1.l14.m1.1"><semantics id="alg1.l14.m1.1a"><msub id="alg1.l14.m1.1.1" xref="alg1.l14.m1.1.1.cmml"><mi id="alg1.l14.m1.1.1.2" mathsize="80%" xref="alg1.l14.m1.1.1.2.cmml">B</mi><mrow id="alg1.l14.m1.1.1.3" xref="alg1.l14.m1.1.1.3.cmml"><mi id="alg1.l14.m1.1.1.3.2" mathsize="80%" xref="alg1.l14.m1.1.1.3.2.cmml">n</mi><mo id="alg1.l14.m1.1.1.3.1" xref="alg1.l14.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l14.m1.1.1.3.3" mathsize="80%" xref="alg1.l14.m1.1.1.3.3.cmml">e</mi><mo id="alg1.l14.m1.1.1.3.1a" xref="alg1.l14.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l14.m1.1.1.3.4" mathsize="80%" xref="alg1.l14.m1.1.1.3.4.cmml">x</mi><mo id="alg1.l14.m1.1.1.3.1b" xref="alg1.l14.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l14.m1.1.1.3.5" mathsize="80%" xref="alg1.l14.m1.1.1.3.5.cmml">t</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="alg1.l14.m1.1b"><apply id="alg1.l14.m1.1.1.cmml" xref="alg1.l14.m1.1.1"><csymbol cd="ambiguous" id="alg1.l14.m1.1.1.1.cmml" xref="alg1.l14.m1.1.1">subscript</csymbol><ci id="alg1.l14.m1.1.1.2.cmml" xref="alg1.l14.m1.1.1.2">𝐵</ci><apply id="alg1.l14.m1.1.1.3.cmml" xref="alg1.l14.m1.1.1.3"><times id="alg1.l14.m1.1.1.3.1.cmml" xref="alg1.l14.m1.1.1.3.1"></times><ci id="alg1.l14.m1.1.1.3.2.cmml" xref="alg1.l14.m1.1.1.3.2">𝑛</ci><ci id="alg1.l14.m1.1.1.3.3.cmml" xref="alg1.l14.m1.1.1.3.3">𝑒</ci><ci id="alg1.l14.m1.1.1.3.4.cmml" xref="alg1.l14.m1.1.1.3.4">𝑥</ci><ci id="alg1.l14.m1.1.1.3.5.cmml" xref="alg1.l14.m1.1.1.3.5">𝑡</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l14.m1.1c">B_{next}</annotation><annotation encoding="application/x-llamapun" id="alg1.l14.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT</annotation></semantics></math>) </div> <div class="ltx_listingline" id="alg1.l15"> 15: else </div> <div class="ltx_listingline" id="alg1.l16"> 16: <math alttext="M=Unmerge" class="ltx_Math" display="inline" id="alg1.l16.m1.1"><semantics id="alg1.l16.m1.1a"><mrow id="alg1.l16.m1.1.1" xref="alg1.l16.m1.1.1.cmml"><mi id="alg1.l16.m1.1.1.2" mathsize="80%" xref="alg1.l16.m1.1.1.2.cmml">M</mi><mo id="alg1.l16.m1.1.1.1" mathsize="80%" xref="alg1.l16.m1.1.1.1.cmml">=</mo><mrow id="alg1.l16.m1.1.1.3" xref="alg1.l16.m1.1.1.3.cmml"><mi id="alg1.l16.m1.1.1.3.2" mathsize="80%" xref="alg1.l16.m1.1.1.3.2.cmml">U</mi><mo id="alg1.l16.m1.1.1.3.1" xref="alg1.l16.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l16.m1.1.1.3.3" mathsize="80%" xref="alg1.l16.m1.1.1.3.3.cmml">n</mi><mo id="alg1.l16.m1.1.1.3.1a" xref="alg1.l16.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l16.m1.1.1.3.4" mathsize="80%" xref="alg1.l16.m1.1.1.3.4.cmml">m</mi><mo id="alg1.l16.m1.1.1.3.1b" xref="alg1.l16.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l16.m1.1.1.3.5" mathsize="80%" xref="alg1.l16.m1.1.1.3.5.cmml">e</mi><mo id="alg1.l16.m1.1.1.3.1c" xref="alg1.l16.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l16.m1.1.1.3.6" mathsize="80%" xref="alg1.l16.m1.1.1.3.6.cmml">r</mi><mo id="alg1.l16.m1.1.1.3.1d" xref="alg1.l16.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l16.m1.1.1.3.7" mathsize="80%" xref="alg1.l16.m1.1.1.3.7.cmml">g</mi><mo id="alg1.l16.m1.1.1.3.1e" xref="alg1.l16.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l16.m1.1.1.3.8" mathsize="80%" xref="alg1.l16.m1.1.1.3.8.cmml">e</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="alg1.l16.m1.1b"><apply id="alg1.l16.m1.1.1.cmml" xref="alg1.l16.m1.1.1"><eq id="alg1.l16.m1.1.1.1.cmml" xref="alg1.l16.m1.1.1.1"></eq><ci id="alg1.l16.m1.1.1.2.cmml" xref="alg1.l16.m1.1.1.2">𝑀</ci><apply id="alg1.l16.m1.1.1.3.cmml" xref="alg1.l16.m1.1.1.3"><times id="alg1.l16.m1.1.1.3.1.cmml" xref="alg1.l16.m1.1.1.3.1"></times><ci id="alg1.l16.m1.1.1.3.2.cmml" xref="alg1.l16.m1.1.1.3.2">𝑈</ci><ci id="alg1.l16.m1.1.1.3.3.cmml" xref="alg1.l16.m1.1.1.3.3">𝑛</ci><ci id="alg1.l16.m1.1.1.3.4.cmml" xref="alg1.l16.m1.1.1.3.4">𝑚</ci><ci id="alg1.l16.m1.1.1.3.5.cmml" xref="alg1.l16.m1.1.1.3.5">𝑒</ci><ci id="alg1.l16.m1.1.1.3.6.cmml" xref="alg1.l16.m1.1.1.3.6">𝑟</ci><ci id="alg1.l16.m1.1.1.3.7.cmml" xref="alg1.l16.m1.1.1.3.7">𝑔</ci><ci id="alg1.l16.m1.1.1.3.8.cmml" xref="alg1.l16.m1.1.1.3.8">𝑒</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l16.m1.1c">M=Unmerge</annotation><annotation encoding="application/x-llamapun" id="alg1.l16.m1.1d">italic_M = italic_U italic_n italic_m italic_e italic_r italic_g italic_e</annotation></semantics></math> and ModeSwitch(<math alttext="M" class="ltx_Math" display="inline" id="alg1.l16.m2.1"><semantics id="alg1.l16.m2.1a"><mi id="alg1.l16.m2.1.1" mathsize="80%" xref="alg1.l16.m2.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="alg1.l16.m2.1b"><ci id="alg1.l16.m2.1.1.cmml" xref="alg1.l16.m2.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="alg1.l16.m2.1c">M</annotation><annotation encoding="application/x-llamapun" id="alg1.l16.m2.1d">italic_M</annotation></semantics></math>, ) </div> <div class="ltx_listingline" id="alg1.l17"> 17: <math alttext="B_{next}=R_{starve}+(R-R_{starve})\left[:len\right]" class="ltx_math_unparsed" display="inline" id="alg1.l17.m1.1"><semantics id="alg1.l17.m1.1a"><mrow id="alg1.l17.m1.1b"><msub id="alg1.l17.m1.1.1"><mi id="alg1.l17.m1.1.1.2" mathsize="80%">B</mi><mrow id="alg1.l17.m1.1.1.3"><mi id="alg1.l17.m1.1.1.3.2" mathsize="80%">n</mi><mo id="alg1.l17.m1.1.1.3.1">⁢</mo><mi id="alg1.l17.m1.1.1.3.3" mathsize="80%">e</mi><mo id="alg1.l17.m1.1.1.3.1a">⁢</mo><mi id="alg1.l17.m1.1.1.3.4" mathsize="80%">x</mi><mo id="alg1.l17.m1.1.1.3.1b">⁢</mo><mi id="alg1.l17.m1.1.1.3.5" mathsize="80%">t</mi></mrow></msub><mo id="alg1.l17.m1.1.2" mathsize="80%">=</mo><msub id="alg1.l17.m1.1.3"><mi id="alg1.l17.m1.1.3.2" mathsize="80%">R</mi><mrow id="alg1.l17.m1.1.3.3"><mi id="alg1.l17.m1.1.3.3.2" mathsize="80%">s</mi><mo id="alg1.l17.m1.1.3.3.1">⁢</mo><mi id="alg1.l17.m1.1.3.3.3" mathsize="80%">t</mi><mo id="alg1.l17.m1.1.3.3.1a">⁢</mo><mi id="alg1.l17.m1.1.3.3.4" mathsize="80%">a</mi><mo id="alg1.l17.m1.1.3.3.1b">⁢</mo><mi id="alg1.l17.m1.1.3.3.5" mathsize="80%">r</mi><mo id="alg1.l17.m1.1.3.3.1c">⁢</mo><mi id="alg1.l17.m1.1.3.3.6" mathsize="80%">v</mi><mo id="alg1.l17.m1.1.3.3.1d">⁢</mo><mi id="alg1.l17.m1.1.3.3.7" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l17.m1.1.4" mathsize="80%">+</mo><mrow id="alg1.l17.m1.1.5"><mo id="alg1.l17.m1.1.5.1" maxsize="80%" minsize="80%">(</mo><mi id="alg1.l17.m1.1.5.2" mathsize="80%">R</mi><mo id="alg1.l17.m1.1.5.3" mathsize="80%">−</mo><msub id="alg1.l17.m1.1.5.4"><mi id="alg1.l17.m1.1.5.4.2" mathsize="80%">R</mi><mrow id="alg1.l17.m1.1.5.4.3"><mi id="alg1.l17.m1.1.5.4.3.2" mathsize="80%">s</mi><mo id="alg1.l17.m1.1.5.4.3.1">⁢</mo><mi id="alg1.l17.m1.1.5.4.3.3" mathsize="80%">t</mi><mo id="alg1.l17.m1.1.5.4.3.1a">⁢</mo><mi id="alg1.l17.m1.1.5.4.3.4" mathsize="80%">a</mi><mo id="alg1.l17.m1.1.5.4.3.1b">⁢</mo><mi id="alg1.l17.m1.1.5.4.3.5" mathsize="80%">r</mi><mo id="alg1.l17.m1.1.5.4.3.1c">⁢</mo><mi id="alg1.l17.m1.1.5.4.3.6" mathsize="80%">v</mi><mo id="alg1.l17.m1.1.5.4.3.1d">⁢</mo><mi id="alg1.l17.m1.1.5.4.3.7" mathsize="80%">e</mi></mrow></msub><mo id="alg1.l17.m1.1.5.5" maxsize="80%" minsize="80%">)</mo></mrow><mrow id="alg1.l17.m1.1.6"><mo id="alg1.l17.m1.1.6.1">[</mo><mo id="alg1.l17.m1.1.6.2" mathsize="80%" rspace="0.278em">:</mo><mi id="alg1.l17.m1.1.6.3" mathsize="80%">l</mi><mi id="alg1.l17.m1.1.6.4" mathsize="80%">e</mi><mi id="alg1.l17.m1.1.6.5" mathsize="80%">n</mi><mo id="alg1.l17.m1.1.6.6">]</mo></mrow></mrow><annotation encoding="application/x-tex" id="alg1.l17.m1.1c">B_{next}=R_{starve}+(R-R_{starve})\left[:len\right]</annotation><annotation encoding="application/x-llamapun" id="alg1.l17.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT + ( italic_R - italic_R start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_v italic_e end_POSTSUBSCRIPT ) [ : italic_l italic_e italic_n ]</annotation></semantics></math> </div> <div class="ltx_listingline" id="alg1.l18"> 18: return <math alttext="B_{next}" class="ltx_Math" display="inline" id="alg1.l18.m1.1"><semantics id="alg1.l18.m1.1a"><msub id="alg1.l18.m1.1.1" xref="alg1.l18.m1.1.1.cmml"><mi id="alg1.l18.m1.1.1.2" mathsize="80%" xref="alg1.l18.m1.1.1.2.cmml">B</mi><mrow id="alg1.l18.m1.1.1.3" xref="alg1.l18.m1.1.1.3.cmml"><mi id="alg1.l18.m1.1.1.3.2" mathsize="80%" xref="alg1.l18.m1.1.1.3.2.cmml">n</mi><mo id="alg1.l18.m1.1.1.3.1" xref="alg1.l18.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l18.m1.1.1.3.3" mathsize="80%" xref="alg1.l18.m1.1.1.3.3.cmml">e</mi><mo id="alg1.l18.m1.1.1.3.1a" xref="alg1.l18.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l18.m1.1.1.3.4" mathsize="80%" xref="alg1.l18.m1.1.1.3.4.cmml">x</mi><mo id="alg1.l18.m1.1.1.3.1b" xref="alg1.l18.m1.1.1.3.1.cmml">⁢</mo><mi id="alg1.l18.m1.1.1.3.5" mathsize="80%" xref="alg1.l18.m1.1.1.3.5.cmml">t</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="alg1.l18.m1.1b"><apply id="alg1.l18.m1.1.1.cmml" xref="alg1.l18.m1.1.1"><csymbol cd="ambiguous" id="alg1.l18.m1.1.1.1.cmml" xref="alg1.l18.m1.1.1">subscript</csymbol><ci id="alg1.l18.m1.1.1.2.cmml" xref="alg1.l18.m1.1.1.2">𝐵</ci><apply id="alg1.l18.m1.1.1.3.cmml" xref="alg1.l18.m1.1.1.3"><times id="alg1.l18.m1.1.1.3.1.cmml" xref="alg1.l18.m1.1.1.3.1"></times><ci id="alg1.l18.m1.1.1.3.2.cmml" xref="alg1.l18.m1.1.1.3.2">𝑛</ci><ci id="alg1.l18.m1.1.1.3.3.cmml" xref="alg1.l18.m1.1.1.3.3">𝑒</ci><ci id="alg1.l18.m1.1.1.3.4.cmml" xref="alg1.l18.m1.1.1.3.4">𝑥</ci><ci id="alg1.l18.m1.1.1.3.5.cmml" xref="alg1.l18.m1.1.1.3.5">𝑡</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="alg1.l18.m1.1c">B_{next}</annotation><annotation encoding="application/x-llamapun" id="alg1.l18.m1.1d">italic_B start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT</annotation></semantics></math> </div> </div> </figure> </section> <section class="ltx_subsubsection" id="S4.SS4.SSS2"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.4.2 Mixture inference mode.</h4> <div class="ltx_para" id="S4.SS4.SSS2.p1"> Compared to the unmerged mode, merged inference supports only one LoRA adapter at once, which results in the starvation of requests from other vision tasks. To alleviate starvation, we propose a novel inference mode, deLoRA, to enable the simultaneous execution of merged and unmerged inference. As shown in Fig.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.F13" title="Figure 13 ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">13</a>, LoRA1 handles the requests in merged mode, while other LoRA adapters, LoRAx, process their requests in unmerged mode. To maintain the consistent results of LoRAx’s request, we introduce the deLoRA branch to prevent contamination from LoRA1. The weight of deLoRA is the same as that of the merged LoRA. Based on the distributive property of matrix multiplication, the correctness can be verified as follows. <table class="ltx_equationgroup ltx_eqn_table" id="S4.Ex1"> <tbody id="S4.Ex1X"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\mathbf{output_{x}}" class="ltx_Math" display="inline" id="S4.Ex1X.2.1.1.m1.1"><semantics id="S4.Ex1X.2.1.1.m1.1a"><msub id="S4.Ex1X.2.1.1.m1.1.1" xref="S4.Ex1X.2.1.1.m1.1.1.cmml"><mi id="S4.Ex1X.2.1.1.m1.1.1.2" mathsize="90%" xref="S4.Ex1X.2.1.1.m1.1.1.2.cmml">𝐨𝐮𝐭𝐩𝐮𝐭</mi><mi id="S4.Ex1X.2.1.1.m1.1.1.3" mathsize="90%" xref="S4.Ex1X.2.1.1.m1.1.1.3.cmml">𝐱</mi></msub><annotation-xml encoding="MathML-Content" id="S4.Ex1X.2.1.1.m1.1b"><apply id="S4.Ex1X.2.1.1.m1.1.1.cmml" xref="S4.Ex1X.2.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.Ex1X.2.1.1.m1.1.1.1.cmml" xref="S4.Ex1X.2.1.1.m1.1.1">subscript</csymbol><ci id="S4.Ex1X.2.1.1.m1.1.1.2.cmml" xref="S4.Ex1X.2.1.1.m1.1.1.2">𝐨𝐮𝐭𝐩𝐮𝐭</ci><ci id="S4.Ex1X.2.1.1.m1.1.1.3.cmml" xref="S4.Ex1X.2.1.1.m1.1.1.3">𝐱</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.Ex1X.2.1.1.m1.1c">\displaystyle\mathbf{output_{x}}</annotation><annotation encoding="application/x-llamapun" id="S4.Ex1X.2.1.1.m1.1d">bold_output start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=\mathbf{input_{x}}\times(\mathbf{W_{merge}}-\mathbf{W_{deLoRA_{1% }}}+\mathbf{W_{LoRA_{x}}})" class="ltx_Math" display="inline" id="S4.Ex1X.3.2.2.m1.1"><semantics id="S4.Ex1X.3.2.2.m1.1a"><mrow id="S4.Ex1X.3.2.2.m1.1.1" xref="S4.Ex1X.3.2.2.m1.1.1.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.3" xref="S4.Ex1X.3.2.2.m1.1.1.3.cmml"></mi><mo id="S4.Ex1X.3.2.2.m1.1.1.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.2.cmml">=</mo><mrow id="S4.Ex1X.3.2.2.m1.1.1.1" xref="S4.Ex1X.3.2.2.m1.1.1.1.cmml"><msub id="S4.Ex1X.3.2.2.m1.1.1.1.3" xref="S4.Ex1X.3.2.2.m1.1.1.1.3.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.3.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.3.2.cmml">𝐢𝐧𝐩𝐮𝐭</mi><mi id="S4.Ex1X.3.2.2.m1.1.1.1.3.3" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.3.3.cmml">𝐱</mi></msub><mo id="S4.Ex1X.3.2.2.m1.1.1.1.2" lspace="0.222em" mathsize="90%" rspace="0.222em" xref="S4.Ex1X.3.2.2.m1.1.1.1.2.cmml">×</mo><mrow id="S4.Ex1X.3.2.2.m1.1.1.1.1.1" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.cmml"><mo id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.2" maxsize="90%" minsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.cmml"><mrow id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.cmml"><msub id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.2.cmml">𝐖</mi><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.3" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.3.cmml">𝐦𝐞𝐫𝐠𝐞</mi></msub><mo id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.1" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.1.cmml">−</mo><msub id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.2.cmml">𝐖</mi><msub id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.2.cmml">𝐝𝐞𝐋𝐨𝐑𝐀</mi><mn id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.3" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.3.cmml">𝟏</mn></msub></msub></mrow><mo id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.1" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.1.cmml">+</mo><msub id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.2.cmml">𝐖</mi><msub id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.cmml"><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.2" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.2.cmml">𝐋𝐨𝐑𝐀</mi><mi id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.3" mathsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.3.cmml">𝐱</mi></msub></msub></mrow><mo id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.3" maxsize="90%" minsize="90%" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.Ex1X.3.2.2.m1.1b"><apply id="S4.Ex1X.3.2.2.m1.1.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1"><eq id="S4.Ex1X.3.2.2.m1.1.1.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.2"></eq><csymbol cd="latexml" id="S4.Ex1X.3.2.2.m1.1.1.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.3">absent</csymbol><apply id="S4.Ex1X.3.2.2.m1.1.1.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1"><times id="S4.Ex1X.3.2.2.m1.1.1.1.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.2"></times><apply id="S4.Ex1X.3.2.2.m1.1.1.1.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.3"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.3.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.3">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.3.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.3.2">𝐢𝐧𝐩𝐮𝐭</ci><ci id="S4.Ex1X.3.2.2.m1.1.1.1.3.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.3.3">𝐱</ci></apply><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1"><plus id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.1"></plus><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2"><minus id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.1"></minus><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.2">𝐖</ci><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.2.3">𝐦𝐞𝐫𝐠𝐞</ci></apply><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.2">𝐖</ci><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.2">𝐝𝐞𝐋𝐨𝐑𝐀</ci><cn id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.3.cmml" type="integer" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.2.3.3.3">1</cn></apply></apply></apply><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.2">𝐖</ci><apply id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3"><csymbol cd="ambiguous" id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.1.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3">subscript</csymbol><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.2.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.2">𝐋𝐨𝐑𝐀</ci><ci id="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.3.cmml" xref="S4.Ex1X.3.2.2.m1.1.1.1.1.1.1.3.3.3">𝐱</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.Ex1X.3.2.2.m1.1c">\displaystyle=\mathbf{input_{x}}\times(\mathbf{W_{merge}}-\mathbf{W_{deLoRA_{1% }}}+\mathbf{W_{LoRA_{x}}})</annotation><annotation encoding="application/x-llamapun" id="S4.Ex1X.3.2.2.m1.1d">= bold_input start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT × ( bold_W start_POSTSUBSCRIPT bold_merge end_POSTSUBSCRIPT - bold_W start_POSTSUBSCRIPT bold_deLoRA start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT bold_LoRA start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> <tbody id="S4.Ex1Xa"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=\mathbf{input_{x}}\times(\mathbf{W_{base}}+\mathbf{W_{LoRA_{x}}})," class="ltx_Math" display="inline" id="S4.Ex1Xa.2.1.1.m1.1"><semantics id="S4.Ex1Xa.2.1.1.m1.1a"><mrow id="S4.Ex1Xa.2.1.1.m1.1.1.1" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.cmml"><mrow id="S4.Ex1Xa.2.1.1.m1.1.1.1.1" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.cmml"><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.3" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.3.cmml"></mi><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.2.cmml">=</mo><mrow id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.cmml"><msub id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.cmml"><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.2.cmml">𝐢𝐧𝐩𝐮𝐭</mi><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.3" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.3.cmml">𝐱</mi></msub><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.2" lspace="0.222em" mathsize="90%" rspace="0.222em" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.2.cmml">×</mo><mrow id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.2" maxsize="90%" minsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.cmml"><msub id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.cmml"><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.2.cmml">𝐖</mi><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.3" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.3.cmml">𝐛𝐚𝐬𝐞</mi></msub><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.1" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.1.cmml">+</mo><msub id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.2.cmml">𝐖</mi><msub id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.cmml"><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.2.cmml">𝐋𝐨𝐑𝐀</mi><mi id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.3" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.3.cmml">𝐱</mi></msub></msub></mrow><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.3" maxsize="90%" minsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S4.Ex1Xa.2.1.1.m1.1.1.1.2" mathsize="90%" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.Ex1Xa.2.1.1.m1.1b"><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1"><eq id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.2"></eq><csymbol cd="latexml" id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.3">absent</csymbol><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1"><times id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.2"></times><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3">subscript</csymbol><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.2">𝐢𝐧𝐩𝐮𝐭</ci><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.3.3">𝐱</ci></apply><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1"><plus id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.1"></plus><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.2">𝐖</ci><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.2.3">𝐛𝐚𝐬𝐞</ci></apply><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.2">𝐖</ci><apply id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3"><csymbol cd="ambiguous" id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.1.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3">subscript</csymbol><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.2.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.2">𝐋𝐨𝐑𝐀</ci><ci id="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.3.cmml" xref="S4.Ex1Xa.2.1.1.m1.1.1.1.1.1.1.1.1.3.3.3">𝐱</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.Ex1Xa.2.1.1.m1.1c">\displaystyle=\mathbf{input_{x}}\times(\mathbf{W_{base}}+\mathbf{W_{LoRA_{x}}}),</annotation><annotation encoding="application/x-llamapun" id="S4.Ex1Xa.2.1.1.m1.1d">= bold_input start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT × ( bold_W start_POSTSUBSCRIPT bold_base end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT bold_LoRA start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> </table> in which <math alttext="\mathbf{W_{merge}}=\mathbf{W_{base}}+\mathbf{W_{LoRA_{1}}}" class="ltx_Math" display="inline" id="S4.SS4.SSS2.p1.5.m1.1"><semantics id="S4.SS4.SSS2.p1.5.m1.1a"><mrow id="S4.SS4.SSS2.p1.5.m1.1.1" xref="S4.SS4.SSS2.p1.5.m1.1.1.cmml"><msub id="S4.SS4.SSS2.p1.5.m1.1.1.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.2.cmml"><mi id="S4.SS4.SSS2.p1.5.m1.1.1.2.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.2.2.cmml">𝐖</mi><mi id="S4.SS4.SSS2.p1.5.m1.1.1.2.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.2.3.cmml">𝐦𝐞𝐫𝐠𝐞</mi></msub><mo id="S4.SS4.SSS2.p1.5.m1.1.1.1" xref="S4.SS4.SSS2.p1.5.m1.1.1.1.cmml">=</mo><mrow id="S4.SS4.SSS2.p1.5.m1.1.1.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.cmml"><msub id="S4.SS4.SSS2.p1.5.m1.1.1.3.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2.cmml"><mi id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2.2.cmml">𝐖</mi><mi id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2.3.cmml">𝐛𝐚𝐬𝐞</mi></msub><mo id="S4.SS4.SSS2.p1.5.m1.1.1.3.1" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.1.cmml">+</mo><msub id="S4.SS4.SSS2.p1.5.m1.1.1.3.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.cmml"><mi id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.2.cmml">𝐖</mi><msub id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.cmml"><mi id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.2" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.2.cmml">𝐋𝐨𝐑𝐀</mi><mn id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.3" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.3.cmml">𝟏</mn></msub></msub></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS2.p1.5.m1.1b"><apply id="S4.SS4.SSS2.p1.5.m1.1.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1"><eq id="S4.SS4.SSS2.p1.5.m1.1.1.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.1"></eq><apply id="S4.SS4.SSS2.p1.5.m1.1.1.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.2"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.5.m1.1.1.2.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.2">subscript</csymbol><ci id="S4.SS4.SSS2.p1.5.m1.1.1.2.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.2.2">𝐖</ci><ci id="S4.SS4.SSS2.p1.5.m1.1.1.2.3.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.2.3">𝐦𝐞𝐫𝐠𝐞</ci></apply><apply id="S4.SS4.SSS2.p1.5.m1.1.1.3.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3"><plus id="S4.SS4.SSS2.p1.5.m1.1.1.3.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.1"></plus><apply id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2">subscript</csymbol><ci id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2.2">𝐖</ci><ci id="S4.SS4.SSS2.p1.5.m1.1.1.3.2.3.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.2.3">𝐛𝐚𝐬𝐞</ci></apply><apply id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3">subscript</csymbol><ci id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.2">𝐖</ci><apply id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.1.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3">subscript</csymbol><ci id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.2.cmml" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.2">𝐋𝐨𝐑𝐀</ci><cn id="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.3.cmml" type="integer" xref="S4.SS4.SSS2.p1.5.m1.1.1.3.3.3.3">1</cn></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS2.p1.5.m1.1c">\mathbf{W_{merge}}=\mathbf{W_{base}}+\mathbf{W_{LoRA_{1}}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS2.p1.5.m1.1d">bold_W start_POSTSUBSCRIPT bold_merge end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT bold_base end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT bold_LoRA start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT</annotation></semantics></math>, <math alttext="\mathbf{W_{LoRA_{1}}}=\mathbf{W_{deLoRA_{1}}}" class="ltx_Math" display="inline" id="S4.SS4.SSS2.p1.6.m2.1"><semantics id="S4.SS4.SSS2.p1.6.m2.1a"><mrow id="S4.SS4.SSS2.p1.6.m2.1.1" xref="S4.SS4.SSS2.p1.6.m2.1.1.cmml"><msub id="S4.SS4.SSS2.p1.6.m2.1.1.2" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.cmml"><mi id="S4.SS4.SSS2.p1.6.m2.1.1.2.2" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.2.cmml">𝐖</mi><msub id="S4.SS4.SSS2.p1.6.m2.1.1.2.3" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3.cmml"><mi id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.2" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3.2.cmml">𝐋𝐨𝐑𝐀</mi><mn id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.3" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3.3.cmml">𝟏</mn></msub></msub><mo id="S4.SS4.SSS2.p1.6.m2.1.1.1" xref="S4.SS4.SSS2.p1.6.m2.1.1.1.cmml">=</mo><msub id="S4.SS4.SSS2.p1.6.m2.1.1.3" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.cmml"><mi id="S4.SS4.SSS2.p1.6.m2.1.1.3.2" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.2.cmml">𝐖</mi><msub id="S4.SS4.SSS2.p1.6.m2.1.1.3.3" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3.cmml"><mi id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.2" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3.2.cmml">𝐝𝐞𝐋𝐨𝐑𝐀</mi><mn id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.3" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3.3.cmml">𝟏</mn></msub></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS2.p1.6.m2.1b"><apply id="S4.SS4.SSS2.p1.6.m2.1.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1"><eq id="S4.SS4.SSS2.p1.6.m2.1.1.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.1"></eq><apply id="S4.SS4.SSS2.p1.6.m2.1.1.2.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.6.m2.1.1.2.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2">subscript</csymbol><ci id="S4.SS4.SSS2.p1.6.m2.1.1.2.2.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.2">𝐖</ci><apply id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3">subscript</csymbol><ci id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.2.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3.2">𝐋𝐨𝐑𝐀</ci><cn id="S4.SS4.SSS2.p1.6.m2.1.1.2.3.3.cmml" type="integer" xref="S4.SS4.SSS2.p1.6.m2.1.1.2.3.3">1</cn></apply></apply><apply id="S4.SS4.SSS2.p1.6.m2.1.1.3.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.6.m2.1.1.3.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3">subscript</csymbol><ci id="S4.SS4.SSS2.p1.6.m2.1.1.3.2.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.2">𝐖</ci><apply id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.1.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3">subscript</csymbol><ci id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.2.cmml" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3.2">𝐝𝐞𝐋𝐨𝐑𝐀</ci><cn id="S4.SS4.SSS2.p1.6.m2.1.1.3.3.3.cmml" type="integer" xref="S4.SS4.SSS2.p1.6.m2.1.1.3.3.3">1</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS2.p1.6.m2.1c">\mathbf{W_{LoRA_{1}}}=\mathbf{W_{deLoRA_{1}}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS2.p1.6.m2.1d">bold_W start_POSTSUBSCRIPT bold_LoRA start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT bold_deLoRA start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT</annotation></semantics></math>, and <math alttext="\mathbf{W}" class="ltx_Math" display="inline" id="S4.SS4.SSS2.p1.7.m3.1"><semantics id="S4.SS4.SSS2.p1.7.m3.1a"><mi id="S4.SS4.SSS2.p1.7.m3.1.1" xref="S4.SS4.SSS2.p1.7.m3.1.1.cmml">𝐖</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS2.p1.7.m3.1b"><ci id="S4.SS4.SSS2.p1.7.m3.1.1.cmml" xref="S4.SS4.SSS2.p1.7.m3.1.1">𝐖</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS2.p1.7.m3.1c">\mathbf{W}</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS2.p1.7.m3.1d">bold_W</annotation></semantics></math> means the weight of the base model, deLoRA, and LoRA adapter; <math alttext="\mathbf{input_{x}}" class="ltx_Math" display="inline" id="S4.SS4.SSS2.p1.8.m4.1"><semantics id="S4.SS4.SSS2.p1.8.m4.1a"><msub id="S4.SS4.SSS2.p1.8.m4.1.1" xref="S4.SS4.SSS2.p1.8.m4.1.1.cmml"><mi id="S4.SS4.SSS2.p1.8.m4.1.1.2" xref="S4.SS4.SSS2.p1.8.m4.1.1.2.cmml">𝐢𝐧𝐩𝐮𝐭</mi><mi id="S4.SS4.SSS2.p1.8.m4.1.1.3" xref="S4.SS4.SSS2.p1.8.m4.1.1.3.cmml">𝐱</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS2.p1.8.m4.1b"><apply id="S4.SS4.SSS2.p1.8.m4.1.1.cmml" xref="S4.SS4.SSS2.p1.8.m4.1.1"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.8.m4.1.1.1.cmml" xref="S4.SS4.SSS2.p1.8.m4.1.1">subscript</csymbol><ci id="S4.SS4.SSS2.p1.8.m4.1.1.2.cmml" xref="S4.SS4.SSS2.p1.8.m4.1.1.2">𝐢𝐧𝐩𝐮𝐭</ci><ci id="S4.SS4.SSS2.p1.8.m4.1.1.3.cmml" xref="S4.SS4.SSS2.p1.8.m4.1.1.3">𝐱</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS2.p1.8.m4.1c">\mathbf{input_{x}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS2.p1.8.m4.1d">bold_input start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="\mathbf{output_{x}}" class="ltx_Math" display="inline" id="S4.SS4.SSS2.p1.9.m5.1"><semantics id="S4.SS4.SSS2.p1.9.m5.1a"><msub id="S4.SS4.SSS2.p1.9.m5.1.1" xref="S4.SS4.SSS2.p1.9.m5.1.1.cmml"><mi id="S4.SS4.SSS2.p1.9.m5.1.1.2" xref="S4.SS4.SSS2.p1.9.m5.1.1.2.cmml">𝐨𝐮𝐭𝐩𝐮𝐭</mi><mi id="S4.SS4.SSS2.p1.9.m5.1.1.3" xref="S4.SS4.SSS2.p1.9.m5.1.1.3.cmml">𝐱</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS2.p1.9.m5.1b"><apply id="S4.SS4.SSS2.p1.9.m5.1.1.cmml" xref="S4.SS4.SSS2.p1.9.m5.1.1"><csymbol cd="ambiguous" id="S4.SS4.SSS2.p1.9.m5.1.1.1.cmml" xref="S4.SS4.SSS2.p1.9.m5.1.1">subscript</csymbol><ci id="S4.SS4.SSS2.p1.9.m5.1.1.2.cmml" xref="S4.SS4.SSS2.p1.9.m5.1.1.2">𝐨𝐮𝐭𝐩𝐮𝐭</ci><ci id="S4.SS4.SSS2.p1.9.m5.1.1.3.cmml" xref="S4.SS4.SSS2.p1.9.m5.1.1.3">𝐱</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS2.p1.9.m5.1c">\mathbf{output_{x}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS2.p1.9.m5.1d">bold_output start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT</annotation></semantics></math> are the input and output of the request of LoRAx. </div> <div class="ltx_para" id="S4.SS4.SSS2.p2"> Mixture inference mode shows two advantages. 1) It does not incur mode-switching costs from merged to unmerged mode. 2) It costs less extra computation than unmerged inference when there are more requests for the merged LoRA adapter than others. </div> </section> <section class="ltx_subsubsection" id="S4.SS4.SSS3"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">4.4.3 Scheduling policy.</h4> <div class="ltx_para" id="S4.SS4.SSS3.p1"> To minimize the average response latency and meet each request’s latency constraint, our orchestrator must carefully orchestrate requests, adapters, and inference modes. Our policy follows a greedy heuristic which includes two principles. (1) Executing in merged mode whenever possible, as it produces the fastest response and without extra overhead. (2) When starvation occurs, switch to mixture first, then unmerge mode, in order of the switching cost and extra computation. Alg.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#alg1" title="Algorithm 1 ‣ 4.4.1 Swift inference mode switch. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">1</a> shows the pseudo-code. To alleviate starvation, it assigns each request a credit, indicating its waiting time adds the execution time in current mode and the mode switch latency (line #2), and sets a tolerance threshold <math alttext="\theta" class="ltx_Math" display="inline" id="S4.SS4.SSS3.p1.1.m1.1"><semantics id="S4.SS4.SSS3.p1.1.m1.1a"><mi id="S4.SS4.SSS3.p1.1.m1.1.1" xref="S4.SS4.SSS3.p1.1.m1.1.1.cmml">θ</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS3.p1.1.m1.1b"><ci id="S4.SS4.SSS3.p1.1.m1.1.1.cmml" xref="S4.SS4.SSS3.p1.1.m1.1.1">𝜃</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS3.p1.1.m1.1c">\theta</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS3.p1.1.m1.1d">italic_θ</annotation></semantics></math> as the mixture mode condition. When the request workload meets the criteria for switching to merge mode, the algorithm switches the mode to merge (line #5-8). When the number of starving requests exceeds <math alttext="\theta" class="ltx_Math" display="inline" id="S4.SS4.SSS3.p1.2.m2.1"><semantics id="S4.SS4.SSS3.p1.2.m2.1a"><mi id="S4.SS4.SSS3.p1.2.m2.1.1" xref="S4.SS4.SSS3.p1.2.m2.1.1.cmml">θ</mi><annotation-xml encoding="MathML-Content" id="S4.SS4.SSS3.p1.2.m2.1b"><ci id="S4.SS4.SSS3.p1.2.m2.1.1.cmml" xref="S4.SS4.SSS3.p1.2.m2.1.1">𝜃</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS4.SSS3.p1.2.m2.1c">\theta</annotation><annotation encoding="application/x-llamapun" id="S4.SS4.SSS3.p1.2.m2.1d">italic_θ</annotation></semantics></math>, Alg.<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#alg1" title="Algorithm 1 ‣ 4.4.1 Swift inference mode switch. ‣ 4.4 Flexible LoRA Adapters Orchestration ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">1</a> processes them with mixture mode immediately (line #9-12). When it further exceeds half the maximum batch size, it switches to unmerge mode (line #13-15). </div> </section> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> 5. Implementation</h2> <div class="ltx_para" id="S5.p1"> We implement V-LoRA upon several tools, including Pytorch (v2.0.1) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib6" title="">pytorch, </a>)</cite>, Triton (v2.0.1) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib74" title="">Triton, </a>)</cite>, CUTLASS (v3.5.1)<cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib15" title="">cutlass, </a>)</cite>, and vLLM (v0.3.0) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib51" title="">vLLM, </a>)</cite>, with <math alttext="\sim" class="ltx_Math" display="inline" id="S5.p1.1.m1.1"><semantics id="S5.p1.1.m1.1a"><mo id="S5.p1.1.m1.1.1" xref="S5.p1.1.m1.1.1.cmml">∼</mo><annotation-xml encoding="MathML-Content" id="S5.p1.1.m1.1b"><csymbol cd="latexml" id="S5.p1.1.m1.1.1.cmml" xref="S5.p1.1.m1.1.1">similar-to</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S5.p1.1.m1.1c">\sim</annotation><annotation encoding="application/x-llamapun" id="S5.p1.1.m1.1d">∼</annotation></semantics></math>7.1K LOC. Most of the code is implemented in Python (v3.9) except the ATMM in CUDA. We use vLLM, especially the LightLLM <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib1" title="">LightLLM, </a>)</cite> version, to build V-LoRA because of its advanced features, such as PagedAttention, iteration-level scheduling, and token-based memory management. The three key techniques in V-LoRA are implemented as follows. (1) We implement the accuracy-aware LoRA adapter generation for popular LLMs, including Qwen-VL <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib20" title="">QwenVL, </a>)</cite> and LLaVA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib59" title="">llava, </a>)</cite> series, based on transformers <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib80" title="">transformers, </a>)</cite> and PEFT <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib61" title="">peft, </a>)</cite> library. (2) We implement ATMM using CUDA C++ based on CUTLASS. Its hash table, which stores optimal input-tiling pairs, is implemented with a 128-bit unsigned integer as a key to map the input shapes. Since CUTLASS operators cannot be dynamically compiled, we created a Python interface to bind and package the code implementation of optimal tiling configurations with Pybind11 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib41" title="">pybind11, </a>)</cite> and setuptools <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib5" title="">setuptools, </a>)</cite>. (3) We integrate ATMM into vLLM to support unmerged inference, mixture inference, and swift inference mode switch. To manage the complicated adapters and requests in unmerged and mixture mode, we transform the LoRA type of each request into a one-hot vector and build a request-type mapping matrix of the current batch. For efficient memory management, we use the unified memory management <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> for KV cache and adapters. V-LoRA streams text and images in requests asynchronously via RPyC <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib4" title="">rpyc, </a>)</cite>. After receiving a request, V-LoRA identifies its LoRA adapter, dispatches it to the adapter, then generates a response and returns it. </div> <div class="ltx_para" id="S5.p2"> LoRA adapter swap. Considering it may generate an amount of LoRA adapters, we leverage the main memory. To reduce delay, we only store <math alttext="A" class="ltx_Math" display="inline" id="S5.p2.1.m1.1"><semantics id="S5.p2.1.m1.1a"><mi id="S5.p2.1.m1.1.1" xref="S5.p2.1.m1.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S5.p2.1.m1.1b"><ci id="S5.p2.1.m1.1.1.cmml" xref="S5.p2.1.m1.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.p2.1.m1.1c">A</annotation><annotation encoding="application/x-llamapun" id="S5.p2.1.m1.1d">italic_A</annotation></semantics></math> and <math alttext="B" class="ltx_Math" display="inline" id="S5.p2.2.m2.1"><semantics id="S5.p2.2.m2.1a"><mi id="S5.p2.2.m2.1.1" xref="S5.p2.2.m2.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S5.p2.2.m2.1b"><ci id="S5.p2.2.m2.1.1.cmml" xref="S5.p2.2.m2.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.p2.2.m2.1c">B</annotation><annotation encoding="application/x-llamapun" id="S5.p2.2.m2.1d">italic_B</annotation></semantics></math>, not <math alttext="\Delta W" class="ltx_Math" display="inline" id="S5.p2.3.m3.1"><semantics id="S5.p2.3.m3.1a"><mrow id="S5.p2.3.m3.1.1" xref="S5.p2.3.m3.1.1.cmml"><mi id="S5.p2.3.m3.1.1.2" mathvariant="normal" xref="S5.p2.3.m3.1.1.2.cmml">Δ</mi><mo id="S5.p2.3.m3.1.1.1" xref="S5.p2.3.m3.1.1.1.cmml">⁢</mo><mi id="S5.p2.3.m3.1.1.3" xref="S5.p2.3.m3.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S5.p2.3.m3.1b"><apply id="S5.p2.3.m3.1.1.cmml" xref="S5.p2.3.m3.1.1"><times id="S5.p2.3.m3.1.1.1.cmml" xref="S5.p2.3.m3.1.1.1"></times><ci id="S5.p2.3.m3.1.1.2.cmml" xref="S5.p2.3.m3.1.1.2">Δ</ci><ci id="S5.p2.3.m3.1.1.3.cmml" xref="S5.p2.3.m3.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.p2.3.m3.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S5.p2.3.m3.1d">roman_Δ italic_W</annotation></semantics></math>, and swap them asynchronously between the GPU and host memory, and compute <math alttext="\Delta W" class="ltx_Math" display="inline" id="S5.p2.4.m4.1"><semantics id="S5.p2.4.m4.1a"><mrow id="S5.p2.4.m4.1.1" xref="S5.p2.4.m4.1.1.cmml"><mi id="S5.p2.4.m4.1.1.2" mathvariant="normal" xref="S5.p2.4.m4.1.1.2.cmml">Δ</mi><mo id="S5.p2.4.m4.1.1.1" xref="S5.p2.4.m4.1.1.1.cmml">⁢</mo><mi id="S5.p2.4.m4.1.1.3" xref="S5.p2.4.m4.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S5.p2.4.m4.1b"><apply id="S5.p2.4.m4.1.1.cmml" xref="S5.p2.4.m4.1.1"><times id="S5.p2.4.m4.1.1.1.cmml" xref="S5.p2.4.m4.1.1.1"></times><ci id="S5.p2.4.m4.1.1.2.cmml" xref="S5.p2.4.m4.1.1.2">Δ</ci><ci id="S5.p2.4.m4.1.1.3.cmml" xref="S5.p2.4.m4.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.p2.4.m4.1c">\Delta W</annotation><annotation encoding="application/x-llamapun" id="S5.p2.4.m4.1d">roman_Δ italic_W</annotation></semantics></math> with ATMM at runtime. </div> <div class="ltx_para" id="S5.p3"> KV Cache reuse. The same images may be accessed multiple times in some applications, e.g., the multi-round visual question-answering <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib38" title="">vqa2, </a>)</cite>. We implement a prefix matching, based on CacheBlend <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib85" title="">yao2024cacheblend, </a>)</cite> and SGLang <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib96" title="">zheng2023efficiently, </a>)</cite>, to reuse the same images’ KV cache, avoiding redundant storage. </div> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> 6. Evaluation</h2> <div class="ltx_para" id="S6.p1"> We evaluate V-LoRA with two vision applications involving five tasks on three LMMs. The key takeaways are: </div> <div class="ltx_para" id="S6.p2"> <div class="ltx_inline-block ltx_transformed_outer" id="S6.p2.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"> <math alttext="\bullet" class="ltx_Math" display="inline" id="S6.p2.1.1.m1.1"><semantics id="S6.p2.1.1.m1.1a"><mo id="S6.p2.1.1.m1.1.1" xref="S6.p2.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.p2.1.1.m1.1b"><ci id="S6.p2.1.1.m1.1.1.cmml" xref="S6.p2.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.p2.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.p2.1.1.m1.1d">∙</annotation></semantics></math> </div> V-LoRA decreases 20-89% end-to-end latency compared to state-of-the-art serving systems and achieves comparable accuracy with small models on specific domains. (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS2" title="6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.2</a>) </div> <div class="ltx_para" id="S6.p3"> <div class="ltx_inline-block ltx_transformed_outer" id="S6.p3.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"> <math alttext="\bullet" class="ltx_Math" display="inline" id="S6.p3.1.1.m1.1"><semantics id="S6.p3.1.1.m1.1a"><mo id="S6.p3.1.1.m1.1.1" xref="S6.p3.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.p3.1.1.m1.1b"><ci id="S6.p3.1.1.m1.1.1.cmml" xref="S6.p3.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.p3.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.p3.1.1.m1.1d">∙</annotation></semantics></math> </div> Accuracy-aware LoRA Adapter Generation brings remarkable accuracy and throughput benefits; Adaptive-tiling LoRA Adapters Batching and Flexible LoRA Adapters Orchestration boost great latency and throughput. (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3" title="6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.3</a>) </div> <div class="ltx_para" id="S6.p4"> <div class="ltx_inline-block ltx_transformed_outer" id="S6.p4.1" style="width:4.0pt;height:3.6pt;vertical-align:-0.0pt;"> <math alttext="\bullet" class="ltx_Math" display="inline" id="S6.p4.1.1.m1.1"><semantics id="S6.p4.1.1.m1.1a"><mo id="S6.p4.1.1.m1.1.1" xref="S6.p4.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.p4.1.1.m1.1b"><ci id="S6.p4.1.1.m1.1.1.cmml" xref="S6.p4.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.p4.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.p4.1.1.m1.1d">∙</annotation></semantics></math> </div> V-LoRA shows strong stability to diverse workloads and LoRA adapter numbers, as well as scalability to multiple GPU resources. (§<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS4" title="6.4 Stability and Scalability ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.4</a>) </div> <figure class="ltx_figure" id="S6.F14"> <div class="ltx_block ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F14.1" style="width:433.6pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="340" id="S6.F14.1.g1" src="x17.png" width="747"/> </div> <figcaption class="ltx_caption ltx_centering">Figure 14. Average token latency comparison over various serving systems on two vision applications and three LMMs.</figcaption> </figure> <figure class="ltx_table" id="S6.T2"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S6.T2.1"> <tr class="ltx_tr" id="S6.T2.1.1"> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.T2.1.1.1">Model</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.T2.1.1.2">Vision Encoder</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.T2.1.1.3">Size</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.T2.1.1.4">Layer #</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.T2.1.1.5">Dimension</td> </tr> <tr class="ltx_tr" id="S6.T2.1.2"> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.T2.1.2.1"> Qwen-VL-7B </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.T2.1.2.2"> Openclip-ViT (1.9B) </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.T2.1.2.3"> 18GB </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.T2.1.2.4"> 32 </td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.T2.1.2.5"> 4096 </td> </tr> <tr class="ltx_tr" id="S6.T2.1.3"> <td class="ltx_td ltx_align_center" id="S6.T2.1.3.1"> LLaVA1.5-7B </td> <td class="ltx_td ltx_align_center" id="S6.T2.1.3.2"> CLIP-ViT (0.3B) </td> <td class="ltx_td ltx_align_center" id="S6.T2.1.3.3"> 13GB </td> <td class="ltx_td ltx_align_center" id="S6.T2.1.3.4"> 32 </td> <td class="ltx_td ltx_align_center" id="S6.T2.1.3.5"> 4096 </td> </tr> <tr class="ltx_tr" id="S6.T2.1.4"> <td class="ltx_td ltx_align_center ltx_border_bb" id="S6.T2.1.4.1"> LLaVA-1.5-13B </td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S6.T2.1.4.2"> CLIP-ViT (0.3B) </td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S6.T2.1.4.3"> 24GB </td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S6.T2.1.4.4"> 40 </td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S6.T2.1.4.5"> 5120 </td> </tr> </table> <figcaption class="ltx_caption ltx_centering" style="font-size:80%;">Table 2. Model configurations.</figcaption> </figure> <section class="ltx_subsection" id="S6.SS1"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">6.1 Experimental Setup</h3> <div class="ltx_para" id="S6.SS1.p1"> Vision applications and datasets. We select two distinct types of visual applications: visual retrieval and video analytics, to evaluate the performance of V-LoRA. Visual retrieval aims to analyze images and respond to queries. It involves visual question-answering, image caption, and specific-target detection tasks when needed by queries. We evaluate visual retrieval on SharedGPT-4V <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib26" title="">SharedGPT4V, </a>)</cite> and RefCOCO <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib48" title="">kazemzadeh2014referitgame, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib88" title="">yu2016modeling, </a>)</cite> datasets. Video analytics ingests and analyzes each RGB frame from the video, then outputs results of fixed vision tasks, including object detection and video understanding like prior work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib78" title="">wang2024region, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib90" title="">yuan2022PacketGame, </a>)</cite>. Object detection locates and identifies objects on each video frame on YODA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib83" title="">xiao2021towards, </a>)</cite> and Cityscapes <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib29" title="">Cordts2016Cityscapes, </a>)</cite>. Video understanding recognizes actions on consecutive video frames on UCF101 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib72" title="">UCF101, </a>)</cite>. </div> <div class="ltx_para ltx_noindent" id="S6.SS1.p2"> Testbed and workload. We evaluated V-LoRA on a server equipped with one NVIDIA A100 80GB GPU and Intel Xeon Platinum 8358 CPU, with 128 GB of host memory. The workload of visual retrieval is from production traces from the Microsoft Azure LLM inference trace 2023 <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib14" title="">AzureLLM, </a>)</cite>. As the high volume of requests in this workload trace exceeds current hardware capabilities, like prior work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite>, we randomly sample requests at varying rates in a round-robin approach. Like prior work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib78" title="">wang2024region, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib31" title="">du2020server, </a>)</cite>, the video analytics workload ingests one video chunk per second, 30 frames each, each stream. </div> <div class="ltx_para ltx_noindent" id="S6.SS1.p3"> Metrics. Accuracy metrics follow the standard metrics. Visual retrieval is measured by vqa-score <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib37" title="">goyal2017making, </a>)</cite>, object detection is evaluated by average F1-score and video understanding is evaluated via Top-1 accuracy. To the system metrics, we evaluate the end-to-end throughput, i.e., requests per second, and the average token latency, i.e., the sum of each request’s end-to-end latency divided by the total number of tokens. </div> <div class="ltx_para ltx_noindent" id="S6.SS1.p4"> Models. We choose the widely-used open-sourced LMMs, Qwen-VL <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib20" title="">QwenVL, </a>)</cite> and LLaVA-1.5 series <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib59" title="">llava, </a>)</cite>. We use various LLaVA models with different sizes, including LLaVA-v1.5-7B and LLaVA-v1.5-13B. The details of these models are shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.T2" title="Table 2 ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">2</a>. The rank of LoRA adapters is set to 64. We also use five small models, VisionMamba <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib99" title="">VisionMamba, </a>)</cite>, YOLO <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib33" title="">glenn2021YOLOV5, </a>)</cite>, OSCAR <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib55" title="">li2020oscar, </a>)</cite> VideoMAE <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib75" title="">VideoMAE, </a>)</cite>, and UNINEXT <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib84" title="">yan2023universal, </a>)</cite> that yield state-of-the-art performance for accuracy comparison on corresponding datasets. </div> <figure class="ltx_figure" id="S6.F15"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="185" id="S6.F15.g1" src="x18.png" width="407"/> <figcaption class="ltx_caption ltx_centering">Figure 15. Accuracy comparison between small models and LMM model across five different vision tasks.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S6.SS1.p5"> Baselines. To show the superiority brought by V-LoRA, we compare it with three different baselines: <math alttext="\bullet" class="ltx_Math" display="inline" id="S6.SS1.p5.1.1.1.m1.1"><semantics id="S6.SS1.p5.1.1.1.m1.1a"><mo id="S6.SS1.p5.1.1.1.m1.1.1" xref="S6.SS1.p5.1.1.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.SS1.p5.1.1.1.m1.1b"><ci id="S6.SS1.p5.1.1.1.m1.1.1.cmml" xref="S6.SS1.p5.1.1.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.SS1.p5.1.1.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.SS1.p5.1.1.1.m1.1d">∙</annotation></semantics></math> S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite> only serves in unmerge mode and employs its customized CUDA operator to batch concurrent heterogeneous LoRA computation. <math alttext="\bullet" class="ltx_Math" display="inline" id="S6.SS1.p5.2.2.1.m1.1"><semantics id="S6.SS1.p5.2.2.1.m1.1a"><mo id="S6.SS1.p5.2.2.1.m1.1.1" xref="S6.SS1.p5.2.2.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.SS1.p5.2.2.1.m1.1b"><ci id="S6.SS1.p5.2.2.1.m1.1.1.cmml" xref="S6.SS1.p5.2.2.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.SS1.p5.2.2.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.SS1.p5.2.2.1.m1.1d">∙</annotation></semantics></math> Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite> also serves in unmerge mode only, like S-LoRA, but employs its own operator. <math alttext="\bullet" class="ltx_Math" display="inline" id="S6.SS1.p5.3.3.1.m1.1"><semantics id="S6.SS1.p5.3.3.1.m1.1a"><mo id="S6.SS1.p5.3.3.1.m1.1.1" xref="S6.SS1.p5.3.3.1.m1.1.1.cmml">∙</mo><annotation-xml encoding="MathML-Content" id="S6.SS1.p5.3.3.1.m1.1b"><ci id="S6.SS1.p5.3.3.1.m1.1.1.cmml" xref="S6.SS1.p5.3.3.1.m1.1.1">∙</ci></annotation-xml><annotation encoding="application/x-tex" id="S6.SS1.p5.3.3.1.m1.1c">\bullet</annotation><annotation encoding="application/x-llamapun" id="S6.SS1.p5.3.3.1.m1.1d">∙</annotation></semantics></math> dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> dynamically switches between unmerged and merged mode based on workload and invokes PyTorch operator Einsum to implement unmerged inference. </div> </section> <section class="ltx_subsection" id="S6.SS2"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">6.2 End-to-End Performance</h3> <div class="ltx_para" id="S6.SS2.p1"> This section reports the E2E performance of V-LoRA on multiple LMMs and applications. </div> <figure class="ltx_figure" id="S6.F19"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F19.1" style="width:101.9pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="668" id="S6.F19.1.g1" src="x19.png" width="832"/> <figcaption class="ltx_caption ltx_centering">Figure 16. Latency comparison between original LMM head and video analytics head.</figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F19.2" style="width:101.9pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="660" id="S6.F19.2.g1" src="x20.png" width="830"/> <figcaption class="ltx_caption ltx_centering">Figure 17. Latency comparison of different operators across different token batch sizes.</figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F19.3" style="width:104.1pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="646" id="S6.F19.3.g1" src="x21.png" width="790"/> <figcaption class="ltx_caption ltx_centering">Figure 18. Performance of different operators at average, 90-tile, and 95-tile .</figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_4"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F19.4" style="width:104.1pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="627" id="S6.F19.4.g1" src="x22.png" width="787"/> <figcaption class="ltx_caption ltx_centering">Figure 19. Performance of different schedulers under different skewness.</figcaption> </figure> </div> </div> </figure> <div class="ltx_para ltx_noindent" id="S6.SS2.p2"> System performance. V-LoRA achieves notably lower average token latency than dLoRA, Punica, and SLoRA regardless of vision applications and LMMs. The first row in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F14" title="Figure 14 ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">14</a> shows their performance for visual retrieval on three LMMs. Across three LMMs, V-LoRA reduces 72%, 50%, and 20% average token latency, compared to dLoRA, Punica, and S-LoRA, respectively. To Punica and S-LoRA, this acceleration is obvious. They only work in unmerge mode, which ignores the merge-friendly workload pattern, e.g., 60% of requests asking for the same LoRA adapter. dLoRA, though, takes the preference of both inference modes into account, its high mode switching cost and inefficient Einsum operator in unmerged inference incurs this 20% drop; conversely, V-LoRA’s ATMM operator enables swift switch and efficient unmerge and mixture mode (more experiments in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3" title="6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.3</a>). Moreover, along with the increasing requests per second, the inflection points of most serving systems occur at 6, which can be eliminated if equipped with more GPUs (more in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS4" title="6.4 Stability and Scalability ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.4</a>). </div> <div class="ltx_para" id="S6.SS2.p3"> V-LoRA delivers the best service on video analytics than other systems, as shown in the second row of Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F14" title="Figure 14 ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">14</a>. It makes 89%, 83%, and 71% of average token latency reduction than dLoRA, Punica, and S-LoRA, respectively. This benefit arises from the vision task head. It effectively eliminates the multi-rounds inference in the autoregressive manner of LLMs (more experiments in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.SS3.SSS1" title="6.3.1 Accuracy-aware LoRA Adapter Generation ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">6.3.1</a>). </div> <div class="ltx_para" id="S6.SS2.p4"> Compare each column in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F14" title="Figure 14 ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">14</a>. V-LoRA produces more remarkable benefits than other serving systems on the video analytics application. This difference stems from the different distribution of input and output token lengths in two vision applications. Unlike visual retrieval, video analytics typically has fewer output tokens and more input tokens, e.g., each video understanding request has 6<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS2.p4.1.m1.1"><semantics id="S6.SS2.p4.1.m1.1a"><mo id="S6.SS2.p4.1.m1.1.1" xref="S6.SS2.p4.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS2.p4.1.m1.1b"><times id="S6.SS2.p4.1.m1.1.1.cmml" xref="S6.SS2.p4.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS2.p4.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS2.p4.1.m1.1d">×</annotation></semantics></math>256 input and 5-10 output tokens, while VQA has 256 and 200+. Since input tokens can be batched computation in the prefill stage, they cost much less time (¡1ms per token) than decode-stage output tokens (30-50ms per token). Furthermore, the extra latency of unmerged inference increases along with longer input length (See Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">19</a>); thus, longer-input video analytics can obtain more benefits from V-LoRA with mode switch. </div> <div class="ltx_para ltx_noindent" id="S6.SS2.p5"> Accuracy performance. V-LoRA achieves performance close to or surpassing the SOTA accuracy of domain-specific small models across various tasks. We report the results of Qwen-VL with fine-tuned LoRA adapters for different vision tasks as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F15" title="Figure 15 ‣ 6.1 Experimental Setup ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">15</a>. We conduct training on A100 80GB for five distinct tasks and test against corresponding SOTA small models. For instance, V-LoRA achieves a 4.3-5% accuracy improvement in Visual QA and image captioning tasks. Additionally, for tasks where small models typically excel, such as object detection and video understanding, V-LoRA’s fine-tuned LoRA adapters improve Qwen-VL 24.5-62.2% accuracy, achieving competitive accuracy in these domains. </div> </section> <section class="ltx_subsection" id="S6.SS3"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">6.3 Comprehensive Component-wise Analysis</h3> <div class="ltx_para" id="S6.SS3.p1"> We provide an in-depth performance analysis of individual system components. If not mentioned, all results are tested on the Qwen-VL model with 10 requests per second. </div> <section class="ltx_subsubsection" id="S6.SS3.SSS1"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">6.3.1 Accuracy-aware LoRA Adapter Generation</h4> <div class="ltx_para" id="S6.SS3.SSS1.p1"> helps V-LoRA achieve great throughput improvement while keeping high accuracy. The accuracy gain has been discussed above; we only analyze the throughput gain from the vision task head. Especially for video analytics tasks, the vision task head employed by V-LoRA significantly reduces the rounds of autoregressive decodes, thereby greatly enhancing system performance. As illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">19</a>, V-LoRA achieves a 41-63% reduction in latency compared to the original language modeling head. This gain is attributed to the video analytics head’s contribution to minimizing the prompt length and requiring only one inference round. In video understanding tasks, V-LoRA equipped with the video analytics head can match the accuracy of certain small models and handle 3-4 video streams in real time. </div> </section> <section class="ltx_subsubsection" id="S6.SS3.SSS2"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">6.3.2 Adaptive-tiling LoRA Adapters Batching</h4> <div class="ltx_para" id="S6.SS3.SSS2.p1"> gives the most efficient and stable matrix multiplication by ATMM among all comparisons. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">19</a>, by testing over 100 rounds after 10 warm-ups on large amounts of diverse inputs, ATMM achieves the lowest average latency across different batch sizes, speeds up 2.7<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.1.m1.1"><semantics id="S6.SS3.SSS2.p1.1.m1.1a"><mo id="S6.SS3.SSS2.p1.1.m1.1.1" xref="S6.SS3.SSS2.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.1.m1.1b"><times id="S6.SS3.SSS2.p1.1.m1.1.1.cmml" xref="S6.SS3.SSS2.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.1.m1.1d">×</annotation></semantics></math>, 2.3<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.2.m2.1"><semantics id="S6.SS3.SSS2.p1.2.m2.1a"><mo id="S6.SS3.SSS2.p1.2.m2.1.1" xref="S6.SS3.SSS2.p1.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.2.m2.1b"><times id="S6.SS3.SSS2.p1.2.m2.1.1.cmml" xref="S6.SS3.SSS2.p1.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.2.m2.1d">×</annotation></semantics></math>, and 3.4<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.3.m3.1"><semantics id="S6.SS3.SSS2.p1.3.m3.1a"><mo id="S6.SS3.SSS2.p1.3.m3.1.1" xref="S6.SS3.SSS2.p1.3.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.3.m3.1b"><times id="S6.SS3.SSS2.p1.3.m3.1.1.cmml" xref="S6.SS3.SSS2.p1.3.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.3.m3.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.3.m3.1d">×</annotation></semantics></math> of S-LoRA, Punica, and dLoRA, respectively. On stability, the statistical results plotted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">19</a> show that ATMM delivers the most robust performance, which reduce the latency fluctuation by 3<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.4.1.m1.1"><semantics id="S6.SS3.SSS2.p1.4.1.m1.1a"><mo id="S6.SS3.SSS2.p1.4.1.m1.1.1" mathcolor="#000000" xref="S6.SS3.SSS2.p1.4.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.4.1.m1.1b"><times id="S6.SS3.SSS2.p1.4.1.m1.1.1.cmml" xref="S6.SS3.SSS2.p1.4.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.4.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.4.1.m1.1d">×</annotation></semantics></math>, 2<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.5.2.m2.1"><semantics id="S6.SS3.SSS2.p1.5.2.m2.1a"><mo id="S6.SS3.SSS2.p1.5.2.m2.1.1" mathcolor="#000000" xref="S6.SS3.SSS2.p1.5.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.5.2.m2.1b"><times id="S6.SS3.SSS2.p1.5.2.m2.1.1.cmml" xref="S6.SS3.SSS2.p1.5.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.5.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.5.2.m2.1d">×</annotation></semantics></math>, and 2<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p1.6.3.m3.1"><semantics id="S6.SS3.SSS2.p1.6.3.m3.1a"><mo id="S6.SS3.SSS2.p1.6.3.m3.1.1" mathcolor="#000000" xref="S6.SS3.SSS2.p1.6.3.m3.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p1.6.3.m3.1b"><times id="S6.SS3.SSS2.p1.6.3.m3.1.1.cmml" xref="S6.SS3.SSS2.p1.6.3.m3.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p1.6.3.m3.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p1.6.3.m3.1d">×</annotation></semantics></math> compared to S-LoRA, Punica, and dLoRA. These benefits stem from the profile-based optimal tiling search at the offline phase. When the batch size exceeds 1024, for instance, ATMM adaptively adjusts to a larger tile shape, fully utilizing hardware resources, while other operators suffer from static tiling. </div> <figure class="ltx_figure" id="S6.F23"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_many"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F23.1" style="width:86.7pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="628" id="S6.F23.1.g1" src="x23.png" width="788"/> <figcaption class="ltx_caption ltx_centering">Figure 20. Latency gain of mixture mode.</figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_many"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F23.2" style="width:86.7pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="618" id="S6.F23.2.g1" src="x24.png" width="831"/> <figcaption class="ltx_caption ltx_centering">Figure 21. Benefits from swift mode switch.</figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_many"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F23.3" style="width:86.7pt;"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="663" id="S6.F23.3.g1" src="x25.png" width="831"/> <figcaption class="ltx_caption ltx_centering">Figure 22. Impact of the request skewness. </figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_many"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F23.4" style="width:86.7pt;"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="679" id="S6.F23.4.g1" src="x26.png" width="832"/> <figcaption class="ltx_caption ltx_centering">Figure 23. Impact of the adapter number. </figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_many"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_center ltx_align_bottom" id="S6.F23.fig1" style="width:65.0pt;"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_1"> <table class="ltx_tabular ltx_centering ltx_figure_panel ltx_align_middle" id="S6.F23.fig1.1"> <tr class="ltx_tr" id="S6.F23.fig1.1.1"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_tt" id="S6.F23.fig1.1.1.1"> GPU Num. </td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S6.F23.fig1.1.1.2"> TPT (req/s) </td> </tr> <tr class="ltx_tr" id="S6.F23.fig1.1.2"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S6.F23.fig1.1.2.1">1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.F23.fig1.1.2.2">6.07</td> </tr> <tr class="ltx_tr" id="S6.F23.fig1.1.3"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S6.F23.fig1.1.3.1">2</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S6.F23.fig1.1.3.2">11.48</td> </tr> <tr class="ltx_tr" id="S6.F23.fig1.1.4"> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t" id="S6.F23.fig1.1.4.1">4</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S6.F23.fig1.1.4.2">23.97</td> </tr> </table> </div> <div class="ltx_flex_break"></div> <div class="ltx_flex_cell ltx_flex_size_1"> <figure class="ltx_table ltx_figure_panel ltx_align_center" id="S6.T3"> <figcaption class="ltx_caption" style="font-size:80%;">Table 3. Scales to multiple GPUs.</figcaption> </figure> </div> </div> </figure> </div> </div> </figure> <div class="ltx_para" id="S6.SS3.SSS2.p2"> At the decode stage with small input matrix shapes, the left part in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">19</a>, ATMM maintains high efficiency by adapting smaller tile shapes. It delivers comparable latency to S-LoRA, outperforming dLoRA and Punica by 4.5<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p2.1.m1.1"><semantics id="S6.SS3.SSS2.p2.1.m1.1a"><mo id="S6.SS3.SSS2.p2.1.m1.1.1" xref="S6.SS3.SSS2.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p2.1.m1.1b"><times id="S6.SS3.SSS2.p2.1.m1.1.1.cmml" xref="S6.SS3.SSS2.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p2.1.m1.1d">×</annotation></semantics></math> and 2.6<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS2.p2.2.m2.1"><semantics id="S6.SS3.SSS2.p2.2.m2.1a"><mo id="S6.SS3.SSS2.p2.2.m2.1.1" xref="S6.SS3.SSS2.p2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS2.p2.2.m2.1b"><times id="S6.SS3.SSS2.p2.2.m2.1.1.cmml" xref="S6.SS3.SSS2.p2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS2.p2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS2.p2.2.m2.1d">×</annotation></semantics></math>. dLoRA suffers from the large context-switching overhead of repeated kernel calls by Einsum, while Punica results in a low core utilization due to its mismatched tile shapes. Benefiting from the adaptive tiling, ATMM spends only 5ms to compute and un-/merge all-layer LoRA matrices. </div> </section> <section class="ltx_subsubsection" id="S6.SS3.SSS3"> <h4 class="ltx_title ltx_runin ltx_font_bold ltx_title_subsubsection">6.3.3 Flexible LoRA Adapters Orchestration</h4> <div class="ltx_para" id="S6.SS3.SSS3.p1"> dynamically selects and switches inference modes with our swift switcher and deLoRA operation offering the best service among comparisons. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F19" title="Figure 19 ‣ 6.2 End-to-End Performance ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">19</a>, V-LoRA outperforms merge only, unmerge only, and dLoRA by 33%, 59%, and 21% of latency under different skewness, respectively. The skewness indicates the proportion of the most required LoRA adapter. Merge only processes requests invoking the same LoRA adapter, leading to underutilized resources and small batch sizes, while unmerge only introduces significant extra computation. dLoRA shows benefits only in highly skewed workloads as the poor performance of its unmerged inference operator Einsum. Our scheduling policy performs the best because it fully utilizes low-latency merge mode, and invents mixture mode to eliminate some mode switch. </div> <div class="ltx_para" id="S6.SS3.SSS3.p2"> deLora significantly reduces the latency compared to unmerged inference as plotted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F23" title="Figure 23 ‣ 6.3.2 Adaptive-tiling LoRA Adapters Batching ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">23</a>. Its early execution for the starved requests saves an average of 62% computation overhead when the starved requests’ number is lower than 50% of max batch size. On the other hand, the swift inference mode switcher contributes a lot, too. Supported by ATMM, it yields 1.2<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS3.p2.1.m1.1"><semantics id="S6.SS3.SSS3.p2.1.m1.1a"><mo id="S6.SS3.SSS3.p2.1.m1.1.1" xref="S6.SS3.SSS3.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS3.p2.1.m1.1b"><times id="S6.SS3.SSS3.p2.1.m1.1.1.cmml" xref="S6.SS3.SSS3.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS3.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS3.p2.1.m1.1d">×</annotation></semantics></math> and 1.4<math alttext="\times" class="ltx_Math" display="inline" id="S6.SS3.SSS3.p2.2.m2.1"><semantics id="S6.SS3.SSS3.p2.2.m2.1a"><mo id="S6.SS3.SSS3.p2.2.m2.1.1" xref="S6.SS3.SSS3.p2.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S6.SS3.SSS3.p2.2.m2.1b"><times id="S6.SS3.SSS3.p2.2.m2.1.1.cmml" xref="S6.SS3.SSS3.p2.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S6.SS3.SSS3.p2.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S6.SS3.SSS3.p2.2.m2.1d">×</annotation></semantics></math> speed up compared to dLoRA and unmerge in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F23" title="Figure 23 ‣ 6.3.2 Adaptive-tiling LoRA Adapters Batching ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">23</a> case that infers with two LoRA adapters. </div> </section> </section> <section class="ltx_subsection" id="S6.SS4"> <h3 class="ltx_title ltx_font_bold ltx_title_subsection">6.4 Stability and Scalability</h3> <div class="ltx_para" id="S6.SS4.p1"> V-LoRA demonstrates great stability and scalability. </div> <div class="ltx_para ltx_noindent" id="S6.SS4.p2"> Impacts of different skewness of requests. V-LoRA achieves the best average token latency compared to other systems under diverse skewness. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F23" title="Figure 23 ‣ 6.3.2 Adaptive-tiling LoRA Adapters Batching ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">23</a> shows that V-LoRA achieves a reduction in average token latency by 76-81%, 72-83%, and 63-76% compared to dLoRA, Punica, and S-LoRA under four different skewness conditions. This superiority arises from the V-LoRA’s timely mode switch and proper requests and adapters orchestration. With the swift switcher and mixture mode, it responds to workload changes fast. </div> <div class="ltx_para ltx_noindent" id="S6.SS4.p3"> Impacts of different number of LoRA adapters. V-LoRA maintains the best and most stable performance when the number of LoRA adapters increases. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.F23" title="Figure 23 ‣ 6.3.2 Adaptive-tiling LoRA Adapters Batching ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">23</a>, it suffers the minimal impact, which benefits from V-LoRA’s efficient memory management. V-LoRA’s pre-allocated contiguous memory reduces unnecessary memory copy or movement and memory fragmentation. In addition, when the number increases to need LoRA to swap, V-LoRA’s strategy swaps adapter and computes matrix at runtime, and the asynchronous swap contributes to the low latency. With the highly optimized ATMM kernel, the LoRA matrix swapping keeps high stability compared to the matrix multiplication via batched GEMM in dLoRA. </div> <div class="ltx_para ltx_noindent" id="S6.SS4.p4"> Scales to multiple GPUs. V-LoRA demonstrates excellent scalability and can significantly enhance the overall system throughput by multiple GPUs. As shown in Tab. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S6.T3" title="Table 3 ‣ Figure 23 ‣ 6.3.2 Adaptive-tiling LoRA Adapters Batching ‣ 6.3 Comprehensive Component-wise Analysis ‣ 6. Evaluation ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">3</a>, on servers equipped with 1, 2, and 4 A100 GPUs, the total system throughput can reach 6.07, 11.48, and 23.97 requests per second, respectively. In future work, we can further improve system performance in multi-GPU scenarios by incorporating inter-GPU scheduling like dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> and support larger LMM like InternVL2-76B <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib35" title="">gao2024miniinternvlflexibletransferpocketmultimodal, </a>)</cite>. </div> </section> </section> <section class="ltx_section" id="S7"> <h2 class="ltx_title ltx_title_section"> 7. Related Works</h2> <div class="ltx_para" id="S7.p1"> Large model serving systems recently leveraged system optimization techniques to improve LLM’s inference efficiency. With paged thinking, vLLM <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib50" title="">kwon2023efficient, </a>)</cite> proposes a PagedAttention operator cooperating with its block-based KV cache to minimize GPU memory fragmentation. From advanced batching mechanisms, Orca <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib87" title="">yu2022orca, </a>)</cite> introduces iteration-level scheduling to continuously batch requests of varying lengths, and DeepSpeed-FastGen <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib39" title="">holmes2024deepspeed, </a>)</cite> further improves it with a dynamic split-fuse strategy. With a distributed architecture, FlexGen <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib70" title="">sheng2023flexgen, </a>)</cite> employs offloading to enhance LLM serving throughput, while Mooncack <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib64" title="">qin2407mooncake, </a>)</cite> features a KVCache-centric architecture separating the prefill and decoding clusters. With LLM inference characteristics, SpecInfer <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib62" title="">miao2024specinfer, </a>)</cite> utilizes speculative decoding to reduce latency, while SARATHI <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib18" title="">agrawal2023sarathi, </a>)</cite> schedules requests by piggybacking decodes and chunked prefills. </div> <div class="ltx_para" id="S7.p2"> S-LoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib69" title="">sheng2023slora, </a>)</cite>, Punica <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib25" title="">chen2024punica, </a>)</cite>, dLoRA <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib81" title="">wu2024dlora, </a>)</cite> are the only three systems that also serve multiple LoRA LLMs by batching requests destined for different adapters. Their shortcomings are deeply analyzed in this paper. An earlier study, PetS <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib97" title="">zhou2022pets, </a>)</cite>, also considers the scenario of serving multiple parameter-efficient DNN models, but it does not consider serving autoregressive LLMs and the unique system characteristics of LoRA adapters. Compared to them, V-LoRA provides a more efficient serving runtime and, as an end-to-end system, includes LoRA adapter generation. </div> <div class="ltx_para ltx_noindent" id="S7.p3"> Parameter-efficient fine-tuning (PEFT) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib40" title="">hu2021lora, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib54" title="">li2021prefix, </a>)</cite> is developed to adapt large pre-trained models to specific tasks or domains. By adjusting a few layers or parameters, PEFT retains core pre-trained knowledge while efficiently learning task-specific nuances <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib58" title="">liu2022few, </a>)</cite>. As a verification system, V-LoRA’s LoRA adapter generation adopts a heuristic algorithm. It can be easily replaced by many advanced PEFT techniques <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib94" title="">zhang2023adaptive, </a>)</cite>, and we leave this in future work. </div> <div class="ltx_para ltx_noindent" id="S7.p4"> Retrieval-augmented Generation (RAG) <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib22" title="">borgeaud2022improving, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib52" title="">lewis2020retrieval, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib44" title="">jin2024ragcache, </a>)</cite> enhances LLMs by incorporating relevant knowledge from external databases, enabling comparable performance to fine-tuned LLMs <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib24" title="">chen2024benchmarking, </a>)</cite>. Some work <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib66" title="">ram2023context, </a>; <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib77" title="">trivedi2023interleaving, </a>)</cite> suggest iterative retrieval throughout generation for higher quality. V-LoRA’s system performance greatly outperforms RAG because RAG’s costly vector search <cite class="ltx_cite ltx_citemacro_citep">(<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#bib.bib27" title="">chen2021spann, </a>)</cite> and long-context prompt. </div> </section> <section class="ltx_section" id="S8"> <h2 class="ltx_title ltx_title_section"> 8. Conclusion</h2> <div class="ltx_para" id="S8.p1"> In this paper, we first explore the utilization of LMMs as foundation models for vision applications to achieve high serving efficiency and user-friendly nature language interface. To achieve this, we propose V-LoRA, an end-to-end system that adapts LMMs for domain-specific visual tasks with LoRA adapters and efficiently manages them at runtime to enrich vision applications. Across two typical vision applications, we show that V-LoRA enables the effective utilization of a single LMM to achieve superior performance and generalization in multiple visual tasks. While V-LoRA by no means is the final answer, we hope it serves as a stepping stone towards poly-basic design for future vision applications and demonstrates the potential of adapting LMM for visual tasks. </div> <div class="ltx_pagination ltx_role_newpage"></div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> [1] Lightllm. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/ModelTC/lightllm" title="">https://github.com/ModelTC/lightllm</a>. </li> <li class="ltx_bibitem" id="bib.bib2"> [2] NVIDIA A100. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.nvidia.com/en-us/data-center/a100/" title="">https://www.nvidia.com/en-us/data-center/a100/</a>. </li> <li class="ltx_bibitem" id="bib.bib3"> [3] Pytorch torch.einsum. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://pytorch.org/docs/stable/generated/torch.einsum.html" title="">https://pytorch.org/docs/stable/generated/torch.einsum.html</a>. </li> <li class="ltx_bibitem" id="bib.bib4"> [4] rpyc. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/tomerfiliba-org/rpyc" title="">https://github.com/tomerfiliba-org/rpyc</a>. </li> <li class="ltx_bibitem" id="bib.bib5"> [5] setuptools. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/pypa/setuptools" title="">https://github.com/pypa/setuptools</a>. </li> <li class="ltx_bibitem" id="bib.bib6"> [6] Pytorch: Official website of the python platform. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://pytorch.org/" title="">https://pytorch.org/</a>, 2021. </li> <li class="ltx_bibitem" id="bib.bib7"> [7] 12 Practical Large Language Model (LLM) Applications - Techopedia. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.techopedia.com/12-practical-large-language-model-llm-applications" title="">https://www.techopedia.com/12-practical-large-language-model-llm-applications</a>, (Accessed on 09/18/2024). </li> <li class="ltx_bibitem" id="bib.bib8"> [8] 7 top large language model use cases and applications. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.projectpro.io/article/large-language-model-use-cases-and-applications/887" title="">https://www.projectpro.io/article/large-language-model-use-cases-and-applications/887</a>, (Accessed on 09/18/2024). </li> <li class="ltx_bibitem" id="bib.bib9"> [9] Airbus Aircraft dataset. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.kaggle.com/datasets/airbusgeo/airbus-aircrafts-sample-dataset/data" title="">https://www.kaggle.com/datasets/airbusgeo/airbus-aircrafts-sample-dataset/data</a>, (Accessed on 09/18/2024). </li> <li class="ltx_bibitem" id="bib.bib10"> [10] Applications of large language models - indata labs. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://indatalabs.com/blog/large-language-model-apps" title="">https://indatalabs.com/blog/large-language-model-apps</a>, (Accessed on 09/18/2024). </li> <li class="ltx_bibitem" id="bib.bib11"> [11] Claude3.5-Sonnet. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.anthropic.com/news/claude-3-5-sonnet" title="">https://www.anthropic.com/news/claude-3-5-sonnet</a>, (Accessed on 09/18/2024). </li> <li class="ltx_bibitem" id="bib.bib12"> [12] OpenAI GPT-4o. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://openai.com/index/hello-gpt-4o/" title="">https://openai.com/index/hello-gpt-4o/</a>, (Accessed on 09/18/2024). </li> <li class="ltx_bibitem" id="bib.bib13"> [13] Real-world use cases for large language models (llms). <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://cellstrat.medium.com/real-world-use-cases-for-large-language-models-llms-d71c3a577bf2" title="">https://cellstrat.medium.com/real-world-use-cases-for-large-language-models-llms-d71c3a577bf2</a>, (Accessed on 09/18/2024). </li> <li class="ltx_bibitem" id="bib.bib14"> [14] Azure LLM inference trace 2023. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/Azure/AzurePublicDataset/blob/master/AzureLLMInferenceDataset2023.md" title="">https://github.com/Azure/AzurePublicDataset/blob/master/AzureLLMInferenceDataset2023.md</a>, (Accessed on 10/07/2024). </li> <li class="ltx_bibitem" id="bib.bib15"> [15] NVIDIA CUTLASS 3.5.1. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/NVIDIA/cutlass" title="">https://github.com/NVIDIA/cutlass</a>, (Accessed on 10/07/2024). </li> <li class="ltx_bibitem" id="bib.bib16"> [16] NVIDIA Tensor Core. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.nvidia.cn/data-center/tensor-cores/" title="">https://www.nvidia.cn/data-center/tensor-cores/</a>, (Accessed on 10/07/2024). </li> <li class="ltx_bibitem" id="bib.bib17"> [17] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. </li> <li class="ltx_bibitem" id="bib.bib18"> [18] Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023. </li> <li class="ltx_bibitem" id="bib.bib19"> [19] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023. </li> <li class="ltx_bibitem" id="bib.bib20"> [20] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. </li> <li class="ltx_bibitem" id="bib.bib21"> [21] Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. PipeSwitch: Fast pipelined context switching for deep learning applications. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 499–514, 2020. </li> <li class="ltx_bibitem" id="bib.bib22"> [22] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning (ICML), pages 2206–2240, 2022. </li> <li class="ltx_bibitem" id="bib.bib23"> [23] Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, and Zhiqiang Shen. One-for-all: Generalized lora for parameter-efficient fine-tuning. arXiv preprint arXiv:2306.07967, 2023. </li> <li class="ltx_bibitem" id="bib.bib24"> [24] Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024. </li> <li class="ltx_bibitem" id="bib.bib25"> [25] Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy. Punica: Multi-tenant lora serving. In Proceedings of Machine Learning and Systems (MLSys), 2024. </li> <li class="ltx_bibitem" id="bib.bib26"> [26] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions, 2023. </li> <li class="ltx_bibitem" id="bib.bib27"> [27] Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. Spann: Highly-efficient billion-scale approximate nearest neighborhood search. Advances in Neural Information Processing Systems (NeurIPS), 2021. </li> <li class="ltx_bibitem" id="bib.bib28"> [28] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. </li> <li class="ltx_bibitem" id="bib.bib29"> [29] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. </li> <li class="ltx_bibitem" id="bib.bib30"> [30] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. In Advances in Neural Information Processing Systems (NeurIPS), 2024. </li> <li class="ltx_bibitem" id="bib.bib31"> [31] Kuntai Du, Ahsan Pervaiz, Xin Yuan, Aakanksha Chowdhery, Qizheng Zhang, Henry Hoffmann, and Junchen Jiang. Server-Driven Video Streaming for Deep Learning Inference. In ACM Special Interest Group on Data Communication (SIGCOMM), pages 557–570. ACM, 2020. </li> <li class="ltx_bibitem" id="bib.bib32"> [32] Albert Einstein. The General Theory of Relativity, pages 54–75. Springer Netherlands, Dordrecht, 1922. </li> <li class="ltx_bibitem" id="bib.bib33"> [33] Glenn Jocher et. al. ultralytics/yolov5: v6.0 - YOLOv5n ’Nano’ models, Roboflow integration, TensorFlow export, OpenCV DNN support, October 2021. </li> <li class="ltx_bibitem" id="bib.bib34"> [34] Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Yu Han, and Hao Wang. Mixture-of-loras: An efficient multitask tuning for large language models. arXiv preprint arXiv:2403.03432, 2024. </li> <li class="ltx_bibitem" id="bib.bib35"> [35] Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, and Wenhai Wang. Mini-internvl: A flexible-transfer pocket multimodal model with 5 arXiv preprint arXiv:2410.16261. </li> <li class="ltx_bibitem" id="bib.bib36"> [36] Michael R Garey and David S Johnson. Approximation algorithms for bin packing problems: A survey. In Analysis and design of algorithms in combinatorial optimization, pages 147–172. Springer, 1981. </li> <li class="ltx_bibitem" id="bib.bib37"> [37] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913, 2017. </li> <li class="ltx_bibitem" id="bib.bib38"> [38] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In IEEE conference on computer vision and pattern recognition (CVPR), 2017. </li> <li class="ltx_bibitem" id="bib.bib39"> [39] Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. arXiv preprint arXiv:2401.08671, 2024. </li> <li class="ltx_bibitem" id="bib.bib40"> [40] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. </li> <li class="ltx_bibitem" id="bib.bib41"> [41] Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. pybind11 — seamless operability between c++11 and python, 2016. https://github.com/pybind/pybind11. </li> <li class="ltx_bibitem" id="bib.bib42"> [42] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023. </li> <li class="ltx_bibitem" id="bib.bib43"> [43] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. Chameleon: scalable adaptation of video analytics. In ACM Special Interest Group on Data Communication (SIGCOMM), pages 253–266, August 2018. </li> <li class="ltx_bibitem" id="bib.bib44"> [44] Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. Ragcache: Efficient knowledge caching for retrieval-augmented generation. arXiv preprint arXiv:2404.12457, 2024. </li> <li class="ltx_bibitem" id="bib.bib45"> [45] Daniel Kang, Peter Bailis, and Matei Zaharia. Blazeit: Optimizing declarative aggregation and limit queries for neural network-based video analytics. VLDB Endowment, 13(4), 2020. </li> <li class="ltx_bibitem" id="bib.bib46"> [46] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. NoScope: optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11):1586–1597, August 2017. </li> <li class="ltx_bibitem" id="bib.bib47"> [47] Daniel Kang, Francisco Romero, Peter D Bailis, Christos Kozyrakis, and Matei Zaharia. Viva: An end-to-end system for interactive video analytics. In CIDR, 2022. </li> <li class="ltx_bibitem" id="bib.bib48"> [48] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In ACL conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. </li> <li class="ltx_bibitem" id="bib.bib49"> [49] Mehrdad Khani, Ganesh Ananthanarayanan, Kevin Hsieh, Junchen Jiang, Ravi Netravali, Yuanchao Shu, Mohammad Alizadeh, and Victor Bahl. RECL: Responsive resource-efficient continuous learning for video analytics. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 917–932, 2023. </li> <li class="ltx_bibitem" id="bib.bib50"> [50] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In USENIX Symposium on Operating Systems Principles (OSDI), pages 611–626, 2023. </li> <li class="ltx_bibitem" id="bib.bib51"> [51] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. </li> <li class="ltx_bibitem" id="bib.bib52"> [52] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems (NeurIPS), 2020. </li> <li class="ltx_bibitem" id="bib.bib53"> [53] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (ICML), pages 19730–19742. PMLR, 2023. </li> <li class="ltx_bibitem" id="bib.bib54"> [54] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of Annual Meeting of the Association for Computational Linguistics and the Joint Conference on Natural Language Processing, pages 4582–4597, 2021. </li> <li class="ltx_bibitem" id="bib.bib55"> [55] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Springer European Conference Computer Vision (ECCV), 2020. </li> <li class="ltx_bibitem" id="bib.bib56"> [56] Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv preprint arXiv:2401.05459, 2024. </li> <li class="ltx_bibitem" id="bib.bib57"> [57] Yuanqi Li, Arthi Padmanabhan, Pengzhan Zhao, Yufei Wang, Guoqing Harry Xu, and Ravi Netravali. Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics. In ACM Special Interest Group on Data Communication (SIGCOMM), pages 359–376, 2020. </li> <li class="ltx_bibitem" id="bib.bib58"> [58] Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems (NeurIPS), 35:1950–1965, 2022. </li> <li class="ltx_bibitem" id="bib.bib59"> [59] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. </li> <li class="ltx_bibitem" id="bib.bib60"> [60] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Conference and Workshop on Neural Information Processing Systems (NeurIPS), 2024. </li> <li class="ltx_bibitem" id="bib.bib61"> [61] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/huggingface/peft" title="">https://github.com/huggingface/peft</a>, 2022. </li> <li class="ltx_bibitem" id="bib.bib62"> [62] Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 932–949, 2024. </li> <li class="ltx_bibitem" id="bib.bib63"> [63] Thomas Politzer. Vision is our dominant sense. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.brainline.org/article/vision-our-dominant-sense" title="">https://www.brainline.org/article/vision-our-dominant-sense</a>, (Accessed on 09/18/2024). </li> <li class="ltx_bibitem" id="bib.bib64"> [64] Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving. arXiv preprint arxiv:2407.00079, 2024. </li> <li class="ltx_bibitem" id="bib.bib65"> [65] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. </li> <li class="ltx_bibitem" id="bib.bib66"> [66] Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023. </li> <li class="ltx_bibitem" id="bib.bib67"> [67] Bhardwaj Romil, Xia Zhengxu, Ananthanarayanan Ganesh, Jiang Junchen, Shu Yuanchao, Karianakis Nikolaos, Hsieh Kevin, Bahl Paramvir, and Stoica Ion. Ekya: Continuous learning of video analytics models on edge compute servers. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2022. </li> <li class="ltx_bibitem" id="bib.bib68"> [68] Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In ACM Symposium on Operating Systems Principles (SOSP), pages 322–337, 2019. </li> <li class="ltx_bibitem" id="bib.bib69"> [69] Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. In Proceedings of Machine Learning and Systems (MLSys), 2024. </li> <li class="ltx_bibitem" id="bib.bib70"> [70] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning (ICML), pages 31094–31116, 2023. </li> <li class="ltx_bibitem" id="bib.bib71"> [71] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. </li> <li class="ltx_bibitem" id="bib.bib72"> [72] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012. </li> <li class="ltx_bibitem" id="bib.bib73"> [73] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. </li> <li class="ltx_bibitem" id="bib.bib74"> [74] Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. MAPL 2019, page 10–19, New York, NY, USA, 2019. Association for Computing Machinery. </li> <li class="ltx_bibitem" id="bib.bib75"> [75] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022. </li> <li class="ltx_bibitem" id="bib.bib76"> [76] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. </li> <li class="ltx_bibitem" id="bib.bib77"> [77] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2023. </li> <li class="ltx_bibitem" id="bib.bib78"> [78] Weijun Wang, Liang Mi, Shaowei Cen, Haipeng Dai, Yuanchun Li, Xiaoming Fu, and Yunxin Liu. Region-based content enhancement for efficient video analytics at the edge. arXiv preprint arXiv:2407.16990, 2024. </li> <li class="ltx_bibitem" id="bib.bib79"> [79] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. Autodroid: Llm-powered task automation in android. In Proceedings of the ACM International Conference on Mobile Computing and Networking (MobiCom), pages 543–557, 2024. </li> <li class="ltx_bibitem" id="bib.bib80"> [80] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: State-of-the-art natural language processing, 2020. </li> <li class="ltx_bibitem" id="bib.bib81"> [81] Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 911–927, 2024. </li> <li class="ltx_bibitem" id="bib.bib82"> [82] Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, July 2017. </li> <li class="ltx_bibitem" id="bib.bib83"> [83] Zhujun Xiao, Zhengxu Xia, Haitao Zheng, Ben Y Zhao, and Junchen Jiang. Towards performance clarity of edge video analytics. In IEEE/ACM Symposium on Edge Computing (SEC), pages 148–164, 2021. </li> <li class="ltx_bibitem" id="bib.bib84"> [84] Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15325–15336, 2023. </li> <li class="ltx_bibitem" id="bib.bib85"> [85] Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving with cached knowledge fusion. arXiv preprint arXiv:2405.16444, 2024. </li> <li class="ltx_bibitem" id="bib.bib86"> [86] Juheon Yi, Sunghyun Choi, and Youngki Lee. EagleEye: wearable camera-based person identification in crowded urban spaces. In ACM International Conference on Mobile Computing and Networking (MobiCom), pages 1–14, April 2020. </li> <li class="ltx_bibitem" id="bib.bib87"> [87] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for <math alttext="\{" class="ltx_Math" display="inline" id="bib.bib87.1.m1.1"><semantics id="bib.bib87.1.m1.1a"><mo id="bib.bib87.1.m1.1.1" stretchy="false" xref="bib.bib87.1.m1.1.1.cmml">{</mo><annotation-xml encoding="MathML-Content" id="bib.bib87.1.m1.1b"><ci id="bib.bib87.1.m1.1.1.cmml" xref="bib.bib87.1.m1.1.1">{</ci></annotation-xml><annotation encoding="application/x-tex" id="bib.bib87.1.m1.1c">\{</annotation><annotation encoding="application/x-llamapun" id="bib.bib87.1.m1.1d">{</annotation></semantics></math>Transformer-Based<math alttext="\}" class="ltx_Math" display="inline" id="bib.bib87.2.m2.1"><semantics id="bib.bib87.2.m2.1a"><mo id="bib.bib87.2.m2.1.1" stretchy="false" xref="bib.bib87.2.m2.1.1.cmml">}</mo><annotation-xml encoding="MathML-Content" id="bib.bib87.2.m2.1b"><ci id="bib.bib87.2.m2.1.1.cmml" xref="bib.bib87.2.m2.1.1">}</ci></annotation-xml><annotation encoding="application/x-tex" id="bib.bib87.2.m2.1c">\}</annotation><annotation encoding="application/x-llamapun" id="bib.bib87.2.m2.1d">}</annotation></semantics></math> generative models. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 521–538, 2022. </li> <li class="ltx_bibitem" id="bib.bib88"> [88] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In European Conference on Computer Vision (ECCV), pages 69–85. Springer, 2016. </li> <li class="ltx_bibitem" id="bib.bib89"> [89] Shan Yu, Zhenting Zhu, Yu Chen, Hanchen Xu, Pengzhan Zhao, Yang Wang, Arthi Padmanabhan, Hugo Latapie, and Harry Xu. Vqpy: An object-oriented approach to modern video analytics. In Machine Learning and Systems (MLSys), 2024. </li> <li class="ltx_bibitem" id="bib.bib90"> [90] Mu Yuan, Lan Zhang, Xuanke You, and Xiang-Yang Li. Packetgame: Multi-stream packet gating for concurrent video inference at scale. In ACM Special Interest Group on Data Communication (SIGCOMM), 2023. </li> <li class="ltx_bibitem" id="bib.bib91"> [91] Tingting Yuan, Liang Mi, Weijun Wang, Haipeng Dai, and Xiaoming Fu. Accdecoder: Accelerated decoding for neural-enhanced video analytics. In IEEE Conference on Computer Communications (INFOCOM), 2023. </li> <li class="ltx_bibitem" id="bib.bib92"> [92] Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, and Jinhyung Kim. Videomix: Rethinking data augmentation for video classification, 2020. </li> <li class="ltx_bibitem" id="bib.bib93"> [93] Ben Zhang, Xin Jin, Sylvia Ratnasamy, John Wawrzynek, and Edward A. Lee. AWStream: adaptive wide-area streaming analytics. In ACM Special Interest Group on Data Communication (SIGCOMM), pages 236–252. ACM, 2018. </li> <li class="ltx_bibitem" id="bib.bib94"> [94] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In International Conference on Learning Representations (ICLR), 2023. </li> <li class="ltx_bibitem" id="bib.bib95"> [95] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023. </li> <li class="ltx_bibitem" id="bib.bib96"> [96] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104, 2023. </li> <li class="ltx_bibitem" id="bib.bib97"> [97] Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu Sun. Pet: A unified framework for Parameter-Efficient transformers serving. In USENIX Annual Technical Conference (ATC), pages 489–504, 2022. </li> <li class="ltx_bibitem" id="bib.bib98"> [98] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. </li> <li class="ltx_bibitem" id="bib.bib99"> [99] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model, 2024. </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> <section class="ltx_appendix" id="A1"> <h2 class="ltx_title ltx_title_appendix"> Appendix A Adaptive-tiling Matrix Multiplication.</h2> <div class="ltx_para" id="A1.p1"> As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A1.F24" title="Figure 24 ‣ Appendix A Adaptive-tiling Matrix Multiplication. ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">24</a>, ATMM adaptively divides input matrices into thread blocks tiles, warp tiles, and thread tiles in order of granularity guided by the hash table, then transfers the data and computes by invoking the corresponding executable kernel. </div> <figure class="ltx_figure" id="A1.F24"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="440" id="A1.F24.g1" src="x27.png" width="830"/> <figcaption class="ltx_caption ltx_centering">Figure 24. Procedure of ATMM. The hash table and executable kernels are prepared in §<a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#S4.SS3.SSS2" title="4.3.2 Profile-based optimal tiling search. ‣ 4.3 Adaptive-tiling LoRA Adapters Batching ‣ 4. V-LoRA Design ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">4.3.2</a>. ¡in_x,y,z¿, ¡tb_a,b,c¿, ¡w_d,e,f¿ indicate the shape of input matrices, thread block tile, and warp tile, respectively. </figcaption> </figure> </section> <section class="ltx_appendix" id="A2"> <h2 class="ltx_title ltx_title_appendix"> Appendix B Profile-based Optimal Tiling Search Algorithm</h2> <figure class="ltx_float ltx_float_algorithm ltx_framed ltx_framed_top" id="alg2"> <figcaption class="ltx_caption" style="font-size:80%;">Algorithm 2 Tiling Search(model, hardware)</figcaption> <div class="ltx_listing ltx_listing" id="alg2.5"> <div class="ltx_listingline" id="alg2.l1"> 1:model, hardware </div> <div class="ltx_listingline" id="alg2.l2"> 2:Optimal configuration for each input shape </div> <div class="ltx_listingline" id="alg2.l3"> 3:function TilingSearch(model, hardware) </div> <div class="ltx_listingline" id="alg2.l4"> 4: min, max=GetRange(model, hardware) <math alttext="\triangleright" class="ltx_Math" display="inline" id="alg2.l4.1.m1.1"><semantics id="alg2.l4.1.m1.1a"><mo id="alg2.l4.1.m1.1.1" xref="alg2.l4.1.m1.1.1.cmml">▷</mo><annotation-xml encoding="MathML-Content" id="alg2.l4.1.m1.1b"><ci id="alg2.l4.1.m1.1.1.cmml" xref="alg2.l4.1.m1.1.1">▷</ci></annotation-xml><annotation encoding="application/x-tex" id="alg2.l4.1.m1.1c">\triangleright</annotation><annotation encoding="application/x-llamapun" id="alg2.l4.1.m1.1d">▷</annotation></semantics></math> Limit the input size by limited memory </div> <div class="ltx_listingline" id="alg2.l5"> 5: configlist=GetConfig(harware) <math alttext="\triangleright" class="ltx_Math" display="inline" id="alg2.l5.1.m1.1"><semantics id="alg2.l5.1.m1.1a"><mo id="alg2.l5.1.m1.1.1" xref="alg2.l5.1.m1.1.1.cmml">▷</mo><annotation-xml encoding="MathML-Content" id="alg2.l5.1.m1.1b"><ci id="alg2.l5.1.m1.1.1.cmml" xref="alg2.l5.1.m1.1.1">▷</ci></annotation-xml><annotation encoding="application/x-tex" id="alg2.l5.1.m1.1c">\triangleright</annotation><annotation encoding="application/x-llamapun" id="alg2.l5.1.m1.1d">▷</annotation></semantics></math> Limit the tiling step s.t. the arch characteristics </div> <div class="ltx_listingline" id="alg2.l6"> 6: </div> <div class="ltx_listingline" id="alg2.l7"> 7: for shape in range(min, max, 32) do <math alttext="\triangleright" class="ltx_Math" display="inline" id="alg2.l7.1.m1.1"><semantics id="alg2.l7.1.m1.1a"><mo id="alg2.l7.1.m1.1.1" xref="alg2.l7.1.m1.1.1.cmml">▷</mo><annotation-xml encoding="MathML-Content" id="alg2.l7.1.m1.1b"><ci id="alg2.l7.1.m1.1.1.cmml" xref="alg2.l7.1.m1.1.1">▷</ci></annotation-xml><annotation encoding="application/x-tex" id="alg2.l7.1.m1.1c">\triangleright</annotation><annotation encoding="application/x-llamapun" id="alg2.l7.1.m1.1d">▷</annotation></semantics></math> Reduce the search step size based on the vision task characteristics </div> <div class="ltx_listingline" id="alg2.l8"> 8: for rank in ranklist do </div> <div class="ltx_listingline" id="alg2.l9"> 9: for tbtile in configlist[0] do </div> <div class="ltx_listingline" id="alg2.l10"> 10: for warptile in configlist[1] do </div> <div class="ltx_listingline" id="alg2.l11"> 11: Profile(shape, tbtile, warptile) </div> <div class="ltx_listingline" id="alg2.l12"> 12: UpdateBestConfig( ) </div> </div> </figure> </section> <section class="ltx_appendix" id="A3"> <h2 class="ltx_title ltx_title_appendix"> Appendix C Prompt Template of Vision Applications</h2> <div class="ltx_para" id="A3.p1"> We report the prompt template for two vision applications, visual retrieval and video analytics, as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2411.00915v1#A3.F25" title="Figure 25 ‣ Appendix C Prompt Template of Vision Applications ‣ V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM">25</a>. Visual retrieval includes three tasks: referring expression task, visual question answering, and image captioning. The black text represents the prompt, and the blue text shows the response. Video analytics involves two tasks: object detection and video understanding. Object detection uses a similar prompt as the referring expression task. Video understanding provides multiple image frames as input, followed by an instruction to analyze the actions depicted in the sequence. </div> <figure class="ltx_figure" id="A3.F25"> <div class="ltx_block ltx_minipage ltx_align_center ltx_align_bottom" id="A3.F25.1" style="width:433.6pt;"> <img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="954" id="A3.F25.1.g1" src="x28.png" width="831"/> </div> <figcaption class="ltx_caption ltx_centering">Figure 25. The prompt template of vision applications.</figcaption> </figure> <div class="ltx_pagination ltx_role_newpage"></div> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Fri Nov 1 13:46:02 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/">LaTeXML<img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>