CINXE.COM

TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University.

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University.</title> <!--Generated on Wed Mar 12 06:38:27 2025 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content=" Audio Generation, Multimodality, Diffusion Model, Contrastive Pretraining, AIGC " lang="en" name="keywords"/> <base href="/html/2503.10700v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S1" title="In TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">I </span><span class="ltx_text ltx_font_smallcaps">Introduction</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S2" title="In TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">II </span><span class="ltx_text ltx_font_smallcaps">Method</span></span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S2.SS1" title="In II Method ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-A</span> </span><span class="ltx_text ltx_font_italic">Contrastive Video-Audio-Language Pretraining</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S2.SS2" title="In II Method ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-B</span> </span><span class="ltx_text ltx_font_italic">Feature Mixing</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S2.SS3" title="In II Method ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-C</span> </span><span class="ltx_text ltx_font_italic">Latent Diffusion Model</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S2.SS4" title="In II Method ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">II-D</span> </span><span class="ltx_text ltx_font_italic">Inference with Guidance</span></span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S3" title="In TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">III </span><span class="ltx_text ltx_font_smallcaps">Experiments</span></span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S3.SS1" title="In III Experiments ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">III-A</span> </span><span class="ltx_text ltx_font_italic">Datasets and Data Processing</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S3.SS2" title="In III Experiments ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">III-B</span> </span><span class="ltx_text ltx_font_italic">Configurations</span></span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S3.SS3" title="In III Experiments ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">III-C</span> </span><span class="ltx_text ltx_font_italic">Evaluation</span></span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S3.SS3.SSS1" title="In III-C Evaluation ‣ III Experiments ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">III-C</span>1 </span>Evaluation Metrics</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S3.SS3.SSS2" title="In III-C Evaluation ‣ III Experiments ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text">III-C</span>2 </span>Results and Analysis</span></a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S4" title="In TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">IV </span><span class="ltx_text ltx_font_smallcaps">Conclusion</span></span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line"> <h1 class="ltx_title ltx_title_document">TA-V2A: Textually Assisted Video-to-Audio Generation <br class="ltx_break"/><span class="ltx_note ltx_role_thanks" id="id1.id1"><sup class="ltx_note_mark">†</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">†</sup><span class="ltx_note_type">thanks: </span>This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University.</span></span></span> </h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> <span class="ltx_inline-block ltx_minipage ltx_align_top" id="id2.1.id1" style="width:130.1pt;"> <span class="ltx_p" id="id2.1.id1.1">Yuhuan You</span> <span class="ltx_p ltx_align_center" id="id2.1.id1.2"><span class="ltx_text ltx_font_italic" id="id2.1.id1.2.1">State Key Laboratory of <br class="ltx_break"/>General Artificial Intelligence <br class="ltx_break"/>School of <br class="ltx_break"/>Intelligence Science and Technology <br class="ltx_break"/>Peking University</span>, Beijing, China</span> <span class="ltx_p ltx_align_center" id="id2.1.id1.3">2000017809@stu.pku.edu.cn</span> </span> <span class="ltx_inline-block ltx_minipage ltx_align_top" id="id3.2.id2" style="width:130.1pt;"> <span class="ltx_p" id="id3.2.id2.1">Xihong Wu</span> <span class="ltx_p ltx_align_center" id="id3.2.id2.2"><span class="ltx_text ltx_font_italic" id="id3.2.id2.2.1">State Key Laboratory of <br class="ltx_break"/>General Artificial Intelligence <br class="ltx_break"/>School of <br class="ltx_break"/>Intelligence Science and Technology <br class="ltx_break"/>Peking University</span>, Beijing, China</span> <span class="ltx_p ltx_align_center" id="id3.2.id2.3">wxh@cis.pku.edu.cn</span> </span> <span class="ltx_inline-block ltx_minipage ltx_align_top" id="id4.3.id3" style="width:130.1pt;"> <span class="ltx_p" id="id4.3.id3.1">Tianshu Qu</span> <span class="ltx_p ltx_align_center" id="id4.3.id3.2"><span class="ltx_text ltx_font_italic" id="id4.3.id3.2.1">State Key Laboratory of <br class="ltx_break"/>General Artificial Intelligence <br class="ltx_break"/>School of <br class="ltx_break"/>Intelligence Science and Technology <br class="ltx_break"/>Peking University</span>, Beijing, China</span> <span class="ltx_p ltx_align_center" id="id4.3.id3.3">qutianshu@pku.edu.cn</span> </span> </span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id5.id1">As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as current models often lose sequential context by relying solely on frame-based features. To address this, we present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space. By incorporating large language models for enhanced video comprehension, our approach leverages text guidance to enrich semantic expression. Our diffusion model-based system utilizes automated text modulation to enhance inference quality and efficiency, providing personalized control through text-guided interfaces. This integration enhances semantic expression while ensuring temporal alignment, leading to more accurate and coherent video-to-audio generation.</p> </div> <div class="ltx_keywords"> <h6 class="ltx_title ltx_title_keywords">Index Terms: </h6> Audio Generation, Multimodality, Diffusion Model, Contrastive Pretraining, AIGC </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">I </span><span class="ltx_text ltx_font_smallcaps" id="S1.1.1">Introduction</span> </h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Recently, the specific modality transformation task of video-to-audio has garnered considerable attention. The ability to generate corresponding audio from video is crucial for applications such as enhancing virtual reality experiences, automated video foley synthesis, and improving the performance of robots in perceiving and understanding environments.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">V2A tasks generally derive from text-to-audio (T2A) frameworks, such as <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib1" title="">1</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib2" title="">2</a>]</cite>, which focus solely on text input for audio generation, while some approaches, including <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib3" title="">3</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib4" title="">4</a>]</cite>, extend T2A frameworks by treating video frames as auxiliary features to enhance the audio generation process. Other approaches, such as <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib5" title="">5</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib6" title="">6</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib7" title="">7</a>]</cite>, aim for joint audio and video generation. Methods employing Generative Adversarial Networks (GANs) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib8" title="">8</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib9" title="">9</a>]</cite> and Transformers <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib10" title="">10</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib11" title="">11</a>]</cite> have also been utilized for pure V2A tasks. Recent advances in diffusion-based technologies have further expanded the capabilities of audio generation, with Luo et al.<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib12" title="">12</a>]</cite> leveraging Latent Diffusion Models (LDM) and Contrastive Audio-Video Pretraining (CAVP) for V2A tasks and Xu et al.<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib13" title="">13</a>]</cite> delving into the impact of module selection within the V2A framework.</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">Despite these advancements, a key challenge still remains in extracting precise semantic information from videos, as current models often lose sequential context when relying solely on frame-based features. This lack of temporal consistency in many models has led to gaps in audio generation, where the produced sound fails to match the nuances of actions or events unfolding over time. To address this, integrating text guidance has shown promise in enhancing semantic representation. Previous work has utilized text to guide audio generation <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib14" title="">14</a>]</cite> and combined video and text features for better semantic and temporal alignment <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib15" title="">15</a>]</cite>.</p> </div> <div class="ltx_para" id="S1.p4"> <p class="ltx_p" id="S1.p4.1">Building on these efforts, we introduce TA-V2A, a text-assisted V2A generation system, which further leverages text as an auxiliary feature to participate in network training, feature generation, and the guidance stages of the diffusion process. The role of text assistance in this system is twofold: first, we integrate video, audio, and text modalities in a unified training framework to achieve precise and refined semantic alignment; second, inspired by advancements in multimodal large language models (MLLMs) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib16" title="">16</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib17" title="">17</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib18" title="">18</a>]</cite>, we employ large language models to generate textual descriptions of videos, which are then used for text-aligned training and as guiding conditions within the diffusion model. This approach significantly enhances the generation of audio that closely aligns with human descriptive preferences.</p> </div> <div class="ltx_para" id="S1.p5"> <p class="ltx_p" id="S1.p5.1">Simultaneously, we delve into the exploration of feature vectors within the latent space that are most conducive to high-quality audio generation in V2A tasks, as well as the decomposition and integration of multi-condition inputs in the diffusion model. By focusing on the latent space, we aim to better understand the intricate relationships between video, audio, and text features, ensuring that the generated audio not only aligns with the video’s temporal sequence but also accurately reflects the semantic content captured in the video. Extensive experiments and analyses validate the effectiveness of these approaches, suggesting that TA-V2A can significantly advance video-to-audio generation, enhancing multimedia processing and intelligent information comprehension.</p> </div> <div class="ltx_para" id="S1.p6"> <p class="ltx_p" id="S1.p6.1">The rest of this paper is organized as follows. Section <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S2" title="II Method ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_tag">II</span></a> presents our proposed approach and introduces the core methodologies . Section <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S3" title="III Experiments ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_tag">III</span></a> details the experimental configurations and results. Finally, Section <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S4" title="IV Conclusion ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_tag">IV</span></a> offers conclusions.</p> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">II </span><span class="ltx_text ltx_font_smallcaps" id="S2.1.1">Method</span> </h2> <div class="ltx_para" id="S2.p1"> <p class="ltx_p" id="S2.p1.1">Our TA-V2A generation system combines video, audio, and textual data to produce synchronized audio outputs from video inputs using contrastive learning, feature alignment, and generative modeling. An overview of the complete workflow is illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S2.F1" title="Figure 1 ‣ II Method ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_tag">1</span></a>.</p> </div> <div class="ltx_para" id="S2.p2"> <p class="ltx_p" id="S2.p2.4">The system begins with video and textual inputs, where text is either manually provided or generated by an LLM and undergoes automated augmentation. Features from video, audio, and text are extracted by modality-specific encoders <math alttext="\mathcal{E}_{v},\mathcal{E}_{a},\mathcal{E}_{l}" class="ltx_Math" display="inline" id="S2.p2.1.m1.3"><semantics id="S2.p2.1.m1.3a"><mrow id="S2.p2.1.m1.3.3.3" xref="S2.p2.1.m1.3.3.4.cmml"><msub id="S2.p2.1.m1.1.1.1.1" xref="S2.p2.1.m1.1.1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.p2.1.m1.1.1.1.1.2" xref="S2.p2.1.m1.1.1.1.1.2.cmml">ℰ</mi><mi id="S2.p2.1.m1.1.1.1.1.3" xref="S2.p2.1.m1.1.1.1.1.3.cmml">v</mi></msub><mo id="S2.p2.1.m1.3.3.3.4" xref="S2.p2.1.m1.3.3.4.cmml">,</mo><msub id="S2.p2.1.m1.2.2.2.2" xref="S2.p2.1.m1.2.2.2.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.p2.1.m1.2.2.2.2.2" xref="S2.p2.1.m1.2.2.2.2.2.cmml">ℰ</mi><mi id="S2.p2.1.m1.2.2.2.2.3" xref="S2.p2.1.m1.2.2.2.2.3.cmml">a</mi></msub><mo id="S2.p2.1.m1.3.3.3.5" xref="S2.p2.1.m1.3.3.4.cmml">,</mo><msub id="S2.p2.1.m1.3.3.3.3" xref="S2.p2.1.m1.3.3.3.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.p2.1.m1.3.3.3.3.2" xref="S2.p2.1.m1.3.3.3.3.2.cmml">ℰ</mi><mi id="S2.p2.1.m1.3.3.3.3.3" xref="S2.p2.1.m1.3.3.3.3.3.cmml">l</mi></msub></mrow><annotation-xml encoding="MathML-Content" id="S2.p2.1.m1.3b"><list id="S2.p2.1.m1.3.3.4.cmml" xref="S2.p2.1.m1.3.3.3"><apply id="S2.p2.1.m1.1.1.1.1.cmml" xref="S2.p2.1.m1.1.1.1.1"><csymbol cd="ambiguous" id="S2.p2.1.m1.1.1.1.1.1.cmml" xref="S2.p2.1.m1.1.1.1.1">subscript</csymbol><ci id="S2.p2.1.m1.1.1.1.1.2.cmml" xref="S2.p2.1.m1.1.1.1.1.2">ℰ</ci><ci id="S2.p2.1.m1.1.1.1.1.3.cmml" xref="S2.p2.1.m1.1.1.1.1.3">𝑣</ci></apply><apply id="S2.p2.1.m1.2.2.2.2.cmml" xref="S2.p2.1.m1.2.2.2.2"><csymbol cd="ambiguous" id="S2.p2.1.m1.2.2.2.2.1.cmml" xref="S2.p2.1.m1.2.2.2.2">subscript</csymbol><ci id="S2.p2.1.m1.2.2.2.2.2.cmml" xref="S2.p2.1.m1.2.2.2.2.2">ℰ</ci><ci id="S2.p2.1.m1.2.2.2.2.3.cmml" xref="S2.p2.1.m1.2.2.2.2.3">𝑎</ci></apply><apply id="S2.p2.1.m1.3.3.3.3.cmml" xref="S2.p2.1.m1.3.3.3.3"><csymbol cd="ambiguous" id="S2.p2.1.m1.3.3.3.3.1.cmml" xref="S2.p2.1.m1.3.3.3.3">subscript</csymbol><ci id="S2.p2.1.m1.3.3.3.3.2.cmml" xref="S2.p2.1.m1.3.3.3.3.2">ℰ</ci><ci id="S2.p2.1.m1.3.3.3.3.3.cmml" xref="S2.p2.1.m1.3.3.3.3.3">𝑙</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S2.p2.1.m1.3c">\mathcal{E}_{v},\mathcal{E}_{a},\mathcal{E}_{l}</annotation><annotation encoding="application/x-llamapun" id="S2.p2.1.m1.3d">caligraphic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT</annotation></semantics></math> in the Contrastive Video-Audio-Language Pretraining (CVALP) module, aligned, and fused into audio-aligned features <math alttext="E_{\text{mix}}" class="ltx_Math" display="inline" id="S2.p2.2.m2.1"><semantics id="S2.p2.2.m2.1a"><msub id="S2.p2.2.m2.1.1" xref="S2.p2.2.m2.1.1.cmml"><mi id="S2.p2.2.m2.1.1.2" xref="S2.p2.2.m2.1.1.2.cmml">E</mi><mtext id="S2.p2.2.m2.1.1.3" xref="S2.p2.2.m2.1.1.3a.cmml">mix</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.p2.2.m2.1b"><apply id="S2.p2.2.m2.1.1.cmml" xref="S2.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S2.p2.2.m2.1.1.1.cmml" xref="S2.p2.2.m2.1.1">subscript</csymbol><ci id="S2.p2.2.m2.1.1.2.cmml" xref="S2.p2.2.m2.1.1.2">𝐸</ci><ci id="S2.p2.2.m2.1.1.3a.cmml" xref="S2.p2.2.m2.1.1.3"><mtext id="S2.p2.2.m2.1.1.3.cmml" mathsize="70%" xref="S2.p2.2.m2.1.1.3">mix</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p2.2.m2.1c">E_{\text{mix}}</annotation><annotation encoding="application/x-llamapun" id="S2.p2.2.m2.1d">italic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT</annotation></semantics></math>. These features are fed into LDM, which generates audio features from Gaussian noise <math alttext="\mathcal{N}(0,I)" class="ltx_Math" display="inline" id="S2.p2.3.m3.2"><semantics id="S2.p2.3.m3.2a"><mrow id="S2.p2.3.m3.2.3" xref="S2.p2.3.m3.2.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.p2.3.m3.2.3.2" xref="S2.p2.3.m3.2.3.2.cmml">𝒩</mi><mo id="S2.p2.3.m3.2.3.1" xref="S2.p2.3.m3.2.3.1.cmml">⁢</mo><mrow id="S2.p2.3.m3.2.3.3.2" xref="S2.p2.3.m3.2.3.3.1.cmml"><mo id="S2.p2.3.m3.2.3.3.2.1" stretchy="false" xref="S2.p2.3.m3.2.3.3.1.cmml">(</mo><mn id="S2.p2.3.m3.1.1" xref="S2.p2.3.m3.1.1.cmml">0</mn><mo id="S2.p2.3.m3.2.3.3.2.2" xref="S2.p2.3.m3.2.3.3.1.cmml">,</mo><mi id="S2.p2.3.m3.2.2" xref="S2.p2.3.m3.2.2.cmml">I</mi><mo id="S2.p2.3.m3.2.3.3.2.3" stretchy="false" xref="S2.p2.3.m3.2.3.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.p2.3.m3.2b"><apply id="S2.p2.3.m3.2.3.cmml" xref="S2.p2.3.m3.2.3"><times id="S2.p2.3.m3.2.3.1.cmml" xref="S2.p2.3.m3.2.3.1"></times><ci id="S2.p2.3.m3.2.3.2.cmml" xref="S2.p2.3.m3.2.3.2">𝒩</ci><interval closure="open" id="S2.p2.3.m3.2.3.3.1.cmml" xref="S2.p2.3.m3.2.3.3.2"><cn id="S2.p2.3.m3.1.1.cmml" type="integer" xref="S2.p2.3.m3.1.1">0</cn><ci id="S2.p2.3.m3.2.2.cmml" xref="S2.p2.3.m3.2.2">𝐼</ci></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p2.3.m3.2c">\mathcal{N}(0,I)</annotation><annotation encoding="application/x-llamapun" id="S2.p2.3.m3.2d">caligraphic_N ( 0 , italic_I )</annotation></semantics></math>. The LDM output is decoded into a Mel-spectrogram (<math alttext="\hat{z}_{0}" class="ltx_Math" display="inline" id="S2.p2.4.m4.1"><semantics id="S2.p2.4.m4.1a"><msub id="S2.p2.4.m4.1.1" xref="S2.p2.4.m4.1.1.cmml"><mover accent="true" id="S2.p2.4.m4.1.1.2" xref="S2.p2.4.m4.1.1.2.cmml"><mi id="S2.p2.4.m4.1.1.2.2" xref="S2.p2.4.m4.1.1.2.2.cmml">z</mi><mo id="S2.p2.4.m4.1.1.2.1" xref="S2.p2.4.m4.1.1.2.1.cmml">^</mo></mover><mn id="S2.p2.4.m4.1.1.3" xref="S2.p2.4.m4.1.1.3.cmml">0</mn></msub><annotation-xml encoding="MathML-Content" id="S2.p2.4.m4.1b"><apply id="S2.p2.4.m4.1.1.cmml" xref="S2.p2.4.m4.1.1"><csymbol cd="ambiguous" id="S2.p2.4.m4.1.1.1.cmml" xref="S2.p2.4.m4.1.1">subscript</csymbol><apply id="S2.p2.4.m4.1.1.2.cmml" xref="S2.p2.4.m4.1.1.2"><ci id="S2.p2.4.m4.1.1.2.1.cmml" xref="S2.p2.4.m4.1.1.2.1">^</ci><ci id="S2.p2.4.m4.1.1.2.2.cmml" xref="S2.p2.4.m4.1.1.2.2">𝑧</ci></apply><cn id="S2.p2.4.m4.1.1.3.cmml" type="integer" xref="S2.p2.4.m4.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.p2.4.m4.1c">\hat{z}_{0}</annotation><annotation encoding="application/x-llamapun" id="S2.p2.4.m4.1d">over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT</annotation></semantics></math>) and synthesized into an audio waveform by a vocoder. Training refines system parameters, while inference generates the final synchronized audio output. Below, we detail each component of the system.</p> </div> <figure class="ltx_figure" id="S2.F1"> <p class="ltx_p ltx_align_center ltx_align_center" id="S2.F1.1"><span class="ltx_text" id="S2.F1.1.1"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="212" id="S2.F1.1.1.g1" src="extracted/6273342/long_new.png" width="598"/></span></p> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>The complete workflow of the TA-V2A generation system. The system takes video and textual descriptions as input, with the textual description generated by an LLM. The CVALP module extracts and aligns features from video, audio, and text, creating audio-aligned video and text features. These features are then fed into LDM, which iteratively generates high-quality audio from noise. During inference, guidance techniques such as CFG and human-modified text prompts are used to control the generation process, ensuring better alignment between the generated audio and the input modalities. Finally, the audio representation is decoded into a Mel-spectrogram and synthesized into the actual audio waveform using a vocoder.</figcaption> </figure> <section class="ltx_subsection" id="S2.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS1.5.1.1">II-A</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS1.6.2">Contrastive Video-Audio-Language Pretraining</span> </h3> <div class="ltx_para" id="S2.SS1.p1"> <p class="ltx_p" id="S2.SS1.p1.1">The CVALP module serves as the backbone of our feature extraction and alignment process. Inspired by prior works <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib19" title="">19</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib20" title="">20</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib12" title="">12</a>]</cite>, CVALP aligns video and text modalities with audio using contrastive learning across video-audio and text-audio pairs. This simultaneous alignment improves feature quality, convergence speed, model robustness, and hierarchical learning.</p> </div> <div class="ltx_para" id="S2.SS1.p2"> <p class="ltx_p" id="S2.SS1.p2.18">Given a video-audio-text triplet <math alttext="(x_{v},x_{a},x_{l})" class="ltx_Math" display="inline" id="S2.SS1.p2.1.m1.3"><semantics id="S2.SS1.p2.1.m1.3a"><mrow id="S2.SS1.p2.1.m1.3.3.3" xref="S2.SS1.p2.1.m1.3.3.4.cmml"><mo id="S2.SS1.p2.1.m1.3.3.3.4" stretchy="false" xref="S2.SS1.p2.1.m1.3.3.4.cmml">(</mo><msub id="S2.SS1.p2.1.m1.1.1.1.1" xref="S2.SS1.p2.1.m1.1.1.1.1.cmml"><mi id="S2.SS1.p2.1.m1.1.1.1.1.2" xref="S2.SS1.p2.1.m1.1.1.1.1.2.cmml">x</mi><mi id="S2.SS1.p2.1.m1.1.1.1.1.3" xref="S2.SS1.p2.1.m1.1.1.1.1.3.cmml">v</mi></msub><mo id="S2.SS1.p2.1.m1.3.3.3.5" xref="S2.SS1.p2.1.m1.3.3.4.cmml">,</mo><msub id="S2.SS1.p2.1.m1.2.2.2.2" xref="S2.SS1.p2.1.m1.2.2.2.2.cmml"><mi id="S2.SS1.p2.1.m1.2.2.2.2.2" xref="S2.SS1.p2.1.m1.2.2.2.2.2.cmml">x</mi><mi id="S2.SS1.p2.1.m1.2.2.2.2.3" xref="S2.SS1.p2.1.m1.2.2.2.2.3.cmml">a</mi></msub><mo id="S2.SS1.p2.1.m1.3.3.3.6" xref="S2.SS1.p2.1.m1.3.3.4.cmml">,</mo><msub id="S2.SS1.p2.1.m1.3.3.3.3" xref="S2.SS1.p2.1.m1.3.3.3.3.cmml"><mi id="S2.SS1.p2.1.m1.3.3.3.3.2" xref="S2.SS1.p2.1.m1.3.3.3.3.2.cmml">x</mi><mi id="S2.SS1.p2.1.m1.3.3.3.3.3" xref="S2.SS1.p2.1.m1.3.3.3.3.3.cmml">l</mi></msub><mo id="S2.SS1.p2.1.m1.3.3.3.7" stretchy="false" xref="S2.SS1.p2.1.m1.3.3.4.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.1.m1.3b"><vector id="S2.SS1.p2.1.m1.3.3.4.cmml" xref="S2.SS1.p2.1.m1.3.3.3"><apply id="S2.SS1.p2.1.m1.1.1.1.1.cmml" xref="S2.SS1.p2.1.m1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.1.m1.1.1.1.1.1.cmml" xref="S2.SS1.p2.1.m1.1.1.1.1">subscript</csymbol><ci id="S2.SS1.p2.1.m1.1.1.1.1.2.cmml" xref="S2.SS1.p2.1.m1.1.1.1.1.2">𝑥</ci><ci id="S2.SS1.p2.1.m1.1.1.1.1.3.cmml" xref="S2.SS1.p2.1.m1.1.1.1.1.3">𝑣</ci></apply><apply id="S2.SS1.p2.1.m1.2.2.2.2.cmml" xref="S2.SS1.p2.1.m1.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.p2.1.m1.2.2.2.2.1.cmml" xref="S2.SS1.p2.1.m1.2.2.2.2">subscript</csymbol><ci id="S2.SS1.p2.1.m1.2.2.2.2.2.cmml" xref="S2.SS1.p2.1.m1.2.2.2.2.2">𝑥</ci><ci id="S2.SS1.p2.1.m1.2.2.2.2.3.cmml" xref="S2.SS1.p2.1.m1.2.2.2.2.3">𝑎</ci></apply><apply id="S2.SS1.p2.1.m1.3.3.3.3.cmml" xref="S2.SS1.p2.1.m1.3.3.3.3"><csymbol cd="ambiguous" id="S2.SS1.p2.1.m1.3.3.3.3.1.cmml" xref="S2.SS1.p2.1.m1.3.3.3.3">subscript</csymbol><ci id="S2.SS1.p2.1.m1.3.3.3.3.2.cmml" xref="S2.SS1.p2.1.m1.3.3.3.3.2">𝑥</ci><ci id="S2.SS1.p2.1.m1.3.3.3.3.3.cmml" xref="S2.SS1.p2.1.m1.3.3.3.3.3">𝑙</ci></apply></vector></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.1.m1.3c">(x_{v},x_{a},x_{l})</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.1.m1.3d">( italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )</annotation></semantics></math>, where <math alttext="x_{v}\in\mathbb{R}^{T_{v}\times 3\times H\times W}" class="ltx_Math" display="inline" id="S2.SS1.p2.2.m2.1"><semantics id="S2.SS1.p2.2.m2.1a"><mrow id="S2.SS1.p2.2.m2.1.1" xref="S2.SS1.p2.2.m2.1.1.cmml"><msub id="S2.SS1.p2.2.m2.1.1.2" xref="S2.SS1.p2.2.m2.1.1.2.cmml"><mi id="S2.SS1.p2.2.m2.1.1.2.2" xref="S2.SS1.p2.2.m2.1.1.2.2.cmml">x</mi><mi id="S2.SS1.p2.2.m2.1.1.2.3" xref="S2.SS1.p2.2.m2.1.1.2.3.cmml">v</mi></msub><mo id="S2.SS1.p2.2.m2.1.1.1" xref="S2.SS1.p2.2.m2.1.1.1.cmml">∈</mo><msup id="S2.SS1.p2.2.m2.1.1.3" xref="S2.SS1.p2.2.m2.1.1.3.cmml"><mi id="S2.SS1.p2.2.m2.1.1.3.2" xref="S2.SS1.p2.2.m2.1.1.3.2.cmml">ℝ</mi><mrow id="S2.SS1.p2.2.m2.1.1.3.3" xref="S2.SS1.p2.2.m2.1.1.3.3.cmml"><msub id="S2.SS1.p2.2.m2.1.1.3.3.2" xref="S2.SS1.p2.2.m2.1.1.3.3.2.cmml"><mi id="S2.SS1.p2.2.m2.1.1.3.3.2.2" xref="S2.SS1.p2.2.m2.1.1.3.3.2.2.cmml">T</mi><mi id="S2.SS1.p2.2.m2.1.1.3.3.2.3" xref="S2.SS1.p2.2.m2.1.1.3.3.2.3.cmml">v</mi></msub><mo id="S2.SS1.p2.2.m2.1.1.3.3.1" lspace="0.222em" rspace="0.222em" xref="S2.SS1.p2.2.m2.1.1.3.3.1.cmml">×</mo><mn id="S2.SS1.p2.2.m2.1.1.3.3.3" xref="S2.SS1.p2.2.m2.1.1.3.3.3.cmml">3</mn><mo id="S2.SS1.p2.2.m2.1.1.3.3.1a" lspace="0.222em" rspace="0.222em" xref="S2.SS1.p2.2.m2.1.1.3.3.1.cmml">×</mo><mi id="S2.SS1.p2.2.m2.1.1.3.3.4" xref="S2.SS1.p2.2.m2.1.1.3.3.4.cmml">H</mi><mo id="S2.SS1.p2.2.m2.1.1.3.3.1b" lspace="0.222em" rspace="0.222em" xref="S2.SS1.p2.2.m2.1.1.3.3.1.cmml">×</mo><mi id="S2.SS1.p2.2.m2.1.1.3.3.5" xref="S2.SS1.p2.2.m2.1.1.3.3.5.cmml">W</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.2.m2.1b"><apply id="S2.SS1.p2.2.m2.1.1.cmml" xref="S2.SS1.p2.2.m2.1.1"><in id="S2.SS1.p2.2.m2.1.1.1.cmml" xref="S2.SS1.p2.2.m2.1.1.1"></in><apply id="S2.SS1.p2.2.m2.1.1.2.cmml" xref="S2.SS1.p2.2.m2.1.1.2"><csymbol cd="ambiguous" id="S2.SS1.p2.2.m2.1.1.2.1.cmml" xref="S2.SS1.p2.2.m2.1.1.2">subscript</csymbol><ci id="S2.SS1.p2.2.m2.1.1.2.2.cmml" xref="S2.SS1.p2.2.m2.1.1.2.2">𝑥</ci><ci id="S2.SS1.p2.2.m2.1.1.2.3.cmml" xref="S2.SS1.p2.2.m2.1.1.2.3">𝑣</ci></apply><apply id="S2.SS1.p2.2.m2.1.1.3.cmml" xref="S2.SS1.p2.2.m2.1.1.3"><csymbol cd="ambiguous" id="S2.SS1.p2.2.m2.1.1.3.1.cmml" xref="S2.SS1.p2.2.m2.1.1.3">superscript</csymbol><ci id="S2.SS1.p2.2.m2.1.1.3.2.cmml" xref="S2.SS1.p2.2.m2.1.1.3.2">ℝ</ci><apply id="S2.SS1.p2.2.m2.1.1.3.3.cmml" xref="S2.SS1.p2.2.m2.1.1.3.3"><times id="S2.SS1.p2.2.m2.1.1.3.3.1.cmml" xref="S2.SS1.p2.2.m2.1.1.3.3.1"></times><apply id="S2.SS1.p2.2.m2.1.1.3.3.2.cmml" xref="S2.SS1.p2.2.m2.1.1.3.3.2"><csymbol cd="ambiguous" id="S2.SS1.p2.2.m2.1.1.3.3.2.1.cmml" xref="S2.SS1.p2.2.m2.1.1.3.3.2">subscript</csymbol><ci id="S2.SS1.p2.2.m2.1.1.3.3.2.2.cmml" xref="S2.SS1.p2.2.m2.1.1.3.3.2.2">𝑇</ci><ci id="S2.SS1.p2.2.m2.1.1.3.3.2.3.cmml" xref="S2.SS1.p2.2.m2.1.1.3.3.2.3">𝑣</ci></apply><cn id="S2.SS1.p2.2.m2.1.1.3.3.3.cmml" type="integer" xref="S2.SS1.p2.2.m2.1.1.3.3.3">3</cn><ci id="S2.SS1.p2.2.m2.1.1.3.3.4.cmml" xref="S2.SS1.p2.2.m2.1.1.3.3.4">𝐻</ci><ci id="S2.SS1.p2.2.m2.1.1.3.3.5.cmml" xref="S2.SS1.p2.2.m2.1.1.3.3.5">𝑊</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.2.m2.1c">x_{v}\in\mathbb{R}^{T_{v}\times 3\times H\times W}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.2.m2.1d">italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × 3 × italic_H × italic_W end_POSTSUPERSCRIPT</annotation></semantics></math> denotes a video clip with <math alttext="T_{v}" class="ltx_Math" display="inline" id="S2.SS1.p2.3.m3.1"><semantics id="S2.SS1.p2.3.m3.1a"><msub id="S2.SS1.p2.3.m3.1.1" xref="S2.SS1.p2.3.m3.1.1.cmml"><mi id="S2.SS1.p2.3.m3.1.1.2" xref="S2.SS1.p2.3.m3.1.1.2.cmml">T</mi><mi id="S2.SS1.p2.3.m3.1.1.3" xref="S2.SS1.p2.3.m3.1.1.3.cmml">v</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.3.m3.1b"><apply id="S2.SS1.p2.3.m3.1.1.cmml" xref="S2.SS1.p2.3.m3.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.3.m3.1.1.1.cmml" xref="S2.SS1.p2.3.m3.1.1">subscript</csymbol><ci id="S2.SS1.p2.3.m3.1.1.2.cmml" xref="S2.SS1.p2.3.m3.1.1.2">𝑇</ci><ci id="S2.SS1.p2.3.m3.1.1.3.cmml" xref="S2.SS1.p2.3.m3.1.1.3">𝑣</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.3.m3.1c">T_{v}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.3.m3.1d">italic_T start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT</annotation></semantics></math> frames of size <math alttext="H\times W" class="ltx_Math" display="inline" id="S2.SS1.p2.4.m4.1"><semantics id="S2.SS1.p2.4.m4.1a"><mrow id="S2.SS1.p2.4.m4.1.1" xref="S2.SS1.p2.4.m4.1.1.cmml"><mi id="S2.SS1.p2.4.m4.1.1.2" xref="S2.SS1.p2.4.m4.1.1.2.cmml">H</mi><mo id="S2.SS1.p2.4.m4.1.1.1" lspace="0.222em" rspace="0.222em" xref="S2.SS1.p2.4.m4.1.1.1.cmml">×</mo><mi id="S2.SS1.p2.4.m4.1.1.3" xref="S2.SS1.p2.4.m4.1.1.3.cmml">W</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.4.m4.1b"><apply id="S2.SS1.p2.4.m4.1.1.cmml" xref="S2.SS1.p2.4.m4.1.1"><times id="S2.SS1.p2.4.m4.1.1.1.cmml" xref="S2.SS1.p2.4.m4.1.1.1"></times><ci id="S2.SS1.p2.4.m4.1.1.2.cmml" xref="S2.SS1.p2.4.m4.1.1.2">𝐻</ci><ci id="S2.SS1.p2.4.m4.1.1.3.cmml" xref="S2.SS1.p2.4.m4.1.1.3">𝑊</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.4.m4.1c">H\times W</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.4.m4.1d">italic_H × italic_W</annotation></semantics></math> and RGB channels, <math alttext="x_{a}\in\mathbb{R}^{T_{a}\times M}" class="ltx_Math" display="inline" id="S2.SS1.p2.5.m5.1"><semantics id="S2.SS1.p2.5.m5.1a"><mrow id="S2.SS1.p2.5.m5.1.1" xref="S2.SS1.p2.5.m5.1.1.cmml"><msub id="S2.SS1.p2.5.m5.1.1.2" xref="S2.SS1.p2.5.m5.1.1.2.cmml"><mi id="S2.SS1.p2.5.m5.1.1.2.2" xref="S2.SS1.p2.5.m5.1.1.2.2.cmml">x</mi><mi id="S2.SS1.p2.5.m5.1.1.2.3" xref="S2.SS1.p2.5.m5.1.1.2.3.cmml">a</mi></msub><mo id="S2.SS1.p2.5.m5.1.1.1" xref="S2.SS1.p2.5.m5.1.1.1.cmml">∈</mo><msup id="S2.SS1.p2.5.m5.1.1.3" xref="S2.SS1.p2.5.m5.1.1.3.cmml"><mi id="S2.SS1.p2.5.m5.1.1.3.2" xref="S2.SS1.p2.5.m5.1.1.3.2.cmml">ℝ</mi><mrow id="S2.SS1.p2.5.m5.1.1.3.3" xref="S2.SS1.p2.5.m5.1.1.3.3.cmml"><msub id="S2.SS1.p2.5.m5.1.1.3.3.2" xref="S2.SS1.p2.5.m5.1.1.3.3.2.cmml"><mi id="S2.SS1.p2.5.m5.1.1.3.3.2.2" xref="S2.SS1.p2.5.m5.1.1.3.3.2.2.cmml">T</mi><mi id="S2.SS1.p2.5.m5.1.1.3.3.2.3" xref="S2.SS1.p2.5.m5.1.1.3.3.2.3.cmml">a</mi></msub><mo id="S2.SS1.p2.5.m5.1.1.3.3.1" lspace="0.222em" rspace="0.222em" xref="S2.SS1.p2.5.m5.1.1.3.3.1.cmml">×</mo><mi id="S2.SS1.p2.5.m5.1.1.3.3.3" xref="S2.SS1.p2.5.m5.1.1.3.3.3.cmml">M</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.5.m5.1b"><apply id="S2.SS1.p2.5.m5.1.1.cmml" xref="S2.SS1.p2.5.m5.1.1"><in id="S2.SS1.p2.5.m5.1.1.1.cmml" xref="S2.SS1.p2.5.m5.1.1.1"></in><apply id="S2.SS1.p2.5.m5.1.1.2.cmml" xref="S2.SS1.p2.5.m5.1.1.2"><csymbol cd="ambiguous" id="S2.SS1.p2.5.m5.1.1.2.1.cmml" xref="S2.SS1.p2.5.m5.1.1.2">subscript</csymbol><ci id="S2.SS1.p2.5.m5.1.1.2.2.cmml" xref="S2.SS1.p2.5.m5.1.1.2.2">𝑥</ci><ci id="S2.SS1.p2.5.m5.1.1.2.3.cmml" xref="S2.SS1.p2.5.m5.1.1.2.3">𝑎</ci></apply><apply id="S2.SS1.p2.5.m5.1.1.3.cmml" xref="S2.SS1.p2.5.m5.1.1.3"><csymbol cd="ambiguous" id="S2.SS1.p2.5.m5.1.1.3.1.cmml" xref="S2.SS1.p2.5.m5.1.1.3">superscript</csymbol><ci id="S2.SS1.p2.5.m5.1.1.3.2.cmml" xref="S2.SS1.p2.5.m5.1.1.3.2">ℝ</ci><apply id="S2.SS1.p2.5.m5.1.1.3.3.cmml" xref="S2.SS1.p2.5.m5.1.1.3.3"><times id="S2.SS1.p2.5.m5.1.1.3.3.1.cmml" xref="S2.SS1.p2.5.m5.1.1.3.3.1"></times><apply id="S2.SS1.p2.5.m5.1.1.3.3.2.cmml" xref="S2.SS1.p2.5.m5.1.1.3.3.2"><csymbol cd="ambiguous" id="S2.SS1.p2.5.m5.1.1.3.3.2.1.cmml" xref="S2.SS1.p2.5.m5.1.1.3.3.2">subscript</csymbol><ci id="S2.SS1.p2.5.m5.1.1.3.3.2.2.cmml" xref="S2.SS1.p2.5.m5.1.1.3.3.2.2">𝑇</ci><ci id="S2.SS1.p2.5.m5.1.1.3.3.2.3.cmml" xref="S2.SS1.p2.5.m5.1.1.3.3.2.3">𝑎</ci></apply><ci id="S2.SS1.p2.5.m5.1.1.3.3.3.cmml" xref="S2.SS1.p2.5.m5.1.1.3.3.3">𝑀</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.5.m5.1c">x_{a}\in\mathbb{R}^{T_{a}\times M}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.5.m5.1d">italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT</annotation></semantics></math> represents a Mel-spectrogram with <math alttext="T_{a}" class="ltx_Math" display="inline" id="S2.SS1.p2.6.m6.1"><semantics id="S2.SS1.p2.6.m6.1a"><msub id="S2.SS1.p2.6.m6.1.1" xref="S2.SS1.p2.6.m6.1.1.cmml"><mi id="S2.SS1.p2.6.m6.1.1.2" xref="S2.SS1.p2.6.m6.1.1.2.cmml">T</mi><mi id="S2.SS1.p2.6.m6.1.1.3" xref="S2.SS1.p2.6.m6.1.1.3.cmml">a</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.6.m6.1b"><apply id="S2.SS1.p2.6.m6.1.1.cmml" xref="S2.SS1.p2.6.m6.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.6.m6.1.1.1.cmml" xref="S2.SS1.p2.6.m6.1.1">subscript</csymbol><ci id="S2.SS1.p2.6.m6.1.1.2.cmml" xref="S2.SS1.p2.6.m6.1.1.2">𝑇</ci><ci id="S2.SS1.p2.6.m6.1.1.3.cmml" xref="S2.SS1.p2.6.m6.1.1.3">𝑎</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.6.m6.1c">T_{a}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.6.m6.1d">italic_T start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT</annotation></semantics></math> time steps and <math alttext="M" class="ltx_Math" display="inline" id="S2.SS1.p2.7.m7.1"><semantics id="S2.SS1.p2.7.m7.1a"><mi id="S2.SS1.p2.7.m7.1.1" xref="S2.SS1.p2.7.m7.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.7.m7.1b"><ci id="S2.SS1.p2.7.m7.1.1.cmml" xref="S2.SS1.p2.7.m7.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.7.m7.1c">M</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.7.m7.1d">italic_M</annotation></semantics></math> mel bands, and <math alttext="x_{l}" class="ltx_Math" display="inline" id="S2.SS1.p2.8.m8.1"><semantics id="S2.SS1.p2.8.m8.1a"><msub id="S2.SS1.p2.8.m8.1.1" xref="S2.SS1.p2.8.m8.1.1.cmml"><mi id="S2.SS1.p2.8.m8.1.1.2" xref="S2.SS1.p2.8.m8.1.1.2.cmml">x</mi><mi id="S2.SS1.p2.8.m8.1.1.3" xref="S2.SS1.p2.8.m8.1.1.3.cmml">l</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.8.m8.1b"><apply id="S2.SS1.p2.8.m8.1.1.cmml" xref="S2.SS1.p2.8.m8.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.8.m8.1.1.1.cmml" xref="S2.SS1.p2.8.m8.1.1">subscript</csymbol><ci id="S2.SS1.p2.8.m8.1.1.2.cmml" xref="S2.SS1.p2.8.m8.1.1.2">𝑥</ci><ci id="S2.SS1.p2.8.m8.1.1.3.cmml" xref="S2.SS1.p2.8.m8.1.1.3">𝑙</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.8.m8.1c">x_{l}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.8.m8.1d">italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT</annotation></semantics></math> is the corresponding text description. We use video, audio, and text encoders <math alttext="f_{V}(\cdot)" class="ltx_Math" display="inline" id="S2.SS1.p2.9.m9.1"><semantics id="S2.SS1.p2.9.m9.1a"><mrow id="S2.SS1.p2.9.m9.1.2" xref="S2.SS1.p2.9.m9.1.2.cmml"><msub id="S2.SS1.p2.9.m9.1.2.2" xref="S2.SS1.p2.9.m9.1.2.2.cmml"><mi id="S2.SS1.p2.9.m9.1.2.2.2" xref="S2.SS1.p2.9.m9.1.2.2.2.cmml">f</mi><mi id="S2.SS1.p2.9.m9.1.2.2.3" xref="S2.SS1.p2.9.m9.1.2.2.3.cmml">V</mi></msub><mo id="S2.SS1.p2.9.m9.1.2.1" xref="S2.SS1.p2.9.m9.1.2.1.cmml">⁢</mo><mrow id="S2.SS1.p2.9.m9.1.2.3.2" xref="S2.SS1.p2.9.m9.1.2.cmml"><mo id="S2.SS1.p2.9.m9.1.2.3.2.1" stretchy="false" xref="S2.SS1.p2.9.m9.1.2.cmml">(</mo><mo id="S2.SS1.p2.9.m9.1.1" lspace="0em" rspace="0em" xref="S2.SS1.p2.9.m9.1.1.cmml">⋅</mo><mo id="S2.SS1.p2.9.m9.1.2.3.2.2" stretchy="false" xref="S2.SS1.p2.9.m9.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.9.m9.1b"><apply id="S2.SS1.p2.9.m9.1.2.cmml" xref="S2.SS1.p2.9.m9.1.2"><times id="S2.SS1.p2.9.m9.1.2.1.cmml" xref="S2.SS1.p2.9.m9.1.2.1"></times><apply id="S2.SS1.p2.9.m9.1.2.2.cmml" xref="S2.SS1.p2.9.m9.1.2.2"><csymbol cd="ambiguous" id="S2.SS1.p2.9.m9.1.2.2.1.cmml" xref="S2.SS1.p2.9.m9.1.2.2">subscript</csymbol><ci id="S2.SS1.p2.9.m9.1.2.2.2.cmml" xref="S2.SS1.p2.9.m9.1.2.2.2">𝑓</ci><ci id="S2.SS1.p2.9.m9.1.2.2.3.cmml" xref="S2.SS1.p2.9.m9.1.2.2.3">𝑉</ci></apply><ci id="S2.SS1.p2.9.m9.1.1.cmml" xref="S2.SS1.p2.9.m9.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.9.m9.1c">f_{V}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.9.m9.1d">italic_f start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( ⋅ )</annotation></semantics></math>, <math alttext="f_{A}(\cdot)" class="ltx_Math" display="inline" id="S2.SS1.p2.10.m10.1"><semantics id="S2.SS1.p2.10.m10.1a"><mrow id="S2.SS1.p2.10.m10.1.2" xref="S2.SS1.p2.10.m10.1.2.cmml"><msub id="S2.SS1.p2.10.m10.1.2.2" xref="S2.SS1.p2.10.m10.1.2.2.cmml"><mi id="S2.SS1.p2.10.m10.1.2.2.2" xref="S2.SS1.p2.10.m10.1.2.2.2.cmml">f</mi><mi id="S2.SS1.p2.10.m10.1.2.2.3" xref="S2.SS1.p2.10.m10.1.2.2.3.cmml">A</mi></msub><mo id="S2.SS1.p2.10.m10.1.2.1" xref="S2.SS1.p2.10.m10.1.2.1.cmml">⁢</mo><mrow id="S2.SS1.p2.10.m10.1.2.3.2" xref="S2.SS1.p2.10.m10.1.2.cmml"><mo id="S2.SS1.p2.10.m10.1.2.3.2.1" stretchy="false" xref="S2.SS1.p2.10.m10.1.2.cmml">(</mo><mo id="S2.SS1.p2.10.m10.1.1" lspace="0em" rspace="0em" xref="S2.SS1.p2.10.m10.1.1.cmml">⋅</mo><mo id="S2.SS1.p2.10.m10.1.2.3.2.2" stretchy="false" xref="S2.SS1.p2.10.m10.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.10.m10.1b"><apply id="S2.SS1.p2.10.m10.1.2.cmml" xref="S2.SS1.p2.10.m10.1.2"><times id="S2.SS1.p2.10.m10.1.2.1.cmml" xref="S2.SS1.p2.10.m10.1.2.1"></times><apply id="S2.SS1.p2.10.m10.1.2.2.cmml" xref="S2.SS1.p2.10.m10.1.2.2"><csymbol cd="ambiguous" id="S2.SS1.p2.10.m10.1.2.2.1.cmml" xref="S2.SS1.p2.10.m10.1.2.2">subscript</csymbol><ci id="S2.SS1.p2.10.m10.1.2.2.2.cmml" xref="S2.SS1.p2.10.m10.1.2.2.2">𝑓</ci><ci id="S2.SS1.p2.10.m10.1.2.2.3.cmml" xref="S2.SS1.p2.10.m10.1.2.2.3">𝐴</ci></apply><ci id="S2.SS1.p2.10.m10.1.1.cmml" xref="S2.SS1.p2.10.m10.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.10.m10.1c">f_{A}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.10.m10.1d">italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( ⋅ )</annotation></semantics></math>, and <math alttext="f_{L}(\cdot)" class="ltx_Math" display="inline" id="S2.SS1.p2.11.m11.1"><semantics id="S2.SS1.p2.11.m11.1a"><mrow id="S2.SS1.p2.11.m11.1.2" xref="S2.SS1.p2.11.m11.1.2.cmml"><msub id="S2.SS1.p2.11.m11.1.2.2" xref="S2.SS1.p2.11.m11.1.2.2.cmml"><mi id="S2.SS1.p2.11.m11.1.2.2.2" xref="S2.SS1.p2.11.m11.1.2.2.2.cmml">f</mi><mi id="S2.SS1.p2.11.m11.1.2.2.3" xref="S2.SS1.p2.11.m11.1.2.2.3.cmml">L</mi></msub><mo id="S2.SS1.p2.11.m11.1.2.1" xref="S2.SS1.p2.11.m11.1.2.1.cmml">⁢</mo><mrow id="S2.SS1.p2.11.m11.1.2.3.2" xref="S2.SS1.p2.11.m11.1.2.cmml"><mo id="S2.SS1.p2.11.m11.1.2.3.2.1" stretchy="false" xref="S2.SS1.p2.11.m11.1.2.cmml">(</mo><mo id="S2.SS1.p2.11.m11.1.1" lspace="0em" rspace="0em" xref="S2.SS1.p2.11.m11.1.1.cmml">⋅</mo><mo id="S2.SS1.p2.11.m11.1.2.3.2.2" stretchy="false" xref="S2.SS1.p2.11.m11.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.11.m11.1b"><apply id="S2.SS1.p2.11.m11.1.2.cmml" xref="S2.SS1.p2.11.m11.1.2"><times id="S2.SS1.p2.11.m11.1.2.1.cmml" xref="S2.SS1.p2.11.m11.1.2.1"></times><apply id="S2.SS1.p2.11.m11.1.2.2.cmml" xref="S2.SS1.p2.11.m11.1.2.2"><csymbol cd="ambiguous" id="S2.SS1.p2.11.m11.1.2.2.1.cmml" xref="S2.SS1.p2.11.m11.1.2.2">subscript</csymbol><ci id="S2.SS1.p2.11.m11.1.2.2.2.cmml" xref="S2.SS1.p2.11.m11.1.2.2.2">𝑓</ci><ci id="S2.SS1.p2.11.m11.1.2.2.3.cmml" xref="S2.SS1.p2.11.m11.1.2.2.3">𝐿</ci></apply><ci id="S2.SS1.p2.11.m11.1.1.cmml" xref="S2.SS1.p2.11.m11.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.11.m11.1c">f_{L}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.11.m11.1d">italic_f start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( ⋅ )</annotation></semantics></math> to extract features <math alttext="E_{v}^{T},E_{a}^{T}\in\mathbb{R}^{T\times C}" class="ltx_Math" display="inline" id="S2.SS1.p2.12.m12.2"><semantics id="S2.SS1.p2.12.m12.2a"><mrow id="S2.SS1.p2.12.m12.2.2" xref="S2.SS1.p2.12.m12.2.2.cmml"><mrow id="S2.SS1.p2.12.m12.2.2.2.2" xref="S2.SS1.p2.12.m12.2.2.2.3.cmml"><msubsup id="S2.SS1.p2.12.m12.1.1.1.1.1" xref="S2.SS1.p2.12.m12.1.1.1.1.1.cmml"><mi id="S2.SS1.p2.12.m12.1.1.1.1.1.2.2" xref="S2.SS1.p2.12.m12.1.1.1.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p2.12.m12.1.1.1.1.1.2.3" xref="S2.SS1.p2.12.m12.1.1.1.1.1.2.3.cmml">v</mi><mi id="S2.SS1.p2.12.m12.1.1.1.1.1.3" xref="S2.SS1.p2.12.m12.1.1.1.1.1.3.cmml">T</mi></msubsup><mo id="S2.SS1.p2.12.m12.2.2.2.2.3" xref="S2.SS1.p2.12.m12.2.2.2.3.cmml">,</mo><msubsup id="S2.SS1.p2.12.m12.2.2.2.2.2" xref="S2.SS1.p2.12.m12.2.2.2.2.2.cmml"><mi id="S2.SS1.p2.12.m12.2.2.2.2.2.2.2" xref="S2.SS1.p2.12.m12.2.2.2.2.2.2.2.cmml">E</mi><mi id="S2.SS1.p2.12.m12.2.2.2.2.2.2.3" xref="S2.SS1.p2.12.m12.2.2.2.2.2.2.3.cmml">a</mi><mi id="S2.SS1.p2.12.m12.2.2.2.2.2.3" xref="S2.SS1.p2.12.m12.2.2.2.2.2.3.cmml">T</mi></msubsup></mrow><mo id="S2.SS1.p2.12.m12.2.2.3" xref="S2.SS1.p2.12.m12.2.2.3.cmml">∈</mo><msup id="S2.SS1.p2.12.m12.2.2.4" xref="S2.SS1.p2.12.m12.2.2.4.cmml"><mi id="S2.SS1.p2.12.m12.2.2.4.2" xref="S2.SS1.p2.12.m12.2.2.4.2.cmml">ℝ</mi><mrow id="S2.SS1.p2.12.m12.2.2.4.3" xref="S2.SS1.p2.12.m12.2.2.4.3.cmml"><mi id="S2.SS1.p2.12.m12.2.2.4.3.2" xref="S2.SS1.p2.12.m12.2.2.4.3.2.cmml">T</mi><mo id="S2.SS1.p2.12.m12.2.2.4.3.1" lspace="0.222em" rspace="0.222em" xref="S2.SS1.p2.12.m12.2.2.4.3.1.cmml">×</mo><mi id="S2.SS1.p2.12.m12.2.2.4.3.3" xref="S2.SS1.p2.12.m12.2.2.4.3.3.cmml">C</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.12.m12.2b"><apply id="S2.SS1.p2.12.m12.2.2.cmml" xref="S2.SS1.p2.12.m12.2.2"><in id="S2.SS1.p2.12.m12.2.2.3.cmml" xref="S2.SS1.p2.12.m12.2.2.3"></in><list id="S2.SS1.p2.12.m12.2.2.2.3.cmml" xref="S2.SS1.p2.12.m12.2.2.2.2"><apply id="S2.SS1.p2.12.m12.1.1.1.1.1.cmml" xref="S2.SS1.p2.12.m12.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.12.m12.1.1.1.1.1.1.cmml" xref="S2.SS1.p2.12.m12.1.1.1.1.1">superscript</csymbol><apply id="S2.SS1.p2.12.m12.1.1.1.1.1.2.cmml" xref="S2.SS1.p2.12.m12.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.12.m12.1.1.1.1.1.2.1.cmml" xref="S2.SS1.p2.12.m12.1.1.1.1.1">subscript</csymbol><ci id="S2.SS1.p2.12.m12.1.1.1.1.1.2.2.cmml" xref="S2.SS1.p2.12.m12.1.1.1.1.1.2.2">𝐸</ci><ci id="S2.SS1.p2.12.m12.1.1.1.1.1.2.3.cmml" xref="S2.SS1.p2.12.m12.1.1.1.1.1.2.3">𝑣</ci></apply><ci id="S2.SS1.p2.12.m12.1.1.1.1.1.3.cmml" xref="S2.SS1.p2.12.m12.1.1.1.1.1.3">𝑇</ci></apply><apply id="S2.SS1.p2.12.m12.2.2.2.2.2.cmml" xref="S2.SS1.p2.12.m12.2.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.p2.12.m12.2.2.2.2.2.1.cmml" xref="S2.SS1.p2.12.m12.2.2.2.2.2">superscript</csymbol><apply id="S2.SS1.p2.12.m12.2.2.2.2.2.2.cmml" xref="S2.SS1.p2.12.m12.2.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.p2.12.m12.2.2.2.2.2.2.1.cmml" xref="S2.SS1.p2.12.m12.2.2.2.2.2">subscript</csymbol><ci id="S2.SS1.p2.12.m12.2.2.2.2.2.2.2.cmml" xref="S2.SS1.p2.12.m12.2.2.2.2.2.2.2">𝐸</ci><ci id="S2.SS1.p2.12.m12.2.2.2.2.2.2.3.cmml" xref="S2.SS1.p2.12.m12.2.2.2.2.2.2.3">𝑎</ci></apply><ci id="S2.SS1.p2.12.m12.2.2.2.2.2.3.cmml" xref="S2.SS1.p2.12.m12.2.2.2.2.2.3">𝑇</ci></apply></list><apply id="S2.SS1.p2.12.m12.2.2.4.cmml" xref="S2.SS1.p2.12.m12.2.2.4"><csymbol cd="ambiguous" id="S2.SS1.p2.12.m12.2.2.4.1.cmml" xref="S2.SS1.p2.12.m12.2.2.4">superscript</csymbol><ci id="S2.SS1.p2.12.m12.2.2.4.2.cmml" xref="S2.SS1.p2.12.m12.2.2.4.2">ℝ</ci><apply id="S2.SS1.p2.12.m12.2.2.4.3.cmml" xref="S2.SS1.p2.12.m12.2.2.4.3"><times id="S2.SS1.p2.12.m12.2.2.4.3.1.cmml" xref="S2.SS1.p2.12.m12.2.2.4.3.1"></times><ci id="S2.SS1.p2.12.m12.2.2.4.3.2.cmml" xref="S2.SS1.p2.12.m12.2.2.4.3.2">𝑇</ci><ci id="S2.SS1.p2.12.m12.2.2.4.3.3.cmml" xref="S2.SS1.p2.12.m12.2.2.4.3.3">𝐶</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.12.m12.2c">E_{v}^{T},E_{a}^{T}\in\mathbb{R}^{T\times C}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.12.m12.2d">italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_C end_POSTSUPERSCRIPT</annotation></semantics></math> from video and audio, where <math alttext="T=32" class="ltx_Math" display="inline" id="S2.SS1.p2.13.m13.1"><semantics id="S2.SS1.p2.13.m13.1a"><mrow id="S2.SS1.p2.13.m13.1.1" xref="S2.SS1.p2.13.m13.1.1.cmml"><mi id="S2.SS1.p2.13.m13.1.1.2" xref="S2.SS1.p2.13.m13.1.1.2.cmml">T</mi><mo id="S2.SS1.p2.13.m13.1.1.1" xref="S2.SS1.p2.13.m13.1.1.1.cmml">=</mo><mn id="S2.SS1.p2.13.m13.1.1.3" xref="S2.SS1.p2.13.m13.1.1.3.cmml">32</mn></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.13.m13.1b"><apply id="S2.SS1.p2.13.m13.1.1.cmml" xref="S2.SS1.p2.13.m13.1.1"><eq id="S2.SS1.p2.13.m13.1.1.1.cmml" xref="S2.SS1.p2.13.m13.1.1.1"></eq><ci id="S2.SS1.p2.13.m13.1.1.2.cmml" xref="S2.SS1.p2.13.m13.1.1.2">𝑇</ci><cn id="S2.SS1.p2.13.m13.1.1.3.cmml" type="integer" xref="S2.SS1.p2.13.m13.1.1.3">32</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.13.m13.1c">T=32</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.13.m13.1d">italic_T = 32</annotation></semantics></math> is the number of temporal segments and <math alttext="C=512" class="ltx_Math" display="inline" id="S2.SS1.p2.14.m14.1"><semantics id="S2.SS1.p2.14.m14.1a"><mrow id="S2.SS1.p2.14.m14.1.1" xref="S2.SS1.p2.14.m14.1.1.cmml"><mi id="S2.SS1.p2.14.m14.1.1.2" xref="S2.SS1.p2.14.m14.1.1.2.cmml">C</mi><mo id="S2.SS1.p2.14.m14.1.1.1" xref="S2.SS1.p2.14.m14.1.1.1.cmml">=</mo><mn id="S2.SS1.p2.14.m14.1.1.3" xref="S2.SS1.p2.14.m14.1.1.3.cmml">512</mn></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.14.m14.1b"><apply id="S2.SS1.p2.14.m14.1.1.cmml" xref="S2.SS1.p2.14.m14.1.1"><eq id="S2.SS1.p2.14.m14.1.1.1.cmml" xref="S2.SS1.p2.14.m14.1.1.1"></eq><ci id="S2.SS1.p2.14.m14.1.1.2.cmml" xref="S2.SS1.p2.14.m14.1.1.2">𝐶</ci><cn id="S2.SS1.p2.14.m14.1.1.3.cmml" type="integer" xref="S2.SS1.p2.14.m14.1.1.3">512</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.14.m14.1c">C=512</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.14.m14.1d">italic_C = 512</annotation></semantics></math> is the feature dimension, and <math alttext="E_{l}\in\mathbb{R}^{C}" class="ltx_Math" display="inline" id="S2.SS1.p2.15.m15.1"><semantics id="S2.SS1.p2.15.m15.1a"><mrow id="S2.SS1.p2.15.m15.1.1" xref="S2.SS1.p2.15.m15.1.1.cmml"><msub id="S2.SS1.p2.15.m15.1.1.2" xref="S2.SS1.p2.15.m15.1.1.2.cmml"><mi id="S2.SS1.p2.15.m15.1.1.2.2" xref="S2.SS1.p2.15.m15.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p2.15.m15.1.1.2.3" xref="S2.SS1.p2.15.m15.1.1.2.3.cmml">l</mi></msub><mo id="S2.SS1.p2.15.m15.1.1.1" xref="S2.SS1.p2.15.m15.1.1.1.cmml">∈</mo><msup id="S2.SS1.p2.15.m15.1.1.3" xref="S2.SS1.p2.15.m15.1.1.3.cmml"><mi id="S2.SS1.p2.15.m15.1.1.3.2" xref="S2.SS1.p2.15.m15.1.1.3.2.cmml">ℝ</mi><mi id="S2.SS1.p2.15.m15.1.1.3.3" xref="S2.SS1.p2.15.m15.1.1.3.3.cmml">C</mi></msup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.15.m15.1b"><apply id="S2.SS1.p2.15.m15.1.1.cmml" xref="S2.SS1.p2.15.m15.1.1"><in id="S2.SS1.p2.15.m15.1.1.1.cmml" xref="S2.SS1.p2.15.m15.1.1.1"></in><apply id="S2.SS1.p2.15.m15.1.1.2.cmml" xref="S2.SS1.p2.15.m15.1.1.2"><csymbol cd="ambiguous" id="S2.SS1.p2.15.m15.1.1.2.1.cmml" xref="S2.SS1.p2.15.m15.1.1.2">subscript</csymbol><ci id="S2.SS1.p2.15.m15.1.1.2.2.cmml" xref="S2.SS1.p2.15.m15.1.1.2.2">𝐸</ci><ci id="S2.SS1.p2.15.m15.1.1.2.3.cmml" xref="S2.SS1.p2.15.m15.1.1.2.3">𝑙</ci></apply><apply id="S2.SS1.p2.15.m15.1.1.3.cmml" xref="S2.SS1.p2.15.m15.1.1.3"><csymbol cd="ambiguous" id="S2.SS1.p2.15.m15.1.1.3.1.cmml" xref="S2.SS1.p2.15.m15.1.1.3">superscript</csymbol><ci id="S2.SS1.p2.15.m15.1.1.3.2.cmml" xref="S2.SS1.p2.15.m15.1.1.3.2">ℝ</ci><ci id="S2.SS1.p2.15.m15.1.1.3.3.cmml" xref="S2.SS1.p2.15.m15.1.1.3.3">𝐶</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.15.m15.1c">E_{l}\in\mathbb{R}^{C}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.15.m15.1d">italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT</annotation></semantics></math> from text. To align feature dimensions, we apply temporal pooling to the video and audio features, resulting in <math alttext="E_{v}" class="ltx_Math" display="inline" id="S2.SS1.p2.16.m16.1"><semantics id="S2.SS1.p2.16.m16.1a"><msub id="S2.SS1.p2.16.m16.1.1" xref="S2.SS1.p2.16.m16.1.1.cmml"><mi id="S2.SS1.p2.16.m16.1.1.2" xref="S2.SS1.p2.16.m16.1.1.2.cmml">E</mi><mi id="S2.SS1.p2.16.m16.1.1.3" xref="S2.SS1.p2.16.m16.1.1.3.cmml">v</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.16.m16.1b"><apply id="S2.SS1.p2.16.m16.1.1.cmml" xref="S2.SS1.p2.16.m16.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.16.m16.1.1.1.cmml" xref="S2.SS1.p2.16.m16.1.1">subscript</csymbol><ci id="S2.SS1.p2.16.m16.1.1.2.cmml" xref="S2.SS1.p2.16.m16.1.1.2">𝐸</ci><ci id="S2.SS1.p2.16.m16.1.1.3.cmml" xref="S2.SS1.p2.16.m16.1.1.3">𝑣</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.16.m16.1c">E_{v}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.16.m16.1d">italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="E_{a}\in\mathbb{R}^{C}" class="ltx_Math" display="inline" id="S2.SS1.p2.17.m17.1"><semantics id="S2.SS1.p2.17.m17.1a"><mrow id="S2.SS1.p2.17.m17.1.1" xref="S2.SS1.p2.17.m17.1.1.cmml"><msub id="S2.SS1.p2.17.m17.1.1.2" xref="S2.SS1.p2.17.m17.1.1.2.cmml"><mi id="S2.SS1.p2.17.m17.1.1.2.2" xref="S2.SS1.p2.17.m17.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p2.17.m17.1.1.2.3" xref="S2.SS1.p2.17.m17.1.1.2.3.cmml">a</mi></msub><mo id="S2.SS1.p2.17.m17.1.1.1" xref="S2.SS1.p2.17.m17.1.1.1.cmml">∈</mo><msup id="S2.SS1.p2.17.m17.1.1.3" xref="S2.SS1.p2.17.m17.1.1.3.cmml"><mi id="S2.SS1.p2.17.m17.1.1.3.2" xref="S2.SS1.p2.17.m17.1.1.3.2.cmml">ℝ</mi><mi id="S2.SS1.p2.17.m17.1.1.3.3" xref="S2.SS1.p2.17.m17.1.1.3.3.cmml">C</mi></msup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.17.m17.1b"><apply id="S2.SS1.p2.17.m17.1.1.cmml" xref="S2.SS1.p2.17.m17.1.1"><in id="S2.SS1.p2.17.m17.1.1.1.cmml" xref="S2.SS1.p2.17.m17.1.1.1"></in><apply id="S2.SS1.p2.17.m17.1.1.2.cmml" xref="S2.SS1.p2.17.m17.1.1.2"><csymbol cd="ambiguous" id="S2.SS1.p2.17.m17.1.1.2.1.cmml" xref="S2.SS1.p2.17.m17.1.1.2">subscript</csymbol><ci id="S2.SS1.p2.17.m17.1.1.2.2.cmml" xref="S2.SS1.p2.17.m17.1.1.2.2">𝐸</ci><ci id="S2.SS1.p2.17.m17.1.1.2.3.cmml" xref="S2.SS1.p2.17.m17.1.1.2.3">𝑎</ci></apply><apply id="S2.SS1.p2.17.m17.1.1.3.cmml" xref="S2.SS1.p2.17.m17.1.1.3"><csymbol cd="ambiguous" id="S2.SS1.p2.17.m17.1.1.3.1.cmml" xref="S2.SS1.p2.17.m17.1.1.3">superscript</csymbol><ci id="S2.SS1.p2.17.m17.1.1.3.2.cmml" xref="S2.SS1.p2.17.m17.1.1.3.2">ℝ</ci><ci id="S2.SS1.p2.17.m17.1.1.3.3.cmml" xref="S2.SS1.p2.17.m17.1.1.3.3">𝐶</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.17.m17.1c">E_{a}\in\mathbb{R}^{C}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.17.m17.1d">italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT</annotation></semantics></math>. This ensures all features are in the same space <math alttext="\mathbb{R}^{C}" class="ltx_Math" display="inline" id="S2.SS1.p2.18.m18.1"><semantics id="S2.SS1.p2.18.m18.1a"><msup id="S2.SS1.p2.18.m18.1.1" xref="S2.SS1.p2.18.m18.1.1.cmml"><mi id="S2.SS1.p2.18.m18.1.1.2" xref="S2.SS1.p2.18.m18.1.1.2.cmml">ℝ</mi><mi id="S2.SS1.p2.18.m18.1.1.3" xref="S2.SS1.p2.18.m18.1.1.3.cmml">C</mi></msup><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.18.m18.1b"><apply id="S2.SS1.p2.18.m18.1.1.cmml" xref="S2.SS1.p2.18.m18.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.18.m18.1.1.1.cmml" xref="S2.SS1.p2.18.m18.1.1">superscript</csymbol><ci id="S2.SS1.p2.18.m18.1.1.2.cmml" xref="S2.SS1.p2.18.m18.1.1.2">ℝ</ci><ci id="S2.SS1.p2.18.m18.1.1.3.cmml" xref="S2.SS1.p2.18.m18.1.1.3">𝐶</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.18.m18.1c">\mathbb{R}^{C}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.18.m18.1d">blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT</annotation></semantics></math>, facilitating fusion and comparison.</p> </div> <div class="ltx_para" id="S2.SS1.p3"> <p class="ltx_p" id="S2.SS1.p3.1">We define the contrastive loss function for the <math alttext="i" class="ltx_Math" display="inline" id="S2.SS1.p3.1.m1.1"><semantics id="S2.SS1.p3.1.m1.1a"><mi id="S2.SS1.p3.1.m1.1.1" xref="S2.SS1.p3.1.m1.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p3.1.m1.1b"><ci id="S2.SS1.p3.1.m1.1.1.cmml" xref="S2.SS1.p3.1.m1.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p3.1.m1.1c">i</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p3.1.m1.1d">italic_i</annotation></semantics></math>-th cross-modal pair as:</p> </div> <div class="ltx_para" id="S2.SS1.p4"> <table class="ltx_equation ltx_eqn_table" id="S2.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="f(E_{x}^{i},E_{y}^{i},M)=\log\frac{\exp(E_{x}^{i}\cdot E_{y}^{i}/\tau)}{\sum_{% j=1}^{M}\exp(E_{x}^{i}\cdot E_{y}^{j}/\tau)}" class="ltx_Math" display="block" id="S2.E1.m1.7"><semantics id="S2.E1.m1.7a"><mrow id="S2.E1.m1.7.7" xref="S2.E1.m1.7.7.cmml"><mrow id="S2.E1.m1.7.7.2" xref="S2.E1.m1.7.7.2.cmml"><mi id="S2.E1.m1.7.7.2.4" xref="S2.E1.m1.7.7.2.4.cmml">f</mi><mo id="S2.E1.m1.7.7.2.3" xref="S2.E1.m1.7.7.2.3.cmml">⁢</mo><mrow id="S2.E1.m1.7.7.2.2.2" xref="S2.E1.m1.7.7.2.2.3.cmml"><mo id="S2.E1.m1.7.7.2.2.2.3" stretchy="false" xref="S2.E1.m1.7.7.2.2.3.cmml">(</mo><msubsup id="S2.E1.m1.6.6.1.1.1.1" xref="S2.E1.m1.6.6.1.1.1.1.cmml"><mi id="S2.E1.m1.6.6.1.1.1.1.2.2" xref="S2.E1.m1.6.6.1.1.1.1.2.2.cmml">E</mi><mi id="S2.E1.m1.6.6.1.1.1.1.2.3" xref="S2.E1.m1.6.6.1.1.1.1.2.3.cmml">x</mi><mi id="S2.E1.m1.6.6.1.1.1.1.3" xref="S2.E1.m1.6.6.1.1.1.1.3.cmml">i</mi></msubsup><mo id="S2.E1.m1.7.7.2.2.2.4" xref="S2.E1.m1.7.7.2.2.3.cmml">,</mo><msubsup id="S2.E1.m1.7.7.2.2.2.2" xref="S2.E1.m1.7.7.2.2.2.2.cmml"><mi id="S2.E1.m1.7.7.2.2.2.2.2.2" xref="S2.E1.m1.7.7.2.2.2.2.2.2.cmml">E</mi><mi id="S2.E1.m1.7.7.2.2.2.2.2.3" xref="S2.E1.m1.7.7.2.2.2.2.2.3.cmml">y</mi><mi id="S2.E1.m1.7.7.2.2.2.2.3" xref="S2.E1.m1.7.7.2.2.2.2.3.cmml">i</mi></msubsup><mo id="S2.E1.m1.7.7.2.2.2.5" xref="S2.E1.m1.7.7.2.2.3.cmml">,</mo><mi id="S2.E1.m1.5.5" xref="S2.E1.m1.5.5.cmml">M</mi><mo id="S2.E1.m1.7.7.2.2.2.6" stretchy="false" xref="S2.E1.m1.7.7.2.2.3.cmml">)</mo></mrow></mrow><mo id="S2.E1.m1.7.7.3" xref="S2.E1.m1.7.7.3.cmml">=</mo><mrow id="S2.E1.m1.7.7.4" xref="S2.E1.m1.7.7.4.cmml"><mi id="S2.E1.m1.7.7.4.1" xref="S2.E1.m1.7.7.4.1.cmml">log</mi><mo id="S2.E1.m1.7.7.4a" lspace="0.167em" xref="S2.E1.m1.7.7.4.cmml">⁡</mo><mfrac id="S2.E1.m1.4.4" xref="S2.E1.m1.4.4.cmml"><mrow id="S2.E1.m1.2.2.2.2" xref="S2.E1.m1.2.2.2.3.cmml"><mi id="S2.E1.m1.1.1.1.1" xref="S2.E1.m1.1.1.1.1.cmml">exp</mi><mo id="S2.E1.m1.2.2.2.2a" xref="S2.E1.m1.2.2.2.3.cmml">⁡</mo><mrow id="S2.E1.m1.2.2.2.2.1" xref="S2.E1.m1.2.2.2.3.cmml"><mo id="S2.E1.m1.2.2.2.2.1.2" stretchy="false" xref="S2.E1.m1.2.2.2.3.cmml">(</mo><mrow id="S2.E1.m1.2.2.2.2.1.1" xref="S2.E1.m1.2.2.2.2.1.1.cmml"><mrow id="S2.E1.m1.2.2.2.2.1.1.2" xref="S2.E1.m1.2.2.2.2.1.1.2.cmml"><msubsup id="S2.E1.m1.2.2.2.2.1.1.2.2" xref="S2.E1.m1.2.2.2.2.1.1.2.2.cmml"><mi id="S2.E1.m1.2.2.2.2.1.1.2.2.2.2" xref="S2.E1.m1.2.2.2.2.1.1.2.2.2.2.cmml">E</mi><mi id="S2.E1.m1.2.2.2.2.1.1.2.2.2.3" xref="S2.E1.m1.2.2.2.2.1.1.2.2.2.3.cmml">x</mi><mi id="S2.E1.m1.2.2.2.2.1.1.2.2.3" xref="S2.E1.m1.2.2.2.2.1.1.2.2.3.cmml">i</mi></msubsup><mo id="S2.E1.m1.2.2.2.2.1.1.2.1" lspace="0.222em" rspace="0.222em" xref="S2.E1.m1.2.2.2.2.1.1.2.1.cmml">⋅</mo><msubsup id="S2.E1.m1.2.2.2.2.1.1.2.3" xref="S2.E1.m1.2.2.2.2.1.1.2.3.cmml"><mi id="S2.E1.m1.2.2.2.2.1.1.2.3.2.2" xref="S2.E1.m1.2.2.2.2.1.1.2.3.2.2.cmml">E</mi><mi id="S2.E1.m1.2.2.2.2.1.1.2.3.2.3" xref="S2.E1.m1.2.2.2.2.1.1.2.3.2.3.cmml">y</mi><mi id="S2.E1.m1.2.2.2.2.1.1.2.3.3" xref="S2.E1.m1.2.2.2.2.1.1.2.3.3.cmml">i</mi></msubsup></mrow><mo id="S2.E1.m1.2.2.2.2.1.1.1" xref="S2.E1.m1.2.2.2.2.1.1.1.cmml">/</mo><mi id="S2.E1.m1.2.2.2.2.1.1.3" xref="S2.E1.m1.2.2.2.2.1.1.3.cmml">τ</mi></mrow><mo id="S2.E1.m1.2.2.2.2.1.3" stretchy="false" xref="S2.E1.m1.2.2.2.3.cmml">)</mo></mrow></mrow><mrow id="S2.E1.m1.4.4.4" xref="S2.E1.m1.4.4.4.cmml"><msubsup id="S2.E1.m1.4.4.4.3" xref="S2.E1.m1.4.4.4.3.cmml"><mo id="S2.E1.m1.4.4.4.3.2.2" xref="S2.E1.m1.4.4.4.3.2.2.cmml">∑</mo><mrow id="S2.E1.m1.4.4.4.3.2.3" xref="S2.E1.m1.4.4.4.3.2.3.cmml"><mi id="S2.E1.m1.4.4.4.3.2.3.2" xref="S2.E1.m1.4.4.4.3.2.3.2.cmml">j</mi><mo id="S2.E1.m1.4.4.4.3.2.3.1" xref="S2.E1.m1.4.4.4.3.2.3.1.cmml">=</mo><mn id="S2.E1.m1.4.4.4.3.2.3.3" xref="S2.E1.m1.4.4.4.3.2.3.3.cmml">1</mn></mrow><mi id="S2.E1.m1.4.4.4.3.3" xref="S2.E1.m1.4.4.4.3.3.cmml">M</mi></msubsup><mrow id="S2.E1.m1.4.4.4.2.1" xref="S2.E1.m1.4.4.4.2.2.cmml"><mi id="S2.E1.m1.3.3.3.1" xref="S2.E1.m1.3.3.3.1.cmml">exp</mi><mo id="S2.E1.m1.4.4.4.2.1a" xref="S2.E1.m1.4.4.4.2.2.cmml">⁡</mo><mrow id="S2.E1.m1.4.4.4.2.1.1" xref="S2.E1.m1.4.4.4.2.2.cmml"><mo id="S2.E1.m1.4.4.4.2.1.1.2" stretchy="false" xref="S2.E1.m1.4.4.4.2.2.cmml">(</mo><mrow id="S2.E1.m1.4.4.4.2.1.1.1" xref="S2.E1.m1.4.4.4.2.1.1.1.cmml"><mrow id="S2.E1.m1.4.4.4.2.1.1.1.2" xref="S2.E1.m1.4.4.4.2.1.1.1.2.cmml"><msubsup id="S2.E1.m1.4.4.4.2.1.1.1.2.2" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2.cmml"><mi id="S2.E1.m1.4.4.4.2.1.1.1.2.2.2.2" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2.2.2.cmml">E</mi><mi id="S2.E1.m1.4.4.4.2.1.1.1.2.2.2.3" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2.2.3.cmml">x</mi><mi id="S2.E1.m1.4.4.4.2.1.1.1.2.2.3" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2.3.cmml">i</mi></msubsup><mo id="S2.E1.m1.4.4.4.2.1.1.1.2.1" lspace="0.222em" rspace="0.222em" xref="S2.E1.m1.4.4.4.2.1.1.1.2.1.cmml">⋅</mo><msubsup id="S2.E1.m1.4.4.4.2.1.1.1.2.3" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3.cmml"><mi id="S2.E1.m1.4.4.4.2.1.1.1.2.3.2.2" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3.2.2.cmml">E</mi><mi id="S2.E1.m1.4.4.4.2.1.1.1.2.3.2.3" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3.2.3.cmml">y</mi><mi id="S2.E1.m1.4.4.4.2.1.1.1.2.3.3" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3.3.cmml">j</mi></msubsup></mrow><mo id="S2.E1.m1.4.4.4.2.1.1.1.1" xref="S2.E1.m1.4.4.4.2.1.1.1.1.cmml">/</mo><mi id="S2.E1.m1.4.4.4.2.1.1.1.3" xref="S2.E1.m1.4.4.4.2.1.1.1.3.cmml">τ</mi></mrow><mo id="S2.E1.m1.4.4.4.2.1.1.3" stretchy="false" xref="S2.E1.m1.4.4.4.2.2.cmml">)</mo></mrow></mrow></mrow></mfrac></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E1.m1.7b"><apply id="S2.E1.m1.7.7.cmml" xref="S2.E1.m1.7.7"><eq id="S2.E1.m1.7.7.3.cmml" xref="S2.E1.m1.7.7.3"></eq><apply id="S2.E1.m1.7.7.2.cmml" xref="S2.E1.m1.7.7.2"><times id="S2.E1.m1.7.7.2.3.cmml" xref="S2.E1.m1.7.7.2.3"></times><ci id="S2.E1.m1.7.7.2.4.cmml" xref="S2.E1.m1.7.7.2.4">𝑓</ci><vector id="S2.E1.m1.7.7.2.2.3.cmml" xref="S2.E1.m1.7.7.2.2.2"><apply id="S2.E1.m1.6.6.1.1.1.1.cmml" xref="S2.E1.m1.6.6.1.1.1.1"><csymbol cd="ambiguous" id="S2.E1.m1.6.6.1.1.1.1.1.cmml" xref="S2.E1.m1.6.6.1.1.1.1">superscript</csymbol><apply id="S2.E1.m1.6.6.1.1.1.1.2.cmml" xref="S2.E1.m1.6.6.1.1.1.1"><csymbol cd="ambiguous" id="S2.E1.m1.6.6.1.1.1.1.2.1.cmml" xref="S2.E1.m1.6.6.1.1.1.1">subscript</csymbol><ci id="S2.E1.m1.6.6.1.1.1.1.2.2.cmml" xref="S2.E1.m1.6.6.1.1.1.1.2.2">𝐸</ci><ci id="S2.E1.m1.6.6.1.1.1.1.2.3.cmml" xref="S2.E1.m1.6.6.1.1.1.1.2.3">𝑥</ci></apply><ci id="S2.E1.m1.6.6.1.1.1.1.3.cmml" xref="S2.E1.m1.6.6.1.1.1.1.3">𝑖</ci></apply><apply id="S2.E1.m1.7.7.2.2.2.2.cmml" xref="S2.E1.m1.7.7.2.2.2.2"><csymbol cd="ambiguous" id="S2.E1.m1.7.7.2.2.2.2.1.cmml" xref="S2.E1.m1.7.7.2.2.2.2">superscript</csymbol><apply id="S2.E1.m1.7.7.2.2.2.2.2.cmml" xref="S2.E1.m1.7.7.2.2.2.2"><csymbol cd="ambiguous" id="S2.E1.m1.7.7.2.2.2.2.2.1.cmml" xref="S2.E1.m1.7.7.2.2.2.2">subscript</csymbol><ci id="S2.E1.m1.7.7.2.2.2.2.2.2.cmml" xref="S2.E1.m1.7.7.2.2.2.2.2.2">𝐸</ci><ci id="S2.E1.m1.7.7.2.2.2.2.2.3.cmml" xref="S2.E1.m1.7.7.2.2.2.2.2.3">𝑦</ci></apply><ci id="S2.E1.m1.7.7.2.2.2.2.3.cmml" xref="S2.E1.m1.7.7.2.2.2.2.3">𝑖</ci></apply><ci id="S2.E1.m1.5.5.cmml" xref="S2.E1.m1.5.5">𝑀</ci></vector></apply><apply id="S2.E1.m1.7.7.4.cmml" xref="S2.E1.m1.7.7.4"><log id="S2.E1.m1.7.7.4.1.cmml" xref="S2.E1.m1.7.7.4.1"></log><apply id="S2.E1.m1.4.4.cmml" xref="S2.E1.m1.4.4"><divide id="S2.E1.m1.4.4.5.cmml" xref="S2.E1.m1.4.4"></divide><apply id="S2.E1.m1.2.2.2.3.cmml" xref="S2.E1.m1.2.2.2.2"><exp id="S2.E1.m1.1.1.1.1.cmml" xref="S2.E1.m1.1.1.1.1"></exp><apply id="S2.E1.m1.2.2.2.2.1.1.cmml" xref="S2.E1.m1.2.2.2.2.1.1"><divide id="S2.E1.m1.2.2.2.2.1.1.1.cmml" xref="S2.E1.m1.2.2.2.2.1.1.1"></divide><apply id="S2.E1.m1.2.2.2.2.1.1.2.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2"><ci id="S2.E1.m1.2.2.2.2.1.1.2.1.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.1">⋅</ci><apply id="S2.E1.m1.2.2.2.2.1.1.2.2.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.2"><csymbol cd="ambiguous" id="S2.E1.m1.2.2.2.2.1.1.2.2.1.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.2">superscript</csymbol><apply id="S2.E1.m1.2.2.2.2.1.1.2.2.2.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.2"><csymbol cd="ambiguous" id="S2.E1.m1.2.2.2.2.1.1.2.2.2.1.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.2">subscript</csymbol><ci id="S2.E1.m1.2.2.2.2.1.1.2.2.2.2.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.2.2.2">𝐸</ci><ci id="S2.E1.m1.2.2.2.2.1.1.2.2.2.3.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.2.2.3">𝑥</ci></apply><ci id="S2.E1.m1.2.2.2.2.1.1.2.2.3.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.2.3">𝑖</ci></apply><apply id="S2.E1.m1.2.2.2.2.1.1.2.3.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.3"><csymbol cd="ambiguous" id="S2.E1.m1.2.2.2.2.1.1.2.3.1.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.3">superscript</csymbol><apply id="S2.E1.m1.2.2.2.2.1.1.2.3.2.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.3"><csymbol cd="ambiguous" id="S2.E1.m1.2.2.2.2.1.1.2.3.2.1.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.3">subscript</csymbol><ci id="S2.E1.m1.2.2.2.2.1.1.2.3.2.2.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.3.2.2">𝐸</ci><ci id="S2.E1.m1.2.2.2.2.1.1.2.3.2.3.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.3.2.3">𝑦</ci></apply><ci id="S2.E1.m1.2.2.2.2.1.1.2.3.3.cmml" xref="S2.E1.m1.2.2.2.2.1.1.2.3.3">𝑖</ci></apply></apply><ci id="S2.E1.m1.2.2.2.2.1.1.3.cmml" xref="S2.E1.m1.2.2.2.2.1.1.3">𝜏</ci></apply></apply><apply id="S2.E1.m1.4.4.4.cmml" xref="S2.E1.m1.4.4.4"><apply id="S2.E1.m1.4.4.4.3.cmml" xref="S2.E1.m1.4.4.4.3"><csymbol cd="ambiguous" id="S2.E1.m1.4.4.4.3.1.cmml" xref="S2.E1.m1.4.4.4.3">superscript</csymbol><apply id="S2.E1.m1.4.4.4.3.2.cmml" xref="S2.E1.m1.4.4.4.3"><csymbol cd="ambiguous" id="S2.E1.m1.4.4.4.3.2.1.cmml" xref="S2.E1.m1.4.4.4.3">subscript</csymbol><sum id="S2.E1.m1.4.4.4.3.2.2.cmml" xref="S2.E1.m1.4.4.4.3.2.2"></sum><apply id="S2.E1.m1.4.4.4.3.2.3.cmml" xref="S2.E1.m1.4.4.4.3.2.3"><eq id="S2.E1.m1.4.4.4.3.2.3.1.cmml" xref="S2.E1.m1.4.4.4.3.2.3.1"></eq><ci id="S2.E1.m1.4.4.4.3.2.3.2.cmml" xref="S2.E1.m1.4.4.4.3.2.3.2">𝑗</ci><cn id="S2.E1.m1.4.4.4.3.2.3.3.cmml" type="integer" xref="S2.E1.m1.4.4.4.3.2.3.3">1</cn></apply></apply><ci id="S2.E1.m1.4.4.4.3.3.cmml" xref="S2.E1.m1.4.4.4.3.3">𝑀</ci></apply><apply id="S2.E1.m1.4.4.4.2.2.cmml" xref="S2.E1.m1.4.4.4.2.1"><exp id="S2.E1.m1.3.3.3.1.cmml" xref="S2.E1.m1.3.3.3.1"></exp><apply id="S2.E1.m1.4.4.4.2.1.1.1.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1"><divide id="S2.E1.m1.4.4.4.2.1.1.1.1.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.1"></divide><apply id="S2.E1.m1.4.4.4.2.1.1.1.2.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2"><ci id="S2.E1.m1.4.4.4.2.1.1.1.2.1.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.1">⋅</ci><apply id="S2.E1.m1.4.4.4.2.1.1.1.2.2.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2"><csymbol cd="ambiguous" id="S2.E1.m1.4.4.4.2.1.1.1.2.2.1.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2">superscript</csymbol><apply id="S2.E1.m1.4.4.4.2.1.1.1.2.2.2.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2"><csymbol cd="ambiguous" id="S2.E1.m1.4.4.4.2.1.1.1.2.2.2.1.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2">subscript</csymbol><ci id="S2.E1.m1.4.4.4.2.1.1.1.2.2.2.2.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2.2.2">𝐸</ci><ci id="S2.E1.m1.4.4.4.2.1.1.1.2.2.2.3.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2.2.3">𝑥</ci></apply><ci id="S2.E1.m1.4.4.4.2.1.1.1.2.2.3.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.2.3">𝑖</ci></apply><apply id="S2.E1.m1.4.4.4.2.1.1.1.2.3.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3"><csymbol cd="ambiguous" id="S2.E1.m1.4.4.4.2.1.1.1.2.3.1.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3">superscript</csymbol><apply id="S2.E1.m1.4.4.4.2.1.1.1.2.3.2.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3"><csymbol cd="ambiguous" id="S2.E1.m1.4.4.4.2.1.1.1.2.3.2.1.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3">subscript</csymbol><ci id="S2.E1.m1.4.4.4.2.1.1.1.2.3.2.2.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3.2.2">𝐸</ci><ci id="S2.E1.m1.4.4.4.2.1.1.1.2.3.2.3.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3.2.3">𝑦</ci></apply><ci id="S2.E1.m1.4.4.4.2.1.1.1.2.3.3.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.2.3.3">𝑗</ci></apply></apply><ci id="S2.E1.m1.4.4.4.2.1.1.1.3.cmml" xref="S2.E1.m1.4.4.4.2.1.1.1.3">𝜏</ci></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E1.m1.7c">f(E_{x}^{i},E_{y}^{i},M)=\log\frac{\exp(E_{x}^{i}\cdot E_{y}^{i}/\tau)}{\sum_{% j=1}^{M}\exp(E_{x}^{i}\cdot E_{y}^{j}/\tau)}</annotation><annotation encoding="application/x-llamapun" id="S2.E1.m1.7d">italic_f ( italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_M ) = roman_log divide start_ARG roman_exp ( italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_E start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT roman_exp ( italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_E start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT / italic_τ ) end_ARG</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS1.p4.12">where <math alttext="E_{x}^{i}" class="ltx_Math" display="inline" id="S2.SS1.p4.1.m1.1"><semantics id="S2.SS1.p4.1.m1.1a"><msubsup id="S2.SS1.p4.1.m1.1.1" xref="S2.SS1.p4.1.m1.1.1.cmml"><mi id="S2.SS1.p4.1.m1.1.1.2.2" xref="S2.SS1.p4.1.m1.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p4.1.m1.1.1.2.3" xref="S2.SS1.p4.1.m1.1.1.2.3.cmml">x</mi><mi id="S2.SS1.p4.1.m1.1.1.3" xref="S2.SS1.p4.1.m1.1.1.3.cmml">i</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.1.m1.1b"><apply id="S2.SS1.p4.1.m1.1.1.cmml" xref="S2.SS1.p4.1.m1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.1.m1.1.1.1.cmml" xref="S2.SS1.p4.1.m1.1.1">superscript</csymbol><apply id="S2.SS1.p4.1.m1.1.1.2.cmml" xref="S2.SS1.p4.1.m1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.1.m1.1.1.2.1.cmml" xref="S2.SS1.p4.1.m1.1.1">subscript</csymbol><ci id="S2.SS1.p4.1.m1.1.1.2.2.cmml" xref="S2.SS1.p4.1.m1.1.1.2.2">𝐸</ci><ci id="S2.SS1.p4.1.m1.1.1.2.3.cmml" xref="S2.SS1.p4.1.m1.1.1.2.3">𝑥</ci></apply><ci id="S2.SS1.p4.1.m1.1.1.3.cmml" xref="S2.SS1.p4.1.m1.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.1.m1.1c">E_{x}^{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.1.m1.1d">italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT</annotation></semantics></math> and <math alttext="E_{y}^{i}" class="ltx_Math" display="inline" id="S2.SS1.p4.2.m2.1"><semantics id="S2.SS1.p4.2.m2.1a"><msubsup id="S2.SS1.p4.2.m2.1.1" xref="S2.SS1.p4.2.m2.1.1.cmml"><mi id="S2.SS1.p4.2.m2.1.1.2.2" xref="S2.SS1.p4.2.m2.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p4.2.m2.1.1.2.3" xref="S2.SS1.p4.2.m2.1.1.2.3.cmml">y</mi><mi id="S2.SS1.p4.2.m2.1.1.3" xref="S2.SS1.p4.2.m2.1.1.3.cmml">i</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.2.m2.1b"><apply id="S2.SS1.p4.2.m2.1.1.cmml" xref="S2.SS1.p4.2.m2.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.2.m2.1.1.1.cmml" xref="S2.SS1.p4.2.m2.1.1">superscript</csymbol><apply id="S2.SS1.p4.2.m2.1.1.2.cmml" xref="S2.SS1.p4.2.m2.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.2.m2.1.1.2.1.cmml" xref="S2.SS1.p4.2.m2.1.1">subscript</csymbol><ci id="S2.SS1.p4.2.m2.1.1.2.2.cmml" xref="S2.SS1.p4.2.m2.1.1.2.2">𝐸</ci><ci id="S2.SS1.p4.2.m2.1.1.2.3.cmml" xref="S2.SS1.p4.2.m2.1.1.2.3">𝑦</ci></apply><ci id="S2.SS1.p4.2.m2.1.1.3.cmml" xref="S2.SS1.p4.2.m2.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.2.m2.1c">E_{y}^{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.2.m2.1d">italic_E start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT</annotation></semantics></math> are feature vectors from different modalities <math alttext="x" class="ltx_Math" display="inline" id="S2.SS1.p4.3.m3.1"><semantics id="S2.SS1.p4.3.m3.1a"><mi id="S2.SS1.p4.3.m3.1.1" xref="S2.SS1.p4.3.m3.1.1.cmml">x</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.3.m3.1b"><ci id="S2.SS1.p4.3.m3.1.1.cmml" xref="S2.SS1.p4.3.m3.1.1">𝑥</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.3.m3.1c">x</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.3.m3.1d">italic_x</annotation></semantics></math> and <math alttext="y" class="ltx_Math" display="inline" id="S2.SS1.p4.4.m4.1"><semantics id="S2.SS1.p4.4.m4.1a"><mi id="S2.SS1.p4.4.m4.1.1" xref="S2.SS1.p4.4.m4.1.1.cmml">y</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.4.m4.1b"><ci id="S2.SS1.p4.4.m4.1.1.cmml" xref="S2.SS1.p4.4.m4.1.1">𝑦</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.4.m4.1c">y</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.4.m4.1d">italic_y</annotation></semantics></math>, <math alttext="M" class="ltx_Math" display="inline" id="S2.SS1.p4.5.m5.1"><semantics id="S2.SS1.p4.5.m5.1a"><mi id="S2.SS1.p4.5.m5.1.1" xref="S2.SS1.p4.5.m5.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.5.m5.1b"><ci id="S2.SS1.p4.5.m5.1.1.cmml" xref="S2.SS1.p4.5.m5.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.5.m5.1c">M</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.5.m5.1d">italic_M</annotation></semantics></math> is the number of cross-modal pairs, and <math alttext="\tau" class="ltx_Math" display="inline" id="S2.SS1.p4.6.m6.1"><semantics id="S2.SS1.p4.6.m6.1a"><mi id="S2.SS1.p4.6.m6.1.1" xref="S2.SS1.p4.6.m6.1.1.cmml">τ</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.6.m6.1b"><ci id="S2.SS1.p4.6.m6.1.1.cmml" xref="S2.SS1.p4.6.m6.1.1">𝜏</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.6.m6.1c">\tau</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.6.m6.1d">italic_τ</annotation></semantics></math> controls softmax smoothness. The numerator represents the similarity between the correct pair, while the denominator sums the similarities between <math alttext="E_{x}^{i}" class="ltx_Math" display="inline" id="S2.SS1.p4.7.m7.1"><semantics id="S2.SS1.p4.7.m7.1a"><msubsup id="S2.SS1.p4.7.m7.1.1" xref="S2.SS1.p4.7.m7.1.1.cmml"><mi id="S2.SS1.p4.7.m7.1.1.2.2" xref="S2.SS1.p4.7.m7.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p4.7.m7.1.1.2.3" xref="S2.SS1.p4.7.m7.1.1.2.3.cmml">x</mi><mi id="S2.SS1.p4.7.m7.1.1.3" xref="S2.SS1.p4.7.m7.1.1.3.cmml">i</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.7.m7.1b"><apply id="S2.SS1.p4.7.m7.1.1.cmml" xref="S2.SS1.p4.7.m7.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.7.m7.1.1.1.cmml" xref="S2.SS1.p4.7.m7.1.1">superscript</csymbol><apply id="S2.SS1.p4.7.m7.1.1.2.cmml" xref="S2.SS1.p4.7.m7.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.7.m7.1.1.2.1.cmml" xref="S2.SS1.p4.7.m7.1.1">subscript</csymbol><ci id="S2.SS1.p4.7.m7.1.1.2.2.cmml" xref="S2.SS1.p4.7.m7.1.1.2.2">𝐸</ci><ci id="S2.SS1.p4.7.m7.1.1.2.3.cmml" xref="S2.SS1.p4.7.m7.1.1.2.3">𝑥</ci></apply><ci id="S2.SS1.p4.7.m7.1.1.3.cmml" xref="S2.SS1.p4.7.m7.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.7.m7.1c">E_{x}^{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.7.m7.1d">italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT</annotation></semantics></math> and all <math alttext="E_{y}^{j}" class="ltx_Math" display="inline" id="S2.SS1.p4.8.m8.1"><semantics id="S2.SS1.p4.8.m8.1a"><msubsup id="S2.SS1.p4.8.m8.1.1" xref="S2.SS1.p4.8.m8.1.1.cmml"><mi id="S2.SS1.p4.8.m8.1.1.2.2" xref="S2.SS1.p4.8.m8.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p4.8.m8.1.1.2.3" xref="S2.SS1.p4.8.m8.1.1.2.3.cmml">y</mi><mi id="S2.SS1.p4.8.m8.1.1.3" xref="S2.SS1.p4.8.m8.1.1.3.cmml">j</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.8.m8.1b"><apply id="S2.SS1.p4.8.m8.1.1.cmml" xref="S2.SS1.p4.8.m8.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.8.m8.1.1.1.cmml" xref="S2.SS1.p4.8.m8.1.1">superscript</csymbol><apply id="S2.SS1.p4.8.m8.1.1.2.cmml" xref="S2.SS1.p4.8.m8.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.8.m8.1.1.2.1.cmml" xref="S2.SS1.p4.8.m8.1.1">subscript</csymbol><ci id="S2.SS1.p4.8.m8.1.1.2.2.cmml" xref="S2.SS1.p4.8.m8.1.1.2.2">𝐸</ci><ci id="S2.SS1.p4.8.m8.1.1.2.3.cmml" xref="S2.SS1.p4.8.m8.1.1.2.3">𝑦</ci></apply><ci id="S2.SS1.p4.8.m8.1.1.3.cmml" xref="S2.SS1.p4.8.m8.1.1.3">𝑗</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.8.m8.1c">E_{y}^{j}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.8.m8.1d">italic_E start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT</annotation></semantics></math> pairs. This loss is not permutation symmetric because swapping <math alttext="E_{x}^{i}" class="ltx_Math" display="inline" id="S2.SS1.p4.9.m9.1"><semantics id="S2.SS1.p4.9.m9.1a"><msubsup id="S2.SS1.p4.9.m9.1.1" xref="S2.SS1.p4.9.m9.1.1.cmml"><mi id="S2.SS1.p4.9.m9.1.1.2.2" xref="S2.SS1.p4.9.m9.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p4.9.m9.1.1.2.3" xref="S2.SS1.p4.9.m9.1.1.2.3.cmml">x</mi><mi id="S2.SS1.p4.9.m9.1.1.3" xref="S2.SS1.p4.9.m9.1.1.3.cmml">i</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.9.m9.1b"><apply id="S2.SS1.p4.9.m9.1.1.cmml" xref="S2.SS1.p4.9.m9.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.9.m9.1.1.1.cmml" xref="S2.SS1.p4.9.m9.1.1">superscript</csymbol><apply id="S2.SS1.p4.9.m9.1.1.2.cmml" xref="S2.SS1.p4.9.m9.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.9.m9.1.1.2.1.cmml" xref="S2.SS1.p4.9.m9.1.1">subscript</csymbol><ci id="S2.SS1.p4.9.m9.1.1.2.2.cmml" xref="S2.SS1.p4.9.m9.1.1.2.2">𝐸</ci><ci id="S2.SS1.p4.9.m9.1.1.2.3.cmml" xref="S2.SS1.p4.9.m9.1.1.2.3">𝑥</ci></apply><ci id="S2.SS1.p4.9.m9.1.1.3.cmml" xref="S2.SS1.p4.9.m9.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.9.m9.1c">E_{x}^{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.9.m9.1d">italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT</annotation></semantics></math> and <math alttext="E_{y}^{i}" class="ltx_Math" display="inline" id="S2.SS1.p4.10.m10.1"><semantics id="S2.SS1.p4.10.m10.1a"><msubsup id="S2.SS1.p4.10.m10.1.1" xref="S2.SS1.p4.10.m10.1.1.cmml"><mi id="S2.SS1.p4.10.m10.1.1.2.2" xref="S2.SS1.p4.10.m10.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p4.10.m10.1.1.2.3" xref="S2.SS1.p4.10.m10.1.1.2.3.cmml">y</mi><mi id="S2.SS1.p4.10.m10.1.1.3" xref="S2.SS1.p4.10.m10.1.1.3.cmml">i</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.10.m10.1b"><apply id="S2.SS1.p4.10.m10.1.1.cmml" xref="S2.SS1.p4.10.m10.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.10.m10.1.1.1.cmml" xref="S2.SS1.p4.10.m10.1.1">superscript</csymbol><apply id="S2.SS1.p4.10.m10.1.1.2.cmml" xref="S2.SS1.p4.10.m10.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.10.m10.1.1.2.1.cmml" xref="S2.SS1.p4.10.m10.1.1">subscript</csymbol><ci id="S2.SS1.p4.10.m10.1.1.2.2.cmml" xref="S2.SS1.p4.10.m10.1.1.2.2">𝐸</ci><ci id="S2.SS1.p4.10.m10.1.1.2.3.cmml" xref="S2.SS1.p4.10.m10.1.1.2.3">𝑦</ci></apply><ci id="S2.SS1.p4.10.m10.1.1.3.cmml" xref="S2.SS1.p4.10.m10.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.10.m10.1c">E_{y}^{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.10.m10.1d">italic_E start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT</annotation></semantics></math> changes the calculation, as <math alttext="E_{x}^{i}" class="ltx_Math" display="inline" id="S2.SS1.p4.11.m11.1"><semantics id="S2.SS1.p4.11.m11.1a"><msubsup id="S2.SS1.p4.11.m11.1.1" xref="S2.SS1.p4.11.m11.1.1.cmml"><mi id="S2.SS1.p4.11.m11.1.1.2.2" xref="S2.SS1.p4.11.m11.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p4.11.m11.1.1.2.3" xref="S2.SS1.p4.11.m11.1.1.2.3.cmml">x</mi><mi id="S2.SS1.p4.11.m11.1.1.3" xref="S2.SS1.p4.11.m11.1.1.3.cmml">i</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.11.m11.1b"><apply id="S2.SS1.p4.11.m11.1.1.cmml" xref="S2.SS1.p4.11.m11.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.11.m11.1.1.1.cmml" xref="S2.SS1.p4.11.m11.1.1">superscript</csymbol><apply id="S2.SS1.p4.11.m11.1.1.2.cmml" xref="S2.SS1.p4.11.m11.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.11.m11.1.1.2.1.cmml" xref="S2.SS1.p4.11.m11.1.1">subscript</csymbol><ci id="S2.SS1.p4.11.m11.1.1.2.2.cmml" xref="S2.SS1.p4.11.m11.1.1.2.2">𝐸</ci><ci id="S2.SS1.p4.11.m11.1.1.2.3.cmml" xref="S2.SS1.p4.11.m11.1.1.2.3">𝑥</ci></apply><ci id="S2.SS1.p4.11.m11.1.1.3.cmml" xref="S2.SS1.p4.11.m11.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.11.m11.1c">E_{x}^{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.11.m11.1d">italic_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT</annotation></semantics></math> is always the anchor point compared to all <math alttext="E_{y}^{j}" class="ltx_Math" display="inline" id="S2.SS1.p4.12.m12.1"><semantics id="S2.SS1.p4.12.m12.1a"><msubsup id="S2.SS1.p4.12.m12.1.1" xref="S2.SS1.p4.12.m12.1.1.cmml"><mi id="S2.SS1.p4.12.m12.1.1.2.2" xref="S2.SS1.p4.12.m12.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p4.12.m12.1.1.2.3" xref="S2.SS1.p4.12.m12.1.1.2.3.cmml">y</mi><mi id="S2.SS1.p4.12.m12.1.1.3" xref="S2.SS1.p4.12.m12.1.1.3.cmml">j</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p4.12.m12.1b"><apply id="S2.SS1.p4.12.m12.1.1.cmml" xref="S2.SS1.p4.12.m12.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.12.m12.1.1.1.cmml" xref="S2.SS1.p4.12.m12.1.1">superscript</csymbol><apply id="S2.SS1.p4.12.m12.1.1.2.cmml" xref="S2.SS1.p4.12.m12.1.1"><csymbol cd="ambiguous" id="S2.SS1.p4.12.m12.1.1.2.1.cmml" xref="S2.SS1.p4.12.m12.1.1">subscript</csymbol><ci id="S2.SS1.p4.12.m12.1.1.2.2.cmml" xref="S2.SS1.p4.12.m12.1.1.2.2">𝐸</ci><ci id="S2.SS1.p4.12.m12.1.1.2.3.cmml" xref="S2.SS1.p4.12.m12.1.1.2.3">𝑦</ci></apply><ci id="S2.SS1.p4.12.m12.1.1.3.cmml" xref="S2.SS1.p4.12.m12.1.1.3">𝑗</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p4.12.m12.1c">E_{y}^{j}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p4.12.m12.1d">italic_E start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT</annotation></semantics></math>. This asymmetry is useful when one modality, such as text, provides stronger semantic guidance for aligning features from another modality, such as video or audio.</p> </div> <div class="ltx_para" id="S2.SS1.p5"> <p class="ltx_p" id="S2.SS1.p5.1">For the feature triplet <math alttext="(E_{v},E_{a},E_{l})" class="ltx_Math" display="inline" id="S2.SS1.p5.1.m1.3"><semantics id="S2.SS1.p5.1.m1.3a"><mrow id="S2.SS1.p5.1.m1.3.3.3" xref="S2.SS1.p5.1.m1.3.3.4.cmml"><mo id="S2.SS1.p5.1.m1.3.3.3.4" stretchy="false" xref="S2.SS1.p5.1.m1.3.3.4.cmml">(</mo><msub id="S2.SS1.p5.1.m1.1.1.1.1" xref="S2.SS1.p5.1.m1.1.1.1.1.cmml"><mi id="S2.SS1.p5.1.m1.1.1.1.1.2" xref="S2.SS1.p5.1.m1.1.1.1.1.2.cmml">E</mi><mi id="S2.SS1.p5.1.m1.1.1.1.1.3" xref="S2.SS1.p5.1.m1.1.1.1.1.3.cmml">v</mi></msub><mo id="S2.SS1.p5.1.m1.3.3.3.5" xref="S2.SS1.p5.1.m1.3.3.4.cmml">,</mo><msub id="S2.SS1.p5.1.m1.2.2.2.2" xref="S2.SS1.p5.1.m1.2.2.2.2.cmml"><mi id="S2.SS1.p5.1.m1.2.2.2.2.2" xref="S2.SS1.p5.1.m1.2.2.2.2.2.cmml">E</mi><mi id="S2.SS1.p5.1.m1.2.2.2.2.3" xref="S2.SS1.p5.1.m1.2.2.2.2.3.cmml">a</mi></msub><mo id="S2.SS1.p5.1.m1.3.3.3.6" xref="S2.SS1.p5.1.m1.3.3.4.cmml">,</mo><msub id="S2.SS1.p5.1.m1.3.3.3.3" xref="S2.SS1.p5.1.m1.3.3.3.3.cmml"><mi id="S2.SS1.p5.1.m1.3.3.3.3.2" xref="S2.SS1.p5.1.m1.3.3.3.3.2.cmml">E</mi><mi id="S2.SS1.p5.1.m1.3.3.3.3.3" xref="S2.SS1.p5.1.m1.3.3.3.3.3.cmml">l</mi></msub><mo id="S2.SS1.p5.1.m1.3.3.3.7" stretchy="false" xref="S2.SS1.p5.1.m1.3.3.4.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p5.1.m1.3b"><vector id="S2.SS1.p5.1.m1.3.3.4.cmml" xref="S2.SS1.p5.1.m1.3.3.3"><apply id="S2.SS1.p5.1.m1.1.1.1.1.cmml" xref="S2.SS1.p5.1.m1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p5.1.m1.1.1.1.1.1.cmml" xref="S2.SS1.p5.1.m1.1.1.1.1">subscript</csymbol><ci id="S2.SS1.p5.1.m1.1.1.1.1.2.cmml" xref="S2.SS1.p5.1.m1.1.1.1.1.2">𝐸</ci><ci id="S2.SS1.p5.1.m1.1.1.1.1.3.cmml" xref="S2.SS1.p5.1.m1.1.1.1.1.3">𝑣</ci></apply><apply id="S2.SS1.p5.1.m1.2.2.2.2.cmml" xref="S2.SS1.p5.1.m1.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.p5.1.m1.2.2.2.2.1.cmml" xref="S2.SS1.p5.1.m1.2.2.2.2">subscript</csymbol><ci id="S2.SS1.p5.1.m1.2.2.2.2.2.cmml" xref="S2.SS1.p5.1.m1.2.2.2.2.2">𝐸</ci><ci id="S2.SS1.p5.1.m1.2.2.2.2.3.cmml" xref="S2.SS1.p5.1.m1.2.2.2.2.3">𝑎</ci></apply><apply id="S2.SS1.p5.1.m1.3.3.3.3.cmml" xref="S2.SS1.p5.1.m1.3.3.3.3"><csymbol cd="ambiguous" id="S2.SS1.p5.1.m1.3.3.3.3.1.cmml" xref="S2.SS1.p5.1.m1.3.3.3.3">subscript</csymbol><ci id="S2.SS1.p5.1.m1.3.3.3.3.2.cmml" xref="S2.SS1.p5.1.m1.3.3.3.3.2">𝐸</ci><ci id="S2.SS1.p5.1.m1.3.3.3.3.3.cmml" xref="S2.SS1.p5.1.m1.3.3.3.3.3">𝑙</ci></apply></vector></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p5.1.m1.3c">(E_{v},E_{a},E_{l})</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p5.1.m1.3d">( italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )</annotation></semantics></math>, we define the loss function centered on audio as follows:</p> <table class="ltx_equation ltx_eqn_table" id="S2.E2"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{AL}^{i}=f(E_{a}^{i},E_{l}^{i},N)+f(E_{l}^{i},E_{a}^{i},N)" class="ltx_Math" display="block" id="S2.E2.m1.6"><semantics id="S2.E2.m1.6a"><mrow id="S2.E2.m1.6.6" xref="S2.E2.m1.6.6.cmml"><msubsup id="S2.E2.m1.6.6.6" xref="S2.E2.m1.6.6.6.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E2.m1.6.6.6.2.2" xref="S2.E2.m1.6.6.6.2.2.cmml">ℒ</mi><mrow id="S2.E2.m1.6.6.6.2.3" xref="S2.E2.m1.6.6.6.2.3.cmml"><mi id="S2.E2.m1.6.6.6.2.3.2" xref="S2.E2.m1.6.6.6.2.3.2.cmml">A</mi><mo id="S2.E2.m1.6.6.6.2.3.1" xref="S2.E2.m1.6.6.6.2.3.1.cmml">⁢</mo><mi id="S2.E2.m1.6.6.6.2.3.3" xref="S2.E2.m1.6.6.6.2.3.3.cmml">L</mi></mrow><mi id="S2.E2.m1.6.6.6.3" xref="S2.E2.m1.6.6.6.3.cmml">i</mi></msubsup><mo id="S2.E2.m1.6.6.5" xref="S2.E2.m1.6.6.5.cmml">=</mo><mrow id="S2.E2.m1.6.6.4" xref="S2.E2.m1.6.6.4.cmml"><mrow id="S2.E2.m1.4.4.2.2" xref="S2.E2.m1.4.4.2.2.cmml"><mi id="S2.E2.m1.4.4.2.2.4" xref="S2.E2.m1.4.4.2.2.4.cmml">f</mi><mo id="S2.E2.m1.4.4.2.2.3" xref="S2.E2.m1.4.4.2.2.3.cmml">⁢</mo><mrow id="S2.E2.m1.4.4.2.2.2.2" xref="S2.E2.m1.4.4.2.2.2.3.cmml"><mo id="S2.E2.m1.4.4.2.2.2.2.3" stretchy="false" xref="S2.E2.m1.4.4.2.2.2.3.cmml">(</mo><msubsup id="S2.E2.m1.3.3.1.1.1.1.1" xref="S2.E2.m1.3.3.1.1.1.1.1.cmml"><mi id="S2.E2.m1.3.3.1.1.1.1.1.2.2" xref="S2.E2.m1.3.3.1.1.1.1.1.2.2.cmml">E</mi><mi id="S2.E2.m1.3.3.1.1.1.1.1.2.3" xref="S2.E2.m1.3.3.1.1.1.1.1.2.3.cmml">a</mi><mi id="S2.E2.m1.3.3.1.1.1.1.1.3" xref="S2.E2.m1.3.3.1.1.1.1.1.3.cmml">i</mi></msubsup><mo id="S2.E2.m1.4.4.2.2.2.2.4" xref="S2.E2.m1.4.4.2.2.2.3.cmml">,</mo><msubsup id="S2.E2.m1.4.4.2.2.2.2.2" xref="S2.E2.m1.4.4.2.2.2.2.2.cmml"><mi id="S2.E2.m1.4.4.2.2.2.2.2.2.2" xref="S2.E2.m1.4.4.2.2.2.2.2.2.2.cmml">E</mi><mi id="S2.E2.m1.4.4.2.2.2.2.2.2.3" xref="S2.E2.m1.4.4.2.2.2.2.2.2.3.cmml">l</mi><mi id="S2.E2.m1.4.4.2.2.2.2.2.3" xref="S2.E2.m1.4.4.2.2.2.2.2.3.cmml">i</mi></msubsup><mo id="S2.E2.m1.4.4.2.2.2.2.5" xref="S2.E2.m1.4.4.2.2.2.3.cmml">,</mo><mi id="S2.E2.m1.1.1" xref="S2.E2.m1.1.1.cmml">N</mi><mo id="S2.E2.m1.4.4.2.2.2.2.6" stretchy="false" xref="S2.E2.m1.4.4.2.2.2.3.cmml">)</mo></mrow></mrow><mo id="S2.E2.m1.6.6.4.5" xref="S2.E2.m1.6.6.4.5.cmml">+</mo><mrow id="S2.E2.m1.6.6.4.4" xref="S2.E2.m1.6.6.4.4.cmml"><mi id="S2.E2.m1.6.6.4.4.4" xref="S2.E2.m1.6.6.4.4.4.cmml">f</mi><mo id="S2.E2.m1.6.6.4.4.3" xref="S2.E2.m1.6.6.4.4.3.cmml">⁢</mo><mrow id="S2.E2.m1.6.6.4.4.2.2" xref="S2.E2.m1.6.6.4.4.2.3.cmml"><mo id="S2.E2.m1.6.6.4.4.2.2.3" stretchy="false" xref="S2.E2.m1.6.6.4.4.2.3.cmml">(</mo><msubsup id="S2.E2.m1.5.5.3.3.1.1.1" xref="S2.E2.m1.5.5.3.3.1.1.1.cmml"><mi id="S2.E2.m1.5.5.3.3.1.1.1.2.2" xref="S2.E2.m1.5.5.3.3.1.1.1.2.2.cmml">E</mi><mi id="S2.E2.m1.5.5.3.3.1.1.1.2.3" xref="S2.E2.m1.5.5.3.3.1.1.1.2.3.cmml">l</mi><mi id="S2.E2.m1.5.5.3.3.1.1.1.3" xref="S2.E2.m1.5.5.3.3.1.1.1.3.cmml">i</mi></msubsup><mo id="S2.E2.m1.6.6.4.4.2.2.4" xref="S2.E2.m1.6.6.4.4.2.3.cmml">,</mo><msubsup id="S2.E2.m1.6.6.4.4.2.2.2" xref="S2.E2.m1.6.6.4.4.2.2.2.cmml"><mi id="S2.E2.m1.6.6.4.4.2.2.2.2.2" xref="S2.E2.m1.6.6.4.4.2.2.2.2.2.cmml">E</mi><mi id="S2.E2.m1.6.6.4.4.2.2.2.2.3" xref="S2.E2.m1.6.6.4.4.2.2.2.2.3.cmml">a</mi><mi id="S2.E2.m1.6.6.4.4.2.2.2.3" xref="S2.E2.m1.6.6.4.4.2.2.2.3.cmml">i</mi></msubsup><mo id="S2.E2.m1.6.6.4.4.2.2.5" xref="S2.E2.m1.6.6.4.4.2.3.cmml">,</mo><mi id="S2.E2.m1.2.2" xref="S2.E2.m1.2.2.cmml">N</mi><mo id="S2.E2.m1.6.6.4.4.2.2.6" stretchy="false" xref="S2.E2.m1.6.6.4.4.2.3.cmml">)</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E2.m1.6b"><apply id="S2.E2.m1.6.6.cmml" xref="S2.E2.m1.6.6"><eq id="S2.E2.m1.6.6.5.cmml" xref="S2.E2.m1.6.6.5"></eq><apply id="S2.E2.m1.6.6.6.cmml" xref="S2.E2.m1.6.6.6"><csymbol cd="ambiguous" id="S2.E2.m1.6.6.6.1.cmml" xref="S2.E2.m1.6.6.6">superscript</csymbol><apply id="S2.E2.m1.6.6.6.2.cmml" xref="S2.E2.m1.6.6.6"><csymbol cd="ambiguous" id="S2.E2.m1.6.6.6.2.1.cmml" xref="S2.E2.m1.6.6.6">subscript</csymbol><ci id="S2.E2.m1.6.6.6.2.2.cmml" xref="S2.E2.m1.6.6.6.2.2">ℒ</ci><apply id="S2.E2.m1.6.6.6.2.3.cmml" xref="S2.E2.m1.6.6.6.2.3"><times id="S2.E2.m1.6.6.6.2.3.1.cmml" xref="S2.E2.m1.6.6.6.2.3.1"></times><ci id="S2.E2.m1.6.6.6.2.3.2.cmml" xref="S2.E2.m1.6.6.6.2.3.2">𝐴</ci><ci id="S2.E2.m1.6.6.6.2.3.3.cmml" xref="S2.E2.m1.6.6.6.2.3.3">𝐿</ci></apply></apply><ci id="S2.E2.m1.6.6.6.3.cmml" xref="S2.E2.m1.6.6.6.3">𝑖</ci></apply><apply id="S2.E2.m1.6.6.4.cmml" xref="S2.E2.m1.6.6.4"><plus id="S2.E2.m1.6.6.4.5.cmml" xref="S2.E2.m1.6.6.4.5"></plus><apply id="S2.E2.m1.4.4.2.2.cmml" xref="S2.E2.m1.4.4.2.2"><times id="S2.E2.m1.4.4.2.2.3.cmml" xref="S2.E2.m1.4.4.2.2.3"></times><ci id="S2.E2.m1.4.4.2.2.4.cmml" xref="S2.E2.m1.4.4.2.2.4">𝑓</ci><vector id="S2.E2.m1.4.4.2.2.2.3.cmml" xref="S2.E2.m1.4.4.2.2.2.2"><apply id="S2.E2.m1.3.3.1.1.1.1.1.cmml" xref="S2.E2.m1.3.3.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E2.m1.3.3.1.1.1.1.1.1.cmml" xref="S2.E2.m1.3.3.1.1.1.1.1">superscript</csymbol><apply id="S2.E2.m1.3.3.1.1.1.1.1.2.cmml" xref="S2.E2.m1.3.3.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E2.m1.3.3.1.1.1.1.1.2.1.cmml" xref="S2.E2.m1.3.3.1.1.1.1.1">subscript</csymbol><ci id="S2.E2.m1.3.3.1.1.1.1.1.2.2.cmml" xref="S2.E2.m1.3.3.1.1.1.1.1.2.2">𝐸</ci><ci id="S2.E2.m1.3.3.1.1.1.1.1.2.3.cmml" xref="S2.E2.m1.3.3.1.1.1.1.1.2.3">𝑎</ci></apply><ci id="S2.E2.m1.3.3.1.1.1.1.1.3.cmml" xref="S2.E2.m1.3.3.1.1.1.1.1.3">𝑖</ci></apply><apply id="S2.E2.m1.4.4.2.2.2.2.2.cmml" xref="S2.E2.m1.4.4.2.2.2.2.2"><csymbol cd="ambiguous" id="S2.E2.m1.4.4.2.2.2.2.2.1.cmml" xref="S2.E2.m1.4.4.2.2.2.2.2">superscript</csymbol><apply id="S2.E2.m1.4.4.2.2.2.2.2.2.cmml" xref="S2.E2.m1.4.4.2.2.2.2.2"><csymbol cd="ambiguous" id="S2.E2.m1.4.4.2.2.2.2.2.2.1.cmml" xref="S2.E2.m1.4.4.2.2.2.2.2">subscript</csymbol><ci id="S2.E2.m1.4.4.2.2.2.2.2.2.2.cmml" xref="S2.E2.m1.4.4.2.2.2.2.2.2.2">𝐸</ci><ci id="S2.E2.m1.4.4.2.2.2.2.2.2.3.cmml" xref="S2.E2.m1.4.4.2.2.2.2.2.2.3">𝑙</ci></apply><ci id="S2.E2.m1.4.4.2.2.2.2.2.3.cmml" xref="S2.E2.m1.4.4.2.2.2.2.2.3">𝑖</ci></apply><ci id="S2.E2.m1.1.1.cmml" xref="S2.E2.m1.1.1">𝑁</ci></vector></apply><apply id="S2.E2.m1.6.6.4.4.cmml" xref="S2.E2.m1.6.6.4.4"><times id="S2.E2.m1.6.6.4.4.3.cmml" xref="S2.E2.m1.6.6.4.4.3"></times><ci id="S2.E2.m1.6.6.4.4.4.cmml" xref="S2.E2.m1.6.6.4.4.4">𝑓</ci><vector id="S2.E2.m1.6.6.4.4.2.3.cmml" xref="S2.E2.m1.6.6.4.4.2.2"><apply id="S2.E2.m1.5.5.3.3.1.1.1.cmml" xref="S2.E2.m1.5.5.3.3.1.1.1"><csymbol cd="ambiguous" id="S2.E2.m1.5.5.3.3.1.1.1.1.cmml" xref="S2.E2.m1.5.5.3.3.1.1.1">superscript</csymbol><apply id="S2.E2.m1.5.5.3.3.1.1.1.2.cmml" xref="S2.E2.m1.5.5.3.3.1.1.1"><csymbol cd="ambiguous" id="S2.E2.m1.5.5.3.3.1.1.1.2.1.cmml" xref="S2.E2.m1.5.5.3.3.1.1.1">subscript</csymbol><ci id="S2.E2.m1.5.5.3.3.1.1.1.2.2.cmml" xref="S2.E2.m1.5.5.3.3.1.1.1.2.2">𝐸</ci><ci id="S2.E2.m1.5.5.3.3.1.1.1.2.3.cmml" xref="S2.E2.m1.5.5.3.3.1.1.1.2.3">𝑙</ci></apply><ci id="S2.E2.m1.5.5.3.3.1.1.1.3.cmml" xref="S2.E2.m1.5.5.3.3.1.1.1.3">𝑖</ci></apply><apply id="S2.E2.m1.6.6.4.4.2.2.2.cmml" xref="S2.E2.m1.6.6.4.4.2.2.2"><csymbol cd="ambiguous" id="S2.E2.m1.6.6.4.4.2.2.2.1.cmml" xref="S2.E2.m1.6.6.4.4.2.2.2">superscript</csymbol><apply id="S2.E2.m1.6.6.4.4.2.2.2.2.cmml" xref="S2.E2.m1.6.6.4.4.2.2.2"><csymbol cd="ambiguous" id="S2.E2.m1.6.6.4.4.2.2.2.2.1.cmml" xref="S2.E2.m1.6.6.4.4.2.2.2">subscript</csymbol><ci id="S2.E2.m1.6.6.4.4.2.2.2.2.2.cmml" xref="S2.E2.m1.6.6.4.4.2.2.2.2.2">𝐸</ci><ci id="S2.E2.m1.6.6.4.4.2.2.2.2.3.cmml" xref="S2.E2.m1.6.6.4.4.2.2.2.2.3">𝑎</ci></apply><ci id="S2.E2.m1.6.6.4.4.2.2.2.3.cmml" xref="S2.E2.m1.6.6.4.4.2.2.2.3">𝑖</ci></apply><ci id="S2.E2.m1.2.2.cmml" xref="S2.E2.m1.2.2">𝑁</ci></vector></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E2.m1.6c">\mathcal{L}_{AL}^{i}=f(E_{a}^{i},E_{l}^{i},N)+f(E_{l}^{i},E_{a}^{i},N)</annotation><annotation encoding="application/x-llamapun" id="S2.E2.m1.6d">caligraphic_L start_POSTSUBSCRIPT italic_A italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_f ( italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_N ) + italic_f ( italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_N )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td> </tr></tbody> </table> <table class="ltx_equation ltx_eqn_table" id="S2.E3"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{AVS}^{i}=f(E_{a}^{i},E_{v}^{i},N_{S})+f(E_{v}^{i},E_{a}^{i},N_{S})" class="ltx_Math" display="block" id="S2.E3.m1.6"><semantics id="S2.E3.m1.6a"><mrow id="S2.E3.m1.6.6" xref="S2.E3.m1.6.6.cmml"><msubsup id="S2.E3.m1.6.6.8" xref="S2.E3.m1.6.6.8.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E3.m1.6.6.8.2.2" xref="S2.E3.m1.6.6.8.2.2.cmml">ℒ</mi><mrow id="S2.E3.m1.6.6.8.2.3" xref="S2.E3.m1.6.6.8.2.3.cmml"><mi id="S2.E3.m1.6.6.8.2.3.2" xref="S2.E3.m1.6.6.8.2.3.2.cmml">A</mi><mo id="S2.E3.m1.6.6.8.2.3.1" xref="S2.E3.m1.6.6.8.2.3.1.cmml">⁢</mo><mi id="S2.E3.m1.6.6.8.2.3.3" xref="S2.E3.m1.6.6.8.2.3.3.cmml">V</mi><mo id="S2.E3.m1.6.6.8.2.3.1a" xref="S2.E3.m1.6.6.8.2.3.1.cmml">⁢</mo><mi id="S2.E3.m1.6.6.8.2.3.4" xref="S2.E3.m1.6.6.8.2.3.4.cmml">S</mi></mrow><mi id="S2.E3.m1.6.6.8.3" xref="S2.E3.m1.6.6.8.3.cmml">i</mi></msubsup><mo id="S2.E3.m1.6.6.7" xref="S2.E3.m1.6.6.7.cmml">=</mo><mrow id="S2.E3.m1.6.6.6" xref="S2.E3.m1.6.6.6.cmml"><mrow id="S2.E3.m1.3.3.3.3" xref="S2.E3.m1.3.3.3.3.cmml"><mi id="S2.E3.m1.3.3.3.3.5" xref="S2.E3.m1.3.3.3.3.5.cmml">f</mi><mo id="S2.E3.m1.3.3.3.3.4" xref="S2.E3.m1.3.3.3.3.4.cmml">⁢</mo><mrow id="S2.E3.m1.3.3.3.3.3.3" xref="S2.E3.m1.3.3.3.3.3.4.cmml"><mo id="S2.E3.m1.3.3.3.3.3.3.4" stretchy="false" xref="S2.E3.m1.3.3.3.3.3.4.cmml">(</mo><msubsup id="S2.E3.m1.1.1.1.1.1.1.1" xref="S2.E3.m1.1.1.1.1.1.1.1.cmml"><mi id="S2.E3.m1.1.1.1.1.1.1.1.2.2" xref="S2.E3.m1.1.1.1.1.1.1.1.2.2.cmml">E</mi><mi id="S2.E3.m1.1.1.1.1.1.1.1.2.3" xref="S2.E3.m1.1.1.1.1.1.1.1.2.3.cmml">a</mi><mi id="S2.E3.m1.1.1.1.1.1.1.1.3" xref="S2.E3.m1.1.1.1.1.1.1.1.3.cmml">i</mi></msubsup><mo id="S2.E3.m1.3.3.3.3.3.3.5" xref="S2.E3.m1.3.3.3.3.3.4.cmml">,</mo><msubsup id="S2.E3.m1.2.2.2.2.2.2.2" xref="S2.E3.m1.2.2.2.2.2.2.2.cmml"><mi id="S2.E3.m1.2.2.2.2.2.2.2.2.2" xref="S2.E3.m1.2.2.2.2.2.2.2.2.2.cmml">E</mi><mi id="S2.E3.m1.2.2.2.2.2.2.2.2.3" xref="S2.E3.m1.2.2.2.2.2.2.2.2.3.cmml">v</mi><mi id="S2.E3.m1.2.2.2.2.2.2.2.3" xref="S2.E3.m1.2.2.2.2.2.2.2.3.cmml">i</mi></msubsup><mo id="S2.E3.m1.3.3.3.3.3.3.6" xref="S2.E3.m1.3.3.3.3.3.4.cmml">,</mo><msub id="S2.E3.m1.3.3.3.3.3.3.3" xref="S2.E3.m1.3.3.3.3.3.3.3.cmml"><mi id="S2.E3.m1.3.3.3.3.3.3.3.2" xref="S2.E3.m1.3.3.3.3.3.3.3.2.cmml">N</mi><mi id="S2.E3.m1.3.3.3.3.3.3.3.3" xref="S2.E3.m1.3.3.3.3.3.3.3.3.cmml">S</mi></msub><mo id="S2.E3.m1.3.3.3.3.3.3.7" stretchy="false" xref="S2.E3.m1.3.3.3.3.3.4.cmml">)</mo></mrow></mrow><mo id="S2.E3.m1.6.6.6.7" xref="S2.E3.m1.6.6.6.7.cmml">+</mo><mrow id="S2.E3.m1.6.6.6.6" xref="S2.E3.m1.6.6.6.6.cmml"><mi id="S2.E3.m1.6.6.6.6.5" xref="S2.E3.m1.6.6.6.6.5.cmml">f</mi><mo id="S2.E3.m1.6.6.6.6.4" xref="S2.E3.m1.6.6.6.6.4.cmml">⁢</mo><mrow id="S2.E3.m1.6.6.6.6.3.3" xref="S2.E3.m1.6.6.6.6.3.4.cmml"><mo id="S2.E3.m1.6.6.6.6.3.3.4" stretchy="false" xref="S2.E3.m1.6.6.6.6.3.4.cmml">(</mo><msubsup id="S2.E3.m1.4.4.4.4.1.1.1" xref="S2.E3.m1.4.4.4.4.1.1.1.cmml"><mi id="S2.E3.m1.4.4.4.4.1.1.1.2.2" xref="S2.E3.m1.4.4.4.4.1.1.1.2.2.cmml">E</mi><mi id="S2.E3.m1.4.4.4.4.1.1.1.2.3" xref="S2.E3.m1.4.4.4.4.1.1.1.2.3.cmml">v</mi><mi id="S2.E3.m1.4.4.4.4.1.1.1.3" xref="S2.E3.m1.4.4.4.4.1.1.1.3.cmml">i</mi></msubsup><mo id="S2.E3.m1.6.6.6.6.3.3.5" xref="S2.E3.m1.6.6.6.6.3.4.cmml">,</mo><msubsup id="S2.E3.m1.5.5.5.5.2.2.2" xref="S2.E3.m1.5.5.5.5.2.2.2.cmml"><mi id="S2.E3.m1.5.5.5.5.2.2.2.2.2" xref="S2.E3.m1.5.5.5.5.2.2.2.2.2.cmml">E</mi><mi id="S2.E3.m1.5.5.5.5.2.2.2.2.3" xref="S2.E3.m1.5.5.5.5.2.2.2.2.3.cmml">a</mi><mi id="S2.E3.m1.5.5.5.5.2.2.2.3" xref="S2.E3.m1.5.5.5.5.2.2.2.3.cmml">i</mi></msubsup><mo id="S2.E3.m1.6.6.6.6.3.3.6" xref="S2.E3.m1.6.6.6.6.3.4.cmml">,</mo><msub id="S2.E3.m1.6.6.6.6.3.3.3" xref="S2.E3.m1.6.6.6.6.3.3.3.cmml"><mi id="S2.E3.m1.6.6.6.6.3.3.3.2" xref="S2.E3.m1.6.6.6.6.3.3.3.2.cmml">N</mi><mi id="S2.E3.m1.6.6.6.6.3.3.3.3" xref="S2.E3.m1.6.6.6.6.3.3.3.3.cmml">S</mi></msub><mo id="S2.E3.m1.6.6.6.6.3.3.7" stretchy="false" xref="S2.E3.m1.6.6.6.6.3.4.cmml">)</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E3.m1.6b"><apply id="S2.E3.m1.6.6.cmml" xref="S2.E3.m1.6.6"><eq id="S2.E3.m1.6.6.7.cmml" xref="S2.E3.m1.6.6.7"></eq><apply id="S2.E3.m1.6.6.8.cmml" xref="S2.E3.m1.6.6.8"><csymbol cd="ambiguous" id="S2.E3.m1.6.6.8.1.cmml" xref="S2.E3.m1.6.6.8">superscript</csymbol><apply id="S2.E3.m1.6.6.8.2.cmml" xref="S2.E3.m1.6.6.8"><csymbol cd="ambiguous" id="S2.E3.m1.6.6.8.2.1.cmml" xref="S2.E3.m1.6.6.8">subscript</csymbol><ci id="S2.E3.m1.6.6.8.2.2.cmml" xref="S2.E3.m1.6.6.8.2.2">ℒ</ci><apply id="S2.E3.m1.6.6.8.2.3.cmml" xref="S2.E3.m1.6.6.8.2.3"><times id="S2.E3.m1.6.6.8.2.3.1.cmml" xref="S2.E3.m1.6.6.8.2.3.1"></times><ci id="S2.E3.m1.6.6.8.2.3.2.cmml" xref="S2.E3.m1.6.6.8.2.3.2">𝐴</ci><ci id="S2.E3.m1.6.6.8.2.3.3.cmml" xref="S2.E3.m1.6.6.8.2.3.3">𝑉</ci><ci id="S2.E3.m1.6.6.8.2.3.4.cmml" xref="S2.E3.m1.6.6.8.2.3.4">𝑆</ci></apply></apply><ci id="S2.E3.m1.6.6.8.3.cmml" xref="S2.E3.m1.6.6.8.3">𝑖</ci></apply><apply id="S2.E3.m1.6.6.6.cmml" xref="S2.E3.m1.6.6.6"><plus id="S2.E3.m1.6.6.6.7.cmml" xref="S2.E3.m1.6.6.6.7"></plus><apply id="S2.E3.m1.3.3.3.3.cmml" xref="S2.E3.m1.3.3.3.3"><times id="S2.E3.m1.3.3.3.3.4.cmml" xref="S2.E3.m1.3.3.3.3.4"></times><ci id="S2.E3.m1.3.3.3.3.5.cmml" xref="S2.E3.m1.3.3.3.3.5">𝑓</ci><vector id="S2.E3.m1.3.3.3.3.3.4.cmml" xref="S2.E3.m1.3.3.3.3.3.3"><apply id="S2.E3.m1.1.1.1.1.1.1.1.cmml" xref="S2.E3.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E3.m1.1.1.1.1.1.1.1.1.cmml" xref="S2.E3.m1.1.1.1.1.1.1.1">superscript</csymbol><apply id="S2.E3.m1.1.1.1.1.1.1.1.2.cmml" xref="S2.E3.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E3.m1.1.1.1.1.1.1.1.2.1.cmml" xref="S2.E3.m1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S2.E3.m1.1.1.1.1.1.1.1.2.2.cmml" xref="S2.E3.m1.1.1.1.1.1.1.1.2.2">𝐸</ci><ci id="S2.E3.m1.1.1.1.1.1.1.1.2.3.cmml" xref="S2.E3.m1.1.1.1.1.1.1.1.2.3">𝑎</ci></apply><ci id="S2.E3.m1.1.1.1.1.1.1.1.3.cmml" xref="S2.E3.m1.1.1.1.1.1.1.1.3">𝑖</ci></apply><apply id="S2.E3.m1.2.2.2.2.2.2.2.cmml" xref="S2.E3.m1.2.2.2.2.2.2.2"><csymbol cd="ambiguous" id="S2.E3.m1.2.2.2.2.2.2.2.1.cmml" xref="S2.E3.m1.2.2.2.2.2.2.2">superscript</csymbol><apply id="S2.E3.m1.2.2.2.2.2.2.2.2.cmml" xref="S2.E3.m1.2.2.2.2.2.2.2"><csymbol cd="ambiguous" id="S2.E3.m1.2.2.2.2.2.2.2.2.1.cmml" xref="S2.E3.m1.2.2.2.2.2.2.2">subscript</csymbol><ci id="S2.E3.m1.2.2.2.2.2.2.2.2.2.cmml" xref="S2.E3.m1.2.2.2.2.2.2.2.2.2">𝐸</ci><ci id="S2.E3.m1.2.2.2.2.2.2.2.2.3.cmml" xref="S2.E3.m1.2.2.2.2.2.2.2.2.3">𝑣</ci></apply><ci id="S2.E3.m1.2.2.2.2.2.2.2.3.cmml" xref="S2.E3.m1.2.2.2.2.2.2.2.3">𝑖</ci></apply><apply id="S2.E3.m1.3.3.3.3.3.3.3.cmml" xref="S2.E3.m1.3.3.3.3.3.3.3"><csymbol cd="ambiguous" id="S2.E3.m1.3.3.3.3.3.3.3.1.cmml" xref="S2.E3.m1.3.3.3.3.3.3.3">subscript</csymbol><ci id="S2.E3.m1.3.3.3.3.3.3.3.2.cmml" xref="S2.E3.m1.3.3.3.3.3.3.3.2">𝑁</ci><ci id="S2.E3.m1.3.3.3.3.3.3.3.3.cmml" xref="S2.E3.m1.3.3.3.3.3.3.3.3">𝑆</ci></apply></vector></apply><apply id="S2.E3.m1.6.6.6.6.cmml" xref="S2.E3.m1.6.6.6.6"><times id="S2.E3.m1.6.6.6.6.4.cmml" xref="S2.E3.m1.6.6.6.6.4"></times><ci id="S2.E3.m1.6.6.6.6.5.cmml" xref="S2.E3.m1.6.6.6.6.5">𝑓</ci><vector id="S2.E3.m1.6.6.6.6.3.4.cmml" xref="S2.E3.m1.6.6.6.6.3.3"><apply id="S2.E3.m1.4.4.4.4.1.1.1.cmml" xref="S2.E3.m1.4.4.4.4.1.1.1"><csymbol cd="ambiguous" id="S2.E3.m1.4.4.4.4.1.1.1.1.cmml" xref="S2.E3.m1.4.4.4.4.1.1.1">superscript</csymbol><apply id="S2.E3.m1.4.4.4.4.1.1.1.2.cmml" xref="S2.E3.m1.4.4.4.4.1.1.1"><csymbol cd="ambiguous" id="S2.E3.m1.4.4.4.4.1.1.1.2.1.cmml" xref="S2.E3.m1.4.4.4.4.1.1.1">subscript</csymbol><ci id="S2.E3.m1.4.4.4.4.1.1.1.2.2.cmml" xref="S2.E3.m1.4.4.4.4.1.1.1.2.2">𝐸</ci><ci id="S2.E3.m1.4.4.4.4.1.1.1.2.3.cmml" xref="S2.E3.m1.4.4.4.4.1.1.1.2.3">𝑣</ci></apply><ci id="S2.E3.m1.4.4.4.4.1.1.1.3.cmml" xref="S2.E3.m1.4.4.4.4.1.1.1.3">𝑖</ci></apply><apply id="S2.E3.m1.5.5.5.5.2.2.2.cmml" xref="S2.E3.m1.5.5.5.5.2.2.2"><csymbol cd="ambiguous" id="S2.E3.m1.5.5.5.5.2.2.2.1.cmml" xref="S2.E3.m1.5.5.5.5.2.2.2">superscript</csymbol><apply id="S2.E3.m1.5.5.5.5.2.2.2.2.cmml" xref="S2.E3.m1.5.5.5.5.2.2.2"><csymbol cd="ambiguous" id="S2.E3.m1.5.5.5.5.2.2.2.2.1.cmml" xref="S2.E3.m1.5.5.5.5.2.2.2">subscript</csymbol><ci id="S2.E3.m1.5.5.5.5.2.2.2.2.2.cmml" xref="S2.E3.m1.5.5.5.5.2.2.2.2.2">𝐸</ci><ci id="S2.E3.m1.5.5.5.5.2.2.2.2.3.cmml" xref="S2.E3.m1.5.5.5.5.2.2.2.2.3">𝑎</ci></apply><ci id="S2.E3.m1.5.5.5.5.2.2.2.3.cmml" xref="S2.E3.m1.5.5.5.5.2.2.2.3">𝑖</ci></apply><apply id="S2.E3.m1.6.6.6.6.3.3.3.cmml" xref="S2.E3.m1.6.6.6.6.3.3.3"><csymbol cd="ambiguous" id="S2.E3.m1.6.6.6.6.3.3.3.1.cmml" xref="S2.E3.m1.6.6.6.6.3.3.3">subscript</csymbol><ci id="S2.E3.m1.6.6.6.6.3.3.3.2.cmml" xref="S2.E3.m1.6.6.6.6.3.3.3.2">𝑁</ci><ci id="S2.E3.m1.6.6.6.6.3.3.3.3.cmml" xref="S2.E3.m1.6.6.6.6.3.3.3.3">𝑆</ci></apply></vector></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E3.m1.6c">\mathcal{L}_{AVS}^{i}=f(E_{a}^{i},E_{v}^{i},N_{S})+f(E_{v}^{i},E_{a}^{i},N_{S})</annotation><annotation encoding="application/x-llamapun" id="S2.E3.m1.6d">caligraphic_L start_POSTSUBSCRIPT italic_A italic_V italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_f ( italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) + italic_f ( italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(3)</span></td> </tr></tbody> </table> <table class="ltx_equation ltx_eqn_table" id="S2.E4"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{AVT}^{i}=f(E_{a}^{i},E_{v}^{i},N_{T})+f(E_{v}^{i},E_{a}^{i},N_{T})" class="ltx_Math" display="block" id="S2.E4.m1.6"><semantics id="S2.E4.m1.6a"><mrow id="S2.E4.m1.6.6" xref="S2.E4.m1.6.6.cmml"><msubsup id="S2.E4.m1.6.6.8" xref="S2.E4.m1.6.6.8.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E4.m1.6.6.8.2.2" xref="S2.E4.m1.6.6.8.2.2.cmml">ℒ</mi><mrow id="S2.E4.m1.6.6.8.2.3" xref="S2.E4.m1.6.6.8.2.3.cmml"><mi id="S2.E4.m1.6.6.8.2.3.2" xref="S2.E4.m1.6.6.8.2.3.2.cmml">A</mi><mo id="S2.E4.m1.6.6.8.2.3.1" xref="S2.E4.m1.6.6.8.2.3.1.cmml">⁢</mo><mi id="S2.E4.m1.6.6.8.2.3.3" xref="S2.E4.m1.6.6.8.2.3.3.cmml">V</mi><mo id="S2.E4.m1.6.6.8.2.3.1a" xref="S2.E4.m1.6.6.8.2.3.1.cmml">⁢</mo><mi id="S2.E4.m1.6.6.8.2.3.4" xref="S2.E4.m1.6.6.8.2.3.4.cmml">T</mi></mrow><mi id="S2.E4.m1.6.6.8.3" xref="S2.E4.m1.6.6.8.3.cmml">i</mi></msubsup><mo id="S2.E4.m1.6.6.7" xref="S2.E4.m1.6.6.7.cmml">=</mo><mrow id="S2.E4.m1.6.6.6" xref="S2.E4.m1.6.6.6.cmml"><mrow id="S2.E4.m1.3.3.3.3" xref="S2.E4.m1.3.3.3.3.cmml"><mi id="S2.E4.m1.3.3.3.3.5" xref="S2.E4.m1.3.3.3.3.5.cmml">f</mi><mo id="S2.E4.m1.3.3.3.3.4" xref="S2.E4.m1.3.3.3.3.4.cmml">⁢</mo><mrow id="S2.E4.m1.3.3.3.3.3.3" xref="S2.E4.m1.3.3.3.3.3.4.cmml"><mo id="S2.E4.m1.3.3.3.3.3.3.4" stretchy="false" xref="S2.E4.m1.3.3.3.3.3.4.cmml">(</mo><msubsup id="S2.E4.m1.1.1.1.1.1.1.1" xref="S2.E4.m1.1.1.1.1.1.1.1.cmml"><mi id="S2.E4.m1.1.1.1.1.1.1.1.2.2" xref="S2.E4.m1.1.1.1.1.1.1.1.2.2.cmml">E</mi><mi id="S2.E4.m1.1.1.1.1.1.1.1.2.3" xref="S2.E4.m1.1.1.1.1.1.1.1.2.3.cmml">a</mi><mi id="S2.E4.m1.1.1.1.1.1.1.1.3" xref="S2.E4.m1.1.1.1.1.1.1.1.3.cmml">i</mi></msubsup><mo id="S2.E4.m1.3.3.3.3.3.3.5" xref="S2.E4.m1.3.3.3.3.3.4.cmml">,</mo><msubsup id="S2.E4.m1.2.2.2.2.2.2.2" xref="S2.E4.m1.2.2.2.2.2.2.2.cmml"><mi id="S2.E4.m1.2.2.2.2.2.2.2.2.2" xref="S2.E4.m1.2.2.2.2.2.2.2.2.2.cmml">E</mi><mi id="S2.E4.m1.2.2.2.2.2.2.2.2.3" xref="S2.E4.m1.2.2.2.2.2.2.2.2.3.cmml">v</mi><mi id="S2.E4.m1.2.2.2.2.2.2.2.3" xref="S2.E4.m1.2.2.2.2.2.2.2.3.cmml">i</mi></msubsup><mo id="S2.E4.m1.3.3.3.3.3.3.6" xref="S2.E4.m1.3.3.3.3.3.4.cmml">,</mo><msub id="S2.E4.m1.3.3.3.3.3.3.3" xref="S2.E4.m1.3.3.3.3.3.3.3.cmml"><mi id="S2.E4.m1.3.3.3.3.3.3.3.2" xref="S2.E4.m1.3.3.3.3.3.3.3.2.cmml">N</mi><mi id="S2.E4.m1.3.3.3.3.3.3.3.3" xref="S2.E4.m1.3.3.3.3.3.3.3.3.cmml">T</mi></msub><mo id="S2.E4.m1.3.3.3.3.3.3.7" stretchy="false" xref="S2.E4.m1.3.3.3.3.3.4.cmml">)</mo></mrow></mrow><mo id="S2.E4.m1.6.6.6.7" xref="S2.E4.m1.6.6.6.7.cmml">+</mo><mrow id="S2.E4.m1.6.6.6.6" xref="S2.E4.m1.6.6.6.6.cmml"><mi id="S2.E4.m1.6.6.6.6.5" xref="S2.E4.m1.6.6.6.6.5.cmml">f</mi><mo id="S2.E4.m1.6.6.6.6.4" xref="S2.E4.m1.6.6.6.6.4.cmml">⁢</mo><mrow id="S2.E4.m1.6.6.6.6.3.3" xref="S2.E4.m1.6.6.6.6.3.4.cmml"><mo id="S2.E4.m1.6.6.6.6.3.3.4" stretchy="false" xref="S2.E4.m1.6.6.6.6.3.4.cmml">(</mo><msubsup id="S2.E4.m1.4.4.4.4.1.1.1" xref="S2.E4.m1.4.4.4.4.1.1.1.cmml"><mi id="S2.E4.m1.4.4.4.4.1.1.1.2.2" xref="S2.E4.m1.4.4.4.4.1.1.1.2.2.cmml">E</mi><mi id="S2.E4.m1.4.4.4.4.1.1.1.2.3" xref="S2.E4.m1.4.4.4.4.1.1.1.2.3.cmml">v</mi><mi id="S2.E4.m1.4.4.4.4.1.1.1.3" xref="S2.E4.m1.4.4.4.4.1.1.1.3.cmml">i</mi></msubsup><mo id="S2.E4.m1.6.6.6.6.3.3.5" xref="S2.E4.m1.6.6.6.6.3.4.cmml">,</mo><msubsup id="S2.E4.m1.5.5.5.5.2.2.2" xref="S2.E4.m1.5.5.5.5.2.2.2.cmml"><mi id="S2.E4.m1.5.5.5.5.2.2.2.2.2" xref="S2.E4.m1.5.5.5.5.2.2.2.2.2.cmml">E</mi><mi id="S2.E4.m1.5.5.5.5.2.2.2.2.3" xref="S2.E4.m1.5.5.5.5.2.2.2.2.3.cmml">a</mi><mi id="S2.E4.m1.5.5.5.5.2.2.2.3" xref="S2.E4.m1.5.5.5.5.2.2.2.3.cmml">i</mi></msubsup><mo id="S2.E4.m1.6.6.6.6.3.3.6" xref="S2.E4.m1.6.6.6.6.3.4.cmml">,</mo><msub id="S2.E4.m1.6.6.6.6.3.3.3" xref="S2.E4.m1.6.6.6.6.3.3.3.cmml"><mi id="S2.E4.m1.6.6.6.6.3.3.3.2" xref="S2.E4.m1.6.6.6.6.3.3.3.2.cmml">N</mi><mi id="S2.E4.m1.6.6.6.6.3.3.3.3" xref="S2.E4.m1.6.6.6.6.3.3.3.3.cmml">T</mi></msub><mo id="S2.E4.m1.6.6.6.6.3.3.7" stretchy="false" xref="S2.E4.m1.6.6.6.6.3.4.cmml">)</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E4.m1.6b"><apply id="S2.E4.m1.6.6.cmml" xref="S2.E4.m1.6.6"><eq id="S2.E4.m1.6.6.7.cmml" xref="S2.E4.m1.6.6.7"></eq><apply id="S2.E4.m1.6.6.8.cmml" xref="S2.E4.m1.6.6.8"><csymbol cd="ambiguous" id="S2.E4.m1.6.6.8.1.cmml" xref="S2.E4.m1.6.6.8">superscript</csymbol><apply id="S2.E4.m1.6.6.8.2.cmml" xref="S2.E4.m1.6.6.8"><csymbol cd="ambiguous" id="S2.E4.m1.6.6.8.2.1.cmml" xref="S2.E4.m1.6.6.8">subscript</csymbol><ci id="S2.E4.m1.6.6.8.2.2.cmml" xref="S2.E4.m1.6.6.8.2.2">ℒ</ci><apply id="S2.E4.m1.6.6.8.2.3.cmml" xref="S2.E4.m1.6.6.8.2.3"><times id="S2.E4.m1.6.6.8.2.3.1.cmml" xref="S2.E4.m1.6.6.8.2.3.1"></times><ci id="S2.E4.m1.6.6.8.2.3.2.cmml" xref="S2.E4.m1.6.6.8.2.3.2">𝐴</ci><ci id="S2.E4.m1.6.6.8.2.3.3.cmml" xref="S2.E4.m1.6.6.8.2.3.3">𝑉</ci><ci id="S2.E4.m1.6.6.8.2.3.4.cmml" xref="S2.E4.m1.6.6.8.2.3.4">𝑇</ci></apply></apply><ci id="S2.E4.m1.6.6.8.3.cmml" xref="S2.E4.m1.6.6.8.3">𝑖</ci></apply><apply id="S2.E4.m1.6.6.6.cmml" xref="S2.E4.m1.6.6.6"><plus id="S2.E4.m1.6.6.6.7.cmml" xref="S2.E4.m1.6.6.6.7"></plus><apply id="S2.E4.m1.3.3.3.3.cmml" xref="S2.E4.m1.3.3.3.3"><times id="S2.E4.m1.3.3.3.3.4.cmml" xref="S2.E4.m1.3.3.3.3.4"></times><ci id="S2.E4.m1.3.3.3.3.5.cmml" xref="S2.E4.m1.3.3.3.3.5">𝑓</ci><vector id="S2.E4.m1.3.3.3.3.3.4.cmml" xref="S2.E4.m1.3.3.3.3.3.3"><apply id="S2.E4.m1.1.1.1.1.1.1.1.cmml" xref="S2.E4.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E4.m1.1.1.1.1.1.1.1.1.cmml" xref="S2.E4.m1.1.1.1.1.1.1.1">superscript</csymbol><apply id="S2.E4.m1.1.1.1.1.1.1.1.2.cmml" xref="S2.E4.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E4.m1.1.1.1.1.1.1.1.2.1.cmml" xref="S2.E4.m1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S2.E4.m1.1.1.1.1.1.1.1.2.2.cmml" xref="S2.E4.m1.1.1.1.1.1.1.1.2.2">𝐸</ci><ci id="S2.E4.m1.1.1.1.1.1.1.1.2.3.cmml" xref="S2.E4.m1.1.1.1.1.1.1.1.2.3">𝑎</ci></apply><ci id="S2.E4.m1.1.1.1.1.1.1.1.3.cmml" xref="S2.E4.m1.1.1.1.1.1.1.1.3">𝑖</ci></apply><apply id="S2.E4.m1.2.2.2.2.2.2.2.cmml" xref="S2.E4.m1.2.2.2.2.2.2.2"><csymbol cd="ambiguous" id="S2.E4.m1.2.2.2.2.2.2.2.1.cmml" xref="S2.E4.m1.2.2.2.2.2.2.2">superscript</csymbol><apply id="S2.E4.m1.2.2.2.2.2.2.2.2.cmml" xref="S2.E4.m1.2.2.2.2.2.2.2"><csymbol cd="ambiguous" id="S2.E4.m1.2.2.2.2.2.2.2.2.1.cmml" xref="S2.E4.m1.2.2.2.2.2.2.2">subscript</csymbol><ci id="S2.E4.m1.2.2.2.2.2.2.2.2.2.cmml" xref="S2.E4.m1.2.2.2.2.2.2.2.2.2">𝐸</ci><ci id="S2.E4.m1.2.2.2.2.2.2.2.2.3.cmml" xref="S2.E4.m1.2.2.2.2.2.2.2.2.3">𝑣</ci></apply><ci id="S2.E4.m1.2.2.2.2.2.2.2.3.cmml" xref="S2.E4.m1.2.2.2.2.2.2.2.3">𝑖</ci></apply><apply id="S2.E4.m1.3.3.3.3.3.3.3.cmml" xref="S2.E4.m1.3.3.3.3.3.3.3"><csymbol cd="ambiguous" id="S2.E4.m1.3.3.3.3.3.3.3.1.cmml" xref="S2.E4.m1.3.3.3.3.3.3.3">subscript</csymbol><ci id="S2.E4.m1.3.3.3.3.3.3.3.2.cmml" xref="S2.E4.m1.3.3.3.3.3.3.3.2">𝑁</ci><ci id="S2.E4.m1.3.3.3.3.3.3.3.3.cmml" xref="S2.E4.m1.3.3.3.3.3.3.3.3">𝑇</ci></apply></vector></apply><apply id="S2.E4.m1.6.6.6.6.cmml" xref="S2.E4.m1.6.6.6.6"><times id="S2.E4.m1.6.6.6.6.4.cmml" xref="S2.E4.m1.6.6.6.6.4"></times><ci id="S2.E4.m1.6.6.6.6.5.cmml" xref="S2.E4.m1.6.6.6.6.5">𝑓</ci><vector id="S2.E4.m1.6.6.6.6.3.4.cmml" xref="S2.E4.m1.6.6.6.6.3.3"><apply id="S2.E4.m1.4.4.4.4.1.1.1.cmml" xref="S2.E4.m1.4.4.4.4.1.1.1"><csymbol cd="ambiguous" id="S2.E4.m1.4.4.4.4.1.1.1.1.cmml" xref="S2.E4.m1.4.4.4.4.1.1.1">superscript</csymbol><apply id="S2.E4.m1.4.4.4.4.1.1.1.2.cmml" xref="S2.E4.m1.4.4.4.4.1.1.1"><csymbol cd="ambiguous" id="S2.E4.m1.4.4.4.4.1.1.1.2.1.cmml" xref="S2.E4.m1.4.4.4.4.1.1.1">subscript</csymbol><ci id="S2.E4.m1.4.4.4.4.1.1.1.2.2.cmml" xref="S2.E4.m1.4.4.4.4.1.1.1.2.2">𝐸</ci><ci id="S2.E4.m1.4.4.4.4.1.1.1.2.3.cmml" xref="S2.E4.m1.4.4.4.4.1.1.1.2.3">𝑣</ci></apply><ci id="S2.E4.m1.4.4.4.4.1.1.1.3.cmml" xref="S2.E4.m1.4.4.4.4.1.1.1.3">𝑖</ci></apply><apply id="S2.E4.m1.5.5.5.5.2.2.2.cmml" xref="S2.E4.m1.5.5.5.5.2.2.2"><csymbol cd="ambiguous" id="S2.E4.m1.5.5.5.5.2.2.2.1.cmml" xref="S2.E4.m1.5.5.5.5.2.2.2">superscript</csymbol><apply id="S2.E4.m1.5.5.5.5.2.2.2.2.cmml" xref="S2.E4.m1.5.5.5.5.2.2.2"><csymbol cd="ambiguous" id="S2.E4.m1.5.5.5.5.2.2.2.2.1.cmml" xref="S2.E4.m1.5.5.5.5.2.2.2">subscript</csymbol><ci id="S2.E4.m1.5.5.5.5.2.2.2.2.2.cmml" xref="S2.E4.m1.5.5.5.5.2.2.2.2.2">𝐸</ci><ci id="S2.E4.m1.5.5.5.5.2.2.2.2.3.cmml" xref="S2.E4.m1.5.5.5.5.2.2.2.2.3">𝑎</ci></apply><ci id="S2.E4.m1.5.5.5.5.2.2.2.3.cmml" xref="S2.E4.m1.5.5.5.5.2.2.2.3">𝑖</ci></apply><apply id="S2.E4.m1.6.6.6.6.3.3.3.cmml" xref="S2.E4.m1.6.6.6.6.3.3.3"><csymbol cd="ambiguous" id="S2.E4.m1.6.6.6.6.3.3.3.1.cmml" xref="S2.E4.m1.6.6.6.6.3.3.3">subscript</csymbol><ci id="S2.E4.m1.6.6.6.6.3.3.3.2.cmml" xref="S2.E4.m1.6.6.6.6.3.3.3.2">𝑁</ci><ci id="S2.E4.m1.6.6.6.6.3.3.3.3.cmml" xref="S2.E4.m1.6.6.6.6.3.3.3.3">𝑇</ci></apply></vector></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E4.m1.6c">\mathcal{L}_{AVT}^{i}=f(E_{a}^{i},E_{v}^{i},N_{T})+f(E_{v}^{i},E_{a}^{i},N_{T})</annotation><annotation encoding="application/x-llamapun" id="S2.E4.m1.6d">caligraphic_L start_POSTSUBSCRIPT italic_A italic_V italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_f ( italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) + italic_f ( italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(4)</span></td> </tr></tbody> </table> <table class="ltx_equation ltx_eqn_table" id="S2.E5"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}=\frac{1}{2N}\sum_{i=1}^{N}\mathcal{L}_{AL}^{i}+\frac{\lambda}{2N_{% S}}\sum_{i=1}^{N_{S}}\mathcal{L}_{AVS}^{i}+\frac{\mu}{2N_{T}}\sum_{i=1}^{N_{T}% }\mathcal{L}_{AVT}^{i}" class="ltx_Math" display="block" id="S2.E5.m1.1"><semantics id="S2.E5.m1.1a"><mrow id="S2.E5.m1.1.1" xref="S2.E5.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E5.m1.1.1.2" xref="S2.E5.m1.1.1.2.cmml">ℒ</mi><mo id="S2.E5.m1.1.1.1" xref="S2.E5.m1.1.1.1.cmml">=</mo><mrow id="S2.E5.m1.1.1.3" xref="S2.E5.m1.1.1.3.cmml"><mrow id="S2.E5.m1.1.1.3.2" xref="S2.E5.m1.1.1.3.2.cmml"><mfrac id="S2.E5.m1.1.1.3.2.2" xref="S2.E5.m1.1.1.3.2.2.cmml"><mn id="S2.E5.m1.1.1.3.2.2.2" xref="S2.E5.m1.1.1.3.2.2.2.cmml">1</mn><mrow id="S2.E5.m1.1.1.3.2.2.3" xref="S2.E5.m1.1.1.3.2.2.3.cmml"><mn id="S2.E5.m1.1.1.3.2.2.3.2" xref="S2.E5.m1.1.1.3.2.2.3.2.cmml">2</mn><mo id="S2.E5.m1.1.1.3.2.2.3.1" xref="S2.E5.m1.1.1.3.2.2.3.1.cmml">⁢</mo><mi id="S2.E5.m1.1.1.3.2.2.3.3" xref="S2.E5.m1.1.1.3.2.2.3.3.cmml">N</mi></mrow></mfrac><mo id="S2.E5.m1.1.1.3.2.1" xref="S2.E5.m1.1.1.3.2.1.cmml">⁢</mo><mrow id="S2.E5.m1.1.1.3.2.3" xref="S2.E5.m1.1.1.3.2.3.cmml"><munderover id="S2.E5.m1.1.1.3.2.3.1" xref="S2.E5.m1.1.1.3.2.3.1.cmml"><mo id="S2.E5.m1.1.1.3.2.3.1.2.2" movablelimits="false" xref="S2.E5.m1.1.1.3.2.3.1.2.2.cmml">∑</mo><mrow id="S2.E5.m1.1.1.3.2.3.1.2.3" xref="S2.E5.m1.1.1.3.2.3.1.2.3.cmml"><mi id="S2.E5.m1.1.1.3.2.3.1.2.3.2" xref="S2.E5.m1.1.1.3.2.3.1.2.3.2.cmml">i</mi><mo id="S2.E5.m1.1.1.3.2.3.1.2.3.1" xref="S2.E5.m1.1.1.3.2.3.1.2.3.1.cmml">=</mo><mn id="S2.E5.m1.1.1.3.2.3.1.2.3.3" xref="S2.E5.m1.1.1.3.2.3.1.2.3.3.cmml">1</mn></mrow><mi id="S2.E5.m1.1.1.3.2.3.1.3" xref="S2.E5.m1.1.1.3.2.3.1.3.cmml">N</mi></munderover><msubsup id="S2.E5.m1.1.1.3.2.3.2" xref="S2.E5.m1.1.1.3.2.3.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E5.m1.1.1.3.2.3.2.2.2" xref="S2.E5.m1.1.1.3.2.3.2.2.2.cmml">ℒ</mi><mrow id="S2.E5.m1.1.1.3.2.3.2.2.3" xref="S2.E5.m1.1.1.3.2.3.2.2.3.cmml"><mi id="S2.E5.m1.1.1.3.2.3.2.2.3.2" xref="S2.E5.m1.1.1.3.2.3.2.2.3.2.cmml">A</mi><mo id="S2.E5.m1.1.1.3.2.3.2.2.3.1" xref="S2.E5.m1.1.1.3.2.3.2.2.3.1.cmml">⁢</mo><mi id="S2.E5.m1.1.1.3.2.3.2.2.3.3" xref="S2.E5.m1.1.1.3.2.3.2.2.3.3.cmml">L</mi></mrow><mi id="S2.E5.m1.1.1.3.2.3.2.3" xref="S2.E5.m1.1.1.3.2.3.2.3.cmml">i</mi></msubsup></mrow></mrow><mo id="S2.E5.m1.1.1.3.1" xref="S2.E5.m1.1.1.3.1.cmml">+</mo><mrow id="S2.E5.m1.1.1.3.3" xref="S2.E5.m1.1.1.3.3.cmml"><mfrac id="S2.E5.m1.1.1.3.3.2" xref="S2.E5.m1.1.1.3.3.2.cmml"><mi id="S2.E5.m1.1.1.3.3.2.2" xref="S2.E5.m1.1.1.3.3.2.2.cmml">λ</mi><mrow id="S2.E5.m1.1.1.3.3.2.3" xref="S2.E5.m1.1.1.3.3.2.3.cmml"><mn id="S2.E5.m1.1.1.3.3.2.3.2" xref="S2.E5.m1.1.1.3.3.2.3.2.cmml">2</mn><mo id="S2.E5.m1.1.1.3.3.2.3.1" xref="S2.E5.m1.1.1.3.3.2.3.1.cmml">⁢</mo><msub id="S2.E5.m1.1.1.3.3.2.3.3" xref="S2.E5.m1.1.1.3.3.2.3.3.cmml"><mi id="S2.E5.m1.1.1.3.3.2.3.3.2" xref="S2.E5.m1.1.1.3.3.2.3.3.2.cmml">N</mi><mi id="S2.E5.m1.1.1.3.3.2.3.3.3" xref="S2.E5.m1.1.1.3.3.2.3.3.3.cmml">S</mi></msub></mrow></mfrac><mo id="S2.E5.m1.1.1.3.3.1" xref="S2.E5.m1.1.1.3.3.1.cmml">⁢</mo><mrow id="S2.E5.m1.1.1.3.3.3" xref="S2.E5.m1.1.1.3.3.3.cmml"><munderover id="S2.E5.m1.1.1.3.3.3.1" xref="S2.E5.m1.1.1.3.3.3.1.cmml"><mo id="S2.E5.m1.1.1.3.3.3.1.2.2" movablelimits="false" xref="S2.E5.m1.1.1.3.3.3.1.2.2.cmml">∑</mo><mrow id="S2.E5.m1.1.1.3.3.3.1.2.3" xref="S2.E5.m1.1.1.3.3.3.1.2.3.cmml"><mi id="S2.E5.m1.1.1.3.3.3.1.2.3.2" xref="S2.E5.m1.1.1.3.3.3.1.2.3.2.cmml">i</mi><mo id="S2.E5.m1.1.1.3.3.3.1.2.3.1" xref="S2.E5.m1.1.1.3.3.3.1.2.3.1.cmml">=</mo><mn id="S2.E5.m1.1.1.3.3.3.1.2.3.3" xref="S2.E5.m1.1.1.3.3.3.1.2.3.3.cmml">1</mn></mrow><msub id="S2.E5.m1.1.1.3.3.3.1.3" xref="S2.E5.m1.1.1.3.3.3.1.3.cmml"><mi id="S2.E5.m1.1.1.3.3.3.1.3.2" xref="S2.E5.m1.1.1.3.3.3.1.3.2.cmml">N</mi><mi id="S2.E5.m1.1.1.3.3.3.1.3.3" xref="S2.E5.m1.1.1.3.3.3.1.3.3.cmml">S</mi></msub></munderover><msubsup id="S2.E5.m1.1.1.3.3.3.2" xref="S2.E5.m1.1.1.3.3.3.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E5.m1.1.1.3.3.3.2.2.2" xref="S2.E5.m1.1.1.3.3.3.2.2.2.cmml">ℒ</mi><mrow id="S2.E5.m1.1.1.3.3.3.2.2.3" xref="S2.E5.m1.1.1.3.3.3.2.2.3.cmml"><mi id="S2.E5.m1.1.1.3.3.3.2.2.3.2" xref="S2.E5.m1.1.1.3.3.3.2.2.3.2.cmml">A</mi><mo id="S2.E5.m1.1.1.3.3.3.2.2.3.1" xref="S2.E5.m1.1.1.3.3.3.2.2.3.1.cmml">⁢</mo><mi id="S2.E5.m1.1.1.3.3.3.2.2.3.3" xref="S2.E5.m1.1.1.3.3.3.2.2.3.3.cmml">V</mi><mo id="S2.E5.m1.1.1.3.3.3.2.2.3.1a" xref="S2.E5.m1.1.1.3.3.3.2.2.3.1.cmml">⁢</mo><mi id="S2.E5.m1.1.1.3.3.3.2.2.3.4" xref="S2.E5.m1.1.1.3.3.3.2.2.3.4.cmml">S</mi></mrow><mi id="S2.E5.m1.1.1.3.3.3.2.3" xref="S2.E5.m1.1.1.3.3.3.2.3.cmml">i</mi></msubsup></mrow></mrow><mo id="S2.E5.m1.1.1.3.1a" xref="S2.E5.m1.1.1.3.1.cmml">+</mo><mrow id="S2.E5.m1.1.1.3.4" xref="S2.E5.m1.1.1.3.4.cmml"><mfrac id="S2.E5.m1.1.1.3.4.2" xref="S2.E5.m1.1.1.3.4.2.cmml"><mi id="S2.E5.m1.1.1.3.4.2.2" xref="S2.E5.m1.1.1.3.4.2.2.cmml">μ</mi><mrow id="S2.E5.m1.1.1.3.4.2.3" xref="S2.E5.m1.1.1.3.4.2.3.cmml"><mn id="S2.E5.m1.1.1.3.4.2.3.2" xref="S2.E5.m1.1.1.3.4.2.3.2.cmml">2</mn><mo id="S2.E5.m1.1.1.3.4.2.3.1" xref="S2.E5.m1.1.1.3.4.2.3.1.cmml">⁢</mo><msub id="S2.E5.m1.1.1.3.4.2.3.3" xref="S2.E5.m1.1.1.3.4.2.3.3.cmml"><mi id="S2.E5.m1.1.1.3.4.2.3.3.2" xref="S2.E5.m1.1.1.3.4.2.3.3.2.cmml">N</mi><mi id="S2.E5.m1.1.1.3.4.2.3.3.3" xref="S2.E5.m1.1.1.3.4.2.3.3.3.cmml">T</mi></msub></mrow></mfrac><mo id="S2.E5.m1.1.1.3.4.1" xref="S2.E5.m1.1.1.3.4.1.cmml">⁢</mo><mrow id="S2.E5.m1.1.1.3.4.3" xref="S2.E5.m1.1.1.3.4.3.cmml"><munderover id="S2.E5.m1.1.1.3.4.3.1" xref="S2.E5.m1.1.1.3.4.3.1.cmml"><mo id="S2.E5.m1.1.1.3.4.3.1.2.2" movablelimits="false" xref="S2.E5.m1.1.1.3.4.3.1.2.2.cmml">∑</mo><mrow id="S2.E5.m1.1.1.3.4.3.1.2.3" xref="S2.E5.m1.1.1.3.4.3.1.2.3.cmml"><mi id="S2.E5.m1.1.1.3.4.3.1.2.3.2" xref="S2.E5.m1.1.1.3.4.3.1.2.3.2.cmml">i</mi><mo id="S2.E5.m1.1.1.3.4.3.1.2.3.1" xref="S2.E5.m1.1.1.3.4.3.1.2.3.1.cmml">=</mo><mn id="S2.E5.m1.1.1.3.4.3.1.2.3.3" xref="S2.E5.m1.1.1.3.4.3.1.2.3.3.cmml">1</mn></mrow><msub id="S2.E5.m1.1.1.3.4.3.1.3" xref="S2.E5.m1.1.1.3.4.3.1.3.cmml"><mi id="S2.E5.m1.1.1.3.4.3.1.3.2" xref="S2.E5.m1.1.1.3.4.3.1.3.2.cmml">N</mi><mi id="S2.E5.m1.1.1.3.4.3.1.3.3" xref="S2.E5.m1.1.1.3.4.3.1.3.3.cmml">T</mi></msub></munderover><msubsup id="S2.E5.m1.1.1.3.4.3.2" xref="S2.E5.m1.1.1.3.4.3.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E5.m1.1.1.3.4.3.2.2.2" xref="S2.E5.m1.1.1.3.4.3.2.2.2.cmml">ℒ</mi><mrow id="S2.E5.m1.1.1.3.4.3.2.2.3" xref="S2.E5.m1.1.1.3.4.3.2.2.3.cmml"><mi id="S2.E5.m1.1.1.3.4.3.2.2.3.2" xref="S2.E5.m1.1.1.3.4.3.2.2.3.2.cmml">A</mi><mo id="S2.E5.m1.1.1.3.4.3.2.2.3.1" xref="S2.E5.m1.1.1.3.4.3.2.2.3.1.cmml">⁢</mo><mi id="S2.E5.m1.1.1.3.4.3.2.2.3.3" xref="S2.E5.m1.1.1.3.4.3.2.2.3.3.cmml">V</mi><mo id="S2.E5.m1.1.1.3.4.3.2.2.3.1a" xref="S2.E5.m1.1.1.3.4.3.2.2.3.1.cmml">⁢</mo><mi id="S2.E5.m1.1.1.3.4.3.2.2.3.4" xref="S2.E5.m1.1.1.3.4.3.2.2.3.4.cmml">T</mi></mrow><mi id="S2.E5.m1.1.1.3.4.3.2.3" xref="S2.E5.m1.1.1.3.4.3.2.3.cmml">i</mi></msubsup></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E5.m1.1b"><apply id="S2.E5.m1.1.1.cmml" xref="S2.E5.m1.1.1"><eq id="S2.E5.m1.1.1.1.cmml" xref="S2.E5.m1.1.1.1"></eq><ci id="S2.E5.m1.1.1.2.cmml" xref="S2.E5.m1.1.1.2">ℒ</ci><apply id="S2.E5.m1.1.1.3.cmml" xref="S2.E5.m1.1.1.3"><plus id="S2.E5.m1.1.1.3.1.cmml" xref="S2.E5.m1.1.1.3.1"></plus><apply id="S2.E5.m1.1.1.3.2.cmml" xref="S2.E5.m1.1.1.3.2"><times id="S2.E5.m1.1.1.3.2.1.cmml" xref="S2.E5.m1.1.1.3.2.1"></times><apply id="S2.E5.m1.1.1.3.2.2.cmml" xref="S2.E5.m1.1.1.3.2.2"><divide id="S2.E5.m1.1.1.3.2.2.1.cmml" xref="S2.E5.m1.1.1.3.2.2"></divide><cn id="S2.E5.m1.1.1.3.2.2.2.cmml" type="integer" xref="S2.E5.m1.1.1.3.2.2.2">1</cn><apply id="S2.E5.m1.1.1.3.2.2.3.cmml" xref="S2.E5.m1.1.1.3.2.2.3"><times id="S2.E5.m1.1.1.3.2.2.3.1.cmml" xref="S2.E5.m1.1.1.3.2.2.3.1"></times><cn id="S2.E5.m1.1.1.3.2.2.3.2.cmml" type="integer" xref="S2.E5.m1.1.1.3.2.2.3.2">2</cn><ci id="S2.E5.m1.1.1.3.2.2.3.3.cmml" xref="S2.E5.m1.1.1.3.2.2.3.3">𝑁</ci></apply></apply><apply id="S2.E5.m1.1.1.3.2.3.cmml" xref="S2.E5.m1.1.1.3.2.3"><apply id="S2.E5.m1.1.1.3.2.3.1.cmml" xref="S2.E5.m1.1.1.3.2.3.1"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.2.3.1.1.cmml" xref="S2.E5.m1.1.1.3.2.3.1">superscript</csymbol><apply id="S2.E5.m1.1.1.3.2.3.1.2.cmml" xref="S2.E5.m1.1.1.3.2.3.1"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.2.3.1.2.1.cmml" xref="S2.E5.m1.1.1.3.2.3.1">subscript</csymbol><sum id="S2.E5.m1.1.1.3.2.3.1.2.2.cmml" xref="S2.E5.m1.1.1.3.2.3.1.2.2"></sum><apply id="S2.E5.m1.1.1.3.2.3.1.2.3.cmml" xref="S2.E5.m1.1.1.3.2.3.1.2.3"><eq id="S2.E5.m1.1.1.3.2.3.1.2.3.1.cmml" xref="S2.E5.m1.1.1.3.2.3.1.2.3.1"></eq><ci id="S2.E5.m1.1.1.3.2.3.1.2.3.2.cmml" xref="S2.E5.m1.1.1.3.2.3.1.2.3.2">𝑖</ci><cn id="S2.E5.m1.1.1.3.2.3.1.2.3.3.cmml" type="integer" xref="S2.E5.m1.1.1.3.2.3.1.2.3.3">1</cn></apply></apply><ci id="S2.E5.m1.1.1.3.2.3.1.3.cmml" xref="S2.E5.m1.1.1.3.2.3.1.3">𝑁</ci></apply><apply id="S2.E5.m1.1.1.3.2.3.2.cmml" xref="S2.E5.m1.1.1.3.2.3.2"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.2.3.2.1.cmml" xref="S2.E5.m1.1.1.3.2.3.2">superscript</csymbol><apply id="S2.E5.m1.1.1.3.2.3.2.2.cmml" xref="S2.E5.m1.1.1.3.2.3.2"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.2.3.2.2.1.cmml" xref="S2.E5.m1.1.1.3.2.3.2">subscript</csymbol><ci id="S2.E5.m1.1.1.3.2.3.2.2.2.cmml" xref="S2.E5.m1.1.1.3.2.3.2.2.2">ℒ</ci><apply id="S2.E5.m1.1.1.3.2.3.2.2.3.cmml" xref="S2.E5.m1.1.1.3.2.3.2.2.3"><times id="S2.E5.m1.1.1.3.2.3.2.2.3.1.cmml" xref="S2.E5.m1.1.1.3.2.3.2.2.3.1"></times><ci id="S2.E5.m1.1.1.3.2.3.2.2.3.2.cmml" xref="S2.E5.m1.1.1.3.2.3.2.2.3.2">𝐴</ci><ci id="S2.E5.m1.1.1.3.2.3.2.2.3.3.cmml" xref="S2.E5.m1.1.1.3.2.3.2.2.3.3">𝐿</ci></apply></apply><ci id="S2.E5.m1.1.1.3.2.3.2.3.cmml" xref="S2.E5.m1.1.1.3.2.3.2.3">𝑖</ci></apply></apply></apply><apply id="S2.E5.m1.1.1.3.3.cmml" xref="S2.E5.m1.1.1.3.3"><times id="S2.E5.m1.1.1.3.3.1.cmml" xref="S2.E5.m1.1.1.3.3.1"></times><apply id="S2.E5.m1.1.1.3.3.2.cmml" xref="S2.E5.m1.1.1.3.3.2"><divide id="S2.E5.m1.1.1.3.3.2.1.cmml" xref="S2.E5.m1.1.1.3.3.2"></divide><ci id="S2.E5.m1.1.1.3.3.2.2.cmml" xref="S2.E5.m1.1.1.3.3.2.2">𝜆</ci><apply id="S2.E5.m1.1.1.3.3.2.3.cmml" xref="S2.E5.m1.1.1.3.3.2.3"><times id="S2.E5.m1.1.1.3.3.2.3.1.cmml" xref="S2.E5.m1.1.1.3.3.2.3.1"></times><cn id="S2.E5.m1.1.1.3.3.2.3.2.cmml" type="integer" xref="S2.E5.m1.1.1.3.3.2.3.2">2</cn><apply id="S2.E5.m1.1.1.3.3.2.3.3.cmml" xref="S2.E5.m1.1.1.3.3.2.3.3"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.3.2.3.3.1.cmml" xref="S2.E5.m1.1.1.3.3.2.3.3">subscript</csymbol><ci id="S2.E5.m1.1.1.3.3.2.3.3.2.cmml" xref="S2.E5.m1.1.1.3.3.2.3.3.2">𝑁</ci><ci id="S2.E5.m1.1.1.3.3.2.3.3.3.cmml" xref="S2.E5.m1.1.1.3.3.2.3.3.3">𝑆</ci></apply></apply></apply><apply id="S2.E5.m1.1.1.3.3.3.cmml" xref="S2.E5.m1.1.1.3.3.3"><apply id="S2.E5.m1.1.1.3.3.3.1.cmml" xref="S2.E5.m1.1.1.3.3.3.1"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.3.3.1.1.cmml" xref="S2.E5.m1.1.1.3.3.3.1">superscript</csymbol><apply id="S2.E5.m1.1.1.3.3.3.1.2.cmml" xref="S2.E5.m1.1.1.3.3.3.1"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.3.3.1.2.1.cmml" xref="S2.E5.m1.1.1.3.3.3.1">subscript</csymbol><sum id="S2.E5.m1.1.1.3.3.3.1.2.2.cmml" xref="S2.E5.m1.1.1.3.3.3.1.2.2"></sum><apply id="S2.E5.m1.1.1.3.3.3.1.2.3.cmml" xref="S2.E5.m1.1.1.3.3.3.1.2.3"><eq id="S2.E5.m1.1.1.3.3.3.1.2.3.1.cmml" xref="S2.E5.m1.1.1.3.3.3.1.2.3.1"></eq><ci id="S2.E5.m1.1.1.3.3.3.1.2.3.2.cmml" xref="S2.E5.m1.1.1.3.3.3.1.2.3.2">𝑖</ci><cn id="S2.E5.m1.1.1.3.3.3.1.2.3.3.cmml" type="integer" xref="S2.E5.m1.1.1.3.3.3.1.2.3.3">1</cn></apply></apply><apply id="S2.E5.m1.1.1.3.3.3.1.3.cmml" xref="S2.E5.m1.1.1.3.3.3.1.3"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.3.3.1.3.1.cmml" xref="S2.E5.m1.1.1.3.3.3.1.3">subscript</csymbol><ci id="S2.E5.m1.1.1.3.3.3.1.3.2.cmml" xref="S2.E5.m1.1.1.3.3.3.1.3.2">𝑁</ci><ci id="S2.E5.m1.1.1.3.3.3.1.3.3.cmml" xref="S2.E5.m1.1.1.3.3.3.1.3.3">𝑆</ci></apply></apply><apply id="S2.E5.m1.1.1.3.3.3.2.cmml" xref="S2.E5.m1.1.1.3.3.3.2"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.3.3.2.1.cmml" xref="S2.E5.m1.1.1.3.3.3.2">superscript</csymbol><apply id="S2.E5.m1.1.1.3.3.3.2.2.cmml" xref="S2.E5.m1.1.1.3.3.3.2"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.3.3.2.2.1.cmml" xref="S2.E5.m1.1.1.3.3.3.2">subscript</csymbol><ci id="S2.E5.m1.1.1.3.3.3.2.2.2.cmml" xref="S2.E5.m1.1.1.3.3.3.2.2.2">ℒ</ci><apply id="S2.E5.m1.1.1.3.3.3.2.2.3.cmml" xref="S2.E5.m1.1.1.3.3.3.2.2.3"><times id="S2.E5.m1.1.1.3.3.3.2.2.3.1.cmml" xref="S2.E5.m1.1.1.3.3.3.2.2.3.1"></times><ci id="S2.E5.m1.1.1.3.3.3.2.2.3.2.cmml" xref="S2.E5.m1.1.1.3.3.3.2.2.3.2">𝐴</ci><ci id="S2.E5.m1.1.1.3.3.3.2.2.3.3.cmml" xref="S2.E5.m1.1.1.3.3.3.2.2.3.3">𝑉</ci><ci id="S2.E5.m1.1.1.3.3.3.2.2.3.4.cmml" xref="S2.E5.m1.1.1.3.3.3.2.2.3.4">𝑆</ci></apply></apply><ci id="S2.E5.m1.1.1.3.3.3.2.3.cmml" xref="S2.E5.m1.1.1.3.3.3.2.3">𝑖</ci></apply></apply></apply><apply id="S2.E5.m1.1.1.3.4.cmml" xref="S2.E5.m1.1.1.3.4"><times id="S2.E5.m1.1.1.3.4.1.cmml" xref="S2.E5.m1.1.1.3.4.1"></times><apply id="S2.E5.m1.1.1.3.4.2.cmml" xref="S2.E5.m1.1.1.3.4.2"><divide id="S2.E5.m1.1.1.3.4.2.1.cmml" xref="S2.E5.m1.1.1.3.4.2"></divide><ci id="S2.E5.m1.1.1.3.4.2.2.cmml" xref="S2.E5.m1.1.1.3.4.2.2">𝜇</ci><apply id="S2.E5.m1.1.1.3.4.2.3.cmml" xref="S2.E5.m1.1.1.3.4.2.3"><times id="S2.E5.m1.1.1.3.4.2.3.1.cmml" xref="S2.E5.m1.1.1.3.4.2.3.1"></times><cn id="S2.E5.m1.1.1.3.4.2.3.2.cmml" type="integer" xref="S2.E5.m1.1.1.3.4.2.3.2">2</cn><apply id="S2.E5.m1.1.1.3.4.2.3.3.cmml" xref="S2.E5.m1.1.1.3.4.2.3.3"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.4.2.3.3.1.cmml" xref="S2.E5.m1.1.1.3.4.2.3.3">subscript</csymbol><ci id="S2.E5.m1.1.1.3.4.2.3.3.2.cmml" xref="S2.E5.m1.1.1.3.4.2.3.3.2">𝑁</ci><ci id="S2.E5.m1.1.1.3.4.2.3.3.3.cmml" xref="S2.E5.m1.1.1.3.4.2.3.3.3">𝑇</ci></apply></apply></apply><apply id="S2.E5.m1.1.1.3.4.3.cmml" xref="S2.E5.m1.1.1.3.4.3"><apply id="S2.E5.m1.1.1.3.4.3.1.cmml" xref="S2.E5.m1.1.1.3.4.3.1"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.4.3.1.1.cmml" xref="S2.E5.m1.1.1.3.4.3.1">superscript</csymbol><apply id="S2.E5.m1.1.1.3.4.3.1.2.cmml" xref="S2.E5.m1.1.1.3.4.3.1"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.4.3.1.2.1.cmml" xref="S2.E5.m1.1.1.3.4.3.1">subscript</csymbol><sum id="S2.E5.m1.1.1.3.4.3.1.2.2.cmml" xref="S2.E5.m1.1.1.3.4.3.1.2.2"></sum><apply id="S2.E5.m1.1.1.3.4.3.1.2.3.cmml" xref="S2.E5.m1.1.1.3.4.3.1.2.3"><eq id="S2.E5.m1.1.1.3.4.3.1.2.3.1.cmml" xref="S2.E5.m1.1.1.3.4.3.1.2.3.1"></eq><ci id="S2.E5.m1.1.1.3.4.3.1.2.3.2.cmml" xref="S2.E5.m1.1.1.3.4.3.1.2.3.2">𝑖</ci><cn id="S2.E5.m1.1.1.3.4.3.1.2.3.3.cmml" type="integer" xref="S2.E5.m1.1.1.3.4.3.1.2.3.3">1</cn></apply></apply><apply id="S2.E5.m1.1.1.3.4.3.1.3.cmml" xref="S2.E5.m1.1.1.3.4.3.1.3"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.4.3.1.3.1.cmml" xref="S2.E5.m1.1.1.3.4.3.1.3">subscript</csymbol><ci id="S2.E5.m1.1.1.3.4.3.1.3.2.cmml" xref="S2.E5.m1.1.1.3.4.3.1.3.2">𝑁</ci><ci id="S2.E5.m1.1.1.3.4.3.1.3.3.cmml" xref="S2.E5.m1.1.1.3.4.3.1.3.3">𝑇</ci></apply></apply><apply id="S2.E5.m1.1.1.3.4.3.2.cmml" xref="S2.E5.m1.1.1.3.4.3.2"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.4.3.2.1.cmml" xref="S2.E5.m1.1.1.3.4.3.2">superscript</csymbol><apply id="S2.E5.m1.1.1.3.4.3.2.2.cmml" xref="S2.E5.m1.1.1.3.4.3.2"><csymbol cd="ambiguous" id="S2.E5.m1.1.1.3.4.3.2.2.1.cmml" xref="S2.E5.m1.1.1.3.4.3.2">subscript</csymbol><ci id="S2.E5.m1.1.1.3.4.3.2.2.2.cmml" xref="S2.E5.m1.1.1.3.4.3.2.2.2">ℒ</ci><apply id="S2.E5.m1.1.1.3.4.3.2.2.3.cmml" xref="S2.E5.m1.1.1.3.4.3.2.2.3"><times id="S2.E5.m1.1.1.3.4.3.2.2.3.1.cmml" xref="S2.E5.m1.1.1.3.4.3.2.2.3.1"></times><ci id="S2.E5.m1.1.1.3.4.3.2.2.3.2.cmml" xref="S2.E5.m1.1.1.3.4.3.2.2.3.2">𝐴</ci><ci id="S2.E5.m1.1.1.3.4.3.2.2.3.3.cmml" xref="S2.E5.m1.1.1.3.4.3.2.2.3.3">𝑉</ci><ci id="S2.E5.m1.1.1.3.4.3.2.2.3.4.cmml" xref="S2.E5.m1.1.1.3.4.3.2.2.3.4">𝑇</ci></apply></apply><ci id="S2.E5.m1.1.1.3.4.3.2.3.cmml" xref="S2.E5.m1.1.1.3.4.3.2.3">𝑖</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E5.m1.1c">\mathcal{L}=\frac{1}{2N}\sum_{i=1}^{N}\mathcal{L}_{AL}^{i}+\frac{\lambda}{2N_{% S}}\sum_{i=1}^{N_{S}}\mathcal{L}_{AVS}^{i}+\frac{\mu}{2N_{T}}\sum_{i=1}^{N_{T}% }\mathcal{L}_{AVT}^{i}</annotation><annotation encoding="application/x-llamapun" id="S2.E5.m1.1d">caligraphic_L = divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_A italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + divide start_ARG italic_λ end_ARG start_ARG 2 italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_A italic_V italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + divide start_ARG italic_μ end_ARG start_ARG 2 italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_A italic_V italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(5)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS1.p5.8">where <math alttext="E_{a}^{i}" class="ltx_Math" display="inline" id="S2.SS1.p5.2.m1.1"><semantics id="S2.SS1.p5.2.m1.1a"><msubsup id="S2.SS1.p5.2.m1.1.1" xref="S2.SS1.p5.2.m1.1.1.cmml"><mi id="S2.SS1.p5.2.m1.1.1.2.2" xref="S2.SS1.p5.2.m1.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p5.2.m1.1.1.2.3" xref="S2.SS1.p5.2.m1.1.1.2.3.cmml">a</mi><mi id="S2.SS1.p5.2.m1.1.1.3" xref="S2.SS1.p5.2.m1.1.1.3.cmml">i</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p5.2.m1.1b"><apply id="S2.SS1.p5.2.m1.1.1.cmml" xref="S2.SS1.p5.2.m1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p5.2.m1.1.1.1.cmml" xref="S2.SS1.p5.2.m1.1.1">superscript</csymbol><apply id="S2.SS1.p5.2.m1.1.1.2.cmml" xref="S2.SS1.p5.2.m1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p5.2.m1.1.1.2.1.cmml" xref="S2.SS1.p5.2.m1.1.1">subscript</csymbol><ci id="S2.SS1.p5.2.m1.1.1.2.2.cmml" xref="S2.SS1.p5.2.m1.1.1.2.2">𝐸</ci><ci id="S2.SS1.p5.2.m1.1.1.2.3.cmml" xref="S2.SS1.p5.2.m1.1.1.2.3">𝑎</ci></apply><ci id="S2.SS1.p5.2.m1.1.1.3.cmml" xref="S2.SS1.p5.2.m1.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p5.2.m1.1c">E_{a}^{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p5.2.m1.1d">italic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT</annotation></semantics></math>, <math alttext="E_{l}^{i}" class="ltx_Math" display="inline" id="S2.SS1.p5.3.m2.1"><semantics id="S2.SS1.p5.3.m2.1a"><msubsup id="S2.SS1.p5.3.m2.1.1" xref="S2.SS1.p5.3.m2.1.1.cmml"><mi id="S2.SS1.p5.3.m2.1.1.2.2" xref="S2.SS1.p5.3.m2.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p5.3.m2.1.1.2.3" xref="S2.SS1.p5.3.m2.1.1.2.3.cmml">l</mi><mi id="S2.SS1.p5.3.m2.1.1.3" xref="S2.SS1.p5.3.m2.1.1.3.cmml">i</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p5.3.m2.1b"><apply id="S2.SS1.p5.3.m2.1.1.cmml" xref="S2.SS1.p5.3.m2.1.1"><csymbol cd="ambiguous" id="S2.SS1.p5.3.m2.1.1.1.cmml" xref="S2.SS1.p5.3.m2.1.1">superscript</csymbol><apply id="S2.SS1.p5.3.m2.1.1.2.cmml" xref="S2.SS1.p5.3.m2.1.1"><csymbol cd="ambiguous" id="S2.SS1.p5.3.m2.1.1.2.1.cmml" xref="S2.SS1.p5.3.m2.1.1">subscript</csymbol><ci id="S2.SS1.p5.3.m2.1.1.2.2.cmml" xref="S2.SS1.p5.3.m2.1.1.2.2">𝐸</ci><ci id="S2.SS1.p5.3.m2.1.1.2.3.cmml" xref="S2.SS1.p5.3.m2.1.1.2.3">𝑙</ci></apply><ci id="S2.SS1.p5.3.m2.1.1.3.cmml" xref="S2.SS1.p5.3.m2.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p5.3.m2.1c">E_{l}^{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p5.3.m2.1d">italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT</annotation></semantics></math>, and <math alttext="E_{v}^{i}" class="ltx_Math" display="inline" id="S2.SS1.p5.4.m3.1"><semantics id="S2.SS1.p5.4.m3.1a"><msubsup id="S2.SS1.p5.4.m3.1.1" xref="S2.SS1.p5.4.m3.1.1.cmml"><mi id="S2.SS1.p5.4.m3.1.1.2.2" xref="S2.SS1.p5.4.m3.1.1.2.2.cmml">E</mi><mi id="S2.SS1.p5.4.m3.1.1.2.3" xref="S2.SS1.p5.4.m3.1.1.2.3.cmml">v</mi><mi id="S2.SS1.p5.4.m3.1.1.3" xref="S2.SS1.p5.4.m3.1.1.3.cmml">i</mi></msubsup><annotation-xml encoding="MathML-Content" id="S2.SS1.p5.4.m3.1b"><apply id="S2.SS1.p5.4.m3.1.1.cmml" xref="S2.SS1.p5.4.m3.1.1"><csymbol cd="ambiguous" id="S2.SS1.p5.4.m3.1.1.1.cmml" xref="S2.SS1.p5.4.m3.1.1">superscript</csymbol><apply id="S2.SS1.p5.4.m3.1.1.2.cmml" xref="S2.SS1.p5.4.m3.1.1"><csymbol cd="ambiguous" id="S2.SS1.p5.4.m3.1.1.2.1.cmml" xref="S2.SS1.p5.4.m3.1.1">subscript</csymbol><ci id="S2.SS1.p5.4.m3.1.1.2.2.cmml" xref="S2.SS1.p5.4.m3.1.1.2.2">𝐸</ci><ci id="S2.SS1.p5.4.m3.1.1.2.3.cmml" xref="S2.SS1.p5.4.m3.1.1.2.3">𝑣</ci></apply><ci id="S2.SS1.p5.4.m3.1.1.3.cmml" xref="S2.SS1.p5.4.m3.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p5.4.m3.1c">E_{v}^{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p5.4.m3.1d">italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT</annotation></semantics></math> represent the audio, text, and video embeddings for the <math alttext="i" class="ltx_Math" display="inline" id="S2.SS1.p5.5.m4.1"><semantics id="S2.SS1.p5.5.m4.1a"><mi id="S2.SS1.p5.5.m4.1.1" xref="S2.SS1.p5.5.m4.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p5.5.m4.1b"><ci id="S2.SS1.p5.5.m4.1.1.cmml" xref="S2.SS1.p5.5.m4.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p5.5.m4.1c">i</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p5.5.m4.1d">italic_i</annotation></semantics></math>-th sample, respectively. <math alttext="N" class="ltx_Math" display="inline" id="S2.SS1.p5.6.m5.1"><semantics id="S2.SS1.p5.6.m5.1a"><mi id="S2.SS1.p5.6.m5.1.1" xref="S2.SS1.p5.6.m5.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p5.6.m5.1b"><ci id="S2.SS1.p5.6.m5.1.1.cmml" xref="S2.SS1.p5.6.m5.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p5.6.m5.1c">N</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p5.6.m5.1d">italic_N</annotation></semantics></math>, <math alttext="N_{S}" class="ltx_Math" display="inline" id="S2.SS1.p5.7.m6.1"><semantics id="S2.SS1.p5.7.m6.1a"><msub id="S2.SS1.p5.7.m6.1.1" xref="S2.SS1.p5.7.m6.1.1.cmml"><mi id="S2.SS1.p5.7.m6.1.1.2" xref="S2.SS1.p5.7.m6.1.1.2.cmml">N</mi><mi id="S2.SS1.p5.7.m6.1.1.3" xref="S2.SS1.p5.7.m6.1.1.3.cmml">S</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p5.7.m6.1b"><apply id="S2.SS1.p5.7.m6.1.1.cmml" xref="S2.SS1.p5.7.m6.1.1"><csymbol cd="ambiguous" id="S2.SS1.p5.7.m6.1.1.1.cmml" xref="S2.SS1.p5.7.m6.1.1">subscript</csymbol><ci id="S2.SS1.p5.7.m6.1.1.2.cmml" xref="S2.SS1.p5.7.m6.1.1.2">𝑁</ci><ci id="S2.SS1.p5.7.m6.1.1.3.cmml" xref="S2.SS1.p5.7.m6.1.1.3">𝑆</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p5.7.m6.1c">N_{S}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p5.7.m6.1d">italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT</annotation></semantics></math>, and <math alttext="N_{T}" class="ltx_Math" display="inline" id="S2.SS1.p5.8.m7.1"><semantics id="S2.SS1.p5.8.m7.1a"><msub id="S2.SS1.p5.8.m7.1.1" xref="S2.SS1.p5.8.m7.1.1.cmml"><mi id="S2.SS1.p5.8.m7.1.1.2" xref="S2.SS1.p5.8.m7.1.1.2.cmml">N</mi><mi id="S2.SS1.p5.8.m7.1.1.3" xref="S2.SS1.p5.8.m7.1.1.3.cmml">T</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p5.8.m7.1b"><apply id="S2.SS1.p5.8.m7.1.1.cmml" xref="S2.SS1.p5.8.m7.1.1"><csymbol cd="ambiguous" id="S2.SS1.p5.8.m7.1.1.1.cmml" xref="S2.SS1.p5.8.m7.1.1">subscript</csymbol><ci id="S2.SS1.p5.8.m7.1.1.2.cmml" xref="S2.SS1.p5.8.m7.1.1.2">𝑁</ci><ci id="S2.SS1.p5.8.m7.1.1.3.cmml" xref="S2.SS1.p5.8.m7.1.1.3">𝑇</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p5.8.m7.1c">N_{T}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p5.8.m7.1d">italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT</annotation></semantics></math> represent the total audio-text pairs, cross-video audio-video pairs, and intra-video temporal pairs, as defined in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib12" title="">12</a>]</cite>.</p> </div> <div class="ltx_para" id="S2.SS1.p6"> <p class="ltx_p" id="S2.SS1.p6.4">The parameters <math alttext="\lambda" class="ltx_Math" display="inline" id="S2.SS1.p6.1.m1.1"><semantics id="S2.SS1.p6.1.m1.1a"><mi id="S2.SS1.p6.1.m1.1.1" xref="S2.SS1.p6.1.m1.1.1.cmml">λ</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p6.1.m1.1b"><ci id="S2.SS1.p6.1.m1.1.1.cmml" xref="S2.SS1.p6.1.m1.1.1">𝜆</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p6.1.m1.1c">\lambda</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p6.1.m1.1d">italic_λ</annotation></semantics></math> and <math alttext="\mu" class="ltx_Math" display="inline" id="S2.SS1.p6.2.m2.1"><semantics id="S2.SS1.p6.2.m2.1a"><mi id="S2.SS1.p6.2.m2.1.1" xref="S2.SS1.p6.2.m2.1.1.cmml">μ</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p6.2.m2.1b"><ci id="S2.SS1.p6.2.m2.1.1.cmml" xref="S2.SS1.p6.2.m2.1.1">𝜇</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p6.2.m2.1c">\mu</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p6.2.m2.1d">italic_μ</annotation></semantics></math> are weights that determine the relative importance of the different loss components: <math alttext="1/(\lambda+\mu)" class="ltx_Math" display="inline" id="S2.SS1.p6.3.m3.1"><semantics id="S2.SS1.p6.3.m3.1a"><mrow id="S2.SS1.p6.3.m3.1.1" xref="S2.SS1.p6.3.m3.1.1.cmml"><mn id="S2.SS1.p6.3.m3.1.1.3" xref="S2.SS1.p6.3.m3.1.1.3.cmml">1</mn><mo id="S2.SS1.p6.3.m3.1.1.2" xref="S2.SS1.p6.3.m3.1.1.2.cmml">/</mo><mrow id="S2.SS1.p6.3.m3.1.1.1.1" xref="S2.SS1.p6.3.m3.1.1.1.1.1.cmml"><mo id="S2.SS1.p6.3.m3.1.1.1.1.2" stretchy="false" xref="S2.SS1.p6.3.m3.1.1.1.1.1.cmml">(</mo><mrow id="S2.SS1.p6.3.m3.1.1.1.1.1" xref="S2.SS1.p6.3.m3.1.1.1.1.1.cmml"><mi id="S2.SS1.p6.3.m3.1.1.1.1.1.2" xref="S2.SS1.p6.3.m3.1.1.1.1.1.2.cmml">λ</mi><mo id="S2.SS1.p6.3.m3.1.1.1.1.1.1" xref="S2.SS1.p6.3.m3.1.1.1.1.1.1.cmml">+</mo><mi id="S2.SS1.p6.3.m3.1.1.1.1.1.3" xref="S2.SS1.p6.3.m3.1.1.1.1.1.3.cmml">μ</mi></mrow><mo id="S2.SS1.p6.3.m3.1.1.1.1.3" stretchy="false" xref="S2.SS1.p6.3.m3.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p6.3.m3.1b"><apply id="S2.SS1.p6.3.m3.1.1.cmml" xref="S2.SS1.p6.3.m3.1.1"><divide id="S2.SS1.p6.3.m3.1.1.2.cmml" xref="S2.SS1.p6.3.m3.1.1.2"></divide><cn id="S2.SS1.p6.3.m3.1.1.3.cmml" type="integer" xref="S2.SS1.p6.3.m3.1.1.3">1</cn><apply id="S2.SS1.p6.3.m3.1.1.1.1.1.cmml" xref="S2.SS1.p6.3.m3.1.1.1.1"><plus id="S2.SS1.p6.3.m3.1.1.1.1.1.1.cmml" xref="S2.SS1.p6.3.m3.1.1.1.1.1.1"></plus><ci id="S2.SS1.p6.3.m3.1.1.1.1.1.2.cmml" xref="S2.SS1.p6.3.m3.1.1.1.1.1.2">𝜆</ci><ci id="S2.SS1.p6.3.m3.1.1.1.1.1.3.cmml" xref="S2.SS1.p6.3.m3.1.1.1.1.1.3">𝜇</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p6.3.m3.1c">1/(\lambda+\mu)</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p6.3.m3.1d">1 / ( italic_λ + italic_μ )</annotation></semantics></math> reflects the importance of text relative to video, and <math alttext="\mu/(1+\lambda)" class="ltx_Math" display="inline" id="S2.SS1.p6.4.m4.1"><semantics id="S2.SS1.p6.4.m4.1a"><mrow id="S2.SS1.p6.4.m4.1.1" xref="S2.SS1.p6.4.m4.1.1.cmml"><mi id="S2.SS1.p6.4.m4.1.1.3" xref="S2.SS1.p6.4.m4.1.1.3.cmml">μ</mi><mo id="S2.SS1.p6.4.m4.1.1.2" xref="S2.SS1.p6.4.m4.1.1.2.cmml">/</mo><mrow id="S2.SS1.p6.4.m4.1.1.1.1" xref="S2.SS1.p6.4.m4.1.1.1.1.1.cmml"><mo id="S2.SS1.p6.4.m4.1.1.1.1.2" stretchy="false" xref="S2.SS1.p6.4.m4.1.1.1.1.1.cmml">(</mo><mrow id="S2.SS1.p6.4.m4.1.1.1.1.1" xref="S2.SS1.p6.4.m4.1.1.1.1.1.cmml"><mn id="S2.SS1.p6.4.m4.1.1.1.1.1.2" xref="S2.SS1.p6.4.m4.1.1.1.1.1.2.cmml">1</mn><mo id="S2.SS1.p6.4.m4.1.1.1.1.1.1" xref="S2.SS1.p6.4.m4.1.1.1.1.1.1.cmml">+</mo><mi id="S2.SS1.p6.4.m4.1.1.1.1.1.3" xref="S2.SS1.p6.4.m4.1.1.1.1.1.3.cmml">λ</mi></mrow><mo id="S2.SS1.p6.4.m4.1.1.1.1.3" stretchy="false" xref="S2.SS1.p6.4.m4.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p6.4.m4.1b"><apply id="S2.SS1.p6.4.m4.1.1.cmml" xref="S2.SS1.p6.4.m4.1.1"><divide id="S2.SS1.p6.4.m4.1.1.2.cmml" xref="S2.SS1.p6.4.m4.1.1.2"></divide><ci id="S2.SS1.p6.4.m4.1.1.3.cmml" xref="S2.SS1.p6.4.m4.1.1.3">𝜇</ci><apply id="S2.SS1.p6.4.m4.1.1.1.1.1.cmml" xref="S2.SS1.p6.4.m4.1.1.1.1"><plus id="S2.SS1.p6.4.m4.1.1.1.1.1.1.cmml" xref="S2.SS1.p6.4.m4.1.1.1.1.1.1"></plus><cn id="S2.SS1.p6.4.m4.1.1.1.1.1.2.cmml" type="integer" xref="S2.SS1.p6.4.m4.1.1.1.1.1.2">1</cn><ci id="S2.SS1.p6.4.m4.1.1.1.1.1.3.cmml" xref="S2.SS1.p6.4.m4.1.1.1.1.1.3">𝜆</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p6.4.m4.1c">\mu/(1+\lambda)</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p6.4.m4.1d">italic_μ / ( 1 + italic_λ )</annotation></semantics></math> indicates the emphasis on temporal alignment versus semantic expression. They will be decided by experiments.</p> </div> <div class="ltx_para" id="S2.SS1.p7"> <p class="ltx_p" id="S2.SS1.p7.1">Equations (2) capture the similarity measure between audio and text, while (3) and (4) express the similarity between audio and video from semantic and temporal perspectives. The final loss function in (5) integrates these modality pairs.</p> </div> </section> <section class="ltx_subsection" id="S2.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS2.5.1.1">II-B</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS2.6.2">Feature Mixing</span> </h3> <div class="ltx_para" id="S2.SS2.p1"> <p class="ltx_p" id="S2.SS2.p1.4">To blend the features extracted from different modalities, we employ feature mixing strategies that balance information from video and text. Since video modality contains both semantic and temporal features, while text usually only contains semantic features, blindly aligning the two (such as using methods like cross-attention) may likely lead to the loss of temporal alignment information learned in CAVP. To address this, we employ weighted averaging and concatenation methods for feature fusion:</p> <table class="ltx_equation ltx_eqn_table" id="S2.E6"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="E_{\text{mix}}^{\text{aver}}=E_{\text{aver}}=\left[E_{l}+(\lambda+\mu)E_{v}% \right]/(1+\lambda+\mu)" class="ltx_Math" display="block" id="S2.E6.m1.2"><semantics id="S2.E6.m1.2a"><mrow id="S2.E6.m1.2.2" xref="S2.E6.m1.2.2.cmml"><msubsup id="S2.E6.m1.2.2.4" xref="S2.E6.m1.2.2.4.cmml"><mi id="S2.E6.m1.2.2.4.2.2" xref="S2.E6.m1.2.2.4.2.2.cmml">E</mi><mtext id="S2.E6.m1.2.2.4.2.3" xref="S2.E6.m1.2.2.4.2.3a.cmml">mix</mtext><mtext id="S2.E6.m1.2.2.4.3" xref="S2.E6.m1.2.2.4.3a.cmml">aver</mtext></msubsup><mo id="S2.E6.m1.2.2.5" xref="S2.E6.m1.2.2.5.cmml">=</mo><msub id="S2.E6.m1.2.2.6" xref="S2.E6.m1.2.2.6.cmml"><mi id="S2.E6.m1.2.2.6.2" xref="S2.E6.m1.2.2.6.2.cmml">E</mi><mtext id="S2.E6.m1.2.2.6.3" xref="S2.E6.m1.2.2.6.3a.cmml">aver</mtext></msub><mo id="S2.E6.m1.2.2.7" xref="S2.E6.m1.2.2.7.cmml">=</mo><mrow id="S2.E6.m1.2.2.2" xref="S2.E6.m1.2.2.2.cmml"><mrow id="S2.E6.m1.1.1.1.1.1" xref="S2.E6.m1.1.1.1.1.2.cmml"><mo id="S2.E6.m1.1.1.1.1.1.2" xref="S2.E6.m1.1.1.1.1.2.1.cmml">[</mo><mrow id="S2.E6.m1.1.1.1.1.1.1" xref="S2.E6.m1.1.1.1.1.1.1.cmml"><msub id="S2.E6.m1.1.1.1.1.1.1.3" xref="S2.E6.m1.1.1.1.1.1.1.3.cmml"><mi id="S2.E6.m1.1.1.1.1.1.1.3.2" xref="S2.E6.m1.1.1.1.1.1.1.3.2.cmml">E</mi><mi id="S2.E6.m1.1.1.1.1.1.1.3.3" xref="S2.E6.m1.1.1.1.1.1.1.3.3.cmml">l</mi></msub><mo id="S2.E6.m1.1.1.1.1.1.1.2" xref="S2.E6.m1.1.1.1.1.1.1.2.cmml">+</mo><mrow id="S2.E6.m1.1.1.1.1.1.1.1" xref="S2.E6.m1.1.1.1.1.1.1.1.cmml"><mrow id="S2.E6.m1.1.1.1.1.1.1.1.1.1" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.cmml"><mo id="S2.E6.m1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.E6.m1.1.1.1.1.1.1.1.1.1.1" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.2" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.2.cmml">λ</mi><mo id="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.1" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.1.cmml">+</mo><mi id="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.3" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.3.cmml">μ</mi></mrow><mo id="S2.E6.m1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow><mo id="S2.E6.m1.1.1.1.1.1.1.1.2" xref="S2.E6.m1.1.1.1.1.1.1.1.2.cmml">⁢</mo><msub id="S2.E6.m1.1.1.1.1.1.1.1.3" xref="S2.E6.m1.1.1.1.1.1.1.1.3.cmml"><mi id="S2.E6.m1.1.1.1.1.1.1.1.3.2" xref="S2.E6.m1.1.1.1.1.1.1.1.3.2.cmml">E</mi><mi id="S2.E6.m1.1.1.1.1.1.1.1.3.3" xref="S2.E6.m1.1.1.1.1.1.1.1.3.3.cmml">v</mi></msub></mrow></mrow><mo id="S2.E6.m1.1.1.1.1.1.3" xref="S2.E6.m1.1.1.1.1.2.1.cmml">]</mo></mrow><mo id="S2.E6.m1.2.2.2.3" xref="S2.E6.m1.2.2.2.3.cmml">/</mo><mrow id="S2.E6.m1.2.2.2.2.1" xref="S2.E6.m1.2.2.2.2.1.1.cmml"><mo id="S2.E6.m1.2.2.2.2.1.2" stretchy="false" xref="S2.E6.m1.2.2.2.2.1.1.cmml">(</mo><mrow id="S2.E6.m1.2.2.2.2.1.1" xref="S2.E6.m1.2.2.2.2.1.1.cmml"><mn id="S2.E6.m1.2.2.2.2.1.1.2" xref="S2.E6.m1.2.2.2.2.1.1.2.cmml">1</mn><mo id="S2.E6.m1.2.2.2.2.1.1.1" xref="S2.E6.m1.2.2.2.2.1.1.1.cmml">+</mo><mi id="S2.E6.m1.2.2.2.2.1.1.3" xref="S2.E6.m1.2.2.2.2.1.1.3.cmml">λ</mi><mo id="S2.E6.m1.2.2.2.2.1.1.1a" xref="S2.E6.m1.2.2.2.2.1.1.1.cmml">+</mo><mi id="S2.E6.m1.2.2.2.2.1.1.4" xref="S2.E6.m1.2.2.2.2.1.1.4.cmml">μ</mi></mrow><mo id="S2.E6.m1.2.2.2.2.1.3" stretchy="false" xref="S2.E6.m1.2.2.2.2.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E6.m1.2b"><apply id="S2.E6.m1.2.2.cmml" xref="S2.E6.m1.2.2"><and id="S2.E6.m1.2.2a.cmml" xref="S2.E6.m1.2.2"></and><apply id="S2.E6.m1.2.2b.cmml" xref="S2.E6.m1.2.2"><eq id="S2.E6.m1.2.2.5.cmml" xref="S2.E6.m1.2.2.5"></eq><apply id="S2.E6.m1.2.2.4.cmml" xref="S2.E6.m1.2.2.4"><csymbol cd="ambiguous" id="S2.E6.m1.2.2.4.1.cmml" xref="S2.E6.m1.2.2.4">superscript</csymbol><apply id="S2.E6.m1.2.2.4.2.cmml" xref="S2.E6.m1.2.2.4"><csymbol cd="ambiguous" id="S2.E6.m1.2.2.4.2.1.cmml" xref="S2.E6.m1.2.2.4">subscript</csymbol><ci id="S2.E6.m1.2.2.4.2.2.cmml" xref="S2.E6.m1.2.2.4.2.2">𝐸</ci><ci id="S2.E6.m1.2.2.4.2.3a.cmml" xref="S2.E6.m1.2.2.4.2.3"><mtext id="S2.E6.m1.2.2.4.2.3.cmml" mathsize="70%" xref="S2.E6.m1.2.2.4.2.3">mix</mtext></ci></apply><ci id="S2.E6.m1.2.2.4.3a.cmml" xref="S2.E6.m1.2.2.4.3"><mtext id="S2.E6.m1.2.2.4.3.cmml" mathsize="70%" xref="S2.E6.m1.2.2.4.3">aver</mtext></ci></apply><apply id="S2.E6.m1.2.2.6.cmml" xref="S2.E6.m1.2.2.6"><csymbol cd="ambiguous" id="S2.E6.m1.2.2.6.1.cmml" xref="S2.E6.m1.2.2.6">subscript</csymbol><ci id="S2.E6.m1.2.2.6.2.cmml" xref="S2.E6.m1.2.2.6.2">𝐸</ci><ci id="S2.E6.m1.2.2.6.3a.cmml" xref="S2.E6.m1.2.2.6.3"><mtext id="S2.E6.m1.2.2.6.3.cmml" mathsize="70%" xref="S2.E6.m1.2.2.6.3">aver</mtext></ci></apply></apply><apply id="S2.E6.m1.2.2c.cmml" xref="S2.E6.m1.2.2"><eq id="S2.E6.m1.2.2.7.cmml" xref="S2.E6.m1.2.2.7"></eq><share href="https://arxiv.org/html/2503.10700v1#S2.E6.m1.2.2.6.cmml" id="S2.E6.m1.2.2d.cmml" xref="S2.E6.m1.2.2"></share><apply id="S2.E6.m1.2.2.2.cmml" xref="S2.E6.m1.2.2.2"><divide id="S2.E6.m1.2.2.2.3.cmml" xref="S2.E6.m1.2.2.2.3"></divide><apply id="S2.E6.m1.1.1.1.1.2.cmml" xref="S2.E6.m1.1.1.1.1.1"><csymbol cd="latexml" id="S2.E6.m1.1.1.1.1.2.1.cmml" xref="S2.E6.m1.1.1.1.1.1.2">delimited-[]</csymbol><apply id="S2.E6.m1.1.1.1.1.1.1.cmml" xref="S2.E6.m1.1.1.1.1.1.1"><plus id="S2.E6.m1.1.1.1.1.1.1.2.cmml" xref="S2.E6.m1.1.1.1.1.1.1.2"></plus><apply id="S2.E6.m1.1.1.1.1.1.1.3.cmml" xref="S2.E6.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E6.m1.1.1.1.1.1.1.3.1.cmml" xref="S2.E6.m1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S2.E6.m1.1.1.1.1.1.1.3.2.cmml" xref="S2.E6.m1.1.1.1.1.1.1.3.2">𝐸</ci><ci id="S2.E6.m1.1.1.1.1.1.1.3.3.cmml" xref="S2.E6.m1.1.1.1.1.1.1.3.3">𝑙</ci></apply><apply id="S2.E6.m1.1.1.1.1.1.1.1.cmml" xref="S2.E6.m1.1.1.1.1.1.1.1"><times id="S2.E6.m1.1.1.1.1.1.1.1.2.cmml" xref="S2.E6.m1.1.1.1.1.1.1.1.2"></times><apply id="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1"><plus id="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.1"></plus><ci id="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.2">𝜆</ci><ci id="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S2.E6.m1.1.1.1.1.1.1.1.1.1.1.3">𝜇</ci></apply><apply id="S2.E6.m1.1.1.1.1.1.1.1.3.cmml" xref="S2.E6.m1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E6.m1.1.1.1.1.1.1.1.3.1.cmml" xref="S2.E6.m1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S2.E6.m1.1.1.1.1.1.1.1.3.2.cmml" xref="S2.E6.m1.1.1.1.1.1.1.1.3.2">𝐸</ci><ci id="S2.E6.m1.1.1.1.1.1.1.1.3.3.cmml" xref="S2.E6.m1.1.1.1.1.1.1.1.3.3">𝑣</ci></apply></apply></apply></apply><apply id="S2.E6.m1.2.2.2.2.1.1.cmml" xref="S2.E6.m1.2.2.2.2.1"><plus id="S2.E6.m1.2.2.2.2.1.1.1.cmml" xref="S2.E6.m1.2.2.2.2.1.1.1"></plus><cn id="S2.E6.m1.2.2.2.2.1.1.2.cmml" type="integer" xref="S2.E6.m1.2.2.2.2.1.1.2">1</cn><ci id="S2.E6.m1.2.2.2.2.1.1.3.cmml" xref="S2.E6.m1.2.2.2.2.1.1.3">𝜆</ci><ci id="S2.E6.m1.2.2.2.2.1.1.4.cmml" xref="S2.E6.m1.2.2.2.2.1.1.4">𝜇</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E6.m1.2c">E_{\text{mix}}^{\text{aver}}=E_{\text{aver}}=\left[E_{l}+(\lambda+\mu)E_{v}% \right]/(1+\lambda+\mu)</annotation><annotation encoding="application/x-llamapun" id="S2.E6.m1.2d">italic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT start_POSTSUPERSCRIPT aver end_POSTSUPERSCRIPT = italic_E start_POSTSUBSCRIPT aver end_POSTSUBSCRIPT = [ italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + ( italic_λ + italic_μ ) italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] / ( 1 + italic_λ + italic_μ )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(6)</span></td> </tr></tbody> </table> <table class="ltx_equation ltx_eqn_table" id="S2.E7"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="E_{\text{concat}}=\text{concat}[P(E_{v});P(E_{l})]\}" class="ltx_math_unparsed" display="block" id="S2.E7.m1.1"><semantics id="S2.E7.m1.1a"><mrow id="S2.E7.m1.1b"><msub id="S2.E7.m1.1.1"><mi id="S2.E7.m1.1.1.2">E</mi><mtext id="S2.E7.m1.1.1.3">concat</mtext></msub><mo id="S2.E7.m1.1.2">=</mo><mtext id="S2.E7.m1.1.3">concat</mtext><mrow id="S2.E7.m1.1.4"><mo id="S2.E7.m1.1.4.1" stretchy="false">[</mo><mi id="S2.E7.m1.1.4.2">P</mi><mrow id="S2.E7.m1.1.4.3"><mo id="S2.E7.m1.1.4.3.1" stretchy="false">(</mo><msub id="S2.E7.m1.1.4.3.2"><mi id="S2.E7.m1.1.4.3.2.2">E</mi><mi id="S2.E7.m1.1.4.3.2.3">v</mi></msub><mo id="S2.E7.m1.1.4.3.3" stretchy="false">)</mo></mrow><mo id="S2.E7.m1.1.4.4">;</mo><mi id="S2.E7.m1.1.4.5">P</mi><mrow id="S2.E7.m1.1.4.6"><mo id="S2.E7.m1.1.4.6.1" stretchy="false">(</mo><msub id="S2.E7.m1.1.4.6.2"><mi id="S2.E7.m1.1.4.6.2.2">E</mi><mi id="S2.E7.m1.1.4.6.2.3">l</mi></msub><mo id="S2.E7.m1.1.4.6.3" stretchy="false">)</mo></mrow><mo id="S2.E7.m1.1.4.7" stretchy="false">]</mo></mrow><mo id="S2.E7.m1.1.5" stretchy="false">}</mo></mrow><annotation encoding="application/x-tex" id="S2.E7.m1.1c">E_{\text{concat}}=\text{concat}[P(E_{v});P(E_{l})]\}</annotation><annotation encoding="application/x-llamapun" id="S2.E7.m1.1d">italic_E start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT = concat [ italic_P ( italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ; italic_P ( italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ] }</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(7)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS2.p1.3">where <math alttext="P(E_{v})" class="ltx_Math" display="inline" id="S2.SS2.p1.1.m1.1"><semantics id="S2.SS2.p1.1.m1.1a"><mrow id="S2.SS2.p1.1.m1.1.1" xref="S2.SS2.p1.1.m1.1.1.cmml"><mi id="S2.SS2.p1.1.m1.1.1.3" xref="S2.SS2.p1.1.m1.1.1.3.cmml">P</mi><mo id="S2.SS2.p1.1.m1.1.1.2" xref="S2.SS2.p1.1.m1.1.1.2.cmml">⁢</mo><mrow id="S2.SS2.p1.1.m1.1.1.1.1" xref="S2.SS2.p1.1.m1.1.1.1.1.1.cmml"><mo id="S2.SS2.p1.1.m1.1.1.1.1.2" stretchy="false" xref="S2.SS2.p1.1.m1.1.1.1.1.1.cmml">(</mo><msub id="S2.SS2.p1.1.m1.1.1.1.1.1" xref="S2.SS2.p1.1.m1.1.1.1.1.1.cmml"><mi id="S2.SS2.p1.1.m1.1.1.1.1.1.2" xref="S2.SS2.p1.1.m1.1.1.1.1.1.2.cmml">E</mi><mi id="S2.SS2.p1.1.m1.1.1.1.1.1.3" xref="S2.SS2.p1.1.m1.1.1.1.1.1.3.cmml">v</mi></msub><mo id="S2.SS2.p1.1.m1.1.1.1.1.3" stretchy="false" xref="S2.SS2.p1.1.m1.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p1.1.m1.1b"><apply id="S2.SS2.p1.1.m1.1.1.cmml" xref="S2.SS2.p1.1.m1.1.1"><times id="S2.SS2.p1.1.m1.1.1.2.cmml" xref="S2.SS2.p1.1.m1.1.1.2"></times><ci id="S2.SS2.p1.1.m1.1.1.3.cmml" xref="S2.SS2.p1.1.m1.1.1.3">𝑃</ci><apply id="S2.SS2.p1.1.m1.1.1.1.1.1.cmml" xref="S2.SS2.p1.1.m1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS2.p1.1.m1.1.1.1.1.1.1.cmml" xref="S2.SS2.p1.1.m1.1.1.1.1">subscript</csymbol><ci id="S2.SS2.p1.1.m1.1.1.1.1.1.2.cmml" xref="S2.SS2.p1.1.m1.1.1.1.1.1.2">𝐸</ci><ci id="S2.SS2.p1.1.m1.1.1.1.1.1.3.cmml" xref="S2.SS2.p1.1.m1.1.1.1.1.1.3">𝑣</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p1.1.m1.1c">P(E_{v})</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p1.1.m1.1d">italic_P ( italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )</annotation></semantics></math> and <math alttext="P(E_{l})" class="ltx_Math" display="inline" id="S2.SS2.p1.2.m2.1"><semantics id="S2.SS2.p1.2.m2.1a"><mrow id="S2.SS2.p1.2.m2.1.1" xref="S2.SS2.p1.2.m2.1.1.cmml"><mi id="S2.SS2.p1.2.m2.1.1.3" xref="S2.SS2.p1.2.m2.1.1.3.cmml">P</mi><mo id="S2.SS2.p1.2.m2.1.1.2" xref="S2.SS2.p1.2.m2.1.1.2.cmml">⁢</mo><mrow id="S2.SS2.p1.2.m2.1.1.1.1" xref="S2.SS2.p1.2.m2.1.1.1.1.1.cmml"><mo id="S2.SS2.p1.2.m2.1.1.1.1.2" stretchy="false" xref="S2.SS2.p1.2.m2.1.1.1.1.1.cmml">(</mo><msub id="S2.SS2.p1.2.m2.1.1.1.1.1" xref="S2.SS2.p1.2.m2.1.1.1.1.1.cmml"><mi id="S2.SS2.p1.2.m2.1.1.1.1.1.2" xref="S2.SS2.p1.2.m2.1.1.1.1.1.2.cmml">E</mi><mi id="S2.SS2.p1.2.m2.1.1.1.1.1.3" xref="S2.SS2.p1.2.m2.1.1.1.1.1.3.cmml">l</mi></msub><mo id="S2.SS2.p1.2.m2.1.1.1.1.3" stretchy="false" xref="S2.SS2.p1.2.m2.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p1.2.m2.1b"><apply id="S2.SS2.p1.2.m2.1.1.cmml" xref="S2.SS2.p1.2.m2.1.1"><times id="S2.SS2.p1.2.m2.1.1.2.cmml" xref="S2.SS2.p1.2.m2.1.1.2"></times><ci id="S2.SS2.p1.2.m2.1.1.3.cmml" xref="S2.SS2.p1.2.m2.1.1.3">𝑃</ci><apply id="S2.SS2.p1.2.m2.1.1.1.1.1.cmml" xref="S2.SS2.p1.2.m2.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS2.p1.2.m2.1.1.1.1.1.1.cmml" xref="S2.SS2.p1.2.m2.1.1.1.1">subscript</csymbol><ci id="S2.SS2.p1.2.m2.1.1.1.1.1.2.cmml" xref="S2.SS2.p1.2.m2.1.1.1.1.1.2">𝐸</ci><ci id="S2.SS2.p1.2.m2.1.1.1.1.1.3.cmml" xref="S2.SS2.p1.2.m2.1.1.1.1.1.3">𝑙</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p1.2.m2.1c">P(E_{l})</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p1.2.m2.1d">italic_P ( italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )</annotation></semantics></math> are linear projections reducing dimensions by half to keep <math alttext="E_{\text{concat}}" class="ltx_Math" display="inline" id="S2.SS2.p1.3.m3.1"><semantics id="S2.SS2.p1.3.m3.1a"><msub id="S2.SS2.p1.3.m3.1.1" xref="S2.SS2.p1.3.m3.1.1.cmml"><mi id="S2.SS2.p1.3.m3.1.1.2" xref="S2.SS2.p1.3.m3.1.1.2.cmml">E</mi><mtext id="S2.SS2.p1.3.m3.1.1.3" xref="S2.SS2.p1.3.m3.1.1.3a.cmml">concat</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p1.3.m3.1b"><apply id="S2.SS2.p1.3.m3.1.1.cmml" xref="S2.SS2.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S2.SS2.p1.3.m3.1.1.1.cmml" xref="S2.SS2.p1.3.m3.1.1">subscript</csymbol><ci id="S2.SS2.p1.3.m3.1.1.2.cmml" xref="S2.SS2.p1.3.m3.1.1.2">𝐸</ci><ci id="S2.SS2.p1.3.m3.1.1.3a.cmml" xref="S2.SS2.p1.3.m3.1.1.3"><mtext id="S2.SS2.p1.3.m3.1.1.3.cmml" mathsize="70%" xref="S2.SS2.p1.3.m3.1.1.3">concat</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p1.3.m3.1c">E_{\text{concat}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p1.3.m3.1d">italic_E start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT</annotation></semantics></math> the same size.</p> </div> <div class="ltx_para" id="S2.SS2.p2"> <p class="ltx_p" id="S2.SS2.p2.3">We utilize positional encoding and a projection layer <math alttext="\tau_{\theta}" class="ltx_Math" display="inline" id="S2.SS2.p2.1.m1.1"><semantics id="S2.SS2.p2.1.m1.1a"><msub id="S2.SS2.p2.1.m1.1.1" xref="S2.SS2.p2.1.m1.1.1.cmml"><mi id="S2.SS2.p2.1.m1.1.1.2" xref="S2.SS2.p2.1.m1.1.1.2.cmml">τ</mi><mi id="S2.SS2.p2.1.m1.1.1.3" xref="S2.SS2.p2.1.m1.1.1.3.cmml">θ</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.1.m1.1b"><apply id="S2.SS2.p2.1.m1.1.1.cmml" xref="S2.SS2.p2.1.m1.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.1.m1.1.1.1.cmml" xref="S2.SS2.p2.1.m1.1.1">subscript</csymbol><ci id="S2.SS2.p2.1.m1.1.1.2.cmml" xref="S2.SS2.p2.1.m1.1.1.2">𝜏</ci><ci id="S2.SS2.p2.1.m1.1.1.3.cmml" xref="S2.SS2.p2.1.m1.1.1.3">𝜃</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.1.m1.1c">\tau_{\theta}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.1.m1.1d">italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT</annotation></semantics></math> to map <math alttext="E_{\text{concat}}" class="ltx_Math" display="inline" id="S2.SS2.p2.2.m2.1"><semantics id="S2.SS2.p2.2.m2.1a"><msub id="S2.SS2.p2.2.m2.1.1" xref="S2.SS2.p2.2.m2.1.1.cmml"><mi id="S2.SS2.p2.2.m2.1.1.2" xref="S2.SS2.p2.2.m2.1.1.2.cmml">E</mi><mtext id="S2.SS2.p2.2.m2.1.1.3" xref="S2.SS2.p2.2.m2.1.1.3a.cmml">concat</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.2.m2.1b"><apply id="S2.SS2.p2.2.m2.1.1.cmml" xref="S2.SS2.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.2.m2.1.1.1.cmml" xref="S2.SS2.p2.2.m2.1.1">subscript</csymbol><ci id="S2.SS2.p2.2.m2.1.1.2.cmml" xref="S2.SS2.p2.2.m2.1.1.2">𝐸</ci><ci id="S2.SS2.p2.2.m2.1.1.3a.cmml" xref="S2.SS2.p2.2.m2.1.1.3"><mtext id="S2.SS2.p2.2.m2.1.1.3.cmml" mathsize="70%" xref="S2.SS2.p2.2.m2.1.1.3">concat</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.2.m2.1c">E_{\text{concat}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.2.m2.1d">italic_E start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT</annotation></semantics></math> to the appropriate dimension, expressed as <math alttext="{E}_{\text{mix}}^{\text{concat}}=\tau_{\theta}(E_{\text{concat}})=MLP(E_{\text% {concat}}+PE)" class="ltx_Math" display="inline" id="S2.SS2.p2.3.m3.2"><semantics id="S2.SS2.p2.3.m3.2a"><mrow id="S2.SS2.p2.3.m3.2.2" xref="S2.SS2.p2.3.m3.2.2.cmml"><msubsup id="S2.SS2.p2.3.m3.2.2.4" xref="S2.SS2.p2.3.m3.2.2.4.cmml"><mi id="S2.SS2.p2.3.m3.2.2.4.2.2" xref="S2.SS2.p2.3.m3.2.2.4.2.2.cmml">E</mi><mtext id="S2.SS2.p2.3.m3.2.2.4.2.3" xref="S2.SS2.p2.3.m3.2.2.4.2.3a.cmml">mix</mtext><mtext id="S2.SS2.p2.3.m3.2.2.4.3" xref="S2.SS2.p2.3.m3.2.2.4.3a.cmml">concat</mtext></msubsup><mo id="S2.SS2.p2.3.m3.2.2.5" xref="S2.SS2.p2.3.m3.2.2.5.cmml">=</mo><mrow id="S2.SS2.p2.3.m3.1.1.1" xref="S2.SS2.p2.3.m3.1.1.1.cmml"><msub id="S2.SS2.p2.3.m3.1.1.1.3" xref="S2.SS2.p2.3.m3.1.1.1.3.cmml"><mi id="S2.SS2.p2.3.m3.1.1.1.3.2" xref="S2.SS2.p2.3.m3.1.1.1.3.2.cmml">τ</mi><mi id="S2.SS2.p2.3.m3.1.1.1.3.3" xref="S2.SS2.p2.3.m3.1.1.1.3.3.cmml">θ</mi></msub><mo id="S2.SS2.p2.3.m3.1.1.1.2" xref="S2.SS2.p2.3.m3.1.1.1.2.cmml">⁢</mo><mrow id="S2.SS2.p2.3.m3.1.1.1.1.1" xref="S2.SS2.p2.3.m3.1.1.1.1.1.1.cmml"><mo id="S2.SS2.p2.3.m3.1.1.1.1.1.2" stretchy="false" xref="S2.SS2.p2.3.m3.1.1.1.1.1.1.cmml">(</mo><msub id="S2.SS2.p2.3.m3.1.1.1.1.1.1" xref="S2.SS2.p2.3.m3.1.1.1.1.1.1.cmml"><mi id="S2.SS2.p2.3.m3.1.1.1.1.1.1.2" xref="S2.SS2.p2.3.m3.1.1.1.1.1.1.2.cmml">E</mi><mtext id="S2.SS2.p2.3.m3.1.1.1.1.1.1.3" xref="S2.SS2.p2.3.m3.1.1.1.1.1.1.3a.cmml">concat</mtext></msub><mo id="S2.SS2.p2.3.m3.1.1.1.1.1.3" stretchy="false" xref="S2.SS2.p2.3.m3.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.SS2.p2.3.m3.2.2.6" xref="S2.SS2.p2.3.m3.2.2.6.cmml">=</mo><mrow id="S2.SS2.p2.3.m3.2.2.2" xref="S2.SS2.p2.3.m3.2.2.2.cmml"><mi id="S2.SS2.p2.3.m3.2.2.2.3" xref="S2.SS2.p2.3.m3.2.2.2.3.cmml">M</mi><mo id="S2.SS2.p2.3.m3.2.2.2.2" xref="S2.SS2.p2.3.m3.2.2.2.2.cmml">⁢</mo><mi id="S2.SS2.p2.3.m3.2.2.2.4" xref="S2.SS2.p2.3.m3.2.2.2.4.cmml">L</mi><mo id="S2.SS2.p2.3.m3.2.2.2.2a" xref="S2.SS2.p2.3.m3.2.2.2.2.cmml">⁢</mo><mi id="S2.SS2.p2.3.m3.2.2.2.5" xref="S2.SS2.p2.3.m3.2.2.2.5.cmml">P</mi><mo id="S2.SS2.p2.3.m3.2.2.2.2b" xref="S2.SS2.p2.3.m3.2.2.2.2.cmml">⁢</mo><mrow id="S2.SS2.p2.3.m3.2.2.2.1.1" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.cmml"><mo id="S2.SS2.p2.3.m3.2.2.2.1.1.2" stretchy="false" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.cmml">(</mo><mrow id="S2.SS2.p2.3.m3.2.2.2.1.1.1" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.cmml"><msub id="S2.SS2.p2.3.m3.2.2.2.1.1.1.2" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.cmml"><mi id="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.2" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.2.cmml">E</mi><mtext id="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.3" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.3a.cmml">concat</mtext></msub><mo id="S2.SS2.p2.3.m3.2.2.2.1.1.1.1" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.1.cmml">+</mo><mrow id="S2.SS2.p2.3.m3.2.2.2.1.1.1.3" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.cmml"><mi id="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.2" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.2.cmml">P</mi><mo id="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.1" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.1.cmml">⁢</mo><mi id="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.3" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.3.cmml">E</mi></mrow></mrow><mo id="S2.SS2.p2.3.m3.2.2.2.1.1.3" stretchy="false" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.3.m3.2b"><apply id="S2.SS2.p2.3.m3.2.2.cmml" xref="S2.SS2.p2.3.m3.2.2"><and id="S2.SS2.p2.3.m3.2.2a.cmml" xref="S2.SS2.p2.3.m3.2.2"></and><apply id="S2.SS2.p2.3.m3.2.2b.cmml" xref="S2.SS2.p2.3.m3.2.2"><eq id="S2.SS2.p2.3.m3.2.2.5.cmml" xref="S2.SS2.p2.3.m3.2.2.5"></eq><apply id="S2.SS2.p2.3.m3.2.2.4.cmml" xref="S2.SS2.p2.3.m3.2.2.4"><csymbol cd="ambiguous" id="S2.SS2.p2.3.m3.2.2.4.1.cmml" xref="S2.SS2.p2.3.m3.2.2.4">superscript</csymbol><apply id="S2.SS2.p2.3.m3.2.2.4.2.cmml" xref="S2.SS2.p2.3.m3.2.2.4"><csymbol cd="ambiguous" id="S2.SS2.p2.3.m3.2.2.4.2.1.cmml" xref="S2.SS2.p2.3.m3.2.2.4">subscript</csymbol><ci id="S2.SS2.p2.3.m3.2.2.4.2.2.cmml" xref="S2.SS2.p2.3.m3.2.2.4.2.2">𝐸</ci><ci id="S2.SS2.p2.3.m3.2.2.4.2.3a.cmml" xref="S2.SS2.p2.3.m3.2.2.4.2.3"><mtext id="S2.SS2.p2.3.m3.2.2.4.2.3.cmml" mathsize="70%" xref="S2.SS2.p2.3.m3.2.2.4.2.3">mix</mtext></ci></apply><ci id="S2.SS2.p2.3.m3.2.2.4.3a.cmml" xref="S2.SS2.p2.3.m3.2.2.4.3"><mtext id="S2.SS2.p2.3.m3.2.2.4.3.cmml" mathsize="70%" xref="S2.SS2.p2.3.m3.2.2.4.3">concat</mtext></ci></apply><apply id="S2.SS2.p2.3.m3.1.1.1.cmml" xref="S2.SS2.p2.3.m3.1.1.1"><times id="S2.SS2.p2.3.m3.1.1.1.2.cmml" xref="S2.SS2.p2.3.m3.1.1.1.2"></times><apply id="S2.SS2.p2.3.m3.1.1.1.3.cmml" xref="S2.SS2.p2.3.m3.1.1.1.3"><csymbol cd="ambiguous" id="S2.SS2.p2.3.m3.1.1.1.3.1.cmml" xref="S2.SS2.p2.3.m3.1.1.1.3">subscript</csymbol><ci id="S2.SS2.p2.3.m3.1.1.1.3.2.cmml" xref="S2.SS2.p2.3.m3.1.1.1.3.2">𝜏</ci><ci id="S2.SS2.p2.3.m3.1.1.1.3.3.cmml" xref="S2.SS2.p2.3.m3.1.1.1.3.3">𝜃</ci></apply><apply id="S2.SS2.p2.3.m3.1.1.1.1.1.1.cmml" xref="S2.SS2.p2.3.m3.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.3.m3.1.1.1.1.1.1.1.cmml" xref="S2.SS2.p2.3.m3.1.1.1.1.1">subscript</csymbol><ci id="S2.SS2.p2.3.m3.1.1.1.1.1.1.2.cmml" xref="S2.SS2.p2.3.m3.1.1.1.1.1.1.2">𝐸</ci><ci id="S2.SS2.p2.3.m3.1.1.1.1.1.1.3a.cmml" xref="S2.SS2.p2.3.m3.1.1.1.1.1.1.3"><mtext id="S2.SS2.p2.3.m3.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S2.SS2.p2.3.m3.1.1.1.1.1.1.3">concat</mtext></ci></apply></apply></apply><apply id="S2.SS2.p2.3.m3.2.2c.cmml" xref="S2.SS2.p2.3.m3.2.2"><eq id="S2.SS2.p2.3.m3.2.2.6.cmml" xref="S2.SS2.p2.3.m3.2.2.6"></eq><share href="https://arxiv.org/html/2503.10700v1#S2.SS2.p2.3.m3.1.1.1.cmml" id="S2.SS2.p2.3.m3.2.2d.cmml" xref="S2.SS2.p2.3.m3.2.2"></share><apply id="S2.SS2.p2.3.m3.2.2.2.cmml" xref="S2.SS2.p2.3.m3.2.2.2"><times id="S2.SS2.p2.3.m3.2.2.2.2.cmml" xref="S2.SS2.p2.3.m3.2.2.2.2"></times><ci id="S2.SS2.p2.3.m3.2.2.2.3.cmml" xref="S2.SS2.p2.3.m3.2.2.2.3">𝑀</ci><ci id="S2.SS2.p2.3.m3.2.2.2.4.cmml" xref="S2.SS2.p2.3.m3.2.2.2.4">𝐿</ci><ci id="S2.SS2.p2.3.m3.2.2.2.5.cmml" xref="S2.SS2.p2.3.m3.2.2.2.5">𝑃</ci><apply id="S2.SS2.p2.3.m3.2.2.2.1.1.1.cmml" xref="S2.SS2.p2.3.m3.2.2.2.1.1"><plus id="S2.SS2.p2.3.m3.2.2.2.1.1.1.1.cmml" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.1"></plus><apply id="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.cmml" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.2"><csymbol cd="ambiguous" id="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.1.cmml" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.2">subscript</csymbol><ci id="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.2.cmml" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.2">𝐸</ci><ci id="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.3a.cmml" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.3"><mtext id="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.3.cmml" mathsize="70%" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.2.3">concat</mtext></ci></apply><apply id="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.cmml" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.3"><times id="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.1.cmml" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.1"></times><ci id="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.2.cmml" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.2">𝑃</ci><ci id="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.3.cmml" xref="S2.SS2.p2.3.m3.2.2.2.1.1.1.3.3">𝐸</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.3.m3.2c">{E}_{\text{mix}}^{\text{concat}}=\tau_{\theta}(E_{\text{concat}})=MLP(E_{\text% {concat}}+PE)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.3.m3.2d">italic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT start_POSTSUPERSCRIPT concat end_POSTSUPERSCRIPT = italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT ) = italic_M italic_L italic_P ( italic_E start_POSTSUBSCRIPT concat end_POSTSUBSCRIPT + italic_P italic_E )</annotation></semantics></math>, where MLP serves as the projection layer and PE denotes positional encoding. This hybrid feature representation is used as an input feature vector for training and inference in LDM.</p> </div> </section> <section class="ltx_subsection" id="S2.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS3.5.1.1">II-C</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS3.6.2">Latent Diffusion Model</span> </h3> <div class="ltx_para" id="S2.SS3.p1"> <p class="ltx_p" id="S2.SS3.p1.3">LDM generates high-dimensional audio data in a lower-dimensional latent space<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib21" title="">21</a>]</cite>. Starting from an encoded Mel-spectrogram <math alttext="z_{0}=\mathcal{E}(x_{a})" class="ltx_Math" display="inline" id="S2.SS3.p1.1.m1.1"><semantics id="S2.SS3.p1.1.m1.1a"><mrow id="S2.SS3.p1.1.m1.1.1" xref="S2.SS3.p1.1.m1.1.1.cmml"><msub id="S2.SS3.p1.1.m1.1.1.3" xref="S2.SS3.p1.1.m1.1.1.3.cmml"><mi id="S2.SS3.p1.1.m1.1.1.3.2" xref="S2.SS3.p1.1.m1.1.1.3.2.cmml">z</mi><mn id="S2.SS3.p1.1.m1.1.1.3.3" xref="S2.SS3.p1.1.m1.1.1.3.3.cmml">0</mn></msub><mo id="S2.SS3.p1.1.m1.1.1.2" xref="S2.SS3.p1.1.m1.1.1.2.cmml">=</mo><mrow id="S2.SS3.p1.1.m1.1.1.1" xref="S2.SS3.p1.1.m1.1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.SS3.p1.1.m1.1.1.1.3" xref="S2.SS3.p1.1.m1.1.1.1.3.cmml">ℰ</mi><mo id="S2.SS3.p1.1.m1.1.1.1.2" xref="S2.SS3.p1.1.m1.1.1.1.2.cmml">⁢</mo><mrow id="S2.SS3.p1.1.m1.1.1.1.1.1" xref="S2.SS3.p1.1.m1.1.1.1.1.1.1.cmml"><mo id="S2.SS3.p1.1.m1.1.1.1.1.1.2" stretchy="false" xref="S2.SS3.p1.1.m1.1.1.1.1.1.1.cmml">(</mo><msub id="S2.SS3.p1.1.m1.1.1.1.1.1.1" xref="S2.SS3.p1.1.m1.1.1.1.1.1.1.cmml"><mi id="S2.SS3.p1.1.m1.1.1.1.1.1.1.2" xref="S2.SS3.p1.1.m1.1.1.1.1.1.1.2.cmml">x</mi><mi id="S2.SS3.p1.1.m1.1.1.1.1.1.1.3" xref="S2.SS3.p1.1.m1.1.1.1.1.1.1.3.cmml">a</mi></msub><mo id="S2.SS3.p1.1.m1.1.1.1.1.1.3" stretchy="false" xref="S2.SS3.p1.1.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.1.m1.1b"><apply id="S2.SS3.p1.1.m1.1.1.cmml" xref="S2.SS3.p1.1.m1.1.1"><eq id="S2.SS3.p1.1.m1.1.1.2.cmml" xref="S2.SS3.p1.1.m1.1.1.2"></eq><apply id="S2.SS3.p1.1.m1.1.1.3.cmml" xref="S2.SS3.p1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S2.SS3.p1.1.m1.1.1.3.1.cmml" xref="S2.SS3.p1.1.m1.1.1.3">subscript</csymbol><ci id="S2.SS3.p1.1.m1.1.1.3.2.cmml" xref="S2.SS3.p1.1.m1.1.1.3.2">𝑧</ci><cn id="S2.SS3.p1.1.m1.1.1.3.3.cmml" type="integer" xref="S2.SS3.p1.1.m1.1.1.3.3">0</cn></apply><apply id="S2.SS3.p1.1.m1.1.1.1.cmml" xref="S2.SS3.p1.1.m1.1.1.1"><times id="S2.SS3.p1.1.m1.1.1.1.2.cmml" xref="S2.SS3.p1.1.m1.1.1.1.2"></times><ci id="S2.SS3.p1.1.m1.1.1.1.3.cmml" xref="S2.SS3.p1.1.m1.1.1.1.3">ℰ</ci><apply id="S2.SS3.p1.1.m1.1.1.1.1.1.1.cmml" xref="S2.SS3.p1.1.m1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.1.m1.1.1.1.1.1.1.1.cmml" xref="S2.SS3.p1.1.m1.1.1.1.1.1">subscript</csymbol><ci id="S2.SS3.p1.1.m1.1.1.1.1.1.1.2.cmml" xref="S2.SS3.p1.1.m1.1.1.1.1.1.1.2">𝑥</ci><ci id="S2.SS3.p1.1.m1.1.1.1.1.1.1.3.cmml" xref="S2.SS3.p1.1.m1.1.1.1.1.1.1.3">𝑎</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.1.m1.1c">z_{0}=\mathcal{E}(x_{a})</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.1.m1.1d">italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_E ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT )</annotation></semantics></math>, the diffusion process is conducted in the latent space, where noise is progressively added to <math alttext="z_{0}" class="ltx_Math" display="inline" id="S2.SS3.p1.2.m2.1"><semantics id="S2.SS3.p1.2.m2.1a"><msub id="S2.SS3.p1.2.m2.1.1" xref="S2.SS3.p1.2.m2.1.1.cmml"><mi id="S2.SS3.p1.2.m2.1.1.2" xref="S2.SS3.p1.2.m2.1.1.2.cmml">z</mi><mn id="S2.SS3.p1.2.m2.1.1.3" xref="S2.SS3.p1.2.m2.1.1.3.cmml">0</mn></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.2.m2.1b"><apply id="S2.SS3.p1.2.m2.1.1.cmml" xref="S2.SS3.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.2.m2.1.1.1.cmml" xref="S2.SS3.p1.2.m2.1.1">subscript</csymbol><ci id="S2.SS3.p1.2.m2.1.1.2.cmml" xref="S2.SS3.p1.2.m2.1.1.2">𝑧</ci><cn id="S2.SS3.p1.2.m2.1.1.3.cmml" type="integer" xref="S2.SS3.p1.2.m2.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.2.m2.1c">z_{0}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.2.m2.1d">italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT</annotation></semantics></math>, forming a sequence of latent variables <math alttext="z_{1},z_{2},\ldots,z_{T}" class="ltx_Math" display="inline" id="S2.SS3.p1.3.m3.4"><semantics id="S2.SS3.p1.3.m3.4a"><mrow id="S2.SS3.p1.3.m3.4.4.3" xref="S2.SS3.p1.3.m3.4.4.4.cmml"><msub id="S2.SS3.p1.3.m3.2.2.1.1" xref="S2.SS3.p1.3.m3.2.2.1.1.cmml"><mi id="S2.SS3.p1.3.m3.2.2.1.1.2" xref="S2.SS3.p1.3.m3.2.2.1.1.2.cmml">z</mi><mn id="S2.SS3.p1.3.m3.2.2.1.1.3" xref="S2.SS3.p1.3.m3.2.2.1.1.3.cmml">1</mn></msub><mo id="S2.SS3.p1.3.m3.4.4.3.4" xref="S2.SS3.p1.3.m3.4.4.4.cmml">,</mo><msub id="S2.SS3.p1.3.m3.3.3.2.2" xref="S2.SS3.p1.3.m3.3.3.2.2.cmml"><mi id="S2.SS3.p1.3.m3.3.3.2.2.2" xref="S2.SS3.p1.3.m3.3.3.2.2.2.cmml">z</mi><mn id="S2.SS3.p1.3.m3.3.3.2.2.3" xref="S2.SS3.p1.3.m3.3.3.2.2.3.cmml">2</mn></msub><mo id="S2.SS3.p1.3.m3.4.4.3.5" xref="S2.SS3.p1.3.m3.4.4.4.cmml">,</mo><mi id="S2.SS3.p1.3.m3.1.1" mathvariant="normal" xref="S2.SS3.p1.3.m3.1.1.cmml">…</mi><mo id="S2.SS3.p1.3.m3.4.4.3.6" xref="S2.SS3.p1.3.m3.4.4.4.cmml">,</mo><msub id="S2.SS3.p1.3.m3.4.4.3.3" xref="S2.SS3.p1.3.m3.4.4.3.3.cmml"><mi id="S2.SS3.p1.3.m3.4.4.3.3.2" xref="S2.SS3.p1.3.m3.4.4.3.3.2.cmml">z</mi><mi id="S2.SS3.p1.3.m3.4.4.3.3.3" xref="S2.SS3.p1.3.m3.4.4.3.3.3.cmml">T</mi></msub></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.3.m3.4b"><list id="S2.SS3.p1.3.m3.4.4.4.cmml" xref="S2.SS3.p1.3.m3.4.4.3"><apply id="S2.SS3.p1.3.m3.2.2.1.1.cmml" xref="S2.SS3.p1.3.m3.2.2.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.3.m3.2.2.1.1.1.cmml" xref="S2.SS3.p1.3.m3.2.2.1.1">subscript</csymbol><ci id="S2.SS3.p1.3.m3.2.2.1.1.2.cmml" xref="S2.SS3.p1.3.m3.2.2.1.1.2">𝑧</ci><cn id="S2.SS3.p1.3.m3.2.2.1.1.3.cmml" type="integer" xref="S2.SS3.p1.3.m3.2.2.1.1.3">1</cn></apply><apply id="S2.SS3.p1.3.m3.3.3.2.2.cmml" xref="S2.SS3.p1.3.m3.3.3.2.2"><csymbol cd="ambiguous" id="S2.SS3.p1.3.m3.3.3.2.2.1.cmml" xref="S2.SS3.p1.3.m3.3.3.2.2">subscript</csymbol><ci id="S2.SS3.p1.3.m3.3.3.2.2.2.cmml" xref="S2.SS3.p1.3.m3.3.3.2.2.2">𝑧</ci><cn id="S2.SS3.p1.3.m3.3.3.2.2.3.cmml" type="integer" xref="S2.SS3.p1.3.m3.3.3.2.2.3">2</cn></apply><ci id="S2.SS3.p1.3.m3.1.1.cmml" xref="S2.SS3.p1.3.m3.1.1">…</ci><apply id="S2.SS3.p1.3.m3.4.4.3.3.cmml" xref="S2.SS3.p1.3.m3.4.4.3.3"><csymbol cd="ambiguous" id="S2.SS3.p1.3.m3.4.4.3.3.1.cmml" xref="S2.SS3.p1.3.m3.4.4.3.3">subscript</csymbol><ci id="S2.SS3.p1.3.m3.4.4.3.3.2.cmml" xref="S2.SS3.p1.3.m3.4.4.3.3.2">𝑧</ci><ci id="S2.SS3.p1.3.m3.4.4.3.3.3.cmml" xref="S2.SS3.p1.3.m3.4.4.3.3.3">𝑇</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.3.m3.4c">z_{1},z_{2},\ldots,z_{T}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.3.m3.4d">italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT</annotation></semantics></math>. Each diffusion step is modeled as:</p> <table class="ltx_equation ltx_eqn_table" id="S2.E8"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="q(z_{t}|z_{t-1})=\mathcal{N}(z_{t};\sqrt{1-\beta_{t}}z_{t-1},\beta_{t}\mathbf{% I})" class="ltx_Math" display="block" id="S2.E8.m1.4"><semantics id="S2.E8.m1.4a"><mrow id="S2.E8.m1.4.4" xref="S2.E8.m1.4.4.cmml"><mrow id="S2.E8.m1.1.1.1" xref="S2.E8.m1.1.1.1.cmml"><mi id="S2.E8.m1.1.1.1.3" xref="S2.E8.m1.1.1.1.3.cmml">q</mi><mo id="S2.E8.m1.1.1.1.2" xref="S2.E8.m1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E8.m1.1.1.1.1.1" xref="S2.E8.m1.1.1.1.1.1.1.cmml"><mo id="S2.E8.m1.1.1.1.1.1.2" stretchy="false" xref="S2.E8.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.E8.m1.1.1.1.1.1.1" xref="S2.E8.m1.1.1.1.1.1.1.cmml"><msub id="S2.E8.m1.1.1.1.1.1.1.2" xref="S2.E8.m1.1.1.1.1.1.1.2.cmml"><mi id="S2.E8.m1.1.1.1.1.1.1.2.2" xref="S2.E8.m1.1.1.1.1.1.1.2.2.cmml">z</mi><mi id="S2.E8.m1.1.1.1.1.1.1.2.3" xref="S2.E8.m1.1.1.1.1.1.1.2.3.cmml">t</mi></msub><mo fence="false" id="S2.E8.m1.1.1.1.1.1.1.1" xref="S2.E8.m1.1.1.1.1.1.1.1.cmml">|</mo><msub id="S2.E8.m1.1.1.1.1.1.1.3" xref="S2.E8.m1.1.1.1.1.1.1.3.cmml"><mi id="S2.E8.m1.1.1.1.1.1.1.3.2" xref="S2.E8.m1.1.1.1.1.1.1.3.2.cmml">z</mi><mrow id="S2.E8.m1.1.1.1.1.1.1.3.3" xref="S2.E8.m1.1.1.1.1.1.1.3.3.cmml"><mi id="S2.E8.m1.1.1.1.1.1.1.3.3.2" xref="S2.E8.m1.1.1.1.1.1.1.3.3.2.cmml">t</mi><mo id="S2.E8.m1.1.1.1.1.1.1.3.3.1" xref="S2.E8.m1.1.1.1.1.1.1.3.3.1.cmml">−</mo><mn id="S2.E8.m1.1.1.1.1.1.1.3.3.3" xref="S2.E8.m1.1.1.1.1.1.1.3.3.3.cmml">1</mn></mrow></msub></mrow><mo id="S2.E8.m1.1.1.1.1.1.3" stretchy="false" xref="S2.E8.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.E8.m1.4.4.5" xref="S2.E8.m1.4.4.5.cmml">=</mo><mrow id="S2.E8.m1.4.4.4" xref="S2.E8.m1.4.4.4.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E8.m1.4.4.4.5" xref="S2.E8.m1.4.4.4.5.cmml">𝒩</mi><mo id="S2.E8.m1.4.4.4.4" xref="S2.E8.m1.4.4.4.4.cmml">⁢</mo><mrow id="S2.E8.m1.4.4.4.3.3" xref="S2.E8.m1.4.4.4.3.4.cmml"><mo id="S2.E8.m1.4.4.4.3.3.4" stretchy="false" xref="S2.E8.m1.4.4.4.3.4.cmml">(</mo><msub id="S2.E8.m1.2.2.2.1.1.1" xref="S2.E8.m1.2.2.2.1.1.1.cmml"><mi id="S2.E8.m1.2.2.2.1.1.1.2" xref="S2.E8.m1.2.2.2.1.1.1.2.cmml">z</mi><mi id="S2.E8.m1.2.2.2.1.1.1.3" xref="S2.E8.m1.2.2.2.1.1.1.3.cmml">t</mi></msub><mo id="S2.E8.m1.4.4.4.3.3.5" xref="S2.E8.m1.4.4.4.3.4.cmml">;</mo><mrow id="S2.E8.m1.3.3.3.2.2.2" xref="S2.E8.m1.3.3.3.2.2.2.cmml"><msqrt id="S2.E8.m1.3.3.3.2.2.2.2" xref="S2.E8.m1.3.3.3.2.2.2.2.cmml"><mrow id="S2.E8.m1.3.3.3.2.2.2.2.2" xref="S2.E8.m1.3.3.3.2.2.2.2.2.cmml"><mn id="S2.E8.m1.3.3.3.2.2.2.2.2.2" xref="S2.E8.m1.3.3.3.2.2.2.2.2.2.cmml">1</mn><mo id="S2.E8.m1.3.3.3.2.2.2.2.2.1" xref="S2.E8.m1.3.3.3.2.2.2.2.2.1.cmml">−</mo><msub id="S2.E8.m1.3.3.3.2.2.2.2.2.3" xref="S2.E8.m1.3.3.3.2.2.2.2.2.3.cmml"><mi id="S2.E8.m1.3.3.3.2.2.2.2.2.3.2" xref="S2.E8.m1.3.3.3.2.2.2.2.2.3.2.cmml">β</mi><mi id="S2.E8.m1.3.3.3.2.2.2.2.2.3.3" xref="S2.E8.m1.3.3.3.2.2.2.2.2.3.3.cmml">t</mi></msub></mrow></msqrt><mo id="S2.E8.m1.3.3.3.2.2.2.1" xref="S2.E8.m1.3.3.3.2.2.2.1.cmml">⁢</mo><msub id="S2.E8.m1.3.3.3.2.2.2.3" xref="S2.E8.m1.3.3.3.2.2.2.3.cmml"><mi id="S2.E8.m1.3.3.3.2.2.2.3.2" xref="S2.E8.m1.3.3.3.2.2.2.3.2.cmml">z</mi><mrow id="S2.E8.m1.3.3.3.2.2.2.3.3" xref="S2.E8.m1.3.3.3.2.2.2.3.3.cmml"><mi id="S2.E8.m1.3.3.3.2.2.2.3.3.2" xref="S2.E8.m1.3.3.3.2.2.2.3.3.2.cmml">t</mi><mo id="S2.E8.m1.3.3.3.2.2.2.3.3.1" xref="S2.E8.m1.3.3.3.2.2.2.3.3.1.cmml">−</mo><mn id="S2.E8.m1.3.3.3.2.2.2.3.3.3" xref="S2.E8.m1.3.3.3.2.2.2.3.3.3.cmml">1</mn></mrow></msub></mrow><mo id="S2.E8.m1.4.4.4.3.3.6" xref="S2.E8.m1.4.4.4.3.4.cmml">,</mo><mrow id="S2.E8.m1.4.4.4.3.3.3" xref="S2.E8.m1.4.4.4.3.3.3.cmml"><msub id="S2.E8.m1.4.4.4.3.3.3.2" xref="S2.E8.m1.4.4.4.3.3.3.2.cmml"><mi id="S2.E8.m1.4.4.4.3.3.3.2.2" xref="S2.E8.m1.4.4.4.3.3.3.2.2.cmml">β</mi><mi id="S2.E8.m1.4.4.4.3.3.3.2.3" xref="S2.E8.m1.4.4.4.3.3.3.2.3.cmml">t</mi></msub><mo id="S2.E8.m1.4.4.4.3.3.3.1" xref="S2.E8.m1.4.4.4.3.3.3.1.cmml">⁢</mo><mi id="S2.E8.m1.4.4.4.3.3.3.3" xref="S2.E8.m1.4.4.4.3.3.3.3.cmml">𝐈</mi></mrow><mo id="S2.E8.m1.4.4.4.3.3.7" stretchy="false" xref="S2.E8.m1.4.4.4.3.4.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E8.m1.4b"><apply id="S2.E8.m1.4.4.cmml" xref="S2.E8.m1.4.4"><eq id="S2.E8.m1.4.4.5.cmml" xref="S2.E8.m1.4.4.5"></eq><apply id="S2.E8.m1.1.1.1.cmml" xref="S2.E8.m1.1.1.1"><times id="S2.E8.m1.1.1.1.2.cmml" xref="S2.E8.m1.1.1.1.2"></times><ci id="S2.E8.m1.1.1.1.3.cmml" xref="S2.E8.m1.1.1.1.3">𝑞</ci><apply id="S2.E8.m1.1.1.1.1.1.1.cmml" xref="S2.E8.m1.1.1.1.1.1"><csymbol cd="latexml" id="S2.E8.m1.1.1.1.1.1.1.1.cmml" xref="S2.E8.m1.1.1.1.1.1.1.1">conditional</csymbol><apply id="S2.E8.m1.1.1.1.1.1.1.2.cmml" xref="S2.E8.m1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S2.E8.m1.1.1.1.1.1.1.2.1.cmml" xref="S2.E8.m1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S2.E8.m1.1.1.1.1.1.1.2.2.cmml" xref="S2.E8.m1.1.1.1.1.1.1.2.2">𝑧</ci><ci id="S2.E8.m1.1.1.1.1.1.1.2.3.cmml" xref="S2.E8.m1.1.1.1.1.1.1.2.3">𝑡</ci></apply><apply id="S2.E8.m1.1.1.1.1.1.1.3.cmml" xref="S2.E8.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E8.m1.1.1.1.1.1.1.3.1.cmml" xref="S2.E8.m1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S2.E8.m1.1.1.1.1.1.1.3.2.cmml" xref="S2.E8.m1.1.1.1.1.1.1.3.2">𝑧</ci><apply id="S2.E8.m1.1.1.1.1.1.1.3.3.cmml" xref="S2.E8.m1.1.1.1.1.1.1.3.3"><minus id="S2.E8.m1.1.1.1.1.1.1.3.3.1.cmml" xref="S2.E8.m1.1.1.1.1.1.1.3.3.1"></minus><ci id="S2.E8.m1.1.1.1.1.1.1.3.3.2.cmml" xref="S2.E8.m1.1.1.1.1.1.1.3.3.2">𝑡</ci><cn id="S2.E8.m1.1.1.1.1.1.1.3.3.3.cmml" type="integer" xref="S2.E8.m1.1.1.1.1.1.1.3.3.3">1</cn></apply></apply></apply></apply><apply id="S2.E8.m1.4.4.4.cmml" xref="S2.E8.m1.4.4.4"><times id="S2.E8.m1.4.4.4.4.cmml" xref="S2.E8.m1.4.4.4.4"></times><ci id="S2.E8.m1.4.4.4.5.cmml" xref="S2.E8.m1.4.4.4.5">𝒩</ci><list id="S2.E8.m1.4.4.4.3.4.cmml" xref="S2.E8.m1.4.4.4.3.3"><apply id="S2.E8.m1.2.2.2.1.1.1.cmml" xref="S2.E8.m1.2.2.2.1.1.1"><csymbol cd="ambiguous" id="S2.E8.m1.2.2.2.1.1.1.1.cmml" xref="S2.E8.m1.2.2.2.1.1.1">subscript</csymbol><ci id="S2.E8.m1.2.2.2.1.1.1.2.cmml" xref="S2.E8.m1.2.2.2.1.1.1.2">𝑧</ci><ci id="S2.E8.m1.2.2.2.1.1.1.3.cmml" xref="S2.E8.m1.2.2.2.1.1.1.3">𝑡</ci></apply><apply id="S2.E8.m1.3.3.3.2.2.2.cmml" xref="S2.E8.m1.3.3.3.2.2.2"><times id="S2.E8.m1.3.3.3.2.2.2.1.cmml" xref="S2.E8.m1.3.3.3.2.2.2.1"></times><apply id="S2.E8.m1.3.3.3.2.2.2.2.cmml" xref="S2.E8.m1.3.3.3.2.2.2.2"><root id="S2.E8.m1.3.3.3.2.2.2.2a.cmml" xref="S2.E8.m1.3.3.3.2.2.2.2"></root><apply id="S2.E8.m1.3.3.3.2.2.2.2.2.cmml" xref="S2.E8.m1.3.3.3.2.2.2.2.2"><minus id="S2.E8.m1.3.3.3.2.2.2.2.2.1.cmml" xref="S2.E8.m1.3.3.3.2.2.2.2.2.1"></minus><cn id="S2.E8.m1.3.3.3.2.2.2.2.2.2.cmml" type="integer" xref="S2.E8.m1.3.3.3.2.2.2.2.2.2">1</cn><apply id="S2.E8.m1.3.3.3.2.2.2.2.2.3.cmml" xref="S2.E8.m1.3.3.3.2.2.2.2.2.3"><csymbol cd="ambiguous" id="S2.E8.m1.3.3.3.2.2.2.2.2.3.1.cmml" xref="S2.E8.m1.3.3.3.2.2.2.2.2.3">subscript</csymbol><ci id="S2.E8.m1.3.3.3.2.2.2.2.2.3.2.cmml" xref="S2.E8.m1.3.3.3.2.2.2.2.2.3.2">𝛽</ci><ci id="S2.E8.m1.3.3.3.2.2.2.2.2.3.3.cmml" xref="S2.E8.m1.3.3.3.2.2.2.2.2.3.3">𝑡</ci></apply></apply></apply><apply id="S2.E8.m1.3.3.3.2.2.2.3.cmml" xref="S2.E8.m1.3.3.3.2.2.2.3"><csymbol cd="ambiguous" id="S2.E8.m1.3.3.3.2.2.2.3.1.cmml" xref="S2.E8.m1.3.3.3.2.2.2.3">subscript</csymbol><ci id="S2.E8.m1.3.3.3.2.2.2.3.2.cmml" xref="S2.E8.m1.3.3.3.2.2.2.3.2">𝑧</ci><apply id="S2.E8.m1.3.3.3.2.2.2.3.3.cmml" xref="S2.E8.m1.3.3.3.2.2.2.3.3"><minus id="S2.E8.m1.3.3.3.2.2.2.3.3.1.cmml" xref="S2.E8.m1.3.3.3.2.2.2.3.3.1"></minus><ci id="S2.E8.m1.3.3.3.2.2.2.3.3.2.cmml" xref="S2.E8.m1.3.3.3.2.2.2.3.3.2">𝑡</ci><cn id="S2.E8.m1.3.3.3.2.2.2.3.3.3.cmml" type="integer" xref="S2.E8.m1.3.3.3.2.2.2.3.3.3">1</cn></apply></apply></apply><apply id="S2.E8.m1.4.4.4.3.3.3.cmml" xref="S2.E8.m1.4.4.4.3.3.3"><times id="S2.E8.m1.4.4.4.3.3.3.1.cmml" xref="S2.E8.m1.4.4.4.3.3.3.1"></times><apply id="S2.E8.m1.4.4.4.3.3.3.2.cmml" xref="S2.E8.m1.4.4.4.3.3.3.2"><csymbol cd="ambiguous" id="S2.E8.m1.4.4.4.3.3.3.2.1.cmml" xref="S2.E8.m1.4.4.4.3.3.3.2">subscript</csymbol><ci id="S2.E8.m1.4.4.4.3.3.3.2.2.cmml" xref="S2.E8.m1.4.4.4.3.3.3.2.2">𝛽</ci><ci id="S2.E8.m1.4.4.4.3.3.3.2.3.cmml" xref="S2.E8.m1.4.4.4.3.3.3.2.3">𝑡</ci></apply><ci id="S2.E8.m1.4.4.4.3.3.3.3.cmml" xref="S2.E8.m1.4.4.4.3.3.3.3">𝐈</ci></apply></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E8.m1.4c">q(z_{t}|z_{t-1})=\mathcal{N}(z_{t};\sqrt{1-\beta_{t}}z_{t-1},\beta_{t}\mathbf{% I})</annotation><annotation encoding="application/x-llamapun" id="S2.E8.m1.4d">italic_q ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(8)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS3.p1.8">where <math alttext="t" class="ltx_Math" display="inline" id="S2.SS3.p1.4.m1.1"><semantics id="S2.SS3.p1.4.m1.1a"><mi id="S2.SS3.p1.4.m1.1.1" xref="S2.SS3.p1.4.m1.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.4.m1.1b"><ci id="S2.SS3.p1.4.m1.1.1.cmml" xref="S2.SS3.p1.4.m1.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.4.m1.1c">t</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.4.m1.1d">italic_t</annotation></semantics></math> denotes the time step, <math alttext="\beta_{t}" class="ltx_Math" display="inline" id="S2.SS3.p1.5.m2.1"><semantics id="S2.SS3.p1.5.m2.1a"><msub id="S2.SS3.p1.5.m2.1.1" xref="S2.SS3.p1.5.m2.1.1.cmml"><mi id="S2.SS3.p1.5.m2.1.1.2" xref="S2.SS3.p1.5.m2.1.1.2.cmml">β</mi><mi id="S2.SS3.p1.5.m2.1.1.3" xref="S2.SS3.p1.5.m2.1.1.3.cmml">t</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.5.m2.1b"><apply id="S2.SS3.p1.5.m2.1.1.cmml" xref="S2.SS3.p1.5.m2.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.5.m2.1.1.1.cmml" xref="S2.SS3.p1.5.m2.1.1">subscript</csymbol><ci id="S2.SS3.p1.5.m2.1.1.2.cmml" xref="S2.SS3.p1.5.m2.1.1.2">𝛽</ci><ci id="S2.SS3.p1.5.m2.1.1.3.cmml" xref="S2.SS3.p1.5.m2.1.1.3">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.5.m2.1c">\beta_{t}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.5.m2.1d">italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT</annotation></semantics></math> is the diffusion coefficient, and <math alttext="\mathcal{N}" class="ltx_Math" display="inline" id="S2.SS3.p1.6.m3.1"><semantics id="S2.SS3.p1.6.m3.1a"><mi class="ltx_font_mathcaligraphic" id="S2.SS3.p1.6.m3.1.1" xref="S2.SS3.p1.6.m3.1.1.cmml">𝒩</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.6.m3.1b"><ci id="S2.SS3.p1.6.m3.1.1.cmml" xref="S2.SS3.p1.6.m3.1.1">𝒩</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.6.m3.1c">\mathcal{N}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.6.m3.1d">caligraphic_N</annotation></semantics></math> represents the normal distribution. For data generation, the model reverses the diffusion process, starting from a sample <math alttext="z_{T}" class="ltx_Math" display="inline" id="S2.SS3.p1.7.m4.1"><semantics id="S2.SS3.p1.7.m4.1a"><msub id="S2.SS3.p1.7.m4.1.1" xref="S2.SS3.p1.7.m4.1.1.cmml"><mi id="S2.SS3.p1.7.m4.1.1.2" xref="S2.SS3.p1.7.m4.1.1.2.cmml">z</mi><mi id="S2.SS3.p1.7.m4.1.1.3" xref="S2.SS3.p1.7.m4.1.1.3.cmml">T</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.7.m4.1b"><apply id="S2.SS3.p1.7.m4.1.1.cmml" xref="S2.SS3.p1.7.m4.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.7.m4.1.1.1.cmml" xref="S2.SS3.p1.7.m4.1.1">subscript</csymbol><ci id="S2.SS3.p1.7.m4.1.1.2.cmml" xref="S2.SS3.p1.7.m4.1.1.2">𝑧</ci><ci id="S2.SS3.p1.7.m4.1.1.3.cmml" xref="S2.SS3.p1.7.m4.1.1.3">𝑇</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.7.m4.1c">z_{T}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.7.m4.1d">italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT</annotation></semantics></math> drawn from a Gaussian distribution. A neural network <math alttext="p_{\theta}" class="ltx_Math" display="inline" id="S2.SS3.p1.8.m5.1"><semantics id="S2.SS3.p1.8.m5.1a"><msub id="S2.SS3.p1.8.m5.1.1" xref="S2.SS3.p1.8.m5.1.1.cmml"><mi id="S2.SS3.p1.8.m5.1.1.2" xref="S2.SS3.p1.8.m5.1.1.2.cmml">p</mi><mi id="S2.SS3.p1.8.m5.1.1.3" xref="S2.SS3.p1.8.m5.1.1.3.cmml">θ</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.8.m5.1b"><apply id="S2.SS3.p1.8.m5.1.1.cmml" xref="S2.SS3.p1.8.m5.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.8.m5.1.1.1.cmml" xref="S2.SS3.p1.8.m5.1.1">subscript</csymbol><ci id="S2.SS3.p1.8.m5.1.1.2.cmml" xref="S2.SS3.p1.8.m5.1.1.2">𝑝</ci><ci id="S2.SS3.p1.8.m5.1.1.3.cmml" xref="S2.SS3.p1.8.m5.1.1.3">𝜃</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.8.m5.1c">p_{\theta}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.8.m5.1d">italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT</annotation></semantics></math> predicts the reverse step at each timestep:</p> <table class="ltx_equation ltx_eqn_table" id="S2.E9"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="p_{\theta}(z_{t-1}|z_{t})=\mathcal{N}\left(z_{t-1};\mu_{\theta}(z_{t},t,E_{% \text{mix}}),\sigma_{t}^{2}\mathbf{I}\right)" class="ltx_Math" display="block" id="S2.E9.m1.5"><semantics id="S2.E9.m1.5a"><mrow id="S2.E9.m1.5.5" xref="S2.E9.m1.5.5.cmml"><mrow id="S2.E9.m1.2.2.1" xref="S2.E9.m1.2.2.1.cmml"><msub id="S2.E9.m1.2.2.1.3" xref="S2.E9.m1.2.2.1.3.cmml"><mi id="S2.E9.m1.2.2.1.3.2" xref="S2.E9.m1.2.2.1.3.2.cmml">p</mi><mi id="S2.E9.m1.2.2.1.3.3" xref="S2.E9.m1.2.2.1.3.3.cmml">θ</mi></msub><mo id="S2.E9.m1.2.2.1.2" xref="S2.E9.m1.2.2.1.2.cmml">⁢</mo><mrow id="S2.E9.m1.2.2.1.1.1" xref="S2.E9.m1.2.2.1.1.1.1.cmml"><mo id="S2.E9.m1.2.2.1.1.1.2" stretchy="false" xref="S2.E9.m1.2.2.1.1.1.1.cmml">(</mo><mrow id="S2.E9.m1.2.2.1.1.1.1" xref="S2.E9.m1.2.2.1.1.1.1.cmml"><msub id="S2.E9.m1.2.2.1.1.1.1.2" xref="S2.E9.m1.2.2.1.1.1.1.2.cmml"><mi id="S2.E9.m1.2.2.1.1.1.1.2.2" xref="S2.E9.m1.2.2.1.1.1.1.2.2.cmml">z</mi><mrow id="S2.E9.m1.2.2.1.1.1.1.2.3" xref="S2.E9.m1.2.2.1.1.1.1.2.3.cmml"><mi id="S2.E9.m1.2.2.1.1.1.1.2.3.2" xref="S2.E9.m1.2.2.1.1.1.1.2.3.2.cmml">t</mi><mo id="S2.E9.m1.2.2.1.1.1.1.2.3.1" xref="S2.E9.m1.2.2.1.1.1.1.2.3.1.cmml">−</mo><mn id="S2.E9.m1.2.2.1.1.1.1.2.3.3" xref="S2.E9.m1.2.2.1.1.1.1.2.3.3.cmml">1</mn></mrow></msub><mo fence="false" id="S2.E9.m1.2.2.1.1.1.1.1" xref="S2.E9.m1.2.2.1.1.1.1.1.cmml">|</mo><msub id="S2.E9.m1.2.2.1.1.1.1.3" xref="S2.E9.m1.2.2.1.1.1.1.3.cmml"><mi id="S2.E9.m1.2.2.1.1.1.1.3.2" xref="S2.E9.m1.2.2.1.1.1.1.3.2.cmml">z</mi><mi id="S2.E9.m1.2.2.1.1.1.1.3.3" xref="S2.E9.m1.2.2.1.1.1.1.3.3.cmml">t</mi></msub></mrow><mo id="S2.E9.m1.2.2.1.1.1.3" stretchy="false" xref="S2.E9.m1.2.2.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.E9.m1.5.5.5" xref="S2.E9.m1.5.5.5.cmml">=</mo><mrow id="S2.E9.m1.5.5.4" xref="S2.E9.m1.5.5.4.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E9.m1.5.5.4.5" xref="S2.E9.m1.5.5.4.5.cmml">𝒩</mi><mo id="S2.E9.m1.5.5.4.4" xref="S2.E9.m1.5.5.4.4.cmml">⁢</mo><mrow id="S2.E9.m1.5.5.4.3.3" xref="S2.E9.m1.5.5.4.3.4.cmml"><mo id="S2.E9.m1.5.5.4.3.3.4" xref="S2.E9.m1.5.5.4.3.4.cmml">(</mo><msub id="S2.E9.m1.3.3.2.1.1.1" xref="S2.E9.m1.3.3.2.1.1.1.cmml"><mi id="S2.E9.m1.3.3.2.1.1.1.2" xref="S2.E9.m1.3.3.2.1.1.1.2.cmml">z</mi><mrow id="S2.E9.m1.3.3.2.1.1.1.3" xref="S2.E9.m1.3.3.2.1.1.1.3.cmml"><mi id="S2.E9.m1.3.3.2.1.1.1.3.2" xref="S2.E9.m1.3.3.2.1.1.1.3.2.cmml">t</mi><mo id="S2.E9.m1.3.3.2.1.1.1.3.1" xref="S2.E9.m1.3.3.2.1.1.1.3.1.cmml">−</mo><mn id="S2.E9.m1.3.3.2.1.1.1.3.3" xref="S2.E9.m1.3.3.2.1.1.1.3.3.cmml">1</mn></mrow></msub><mo id="S2.E9.m1.5.5.4.3.3.5" xref="S2.E9.m1.5.5.4.3.4.cmml">;</mo><mrow id="S2.E9.m1.4.4.3.2.2.2" xref="S2.E9.m1.4.4.3.2.2.2.cmml"><msub id="S2.E9.m1.4.4.3.2.2.2.4" xref="S2.E9.m1.4.4.3.2.2.2.4.cmml"><mi id="S2.E9.m1.4.4.3.2.2.2.4.2" xref="S2.E9.m1.4.4.3.2.2.2.4.2.cmml">μ</mi><mi id="S2.E9.m1.4.4.3.2.2.2.4.3" xref="S2.E9.m1.4.4.3.2.2.2.4.3.cmml">θ</mi></msub><mo id="S2.E9.m1.4.4.3.2.2.2.3" xref="S2.E9.m1.4.4.3.2.2.2.3.cmml">⁢</mo><mrow id="S2.E9.m1.4.4.3.2.2.2.2.2" xref="S2.E9.m1.4.4.3.2.2.2.2.3.cmml"><mo id="S2.E9.m1.4.4.3.2.2.2.2.2.3" stretchy="false" xref="S2.E9.m1.4.4.3.2.2.2.2.3.cmml">(</mo><msub id="S2.E9.m1.4.4.3.2.2.2.1.1.1" xref="S2.E9.m1.4.4.3.2.2.2.1.1.1.cmml"><mi id="S2.E9.m1.4.4.3.2.2.2.1.1.1.2" xref="S2.E9.m1.4.4.3.2.2.2.1.1.1.2.cmml">z</mi><mi id="S2.E9.m1.4.4.3.2.2.2.1.1.1.3" xref="S2.E9.m1.4.4.3.2.2.2.1.1.1.3.cmml">t</mi></msub><mo id="S2.E9.m1.4.4.3.2.2.2.2.2.4" xref="S2.E9.m1.4.4.3.2.2.2.2.3.cmml">,</mo><mi id="S2.E9.m1.1.1" xref="S2.E9.m1.1.1.cmml">t</mi><mo id="S2.E9.m1.4.4.3.2.2.2.2.2.5" xref="S2.E9.m1.4.4.3.2.2.2.2.3.cmml">,</mo><msub id="S2.E9.m1.4.4.3.2.2.2.2.2.2" xref="S2.E9.m1.4.4.3.2.2.2.2.2.2.cmml"><mi id="S2.E9.m1.4.4.3.2.2.2.2.2.2.2" xref="S2.E9.m1.4.4.3.2.2.2.2.2.2.2.cmml">E</mi><mtext id="S2.E9.m1.4.4.3.2.2.2.2.2.2.3" xref="S2.E9.m1.4.4.3.2.2.2.2.2.2.3a.cmml">mix</mtext></msub><mo id="S2.E9.m1.4.4.3.2.2.2.2.2.6" stretchy="false" xref="S2.E9.m1.4.4.3.2.2.2.2.3.cmml">)</mo></mrow></mrow><mo id="S2.E9.m1.5.5.4.3.3.6" xref="S2.E9.m1.5.5.4.3.4.cmml">,</mo><mrow id="S2.E9.m1.5.5.4.3.3.3" xref="S2.E9.m1.5.5.4.3.3.3.cmml"><msubsup id="S2.E9.m1.5.5.4.3.3.3.2" xref="S2.E9.m1.5.5.4.3.3.3.2.cmml"><mi id="S2.E9.m1.5.5.4.3.3.3.2.2.2" xref="S2.E9.m1.5.5.4.3.3.3.2.2.2.cmml">σ</mi><mi id="S2.E9.m1.5.5.4.3.3.3.2.2.3" xref="S2.E9.m1.5.5.4.3.3.3.2.2.3.cmml">t</mi><mn id="S2.E9.m1.5.5.4.3.3.3.2.3" xref="S2.E9.m1.5.5.4.3.3.3.2.3.cmml">2</mn></msubsup><mo id="S2.E9.m1.5.5.4.3.3.3.1" xref="S2.E9.m1.5.5.4.3.3.3.1.cmml">⁢</mo><mi id="S2.E9.m1.5.5.4.3.3.3.3" xref="S2.E9.m1.5.5.4.3.3.3.3.cmml">𝐈</mi></mrow><mo id="S2.E9.m1.5.5.4.3.3.7" xref="S2.E9.m1.5.5.4.3.4.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E9.m1.5b"><apply id="S2.E9.m1.5.5.cmml" xref="S2.E9.m1.5.5"><eq id="S2.E9.m1.5.5.5.cmml" xref="S2.E9.m1.5.5.5"></eq><apply id="S2.E9.m1.2.2.1.cmml" xref="S2.E9.m1.2.2.1"><times id="S2.E9.m1.2.2.1.2.cmml" xref="S2.E9.m1.2.2.1.2"></times><apply id="S2.E9.m1.2.2.1.3.cmml" xref="S2.E9.m1.2.2.1.3"><csymbol cd="ambiguous" id="S2.E9.m1.2.2.1.3.1.cmml" xref="S2.E9.m1.2.2.1.3">subscript</csymbol><ci id="S2.E9.m1.2.2.1.3.2.cmml" xref="S2.E9.m1.2.2.1.3.2">𝑝</ci><ci id="S2.E9.m1.2.2.1.3.3.cmml" xref="S2.E9.m1.2.2.1.3.3">𝜃</ci></apply><apply id="S2.E9.m1.2.2.1.1.1.1.cmml" xref="S2.E9.m1.2.2.1.1.1"><csymbol cd="latexml" id="S2.E9.m1.2.2.1.1.1.1.1.cmml" xref="S2.E9.m1.2.2.1.1.1.1.1">conditional</csymbol><apply id="S2.E9.m1.2.2.1.1.1.1.2.cmml" xref="S2.E9.m1.2.2.1.1.1.1.2"><csymbol cd="ambiguous" id="S2.E9.m1.2.2.1.1.1.1.2.1.cmml" xref="S2.E9.m1.2.2.1.1.1.1.2">subscript</csymbol><ci id="S2.E9.m1.2.2.1.1.1.1.2.2.cmml" xref="S2.E9.m1.2.2.1.1.1.1.2.2">𝑧</ci><apply id="S2.E9.m1.2.2.1.1.1.1.2.3.cmml" xref="S2.E9.m1.2.2.1.1.1.1.2.3"><minus id="S2.E9.m1.2.2.1.1.1.1.2.3.1.cmml" xref="S2.E9.m1.2.2.1.1.1.1.2.3.1"></minus><ci id="S2.E9.m1.2.2.1.1.1.1.2.3.2.cmml" xref="S2.E9.m1.2.2.1.1.1.1.2.3.2">𝑡</ci><cn id="S2.E9.m1.2.2.1.1.1.1.2.3.3.cmml" type="integer" xref="S2.E9.m1.2.2.1.1.1.1.2.3.3">1</cn></apply></apply><apply id="S2.E9.m1.2.2.1.1.1.1.3.cmml" xref="S2.E9.m1.2.2.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E9.m1.2.2.1.1.1.1.3.1.cmml" xref="S2.E9.m1.2.2.1.1.1.1.3">subscript</csymbol><ci id="S2.E9.m1.2.2.1.1.1.1.3.2.cmml" xref="S2.E9.m1.2.2.1.1.1.1.3.2">𝑧</ci><ci id="S2.E9.m1.2.2.1.1.1.1.3.3.cmml" xref="S2.E9.m1.2.2.1.1.1.1.3.3">𝑡</ci></apply></apply></apply><apply id="S2.E9.m1.5.5.4.cmml" xref="S2.E9.m1.5.5.4"><times id="S2.E9.m1.5.5.4.4.cmml" xref="S2.E9.m1.5.5.4.4"></times><ci id="S2.E9.m1.5.5.4.5.cmml" xref="S2.E9.m1.5.5.4.5">𝒩</ci><list id="S2.E9.m1.5.5.4.3.4.cmml" xref="S2.E9.m1.5.5.4.3.3"><apply id="S2.E9.m1.3.3.2.1.1.1.cmml" xref="S2.E9.m1.3.3.2.1.1.1"><csymbol cd="ambiguous" id="S2.E9.m1.3.3.2.1.1.1.1.cmml" xref="S2.E9.m1.3.3.2.1.1.1">subscript</csymbol><ci id="S2.E9.m1.3.3.2.1.1.1.2.cmml" xref="S2.E9.m1.3.3.2.1.1.1.2">𝑧</ci><apply id="S2.E9.m1.3.3.2.1.1.1.3.cmml" xref="S2.E9.m1.3.3.2.1.1.1.3"><minus id="S2.E9.m1.3.3.2.1.1.1.3.1.cmml" xref="S2.E9.m1.3.3.2.1.1.1.3.1"></minus><ci id="S2.E9.m1.3.3.2.1.1.1.3.2.cmml" xref="S2.E9.m1.3.3.2.1.1.1.3.2">𝑡</ci><cn id="S2.E9.m1.3.3.2.1.1.1.3.3.cmml" type="integer" xref="S2.E9.m1.3.3.2.1.1.1.3.3">1</cn></apply></apply><apply id="S2.E9.m1.4.4.3.2.2.2.cmml" xref="S2.E9.m1.4.4.3.2.2.2"><times id="S2.E9.m1.4.4.3.2.2.2.3.cmml" xref="S2.E9.m1.4.4.3.2.2.2.3"></times><apply id="S2.E9.m1.4.4.3.2.2.2.4.cmml" xref="S2.E9.m1.4.4.3.2.2.2.4"><csymbol cd="ambiguous" id="S2.E9.m1.4.4.3.2.2.2.4.1.cmml" xref="S2.E9.m1.4.4.3.2.2.2.4">subscript</csymbol><ci id="S2.E9.m1.4.4.3.2.2.2.4.2.cmml" xref="S2.E9.m1.4.4.3.2.2.2.4.2">𝜇</ci><ci id="S2.E9.m1.4.4.3.2.2.2.4.3.cmml" xref="S2.E9.m1.4.4.3.2.2.2.4.3">𝜃</ci></apply><vector id="S2.E9.m1.4.4.3.2.2.2.2.3.cmml" xref="S2.E9.m1.4.4.3.2.2.2.2.2"><apply id="S2.E9.m1.4.4.3.2.2.2.1.1.1.cmml" xref="S2.E9.m1.4.4.3.2.2.2.1.1.1"><csymbol cd="ambiguous" id="S2.E9.m1.4.4.3.2.2.2.1.1.1.1.cmml" xref="S2.E9.m1.4.4.3.2.2.2.1.1.1">subscript</csymbol><ci id="S2.E9.m1.4.4.3.2.2.2.1.1.1.2.cmml" xref="S2.E9.m1.4.4.3.2.2.2.1.1.1.2">𝑧</ci><ci id="S2.E9.m1.4.4.3.2.2.2.1.1.1.3.cmml" xref="S2.E9.m1.4.4.3.2.2.2.1.1.1.3">𝑡</ci></apply><ci id="S2.E9.m1.1.1.cmml" xref="S2.E9.m1.1.1">𝑡</ci><apply id="S2.E9.m1.4.4.3.2.2.2.2.2.2.cmml" xref="S2.E9.m1.4.4.3.2.2.2.2.2.2"><csymbol cd="ambiguous" id="S2.E9.m1.4.4.3.2.2.2.2.2.2.1.cmml" xref="S2.E9.m1.4.4.3.2.2.2.2.2.2">subscript</csymbol><ci id="S2.E9.m1.4.4.3.2.2.2.2.2.2.2.cmml" xref="S2.E9.m1.4.4.3.2.2.2.2.2.2.2">𝐸</ci><ci id="S2.E9.m1.4.4.3.2.2.2.2.2.2.3a.cmml" xref="S2.E9.m1.4.4.3.2.2.2.2.2.2.3"><mtext id="S2.E9.m1.4.4.3.2.2.2.2.2.2.3.cmml" mathsize="70%" xref="S2.E9.m1.4.4.3.2.2.2.2.2.2.3">mix</mtext></ci></apply></vector></apply><apply id="S2.E9.m1.5.5.4.3.3.3.cmml" xref="S2.E9.m1.5.5.4.3.3.3"><times id="S2.E9.m1.5.5.4.3.3.3.1.cmml" xref="S2.E9.m1.5.5.4.3.3.3.1"></times><apply id="S2.E9.m1.5.5.4.3.3.3.2.cmml" xref="S2.E9.m1.5.5.4.3.3.3.2"><csymbol cd="ambiguous" id="S2.E9.m1.5.5.4.3.3.3.2.1.cmml" xref="S2.E9.m1.5.5.4.3.3.3.2">superscript</csymbol><apply id="S2.E9.m1.5.5.4.3.3.3.2.2.cmml" xref="S2.E9.m1.5.5.4.3.3.3.2"><csymbol cd="ambiguous" id="S2.E9.m1.5.5.4.3.3.3.2.2.1.cmml" xref="S2.E9.m1.5.5.4.3.3.3.2">subscript</csymbol><ci id="S2.E9.m1.5.5.4.3.3.3.2.2.2.cmml" xref="S2.E9.m1.5.5.4.3.3.3.2.2.2">𝜎</ci><ci id="S2.E9.m1.5.5.4.3.3.3.2.2.3.cmml" xref="S2.E9.m1.5.5.4.3.3.3.2.2.3">𝑡</ci></apply><cn id="S2.E9.m1.5.5.4.3.3.3.2.3.cmml" type="integer" xref="S2.E9.m1.5.5.4.3.3.3.2.3">2</cn></apply><ci id="S2.E9.m1.5.5.4.3.3.3.3.cmml" xref="S2.E9.m1.5.5.4.3.3.3.3">𝐈</ci></apply></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E9.m1.5c">p_{\theta}(z_{t-1}|z_{t})=\mathcal{N}\left(z_{t-1};\mu_{\theta}(z_{t},t,E_{% \text{mix}}),\sigma_{t}^{2}\mathbf{I}\right)</annotation><annotation encoding="application/x-llamapun" id="S2.E9.m1.5d">italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(9)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS3.p1.12">where <math alttext="\mu_{\theta}" class="ltx_Math" display="inline" id="S2.SS3.p1.9.m1.1"><semantics id="S2.SS3.p1.9.m1.1a"><msub id="S2.SS3.p1.9.m1.1.1" xref="S2.SS3.p1.9.m1.1.1.cmml"><mi id="S2.SS3.p1.9.m1.1.1.2" xref="S2.SS3.p1.9.m1.1.1.2.cmml">μ</mi><mi id="S2.SS3.p1.9.m1.1.1.3" xref="S2.SS3.p1.9.m1.1.1.3.cmml">θ</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.9.m1.1b"><apply id="S2.SS3.p1.9.m1.1.1.cmml" xref="S2.SS3.p1.9.m1.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.9.m1.1.1.1.cmml" xref="S2.SS3.p1.9.m1.1.1">subscript</csymbol><ci id="S2.SS3.p1.9.m1.1.1.2.cmml" xref="S2.SS3.p1.9.m1.1.1.2">𝜇</ci><ci id="S2.SS3.p1.9.m1.1.1.3.cmml" xref="S2.SS3.p1.9.m1.1.1.3">𝜃</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.9.m1.1c">\mu_{\theta}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.9.m1.1d">italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT</annotation></semantics></math> represents the predicted mean of the Gaussian distribution by the neural network parameterized by <math alttext="\theta" class="ltx_Math" display="inline" id="S2.SS3.p1.10.m2.1"><semantics id="S2.SS3.p1.10.m2.1a"><mi id="S2.SS3.p1.10.m2.1.1" xref="S2.SS3.p1.10.m2.1.1.cmml">θ</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.10.m2.1b"><ci id="S2.SS3.p1.10.m2.1.1.cmml" xref="S2.SS3.p1.10.m2.1.1">𝜃</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.10.m2.1c">\theta</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.10.m2.1d">italic_θ</annotation></semantics></math>, and <math alttext="\sigma_{t}" class="ltx_Math" display="inline" id="S2.SS3.p1.11.m3.1"><semantics id="S2.SS3.p1.11.m3.1a"><msub id="S2.SS3.p1.11.m3.1.1" xref="S2.SS3.p1.11.m3.1.1.cmml"><mi id="S2.SS3.p1.11.m3.1.1.2" xref="S2.SS3.p1.11.m3.1.1.2.cmml">σ</mi><mi id="S2.SS3.p1.11.m3.1.1.3" xref="S2.SS3.p1.11.m3.1.1.3.cmml">t</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.11.m3.1b"><apply id="S2.SS3.p1.11.m3.1.1.cmml" xref="S2.SS3.p1.11.m3.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.11.m3.1.1.1.cmml" xref="S2.SS3.p1.11.m3.1.1">subscript</csymbol><ci id="S2.SS3.p1.11.m3.1.1.2.cmml" xref="S2.SS3.p1.11.m3.1.1.2">𝜎</ci><ci id="S2.SS3.p1.11.m3.1.1.3.cmml" xref="S2.SS3.p1.11.m3.1.1.3">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.11.m3.1c">\sigma_{t}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.11.m3.1d">italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT</annotation></semantics></math> denotes the standard deviation, often related to the timestep <math alttext="t" class="ltx_Math" display="inline" id="S2.SS3.p1.12.m4.1"><semantics id="S2.SS3.p1.12.m4.1a"><mi id="S2.SS3.p1.12.m4.1.1" xref="S2.SS3.p1.12.m4.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.12.m4.1b"><ci id="S2.SS3.p1.12.m4.1.1.cmml" xref="S2.SS3.p1.12.m4.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.12.m4.1c">t</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.12.m4.1d">italic_t</annotation></semantics></math>.</p> </div> <div class="ltx_para" id="S2.SS3.p2"> <p class="ltx_p" id="S2.SS3.p2.3">After the reverse diffusion, the final latent variable <math alttext="z_{0}" class="ltx_Math" display="inline" id="S2.SS3.p2.1.m1.1"><semantics id="S2.SS3.p2.1.m1.1a"><msub id="S2.SS3.p2.1.m1.1.1" xref="S2.SS3.p2.1.m1.1.1.cmml"><mi id="S2.SS3.p2.1.m1.1.1.2" xref="S2.SS3.p2.1.m1.1.1.2.cmml">z</mi><mn id="S2.SS3.p2.1.m1.1.1.3" xref="S2.SS3.p2.1.m1.1.1.3.cmml">0</mn></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.1.m1.1b"><apply id="S2.SS3.p2.1.m1.1.1.cmml" xref="S2.SS3.p2.1.m1.1.1"><csymbol cd="ambiguous" id="S2.SS3.p2.1.m1.1.1.1.cmml" xref="S2.SS3.p2.1.m1.1.1">subscript</csymbol><ci id="S2.SS3.p2.1.m1.1.1.2.cmml" xref="S2.SS3.p2.1.m1.1.1.2">𝑧</ci><cn id="S2.SS3.p2.1.m1.1.1.3.cmml" type="integer" xref="S2.SS3.p2.1.m1.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.1.m1.1c">z_{0}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.1.m1.1d">italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT</annotation></semantics></math> is decoded by <math alttext="\mathcal{D}" class="ltx_Math" display="inline" id="S2.SS3.p2.2.m2.1"><semantics id="S2.SS3.p2.2.m2.1a"><mi class="ltx_font_mathcaligraphic" id="S2.SS3.p2.2.m2.1.1" xref="S2.SS3.p2.2.m2.1.1.cmml">𝒟</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.2.m2.1b"><ci id="S2.SS3.p2.2.m2.1.1.cmml" xref="S2.SS3.p2.2.m2.1.1">𝒟</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.2.m2.1c">\mathcal{D}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.2.m2.1d">caligraphic_D</annotation></semantics></math> back into the data space, producing the generated Mel-Spectrogram <math alttext="\hat{x}_{a}=\mathcal{D}(z_{0})" class="ltx_Math" display="inline" id="S2.SS3.p2.3.m3.1"><semantics id="S2.SS3.p2.3.m3.1a"><mrow id="S2.SS3.p2.3.m3.1.1" xref="S2.SS3.p2.3.m3.1.1.cmml"><msub id="S2.SS3.p2.3.m3.1.1.3" xref="S2.SS3.p2.3.m3.1.1.3.cmml"><mover accent="true" id="S2.SS3.p2.3.m3.1.1.3.2" xref="S2.SS3.p2.3.m3.1.1.3.2.cmml"><mi id="S2.SS3.p2.3.m3.1.1.3.2.2" xref="S2.SS3.p2.3.m3.1.1.3.2.2.cmml">x</mi><mo id="S2.SS3.p2.3.m3.1.1.3.2.1" xref="S2.SS3.p2.3.m3.1.1.3.2.1.cmml">^</mo></mover><mi id="S2.SS3.p2.3.m3.1.1.3.3" xref="S2.SS3.p2.3.m3.1.1.3.3.cmml">a</mi></msub><mo id="S2.SS3.p2.3.m3.1.1.2" xref="S2.SS3.p2.3.m3.1.1.2.cmml">=</mo><mrow id="S2.SS3.p2.3.m3.1.1.1" xref="S2.SS3.p2.3.m3.1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.SS3.p2.3.m3.1.1.1.3" xref="S2.SS3.p2.3.m3.1.1.1.3.cmml">𝒟</mi><mo id="S2.SS3.p2.3.m3.1.1.1.2" xref="S2.SS3.p2.3.m3.1.1.1.2.cmml">⁢</mo><mrow id="S2.SS3.p2.3.m3.1.1.1.1.1" xref="S2.SS3.p2.3.m3.1.1.1.1.1.1.cmml"><mo id="S2.SS3.p2.3.m3.1.1.1.1.1.2" stretchy="false" xref="S2.SS3.p2.3.m3.1.1.1.1.1.1.cmml">(</mo><msub id="S2.SS3.p2.3.m3.1.1.1.1.1.1" xref="S2.SS3.p2.3.m3.1.1.1.1.1.1.cmml"><mi id="S2.SS3.p2.3.m3.1.1.1.1.1.1.2" xref="S2.SS3.p2.3.m3.1.1.1.1.1.1.2.cmml">z</mi><mn id="S2.SS3.p2.3.m3.1.1.1.1.1.1.3" xref="S2.SS3.p2.3.m3.1.1.1.1.1.1.3.cmml">0</mn></msub><mo id="S2.SS3.p2.3.m3.1.1.1.1.1.3" stretchy="false" xref="S2.SS3.p2.3.m3.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.3.m3.1b"><apply id="S2.SS3.p2.3.m3.1.1.cmml" xref="S2.SS3.p2.3.m3.1.1"><eq id="S2.SS3.p2.3.m3.1.1.2.cmml" xref="S2.SS3.p2.3.m3.1.1.2"></eq><apply id="S2.SS3.p2.3.m3.1.1.3.cmml" xref="S2.SS3.p2.3.m3.1.1.3"><csymbol cd="ambiguous" id="S2.SS3.p2.3.m3.1.1.3.1.cmml" xref="S2.SS3.p2.3.m3.1.1.3">subscript</csymbol><apply id="S2.SS3.p2.3.m3.1.1.3.2.cmml" xref="S2.SS3.p2.3.m3.1.1.3.2"><ci id="S2.SS3.p2.3.m3.1.1.3.2.1.cmml" xref="S2.SS3.p2.3.m3.1.1.3.2.1">^</ci><ci id="S2.SS3.p2.3.m3.1.1.3.2.2.cmml" xref="S2.SS3.p2.3.m3.1.1.3.2.2">𝑥</ci></apply><ci id="S2.SS3.p2.3.m3.1.1.3.3.cmml" xref="S2.SS3.p2.3.m3.1.1.3.3">𝑎</ci></apply><apply id="S2.SS3.p2.3.m3.1.1.1.cmml" xref="S2.SS3.p2.3.m3.1.1.1"><times id="S2.SS3.p2.3.m3.1.1.1.2.cmml" xref="S2.SS3.p2.3.m3.1.1.1.2"></times><ci id="S2.SS3.p2.3.m3.1.1.1.3.cmml" xref="S2.SS3.p2.3.m3.1.1.1.3">𝒟</ci><apply id="S2.SS3.p2.3.m3.1.1.1.1.1.1.cmml" xref="S2.SS3.p2.3.m3.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS3.p2.3.m3.1.1.1.1.1.1.1.cmml" xref="S2.SS3.p2.3.m3.1.1.1.1.1">subscript</csymbol><ci id="S2.SS3.p2.3.m3.1.1.1.1.1.1.2.cmml" xref="S2.SS3.p2.3.m3.1.1.1.1.1.1.2">𝑧</ci><cn id="S2.SS3.p2.3.m3.1.1.1.1.1.1.3.cmml" type="integer" xref="S2.SS3.p2.3.m3.1.1.1.1.1.1.3">0</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.3.m3.1c">\hat{x}_{a}=\mathcal{D}(z_{0})</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.3.m3.1d">over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = caligraphic_D ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )</annotation></semantics></math>, which is then converted into an audio sample using a vocoder. The conditional loss function is given by:</p> <table class="ltx_equation ltx_eqn_table" id="S2.E10"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{\text{LDM}}=\mathbb{E}_{\mathcal{E}(x),\epsilon\sim\mathcal{N}(0,% 1),t}\left[\left\|\epsilon-\epsilon_{\theta}\left(z_{t},t,E_{\text{mix}}\right% )\right\|_{2}^{2}\right]" class="ltx_Math" display="block" id="S2.E10.m1.8"><semantics id="S2.E10.m1.8a"><mrow id="S2.E10.m1.8.8" xref="S2.E10.m1.8.8.cmml"><msub id="S2.E10.m1.8.8.3" xref="S2.E10.m1.8.8.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E10.m1.8.8.3.2" xref="S2.E10.m1.8.8.3.2.cmml">ℒ</mi><mtext id="S2.E10.m1.8.8.3.3" xref="S2.E10.m1.8.8.3.3a.cmml">LDM</mtext></msub><mo id="S2.E10.m1.8.8.2" xref="S2.E10.m1.8.8.2.cmml">=</mo><mrow id="S2.E10.m1.8.8.1" xref="S2.E10.m1.8.8.1.cmml"><msub id="S2.E10.m1.8.8.1.3" xref="S2.E10.m1.8.8.1.3.cmml"><mi id="S2.E10.m1.8.8.1.3.2" xref="S2.E10.m1.8.8.1.3.2.cmml">𝔼</mi><mrow id="S2.E10.m1.6.6.6.6" xref="S2.E10.m1.6.6.6.7.cmml"><mrow id="S2.E10.m1.6.6.6.6.1" xref="S2.E10.m1.6.6.6.6.1.cmml"><mrow id="S2.E10.m1.6.6.6.6.1.1.1" xref="S2.E10.m1.6.6.6.6.1.1.2.cmml"><mrow id="S2.E10.m1.6.6.6.6.1.1.1.1" xref="S2.E10.m1.6.6.6.6.1.1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E10.m1.6.6.6.6.1.1.1.1.2" xref="S2.E10.m1.6.6.6.6.1.1.1.1.2.cmml">ℰ</mi><mo id="S2.E10.m1.6.6.6.6.1.1.1.1.1" xref="S2.E10.m1.6.6.6.6.1.1.1.1.1.cmml">⁢</mo><mrow id="S2.E10.m1.6.6.6.6.1.1.1.1.3.2" xref="S2.E10.m1.6.6.6.6.1.1.1.1.cmml"><mo id="S2.E10.m1.6.6.6.6.1.1.1.1.3.2.1" stretchy="false" xref="S2.E10.m1.6.6.6.6.1.1.1.1.cmml">(</mo><mi id="S2.E10.m1.1.1.1.1" xref="S2.E10.m1.1.1.1.1.cmml">x</mi><mo id="S2.E10.m1.6.6.6.6.1.1.1.1.3.2.2" stretchy="false" xref="S2.E10.m1.6.6.6.6.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.E10.m1.6.6.6.6.1.1.1.2" xref="S2.E10.m1.6.6.6.6.1.1.2.cmml">,</mo><mi id="S2.E10.m1.4.4.4.4" xref="S2.E10.m1.4.4.4.4.cmml">ϵ</mi></mrow><mo id="S2.E10.m1.6.6.6.6.1.2" xref="S2.E10.m1.6.6.6.6.1.2.cmml">∼</mo><mrow id="S2.E10.m1.6.6.6.6.1.3" xref="S2.E10.m1.6.6.6.6.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.E10.m1.6.6.6.6.1.3.2" xref="S2.E10.m1.6.6.6.6.1.3.2.cmml">𝒩</mi><mo id="S2.E10.m1.6.6.6.6.1.3.1" xref="S2.E10.m1.6.6.6.6.1.3.1.cmml">⁢</mo><mrow id="S2.E10.m1.6.6.6.6.1.3.3.2" xref="S2.E10.m1.6.6.6.6.1.3.3.1.cmml"><mo id="S2.E10.m1.6.6.6.6.1.3.3.2.1" stretchy="false" xref="S2.E10.m1.6.6.6.6.1.3.3.1.cmml">(</mo><mn id="S2.E10.m1.2.2.2.2" xref="S2.E10.m1.2.2.2.2.cmml">0</mn><mo id="S2.E10.m1.6.6.6.6.1.3.3.2.2" xref="S2.E10.m1.6.6.6.6.1.3.3.1.cmml">,</mo><mn id="S2.E10.m1.3.3.3.3" xref="S2.E10.m1.3.3.3.3.cmml">1</mn><mo id="S2.E10.m1.6.6.6.6.1.3.3.2.3" stretchy="false" xref="S2.E10.m1.6.6.6.6.1.3.3.1.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E10.m1.6.6.6.6.2" xref="S2.E10.m1.6.6.6.7a.cmml">,</mo><mi id="S2.E10.m1.5.5.5.5" xref="S2.E10.m1.5.5.5.5.cmml">t</mi></mrow></msub><mo id="S2.E10.m1.8.8.1.2" xref="S2.E10.m1.8.8.1.2.cmml">⁢</mo><mrow id="S2.E10.m1.8.8.1.1.1" xref="S2.E10.m1.8.8.1.1.2.cmml"><mo id="S2.E10.m1.8.8.1.1.1.2" xref="S2.E10.m1.8.8.1.1.2.1.cmml">[</mo><msubsup id="S2.E10.m1.8.8.1.1.1.1" xref="S2.E10.m1.8.8.1.1.1.1.cmml"><mrow id="S2.E10.m1.8.8.1.1.1.1.1.1.1" xref="S2.E10.m1.8.8.1.1.1.1.1.1.2.cmml"><mo id="S2.E10.m1.8.8.1.1.1.1.1.1.1.2" xref="S2.E10.m1.8.8.1.1.1.1.1.1.2.1.cmml">‖</mo><mrow id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.cmml"><mi id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.4" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.4.cmml">ϵ</mi><mo id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.3" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.3.cmml">−</mo><mrow id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.cmml"><msub id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.cmml"><mi id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.2" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.2.cmml">ϵ</mi><mi id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.3" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.3.cmml">θ</mi></msub><mo id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.3" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.3.cmml">⁢</mo><mrow id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.3.cmml"><mo id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.3" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.3.cmml">(</mo><msub id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.2" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">z</mi><mi id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.3" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml">t</mi></msub><mo id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.4" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.3.cmml">,</mo><mi id="S2.E10.m1.7.7" xref="S2.E10.m1.7.7.cmml">t</mi><mo id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.5" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.3.cmml">,</mo><msub id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.cmml"><mi id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.2" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.2.cmml">E</mi><mtext id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.3" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.3a.cmml">mix</mtext></msub><mo id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.6" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.3.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E10.m1.8.8.1.1.1.1.1.1.1.3" xref="S2.E10.m1.8.8.1.1.1.1.1.1.2.1.cmml">‖</mo></mrow><mn id="S2.E10.m1.8.8.1.1.1.1.1.3" xref="S2.E10.m1.8.8.1.1.1.1.1.3.cmml">2</mn><mn id="S2.E10.m1.8.8.1.1.1.1.3" xref="S2.E10.m1.8.8.1.1.1.1.3.cmml">2</mn></msubsup><mo id="S2.E10.m1.8.8.1.1.1.3" xref="S2.E10.m1.8.8.1.1.2.1.cmml">]</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E10.m1.8b"><apply id="S2.E10.m1.8.8.cmml" xref="S2.E10.m1.8.8"><eq id="S2.E10.m1.8.8.2.cmml" xref="S2.E10.m1.8.8.2"></eq><apply id="S2.E10.m1.8.8.3.cmml" xref="S2.E10.m1.8.8.3"><csymbol cd="ambiguous" id="S2.E10.m1.8.8.3.1.cmml" xref="S2.E10.m1.8.8.3">subscript</csymbol><ci id="S2.E10.m1.8.8.3.2.cmml" xref="S2.E10.m1.8.8.3.2">ℒ</ci><ci id="S2.E10.m1.8.8.3.3a.cmml" xref="S2.E10.m1.8.8.3.3"><mtext id="S2.E10.m1.8.8.3.3.cmml" mathsize="70%" xref="S2.E10.m1.8.8.3.3">LDM</mtext></ci></apply><apply id="S2.E10.m1.8.8.1.cmml" xref="S2.E10.m1.8.8.1"><times id="S2.E10.m1.8.8.1.2.cmml" xref="S2.E10.m1.8.8.1.2"></times><apply id="S2.E10.m1.8.8.1.3.cmml" xref="S2.E10.m1.8.8.1.3"><csymbol cd="ambiguous" id="S2.E10.m1.8.8.1.3.1.cmml" xref="S2.E10.m1.8.8.1.3">subscript</csymbol><ci id="S2.E10.m1.8.8.1.3.2.cmml" xref="S2.E10.m1.8.8.1.3.2">𝔼</ci><apply id="S2.E10.m1.6.6.6.7.cmml" xref="S2.E10.m1.6.6.6.6"><csymbol cd="ambiguous" id="S2.E10.m1.6.6.6.7a.cmml" xref="S2.E10.m1.6.6.6.6.2">formulae-sequence</csymbol><apply id="S2.E10.m1.6.6.6.6.1.cmml" xref="S2.E10.m1.6.6.6.6.1"><csymbol cd="latexml" id="S2.E10.m1.6.6.6.6.1.2.cmml" xref="S2.E10.m1.6.6.6.6.1.2">similar-to</csymbol><list id="S2.E10.m1.6.6.6.6.1.1.2.cmml" xref="S2.E10.m1.6.6.6.6.1.1.1"><apply id="S2.E10.m1.6.6.6.6.1.1.1.1.cmml" xref="S2.E10.m1.6.6.6.6.1.1.1.1"><times id="S2.E10.m1.6.6.6.6.1.1.1.1.1.cmml" xref="S2.E10.m1.6.6.6.6.1.1.1.1.1"></times><ci id="S2.E10.m1.6.6.6.6.1.1.1.1.2.cmml" xref="S2.E10.m1.6.6.6.6.1.1.1.1.2">ℰ</ci><ci id="S2.E10.m1.1.1.1.1.cmml" xref="S2.E10.m1.1.1.1.1">𝑥</ci></apply><ci id="S2.E10.m1.4.4.4.4.cmml" xref="S2.E10.m1.4.4.4.4">italic-ϵ</ci></list><apply id="S2.E10.m1.6.6.6.6.1.3.cmml" xref="S2.E10.m1.6.6.6.6.1.3"><times id="S2.E10.m1.6.6.6.6.1.3.1.cmml" xref="S2.E10.m1.6.6.6.6.1.3.1"></times><ci id="S2.E10.m1.6.6.6.6.1.3.2.cmml" xref="S2.E10.m1.6.6.6.6.1.3.2">𝒩</ci><interval closure="open" id="S2.E10.m1.6.6.6.6.1.3.3.1.cmml" xref="S2.E10.m1.6.6.6.6.1.3.3.2"><cn id="S2.E10.m1.2.2.2.2.cmml" type="integer" xref="S2.E10.m1.2.2.2.2">0</cn><cn id="S2.E10.m1.3.3.3.3.cmml" type="integer" xref="S2.E10.m1.3.3.3.3">1</cn></interval></apply></apply><ci id="S2.E10.m1.5.5.5.5.cmml" xref="S2.E10.m1.5.5.5.5">𝑡</ci></apply></apply><apply id="S2.E10.m1.8.8.1.1.2.cmml" xref="S2.E10.m1.8.8.1.1.1"><csymbol cd="latexml" id="S2.E10.m1.8.8.1.1.2.1.cmml" xref="S2.E10.m1.8.8.1.1.1.2">delimited-[]</csymbol><apply id="S2.E10.m1.8.8.1.1.1.1.cmml" xref="S2.E10.m1.8.8.1.1.1.1"><csymbol cd="ambiguous" id="S2.E10.m1.8.8.1.1.1.1.2.cmml" xref="S2.E10.m1.8.8.1.1.1.1">superscript</csymbol><apply id="S2.E10.m1.8.8.1.1.1.1.1.cmml" xref="S2.E10.m1.8.8.1.1.1.1"><csymbol cd="ambiguous" id="S2.E10.m1.8.8.1.1.1.1.1.2.cmml" xref="S2.E10.m1.8.8.1.1.1.1">subscript</csymbol><apply id="S2.E10.m1.8.8.1.1.1.1.1.1.2.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S2.E10.m1.8.8.1.1.1.1.1.1.2.1.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.2">norm</csymbol><apply id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1"><minus id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.3.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.3"></minus><ci id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.4.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.4">italic-ϵ</ci><apply id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2"><times id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.3.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.3"></times><apply id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4"><csymbol cd="ambiguous" id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.1.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4">subscript</csymbol><ci id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.2.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.2">italic-ϵ</ci><ci id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.3.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.4.3">𝜃</ci></apply><vector id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.3.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2"><apply id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.2">𝑧</ci><ci id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.1.1.1.1.3">𝑡</ci></apply><ci id="S2.E10.m1.7.7.cmml" xref="S2.E10.m1.7.7">𝑡</ci><apply id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2"><csymbol cd="ambiguous" id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.1.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2">subscript</csymbol><ci id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.2.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.2">𝐸</ci><ci id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.3a.cmml" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.3"><mtext id="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.3.cmml" mathsize="70%" xref="S2.E10.m1.8.8.1.1.1.1.1.1.1.1.2.2.2.2.3">mix</mtext></ci></apply></vector></apply></apply></apply><cn id="S2.E10.m1.8.8.1.1.1.1.1.3.cmml" type="integer" xref="S2.E10.m1.8.8.1.1.1.1.1.3">2</cn></apply><cn id="S2.E10.m1.8.8.1.1.1.1.3.cmml" type="integer" xref="S2.E10.m1.8.8.1.1.1.1.3">2</cn></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E10.m1.8c">\mathcal{L}_{\text{LDM}}=\mathbb{E}_{\mathcal{E}(x),\epsilon\sim\mathcal{N}(0,% 1),t}\left[\left\|\epsilon-\epsilon_{\theta}\left(z_{t},t,E_{\text{mix}}\right% )\right\|_{2}^{2}\right]</annotation><annotation encoding="application/x-llamapun" id="S2.E10.m1.8d">caligraphic_L start_POSTSUBSCRIPT LDM end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT caligraphic_E ( italic_x ) , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(10)</span></td> </tr></tbody> </table> </div> </section> <section class="ltx_subsection" id="S2.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S2.SS4.5.1.1">II-D</span> </span><span class="ltx_text ltx_font_italic" id="S2.SS4.6.2">Inference with Guidance</span> </h3> <div class="ltx_para" id="S2.SS4.p1"> <p class="ltx_p" id="S2.SS4.p1.2">During inference, guidance techniques such as Classifier Guidance (CG)<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib22" title="">22</a>]</cite> and Classifier-Free Guidance (CFG)<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib23" title="">23</a>]</cite> are employed to control the generation process. CG relies on an additional classifier <math alttext="P_{\phi}" class="ltx_Math" display="inline" id="S2.SS4.p1.1.m1.1"><semantics id="S2.SS4.p1.1.m1.1a"><msub id="S2.SS4.p1.1.m1.1.1" xref="S2.SS4.p1.1.m1.1.1.cmml"><mi id="S2.SS4.p1.1.m1.1.1.2" xref="S2.SS4.p1.1.m1.1.1.2.cmml">P</mi><mi id="S2.SS4.p1.1.m1.1.1.3" xref="S2.SS4.p1.1.m1.1.1.3.cmml">ϕ</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS4.p1.1.m1.1b"><apply id="S2.SS4.p1.1.m1.1.1.cmml" xref="S2.SS4.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S2.SS4.p1.1.m1.1.1.1.cmml" xref="S2.SS4.p1.1.m1.1.1">subscript</csymbol><ci id="S2.SS4.p1.1.m1.1.1.2.cmml" xref="S2.SS4.p1.1.m1.1.1.2">𝑃</ci><ci id="S2.SS4.p1.1.m1.1.1.3.cmml" xref="S2.SS4.p1.1.m1.1.1.3">italic-ϕ</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p1.1.m1.1c">P_{\phi}</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p1.1.m1.1d">italic_P start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT</annotation></semantics></math> to guide the reverse process at each timestep via the gradient of the class label log-likelihood <math alttext="\nabla\log P_{\phi}(y|x_{t})" class="ltx_Math" display="inline" id="S2.SS4.p1.2.m2.1"><semantics id="S2.SS4.p1.2.m2.1a"><mrow id="S2.SS4.p1.2.m2.1.1" xref="S2.SS4.p1.2.m2.1.1.cmml"><mrow id="S2.SS4.p1.2.m2.1.1.3" xref="S2.SS4.p1.2.m2.1.1.3.cmml"><mrow id="S2.SS4.p1.2.m2.1.1.3.1" xref="S2.SS4.p1.2.m2.1.1.3.1.cmml"><mo id="S2.SS4.p1.2.m2.1.1.3.1.1" rspace="0.167em" xref="S2.SS4.p1.2.m2.1.1.3.1.1.cmml">∇</mo><mi id="S2.SS4.p1.2.m2.1.1.3.1.2" xref="S2.SS4.p1.2.m2.1.1.3.1.2.cmml">log</mi></mrow><mo id="S2.SS4.p1.2.m2.1.1.3a" lspace="0.167em" xref="S2.SS4.p1.2.m2.1.1.3.cmml">⁡</mo><msub id="S2.SS4.p1.2.m2.1.1.3.2" xref="S2.SS4.p1.2.m2.1.1.3.2.cmml"><mi id="S2.SS4.p1.2.m2.1.1.3.2.2" xref="S2.SS4.p1.2.m2.1.1.3.2.2.cmml">P</mi><mi id="S2.SS4.p1.2.m2.1.1.3.2.3" xref="S2.SS4.p1.2.m2.1.1.3.2.3.cmml">ϕ</mi></msub></mrow><mo id="S2.SS4.p1.2.m2.1.1.2" xref="S2.SS4.p1.2.m2.1.1.2.cmml">⁢</mo><mrow id="S2.SS4.p1.2.m2.1.1.1.1" xref="S2.SS4.p1.2.m2.1.1.1.1.1.cmml"><mo id="S2.SS4.p1.2.m2.1.1.1.1.2" stretchy="false" xref="S2.SS4.p1.2.m2.1.1.1.1.1.cmml">(</mo><mrow id="S2.SS4.p1.2.m2.1.1.1.1.1" xref="S2.SS4.p1.2.m2.1.1.1.1.1.cmml"><mi id="S2.SS4.p1.2.m2.1.1.1.1.1.2" xref="S2.SS4.p1.2.m2.1.1.1.1.1.2.cmml">y</mi><mo fence="false" id="S2.SS4.p1.2.m2.1.1.1.1.1.1" xref="S2.SS4.p1.2.m2.1.1.1.1.1.1.cmml">|</mo><msub id="S2.SS4.p1.2.m2.1.1.1.1.1.3" xref="S2.SS4.p1.2.m2.1.1.1.1.1.3.cmml"><mi id="S2.SS4.p1.2.m2.1.1.1.1.1.3.2" xref="S2.SS4.p1.2.m2.1.1.1.1.1.3.2.cmml">x</mi><mi id="S2.SS4.p1.2.m2.1.1.1.1.1.3.3" xref="S2.SS4.p1.2.m2.1.1.1.1.1.3.3.cmml">t</mi></msub></mrow><mo id="S2.SS4.p1.2.m2.1.1.1.1.3" stretchy="false" xref="S2.SS4.p1.2.m2.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS4.p1.2.m2.1b"><apply id="S2.SS4.p1.2.m2.1.1.cmml" xref="S2.SS4.p1.2.m2.1.1"><times id="S2.SS4.p1.2.m2.1.1.2.cmml" xref="S2.SS4.p1.2.m2.1.1.2"></times><apply id="S2.SS4.p1.2.m2.1.1.3.cmml" xref="S2.SS4.p1.2.m2.1.1.3"><apply id="S2.SS4.p1.2.m2.1.1.3.1.cmml" xref="S2.SS4.p1.2.m2.1.1.3.1"><ci id="S2.SS4.p1.2.m2.1.1.3.1.1.cmml" xref="S2.SS4.p1.2.m2.1.1.3.1.1">∇</ci><log id="S2.SS4.p1.2.m2.1.1.3.1.2.cmml" xref="S2.SS4.p1.2.m2.1.1.3.1.2"></log></apply><apply id="S2.SS4.p1.2.m2.1.1.3.2.cmml" xref="S2.SS4.p1.2.m2.1.1.3.2"><csymbol cd="ambiguous" id="S2.SS4.p1.2.m2.1.1.3.2.1.cmml" xref="S2.SS4.p1.2.m2.1.1.3.2">subscript</csymbol><ci id="S2.SS4.p1.2.m2.1.1.3.2.2.cmml" xref="S2.SS4.p1.2.m2.1.1.3.2.2">𝑃</ci><ci id="S2.SS4.p1.2.m2.1.1.3.2.3.cmml" xref="S2.SS4.p1.2.m2.1.1.3.2.3">italic-ϕ</ci></apply></apply><apply id="S2.SS4.p1.2.m2.1.1.1.1.1.cmml" xref="S2.SS4.p1.2.m2.1.1.1.1"><csymbol cd="latexml" id="S2.SS4.p1.2.m2.1.1.1.1.1.1.cmml" xref="S2.SS4.p1.2.m2.1.1.1.1.1.1">conditional</csymbol><ci id="S2.SS4.p1.2.m2.1.1.1.1.1.2.cmml" xref="S2.SS4.p1.2.m2.1.1.1.1.1.2">𝑦</ci><apply id="S2.SS4.p1.2.m2.1.1.1.1.1.3.cmml" xref="S2.SS4.p1.2.m2.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.SS4.p1.2.m2.1.1.1.1.1.3.1.cmml" xref="S2.SS4.p1.2.m2.1.1.1.1.1.3">subscript</csymbol><ci id="S2.SS4.p1.2.m2.1.1.1.1.1.3.2.cmml" xref="S2.SS4.p1.2.m2.1.1.1.1.1.3.2">𝑥</ci><ci id="S2.SS4.p1.2.m2.1.1.1.1.1.3.3.cmml" xref="S2.SS4.p1.2.m2.1.1.1.1.1.3.3">𝑡</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p1.2.m2.1c">\nabla\log P_{\phi}(y|x_{t})</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p1.2.m2.1d">∇ roman_log italic_P start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )</annotation></semantics></math>. CFG, on the other hand, combines conditional and unconditional score estimates to steer the reverse process. As suggested in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib12" title="">12</a>]</cite>, double guidance can be applied for enhanced alignment:</p> <table class="ltx_equation ltx_eqn_table" id="S2.E11"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\hat{\epsilon}_{\theta}\left(z_{t},t\right)\leftarrow\begin{gathered}\omega% \epsilon_{\theta}\left(z_{t},t,E_{\text{mix}}\right)+(1-\omega)\epsilon_{% \theta}\left(z_{t},t,\varnothing\right)\\ -\gamma\bar{\beta}_{t}\nabla\log P_{\phi}\left(y|z_{t},t,E_{v}\right)\end{gathered}" class="ltx_Math" display="block" id="S2.E11.m1.59"><semantics id="S2.E11.m1.59a"><mrow id="S2.E11.m1.59.59" xref="S2.E11.m1.59.59.cmml"><mrow id="S2.E11.m1.59.59.1" xref="S2.E11.m1.59.59.1.cmml"><msub id="S2.E11.m1.59.59.1.3" xref="S2.E11.m1.59.59.1.3.cmml"><mover accent="true" id="S2.E11.m1.59.59.1.3.2" xref="S2.E11.m1.59.59.1.3.2.cmml"><mi id="S2.E11.m1.59.59.1.3.2.2" xref="S2.E11.m1.59.59.1.3.2.2.cmml">ϵ</mi><mo id="S2.E11.m1.59.59.1.3.2.1" xref="S2.E11.m1.59.59.1.3.2.1.cmml">^</mo></mover><mi id="S2.E11.m1.59.59.1.3.3" xref="S2.E11.m1.59.59.1.3.3.cmml">θ</mi></msub><mo id="S2.E11.m1.59.59.1.2" xref="S2.E11.m1.59.59.1.2.cmml">⁢</mo><mrow id="S2.E11.m1.59.59.1.1.1" xref="S2.E11.m1.59.59.1.1.2.cmml"><mo id="S2.E11.m1.59.59.1.1.1.2" xref="S2.E11.m1.59.59.1.1.2.cmml">(</mo><msub id="S2.E11.m1.59.59.1.1.1.1" xref="S2.E11.m1.59.59.1.1.1.1.cmml"><mi id="S2.E11.m1.59.59.1.1.1.1.2" xref="S2.E11.m1.59.59.1.1.1.1.2.cmml">z</mi><mi id="S2.E11.m1.59.59.1.1.1.1.3" xref="S2.E11.m1.59.59.1.1.1.1.3.cmml">t</mi></msub><mo id="S2.E11.m1.59.59.1.1.1.3" xref="S2.E11.m1.59.59.1.1.2.cmml">,</mo><mi id="S2.E11.m1.58.58" xref="S2.E11.m1.58.58.cmml">t</mi><mo id="S2.E11.m1.59.59.1.1.1.4" xref="S2.E11.m1.59.59.1.1.2.cmml">)</mo></mrow></mrow><mo id="S2.E11.m1.59.59.2" stretchy="false" xref="S2.E11.m1.59.59.2.cmml">←</mo><mtable displaystyle="true" id="S2.E11.m1.57.57.10" rowspacing="0pt" xref="S2.E11.m1.52.52.5.cmml"><mtr id="S2.E11.m1.57.57.10a" xref="S2.E11.m1.52.52.5.cmml"><mtd id="S2.E11.m1.57.57.10b" xref="S2.E11.m1.52.52.5.cmml"><mrow id="S2.E11.m1.56.56.9.51.32.32" xref="S2.E11.m1.52.52.5.cmml"><mrow id="S2.E11.m1.54.54.7.49.30.30.30" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.1.1.1.1.1.1" xref="S2.E11.m1.1.1.1.1.1.1.cmml">ω</mi><mo id="S2.E11.m1.54.54.7.49.30.30.30.3" xref="S2.E11.m1.52.52.5.cmml">⁢</mo><msub id="S2.E11.m1.54.54.7.49.30.30.30.4" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.2.2.2.2.2.2" xref="S2.E11.m1.2.2.2.2.2.2.cmml">ϵ</mi><mi id="S2.E11.m1.3.3.3.3.3.3.1" xref="S2.E11.m1.3.3.3.3.3.3.1.cmml">θ</mi></msub><mo id="S2.E11.m1.54.54.7.49.30.30.30.3a" xref="S2.E11.m1.52.52.5.cmml">⁢</mo><mrow id="S2.E11.m1.54.54.7.49.30.30.30.2.2" xref="S2.E11.m1.52.52.5.cmml"><mo id="S2.E11.m1.4.4.4.4.4.4" xref="S2.E11.m1.52.52.5.cmml">(</mo><msub id="S2.E11.m1.53.53.6.48.29.29.29.1.1.1" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.5.5.5.5.5.5" xref="S2.E11.m1.5.5.5.5.5.5.cmml">z</mi><mi id="S2.E11.m1.6.6.6.6.6.6.1" xref="S2.E11.m1.6.6.6.6.6.6.1.cmml">t</mi></msub><mo id="S2.E11.m1.7.7.7.7.7.7" xref="S2.E11.m1.52.52.5.cmml">,</mo><mi id="S2.E11.m1.8.8.8.8.8.8" xref="S2.E11.m1.8.8.8.8.8.8.cmml">t</mi><mo id="S2.E11.m1.9.9.9.9.9.9" xref="S2.E11.m1.52.52.5.cmml">,</mo><msub id="S2.E11.m1.54.54.7.49.30.30.30.2.2.2" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.10.10.10.10.10.10" xref="S2.E11.m1.10.10.10.10.10.10.cmml">E</mi><mtext id="S2.E11.m1.11.11.11.11.11.11.1" xref="S2.E11.m1.11.11.11.11.11.11.1a.cmml">mix</mtext></msub><mo id="S2.E11.m1.12.12.12.12.12.12" xref="S2.E11.m1.52.52.5.cmml">)</mo></mrow></mrow><mo id="S2.E11.m1.13.13.13.13.13.13" xref="S2.E11.m1.13.13.13.13.13.13.cmml">+</mo><mrow id="S2.E11.m1.56.56.9.51.32.32.32" xref="S2.E11.m1.52.52.5.cmml"><mrow id="S2.E11.m1.55.55.8.50.31.31.31.1.1" xref="S2.E11.m1.52.52.5.cmml"><mo id="S2.E11.m1.14.14.14.14.14.14" stretchy="false" xref="S2.E11.m1.52.52.5.cmml">(</mo><mrow id="S2.E11.m1.55.55.8.50.31.31.31.1.1.1" xref="S2.E11.m1.52.52.5.cmml"><mn id="S2.E11.m1.15.15.15.15.15.15" xref="S2.E11.m1.15.15.15.15.15.15.cmml">1</mn><mo id="S2.E11.m1.16.16.16.16.16.16" xref="S2.E11.m1.16.16.16.16.16.16.cmml">−</mo><mi id="S2.E11.m1.17.17.17.17.17.17" xref="S2.E11.m1.17.17.17.17.17.17.cmml">ω</mi></mrow><mo id="S2.E11.m1.18.18.18.18.18.18" stretchy="false" xref="S2.E11.m1.52.52.5.cmml">)</mo></mrow><mo id="S2.E11.m1.56.56.9.51.32.32.32.3" xref="S2.E11.m1.52.52.5.cmml">⁢</mo><msub id="S2.E11.m1.56.56.9.51.32.32.32.4" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.19.19.19.19.19.19" xref="S2.E11.m1.19.19.19.19.19.19.cmml">ϵ</mi><mi id="S2.E11.m1.20.20.20.20.20.20.1" xref="S2.E11.m1.20.20.20.20.20.20.1.cmml">θ</mi></msub><mo id="S2.E11.m1.56.56.9.51.32.32.32.3a" xref="S2.E11.m1.52.52.5.cmml">⁢</mo><mrow id="S2.E11.m1.56.56.9.51.32.32.32.2.1" xref="S2.E11.m1.52.52.5.cmml"><mo id="S2.E11.m1.21.21.21.21.21.21" xref="S2.E11.m1.52.52.5.cmml">(</mo><msub id="S2.E11.m1.56.56.9.51.32.32.32.2.1.1" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.22.22.22.22.22.22" xref="S2.E11.m1.22.22.22.22.22.22.cmml">z</mi><mi id="S2.E11.m1.23.23.23.23.23.23.1" xref="S2.E11.m1.23.23.23.23.23.23.1.cmml">t</mi></msub><mo id="S2.E11.m1.24.24.24.24.24.24" xref="S2.E11.m1.52.52.5.cmml">,</mo><mi id="S2.E11.m1.25.25.25.25.25.25" xref="S2.E11.m1.25.25.25.25.25.25.cmml">t</mi><mo id="S2.E11.m1.26.26.26.26.26.26" xref="S2.E11.m1.52.52.5.cmml">,</mo><mi id="S2.E11.m1.27.27.27.27.27.27" mathvariant="normal" xref="S2.E11.m1.27.27.27.27.27.27.cmml">∅</mi><mo id="S2.E11.m1.28.28.28.28.28.28" xref="S2.E11.m1.52.52.5.cmml">)</mo></mrow></mrow></mrow></mtd></mtr><mtr id="S2.E11.m1.57.57.10c" xref="S2.E11.m1.52.52.5.cmml"><mtd id="S2.E11.m1.57.57.10d" xref="S2.E11.m1.52.52.5.cmml"><mrow id="S2.E11.m1.57.57.10.52.20.20" xref="S2.E11.m1.52.52.5.cmml"><mo id="S2.E11.m1.57.57.10.52.20.20a" xref="S2.E11.m1.52.52.5.cmml">−</mo><mrow id="S2.E11.m1.57.57.10.52.20.20.20" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.30.30.30.2.2.2" xref="S2.E11.m1.30.30.30.2.2.2.cmml">γ</mi><mo id="S2.E11.m1.57.57.10.52.20.20.20.2" xref="S2.E11.m1.52.52.5.cmml">⁢</mo><msub id="S2.E11.m1.57.57.10.52.20.20.20.3" xref="S2.E11.m1.52.52.5.cmml"><mover accent="true" id="S2.E11.m1.31.31.31.3.3.3" xref="S2.E11.m1.31.31.31.3.3.3.cmml"><mi id="S2.E11.m1.31.31.31.3.3.3.2" xref="S2.E11.m1.31.31.31.3.3.3.2.cmml">β</mi><mo id="S2.E11.m1.31.31.31.3.3.3.1" xref="S2.E11.m1.31.31.31.3.3.3.1.cmml">¯</mo></mover><mi id="S2.E11.m1.32.32.32.4.4.4.1" xref="S2.E11.m1.32.32.32.4.4.4.1.cmml">t</mi></msub><mo id="S2.E11.m1.57.57.10.52.20.20.20.2a" lspace="0.167em" xref="S2.E11.m1.52.52.5.cmml">⁢</mo><mrow id="S2.E11.m1.57.57.10.52.20.20.20.4" xref="S2.E11.m1.52.52.5.cmml"><mrow id="S2.E11.m1.57.57.10.52.20.20.20.4.1" xref="S2.E11.m1.52.52.5.cmml"><mo id="S2.E11.m1.33.33.33.5.5.5" rspace="0.167em" xref="S2.E11.m1.33.33.33.5.5.5.cmml">∇</mo><mi id="S2.E11.m1.34.34.34.6.6.6" xref="S2.E11.m1.34.34.34.6.6.6.cmml">log</mi></mrow><mo id="S2.E11.m1.57.57.10.52.20.20.20.4a" lspace="0.167em" xref="S2.E11.m1.52.52.5.cmml">⁡</mo><msub id="S2.E11.m1.57.57.10.52.20.20.20.4.2" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.35.35.35.7.7.7" xref="S2.E11.m1.35.35.35.7.7.7.cmml">P</mi><mi id="S2.E11.m1.36.36.36.8.8.8.1" xref="S2.E11.m1.36.36.36.8.8.8.1.cmml">ϕ</mi></msub></mrow><mo id="S2.E11.m1.57.57.10.52.20.20.20.2b" xref="S2.E11.m1.52.52.5.cmml">⁢</mo><mrow id="S2.E11.m1.57.57.10.52.20.20.20.1.1" xref="S2.E11.m1.52.52.5.cmml"><mo id="S2.E11.m1.37.37.37.9.9.9" xref="S2.E11.m1.52.52.5.cmml">(</mo><mrow id="S2.E11.m1.57.57.10.52.20.20.20.1.1.1" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.38.38.38.10.10.10" xref="S2.E11.m1.38.38.38.10.10.10.cmml">y</mi><mo fence="false" id="S2.E11.m1.39.39.39.11.11.11" xref="S2.E11.m1.39.39.39.11.11.11.cmml">|</mo><mrow id="S2.E11.m1.57.57.10.52.20.20.20.1.1.1.2.2" xref="S2.E11.m1.52.52.5.cmml"><msub id="S2.E11.m1.57.57.10.52.20.20.20.1.1.1.1.1.1" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.40.40.40.12.12.12" xref="S2.E11.m1.40.40.40.12.12.12.cmml">z</mi><mi id="S2.E11.m1.41.41.41.13.13.13.1" xref="S2.E11.m1.41.41.41.13.13.13.1.cmml">t</mi></msub><mo id="S2.E11.m1.42.42.42.14.14.14" xref="S2.E11.m1.52.52.5.cmml">,</mo><mi id="S2.E11.m1.43.43.43.15.15.15" xref="S2.E11.m1.43.43.43.15.15.15.cmml">t</mi><mo id="S2.E11.m1.44.44.44.16.16.16" xref="S2.E11.m1.52.52.5.cmml">,</mo><msub id="S2.E11.m1.57.57.10.52.20.20.20.1.1.1.2.2.2" xref="S2.E11.m1.52.52.5.cmml"><mi id="S2.E11.m1.45.45.45.17.17.17" xref="S2.E11.m1.45.45.45.17.17.17.cmml">E</mi><mi id="S2.E11.m1.46.46.46.18.18.18.1" xref="S2.E11.m1.46.46.46.18.18.18.1.cmml">v</mi></msub></mrow></mrow><mo id="S2.E11.m1.47.47.47.19.19.19" xref="S2.E11.m1.52.52.5.cmml">)</mo></mrow></mrow></mrow></mtd></mtr></mtable></mrow><annotation-xml encoding="MathML-Content" id="S2.E11.m1.59b"><apply id="S2.E11.m1.59.59.cmml" xref="S2.E11.m1.59.59"><ci id="S2.E11.m1.59.59.2.cmml" xref="S2.E11.m1.59.59.2">←</ci><apply id="S2.E11.m1.59.59.1.cmml" xref="S2.E11.m1.59.59.1"><times id="S2.E11.m1.59.59.1.2.cmml" xref="S2.E11.m1.59.59.1.2"></times><apply id="S2.E11.m1.59.59.1.3.cmml" xref="S2.E11.m1.59.59.1.3"><csymbol cd="ambiguous" id="S2.E11.m1.59.59.1.3.1.cmml" xref="S2.E11.m1.59.59.1.3">subscript</csymbol><apply id="S2.E11.m1.59.59.1.3.2.cmml" xref="S2.E11.m1.59.59.1.3.2"><ci id="S2.E11.m1.59.59.1.3.2.1.cmml" xref="S2.E11.m1.59.59.1.3.2.1">^</ci><ci id="S2.E11.m1.59.59.1.3.2.2.cmml" xref="S2.E11.m1.59.59.1.3.2.2">italic-ϵ</ci></apply><ci id="S2.E11.m1.59.59.1.3.3.cmml" xref="S2.E11.m1.59.59.1.3.3">𝜃</ci></apply><interval closure="open" id="S2.E11.m1.59.59.1.1.2.cmml" xref="S2.E11.m1.59.59.1.1.1"><apply id="S2.E11.m1.59.59.1.1.1.1.cmml" xref="S2.E11.m1.59.59.1.1.1.1"><csymbol cd="ambiguous" id="S2.E11.m1.59.59.1.1.1.1.1.cmml" xref="S2.E11.m1.59.59.1.1.1.1">subscript</csymbol><ci id="S2.E11.m1.59.59.1.1.1.1.2.cmml" xref="S2.E11.m1.59.59.1.1.1.1.2">𝑧</ci><ci id="S2.E11.m1.59.59.1.1.1.1.3.cmml" xref="S2.E11.m1.59.59.1.1.1.1.3">𝑡</ci></apply><ci id="S2.E11.m1.58.58.cmml" xref="S2.E11.m1.58.58">𝑡</ci></interval></apply><apply id="S2.E11.m1.52.52.5.cmml" xref="S2.E11.m1.57.57.10"><minus id="S2.E11.m1.29.29.29.1.1.1.cmml" xref="S2.E11.m1.57.57.10"></minus><apply id="S2.E11.m1.51.51.4.4.cmml" xref="S2.E11.m1.57.57.10"><plus id="S2.E11.m1.13.13.13.13.13.13.cmml" xref="S2.E11.m1.13.13.13.13.13.13"></plus><apply id="S2.E11.m1.49.49.2.2.2.cmml" xref="S2.E11.m1.57.57.10"><times id="S2.E11.m1.49.49.2.2.2.3.cmml" xref="S2.E11.m1.57.57.10"></times><ci id="S2.E11.m1.1.1.1.1.1.1.cmml" xref="S2.E11.m1.1.1.1.1.1.1">𝜔</ci><apply id="S2.E11.m1.49.49.2.2.2.5.cmml" xref="S2.E11.m1.57.57.10"><csymbol cd="ambiguous" id="S2.E11.m1.49.49.2.2.2.5.1.cmml" xref="S2.E11.m1.57.57.10">subscript</csymbol><ci id="S2.E11.m1.2.2.2.2.2.2.cmml" xref="S2.E11.m1.2.2.2.2.2.2">italic-ϵ</ci><ci id="S2.E11.m1.3.3.3.3.3.3.1.cmml" xref="S2.E11.m1.3.3.3.3.3.3.1">𝜃</ci></apply><vector id="S2.E11.m1.49.49.2.2.2.2.3.cmml" xref="S2.E11.m1.57.57.10"><apply id="S2.E11.m1.48.48.1.1.1.1.1.1.cmml" xref="S2.E11.m1.57.57.10"><csymbol cd="ambiguous" id="S2.E11.m1.48.48.1.1.1.1.1.1.1.cmml" xref="S2.E11.m1.57.57.10">subscript</csymbol><ci id="S2.E11.m1.5.5.5.5.5.5.cmml" xref="S2.E11.m1.5.5.5.5.5.5">𝑧</ci><ci id="S2.E11.m1.6.6.6.6.6.6.1.cmml" xref="S2.E11.m1.6.6.6.6.6.6.1">𝑡</ci></apply><ci id="S2.E11.m1.8.8.8.8.8.8.cmml" xref="S2.E11.m1.8.8.8.8.8.8">𝑡</ci><apply id="S2.E11.m1.49.49.2.2.2.2.2.2.cmml" xref="S2.E11.m1.57.57.10"><csymbol cd="ambiguous" id="S2.E11.m1.49.49.2.2.2.2.2.2.1.cmml" xref="S2.E11.m1.57.57.10">subscript</csymbol><ci id="S2.E11.m1.10.10.10.10.10.10.cmml" xref="S2.E11.m1.10.10.10.10.10.10">𝐸</ci><ci id="S2.E11.m1.11.11.11.11.11.11.1a.cmml" xref="S2.E11.m1.11.11.11.11.11.11.1"><mtext id="S2.E11.m1.11.11.11.11.11.11.1.cmml" mathsize="70%" xref="S2.E11.m1.11.11.11.11.11.11.1">mix</mtext></ci></apply></vector></apply><apply id="S2.E11.m1.51.51.4.4.4.cmml" xref="S2.E11.m1.57.57.10"><times id="S2.E11.m1.51.51.4.4.4.3.cmml" xref="S2.E11.m1.57.57.10"></times><apply id="S2.E11.m1.50.50.3.3.3.1.1.1.cmml" xref="S2.E11.m1.57.57.10"><minus id="S2.E11.m1.16.16.16.16.16.16.cmml" xref="S2.E11.m1.16.16.16.16.16.16"></minus><cn id="S2.E11.m1.15.15.15.15.15.15.cmml" type="integer" xref="S2.E11.m1.15.15.15.15.15.15">1</cn><ci id="S2.E11.m1.17.17.17.17.17.17.cmml" xref="S2.E11.m1.17.17.17.17.17.17">𝜔</ci></apply><apply id="S2.E11.m1.51.51.4.4.4.4.cmml" xref="S2.E11.m1.57.57.10"><csymbol cd="ambiguous" id="S2.E11.m1.51.51.4.4.4.4.1.cmml" xref="S2.E11.m1.57.57.10">subscript</csymbol><ci id="S2.E11.m1.19.19.19.19.19.19.cmml" xref="S2.E11.m1.19.19.19.19.19.19">italic-ϵ</ci><ci id="S2.E11.m1.20.20.20.20.20.20.1.cmml" xref="S2.E11.m1.20.20.20.20.20.20.1">𝜃</ci></apply><vector id="S2.E11.m1.51.51.4.4.4.2.2.cmml" xref="S2.E11.m1.57.57.10"><apply id="S2.E11.m1.51.51.4.4.4.2.1.1.cmml" xref="S2.E11.m1.57.57.10"><csymbol cd="ambiguous" id="S2.E11.m1.51.51.4.4.4.2.1.1.1.cmml" xref="S2.E11.m1.57.57.10">subscript</csymbol><ci id="S2.E11.m1.22.22.22.22.22.22.cmml" xref="S2.E11.m1.22.22.22.22.22.22">𝑧</ci><ci id="S2.E11.m1.23.23.23.23.23.23.1.cmml" xref="S2.E11.m1.23.23.23.23.23.23.1">𝑡</ci></apply><ci id="S2.E11.m1.25.25.25.25.25.25.cmml" xref="S2.E11.m1.25.25.25.25.25.25">𝑡</ci><emptyset id="S2.E11.m1.27.27.27.27.27.27.cmml" xref="S2.E11.m1.27.27.27.27.27.27"></emptyset></vector></apply></apply><apply id="S2.E11.m1.52.52.5.5.cmml" xref="S2.E11.m1.57.57.10"><times id="S2.E11.m1.52.52.5.5.2.cmml" xref="S2.E11.m1.57.57.10"></times><ci id="S2.E11.m1.30.30.30.2.2.2.cmml" xref="S2.E11.m1.30.30.30.2.2.2">𝛾</ci><apply id="S2.E11.m1.52.52.5.5.4.cmml" xref="S2.E11.m1.57.57.10"><csymbol cd="ambiguous" id="S2.E11.m1.52.52.5.5.4.1.cmml" xref="S2.E11.m1.57.57.10">subscript</csymbol><apply id="S2.E11.m1.31.31.31.3.3.3.cmml" xref="S2.E11.m1.31.31.31.3.3.3"><ci id="S2.E11.m1.31.31.31.3.3.3.1.cmml" xref="S2.E11.m1.31.31.31.3.3.3.1">¯</ci><ci id="S2.E11.m1.31.31.31.3.3.3.2.cmml" xref="S2.E11.m1.31.31.31.3.3.3.2">𝛽</ci></apply><ci id="S2.E11.m1.32.32.32.4.4.4.1.cmml" xref="S2.E11.m1.32.32.32.4.4.4.1">𝑡</ci></apply><apply id="S2.E11.m1.52.52.5.5.5.cmml" xref="S2.E11.m1.57.57.10"><apply id="S2.E11.m1.52.52.5.5.5.1.cmml" xref="S2.E11.m1.57.57.10"><ci id="S2.E11.m1.33.33.33.5.5.5.cmml" xref="S2.E11.m1.33.33.33.5.5.5">∇</ci><log id="S2.E11.m1.34.34.34.6.6.6.cmml" xref="S2.E11.m1.34.34.34.6.6.6"></log></apply><apply id="S2.E11.m1.52.52.5.5.5.2.cmml" xref="S2.E11.m1.57.57.10"><csymbol cd="ambiguous" id="S2.E11.m1.52.52.5.5.5.2.1.cmml" xref="S2.E11.m1.57.57.10">subscript</csymbol><ci id="S2.E11.m1.35.35.35.7.7.7.cmml" xref="S2.E11.m1.35.35.35.7.7.7">𝑃</ci><ci id="S2.E11.m1.36.36.36.8.8.8.1.cmml" xref="S2.E11.m1.36.36.36.8.8.8.1">italic-ϕ</ci></apply></apply><apply id="S2.E11.m1.52.52.5.5.1.1.1.cmml" xref="S2.E11.m1.57.57.10"><csymbol cd="latexml" id="S2.E11.m1.39.39.39.11.11.11.cmml" xref="S2.E11.m1.39.39.39.11.11.11">conditional</csymbol><ci id="S2.E11.m1.38.38.38.10.10.10.cmml" xref="S2.E11.m1.38.38.38.10.10.10">𝑦</ci><list id="S2.E11.m1.52.52.5.5.1.1.1.2.3.cmml" xref="S2.E11.m1.57.57.10"><apply id="S2.E11.m1.52.52.5.5.1.1.1.1.1.1.cmml" xref="S2.E11.m1.57.57.10"><csymbol cd="ambiguous" id="S2.E11.m1.52.52.5.5.1.1.1.1.1.1.1.cmml" xref="S2.E11.m1.57.57.10">subscript</csymbol><ci id="S2.E11.m1.40.40.40.12.12.12.cmml" xref="S2.E11.m1.40.40.40.12.12.12">𝑧</ci><ci id="S2.E11.m1.41.41.41.13.13.13.1.cmml" xref="S2.E11.m1.41.41.41.13.13.13.1">𝑡</ci></apply><ci id="S2.E11.m1.43.43.43.15.15.15.cmml" xref="S2.E11.m1.43.43.43.15.15.15">𝑡</ci><apply id="S2.E11.m1.52.52.5.5.1.1.1.2.2.2.cmml" xref="S2.E11.m1.57.57.10"><csymbol cd="ambiguous" id="S2.E11.m1.52.52.5.5.1.1.1.2.2.2.1.cmml" xref="S2.E11.m1.57.57.10">subscript</csymbol><ci id="S2.E11.m1.45.45.45.17.17.17.cmml" xref="S2.E11.m1.45.45.45.17.17.17">𝐸</ci><ci id="S2.E11.m1.46.46.46.18.18.18.1.cmml" xref="S2.E11.m1.46.46.46.18.18.18.1">𝑣</ci></apply></list></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E11.m1.59c">\hat{\epsilon}_{\theta}\left(z_{t},t\right)\leftarrow\begin{gathered}\omega% \epsilon_{\theta}\left(z_{t},t,E_{\text{mix}}\right)+(1-\omega)\epsilon_{% \theta}\left(z_{t},t,\varnothing\right)\\ -\gamma\bar{\beta}_{t}\nabla\log P_{\phi}\left(y|z_{t},t,E_{v}\right)\end{gathered}</annotation><annotation encoding="application/x-llamapun" id="S2.E11.m1.59d">over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ← start_ROW start_CELL italic_ω italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT ) + ( 1 - italic_ω ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) end_CELL end_ROW start_ROW start_CELL - italic_γ over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ roman_log italic_P start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) end_CELL end_ROW</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(11)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS4.p1.7">where <math alttext="\gamma" class="ltx_Math" display="inline" id="S2.SS4.p1.3.m1.1"><semantics id="S2.SS4.p1.3.m1.1a"><mi id="S2.SS4.p1.3.m1.1.1" xref="S2.SS4.p1.3.m1.1.1.cmml">γ</mi><annotation-xml encoding="MathML-Content" id="S2.SS4.p1.3.m1.1b"><ci id="S2.SS4.p1.3.m1.1.1.cmml" xref="S2.SS4.p1.3.m1.1.1">𝛾</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p1.3.m1.1c">\gamma</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p1.3.m1.1d">italic_γ</annotation></semantics></math> and <math alttext="\omega" class="ltx_Math" display="inline" id="S2.SS4.p1.4.m2.1"><semantics id="S2.SS4.p1.4.m2.1a"><mi id="S2.SS4.p1.4.m2.1.1" xref="S2.SS4.p1.4.m2.1.1.cmml">ω</mi><annotation-xml encoding="MathML-Content" id="S2.SS4.p1.4.m2.1b"><ci id="S2.SS4.p1.4.m2.1.1.cmml" xref="S2.SS4.p1.4.m2.1.1">𝜔</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p1.4.m2.1c">\omega</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p1.4.m2.1d">italic_ω</annotation></semantics></math> represent the scales for CG and CFG, respectively. Notably, CFG employs <math alttext="E_{\text{mix}}" class="ltx_Math" display="inline" id="S2.SS4.p1.5.m3.1"><semantics id="S2.SS4.p1.5.m3.1a"><msub id="S2.SS4.p1.5.m3.1.1" xref="S2.SS4.p1.5.m3.1.1.cmml"><mi id="S2.SS4.p1.5.m3.1.1.2" xref="S2.SS4.p1.5.m3.1.1.2.cmml">E</mi><mtext id="S2.SS4.p1.5.m3.1.1.3" xref="S2.SS4.p1.5.m3.1.1.3a.cmml">mix</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.SS4.p1.5.m3.1b"><apply id="S2.SS4.p1.5.m3.1.1.cmml" xref="S2.SS4.p1.5.m3.1.1"><csymbol cd="ambiguous" id="S2.SS4.p1.5.m3.1.1.1.cmml" xref="S2.SS4.p1.5.m3.1.1">subscript</csymbol><ci id="S2.SS4.p1.5.m3.1.1.2.cmml" xref="S2.SS4.p1.5.m3.1.1.2">𝐸</ci><ci id="S2.SS4.p1.5.m3.1.1.3a.cmml" xref="S2.SS4.p1.5.m3.1.1.3"><mtext id="S2.SS4.p1.5.m3.1.1.3.cmml" mathsize="70%" xref="S2.SS4.p1.5.m3.1.1.3">mix</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p1.5.m3.1c">E_{\text{mix}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p1.5.m3.1d">italic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT</annotation></semantics></math>, while CG uses <math alttext="E_{v}" class="ltx_Math" display="inline" id="S2.SS4.p1.6.m4.1"><semantics id="S2.SS4.p1.6.m4.1a"><msub id="S2.SS4.p1.6.m4.1.1" xref="S2.SS4.p1.6.m4.1.1.cmml"><mi id="S2.SS4.p1.6.m4.1.1.2" xref="S2.SS4.p1.6.m4.1.1.2.cmml">E</mi><mi id="S2.SS4.p1.6.m4.1.1.3" xref="S2.SS4.p1.6.m4.1.1.3.cmml">v</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS4.p1.6.m4.1b"><apply id="S2.SS4.p1.6.m4.1.1.cmml" xref="S2.SS4.p1.6.m4.1.1"><csymbol cd="ambiguous" id="S2.SS4.p1.6.m4.1.1.1.cmml" xref="S2.SS4.p1.6.m4.1.1">subscript</csymbol><ci id="S2.SS4.p1.6.m4.1.1.2.cmml" xref="S2.SS4.p1.6.m4.1.1.2">𝐸</ci><ci id="S2.SS4.p1.6.m4.1.1.3.cmml" xref="S2.SS4.p1.6.m4.1.1.3">𝑣</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p1.6.m4.1c">E_{v}</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p1.6.m4.1d">italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT</annotation></semantics></math> due to the aligned classifier <math alttext="P_{\phi}(y|z_{t},E_{v})" class="ltx_Math" display="inline" id="S2.SS4.p1.7.m5.1"><semantics id="S2.SS4.p1.7.m5.1a"><mrow id="S2.SS4.p1.7.m5.1.1" xref="S2.SS4.p1.7.m5.1.1.cmml"><msub id="S2.SS4.p1.7.m5.1.1.3" xref="S2.SS4.p1.7.m5.1.1.3.cmml"><mi id="S2.SS4.p1.7.m5.1.1.3.2" xref="S2.SS4.p1.7.m5.1.1.3.2.cmml">P</mi><mi id="S2.SS4.p1.7.m5.1.1.3.3" xref="S2.SS4.p1.7.m5.1.1.3.3.cmml">ϕ</mi></msub><mo id="S2.SS4.p1.7.m5.1.1.2" xref="S2.SS4.p1.7.m5.1.1.2.cmml">⁢</mo><mrow id="S2.SS4.p1.7.m5.1.1.1.1" xref="S2.SS4.p1.7.m5.1.1.1.1.1.cmml"><mo id="S2.SS4.p1.7.m5.1.1.1.1.2" stretchy="false" xref="S2.SS4.p1.7.m5.1.1.1.1.1.cmml">(</mo><mrow id="S2.SS4.p1.7.m5.1.1.1.1.1" xref="S2.SS4.p1.7.m5.1.1.1.1.1.cmml"><mi id="S2.SS4.p1.7.m5.1.1.1.1.1.4" xref="S2.SS4.p1.7.m5.1.1.1.1.1.4.cmml">y</mi><mo fence="false" id="S2.SS4.p1.7.m5.1.1.1.1.1.3" xref="S2.SS4.p1.7.m5.1.1.1.1.1.3.cmml">|</mo><mrow id="S2.SS4.p1.7.m5.1.1.1.1.1.2.2" xref="S2.SS4.p1.7.m5.1.1.1.1.1.2.3.cmml"><msub id="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1" xref="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.cmml"><mi id="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.2" xref="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.2.cmml">z</mi><mi id="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.3" xref="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.3.cmml">t</mi></msub><mo id="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.3" xref="S2.SS4.p1.7.m5.1.1.1.1.1.2.3.cmml">,</mo><msub id="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2" xref="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.cmml"><mi id="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.2" xref="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.2.cmml">E</mi><mi id="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.3" xref="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.3.cmml">v</mi></msub></mrow></mrow><mo id="S2.SS4.p1.7.m5.1.1.1.1.3" stretchy="false" xref="S2.SS4.p1.7.m5.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS4.p1.7.m5.1b"><apply id="S2.SS4.p1.7.m5.1.1.cmml" xref="S2.SS4.p1.7.m5.1.1"><times id="S2.SS4.p1.7.m5.1.1.2.cmml" xref="S2.SS4.p1.7.m5.1.1.2"></times><apply id="S2.SS4.p1.7.m5.1.1.3.cmml" xref="S2.SS4.p1.7.m5.1.1.3"><csymbol cd="ambiguous" id="S2.SS4.p1.7.m5.1.1.3.1.cmml" xref="S2.SS4.p1.7.m5.1.1.3">subscript</csymbol><ci id="S2.SS4.p1.7.m5.1.1.3.2.cmml" xref="S2.SS4.p1.7.m5.1.1.3.2">𝑃</ci><ci id="S2.SS4.p1.7.m5.1.1.3.3.cmml" xref="S2.SS4.p1.7.m5.1.1.3.3">italic-ϕ</ci></apply><apply id="S2.SS4.p1.7.m5.1.1.1.1.1.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1"><csymbol cd="latexml" id="S2.SS4.p1.7.m5.1.1.1.1.1.3.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.3">conditional</csymbol><ci id="S2.SS4.p1.7.m5.1.1.1.1.1.4.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.4">𝑦</ci><list id="S2.SS4.p1.7.m5.1.1.1.1.1.2.3.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.2.2"><apply id="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.1.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.2.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.2">𝑧</ci><ci id="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.3.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.1.1.1.3">𝑡</ci></apply><apply id="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.1.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2">subscript</csymbol><ci id="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.2.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.2">𝐸</ci><ci id="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.3.cmml" xref="S2.SS4.p1.7.m5.1.1.1.1.1.2.2.2.3">𝑣</ci></apply></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p1.7.m5.1c">P_{\phi}(y|z_{t},E_{v})</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p1.7.m5.1d">italic_P start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )</annotation></semantics></math> trained for the alignment of audio-visual pairs as discussed in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib12" title="">12</a>]</cite>.</p> </div> <div class="ltx_para" id="S2.SS4.p2"> <p class="ltx_p" id="S2.SS4.p2.1">From the perspective of Energy Based Models (EBMs)<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib24" title="">24</a>]</cite>, multiple conditions can also influence the inference process independently without combination. The conditional probability can be estimated by the following formula:</p> <table class="ltx_equation ltx_eqn_table" id="S2.E12"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="p\left(\boldsymbol{x}|\boldsymbol{c}_{1},\ldots,\boldsymbol{c}_{n}\right)% \propto p(\boldsymbol{x})\prod_{i=1}^{n}\frac{p\left(\boldsymbol{x}|% \boldsymbol{c}_{i}\right)}{p(\boldsymbol{x})}" class="ltx_Math" display="block" id="S2.E12.m1.5"><semantics id="S2.E12.m1.5a"><mrow id="S2.E12.m1.5.5" xref="S2.E12.m1.5.5.cmml"><mrow id="S2.E12.m1.5.5.1" xref="S2.E12.m1.5.5.1.cmml"><mi id="S2.E12.m1.5.5.1.3" xref="S2.E12.m1.5.5.1.3.cmml">p</mi><mo id="S2.E12.m1.5.5.1.2" xref="S2.E12.m1.5.5.1.2.cmml">⁢</mo><mrow id="S2.E12.m1.5.5.1.1.1" xref="S2.E12.m1.5.5.1.1.1.1.cmml"><mo id="S2.E12.m1.5.5.1.1.1.2" xref="S2.E12.m1.5.5.1.1.1.1.cmml">(</mo><mrow id="S2.E12.m1.5.5.1.1.1.1" xref="S2.E12.m1.5.5.1.1.1.1.cmml"><mi id="S2.E12.m1.5.5.1.1.1.1.4" xref="S2.E12.m1.5.5.1.1.1.1.4.cmml">𝒙</mi><mo fence="false" id="S2.E12.m1.5.5.1.1.1.1.3" xref="S2.E12.m1.5.5.1.1.1.1.3.cmml">|</mo><mrow id="S2.E12.m1.5.5.1.1.1.1.2.2" xref="S2.E12.m1.5.5.1.1.1.1.2.3.cmml"><msub id="S2.E12.m1.5.5.1.1.1.1.1.1.1" xref="S2.E12.m1.5.5.1.1.1.1.1.1.1.cmml"><mi id="S2.E12.m1.5.5.1.1.1.1.1.1.1.2" xref="S2.E12.m1.5.5.1.1.1.1.1.1.1.2.cmml">𝒄</mi><mn id="S2.E12.m1.5.5.1.1.1.1.1.1.1.3" xref="S2.E12.m1.5.5.1.1.1.1.1.1.1.3.cmml">1</mn></msub><mo id="S2.E12.m1.5.5.1.1.1.1.2.2.3" xref="S2.E12.m1.5.5.1.1.1.1.2.3.cmml">,</mo><mi id="S2.E12.m1.3.3" mathvariant="normal" xref="S2.E12.m1.3.3.cmml">…</mi><mo id="S2.E12.m1.5.5.1.1.1.1.2.2.4" xref="S2.E12.m1.5.5.1.1.1.1.2.3.cmml">,</mo><msub id="S2.E12.m1.5.5.1.1.1.1.2.2.2" xref="S2.E12.m1.5.5.1.1.1.1.2.2.2.cmml"><mi id="S2.E12.m1.5.5.1.1.1.1.2.2.2.2" xref="S2.E12.m1.5.5.1.1.1.1.2.2.2.2.cmml">𝒄</mi><mi id="S2.E12.m1.5.5.1.1.1.1.2.2.2.3" xref="S2.E12.m1.5.5.1.1.1.1.2.2.2.3.cmml">n</mi></msub></mrow></mrow><mo id="S2.E12.m1.5.5.1.1.1.3" xref="S2.E12.m1.5.5.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.E12.m1.5.5.2" xref="S2.E12.m1.5.5.2.cmml">∝</mo><mrow id="S2.E12.m1.5.5.3" xref="S2.E12.m1.5.5.3.cmml"><mi id="S2.E12.m1.5.5.3.2" xref="S2.E12.m1.5.5.3.2.cmml">p</mi><mo id="S2.E12.m1.5.5.3.1" xref="S2.E12.m1.5.5.3.1.cmml">⁢</mo><mrow id="S2.E12.m1.5.5.3.3.2" xref="S2.E12.m1.5.5.3.cmml"><mo id="S2.E12.m1.5.5.3.3.2.1" stretchy="false" xref="S2.E12.m1.5.5.3.cmml">(</mo><mi id="S2.E12.m1.4.4" xref="S2.E12.m1.4.4.cmml">𝒙</mi><mo id="S2.E12.m1.5.5.3.3.2.2" stretchy="false" xref="S2.E12.m1.5.5.3.cmml">)</mo></mrow><mo id="S2.E12.m1.5.5.3.1a" xref="S2.E12.m1.5.5.3.1.cmml">⁢</mo><mrow id="S2.E12.m1.5.5.3.4" xref="S2.E12.m1.5.5.3.4.cmml"><munderover id="S2.E12.m1.5.5.3.4.1" xref="S2.E12.m1.5.5.3.4.1.cmml"><mo id="S2.E12.m1.5.5.3.4.1.2.2" movablelimits="false" xref="S2.E12.m1.5.5.3.4.1.2.2.cmml">∏</mo><mrow id="S2.E12.m1.5.5.3.4.1.2.3" xref="S2.E12.m1.5.5.3.4.1.2.3.cmml"><mi id="S2.E12.m1.5.5.3.4.1.2.3.2" xref="S2.E12.m1.5.5.3.4.1.2.3.2.cmml">i</mi><mo id="S2.E12.m1.5.5.3.4.1.2.3.1" xref="S2.E12.m1.5.5.3.4.1.2.3.1.cmml">=</mo><mn id="S2.E12.m1.5.5.3.4.1.2.3.3" xref="S2.E12.m1.5.5.3.4.1.2.3.3.cmml">1</mn></mrow><mi id="S2.E12.m1.5.5.3.4.1.3" xref="S2.E12.m1.5.5.3.4.1.3.cmml">n</mi></munderover><mfrac id="S2.E12.m1.2.2" xref="S2.E12.m1.2.2.cmml"><mrow id="S2.E12.m1.1.1.1" xref="S2.E12.m1.1.1.1.cmml"><mi id="S2.E12.m1.1.1.1.3" xref="S2.E12.m1.1.1.1.3.cmml">p</mi><mo id="S2.E12.m1.1.1.1.2" xref="S2.E12.m1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E12.m1.1.1.1.1.1" xref="S2.E12.m1.1.1.1.1.1.1.cmml"><mo id="S2.E12.m1.1.1.1.1.1.2" xref="S2.E12.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.E12.m1.1.1.1.1.1.1" xref="S2.E12.m1.1.1.1.1.1.1.cmml"><mi id="S2.E12.m1.1.1.1.1.1.1.2" xref="S2.E12.m1.1.1.1.1.1.1.2.cmml">𝒙</mi><mo fence="false" id="S2.E12.m1.1.1.1.1.1.1.1" xref="S2.E12.m1.1.1.1.1.1.1.1.cmml">|</mo><msub id="S2.E12.m1.1.1.1.1.1.1.3" xref="S2.E12.m1.1.1.1.1.1.1.3.cmml"><mi id="S2.E12.m1.1.1.1.1.1.1.3.2" xref="S2.E12.m1.1.1.1.1.1.1.3.2.cmml">𝒄</mi><mi id="S2.E12.m1.1.1.1.1.1.1.3.3" xref="S2.E12.m1.1.1.1.1.1.1.3.3.cmml">i</mi></msub></mrow><mo id="S2.E12.m1.1.1.1.1.1.3" xref="S2.E12.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mrow id="S2.E12.m1.2.2.2" xref="S2.E12.m1.2.2.2.cmml"><mi id="S2.E12.m1.2.2.2.3" xref="S2.E12.m1.2.2.2.3.cmml">p</mi><mo id="S2.E12.m1.2.2.2.2" xref="S2.E12.m1.2.2.2.2.cmml">⁢</mo><mrow id="S2.E12.m1.2.2.2.4.2" xref="S2.E12.m1.2.2.2.cmml"><mo id="S2.E12.m1.2.2.2.4.2.1" stretchy="false" xref="S2.E12.m1.2.2.2.cmml">(</mo><mi id="S2.E12.m1.2.2.2.1" xref="S2.E12.m1.2.2.2.1.cmml">𝒙</mi><mo id="S2.E12.m1.2.2.2.4.2.2" stretchy="false" xref="S2.E12.m1.2.2.2.cmml">)</mo></mrow></mrow></mfrac></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E12.m1.5b"><apply id="S2.E12.m1.5.5.cmml" xref="S2.E12.m1.5.5"><csymbol cd="latexml" id="S2.E12.m1.5.5.2.cmml" xref="S2.E12.m1.5.5.2">proportional-to</csymbol><apply id="S2.E12.m1.5.5.1.cmml" xref="S2.E12.m1.5.5.1"><times id="S2.E12.m1.5.5.1.2.cmml" xref="S2.E12.m1.5.5.1.2"></times><ci id="S2.E12.m1.5.5.1.3.cmml" xref="S2.E12.m1.5.5.1.3">𝑝</ci><apply id="S2.E12.m1.5.5.1.1.1.1.cmml" xref="S2.E12.m1.5.5.1.1.1"><csymbol cd="latexml" id="S2.E12.m1.5.5.1.1.1.1.3.cmml" xref="S2.E12.m1.5.5.1.1.1.1.3">conditional</csymbol><ci id="S2.E12.m1.5.5.1.1.1.1.4.cmml" xref="S2.E12.m1.5.5.1.1.1.1.4">𝒙</ci><list id="S2.E12.m1.5.5.1.1.1.1.2.3.cmml" xref="S2.E12.m1.5.5.1.1.1.1.2.2"><apply id="S2.E12.m1.5.5.1.1.1.1.1.1.1.cmml" xref="S2.E12.m1.5.5.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E12.m1.5.5.1.1.1.1.1.1.1.1.cmml" xref="S2.E12.m1.5.5.1.1.1.1.1.1.1">subscript</csymbol><ci id="S2.E12.m1.5.5.1.1.1.1.1.1.1.2.cmml" xref="S2.E12.m1.5.5.1.1.1.1.1.1.1.2">𝒄</ci><cn id="S2.E12.m1.5.5.1.1.1.1.1.1.1.3.cmml" type="integer" xref="S2.E12.m1.5.5.1.1.1.1.1.1.1.3">1</cn></apply><ci id="S2.E12.m1.3.3.cmml" xref="S2.E12.m1.3.3">…</ci><apply id="S2.E12.m1.5.5.1.1.1.1.2.2.2.cmml" xref="S2.E12.m1.5.5.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S2.E12.m1.5.5.1.1.1.1.2.2.2.1.cmml" xref="S2.E12.m1.5.5.1.1.1.1.2.2.2">subscript</csymbol><ci id="S2.E12.m1.5.5.1.1.1.1.2.2.2.2.cmml" xref="S2.E12.m1.5.5.1.1.1.1.2.2.2.2">𝒄</ci><ci id="S2.E12.m1.5.5.1.1.1.1.2.2.2.3.cmml" xref="S2.E12.m1.5.5.1.1.1.1.2.2.2.3">𝑛</ci></apply></list></apply></apply><apply id="S2.E12.m1.5.5.3.cmml" xref="S2.E12.m1.5.5.3"><times id="S2.E12.m1.5.5.3.1.cmml" xref="S2.E12.m1.5.5.3.1"></times><ci id="S2.E12.m1.5.5.3.2.cmml" xref="S2.E12.m1.5.5.3.2">𝑝</ci><ci id="S2.E12.m1.4.4.cmml" xref="S2.E12.m1.4.4">𝒙</ci><apply id="S2.E12.m1.5.5.3.4.cmml" xref="S2.E12.m1.5.5.3.4"><apply id="S2.E12.m1.5.5.3.4.1.cmml" xref="S2.E12.m1.5.5.3.4.1"><csymbol cd="ambiguous" id="S2.E12.m1.5.5.3.4.1.1.cmml" xref="S2.E12.m1.5.5.3.4.1">superscript</csymbol><apply id="S2.E12.m1.5.5.3.4.1.2.cmml" xref="S2.E12.m1.5.5.3.4.1"><csymbol cd="ambiguous" id="S2.E12.m1.5.5.3.4.1.2.1.cmml" xref="S2.E12.m1.5.5.3.4.1">subscript</csymbol><csymbol cd="latexml" id="S2.E12.m1.5.5.3.4.1.2.2.cmml" xref="S2.E12.m1.5.5.3.4.1.2.2">product</csymbol><apply id="S2.E12.m1.5.5.3.4.1.2.3.cmml" xref="S2.E12.m1.5.5.3.4.1.2.3"><eq id="S2.E12.m1.5.5.3.4.1.2.3.1.cmml" xref="S2.E12.m1.5.5.3.4.1.2.3.1"></eq><ci id="S2.E12.m1.5.5.3.4.1.2.3.2.cmml" xref="S2.E12.m1.5.5.3.4.1.2.3.2">𝑖</ci><cn id="S2.E12.m1.5.5.3.4.1.2.3.3.cmml" type="integer" xref="S2.E12.m1.5.5.3.4.1.2.3.3">1</cn></apply></apply><ci id="S2.E12.m1.5.5.3.4.1.3.cmml" xref="S2.E12.m1.5.5.3.4.1.3">𝑛</ci></apply><apply id="S2.E12.m1.2.2.cmml" xref="S2.E12.m1.2.2"><divide id="S2.E12.m1.2.2.3.cmml" xref="S2.E12.m1.2.2"></divide><apply id="S2.E12.m1.1.1.1.cmml" xref="S2.E12.m1.1.1.1"><times id="S2.E12.m1.1.1.1.2.cmml" xref="S2.E12.m1.1.1.1.2"></times><ci id="S2.E12.m1.1.1.1.3.cmml" xref="S2.E12.m1.1.1.1.3">𝑝</ci><apply id="S2.E12.m1.1.1.1.1.1.1.cmml" xref="S2.E12.m1.1.1.1.1.1"><csymbol cd="latexml" id="S2.E12.m1.1.1.1.1.1.1.1.cmml" xref="S2.E12.m1.1.1.1.1.1.1.1">conditional</csymbol><ci id="S2.E12.m1.1.1.1.1.1.1.2.cmml" xref="S2.E12.m1.1.1.1.1.1.1.2">𝒙</ci><apply id="S2.E12.m1.1.1.1.1.1.1.3.cmml" xref="S2.E12.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E12.m1.1.1.1.1.1.1.3.1.cmml" xref="S2.E12.m1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S2.E12.m1.1.1.1.1.1.1.3.2.cmml" xref="S2.E12.m1.1.1.1.1.1.1.3.2">𝒄</ci><ci id="S2.E12.m1.1.1.1.1.1.1.3.3.cmml" xref="S2.E12.m1.1.1.1.1.1.1.3.3">𝑖</ci></apply></apply></apply><apply id="S2.E12.m1.2.2.2.cmml" xref="S2.E12.m1.2.2.2"><times id="S2.E12.m1.2.2.2.2.cmml" xref="S2.E12.m1.2.2.2.2"></times><ci id="S2.E12.m1.2.2.2.3.cmml" xref="S2.E12.m1.2.2.2.3">𝑝</ci><ci id="S2.E12.m1.2.2.2.1.cmml" xref="S2.E12.m1.2.2.2.1">𝒙</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E12.m1.5c">p\left(\boldsymbol{x}|\boldsymbol{c}_{1},\ldots,\boldsymbol{c}_{n}\right)% \propto p(\boldsymbol{x})\prod_{i=1}^{n}\frac{p\left(\boldsymbol{x}|% \boldsymbol{c}_{i}\right)}{p(\boldsymbol{x})}</annotation><annotation encoding="application/x-llamapun" id="S2.E12.m1.5d">italic_p ( bold_italic_x | bold_italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∝ italic_p ( bold_italic_x ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_italic_x | bold_italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( bold_italic_x ) end_ARG</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(12)</span></td> </tr></tbody> </table> </div> <div class="ltx_para" id="S2.SS4.p3"> <p class="ltx_p" id="S2.SS4.p3.1">Here, we define <math alttext="g" class="ltx_Math" display="inline" id="S2.SS4.p3.1.m1.1"><semantics id="S2.SS4.p3.1.m1.1a"><mi id="S2.SS4.p3.1.m1.1.1" xref="S2.SS4.p3.1.m1.1.1.cmml">g</mi><annotation-xml encoding="MathML-Content" id="S2.SS4.p3.1.m1.1b"><ci id="S2.SS4.p3.1.m1.1.1.cmml" xref="S2.SS4.p3.1.m1.1.1">𝑔</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p3.1.m1.1c">g</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p3.1.m1.1d">italic_g</annotation></semantics></math> to represent the difference between the unconditional and conditional score estimates:</p> <table class="ltx_equation ltx_eqn_table" id="S2.E13"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="g(c,\omega)=\omega\left(\epsilon_{\theta}\left(z_{t},t|c\right)-\epsilon_{% \theta}\left(z_{t},t\right)\right)" class="ltx_Math" display="block" id="S2.E13.m1.4"><semantics id="S2.E13.m1.4a"><mrow id="S2.E13.m1.4.4" xref="S2.E13.m1.4.4.cmml"><mrow id="S2.E13.m1.4.4.3" xref="S2.E13.m1.4.4.3.cmml"><mi id="S2.E13.m1.4.4.3.2" xref="S2.E13.m1.4.4.3.2.cmml">g</mi><mo id="S2.E13.m1.4.4.3.1" xref="S2.E13.m1.4.4.3.1.cmml">⁢</mo><mrow id="S2.E13.m1.4.4.3.3.2" xref="S2.E13.m1.4.4.3.3.1.cmml"><mo id="S2.E13.m1.4.4.3.3.2.1" stretchy="false" xref="S2.E13.m1.4.4.3.3.1.cmml">(</mo><mi id="S2.E13.m1.1.1" xref="S2.E13.m1.1.1.cmml">c</mi><mo id="S2.E13.m1.4.4.3.3.2.2" xref="S2.E13.m1.4.4.3.3.1.cmml">,</mo><mi id="S2.E13.m1.2.2" xref="S2.E13.m1.2.2.cmml">ω</mi><mo id="S2.E13.m1.4.4.3.3.2.3" stretchy="false" xref="S2.E13.m1.4.4.3.3.1.cmml">)</mo></mrow></mrow><mo id="S2.E13.m1.4.4.2" xref="S2.E13.m1.4.4.2.cmml">=</mo><mrow id="S2.E13.m1.4.4.1" xref="S2.E13.m1.4.4.1.cmml"><mi id="S2.E13.m1.4.4.1.3" xref="S2.E13.m1.4.4.1.3.cmml">ω</mi><mo id="S2.E13.m1.4.4.1.2" xref="S2.E13.m1.4.4.1.2.cmml">⁢</mo><mrow id="S2.E13.m1.4.4.1.1.1" xref="S2.E13.m1.4.4.1.1.1.1.cmml"><mo id="S2.E13.m1.4.4.1.1.1.2" xref="S2.E13.m1.4.4.1.1.1.1.cmml">(</mo><mrow id="S2.E13.m1.4.4.1.1.1.1" xref="S2.E13.m1.4.4.1.1.1.1.cmml"><mrow id="S2.E13.m1.4.4.1.1.1.1.2" xref="S2.E13.m1.4.4.1.1.1.1.2.cmml"><msub id="S2.E13.m1.4.4.1.1.1.1.2.4" xref="S2.E13.m1.4.4.1.1.1.1.2.4.cmml"><mi id="S2.E13.m1.4.4.1.1.1.1.2.4.2" xref="S2.E13.m1.4.4.1.1.1.1.2.4.2.cmml">ϵ</mi><mi id="S2.E13.m1.4.4.1.1.1.1.2.4.3" xref="S2.E13.m1.4.4.1.1.1.1.2.4.3.cmml">θ</mi></msub><mo id="S2.E13.m1.4.4.1.1.1.1.2.3" xref="S2.E13.m1.4.4.1.1.1.1.2.3.cmml">⁢</mo><mrow id="S2.E13.m1.4.4.1.1.1.1.2.2.2" xref="S2.E13.m1.4.4.1.1.1.1.2.2.3.cmml"><mo id="S2.E13.m1.4.4.1.1.1.1.2.2.2.3" xref="S2.E13.m1.4.4.1.1.1.1.2.2.3.cmml">(</mo><msub id="S2.E13.m1.4.4.1.1.1.1.1.1.1.1" xref="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.cmml"><mi id="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.2" xref="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.2.cmml">z</mi><mi id="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.3" xref="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.3.cmml">t</mi></msub><mo id="S2.E13.m1.4.4.1.1.1.1.2.2.2.4" xref="S2.E13.m1.4.4.1.1.1.1.2.2.3.cmml">,</mo><mrow id="S2.E13.m1.4.4.1.1.1.1.2.2.2.2" xref="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.cmml"><mi id="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.2" xref="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.2.cmml">t</mi><mo fence="false" id="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.1" xref="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.1.cmml">|</mo><mi id="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.3" xref="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.3.cmml">c</mi></mrow><mo id="S2.E13.m1.4.4.1.1.1.1.2.2.2.5" xref="S2.E13.m1.4.4.1.1.1.1.2.2.3.cmml">)</mo></mrow></mrow><mo id="S2.E13.m1.4.4.1.1.1.1.4" xref="S2.E13.m1.4.4.1.1.1.1.4.cmml">−</mo><mrow id="S2.E13.m1.4.4.1.1.1.1.3" xref="S2.E13.m1.4.4.1.1.1.1.3.cmml"><msub id="S2.E13.m1.4.4.1.1.1.1.3.3" xref="S2.E13.m1.4.4.1.1.1.1.3.3.cmml"><mi id="S2.E13.m1.4.4.1.1.1.1.3.3.2" xref="S2.E13.m1.4.4.1.1.1.1.3.3.2.cmml">ϵ</mi><mi id="S2.E13.m1.4.4.1.1.1.1.3.3.3" xref="S2.E13.m1.4.4.1.1.1.1.3.3.3.cmml">θ</mi></msub><mo id="S2.E13.m1.4.4.1.1.1.1.3.2" xref="S2.E13.m1.4.4.1.1.1.1.3.2.cmml">⁢</mo><mrow id="S2.E13.m1.4.4.1.1.1.1.3.1.1" xref="S2.E13.m1.4.4.1.1.1.1.3.1.2.cmml"><mo id="S2.E13.m1.4.4.1.1.1.1.3.1.1.2" xref="S2.E13.m1.4.4.1.1.1.1.3.1.2.cmml">(</mo><msub id="S2.E13.m1.4.4.1.1.1.1.3.1.1.1" xref="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.cmml"><mi id="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.2" xref="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.2.cmml">z</mi><mi id="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.3" xref="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.3.cmml">t</mi></msub><mo id="S2.E13.m1.4.4.1.1.1.1.3.1.1.3" xref="S2.E13.m1.4.4.1.1.1.1.3.1.2.cmml">,</mo><mi id="S2.E13.m1.3.3" xref="S2.E13.m1.3.3.cmml">t</mi><mo id="S2.E13.m1.4.4.1.1.1.1.3.1.1.4" xref="S2.E13.m1.4.4.1.1.1.1.3.1.2.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E13.m1.4.4.1.1.1.3" xref="S2.E13.m1.4.4.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E13.m1.4b"><apply id="S2.E13.m1.4.4.cmml" xref="S2.E13.m1.4.4"><eq id="S2.E13.m1.4.4.2.cmml" xref="S2.E13.m1.4.4.2"></eq><apply id="S2.E13.m1.4.4.3.cmml" xref="S2.E13.m1.4.4.3"><times id="S2.E13.m1.4.4.3.1.cmml" xref="S2.E13.m1.4.4.3.1"></times><ci id="S2.E13.m1.4.4.3.2.cmml" xref="S2.E13.m1.4.4.3.2">𝑔</ci><interval closure="open" id="S2.E13.m1.4.4.3.3.1.cmml" xref="S2.E13.m1.4.4.3.3.2"><ci id="S2.E13.m1.1.1.cmml" xref="S2.E13.m1.1.1">𝑐</ci><ci id="S2.E13.m1.2.2.cmml" xref="S2.E13.m1.2.2">𝜔</ci></interval></apply><apply id="S2.E13.m1.4.4.1.cmml" xref="S2.E13.m1.4.4.1"><times id="S2.E13.m1.4.4.1.2.cmml" xref="S2.E13.m1.4.4.1.2"></times><ci id="S2.E13.m1.4.4.1.3.cmml" xref="S2.E13.m1.4.4.1.3">𝜔</ci><apply id="S2.E13.m1.4.4.1.1.1.1.cmml" xref="S2.E13.m1.4.4.1.1.1"><minus id="S2.E13.m1.4.4.1.1.1.1.4.cmml" xref="S2.E13.m1.4.4.1.1.1.1.4"></minus><apply id="S2.E13.m1.4.4.1.1.1.1.2.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2"><times id="S2.E13.m1.4.4.1.1.1.1.2.3.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2.3"></times><apply id="S2.E13.m1.4.4.1.1.1.1.2.4.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2.4"><csymbol cd="ambiguous" id="S2.E13.m1.4.4.1.1.1.1.2.4.1.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2.4">subscript</csymbol><ci id="S2.E13.m1.4.4.1.1.1.1.2.4.2.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2.4.2">italic-ϵ</ci><ci id="S2.E13.m1.4.4.1.1.1.1.2.4.3.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2.4.3">𝜃</ci></apply><interval closure="open" id="S2.E13.m1.4.4.1.1.1.1.2.2.3.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2.2.2"><apply id="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.cmml" xref="S2.E13.m1.4.4.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E13.m1.4.4.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.2">𝑧</ci><ci id="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.3.cmml" xref="S2.E13.m1.4.4.1.1.1.1.1.1.1.1.3">𝑡</ci></apply><apply id="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2.2.2.2"><csymbol cd="latexml" id="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.1.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.1">conditional</csymbol><ci id="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.2.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.2">𝑡</ci><ci id="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.3.cmml" xref="S2.E13.m1.4.4.1.1.1.1.2.2.2.2.3">𝑐</ci></apply></interval></apply><apply id="S2.E13.m1.4.4.1.1.1.1.3.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3"><times id="S2.E13.m1.4.4.1.1.1.1.3.2.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3.2"></times><apply id="S2.E13.m1.4.4.1.1.1.1.3.3.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3.3"><csymbol cd="ambiguous" id="S2.E13.m1.4.4.1.1.1.1.3.3.1.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3.3">subscript</csymbol><ci id="S2.E13.m1.4.4.1.1.1.1.3.3.2.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3.3.2">italic-ϵ</ci><ci id="S2.E13.m1.4.4.1.1.1.1.3.3.3.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3.3.3">𝜃</ci></apply><interval closure="open" id="S2.E13.m1.4.4.1.1.1.1.3.1.2.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3.1.1"><apply id="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3.1.1.1"><csymbol cd="ambiguous" id="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.1.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3.1.1.1">subscript</csymbol><ci id="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.2.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.2">𝑧</ci><ci id="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.3.cmml" xref="S2.E13.m1.4.4.1.1.1.1.3.1.1.1.3">𝑡</ci></apply><ci id="S2.E13.m1.3.3.cmml" xref="S2.E13.m1.3.3">𝑡</ci></interval></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E13.m1.4c">g(c,\omega)=\omega\left(\epsilon_{\theta}\left(z_{t},t|c\right)-\epsilon_{% \theta}\left(z_{t},t\right)\right)</annotation><annotation encoding="application/x-llamapun" id="S2.E13.m1.4d">italic_g ( italic_c , italic_ω ) = italic_ω ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t | italic_c ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(13)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS4.p3.3">where <math alttext="c" class="ltx_Math" display="inline" id="S2.SS4.p3.2.m1.1"><semantics id="S2.SS4.p3.2.m1.1a"><mi id="S2.SS4.p3.2.m1.1.1" xref="S2.SS4.p3.2.m1.1.1.cmml">c</mi><annotation-xml encoding="MathML-Content" id="S2.SS4.p3.2.m1.1b"><ci id="S2.SS4.p3.2.m1.1.1.cmml" xref="S2.SS4.p3.2.m1.1.1">𝑐</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p3.2.m1.1c">c</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p3.2.m1.1d">italic_c</annotation></semantics></math> is the condition and <math alttext="\omega" class="ltx_Math" display="inline" id="S2.SS4.p3.3.m2.1"><semantics id="S2.SS4.p3.3.m2.1a"><mi id="S2.SS4.p3.3.m2.1.1" xref="S2.SS4.p3.3.m2.1.1.cmml">ω</mi><annotation-xml encoding="MathML-Content" id="S2.SS4.p3.3.m2.1b"><ci id="S2.SS4.p3.3.m2.1.1.cmml" xref="S2.SS4.p3.3.m2.1.1">𝜔</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p3.3.m2.1c">\omega</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p3.3.m2.1d">italic_ω</annotation></semantics></math> is a hyperparameter. Then the multi-conditioned inference steps could be represented as</p> <table class="ltx_equation ltx_eqn_table" id="S2.E14"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\hat{\epsilon}_{\theta}\left(z_{t},t\right)\leftarrow\begin{gathered}\epsilon_% {\theta}\left(z_{t},t\right)+g\left(E_{l},\omega_{l}\right)-g\left(E_{nl},% \omega_{nl}\right)\\ +g\left(E_{v},\omega_{v}\right)-\gamma\bar{\beta}_{t}\nabla\log P_{\phi}\left(% y|z_{t},t,E_{v}\right)\end{gathered}" class="ltx_Math" display="block" id="S2.E14.m1.72"><semantics id="S2.E14.m1.72a"><mrow id="S2.E14.m1.72.72" xref="S2.E14.m1.72.72.cmml"><mrow id="S2.E14.m1.72.72.1" xref="S2.E14.m1.72.72.1.cmml"><msub id="S2.E14.m1.72.72.1.3" xref="S2.E14.m1.72.72.1.3.cmml"><mover accent="true" id="S2.E14.m1.72.72.1.3.2" xref="S2.E14.m1.72.72.1.3.2.cmml"><mi id="S2.E14.m1.72.72.1.3.2.2" xref="S2.E14.m1.72.72.1.3.2.2.cmml">ϵ</mi><mo id="S2.E14.m1.72.72.1.3.2.1" xref="S2.E14.m1.72.72.1.3.2.1.cmml">^</mo></mover><mi id="S2.E14.m1.72.72.1.3.3" xref="S2.E14.m1.72.72.1.3.3.cmml">θ</mi></msub><mo id="S2.E14.m1.72.72.1.2" xref="S2.E14.m1.72.72.1.2.cmml">⁢</mo><mrow id="S2.E14.m1.72.72.1.1.1" xref="S2.E14.m1.72.72.1.1.2.cmml"><mo id="S2.E14.m1.72.72.1.1.1.2" xref="S2.E14.m1.72.72.1.1.2.cmml">(</mo><msub id="S2.E14.m1.72.72.1.1.1.1" xref="S2.E14.m1.72.72.1.1.1.1.cmml"><mi id="S2.E14.m1.72.72.1.1.1.1.2" xref="S2.E14.m1.72.72.1.1.1.1.2.cmml">z</mi><mi id="S2.E14.m1.72.72.1.1.1.1.3" xref="S2.E14.m1.72.72.1.1.1.1.3.cmml">t</mi></msub><mo id="S2.E14.m1.72.72.1.1.1.3" xref="S2.E14.m1.72.72.1.1.2.cmml">,</mo><mi id="S2.E14.m1.71.71" xref="S2.E14.m1.71.71.cmml">t</mi><mo id="S2.E14.m1.72.72.1.1.1.4" xref="S2.E14.m1.72.72.1.1.2.cmml">)</mo></mrow></mrow><mo id="S2.E14.m1.72.72.2" stretchy="false" xref="S2.E14.m1.72.72.2.cmml">←</mo><mtable displaystyle="true" id="S2.E14.m1.70.70.16" rowspacing="0pt" xref="S2.E14.m1.62.62.8.cmml"><mtr id="S2.E14.m1.70.70.16a" xref="S2.E14.m1.62.62.8.cmml"><mtd id="S2.E14.m1.70.70.16b" xref="S2.E14.m1.62.62.8.cmml"><mrow id="S2.E14.m1.67.67.13.59.31.31" xref="S2.E14.m1.62.62.8.cmml"><mrow id="S2.E14.m1.65.65.11.57.29.29.29" xref="S2.E14.m1.62.62.8.cmml"><mrow id="S2.E14.m1.63.63.9.55.27.27.27.1" xref="S2.E14.m1.62.62.8.cmml"><msub id="S2.E14.m1.63.63.9.55.27.27.27.1.3" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.1.1.1.1.1.1" xref="S2.E14.m1.1.1.1.1.1.1.cmml">ϵ</mi><mi id="S2.E14.m1.2.2.2.2.2.2.1" xref="S2.E14.m1.2.2.2.2.2.2.1.cmml">θ</mi></msub><mo id="S2.E14.m1.63.63.9.55.27.27.27.1.2" xref="S2.E14.m1.62.62.8.cmml">⁢</mo><mrow id="S2.E14.m1.63.63.9.55.27.27.27.1.1.1" xref="S2.E14.m1.62.62.8.cmml"><mo id="S2.E14.m1.3.3.3.3.3.3" xref="S2.E14.m1.62.62.8.cmml">(</mo><msub id="S2.E14.m1.63.63.9.55.27.27.27.1.1.1.1" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.4.4.4.4.4.4" xref="S2.E14.m1.4.4.4.4.4.4.cmml">z</mi><mi id="S2.E14.m1.5.5.5.5.5.5.1" xref="S2.E14.m1.5.5.5.5.5.5.1.cmml">t</mi></msub><mo id="S2.E14.m1.6.6.6.6.6.6" xref="S2.E14.m1.62.62.8.cmml">,</mo><mi id="S2.E14.m1.7.7.7.7.7.7" xref="S2.E14.m1.7.7.7.7.7.7.cmml">t</mi><mo id="S2.E14.m1.8.8.8.8.8.8" xref="S2.E14.m1.62.62.8.cmml">)</mo></mrow></mrow><mo id="S2.E14.m1.9.9.9.9.9.9" xref="S2.E14.m1.9.9.9.9.9.9.cmml">+</mo><mrow id="S2.E14.m1.65.65.11.57.29.29.29.3" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.10.10.10.10.10.10" xref="S2.E14.m1.10.10.10.10.10.10.cmml">g</mi><mo id="S2.E14.m1.65.65.11.57.29.29.29.3.3" xref="S2.E14.m1.62.62.8.cmml">⁢</mo><mrow id="S2.E14.m1.65.65.11.57.29.29.29.3.2.2" xref="S2.E14.m1.62.62.8.cmml"><mo id="S2.E14.m1.11.11.11.11.11.11" xref="S2.E14.m1.62.62.8.cmml">(</mo><msub id="S2.E14.m1.64.64.10.56.28.28.28.2.1.1.1" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.12.12.12.12.12.12" xref="S2.E14.m1.12.12.12.12.12.12.cmml">E</mi><mi id="S2.E14.m1.13.13.13.13.13.13.1" xref="S2.E14.m1.13.13.13.13.13.13.1.cmml">l</mi></msub><mo id="S2.E14.m1.14.14.14.14.14.14" xref="S2.E14.m1.62.62.8.cmml">,</mo><msub id="S2.E14.m1.65.65.11.57.29.29.29.3.2.2.2" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.15.15.15.15.15.15" xref="S2.E14.m1.15.15.15.15.15.15.cmml">ω</mi><mi id="S2.E14.m1.16.16.16.16.16.16.1" xref="S2.E14.m1.16.16.16.16.16.16.1.cmml">l</mi></msub><mo id="S2.E14.m1.17.17.17.17.17.17" xref="S2.E14.m1.62.62.8.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E14.m1.18.18.18.18.18.18" xref="S2.E14.m1.18.18.18.18.18.18.cmml">−</mo><mrow id="S2.E14.m1.67.67.13.59.31.31.31" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.19.19.19.19.19.19" xref="S2.E14.m1.19.19.19.19.19.19.cmml">g</mi><mo id="S2.E14.m1.67.67.13.59.31.31.31.3" xref="S2.E14.m1.62.62.8.cmml">⁢</mo><mrow id="S2.E14.m1.67.67.13.59.31.31.31.2.2" xref="S2.E14.m1.62.62.8.cmml"><mo id="S2.E14.m1.20.20.20.20.20.20" xref="S2.E14.m1.62.62.8.cmml">(</mo><msub id="S2.E14.m1.66.66.12.58.30.30.30.1.1.1" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.21.21.21.21.21.21" xref="S2.E14.m1.21.21.21.21.21.21.cmml">E</mi><mrow id="S2.E14.m1.22.22.22.22.22.22.1" xref="S2.E14.m1.22.22.22.22.22.22.1.cmml"><mi id="S2.E14.m1.22.22.22.22.22.22.1.2" xref="S2.E14.m1.22.22.22.22.22.22.1.2.cmml">n</mi><mo id="S2.E14.m1.22.22.22.22.22.22.1.1" xref="S2.E14.m1.22.22.22.22.22.22.1.1.cmml">⁢</mo><mi id="S2.E14.m1.22.22.22.22.22.22.1.3" xref="S2.E14.m1.22.22.22.22.22.22.1.3.cmml">l</mi></mrow></msub><mo id="S2.E14.m1.23.23.23.23.23.23" xref="S2.E14.m1.62.62.8.cmml">,</mo><msub id="S2.E14.m1.67.67.13.59.31.31.31.2.2.2" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.24.24.24.24.24.24" xref="S2.E14.m1.24.24.24.24.24.24.cmml">ω</mi><mrow id="S2.E14.m1.25.25.25.25.25.25.1" xref="S2.E14.m1.25.25.25.25.25.25.1.cmml"><mi id="S2.E14.m1.25.25.25.25.25.25.1.2" xref="S2.E14.m1.25.25.25.25.25.25.1.2.cmml">n</mi><mo id="S2.E14.m1.25.25.25.25.25.25.1.1" xref="S2.E14.m1.25.25.25.25.25.25.1.1.cmml">⁢</mo><mi id="S2.E14.m1.25.25.25.25.25.25.1.3" xref="S2.E14.m1.25.25.25.25.25.25.1.3.cmml">l</mi></mrow></msub><mo id="S2.E14.m1.26.26.26.26.26.26" xref="S2.E14.m1.62.62.8.cmml">)</mo></mrow></mrow></mrow></mtd></mtr><mtr id="S2.E14.m1.70.70.16c" xref="S2.E14.m1.62.62.8.cmml"><mtd id="S2.E14.m1.70.70.16d" xref="S2.E14.m1.62.62.8.cmml"><mrow id="S2.E14.m1.70.70.16.62.31.31" xref="S2.E14.m1.62.62.8.cmml"><mrow id="S2.E14.m1.69.69.15.61.30.30.30" xref="S2.E14.m1.62.62.8.cmml"><mo id="S2.E14.m1.69.69.15.61.30.30.30a" xref="S2.E14.m1.62.62.8.cmml">+</mo><mrow id="S2.E14.m1.69.69.15.61.30.30.30.2" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.28.28.28.2.2.2" xref="S2.E14.m1.28.28.28.2.2.2.cmml">g</mi><mo id="S2.E14.m1.69.69.15.61.30.30.30.2.3" xref="S2.E14.m1.62.62.8.cmml">⁢</mo><mrow id="S2.E14.m1.69.69.15.61.30.30.30.2.2.2" xref="S2.E14.m1.62.62.8.cmml"><mo id="S2.E14.m1.29.29.29.3.3.3" xref="S2.E14.m1.62.62.8.cmml">(</mo><msub id="S2.E14.m1.68.68.14.60.29.29.29.1.1.1.1" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.30.30.30.4.4.4" xref="S2.E14.m1.30.30.30.4.4.4.cmml">E</mi><mi id="S2.E14.m1.31.31.31.5.5.5.1" xref="S2.E14.m1.31.31.31.5.5.5.1.cmml">v</mi></msub><mo id="S2.E14.m1.32.32.32.6.6.6" xref="S2.E14.m1.62.62.8.cmml">,</mo><msub id="S2.E14.m1.69.69.15.61.30.30.30.2.2.2.2" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.33.33.33.7.7.7" xref="S2.E14.m1.33.33.33.7.7.7.cmml">ω</mi><mi id="S2.E14.m1.34.34.34.8.8.8.1" xref="S2.E14.m1.34.34.34.8.8.8.1.cmml">v</mi></msub><mo id="S2.E14.m1.35.35.35.9.9.9" xref="S2.E14.m1.62.62.8.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E14.m1.36.36.36.10.10.10" xref="S2.E14.m1.36.36.36.10.10.10.cmml">−</mo><mrow id="S2.E14.m1.70.70.16.62.31.31.31" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.37.37.37.11.11.11" xref="S2.E14.m1.37.37.37.11.11.11.cmml">γ</mi><mo id="S2.E14.m1.70.70.16.62.31.31.31.2" xref="S2.E14.m1.62.62.8.cmml">⁢</mo><msub id="S2.E14.m1.70.70.16.62.31.31.31.3" xref="S2.E14.m1.62.62.8.cmml"><mover accent="true" id="S2.E14.m1.38.38.38.12.12.12" xref="S2.E14.m1.38.38.38.12.12.12.cmml"><mi id="S2.E14.m1.38.38.38.12.12.12.2" xref="S2.E14.m1.38.38.38.12.12.12.2.cmml">β</mi><mo id="S2.E14.m1.38.38.38.12.12.12.1" xref="S2.E14.m1.38.38.38.12.12.12.1.cmml">¯</mo></mover><mi id="S2.E14.m1.39.39.39.13.13.13.1" xref="S2.E14.m1.39.39.39.13.13.13.1.cmml">t</mi></msub><mo id="S2.E14.m1.70.70.16.62.31.31.31.2a" lspace="0.167em" xref="S2.E14.m1.62.62.8.cmml">⁢</mo><mrow id="S2.E14.m1.70.70.16.62.31.31.31.4" xref="S2.E14.m1.62.62.8.cmml"><mrow id="S2.E14.m1.70.70.16.62.31.31.31.4.1" xref="S2.E14.m1.62.62.8.cmml"><mo id="S2.E14.m1.40.40.40.14.14.14" rspace="0.167em" xref="S2.E14.m1.40.40.40.14.14.14.cmml">∇</mo><mi id="S2.E14.m1.41.41.41.15.15.15" xref="S2.E14.m1.41.41.41.15.15.15.cmml">log</mi></mrow><mo id="S2.E14.m1.70.70.16.62.31.31.31.4a" lspace="0.167em" xref="S2.E14.m1.62.62.8.cmml">⁡</mo><msub id="S2.E14.m1.70.70.16.62.31.31.31.4.2" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.42.42.42.16.16.16" xref="S2.E14.m1.42.42.42.16.16.16.cmml">P</mi><mi id="S2.E14.m1.43.43.43.17.17.17.1" xref="S2.E14.m1.43.43.43.17.17.17.1.cmml">ϕ</mi></msub></mrow><mo id="S2.E14.m1.70.70.16.62.31.31.31.2b" xref="S2.E14.m1.62.62.8.cmml">⁢</mo><mrow id="S2.E14.m1.70.70.16.62.31.31.31.1.1" xref="S2.E14.m1.62.62.8.cmml"><mo id="S2.E14.m1.44.44.44.18.18.18" xref="S2.E14.m1.62.62.8.cmml">(</mo><mrow id="S2.E14.m1.70.70.16.62.31.31.31.1.1.1" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.45.45.45.19.19.19" xref="S2.E14.m1.45.45.45.19.19.19.cmml">y</mi><mo fence="false" id="S2.E14.m1.46.46.46.20.20.20" xref="S2.E14.m1.46.46.46.20.20.20.cmml">|</mo><mrow id="S2.E14.m1.70.70.16.62.31.31.31.1.1.1.2.2" xref="S2.E14.m1.62.62.8.cmml"><msub id="S2.E14.m1.70.70.16.62.31.31.31.1.1.1.1.1.1" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.47.47.47.21.21.21" xref="S2.E14.m1.47.47.47.21.21.21.cmml">z</mi><mi id="S2.E14.m1.48.48.48.22.22.22.1" xref="S2.E14.m1.48.48.48.22.22.22.1.cmml">t</mi></msub><mo id="S2.E14.m1.49.49.49.23.23.23" xref="S2.E14.m1.62.62.8.cmml">,</mo><mi id="S2.E14.m1.50.50.50.24.24.24" xref="S2.E14.m1.50.50.50.24.24.24.cmml">t</mi><mo id="S2.E14.m1.51.51.51.25.25.25" xref="S2.E14.m1.62.62.8.cmml">,</mo><msub id="S2.E14.m1.70.70.16.62.31.31.31.1.1.1.2.2.2" xref="S2.E14.m1.62.62.8.cmml"><mi id="S2.E14.m1.52.52.52.26.26.26" xref="S2.E14.m1.52.52.52.26.26.26.cmml">E</mi><mi id="S2.E14.m1.53.53.53.27.27.27.1" xref="S2.E14.m1.53.53.53.27.27.27.1.cmml">v</mi></msub></mrow></mrow><mo id="S2.E14.m1.54.54.54.28.28.28" xref="S2.E14.m1.62.62.8.cmml">)</mo></mrow></mrow></mrow></mtd></mtr></mtable></mrow><annotation-xml encoding="MathML-Content" id="S2.E14.m1.72b"><apply id="S2.E14.m1.72.72.cmml" xref="S2.E14.m1.72.72"><ci id="S2.E14.m1.72.72.2.cmml" xref="S2.E14.m1.72.72.2">←</ci><apply id="S2.E14.m1.72.72.1.cmml" xref="S2.E14.m1.72.72.1"><times id="S2.E14.m1.72.72.1.2.cmml" xref="S2.E14.m1.72.72.1.2"></times><apply id="S2.E14.m1.72.72.1.3.cmml" xref="S2.E14.m1.72.72.1.3"><csymbol cd="ambiguous" id="S2.E14.m1.72.72.1.3.1.cmml" xref="S2.E14.m1.72.72.1.3">subscript</csymbol><apply id="S2.E14.m1.72.72.1.3.2.cmml" xref="S2.E14.m1.72.72.1.3.2"><ci id="S2.E14.m1.72.72.1.3.2.1.cmml" xref="S2.E14.m1.72.72.1.3.2.1">^</ci><ci id="S2.E14.m1.72.72.1.3.2.2.cmml" xref="S2.E14.m1.72.72.1.3.2.2">italic-ϵ</ci></apply><ci id="S2.E14.m1.72.72.1.3.3.cmml" xref="S2.E14.m1.72.72.1.3.3">𝜃</ci></apply><interval closure="open" id="S2.E14.m1.72.72.1.1.2.cmml" xref="S2.E14.m1.72.72.1.1.1"><apply id="S2.E14.m1.72.72.1.1.1.1.cmml" xref="S2.E14.m1.72.72.1.1.1.1"><csymbol cd="ambiguous" id="S2.E14.m1.72.72.1.1.1.1.1.cmml" xref="S2.E14.m1.72.72.1.1.1.1">subscript</csymbol><ci id="S2.E14.m1.72.72.1.1.1.1.2.cmml" xref="S2.E14.m1.72.72.1.1.1.1.2">𝑧</ci><ci id="S2.E14.m1.72.72.1.1.1.1.3.cmml" xref="S2.E14.m1.72.72.1.1.1.1.3">𝑡</ci></apply><ci id="S2.E14.m1.71.71.cmml" xref="S2.E14.m1.71.71">𝑡</ci></interval></apply><apply id="S2.E14.m1.62.62.8.cmml" xref="S2.E14.m1.70.70.16"><minus id="S2.E14.m1.36.36.36.10.10.10.cmml" xref="S2.E14.m1.36.36.36.10.10.10"></minus><apply id="S2.E14.m1.61.61.7.7.cmml" xref="S2.E14.m1.70.70.16"><plus id="S2.E14.m1.27.27.27.1.1.1.cmml" xref="S2.E14.m1.70.70.16"></plus><apply id="S2.E14.m1.59.59.5.5.5.cmml" xref="S2.E14.m1.70.70.16"><minus id="S2.E14.m1.18.18.18.18.18.18.cmml" xref="S2.E14.m1.18.18.18.18.18.18"></minus><apply id="S2.E14.m1.57.57.3.3.3.3.cmml" xref="S2.E14.m1.70.70.16"><plus id="S2.E14.m1.9.9.9.9.9.9.cmml" xref="S2.E14.m1.9.9.9.9.9.9"></plus><apply id="S2.E14.m1.55.55.1.1.1.1.1.cmml" xref="S2.E14.m1.70.70.16"><times id="S2.E14.m1.55.55.1.1.1.1.1.2.cmml" xref="S2.E14.m1.70.70.16"></times><apply id="S2.E14.m1.55.55.1.1.1.1.1.3.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.55.55.1.1.1.1.1.3.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.1.1.1.1.1.1.cmml" xref="S2.E14.m1.1.1.1.1.1.1">italic-ϵ</ci><ci id="S2.E14.m1.2.2.2.2.2.2.1.cmml" xref="S2.E14.m1.2.2.2.2.2.2.1">𝜃</ci></apply><interval closure="open" id="S2.E14.m1.55.55.1.1.1.1.1.1.2.cmml" xref="S2.E14.m1.70.70.16"><apply id="S2.E14.m1.55.55.1.1.1.1.1.1.1.1.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.55.55.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.4.4.4.4.4.4.cmml" xref="S2.E14.m1.4.4.4.4.4.4">𝑧</ci><ci id="S2.E14.m1.5.5.5.5.5.5.1.cmml" xref="S2.E14.m1.5.5.5.5.5.5.1">𝑡</ci></apply><ci id="S2.E14.m1.7.7.7.7.7.7.cmml" xref="S2.E14.m1.7.7.7.7.7.7">𝑡</ci></interval></apply><apply id="S2.E14.m1.57.57.3.3.3.3.3.cmml" xref="S2.E14.m1.70.70.16"><times id="S2.E14.m1.57.57.3.3.3.3.3.3.cmml" xref="S2.E14.m1.70.70.16"></times><ci id="S2.E14.m1.10.10.10.10.10.10.cmml" xref="S2.E14.m1.10.10.10.10.10.10">𝑔</ci><interval closure="open" id="S2.E14.m1.57.57.3.3.3.3.3.2.3.cmml" xref="S2.E14.m1.70.70.16"><apply id="S2.E14.m1.56.56.2.2.2.2.2.1.1.1.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.56.56.2.2.2.2.2.1.1.1.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.12.12.12.12.12.12.cmml" xref="S2.E14.m1.12.12.12.12.12.12">𝐸</ci><ci id="S2.E14.m1.13.13.13.13.13.13.1.cmml" xref="S2.E14.m1.13.13.13.13.13.13.1">𝑙</ci></apply><apply id="S2.E14.m1.57.57.3.3.3.3.3.2.2.2.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.57.57.3.3.3.3.3.2.2.2.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.15.15.15.15.15.15.cmml" xref="S2.E14.m1.15.15.15.15.15.15">𝜔</ci><ci id="S2.E14.m1.16.16.16.16.16.16.1.cmml" xref="S2.E14.m1.16.16.16.16.16.16.1">𝑙</ci></apply></interval></apply></apply><apply id="S2.E14.m1.59.59.5.5.5.5.cmml" xref="S2.E14.m1.70.70.16"><times id="S2.E14.m1.59.59.5.5.5.5.3.cmml" xref="S2.E14.m1.70.70.16"></times><ci id="S2.E14.m1.19.19.19.19.19.19.cmml" xref="S2.E14.m1.19.19.19.19.19.19">𝑔</ci><interval closure="open" id="S2.E14.m1.59.59.5.5.5.5.2.3.cmml" xref="S2.E14.m1.70.70.16"><apply id="S2.E14.m1.58.58.4.4.4.4.1.1.1.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.58.58.4.4.4.4.1.1.1.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.21.21.21.21.21.21.cmml" xref="S2.E14.m1.21.21.21.21.21.21">𝐸</ci><apply id="S2.E14.m1.22.22.22.22.22.22.1.cmml" xref="S2.E14.m1.22.22.22.22.22.22.1"><times id="S2.E14.m1.22.22.22.22.22.22.1.1.cmml" xref="S2.E14.m1.22.22.22.22.22.22.1.1"></times><ci id="S2.E14.m1.22.22.22.22.22.22.1.2.cmml" xref="S2.E14.m1.22.22.22.22.22.22.1.2">𝑛</ci><ci id="S2.E14.m1.22.22.22.22.22.22.1.3.cmml" xref="S2.E14.m1.22.22.22.22.22.22.1.3">𝑙</ci></apply></apply><apply id="S2.E14.m1.59.59.5.5.5.5.2.2.2.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.59.59.5.5.5.5.2.2.2.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.24.24.24.24.24.24.cmml" xref="S2.E14.m1.24.24.24.24.24.24">𝜔</ci><apply id="S2.E14.m1.25.25.25.25.25.25.1.cmml" xref="S2.E14.m1.25.25.25.25.25.25.1"><times id="S2.E14.m1.25.25.25.25.25.25.1.1.cmml" xref="S2.E14.m1.25.25.25.25.25.25.1.1"></times><ci id="S2.E14.m1.25.25.25.25.25.25.1.2.cmml" xref="S2.E14.m1.25.25.25.25.25.25.1.2">𝑛</ci><ci id="S2.E14.m1.25.25.25.25.25.25.1.3.cmml" xref="S2.E14.m1.25.25.25.25.25.25.1.3">𝑙</ci></apply></apply></interval></apply></apply><apply id="S2.E14.m1.61.61.7.7.7.cmml" xref="S2.E14.m1.70.70.16"><times id="S2.E14.m1.61.61.7.7.7.3.cmml" xref="S2.E14.m1.70.70.16"></times><ci id="S2.E14.m1.28.28.28.2.2.2.cmml" xref="S2.E14.m1.28.28.28.2.2.2">𝑔</ci><interval closure="open" id="S2.E14.m1.61.61.7.7.7.2.3.cmml" xref="S2.E14.m1.70.70.16"><apply id="S2.E14.m1.60.60.6.6.6.1.1.1.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.60.60.6.6.6.1.1.1.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.30.30.30.4.4.4.cmml" xref="S2.E14.m1.30.30.30.4.4.4">𝐸</ci><ci id="S2.E14.m1.31.31.31.5.5.5.1.cmml" xref="S2.E14.m1.31.31.31.5.5.5.1">𝑣</ci></apply><apply id="S2.E14.m1.61.61.7.7.7.2.2.2.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.61.61.7.7.7.2.2.2.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.33.33.33.7.7.7.cmml" xref="S2.E14.m1.33.33.33.7.7.7">𝜔</ci><ci id="S2.E14.m1.34.34.34.8.8.8.1.cmml" xref="S2.E14.m1.34.34.34.8.8.8.1">𝑣</ci></apply></interval></apply></apply><apply id="S2.E14.m1.62.62.8.8.cmml" xref="S2.E14.m1.70.70.16"><times id="S2.E14.m1.62.62.8.8.2.cmml" xref="S2.E14.m1.70.70.16"></times><ci id="S2.E14.m1.37.37.37.11.11.11.cmml" xref="S2.E14.m1.37.37.37.11.11.11">𝛾</ci><apply id="S2.E14.m1.62.62.8.8.4.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.62.62.8.8.4.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><apply id="S2.E14.m1.38.38.38.12.12.12.cmml" xref="S2.E14.m1.38.38.38.12.12.12"><ci id="S2.E14.m1.38.38.38.12.12.12.1.cmml" xref="S2.E14.m1.38.38.38.12.12.12.1">¯</ci><ci id="S2.E14.m1.38.38.38.12.12.12.2.cmml" xref="S2.E14.m1.38.38.38.12.12.12.2">𝛽</ci></apply><ci id="S2.E14.m1.39.39.39.13.13.13.1.cmml" xref="S2.E14.m1.39.39.39.13.13.13.1">𝑡</ci></apply><apply id="S2.E14.m1.62.62.8.8.5.cmml" xref="S2.E14.m1.70.70.16"><apply id="S2.E14.m1.62.62.8.8.5.1.cmml" xref="S2.E14.m1.70.70.16"><ci id="S2.E14.m1.40.40.40.14.14.14.cmml" xref="S2.E14.m1.40.40.40.14.14.14">∇</ci><log id="S2.E14.m1.41.41.41.15.15.15.cmml" xref="S2.E14.m1.41.41.41.15.15.15"></log></apply><apply id="S2.E14.m1.62.62.8.8.5.2.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.62.62.8.8.5.2.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.42.42.42.16.16.16.cmml" xref="S2.E14.m1.42.42.42.16.16.16">𝑃</ci><ci id="S2.E14.m1.43.43.43.17.17.17.1.cmml" xref="S2.E14.m1.43.43.43.17.17.17.1">italic-ϕ</ci></apply></apply><apply id="S2.E14.m1.62.62.8.8.1.1.1.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="latexml" id="S2.E14.m1.46.46.46.20.20.20.cmml" xref="S2.E14.m1.46.46.46.20.20.20">conditional</csymbol><ci id="S2.E14.m1.45.45.45.19.19.19.cmml" xref="S2.E14.m1.45.45.45.19.19.19">𝑦</ci><list id="S2.E14.m1.62.62.8.8.1.1.1.2.3.cmml" xref="S2.E14.m1.70.70.16"><apply id="S2.E14.m1.62.62.8.8.1.1.1.1.1.1.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.62.62.8.8.1.1.1.1.1.1.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.47.47.47.21.21.21.cmml" xref="S2.E14.m1.47.47.47.21.21.21">𝑧</ci><ci id="S2.E14.m1.48.48.48.22.22.22.1.cmml" xref="S2.E14.m1.48.48.48.22.22.22.1">𝑡</ci></apply><ci id="S2.E14.m1.50.50.50.24.24.24.cmml" xref="S2.E14.m1.50.50.50.24.24.24">𝑡</ci><apply id="S2.E14.m1.62.62.8.8.1.1.1.2.2.2.cmml" xref="S2.E14.m1.70.70.16"><csymbol cd="ambiguous" id="S2.E14.m1.62.62.8.8.1.1.1.2.2.2.1.cmml" xref="S2.E14.m1.70.70.16">subscript</csymbol><ci id="S2.E14.m1.52.52.52.26.26.26.cmml" xref="S2.E14.m1.52.52.52.26.26.26">𝐸</ci><ci id="S2.E14.m1.53.53.53.27.27.27.1.cmml" xref="S2.E14.m1.53.53.53.27.27.27.1">𝑣</ci></apply></list></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E14.m1.72c">\hat{\epsilon}_{\theta}\left(z_{t},t\right)\leftarrow\begin{gathered}\epsilon_% {\theta}\left(z_{t},t\right)+g\left(E_{l},\omega_{l}\right)-g\left(E_{nl},% \omega_{nl}\right)\\ +g\left(E_{v},\omega_{v}\right)-\gamma\bar{\beta}_{t}\nabla\log P_{\phi}\left(% y|z_{t},t,E_{v}\right)\end{gathered}</annotation><annotation encoding="application/x-llamapun" id="S2.E14.m1.72d">over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ← start_ROW start_CELL italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + italic_g ( italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) - italic_g ( italic_E start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL + italic_g ( italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) - italic_γ over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ roman_log italic_P start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) end_CELL end_ROW</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(14)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS4.p3.6">where <math alttext="E_{v}" class="ltx_Math" display="inline" id="S2.SS4.p3.4.m1.1"><semantics id="S2.SS4.p3.4.m1.1a"><msub id="S2.SS4.p3.4.m1.1.1" xref="S2.SS4.p3.4.m1.1.1.cmml"><mi id="S2.SS4.p3.4.m1.1.1.2" xref="S2.SS4.p3.4.m1.1.1.2.cmml">E</mi><mi id="S2.SS4.p3.4.m1.1.1.3" xref="S2.SS4.p3.4.m1.1.1.3.cmml">v</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS4.p3.4.m1.1b"><apply id="S2.SS4.p3.4.m1.1.1.cmml" xref="S2.SS4.p3.4.m1.1.1"><csymbol cd="ambiguous" id="S2.SS4.p3.4.m1.1.1.1.cmml" xref="S2.SS4.p3.4.m1.1.1">subscript</csymbol><ci id="S2.SS4.p3.4.m1.1.1.2.cmml" xref="S2.SS4.p3.4.m1.1.1.2">𝐸</ci><ci id="S2.SS4.p3.4.m1.1.1.3.cmml" xref="S2.SS4.p3.4.m1.1.1.3">𝑣</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p3.4.m1.1c">E_{v}</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p3.4.m1.1d">italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT</annotation></semantics></math> represents video features, <math alttext="E_{l}" class="ltx_Math" display="inline" id="S2.SS4.p3.5.m2.1"><semantics id="S2.SS4.p3.5.m2.1a"><msub id="S2.SS4.p3.5.m2.1.1" xref="S2.SS4.p3.5.m2.1.1.cmml"><mi id="S2.SS4.p3.5.m2.1.1.2" xref="S2.SS4.p3.5.m2.1.1.2.cmml">E</mi><mi id="S2.SS4.p3.5.m2.1.1.3" xref="S2.SS4.p3.5.m2.1.1.3.cmml">l</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS4.p3.5.m2.1b"><apply id="S2.SS4.p3.5.m2.1.1.cmml" xref="S2.SS4.p3.5.m2.1.1"><csymbol cd="ambiguous" id="S2.SS4.p3.5.m2.1.1.1.cmml" xref="S2.SS4.p3.5.m2.1.1">subscript</csymbol><ci id="S2.SS4.p3.5.m2.1.1.2.cmml" xref="S2.SS4.p3.5.m2.1.1.2">𝐸</ci><ci id="S2.SS4.p3.5.m2.1.1.3.cmml" xref="S2.SS4.p3.5.m2.1.1.3">𝑙</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p3.5.m2.1c">E_{l}</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p3.5.m2.1d">italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT</annotation></semantics></math> represents prompt features, and <math alttext="E_{nl}" class="ltx_Math" display="inline" id="S2.SS4.p3.6.m3.1"><semantics id="S2.SS4.p3.6.m3.1a"><msub id="S2.SS4.p3.6.m3.1.1" xref="S2.SS4.p3.6.m3.1.1.cmml"><mi id="S2.SS4.p3.6.m3.1.1.2" xref="S2.SS4.p3.6.m3.1.1.2.cmml">E</mi><mrow id="S2.SS4.p3.6.m3.1.1.3" xref="S2.SS4.p3.6.m3.1.1.3.cmml"><mi id="S2.SS4.p3.6.m3.1.1.3.2" xref="S2.SS4.p3.6.m3.1.1.3.2.cmml">n</mi><mo id="S2.SS4.p3.6.m3.1.1.3.1" xref="S2.SS4.p3.6.m3.1.1.3.1.cmml">⁢</mo><mi id="S2.SS4.p3.6.m3.1.1.3.3" xref="S2.SS4.p3.6.m3.1.1.3.3.cmml">l</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S2.SS4.p3.6.m3.1b"><apply id="S2.SS4.p3.6.m3.1.1.cmml" xref="S2.SS4.p3.6.m3.1.1"><csymbol cd="ambiguous" id="S2.SS4.p3.6.m3.1.1.1.cmml" xref="S2.SS4.p3.6.m3.1.1">subscript</csymbol><ci id="S2.SS4.p3.6.m3.1.1.2.cmml" xref="S2.SS4.p3.6.m3.1.1.2">𝐸</ci><apply id="S2.SS4.p3.6.m3.1.1.3.cmml" xref="S2.SS4.p3.6.m3.1.1.3"><times id="S2.SS4.p3.6.m3.1.1.3.1.cmml" xref="S2.SS4.p3.6.m3.1.1.3.1"></times><ci id="S2.SS4.p3.6.m3.1.1.3.2.cmml" xref="S2.SS4.p3.6.m3.1.1.3.2">𝑛</ci><ci id="S2.SS4.p3.6.m3.1.1.3.3.cmml" xref="S2.SS4.p3.6.m3.1.1.3.3">𝑙</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS4.p3.6.m3.1c">E_{nl}</annotation><annotation encoding="application/x-llamapun" id="S2.SS4.p3.6.m3.1d">italic_E start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT</annotation></semantics></math> represents negative prompt features. This approach allows for a more flexible inference process by independently considering the effects of multiple conditions.</p> </div> <div class="ltx_para" id="S2.SS4.p4"> <p class="ltx_p" id="S2.SS4.p4.1">The workflow is designed to efficiently process the video and textual descriptions through CVALP, align and mix features using LDM, and apply inference techniques to produce the final high-quality audio output that matches the given video content. Speicfically, in the inference process, the positive prompt input by the human will undergo the Portable Plug-in Prompt Refiner (PPPR)<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib25" title="">25</a>]</cite> for standardization, ensuring that it corresponds with the AI-generated text involved in the training. If a human-provided prompt is not available, we will use the Video-LlaMA2<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib18" title="">18</a>]</cite> model to generate a description of the video content automatically.</p> </div> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">III </span><span class="ltx_text ltx_font_smallcaps" id="S3.1.1">Experiments</span> </h2> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S3.SS1.5.1.1">III-A</span> </span><span class="ltx_text ltx_font_italic" id="S3.SS1.6.2">Datasets and Data Processing</span> </h3> <div class="ltx_para" id="S3.SS1.p1"> <p class="ltx_p" id="S3.SS1.p1.2">Our study utilizes the VGGSound <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib26" title="">26</a>]</cite> dataset, which contains approximately 200,000 videos, each with a duration of 10 seconds. As in the original scheme, we choose to use the provided training and testing split of the dataset, with 183,971 videos in the training set and 15,496 videos in the test set. Data preprocessing includes video, audio, and text processing: the videos were resized to <math alttext="224\times 224" class="ltx_Math" display="inline" id="S3.SS1.p1.1.m1.1"><semantics id="S3.SS1.p1.1.m1.1a"><mrow id="S3.SS1.p1.1.m1.1.1" xref="S3.SS1.p1.1.m1.1.1.cmml"><mn id="S3.SS1.p1.1.m1.1.1.2" xref="S3.SS1.p1.1.m1.1.1.2.cmml">224</mn><mo id="S3.SS1.p1.1.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S3.SS1.p1.1.m1.1.1.1.cmml">×</mo><mn id="S3.SS1.p1.1.m1.1.1.3" xref="S3.SS1.p1.1.m1.1.1.3.cmml">224</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.1.m1.1b"><apply id="S3.SS1.p1.1.m1.1.1.cmml" xref="S3.SS1.p1.1.m1.1.1"><times id="S3.SS1.p1.1.m1.1.1.1.cmml" xref="S3.SS1.p1.1.m1.1.1.1"></times><cn id="S3.SS1.p1.1.m1.1.1.2.cmml" type="integer" xref="S3.SS1.p1.1.m1.1.1.2">224</cn><cn id="S3.SS1.p1.1.m1.1.1.3.cmml" type="integer" xref="S3.SS1.p1.1.m1.1.1.3">224</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.1.m1.1c">224\times 224</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.1.m1.1d">224 × 224</annotation></semantics></math> and frames were sampled at 4 FPS; the audio was sampled at 16kHz and converted to Mel spectrograms (Mel Basis <math alttext="M=128" class="ltx_Math" display="inline" id="S3.SS1.p1.2.m2.1"><semantics id="S3.SS1.p1.2.m2.1a"><mrow id="S3.SS1.p1.2.m2.1.1" xref="S3.SS1.p1.2.m2.1.1.cmml"><mi id="S3.SS1.p1.2.m2.1.1.2" xref="S3.SS1.p1.2.m2.1.1.2.cmml">M</mi><mo id="S3.SS1.p1.2.m2.1.1.1" xref="S3.SS1.p1.2.m2.1.1.1.cmml">=</mo><mn id="S3.SS1.p1.2.m2.1.1.3" xref="S3.SS1.p1.2.m2.1.1.3.cmml">128</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p1.2.m2.1b"><apply id="S3.SS1.p1.2.m2.1.1.cmml" xref="S3.SS1.p1.2.m2.1.1"><eq id="S3.SS1.p1.2.m2.1.1.1.cmml" xref="S3.SS1.p1.2.m2.1.1.1"></eq><ci id="S3.SS1.p1.2.m2.1.1.2.cmml" xref="S3.SS1.p1.2.m2.1.1.2">𝑀</ci><cn id="S3.SS1.p1.2.m2.1.1.3.cmml" type="integer" xref="S3.SS1.p1.2.m2.1.1.3">128</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p1.2.m2.1c">M=128</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p1.2.m2.1d">italic_M = 128</annotation></semantics></math>), with a hop size uniformly set to 256. For text processing, we expanded the annotations in the VGGSound dataset using ChatGPT4o to improve text-audio alignment and maintain consistency with inference prompts. A unified prompt was used to generate video descriptions, which served as contrastive learning material in CVALP training:</p> </div> <div class="ltx_para" id="S3.SS1.p2"> <p class="ltx_p" id="S3.SS1.p2.1"><span class="ltx_text ltx_font_italic" id="S3.SS1.p2.1.1">”Here are the annotated texts from a video dataset. Please expand each into a full sentence, keeping the core content unchanged.”</span></p> </div> <div class="ltx_para" id="S3.SS1.p3"> <p class="ltx_p" id="S3.SS1.p3.1">To introduce variation and avoid identical descriptions for similar videos during contrastive learning, we applied PPPR while preserving the core content, as detailed in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib25" title="">25</a>]</cite>.</p> </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S3.SS2.5.1.1">III-B</span> </span><span class="ltx_text ltx_font_italic" id="S3.SS2.6.2">Configurations</span> </h3> <div class="ltx_para" id="S3.SS2.p1"> <p class="ltx_p" id="S3.SS2.p1.3">In the CVALP contrastive learning process, we employed the PANNs<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib27" title="">27</a>]</cite> model pretrained on the AudioSet<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib28" title="">28</a>]</cite> dataset as the audio encoder, and the SlowOnly<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib29" title="">29</a>]</cite> model pretrained on the Kinetics-400<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib30" title="">30</a>]</cite> dataset as the video encoder, with Flan-T5<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib31" title="">31</a>]</cite> serving as the text encoder. For the LDM, we adopted the architecture of Stable Diffusion, utilizing frozen, pretrained latent encoder <math alttext="\mathcal{E}" class="ltx_Math" display="inline" id="S3.SS2.p1.1.m1.1"><semantics id="S3.SS2.p1.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S3.SS2.p1.1.m1.1.1" xref="S3.SS2.p1.1.m1.1.1.cmml">ℰ</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.1.m1.1b"><ci id="S3.SS2.p1.1.m1.1.1.cmml" xref="S3.SS2.p1.1.m1.1.1">ℰ</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.1.m1.1c">\mathcal{E}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.1.m1.1d">caligraphic_E</annotation></semantics></math> and decoder <math alttext="\mathcal{D}" class="ltx_Math" display="inline" id="S3.SS2.p1.2.m2.1"><semantics id="S3.SS2.p1.2.m2.1a"><mi class="ltx_font_mathcaligraphic" id="S3.SS2.p1.2.m2.1.1" xref="S3.SS2.p1.2.m2.1.1.cmml">𝒟</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.2.m2.1b"><ci id="S3.SS2.p1.2.m2.1.1.cmml" xref="S3.SS2.p1.2.m2.1.1">𝒟</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.2.m2.1c">\mathcal{D}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.2.m2.1d">caligraphic_D</annotation></semantics></math> components. The denoising process involved 1,000 steps, and we used a learning rate of <math alttext="10^{-4}" class="ltx_Math" display="inline" id="S3.SS2.p1.3.m3.1"><semantics id="S3.SS2.p1.3.m3.1a"><msup id="S3.SS2.p1.3.m3.1.1" xref="S3.SS2.p1.3.m3.1.1.cmml"><mn id="S3.SS2.p1.3.m3.1.1.2" xref="S3.SS2.p1.3.m3.1.1.2.cmml">10</mn><mrow id="S3.SS2.p1.3.m3.1.1.3" xref="S3.SS2.p1.3.m3.1.1.3.cmml"><mo id="S3.SS2.p1.3.m3.1.1.3a" xref="S3.SS2.p1.3.m3.1.1.3.cmml">−</mo><mn id="S3.SS2.p1.3.m3.1.1.3.2" xref="S3.SS2.p1.3.m3.1.1.3.2.cmml">4</mn></mrow></msup><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.3.m3.1b"><apply id="S3.SS2.p1.3.m3.1.1.cmml" xref="S3.SS2.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS2.p1.3.m3.1.1.1.cmml" xref="S3.SS2.p1.3.m3.1.1">superscript</csymbol><cn id="S3.SS2.p1.3.m3.1.1.2.cmml" type="integer" xref="S3.SS2.p1.3.m3.1.1.2">10</cn><apply id="S3.SS2.p1.3.m3.1.1.3.cmml" xref="S3.SS2.p1.3.m3.1.1.3"><minus id="S3.SS2.p1.3.m3.1.1.3.1.cmml" xref="S3.SS2.p1.3.m3.1.1.3"></minus><cn id="S3.SS2.p1.3.m3.1.1.3.2.cmml" type="integer" xref="S3.SS2.p1.3.m3.1.1.3.2">4</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.3.m3.1c">10^{-4}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.3.m3.1d">10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT</annotation></semantics></math> with an initial warmup phase of 1,000 steps. During the inference stage, we incorporated agent attention<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib32" title="">32</a>]</cite>, which introduces agent tokens to improve computational efficiency while maintaining global context modeling, to enhance speed and effectiveness.</p> </div> <div class="ltx_para" id="S3.SS2.p2"> <p class="ltx_p" id="S3.SS2.p2.3">We set the CFG scale to <math alttext="\omega=4.5" class="ltx_Math" display="inline" id="S3.SS2.p2.1.m1.1"><semantics id="S3.SS2.p2.1.m1.1a"><mrow id="S3.SS2.p2.1.m1.1.1" xref="S3.SS2.p2.1.m1.1.1.cmml"><mi id="S3.SS2.p2.1.m1.1.1.2" xref="S3.SS2.p2.1.m1.1.1.2.cmml">ω</mi><mo id="S3.SS2.p2.1.m1.1.1.1" xref="S3.SS2.p2.1.m1.1.1.1.cmml">=</mo><mn id="S3.SS2.p2.1.m1.1.1.3" xref="S3.SS2.p2.1.m1.1.1.3.cmml">4.5</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.1.m1.1b"><apply id="S3.SS2.p2.1.m1.1.1.cmml" xref="S3.SS2.p2.1.m1.1.1"><eq id="S3.SS2.p2.1.m1.1.1.1.cmml" xref="S3.SS2.p2.1.m1.1.1.1"></eq><ci id="S3.SS2.p2.1.m1.1.1.2.cmml" xref="S3.SS2.p2.1.m1.1.1.2">𝜔</ci><cn id="S3.SS2.p2.1.m1.1.1.3.cmml" type="float" xref="S3.SS2.p2.1.m1.1.1.3">4.5</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.1.m1.1c">\omega=4.5</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.1.m1.1d">italic_ω = 4.5</annotation></semantics></math> and the CG scale to <math alttext="\gamma=50" class="ltx_Math" display="inline" id="S3.SS2.p2.2.m2.1"><semantics id="S3.SS2.p2.2.m2.1a"><mrow id="S3.SS2.p2.2.m2.1.1" xref="S3.SS2.p2.2.m2.1.1.cmml"><mi id="S3.SS2.p2.2.m2.1.1.2" xref="S3.SS2.p2.2.m2.1.1.2.cmml">γ</mi><mo id="S3.SS2.p2.2.m2.1.1.1" xref="S3.SS2.p2.2.m2.1.1.1.cmml">=</mo><mn id="S3.SS2.p2.2.m2.1.1.3" xref="S3.SS2.p2.2.m2.1.1.3.cmml">50</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.2.m2.1b"><apply id="S3.SS2.p2.2.m2.1.1.cmml" xref="S3.SS2.p2.2.m2.1.1"><eq id="S3.SS2.p2.2.m2.1.1.1.cmml" xref="S3.SS2.p2.2.m2.1.1.1"></eq><ci id="S3.SS2.p2.2.m2.1.1.2.cmml" xref="S3.SS2.p2.2.m2.1.1.2">𝛾</ci><cn id="S3.SS2.p2.2.m2.1.1.3.cmml" type="integer" xref="S3.SS2.p2.2.m2.1.1.3">50</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.2.m2.1c">\gamma=50</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.2.m2.1d">italic_γ = 50</annotation></semantics></math> same as <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib12" title="">12</a>]</cite>. For composite condition inference, we set <math alttext="\omega_{l}=\omega_{nl}=\omega_{v}=2.5" class="ltx_Math" display="inline" id="S3.SS2.p2.3.m3.1"><semantics id="S3.SS2.p2.3.m3.1a"><mrow id="S3.SS2.p2.3.m3.1.1" xref="S3.SS2.p2.3.m3.1.1.cmml"><msub id="S3.SS2.p2.3.m3.1.1.2" xref="S3.SS2.p2.3.m3.1.1.2.cmml"><mi id="S3.SS2.p2.3.m3.1.1.2.2" xref="S3.SS2.p2.3.m3.1.1.2.2.cmml">ω</mi><mi id="S3.SS2.p2.3.m3.1.1.2.3" xref="S3.SS2.p2.3.m3.1.1.2.3.cmml">l</mi></msub><mo id="S3.SS2.p2.3.m3.1.1.3" xref="S3.SS2.p2.3.m3.1.1.3.cmml">=</mo><msub id="S3.SS2.p2.3.m3.1.1.4" xref="S3.SS2.p2.3.m3.1.1.4.cmml"><mi id="S3.SS2.p2.3.m3.1.1.4.2" xref="S3.SS2.p2.3.m3.1.1.4.2.cmml">ω</mi><mrow id="S3.SS2.p2.3.m3.1.1.4.3" xref="S3.SS2.p2.3.m3.1.1.4.3.cmml"><mi id="S3.SS2.p2.3.m3.1.1.4.3.2" xref="S3.SS2.p2.3.m3.1.1.4.3.2.cmml">n</mi><mo id="S3.SS2.p2.3.m3.1.1.4.3.1" xref="S3.SS2.p2.3.m3.1.1.4.3.1.cmml">⁢</mo><mi id="S3.SS2.p2.3.m3.1.1.4.3.3" xref="S3.SS2.p2.3.m3.1.1.4.3.3.cmml">l</mi></mrow></msub><mo id="S3.SS2.p2.3.m3.1.1.5" xref="S3.SS2.p2.3.m3.1.1.5.cmml">=</mo><msub id="S3.SS2.p2.3.m3.1.1.6" xref="S3.SS2.p2.3.m3.1.1.6.cmml"><mi id="S3.SS2.p2.3.m3.1.1.6.2" xref="S3.SS2.p2.3.m3.1.1.6.2.cmml">ω</mi><mi id="S3.SS2.p2.3.m3.1.1.6.3" xref="S3.SS2.p2.3.m3.1.1.6.3.cmml">v</mi></msub><mo id="S3.SS2.p2.3.m3.1.1.7" xref="S3.SS2.p2.3.m3.1.1.7.cmml">=</mo><mn id="S3.SS2.p2.3.m3.1.1.8" xref="S3.SS2.p2.3.m3.1.1.8.cmml">2.5</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.3.m3.1b"><apply id="S3.SS2.p2.3.m3.1.1.cmml" xref="S3.SS2.p2.3.m3.1.1"><and id="S3.SS2.p2.3.m3.1.1a.cmml" xref="S3.SS2.p2.3.m3.1.1"></and><apply id="S3.SS2.p2.3.m3.1.1b.cmml" xref="S3.SS2.p2.3.m3.1.1"><eq id="S3.SS2.p2.3.m3.1.1.3.cmml" xref="S3.SS2.p2.3.m3.1.1.3"></eq><apply id="S3.SS2.p2.3.m3.1.1.2.cmml" xref="S3.SS2.p2.3.m3.1.1.2"><csymbol cd="ambiguous" id="S3.SS2.p2.3.m3.1.1.2.1.cmml" xref="S3.SS2.p2.3.m3.1.1.2">subscript</csymbol><ci id="S3.SS2.p2.3.m3.1.1.2.2.cmml" xref="S3.SS2.p2.3.m3.1.1.2.2">𝜔</ci><ci id="S3.SS2.p2.3.m3.1.1.2.3.cmml" xref="S3.SS2.p2.3.m3.1.1.2.3">𝑙</ci></apply><apply id="S3.SS2.p2.3.m3.1.1.4.cmml" xref="S3.SS2.p2.3.m3.1.1.4"><csymbol cd="ambiguous" id="S3.SS2.p2.3.m3.1.1.4.1.cmml" xref="S3.SS2.p2.3.m3.1.1.4">subscript</csymbol><ci id="S3.SS2.p2.3.m3.1.1.4.2.cmml" xref="S3.SS2.p2.3.m3.1.1.4.2">𝜔</ci><apply id="S3.SS2.p2.3.m3.1.1.4.3.cmml" xref="S3.SS2.p2.3.m3.1.1.4.3"><times id="S3.SS2.p2.3.m3.1.1.4.3.1.cmml" xref="S3.SS2.p2.3.m3.1.1.4.3.1"></times><ci id="S3.SS2.p2.3.m3.1.1.4.3.2.cmml" xref="S3.SS2.p2.3.m3.1.1.4.3.2">𝑛</ci><ci id="S3.SS2.p2.3.m3.1.1.4.3.3.cmml" xref="S3.SS2.p2.3.m3.1.1.4.3.3">𝑙</ci></apply></apply></apply><apply id="S3.SS2.p2.3.m3.1.1c.cmml" xref="S3.SS2.p2.3.m3.1.1"><eq id="S3.SS2.p2.3.m3.1.1.5.cmml" xref="S3.SS2.p2.3.m3.1.1.5"></eq><share href="https://arxiv.org/html/2503.10700v1#S3.SS2.p2.3.m3.1.1.4.cmml" id="S3.SS2.p2.3.m3.1.1d.cmml" xref="S3.SS2.p2.3.m3.1.1"></share><apply id="S3.SS2.p2.3.m3.1.1.6.cmml" xref="S3.SS2.p2.3.m3.1.1.6"><csymbol cd="ambiguous" id="S3.SS2.p2.3.m3.1.1.6.1.cmml" xref="S3.SS2.p2.3.m3.1.1.6">subscript</csymbol><ci id="S3.SS2.p2.3.m3.1.1.6.2.cmml" xref="S3.SS2.p2.3.m3.1.1.6.2">𝜔</ci><ci id="S3.SS2.p2.3.m3.1.1.6.3.cmml" xref="S3.SS2.p2.3.m3.1.1.6.3">𝑣</ci></apply></apply><apply id="S3.SS2.p2.3.m3.1.1e.cmml" xref="S3.SS2.p2.3.m3.1.1"><eq id="S3.SS2.p2.3.m3.1.1.7.cmml" xref="S3.SS2.p2.3.m3.1.1.7"></eq><share href="https://arxiv.org/html/2503.10700v1#S3.SS2.p2.3.m3.1.1.6.cmml" id="S3.SS2.p2.3.m3.1.1f.cmml" xref="S3.SS2.p2.3.m3.1.1"></share><cn id="S3.SS2.p2.3.m3.1.1.8.cmml" type="float" xref="S3.SS2.p2.3.m3.1.1.8">2.5</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.3.m3.1c">\omega_{l}=\omega_{nl}=\omega_{v}=2.5</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.3.m3.1d">italic_ω start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_ω start_POSTSUBSCRIPT italic_n italic_l end_POSTSUBSCRIPT = italic_ω start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 2.5</annotation></semantics></math> by experiments. After generating the Mel-spectrogram, we used the GLA-GRAD <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib33" title="">33</a>]</cite> vocoder, which is also based on a diffusion model architecture, to produce the audio signal.</p> </div> </section> <section class="ltx_subsection" id="S3.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection"><span class="ltx_text" id="S3.SS3.5.1.1">III-C</span> </span><span class="ltx_text ltx_font_italic" id="S3.SS3.6.2">Evaluation</span> </h3> <section class="ltx_subsubsection" id="S3.SS3.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S3.SS3.SSS1.5.1.1">III-C</span>1 </span>Evaluation Metrics</h4> <div class="ltx_para ltx_noindent" id="S3.SS3.SSS1.p1"> <p class="ltx_p" id="S3.SS3.SSS1.p1.1"><span class="ltx_text ltx_font_bold" id="S3.SS3.SSS1.p1.1.1">Baseline</span> We conducted ablation experiments using the same LDM module, deriving latent features through different methods. Our model employed the CVALP module with averaging (aver.) and concatenation (concat) techniques. For comparison, we tested against the Diff-Foley <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib12" title="">12</a>]</cite> model with the CAVP module and the VTA-LDM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib13" title="">13</a>]</cite> model using Clip4Clip <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib34" title="">34</a>]</cite>. All experiments were conducted on the VGGSound <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib26" title="">26</a>]</cite> test set, generating 8-second audio clips for evaluation.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS3.SSS1.p2"> <p class="ltx_p" id="S3.SS3.SSS1.p2.1"><span class="ltx_text ltx_font_bold" id="S3.SS3.SSS1.p2.1.1">Objective Evaluation</span> We used four metrics from <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib35" title="">35</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib13" title="">13</a>]</cite> to assess semantic generation quality: Inception Score (IS) for evaluating how well the generated distribution matches the diversity of real data. Fréchet Inception Distance (FID) and Fréchet Audio Distance (FAD) are used to compare the statistical features of generated samples with those of real samples, as demonstrated by <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib36" title="">36</a>]</cite>, which validates their effectiveness. Mean Kullback–Leibler divergence (MKL) to measure the divergence between the probability distributions of generated and real data, reflecting how closely they align. We also used Alignment Accuracy (Align) from <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib12" title="">12</a>]</cite> to evaluate temporal synchronization between generated audio and video content.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS3.SSS1.p3"> <p class="ltx_p" id="S3.SS3.SSS1.p3.1"><span class="ltx_text ltx_font_bold" id="S3.SS3.SSS1.p3.1.1">Subjective Evaluation</span> Twenty participants (8 females, 12 males, aged 20-45), with with self-reported normal hearing and normal or corrected-to-normal vision, rated audio-visual clips using Sennheiser HD600 headphones. Participants viewed five distinct videos, each with four generated audio tracks corresponding to the four experimental conditions listed in the table (as identified by the ”Model” and ”Latent Features” columns collectively), presented in random order. Each participant listened to each audio sample only once to ensure that the ratings reflected their initial impressions. This procedure was repeated for each of the five videos.</p> </div> <div class="ltx_para" id="S3.SS3.SSS1.p4"> <p class="ltx_p" id="S3.SS3.SSS1.p4.1">Among these conditions, the ”Users” group contributed by writing descriptions based on the video content. These descriptions were used as text prompt input for the model in the ”Users” condition, serving as a guiding condition during the diffusion process.</p> </div> <div class="ltx_para" id="S3.SS3.SSS1.p5"> <p class="ltx_p" id="S3.SS3.SSS1.p5.1">Participants rated the audio tracks using Mean Opinion Scores (MOS)<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#bib.bib37" title="">37</a>]</cite> on a five-point scale, where 1 represents ”Bad,” 2 ”Poor,” 3 ”Fair,” 4 ”Good,” and 5 ”Excellent.” MOS for semantic consistency was based on how well the audio content matched the video content, while temporal alignment MOS was based on the synchronization between the audio and the visual cues.</p> </div> </section> <section class="ltx_subsubsection" id="S3.SS3.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection"><span class="ltx_text" id="S3.SS3.SSS2.5.1.1">III-C</span>2 </span>Results and Analysis</h4> <figure class="ltx_table" id="S3.T1"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">TABLE I: </span>Objective Evaluation Results</figcaption> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S3.T1.2.2"> <tr class="ltx_tr" id="S3.T1.2.2.3"> <td class="ltx_td ltx_align_center ltx_border_tt" id="S3.T1.2.2.3.1">Model</td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_tt" id="S3.T1.2.2.3.2" style="width:28.5pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.3.2.1"> <span class="ltx_p" id="S3.T1.2.2.3.2.1.1">Features</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_tt" id="S3.T1.2.2.3.3" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.3.3.1"> <span class="ltx_p" id="S3.T1.2.2.3.3.1.1">IS↑</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_tt" id="S3.T1.2.2.3.4" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.3.4.1"> <span class="ltx_p" id="S3.T1.2.2.3.4.1.1">FID↓</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_tt" id="S3.T1.2.2.3.5" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.3.5.1"> <span class="ltx_p" id="S3.T1.2.2.3.5.1.1">FAD↓</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_tt" id="S3.T1.2.2.3.6" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.3.6.1"> <span class="ltx_p" id="S3.T1.2.2.3.6.1.1">MKL↓</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_tt" id="S3.T1.2.2.3.7" style="width:22.8pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.3.7.1"> <span class="ltx_p" id="S3.T1.2.2.3.7.1.1">Align(%)↑</span> </span> </td> </tr> <tr class="ltx_tr" id="S3.T1.1.1.1"> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.1.1.1.2">TA-V2A</td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_t" id="S3.T1.1.1.1.1" style="width:28.5pt;"><span class="ltx_inline-logical-block ltx_align_top" id="S3.T1.1.1.1.1.1"> <span class="ltx_para ltx_noindent" id="S3.T1.1.1.1.1.1.p1"> <span class="ltx_p" id="S3.T1.1.1.1.1.1.p1.1"><span class="ltx_text" id="S3.T1.1.1.1.1.1.p1.1.1"></span> <span class="ltx_text" id="S3.T1.1.1.1.1.1.p1.1.2"> <span class="ltx_tabular ltx_align_middle" id="S3.T1.1.1.1.1.1.p1.1.2.1"> <span class="ltx_tr" id="S3.T1.1.1.1.1.1.p1.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S3.T1.1.1.1.1.1.p1.1.2.1.1.1">CVALP</span></span> <span class="ltx_tr" id="S3.T1.1.1.1.1.1.p1.1.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S3.T1.1.1.1.1.1.p1.1.2.1.2.1">(aver.)</span></span> </span></span><span class="ltx_text" id="S3.T1.1.1.1.1.1.p1.1.3"></span></span> </span></span></td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_t" id="S3.T1.1.1.1.3" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.1.1.1.3.1"> <span class="ltx_p" id="S3.T1.1.1.1.3.1.1">7.16</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_t" id="S3.T1.1.1.1.4" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.1.1.1.4.1"> <span class="ltx_p" id="S3.T1.1.1.1.4.1.1">48.47</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_t" id="S3.T1.1.1.1.5" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.1.1.1.5.1"> <span class="ltx_p" id="S3.T1.1.1.1.5.1.1">6.03</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_t" id="S3.T1.1.1.1.6" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.1.1.1.6.1"> <span class="ltx_p" id="S3.T1.1.1.1.6.1.1">4.92</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_t" id="S3.T1.1.1.1.7" style="width:22.8pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.1.1.1.7.1"> <span class="ltx_p" id="S3.T1.1.1.1.7.1.1">72.61</span> </span> </td> </tr> <tr class="ltx_tr" id="S3.T1.2.2.2"> <td class="ltx_td ltx_align_center" id="S3.T1.2.2.2.2">TA-V2A</td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.2.1" style="width:28.5pt;"><span class="ltx_inline-logical-block ltx_align_top" id="S3.T1.2.2.2.1.1"> <span class="ltx_para ltx_noindent" id="S3.T1.2.2.2.1.1.p1"> <span class="ltx_p" id="S3.T1.2.2.2.1.1.p1.1"><span class="ltx_text" id="S3.T1.2.2.2.1.1.p1.1.1"></span> <span class="ltx_text" id="S3.T1.2.2.2.1.1.p1.1.2"> <span class="ltx_tabular ltx_align_middle" id="S3.T1.2.2.2.1.1.p1.1.2.1"> <span class="ltx_tr" id="S3.T1.2.2.2.1.1.p1.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S3.T1.2.2.2.1.1.p1.1.2.1.1.1">CVALP</span></span> <span class="ltx_tr" id="S3.T1.2.2.2.1.1.p1.1.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S3.T1.2.2.2.1.1.p1.1.2.1.2.1">(concat)</span></span> </span></span><span class="ltx_text" id="S3.T1.2.2.2.1.1.p1.1.3"></span></span> </span></span></td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.2.3" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.2.3.1"> <span class="ltx_p" id="S3.T1.2.2.2.3.1.1">10.59</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.2.4" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.2.4.1"> <span class="ltx_p" id="S3.T1.2.2.2.4.1.1"><span class="ltx_text ltx_font_bold" id="S3.T1.2.2.2.4.1.1.1">21.71</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.2.5" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.2.5.1"> <span class="ltx_p" id="S3.T1.2.2.2.5.1.1"><span class="ltx_text ltx_font_bold" id="S3.T1.2.2.2.5.1.1.1">2.66</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.2.6" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.2.6.1"> <span class="ltx_p" id="S3.T1.2.2.2.6.1.1"><span class="ltx_text ltx_font_bold" id="S3.T1.2.2.2.6.1.1.1">2.74</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.2.7" style="width:22.8pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.2.7.1"> <span class="ltx_p" id="S3.T1.2.2.2.7.1.1">84.37</span> </span> </td> </tr> <tr class="ltx_tr" id="S3.T1.2.2.4"> <td class="ltx_td ltx_align_center" id="S3.T1.2.2.4.1">Diff-Foley</td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.4.2" style="width:28.5pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.4.2.1"> <span class="ltx_p" id="S3.T1.2.2.4.2.1.1">CAVP</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.4.3" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.4.3.1"> <span class="ltx_p" id="S3.T1.2.2.4.3.1.1">9.51</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.4.4" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.4.4.1"> <span class="ltx_p" id="S3.T1.2.2.4.4.1.1">36.20</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.4.5" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.4.5.1"> <span class="ltx_p" id="S3.T1.2.2.4.5.1.1">4.87</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.4.6" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.4.6.1"> <span class="ltx_p" id="S3.T1.2.2.4.6.1.1">4.53</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle" id="S3.T1.2.2.4.7" style="width:22.8pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.4.7.1"> <span class="ltx_p" id="S3.T1.2.2.4.7.1.1"><span class="ltx_text ltx_font_bold" id="S3.T1.2.2.4.7.1.1.1">86.77</span></span> </span> </td> </tr> <tr class="ltx_tr" id="S3.T1.2.2.5"> <td class="ltx_td ltx_align_center ltx_border_bb" id="S3.T1.2.2.5.1">VTA-LDM</td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_bb" id="S3.T1.2.2.5.2" style="width:28.5pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.5.2.1"> <span class="ltx_p" id="S3.T1.2.2.5.2.1.1">Clip4Clip</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_bb" id="S3.T1.2.2.5.3" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.5.3.1"> <span class="ltx_p" id="S3.T1.2.2.5.3.1.1"><span class="ltx_text ltx_font_bold" id="S3.T1.2.2.5.3.1.1.1">10.74</span></span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_bb" id="S3.T1.2.2.5.4" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.5.4.1"> <span class="ltx_p" id="S3.T1.2.2.5.4.1.1">26.05</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_bb" id="S3.T1.2.2.5.5" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.5.5.1"> <span class="ltx_p" id="S3.T1.2.2.5.5.1.1">2.74</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_bb" id="S3.T1.2.2.5.6" style="width:14.2pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.5.6.1"> <span class="ltx_p" id="S3.T1.2.2.5.6.1.1">3.10</span> </span> </td> <td class="ltx_td ltx_align_justify ltx_align_middle ltx_border_bb" id="S3.T1.2.2.5.7" style="width:22.8pt;"> <span class="ltx_inline-block ltx_align_top" id="S3.T1.2.2.5.7.1"> <span class="ltx_p" id="S3.T1.2.2.5.7.1.1">79.89</span> </span> </td> </tr> </table> </figure> <figure class="ltx_table" id="S3.T2"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table">TABLE II: </span>Subjective Evaluation Results</figcaption> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S3.T2.1"> <tr class="ltx_tr" id="S3.T2.1.1"> <td class="ltx_td ltx_align_center ltx_border_tt" id="S3.T2.1.1.1">Model</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S3.T2.1.1.2">Latent Features</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S3.T2.1.1.3">Semantic↑</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S3.T2.1.1.4">Temporal↑</td> </tr> <tr class="ltx_tr" id="S3.T2.1.2"> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.1.2.1">TA-V2A (Auto)</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.1.2.2">CVALP (concat)</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.1.2.3">4.00</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.1.2.4"><span class="ltx_text ltx_font_bold" id="S3.T2.1.2.4.1">3.75</span></td> </tr> <tr class="ltx_tr" id="S3.T2.1.3"> <td class="ltx_td ltx_align_center" id="S3.T2.1.3.1">TA-V2A (Users)</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.3.2">CVALP (concat)</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.3.3"><span class="ltx_text ltx_font_bold" id="S3.T2.1.3.3.1">4.30</span></td> <td class="ltx_td ltx_align_center" id="S3.T2.1.3.4">3.70</td> </tr> <tr class="ltx_tr" id="S3.T2.1.4"> <td class="ltx_td ltx_align_center" id="S3.T2.1.4.1">Diff-Foley</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.4.2">CAVP</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.4.3">3.35</td> <td class="ltx_td ltx_align_center" id="S3.T2.1.4.4"><span class="ltx_text ltx_font_bold" id="S3.T2.1.4.4.1">3.75</span></td> </tr> <tr class="ltx_tr" id="S3.T2.1.5"> <td class="ltx_td ltx_align_center ltx_border_bb" id="S3.T2.1.5.1">VTA-LDM</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S3.T2.1.5.2">Clip4Clip</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S3.T2.1.5.3">3.90</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S3.T2.1.5.4">3.10</td> </tr> </table> </figure> <div class="ltx_para" id="S3.SS3.SSS2.p1"> <p class="ltx_p" id="S3.SS3.SSS2.p1.1">To showcase our results more effectively, we provide a video that visually demonstrates the audio generation outcomes: <a class="ltx_ref ltx_href" href="https://drive.google.com/file/d/1P3BLVhO_GcpYUqq67ij3LOD1FRxR3nra/view?usp=sharing" title="">Display</a></p> </div> <div class="ltx_para ltx_noindent" id="S3.SS3.SSS2.p2"> <p class="ltx_p" id="S3.SS3.SSS2.p2.1"><span class="ltx_text ltx_font_bold" id="S3.SS3.SSS2.p2.1.1">Objective Evaluation Results</span> Table <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S3.T1" title="TABLE I ‣ III-C2 Results and Analysis ‣ III-C Evaluation ‣ III Experiments ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_tag">I</span></a> presents the objective evaluation outcomes of the models. The TA-V2A model demonstrates a notably strong performance across all metrics when employing the concatenation (concat) method to derive CVALP features, achieving particularly outstanding results in the FID and FAD metrics. In contrast, the performance diminishes when utilizing the averaging (average) method, likely due to the simple averaging process disrupting the algebraic structure of the feature vectors or matrices. The approach of concatenating multimodal information, followed by projection and dimensionality reduction, proves effective in enhancing semantic expression while simultaneously preserving the alignment capabilities within the video features.</p> </div> <figure class="ltx_figure" id="S3.F2"> <p class="ltx_p ltx_align_center ltx_align_center" id="S3.F2.1"><span class="ltx_text" id="S3.F2.1.1"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="288" id="S3.F2.1.1.g1" src="extracted/6273342/Align.png" width="568"/></span></p> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>An Example of Video-Audio Alignment. The top shows frames from a badminton sequence, while the bottom compares audio spectrograms from different methods: Ground Truth, TA-V2A, Diff-Foley, and VTA-LDM. Yellow boxes highlight key synchronized moments between video and audio.</figcaption> </figure> <div class="ltx_para" id="S3.SS3.SSS2.p3"> <p class="ltx_p" id="S3.SS3.SSS2.p3.1">Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S3.F2" title="Figure 2 ‣ III-C2 Results and Analysis ‣ III-C Evaluation ‣ III Experiments ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_tag">2</span></a> compares the generated audio from different models, with a focus on our TA-V2A model. The video frames shown at the top depict a badminton match, featuring multiple instances of shuttlecock hits, which are also the key focus for the audio generation models. The audio spectrogram generated by our model closely mirrors the ground truth in terms of both timing and frequency. This indicates that the TA-V2A model accurately captures the key audio events, such as the sharp sounds of racket impacts and the shuttlecock’s flight, all while maintaining precise temporal alignment with the visual actions in the video.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS3.SSS2.p4"> <p class="ltx_p" id="S3.SS3.SSS2.p4.1"><span class="ltx_text ltx_font_bold" id="S3.SS3.SSS2.p4.1.1">Subjective Evaluation Results</span> Table <a class="ltx_ref" href="https://arxiv.org/html/2503.10700v1#S3.T2" title="TABLE II ‣ III-C2 Results and Analysis ‣ III-C Evaluation ‣ III Experiments ‣ TA-V2A: Textually Assisted Video-to-Audio Generation This work is supported by the National Key Research and Development Program of China (No.2024YFB2808902), and the High-performance Computing Platform of Peking University."><span class="ltx_text ltx_ref_tag">II</span></a> presents the subjective evaluation results of the models. The TA-V2A model, when user modifications were applied during the inference phase, achieved the highest MOS for semantic consistency. This suggests that the integration of a text control interface, along with the PPPR text expansion method, convincingly enhances the model’s ability to produce audio that is semantically consistent with the video and closely aligned with human understanding.</p> </div> <div class="ltx_para" id="S3.SS3.SSS2.p5"> <p class="ltx_p" id="S3.SS3.SSS2.p5.1">Indeed, as evidenced in the analysis above, semantic expression and temporal alignment are intricately connected rather than entirely independent variables. Only with high-quality recognition and generation can the accuracy of alignment be meaningfully evaluated.</p> </div> </section> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">IV </span><span class="ltx_text ltx_font_smallcaps" id="S4.1.1">Conclusion</span> </h2> <div class="ltx_para" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">We present TA-V2A, an innovative system for text-assisted video-to-audio generation, featuring a pretraining method that aligns video, text, and audio using advanced diffusion guidance techniques. It includes a text interface for personalized sound generation. Extensive evaluations show that TA-V2A outperforms existing methods in both objective and subjective assessments, enhancing semantic expression. We aim to advance more human-centered, context-aware sound generation. We hope to push the field toward more human-centered, context-aware sound generation.</p> </div> <div class="ltx_pagination ltx_role_newpage"></div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_tag_bibitem">[1]</span> <span class="ltx_bibblock"> Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley, </span> <span class="ltx_bibblock">“Audioldm: Text-to-audio generation with latent diffusion models,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib1.1.1">arXiv preprint arXiv:2301.12503</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_tag_bibitem">[2]</span> <span class="ltx_bibblock"> Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi, </span> <span class="ltx_bibblock">“Audiogen: Textually guided audio generation,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib2.1.1">arXiv preprint arXiv:2209.15352</span>, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_tag_bibitem">[3]</span> <span class="ltx_bibblock"> Shentong Mo, Jing Shi, and Yapeng Tian, </span> <span class="ltx_bibblock">“Text-to-audio generation synchronized with videos,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib3.1.1">arXiv preprint arXiv:2403.07938</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_tag_bibitem">[4]</span> <span class="ltx_bibblock"> Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai, </span> <span class="ltx_bibblock">“V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib4.1.1">Proceedings of the AAAI Conference on Artificial Intelligence</span>, 2024, vol. 38, pp. 15492–15501. </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_tag_bibitem">[5]</span> <span class="ltx_bibblock"> Ruihan Yang, Hannes Gamper, and Sebastian Braun, </span> <span class="ltx_bibblock">“Cmmd: Contrastive multi-modal diffusion for video-audio conditional modeling,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib5.1.1">arXiv preprint arXiv:2312.05412</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_tag_bibitem">[6]</span> <span class="ltx_bibblock"> Vinod K Kurmi, Vipul Bajaj, Badri N Patro, KS Venkatesh, Vinay P Namboodiri, and Preethi Jyothi, </span> <span class="ltx_bibblock">“Collaborative learning to generate audio-video jointly,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib6.1.1">ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</span>. IEEE, 2021, pp. 4180–4184. </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_tag_bibitem">[7]</span> <span class="ltx_bibblock"> Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo, </span> <span class="ltx_bibblock">“Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib7.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</span>, 2023, pp. 10219–10228. </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_tag_bibitem">[8]</span> <span class="ltx_bibblock"> Sanchita Ghose and John J Prevost, </span> <span class="ltx_bibblock">“Foleygan: Visually guided generative adversarial network-based synchronous sound generation in silent videos,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib8.1.1">IEEE Transactions on Multimedia</span>, vol. 25, pp. 4508–4519, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_tag_bibitem">[9]</span> <span class="ltx_bibblock"> Vladimir Iashin and Esa Rahtu, </span> <span class="ltx_bibblock">“Taming visually guided sound generation,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib9.1.1">arXiv preprint arXiv:2110.08791</span>, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_tag_bibitem">[10]</span> <span class="ltx_bibblock"> Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and Andrew Owens, </span> <span class="ltx_bibblock">“Conditional generation of audio from video via foley analogies,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib10.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</span>, 2023, pp. 2426–2436. </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_tag_bibitem">[11]</span> <span class="ltx_bibblock"> Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, and Joan Serrà, </span> <span class="ltx_bibblock">“Masked generative video-to-audio transformers with enhanced synchronicity,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib11.1.1">arXiv preprint arXiv:2407.10387</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_tag_bibitem">[12]</span> <span class="ltx_bibblock"> Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao, </span> <span class="ltx_bibblock">“Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib12.1.1">Advances in Neural Information Processing Systems</span>, vol. 36, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_tag_bibitem">[13]</span> <span class="ltx_bibblock"> Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, and Dong Yu, </span> <span class="ltx_bibblock">“Video-to-audio generation with hidden alignment,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib13.1.1">arXiv preprint arXiv:2407.07464</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_tag_bibitem">[14]</span> <span class="ltx_bibblock"> Yujin Jeong, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee, </span> <span class="ltx_bibblock">“Read, watch and scream! sound generation from text and video,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib14.1.1">arXiv preprint arXiv:2407.05551</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_tag_bibitem">[15]</span> <span class="ltx_bibblock"> Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen, </span> <span class="ltx_bibblock">“Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib15.1.1">arXiv preprint arXiv:2407.01494</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_tag_bibitem">[16]</span> <span class="ltx_bibblock"> Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua, </span> <span class="ltx_bibblock">“Next-gpt: Any-to-any multimodal llm,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib16.1.1">arXiv preprint arXiv:2309.05519</span>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_tag_bibitem">[17]</span> <span class="ltx_bibblock"> Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal, </span> <span class="ltx_bibblock">“Any-to-any generation via composable diffusion,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib17.1.1">Advances in Neural Information Processing Systems</span>, vol. 36, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_tag_bibitem">[18]</span> <span class="ltx_bibblock"> Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing, </span> <span class="ltx_bibblock">“Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib18.1.1">arXiv preprint arXiv:2406.07476</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_tag_bibitem">[19]</span> <span class="ltx_bibblock"> Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., </span> <span class="ltx_bibblock">“Learning transferable visual models from natural language supervision,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib19.1.1">International conference on machine learning</span>. PMLR, 2021, pp. 8748–8763. </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_tag_bibitem">[20]</span> <span class="ltx_bibblock"> Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, </span> <span class="ltx_bibblock">“Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib20.1.1">ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</span>. IEEE, 2023, pp. 1–5. </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_tag_bibitem">[21]</span> <span class="ltx_bibblock"> Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, </span> <span class="ltx_bibblock">“High-resolution image synthesis with latent diffusion models,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib21.1.1">Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</span>, 2022, pp. 10684–10695. </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_tag_bibitem">[22]</span> <span class="ltx_bibblock"> Prafulla Dhariwal and Alexander Nichol, </span> <span class="ltx_bibblock">“Diffusion models beat gans on image synthesis,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib22.1.1">Advances in neural information processing systems</span>, vol. 34, pp. 8780–8794, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_tag_bibitem">[23]</span> <span class="ltx_bibblock"> Jonathan Ho and Tim Salimans, </span> <span class="ltx_bibblock">“Classifier-free diffusion guidance,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib23.1.1">arXiv preprint arXiv:2207.12598</span>, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_tag_bibitem">[24]</span> <span class="ltx_bibblock"> Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum, </span> <span class="ltx_bibblock">“Compositional visual generation with composable diffusion models,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib24.1.1">Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII</span>. Springer, 2022, pp. 423–439. </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_tag_bibitem">[25]</span> <span class="ltx_bibblock"> Shuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, et al., </span> <span class="ltx_bibblock">“Pppr: Portable plug-in prompt refiner for text to audio generation,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib25.1.1">arXiv preprint arXiv:2406.04683</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_tag_bibitem">[26]</span> <span class="ltx_bibblock"> Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman, </span> <span class="ltx_bibblock">“Vggsound: A large-scale audio-visual dataset,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib26.1.1">ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</span>. IEEE, 2020, pp. 721–725. </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_tag_bibitem">[27]</span> <span class="ltx_bibblock"> Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley, </span> <span class="ltx_bibblock">“Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” 2020. </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_tag_bibitem">[28]</span> <span class="ltx_bibblock"> Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, </span> <span class="ltx_bibblock">“Audio set: An ontology and human-labeled dataset for audio events,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib28.1.1">2017 IEEE international conference on acoustics, speech and signal processing (ICASSP)</span>. IEEE, 2017, pp. 776–780. </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_tag_bibitem">[29]</span> <span class="ltx_bibblock"> Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He, </span> <span class="ltx_bibblock">“Slowfast networks for video recognition,” 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_tag_bibitem">[30]</span> <span class="ltx_bibblock"> Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman, </span> <span class="ltx_bibblock">“The kinetics human action video dataset,” 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_tag_bibitem">[31]</span> <span class="ltx_bibblock"> Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei, </span> <span class="ltx_bibblock">“Scaling instruction-finetuned language models,” 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_tag_bibitem">[32]</span> <span class="ltx_bibblock"> Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Siyuan Pan, Pengfei Wan, Shiji Song, and Gao Huang, </span> <span class="ltx_bibblock">“Agent attention: On the integration of softmax and linear attention,” 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_tag_bibitem">[33]</span> <span class="ltx_bibblock"> Haocheng Liu, Teysir Baoueb, Mathieu Fontaine, Jonathan Le Roux, and Gael Richard, </span> <span class="ltx_bibblock">“Gla-grad: A griffin-lim extended waveform generation diffusion model,” </span> <span class="ltx_bibblock">in <span class="ltx_text ltx_font_italic" id="bib.bib33.1.1">ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</span>. IEEE, 2024, pp. 11611–11615. </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_tag_bibitem">[34]</span> <span class="ltx_bibblock"> Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li, </span> <span class="ltx_bibblock">“Clip4clip: An empirical study of clip for end to end video clip retrieval,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib34.1.1">arXiv preprint arXiv:2104.08860</span>, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_tag_bibitem">[35]</span> <span class="ltx_bibblock"> Vladimir Iashin and Esa Rahtu, </span> <span class="ltx_bibblock">“Taming visually guided sound generation,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib35.1.1">CoRR</span>, vol. abs/2110.08791, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_tag_bibitem">[36]</span> <span class="ltx_bibblock"> Modan Tailleur, Junwon Lee, Mathieu Lagrange, Keunwoo Choi, Laurie M Heller, Keisuke Imoto, and Yuki Okamoto, </span> <span class="ltx_bibblock">“Correlation of fr<math alttext="\backslash" class="ltx_Math" display="inline" id="bib.bib36.1.m1.1"><semantics id="bib.bib36.1.m1.1a"><mo id="bib.bib36.1.m1.1.1" xref="bib.bib36.1.m1.1.1.cmml">\</mo><annotation-xml encoding="MathML-Content" id="bib.bib36.1.m1.1b"><ci id="bib.bib36.1.m1.1.1.cmml" xref="bib.bib36.1.m1.1.1">\</ci></annotation-xml><annotation encoding="application/x-tex" id="bib.bib36.1.m1.1c">\backslash</annotation><annotation encoding="application/x-llamapun" id="bib.bib36.1.m1.1d">\</annotation></semantics></math>’echet audio distance with human perception of environmental audio is embedding dependant,” </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib36.2.1">arXiv preprint arXiv:2403.17508</span>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_tag_bibitem">[37]</span> <span class="ltx_bibblock"> International Telecommunication Union, </span> <span class="ltx_bibblock">“ITU-T Recommendation P.800: Methods for Subjective Determination of Transmission Quality,” <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.itu.int/rec/T-REC-P.800-199608-I" title="">https://www.itu.int/rec/T-REC-P.800-199608-I</a>, 1996, </span> <span class="ltx_bibblock">Accessed: 2024-09-03. </span> </li> </ul> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Wed Mar 12 06:38:27 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10