CINXE.COM

GiVE: Guiding Visual Encoder to Perceive Overlooked Information

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>GiVE: Guiding Visual Encoder to Perceive Overlooked Information</title> <!--Generated on Sat Oct 26 07:32:20 2024 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2410.20109v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx1" title="In GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx2" title="In GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Related Work</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx2.SSx1" title="In Related Work ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Pre-training Visual Encoders</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx2.SSx2" title="In Related Work ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Visual Perception of MLLMs</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx3" title="In GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Methodology</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx3.SSx1" title="In Methodology ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Model Architecture</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx3.SSx2" title="In Methodology ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Attention-guided Adapter</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx3.SSx3" title="In Methodology ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Object-focused Visual Semantic Learning</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx3.SSx4" title="In Methodology ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Instruction of New Dataset</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4" title="In GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Experiments</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.SSx1" title="In Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Experimental Setup</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.SSx1.SSSx1" title="In Experimental Setup ‣ Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Datasets</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.SSx1.SSSx2" title="In Experimental Setup ‣ Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Baselines</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.SSx1.SSSx3" title="In Experimental Setup ‣ Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Evaluation Metrics</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.SSx2" title="In Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Performance Evaluation</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.SSx2.SSSx1" title="In Performance Evaluation ‣ Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Image Classification</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.SSx2.SSSx2" title="In Performance Evaluation ‣ Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Image-Text Retrieval</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.SSx2.SSSx3" title="In Performance Evaluation ‣ Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Comparison with Secondary Coding Methods</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.SSx3" title="In Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Ablation Studies</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.SSx4" title="In Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Attention Visualization</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx5" title="In GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Conclusion</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"> <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1" title="In GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A </span>Appendix</span></a> <ol class="ltx_toclist ltx_toclist_appendix"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx1" title="In Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Prompt Engineering and Pipeline</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx2" title="In Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">MLLM and Image Encoder Summary</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx3" title="In Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Multi-Object Instruction Dataset</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx3.SSSx1" title="In Multi-Object Instruction Dataset ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Construction</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx4" title="In Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Experimental Setup</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx4.SSSx1" title="In Experimental Setup ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Evaluation Metrics</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx4.SSSx2" title="In Experimental Setup ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Hyper-parameter Settings</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx4.SSSx3" title="In Experimental Setup ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Compute Resources</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx5" title="In Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Parameter Sensitivity</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx5.SSSx1" title="In Parameter Sensitivity ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Configurations of CLIP with GiVE</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx5.SSSx2" title="In Parameter Sensitivity ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Sensitivity Analysis on Loss Weights</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx5.SSSx3" title="In Parameter Sensitivity ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Comparison with Clip-based Method</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx5.SSSx4" title="In Parameter Sensitivity ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Training Stability</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx6" title="In Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">More Guided Attention Visualization</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.SSx7" title="In Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_title">Visual Attention and Response</span></a></li> </ol> </li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line"> <h1 class="ltx_title ltx_title_document">GiVE: Guiding Visual Encoder to Perceive Overlooked Information</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Junjie Li<sup class="ltx_sup" id="id1.1.id1">1</sup>, Jianghong Ma<sup class="ltx_sup" id="id2.2.id2">1</sup>, Xiaofeng Zhang<sup class="ltx_sup" id="id3.3.id3">1</sup>, Yuhang Li<sup class="ltx_sup" id="id4.4.id4">1</sup>, Jianyang Shi<sup class="ltx_sup" id="id5.5.id5">1</sup> </span><span class="ltx_author_notes">Corresponding author.</span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id6.id1">Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval, and the Multi-Object Instruction (MOInst) dataset. Experiments show our approach achieves state-of-the-art performance.</p> </div> <section class="ltx_section" id="Sx1"> <h2 class="ltx_title ltx_title_section">Introduction</h2> <div class="ltx_para" id="Sx1.p1"> <p class="ltx_p" id="Sx1.p1.1">The multimodal large language models (MLLMs) <cite class="ltx_cite ltx_citemacro_citep">(Zhu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib62" title="">2023</a>; Liu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib24" title="">2023b</a>; Ye et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib52" title="">2023</a>; Li et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib19" title="">2023</a>)</cite> have significantly advanced general artificial intelligence by demonstrating exceptional capabilities in generation and inference across diverse real-world applications, such as text-to-video generation <cite class="ltx_cite ltx_citemacro_citep">(Blattmann et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib2" title="">2023</a>)</cite>, visual question answering <cite class="ltx_cite ltx_citemacro_citep">(Singh et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib39" title="">2019</a>)</cite>, and embodied robotics <cite class="ltx_cite ltx_citemacro_citep">(Mu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib30" title="">2023</a>)</cite>. To optimize performance while managing costs, a prevalent MLLM architecture integrates a visual encoder with the large language model (LLM). This architecture encoder embeds non-text data, transforming these embeddings into vectors comprehensible to the LLM through a mapping mechanism. Although prior research <cite class="ltx_cite ltx_citemacro_citep">(Wang et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib45" title="">2023</a>; Liu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib22" title="">2023a</a>)</cite> has underscored the efficacy of this architecture, the cross-modal perception abilities of MLLMs heavily relies on the visual encoder, with the quality of encoded image embeddings directly influencing the response quality of the MLLMs.</p> </div> <div class="ltx_para" id="Sx1.p2"> <p class="ltx_p" id="Sx1.p2.1">An image encoder is a specialized visual encoder designed to map high-dimensional image data to a lower-dimensional feature space. These encoders can be broadly categorized based on their pre-training tasks into two main types: reconstruction-based and cross-modal contrastive learning-based encoders.</p> <ul class="ltx_itemize" id="Sx1.I1"> <li class="ltx_item" id="Sx1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx1.I1.i1.p1"> <p class="ltx_p" id="Sx1.I1.i1.p1.1">Image encoding models trained with <span class="ltx_text ltx_font_bold" id="Sx1.I1.i1.p1.1.1">reconstruction tasks</span> <cite class="ltx_cite ltx_citemacro_citep">(van den Oord, Vinyals, and Kavukcuoglu <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib44" title="">2017</a>; Huang, Fu, and Bors <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib16" title="">2023</a>)</cite> are proficient in capturing comprehensive image details. However, these models <span class="ltx_text ltx_font_italic" id="Sx1.I1.i1.p1.1.2">lack semantic alignment with text</span> during training, which complicates the LLM’s ability to interpret image embeddings <cite class="ltx_cite ltx_citemacro_citep">(Yin et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib54" title="">2023</a>)</cite>. Consequently, such encoders are infrequently utilized in MLLMs.</p> </div> </li> <li class="ltx_item" id="Sx1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx1.I1.i2.p1"> <p class="ltx_p" id="Sx1.I1.i2.p1.1">Vision transformer (ViT) models trained with <span class="ltx_text ltx_font_bold" id="Sx1.I1.i2.p1.1.1">image-text contrastive learning</span> <cite class="ltx_cite ltx_citemacro_citep">(Radford et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>; Jia et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib17" title="">2021</a>)</cite> generally <span class="ltx_text ltx_font_italic" id="Sx1.I1.i2.p1.1.2">align effectively with LLMs</span> but face an implicit “ignore” problem, limiting their expressive capability. This limitation arises because different modalities convey distinct types of information. For instance, an image may feature multiple objects with unique attributes, such as texture, color, spatial location, and potential interactions. In contrast, <span class="ltx_text ltx_font_bold" id="Sx1.I1.i2.p1.1.3">abstract text</span> typically highlights only <span class="ltx_text ltx_font_bold" id="Sx1.I1.i2.p1.1.4">the most salient objects</span> and provides limited descriptions of other visual elements. ViTs trained for image-text matching tend to focus on the salient regions of the image that correspond to the text, thereby <span class="ltx_text ltx_font_bold" id="Sx1.I1.i2.p1.1.5">overlooking secondary elements</span> like the background. As shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T13" title="Table 13 ‣ Visual Attention and Response ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">13</span></a> of the Appendix, MLLMs using such visual encoders exhibit diminished response quality when users inquire about non-salient objects.</p> </div> </li> </ul> </div> <div class="ltx_para" id="Sx1.p3"> <p class="ltx_p" id="Sx1.p3.1">In summary, for effective integration with LLMs, MLLMs require an image encoder that is both (1) semantically aligned with text during training and (2) capable of flexible attention to prevent the omission of relevant visual features.</p> </div> <figure class="ltx_figure" id="Sx1.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="187" id="Sx1.F1.g1" src="x1.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span><span class="ltx_text ltx_font_bold" id="Sx1.F1.7.1">Motivation overview</span>. (a) The original reconstruction-based encoder perceives the full image but is <span class="ltx_text ltx_font_italic" id="Sx1.F1.8.2">not aligned with textual semantics</span>, thereby limiting its utility for LLMs in effectively interpreting the image embeddings. (b) The contrastive learning-based encoder only processes images <span class="ltx_text ltx_font_italic" id="Sx1.F1.9.3">without the benefit of textual instructions</span>, leading to a focus <span class="ltx_text ltx_font_italic" id="Sx1.F1.10.4">solely on salient objects</span> (Ⓟ) and neglecting user-specific concerns (Ⓞ). (c) Our proposed visual encoder addresses these limitations by flexibly adjusting its focus to highlight various objects, whether <span class="ltx_text ltx_font_italic" id="Sx1.F1.11.5">salient</span> (Ⓢ) or <span class="ltx_text ltx_font_italic" id="Sx1.F1.12.6">non-salient</span> (Ⓝ), according to the provided instructions. </figcaption> </figure> <div class="ltx_para" id="Sx1.p4"> <p class="ltx_p" id="Sx1.p4.1">To address these challenges, we propose this <span class="ltx_text ltx_font_bold" id="Sx1.p4.1.1">G</span>u<span class="ltx_text ltx_font_bold" id="Sx1.p4.1.2">i</span>ding <span class="ltx_text ltx_font_bold" id="Sx1.p4.1.3">V</span>isual <span class="ltx_text ltx_font_bold" id="Sx1.p4.1.4">E</span>ncoder to Perceive Overlooked Information (<span class="ltx_text ltx_font_bold" id="Sx1.p4.1.5">GiVE</span>) approach, which aims to guide the visual encoder in adaptively adjusting its attention to well capture overlooked information. In this approach, we introduce a novel <span class="ltx_text ltx_font_bold" id="Sx1.p4.1.6">Attention-Guided Adapter (AG-Adapter)</span> module that enhances the representation ability of the visual encoder by <span class="ltx_text ltx_font_italic" id="Sx1.p4.1.7">aligning the visual representations with abstract semantics</span>. This module also functions as a plug-in for generalizing abstract semantics, enabling it to more effectively address user queries.</p> </div> <div class="ltx_para" id="Sx1.p5"> <p class="ltx_p" id="Sx1.p5.1">To tackle the above limitations in detail, GiVE incorporates another innovative module: <span class="ltx_text ltx_font_bold" id="Sx1.p5.1.1">Object-focused Visual Semantic Learning</span>. This module employs three distinct model loss terms: (i) an Object-focused Image-Text Contrast (OITC) loss, which encourages the model to generate distinct embeddings for varied instructions, ensuring consideration of <span class="ltx_text ltx_font_italic" id="Sx1.p5.1.2">both salient and non-salient objects</span>; (ii) an Object-focused Image-Image Contrast (OIIC) loss, which improves the <span class="ltx_text ltx_font_italic" id="Sx1.p5.1.3">accuracy of object retrieval</span> by enabling the model to learn common features among in-class objects, thus enhancing concept generalization ability; and (iii) an Object-focused Image Discrimination (OID) loss, which improves the <span class="ltx_text ltx_font_italic" id="Sx1.p5.1.4">comprehensiveness of object retrieval</span> by facilitating the model in identifying specific objects and recognizing potential correlations among objects, thereby preventing the omission of objects. As illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx1.F1" title="Figure 1 ‣ Introduction ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">1</span></a>, GiVE enables visual encoders to generate image embeddings that are rich in targeted information.</p> </div> <div class="ltx_para" id="Sx1.p6"> <p class="ltx_p" id="Sx1.p6.1">Our contributions can be summarized as follows:</p> <ul class="ltx_itemize" id="Sx1.I2"> <li class="ltx_item" id="Sx1.I2.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx1.I2.i1.p1"> <p class="ltx_p" id="Sx1.I2.i1.p1.1">We propose an innovative GiVE approach to give attetion to overlooked information, enabling dynamic adjustment of focus by modifying textual instructions. This capability allows for flexible and adaptive attention to various objects within the image.</p> </div> </li> <li class="ltx_item" id="Sx1.I2.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx1.I2.i2.p1"> <p class="ltx_p" id="Sx1.I2.i2.p1.1">We propose three novel loss functions: OITC, OIIC, and OID. The OITC loss addresses the current limitations of the original visual encoder, particularly its neglect of non-salient objects. The OIIC loss enhances ungeneralized representation ability, thereby improving object retrieval accuracy. The OID loss recognizes potential correlations among objects, further increasing the comprehensiveness of object retrieval.</p> </div> </li> <li class="ltx_item" id="Sx1.I2.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx1.I2.i3.p1"> <p class="ltx_p" id="Sx1.I2.i3.p1.1">We present a fine-grained image-text dataset with instructional labels, named the Multi-Object Instruction (MOInst) dataset, designed to provide semantic indications for different objects. Extensive experiments conducted on both constructed and real-world datasets showcase the effectiveness of our approach, achieving state-of-the-art performance.</p> </div> </li> </ul> </div> </section> <section class="ltx_section" id="Sx2"> <h2 class="ltx_title ltx_title_section">Related Work</h2> <section class="ltx_subsection" id="Sx2.SSx1"> <h3 class="ltx_title ltx_title_subsection">Pre-training Visual Encoders</h3> <div class="ltx_para" id="Sx2.SSx1.p1"> <p class="ltx_p" id="Sx2.SSx1.p1.1">Visual encoding models can be categorized based on their pre-training tasks into image reconstruction models and image-text contrast models. (i) Image reconstruction models, such as MAE <cite class="ltx_cite ltx_citemacro_citep">(He et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib14" title="">2022</a>)</cite>, DeepMIM <cite class="ltx_cite ltx_citemacro_citep">(Ren et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib37" title="">2023</a>)</cite>, and MIRL <cite class="ltx_cite ltx_citemacro_citep">(Huang, Fu, and Bors <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib16" title="">2023</a>)</cite>, use masked image modeling similar to language modeling. Other models, like VQ-VAE <cite class="ltx_cite ltx_citemacro_citep">(van den Oord, Vinyals, and Kavukcuoglu <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib44" title="">2017</a>)</cite>, VQGAN <cite class="ltx_cite ltx_citemacro_citep">(Esser, Rombach, and Ommer <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib10" title="">2021</a>)</cite>, and HQ-VAE <cite class="ltx_cite ltx_citemacro_citep">(You et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib55" title="">2022</a>)</cite>, implement image reconstruction using autoencoder architectures. (ii) Models like CLIP <cite class="ltx_cite ltx_citemacro_citep">(Radford et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite>, ALIGN <cite class="ltx_cite ltx_citemacro_citep">(Jia et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib17" title="">2021</a>)</cite>, and EVA-CLIP <cite class="ltx_cite ltx_citemacro_citep">(Sun et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite> utilize cross-modal contrastive learning to align textual and visual information into a unified semantic space. Consequently, such pre-aligned encoders are more readily aligned with LLMs through alignment pre-training <cite class="ltx_cite ltx_citemacro_citep">(Yin et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib54" title="">2023</a>)</cite>. In contrast, reconstruction-trained models lack this text correlation due to their independent training process. As detailed in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T5" title="Table 5 ‣ MLLM and Image Encoder Summary ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">5</span></a> in the Appendix, 90% of the MLLMs we investigated use image-text contrast encoders.</p> </div> <div class="ltx_para" id="Sx2.SSx1.p2"> <p class="ltx_p" id="Sx2.SSx1.p2.1"><span class="ltx_text ltx_font_bold" id="Sx2.SSx1.p2.1.1">Discussion.</span> Our study preserves the semantic alignment capabilities of visual encoders while minimizing the omission of effective information by using semantic instructions.</p> </div> <figure class="ltx_figure" id="Sx2.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="352" id="Sx2.F2.g1" src="x2.png" width="748"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span><span class="ltx_text ltx_font_bold" id="Sx2.F2.2.1">Overall architecture of GiVE</span>. The plug-in module, AG-Adapter, is inserted into the feature extraction layers of the visual encoder and trained with the three losses proposed in our work: Object-focused Image-Text Contrast (OITC), Object-focused Image-Image Contrast (OIIC), and Object-focused Image Discrimination (OID). Cross-attention is used to emphasize the visual elements most relevant to the textual instructions. The text instructions are pre-integrated into the prompt template designated as “a photo of {object}”.</figcaption> </figure> </section> <section class="ltx_subsection" id="Sx2.SSx2"> <h3 class="ltx_title ltx_title_subsection">Visual Perception of MLLMs</h3> <div class="ltx_para" id="Sx2.SSx2.p1"> <p class="ltx_p" id="Sx2.SSx2.p1.1">Despite some pioneering works mitigating the negative effects of visual ignorance, they still have limitations. For instance, InstructBLIP <cite class="ltx_cite ltx_citemacro_citep">(Dai et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib9" title="">2023</a>)</cite> injects textual instructions to guide the post-processing process with a lightweight Q-Former to translate the image embeddings. However, this downstream secondary coding only reinforces indication-related information in the image embeddings and does not recover useful features lost in the upstream image encoder. Similarly, LLaVA-NeXT <cite class="ltx_cite ltx_citemacro_citep">(Liu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib23" title="">2024</a>)</cite> encodes the image multiple times after segmentation, allowing each subimage to have its own salient object, thus reducing the visual feature loss rate. However, this exponentially increases token occupancy and encoding time. Additionally, BLIVA <cite class="ltx_cite ltx_citemacro_citep">(Hu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib15" title="">2024</a>)</cite> combines the mapper of BLIP-2 and LLaVA to enhance the LLM’s understanding from more translation perspectives, but since both mappers share the same image encoder, this method does not fundamentally address the loss of useful information during visual encoding.</p> </div> <div class="ltx_para" id="Sx2.SSx2.p2"> <p class="ltx_p" id="Sx2.SSx2.p2.1"><span class="ltx_text ltx_font_bold" id="Sx2.SSx2.p2.1.1">Dicussion.</span> Overlooking the inherent imperfections of visual encoders and attempting to rectify them through alternative means is frequently inadequate. Our objective is to confront these shortcomings directly within the visual encoder, thereby minimizing the accumulation of errors from the outset of the encoding process.</p> </div> </section> </section> <section class="ltx_section" id="Sx3"> <h2 class="ltx_title ltx_title_section">Methodology</h2> <div class="ltx_para" id="Sx3.p1"> <p class="ltx_p" id="Sx3.p1.1">The proposed approach takes a text encoder to encode both captions and instructions, and a visual encoder for visual content. An AG-Adapter module is inserted into each stacked visual encoder as the tunable parameters to align cross-modal data. The AG-Adapter is trained by the designed Object-focused Visual Semantic Learning module. Details of model architecture, AG-Adapter, and object-focused visual semantic learning module are presented in the following subsections. The overall architecture is depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx2.F2" title="Figure 2 ‣ Pre-training Visual Encoders ‣ Related Work ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">2</span></a>.</p> </div> <section class="ltx_subsection" id="Sx3.SSx1"> <h3 class="ltx_title ltx_title_subsection">Model Architecture</h3> <div class="ltx_para" id="Sx3.SSx1.p1"> <p class="ltx_p" id="Sx3.SSx1.p1.8">The proposed model includes a visual encoder <math alttext="\Phi_{I}(\cdot,\cdot)" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.1.m1.2"><semantics id="Sx3.SSx1.p1.1.m1.2a"><mrow id="Sx3.SSx1.p1.1.m1.2.3" xref="Sx3.SSx1.p1.1.m1.2.3.cmml"><msub id="Sx3.SSx1.p1.1.m1.2.3.2" xref="Sx3.SSx1.p1.1.m1.2.3.2.cmml"><mi id="Sx3.SSx1.p1.1.m1.2.3.2.2" mathvariant="normal" xref="Sx3.SSx1.p1.1.m1.2.3.2.2.cmml">Φ</mi><mi id="Sx3.SSx1.p1.1.m1.2.3.2.3" xref="Sx3.SSx1.p1.1.m1.2.3.2.3.cmml">I</mi></msub><mo id="Sx3.SSx1.p1.1.m1.2.3.1" xref="Sx3.SSx1.p1.1.m1.2.3.1.cmml">⁢</mo><mrow id="Sx3.SSx1.p1.1.m1.2.3.3.2" xref="Sx3.SSx1.p1.1.m1.2.3.3.1.cmml"><mo id="Sx3.SSx1.p1.1.m1.2.3.3.2.1" stretchy="false" xref="Sx3.SSx1.p1.1.m1.2.3.3.1.cmml">(</mo><mo id="Sx3.SSx1.p1.1.m1.1.1" lspace="0em" rspace="0em" xref="Sx3.SSx1.p1.1.m1.1.1.cmml">⋅</mo><mo id="Sx3.SSx1.p1.1.m1.2.3.3.2.2" rspace="0em" xref="Sx3.SSx1.p1.1.m1.2.3.3.1.cmml">,</mo><mo id="Sx3.SSx1.p1.1.m1.2.2" lspace="0em" rspace="0em" xref="Sx3.SSx1.p1.1.m1.2.2.cmml">⋅</mo><mo id="Sx3.SSx1.p1.1.m1.2.3.3.2.3" stretchy="false" xref="Sx3.SSx1.p1.1.m1.2.3.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.1.m1.2b"><apply id="Sx3.SSx1.p1.1.m1.2.3.cmml" xref="Sx3.SSx1.p1.1.m1.2.3"><times id="Sx3.SSx1.p1.1.m1.2.3.1.cmml" xref="Sx3.SSx1.p1.1.m1.2.3.1"></times><apply id="Sx3.SSx1.p1.1.m1.2.3.2.cmml" xref="Sx3.SSx1.p1.1.m1.2.3.2"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.1.m1.2.3.2.1.cmml" xref="Sx3.SSx1.p1.1.m1.2.3.2">subscript</csymbol><ci id="Sx3.SSx1.p1.1.m1.2.3.2.2.cmml" xref="Sx3.SSx1.p1.1.m1.2.3.2.2">Φ</ci><ci id="Sx3.SSx1.p1.1.m1.2.3.2.3.cmml" xref="Sx3.SSx1.p1.1.m1.2.3.2.3">𝐼</ci></apply><interval closure="open" id="Sx3.SSx1.p1.1.m1.2.3.3.1.cmml" xref="Sx3.SSx1.p1.1.m1.2.3.3.2"><ci id="Sx3.SSx1.p1.1.m1.1.1.cmml" xref="Sx3.SSx1.p1.1.m1.1.1">⋅</ci><ci id="Sx3.SSx1.p1.1.m1.2.2.cmml" xref="Sx3.SSx1.p1.1.m1.2.2">⋅</ci></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.1.m1.2c">\Phi_{I}(\cdot,\cdot)</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.1.m1.2d">roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ , ⋅ )</annotation></semantics></math> and a text encoder <math alttext="\Phi_{T}(\cdot)" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.2.m2.1"><semantics id="Sx3.SSx1.p1.2.m2.1a"><mrow id="Sx3.SSx1.p1.2.m2.1.2" xref="Sx3.SSx1.p1.2.m2.1.2.cmml"><msub id="Sx3.SSx1.p1.2.m2.1.2.2" xref="Sx3.SSx1.p1.2.m2.1.2.2.cmml"><mi id="Sx3.SSx1.p1.2.m2.1.2.2.2" mathvariant="normal" xref="Sx3.SSx1.p1.2.m2.1.2.2.2.cmml">Φ</mi><mi id="Sx3.SSx1.p1.2.m2.1.2.2.3" xref="Sx3.SSx1.p1.2.m2.1.2.2.3.cmml">T</mi></msub><mo id="Sx3.SSx1.p1.2.m2.1.2.1" xref="Sx3.SSx1.p1.2.m2.1.2.1.cmml">⁢</mo><mrow id="Sx3.SSx1.p1.2.m2.1.2.3.2" xref="Sx3.SSx1.p1.2.m2.1.2.cmml"><mo id="Sx3.SSx1.p1.2.m2.1.2.3.2.1" stretchy="false" xref="Sx3.SSx1.p1.2.m2.1.2.cmml">(</mo><mo id="Sx3.SSx1.p1.2.m2.1.1" lspace="0em" rspace="0em" xref="Sx3.SSx1.p1.2.m2.1.1.cmml">⋅</mo><mo id="Sx3.SSx1.p1.2.m2.1.2.3.2.2" stretchy="false" xref="Sx3.SSx1.p1.2.m2.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.2.m2.1b"><apply id="Sx3.SSx1.p1.2.m2.1.2.cmml" xref="Sx3.SSx1.p1.2.m2.1.2"><times id="Sx3.SSx1.p1.2.m2.1.2.1.cmml" xref="Sx3.SSx1.p1.2.m2.1.2.1"></times><apply id="Sx3.SSx1.p1.2.m2.1.2.2.cmml" xref="Sx3.SSx1.p1.2.m2.1.2.2"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.2.m2.1.2.2.1.cmml" xref="Sx3.SSx1.p1.2.m2.1.2.2">subscript</csymbol><ci id="Sx3.SSx1.p1.2.m2.1.2.2.2.cmml" xref="Sx3.SSx1.p1.2.m2.1.2.2.2">Φ</ci><ci id="Sx3.SSx1.p1.2.m2.1.2.2.3.cmml" xref="Sx3.SSx1.p1.2.m2.1.2.2.3">𝑇</ci></apply><ci id="Sx3.SSx1.p1.2.m2.1.1.cmml" xref="Sx3.SSx1.p1.2.m2.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.2.m2.1c">\Phi_{T}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.2.m2.1d">roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ )</annotation></semantics></math> to respectively encode visual and textual content. It accepts image-text-object triplets <math alttext="\{\boldsymbol{x}^{I},\boldsymbol{x}^{T},\boldsymbol{x}^{O}\}" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.3.m3.3"><semantics id="Sx3.SSx1.p1.3.m3.3a"><mrow id="Sx3.SSx1.p1.3.m3.3.3.3" xref="Sx3.SSx1.p1.3.m3.3.3.4.cmml"><mo id="Sx3.SSx1.p1.3.m3.3.3.3.4" stretchy="false" xref="Sx3.SSx1.p1.3.m3.3.3.4.cmml">{</mo><msup id="Sx3.SSx1.p1.3.m3.1.1.1.1" xref="Sx3.SSx1.p1.3.m3.1.1.1.1.cmml"><mi id="Sx3.SSx1.p1.3.m3.1.1.1.1.2" xref="Sx3.SSx1.p1.3.m3.1.1.1.1.2.cmml">𝒙</mi><mi id="Sx3.SSx1.p1.3.m3.1.1.1.1.3" xref="Sx3.SSx1.p1.3.m3.1.1.1.1.3.cmml">I</mi></msup><mo id="Sx3.SSx1.p1.3.m3.3.3.3.5" xref="Sx3.SSx1.p1.3.m3.3.3.4.cmml">,</mo><msup id="Sx3.SSx1.p1.3.m3.2.2.2.2" xref="Sx3.SSx1.p1.3.m3.2.2.2.2.cmml"><mi id="Sx3.SSx1.p1.3.m3.2.2.2.2.2" xref="Sx3.SSx1.p1.3.m3.2.2.2.2.2.cmml">𝒙</mi><mi id="Sx3.SSx1.p1.3.m3.2.2.2.2.3" xref="Sx3.SSx1.p1.3.m3.2.2.2.2.3.cmml">T</mi></msup><mo id="Sx3.SSx1.p1.3.m3.3.3.3.6" xref="Sx3.SSx1.p1.3.m3.3.3.4.cmml">,</mo><msup id="Sx3.SSx1.p1.3.m3.3.3.3.3" xref="Sx3.SSx1.p1.3.m3.3.3.3.3.cmml"><mi id="Sx3.SSx1.p1.3.m3.3.3.3.3.2" xref="Sx3.SSx1.p1.3.m3.3.3.3.3.2.cmml">𝒙</mi><mi id="Sx3.SSx1.p1.3.m3.3.3.3.3.3" xref="Sx3.SSx1.p1.3.m3.3.3.3.3.3.cmml">O</mi></msup><mo id="Sx3.SSx1.p1.3.m3.3.3.3.7" stretchy="false" xref="Sx3.SSx1.p1.3.m3.3.3.4.cmml">}</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.3.m3.3b"><set id="Sx3.SSx1.p1.3.m3.3.3.4.cmml" xref="Sx3.SSx1.p1.3.m3.3.3.3"><apply id="Sx3.SSx1.p1.3.m3.1.1.1.1.cmml" xref="Sx3.SSx1.p1.3.m3.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.3.m3.1.1.1.1.1.cmml" xref="Sx3.SSx1.p1.3.m3.1.1.1.1">superscript</csymbol><ci id="Sx3.SSx1.p1.3.m3.1.1.1.1.2.cmml" xref="Sx3.SSx1.p1.3.m3.1.1.1.1.2">𝒙</ci><ci id="Sx3.SSx1.p1.3.m3.1.1.1.1.3.cmml" xref="Sx3.SSx1.p1.3.m3.1.1.1.1.3">𝐼</ci></apply><apply id="Sx3.SSx1.p1.3.m3.2.2.2.2.cmml" xref="Sx3.SSx1.p1.3.m3.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.3.m3.2.2.2.2.1.cmml" xref="Sx3.SSx1.p1.3.m3.2.2.2.2">superscript</csymbol><ci id="Sx3.SSx1.p1.3.m3.2.2.2.2.2.cmml" xref="Sx3.SSx1.p1.3.m3.2.2.2.2.2">𝒙</ci><ci id="Sx3.SSx1.p1.3.m3.2.2.2.2.3.cmml" xref="Sx3.SSx1.p1.3.m3.2.2.2.2.3">𝑇</ci></apply><apply id="Sx3.SSx1.p1.3.m3.3.3.3.3.cmml" xref="Sx3.SSx1.p1.3.m3.3.3.3.3"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.3.m3.3.3.3.3.1.cmml" xref="Sx3.SSx1.p1.3.m3.3.3.3.3">superscript</csymbol><ci id="Sx3.SSx1.p1.3.m3.3.3.3.3.2.cmml" xref="Sx3.SSx1.p1.3.m3.3.3.3.3.2">𝒙</ci><ci id="Sx3.SSx1.p1.3.m3.3.3.3.3.3.cmml" xref="Sx3.SSx1.p1.3.m3.3.3.3.3.3">𝑂</ci></apply></set></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.3.m3.3c">\{\boldsymbol{x}^{I},\boldsymbol{x}^{T},\boldsymbol{x}^{O}\}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.3.m3.3d">{ bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT }</annotation></semantics></math> as input, where <math alttext="\boldsymbol{x}^{I}\in\mathcal{I}" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.4.m4.1"><semantics id="Sx3.SSx1.p1.4.m4.1a"><mrow id="Sx3.SSx1.p1.4.m4.1.1" xref="Sx3.SSx1.p1.4.m4.1.1.cmml"><msup id="Sx3.SSx1.p1.4.m4.1.1.2" xref="Sx3.SSx1.p1.4.m4.1.1.2.cmml"><mi id="Sx3.SSx1.p1.4.m4.1.1.2.2" xref="Sx3.SSx1.p1.4.m4.1.1.2.2.cmml">𝒙</mi><mi id="Sx3.SSx1.p1.4.m4.1.1.2.3" xref="Sx3.SSx1.p1.4.m4.1.1.2.3.cmml">I</mi></msup><mo id="Sx3.SSx1.p1.4.m4.1.1.1" xref="Sx3.SSx1.p1.4.m4.1.1.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="Sx3.SSx1.p1.4.m4.1.1.3" xref="Sx3.SSx1.p1.4.m4.1.1.3.cmml">ℐ</mi></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.4.m4.1b"><apply id="Sx3.SSx1.p1.4.m4.1.1.cmml" xref="Sx3.SSx1.p1.4.m4.1.1"><in id="Sx3.SSx1.p1.4.m4.1.1.1.cmml" xref="Sx3.SSx1.p1.4.m4.1.1.1"></in><apply id="Sx3.SSx1.p1.4.m4.1.1.2.cmml" xref="Sx3.SSx1.p1.4.m4.1.1.2"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.4.m4.1.1.2.1.cmml" xref="Sx3.SSx1.p1.4.m4.1.1.2">superscript</csymbol><ci id="Sx3.SSx1.p1.4.m4.1.1.2.2.cmml" xref="Sx3.SSx1.p1.4.m4.1.1.2.2">𝒙</ci><ci id="Sx3.SSx1.p1.4.m4.1.1.2.3.cmml" xref="Sx3.SSx1.p1.4.m4.1.1.2.3">𝐼</ci></apply><ci id="Sx3.SSx1.p1.4.m4.1.1.3.cmml" xref="Sx3.SSx1.p1.4.m4.1.1.3">ℐ</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.4.m4.1c">\boldsymbol{x}^{I}\in\mathcal{I}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.4.m4.1d">bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ∈ caligraphic_I</annotation></semantics></math> is an image, <math alttext="\boldsymbol{x}^{T}\in\mathcal{T}" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.5.m5.1"><semantics id="Sx3.SSx1.p1.5.m5.1a"><mrow id="Sx3.SSx1.p1.5.m5.1.1" xref="Sx3.SSx1.p1.5.m5.1.1.cmml"><msup id="Sx3.SSx1.p1.5.m5.1.1.2" xref="Sx3.SSx1.p1.5.m5.1.1.2.cmml"><mi id="Sx3.SSx1.p1.5.m5.1.1.2.2" xref="Sx3.SSx1.p1.5.m5.1.1.2.2.cmml">𝒙</mi><mi id="Sx3.SSx1.p1.5.m5.1.1.2.3" xref="Sx3.SSx1.p1.5.m5.1.1.2.3.cmml">T</mi></msup><mo id="Sx3.SSx1.p1.5.m5.1.1.1" xref="Sx3.SSx1.p1.5.m5.1.1.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="Sx3.SSx1.p1.5.m5.1.1.3" xref="Sx3.SSx1.p1.5.m5.1.1.3.cmml">𝒯</mi></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.5.m5.1b"><apply id="Sx3.SSx1.p1.5.m5.1.1.cmml" xref="Sx3.SSx1.p1.5.m5.1.1"><in id="Sx3.SSx1.p1.5.m5.1.1.1.cmml" xref="Sx3.SSx1.p1.5.m5.1.1.1"></in><apply id="Sx3.SSx1.p1.5.m5.1.1.2.cmml" xref="Sx3.SSx1.p1.5.m5.1.1.2"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.5.m5.1.1.2.1.cmml" xref="Sx3.SSx1.p1.5.m5.1.1.2">superscript</csymbol><ci id="Sx3.SSx1.p1.5.m5.1.1.2.2.cmml" xref="Sx3.SSx1.p1.5.m5.1.1.2.2">𝒙</ci><ci id="Sx3.SSx1.p1.5.m5.1.1.2.3.cmml" xref="Sx3.SSx1.p1.5.m5.1.1.2.3">𝑇</ci></apply><ci id="Sx3.SSx1.p1.5.m5.1.1.3.cmml" xref="Sx3.SSx1.p1.5.m5.1.1.3">𝒯</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.5.m5.1c">\boldsymbol{x}^{T}\in\mathcal{T}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.5.m5.1d">bold_italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ caligraphic_T</annotation></semantics></math> is a text, and <math alttext="\boldsymbol{x}^{O}\in\mathcal{O}" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.6.m6.1"><semantics id="Sx3.SSx1.p1.6.m6.1a"><mrow id="Sx3.SSx1.p1.6.m6.1.1" xref="Sx3.SSx1.p1.6.m6.1.1.cmml"><msup id="Sx3.SSx1.p1.6.m6.1.1.2" xref="Sx3.SSx1.p1.6.m6.1.1.2.cmml"><mi id="Sx3.SSx1.p1.6.m6.1.1.2.2" xref="Sx3.SSx1.p1.6.m6.1.1.2.2.cmml">𝒙</mi><mi id="Sx3.SSx1.p1.6.m6.1.1.2.3" xref="Sx3.SSx1.p1.6.m6.1.1.2.3.cmml">O</mi></msup><mo id="Sx3.SSx1.p1.6.m6.1.1.1" xref="Sx3.SSx1.p1.6.m6.1.1.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="Sx3.SSx1.p1.6.m6.1.1.3" xref="Sx3.SSx1.p1.6.m6.1.1.3.cmml">𝒪</mi></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.6.m6.1b"><apply id="Sx3.SSx1.p1.6.m6.1.1.cmml" xref="Sx3.SSx1.p1.6.m6.1.1"><in id="Sx3.SSx1.p1.6.m6.1.1.1.cmml" xref="Sx3.SSx1.p1.6.m6.1.1.1"></in><apply id="Sx3.SSx1.p1.6.m6.1.1.2.cmml" xref="Sx3.SSx1.p1.6.m6.1.1.2"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.6.m6.1.1.2.1.cmml" xref="Sx3.SSx1.p1.6.m6.1.1.2">superscript</csymbol><ci id="Sx3.SSx1.p1.6.m6.1.1.2.2.cmml" xref="Sx3.SSx1.p1.6.m6.1.1.2.2">𝒙</ci><ci id="Sx3.SSx1.p1.6.m6.1.1.2.3.cmml" xref="Sx3.SSx1.p1.6.m6.1.1.2.3">𝑂</ci></apply><ci id="Sx3.SSx1.p1.6.m6.1.1.3.cmml" xref="Sx3.SSx1.p1.6.m6.1.1.3">𝒪</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.6.m6.1c">\boldsymbol{x}^{O}\in\mathcal{O}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.6.m6.1d">bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ∈ caligraphic_O</annotation></semantics></math> is an indicative text denoting the target object, such as “person”. The model then extracts conditional image and text features <math alttext="(\boldsymbol{y}^{I|O},\boldsymbol{y}^{T})" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.7.m7.2"><semantics id="Sx3.SSx1.p1.7.m7.2a"><mrow id="Sx3.SSx1.p1.7.m7.2.2.2" xref="Sx3.SSx1.p1.7.m7.2.2.3.cmml"><mo id="Sx3.SSx1.p1.7.m7.2.2.2.3" stretchy="false" xref="Sx3.SSx1.p1.7.m7.2.2.3.cmml">(</mo><msup id="Sx3.SSx1.p1.7.m7.1.1.1.1" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.cmml"><mi id="Sx3.SSx1.p1.7.m7.1.1.1.1.2" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.2.cmml">𝒚</mi><mrow id="Sx3.SSx1.p1.7.m7.1.1.1.1.3" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.3.cmml"><mi id="Sx3.SSx1.p1.7.m7.1.1.1.1.3.2" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx1.p1.7.m7.1.1.1.1.3.1" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.3.1.cmml">|</mo><mi id="Sx3.SSx1.p1.7.m7.1.1.1.1.3.3" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.3.3.cmml">O</mi></mrow></msup><mo id="Sx3.SSx1.p1.7.m7.2.2.2.4" xref="Sx3.SSx1.p1.7.m7.2.2.3.cmml">,</mo><msup id="Sx3.SSx1.p1.7.m7.2.2.2.2" xref="Sx3.SSx1.p1.7.m7.2.2.2.2.cmml"><mi id="Sx3.SSx1.p1.7.m7.2.2.2.2.2" xref="Sx3.SSx1.p1.7.m7.2.2.2.2.2.cmml">𝒚</mi><mi id="Sx3.SSx1.p1.7.m7.2.2.2.2.3" xref="Sx3.SSx1.p1.7.m7.2.2.2.2.3.cmml">T</mi></msup><mo id="Sx3.SSx1.p1.7.m7.2.2.2.5" stretchy="false" xref="Sx3.SSx1.p1.7.m7.2.2.3.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.7.m7.2b"><interval closure="open" id="Sx3.SSx1.p1.7.m7.2.2.3.cmml" xref="Sx3.SSx1.p1.7.m7.2.2.2"><apply id="Sx3.SSx1.p1.7.m7.1.1.1.1.cmml" xref="Sx3.SSx1.p1.7.m7.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.7.m7.1.1.1.1.1.cmml" xref="Sx3.SSx1.p1.7.m7.1.1.1.1">superscript</csymbol><ci id="Sx3.SSx1.p1.7.m7.1.1.1.1.2.cmml" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.2">𝒚</ci><apply id="Sx3.SSx1.p1.7.m7.1.1.1.1.3.cmml" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.3"><csymbol cd="latexml" id="Sx3.SSx1.p1.7.m7.1.1.1.1.3.1.cmml" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.3.1">conditional</csymbol><ci id="Sx3.SSx1.p1.7.m7.1.1.1.1.3.2.cmml" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.3.2">𝐼</ci><ci id="Sx3.SSx1.p1.7.m7.1.1.1.1.3.3.cmml" xref="Sx3.SSx1.p1.7.m7.1.1.1.1.3.3">𝑂</ci></apply></apply><apply id="Sx3.SSx1.p1.7.m7.2.2.2.2.cmml" xref="Sx3.SSx1.p1.7.m7.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.7.m7.2.2.2.2.1.cmml" xref="Sx3.SSx1.p1.7.m7.2.2.2.2">superscript</csymbol><ci id="Sx3.SSx1.p1.7.m7.2.2.2.2.2.cmml" xref="Sx3.SSx1.p1.7.m7.2.2.2.2.2">𝒚</ci><ci id="Sx3.SSx1.p1.7.m7.2.2.2.2.3.cmml" xref="Sx3.SSx1.p1.7.m7.2.2.2.2.3">𝑇</ci></apply></interval></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.7.m7.2c">(\boldsymbol{y}^{I|O},\boldsymbol{y}^{T})</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.7.m7.2d">( bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )</annotation></semantics></math> using paired encoders. When extracting conditional image features, the instruction feature <math alttext="\boldsymbol{y}^{O}" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.8.m8.1"><semantics id="Sx3.SSx1.p1.8.m8.1a"><msup id="Sx3.SSx1.p1.8.m8.1.1" xref="Sx3.SSx1.p1.8.m8.1.1.cmml"><mi id="Sx3.SSx1.p1.8.m8.1.1.2" xref="Sx3.SSx1.p1.8.m8.1.1.2.cmml">𝒚</mi><mi id="Sx3.SSx1.p1.8.m8.1.1.3" xref="Sx3.SSx1.p1.8.m8.1.1.3.cmml">O</mi></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.8.m8.1b"><apply id="Sx3.SSx1.p1.8.m8.1.1.cmml" xref="Sx3.SSx1.p1.8.m8.1.1"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.8.m8.1.1.1.cmml" xref="Sx3.SSx1.p1.8.m8.1.1">superscript</csymbol><ci id="Sx3.SSx1.p1.8.m8.1.1.2.cmml" xref="Sx3.SSx1.p1.8.m8.1.1.2">𝒚</ci><ci id="Sx3.SSx1.p1.8.m8.1.1.3.cmml" xref="Sx3.SSx1.p1.8.m8.1.1.3">𝑂</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.8.m8.1c">\boldsymbol{y}^{O}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.8.m8.1d">bold_italic_y start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT</annotation></semantics></math> is fused with the visual data stream within the AG-Adapter module. Formally,</p> <table class="ltx_equation ltx_eqn_table" id="Sx3.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\boldsymbol{y}^{T}=\Phi_{T}(\boldsymbol{x}^{T}),\ \boldsymbol{y}^{O}=\Phi_{T}(% \boldsymbol{x}^{O}),\ \boldsymbol{y}^{I|O}=\Phi_{I}(\boldsymbol{x}^{I},% \boldsymbol{y}^{O})," class="ltx_Math" display="block" id="Sx3.E1.m1.1"><semantics id="Sx3.E1.m1.1a"><mrow id="Sx3.E1.m1.1.1.1"><mrow id="Sx3.E1.m1.1.1.1.1.2" xref="Sx3.E1.m1.1.1.1.1.3.cmml"><mrow id="Sx3.E1.m1.1.1.1.1.1.1" xref="Sx3.E1.m1.1.1.1.1.1.1.cmml"><msup id="Sx3.E1.m1.1.1.1.1.1.1.3" xref="Sx3.E1.m1.1.1.1.1.1.1.3.cmml"><mi id="Sx3.E1.m1.1.1.1.1.1.1.3.2" xref="Sx3.E1.m1.1.1.1.1.1.1.3.2.cmml">𝒚</mi><mi id="Sx3.E1.m1.1.1.1.1.1.1.3.3" xref="Sx3.E1.m1.1.1.1.1.1.1.3.3.cmml">T</mi></msup><mo id="Sx3.E1.m1.1.1.1.1.1.1.2" xref="Sx3.E1.m1.1.1.1.1.1.1.2.cmml">=</mo><mrow id="Sx3.E1.m1.1.1.1.1.1.1.1" xref="Sx3.E1.m1.1.1.1.1.1.1.1.cmml"><msub id="Sx3.E1.m1.1.1.1.1.1.1.1.3" xref="Sx3.E1.m1.1.1.1.1.1.1.1.3.cmml"><mi id="Sx3.E1.m1.1.1.1.1.1.1.1.3.2" mathvariant="normal" xref="Sx3.E1.m1.1.1.1.1.1.1.1.3.2.cmml">Φ</mi><mi id="Sx3.E1.m1.1.1.1.1.1.1.1.3.3" xref="Sx3.E1.m1.1.1.1.1.1.1.1.3.3.cmml">T</mi></msub><mo id="Sx3.E1.m1.1.1.1.1.1.1.1.2" xref="Sx3.E1.m1.1.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="Sx3.E1.m1.1.1.1.1.1.1.1.1.1" xref="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.cmml"><mo id="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.cmml">(</mo><msup id="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1" xref="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.2" xref="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.2.cmml">𝒙</mi><mi id="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.3" xref="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.3.cmml">T</mi></msup><mo id="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="Sx3.E1.m1.1.1.1.1.2.3" rspace="0.667em" xref="Sx3.E1.m1.1.1.1.1.3a.cmml">,</mo><mrow id="Sx3.E1.m1.1.1.1.1.2.2.2" xref="Sx3.E1.m1.1.1.1.1.2.2.3.cmml"><mrow id="Sx3.E1.m1.1.1.1.1.2.2.1.1" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.cmml"><msup id="Sx3.E1.m1.1.1.1.1.2.2.1.1.3" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.cmml"><mi id="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.2" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.2.cmml">𝒚</mi><mi id="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.3" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.3.cmml">O</mi></msup><mo id="Sx3.E1.m1.1.1.1.1.2.2.1.1.2" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.2.cmml">=</mo><mrow id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.cmml"><msub id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.cmml"><mi id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.2" mathvariant="normal" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.2.cmml">Φ</mi><mi id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.3" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.3.cmml">T</mi></msub><mo id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.2" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.2.cmml">⁢</mo><mrow id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.cmml"><mo id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.2" stretchy="false" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.cmml">(</mo><msup id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.cmml"><mi id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.2" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.2.cmml">𝒙</mi><mi id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.3" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.3.cmml">O</mi></msup><mo id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.3" stretchy="false" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="Sx3.E1.m1.1.1.1.1.2.2.2.3" rspace="0.667em" xref="Sx3.E1.m1.1.1.1.1.2.2.3a.cmml">,</mo><mrow id="Sx3.E1.m1.1.1.1.1.2.2.2.2" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.cmml"><msup id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.cmml"><mi id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.2" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.2.cmml">𝒚</mi><mrow id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.cmml"><mi id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.2" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.2.cmml">I</mi><mo fence="false" id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.1" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.1.cmml">|</mo><mi id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.3" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.3.cmml">O</mi></mrow></msup><mo id="Sx3.E1.m1.1.1.1.1.2.2.2.2.3" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.3.cmml">=</mo><mrow id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.cmml"><msub id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.cmml"><mi id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.2" mathvariant="normal" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.2.cmml">Φ</mi><mi id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.3" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.3.cmml">I</mi></msub><mo id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.3" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.3.cmml">⁢</mo><mrow id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.3.cmml"><mo id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.3" stretchy="false" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.3.cmml">(</mo><msup id="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.cmml"><mi id="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.2" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.2.cmml">𝒙</mi><mi id="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.3" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.3.cmml">I</mi></msup><mo id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.4" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.3.cmml">,</mo><msup id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.cmml"><mi id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.2" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.2.cmml">𝒚</mi><mi id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.3" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.3.cmml">O</mi></msup><mo id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.5" stretchy="false" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.3.cmml">)</mo></mrow></mrow></mrow></mrow></mrow><mo id="Sx3.E1.m1.1.1.1.2">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E1.m1.1b"><apply id="Sx3.E1.m1.1.1.1.1.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.3a.cmml" xref="Sx3.E1.m1.1.1.1.1.2.3">formulae-sequence</csymbol><apply id="Sx3.E1.m1.1.1.1.1.1.1.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1"><eq id="Sx3.E1.m1.1.1.1.1.1.1.2.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.2"></eq><apply id="Sx3.E1.m1.1.1.1.1.1.1.3.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.1.1.3.1.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.3">superscript</csymbol><ci id="Sx3.E1.m1.1.1.1.1.1.1.3.2.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.3.2">𝒚</ci><ci id="Sx3.E1.m1.1.1.1.1.1.1.3.3.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.3.3">𝑇</ci></apply><apply id="Sx3.E1.m1.1.1.1.1.1.1.1.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.1"><times id="Sx3.E1.m1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.1.2"></times><apply id="Sx3.E1.m1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.1.1.1.3.1.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="Sx3.E1.m1.1.1.1.1.1.1.1.3.2.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.1.3.2">Φ</ci><ci id="Sx3.E1.m1.1.1.1.1.1.1.1.3.3.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.1.3.3">𝑇</ci></apply><apply id="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.2">𝒙</ci><ci id="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E1.m1.1.1.1.1.1.1.1.1.1.1.3">𝑇</ci></apply></apply></apply><apply id="Sx3.E1.m1.1.1.1.1.2.2.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.2.2.3a.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.3">formulae-sequence</csymbol><apply id="Sx3.E1.m1.1.1.1.1.2.2.1.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1"><eq id="Sx3.E1.m1.1.1.1.1.2.2.1.1.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.2"></eq><apply id="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.3"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.3">superscript</csymbol><ci id="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.2">𝒚</ci><ci id="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.3.3">𝑂</ci></apply><apply id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1"><times id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.2"></times><apply id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3">subscript</csymbol><ci id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.2">Φ</ci><ci id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.3.3">𝑇</ci></apply><apply id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1">superscript</csymbol><ci id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.2">𝒙</ci><ci id="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.1.1.1.1.1.1.3">𝑂</ci></apply></apply></apply><apply id="Sx3.E1.m1.1.1.1.1.2.2.2.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2"><eq id="Sx3.E1.m1.1.1.1.1.2.2.2.2.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.3"></eq><apply id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4">superscript</csymbol><ci id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.2">𝒚</ci><apply id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3"><csymbol cd="latexml" id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.1">conditional</csymbol><ci id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.2">𝐼</ci><ci id="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.4.3.3">𝑂</ci></apply></apply><apply id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2"><times id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.3"></times><apply id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4">subscript</csymbol><ci id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.2">Φ</ci><ci id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.4.3">𝐼</ci></apply><interval closure="open" id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2"><apply id="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1">superscript</csymbol><ci id="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.2">𝒙</ci><ci id="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.1.1.1.1.3">𝐼</ci></apply><apply id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.1.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2">superscript</csymbol><ci id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.2.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.2">𝒚</ci><ci id="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.3.cmml" xref="Sx3.E1.m1.1.1.1.1.2.2.2.2.2.2.2.2.3">𝑂</ci></apply></interval></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E1.m1.1c">\boldsymbol{y}^{T}=\Phi_{T}(\boldsymbol{x}^{T}),\ \boldsymbol{y}^{O}=\Phi_{T}(% \boldsymbol{x}^{O}),\ \boldsymbol{y}^{I|O}=\Phi_{I}(\boldsymbol{x}^{I},% \boldsymbol{y}^{O}),</annotation><annotation encoding="application/x-llamapun" id="Sx3.E1.m1.1d">bold_italic_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , bold_italic_y start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ) , bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td> </tr></tbody> </table> <p class="ltx_p" id="Sx3.SSx1.p1.11">where <math alttext="\boldsymbol{y}^{T}\in\mathbb{R}^{d}" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.9.m1.1"><semantics id="Sx3.SSx1.p1.9.m1.1a"><mrow id="Sx3.SSx1.p1.9.m1.1.1" xref="Sx3.SSx1.p1.9.m1.1.1.cmml"><msup id="Sx3.SSx1.p1.9.m1.1.1.2" xref="Sx3.SSx1.p1.9.m1.1.1.2.cmml"><mi id="Sx3.SSx1.p1.9.m1.1.1.2.2" xref="Sx3.SSx1.p1.9.m1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.SSx1.p1.9.m1.1.1.2.3" xref="Sx3.SSx1.p1.9.m1.1.1.2.3.cmml">T</mi></msup><mo id="Sx3.SSx1.p1.9.m1.1.1.1" xref="Sx3.SSx1.p1.9.m1.1.1.1.cmml">∈</mo><msup id="Sx3.SSx1.p1.9.m1.1.1.3" xref="Sx3.SSx1.p1.9.m1.1.1.3.cmml"><mi id="Sx3.SSx1.p1.9.m1.1.1.3.2" xref="Sx3.SSx1.p1.9.m1.1.1.3.2.cmml">ℝ</mi><mi id="Sx3.SSx1.p1.9.m1.1.1.3.3" xref="Sx3.SSx1.p1.9.m1.1.1.3.3.cmml">d</mi></msup></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.9.m1.1b"><apply id="Sx3.SSx1.p1.9.m1.1.1.cmml" xref="Sx3.SSx1.p1.9.m1.1.1"><in id="Sx3.SSx1.p1.9.m1.1.1.1.cmml" xref="Sx3.SSx1.p1.9.m1.1.1.1"></in><apply id="Sx3.SSx1.p1.9.m1.1.1.2.cmml" xref="Sx3.SSx1.p1.9.m1.1.1.2"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.9.m1.1.1.2.1.cmml" xref="Sx3.SSx1.p1.9.m1.1.1.2">superscript</csymbol><ci id="Sx3.SSx1.p1.9.m1.1.1.2.2.cmml" xref="Sx3.SSx1.p1.9.m1.1.1.2.2">𝒚</ci><ci id="Sx3.SSx1.p1.9.m1.1.1.2.3.cmml" xref="Sx3.SSx1.p1.9.m1.1.1.2.3">𝑇</ci></apply><apply id="Sx3.SSx1.p1.9.m1.1.1.3.cmml" xref="Sx3.SSx1.p1.9.m1.1.1.3"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.9.m1.1.1.3.1.cmml" xref="Sx3.SSx1.p1.9.m1.1.1.3">superscript</csymbol><ci id="Sx3.SSx1.p1.9.m1.1.1.3.2.cmml" xref="Sx3.SSx1.p1.9.m1.1.1.3.2">ℝ</ci><ci id="Sx3.SSx1.p1.9.m1.1.1.3.3.cmml" xref="Sx3.SSx1.p1.9.m1.1.1.3.3">𝑑</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.9.m1.1c">\boldsymbol{y}^{T}\in\mathbb{R}^{d}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.9.m1.1d">bold_italic_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT</annotation></semantics></math> is text feature, <math alttext="\boldsymbol{y}^{O}\in\mathbb{R}^{d}" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.10.m2.1"><semantics id="Sx3.SSx1.p1.10.m2.1a"><mrow id="Sx3.SSx1.p1.10.m2.1.1" xref="Sx3.SSx1.p1.10.m2.1.1.cmml"><msup id="Sx3.SSx1.p1.10.m2.1.1.2" xref="Sx3.SSx1.p1.10.m2.1.1.2.cmml"><mi id="Sx3.SSx1.p1.10.m2.1.1.2.2" xref="Sx3.SSx1.p1.10.m2.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.SSx1.p1.10.m2.1.1.2.3" xref="Sx3.SSx1.p1.10.m2.1.1.2.3.cmml">O</mi></msup><mo id="Sx3.SSx1.p1.10.m2.1.1.1" xref="Sx3.SSx1.p1.10.m2.1.1.1.cmml">∈</mo><msup id="Sx3.SSx1.p1.10.m2.1.1.3" xref="Sx3.SSx1.p1.10.m2.1.1.3.cmml"><mi id="Sx3.SSx1.p1.10.m2.1.1.3.2" xref="Sx3.SSx1.p1.10.m2.1.1.3.2.cmml">ℝ</mi><mi id="Sx3.SSx1.p1.10.m2.1.1.3.3" xref="Sx3.SSx1.p1.10.m2.1.1.3.3.cmml">d</mi></msup></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.10.m2.1b"><apply id="Sx3.SSx1.p1.10.m2.1.1.cmml" xref="Sx3.SSx1.p1.10.m2.1.1"><in id="Sx3.SSx1.p1.10.m2.1.1.1.cmml" xref="Sx3.SSx1.p1.10.m2.1.1.1"></in><apply id="Sx3.SSx1.p1.10.m2.1.1.2.cmml" xref="Sx3.SSx1.p1.10.m2.1.1.2"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.10.m2.1.1.2.1.cmml" xref="Sx3.SSx1.p1.10.m2.1.1.2">superscript</csymbol><ci id="Sx3.SSx1.p1.10.m2.1.1.2.2.cmml" xref="Sx3.SSx1.p1.10.m2.1.1.2.2">𝒚</ci><ci id="Sx3.SSx1.p1.10.m2.1.1.2.3.cmml" xref="Sx3.SSx1.p1.10.m2.1.1.2.3">𝑂</ci></apply><apply id="Sx3.SSx1.p1.10.m2.1.1.3.cmml" xref="Sx3.SSx1.p1.10.m2.1.1.3"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.10.m2.1.1.3.1.cmml" xref="Sx3.SSx1.p1.10.m2.1.1.3">superscript</csymbol><ci id="Sx3.SSx1.p1.10.m2.1.1.3.2.cmml" xref="Sx3.SSx1.p1.10.m2.1.1.3.2">ℝ</ci><ci id="Sx3.SSx1.p1.10.m2.1.1.3.3.cmml" xref="Sx3.SSx1.p1.10.m2.1.1.3.3">𝑑</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.10.m2.1c">\boldsymbol{y}^{O}\in\mathbb{R}^{d}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.10.m2.1d">bold_italic_y start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT</annotation></semantics></math> is instruction feature, and <math alttext="\boldsymbol{y}^{I|O}" class="ltx_Math" display="inline" id="Sx3.SSx1.p1.11.m3.1"><semantics id="Sx3.SSx1.p1.11.m3.1a"><msup id="Sx3.SSx1.p1.11.m3.1.1" xref="Sx3.SSx1.p1.11.m3.1.1.cmml"><mi id="Sx3.SSx1.p1.11.m3.1.1.2" xref="Sx3.SSx1.p1.11.m3.1.1.2.cmml">𝒚</mi><mrow id="Sx3.SSx1.p1.11.m3.1.1.3" xref="Sx3.SSx1.p1.11.m3.1.1.3.cmml"><mi id="Sx3.SSx1.p1.11.m3.1.1.3.2" xref="Sx3.SSx1.p1.11.m3.1.1.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx1.p1.11.m3.1.1.3.1" xref="Sx3.SSx1.p1.11.m3.1.1.3.1.cmml">|</mo><mi id="Sx3.SSx1.p1.11.m3.1.1.3.3" xref="Sx3.SSx1.p1.11.m3.1.1.3.3.cmml">O</mi></mrow></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx1.p1.11.m3.1b"><apply id="Sx3.SSx1.p1.11.m3.1.1.cmml" xref="Sx3.SSx1.p1.11.m3.1.1"><csymbol cd="ambiguous" id="Sx3.SSx1.p1.11.m3.1.1.1.cmml" xref="Sx3.SSx1.p1.11.m3.1.1">superscript</csymbol><ci id="Sx3.SSx1.p1.11.m3.1.1.2.cmml" xref="Sx3.SSx1.p1.11.m3.1.1.2">𝒚</ci><apply id="Sx3.SSx1.p1.11.m3.1.1.3.cmml" xref="Sx3.SSx1.p1.11.m3.1.1.3"><csymbol cd="latexml" id="Sx3.SSx1.p1.11.m3.1.1.3.1.cmml" xref="Sx3.SSx1.p1.11.m3.1.1.3.1">conditional</csymbol><ci id="Sx3.SSx1.p1.11.m3.1.1.3.2.cmml" xref="Sx3.SSx1.p1.11.m3.1.1.3.2">𝐼</ci><ci id="Sx3.SSx1.p1.11.m3.1.1.3.3.cmml" xref="Sx3.SSx1.p1.11.m3.1.1.3.3">𝑂</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx1.p1.11.m3.1c">\boldsymbol{y}^{I|O}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx1.p1.11.m3.1d">bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT</annotation></semantics></math> is conditional image feature, i.e. the output of the visual encoder integrated with the AG-Adapter. The AG-Adapter is trained using our designed Object-focused Visual Semantic Learning component containing three loss terms: OITC, OIIC and OID. Note that during the training phase, the loss is calculated based on the output embeddings of both encoders. During the inference phase, the text encoder is retained to serialize the textual instructions.</p> </div> <figure class="ltx_figure" id="Sx3.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="271" id="Sx3.F3.g1" src="x3.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span><span class="ltx_text ltx_font_bold" id="Sx3.F3.5.1">Learning objectives illustration</span>. The image and text encoders jointly compute three losses. (a) For <span class="ltx_text ltx_font_bold" id="Sx3.F3.6.2">Object-focused Image-Text Contrast</span>, the paired text and image should not only correspond to each other but also correspond to the same semantic object. (b) <span class="ltx_text ltx_font_bold" id="Sx3.F3.7.3">Object-focused Image-Image Contrast</span> requires the model to predict pairs of image features that contain the same semantic object. (c) <span class="ltx_text ltx_font_bold" id="Sx3.F3.8.4">Object-focused Image Discrimination</span> determines whether a specific object exists or not in the image. The text instructions, such as “bike”, are pre-integrated into the prompt template designated as “a photo of {object}”.</figcaption> </figure> </section> <section class="ltx_subsection" id="Sx3.SSx2"> <h3 class="ltx_title ltx_title_subsection">Attention-guided Adapter</h3> <div class="ltx_para" id="Sx3.SSx2.p1"> <p class="ltx_p" id="Sx3.SSx2.p1.1">The proposed AG-Adapter module, as highlighted in green rectangle in Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx2.F2" title="Figure 2 ‣ Pre-training Visual Encoders ‣ Related Work ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">2</span></a>, is a simple yet effective plug-in that interweaves semantic directives with visual cues, enabling visual model to perceive queried objects. The AG-Adapter is inspired by Latent Diffusion Model (LDM) <cite class="ltx_cite ltx_citemacro_citep">(Rombach et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib38" title="">2022</a>)</cite> which enhances the alignment of visual representations with abstract semantics by conditioning on text representations. In particular, the AG-Adapter is incorporated into the pre-training feature extraction layers of the <math alttext="\Phi_{I}(\cdot,\cdot)" class="ltx_Math" display="inline" id="Sx3.SSx2.p1.1.m1.2"><semantics id="Sx3.SSx2.p1.1.m1.2a"><mrow id="Sx3.SSx2.p1.1.m1.2.3" xref="Sx3.SSx2.p1.1.m1.2.3.cmml"><msub id="Sx3.SSx2.p1.1.m1.2.3.2" xref="Sx3.SSx2.p1.1.m1.2.3.2.cmml"><mi id="Sx3.SSx2.p1.1.m1.2.3.2.2" mathvariant="normal" xref="Sx3.SSx2.p1.1.m1.2.3.2.2.cmml">Φ</mi><mi id="Sx3.SSx2.p1.1.m1.2.3.2.3" xref="Sx3.SSx2.p1.1.m1.2.3.2.3.cmml">I</mi></msub><mo id="Sx3.SSx2.p1.1.m1.2.3.1" xref="Sx3.SSx2.p1.1.m1.2.3.1.cmml">⁢</mo><mrow id="Sx3.SSx2.p1.1.m1.2.3.3.2" xref="Sx3.SSx2.p1.1.m1.2.3.3.1.cmml"><mo id="Sx3.SSx2.p1.1.m1.2.3.3.2.1" stretchy="false" xref="Sx3.SSx2.p1.1.m1.2.3.3.1.cmml">(</mo><mo id="Sx3.SSx2.p1.1.m1.1.1" lspace="0em" rspace="0em" xref="Sx3.SSx2.p1.1.m1.1.1.cmml">⋅</mo><mo id="Sx3.SSx2.p1.1.m1.2.3.3.2.2" rspace="0em" xref="Sx3.SSx2.p1.1.m1.2.3.3.1.cmml">,</mo><mo id="Sx3.SSx2.p1.1.m1.2.2" lspace="0em" rspace="0em" xref="Sx3.SSx2.p1.1.m1.2.2.cmml">⋅</mo><mo id="Sx3.SSx2.p1.1.m1.2.3.3.2.3" stretchy="false" xref="Sx3.SSx2.p1.1.m1.2.3.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p1.1.m1.2b"><apply id="Sx3.SSx2.p1.1.m1.2.3.cmml" xref="Sx3.SSx2.p1.1.m1.2.3"><times id="Sx3.SSx2.p1.1.m1.2.3.1.cmml" xref="Sx3.SSx2.p1.1.m1.2.3.1"></times><apply id="Sx3.SSx2.p1.1.m1.2.3.2.cmml" xref="Sx3.SSx2.p1.1.m1.2.3.2"><csymbol cd="ambiguous" id="Sx3.SSx2.p1.1.m1.2.3.2.1.cmml" xref="Sx3.SSx2.p1.1.m1.2.3.2">subscript</csymbol><ci id="Sx3.SSx2.p1.1.m1.2.3.2.2.cmml" xref="Sx3.SSx2.p1.1.m1.2.3.2.2">Φ</ci><ci id="Sx3.SSx2.p1.1.m1.2.3.2.3.cmml" xref="Sx3.SSx2.p1.1.m1.2.3.2.3">𝐼</ci></apply><interval closure="open" id="Sx3.SSx2.p1.1.m1.2.3.3.1.cmml" xref="Sx3.SSx2.p1.1.m1.2.3.3.2"><ci id="Sx3.SSx2.p1.1.m1.1.1.cmml" xref="Sx3.SSx2.p1.1.m1.1.1">⋅</ci><ci id="Sx3.SSx2.p1.1.m1.2.2.cmml" xref="Sx3.SSx2.p1.1.m1.2.2">⋅</ci></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p1.1.m1.2c">\Phi_{I}(\cdot,\cdot)</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p1.1.m1.2d">roman_Φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ , ⋅ )</annotation></semantics></math>, with only the inserted layer undergoing the training process.</p> </div> <div class="ltx_para" id="Sx3.SSx2.p2"> <p class="ltx_p" id="Sx3.SSx2.p2.1">Formally, in each layer of the visual encoder, the AG-Adapter module <math alttext="\varphi(\cdot)" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.1.m1.1"><semantics id="Sx3.SSx2.p2.1.m1.1a"><mrow id="Sx3.SSx2.p2.1.m1.1.2" xref="Sx3.SSx2.p2.1.m1.1.2.cmml"><mi id="Sx3.SSx2.p2.1.m1.1.2.2" xref="Sx3.SSx2.p2.1.m1.1.2.2.cmml">φ</mi><mo id="Sx3.SSx2.p2.1.m1.1.2.1" xref="Sx3.SSx2.p2.1.m1.1.2.1.cmml">⁢</mo><mrow id="Sx3.SSx2.p2.1.m1.1.2.3.2" xref="Sx3.SSx2.p2.1.m1.1.2.cmml"><mo id="Sx3.SSx2.p2.1.m1.1.2.3.2.1" stretchy="false" xref="Sx3.SSx2.p2.1.m1.1.2.cmml">(</mo><mo id="Sx3.SSx2.p2.1.m1.1.1" lspace="0em" rspace="0em" xref="Sx3.SSx2.p2.1.m1.1.1.cmml">⋅</mo><mo id="Sx3.SSx2.p2.1.m1.1.2.3.2.2" stretchy="false" xref="Sx3.SSx2.p2.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.1.m1.1b"><apply id="Sx3.SSx2.p2.1.m1.1.2.cmml" xref="Sx3.SSx2.p2.1.m1.1.2"><times id="Sx3.SSx2.p2.1.m1.1.2.1.cmml" xref="Sx3.SSx2.p2.1.m1.1.2.1"></times><ci id="Sx3.SSx2.p2.1.m1.1.2.2.cmml" xref="Sx3.SSx2.p2.1.m1.1.2.2">𝜑</ci><ci id="Sx3.SSx2.p2.1.m1.1.1.cmml" xref="Sx3.SSx2.p2.1.m1.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.1.m1.1c">\varphi(\cdot)</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.1.m1.1d">italic_φ ( ⋅ )</annotation></semantics></math> enhances fine-grained object features:</p> <table class="ltx_equation ltx_eqn_table" id="Sx3.E2"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\boldsymbol{f}^{V|O}=\varphi(\boldsymbol{y}^{O},\boldsymbol{f}^{V}),\vspace{-1mm}" class="ltx_Math" display="block" id="Sx3.E2.m1.1"><semantics id="Sx3.E2.m1.1a"><mrow id="Sx3.E2.m1.1.1.1" xref="Sx3.E2.m1.1.1.1.1.cmml"><mrow id="Sx3.E2.m1.1.1.1.1" xref="Sx3.E2.m1.1.1.1.1.cmml"><msup id="Sx3.E2.m1.1.1.1.1.4" xref="Sx3.E2.m1.1.1.1.1.4.cmml"><mi id="Sx3.E2.m1.1.1.1.1.4.2" xref="Sx3.E2.m1.1.1.1.1.4.2.cmml">𝒇</mi><mrow id="Sx3.E2.m1.1.1.1.1.4.3" xref="Sx3.E2.m1.1.1.1.1.4.3.cmml"><mi id="Sx3.E2.m1.1.1.1.1.4.3.2" xref="Sx3.E2.m1.1.1.1.1.4.3.2.cmml">V</mi><mo fence="false" id="Sx3.E2.m1.1.1.1.1.4.3.1" xref="Sx3.E2.m1.1.1.1.1.4.3.1.cmml">|</mo><mi id="Sx3.E2.m1.1.1.1.1.4.3.3" xref="Sx3.E2.m1.1.1.1.1.4.3.3.cmml">O</mi></mrow></msup><mo id="Sx3.E2.m1.1.1.1.1.3" xref="Sx3.E2.m1.1.1.1.1.3.cmml">=</mo><mrow id="Sx3.E2.m1.1.1.1.1.2" xref="Sx3.E2.m1.1.1.1.1.2.cmml"><mi id="Sx3.E2.m1.1.1.1.1.2.4" xref="Sx3.E2.m1.1.1.1.1.2.4.cmml">φ</mi><mo id="Sx3.E2.m1.1.1.1.1.2.3" xref="Sx3.E2.m1.1.1.1.1.2.3.cmml">⁢</mo><mrow id="Sx3.E2.m1.1.1.1.1.2.2.2" xref="Sx3.E2.m1.1.1.1.1.2.2.3.cmml"><mo id="Sx3.E2.m1.1.1.1.1.2.2.2.3" stretchy="false" xref="Sx3.E2.m1.1.1.1.1.2.2.3.cmml">(</mo><msup id="Sx3.E2.m1.1.1.1.1.1.1.1.1" xref="Sx3.E2.m1.1.1.1.1.1.1.1.1.cmml"><mi id="Sx3.E2.m1.1.1.1.1.1.1.1.1.2" xref="Sx3.E2.m1.1.1.1.1.1.1.1.1.2.cmml">𝒚</mi><mi id="Sx3.E2.m1.1.1.1.1.1.1.1.1.3" xref="Sx3.E2.m1.1.1.1.1.1.1.1.1.3.cmml">O</mi></msup><mo id="Sx3.E2.m1.1.1.1.1.2.2.2.4" xref="Sx3.E2.m1.1.1.1.1.2.2.3.cmml">,</mo><msup id="Sx3.E2.m1.1.1.1.1.2.2.2.2" xref="Sx3.E2.m1.1.1.1.1.2.2.2.2.cmml"><mi id="Sx3.E2.m1.1.1.1.1.2.2.2.2.2" xref="Sx3.E2.m1.1.1.1.1.2.2.2.2.2.cmml">𝒇</mi><mi id="Sx3.E2.m1.1.1.1.1.2.2.2.2.3" xref="Sx3.E2.m1.1.1.1.1.2.2.2.2.3.cmml">V</mi></msup><mo id="Sx3.E2.m1.1.1.1.1.2.2.2.5" stretchy="false" xref="Sx3.E2.m1.1.1.1.1.2.2.3.cmml">)</mo></mrow></mrow></mrow><mo id="Sx3.E2.m1.1.1.1.2" xref="Sx3.E2.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E2.m1.1b"><apply id="Sx3.E2.m1.1.1.1.1.cmml" xref="Sx3.E2.m1.1.1.1"><eq id="Sx3.E2.m1.1.1.1.1.3.cmml" xref="Sx3.E2.m1.1.1.1.1.3"></eq><apply id="Sx3.E2.m1.1.1.1.1.4.cmml" xref="Sx3.E2.m1.1.1.1.1.4"><csymbol cd="ambiguous" id="Sx3.E2.m1.1.1.1.1.4.1.cmml" xref="Sx3.E2.m1.1.1.1.1.4">superscript</csymbol><ci id="Sx3.E2.m1.1.1.1.1.4.2.cmml" xref="Sx3.E2.m1.1.1.1.1.4.2">𝒇</ci><apply id="Sx3.E2.m1.1.1.1.1.4.3.cmml" xref="Sx3.E2.m1.1.1.1.1.4.3"><csymbol cd="latexml" id="Sx3.E2.m1.1.1.1.1.4.3.1.cmml" xref="Sx3.E2.m1.1.1.1.1.4.3.1">conditional</csymbol><ci id="Sx3.E2.m1.1.1.1.1.4.3.2.cmml" xref="Sx3.E2.m1.1.1.1.1.4.3.2">𝑉</ci><ci id="Sx3.E2.m1.1.1.1.1.4.3.3.cmml" xref="Sx3.E2.m1.1.1.1.1.4.3.3">𝑂</ci></apply></apply><apply id="Sx3.E2.m1.1.1.1.1.2.cmml" xref="Sx3.E2.m1.1.1.1.1.2"><times id="Sx3.E2.m1.1.1.1.1.2.3.cmml" xref="Sx3.E2.m1.1.1.1.1.2.3"></times><ci id="Sx3.E2.m1.1.1.1.1.2.4.cmml" xref="Sx3.E2.m1.1.1.1.1.2.4">𝜑</ci><interval closure="open" id="Sx3.E2.m1.1.1.1.1.2.2.3.cmml" xref="Sx3.E2.m1.1.1.1.1.2.2.2"><apply id="Sx3.E2.m1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E2.m1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E2.m1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E2.m1.1.1.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.E2.m1.1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E2.m1.1.1.1.1.1.1.1.1.2">𝒚</ci><ci id="Sx3.E2.m1.1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E2.m1.1.1.1.1.1.1.1.1.3">𝑂</ci></apply><apply id="Sx3.E2.m1.1.1.1.1.2.2.2.2.cmml" xref="Sx3.E2.m1.1.1.1.1.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.E2.m1.1.1.1.1.2.2.2.2.1.cmml" xref="Sx3.E2.m1.1.1.1.1.2.2.2.2">superscript</csymbol><ci id="Sx3.E2.m1.1.1.1.1.2.2.2.2.2.cmml" xref="Sx3.E2.m1.1.1.1.1.2.2.2.2.2">𝒇</ci><ci id="Sx3.E2.m1.1.1.1.1.2.2.2.2.3.cmml" xref="Sx3.E2.m1.1.1.1.1.2.2.2.2.3">𝑉</ci></apply></interval></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E2.m1.1c">\boldsymbol{f}^{V|O}=\varphi(\boldsymbol{y}^{O},\boldsymbol{f}^{V}),\vspace{-1mm}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E2.m1.1d">bold_italic_f start_POSTSUPERSCRIPT italic_V | italic_O end_POSTSUPERSCRIPT = italic_φ ( bold_italic_y start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT , bold_italic_f start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td> </tr></tbody> </table> <p class="ltx_p" id="Sx3.SSx2.p2.7">where <math alttext="\boldsymbol{y}^{O}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.2.m1.1"><semantics id="Sx3.SSx2.p2.2.m1.1a"><msup id="Sx3.SSx2.p2.2.m1.1.1" xref="Sx3.SSx2.p2.2.m1.1.1.cmml"><mi id="Sx3.SSx2.p2.2.m1.1.1.2" xref="Sx3.SSx2.p2.2.m1.1.1.2.cmml">𝒚</mi><mi id="Sx3.SSx2.p2.2.m1.1.1.3" xref="Sx3.SSx2.p2.2.m1.1.1.3.cmml">O</mi></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.2.m1.1b"><apply id="Sx3.SSx2.p2.2.m1.1.1.cmml" xref="Sx3.SSx2.p2.2.m1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.2.m1.1.1.1.cmml" xref="Sx3.SSx2.p2.2.m1.1.1">superscript</csymbol><ci id="Sx3.SSx2.p2.2.m1.1.1.2.cmml" xref="Sx3.SSx2.p2.2.m1.1.1.2">𝒚</ci><ci id="Sx3.SSx2.p2.2.m1.1.1.3.cmml" xref="Sx3.SSx2.p2.2.m1.1.1.3">𝑂</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.2.m1.1c">\boldsymbol{y}^{O}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.2.m1.1d">bold_italic_y start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT</annotation></semantics></math> is instruction feature derived from user queries and <math alttext="\boldsymbol{f}^{V}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.3.m2.1"><semantics id="Sx3.SSx2.p2.3.m2.1a"><msup id="Sx3.SSx2.p2.3.m2.1.1" xref="Sx3.SSx2.p2.3.m2.1.1.cmml"><mi id="Sx3.SSx2.p2.3.m2.1.1.2" xref="Sx3.SSx2.p2.3.m2.1.1.2.cmml">𝒇</mi><mi id="Sx3.SSx2.p2.3.m2.1.1.3" xref="Sx3.SSx2.p2.3.m2.1.1.3.cmml">V</mi></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.3.m2.1b"><apply id="Sx3.SSx2.p2.3.m2.1.1.cmml" xref="Sx3.SSx2.p2.3.m2.1.1"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.3.m2.1.1.1.cmml" xref="Sx3.SSx2.p2.3.m2.1.1">superscript</csymbol><ci id="Sx3.SSx2.p2.3.m2.1.1.2.cmml" xref="Sx3.SSx2.p2.3.m2.1.1.2">𝒇</ci><ci id="Sx3.SSx2.p2.3.m2.1.1.3.cmml" xref="Sx3.SSx2.p2.3.m2.1.1.3">𝑉</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.3.m2.1c">\boldsymbol{f}^{V}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.3.m2.1d">bold_italic_f start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT</annotation></semantics></math> refers to the feature of the visual data flow. The <math alttext="\boldsymbol{f}^{V}\in\mathbb{R}^{(M+K)\times d}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.4.m3.1"><semantics id="Sx3.SSx2.p2.4.m3.1a"><mrow id="Sx3.SSx2.p2.4.m3.1.2" xref="Sx3.SSx2.p2.4.m3.1.2.cmml"><msup id="Sx3.SSx2.p2.4.m3.1.2.2" xref="Sx3.SSx2.p2.4.m3.1.2.2.cmml"><mi id="Sx3.SSx2.p2.4.m3.1.2.2.2" xref="Sx3.SSx2.p2.4.m3.1.2.2.2.cmml">𝒇</mi><mi id="Sx3.SSx2.p2.4.m3.1.2.2.3" xref="Sx3.SSx2.p2.4.m3.1.2.2.3.cmml">V</mi></msup><mo id="Sx3.SSx2.p2.4.m3.1.2.1" xref="Sx3.SSx2.p2.4.m3.1.2.1.cmml">∈</mo><msup id="Sx3.SSx2.p2.4.m3.1.2.3" xref="Sx3.SSx2.p2.4.m3.1.2.3.cmml"><mi id="Sx3.SSx2.p2.4.m3.1.2.3.2" xref="Sx3.SSx2.p2.4.m3.1.2.3.2.cmml">ℝ</mi><mrow id="Sx3.SSx2.p2.4.m3.1.1.1" xref="Sx3.SSx2.p2.4.m3.1.1.1.cmml"><mrow id="Sx3.SSx2.p2.4.m3.1.1.1.1.1" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.cmml"><mo id="Sx3.SSx2.p2.4.m3.1.1.1.1.1.2" stretchy="false" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.cmml">(</mo><mrow id="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.cmml"><mi id="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.2" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.2.cmml">M</mi><mo id="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.1" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.1.cmml">+</mo><mi id="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.3" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.3.cmml">K</mi></mrow><mo id="Sx3.SSx2.p2.4.m3.1.1.1.1.1.3" rspace="0.055em" stretchy="false" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.cmml">)</mo></mrow><mo id="Sx3.SSx2.p2.4.m3.1.1.1.2" rspace="0.222em" xref="Sx3.SSx2.p2.4.m3.1.1.1.2.cmml">×</mo><mi id="Sx3.SSx2.p2.4.m3.1.1.1.3" xref="Sx3.SSx2.p2.4.m3.1.1.1.3.cmml">d</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.4.m3.1b"><apply id="Sx3.SSx2.p2.4.m3.1.2.cmml" xref="Sx3.SSx2.p2.4.m3.1.2"><in id="Sx3.SSx2.p2.4.m3.1.2.1.cmml" xref="Sx3.SSx2.p2.4.m3.1.2.1"></in><apply id="Sx3.SSx2.p2.4.m3.1.2.2.cmml" xref="Sx3.SSx2.p2.4.m3.1.2.2"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.4.m3.1.2.2.1.cmml" xref="Sx3.SSx2.p2.4.m3.1.2.2">superscript</csymbol><ci id="Sx3.SSx2.p2.4.m3.1.2.2.2.cmml" xref="Sx3.SSx2.p2.4.m3.1.2.2.2">𝒇</ci><ci id="Sx3.SSx2.p2.4.m3.1.2.2.3.cmml" xref="Sx3.SSx2.p2.4.m3.1.2.2.3">𝑉</ci></apply><apply id="Sx3.SSx2.p2.4.m3.1.2.3.cmml" xref="Sx3.SSx2.p2.4.m3.1.2.3"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.4.m3.1.2.3.1.cmml" xref="Sx3.SSx2.p2.4.m3.1.2.3">superscript</csymbol><ci id="Sx3.SSx2.p2.4.m3.1.2.3.2.cmml" xref="Sx3.SSx2.p2.4.m3.1.2.3.2">ℝ</ci><apply id="Sx3.SSx2.p2.4.m3.1.1.1.cmml" xref="Sx3.SSx2.p2.4.m3.1.1.1"><times id="Sx3.SSx2.p2.4.m3.1.1.1.2.cmml" xref="Sx3.SSx2.p2.4.m3.1.1.1.2"></times><apply id="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.cmml" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1"><plus id="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.1.cmml" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.1"></plus><ci id="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.2.cmml" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.2">𝑀</ci><ci id="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.3.cmml" xref="Sx3.SSx2.p2.4.m3.1.1.1.1.1.1.3">𝐾</ci></apply><ci id="Sx3.SSx2.p2.4.m3.1.1.1.3.cmml" xref="Sx3.SSx2.p2.4.m3.1.1.1.3">𝑑</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.4.m3.1c">\boldsymbol{f}^{V}\in\mathbb{R}^{(M+K)\times d}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.4.m3.1d">bold_italic_f start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_M + italic_K ) × italic_d end_POSTSUPERSCRIPT</annotation></semantics></math> is composed of two types of tokens: image tokens <math alttext="\boldsymbol{f}^{I}\in\mathbb{R}^{M\times d}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.5.m4.1"><semantics id="Sx3.SSx2.p2.5.m4.1a"><mrow id="Sx3.SSx2.p2.5.m4.1.1" xref="Sx3.SSx2.p2.5.m4.1.1.cmml"><msup id="Sx3.SSx2.p2.5.m4.1.1.2" xref="Sx3.SSx2.p2.5.m4.1.1.2.cmml"><mi id="Sx3.SSx2.p2.5.m4.1.1.2.2" xref="Sx3.SSx2.p2.5.m4.1.1.2.2.cmml">𝒇</mi><mi id="Sx3.SSx2.p2.5.m4.1.1.2.3" xref="Sx3.SSx2.p2.5.m4.1.1.2.3.cmml">I</mi></msup><mo id="Sx3.SSx2.p2.5.m4.1.1.1" xref="Sx3.SSx2.p2.5.m4.1.1.1.cmml">∈</mo><msup id="Sx3.SSx2.p2.5.m4.1.1.3" xref="Sx3.SSx2.p2.5.m4.1.1.3.cmml"><mi id="Sx3.SSx2.p2.5.m4.1.1.3.2" xref="Sx3.SSx2.p2.5.m4.1.1.3.2.cmml">ℝ</mi><mrow id="Sx3.SSx2.p2.5.m4.1.1.3.3" xref="Sx3.SSx2.p2.5.m4.1.1.3.3.cmml"><mi id="Sx3.SSx2.p2.5.m4.1.1.3.3.2" xref="Sx3.SSx2.p2.5.m4.1.1.3.3.2.cmml">M</mi><mo id="Sx3.SSx2.p2.5.m4.1.1.3.3.1" lspace="0.222em" rspace="0.222em" xref="Sx3.SSx2.p2.5.m4.1.1.3.3.1.cmml">×</mo><mi id="Sx3.SSx2.p2.5.m4.1.1.3.3.3" xref="Sx3.SSx2.p2.5.m4.1.1.3.3.3.cmml">d</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.5.m4.1b"><apply id="Sx3.SSx2.p2.5.m4.1.1.cmml" xref="Sx3.SSx2.p2.5.m4.1.1"><in id="Sx3.SSx2.p2.5.m4.1.1.1.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.1"></in><apply id="Sx3.SSx2.p2.5.m4.1.1.2.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.2"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.5.m4.1.1.2.1.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.2">superscript</csymbol><ci id="Sx3.SSx2.p2.5.m4.1.1.2.2.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.2.2">𝒇</ci><ci id="Sx3.SSx2.p2.5.m4.1.1.2.3.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.2.3">𝐼</ci></apply><apply id="Sx3.SSx2.p2.5.m4.1.1.3.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.3"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.5.m4.1.1.3.1.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.3">superscript</csymbol><ci id="Sx3.SSx2.p2.5.m4.1.1.3.2.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.3.2">ℝ</ci><apply id="Sx3.SSx2.p2.5.m4.1.1.3.3.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.3.3"><times id="Sx3.SSx2.p2.5.m4.1.1.3.3.1.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.3.3.1"></times><ci id="Sx3.SSx2.p2.5.m4.1.1.3.3.2.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.3.3.2">𝑀</ci><ci id="Sx3.SSx2.p2.5.m4.1.1.3.3.3.cmml" xref="Sx3.SSx2.p2.5.m4.1.1.3.3.3">𝑑</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.5.m4.1c">\boldsymbol{f}^{I}\in\mathbb{R}^{M\times d}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.5.m4.1d">bold_italic_f start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT</annotation></semantics></math> and other tokens <math alttext="\boldsymbol{f}^{H}\in\mathbb{R}^{K\times d}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.6.m5.1"><semantics id="Sx3.SSx2.p2.6.m5.1a"><mrow id="Sx3.SSx2.p2.6.m5.1.1" xref="Sx3.SSx2.p2.6.m5.1.1.cmml"><msup id="Sx3.SSx2.p2.6.m5.1.1.2" xref="Sx3.SSx2.p2.6.m5.1.1.2.cmml"><mi id="Sx3.SSx2.p2.6.m5.1.1.2.2" xref="Sx3.SSx2.p2.6.m5.1.1.2.2.cmml">𝒇</mi><mi id="Sx3.SSx2.p2.6.m5.1.1.2.3" xref="Sx3.SSx2.p2.6.m5.1.1.2.3.cmml">H</mi></msup><mo id="Sx3.SSx2.p2.6.m5.1.1.1" xref="Sx3.SSx2.p2.6.m5.1.1.1.cmml">∈</mo><msup id="Sx3.SSx2.p2.6.m5.1.1.3" xref="Sx3.SSx2.p2.6.m5.1.1.3.cmml"><mi id="Sx3.SSx2.p2.6.m5.1.1.3.2" xref="Sx3.SSx2.p2.6.m5.1.1.3.2.cmml">ℝ</mi><mrow id="Sx3.SSx2.p2.6.m5.1.1.3.3" xref="Sx3.SSx2.p2.6.m5.1.1.3.3.cmml"><mi id="Sx3.SSx2.p2.6.m5.1.1.3.3.2" xref="Sx3.SSx2.p2.6.m5.1.1.3.3.2.cmml">K</mi><mo id="Sx3.SSx2.p2.6.m5.1.1.3.3.1" lspace="0.222em" rspace="0.222em" xref="Sx3.SSx2.p2.6.m5.1.1.3.3.1.cmml">×</mo><mi id="Sx3.SSx2.p2.6.m5.1.1.3.3.3" xref="Sx3.SSx2.p2.6.m5.1.1.3.3.3.cmml">d</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.6.m5.1b"><apply id="Sx3.SSx2.p2.6.m5.1.1.cmml" xref="Sx3.SSx2.p2.6.m5.1.1"><in id="Sx3.SSx2.p2.6.m5.1.1.1.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.1"></in><apply id="Sx3.SSx2.p2.6.m5.1.1.2.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.2"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.6.m5.1.1.2.1.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.2">superscript</csymbol><ci id="Sx3.SSx2.p2.6.m5.1.1.2.2.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.2.2">𝒇</ci><ci id="Sx3.SSx2.p2.6.m5.1.1.2.3.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.2.3">𝐻</ci></apply><apply id="Sx3.SSx2.p2.6.m5.1.1.3.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.3"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.6.m5.1.1.3.1.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.3">superscript</csymbol><ci id="Sx3.SSx2.p2.6.m5.1.1.3.2.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.3.2">ℝ</ci><apply id="Sx3.SSx2.p2.6.m5.1.1.3.3.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.3.3"><times id="Sx3.SSx2.p2.6.m5.1.1.3.3.1.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.3.3.1"></times><ci id="Sx3.SSx2.p2.6.m5.1.1.3.3.2.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.3.3.2">𝐾</ci><ci id="Sx3.SSx2.p2.6.m5.1.1.3.3.3.cmml" xref="Sx3.SSx2.p2.6.m5.1.1.3.3.3">𝑑</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.6.m5.1c">\boldsymbol{f}^{H}\in\mathbb{R}^{K\times d}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.6.m5.1d">bold_italic_f start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_d end_POSTSUPERSCRIPT</annotation></semantics></math>. The image tokens are derived directly from the input image, while the role of other tokens depends on the baseline model. In the case of GroupViT <cite class="ltx_cite ltx_citemacro_citep">(Xu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib49" title="">2022</a>)</cite>, other tokens are group tokens; in contrast, in CLIP <cite class="ltx_cite ltx_citemacro_citep">(Radford et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite>, they are absent, and <math alttext="K=0" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.7.m6.1"><semantics id="Sx3.SSx2.p2.7.m6.1a"><mrow id="Sx3.SSx2.p2.7.m6.1.1" xref="Sx3.SSx2.p2.7.m6.1.1.cmml"><mi id="Sx3.SSx2.p2.7.m6.1.1.2" xref="Sx3.SSx2.p2.7.m6.1.1.2.cmml">K</mi><mo id="Sx3.SSx2.p2.7.m6.1.1.1" xref="Sx3.SSx2.p2.7.m6.1.1.1.cmml">=</mo><mn id="Sx3.SSx2.p2.7.m6.1.1.3" xref="Sx3.SSx2.p2.7.m6.1.1.3.cmml">0</mn></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.7.m6.1b"><apply id="Sx3.SSx2.p2.7.m6.1.1.cmml" xref="Sx3.SSx2.p2.7.m6.1.1"><eq id="Sx3.SSx2.p2.7.m6.1.1.1.cmml" xref="Sx3.SSx2.p2.7.m6.1.1.1"></eq><ci id="Sx3.SSx2.p2.7.m6.1.1.2.cmml" xref="Sx3.SSx2.p2.7.m6.1.1.2">𝐾</ci><cn id="Sx3.SSx2.p2.7.m6.1.1.3.cmml" type="integer" xref="Sx3.SSx2.p2.7.m6.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.7.m6.1c">K=0</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.7.m6.1d">italic_K = 0</annotation></semantics></math>. Within the AG-Adapter, the dual-layer MLP serves to bridge the gap between text and image representations, namely,</p> <table class="ltx_equation ltx_eqn_table" id="Sx3.E3"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\boldsymbol{f}^{O}=\text{MLP}(\boldsymbol{y}^{O}),\vspace{-1mm}" class="ltx_Math" display="block" id="Sx3.E3.m1.1"><semantics id="Sx3.E3.m1.1a"><mrow id="Sx3.E3.m1.1.1.1" xref="Sx3.E3.m1.1.1.1.1.cmml"><mrow id="Sx3.E3.m1.1.1.1.1" xref="Sx3.E3.m1.1.1.1.1.cmml"><msup id="Sx3.E3.m1.1.1.1.1.3" xref="Sx3.E3.m1.1.1.1.1.3.cmml"><mi id="Sx3.E3.m1.1.1.1.1.3.2" xref="Sx3.E3.m1.1.1.1.1.3.2.cmml">𝒇</mi><mi id="Sx3.E3.m1.1.1.1.1.3.3" xref="Sx3.E3.m1.1.1.1.1.3.3.cmml">O</mi></msup><mo id="Sx3.E3.m1.1.1.1.1.2" xref="Sx3.E3.m1.1.1.1.1.2.cmml">=</mo><mrow id="Sx3.E3.m1.1.1.1.1.1" xref="Sx3.E3.m1.1.1.1.1.1.cmml"><mtext id="Sx3.E3.m1.1.1.1.1.1.3" xref="Sx3.E3.m1.1.1.1.1.1.3a.cmml">MLP</mtext><mo id="Sx3.E3.m1.1.1.1.1.1.2" xref="Sx3.E3.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="Sx3.E3.m1.1.1.1.1.1.1.1" xref="Sx3.E3.m1.1.1.1.1.1.1.1.1.cmml"><mo id="Sx3.E3.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="Sx3.E3.m1.1.1.1.1.1.1.1.1.cmml">(</mo><msup id="Sx3.E3.m1.1.1.1.1.1.1.1.1" xref="Sx3.E3.m1.1.1.1.1.1.1.1.1.cmml"><mi id="Sx3.E3.m1.1.1.1.1.1.1.1.1.2" xref="Sx3.E3.m1.1.1.1.1.1.1.1.1.2.cmml">𝒚</mi><mi id="Sx3.E3.m1.1.1.1.1.1.1.1.1.3" xref="Sx3.E3.m1.1.1.1.1.1.1.1.1.3.cmml">O</mi></msup><mo id="Sx3.E3.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="Sx3.E3.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="Sx3.E3.m1.1.1.1.2" xref="Sx3.E3.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E3.m1.1b"><apply id="Sx3.E3.m1.1.1.1.1.cmml" xref="Sx3.E3.m1.1.1.1"><eq id="Sx3.E3.m1.1.1.1.1.2.cmml" xref="Sx3.E3.m1.1.1.1.1.2"></eq><apply id="Sx3.E3.m1.1.1.1.1.3.cmml" xref="Sx3.E3.m1.1.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E3.m1.1.1.1.1.3.1.cmml" xref="Sx3.E3.m1.1.1.1.1.3">superscript</csymbol><ci id="Sx3.E3.m1.1.1.1.1.3.2.cmml" xref="Sx3.E3.m1.1.1.1.1.3.2">𝒇</ci><ci id="Sx3.E3.m1.1.1.1.1.3.3.cmml" xref="Sx3.E3.m1.1.1.1.1.3.3">𝑂</ci></apply><apply id="Sx3.E3.m1.1.1.1.1.1.cmml" xref="Sx3.E3.m1.1.1.1.1.1"><times id="Sx3.E3.m1.1.1.1.1.1.2.cmml" xref="Sx3.E3.m1.1.1.1.1.1.2"></times><ci id="Sx3.E3.m1.1.1.1.1.1.3a.cmml" xref="Sx3.E3.m1.1.1.1.1.1.3"><mtext id="Sx3.E3.m1.1.1.1.1.1.3.cmml" xref="Sx3.E3.m1.1.1.1.1.1.3">MLP</mtext></ci><apply id="Sx3.E3.m1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E3.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E3.m1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E3.m1.1.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.E3.m1.1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E3.m1.1.1.1.1.1.1.1.1.2">𝒚</ci><ci id="Sx3.E3.m1.1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E3.m1.1.1.1.1.1.1.1.1.3">𝑂</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E3.m1.1c">\boldsymbol{f}^{O}=\text{MLP}(\boldsymbol{y}^{O}),\vspace{-1mm}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E3.m1.1d">bold_italic_f start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT = MLP ( bold_italic_y start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(3)</span></td> </tr></tbody> </table> <p class="ltx_p" id="Sx3.SSx2.p2.8">where <math alttext="\text{MLP}(\cdot)" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.8.m1.1"><semantics id="Sx3.SSx2.p2.8.m1.1a"><mrow id="Sx3.SSx2.p2.8.m1.1.2" xref="Sx3.SSx2.p2.8.m1.1.2.cmml"><mtext id="Sx3.SSx2.p2.8.m1.1.2.2" xref="Sx3.SSx2.p2.8.m1.1.2.2a.cmml">MLP</mtext><mo id="Sx3.SSx2.p2.8.m1.1.2.1" xref="Sx3.SSx2.p2.8.m1.1.2.1.cmml">⁢</mo><mrow id="Sx3.SSx2.p2.8.m1.1.2.3.2" xref="Sx3.SSx2.p2.8.m1.1.2.cmml"><mo id="Sx3.SSx2.p2.8.m1.1.2.3.2.1" stretchy="false" xref="Sx3.SSx2.p2.8.m1.1.2.cmml">(</mo><mo id="Sx3.SSx2.p2.8.m1.1.1" lspace="0em" rspace="0em" xref="Sx3.SSx2.p2.8.m1.1.1.cmml">⋅</mo><mo id="Sx3.SSx2.p2.8.m1.1.2.3.2.2" stretchy="false" xref="Sx3.SSx2.p2.8.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.8.m1.1b"><apply id="Sx3.SSx2.p2.8.m1.1.2.cmml" xref="Sx3.SSx2.p2.8.m1.1.2"><times id="Sx3.SSx2.p2.8.m1.1.2.1.cmml" xref="Sx3.SSx2.p2.8.m1.1.2.1"></times><ci id="Sx3.SSx2.p2.8.m1.1.2.2a.cmml" xref="Sx3.SSx2.p2.8.m1.1.2.2"><mtext id="Sx3.SSx2.p2.8.m1.1.2.2.cmml" xref="Sx3.SSx2.p2.8.m1.1.2.2">MLP</mtext></ci><ci id="Sx3.SSx2.p2.8.m1.1.1.cmml" xref="Sx3.SSx2.p2.8.m1.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.8.m1.1c">\text{MLP}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.8.m1.1d">MLP ( ⋅ )</annotation></semantics></math> denotes two linear mappings and one non-linear mapping. The image tokens are standardized in order to enhance training stability:</p> <table class="ltx_equation ltx_eqn_table" id="Sx3.E4"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\vspace{-1mm}\hat{\boldsymbol{f}}^{I}=\text{Norm}(\boldsymbol{f}^{I}),\vspace{% -1mm}" class="ltx_Math" display="block" id="Sx3.E4.m1.1"><semantics id="Sx3.E4.m1.1a"><mrow id="Sx3.E4.m1.1.1.1" xref="Sx3.E4.m1.1.1.1.1.cmml"><mrow id="Sx3.E4.m1.1.1.1.1" xref="Sx3.E4.m1.1.1.1.1.cmml"><msup id="Sx3.E4.m1.1.1.1.1.3" xref="Sx3.E4.m1.1.1.1.1.3.cmml"><mover accent="true" id="Sx3.E4.m1.1.1.1.1.3.2" xref="Sx3.E4.m1.1.1.1.1.3.2.cmml"><mi id="Sx3.E4.m1.1.1.1.1.3.2.2" xref="Sx3.E4.m1.1.1.1.1.3.2.2.cmml">𝒇</mi><mo id="Sx3.E4.m1.1.1.1.1.3.2.1" xref="Sx3.E4.m1.1.1.1.1.3.2.1.cmml">^</mo></mover><mi id="Sx3.E4.m1.1.1.1.1.3.3" xref="Sx3.E4.m1.1.1.1.1.3.3.cmml">I</mi></msup><mo id="Sx3.E4.m1.1.1.1.1.2" xref="Sx3.E4.m1.1.1.1.1.2.cmml">=</mo><mrow id="Sx3.E4.m1.1.1.1.1.1" xref="Sx3.E4.m1.1.1.1.1.1.cmml"><mtext id="Sx3.E4.m1.1.1.1.1.1.3" xref="Sx3.E4.m1.1.1.1.1.1.3a.cmml">Norm</mtext><mo id="Sx3.E4.m1.1.1.1.1.1.2" xref="Sx3.E4.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="Sx3.E4.m1.1.1.1.1.1.1.1" xref="Sx3.E4.m1.1.1.1.1.1.1.1.1.cmml"><mo id="Sx3.E4.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="Sx3.E4.m1.1.1.1.1.1.1.1.1.cmml">(</mo><msup id="Sx3.E4.m1.1.1.1.1.1.1.1.1" xref="Sx3.E4.m1.1.1.1.1.1.1.1.1.cmml"><mi id="Sx3.E4.m1.1.1.1.1.1.1.1.1.2" xref="Sx3.E4.m1.1.1.1.1.1.1.1.1.2.cmml">𝒇</mi><mi id="Sx3.E4.m1.1.1.1.1.1.1.1.1.3" xref="Sx3.E4.m1.1.1.1.1.1.1.1.1.3.cmml">I</mi></msup><mo id="Sx3.E4.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="Sx3.E4.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="Sx3.E4.m1.1.1.1.2" xref="Sx3.E4.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E4.m1.1b"><apply id="Sx3.E4.m1.1.1.1.1.cmml" xref="Sx3.E4.m1.1.1.1"><eq id="Sx3.E4.m1.1.1.1.1.2.cmml" xref="Sx3.E4.m1.1.1.1.1.2"></eq><apply id="Sx3.E4.m1.1.1.1.1.3.cmml" xref="Sx3.E4.m1.1.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E4.m1.1.1.1.1.3.1.cmml" xref="Sx3.E4.m1.1.1.1.1.3">superscript</csymbol><apply id="Sx3.E4.m1.1.1.1.1.3.2.cmml" xref="Sx3.E4.m1.1.1.1.1.3.2"><ci id="Sx3.E4.m1.1.1.1.1.3.2.1.cmml" xref="Sx3.E4.m1.1.1.1.1.3.2.1">^</ci><ci id="Sx3.E4.m1.1.1.1.1.3.2.2.cmml" xref="Sx3.E4.m1.1.1.1.1.3.2.2">𝒇</ci></apply><ci id="Sx3.E4.m1.1.1.1.1.3.3.cmml" xref="Sx3.E4.m1.1.1.1.1.3.3">𝐼</ci></apply><apply id="Sx3.E4.m1.1.1.1.1.1.cmml" xref="Sx3.E4.m1.1.1.1.1.1"><times id="Sx3.E4.m1.1.1.1.1.1.2.cmml" xref="Sx3.E4.m1.1.1.1.1.1.2"></times><ci id="Sx3.E4.m1.1.1.1.1.1.3a.cmml" xref="Sx3.E4.m1.1.1.1.1.1.3"><mtext id="Sx3.E4.m1.1.1.1.1.1.3.cmml" xref="Sx3.E4.m1.1.1.1.1.1.3">Norm</mtext></ci><apply id="Sx3.E4.m1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E4.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E4.m1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E4.m1.1.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.E4.m1.1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E4.m1.1.1.1.1.1.1.1.1.2">𝒇</ci><ci id="Sx3.E4.m1.1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E4.m1.1.1.1.1.1.1.1.1.3">𝐼</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E4.m1.1c">\vspace{-1mm}\hat{\boldsymbol{f}}^{I}=\text{Norm}(\boldsymbol{f}^{I}),\vspace{% -1mm}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E4.m1.1d">over^ start_ARG bold_italic_f end_ARG start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = Norm ( bold_italic_f start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(4)</span></td> </tr></tbody> </table> <p class="ltx_p" id="Sx3.SSx2.p2.11">where <math alttext="\text{Norm}(\cdot)" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.9.m1.1"><semantics id="Sx3.SSx2.p2.9.m1.1a"><mrow id="Sx3.SSx2.p2.9.m1.1.2" xref="Sx3.SSx2.p2.9.m1.1.2.cmml"><mtext id="Sx3.SSx2.p2.9.m1.1.2.2" xref="Sx3.SSx2.p2.9.m1.1.2.2a.cmml">Norm</mtext><mo id="Sx3.SSx2.p2.9.m1.1.2.1" xref="Sx3.SSx2.p2.9.m1.1.2.1.cmml">⁢</mo><mrow id="Sx3.SSx2.p2.9.m1.1.2.3.2" xref="Sx3.SSx2.p2.9.m1.1.2.cmml"><mo id="Sx3.SSx2.p2.9.m1.1.2.3.2.1" stretchy="false" xref="Sx3.SSx2.p2.9.m1.1.2.cmml">(</mo><mo id="Sx3.SSx2.p2.9.m1.1.1" lspace="0em" rspace="0em" xref="Sx3.SSx2.p2.9.m1.1.1.cmml">⋅</mo><mo id="Sx3.SSx2.p2.9.m1.1.2.3.2.2" stretchy="false" xref="Sx3.SSx2.p2.9.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.9.m1.1b"><apply id="Sx3.SSx2.p2.9.m1.1.2.cmml" xref="Sx3.SSx2.p2.9.m1.1.2"><times id="Sx3.SSx2.p2.9.m1.1.2.1.cmml" xref="Sx3.SSx2.p2.9.m1.1.2.1"></times><ci id="Sx3.SSx2.p2.9.m1.1.2.2a.cmml" xref="Sx3.SSx2.p2.9.m1.1.2.2"><mtext id="Sx3.SSx2.p2.9.m1.1.2.2.cmml" xref="Sx3.SSx2.p2.9.m1.1.2.2">Norm</mtext></ci><ci id="Sx3.SSx2.p2.9.m1.1.1.cmml" xref="Sx3.SSx2.p2.9.m1.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.9.m1.1c">\text{Norm}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.9.m1.1d">Norm ( ⋅ )</annotation></semantics></math> denotes the layer normalization operater. Subsequently, <math alttext="\boldsymbol{f}^{O}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.10.m2.1"><semantics id="Sx3.SSx2.p2.10.m2.1a"><msup id="Sx3.SSx2.p2.10.m2.1.1" xref="Sx3.SSx2.p2.10.m2.1.1.cmml"><mi id="Sx3.SSx2.p2.10.m2.1.1.2" xref="Sx3.SSx2.p2.10.m2.1.1.2.cmml">𝒇</mi><mi id="Sx3.SSx2.p2.10.m2.1.1.3" xref="Sx3.SSx2.p2.10.m2.1.1.3.cmml">O</mi></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.10.m2.1b"><apply id="Sx3.SSx2.p2.10.m2.1.1.cmml" xref="Sx3.SSx2.p2.10.m2.1.1"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.10.m2.1.1.1.cmml" xref="Sx3.SSx2.p2.10.m2.1.1">superscript</csymbol><ci id="Sx3.SSx2.p2.10.m2.1.1.2.cmml" xref="Sx3.SSx2.p2.10.m2.1.1.2">𝒇</ci><ci id="Sx3.SSx2.p2.10.m2.1.1.3.cmml" xref="Sx3.SSx2.p2.10.m2.1.1.3">𝑂</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.10.m2.1c">\boldsymbol{f}^{O}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.10.m2.1d">bold_italic_f start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT</annotation></semantics></math> and <math alttext="\hat{\boldsymbol{f}}^{I}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.11.m3.1"><semantics id="Sx3.SSx2.p2.11.m3.1a"><msup id="Sx3.SSx2.p2.11.m3.1.1" xref="Sx3.SSx2.p2.11.m3.1.1.cmml"><mover accent="true" id="Sx3.SSx2.p2.11.m3.1.1.2" xref="Sx3.SSx2.p2.11.m3.1.1.2.cmml"><mi id="Sx3.SSx2.p2.11.m3.1.1.2.2" xref="Sx3.SSx2.p2.11.m3.1.1.2.2.cmml">𝒇</mi><mo id="Sx3.SSx2.p2.11.m3.1.1.2.1" xref="Sx3.SSx2.p2.11.m3.1.1.2.1.cmml">^</mo></mover><mi id="Sx3.SSx2.p2.11.m3.1.1.3" xref="Sx3.SSx2.p2.11.m3.1.1.3.cmml">I</mi></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.11.m3.1b"><apply id="Sx3.SSx2.p2.11.m3.1.1.cmml" xref="Sx3.SSx2.p2.11.m3.1.1"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.11.m3.1.1.1.cmml" xref="Sx3.SSx2.p2.11.m3.1.1">superscript</csymbol><apply id="Sx3.SSx2.p2.11.m3.1.1.2.cmml" xref="Sx3.SSx2.p2.11.m3.1.1.2"><ci id="Sx3.SSx2.p2.11.m3.1.1.2.1.cmml" xref="Sx3.SSx2.p2.11.m3.1.1.2.1">^</ci><ci id="Sx3.SSx2.p2.11.m3.1.1.2.2.cmml" xref="Sx3.SSx2.p2.11.m3.1.1.2.2">𝒇</ci></apply><ci id="Sx3.SSx2.p2.11.m3.1.1.3.cmml" xref="Sx3.SSx2.p2.11.m3.1.1.3">𝐼</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.11.m3.1c">\hat{\boldsymbol{f}}^{I}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.11.m3.1d">over^ start_ARG bold_italic_f end_ARG start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT</annotation></semantics></math> are merged through the cross-attention mechanism, with the resulting fused features integrated back into the visual data stream through residual concatenation as</p> <table class="ltx_equation ltx_eqn_table" id="Sx3.E5"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\boldsymbol{f}^{I|O}=\boldsymbol{f}^{I}+\text{Cross}(\hat{\boldsymbol{f}}^{I},% \boldsymbol{f}^{O}),\vspace{-1mm}" class="ltx_Math" display="block" id="Sx3.E5.m1.1"><semantics id="Sx3.E5.m1.1a"><mrow id="Sx3.E5.m1.1.1.1" xref="Sx3.E5.m1.1.1.1.1.cmml"><mrow id="Sx3.E5.m1.1.1.1.1" xref="Sx3.E5.m1.1.1.1.1.cmml"><msup id="Sx3.E5.m1.1.1.1.1.4" xref="Sx3.E5.m1.1.1.1.1.4.cmml"><mi id="Sx3.E5.m1.1.1.1.1.4.2" xref="Sx3.E5.m1.1.1.1.1.4.2.cmml">𝒇</mi><mrow id="Sx3.E5.m1.1.1.1.1.4.3" xref="Sx3.E5.m1.1.1.1.1.4.3.cmml"><mi id="Sx3.E5.m1.1.1.1.1.4.3.2" xref="Sx3.E5.m1.1.1.1.1.4.3.2.cmml">I</mi><mo fence="false" id="Sx3.E5.m1.1.1.1.1.4.3.1" xref="Sx3.E5.m1.1.1.1.1.4.3.1.cmml">|</mo><mi id="Sx3.E5.m1.1.1.1.1.4.3.3" xref="Sx3.E5.m1.1.1.1.1.4.3.3.cmml">O</mi></mrow></msup><mo id="Sx3.E5.m1.1.1.1.1.3" xref="Sx3.E5.m1.1.1.1.1.3.cmml">=</mo><mrow id="Sx3.E5.m1.1.1.1.1.2" xref="Sx3.E5.m1.1.1.1.1.2.cmml"><msup id="Sx3.E5.m1.1.1.1.1.2.4" xref="Sx3.E5.m1.1.1.1.1.2.4.cmml"><mi id="Sx3.E5.m1.1.1.1.1.2.4.2" xref="Sx3.E5.m1.1.1.1.1.2.4.2.cmml">𝒇</mi><mi id="Sx3.E5.m1.1.1.1.1.2.4.3" xref="Sx3.E5.m1.1.1.1.1.2.4.3.cmml">I</mi></msup><mo id="Sx3.E5.m1.1.1.1.1.2.3" xref="Sx3.E5.m1.1.1.1.1.2.3.cmml">+</mo><mrow id="Sx3.E5.m1.1.1.1.1.2.2" xref="Sx3.E5.m1.1.1.1.1.2.2.cmml"><mtext id="Sx3.E5.m1.1.1.1.1.2.2.4" xref="Sx3.E5.m1.1.1.1.1.2.2.4a.cmml">Cross</mtext><mo id="Sx3.E5.m1.1.1.1.1.2.2.3" xref="Sx3.E5.m1.1.1.1.1.2.2.3.cmml">⁢</mo><mrow id="Sx3.E5.m1.1.1.1.1.2.2.2.2" xref="Sx3.E5.m1.1.1.1.1.2.2.2.3.cmml"><mo id="Sx3.E5.m1.1.1.1.1.2.2.2.2.3" stretchy="false" xref="Sx3.E5.m1.1.1.1.1.2.2.2.3.cmml">(</mo><msup id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.cmml"><mover accent="true" id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2.cmml"><mi id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2.2" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2.2.cmml">𝒇</mi><mo id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2.1" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2.1.cmml">^</mo></mover><mi id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.3" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.3.cmml">I</mi></msup><mo id="Sx3.E5.m1.1.1.1.1.2.2.2.2.4" xref="Sx3.E5.m1.1.1.1.1.2.2.2.3.cmml">,</mo><msup id="Sx3.E5.m1.1.1.1.1.2.2.2.2.2" xref="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.cmml"><mi id="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.2" xref="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.2.cmml">𝒇</mi><mi id="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.3" xref="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.3.cmml">O</mi></msup><mo id="Sx3.E5.m1.1.1.1.1.2.2.2.2.5" stretchy="false" xref="Sx3.E5.m1.1.1.1.1.2.2.2.3.cmml">)</mo></mrow></mrow></mrow></mrow><mo id="Sx3.E5.m1.1.1.1.2" xref="Sx3.E5.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E5.m1.1b"><apply id="Sx3.E5.m1.1.1.1.1.cmml" xref="Sx3.E5.m1.1.1.1"><eq id="Sx3.E5.m1.1.1.1.1.3.cmml" xref="Sx3.E5.m1.1.1.1.1.3"></eq><apply id="Sx3.E5.m1.1.1.1.1.4.cmml" xref="Sx3.E5.m1.1.1.1.1.4"><csymbol cd="ambiguous" id="Sx3.E5.m1.1.1.1.1.4.1.cmml" xref="Sx3.E5.m1.1.1.1.1.4">superscript</csymbol><ci id="Sx3.E5.m1.1.1.1.1.4.2.cmml" xref="Sx3.E5.m1.1.1.1.1.4.2">𝒇</ci><apply id="Sx3.E5.m1.1.1.1.1.4.3.cmml" xref="Sx3.E5.m1.1.1.1.1.4.3"><csymbol cd="latexml" id="Sx3.E5.m1.1.1.1.1.4.3.1.cmml" xref="Sx3.E5.m1.1.1.1.1.4.3.1">conditional</csymbol><ci id="Sx3.E5.m1.1.1.1.1.4.3.2.cmml" xref="Sx3.E5.m1.1.1.1.1.4.3.2">𝐼</ci><ci id="Sx3.E5.m1.1.1.1.1.4.3.3.cmml" xref="Sx3.E5.m1.1.1.1.1.4.3.3">𝑂</ci></apply></apply><apply id="Sx3.E5.m1.1.1.1.1.2.cmml" xref="Sx3.E5.m1.1.1.1.1.2"><plus id="Sx3.E5.m1.1.1.1.1.2.3.cmml" xref="Sx3.E5.m1.1.1.1.1.2.3"></plus><apply id="Sx3.E5.m1.1.1.1.1.2.4.cmml" xref="Sx3.E5.m1.1.1.1.1.2.4"><csymbol cd="ambiguous" id="Sx3.E5.m1.1.1.1.1.2.4.1.cmml" xref="Sx3.E5.m1.1.1.1.1.2.4">superscript</csymbol><ci id="Sx3.E5.m1.1.1.1.1.2.4.2.cmml" xref="Sx3.E5.m1.1.1.1.1.2.4.2">𝒇</ci><ci id="Sx3.E5.m1.1.1.1.1.2.4.3.cmml" xref="Sx3.E5.m1.1.1.1.1.2.4.3">𝐼</ci></apply><apply id="Sx3.E5.m1.1.1.1.1.2.2.cmml" xref="Sx3.E5.m1.1.1.1.1.2.2"><times id="Sx3.E5.m1.1.1.1.1.2.2.3.cmml" xref="Sx3.E5.m1.1.1.1.1.2.2.3"></times><ci id="Sx3.E5.m1.1.1.1.1.2.2.4a.cmml" xref="Sx3.E5.m1.1.1.1.1.2.2.4"><mtext id="Sx3.E5.m1.1.1.1.1.2.2.4.cmml" xref="Sx3.E5.m1.1.1.1.1.2.2.4">Cross</mtext></ci><interval closure="open" id="Sx3.E5.m1.1.1.1.1.2.2.2.3.cmml" xref="Sx3.E5.m1.1.1.1.1.2.2.2.2"><apply id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1">superscript</csymbol><apply id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2"><ci id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2.1.cmml" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2.1">^</ci><ci id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2.2.cmml" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.2.2">𝒇</ci></apply><ci id="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E5.m1.1.1.1.1.1.1.1.1.1.3">𝐼</ci></apply><apply id="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.cmml" xref="Sx3.E5.m1.1.1.1.1.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.1.cmml" xref="Sx3.E5.m1.1.1.1.1.2.2.2.2.2">superscript</csymbol><ci id="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.2.cmml" xref="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.2">𝒇</ci><ci id="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.3.cmml" xref="Sx3.E5.m1.1.1.1.1.2.2.2.2.2.3">𝑂</ci></apply></interval></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E5.m1.1c">\boldsymbol{f}^{I|O}=\boldsymbol{f}^{I}+\text{Cross}(\hat{\boldsymbol{f}}^{I},% \boldsymbol{f}^{O}),\vspace{-1mm}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E5.m1.1d">bold_italic_f start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT = bold_italic_f start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + Cross ( over^ start_ARG bold_italic_f end_ARG start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , bold_italic_f start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(5)</span></td> </tr></tbody> </table> <p class="ltx_p" id="Sx3.SSx2.p2.15">where <math alttext="\text{Cross}(\cdot,\cdot)" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.12.m1.2"><semantics id="Sx3.SSx2.p2.12.m1.2a"><mrow id="Sx3.SSx2.p2.12.m1.2.3" xref="Sx3.SSx2.p2.12.m1.2.3.cmml"><mtext id="Sx3.SSx2.p2.12.m1.2.3.2" xref="Sx3.SSx2.p2.12.m1.2.3.2a.cmml">Cross</mtext><mo id="Sx3.SSx2.p2.12.m1.2.3.1" xref="Sx3.SSx2.p2.12.m1.2.3.1.cmml">⁢</mo><mrow id="Sx3.SSx2.p2.12.m1.2.3.3.2" xref="Sx3.SSx2.p2.12.m1.2.3.3.1.cmml"><mo id="Sx3.SSx2.p2.12.m1.2.3.3.2.1" stretchy="false" xref="Sx3.SSx2.p2.12.m1.2.3.3.1.cmml">(</mo><mo id="Sx3.SSx2.p2.12.m1.1.1" lspace="0em" rspace="0em" xref="Sx3.SSx2.p2.12.m1.1.1.cmml">⋅</mo><mo id="Sx3.SSx2.p2.12.m1.2.3.3.2.2" rspace="0em" xref="Sx3.SSx2.p2.12.m1.2.3.3.1.cmml">,</mo><mo id="Sx3.SSx2.p2.12.m1.2.2" lspace="0em" rspace="0em" xref="Sx3.SSx2.p2.12.m1.2.2.cmml">⋅</mo><mo id="Sx3.SSx2.p2.12.m1.2.3.3.2.3" stretchy="false" xref="Sx3.SSx2.p2.12.m1.2.3.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.12.m1.2b"><apply id="Sx3.SSx2.p2.12.m1.2.3.cmml" xref="Sx3.SSx2.p2.12.m1.2.3"><times id="Sx3.SSx2.p2.12.m1.2.3.1.cmml" xref="Sx3.SSx2.p2.12.m1.2.3.1"></times><ci id="Sx3.SSx2.p2.12.m1.2.3.2a.cmml" xref="Sx3.SSx2.p2.12.m1.2.3.2"><mtext id="Sx3.SSx2.p2.12.m1.2.3.2.cmml" xref="Sx3.SSx2.p2.12.m1.2.3.2">Cross</mtext></ci><interval closure="open" id="Sx3.SSx2.p2.12.m1.2.3.3.1.cmml" xref="Sx3.SSx2.p2.12.m1.2.3.3.2"><ci id="Sx3.SSx2.p2.12.m1.1.1.cmml" xref="Sx3.SSx2.p2.12.m1.1.1">⋅</ci><ci id="Sx3.SSx2.p2.12.m1.2.2.cmml" xref="Sx3.SSx2.p2.12.m1.2.2">⋅</ci></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.12.m1.2c">\text{Cross}(\cdot,\cdot)</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.12.m1.2d">Cross ( ⋅ , ⋅ )</annotation></semantics></math> represents a cross-attention operator, <math alttext="\hat{\boldsymbol{f}}^{I}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.13.m2.1"><semantics id="Sx3.SSx2.p2.13.m2.1a"><msup id="Sx3.SSx2.p2.13.m2.1.1" xref="Sx3.SSx2.p2.13.m2.1.1.cmml"><mover accent="true" id="Sx3.SSx2.p2.13.m2.1.1.2" xref="Sx3.SSx2.p2.13.m2.1.1.2.cmml"><mi id="Sx3.SSx2.p2.13.m2.1.1.2.2" xref="Sx3.SSx2.p2.13.m2.1.1.2.2.cmml">𝒇</mi><mo id="Sx3.SSx2.p2.13.m2.1.1.2.1" xref="Sx3.SSx2.p2.13.m2.1.1.2.1.cmml">^</mo></mover><mi id="Sx3.SSx2.p2.13.m2.1.1.3" xref="Sx3.SSx2.p2.13.m2.1.1.3.cmml">I</mi></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.13.m2.1b"><apply id="Sx3.SSx2.p2.13.m2.1.1.cmml" xref="Sx3.SSx2.p2.13.m2.1.1"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.13.m2.1.1.1.cmml" xref="Sx3.SSx2.p2.13.m2.1.1">superscript</csymbol><apply id="Sx3.SSx2.p2.13.m2.1.1.2.cmml" xref="Sx3.SSx2.p2.13.m2.1.1.2"><ci id="Sx3.SSx2.p2.13.m2.1.1.2.1.cmml" xref="Sx3.SSx2.p2.13.m2.1.1.2.1">^</ci><ci id="Sx3.SSx2.p2.13.m2.1.1.2.2.cmml" xref="Sx3.SSx2.p2.13.m2.1.1.2.2">𝒇</ci></apply><ci id="Sx3.SSx2.p2.13.m2.1.1.3.cmml" xref="Sx3.SSx2.p2.13.m2.1.1.3">𝐼</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.13.m2.1c">\hat{\boldsymbol{f}}^{I}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.13.m2.1d">over^ start_ARG bold_italic_f end_ARG start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT</annotation></semantics></math> is treated as query and <math alttext="\boldsymbol{f}^{O}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.14.m3.1"><semantics id="Sx3.SSx2.p2.14.m3.1a"><msup id="Sx3.SSx2.p2.14.m3.1.1" xref="Sx3.SSx2.p2.14.m3.1.1.cmml"><mi id="Sx3.SSx2.p2.14.m3.1.1.2" xref="Sx3.SSx2.p2.14.m3.1.1.2.cmml">𝒇</mi><mi id="Sx3.SSx2.p2.14.m3.1.1.3" xref="Sx3.SSx2.p2.14.m3.1.1.3.cmml">O</mi></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.14.m3.1b"><apply id="Sx3.SSx2.p2.14.m3.1.1.cmml" xref="Sx3.SSx2.p2.14.m3.1.1"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.14.m3.1.1.1.cmml" xref="Sx3.SSx2.p2.14.m3.1.1">superscript</csymbol><ci id="Sx3.SSx2.p2.14.m3.1.1.2.cmml" xref="Sx3.SSx2.p2.14.m3.1.1.2">𝒇</ci><ci id="Sx3.SSx2.p2.14.m3.1.1.3.cmml" xref="Sx3.SSx2.p2.14.m3.1.1.3">𝑂</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.14.m3.1c">\boldsymbol{f}^{O}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.14.m3.1d">bold_italic_f start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT</annotation></semantics></math> is key and value. <math alttext="\boldsymbol{f}^{I|O}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.15.m4.1"><semantics id="Sx3.SSx2.p2.15.m4.1a"><msup id="Sx3.SSx2.p2.15.m4.1.1" xref="Sx3.SSx2.p2.15.m4.1.1.cmml"><mi id="Sx3.SSx2.p2.15.m4.1.1.2" xref="Sx3.SSx2.p2.15.m4.1.1.2.cmml">𝒇</mi><mrow id="Sx3.SSx2.p2.15.m4.1.1.3" xref="Sx3.SSx2.p2.15.m4.1.1.3.cmml"><mi id="Sx3.SSx2.p2.15.m4.1.1.3.2" xref="Sx3.SSx2.p2.15.m4.1.1.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx2.p2.15.m4.1.1.3.1" xref="Sx3.SSx2.p2.15.m4.1.1.3.1.cmml">|</mo><mi id="Sx3.SSx2.p2.15.m4.1.1.3.3" xref="Sx3.SSx2.p2.15.m4.1.1.3.3.cmml">O</mi></mrow></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.15.m4.1b"><apply id="Sx3.SSx2.p2.15.m4.1.1.cmml" xref="Sx3.SSx2.p2.15.m4.1.1"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.15.m4.1.1.1.cmml" xref="Sx3.SSx2.p2.15.m4.1.1">superscript</csymbol><ci id="Sx3.SSx2.p2.15.m4.1.1.2.cmml" xref="Sx3.SSx2.p2.15.m4.1.1.2">𝒇</ci><apply id="Sx3.SSx2.p2.15.m4.1.1.3.cmml" xref="Sx3.SSx2.p2.15.m4.1.1.3"><csymbol cd="latexml" id="Sx3.SSx2.p2.15.m4.1.1.3.1.cmml" xref="Sx3.SSx2.p2.15.m4.1.1.3.1">conditional</csymbol><ci id="Sx3.SSx2.p2.15.m4.1.1.3.2.cmml" xref="Sx3.SSx2.p2.15.m4.1.1.3.2">𝐼</ci><ci id="Sx3.SSx2.p2.15.m4.1.1.3.3.cmml" xref="Sx3.SSx2.p2.15.m4.1.1.3.3">𝑂</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.15.m4.1c">\boldsymbol{f}^{I|O}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.15.m4.1d">bold_italic_f start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT</annotation></semantics></math> emphasizes the semantic objects while maintaining the integrity of the original visual information. The resulting tokens are then concatenated with other tokens to form the final features:</p> <table class="ltx_equation ltx_eqn_table" id="Sx3.E6"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\boldsymbol{f}^{V|O}=[\boldsymbol{f}^{I|O};\boldsymbol{f}^{H}],\vspace{-1mm}" class="ltx_Math" display="block" id="Sx3.E6.m1.1"><semantics id="Sx3.E6.m1.1a"><mrow id="Sx3.E6.m1.1.1.1" xref="Sx3.E6.m1.1.1.1.1.cmml"><mrow id="Sx3.E6.m1.1.1.1.1" xref="Sx3.E6.m1.1.1.1.1.cmml"><msup id="Sx3.E6.m1.1.1.1.1.4" xref="Sx3.E6.m1.1.1.1.1.4.cmml"><mi id="Sx3.E6.m1.1.1.1.1.4.2" xref="Sx3.E6.m1.1.1.1.1.4.2.cmml">𝒇</mi><mrow id="Sx3.E6.m1.1.1.1.1.4.3" xref="Sx3.E6.m1.1.1.1.1.4.3.cmml"><mi id="Sx3.E6.m1.1.1.1.1.4.3.2" xref="Sx3.E6.m1.1.1.1.1.4.3.2.cmml">V</mi><mo fence="false" id="Sx3.E6.m1.1.1.1.1.4.3.1" xref="Sx3.E6.m1.1.1.1.1.4.3.1.cmml">|</mo><mi id="Sx3.E6.m1.1.1.1.1.4.3.3" xref="Sx3.E6.m1.1.1.1.1.4.3.3.cmml">O</mi></mrow></msup><mo id="Sx3.E6.m1.1.1.1.1.3" xref="Sx3.E6.m1.1.1.1.1.3.cmml">=</mo><mrow id="Sx3.E6.m1.1.1.1.1.2.2" xref="Sx3.E6.m1.1.1.1.1.2.3.cmml"><mo id="Sx3.E6.m1.1.1.1.1.2.2.3" stretchy="false" xref="Sx3.E6.m1.1.1.1.1.2.3.cmml">[</mo><msup id="Sx3.E6.m1.1.1.1.1.1.1.1" xref="Sx3.E6.m1.1.1.1.1.1.1.1.cmml"><mi id="Sx3.E6.m1.1.1.1.1.1.1.1.2" xref="Sx3.E6.m1.1.1.1.1.1.1.1.2.cmml">𝒇</mi><mrow id="Sx3.E6.m1.1.1.1.1.1.1.1.3" xref="Sx3.E6.m1.1.1.1.1.1.1.1.3.cmml"><mi id="Sx3.E6.m1.1.1.1.1.1.1.1.3.2" xref="Sx3.E6.m1.1.1.1.1.1.1.1.3.2.cmml">I</mi><mo fence="false" id="Sx3.E6.m1.1.1.1.1.1.1.1.3.1" xref="Sx3.E6.m1.1.1.1.1.1.1.1.3.1.cmml">|</mo><mi id="Sx3.E6.m1.1.1.1.1.1.1.1.3.3" xref="Sx3.E6.m1.1.1.1.1.1.1.1.3.3.cmml">O</mi></mrow></msup><mo id="Sx3.E6.m1.1.1.1.1.2.2.4" xref="Sx3.E6.m1.1.1.1.1.2.3.cmml">;</mo><msup id="Sx3.E6.m1.1.1.1.1.2.2.2" xref="Sx3.E6.m1.1.1.1.1.2.2.2.cmml"><mi id="Sx3.E6.m1.1.1.1.1.2.2.2.2" xref="Sx3.E6.m1.1.1.1.1.2.2.2.2.cmml">𝒇</mi><mi id="Sx3.E6.m1.1.1.1.1.2.2.2.3" xref="Sx3.E6.m1.1.1.1.1.2.2.2.3.cmml">H</mi></msup><mo id="Sx3.E6.m1.1.1.1.1.2.2.5" stretchy="false" xref="Sx3.E6.m1.1.1.1.1.2.3.cmml">]</mo></mrow></mrow><mo id="Sx3.E6.m1.1.1.1.2" xref="Sx3.E6.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E6.m1.1b"><apply id="Sx3.E6.m1.1.1.1.1.cmml" xref="Sx3.E6.m1.1.1.1"><eq id="Sx3.E6.m1.1.1.1.1.3.cmml" xref="Sx3.E6.m1.1.1.1.1.3"></eq><apply id="Sx3.E6.m1.1.1.1.1.4.cmml" xref="Sx3.E6.m1.1.1.1.1.4"><csymbol cd="ambiguous" id="Sx3.E6.m1.1.1.1.1.4.1.cmml" xref="Sx3.E6.m1.1.1.1.1.4">superscript</csymbol><ci id="Sx3.E6.m1.1.1.1.1.4.2.cmml" xref="Sx3.E6.m1.1.1.1.1.4.2">𝒇</ci><apply id="Sx3.E6.m1.1.1.1.1.4.3.cmml" xref="Sx3.E6.m1.1.1.1.1.4.3"><csymbol cd="latexml" id="Sx3.E6.m1.1.1.1.1.4.3.1.cmml" xref="Sx3.E6.m1.1.1.1.1.4.3.1">conditional</csymbol><ci id="Sx3.E6.m1.1.1.1.1.4.3.2.cmml" xref="Sx3.E6.m1.1.1.1.1.4.3.2">𝑉</ci><ci id="Sx3.E6.m1.1.1.1.1.4.3.3.cmml" xref="Sx3.E6.m1.1.1.1.1.4.3.3">𝑂</ci></apply></apply><list id="Sx3.E6.m1.1.1.1.1.2.3.cmml" xref="Sx3.E6.m1.1.1.1.1.2.2"><apply id="Sx3.E6.m1.1.1.1.1.1.1.1.cmml" xref="Sx3.E6.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E6.m1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E6.m1.1.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.E6.m1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E6.m1.1.1.1.1.1.1.1.2">𝒇</ci><apply id="Sx3.E6.m1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E6.m1.1.1.1.1.1.1.1.3"><csymbol cd="latexml" id="Sx3.E6.m1.1.1.1.1.1.1.1.3.1.cmml" xref="Sx3.E6.m1.1.1.1.1.1.1.1.3.1">conditional</csymbol><ci id="Sx3.E6.m1.1.1.1.1.1.1.1.3.2.cmml" xref="Sx3.E6.m1.1.1.1.1.1.1.1.3.2">𝐼</ci><ci id="Sx3.E6.m1.1.1.1.1.1.1.1.3.3.cmml" xref="Sx3.E6.m1.1.1.1.1.1.1.1.3.3">𝑂</ci></apply></apply><apply id="Sx3.E6.m1.1.1.1.1.2.2.2.cmml" xref="Sx3.E6.m1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="Sx3.E6.m1.1.1.1.1.2.2.2.1.cmml" xref="Sx3.E6.m1.1.1.1.1.2.2.2">superscript</csymbol><ci id="Sx3.E6.m1.1.1.1.1.2.2.2.2.cmml" xref="Sx3.E6.m1.1.1.1.1.2.2.2.2">𝒇</ci><ci id="Sx3.E6.m1.1.1.1.1.2.2.2.3.cmml" xref="Sx3.E6.m1.1.1.1.1.2.2.2.3">𝐻</ci></apply></list></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E6.m1.1c">\boldsymbol{f}^{V|O}=[\boldsymbol{f}^{I|O};\boldsymbol{f}^{H}],\vspace{-1mm}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E6.m1.1d">bold_italic_f start_POSTSUPERSCRIPT italic_V | italic_O end_POSTSUPERSCRIPT = [ bold_italic_f start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT ; bold_italic_f start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ] ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(6)</span></td> </tr></tbody> </table> <p class="ltx_p" id="Sx3.SSx2.p2.17">where <math alttext="[\ ;\ ]" class="ltx_math_unparsed" display="inline" id="Sx3.SSx2.p2.16.m1.1"><semantics id="Sx3.SSx2.p2.16.m1.1a"><mrow id="Sx3.SSx2.p2.16.m1.1b"><mo id="Sx3.SSx2.p2.16.m1.1.1" rspace="0.500em" stretchy="false">[</mo><mo id="Sx3.SSx2.p2.16.m1.1.2" rspace="0.667em">;</mo><mo id="Sx3.SSx2.p2.16.m1.1.3" stretchy="false">]</mo></mrow><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.16.m1.1c">[\ ;\ ]</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.16.m1.1d">[ ; ]</annotation></semantics></math> represents the concatenation of two vectors. The feature <math alttext="\boldsymbol{f}^{V|O}" class="ltx_Math" display="inline" id="Sx3.SSx2.p2.17.m2.1"><semantics id="Sx3.SSx2.p2.17.m2.1a"><msup id="Sx3.SSx2.p2.17.m2.1.1" xref="Sx3.SSx2.p2.17.m2.1.1.cmml"><mi id="Sx3.SSx2.p2.17.m2.1.1.2" xref="Sx3.SSx2.p2.17.m2.1.1.2.cmml">𝒇</mi><mrow id="Sx3.SSx2.p2.17.m2.1.1.3" xref="Sx3.SSx2.p2.17.m2.1.1.3.cmml"><mi id="Sx3.SSx2.p2.17.m2.1.1.3.2" xref="Sx3.SSx2.p2.17.m2.1.1.3.2.cmml">V</mi><mo fence="false" id="Sx3.SSx2.p2.17.m2.1.1.3.1" xref="Sx3.SSx2.p2.17.m2.1.1.3.1.cmml">|</mo><mi id="Sx3.SSx2.p2.17.m2.1.1.3.3" xref="Sx3.SSx2.p2.17.m2.1.1.3.3.cmml">O</mi></mrow></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx2.p2.17.m2.1b"><apply id="Sx3.SSx2.p2.17.m2.1.1.cmml" xref="Sx3.SSx2.p2.17.m2.1.1"><csymbol cd="ambiguous" id="Sx3.SSx2.p2.17.m2.1.1.1.cmml" xref="Sx3.SSx2.p2.17.m2.1.1">superscript</csymbol><ci id="Sx3.SSx2.p2.17.m2.1.1.2.cmml" xref="Sx3.SSx2.p2.17.m2.1.1.2">𝒇</ci><apply id="Sx3.SSx2.p2.17.m2.1.1.3.cmml" xref="Sx3.SSx2.p2.17.m2.1.1.3"><csymbol cd="latexml" id="Sx3.SSx2.p2.17.m2.1.1.3.1.cmml" xref="Sx3.SSx2.p2.17.m2.1.1.3.1">conditional</csymbol><ci id="Sx3.SSx2.p2.17.m2.1.1.3.2.cmml" xref="Sx3.SSx2.p2.17.m2.1.1.3.2">𝑉</ci><ci id="Sx3.SSx2.p2.17.m2.1.1.3.3.cmml" xref="Sx3.SSx2.p2.17.m2.1.1.3.3">𝑂</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx2.p2.17.m2.1c">\boldsymbol{f}^{V|O}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx2.p2.17.m2.1d">bold_italic_f start_POSTSUPERSCRIPT italic_V | italic_O end_POSTSUPERSCRIPT</annotation></semantics></math> that enriches the visual information of the indicated object is fed to the subsequent module.</p> </div> <div class="ltx_para" id="Sx3.SSx2.p3"> <p class="ltx_p" id="Sx3.SSx2.p3.1">In this module, we counter-intuitively use the image as the query and the text as the key and value. This approach is driven by two considerations. First, it ensures that visual information is adequately preserved. Typically, instruction tokens are much shorter than image tokens, so a text-based query might not carry sufficient information. Second, the LDM <cite class="ltx_cite ltx_citemacro_citep">(Rombach et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib38" title="">2022</a>)</cite> conditions the image feature delivery on the text, also using the image as the query and the text as the key and value, indicating that such modal fusion is feasible. Our subsequent experiments confirm the effectiveness of this role assignment.</p> </div> </section> <section class="ltx_subsection" id="Sx3.SSx3"> <h3 class="ltx_title ltx_title_subsection">Object-focused Visual Semantic Learning</h3> <div class="ltx_para" id="Sx3.SSx3.p1"> <p class="ltx_p" id="Sx3.SSx3.p1.3">During the abstract concept learning phase, the original parameters of the image and text encoders are frozen, and the parameters of the AG-Adapter are trained. In each training batch, we sample <math alttext="b" class="ltx_Math" display="inline" id="Sx3.SSx3.p1.1.m1.1"><semantics id="Sx3.SSx3.p1.1.m1.1a"><mi id="Sx3.SSx3.p1.1.m1.1.1" xref="Sx3.SSx3.p1.1.m1.1.1.cmml">b</mi><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p1.1.m1.1b"><ci id="Sx3.SSx3.p1.1.m1.1.1.cmml" xref="Sx3.SSx3.p1.1.m1.1.1">𝑏</ci></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p1.1.m1.1c">b</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p1.1.m1.1d">italic_b</annotation></semantics></math> image-text-object triples <math alttext="\{\boldsymbol{x}^{I}_{k},\boldsymbol{x}^{T}_{k},\boldsymbol{x}^{O}_{k}\}^{b}_{% k=1}" class="ltx_Math" display="inline" id="Sx3.SSx3.p1.2.m2.3"><semantics id="Sx3.SSx3.p1.2.m2.3a"><msubsup id="Sx3.SSx3.p1.2.m2.3.3" xref="Sx3.SSx3.p1.2.m2.3.3.cmml"><mrow id="Sx3.SSx3.p1.2.m2.3.3.3.3.3" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.4.cmml"><mo id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.4" stretchy="false" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.4.cmml">{</mo><msubsup id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.cmml"><mi id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.2.2" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.3" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.3.cmml">k</mi><mi id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.2.3" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.2.3.cmml">I</mi></msubsup><mo id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.5" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.4.cmml">,</mo><msubsup id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.cmml"><mi id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.2.2" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.3" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.3.cmml">k</mi><mi id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.2.3" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.2.3.cmml">T</mi></msubsup><mo id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.6" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.4.cmml">,</mo><msubsup id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.cmml"><mi id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.2.2" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.3" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.3.cmml">k</mi><mi id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.2.3" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.2.3.cmml">O</mi></msubsup><mo id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.7" stretchy="false" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.4.cmml">}</mo></mrow><mrow id="Sx3.SSx3.p1.2.m2.3.3.5" xref="Sx3.SSx3.p1.2.m2.3.3.5.cmml"><mi id="Sx3.SSx3.p1.2.m2.3.3.5.2" xref="Sx3.SSx3.p1.2.m2.3.3.5.2.cmml">k</mi><mo id="Sx3.SSx3.p1.2.m2.3.3.5.1" xref="Sx3.SSx3.p1.2.m2.3.3.5.1.cmml">=</mo><mn id="Sx3.SSx3.p1.2.m2.3.3.5.3" xref="Sx3.SSx3.p1.2.m2.3.3.5.3.cmml">1</mn></mrow><mi id="Sx3.SSx3.p1.2.m2.3.3.3.5" xref="Sx3.SSx3.p1.2.m2.3.3.3.5.cmml">b</mi></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p1.2.m2.3b"><apply id="Sx3.SSx3.p1.2.m2.3.3.cmml" xref="Sx3.SSx3.p1.2.m2.3.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.2.m2.3.3.4.cmml" xref="Sx3.SSx3.p1.2.m2.3.3">subscript</csymbol><apply id="Sx3.SSx3.p1.2.m2.3.3.3.cmml" xref="Sx3.SSx3.p1.2.m2.3.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.2.m2.3.3.3.4.cmml" xref="Sx3.SSx3.p1.2.m2.3.3">superscript</csymbol><set id="Sx3.SSx3.p1.2.m2.3.3.3.3.4.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3"><apply id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1">subscript</csymbol><apply id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.2.cmml" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.2.1.cmml" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.2.2.cmml" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.2.2">𝒙</ci><ci id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.2.3.cmml" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.2.3">𝐼</ci></apply><ci id="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.3.cmml" xref="Sx3.SSx3.p1.2.m2.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2">subscript</csymbol><apply id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2">superscript</csymbol><ci id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.2.2">𝒙</ci><ci id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.2.3">𝑇</ci></apply><ci id="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p1.2.m2.2.2.2.2.2.2.3">𝑘</ci></apply><apply id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.1.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3">subscript</csymbol><apply id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.2.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.2.1.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3">superscript</csymbol><ci id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.2.2.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.2.2">𝒙</ci><ci id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.2.3.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.2.3">𝑂</ci></apply><ci id="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.3.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.3.3.3.3.3">𝑘</ci></apply></set><ci id="Sx3.SSx3.p1.2.m2.3.3.3.5.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.3.5">𝑏</ci></apply><apply id="Sx3.SSx3.p1.2.m2.3.3.5.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.5"><eq id="Sx3.SSx3.p1.2.m2.3.3.5.1.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.5.1"></eq><ci id="Sx3.SSx3.p1.2.m2.3.3.5.2.cmml" xref="Sx3.SSx3.p1.2.m2.3.3.5.2">𝑘</ci><cn id="Sx3.SSx3.p1.2.m2.3.3.5.3.cmml" type="integer" xref="Sx3.SSx3.p1.2.m2.3.3.5.3">1</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p1.2.m2.3c">\{\boldsymbol{x}^{I}_{k},\boldsymbol{x}^{T}_{k},\boldsymbol{x}^{O}_{k}\}^{b}_{% k=1}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p1.2.m2.3d">{ bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT</annotation></semantics></math> and encode them to obtain image-text feature pairs <math alttext="\{\boldsymbol{y}^{I|O}_{k},\boldsymbol{y}^{T}_{k}\}^{b}_{k=1}" class="ltx_Math" display="inline" id="Sx3.SSx3.p1.3.m3.2"><semantics id="Sx3.SSx3.p1.3.m3.2a"><msubsup id="Sx3.SSx3.p1.3.m3.2.2" xref="Sx3.SSx3.p1.3.m3.2.2.cmml"><mrow id="Sx3.SSx3.p1.3.m3.2.2.2.2.2" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.3.cmml"><mo id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.3" stretchy="false" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.3.cmml">{</mo><msubsup id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.cmml"><mi id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.2" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.3" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.3.cmml">k</mi><mrow id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.cmml"><mi id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.2" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.1" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.1.cmml">|</mo><mi id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.3" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.4" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.3.cmml">,</mo><msubsup id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.cmml"><mi id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.2.2" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.2.2.cmml">𝒚</mi><mi id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.3" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.3.cmml">k</mi><mi id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.2.3" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.2.3.cmml">T</mi></msubsup><mo id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.5" stretchy="false" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.3.cmml">}</mo></mrow><mrow id="Sx3.SSx3.p1.3.m3.2.2.4" xref="Sx3.SSx3.p1.3.m3.2.2.4.cmml"><mi id="Sx3.SSx3.p1.3.m3.2.2.4.2" xref="Sx3.SSx3.p1.3.m3.2.2.4.2.cmml">k</mi><mo id="Sx3.SSx3.p1.3.m3.2.2.4.1" xref="Sx3.SSx3.p1.3.m3.2.2.4.1.cmml">=</mo><mn id="Sx3.SSx3.p1.3.m3.2.2.4.3" xref="Sx3.SSx3.p1.3.m3.2.2.4.3.cmml">1</mn></mrow><mi id="Sx3.SSx3.p1.3.m3.2.2.2.4" xref="Sx3.SSx3.p1.3.m3.2.2.2.4.cmml">b</mi></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p1.3.m3.2b"><apply id="Sx3.SSx3.p1.3.m3.2.2.cmml" xref="Sx3.SSx3.p1.3.m3.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.3.m3.2.2.3.cmml" xref="Sx3.SSx3.p1.3.m3.2.2">subscript</csymbol><apply id="Sx3.SSx3.p1.3.m3.2.2.2.cmml" xref="Sx3.SSx3.p1.3.m3.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.3.m3.2.2.2.3.cmml" xref="Sx3.SSx3.p1.3.m3.2.2">superscript</csymbol><set id="Sx3.SSx3.p1.3.m3.2.2.2.2.3.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2"><apply id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1">subscript</csymbol><apply id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.cmml" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.1.cmml" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.2.cmml" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.2">𝒚</ci><apply id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.cmml" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3"><csymbol cd="latexml" id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.1.cmml" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.1">conditional</csymbol><ci id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.2.cmml" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.2">𝐼</ci><ci id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.3.cmml" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.3.cmml" xref="Sx3.SSx3.p1.3.m3.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2">subscript</csymbol><apply id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2">superscript</csymbol><ci id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.2.2">𝒚</ci><ci id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.2.3">𝑇</ci></apply><ci id="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.2.2.2.2.3">𝑘</ci></apply></set><ci id="Sx3.SSx3.p1.3.m3.2.2.2.4.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.2.4">𝑏</ci></apply><apply id="Sx3.SSx3.p1.3.m3.2.2.4.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.4"><eq id="Sx3.SSx3.p1.3.m3.2.2.4.1.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.4.1"></eq><ci id="Sx3.SSx3.p1.3.m3.2.2.4.2.cmml" xref="Sx3.SSx3.p1.3.m3.2.2.4.2">𝑘</ci><cn id="Sx3.SSx3.p1.3.m3.2.2.4.3.cmml" type="integer" xref="Sx3.SSx3.p1.3.m3.2.2.4.3">1</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p1.3.m3.2c">\{\boldsymbol{y}^{I|O}_{k},\boldsymbol{y}^{T}_{k}\}^{b}_{k=1}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p1.3.m3.2d">{ bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT</annotation></semantics></math>. Our goal is to train the AG-Adapter to extract the most relevant visual representations based on semantic instructions. As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx3.F3" title="Figure 3 ‣ Model Architecture ‣ Methodology ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">3</span></a>, to achieve this goal, we jointly optimize three training objectives that share model parameters.</p> </div> <div class="ltx_para" id="Sx3.SSx3.p2"> <p class="ltx_p" id="Sx3.SSx3.p2.13"><span class="ltx_text ltx_font_bold" id="Sx3.SSx3.p2.13.1">Object-focused Image-Text Contrast (OITC)</span> objective is designed to align image and text representations centered around instructed objects, with the aim of maximizing their mutual information. This objective requires the visual encoder to generate distinct features for different instructions. The similarity between features is computed as follows:</p> <table class="ltx_equation ltx_eqn_table" id="Sx3.E7"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\vspace{-1mm}s^{I|O}_{i,j}=s^{T}_{i,j}=(\boldsymbol{y}^{I|O}_{i})^{\top}% \boldsymbol{y}^{T}_{j},\vspace{-1mm}" class="ltx_Math" display="block" id="Sx3.E7.m1.5"><semantics id="Sx3.E7.m1.5a"><mrow id="Sx3.E7.m1.5.5.1" xref="Sx3.E7.m1.5.5.1.1.cmml"><mrow id="Sx3.E7.m1.5.5.1.1" xref="Sx3.E7.m1.5.5.1.1.cmml"><msubsup id="Sx3.E7.m1.5.5.1.1.3" xref="Sx3.E7.m1.5.5.1.1.3.cmml"><mi id="Sx3.E7.m1.5.5.1.1.3.2.2" xref="Sx3.E7.m1.5.5.1.1.3.2.2.cmml">s</mi><mrow id="Sx3.E7.m1.2.2.2.4" xref="Sx3.E7.m1.2.2.2.3.cmml"><mi id="Sx3.E7.m1.1.1.1.1" xref="Sx3.E7.m1.1.1.1.1.cmml">i</mi><mo id="Sx3.E7.m1.2.2.2.4.1" xref="Sx3.E7.m1.2.2.2.3.cmml">,</mo><mi id="Sx3.E7.m1.2.2.2.2" xref="Sx3.E7.m1.2.2.2.2.cmml">j</mi></mrow><mrow id="Sx3.E7.m1.5.5.1.1.3.2.3" xref="Sx3.E7.m1.5.5.1.1.3.2.3.cmml"><mi id="Sx3.E7.m1.5.5.1.1.3.2.3.2" xref="Sx3.E7.m1.5.5.1.1.3.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.E7.m1.5.5.1.1.3.2.3.1" xref="Sx3.E7.m1.5.5.1.1.3.2.3.1.cmml">|</mo><mi id="Sx3.E7.m1.5.5.1.1.3.2.3.3" xref="Sx3.E7.m1.5.5.1.1.3.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.E7.m1.5.5.1.1.4" xref="Sx3.E7.m1.5.5.1.1.4.cmml">=</mo><msubsup id="Sx3.E7.m1.5.5.1.1.5" xref="Sx3.E7.m1.5.5.1.1.5.cmml"><mi id="Sx3.E7.m1.5.5.1.1.5.2.2" xref="Sx3.E7.m1.5.5.1.1.5.2.2.cmml">s</mi><mrow id="Sx3.E7.m1.4.4.2.4" xref="Sx3.E7.m1.4.4.2.3.cmml"><mi id="Sx3.E7.m1.3.3.1.1" xref="Sx3.E7.m1.3.3.1.1.cmml">i</mi><mo id="Sx3.E7.m1.4.4.2.4.1" xref="Sx3.E7.m1.4.4.2.3.cmml">,</mo><mi id="Sx3.E7.m1.4.4.2.2" xref="Sx3.E7.m1.4.4.2.2.cmml">j</mi></mrow><mi id="Sx3.E7.m1.5.5.1.1.5.2.3" xref="Sx3.E7.m1.5.5.1.1.5.2.3.cmml">T</mi></msubsup><mo id="Sx3.E7.m1.5.5.1.1.6" xref="Sx3.E7.m1.5.5.1.1.6.cmml">=</mo><mrow id="Sx3.E7.m1.5.5.1.1.1" xref="Sx3.E7.m1.5.5.1.1.1.cmml"><msup id="Sx3.E7.m1.5.5.1.1.1.1" xref="Sx3.E7.m1.5.5.1.1.1.1.cmml"><mrow id="Sx3.E7.m1.5.5.1.1.1.1.1.1" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.cmml"><mo id="Sx3.E7.m1.5.5.1.1.1.1.1.1.2" stretchy="false" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.cmml">(</mo><msubsup id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.cmml"><mi id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.2" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.3" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.3.cmml">i</mi><mrow id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.cmml"><mi id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.2" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.1" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.1.cmml">|</mo><mi id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.3" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.E7.m1.5.5.1.1.1.1.1.1.3" stretchy="false" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.cmml">)</mo></mrow><mo id="Sx3.E7.m1.5.5.1.1.1.1.3" xref="Sx3.E7.m1.5.5.1.1.1.1.3.cmml">⊤</mo></msup><mo id="Sx3.E7.m1.5.5.1.1.1.2" xref="Sx3.E7.m1.5.5.1.1.1.2.cmml">⁢</mo><msubsup id="Sx3.E7.m1.5.5.1.1.1.3" xref="Sx3.E7.m1.5.5.1.1.1.3.cmml"><mi id="Sx3.E7.m1.5.5.1.1.1.3.2.2" xref="Sx3.E7.m1.5.5.1.1.1.3.2.2.cmml">𝒚</mi><mi id="Sx3.E7.m1.5.5.1.1.1.3.3" xref="Sx3.E7.m1.5.5.1.1.1.3.3.cmml">j</mi><mi id="Sx3.E7.m1.5.5.1.1.1.3.2.3" xref="Sx3.E7.m1.5.5.1.1.1.3.2.3.cmml">T</mi></msubsup></mrow></mrow><mo id="Sx3.E7.m1.5.5.1.2" xref="Sx3.E7.m1.5.5.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E7.m1.5b"><apply id="Sx3.E7.m1.5.5.1.1.cmml" xref="Sx3.E7.m1.5.5.1"><and id="Sx3.E7.m1.5.5.1.1a.cmml" xref="Sx3.E7.m1.5.5.1"></and><apply id="Sx3.E7.m1.5.5.1.1b.cmml" xref="Sx3.E7.m1.5.5.1"><eq id="Sx3.E7.m1.5.5.1.1.4.cmml" xref="Sx3.E7.m1.5.5.1.1.4"></eq><apply id="Sx3.E7.m1.5.5.1.1.3.cmml" xref="Sx3.E7.m1.5.5.1.1.3"><csymbol cd="ambiguous" id="Sx3.E7.m1.5.5.1.1.3.1.cmml" xref="Sx3.E7.m1.5.5.1.1.3">subscript</csymbol><apply id="Sx3.E7.m1.5.5.1.1.3.2.cmml" xref="Sx3.E7.m1.5.5.1.1.3"><csymbol cd="ambiguous" id="Sx3.E7.m1.5.5.1.1.3.2.1.cmml" xref="Sx3.E7.m1.5.5.1.1.3">superscript</csymbol><ci id="Sx3.E7.m1.5.5.1.1.3.2.2.cmml" xref="Sx3.E7.m1.5.5.1.1.3.2.2">𝑠</ci><apply id="Sx3.E7.m1.5.5.1.1.3.2.3.cmml" xref="Sx3.E7.m1.5.5.1.1.3.2.3"><csymbol cd="latexml" id="Sx3.E7.m1.5.5.1.1.3.2.3.1.cmml" xref="Sx3.E7.m1.5.5.1.1.3.2.3.1">conditional</csymbol><ci id="Sx3.E7.m1.5.5.1.1.3.2.3.2.cmml" xref="Sx3.E7.m1.5.5.1.1.3.2.3.2">𝐼</ci><ci id="Sx3.E7.m1.5.5.1.1.3.2.3.3.cmml" xref="Sx3.E7.m1.5.5.1.1.3.2.3.3">𝑂</ci></apply></apply><list id="Sx3.E7.m1.2.2.2.3.cmml" xref="Sx3.E7.m1.2.2.2.4"><ci id="Sx3.E7.m1.1.1.1.1.cmml" xref="Sx3.E7.m1.1.1.1.1">𝑖</ci><ci id="Sx3.E7.m1.2.2.2.2.cmml" xref="Sx3.E7.m1.2.2.2.2">𝑗</ci></list></apply><apply id="Sx3.E7.m1.5.5.1.1.5.cmml" xref="Sx3.E7.m1.5.5.1.1.5"><csymbol cd="ambiguous" id="Sx3.E7.m1.5.5.1.1.5.1.cmml" xref="Sx3.E7.m1.5.5.1.1.5">subscript</csymbol><apply id="Sx3.E7.m1.5.5.1.1.5.2.cmml" xref="Sx3.E7.m1.5.5.1.1.5"><csymbol cd="ambiguous" id="Sx3.E7.m1.5.5.1.1.5.2.1.cmml" xref="Sx3.E7.m1.5.5.1.1.5">superscript</csymbol><ci id="Sx3.E7.m1.5.5.1.1.5.2.2.cmml" xref="Sx3.E7.m1.5.5.1.1.5.2.2">𝑠</ci><ci id="Sx3.E7.m1.5.5.1.1.5.2.3.cmml" xref="Sx3.E7.m1.5.5.1.1.5.2.3">𝑇</ci></apply><list id="Sx3.E7.m1.4.4.2.3.cmml" xref="Sx3.E7.m1.4.4.2.4"><ci id="Sx3.E7.m1.3.3.1.1.cmml" xref="Sx3.E7.m1.3.3.1.1">𝑖</ci><ci id="Sx3.E7.m1.4.4.2.2.cmml" xref="Sx3.E7.m1.4.4.2.2">𝑗</ci></list></apply></apply><apply id="Sx3.E7.m1.5.5.1.1c.cmml" xref="Sx3.E7.m1.5.5.1"><eq id="Sx3.E7.m1.5.5.1.1.6.cmml" xref="Sx3.E7.m1.5.5.1.1.6"></eq><share href="https://arxiv.org/html/2410.20109v1#Sx3.E7.m1.5.5.1.1.5.cmml" id="Sx3.E7.m1.5.5.1.1d.cmml" xref="Sx3.E7.m1.5.5.1"></share><apply id="Sx3.E7.m1.5.5.1.1.1.cmml" xref="Sx3.E7.m1.5.5.1.1.1"><times id="Sx3.E7.m1.5.5.1.1.1.2.cmml" xref="Sx3.E7.m1.5.5.1.1.1.2"></times><apply id="Sx3.E7.m1.5.5.1.1.1.1.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E7.m1.5.5.1.1.1.1.2.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1">superscript</csymbol><apply id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1">subscript</csymbol><apply id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.1.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.2.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.2">𝒚</ci><apply id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3"><csymbol cd="latexml" id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.1.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.1">conditional</csymbol><ci id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.2.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.2">𝐼</ci><ci id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.3.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.1.1.1.3">𝑖</ci></apply><csymbol cd="latexml" id="Sx3.E7.m1.5.5.1.1.1.1.3.cmml" xref="Sx3.E7.m1.5.5.1.1.1.1.3">top</csymbol></apply><apply id="Sx3.E7.m1.5.5.1.1.1.3.cmml" xref="Sx3.E7.m1.5.5.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E7.m1.5.5.1.1.1.3.1.cmml" xref="Sx3.E7.m1.5.5.1.1.1.3">subscript</csymbol><apply id="Sx3.E7.m1.5.5.1.1.1.3.2.cmml" xref="Sx3.E7.m1.5.5.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E7.m1.5.5.1.1.1.3.2.1.cmml" xref="Sx3.E7.m1.5.5.1.1.1.3">superscript</csymbol><ci id="Sx3.E7.m1.5.5.1.1.1.3.2.2.cmml" xref="Sx3.E7.m1.5.5.1.1.1.3.2.2">𝒚</ci><ci id="Sx3.E7.m1.5.5.1.1.1.3.2.3.cmml" xref="Sx3.E7.m1.5.5.1.1.1.3.2.3">𝑇</ci></apply><ci id="Sx3.E7.m1.5.5.1.1.1.3.3.cmml" xref="Sx3.E7.m1.5.5.1.1.1.3.3">𝑗</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E7.m1.5c">\vspace{-1mm}s^{I|O}_{i,j}=s^{T}_{i,j}=(\boldsymbol{y}^{I|O}_{i})^{\top}% \boldsymbol{y}^{T}_{j},\vspace{-1mm}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E7.m1.5d">italic_s start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ( bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(7)</span></td> </tr></tbody> </table> <p class="ltx_p" id="Sx3.SSx3.p2.9">where <math alttext="s^{I|O}_{i,j}" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.1.m1.2"><semantics id="Sx3.SSx3.p2.1.m1.2a"><msubsup id="Sx3.SSx3.p2.1.m1.2.3" xref="Sx3.SSx3.p2.1.m1.2.3.cmml"><mi id="Sx3.SSx3.p2.1.m1.2.3.2.2" xref="Sx3.SSx3.p2.1.m1.2.3.2.2.cmml">s</mi><mrow id="Sx3.SSx3.p2.1.m1.2.2.2.4" xref="Sx3.SSx3.p2.1.m1.2.2.2.3.cmml"><mi id="Sx3.SSx3.p2.1.m1.1.1.1.1" xref="Sx3.SSx3.p2.1.m1.1.1.1.1.cmml">i</mi><mo id="Sx3.SSx3.p2.1.m1.2.2.2.4.1" xref="Sx3.SSx3.p2.1.m1.2.2.2.3.cmml">,</mo><mi id="Sx3.SSx3.p2.1.m1.2.2.2.2" xref="Sx3.SSx3.p2.1.m1.2.2.2.2.cmml">j</mi></mrow><mrow id="Sx3.SSx3.p2.1.m1.2.3.2.3" xref="Sx3.SSx3.p2.1.m1.2.3.2.3.cmml"><mi id="Sx3.SSx3.p2.1.m1.2.3.2.3.2" xref="Sx3.SSx3.p2.1.m1.2.3.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx3.p2.1.m1.2.3.2.3.1" xref="Sx3.SSx3.p2.1.m1.2.3.2.3.1.cmml">|</mo><mi id="Sx3.SSx3.p2.1.m1.2.3.2.3.3" xref="Sx3.SSx3.p2.1.m1.2.3.2.3.3.cmml">O</mi></mrow></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.1.m1.2b"><apply id="Sx3.SSx3.p2.1.m1.2.3.cmml" xref="Sx3.SSx3.p2.1.m1.2.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.1.m1.2.3.1.cmml" xref="Sx3.SSx3.p2.1.m1.2.3">subscript</csymbol><apply id="Sx3.SSx3.p2.1.m1.2.3.2.cmml" xref="Sx3.SSx3.p2.1.m1.2.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.1.m1.2.3.2.1.cmml" xref="Sx3.SSx3.p2.1.m1.2.3">superscript</csymbol><ci id="Sx3.SSx3.p2.1.m1.2.3.2.2.cmml" xref="Sx3.SSx3.p2.1.m1.2.3.2.2">𝑠</ci><apply id="Sx3.SSx3.p2.1.m1.2.3.2.3.cmml" xref="Sx3.SSx3.p2.1.m1.2.3.2.3"><csymbol cd="latexml" id="Sx3.SSx3.p2.1.m1.2.3.2.3.1.cmml" xref="Sx3.SSx3.p2.1.m1.2.3.2.3.1">conditional</csymbol><ci id="Sx3.SSx3.p2.1.m1.2.3.2.3.2.cmml" xref="Sx3.SSx3.p2.1.m1.2.3.2.3.2">𝐼</ci><ci id="Sx3.SSx3.p2.1.m1.2.3.2.3.3.cmml" xref="Sx3.SSx3.p2.1.m1.2.3.2.3.3">𝑂</ci></apply></apply><list id="Sx3.SSx3.p2.1.m1.2.2.2.3.cmml" xref="Sx3.SSx3.p2.1.m1.2.2.2.4"><ci id="Sx3.SSx3.p2.1.m1.1.1.1.1.cmml" xref="Sx3.SSx3.p2.1.m1.1.1.1.1">𝑖</ci><ci id="Sx3.SSx3.p2.1.m1.2.2.2.2.cmml" xref="Sx3.SSx3.p2.1.m1.2.2.2.2">𝑗</ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.1.m1.2c">s^{I|O}_{i,j}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.1.m1.2d">italic_s start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT</annotation></semantics></math> denotes the similarity of the <math alttext="i" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.2.m2.1"><semantics id="Sx3.SSx3.p2.2.m2.1a"><mi id="Sx3.SSx3.p2.2.m2.1.1" xref="Sx3.SSx3.p2.2.m2.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.2.m2.1b"><ci id="Sx3.SSx3.p2.2.m2.1.1.cmml" xref="Sx3.SSx3.p2.2.m2.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.2.m2.1c">i</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.2.m2.1d">italic_i</annotation></semantics></math>-th conditional image feature to the <math alttext="j" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.3.m3.1"><semantics id="Sx3.SSx3.p2.3.m3.1a"><mi id="Sx3.SSx3.p2.3.m3.1.1" xref="Sx3.SSx3.p2.3.m3.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.3.m3.1b"><ci id="Sx3.SSx3.p2.3.m3.1.1.cmml" xref="Sx3.SSx3.p2.3.m3.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.3.m3.1c">j</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.3.m3.1d">italic_j</annotation></semantics></math>-th text feature, and <math alttext="s^{T}_{i,j}" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.4.m4.2"><semantics id="Sx3.SSx3.p2.4.m4.2a"><msubsup id="Sx3.SSx3.p2.4.m4.2.3" xref="Sx3.SSx3.p2.4.m4.2.3.cmml"><mi id="Sx3.SSx3.p2.4.m4.2.3.2.2" xref="Sx3.SSx3.p2.4.m4.2.3.2.2.cmml">s</mi><mrow id="Sx3.SSx3.p2.4.m4.2.2.2.4" xref="Sx3.SSx3.p2.4.m4.2.2.2.3.cmml"><mi id="Sx3.SSx3.p2.4.m4.1.1.1.1" xref="Sx3.SSx3.p2.4.m4.1.1.1.1.cmml">i</mi><mo id="Sx3.SSx3.p2.4.m4.2.2.2.4.1" xref="Sx3.SSx3.p2.4.m4.2.2.2.3.cmml">,</mo><mi id="Sx3.SSx3.p2.4.m4.2.2.2.2" xref="Sx3.SSx3.p2.4.m4.2.2.2.2.cmml">j</mi></mrow><mi id="Sx3.SSx3.p2.4.m4.2.3.2.3" xref="Sx3.SSx3.p2.4.m4.2.3.2.3.cmml">T</mi></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.4.m4.2b"><apply id="Sx3.SSx3.p2.4.m4.2.3.cmml" xref="Sx3.SSx3.p2.4.m4.2.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.4.m4.2.3.1.cmml" xref="Sx3.SSx3.p2.4.m4.2.3">subscript</csymbol><apply id="Sx3.SSx3.p2.4.m4.2.3.2.cmml" xref="Sx3.SSx3.p2.4.m4.2.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.4.m4.2.3.2.1.cmml" xref="Sx3.SSx3.p2.4.m4.2.3">superscript</csymbol><ci id="Sx3.SSx3.p2.4.m4.2.3.2.2.cmml" xref="Sx3.SSx3.p2.4.m4.2.3.2.2">𝑠</ci><ci id="Sx3.SSx3.p2.4.m4.2.3.2.3.cmml" xref="Sx3.SSx3.p2.4.m4.2.3.2.3">𝑇</ci></apply><list id="Sx3.SSx3.p2.4.m4.2.2.2.3.cmml" xref="Sx3.SSx3.p2.4.m4.2.2.2.4"><ci id="Sx3.SSx3.p2.4.m4.1.1.1.1.cmml" xref="Sx3.SSx3.p2.4.m4.1.1.1.1">𝑖</ci><ci id="Sx3.SSx3.p2.4.m4.2.2.2.2.cmml" xref="Sx3.SSx3.p2.4.m4.2.2.2.2">𝑗</ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.4.m4.2c">s^{T}_{i,j}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.4.m4.2d">italic_s start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT</annotation></semantics></math> represents the similarity between the <math alttext="i" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.5.m5.1"><semantics id="Sx3.SSx3.p2.5.m5.1a"><mi id="Sx3.SSx3.p2.5.m5.1.1" xref="Sx3.SSx3.p2.5.m5.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.5.m5.1b"><ci id="Sx3.SSx3.p2.5.m5.1.1.cmml" xref="Sx3.SSx3.p2.5.m5.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.5.m5.1c">i</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.5.m5.1d">italic_i</annotation></semantics></math>-th text feature and the <math alttext="j" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.6.m6.1"><semantics id="Sx3.SSx3.p2.6.m6.1a"><mi id="Sx3.SSx3.p2.6.m6.1.1" xref="Sx3.SSx3.p2.6.m6.1.1.cmml">j</mi><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.6.m6.1b"><ci id="Sx3.SSx3.p2.6.m6.1.1.cmml" xref="Sx3.SSx3.p2.6.m6.1.1">𝑗</ci></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.6.m6.1c">j</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.6.m6.1d">italic_j</annotation></semantics></math>-th text feature. For the conditional image feature <math alttext="\boldsymbol{y}^{I|O}_{k}" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.7.m7.1"><semantics id="Sx3.SSx3.p2.7.m7.1a"><msubsup id="Sx3.SSx3.p2.7.m7.1.1" xref="Sx3.SSx3.p2.7.m7.1.1.cmml"><mi id="Sx3.SSx3.p2.7.m7.1.1.2.2" xref="Sx3.SSx3.p2.7.m7.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.SSx3.p2.7.m7.1.1.3" xref="Sx3.SSx3.p2.7.m7.1.1.3.cmml">k</mi><mrow id="Sx3.SSx3.p2.7.m7.1.1.2.3" xref="Sx3.SSx3.p2.7.m7.1.1.2.3.cmml"><mi id="Sx3.SSx3.p2.7.m7.1.1.2.3.2" xref="Sx3.SSx3.p2.7.m7.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx3.p2.7.m7.1.1.2.3.1" xref="Sx3.SSx3.p2.7.m7.1.1.2.3.1.cmml">|</mo><mi id="Sx3.SSx3.p2.7.m7.1.1.2.3.3" xref="Sx3.SSx3.p2.7.m7.1.1.2.3.3.cmml">O</mi></mrow></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.7.m7.1b"><apply id="Sx3.SSx3.p2.7.m7.1.1.cmml" xref="Sx3.SSx3.p2.7.m7.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.7.m7.1.1.1.cmml" xref="Sx3.SSx3.p2.7.m7.1.1">subscript</csymbol><apply id="Sx3.SSx3.p2.7.m7.1.1.2.cmml" xref="Sx3.SSx3.p2.7.m7.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.7.m7.1.1.2.1.cmml" xref="Sx3.SSx3.p2.7.m7.1.1">superscript</csymbol><ci id="Sx3.SSx3.p2.7.m7.1.1.2.2.cmml" xref="Sx3.SSx3.p2.7.m7.1.1.2.2">𝒚</ci><apply id="Sx3.SSx3.p2.7.m7.1.1.2.3.cmml" xref="Sx3.SSx3.p2.7.m7.1.1.2.3"><csymbol cd="latexml" id="Sx3.SSx3.p2.7.m7.1.1.2.3.1.cmml" xref="Sx3.SSx3.p2.7.m7.1.1.2.3.1">conditional</csymbol><ci id="Sx3.SSx3.p2.7.m7.1.1.2.3.2.cmml" xref="Sx3.SSx3.p2.7.m7.1.1.2.3.2">𝐼</ci><ci id="Sx3.SSx3.p2.7.m7.1.1.2.3.3.cmml" xref="Sx3.SSx3.p2.7.m7.1.1.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.SSx3.p2.7.m7.1.1.3.cmml" xref="Sx3.SSx3.p2.7.m7.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.7.m7.1c">\boldsymbol{y}^{I|O}_{k}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.7.m7.1d">bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT</annotation></semantics></math> in feature pair <math alttext="\{\boldsymbol{y}^{I|O}_{k},\boldsymbol{y}^{T}_{k}\}" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.8.m8.2"><semantics id="Sx3.SSx3.p2.8.m8.2a"><mrow id="Sx3.SSx3.p2.8.m8.2.2.2" xref="Sx3.SSx3.p2.8.m8.2.2.3.cmml"><mo id="Sx3.SSx3.p2.8.m8.2.2.2.3" stretchy="false" xref="Sx3.SSx3.p2.8.m8.2.2.3.cmml">{</mo><msubsup id="Sx3.SSx3.p2.8.m8.1.1.1.1" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.cmml"><mi id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.2" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.SSx3.p2.8.m8.1.1.1.1.3" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.3.cmml">k</mi><mrow id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.cmml"><mi id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.2" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.1" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.1.cmml">|</mo><mi id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.3" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.SSx3.p2.8.m8.2.2.2.4" xref="Sx3.SSx3.p2.8.m8.2.2.3.cmml">,</mo><msubsup id="Sx3.SSx3.p2.8.m8.2.2.2.2" xref="Sx3.SSx3.p2.8.m8.2.2.2.2.cmml"><mi id="Sx3.SSx3.p2.8.m8.2.2.2.2.2.2" xref="Sx3.SSx3.p2.8.m8.2.2.2.2.2.2.cmml">𝒚</mi><mi id="Sx3.SSx3.p2.8.m8.2.2.2.2.3" xref="Sx3.SSx3.p2.8.m8.2.2.2.2.3.cmml">k</mi><mi id="Sx3.SSx3.p2.8.m8.2.2.2.2.2.3" xref="Sx3.SSx3.p2.8.m8.2.2.2.2.2.3.cmml">T</mi></msubsup><mo id="Sx3.SSx3.p2.8.m8.2.2.2.5" stretchy="false" xref="Sx3.SSx3.p2.8.m8.2.2.3.cmml">}</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.8.m8.2b"><set id="Sx3.SSx3.p2.8.m8.2.2.3.cmml" xref="Sx3.SSx3.p2.8.m8.2.2.2"><apply id="Sx3.SSx3.p2.8.m8.1.1.1.1.cmml" xref="Sx3.SSx3.p2.8.m8.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.8.m8.1.1.1.1.1.cmml" xref="Sx3.SSx3.p2.8.m8.1.1.1.1">subscript</csymbol><apply id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.cmml" xref="Sx3.SSx3.p2.8.m8.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.1.cmml" xref="Sx3.SSx3.p2.8.m8.1.1.1.1">superscript</csymbol><ci id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.2.cmml" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.2.2">𝒚</ci><apply id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.cmml" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3"><csymbol cd="latexml" id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.1.cmml" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.1">conditional</csymbol><ci id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.2.cmml" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.2">𝐼</ci><ci id="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.3.cmml" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.SSx3.p2.8.m8.1.1.1.1.3.cmml" xref="Sx3.SSx3.p2.8.m8.1.1.1.1.3">𝑘</ci></apply><apply id="Sx3.SSx3.p2.8.m8.2.2.2.2.cmml" xref="Sx3.SSx3.p2.8.m8.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.8.m8.2.2.2.2.1.cmml" xref="Sx3.SSx3.p2.8.m8.2.2.2.2">subscript</csymbol><apply id="Sx3.SSx3.p2.8.m8.2.2.2.2.2.cmml" xref="Sx3.SSx3.p2.8.m8.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.8.m8.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p2.8.m8.2.2.2.2">superscript</csymbol><ci id="Sx3.SSx3.p2.8.m8.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p2.8.m8.2.2.2.2.2.2">𝒚</ci><ci id="Sx3.SSx3.p2.8.m8.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p2.8.m8.2.2.2.2.2.3">𝑇</ci></apply><ci id="Sx3.SSx3.p2.8.m8.2.2.2.2.3.cmml" xref="Sx3.SSx3.p2.8.m8.2.2.2.2.3">𝑘</ci></apply></set></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.8.m8.2c">\{\boldsymbol{y}^{I|O}_{k},\boldsymbol{y}^{T}_{k}\}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.8.m8.2d">{ bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }</annotation></semantics></math>, the corresponding text features <math alttext="\boldsymbol{y}^{T}_{j|\boldsymbol{x}^{I}_{j}=\boldsymbol{x}^{I}_{k}\land% \boldsymbol{x}^{O}_{j}=\boldsymbol{x}^{O}_{k}}" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.9.m9.1"><semantics id="Sx3.SSx3.p2.9.m9.1a"><msubsup id="Sx3.SSx3.p2.9.m9.1.1" xref="Sx3.SSx3.p2.9.m9.1.1.cmml"><mi id="Sx3.SSx3.p2.9.m9.1.1.2.2" xref="Sx3.SSx3.p2.9.m9.1.1.2.2.cmml">𝒚</mi><mrow id="Sx3.SSx3.p2.9.m9.1.1.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.cmml"><mrow id="Sx3.SSx3.p2.9.m9.1.1.3.2" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.cmml"><mi id="Sx3.SSx3.p2.9.m9.1.1.3.2.2" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.2.cmml">j</mi><mo fence="false" id="Sx3.SSx3.p2.9.m9.1.1.3.2.1" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.1.cmml">|</mo><msubsup id="Sx3.SSx3.p2.9.m9.1.1.3.2.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3.cmml"><mi id="Sx3.SSx3.p2.9.m9.1.1.3.2.3.2.2" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p2.9.m9.1.1.3.2.3.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3.3.cmml">j</mi><mi id="Sx3.SSx3.p2.9.m9.1.1.3.2.3.2.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3.2.3.cmml">I</mi></msubsup></mrow><mo id="Sx3.SSx3.p2.9.m9.1.1.3.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.3.cmml">=</mo><mrow id="Sx3.SSx3.p2.9.m9.1.1.3.4" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.cmml"><msubsup id="Sx3.SSx3.p2.9.m9.1.1.3.4.2" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2.cmml"><mi id="Sx3.SSx3.p2.9.m9.1.1.3.4.2.2.2" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p2.9.m9.1.1.3.4.2.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2.3.cmml">k</mi><mi id="Sx3.SSx3.p2.9.m9.1.1.3.4.2.2.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2.2.3.cmml">I</mi></msubsup><mo id="Sx3.SSx3.p2.9.m9.1.1.3.4.1" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.1.cmml">∧</mo><msubsup id="Sx3.SSx3.p2.9.m9.1.1.3.4.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3.cmml"><mi id="Sx3.SSx3.p2.9.m9.1.1.3.4.3.2.2" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p2.9.m9.1.1.3.4.3.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3.3.cmml">j</mi><mi id="Sx3.SSx3.p2.9.m9.1.1.3.4.3.2.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3.2.3.cmml">O</mi></msubsup></mrow><mo id="Sx3.SSx3.p2.9.m9.1.1.3.5" xref="Sx3.SSx3.p2.9.m9.1.1.3.5.cmml">=</mo><msubsup id="Sx3.SSx3.p2.9.m9.1.1.3.6" xref="Sx3.SSx3.p2.9.m9.1.1.3.6.cmml"><mi id="Sx3.SSx3.p2.9.m9.1.1.3.6.2.2" xref="Sx3.SSx3.p2.9.m9.1.1.3.6.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p2.9.m9.1.1.3.6.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.6.3.cmml">k</mi><mi id="Sx3.SSx3.p2.9.m9.1.1.3.6.2.3" xref="Sx3.SSx3.p2.9.m9.1.1.3.6.2.3.cmml">O</mi></msubsup></mrow><mi id="Sx3.SSx3.p2.9.m9.1.1.2.3" xref="Sx3.SSx3.p2.9.m9.1.1.2.3.cmml">T</mi></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.9.m9.1b"><apply id="Sx3.SSx3.p2.9.m9.1.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.9.m9.1.1.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1">subscript</csymbol><apply id="Sx3.SSx3.p2.9.m9.1.1.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.9.m9.1.1.2.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1">superscript</csymbol><ci id="Sx3.SSx3.p2.9.m9.1.1.2.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.2.2">𝒚</ci><ci id="Sx3.SSx3.p2.9.m9.1.1.2.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.2.3">𝑇</ci></apply><apply id="Sx3.SSx3.p2.9.m9.1.1.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3"><and id="Sx3.SSx3.p2.9.m9.1.1.3a.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3"></and><apply id="Sx3.SSx3.p2.9.m9.1.1.3b.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3"><eq id="Sx3.SSx3.p2.9.m9.1.1.3.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.3"></eq><apply id="Sx3.SSx3.p2.9.m9.1.1.3.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.2"><csymbol cd="latexml" id="Sx3.SSx3.p2.9.m9.1.1.3.2.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.1">conditional</csymbol><ci id="Sx3.SSx3.p2.9.m9.1.1.3.2.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.2">𝑗</ci><apply id="Sx3.SSx3.p2.9.m9.1.1.3.2.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.9.m9.1.1.3.2.3.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3">subscript</csymbol><apply id="Sx3.SSx3.p2.9.m9.1.1.3.2.3.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.9.m9.1.1.3.2.3.2.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3">superscript</csymbol><ci id="Sx3.SSx3.p2.9.m9.1.1.3.2.3.2.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3.2.2">𝒙</ci><ci id="Sx3.SSx3.p2.9.m9.1.1.3.2.3.2.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3.2.3">𝐼</ci></apply><ci id="Sx3.SSx3.p2.9.m9.1.1.3.2.3.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.2.3.3">𝑗</ci></apply></apply><apply id="Sx3.SSx3.p2.9.m9.1.1.3.4.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4"><and id="Sx3.SSx3.p2.9.m9.1.1.3.4.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.1"></and><apply id="Sx3.SSx3.p2.9.m9.1.1.3.4.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.9.m9.1.1.3.4.2.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2">subscript</csymbol><apply id="Sx3.SSx3.p2.9.m9.1.1.3.4.2.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.9.m9.1.1.3.4.2.2.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2">superscript</csymbol><ci id="Sx3.SSx3.p2.9.m9.1.1.3.4.2.2.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2.2.2">𝒙</ci><ci id="Sx3.SSx3.p2.9.m9.1.1.3.4.2.2.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2.2.3">𝐼</ci></apply><ci id="Sx3.SSx3.p2.9.m9.1.1.3.4.2.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.2.3">𝑘</ci></apply><apply id="Sx3.SSx3.p2.9.m9.1.1.3.4.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.9.m9.1.1.3.4.3.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3">subscript</csymbol><apply id="Sx3.SSx3.p2.9.m9.1.1.3.4.3.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.9.m9.1.1.3.4.3.2.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3">superscript</csymbol><ci id="Sx3.SSx3.p2.9.m9.1.1.3.4.3.2.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3.2.2">𝒙</ci><ci id="Sx3.SSx3.p2.9.m9.1.1.3.4.3.2.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3.2.3">𝑂</ci></apply><ci id="Sx3.SSx3.p2.9.m9.1.1.3.4.3.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.4.3.3">𝑗</ci></apply></apply></apply><apply id="Sx3.SSx3.p2.9.m9.1.1.3c.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3"><eq id="Sx3.SSx3.p2.9.m9.1.1.3.5.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.5"></eq><share href="https://arxiv.org/html/2410.20109v1#Sx3.SSx3.p2.9.m9.1.1.3.4.cmml" id="Sx3.SSx3.p2.9.m9.1.1.3d.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3"></share><apply id="Sx3.SSx3.p2.9.m9.1.1.3.6.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.6"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.9.m9.1.1.3.6.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.6">subscript</csymbol><apply id="Sx3.SSx3.p2.9.m9.1.1.3.6.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.6"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.9.m9.1.1.3.6.2.1.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.6">superscript</csymbol><ci id="Sx3.SSx3.p2.9.m9.1.1.3.6.2.2.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.6.2.2">𝒙</ci><ci id="Sx3.SSx3.p2.9.m9.1.1.3.6.2.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.6.2.3">𝑂</ci></apply><ci id="Sx3.SSx3.p2.9.m9.1.1.3.6.3.cmml" xref="Sx3.SSx3.p2.9.m9.1.1.3.6.3">𝑘</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.9.m9.1c">\boldsymbol{y}^{T}_{j|\boldsymbol{x}^{I}_{j}=\boldsymbol{x}^{I}_{k}\land% \boldsymbol{x}^{O}_{j}=\boldsymbol{x}^{O}_{k}}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.9.m9.1d">bold_italic_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j | bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∧ bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT</annotation></semantics></math> of the same object within the same image are positive, while other text features within the batch are negative. The loss of a batch can be represented by</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="A1.EGx1"> <tbody id="Sx3.Ex1"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\mathcal{L}^{I|O}_{k}(\boldsymbol{y}^{I|O}_{k},\{\boldsymbol{y}^{% T}_{j}\}^{b}_{j=1})=" class="ltx_Math" display="inline" id="Sx3.Ex1.m1.2"><semantics id="Sx3.Ex1.m1.2a"><mrow id="Sx3.Ex1.m1.2.2" xref="Sx3.Ex1.m1.2.2.cmml"><mrow id="Sx3.Ex1.m1.2.2.2" xref="Sx3.Ex1.m1.2.2.2.cmml"><msubsup id="Sx3.Ex1.m1.2.2.2.4" xref="Sx3.Ex1.m1.2.2.2.4.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.Ex1.m1.2.2.2.4.2.2" xref="Sx3.Ex1.m1.2.2.2.4.2.2.cmml">ℒ</mi><mi id="Sx3.Ex1.m1.2.2.2.4.3" xref="Sx3.Ex1.m1.2.2.2.4.3.cmml">k</mi><mrow id="Sx3.Ex1.m1.2.2.2.4.2.3" xref="Sx3.Ex1.m1.2.2.2.4.2.3.cmml"><mi id="Sx3.Ex1.m1.2.2.2.4.2.3.2" xref="Sx3.Ex1.m1.2.2.2.4.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.Ex1.m1.2.2.2.4.2.3.1" xref="Sx3.Ex1.m1.2.2.2.4.2.3.1.cmml">|</mo><mi id="Sx3.Ex1.m1.2.2.2.4.2.3.3" xref="Sx3.Ex1.m1.2.2.2.4.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.Ex1.m1.2.2.2.3" xref="Sx3.Ex1.m1.2.2.2.3.cmml">⁢</mo><mrow id="Sx3.Ex1.m1.2.2.2.2.2" xref="Sx3.Ex1.m1.2.2.2.2.3.cmml"><mo id="Sx3.Ex1.m1.2.2.2.2.2.3" stretchy="false" xref="Sx3.Ex1.m1.2.2.2.2.3.cmml">(</mo><msubsup id="Sx3.Ex1.m1.1.1.1.1.1.1" xref="Sx3.Ex1.m1.1.1.1.1.1.1.cmml"><mi id="Sx3.Ex1.m1.1.1.1.1.1.1.2.2" xref="Sx3.Ex1.m1.1.1.1.1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.Ex1.m1.1.1.1.1.1.1.3" xref="Sx3.Ex1.m1.1.1.1.1.1.1.3.cmml">k</mi><mrow id="Sx3.Ex1.m1.1.1.1.1.1.1.2.3" xref="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.cmml"><mi id="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.2" xref="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.1" xref="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.1.cmml">|</mo><mi id="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.3" xref="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.Ex1.m1.2.2.2.2.2.4" xref="Sx3.Ex1.m1.2.2.2.2.3.cmml">,</mo><msubsup id="Sx3.Ex1.m1.2.2.2.2.2.2" xref="Sx3.Ex1.m1.2.2.2.2.2.2.cmml"><mrow id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.2.cmml"><mo id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.2" stretchy="false" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.2.cmml">{</mo><msubsup id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.cmml"><mi id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.2.2" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.3" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.3.cmml">j</mi><mi id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.2.3" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.2.3.cmml">T</mi></msubsup><mo id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.3" stretchy="false" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.2.cmml">}</mo></mrow><mrow id="Sx3.Ex1.m1.2.2.2.2.2.2.3" xref="Sx3.Ex1.m1.2.2.2.2.2.2.3.cmml"><mi id="Sx3.Ex1.m1.2.2.2.2.2.2.3.2" xref="Sx3.Ex1.m1.2.2.2.2.2.2.3.2.cmml">j</mi><mo id="Sx3.Ex1.m1.2.2.2.2.2.2.3.1" xref="Sx3.Ex1.m1.2.2.2.2.2.2.3.1.cmml">=</mo><mn id="Sx3.Ex1.m1.2.2.2.2.2.2.3.3" xref="Sx3.Ex1.m1.2.2.2.2.2.2.3.3.cmml">1</mn></mrow><mi id="Sx3.Ex1.m1.2.2.2.2.2.2.1.3" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.3.cmml">b</mi></msubsup><mo id="Sx3.Ex1.m1.2.2.2.2.2.5" stretchy="false" xref="Sx3.Ex1.m1.2.2.2.2.3.cmml">)</mo></mrow></mrow><mo id="Sx3.Ex1.m1.2.2.3" xref="Sx3.Ex1.m1.2.2.3.cmml">=</mo><mi id="Sx3.Ex1.m1.2.2.4" xref="Sx3.Ex1.m1.2.2.4.cmml"></mi></mrow><annotation-xml encoding="MathML-Content" id="Sx3.Ex1.m1.2b"><apply id="Sx3.Ex1.m1.2.2.cmml" xref="Sx3.Ex1.m1.2.2"><eq id="Sx3.Ex1.m1.2.2.3.cmml" xref="Sx3.Ex1.m1.2.2.3"></eq><apply id="Sx3.Ex1.m1.2.2.2.cmml" xref="Sx3.Ex1.m1.2.2.2"><times id="Sx3.Ex1.m1.2.2.2.3.cmml" xref="Sx3.Ex1.m1.2.2.2.3"></times><apply id="Sx3.Ex1.m1.2.2.2.4.cmml" xref="Sx3.Ex1.m1.2.2.2.4"><csymbol cd="ambiguous" id="Sx3.Ex1.m1.2.2.2.4.1.cmml" xref="Sx3.Ex1.m1.2.2.2.4">subscript</csymbol><apply id="Sx3.Ex1.m1.2.2.2.4.2.cmml" xref="Sx3.Ex1.m1.2.2.2.4"><csymbol cd="ambiguous" id="Sx3.Ex1.m1.2.2.2.4.2.1.cmml" xref="Sx3.Ex1.m1.2.2.2.4">superscript</csymbol><ci id="Sx3.Ex1.m1.2.2.2.4.2.2.cmml" xref="Sx3.Ex1.m1.2.2.2.4.2.2">ℒ</ci><apply id="Sx3.Ex1.m1.2.2.2.4.2.3.cmml" xref="Sx3.Ex1.m1.2.2.2.4.2.3"><csymbol cd="latexml" id="Sx3.Ex1.m1.2.2.2.4.2.3.1.cmml" xref="Sx3.Ex1.m1.2.2.2.4.2.3.1">conditional</csymbol><ci id="Sx3.Ex1.m1.2.2.2.4.2.3.2.cmml" xref="Sx3.Ex1.m1.2.2.2.4.2.3.2">𝐼</ci><ci id="Sx3.Ex1.m1.2.2.2.4.2.3.3.cmml" xref="Sx3.Ex1.m1.2.2.2.4.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.Ex1.m1.2.2.2.4.3.cmml" xref="Sx3.Ex1.m1.2.2.2.4.3">𝑘</ci></apply><interval closure="open" id="Sx3.Ex1.m1.2.2.2.2.3.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2"><apply id="Sx3.Ex1.m1.1.1.1.1.1.1.cmml" xref="Sx3.Ex1.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.Ex1.m1.1.1.1.1.1.1.1.cmml" xref="Sx3.Ex1.m1.1.1.1.1.1.1">subscript</csymbol><apply id="Sx3.Ex1.m1.1.1.1.1.1.1.2.cmml" xref="Sx3.Ex1.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.Ex1.m1.1.1.1.1.1.1.2.1.cmml" xref="Sx3.Ex1.m1.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.Ex1.m1.1.1.1.1.1.1.2.2.cmml" xref="Sx3.Ex1.m1.1.1.1.1.1.1.2.2">𝒚</ci><apply id="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.cmml" xref="Sx3.Ex1.m1.1.1.1.1.1.1.2.3"><csymbol cd="latexml" id="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.1.cmml" xref="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.1">conditional</csymbol><ci id="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.2.cmml" xref="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.2">𝐼</ci><ci id="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.3.cmml" xref="Sx3.Ex1.m1.1.1.1.1.1.1.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.Ex1.m1.1.1.1.1.1.1.3.cmml" xref="Sx3.Ex1.m1.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="Sx3.Ex1.m1.2.2.2.2.2.2.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.Ex1.m1.2.2.2.2.2.2.2.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2">subscript</csymbol><apply id="Sx3.Ex1.m1.2.2.2.2.2.2.1.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.Ex1.m1.2.2.2.2.2.2.1.2.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2">superscript</csymbol><set id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.2.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1"><apply id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.1.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1">subscript</csymbol><apply id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.2.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.2.1.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1">superscript</csymbol><ci id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.2.2.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.2.2">𝒚</ci><ci id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.2.3.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.2.3">𝑇</ci></apply><ci id="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.3.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.1.1.1.3">𝑗</ci></apply></set><ci id="Sx3.Ex1.m1.2.2.2.2.2.2.1.3.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.1.3">𝑏</ci></apply><apply id="Sx3.Ex1.m1.2.2.2.2.2.2.3.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.3"><eq id="Sx3.Ex1.m1.2.2.2.2.2.2.3.1.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.3.1"></eq><ci id="Sx3.Ex1.m1.2.2.2.2.2.2.3.2.cmml" xref="Sx3.Ex1.m1.2.2.2.2.2.2.3.2">𝑗</ci><cn id="Sx3.Ex1.m1.2.2.2.2.2.2.3.3.cmml" type="integer" xref="Sx3.Ex1.m1.2.2.2.2.2.2.3.3">1</cn></apply></apply></interval></apply><csymbol cd="latexml" id="Sx3.Ex1.m1.2.2.4.cmml" xref="Sx3.Ex1.m1.2.2.4">absent</csymbol></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.Ex1.m1.2c">\displaystyle\mathcal{L}^{I|O}_{k}(\boldsymbol{y}^{I|O}_{k},\{\boldsymbol{y}^{% T}_{j}\}^{b}_{j=1})=</annotation><annotation encoding="application/x-llamapun" id="Sx3.Ex1.m1.2d">caligraphic_L start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { bold_italic_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ) =</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> <tbody id="Sx3.E8"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle-\frac{1}{b}\log\frac{\sum_{k,j}\exp(s^{I|O}_{k,j}\mid\boldsymbol% {x}^{I}_{j}=\boldsymbol{x}^{I}_{k}\land\boldsymbol{x}^{O}_{j}=\boldsymbol{x}^{% O}_{k})}{\sum_{j}\exp(s^{I|O}_{k,j})}," class="ltx_Math" display="inline" id="Sx3.E8.m1.11"><semantics id="Sx3.E8.m1.11a"><mrow id="Sx3.E8.m1.11.11.1" xref="Sx3.E8.m1.11.11.1.1.cmml"><mrow id="Sx3.E8.m1.11.11.1.1" xref="Sx3.E8.m1.11.11.1.1.cmml"><mo id="Sx3.E8.m1.11.11.1.1a" xref="Sx3.E8.m1.11.11.1.1.cmml">−</mo><mrow id="Sx3.E8.m1.11.11.1.1.2" xref="Sx3.E8.m1.11.11.1.1.2.cmml"><mstyle displaystyle="true" id="Sx3.E8.m1.11.11.1.1.2.2" xref="Sx3.E8.m1.11.11.1.1.2.2.cmml"><mfrac id="Sx3.E8.m1.11.11.1.1.2.2a" xref="Sx3.E8.m1.11.11.1.1.2.2.cmml"><mn id="Sx3.E8.m1.11.11.1.1.2.2.2" xref="Sx3.E8.m1.11.11.1.1.2.2.2.cmml">1</mn><mi id="Sx3.E8.m1.11.11.1.1.2.2.3" xref="Sx3.E8.m1.11.11.1.1.2.2.3.cmml">b</mi></mfrac></mstyle><mo id="Sx3.E8.m1.11.11.1.1.2.1" lspace="0.167em" xref="Sx3.E8.m1.11.11.1.1.2.1.cmml">⁢</mo><mrow id="Sx3.E8.m1.11.11.1.1.2.3" xref="Sx3.E8.m1.11.11.1.1.2.3.cmml"><mi id="Sx3.E8.m1.11.11.1.1.2.3.1" xref="Sx3.E8.m1.11.11.1.1.2.3.1.cmml">log</mi><mo id="Sx3.E8.m1.11.11.1.1.2.3a" lspace="0.167em" xref="Sx3.E8.m1.11.11.1.1.2.3.cmml">⁡</mo><mstyle displaystyle="true" id="Sx3.E8.m1.10.10" xref="Sx3.E8.m1.10.10.cmml"><mfrac id="Sx3.E8.m1.10.10a" xref="Sx3.E8.m1.10.10.cmml"><mrow id="Sx3.E8.m1.6.6.6" xref="Sx3.E8.m1.6.6.6.cmml"><msub id="Sx3.E8.m1.6.6.6.7" xref="Sx3.E8.m1.6.6.6.7.cmml"><mo id="Sx3.E8.m1.6.6.6.7.2" xref="Sx3.E8.m1.6.6.6.7.2.cmml">∑</mo><mrow id="Sx3.E8.m1.2.2.2.2.2.4" xref="Sx3.E8.m1.2.2.2.2.2.3.cmml"><mi id="Sx3.E8.m1.1.1.1.1.1.1" xref="Sx3.E8.m1.1.1.1.1.1.1.cmml">k</mi><mo id="Sx3.E8.m1.2.2.2.2.2.4.1" xref="Sx3.E8.m1.2.2.2.2.2.3.cmml">,</mo><mi id="Sx3.E8.m1.2.2.2.2.2.2" xref="Sx3.E8.m1.2.2.2.2.2.2.cmml">j</mi></mrow></msub><mrow id="Sx3.E8.m1.6.6.6.6.1" xref="Sx3.E8.m1.6.6.6.6.2.cmml"><mi id="Sx3.E8.m1.5.5.5.5" xref="Sx3.E8.m1.5.5.5.5.cmml">exp</mi><mo id="Sx3.E8.m1.6.6.6.6.1a" xref="Sx3.E8.m1.6.6.6.6.2.cmml">⁡</mo><mrow id="Sx3.E8.m1.6.6.6.6.1.1" xref="Sx3.E8.m1.6.6.6.6.2.cmml"><mo id="Sx3.E8.m1.6.6.6.6.1.1.2" stretchy="false" xref="Sx3.E8.m1.6.6.6.6.2.cmml">(</mo><mrow id="Sx3.E8.m1.6.6.6.6.1.1.1" xref="Sx3.E8.m1.6.6.6.6.1.1.1.cmml"><mrow id="Sx3.E8.m1.6.6.6.6.1.1.1.2" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.cmml"><msubsup id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.cmml"><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.2" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.2.cmml">s</mi><mrow id="Sx3.E8.m1.4.4.4.4.2.4" xref="Sx3.E8.m1.4.4.4.4.2.3.cmml"><mi id="Sx3.E8.m1.3.3.3.3.1.1" xref="Sx3.E8.m1.3.3.3.3.1.1.cmml">k</mi><mo id="Sx3.E8.m1.4.4.4.4.2.4.1" xref="Sx3.E8.m1.4.4.4.4.2.3.cmml">,</mo><mi id="Sx3.E8.m1.4.4.4.4.2.2" xref="Sx3.E8.m1.4.4.4.4.2.2.cmml">j</mi></mrow><mrow id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.cmml"><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.2" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.1" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.1.cmml">|</mo><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.E8.m1.6.6.6.6.1.1.1.2.1" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.1.cmml">∣</mo><msubsup id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.cmml"><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.2.2" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.2.2.cmml">𝒙</mi><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.3.cmml">j</mi><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.2.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.2.3.cmml">I</mi></msubsup></mrow><mo id="Sx3.E8.m1.6.6.6.6.1.1.1.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.3.cmml">=</mo><mrow id="Sx3.E8.m1.6.6.6.6.1.1.1.4" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.cmml"><msubsup id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.cmml"><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.2.2" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.2.2.cmml">𝒙</mi><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.3.cmml">k</mi><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.2.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.2.3.cmml">I</mi></msubsup><mo id="Sx3.E8.m1.6.6.6.6.1.1.1.4.1" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.1.cmml">∧</mo><msubsup id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.cmml"><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.2.2" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.2.2.cmml">𝒙</mi><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.3.cmml">j</mi><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.2.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.2.3.cmml">O</mi></msubsup></mrow><mo id="Sx3.E8.m1.6.6.6.6.1.1.1.5" xref="Sx3.E8.m1.6.6.6.6.1.1.1.5.cmml">=</mo><msubsup id="Sx3.E8.m1.6.6.6.6.1.1.1.6" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6.cmml"><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.6.2.2" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6.2.2.cmml">𝒙</mi><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.6.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6.3.cmml">k</mi><mi id="Sx3.E8.m1.6.6.6.6.1.1.1.6.2.3" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6.2.3.cmml">O</mi></msubsup></mrow><mo id="Sx3.E8.m1.6.6.6.6.1.1.3" stretchy="false" xref="Sx3.E8.m1.6.6.6.6.2.cmml">)</mo></mrow></mrow></mrow><mrow id="Sx3.E8.m1.10.10.10" xref="Sx3.E8.m1.10.10.10.cmml"><msub id="Sx3.E8.m1.10.10.10.5" xref="Sx3.E8.m1.10.10.10.5.cmml"><mo id="Sx3.E8.m1.10.10.10.5.2" xref="Sx3.E8.m1.10.10.10.5.2.cmml">∑</mo><mi id="Sx3.E8.m1.10.10.10.5.3" xref="Sx3.E8.m1.10.10.10.5.3.cmml">j</mi></msub><mrow id="Sx3.E8.m1.10.10.10.4.1" xref="Sx3.E8.m1.10.10.10.4.2.cmml"><mi id="Sx3.E8.m1.9.9.9.3" xref="Sx3.E8.m1.9.9.9.3.cmml">exp</mi><mo id="Sx3.E8.m1.10.10.10.4.1a" xref="Sx3.E8.m1.10.10.10.4.2.cmml">⁡</mo><mrow id="Sx3.E8.m1.10.10.10.4.1.1" xref="Sx3.E8.m1.10.10.10.4.2.cmml"><mo id="Sx3.E8.m1.10.10.10.4.1.1.2" stretchy="false" xref="Sx3.E8.m1.10.10.10.4.2.cmml">(</mo><msubsup id="Sx3.E8.m1.10.10.10.4.1.1.1" xref="Sx3.E8.m1.10.10.10.4.1.1.1.cmml"><mi id="Sx3.E8.m1.10.10.10.4.1.1.1.2.2" xref="Sx3.E8.m1.10.10.10.4.1.1.1.2.2.cmml">s</mi><mrow id="Sx3.E8.m1.8.8.8.2.2.4" xref="Sx3.E8.m1.8.8.8.2.2.3.cmml"><mi id="Sx3.E8.m1.7.7.7.1.1.1" xref="Sx3.E8.m1.7.7.7.1.1.1.cmml">k</mi><mo id="Sx3.E8.m1.8.8.8.2.2.4.1" xref="Sx3.E8.m1.8.8.8.2.2.3.cmml">,</mo><mi id="Sx3.E8.m1.8.8.8.2.2.2" xref="Sx3.E8.m1.8.8.8.2.2.2.cmml">j</mi></mrow><mrow id="Sx3.E8.m1.10.10.10.4.1.1.1.2.3" xref="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.cmml"><mi id="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.2" xref="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.1" xref="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.1.cmml">|</mo><mi id="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.3" xref="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.E8.m1.10.10.10.4.1.1.3" stretchy="false" xref="Sx3.E8.m1.10.10.10.4.2.cmml">)</mo></mrow></mrow></mrow></mfrac></mstyle></mrow></mrow></mrow><mo id="Sx3.E8.m1.11.11.1.2" xref="Sx3.E8.m1.11.11.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E8.m1.11b"><apply id="Sx3.E8.m1.11.11.1.1.cmml" xref="Sx3.E8.m1.11.11.1"><minus id="Sx3.E8.m1.11.11.1.1.1.cmml" xref="Sx3.E8.m1.11.11.1"></minus><apply id="Sx3.E8.m1.11.11.1.1.2.cmml" xref="Sx3.E8.m1.11.11.1.1.2"><times id="Sx3.E8.m1.11.11.1.1.2.1.cmml" xref="Sx3.E8.m1.11.11.1.1.2.1"></times><apply id="Sx3.E8.m1.11.11.1.1.2.2.cmml" xref="Sx3.E8.m1.11.11.1.1.2.2"><divide id="Sx3.E8.m1.11.11.1.1.2.2.1.cmml" xref="Sx3.E8.m1.11.11.1.1.2.2"></divide><cn id="Sx3.E8.m1.11.11.1.1.2.2.2.cmml" type="integer" xref="Sx3.E8.m1.11.11.1.1.2.2.2">1</cn><ci id="Sx3.E8.m1.11.11.1.1.2.2.3.cmml" xref="Sx3.E8.m1.11.11.1.1.2.2.3">𝑏</ci></apply><apply id="Sx3.E8.m1.11.11.1.1.2.3.cmml" xref="Sx3.E8.m1.11.11.1.1.2.3"><log id="Sx3.E8.m1.11.11.1.1.2.3.1.cmml" xref="Sx3.E8.m1.11.11.1.1.2.3.1"></log><apply id="Sx3.E8.m1.10.10.cmml" xref="Sx3.E8.m1.10.10"><divide id="Sx3.E8.m1.10.10.11.cmml" xref="Sx3.E8.m1.10.10"></divide><apply id="Sx3.E8.m1.6.6.6.cmml" xref="Sx3.E8.m1.6.6.6"><apply id="Sx3.E8.m1.6.6.6.7.cmml" xref="Sx3.E8.m1.6.6.6.7"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.7.1.cmml" xref="Sx3.E8.m1.6.6.6.7">subscript</csymbol><sum id="Sx3.E8.m1.6.6.6.7.2.cmml" xref="Sx3.E8.m1.6.6.6.7.2"></sum><list id="Sx3.E8.m1.2.2.2.2.2.3.cmml" xref="Sx3.E8.m1.2.2.2.2.2.4"><ci id="Sx3.E8.m1.1.1.1.1.1.1.cmml" xref="Sx3.E8.m1.1.1.1.1.1.1">𝑘</ci><ci id="Sx3.E8.m1.2.2.2.2.2.2.cmml" xref="Sx3.E8.m1.2.2.2.2.2.2">𝑗</ci></list></apply><apply id="Sx3.E8.m1.6.6.6.6.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1"><exp id="Sx3.E8.m1.5.5.5.5.cmml" xref="Sx3.E8.m1.5.5.5.5"></exp><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1"><and id="Sx3.E8.m1.6.6.6.6.1.1.1a.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1"></and><apply id="Sx3.E8.m1.6.6.6.6.1.1.1b.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1"><eq id="Sx3.E8.m1.6.6.6.6.1.1.1.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.3"></eq><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2"><csymbol cd="latexml" id="Sx3.E8.m1.6.6.6.6.1.1.1.2.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.1">conditional</csymbol><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2">subscript</csymbol><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2">superscript</csymbol><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.2">𝑠</ci><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3"><csymbol cd="latexml" id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.1">conditional</csymbol><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.2">𝐼</ci><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.2.2.3.3">𝑂</ci></apply></apply><list id="Sx3.E8.m1.4.4.4.4.2.3.cmml" xref="Sx3.E8.m1.4.4.4.4.2.4"><ci id="Sx3.E8.m1.3.3.3.3.1.1.cmml" xref="Sx3.E8.m1.3.3.3.3.1.1">𝑘</ci><ci id="Sx3.E8.m1.4.4.4.4.2.2.cmml" xref="Sx3.E8.m1.4.4.4.4.2.2">𝑗</ci></list></apply><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3">subscript</csymbol><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.2.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3">superscript</csymbol><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.2.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.2.2">𝒙</ci><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.2.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.2.3">𝐼</ci></apply><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.2.3.3">𝑗</ci></apply></apply><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.4.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4"><and id="Sx3.E8.m1.6.6.6.6.1.1.1.4.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.1"></and><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2">subscript</csymbol><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.2.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2">superscript</csymbol><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.2.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.2.2">𝒙</ci><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.2.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.2.3">𝐼</ci></apply><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.2.3">𝑘</ci></apply><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3">subscript</csymbol><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.2.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3">superscript</csymbol><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.2.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.2.2">𝒙</ci><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.2.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.2.3">𝑂</ci></apply><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.4.3.3">𝑗</ci></apply></apply></apply><apply id="Sx3.E8.m1.6.6.6.6.1.1.1c.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1"><eq id="Sx3.E8.m1.6.6.6.6.1.1.1.5.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.5"></eq><share href="https://arxiv.org/html/2410.20109v1#Sx3.E8.m1.6.6.6.6.1.1.1.4.cmml" id="Sx3.E8.m1.6.6.6.6.1.1.1d.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1"></share><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.6.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.6.1.1.1.6.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6">subscript</csymbol><apply id="Sx3.E8.m1.6.6.6.6.1.1.1.6.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6"><csymbol cd="ambiguous" id="Sx3.E8.m1.6.6.6.6.1.1.1.6.2.1.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6">superscript</csymbol><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.6.2.2.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6.2.2">𝒙</ci><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.6.2.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6.2.3">𝑂</ci></apply><ci id="Sx3.E8.m1.6.6.6.6.1.1.1.6.3.cmml" xref="Sx3.E8.m1.6.6.6.6.1.1.1.6.3">𝑘</ci></apply></apply></apply></apply></apply><apply id="Sx3.E8.m1.10.10.10.cmml" xref="Sx3.E8.m1.10.10.10"><apply id="Sx3.E8.m1.10.10.10.5.cmml" xref="Sx3.E8.m1.10.10.10.5"><csymbol cd="ambiguous" id="Sx3.E8.m1.10.10.10.5.1.cmml" xref="Sx3.E8.m1.10.10.10.5">subscript</csymbol><sum id="Sx3.E8.m1.10.10.10.5.2.cmml" xref="Sx3.E8.m1.10.10.10.5.2"></sum><ci id="Sx3.E8.m1.10.10.10.5.3.cmml" xref="Sx3.E8.m1.10.10.10.5.3">𝑗</ci></apply><apply id="Sx3.E8.m1.10.10.10.4.2.cmml" xref="Sx3.E8.m1.10.10.10.4.1"><exp id="Sx3.E8.m1.9.9.9.3.cmml" xref="Sx3.E8.m1.9.9.9.3"></exp><apply id="Sx3.E8.m1.10.10.10.4.1.1.1.cmml" xref="Sx3.E8.m1.10.10.10.4.1.1.1"><csymbol cd="ambiguous" id="Sx3.E8.m1.10.10.10.4.1.1.1.1.cmml" xref="Sx3.E8.m1.10.10.10.4.1.1.1">subscript</csymbol><apply id="Sx3.E8.m1.10.10.10.4.1.1.1.2.cmml" xref="Sx3.E8.m1.10.10.10.4.1.1.1"><csymbol cd="ambiguous" id="Sx3.E8.m1.10.10.10.4.1.1.1.2.1.cmml" xref="Sx3.E8.m1.10.10.10.4.1.1.1">superscript</csymbol><ci id="Sx3.E8.m1.10.10.10.4.1.1.1.2.2.cmml" xref="Sx3.E8.m1.10.10.10.4.1.1.1.2.2">𝑠</ci><apply id="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.cmml" xref="Sx3.E8.m1.10.10.10.4.1.1.1.2.3"><csymbol cd="latexml" id="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.1.cmml" xref="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.1">conditional</csymbol><ci id="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.2.cmml" xref="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.2">𝐼</ci><ci id="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.3.cmml" xref="Sx3.E8.m1.10.10.10.4.1.1.1.2.3.3">𝑂</ci></apply></apply><list id="Sx3.E8.m1.8.8.8.2.2.3.cmml" xref="Sx3.E8.m1.8.8.8.2.2.4"><ci id="Sx3.E8.m1.7.7.7.1.1.1.cmml" xref="Sx3.E8.m1.7.7.7.1.1.1">𝑘</ci><ci id="Sx3.E8.m1.8.8.8.2.2.2.cmml" xref="Sx3.E8.m1.8.8.8.2.2.2">𝑗</ci></list></apply></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E8.m1.11c">\displaystyle-\frac{1}{b}\log\frac{\sum_{k,j}\exp(s^{I|O}_{k,j}\mid\boldsymbol% {x}^{I}_{j}=\boldsymbol{x}^{I}_{k}\land\boldsymbol{x}^{O}_{j}=\boldsymbol{x}^{% O}_{k})}{\sum_{j}\exp(s^{I|O}_{k,j})},</annotation><annotation encoding="application/x-llamapun" id="Sx3.E8.m1.11d">- divide start_ARG 1 end_ARG start_ARG italic_b end_ARG roman_log divide start_ARG ∑ start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT roman_exp ( italic_s start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∧ bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( italic_s start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ) end_ARG ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(8)</span></td> </tr></tbody> </table> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="A1.EGx2"> <tbody id="Sx3.Ex2"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\mathcal{L}^{T}_{k}(\boldsymbol{y}^{T}_{k},\{\boldsymbol{y}^{I|O}% _{j}\}^{b}_{j=1})=" class="ltx_Math" display="inline" id="Sx3.Ex2.m1.2"><semantics id="Sx3.Ex2.m1.2a"><mrow id="Sx3.Ex2.m1.2.2" xref="Sx3.Ex2.m1.2.2.cmml"><mrow id="Sx3.Ex2.m1.2.2.2" xref="Sx3.Ex2.m1.2.2.2.cmml"><msubsup id="Sx3.Ex2.m1.2.2.2.4" xref="Sx3.Ex2.m1.2.2.2.4.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.Ex2.m1.2.2.2.4.2.2" xref="Sx3.Ex2.m1.2.2.2.4.2.2.cmml">ℒ</mi><mi id="Sx3.Ex2.m1.2.2.2.4.3" xref="Sx3.Ex2.m1.2.2.2.4.3.cmml">k</mi><mi id="Sx3.Ex2.m1.2.2.2.4.2.3" xref="Sx3.Ex2.m1.2.2.2.4.2.3.cmml">T</mi></msubsup><mo id="Sx3.Ex2.m1.2.2.2.3" xref="Sx3.Ex2.m1.2.2.2.3.cmml">⁢</mo><mrow id="Sx3.Ex2.m1.2.2.2.2.2" xref="Sx3.Ex2.m1.2.2.2.2.3.cmml"><mo id="Sx3.Ex2.m1.2.2.2.2.2.3" stretchy="false" xref="Sx3.Ex2.m1.2.2.2.2.3.cmml">(</mo><msubsup id="Sx3.Ex2.m1.1.1.1.1.1.1" xref="Sx3.Ex2.m1.1.1.1.1.1.1.cmml"><mi id="Sx3.Ex2.m1.1.1.1.1.1.1.2.2" xref="Sx3.Ex2.m1.1.1.1.1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.Ex2.m1.1.1.1.1.1.1.3" xref="Sx3.Ex2.m1.1.1.1.1.1.1.3.cmml">k</mi><mi id="Sx3.Ex2.m1.1.1.1.1.1.1.2.3" xref="Sx3.Ex2.m1.1.1.1.1.1.1.2.3.cmml">T</mi></msubsup><mo id="Sx3.Ex2.m1.2.2.2.2.2.4" xref="Sx3.Ex2.m1.2.2.2.2.3.cmml">,</mo><msubsup id="Sx3.Ex2.m1.2.2.2.2.2.2" xref="Sx3.Ex2.m1.2.2.2.2.2.2.cmml"><mrow id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.2.cmml"><mo id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.2" stretchy="false" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.2.cmml">{</mo><msubsup id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.cmml"><mi id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.2" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.3" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.3.cmml">j</mi><mrow id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.cmml"><mi id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.2" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.1" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.1.cmml">|</mo><mi id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.3" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.3" stretchy="false" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.2.cmml">}</mo></mrow><mrow id="Sx3.Ex2.m1.2.2.2.2.2.2.3" xref="Sx3.Ex2.m1.2.2.2.2.2.2.3.cmml"><mi id="Sx3.Ex2.m1.2.2.2.2.2.2.3.2" xref="Sx3.Ex2.m1.2.2.2.2.2.2.3.2.cmml">j</mi><mo id="Sx3.Ex2.m1.2.2.2.2.2.2.3.1" xref="Sx3.Ex2.m1.2.2.2.2.2.2.3.1.cmml">=</mo><mn id="Sx3.Ex2.m1.2.2.2.2.2.2.3.3" xref="Sx3.Ex2.m1.2.2.2.2.2.2.3.3.cmml">1</mn></mrow><mi id="Sx3.Ex2.m1.2.2.2.2.2.2.1.3" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.3.cmml">b</mi></msubsup><mo id="Sx3.Ex2.m1.2.2.2.2.2.5" stretchy="false" xref="Sx3.Ex2.m1.2.2.2.2.3.cmml">)</mo></mrow></mrow><mo id="Sx3.Ex2.m1.2.2.3" xref="Sx3.Ex2.m1.2.2.3.cmml">=</mo><mi id="Sx3.Ex2.m1.2.2.4" xref="Sx3.Ex2.m1.2.2.4.cmml"></mi></mrow><annotation-xml encoding="MathML-Content" id="Sx3.Ex2.m1.2b"><apply id="Sx3.Ex2.m1.2.2.cmml" xref="Sx3.Ex2.m1.2.2"><eq id="Sx3.Ex2.m1.2.2.3.cmml" xref="Sx3.Ex2.m1.2.2.3"></eq><apply id="Sx3.Ex2.m1.2.2.2.cmml" xref="Sx3.Ex2.m1.2.2.2"><times id="Sx3.Ex2.m1.2.2.2.3.cmml" xref="Sx3.Ex2.m1.2.2.2.3"></times><apply id="Sx3.Ex2.m1.2.2.2.4.cmml" xref="Sx3.Ex2.m1.2.2.2.4"><csymbol cd="ambiguous" id="Sx3.Ex2.m1.2.2.2.4.1.cmml" xref="Sx3.Ex2.m1.2.2.2.4">subscript</csymbol><apply id="Sx3.Ex2.m1.2.2.2.4.2.cmml" xref="Sx3.Ex2.m1.2.2.2.4"><csymbol cd="ambiguous" id="Sx3.Ex2.m1.2.2.2.4.2.1.cmml" xref="Sx3.Ex2.m1.2.2.2.4">superscript</csymbol><ci id="Sx3.Ex2.m1.2.2.2.4.2.2.cmml" xref="Sx3.Ex2.m1.2.2.2.4.2.2">ℒ</ci><ci id="Sx3.Ex2.m1.2.2.2.4.2.3.cmml" xref="Sx3.Ex2.m1.2.2.2.4.2.3">𝑇</ci></apply><ci id="Sx3.Ex2.m1.2.2.2.4.3.cmml" xref="Sx3.Ex2.m1.2.2.2.4.3">𝑘</ci></apply><interval closure="open" id="Sx3.Ex2.m1.2.2.2.2.3.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2"><apply id="Sx3.Ex2.m1.1.1.1.1.1.1.cmml" xref="Sx3.Ex2.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.Ex2.m1.1.1.1.1.1.1.1.cmml" xref="Sx3.Ex2.m1.1.1.1.1.1.1">subscript</csymbol><apply id="Sx3.Ex2.m1.1.1.1.1.1.1.2.cmml" xref="Sx3.Ex2.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.Ex2.m1.1.1.1.1.1.1.2.1.cmml" xref="Sx3.Ex2.m1.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.Ex2.m1.1.1.1.1.1.1.2.2.cmml" xref="Sx3.Ex2.m1.1.1.1.1.1.1.2.2">𝒚</ci><ci id="Sx3.Ex2.m1.1.1.1.1.1.1.2.3.cmml" xref="Sx3.Ex2.m1.1.1.1.1.1.1.2.3">𝑇</ci></apply><ci id="Sx3.Ex2.m1.1.1.1.1.1.1.3.cmml" xref="Sx3.Ex2.m1.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="Sx3.Ex2.m1.2.2.2.2.2.2.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.Ex2.m1.2.2.2.2.2.2.2.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2">subscript</csymbol><apply id="Sx3.Ex2.m1.2.2.2.2.2.2.1.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.Ex2.m1.2.2.2.2.2.2.1.2.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2">superscript</csymbol><set id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.2.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1"><apply id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.1.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1">subscript</csymbol><apply id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.1.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1">superscript</csymbol><ci id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.2.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.2">𝒚</ci><apply id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3"><csymbol cd="latexml" id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.1.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.1">conditional</csymbol><ci id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.2.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.2">𝐼</ci><ci id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.3.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.3.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.1.1.1.3">𝑗</ci></apply></set><ci id="Sx3.Ex2.m1.2.2.2.2.2.2.1.3.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.1.3">𝑏</ci></apply><apply id="Sx3.Ex2.m1.2.2.2.2.2.2.3.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.3"><eq id="Sx3.Ex2.m1.2.2.2.2.2.2.3.1.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.3.1"></eq><ci id="Sx3.Ex2.m1.2.2.2.2.2.2.3.2.cmml" xref="Sx3.Ex2.m1.2.2.2.2.2.2.3.2">𝑗</ci><cn id="Sx3.Ex2.m1.2.2.2.2.2.2.3.3.cmml" type="integer" xref="Sx3.Ex2.m1.2.2.2.2.2.2.3.3">1</cn></apply></apply></interval></apply><csymbol cd="latexml" id="Sx3.Ex2.m1.2.2.4.cmml" xref="Sx3.Ex2.m1.2.2.4">absent</csymbol></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.Ex2.m1.2c">\displaystyle\mathcal{L}^{T}_{k}(\boldsymbol{y}^{T}_{k},\{\boldsymbol{y}^{I|O}% _{j}\}^{b}_{j=1})=</annotation><annotation encoding="application/x-llamapun" id="Sx3.Ex2.m1.2d">caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , { bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ) =</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> <tbody id="Sx3.E9"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle-\frac{1}{b}\log\frac{\sum_{k,j}\exp(s^{T}_{k,j}\mid\boldsymbol{x% }^{I}_{j}=\boldsymbol{x}^{I}_{k}\land\boldsymbol{x}^{O}_{j}=\boldsymbol{x}^{O}% _{k})}{\sum_{j}\exp(s^{T}_{k,j})}," class="ltx_Math" display="inline" id="Sx3.E9.m1.11"><semantics id="Sx3.E9.m1.11a"><mrow id="Sx3.E9.m1.11.11.1" xref="Sx3.E9.m1.11.11.1.1.cmml"><mrow id="Sx3.E9.m1.11.11.1.1" xref="Sx3.E9.m1.11.11.1.1.cmml"><mo id="Sx3.E9.m1.11.11.1.1a" xref="Sx3.E9.m1.11.11.1.1.cmml">−</mo><mrow id="Sx3.E9.m1.11.11.1.1.2" xref="Sx3.E9.m1.11.11.1.1.2.cmml"><mstyle displaystyle="true" id="Sx3.E9.m1.11.11.1.1.2.2" xref="Sx3.E9.m1.11.11.1.1.2.2.cmml"><mfrac id="Sx3.E9.m1.11.11.1.1.2.2a" xref="Sx3.E9.m1.11.11.1.1.2.2.cmml"><mn id="Sx3.E9.m1.11.11.1.1.2.2.2" xref="Sx3.E9.m1.11.11.1.1.2.2.2.cmml">1</mn><mi id="Sx3.E9.m1.11.11.1.1.2.2.3" xref="Sx3.E9.m1.11.11.1.1.2.2.3.cmml">b</mi></mfrac></mstyle><mo id="Sx3.E9.m1.11.11.1.1.2.1" lspace="0.167em" xref="Sx3.E9.m1.11.11.1.1.2.1.cmml">⁢</mo><mrow id="Sx3.E9.m1.11.11.1.1.2.3" xref="Sx3.E9.m1.11.11.1.1.2.3.cmml"><mi id="Sx3.E9.m1.11.11.1.1.2.3.1" xref="Sx3.E9.m1.11.11.1.1.2.3.1.cmml">log</mi><mo id="Sx3.E9.m1.11.11.1.1.2.3a" lspace="0.167em" xref="Sx3.E9.m1.11.11.1.1.2.3.cmml">⁡</mo><mstyle displaystyle="true" id="Sx3.E9.m1.10.10" xref="Sx3.E9.m1.10.10.cmml"><mfrac id="Sx3.E9.m1.10.10a" xref="Sx3.E9.m1.10.10.cmml"><mrow id="Sx3.E9.m1.6.6.6" xref="Sx3.E9.m1.6.6.6.cmml"><msub id="Sx3.E9.m1.6.6.6.7" xref="Sx3.E9.m1.6.6.6.7.cmml"><mo id="Sx3.E9.m1.6.6.6.7.2" xref="Sx3.E9.m1.6.6.6.7.2.cmml">∑</mo><mrow id="Sx3.E9.m1.2.2.2.2.2.4" xref="Sx3.E9.m1.2.2.2.2.2.3.cmml"><mi id="Sx3.E9.m1.1.1.1.1.1.1" xref="Sx3.E9.m1.1.1.1.1.1.1.cmml">k</mi><mo id="Sx3.E9.m1.2.2.2.2.2.4.1" xref="Sx3.E9.m1.2.2.2.2.2.3.cmml">,</mo><mi id="Sx3.E9.m1.2.2.2.2.2.2" xref="Sx3.E9.m1.2.2.2.2.2.2.cmml">j</mi></mrow></msub><mrow id="Sx3.E9.m1.6.6.6.6.1" xref="Sx3.E9.m1.6.6.6.6.2.cmml"><mi id="Sx3.E9.m1.5.5.5.5" xref="Sx3.E9.m1.5.5.5.5.cmml">exp</mi><mo id="Sx3.E9.m1.6.6.6.6.1a" xref="Sx3.E9.m1.6.6.6.6.2.cmml">⁡</mo><mrow id="Sx3.E9.m1.6.6.6.6.1.1" xref="Sx3.E9.m1.6.6.6.6.2.cmml"><mo id="Sx3.E9.m1.6.6.6.6.1.1.2" stretchy="false" xref="Sx3.E9.m1.6.6.6.6.2.cmml">(</mo><mrow id="Sx3.E9.m1.6.6.6.6.1.1.1" xref="Sx3.E9.m1.6.6.6.6.1.1.1.cmml"><mrow id="Sx3.E9.m1.6.6.6.6.1.1.1.2" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.cmml"><msubsup id="Sx3.E9.m1.6.6.6.6.1.1.1.2.2" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.cmml"><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.2.2" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.2.2.cmml">s</mi><mrow id="Sx3.E9.m1.4.4.4.4.2.4" xref="Sx3.E9.m1.4.4.4.4.2.3.cmml"><mi id="Sx3.E9.m1.3.3.3.3.1.1" xref="Sx3.E9.m1.3.3.3.3.1.1.cmml">k</mi><mo id="Sx3.E9.m1.4.4.4.4.2.4.1" xref="Sx3.E9.m1.4.4.4.4.2.3.cmml">,</mo><mi id="Sx3.E9.m1.4.4.4.4.2.2" xref="Sx3.E9.m1.4.4.4.4.2.2.cmml">j</mi></mrow><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.2.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.2.3.cmml">T</mi></msubsup><mo id="Sx3.E9.m1.6.6.6.6.1.1.1.2.1" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.1.cmml">∣</mo><msubsup id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.cmml"><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.2.2" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.2.2.cmml">𝒙</mi><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.3.cmml">j</mi><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.2.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.2.3.cmml">I</mi></msubsup></mrow><mo id="Sx3.E9.m1.6.6.6.6.1.1.1.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.3.cmml">=</mo><mrow id="Sx3.E9.m1.6.6.6.6.1.1.1.4" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.cmml"><msubsup id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.cmml"><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.2.2" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.2.2.cmml">𝒙</mi><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.3.cmml">k</mi><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.2.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.2.3.cmml">I</mi></msubsup><mo id="Sx3.E9.m1.6.6.6.6.1.1.1.4.1" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.1.cmml">∧</mo><msubsup id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.cmml"><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.2.2" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.2.2.cmml">𝒙</mi><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.3.cmml">j</mi><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.2.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.2.3.cmml">O</mi></msubsup></mrow><mo id="Sx3.E9.m1.6.6.6.6.1.1.1.5" xref="Sx3.E9.m1.6.6.6.6.1.1.1.5.cmml">=</mo><msubsup id="Sx3.E9.m1.6.6.6.6.1.1.1.6" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6.cmml"><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.6.2.2" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6.2.2.cmml">𝒙</mi><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.6.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6.3.cmml">k</mi><mi id="Sx3.E9.m1.6.6.6.6.1.1.1.6.2.3" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6.2.3.cmml">O</mi></msubsup></mrow><mo id="Sx3.E9.m1.6.6.6.6.1.1.3" stretchy="false" xref="Sx3.E9.m1.6.6.6.6.2.cmml">)</mo></mrow></mrow></mrow><mrow id="Sx3.E9.m1.10.10.10" xref="Sx3.E9.m1.10.10.10.cmml"><msub id="Sx3.E9.m1.10.10.10.5" xref="Sx3.E9.m1.10.10.10.5.cmml"><mo id="Sx3.E9.m1.10.10.10.5.2" xref="Sx3.E9.m1.10.10.10.5.2.cmml">∑</mo><mi id="Sx3.E9.m1.10.10.10.5.3" xref="Sx3.E9.m1.10.10.10.5.3.cmml">j</mi></msub><mrow id="Sx3.E9.m1.10.10.10.4.1" xref="Sx3.E9.m1.10.10.10.4.2.cmml"><mi id="Sx3.E9.m1.9.9.9.3" xref="Sx3.E9.m1.9.9.9.3.cmml">exp</mi><mo id="Sx3.E9.m1.10.10.10.4.1a" xref="Sx3.E9.m1.10.10.10.4.2.cmml">⁡</mo><mrow id="Sx3.E9.m1.10.10.10.4.1.1" xref="Sx3.E9.m1.10.10.10.4.2.cmml"><mo id="Sx3.E9.m1.10.10.10.4.1.1.2" stretchy="false" xref="Sx3.E9.m1.10.10.10.4.2.cmml">(</mo><msubsup id="Sx3.E9.m1.10.10.10.4.1.1.1" xref="Sx3.E9.m1.10.10.10.4.1.1.1.cmml"><mi id="Sx3.E9.m1.10.10.10.4.1.1.1.2.2" xref="Sx3.E9.m1.10.10.10.4.1.1.1.2.2.cmml">s</mi><mrow id="Sx3.E9.m1.8.8.8.2.2.4" xref="Sx3.E9.m1.8.8.8.2.2.3.cmml"><mi id="Sx3.E9.m1.7.7.7.1.1.1" xref="Sx3.E9.m1.7.7.7.1.1.1.cmml">k</mi><mo id="Sx3.E9.m1.8.8.8.2.2.4.1" xref="Sx3.E9.m1.8.8.8.2.2.3.cmml">,</mo><mi id="Sx3.E9.m1.8.8.8.2.2.2" xref="Sx3.E9.m1.8.8.8.2.2.2.cmml">j</mi></mrow><mi id="Sx3.E9.m1.10.10.10.4.1.1.1.2.3" xref="Sx3.E9.m1.10.10.10.4.1.1.1.2.3.cmml">T</mi></msubsup><mo id="Sx3.E9.m1.10.10.10.4.1.1.3" stretchy="false" xref="Sx3.E9.m1.10.10.10.4.2.cmml">)</mo></mrow></mrow></mrow></mfrac></mstyle></mrow></mrow></mrow><mo id="Sx3.E9.m1.11.11.1.2" xref="Sx3.E9.m1.11.11.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E9.m1.11b"><apply id="Sx3.E9.m1.11.11.1.1.cmml" xref="Sx3.E9.m1.11.11.1"><minus id="Sx3.E9.m1.11.11.1.1.1.cmml" xref="Sx3.E9.m1.11.11.1"></minus><apply id="Sx3.E9.m1.11.11.1.1.2.cmml" xref="Sx3.E9.m1.11.11.1.1.2"><times id="Sx3.E9.m1.11.11.1.1.2.1.cmml" xref="Sx3.E9.m1.11.11.1.1.2.1"></times><apply id="Sx3.E9.m1.11.11.1.1.2.2.cmml" xref="Sx3.E9.m1.11.11.1.1.2.2"><divide id="Sx3.E9.m1.11.11.1.1.2.2.1.cmml" xref="Sx3.E9.m1.11.11.1.1.2.2"></divide><cn id="Sx3.E9.m1.11.11.1.1.2.2.2.cmml" type="integer" xref="Sx3.E9.m1.11.11.1.1.2.2.2">1</cn><ci id="Sx3.E9.m1.11.11.1.1.2.2.3.cmml" xref="Sx3.E9.m1.11.11.1.1.2.2.3">𝑏</ci></apply><apply id="Sx3.E9.m1.11.11.1.1.2.3.cmml" xref="Sx3.E9.m1.11.11.1.1.2.3"><log id="Sx3.E9.m1.11.11.1.1.2.3.1.cmml" xref="Sx3.E9.m1.11.11.1.1.2.3.1"></log><apply id="Sx3.E9.m1.10.10.cmml" xref="Sx3.E9.m1.10.10"><divide id="Sx3.E9.m1.10.10.11.cmml" xref="Sx3.E9.m1.10.10"></divide><apply id="Sx3.E9.m1.6.6.6.cmml" xref="Sx3.E9.m1.6.6.6"><apply id="Sx3.E9.m1.6.6.6.7.cmml" xref="Sx3.E9.m1.6.6.6.7"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.7.1.cmml" xref="Sx3.E9.m1.6.6.6.7">subscript</csymbol><sum id="Sx3.E9.m1.6.6.6.7.2.cmml" xref="Sx3.E9.m1.6.6.6.7.2"></sum><list id="Sx3.E9.m1.2.2.2.2.2.3.cmml" xref="Sx3.E9.m1.2.2.2.2.2.4"><ci id="Sx3.E9.m1.1.1.1.1.1.1.cmml" xref="Sx3.E9.m1.1.1.1.1.1.1">𝑘</ci><ci id="Sx3.E9.m1.2.2.2.2.2.2.cmml" xref="Sx3.E9.m1.2.2.2.2.2.2">𝑗</ci></list></apply><apply id="Sx3.E9.m1.6.6.6.6.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1"><exp id="Sx3.E9.m1.5.5.5.5.cmml" xref="Sx3.E9.m1.5.5.5.5"></exp><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1"><and id="Sx3.E9.m1.6.6.6.6.1.1.1a.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1"></and><apply id="Sx3.E9.m1.6.6.6.6.1.1.1b.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1"><eq id="Sx3.E9.m1.6.6.6.6.1.1.1.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.3"></eq><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2"><csymbol cd="latexml" id="Sx3.E9.m1.6.6.6.6.1.1.1.2.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.1">conditional</csymbol><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.2"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.2">subscript</csymbol><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.2"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.2.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.2">superscript</csymbol><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.2.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.2.2">𝑠</ci><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.2.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.2.2.3">𝑇</ci></apply><list id="Sx3.E9.m1.4.4.4.4.2.3.cmml" xref="Sx3.E9.m1.4.4.4.4.2.4"><ci id="Sx3.E9.m1.3.3.3.3.1.1.cmml" xref="Sx3.E9.m1.3.3.3.3.1.1">𝑘</ci><ci id="Sx3.E9.m1.4.4.4.4.2.2.cmml" xref="Sx3.E9.m1.4.4.4.4.2.2">𝑗</ci></list></apply><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3">subscript</csymbol><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.2.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3">superscript</csymbol><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.2.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.2.2">𝒙</ci><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.2.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.2.3">𝐼</ci></apply><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.2.3.3">𝑗</ci></apply></apply><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.4.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4"><and id="Sx3.E9.m1.6.6.6.6.1.1.1.4.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.1"></and><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2">subscript</csymbol><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.2.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2">superscript</csymbol><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.2.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.2.2">𝒙</ci><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.2.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.2.3">𝐼</ci></apply><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.2.3">𝑘</ci></apply><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3">subscript</csymbol><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.2.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3">superscript</csymbol><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.2.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.2.2">𝒙</ci><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.2.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.2.3">𝑂</ci></apply><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.4.3.3">𝑗</ci></apply></apply></apply><apply id="Sx3.E9.m1.6.6.6.6.1.1.1c.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1"><eq id="Sx3.E9.m1.6.6.6.6.1.1.1.5.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.5"></eq><share href="https://arxiv.org/html/2410.20109v1#Sx3.E9.m1.6.6.6.6.1.1.1.4.cmml" id="Sx3.E9.m1.6.6.6.6.1.1.1d.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1"></share><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.6.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.6.1.1.1.6.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6">subscript</csymbol><apply id="Sx3.E9.m1.6.6.6.6.1.1.1.6.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6"><csymbol cd="ambiguous" id="Sx3.E9.m1.6.6.6.6.1.1.1.6.2.1.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6">superscript</csymbol><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.6.2.2.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6.2.2">𝒙</ci><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.6.2.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6.2.3">𝑂</ci></apply><ci id="Sx3.E9.m1.6.6.6.6.1.1.1.6.3.cmml" xref="Sx3.E9.m1.6.6.6.6.1.1.1.6.3">𝑘</ci></apply></apply></apply></apply></apply><apply id="Sx3.E9.m1.10.10.10.cmml" xref="Sx3.E9.m1.10.10.10"><apply id="Sx3.E9.m1.10.10.10.5.cmml" xref="Sx3.E9.m1.10.10.10.5"><csymbol cd="ambiguous" id="Sx3.E9.m1.10.10.10.5.1.cmml" xref="Sx3.E9.m1.10.10.10.5">subscript</csymbol><sum id="Sx3.E9.m1.10.10.10.5.2.cmml" xref="Sx3.E9.m1.10.10.10.5.2"></sum><ci id="Sx3.E9.m1.10.10.10.5.3.cmml" xref="Sx3.E9.m1.10.10.10.5.3">𝑗</ci></apply><apply id="Sx3.E9.m1.10.10.10.4.2.cmml" xref="Sx3.E9.m1.10.10.10.4.1"><exp id="Sx3.E9.m1.9.9.9.3.cmml" xref="Sx3.E9.m1.9.9.9.3"></exp><apply id="Sx3.E9.m1.10.10.10.4.1.1.1.cmml" xref="Sx3.E9.m1.10.10.10.4.1.1.1"><csymbol cd="ambiguous" id="Sx3.E9.m1.10.10.10.4.1.1.1.1.cmml" xref="Sx3.E9.m1.10.10.10.4.1.1.1">subscript</csymbol><apply id="Sx3.E9.m1.10.10.10.4.1.1.1.2.cmml" xref="Sx3.E9.m1.10.10.10.4.1.1.1"><csymbol cd="ambiguous" id="Sx3.E9.m1.10.10.10.4.1.1.1.2.1.cmml" xref="Sx3.E9.m1.10.10.10.4.1.1.1">superscript</csymbol><ci id="Sx3.E9.m1.10.10.10.4.1.1.1.2.2.cmml" xref="Sx3.E9.m1.10.10.10.4.1.1.1.2.2">𝑠</ci><ci id="Sx3.E9.m1.10.10.10.4.1.1.1.2.3.cmml" xref="Sx3.E9.m1.10.10.10.4.1.1.1.2.3">𝑇</ci></apply><list id="Sx3.E9.m1.8.8.8.2.2.3.cmml" xref="Sx3.E9.m1.8.8.8.2.2.4"><ci id="Sx3.E9.m1.7.7.7.1.1.1.cmml" xref="Sx3.E9.m1.7.7.7.1.1.1">𝑘</ci><ci id="Sx3.E9.m1.8.8.8.2.2.2.cmml" xref="Sx3.E9.m1.8.8.8.2.2.2">𝑗</ci></list></apply></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E9.m1.11c">\displaystyle-\frac{1}{b}\log\frac{\sum_{k,j}\exp(s^{T}_{k,j}\mid\boldsymbol{x% }^{I}_{j}=\boldsymbol{x}^{I}_{k}\land\boldsymbol{x}^{O}_{j}=\boldsymbol{x}^{O}% _{k})}{\sum_{j}\exp(s^{T}_{k,j})},</annotation><annotation encoding="application/x-llamapun" id="Sx3.E9.m1.11d">- divide start_ARG 1 end_ARG start_ARG italic_b end_ARG roman_log divide start_ARG ∑ start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT roman_exp ( italic_s start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∧ bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( italic_s start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ) end_ARG ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(9)</span></td> </tr></tbody> </table> <table class="ltx_equation ltx_eqn_table" id="Sx3.E10"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}^{\text{OITC}}=\frac{1}{2}\sum^{b}_{k=1}(\mathcal{L}^{I|O}_{k}+% \mathcal{L}^{T}_{k}),\vspace{-1mm}" class="ltx_Math" display="block" id="Sx3.E10.m1.1"><semantics id="Sx3.E10.m1.1a"><mrow id="Sx3.E10.m1.1.1.1" xref="Sx3.E10.m1.1.1.1.1.cmml"><mrow id="Sx3.E10.m1.1.1.1.1" xref="Sx3.E10.m1.1.1.1.1.cmml"><msup id="Sx3.E10.m1.1.1.1.1.3" xref="Sx3.E10.m1.1.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.E10.m1.1.1.1.1.3.2" xref="Sx3.E10.m1.1.1.1.1.3.2.cmml">ℒ</mi><mtext id="Sx3.E10.m1.1.1.1.1.3.3" xref="Sx3.E10.m1.1.1.1.1.3.3a.cmml">OITC</mtext></msup><mo id="Sx3.E10.m1.1.1.1.1.2" xref="Sx3.E10.m1.1.1.1.1.2.cmml">=</mo><mrow id="Sx3.E10.m1.1.1.1.1.1" xref="Sx3.E10.m1.1.1.1.1.1.cmml"><mfrac id="Sx3.E10.m1.1.1.1.1.1.3" xref="Sx3.E10.m1.1.1.1.1.1.3.cmml"><mn id="Sx3.E10.m1.1.1.1.1.1.3.2" xref="Sx3.E10.m1.1.1.1.1.1.3.2.cmml">1</mn><mn id="Sx3.E10.m1.1.1.1.1.1.3.3" xref="Sx3.E10.m1.1.1.1.1.1.3.3.cmml">2</mn></mfrac><mo id="Sx3.E10.m1.1.1.1.1.1.2" xref="Sx3.E10.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="Sx3.E10.m1.1.1.1.1.1.1" xref="Sx3.E10.m1.1.1.1.1.1.1.cmml"><munderover id="Sx3.E10.m1.1.1.1.1.1.1.2" xref="Sx3.E10.m1.1.1.1.1.1.1.2.cmml"><mo id="Sx3.E10.m1.1.1.1.1.1.1.2.2.2" movablelimits="false" rspace="0em" xref="Sx3.E10.m1.1.1.1.1.1.1.2.2.2.cmml">∑</mo><mrow id="Sx3.E10.m1.1.1.1.1.1.1.2.3" xref="Sx3.E10.m1.1.1.1.1.1.1.2.3.cmml"><mi id="Sx3.E10.m1.1.1.1.1.1.1.2.3.2" xref="Sx3.E10.m1.1.1.1.1.1.1.2.3.2.cmml">k</mi><mo id="Sx3.E10.m1.1.1.1.1.1.1.2.3.1" xref="Sx3.E10.m1.1.1.1.1.1.1.2.3.1.cmml">=</mo><mn id="Sx3.E10.m1.1.1.1.1.1.1.2.3.3" xref="Sx3.E10.m1.1.1.1.1.1.1.2.3.3.cmml">1</mn></mrow><mi id="Sx3.E10.m1.1.1.1.1.1.1.2.2.3" xref="Sx3.E10.m1.1.1.1.1.1.1.2.2.3.cmml">b</mi></munderover><mrow id="Sx3.E10.m1.1.1.1.1.1.1.1.1" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.cmml"><mo id="Sx3.E10.m1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.cmml"><msubsup id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.2" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.2.cmml">ℒ</mi><mi id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.3" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.3.cmml">k</mi><mrow id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.cmml"><mi id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.2" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.1" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.1.cmml">|</mo><mi id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.3" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.1" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.1.cmml">+</mo><msubsup id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.2.2" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.2.2.cmml">ℒ</mi><mi id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.3" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.3.cmml">k</mi><mi id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.2.3" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.2.3.cmml">T</mi></msubsup></mrow><mo id="Sx3.E10.m1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow><mo id="Sx3.E10.m1.1.1.1.2" xref="Sx3.E10.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E10.m1.1b"><apply id="Sx3.E10.m1.1.1.1.1.cmml" xref="Sx3.E10.m1.1.1.1"><eq id="Sx3.E10.m1.1.1.1.1.2.cmml" xref="Sx3.E10.m1.1.1.1.1.2"></eq><apply id="Sx3.E10.m1.1.1.1.1.3.cmml" xref="Sx3.E10.m1.1.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E10.m1.1.1.1.1.3.1.cmml" xref="Sx3.E10.m1.1.1.1.1.3">superscript</csymbol><ci id="Sx3.E10.m1.1.1.1.1.3.2.cmml" xref="Sx3.E10.m1.1.1.1.1.3.2">ℒ</ci><ci id="Sx3.E10.m1.1.1.1.1.3.3a.cmml" xref="Sx3.E10.m1.1.1.1.1.3.3"><mtext id="Sx3.E10.m1.1.1.1.1.3.3.cmml" mathsize="70%" xref="Sx3.E10.m1.1.1.1.1.3.3">OITC</mtext></ci></apply><apply id="Sx3.E10.m1.1.1.1.1.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1"><times id="Sx3.E10.m1.1.1.1.1.1.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.2"></times><apply id="Sx3.E10.m1.1.1.1.1.1.3.cmml" xref="Sx3.E10.m1.1.1.1.1.1.3"><divide id="Sx3.E10.m1.1.1.1.1.1.3.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.3"></divide><cn id="Sx3.E10.m1.1.1.1.1.1.3.2.cmml" type="integer" xref="Sx3.E10.m1.1.1.1.1.1.3.2">1</cn><cn id="Sx3.E10.m1.1.1.1.1.1.3.3.cmml" type="integer" xref="Sx3.E10.m1.1.1.1.1.1.3.3">2</cn></apply><apply id="Sx3.E10.m1.1.1.1.1.1.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1"><apply id="Sx3.E10.m1.1.1.1.1.1.1.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="Sx3.E10.m1.1.1.1.1.1.1.2.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.2">subscript</csymbol><apply id="Sx3.E10.m1.1.1.1.1.1.1.2.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="Sx3.E10.m1.1.1.1.1.1.1.2.2.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.2">superscript</csymbol><sum id="Sx3.E10.m1.1.1.1.1.1.1.2.2.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.2.2.2"></sum><ci id="Sx3.E10.m1.1.1.1.1.1.1.2.2.3.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.2.2.3">𝑏</ci></apply><apply id="Sx3.E10.m1.1.1.1.1.1.1.2.3.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.2.3"><eq id="Sx3.E10.m1.1.1.1.1.1.1.2.3.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.2.3.1"></eq><ci id="Sx3.E10.m1.1.1.1.1.1.1.2.3.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.2.3.2">𝑘</ci><cn id="Sx3.E10.m1.1.1.1.1.1.1.2.3.3.cmml" type="integer" xref="Sx3.E10.m1.1.1.1.1.1.1.2.3.3">1</cn></apply></apply><apply id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1"><plus id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.1"></plus><apply id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2">subscript</csymbol><apply id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2">superscript</csymbol><ci id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.2">ℒ</ci><apply id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3"><csymbol cd="latexml" id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.1">conditional</csymbol><ci id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.2">𝐼</ci><ci id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.3.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.3.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.2.3">𝑘</ci></apply><apply id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3">subscript</csymbol><apply id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.2.1.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3">superscript</csymbol><ci id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.2.2.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.2.2">ℒ</ci><ci id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.2.3.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.2.3">𝑇</ci></apply><ci id="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.3.cmml" xref="Sx3.E10.m1.1.1.1.1.1.1.1.1.1.3.3">𝑘</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E10.m1.1c">\mathcal{L}^{\text{OITC}}=\frac{1}{2}\sum^{b}_{k=1}(\mathcal{L}^{I|O}_{k}+% \mathcal{L}^{T}_{k}),\vspace{-1mm}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E10.m1.1d">caligraphic_L start_POSTSUPERSCRIPT OITC end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(10)</span></td> </tr></tbody> </table> <p class="ltx_p" id="Sx3.SSx3.p2.12">where <math alttext="\mathcal{L}^{I|O}_{k}" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.10.m1.1"><semantics id="Sx3.SSx3.p2.10.m1.1a"><msubsup id="Sx3.SSx3.p2.10.m1.1.1" xref="Sx3.SSx3.p2.10.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.SSx3.p2.10.m1.1.1.2.2" xref="Sx3.SSx3.p2.10.m1.1.1.2.2.cmml">ℒ</mi><mi id="Sx3.SSx3.p2.10.m1.1.1.3" xref="Sx3.SSx3.p2.10.m1.1.1.3.cmml">k</mi><mrow id="Sx3.SSx3.p2.10.m1.1.1.2.3" xref="Sx3.SSx3.p2.10.m1.1.1.2.3.cmml"><mi id="Sx3.SSx3.p2.10.m1.1.1.2.3.2" xref="Sx3.SSx3.p2.10.m1.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx3.p2.10.m1.1.1.2.3.1" xref="Sx3.SSx3.p2.10.m1.1.1.2.3.1.cmml">|</mo><mi id="Sx3.SSx3.p2.10.m1.1.1.2.3.3" xref="Sx3.SSx3.p2.10.m1.1.1.2.3.3.cmml">O</mi></mrow></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.10.m1.1b"><apply id="Sx3.SSx3.p2.10.m1.1.1.cmml" xref="Sx3.SSx3.p2.10.m1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.10.m1.1.1.1.cmml" xref="Sx3.SSx3.p2.10.m1.1.1">subscript</csymbol><apply id="Sx3.SSx3.p2.10.m1.1.1.2.cmml" xref="Sx3.SSx3.p2.10.m1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.10.m1.1.1.2.1.cmml" xref="Sx3.SSx3.p2.10.m1.1.1">superscript</csymbol><ci id="Sx3.SSx3.p2.10.m1.1.1.2.2.cmml" xref="Sx3.SSx3.p2.10.m1.1.1.2.2">ℒ</ci><apply id="Sx3.SSx3.p2.10.m1.1.1.2.3.cmml" xref="Sx3.SSx3.p2.10.m1.1.1.2.3"><csymbol cd="latexml" id="Sx3.SSx3.p2.10.m1.1.1.2.3.1.cmml" xref="Sx3.SSx3.p2.10.m1.1.1.2.3.1">conditional</csymbol><ci id="Sx3.SSx3.p2.10.m1.1.1.2.3.2.cmml" xref="Sx3.SSx3.p2.10.m1.1.1.2.3.2">𝐼</ci><ci id="Sx3.SSx3.p2.10.m1.1.1.2.3.3.cmml" xref="Sx3.SSx3.p2.10.m1.1.1.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.SSx3.p2.10.m1.1.1.3.cmml" xref="Sx3.SSx3.p2.10.m1.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.10.m1.1c">\mathcal{L}^{I|O}_{k}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.10.m1.1d">caligraphic_L start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT</annotation></semantics></math> is image-to-text contrastive loss, <math alttext="\mathcal{L}^{T}_{k}" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.11.m2.1"><semantics id="Sx3.SSx3.p2.11.m2.1a"><msubsup id="Sx3.SSx3.p2.11.m2.1.1" xref="Sx3.SSx3.p2.11.m2.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.SSx3.p2.11.m2.1.1.2.2" xref="Sx3.SSx3.p2.11.m2.1.1.2.2.cmml">ℒ</mi><mi id="Sx3.SSx3.p2.11.m2.1.1.3" xref="Sx3.SSx3.p2.11.m2.1.1.3.cmml">k</mi><mi id="Sx3.SSx3.p2.11.m2.1.1.2.3" xref="Sx3.SSx3.p2.11.m2.1.1.2.3.cmml">T</mi></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.11.m2.1b"><apply id="Sx3.SSx3.p2.11.m2.1.1.cmml" xref="Sx3.SSx3.p2.11.m2.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.11.m2.1.1.1.cmml" xref="Sx3.SSx3.p2.11.m2.1.1">subscript</csymbol><apply id="Sx3.SSx3.p2.11.m2.1.1.2.cmml" xref="Sx3.SSx3.p2.11.m2.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.11.m2.1.1.2.1.cmml" xref="Sx3.SSx3.p2.11.m2.1.1">superscript</csymbol><ci id="Sx3.SSx3.p2.11.m2.1.1.2.2.cmml" xref="Sx3.SSx3.p2.11.m2.1.1.2.2">ℒ</ci><ci id="Sx3.SSx3.p2.11.m2.1.1.2.3.cmml" xref="Sx3.SSx3.p2.11.m2.1.1.2.3">𝑇</ci></apply><ci id="Sx3.SSx3.p2.11.m2.1.1.3.cmml" xref="Sx3.SSx3.p2.11.m2.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.11.m2.1c">\mathcal{L}^{T}_{k}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.11.m2.1d">caligraphic_L start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT</annotation></semantics></math> is text-to-image contrastive loss, and <math alttext="\mathcal{L}^{\text{OITC}}" class="ltx_Math" display="inline" id="Sx3.SSx3.p2.12.m3.1"><semantics id="Sx3.SSx3.p2.12.m3.1a"><msup id="Sx3.SSx3.p2.12.m3.1.1" xref="Sx3.SSx3.p2.12.m3.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.SSx3.p2.12.m3.1.1.2" xref="Sx3.SSx3.p2.12.m3.1.1.2.cmml">ℒ</mi><mtext id="Sx3.SSx3.p2.12.m3.1.1.3" xref="Sx3.SSx3.p2.12.m3.1.1.3a.cmml">OITC</mtext></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p2.12.m3.1b"><apply id="Sx3.SSx3.p2.12.m3.1.1.cmml" xref="Sx3.SSx3.p2.12.m3.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p2.12.m3.1.1.1.cmml" xref="Sx3.SSx3.p2.12.m3.1.1">superscript</csymbol><ci id="Sx3.SSx3.p2.12.m3.1.1.2.cmml" xref="Sx3.SSx3.p2.12.m3.1.1.2">ℒ</ci><ci id="Sx3.SSx3.p2.12.m3.1.1.3a.cmml" xref="Sx3.SSx3.p2.12.m3.1.1.3"><mtext id="Sx3.SSx3.p2.12.m3.1.1.3.cmml" mathsize="70%" xref="Sx3.SSx3.p2.12.m3.1.1.3">OITC</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p2.12.m3.1c">\mathcal{L}^{\text{OITC}}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p2.12.m3.1d">caligraphic_L start_POSTSUPERSCRIPT OITC end_POSTSUPERSCRIPT</annotation></semantics></math> is total loss.</p> </div> <div class="ltx_para" id="Sx3.SSx3.p3"> <p class="ltx_p" id="Sx3.SSx3.p3.2"><span class="ltx_text ltx_font_bold" id="Sx3.SSx3.p3.2.1">Object-focused Image-Image Contrast (OIIC)</span> loss emphasizes the commonality of objects within the same class, requiring the encoder to generate similar features for these objects. The contrast is performed within the image. For a feature <math alttext="\boldsymbol{y}^{I|O}_{k}" class="ltx_Math" display="inline" id="Sx3.SSx3.p3.1.m1.1"><semantics id="Sx3.SSx3.p3.1.m1.1a"><msubsup id="Sx3.SSx3.p3.1.m1.1.1" xref="Sx3.SSx3.p3.1.m1.1.1.cmml"><mi id="Sx3.SSx3.p3.1.m1.1.1.2.2" xref="Sx3.SSx3.p3.1.m1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.SSx3.p3.1.m1.1.1.3" xref="Sx3.SSx3.p3.1.m1.1.1.3.cmml">k</mi><mrow id="Sx3.SSx3.p3.1.m1.1.1.2.3" xref="Sx3.SSx3.p3.1.m1.1.1.2.3.cmml"><mi id="Sx3.SSx3.p3.1.m1.1.1.2.3.2" xref="Sx3.SSx3.p3.1.m1.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx3.p3.1.m1.1.1.2.3.1" xref="Sx3.SSx3.p3.1.m1.1.1.2.3.1.cmml">|</mo><mi id="Sx3.SSx3.p3.1.m1.1.1.2.3.3" xref="Sx3.SSx3.p3.1.m1.1.1.2.3.3.cmml">O</mi></mrow></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p3.1.m1.1b"><apply id="Sx3.SSx3.p3.1.m1.1.1.cmml" xref="Sx3.SSx3.p3.1.m1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p3.1.m1.1.1.1.cmml" xref="Sx3.SSx3.p3.1.m1.1.1">subscript</csymbol><apply id="Sx3.SSx3.p3.1.m1.1.1.2.cmml" xref="Sx3.SSx3.p3.1.m1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p3.1.m1.1.1.2.1.cmml" xref="Sx3.SSx3.p3.1.m1.1.1">superscript</csymbol><ci id="Sx3.SSx3.p3.1.m1.1.1.2.2.cmml" xref="Sx3.SSx3.p3.1.m1.1.1.2.2">𝒚</ci><apply id="Sx3.SSx3.p3.1.m1.1.1.2.3.cmml" xref="Sx3.SSx3.p3.1.m1.1.1.2.3"><csymbol cd="latexml" id="Sx3.SSx3.p3.1.m1.1.1.2.3.1.cmml" xref="Sx3.SSx3.p3.1.m1.1.1.2.3.1">conditional</csymbol><ci id="Sx3.SSx3.p3.1.m1.1.1.2.3.2.cmml" xref="Sx3.SSx3.p3.1.m1.1.1.2.3.2">𝐼</ci><ci id="Sx3.SSx3.p3.1.m1.1.1.2.3.3.cmml" xref="Sx3.SSx3.p3.1.m1.1.1.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.SSx3.p3.1.m1.1.1.3.cmml" xref="Sx3.SSx3.p3.1.m1.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p3.1.m1.1c">\boldsymbol{y}^{I|O}_{k}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p3.1.m1.1d">bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT</annotation></semantics></math>, <math alttext="\boldsymbol{y}^{I|O}_{j|\boldsymbol{x}^{O}_{j}=\boldsymbol{x}^{O}_{k}}" class="ltx_Math" display="inline" id="Sx3.SSx3.p3.2.m2.1"><semantics id="Sx3.SSx3.p3.2.m2.1a"><msubsup id="Sx3.SSx3.p3.2.m2.1.1" xref="Sx3.SSx3.p3.2.m2.1.1.cmml"><mi id="Sx3.SSx3.p3.2.m2.1.1.2.2" xref="Sx3.SSx3.p3.2.m2.1.1.2.2.cmml">𝒚</mi><mrow id="Sx3.SSx3.p3.2.m2.1.1.3" xref="Sx3.SSx3.p3.2.m2.1.1.3.cmml"><mrow id="Sx3.SSx3.p3.2.m2.1.1.3.2" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.cmml"><mi id="Sx3.SSx3.p3.2.m2.1.1.3.2.2" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.2.cmml">j</mi><mo fence="false" id="Sx3.SSx3.p3.2.m2.1.1.3.2.1" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.1.cmml">|</mo><msubsup id="Sx3.SSx3.p3.2.m2.1.1.3.2.3" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3.cmml"><mi id="Sx3.SSx3.p3.2.m2.1.1.3.2.3.2.2" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p3.2.m2.1.1.3.2.3.3" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3.3.cmml">j</mi><mi id="Sx3.SSx3.p3.2.m2.1.1.3.2.3.2.3" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3.2.3.cmml">O</mi></msubsup></mrow><mo id="Sx3.SSx3.p3.2.m2.1.1.3.1" xref="Sx3.SSx3.p3.2.m2.1.1.3.1.cmml">=</mo><msubsup id="Sx3.SSx3.p3.2.m2.1.1.3.3" xref="Sx3.SSx3.p3.2.m2.1.1.3.3.cmml"><mi id="Sx3.SSx3.p3.2.m2.1.1.3.3.2.2" xref="Sx3.SSx3.p3.2.m2.1.1.3.3.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p3.2.m2.1.1.3.3.3" xref="Sx3.SSx3.p3.2.m2.1.1.3.3.3.cmml">k</mi><mi id="Sx3.SSx3.p3.2.m2.1.1.3.3.2.3" xref="Sx3.SSx3.p3.2.m2.1.1.3.3.2.3.cmml">O</mi></msubsup></mrow><mrow id="Sx3.SSx3.p3.2.m2.1.1.2.3" xref="Sx3.SSx3.p3.2.m2.1.1.2.3.cmml"><mi id="Sx3.SSx3.p3.2.m2.1.1.2.3.2" xref="Sx3.SSx3.p3.2.m2.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx3.p3.2.m2.1.1.2.3.1" xref="Sx3.SSx3.p3.2.m2.1.1.2.3.1.cmml">|</mo><mi id="Sx3.SSx3.p3.2.m2.1.1.2.3.3" xref="Sx3.SSx3.p3.2.m2.1.1.2.3.3.cmml">O</mi></mrow></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p3.2.m2.1b"><apply id="Sx3.SSx3.p3.2.m2.1.1.cmml" xref="Sx3.SSx3.p3.2.m2.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p3.2.m2.1.1.1.cmml" xref="Sx3.SSx3.p3.2.m2.1.1">subscript</csymbol><apply id="Sx3.SSx3.p3.2.m2.1.1.2.cmml" xref="Sx3.SSx3.p3.2.m2.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p3.2.m2.1.1.2.1.cmml" xref="Sx3.SSx3.p3.2.m2.1.1">superscript</csymbol><ci id="Sx3.SSx3.p3.2.m2.1.1.2.2.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.2.2">𝒚</ci><apply id="Sx3.SSx3.p3.2.m2.1.1.2.3.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.2.3"><csymbol cd="latexml" id="Sx3.SSx3.p3.2.m2.1.1.2.3.1.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.2.3.1">conditional</csymbol><ci id="Sx3.SSx3.p3.2.m2.1.1.2.3.2.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.2.3.2">𝐼</ci><ci id="Sx3.SSx3.p3.2.m2.1.1.2.3.3.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.2.3.3">𝑂</ci></apply></apply><apply id="Sx3.SSx3.p3.2.m2.1.1.3.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3"><eq id="Sx3.SSx3.p3.2.m2.1.1.3.1.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.1"></eq><apply id="Sx3.SSx3.p3.2.m2.1.1.3.2.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.2"><csymbol cd="latexml" id="Sx3.SSx3.p3.2.m2.1.1.3.2.1.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.1">conditional</csymbol><ci id="Sx3.SSx3.p3.2.m2.1.1.3.2.2.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.2">𝑗</ci><apply id="Sx3.SSx3.p3.2.m2.1.1.3.2.3.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p3.2.m2.1.1.3.2.3.1.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3">subscript</csymbol><apply id="Sx3.SSx3.p3.2.m2.1.1.3.2.3.2.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p3.2.m2.1.1.3.2.3.2.1.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3">superscript</csymbol><ci id="Sx3.SSx3.p3.2.m2.1.1.3.2.3.2.2.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3.2.2">𝒙</ci><ci id="Sx3.SSx3.p3.2.m2.1.1.3.2.3.2.3.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3.2.3">𝑂</ci></apply><ci id="Sx3.SSx3.p3.2.m2.1.1.3.2.3.3.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.2.3.3">𝑗</ci></apply></apply><apply id="Sx3.SSx3.p3.2.m2.1.1.3.3.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p3.2.m2.1.1.3.3.1.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.3">subscript</csymbol><apply id="Sx3.SSx3.p3.2.m2.1.1.3.3.2.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.3"><csymbol cd="ambiguous" id="Sx3.SSx3.p3.2.m2.1.1.3.3.2.1.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.3">superscript</csymbol><ci id="Sx3.SSx3.p3.2.m2.1.1.3.3.2.2.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.3.2.2">𝒙</ci><ci id="Sx3.SSx3.p3.2.m2.1.1.3.3.2.3.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.3.2.3">𝑂</ci></apply><ci id="Sx3.SSx3.p3.2.m2.1.1.3.3.3.cmml" xref="Sx3.SSx3.p3.2.m2.1.1.3.3.3">𝑘</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p3.2.m2.1c">\boldsymbol{y}^{I|O}_{j|\boldsymbol{x}^{O}_{j}=\boldsymbol{x}^{O}_{k}}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p3.2.m2.1d">bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j | bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT</annotation></semantics></math> is its positive example, while other conditional image features in the batch serve as in-batch negatives. The similarity computation of conditional image features is expressed as</p> <table class="ltx_equation ltx_eqn_table" id="Sx3.E11"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="s_{i,j}=(\boldsymbol{y}^{I|O}_{i})^{\top}\boldsymbol{y}^{I|O}_{j}.\vspace{-1mm}" class="ltx_Math" display="block" id="Sx3.E11.m1.3"><semantics id="Sx3.E11.m1.3a"><mrow id="Sx3.E11.m1.3.3.1" xref="Sx3.E11.m1.3.3.1.1.cmml"><mrow id="Sx3.E11.m1.3.3.1.1" xref="Sx3.E11.m1.3.3.1.1.cmml"><msub id="Sx3.E11.m1.3.3.1.1.3" xref="Sx3.E11.m1.3.3.1.1.3.cmml"><mi id="Sx3.E11.m1.3.3.1.1.3.2" xref="Sx3.E11.m1.3.3.1.1.3.2.cmml">s</mi><mrow id="Sx3.E11.m1.2.2.2.4" xref="Sx3.E11.m1.2.2.2.3.cmml"><mi id="Sx3.E11.m1.1.1.1.1" xref="Sx3.E11.m1.1.1.1.1.cmml">i</mi><mo id="Sx3.E11.m1.2.2.2.4.1" xref="Sx3.E11.m1.2.2.2.3.cmml">,</mo><mi id="Sx3.E11.m1.2.2.2.2" xref="Sx3.E11.m1.2.2.2.2.cmml">j</mi></mrow></msub><mo id="Sx3.E11.m1.3.3.1.1.2" xref="Sx3.E11.m1.3.3.1.1.2.cmml">=</mo><mrow id="Sx3.E11.m1.3.3.1.1.1" xref="Sx3.E11.m1.3.3.1.1.1.cmml"><msup id="Sx3.E11.m1.3.3.1.1.1.1" xref="Sx3.E11.m1.3.3.1.1.1.1.cmml"><mrow id="Sx3.E11.m1.3.3.1.1.1.1.1.1" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.cmml"><mo id="Sx3.E11.m1.3.3.1.1.1.1.1.1.2" stretchy="false" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.cmml">(</mo><msubsup id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.cmml"><mi id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.2" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.3" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.3.cmml">i</mi><mrow id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.cmml"><mi id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.2" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.1" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.1.cmml">|</mo><mi id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.3" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.E11.m1.3.3.1.1.1.1.1.1.3" stretchy="false" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.cmml">)</mo></mrow><mo id="Sx3.E11.m1.3.3.1.1.1.1.3" xref="Sx3.E11.m1.3.3.1.1.1.1.3.cmml">⊤</mo></msup><mo id="Sx3.E11.m1.3.3.1.1.1.2" xref="Sx3.E11.m1.3.3.1.1.1.2.cmml">⁢</mo><msubsup id="Sx3.E11.m1.3.3.1.1.1.3" xref="Sx3.E11.m1.3.3.1.1.1.3.cmml"><mi id="Sx3.E11.m1.3.3.1.1.1.3.2.2" xref="Sx3.E11.m1.3.3.1.1.1.3.2.2.cmml">𝒚</mi><mi id="Sx3.E11.m1.3.3.1.1.1.3.3" xref="Sx3.E11.m1.3.3.1.1.1.3.3.cmml">j</mi><mrow id="Sx3.E11.m1.3.3.1.1.1.3.2.3" xref="Sx3.E11.m1.3.3.1.1.1.3.2.3.cmml"><mi id="Sx3.E11.m1.3.3.1.1.1.3.2.3.2" xref="Sx3.E11.m1.3.3.1.1.1.3.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.E11.m1.3.3.1.1.1.3.2.3.1" xref="Sx3.E11.m1.3.3.1.1.1.3.2.3.1.cmml">|</mo><mi id="Sx3.E11.m1.3.3.1.1.1.3.2.3.3" xref="Sx3.E11.m1.3.3.1.1.1.3.2.3.3.cmml">O</mi></mrow></msubsup></mrow></mrow><mo id="Sx3.E11.m1.3.3.1.2" lspace="0em" xref="Sx3.E11.m1.3.3.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E11.m1.3b"><apply id="Sx3.E11.m1.3.3.1.1.cmml" xref="Sx3.E11.m1.3.3.1"><eq id="Sx3.E11.m1.3.3.1.1.2.cmml" xref="Sx3.E11.m1.3.3.1.1.2"></eq><apply id="Sx3.E11.m1.3.3.1.1.3.cmml" xref="Sx3.E11.m1.3.3.1.1.3"><csymbol cd="ambiguous" id="Sx3.E11.m1.3.3.1.1.3.1.cmml" xref="Sx3.E11.m1.3.3.1.1.3">subscript</csymbol><ci id="Sx3.E11.m1.3.3.1.1.3.2.cmml" xref="Sx3.E11.m1.3.3.1.1.3.2">𝑠</ci><list id="Sx3.E11.m1.2.2.2.3.cmml" xref="Sx3.E11.m1.2.2.2.4"><ci id="Sx3.E11.m1.1.1.1.1.cmml" xref="Sx3.E11.m1.1.1.1.1">𝑖</ci><ci id="Sx3.E11.m1.2.2.2.2.cmml" xref="Sx3.E11.m1.2.2.2.2">𝑗</ci></list></apply><apply id="Sx3.E11.m1.3.3.1.1.1.cmml" xref="Sx3.E11.m1.3.3.1.1.1"><times id="Sx3.E11.m1.3.3.1.1.1.2.cmml" xref="Sx3.E11.m1.3.3.1.1.1.2"></times><apply id="Sx3.E11.m1.3.3.1.1.1.1.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E11.m1.3.3.1.1.1.1.2.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1">superscript</csymbol><apply id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1">subscript</csymbol><apply id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.1.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.2.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.2">𝒚</ci><apply id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3"><csymbol cd="latexml" id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.1.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.1">conditional</csymbol><ci id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.2.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.2">𝐼</ci><ci id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.3.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.1.1.1.3">𝑖</ci></apply><csymbol cd="latexml" id="Sx3.E11.m1.3.3.1.1.1.1.3.cmml" xref="Sx3.E11.m1.3.3.1.1.1.1.3">top</csymbol></apply><apply id="Sx3.E11.m1.3.3.1.1.1.3.cmml" xref="Sx3.E11.m1.3.3.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E11.m1.3.3.1.1.1.3.1.cmml" xref="Sx3.E11.m1.3.3.1.1.1.3">subscript</csymbol><apply id="Sx3.E11.m1.3.3.1.1.1.3.2.cmml" xref="Sx3.E11.m1.3.3.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E11.m1.3.3.1.1.1.3.2.1.cmml" xref="Sx3.E11.m1.3.3.1.1.1.3">superscript</csymbol><ci id="Sx3.E11.m1.3.3.1.1.1.3.2.2.cmml" xref="Sx3.E11.m1.3.3.1.1.1.3.2.2">𝒚</ci><apply id="Sx3.E11.m1.3.3.1.1.1.3.2.3.cmml" xref="Sx3.E11.m1.3.3.1.1.1.3.2.3"><csymbol cd="latexml" id="Sx3.E11.m1.3.3.1.1.1.3.2.3.1.cmml" xref="Sx3.E11.m1.3.3.1.1.1.3.2.3.1">conditional</csymbol><ci id="Sx3.E11.m1.3.3.1.1.1.3.2.3.2.cmml" xref="Sx3.E11.m1.3.3.1.1.1.3.2.3.2">𝐼</ci><ci id="Sx3.E11.m1.3.3.1.1.1.3.2.3.3.cmml" xref="Sx3.E11.m1.3.3.1.1.1.3.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.E11.m1.3.3.1.1.1.3.3.cmml" xref="Sx3.E11.m1.3.3.1.1.1.3.3">𝑗</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E11.m1.3c">s_{i,j}=(\boldsymbol{y}^{I|O}_{i})^{\top}\boldsymbol{y}^{I|O}_{j}.\vspace{-1mm}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E11.m1.3d">italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ( bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(11)</span></td> </tr></tbody> </table> <p class="ltx_p" id="Sx3.SSx3.p3.3">The OIIC loss, denoted by <math alttext="\mathcal{L}^{\text{OIIC}}" class="ltx_Math" display="inline" id="Sx3.SSx3.p3.3.m1.1"><semantics id="Sx3.SSx3.p3.3.m1.1a"><msup id="Sx3.SSx3.p3.3.m1.1.1" xref="Sx3.SSx3.p3.3.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.SSx3.p3.3.m1.1.1.2" xref="Sx3.SSx3.p3.3.m1.1.1.2.cmml">ℒ</mi><mtext id="Sx3.SSx3.p3.3.m1.1.1.3" xref="Sx3.SSx3.p3.3.m1.1.1.3a.cmml">OIIC</mtext></msup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p3.3.m1.1b"><apply id="Sx3.SSx3.p3.3.m1.1.1.cmml" xref="Sx3.SSx3.p3.3.m1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p3.3.m1.1.1.1.cmml" xref="Sx3.SSx3.p3.3.m1.1.1">superscript</csymbol><ci id="Sx3.SSx3.p3.3.m1.1.1.2.cmml" xref="Sx3.SSx3.p3.3.m1.1.1.2">ℒ</ci><ci id="Sx3.SSx3.p3.3.m1.1.1.3a.cmml" xref="Sx3.SSx3.p3.3.m1.1.1.3"><mtext id="Sx3.SSx3.p3.3.m1.1.1.3.cmml" mathsize="70%" xref="Sx3.SSx3.p3.3.m1.1.1.3">OIIC</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p3.3.m1.1c">\mathcal{L}^{\text{OIIC}}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p3.3.m1.1d">caligraphic_L start_POSTSUPERSCRIPT OIIC end_POSTSUPERSCRIPT</annotation></semantics></math>, can be represented as</p> <table class="ltx_equation ltx_eqn_table" id="Sx3.E12"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}^{\text{OIIC}}=-\frac{1}{b}\sum^{b}_{k=1}\mathcal{\log}\frac{\sum_{% k,j}\exp(s_{k,j}\mid\boldsymbol{x}^{O}_{j}=\boldsymbol{x}^{O}_{k})}{\sum_{j}% \exp(s_{k,j})}.\vspace{-1mm}" class="ltx_Math" display="block" id="Sx3.E12.m1.11"><semantics id="Sx3.E12.m1.11a"><mrow id="Sx3.E12.m1.11.11.1" xref="Sx3.E12.m1.11.11.1.1.cmml"><mrow id="Sx3.E12.m1.11.11.1.1" xref="Sx3.E12.m1.11.11.1.1.cmml"><msup id="Sx3.E12.m1.11.11.1.1.2" xref="Sx3.E12.m1.11.11.1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.E12.m1.11.11.1.1.2.2" xref="Sx3.E12.m1.11.11.1.1.2.2.cmml">ℒ</mi><mtext id="Sx3.E12.m1.11.11.1.1.2.3" xref="Sx3.E12.m1.11.11.1.1.2.3a.cmml">OIIC</mtext></msup><mo id="Sx3.E12.m1.11.11.1.1.1" xref="Sx3.E12.m1.11.11.1.1.1.cmml">=</mo><mrow id="Sx3.E12.m1.11.11.1.1.3" xref="Sx3.E12.m1.11.11.1.1.3.cmml"><mo id="Sx3.E12.m1.11.11.1.1.3a" xref="Sx3.E12.m1.11.11.1.1.3.cmml">−</mo><mrow id="Sx3.E12.m1.11.11.1.1.3.2" xref="Sx3.E12.m1.11.11.1.1.3.2.cmml"><mfrac id="Sx3.E12.m1.11.11.1.1.3.2.2" xref="Sx3.E12.m1.11.11.1.1.3.2.2.cmml"><mn id="Sx3.E12.m1.11.11.1.1.3.2.2.2" xref="Sx3.E12.m1.11.11.1.1.3.2.2.2.cmml">1</mn><mi id="Sx3.E12.m1.11.11.1.1.3.2.2.3" xref="Sx3.E12.m1.11.11.1.1.3.2.2.3.cmml">b</mi></mfrac><mo id="Sx3.E12.m1.11.11.1.1.3.2.1" xref="Sx3.E12.m1.11.11.1.1.3.2.1.cmml">⁢</mo><mrow id="Sx3.E12.m1.11.11.1.1.3.2.3" xref="Sx3.E12.m1.11.11.1.1.3.2.3.cmml"><munderover id="Sx3.E12.m1.11.11.1.1.3.2.3.1" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.cmml"><mo id="Sx3.E12.m1.11.11.1.1.3.2.3.1.2.2" movablelimits="false" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.2.2.cmml">∑</mo><mrow id="Sx3.E12.m1.11.11.1.1.3.2.3.1.3" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.cmml"><mi id="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.2" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.2.cmml">k</mi><mo id="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.1" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.1.cmml">=</mo><mn id="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.3" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.3.cmml">1</mn></mrow><mi id="Sx3.E12.m1.11.11.1.1.3.2.3.1.2.3" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.2.3.cmml">b</mi></munderover><mrow id="Sx3.E12.m1.11.11.1.1.3.2.3.2" xref="Sx3.E12.m1.11.11.1.1.3.2.3.2.cmml"><mi id="Sx3.E12.m1.11.11.1.1.3.2.3.2.1" xref="Sx3.E12.m1.11.11.1.1.3.2.3.2.1.cmml">log</mi><mo id="Sx3.E12.m1.11.11.1.1.3.2.3.2a" lspace="0.167em" xref="Sx3.E12.m1.11.11.1.1.3.2.3.2.cmml">⁡</mo><mfrac id="Sx3.E12.m1.10.10" xref="Sx3.E12.m1.10.10.cmml"><mrow id="Sx3.E12.m1.6.6.6" xref="Sx3.E12.m1.6.6.6.cmml"><msub id="Sx3.E12.m1.6.6.6.7" xref="Sx3.E12.m1.6.6.6.7.cmml"><mo id="Sx3.E12.m1.6.6.6.7.2" xref="Sx3.E12.m1.6.6.6.7.2.cmml">∑</mo><mrow id="Sx3.E12.m1.2.2.2.2.2.4" xref="Sx3.E12.m1.2.2.2.2.2.3.cmml"><mi id="Sx3.E12.m1.1.1.1.1.1.1" xref="Sx3.E12.m1.1.1.1.1.1.1.cmml">k</mi><mo id="Sx3.E12.m1.2.2.2.2.2.4.1" xref="Sx3.E12.m1.2.2.2.2.2.3.cmml">,</mo><mi id="Sx3.E12.m1.2.2.2.2.2.2" xref="Sx3.E12.m1.2.2.2.2.2.2.cmml">j</mi></mrow></msub><mrow id="Sx3.E12.m1.6.6.6.6.1" xref="Sx3.E12.m1.6.6.6.6.2.cmml"><mi id="Sx3.E12.m1.5.5.5.5" xref="Sx3.E12.m1.5.5.5.5.cmml">exp</mi><mo id="Sx3.E12.m1.6.6.6.6.1a" xref="Sx3.E12.m1.6.6.6.6.2.cmml">⁡</mo><mrow id="Sx3.E12.m1.6.6.6.6.1.1" xref="Sx3.E12.m1.6.6.6.6.2.cmml"><mo id="Sx3.E12.m1.6.6.6.6.1.1.2" stretchy="false" xref="Sx3.E12.m1.6.6.6.6.2.cmml">(</mo><mrow id="Sx3.E12.m1.6.6.6.6.1.1.1" xref="Sx3.E12.m1.6.6.6.6.1.1.1.cmml"><mrow id="Sx3.E12.m1.6.6.6.6.1.1.1.2" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.cmml"><msub id="Sx3.E12.m1.6.6.6.6.1.1.1.2.2" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.2.cmml"><mi id="Sx3.E12.m1.6.6.6.6.1.1.1.2.2.2" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.2.2.cmml">s</mi><mrow id="Sx3.E12.m1.4.4.4.4.2.4" xref="Sx3.E12.m1.4.4.4.4.2.3.cmml"><mi id="Sx3.E12.m1.3.3.3.3.1.1" xref="Sx3.E12.m1.3.3.3.3.1.1.cmml">k</mi><mo id="Sx3.E12.m1.4.4.4.4.2.4.1" xref="Sx3.E12.m1.4.4.4.4.2.3.cmml">,</mo><mi id="Sx3.E12.m1.4.4.4.4.2.2" xref="Sx3.E12.m1.4.4.4.4.2.2.cmml">j</mi></mrow></msub><mo id="Sx3.E12.m1.6.6.6.6.1.1.1.2.1" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.1.cmml">∣</mo><msubsup id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.cmml"><mi id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.2.2" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.2.2.cmml">𝒙</mi><mi id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.3" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.3.cmml">j</mi><mi id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.2.3" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.2.3.cmml">O</mi></msubsup></mrow><mo id="Sx3.E12.m1.6.6.6.6.1.1.1.1" xref="Sx3.E12.m1.6.6.6.6.1.1.1.1.cmml">=</mo><msubsup id="Sx3.E12.m1.6.6.6.6.1.1.1.3" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3.cmml"><mi id="Sx3.E12.m1.6.6.6.6.1.1.1.3.2.2" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3.2.2.cmml">𝒙</mi><mi id="Sx3.E12.m1.6.6.6.6.1.1.1.3.3" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3.3.cmml">k</mi><mi id="Sx3.E12.m1.6.6.6.6.1.1.1.3.2.3" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3.2.3.cmml">O</mi></msubsup></mrow><mo id="Sx3.E12.m1.6.6.6.6.1.1.3" stretchy="false" xref="Sx3.E12.m1.6.6.6.6.2.cmml">)</mo></mrow></mrow></mrow><mrow id="Sx3.E12.m1.10.10.10" xref="Sx3.E12.m1.10.10.10.cmml"><msub id="Sx3.E12.m1.10.10.10.5" xref="Sx3.E12.m1.10.10.10.5.cmml"><mo id="Sx3.E12.m1.10.10.10.5.2" xref="Sx3.E12.m1.10.10.10.5.2.cmml">∑</mo><mi id="Sx3.E12.m1.10.10.10.5.3" xref="Sx3.E12.m1.10.10.10.5.3.cmml">j</mi></msub><mrow id="Sx3.E12.m1.10.10.10.4.1" xref="Sx3.E12.m1.10.10.10.4.2.cmml"><mi id="Sx3.E12.m1.9.9.9.3" xref="Sx3.E12.m1.9.9.9.3.cmml">exp</mi><mo id="Sx3.E12.m1.10.10.10.4.1a" xref="Sx3.E12.m1.10.10.10.4.2.cmml">⁡</mo><mrow id="Sx3.E12.m1.10.10.10.4.1.1" xref="Sx3.E12.m1.10.10.10.4.2.cmml"><mo id="Sx3.E12.m1.10.10.10.4.1.1.2" stretchy="false" xref="Sx3.E12.m1.10.10.10.4.2.cmml">(</mo><msub id="Sx3.E12.m1.10.10.10.4.1.1.1" xref="Sx3.E12.m1.10.10.10.4.1.1.1.cmml"><mi id="Sx3.E12.m1.10.10.10.4.1.1.1.2" xref="Sx3.E12.m1.10.10.10.4.1.1.1.2.cmml">s</mi><mrow id="Sx3.E12.m1.8.8.8.2.2.4" xref="Sx3.E12.m1.8.8.8.2.2.3.cmml"><mi id="Sx3.E12.m1.7.7.7.1.1.1" xref="Sx3.E12.m1.7.7.7.1.1.1.cmml">k</mi><mo id="Sx3.E12.m1.8.8.8.2.2.4.1" xref="Sx3.E12.m1.8.8.8.2.2.3.cmml">,</mo><mi id="Sx3.E12.m1.8.8.8.2.2.2" xref="Sx3.E12.m1.8.8.8.2.2.2.cmml">j</mi></mrow></msub><mo id="Sx3.E12.m1.10.10.10.4.1.1.3" stretchy="false" xref="Sx3.E12.m1.10.10.10.4.2.cmml">)</mo></mrow></mrow></mrow></mfrac></mrow></mrow></mrow></mrow></mrow><mo id="Sx3.E12.m1.11.11.1.2" lspace="0em" xref="Sx3.E12.m1.11.11.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E12.m1.11b"><apply id="Sx3.E12.m1.11.11.1.1.cmml" xref="Sx3.E12.m1.11.11.1"><eq id="Sx3.E12.m1.11.11.1.1.1.cmml" xref="Sx3.E12.m1.11.11.1.1.1"></eq><apply id="Sx3.E12.m1.11.11.1.1.2.cmml" xref="Sx3.E12.m1.11.11.1.1.2"><csymbol cd="ambiguous" id="Sx3.E12.m1.11.11.1.1.2.1.cmml" xref="Sx3.E12.m1.11.11.1.1.2">superscript</csymbol><ci id="Sx3.E12.m1.11.11.1.1.2.2.cmml" xref="Sx3.E12.m1.11.11.1.1.2.2">ℒ</ci><ci id="Sx3.E12.m1.11.11.1.1.2.3a.cmml" xref="Sx3.E12.m1.11.11.1.1.2.3"><mtext id="Sx3.E12.m1.11.11.1.1.2.3.cmml" mathsize="70%" xref="Sx3.E12.m1.11.11.1.1.2.3">OIIC</mtext></ci></apply><apply id="Sx3.E12.m1.11.11.1.1.3.cmml" xref="Sx3.E12.m1.11.11.1.1.3"><minus id="Sx3.E12.m1.11.11.1.1.3.1.cmml" xref="Sx3.E12.m1.11.11.1.1.3"></minus><apply id="Sx3.E12.m1.11.11.1.1.3.2.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2"><times id="Sx3.E12.m1.11.11.1.1.3.2.1.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.1"></times><apply id="Sx3.E12.m1.11.11.1.1.3.2.2.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.2"><divide id="Sx3.E12.m1.11.11.1.1.3.2.2.1.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.2"></divide><cn id="Sx3.E12.m1.11.11.1.1.3.2.2.2.cmml" type="integer" xref="Sx3.E12.m1.11.11.1.1.3.2.2.2">1</cn><ci id="Sx3.E12.m1.11.11.1.1.3.2.2.3.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.2.3">𝑏</ci></apply><apply id="Sx3.E12.m1.11.11.1.1.3.2.3.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3"><apply id="Sx3.E12.m1.11.11.1.1.3.2.3.1.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1"><csymbol cd="ambiguous" id="Sx3.E12.m1.11.11.1.1.3.2.3.1.1.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1">subscript</csymbol><apply id="Sx3.E12.m1.11.11.1.1.3.2.3.1.2.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1"><csymbol cd="ambiguous" id="Sx3.E12.m1.11.11.1.1.3.2.3.1.2.1.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1">superscript</csymbol><sum id="Sx3.E12.m1.11.11.1.1.3.2.3.1.2.2.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.2.2"></sum><ci id="Sx3.E12.m1.11.11.1.1.3.2.3.1.2.3.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.2.3">𝑏</ci></apply><apply id="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.3"><eq id="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.1.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.1"></eq><ci id="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.2.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.2">𝑘</ci><cn id="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.3.cmml" type="integer" xref="Sx3.E12.m1.11.11.1.1.3.2.3.1.3.3">1</cn></apply></apply><apply id="Sx3.E12.m1.11.11.1.1.3.2.3.2.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.2"><log id="Sx3.E12.m1.11.11.1.1.3.2.3.2.1.cmml" xref="Sx3.E12.m1.11.11.1.1.3.2.3.2.1"></log><apply id="Sx3.E12.m1.10.10.cmml" xref="Sx3.E12.m1.10.10"><divide id="Sx3.E12.m1.10.10.11.cmml" xref="Sx3.E12.m1.10.10"></divide><apply id="Sx3.E12.m1.6.6.6.cmml" xref="Sx3.E12.m1.6.6.6"><apply id="Sx3.E12.m1.6.6.6.7.cmml" xref="Sx3.E12.m1.6.6.6.7"><csymbol cd="ambiguous" id="Sx3.E12.m1.6.6.6.7.1.cmml" xref="Sx3.E12.m1.6.6.6.7">subscript</csymbol><sum id="Sx3.E12.m1.6.6.6.7.2.cmml" xref="Sx3.E12.m1.6.6.6.7.2"></sum><list id="Sx3.E12.m1.2.2.2.2.2.3.cmml" xref="Sx3.E12.m1.2.2.2.2.2.4"><ci id="Sx3.E12.m1.1.1.1.1.1.1.cmml" xref="Sx3.E12.m1.1.1.1.1.1.1">𝑘</ci><ci id="Sx3.E12.m1.2.2.2.2.2.2.cmml" xref="Sx3.E12.m1.2.2.2.2.2.2">𝑗</ci></list></apply><apply id="Sx3.E12.m1.6.6.6.6.2.cmml" xref="Sx3.E12.m1.6.6.6.6.1"><exp id="Sx3.E12.m1.5.5.5.5.cmml" xref="Sx3.E12.m1.5.5.5.5"></exp><apply id="Sx3.E12.m1.6.6.6.6.1.1.1.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1"><eq id="Sx3.E12.m1.6.6.6.6.1.1.1.1.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.1"></eq><apply id="Sx3.E12.m1.6.6.6.6.1.1.1.2.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2"><csymbol cd="latexml" id="Sx3.E12.m1.6.6.6.6.1.1.1.2.1.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.1">conditional</csymbol><apply id="Sx3.E12.m1.6.6.6.6.1.1.1.2.2.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.2"><csymbol cd="ambiguous" id="Sx3.E12.m1.6.6.6.6.1.1.1.2.2.1.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.2">subscript</csymbol><ci id="Sx3.E12.m1.6.6.6.6.1.1.1.2.2.2.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.2.2">𝑠</ci><list id="Sx3.E12.m1.4.4.4.4.2.3.cmml" xref="Sx3.E12.m1.4.4.4.4.2.4"><ci id="Sx3.E12.m1.3.3.3.3.1.1.cmml" xref="Sx3.E12.m1.3.3.3.3.1.1">𝑘</ci><ci id="Sx3.E12.m1.4.4.4.4.2.2.cmml" xref="Sx3.E12.m1.4.4.4.4.2.2">𝑗</ci></list></apply><apply id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3"><csymbol cd="ambiguous" id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.1.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3">subscript</csymbol><apply id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.2.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3"><csymbol cd="ambiguous" id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.2.1.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3">superscript</csymbol><ci id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.2.2.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.2.2">𝒙</ci><ci id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.2.3.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.2.3">𝑂</ci></apply><ci id="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.3.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.2.3.3">𝑗</ci></apply></apply><apply id="Sx3.E12.m1.6.6.6.6.1.1.1.3.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E12.m1.6.6.6.6.1.1.1.3.1.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3">subscript</csymbol><apply id="Sx3.E12.m1.6.6.6.6.1.1.1.3.2.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E12.m1.6.6.6.6.1.1.1.3.2.1.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3">superscript</csymbol><ci id="Sx3.E12.m1.6.6.6.6.1.1.1.3.2.2.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3.2.2">𝒙</ci><ci id="Sx3.E12.m1.6.6.6.6.1.1.1.3.2.3.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3.2.3">𝑂</ci></apply><ci id="Sx3.E12.m1.6.6.6.6.1.1.1.3.3.cmml" xref="Sx3.E12.m1.6.6.6.6.1.1.1.3.3">𝑘</ci></apply></apply></apply></apply><apply id="Sx3.E12.m1.10.10.10.cmml" xref="Sx3.E12.m1.10.10.10"><apply id="Sx3.E12.m1.10.10.10.5.cmml" xref="Sx3.E12.m1.10.10.10.5"><csymbol cd="ambiguous" id="Sx3.E12.m1.10.10.10.5.1.cmml" xref="Sx3.E12.m1.10.10.10.5">subscript</csymbol><sum id="Sx3.E12.m1.10.10.10.5.2.cmml" xref="Sx3.E12.m1.10.10.10.5.2"></sum><ci id="Sx3.E12.m1.10.10.10.5.3.cmml" xref="Sx3.E12.m1.10.10.10.5.3">𝑗</ci></apply><apply id="Sx3.E12.m1.10.10.10.4.2.cmml" xref="Sx3.E12.m1.10.10.10.4.1"><exp id="Sx3.E12.m1.9.9.9.3.cmml" xref="Sx3.E12.m1.9.9.9.3"></exp><apply id="Sx3.E12.m1.10.10.10.4.1.1.1.cmml" xref="Sx3.E12.m1.10.10.10.4.1.1.1"><csymbol cd="ambiguous" id="Sx3.E12.m1.10.10.10.4.1.1.1.1.cmml" xref="Sx3.E12.m1.10.10.10.4.1.1.1">subscript</csymbol><ci id="Sx3.E12.m1.10.10.10.4.1.1.1.2.cmml" xref="Sx3.E12.m1.10.10.10.4.1.1.1.2">𝑠</ci><list id="Sx3.E12.m1.8.8.8.2.2.3.cmml" xref="Sx3.E12.m1.8.8.8.2.2.4"><ci id="Sx3.E12.m1.7.7.7.1.1.1.cmml" xref="Sx3.E12.m1.7.7.7.1.1.1">𝑘</ci><ci id="Sx3.E12.m1.8.8.8.2.2.2.cmml" xref="Sx3.E12.m1.8.8.8.2.2.2">𝑗</ci></list></apply></apply></apply></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E12.m1.11c">\mathcal{L}^{\text{OIIC}}=-\frac{1}{b}\sum^{b}_{k=1}\mathcal{\log}\frac{\sum_{% k,j}\exp(s_{k,j}\mid\boldsymbol{x}^{O}_{j}=\boldsymbol{x}^{O}_{k})}{\sum_{j}% \exp(s_{k,j})}.\vspace{-1mm}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E12.m1.11d">caligraphic_L start_POSTSUPERSCRIPT OIIC end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_b end_ARG ∑ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT roman_log divide start_ARG ∑ start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT roman_exp ( italic_s start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( italic_s start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ) end_ARG .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(12)</span></td> </tr></tbody> </table> </div> <figure class="ltx_table" id="Sx3.T1"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="Sx3.T1.4"> <tr class="ltx_tr" id="Sx3.T1.2.2"> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx3.T1.2.2.3" rowspan="2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.2.2.3.1">Model</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="Sx3.T1.1.1.1" style="padding-left:9.0pt;padding-right:9.0pt;">LVIS<sup class="ltx_sup" id="Sx3.T1.1.1.1.1"><span class="ltx_text ltx_font_italic" id="Sx3.T1.1.1.1.1.1">f</span></sup> </td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="Sx3.T1.2.2.2" style="padding-left:9.0pt;padding-right:9.0pt;">LVIS<sup class="ltx_sup" id="Sx3.T1.2.2.2.1"><span class="ltx_text ltx_font_italic" id="Sx3.T1.2.2.2.1.1">a</span></sup> </td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.5"> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.5.1" style="padding-left:9.0pt;padding-right:9.0pt;">F1</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.5.2" style="padding-left:9.0pt;padding-right:9.0pt;">AUC</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.5.3" style="padding-left:9.0pt;padding-right:9.0pt;">F1</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.5.4" style="padding-left:9.0pt;padding-right:9.0pt;">AUC</td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.6"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T1.4.6.1" style="padding-left:9.0pt;padding-right:9.0pt;">Instruct.</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.6.2" style="padding-left:9.0pt;padding-right:9.0pt;">-0.1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.6.3" style="padding-left:9.0pt;padding-right:9.0pt;">49.9</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.6.4" style="padding-left:9.0pt;padding-right:9.0pt;">0.1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.6.5" style="padding-left:9.0pt;padding-right:9.0pt;">50.0</td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.7" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T1.4.7.1" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.7.1.1" style="background-color:#E6E6E6;">CLIP-ViT</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.7.2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.7.2.1" style="background-color:#E6E6E6;">11.3</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.7.3" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.7.3.1" style="background-color:#E6E6E6;">57.8</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.7.4" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.7.4.1" style="background-color:#E6E6E6;">8.6</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.7.5" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.7.5.1" style="background-color:#E6E6E6;">55.5</span></td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.8"> <td class="ltx_td ltx_align_left" id="Sx3.T1.4.8.1" style="padding-left:9.0pt;padding-right:9.0pt;">+ GiVE</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.8.2" style="padding-left:9.0pt;padding-right:9.0pt;">50.7</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.8.3" style="padding-left:9.0pt;padding-right:9.0pt;">75.6</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.8.4" style="padding-left:9.0pt;padding-right:9.0pt;">56.2</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.8.5" style="padding-left:9.0pt;padding-right:9.0pt;">77.3</td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.9"> <td class="ltx_td ltx_align_left" id="Sx3.T1.4.9.1" style="padding-left:9.0pt;padding-right:9.0pt;">Improv.</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.9.2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.9.2.1" style="color:#00E000;">348.7%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.9.3" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.9.3.1" style="color:#00E000;">30.8%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.9.4" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.9.4.1" style="color:#00E000;">553.5%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.9.5" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.9.5.1" style="color:#00E000;">39.3%</span></td> </tr> <tr class="ltx_tr" id="Sx3.T1.3.3" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T1.3.3.1" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.3.3.1.1" style="background-color:#E6E6E6;">GroupViT<sub class="ltx_sub" id="Sx3.T1.3.3.1.1.1"><span class="ltx_text ltx_font_italic" id="Sx3.T1.3.3.1.1.1.1" style="background-color:#E6E6E6;">y</span></sub></span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.3.3.2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.3.3.2.1" style="background-color:#E6E6E6;">9.6</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.3.3.3" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.3.3.3.1" style="background-color:#E6E6E6;">56.8</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.3.3.4" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.3.3.4.1" style="background-color:#E6E6E6;">7.3</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.3.3.5" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.3.3.5.1" style="background-color:#E6E6E6;">54.8</span></td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.10"> <td class="ltx_td ltx_align_left" id="Sx3.T1.4.10.1" style="padding-left:9.0pt;padding-right:9.0pt;">+ GiVE</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.10.2" style="padding-left:9.0pt;padding-right:9.0pt;">31.9</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.10.3" style="padding-left:9.0pt;padding-right:9.0pt;">65.3</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.10.4" style="padding-left:9.0pt;padding-right:9.0pt;">42.3</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.10.5" style="padding-left:9.0pt;padding-right:9.0pt;">69.8</td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.11"> <td class="ltx_td ltx_align_left" id="Sx3.T1.4.11.1" style="padding-left:9.0pt;padding-right:9.0pt;">Improv.</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.11.2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.11.2.1" style="color:#00E000;">232.3%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.11.3" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.11.3.1" style="color:#00E000;">15.0%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.11.4" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.11.4.1" style="color:#00E000;">479.5%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.11.5" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.11.5.1" style="color:#00E000;">27.4%</span></td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.4" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T1.4.4.1" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.4.1.1" style="background-color:#E6E6E6;">GroupViT<sub class="ltx_sub" id="Sx3.T1.4.4.1.1.1"><span class="ltx_text ltx_font_italic" id="Sx3.T1.4.4.1.1.1.1" style="background-color:#E6E6E6;">r</span></sub></span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.4.2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.4.2.1" style="background-color:#E6E6E6;">10.1</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.4.3" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.4.3.1" style="background-color:#E6E6E6;">57.2</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.4.4" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.4.4.1" style="background-color:#E6E6E6;">8.2</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.4.5" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.4.5.1" style="background-color:#E6E6E6;">55.4</span></td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.12"> <td class="ltx_td ltx_align_left" id="Sx3.T1.4.12.1" style="padding-left:9.0pt;padding-right:9.0pt;">+ GiVE</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.12.2" style="padding-left:9.0pt;padding-right:9.0pt;">31.8</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.12.3" style="padding-left:9.0pt;padding-right:9.0pt;">65.9</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.12.4" style="padding-left:9.0pt;padding-right:9.0pt;">42.8</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.12.5" style="padding-left:9.0pt;padding-right:9.0pt;">70.3</td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.13"> <td class="ltx_td ltx_align_left" id="Sx3.T1.4.13.1" style="padding-left:9.0pt;padding-right:9.0pt;">Improv.</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.13.2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.13.2.1" style="color:#00E000;">214.9%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.13.3" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.13.3.1" style="color:#00E000;">15.2%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.13.4" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.13.4.1" style="color:#00E000;">422.0%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.13.5" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.13.5.1" style="color:#00E000;">26.9%</span></td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.14" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T1.4.14.1" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.14.1.1" style="background-color:#E6E6E6;">SigLIP</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.14.2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.14.2.1" style="background-color:#E6E6E6;">10.7</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.14.3" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.14.3.1" style="background-color:#E6E6E6;">57.3</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.14.4" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.14.4.1" style="background-color:#E6E6E6;">9.0</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.14.5" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.14.5.1" style="background-color:#E6E6E6;">55.6</span></td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.15"> <td class="ltx_td ltx_align_left" id="Sx3.T1.4.15.1" style="padding-left:9.0pt;padding-right:9.0pt;">+ GiVE</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.15.2" style="padding-left:9.0pt;padding-right:9.0pt;">48.6</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.15.3" style="padding-left:9.0pt;padding-right:9.0pt;">74.2</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.15.4" style="padding-left:9.0pt;padding-right:9.0pt;">53.0</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.15.5" style="padding-left:9.0pt;padding-right:9.0pt;">75.0</td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.16"> <td class="ltx_td ltx_align_left" id="Sx3.T1.4.16.1" style="padding-left:9.0pt;padding-right:9.0pt;">Improv.</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.16.2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.16.2.1" style="color:#00E000;">354.2%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.16.3" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.16.3.1" style="color:#00E000;">29.5%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.16.4" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.16.4.1" style="color:#00E000;">488.9%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.16.5" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.16.5.1" style="color:#00E000;">34.9%</span></td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.17" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T1.4.17.1" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.17.1.1" style="background-color:#E6E6E6;">OwlViT</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.17.2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.17.2.1" style="background-color:#E6E6E6;">0.0</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.17.3" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.17.3.1" style="background-color:#E6E6E6;">50.0</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.17.4" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.17.4.1" style="background-color:#E6E6E6;">0.0</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T1.4.17.5" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.17.5.1" style="background-color:#E6E6E6;">50.0</span></td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.18"> <td class="ltx_td ltx_align_left" id="Sx3.T1.4.18.1" style="padding-left:9.0pt;padding-right:9.0pt;">+ GiVE</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.18.2" style="padding-left:9.0pt;padding-right:9.0pt;">40.6</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.18.3" style="padding-left:9.0pt;padding-right:9.0pt;">70.6</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.18.4" style="padding-left:9.0pt;padding-right:9.0pt;">41.8</td> <td class="ltx_td ltx_align_center" id="Sx3.T1.4.18.5" style="padding-left:9.0pt;padding-right:9.0pt;">69.9</td> </tr> <tr class="ltx_tr" id="Sx3.T1.4.19"> <td class="ltx_td ltx_align_left ltx_border_bb" id="Sx3.T1.4.19.1" style="padding-left:9.0pt;padding-right:9.0pt;">Improv.</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="Sx3.T1.4.19.2" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.19.2.1" style="color:#00E000;">-</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="Sx3.T1.4.19.3" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.19.3.1" style="color:#00E000;">41.2%</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="Sx3.T1.4.19.4" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.19.4.1" style="color:#00E000;">-</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="Sx3.T1.4.19.5" style="padding-left:9.0pt;padding-right:9.0pt;"><span class="ltx_text" id="Sx3.T1.4.19.5.1" style="color:#00E000;">39.8%</span></td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 1: </span>Top-1 F1 and AUC (%) of zero-shot image classification on multi-object datasets. “Instruct.” signifies that classification is based on the instruction text rather than the image. “GroupViT<sub class="ltx_sub" id="Sx3.T1.11.1"><span class="ltx_text ltx_font_italic" id="Sx3.T1.11.1.1">y</span></sub>” and “GroupViT<sub class="ltx_sub" id="Sx3.T1.12.2"><span class="ltx_text ltx_font_italic" id="Sx3.T1.12.2.1">r</span></sub>”, i.e., GroupViT-gcc-yfcc and GroupViT-gcc-redcaps, denote two variants of GroupViT trained with different datasets, respectively. Since our evaluation metric accounts for and subtracts the influence of textual interference, it can result in negative values.</figcaption> </figure> <figure class="ltx_table" id="Sx3.T2"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="Sx3.T2.4"> <tr class="ltx_tr" id="Sx3.T2.2.2"> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx3.T2.2.2.3" rowspan="2" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.2.2.3.1">Model</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx3.T2.2.2.4" rowspan="2" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.2.2.4.1"><span class="ltx_text" id="Sx3.T2.2.2.4.1.1"></span><span class="ltx_text" id="Sx3.T2.2.2.4.1.2"> <span class="ltx_tabular ltx_align_middle" id="Sx3.T2.2.2.4.1.2.1"> <span class="ltx_tr" id="Sx3.T2.2.2.4.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="Sx3.T2.2.2.4.1.2.1.1.1" style="padding-left:4.0pt;padding-right:4.0pt;"># TP</span></span> </span></span> <span class="ltx_text" id="Sx3.T2.2.2.4.1.3"></span></span></td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="Sx3.T2.1.1.1" style="padding-left:4.0pt;padding-right:4.0pt;">Image <math alttext="\rightarrow" class="ltx_Math" display="inline" id="Sx3.T2.1.1.1.m1.1"><semantics id="Sx3.T2.1.1.1.m1.1a"><mo id="Sx3.T2.1.1.1.m1.1.1" stretchy="false" xref="Sx3.T2.1.1.1.m1.1.1.cmml">→</mo><annotation-xml encoding="MathML-Content" id="Sx3.T2.1.1.1.m1.1b"><ci id="Sx3.T2.1.1.1.m1.1.1.cmml" xref="Sx3.T2.1.1.1.m1.1.1">→</ci></annotation-xml><annotation encoding="application/x-tex" id="Sx3.T2.1.1.1.m1.1c">\rightarrow</annotation><annotation encoding="application/x-llamapun" id="Sx3.T2.1.1.1.m1.1d">→</annotation></semantics></math> Text</td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="Sx3.T2.2.2.2" style="padding-left:4.0pt;padding-right:4.0pt;">Text <math alttext="\rightarrow" class="ltx_Math" display="inline" id="Sx3.T2.2.2.2.m1.1"><semantics id="Sx3.T2.2.2.2.m1.1a"><mo id="Sx3.T2.2.2.2.m1.1.1" stretchy="false" xref="Sx3.T2.2.2.2.m1.1.1.cmml">→</mo><annotation-xml encoding="MathML-Content" id="Sx3.T2.2.2.2.m1.1b"><ci id="Sx3.T2.2.2.2.m1.1.1.cmml" xref="Sx3.T2.2.2.2.m1.1.1">→</ci></annotation-xml><annotation encoding="application/x-tex" id="Sx3.T2.2.2.2.m1.1c">\rightarrow</annotation><annotation encoding="application/x-llamapun" id="Sx3.T2.2.2.2.m1.1d">→</annotation></semantics></math> Image</td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.5"> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.5.1" style="padding-left:4.0pt;padding-right:4.0pt;">R@1</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.5.2" style="padding-left:4.0pt;padding-right:4.0pt;">R@5</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.5.3" style="padding-left:4.0pt;padding-right:4.0pt;">R@1</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.5.4" style="padding-left:4.0pt;padding-right:4.0pt;">R@5</td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.6"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.4.6.1" style="padding-left:4.0pt;padding-right:4.0pt;">Instruct.</td> <td class="ltx_td ltx_border_t" id="Sx3.T2.4.6.2" style="padding-left:4.0pt;padding-right:4.0pt;"></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.6.3" style="padding-left:4.0pt;padding-right:4.0pt;">0.1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.6.4" style="padding-left:4.0pt;padding-right:4.0pt;">0.1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.6.5" style="padding-left:4.0pt;padding-right:4.0pt;">-0.2</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.6.6" style="padding-left:4.0pt;padding-right:4.0pt;">0.4</td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.7" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.4.7.1" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.7.1.1" style="background-color:#E6E6E6;">CLIP-ViT</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.4.7.2" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.7.2.1" style="background-color:#E6E6E6;">88M</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.7.3" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.7.3.1" style="background-color:#E6E6E6;">7.6</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.7.4" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.7.4.1" style="background-color:#E6E6E6;">18.7</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.7.5" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.7.5.1" style="background-color:#E6E6E6;">5.4</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.7.6" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.7.6.1" style="background-color:#E6E6E6;">19.7</span></td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.8"> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.8.1" style="padding-left:4.0pt;padding-right:4.0pt;">+ GiVE</td> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.8.2" style="padding-left:4.0pt;padding-right:4.0pt;">62M</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.8.3" style="padding-left:4.0pt;padding-right:4.0pt;">29.1</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.8.4" style="padding-left:4.0pt;padding-right:4.0pt;">53.3</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.8.5" style="padding-left:4.0pt;padding-right:4.0pt;">29.8</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.8.6" style="padding-left:4.0pt;padding-right:4.0pt;">54.9</td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.9"> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.9.1" style="padding-left:4.0pt;padding-right:4.0pt;">Improv.</td> <td class="ltx_td" id="Sx3.T2.4.9.2" style="padding-left:4.0pt;padding-right:4.0pt;"></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.9.3" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.9.3.1" style="color:#00E000;">282.9%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.9.4" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.9.4.1" style="color:#00E000;">185.0%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.9.5" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.9.5.1" style="color:#00E000;">451.9%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.9.6" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.9.6.1" style="color:#00E000;">178.7%</span></td> </tr> <tr class="ltx_tr" id="Sx3.T2.3.3" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.3.3.1" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.3.3.1.1" style="background-color:#E6E6E6;">GroupViT<sub class="ltx_sub" id="Sx3.T2.3.3.1.1.1"><span class="ltx_text ltx_font_italic" id="Sx3.T2.3.3.1.1.1.1" style="background-color:#E6E6E6;">y</span></sub></span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.3.3.2" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.3.3.2.1" style="background-color:#E6E6E6;">31M</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.3.3.3" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.3.3.3.1" style="background-color:#E6E6E6;">5.9</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.3.3.4" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.3.3.4.1" style="background-color:#E6E6E6;">18.0</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.3.3.5" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.3.3.5.1" style="background-color:#E6E6E6;">4.2</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.3.3.6" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.3.3.6.1" style="background-color:#E6E6E6;">17.8</span></td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.10"> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.10.1" style="padding-left:4.0pt;padding-right:4.0pt;">+ GiVE</td> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.10.2" style="padding-left:4.0pt;padding-right:4.0pt;">15M</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.10.3" style="padding-left:4.0pt;padding-right:4.0pt;">13.2</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.10.4" style="padding-left:4.0pt;padding-right:4.0pt;">29.3</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.10.5" style="padding-left:4.0pt;padding-right:4.0pt;">16.7</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.10.6" style="padding-left:4.0pt;padding-right:4.0pt;">34.9</td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.11"> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.11.1" style="padding-left:4.0pt;padding-right:4.0pt;">Improv.</td> <td class="ltx_td" id="Sx3.T2.4.11.2" style="padding-left:4.0pt;padding-right:4.0pt;"></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.11.3" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.11.3.1" style="color:#00E000;">123.7%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.11.4" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.11.4.1" style="color:#00E000;">62.8%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.11.5" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.11.5.1" style="color:#00E000;">297.6%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.11.6" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.11.6.1" style="color:#00E000;">96.1%</span></td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.4" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.4.4.1" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.4.1.1" style="background-color:#E6E6E6;">GroupViT<sub class="ltx_sub" id="Sx3.T2.4.4.1.1.1"><span class="ltx_text ltx_font_italic" id="Sx3.T2.4.4.1.1.1.1" style="background-color:#E6E6E6;">r</span></sub></span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.4.4.2" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.4.2.1" style="background-color:#E6E6E6;">31M</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.4.3" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.4.3.1" style="background-color:#E6E6E6;">7.2</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.4.4" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.4.4.1" style="background-color:#E6E6E6;">19.1</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.4.5" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.4.5.1" style="background-color:#E6E6E6;">4.7</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.4.6" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.4.6.1" style="background-color:#E6E6E6;">18.7</span></td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.12"> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.12.1" style="padding-left:4.0pt;padding-right:4.0pt;">+ GiVE</td> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.12.2" style="padding-left:4.0pt;padding-right:4.0pt;">15M</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.12.3" style="padding-left:4.0pt;padding-right:4.0pt;">12.6</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.12.4" style="padding-left:4.0pt;padding-right:4.0pt;">29.8</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.12.5" style="padding-left:4.0pt;padding-right:4.0pt;">17.1</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.12.6" style="padding-left:4.0pt;padding-right:4.0pt;">36.2</td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.13"> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.13.1" style="padding-left:4.0pt;padding-right:4.0pt;">Improv.</td> <td class="ltx_td" id="Sx3.T2.4.13.2" style="padding-left:4.0pt;padding-right:4.0pt;"></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.13.3" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.13.3.1" style="color:#00E000;">75.0%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.13.4" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.13.4.1" style="color:#00E000;">56.0%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.13.5" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.13.5.1" style="color:#00E000;">263.8%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.13.6" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.13.6.1" style="color:#00E000;">93.6%</span></td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.14" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.4.14.1" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.14.1.1" style="background-color:#E6E6E6;">SigLIP</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.4.14.2" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.14.2.1" style="background-color:#E6E6E6;">93M</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.14.3" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.14.3.1" style="background-color:#E6E6E6;">9.3</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.14.4" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.14.4.1" style="background-color:#E6E6E6;">23.6</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.14.5" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.14.5.1" style="background-color:#E6E6E6;">6.9</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.14.6" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.14.6.1" style="background-color:#E6E6E6;">25.6</span></td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.15"> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.15.1" style="padding-left:4.0pt;padding-right:4.0pt;">+ GiVE</td> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.15.2" style="padding-left:4.0pt;padding-right:4.0pt;">85M</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.15.3" style="padding-left:4.0pt;padding-right:4.0pt;">20.7</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.15.4" style="padding-left:4.0pt;padding-right:4.0pt;">43.6</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.15.5" style="padding-left:4.0pt;padding-right:4.0pt;">25.7</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.15.6" style="padding-left:4.0pt;padding-right:4.0pt;">48.1</td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.16"> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.16.1" style="padding-left:4.0pt;padding-right:4.0pt;">Improv.</td> <td class="ltx_td" id="Sx3.T2.4.16.2" style="padding-left:4.0pt;padding-right:4.0pt;"></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.16.3" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.16.3.1" style="color:#00E000;">122.6%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.16.4" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.16.4.1" style="color:#00E000;">84.7%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.16.5" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.16.5.1" style="color:#00E000;">272.5%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.16.6" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.16.6.1" style="color:#00E000;">87.9%</span></td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.17" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.4.17.1" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.17.1.1" style="background-color:#E6E6E6;">OwlViT</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T2.4.17.2" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.17.2.1" style="background-color:#E6E6E6;">88M</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.17.3" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.17.3.1" style="background-color:#E6E6E6;">5.6</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.17.4" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.17.4.1" style="background-color:#E6E6E6;">16.5</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.17.5" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.17.5.1" style="background-color:#E6E6E6;">4.1</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T2.4.17.6" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.17.6.1" style="background-color:#E6E6E6;">16.5</span></td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.18"> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.18.1" style="padding-left:4.0pt;padding-right:4.0pt;">+ GiVE</td> <td class="ltx_td ltx_align_left" id="Sx3.T2.4.18.2" style="padding-left:4.0pt;padding-right:4.0pt;">62M</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.18.3" style="padding-left:4.0pt;padding-right:4.0pt;">12.4</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.18.4" style="padding-left:4.0pt;padding-right:4.0pt;">32.1</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.18.5" style="padding-left:4.0pt;padding-right:4.0pt;">14.4</td> <td class="ltx_td ltx_align_center" id="Sx3.T2.4.18.6" style="padding-left:4.0pt;padding-right:4.0pt;">35.2</td> </tr> <tr class="ltx_tr" id="Sx3.T2.4.19"> <td class="ltx_td ltx_align_left ltx_border_bb" id="Sx3.T2.4.19.1" style="padding-left:4.0pt;padding-right:4.0pt;">Improv.</td> <td class="ltx_td ltx_border_bb" id="Sx3.T2.4.19.2" style="padding-left:4.0pt;padding-right:4.0pt;"></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="Sx3.T2.4.19.3" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.19.3.1" style="color:#00E000;">121.4%</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="Sx3.T2.4.19.4" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.19.4.1" style="color:#00E000;">94.5%</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="Sx3.T2.4.19.5" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.19.5.1" style="color:#00E000;">251.2%</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="Sx3.T2.4.19.6" style="padding-left:4.0pt;padding-right:4.0pt;"><span class="ltx_text" id="Sx3.T2.4.19.6.1" style="color:#00E000;">113.3%</span></td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 2: </span>Fine-tuned image-text retrieval results on MOInst dataset. “# TP” denotes the number of trainable parameters. </figcaption> </figure> <div class="ltx_para" id="Sx3.SSx3.p4"> <p class="ltx_p" id="Sx3.SSx3.p4.8"><span class="ltx_text ltx_font_bold" id="Sx3.SSx3.p4.8.1">Object-focused Image Discrimination (OID)</span> is a binary classification task that requires the model to predict whether a given image and the indicated object match. For each batch of sample pairs <math alttext="\{\boldsymbol{x}^{I}_{k},\boldsymbol{x}^{O}_{k}\}^{b}_{k=1}" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.1.m1.2"><semantics id="Sx3.SSx3.p4.1.m1.2a"><msubsup id="Sx3.SSx3.p4.1.m1.2.2" xref="Sx3.SSx3.p4.1.m1.2.2.cmml"><mrow id="Sx3.SSx3.p4.1.m1.2.2.2.2.2" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.3.cmml"><mo id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.3" stretchy="false" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.3.cmml">{</mo><msubsup id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.cmml"><mi id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.2.2" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.3" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.3.cmml">k</mi><mi id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.2.3" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.2.3.cmml">I</mi></msubsup><mo id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.4" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.3.cmml">,</mo><msubsup id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.cmml"><mi id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.2.2" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.3" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.3.cmml">k</mi><mi id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.2.3" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.2.3.cmml">O</mi></msubsup><mo id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.5" stretchy="false" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.3.cmml">}</mo></mrow><mrow id="Sx3.SSx3.p4.1.m1.2.2.4" xref="Sx3.SSx3.p4.1.m1.2.2.4.cmml"><mi id="Sx3.SSx3.p4.1.m1.2.2.4.2" xref="Sx3.SSx3.p4.1.m1.2.2.4.2.cmml">k</mi><mo id="Sx3.SSx3.p4.1.m1.2.2.4.1" xref="Sx3.SSx3.p4.1.m1.2.2.4.1.cmml">=</mo><mn id="Sx3.SSx3.p4.1.m1.2.2.4.3" xref="Sx3.SSx3.p4.1.m1.2.2.4.3.cmml">1</mn></mrow><mi id="Sx3.SSx3.p4.1.m1.2.2.2.4" xref="Sx3.SSx3.p4.1.m1.2.2.2.4.cmml">b</mi></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.1.m1.2b"><apply id="Sx3.SSx3.p4.1.m1.2.2.cmml" xref="Sx3.SSx3.p4.1.m1.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.1.m1.2.2.3.cmml" xref="Sx3.SSx3.p4.1.m1.2.2">subscript</csymbol><apply id="Sx3.SSx3.p4.1.m1.2.2.2.cmml" xref="Sx3.SSx3.p4.1.m1.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.1.m1.2.2.2.3.cmml" xref="Sx3.SSx3.p4.1.m1.2.2">superscript</csymbol><set id="Sx3.SSx3.p4.1.m1.2.2.2.2.3.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2"><apply id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1">subscript</csymbol><apply id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.2.cmml" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.2.1.cmml" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.2.2.cmml" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.2.2">𝒙</ci><ci id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.2.3.cmml" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.2.3">𝐼</ci></apply><ci id="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.3.cmml" xref="Sx3.SSx3.p4.1.m1.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2">subscript</csymbol><apply id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2">superscript</csymbol><ci id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.2.2">𝒙</ci><ci id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.2.3">𝑂</ci></apply><ci id="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.2.2.2.2.3">𝑘</ci></apply></set><ci id="Sx3.SSx3.p4.1.m1.2.2.2.4.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.2.4">𝑏</ci></apply><apply id="Sx3.SSx3.p4.1.m1.2.2.4.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.4"><eq id="Sx3.SSx3.p4.1.m1.2.2.4.1.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.4.1"></eq><ci id="Sx3.SSx3.p4.1.m1.2.2.4.2.cmml" xref="Sx3.SSx3.p4.1.m1.2.2.4.2">𝑘</ci><cn id="Sx3.SSx3.p4.1.m1.2.2.4.3.cmml" type="integer" xref="Sx3.SSx3.p4.1.m1.2.2.4.3">1</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.1.m1.2c">\{\boldsymbol{x}^{I}_{k},\boldsymbol{x}^{O}_{k}\}^{b}_{k=1}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.1.m1.2d">{ bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT</annotation></semantics></math>, we additionally construct <math alttext="b" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.2.m2.1"><semantics id="Sx3.SSx3.p4.2.m2.1a"><mi id="Sx3.SSx3.p4.2.m2.1.1" xref="Sx3.SSx3.p4.2.m2.1.1.cmml">b</mi><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.2.m2.1b"><ci id="Sx3.SSx3.p4.2.m2.1.1.cmml" xref="Sx3.SSx3.p4.2.m2.1.1">𝑏</ci></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.2.m2.1c">b</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.2.m2.1d">italic_b</annotation></semantics></math> negative pairs <math alttext="\{\boldsymbol{x}^{I}_{k},\boldsymbol{x}^{NO}_{k}\}^{b}_{k=1}" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.3.m3.2"><semantics id="Sx3.SSx3.p4.3.m3.2a"><msubsup id="Sx3.SSx3.p4.3.m3.2.2" xref="Sx3.SSx3.p4.3.m3.2.2.cmml"><mrow id="Sx3.SSx3.p4.3.m3.2.2.2.2.2" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.3.cmml"><mo id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.3" stretchy="false" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.3.cmml">{</mo><msubsup id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.cmml"><mi id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.2.2" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.3" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.3.cmml">k</mi><mi id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.2.3" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.2.3.cmml">I</mi></msubsup><mo id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.4" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.3.cmml">,</mo><msubsup id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.cmml"><mi id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.2" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.3" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.3.cmml">k</mi><mrow id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.cmml"><mi id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.2" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.2.cmml">N</mi><mo id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.1" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.1.cmml">⁢</mo><mi id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.3" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.3.cmml">O</mi></mrow></msubsup><mo id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.5" stretchy="false" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.3.cmml">}</mo></mrow><mrow id="Sx3.SSx3.p4.3.m3.2.2.4" xref="Sx3.SSx3.p4.3.m3.2.2.4.cmml"><mi id="Sx3.SSx3.p4.3.m3.2.2.4.2" xref="Sx3.SSx3.p4.3.m3.2.2.4.2.cmml">k</mi><mo id="Sx3.SSx3.p4.3.m3.2.2.4.1" xref="Sx3.SSx3.p4.3.m3.2.2.4.1.cmml">=</mo><mn id="Sx3.SSx3.p4.3.m3.2.2.4.3" xref="Sx3.SSx3.p4.3.m3.2.2.4.3.cmml">1</mn></mrow><mi id="Sx3.SSx3.p4.3.m3.2.2.2.4" xref="Sx3.SSx3.p4.3.m3.2.2.2.4.cmml">b</mi></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.3.m3.2b"><apply id="Sx3.SSx3.p4.3.m3.2.2.cmml" xref="Sx3.SSx3.p4.3.m3.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.3.m3.2.2.3.cmml" xref="Sx3.SSx3.p4.3.m3.2.2">subscript</csymbol><apply id="Sx3.SSx3.p4.3.m3.2.2.2.cmml" xref="Sx3.SSx3.p4.3.m3.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.3.m3.2.2.2.3.cmml" xref="Sx3.SSx3.p4.3.m3.2.2">superscript</csymbol><set id="Sx3.SSx3.p4.3.m3.2.2.2.2.3.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2"><apply id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1">subscript</csymbol><apply id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.2.cmml" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.2.1.cmml" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.2.2.cmml" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.2.2">𝒙</ci><ci id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.2.3.cmml" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.2.3">𝐼</ci></apply><ci id="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.3.cmml" xref="Sx3.SSx3.p4.3.m3.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2">subscript</csymbol><apply id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2">superscript</csymbol><ci id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.2">𝒙</ci><apply id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3"><times id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.1.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.1"></times><ci id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.2.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.2">𝑁</ci><ci id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.3.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.2.2.2.3">𝑘</ci></apply></set><ci id="Sx3.SSx3.p4.3.m3.2.2.2.4.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.2.4">𝑏</ci></apply><apply id="Sx3.SSx3.p4.3.m3.2.2.4.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.4"><eq id="Sx3.SSx3.p4.3.m3.2.2.4.1.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.4.1"></eq><ci id="Sx3.SSx3.p4.3.m3.2.2.4.2.cmml" xref="Sx3.SSx3.p4.3.m3.2.2.4.2">𝑘</ci><cn id="Sx3.SSx3.p4.3.m3.2.2.4.3.cmml" type="integer" xref="Sx3.SSx3.p4.3.m3.2.2.4.3">1</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.3.m3.2c">\{\boldsymbol{x}^{I}_{k},\boldsymbol{x}^{NO}_{k}\}^{b}_{k=1}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.3.m3.2d">{ bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_N italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT</annotation></semantics></math>, where <math alttext="\boldsymbol{x}^{NO}_{k}" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.4.m4.1"><semantics id="Sx3.SSx3.p4.4.m4.1a"><msubsup id="Sx3.SSx3.p4.4.m4.1.1" xref="Sx3.SSx3.p4.4.m4.1.1.cmml"><mi id="Sx3.SSx3.p4.4.m4.1.1.2.2" xref="Sx3.SSx3.p4.4.m4.1.1.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p4.4.m4.1.1.3" xref="Sx3.SSx3.p4.4.m4.1.1.3.cmml">k</mi><mrow id="Sx3.SSx3.p4.4.m4.1.1.2.3" xref="Sx3.SSx3.p4.4.m4.1.1.2.3.cmml"><mi id="Sx3.SSx3.p4.4.m4.1.1.2.3.2" xref="Sx3.SSx3.p4.4.m4.1.1.2.3.2.cmml">N</mi><mo id="Sx3.SSx3.p4.4.m4.1.1.2.3.1" xref="Sx3.SSx3.p4.4.m4.1.1.2.3.1.cmml">⁢</mo><mi id="Sx3.SSx3.p4.4.m4.1.1.2.3.3" xref="Sx3.SSx3.p4.4.m4.1.1.2.3.3.cmml">O</mi></mrow></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.4.m4.1b"><apply id="Sx3.SSx3.p4.4.m4.1.1.cmml" xref="Sx3.SSx3.p4.4.m4.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.4.m4.1.1.1.cmml" xref="Sx3.SSx3.p4.4.m4.1.1">subscript</csymbol><apply id="Sx3.SSx3.p4.4.m4.1.1.2.cmml" xref="Sx3.SSx3.p4.4.m4.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.4.m4.1.1.2.1.cmml" xref="Sx3.SSx3.p4.4.m4.1.1">superscript</csymbol><ci id="Sx3.SSx3.p4.4.m4.1.1.2.2.cmml" xref="Sx3.SSx3.p4.4.m4.1.1.2.2">𝒙</ci><apply id="Sx3.SSx3.p4.4.m4.1.1.2.3.cmml" xref="Sx3.SSx3.p4.4.m4.1.1.2.3"><times id="Sx3.SSx3.p4.4.m4.1.1.2.3.1.cmml" xref="Sx3.SSx3.p4.4.m4.1.1.2.3.1"></times><ci id="Sx3.SSx3.p4.4.m4.1.1.2.3.2.cmml" xref="Sx3.SSx3.p4.4.m4.1.1.2.3.2">𝑁</ci><ci id="Sx3.SSx3.p4.4.m4.1.1.2.3.3.cmml" xref="Sx3.SSx3.p4.4.m4.1.1.2.3.3">𝑂</ci></apply></apply><ci id="Sx3.SSx3.p4.4.m4.1.1.3.cmml" xref="Sx3.SSx3.p4.4.m4.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.4.m4.1c">\boldsymbol{x}^{NO}_{k}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.4.m4.1d">bold_italic_x start_POSTSUPERSCRIPT italic_N italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT</annotation></semantics></math> indicating object not appear in <math alttext="\boldsymbol{x}^{I}_{k}" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.5.m5.1"><semantics id="Sx3.SSx3.p4.5.m5.1a"><msubsup id="Sx3.SSx3.p4.5.m5.1.1" xref="Sx3.SSx3.p4.5.m5.1.1.cmml"><mi id="Sx3.SSx3.p4.5.m5.1.1.2.2" xref="Sx3.SSx3.p4.5.m5.1.1.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p4.5.m5.1.1.3" xref="Sx3.SSx3.p4.5.m5.1.1.3.cmml">k</mi><mi id="Sx3.SSx3.p4.5.m5.1.1.2.3" xref="Sx3.SSx3.p4.5.m5.1.1.2.3.cmml">I</mi></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.5.m5.1b"><apply id="Sx3.SSx3.p4.5.m5.1.1.cmml" xref="Sx3.SSx3.p4.5.m5.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.5.m5.1.1.1.cmml" xref="Sx3.SSx3.p4.5.m5.1.1">subscript</csymbol><apply id="Sx3.SSx3.p4.5.m5.1.1.2.cmml" xref="Sx3.SSx3.p4.5.m5.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.5.m5.1.1.2.1.cmml" xref="Sx3.SSx3.p4.5.m5.1.1">superscript</csymbol><ci id="Sx3.SSx3.p4.5.m5.1.1.2.2.cmml" xref="Sx3.SSx3.p4.5.m5.1.1.2.2">𝒙</ci><ci id="Sx3.SSx3.p4.5.m5.1.1.2.3.cmml" xref="Sx3.SSx3.p4.5.m5.1.1.2.3">𝐼</ci></apply><ci id="Sx3.SSx3.p4.5.m5.1.1.3.cmml" xref="Sx3.SSx3.p4.5.m5.1.1.3">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.5.m5.1c">\boldsymbol{x}^{I}_{k}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.5.m5.1d">bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT</annotation></semantics></math>. These positive and negative samples <math alttext="\{\boldsymbol{x}^{I}_{k},\boldsymbol{x}^{O\cup NO}_{k}\}^{2b}_{k=1}" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.6.m6.2"><semantics id="Sx3.SSx3.p4.6.m6.2a"><msubsup id="Sx3.SSx3.p4.6.m6.2.2" xref="Sx3.SSx3.p4.6.m6.2.2.cmml"><mrow id="Sx3.SSx3.p4.6.m6.2.2.2.2.2" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.3.cmml"><mo id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.3" stretchy="false" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.3.cmml">{</mo><msubsup id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.cmml"><mi id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.2.2" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.3" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.3.cmml">k</mi><mi id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.2.3" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.2.3.cmml">I</mi></msubsup><mo id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.4" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.3.cmml">,</mo><msubsup id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.cmml"><mi id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.2" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.2.cmml">𝒙</mi><mi id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.3" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.3.cmml">k</mi><mrow id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.cmml"><mi id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.2" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.2.cmml">O</mi><mo id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.1" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.1.cmml">∪</mo><mrow id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.cmml"><mi id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.2" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.2.cmml">N</mi><mo id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.1" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.1.cmml">⁢</mo><mi id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.3" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.3.cmml">O</mi></mrow></mrow></msubsup><mo id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.5" stretchy="false" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.3.cmml">}</mo></mrow><mrow id="Sx3.SSx3.p4.6.m6.2.2.4" xref="Sx3.SSx3.p4.6.m6.2.2.4.cmml"><mi id="Sx3.SSx3.p4.6.m6.2.2.4.2" xref="Sx3.SSx3.p4.6.m6.2.2.4.2.cmml">k</mi><mo id="Sx3.SSx3.p4.6.m6.2.2.4.1" xref="Sx3.SSx3.p4.6.m6.2.2.4.1.cmml">=</mo><mn id="Sx3.SSx3.p4.6.m6.2.2.4.3" xref="Sx3.SSx3.p4.6.m6.2.2.4.3.cmml">1</mn></mrow><mrow id="Sx3.SSx3.p4.6.m6.2.2.2.4" xref="Sx3.SSx3.p4.6.m6.2.2.2.4.cmml"><mn id="Sx3.SSx3.p4.6.m6.2.2.2.4.2" xref="Sx3.SSx3.p4.6.m6.2.2.2.4.2.cmml">2</mn><mo id="Sx3.SSx3.p4.6.m6.2.2.2.4.1" xref="Sx3.SSx3.p4.6.m6.2.2.2.4.1.cmml">⁢</mo><mi id="Sx3.SSx3.p4.6.m6.2.2.2.4.3" xref="Sx3.SSx3.p4.6.m6.2.2.2.4.3.cmml">b</mi></mrow></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.6.m6.2b"><apply id="Sx3.SSx3.p4.6.m6.2.2.cmml" xref="Sx3.SSx3.p4.6.m6.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.6.m6.2.2.3.cmml" xref="Sx3.SSx3.p4.6.m6.2.2">subscript</csymbol><apply id="Sx3.SSx3.p4.6.m6.2.2.2.cmml" xref="Sx3.SSx3.p4.6.m6.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.6.m6.2.2.2.3.cmml" xref="Sx3.SSx3.p4.6.m6.2.2">superscript</csymbol><set id="Sx3.SSx3.p4.6.m6.2.2.2.2.3.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2"><apply id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1">subscript</csymbol><apply id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.2.cmml" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.2.1.cmml" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.2.2.cmml" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.2.2">𝒙</ci><ci id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.2.3.cmml" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.2.3">𝐼</ci></apply><ci id="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.3.cmml" xref="Sx3.SSx3.p4.6.m6.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2">subscript</csymbol><apply id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.1.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2">superscript</csymbol><ci id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.2.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.2">𝒙</ci><apply id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3"><union id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.1.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.1"></union><ci id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.2.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.2">𝑂</ci><apply id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3"><times id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.1.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.1"></times><ci id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.2.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.2">𝑁</ci><ci id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.3.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.2.3.3.3">𝑂</ci></apply></apply></apply><ci id="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.3.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.2.2.2.3">𝑘</ci></apply></set><apply id="Sx3.SSx3.p4.6.m6.2.2.2.4.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.4"><times id="Sx3.SSx3.p4.6.m6.2.2.2.4.1.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.4.1"></times><cn id="Sx3.SSx3.p4.6.m6.2.2.2.4.2.cmml" type="integer" xref="Sx3.SSx3.p4.6.m6.2.2.2.4.2">2</cn><ci id="Sx3.SSx3.p4.6.m6.2.2.2.4.3.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.2.4.3">𝑏</ci></apply></apply><apply id="Sx3.SSx3.p4.6.m6.2.2.4.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.4"><eq id="Sx3.SSx3.p4.6.m6.2.2.4.1.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.4.1"></eq><ci id="Sx3.SSx3.p4.6.m6.2.2.4.2.cmml" xref="Sx3.SSx3.p4.6.m6.2.2.4.2">𝑘</ci><cn id="Sx3.SSx3.p4.6.m6.2.2.4.3.cmml" type="integer" xref="Sx3.SSx3.p4.6.m6.2.2.4.3">1</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.6.m6.2c">\{\boldsymbol{x}^{I}_{k},\boldsymbol{x}^{O\cup NO}_{k}\}^{2b}_{k=1}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.6.m6.2d">{ bold_italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_O ∪ italic_N italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 2 italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT</annotation></semantics></math> are encoded to <math alttext="\{\boldsymbol{y}^{I|O\cup NO}_{k}\}^{2b}_{k=1}" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.7.m7.1"><semantics id="Sx3.SSx3.p4.7.m7.1a"><msubsup id="Sx3.SSx3.p4.7.m7.1.1" xref="Sx3.SSx3.p4.7.m7.1.1.cmml"><mrow id="Sx3.SSx3.p4.7.m7.1.1.1.1.1" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.2.cmml"><mo id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.2" stretchy="false" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.2.cmml">{</mo><msubsup id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.cmml"><mi id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.2" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.2.cmml">𝒚</mi><mi id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.3" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.3.cmml">k</mi><mrow id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.cmml"><mi id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.2" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.2.cmml">I</mi><mo fence="false" id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.1" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.1.cmml">|</mo><mrow id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.cmml"><mi id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.2" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.2.cmml">O</mi><mo id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.1" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.1.cmml">∪</mo><mrow id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.cmml"><mi id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.2" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.2.cmml">N</mi><mo id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.1" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.1.cmml">⁢</mo><mi id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.3" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.3.cmml">O</mi></mrow></mrow></mrow></msubsup><mo id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.3" stretchy="false" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.2.cmml">}</mo></mrow><mrow id="Sx3.SSx3.p4.7.m7.1.1.3" xref="Sx3.SSx3.p4.7.m7.1.1.3.cmml"><mi id="Sx3.SSx3.p4.7.m7.1.1.3.2" xref="Sx3.SSx3.p4.7.m7.1.1.3.2.cmml">k</mi><mo id="Sx3.SSx3.p4.7.m7.1.1.3.1" xref="Sx3.SSx3.p4.7.m7.1.1.3.1.cmml">=</mo><mn id="Sx3.SSx3.p4.7.m7.1.1.3.3" xref="Sx3.SSx3.p4.7.m7.1.1.3.3.cmml">1</mn></mrow><mrow id="Sx3.SSx3.p4.7.m7.1.1.1.3" xref="Sx3.SSx3.p4.7.m7.1.1.1.3.cmml"><mn id="Sx3.SSx3.p4.7.m7.1.1.1.3.2" xref="Sx3.SSx3.p4.7.m7.1.1.1.3.2.cmml">2</mn><mo id="Sx3.SSx3.p4.7.m7.1.1.1.3.1" xref="Sx3.SSx3.p4.7.m7.1.1.1.3.1.cmml">⁢</mo><mi id="Sx3.SSx3.p4.7.m7.1.1.1.3.3" xref="Sx3.SSx3.p4.7.m7.1.1.1.3.3.cmml">b</mi></mrow></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.7.m7.1b"><apply id="Sx3.SSx3.p4.7.m7.1.1.cmml" xref="Sx3.SSx3.p4.7.m7.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.7.m7.1.1.2.cmml" xref="Sx3.SSx3.p4.7.m7.1.1">subscript</csymbol><apply id="Sx3.SSx3.p4.7.m7.1.1.1.cmml" xref="Sx3.SSx3.p4.7.m7.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.7.m7.1.1.1.2.cmml" xref="Sx3.SSx3.p4.7.m7.1.1">superscript</csymbol><set id="Sx3.SSx3.p4.7.m7.1.1.1.1.2.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1"><apply id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1">subscript</csymbol><apply id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.1.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1">superscript</csymbol><ci id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.2.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.2">𝒚</ci><apply id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3"><csymbol cd="latexml" id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.1.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.1">conditional</csymbol><ci id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.2.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.2">𝐼</ci><apply id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3"><union id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.1.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.1"></union><ci id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.2.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.2">𝑂</ci><apply id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3"><times id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.1.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.1"></times><ci id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.2.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.2">𝑁</ci><ci id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.3.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.2.3.3.3.3">𝑂</ci></apply></apply></apply></apply><ci id="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.3.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.1.1.1.3">𝑘</ci></apply></set><apply id="Sx3.SSx3.p4.7.m7.1.1.1.3.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.3"><times id="Sx3.SSx3.p4.7.m7.1.1.1.3.1.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.3.1"></times><cn id="Sx3.SSx3.p4.7.m7.1.1.1.3.2.cmml" type="integer" xref="Sx3.SSx3.p4.7.m7.1.1.1.3.2">2</cn><ci id="Sx3.SSx3.p4.7.m7.1.1.1.3.3.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.1.3.3">𝑏</ci></apply></apply><apply id="Sx3.SSx3.p4.7.m7.1.1.3.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.3"><eq id="Sx3.SSx3.p4.7.m7.1.1.3.1.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.3.1"></eq><ci id="Sx3.SSx3.p4.7.m7.1.1.3.2.cmml" xref="Sx3.SSx3.p4.7.m7.1.1.3.2">𝑘</ci><cn id="Sx3.SSx3.p4.7.m7.1.1.3.3.cmml" type="integer" xref="Sx3.SSx3.p4.7.m7.1.1.3.3">1</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.7.m7.1c">\{\boldsymbol{y}^{I|O\cup NO}_{k}\}^{2b}_{k=1}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.7.m7.1d">{ bold_italic_y start_POSTSUPERSCRIPT italic_I | italic_O ∪ italic_N italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 2 italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT</annotation></semantics></math>, which are then input into a binary linear classifier to obtain logits <math alttext="\{\boldsymbol{z}_{k}\}^{2b}_{k=1}" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.8.m8.1"><semantics id="Sx3.SSx3.p4.8.m8.1a"><msubsup id="Sx3.SSx3.p4.8.m8.1.1" xref="Sx3.SSx3.p4.8.m8.1.1.cmml"><mrow id="Sx3.SSx3.p4.8.m8.1.1.1.1.1" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.2.cmml"><mo id="Sx3.SSx3.p4.8.m8.1.1.1.1.1.2" stretchy="false" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.2.cmml">{</mo><msub id="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.cmml"><mi id="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.2" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.2.cmml">𝒛</mi><mi id="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.3" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.3.cmml">k</mi></msub><mo id="Sx3.SSx3.p4.8.m8.1.1.1.1.1.3" stretchy="false" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.2.cmml">}</mo></mrow><mrow id="Sx3.SSx3.p4.8.m8.1.1.3" xref="Sx3.SSx3.p4.8.m8.1.1.3.cmml"><mi id="Sx3.SSx3.p4.8.m8.1.1.3.2" xref="Sx3.SSx3.p4.8.m8.1.1.3.2.cmml">k</mi><mo id="Sx3.SSx3.p4.8.m8.1.1.3.1" xref="Sx3.SSx3.p4.8.m8.1.1.3.1.cmml">=</mo><mn id="Sx3.SSx3.p4.8.m8.1.1.3.3" xref="Sx3.SSx3.p4.8.m8.1.1.3.3.cmml">1</mn></mrow><mrow id="Sx3.SSx3.p4.8.m8.1.1.1.3" xref="Sx3.SSx3.p4.8.m8.1.1.1.3.cmml"><mn id="Sx3.SSx3.p4.8.m8.1.1.1.3.2" xref="Sx3.SSx3.p4.8.m8.1.1.1.3.2.cmml">2</mn><mo id="Sx3.SSx3.p4.8.m8.1.1.1.3.1" xref="Sx3.SSx3.p4.8.m8.1.1.1.3.1.cmml">⁢</mo><mi id="Sx3.SSx3.p4.8.m8.1.1.1.3.3" xref="Sx3.SSx3.p4.8.m8.1.1.1.3.3.cmml">b</mi></mrow></msubsup><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.8.m8.1b"><apply id="Sx3.SSx3.p4.8.m8.1.1.cmml" xref="Sx3.SSx3.p4.8.m8.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.8.m8.1.1.2.cmml" xref="Sx3.SSx3.p4.8.m8.1.1">subscript</csymbol><apply id="Sx3.SSx3.p4.8.m8.1.1.1.cmml" xref="Sx3.SSx3.p4.8.m8.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.8.m8.1.1.1.2.cmml" xref="Sx3.SSx3.p4.8.m8.1.1">superscript</csymbol><set id="Sx3.SSx3.p4.8.m8.1.1.1.1.2.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.1"><apply id="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.1.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1">subscript</csymbol><ci id="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.2.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.2">𝒛</ci><ci id="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.3.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.1.1.1.1.3">𝑘</ci></apply></set><apply id="Sx3.SSx3.p4.8.m8.1.1.1.3.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.1.3"><times id="Sx3.SSx3.p4.8.m8.1.1.1.3.1.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.1.3.1"></times><cn id="Sx3.SSx3.p4.8.m8.1.1.1.3.2.cmml" type="integer" xref="Sx3.SSx3.p4.8.m8.1.1.1.3.2">2</cn><ci id="Sx3.SSx3.p4.8.m8.1.1.1.3.3.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.1.3.3">𝑏</ci></apply></apply><apply id="Sx3.SSx3.p4.8.m8.1.1.3.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.3"><eq id="Sx3.SSx3.p4.8.m8.1.1.3.1.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.3.1"></eq><ci id="Sx3.SSx3.p4.8.m8.1.1.3.2.cmml" xref="Sx3.SSx3.p4.8.m8.1.1.3.2">𝑘</ci><cn id="Sx3.SSx3.p4.8.m8.1.1.3.3.cmml" type="integer" xref="Sx3.SSx3.p4.8.m8.1.1.3.3">1</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.8.m8.1c">\{\boldsymbol{z}_{k}\}^{2b}_{k=1}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.8.m8.1d">{ bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 2 italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT</annotation></semantics></math>. The loss function is formalized as</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="A1.EGx3"> <tbody id="Sx3.E13"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle p_{i}" class="ltx_Math" display="inline" id="Sx3.E13.m1.1"><semantics id="Sx3.E13.m1.1a"><msub id="Sx3.E13.m1.1.1" xref="Sx3.E13.m1.1.1.cmml"><mi id="Sx3.E13.m1.1.1.2" xref="Sx3.E13.m1.1.1.2.cmml">p</mi><mi id="Sx3.E13.m1.1.1.3" xref="Sx3.E13.m1.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="Sx3.E13.m1.1b"><apply id="Sx3.E13.m1.1.1.cmml" xref="Sx3.E13.m1.1.1"><csymbol cd="ambiguous" id="Sx3.E13.m1.1.1.1.cmml" xref="Sx3.E13.m1.1.1">subscript</csymbol><ci id="Sx3.E13.m1.1.1.2.cmml" xref="Sx3.E13.m1.1.1.2">𝑝</ci><ci id="Sx3.E13.m1.1.1.3.cmml" xref="Sx3.E13.m1.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E13.m1.1c">\displaystyle p_{i}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E13.m1.1d">italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=\frac{1}{1+\exp(-z_{i})}," class="ltx_Math" display="inline" id="Sx3.E13.m2.3"><semantics id="Sx3.E13.m2.3a"><mrow id="Sx3.E13.m2.3.3.1" xref="Sx3.E13.m2.3.3.1.1.cmml"><mrow id="Sx3.E13.m2.3.3.1.1" xref="Sx3.E13.m2.3.3.1.1.cmml"><mi id="Sx3.E13.m2.3.3.1.1.2" xref="Sx3.E13.m2.3.3.1.1.2.cmml"></mi><mo id="Sx3.E13.m2.3.3.1.1.1" xref="Sx3.E13.m2.3.3.1.1.1.cmml">=</mo><mstyle displaystyle="true" id="Sx3.E13.m2.2.2" xref="Sx3.E13.m2.2.2.cmml"><mfrac id="Sx3.E13.m2.2.2a" xref="Sx3.E13.m2.2.2.cmml"><mn id="Sx3.E13.m2.2.2.4" xref="Sx3.E13.m2.2.2.4.cmml">1</mn><mrow id="Sx3.E13.m2.2.2.2" xref="Sx3.E13.m2.2.2.2.cmml"><mn id="Sx3.E13.m2.2.2.2.4" xref="Sx3.E13.m2.2.2.2.4.cmml">1</mn><mo id="Sx3.E13.m2.2.2.2.3" xref="Sx3.E13.m2.2.2.2.3.cmml">+</mo><mrow id="Sx3.E13.m2.2.2.2.2.1" xref="Sx3.E13.m2.2.2.2.2.2.cmml"><mi id="Sx3.E13.m2.1.1.1.1" xref="Sx3.E13.m2.1.1.1.1.cmml">exp</mi><mo id="Sx3.E13.m2.2.2.2.2.1a" xref="Sx3.E13.m2.2.2.2.2.2.cmml">⁡</mo><mrow id="Sx3.E13.m2.2.2.2.2.1.1" xref="Sx3.E13.m2.2.2.2.2.2.cmml"><mo id="Sx3.E13.m2.2.2.2.2.1.1.2" stretchy="false" xref="Sx3.E13.m2.2.2.2.2.2.cmml">(</mo><mrow id="Sx3.E13.m2.2.2.2.2.1.1.1" xref="Sx3.E13.m2.2.2.2.2.1.1.1.cmml"><mo id="Sx3.E13.m2.2.2.2.2.1.1.1a" xref="Sx3.E13.m2.2.2.2.2.1.1.1.cmml">−</mo><msub id="Sx3.E13.m2.2.2.2.2.1.1.1.2" xref="Sx3.E13.m2.2.2.2.2.1.1.1.2.cmml"><mi id="Sx3.E13.m2.2.2.2.2.1.1.1.2.2" xref="Sx3.E13.m2.2.2.2.2.1.1.1.2.2.cmml">z</mi><mi id="Sx3.E13.m2.2.2.2.2.1.1.1.2.3" xref="Sx3.E13.m2.2.2.2.2.1.1.1.2.3.cmml">i</mi></msub></mrow><mo id="Sx3.E13.m2.2.2.2.2.1.1.3" stretchy="false" xref="Sx3.E13.m2.2.2.2.2.2.cmml">)</mo></mrow></mrow></mrow></mfrac></mstyle></mrow><mo id="Sx3.E13.m2.3.3.1.2" xref="Sx3.E13.m2.3.3.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E13.m2.3b"><apply id="Sx3.E13.m2.3.3.1.1.cmml" xref="Sx3.E13.m2.3.3.1"><eq id="Sx3.E13.m2.3.3.1.1.1.cmml" xref="Sx3.E13.m2.3.3.1.1.1"></eq><csymbol cd="latexml" id="Sx3.E13.m2.3.3.1.1.2.cmml" xref="Sx3.E13.m2.3.3.1.1.2">absent</csymbol><apply id="Sx3.E13.m2.2.2.cmml" xref="Sx3.E13.m2.2.2"><divide id="Sx3.E13.m2.2.2.3.cmml" xref="Sx3.E13.m2.2.2"></divide><cn id="Sx3.E13.m2.2.2.4.cmml" type="integer" xref="Sx3.E13.m2.2.2.4">1</cn><apply id="Sx3.E13.m2.2.2.2.cmml" xref="Sx3.E13.m2.2.2.2"><plus id="Sx3.E13.m2.2.2.2.3.cmml" xref="Sx3.E13.m2.2.2.2.3"></plus><cn id="Sx3.E13.m2.2.2.2.4.cmml" type="integer" xref="Sx3.E13.m2.2.2.2.4">1</cn><apply id="Sx3.E13.m2.2.2.2.2.2.cmml" xref="Sx3.E13.m2.2.2.2.2.1"><exp id="Sx3.E13.m2.1.1.1.1.cmml" xref="Sx3.E13.m2.1.1.1.1"></exp><apply id="Sx3.E13.m2.2.2.2.2.1.1.1.cmml" xref="Sx3.E13.m2.2.2.2.2.1.1.1"><minus id="Sx3.E13.m2.2.2.2.2.1.1.1.1.cmml" xref="Sx3.E13.m2.2.2.2.2.1.1.1"></minus><apply id="Sx3.E13.m2.2.2.2.2.1.1.1.2.cmml" xref="Sx3.E13.m2.2.2.2.2.1.1.1.2"><csymbol cd="ambiguous" id="Sx3.E13.m2.2.2.2.2.1.1.1.2.1.cmml" xref="Sx3.E13.m2.2.2.2.2.1.1.1.2">subscript</csymbol><ci id="Sx3.E13.m2.2.2.2.2.1.1.1.2.2.cmml" xref="Sx3.E13.m2.2.2.2.2.1.1.1.2.2">𝑧</ci><ci id="Sx3.E13.m2.2.2.2.2.1.1.1.2.3.cmml" xref="Sx3.E13.m2.2.2.2.2.1.1.1.2.3">𝑖</ci></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E13.m2.3c">\displaystyle=\frac{1}{1+\exp(-z_{i})},</annotation><annotation encoding="application/x-llamapun" id="Sx3.E13.m2.3d">= divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(13)</span></td> </tr></tbody> <tbody id="Sx3.E14"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\mathcal{L}^{\text{OID}}" class="ltx_Math" display="inline" id="Sx3.E14.m1.1"><semantics id="Sx3.E14.m1.1a"><msup id="Sx3.E14.m1.1.1" xref="Sx3.E14.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="Sx3.E14.m1.1.1.2" xref="Sx3.E14.m1.1.1.2.cmml">ℒ</mi><mtext id="Sx3.E14.m1.1.1.3" xref="Sx3.E14.m1.1.1.3a.cmml">OID</mtext></msup><annotation-xml encoding="MathML-Content" id="Sx3.E14.m1.1b"><apply id="Sx3.E14.m1.1.1.cmml" xref="Sx3.E14.m1.1.1"><csymbol cd="ambiguous" id="Sx3.E14.m1.1.1.1.cmml" xref="Sx3.E14.m1.1.1">superscript</csymbol><ci id="Sx3.E14.m1.1.1.2.cmml" xref="Sx3.E14.m1.1.1.2">ℒ</ci><ci id="Sx3.E14.m1.1.1.3a.cmml" xref="Sx3.E14.m1.1.1.3"><mtext id="Sx3.E14.m1.1.1.3.cmml" mathsize="70%" xref="Sx3.E14.m1.1.1.3">OID</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E14.m1.1c">\displaystyle\mathcal{L}^{\text{OID}}</annotation><annotation encoding="application/x-llamapun" id="Sx3.E14.m1.1d">caligraphic_L start_POSTSUPERSCRIPT OID end_POSTSUPERSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=-\frac{1}{2b}\sum_{i=1}^{2b}[t_{i}\log(p_{i})+(1-t_{i})\log(1-p_% {i})]," class="ltx_Math" display="inline" id="Sx3.E14.m2.3"><semantics id="Sx3.E14.m2.3a"><mrow id="Sx3.E14.m2.3.3.1" xref="Sx3.E14.m2.3.3.1.1.cmml"><mrow id="Sx3.E14.m2.3.3.1.1" xref="Sx3.E14.m2.3.3.1.1.cmml"><mi id="Sx3.E14.m2.3.3.1.1.3" xref="Sx3.E14.m2.3.3.1.1.3.cmml"></mi><mo id="Sx3.E14.m2.3.3.1.1.2" xref="Sx3.E14.m2.3.3.1.1.2.cmml">=</mo><mrow id="Sx3.E14.m2.3.3.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.cmml"><mo id="Sx3.E14.m2.3.3.1.1.1a" xref="Sx3.E14.m2.3.3.1.1.1.cmml">−</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.cmml"><mstyle displaystyle="true" id="Sx3.E14.m2.3.3.1.1.1.1.3" xref="Sx3.E14.m2.3.3.1.1.1.1.3.cmml"><mfrac id="Sx3.E14.m2.3.3.1.1.1.1.3a" xref="Sx3.E14.m2.3.3.1.1.1.1.3.cmml"><mn id="Sx3.E14.m2.3.3.1.1.1.1.3.2" xref="Sx3.E14.m2.3.3.1.1.1.1.3.2.cmml">1</mn><mrow id="Sx3.E14.m2.3.3.1.1.1.1.3.3" xref="Sx3.E14.m2.3.3.1.1.1.1.3.3.cmml"><mn id="Sx3.E14.m2.3.3.1.1.1.1.3.3.2" xref="Sx3.E14.m2.3.3.1.1.1.1.3.3.2.cmml">2</mn><mo id="Sx3.E14.m2.3.3.1.1.1.1.3.3.1" xref="Sx3.E14.m2.3.3.1.1.1.1.3.3.1.cmml">⁢</mo><mi id="Sx3.E14.m2.3.3.1.1.1.1.3.3.3" xref="Sx3.E14.m2.3.3.1.1.1.1.3.3.3.cmml">b</mi></mrow></mfrac></mstyle><mo id="Sx3.E14.m2.3.3.1.1.1.1.2" xref="Sx3.E14.m2.3.3.1.1.1.1.2.cmml">⁢</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.cmml"><mstyle displaystyle="true" id="Sx3.E14.m2.3.3.1.1.1.1.1.2" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.cmml"><munderover id="Sx3.E14.m2.3.3.1.1.1.1.1.2a" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.cmml"><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.2" movablelimits="false" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.2.cmml">∑</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.cmml"><mi id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.2" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.2.cmml">i</mi><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.1.cmml">=</mo><mn id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.3.cmml">1</mn></mrow><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.2.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.cmml"><mn id="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.2" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.2.cmml">2</mn><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.1.cmml">⁢</mo><mi id="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.3.cmml">b</mi></mrow></munderover></mstyle><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.2.cmml"><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.2" stretchy="false" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.2.1.cmml">[</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.cmml"><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.cmml"><msub id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.cmml"><mi id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.2" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.2.cmml">t</mi><mi id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.3.cmml">i</mi></msub><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.2" lspace="0.167em" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.2.cmml"><mi id="Sx3.E14.m2.1.1" xref="Sx3.E14.m2.1.1.cmml">log</mi><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1a" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.2.cmml">⁡</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.2.cmml"><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.2.cmml">(</mo><msub id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.2" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">p</mi><mi id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml">i</mi></msub><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.2.cmml">)</mo></mrow></mrow></mrow><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.4" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.4.cmml">+</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.cmml"><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.cmml"><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.2" stretchy="false" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.cmml">(</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.cmml"><mn id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.2" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.2.cmml">1</mn><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.1.cmml">−</mo><msub id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.cmml"><mi id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.2" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.2.cmml">t</mi><mi id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.3.cmml">i</mi></msub></mrow><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.3" stretchy="false" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.cmml">)</mo></mrow><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.3" lspace="0.167em" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.3.cmml">⁢</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.2.cmml"><mi id="Sx3.E14.m2.2.2" xref="Sx3.E14.m2.2.2.cmml">log</mi><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1a" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.2.cmml">⁡</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.2.cmml"><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.2" stretchy="false" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.2.cmml">(</mo><mrow id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.cmml"><mn id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.2" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.2.cmml">1</mn><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.1" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.1.cmml">−</mo><msub id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.cmml"><mi id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.2" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.2.cmml">p</mi><mi id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.3" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.3.cmml">i</mi></msub></mrow><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.3" stretchy="false" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.2.cmml">)</mo></mrow></mrow></mrow></mrow><mo id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.3" stretchy="false" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.2.1.cmml">]</mo></mrow></mrow></mrow></mrow></mrow><mo id="Sx3.E14.m2.3.3.1.2" xref="Sx3.E14.m2.3.3.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="Sx3.E14.m2.3b"><apply id="Sx3.E14.m2.3.3.1.1.cmml" xref="Sx3.E14.m2.3.3.1"><eq id="Sx3.E14.m2.3.3.1.1.2.cmml" xref="Sx3.E14.m2.3.3.1.1.2"></eq><csymbol cd="latexml" id="Sx3.E14.m2.3.3.1.1.3.cmml" xref="Sx3.E14.m2.3.3.1.1.3">absent</csymbol><apply id="Sx3.E14.m2.3.3.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1"><minus id="Sx3.E14.m2.3.3.1.1.1.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1"></minus><apply id="Sx3.E14.m2.3.3.1.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1"><times id="Sx3.E14.m2.3.3.1.1.1.1.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.2"></times><apply id="Sx3.E14.m2.3.3.1.1.1.1.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.3"><divide id="Sx3.E14.m2.3.3.1.1.1.1.3.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.3"></divide><cn id="Sx3.E14.m2.3.3.1.1.1.1.3.2.cmml" type="integer" xref="Sx3.E14.m2.3.3.1.1.1.1.3.2">1</cn><apply id="Sx3.E14.m2.3.3.1.1.1.1.3.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.3.3"><times id="Sx3.E14.m2.3.3.1.1.1.1.3.3.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.3.3.1"></times><cn id="Sx3.E14.m2.3.3.1.1.1.1.3.3.2.cmml" type="integer" xref="Sx3.E14.m2.3.3.1.1.1.1.3.3.2">2</cn><ci id="Sx3.E14.m2.3.3.1.1.1.1.3.3.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.3.3.3">𝑏</ci></apply></apply><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1"><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2"><csymbol cd="ambiguous" id="Sx3.E14.m2.3.3.1.1.1.1.1.2.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2">superscript</csymbol><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2"><csymbol cd="ambiguous" id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2">subscript</csymbol><sum id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.2"></sum><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3"><eq id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.1"></eq><ci id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.2">𝑖</ci><cn id="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.3.cmml" type="integer" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.2.3.3">1</cn></apply></apply><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.3"><times id="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.1"></times><cn id="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.2.cmml" type="integer" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.2">2</cn><ci id="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.2.3.3">𝑏</ci></apply></apply><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1"><csymbol cd="latexml" id="Sx3.E14.m2.3.3.1.1.1.1.1.1.2.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.2">delimited-[]</csymbol><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1"><plus id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.4.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.4"></plus><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1"><times id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.2"></times><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.2">𝑡</ci><ci id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.3.3">𝑖</ci></apply><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1"><log id="Sx3.E14.m2.1.1.cmml" xref="Sx3.E14.m2.1.1"></log><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.2">𝑝</ci><ci id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.1.1.1.1.1.3">𝑖</ci></apply></apply></apply><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3"><times id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.3"></times><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1"><minus id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.1"></minus><cn id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.2.cmml" type="integer" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.2">1</cn><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3">subscript</csymbol><ci id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.2">𝑡</ci><ci id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.2.1.1.1.3.3">𝑖</ci></apply></apply><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1"><log id="Sx3.E14.m2.2.2.cmml" xref="Sx3.E14.m2.2.2"></log><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1"><minus id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.1"></minus><cn id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.2.cmml" type="integer" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.2">1</cn><apply id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3"><csymbol cd="ambiguous" id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.1.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3">subscript</csymbol><ci id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.2.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.2">𝑝</ci><ci id="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.3.cmml" xref="Sx3.E14.m2.3.3.1.1.1.1.1.1.1.1.3.2.1.1.1.3.3">𝑖</ci></apply></apply></apply></apply></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.E14.m2.3c">\displaystyle=-\frac{1}{2b}\sum_{i=1}^{2b}[t_{i}\log(p_{i})+(1-t_{i})\log(1-p_% {i})],</annotation><annotation encoding="application/x-llamapun" id="Sx3.E14.m2.3d">= - divide start_ARG 1 end_ARG start_ARG 2 italic_b end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_b end_POSTSUPERSCRIPT [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( 1 - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(14)</span></td> </tr></tbody> </table> <p class="ltx_p" id="Sx3.SSx3.p4.11">where <math alttext="p_{i}\in\mathbb{R}" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.9.m1.1"><semantics id="Sx3.SSx3.p4.9.m1.1a"><mrow id="Sx3.SSx3.p4.9.m1.1.1" xref="Sx3.SSx3.p4.9.m1.1.1.cmml"><msub id="Sx3.SSx3.p4.9.m1.1.1.2" xref="Sx3.SSx3.p4.9.m1.1.1.2.cmml"><mi id="Sx3.SSx3.p4.9.m1.1.1.2.2" xref="Sx3.SSx3.p4.9.m1.1.1.2.2.cmml">p</mi><mi id="Sx3.SSx3.p4.9.m1.1.1.2.3" xref="Sx3.SSx3.p4.9.m1.1.1.2.3.cmml">i</mi></msub><mo id="Sx3.SSx3.p4.9.m1.1.1.1" xref="Sx3.SSx3.p4.9.m1.1.1.1.cmml">∈</mo><mi id="Sx3.SSx3.p4.9.m1.1.1.3" xref="Sx3.SSx3.p4.9.m1.1.1.3.cmml">ℝ</mi></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.9.m1.1b"><apply id="Sx3.SSx3.p4.9.m1.1.1.cmml" xref="Sx3.SSx3.p4.9.m1.1.1"><in id="Sx3.SSx3.p4.9.m1.1.1.1.cmml" xref="Sx3.SSx3.p4.9.m1.1.1.1"></in><apply id="Sx3.SSx3.p4.9.m1.1.1.2.cmml" xref="Sx3.SSx3.p4.9.m1.1.1.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.9.m1.1.1.2.1.cmml" xref="Sx3.SSx3.p4.9.m1.1.1.2">subscript</csymbol><ci id="Sx3.SSx3.p4.9.m1.1.1.2.2.cmml" xref="Sx3.SSx3.p4.9.m1.1.1.2.2">𝑝</ci><ci id="Sx3.SSx3.p4.9.m1.1.1.2.3.cmml" xref="Sx3.SSx3.p4.9.m1.1.1.2.3">𝑖</ci></apply><ci id="Sx3.SSx3.p4.9.m1.1.1.3.cmml" xref="Sx3.SSx3.p4.9.m1.1.1.3">ℝ</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.9.m1.1c">p_{i}\in\mathbb{R}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.9.m1.1d">italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R</annotation></semantics></math> and <math alttext="t_{i}\in\{0,1\}" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.10.m2.2"><semantics id="Sx3.SSx3.p4.10.m2.2a"><mrow id="Sx3.SSx3.p4.10.m2.2.3" xref="Sx3.SSx3.p4.10.m2.2.3.cmml"><msub id="Sx3.SSx3.p4.10.m2.2.3.2" xref="Sx3.SSx3.p4.10.m2.2.3.2.cmml"><mi id="Sx3.SSx3.p4.10.m2.2.3.2.2" xref="Sx3.SSx3.p4.10.m2.2.3.2.2.cmml">t</mi><mi id="Sx3.SSx3.p4.10.m2.2.3.2.3" xref="Sx3.SSx3.p4.10.m2.2.3.2.3.cmml">i</mi></msub><mo id="Sx3.SSx3.p4.10.m2.2.3.1" xref="Sx3.SSx3.p4.10.m2.2.3.1.cmml">∈</mo><mrow id="Sx3.SSx3.p4.10.m2.2.3.3.2" xref="Sx3.SSx3.p4.10.m2.2.3.3.1.cmml"><mo id="Sx3.SSx3.p4.10.m2.2.3.3.2.1" stretchy="false" xref="Sx3.SSx3.p4.10.m2.2.3.3.1.cmml">{</mo><mn id="Sx3.SSx3.p4.10.m2.1.1" xref="Sx3.SSx3.p4.10.m2.1.1.cmml">0</mn><mo id="Sx3.SSx3.p4.10.m2.2.3.3.2.2" xref="Sx3.SSx3.p4.10.m2.2.3.3.1.cmml">,</mo><mn id="Sx3.SSx3.p4.10.m2.2.2" xref="Sx3.SSx3.p4.10.m2.2.2.cmml">1</mn><mo id="Sx3.SSx3.p4.10.m2.2.3.3.2.3" stretchy="false" xref="Sx3.SSx3.p4.10.m2.2.3.3.1.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.10.m2.2b"><apply id="Sx3.SSx3.p4.10.m2.2.3.cmml" xref="Sx3.SSx3.p4.10.m2.2.3"><in id="Sx3.SSx3.p4.10.m2.2.3.1.cmml" xref="Sx3.SSx3.p4.10.m2.2.3.1"></in><apply id="Sx3.SSx3.p4.10.m2.2.3.2.cmml" xref="Sx3.SSx3.p4.10.m2.2.3.2"><csymbol cd="ambiguous" id="Sx3.SSx3.p4.10.m2.2.3.2.1.cmml" xref="Sx3.SSx3.p4.10.m2.2.3.2">subscript</csymbol><ci id="Sx3.SSx3.p4.10.m2.2.3.2.2.cmml" xref="Sx3.SSx3.p4.10.m2.2.3.2.2">𝑡</ci><ci id="Sx3.SSx3.p4.10.m2.2.3.2.3.cmml" xref="Sx3.SSx3.p4.10.m2.2.3.2.3">𝑖</ci></apply><set id="Sx3.SSx3.p4.10.m2.2.3.3.1.cmml" xref="Sx3.SSx3.p4.10.m2.2.3.3.2"><cn id="Sx3.SSx3.p4.10.m2.1.1.cmml" type="integer" xref="Sx3.SSx3.p4.10.m2.1.1">0</cn><cn id="Sx3.SSx3.p4.10.m2.2.2.cmml" type="integer" xref="Sx3.SSx3.p4.10.m2.2.2">1</cn></set></apply></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.10.m2.2c">t_{i}\in\{0,1\}</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.10.m2.2d">italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 }</annotation></semantics></math> denote the predicted probability and true label of the <math alttext="i" class="ltx_Math" display="inline" id="Sx3.SSx3.p4.11.m3.1"><semantics id="Sx3.SSx3.p4.11.m3.1a"><mi id="Sx3.SSx3.p4.11.m3.1.1" xref="Sx3.SSx3.p4.11.m3.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="Sx3.SSx3.p4.11.m3.1b"><ci id="Sx3.SSx3.p4.11.m3.1.1.cmml" xref="Sx3.SSx3.p4.11.m3.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="Sx3.SSx3.p4.11.m3.1c">i</annotation><annotation encoding="application/x-llamapun" id="Sx3.SSx3.p4.11.m3.1d">italic_i</annotation></semantics></math>-th sample, respectively.</p> </div> <div class="ltx_para" id="Sx3.SSx3.p5"> <p class="ltx_p" id="Sx3.SSx3.p5.1">Please refer to Appendix Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.F5" title="Figure 5 ‣ Prompt Engineering and Pipeline ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">5</span></a> and the accompanying introduction for the inference pipeline.</p> </div> </section> <section class="ltx_subsection" id="Sx3.SSx4"> <h3 class="ltx_title ltx_title_subsection">Instruction of New Dataset</h3> <div class="ltx_para" id="Sx3.SSx4.p1"> <p class="ltx_p" id="Sx3.SSx4.p1.1">Although the Visual Genome (VG) dataset <cite class="ltx_cite ltx_citemacro_citep">(Krishna et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib18" title="">2017</a>)</cite> provides multiple objects and captions per image, it lacks object-caption correspondence, features noisy labels, and contains low-quality text. Therefore, it is not recommended to use the VG dataset directly. To address these issues, we construct the Multi-Object Instruction (MOInst) dataset. This dataset comprises 81,536 high-fidelity, complex images, accompanied by 244,378 textual captions. Each caption is associated with one of 264 distinct categories. Additional details are available in the Appendix. </p> </div> <figure class="ltx_table" id="Sx3.T3"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="Sx3.T3.3"> <tr class="ltx_tr" id="Sx3.T3.2.2"> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx3.T3.2.2.3" rowspan="3" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.2.2.3.1">Model</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx3.T3.2.2.4" rowspan="3" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.2.2.4.1"># TP</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx3.T3.2.2.5" rowspan="3" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.2.2.5.1">Inst.</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx3.T3.2.2.6" rowspan="3" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.2.2.6.1">Img Enc.</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx3.T3.2.2.7" rowspan="3" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.2.2.7.1">Type</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx3.T3.2.2.8" rowspan="3" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.2.2.8.1">GiVE</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx3.T3.2.2.9" rowspan="3" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.2.2.9.1"><span class="ltx_text" id="Sx3.T3.2.2.9.1.1"></span><span class="ltx_text" id="Sx3.T3.2.2.9.1.2"> <span class="ltx_tabular ltx_align_middle" id="Sx3.T3.2.2.9.1.2.1"> <span class="ltx_tr" id="Sx3.T3.2.2.9.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="Sx3.T3.2.2.9.1.2.1.1.1" style="padding-left:5.0pt;padding-right:5.0pt;">Post Proc.</span></span> </span></span> <span class="ltx_text" id="Sx3.T3.2.2.9.1.3"></span></span></td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="Sx3.T3.2.2.10" style="padding-left:5.0pt;padding-right:5.0pt;">MOInst</td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="Sx3.T3.1.1.1" style="padding-left:5.0pt;padding-right:5.0pt;">LVIS<sup class="ltx_sup" id="Sx3.T3.1.1.1.1"><span class="ltx_text ltx_font_italic" id="Sx3.T3.1.1.1.1.1">f</span></sup> </td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="Sx3.T3.2.2.2" style="padding-left:5.0pt;padding-right:5.0pt;">LVIS<sup class="ltx_sup" id="Sx3.T3.2.2.2.1"><span class="ltx_text ltx_font_italic" id="Sx3.T3.2.2.2.1.1">a</span></sup> </td> </tr> <tr class="ltx_tr" id="Sx3.T3.3.4"> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.4.1" style="padding-left:5.0pt;padding-right:5.0pt;">I2T</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.4.2" style="padding-left:5.0pt;padding-right:5.0pt;">T2I</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.4.3" rowspan="2" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.4.3.1">F1</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.4.4" rowspan="2" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.4.4.1">AUC</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.4.5" rowspan="2" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.4.5.1">F1</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.4.6" rowspan="2" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.4.6.1">AUC</span></td> </tr> <tr class="ltx_tr" id="Sx3.T3.3.5"> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.5.1" style="padding-left:5.0pt;padding-right:5.0pt;">R@1</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.5.2" style="padding-left:5.0pt;padding-right:5.0pt;">R@1</td> </tr> <tr class="ltx_tr" id="Sx3.T3.3.6" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.6.1" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.1.1" style="background-color:#E6E6E6;">BLIP-2</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.6.2" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.2.1" style="background-color:#E6E6E6;">1.2B</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.6.3" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.3.1" style="background-color:#E6E6E6;">✗</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.6.4" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.4.1" style="background-color:#E6E6E6;">EVA-CLIP-ViT</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.6.5" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.5.1" style="background-color:#E6E6E6;">G/14</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.6.6" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.6.1" style="background-color:#E6E6E6;">✗</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.6.7" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.7.1" style="background-color:#E6E6E6;">Q-Former</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.6.8" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.8.1" style="background-color:#E6E6E6;">14.1</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.6.9" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.9.1" style="background-color:#E6E6E6;">9.4</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.6.10" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.10.1" style="background-color:#E6E6E6;">14.1</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.6.11" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.11.1" style="background-color:#E6E6E6;">58.2</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.6.12" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.12.1" style="background-color:#E6E6E6;">11.3</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.6.13" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.6.13.1" style="background-color:#E6E6E6;">56.3</span></td> </tr> <tr class="ltx_tr" id="Sx3.T3.3.7"> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.7.1" style="padding-left:5.0pt;padding-right:5.0pt;">BLIP-2</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.7.2" style="padding-left:5.0pt;padding-right:5.0pt;">388M</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.7.3" style="padding-left:5.0pt;padding-right:5.0pt;">✔</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.7.4" style="padding-left:5.0pt;padding-right:5.0pt;">CLIP-ViT</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.7.5" style="padding-left:5.0pt;padding-right:5.0pt;">B/32</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.7.6" style="padding-left:5.0pt;padding-right:5.0pt;">✔</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.7.7" style="padding-left:5.0pt;padding-right:5.0pt;">Q-Former</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.7.8" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text ltx_font_bold" id="Sx3.T3.3.7.8.1">30.1</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.7.9" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text ltx_font_bold" id="Sx3.T3.3.7.9.1">31.1</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.7.10" style="padding-left:5.0pt;padding-right:5.0pt;">35.2</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.7.11" style="padding-left:5.0pt;padding-right:5.0pt;">67.4</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.7.12" style="padding-left:5.0pt;padding-right:5.0pt;">40.1</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.7.13" style="padding-left:5.0pt;padding-right:5.0pt;">68.9</td> </tr> <tr class="ltx_tr" id="Sx3.T3.3.8"> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.8.1" style="padding-left:5.0pt;padding-right:5.0pt;">Improv.</td> <td class="ltx_td" id="Sx3.T3.3.8.2" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td" id="Sx3.T3.3.8.3" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td" id="Sx3.T3.3.8.4" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td" id="Sx3.T3.3.8.5" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td" id="Sx3.T3.3.8.6" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td" id="Sx3.T3.3.8.7" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.8.8" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.8.8.1" style="color:#00E000;">113.5%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.8.9" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.8.9.1" style="color:#00E000;">230.9%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.8.10" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.8.10.1" style="color:#00E000;">149.6%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.8.11" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.8.11.1" style="color:#00E000;">15.8%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.8.12" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.8.12.1" style="color:#00E000;">254.9%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.8.13" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.8.13.1" style="color:#00E000;">22.4%</span></td> </tr> <tr class="ltx_tr" id="Sx3.T3.3.9" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.9.1" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.1.1" style="background-color:#E6E6E6;">InstructBLIP</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.9.2" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.2.1" style="background-color:#E6E6E6;">1.2B</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.9.3" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.3.1" style="background-color:#E6E6E6;">✔</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.9.4" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.4.1" style="background-color:#E6E6E6;">EVA-CLIP-ViT</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.9.5" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.5.1" style="background-color:#E6E6E6;">G/14</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.9.6" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.6.1" style="background-color:#E6E6E6;">✗</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx3.T3.3.9.7" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.7.1" style="background-color:#E6E6E6;">Q-Former</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.9.8" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.8.1" style="background-color:#E6E6E6;">19.8</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.9.9" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.9.1" style="background-color:#E6E6E6;">22.1</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.9.10" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.10.1" style="background-color:#E6E6E6;">28.6</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.9.11" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.11.1" style="background-color:#E6E6E6;">64.7</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.9.12" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.12.1" style="background-color:#E6E6E6;">23.0</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx3.T3.3.9.13" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.9.13.1" style="background-color:#E6E6E6;">60.9</span></td> </tr> <tr class="ltx_tr" id="Sx3.T3.3.10"> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.10.1" style="padding-left:5.0pt;padding-right:5.0pt;">InstructBLIP</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.10.2" style="padding-left:5.0pt;padding-right:5.0pt;">388M</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.10.3" style="padding-left:5.0pt;padding-right:5.0pt;">✔</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.10.4" style="padding-left:5.0pt;padding-right:5.0pt;">CLIP-ViT</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.10.5" style="padding-left:5.0pt;padding-right:5.0pt;">B/32</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.10.6" style="padding-left:5.0pt;padding-right:5.0pt;">✔</td> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.10.7" style="padding-left:5.0pt;padding-right:5.0pt;">Q-Former</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.10.8" style="padding-left:5.0pt;padding-right:5.0pt;">24.2</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.10.9" style="padding-left:5.0pt;padding-right:5.0pt;">21.9</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.10.10" style="padding-left:5.0pt;padding-right:5.0pt;">35.0</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.10.11" style="padding-left:5.0pt;padding-right:5.0pt;">67.1</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.10.12" style="padding-left:5.0pt;padding-right:5.0pt;">42.7</td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.10.13" style="padding-left:5.0pt;padding-right:5.0pt;">70.6</td> </tr> <tr class="ltx_tr" id="Sx3.T3.3.11"> <td class="ltx_td ltx_align_left" id="Sx3.T3.3.11.1" style="padding-left:5.0pt;padding-right:5.0pt;">Improv.</td> <td class="ltx_td" id="Sx3.T3.3.11.2" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td" id="Sx3.T3.3.11.3" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td" id="Sx3.T3.3.11.4" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td" id="Sx3.T3.3.11.5" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td" id="Sx3.T3.3.11.6" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td" id="Sx3.T3.3.11.7" style="padding-left:5.0pt;padding-right:5.0pt;"></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.11.8" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.11.8.1" style="color:#00E000;">22.2%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.11.9" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.11.9.1" style="color:#B80000;">-0.9%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.11.10" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.11.10.1" style="color:#00E000;">22.4%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.11.11" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.11.11.1" style="color:#00E000;">3.7%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.11.12" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.11.12.1" style="color:#00E000;">85.7%</span></td> <td class="ltx_td ltx_align_center" id="Sx3.T3.3.11.13" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text" id="Sx3.T3.3.11.13.1" style="color:#00E000;">15.9%</span></td> </tr> <tr class="ltx_tr" id="Sx3.T3.3.3"> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.2" style="padding-left:5.0pt;padding-right:5.0pt;">AG-CLIP</td> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.3" style="padding-left:5.0pt;padding-right:5.0pt;">62M</td> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.4" style="padding-left:5.0pt;padding-right:5.0pt;">✔</td> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.5" style="padding-left:5.0pt;padding-right:5.0pt;">CLIP-ViT</td> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.6" style="padding-left:5.0pt;padding-right:5.0pt;">B/32</td> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.7" style="padding-left:5.0pt;padding-right:5.0pt;">✔</td> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.1" style="padding-left:5.0pt;padding-right:5.0pt;"><math alttext="-" class="ltx_Math" display="inline" id="Sx3.T3.3.3.1.m1.1"><semantics id="Sx3.T3.3.3.1.m1.1a"><mo id="Sx3.T3.3.3.1.m1.1.1" xref="Sx3.T3.3.3.1.m1.1.1.cmml">−</mo><annotation-xml encoding="MathML-Content" id="Sx3.T3.3.3.1.m1.1b"><minus id="Sx3.T3.3.3.1.m1.1.1.cmml" xref="Sx3.T3.3.3.1.m1.1.1"></minus></annotation-xml><annotation encoding="application/x-tex" id="Sx3.T3.3.3.1.m1.1c">-</annotation><annotation encoding="application/x-llamapun" id="Sx3.T3.3.3.1.m1.1d">-</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.8" style="padding-left:5.0pt;padding-right:5.0pt;">29.1</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.9" style="padding-left:5.0pt;padding-right:5.0pt;">29.8</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.10" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text ltx_font_bold" id="Sx3.T3.3.3.10.1">50.7</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.11" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text ltx_font_bold" id="Sx3.T3.3.3.11.1">75.6</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.12" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text ltx_font_bold" id="Sx3.T3.3.3.12.1">56.2</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx3.T3.3.3.13" style="padding-left:5.0pt;padding-right:5.0pt;"><span class="ltx_text ltx_font_bold" id="Sx3.T3.3.3.13.1">77.3</span></td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 3: </span>The effect of semantic instructions on secondary coding methods. AG-CLIP is CLIP-ViT with GiVE for attention guidance. “Inst.” indicates whether the model receives additional instructions. “Img Enc.” specifies the image encoder used by the model, and “Type” denotes the type of image encoder. “GiVE” shows whether our GiVE method is applied to the image encoder, and “Post Proc.” identifies the type of post-processor used for the secondary encoding of the image embeddings.</figcaption> </figure> </section> </section> <section class="ltx_section" id="Sx4"> <h2 class="ltx_title ltx_title_section">Experiments</h2> <section class="ltx_subsection" id="Sx4.SSx1"> <h3 class="ltx_title ltx_title_subsection">Experimental Setup</h3> <section class="ltx_subsubsection" id="Sx4.SSx1.SSSx1"> <h4 class="ltx_title ltx_title_subsubsection">Datasets</h4> <div class="ltx_para" id="Sx4.SSx1.SSSx1.p1"> <p class="ltx_p" id="Sx4.SSx1.SSSx1.p1.2">We evaluate the effectiveness of GiVE using both the LVIS dataset <cite class="ltx_cite ltx_citemacro_citep">(Gupta, Dollár, and Girshick <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib13" title="">2019</a>)</cite> and MOInst dataset, each annotating multiple objects per image. The comprehensive LVIS dataset, referred to as LVIS<sup class="ltx_sup" id="Sx4.SSx1.SSSx1.p1.2.1"><span class="ltx_text ltx_font_italic" id="Sx4.SSx1.SSSx1.p1.2.1.1">a</span></sup>, comprises 1,203 categories, from which, we derive a subset, denoted as LVIS<sup class="ltx_sup" id="Sx4.SSx1.SSSx1.p1.2.2"><span class="ltx_text ltx_font_italic" id="Sx4.SSx1.SSSx1.p1.2.2.1">f</span></sup>, with 405 “frequent” categories. In contrast, the MOInst dataset used for training contains 264 categories, with category label overlaps of 7.7% and 17.6% with the two LVIS datasets, respectively. We also conduct experiments on the COCO <cite class="ltx_cite ltx_citemacro_citep">(Lin et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib21" title="">2014</a>)</cite>, SUN397 <cite class="ltx_cite ltx_citemacro_citep">(Xiao et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib47" title="">2010</a>)</cite>, and ImageNet <cite class="ltx_cite ltx_citemacro_citep">(Gao et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib11" title="">2023</a>)</cite> datasets.</p> </div> </section> <section class="ltx_subsubsection" id="Sx4.SSx1.SSSx2"> <h4 class="ltx_title ltx_title_subsubsection">Baselines</h4> <div class="ltx_para" id="Sx4.SSx1.SSSx2.p1"> <p class="ltx_p" id="Sx4.SSx1.SSSx2.p1.1">We evaluate the gains brought by integrating the GiVE with several representative ViT baselines, including CLIP <cite class="ltx_cite ltx_citemacro_citep">(Radford et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite>, GroupViT <cite class="ltx_cite ltx_citemacro_citep">(Xu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib49" title="">2022</a>)</cite>, SigLIP <cite class="ltx_cite ltx_citemacro_citep">(Zhai et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib57" title="">2023</a>)</cite>, and OwlViT <cite class="ltx_cite ltx_citemacro_citep">(Minderer et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib29" title="">2022</a>)</cite>. GroupViT has two versions trained on different datasets, and OwlViT is a visual encoder designed for object detection. Additionally, we apply the GiVE to BLIP-2 <cite class="ltx_cite ltx_citemacro_citep">(Li et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib19" title="">2023</a>)</cite> and InstructBLIP <cite class="ltx_cite ltx_citemacro_citep">(Dai et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib9" title="">2023</a>)</cite> to explore its impact within a larger encoding framework. Throughout this paper, unless explicitly stated otherwise, the experiments with GiVE are conducted on the CLIP platform, with the unmarked CLIP model referring to CLIP-ViT-B/32. Furthermore, a comparison is presented between CLIP using GiVE and the more recent MetaCLIP <cite class="ltx_cite ltx_citemacro_citep">(Xu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib48" title="">2024</a>)</cite>.</p> </div> </section> <section class="ltx_subsubsection" id="Sx4.SSx1.SSSx3"> <h4 class="ltx_title ltx_title_subsubsection">Evaluation Metrics</h4> <div class="ltx_para" id="Sx4.SSx1.SSSx3.p1"> <p class="ltx_p" id="Sx4.SSx1.SSSx3.p1.1">Inspired by prior works <cite class="ltx_cite ltx_citemacro_citep">(Li et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib19" title="">2023</a>; Yao et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib51" title="">2022</a>)</cite>, we conduct quantitative evaluations on image classification and image-text retrieval tasks. Scores are calculated solely from vector similarity, without further optimization, to accurately reflect the effective information content of the extracted features. To ensure that the scores reflect the visual information extraction capabilities of visual encoders, we mitigate the potential interference by quantifying the influence of textual data when evaluating methods that use instruction text, such as InstructBLIP <cite class="ltx_cite ltx_citemacro_citep">(Dai et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib9" title="">2023</a>)</cite>. For detailed calculations and implementation specifics, see the Appendix.</p> </div> <figure class="ltx_figure" id="Sx4.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="248" id="Sx4.F4.g1" src="x4.png" width="747"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4: </span><span class="ltx_text ltx_font_bold" id="Sx4.F4.2.1">Visual encoder attention visualization</span>. The images are divided into three groups, each containing the raw image and its attention visualization. The first row shows the raw image and the attention of CLIP-ViT. The second row displays attention maps of CLIP integrated with the AG-Adapter under various instructions.</figcaption> </figure> <figure class="ltx_table" id="Sx4.T4"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="Sx4.T4.2"> <tr class="ltx_tr" id="Sx4.T4.2.2"> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx4.T4.2.2.3" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.2.3.1">Loss</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx4.T4.2.2.4" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.2.4.1">Fuse</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="Sx4.T4.2.2.5" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.2.5.1">Inst.</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="Sx4.T4.2.2.6" style="padding-left:2.8pt;padding-right:2.8pt;">MOInst</td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="Sx4.T4.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">LVIS<sup class="ltx_sup" id="Sx4.T4.1.1.1.1"><span class="ltx_text ltx_font_italic" id="Sx4.T4.1.1.1.1.1">f</span></sup> </td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="Sx4.T4.2.2.2" style="padding-left:2.8pt;padding-right:2.8pt;">LVIS<sup class="ltx_sup" id="Sx4.T4.2.2.2.1"><span class="ltx_text ltx_font_italic" id="Sx4.T4.2.2.2.1.1">a</span></sup> </td> <td class="ltx_td ltx_border_tt" id="Sx4.T4.2.2.7" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.3"> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.3.1" style="padding-left:2.8pt;padding-right:2.8pt;">I2T</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.3.2" style="padding-left:2.8pt;padding-right:2.8pt;">T2I</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.3.3" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.3.3.1">F1</span></td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.3.4" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.3.4.1">AUC</span></td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.3.5" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.3.5.1">F1</span></td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.3.6" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.3.6.1">AUC</span></td> <td class="ltx_td" id="Sx4.T4.2.3.7" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.4"> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.4.1" style="padding-left:2.8pt;padding-right:2.8pt;">R@1</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.4.2" style="padding-left:2.8pt;padding-right:2.8pt;">R@1</td> <td class="ltx_td" id="Sx4.T4.2.4.3" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.5" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.5.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.5.1.1" style="background-color:#E6E6E6;">✗</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.5.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.5.2.1" style="background-color:#E6E6E6;">✗</span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.5.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.5.3.1" style="background-color:#E6E6E6;">✗</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.5.4" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.5.4.1" style="background-color:#E6E6E6;">7.6</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.5.5" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.5.5.1" style="background-color:#E6E6E6;">5.4</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.5.6" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.5.6.1" style="background-color:#E6E6E6;">11.3</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.5.7" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.5.7.1" style="background-color:#E6E6E6;">57.8</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.5.8" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.5.8.1" style="background-color:#E6E6E6;">8.6</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.5.9" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.5.9.1" style="background-color:#E6E6E6;">55.5</span></td> <td class="ltx_td ltx_border_t" id="Sx4.T4.2.5.10" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.6"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.6.1" style="padding-left:2.8pt;padding-right:2.8pt;">-OITC</td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.6.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="Sx4.T4.2.6.2.1"></span><span class="ltx_text" id="Sx4.T4.2.6.2.2"> <span class="ltx_tabular ltx_align_middle" id="Sx4.T4.2.6.2.2.1"> <span class="ltx_tr" id="Sx4.T4.2.6.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="Sx4.T4.2.6.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">early+late</span></span> </span></span><span class="ltx_text" id="Sx4.T4.2.6.2.3"></span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.6.3" style="padding-left:2.8pt;padding-right:2.8pt;">✔</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.6.4" style="padding-left:2.8pt;padding-right:2.8pt;">0.1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.6.5" style="padding-left:2.8pt;padding-right:2.8pt;">0.1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.6.6" style="padding-left:2.8pt;padding-right:2.8pt;">0.3</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.6.7" style="padding-left:2.8pt;padding-right:2.8pt;">50.0</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.6.8" style="padding-left:2.8pt;padding-right:2.8pt;">0.1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.6.9" style="padding-left:2.8pt;padding-right:2.8pt;">50.0</td> <td class="ltx_td ltx_border_t" id="Sx4.T4.2.6.10" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.7"> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.7.1" style="padding-left:2.8pt;padding-right:2.8pt;">-OIIC</td> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.7.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="Sx4.T4.2.7.2.1"></span><span class="ltx_text" id="Sx4.T4.2.7.2.2"> <span class="ltx_tabular ltx_align_middle" id="Sx4.T4.2.7.2.2.1"> <span class="ltx_tr" id="Sx4.T4.2.7.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="Sx4.T4.2.7.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">early+late</span></span> </span></span><span class="ltx_text" id="Sx4.T4.2.7.2.3"></span></td> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.7.3" style="padding-left:2.8pt;padding-right:2.8pt;">✔</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.7.4" style="padding-left:2.8pt;padding-right:2.8pt;">28.1</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.7.5" style="padding-left:2.8pt;padding-right:2.8pt;">29.4</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.7.6" style="padding-left:2.8pt;padding-right:2.8pt;">48.4</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.7.7" style="padding-left:2.8pt;padding-right:2.8pt;">74.0</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.7.8" style="padding-left:2.8pt;padding-right:2.8pt;">55.7</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.7.9" style="padding-left:2.8pt;padding-right:2.8pt;">76.7</td> <td class="ltx_td" id="Sx4.T4.2.7.10" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.8"> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.8.1" style="padding-left:2.8pt;padding-right:2.8pt;">-OID</td> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.8.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="Sx4.T4.2.8.2.1"></span><span class="ltx_text" id="Sx4.T4.2.8.2.2"> <span class="ltx_tabular ltx_align_middle" id="Sx4.T4.2.8.2.2.1"> <span class="ltx_tr" id="Sx4.T4.2.8.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="Sx4.T4.2.8.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">early+late</span></span> </span></span><span class="ltx_text" id="Sx4.T4.2.8.2.3"></span></td> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.8.3" style="padding-left:2.8pt;padding-right:2.8pt;">✔</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.8.4" style="padding-left:2.8pt;padding-right:2.8pt;">28.1</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.8.5" style="padding-left:2.8pt;padding-right:2.8pt;">29.7</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.8.6" style="padding-left:2.8pt;padding-right:2.8pt;">37.6</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.8.7" style="padding-left:2.8pt;padding-right:2.8pt;">67.8</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.8.8" style="padding-left:2.8pt;padding-right:2.8pt;">47.6</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.8.9" style="padding-left:2.8pt;padding-right:2.8pt;">72.2</td> <td class="ltx_td" id="Sx4.T4.2.8.10" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.9"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.9.1" style="padding-left:2.8pt;padding-right:2.8pt;">all</td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.9.2" style="padding-left:2.8pt;padding-right:2.8pt;">early</td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.9.3" style="padding-left:2.8pt;padding-right:2.8pt;">✔</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.9.4" style="padding-left:2.8pt;padding-right:2.8pt;">6.4</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.9.5" style="padding-left:2.8pt;padding-right:2.8pt;">8.2</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.9.6" style="padding-left:2.8pt;padding-right:2.8pt;">27.7</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.9.7" style="padding-left:2.8pt;padding-right:2.8pt;">63.6</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.9.8" style="padding-left:2.8pt;padding-right:2.8pt;">37.2</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.9.9" style="padding-left:2.8pt;padding-right:2.8pt;">67.3</td> <td class="ltx_td ltx_border_t" id="Sx4.T4.2.9.10" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.10"> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.10.1" style="padding-left:2.8pt;padding-right:2.8pt;">all</td> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.10.2" style="padding-left:2.8pt;padding-right:2.8pt;">late</td> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.10.3" style="padding-left:2.8pt;padding-right:2.8pt;">✔</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.10.4" style="padding-left:2.8pt;padding-right:2.8pt;">27.1</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.10.5" style="padding-left:2.8pt;padding-right:2.8pt;">29.1</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.10.6" style="padding-left:2.8pt;padding-right:2.8pt;">47.4</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.10.7" style="padding-left:2.8pt;padding-right:2.8pt;">73.6</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.10.8" style="padding-left:2.8pt;padding-right:2.8pt;">55.6</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.10.9" style="padding-left:2.8pt;padding-right:2.8pt;">76.5</td> <td class="ltx_td" id="Sx4.T4.2.10.10" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.11"> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.11.1" style="padding-left:2.8pt;padding-right:2.8pt;">all</td> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.11.2" style="padding-left:2.8pt;padding-right:2.8pt;">sparse</td> <td class="ltx_td ltx_align_left" id="Sx4.T4.2.11.3" style="padding-left:2.8pt;padding-right:2.8pt;">✔</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.11.4" style="padding-left:2.8pt;padding-right:2.8pt;">27.0</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.11.5" style="padding-left:2.8pt;padding-right:2.8pt;">29.0</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.11.6" style="padding-left:2.8pt;padding-right:2.8pt;">46.9</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.11.7" style="padding-left:2.8pt;padding-right:2.8pt;">73.0</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.11.8" style="padding-left:2.8pt;padding-right:2.8pt;">55.1</td> <td class="ltx_td ltx_align_center" id="Sx4.T4.2.11.9" style="padding-left:2.8pt;padding-right:2.8pt;">76.3</td> <td class="ltx_td" id="Sx4.T4.2.11.10" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.12"> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.12.1" style="padding-left:2.8pt;padding-right:2.8pt;">all</td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.12.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="Sx4.T4.2.12.2.1"></span><span class="ltx_text" id="Sx4.T4.2.12.2.2"> <span class="ltx_tabular ltx_align_middle" id="Sx4.T4.2.12.2.2.1"> <span class="ltx_tr" id="Sx4.T4.2.12.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="Sx4.T4.2.12.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">early+late</span></span> </span></span><span class="ltx_text" id="Sx4.T4.2.12.2.3"></span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="Sx4.T4.2.12.3" style="padding-left:2.8pt;padding-right:2.8pt;">✗</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.12.4" style="padding-left:2.8pt;padding-right:2.8pt;">0.0</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.12.5" style="padding-left:2.8pt;padding-right:2.8pt;">0.1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.12.6" style="padding-left:2.8pt;padding-right:2.8pt;">0.3</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.12.7" style="padding-left:2.8pt;padding-right:2.8pt;">50.0</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.12.8" style="padding-left:2.8pt;padding-right:2.8pt;">0.4</td> <td class="ltx_td ltx_align_center ltx_border_t" id="Sx4.T4.2.12.9" style="padding-left:2.8pt;padding-right:2.8pt;">50.0</td> <td class="ltx_td ltx_border_t" id="Sx4.T4.2.12.10" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="Sx4.T4.2.13" style="background-color:#FFFFCC;"> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="Sx4.T4.2.13.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.13.1.1" style="background-color:#FFFFCC;">all</span></td> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="Sx4.T4.2.13.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="Sx4.T4.2.13.2.1"></span><span class="ltx_text" id="Sx4.T4.2.13.2.2" style="background-color:#FFFFCC;"> <span class="ltx_tabular ltx_align_middle" id="Sx4.T4.2.13.2.2.1"> <span class="ltx_tr" id="Sx4.T4.2.13.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="Sx4.T4.2.13.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">early+late</span></span> <span class="ltx_tr" id="Sx4.T4.2.13.2.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="Sx4.T4.2.13.2.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">(dense)</span></span> </span></span><span class="ltx_text" id="Sx4.T4.2.13.2.3"></span><span class="ltx_text" id="Sx4.T4.2.13.2.4" style="background-color:#FFFFCC;"></span> </td> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="Sx4.T4.2.13.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="Sx4.T4.2.13.3.1" style="background-color:#FFFFCC;">✔</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx4.T4.2.13.4" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="Sx4.T4.2.13.4.1" style="background-color:#FFFFCC;">29.1</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx4.T4.2.13.5" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="Sx4.T4.2.13.5.1" style="background-color:#FFFFCC;">29.8</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx4.T4.2.13.6" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="Sx4.T4.2.13.6.1" style="background-color:#FFFFCC;">50.7</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx4.T4.2.13.7" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="Sx4.T4.2.13.7.1" style="background-color:#FFFFCC;">75.6</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx4.T4.2.13.8" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="Sx4.T4.2.13.8.1" style="background-color:#FFFFCC;">56.2</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="Sx4.T4.2.13.9" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="Sx4.T4.2.13.9.1" style="background-color:#FFFFCC;">77.3</span></td> <td class="ltx_td ltx_border_bb ltx_border_t" id="Sx4.T4.2.13.10" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 4: </span>Ablation study. The “fuse” represents the layer where the AG-Adapter is inserted. The “early” indicates feature fusion in only the first half of the layers near the input, while the “late” indicates fusion in the latter half. The “sparse” refers to fusion in alternating layers, whereas the “dense” means fusion in all layers.</figcaption> </figure> </section> </section> <section class="ltx_subsection" id="Sx4.SSx2"> <h3 class="ltx_title ltx_title_subsection">Performance Evaluation</h3> <section class="ltx_subsubsection" id="Sx4.SSx2.SSSx1"> <h4 class="ltx_title ltx_title_subsubsection">Image Classification</h4> <div class="ltx_para" id="Sx4.SSx2.SSSx1.p1"> <p class="ltx_p" id="Sx4.SSx2.SSSx1.p1.1">We conduct zero-shot image classification on the “frequent” subset and the full LVIS test set. Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx3.T1" title="Table 1 ‣ Object-focused Visual Semantic Learning ‣ Methodology ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">1</span></a> presents the results on five ViT baselines. Key observations from these experiments are as follows:</p> </div> <div class="ltx_para" id="Sx4.SSx2.SSSx1.p2"> <ul class="ltx_itemize" id="Sx4.I3"> <li class="ltx_item" id="Sx4.I3.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I3.i1.p1"> <p class="ltx_p" id="Sx4.I3.i1.p1.1">The evaluation value for the instruction text is nearly equivalent to random categorization, demonstrating that our metric successfully filters out textual interference. This outcome ensures that our work fairly compares the visual feature extraction capabilities of each model.</p> </div> </li> <li class="ltx_item" id="Sx4.I3.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I3.i2.p1"> <p class="ltx_p" id="Sx4.I3.i2.p1.1">The GiVE demonstrates a notable improvement in all baselines across all metrics on both evaluation datasets. This highlights the efficacy of the GiVE’s capacity to redirect attention based on semantic instructions.</p> </div> </li> <li class="ltx_item" id="Sx4.I3.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I3.i3.p1"> <p class="ltx_p" id="Sx4.I3.i3.p1.1">OwlViT, a ViT designed for object detection, requires fine-tuning to serve effectively as an image feature extractor. However, GiVE can unlock its potential.</p> </div> </li> <li class="ltx_item" id="Sx4.I3.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I3.i4.p1"> <p class="ltx_p" id="Sx4.I3.i4.p1.1">The AG-Adapter, trained on the MOInst dataset with 264 classes, achieves F1 scores of over 40% on the LVIS<sup class="ltx_sup" id="Sx4.I3.i4.p1.1.1"><span class="ltx_text ltx_font_italic" id="Sx4.I3.i4.p1.1.1.1">a</span></sup> dataset with 1,203 classes. The observed gaps in category magnitudes indicate that the instruction semantics have generalizability beyond the training scope.</p> </div> </li> </ul> </div> <div class="ltx_para" id="Sx4.SSx2.SSSx1.p3"> <p class="ltx_p" id="Sx4.SSx2.SSSx1.p3.1">Please see Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T11" title="Table 11 ‣ Comparison with Clip-based Method ‣ Parameter Sensitivity ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">11</span></a> in the Appendix for a comparison with MetaCLIP on more datasets.</p> </div> </section> <section class="ltx_subsubsection" id="Sx4.SSx2.SSSx2"> <h4 class="ltx_title ltx_title_subsubsection">Image-Text Retrieval</h4> <div class="ltx_para" id="Sx4.SSx2.SSSx2.p1"> <p class="ltx_p" id="Sx4.SSx2.SSSx2.p1.1">Image-text retrieval includes two subtasks: image-to-text retrieval and text-to-image retrieval. We evaluate the models on the MOInst dataset. Since the AG-Adapter is trained on this dataset, we compare the results under the fine-tuned setting, as recorded in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx3.T2" title="Table 2 ‣ Object-focused Visual Semantic Learning ‣ Methodology ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">2</span></a>.</p> </div> <div class="ltx_para" id="Sx4.SSx2.SSSx2.p2"> <p class="ltx_p" id="Sx4.SSx2.SSSx2.p2.1">As shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx3.T2" title="Table 2 ‣ Object-focused Visual Semantic Learning ‣ Methodology ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">2</span></a>, GiVE outperforms all baseline methods while utilizing fewer trainable parameters. And the informativeness gap between instruction and caption, illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.F5" title="Figure 5 ‣ Prompt Engineering and Pipeline ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">5</span></a> of Appendix, indicates that simple instruction alone is insufficient for the model to achieve such a high hit rate. Together, GiVE effectively extracts target visual features. Additionally, the evaluation results for “Instruct.” further support the credibility of our experimental findings.</p> </div> </section> <section class="ltx_subsubsection" id="Sx4.SSx2.SSSx3"> <h4 class="ltx_title ltx_title_subsubsection">Comparison with Secondary Coding Methods</h4> <div class="ltx_para" id="Sx4.SSx2.SSSx3.p1"> <p class="ltx_p" id="Sx4.SSx2.SSSx3.p1.1">The BLIP-2 and InstructBLIP methods utilize a Q-Former to re-encode image features after the visual encoder, with InstructBLIP incorporating text instructions into the Q-Former. We replace the visual encoders of these two methods with AG-CLIP-ViT-B/32 and then fine-tune them on MOInst following their respective strategies. The results are shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx3.T3" title="Table 3 ‣ Instruction of New Dataset ‣ Methodology ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">3</span></a>. From the table, it can be observed that:</p> </div> <div class="ltx_para" id="Sx4.SSx2.SSSx3.p2"> <ul class="ltx_itemize" id="Sx4.I4"> <li class="ltx_item" id="Sx4.I4.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I4.i1.p1"> <p class="ltx_p" id="Sx4.I4.i1.p1.1">Our AG-CLIP shows performance improvements compared to the original BLIP-2 and InstructBLIP, despite the significant difference in the number of parameters. This supports the validity of injecting instructions. </p> </div> </li> <li class="ltx_item" id="Sx4.I4.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I4.i2.p1"> <p class="ltx_p" id="Sx4.I4.i2.p1.1">Semantic instructions prove to be more efficient during visual feature extraction than when applied post-process. In our second set of experiments, we replace the original giant EVA-CLIP in InstructBLIP with the base AG-CLIP, resulting in superior performance across most metrics. Although AG-CLIP performs slightly worse in the text-to-image retrieval task, the substantial difference in the number of training parameters makes this acceptable.</p> </div> </li> <li class="ltx_item" id="Sx4.I4.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I4.i3.p1"> <p class="ltx_p" id="Sx4.I4.i3.p1.2">Secondary encoding may weaken the abstract semantics in the visual features, leading to decreased performance in image classification tasks, as evidenced by the superior performance of AG-CLIP compared to BLIP-2 and InstructBLIP on LVIS<sup class="ltx_sup" id="Sx4.I4.i3.p1.2.1"><span class="ltx_text ltx_font_italic" id="Sx4.I4.i3.p1.2.1.1">f</span></sup> and LVIS<sup class="ltx_sup" id="Sx4.I4.i3.p1.2.2"><span class="ltx_text ltx_font_italic" id="Sx4.I4.i3.p1.2.2.1">a</span></sup> datasets.</p> </div> </li> </ul> </div> </section> </section> <section class="ltx_subsection" id="Sx4.SSx3"> <h3 class="ltx_title ltx_title_subsection">Ablation Studies</h3> <div class="ltx_para" id="Sx4.SSx3.p1"> <p class="ltx_p" id="Sx4.SSx3.p1.1">In this section, we conduct an ablation study of the GiVE from three perspectives: loss, fusion layers, and abstract semantic instructions. Specifically, for instructions, we eliminate abstract semantics by replacing short object prompts with detailed descriptions. Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.T4" title="Table 4 ‣ Evaluation Metrics ‣ Experimental Setup ‣ Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">4</span></a> presents the experimental results, with the gray row representing the original CLIP-ViT and the yellow row representing our AG-CLIP-ViT. Observations from the results are as follows:</p> </div> <div class="ltx_para" id="Sx4.SSx3.p2"> <ul class="ltx_itemize" id="Sx4.I5"> <li class="ltx_item" id="Sx4.I5.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I5.i1.p1"> <p class="ltx_p" id="Sx4.I5.i1.p1.1">The ablation study on loss functions suggests that all three losses are crucial, with particular emphasis on the OITC loss, which is responsible for aligning object-focused visual features with text features.</p> </div> </li> <li class="ltx_item" id="Sx4.I5.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I5.i2.p1"> <p class="ltx_p" id="Sx4.I5.i2.p1.1">The late encoding layers have a significantly greater impact on abstract semantics compared to the early and sparse layers, though the early and sparse layers also contribute to the understanding of semantics.</p> </div> </li> <li class="ltx_item" id="Sx4.I5.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I5.i3.p1"> <p class="ltx_p" id="Sx4.I5.i3.p1.1">In the absence of instructions, captions can serve as textual inputs. However, overly specific captions may cause the visual encoder to rely heavily on textual instructions, possibly missing key details in the image.</p> </div> </li> </ul> </div> </section> <section class="ltx_subsection" id="Sx4.SSx4"> <h3 class="ltx_title ltx_title_subsection">Attention Visualization</h3> <div class="ltx_para" id="Sx4.SSx4.p1"> <p class="ltx_p" id="Sx4.SSx4.p1.1">In this section, we use EigenCAM <cite class="ltx_cite ltx_citemacro_citep">(Muhammad and Yeasin <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib31" title="">2020</a>)</cite> to visualize the attention of CLIP-ViT with and without the integration of the GiVE on three example images in Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#Sx4.F4" title="Figure 4 ‣ Evaluation Metrics ‣ Experimental Setup ‣ Experiments ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">4</span></a>. It is evident that GiVE aids the original ViT in encoding images more effectively in the following ways:</p> </div> <div class="ltx_para" id="Sx4.SSx4.p2"> <ul class="ltx_itemize" id="Sx4.I6"> <li class="ltx_item" id="Sx4.I6.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I6.i1.p1"> <p class="ltx_p" id="Sx4.I6.i1.p1.1"><span class="ltx_text ltx_font_bold" id="Sx4.I6.i1.p1.1.1">Flexibility for a single image.</span> The dynamically changing attention areas in each image indicate that the ViT no longer encodes fixed image features but can extract visual features tailored to the query.</p> </div> </li> <li class="ltx_item" id="Sx4.I6.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I6.i2.p1"> <p class="ltx_p" id="Sx4.I6.i2.p1.1"><span class="ltx_text ltx_font_bold" id="Sx4.I6.i2.p1.1.1">Effectiveness of semantic control over attention.</span> The three attention distributions for the left and middle images show that even objects initially overlooked by the ViT can be brought into focus through instructions.</p> </div> </li> <li class="ltx_item" id="Sx4.I6.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx4.I6.i3.p1"> <p class="ltx_p" id="Sx4.I6.i3.p1.1"><span class="ltx_text ltx_font_bold" id="Sx4.I6.i3.p1.1.1">Completeness of object retrieval.</span> As shown in the right image with the “shoe” instruction, all shoes are attended to, regardless of whether they are in a box or separated by a human body. GiVE enables the ViT to locate more comprehensive specific objects.</p> </div> </li> </ul> <p class="ltx_p" id="Sx4.SSx4.p2.1">For additional results, see Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.F7" title="Figure 7 ‣ More Guided Attention Visualization ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">7</span></a> in the Appendix.</p> </div> </section> </section> <section class="ltx_section" id="Sx5"> <h2 class="ltx_title ltx_title_section">Conclusion</h2> <div class="ltx_para" id="Sx5.p1"> <p class="ltx_p" id="Sx5.p1.1">This paper presents GiVE, a novel approach enhancing visual encoders’ integration with LLMs by addressing semantic alignment and overlooked information. GiVE features the AG-Adapter and three innovative loss functions—OITC, OIIC, and OID. The AG-Adapter aligns visual representations with abstract semantics, while the OITC loss ensures attention to both salient and non-salient objects, and the OIIC and OID losses enhance object retrieval accuracy and comprehensiveness. Experiments show GiVE’s significant improvements over existing methods in multiple tasks.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bai et al. (2023)</span> <span class="ltx_bibblock"> Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; and Zhou, J. 2023. </span> <span class="ltx_bibblock">Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib1.1.1">arXiv preprint arXiv:2308.12966</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Blattmann et al. (2023)</span> <span class="ltx_bibblock"> Blattmann, A.; Rombach, R.; Ling, H.; Dockhorn, T.; Kim, S. W.; Fidler, S.; and Kreis, K. 2023. </span> <span class="ltx_bibblock">Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib2.1.1">IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023</em>, 22563–22575. IEEE. </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2023a)</span> <span class="ltx_bibblock"> Chen, J.; Zhu, D.; Shen, X.; Li, X.; Liu, Z.; Zhang, P.; Krishnamoorthi, R.; Chandra, V.; Xiong, Y.; and Elhoseiny, M. 2023a. </span> <span class="ltx_bibblock">MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib3.1.1">CoRR</em>, abs/2310.09478. </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2023b)</span> <span class="ltx_bibblock"> Chen, K.; Zhang, Z.; Zeng, W.; Zhang, R.; Zhu, F.; and Zhao, R. 2023b. </span> <span class="ltx_bibblock">Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib4.1.1">arXiv preprint arXiv:2306.15195</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2023c)</span> <span class="ltx_bibblock"> Chen, L.; Li, J.; Dong, X.; Zhang, P.; He, C.; Wang, J.; Zhao, F.; and Lin, D. 2023c. </span> <span class="ltx_bibblock">ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib5.1.1">arXiv preprint arXiv:2311.12793</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2023d)</span> <span class="ltx_bibblock"> Chen, Y.; Sikka, K.; Cogswell, M.; Ji, H.; and Divakaran, A. 2023d. </span> <span class="ltx_bibblock">DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib6.1.1">CoRR</em>, abs/2311.10081. </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2023e)</span> <span class="ltx_bibblock"> Chen, Y.; Sikka, K.; Cogswell, M.; Ji, H.; and Divakaran, A. 2023e. </span> <span class="ltx_bibblock">Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib7.1.1">CoRR</em>, abs/2309.04461. </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cherti et al. (2023)</span> <span class="ltx_bibblock"> Cherti, M.; Beaumont, R.; Wightman, R.; Wortsman, M.; Ilharco, G.; Gordon, C.; Schuhmann, C.; Schmidt, L.; and Jitsev, J. 2023. </span> <span class="ltx_bibblock">Reproducible Scaling Laws for Contrastive Language-Image Learning. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib8.1.1">IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023</em>, 2818–2829. IEEE. </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dai et al. (2023)</span> <span class="ltx_bibblock"> Dai, W.; Li, J.; Li, D.; Tiong, A. M. H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; and Hoi, S. C. H. 2023. </span> <span class="ltx_bibblock">InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib9.1.1">Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Esser, Rombach, and Ommer (2021)</span> <span class="ltx_bibblock"> Esser, P.; Rombach, R.; and Ommer, B. 2021. </span> <span class="ltx_bibblock">Taming Transformers for High-Resolution Image Synthesis. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib10.1.1">CVPR</em>, 12873–12883. Computer Vision Foundation / IEEE. </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gao et al. (2023)</span> <span class="ltx_bibblock"> Gao, S.; Li, Z.; Yang, M.; Cheng, M.; Han, J.; and Torr, P. H. S. 2023. </span> <span class="ltx_bibblock">Large-Scale Unsupervised Semantic Segmentation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib11.1.1">IEEE Trans. Pattern Anal. Mach. Intell.</em>, 45(6): 7457–7476. </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Girdhar et al. (2023)</span> <span class="ltx_bibblock"> Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K. V.; Joulin, A.; and Misra, I. 2023. </span> <span class="ltx_bibblock">ImageBind One Embedding Space to Bind Them All. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib12.1.1">CVPR</em>, 15180–15190. IEEE. </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gupta, Dollár, and Girshick (2019)</span> <span class="ltx_bibblock"> Gupta, A.; Dollár, P.; and Girshick, R. B. 2019. </span> <span class="ltx_bibblock">LVIS: A Dataset for Large Vocabulary Instance Segmentation. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib13.1.1">IEEE Conference on Computer Vision and Pattern Recognition</em>, 5356–5364. </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">He et al. (2022)</span> <span class="ltx_bibblock"> He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; and Girshick, R. B. 2022. </span> <span class="ltx_bibblock">Masked Autoencoders Are Scalable Vision Learners. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib14.1.1">CVPR</em>, 15979–15988. IEEE. </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hu et al. (2024)</span> <span class="ltx_bibblock"> Hu, W.; Xu, Y.; Li, Y.; Li, W.; Chen, Z.; and Tu, Z. 2024. </span> <span class="ltx_bibblock">BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib15.1.1">Thirty-Eighth AAAI Conference on Artificial Intelligence</em>, 2256–2264. </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Huang, Fu, and Bors (2023)</span> <span class="ltx_bibblock"> Huang, G.; Fu, H.; and Bors, A. G. 2023. </span> <span class="ltx_bibblock">Masked Image Residual Learning for Scaling Deeper Vision Transformers. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib16.1.1">NeurIPS</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Jia et al. (2021)</span> <span class="ltx_bibblock"> Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.; Parekh, Z.; Pham, H.; Le, Q. V.; Sung, Y.; Li, Z.; and Duerig, T. 2021. </span> <span class="ltx_bibblock">Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib17.1.1">ICML</em>, volume 139 of <em class="ltx_emph ltx_font_italic" id="bib.bib17.2.2">Proceedings of Machine Learning Research</em>, 4904–4916. PMLR. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Krishna et al. (2017)</span> <span class="ltx_bibblock"> Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.; Shamma, D. A.; Bernstein, M. S.; and Fei-Fei, L. 2017. </span> <span class="ltx_bibblock">Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib18.1.1">Int. J. Comput. Vis.</em>, 123(1): 32–73. </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2023)</span> <span class="ltx_bibblock"> Li, J.; Li, D.; Savarese, S.; and Hoi, S. C. H. 2023. </span> <span class="ltx_bibblock">BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib19.1.1">International Conference on Machine Learning, ICML</em>, 19730–19742. </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2024)</span> <span class="ltx_bibblock"> Li, Y.; Zhang, Y.; Wang, C.; Zhong, Z.; Chen, Y.; Chu, R.; Liu, S.; and Jia, J. 2024. </span> <span class="ltx_bibblock">Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib20.1.1">CoRR</em>, abs/2403.18814. </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lin et al. (2014)</span> <span class="ltx_bibblock"> Lin, T.; Maire, M.; Belongie, S. J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. </span> <span class="ltx_bibblock">Microsoft COCO: Common Objects in Context. </span> <span class="ltx_bibblock">In Fleet, D. J.; Pajdla, T.; Schiele, B.; and Tuytelaars, T., eds., <em class="ltx_emph ltx_font_italic" id="bib.bib21.1.1">Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V</em>, volume 8693 of <em class="ltx_emph ltx_font_italic" id="bib.bib21.2.2">Lecture Notes in Computer Science</em>, 740–755. Springer. </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al. (2023a)</span> <span class="ltx_bibblock"> Liu, H.; Li, C.; Li, Y.; and Lee, Y. J. 2023a. </span> <span class="ltx_bibblock">Improved Baselines with Visual Instruction Tuning. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib22.1.1">CoRR</em>, abs/2310.03744. </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al. (2024)</span> <span class="ltx_bibblock"> Liu, H.; Li, C.; Li, Y.; Li, B.; Zhang, Y.; Shen, S.; and Lee, Y. J. 2024. </span> <span class="ltx_bibblock">LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al. (2023b)</span> <span class="ltx_bibblock"> Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023b. </span> <span class="ltx_bibblock">Visual Instruction Tuning. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib24.1.1">Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu, Yan, and Abbeel (2023)</span> <span class="ltx_bibblock"> Liu, H.; Yan, W.; and Abbeel, P. 2023. </span> <span class="ltx_bibblock">Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib25.1.1">NeurIPS</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al. (2022)</span> <span class="ltx_bibblock"> Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; and Xie, S. 2022. </span> <span class="ltx_bibblock">A ConvNet for the 2020s. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib26.1.1">CVPR</em>, 11966–11976. IEEE. </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Maaz et al. (2023)</span> <span class="ltx_bibblock"> Maaz, M.; Rasheed, H. A.; Khan, S. H.; and Khan, F. S. 2023. </span> <span class="ltx_bibblock">Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib27.1.1">CoRR</em>, abs/2306.05424. </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Mikolov et al. (2013)</span> <span class="ltx_bibblock"> Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. </span> <span class="ltx_bibblock">word2vec - Tool for Efficient Estimation of Word Representations in Vector Space. </span> <span class="ltx_bibblock">Accessed: 2024-08-07. </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Minderer et al. (2022)</span> <span class="ltx_bibblock"> Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; Wang, X.; Zhai, X.; Kipf, T.; and Houlsby, N. 2022. </span> <span class="ltx_bibblock">Simple Open-Vocabulary Object Detection with Vision Transformers. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib29.1.1">ECCV</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Mu et al. (2023)</span> <span class="ltx_bibblock"> Mu, Y.; Zhang, Q.; Hu, M.; Wang, W.; Ding, M.; Jin, J.; Wang, B.; Dai, J.; Qiao, Y.; and Luo, P. 2023. </span> <span class="ltx_bibblock">EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib30.1.1">Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Muhammad and Yeasin (2020)</span> <span class="ltx_bibblock"> Muhammad, M. B.; and Yeasin, M. 2020. </span> <span class="ltx_bibblock">Eigen-CAM: Class Activation Map using Principal Components. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib31.1.1">2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020</em>, 1–7. IEEE. </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">OpenAI (2023)</span> <span class="ltx_bibblock"> OpenAI. 2023. </span> <span class="ltx_bibblock">ChatGPT: Optimizing Language Models for Dialogue. </span> <span class="ltx_bibblock"><span class="ltx_ref ltx_nolink ltx_url ltx_ref_self">https://openai.com/blog/chatgpt</span>. </span> <span class="ltx_bibblock">Accessed: 2023-12-21. </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Oquab et al. (2023)</span> <span class="ltx_bibblock"> Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; Assran, M.; Ballas, N.; Galuba, W.; Howes, R.; Huang, P.; Li, S.; Misra, I.; Rabbat, M. G.; Sharma, V.; Synnaeve, G.; Xu, H.; Jégou, H.; Mairal, J.; Labatut, P.; Joulin, A.; and Bojanowski, P. 2023. </span> <span class="ltx_bibblock">DINOv2: Learning Robust Visual Features without Supervision. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib33.1.1">CoRR</em>, abs/2304.07193. </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Panagopoulou et al. (2023)</span> <span class="ltx_bibblock"> Panagopoulou, A.; Xue, L.; Yu, N.; Li, J.; Li, D.; Joty, S.; Xu, R.; Savarese, S.; Xiong, C.; and Niebles, J. C. 2023. </span> <span class="ltx_bibblock">X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib34.1.1">CoRR</em>, abs/2311.18799. </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Peng et al. (2022)</span> <span class="ltx_bibblock"> Peng, Z.; Dong, L.; Bao, H.; Ye, Q.; and Wei, F. 2022. </span> <span class="ltx_bibblock">BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib35.1.1">CoRR</em>, abs/2208.06366. </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Radford et al. (2021)</span> <span class="ltx_bibblock"> Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. </span> <span class="ltx_bibblock">Learning Transferable Visual Models From Natural Language Supervision. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib36.1.1">Proceedings of the 38th International Conference on Machine Learning, ICML</em>, volume 139, 8748–8763. </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ren et al. (2023)</span> <span class="ltx_bibblock"> Ren, S.; Wei, F.; Albanie, S.; Zhang, Z.; and Hu, H. 2023. </span> <span class="ltx_bibblock">DeepMIM: Deep Supervision for Masked Image Modeling. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib37.1.1">CoRR</em>, abs/2303.08817. </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Rombach et al. (2022)</span> <span class="ltx_bibblock"> Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. </span> <span class="ltx_bibblock">High-Resolution Image Synthesis with Latent Diffusion Models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib38.1.1">IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR</em>, 10674–10685. IEEE. </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Singh et al. (2019)</span> <span class="ltx_bibblock"> Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; and Rohrbach, M. 2019. </span> <span class="ltx_bibblock">Towards vqa models that can read. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib39.1.1">IEEE/CVF conference on computer vision and pattern recognition</em>, 8317––8326. </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Su et al. (2023)</span> <span class="ltx_bibblock"> Su, Y.; Lan, T.; Li, H.; Xu, J.; Wang, Y.; and Cai, D. 2023. </span> <span class="ltx_bibblock">PandaGPT: One Model To Instruction-Follow Them All. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib40.1.1">arXiv preprint arXiv:2305.16355</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib41"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Sun et al. (2023)</span> <span class="ltx_bibblock"> Sun, Q.; Fang, Y.; Wu, L.; Wang, X.; and Cao, Y. 2023. </span> <span class="ltx_bibblock">EVA-CLIP: Improved Training Techniques for CLIP at Scale. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib41.1.1">CoRR</em>, abs/2303.15389. </span> </li> <li class="ltx_bibitem" id="bib.bib42"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Tang et al. (2024)</span> <span class="ltx_bibblock"> Tang, Z.; Yang, Z.; Khademi, M.; Liu, Y.; Zhu, C.; and Bansal, M. 2024. </span> <span class="ltx_bibblock">CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib42.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 27425–27434. </span> </li> <li class="ltx_bibitem" id="bib.bib43"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">THUDM (2023)</span> <span class="ltx_bibblock"> THUDM. 2023. </span> <span class="ltx_bibblock">VisualGLM-6B. </span> <span class="ltx_bibblock"><span class="ltx_ref ltx_nolink ltx_url ltx_ref_self">https://github.com/THUDM/VisualGLM-6B</span>. </span> <span class="ltx_bibblock">GitHub repository. </span> </li> <li class="ltx_bibitem" id="bib.bib44"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">van den Oord, Vinyals, and Kavukcuoglu (2017)</span> <span class="ltx_bibblock"> van den Oord, A.; Vinyals, O.; and Kavukcuoglu, K. 2017. </span> <span class="ltx_bibblock">Neural Discrete Representation Learning. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib44.1.1">NIPS</em>, 6306–6315. </span> </li> <li class="ltx_bibitem" id="bib.bib45"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wang et al. (2023)</span> <span class="ltx_bibblock"> Wang, W.; Lv, Q.; Yu, W.; Hong, W.; Qi, J.; Wang, Y.; Ji, J.; Yang, Z.; Zhao, L.; Song, X.; Xu, J.; Xu, B.; Li, J.; Dong, Y.; Ding, M.; and Tang, J. 2023. </span> <span class="ltx_bibblock">CogVLM: Visual Expert for Pretrained Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib45.1.1">CoRR</em>, abs/2311.03079. </span> </li> <li class="ltx_bibitem" id="bib.bib46"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wu et al. (2023)</span> <span class="ltx_bibblock"> Wu, S.; Fei, H.; Qu, L.; Ji, W.; and Chua, T. 2023. </span> <span class="ltx_bibblock">NExT-GPT: Any-to-Any Multimodal LLM. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib46.1.1">CoRR</em>, abs/2309.05519. </span> </li> <li class="ltx_bibitem" id="bib.bib47"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Xiao et al. (2010)</span> <span class="ltx_bibblock"> Xiao, J.; Hays, J.; Ehinger, K. A.; Oliva, A.; and Torralba, A. 2010. </span> <span class="ltx_bibblock">SUN database: Large-scale scene recognition from abbey to zoo. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib47.1.1">CVPR</em>, 3485–3492. IEEE Computer Society. </span> </li> <li class="ltx_bibitem" id="bib.bib48"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Xu et al. (2024)</span> <span class="ltx_bibblock"> Xu, H.; Xie, S.; Tan, X. E.; Huang, P.; Howes, R.; Sharma, V.; Li, S.; Ghosh, G.; Zettlemoyer, L.; and Feichtenhofer, C. 2024. </span> <span class="ltx_bibblock">Demystifying CLIP Data. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib48.1.1">ICLR</em>. OpenReview.net. </span> </li> <li class="ltx_bibitem" id="bib.bib49"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Xu et al. (2022)</span> <span class="ltx_bibblock"> Xu, J.; Mello, S. D.; Liu, S.; Byeon, W.; Breuel, T. M.; Kautz, J.; and Wang, X. 2022. </span> <span class="ltx_bibblock">GroupViT: Semantic Segmentation Emerges from Text Supervision. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib49.1.1">IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR</em>, 18113–18123. IEEE. </span> </li> <li class="ltx_bibitem" id="bib.bib50"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yang et al. (2023)</span> <span class="ltx_bibblock"> Yang, Z.; Zhang, Y.; Meng, F.; and Zhou, J. 2023. </span> <span class="ltx_bibblock">TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib50.1.1">CoRR</em>, abs/2311.04589. </span> </li> <li class="ltx_bibitem" id="bib.bib51"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yao et al. (2022)</span> <span class="ltx_bibblock"> Yao, L.; Huang, R.; Hou, L.; Lu, G.; Niu, M.; Xu, H.; Liang, X.; Li, Z.; Jiang, X.; and Xu, C. 2022. </span> <span class="ltx_bibblock">FILIP: Fine-grained Interactive Language-Image Pre-Training. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib51.1.1">The Tenth International Conference on Learning Representations, ICLR</em>. OpenReview.net. </span> </li> <li class="ltx_bibitem" id="bib.bib52"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ye et al. (2023)</span> <span class="ltx_bibblock"> Ye, Q.; Xu, H.; Xu, G.; Ye, J.; Yan, M.; Zhou, Y.; Wang, J.; Hu, A.; Shi, P.; Shi, Y.; Li, C.; Xu, Y.; Chen, H.; Tian, J.; Qi, Q.; Zhang, J.; and Huang, F. 2023. </span> <span class="ltx_bibblock">mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib52.1.1">CoRR</em>, abs/2304.14178. </span> </li> <li class="ltx_bibitem" id="bib.bib53"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ye et al. (2024)</span> <span class="ltx_bibblock"> Ye, Q.; Xu, H.; Ye, J.; Yan, M.; Hu, A.; Liu, H.; Qian, Q.; Zhang, J.; and Huang, F. 2024. </span> <span class="ltx_bibblock">mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib53.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, 13040–13051. </span> </li> <li class="ltx_bibitem" id="bib.bib54"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yin et al. (2023)</span> <span class="ltx_bibblock"> Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; and Chen, E. 2023. </span> <span class="ltx_bibblock">A Survey on Multimodal Large Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib54.1.1">CoRR</em>, abs/2306.13549. </span> </li> <li class="ltx_bibitem" id="bib.bib55"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">You et al. (2022)</span> <span class="ltx_bibblock"> You, T.; Kim, S.; Kim, C.; Lee, D.; and Han, B. 2022. </span> <span class="ltx_bibblock">Locally Hierarchical Auto-Regressive Modeling for Image Generation. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib55.1.1">NeurIPS</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib56"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yu et al. (2023)</span> <span class="ltx_bibblock"> Yu, L.; Cheng, Y.; Wang, Z.; Kumar, V.; Macherey, W.; Huang, Y.; Ross, D. A.; Essa, I.; Bisk, Y.; Yang, M.; Murphy, K. P.; Hauptmann, A. G.; and Jiang, L. 2023. </span> <span class="ltx_bibblock">SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib56.1.1">NeurIPS</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib57"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhai et al. (2023)</span> <span class="ltx_bibblock"> Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L. 2023. </span> <span class="ltx_bibblock">Sigmoid Loss for Language Image Pre-Training. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib57.1.1">IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023</em>, 11941–11952. IEEE. </span> </li> <li class="ltx_bibitem" id="bib.bib58"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang, Li, and Bing (2023)</span> <span class="ltx_bibblock"> Zhang, H.; Li, X.; and Bing, L. 2023. </span> <span class="ltx_bibblock">Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib58.1.1">EMNLP (Demos)</em>, 543–553. Association for Computational Linguistics. </span> </li> <li class="ltx_bibitem" id="bib.bib59"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhao et al. (2024)</span> <span class="ltx_bibblock"> Zhao, H.; Zhang, M.; Zhao, W.; Ding, P.; Huang, S.; and Wang, D. 2024. </span> <span class="ltx_bibblock">Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib59.1.1">CoRR</em>, abs/2403.14520. </span> </li> <li class="ltx_bibitem" id="bib.bib60"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhao et al. (2023)</span> <span class="ltx_bibblock"> Zhao, Y.; Lin, Z.; Zhou, D.; Huang, Z.; Feng, J.; and Kang, B. 2023. </span> <span class="ltx_bibblock">BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib60.1.1">CoRR</em>, abs/2307.08581. </span> </li> <li class="ltx_bibitem" id="bib.bib61"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zheng, He, and Wang (2023)</span> <span class="ltx_bibblock"> Zheng, K.; He, X.; and Wang, X. E. 2023. </span> <span class="ltx_bibblock">MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib61.1.1">CoRR</em>, abs/2310.02239. </span> </li> <li class="ltx_bibitem" id="bib.bib62"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhu et al. (2023)</span> <span class="ltx_bibblock"> Zhu, D.; Chen, J.; Shen, X.; Li, X.; and Elhoseiny, M. 2023. </span> <span class="ltx_bibblock">MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib62.1.1">CoRR</em>, abs/2304.10592. </span> </li> </ul> </section> <section class="ltx_section" id="Sx6"> <h2 class="ltx_title ltx_title_section">Reproducibility Checklist</h2> <div class="ltx_para" id="Sx6.p1"> <p class="ltx_p" id="Sx6.p1.1">This paper:</p> <ul class="ltx_itemize" id="Sx6.I7"> <li class="ltx_item" id="Sx6.I7.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I7.i1.p1"> <p class="ltx_p" id="Sx6.I7.i1.p1.1">Includes a conceptual outline and/or pseudocode description of AI methods introduced (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I7.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I7.i2.p1"> <p class="ltx_p" id="Sx6.I7.i2.p1.1">Clearly delineates statements that are opinions, hypothesis, and speculation from objective facts and results (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I7.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I7.i3.p1"> <p class="ltx_p" id="Sx6.I7.i3.p1.1">Provides well marked pedagogical references for less-familiare readers to gain background necessary to replicate the paper (yes)</p> </div> </li> </ul> </div> <div class="ltx_para ltx_noindent" id="Sx6.p2"> <p class="ltx_p" id="Sx6.p2.1">Does this paper make theoretical contributions? (no)</p> </div> <div class="ltx_para ltx_noindent" id="Sx6.p3"> <p class="ltx_p" id="Sx6.p3.1">Does this paper rely on one or more datasets? (yes)</p> <ul class="ltx_itemize" id="Sx6.I8"> <li class="ltx_item" id="Sx6.I8.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I8.i1.p1"> <p class="ltx_p" id="Sx6.I8.i1.p1.1">A motivation is given for why the experiments are conducted on the selected datasets (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I8.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I8.i2.p1"> <p class="ltx_p" id="Sx6.I8.i2.p1.1">All novel datasets introduced in this paper are included in a data appendix. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I8.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I8.i3.p1"> <p class="ltx_p" id="Sx6.I8.i3.p1.1">All novel datasets introduced in this paper will be made publicly available upon publication of the paper with a license that allows free usage for research purposes. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I8.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I8.i4.p1"> <p class="ltx_p" id="Sx6.I8.i4.p1.1">All datasets drawn from the existing literature (potentially including authors’ own previously published work) are accompanied by appropriate citations. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I8.i5" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I8.i5.p1"> <p class="ltx_p" id="Sx6.I8.i5.p1.1">All datasets drawn from the existing literature (potentially including authors’ own previously published work) are publicly available. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I8.i6" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I8.i6.p1"> <p class="ltx_p" id="Sx6.I8.i6.p1.1">All datasets that are not publicly available are described in detail, with explanation why publicly available alternatives are not scientifically satisficing. (NA)</p> </div> </li> </ul> </div> <div class="ltx_para ltx_noindent" id="Sx6.p4"> <p class="ltx_p" id="Sx6.p4.1">Does this paper include computational experiments? (yes)</p> <ul class="ltx_itemize" id="Sx6.I9"> <li class="ltx_item" id="Sx6.I9.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i1.p1"> <p class="ltx_p" id="Sx6.I9.i1.p1.1">Any code required for pre-processing data is included in the appendix. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I9.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i2.p1"> <p class="ltx_p" id="Sx6.I9.i2.p1.1">All source code required for conducting and analyzing the experiments is included in a code appendix. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I9.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i3.p1"> <p class="ltx_p" id="Sx6.I9.i3.p1.1">All source code required for conducting and analyzing the experiments will be made publicly available upon publication of the paper with a license that allows free usage for research purposes. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I9.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i4.p1"> <p class="ltx_p" id="Sx6.I9.i4.p1.1">All source code implementing new methods have comments detailing the implementation, with references to the paper where each step comes from (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I9.i5" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i5.p1"> <p class="ltx_p" id="Sx6.I9.i5.p1.1">If an algorithm depends on randomness, then the method used for setting seeds is described in a way sufficient to allow replication of results. (NA) </p> </div> </li> <li class="ltx_item" id="Sx6.I9.i6" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i6.p1"> <p class="ltx_p" id="Sx6.I9.i6.p1.1">This paper specifies the computing infrastructure used for running experiments (hardware and software), including GPU/CPU models; amount of memory; operating system; names and versions of relevant software libraries and frameworks. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I9.i7" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i7.p1"> <p class="ltx_p" id="Sx6.I9.i7.p1.1">This paper formally describes evaluation metrics used and explains the motivation for choosing these metrics. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I9.i8" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i8.p1"> <p class="ltx_p" id="Sx6.I9.i8.p1.1">This paper states the number of algorithm runs used to compute each reported result. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I9.i9" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i9.p1"> <p class="ltx_p" id="Sx6.I9.i9.p1.1">Analysis of experiments goes beyond single-dimensional summaries of performance (e.g., average; median) to include measures of variation, confidence, or other distributional information. (yes) </p> </div> </li> <li class="ltx_item" id="Sx6.I9.i10" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i10.p1"> <p class="ltx_p" id="Sx6.I9.i10.p1.1">The significance of any improvement or decrease in performance is judged using appropriate statistical tests (e.g., Wilcoxon signed-rank). (partial)</p> <ul class="ltx_itemize" id="Sx6.I9.i10.I1"> <li class="ltx_item" id="Sx6.I9.i10.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item"><span class="ltx_text ltx_font_bold" id="Sx6.I9.i10.I1.i1.1.1.1">–</span></span> <div class="ltx_para" id="Sx6.I9.i10.I1.i1.p1"> <p class="ltx_p" id="Sx6.I9.i10.I1.i1.p1.1">The methodology presented in this paper is demonstrably superior to existing baselines for our tasks, and its efficacy is not merely attributable to random fluctuations. Consequently, statistical testing is not a necessary component of this evaluation.</p> </div> </li> </ul> </div> </li> <li class="ltx_item" id="Sx6.I9.i11" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i11.p1"> <p class="ltx_p" id="Sx6.I9.i11.p1.1">This paper lists all final (hyper-)parameters used for each model/algorithm in the paper’s experiments. (yes)</p> </div> </li> <li class="ltx_item" id="Sx6.I9.i12" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="Sx6.I9.i12.p1"> <p class="ltx_p" id="Sx6.I9.i12.p1.1">This paper states the number and range of values tried per (hyper-) parameter during development of the paper, along with the criterion used for selecting the final parameter setting. (partial)</p> </div> </li> </ul> </div> <div class="ltx_pagination ltx_role_newpage"></div> </section> <section class="ltx_appendix" id="A1"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix A </span>Appendix</h2> <div class="ltx_para" id="A1.p1"> <p class="ltx_p" id="A1.p1.1">The appendix covers several key topics: first, it provides an example of the inference pipeline to illustrate the GiVE workflow. Next, it presents statistics on the types of image encoders used by MLLMs. Following this, it describes the proposed MOInst dataset and its construction process. The appendix then details the experimental setup and includes additional experiments, such as parameter sensitivity and visualization illustrations. Finally, it demonstrates the impact of visual attention on MLLMs and the improvements achieved with our GiVE approach.</p> </div> <section class="ltx_subsection" id="A1.SSx1"> <h3 class="ltx_title ltx_title_subsection">Prompt Engineering and Pipeline</h3> <div class="ltx_para" id="A1.SSx1.p1"> <p class="ltx_p" id="A1.SSx1.p1.1">GiVE aims to provide text guided visual coding scheme. Text is a medium that is readily comprehensible and straightforward for both humans and models. Therefore, we chose to utilize text, which introduces the challenge of avoiding information leakage. To mitigate this issue, we provide only minimal information to the visual encoder, specifically words that serve as object labels. This level of information is sufficient to prevent the visual encoder from missing key features. </p> </div> <div class="ltx_para" id="A1.SSx1.p2"> <p class="ltx_p" id="A1.SSx1.p2.1">However, text encoders are typically trained on complete sentences describing images rather than single words. To address this discrepancy, object indicator words are incorporated into the template “A photo of {object}” during both training and inference. Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.F5" title="Figure 5 ‣ Prompt Engineering and Pipeline ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">5</span></a> illustrates the pipeline of the model as an example of image-text retrieval.</p> </div> <figure class="ltx_figure" id="A1.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="659" id="A1.F5.g1" src="x5.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 5: </span><span class="ltx_text ltx_font_bold" id="A1.F5.2.1">Pipeline for image-text retrieval</span>. The frozen text encoder synthesizes retrieval instructions through constructed statements with target objects. The trained image encoder generates corresponding image embeddings for different instructions.</figcaption> </figure> </section> <section class="ltx_subsection" id="A1.SSx2"> <h3 class="ltx_title ltx_title_subsection">MLLM and Image Encoder Summary</h3> <div class="ltx_para" id="A1.SSx2.p1"> <p class="ltx_p" id="A1.SSx2.p1.1">An investigation was conducted into 30 common MLLMs from the past two years, with a view to establish which image encoders they employ and the pre-training tasks associated with these encoders. The findings are set out in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T5" title="Table 5 ‣ MLLM and Image Encoder Summary ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">5</span></a>. It is evident that the majority of researchers prefer pre-aligned encoders such as CLIP <cite class="ltx_cite ltx_citemacro_citep">(Radford et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite> when constructing MLLM models. Even the few models that employ image reconstruction encoders for specific purposes, such as expanding modalities <cite class="ltx_cite ltx_citemacro_citep">(Yang et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib50" title="">2023</a>)</cite> or freezing LLMs <cite class="ltx_cite ltx_citemacro_citep">(Liu, Yan, and Abbeel <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib25" title="">2023</a>; Yu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib56" title="">2023</a>)</cite>, exhibit diminished capacity for multimodal understanding in comparison to models like LLaVA <cite class="ltx_cite ltx_citemacro_citep">(Liu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib24" title="">2023b</a>)</cite>, whose encoders undergo training with contrastive learning.</p> </div> <figure class="ltx_table" id="A1.T5"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T5.1"> <tr class="ltx_tr" id="A1.T5.1.1"> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T5.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">Model</td> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T5.1.1.2" style="padding-left:2.8pt;padding-right:2.8pt;">Image Encoder</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T5.1.1.3" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="A1.T5.1.1.3.1"></span> <span class="ltx_text" id="A1.T5.1.1.3.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T5.1.1.3.2.1"> <span class="ltx_tr" id="A1.T5.1.1.3.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T5.1.1.3.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">Pre-training</span></span> <span class="ltx_tr" id="A1.T5.1.1.3.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T5.1.1.3.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">Task</span></span> </span></span><span class="ltx_text" id="A1.T5.1.1.3.3"></span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.2" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T5.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.2.1.1" style="background-color:#E6E6E6;">BLIP-2 <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib19" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T5.1.2.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="A1.T5.1.2.2.1"></span><span class="ltx_text" id="A1.T5.1.2.2.2" style="background-color:#E6E6E6;"> <span class="ltx_tabular ltx_align_middle" id="A1.T5.1.2.2.2.1"> <span class="ltx_tr" id="A1.T5.1.2.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T5.1.2.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite></span></span> <span class="ltx_tr" id="A1.T5.1.2.2.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T5.1.2.2.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">/EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite></span></span> </span></span><span class="ltx_text" id="A1.T5.1.2.2.3"></span><span class="ltx_text" id="A1.T5.1.2.2.4" style="background-color:#E6E6E6;"></span> </td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T5.1.2.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.2.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.3"> <td class="ltx_td ltx_align_left" id="A1.T5.1.3.1" style="padding-left:2.8pt;padding-right:2.8pt;">InstructBLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib9" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.3.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="A1.T5.1.3.2.1"></span><span class="ltx_text" id="A1.T5.1.3.2.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T5.1.3.2.2.1"> <span class="ltx_tr" id="A1.T5.1.3.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T5.1.3.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite></span></span> <span class="ltx_tr" id="A1.T5.1.3.2.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T5.1.3.2.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">/EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite></span></span> </span></span><span class="ltx_text" id="A1.T5.1.3.2.3"></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.3.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.4" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.4.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.4.1.1" style="background-color:#E6E6E6;">X-InstructBLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib34" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.4.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.4.2.1" style="background-color:#E6E6E6;">EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.4.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.4.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.5"> <td class="ltx_td ltx_align_left" id="A1.T5.1.5.1" style="padding-left:2.8pt;padding-right:2.8pt;">LLaVA <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib24" title="">2023b</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.5.2" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.5.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.6" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.6.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.6.1.1" style="background-color:#E6E6E6;">LLaVA-1.5 <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib22" title="">2023a</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.6.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.6.2.1" style="background-color:#E6E6E6;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.6.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.6.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.7"> <td class="ltx_td ltx_align_left" id="A1.T5.1.7.1" style="padding-left:2.8pt;padding-right:2.8pt;">LLaVA-NeXT <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib23" title="">2024</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.7.2" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.7.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.8" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.8.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.8.1.1" style="background-color:#E6E6E6;">BLIVA <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib15" title="">2024</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.8.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.8.2.1" style="background-color:#E6E6E6;">EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.8.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.8.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.9"> <td class="ltx_td ltx_align_left" id="A1.T5.1.9.1" style="padding-left:2.8pt;padding-right:2.8pt;">MiniGPT-4 <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib62" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.9.2" style="padding-left:2.8pt;padding-right:2.8pt;">EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.9.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.10" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.10.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.10.1.1" style="background-color:#E6E6E6;">MiniGPT-5 <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib61" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.10.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.10.2.1" style="background-color:#E6E6E6;">EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.10.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.10.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.11"> <td class="ltx_td ltx_align_left" id="A1.T5.1.11.1" style="padding-left:2.8pt;padding-right:2.8pt;">MiniGPT-v2 <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib3" title="">2023a</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.11.2" style="padding-left:2.8pt;padding-right:2.8pt;">EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.11.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.12" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.12.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.12.1.1" style="background-color:#E6E6E6;">mPLUG-Owl <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib52" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.12.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.12.2.1" style="background-color:#E6E6E6;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.12.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.12.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.13"> <td class="ltx_td ltx_align_left" id="A1.T5.1.13.1" style="padding-left:2.8pt;padding-right:2.8pt;">mPLUG-Owl2 <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib53" title="">2024</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.13.2" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.13.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.14" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.14.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.14.1.1" style="background-color:#E6E6E6;">ShareGPT4V <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib5" title="">2023c</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.14.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.14.2.1" style="background-color:#E6E6E6;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.14.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.14.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.15"> <td class="ltx_td ltx_align_left" id="A1.T5.1.15.1" style="padding-left:2.8pt;padding-right:2.8pt;">PandaGPT <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib40" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.15.2" style="padding-left:2.8pt;padding-right:2.8pt;">ImageBind <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib12" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.15.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.16" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.16.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.16.1.1" style="background-color:#E6E6E6;">Video-LLaMA <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib58" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.16.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.16.2.1" style="background-color:#E6E6E6;">EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.16.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.16.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.17"> <td class="ltx_td ltx_align_left" id="A1.T5.1.17.1" style="padding-left:2.8pt;padding-right:2.8pt;">Video-ChatGPT <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib27" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.17.2" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.17.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.18" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.18.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.18.1.1" style="background-color:#E6E6E6;">Shikra <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib4" title="">2023b</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.18.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.18.2.1" style="background-color:#E6E6E6;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.18.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.18.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.19"> <td class="ltx_td ltx_align_left" id="A1.T5.1.19.1" style="padding-left:2.8pt;padding-right:2.8pt;">VisualGLM-6B <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib43" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.19.2" style="padding-left:2.8pt;padding-right:2.8pt;">EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.19.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.20" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.20.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.20.1.1" style="background-color:#E6E6E6;">CogVLM <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib45" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.20.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.20.2.1" style="background-color:#E6E6E6;">EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.20.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.20.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.21"> <td class="ltx_td ltx_align_left" id="A1.T5.1.21.1" style="padding-left:2.8pt;padding-right:2.8pt;">Mini-Gemini <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib20" title="">2024</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.21.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="A1.T5.1.21.2.1"></span><span class="ltx_text" id="A1.T5.1.21.2.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T5.1.21.2.2.1"> <span class="ltx_tr" id="A1.T5.1.21.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T5.1.21.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite>;</span></span> <span class="ltx_tr" id="A1.T5.1.21.2.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T5.1.21.2.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">ConvNeXt <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib26" title="">2022</a>)</cite></span></span> </span></span><span class="ltx_text" id="A1.T5.1.21.2.3"></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.21.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.22" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.22.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.22.1.1" style="background-color:#E6E6E6;">BuboGPT <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib60" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.22.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="A1.T5.1.22.2.1"></span><span class="ltx_text" id="A1.T5.1.22.2.2" style="background-color:#E6E6E6;"> <span class="ltx_tabular ltx_align_middle" id="A1.T5.1.22.2.2.1"> <span class="ltx_tr" id="A1.T5.1.22.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T5.1.22.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite></span></span> <span class="ltx_tr" id="A1.T5.1.22.2.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T5.1.22.2.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">/EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite></span></span> </span></span><span class="ltx_text" id="A1.T5.1.22.2.3"></span><span class="ltx_text" id="A1.T5.1.22.2.4" style="background-color:#E6E6E6;"></span> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.22.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.22.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.23"> <td class="ltx_td ltx_align_left" id="A1.T5.1.23.1" style="padding-left:2.8pt;padding-right:2.8pt;">ChatSpot <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib32" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.23.2" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.23.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.24" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.24.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.24.1.1" style="background-color:#E6E6E6;">Qwen-VL <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib1" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.24.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.24.2.1" style="background-color:#E6E6E6;">OpenCLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib8" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.24.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.24.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.25"> <td class="ltx_td ltx_align_left" id="A1.T5.1.25.1" style="padding-left:2.8pt;padding-right:2.8pt;">NExT-GPT <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib46" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.25.2" style="padding-left:2.8pt;padding-right:2.8pt;">ImageBind <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib12" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.25.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.26" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.26.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.26.1.1" style="background-color:#E6E6E6;">DRESS <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib6" title="">2023d</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.26.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.26.2.1" style="background-color:#E6E6E6;">EVA-CLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib41" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.26.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.26.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.27"> <td class="ltx_td ltx_align_left" id="A1.T5.1.27.1" style="padding-left:2.8pt;padding-right:2.8pt;">Cobra <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib59" title="">2024</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.27.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="A1.T5.1.27.2.1"></span><span class="ltx_text" id="A1.T5.1.27.2.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T5.1.27.2.2.1"> <span class="ltx_tr" id="A1.T5.1.27.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T5.1.27.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">SigLIP <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib57" title="">2023</a>)</cite>;</span></span> <span class="ltx_tr" id="A1.T5.1.27.2.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T5.1.27.2.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">DINOv2 <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib33" title="">2023</a>)</cite></span></span> </span></span><span class="ltx_text" id="A1.T5.1.27.2.3"></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.27.3" style="padding-left:2.8pt;padding-right:2.8pt;">contrast</td> </tr> <tr class="ltx_tr" id="A1.T5.1.28" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.28.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.28.1.1" style="background-color:#E6E6E6;">CoDi-2 <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib42" title="">2024</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.28.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.28.2.1" style="background-color:#E6E6E6;">ImageBind <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib12" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.28.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.28.3.1" style="background-color:#E6E6E6;">contrast</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.29"> <td class="ltx_td ltx_align_left" id="A1.T5.1.29.1" style="padding-left:2.8pt;padding-right:2.8pt;">LQAE <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib25" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_left" id="A1.T5.1.29.2" style="padding-left:2.8pt;padding-right:2.8pt;">LQAE <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib25" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_center" id="A1.T5.1.29.3" style="padding-left:2.8pt;padding-right:2.8pt;">reconstruction</td> </tr> <tr class="ltx_tr" id="A1.T5.1.30" style="background-color:#E6E6E6;"> <td class="ltx_td ltx_align_left" id="A1.T5.1.30.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.30.1.1" style="background-color:#E6E6E6;">SPAE <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib56" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_left" id="A1.T5.1.30.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.30.2.1" style="background-color:#E6E6E6;">SPAE <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib56" title="">2023</a>)</cite></span></td> <td class="ltx_td ltx_align_center" id="A1.T5.1.30.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T5.1.30.3.1" style="background-color:#E6E6E6;">reconstruction</span></td> </tr> <tr class="ltx_tr" id="A1.T5.1.31"> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T5.1.31.1" style="padding-left:2.8pt;padding-right:2.8pt;">TEAL <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib50" title="">2023</a>)</cite> </td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T5.1.31.2" style="padding-left:2.8pt;padding-right:2.8pt;">BEiT-V2 <cite class="ltx_cite ltx_citemacro_citeyearpar">(<a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib35" title="">2022</a>)</cite> </td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T5.1.31.3" style="padding-left:2.8pt;padding-right:2.8pt;">reconstruction</td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 5: </span>A summary of common MLLMs and the image encoders they use. This table presents the image encoder pre-training tasks. Most researchers prefer using pre-aligned encoders when building MLLM models.</figcaption> </figure> </section> <section class="ltx_subsection" id="A1.SSx3"> <h3 class="ltx_title ltx_title_subsection">Multi-Object Instruction Dataset</h3> <div class="ltx_para" id="A1.SSx3.p1"> <p class="ltx_p" id="A1.SSx3.p1.1">Our MOInst dataset is divided into training and testing sets, and the specific statistics are shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T6" title="Table 6 ‣ Multi-Object Instruction Dataset ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">6</span></a>. The annotation files of Multi-Object Instruction (MOInst) encompass the following fields: “image_id” indicates the unique identifier of a particular image, “file_name” serves as the image file name, “image_path” specifies the path to the image file, “caption” provides a textual description of an object in the image, “category_id” corresponds to the category identifier of the caption, and “object_id” uniquely identifies a particular category within a specific image. The key information for an image in the dataset is shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T7" title="Table 7 ‣ Construction ‣ Multi-Object Instruction Dataset ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">7</span></a> as an example.</p> </div> <figure class="ltx_table" id="A1.T6"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T6.1"> <tr class="ltx_tr" id="A1.T6.1.1"> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T6.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">Parts</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T6.1.1.2" style="padding-left:2.8pt;padding-right:2.8pt;"># Imgs</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T6.1.1.3" style="padding-left:2.8pt;padding-right:2.8pt;"># Caps</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T6.1.1.4" style="padding-left:2.8pt;padding-right:2.8pt;"># Cats</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T6.1.1.5" style="padding-left:2.8pt;padding-right:2.8pt;"># Avg. Cats/Image</td> </tr> <tr class="ltx_tr" id="A1.T6.1.2"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T6.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">Train</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T6.1.2.2" style="padding-left:2.8pt;padding-right:2.8pt;">80536</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T6.1.2.3" style="padding-left:2.8pt;padding-right:2.8pt;">240183</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T6.1.2.4" style="padding-left:2.8pt;padding-right:2.8pt;">264</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T6.1.2.5" style="padding-left:2.8pt;padding-right:2.8pt;">2.4</td> </tr> <tr class="ltx_tr" id="A1.T6.1.3"> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="A1.T6.1.3.1" style="padding-left:2.8pt;padding-right:2.8pt;">Test</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="A1.T6.1.3.2" style="padding-left:2.8pt;padding-right:2.8pt;">1000</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="A1.T6.1.3.3" style="padding-left:2.8pt;padding-right:2.8pt;">4195</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="A1.T6.1.3.4" style="padding-left:2.8pt;padding-right:2.8pt;">264</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="A1.T6.1.3.5" style="padding-left:2.8pt;padding-right:2.8pt;">3.2</td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 6: </span>Statistics of the MOInst dataset. Parts represents the different partitions of the dataset, # Imgs indicates the total number of images in each partition, # Caps refers to the total number of captions, # Cats denotes the total number of distinct categories, and # Avg. Cats/Image is the average number of categories labeled per image. Multiple captions of an image may share the same category label.</figcaption> </figure> <section class="ltx_subsubsection" id="A1.SSx3.SSSx1"> <h4 class="ltx_title ltx_title_subsubsection">Construction</h4> <div class="ltx_para" id="A1.SSx3.SSSx1.p1"> <p class="ltx_p" id="A1.SSx3.SSSx1.p1.1">We refer to the pipeline of <cite class="ltx_cite ltx_citemacro_citep">(Chen et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib7" title="">2023e</a>)</cite> to construct our dataset. As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.F6" title="Figure 6 ‣ Construction ‣ Multi-Object Instruction Dataset ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">6</span></a>, the pipeline consists of two stages. The first stage aims at generating category labels for the objects described by captions, while the second stage improves the quality of the labels.</p> </div> <figure class="ltx_figure" id="A1.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_portrait" height="1382" id="A1.F6.g1" src="x6.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 6: </span><span class="ltx_text ltx_font_bold" id="A1.F6.2.1">Pipeline for dataset construction</span>. The dataset construction includes both generation and processing stages. It is created in a semi-automated manner with the help of existing models and minimal human assistance.</figcaption> </figure> <div class="ltx_para" id="A1.SSx3.SSSx1.p2"> <ol class="ltx_enumerate" id="A1.I1"> <li class="ltx_item" id="A1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">1.</span> <div class="ltx_para" id="A1.I1.i1.p1"> <p class="ltx_p" id="A1.I1.i1.p1.1">We filter the raw caption data from the VG dataset based on metrics such as length. Additionally, we devise an initial prompt that briefly describes the data the LLM (GPT <cite class="ltx_cite ltx_citemacro_citep">(OpenAI <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib32" title="">2023</a>)</cite> is used in this paper) should generate with reference to the examples. This prompt, along with the necessary context (caption text), is then used to generate a small sample set. Next, we inspect the generated samples and modify the prompt accordingly. Through several iterations, the refined prompt effectively guides the LLM in generating caption labels. The final prompt is described in Listing <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#listing1" title="Listing 1 ‣ Construction ‣ Multi-Object Instruction Dataset ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">1</span></a>.</p> </div> </li> <li class="ltx_item" id="A1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">2.</span> <div class="ltx_para" id="A1.I1.i2.p1"> <p class="ltx_p" id="A1.I1.i2.p1.1">Although the labeling data roughly meets expectations, it faces issues due to the limitations of existing LLMs. These issues include confusion between singular and plural forms, overly fine divisions of category labels, and insufficient samples within certain categories. Therefore, the labeled data requires further processing. Initially, we obtain noun lemmas through lemmatization and merge labels derived from the same word. Subsequently, the word2vec model <cite class="ltx_cite ltx_citemacro_citep">(Mikolov et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib28" title="">2013</a>)</cite> is employed to derive the word embeddings for the labels, which are then utilized to calculate the similarity between them. Then, manual rules are set, including a similarity threshold and a minimum number of samples within a category, to cluster and filter the labels. By iterating these rules based on the merged data, the final dataset is obtained.</p> </div> </li> </ol> </div> <figure class="ltx_float ltx_float_listing ltx_framed ltx_framed_top" id="listing1"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_float"><span class="ltx_text ltx_font_bold" id="listing1.2.1.1">Listing 1</span> </span> The final prompt used in the MOInst construction</figcaption> <div class="ltx_listing ltx_lst_language_Haskell ltx_lst_numbers_left ltx_lstlisting ltx_listing" id="listing1.3"> <div class="ltx_listing_data"><a download="" href="data:text/plain;base64,WW91IHdpbGwgbGFiZWwgZWFjaCBzZW50ZW5jZSBvZiB0aGUgZ2l2ZW4gbXVsdGlwbGUgc2VudGVuY2VzIGRpdmlkZWQgd2l0aCA8Q0FQPiwgdGhlIGxhYmVsIGluZGljYXRlcyB0aGUgb2JqZWN0IG9mIHRoZSBzZW50ZW5jZS4gTGFiZWxpbmcgZXhhbXBsZXMgYXJlIGdpdmVuLgoKbGFiZWxpbmcgZXhhbXBsZXM6CmEgcmVkIGNhciBpcyBhdCB0aGUgYmFjayA8Q0FQPiB0d28gbGlnaHRzIGFyZSBvbiBvbiB0aGUgYnVpbGRpbmcgPENBUD4gYmlrZXMgYXJlIHBhcmtlZCBvbiB0aGUgZmFyIGVuZCA8Q0FQPiB0aGUgd29yZCBcIndhbGtcIiBpcyBsaXQgb24gYSBzdHJlZXQgc2lnbiA8Q0FQPiB0d28gbWVuIHN0YW5kIG9uIGFuIHVyYmFuIHNpZGV3YWxrCmNhciA8TEFCPiBsaWdodCA8TEFCPiBiaWN5Y2xlIDxMQUI+IGNoYXJhY3RlciA8TEFCPiBtYW4KCnNlbnRlbmNlczoKe2lucHV0X3RleHR9CgpJbmRpY2F0ZSB0aGUgc3ViamVjdCBvZiBlYWNoIHNlbnRlbmNlIHdpdGggcHJvdG90eXBlIG9mIG9uZSB3b3JkIHNlcGFyYXRlbHksIHNob3cgbWUgdGhlIGxhYmVscyBkaXJlY3RseSB3aXRob3V0IHRoZSBzZW50ZW5jZXMgd2l0aCBlYWNoIGxhYmVsIHNlcGFyYXRlZCBieSA8TEFCPiwgYW5kIG1ha2Ugc3VyZSB0aGF0IHRoZSBtYXJrZWQgbGFiZWxzIGNvcnJlc3BvbmQgb25lLXRvLW9uZSB3aXRoIHRoZSBvcmlnaW5hbCBzZW50ZW5jZS4=">⬇</a></div> <div class="ltx_listingline" id="lstnumberx1"> <span class="ltx_tag ltx_tag_listingline">1</span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.1">You</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.2"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.3">will</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.4"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.5">label</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.6"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.7">each</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.8"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.9">sentence</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.10"> </span><span class="ltx_text ltx_lst_keyword ltx_font_typewriter ltx_font_bold" id="lstnumberx1.11">of</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.12"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.13">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.14"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.15">given</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.16"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.17">multiple</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.18"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.19">sentences</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.20"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.21">divided</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.22"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.23">with</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.24"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx1.25">&lt;</span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.26">CAP</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx1.27">&gt;,</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.28"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.29">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.30"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.31">label</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.32"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.33">indicates</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.34"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.35">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.36"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.37">object</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.38"> </span><span class="ltx_text ltx_lst_keyword ltx_font_typewriter ltx_font_bold" id="lstnumberx1.39">of</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.40"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.41">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.42"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.43">sentence</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx1.44">.</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.45"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.46">Labeling</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.47"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.48">examples</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.49"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.50">are</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx1.51"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx1.52">given</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx1.53">.</span> </div> <div class="ltx_listingline" id="lstnumberx2"> <span class="ltx_tag ltx_tag_listingline">2</span> </div> <div class="ltx_listingline" id="lstnumberx3"> <span class="ltx_tag ltx_tag_listingline">3</span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx3.1">labeling</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx3.2"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx3.3">examples</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx3.4">:</span> </div> <div class="ltx_listingline" id="lstnumberx4"> <span class="ltx_tag ltx_tag_listingline">4</span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.1">a</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.2"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.3">red</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.4"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.5">car</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.6"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.7">is</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.8"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.9">at</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.10"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.11">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.12"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.13">back</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.14"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx4.15">&lt;</span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.16">CAP</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx4.17">&gt;</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.18"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.19">two</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.20"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.21">lights</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.22"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.23">are</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.24"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.25">on</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.26"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.27">on</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.28"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.29">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.30"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.31">building</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.32"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx4.33">&lt;</span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.34">CAP</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx4.35">&gt;</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.36"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.37">bikes</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.38"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.39">are</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.40"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.41">parked</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.42"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.43">on</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.44"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.45">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.46"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.47">far</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.48"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.49">end</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.50"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx4.51">&lt;</span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.52">CAP</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx4.53">&gt;</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.54"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.55">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.56"> </span><span class="ltx_text ltx_lst_identifier ltx_font_typewriter" id="lstnumberx4.57">word</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx4.58"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx4.59">\</span><span class="ltx_text ltx_lst_string ltx_font_typewriter" id="lstnumberx4.60">"walk\"<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.1"> </span>is<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.2"> </span>lit<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.3"> </span>on<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.4"> </span>a<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.5"> </span>street<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.6"> </span>sign<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.7"> </span>&lt;CAP&gt;<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.8"> </span>two<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.9"> </span>men<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.10"> </span>stand<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.11"> </span>on<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.12"> </span>an<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.13"> </span>urban<span class="ltx_text ltx_lst_space" id="lstnumberx4.60.14"> </span>sidewalk</span> </div> <div class="ltx_listingline" id="lstnumberx5"> <span class="ltx_tag ltx_tag_listingline">5</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx5.1">car</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx5.2"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx5.3">&lt;LAB&gt;</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx5.4"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx5.5">light</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx5.6"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx5.7">&lt;LAB&gt;</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx5.8"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx5.9">bicycle</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx5.10"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx5.11">&lt;LAB&gt;</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx5.12"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx5.13">character</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx5.14"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx5.15">&lt;LAB&gt;</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx5.16"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx5.17">man</span> </div> <div class="ltx_listingline" id="lstnumberx6"> <span class="ltx_tag ltx_tag_listingline">6</span> </div> <div class="ltx_listingline" id="lstnumberx7"> <span class="ltx_tag ltx_tag_listingline">7</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx7.1">sentences:</span> </div> <div class="ltx_listingline" id="lstnumberx8"> <span class="ltx_tag ltx_tag_listingline">8</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx8.1">{input_text}</span> </div> <div class="ltx_listingline" id="lstnumberx9"> <span class="ltx_tag ltx_tag_listingline">9</span> </div> <div class="ltx_listingline" id="lstnumberx10"> <span class="ltx_tag ltx_tag_listingline">10</span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.1">Indicate</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.2"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.3">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.4"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.5">subject</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.6"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.7">of</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.8"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.9">each</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.10"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.11">sentence</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.12"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.13">with</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.14"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.15">prototype</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.16"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.17">of</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.18"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.19">one</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.20"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.21">word</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.22"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.23">separately,</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.24"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.25">show</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.26"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.27">me</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.28"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.29">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.30"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.31">labels</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.32"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.33">directly</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.34"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.35">without</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.36"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.37">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.38"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.39">sentences</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.40"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.41">with</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.42"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.43">each</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.44"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.45">label</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.46"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.47">separated</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.48"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.49">by</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.50"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.51">&lt;LAB&gt;,</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.52"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.53">and</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.54"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.55">make</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.56"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.57">sure</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.58"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.59">that</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.60"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.61">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.62"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.63">marked</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.64"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.65">labels</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.66"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.67">correspond</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.68"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.69">one-to-one</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.70"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.71">with</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.72"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.73">the</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.74"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.75">original</span><span class="ltx_text ltx_lst_space ltx_font_typewriter" id="lstnumberx10.76"> </span><span class="ltx_text ltx_font_typewriter" id="lstnumberx10.77">sentence."</span> </div> </div> </figure> <figure class="ltx_table" id="A1.T7"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T7.1"> <tr class="ltx_tr" id="A1.T7.1.2"> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T7.1.2.1" style="padding-left:5.5pt;padding-right:5.5pt;">Image ID</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T7.1.2.2" style="padding-left:5.5pt;padding-right:5.5pt;">Image</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T7.1.2.3" style="padding-left:5.5pt;padding-right:5.5pt;">Category ID</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T7.1.2.4" style="padding-left:5.5pt;padding-right:5.5pt;">Category</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T7.1.2.5" style="padding-left:5.5pt;padding-right:5.5pt;">Object ID</td> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T7.1.2.6" style="padding-left:5.5pt;padding-right:5.5pt;"> <span class="ltx_text" id="A1.T7.1.2.6.1"></span> <span class="ltx_text" id="A1.T7.1.2.6.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T7.1.2.6.2.1"> <span class="ltx_tr" id="A1.T7.1.2.6.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T7.1.2.6.2.1.1.1" style="padding-left:5.5pt;padding-right:5.5pt;">Captions</span></span> </span></span><span class="ltx_text" id="A1.T7.1.2.6.3"></span></td> </tr> <tr class="ltx_tr" id="A1.T7.1.1"> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="A1.T7.1.1.2" rowspan="4" style="padding-left:5.5pt;padding-right:5.5pt;"><span class="ltx_text" id="A1.T7.1.1.2.1">484</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="A1.T7.1.1.1" rowspan="4" style="padding-left:5.5pt;padding-right:5.5pt;"><span class="ltx_text" id="A1.T7.1.1.1.1"><img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_square" height="59" id="A1.T7.1.1.1.1.g1" src="extracted/5955593/image/484.jpg" width="59"/></span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T7.1.1.3" style="padding-left:5.5pt;padding-right:5.5pt;">2</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T7.1.1.4" style="padding-left:5.5pt;padding-right:5.5pt;">fence</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T7.1.1.5" style="padding-left:5.5pt;padding-right:5.5pt;">4840002</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T7.1.1.6" style="padding-left:5.5pt;padding-right:5.5pt;">A fence made of metal by the road.</td> </tr> <tr class="ltx_tr" id="A1.T7.1.3"> <td class="ltx_td ltx_align_center" id="A1.T7.1.3.1" style="padding-left:5.5pt;padding-right:5.5pt;">31</td> <td class="ltx_td ltx_align_center" id="A1.T7.1.3.2" style="padding-left:5.5pt;padding-right:5.5pt;">bike</td> <td class="ltx_td ltx_align_center" id="A1.T7.1.3.3" style="padding-left:5.5pt;padding-right:5.5pt;">4840031</td> <td class="ltx_td ltx_align_left" id="A1.T7.1.3.4" style="padding-left:5.5pt;padding-right:5.5pt;">A blue bike is parked off to the side.</td> </tr> <tr class="ltx_tr" id="A1.T7.1.4"> <td class="ltx_td ltx_align_center" id="A1.T7.1.4.1" style="padding-left:5.5pt;padding-right:5.5pt;">36</td> <td class="ltx_td ltx_align_center" id="A1.T7.1.4.2" style="padding-left:5.5pt;padding-right:5.5pt;">woman</td> <td class="ltx_td ltx_align_center" id="A1.T7.1.4.3" style="padding-left:5.5pt;padding-right:5.5pt;">4840036</td> <td class="ltx_td ltx_align_left" id="A1.T7.1.4.4" style="padding-left:5.5pt;padding-right:5.5pt;">A woman is sitting down in a chair.</td> </tr> <tr class="ltx_tr" id="A1.T7.1.5"> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T7.1.5.1" style="padding-left:5.5pt;padding-right:5.5pt;">130</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T7.1.5.2" style="padding-left:5.5pt;padding-right:5.5pt;">grass</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T7.1.5.3" style="padding-left:5.5pt;padding-right:5.5pt;">4840130</td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T7.1.5.4" style="padding-left:5.5pt;padding-right:5.5pt;">The grass is green on the side where the woman is sitting.</td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 7: </span>Important information about an image in the MOInst dataset. Image ID, Category ID, and Object ID are unique identifiers for images, category labels, and specific categories within a particular image, respectively.</figcaption> </figure> </section> </section> <section class="ltx_subsection" id="A1.SSx4"> <h3 class="ltx_title ltx_title_subsection">Experimental Setup</h3> <div class="ltx_para" id="A1.SSx4.p1"> <p class="ltx_p" id="A1.SSx4.p1.1">We introduce a new task for which no prior experimental data exists. As a result, we need to evaluate all models, including baselines and our proposed method. However, due to limited computational resources, the cost of running multiple experiments to obtain error bars is prohibitively high. Consequently, it should be noted that the results presented in this paper are based on a single experiment.</p> </div> <section class="ltx_subsubsection" id="A1.SSx4.SSSx1"> <h4 class="ltx_title ltx_title_subsubsection">Evaluation Metrics</h4> <div class="ltx_para" id="A1.SSx4.SSSx1.p1"> <p class="ltx_p" id="A1.SSx4.SSSx1.p1.7">To obtain an accurate measure of visual encoding ability, it is essential to exclude the inflated values introduced by the instructional text. A way to quantify the impact of instructional information on the assessment is to calculate the probability that the results of task execution are consistent with an erroneous instruction. Specifically, for task <math alttext="\tau(\cdot)" class="ltx_Math" display="inline" id="A1.SSx4.SSSx1.p1.1.m1.1"><semantics id="A1.SSx4.SSSx1.p1.1.m1.1a"><mrow id="A1.SSx4.SSSx1.p1.1.m1.1.2" xref="A1.SSx4.SSSx1.p1.1.m1.1.2.cmml"><mi id="A1.SSx4.SSSx1.p1.1.m1.1.2.2" xref="A1.SSx4.SSSx1.p1.1.m1.1.2.2.cmml">τ</mi><mo id="A1.SSx4.SSSx1.p1.1.m1.1.2.1" xref="A1.SSx4.SSSx1.p1.1.m1.1.2.1.cmml">⁢</mo><mrow id="A1.SSx4.SSSx1.p1.1.m1.1.2.3.2" xref="A1.SSx4.SSSx1.p1.1.m1.1.2.cmml"><mo id="A1.SSx4.SSSx1.p1.1.m1.1.2.3.2.1" stretchy="false" xref="A1.SSx4.SSSx1.p1.1.m1.1.2.cmml">(</mo><mo id="A1.SSx4.SSSx1.p1.1.m1.1.1" lspace="0em" rspace="0em" xref="A1.SSx4.SSSx1.p1.1.m1.1.1.cmml">⋅</mo><mo id="A1.SSx4.SSSx1.p1.1.m1.1.2.3.2.2" stretchy="false" xref="A1.SSx4.SSSx1.p1.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="A1.SSx4.SSSx1.p1.1.m1.1b"><apply id="A1.SSx4.SSSx1.p1.1.m1.1.2.cmml" xref="A1.SSx4.SSSx1.p1.1.m1.1.2"><times id="A1.SSx4.SSSx1.p1.1.m1.1.2.1.cmml" xref="A1.SSx4.SSSx1.p1.1.m1.1.2.1"></times><ci id="A1.SSx4.SSSx1.p1.1.m1.1.2.2.cmml" xref="A1.SSx4.SSSx1.p1.1.m1.1.2.2">𝜏</ci><ci id="A1.SSx4.SSSx1.p1.1.m1.1.1.cmml" xref="A1.SSx4.SSSx1.p1.1.m1.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.SSx4.SSSx1.p1.1.m1.1c">\tau(\cdot)</annotation><annotation encoding="application/x-llamapun" id="A1.SSx4.SSSx1.p1.1.m1.1d">italic_τ ( ⋅ )</annotation></semantics></math> and model <math alttext="\Phi(\cdot,\cdot)" class="ltx_Math" display="inline" id="A1.SSx4.SSSx1.p1.2.m2.2"><semantics id="A1.SSx4.SSSx1.p1.2.m2.2a"><mrow id="A1.SSx4.SSSx1.p1.2.m2.2.3" xref="A1.SSx4.SSSx1.p1.2.m2.2.3.cmml"><mi id="A1.SSx4.SSSx1.p1.2.m2.2.3.2" mathvariant="normal" xref="A1.SSx4.SSSx1.p1.2.m2.2.3.2.cmml">Φ</mi><mo id="A1.SSx4.SSSx1.p1.2.m2.2.3.1" xref="A1.SSx4.SSSx1.p1.2.m2.2.3.1.cmml">⁢</mo><mrow id="A1.SSx4.SSSx1.p1.2.m2.2.3.3.2" xref="A1.SSx4.SSSx1.p1.2.m2.2.3.3.1.cmml"><mo id="A1.SSx4.SSSx1.p1.2.m2.2.3.3.2.1" stretchy="false" xref="A1.SSx4.SSSx1.p1.2.m2.2.3.3.1.cmml">(</mo><mo id="A1.SSx4.SSSx1.p1.2.m2.1.1" lspace="0em" rspace="0em" xref="A1.SSx4.SSSx1.p1.2.m2.1.1.cmml">⋅</mo><mo id="A1.SSx4.SSSx1.p1.2.m2.2.3.3.2.2" rspace="0em" xref="A1.SSx4.SSSx1.p1.2.m2.2.3.3.1.cmml">,</mo><mo id="A1.SSx4.SSSx1.p1.2.m2.2.2" lspace="0em" rspace="0em" xref="A1.SSx4.SSSx1.p1.2.m2.2.2.cmml">⋅</mo><mo id="A1.SSx4.SSSx1.p1.2.m2.2.3.3.2.3" stretchy="false" xref="A1.SSx4.SSSx1.p1.2.m2.2.3.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="A1.SSx4.SSSx1.p1.2.m2.2b"><apply id="A1.SSx4.SSSx1.p1.2.m2.2.3.cmml" xref="A1.SSx4.SSSx1.p1.2.m2.2.3"><times id="A1.SSx4.SSSx1.p1.2.m2.2.3.1.cmml" xref="A1.SSx4.SSSx1.p1.2.m2.2.3.1"></times><ci id="A1.SSx4.SSSx1.p1.2.m2.2.3.2.cmml" xref="A1.SSx4.SSSx1.p1.2.m2.2.3.2">Φ</ci><interval closure="open" id="A1.SSx4.SSSx1.p1.2.m2.2.3.3.1.cmml" xref="A1.SSx4.SSSx1.p1.2.m2.2.3.3.2"><ci id="A1.SSx4.SSSx1.p1.2.m2.1.1.cmml" xref="A1.SSx4.SSSx1.p1.2.m2.1.1">⋅</ci><ci id="A1.SSx4.SSSx1.p1.2.m2.2.2.cmml" xref="A1.SSx4.SSSx1.p1.2.m2.2.2">⋅</ci></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.SSx4.SSSx1.p1.2.m2.2c">\Phi(\cdot,\cdot)</annotation><annotation encoding="application/x-llamapun" id="A1.SSx4.SSSx1.p1.2.m2.2d">roman_Φ ( ⋅ , ⋅ )</annotation></semantics></math>, a set of <math alttext="p" class="ltx_Math" display="inline" id="A1.SSx4.SSSx1.p1.3.m3.1"><semantics id="A1.SSx4.SSSx1.p1.3.m3.1a"><mi id="A1.SSx4.SSSx1.p1.3.m3.1.1" xref="A1.SSx4.SSSx1.p1.3.m3.1.1.cmml">p</mi><annotation-xml encoding="MathML-Content" id="A1.SSx4.SSSx1.p1.3.m3.1b"><ci id="A1.SSx4.SSSx1.p1.3.m3.1.1.cmml" xref="A1.SSx4.SSSx1.p1.3.m3.1.1">𝑝</ci></annotation-xml><annotation encoding="application/x-tex" id="A1.SSx4.SSSx1.p1.3.m3.1c">p</annotation><annotation encoding="application/x-llamapun" id="A1.SSx4.SSSx1.p1.3.m3.1d">italic_p</annotation></semantics></math> evaluation data is given as <math alttext="\{x^{I}_{k},x^{IT}_{k},t_{k}\}^{p}_{k=1}" class="ltx_Math" display="inline" id="A1.SSx4.SSSx1.p1.4.m4.3"><semantics id="A1.SSx4.SSSx1.p1.4.m4.3a"><msubsup id="A1.SSx4.SSSx1.p1.4.m4.3.3" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.cmml"><mrow id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.4.cmml"><mo id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.4" stretchy="false" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.4.cmml">{</mo><msubsup id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.cmml"><mi id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.2.2" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.2.2.cmml">x</mi><mi id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.3" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.3.cmml">k</mi><mi id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.2.3" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.2.3.cmml">I</mi></msubsup><mo id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.5" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.4.cmml">,</mo><msubsup id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.cmml"><mi id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.2" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.2.cmml">x</mi><mi id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.3" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.3.cmml">k</mi><mrow id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.cmml"><mi id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.2" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.2.cmml">I</mi><mo id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.1" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.1.cmml">⁢</mo><mi id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.3" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.3.cmml">T</mi></mrow></msubsup><mo id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.6" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.4.cmml">,</mo><msub id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.cmml"><mi id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.2" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.2.cmml">t</mi><mi id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.3" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.3.cmml">k</mi></msub><mo id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.7" stretchy="false" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.4.cmml">}</mo></mrow><mrow id="A1.SSx4.SSSx1.p1.4.m4.3.3.5" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.5.cmml"><mi id="A1.SSx4.SSSx1.p1.4.m4.3.3.5.2" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.5.2.cmml">k</mi><mo id="A1.SSx4.SSSx1.p1.4.m4.3.3.5.1" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.5.1.cmml">=</mo><mn id="A1.SSx4.SSSx1.p1.4.m4.3.3.5.3" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.5.3.cmml">1</mn></mrow><mi id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.5" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.5.cmml">p</mi></msubsup><annotation-xml encoding="MathML-Content" id="A1.SSx4.SSSx1.p1.4.m4.3b"><apply id="A1.SSx4.SSSx1.p1.4.m4.3.3.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3"><csymbol cd="ambiguous" id="A1.SSx4.SSSx1.p1.4.m4.3.3.4.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3">subscript</csymbol><apply id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3"><csymbol cd="ambiguous" id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.4.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3">superscript</csymbol><set id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.4.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3"><apply id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1"><csymbol cd="ambiguous" id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.1.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1">subscript</csymbol><apply id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.2.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1"><csymbol cd="ambiguous" id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.2.1.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1">superscript</csymbol><ci id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.2.2.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.2.2">𝑥</ci><ci id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.2.3.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.2.3">𝐼</ci></apply><ci id="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.3.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2"><csymbol cd="ambiguous" id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.1.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2">subscript</csymbol><apply id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2"><csymbol cd="ambiguous" id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.1.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2">superscript</csymbol><ci id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.2.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.2">𝑥</ci><apply id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3"><times id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.1.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.1"></times><ci id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.2.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.2">𝐼</ci><ci id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.3.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.2.3.3">𝑇</ci></apply></apply><ci id="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.3.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.2.2.2.2.2.2.3">𝑘</ci></apply><apply id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3"><csymbol cd="ambiguous" id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.1.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3">subscript</csymbol><ci id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.2.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.2">𝑡</ci><ci id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.3.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.3.3.3.3">𝑘</ci></apply></set><ci id="A1.SSx4.SSSx1.p1.4.m4.3.3.3.5.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.3.5">𝑝</ci></apply><apply id="A1.SSx4.SSSx1.p1.4.m4.3.3.5.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.5"><eq id="A1.SSx4.SSSx1.p1.4.m4.3.3.5.1.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.5.1"></eq><ci id="A1.SSx4.SSSx1.p1.4.m4.3.3.5.2.cmml" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.5.2">𝑘</ci><cn id="A1.SSx4.SSSx1.p1.4.m4.3.3.5.3.cmml" type="integer" xref="A1.SSx4.SSSx1.p1.4.m4.3.3.5.3">1</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.SSx4.SSSx1.p1.4.m4.3c">\{x^{I}_{k},x^{IT}_{k},t_{k}\}^{p}_{k=1}</annotation><annotation encoding="application/x-llamapun" id="A1.SSx4.SSSx1.p1.4.m4.3d">{ italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_I italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT</annotation></semantics></math>, where <math alttext="x^{IT}" class="ltx_Math" display="inline" id="A1.SSx4.SSSx1.p1.5.m5.1"><semantics id="A1.SSx4.SSSx1.p1.5.m5.1a"><msup id="A1.SSx4.SSSx1.p1.5.m5.1.1" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.cmml"><mi id="A1.SSx4.SSSx1.p1.5.m5.1.1.2" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.2.cmml">x</mi><mrow id="A1.SSx4.SSSx1.p1.5.m5.1.1.3" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.3.cmml"><mi id="A1.SSx4.SSSx1.p1.5.m5.1.1.3.2" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.3.2.cmml">I</mi><mo id="A1.SSx4.SSSx1.p1.5.m5.1.1.3.1" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.3.1.cmml">⁢</mo><mi id="A1.SSx4.SSSx1.p1.5.m5.1.1.3.3" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.3.3.cmml">T</mi></mrow></msup><annotation-xml encoding="MathML-Content" id="A1.SSx4.SSSx1.p1.5.m5.1b"><apply id="A1.SSx4.SSSx1.p1.5.m5.1.1.cmml" xref="A1.SSx4.SSSx1.p1.5.m5.1.1"><csymbol cd="ambiguous" id="A1.SSx4.SSSx1.p1.5.m5.1.1.1.cmml" xref="A1.SSx4.SSSx1.p1.5.m5.1.1">superscript</csymbol><ci id="A1.SSx4.SSSx1.p1.5.m5.1.1.2.cmml" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.2">𝑥</ci><apply id="A1.SSx4.SSSx1.p1.5.m5.1.1.3.cmml" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.3"><times id="A1.SSx4.SSSx1.p1.5.m5.1.1.3.1.cmml" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.3.1"></times><ci id="A1.SSx4.SSSx1.p1.5.m5.1.1.3.2.cmml" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.3.2">𝐼</ci><ci id="A1.SSx4.SSSx1.p1.5.m5.1.1.3.3.cmml" xref="A1.SSx4.SSSx1.p1.5.m5.1.1.3.3">𝑇</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.SSx4.SSSx1.p1.5.m5.1c">x^{IT}</annotation><annotation encoding="application/x-llamapun" id="A1.SSx4.SSSx1.p1.5.m5.1d">italic_x start_POSTSUPERSCRIPT italic_I italic_T end_POSTSUPERSCRIPT</annotation></semantics></math> represents the instruction text corresponding to the image <math alttext="x^{I}" class="ltx_Math" display="inline" id="A1.SSx4.SSSx1.p1.6.m6.1"><semantics id="A1.SSx4.SSSx1.p1.6.m6.1a"><msup id="A1.SSx4.SSSx1.p1.6.m6.1.1" xref="A1.SSx4.SSSx1.p1.6.m6.1.1.cmml"><mi id="A1.SSx4.SSSx1.p1.6.m6.1.1.2" xref="A1.SSx4.SSSx1.p1.6.m6.1.1.2.cmml">x</mi><mi id="A1.SSx4.SSSx1.p1.6.m6.1.1.3" xref="A1.SSx4.SSSx1.p1.6.m6.1.1.3.cmml">I</mi></msup><annotation-xml encoding="MathML-Content" id="A1.SSx4.SSSx1.p1.6.m6.1b"><apply id="A1.SSx4.SSSx1.p1.6.m6.1.1.cmml" xref="A1.SSx4.SSSx1.p1.6.m6.1.1"><csymbol cd="ambiguous" id="A1.SSx4.SSSx1.p1.6.m6.1.1.1.cmml" xref="A1.SSx4.SSSx1.p1.6.m6.1.1">superscript</csymbol><ci id="A1.SSx4.SSSx1.p1.6.m6.1.1.2.cmml" xref="A1.SSx4.SSSx1.p1.6.m6.1.1.2">𝑥</ci><ci id="A1.SSx4.SSSx1.p1.6.m6.1.1.3.cmml" xref="A1.SSx4.SSSx1.p1.6.m6.1.1.3">𝐼</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.SSx4.SSSx1.p1.6.m6.1c">x^{I}</annotation><annotation encoding="application/x-llamapun" id="A1.SSx4.SSSx1.p1.6.m6.1d">italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT</annotation></semantics></math>, and <math alttext="t" class="ltx_Math" display="inline" id="A1.SSx4.SSSx1.p1.7.m7.1"><semantics id="A1.SSx4.SSSx1.p1.7.m7.1a"><mi id="A1.SSx4.SSSx1.p1.7.m7.1.1" xref="A1.SSx4.SSSx1.p1.7.m7.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="A1.SSx4.SSSx1.p1.7.m7.1b"><ci id="A1.SSx4.SSSx1.p1.7.m7.1.1.cmml" xref="A1.SSx4.SSSx1.p1.7.m7.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="A1.SSx4.SSSx1.p1.7.m7.1c">t</annotation><annotation encoding="application/x-llamapun" id="A1.SSx4.SSSx1.p1.7.m7.1d">italic_t</annotation></semantics></math> is the ground truth. The evaluation score of the model can be formalized as</p> <table class="ltx_equation ltx_eqn_table" id="A1.E15"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{S}=\mathbb{E}[\mathbb{I}(\tau(\Phi(x^{I}_{k},x^{IT}_{k}))=t_{k})]." class="ltx_Math" display="block" id="A1.E15.m1.1"><semantics id="A1.E15.m1.1a"><mrow id="A1.E15.m1.1.1.1" xref="A1.E15.m1.1.1.1.1.cmml"><mrow id="A1.E15.m1.1.1.1.1" xref="A1.E15.m1.1.1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="A1.E15.m1.1.1.1.1.3" xref="A1.E15.m1.1.1.1.1.3.cmml">𝒮</mi><mo id="A1.E15.m1.1.1.1.1.2" xref="A1.E15.m1.1.1.1.1.2.cmml">=</mo><mrow id="A1.E15.m1.1.1.1.1.1" xref="A1.E15.m1.1.1.1.1.1.cmml"><mi id="A1.E15.m1.1.1.1.1.1.3" xref="A1.E15.m1.1.1.1.1.1.3.cmml">𝔼</mi><mo id="A1.E15.m1.1.1.1.1.1.2" xref="A1.E15.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="A1.E15.m1.1.1.1.1.1.1.1" xref="A1.E15.m1.1.1.1.1.1.1.2.cmml"><mo id="A1.E15.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="A1.E15.m1.1.1.1.1.1.1.2.1.cmml">[</mo><mrow id="A1.E15.m1.1.1.1.1.1.1.1.1" xref="A1.E15.m1.1.1.1.1.1.1.1.1.cmml"><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.3" xref="A1.E15.m1.1.1.1.1.1.1.1.1.3.cmml">𝕀</mi><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.2" xref="A1.E15.m1.1.1.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mrow id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.3" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml">τ</mi><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.2" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.4" mathvariant="normal" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.4.cmml">Φ</mi><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml">⁢</mo><mrow id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.3.cmml"><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.3" stretchy="false" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.3.cmml">(</mo><msubsup id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.cmml">x</mi><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml">k</mi><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.3" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.3.cmml">I</mi></msubsup><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.4" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.3.cmml">,</mo><msubsup id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.cmml"><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2.cmml">x</mi><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.3" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.3.cmml">k</mi><mrow id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.cmml"><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.2" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.2.cmml">I</mi><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.1" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.1.cmml">⁢</mo><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.3" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.3.cmml">T</mi></mrow></msubsup><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.5" stretchy="false" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.3.cmml">)</mo></mrow></mrow><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.2" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">=</mo><msub id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.cmml"><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.2" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.2.cmml">t</mi><mi id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.3" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.3.cmml">k</mi></msub></mrow><mo id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="A1.E15.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="A1.E15.m1.1.1.1.1.1.1.2.1.cmml">]</mo></mrow></mrow></mrow><mo id="A1.E15.m1.1.1.1.2" lspace="0em" xref="A1.E15.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="A1.E15.m1.1b"><apply id="A1.E15.m1.1.1.1.1.cmml" xref="A1.E15.m1.1.1.1"><eq id="A1.E15.m1.1.1.1.1.2.cmml" xref="A1.E15.m1.1.1.1.1.2"></eq><ci id="A1.E15.m1.1.1.1.1.3.cmml" xref="A1.E15.m1.1.1.1.1.3">𝒮</ci><apply id="A1.E15.m1.1.1.1.1.1.cmml" xref="A1.E15.m1.1.1.1.1.1"><times id="A1.E15.m1.1.1.1.1.1.2.cmml" xref="A1.E15.m1.1.1.1.1.1.2"></times><ci id="A1.E15.m1.1.1.1.1.1.3.cmml" xref="A1.E15.m1.1.1.1.1.1.3">𝔼</ci><apply id="A1.E15.m1.1.1.1.1.1.1.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1"><csymbol cd="latexml" id="A1.E15.m1.1.1.1.1.1.1.2.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.2">delimited-[]</csymbol><apply id="A1.E15.m1.1.1.1.1.1.1.1.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1"><times id="A1.E15.m1.1.1.1.1.1.1.1.1.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.2"></times><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.3">𝕀</ci><apply id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1"><eq id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.2"></eq><apply id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1"><times id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.2"></times><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.3">𝜏</ci><apply id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1"><times id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3"></times><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.4.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.4">Φ</ci><interval closure="open" id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2"><apply id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1">subscript</csymbol><apply id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1">superscript</csymbol><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2">𝑥</ci><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.3">𝐼</ci></apply><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2">subscript</csymbol><apply id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2">superscript</csymbol><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2">𝑥</ci><apply id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3"><times id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.1"></times><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.2">𝐼</ci><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.3">𝑇</ci></apply></apply><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.2.2.3">𝑘</ci></apply></interval></apply></apply><apply id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.1.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.2.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.2">𝑡</ci><ci id="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.3.cmml" xref="A1.E15.m1.1.1.1.1.1.1.1.1.1.1.1.3.3">𝑘</ci></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.E15.m1.1c">\mathcal{S}=\mathbb{E}[\mathbb{I}(\tau(\Phi(x^{I}_{k},x^{IT}_{k}))=t_{k})].</annotation><annotation encoding="application/x-llamapun" id="A1.E15.m1.1d">caligraphic_S = blackboard_E [ blackboard_I ( italic_τ ( roman_Φ ( italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_I italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(15)</span></td> </tr></tbody> </table> <p class="ltx_p" id="A1.SSx4.SSSx1.p1.8">Then the interference of the instruction text on the score can be expressed as</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="A1.EGx4"> <tbody id="A1.E16"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\mathcal{N}=" class="ltx_Math" display="inline" id="A1.E16.m1.1"><semantics id="A1.E16.m1.1a"><mrow id="A1.E16.m1.1.1" xref="A1.E16.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="A1.E16.m1.1.1.2" xref="A1.E16.m1.1.1.2.cmml">𝒩</mi><mo id="A1.E16.m1.1.1.1" xref="A1.E16.m1.1.1.1.cmml">=</mo><mi id="A1.E16.m1.1.1.3" xref="A1.E16.m1.1.1.3.cmml"></mi></mrow><annotation-xml encoding="MathML-Content" id="A1.E16.m1.1b"><apply id="A1.E16.m1.1.1.cmml" xref="A1.E16.m1.1.1"><eq id="A1.E16.m1.1.1.1.cmml" xref="A1.E16.m1.1.1.1"></eq><ci id="A1.E16.m1.1.1.2.cmml" xref="A1.E16.m1.1.1.2">𝒩</ci><csymbol cd="latexml" id="A1.E16.m1.1.1.3.cmml" xref="A1.E16.m1.1.1.3">absent</csymbol></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.E16.m1.1c">\displaystyle\mathcal{N}=</annotation><annotation encoding="application/x-llamapun" id="A1.E16.m1.1d">caligraphic_N =</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\mathcal{S}(x^{I}_{k},x^{IT}_{j},t_{j}\mid j\neq k)" class="ltx_Math" display="inline" id="A1.E16.m2.1"><semantics id="A1.E16.m2.1a"><mrow id="A1.E16.m2.1.1" xref="A1.E16.m2.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="A1.E16.m2.1.1.3" xref="A1.E16.m2.1.1.3.cmml">𝒮</mi><mo id="A1.E16.m2.1.1.2" xref="A1.E16.m2.1.1.2.cmml">⁢</mo><mrow id="A1.E16.m2.1.1.1.1" xref="A1.E16.m2.1.1.1.1.1.cmml"><mo id="A1.E16.m2.1.1.1.1.2" stretchy="false" xref="A1.E16.m2.1.1.1.1.1.cmml">(</mo><mrow id="A1.E16.m2.1.1.1.1.1" xref="A1.E16.m2.1.1.1.1.1.cmml"><mrow id="A1.E16.m2.1.1.1.1.1.3.3" xref="A1.E16.m2.1.1.1.1.1.3.4.cmml"><msubsup id="A1.E16.m2.1.1.1.1.1.1.1.1" xref="A1.E16.m2.1.1.1.1.1.1.1.1.cmml"><mi id="A1.E16.m2.1.1.1.1.1.1.1.1.2.2" xref="A1.E16.m2.1.1.1.1.1.1.1.1.2.2.cmml">x</mi><mi id="A1.E16.m2.1.1.1.1.1.1.1.1.3" xref="A1.E16.m2.1.1.1.1.1.1.1.1.3.cmml">k</mi><mi id="A1.E16.m2.1.1.1.1.1.1.1.1.2.3" xref="A1.E16.m2.1.1.1.1.1.1.1.1.2.3.cmml">I</mi></msubsup><mo id="A1.E16.m2.1.1.1.1.1.3.3.4" xref="A1.E16.m2.1.1.1.1.1.3.4.cmml">,</mo><msubsup id="A1.E16.m2.1.1.1.1.1.2.2.2" xref="A1.E16.m2.1.1.1.1.1.2.2.2.cmml"><mi id="A1.E16.m2.1.1.1.1.1.2.2.2.2.2" xref="A1.E16.m2.1.1.1.1.1.2.2.2.2.2.cmml">x</mi><mi id="A1.E16.m2.1.1.1.1.1.2.2.2.3" xref="A1.E16.m2.1.1.1.1.1.2.2.2.3.cmml">j</mi><mrow id="A1.E16.m2.1.1.1.1.1.2.2.2.2.3" xref="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.cmml"><mi id="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.2" xref="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.2.cmml">I</mi><mo id="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.1" xref="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.1.cmml">⁢</mo><mi id="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.3" xref="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.3.cmml">T</mi></mrow></msubsup><mo id="A1.E16.m2.1.1.1.1.1.3.3.5" xref="A1.E16.m2.1.1.1.1.1.3.4.cmml">,</mo><mrow id="A1.E16.m2.1.1.1.1.1.3.3.3" xref="A1.E16.m2.1.1.1.1.1.3.3.3.cmml"><msub id="A1.E16.m2.1.1.1.1.1.3.3.3.2" xref="A1.E16.m2.1.1.1.1.1.3.3.3.2.cmml"><mi id="A1.E16.m2.1.1.1.1.1.3.3.3.2.2" xref="A1.E16.m2.1.1.1.1.1.3.3.3.2.2.cmml">t</mi><mi id="A1.E16.m2.1.1.1.1.1.3.3.3.2.3" xref="A1.E16.m2.1.1.1.1.1.3.3.3.2.3.cmml">j</mi></msub><mo id="A1.E16.m2.1.1.1.1.1.3.3.3.1" xref="A1.E16.m2.1.1.1.1.1.3.3.3.1.cmml">∣</mo><mi id="A1.E16.m2.1.1.1.1.1.3.3.3.3" xref="A1.E16.m2.1.1.1.1.1.3.3.3.3.cmml">j</mi></mrow></mrow><mo id="A1.E16.m2.1.1.1.1.1.4" xref="A1.E16.m2.1.1.1.1.1.4.cmml">≠</mo><mi id="A1.E16.m2.1.1.1.1.1.5" xref="A1.E16.m2.1.1.1.1.1.5.cmml">k</mi></mrow><mo id="A1.E16.m2.1.1.1.1.3" stretchy="false" xref="A1.E16.m2.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="A1.E16.m2.1b"><apply id="A1.E16.m2.1.1.cmml" xref="A1.E16.m2.1.1"><times id="A1.E16.m2.1.1.2.cmml" xref="A1.E16.m2.1.1.2"></times><ci id="A1.E16.m2.1.1.3.cmml" xref="A1.E16.m2.1.1.3">𝒮</ci><apply id="A1.E16.m2.1.1.1.1.1.cmml" xref="A1.E16.m2.1.1.1.1"><neq id="A1.E16.m2.1.1.1.1.1.4.cmml" xref="A1.E16.m2.1.1.1.1.1.4"></neq><list id="A1.E16.m2.1.1.1.1.1.3.4.cmml" xref="A1.E16.m2.1.1.1.1.1.3.3"><apply id="A1.E16.m2.1.1.1.1.1.1.1.1.cmml" xref="A1.E16.m2.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="A1.E16.m2.1.1.1.1.1.1.1.1.1.cmml" xref="A1.E16.m2.1.1.1.1.1.1.1.1">subscript</csymbol><apply id="A1.E16.m2.1.1.1.1.1.1.1.1.2.cmml" xref="A1.E16.m2.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="A1.E16.m2.1.1.1.1.1.1.1.1.2.1.cmml" xref="A1.E16.m2.1.1.1.1.1.1.1.1">superscript</csymbol><ci id="A1.E16.m2.1.1.1.1.1.1.1.1.2.2.cmml" xref="A1.E16.m2.1.1.1.1.1.1.1.1.2.2">𝑥</ci><ci id="A1.E16.m2.1.1.1.1.1.1.1.1.2.3.cmml" xref="A1.E16.m2.1.1.1.1.1.1.1.1.2.3">𝐼</ci></apply><ci id="A1.E16.m2.1.1.1.1.1.1.1.1.3.cmml" xref="A1.E16.m2.1.1.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="A1.E16.m2.1.1.1.1.1.2.2.2.cmml" xref="A1.E16.m2.1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="A1.E16.m2.1.1.1.1.1.2.2.2.1.cmml" xref="A1.E16.m2.1.1.1.1.1.2.2.2">subscript</csymbol><apply id="A1.E16.m2.1.1.1.1.1.2.2.2.2.cmml" xref="A1.E16.m2.1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="A1.E16.m2.1.1.1.1.1.2.2.2.2.1.cmml" xref="A1.E16.m2.1.1.1.1.1.2.2.2">superscript</csymbol><ci id="A1.E16.m2.1.1.1.1.1.2.2.2.2.2.cmml" xref="A1.E16.m2.1.1.1.1.1.2.2.2.2.2">𝑥</ci><apply id="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.cmml" xref="A1.E16.m2.1.1.1.1.1.2.2.2.2.3"><times id="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.1.cmml" xref="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.1"></times><ci id="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.2.cmml" xref="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.2">𝐼</ci><ci id="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.3.cmml" xref="A1.E16.m2.1.1.1.1.1.2.2.2.2.3.3">𝑇</ci></apply></apply><ci id="A1.E16.m2.1.1.1.1.1.2.2.2.3.cmml" xref="A1.E16.m2.1.1.1.1.1.2.2.2.3">𝑗</ci></apply><apply id="A1.E16.m2.1.1.1.1.1.3.3.3.cmml" xref="A1.E16.m2.1.1.1.1.1.3.3.3"><csymbol cd="latexml" id="A1.E16.m2.1.1.1.1.1.3.3.3.1.cmml" xref="A1.E16.m2.1.1.1.1.1.3.3.3.1">conditional</csymbol><apply id="A1.E16.m2.1.1.1.1.1.3.3.3.2.cmml" xref="A1.E16.m2.1.1.1.1.1.3.3.3.2"><csymbol cd="ambiguous" id="A1.E16.m2.1.1.1.1.1.3.3.3.2.1.cmml" xref="A1.E16.m2.1.1.1.1.1.3.3.3.2">subscript</csymbol><ci id="A1.E16.m2.1.1.1.1.1.3.3.3.2.2.cmml" xref="A1.E16.m2.1.1.1.1.1.3.3.3.2.2">𝑡</ci><ci id="A1.E16.m2.1.1.1.1.1.3.3.3.2.3.cmml" xref="A1.E16.m2.1.1.1.1.1.3.3.3.2.3">𝑗</ci></apply><ci id="A1.E16.m2.1.1.1.1.1.3.3.3.3.cmml" xref="A1.E16.m2.1.1.1.1.1.3.3.3.3">𝑗</ci></apply></list><ci id="A1.E16.m2.1.1.1.1.1.5.cmml" xref="A1.E16.m2.1.1.1.1.1.5">𝑘</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.E16.m2.1c">\displaystyle\mathcal{S}(x^{I}_{k},x^{IT}_{j},t_{j}\mid j\neq k)</annotation><annotation encoding="application/x-llamapun" id="A1.E16.m2.1d">caligraphic_S ( italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_I italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_j ≠ italic_k )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(16)</span></td> </tr></tbody> <tbody id="A1.E17"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle-\mathcal{S}(x^{I}_{k},x^{IT}_{j},t_{i}\mid j\neq k,i\in\{1,2,% \ldots,p\})," class="ltx_math_unparsed" display="inline" id="A1.E17.m1.4"><semantics id="A1.E17.m1.4a"><mrow id="A1.E17.m1.4b"><mo id="A1.E17.m1.4.5">−</mo><mi class="ltx_font_mathcaligraphic" id="A1.E17.m1.4.6">𝒮</mi><mrow id="A1.E17.m1.4.7"><mo id="A1.E17.m1.4.7.1" stretchy="false">(</mo><msubsup id="A1.E17.m1.4.7.2"><mi id="A1.E17.m1.4.7.2.2.2">x</mi><mi id="A1.E17.m1.4.7.2.3">k</mi><mi id="A1.E17.m1.4.7.2.2.3">I</mi></msubsup><mo id="A1.E17.m1.4.7.3">,</mo><msubsup id="A1.E17.m1.4.7.4"><mi id="A1.E17.m1.4.7.4.2.2">x</mi><mi id="A1.E17.m1.4.7.4.3">j</mi><mrow id="A1.E17.m1.4.7.4.2.3"><mi id="A1.E17.m1.4.7.4.2.3.2">I</mi><mo id="A1.E17.m1.4.7.4.2.3.1">⁢</mo><mi id="A1.E17.m1.4.7.4.2.3.3">T</mi></mrow></msubsup><mo id="A1.E17.m1.4.7.5">,</mo><msub id="A1.E17.m1.4.7.6"><mi id="A1.E17.m1.4.7.6.2">t</mi><mi id="A1.E17.m1.4.7.6.3">i</mi></msub><mo id="A1.E17.m1.4.7.7" lspace="0em" rspace="0.167em">∣</mo><mi id="A1.E17.m1.4.7.8">j</mi><mo id="A1.E17.m1.4.7.9">≠</mo><mi id="A1.E17.m1.4.7.10">k</mi><mo id="A1.E17.m1.4.7.11">,</mo><mi id="A1.E17.m1.4.7.12">i</mi><mo id="A1.E17.m1.4.7.13">∈</mo><mrow id="A1.E17.m1.4.7.14"><mo id="A1.E17.m1.4.7.14.1" stretchy="false">{</mo><mn id="A1.E17.m1.1.1">1</mn><mo id="A1.E17.m1.4.7.14.2">,</mo><mn id="A1.E17.m1.2.2">2</mn><mo id="A1.E17.m1.4.7.14.3">,</mo><mi id="A1.E17.m1.3.3" mathvariant="normal">…</mi><mo id="A1.E17.m1.4.7.14.4">,</mo><mi id="A1.E17.m1.4.4">p</mi><mo id="A1.E17.m1.4.7.14.5" stretchy="false">}</mo></mrow><mo id="A1.E17.m1.4.7.15" stretchy="false">)</mo></mrow><mo id="A1.E17.m1.4.8">,</mo></mrow><annotation encoding="application/x-tex" id="A1.E17.m1.4c">\displaystyle-\mathcal{S}(x^{I}_{k},x^{IT}_{j},t_{i}\mid j\neq k,i\in\{1,2,% \ldots,p\}),</annotation><annotation encoding="application/x-llamapun" id="A1.E17.m1.4d">- caligraphic_S ( italic_x start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_I italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_j ≠ italic_k , italic_i ∈ { 1 , 2 , … , italic_p } ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(17)</span></td> </tr></tbody> </table> <p class="ltx_p" id="A1.SSx4.SSSx1.p1.9">which is the likelihood of being misdirected to a specific wrong minus randomness. Thus, the score obtained since the visual features are</p> <table class="ltx_equation ltx_eqn_table" id="A1.E18"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{S}^{V}=\mathcal{S}-\mathcal{N}." class="ltx_Math" display="block" id="A1.E18.m1.1"><semantics id="A1.E18.m1.1a"><mrow id="A1.E18.m1.1.1.1" xref="A1.E18.m1.1.1.1.1.cmml"><mrow id="A1.E18.m1.1.1.1.1" xref="A1.E18.m1.1.1.1.1.cmml"><msup id="A1.E18.m1.1.1.1.1.2" xref="A1.E18.m1.1.1.1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="A1.E18.m1.1.1.1.1.2.2" xref="A1.E18.m1.1.1.1.1.2.2.cmml">𝒮</mi><mi id="A1.E18.m1.1.1.1.1.2.3" xref="A1.E18.m1.1.1.1.1.2.3.cmml">V</mi></msup><mo id="A1.E18.m1.1.1.1.1.1" xref="A1.E18.m1.1.1.1.1.1.cmml">=</mo><mrow id="A1.E18.m1.1.1.1.1.3" xref="A1.E18.m1.1.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="A1.E18.m1.1.1.1.1.3.2" xref="A1.E18.m1.1.1.1.1.3.2.cmml">𝒮</mi><mo id="A1.E18.m1.1.1.1.1.3.1" xref="A1.E18.m1.1.1.1.1.3.1.cmml">−</mo><mi class="ltx_font_mathcaligraphic" id="A1.E18.m1.1.1.1.1.3.3" xref="A1.E18.m1.1.1.1.1.3.3.cmml">𝒩</mi></mrow></mrow><mo id="A1.E18.m1.1.1.1.2" lspace="0em" xref="A1.E18.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="A1.E18.m1.1b"><apply id="A1.E18.m1.1.1.1.1.cmml" xref="A1.E18.m1.1.1.1"><eq id="A1.E18.m1.1.1.1.1.1.cmml" xref="A1.E18.m1.1.1.1.1.1"></eq><apply id="A1.E18.m1.1.1.1.1.2.cmml" xref="A1.E18.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="A1.E18.m1.1.1.1.1.2.1.cmml" xref="A1.E18.m1.1.1.1.1.2">superscript</csymbol><ci id="A1.E18.m1.1.1.1.1.2.2.cmml" xref="A1.E18.m1.1.1.1.1.2.2">𝒮</ci><ci id="A1.E18.m1.1.1.1.1.2.3.cmml" xref="A1.E18.m1.1.1.1.1.2.3">𝑉</ci></apply><apply id="A1.E18.m1.1.1.1.1.3.cmml" xref="A1.E18.m1.1.1.1.1.3"><minus id="A1.E18.m1.1.1.1.1.3.1.cmml" xref="A1.E18.m1.1.1.1.1.3.1"></minus><ci id="A1.E18.m1.1.1.1.1.3.2.cmml" xref="A1.E18.m1.1.1.1.1.3.2">𝒮</ci><ci id="A1.E18.m1.1.1.1.1.3.3.cmml" xref="A1.E18.m1.1.1.1.1.3.3">𝒩</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.E18.m1.1c">\mathcal{S}^{V}=\mathcal{S}-\mathcal{N}.</annotation><annotation encoding="application/x-llamapun" id="A1.E18.m1.1d">caligraphic_S start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT = caligraphic_S - caligraphic_N .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(18)</span></td> </tr></tbody> </table> </div> </section> <section class="ltx_subsubsection" id="A1.SSx4.SSSx2"> <h4 class="ltx_title ltx_title_subsubsection">Hyper-parameter Settings</h4> <div class="ltx_para" id="A1.SSx4.SSSx2.p1"> <p class="ltx_p" id="A1.SSx4.SSSx2.p1.1">We report the hyper-parameter settings used during GiVE training in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T8" title="Table 8 ‣ Hyper-parameter Settings ‣ Experimental Setup ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">8</span></a>. All other settings align with those of the corresponding visual encoder.</p> </div> <figure class="ltx_table" id="A1.T8"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T8.1"> <tr class="ltx_tr" id="A1.T8.1.1"> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T8.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">Item</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_tt" id="A1.T8.1.1.2" style="padding-left:2.8pt;padding-right:2.8pt;">Value</td> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T8.1.1.3" style="padding-left:2.8pt;padding-right:2.8pt;">Item</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T8.1.1.4" style="padding-left:2.8pt;padding-right:2.8pt;">Value</td> </tr> <tr class="ltx_tr" id="A1.T8.1.2"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T8.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">Optimizer</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="A1.T8.1.2.2" style="padding-left:2.8pt;padding-right:2.8pt;">AdamW</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T8.1.2.3" style="padding-left:2.8pt;padding-right:2.8pt;">Batch size</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T8.1.2.4" style="padding-left:2.8pt;padding-right:2.8pt;">640</td> </tr> <tr class="ltx_tr" id="A1.T8.1.3"> <td class="ltx_td ltx_align_left" id="A1.T8.1.3.1" style="padding-left:2.8pt;padding-right:2.8pt;">Learning rate</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A1.T8.1.3.2" style="padding-left:2.8pt;padding-right:2.8pt;">5e-5</td> <td class="ltx_td ltx_align_left" id="A1.T8.1.3.3" style="padding-left:2.8pt;padding-right:2.8pt;">Weight decay</td> <td class="ltx_td ltx_align_center" id="A1.T8.1.3.4" style="padding-left:2.8pt;padding-right:2.8pt;">3e-3</td> </tr> <tr class="ltx_tr" id="A1.T8.1.4"> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T8.1.4.1" style="padding-left:2.8pt;padding-right:2.8pt;">Epoch</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r" id="A1.T8.1.4.2" style="padding-left:2.8pt;padding-right:2.8pt;">5</td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T8.1.4.3" style="padding-left:2.8pt;padding-right:2.8pt;">Warmup ratio</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T8.1.4.4" style="padding-left:2.8pt;padding-right:2.8pt;">3e-2</td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 8: </span>Hyper-parameters for training the model.</figcaption> </figure> </section> <section class="ltx_subsubsection" id="A1.SSx4.SSSx3"> <h4 class="ltx_title ltx_title_subsubsection">Compute Resources</h4> <div class="ltx_para" id="A1.SSx4.SSSx3.p1"> <p class="ltx_p" id="A1.SSx4.SSSx3.p1.1">We conduct our experiments on a NVIDIA A800-SXM4-80GB GPU, an Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz. The experimental device possesses 1.0 terabyte (TB) of memory, and its operating system is Ubuntu 22.04.3 LTS.</p> </div> </section> </section> <section class="ltx_subsection" id="A1.SSx5"> <h3 class="ltx_title ltx_title_subsection">Parameter Sensitivity</h3> <section class="ltx_subsubsection" id="A1.SSx5.SSSx1"> <h4 class="ltx_title ltx_title_subsubsection">Configurations of CLIP with GiVE</h4> <div class="ltx_para" id="A1.SSx5.SSSx1.p1"> <p class="ltx_p" id="A1.SSx5.SSSx1.p1.1">We conduct experiments with different configurations of CLIP, and the results in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T9" title="Table 9 ‣ Configurations of CLIP with GiVE ‣ Parameter Sensitivity ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">9</span></a> demonstrate that models integrated with the AG-Adapter can still be enhanced using typical methods. Specifically, smaller patch sizes can capture more fine-grained information, and an increased number of parameters can provide greater learning capacity. These enhancements improve CLIP’s performance, and similarly, AG-CLIP, which combines CLIP with the GiVE, also benefits from these improvements.</p> </div> <figure class="ltx_table" id="A1.T9"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T9.5"> <tr class="ltx_tr" id="A1.T9.2.2"> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T9.2.2.3" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T9.2.2.3.1">Type</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T9.2.2.4" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T9.2.2.4.1">PS</span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T9.2.2.5" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T9.2.2.5.1"># TP</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="A1.T9.2.2.6" style="padding-left:2.8pt;padding-right:2.8pt;">MOInst</td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="A1.T9.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">LVIS<sup class="ltx_sup" id="A1.T9.1.1.1.1"><span class="ltx_text ltx_font_italic" id="A1.T9.1.1.1.1.1">f</span></sup> </td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="A1.T9.2.2.2" style="padding-left:2.8pt;padding-right:2.8pt;">LVIS<sup class="ltx_sup" id="A1.T9.2.2.2.1"><span class="ltx_text ltx_font_italic" id="A1.T9.2.2.2.1.1">a</span></sup> </td> </tr> <tr class="ltx_tr" id="A1.T9.5.6"> <td class="ltx_td ltx_align_left" id="A1.T9.5.6.1" style="padding-left:2.8pt;padding-right:2.8pt;">I2T</td> <td class="ltx_td ltx_align_left" id="A1.T9.5.6.2" style="padding-left:2.8pt;padding-right:2.8pt;">T2I</td> <td class="ltx_td ltx_align_center" id="A1.T9.5.6.3" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T9.5.6.3.1">F1</span></td> <td class="ltx_td ltx_align_center" id="A1.T9.5.6.4" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T9.5.6.4.1">AUC</span></td> <td class="ltx_td ltx_align_center" id="A1.T9.5.6.5" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T9.5.6.5.1">F1</span></td> <td class="ltx_td ltx_align_center" id="A1.T9.5.6.6" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T9.5.6.6.1">AUC</span></td> </tr> <tr class="ltx_tr" id="A1.T9.5.7"> <td class="ltx_td ltx_align_left" id="A1.T9.5.7.1" style="padding-left:2.8pt;padding-right:2.8pt;">R@1</td> <td class="ltx_td ltx_align_left" id="A1.T9.5.7.2" style="padding-left:2.8pt;padding-right:2.8pt;">R@1</td> </tr> <tr class="ltx_tr" id="A1.T9.3.3"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T9.3.3.2" style="padding-left:2.8pt;padding-right:2.8pt;">B/32</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T9.3.3.1" style="padding-left:2.8pt;padding-right:2.8pt;"><math alttext="32^{2}" class="ltx_Math" display="inline" id="A1.T9.3.3.1.m1.1"><semantics id="A1.T9.3.3.1.m1.1a"><msup id="A1.T9.3.3.1.m1.1.1" xref="A1.T9.3.3.1.m1.1.1.cmml"><mn id="A1.T9.3.3.1.m1.1.1.2" xref="A1.T9.3.3.1.m1.1.1.2.cmml">32</mn><mn id="A1.T9.3.3.1.m1.1.1.3" xref="A1.T9.3.3.1.m1.1.1.3.cmml">2</mn></msup><annotation-xml encoding="MathML-Content" id="A1.T9.3.3.1.m1.1b"><apply id="A1.T9.3.3.1.m1.1.1.cmml" xref="A1.T9.3.3.1.m1.1.1"><csymbol cd="ambiguous" id="A1.T9.3.3.1.m1.1.1.1.cmml" xref="A1.T9.3.3.1.m1.1.1">superscript</csymbol><cn id="A1.T9.3.3.1.m1.1.1.2.cmml" type="integer" xref="A1.T9.3.3.1.m1.1.1.2">32</cn><cn id="A1.T9.3.3.1.m1.1.1.3.cmml" type="integer" xref="A1.T9.3.3.1.m1.1.1.3">2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T9.3.3.1.m1.1c">32^{2}</annotation><annotation encoding="application/x-llamapun" id="A1.T9.3.3.1.m1.1d">32 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T9.3.3.3" style="padding-left:2.8pt;padding-right:2.8pt;">62M</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T9.3.3.4" style="padding-left:2.8pt;padding-right:2.8pt;">29.1</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T9.3.3.5" style="padding-left:2.8pt;padding-right:2.8pt;">29.8</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T9.3.3.6" style="padding-left:2.8pt;padding-right:2.8pt;">50.7</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T9.3.3.7" style="padding-left:2.8pt;padding-right:2.8pt;">75.6</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T9.3.3.8" style="padding-left:2.8pt;padding-right:2.8pt;">56.2</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T9.3.3.9" style="padding-left:2.8pt;padding-right:2.8pt;">77.3</td> </tr> <tr class="ltx_tr" id="A1.T9.4.4"> <td class="ltx_td ltx_align_left" id="A1.T9.4.4.2" style="padding-left:2.8pt;padding-right:2.8pt;">B/16</td> <td class="ltx_td ltx_align_left" id="A1.T9.4.4.1" style="padding-left:2.8pt;padding-right:2.8pt;"><math alttext="16^{2}" class="ltx_Math" display="inline" id="A1.T9.4.4.1.m1.1"><semantics id="A1.T9.4.4.1.m1.1a"><msup id="A1.T9.4.4.1.m1.1.1" xref="A1.T9.4.4.1.m1.1.1.cmml"><mn id="A1.T9.4.4.1.m1.1.1.2" xref="A1.T9.4.4.1.m1.1.1.2.cmml">16</mn><mn id="A1.T9.4.4.1.m1.1.1.3" xref="A1.T9.4.4.1.m1.1.1.3.cmml">2</mn></msup><annotation-xml encoding="MathML-Content" id="A1.T9.4.4.1.m1.1b"><apply id="A1.T9.4.4.1.m1.1.1.cmml" xref="A1.T9.4.4.1.m1.1.1"><csymbol cd="ambiguous" id="A1.T9.4.4.1.m1.1.1.1.cmml" xref="A1.T9.4.4.1.m1.1.1">superscript</csymbol><cn id="A1.T9.4.4.1.m1.1.1.2.cmml" type="integer" xref="A1.T9.4.4.1.m1.1.1.2">16</cn><cn id="A1.T9.4.4.1.m1.1.1.3.cmml" type="integer" xref="A1.T9.4.4.1.m1.1.1.3">2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T9.4.4.1.m1.1c">16^{2}</annotation><annotation encoding="application/x-llamapun" id="A1.T9.4.4.1.m1.1d">16 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left" id="A1.T9.4.4.3" style="padding-left:2.8pt;padding-right:2.8pt;">62M</td> <td class="ltx_td ltx_align_left" id="A1.T9.4.4.4" style="padding-left:2.8pt;padding-right:2.8pt;">31.2</td> <td class="ltx_td ltx_align_left" id="A1.T9.4.4.5" style="padding-left:2.8pt;padding-right:2.8pt;">33.5</td> <td class="ltx_td ltx_align_center" id="A1.T9.4.4.6" style="padding-left:2.8pt;padding-right:2.8pt;">55.8</td> <td class="ltx_td ltx_align_center" id="A1.T9.4.4.7" style="padding-left:2.8pt;padding-right:2.8pt;">77.4</td> <td class="ltx_td ltx_align_center" id="A1.T9.4.4.8" style="padding-left:2.8pt;padding-right:2.8pt;">61.9</td> <td class="ltx_td ltx_align_center" id="A1.T9.4.4.9" style="padding-left:2.8pt;padding-right:2.8pt;">79.9</td> </tr> <tr class="ltx_tr" id="A1.T9.5.5"> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T9.5.5.2" style="padding-left:2.8pt;padding-right:2.8pt;">L/14</td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T9.5.5.1" style="padding-left:2.8pt;padding-right:2.8pt;"><math alttext="14^{2}" class="ltx_Math" display="inline" id="A1.T9.5.5.1.m1.1"><semantics id="A1.T9.5.5.1.m1.1a"><msup id="A1.T9.5.5.1.m1.1.1" xref="A1.T9.5.5.1.m1.1.1.cmml"><mn id="A1.T9.5.5.1.m1.1.1.2" xref="A1.T9.5.5.1.m1.1.1.2.cmml">14</mn><mn id="A1.T9.5.5.1.m1.1.1.3" xref="A1.T9.5.5.1.m1.1.1.3.cmml">2</mn></msup><annotation-xml encoding="MathML-Content" id="A1.T9.5.5.1.m1.1b"><apply id="A1.T9.5.5.1.m1.1.1.cmml" xref="A1.T9.5.5.1.m1.1.1"><csymbol cd="ambiguous" id="A1.T9.5.5.1.m1.1.1.1.cmml" xref="A1.T9.5.5.1.m1.1.1">superscript</csymbol><cn id="A1.T9.5.5.1.m1.1.1.2.cmml" type="integer" xref="A1.T9.5.5.1.m1.1.1.2">14</cn><cn id="A1.T9.5.5.1.m1.1.1.3.cmml" type="integer" xref="A1.T9.5.5.1.m1.1.1.3">2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T9.5.5.1.m1.1c">14^{2}</annotation><annotation encoding="application/x-llamapun" id="A1.T9.5.5.1.m1.1d">14 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T9.5.5.3" style="padding-left:2.8pt;padding-right:2.8pt;">240M</td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T9.5.5.4" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T9.5.5.4.1">34.7</span></td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T9.5.5.5" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T9.5.5.5.1">36.9</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T9.5.5.6" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T9.5.5.6.1">60.4</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T9.5.5.7" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T9.5.5.7.1">80.2</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T9.5.5.8" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T9.5.5.8.1">62.8</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T9.5.5.9" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T9.5.5.9.1">80.7</span></td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 9: </span>Evaluation of the GiVE applied on different configurations of CLIP-ViT. “Type” refers to the type of the base model, “PS” indicates the patch size used when dividing the image into patches, and “# TP” represents the number of trainable parameters.</figcaption> </figure> </section> <section class="ltx_subsubsection" id="A1.SSx5.SSSx2"> <h4 class="ltx_title ltx_title_subsubsection">Sensitivity Analysis on Loss Weights</h4> <div class="ltx_para" id="A1.SSx5.SSSx2.p1"> <p class="ltx_p" id="A1.SSx5.SSSx2.p1.1">We also explore the impact of different loss weightings within GiVE on the AG-Adapter to balance the richness of effective information and the accuracy of semantic concepts. As observed from results in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T10" title="Table 10 ‣ Sensitivity Analysis on Loss Weights ‣ Parameter Sensitivity ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">10</span></a>, variations in the weightings of these three hyper-parameters cause the visual encoder’s capabilities to favor one aspect of the trade-off over another. Taking everything into account, we adopt the relatively balanced values from the first row.</p> </div> <figure class="ltx_table" id="A1.T10"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T10.5"> <tr class="ltx_tr" id="A1.T10.5.5"> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T10.1.1.1" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T10.1.1.1.1"><math alttext="\omega^{\text{OITC}}" class="ltx_Math" display="inline" id="A1.T10.1.1.1.1.m1.1"><semantics id="A1.T10.1.1.1.1.m1.1a"><msup id="A1.T10.1.1.1.1.m1.1.1" xref="A1.T10.1.1.1.1.m1.1.1.cmml"><mi id="A1.T10.1.1.1.1.m1.1.1.2" xref="A1.T10.1.1.1.1.m1.1.1.2.cmml">ω</mi><mtext id="A1.T10.1.1.1.1.m1.1.1.3" xref="A1.T10.1.1.1.1.m1.1.1.3a.cmml">OITC</mtext></msup><annotation-xml encoding="MathML-Content" id="A1.T10.1.1.1.1.m1.1b"><apply id="A1.T10.1.1.1.1.m1.1.1.cmml" xref="A1.T10.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="A1.T10.1.1.1.1.m1.1.1.1.cmml" xref="A1.T10.1.1.1.1.m1.1.1">superscript</csymbol><ci id="A1.T10.1.1.1.1.m1.1.1.2.cmml" xref="A1.T10.1.1.1.1.m1.1.1.2">𝜔</ci><ci id="A1.T10.1.1.1.1.m1.1.1.3a.cmml" xref="A1.T10.1.1.1.1.m1.1.1.3"><mtext id="A1.T10.1.1.1.1.m1.1.1.3.cmml" mathsize="70%" xref="A1.T10.1.1.1.1.m1.1.1.3">OITC</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T10.1.1.1.1.m1.1c">\omega^{\text{OITC}}</annotation><annotation encoding="application/x-llamapun" id="A1.T10.1.1.1.1.m1.1d">italic_ω start_POSTSUPERSCRIPT OITC end_POSTSUPERSCRIPT</annotation></semantics></math></span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T10.2.2.2" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T10.2.2.2.1"><math alttext="\omega^{\text{OIIC}}" class="ltx_Math" display="inline" id="A1.T10.2.2.2.1.m1.1"><semantics id="A1.T10.2.2.2.1.m1.1a"><msup id="A1.T10.2.2.2.1.m1.1.1" xref="A1.T10.2.2.2.1.m1.1.1.cmml"><mi id="A1.T10.2.2.2.1.m1.1.1.2" xref="A1.T10.2.2.2.1.m1.1.1.2.cmml">ω</mi><mtext id="A1.T10.2.2.2.1.m1.1.1.3" xref="A1.T10.2.2.2.1.m1.1.1.3a.cmml">OIIC</mtext></msup><annotation-xml encoding="MathML-Content" id="A1.T10.2.2.2.1.m1.1b"><apply id="A1.T10.2.2.2.1.m1.1.1.cmml" xref="A1.T10.2.2.2.1.m1.1.1"><csymbol cd="ambiguous" id="A1.T10.2.2.2.1.m1.1.1.1.cmml" xref="A1.T10.2.2.2.1.m1.1.1">superscript</csymbol><ci id="A1.T10.2.2.2.1.m1.1.1.2.cmml" xref="A1.T10.2.2.2.1.m1.1.1.2">𝜔</ci><ci id="A1.T10.2.2.2.1.m1.1.1.3a.cmml" xref="A1.T10.2.2.2.1.m1.1.1.3"><mtext id="A1.T10.2.2.2.1.m1.1.1.3.cmml" mathsize="70%" xref="A1.T10.2.2.2.1.m1.1.1.3">OIIC</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T10.2.2.2.1.m1.1c">\omega^{\text{OIIC}}</annotation><annotation encoding="application/x-llamapun" id="A1.T10.2.2.2.1.m1.1d">italic_ω start_POSTSUPERSCRIPT OIIC end_POSTSUPERSCRIPT</annotation></semantics></math></span></td> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T10.3.3.3" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T10.3.3.3.1"><math alttext="\omega^{\text{OID}}" class="ltx_Math" display="inline" id="A1.T10.3.3.3.1.m1.1"><semantics id="A1.T10.3.3.3.1.m1.1a"><msup id="A1.T10.3.3.3.1.m1.1.1" xref="A1.T10.3.3.3.1.m1.1.1.cmml"><mi id="A1.T10.3.3.3.1.m1.1.1.2" xref="A1.T10.3.3.3.1.m1.1.1.2.cmml">ω</mi><mtext id="A1.T10.3.3.3.1.m1.1.1.3" xref="A1.T10.3.3.3.1.m1.1.1.3a.cmml">OID</mtext></msup><annotation-xml encoding="MathML-Content" id="A1.T10.3.3.3.1.m1.1b"><apply id="A1.T10.3.3.3.1.m1.1.1.cmml" xref="A1.T10.3.3.3.1.m1.1.1"><csymbol cd="ambiguous" id="A1.T10.3.3.3.1.m1.1.1.1.cmml" xref="A1.T10.3.3.3.1.m1.1.1">superscript</csymbol><ci id="A1.T10.3.3.3.1.m1.1.1.2.cmml" xref="A1.T10.3.3.3.1.m1.1.1.2">𝜔</ci><ci id="A1.T10.3.3.3.1.m1.1.1.3a.cmml" xref="A1.T10.3.3.3.1.m1.1.1.3"><mtext id="A1.T10.3.3.3.1.m1.1.1.3.cmml" mathsize="70%" xref="A1.T10.3.3.3.1.m1.1.1.3">OID</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T10.3.3.3.1.m1.1c">\omega^{\text{OID}}</annotation><annotation encoding="application/x-llamapun" id="A1.T10.3.3.3.1.m1.1d">italic_ω start_POSTSUPERSCRIPT OID end_POSTSUPERSCRIPT</annotation></semantics></math></span></td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="A1.T10.5.5.6" style="padding-left:2.8pt;padding-right:2.8pt;">MOInst</td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="A1.T10.4.4.4" style="padding-left:2.8pt;padding-right:2.8pt;">LVIS<sup class="ltx_sup" id="A1.T10.4.4.4.1"><span class="ltx_text ltx_font_italic" id="A1.T10.4.4.4.1.1">f</span></sup> </td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="A1.T10.5.5.5" style="padding-left:2.8pt;padding-right:2.8pt;">LVIS<sup class="ltx_sup" id="A1.T10.5.5.5.1"><span class="ltx_text ltx_font_italic" id="A1.T10.5.5.5.1.1">a</span></sup> </td> </tr> <tr class="ltx_tr" id="A1.T10.5.6"> <td class="ltx_td ltx_align_left" id="A1.T10.5.6.1" style="padding-left:2.8pt;padding-right:2.8pt;">I2T</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.6.2" style="padding-left:2.8pt;padding-right:2.8pt;">T2I</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.6.3" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T10.5.6.3.1">F1</span></td> <td class="ltx_td ltx_align_center" id="A1.T10.5.6.4" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T10.5.6.4.1">AUC</span></td> <td class="ltx_td ltx_align_center" id="A1.T10.5.6.5" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T10.5.6.5.1">F1</span></td> <td class="ltx_td ltx_align_center" id="A1.T10.5.6.6" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T10.5.6.6.1">AUC</span></td> </tr> <tr class="ltx_tr" id="A1.T10.5.7"> <td class="ltx_td ltx_align_left" id="A1.T10.5.7.1" style="padding-left:2.8pt;padding-right:2.8pt;">R@1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.7.2" style="padding-left:2.8pt;padding-right:2.8pt;">R@1</td> </tr> <tr class="ltx_tr" id="A1.T10.5.8"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.8.1" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.8.2" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.8.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.01</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.8.4" style="padding-left:2.8pt;padding-right:2.8pt;">29.1</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.8.5" style="padding-left:2.8pt;padding-right:2.8pt;">29.8</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.8.6" style="padding-left:2.8pt;padding-right:2.8pt;">50.7</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.8.7" style="padding-left:2.8pt;padding-right:2.8pt;">75.6</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.8.8" style="padding-left:2.8pt;padding-right:2.8pt;">56.2</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.8.9" style="padding-left:2.8pt;padding-right:2.8pt;">77.3</td> </tr> <tr class="ltx_tr" id="A1.T10.5.9"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.9.1" style="padding-left:2.8pt;padding-right:2.8pt;">0.1</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.9.2" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.9.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.01</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.9.4" style="padding-left:2.8pt;padding-right:2.8pt;">15.0</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.9.5" style="padding-left:2.8pt;padding-right:2.8pt;">18.4</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.9.6" style="padding-left:2.8pt;padding-right:2.8pt;">41.3</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.9.7" style="padding-left:2.8pt;padding-right:2.8pt;">71.2</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.9.8" style="padding-left:2.8pt;padding-right:2.8pt;">49.9</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.9.9" style="padding-left:2.8pt;padding-right:2.8pt;">73.9</td> </tr> <tr class="ltx_tr" id="A1.T10.5.10"> <td class="ltx_td ltx_align_left" id="A1.T10.5.10.1" style="padding-left:2.8pt;padding-right:2.8pt;">0.5</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.10.2" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.10.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.01</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.10.4" style="padding-left:2.8pt;padding-right:2.8pt;">24.7</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.10.5" style="padding-left:2.8pt;padding-right:2.8pt;">26.6</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.10.6" style="padding-left:2.8pt;padding-right:2.8pt;">51.7</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.10.7" style="padding-left:2.8pt;padding-right:2.8pt;">73.1</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.10.8" style="padding-left:2.8pt;padding-right:2.8pt;">55.1</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.10.9" style="padding-left:2.8pt;padding-right:2.8pt;">76.5</td> </tr> <tr class="ltx_tr" id="A1.T10.5.11"> <td class="ltx_td ltx_align_left" id="A1.T10.5.11.1" style="padding-left:2.8pt;padding-right:2.8pt;">5</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.11.2" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.11.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.01</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.11.4" style="padding-left:2.8pt;padding-right:2.8pt;">30.2</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.11.5" style="padding-left:2.8pt;padding-right:2.8pt;">30.4</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.11.6" style="padding-left:2.8pt;padding-right:2.8pt;">39.1</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.11.7" style="padding-left:2.8pt;padding-right:2.8pt;">68.7</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.11.8" style="padding-left:2.8pt;padding-right:2.8pt;">48.9</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.11.9" style="padding-left:2.8pt;padding-right:2.8pt;">73.0</td> </tr> <tr class="ltx_tr" id="A1.T10.5.12"> <td class="ltx_td ltx_align_left" id="A1.T10.5.12.1" style="padding-left:2.8pt;padding-right:2.8pt;">10</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.12.2" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.12.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.01</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.12.4" style="padding-left:2.8pt;padding-right:2.8pt;">29.7</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.12.5" style="padding-left:2.8pt;padding-right:2.8pt;">30.6</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.12.6" style="padding-left:2.8pt;padding-right:2.8pt;">33.2</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.12.7" style="padding-left:2.8pt;padding-right:2.8pt;">65.3</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.12.8" style="padding-left:2.8pt;padding-right:2.8pt;">42.9</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.12.9" style="padding-left:2.8pt;padding-right:2.8pt;">69.5</td> </tr> <tr class="ltx_tr" id="A1.T10.5.13"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.13.1" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.13.2" style="padding-left:2.8pt;padding-right:2.8pt;">0.1</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.13.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.01</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.13.4" style="padding-left:2.8pt;padding-right:2.8pt;">29.5</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.13.5" style="padding-left:2.8pt;padding-right:2.8pt;">31.0</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.13.6" style="padding-left:2.8pt;padding-right:2.8pt;">48.0</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.13.7" style="padding-left:2.8pt;padding-right:2.8pt;">73.5</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.13.8" style="padding-left:2.8pt;padding-right:2.8pt;">56.6</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.13.9" style="padding-left:2.8pt;padding-right:2.8pt;">77.2</td> </tr> <tr class="ltx_tr" id="A1.T10.5.14"> <td class="ltx_td ltx_align_left" id="A1.T10.5.14.1" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.14.2" style="padding-left:2.8pt;padding-right:2.8pt;">0.5</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.14.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.01</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.14.4" style="padding-left:2.8pt;padding-right:2.8pt;">29.3</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.14.5" style="padding-left:2.8pt;padding-right:2.8pt;">30.8</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.14.6" style="padding-left:2.8pt;padding-right:2.8pt;">48.2</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.14.7" style="padding-left:2.8pt;padding-right:2.8pt;">74.1</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.14.8" style="padding-left:2.8pt;padding-right:2.8pt;">54.7</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.14.9" style="padding-left:2.8pt;padding-right:2.8pt;">76.5</td> </tr> <tr class="ltx_tr" id="A1.T10.5.15"> <td class="ltx_td ltx_align_left" id="A1.T10.5.15.1" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.15.2" style="padding-left:2.8pt;padding-right:2.8pt;">5</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.15.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.01</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.15.4" style="padding-left:2.8pt;padding-right:2.8pt;">24.8</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.15.5" style="padding-left:2.8pt;padding-right:2.8pt;">26.7</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.15.6" style="padding-left:2.8pt;padding-right:2.8pt;">43.3</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.15.7" style="padding-left:2.8pt;padding-right:2.8pt;">71.9</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.15.8" style="padding-left:2.8pt;padding-right:2.8pt;">52.5</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.15.9" style="padding-left:2.8pt;padding-right:2.8pt;">75.6</td> </tr> <tr class="ltx_tr" id="A1.T10.5.16"> <td class="ltx_td ltx_align_left" id="A1.T10.5.16.1" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.16.2" style="padding-left:2.8pt;padding-right:2.8pt;">10</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.16.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.01</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.16.4" style="padding-left:2.8pt;padding-right:2.8pt;">23.6</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.16.5" style="padding-left:2.8pt;padding-right:2.8pt;">27.5</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.16.6" style="padding-left:2.8pt;padding-right:2.8pt;">39.4</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.16.7" style="padding-left:2.8pt;padding-right:2.8pt;">70.7</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.16.8" style="padding-left:2.8pt;padding-right:2.8pt;">49.2</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.16.9" style="padding-left:2.8pt;padding-right:2.8pt;">74.3</td> </tr> <tr class="ltx_tr" id="A1.T10.5.17"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.17.1" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.17.2" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.17.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.001</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.17.4" style="padding-left:2.8pt;padding-right:2.8pt;">28.9</td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T10.5.17.5" style="padding-left:2.8pt;padding-right:2.8pt;">30.8</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.17.6" style="padding-left:2.8pt;padding-right:2.8pt;">32.4</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.17.7" style="padding-left:2.8pt;padding-right:2.8pt;">65.1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.17.8" style="padding-left:2.8pt;padding-right:2.8pt;">43.3</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T10.5.17.9" style="padding-left:2.8pt;padding-right:2.8pt;">69.8</td> </tr> <tr class="ltx_tr" id="A1.T10.5.18"> <td class="ltx_td ltx_align_left" id="A1.T10.5.18.1" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.18.2" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.18.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.005</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.18.4" style="padding-left:2.8pt;padding-right:2.8pt;">29.4</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.18.5" style="padding-left:2.8pt;padding-right:2.8pt;">30.0</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.18.6" style="padding-left:2.8pt;padding-right:2.8pt;">44.6</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.18.7" style="padding-left:2.8pt;padding-right:2.8pt;">71.6</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.18.8" style="padding-left:2.8pt;padding-right:2.8pt;">54.3</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.18.9" style="padding-left:2.8pt;padding-right:2.8pt;">75.6</td> </tr> <tr class="ltx_tr" id="A1.T10.5.19"> <td class="ltx_td ltx_align_left" id="A1.T10.5.19.1" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.19.2" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.19.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.05</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.19.4" style="padding-left:2.8pt;padding-right:2.8pt;">22.4</td> <td class="ltx_td ltx_align_left" id="A1.T10.5.19.5" style="padding-left:2.8pt;padding-right:2.8pt;">24.1</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.19.6" style="padding-left:2.8pt;padding-right:2.8pt;">53.3</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.19.7" style="padding-left:2.8pt;padding-right:2.8pt;">77.0</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.19.8" style="padding-left:2.8pt;padding-right:2.8pt;">54.9</td> <td class="ltx_td ltx_align_center" id="A1.T10.5.19.9" style="padding-left:2.8pt;padding-right:2.8pt;">76.6</td> </tr> <tr class="ltx_tr" id="A1.T10.5.20"> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T10.5.20.1" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T10.5.20.2" style="padding-left:2.8pt;padding-right:2.8pt;">1</td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T10.5.20.3" style="padding-left:2.8pt;padding-right:2.8pt;">0.1</td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T10.5.20.4" style="padding-left:2.8pt;padding-right:2.8pt;">20.4</td> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T10.5.20.5" style="padding-left:2.8pt;padding-right:2.8pt;">21.7</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T10.5.20.6" style="padding-left:2.8pt;padding-right:2.8pt;">48.2</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T10.5.20.7" style="padding-left:2.8pt;padding-right:2.8pt;">74.2</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T10.5.20.8" style="padding-left:2.8pt;padding-right:2.8pt;">53.2</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T10.5.20.9" style="padding-left:2.8pt;padding-right:2.8pt;">75.8</td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 10: </span>Hyper-parameter grid search on <math alttext="\omega^{\text{OITC}}" class="ltx_Math" display="inline" id="A1.T10.12.m1.1"><semantics id="A1.T10.12.m1.1b"><msup id="A1.T10.12.m1.1.1" xref="A1.T10.12.m1.1.1.cmml"><mi id="A1.T10.12.m1.1.1.2" xref="A1.T10.12.m1.1.1.2.cmml">ω</mi><mtext id="A1.T10.12.m1.1.1.3" xref="A1.T10.12.m1.1.1.3a.cmml">OITC</mtext></msup><annotation-xml encoding="MathML-Content" id="A1.T10.12.m1.1c"><apply id="A1.T10.12.m1.1.1.cmml" xref="A1.T10.12.m1.1.1"><csymbol cd="ambiguous" id="A1.T10.12.m1.1.1.1.cmml" xref="A1.T10.12.m1.1.1">superscript</csymbol><ci id="A1.T10.12.m1.1.1.2.cmml" xref="A1.T10.12.m1.1.1.2">𝜔</ci><ci id="A1.T10.12.m1.1.1.3a.cmml" xref="A1.T10.12.m1.1.1.3"><mtext id="A1.T10.12.m1.1.1.3.cmml" mathsize="70%" xref="A1.T10.12.m1.1.1.3">OITC</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T10.12.m1.1d">\omega^{\text{OITC}}</annotation><annotation encoding="application/x-llamapun" id="A1.T10.12.m1.1e">italic_ω start_POSTSUPERSCRIPT OITC end_POSTSUPERSCRIPT</annotation></semantics></math>, <math alttext="\omega^{\text{OIIC}}" class="ltx_Math" display="inline" id="A1.T10.13.m2.1"><semantics id="A1.T10.13.m2.1b"><msup id="A1.T10.13.m2.1.1" xref="A1.T10.13.m2.1.1.cmml"><mi id="A1.T10.13.m2.1.1.2" xref="A1.T10.13.m2.1.1.2.cmml">ω</mi><mtext id="A1.T10.13.m2.1.1.3" xref="A1.T10.13.m2.1.1.3a.cmml">OIIC</mtext></msup><annotation-xml encoding="MathML-Content" id="A1.T10.13.m2.1c"><apply id="A1.T10.13.m2.1.1.cmml" xref="A1.T10.13.m2.1.1"><csymbol cd="ambiguous" id="A1.T10.13.m2.1.1.1.cmml" xref="A1.T10.13.m2.1.1">superscript</csymbol><ci id="A1.T10.13.m2.1.1.2.cmml" xref="A1.T10.13.m2.1.1.2">𝜔</ci><ci id="A1.T10.13.m2.1.1.3a.cmml" xref="A1.T10.13.m2.1.1.3"><mtext id="A1.T10.13.m2.1.1.3.cmml" mathsize="70%" xref="A1.T10.13.m2.1.1.3">OIIC</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T10.13.m2.1d">\omega^{\text{OIIC}}</annotation><annotation encoding="application/x-llamapun" id="A1.T10.13.m2.1e">italic_ω start_POSTSUPERSCRIPT OIIC end_POSTSUPERSCRIPT</annotation></semantics></math>, and <math alttext="\omega^{\text{OID}}" class="ltx_Math" display="inline" id="A1.T10.14.m3.1"><semantics id="A1.T10.14.m3.1b"><msup id="A1.T10.14.m3.1.1" xref="A1.T10.14.m3.1.1.cmml"><mi id="A1.T10.14.m3.1.1.2" xref="A1.T10.14.m3.1.1.2.cmml">ω</mi><mtext id="A1.T10.14.m3.1.1.3" xref="A1.T10.14.m3.1.1.3a.cmml">OID</mtext></msup><annotation-xml encoding="MathML-Content" id="A1.T10.14.m3.1c"><apply id="A1.T10.14.m3.1.1.cmml" xref="A1.T10.14.m3.1.1"><csymbol cd="ambiguous" id="A1.T10.14.m3.1.1.1.cmml" xref="A1.T10.14.m3.1.1">superscript</csymbol><ci id="A1.T10.14.m3.1.1.2.cmml" xref="A1.T10.14.m3.1.1.2">𝜔</ci><ci id="A1.T10.14.m3.1.1.3a.cmml" xref="A1.T10.14.m3.1.1.3"><mtext id="A1.T10.14.m3.1.1.3.cmml" mathsize="70%" xref="A1.T10.14.m3.1.1.3">OID</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T10.14.m3.1d">\omega^{\text{OID}}</annotation><annotation encoding="application/x-llamapun" id="A1.T10.14.m3.1e">italic_ω start_POSTSUPERSCRIPT OID end_POSTSUPERSCRIPT</annotation></semantics></math>, which represent the weights for <math alttext="\mathcal{L}^{\text{OITC}}" class="ltx_Math" display="inline" id="A1.T10.15.m4.1"><semantics id="A1.T10.15.m4.1b"><msup id="A1.T10.15.m4.1.1" xref="A1.T10.15.m4.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="A1.T10.15.m4.1.1.2" xref="A1.T10.15.m4.1.1.2.cmml">ℒ</mi><mtext id="A1.T10.15.m4.1.1.3" xref="A1.T10.15.m4.1.1.3a.cmml">OITC</mtext></msup><annotation-xml encoding="MathML-Content" id="A1.T10.15.m4.1c"><apply id="A1.T10.15.m4.1.1.cmml" xref="A1.T10.15.m4.1.1"><csymbol cd="ambiguous" id="A1.T10.15.m4.1.1.1.cmml" xref="A1.T10.15.m4.1.1">superscript</csymbol><ci id="A1.T10.15.m4.1.1.2.cmml" xref="A1.T10.15.m4.1.1.2">ℒ</ci><ci id="A1.T10.15.m4.1.1.3a.cmml" xref="A1.T10.15.m4.1.1.3"><mtext id="A1.T10.15.m4.1.1.3.cmml" mathsize="70%" xref="A1.T10.15.m4.1.1.3">OITC</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T10.15.m4.1d">\mathcal{L}^{\text{OITC}}</annotation><annotation encoding="application/x-llamapun" id="A1.T10.15.m4.1e">caligraphic_L start_POSTSUPERSCRIPT OITC end_POSTSUPERSCRIPT</annotation></semantics></math>, <math alttext="\mathcal{L}^{\text{OIIC}}" class="ltx_Math" display="inline" id="A1.T10.16.m5.1"><semantics id="A1.T10.16.m5.1b"><msup id="A1.T10.16.m5.1.1" xref="A1.T10.16.m5.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="A1.T10.16.m5.1.1.2" xref="A1.T10.16.m5.1.1.2.cmml">ℒ</mi><mtext id="A1.T10.16.m5.1.1.3" xref="A1.T10.16.m5.1.1.3a.cmml">OIIC</mtext></msup><annotation-xml encoding="MathML-Content" id="A1.T10.16.m5.1c"><apply id="A1.T10.16.m5.1.1.cmml" xref="A1.T10.16.m5.1.1"><csymbol cd="ambiguous" id="A1.T10.16.m5.1.1.1.cmml" xref="A1.T10.16.m5.1.1">superscript</csymbol><ci id="A1.T10.16.m5.1.1.2.cmml" xref="A1.T10.16.m5.1.1.2">ℒ</ci><ci id="A1.T10.16.m5.1.1.3a.cmml" xref="A1.T10.16.m5.1.1.3"><mtext id="A1.T10.16.m5.1.1.3.cmml" mathsize="70%" xref="A1.T10.16.m5.1.1.3">OIIC</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T10.16.m5.1d">\mathcal{L}^{\text{OIIC}}</annotation><annotation encoding="application/x-llamapun" id="A1.T10.16.m5.1e">caligraphic_L start_POSTSUPERSCRIPT OIIC end_POSTSUPERSCRIPT</annotation></semantics></math>, and <math alttext="\mathcal{L}^{\text{OID}}" class="ltx_Math" display="inline" id="A1.T10.17.m6.1"><semantics id="A1.T10.17.m6.1b"><msup id="A1.T10.17.m6.1.1" xref="A1.T10.17.m6.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="A1.T10.17.m6.1.1.2" xref="A1.T10.17.m6.1.1.2.cmml">ℒ</mi><mtext id="A1.T10.17.m6.1.1.3" xref="A1.T10.17.m6.1.1.3a.cmml">OID</mtext></msup><annotation-xml encoding="MathML-Content" id="A1.T10.17.m6.1c"><apply id="A1.T10.17.m6.1.1.cmml" xref="A1.T10.17.m6.1.1"><csymbol cd="ambiguous" id="A1.T10.17.m6.1.1.1.cmml" xref="A1.T10.17.m6.1.1">superscript</csymbol><ci id="A1.T10.17.m6.1.1.2.cmml" xref="A1.T10.17.m6.1.1.2">ℒ</ci><ci id="A1.T10.17.m6.1.1.3a.cmml" xref="A1.T10.17.m6.1.1.3"><mtext id="A1.T10.17.m6.1.1.3.cmml" mathsize="70%" xref="A1.T10.17.m6.1.1.3">OID</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T10.17.m6.1d">\mathcal{L}^{\text{OID}}</annotation><annotation encoding="application/x-llamapun" id="A1.T10.17.m6.1e">caligraphic_L start_POSTSUPERSCRIPT OID end_POSTSUPERSCRIPT</annotation></semantics></math> , respectively.</figcaption> </figure> </section> <section class="ltx_subsubsection" id="A1.SSx5.SSSx3"> <h4 class="ltx_title ltx_title_subsubsection">Comparison with Clip-based Method</h4> <div class="ltx_para" id="A1.SSx5.SSSx3.p1"> <p class="ltx_p" id="A1.SSx5.SSSx3.p1.1">We compare CLIP <cite class="ltx_cite ltx_citemacro_citep">(Radford et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib36" title="">2021</a>)</cite>, the recent MetaCLIP <cite class="ltx_cite ltx_citemacro_citep">(Xu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib48" title="">2024</a>)</cite>, and CLIP with our GiVE applied on five datasets for zero-shot image classification. The results, shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T11" title="Table 11 ‣ Comparison with Clip-based Method ‣ Parameter Sensitivity ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">11</span></a>, demonstrate that our method enables CLIP to outperform the state-of-the-art (SoTA) method.</p> </div> <figure class="ltx_table" id="A1.T11"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T11.2"> <tr class="ltx_tr" id="A1.T11.2.2"> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T11.2.2.3" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T11.2.2.3.1">Model</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T11.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">LVIS<sup class="ltx_sup" id="A1.T11.1.1.1.1"><span class="ltx_text ltx_font_italic" id="A1.T11.1.1.1.1.1">f</span></sup> </td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T11.2.2.2" style="padding-left:2.8pt;padding-right:2.8pt;">LVIS<sup class="ltx_sup" id="A1.T11.2.2.2.1"><span class="ltx_text ltx_font_italic" id="A1.T11.2.2.2.1.1">a</span></sup> </td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T11.2.2.4" style="padding-left:2.8pt;padding-right:2.8pt;">COCO</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T11.2.2.5" style="padding-left:2.8pt;padding-right:2.8pt;">SUN397</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T11.2.2.6" style="padding-left:2.8pt;padding-right:2.8pt;">ImageNet</td> </tr> <tr class="ltx_tr" id="A1.T11.2.3"> <td class="ltx_td ltx_align_center" id="A1.T11.2.3.1" style="padding-left:2.8pt;padding-right:2.8pt;">AUC</td> <td class="ltx_td ltx_align_center" id="A1.T11.2.3.2" style="padding-left:2.8pt;padding-right:2.8pt;">AUC</td> <td class="ltx_td ltx_align_center" id="A1.T11.2.3.3" style="padding-left:2.8pt;padding-right:2.8pt;">AUC</td> <td class="ltx_td ltx_align_center" id="A1.T11.2.3.4" style="padding-left:2.8pt;padding-right:2.8pt;">AUC</td> <td class="ltx_td ltx_align_center" id="A1.T11.2.3.5" style="padding-left:2.8pt;padding-right:2.8pt;">AUC</td> </tr> <tr class="ltx_tr" id="A1.T11.2.4"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T11.2.4.1" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T11.2.4.2" style="padding-left:2.8pt;padding-right:2.8pt;">57.8</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T11.2.4.3" style="padding-left:2.8pt;padding-right:2.8pt;">55.5</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T11.2.4.4" style="padding-left:2.8pt;padding-right:2.8pt;">63.8</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T11.2.4.5" style="padding-left:2.8pt;padding-right:2.8pt;">81.9</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T11.2.4.6" style="padding-left:2.8pt;padding-right:2.8pt;">80.2</td> </tr> <tr class="ltx_tr" id="A1.T11.2.5"> <td class="ltx_td ltx_align_left" id="A1.T11.2.5.1" style="padding-left:2.8pt;padding-right:2.8pt;">MetaCLIP</td> <td class="ltx_td ltx_align_center" id="A1.T11.2.5.2" style="padding-left:2.8pt;padding-right:2.8pt;">57.8</td> <td class="ltx_td ltx_align_center" id="A1.T11.2.5.3" style="padding-left:2.8pt;padding-right:2.8pt;">55.6</td> <td class="ltx_td ltx_align_center" id="A1.T11.2.5.4" style="padding-left:2.8pt;padding-right:2.8pt;">63.7</td> <td class="ltx_td ltx_align_center" id="A1.T11.2.5.5" style="padding-left:2.8pt;padding-right:2.8pt;">83.6</td> <td class="ltx_td ltx_align_center" id="A1.T11.2.5.6" style="padding-left:2.8pt;padding-right:2.8pt;">81.1</td> </tr> <tr class="ltx_tr" id="A1.T11.2.6"> <td class="ltx_td ltx_align_left ltx_border_bb" id="A1.T11.2.6.1" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP +GiVE</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T11.2.6.2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T11.2.6.2.1">75.6</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T11.2.6.3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T11.2.6.3.1">77.3</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T11.2.6.4" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T11.2.6.4.1">75.0</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T11.2.6.5" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T11.2.6.5.1">85.2</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T11.2.6.6" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T11.2.6.6.1">82.7</span></td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 11: </span>Top-1 AUC (%) of zero-shot image classification on 5 datasets.</figcaption> </figure> </section> <section class="ltx_subsubsection" id="A1.SSx5.SSSx4"> <h4 class="ltx_title ltx_title_subsubsection">Training Stability</h4> <div class="ltx_para" id="A1.SSx5.SSSx4.p1"> <p class="ltx_p" id="A1.SSx5.SSSx4.p1.1">We apply our GiVE method to CLIP-ViT and repeat the experiment 5 times to calculate the mean and standard deviation, as recorded in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T12" title="Table 12 ‣ Training Stability ‣ Parameter Sensitivity ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">12</span></a>. The results demonstrate that our method is relatively stable across all 3 tasks.</p> </div> <figure class="ltx_table" id="A1.T12"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T12.6"> <tr class="ltx_tr" id="A1.T12.2.2"> <td class="ltx_td ltx_align_left ltx_border_tt" id="A1.T12.2.2.3" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T12.2.2.3.1">Model</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="A1.T12.2.2.4" style="padding-left:2.8pt;padding-right:2.8pt;">MOInst</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T12.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">LVIS<sup class="ltx_sup" id="A1.T12.1.1.1.1"><span class="ltx_text ltx_font_italic" id="A1.T12.1.1.1.1.1">f</span></sup> </td> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T12.2.2.2" style="padding-left:2.8pt;padding-right:2.8pt;">LVIS<sup class="ltx_sup" id="A1.T12.2.2.2.1"><span class="ltx_text ltx_font_italic" id="A1.T12.2.2.2.1.1">a</span></sup> </td> </tr> <tr class="ltx_tr" id="A1.T12.6.7"> <td class="ltx_td ltx_align_center" id="A1.T12.6.7.1" style="padding-left:2.8pt;padding-right:2.8pt;">I2T</td> <td class="ltx_td ltx_align_center" id="A1.T12.6.7.2" style="padding-left:2.8pt;padding-right:2.8pt;">T2I</td> <td class="ltx_td ltx_align_center" id="A1.T12.6.7.3" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T12.6.7.3.1">AUC</span></td> <td class="ltx_td ltx_align_center" id="A1.T12.6.7.4" rowspan="2" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T12.6.7.4.1">AUC</span></td> </tr> <tr class="ltx_tr" id="A1.T12.6.8"> <td class="ltx_td ltx_align_center" id="A1.T12.6.8.1" style="padding-left:2.8pt;padding-right:2.8pt;">R@1</td> <td class="ltx_td ltx_align_center" id="A1.T12.6.8.2" style="padding-left:2.8pt;padding-right:2.8pt;">R@1</td> </tr> <tr class="ltx_tr" id="A1.T12.6.6"> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="A1.T12.6.6.5" style="padding-left:2.8pt;padding-right:2.8pt;">CLIP +GiVE</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="A1.T12.3.3.1" style="padding-left:2.8pt;padding-right:2.8pt;"><math alttext="29.1\!\pm\!0.2" class="ltx_Math" display="inline" id="A1.T12.3.3.1.m1.1"><semantics id="A1.T12.3.3.1.m1.1a"><mrow id="A1.T12.3.3.1.m1.1.1" xref="A1.T12.3.3.1.m1.1.1.cmml"><mn id="A1.T12.3.3.1.m1.1.1.2" xref="A1.T12.3.3.1.m1.1.1.2.cmml">29.1</mn><mo id="A1.T12.3.3.1.m1.1.1.1" lspace="0.052em" rspace="0.052em" xref="A1.T12.3.3.1.m1.1.1.1.cmml">±</mo><mn id="A1.T12.3.3.1.m1.1.1.3" xref="A1.T12.3.3.1.m1.1.1.3.cmml">0.2</mn></mrow><annotation-xml encoding="MathML-Content" id="A1.T12.3.3.1.m1.1b"><apply id="A1.T12.3.3.1.m1.1.1.cmml" xref="A1.T12.3.3.1.m1.1.1"><csymbol cd="latexml" id="A1.T12.3.3.1.m1.1.1.1.cmml" xref="A1.T12.3.3.1.m1.1.1.1">plus-or-minus</csymbol><cn id="A1.T12.3.3.1.m1.1.1.2.cmml" type="float" xref="A1.T12.3.3.1.m1.1.1.2">29.1</cn><cn id="A1.T12.3.3.1.m1.1.1.3.cmml" type="float" xref="A1.T12.3.3.1.m1.1.1.3">0.2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T12.3.3.1.m1.1c">29.1\!\pm\!0.2</annotation><annotation encoding="application/x-llamapun" id="A1.T12.3.3.1.m1.1d">29.1 ± 0.2</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="A1.T12.4.4.2" style="padding-left:2.8pt;padding-right:2.8pt;"><math alttext="30.1\!\pm\!0.3" class="ltx_Math" display="inline" id="A1.T12.4.4.2.m1.1"><semantics id="A1.T12.4.4.2.m1.1a"><mrow id="A1.T12.4.4.2.m1.1.1" xref="A1.T12.4.4.2.m1.1.1.cmml"><mn id="A1.T12.4.4.2.m1.1.1.2" xref="A1.T12.4.4.2.m1.1.1.2.cmml">30.1</mn><mo id="A1.T12.4.4.2.m1.1.1.1" lspace="0.052em" rspace="0.052em" xref="A1.T12.4.4.2.m1.1.1.1.cmml">±</mo><mn id="A1.T12.4.4.2.m1.1.1.3" xref="A1.T12.4.4.2.m1.1.1.3.cmml">0.3</mn></mrow><annotation-xml encoding="MathML-Content" id="A1.T12.4.4.2.m1.1b"><apply id="A1.T12.4.4.2.m1.1.1.cmml" xref="A1.T12.4.4.2.m1.1.1"><csymbol cd="latexml" id="A1.T12.4.4.2.m1.1.1.1.cmml" xref="A1.T12.4.4.2.m1.1.1.1">plus-or-minus</csymbol><cn id="A1.T12.4.4.2.m1.1.1.2.cmml" type="float" xref="A1.T12.4.4.2.m1.1.1.2">30.1</cn><cn id="A1.T12.4.4.2.m1.1.1.3.cmml" type="float" xref="A1.T12.4.4.2.m1.1.1.3">0.3</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T12.4.4.2.m1.1c">30.1\!\pm\!0.3</annotation><annotation encoding="application/x-llamapun" id="A1.T12.4.4.2.m1.1d">30.1 ± 0.3</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="A1.T12.5.5.3" style="padding-left:2.8pt;padding-right:2.8pt;"><math alttext="75.3\!\pm\!2.2" class="ltx_Math" display="inline" id="A1.T12.5.5.3.m1.1"><semantics id="A1.T12.5.5.3.m1.1a"><mrow id="A1.T12.5.5.3.m1.1.1" xref="A1.T12.5.5.3.m1.1.1.cmml"><mn id="A1.T12.5.5.3.m1.1.1.2" xref="A1.T12.5.5.3.m1.1.1.2.cmml">75.3</mn><mo id="A1.T12.5.5.3.m1.1.1.1" lspace="0.052em" rspace="0.052em" xref="A1.T12.5.5.3.m1.1.1.1.cmml">±</mo><mn id="A1.T12.5.5.3.m1.1.1.3" xref="A1.T12.5.5.3.m1.1.1.3.cmml">2.2</mn></mrow><annotation-xml encoding="MathML-Content" id="A1.T12.5.5.3.m1.1b"><apply id="A1.T12.5.5.3.m1.1.1.cmml" xref="A1.T12.5.5.3.m1.1.1"><csymbol cd="latexml" id="A1.T12.5.5.3.m1.1.1.1.cmml" xref="A1.T12.5.5.3.m1.1.1.1">plus-or-minus</csymbol><cn id="A1.T12.5.5.3.m1.1.1.2.cmml" type="float" xref="A1.T12.5.5.3.m1.1.1.2">75.3</cn><cn id="A1.T12.5.5.3.m1.1.1.3.cmml" type="float" xref="A1.T12.5.5.3.m1.1.1.3">2.2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T12.5.5.3.m1.1c">75.3\!\pm\!2.2</annotation><annotation encoding="application/x-llamapun" id="A1.T12.5.5.3.m1.1d">75.3 ± 2.2</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="A1.T12.6.6.4" style="padding-left:2.8pt;padding-right:2.8pt;"><math alttext="77.4\!\pm\!1.1" class="ltx_Math" display="inline" id="A1.T12.6.6.4.m1.1"><semantics id="A1.T12.6.6.4.m1.1a"><mrow id="A1.T12.6.6.4.m1.1.1" xref="A1.T12.6.6.4.m1.1.1.cmml"><mn id="A1.T12.6.6.4.m1.1.1.2" xref="A1.T12.6.6.4.m1.1.1.2.cmml">77.4</mn><mo id="A1.T12.6.6.4.m1.1.1.1" lspace="0.052em" rspace="0.052em" xref="A1.T12.6.6.4.m1.1.1.1.cmml">±</mo><mn id="A1.T12.6.6.4.m1.1.1.3" xref="A1.T12.6.6.4.m1.1.1.3.cmml">1.1</mn></mrow><annotation-xml encoding="MathML-Content" id="A1.T12.6.6.4.m1.1b"><apply id="A1.T12.6.6.4.m1.1.1.cmml" xref="A1.T12.6.6.4.m1.1.1"><csymbol cd="latexml" id="A1.T12.6.6.4.m1.1.1.1.cmml" xref="A1.T12.6.6.4.m1.1.1.1">plus-or-minus</csymbol><cn id="A1.T12.6.6.4.m1.1.1.2.cmml" type="float" xref="A1.T12.6.6.4.m1.1.1.2">77.4</cn><cn id="A1.T12.6.6.4.m1.1.1.3.cmml" type="float" xref="A1.T12.6.6.4.m1.1.1.3">1.1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T12.6.6.4.m1.1c">77.4\!\pm\!1.1</annotation><annotation encoding="application/x-llamapun" id="A1.T12.6.6.4.m1.1d">77.4 ± 1.1</annotation></semantics></math></td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 12: </span>The means and standard deviations measured by CLIP with GiVE across 5 experiments on 3 different tasks.</figcaption> </figure> </section> </section> <section class="ltx_subsection" id="A1.SSx6"> <h3 class="ltx_title ltx_title_subsection">More Guided Attention Visualization</h3> <div class="ltx_para" id="A1.SSx6.p1"> <p class="ltx_p" id="A1.SSx6.p1.1">In Figure <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.F7" title="Figure 7 ‣ More Guided Attention Visualization ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">7</span></a>, we use EigenCAM <cite class="ltx_cite ltx_citemacro_citep">(Muhammad and Yeasin <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib31" title="">2020</a>)</cite> to generate heat maps that visualize instruction following. The results demonstrate that our proposed method enables existing image encoder to flexibly perceive the query objects.</p> </div> <figure class="ltx_figure" id="A1.F7"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_portrait" height="994" id="A1.F7.g1" src="x7.png" width="710"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 7: </span><span class="ltx_text ltx_font_bold" id="A1.F7.2.1">More visualizations on different categories</span>. Each row represents a set, comprising the raw image and the visualized attention of the image encoders. The first column contains the raw image, the second column shows the attention of CLIP-ViT, and the last four columns illustrate how GiVE guides CLIP-ViT to attend to both salient and non-salient objects. </figcaption> </figure> </section> <section class="ltx_subsection" id="A1.SSx7"> <h3 class="ltx_title ltx_title_subsection">Visual Attention and Response</h3> <div class="ltx_para" id="A1.SSx7.p1"> <p class="ltx_p" id="A1.SSx7.p1.1">Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T13" title="Table 13 ‣ Visual Attention and Response ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">13</span></a> illustrates the disparate responses of MLLMs when queried with respect to disparate objects. The attention distributions of the visual encoders on images are visualized using EigenCAM <cite class="ltx_cite ltx_citemacro_citep">(Muhammad and Yeasin <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib31" title="">2020</a>)</cite>, with the redness intensity indicating the degree of attention the network allocates. The table presents a case study where the model accurately responds to a query about a highly salient SUV but incorrectly answers a query about a less noticeable car that shares features with the SUV. This error arises because the model’s perception is constrained by image features that emphasize salient objects. When query information is incomplete, the model tends to default to common sense embedded in its weights. While this approach generally works for abstract text, it can lead to errors when dealing with detail-rich images.</p> </div> <figure class="ltx_table" id="A1.T13"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T13.3"> <tr class="ltx_tr" id="A1.T13.3.4"> <td class="ltx_td ltx_align_left ltx_border_tt" colspan="5" id="A1.T13.3.4.1" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="A1.T13.3.4.1.1"></span><span class="ltx_text" id="A1.T13.3.4.1.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T13.3.4.1.2.1"> <span class="ltx_tr" id="A1.T13.3.4.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T13.3.4.1.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T13.3.4.1.2.1.1.1.1">Visual Input Example,</span></span></span> <span class="ltx_tr" id="A1.T13.3.4.1.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T13.3.4.1.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T13.3.4.1.2.1.2.1.1">Salience and Non-Salience Object Question:</span></span></span> </span></span><span class="ltx_text" id="A1.T13.3.4.1.3"></span></td> </tr> <tr class="ltx_tr" id="A1.T13.1.1"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T13.1.1.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_text" id="A1.T13.1.1.2.1"></span><span class="ltx_text" id="A1.T13.1.1.2.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T13.1.1.2.2.1"> <span class="ltx_tr" id="A1.T13.1.1.2.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T13.1.1.2.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">Input</span></span> <span class="ltx_tr" id="A1.T13.1.1.2.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_left" id="A1.T13.1.1.2.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">image</span></span> </span></span><span class="ltx_text" id="A1.T13.1.1.2.3"></span></td> <td class="ltx_td ltx_border_t" id="A1.T13.1.1.3" style="padding-left:2.8pt;padding-right:2.8pt;"></td> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T13.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;"><img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_landscape" height="79" id="A1.T13.1.1.1.g1" src="extracted/5955593/image/Confusing-Pictures.jpg" width="118"/></td> <td class="ltx_td ltx_border_t" id="A1.T13.1.1.4" style="padding-left:2.8pt;padding-right:2.8pt;"></td> <td class="ltx_td ltx_border_t" id="A1.T13.1.1.5" style="padding-left:2.8pt;padding-right:2.8pt;"></td> </tr> <tr class="ltx_tr" id="A1.T13.3.3"> <td class="ltx_td ltx_align_left" id="A1.T13.3.3.3" style="padding-left:2.8pt;padding-right:2.8pt;">CAM</td> <td class="ltx_td ltx_align_right" id="A1.T13.3.3.4" style="padding-left:2.8pt;padding-right:2.8pt;"> <div class="ltx_inline-block ltx_transformed_outer" id="A1.T13.3.3.4.1" style="width:6.8pt;height:47.5pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="width:47.5pt;transform:translate(-20.33pt,-20.33pt) rotate(-90deg) ;"> <p class="ltx_p" id="A1.T13.3.3.4.1.1">LLaVA-1.5</p> </span></div> </td> <td class="ltx_td ltx_align_left" id="A1.T13.2.2.1" style="padding-left:2.8pt;padding-right:2.8pt;"><img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_landscape" height="79" id="A1.T13.2.2.1.g1" src="extracted/5955593/image/Confusing-Pictures_vit_l_336_CAM.jpg" width="118"/></td> <td class="ltx_td ltx_align_right" id="A1.T13.3.3.5" style="padding-left:2.8pt;padding-right:2.8pt;"> <div class="ltx_inline-block ltx_transformed_outer" id="A1.T13.3.3.5.1" style="width:6.8pt;height:58.6pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="width:58.6pt;transform:translate(-25.86pt,-25.86pt) rotate(-90deg) ;"> <p class="ltx_p" id="A1.T13.3.3.5.1.1">InstructBLIP</p> </span></div> </td> <td class="ltx_td ltx_align_left" id="A1.T13.3.3.2" style="padding-left:2.8pt;padding-right:2.8pt;"><img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_landscape" height="79" id="A1.T13.3.3.2.g1" src="extracted/5955593/image/Confusing-Pictures_eva_vit_g_CAM.jpg" width="118"/></td> </tr> <tr class="ltx_tr" id="A1.T13.3.5"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T13.3.5.1" style="padding-left:2.8pt;padding-right:2.8pt;">User</td> <td class="ltx_td ltx_align_left ltx_border_t" colspan="4" id="A1.T13.3.5.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_inline-block ltx_align_top" id="A1.T13.3.5.2.1"> <span class="ltx_p" id="A1.T13.3.5.2.1.1" style="width:156.5pt;">Describe in detail the SUV on the right.</span> </span> </td> </tr> <tr class="ltx_tr" id="A1.T13.3.6"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T13.3.6.1" style="padding-left:2.8pt;padding-right:2.8pt;"> <div class="ltx_inline-block ltx_transformed_outer" id="A1.T13.3.6.1.1" style="width:36.0pt;height:167.9pt;vertical-align:-167.9pt;"><span class="ltx_transformed_inner" style="width:167.9pt;transform:translate(-65.95pt,101.95pt) rotate(-90deg) ;"> <p class="ltx_p" id="A1.T13.3.6.1.1.1"><span class="ltx_text" id="A1.T13.3.6.1.1.1.1"></span> <span class="ltx_text" id="A1.T13.3.6.1.1.1.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T13.3.6.1.1.1.2.1"> <span class="ltx_tr" id="A1.T13.3.6.1.1.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T13.3.6.1.1.1.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">LLaVA-1.5</span></span> <span class="ltx_tr" id="A1.T13.3.6.1.1.1.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T13.3.6.1.1.1.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;"><cite class="ltx_cite ltx_citemacro_citep">(Liu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib22" title="">2023a</a>)</cite></span></span> </span></span> <span class="ltx_text" id="A1.T13.3.6.1.1.1.3"></span></p> </span></div> </td> <td class="ltx_td ltx_align_left ltx_border_t" colspan="4" id="A1.T13.3.6.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_inline-block ltx_align_top" id="A1.T13.3.6.2.1"> <span class="ltx_p" id="A1.T13.3.6.2.1.1" style="width:213.4pt;">The SUV on the right is a yellow vehicle, possibly a van or a truck, with a man standing on the back of it. The man is holding a shirt, possibly ironing it or preparing to hang it on a clothesline. The SUV is parked on the side of the road, and it appears to be a unique and unconventional sight.</span> </span> </td> </tr> <tr class="ltx_tr" id="A1.T13.3.7"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T13.3.7.1" style="padding-left:2.8pt;padding-right:2.8pt;"> <div class="ltx_inline-block ltx_transformed_outer" id="A1.T13.3.7.1.1" style="width:36.0pt;height:182.4pt;vertical-align:-182.4pt;"><span class="ltx_transformed_inner" style="width:182.4pt;transform:translate(-73.21pt,109.21pt) rotate(-90deg) ;"> <p class="ltx_p" id="A1.T13.3.7.1.1.1"><span class="ltx_text" id="A1.T13.3.7.1.1.1.1"></span> <span class="ltx_text" id="A1.T13.3.7.1.1.1.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T13.3.7.1.1.1.2.1"> <span class="ltx_tr" id="A1.T13.3.7.1.1.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T13.3.7.1.1.1.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">InstructBLIP</span></span> <span class="ltx_tr" id="A1.T13.3.7.1.1.1.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T13.3.7.1.1.1.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;"><cite class="ltx_cite ltx_citemacro_citep">(Dai et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib9" title="">2023</a>)</cite></span></span> </span></span> <span class="ltx_text" id="A1.T13.3.7.1.1.1.3"></span></p> </span></div> </td> <td class="ltx_td ltx_align_left ltx_border_t" colspan="4" id="A1.T13.3.7.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_inline-block ltx_align_top" id="A1.T13.3.7.2.1"> <span class="ltx_p" id="A1.T13.3.7.2.1.1" style="width:213.4pt;">In the image, a yellow SUV is parked on the side of a busy city street. A man wearing a yellow shirt is standing on the back of the SUV, ironing clothes on an ironing board that is attached to the vehicle’s roof rack. He is positioned in such a way that he can easily access the ironing board from the back of the SUV. The presence of the ironing board on top of the SUV adds a unique and creative touch to the scene, showcasing the man’s resourcefulness and his ability to make the most out of his surroundings. Additionally, two taxis are visible in the background, adding to the bustling atmosphere of the city street.</span> </span> </td> </tr> <tr class="ltx_tr" id="A1.T13.3.8"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T13.3.8.1" style="padding-left:2.8pt;padding-right:2.8pt;">User</td> <td class="ltx_td ltx_align_left ltx_border_t" colspan="4" id="A1.T13.3.8.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_inline-block ltx_align_top" id="A1.T13.3.8.2.1"> <span class="ltx_p" id="A1.T13.3.8.2.1.1" style="width:156.5pt;">Describe in detail the car on the left.</span> </span> </td> </tr> <tr class="ltx_tr" id="A1.T13.3.9"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T13.3.9.1" style="padding-left:2.8pt;padding-right:2.8pt;"> <div class="ltx_inline-block ltx_transformed_outer" id="A1.T13.3.9.1.1" style="width:36.0pt;height:167.9pt;vertical-align:-167.9pt;"><span class="ltx_transformed_inner" style="width:167.9pt;transform:translate(-65.95pt,101.95pt) rotate(-90deg) ;"> <p class="ltx_p" id="A1.T13.3.9.1.1.1"><span class="ltx_text" id="A1.T13.3.9.1.1.1.1"></span> <span class="ltx_text" id="A1.T13.3.9.1.1.1.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T13.3.9.1.1.1.2.1"> <span class="ltx_tr" id="A1.T13.3.9.1.1.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T13.3.9.1.1.1.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">LLaVA-1.5</span></span> <span class="ltx_tr" id="A1.T13.3.9.1.1.1.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T13.3.9.1.1.1.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;"><cite class="ltx_cite ltx_citemacro_citep">(Liu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib22" title="">2023a</a>)</cite></span></span> </span></span> <span class="ltx_text" id="A1.T13.3.9.1.1.1.3"></span></p> </span></div> </td> <td class="ltx_td ltx_align_left ltx_border_t" colspan="4" id="A1.T13.3.9.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_inline-block ltx_align_top" id="A1.T13.3.9.2.1"> <span class="ltx_p" id="A1.T13.3.9.2.1.1" style="width:213.4pt;">The car on the left is a yellow taxi cab. It is driving down the street, and there is a man standing on the back of the car, holding a shirt. The taxi is passing by a building, and the man appears to be hanging the shirt on a clothesline.</span> </span> </td> </tr> <tr class="ltx_tr" id="A1.T13.3.10"> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="A1.T13.3.10.1" style="padding-left:2.8pt;padding-right:2.8pt;"> <div class="ltx_inline-block ltx_transformed_outer" id="A1.T13.3.10.1.1" style="width:36.0pt;height:182.4pt;vertical-align:-182.4pt;"><span class="ltx_transformed_inner" style="width:182.4pt;transform:translate(-73.21pt,109.21pt) rotate(-90deg) ;"> <p class="ltx_p" id="A1.T13.3.10.1.1.1"><span class="ltx_text" id="A1.T13.3.10.1.1.1.1"></span> <span class="ltx_text" id="A1.T13.3.10.1.1.1.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T13.3.10.1.1.1.2.1"> <span class="ltx_tr" id="A1.T13.3.10.1.1.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T13.3.10.1.1.1.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">InstructBLIP</span></span> <span class="ltx_tr" id="A1.T13.3.10.1.1.1.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T13.3.10.1.1.1.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;"><cite class="ltx_cite ltx_citemacro_citep">(Dai et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib9" title="">2023</a>)</cite></span></span> </span></span> <span class="ltx_text" id="A1.T13.3.10.1.1.1.3"></span></p> </span></div> </td> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" colspan="4" id="A1.T13.3.10.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_inline-block ltx_align_top" id="A1.T13.3.10.2.1"> <span class="ltx_p" id="A1.T13.3.10.2.1.1" style="width:213.4pt;">In the image, a man wearing a yellow shirt is seen standing on top of a car parked on the side of a busy city street. The car appears to be a large SUV or minivan, possibly a Toyota Highlander or Honda Odyssey. The man is ironing clothes on the roof of the vehicle, which is an unusual and creative way to do laundry in the middle of a bustling city. The presence of a yellow taxi cab nearby adds to the vibrant and lively atmosphere of the scene.</span> </span> </td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 13: </span>Responses of MLLMs to queries on objects with varying attention levels.</figcaption> </figure> <div class="ltx_para" id="A1.SSx7.p2"> <p class="ltx_p" id="A1.SSx7.p2.1">In comparison, our GiVE aids in addressing the visual overlook issue. It enables the visual encoder to focus on the query object, even if it is non-salient. We substitute AG-CLIP-ViT-L/14, as referenced in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T9" title="Table 9 ‣ Configurations of CLIP with GiVE ‣ Parameter Sensitivity ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">9</span></a>, for CLIP-ViT-L/14, which is utilized by LLaVA <cite class="ltx_cite ltx_citemacro_citep">(Liu et al. <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#bib.bib24" title="">2023b</a>)</cite>, without additional training. As illustrated in Table <a class="ltx_ref" href="https://arxiv.org/html/2410.20109v1#A1.T14" title="Table 14 ‣ Visual Attention and Response ‣ Appendix A Appendix ‣ GiVE: Guiding Visual Encoder to Perceive Overlooked Information"><span class="ltx_text ltx_ref_tag">14</span></a>, this methodology allows LLaVA to generate a more accurate response.</p> </div> <figure class="ltx_table" id="A1.T14"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T14.1"> <tr class="ltx_tr" id="A1.T14.1.2"> <td class="ltx_td ltx_align_left ltx_border_tt" colspan="2" id="A1.T14.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text ltx_font_bold" id="A1.T14.1.2.1.1">Visual input example, Non-Salience Object Question:</span></td> </tr> <tr class="ltx_tr" id="A1.T14.1.1"> <td class="ltx_td ltx_border_t" id="A1.T14.1.1.2" style="padding-left:2.8pt;padding-right:2.8pt;"></td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T14.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_inline-block ltx_align_top" id="A1.T14.1.1.1.1"> <span class="ltx_p" id="A1.T14.1.1.1.1.1" style="width:193.5pt;"><img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_landscape" height="138" id="A1.T14.1.1.1.1.1.g1" src="extracted/5955593/image/Confusing-Pictures.jpg" width="206"/></span> </span> </td> </tr> <tr class="ltx_tr" id="A1.T14.1.3"> <td class="ltx_td ltx_align_left" id="A1.T14.1.3.1" style="padding-left:2.8pt;padding-right:2.8pt;">User</td> <td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T14.1.3.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_inline-block ltx_align_top" id="A1.T14.1.3.2.1"> <span class="ltx_p" id="A1.T14.1.3.2.1.1" style="width:193.5pt;">Describe in detail the car on the left.</span> </span> </td> </tr> <tr class="ltx_tr" id="A1.T14.1.4"> <td class="ltx_td ltx_align_left ltx_border_t" id="A1.T14.1.4.1" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T14.1.4.1.1"><span class="ltx_text" id="A1.T14.1.4.1.1.1"></span> <span class="ltx_text" id="A1.T14.1.4.1.1.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T14.1.4.1.1.2.1"> <span class="ltx_tr" id="A1.T14.1.4.1.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T14.1.4.1.1.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">LLaVA</span></span> <span class="ltx_tr" id="A1.T14.1.4.1.1.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T14.1.4.1.1.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">w/o</span></span> <span class="ltx_tr" id="A1.T14.1.4.1.1.2.1.3"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T14.1.4.1.1.2.1.3.1" style="padding-left:2.8pt;padding-right:2.8pt;">GiVE</span></span> </span></span> <span class="ltx_text" id="A1.T14.1.4.1.1.3"></span></span></td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T14.1.4.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_inline-block ltx_align_top" id="A1.T14.1.4.2.1"> <span class="ltx_p" id="A1.T14.1.4.2.1.1" style="width:193.5pt;">The car on the left is a yellow taxi cab. It is driving down the street, and a man is standing on the back of the car, holding a clothes iron. This unusual scene suggests that the man might be a tailor or a professional ironing service provider, offering his services to the people in the city.</span> </span> </td> </tr> <tr class="ltx_tr" id="A1.T14.1.5"> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_t" id="A1.T14.1.5.1" rowspan="3" style="padding-left:2.8pt;padding-right:2.8pt;"><span class="ltx_text" id="A1.T14.1.5.1.1"><span class="ltx_text" id="A1.T14.1.5.1.1.1"></span> <span class="ltx_text" id="A1.T14.1.5.1.1.2"> <span class="ltx_tabular ltx_align_middle" id="A1.T14.1.5.1.1.2.1"> <span class="ltx_tr" id="A1.T14.1.5.1.1.2.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T14.1.5.1.1.2.1.1.1" style="padding-left:2.8pt;padding-right:2.8pt;">LLaVA</span></span> <span class="ltx_tr" id="A1.T14.1.5.1.1.2.1.2"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T14.1.5.1.1.2.1.2.1" style="padding-left:2.8pt;padding-right:2.8pt;">w/</span></span> <span class="ltx_tr" id="A1.T14.1.5.1.1.2.1.3"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="A1.T14.1.5.1.1.2.1.3.1" style="padding-left:2.8pt;padding-right:2.8pt;">GiVE</span></span> </span></span> <span class="ltx_text" id="A1.T14.1.5.1.1.3"></span></span></td> <td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb ltx_border_t" id="A1.T14.1.5.2" style="padding-left:2.8pt;padding-right:2.8pt;"> <span class="ltx_inline-block ltx_align_top" id="A1.T14.1.5.2.1"> <span class="ltx_p" id="A1.T14.1.5.2.1.1" style="width:193.5pt;">The car on the left is a yellow taxi cab. It is driving down the street, <span class="ltx_text ltx_font_bold" id="A1.T14.1.5.2.1.1.1">passing by a man who is ironing clothes in the back of a truck. The taxi cab is positioned behind the truck, and it appears to be in motion.</span></span> </span> </td> </tr> </table> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 14: </span>The integration of GiVE enables MLLM to utilize more targeted visual features, thereby enhancing response quality. In the example, LLaVA with GiVE does not incorrectly position the man behind the car, and the response focuses on the car rather than mistakenly shifting emphasis to the man, as seen with the original LLaVA.</figcaption> </figure> </section> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Sat Oct 26 07:32:20 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10