CINXE.COM

BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing</title> <!--Generated on Mon Mar 17 17:52:00 2025 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content="Machine Learning, ICML" lang="en" name="keywords"/> <base href="/html/2503.13434v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S1" title="In BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S2" title="In BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Blob-Based Element-level Representation</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S2.SS1" title="In 2 Blob-Based Element-level Representation ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.1 </span>Blob Formula</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S2.SS2" title="In 2 Blob-Based Element-level Representation ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.2 </span>Blob Opacity</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S2.SS3" title="In 2 Blob-Based Element-level Representation ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.3 </span>Blob Composition and Splatting</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3" title="In BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>Self-supervised Paradigm for BlobCtrl</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.SS1" title="In 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>Model Architecture</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.SS1.SSS0.Px1" title="In 3.1 Model Architecture ‣ 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Foreground Branch.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.SS1.SSS0.Px2" title="In 3.1 Model Architecture ‣ 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Background Branch.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.SS2" title="In 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>Self-supervised Training</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.SS3" title="In 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.3 </span>ID Preservation and Scene Harmonization</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.SS3.SSS0.Px1" title="In 3.3 ID Preservation and Scene Harmonization ‣ 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Random Data Augmentation.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.SS3.SSS0.Px2" title="In 3.3 ID Preservation and Scene Harmonization ‣ 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Identity Preservation Score Function.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.SS4" title="In 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.4 </span>Controllable Fidelity-Diversity Trade-off</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4" title="In BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>Experiments</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS1" title="In 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>Datasets, Benchmark and Metrics</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS1.SSS0.Px1" title="In 4.1 Datasets, Benchmark and Metrics ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_text ltx_font_italic">BlobData</span> Curation.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS1.SSS0.Px2" title="In 4.1 Datasets, Benchmark and Metrics ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_text ltx_font_italic">BlobBench</span> Curation.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS1.SSS0.Px3" title="In 4.1 Datasets, Benchmark and Metrics ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Evaluation Metrics.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS2" title="In 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2 </span>Implementation Details.</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS2.SSS0.Px1" title="In 4.2 Implementation Details. ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Training Details.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS2.SSS0.Px2" title="In 4.2 Implementation Details. ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Evaluation Details.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS3" title="In 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.3 </span>Quantitative Evalution</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS3.SSS0.Px1" title="In 4.3 Quantitative Evalution ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Comparison to State-of-the-Art Methods.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS3.SSS0.Px2" title="In 4.3 Quantitative Evalution ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Human Evaluation.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS4" title="In 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.4 </span>Qualitative Evalution</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS5" title="In 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.5 </span>Ablation Studies</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS5.SSS0.Px1" title="In 4.5 Ablation Studies ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Analysis of Controllability and Flexibility</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.SS5.SSS0.Px2" title="In 4.5 Ablation Studies ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Ablation of Identity Preservation Score Function.</span></a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S5" title="In BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Related Work</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S5.SS0.SSS0.Px1" title="In 5 Related Work ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Element-level Generation.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S5.SS0.SSS0.Px2" title="In 5 Related Work ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Element-level Editing.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S6" title="In BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6 </span>Discussion</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S6.SS0.SSS0.Px1" title="In 6 Discussion ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Conclusion.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S6.SS0.SSS0.Px2" title="In 6 Discussion ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Limitations and Future Work.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_appendix"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#A1" title="In BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A </span>BlobBench Overview and Evaluation Metrics</span></a> <ol class="ltx_toclist ltx_toclist_appendix"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#A1.SS0.SSS0.Px1" title="In Appendix A BlobBench Overview and Evaluation Metrics ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">BlobBench Overview</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#A1.SS0.SSS0.Px2" title="In Appendix A BlobBench Overview and Evaluation Metrics ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">Evaluation Metrics</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_appendix"> <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#A2" title="In BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">B </span>Mathematical Relationship Between Ellipses and 2D Gaussian Distributions</span></a> <ol class="ltx_toclist ltx_toclist_appendix"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#A2.SS0.SSS0.Px1" title="In Appendix B Mathematical Relationship Between Ellipses and 2D Gaussian Distributions ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">From Gaussian to Ellipse.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#A2.SS0.SSS0.Px2" title="In Appendix B Mathematical Relationship Between Ellipses and 2D Gaussian Distributions ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title">From Ellipse to Gaussian.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#A3" title="In BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">C </span>BlobData Curation</span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_pruned_first"> <h1 class="ltx_title ltx_title_document">BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Yaowei Li </span></span> <span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Lingen Li </span></span> <span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Zhaoyang Zhang </span></span> <span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Xiaoyu Li </span></span> <span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Guangzhi Wang </span></span> <span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Hongxiang Li </span></span> <span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Xiaodong Cun </span></span> <span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Ying Shan </span></span> <span class="ltx_author_before">  </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Yuexian Zou </span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id1.id1">Element-level visual manipulation is essential in digital content creation, but current diffusion-based methods lack the precision and flexibility of traditional tools. In this work, we introduce <span class="ltx_text ltx_font_italic" id="id1.id1.1">BlobCtrl</span>, a framework that unifies element-level generation and editing using a probabilistic blob-based representation. By employing blobs as visual primitives, our approach effectively decouples and represents spatial location, semantic content, and identity information, enabling precise element-level manipulation. Our key contributions include: 1) a dual-branch diffusion architecture with hierarchical feature fusion for seamless foreground-background integration; 2) a self-supervised training paradigm with tailored data augmentation and score functions; and 3) controllable dropout strategies to balance fidelity and diversity. To support further research, we introduce <em class="ltx_emph ltx_font_italic" id="id1.id1.2">BlobData</em> for large-scale training and <em class="ltx_emph ltx_font_italic" id="id1.id1.3">BlobBench</em> for systematic evaluation. Experiments show that <span class="ltx_text ltx_font_italic" id="id1.id1.4">BlobCtrl</span> excels in various element-level manipulation tasks while maintaining computational efficiency, offering a practical solution for precise and flexible visual content creation. Project page: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://liyaowei-stu.github.io/project/BlobCtrl/" title="">https://liyaowei-stu.github.io/project/BlobCtrl/</a>.</p> </div> <div class="ltx_keywords">Machine Learning, ICML </div> <div class="ltx_para" id="p2"> <p class="ltx_p" id="p2.1"><sup class="ltx_sup" id="p2.1.1">‡</sup> <br class="ltx_break"/> <sup class="ltx_sup" id="p2.1.2">*</sup></p> </div> <div class="ltx_para" id="p3"> <span class="ltx_ERROR undefined" id="p3.1">\icmlprojlead</span> <p class="ltx_p" id="p3.2">Zhaoyang Zhangzhaoyangzhang@link.cuhk.edu.hk</p> </div> <div class="ltx_para ltx_align_center" id="p4"> <img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_square" height="613" id="p4.g1" src="x1.png" width="714"/> </div> <figure class="ltx_figure ltx_align_center" id="S0.F1"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure">Figure 1: </span> Our proposed <span class="ltx_text ltx_font_italic" id="S0.F1.3.1">BlobCtrl</span> framework enables comprehensive element-level control over both visual appearance and spatial layout, facilitating diverse manipulation operations including compositional generation, spatial transformation, element removal, content replacement and arbitrary combinations thereof (top). Through an iterative refinement process, <span class="ltx_text ltx_font_italic" id="S0.F1.4.2">BlobCtrl</span> allows precise and fine-grained editing capabilities to achieve desired visual outcomes (bottom). </figcaption> </figure> <div class="ltx_para" id="p5"> <br class="ltx_break"/> </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Element-level image manipulation has long been a goal in digital art, with tools like Adobe Photoshop<cite class="ltx_cite ltx_citemacro_citep">(Adobe Inc., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib1" title="">1988–2023</a>)</cite> enabling precise manipulation of visual elements. While recent AI models<cite class="ltx_cite ltx_citemacro_citep">(Ramesh et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib37" title="">2022</a>; Labs, <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib24" title="">2023</a>; Esser et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib12" title="">2024</a>; Sheynin et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib41" title="">2024</a>; Shi et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib43" title="">2024</a>)</cite> excel in high-quality image synthesis, they often lack fine-grained control over individual elements—a key feature of traditional tools <cite class="ltx_cite ltx_citemacro_citep">(Adobe Inc., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib1" title="">1988–2023</a>; Serif Europe Ltd., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib40" title="">2015–2023</a>)</cite>. Advances like ControlNet<cite class="ltx_cite ltx_citemacro_citep">(Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib51" title="">2023a</a>)</cite> and IP-Adapter<cite class="ltx_cite ltx_citemacro_citep">(Ye et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib49" title="">2023</a>)</cite> have improved controllability but still do not support interactive, multi-round, element-based manipulation (e.g., composition, resizing, arrangement) crucial for creative workflows. Challenges include: 1) decoupling and representation of visual elements, 2) continuous layout control, 3) preserving appearance and identity, 4) maintaining visual harmony, and 5) scarcity of large-scale paired training data for end-to-end training.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">Current efforts in element-level manipulation focus on two approaches: generation and editing, each facing obstacles. <em class="ltx_emph ltx_font_italic" id="S1.p2.1.1">Element-level generation</em><cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib26" title="">2023</a>; Ye et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib49" title="">2023</a>; Nie et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib31" title="">2024</a>; Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>; Xiong et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib47" title="">2024</a>; Parmar et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib34" title="">2025</a>)</cite> uses grounding tokens (e.g., bounding boxes, ellipses) for spatial control and identity tokens like CLIP<cite class="ltx_cite ltx_citemacro_citep">(Radford et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib36" title="">2021</a>)</cite> and DINO <cite class="ltx_cite ltx_citemacro_citep">(Caron et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib7" title="">2021</a>)</cite> for appearance maintenance. These methods struggle with continuous layout control due to the discrete nature of grounding tokens and the high compression of identity tokens, which hinders detailed appearance preservation. <em class="ltx_emph ltx_font_italic" id="S1.p2.1.2">Element-level editing</em> <cite class="ltx_cite ltx_citemacro_citep">(Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib54" title="">2023b</a>; Avrahami et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib4" title="">2023</a>; Shi et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib42" title="">2023</a>; Alzayer et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib2" title="">2024</a>; Mu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib30" title="">2025</a>)</cite> employs optimization, segmentation, clustering, and drag-based methods for attribute control. These approaches often lack flexibility and struggle with visual harmony, frequently relying on video data that introduces complexities like camera movements, degrading performance and generalization. Detailed discussions of related research works are provided in <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S5" title="5 Related Work ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">Section</span> <span class="ltx_text ltx_ref_tag">5</span></a>.</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">The essence of element-level visual modeling lies in the flexible decoupling and representation of location, semantics, and identity. <span class="ltx_text ltx_font_italic" id="S1.p3.1.1">BlobCtrl</span> uses blobs as visual primitives to achieve this. Formally, a blob is a probabilistic two-dimensional Gaussian distribution<cite class="ltx_cite ltx_citemacro_citep">(Carson et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib8" title="">1999</a>)</cite>, and geometrically, it appears as an ellipse<cite class="ltx_cite ltx_citemacro_citep">(Nie et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib31" title="">2024</a>)</cite>. Blob parameters precisely specify position, size, and orientation, while Gaussian smoothness ensures harmonious and continuous layout control. For visual identity, we use differentiable blob splatting<cite class="ltx_cite ltx_citemacro_citep">(Epstein et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib11" title="">2022</a>)</cite> combined with variational autoencoder (VAE) features<cite class="ltx_cite ltx_citemacro_citep">(Kingma, <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib20" title="">2013</a>)</cite> to preserve appearance.</p> </div> <div class="ltx_para" id="S1.p4"> <p class="ltx_p" id="S1.p4.1">Building on the probabilistic blob representation, we introduce a dual-branch diffusion model: one branch for foreground elements and another for background elements. A self-supervised training paradigm enhances generalization and efficiency, with specific strategies improving <span class="ltx_text ltx_font_italic" id="S1.p4.1.1">BlobCtrl</span>’s robustness. To preserve foreground identities, we propose random data augmentation and an ID retention score function. Additionally, random dropout in the dual-branch structure allows flexible balancing of appearance fidelity and creative diversity during inference. These design choices make <span class="ltx_text ltx_font_italic" id="S1.p4.1.2">BlobCtrl</span> an efficient, flexible solution for element-level generation and editing.</p> </div> <div class="ltx_para" id="S1.p5"> <p class="ltx_p" id="S1.p5.1">To scale up our method and ensure comprehensive evaluation, we introduce a new training dataset, <span class="ltx_text ltx_font_italic" id="S1.p5.1.1">BlobData</span>, and a benchmark, <span class="ltx_text ltx_font_italic" id="S1.p5.1.2">BlobBench</span>. Extensive quantitative and qualitative results demonstrate <span class="ltx_text ltx_font_italic" id="S1.p5.1.3">BlobCtrl</span>’s effectiveness in element-level generation (combining multiple subjects) and editing (moving, resizing, adding, deleting, and replacing elements).</p> </div> <div class="ltx_para" id="S1.p6"> <p class="ltx_p" id="S1.p6.1">In a nutshell, our main contributions include:</p> <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i1.p1"> <p class="ltx_p" id="S1.I1.i1.p1.1">We propose <span class="ltx_text ltx_font_italic" id="S1.I1.i1.p1.1.1">BlobCtrl</span>, a novel unified framework that first enables precise and flexible manipulation over visual elements through element-level generation and editing, while effectively preserving their intrinsic characteristics.</p> </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i2.p1"> <p class="ltx_p" id="S1.I1.i2.p1.1">We design an innovative dual-branch architecture with meticulously crafted training paradigms and strategies, achieving an optimal balance between maintaining appearance fidelity and enabling creative diversity in visual manipulation.</p> </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i3.p1"> <p class="ltx_p" id="S1.I1.i3.p1.1">We introduce <span class="ltx_text ltx_font_italic" id="S1.I1.i3.p1.1.1">BlobData</span>, a comprehensive large-scale dataset specifically curated for training element-level visual models, alongside <span class="ltx_text ltx_font_italic" id="S1.I1.i3.p1.1.2">BlobBench</span>, a rigorous evaluation benchmark for assessing element-level generation and editing capabilities.</p> </div> </li> <li class="ltx_item" id="S1.I1.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i4.p1"> <p class="ltx_p" id="S1.I1.i4.p1.1">Through extensive experimentation, we demonstrate that <span class="ltx_text ltx_font_italic" id="S1.I1.i4.p1.1.1">BlobCtrl</span> achieves superior performance compared to existing methods in both element-level generation and editing tasks, while maintaining computational efficiency and practical applicability.</p> </div> </li> </ul> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">2 </span>Blob-Based Element-level Representation</h2> <div class="ltx_para" id="S2.p1"> <p class="ltx_p" id="S2.p1.1">Why is the blob an effective element-level representation? As a grounding token, a blob precisely represents an object’s position, size, and orientation. As a Gaussian distribution, it offers more flexible and harmonious element-level expression than segmentation masks, which have strong shape constraints. In this section, we define the blob and explain its role as an element-level visual representation.</p> </div> <section class="ltx_subsection" id="S2.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.1 </span>Blob Formula</h3> <div class="ltx_para" id="S2.SS1.p1"> <p class="ltx_p" id="S2.SS1.p1.16">Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S2.F2" title="Figure 2 ‣ 2.1 Blob Formula ‣ 2 Blob-Based Element-level Representation ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">2</span></a> visualizes a blob. Geometrically, a blob can be considered as an ellipse, parameterized by <math alttext="\bm{e}_{\tau}=[C_{x},C_{y},a,b,\theta]" class="ltx_Math" display="inline" id="S2.SS1.p1.1.m1.5"><semantics id="S2.SS1.p1.1.m1.5a"><mrow id="S2.SS1.p1.1.m1.5.5" xref="S2.SS1.p1.1.m1.5.5.cmml"><msub id="S2.SS1.p1.1.m1.5.5.4" xref="S2.SS1.p1.1.m1.5.5.4.cmml"><mi id="S2.SS1.p1.1.m1.5.5.4.2" xref="S2.SS1.p1.1.m1.5.5.4.2.cmml">𝒆</mi><mi id="S2.SS1.p1.1.m1.5.5.4.3" xref="S2.SS1.p1.1.m1.5.5.4.3.cmml">τ</mi></msub><mo id="S2.SS1.p1.1.m1.5.5.3" xref="S2.SS1.p1.1.m1.5.5.3.cmml">=</mo><mrow id="S2.SS1.p1.1.m1.5.5.2.2" xref="S2.SS1.p1.1.m1.5.5.2.3.cmml"><mo id="S2.SS1.p1.1.m1.5.5.2.2.3" stretchy="false" xref="S2.SS1.p1.1.m1.5.5.2.3.cmml">[</mo><msub id="S2.SS1.p1.1.m1.4.4.1.1.1" xref="S2.SS1.p1.1.m1.4.4.1.1.1.cmml"><mi id="S2.SS1.p1.1.m1.4.4.1.1.1.2" xref="S2.SS1.p1.1.m1.4.4.1.1.1.2.cmml">C</mi><mi id="S2.SS1.p1.1.m1.4.4.1.1.1.3" xref="S2.SS1.p1.1.m1.4.4.1.1.1.3.cmml">x</mi></msub><mo id="S2.SS1.p1.1.m1.5.5.2.2.4" xref="S2.SS1.p1.1.m1.5.5.2.3.cmml">,</mo><msub id="S2.SS1.p1.1.m1.5.5.2.2.2" xref="S2.SS1.p1.1.m1.5.5.2.2.2.cmml"><mi id="S2.SS1.p1.1.m1.5.5.2.2.2.2" xref="S2.SS1.p1.1.m1.5.5.2.2.2.2.cmml">C</mi><mi id="S2.SS1.p1.1.m1.5.5.2.2.2.3" xref="S2.SS1.p1.1.m1.5.5.2.2.2.3.cmml">y</mi></msub><mo id="S2.SS1.p1.1.m1.5.5.2.2.5" xref="S2.SS1.p1.1.m1.5.5.2.3.cmml">,</mo><mi id="S2.SS1.p1.1.m1.1.1" xref="S2.SS1.p1.1.m1.1.1.cmml">a</mi><mo id="S2.SS1.p1.1.m1.5.5.2.2.6" xref="S2.SS1.p1.1.m1.5.5.2.3.cmml">,</mo><mi id="S2.SS1.p1.1.m1.2.2" xref="S2.SS1.p1.1.m1.2.2.cmml">b</mi><mo id="S2.SS1.p1.1.m1.5.5.2.2.7" xref="S2.SS1.p1.1.m1.5.5.2.3.cmml">,</mo><mi id="S2.SS1.p1.1.m1.3.3" xref="S2.SS1.p1.1.m1.3.3.cmml">θ</mi><mo id="S2.SS1.p1.1.m1.5.5.2.2.8" stretchy="false" xref="S2.SS1.p1.1.m1.5.5.2.3.cmml">]</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.1.m1.5b"><apply id="S2.SS1.p1.1.m1.5.5.cmml" xref="S2.SS1.p1.1.m1.5.5"><eq id="S2.SS1.p1.1.m1.5.5.3.cmml" xref="S2.SS1.p1.1.m1.5.5.3"></eq><apply id="S2.SS1.p1.1.m1.5.5.4.cmml" xref="S2.SS1.p1.1.m1.5.5.4"><csymbol cd="ambiguous" id="S2.SS1.p1.1.m1.5.5.4.1.cmml" xref="S2.SS1.p1.1.m1.5.5.4">subscript</csymbol><ci id="S2.SS1.p1.1.m1.5.5.4.2.cmml" xref="S2.SS1.p1.1.m1.5.5.4.2">𝒆</ci><ci id="S2.SS1.p1.1.m1.5.5.4.3.cmml" xref="S2.SS1.p1.1.m1.5.5.4.3">𝜏</ci></apply><list id="S2.SS1.p1.1.m1.5.5.2.3.cmml" xref="S2.SS1.p1.1.m1.5.5.2.2"><apply id="S2.SS1.p1.1.m1.4.4.1.1.1.cmml" xref="S2.SS1.p1.1.m1.4.4.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p1.1.m1.4.4.1.1.1.1.cmml" xref="S2.SS1.p1.1.m1.4.4.1.1.1">subscript</csymbol><ci id="S2.SS1.p1.1.m1.4.4.1.1.1.2.cmml" xref="S2.SS1.p1.1.m1.4.4.1.1.1.2">𝐶</ci><ci id="S2.SS1.p1.1.m1.4.4.1.1.1.3.cmml" xref="S2.SS1.p1.1.m1.4.4.1.1.1.3">𝑥</ci></apply><apply id="S2.SS1.p1.1.m1.5.5.2.2.2.cmml" xref="S2.SS1.p1.1.m1.5.5.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.p1.1.m1.5.5.2.2.2.1.cmml" xref="S2.SS1.p1.1.m1.5.5.2.2.2">subscript</csymbol><ci id="S2.SS1.p1.1.m1.5.5.2.2.2.2.cmml" xref="S2.SS1.p1.1.m1.5.5.2.2.2.2">𝐶</ci><ci id="S2.SS1.p1.1.m1.5.5.2.2.2.3.cmml" xref="S2.SS1.p1.1.m1.5.5.2.2.2.3">𝑦</ci></apply><ci id="S2.SS1.p1.1.m1.1.1.cmml" xref="S2.SS1.p1.1.m1.1.1">𝑎</ci><ci id="S2.SS1.p1.1.m1.2.2.cmml" xref="S2.SS1.p1.1.m1.2.2">𝑏</ci><ci id="S2.SS1.p1.1.m1.3.3.cmml" xref="S2.SS1.p1.1.m1.3.3">𝜃</ci></list></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.1.m1.5c">\bm{e}_{\tau}=[C_{x},C_{y},a,b,\theta]</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.1.m1.5d">bold_italic_e start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = [ italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_a , italic_b , italic_θ ]</annotation></semantics></math>, where <math alttext="C_{x}" class="ltx_Math" display="inline" id="S2.SS1.p1.2.m2.1"><semantics id="S2.SS1.p1.2.m2.1a"><msub id="S2.SS1.p1.2.m2.1.1" xref="S2.SS1.p1.2.m2.1.1.cmml"><mi id="S2.SS1.p1.2.m2.1.1.2" xref="S2.SS1.p1.2.m2.1.1.2.cmml">C</mi><mi id="S2.SS1.p1.2.m2.1.1.3" xref="S2.SS1.p1.2.m2.1.1.3.cmml">x</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.2.m2.1b"><apply id="S2.SS1.p1.2.m2.1.1.cmml" xref="S2.SS1.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S2.SS1.p1.2.m2.1.1.1.cmml" xref="S2.SS1.p1.2.m2.1.1">subscript</csymbol><ci id="S2.SS1.p1.2.m2.1.1.2.cmml" xref="S2.SS1.p1.2.m2.1.1.2">𝐶</ci><ci id="S2.SS1.p1.2.m2.1.1.3.cmml" xref="S2.SS1.p1.2.m2.1.1.3">𝑥</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.2.m2.1c">C_{x}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.2.m2.1d">italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="C_{y}" class="ltx_Math" display="inline" id="S2.SS1.p1.3.m3.1"><semantics id="S2.SS1.p1.3.m3.1a"><msub id="S2.SS1.p1.3.m3.1.1" xref="S2.SS1.p1.3.m3.1.1.cmml"><mi id="S2.SS1.p1.3.m3.1.1.2" xref="S2.SS1.p1.3.m3.1.1.2.cmml">C</mi><mi id="S2.SS1.p1.3.m3.1.1.3" xref="S2.SS1.p1.3.m3.1.1.3.cmml">y</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.3.m3.1b"><apply id="S2.SS1.p1.3.m3.1.1.cmml" xref="S2.SS1.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S2.SS1.p1.3.m3.1.1.1.cmml" xref="S2.SS1.p1.3.m3.1.1">subscript</csymbol><ci id="S2.SS1.p1.3.m3.1.1.2.cmml" xref="S2.SS1.p1.3.m3.1.1.2">𝐶</ci><ci id="S2.SS1.p1.3.m3.1.1.3.cmml" xref="S2.SS1.p1.3.m3.1.1.3">𝑦</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.3.m3.1c">C_{y}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.3.m3.1d">italic_C start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT</annotation></semantics></math> denote the coordinates of the ellipse’s center, <math alttext="{a}" class="ltx_Math" display="inline" id="S2.SS1.p1.4.m4.1"><semantics id="S2.SS1.p1.4.m4.1a"><mi id="S2.SS1.p1.4.m4.1.1" xref="S2.SS1.p1.4.m4.1.1.cmml">a</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.4.m4.1b"><ci id="S2.SS1.p1.4.m4.1.1.cmml" xref="S2.SS1.p1.4.m4.1.1">𝑎</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.4.m4.1c">{a}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.4.m4.1d">italic_a</annotation></semantics></math> and <math alttext="{b}" class="ltx_Math" display="inline" id="S2.SS1.p1.5.m5.1"><semantics id="S2.SS1.p1.5.m5.1a"><mi id="S2.SS1.p1.5.m5.1.1" xref="S2.SS1.p1.5.m5.1.1.cmml">b</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.5.m5.1b"><ci id="S2.SS1.p1.5.m5.1.1.cmml" xref="S2.SS1.p1.5.m5.1.1">𝑏</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.5.m5.1c">{b}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.5.m5.1d">italic_b</annotation></semantics></math> are the lengths of the minor and major axes, respectively, and <math alttext="\theta\in[0,\pi]" class="ltx_Math" display="inline" id="S2.SS1.p1.6.m6.2"><semantics id="S2.SS1.p1.6.m6.2a"><mrow id="S2.SS1.p1.6.m6.2.3" xref="S2.SS1.p1.6.m6.2.3.cmml"><mi id="S2.SS1.p1.6.m6.2.3.2" xref="S2.SS1.p1.6.m6.2.3.2.cmml">θ</mi><mo id="S2.SS1.p1.6.m6.2.3.1" xref="S2.SS1.p1.6.m6.2.3.1.cmml">∈</mo><mrow id="S2.SS1.p1.6.m6.2.3.3.2" xref="S2.SS1.p1.6.m6.2.3.3.1.cmml"><mo id="S2.SS1.p1.6.m6.2.3.3.2.1" stretchy="false" xref="S2.SS1.p1.6.m6.2.3.3.1.cmml">[</mo><mn id="S2.SS1.p1.6.m6.1.1" xref="S2.SS1.p1.6.m6.1.1.cmml">0</mn><mo id="S2.SS1.p1.6.m6.2.3.3.2.2" xref="S2.SS1.p1.6.m6.2.3.3.1.cmml">,</mo><mi id="S2.SS1.p1.6.m6.2.2" xref="S2.SS1.p1.6.m6.2.2.cmml">π</mi><mo id="S2.SS1.p1.6.m6.2.3.3.2.3" stretchy="false" xref="S2.SS1.p1.6.m6.2.3.3.1.cmml">]</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.6.m6.2b"><apply id="S2.SS1.p1.6.m6.2.3.cmml" xref="S2.SS1.p1.6.m6.2.3"><in id="S2.SS1.p1.6.m6.2.3.1.cmml" xref="S2.SS1.p1.6.m6.2.3.1"></in><ci id="S2.SS1.p1.6.m6.2.3.2.cmml" xref="S2.SS1.p1.6.m6.2.3.2">𝜃</ci><interval closure="closed" id="S2.SS1.p1.6.m6.2.3.3.1.cmml" xref="S2.SS1.p1.6.m6.2.3.3.2"><cn id="S2.SS1.p1.6.m6.1.1.cmml" type="integer" xref="S2.SS1.p1.6.m6.1.1">0</cn><ci id="S2.SS1.p1.6.m6.2.2.cmml" xref="S2.SS1.p1.6.m6.2.2">𝜋</ci></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.6.m6.2c">\theta\in[0,\pi]</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.6.m6.2d">italic_θ ∈ [ 0 , italic_π ]</annotation></semantics></math> is the orientation angle of the ellipse. Statistically, a blob is modeled as a two-dimensional Gaussian distribution, parameterized by <math alttext="\bm{\mu}=[\mu_{x},\mu_{y}]" class="ltx_Math" display="inline" id="S2.SS1.p1.7.m7.2"><semantics id="S2.SS1.p1.7.m7.2a"><mrow id="S2.SS1.p1.7.m7.2.2" xref="S2.SS1.p1.7.m7.2.2.cmml"><mi id="S2.SS1.p1.7.m7.2.2.4" xref="S2.SS1.p1.7.m7.2.2.4.cmml">𝝁</mi><mo id="S2.SS1.p1.7.m7.2.2.3" xref="S2.SS1.p1.7.m7.2.2.3.cmml">=</mo><mrow id="S2.SS1.p1.7.m7.2.2.2.2" xref="S2.SS1.p1.7.m7.2.2.2.3.cmml"><mo id="S2.SS1.p1.7.m7.2.2.2.2.3" stretchy="false" xref="S2.SS1.p1.7.m7.2.2.2.3.cmml">[</mo><msub id="S2.SS1.p1.7.m7.1.1.1.1.1" xref="S2.SS1.p1.7.m7.1.1.1.1.1.cmml"><mi id="S2.SS1.p1.7.m7.1.1.1.1.1.2" xref="S2.SS1.p1.7.m7.1.1.1.1.1.2.cmml">μ</mi><mi id="S2.SS1.p1.7.m7.1.1.1.1.1.3" xref="S2.SS1.p1.7.m7.1.1.1.1.1.3.cmml">x</mi></msub><mo id="S2.SS1.p1.7.m7.2.2.2.2.4" xref="S2.SS1.p1.7.m7.2.2.2.3.cmml">,</mo><msub id="S2.SS1.p1.7.m7.2.2.2.2.2" xref="S2.SS1.p1.7.m7.2.2.2.2.2.cmml"><mi id="S2.SS1.p1.7.m7.2.2.2.2.2.2" xref="S2.SS1.p1.7.m7.2.2.2.2.2.2.cmml">μ</mi><mi id="S2.SS1.p1.7.m7.2.2.2.2.2.3" xref="S2.SS1.p1.7.m7.2.2.2.2.2.3.cmml">y</mi></msub><mo id="S2.SS1.p1.7.m7.2.2.2.2.5" stretchy="false" xref="S2.SS1.p1.7.m7.2.2.2.3.cmml">]</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.7.m7.2b"><apply id="S2.SS1.p1.7.m7.2.2.cmml" xref="S2.SS1.p1.7.m7.2.2"><eq id="S2.SS1.p1.7.m7.2.2.3.cmml" xref="S2.SS1.p1.7.m7.2.2.3"></eq><ci id="S2.SS1.p1.7.m7.2.2.4.cmml" xref="S2.SS1.p1.7.m7.2.2.4">𝝁</ci><interval closure="closed" id="S2.SS1.p1.7.m7.2.2.2.3.cmml" xref="S2.SS1.p1.7.m7.2.2.2.2"><apply id="S2.SS1.p1.7.m7.1.1.1.1.1.cmml" xref="S2.SS1.p1.7.m7.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p1.7.m7.1.1.1.1.1.1.cmml" xref="S2.SS1.p1.7.m7.1.1.1.1.1">subscript</csymbol><ci id="S2.SS1.p1.7.m7.1.1.1.1.1.2.cmml" xref="S2.SS1.p1.7.m7.1.1.1.1.1.2">𝜇</ci><ci id="S2.SS1.p1.7.m7.1.1.1.1.1.3.cmml" xref="S2.SS1.p1.7.m7.1.1.1.1.1.3">𝑥</ci></apply><apply id="S2.SS1.p1.7.m7.2.2.2.2.2.cmml" xref="S2.SS1.p1.7.m7.2.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.p1.7.m7.2.2.2.2.2.1.cmml" xref="S2.SS1.p1.7.m7.2.2.2.2.2">subscript</csymbol><ci id="S2.SS1.p1.7.m7.2.2.2.2.2.2.cmml" xref="S2.SS1.p1.7.m7.2.2.2.2.2.2">𝜇</ci><ci id="S2.SS1.p1.7.m7.2.2.2.2.2.3.cmml" xref="S2.SS1.p1.7.m7.2.2.2.2.2.3">𝑦</ci></apply></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.7.m7.2c">\bm{\mu}=[\mu_{x},\mu_{y}]</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.7.m7.2d">bold_italic_μ = [ italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ]</annotation></semantics></math> and <math alttext="\bm{\Sigma}=\begin{bmatrix}\sigma_{xx}&amp;\sigma_{xy}\\ \sigma_{yx}&amp;\sigma_{yy}\end{bmatrix}" class="ltx_Math" display="inline" id="S2.SS1.p1.8.m8.1"><semantics id="S2.SS1.p1.8.m8.1a"><mrow id="S2.SS1.p1.8.m8.1.2" xref="S2.SS1.p1.8.m8.1.2.cmml"><mi id="S2.SS1.p1.8.m8.1.2.2" xref="S2.SS1.p1.8.m8.1.2.2.cmml">𝚺</mi><mo id="S2.SS1.p1.8.m8.1.2.1" xref="S2.SS1.p1.8.m8.1.2.1.cmml">=</mo><mrow id="S2.SS1.p1.8.m8.1.1.3" xref="S2.SS1.p1.8.m8.1.1.2.cmml"><mo id="S2.SS1.p1.8.m8.1.1.3.1" xref="S2.SS1.p1.8.m8.1.1.2.1.cmml">[</mo><mtable columnspacing="5pt" id="S2.SS1.p1.8.m8.1.1.1.1" rowspacing="0pt" xref="S2.SS1.p1.8.m8.1.1.1.1.cmml"><mtr id="S2.SS1.p1.8.m8.1.1.1.1a" xref="S2.SS1.p1.8.m8.1.1.1.1.cmml"><mtd id="S2.SS1.p1.8.m8.1.1.1.1b" xref="S2.SS1.p1.8.m8.1.1.1.1.cmml"><msub id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.cmml"><mi id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.2" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.2.cmml">σ</mi><mrow id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.cmml"><mi id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.2" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.2.cmml">x</mi><mo id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.1" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.3" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.3.cmml">x</mi></mrow></msub></mtd><mtd id="S2.SS1.p1.8.m8.1.1.1.1c" xref="S2.SS1.p1.8.m8.1.1.1.1.cmml"><msub id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.cmml"><mi id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.2" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.2.cmml">σ</mi><mrow id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.cmml"><mi id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.2" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.2.cmml">x</mi><mo id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.1" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.1.cmml">⁢</mo><mi id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.3" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.3.cmml">y</mi></mrow></msub></mtd></mtr><mtr id="S2.SS1.p1.8.m8.1.1.1.1d" xref="S2.SS1.p1.8.m8.1.1.1.1.cmml"><mtd id="S2.SS1.p1.8.m8.1.1.1.1e" xref="S2.SS1.p1.8.m8.1.1.1.1.cmml"><msub id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.cmml"><mi id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.2" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.2.cmml">σ</mi><mrow id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.cmml"><mi id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.2" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.2.cmml">y</mi><mo id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.1" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.1.cmml">⁢</mo><mi id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.3" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.3.cmml">x</mi></mrow></msub></mtd><mtd id="S2.SS1.p1.8.m8.1.1.1.1f" xref="S2.SS1.p1.8.m8.1.1.1.1.cmml"><msub id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.cmml"><mi id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.2" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.2.cmml">σ</mi><mrow id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.cmml"><mi id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.2" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.2.cmml">y</mi><mo id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.1" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.1.cmml">⁢</mo><mi id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.3" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.3.cmml">y</mi></mrow></msub></mtd></mtr></mtable><mo id="S2.SS1.p1.8.m8.1.1.3.2" xref="S2.SS1.p1.8.m8.1.1.2.1.cmml">]</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.8.m8.1b"><apply id="S2.SS1.p1.8.m8.1.2.cmml" xref="S2.SS1.p1.8.m8.1.2"><eq id="S2.SS1.p1.8.m8.1.2.1.cmml" xref="S2.SS1.p1.8.m8.1.2.1"></eq><ci id="S2.SS1.p1.8.m8.1.2.2.cmml" xref="S2.SS1.p1.8.m8.1.2.2">𝚺</ci><apply id="S2.SS1.p1.8.m8.1.1.2.cmml" xref="S2.SS1.p1.8.m8.1.1.3"><csymbol cd="latexml" id="S2.SS1.p1.8.m8.1.1.2.1.cmml" xref="S2.SS1.p1.8.m8.1.1.3.1">matrix</csymbol><matrix id="S2.SS1.p1.8.m8.1.1.1.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1"><matrixrow id="S2.SS1.p1.8.m8.1.1.1.1a.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1"><apply id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1">subscript</csymbol><ci id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.2.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.2">𝜎</ci><apply id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3"><times id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.1"></times><ci id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.2.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.2">𝑥</ci><ci id="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.3.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.1.1.3.3">𝑥</ci></apply></apply><apply id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1"><csymbol cd="ambiguous" id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1">subscript</csymbol><ci id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.2.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.2">𝜎</ci><apply id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3"><times id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.1"></times><ci id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.2.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.2">𝑥</ci><ci id="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.3.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.1.2.1.3.3">𝑦</ci></apply></apply></matrixrow><matrixrow id="S2.SS1.p1.8.m8.1.1.1.1b.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1"><apply id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1"><csymbol cd="ambiguous" id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1">subscript</csymbol><ci id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.2.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.2">𝜎</ci><apply id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3"><times id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.1"></times><ci id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.2.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.2">𝑦</ci><ci id="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.3.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.1.1.3.3">𝑥</ci></apply></apply><apply id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1"><csymbol cd="ambiguous" id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1">subscript</csymbol><ci id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.2.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.2">𝜎</ci><apply id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3"><times id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.1.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.1"></times><ci id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.2.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.2">𝑦</ci><ci id="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.3.cmml" xref="S2.SS1.p1.8.m8.1.1.1.1.2.2.1.3.3">𝑦</ci></apply></apply></matrixrow></matrix></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.8.m8.1c">\bm{\Sigma}=\begin{bmatrix}\sigma_{xx}&amp;\sigma_{xy}\\ \sigma_{yx}&amp;\sigma_{yy}\end{bmatrix}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.8.m8.1d">bold_Σ = [ start_ARG start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT end_CELL start_CELL italic_σ start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_y italic_x end_POSTSUBSCRIPT end_CELL start_CELL italic_σ start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]</annotation></semantics></math>, where <math alttext="\mu_{x}" class="ltx_Math" display="inline" id="S2.SS1.p1.9.m9.1"><semantics id="S2.SS1.p1.9.m9.1a"><msub id="S2.SS1.p1.9.m9.1.1" xref="S2.SS1.p1.9.m9.1.1.cmml"><mi id="S2.SS1.p1.9.m9.1.1.2" xref="S2.SS1.p1.9.m9.1.1.2.cmml">μ</mi><mi id="S2.SS1.p1.9.m9.1.1.3" xref="S2.SS1.p1.9.m9.1.1.3.cmml">x</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.9.m9.1b"><apply id="S2.SS1.p1.9.m9.1.1.cmml" xref="S2.SS1.p1.9.m9.1.1"><csymbol cd="ambiguous" id="S2.SS1.p1.9.m9.1.1.1.cmml" xref="S2.SS1.p1.9.m9.1.1">subscript</csymbol><ci id="S2.SS1.p1.9.m9.1.1.2.cmml" xref="S2.SS1.p1.9.m9.1.1.2">𝜇</ci><ci id="S2.SS1.p1.9.m9.1.1.3.cmml" xref="S2.SS1.p1.9.m9.1.1.3">𝑥</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.9.m9.1c">\mu_{x}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.9.m9.1d">italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="\mu_{y}" class="ltx_Math" display="inline" id="S2.SS1.p1.10.m10.1"><semantics id="S2.SS1.p1.10.m10.1a"><msub id="S2.SS1.p1.10.m10.1.1" xref="S2.SS1.p1.10.m10.1.1.cmml"><mi id="S2.SS1.p1.10.m10.1.1.2" xref="S2.SS1.p1.10.m10.1.1.2.cmml">μ</mi><mi id="S2.SS1.p1.10.m10.1.1.3" xref="S2.SS1.p1.10.m10.1.1.3.cmml">y</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.10.m10.1b"><apply id="S2.SS1.p1.10.m10.1.1.cmml" xref="S2.SS1.p1.10.m10.1.1"><csymbol cd="ambiguous" id="S2.SS1.p1.10.m10.1.1.1.cmml" xref="S2.SS1.p1.10.m10.1.1">subscript</csymbol><ci id="S2.SS1.p1.10.m10.1.1.2.cmml" xref="S2.SS1.p1.10.m10.1.1.2">𝜇</ci><ci id="S2.SS1.p1.10.m10.1.1.3.cmml" xref="S2.SS1.p1.10.m10.1.1.3">𝑦</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.10.m10.1c">\mu_{y}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.10.m10.1d">italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT</annotation></semantics></math> are the means of the Gaussian distribution, <math alttext="[\sigma_{xx},\sigma_{yy}]" class="ltx_Math" display="inline" id="S2.SS1.p1.11.m11.2"><semantics id="S2.SS1.p1.11.m11.2a"><mrow id="S2.SS1.p1.11.m11.2.2.2" xref="S2.SS1.p1.11.m11.2.2.3.cmml"><mo id="S2.SS1.p1.11.m11.2.2.2.3" stretchy="false" xref="S2.SS1.p1.11.m11.2.2.3.cmml">[</mo><msub id="S2.SS1.p1.11.m11.1.1.1.1" xref="S2.SS1.p1.11.m11.1.1.1.1.cmml"><mi id="S2.SS1.p1.11.m11.1.1.1.1.2" xref="S2.SS1.p1.11.m11.1.1.1.1.2.cmml">σ</mi><mrow id="S2.SS1.p1.11.m11.1.1.1.1.3" xref="S2.SS1.p1.11.m11.1.1.1.1.3.cmml"><mi id="S2.SS1.p1.11.m11.1.1.1.1.3.2" xref="S2.SS1.p1.11.m11.1.1.1.1.3.2.cmml">x</mi><mo id="S2.SS1.p1.11.m11.1.1.1.1.3.1" xref="S2.SS1.p1.11.m11.1.1.1.1.3.1.cmml">⁢</mo><mi id="S2.SS1.p1.11.m11.1.1.1.1.3.3" xref="S2.SS1.p1.11.m11.1.1.1.1.3.3.cmml">x</mi></mrow></msub><mo id="S2.SS1.p1.11.m11.2.2.2.4" xref="S2.SS1.p1.11.m11.2.2.3.cmml">,</mo><msub id="S2.SS1.p1.11.m11.2.2.2.2" xref="S2.SS1.p1.11.m11.2.2.2.2.cmml"><mi id="S2.SS1.p1.11.m11.2.2.2.2.2" xref="S2.SS1.p1.11.m11.2.2.2.2.2.cmml">σ</mi><mrow id="S2.SS1.p1.11.m11.2.2.2.2.3" xref="S2.SS1.p1.11.m11.2.2.2.2.3.cmml"><mi id="S2.SS1.p1.11.m11.2.2.2.2.3.2" xref="S2.SS1.p1.11.m11.2.2.2.2.3.2.cmml">y</mi><mo id="S2.SS1.p1.11.m11.2.2.2.2.3.1" xref="S2.SS1.p1.11.m11.2.2.2.2.3.1.cmml">⁢</mo><mi id="S2.SS1.p1.11.m11.2.2.2.2.3.3" xref="S2.SS1.p1.11.m11.2.2.2.2.3.3.cmml">y</mi></mrow></msub><mo id="S2.SS1.p1.11.m11.2.2.2.5" stretchy="false" xref="S2.SS1.p1.11.m11.2.2.3.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.11.m11.2b"><interval closure="closed" id="S2.SS1.p1.11.m11.2.2.3.cmml" xref="S2.SS1.p1.11.m11.2.2.2"><apply id="S2.SS1.p1.11.m11.1.1.1.1.cmml" xref="S2.SS1.p1.11.m11.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p1.11.m11.1.1.1.1.1.cmml" xref="S2.SS1.p1.11.m11.1.1.1.1">subscript</csymbol><ci id="S2.SS1.p1.11.m11.1.1.1.1.2.cmml" xref="S2.SS1.p1.11.m11.1.1.1.1.2">𝜎</ci><apply id="S2.SS1.p1.11.m11.1.1.1.1.3.cmml" xref="S2.SS1.p1.11.m11.1.1.1.1.3"><times id="S2.SS1.p1.11.m11.1.1.1.1.3.1.cmml" xref="S2.SS1.p1.11.m11.1.1.1.1.3.1"></times><ci id="S2.SS1.p1.11.m11.1.1.1.1.3.2.cmml" xref="S2.SS1.p1.11.m11.1.1.1.1.3.2">𝑥</ci><ci id="S2.SS1.p1.11.m11.1.1.1.1.3.3.cmml" xref="S2.SS1.p1.11.m11.1.1.1.1.3.3">𝑥</ci></apply></apply><apply id="S2.SS1.p1.11.m11.2.2.2.2.cmml" xref="S2.SS1.p1.11.m11.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.p1.11.m11.2.2.2.2.1.cmml" xref="S2.SS1.p1.11.m11.2.2.2.2">subscript</csymbol><ci id="S2.SS1.p1.11.m11.2.2.2.2.2.cmml" xref="S2.SS1.p1.11.m11.2.2.2.2.2">𝜎</ci><apply id="S2.SS1.p1.11.m11.2.2.2.2.3.cmml" xref="S2.SS1.p1.11.m11.2.2.2.2.3"><times id="S2.SS1.p1.11.m11.2.2.2.2.3.1.cmml" xref="S2.SS1.p1.11.m11.2.2.2.2.3.1"></times><ci id="S2.SS1.p1.11.m11.2.2.2.2.3.2.cmml" xref="S2.SS1.p1.11.m11.2.2.2.2.3.2">𝑦</ci><ci id="S2.SS1.p1.11.m11.2.2.2.2.3.3.cmml" xref="S2.SS1.p1.11.m11.2.2.2.2.3.3">𝑦</ci></apply></apply></interval></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.11.m11.2c">[\sigma_{xx},\sigma_{yy}]</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.11.m11.2d">[ italic_σ start_POSTSUBSCRIPT italic_x italic_x end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_y italic_y end_POSTSUBSCRIPT ]</annotation></semantics></math> are the variances corresponding to the <math alttext="x" class="ltx_Math" display="inline" id="S2.SS1.p1.12.m12.1"><semantics id="S2.SS1.p1.12.m12.1a"><mi id="S2.SS1.p1.12.m12.1.1" xref="S2.SS1.p1.12.m12.1.1.cmml">x</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.12.m12.1b"><ci id="S2.SS1.p1.12.m12.1.1.cmml" xref="S2.SS1.p1.12.m12.1.1">𝑥</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.12.m12.1c">x</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.12.m12.1d">italic_x</annotation></semantics></math> and <math alttext="y" class="ltx_Math" display="inline" id="S2.SS1.p1.13.m13.1"><semantics id="S2.SS1.p1.13.m13.1a"><mi id="S2.SS1.p1.13.m13.1.1" xref="S2.SS1.p1.13.m13.1.1.cmml">y</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.13.m13.1b"><ci id="S2.SS1.p1.13.m13.1.1.cmml" xref="S2.SS1.p1.13.m13.1.1">𝑦</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.13.m13.1c">y</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.13.m13.1d">italic_y</annotation></semantics></math> directions, respectively, and <math alttext="[\sigma_{xy},\sigma_{yx}]" class="ltx_Math" display="inline" id="S2.SS1.p1.14.m14.2"><semantics id="S2.SS1.p1.14.m14.2a"><mrow id="S2.SS1.p1.14.m14.2.2.2" xref="S2.SS1.p1.14.m14.2.2.3.cmml"><mo id="S2.SS1.p1.14.m14.2.2.2.3" stretchy="false" xref="S2.SS1.p1.14.m14.2.2.3.cmml">[</mo><msub id="S2.SS1.p1.14.m14.1.1.1.1" xref="S2.SS1.p1.14.m14.1.1.1.1.cmml"><mi id="S2.SS1.p1.14.m14.1.1.1.1.2" xref="S2.SS1.p1.14.m14.1.1.1.1.2.cmml">σ</mi><mrow id="S2.SS1.p1.14.m14.1.1.1.1.3" xref="S2.SS1.p1.14.m14.1.1.1.1.3.cmml"><mi id="S2.SS1.p1.14.m14.1.1.1.1.3.2" xref="S2.SS1.p1.14.m14.1.1.1.1.3.2.cmml">x</mi><mo id="S2.SS1.p1.14.m14.1.1.1.1.3.1" xref="S2.SS1.p1.14.m14.1.1.1.1.3.1.cmml">⁢</mo><mi id="S2.SS1.p1.14.m14.1.1.1.1.3.3" xref="S2.SS1.p1.14.m14.1.1.1.1.3.3.cmml">y</mi></mrow></msub><mo id="S2.SS1.p1.14.m14.2.2.2.4" xref="S2.SS1.p1.14.m14.2.2.3.cmml">,</mo><msub id="S2.SS1.p1.14.m14.2.2.2.2" xref="S2.SS1.p1.14.m14.2.2.2.2.cmml"><mi id="S2.SS1.p1.14.m14.2.2.2.2.2" xref="S2.SS1.p1.14.m14.2.2.2.2.2.cmml">σ</mi><mrow id="S2.SS1.p1.14.m14.2.2.2.2.3" xref="S2.SS1.p1.14.m14.2.2.2.2.3.cmml"><mi id="S2.SS1.p1.14.m14.2.2.2.2.3.2" xref="S2.SS1.p1.14.m14.2.2.2.2.3.2.cmml">y</mi><mo id="S2.SS1.p1.14.m14.2.2.2.2.3.1" xref="S2.SS1.p1.14.m14.2.2.2.2.3.1.cmml">⁢</mo><mi id="S2.SS1.p1.14.m14.2.2.2.2.3.3" xref="S2.SS1.p1.14.m14.2.2.2.2.3.3.cmml">x</mi></mrow></msub><mo id="S2.SS1.p1.14.m14.2.2.2.5" stretchy="false" xref="S2.SS1.p1.14.m14.2.2.3.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.14.m14.2b"><interval closure="closed" id="S2.SS1.p1.14.m14.2.2.3.cmml" xref="S2.SS1.p1.14.m14.2.2.2"><apply id="S2.SS1.p1.14.m14.1.1.1.1.cmml" xref="S2.SS1.p1.14.m14.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p1.14.m14.1.1.1.1.1.cmml" xref="S2.SS1.p1.14.m14.1.1.1.1">subscript</csymbol><ci id="S2.SS1.p1.14.m14.1.1.1.1.2.cmml" xref="S2.SS1.p1.14.m14.1.1.1.1.2">𝜎</ci><apply id="S2.SS1.p1.14.m14.1.1.1.1.3.cmml" xref="S2.SS1.p1.14.m14.1.1.1.1.3"><times id="S2.SS1.p1.14.m14.1.1.1.1.3.1.cmml" xref="S2.SS1.p1.14.m14.1.1.1.1.3.1"></times><ci id="S2.SS1.p1.14.m14.1.1.1.1.3.2.cmml" xref="S2.SS1.p1.14.m14.1.1.1.1.3.2">𝑥</ci><ci id="S2.SS1.p1.14.m14.1.1.1.1.3.3.cmml" xref="S2.SS1.p1.14.m14.1.1.1.1.3.3">𝑦</ci></apply></apply><apply id="S2.SS1.p1.14.m14.2.2.2.2.cmml" xref="S2.SS1.p1.14.m14.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.p1.14.m14.2.2.2.2.1.cmml" xref="S2.SS1.p1.14.m14.2.2.2.2">subscript</csymbol><ci id="S2.SS1.p1.14.m14.2.2.2.2.2.cmml" xref="S2.SS1.p1.14.m14.2.2.2.2.2">𝜎</ci><apply id="S2.SS1.p1.14.m14.2.2.2.2.3.cmml" xref="S2.SS1.p1.14.m14.2.2.2.2.3"><times id="S2.SS1.p1.14.m14.2.2.2.2.3.1.cmml" xref="S2.SS1.p1.14.m14.2.2.2.2.3.1"></times><ci id="S2.SS1.p1.14.m14.2.2.2.2.3.2.cmml" xref="S2.SS1.p1.14.m14.2.2.2.2.3.2">𝑦</ci><ci id="S2.SS1.p1.14.m14.2.2.2.2.3.3.cmml" xref="S2.SS1.p1.14.m14.2.2.2.2.3.3">𝑥</ci></apply></apply></interval></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.14.m14.2c">[\sigma_{xy},\sigma_{yx}]</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.14.m14.2d">[ italic_σ start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_y italic_x end_POSTSUBSCRIPT ]</annotation></semantics></math> are the covariances, indicating the correlation between <math alttext="x" class="ltx_Math" display="inline" id="S2.SS1.p1.15.m15.1"><semantics id="S2.SS1.p1.15.m15.1a"><mi id="S2.SS1.p1.15.m15.1.1" xref="S2.SS1.p1.15.m15.1.1.cmml">x</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.15.m15.1b"><ci id="S2.SS1.p1.15.m15.1.1.cmml" xref="S2.SS1.p1.15.m15.1.1">𝑥</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.15.m15.1c">x</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.15.m15.1d">italic_x</annotation></semantics></math> and <math alttext="y" class="ltx_Math" display="inline" id="S2.SS1.p1.16.m16.1"><semantics id="S2.SS1.p1.16.m16.1a"><mi id="S2.SS1.p1.16.m16.1.1" xref="S2.SS1.p1.16.m16.1.1.cmml">y</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.16.m16.1b"><ci id="S2.SS1.p1.16.m16.1.1.cmml" xref="S2.SS1.p1.16.m16.1.1">𝑦</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.16.m16.1c">y</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.16.m16.1d">italic_y</annotation></semantics></math>.</p> </div> <figure class="ltx_figure" id="S2.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="379" id="S2.F2.g1" src="x2.png" width="747"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span><span class="ltx_text ltx_font_bold" id="S2.F2.12.1">Blob Formula.</span> A blob can be represented in two equivalent forms: geometrically as an ellipse parameterized by center coordinates <math alttext="(C_{x},C_{y})" class="ltx_Math" display="inline" id="S2.F2.6.m1.2"><semantics id="S2.F2.6.m1.2b"><mrow id="S2.F2.6.m1.2.2.2" xref="S2.F2.6.m1.2.2.3.cmml"><mo id="S2.F2.6.m1.2.2.2.3" stretchy="false" xref="S2.F2.6.m1.2.2.3.cmml">(</mo><msub id="S2.F2.6.m1.1.1.1.1" xref="S2.F2.6.m1.1.1.1.1.cmml"><mi id="S2.F2.6.m1.1.1.1.1.2" xref="S2.F2.6.m1.1.1.1.1.2.cmml">C</mi><mi id="S2.F2.6.m1.1.1.1.1.3" xref="S2.F2.6.m1.1.1.1.1.3.cmml">x</mi></msub><mo id="S2.F2.6.m1.2.2.2.4" xref="S2.F2.6.m1.2.2.3.cmml">,</mo><msub id="S2.F2.6.m1.2.2.2.2" xref="S2.F2.6.m1.2.2.2.2.cmml"><mi id="S2.F2.6.m1.2.2.2.2.2" xref="S2.F2.6.m1.2.2.2.2.2.cmml">C</mi><mi id="S2.F2.6.m1.2.2.2.2.3" xref="S2.F2.6.m1.2.2.2.2.3.cmml">y</mi></msub><mo id="S2.F2.6.m1.2.2.2.5" stretchy="false" xref="S2.F2.6.m1.2.2.3.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.F2.6.m1.2c"><interval closure="open" id="S2.F2.6.m1.2.2.3.cmml" xref="S2.F2.6.m1.2.2.2"><apply id="S2.F2.6.m1.1.1.1.1.cmml" xref="S2.F2.6.m1.1.1.1.1"><csymbol cd="ambiguous" id="S2.F2.6.m1.1.1.1.1.1.cmml" xref="S2.F2.6.m1.1.1.1.1">subscript</csymbol><ci id="S2.F2.6.m1.1.1.1.1.2.cmml" xref="S2.F2.6.m1.1.1.1.1.2">𝐶</ci><ci id="S2.F2.6.m1.1.1.1.1.3.cmml" xref="S2.F2.6.m1.1.1.1.1.3">𝑥</ci></apply><apply id="S2.F2.6.m1.2.2.2.2.cmml" xref="S2.F2.6.m1.2.2.2.2"><csymbol cd="ambiguous" id="S2.F2.6.m1.2.2.2.2.1.cmml" xref="S2.F2.6.m1.2.2.2.2">subscript</csymbol><ci id="S2.F2.6.m1.2.2.2.2.2.cmml" xref="S2.F2.6.m1.2.2.2.2.2">𝐶</ci><ci id="S2.F2.6.m1.2.2.2.2.3.cmml" xref="S2.F2.6.m1.2.2.2.2.3">𝑦</ci></apply></interval></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.6.m1.2d">(C_{x},C_{y})</annotation><annotation encoding="application/x-llamapun" id="S2.F2.6.m1.2e">( italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT )</annotation></semantics></math>, axes lengths <math alttext="(a,b)" class="ltx_Math" display="inline" id="S2.F2.7.m2.2"><semantics id="S2.F2.7.m2.2b"><mrow id="S2.F2.7.m2.2.3.2" xref="S2.F2.7.m2.2.3.1.cmml"><mo id="S2.F2.7.m2.2.3.2.1" stretchy="false" xref="S2.F2.7.m2.2.3.1.cmml">(</mo><mi id="S2.F2.7.m2.1.1" xref="S2.F2.7.m2.1.1.cmml">a</mi><mo id="S2.F2.7.m2.2.3.2.2" xref="S2.F2.7.m2.2.3.1.cmml">,</mo><mi id="S2.F2.7.m2.2.2" xref="S2.F2.7.m2.2.2.cmml">b</mi><mo id="S2.F2.7.m2.2.3.2.3" stretchy="false" xref="S2.F2.7.m2.2.3.1.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.F2.7.m2.2c"><interval closure="open" id="S2.F2.7.m2.2.3.1.cmml" xref="S2.F2.7.m2.2.3.2"><ci id="S2.F2.7.m2.1.1.cmml" xref="S2.F2.7.m2.1.1">𝑎</ci><ci id="S2.F2.7.m2.2.2.cmml" xref="S2.F2.7.m2.2.2">𝑏</ci></interval></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.7.m2.2d">(a,b)</annotation><annotation encoding="application/x-llamapun" id="S2.F2.7.m2.2e">( italic_a , italic_b )</annotation></semantics></math>, and orientation <math alttext="\theta" class="ltx_Math" display="inline" id="S2.F2.8.m3.1"><semantics id="S2.F2.8.m3.1b"><mi id="S2.F2.8.m3.1.1" xref="S2.F2.8.m3.1.1.cmml">θ</mi><annotation-xml encoding="MathML-Content" id="S2.F2.8.m3.1c"><ci id="S2.F2.8.m3.1.1.cmml" xref="S2.F2.8.m3.1.1">𝜃</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.8.m3.1d">\theta</annotation><annotation encoding="application/x-llamapun" id="S2.F2.8.m3.1e">italic_θ</annotation></semantics></math>; and statistically as a 2D Gaussian distribution characterized by mean <math alttext="\bm{\mu}" class="ltx_Math" display="inline" id="S2.F2.9.m4.1"><semantics id="S2.F2.9.m4.1b"><mi id="S2.F2.9.m4.1.1" xref="S2.F2.9.m4.1.1.cmml">𝝁</mi><annotation-xml encoding="MathML-Content" id="S2.F2.9.m4.1c"><ci id="S2.F2.9.m4.1.1.cmml" xref="S2.F2.9.m4.1.1">𝝁</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.9.m4.1d">\bm{\mu}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.9.m4.1e">bold_italic_μ</annotation></semantics></math> and covariance matrix <math alttext="\bm{\Sigma}" class="ltx_Math" display="inline" id="S2.F2.10.m5.1"><semantics id="S2.F2.10.m5.1b"><mi id="S2.F2.10.m5.1.1" xref="S2.F2.10.m5.1.1.cmml">𝚺</mi><annotation-xml encoding="MathML-Content" id="S2.F2.10.m5.1c"><ci id="S2.F2.10.m5.1.1.cmml" xref="S2.F2.10.m5.1.1">𝚺</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.10.m5.1d">\bm{\Sigma}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.10.m5.1e">bold_Σ</annotation></semantics></math>. The two forms are exactly equivalent and interchangeable.</figcaption> </figure> </section> <section class="ltx_subsection" id="S2.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.2 </span>Blob Opacity</h3> <div class="ltx_para" id="S2.SS2.p1"> <p class="ltx_p" id="S2.SS2.p1.1">Notably, the blob represented as a Gaussian enables the calculation of opacity across spatial dimensions, leading to the concepts of blob splatting and blob composition. These concepts are crucial for achieving smooth rendering and seamless integration of visual elements in graphics.</p> </div> <div class="ltx_para" id="S2.SS2.p2"> <p class="ltx_p" id="S2.SS2.p2.11">In particular, the squared Mahalanobis distance <cite class="ltx_cite ltx_citemacro_citep">(Mahalanobis, <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib29" title="">1936</a>)</cite> to the blob center is first computed:</p> <table class="ltx_equationgroup ltx_eqn_table" id="S2.E1"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S2.E1X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle d_{M}(\bm{x}_{\text{grid}},\bm{Q})=(\bm{x}_{\text{grid}}-\bm{\mu% })^{T}\bm{\Sigma}^{-1}(\bm{x}_{\text{grid}}-\bm{\mu})," class="ltx_Math" display="inline" id="S2.E1X.2.1.1.m1.2"><semantics id="S2.E1X.2.1.1.m1.2a"><mrow id="S2.E1X.2.1.1.m1.2.2.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.cmml"><mrow id="S2.E1X.2.1.1.m1.2.2.1.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.cmml"><mrow id="S2.E1X.2.1.1.m1.2.2.1.1.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.cmml"><msub id="S2.E1X.2.1.1.m1.2.2.1.1.1.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.3.cmml"><mi id="S2.E1X.2.1.1.m1.2.2.1.1.1.3.2" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.3.2.cmml">d</mi><mi id="S2.E1X.2.1.1.m1.2.2.1.1.1.3.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.3.3.cmml">M</mi></msub><mo id="S2.E1X.2.1.1.m1.2.2.1.1.1.2" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.2.cmml">⁢</mo><mrow id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.2.cmml"><mo id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.2" stretchy="false" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.2.cmml">(</mo><msub id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.cmml"><mi id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.2" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.2.cmml">𝒙</mi><mtext id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.3a.cmml">grid</mtext></msub><mo id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.2.cmml">,</mo><mi id="S2.E1X.2.1.1.m1.1.1" xref="S2.E1X.2.1.1.m1.1.1.cmml">𝑸</mi><mo id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.4" stretchy="false" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.2.cmml">)</mo></mrow></mrow><mo id="S2.E1X.2.1.1.m1.2.2.1.1.4" xref="S2.E1X.2.1.1.m1.2.2.1.1.4.cmml">=</mo><mrow id="S2.E1X.2.1.1.m1.2.2.1.1.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.cmml"><msup id="S2.E1X.2.1.1.m1.2.2.1.1.2.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.cmml"><mrow id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.cmml"><mo id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.2" stretchy="false" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.cmml">(</mo><mrow id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.cmml"><msub id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.cmml"><mi id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.2" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.2.cmml">𝒙</mi><mtext id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.3a.cmml">grid</mtext></msub><mo id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.1.cmml">−</mo><mi id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.3.cmml">𝝁</mi></mrow><mo id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.3" stretchy="false" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.cmml">)</mo></mrow><mi id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.3.cmml">T</mi></msup><mo id="S2.E1X.2.1.1.m1.2.2.1.1.3.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.3.cmml">⁢</mo><msup id="S2.E1X.2.1.1.m1.2.2.1.1.3.4" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4.cmml"><mi id="S2.E1X.2.1.1.m1.2.2.1.1.3.4.2" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4.2.cmml">𝚺</mi><mrow id="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3.cmml"><mo id="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3a" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3.cmml">−</mo><mn id="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3.2" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3.2.cmml">1</mn></mrow></msup><mo id="S2.E1X.2.1.1.m1.2.2.1.1.3.3a" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.3.cmml">⁢</mo><mrow id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.cmml"><mo id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.2" stretchy="false" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.cmml">(</mo><mrow id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.cmml"><msub id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.cmml"><mi id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.2" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.2.cmml">𝒙</mi><mtext id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.3a.cmml">grid</mtext></msub><mo id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.1" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.1.cmml">−</mo><mi id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.3" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.3.cmml">𝝁</mi></mrow><mo id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.3" stretchy="false" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E1X.2.1.1.m1.2.2.1.2" xref="S2.E1X.2.1.1.m1.2.2.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.E1X.2.1.1.m1.2b"><apply id="S2.E1X.2.1.1.m1.2.2.1.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1"><eq id="S2.E1X.2.1.1.m1.2.2.1.1.4.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.4"></eq><apply id="S2.E1X.2.1.1.m1.2.2.1.1.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1"><times id="S2.E1X.2.1.1.m1.2.2.1.1.1.2.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.2"></times><apply id="S2.E1X.2.1.1.m1.2.2.1.1.1.3.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S2.E1X.2.1.1.m1.2.2.1.1.1.3.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.3">subscript</csymbol><ci id="S2.E1X.2.1.1.m1.2.2.1.1.1.3.2.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.3.2">𝑑</ci><ci id="S2.E1X.2.1.1.m1.2.2.1.1.1.3.3.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.3.3">𝑀</ci></apply><interval closure="open" id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.2.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1"><apply id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1">subscript</csymbol><ci id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.2.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.2">𝒙</ci><ci id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.3a.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.3"><mtext id="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S2.E1X.2.1.1.m1.2.2.1.1.1.1.1.1.3">grid</mtext></ci></apply><ci id="S2.E1X.2.1.1.m1.1.1.cmml" xref="S2.E1X.2.1.1.m1.1.1">𝑸</ci></interval></apply><apply id="S2.E1X.2.1.1.m1.2.2.1.1.3.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3"><times id="S2.E1X.2.1.1.m1.2.2.1.1.3.3.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.3"></times><apply id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1"><csymbol cd="ambiguous" id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.2.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1">superscript</csymbol><apply id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1"><minus id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.1"></minus><apply id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2"><csymbol cd="ambiguous" id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2">subscript</csymbol><ci id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.2.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.2">𝒙</ci><ci id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.3a.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.3"><mtext id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.3.cmml" mathsize="70%" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.2.3">grid</mtext></ci></apply><ci id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.3.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.1.1.1.3">𝝁</ci></apply><ci id="S2.E1X.2.1.1.m1.2.2.1.1.2.1.3.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.2.1.3">𝑇</ci></apply><apply id="S2.E1X.2.1.1.m1.2.2.1.1.3.4.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4"><csymbol cd="ambiguous" id="S2.E1X.2.1.1.m1.2.2.1.1.3.4.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4">superscript</csymbol><ci id="S2.E1X.2.1.1.m1.2.2.1.1.3.4.2.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4.2">𝚺</ci><apply id="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3"><minus id="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3"></minus><cn id="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3.2.cmml" type="integer" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.4.3.2">1</cn></apply></apply><apply id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1"><minus id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.1"></minus><apply id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2"><csymbol cd="ambiguous" id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.1.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2">subscript</csymbol><ci id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.2.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.2">𝒙</ci><ci id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.3a.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.3"><mtext id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.3.cmml" mathsize="70%" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.2.3">grid</mtext></ci></apply><ci id="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.3.cmml" xref="S2.E1X.2.1.1.m1.2.2.1.1.3.2.1.1.3">𝝁</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E1X.2.1.1.m1.2c">\displaystyle d_{M}(\bm{x}_{\text{grid}},\bm{Q})=(\bm{x}_{\text{grid}}-\bm{\mu% })^{T}\bm{\Sigma}^{-1}(\bm{x}_{\text{grid}}-\bm{\mu}),</annotation><annotation encoding="application/x-llamapun" id="S2.E1X.2.1.1.m1.2d">italic_d start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT , bold_italic_Q ) = ( bold_italic_x start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT - bold_italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT - bold_italic_μ ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(1)</span></td> </tr> </tbody> </table> <p class="ltx_p" id="S2.SS2.p2.8">where <math alttext="\bm{x}_{\text{grid}}\in\left\{\left(\frac{w}{W},\frac{h}{H}\right)\right\}_{w=% 1,h=1}^{W,H}" class="ltx_Math" display="inline" id="S2.SS2.p2.1.m1.7"><semantics id="S2.SS2.p2.1.m1.7a"><mrow id="S2.SS2.p2.1.m1.7.7" xref="S2.SS2.p2.1.m1.7.7.cmml"><msub id="S2.SS2.p2.1.m1.7.7.3" xref="S2.SS2.p2.1.m1.7.7.3.cmml"><mi id="S2.SS2.p2.1.m1.7.7.3.2" xref="S2.SS2.p2.1.m1.7.7.3.2.cmml">𝒙</mi><mtext id="S2.SS2.p2.1.m1.7.7.3.3" xref="S2.SS2.p2.1.m1.7.7.3.3a.cmml">grid</mtext></msub><mo id="S2.SS2.p2.1.m1.7.7.2" xref="S2.SS2.p2.1.m1.7.7.2.cmml">∈</mo><msubsup id="S2.SS2.p2.1.m1.7.7.1" xref="S2.SS2.p2.1.m1.7.7.1.cmml"><mrow id="S2.SS2.p2.1.m1.7.7.1.1.1.1" xref="S2.SS2.p2.1.m1.7.7.1.1.1.2.cmml"><mo id="S2.SS2.p2.1.m1.7.7.1.1.1.1.2" xref="S2.SS2.p2.1.m1.7.7.1.1.1.2.cmml">{</mo><mrow id="S2.SS2.p2.1.m1.7.7.1.1.1.1.1.2" xref="S2.SS2.p2.1.m1.7.7.1.1.1.1.1.1.cmml"><mo id="S2.SS2.p2.1.m1.7.7.1.1.1.1.1.2.1" xref="S2.SS2.p2.1.m1.7.7.1.1.1.1.1.1.cmml">(</mo><mfrac id="S2.SS2.p2.1.m1.5.5" xref="S2.SS2.p2.1.m1.5.5.cmml"><mi id="S2.SS2.p2.1.m1.5.5.2" xref="S2.SS2.p2.1.m1.5.5.2.cmml">w</mi><mi id="S2.SS2.p2.1.m1.5.5.3" xref="S2.SS2.p2.1.m1.5.5.3.cmml">W</mi></mfrac><mo id="S2.SS2.p2.1.m1.7.7.1.1.1.1.1.2.2" xref="S2.SS2.p2.1.m1.7.7.1.1.1.1.1.1.cmml">,</mo><mfrac id="S2.SS2.p2.1.m1.6.6" xref="S2.SS2.p2.1.m1.6.6.cmml"><mi id="S2.SS2.p2.1.m1.6.6.2" xref="S2.SS2.p2.1.m1.6.6.2.cmml">h</mi><mi id="S2.SS2.p2.1.m1.6.6.3" xref="S2.SS2.p2.1.m1.6.6.3.cmml">H</mi></mfrac><mo id="S2.SS2.p2.1.m1.7.7.1.1.1.1.1.2.3" xref="S2.SS2.p2.1.m1.7.7.1.1.1.1.1.1.cmml">)</mo></mrow><mo id="S2.SS2.p2.1.m1.7.7.1.1.1.1.3" xref="S2.SS2.p2.1.m1.7.7.1.1.1.2.cmml">}</mo></mrow><mrow id="S2.SS2.p2.1.m1.2.2.2.2" xref="S2.SS2.p2.1.m1.2.2.2.3.cmml"><mrow id="S2.SS2.p2.1.m1.1.1.1.1.1" xref="S2.SS2.p2.1.m1.1.1.1.1.1.cmml"><mi id="S2.SS2.p2.1.m1.1.1.1.1.1.2" xref="S2.SS2.p2.1.m1.1.1.1.1.1.2.cmml">w</mi><mo id="S2.SS2.p2.1.m1.1.1.1.1.1.1" xref="S2.SS2.p2.1.m1.1.1.1.1.1.1.cmml">=</mo><mn id="S2.SS2.p2.1.m1.1.1.1.1.1.3" xref="S2.SS2.p2.1.m1.1.1.1.1.1.3.cmml">1</mn></mrow><mo id="S2.SS2.p2.1.m1.2.2.2.2.3" xref="S2.SS2.p2.1.m1.2.2.2.3a.cmml">,</mo><mrow id="S2.SS2.p2.1.m1.2.2.2.2.2" xref="S2.SS2.p2.1.m1.2.2.2.2.2.cmml"><mi id="S2.SS2.p2.1.m1.2.2.2.2.2.2" xref="S2.SS2.p2.1.m1.2.2.2.2.2.2.cmml">h</mi><mo id="S2.SS2.p2.1.m1.2.2.2.2.2.1" xref="S2.SS2.p2.1.m1.2.2.2.2.2.1.cmml">=</mo><mn id="S2.SS2.p2.1.m1.2.2.2.2.2.3" xref="S2.SS2.p2.1.m1.2.2.2.2.2.3.cmml">1</mn></mrow></mrow><mrow id="S2.SS2.p2.1.m1.4.4.2.4" xref="S2.SS2.p2.1.m1.4.4.2.3.cmml"><mi id="S2.SS2.p2.1.m1.3.3.1.1" xref="S2.SS2.p2.1.m1.3.3.1.1.cmml">W</mi><mo id="S2.SS2.p2.1.m1.4.4.2.4.1" xref="S2.SS2.p2.1.m1.4.4.2.3.cmml">,</mo><mi id="S2.SS2.p2.1.m1.4.4.2.2" xref="S2.SS2.p2.1.m1.4.4.2.2.cmml">H</mi></mrow></msubsup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.1.m1.7b"><apply id="S2.SS2.p2.1.m1.7.7.cmml" xref="S2.SS2.p2.1.m1.7.7"><in id="S2.SS2.p2.1.m1.7.7.2.cmml" xref="S2.SS2.p2.1.m1.7.7.2"></in><apply id="S2.SS2.p2.1.m1.7.7.3.cmml" xref="S2.SS2.p2.1.m1.7.7.3"><csymbol cd="ambiguous" id="S2.SS2.p2.1.m1.7.7.3.1.cmml" xref="S2.SS2.p2.1.m1.7.7.3">subscript</csymbol><ci id="S2.SS2.p2.1.m1.7.7.3.2.cmml" xref="S2.SS2.p2.1.m1.7.7.3.2">𝒙</ci><ci id="S2.SS2.p2.1.m1.7.7.3.3a.cmml" xref="S2.SS2.p2.1.m1.7.7.3.3"><mtext id="S2.SS2.p2.1.m1.7.7.3.3.cmml" mathsize="70%" xref="S2.SS2.p2.1.m1.7.7.3.3">grid</mtext></ci></apply><apply id="S2.SS2.p2.1.m1.7.7.1.cmml" xref="S2.SS2.p2.1.m1.7.7.1"><csymbol cd="ambiguous" id="S2.SS2.p2.1.m1.7.7.1.2.cmml" xref="S2.SS2.p2.1.m1.7.7.1">superscript</csymbol><apply id="S2.SS2.p2.1.m1.7.7.1.1.cmml" xref="S2.SS2.p2.1.m1.7.7.1"><csymbol cd="ambiguous" id="S2.SS2.p2.1.m1.7.7.1.1.2.cmml" xref="S2.SS2.p2.1.m1.7.7.1">subscript</csymbol><set id="S2.SS2.p2.1.m1.7.7.1.1.1.2.cmml" xref="S2.SS2.p2.1.m1.7.7.1.1.1.1"><interval closure="open" id="S2.SS2.p2.1.m1.7.7.1.1.1.1.1.1.cmml" xref="S2.SS2.p2.1.m1.7.7.1.1.1.1.1.2"><apply id="S2.SS2.p2.1.m1.5.5.cmml" xref="S2.SS2.p2.1.m1.5.5"><divide id="S2.SS2.p2.1.m1.5.5.1.cmml" xref="S2.SS2.p2.1.m1.5.5"></divide><ci id="S2.SS2.p2.1.m1.5.5.2.cmml" xref="S2.SS2.p2.1.m1.5.5.2">𝑤</ci><ci id="S2.SS2.p2.1.m1.5.5.3.cmml" xref="S2.SS2.p2.1.m1.5.5.3">𝑊</ci></apply><apply id="S2.SS2.p2.1.m1.6.6.cmml" xref="S2.SS2.p2.1.m1.6.6"><divide id="S2.SS2.p2.1.m1.6.6.1.cmml" xref="S2.SS2.p2.1.m1.6.6"></divide><ci id="S2.SS2.p2.1.m1.6.6.2.cmml" xref="S2.SS2.p2.1.m1.6.6.2">ℎ</ci><ci id="S2.SS2.p2.1.m1.6.6.3.cmml" xref="S2.SS2.p2.1.m1.6.6.3">𝐻</ci></apply></interval></set><apply id="S2.SS2.p2.1.m1.2.2.2.3.cmml" xref="S2.SS2.p2.1.m1.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS2.p2.1.m1.2.2.2.3a.cmml" xref="S2.SS2.p2.1.m1.2.2.2.2.3">formulae-sequence</csymbol><apply id="S2.SS2.p2.1.m1.1.1.1.1.1.cmml" xref="S2.SS2.p2.1.m1.1.1.1.1.1"><eq id="S2.SS2.p2.1.m1.1.1.1.1.1.1.cmml" xref="S2.SS2.p2.1.m1.1.1.1.1.1.1"></eq><ci id="S2.SS2.p2.1.m1.1.1.1.1.1.2.cmml" xref="S2.SS2.p2.1.m1.1.1.1.1.1.2">𝑤</ci><cn id="S2.SS2.p2.1.m1.1.1.1.1.1.3.cmml" type="integer" xref="S2.SS2.p2.1.m1.1.1.1.1.1.3">1</cn></apply><apply id="S2.SS2.p2.1.m1.2.2.2.2.2.cmml" xref="S2.SS2.p2.1.m1.2.2.2.2.2"><eq id="S2.SS2.p2.1.m1.2.2.2.2.2.1.cmml" xref="S2.SS2.p2.1.m1.2.2.2.2.2.1"></eq><ci id="S2.SS2.p2.1.m1.2.2.2.2.2.2.cmml" xref="S2.SS2.p2.1.m1.2.2.2.2.2.2">ℎ</ci><cn id="S2.SS2.p2.1.m1.2.2.2.2.2.3.cmml" type="integer" xref="S2.SS2.p2.1.m1.2.2.2.2.2.3">1</cn></apply></apply></apply><list id="S2.SS2.p2.1.m1.4.4.2.3.cmml" xref="S2.SS2.p2.1.m1.4.4.2.4"><ci id="S2.SS2.p2.1.m1.3.3.1.1.cmml" xref="S2.SS2.p2.1.m1.3.3.1.1">𝑊</ci><ci id="S2.SS2.p2.1.m1.4.4.2.2.cmml" xref="S2.SS2.p2.1.m1.4.4.2.2">𝐻</ci></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.1.m1.7c">\bm{x}_{\text{grid}}\in\left\{\left(\frac{w}{W},\frac{h}{H}\right)\right\}_{w=% 1,h=1}^{W,H}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.1.m1.7d">bold_italic_x start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT ∈ { ( divide start_ARG italic_w end_ARG start_ARG italic_W end_ARG , divide start_ARG italic_h end_ARG start_ARG italic_H end_ARG ) } start_POSTSUBSCRIPT italic_w = 1 , italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W , italic_H end_POSTSUPERSCRIPT</annotation></semantics></math> is a point on a two-dimensional grid map, and <math alttext="\bm{Q}" class="ltx_Math" display="inline" id="S2.SS2.p2.2.m2.1"><semantics id="S2.SS2.p2.2.m2.1a"><mi id="S2.SS2.p2.2.m2.1.1" xref="S2.SS2.p2.2.m2.1.1.cmml">𝑸</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.2.m2.1b"><ci id="S2.SS2.p2.2.m2.1.1.cmml" xref="S2.SS2.p2.2.m2.1.1">𝑸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.2.m2.1c">\bm{Q}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.2.m2.1d">bold_italic_Q</annotation></semantics></math> represents the 2D Gaussian distribution of the blob, characterized by its mean <math alttext="\bm{\mu}" class="ltx_Math" display="inline" id="S2.SS2.p2.3.m3.1"><semantics id="S2.SS2.p2.3.m3.1a"><mi id="S2.SS2.p2.3.m3.1.1" xref="S2.SS2.p2.3.m3.1.1.cmml">𝝁</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.3.m3.1b"><ci id="S2.SS2.p2.3.m3.1.1.cmml" xref="S2.SS2.p2.3.m3.1.1">𝝁</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.3.m3.1c">\bm{\mu}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.3.m3.1d">bold_italic_μ</annotation></semantics></math> and covariance matrix <math alttext="\bm{\Sigma}" class="ltx_Math" display="inline" id="S2.SS2.p2.4.m4.1"><semantics id="S2.SS2.p2.4.m4.1a"><mi id="S2.SS2.p2.4.m4.1.1" xref="S2.SS2.p2.4.m4.1.1.cmml">𝚺</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.4.m4.1b"><ci id="S2.SS2.p2.4.m4.1.1.cmml" xref="S2.SS2.p2.4.m4.1.1">𝚺</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.4.m4.1c">\bm{\Sigma}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.4.m4.1d">bold_Σ</annotation></semantics></math>. The distance <math alttext="d_{M}\in\mathbb{R}^{H,W}" class="ltx_Math" display="inline" id="S2.SS2.p2.5.m5.2"><semantics id="S2.SS2.p2.5.m5.2a"><mrow id="S2.SS2.p2.5.m5.2.3" xref="S2.SS2.p2.5.m5.2.3.cmml"><msub id="S2.SS2.p2.5.m5.2.3.2" xref="S2.SS2.p2.5.m5.2.3.2.cmml"><mi id="S2.SS2.p2.5.m5.2.3.2.2" xref="S2.SS2.p2.5.m5.2.3.2.2.cmml">d</mi><mi id="S2.SS2.p2.5.m5.2.3.2.3" xref="S2.SS2.p2.5.m5.2.3.2.3.cmml">M</mi></msub><mo id="S2.SS2.p2.5.m5.2.3.1" xref="S2.SS2.p2.5.m5.2.3.1.cmml">∈</mo><msup id="S2.SS2.p2.5.m5.2.3.3" xref="S2.SS2.p2.5.m5.2.3.3.cmml"><mi id="S2.SS2.p2.5.m5.2.3.3.2" xref="S2.SS2.p2.5.m5.2.3.3.2.cmml">ℝ</mi><mrow id="S2.SS2.p2.5.m5.2.2.2.4" xref="S2.SS2.p2.5.m5.2.2.2.3.cmml"><mi id="S2.SS2.p2.5.m5.1.1.1.1" xref="S2.SS2.p2.5.m5.1.1.1.1.cmml">H</mi><mo id="S2.SS2.p2.5.m5.2.2.2.4.1" xref="S2.SS2.p2.5.m5.2.2.2.3.cmml">,</mo><mi id="S2.SS2.p2.5.m5.2.2.2.2" xref="S2.SS2.p2.5.m5.2.2.2.2.cmml">W</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.5.m5.2b"><apply id="S2.SS2.p2.5.m5.2.3.cmml" xref="S2.SS2.p2.5.m5.2.3"><in id="S2.SS2.p2.5.m5.2.3.1.cmml" xref="S2.SS2.p2.5.m5.2.3.1"></in><apply id="S2.SS2.p2.5.m5.2.3.2.cmml" xref="S2.SS2.p2.5.m5.2.3.2"><csymbol cd="ambiguous" id="S2.SS2.p2.5.m5.2.3.2.1.cmml" xref="S2.SS2.p2.5.m5.2.3.2">subscript</csymbol><ci id="S2.SS2.p2.5.m5.2.3.2.2.cmml" xref="S2.SS2.p2.5.m5.2.3.2.2">𝑑</ci><ci id="S2.SS2.p2.5.m5.2.3.2.3.cmml" xref="S2.SS2.p2.5.m5.2.3.2.3">𝑀</ci></apply><apply id="S2.SS2.p2.5.m5.2.3.3.cmml" xref="S2.SS2.p2.5.m5.2.3.3"><csymbol cd="ambiguous" id="S2.SS2.p2.5.m5.2.3.3.1.cmml" xref="S2.SS2.p2.5.m5.2.3.3">superscript</csymbol><ci id="S2.SS2.p2.5.m5.2.3.3.2.cmml" xref="S2.SS2.p2.5.m5.2.3.3.2">ℝ</ci><list id="S2.SS2.p2.5.m5.2.2.2.3.cmml" xref="S2.SS2.p2.5.m5.2.2.2.4"><ci id="S2.SS2.p2.5.m5.1.1.1.1.cmml" xref="S2.SS2.p2.5.m5.1.1.1.1">𝐻</ci><ci id="S2.SS2.p2.5.m5.2.2.2.2.cmml" xref="S2.SS2.p2.5.m5.2.2.2.2">𝑊</ci></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.5.m5.2c">d_{M}\in\mathbb{R}^{H,W}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.5.m5.2d">italic_d start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H , italic_W end_POSTSUPERSCRIPT</annotation></semantics></math> quantifies how far the point <math alttext="\bm{x}_{\text{grid}}" class="ltx_Math" display="inline" id="S2.SS2.p2.6.m6.1"><semantics id="S2.SS2.p2.6.m6.1a"><msub id="S2.SS2.p2.6.m6.1.1" xref="S2.SS2.p2.6.m6.1.1.cmml"><mi id="S2.SS2.p2.6.m6.1.1.2" xref="S2.SS2.p2.6.m6.1.1.2.cmml">𝒙</mi><mtext id="S2.SS2.p2.6.m6.1.1.3" xref="S2.SS2.p2.6.m6.1.1.3a.cmml">grid</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.6.m6.1b"><apply id="S2.SS2.p2.6.m6.1.1.cmml" xref="S2.SS2.p2.6.m6.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.6.m6.1.1.1.cmml" xref="S2.SS2.p2.6.m6.1.1">subscript</csymbol><ci id="S2.SS2.p2.6.m6.1.1.2.cmml" xref="S2.SS2.p2.6.m6.1.1.2">𝒙</ci><ci id="S2.SS2.p2.6.m6.1.1.3a.cmml" xref="S2.SS2.p2.6.m6.1.1.3"><mtext id="S2.SS2.p2.6.m6.1.1.3.cmml" mathsize="70%" xref="S2.SS2.p2.6.m6.1.1.3">grid</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.6.m6.1c">\bm{x}_{\text{grid}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.6.m6.1d">bold_italic_x start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT</annotation></semantics></math> is from the center <math alttext="\bm{\mu}" class="ltx_Math" display="inline" id="S2.SS2.p2.7.m7.1"><semantics id="S2.SS2.p2.7.m7.1a"><mi id="S2.SS2.p2.7.m7.1.1" xref="S2.SS2.p2.7.m7.1.1.cmml">𝝁</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.7.m7.1b"><ci id="S2.SS2.p2.7.m7.1.1.cmml" xref="S2.SS2.p2.7.m7.1.1">𝝁</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.7.m7.1c">\bm{\mu}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.7.m7.1d">bold_italic_μ</annotation></semantics></math> while taking into account the shape of the distribution defined by <math alttext="\bm{\Sigma}" class="ltx_Math" display="inline" id="S2.SS2.p2.8.m8.1"><semantics id="S2.SS2.p2.8.m8.1a"><mi id="S2.SS2.p2.8.m8.1.1" xref="S2.SS2.p2.8.m8.1.1.cmml">𝚺</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.8.m8.1b"><ci id="S2.SS2.p2.8.m8.1.1.cmml" xref="S2.SS2.p2.8.m8.1.1">𝚺</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.8.m8.1c">\bm{\Sigma}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.8.m8.1d">bold_Σ</annotation></semantics></math>. Then, the blob opacity can be calculated based on this distance:</p> <table class="ltx_equationgroup ltx_eqn_table" id="S2.E2"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S2.E2X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle O(\bm{x}_{\text{grid}})={\text{sigmoid}}(-d_{M})," class="ltx_Math" display="inline" id="S2.E2X.2.1.1.m1.1"><semantics id="S2.E2X.2.1.1.m1.1a"><mrow id="S2.E2X.2.1.1.m1.1.1.1" xref="S2.E2X.2.1.1.m1.1.1.1.1.cmml"><mrow id="S2.E2X.2.1.1.m1.1.1.1.1" xref="S2.E2X.2.1.1.m1.1.1.1.1.cmml"><mrow id="S2.E2X.2.1.1.m1.1.1.1.1.1" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.cmml"><mi id="S2.E2X.2.1.1.m1.1.1.1.1.1.3" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.3.cmml">O</mi><mo id="S2.E2X.2.1.1.m1.1.1.1.1.1.2" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.cmml">(</mo><msub id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.cmml"><mi id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.2" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.2.cmml">𝒙</mi><mtext id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.3" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.3a.cmml">grid</mtext></msub><mo id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.E2X.2.1.1.m1.1.1.1.1.3" xref="S2.E2X.2.1.1.m1.1.1.1.1.3.cmml">=</mo><mrow id="S2.E2X.2.1.1.m1.1.1.1.1.2" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.cmml"><mtext id="S2.E2X.2.1.1.m1.1.1.1.1.2.3" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.3a.cmml">sigmoid</mtext><mo id="S2.E2X.2.1.1.m1.1.1.1.1.2.2" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.2.cmml">⁢</mo><mrow id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml"><mo id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.2" stretchy="false" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml">(</mo><mrow id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml"><mo id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1a" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml">−</mo><msub id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.cmml"><mi id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.2" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.2.cmml">d</mi><mi id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.3" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.3.cmml">M</mi></msub></mrow><mo id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.3" stretchy="false" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E2X.2.1.1.m1.1.1.1.2" xref="S2.E2X.2.1.1.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.E2X.2.1.1.m1.1b"><apply id="S2.E2X.2.1.1.m1.1.1.1.1.cmml" xref="S2.E2X.2.1.1.m1.1.1.1"><eq id="S2.E2X.2.1.1.m1.1.1.1.1.3.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.3"></eq><apply id="S2.E2X.2.1.1.m1.1.1.1.1.1.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.1"><times id="S2.E2X.2.1.1.m1.1.1.1.1.1.2.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.2"></times><ci id="S2.E2X.2.1.1.m1.1.1.1.1.1.3.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.3">𝑂</ci><apply id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.2">𝒙</ci><ci id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.3a.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.3"><mtext id="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S2.E2X.2.1.1.m1.1.1.1.1.1.1.1.1.3">grid</mtext></ci></apply></apply><apply id="S2.E2X.2.1.1.m1.1.1.1.1.2.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.2"><times id="S2.E2X.2.1.1.m1.1.1.1.1.2.2.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.2"></times><ci id="S2.E2X.2.1.1.m1.1.1.1.1.2.3a.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.3"><mtext id="S2.E2X.2.1.1.m1.1.1.1.1.2.3.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.3">sigmoid</mtext></ci><apply id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1"><minus id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.1.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1"></minus><apply id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2"><csymbol cd="ambiguous" id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.1.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2">subscript</csymbol><ci id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.2.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.2">𝑑</ci><ci id="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.3.cmml" xref="S2.E2X.2.1.1.m1.1.1.1.1.2.1.1.1.2.3">𝑀</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E2X.2.1.1.m1.1c">\displaystyle O(\bm{x}_{\text{grid}})={\text{sigmoid}}(-d_{M}),</annotation><annotation encoding="application/x-llamapun" id="S2.E2X.2.1.1.m1.1d">italic_O ( bold_italic_x start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT ) = sigmoid ( - italic_d start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(2)</span></td> </tr> </tbody> </table> <p class="ltx_p" id="S2.SS2.p2.10">which maps the distance <math alttext="d_{M}" class="ltx_Math" display="inline" id="S2.SS2.p2.9.m1.1"><semantics id="S2.SS2.p2.9.m1.1a"><msub id="S2.SS2.p2.9.m1.1.1" xref="S2.SS2.p2.9.m1.1.1.cmml"><mi id="S2.SS2.p2.9.m1.1.1.2" xref="S2.SS2.p2.9.m1.1.1.2.cmml">d</mi><mi id="S2.SS2.p2.9.m1.1.1.3" xref="S2.SS2.p2.9.m1.1.1.3.cmml">M</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.9.m1.1b"><apply id="S2.SS2.p2.9.m1.1.1.cmml" xref="S2.SS2.p2.9.m1.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.9.m1.1.1.1.cmml" xref="S2.SS2.p2.9.m1.1.1">subscript</csymbol><ci id="S2.SS2.p2.9.m1.1.1.2.cmml" xref="S2.SS2.p2.9.m1.1.1.2">𝑑</ci><ci id="S2.SS2.p2.9.m1.1.1.3.cmml" xref="S2.SS2.p2.9.m1.1.1.3">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.9.m1.1c">d_{M}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.9.m1.1d">italic_d start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT</annotation></semantics></math> to a value between 0 and 1, effectively representing the opacity of the blob at the point <math alttext="\bm{x}_{\text{grid}}" class="ltx_Math" display="inline" id="S2.SS2.p2.10.m2.1"><semantics id="S2.SS2.p2.10.m2.1a"><msub id="S2.SS2.p2.10.m2.1.1" xref="S2.SS2.p2.10.m2.1.1.cmml"><mi id="S2.SS2.p2.10.m2.1.1.2" xref="S2.SS2.p2.10.m2.1.1.2.cmml">𝒙</mi><mtext id="S2.SS2.p2.10.m2.1.1.3" xref="S2.SS2.p2.10.m2.1.1.3a.cmml">grid</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.10.m2.1b"><apply id="S2.SS2.p2.10.m2.1.1.cmml" xref="S2.SS2.p2.10.m2.1.1"><csymbol cd="ambiguous" id="S2.SS2.p2.10.m2.1.1.1.cmml" xref="S2.SS2.p2.10.m2.1.1">subscript</csymbol><ci id="S2.SS2.p2.10.m2.1.1.2.cmml" xref="S2.SS2.p2.10.m2.1.1.2">𝒙</ci><ci id="S2.SS2.p2.10.m2.1.1.3a.cmml" xref="S2.SS2.p2.10.m2.1.1.3"><mtext id="S2.SS2.p2.10.m2.1.1.3.cmml" mathsize="70%" xref="S2.SS2.p2.10.m2.1.1.3">grid</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.10.m2.1c">\bm{x}_{\text{grid}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.10.m2.1d">bold_italic_x start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT</annotation></semantics></math>. This ensures that points closer to the center of the blob have higher opacity, while those further away are more transparent, ensuring a smooth and continuous transition.</p> </div> <figure class="ltx_figure" id="S2.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="274" id="S2.F3.g1" src="x3.png" width="813"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span><span class="ltx_text ltx_font_bold" id="S2.F3.2.1">Overview of <span class="ltx_text ltx_font_italic" id="S2.F3.2.1.1">BlobCtrl</span>.</span> Our framework consists of: (1) A dual-branch architecture with a foreground branch for element identity encoding and a background branch for scene context preservation and harmonization. Both branches use concatenated inputs of noisy latents and reference conditions (Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.SS1" title="3.1 Model Architecture ‣ 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">3.1</span></a>). (2) A self-supervised training paradigm for element-level manipulation through stochastic position generation and target reconstruction optimization. Through feature fusion between branches, our framework achieves precise control over elements while maintaining visual coherence.</figcaption> </figure> </section> <section class="ltx_subsection" id="S2.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.3 </span>Blob Composition and Splatting</h3> <div class="ltx_para" id="S2.SS3.p1"> <p class="ltx_p" id="S2.SS3.p1.6">Blob composition implies the process of integrating multiple blobs through depth-aware alpha compositing <cite class="ltx_cite ltx_citemacro_citep">(Porter &amp; Duff, <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib35" title="">1984</a>; Nitzberg &amp; Mumford, <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib32" title="">1990</a>)</cite>, which effectively addresses occlusion and models inter-object relationships. Mathematically, blob composition is formulated as follows:</p> <table class="ltx_equationgroup ltx_eqn_table" id="S2.E3"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S2.E3X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle O_{c}^{i}(\bm{x}_{\text{grid}})=O_{i}(\bm{x}_{\text{grid}})\prod% _{j=i+1}^{m}\left(1-O_{j}\left(\bm{x}_{g}\right)\right)," class="ltx_Math" display="inline" id="S2.E3X.2.1.1.m1.1"><semantics id="S2.E3X.2.1.1.m1.1a"><mrow id="S2.E3X.2.1.1.m1.1.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.cmml"><mrow id="S2.E3X.2.1.1.m1.1.1.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.cmml"><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.cmml"><msubsup id="S2.E3X.2.1.1.m1.1.1.1.1.1.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3.cmml"><mi id="S2.E3X.2.1.1.m1.1.1.1.1.1.3.2.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3.2.2.cmml">O</mi><mi id="S2.E3X.2.1.1.m1.1.1.1.1.1.3.2.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3.2.3.cmml">c</mi><mi id="S2.E3X.2.1.1.m1.1.1.1.1.1.3.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3.3.cmml">i</mi></msubsup><mo id="S2.E3X.2.1.1.m1.1.1.1.1.1.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.cmml">(</mo><msub id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.cmml"><mi id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.2.cmml">𝒙</mi><mtext id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.3a.cmml">grid</mtext></msub><mo id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.E3X.2.1.1.m1.1.1.1.1.4" xref="S2.E3X.2.1.1.m1.1.1.1.1.4.cmml">=</mo><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.cmml"><msub id="S2.E3X.2.1.1.m1.1.1.1.1.3.4" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.4.cmml"><mi id="S2.E3X.2.1.1.m1.1.1.1.1.3.4.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.4.2.cmml">O</mi><mi id="S2.E3X.2.1.1.m1.1.1.1.1.3.4.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.4.3.cmml">i</mi></msub><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.3.cmml">⁢</mo><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml"><mo id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.2" stretchy="false" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml">(</mo><msub id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml"><mi id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.2.cmml">𝒙</mi><mtext id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.3a.cmml">grid</mtext></msub><mo id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.3" stretchy="false" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml">)</mo></mrow><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.3a" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.3.cmml">⁢</mo><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.3.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.cmml"><mstyle displaystyle="true" id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.cmml"><munderover id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2a" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.cmml"><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.2" movablelimits="false" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.2.cmml">∏</mo><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.cmml"><mi id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.2.cmml">j</mi><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.1.cmml">=</mo><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.cmml"><mi id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.2.cmml">i</mi><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.1.cmml">+</mo><mn id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.3.cmml">1</mn></mrow></mrow><mi id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.3.cmml">m</mi></munderover></mstyle><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.cmml"><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.cmml">(</mo><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.cmml"><mn id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.3.cmml">1</mn><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.2.cmml">−</mo><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.cmml"><msub id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.cmml"><mi id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.2.cmml">O</mi><mi id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.3.cmml">j</mi></msub><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.cmml"><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.cmml">(</mo><msub id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.cmml"><mi id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.2.cmml">𝒙</mi><mi id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.3.cmml">g</mi></msub><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.3" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow><mo id="S2.E3X.2.1.1.m1.1.1.1.2" xref="S2.E3X.2.1.1.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.E3X.2.1.1.m1.1b"><apply id="S2.E3X.2.1.1.m1.1.1.1.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1"><eq id="S2.E3X.2.1.1.m1.1.1.1.1.4.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.4"></eq><apply id="S2.E3X.2.1.1.m1.1.1.1.1.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1"><times id="S2.E3X.2.1.1.m1.1.1.1.1.1.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.2"></times><apply id="S2.E3X.2.1.1.m1.1.1.1.1.1.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E3X.2.1.1.m1.1.1.1.1.1.3.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3">superscript</csymbol><apply id="S2.E3X.2.1.1.m1.1.1.1.1.1.3.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E3X.2.1.1.m1.1.1.1.1.1.3.2.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3">subscript</csymbol><ci id="S2.E3X.2.1.1.m1.1.1.1.1.1.3.2.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3.2.2">𝑂</ci><ci id="S2.E3X.2.1.1.m1.1.1.1.1.1.3.2.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3.2.3">𝑐</ci></apply><ci id="S2.E3X.2.1.1.m1.1.1.1.1.1.3.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.3.3">𝑖</ci></apply><apply id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.2">𝒙</ci><ci id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.3a.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.3"><mtext id="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S2.E3X.2.1.1.m1.1.1.1.1.1.1.1.1.3">grid</mtext></ci></apply></apply><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3"><times id="S2.E3X.2.1.1.m1.1.1.1.1.3.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.3"></times><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.4.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.4"><csymbol cd="ambiguous" id="S2.E3X.2.1.1.m1.1.1.1.1.3.4.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.4">subscript</csymbol><ci id="S2.E3X.2.1.1.m1.1.1.1.1.3.4.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.4.2">𝑂</ci><ci id="S2.E3X.2.1.1.m1.1.1.1.1.3.4.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.4.3">𝑖</ci></apply><apply id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1"><csymbol cd="ambiguous" id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1">subscript</csymbol><ci id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.2">𝒙</ci><ci id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.3a.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.3"><mtext id="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.3.cmml" mathsize="70%" xref="S2.E3X.2.1.1.m1.1.1.1.1.2.1.1.1.3">grid</mtext></ci></apply><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2"><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2"><csymbol cd="ambiguous" id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2">superscript</csymbol><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2"><csymbol cd="ambiguous" id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2">subscript</csymbol><csymbol cd="latexml" id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.2">product</csymbol><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3"><eq id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.1"></eq><ci id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.2">𝑗</ci><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3"><plus id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.1"></plus><ci id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.2">𝑖</ci><cn id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.3.cmml" type="integer" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.2.3.3.3">1</cn></apply></apply></apply><ci id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.2.3">𝑚</ci></apply><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1"><minus id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.2"></minus><cn id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.3.cmml" type="integer" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.3">1</cn><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1"><times id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.2"></times><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3">subscript</csymbol><ci id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.2">𝑂</ci><ci id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.3.3">𝑗</ci></apply><apply id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.1.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1">subscript</csymbol><ci id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.2.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.2">𝒙</ci><ci id="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.3.cmml" xref="S2.E3X.2.1.1.m1.1.1.1.1.3.2.1.1.1.1.1.1.1.3">𝑔</ci></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E3X.2.1.1.m1.1c">\displaystyle O_{c}^{i}(\bm{x}_{\text{grid}})=O_{i}(\bm{x}_{\text{grid}})\prod% _{j=i+1}^{m}\left(1-O_{j}\left(\bm{x}_{g}\right)\right),</annotation><annotation encoding="application/x-llamapun" id="S2.E3X.2.1.1.m1.1d">italic_O start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT ) = italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( 1 - italic_O start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(3)</span></td> </tr> </tbody> </table> <p class="ltx_p" id="S2.SS3.p1.5">where <math alttext="m" class="ltx_Math" display="inline" id="S2.SS3.p1.1.m1.1"><semantics id="S2.SS3.p1.1.m1.1a"><mi id="S2.SS3.p1.1.m1.1.1" xref="S2.SS3.p1.1.m1.1.1.cmml">m</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.1.m1.1b"><ci id="S2.SS3.p1.1.m1.1.1.cmml" xref="S2.SS3.p1.1.m1.1.1">𝑚</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.1.m1.1c">m</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.1.m1.1d">italic_m</annotation></semantics></math> is the total number of blobs to be composed, <math alttext="O_{c}^{i}(\bm{x}_{\text{grid}})\in\mathbb{R}^{h,w}" class="ltx_Math" display="inline" id="S2.SS3.p1.2.m2.3"><semantics id="S2.SS3.p1.2.m2.3a"><mrow id="S2.SS3.p1.2.m2.3.3" xref="S2.SS3.p1.2.m2.3.3.cmml"><mrow id="S2.SS3.p1.2.m2.3.3.1" xref="S2.SS3.p1.2.m2.3.3.1.cmml"><msubsup id="S2.SS3.p1.2.m2.3.3.1.3" xref="S2.SS3.p1.2.m2.3.3.1.3.cmml"><mi id="S2.SS3.p1.2.m2.3.3.1.3.2.2" xref="S2.SS3.p1.2.m2.3.3.1.3.2.2.cmml">O</mi><mi id="S2.SS3.p1.2.m2.3.3.1.3.2.3" xref="S2.SS3.p1.2.m2.3.3.1.3.2.3.cmml">c</mi><mi id="S2.SS3.p1.2.m2.3.3.1.3.3" xref="S2.SS3.p1.2.m2.3.3.1.3.3.cmml">i</mi></msubsup><mo id="S2.SS3.p1.2.m2.3.3.1.2" xref="S2.SS3.p1.2.m2.3.3.1.2.cmml">⁢</mo><mrow id="S2.SS3.p1.2.m2.3.3.1.1.1" xref="S2.SS3.p1.2.m2.3.3.1.1.1.1.cmml"><mo id="S2.SS3.p1.2.m2.3.3.1.1.1.2" stretchy="false" xref="S2.SS3.p1.2.m2.3.3.1.1.1.1.cmml">(</mo><msub id="S2.SS3.p1.2.m2.3.3.1.1.1.1" xref="S2.SS3.p1.2.m2.3.3.1.1.1.1.cmml"><mi id="S2.SS3.p1.2.m2.3.3.1.1.1.1.2" xref="S2.SS3.p1.2.m2.3.3.1.1.1.1.2.cmml">𝒙</mi><mtext id="S2.SS3.p1.2.m2.3.3.1.1.1.1.3" xref="S2.SS3.p1.2.m2.3.3.1.1.1.1.3a.cmml">grid</mtext></msub><mo id="S2.SS3.p1.2.m2.3.3.1.1.1.3" stretchy="false" xref="S2.SS3.p1.2.m2.3.3.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.SS3.p1.2.m2.3.3.2" xref="S2.SS3.p1.2.m2.3.3.2.cmml">∈</mo><msup id="S2.SS3.p1.2.m2.3.3.3" xref="S2.SS3.p1.2.m2.3.3.3.cmml"><mi id="S2.SS3.p1.2.m2.3.3.3.2" xref="S2.SS3.p1.2.m2.3.3.3.2.cmml">ℝ</mi><mrow id="S2.SS3.p1.2.m2.2.2.2.4" xref="S2.SS3.p1.2.m2.2.2.2.3.cmml"><mi id="S2.SS3.p1.2.m2.1.1.1.1" xref="S2.SS3.p1.2.m2.1.1.1.1.cmml">h</mi><mo id="S2.SS3.p1.2.m2.2.2.2.4.1" xref="S2.SS3.p1.2.m2.2.2.2.3.cmml">,</mo><mi id="S2.SS3.p1.2.m2.2.2.2.2" xref="S2.SS3.p1.2.m2.2.2.2.2.cmml">w</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.2.m2.3b"><apply id="S2.SS3.p1.2.m2.3.3.cmml" xref="S2.SS3.p1.2.m2.3.3"><in id="S2.SS3.p1.2.m2.3.3.2.cmml" xref="S2.SS3.p1.2.m2.3.3.2"></in><apply id="S2.SS3.p1.2.m2.3.3.1.cmml" xref="S2.SS3.p1.2.m2.3.3.1"><times id="S2.SS3.p1.2.m2.3.3.1.2.cmml" xref="S2.SS3.p1.2.m2.3.3.1.2"></times><apply id="S2.SS3.p1.2.m2.3.3.1.3.cmml" xref="S2.SS3.p1.2.m2.3.3.1.3"><csymbol cd="ambiguous" id="S2.SS3.p1.2.m2.3.3.1.3.1.cmml" xref="S2.SS3.p1.2.m2.3.3.1.3">superscript</csymbol><apply id="S2.SS3.p1.2.m2.3.3.1.3.2.cmml" xref="S2.SS3.p1.2.m2.3.3.1.3"><csymbol cd="ambiguous" id="S2.SS3.p1.2.m2.3.3.1.3.2.1.cmml" xref="S2.SS3.p1.2.m2.3.3.1.3">subscript</csymbol><ci id="S2.SS3.p1.2.m2.3.3.1.3.2.2.cmml" xref="S2.SS3.p1.2.m2.3.3.1.3.2.2">𝑂</ci><ci id="S2.SS3.p1.2.m2.3.3.1.3.2.3.cmml" xref="S2.SS3.p1.2.m2.3.3.1.3.2.3">𝑐</ci></apply><ci id="S2.SS3.p1.2.m2.3.3.1.3.3.cmml" xref="S2.SS3.p1.2.m2.3.3.1.3.3">𝑖</ci></apply><apply id="S2.SS3.p1.2.m2.3.3.1.1.1.1.cmml" xref="S2.SS3.p1.2.m2.3.3.1.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.2.m2.3.3.1.1.1.1.1.cmml" xref="S2.SS3.p1.2.m2.3.3.1.1.1">subscript</csymbol><ci id="S2.SS3.p1.2.m2.3.3.1.1.1.1.2.cmml" xref="S2.SS3.p1.2.m2.3.3.1.1.1.1.2">𝒙</ci><ci id="S2.SS3.p1.2.m2.3.3.1.1.1.1.3a.cmml" xref="S2.SS3.p1.2.m2.3.3.1.1.1.1.3"><mtext id="S2.SS3.p1.2.m2.3.3.1.1.1.1.3.cmml" mathsize="70%" xref="S2.SS3.p1.2.m2.3.3.1.1.1.1.3">grid</mtext></ci></apply></apply><apply id="S2.SS3.p1.2.m2.3.3.3.cmml" xref="S2.SS3.p1.2.m2.3.3.3"><csymbol cd="ambiguous" id="S2.SS3.p1.2.m2.3.3.3.1.cmml" xref="S2.SS3.p1.2.m2.3.3.3">superscript</csymbol><ci id="S2.SS3.p1.2.m2.3.3.3.2.cmml" xref="S2.SS3.p1.2.m2.3.3.3.2">ℝ</ci><list id="S2.SS3.p1.2.m2.2.2.2.3.cmml" xref="S2.SS3.p1.2.m2.2.2.2.4"><ci id="S2.SS3.p1.2.m2.1.1.1.1.cmml" xref="S2.SS3.p1.2.m2.1.1.1.1">ℎ</ci><ci id="S2.SS3.p1.2.m2.2.2.2.2.cmml" xref="S2.SS3.p1.2.m2.2.2.2.2">𝑤</ci></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.2.m2.3c">O_{c}^{i}(\bm{x}_{\text{grid}})\in\mathbb{R}^{h,w}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.2.m2.3d">italic_O start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_h , italic_w end_POSTSUPERSCRIPT</annotation></semantics></math> represents the composed opacity of the <math alttext="i" class="ltx_Math" display="inline" id="S2.SS3.p1.3.m3.1"><semantics id="S2.SS3.p1.3.m3.1a"><mi id="S2.SS3.p1.3.m3.1.1" xref="S2.SS3.p1.3.m3.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.3.m3.1b"><ci id="S2.SS3.p1.3.m3.1.1.cmml" xref="S2.SS3.p1.3.m3.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.3.m3.1c">i</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.3.m3.1d">italic_i</annotation></semantics></math>-th blob across spatial dimensions, and the product term accounts for the occlusion effects of subsequent blobs in the sequence. In our <span class="ltx_text ltx_font_italic" id="S2.SS3.p1.5.1">BlobCtrl</span>, we compose the foreground and background element to obtain the foreground opacity <math alttext="O_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S2.SS3.p1.4.m4.1"><semantics id="S2.SS3.p1.4.m4.1a"><msub id="S2.SS3.p1.4.m4.1.1" xref="S2.SS3.p1.4.m4.1.1.cmml"><mi id="S2.SS3.p1.4.m4.1.1.2" xref="S2.SS3.p1.4.m4.1.1.2.cmml">O</mi><mi id="S2.SS3.p1.4.m4.1.1.3" xref="S2.SS3.p1.4.m4.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.4.m4.1b"><apply id="S2.SS3.p1.4.m4.1.1.cmml" xref="S2.SS3.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.4.m4.1.1.1.cmml" xref="S2.SS3.p1.4.m4.1.1">subscript</csymbol><ci id="S2.SS3.p1.4.m4.1.1.2.cmml" xref="S2.SS3.p1.4.m4.1.1.2">𝑂</ci><ci id="S2.SS3.p1.4.m4.1.1.3.cmml" xref="S2.SS3.p1.4.m4.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.4.m4.1c">O_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.4.m4.1d">italic_O start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math> and background opacity <math alttext="O_{\mathsf{bg}}" class="ltx_Math" display="inline" id="S2.SS3.p1.5.m5.1"><semantics id="S2.SS3.p1.5.m5.1a"><msub id="S2.SS3.p1.5.m5.1.1" xref="S2.SS3.p1.5.m5.1.1.cmml"><mi id="S2.SS3.p1.5.m5.1.1.2" xref="S2.SS3.p1.5.m5.1.1.2.cmml">O</mi><mi id="S2.SS3.p1.5.m5.1.1.3" xref="S2.SS3.p1.5.m5.1.1.3.cmml">𝖻𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.5.m5.1b"><apply id="S2.SS3.p1.5.m5.1.1.cmml" xref="S2.SS3.p1.5.m5.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.5.m5.1.1.1.cmml" xref="S2.SS3.p1.5.m5.1.1">subscript</csymbol><ci id="S2.SS3.p1.5.m5.1.1.2.cmml" xref="S2.SS3.p1.5.m5.1.1.2">𝑂</ci><ci id="S2.SS3.p1.5.m5.1.1.3.cmml" xref="S2.SS3.p1.5.m5.1.1.3">𝖻𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.5.m5.1c">O_{\mathsf{bg}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.5.m5.1d">italic_O start_POSTSUBSCRIPT sansserif_bg end_POSTSUBSCRIPT</annotation></semantics></math>, which serve as an element-level layout representation.</p> </div> <div class="ltx_para" id="S2.SS3.p2"> <p class="ltx_p" id="S2.SS3.p2.4">Blob splatting <cite class="ltx_cite ltx_citemacro_citep">(Epstein et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib11" title="">2022</a>)</cite> refers to the ability to project <math alttext="m" class="ltx_Math" display="inline" id="S2.SS3.p2.1.m1.1"><semantics id="S2.SS3.p2.1.m1.1a"><mi id="S2.SS3.p2.1.m1.1.1" xref="S2.SS3.p2.1.m1.1.1.cmml">m</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.1.m1.1b"><ci id="S2.SS3.p2.1.m1.1.1.cmml" xref="S2.SS3.p2.1.m1.1.1">𝑚</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.1.m1.1c">m</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.1.m1.1d">italic_m</annotation></semantics></math> features <math alttext="\bm{f^{i}}\in\mathbb{R}^{1\times d}" class="ltx_Math" display="inline" id="S2.SS3.p2.2.m2.1"><semantics id="S2.SS3.p2.2.m2.1a"><mrow id="S2.SS3.p2.2.m2.1.1" xref="S2.SS3.p2.2.m2.1.1.cmml"><msup id="S2.SS3.p2.2.m2.1.1.2" xref="S2.SS3.p2.2.m2.1.1.2.cmml"><mi id="S2.SS3.p2.2.m2.1.1.2.2" xref="S2.SS3.p2.2.m2.1.1.2.2.cmml">𝒇</mi><mi id="S2.SS3.p2.2.m2.1.1.2.3" xref="S2.SS3.p2.2.m2.1.1.2.3.cmml">𝒊</mi></msup><mo id="S2.SS3.p2.2.m2.1.1.1" xref="S2.SS3.p2.2.m2.1.1.1.cmml">∈</mo><msup id="S2.SS3.p2.2.m2.1.1.3" xref="S2.SS3.p2.2.m2.1.1.3.cmml"><mi id="S2.SS3.p2.2.m2.1.1.3.2" xref="S2.SS3.p2.2.m2.1.1.3.2.cmml">ℝ</mi><mrow id="S2.SS3.p2.2.m2.1.1.3.3" xref="S2.SS3.p2.2.m2.1.1.3.3.cmml"><mn id="S2.SS3.p2.2.m2.1.1.3.3.2" xref="S2.SS3.p2.2.m2.1.1.3.3.2.cmml">1</mn><mo id="S2.SS3.p2.2.m2.1.1.3.3.1" lspace="0.222em" rspace="0.222em" xref="S2.SS3.p2.2.m2.1.1.3.3.1.cmml">×</mo><mi id="S2.SS3.p2.2.m2.1.1.3.3.3" xref="S2.SS3.p2.2.m2.1.1.3.3.3.cmml">d</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.2.m2.1b"><apply id="S2.SS3.p2.2.m2.1.1.cmml" xref="S2.SS3.p2.2.m2.1.1"><in id="S2.SS3.p2.2.m2.1.1.1.cmml" xref="S2.SS3.p2.2.m2.1.1.1"></in><apply id="S2.SS3.p2.2.m2.1.1.2.cmml" xref="S2.SS3.p2.2.m2.1.1.2"><csymbol cd="ambiguous" id="S2.SS3.p2.2.m2.1.1.2.1.cmml" xref="S2.SS3.p2.2.m2.1.1.2">superscript</csymbol><ci id="S2.SS3.p2.2.m2.1.1.2.2.cmml" xref="S2.SS3.p2.2.m2.1.1.2.2">𝒇</ci><ci id="S2.SS3.p2.2.m2.1.1.2.3.cmml" xref="S2.SS3.p2.2.m2.1.1.2.3">𝒊</ci></apply><apply id="S2.SS3.p2.2.m2.1.1.3.cmml" xref="S2.SS3.p2.2.m2.1.1.3"><csymbol cd="ambiguous" id="S2.SS3.p2.2.m2.1.1.3.1.cmml" xref="S2.SS3.p2.2.m2.1.1.3">superscript</csymbol><ci id="S2.SS3.p2.2.m2.1.1.3.2.cmml" xref="S2.SS3.p2.2.m2.1.1.3.2">ℝ</ci><apply id="S2.SS3.p2.2.m2.1.1.3.3.cmml" xref="S2.SS3.p2.2.m2.1.1.3.3"><times id="S2.SS3.p2.2.m2.1.1.3.3.1.cmml" xref="S2.SS3.p2.2.m2.1.1.3.3.1"></times><cn id="S2.SS3.p2.2.m2.1.1.3.3.2.cmml" type="integer" xref="S2.SS3.p2.2.m2.1.1.3.3.2">1</cn><ci id="S2.SS3.p2.2.m2.1.1.3.3.3.cmml" xref="S2.SS3.p2.2.m2.1.1.3.3.3">𝑑</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.2.m2.1c">\bm{f^{i}}\in\mathbb{R}^{1\times d}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.2.m2.1d">bold_italic_f start_POSTSUPERSCRIPT bold_italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT</annotation></semantics></math> into a two-dimensional space via <math alttext="m" class="ltx_Math" display="inline" id="S2.SS3.p2.3.m3.1"><semantics id="S2.SS3.p2.3.m3.1a"><mi id="S2.SS3.p2.3.m3.1.1" xref="S2.SS3.p2.3.m3.1.1.cmml">m</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.3.m3.1b"><ci id="S2.SS3.p2.3.m3.1.1.cmml" xref="S2.SS3.p2.3.m3.1.1">𝑚</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.3.m3.1c">m</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.3.m3.1d">italic_m</annotation></semantics></math> blob, creating spatially-aware features <math alttext="\bm{F}\in\mathbb{R}^{h\times w\times d}" class="ltx_Math" display="inline" id="S2.SS3.p2.4.m4.1"><semantics id="S2.SS3.p2.4.m4.1a"><mrow id="S2.SS3.p2.4.m4.1.1" xref="S2.SS3.p2.4.m4.1.1.cmml"><mi id="S2.SS3.p2.4.m4.1.1.2" xref="S2.SS3.p2.4.m4.1.1.2.cmml">𝑭</mi><mo id="S2.SS3.p2.4.m4.1.1.1" xref="S2.SS3.p2.4.m4.1.1.1.cmml">∈</mo><msup id="S2.SS3.p2.4.m4.1.1.3" xref="S2.SS3.p2.4.m4.1.1.3.cmml"><mi id="S2.SS3.p2.4.m4.1.1.3.2" xref="S2.SS3.p2.4.m4.1.1.3.2.cmml">ℝ</mi><mrow id="S2.SS3.p2.4.m4.1.1.3.3" xref="S2.SS3.p2.4.m4.1.1.3.3.cmml"><mi id="S2.SS3.p2.4.m4.1.1.3.3.2" xref="S2.SS3.p2.4.m4.1.1.3.3.2.cmml">h</mi><mo id="S2.SS3.p2.4.m4.1.1.3.3.1" lspace="0.222em" rspace="0.222em" xref="S2.SS3.p2.4.m4.1.1.3.3.1.cmml">×</mo><mi id="S2.SS3.p2.4.m4.1.1.3.3.3" xref="S2.SS3.p2.4.m4.1.1.3.3.3.cmml">w</mi><mo id="S2.SS3.p2.4.m4.1.1.3.3.1a" lspace="0.222em" rspace="0.222em" xref="S2.SS3.p2.4.m4.1.1.3.3.1.cmml">×</mo><mi id="S2.SS3.p2.4.m4.1.1.3.3.4" xref="S2.SS3.p2.4.m4.1.1.3.3.4.cmml">d</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.4.m4.1b"><apply id="S2.SS3.p2.4.m4.1.1.cmml" xref="S2.SS3.p2.4.m4.1.1"><in id="S2.SS3.p2.4.m4.1.1.1.cmml" xref="S2.SS3.p2.4.m4.1.1.1"></in><ci id="S2.SS3.p2.4.m4.1.1.2.cmml" xref="S2.SS3.p2.4.m4.1.1.2">𝑭</ci><apply id="S2.SS3.p2.4.m4.1.1.3.cmml" xref="S2.SS3.p2.4.m4.1.1.3"><csymbol cd="ambiguous" id="S2.SS3.p2.4.m4.1.1.3.1.cmml" xref="S2.SS3.p2.4.m4.1.1.3">superscript</csymbol><ci id="S2.SS3.p2.4.m4.1.1.3.2.cmml" xref="S2.SS3.p2.4.m4.1.1.3.2">ℝ</ci><apply id="S2.SS3.p2.4.m4.1.1.3.3.cmml" xref="S2.SS3.p2.4.m4.1.1.3.3"><times id="S2.SS3.p2.4.m4.1.1.3.3.1.cmml" xref="S2.SS3.p2.4.m4.1.1.3.3.1"></times><ci id="S2.SS3.p2.4.m4.1.1.3.3.2.cmml" xref="S2.SS3.p2.4.m4.1.1.3.3.2">ℎ</ci><ci id="S2.SS3.p2.4.m4.1.1.3.3.3.cmml" xref="S2.SS3.p2.4.m4.1.1.3.3.3">𝑤</ci><ci id="S2.SS3.p2.4.m4.1.1.3.3.4.cmml" xref="S2.SS3.p2.4.m4.1.1.3.3.4">𝑑</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.4.m4.1c">\bm{F}\in\mathbb{R}^{h\times w\times d}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.4.m4.1d">bold_italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_d end_POSTSUPERSCRIPT</annotation></semantics></math>. Formally, blob splatting is expressed as follows:</p> <table class="ltx_equationgroup ltx_eqn_table" id="S2.E4"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S2.E4X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\bm{F}=g_{\text{splatting}}(\bm{f},O_{c})=\sum_{i=0}^{m}{O_{c}^{i% }\cdot\bm{f^{i}}}," class="ltx_Math" display="inline" id="S2.E4X.2.1.1.m1.2"><semantics id="S2.E4X.2.1.1.m1.2a"><mrow id="S2.E4X.2.1.1.m1.2.2.1" xref="S2.E4X.2.1.1.m1.2.2.1.1.cmml"><mrow id="S2.E4X.2.1.1.m1.2.2.1.1" xref="S2.E4X.2.1.1.m1.2.2.1.1.cmml"><mi id="S2.E4X.2.1.1.m1.2.2.1.1.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.3.cmml">𝑭</mi><mo id="S2.E4X.2.1.1.m1.2.2.1.1.4" xref="S2.E4X.2.1.1.m1.2.2.1.1.4.cmml">=</mo><mrow id="S2.E4X.2.1.1.m1.2.2.1.1.1" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.cmml"><msub id="S2.E4X.2.1.1.m1.2.2.1.1.1.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.3.cmml"><mi id="S2.E4X.2.1.1.m1.2.2.1.1.1.3.2" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.3.2.cmml">g</mi><mtext id="S2.E4X.2.1.1.m1.2.2.1.1.1.3.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.3.3a.cmml">splatting</mtext></msub><mo id="S2.E4X.2.1.1.m1.2.2.1.1.1.2" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.2.cmml">⁢</mo><mrow id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.2.cmml"><mo id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.2" stretchy="false" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.2.cmml">(</mo><mi id="S2.E4X.2.1.1.m1.1.1" xref="S2.E4X.2.1.1.m1.1.1.cmml">𝒇</mi><mo id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.2.cmml">,</mo><msub id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.cmml"><mi id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.2" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.2.cmml">O</mi><mi id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.3.cmml">c</mi></msub><mo id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.4" stretchy="false" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.2.cmml">)</mo></mrow></mrow><mo id="S2.E4X.2.1.1.m1.2.2.1.1.5" xref="S2.E4X.2.1.1.m1.2.2.1.1.5.cmml">=</mo><mrow id="S2.E4X.2.1.1.m1.2.2.1.1.6" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.cmml"><mstyle displaystyle="true" id="S2.E4X.2.1.1.m1.2.2.1.1.6.1" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.cmml"><munderover id="S2.E4X.2.1.1.m1.2.2.1.1.6.1a" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.cmml"><mo id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.2" movablelimits="false" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.2.cmml">∑</mo><mrow id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.cmml"><mi id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.2" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.2.cmml">i</mi><mo id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.1" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.1.cmml">=</mo><mn id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.3.cmml">0</mn></mrow><mi id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.3.cmml">m</mi></munderover></mstyle><mrow id="S2.E4X.2.1.1.m1.2.2.1.1.6.2" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.cmml"><msubsup id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.cmml"><mi id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.2.2" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.2.2.cmml">O</mi><mi id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.2.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.2.3.cmml">c</mi><mi id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.3.cmml">i</mi></msubsup><mo id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.1" lspace="0.222em" rspace="0.222em" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.1.cmml">⋅</mo><msup id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.cmml"><mi id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.2" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.2.cmml">𝒇</mi><mi id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.3" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.3.cmml">𝒊</mi></msup></mrow></mrow></mrow><mo id="S2.E4X.2.1.1.m1.2.2.1.2" xref="S2.E4X.2.1.1.m1.2.2.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.E4X.2.1.1.m1.2b"><apply id="S2.E4X.2.1.1.m1.2.2.1.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1"><and id="S2.E4X.2.1.1.m1.2.2.1.1a.cmml" xref="S2.E4X.2.1.1.m1.2.2.1"></and><apply id="S2.E4X.2.1.1.m1.2.2.1.1b.cmml" xref="S2.E4X.2.1.1.m1.2.2.1"><eq id="S2.E4X.2.1.1.m1.2.2.1.1.4.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.4"></eq><ci id="S2.E4X.2.1.1.m1.2.2.1.1.3.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.3">𝑭</ci><apply id="S2.E4X.2.1.1.m1.2.2.1.1.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1"><times id="S2.E4X.2.1.1.m1.2.2.1.1.1.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.2"></times><apply id="S2.E4X.2.1.1.m1.2.2.1.1.1.3.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S2.E4X.2.1.1.m1.2.2.1.1.1.3.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.3">subscript</csymbol><ci id="S2.E4X.2.1.1.m1.2.2.1.1.1.3.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.3.2">𝑔</ci><ci id="S2.E4X.2.1.1.m1.2.2.1.1.1.3.3a.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.3.3"><mtext id="S2.E4X.2.1.1.m1.2.2.1.1.1.3.3.cmml" mathsize="70%" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.3.3">splatting</mtext></ci></apply><interval closure="open" id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1"><ci id="S2.E4X.2.1.1.m1.1.1.cmml" xref="S2.E4X.2.1.1.m1.1.1">𝒇</ci><apply id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1">subscript</csymbol><ci id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.2">𝑂</ci><ci id="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.3.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.1.1.1.1.3">𝑐</ci></apply></interval></apply></apply><apply id="S2.E4X.2.1.1.m1.2.2.1.1c.cmml" xref="S2.E4X.2.1.1.m1.2.2.1"><eq id="S2.E4X.2.1.1.m1.2.2.1.1.5.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.5"></eq><share href="https://arxiv.org/html/2503.13434v1#S2.E4X.2.1.1.m1.2.2.1.1.1.cmml" id="S2.E4X.2.1.1.m1.2.2.1.1d.cmml" xref="S2.E4X.2.1.1.m1.2.2.1"></share><apply id="S2.E4X.2.1.1.m1.2.2.1.1.6.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6"><apply id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1"><csymbol cd="ambiguous" id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1">superscript</csymbol><apply id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1"><csymbol cd="ambiguous" id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1">subscript</csymbol><sum id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.2"></sum><apply id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3"><eq id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.1"></eq><ci id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.2">𝑖</ci><cn id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.3.cmml" type="integer" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.2.3.3">0</cn></apply></apply><ci id="S2.E4X.2.1.1.m1.2.2.1.1.6.1.3.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.1.3">𝑚</ci></apply><apply id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2"><ci id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.1">⋅</ci><apply id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2"><csymbol cd="ambiguous" id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2">superscript</csymbol><apply id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2"><csymbol cd="ambiguous" id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.2.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2">subscript</csymbol><ci id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.2.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.2.2">𝑂</ci><ci id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.2.3.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.2.3">𝑐</ci></apply><ci id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.3.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.2.3">𝑖</ci></apply><apply id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3"><csymbol cd="ambiguous" id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3">superscript</csymbol><ci id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.2">𝒇</ci><ci id="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.3.cmml" xref="S2.E4X.2.1.1.m1.2.2.1.1.6.2.3.3">𝒊</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E4X.2.1.1.m1.2c">\displaystyle\bm{F}=g_{\text{splatting}}(\bm{f},O_{c})=\sum_{i=0}^{m}{O_{c}^{i% }\cdot\bm{f^{i}}},</annotation><annotation encoding="application/x-llamapun" id="S2.E4X.2.1.1.m1.2d">bold_italic_F = italic_g start_POSTSUBSCRIPT splatting end_POSTSUBSCRIPT ( bold_italic_f , italic_O start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_O start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ bold_italic_f start_POSTSUPERSCRIPT bold_italic_i end_POSTSUPERSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(4)</span></td> </tr> </tbody> </table> <p class="ltx_p" id="S2.SS3.p2.8">where <math alttext="\cdot" class="ltx_Math" display="inline" id="S2.SS3.p2.5.m1.1"><semantics id="S2.SS3.p2.5.m1.1a"><mo id="S2.SS3.p2.5.m1.1.1" xref="S2.SS3.p2.5.m1.1.1.cmml">⋅</mo><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.5.m1.1b"><ci id="S2.SS3.p2.5.m1.1.1.cmml" xref="S2.SS3.p2.5.m1.1.1">⋅</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.5.m1.1c">\cdot</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.5.m1.1d">⋅</annotation></semantics></math> denotes element-wise multiplication with broadcasting. In our <span class="ltx_text ltx_font_italic" id="S2.SS3.p2.8.1">BlobCtrl</span>, we first use DINO V2 <cite class="ltx_cite ltx_citemacro_citep">(Oquab et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib33" title="">2023</a>)</cite> to encode the visual semantics of the foreground element. The semantic features <math alttext="\bm{f}_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S2.SS3.p2.6.m2.1"><semantics id="S2.SS3.p2.6.m2.1a"><msub id="S2.SS3.p2.6.m2.1.1" xref="S2.SS3.p2.6.m2.1.1.cmml"><mi id="S2.SS3.p2.6.m2.1.1.2" xref="S2.SS3.p2.6.m2.1.1.2.cmml">𝒇</mi><mi id="S2.SS3.p2.6.m2.1.1.3" xref="S2.SS3.p2.6.m2.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.6.m2.1b"><apply id="S2.SS3.p2.6.m2.1.1.cmml" xref="S2.SS3.p2.6.m2.1.1"><csymbol cd="ambiguous" id="S2.SS3.p2.6.m2.1.1.1.cmml" xref="S2.SS3.p2.6.m2.1.1">subscript</csymbol><ci id="S2.SS3.p2.6.m2.1.1.2.cmml" xref="S2.SS3.p2.6.m2.1.1.2">𝒇</ci><ci id="S2.SS3.p2.6.m2.1.1.3.cmml" xref="S2.SS3.p2.6.m2.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.6.m2.1c">\bm{f}_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.6.m2.1d">bold_italic_f start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math> are then splatted with the foreground opacity <math alttext="O_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S2.SS3.p2.7.m3.1"><semantics id="S2.SS3.p2.7.m3.1a"><msub id="S2.SS3.p2.7.m3.1.1" xref="S2.SS3.p2.7.m3.1.1.cmml"><mi id="S2.SS3.p2.7.m3.1.1.2" xref="S2.SS3.p2.7.m3.1.1.2.cmml">O</mi><mi id="S2.SS3.p2.7.m3.1.1.3" xref="S2.SS3.p2.7.m3.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.7.m3.1b"><apply id="S2.SS3.p2.7.m3.1.1.cmml" xref="S2.SS3.p2.7.m3.1.1"><csymbol cd="ambiguous" id="S2.SS3.p2.7.m3.1.1.1.cmml" xref="S2.SS3.p2.7.m3.1.1">subscript</csymbol><ci id="S2.SS3.p2.7.m3.1.1.2.cmml" xref="S2.SS3.p2.7.m3.1.1.2">𝑂</ci><ci id="S2.SS3.p2.7.m3.1.1.3.cmml" xref="S2.SS3.p2.7.m3.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.7.m3.1c">O_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.7.m3.1d">italic_O start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math>, resulting in spatially-aware, element-level visual semantic feature <math alttext="\bm{F}_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S2.SS3.p2.8.m4.1"><semantics id="S2.SS3.p2.8.m4.1a"><msub id="S2.SS3.p2.8.m4.1.1" xref="S2.SS3.p2.8.m4.1.1.cmml"><mi id="S2.SS3.p2.8.m4.1.1.2" xref="S2.SS3.p2.8.m4.1.1.2.cmml">𝑭</mi><mi id="S2.SS3.p2.8.m4.1.1.3" xref="S2.SS3.p2.8.m4.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.8.m4.1b"><apply id="S2.SS3.p2.8.m4.1.1.cmml" xref="S2.SS3.p2.8.m4.1.1"><csymbol cd="ambiguous" id="S2.SS3.p2.8.m4.1.1.1.cmml" xref="S2.SS3.p2.8.m4.1.1">subscript</csymbol><ci id="S2.SS3.p2.8.m4.1.1.2.cmml" xref="S2.SS3.p2.8.m4.1.1.2">𝑭</ci><ci id="S2.SS3.p2.8.m4.1.1.3.cmml" xref="S2.SS3.p2.8.m4.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.8.m4.1c">\bm{F}_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.8.m4.1d">bold_italic_F start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math>.</p> </div> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">3 </span>Self-supervised Paradigm for BlobCtrl</h2> <div class="ltx_para" id="S3.p1"> <p class="ltx_p" id="S3.p1.1">As discussed in Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S2" title="2 Blob-Based Element-level Representation ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">2</span></a>, blob-based representation offers continuous spatial control for flexible manipulation, seamless composition for harmonious integration, and spatial-aware splatting for visual semantics. Leveraging these advantages, we introduce a self-supervised training paradigm to develop a robust and versatile model for element-level visual generation and editing.</p> </div> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.1 </span>Model Architecture</h3> <div class="ltx_para" id="S3.SS1.p1"> <p class="ltx_p" id="S3.SS1.p1.1">Based on the blob representation, we propose a dual-branch diffusion model to handle foreground and background elements separately. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S2.F3" title="Figure 3 ‣ 2.2 Blob Opacity ‣ 2 Blob-Based Element-level Representation ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">3</span></a>, our model mainly consists of two key components:</p> </div> <section class="ltx_paragraph" id="S3.SS1.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph">Foreground Branch.</h4> <div class="ltx_para" id="S3.SS1.SSS0.Px1.p1"> <p class="ltx_p" id="S3.SS1.SSS0.Px1.p1.11">The foreground branch is designed to preserve the identity and appearance of foreground elements while enabling flexible layout control. As shown in Fig <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S2.F3" title="Figure 3 ‣ 2.2 Blob Opacity ‣ 2 Blob-Based Element-level Representation ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">3</span></a>, we concatenate the noisy latent <math alttext="\bm{z}_{t}\in\mathbb{R}^{c,h,w}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.1.m1.3"><semantics id="S3.SS1.SSS0.Px1.p1.1.m1.3a"><mrow id="S3.SS1.SSS0.Px1.p1.1.m1.3.4" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.cmml"><msub id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.cmml"><mi id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.2" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.2.cmml">𝒛</mi><mi id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.3" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.3.cmml">t</mi></msub><mo id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.1" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.1.cmml">∈</mo><msup id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.3" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.3.cmml"><mi id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.3.2" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.3.2.cmml">ℝ</mi><mrow id="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.5" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.4.cmml"><mi id="S3.SS1.SSS0.Px1.p1.1.m1.1.1.1.1" xref="S3.SS1.SSS0.Px1.p1.1.m1.1.1.1.1.cmml">c</mi><mo id="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.5.1" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.4.cmml">,</mo><mi id="S3.SS1.SSS0.Px1.p1.1.m1.2.2.2.2" xref="S3.SS1.SSS0.Px1.p1.1.m1.2.2.2.2.cmml">h</mi><mo id="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.5.2" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.4.cmml">,</mo><mi id="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.3" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.3.cmml">w</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.1.m1.3b"><apply id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4"><in id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.1.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.1"></in><apply id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.1.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.2.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.2">𝒛</ci><ci id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.3.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.2.3">𝑡</ci></apply><apply id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.3.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.3"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.3.1.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.3">superscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.1.m1.3.4.3.2.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.4.3.2">ℝ</ci><list id="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.4.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.5"><ci id="S3.SS1.SSS0.Px1.p1.1.m1.1.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.1.1.1.1">𝑐</ci><ci id="S3.SS1.SSS0.Px1.p1.1.m1.2.2.2.2.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.2.2.2.2">ℎ</ci><ci id="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.3.cmml" xref="S3.SS1.SSS0.Px1.p1.1.m1.3.3.3.3">𝑤</ci></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.1.m1.3c">\bm{z}_{t}\in\mathbb{R}^{c,h,w}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.1.m1.3d">bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c , italic_h , italic_w end_POSTSUPERSCRIPT</annotation></semantics></math> with reference foreground conditions <math alttext="\bm{c}_{\mathsf{fg}}\in\mathbb{R}^{(c+1+d),h,w}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.2.m2.3"><semantics id="S3.SS1.SSS0.Px1.p1.2.m2.3a"><mrow id="S3.SS1.SSS0.Px1.p1.2.m2.3.4" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.cmml"><msub id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.cmml"><mi id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.2" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.2.cmml">𝒄</mi><mi id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.3" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.3.cmml">𝖿𝗀</mi></msub><mo id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.1" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.1.cmml">∈</mo><msup id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.3" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.3.cmml"><mi id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.3.2" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.3.2.cmml">ℝ</mi><mrow id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.4.cmml"><mrow id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.cmml"><mo id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.2" stretchy="false" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.cmml">(</mo><mrow id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.cmml"><mi id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.2" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.2.cmml">c</mi><mo id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.1" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.1.cmml">+</mo><mn id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.3" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.3.cmml">1</mn><mo id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.1a" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.1.cmml">+</mo><mi id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.4" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.4.cmml">d</mi></mrow><mo id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.3" stretchy="false" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.cmml">)</mo></mrow><mo id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.2" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.4.cmml">,</mo><mi id="S3.SS1.SSS0.Px1.p1.2.m2.1.1.1.1" xref="S3.SS1.SSS0.Px1.p1.2.m2.1.1.1.1.cmml">h</mi><mo id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.3" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.4.cmml">,</mo><mi id="S3.SS1.SSS0.Px1.p1.2.m2.2.2.2.2" xref="S3.SS1.SSS0.Px1.p1.2.m2.2.2.2.2.cmml">w</mi></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.2.m2.3b"><apply id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4"><in id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.1.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.1"></in><apply id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.1.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.2.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.2">𝒄</ci><ci id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.3.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.2.3">𝖿𝗀</ci></apply><apply id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.3.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.3"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.3.1.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.3">superscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.2.m2.3.4.3.2.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.4.3.2">ℝ</ci><list id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.4.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3"><apply id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1"><plus id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.1"></plus><ci id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.2.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.2">𝑐</ci><cn id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.3.cmml" type="integer" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.3">1</cn><ci id="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.4.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.3.3.3.3.1.1.1.4">𝑑</ci></apply><ci id="S3.SS1.SSS0.Px1.p1.2.m2.1.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.1.1.1.1">ℎ</ci><ci id="S3.SS1.SSS0.Px1.p1.2.m2.2.2.2.2.cmml" xref="S3.SS1.SSS0.Px1.p1.2.m2.2.2.2.2">𝑤</ci></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.2.m2.3c">\bm{c}_{\mathsf{fg}}\in\mathbb{R}^{(c+1+d),h,w}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.2.m2.3d">bold_italic_c start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_c + 1 + italic_d ) , italic_h , italic_w end_POSTSUPERSCRIPT</annotation></semantics></math> along the column dimension as input to the foreground branch. This column-wise concatenation strategy improves the model’s in-context learning capabilities, allowing it to more effectively grasp and retain the characteristics of elements. The reference foreground conditions <math alttext="\bm{c}_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.3.m3.1"><semantics id="S3.SS1.SSS0.Px1.p1.3.m3.1a"><msub id="S3.SS1.SSS0.Px1.p1.3.m3.1.1" xref="S3.SS1.SSS0.Px1.p1.3.m3.1.1.cmml"><mi id="S3.SS1.SSS0.Px1.p1.3.m3.1.1.2" xref="S3.SS1.SSS0.Px1.p1.3.m3.1.1.2.cmml">𝒄</mi><mi id="S3.SS1.SSS0.Px1.p1.3.m3.1.1.3" xref="S3.SS1.SSS0.Px1.p1.3.m3.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.3.m3.1b"><apply id="S3.SS1.SSS0.Px1.p1.3.m3.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.3.m3.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.3.m3.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.3.m3.1.1.2.cmml" xref="S3.SS1.SSS0.Px1.p1.3.m3.1.1.2">𝒄</ci><ci id="S3.SS1.SSS0.Px1.p1.3.m3.1.1.3.cmml" xref="S3.SS1.SSS0.Px1.p1.3.m3.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.3.m3.1c">\bm{c}_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.3.m3.1d">bold_italic_c start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math> are constructed by concatenating three key components along the channel dimension: (1) the opacity map <math alttext="O_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.4.m4.1"><semantics id="S3.SS1.SSS0.Px1.p1.4.m4.1a"><msub id="S3.SS1.SSS0.Px1.p1.4.m4.1.1" xref="S3.SS1.SSS0.Px1.p1.4.m4.1.1.cmml"><mi id="S3.SS1.SSS0.Px1.p1.4.m4.1.1.2" xref="S3.SS1.SSS0.Px1.p1.4.m4.1.1.2.cmml">O</mi><mi id="S3.SS1.SSS0.Px1.p1.4.m4.1.1.3" xref="S3.SS1.SSS0.Px1.p1.4.m4.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.4.m4.1b"><apply id="S3.SS1.SSS0.Px1.p1.4.m4.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.4.m4.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.4.m4.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.4.m4.1.1.2.cmml" xref="S3.SS1.SSS0.Px1.p1.4.m4.1.1.2">𝑂</ci><ci id="S3.SS1.SSS0.Px1.p1.4.m4.1.1.3.cmml" xref="S3.SS1.SSS0.Px1.p1.4.m4.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.4.m4.1c">O_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.4.m4.1d">italic_O start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math> for layout information, (2) spatially-aware semantic features <math alttext="\bm{F}_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.5.m5.1"><semantics id="S3.SS1.SSS0.Px1.p1.5.m5.1a"><msub id="S3.SS1.SSS0.Px1.p1.5.m5.1.1" xref="S3.SS1.SSS0.Px1.p1.5.m5.1.1.cmml"><mi id="S3.SS1.SSS0.Px1.p1.5.m5.1.1.2" xref="S3.SS1.SSS0.Px1.p1.5.m5.1.1.2.cmml">𝑭</mi><mi id="S3.SS1.SSS0.Px1.p1.5.m5.1.1.3" xref="S3.SS1.SSS0.Px1.p1.5.m5.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.5.m5.1b"><apply id="S3.SS1.SSS0.Px1.p1.5.m5.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.5.m5.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.5.m5.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.5.m5.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.5.m5.1.1.2.cmml" xref="S3.SS1.SSS0.Px1.p1.5.m5.1.1.2">𝑭</ci><ci id="S3.SS1.SSS0.Px1.p1.5.m5.1.1.3.cmml" xref="S3.SS1.SSS0.Px1.p1.5.m5.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.5.m5.1c">\bm{F}_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.5.m5.1d">bold_italic_F start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math> for identity preservation, and (3) VAE latent <math alttext="\bm{z}_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.6.m6.1"><semantics id="S3.SS1.SSS0.Px1.p1.6.m6.1a"><msub id="S3.SS1.SSS0.Px1.p1.6.m6.1.1" xref="S3.SS1.SSS0.Px1.p1.6.m6.1.1.cmml"><mi id="S3.SS1.SSS0.Px1.p1.6.m6.1.1.2" xref="S3.SS1.SSS0.Px1.p1.6.m6.1.1.2.cmml">𝒛</mi><mi id="S3.SS1.SSS0.Px1.p1.6.m6.1.1.3" xref="S3.SS1.SSS0.Px1.p1.6.m6.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.6.m6.1b"><apply id="S3.SS1.SSS0.Px1.p1.6.m6.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.6.m6.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.6.m6.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.6.m6.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.6.m6.1.1.2.cmml" xref="S3.SS1.SSS0.Px1.p1.6.m6.1.1.2">𝒛</ci><ci id="S3.SS1.SSS0.Px1.p1.6.m6.1.1.3.cmml" xref="S3.SS1.SSS0.Px1.p1.6.m6.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.6.m6.1c">\bm{z}_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.6.m6.1d">bold_italic_z start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math> for appearance encoding. To ensure dimensional compatibility for the column-wise concatenation between <math alttext="\bm{z_{t}}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.7.m7.1"><semantics id="S3.SS1.SSS0.Px1.p1.7.m7.1a"><msub id="S3.SS1.SSS0.Px1.p1.7.m7.1.1" xref="S3.SS1.SSS0.Px1.p1.7.m7.1.1.cmml"><mi id="S3.SS1.SSS0.Px1.p1.7.m7.1.1.2" xref="S3.SS1.SSS0.Px1.p1.7.m7.1.1.2.cmml">𝒛</mi><mi id="S3.SS1.SSS0.Px1.p1.7.m7.1.1.3" xref="S3.SS1.SSS0.Px1.p1.7.m7.1.1.3.cmml">𝒕</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.7.m7.1b"><apply id="S3.SS1.SSS0.Px1.p1.7.m7.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.7.m7.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.7.m7.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.7.m7.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.7.m7.1.1.2.cmml" xref="S3.SS1.SSS0.Px1.p1.7.m7.1.1.2">𝒛</ci><ci id="S3.SS1.SSS0.Px1.p1.7.m7.1.1.3.cmml" xref="S3.SS1.SSS0.Px1.p1.7.m7.1.1.3">𝒕</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.7.m7.1c">\bm{z_{t}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.7.m7.1d">bold_italic_z start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="\bm{c}_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.8.m8.1"><semantics id="S3.SS1.SSS0.Px1.p1.8.m8.1a"><msub id="S3.SS1.SSS0.Px1.p1.8.m8.1.1" xref="S3.SS1.SSS0.Px1.p1.8.m8.1.1.cmml"><mi id="S3.SS1.SSS0.Px1.p1.8.m8.1.1.2" xref="S3.SS1.SSS0.Px1.p1.8.m8.1.1.2.cmml">𝒄</mi><mi id="S3.SS1.SSS0.Px1.p1.8.m8.1.1.3" xref="S3.SS1.SSS0.Px1.p1.8.m8.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.8.m8.1b"><apply id="S3.SS1.SSS0.Px1.p1.8.m8.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.8.m8.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.8.m8.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.8.m8.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.8.m8.1.1.2.cmml" xref="S3.SS1.SSS0.Px1.p1.8.m8.1.1.2">𝒄</ci><ci id="S3.SS1.SSS0.Px1.p1.8.m8.1.1.3.cmml" xref="S3.SS1.SSS0.Px1.p1.8.m8.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.8.m8.1c">\bm{c}_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.8.m8.1d">bold_italic_c start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math>, we additionally concatenate <math alttext="O_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.9.m9.1"><semantics id="S3.SS1.SSS0.Px1.p1.9.m9.1a"><msub id="S3.SS1.SSS0.Px1.p1.9.m9.1.1" xref="S3.SS1.SSS0.Px1.p1.9.m9.1.1.cmml"><mi id="S3.SS1.SSS0.Px1.p1.9.m9.1.1.2" xref="S3.SS1.SSS0.Px1.p1.9.m9.1.1.2.cmml">O</mi><mi id="S3.SS1.SSS0.Px1.p1.9.m9.1.1.3" xref="S3.SS1.SSS0.Px1.p1.9.m9.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.9.m9.1b"><apply id="S3.SS1.SSS0.Px1.p1.9.m9.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.9.m9.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.9.m9.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.9.m9.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.9.m9.1.1.2.cmml" xref="S3.SS1.SSS0.Px1.p1.9.m9.1.1.2">𝑂</ci><ci id="S3.SS1.SSS0.Px1.p1.9.m9.1.1.3.cmml" xref="S3.SS1.SSS0.Px1.p1.9.m9.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.9.m9.1c">O_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.9.m9.1d">italic_O start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="\bm{F}_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.10.m10.1"><semantics id="S3.SS1.SSS0.Px1.p1.10.m10.1a"><msub id="S3.SS1.SSS0.Px1.p1.10.m10.1.1" xref="S3.SS1.SSS0.Px1.p1.10.m10.1.1.cmml"><mi id="S3.SS1.SSS0.Px1.p1.10.m10.1.1.2" xref="S3.SS1.SSS0.Px1.p1.10.m10.1.1.2.cmml">𝑭</mi><mi id="S3.SS1.SSS0.Px1.p1.10.m10.1.1.3" xref="S3.SS1.SSS0.Px1.p1.10.m10.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.10.m10.1b"><apply id="S3.SS1.SSS0.Px1.p1.10.m10.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.10.m10.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.10.m10.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.10.m10.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.10.m10.1.1.2.cmml" xref="S3.SS1.SSS0.Px1.p1.10.m10.1.1.2">𝑭</ci><ci id="S3.SS1.SSS0.Px1.p1.10.m10.1.1.3.cmml" xref="S3.SS1.SSS0.Px1.p1.10.m10.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.10.m10.1c">\bm{F}_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.10.m10.1d">bold_italic_F start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math> along the channel dimension of <math alttext="\bm{z_{t}}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px1.p1.11.m11.1"><semantics id="S3.SS1.SSS0.Px1.p1.11.m11.1a"><msub id="S3.SS1.SSS0.Px1.p1.11.m11.1.1" xref="S3.SS1.SSS0.Px1.p1.11.m11.1.1.cmml"><mi id="S3.SS1.SSS0.Px1.p1.11.m11.1.1.2" xref="S3.SS1.SSS0.Px1.p1.11.m11.1.1.2.cmml">𝒛</mi><mi id="S3.SS1.SSS0.Px1.p1.11.m11.1.1.3" xref="S3.SS1.SSS0.Px1.p1.11.m11.1.1.3.cmml">𝒕</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px1.p1.11.m11.1b"><apply id="S3.SS1.SSS0.Px1.p1.11.m11.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.11.m11.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px1.p1.11.m11.1.1.1.cmml" xref="S3.SS1.SSS0.Px1.p1.11.m11.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px1.p1.11.m11.1.1.2.cmml" xref="S3.SS1.SSS0.Px1.p1.11.m11.1.1.2">𝒛</ci><ci id="S3.SS1.SSS0.Px1.p1.11.m11.1.1.3.cmml" xref="S3.SS1.SSS0.Px1.p1.11.m11.1.1.3">𝒕</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.11.m11.1c">\bm{z_{t}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.11.m11.1d">bold_italic_z start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT</annotation></semantics></math>. The input construction process for the foreground branch can be formally expressed as:</p> <table class="ltx_equationgroup ltx_eqn_table" id="S3.E5"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E5X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\text{input}_{\mathsf{fg}}=\left[\bm{c}_{\mathsf{fg}},\left[\bm{z% }_{t},O_{\mathsf{fg}},\bm{F}_{\mathsf{fg}};\text{axis}=0\right];\text{axis}=2% \right]," class="ltx_Math" display="inline" id="S3.E5X.2.1.1.m1.3"><semantics id="S3.E5X.2.1.1.m1.3a"><mrow id="S3.E5X.2.1.1.m1.3.3.1" xref="S3.E5X.2.1.1.m1.3.3.1.1.cmml"><mrow id="S3.E5X.2.1.1.m1.3.3.1.1" xref="S3.E5X.2.1.1.m1.3.3.1.1.cmml"><msub id="S3.E5X.2.1.1.m1.3.3.1.1.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.3.cmml"><mtext id="S3.E5X.2.1.1.m1.3.3.1.1.3.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.3.2a.cmml">input</mtext><mi id="S3.E5X.2.1.1.m1.3.3.1.1.3.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.3.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E5X.2.1.1.m1.3.3.1.1.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.2.cmml">=</mo><mrow id="S3.E5X.2.1.1.m1.3.3.1.1.1.1" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.2.cmml"><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.2.1.cmml">[</mo><mrow id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.cmml"><mrow id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.3.cmml"><msub id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml">𝒄</mi><mi id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.3.cmml">,</mo><mrow id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.2.cmml"><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.2.1.cmml">[</mo><mrow id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.cmml"><mrow id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.4.cmml"><msub id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.cmml"><mi id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.2.cmml">𝒛</mi><mi id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.3.cmml">t</mi></msub><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.4" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.4.cmml">,</mo><msub id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.cmml"><mi id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.2.cmml">O</mi><mi id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.5" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.4.cmml">,</mo><msub id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.cmml"><mi id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.2.cmml">𝑭</mi><mi id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.6" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.4.cmml">;</mo><mtext id="S3.E5X.2.1.1.m1.1.1" xref="S3.E5X.2.1.1.m1.1.1a.cmml">axis</mtext></mrow><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.4" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.4.cmml">=</mo><mn id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.5" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.5.cmml">0</mn></mrow><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.2.1.cmml">]</mo></mrow><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.4" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.3.cmml">;</mo><mtext id="S3.E5X.2.1.1.m1.2.2" xref="S3.E5X.2.1.1.m1.2.2a.cmml">axis</mtext></mrow><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.3.cmml">=</mo><mn id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.4" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.4.cmml">2</mn></mrow><mo id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.3" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.2.1.cmml">]</mo></mrow></mrow><mo id="S3.E5X.2.1.1.m1.3.3.1.2" xref="S3.E5X.2.1.1.m1.3.3.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E5X.2.1.1.m1.3b"><apply id="S3.E5X.2.1.1.m1.3.3.1.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1"><eq id="S3.E5X.2.1.1.m1.3.3.1.1.2.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.2"></eq><apply id="S3.E5X.2.1.1.m1.3.3.1.1.3.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.3"><csymbol cd="ambiguous" id="S3.E5X.2.1.1.m1.3.3.1.1.3.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.3">subscript</csymbol><ci id="S3.E5X.2.1.1.m1.3.3.1.1.3.2a.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.3.2"><mtext id="S3.E5X.2.1.1.m1.3.3.1.1.3.2.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.3.2">input</mtext></ci><ci id="S3.E5X.2.1.1.m1.3.3.1.1.3.3.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.3.3">𝖿𝗀</ci></apply><apply id="S3.E5X.2.1.1.m1.3.3.1.1.1.2.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1"><csymbol cd="latexml" id="S3.E5X.2.1.1.m1.3.3.1.1.1.2.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.2">delimited-[]</csymbol><apply id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1"><eq id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.3.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.3"></eq><list id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.3.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2"><apply id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.2">𝒄</ci><ci id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.3">𝖿𝗀</ci></apply><apply id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.2.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1"><csymbol cd="latexml" id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.2.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.2">delimited-[]</csymbol><apply id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1"><eq id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.4.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.4"></eq><list id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.4.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3"><apply id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1">subscript</csymbol><ci id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.2.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.2">𝒛</ci><ci id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.3.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.3">𝑡</ci></apply><apply id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2">subscript</csymbol><ci id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.2.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.2">𝑂</ci><ci id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.3.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.3">𝖿𝗀</ci></apply><apply id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3"><csymbol cd="ambiguous" id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.1.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3">subscript</csymbol><ci id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.2.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.2">𝑭</ci><ci id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.3.cmml" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.3.3.3">𝖿𝗀</ci></apply><ci id="S3.E5X.2.1.1.m1.1.1a.cmml" xref="S3.E5X.2.1.1.m1.1.1"><mtext id="S3.E5X.2.1.1.m1.1.1.cmml" xref="S3.E5X.2.1.1.m1.1.1">axis</mtext></ci></list><cn id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.5.cmml" type="integer" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.5">0</cn></apply></apply><ci id="S3.E5X.2.1.1.m1.2.2a.cmml" xref="S3.E5X.2.1.1.m1.2.2"><mtext id="S3.E5X.2.1.1.m1.2.2.cmml" xref="S3.E5X.2.1.1.m1.2.2">axis</mtext></ci></list><cn id="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.4.cmml" type="integer" xref="S3.E5X.2.1.1.m1.3.3.1.1.1.1.1.4">2</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E5X.2.1.1.m1.3c">\displaystyle\text{input}_{\mathsf{fg}}=\left[\bm{c}_{\mathsf{fg}},\left[\bm{z% }_{t},O_{\mathsf{fg}},\bm{F}_{\mathsf{fg}};\text{axis}=0\right];\text{axis}=2% \right],</annotation><annotation encoding="application/x-llamapun" id="S3.E5X.2.1.1.m1.3d">input start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT = [ bold_italic_c start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT , [ bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT , bold_italic_F start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT ; axis = 0 ] ; axis = 2 ] ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(5)</span></td> </tr> </tbody> </table> <table class="ltx_equationgroup ltx_eqn_table" id="S3.E6"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E6X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\bm{c}_{\mathsf{fg}}=\left[\bm{z}_{\mathsf{fg}},O_{\mathsf{fg}},% \bm{F}_{\mathsf{fg}};\text{axis}=0\right]\in\mathbb{R}^{(c+1+d),h,w}," class="ltx_Math" display="inline" id="S3.E6X.2.1.1.m1.5"><semantics id="S3.E6X.2.1.1.m1.5a"><mrow id="S3.E6X.2.1.1.m1.5.5.1" xref="S3.E6X.2.1.1.m1.5.5.1.1.cmml"><mrow id="S3.E6X.2.1.1.m1.5.5.1.1" xref="S3.E6X.2.1.1.m1.5.5.1.1.cmml"><msub id="S3.E6X.2.1.1.m1.5.5.1.1.3" xref="S3.E6X.2.1.1.m1.5.5.1.1.3.cmml"><mi id="S3.E6X.2.1.1.m1.5.5.1.1.3.2" xref="S3.E6X.2.1.1.m1.5.5.1.1.3.2.cmml">𝒄</mi><mi id="S3.E6X.2.1.1.m1.5.5.1.1.3.3" xref="S3.E6X.2.1.1.m1.5.5.1.1.3.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E6X.2.1.1.m1.5.5.1.1.4" xref="S3.E6X.2.1.1.m1.5.5.1.1.4.cmml">=</mo><mrow id="S3.E6X.2.1.1.m1.5.5.1.1.1.1" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.2.cmml"><mo id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.2" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.2.1.cmml">[</mo><mrow id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.cmml"><mrow id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.4.cmml"><msub id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.2" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.2.cmml">𝒛</mi><mi id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.3" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.4" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.4.cmml">,</mo><msub id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.cmml"><mi id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.2" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.2.cmml">O</mi><mi id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.3" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.5" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.4.cmml">,</mo><msub id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.cmml"><mi id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.2" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.2.cmml">𝑭</mi><mi id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.3" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.6" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.4.cmml">;</mo><mtext id="S3.E6X.2.1.1.m1.4.4" xref="S3.E6X.2.1.1.m1.4.4a.cmml">axis</mtext></mrow><mo id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.4" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.4.cmml">=</mo><mn id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.5" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.5.cmml">0</mn></mrow><mo id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.3" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.2.1.cmml">]</mo></mrow><mo id="S3.E6X.2.1.1.m1.5.5.1.1.5" xref="S3.E6X.2.1.1.m1.5.5.1.1.5.cmml">∈</mo><msup id="S3.E6X.2.1.1.m1.5.5.1.1.6" xref="S3.E6X.2.1.1.m1.5.5.1.1.6.cmml"><mi id="S3.E6X.2.1.1.m1.5.5.1.1.6.2" xref="S3.E6X.2.1.1.m1.5.5.1.1.6.2.cmml">ℝ</mi><mrow id="S3.E6X.2.1.1.m1.3.3.3.3" xref="S3.E6X.2.1.1.m1.3.3.3.4.cmml"><mrow id="S3.E6X.2.1.1.m1.3.3.3.3.1.1" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.cmml"><mo id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.2" stretchy="false" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.cmml">(</mo><mrow id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.cmml"><mi id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.2" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.2.cmml">c</mi><mo id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.1" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.1.cmml">+</mo><mn id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.3" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.3.cmml">1</mn><mo id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.1a" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.1.cmml">+</mo><mi id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.4" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.4.cmml">d</mi></mrow><mo id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.3" stretchy="false" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.cmml">)</mo></mrow><mo id="S3.E6X.2.1.1.m1.3.3.3.3.2" xref="S3.E6X.2.1.1.m1.3.3.3.4.cmml">,</mo><mi id="S3.E6X.2.1.1.m1.1.1.1.1" xref="S3.E6X.2.1.1.m1.1.1.1.1.cmml">h</mi><mo id="S3.E6X.2.1.1.m1.3.3.3.3.3" xref="S3.E6X.2.1.1.m1.3.3.3.4.cmml">,</mo><mi id="S3.E6X.2.1.1.m1.2.2.2.2" xref="S3.E6X.2.1.1.m1.2.2.2.2.cmml">w</mi></mrow></msup></mrow><mo id="S3.E6X.2.1.1.m1.5.5.1.2" xref="S3.E6X.2.1.1.m1.5.5.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E6X.2.1.1.m1.5b"><apply id="S3.E6X.2.1.1.m1.5.5.1.1.cmml" xref="S3.E6X.2.1.1.m1.5.5.1"><and id="S3.E6X.2.1.1.m1.5.5.1.1a.cmml" xref="S3.E6X.2.1.1.m1.5.5.1"></and><apply id="S3.E6X.2.1.1.m1.5.5.1.1b.cmml" xref="S3.E6X.2.1.1.m1.5.5.1"><eq id="S3.E6X.2.1.1.m1.5.5.1.1.4.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.4"></eq><apply id="S3.E6X.2.1.1.m1.5.5.1.1.3.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.3"><csymbol cd="ambiguous" id="S3.E6X.2.1.1.m1.5.5.1.1.3.1.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.3">subscript</csymbol><ci id="S3.E6X.2.1.1.m1.5.5.1.1.3.2.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.3.2">𝒄</ci><ci id="S3.E6X.2.1.1.m1.5.5.1.1.3.3.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.3.3">𝖿𝗀</ci></apply><apply id="S3.E6X.2.1.1.m1.5.5.1.1.1.2.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1"><csymbol cd="latexml" id="S3.E6X.2.1.1.m1.5.5.1.1.1.2.1.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.2">delimited-[]</csymbol><apply id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1"><eq id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.4.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.4"></eq><list id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.4.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3"><apply id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.2">𝒛</ci><ci id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.3">𝖿𝗀</ci></apply><apply id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.1.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2">subscript</csymbol><ci id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.2.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.2">𝑂</ci><ci id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.3.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.3">𝖿𝗀</ci></apply><apply id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3"><csymbol cd="ambiguous" id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.1.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3">subscript</csymbol><ci id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.2.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.2">𝑭</ci><ci id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.3.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.3.3.3.3">𝖿𝗀</ci></apply><ci id="S3.E6X.2.1.1.m1.4.4a.cmml" xref="S3.E6X.2.1.1.m1.4.4"><mtext id="S3.E6X.2.1.1.m1.4.4.cmml" xref="S3.E6X.2.1.1.m1.4.4">axis</mtext></ci></list><cn id="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.5.cmml" type="integer" xref="S3.E6X.2.1.1.m1.5.5.1.1.1.1.1.5">0</cn></apply></apply></apply><apply id="S3.E6X.2.1.1.m1.5.5.1.1c.cmml" xref="S3.E6X.2.1.1.m1.5.5.1"><in id="S3.E6X.2.1.1.m1.5.5.1.1.5.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.5"></in><share href="https://arxiv.org/html/2503.13434v1#S3.E6X.2.1.1.m1.5.5.1.1.1.cmml" id="S3.E6X.2.1.1.m1.5.5.1.1d.cmml" xref="S3.E6X.2.1.1.m1.5.5.1"></share><apply id="S3.E6X.2.1.1.m1.5.5.1.1.6.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.6"><csymbol cd="ambiguous" id="S3.E6X.2.1.1.m1.5.5.1.1.6.1.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.6">superscript</csymbol><ci id="S3.E6X.2.1.1.m1.5.5.1.1.6.2.cmml" xref="S3.E6X.2.1.1.m1.5.5.1.1.6.2">ℝ</ci><list id="S3.E6X.2.1.1.m1.3.3.3.4.cmml" xref="S3.E6X.2.1.1.m1.3.3.3.3"><apply id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.cmml" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1"><plus id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.1.cmml" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.1"></plus><ci id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.2.cmml" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.2">𝑐</ci><cn id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.3.cmml" type="integer" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.3">1</cn><ci id="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.4.cmml" xref="S3.E6X.2.1.1.m1.3.3.3.3.1.1.1.4">𝑑</ci></apply><ci id="S3.E6X.2.1.1.m1.1.1.1.1.cmml" xref="S3.E6X.2.1.1.m1.1.1.1.1">ℎ</ci><ci id="S3.E6X.2.1.1.m1.2.2.2.2.cmml" xref="S3.E6X.2.1.1.m1.2.2.2.2">𝑤</ci></list></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E6X.2.1.1.m1.5c">\displaystyle\bm{c}_{\mathsf{fg}}=\left[\bm{z}_{\mathsf{fg}},O_{\mathsf{fg}},% \bm{F}_{\mathsf{fg}};\text{axis}=0\right]\in\mathbb{R}^{(c+1+d),h,w},</annotation><annotation encoding="application/x-llamapun" id="S3.E6X.2.1.1.m1.5d">bold_italic_c start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT , bold_italic_F start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT ; axis = 0 ] ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_c + 1 + italic_d ) , italic_h , italic_w end_POSTSUPERSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(6)</span></td> </tr> </tbody> </table> <p class="ltx_p" id="S3.SS1.SSS0.Px1.p1.13">where <math alttext="[;\text{axis}=0]" class="ltx_math_unparsed" display="inline" id="S3.SS1.SSS0.Px1.p1.12.m1.1"><semantics id="S3.SS1.SSS0.Px1.p1.12.m1.1a"><mrow id="S3.SS1.SSS0.Px1.p1.12.m1.1b"><mo id="S3.SS1.SSS0.Px1.p1.12.m1.1.1" stretchy="false">[</mo><mo id="S3.SS1.SSS0.Px1.p1.12.m1.1.2">;</mo><mtext id="S3.SS1.SSS0.Px1.p1.12.m1.1.3">axis</mtext><mo id="S3.SS1.SSS0.Px1.p1.12.m1.1.4">=</mo><mn id="S3.SS1.SSS0.Px1.p1.12.m1.1.5">0</mn><mo id="S3.SS1.SSS0.Px1.p1.12.m1.1.6" stretchy="false">]</mo></mrow><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.12.m1.1c">[;\text{axis}=0]</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.12.m1.1d">[ ; axis = 0 ]</annotation></semantics></math> and <math alttext="[;\text{axis}=2]" class="ltx_math_unparsed" display="inline" id="S3.SS1.SSS0.Px1.p1.13.m2.1"><semantics id="S3.SS1.SSS0.Px1.p1.13.m2.1a"><mrow id="S3.SS1.SSS0.Px1.p1.13.m2.1b"><mo id="S3.SS1.SSS0.Px1.p1.13.m2.1.1" stretchy="false">[</mo><mo id="S3.SS1.SSS0.Px1.p1.13.m2.1.2">;</mo><mtext id="S3.SS1.SSS0.Px1.p1.13.m2.1.3">axis</mtext><mo id="S3.SS1.SSS0.Px1.p1.13.m2.1.4">=</mo><mn id="S3.SS1.SSS0.Px1.p1.13.m2.1.5">2</mn><mo id="S3.SS1.SSS0.Px1.p1.13.m2.1.6" stretchy="false">]</mo></mrow><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px1.p1.13.m2.1c">[;\text{axis}=2]</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px1.p1.13.m2.1d">[ ; axis = 2 ]</annotation></semantics></math> represent concatenation along the channel and column, respectively.</p> </div> <div class="ltx_para" id="S3.SS1.SSS0.Px1.p2"> <p class="ltx_p" id="S3.SS1.SSS0.Px1.p2.1">To process element-level foreground input, we use a modified pre-trained diffusion backbone with cross-attention layers removed. This approach serves two purposes: the pre-trained weights offer a strong generative prior for effective foreground feature processing, and removing cross-attention layers ensures the model focuses solely on visual content without broader contextual influences.</p> </div> </section> <section class="ltx_paragraph" id="S3.SS1.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph">Background Branch.</h4> <div class="ltx_para" id="S3.SS1.SSS0.Px2.p1"> <p class="ltx_p" id="S3.SS1.SSS0.Px2.p1.2">The background branch serves as a diffusion backbone aims to preserve the original background while harmoniously integrating foreground elements into the scene. Similarly, we concatenate the noisy latent <math alttext="\bm{z}_{t}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px2.p1.1.m1.1"><semantics id="S3.SS1.SSS0.Px2.p1.1.m1.1a"><msub id="S3.SS1.SSS0.Px2.p1.1.m1.1.1" xref="S3.SS1.SSS0.Px2.p1.1.m1.1.1.cmml"><mi id="S3.SS1.SSS0.Px2.p1.1.m1.1.1.2" xref="S3.SS1.SSS0.Px2.p1.1.m1.1.1.2.cmml">𝒛</mi><mi id="S3.SS1.SSS0.Px2.p1.1.m1.1.1.3" xref="S3.SS1.SSS0.Px2.p1.1.m1.1.1.3.cmml">t</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px2.p1.1.m1.1b"><apply id="S3.SS1.SSS0.Px2.p1.1.m1.1.1.cmml" xref="S3.SS1.SSS0.Px2.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px2.p1.1.m1.1.1.1.cmml" xref="S3.SS1.SSS0.Px2.p1.1.m1.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px2.p1.1.m1.1.1.2.cmml" xref="S3.SS1.SSS0.Px2.p1.1.m1.1.1.2">𝒛</ci><ci id="S3.SS1.SSS0.Px2.p1.1.m1.1.1.3.cmml" xref="S3.SS1.SSS0.Px2.p1.1.m1.1.1.3">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px2.p1.1.m1.1c">\bm{z}_{t}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px2.p1.1.m1.1d">bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT</annotation></semantics></math> with reference background conditions <math alttext="\bm{c}_{\mathsf{bg}}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px2.p1.2.m2.1"><semantics id="S3.SS1.SSS0.Px2.p1.2.m2.1a"><msub id="S3.SS1.SSS0.Px2.p1.2.m2.1.1" xref="S3.SS1.SSS0.Px2.p1.2.m2.1.1.cmml"><mi id="S3.SS1.SSS0.Px2.p1.2.m2.1.1.2" xref="S3.SS1.SSS0.Px2.p1.2.m2.1.1.2.cmml">𝒄</mi><mi id="S3.SS1.SSS0.Px2.p1.2.m2.1.1.3" xref="S3.SS1.SSS0.Px2.p1.2.m2.1.1.3.cmml">𝖻𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px2.p1.2.m2.1b"><apply id="S3.SS1.SSS0.Px2.p1.2.m2.1.1.cmml" xref="S3.SS1.SSS0.Px2.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px2.p1.2.m2.1.1.1.cmml" xref="S3.SS1.SSS0.Px2.p1.2.m2.1.1">subscript</csymbol><ci id="S3.SS1.SSS0.Px2.p1.2.m2.1.1.2.cmml" xref="S3.SS1.SSS0.Px2.p1.2.m2.1.1.2">𝒄</ci><ci id="S3.SS1.SSS0.Px2.p1.2.m2.1.1.3.cmml" xref="S3.SS1.SSS0.Px2.p1.2.m2.1.1.3">𝖻𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px2.p1.2.m2.1c">\bm{c}_{\mathsf{bg}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px2.p1.2.m2.1d">bold_italic_c start_POSTSUBSCRIPT sansserif_bg end_POSTSUBSCRIPT</annotation></semantics></math> along the column dimension as input, as shown below:</p> <table class="ltx_equationgroup ltx_eqn_table" id="S3.E7"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E7X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\text{input}_{\mathsf{bg}}=\left[\bm{c}_{\mathsf{bg}},\left[\bm{z% }_{t},O_{\mathsf{bg}};\text{axis}=0\right];\text{axis}=2\right]," class="ltx_Math" display="inline" id="S3.E7X.2.1.1.m1.3"><semantics id="S3.E7X.2.1.1.m1.3a"><mrow id="S3.E7X.2.1.1.m1.3.3.1" xref="S3.E7X.2.1.1.m1.3.3.1.1.cmml"><mrow id="S3.E7X.2.1.1.m1.3.3.1.1" xref="S3.E7X.2.1.1.m1.3.3.1.1.cmml"><msub id="S3.E7X.2.1.1.m1.3.3.1.1.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.3.cmml"><mtext id="S3.E7X.2.1.1.m1.3.3.1.1.3.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.3.2a.cmml">input</mtext><mi id="S3.E7X.2.1.1.m1.3.3.1.1.3.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.3.3.cmml">𝖻𝗀</mi></msub><mo id="S3.E7X.2.1.1.m1.3.3.1.1.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.2.cmml">=</mo><mrow id="S3.E7X.2.1.1.m1.3.3.1.1.1.1" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.2.cmml"><mo id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.2.1.cmml">[</mo><mrow id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.cmml"><mrow id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.3.cmml"><msub id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml">𝒄</mi><mi id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.3.cmml">𝖻𝗀</mi></msub><mo id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.3.cmml">,</mo><mrow id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.2.cmml"><mo id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.2.1.cmml">[</mo><mrow id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.cmml"><mrow id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.3.cmml"><msub id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.cmml"><mi id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.2.cmml">𝒛</mi><mi id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.3.cmml">t</mi></msub><mo id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.3.cmml">,</mo><msub id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.cmml"><mi id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.2.cmml">O</mi><mi id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.3.cmml">𝖻𝗀</mi></msub><mo id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.4" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.3.cmml">;</mo><mtext id="S3.E7X.2.1.1.m1.1.1" xref="S3.E7X.2.1.1.m1.1.1a.cmml">axis</mtext></mrow><mo id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.cmml">=</mo><mn id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.4" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.4.cmml">0</mn></mrow><mo id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.2.1.cmml">]</mo></mrow><mo id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.4" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.3.cmml">;</mo><mtext id="S3.E7X.2.1.1.m1.2.2" xref="S3.E7X.2.1.1.m1.2.2a.cmml">axis</mtext></mrow><mo id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.3.cmml">=</mo><mn id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.4" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.4.cmml">2</mn></mrow><mo id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.3" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.2.1.cmml">]</mo></mrow></mrow><mo id="S3.E7X.2.1.1.m1.3.3.1.2" xref="S3.E7X.2.1.1.m1.3.3.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E7X.2.1.1.m1.3b"><apply id="S3.E7X.2.1.1.m1.3.3.1.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1"><eq id="S3.E7X.2.1.1.m1.3.3.1.1.2.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.2"></eq><apply id="S3.E7X.2.1.1.m1.3.3.1.1.3.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.3"><csymbol cd="ambiguous" id="S3.E7X.2.1.1.m1.3.3.1.1.3.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.3">subscript</csymbol><ci id="S3.E7X.2.1.1.m1.3.3.1.1.3.2a.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.3.2"><mtext id="S3.E7X.2.1.1.m1.3.3.1.1.3.2.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.3.2">input</mtext></ci><ci id="S3.E7X.2.1.1.m1.3.3.1.1.3.3.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.3.3">𝖻𝗀</ci></apply><apply id="S3.E7X.2.1.1.m1.3.3.1.1.1.2.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1"><csymbol cd="latexml" id="S3.E7X.2.1.1.m1.3.3.1.1.1.2.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.2">delimited-[]</csymbol><apply id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1"><eq id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.3.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.3"></eq><list id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.3.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2"><apply id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.2">𝒄</ci><ci id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.1.1.1.3">𝖻𝗀</ci></apply><apply id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.2.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1"><csymbol cd="latexml" id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.2.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.2">delimited-[]</csymbol><apply id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1"><eq id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.3"></eq><list id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.3.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2"><apply id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1">subscript</csymbol><ci id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.2.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.2">𝒛</ci><ci id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.3.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.1.1.1.3">𝑡</ci></apply><apply id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.1.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2">subscript</csymbol><ci id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.2.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.2">𝑂</ci><ci id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.3.cmml" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.2.2.2.3">𝖻𝗀</ci></apply><ci id="S3.E7X.2.1.1.m1.1.1a.cmml" xref="S3.E7X.2.1.1.m1.1.1"><mtext id="S3.E7X.2.1.1.m1.1.1.cmml" xref="S3.E7X.2.1.1.m1.1.1">axis</mtext></ci></list><cn id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.4.cmml" type="integer" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.2.2.2.1.1.4">0</cn></apply></apply><ci id="S3.E7X.2.1.1.m1.2.2a.cmml" xref="S3.E7X.2.1.1.m1.2.2"><mtext id="S3.E7X.2.1.1.m1.2.2.cmml" xref="S3.E7X.2.1.1.m1.2.2">axis</mtext></ci></list><cn id="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.4.cmml" type="integer" xref="S3.E7X.2.1.1.m1.3.3.1.1.1.1.1.4">2</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E7X.2.1.1.m1.3c">\displaystyle\text{input}_{\mathsf{bg}}=\left[\bm{c}_{\mathsf{bg}},\left[\bm{z% }_{t},O_{\mathsf{bg}};\text{axis}=0\right];\text{axis}=2\right],</annotation><annotation encoding="application/x-llamapun" id="S3.E7X.2.1.1.m1.3d">input start_POSTSUBSCRIPT sansserif_bg end_POSTSUBSCRIPT = [ bold_italic_c start_POSTSUBSCRIPT sansserif_bg end_POSTSUBSCRIPT , [ bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT sansserif_bg end_POSTSUBSCRIPT ; axis = 0 ] ; axis = 2 ] ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(7)</span></td> </tr> </tbody> </table> <table class="ltx_equationgroup ltx_eqn_table" id="S3.E8"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E8X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\bm{c}_{\mathsf{bg}}=\left[\bm{z}_{\mathsf{bg}},O_{\mathsf{bg}};% \text{axis}=0\right]\in\mathbb{R}^{(c+1),h,w}," class="ltx_Math" display="inline" id="S3.E8X.2.1.1.m1.5"><semantics id="S3.E8X.2.1.1.m1.5a"><mrow id="S3.E8X.2.1.1.m1.5.5.1" xref="S3.E8X.2.1.1.m1.5.5.1.1.cmml"><mrow id="S3.E8X.2.1.1.m1.5.5.1.1" xref="S3.E8X.2.1.1.m1.5.5.1.1.cmml"><msub id="S3.E8X.2.1.1.m1.5.5.1.1.3" xref="S3.E8X.2.1.1.m1.5.5.1.1.3.cmml"><mi id="S3.E8X.2.1.1.m1.5.5.1.1.3.2" xref="S3.E8X.2.1.1.m1.5.5.1.1.3.2.cmml">𝒄</mi><mi id="S3.E8X.2.1.1.m1.5.5.1.1.3.3" xref="S3.E8X.2.1.1.m1.5.5.1.1.3.3.cmml">𝖻𝗀</mi></msub><mo id="S3.E8X.2.1.1.m1.5.5.1.1.4" xref="S3.E8X.2.1.1.m1.5.5.1.1.4.cmml">=</mo><mrow id="S3.E8X.2.1.1.m1.5.5.1.1.1.1" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.2.cmml"><mo id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.2" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.2.1.cmml">[</mo><mrow id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.cmml"><mrow id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.3.cmml"><msub id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.2" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.2.cmml">𝒛</mi><mi id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.3" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.3.cmml">𝖻𝗀</mi></msub><mo id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.3" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.3.cmml">,</mo><msub id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.cmml"><mi id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.2" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.2.cmml">O</mi><mi id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.3" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.3.cmml">𝖻𝗀</mi></msub><mo id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.4" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.3.cmml">;</mo><mtext id="S3.E8X.2.1.1.m1.4.4" xref="S3.E8X.2.1.1.m1.4.4a.cmml">axis</mtext></mrow><mo id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.3" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.3.cmml">=</mo><mn id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.4" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.4.cmml">0</mn></mrow><mo id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.3" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.2.1.cmml">]</mo></mrow><mo id="S3.E8X.2.1.1.m1.5.5.1.1.5" xref="S3.E8X.2.1.1.m1.5.5.1.1.5.cmml">∈</mo><msup id="S3.E8X.2.1.1.m1.5.5.1.1.6" xref="S3.E8X.2.1.1.m1.5.5.1.1.6.cmml"><mi id="S3.E8X.2.1.1.m1.5.5.1.1.6.2" xref="S3.E8X.2.1.1.m1.5.5.1.1.6.2.cmml">ℝ</mi><mrow id="S3.E8X.2.1.1.m1.3.3.3.3" xref="S3.E8X.2.1.1.m1.3.3.3.4.cmml"><mrow id="S3.E8X.2.1.1.m1.3.3.3.3.1.1" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.cmml"><mo id="S3.E8X.2.1.1.m1.3.3.3.3.1.1.2" stretchy="false" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.cmml">(</mo><mrow id="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.cmml"><mi id="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.2" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.2.cmml">c</mi><mo id="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.1" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.1.cmml">+</mo><mn id="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.3" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.3.cmml">1</mn></mrow><mo id="S3.E8X.2.1.1.m1.3.3.3.3.1.1.3" stretchy="false" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.cmml">)</mo></mrow><mo id="S3.E8X.2.1.1.m1.3.3.3.3.2" xref="S3.E8X.2.1.1.m1.3.3.3.4.cmml">,</mo><mi id="S3.E8X.2.1.1.m1.1.1.1.1" xref="S3.E8X.2.1.1.m1.1.1.1.1.cmml">h</mi><mo id="S3.E8X.2.1.1.m1.3.3.3.3.3" xref="S3.E8X.2.1.1.m1.3.3.3.4.cmml">,</mo><mi id="S3.E8X.2.1.1.m1.2.2.2.2" xref="S3.E8X.2.1.1.m1.2.2.2.2.cmml">w</mi></mrow></msup></mrow><mo id="S3.E8X.2.1.1.m1.5.5.1.2" xref="S3.E8X.2.1.1.m1.5.5.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E8X.2.1.1.m1.5b"><apply id="S3.E8X.2.1.1.m1.5.5.1.1.cmml" xref="S3.E8X.2.1.1.m1.5.5.1"><and id="S3.E8X.2.1.1.m1.5.5.1.1a.cmml" xref="S3.E8X.2.1.1.m1.5.5.1"></and><apply id="S3.E8X.2.1.1.m1.5.5.1.1b.cmml" xref="S3.E8X.2.1.1.m1.5.5.1"><eq id="S3.E8X.2.1.1.m1.5.5.1.1.4.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.4"></eq><apply id="S3.E8X.2.1.1.m1.5.5.1.1.3.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.3"><csymbol cd="ambiguous" id="S3.E8X.2.1.1.m1.5.5.1.1.3.1.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.3">subscript</csymbol><ci id="S3.E8X.2.1.1.m1.5.5.1.1.3.2.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.3.2">𝒄</ci><ci id="S3.E8X.2.1.1.m1.5.5.1.1.3.3.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.3.3">𝖻𝗀</ci></apply><apply id="S3.E8X.2.1.1.m1.5.5.1.1.1.2.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1"><csymbol cd="latexml" id="S3.E8X.2.1.1.m1.5.5.1.1.1.2.1.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.2">delimited-[]</csymbol><apply id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1"><eq id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.3.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.3"></eq><list id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.3.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2"><apply id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.2">𝒛</ci><ci id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.1.1.1.3">𝖻𝗀</ci></apply><apply id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.1.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2">subscript</csymbol><ci id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.2.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.2">𝑂</ci><ci id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.3.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.2.2.2.3">𝖻𝗀</ci></apply><ci id="S3.E8X.2.1.1.m1.4.4a.cmml" xref="S3.E8X.2.1.1.m1.4.4"><mtext id="S3.E8X.2.1.1.m1.4.4.cmml" xref="S3.E8X.2.1.1.m1.4.4">axis</mtext></ci></list><cn id="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.4.cmml" type="integer" xref="S3.E8X.2.1.1.m1.5.5.1.1.1.1.1.4">0</cn></apply></apply></apply><apply id="S3.E8X.2.1.1.m1.5.5.1.1c.cmml" xref="S3.E8X.2.1.1.m1.5.5.1"><in id="S3.E8X.2.1.1.m1.5.5.1.1.5.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.5"></in><share href="https://arxiv.org/html/2503.13434v1#S3.E8X.2.1.1.m1.5.5.1.1.1.cmml" id="S3.E8X.2.1.1.m1.5.5.1.1d.cmml" xref="S3.E8X.2.1.1.m1.5.5.1"></share><apply id="S3.E8X.2.1.1.m1.5.5.1.1.6.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.6"><csymbol cd="ambiguous" id="S3.E8X.2.1.1.m1.5.5.1.1.6.1.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.6">superscript</csymbol><ci id="S3.E8X.2.1.1.m1.5.5.1.1.6.2.cmml" xref="S3.E8X.2.1.1.m1.5.5.1.1.6.2">ℝ</ci><list id="S3.E8X.2.1.1.m1.3.3.3.4.cmml" xref="S3.E8X.2.1.1.m1.3.3.3.3"><apply id="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.cmml" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1"><plus id="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.1.cmml" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.1"></plus><ci id="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.2.cmml" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.2">𝑐</ci><cn id="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.3.cmml" type="integer" xref="S3.E8X.2.1.1.m1.3.3.3.3.1.1.1.3">1</cn></apply><ci id="S3.E8X.2.1.1.m1.1.1.1.1.cmml" xref="S3.E8X.2.1.1.m1.1.1.1.1">ℎ</ci><ci id="S3.E8X.2.1.1.m1.2.2.2.2.cmml" xref="S3.E8X.2.1.1.m1.2.2.2.2">𝑤</ci></list></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E8X.2.1.1.m1.5c">\displaystyle\bm{c}_{\mathsf{bg}}=\left[\bm{z}_{\mathsf{bg}},O_{\mathsf{bg}};% \text{axis}=0\right]\in\mathbb{R}^{(c+1),h,w},</annotation><annotation encoding="application/x-llamapun" id="S3.E8X.2.1.1.m1.5d">bold_italic_c start_POSTSUBSCRIPT sansserif_bg end_POSTSUBSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT sansserif_bg end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT sansserif_bg end_POSTSUBSCRIPT ; axis = 0 ] ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_c + 1 ) , italic_h , italic_w end_POSTSUPERSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(8)</span></td> </tr> </tbody> </table> <p class="ltx_p" id="S3.SS1.SSS0.Px2.p1.3">where background branch lacks spatial-aware semantic features, as it tends to preserve information completely.</p> </div> <div class="ltx_para" id="S3.SS1.SSS0.Px2.p2"> <p class="ltx_p" id="S3.SS1.SSS0.Px2.p2.1">In element-level editing, the background is the masked image where both the original and target regions of the foreground element are masked out. For instance, when moving a bird, the background has masks at both the bird’s initial and destination positions.</p> </div> <div class="ltx_para" id="S3.SS1.SSS0.Px2.p3"> <p class="ltx_p" id="S3.SS1.SSS0.Px2.p3.1">The background branch uses a complete diffusion backbone with cross-attention layers. To seamlessly integrate foreground and background elements, we employ hierarchical feature fusion, progressively injecting foreground features at multiple resolution levels in the background branch. We also use zero-initialization <cite class="ltx_cite ltx_citemacro_citep">(Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib51" title="">2023a</a>)</cite> <math alttext="\mathcal{Z}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px2.p3.1.m1.1"><semantics id="S3.SS1.SSS0.Px2.p3.1.m1.1a"><mi class="ltx_font_mathcaligraphic" id="S3.SS1.SSS0.Px2.p3.1.m1.1.1" xref="S3.SS1.SSS0.Px2.p3.1.m1.1.1.cmml">𝒵</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px2.p3.1.m1.1b"><ci id="S3.SS1.SSS0.Px2.p3.1.m1.1.1.cmml" xref="S3.SS1.SSS0.Px2.p3.1.m1.1.1">𝒵</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px2.p3.1.m1.1c">\mathcal{Z}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px2.p3.1.m1.1d">caligraphic_Z</annotation></semantics></math> for stable training. Feature fusion for the i-th block is formulated as:</p> <table class="ltx_equationgroup ltx_eqn_table" id="S3.E9"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E9X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\bm{\epsilon}_{\theta}^{i}(\bm{z}_{t},t,C)=\bm{\epsilon}_{\theta}% ^{i}(\bm{z}_{t},t,C)+\omega\cdot\mathcal{Z}(\bm{\epsilon}_{\theta}^{i}(\bm{z}_% {t},t,C^{\prime}))," class="ltx_Math" display="inline" id="S3.E9X.2.1.1.m1.6"><semantics id="S3.E9X.2.1.1.m1.6a"><mrow id="S3.E9X.2.1.1.m1.6.6.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.cmml"><mrow id="S3.E9X.2.1.1.m1.6.6.1.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.cmml"><mrow id="S3.E9X.2.1.1.m1.6.6.1.1.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.cmml"><msubsup id="S3.E9X.2.1.1.m1.6.6.1.1.1.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3.cmml"><mi class="ltx_mathvariant_bold-italic" id="S3.E9X.2.1.1.m1.6.6.1.1.1.3.2.2" mathvariant="bold-italic" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3.2.2.cmml">ϵ</mi><mi id="S3.E9X.2.1.1.m1.6.6.1.1.1.3.2.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3.2.3.cmml">θ</mi><mi id="S3.E9X.2.1.1.m1.6.6.1.1.1.3.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3.3.cmml">i</mi></msubsup><mo id="S3.E9X.2.1.1.m1.6.6.1.1.1.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.2.cmml">⁢</mo><mrow id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.2.cmml"><mo id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.2" stretchy="false" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.2.cmml">(</mo><msub id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.cmml"><mi id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.2.cmml">𝒛</mi><mi id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.3.cmml">t</mi></msub><mo id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.2.cmml">,</mo><mi id="S3.E9X.2.1.1.m1.1.1" xref="S3.E9X.2.1.1.m1.1.1.cmml">t</mi><mo id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.4" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.2.cmml">,</mo><mi id="S3.E9X.2.1.1.m1.2.2" xref="S3.E9X.2.1.1.m1.2.2.cmml">C</mi><mo id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.5" stretchy="false" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.2.cmml">)</mo></mrow></mrow><mo id="S3.E9X.2.1.1.m1.6.6.1.1.4" xref="S3.E9X.2.1.1.m1.6.6.1.1.4.cmml">=</mo><mrow id="S3.E9X.2.1.1.m1.6.6.1.1.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.cmml"><mrow id="S3.E9X.2.1.1.m1.6.6.1.1.2.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.cmml"><msubsup id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.cmml"><mi class="ltx_mathvariant_bold-italic" id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.2.2" mathvariant="bold-italic" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.2.2.cmml">ϵ</mi><mi id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.2.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.2.3.cmml">θ</mi><mi id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.3.cmml">i</mi></msubsup><mo id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.2.cmml">⁢</mo><mrow id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.2.cmml"><mo id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.2" stretchy="false" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.2.cmml">(</mo><msub id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.cmml"><mi id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.2.cmml">𝒛</mi><mi id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.3.cmml">t</mi></msub><mo id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.2.cmml">,</mo><mi id="S3.E9X.2.1.1.m1.3.3" xref="S3.E9X.2.1.1.m1.3.3.cmml">t</mi><mo id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.4" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.2.cmml">,</mo><mi id="S3.E9X.2.1.1.m1.4.4" xref="S3.E9X.2.1.1.m1.4.4.cmml">C</mi><mo id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.5" stretchy="false" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.2.cmml">)</mo></mrow></mrow><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.3.cmml">+</mo><mrow id="S3.E9X.2.1.1.m1.6.6.1.1.3.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.cmml"><mrow id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.cmml"><mi id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.2.cmml">ω</mi><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.1" lspace="0.222em" rspace="0.222em" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.1.cmml">⋅</mo><mi class="ltx_font_mathcaligraphic" id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.3.cmml">𝒵</mi></mrow><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.2.cmml">⁢</mo><mrow id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.cmml"><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.2" stretchy="false" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.cmml">(</mo><mrow id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.cmml"><msubsup id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.cmml"><mi class="ltx_mathvariant_bold-italic" id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.2.2" mathvariant="bold-italic" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.2.2.cmml">ϵ</mi><mi id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.2.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.2.3.cmml">θ</mi><mi id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.3.cmml">i</mi></msubsup><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.3.cmml">⁢</mo><mrow id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.3.cmml"><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.3" stretchy="false" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.3.cmml">(</mo><msub id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.cmml"><mi id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.2.cmml">𝒛</mi><mi id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.3.cmml">t</mi></msub><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.4" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.3.cmml">,</mo><mi id="S3.E9X.2.1.1.m1.5.5" xref="S3.E9X.2.1.1.m1.5.5.cmml">t</mi><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.5" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.3.cmml">,</mo><msup id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.cmml"><mi id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.2.cmml">C</mi><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.3" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.3.cmml">′</mo></msup><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.6" stretchy="false" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.3.cmml">)</mo></mrow></mrow><mo id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.3" stretchy="false" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow><mo id="S3.E9X.2.1.1.m1.6.6.1.2" xref="S3.E9X.2.1.1.m1.6.6.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E9X.2.1.1.m1.6b"><apply id="S3.E9X.2.1.1.m1.6.6.1.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1"><eq id="S3.E9X.2.1.1.m1.6.6.1.1.4.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.4"></eq><apply id="S3.E9X.2.1.1.m1.6.6.1.1.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1"><times id="S3.E9X.2.1.1.m1.6.6.1.1.1.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.2"></times><apply id="S3.E9X.2.1.1.m1.6.6.1.1.1.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3"><csymbol cd="ambiguous" id="S3.E9X.2.1.1.m1.6.6.1.1.1.3.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3">superscript</csymbol><apply id="S3.E9X.2.1.1.m1.6.6.1.1.1.3.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3"><csymbol cd="ambiguous" id="S3.E9X.2.1.1.m1.6.6.1.1.1.3.2.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3">subscript</csymbol><ci id="S3.E9X.2.1.1.m1.6.6.1.1.1.3.2.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3.2.2">bold-italic-ϵ</ci><ci id="S3.E9X.2.1.1.m1.6.6.1.1.1.3.2.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3.2.3">𝜃</ci></apply><ci id="S3.E9X.2.1.1.m1.6.6.1.1.1.3.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.3.3">𝑖</ci></apply><vector id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1"><apply id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.2">𝒛</ci><ci id="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.1.1.1.1.3">𝑡</ci></apply><ci id="S3.E9X.2.1.1.m1.1.1.cmml" xref="S3.E9X.2.1.1.m1.1.1">𝑡</ci><ci id="S3.E9X.2.1.1.m1.2.2.cmml" xref="S3.E9X.2.1.1.m1.2.2">𝐶</ci></vector></apply><apply id="S3.E9X.2.1.1.m1.6.6.1.1.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3"><plus id="S3.E9X.2.1.1.m1.6.6.1.1.3.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.3"></plus><apply id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1"><times id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.2"></times><apply id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3"><csymbol cd="ambiguous" id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3">superscript</csymbol><apply id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3"><csymbol cd="ambiguous" id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.2.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3">subscript</csymbol><ci id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.2.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.2.2">bold-italic-ϵ</ci><ci id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.2.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.2.3">𝜃</ci></apply><ci id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.3.3">𝑖</ci></apply><vector id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1"><apply id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1"><csymbol cd="ambiguous" id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1">subscript</csymbol><ci id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.2">𝒛</ci><ci id="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.2.1.1.1.1.3">𝑡</ci></apply><ci id="S3.E9X.2.1.1.m1.3.3.cmml" xref="S3.E9X.2.1.1.m1.3.3">𝑡</ci><ci id="S3.E9X.2.1.1.m1.4.4.cmml" xref="S3.E9X.2.1.1.m1.4.4">𝐶</ci></vector></apply><apply id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2"><times id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.2"></times><apply id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3"><ci id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.1">⋅</ci><ci id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.2">𝜔</ci><ci id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.3.3">𝒵</ci></apply><apply id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1"><times id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.3"></times><apply id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4"><csymbol cd="ambiguous" id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4">superscript</csymbol><apply id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4"><csymbol cd="ambiguous" id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.2.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4">subscript</csymbol><ci id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.2.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.2.2">bold-italic-ϵ</ci><ci id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.2.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.2.3">𝜃</ci></apply><ci id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.4.3">𝑖</ci></apply><vector id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2"><apply id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.2">𝒛</ci><ci id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.1.1.1.3">𝑡</ci></apply><ci id="S3.E9X.2.1.1.m1.5.5.cmml" xref="S3.E9X.2.1.1.m1.5.5">𝑡</ci><apply id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.1.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2">superscript</csymbol><ci id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.2.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.2">𝐶</ci><ci id="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.3.cmml" xref="S3.E9X.2.1.1.m1.6.6.1.1.3.2.1.1.1.2.2.2.3">′</ci></apply></vector></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E9X.2.1.1.m1.6c">\displaystyle\bm{\epsilon}_{\theta}^{i}(\bm{z}_{t},t,C)=\bm{\epsilon}_{\theta}% ^{i}(\bm{z}_{t},t,C)+\omega\cdot\mathcal{Z}(\bm{\epsilon}_{\theta}^{i}(\bm{z}_% {t},t,C^{\prime})),</annotation><annotation encoding="application/x-llamapun" id="S3.E9X.2.1.1.m1.6d">bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) = bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C ) + italic_ω ⋅ caligraphic_Z ( bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(9)</span></td> </tr> </tbody> </table> <p class="ltx_p" id="S3.SS1.SSS0.Px2.p3.4">where <math alttext="C" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px2.p3.2.m1.1"><semantics id="S3.SS1.SSS0.Px2.p3.2.m1.1a"><mi id="S3.SS1.SSS0.Px2.p3.2.m1.1.1" xref="S3.SS1.SSS0.Px2.p3.2.m1.1.1.cmml">C</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px2.p3.2.m1.1b"><ci id="S3.SS1.SSS0.Px2.p3.2.m1.1.1.cmml" xref="S3.SS1.SSS0.Px2.p3.2.m1.1.1">𝐶</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px2.p3.2.m1.1c">C</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px2.p3.2.m1.1d">italic_C</annotation></semantics></math> and <math alttext="C^{\prime}" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px2.p3.3.m2.1"><semantics id="S3.SS1.SSS0.Px2.p3.3.m2.1a"><msup id="S3.SS1.SSS0.Px2.p3.3.m2.1.1" xref="S3.SS1.SSS0.Px2.p3.3.m2.1.1.cmml"><mi id="S3.SS1.SSS0.Px2.p3.3.m2.1.1.2" xref="S3.SS1.SSS0.Px2.p3.3.m2.1.1.2.cmml">C</mi><mo id="S3.SS1.SSS0.Px2.p3.3.m2.1.1.3" xref="S3.SS1.SSS0.Px2.p3.3.m2.1.1.3.cmml">′</mo></msup><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px2.p3.3.m2.1b"><apply id="S3.SS1.SSS0.Px2.p3.3.m2.1.1.cmml" xref="S3.SS1.SSS0.Px2.p3.3.m2.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS0.Px2.p3.3.m2.1.1.1.cmml" xref="S3.SS1.SSS0.Px2.p3.3.m2.1.1">superscript</csymbol><ci id="S3.SS1.SSS0.Px2.p3.3.m2.1.1.2.cmml" xref="S3.SS1.SSS0.Px2.p3.3.m2.1.1.2">𝐶</ci><ci id="S3.SS1.SSS0.Px2.p3.3.m2.1.1.3.cmml" xref="S3.SS1.SSS0.Px2.p3.3.m2.1.1.3">′</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px2.p3.3.m2.1c">C^{\prime}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px2.p3.3.m2.1d">italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT</annotation></semantics></math> is the input conditions for the background and the foreground branch respectively, and <math alttext="\omega" class="ltx_Math" display="inline" id="S3.SS1.SSS0.Px2.p3.4.m3.1"><semantics id="S3.SS1.SSS0.Px2.p3.4.m3.1a"><mi id="S3.SS1.SSS0.Px2.p3.4.m3.1.1" xref="S3.SS1.SSS0.Px2.p3.4.m3.1.1.cmml">ω</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS0.Px2.p3.4.m3.1b"><ci id="S3.SS1.SSS0.Px2.p3.4.m3.1.1.cmml" xref="S3.SS1.SSS0.Px2.p3.4.m3.1.1">𝜔</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS0.Px2.p3.4.m3.1c">\omega</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS0.Px2.p3.4.m3.1d">italic_ω</annotation></semantics></math> is the weight for feature fusion.</p> </div> </section> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.2 </span>Self-supervised Training</h3> <div class="ltx_para" id="S3.SS2.p1"> <p class="ltx_p" id="S3.SS2.p1.1">While paired data of objects at different positions would be ideal for training, such data is scarce. Previous methods <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>; Alzayer et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib2" title="">2024</a>)</cite> rely on video data, but this introduces unwanted complexities that degrade model performance.</p> </div> <div class="ltx_para" id="S3.SS2.p2"> <p class="ltx_p" id="S3.SS2.p2.4">Instead, we propose a self-supervised training strategy, employing the idea that any image can be seen as the targer result of an element manipulation process. For each training image, we identify the target element’s position and randomly generate a blob at a different location to simulate the source position. This mimics the manipulation process, as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S2.F3" title="Figure 3 ‣ 2.2 Blob Opacity ‣ 2 Blob-Based Element-level Representation ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">3</span></a>, where a toy appears to move from a random left position to its actual right position. We optimize our model using a noise-prediction score function during training:</p> <table class="ltx_equationgroup ltx_eqn_table" id="S3.E10"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E10X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\mathcal{L}=\mathbb{E}_{z_{0},C,C^{\prime},\epsilon\sim\mathcal{N% }(0,\mathit{I})}\left[\|\epsilon-\epsilon_{\theta}(\bm{z}_{t},t,C,C^{\prime})% \|_{2}^{2}\right]," class="ltx_Math" display="inline" id="S3.E10X.2.1.1.m1.9"><semantics id="S3.E10X.2.1.1.m1.9a"><mrow id="S3.E10X.2.1.1.m1.9.9.1" xref="S3.E10X.2.1.1.m1.9.9.1.1.cmml"><mrow id="S3.E10X.2.1.1.m1.9.9.1.1" xref="S3.E10X.2.1.1.m1.9.9.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E10X.2.1.1.m1.9.9.1.1.3" xref="S3.E10X.2.1.1.m1.9.9.1.1.3.cmml">ℒ</mi><mo id="S3.E10X.2.1.1.m1.9.9.1.1.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.2.cmml">=</mo><mrow id="S3.E10X.2.1.1.m1.9.9.1.1.1" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.cmml"><msub id="S3.E10X.2.1.1.m1.9.9.1.1.1.3" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.3.cmml"><mi id="S3.E10X.2.1.1.m1.9.9.1.1.1.3.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.3.2.cmml">𝔼</mi><mrow id="S3.E10X.2.1.1.m1.6.6.6" xref="S3.E10X.2.1.1.m1.6.6.6.cmml"><mrow id="S3.E10X.2.1.1.m1.6.6.6.6.2" xref="S3.E10X.2.1.1.m1.6.6.6.6.3.cmml"><msub id="S3.E10X.2.1.1.m1.5.5.5.5.1.1" xref="S3.E10X.2.1.1.m1.5.5.5.5.1.1.cmml"><mi id="S3.E10X.2.1.1.m1.5.5.5.5.1.1.2" xref="S3.E10X.2.1.1.m1.5.5.5.5.1.1.2.cmml">z</mi><mn id="S3.E10X.2.1.1.m1.5.5.5.5.1.1.3" xref="S3.E10X.2.1.1.m1.5.5.5.5.1.1.3.cmml">0</mn></msub><mo id="S3.E10X.2.1.1.m1.6.6.6.6.2.3" xref="S3.E10X.2.1.1.m1.6.6.6.6.3.cmml">,</mo><mi id="S3.E10X.2.1.1.m1.3.3.3.3" xref="S3.E10X.2.1.1.m1.3.3.3.3.cmml">C</mi><mo id="S3.E10X.2.1.1.m1.6.6.6.6.2.4" xref="S3.E10X.2.1.1.m1.6.6.6.6.3.cmml">,</mo><msup id="S3.E10X.2.1.1.m1.6.6.6.6.2.2" xref="S3.E10X.2.1.1.m1.6.6.6.6.2.2.cmml"><mi id="S3.E10X.2.1.1.m1.6.6.6.6.2.2.2" xref="S3.E10X.2.1.1.m1.6.6.6.6.2.2.2.cmml">C</mi><mo id="S3.E10X.2.1.1.m1.6.6.6.6.2.2.3" xref="S3.E10X.2.1.1.m1.6.6.6.6.2.2.3.cmml">′</mo></msup><mo id="S3.E10X.2.1.1.m1.6.6.6.6.2.5" xref="S3.E10X.2.1.1.m1.6.6.6.6.3.cmml">,</mo><mi id="S3.E10X.2.1.1.m1.4.4.4.4" xref="S3.E10X.2.1.1.m1.4.4.4.4.cmml">ϵ</mi></mrow><mo id="S3.E10X.2.1.1.m1.6.6.6.7" xref="S3.E10X.2.1.1.m1.6.6.6.7.cmml">∼</mo><mrow id="S3.E10X.2.1.1.m1.6.6.6.8" xref="S3.E10X.2.1.1.m1.6.6.6.8.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E10X.2.1.1.m1.6.6.6.8.2" xref="S3.E10X.2.1.1.m1.6.6.6.8.2.cmml">𝒩</mi><mo id="S3.E10X.2.1.1.m1.6.6.6.8.1" xref="S3.E10X.2.1.1.m1.6.6.6.8.1.cmml">⁢</mo><mrow id="S3.E10X.2.1.1.m1.6.6.6.8.3.2" xref="S3.E10X.2.1.1.m1.6.6.6.8.3.1.cmml"><mo id="S3.E10X.2.1.1.m1.6.6.6.8.3.2.1" stretchy="false" xref="S3.E10X.2.1.1.m1.6.6.6.8.3.1.cmml">(</mo><mn id="S3.E10X.2.1.1.m1.1.1.1.1" xref="S3.E10X.2.1.1.m1.1.1.1.1.cmml">0</mn><mo id="S3.E10X.2.1.1.m1.6.6.6.8.3.2.2" xref="S3.E10X.2.1.1.m1.6.6.6.8.3.1.cmml">,</mo><mi id="S3.E10X.2.1.1.m1.2.2.2.2" xref="S3.E10X.2.1.1.m1.2.2.2.2.cmml">I</mi><mo id="S3.E10X.2.1.1.m1.6.6.6.8.3.2.3" stretchy="false" xref="S3.E10X.2.1.1.m1.6.6.6.8.3.1.cmml">)</mo></mrow></mrow></mrow></msub><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.2.cmml">⁢</mo><mrow id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.2.cmml"><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.2.1.cmml">[</mo><msubsup id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.cmml"><mrow id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.2.cmml"><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.2.1.cmml">‖</mo><mrow id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.4" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.4.cmml">ϵ</mi><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.3.cmml">−</mo><mrow id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.cmml"><msub id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.cmml"><mi id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.2.cmml">ϵ</mi><mi id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.3" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.3.cmml">θ</mi></msub><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.3" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.3.cmml">⁢</mo><mrow id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml"><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.3" stretchy="false" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml">(</mo><msub id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">𝒛</mi><mi id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml">t</mi></msub><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.4" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml">,</mo><mi id="S3.E10X.2.1.1.m1.7.7" xref="S3.E10X.2.1.1.m1.7.7.cmml">t</mi><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.5" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml">,</mo><mi id="S3.E10X.2.1.1.m1.8.8" xref="S3.E10X.2.1.1.m1.8.8.cmml">C</mi><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.6" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml">,</mo><msup id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.cmml"><mi id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2.cmml">C</mi><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.cmml">′</mo></msup><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.7" stretchy="false" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.2.1.cmml">‖</mo></mrow><mn id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.3" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.3.cmml">2</mn><mn id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.3" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.3.cmml">2</mn></msubsup><mo id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.3" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.2.1.cmml">]</mo></mrow></mrow></mrow><mo id="S3.E10X.2.1.1.m1.9.9.1.2" xref="S3.E10X.2.1.1.m1.9.9.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E10X.2.1.1.m1.9b"><apply id="S3.E10X.2.1.1.m1.9.9.1.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1"><eq id="S3.E10X.2.1.1.m1.9.9.1.1.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.2"></eq><ci id="S3.E10X.2.1.1.m1.9.9.1.1.3.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.3">ℒ</ci><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1"><times id="S3.E10X.2.1.1.m1.9.9.1.1.1.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.2"></times><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.3.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.3"><csymbol cd="ambiguous" id="S3.E10X.2.1.1.m1.9.9.1.1.1.3.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.3">subscript</csymbol><ci id="S3.E10X.2.1.1.m1.9.9.1.1.1.3.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.3.2">𝔼</ci><apply id="S3.E10X.2.1.1.m1.6.6.6.cmml" xref="S3.E10X.2.1.1.m1.6.6.6"><csymbol cd="latexml" id="S3.E10X.2.1.1.m1.6.6.6.7.cmml" xref="S3.E10X.2.1.1.m1.6.6.6.7">similar-to</csymbol><list id="S3.E10X.2.1.1.m1.6.6.6.6.3.cmml" xref="S3.E10X.2.1.1.m1.6.6.6.6.2"><apply id="S3.E10X.2.1.1.m1.5.5.5.5.1.1.cmml" xref="S3.E10X.2.1.1.m1.5.5.5.5.1.1"><csymbol cd="ambiguous" id="S3.E10X.2.1.1.m1.5.5.5.5.1.1.1.cmml" xref="S3.E10X.2.1.1.m1.5.5.5.5.1.1">subscript</csymbol><ci id="S3.E10X.2.1.1.m1.5.5.5.5.1.1.2.cmml" xref="S3.E10X.2.1.1.m1.5.5.5.5.1.1.2">𝑧</ci><cn id="S3.E10X.2.1.1.m1.5.5.5.5.1.1.3.cmml" type="integer" xref="S3.E10X.2.1.1.m1.5.5.5.5.1.1.3">0</cn></apply><ci id="S3.E10X.2.1.1.m1.3.3.3.3.cmml" xref="S3.E10X.2.1.1.m1.3.3.3.3">𝐶</ci><apply id="S3.E10X.2.1.1.m1.6.6.6.6.2.2.cmml" xref="S3.E10X.2.1.1.m1.6.6.6.6.2.2"><csymbol cd="ambiguous" id="S3.E10X.2.1.1.m1.6.6.6.6.2.2.1.cmml" xref="S3.E10X.2.1.1.m1.6.6.6.6.2.2">superscript</csymbol><ci id="S3.E10X.2.1.1.m1.6.6.6.6.2.2.2.cmml" xref="S3.E10X.2.1.1.m1.6.6.6.6.2.2.2">𝐶</ci><ci id="S3.E10X.2.1.1.m1.6.6.6.6.2.2.3.cmml" xref="S3.E10X.2.1.1.m1.6.6.6.6.2.2.3">′</ci></apply><ci id="S3.E10X.2.1.1.m1.4.4.4.4.cmml" xref="S3.E10X.2.1.1.m1.4.4.4.4">italic-ϵ</ci></list><apply id="S3.E10X.2.1.1.m1.6.6.6.8.cmml" xref="S3.E10X.2.1.1.m1.6.6.6.8"><times id="S3.E10X.2.1.1.m1.6.6.6.8.1.cmml" xref="S3.E10X.2.1.1.m1.6.6.6.8.1"></times><ci id="S3.E10X.2.1.1.m1.6.6.6.8.2.cmml" xref="S3.E10X.2.1.1.m1.6.6.6.8.2">𝒩</ci><interval closure="open" id="S3.E10X.2.1.1.m1.6.6.6.8.3.1.cmml" xref="S3.E10X.2.1.1.m1.6.6.6.8.3.2"><cn id="S3.E10X.2.1.1.m1.1.1.1.1.cmml" type="integer" xref="S3.E10X.2.1.1.m1.1.1.1.1">0</cn><ci id="S3.E10X.2.1.1.m1.2.2.2.2.cmml" xref="S3.E10X.2.1.1.m1.2.2.2.2">𝐼</ci></interval></apply></apply></apply><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1"><csymbol cd="latexml" id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.2.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.2">delimited-[]</csymbol><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1">superscript</csymbol><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1">subscript</csymbol><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.2">norm</csymbol><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1"><minus id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.3"></minus><ci id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.4.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.4">italic-ϵ</ci><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2"><times id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.3.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.3"></times><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4"><csymbol cd="ambiguous" id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4">subscript</csymbol><ci id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.2">italic-ϵ</ci><ci id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.3.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.4.3">𝜃</ci></apply><vector id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2"><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2">𝒛</ci><ci id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3">𝑡</ci></apply><ci id="S3.E10X.2.1.1.m1.7.7.cmml" xref="S3.E10X.2.1.1.m1.7.7">𝑡</ci><ci id="S3.E10X.2.1.1.m1.8.8.cmml" xref="S3.E10X.2.1.1.m1.8.8">𝐶</ci><apply id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2"><csymbol cd="ambiguous" id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.1.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2">superscript</csymbol><ci id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2">𝐶</ci><ci id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.cmml" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3">′</ci></apply></vector></apply></apply></apply><cn id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.3.cmml" type="integer" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.1.3">2</cn></apply><cn id="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.3.cmml" type="integer" xref="S3.E10X.2.1.1.m1.9.9.1.1.1.1.1.1.3">2</cn></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E10X.2.1.1.m1.9c">\displaystyle\mathcal{L}=\mathbb{E}_{z_{0},C,C^{\prime},\epsilon\sim\mathcal{N% }(0,\mathit{I})}\left[\|\epsilon-\epsilon_{\theta}(\bm{z}_{t},t,C,C^{\prime})% \|_{2}^{2}\right],</annotation><annotation encoding="application/x-llamapun" id="S3.E10X.2.1.1.m1.9d">caligraphic_L = blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(10)</span></td> </tr> </tbody> </table> <p class="ltx_p" id="S3.SS2.p2.3">where <math alttext="z_{0}" class="ltx_Math" display="inline" id="S3.SS2.p2.1.m1.1"><semantics id="S3.SS2.p2.1.m1.1a"><msub id="S3.SS2.p2.1.m1.1.1" xref="S3.SS2.p2.1.m1.1.1.cmml"><mi id="S3.SS2.p2.1.m1.1.1.2" xref="S3.SS2.p2.1.m1.1.1.2.cmml">z</mi><mn id="S3.SS2.p2.1.m1.1.1.3" xref="S3.SS2.p2.1.m1.1.1.3.cmml">0</mn></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.1.m1.1b"><apply id="S3.SS2.p2.1.m1.1.1.cmml" xref="S3.SS2.p2.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.1.m1.1.1.1.cmml" xref="S3.SS2.p2.1.m1.1.1">subscript</csymbol><ci id="S3.SS2.p2.1.m1.1.1.2.cmml" xref="S3.SS2.p2.1.m1.1.1.2">𝑧</ci><cn id="S3.SS2.p2.1.m1.1.1.3.cmml" type="integer" xref="S3.SS2.p2.1.m1.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.1.m1.1c">z_{0}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.1.m1.1d">italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT</annotation></semantics></math> is the latent of the source image and <math alttext="z_{t}=\sqrt{\bar{\alpha}_{t}}\bm{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon" class="ltx_Math" display="inline" id="S3.SS2.p2.2.m2.1"><semantics id="S3.SS2.p2.2.m2.1a"><mrow id="S3.SS2.p2.2.m2.1.1" xref="S3.SS2.p2.2.m2.1.1.cmml"><msub id="S3.SS2.p2.2.m2.1.1.2" xref="S3.SS2.p2.2.m2.1.1.2.cmml"><mi id="S3.SS2.p2.2.m2.1.1.2.2" xref="S3.SS2.p2.2.m2.1.1.2.2.cmml">z</mi><mi id="S3.SS2.p2.2.m2.1.1.2.3" xref="S3.SS2.p2.2.m2.1.1.2.3.cmml">t</mi></msub><mo id="S3.SS2.p2.2.m2.1.1.1" xref="S3.SS2.p2.2.m2.1.1.1.cmml">=</mo><mrow id="S3.SS2.p2.2.m2.1.1.3" xref="S3.SS2.p2.2.m2.1.1.3.cmml"><mrow id="S3.SS2.p2.2.m2.1.1.3.2" xref="S3.SS2.p2.2.m2.1.1.3.2.cmml"><msqrt id="S3.SS2.p2.2.m2.1.1.3.2.2" xref="S3.SS2.p2.2.m2.1.1.3.2.2.cmml"><msub id="S3.SS2.p2.2.m2.1.1.3.2.2.2" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2.cmml"><mover accent="true" id="S3.SS2.p2.2.m2.1.1.3.2.2.2.2" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2.2.cmml"><mi id="S3.SS2.p2.2.m2.1.1.3.2.2.2.2.2" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2.2.2.cmml">α</mi><mo id="S3.SS2.p2.2.m2.1.1.3.2.2.2.2.1" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2.2.1.cmml">¯</mo></mover><mi id="S3.SS2.p2.2.m2.1.1.3.2.2.2.3" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2.3.cmml">t</mi></msub></msqrt><mo id="S3.SS2.p2.2.m2.1.1.3.2.1" xref="S3.SS2.p2.2.m2.1.1.3.2.1.cmml">⁢</mo><msub id="S3.SS2.p2.2.m2.1.1.3.2.3" xref="S3.SS2.p2.2.m2.1.1.3.2.3.cmml"><mi id="S3.SS2.p2.2.m2.1.1.3.2.3.2" xref="S3.SS2.p2.2.m2.1.1.3.2.3.2.cmml">𝒙</mi><mn id="S3.SS2.p2.2.m2.1.1.3.2.3.3" xref="S3.SS2.p2.2.m2.1.1.3.2.3.3.cmml">0</mn></msub></mrow><mo id="S3.SS2.p2.2.m2.1.1.3.1" xref="S3.SS2.p2.2.m2.1.1.3.1.cmml">+</mo><mrow id="S3.SS2.p2.2.m2.1.1.3.3" xref="S3.SS2.p2.2.m2.1.1.3.3.cmml"><msqrt id="S3.SS2.p2.2.m2.1.1.3.3.2" xref="S3.SS2.p2.2.m2.1.1.3.3.2.cmml"><mrow id="S3.SS2.p2.2.m2.1.1.3.3.2.2" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.cmml"><mn id="S3.SS2.p2.2.m2.1.1.3.3.2.2.2" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.2.cmml">1</mn><mo id="S3.SS2.p2.2.m2.1.1.3.3.2.2.1" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.1.cmml">−</mo><msub id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.cmml"><mover accent="true" id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2.cmml"><mi id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2.2" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2.2.cmml">α</mi><mo id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2.1" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2.1.cmml">¯</mo></mover><mi id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.3" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.3.cmml">t</mi></msub></mrow></msqrt><mo id="S3.SS2.p2.2.m2.1.1.3.3.1" xref="S3.SS2.p2.2.m2.1.1.3.3.1.cmml">⁢</mo><mi id="S3.SS2.p2.2.m2.1.1.3.3.3" xref="S3.SS2.p2.2.m2.1.1.3.3.3.cmml">ϵ</mi></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.2.m2.1b"><apply id="S3.SS2.p2.2.m2.1.1.cmml" xref="S3.SS2.p2.2.m2.1.1"><eq id="S3.SS2.p2.2.m2.1.1.1.cmml" xref="S3.SS2.p2.2.m2.1.1.1"></eq><apply id="S3.SS2.p2.2.m2.1.1.2.cmml" xref="S3.SS2.p2.2.m2.1.1.2"><csymbol cd="ambiguous" id="S3.SS2.p2.2.m2.1.1.2.1.cmml" xref="S3.SS2.p2.2.m2.1.1.2">subscript</csymbol><ci id="S3.SS2.p2.2.m2.1.1.2.2.cmml" xref="S3.SS2.p2.2.m2.1.1.2.2">𝑧</ci><ci id="S3.SS2.p2.2.m2.1.1.2.3.cmml" xref="S3.SS2.p2.2.m2.1.1.2.3">𝑡</ci></apply><apply id="S3.SS2.p2.2.m2.1.1.3.cmml" xref="S3.SS2.p2.2.m2.1.1.3"><plus id="S3.SS2.p2.2.m2.1.1.3.1.cmml" xref="S3.SS2.p2.2.m2.1.1.3.1"></plus><apply id="S3.SS2.p2.2.m2.1.1.3.2.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2"><times id="S3.SS2.p2.2.m2.1.1.3.2.1.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.1"></times><apply id="S3.SS2.p2.2.m2.1.1.3.2.2.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.2"><root id="S3.SS2.p2.2.m2.1.1.3.2.2a.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.2"></root><apply id="S3.SS2.p2.2.m2.1.1.3.2.2.2.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2"><csymbol cd="ambiguous" id="S3.SS2.p2.2.m2.1.1.3.2.2.2.1.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2">subscript</csymbol><apply id="S3.SS2.p2.2.m2.1.1.3.2.2.2.2.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2.2"><ci id="S3.SS2.p2.2.m2.1.1.3.2.2.2.2.1.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2.2.1">¯</ci><ci id="S3.SS2.p2.2.m2.1.1.3.2.2.2.2.2.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2.2.2">𝛼</ci></apply><ci id="S3.SS2.p2.2.m2.1.1.3.2.2.2.3.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.2.2.3">𝑡</ci></apply></apply><apply id="S3.SS2.p2.2.m2.1.1.3.2.3.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.3"><csymbol cd="ambiguous" id="S3.SS2.p2.2.m2.1.1.3.2.3.1.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.3">subscript</csymbol><ci id="S3.SS2.p2.2.m2.1.1.3.2.3.2.cmml" xref="S3.SS2.p2.2.m2.1.1.3.2.3.2">𝒙</ci><cn id="S3.SS2.p2.2.m2.1.1.3.2.3.3.cmml" type="integer" xref="S3.SS2.p2.2.m2.1.1.3.2.3.3">0</cn></apply></apply><apply id="S3.SS2.p2.2.m2.1.1.3.3.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3"><times id="S3.SS2.p2.2.m2.1.1.3.3.1.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.1"></times><apply id="S3.SS2.p2.2.m2.1.1.3.3.2.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.2"><root id="S3.SS2.p2.2.m2.1.1.3.3.2a.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.2"></root><apply id="S3.SS2.p2.2.m2.1.1.3.3.2.2.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2"><minus id="S3.SS2.p2.2.m2.1.1.3.3.2.2.1.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.1"></minus><cn id="S3.SS2.p2.2.m2.1.1.3.3.2.2.2.cmml" type="integer" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.2">1</cn><apply id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3"><csymbol cd="ambiguous" id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.1.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3">subscript</csymbol><apply id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2"><ci id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2.1.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2.1">¯</ci><ci id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2.2.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.2.2">𝛼</ci></apply><ci id="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.3.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.2.2.3.3">𝑡</ci></apply></apply></apply><ci id="S3.SS2.p2.2.m2.1.1.3.3.3.cmml" xref="S3.SS2.p2.2.m2.1.1.3.3.3">italic-ϵ</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.2.m2.1c">z_{t}=\sqrt{\bar{\alpha}_{t}}\bm{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.2.m2.1d">italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ</annotation></semantics></math> is the noisy latent at time step <math alttext="t" class="ltx_Math" display="inline" id="S3.SS2.p2.3.m3.1"><semantics id="S3.SS2.p2.3.m3.1a"><mi id="S3.SS2.p2.3.m3.1.1" xref="S3.SS2.p2.3.m3.1.1.cmml">t</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.3.m3.1b"><ci id="S3.SS2.p2.3.m3.1.1.cmml" xref="S3.SS2.p2.3.m3.1.1">𝑡</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.3.m3.1c">t</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.3.m3.1d">italic_t</annotation></semantics></math>. This score function drives the model to fill foreground elements at the target layout, inpaint background elements at the original foreground position, and ensure harmonious integration of the entire scene.</p> </div> </section> <section class="ltx_subsection" id="S3.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.3 </span>ID Preservation and Scene Harmonization</h3> <section class="ltx_paragraph" id="S3.SS3.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph">Random Data Augmentation.</h4> <div class="ltx_para" id="S3.SS3.SSS0.Px1.p1"> <p class="ltx_p" id="S3.SS3.SSS0.Px1.p1.1">To prevent the model from defaulting to a simple copy-and-paste solution, we employ extensive data augmentation on foreground elements during training. This includes random transformations such as color jittering, scaling, rotation, erasing, and perspective changes. These augmentations serve two main purposes: they compel the model to harmoniously place foreground elements based on specified layouts and appearances, and the random erasing fosters robust inpainting capabilities for incomplete elements. This approach ensures the model learns to generate and manipulate elements flexibly and contextually, maintaining visual coherence with the background.</p> </div> </section> <section class="ltx_paragraph" id="S3.SS3.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph">Identity Preservation Score Function.</h4> <div class="ltx_para" id="S3.SS3.SSS0.Px2.p1"> <p class="ltx_p" id="S3.SS3.SSS0.Px2.p1.4">To effectively decouple the foreground and background branches—ensuring the foreground branch injects element-level information while the background branch integrates these elements—we propose an identity preservation score function. During training, we retain the diffusion model’s output layer in the foreground branch (discarded during inference) and apply a score function that operates only within the foreground element region.</p> <table class="ltx_equationgroup ltx_eqn_table" id="S3.E11"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E11X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\mathcal{L}_{\text{id}}=\mathbb{E}_{z_{0},C^{\prime},\epsilon\sim% \mathcal{N}(0,\mathit{I})}\left[M_{\mathsf{fg}}\cdot\|\epsilon-\epsilon_{% \theta}(\bm{z}_{t},t,C^{\prime})\|_{2}^{2}\right]," class="ltx_Math" display="inline" id="S3.E11X.2.1.1.m1.7"><semantics id="S3.E11X.2.1.1.m1.7a"><mrow id="S3.E11X.2.1.1.m1.7.7.1" xref="S3.E11X.2.1.1.m1.7.7.1.1.cmml"><mrow id="S3.E11X.2.1.1.m1.7.7.1.1" xref="S3.E11X.2.1.1.m1.7.7.1.1.cmml"><msub id="S3.E11X.2.1.1.m1.7.7.1.1.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E11X.2.1.1.m1.7.7.1.1.3.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.3.2.cmml">ℒ</mi><mtext id="S3.E11X.2.1.1.m1.7.7.1.1.3.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.3.3a.cmml">id</mtext></msub><mo id="S3.E11X.2.1.1.m1.7.7.1.1.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.2.cmml">=</mo><mrow id="S3.E11X.2.1.1.m1.7.7.1.1.1" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.cmml"><msub id="S3.E11X.2.1.1.m1.7.7.1.1.1.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.3.cmml"><mi id="S3.E11X.2.1.1.m1.7.7.1.1.1.3.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.3.2.cmml">𝔼</mi><mrow id="S3.E11X.2.1.1.m1.5.5.5" xref="S3.E11X.2.1.1.m1.5.5.5.cmml"><mrow id="S3.E11X.2.1.1.m1.5.5.5.5.2" xref="S3.E11X.2.1.1.m1.5.5.5.5.3.cmml"><msub id="S3.E11X.2.1.1.m1.4.4.4.4.1.1" xref="S3.E11X.2.1.1.m1.4.4.4.4.1.1.cmml"><mi id="S3.E11X.2.1.1.m1.4.4.4.4.1.1.2" xref="S3.E11X.2.1.1.m1.4.4.4.4.1.1.2.cmml">z</mi><mn id="S3.E11X.2.1.1.m1.4.4.4.4.1.1.3" xref="S3.E11X.2.1.1.m1.4.4.4.4.1.1.3.cmml">0</mn></msub><mo id="S3.E11X.2.1.1.m1.5.5.5.5.2.3" xref="S3.E11X.2.1.1.m1.5.5.5.5.3.cmml">,</mo><msup id="S3.E11X.2.1.1.m1.5.5.5.5.2.2" xref="S3.E11X.2.1.1.m1.5.5.5.5.2.2.cmml"><mi id="S3.E11X.2.1.1.m1.5.5.5.5.2.2.2" xref="S3.E11X.2.1.1.m1.5.5.5.5.2.2.2.cmml">C</mi><mo id="S3.E11X.2.1.1.m1.5.5.5.5.2.2.3" xref="S3.E11X.2.1.1.m1.5.5.5.5.2.2.3.cmml">′</mo></msup><mo id="S3.E11X.2.1.1.m1.5.5.5.5.2.4" xref="S3.E11X.2.1.1.m1.5.5.5.5.3.cmml">,</mo><mi id="S3.E11X.2.1.1.m1.3.3.3.3" xref="S3.E11X.2.1.1.m1.3.3.3.3.cmml">ϵ</mi></mrow><mo id="S3.E11X.2.1.1.m1.5.5.5.6" xref="S3.E11X.2.1.1.m1.5.5.5.6.cmml">∼</mo><mrow id="S3.E11X.2.1.1.m1.5.5.5.7" xref="S3.E11X.2.1.1.m1.5.5.5.7.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E11X.2.1.1.m1.5.5.5.7.2" xref="S3.E11X.2.1.1.m1.5.5.5.7.2.cmml">𝒩</mi><mo id="S3.E11X.2.1.1.m1.5.5.5.7.1" xref="S3.E11X.2.1.1.m1.5.5.5.7.1.cmml">⁢</mo><mrow id="S3.E11X.2.1.1.m1.5.5.5.7.3.2" xref="S3.E11X.2.1.1.m1.5.5.5.7.3.1.cmml"><mo id="S3.E11X.2.1.1.m1.5.5.5.7.3.2.1" stretchy="false" xref="S3.E11X.2.1.1.m1.5.5.5.7.3.1.cmml">(</mo><mn id="S3.E11X.2.1.1.m1.1.1.1.1" xref="S3.E11X.2.1.1.m1.1.1.1.1.cmml">0</mn><mo id="S3.E11X.2.1.1.m1.5.5.5.7.3.2.2" xref="S3.E11X.2.1.1.m1.5.5.5.7.3.1.cmml">,</mo><mi id="S3.E11X.2.1.1.m1.2.2.2.2" xref="S3.E11X.2.1.1.m1.2.2.2.2.cmml">I</mi><mo id="S3.E11X.2.1.1.m1.5.5.5.7.3.2.3" stretchy="false" xref="S3.E11X.2.1.1.m1.5.5.5.7.3.1.cmml">)</mo></mrow></mrow></mrow></msub><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.2.cmml">⁢</mo><mrow id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.2.cmml"><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.2.1.cmml">[</mo><mrow id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.cmml"><msub id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.cmml"><mi id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.2.cmml">M</mi><mi id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.2" lspace="0.222em" rspace="0.222em" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.2.cmml">⋅</mo><msubsup id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.cmml"><mrow id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.2.cmml"><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.2.1.cmml">‖</mo><mrow id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.4" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.4.cmml">ϵ</mi><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.3.cmml">−</mo><mrow id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.cmml"><msub id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.cmml"><mi id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.2.cmml">ϵ</mi><mi id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.3.cmml">θ</mi></msub><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.3.cmml">⁢</mo><mrow id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml"><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.3" stretchy="false" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml">(</mo><msub id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">𝒛</mi><mi id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml">t</mi></msub><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.4" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml">,</mo><mi id="S3.E11X.2.1.1.m1.6.6" xref="S3.E11X.2.1.1.m1.6.6.cmml">t</mi><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.5" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml">,</mo><msup id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.cmml"><mi id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2.cmml">C</mi><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.cmml">′</mo></msup><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.6" stretchy="false" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.2.1.cmml">‖</mo></mrow><mn id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.3.cmml">2</mn><mn id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.3.cmml">2</mn></msubsup></mrow><mo id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.3" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.2.1.cmml">]</mo></mrow></mrow></mrow><mo id="S3.E11X.2.1.1.m1.7.7.1.2" xref="S3.E11X.2.1.1.m1.7.7.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E11X.2.1.1.m1.7b"><apply id="S3.E11X.2.1.1.m1.7.7.1.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1"><eq id="S3.E11X.2.1.1.m1.7.7.1.1.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.2"></eq><apply id="S3.E11X.2.1.1.m1.7.7.1.1.3.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.3"><csymbol cd="ambiguous" id="S3.E11X.2.1.1.m1.7.7.1.1.3.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.3">subscript</csymbol><ci id="S3.E11X.2.1.1.m1.7.7.1.1.3.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.3.2">ℒ</ci><ci id="S3.E11X.2.1.1.m1.7.7.1.1.3.3a.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.3.3"><mtext id="S3.E11X.2.1.1.m1.7.7.1.1.3.3.cmml" mathsize="70%" xref="S3.E11X.2.1.1.m1.7.7.1.1.3.3">id</mtext></ci></apply><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1"><times id="S3.E11X.2.1.1.m1.7.7.1.1.1.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.2"></times><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.3.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.3"><csymbol cd="ambiguous" id="S3.E11X.2.1.1.m1.7.7.1.1.1.3.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.3">subscript</csymbol><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.3.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.3.2">𝔼</ci><apply id="S3.E11X.2.1.1.m1.5.5.5.cmml" xref="S3.E11X.2.1.1.m1.5.5.5"><csymbol cd="latexml" id="S3.E11X.2.1.1.m1.5.5.5.6.cmml" xref="S3.E11X.2.1.1.m1.5.5.5.6">similar-to</csymbol><list id="S3.E11X.2.1.1.m1.5.5.5.5.3.cmml" xref="S3.E11X.2.1.1.m1.5.5.5.5.2"><apply id="S3.E11X.2.1.1.m1.4.4.4.4.1.1.cmml" xref="S3.E11X.2.1.1.m1.4.4.4.4.1.1"><csymbol cd="ambiguous" id="S3.E11X.2.1.1.m1.4.4.4.4.1.1.1.cmml" xref="S3.E11X.2.1.1.m1.4.4.4.4.1.1">subscript</csymbol><ci id="S3.E11X.2.1.1.m1.4.4.4.4.1.1.2.cmml" xref="S3.E11X.2.1.1.m1.4.4.4.4.1.1.2">𝑧</ci><cn id="S3.E11X.2.1.1.m1.4.4.4.4.1.1.3.cmml" type="integer" xref="S3.E11X.2.1.1.m1.4.4.4.4.1.1.3">0</cn></apply><apply id="S3.E11X.2.1.1.m1.5.5.5.5.2.2.cmml" xref="S3.E11X.2.1.1.m1.5.5.5.5.2.2"><csymbol cd="ambiguous" id="S3.E11X.2.1.1.m1.5.5.5.5.2.2.1.cmml" xref="S3.E11X.2.1.1.m1.5.5.5.5.2.2">superscript</csymbol><ci id="S3.E11X.2.1.1.m1.5.5.5.5.2.2.2.cmml" xref="S3.E11X.2.1.1.m1.5.5.5.5.2.2.2">𝐶</ci><ci id="S3.E11X.2.1.1.m1.5.5.5.5.2.2.3.cmml" xref="S3.E11X.2.1.1.m1.5.5.5.5.2.2.3">′</ci></apply><ci id="S3.E11X.2.1.1.m1.3.3.3.3.cmml" xref="S3.E11X.2.1.1.m1.3.3.3.3">italic-ϵ</ci></list><apply id="S3.E11X.2.1.1.m1.5.5.5.7.cmml" xref="S3.E11X.2.1.1.m1.5.5.5.7"><times id="S3.E11X.2.1.1.m1.5.5.5.7.1.cmml" xref="S3.E11X.2.1.1.m1.5.5.5.7.1"></times><ci id="S3.E11X.2.1.1.m1.5.5.5.7.2.cmml" xref="S3.E11X.2.1.1.m1.5.5.5.7.2">𝒩</ci><interval closure="open" id="S3.E11X.2.1.1.m1.5.5.5.7.3.1.cmml" xref="S3.E11X.2.1.1.m1.5.5.5.7.3.2"><cn id="S3.E11X.2.1.1.m1.1.1.1.1.cmml" type="integer" xref="S3.E11X.2.1.1.m1.1.1.1.1">0</cn><ci id="S3.E11X.2.1.1.m1.2.2.2.2.cmml" xref="S3.E11X.2.1.1.m1.2.2.2.2">𝐼</ci></interval></apply></apply></apply><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1"><csymbol cd="latexml" id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.2.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.2">delimited-[]</csymbol><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1"><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.2">⋅</ci><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.2">𝑀</ci><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.3.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.3.3">𝖿𝗀</ci></apply><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1">superscript</csymbol><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1">subscript</csymbol><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.2">norm</csymbol><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1"><minus id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.3"></minus><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.4.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.4">italic-ϵ</ci><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2"><times id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.3.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.3"></times><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4"><csymbol cd="ambiguous" id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4">subscript</csymbol><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.2">italic-ϵ</ci><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.3.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.4.3">𝜃</ci></apply><vector id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2"><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2">𝒛</ci><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.3">𝑡</ci></apply><ci id="S3.E11X.2.1.1.m1.6.6.cmml" xref="S3.E11X.2.1.1.m1.6.6">𝑡</ci><apply id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2"><csymbol cd="ambiguous" id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.1.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2">superscript</csymbol><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.2">𝐶</ci><ci id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3.cmml" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.1.1.1.2.2.2.2.3">′</ci></apply></vector></apply></apply></apply><cn id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.3.cmml" type="integer" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.1.3">2</cn></apply><cn id="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.3.cmml" type="integer" xref="S3.E11X.2.1.1.m1.7.7.1.1.1.1.1.1.1.3">2</cn></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E11X.2.1.1.m1.7c">\displaystyle\mathcal{L}_{\text{id}}=\mathbb{E}_{z_{0},C^{\prime},\epsilon\sim% \mathcal{N}(0,\mathit{I})}\left[M_{\mathsf{fg}}\cdot\|\epsilon-\epsilon_{% \theta}(\bm{z}_{t},t,C^{\prime})\|_{2}^{2}\right],</annotation><annotation encoding="application/x-llamapun" id="S3.E11X.2.1.1.m1.7d">caligraphic_L start_POSTSUBSCRIPT id end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) end_POSTSUBSCRIPT [ italic_M start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT ⋅ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(11)</span></td> </tr> </tbody> </table> <p class="ltx_p" id="S3.SS3.SSS0.Px2.p1.1">where <math alttext="M_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S3.SS3.SSS0.Px2.p1.1.m1.1"><semantics id="S3.SS3.SSS0.Px2.p1.1.m1.1a"><msub id="S3.SS3.SSS0.Px2.p1.1.m1.1.1" xref="S3.SS3.SSS0.Px2.p1.1.m1.1.1.cmml"><mi id="S3.SS3.SSS0.Px2.p1.1.m1.1.1.2" xref="S3.SS3.SSS0.Px2.p1.1.m1.1.1.2.cmml">M</mi><mi id="S3.SS3.SSS0.Px2.p1.1.m1.1.1.3" xref="S3.SS3.SSS0.Px2.p1.1.m1.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.SSS0.Px2.p1.1.m1.1b"><apply id="S3.SS3.SSS0.Px2.p1.1.m1.1.1.cmml" xref="S3.SS3.SSS0.Px2.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS3.SSS0.Px2.p1.1.m1.1.1.1.cmml" xref="S3.SS3.SSS0.Px2.p1.1.m1.1.1">subscript</csymbol><ci id="S3.SS3.SSS0.Px2.p1.1.m1.1.1.2.cmml" xref="S3.SS3.SSS0.Px2.p1.1.m1.1.1.2">𝑀</ci><ci id="S3.SS3.SSS0.Px2.p1.1.m1.1.1.3.cmml" xref="S3.SS3.SSS0.Px2.p1.1.m1.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.SSS0.Px2.p1.1.m1.1c">M_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.SSS0.Px2.p1.1.m1.1d">italic_M start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math> is the foreground mask that indicates the foreground element region. This helps ensure accurate preservation of foreground element appearance while allowing flexibility in background integration. During training, the overall optimization objective is:</p> <table class="ltx_equationgroup ltx_eqn_table" id="S3.E12"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E12X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\mathcal{L}_{\text{total}}=\mathcal{L}+\lambda_{\text{id}}% \mathcal{L}_{\text{id}}," class="ltx_Math" display="inline" id="S3.E12X.2.1.1.m1.1"><semantics id="S3.E12X.2.1.1.m1.1a"><mrow id="S3.E12X.2.1.1.m1.1.1.1" xref="S3.E12X.2.1.1.m1.1.1.1.1.cmml"><mrow id="S3.E12X.2.1.1.m1.1.1.1.1" xref="S3.E12X.2.1.1.m1.1.1.1.1.cmml"><msub id="S3.E12X.2.1.1.m1.1.1.1.1.2" xref="S3.E12X.2.1.1.m1.1.1.1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E12X.2.1.1.m1.1.1.1.1.2.2" xref="S3.E12X.2.1.1.m1.1.1.1.1.2.2.cmml">ℒ</mi><mtext id="S3.E12X.2.1.1.m1.1.1.1.1.2.3" xref="S3.E12X.2.1.1.m1.1.1.1.1.2.3a.cmml">total</mtext></msub><mo id="S3.E12X.2.1.1.m1.1.1.1.1.1" xref="S3.E12X.2.1.1.m1.1.1.1.1.1.cmml">=</mo><mrow id="S3.E12X.2.1.1.m1.1.1.1.1.3" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E12X.2.1.1.m1.1.1.1.1.3.2" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.2.cmml">ℒ</mi><mo id="S3.E12X.2.1.1.m1.1.1.1.1.3.1" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.1.cmml">+</mo><mrow id="S3.E12X.2.1.1.m1.1.1.1.1.3.3" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.cmml"><msub id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.cmml"><mi id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.2" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.2.cmml">λ</mi><mtext id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.3" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.3a.cmml">id</mtext></msub><mo id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.1" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.1.cmml">⁢</mo><msub id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.2" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.2.cmml">ℒ</mi><mtext id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.3" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.3a.cmml">id</mtext></msub></mrow></mrow></mrow><mo id="S3.E12X.2.1.1.m1.1.1.1.2" xref="S3.E12X.2.1.1.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E12X.2.1.1.m1.1b"><apply id="S3.E12X.2.1.1.m1.1.1.1.1.cmml" xref="S3.E12X.2.1.1.m1.1.1.1"><eq id="S3.E12X.2.1.1.m1.1.1.1.1.1.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.1"></eq><apply id="S3.E12X.2.1.1.m1.1.1.1.1.2.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E12X.2.1.1.m1.1.1.1.1.2.1.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.2">subscript</csymbol><ci id="S3.E12X.2.1.1.m1.1.1.1.1.2.2.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.2.2">ℒ</ci><ci id="S3.E12X.2.1.1.m1.1.1.1.1.2.3a.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.2.3"><mtext id="S3.E12X.2.1.1.m1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E12X.2.1.1.m1.1.1.1.1.2.3">total</mtext></ci></apply><apply id="S3.E12X.2.1.1.m1.1.1.1.1.3.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3"><plus id="S3.E12X.2.1.1.m1.1.1.1.1.3.1.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.1"></plus><ci id="S3.E12X.2.1.1.m1.1.1.1.1.3.2.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.2">ℒ</ci><apply id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3"><times id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.1.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.1"></times><apply id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2"><csymbol cd="ambiguous" id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.1.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2">subscript</csymbol><ci id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.2.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.2">𝜆</ci><ci id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.3a.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.3"><mtext id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.3.cmml" mathsize="70%" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.2.3">id</mtext></ci></apply><apply id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3"><csymbol cd="ambiguous" id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.1.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3">subscript</csymbol><ci id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.2.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.2">ℒ</ci><ci id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.3a.cmml" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.3"><mtext id="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.3.cmml" mathsize="70%" xref="S3.E12X.2.1.1.m1.1.1.1.1.3.3.3.3">id</mtext></ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E12X.2.1.1.m1.1c">\displaystyle\mathcal{L}_{\text{total}}=\mathcal{L}+\lambda_{\text{id}}% \mathcal{L}_{\text{id}},</annotation><annotation encoding="application/x-llamapun" id="S3.E12X.2.1.1.m1.1d">caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = caligraphic_L + italic_λ start_POSTSUBSCRIPT id end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT id end_POSTSUBSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(12)</span></td> </tr> </tbody> </table> <p class="ltx_p" id="S3.SS3.SSS0.Px2.p1.3">where <math alttext="\lambda_{\text{id}}" class="ltx_Math" display="inline" id="S3.SS3.SSS0.Px2.p1.2.m1.1"><semantics id="S3.SS3.SSS0.Px2.p1.2.m1.1a"><msub id="S3.SS3.SSS0.Px2.p1.2.m1.1.1" xref="S3.SS3.SSS0.Px2.p1.2.m1.1.1.cmml"><mi id="S3.SS3.SSS0.Px2.p1.2.m1.1.1.2" xref="S3.SS3.SSS0.Px2.p1.2.m1.1.1.2.cmml">λ</mi><mtext id="S3.SS3.SSS0.Px2.p1.2.m1.1.1.3" xref="S3.SS3.SSS0.Px2.p1.2.m1.1.1.3a.cmml">id</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.SSS0.Px2.p1.2.m1.1b"><apply id="S3.SS3.SSS0.Px2.p1.2.m1.1.1.cmml" xref="S3.SS3.SSS0.Px2.p1.2.m1.1.1"><csymbol cd="ambiguous" id="S3.SS3.SSS0.Px2.p1.2.m1.1.1.1.cmml" xref="S3.SS3.SSS0.Px2.p1.2.m1.1.1">subscript</csymbol><ci id="S3.SS3.SSS0.Px2.p1.2.m1.1.1.2.cmml" xref="S3.SS3.SSS0.Px2.p1.2.m1.1.1.2">𝜆</ci><ci id="S3.SS3.SSS0.Px2.p1.2.m1.1.1.3a.cmml" xref="S3.SS3.SSS0.Px2.p1.2.m1.1.1.3"><mtext id="S3.SS3.SSS0.Px2.p1.2.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS3.SSS0.Px2.p1.2.m1.1.1.3">id</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.SSS0.Px2.p1.2.m1.1c">\lambda_{\text{id}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.SSS0.Px2.p1.2.m1.1d">italic_λ start_POSTSUBSCRIPT id end_POSTSUBSCRIPT</annotation></semantics></math> is a hyperparameter that controls the weight of identity preservation. We gradually decay <math alttext="\lambda_{\text{id}}" class="ltx_Math" display="inline" id="S3.SS3.SSS0.Px2.p1.3.m2.1"><semantics id="S3.SS3.SSS0.Px2.p1.3.m2.1a"><msub id="S3.SS3.SSS0.Px2.p1.3.m2.1.1" xref="S3.SS3.SSS0.Px2.p1.3.m2.1.1.cmml"><mi id="S3.SS3.SSS0.Px2.p1.3.m2.1.1.2" xref="S3.SS3.SSS0.Px2.p1.3.m2.1.1.2.cmml">λ</mi><mtext id="S3.SS3.SSS0.Px2.p1.3.m2.1.1.3" xref="S3.SS3.SSS0.Px2.p1.3.m2.1.1.3a.cmml">id</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.SSS0.Px2.p1.3.m2.1b"><apply id="S3.SS3.SSS0.Px2.p1.3.m2.1.1.cmml" xref="S3.SS3.SSS0.Px2.p1.3.m2.1.1"><csymbol cd="ambiguous" id="S3.SS3.SSS0.Px2.p1.3.m2.1.1.1.cmml" xref="S3.SS3.SSS0.Px2.p1.3.m2.1.1">subscript</csymbol><ci id="S3.SS3.SSS0.Px2.p1.3.m2.1.1.2.cmml" xref="S3.SS3.SSS0.Px2.p1.3.m2.1.1.2">𝜆</ci><ci id="S3.SS3.SSS0.Px2.p1.3.m2.1.1.3a.cmml" xref="S3.SS3.SSS0.Px2.p1.3.m2.1.1.3"><mtext id="S3.SS3.SSS0.Px2.p1.3.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS3.SSS0.Px2.p1.3.m2.1.1.3">id</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.SSS0.Px2.p1.3.m2.1c">\lambda_{\text{id}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.SSS0.Px2.p1.3.m2.1d">italic_λ start_POSTSUBSCRIPT id end_POSTSUBSCRIPT</annotation></semantics></math> from 1.0 to 0.6 during training, which encourages the model to focus more on scene harmonization in later training stages while still maintaining reasonable identity preservation.</p> </div> </section> </section> <section class="ltx_subsection" id="S3.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.4 </span>Controllable Fidelity-Diversity Trade-off</h3> <div class="ltx_para" id="S3.SS4.p1"> <p class="ltx_p" id="S3.SS4.p1.4">To achieve flexible control between appearance fidelity and creative diversity, we implement random dropout strategies during training. First, we randomly drop the weights of the foreground branch, allowing the model to adjust between freely generating foreground elements based on global textual information and strictly preserving given foreground identities. Second, we randomly drop both the semantic features to be splatted and the VAE features of foreground elements, enabling flexible control over the balance between semantics and appearance. Specifically, we apply:</p> <table class="ltx_equationgroup ltx_eqn_table" id="S3.E13"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E13X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\omega^{\prime}" class="ltx_Math" display="inline" id="S3.E13X.2.1.1.m1.1"><semantics id="S3.E13X.2.1.1.m1.1a"><msup id="S3.E13X.2.1.1.m1.1.1" xref="S3.E13X.2.1.1.m1.1.1.cmml"><mi id="S3.E13X.2.1.1.m1.1.1.2" xref="S3.E13X.2.1.1.m1.1.1.2.cmml">ω</mi><mo id="S3.E13X.2.1.1.m1.1.1.3" xref="S3.E13X.2.1.1.m1.1.1.3.cmml">′</mo></msup><annotation-xml encoding="MathML-Content" id="S3.E13X.2.1.1.m1.1b"><apply id="S3.E13X.2.1.1.m1.1.1.cmml" xref="S3.E13X.2.1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.E13X.2.1.1.m1.1.1.1.cmml" xref="S3.E13X.2.1.1.m1.1.1">superscript</csymbol><ci id="S3.E13X.2.1.1.m1.1.1.2.cmml" xref="S3.E13X.2.1.1.m1.1.1.2">𝜔</ci><ci id="S3.E13X.2.1.1.m1.1.1.3.cmml" xref="S3.E13X.2.1.1.m1.1.1.3">′</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E13X.2.1.1.m1.1c">\displaystyle\omega^{\prime}</annotation><annotation encoding="application/x-llamapun" id="S3.E13X.2.1.1.m1.1d">italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=\omega\cdot\mathbb{I}(\xi_{1}&lt;p_{\omega})" class="ltx_Math" display="inline" id="S3.E13X.3.2.2.m1.1"><semantics id="S3.E13X.3.2.2.m1.1a"><mrow id="S3.E13X.3.2.2.m1.1.1" xref="S3.E13X.3.2.2.m1.1.1.cmml"><mi id="S3.E13X.3.2.2.m1.1.1.3" xref="S3.E13X.3.2.2.m1.1.1.3.cmml"></mi><mo id="S3.E13X.3.2.2.m1.1.1.2" xref="S3.E13X.3.2.2.m1.1.1.2.cmml">=</mo><mrow id="S3.E13X.3.2.2.m1.1.1.1" xref="S3.E13X.3.2.2.m1.1.1.1.cmml"><mrow id="S3.E13X.3.2.2.m1.1.1.1.3" xref="S3.E13X.3.2.2.m1.1.1.1.3.cmml"><mi id="S3.E13X.3.2.2.m1.1.1.1.3.2" xref="S3.E13X.3.2.2.m1.1.1.1.3.2.cmml">ω</mi><mo id="S3.E13X.3.2.2.m1.1.1.1.3.1" lspace="0.222em" rspace="0.222em" xref="S3.E13X.3.2.2.m1.1.1.1.3.1.cmml">⋅</mo><mi id="S3.E13X.3.2.2.m1.1.1.1.3.3" xref="S3.E13X.3.2.2.m1.1.1.1.3.3.cmml">𝕀</mi></mrow><mo id="S3.E13X.3.2.2.m1.1.1.1.2" xref="S3.E13X.3.2.2.m1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E13X.3.2.2.m1.1.1.1.1.1" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.cmml"><mo id="S3.E13X.3.2.2.m1.1.1.1.1.1.2" stretchy="false" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E13X.3.2.2.m1.1.1.1.1.1.1" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.cmml"><msub id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.cmml"><mi id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.2" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.2.cmml">ξ</mi><mn id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.3" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.3.cmml">1</mn></msub><mo id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.1" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.1.cmml">&lt;</mo><msub id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.cmml"><mi id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.2" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.2.cmml">p</mi><mi id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.3" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.3.cmml">ω</mi></msub></mrow><mo id="S3.E13X.3.2.2.m1.1.1.1.1.1.3" stretchy="false" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.E13X.3.2.2.m1.1b"><apply id="S3.E13X.3.2.2.m1.1.1.cmml" xref="S3.E13X.3.2.2.m1.1.1"><eq id="S3.E13X.3.2.2.m1.1.1.2.cmml" xref="S3.E13X.3.2.2.m1.1.1.2"></eq><csymbol cd="latexml" id="S3.E13X.3.2.2.m1.1.1.3.cmml" xref="S3.E13X.3.2.2.m1.1.1.3">absent</csymbol><apply id="S3.E13X.3.2.2.m1.1.1.1.cmml" xref="S3.E13X.3.2.2.m1.1.1.1"><times id="S3.E13X.3.2.2.m1.1.1.1.2.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.2"></times><apply id="S3.E13X.3.2.2.m1.1.1.1.3.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.3"><ci id="S3.E13X.3.2.2.m1.1.1.1.3.1.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.3.1">⋅</ci><ci id="S3.E13X.3.2.2.m1.1.1.1.3.2.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.3.2">𝜔</ci><ci id="S3.E13X.3.2.2.m1.1.1.1.3.3.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.3.3">𝕀</ci></apply><apply id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.1.1"><lt id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.1.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.1"></lt><apply id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.1.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.2.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.2">𝜉</ci><cn id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.3.cmml" type="integer" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.2.3">1</cn></apply><apply id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.1.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.2.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.2">𝑝</ci><ci id="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.3.cmml" xref="S3.E13X.3.2.2.m1.1.1.1.1.1.1.3.3">𝜔</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E13X.3.2.2.m1.1c">\displaystyle=\omega\cdot\mathbb{I}(\xi_{1}&lt;p_{\omega})</annotation><annotation encoding="application/x-llamapun" id="S3.E13X.3.2.2.m1.1d">= italic_ω ⋅ blackboard_I ( italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT &lt; italic_p start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="3"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(13)</span></td> </tr> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E13Xa"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\bm{F}_{\mathsf{fg}}^{\prime}" class="ltx_Math" display="inline" id="S3.E13Xa.2.1.1.m1.1"><semantics id="S3.E13Xa.2.1.1.m1.1a"><msubsup id="S3.E13Xa.2.1.1.m1.1.1" xref="S3.E13Xa.2.1.1.m1.1.1.cmml"><mi id="S3.E13Xa.2.1.1.m1.1.1.2.2" xref="S3.E13Xa.2.1.1.m1.1.1.2.2.cmml">𝑭</mi><mi id="S3.E13Xa.2.1.1.m1.1.1.2.3" xref="S3.E13Xa.2.1.1.m1.1.1.2.3.cmml">𝖿𝗀</mi><mo id="S3.E13Xa.2.1.1.m1.1.1.3" xref="S3.E13Xa.2.1.1.m1.1.1.3.cmml">′</mo></msubsup><annotation-xml encoding="MathML-Content" id="S3.E13Xa.2.1.1.m1.1b"><apply id="S3.E13Xa.2.1.1.m1.1.1.cmml" xref="S3.E13Xa.2.1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.E13Xa.2.1.1.m1.1.1.1.cmml" xref="S3.E13Xa.2.1.1.m1.1.1">superscript</csymbol><apply id="S3.E13Xa.2.1.1.m1.1.1.2.cmml" xref="S3.E13Xa.2.1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.E13Xa.2.1.1.m1.1.1.2.1.cmml" xref="S3.E13Xa.2.1.1.m1.1.1">subscript</csymbol><ci id="S3.E13Xa.2.1.1.m1.1.1.2.2.cmml" xref="S3.E13Xa.2.1.1.m1.1.1.2.2">𝑭</ci><ci id="S3.E13Xa.2.1.1.m1.1.1.2.3.cmml" xref="S3.E13Xa.2.1.1.m1.1.1.2.3">𝖿𝗀</ci></apply><ci id="S3.E13Xa.2.1.1.m1.1.1.3.cmml" xref="S3.E13Xa.2.1.1.m1.1.1.3">′</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E13Xa.2.1.1.m1.1c">\displaystyle\bm{F}_{\mathsf{fg}}^{\prime}</annotation><annotation encoding="application/x-llamapun" id="S3.E13Xa.2.1.1.m1.1d">bold_italic_F start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=\bm{F}_{\mathsf{fg}}\cdot\mathbb{I}(\xi_{2}&lt;p_{\mathsf{feat}})" class="ltx_Math" display="inline" id="S3.E13Xa.3.2.2.m1.1"><semantics id="S3.E13Xa.3.2.2.m1.1a"><mrow id="S3.E13Xa.3.2.2.m1.1.1" xref="S3.E13Xa.3.2.2.m1.1.1.cmml"><mi id="S3.E13Xa.3.2.2.m1.1.1.3" xref="S3.E13Xa.3.2.2.m1.1.1.3.cmml"></mi><mo id="S3.E13Xa.3.2.2.m1.1.1.2" xref="S3.E13Xa.3.2.2.m1.1.1.2.cmml">=</mo><mrow id="S3.E13Xa.3.2.2.m1.1.1.1" xref="S3.E13Xa.3.2.2.m1.1.1.1.cmml"><mrow id="S3.E13Xa.3.2.2.m1.1.1.1.3" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.cmml"><msub id="S3.E13Xa.3.2.2.m1.1.1.1.3.2" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.2.cmml"><mi id="S3.E13Xa.3.2.2.m1.1.1.1.3.2.2" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.2.2.cmml">𝑭</mi><mi id="S3.E13Xa.3.2.2.m1.1.1.1.3.2.3" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.2.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E13Xa.3.2.2.m1.1.1.1.3.1" lspace="0.222em" rspace="0.222em" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.1.cmml">⋅</mo><mi id="S3.E13Xa.3.2.2.m1.1.1.1.3.3" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.3.cmml">𝕀</mi></mrow><mo id="S3.E13Xa.3.2.2.m1.1.1.1.2" xref="S3.E13Xa.3.2.2.m1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E13Xa.3.2.2.m1.1.1.1.1.1" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.cmml"><mo id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.2" stretchy="false" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.cmml"><msub id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.cmml"><mi id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.2" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.2.cmml">ξ</mi><mn id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.3" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.3.cmml">2</mn></msub><mo id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.1" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.1.cmml">&lt;</mo><msub id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.cmml"><mi id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.2" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.2.cmml">p</mi><mi id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.3" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.3.cmml">𝖿𝖾𝖺𝗍</mi></msub></mrow><mo id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.3" stretchy="false" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.E13Xa.3.2.2.m1.1b"><apply id="S3.E13Xa.3.2.2.m1.1.1.cmml" xref="S3.E13Xa.3.2.2.m1.1.1"><eq id="S3.E13Xa.3.2.2.m1.1.1.2.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.2"></eq><csymbol cd="latexml" id="S3.E13Xa.3.2.2.m1.1.1.3.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.3">absent</csymbol><apply id="S3.E13Xa.3.2.2.m1.1.1.1.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1"><times id="S3.E13Xa.3.2.2.m1.1.1.1.2.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.2"></times><apply id="S3.E13Xa.3.2.2.m1.1.1.1.3.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.3"><ci id="S3.E13Xa.3.2.2.m1.1.1.1.3.1.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.1">⋅</ci><apply id="S3.E13Xa.3.2.2.m1.1.1.1.3.2.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.2"><csymbol cd="ambiguous" id="S3.E13Xa.3.2.2.m1.1.1.1.3.2.1.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.2">subscript</csymbol><ci id="S3.E13Xa.3.2.2.m1.1.1.1.3.2.2.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.2.2">𝑭</ci><ci id="S3.E13Xa.3.2.2.m1.1.1.1.3.2.3.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.2.3">𝖿𝗀</ci></apply><ci id="S3.E13Xa.3.2.2.m1.1.1.1.3.3.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.3.3">𝕀</ci></apply><apply id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1"><lt id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.1.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.1"></lt><apply id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.1.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.2.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.2">𝜉</ci><cn id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.3.cmml" type="integer" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.2.3">2</cn></apply><apply id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.1.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.2.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.2">𝑝</ci><ci id="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.3.cmml" xref="S3.E13Xa.3.2.2.m1.1.1.1.1.1.1.3.3">𝖿𝖾𝖺𝗍</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E13Xa.3.2.2.m1.1c">\displaystyle=\bm{F}_{\mathsf{fg}}\cdot\mathbb{I}(\xi_{2}&lt;p_{\mathsf{feat}})</annotation><annotation encoding="application/x-llamapun" id="S3.E13Xa.3.2.2.m1.1d">= bold_italic_F start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT ⋅ blackboard_I ( italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT &lt; italic_p start_POSTSUBSCRIPT sansserif_feat end_POSTSUBSCRIPT )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S3.E13Xb"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\bm{z}_{\mathsf{fg}}^{\prime}" class="ltx_Math" display="inline" id="S3.E13Xb.2.1.1.m1.1"><semantics id="S3.E13Xb.2.1.1.m1.1a"><msubsup id="S3.E13Xb.2.1.1.m1.1.1" xref="S3.E13Xb.2.1.1.m1.1.1.cmml"><mi id="S3.E13Xb.2.1.1.m1.1.1.2.2" xref="S3.E13Xb.2.1.1.m1.1.1.2.2.cmml">𝒛</mi><mi id="S3.E13Xb.2.1.1.m1.1.1.2.3" xref="S3.E13Xb.2.1.1.m1.1.1.2.3.cmml">𝖿𝗀</mi><mo id="S3.E13Xb.2.1.1.m1.1.1.3" xref="S3.E13Xb.2.1.1.m1.1.1.3.cmml">′</mo></msubsup><annotation-xml encoding="MathML-Content" id="S3.E13Xb.2.1.1.m1.1b"><apply id="S3.E13Xb.2.1.1.m1.1.1.cmml" xref="S3.E13Xb.2.1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.E13Xb.2.1.1.m1.1.1.1.cmml" xref="S3.E13Xb.2.1.1.m1.1.1">superscript</csymbol><apply id="S3.E13Xb.2.1.1.m1.1.1.2.cmml" xref="S3.E13Xb.2.1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.E13Xb.2.1.1.m1.1.1.2.1.cmml" xref="S3.E13Xb.2.1.1.m1.1.1">subscript</csymbol><ci id="S3.E13Xb.2.1.1.m1.1.1.2.2.cmml" xref="S3.E13Xb.2.1.1.m1.1.1.2.2">𝒛</ci><ci id="S3.E13Xb.2.1.1.m1.1.1.2.3.cmml" xref="S3.E13Xb.2.1.1.m1.1.1.2.3">𝖿𝗀</ci></apply><ci id="S3.E13Xb.2.1.1.m1.1.1.3.cmml" xref="S3.E13Xb.2.1.1.m1.1.1.3">′</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E13Xb.2.1.1.m1.1c">\displaystyle\bm{z}_{\mathsf{fg}}^{\prime}</annotation><annotation encoding="application/x-llamapun" id="S3.E13Xb.2.1.1.m1.1d">bold_italic_z start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=\bm{z}_{\mathsf{fg}}\cdot\mathbb{I}(\xi_{3}&lt;p_{\mathsf{vae}})" class="ltx_Math" display="inline" id="S3.E13Xb.3.2.2.m1.1"><semantics id="S3.E13Xb.3.2.2.m1.1a"><mrow id="S3.E13Xb.3.2.2.m1.1.1" xref="S3.E13Xb.3.2.2.m1.1.1.cmml"><mi id="S3.E13Xb.3.2.2.m1.1.1.3" xref="S3.E13Xb.3.2.2.m1.1.1.3.cmml"></mi><mo id="S3.E13Xb.3.2.2.m1.1.1.2" xref="S3.E13Xb.3.2.2.m1.1.1.2.cmml">=</mo><mrow id="S3.E13Xb.3.2.2.m1.1.1.1" xref="S3.E13Xb.3.2.2.m1.1.1.1.cmml"><mrow id="S3.E13Xb.3.2.2.m1.1.1.1.3" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.cmml"><msub id="S3.E13Xb.3.2.2.m1.1.1.1.3.2" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.2.cmml"><mi id="S3.E13Xb.3.2.2.m1.1.1.1.3.2.2" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.2.2.cmml">𝒛</mi><mi id="S3.E13Xb.3.2.2.m1.1.1.1.3.2.3" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.2.3.cmml">𝖿𝗀</mi></msub><mo id="S3.E13Xb.3.2.2.m1.1.1.1.3.1" lspace="0.222em" rspace="0.222em" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.1.cmml">⋅</mo><mi id="S3.E13Xb.3.2.2.m1.1.1.1.3.3" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.3.cmml">𝕀</mi></mrow><mo id="S3.E13Xb.3.2.2.m1.1.1.1.2" xref="S3.E13Xb.3.2.2.m1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E13Xb.3.2.2.m1.1.1.1.1.1" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.cmml"><mo id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.2" stretchy="false" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.cmml"><msub id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.cmml"><mi id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.2" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.2.cmml">ξ</mi><mn id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.3" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.3.cmml">3</mn></msub><mo id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.1" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.1.cmml">&lt;</mo><msub id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.cmml"><mi id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.2" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.2.cmml">p</mi><mi id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.3" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.3.cmml">𝗏𝖺𝖾</mi></msub></mrow><mo id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.3" stretchy="false" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.E13Xb.3.2.2.m1.1b"><apply id="S3.E13Xb.3.2.2.m1.1.1.cmml" xref="S3.E13Xb.3.2.2.m1.1.1"><eq id="S3.E13Xb.3.2.2.m1.1.1.2.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.2"></eq><csymbol cd="latexml" id="S3.E13Xb.3.2.2.m1.1.1.3.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.3">absent</csymbol><apply id="S3.E13Xb.3.2.2.m1.1.1.1.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1"><times id="S3.E13Xb.3.2.2.m1.1.1.1.2.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.2"></times><apply id="S3.E13Xb.3.2.2.m1.1.1.1.3.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.3"><ci id="S3.E13Xb.3.2.2.m1.1.1.1.3.1.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.1">⋅</ci><apply id="S3.E13Xb.3.2.2.m1.1.1.1.3.2.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.2"><csymbol cd="ambiguous" id="S3.E13Xb.3.2.2.m1.1.1.1.3.2.1.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.2">subscript</csymbol><ci id="S3.E13Xb.3.2.2.m1.1.1.1.3.2.2.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.2.2">𝒛</ci><ci id="S3.E13Xb.3.2.2.m1.1.1.1.3.2.3.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.2.3">𝖿𝗀</ci></apply><ci id="S3.E13Xb.3.2.2.m1.1.1.1.3.3.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.3.3">𝕀</ci></apply><apply id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1"><lt id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.1.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.1"></lt><apply id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.1.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.2.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.2">𝜉</ci><cn id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.3.cmml" type="integer" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.2.3">3</cn></apply><apply id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.1.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.2.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.2">𝑝</ci><ci id="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.3.cmml" xref="S3.E13Xb.3.2.2.m1.1.1.1.1.1.1.3.3">𝗏𝖺𝖾</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E13Xb.3.2.2.m1.1c">\displaystyle=\bm{z}_{\mathsf{fg}}\cdot\mathbb{I}(\xi_{3}&lt;p_{\mathsf{vae}})</annotation><annotation encoding="application/x-llamapun" id="S3.E13Xb.3.2.2.m1.1d">= bold_italic_z start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT ⋅ blackboard_I ( italic_ξ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT &lt; italic_p start_POSTSUBSCRIPT sansserif_vae end_POSTSUBSCRIPT )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr> </tbody> </table> <p class="ltx_p" id="S3.SS4.p1.3">where <math alttext="\mathbb{I}" class="ltx_Math" display="inline" id="S3.SS4.p1.1.m1.1"><semantics id="S3.SS4.p1.1.m1.1a"><mi id="S3.SS4.p1.1.m1.1.1" xref="S3.SS4.p1.1.m1.1.1.cmml">𝕀</mi><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.1.m1.1b"><ci id="S3.SS4.p1.1.m1.1.1.cmml" xref="S3.SS4.p1.1.m1.1.1">𝕀</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.1.m1.1c">\mathbb{I}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.1.m1.1d">blackboard_I</annotation></semantics></math> is indicator function, and <math alttext="\xi_{1},\xi_{2},\xi_{3}\sim\mathcal{U}(0,1)" class="ltx_Math" display="inline" id="S3.SS4.p1.2.m2.5"><semantics id="S3.SS4.p1.2.m2.5a"><mrow id="S3.SS4.p1.2.m2.5.5" xref="S3.SS4.p1.2.m2.5.5.cmml"><mrow id="S3.SS4.p1.2.m2.5.5.3.3" xref="S3.SS4.p1.2.m2.5.5.3.4.cmml"><msub id="S3.SS4.p1.2.m2.3.3.1.1.1" xref="S3.SS4.p1.2.m2.3.3.1.1.1.cmml"><mi id="S3.SS4.p1.2.m2.3.3.1.1.1.2" xref="S3.SS4.p1.2.m2.3.3.1.1.1.2.cmml">ξ</mi><mn id="S3.SS4.p1.2.m2.3.3.1.1.1.3" xref="S3.SS4.p1.2.m2.3.3.1.1.1.3.cmml">1</mn></msub><mo id="S3.SS4.p1.2.m2.5.5.3.3.4" xref="S3.SS4.p1.2.m2.5.5.3.4.cmml">,</mo><msub id="S3.SS4.p1.2.m2.4.4.2.2.2" xref="S3.SS4.p1.2.m2.4.4.2.2.2.cmml"><mi id="S3.SS4.p1.2.m2.4.4.2.2.2.2" xref="S3.SS4.p1.2.m2.4.4.2.2.2.2.cmml">ξ</mi><mn id="S3.SS4.p1.2.m2.4.4.2.2.2.3" xref="S3.SS4.p1.2.m2.4.4.2.2.2.3.cmml">2</mn></msub><mo id="S3.SS4.p1.2.m2.5.5.3.3.5" xref="S3.SS4.p1.2.m2.5.5.3.4.cmml">,</mo><msub id="S3.SS4.p1.2.m2.5.5.3.3.3" xref="S3.SS4.p1.2.m2.5.5.3.3.3.cmml"><mi id="S3.SS4.p1.2.m2.5.5.3.3.3.2" xref="S3.SS4.p1.2.m2.5.5.3.3.3.2.cmml">ξ</mi><mn id="S3.SS4.p1.2.m2.5.5.3.3.3.3" xref="S3.SS4.p1.2.m2.5.5.3.3.3.3.cmml">3</mn></msub></mrow><mo id="S3.SS4.p1.2.m2.5.5.4" xref="S3.SS4.p1.2.m2.5.5.4.cmml">∼</mo><mrow id="S3.SS4.p1.2.m2.5.5.5" xref="S3.SS4.p1.2.m2.5.5.5.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p1.2.m2.5.5.5.2" xref="S3.SS4.p1.2.m2.5.5.5.2.cmml">𝒰</mi><mo id="S3.SS4.p1.2.m2.5.5.5.1" xref="S3.SS4.p1.2.m2.5.5.5.1.cmml">⁢</mo><mrow id="S3.SS4.p1.2.m2.5.5.5.3.2" xref="S3.SS4.p1.2.m2.5.5.5.3.1.cmml"><mo id="S3.SS4.p1.2.m2.5.5.5.3.2.1" stretchy="false" xref="S3.SS4.p1.2.m2.5.5.5.3.1.cmml">(</mo><mn id="S3.SS4.p1.2.m2.1.1" xref="S3.SS4.p1.2.m2.1.1.cmml">0</mn><mo id="S3.SS4.p1.2.m2.5.5.5.3.2.2" xref="S3.SS4.p1.2.m2.5.5.5.3.1.cmml">,</mo><mn id="S3.SS4.p1.2.m2.2.2" xref="S3.SS4.p1.2.m2.2.2.cmml">1</mn><mo id="S3.SS4.p1.2.m2.5.5.5.3.2.3" stretchy="false" xref="S3.SS4.p1.2.m2.5.5.5.3.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.2.m2.5b"><apply id="S3.SS4.p1.2.m2.5.5.cmml" xref="S3.SS4.p1.2.m2.5.5"><csymbol cd="latexml" id="S3.SS4.p1.2.m2.5.5.4.cmml" xref="S3.SS4.p1.2.m2.5.5.4">similar-to</csymbol><list id="S3.SS4.p1.2.m2.5.5.3.4.cmml" xref="S3.SS4.p1.2.m2.5.5.3.3"><apply id="S3.SS4.p1.2.m2.3.3.1.1.1.cmml" xref="S3.SS4.p1.2.m2.3.3.1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.2.m2.3.3.1.1.1.1.cmml" xref="S3.SS4.p1.2.m2.3.3.1.1.1">subscript</csymbol><ci id="S3.SS4.p1.2.m2.3.3.1.1.1.2.cmml" xref="S3.SS4.p1.2.m2.3.3.1.1.1.2">𝜉</ci><cn id="S3.SS4.p1.2.m2.3.3.1.1.1.3.cmml" type="integer" xref="S3.SS4.p1.2.m2.3.3.1.1.1.3">1</cn></apply><apply id="S3.SS4.p1.2.m2.4.4.2.2.2.cmml" xref="S3.SS4.p1.2.m2.4.4.2.2.2"><csymbol cd="ambiguous" id="S3.SS4.p1.2.m2.4.4.2.2.2.1.cmml" xref="S3.SS4.p1.2.m2.4.4.2.2.2">subscript</csymbol><ci id="S3.SS4.p1.2.m2.4.4.2.2.2.2.cmml" xref="S3.SS4.p1.2.m2.4.4.2.2.2.2">𝜉</ci><cn id="S3.SS4.p1.2.m2.4.4.2.2.2.3.cmml" type="integer" xref="S3.SS4.p1.2.m2.4.4.2.2.2.3">2</cn></apply><apply id="S3.SS4.p1.2.m2.5.5.3.3.3.cmml" xref="S3.SS4.p1.2.m2.5.5.3.3.3"><csymbol cd="ambiguous" id="S3.SS4.p1.2.m2.5.5.3.3.3.1.cmml" xref="S3.SS4.p1.2.m2.5.5.3.3.3">subscript</csymbol><ci id="S3.SS4.p1.2.m2.5.5.3.3.3.2.cmml" xref="S3.SS4.p1.2.m2.5.5.3.3.3.2">𝜉</ci><cn id="S3.SS4.p1.2.m2.5.5.3.3.3.3.cmml" type="integer" xref="S3.SS4.p1.2.m2.5.5.3.3.3.3">3</cn></apply></list><apply id="S3.SS4.p1.2.m2.5.5.5.cmml" xref="S3.SS4.p1.2.m2.5.5.5"><times id="S3.SS4.p1.2.m2.5.5.5.1.cmml" xref="S3.SS4.p1.2.m2.5.5.5.1"></times><ci id="S3.SS4.p1.2.m2.5.5.5.2.cmml" xref="S3.SS4.p1.2.m2.5.5.5.2">𝒰</ci><interval closure="open" id="S3.SS4.p1.2.m2.5.5.5.3.1.cmml" xref="S3.SS4.p1.2.m2.5.5.5.3.2"><cn id="S3.SS4.p1.2.m2.1.1.cmml" type="integer" xref="S3.SS4.p1.2.m2.1.1">0</cn><cn id="S3.SS4.p1.2.m2.2.2.cmml" type="integer" xref="S3.SS4.p1.2.m2.2.2">1</cn></interval></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.2.m2.5c">\xi_{1},\xi_{2},\xi_{3}\sim\mathcal{U}(0,1)</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.2.m2.5d">italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∼ caligraphic_U ( 0 , 1 )</annotation></semantics></math> are uniform random variables, <math alttext="p_{\omega},p_{\mathsf{feat}},p_{\mathsf{vae}}" class="ltx_Math" display="inline" id="S3.SS4.p1.3.m3.3"><semantics id="S3.SS4.p1.3.m3.3a"><mrow id="S3.SS4.p1.3.m3.3.3.3" xref="S3.SS4.p1.3.m3.3.3.4.cmml"><msub id="S3.SS4.p1.3.m3.1.1.1.1" xref="S3.SS4.p1.3.m3.1.1.1.1.cmml"><mi id="S3.SS4.p1.3.m3.1.1.1.1.2" xref="S3.SS4.p1.3.m3.1.1.1.1.2.cmml">p</mi><mi id="S3.SS4.p1.3.m3.1.1.1.1.3" xref="S3.SS4.p1.3.m3.1.1.1.1.3.cmml">ω</mi></msub><mo id="S3.SS4.p1.3.m3.3.3.3.4" xref="S3.SS4.p1.3.m3.3.3.4.cmml">,</mo><msub id="S3.SS4.p1.3.m3.2.2.2.2" xref="S3.SS4.p1.3.m3.2.2.2.2.cmml"><mi id="S3.SS4.p1.3.m3.2.2.2.2.2" xref="S3.SS4.p1.3.m3.2.2.2.2.2.cmml">p</mi><mi id="S3.SS4.p1.3.m3.2.2.2.2.3" xref="S3.SS4.p1.3.m3.2.2.2.2.3.cmml">𝖿𝖾𝖺𝗍</mi></msub><mo id="S3.SS4.p1.3.m3.3.3.3.5" xref="S3.SS4.p1.3.m3.3.3.4.cmml">,</mo><msub id="S3.SS4.p1.3.m3.3.3.3.3" xref="S3.SS4.p1.3.m3.3.3.3.3.cmml"><mi id="S3.SS4.p1.3.m3.3.3.3.3.2" xref="S3.SS4.p1.3.m3.3.3.3.3.2.cmml">p</mi><mi id="S3.SS4.p1.3.m3.3.3.3.3.3" xref="S3.SS4.p1.3.m3.3.3.3.3.3.cmml">𝗏𝖺𝖾</mi></msub></mrow><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.3.m3.3b"><list id="S3.SS4.p1.3.m3.3.3.4.cmml" xref="S3.SS4.p1.3.m3.3.3.3"><apply id="S3.SS4.p1.3.m3.1.1.1.1.cmml" xref="S3.SS4.p1.3.m3.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.3.m3.1.1.1.1.1.cmml" xref="S3.SS4.p1.3.m3.1.1.1.1">subscript</csymbol><ci id="S3.SS4.p1.3.m3.1.1.1.1.2.cmml" xref="S3.SS4.p1.3.m3.1.1.1.1.2">𝑝</ci><ci id="S3.SS4.p1.3.m3.1.1.1.1.3.cmml" xref="S3.SS4.p1.3.m3.1.1.1.1.3">𝜔</ci></apply><apply id="S3.SS4.p1.3.m3.2.2.2.2.cmml" xref="S3.SS4.p1.3.m3.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS4.p1.3.m3.2.2.2.2.1.cmml" xref="S3.SS4.p1.3.m3.2.2.2.2">subscript</csymbol><ci id="S3.SS4.p1.3.m3.2.2.2.2.2.cmml" xref="S3.SS4.p1.3.m3.2.2.2.2.2">𝑝</ci><ci id="S3.SS4.p1.3.m3.2.2.2.2.3.cmml" xref="S3.SS4.p1.3.m3.2.2.2.2.3">𝖿𝖾𝖺𝗍</ci></apply><apply id="S3.SS4.p1.3.m3.3.3.3.3.cmml" xref="S3.SS4.p1.3.m3.3.3.3.3"><csymbol cd="ambiguous" id="S3.SS4.p1.3.m3.3.3.3.3.1.cmml" xref="S3.SS4.p1.3.m3.3.3.3.3">subscript</csymbol><ci id="S3.SS4.p1.3.m3.3.3.3.3.2.cmml" xref="S3.SS4.p1.3.m3.3.3.3.3.2">𝑝</ci><ci id="S3.SS4.p1.3.m3.3.3.3.3.3.cmml" xref="S3.SS4.p1.3.m3.3.3.3.3.3">𝗏𝖺𝖾</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.3.m3.3c">p_{\omega},p_{\mathsf{feat}},p_{\mathsf{vae}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.3.m3.3d">italic_p start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT sansserif_feat end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT sansserif_vae end_POSTSUBSCRIPT</annotation></semantics></math> are the dropout probabilities for branch weights, semantic features, and VAE features respectively.</p> </div> <figure class="ltx_table" id="S3.T1"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S3.T1.14" style="width:433.6pt;height:54.1pt;vertical-align:-0.5pt;"><span class="ltx_transformed_inner" style="transform:translate(-220.4pt,27.2pt) scale(0.495903736838476,0.495903736838476) ;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S3.T1.14.14"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S3.T1.14.14.15.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_tt" id="S3.T1.14.14.15.1.1" rowspan="2" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text" id="S3.T1.14.14.15.1.1.1">Method</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" colspan="3" id="S3.T1.14.14.15.1.2" style="padding-top:0.5pt;padding-bottom:0.5pt;">Compose</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" colspan="3" id="S3.T1.14.14.15.1.3" style="padding-top:0.5pt;padding-bottom:0.5pt;">Move</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" colspan="3" id="S3.T1.14.14.15.1.4" style="padding-top:0.5pt;padding-bottom:0.5pt;">Resize</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" colspan="3" id="S3.T1.14.14.15.1.5" style="padding-top:0.5pt;padding-bottom:0.5pt;">Replace</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="2" id="S3.T1.14.14.15.1.6" style="padding-top:0.5pt;padding-bottom:0.5pt;">Remove</th> </tr> <tr class="ltx_tr" id="S3.T1.14.14.14"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="S3.T1.1.1.1.1" style="padding-top:0.5pt;padding-bottom:0.5pt;">CLIP-I<math alttext="\uparrow" class="ltx_Math" display="inline" id="S3.T1.1.1.1.1.m1.1"><semantics id="S3.T1.1.1.1.1.m1.1a"><mo id="S3.T1.1.1.1.1.m1.1.1" stretchy="false" xref="S3.T1.1.1.1.1.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S3.T1.1.1.1.1.m1.1b"><ci id="S3.T1.1.1.1.1.m1.1.1.cmml" xref="S3.T1.1.1.1.1.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.1.1.1.1.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.1.1.1.1.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="S3.T1.2.2.2.2" style="padding-top:0.5pt;padding-bottom:0.5pt;">DINO<math alttext="\uparrow" class="ltx_Math" display="inline" id="S3.T1.2.2.2.2.m1.1"><semantics id="S3.T1.2.2.2.2.m1.1a"><mo id="S3.T1.2.2.2.2.m1.1.1" stretchy="false" xref="S3.T1.2.2.2.2.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S3.T1.2.2.2.2.m1.1b"><ci id="S3.T1.2.2.2.2.m1.1.1.cmml" xref="S3.T1.2.2.2.2.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.2.2.2.2.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.2.2.2.2.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t" id="S3.T1.3.3.3.3" style="padding-top:0.5pt;padding-bottom:0.5pt;">MSE <math alttext="\downarrow" class="ltx_Math" display="inline" id="S3.T1.3.3.3.3.m1.1"><semantics id="S3.T1.3.3.3.3.m1.1a"><mo id="S3.T1.3.3.3.3.m1.1.1" stretchy="false" xref="S3.T1.3.3.3.3.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S3.T1.3.3.3.3.m1.1b"><ci id="S3.T1.3.3.3.3.m1.1.1.cmml" xref="S3.T1.3.3.3.3.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.3.3.3.3.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.3.3.3.3.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="S3.T1.4.4.4.4" style="padding-top:0.5pt;padding-bottom:0.5pt;">CLIP-I<math alttext="\uparrow" class="ltx_Math" display="inline" id="S3.T1.4.4.4.4.m1.1"><semantics id="S3.T1.4.4.4.4.m1.1a"><mo id="S3.T1.4.4.4.4.m1.1.1" stretchy="false" xref="S3.T1.4.4.4.4.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S3.T1.4.4.4.4.m1.1b"><ci id="S3.T1.4.4.4.4.m1.1.1.cmml" xref="S3.T1.4.4.4.4.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.4.4.4.4.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.4.4.4.4.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="S3.T1.5.5.5.5" style="padding-top:0.5pt;padding-bottom:0.5pt;">DINO<math alttext="\uparrow" class="ltx_Math" display="inline" id="S3.T1.5.5.5.5.m1.1"><semantics id="S3.T1.5.5.5.5.m1.1a"><mo id="S3.T1.5.5.5.5.m1.1.1" stretchy="false" xref="S3.T1.5.5.5.5.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S3.T1.5.5.5.5.m1.1b"><ci id="S3.T1.5.5.5.5.m1.1.1.cmml" xref="S3.T1.5.5.5.5.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.5.5.5.5.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.5.5.5.5.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t" id="S3.T1.6.6.6.6" style="padding-top:0.5pt;padding-bottom:0.5pt;">MSE<math alttext="\downarrow" class="ltx_Math" display="inline" id="S3.T1.6.6.6.6.m1.1"><semantics id="S3.T1.6.6.6.6.m1.1a"><mo id="S3.T1.6.6.6.6.m1.1.1" stretchy="false" xref="S3.T1.6.6.6.6.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S3.T1.6.6.6.6.m1.1b"><ci id="S3.T1.6.6.6.6.m1.1.1.cmml" xref="S3.T1.6.6.6.6.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.6.6.6.6.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.6.6.6.6.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="S3.T1.7.7.7.7" style="padding-top:0.5pt;padding-bottom:0.5pt;">CLIP-I<math alttext="\uparrow" class="ltx_Math" display="inline" id="S3.T1.7.7.7.7.m1.1"><semantics id="S3.T1.7.7.7.7.m1.1a"><mo id="S3.T1.7.7.7.7.m1.1.1" stretchy="false" xref="S3.T1.7.7.7.7.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S3.T1.7.7.7.7.m1.1b"><ci id="S3.T1.7.7.7.7.m1.1.1.cmml" xref="S3.T1.7.7.7.7.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.7.7.7.7.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.7.7.7.7.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="S3.T1.8.8.8.8" style="padding-top:0.5pt;padding-bottom:0.5pt;">DINO<math alttext="\uparrow" class="ltx_Math" display="inline" id="S3.T1.8.8.8.8.m1.1"><semantics id="S3.T1.8.8.8.8.m1.1a"><mo id="S3.T1.8.8.8.8.m1.1.1" stretchy="false" xref="S3.T1.8.8.8.8.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S3.T1.8.8.8.8.m1.1b"><ci id="S3.T1.8.8.8.8.m1.1.1.cmml" xref="S3.T1.8.8.8.8.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.8.8.8.8.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.8.8.8.8.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t" id="S3.T1.9.9.9.9" style="padding-top:0.5pt;padding-bottom:0.5pt;">MSE<math alttext="\downarrow" class="ltx_Math" display="inline" id="S3.T1.9.9.9.9.m1.1"><semantics id="S3.T1.9.9.9.9.m1.1a"><mo id="S3.T1.9.9.9.9.m1.1.1" stretchy="false" xref="S3.T1.9.9.9.9.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S3.T1.9.9.9.9.m1.1b"><ci id="S3.T1.9.9.9.9.m1.1.1.cmml" xref="S3.T1.9.9.9.9.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.9.9.9.9.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.9.9.9.9.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="S3.T1.10.10.10.10" style="padding-top:0.5pt;padding-bottom:0.5pt;">CLIP-I<math alttext="\uparrow" class="ltx_Math" display="inline" id="S3.T1.10.10.10.10.m1.1"><semantics id="S3.T1.10.10.10.10.m1.1a"><mo id="S3.T1.10.10.10.10.m1.1.1" stretchy="false" xref="S3.T1.10.10.10.10.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S3.T1.10.10.10.10.m1.1b"><ci id="S3.T1.10.10.10.10.m1.1.1.cmml" xref="S3.T1.10.10.10.10.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.10.10.10.10.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.10.10.10.10.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="S3.T1.11.11.11.11" style="padding-top:0.5pt;padding-bottom:0.5pt;">DINO<math alttext="\uparrow" class="ltx_Math" display="inline" id="S3.T1.11.11.11.11.m1.1"><semantics id="S3.T1.11.11.11.11.m1.1a"><mo id="S3.T1.11.11.11.11.m1.1.1" stretchy="false" xref="S3.T1.11.11.11.11.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S3.T1.11.11.11.11.m1.1b"><ci id="S3.T1.11.11.11.11.m1.1.1.cmml" xref="S3.T1.11.11.11.11.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.11.11.11.11.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.11.11.11.11.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t" id="S3.T1.12.12.12.12" style="padding-top:0.5pt;padding-bottom:0.5pt;">MSE<math alttext="\downarrow" class="ltx_Math" display="inline" id="S3.T1.12.12.12.12.m1.1"><semantics id="S3.T1.12.12.12.12.m1.1a"><mo id="S3.T1.12.12.12.12.m1.1.1" stretchy="false" xref="S3.T1.12.12.12.12.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S3.T1.12.12.12.12.m1.1b"><ci id="S3.T1.12.12.12.12.m1.1.1.cmml" xref="S3.T1.12.12.12.12.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.12.12.12.12.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.12.12.12.12.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="S3.T1.13.13.13.13" style="padding-top:0.5pt;padding-bottom:0.5pt;">CLIP-I<math alttext="{}^{*}\downarrow" class="ltx_math_unparsed" display="inline" id="S3.T1.13.13.13.13.m1.1"><semantics id="S3.T1.13.13.13.13.m1.1a"><mmultiscripts id="S3.T1.13.13.13.13.m1.1.1"><mo id="S3.T1.13.13.13.13.m1.1.1.2" stretchy="false">↓</mo><mprescripts id="S3.T1.13.13.13.13.m1.1.1a"></mprescripts><mrow id="S3.T1.13.13.13.13.m1.1.1b"></mrow><mo id="S3.T1.13.13.13.13.m1.1.1.3">∗</mo></mmultiscripts><annotation encoding="application/x-tex" id="S3.T1.13.13.13.13.m1.1b">{}^{*}\downarrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.13.13.13.13.m1.1c">start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT ↓</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_t" id="S3.T1.14.14.14.14" style="padding-top:0.5pt;padding-bottom:0.5pt;">DINO<math alttext="{}^{*}\downarrow" class="ltx_math_unparsed" display="inline" id="S3.T1.14.14.14.14.m1.1"><semantics id="S3.T1.14.14.14.14.m1.1a"><mmultiscripts id="S3.T1.14.14.14.14.m1.1.1"><mo id="S3.T1.14.14.14.14.m1.1.1.2" stretchy="false">↓</mo><mprescripts id="S3.T1.14.14.14.14.m1.1.1a"></mprescripts><mrow id="S3.T1.14.14.14.14.m1.1.1b"></mrow><mo id="S3.T1.14.14.14.14.m1.1.1.3">∗</mo></mmultiscripts><annotation encoding="application/x-tex" id="S3.T1.14.14.14.14.m1.1b">{}^{*}\downarrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.14.14.14.14.m1.1c">start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT ↓</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S3.T1.14.14.16.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S3.T1.14.14.16.1.1" style="padding-top:0.5pt;padding-bottom:0.5pt;">Anydoor <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite> </th> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.14.14.16.1.2" style="padding-top:0.5pt;padding-bottom:0.5pt;">86.7</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.14.14.16.1.3" style="padding-top:0.5pt;padding-bottom:0.5pt;">81.2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S3.T1.14.14.16.1.4" style="padding-top:0.5pt;padding-bottom:0.5pt;">6.7</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.14.14.16.1.5" style="padding-top:0.5pt;padding-bottom:0.5pt;">85.4</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.14.14.16.1.6" style="padding-top:0.5pt;padding-bottom:0.5pt;">81.7</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S3.T1.14.14.16.1.7" style="padding-top:0.5pt;padding-bottom:0.5pt;">6.8</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.14.14.16.1.8" style="padding-top:0.5pt;padding-bottom:0.5pt;">83.3</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.14.14.16.1.9" style="padding-top:0.5pt;padding-bottom:0.5pt;">83.7</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S3.T1.14.14.16.1.10" style="padding-top:0.5pt;padding-bottom:0.5pt;">9.6</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.14.14.16.1.11" style="padding-top:0.5pt;padding-bottom:0.5pt;">81.7</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.14.14.16.1.12" style="padding-top:0.5pt;padding-bottom:0.5pt;">80.2</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S3.T1.14.14.16.1.13" style="padding-top:0.5pt;padding-bottom:0.5pt;">9.7</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T1.14.14.16.1.14" style="padding-top:0.5pt;padding-bottom:0.5pt;">39.5</td> <td class="ltx_td ltx_nopad_r ltx_align_center ltx_border_t" id="S3.T1.14.14.16.1.15" style="padding-top:0.5pt;padding-bottom:0.5pt;">13.6</td> </tr> <tr class="ltx_tr" id="S3.T1.14.14.17.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S3.T1.14.14.17.2.1" style="padding-top:0.5pt;padding-bottom:0.5pt;">GliGen <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib26" title="">2023</a>)</cite> </th> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.17.2.2" style="padding-top:0.5pt;padding-bottom:0.5pt;">70.7</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.17.2.3" style="padding-top:0.5pt;padding-bottom:0.5pt;">57.8</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S3.T1.14.14.17.2.4" style="padding-top:0.5pt;padding-bottom:0.5pt;">6.9</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.17.2.5" style="padding-top:0.5pt;padding-bottom:0.5pt;">71.2</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.17.2.6" style="padding-top:0.5pt;padding-bottom:0.5pt;">62.4</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S3.T1.14.14.17.2.7" style="padding-top:0.5pt;padding-bottom:0.5pt;">7.1</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.17.2.8" style="padding-top:0.5pt;padding-bottom:0.5pt;">78.2</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.17.2.9" style="padding-top:0.5pt;padding-bottom:0.5pt;">69.4</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S3.T1.14.14.17.2.10" style="padding-top:0.5pt;padding-bottom:0.5pt;">9.7</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.17.2.11" style="padding-top:0.5pt;padding-bottom:0.5pt;">68.4</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.17.2.12" style="padding-top:0.5pt;padding-bottom:0.5pt;">60.6</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S3.T1.14.14.17.2.13" style="padding-top:0.5pt;padding-bottom:0.5pt;">9.6</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.17.2.14" style="padding-top:0.5pt;padding-bottom:0.5pt;">40.2</td> <td class="ltx_td ltx_nopad_r ltx_align_center" id="S3.T1.14.14.17.2.15" style="padding-top:0.5pt;padding-bottom:0.5pt;">15.3</td> </tr> <tr class="ltx_tr" id="S3.T1.14.14.18.3"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S3.T1.14.14.18.3.1" style="padding-top:0.5pt;padding-bottom:0.5pt;">MagicFix <cite class="ltx_cite ltx_citemacro_citep">(Alzayer et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib2" title="">2024</a>)</cite> </th> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.18.3.2" style="padding-top:0.5pt;padding-bottom:0.5pt;">80.5</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.18.3.3" style="padding-top:0.5pt;padding-bottom:0.5pt;">78.6</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S3.T1.14.14.18.3.4" style="padding-top:0.5pt;padding-bottom:0.5pt;">6.9</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.18.3.5" style="padding-top:0.5pt;padding-bottom:0.5pt;">84.6</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.18.3.6" style="padding-top:0.5pt;padding-bottom:0.5pt;">82.4</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S3.T1.14.14.18.3.7" style="padding-top:0.5pt;padding-bottom:0.5pt;">6.7</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.18.3.8" style="padding-top:0.5pt;padding-bottom:0.5pt;">83.7</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.18.3.9" style="padding-top:0.5pt;padding-bottom:0.5pt;">85.2</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S3.T1.14.14.18.3.10" style="padding-top:0.5pt;padding-bottom:0.5pt;">9.0</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.18.3.11" style="padding-top:0.5pt;padding-bottom:0.5pt;">84.2</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.18.3.12" style="padding-top:0.5pt;padding-bottom:0.5pt;">80.1</td> <td class="ltx_td ltx_align_center ltx_border_r" id="S3.T1.14.14.18.3.13" style="padding-top:0.5pt;padding-bottom:0.5pt;">9.4</td> <td class="ltx_td ltx_align_center" id="S3.T1.14.14.18.3.14" style="padding-top:0.5pt;padding-bottom:0.5pt;">43.6</td> <td class="ltx_td ltx_nopad_r ltx_align_center" id="S3.T1.14.14.18.3.15" style="padding-top:0.5pt;padding-bottom:0.5pt;">23.1</td> </tr> <tr class="ltx_tr" id="S3.T1.14.14.19.4"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r ltx_border_t" id="S3.T1.14.14.19.4.1" style="padding-top:0.5pt;padding-bottom:0.5pt;"> <span class="ltx_text ltx_font_italic" id="S3.T1.14.14.19.4.1.1">BlobCtrl</span> (Ours)</th> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.14.14.19.4.2" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.2.1">88.3</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.14.14.19.4.3" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.3.1">86.9</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t" id="S3.T1.14.14.19.4.4" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.4.1">6.4</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.14.14.19.4.5" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.5.1">88.9</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.14.14.19.4.6" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.6.1">87.8</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t" id="S3.T1.14.14.19.4.7" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.7.1">6.3</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.14.14.19.4.8" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.8.1">86.5</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.14.14.19.4.9" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.9.1">89.1</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t" id="S3.T1.14.14.19.4.10" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.10.1">8.9</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.14.14.19.4.11" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.11.1">86.2</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.14.14.19.4.12" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.12.1">86.0</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r ltx_border_t" id="S3.T1.14.14.19.4.13" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.13.1">9.0</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.14.14.19.4.14" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.14.1">35.3</span></td> <td class="ltx_td ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_t" id="S3.T1.14.14.19.4.15" style="padding-top:0.5pt;padding-bottom:0.5pt;"><span class="ltx_text ltx_font_bold" id="S3.T1.14.14.19.4.15.1">8.6</span></td> </tr> </tbody> </table> </span></div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 1: </span>Quantitative comparison of identity preservation and grounding accuracy across various element-level manipulations. We evaluate using CLIP-I and DINO scores for identity preservation, and MSE for grounding accuracy. For removal operations, lower CLIP-I<sup class="ltx_sup" id="S3.T1.23.1">∗</sup> and DINO<sup class="ltx_sup" id="S3.T1.24.2">∗</sup> scores (<math alttext="\downarrow" class="ltx_Math" display="inline" id="S3.T1.20.m3.1"><semantics id="S3.T1.20.m3.1b"><mo id="S3.T1.20.m3.1.1" stretchy="false" xref="S3.T1.20.m3.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S3.T1.20.m3.1c"><ci id="S3.T1.20.m3.1.1.cmml" xref="S3.T1.20.m3.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T1.20.m3.1d">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S3.T1.20.m3.1e">↓</annotation></semantics></math>) are desired as they indicate more complete removal of target elements. Our method consistently outperforms existing approaches across all operations.</figcaption> </figure> <figure class="ltx_table" id="S3.T2"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S3.T2.4" style="width:411.9pt;height:86.1pt;vertical-align:-0.9pt;"><span class="ltx_transformed_inner" style="transform:translate(-11.6pt,2.4pt) scale(0.946565479947767,0.946565479947767) ;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S3.T2.4.4"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S3.T2.4.4.4"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_tt" id="S3.T2.4.4.4.5">Method</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S3.T2.1.1.1.1">PSNR<math alttext="\uparrow" class="ltx_Math" display="inline" id="S3.T2.1.1.1.1.m1.1"><semantics id="S3.T2.1.1.1.1.m1.1a"><mo id="S3.T2.1.1.1.1.m1.1.1" stretchy="false" xref="S3.T2.1.1.1.1.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S3.T2.1.1.1.1.m1.1b"><ci id="S3.T2.1.1.1.1.m1.1.1.cmml" xref="S3.T2.1.1.1.1.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T2.1.1.1.1.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S3.T2.1.1.1.1.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S3.T2.2.2.2.2">SSIM<math alttext="\uparrow" class="ltx_Math" display="inline" id="S3.T2.2.2.2.2.m1.1"><semantics id="S3.T2.2.2.2.2.m1.1a"><mo id="S3.T2.2.2.2.2.m1.1.1" stretchy="false" xref="S3.T2.2.2.2.2.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S3.T2.2.2.2.2.m1.1b"><ci id="S3.T2.2.2.2.2.m1.1.1.cmml" xref="S3.T2.2.2.2.2.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T2.2.2.2.2.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S3.T2.2.2.2.2.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S3.T2.3.3.3.3">LPIPS<math alttext="\downarrow" class="ltx_Math" display="inline" id="S3.T2.3.3.3.3.m1.1"><semantics id="S3.T2.3.3.3.3.m1.1a"><mo id="S3.T2.3.3.3.3.m1.1.1" stretchy="false" xref="S3.T2.3.3.3.3.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S3.T2.3.3.3.3.m1.1b"><ci id="S3.T2.3.3.3.3.m1.1.1.cmml" xref="S3.T2.3.3.3.3.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T2.3.3.3.3.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S3.T2.3.3.3.3.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S3.T2.4.4.4.4">FID<math alttext="\downarrow" class="ltx_Math" display="inline" id="S3.T2.4.4.4.4.m1.1"><semantics id="S3.T2.4.4.4.4.m1.1a"><mo id="S3.T2.4.4.4.4.m1.1.1" stretchy="false" xref="S3.T2.4.4.4.4.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S3.T2.4.4.4.4.m1.1b"><ci id="S3.T2.4.4.4.4.m1.1.1.cmml" xref="S3.T2.4.4.4.4.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.T2.4.4.4.4.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S3.T2.4.4.4.4.m1.1d">↓</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S3.T2.4.4.5.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S3.T2.4.4.5.1.1">Anydoor <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite> </th> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.4.4.5.1.2">32.0631</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.4.4.5.1.3">0.7424</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S3.T2.4.4.5.1.4">0.2394</td> <td class="ltx_td ltx_nopad_r ltx_align_center ltx_border_t" id="S3.T2.4.4.5.1.5">145.2546</td> </tr> <tr class="ltx_tr" id="S3.T2.4.4.6.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S3.T2.4.4.6.2.1">GliGen <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib26" title="">2023</a>)</cite> </th> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.6.2.2">27.923</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.6.2.3">0.2414</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.6.2.4">0.6963</td> <td class="ltx_td ltx_nopad_r ltx_align_center" id="S3.T2.4.4.6.2.5">307.8219</td> </tr> <tr class="ltx_tr" id="S3.T2.4.4.7.3"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S3.T2.4.4.7.3.1">MagicFix <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite> </th> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.7.3.2">30.3958</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.7.3.3">0.7415</td> <td class="ltx_td ltx_align_center" id="S3.T2.4.4.7.3.4">0.2277</td> <td class="ltx_td ltx_nopad_r ltx_align_center" id="S3.T2.4.4.7.3.5">194.0154</td> </tr> <tr class="ltx_tr" id="S3.T2.4.4.8.4"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r ltx_border_t" id="S3.T2.4.4.8.4.1"> <span class="ltx_text ltx_font_italic" id="S3.T2.4.4.8.4.1.1">BlobCtrl</span> (Ours)</th> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T2.4.4.8.4.2"><span class="ltx_text ltx_font_bold" id="S3.T2.4.4.8.4.2.1">32.1571</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T2.4.4.8.4.3"><span class="ltx_text ltx_font_bold" id="S3.T2.4.4.8.4.3.1">0.7507</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S3.T2.4.4.8.4.4"><span class="ltx_text ltx_font_bold" id="S3.T2.4.4.8.4.4.1">0.2196</span></td> <td class="ltx_td ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_t" id="S3.T2.4.4.8.4.5"><span class="ltx_text ltx_font_bold" id="S3.T2.4.4.8.4.5.1">102.8094</span></td> </tr> </tbody> </table> </span></div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 2: </span>Comparison of image generation quality using standard metrics. Our method achieves superior performance across all metrics, demonstrating better generation quality and fewer artifacts.</figcaption> </figure> <figure class="ltx_figure" id="S3.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="597" id="S3.F4.g1" src="x4.png" width="714"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4: </span><span class="ltx_text ltx_font_bold" id="S3.F4.3.1">Visual comparison of element-level manipulation capabilities across different methods.</span> We evaluate five fundamental operations: composition, movement, resizing, replacement and removal. Anydoor <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite> struggles with precise identity preservation, GliGen <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib26" title="">2023</a>)</cite> fails to maintain any identity information, and Magic Fixup <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite> produces results with poor visual harmonization. In contrast, <span class="ltx_text ltx_font_italic" id="S3.F4.4.2">BlobCtrl</span> achieves superior results across all operations while maintaining both identity preservation and visual harmony. We recommend zooming in to examine the source images and element-level manipulation instructions in detail.</figcaption> </figure> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">4 </span>Experiments</h2> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.1 </span>Datasets, Benchmark and Metrics</h3> <section class="ltx_paragraph" id="S4.SS1.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph"> <span class="ltx_text ltx_font_italic" id="S4.SS1.SSS0.Px1.1.1">BlobData</span> Curation.</h4> <div class="ltx_para" id="S4.SS1.SSS0.Px1.p1"> <p class="ltx_p" id="S4.SS1.SSS0.Px1.p1.1">To train <span class="ltx_text ltx_font_italic" id="S4.SS1.SSS0.Px1.p1.1.1">BlobCtrl</span>, we construct <span class="ltx_text ltx_font_italic" id="S4.SS1.SSS0.Px1.p1.1.2">BlobData</span> (1.86M samples) sourced from BrushData, containing images, segmentation masks, fitted ellipse parameters (with derived 2D Gaussians), and descriptive texts. The dataset curation process involves: (1) Filtering source images to retain those with shorter sides over 480 pixels and valid instance segmentation masks. (2) Applying mask filtering criteria to preserve masks with area ratios between 0.01 and 0.9 of the total image area and excluding those at image boundaries. (3) For the filtered masks, fitting ellipse parameters <span class="ltx_note ltx_role_footnote" id="footnote1"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_tag ltx_tag_note">1</span><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://docs.opencv.org/4.x/de/d62/tutorial_bounding_rotated_ellipses.html" title="">https://docs.opencv.org/4.x/de/d62/tutorial_bounding_rotated_ellipses.html</a></span></span></span> and derive 2D Gaussian distributions. (4) Removing invalid samples, particularly those with covariance values below 1e-5. (5) Generating detailed image descriptions with InternVL-2.5 <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib10" title="">2024</a>)</cite>.</p> </div> </section> <section class="ltx_paragraph" id="S4.SS1.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph"> <span class="ltx_text ltx_font_italic" id="S4.SS1.SSS0.Px2.1.1">BlobBench</span> Curation.</h4> <div class="ltx_para" id="S4.SS1.SSS0.Px2.p1"> <p class="ltx_p" id="S4.SS1.SSS0.Px2.p1.1">Existing evaluation benchmarks like DreamBooth<cite class="ltx_cite ltx_citemacro_citep">(Ruiz et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib39" title="">2023</a>)</cite>, COCOE<cite class="ltx_cite ltx_citemacro_citep">(Yang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib48" title="">2023</a>)</cite>, COCO Val<cite class="ltx_cite ltx_citemacro_citep">(Lin et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib28" title="">2014</a>)</cite>, and CreatiLayout<cite class="ltx_cite ltx_citemacro_citep">(Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib50" title="">2024a</a>)</cite> assess either grounding capability or identity preservation, but not both simultaneously. They also lack coverage of the full range of element-level manipulations, such as composition, movement, resizing, deletion, and replacement. To address these gaps, we introduce <span class="ltx_text ltx_font_italic" id="S4.SS1.SSS0.Px2.p1.1.1">BlobBench</span>, a comprehensive benchmark with 100 curated images, evenly distributed across different element-level operations. Each image is annotated with ellipse parameters, a foreground mask, and detailed text descriptions by experts. <span class="ltx_text ltx_font_italic" id="S4.SS1.SSS0.Px2.p1.1.2">BlobBench</span> includes both real-world and AI-generated images across diverse scenarios, such as indoor and outdoor scenes, animals, and landscapes, ensuring a fair and effective evaluation.</p> </div> </section> <section class="ltx_paragraph" id="S4.SS1.SSS0.Px3"> <h4 class="ltx_title ltx_title_paragraph">Evaluation Metrics.</h4> <div class="ltx_para" id="S4.SS1.SSS0.Px3.p1"> <p class="ltx_p" id="S4.SS1.SSS0.Px3.p1.1">We evaluate <span class="ltx_text ltx_font_italic" id="S4.SS1.SSS0.Px3.p1.1.1">BlobCtrl</span> using both objective metrics and human assessment, including objective evaluation (identity preservation, grounding accuracy, generation quality and harmonization) and subjective evaluation. Please refer to the <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#A1" title="Appendix A BlobBench Overview and Evaluation Metrics ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">Appendix</span> <span class="ltx_text ltx_ref_tag">A</span></a> for details about these metrics.</p> </div> </section> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.2 </span>Implementation Details.</h3> <section class="ltx_paragraph" id="S4.SS2.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph">Training Details.</h4> <div class="ltx_para" id="S4.SS2.SSS0.Px1.p1"> <p class="ltx_p" id="S4.SS2.SSS0.Px1.p1.3"><span class="ltx_text ltx_font_italic" id="S4.SS2.SSS0.Px1.p1.3.1">BlobCtrl</span> is built upon Stable Diffusion v1.5 <cite class="ltx_cite ltx_citemacro_citep">(Rombach et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib38" title="">2022</a>)</cite>. During training, all images and annotations are resized to <math alttext="512\times 512" class="ltx_Math" display="inline" id="S4.SS2.SSS0.Px1.p1.1.m1.1"><semantics id="S4.SS2.SSS0.Px1.p1.1.m1.1a"><mrow id="S4.SS2.SSS0.Px1.p1.1.m1.1.1" xref="S4.SS2.SSS0.Px1.p1.1.m1.1.1.cmml"><mn id="S4.SS2.SSS0.Px1.p1.1.m1.1.1.2" xref="S4.SS2.SSS0.Px1.p1.1.m1.1.1.2.cmml">512</mn><mo id="S4.SS2.SSS0.Px1.p1.1.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S4.SS2.SSS0.Px1.p1.1.m1.1.1.1.cmml">×</mo><mn id="S4.SS2.SSS0.Px1.p1.1.m1.1.1.3" xref="S4.SS2.SSS0.Px1.p1.1.m1.1.1.3.cmml">512</mn></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.SSS0.Px1.p1.1.m1.1b"><apply id="S4.SS2.SSS0.Px1.p1.1.m1.1.1.cmml" xref="S4.SS2.SSS0.Px1.p1.1.m1.1.1"><times id="S4.SS2.SSS0.Px1.p1.1.m1.1.1.1.cmml" xref="S4.SS2.SSS0.Px1.p1.1.m1.1.1.1"></times><cn id="S4.SS2.SSS0.Px1.p1.1.m1.1.1.2.cmml" type="integer" xref="S4.SS2.SSS0.Px1.p1.1.m1.1.1.2">512</cn><cn id="S4.SS2.SSS0.Px1.p1.1.m1.1.1.3.cmml" type="integer" xref="S4.SS2.SSS0.Px1.p1.1.m1.1.1.3">512</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.SSS0.Px1.p1.1.m1.1c">512\times 512</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.SSS0.Px1.p1.1.m1.1d">512 × 512</annotation></semantics></math> pixels. We initialize both foreground and background branches with pretrained UNet weights. The foreground branch undergoes full fine-tuning with cross-attention layers removed, while the background branch is fine-tuned using LoRA <cite class="ltx_cite ltx_citemacro_citep">(Hu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib17" title="">2021</a>)</cite> with rank=64. We employ the Adam optimizer <cite class="ltx_cite ltx_citemacro_citep">(Kingma &amp; Ba, <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib21" title="">2014</a>)</cite> with a learning rate of 1e-5 and weight decay of 0.01. The model is trained on our curated <span class="ltx_text ltx_font_italic" id="S4.SS2.SSS0.Px1.p1.3.2">BlobData</span> dataset comprising 1.86M samples for 7 days using 24 NVIDIA V100 GPUs with a batch size of 192. For controllable fidelity-diversity trade-off, we set dropout probabilities <math alttext="p_{\omega},p_{\mathsf{feat}},p_{\mathsf{vae}}" class="ltx_Math" display="inline" id="S4.SS2.SSS0.Px1.p1.2.m2.3"><semantics id="S4.SS2.SSS0.Px1.p1.2.m2.3a"><mrow id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.4.cmml"><msub id="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1" xref="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.cmml"><mi id="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.2" xref="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.2.cmml">p</mi><mi id="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.3" xref="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.3.cmml">ω</mi></msub><mo id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.4" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.4.cmml">,</mo><msub id="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2" xref="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.cmml"><mi id="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.2" xref="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.2.cmml">p</mi><mi id="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.3" xref="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.3.cmml">𝖿𝖾𝖺𝗍</mi></msub><mo id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.5" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.4.cmml">,</mo><msub id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.cmml"><mi id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.2" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.2.cmml">p</mi><mi id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.3" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.3.cmml">𝗏𝖺𝖾</mi></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.SSS0.Px1.p1.2.m2.3b"><list id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.4.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3"><apply id="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1"><csymbol cd="ambiguous" id="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.1.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1">subscript</csymbol><ci id="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.2.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.2">𝑝</ci><ci id="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.3.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.1.1.1.1.3">𝜔</ci></apply><apply id="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2"><csymbol cd="ambiguous" id="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.1.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2">subscript</csymbol><ci id="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.2.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.2">𝑝</ci><ci id="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.3.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.2.2.2.2.3">𝖿𝖾𝖺𝗍</ci></apply><apply id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3"><csymbol cd="ambiguous" id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.1.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3">subscript</csymbol><ci id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.2.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.2">𝑝</ci><ci id="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.3.cmml" xref="S4.SS2.SSS0.Px1.p1.2.m2.3.3.3.3.3">𝗏𝖺𝖾</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.SSS0.Px1.p1.2.m2.3c">p_{\omega},p_{\mathsf{feat}},p_{\mathsf{vae}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.SSS0.Px1.p1.2.m2.3d">italic_p start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT sansserif_feat end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT sansserif_vae end_POSTSUBSCRIPT</annotation></semantics></math> to 0.1. The weight of identity preservation loss <math alttext="\lambda_{\text{id}}" class="ltx_Math" display="inline" id="S4.SS2.SSS0.Px1.p1.3.m3.1"><semantics id="S4.SS2.SSS0.Px1.p1.3.m3.1a"><msub id="S4.SS2.SSS0.Px1.p1.3.m3.1.1" xref="S4.SS2.SSS0.Px1.p1.3.m3.1.1.cmml"><mi id="S4.SS2.SSS0.Px1.p1.3.m3.1.1.2" xref="S4.SS2.SSS0.Px1.p1.3.m3.1.1.2.cmml">λ</mi><mtext id="S4.SS2.SSS0.Px1.p1.3.m3.1.1.3" xref="S4.SS2.SSS0.Px1.p1.3.m3.1.1.3a.cmml">id</mtext></msub><annotation-xml encoding="MathML-Content" id="S4.SS2.SSS0.Px1.p1.3.m3.1b"><apply id="S4.SS2.SSS0.Px1.p1.3.m3.1.1.cmml" xref="S4.SS2.SSS0.Px1.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S4.SS2.SSS0.Px1.p1.3.m3.1.1.1.cmml" xref="S4.SS2.SSS0.Px1.p1.3.m3.1.1">subscript</csymbol><ci id="S4.SS2.SSS0.Px1.p1.3.m3.1.1.2.cmml" xref="S4.SS2.SSS0.Px1.p1.3.m3.1.1.2">𝜆</ci><ci id="S4.SS2.SSS0.Px1.p1.3.m3.1.1.3a.cmml" xref="S4.SS2.SSS0.Px1.p1.3.m3.1.1.3"><mtext id="S4.SS2.SSS0.Px1.p1.3.m3.1.1.3.cmml" mathsize="70%" xref="S4.SS2.SSS0.Px1.p1.3.m3.1.1.3">id</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.SSS0.Px1.p1.3.m3.1c">\lambda_{\text{id}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.SSS0.Px1.p1.3.m3.1d">italic_λ start_POSTSUBSCRIPT id end_POSTSUBSCRIPT</annotation></semantics></math> is gradually decayed from 1.0 to 0.6 during training. Additionally, to enable classifier-free guidance during inference, we set the caption dropout probability to 0.1.</p> </div> </section> <section class="ltx_paragraph" id="S4.SS2.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph">Evaluation Details.</h4> <div class="ltx_para" id="S4.SS2.SSS0.Px2.p1"> <p class="ltx_p" id="S4.SS2.SSS0.Px2.p1.1">We evaluate <span class="ltx_text ltx_font_italic" id="S4.SS2.SSS0.Px2.p1.1.1">BlobCtrl</span> against three state-of-the-art methods on the <span class="ltx_text ltx_font_italic" id="S4.SS2.SSS0.Px2.p1.1.2">BlobBench</span> benchmark: GliGen<cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib26" title="">2023</a>)</cite>, a bounding box-based text-to-image model; Anydoor<cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite>, a segmentation mask-based image-to-image model; and Magic Fixup <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite>, which specializes in harmonizing transformed regions. To systematically assess five fundamental element-level operations (composition, movement, resizing, replacement, and removal), we adapt the baselines with specific workflows. For Anydoor, we create a clean background by teleporting background to foreground regions, then edit by teleporting foreground objects to target locations. For GliGen, we use <span class="ltx_text ltx_font_italic" id="S4.SS2.SSS0.Px2.p1.1.3">BlobCtrl</span> to remove elements for a clean background, then apply bounding box constraints with text and image conditions. For Magic Fixup, we warp foreground elements using rigid transformations from editing operations, followed by scene harmonization.</p> </div> </section> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.3 </span>Quantitative Evalution</h3> <section class="ltx_paragraph" id="S4.SS3.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph">Comparison to State-of-the-Art Methods.</h4> <div class="ltx_para" id="S4.SS3.SSS0.Px1.p1"> <p class="ltx_p" id="S4.SS3.SSS0.Px1.p1.1">As shown in Tab. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.T1" title="Table 1 ‣ 3.4 Controllable Fidelity-Diversity Trade-off ‣ 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">1</span></a> and Tab. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.T2" title="Table 2 ‣ 3.4 Controllable Fidelity-Diversity Trade-off ‣ 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">2</span></a>, <span class="ltx_text ltx_font_italic" id="S4.SS3.SSS0.Px1.p1.1.1">BlobCtrl</span> demonstrates consistent and significant improvements over existing methods across all evaluation metrics:</p> </div> <div class="ltx_para" id="S4.SS3.SSS0.Px1.p2"> <ul class="ltx_itemize" id="S4.I1"> <li class="ltx_item" id="S4.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I1.i1.p1"> <p class="ltx_p" id="S4.I1.i1.p1.8"><span class="ltx_text ltx_font_bold" id="S4.I1.i1.p1.8.1">Identity Preservation:</span> For tasks requiring identity preservation (composition, movement, resizing, replacement), <span class="ltx_text ltx_font_italic" id="S4.I1.i1.p1.8.2">BlobCtrl</span> achieves substantially higher average CLIP-I (<math alttext="87.48" class="ltx_Math" display="inline" id="S4.I1.i1.p1.1.m1.1"><semantics id="S4.I1.i1.p1.1.m1.1a"><mn id="S4.I1.i1.p1.1.m1.1.1" xref="S4.I1.i1.p1.1.m1.1.1.cmml">87.48</mn><annotation-xml encoding="MathML-Content" id="S4.I1.i1.p1.1.m1.1b"><cn id="S4.I1.i1.p1.1.m1.1.1.cmml" type="float" xref="S4.I1.i1.p1.1.m1.1.1">87.48</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i1.p1.1.m1.1c">87.48</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i1.p1.1.m1.1d">87.48</annotation></semantics></math> vs. <math alttext="84.28" class="ltx_Math" display="inline" id="S4.I1.i1.p1.2.m2.1"><semantics id="S4.I1.i1.p1.2.m2.1a"><mn id="S4.I1.i1.p1.2.m2.1.1" xref="S4.I1.i1.p1.2.m2.1.1.cmml">84.28</mn><annotation-xml encoding="MathML-Content" id="S4.I1.i1.p1.2.m2.1b"><cn id="S4.I1.i1.p1.2.m2.1.1.cmml" type="float" xref="S4.I1.i1.p1.2.m2.1.1">84.28</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i1.p1.2.m2.1c">84.28</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i1.p1.2.m2.1d">84.28</annotation></semantics></math>) and DINO (<math alttext="87.45" class="ltx_Math" display="inline" id="S4.I1.i1.p1.3.m3.1"><semantics id="S4.I1.i1.p1.3.m3.1a"><mn id="S4.I1.i1.p1.3.m3.1.1" xref="S4.I1.i1.p1.3.m3.1.1.cmml">87.45</mn><annotation-xml encoding="MathML-Content" id="S4.I1.i1.p1.3.m3.1b"><cn id="S4.I1.i1.p1.3.m3.1.1.cmml" type="float" xref="S4.I1.i1.p1.3.m3.1.1">87.45</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i1.p1.3.m3.1c">87.45</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i1.p1.3.m3.1d">87.45</annotation></semantics></math> vs. <math alttext="81.70" class="ltx_Math" display="inline" id="S4.I1.i1.p1.4.m4.1"><semantics id="S4.I1.i1.p1.4.m4.1a"><mn id="S4.I1.i1.p1.4.m4.1.1" xref="S4.I1.i1.p1.4.m4.1.1.cmml">81.70</mn><annotation-xml encoding="MathML-Content" id="S4.I1.i1.p1.4.m4.1b"><cn id="S4.I1.i1.p1.4.m4.1.1.cmml" type="float" xref="S4.I1.i1.p1.4.m4.1.1">81.70</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i1.p1.4.m4.1c">81.70</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i1.p1.4.m4.1d">81.70</annotation></semantics></math>) scores compared to the best baseline. For removal tasks, our method shows lower identity scores (average of CLIP-I<sup class="ltx_sup" id="S4.I1.i1.p1.8.3">∗</sup> and DINO<sup class="ltx_sup" id="S4.I1.i1.p1.8.4">∗</sup> scores) (<math alttext="21.95" class="ltx_Math" display="inline" id="S4.I1.i1.p1.7.m7.1"><semantics id="S4.I1.i1.p1.7.m7.1a"><mn id="S4.I1.i1.p1.7.m7.1.1" xref="S4.I1.i1.p1.7.m7.1.1.cmml">21.95</mn><annotation-xml encoding="MathML-Content" id="S4.I1.i1.p1.7.m7.1b"><cn id="S4.I1.i1.p1.7.m7.1.1.cmml" type="float" xref="S4.I1.i1.p1.7.m7.1.1">21.95</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i1.p1.7.m7.1c">21.95</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i1.p1.7.m7.1d">21.95</annotation></semantics></math> vs. <math alttext="26.55" class="ltx_Math" display="inline" id="S4.I1.i1.p1.8.m8.1"><semantics id="S4.I1.i1.p1.8.m8.1a"><mn id="S4.I1.i1.p1.8.m8.1.1" xref="S4.I1.i1.p1.8.m8.1.1.cmml">26.55</mn><annotation-xml encoding="MathML-Content" id="S4.I1.i1.p1.8.m8.1b"><cn id="S4.I1.i1.p1.8.m8.1.1.cmml" type="float" xref="S4.I1.i1.p1.8.m8.1.1">26.55</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i1.p1.8.m8.1c">26.55</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i1.p1.8.m8.1d">26.55</annotation></semantics></math>), indicating more thorough element elimination.</p> </div> </li> <li class="ltx_item" id="S4.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I1.i2.p1"> <p class="ltx_p" id="S4.I1.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S4.I1.i2.p1.1.1">Layout Control:</span> <span class="ltx_text ltx_font_italic" id="S4.I1.i2.p1.1.2">BlobCtrl</span> exhibits superior spatial control accuracy, reducing layout MSE by <math alttext="8.11\%" class="ltx_Math" display="inline" id="S4.I1.i2.p1.1.m1.1"><semantics id="S4.I1.i2.p1.1.m1.1a"><mrow id="S4.I1.i2.p1.1.m1.1.1" xref="S4.I1.i2.p1.1.m1.1.1.cmml"><mn id="S4.I1.i2.p1.1.m1.1.1.2" xref="S4.I1.i2.p1.1.m1.1.1.2.cmml">8.11</mn><mo id="S4.I1.i2.p1.1.m1.1.1.1" xref="S4.I1.i2.p1.1.m1.1.1.1.cmml">%</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.I1.i2.p1.1.m1.1b"><apply id="S4.I1.i2.p1.1.m1.1.1.cmml" xref="S4.I1.i2.p1.1.m1.1.1"><csymbol cd="latexml" id="S4.I1.i2.p1.1.m1.1.1.1.cmml" xref="S4.I1.i2.p1.1.m1.1.1.1">percent</csymbol><cn id="S4.I1.i2.p1.1.m1.1.1.2.cmml" type="float" xref="S4.I1.i2.p1.1.m1.1.1.2">8.11</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i2.p1.1.m1.1c">8.11\%</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i2.p1.1.m1.1d">8.11 %</annotation></semantics></math> relative to the previous best method. This validates the effectiveness of our probabilistic blob representation for precise element manipulation.</p> </div> </li> <li class="ltx_item" id="S4.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I1.i3.p1"> <p class="ltx_p" id="S4.I1.i3.p1.4"><span class="ltx_text ltx_font_bold" id="S4.I1.i3.p1.4.1">Generation Quality:</span> Our method sets new state-of-the-art performance benchmarks across standard quality metrics: FID <math alttext="102.8094" class="ltx_Math" display="inline" id="S4.I1.i3.p1.1.m1.1"><semantics id="S4.I1.i3.p1.1.m1.1a"><mn id="S4.I1.i3.p1.1.m1.1.1" xref="S4.I1.i3.p1.1.m1.1.1.cmml">102.8094</mn><annotation-xml encoding="MathML-Content" id="S4.I1.i3.p1.1.m1.1b"><cn id="S4.I1.i3.p1.1.m1.1.1.cmml" type="float" xref="S4.I1.i3.p1.1.m1.1.1">102.8094</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i3.p1.1.m1.1c">102.8094</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i3.p1.1.m1.1d">102.8094</annotation></semantics></math>, LPIPS <math alttext="0.2196" class="ltx_Math" display="inline" id="S4.I1.i3.p1.2.m2.1"><semantics id="S4.I1.i3.p1.2.m2.1a"><mn id="S4.I1.i3.p1.2.m2.1.1" xref="S4.I1.i3.p1.2.m2.1.1.cmml">0.2196</mn><annotation-xml encoding="MathML-Content" id="S4.I1.i3.p1.2.m2.1b"><cn id="S4.I1.i3.p1.2.m2.1.1.cmml" type="float" xref="S4.I1.i3.p1.2.m2.1.1">0.2196</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i3.p1.2.m2.1c">0.2196</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i3.p1.2.m2.1d">0.2196</annotation></semantics></math>, PSNR <math alttext="32.1571" class="ltx_Math" display="inline" id="S4.I1.i3.p1.3.m3.1"><semantics id="S4.I1.i3.p1.3.m3.1a"><mn id="S4.I1.i3.p1.3.m3.1.1" xref="S4.I1.i3.p1.3.m3.1.1.cmml">32.1571</mn><annotation-xml encoding="MathML-Content" id="S4.I1.i3.p1.3.m3.1b"><cn id="S4.I1.i3.p1.3.m3.1.1.cmml" type="float" xref="S4.I1.i3.p1.3.m3.1.1">32.1571</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i3.p1.3.m3.1c">32.1571</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i3.p1.3.m3.1d">32.1571</annotation></semantics></math>, and SSIM <math alttext="0.7507" class="ltx_Math" display="inline" id="S4.I1.i3.p1.4.m4.1"><semantics id="S4.I1.i3.p1.4.m4.1a"><mn id="S4.I1.i3.p1.4.m4.1.1" xref="S4.I1.i3.p1.4.m4.1.1.cmml">0.7507</mn><annotation-xml encoding="MathML-Content" id="S4.I1.i3.p1.4.m4.1b"><cn id="S4.I1.i3.p1.4.m4.1.1.cmml" type="float" xref="S4.I1.i3.p1.4.m4.1.1">0.7507</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.I1.i3.p1.4.m4.1c">0.7507</annotation><annotation encoding="application/x-llamapun" id="S4.I1.i3.p1.4.m4.1d">0.7507</annotation></semantics></math>. These results demonstrate <span class="ltx_text ltx_font_italic" id="S4.I1.i3.p1.4.2">BlobCtrl</span>’s ability to generate high-fidelity outputs while maintaining global visual coherence.</p> </div> </li> </ul> <p class="ltx_p" id="S4.SS3.SSS0.Px1.p2.1">We attribute these substantial improvements to two key innovations: (1) the probabilistic blob representation that enables precise control over element attributes, and (2) our self-supervised training paradigm that effectively decouples and recombines visual elements’ identity, semantics and layout information, while eliminating performance degradation caused by unnecessary camera movements and other video-specific artifacts that plague previous methods.</p> </div> </section> <section class="ltx_paragraph" id="S4.SS3.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph">Human Evaluation.</h4> <div class="ltx_para" id="S4.SS3.SSS0.Px2.p1"> <p class="ltx_p" id="S4.SS3.SSS0.Px2.p1.1">The subjective evaluation results reported in Tab. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.T3" title="Table 3 ‣ Human Evaluation. ‣ 4.3 Quantitative Evalution ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">3</span></a> demonstrate the superior performance of <span class="ltx_text ltx_font_italic" id="S4.SS3.SSS0.Px2.p1.1.1">BlobCtrl</span> across all assessment criteria. Quantitatively, our method establishes new state-of-the-art performance with significant margins over the previous best approach: an 87.2% preference rate in appearance fidelity compared to 82.5% for the previous best method, an 86.5% preference rate in layout accuracy versus 81.7%, and an 82.1% preference rate in visual harmony compared to 80.3%. These substantial improvements in human evaluation metrics indicate that <span class="ltx_text ltx_font_italic" id="S4.SS3.SSS0.Px2.p1.1.2">BlobCtrl</span> produces results that are more visually appealing and natural to human observers, making it better suited for real-world applications.</p> </div> <figure class="ltx_table" id="S4.T3"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S4.T3.3" style="width:390.3pt;height:84.9pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-11.7pt,2.5pt) scale(0.94347444025254,0.94347444025254) ;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S4.T3.3.3"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S4.T3.3.3.3"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_tt" id="S4.T3.3.3.3.4">Method</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T3.1.1.1.1">Fidelity<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T3.1.1.1.1.m1.1"><semantics id="S4.T3.1.1.1.1.m1.1a"><mo id="S4.T3.1.1.1.1.m1.1.1" stretchy="false" xref="S4.T3.1.1.1.1.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T3.1.1.1.1.m1.1b"><ci id="S4.T3.1.1.1.1.m1.1.1.cmml" xref="S4.T3.1.1.1.1.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.1.1.1.1.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.1.1.1.1.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T3.2.2.2.2">Layout<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T3.2.2.2.2.m1.1"><semantics id="S4.T3.2.2.2.2.m1.1a"><mo id="S4.T3.2.2.2.2.m1.1.1" stretchy="false" xref="S4.T3.2.2.2.2.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T3.2.2.2.2.m1.1b"><ci id="S4.T3.2.2.2.2.m1.1.1.cmml" xref="S4.T3.2.2.2.2.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.2.2.2.2.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.2.2.2.2.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T3.3.3.3.3">Harmony<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T3.3.3.3.3.m1.1"><semantics id="S4.T3.3.3.3.3.m1.1a"><mo id="S4.T3.3.3.3.3.m1.1.1" stretchy="false" xref="S4.T3.3.3.3.3.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T3.3.3.3.3.m1.1b"><ci id="S4.T3.3.3.3.3.m1.1.1.cmml" xref="S4.T3.3.3.3.3.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.3.3.3.3.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.3.3.3.3.m1.1d">↑</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T3.3.3.4.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T3.3.3.4.1.1">Anydoor <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite> </th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.3.3.4.1.2">82.5%</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.3.3.4.1.3">81.7%</td> <td class="ltx_td ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T3.3.3.4.1.4">78.1%</td> </tr> <tr class="ltx_tr" id="S4.T3.3.3.5.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T3.3.3.5.2.1">GliGen <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib26" title="">2023</a>)</cite> </th> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.5.2.2">51.2%</td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.5.2.3">68.1%</td> <td class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T3.3.3.5.2.4">80.3%</td> </tr> <tr class="ltx_tr" id="S4.T3.3.3.6.3"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T3.3.3.6.3.1">MagicFix <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite> </th> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.6.3.2">70.2%</td> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.6.3.3">73.1%</td> <td class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T3.3.3.6.3.4">49.4%</td> </tr> <tr class="ltx_tr" id="S4.T3.3.3.7.4"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r ltx_border_t" id="S4.T3.3.3.7.4.1"> <span class="ltx_text ltx_font_italic" id="S4.T3.3.3.7.4.1.1">BlobCtrl</span> (Ours)</th> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S4.T3.3.3.7.4.2"><span class="ltx_text ltx_font_bold" id="S4.T3.3.3.7.4.2.1">87.2%</span></td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_t" id="S4.T3.3.3.7.4.3"><span class="ltx_text ltx_font_bold" id="S4.T3.3.3.7.4.3.1">86.5%</span></td> <td class="ltx_td ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_t" id="S4.T3.3.3.7.4.4"><span class="ltx_text ltx_font_bold" id="S4.T3.3.3.7.4.4.1">82.1%</span></td> </tr> </tbody> </table> </span></div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 3: </span><span class="ltx_text ltx_font_bold" id="S4.T3.5.1">Human evaluation results comparing our method with baselines.</span> Our method achieves consistently higher human preference scores across all metrics, demonstrating superior perceptual quality.</figcaption> </figure> </section> </section> <section class="ltx_subsection" id="S4.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.4 </span>Qualitative Evalution</h3> <div class="ltx_para" id="S4.SS4.p1"> <p class="ltx_p" id="S4.SS4.p1.1">Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S3.F4" title="Figure 4 ‣ 3.4 Controllable Fidelity-Diversity Trade-off ‣ 3 Self-supervised Paradigm for BlobCtrl ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">4</span></a> presents qualitative comparisons between <span class="ltx_text ltx_font_italic" id="S4.SS4.p1.1.1">BlobCtrl</span> and state-of-the-art methods across various element-level manipulation scenarios. The results demonstrate several key advantages of our approach:</p> </div> <div class="ltx_para" id="S4.SS4.p2"> <ul class="ltx_itemize" id="S4.I2"> <li class="ltx_item" id="S4.I2.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I2.i1.p1"> <p class="ltx_p" id="S4.I2.i1.p1.1">Anydoor <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite> struggles with accurate identity preservation during element manipulation and shows limitations in element-level removal, often leaving artifacts or incomplete modifications.</p> </div> </li> <li class="ltx_item" id="S4.I2.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I2.i2.p1"> <p class="ltx_p" id="S4.I2.i2.p1.1">While GliGen <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib26" title="">2023</a>)</cite> provides layout control capabilities, it fails to effectively preserve the visual appearance and identity of manipulated elements, resulting in inconsistent outputs.</p> </div> </li> <li class="ltx_item" id="S4.I2.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S4.I2.i3.p1"> <p class="ltx_p" id="S4.I2.i3.p1.1">Magic Mixup <cite class="ltx_cite ltx_citemacro_citep">(Alzayer et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib2" title="">2024</a>)</cite> exhibits insufficient harmonization abilities, leading to visual inconsistencies between modified elements and their surroundings.</p> </div> </li> </ul> </div> <div class="ltx_para" id="S4.SS4.p3"> <p class="ltx_p" id="S4.SS4.p3.1">In contrast, <span class="ltx_text ltx_font_italic" id="S4.SS4.p3.1.1">BlobCtrl</span> demonstrates superior performance across all aspects - better generalization to diverse scenarios, more accurate identity preservation, and precise layout control while maintaining visual coherence.</p> </div> </section> <section class="ltx_subsection" id="S4.SS5"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.5 </span>Ablation Studies</h3> <section class="ltx_paragraph" id="S4.SS5.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph">Analysis of Controllability and Flexibility</h4> <div class="ltx_para" id="S4.SS5.SSS0.Px1.p1"> <p class="ltx_p" id="S4.SS5.SSS0.Px1.p1.3">As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.F5" title="Figure 5 ‣ Analysis of Controllability and Flexibility ‣ 4.5 Ablation Studies ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">5</span></a>, <span class="ltx_text ltx_font_italic" id="S4.SS5.SSS0.Px1.p1.3.1">BlobCtrl</span> achieves flexible control over the trade-off between identity preservation and diversity by adjusting the control timestep interval and control strength <math alttext="\omega" class="ltx_Math" display="inline" id="S4.SS5.SSS0.Px1.p1.1.m1.1"><semantics id="S4.SS5.SSS0.Px1.p1.1.m1.1a"><mi id="S4.SS5.SSS0.Px1.p1.1.m1.1.1" xref="S4.SS5.SSS0.Px1.p1.1.m1.1.1.cmml">ω</mi><annotation-xml encoding="MathML-Content" id="S4.SS5.SSS0.Px1.p1.1.m1.1b"><ci id="S4.SS5.SSS0.Px1.p1.1.m1.1.1.cmml" xref="S4.SS5.SSS0.Px1.p1.1.m1.1.1">𝜔</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS5.SSS0.Px1.p1.1.m1.1c">\omega</annotation><annotation encoding="application/x-llamapun" id="S4.SS5.SSS0.Px1.p1.1.m1.1d">italic_ω</annotation></semantics></math> of the dual-branch fusion. When using only the background branch with text prompts, both identity preservation and layout accuracy suffer. Best results come from combining spatial-aware semantic features <math alttext="\bm{s}_{\mathsf{fg}}" class="ltx_Math" display="inline" id="S4.SS5.SSS0.Px1.p1.2.m2.1"><semantics id="S4.SS5.SSS0.Px1.p1.2.m2.1a"><msub id="S4.SS5.SSS0.Px1.p1.2.m2.1.1" xref="S4.SS5.SSS0.Px1.p1.2.m2.1.1.cmml"><mi id="S4.SS5.SSS0.Px1.p1.2.m2.1.1.2" xref="S4.SS5.SSS0.Px1.p1.2.m2.1.1.2.cmml">𝒔</mi><mi id="S4.SS5.SSS0.Px1.p1.2.m2.1.1.3" xref="S4.SS5.SSS0.Px1.p1.2.m2.1.1.3.cmml">𝖿𝗀</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS5.SSS0.Px1.p1.2.m2.1b"><apply id="S4.SS5.SSS0.Px1.p1.2.m2.1.1.cmml" xref="S4.SS5.SSS0.Px1.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS5.SSS0.Px1.p1.2.m2.1.1.1.cmml" xref="S4.SS5.SSS0.Px1.p1.2.m2.1.1">subscript</csymbol><ci id="S4.SS5.SSS0.Px1.p1.2.m2.1.1.2.cmml" xref="S4.SS5.SSS0.Px1.p1.2.m2.1.1.2">𝒔</ci><ci id="S4.SS5.SSS0.Px1.p1.2.m2.1.1.3.cmml" xref="S4.SS5.SSS0.Px1.p1.2.m2.1.1.3">𝖿𝗀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS5.SSS0.Px1.p1.2.m2.1c">\bm{s}_{\mathsf{fg}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS5.SSS0.Px1.p1.2.m2.1d">bold_italic_s start_POSTSUBSCRIPT sansserif_fg end_POSTSUBSCRIPT</annotation></semantics></math> and VAE features <math alttext="\bm{z}_{\mathsf{vae}}" class="ltx_Math" display="inline" id="S4.SS5.SSS0.Px1.p1.3.m3.1"><semantics id="S4.SS5.SSS0.Px1.p1.3.m3.1a"><msub id="S4.SS5.SSS0.Px1.p1.3.m3.1.1" xref="S4.SS5.SSS0.Px1.p1.3.m3.1.1.cmml"><mi id="S4.SS5.SSS0.Px1.p1.3.m3.1.1.2" xref="S4.SS5.SSS0.Px1.p1.3.m3.1.1.2.cmml">𝒛</mi><mi id="S4.SS5.SSS0.Px1.p1.3.m3.1.1.3" xref="S4.SS5.SSS0.Px1.p1.3.m3.1.1.3.cmml">𝗏𝖺𝖾</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS5.SSS0.Px1.p1.3.m3.1b"><apply id="S4.SS5.SSS0.Px1.p1.3.m3.1.1.cmml" xref="S4.SS5.SSS0.Px1.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S4.SS5.SSS0.Px1.p1.3.m3.1.1.1.cmml" xref="S4.SS5.SSS0.Px1.p1.3.m3.1.1">subscript</csymbol><ci id="S4.SS5.SSS0.Px1.p1.3.m3.1.1.2.cmml" xref="S4.SS5.SSS0.Px1.p1.3.m3.1.1.2">𝒛</ci><ci id="S4.SS5.SSS0.Px1.p1.3.m3.1.1.3.cmml" xref="S4.SS5.SSS0.Px1.p1.3.m3.1.1.3">𝗏𝖺𝖾</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS5.SSS0.Px1.p1.3.m3.1c">\bm{z}_{\mathsf{vae}}</annotation><annotation encoding="application/x-llamapun" id="S4.SS5.SSS0.Px1.p1.3.m3.1d">bold_italic_z start_POSTSUBSCRIPT sansserif_vae end_POSTSUBSCRIPT</annotation></semantics></math>.</p> </div> <figure class="ltx_figure" id="S4.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="570" id="S4.F5.g1" src="x5.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 5: </span><span class="ltx_text ltx_font_bold" id="S4.F5.4.1">Flexible Control.</span> Our dual-branch fusion mechanism enables flexible control over the trade-off between diversity and appearance preservation by adjusting the control timestep interval and fusion strength <math alttext="\omega" class="ltx_Math" display="inline" id="S4.F5.2.m1.1"><semantics id="S4.F5.2.m1.1b"><mi id="S4.F5.2.m1.1.1" xref="S4.F5.2.m1.1.1.cmml">ω</mi><annotation-xml encoding="MathML-Content" id="S4.F5.2.m1.1c"><ci id="S4.F5.2.m1.1.1.cmml" xref="S4.F5.2.m1.1.1">𝜔</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.F5.2.m1.1d">\omega</annotation><annotation encoding="application/x-llamapun" id="S4.F5.2.m1.1e">italic_ω</annotation></semantics></math>. Additionally, the feature dropout mechanisms provide more flexible interfaces for controlling the generation process.</figcaption> </figure> </section> <section class="ltx_paragraph" id="S4.SS5.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph">Ablation of Identity Preservation Score Function.</h4> <div class="ltx_para" id="S4.SS5.SSS0.Px2.p1"> <p class="ltx_p" id="S4.SS5.SSS0.Px2.p1.1">We conduct an ablation study to analyze the effectiveness of our Identity Preservation Score Function. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#S4.F6" title="Figure 6 ‣ Ablation of Identity Preservation Score Function. ‣ 4.5 Ablation Studies ‣ 4 Experiments ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">6</span></a>, under the same training steps, the model with Identity Preservation Score Function achieves significantly lower noise prediction loss (0.0235) compared to the model without it (0.0399), demonstrating faster convergence. To better understand how this score function affects the generation process, we visualize the denoising results using the predicted noise from the foreground branch. The visualization reveals that the foreground branch effectively focuses on generating foreground content when guided by the Identity Preservation Score Function, validating our design choice of decoupling foreground and background element generation through this mechanism.</p> </div> <figure class="ltx_figure" id="S4.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="244" id="S4.F6.g1" src="x6.png" width="813"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 6: </span><span class="ltx_text ltx_font_bold" id="S4.F6.2.1">Ablation of Identity Preservation Score Function.</span> Training loss and denoising visualization for scaling a deer, demonstrating how Identity Preservation Score Function enables faster convergence and effective foreground-background decoupling during element-level manipulation.</figcaption> </figure> </section> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">5 </span>Related Work</h2> <section class="ltx_paragraph" id="S5.SS0.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph">Element-level Generation.</h4> <div class="ltx_para" id="S5.SS0.SSS0.Px1.p1"> <p class="ltx_p" id="S5.SS0.SSS0.Px1.p1.1">Contemporary element-level generation approaches can be categorized into two main paradigms: grounding-based and subject-driven methods. <em class="ltx_emph ltx_font_italic" id="S5.SS0.SSS0.Px1.p1.1.1">Grounding-based</em> approaches, exemplified by GliGen and BlobGen <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib26" title="">2023</a>; Nie et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib31" title="">2024</a>)</cite>, employ bounding boxes and ellipses to achieve spatial control. However, these methods lack robust identity control mechanisms, resulting in significant content variations across different random initializations. While VisualComposer <cite class="ltx_cite ltx_citemacro_citep">(Parmar et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib34" title="">2025</a>)</cite> advances the field by incorporating multi-granular encoders for identity feature extraction, it struggles with precise layout control. Although Anydoor <cite class="ltx_cite ltx_citemacro_citep">(Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib9" title="">2023</a>)</cite> and GroundingBooth <cite class="ltx_cite ltx_citemacro_citep">(Xiong et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib47" title="">2024</a>)</cite> demonstrate promising results in controlling both identity and layout, their heavy reliance on multi-view and video training data constrains their practical applicability and generalization capabilities. In the realm of <em class="ltx_emph ltx_font_italic" id="S5.SS0.SSS0.Px1.p1.1.2">subject-driven</em> approaches, existing methods face significant limitations - they either demand computationally expensive test-time optimization <cite class="ltx_cite ltx_citemacro_citep">(Gal et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib14" title="">2022</a>; Ruiz et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib39" title="">2023</a>; <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib13" title="">Gal et al., </a>; Kumari et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib23" title="">2023</a>)</cite> or depend extensively on multi-view datasets <cite class="ltx_cite ltx_citemacro_citep">(Arar et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib3" title="">2023</a>; Wei et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib45" title="">2023</a>; Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib25" title="">2024a</a>; Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib53" title="">2024b</a>)</cite>, which impedes their out-of-distribution generalization and spatial control capabilities. In contrast, our <span class="ltx_text ltx_font_italic" id="S5.SS0.SSS0.Px1.p1.1.3">BlobCtrl</span> presents an elegant solution by seamlessly integrating layout, semantic, and identity information through probabilistic blob representations and self-supervised training, thereby achieving flexible control over both appearance fidelity and creative diversity.</p> </div> </section> <section class="ltx_paragraph" id="S5.SS0.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph">Element-level Editing.</h4> <div class="ltx_para" id="S5.SS0.SSS0.Px2.p1"> <p class="ltx_p" id="S5.SS0.SSS0.Px2.p1.1">Traditional image editing methods <cite class="ltx_cite ltx_citemacro_citep">(Hertz et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib15" title="">2023</a>; Brooks et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib5" title="">2023</a>; Huang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib18" title="">2024</a>; Cao et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib6" title="">2023</a>; Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib27" title="">2024b</a>; Shi et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib43" title="">2024</a>)</cite> rely on text prompts to introduce editing information, while element-level editing methods foucus on manipulating visual elements through operations like moving, resizing, replacing and removal. Continuous Layout Editing <cite class="ltx_cite ltx_citemacro_citep">(Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib54" title="">2023b</a>)</cite> decouples layout and appearance through test-time optimization but suffers from high computational costs. Magic Fixup <cite class="ltx_cite ltx_citemacro_citep">(Alzayer et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib2" title="">2024</a>)</cite> introduces a two-stage pipeline consisting of transformation and harmonization steps, but relies on video data for training which can lead to degraded performance. Editable-element <cite class="ltx_cite ltx_citemacro_citep">(Mu et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib30" title="">2025</a>)</cite> proposes an element-level VAE approach but shows limited generalization ability due to its heavy dependence on large-scale paired training data. In contrast, our approach enables flexible element-level editing through self-supervised training, eliminating the need for explicitly paired editing data.</p> </div> </section> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">6 </span>Discussion</h2> <section class="ltx_paragraph" id="S6.SS0.SSS0.Px1"> <h4 class="ltx_title ltx_font_bold ltx_title_paragraph">Conclusion.</h4> <div class="ltx_para" id="S6.SS0.SSS0.Px1.p1"> <p class="ltx_p" id="S6.SS0.SSS0.Px1.p1.1">This work introduces <span class="ltx_text ltx_font_italic" id="S6.SS0.SSS0.Px1.p1.1.1">BlobCtrl</span>, a unified framework that integrates element-level generation and editing using a probabilistic blob-based representation. Blobs serve as visual primitives to encode spatial layout, semantics, and identity, allowing precise element manipulation. The dual-branch architecture with self-supervised training preserves foreground identities and maintains background harmony. Random data augmentation and dropout strategies offer flexible control between appearance fidelity and creative diversity. Extensive experiments on <span class="ltx_text ltx_font_italic" id="S6.SS0.SSS0.Px1.p1.1.2">BlobBench</span> demonstrate that <span class="ltx_text ltx_font_italic" id="S6.SS0.SSS0.Px1.p1.1.3">BlobCtrl</span> achieves state-of-the-art performance in element-level manipulation tasks.</p> </div> </section> <section class="ltx_paragraph" id="S6.SS0.SSS0.Px2"> <h4 class="ltx_title ltx_font_bold ltx_title_paragraph">Limitations and Future Work.</h4> <div class="ltx_para" id="S6.SS0.SSS0.Px2.p1"> <p class="ltx_p" id="S6.SS0.SSS0.Px2.p1.1">While <span class="ltx_text ltx_font_italic" id="S6.SS0.SSS0.Px2.p1.1.1">BlobCtrl</span> demonstrates strong capabilities in element-level manipulation, it currently only supports iterative single-element operations in a single model forward pass. Fortunately, observersur blob-based representation inherently supports depth-aware composition, opening promising directions for future work.</p> </div> </section> </section> <section class="ltx_section" id="Sx1"> <h2 class="ltx_title ltx_title_section">Impact Statement</h2> <div class="ltx_para" id="Sx1.p1"> <p class="ltx_p" id="Sx1.p1.1">Our work on element-level manipulation presents both opportunities and risks. While it enables more precise and flexible creative tools, there are potential concerns about misuse for creating misleading or harmful content. We advocate for responsible development and deployment of such technologies, with clear guidelines for ethical use and transparency about AI-generated content.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Adobe Inc. (1988–2023)</span> <span class="ltx_bibblock"> Adobe Inc. </span> <span class="ltx_bibblock">Adobe photoshop, 1988–2023. </span> <span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.adobe.com/products/photoshop.html" title="">https://www.adobe.com/products/photoshop.html</a>. </span> <span class="ltx_bibblock">Version 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Alzayer et al. (2024)</span> <span class="ltx_bibblock"> Alzayer, H., Xia, Z., Zhang, X., Shechtman, E., Huang, J.-B., and Gharbi, M. </span> <span class="ltx_bibblock">Magic fixup: Streamlining photo editing by watching dynamic videos. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib2.1.1">arXiv preprint arXiv:2403.13044</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Arar et al. (2023)</span> <span class="ltx_bibblock"> Arar, M., Gal, R., Atzmon, Y., Chechik, G., Cohen-Or, D., Shamir, A., and H. Bermano, A. </span> <span class="ltx_bibblock">Domain-agnostic tuning-encoder for fast personalization of text-to-image models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib3.1.1">SIGGRAPH Asia 2023 Conference Papers</em>, pp.  1–10, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Avrahami et al. (2023)</span> <span class="ltx_bibblock"> Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., and Lischinski, D. </span> <span class="ltx_bibblock">Break-a-scene: Extracting multiple concepts from a single image. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib4.1.1">SIGGRAPH Asia 2023 Conference Papers</em>, pp.  1–12, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Brooks et al. (2023)</span> <span class="ltx_bibblock"> Brooks, T., Holynski, A., and Efros, A. A. </span> <span class="ltx_bibblock">Instructpix2pix: Learning to follow image editing instructions. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib5.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp.  18392–18402, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cao et al. (2023)</span> <span class="ltx_bibblock"> Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., and Zheng, Y. </span> <span class="ltx_bibblock">Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib6.1.1">Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</em>, pp.  22560–22570, October 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Caron et al. (2021)</span> <span class="ltx_bibblock"> Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. </span> <span class="ltx_bibblock">Emerging properties in self-supervised vision transformers. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib7.1.1">Proceedings of the IEEE/CVF international conference on computer vision</em>, pp.  9650–9660, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Carson et al. (1999)</span> <span class="ltx_bibblock"> Carson, C., Thomas, M., Belongie, S., Hellerstein, J. M., and Malik, J. </span> <span class="ltx_bibblock">Blobworld: A system for region-based image indexing and retrieval. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib8.1.1">Visual Information and Information Systems: Third International Conference, VISUAL’99 Amsterdam, The Netherlands, June 2–4, 1999 Proceedings 3</em>, pp.  509–517. Springer, 1999. </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2023)</span> <span class="ltx_bibblock"> Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., and Zhao, H. </span> <span class="ltx_bibblock">Anydoor: Zero-shot object-level image customization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib9.1.1">arXiv preprint</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2024)</span> <span class="ltx_bibblock"> Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. </span> <span class="ltx_bibblock">Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib10.1.1">arXiv preprint arXiv:2412.05271</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Epstein et al. (2022)</span> <span class="ltx_bibblock"> Epstein, D., Park, T., Zhang, R., Shechtman, E., and Efros, A. A. </span> <span class="ltx_bibblock">Blobgan: Spatially disentangled scene representations. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib11.1.1">European Conference on Computer Vision</em>, pp.  616–635. Springer, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Esser et al. (2024)</span> <span class="ltx_bibblock"> Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. </span> <span class="ltx_bibblock">Scaling rectified flow transformers for high-resolution image synthesis. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib12.1.1">Forty-first International Conference on Machine Learning</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">(13)</span> <span class="ltx_bibblock"> Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-or, D. </span> <span class="ltx_bibblock">An image is worth one word: Personalizing text-to-image generation using textual inversion. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib13.1.1">The Eleventh International Conference on Learning Representations</em>. </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gal et al. (2022)</span> <span class="ltx_bibblock"> Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-Or, D. </span> <span class="ltx_bibblock">An image is worth one word: Personalizing text-to-image generation using textual inversion. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib14.1.1">arXiv preprint arXiv:2208.01618</em>, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hertz et al. (2023)</span> <span class="ltx_bibblock"> Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., and Cohen-Or, D. </span> <span class="ltx_bibblock">Prompt-to-prompt image editing with cross-attention control. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib15.1.1">ICLR</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Heusel et al. (2017)</span> <span class="ltx_bibblock"> Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. </span> <span class="ltx_bibblock">GANs trained by a two time-scale update rule converge to a local Nash equilibrium. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib16.1.1">Advances in Neural Information Processing Systems (NIPS)</em>, 30, 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hu et al. (2021)</span> <span class="ltx_bibblock"> Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. </span> <span class="ltx_bibblock">Lora: Low-rank adaptation of large language models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib17.1.1">arXiv preprint arXiv:2106.09685</em>, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Huang et al. (2024)</span> <span class="ltx_bibblock"> Huang, Y., Xie, L., Wang, X., Yuan, Z., Cun, X., Ge, Y., Zhou, J., Dong, C., Huang, R., Zhang, R., et al. </span> <span class="ltx_bibblock">Smartedit: Exploring complex instruction-based image editing with multimodal large language models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib18.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp.  8362–8371, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ju et al. (2024)</span> <span class="ltx_bibblock"> Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., and Xu, Q. </span> <span class="ltx_bibblock">Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib19.1.1">European Conference on Computer Vision</em>, pp.  150–168. Springer, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kingma (2013)</span> <span class="ltx_bibblock"> Kingma, D. P. </span> <span class="ltx_bibblock">Auto-encoding variational bayes. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib20.1.1">arXiv preprint arXiv:1312.6114</em>, 2013. </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kingma &amp; Ba (2014)</span> <span class="ltx_bibblock"> Kingma, D. P. and Ba, J. </span> <span class="ltx_bibblock">Adam: A method for stochastic optimization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib21.1.1">arXiv preprint arXiv:1412.6980</em>, 2014. </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kirillov et al. (2023)</span> <span class="ltx_bibblock"> Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. </span> <span class="ltx_bibblock">Segment anything. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib22.1.1">Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</em>, October 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kumari et al. (2023)</span> <span class="ltx_bibblock"> Kumari, N., Zhang, B., Zhang, R., Shechtman, E., and Zhu, J.-Y. </span> <span class="ltx_bibblock">Multi-concept customization of text-to-image diffusion. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib23.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp.  1931–1941, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Labs (2023)</span> <span class="ltx_bibblock"> Labs, B. F. </span> <span class="ltx_bibblock">Flux. </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/black-forest-labs/flux" title="">https://github.com/black-forest-labs/flux</a>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2024a)</span> <span class="ltx_bibblock"> Li, D., Li, J., and Hoi, S. </span> <span class="ltx_bibblock">Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib25.1.1">Advances in Neural Information Processing Systems</em>, 36, 2024a. </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2023)</span> <span class="ltx_bibblock"> Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., and Lee, Y. J. </span> <span class="ltx_bibblock">Gligen: Open-set grounded text-to-image generation. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib26.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp.  22511–22521, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2024b)</span> <span class="ltx_bibblock"> Li, Y., Bian, Y., Ju, X., Zhang, Z., Shan, Y., Zou, Y., and Xu, Q. </span> <span class="ltx_bibblock">Brushedit: All-in-one image inpainting and editing. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib27.1.1">arXiv preprint arXiv:2412.10316</em>, 2024b. </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lin et al. (2014)</span> <span class="ltx_bibblock"> Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. </span> <span class="ltx_bibblock">Microsoft coco: Common objects in context. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib28.1.1">Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13</em>, pp.  740–755. Springer, 2014. </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Mahalanobis (1936)</span> <span class="ltx_bibblock"> Mahalanobis, P. </span> <span class="ltx_bibblock">On the generalized distance in statistics. </span> <span class="ltx_bibblock">National Institute of Science of India, 1936. </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Mu et al. (2025)</span> <span class="ltx_bibblock"> Mu, J., Gharbi, M., Zhang, R., Shechtman, E., Vasconcelos, N., Wang, X., and Park, T. </span> <span class="ltx_bibblock">Editable image elements for controllable synthesis. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib30.1.1">European Conference on Computer Vision</em>, pp.  39–56. Springer, 2025. </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Nie et al. (2024)</span> <span class="ltx_bibblock"> Nie, W., Liu, S., Mardani, M., Liu, C., Eckart, B., and Vahdat, A. </span> <span class="ltx_bibblock">Compositional text-to-image generation with dense blob representations. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib31.1.1">Forty-first International Conference on Machine Learning</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Nitzberg &amp; Mumford (1990)</span> <span class="ltx_bibblock"> Nitzberg, M. and Mumford, D. B. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib32.1.1">The 2.1-D sketch</em>. </span> <span class="ltx_bibblock">IEEE Computer Society Press, 1990. </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Oquab et al. (2023)</span> <span class="ltx_bibblock"> Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.-Y., Xu, H., Sharma, V., Li, S.-W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P. </span> <span class="ltx_bibblock">Dinov2: Learning robust visual features without supervision, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Parmar et al. (2025)</span> <span class="ltx_bibblock"> Parmar, G., Patashnik, O., Wang, K.-C., Ostashev, D., Narasimhan, S., Zhu, J.-Y., Cohen-Or, D., and Aberman, K. </span> <span class="ltx_bibblock">Object-level visual prompts for compositional image generation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib34.1.1">arXiv preprint arXiv:2501.01424</em>, 2025. </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Porter &amp; Duff (1984)</span> <span class="ltx_bibblock"> Porter, T. and Duff, T. </span> <span class="ltx_bibblock">Compositing digital images. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib35.1.1">Proceedings of the 11th annual conference on Computer graphics and interactive techniques</em>, pp.  253–259, 1984. </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Radford et al. (2021)</span> <span class="ltx_bibblock"> Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. </span> <span class="ltx_bibblock">Learning transferable visual models from natural language supervision. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib36.1.1">International conference on machine learning</em>, pp.  8748–8763. PMLR, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ramesh et al. (2022)</span> <span class="ltx_bibblock"> Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. </span> <span class="ltx_bibblock">Hierarchical text-conditional image generation with clip latents. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib37.1.1">arXiv preprint arXiv:2204.06125</em>, 1(2):3, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Rombach et al. (2022)</span> <span class="ltx_bibblock"> Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. </span> <span class="ltx_bibblock">High-resolution image synthesis with latent diffusion models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib38.1.1">CVPR</em>, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ruiz et al. (2023)</span> <span class="ltx_bibblock"> Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., and Aberman, K. </span> <span class="ltx_bibblock">Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib39.1.1">Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</em>, pp.  22500–22510, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Serif Europe Ltd. (2015–2023)</span> <span class="ltx_bibblock"> Serif Europe Ltd. </span> <span class="ltx_bibblock">Affinity photo, 2015–2023. </span> <span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://affinity.serif.com/photo/" title="">https://affinity.serif.com/photo/</a>. </span> <span class="ltx_bibblock">Version 2.0. </span> </li> <li class="ltx_bibitem" id="bib.bib41"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Sheynin et al. (2024)</span> <span class="ltx_bibblock"> Sheynin, S., Polyak, A., Singer, U., Kirstain, Y., Zohar, A., Ashual, O., Parikh, D., and Taigman, Y. </span> <span class="ltx_bibblock">Emu edit: Precise image editing via recognition and generation tasks. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib41.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp.  8871–8879, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib42"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Shi et al. (2023)</span> <span class="ltx_bibblock"> Shi, Y., Xue, C., Pan, J., Zhang, W., Tan, V. Y., and Bai, S. </span> <span class="ltx_bibblock">Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib42.1.1">arXiv preprint arXiv:2306.14435</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib43"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Shi et al. (2024)</span> <span class="ltx_bibblock"> Shi, Y., Wang, P., and Huang, W. </span> <span class="ltx_bibblock">Seededit: Align image re-generation to image editing. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib43.1.1">arXiv preprint arXiv:2411.06686</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib44"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wang et al. (2004)</span> <span class="ltx_bibblock"> Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. </span> <span class="ltx_bibblock">Image quality assessment: from error visibility to structural similarity. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib44.1.1">IEEE transactions on image processing</em>, 13(4):600–612, 2004. </span> </li> <li class="ltx_bibitem" id="bib.bib45"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wei et al. (2023)</span> <span class="ltx_bibblock"> Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., and Zuo, W. </span> <span class="ltx_bibblock">Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib45.1.1">Proceedings of the IEEE/CVF International Conference on Computer Vision</em>, pp.  15943–15953, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib46"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wikipedia contributors (2024)</span> <span class="ltx_bibblock"> Wikipedia contributors. </span> <span class="ltx_bibblock">Peak signal-to-noise ratio — Wikipedia, the free encyclopedia, 2024. </span> <span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://en.wikipedia.org/w/index.php?title=Peak_signal-to-noise_ratio&amp;oldid=1210897995" title="">https://en.wikipedia.org/w/index.php?title=Peak_signal-to-noise_ratio&amp;oldid=1210897995</a>. </span> <span class="ltx_bibblock">[Online; accessed 4-March-2024]. </span> </li> <li class="ltx_bibitem" id="bib.bib47"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Xiong et al. (2024)</span> <span class="ltx_bibblock"> Xiong, Z., Xiong, W., Shi, J., Zhang, H., Song, Y., and Jacobs, N. </span> <span class="ltx_bibblock">Groundingbooth: Grounding text-to-image customization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib47.1.1">arXiv preprint arXiv:2409.08520</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib48"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yang et al. (2023)</span> <span class="ltx_bibblock"> Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., Chen, D., and Wen, F. </span> <span class="ltx_bibblock">Paint by example: Exemplar-based image editing with diffusion models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib48.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp.  18381–18391, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib49"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ye et al. (2023)</span> <span class="ltx_bibblock"> Ye, H., Zhang, J., Liu, S., Han, X., and Yang, W. </span> <span class="ltx_bibblock">Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib49.1.1">arXiv preprint arXiv:2308.06721</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib50"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2024a)</span> <span class="ltx_bibblock"> Zhang, H., Hong, D., Gao, T., Wang, Y., Shao, J., Wu, X., Wu, Z., and Jiang, Y.-G. </span> <span class="ltx_bibblock">Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib50.1.1">arXiv preprint arXiv:2412.03859</em>, 2024a. </span> </li> <li class="ltx_bibitem" id="bib.bib51"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2023a)</span> <span class="ltx_bibblock"> Zhang, L., Rao, A., and Agrawala, M. </span> <span class="ltx_bibblock">Adding conditional control to text-to-image diffusion models, 2023a. </span> </li> <li class="ltx_bibitem" id="bib.bib52"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2018)</span> <span class="ltx_bibblock"> Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. </span> <span class="ltx_bibblock">The unreasonable effectiveness of deep features as a perceptual metric. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib52.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em>, pp.  586–595, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib53"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2024b)</span> <span class="ltx_bibblock"> Zhang, Y., Song, Y., Liu, J., Wang, R., Yu, J., Tang, H., Li, H., Tang, X., Hu, Y., Pan, H., et al. </span> <span class="ltx_bibblock">Ssr-encoder: Encoding selective subject representation for subject-driven generation. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib53.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp.  8069–8078, 2024b. </span> </li> <li class="ltx_bibitem" id="bib.bib54"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2023b)</span> <span class="ltx_bibblock"> Zhang, Z., Huang, Z., and Liao, J. </span> <span class="ltx_bibblock">Continuous layout editing of single images with diffusion models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib54.1.1">Computer Graphics Forum</em>, pp.  e14966. Wiley Online Library, 2023b. </span> </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> <section class="ltx_appendix" id="A1"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix A </span>BlobBench Overview and Evaluation Metrics</h2> <section class="ltx_paragraph" id="A1.SS0.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph">BlobBench Overview</h4> <div class="ltx_para" id="A1.SS0.SSS0.Px1.p1"> <p class="ltx_p" id="A1.SS0.SSS0.Px1.p1.1">As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#A1.F7" title="Figure 7 ‣ BlobBench Overview ‣ Appendix A BlobBench Overview and Evaluation Metrics ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">7</span></a>, BlobBench is a comprehensive benchmark containing 100 curated images evenly distributed across different element-level operations (composition, movement, resizing, removal, and replacement). Each image is annotated with ellipse parameters, foreground masks, and text descriptions. The benchmark includes both real-world and AI-generated images across diverse scenarios like indoor/outdoor scenes, animals, and landscapes.</p> </div> <figure class="ltx_figure" id="A1.F7"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="158" id="A1.F7.g1" src="extracted/6284831/figures/benchmark_overview.png" width="598"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 7: </span>Overview of the BlobBench.</figcaption> </figure> </section> <section class="ltx_paragraph" id="A1.SS0.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph">Evaluation Metrics</h4> <div class="ltx_para" id="A1.SS0.SSS0.Px2.p1"> <p class="ltx_p" id="A1.SS0.SSS0.Px2.p1.1">For <em class="ltx_emph ltx_font_italic" id="A1.SS0.SSS0.Px2.p1.1.1">objective evaluation</em>, we consider the following aspects:</p> <ul class="ltx_itemize" id="A1.I1"> <li class="ltx_item" id="A1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i1.p1"> <p class="ltx_p" id="A1.I1.i1.p1.1"><span class="ltx_text ltx_font_bold" id="A1.I1.i1.p1.1.1">Identity Preservation.</span> To evaluate element-level appearance preservation, we employ CLIP-I <cite class="ltx_cite ltx_citemacro_citep">(Radford et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib36" title="">2021</a>)</cite> and DINO <cite class="ltx_cite ltx_citemacro_citep">(Caron et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib7" title="">2021</a>)</cite> scores to measure the appearance similarity between generated and reference images.</p> </div> </li> <li class="ltx_item" id="A1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i2.p1"> <p class="ltx_p" id="A1.I1.i2.p1.1"><span class="ltx_text ltx_font_bold" id="A1.I1.i2.p1.1.1">Grounding Accuracy.</span> To evaluate layout control capability, we first extract masks from generated images using SAM <cite class="ltx_cite ltx_citemacro_citep">(Kirillov et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib22" title="">2023</a>)</cite> and fit ellipses or bounding box to these masks. We then compute the Mean Squared Error (MSE) between these fitted grounding annotations and the ground truth to quantify the accuracy of spatial control.</p> </div> </li> <li class="ltx_item" id="A1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i3.p1"> <p class="ltx_p" id="A1.I1.i3.p1.1"><span class="ltx_text ltx_font_bold" id="A1.I1.i3.p1.1.1">Generation Quality and Harmonization.</span> We adopt standard image quality metrics including FID <cite class="ltx_cite ltx_citemacro_citep">(Heusel et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib16" title="">2017</a>)</cite> for distribution similarity, PSNR <cite class="ltx_cite ltx_citemacro_citep">(Wikipedia contributors, <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib46" title="">2024</a>)</cite> and SSIM <cite class="ltx_cite ltx_citemacro_citep">(Wang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib44" title="">2004</a>)</cite> for pixel-level fidelity, and LPIPS <cite class="ltx_cite ltx_citemacro_citep">(Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib52" title="">2018</a>)</cite> for perceptual quality to evaluate both generation and editing results.</p> </div> </li> </ul> </div> <div class="ltx_para" id="A1.SS0.SSS0.Px2.p2"> <p class="ltx_p" id="A1.SS0.SSS0.Px2.p2.1">For <em class="ltx_emph ltx_font_italic" id="A1.SS0.SSS0.Px2.p2.1.1">subjective evaluation</em>, we conduct a human study with 30 participants rating 20 sets of generated images on three aspects: fidelity, layout accuracy, and visual harmony. Each aspect is scored on a scale of 1-5, with 5 being the highest quality.</p> </div> </section> </section> <section class="ltx_appendix" id="A2"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix B </span>Mathematical Relationship Between Ellipses and 2D Gaussian Distributions</h2> <div class="ltx_para" id="A2.p1"> <p class="ltx_p" id="A2.p1.1">A 2D Gaussian distribution and an ellipse can be mathematically related through their covariance matrix and level sets. Here we explain their conversion:</p> </div> <section class="ltx_paragraph" id="A2.SS0.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph">From Gaussian to Ellipse.</h4> <div class="ltx_para" id="A2.SS0.SSS0.Px1.p1"> <p class="ltx_p" id="A2.SS0.SSS0.Px1.p1.2">A 2D Gaussian distribution is formulated by its mean <math alttext="\bm{\mu}=(\mu_{x},\mu_{y})" class="ltx_Math" display="inline" id="A2.SS0.SSS0.Px1.p1.1.m1.2"><semantics id="A2.SS0.SSS0.Px1.p1.1.m1.2a"><mrow id="A2.SS0.SSS0.Px1.p1.1.m1.2.2" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.cmml"><mi id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.4" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.4.cmml">𝝁</mi><mo id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.3" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.3.cmml">=</mo><mrow id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.3.cmml"><mo id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.3" stretchy="false" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.3.cmml">(</mo><msub id="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1" xref="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.cmml"><mi id="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.2" xref="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.2.cmml">μ</mi><mi id="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.3" xref="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.3.cmml">x</mi></msub><mo id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.4" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.3.cmml">,</mo><msub id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.cmml"><mi id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.2" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.2.cmml">μ</mi><mi id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.3" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.3.cmml">y</mi></msub><mo id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.5" stretchy="false" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.3.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.SS0.SSS0.Px1.p1.1.m1.2b"><apply id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2"><eq id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.3.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.3"></eq><ci id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.4.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.4">𝝁</ci><interval closure="open" id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.3.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2"><apply id="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1"><csymbol cd="ambiguous" id="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.1.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1">subscript</csymbol><ci id="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.2.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.2">𝜇</ci><ci id="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.3.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.1.1.1.1.1.3">𝑥</ci></apply><apply id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2"><csymbol cd="ambiguous" id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.1.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2">subscript</csymbol><ci id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.2.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.2">𝜇</ci><ci id="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.3.cmml" xref="A2.SS0.SSS0.Px1.p1.1.m1.2.2.2.2.2.3">𝑦</ci></apply></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.SS0.SSS0.Px1.p1.1.m1.2c">\bm{\mu}=(\mu_{x},\mu_{y})</annotation><annotation encoding="application/x-llamapun" id="A2.SS0.SSS0.Px1.p1.1.m1.2d">bold_italic_μ = ( italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT )</annotation></semantics></math> and covariance matrix <math alttext="\bm{\Sigma}" class="ltx_Math" display="inline" id="A2.SS0.SSS0.Px1.p1.2.m2.1"><semantics id="A2.SS0.SSS0.Px1.p1.2.m2.1a"><mi id="A2.SS0.SSS0.Px1.p1.2.m2.1.1" xref="A2.SS0.SSS0.Px1.p1.2.m2.1.1.cmml">𝚺</mi><annotation-xml encoding="MathML-Content" id="A2.SS0.SSS0.Px1.p1.2.m2.1b"><ci id="A2.SS0.SSS0.Px1.p1.2.m2.1.1.cmml" xref="A2.SS0.SSS0.Px1.p1.2.m2.1.1">𝚺</ci></annotation-xml><annotation encoding="application/x-tex" id="A2.SS0.SSS0.Px1.p1.2.m2.1c">\bm{\Sigma}</annotation><annotation encoding="application/x-llamapun" id="A2.SS0.SSS0.Px1.p1.2.m2.1d">bold_Σ</annotation></semantics></math>:</p> </div> <div class="ltx_para" id="A2.SS0.SSS0.Px1.p2"> <table class="ltx_equation ltx_eqn_table" id="A2.E14"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\bm{\Sigma}=\begin{bmatrix}\sigma_{x}^{2}&amp;\rho\sigma_{x}\sigma_{y}\\ \rho\sigma_{x}\sigma_{y}&amp;\sigma_{y}^{2}\end{bmatrix}" class="ltx_Math" display="block" id="A2.E14.m1.1"><semantics id="A2.E14.m1.1a"><mrow id="A2.E14.m1.1.2" xref="A2.E14.m1.1.2.cmml"><mi id="A2.E14.m1.1.2.2" xref="A2.E14.m1.1.2.2.cmml">𝚺</mi><mo id="A2.E14.m1.1.2.1" xref="A2.E14.m1.1.2.1.cmml">=</mo><mrow id="A2.E14.m1.1.1.3" xref="A2.E14.m1.1.1.2.cmml"><mo id="A2.E14.m1.1.1.3.1" xref="A2.E14.m1.1.1.2.1.cmml">[</mo><mtable columnspacing="5pt" displaystyle="true" id="A2.E14.m1.1.1.1.1" rowspacing="0pt" xref="A2.E14.m1.1.1.1.1.cmml"><mtr id="A2.E14.m1.1.1.1.1a" xref="A2.E14.m1.1.1.1.1.cmml"><mtd id="A2.E14.m1.1.1.1.1b" xref="A2.E14.m1.1.1.1.1.cmml"><msubsup id="A2.E14.m1.1.1.1.1.1.1.1" xref="A2.E14.m1.1.1.1.1.1.1.1.cmml"><mi id="A2.E14.m1.1.1.1.1.1.1.1.2.2" xref="A2.E14.m1.1.1.1.1.1.1.1.2.2.cmml">σ</mi><mi id="A2.E14.m1.1.1.1.1.1.1.1.2.3" xref="A2.E14.m1.1.1.1.1.1.1.1.2.3.cmml">x</mi><mn id="A2.E14.m1.1.1.1.1.1.1.1.3" xref="A2.E14.m1.1.1.1.1.1.1.1.3.cmml">2</mn></msubsup></mtd><mtd id="A2.E14.m1.1.1.1.1c" xref="A2.E14.m1.1.1.1.1.cmml"><mrow id="A2.E14.m1.1.1.1.1.1.2.1" xref="A2.E14.m1.1.1.1.1.1.2.1.cmml"><mi id="A2.E14.m1.1.1.1.1.1.2.1.2" xref="A2.E14.m1.1.1.1.1.1.2.1.2.cmml">ρ</mi><mo id="A2.E14.m1.1.1.1.1.1.2.1.1" xref="A2.E14.m1.1.1.1.1.1.2.1.1.cmml">⁢</mo><msub id="A2.E14.m1.1.1.1.1.1.2.1.3" xref="A2.E14.m1.1.1.1.1.1.2.1.3.cmml"><mi id="A2.E14.m1.1.1.1.1.1.2.1.3.2" xref="A2.E14.m1.1.1.1.1.1.2.1.3.2.cmml">σ</mi><mi id="A2.E14.m1.1.1.1.1.1.2.1.3.3" xref="A2.E14.m1.1.1.1.1.1.2.1.3.3.cmml">x</mi></msub><mo id="A2.E14.m1.1.1.1.1.1.2.1.1a" xref="A2.E14.m1.1.1.1.1.1.2.1.1.cmml">⁢</mo><msub id="A2.E14.m1.1.1.1.1.1.2.1.4" xref="A2.E14.m1.1.1.1.1.1.2.1.4.cmml"><mi id="A2.E14.m1.1.1.1.1.1.2.1.4.2" xref="A2.E14.m1.1.1.1.1.1.2.1.4.2.cmml">σ</mi><mi id="A2.E14.m1.1.1.1.1.1.2.1.4.3" xref="A2.E14.m1.1.1.1.1.1.2.1.4.3.cmml">y</mi></msub></mrow></mtd></mtr><mtr id="A2.E14.m1.1.1.1.1d" xref="A2.E14.m1.1.1.1.1.cmml"><mtd id="A2.E14.m1.1.1.1.1e" xref="A2.E14.m1.1.1.1.1.cmml"><mrow id="A2.E14.m1.1.1.1.1.2.1.1" xref="A2.E14.m1.1.1.1.1.2.1.1.cmml"><mi id="A2.E14.m1.1.1.1.1.2.1.1.2" xref="A2.E14.m1.1.1.1.1.2.1.1.2.cmml">ρ</mi><mo id="A2.E14.m1.1.1.1.1.2.1.1.1" xref="A2.E14.m1.1.1.1.1.2.1.1.1.cmml">⁢</mo><msub id="A2.E14.m1.1.1.1.1.2.1.1.3" xref="A2.E14.m1.1.1.1.1.2.1.1.3.cmml"><mi id="A2.E14.m1.1.1.1.1.2.1.1.3.2" xref="A2.E14.m1.1.1.1.1.2.1.1.3.2.cmml">σ</mi><mi id="A2.E14.m1.1.1.1.1.2.1.1.3.3" xref="A2.E14.m1.1.1.1.1.2.1.1.3.3.cmml">x</mi></msub><mo id="A2.E14.m1.1.1.1.1.2.1.1.1a" xref="A2.E14.m1.1.1.1.1.2.1.1.1.cmml">⁢</mo><msub id="A2.E14.m1.1.1.1.1.2.1.1.4" xref="A2.E14.m1.1.1.1.1.2.1.1.4.cmml"><mi id="A2.E14.m1.1.1.1.1.2.1.1.4.2" xref="A2.E14.m1.1.1.1.1.2.1.1.4.2.cmml">σ</mi><mi id="A2.E14.m1.1.1.1.1.2.1.1.4.3" xref="A2.E14.m1.1.1.1.1.2.1.1.4.3.cmml">y</mi></msub></mrow></mtd><mtd id="A2.E14.m1.1.1.1.1f" xref="A2.E14.m1.1.1.1.1.cmml"><msubsup id="A2.E14.m1.1.1.1.1.2.2.1" xref="A2.E14.m1.1.1.1.1.2.2.1.cmml"><mi id="A2.E14.m1.1.1.1.1.2.2.1.2.2" xref="A2.E14.m1.1.1.1.1.2.2.1.2.2.cmml">σ</mi><mi id="A2.E14.m1.1.1.1.1.2.2.1.2.3" xref="A2.E14.m1.1.1.1.1.2.2.1.2.3.cmml">y</mi><mn id="A2.E14.m1.1.1.1.1.2.2.1.3" xref="A2.E14.m1.1.1.1.1.2.2.1.3.cmml">2</mn></msubsup></mtd></mtr></mtable><mo id="A2.E14.m1.1.1.3.2" xref="A2.E14.m1.1.1.2.1.cmml">]</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.E14.m1.1b"><apply id="A2.E14.m1.1.2.cmml" xref="A2.E14.m1.1.2"><eq id="A2.E14.m1.1.2.1.cmml" xref="A2.E14.m1.1.2.1"></eq><ci id="A2.E14.m1.1.2.2.cmml" xref="A2.E14.m1.1.2.2">𝚺</ci><apply id="A2.E14.m1.1.1.2.cmml" xref="A2.E14.m1.1.1.3"><csymbol cd="latexml" id="A2.E14.m1.1.1.2.1.cmml" xref="A2.E14.m1.1.1.3.1">matrix</csymbol><matrix id="A2.E14.m1.1.1.1.1.cmml" xref="A2.E14.m1.1.1.1.1"><matrixrow id="A2.E14.m1.1.1.1.1a.cmml" xref="A2.E14.m1.1.1.1.1"><apply id="A2.E14.m1.1.1.1.1.1.1.1.cmml" xref="A2.E14.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="A2.E14.m1.1.1.1.1.1.1.1.1.cmml" xref="A2.E14.m1.1.1.1.1.1.1.1">superscript</csymbol><apply id="A2.E14.m1.1.1.1.1.1.1.1.2.cmml" xref="A2.E14.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="A2.E14.m1.1.1.1.1.1.1.1.2.1.cmml" xref="A2.E14.m1.1.1.1.1.1.1.1">subscript</csymbol><ci id="A2.E14.m1.1.1.1.1.1.1.1.2.2.cmml" xref="A2.E14.m1.1.1.1.1.1.1.1.2.2">𝜎</ci><ci id="A2.E14.m1.1.1.1.1.1.1.1.2.3.cmml" xref="A2.E14.m1.1.1.1.1.1.1.1.2.3">𝑥</ci></apply><cn id="A2.E14.m1.1.1.1.1.1.1.1.3.cmml" type="integer" xref="A2.E14.m1.1.1.1.1.1.1.1.3">2</cn></apply><apply id="A2.E14.m1.1.1.1.1.1.2.1.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1"><times id="A2.E14.m1.1.1.1.1.1.2.1.1.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1.1"></times><ci id="A2.E14.m1.1.1.1.1.1.2.1.2.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1.2">𝜌</ci><apply id="A2.E14.m1.1.1.1.1.1.2.1.3.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1.3"><csymbol cd="ambiguous" id="A2.E14.m1.1.1.1.1.1.2.1.3.1.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1.3">subscript</csymbol><ci id="A2.E14.m1.1.1.1.1.1.2.1.3.2.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1.3.2">𝜎</ci><ci id="A2.E14.m1.1.1.1.1.1.2.1.3.3.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1.3.3">𝑥</ci></apply><apply id="A2.E14.m1.1.1.1.1.1.2.1.4.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1.4"><csymbol cd="ambiguous" id="A2.E14.m1.1.1.1.1.1.2.1.4.1.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1.4">subscript</csymbol><ci id="A2.E14.m1.1.1.1.1.1.2.1.4.2.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1.4.2">𝜎</ci><ci id="A2.E14.m1.1.1.1.1.1.2.1.4.3.cmml" xref="A2.E14.m1.1.1.1.1.1.2.1.4.3">𝑦</ci></apply></apply></matrixrow><matrixrow id="A2.E14.m1.1.1.1.1b.cmml" xref="A2.E14.m1.1.1.1.1"><apply id="A2.E14.m1.1.1.1.1.2.1.1.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1"><times id="A2.E14.m1.1.1.1.1.2.1.1.1.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1.1"></times><ci id="A2.E14.m1.1.1.1.1.2.1.1.2.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1.2">𝜌</ci><apply id="A2.E14.m1.1.1.1.1.2.1.1.3.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1.3"><csymbol cd="ambiguous" id="A2.E14.m1.1.1.1.1.2.1.1.3.1.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1.3">subscript</csymbol><ci id="A2.E14.m1.1.1.1.1.2.1.1.3.2.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1.3.2">𝜎</ci><ci id="A2.E14.m1.1.1.1.1.2.1.1.3.3.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1.3.3">𝑥</ci></apply><apply id="A2.E14.m1.1.1.1.1.2.1.1.4.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1.4"><csymbol cd="ambiguous" id="A2.E14.m1.1.1.1.1.2.1.1.4.1.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1.4">subscript</csymbol><ci id="A2.E14.m1.1.1.1.1.2.1.1.4.2.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1.4.2">𝜎</ci><ci id="A2.E14.m1.1.1.1.1.2.1.1.4.3.cmml" xref="A2.E14.m1.1.1.1.1.2.1.1.4.3">𝑦</ci></apply></apply><apply id="A2.E14.m1.1.1.1.1.2.2.1.cmml" xref="A2.E14.m1.1.1.1.1.2.2.1"><csymbol cd="ambiguous" id="A2.E14.m1.1.1.1.1.2.2.1.1.cmml" xref="A2.E14.m1.1.1.1.1.2.2.1">superscript</csymbol><apply id="A2.E14.m1.1.1.1.1.2.2.1.2.cmml" xref="A2.E14.m1.1.1.1.1.2.2.1"><csymbol cd="ambiguous" id="A2.E14.m1.1.1.1.1.2.2.1.2.1.cmml" xref="A2.E14.m1.1.1.1.1.2.2.1">subscript</csymbol><ci id="A2.E14.m1.1.1.1.1.2.2.1.2.2.cmml" xref="A2.E14.m1.1.1.1.1.2.2.1.2.2">𝜎</ci><ci id="A2.E14.m1.1.1.1.1.2.2.1.2.3.cmml" xref="A2.E14.m1.1.1.1.1.2.2.1.2.3">𝑦</ci></apply><cn id="A2.E14.m1.1.1.1.1.2.2.1.3.cmml" type="integer" xref="A2.E14.m1.1.1.1.1.2.2.1.3">2</cn></apply></matrixrow></matrix></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.E14.m1.1c">\bm{\Sigma}=\begin{bmatrix}\sigma_{x}^{2}&amp;\rho\sigma_{x}\sigma_{y}\\ \rho\sigma_{x}\sigma_{y}&amp;\sigma_{y}^{2}\end{bmatrix}</annotation><annotation encoding="application/x-llamapun" id="A2.E14.m1.1d">bold_Σ = [ start_ARG start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL italic_ρ italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ρ italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL start_CELL italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ]</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(14)</span></td> </tr></tbody> </table> </div> <div class="ltx_para" id="A2.SS0.SSS0.Px1.p3"> <p class="ltx_p" id="A2.SS0.SSS0.Px1.p3.1">The level sets of this distribution form ellipses. For a given confidence level <math alttext="\alpha" class="ltx_Math" display="inline" id="A2.SS0.SSS0.Px1.p3.1.m1.1"><semantics id="A2.SS0.SSS0.Px1.p3.1.m1.1a"><mi id="A2.SS0.SSS0.Px1.p3.1.m1.1.1" xref="A2.SS0.SSS0.Px1.p3.1.m1.1.1.cmml">α</mi><annotation-xml encoding="MathML-Content" id="A2.SS0.SSS0.Px1.p3.1.m1.1b"><ci id="A2.SS0.SSS0.Px1.p3.1.m1.1.1.cmml" xref="A2.SS0.SSS0.Px1.p3.1.m1.1.1">𝛼</ci></annotation-xml><annotation encoding="application/x-tex" id="A2.SS0.SSS0.Px1.p3.1.m1.1c">\alpha</annotation><annotation encoding="application/x-llamapun" id="A2.SS0.SSS0.Px1.p3.1.m1.1d">italic_α</annotation></semantics></math>, the corresponding ellipse equation is:</p> </div> <div class="ltx_para" id="A2.SS0.SSS0.Px1.p4"> <table class="ltx_equation ltx_eqn_table" id="A2.E15"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="(\mathbf{x}-\bm{\mu})^{T}\bm{\Sigma}^{-1}(\mathbf{x}-\bm{\mu})=\chi^{2}_{2}(\alpha)" class="ltx_Math" display="block" id="A2.E15.m1.3"><semantics id="A2.E15.m1.3a"><mrow id="A2.E15.m1.3.3" xref="A2.E15.m1.3.3.cmml"><mrow id="A2.E15.m1.3.3.2" xref="A2.E15.m1.3.3.2.cmml"><msup id="A2.E15.m1.2.2.1.1" xref="A2.E15.m1.2.2.1.1.cmml"><mrow id="A2.E15.m1.2.2.1.1.1.1" xref="A2.E15.m1.2.2.1.1.1.1.1.cmml"><mo id="A2.E15.m1.2.2.1.1.1.1.2" stretchy="false" xref="A2.E15.m1.2.2.1.1.1.1.1.cmml">(</mo><mrow id="A2.E15.m1.2.2.1.1.1.1.1" xref="A2.E15.m1.2.2.1.1.1.1.1.cmml"><mi id="A2.E15.m1.2.2.1.1.1.1.1.2" xref="A2.E15.m1.2.2.1.1.1.1.1.2.cmml">𝐱</mi><mo id="A2.E15.m1.2.2.1.1.1.1.1.1" xref="A2.E15.m1.2.2.1.1.1.1.1.1.cmml">−</mo><mi id="A2.E15.m1.2.2.1.1.1.1.1.3" xref="A2.E15.m1.2.2.1.1.1.1.1.3.cmml">𝝁</mi></mrow><mo id="A2.E15.m1.2.2.1.1.1.1.3" stretchy="false" xref="A2.E15.m1.2.2.1.1.1.1.1.cmml">)</mo></mrow><mi id="A2.E15.m1.2.2.1.1.3" xref="A2.E15.m1.2.2.1.1.3.cmml">T</mi></msup><mo id="A2.E15.m1.3.3.2.3" xref="A2.E15.m1.3.3.2.3.cmml">⁢</mo><msup id="A2.E15.m1.3.3.2.4" xref="A2.E15.m1.3.3.2.4.cmml"><mi id="A2.E15.m1.3.3.2.4.2" xref="A2.E15.m1.3.3.2.4.2.cmml">𝚺</mi><mrow id="A2.E15.m1.3.3.2.4.3" xref="A2.E15.m1.3.3.2.4.3.cmml"><mo id="A2.E15.m1.3.3.2.4.3a" xref="A2.E15.m1.3.3.2.4.3.cmml">−</mo><mn id="A2.E15.m1.3.3.2.4.3.2" xref="A2.E15.m1.3.3.2.4.3.2.cmml">1</mn></mrow></msup><mo id="A2.E15.m1.3.3.2.3a" xref="A2.E15.m1.3.3.2.3.cmml">⁢</mo><mrow id="A2.E15.m1.3.3.2.2.1" xref="A2.E15.m1.3.3.2.2.1.1.cmml"><mo id="A2.E15.m1.3.3.2.2.1.2" stretchy="false" xref="A2.E15.m1.3.3.2.2.1.1.cmml">(</mo><mrow id="A2.E15.m1.3.3.2.2.1.1" xref="A2.E15.m1.3.3.2.2.1.1.cmml"><mi id="A2.E15.m1.3.3.2.2.1.1.2" xref="A2.E15.m1.3.3.2.2.1.1.2.cmml">𝐱</mi><mo id="A2.E15.m1.3.3.2.2.1.1.1" xref="A2.E15.m1.3.3.2.2.1.1.1.cmml">−</mo><mi id="A2.E15.m1.3.3.2.2.1.1.3" xref="A2.E15.m1.3.3.2.2.1.1.3.cmml">𝝁</mi></mrow><mo id="A2.E15.m1.3.3.2.2.1.3" stretchy="false" xref="A2.E15.m1.3.3.2.2.1.1.cmml">)</mo></mrow></mrow><mo id="A2.E15.m1.3.3.3" xref="A2.E15.m1.3.3.3.cmml">=</mo><mrow id="A2.E15.m1.3.3.4" xref="A2.E15.m1.3.3.4.cmml"><msubsup id="A2.E15.m1.3.3.4.2" xref="A2.E15.m1.3.3.4.2.cmml"><mi id="A2.E15.m1.3.3.4.2.2.2" xref="A2.E15.m1.3.3.4.2.2.2.cmml">χ</mi><mn id="A2.E15.m1.3.3.4.2.3" xref="A2.E15.m1.3.3.4.2.3.cmml">2</mn><mn id="A2.E15.m1.3.3.4.2.2.3" xref="A2.E15.m1.3.3.4.2.2.3.cmml">2</mn></msubsup><mo id="A2.E15.m1.3.3.4.1" xref="A2.E15.m1.3.3.4.1.cmml">⁢</mo><mrow id="A2.E15.m1.3.3.4.3.2" xref="A2.E15.m1.3.3.4.cmml"><mo id="A2.E15.m1.3.3.4.3.2.1" stretchy="false" xref="A2.E15.m1.3.3.4.cmml">(</mo><mi id="A2.E15.m1.1.1" xref="A2.E15.m1.1.1.cmml">α</mi><mo id="A2.E15.m1.3.3.4.3.2.2" stretchy="false" xref="A2.E15.m1.3.3.4.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.E15.m1.3b"><apply id="A2.E15.m1.3.3.cmml" xref="A2.E15.m1.3.3"><eq id="A2.E15.m1.3.3.3.cmml" xref="A2.E15.m1.3.3.3"></eq><apply id="A2.E15.m1.3.3.2.cmml" xref="A2.E15.m1.3.3.2"><times id="A2.E15.m1.3.3.2.3.cmml" xref="A2.E15.m1.3.3.2.3"></times><apply id="A2.E15.m1.2.2.1.1.cmml" xref="A2.E15.m1.2.2.1.1"><csymbol cd="ambiguous" id="A2.E15.m1.2.2.1.1.2.cmml" xref="A2.E15.m1.2.2.1.1">superscript</csymbol><apply id="A2.E15.m1.2.2.1.1.1.1.1.cmml" xref="A2.E15.m1.2.2.1.1.1.1"><minus id="A2.E15.m1.2.2.1.1.1.1.1.1.cmml" xref="A2.E15.m1.2.2.1.1.1.1.1.1"></minus><ci id="A2.E15.m1.2.2.1.1.1.1.1.2.cmml" xref="A2.E15.m1.2.2.1.1.1.1.1.2">𝐱</ci><ci id="A2.E15.m1.2.2.1.1.1.1.1.3.cmml" xref="A2.E15.m1.2.2.1.1.1.1.1.3">𝝁</ci></apply><ci id="A2.E15.m1.2.2.1.1.3.cmml" xref="A2.E15.m1.2.2.1.1.3">𝑇</ci></apply><apply id="A2.E15.m1.3.3.2.4.cmml" xref="A2.E15.m1.3.3.2.4"><csymbol cd="ambiguous" id="A2.E15.m1.3.3.2.4.1.cmml" xref="A2.E15.m1.3.3.2.4">superscript</csymbol><ci id="A2.E15.m1.3.3.2.4.2.cmml" xref="A2.E15.m1.3.3.2.4.2">𝚺</ci><apply id="A2.E15.m1.3.3.2.4.3.cmml" xref="A2.E15.m1.3.3.2.4.3"><minus id="A2.E15.m1.3.3.2.4.3.1.cmml" xref="A2.E15.m1.3.3.2.4.3"></minus><cn id="A2.E15.m1.3.3.2.4.3.2.cmml" type="integer" xref="A2.E15.m1.3.3.2.4.3.2">1</cn></apply></apply><apply id="A2.E15.m1.3.3.2.2.1.1.cmml" xref="A2.E15.m1.3.3.2.2.1"><minus id="A2.E15.m1.3.3.2.2.1.1.1.cmml" xref="A2.E15.m1.3.3.2.2.1.1.1"></minus><ci id="A2.E15.m1.3.3.2.2.1.1.2.cmml" xref="A2.E15.m1.3.3.2.2.1.1.2">𝐱</ci><ci id="A2.E15.m1.3.3.2.2.1.1.3.cmml" xref="A2.E15.m1.3.3.2.2.1.1.3">𝝁</ci></apply></apply><apply id="A2.E15.m1.3.3.4.cmml" xref="A2.E15.m1.3.3.4"><times id="A2.E15.m1.3.3.4.1.cmml" xref="A2.E15.m1.3.3.4.1"></times><apply id="A2.E15.m1.3.3.4.2.cmml" xref="A2.E15.m1.3.3.4.2"><csymbol cd="ambiguous" id="A2.E15.m1.3.3.4.2.1.cmml" xref="A2.E15.m1.3.3.4.2">subscript</csymbol><apply id="A2.E15.m1.3.3.4.2.2.cmml" xref="A2.E15.m1.3.3.4.2"><csymbol cd="ambiguous" id="A2.E15.m1.3.3.4.2.2.1.cmml" xref="A2.E15.m1.3.3.4.2">superscript</csymbol><ci id="A2.E15.m1.3.3.4.2.2.2.cmml" xref="A2.E15.m1.3.3.4.2.2.2">𝜒</ci><cn id="A2.E15.m1.3.3.4.2.2.3.cmml" type="integer" xref="A2.E15.m1.3.3.4.2.2.3">2</cn></apply><cn id="A2.E15.m1.3.3.4.2.3.cmml" type="integer" xref="A2.E15.m1.3.3.4.2.3">2</cn></apply><ci id="A2.E15.m1.1.1.cmml" xref="A2.E15.m1.1.1">𝛼</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.E15.m1.3c">(\mathbf{x}-\bm{\mu})^{T}\bm{\Sigma}^{-1}(\mathbf{x}-\bm{\mu})=\chi^{2}_{2}(\alpha)</annotation><annotation encoding="application/x-llamapun" id="A2.E15.m1.3d">( bold_x - bold_italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_x - bold_italic_μ ) = italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_α )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(15)</span></td> </tr></tbody> </table> </div> <div class="ltx_para" id="A2.SS0.SSS0.Px1.p5"> <p class="ltx_p" id="A2.SS0.SSS0.Px1.p5.1">where <math alttext="\chi^{2}_{2}(\alpha)" class="ltx_Math" display="inline" id="A2.SS0.SSS0.Px1.p5.1.m1.1"><semantics id="A2.SS0.SSS0.Px1.p5.1.m1.1a"><mrow id="A2.SS0.SSS0.Px1.p5.1.m1.1.2" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.cmml"><msubsup id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.cmml"><mi id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.2.2" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.2.2.cmml">χ</mi><mn id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.3" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.3.cmml">2</mn><mn id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.2.3" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.2.3.cmml">2</mn></msubsup><mo id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.1" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.1.cmml">⁢</mo><mrow id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.3.2" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.cmml"><mo id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.3.2.1" stretchy="false" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.cmml">(</mo><mi id="A2.SS0.SSS0.Px1.p5.1.m1.1.1" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.1.cmml">α</mi><mo id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.3.2.2" stretchy="false" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.SS0.SSS0.Px1.p5.1.m1.1b"><apply id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.cmml" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2"><times id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.1.cmml" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.1"></times><apply id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.cmml" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2"><csymbol cd="ambiguous" id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.1.cmml" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2">subscript</csymbol><apply id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.2.cmml" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2"><csymbol cd="ambiguous" id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.2.1.cmml" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2">superscript</csymbol><ci id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.2.2.cmml" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.2.2">𝜒</ci><cn id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.2.3.cmml" type="integer" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.2.3">2</cn></apply><cn id="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.3.cmml" type="integer" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.2.2.3">2</cn></apply><ci id="A2.SS0.SSS0.Px1.p5.1.m1.1.1.cmml" xref="A2.SS0.SSS0.Px1.p5.1.m1.1.1">𝛼</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.SS0.SSS0.Px1.p5.1.m1.1c">\chi^{2}_{2}(\alpha)</annotation><annotation encoding="application/x-llamapun" id="A2.SS0.SSS0.Px1.p5.1.m1.1d">italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_α )</annotation></semantics></math> is the quantile function for chi-square distribution with 2 degrees of freedom.</p> </div> </section> <section class="ltx_paragraph" id="A2.SS0.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph">From Ellipse to Gaussian.</h4> <div class="ltx_para" id="A2.SS0.SSS0.Px2.p1"> <p class="ltx_p" id="A2.SS0.SSS0.Px2.p1.4">Conversely, given an ellipse defined by its center <math alttext="(h,k)" class="ltx_Math" display="inline" id="A2.SS0.SSS0.Px2.p1.1.m1.2"><semantics id="A2.SS0.SSS0.Px2.p1.1.m1.2a"><mrow id="A2.SS0.SSS0.Px2.p1.1.m1.2.3.2" xref="A2.SS0.SSS0.Px2.p1.1.m1.2.3.1.cmml"><mo id="A2.SS0.SSS0.Px2.p1.1.m1.2.3.2.1" stretchy="false" xref="A2.SS0.SSS0.Px2.p1.1.m1.2.3.1.cmml">(</mo><mi id="A2.SS0.SSS0.Px2.p1.1.m1.1.1" xref="A2.SS0.SSS0.Px2.p1.1.m1.1.1.cmml">h</mi><mo id="A2.SS0.SSS0.Px2.p1.1.m1.2.3.2.2" xref="A2.SS0.SSS0.Px2.p1.1.m1.2.3.1.cmml">,</mo><mi id="A2.SS0.SSS0.Px2.p1.1.m1.2.2" xref="A2.SS0.SSS0.Px2.p1.1.m1.2.2.cmml">k</mi><mo id="A2.SS0.SSS0.Px2.p1.1.m1.2.3.2.3" stretchy="false" xref="A2.SS0.SSS0.Px2.p1.1.m1.2.3.1.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="A2.SS0.SSS0.Px2.p1.1.m1.2b"><interval closure="open" id="A2.SS0.SSS0.Px2.p1.1.m1.2.3.1.cmml" xref="A2.SS0.SSS0.Px2.p1.1.m1.2.3.2"><ci id="A2.SS0.SSS0.Px2.p1.1.m1.1.1.cmml" xref="A2.SS0.SSS0.Px2.p1.1.m1.1.1">ℎ</ci><ci id="A2.SS0.SSS0.Px2.p1.1.m1.2.2.cmml" xref="A2.SS0.SSS0.Px2.p1.1.m1.2.2">𝑘</ci></interval></annotation-xml><annotation encoding="application/x-tex" id="A2.SS0.SSS0.Px2.p1.1.m1.2c">(h,k)</annotation><annotation encoding="application/x-llamapun" id="A2.SS0.SSS0.Px2.p1.1.m1.2d">( italic_h , italic_k )</annotation></semantics></math>, semi-major axis <math alttext="a" class="ltx_Math" display="inline" id="A2.SS0.SSS0.Px2.p1.2.m2.1"><semantics id="A2.SS0.SSS0.Px2.p1.2.m2.1a"><mi id="A2.SS0.SSS0.Px2.p1.2.m2.1.1" xref="A2.SS0.SSS0.Px2.p1.2.m2.1.1.cmml">a</mi><annotation-xml encoding="MathML-Content" id="A2.SS0.SSS0.Px2.p1.2.m2.1b"><ci id="A2.SS0.SSS0.Px2.p1.2.m2.1.1.cmml" xref="A2.SS0.SSS0.Px2.p1.2.m2.1.1">𝑎</ci></annotation-xml><annotation encoding="application/x-tex" id="A2.SS0.SSS0.Px2.p1.2.m2.1c">a</annotation><annotation encoding="application/x-llamapun" id="A2.SS0.SSS0.Px2.p1.2.m2.1d">italic_a</annotation></semantics></math>, semi-minor axis <math alttext="b" class="ltx_Math" display="inline" id="A2.SS0.SSS0.Px2.p1.3.m3.1"><semantics id="A2.SS0.SSS0.Px2.p1.3.m3.1a"><mi id="A2.SS0.SSS0.Px2.p1.3.m3.1.1" xref="A2.SS0.SSS0.Px2.p1.3.m3.1.1.cmml">b</mi><annotation-xml encoding="MathML-Content" id="A2.SS0.SSS0.Px2.p1.3.m3.1b"><ci id="A2.SS0.SSS0.Px2.p1.3.m3.1.1.cmml" xref="A2.SS0.SSS0.Px2.p1.3.m3.1.1">𝑏</ci></annotation-xml><annotation encoding="application/x-tex" id="A2.SS0.SSS0.Px2.p1.3.m3.1c">b</annotation><annotation encoding="application/x-llamapun" id="A2.SS0.SSS0.Px2.p1.3.m3.1d">italic_b</annotation></semantics></math>, and rotation angle <math alttext="\theta" class="ltx_Math" display="inline" id="A2.SS0.SSS0.Px2.p1.4.m4.1"><semantics id="A2.SS0.SSS0.Px2.p1.4.m4.1a"><mi id="A2.SS0.SSS0.Px2.p1.4.m4.1.1" xref="A2.SS0.SSS0.Px2.p1.4.m4.1.1.cmml">θ</mi><annotation-xml encoding="MathML-Content" id="A2.SS0.SSS0.Px2.p1.4.m4.1b"><ci id="A2.SS0.SSS0.Px2.p1.4.m4.1.1.cmml" xref="A2.SS0.SSS0.Px2.p1.4.m4.1.1">𝜃</ci></annotation-xml><annotation encoding="application/x-tex" id="A2.SS0.SSS0.Px2.p1.4.m4.1c">\theta</annotation><annotation encoding="application/x-llamapun" id="A2.SS0.SSS0.Px2.p1.4.m4.1d">italic_θ</annotation></semantics></math>, we can construct the corresponding Gaussian distribution:</p> </div> <div class="ltx_para" id="A2.SS0.SSS0.Px2.p2"> <table class="ltx_equation ltx_eqn_table" id="A2.E16"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\bm{\mu}=\begin{pmatrix}h\\ k\end{pmatrix}" class="ltx_Math" display="block" id="A2.E16.m1.1"><semantics id="A2.E16.m1.1a"><mrow id="A2.E16.m1.1.2" xref="A2.E16.m1.1.2.cmml"><mi id="A2.E16.m1.1.2.2" xref="A2.E16.m1.1.2.2.cmml">𝝁</mi><mo id="A2.E16.m1.1.2.1" xref="A2.E16.m1.1.2.1.cmml">=</mo><mrow id="A2.E16.m1.1.1.3" xref="A2.E16.m1.1.1.2.cmml"><mo id="A2.E16.m1.1.1.3.1" xref="A2.E16.m1.1.1.2.1.cmml">(</mo><mtable displaystyle="true" id="A2.E16.m1.1.1.1.1" rowspacing="0pt" xref="A2.E16.m1.1.1.1.1.cmml"><mtr id="A2.E16.m1.1.1.1.1a" xref="A2.E16.m1.1.1.1.1.cmml"><mtd id="A2.E16.m1.1.1.1.1b" xref="A2.E16.m1.1.1.1.1.cmml"><mi id="A2.E16.m1.1.1.1.1.1.1.1" xref="A2.E16.m1.1.1.1.1.1.1.1.cmml">h</mi></mtd></mtr><mtr id="A2.E16.m1.1.1.1.1c" xref="A2.E16.m1.1.1.1.1.cmml"><mtd id="A2.E16.m1.1.1.1.1d" xref="A2.E16.m1.1.1.1.1.cmml"><mi id="A2.E16.m1.1.1.1.1.2.1.1" xref="A2.E16.m1.1.1.1.1.2.1.1.cmml">k</mi></mtd></mtr></mtable><mo id="A2.E16.m1.1.1.3.2" xref="A2.E16.m1.1.1.2.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.E16.m1.1b"><apply id="A2.E16.m1.1.2.cmml" xref="A2.E16.m1.1.2"><eq id="A2.E16.m1.1.2.1.cmml" xref="A2.E16.m1.1.2.1"></eq><ci id="A2.E16.m1.1.2.2.cmml" xref="A2.E16.m1.1.2.2">𝝁</ci><apply id="A2.E16.m1.1.1.2.cmml" xref="A2.E16.m1.1.1.3"><csymbol cd="latexml" id="A2.E16.m1.1.1.2.1.cmml" xref="A2.E16.m1.1.1.3.1">matrix</csymbol><matrix id="A2.E16.m1.1.1.1.1.cmml" xref="A2.E16.m1.1.1.1.1"><matrixrow id="A2.E16.m1.1.1.1.1a.cmml" xref="A2.E16.m1.1.1.1.1"><ci id="A2.E16.m1.1.1.1.1.1.1.1.cmml" xref="A2.E16.m1.1.1.1.1.1.1.1">ℎ</ci></matrixrow><matrixrow id="A2.E16.m1.1.1.1.1b.cmml" xref="A2.E16.m1.1.1.1.1"><ci id="A2.E16.m1.1.1.1.1.2.1.1.cmml" xref="A2.E16.m1.1.1.1.1.2.1.1">𝑘</ci></matrixrow></matrix></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.E16.m1.1c">\bm{\mu}=\begin{pmatrix}h\\ k\end{pmatrix}</annotation><annotation encoding="application/x-llamapun" id="A2.E16.m1.1d">bold_italic_μ = ( start_ARG start_ROW start_CELL italic_h end_CELL end_ROW start_ROW start_CELL italic_k end_CELL end_ROW end_ARG )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(16)</span></td> </tr></tbody> </table> </div> <div class="ltx_para" id="A2.SS0.SSS0.Px2.p3"> <table class="ltx_equation ltx_eqn_table" id="A2.E17"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\bm{\Sigma}=\mathbf{R}(\theta)\begin{bmatrix}a^{2}&amp;0\\ 0&amp;b^{2}\end{bmatrix}\mathbf{R}(\theta)^{T}" class="ltx_Math" display="block" id="A2.E17.m1.3"><semantics id="A2.E17.m1.3a"><mrow id="A2.E17.m1.3.4" xref="A2.E17.m1.3.4.cmml"><mi id="A2.E17.m1.3.4.2" xref="A2.E17.m1.3.4.2.cmml">𝚺</mi><mo id="A2.E17.m1.3.4.1" xref="A2.E17.m1.3.4.1.cmml">=</mo><mrow id="A2.E17.m1.3.4.3" xref="A2.E17.m1.3.4.3.cmml"><mi id="A2.E17.m1.3.4.3.2" xref="A2.E17.m1.3.4.3.2.cmml">𝐑</mi><mo id="A2.E17.m1.3.4.3.1" xref="A2.E17.m1.3.4.3.1.cmml">⁢</mo><mrow id="A2.E17.m1.3.4.3.3.2" xref="A2.E17.m1.3.4.3.cmml"><mo id="A2.E17.m1.3.4.3.3.2.1" stretchy="false" xref="A2.E17.m1.3.4.3.cmml">(</mo><mi id="A2.E17.m1.2.2" xref="A2.E17.m1.2.2.cmml">θ</mi><mo id="A2.E17.m1.3.4.3.3.2.2" stretchy="false" xref="A2.E17.m1.3.4.3.cmml">)</mo></mrow><mo id="A2.E17.m1.3.4.3.1a" xref="A2.E17.m1.3.4.3.1.cmml">⁢</mo><mrow id="A2.E17.m1.1.1.3" xref="A2.E17.m1.1.1.2.cmml"><mo id="A2.E17.m1.1.1.3.1" xref="A2.E17.m1.1.1.2.1.cmml">[</mo><mtable columnspacing="5pt" displaystyle="true" id="A2.E17.m1.1.1.1.1" rowspacing="0pt" xref="A2.E17.m1.1.1.1.1.cmml"><mtr id="A2.E17.m1.1.1.1.1a" xref="A2.E17.m1.1.1.1.1.cmml"><mtd id="A2.E17.m1.1.1.1.1b" xref="A2.E17.m1.1.1.1.1.cmml"><msup id="A2.E17.m1.1.1.1.1.1.1.1" xref="A2.E17.m1.1.1.1.1.1.1.1.cmml"><mi id="A2.E17.m1.1.1.1.1.1.1.1.2" xref="A2.E17.m1.1.1.1.1.1.1.1.2.cmml">a</mi><mn id="A2.E17.m1.1.1.1.1.1.1.1.3" xref="A2.E17.m1.1.1.1.1.1.1.1.3.cmml">2</mn></msup></mtd><mtd id="A2.E17.m1.1.1.1.1c" xref="A2.E17.m1.1.1.1.1.cmml"><mn id="A2.E17.m1.1.1.1.1.1.2.1" xref="A2.E17.m1.1.1.1.1.1.2.1.cmml">0</mn></mtd></mtr><mtr id="A2.E17.m1.1.1.1.1d" xref="A2.E17.m1.1.1.1.1.cmml"><mtd id="A2.E17.m1.1.1.1.1e" xref="A2.E17.m1.1.1.1.1.cmml"><mn id="A2.E17.m1.1.1.1.1.2.1.1" xref="A2.E17.m1.1.1.1.1.2.1.1.cmml">0</mn></mtd><mtd id="A2.E17.m1.1.1.1.1f" xref="A2.E17.m1.1.1.1.1.cmml"><msup id="A2.E17.m1.1.1.1.1.2.2.1" xref="A2.E17.m1.1.1.1.1.2.2.1.cmml"><mi id="A2.E17.m1.1.1.1.1.2.2.1.2" xref="A2.E17.m1.1.1.1.1.2.2.1.2.cmml">b</mi><mn id="A2.E17.m1.1.1.1.1.2.2.1.3" xref="A2.E17.m1.1.1.1.1.2.2.1.3.cmml">2</mn></msup></mtd></mtr></mtable><mo id="A2.E17.m1.1.1.3.2" xref="A2.E17.m1.1.1.2.1.cmml">]</mo></mrow><mo id="A2.E17.m1.3.4.3.1b" xref="A2.E17.m1.3.4.3.1.cmml">⁢</mo><mi id="A2.E17.m1.3.4.3.4" xref="A2.E17.m1.3.4.3.4.cmml">𝐑</mi><mo id="A2.E17.m1.3.4.3.1c" xref="A2.E17.m1.3.4.3.1.cmml">⁢</mo><msup id="A2.E17.m1.3.4.3.5" xref="A2.E17.m1.3.4.3.5.cmml"><mrow id="A2.E17.m1.3.4.3.5.2.2" xref="A2.E17.m1.3.4.3.5.cmml"><mo id="A2.E17.m1.3.4.3.5.2.2.1" stretchy="false" xref="A2.E17.m1.3.4.3.5.cmml">(</mo><mi id="A2.E17.m1.3.3" xref="A2.E17.m1.3.3.cmml">θ</mi><mo id="A2.E17.m1.3.4.3.5.2.2.2" stretchy="false" xref="A2.E17.m1.3.4.3.5.cmml">)</mo></mrow><mi id="A2.E17.m1.3.4.3.5.3" xref="A2.E17.m1.3.4.3.5.3.cmml">T</mi></msup></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.E17.m1.3b"><apply id="A2.E17.m1.3.4.cmml" xref="A2.E17.m1.3.4"><eq id="A2.E17.m1.3.4.1.cmml" xref="A2.E17.m1.3.4.1"></eq><ci id="A2.E17.m1.3.4.2.cmml" xref="A2.E17.m1.3.4.2">𝚺</ci><apply id="A2.E17.m1.3.4.3.cmml" xref="A2.E17.m1.3.4.3"><times id="A2.E17.m1.3.4.3.1.cmml" xref="A2.E17.m1.3.4.3.1"></times><ci id="A2.E17.m1.3.4.3.2.cmml" xref="A2.E17.m1.3.4.3.2">𝐑</ci><ci id="A2.E17.m1.2.2.cmml" xref="A2.E17.m1.2.2">𝜃</ci><apply id="A2.E17.m1.1.1.2.cmml" xref="A2.E17.m1.1.1.3"><csymbol cd="latexml" id="A2.E17.m1.1.1.2.1.cmml" xref="A2.E17.m1.1.1.3.1">matrix</csymbol><matrix id="A2.E17.m1.1.1.1.1.cmml" xref="A2.E17.m1.1.1.1.1"><matrixrow id="A2.E17.m1.1.1.1.1a.cmml" xref="A2.E17.m1.1.1.1.1"><apply id="A2.E17.m1.1.1.1.1.1.1.1.cmml" xref="A2.E17.m1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="A2.E17.m1.1.1.1.1.1.1.1.1.cmml" xref="A2.E17.m1.1.1.1.1.1.1.1">superscript</csymbol><ci id="A2.E17.m1.1.1.1.1.1.1.1.2.cmml" xref="A2.E17.m1.1.1.1.1.1.1.1.2">𝑎</ci><cn id="A2.E17.m1.1.1.1.1.1.1.1.3.cmml" type="integer" xref="A2.E17.m1.1.1.1.1.1.1.1.3">2</cn></apply><cn id="A2.E17.m1.1.1.1.1.1.2.1.cmml" type="integer" xref="A2.E17.m1.1.1.1.1.1.2.1">0</cn></matrixrow><matrixrow id="A2.E17.m1.1.1.1.1b.cmml" xref="A2.E17.m1.1.1.1.1"><cn id="A2.E17.m1.1.1.1.1.2.1.1.cmml" type="integer" xref="A2.E17.m1.1.1.1.1.2.1.1">0</cn><apply id="A2.E17.m1.1.1.1.1.2.2.1.cmml" xref="A2.E17.m1.1.1.1.1.2.2.1"><csymbol cd="ambiguous" id="A2.E17.m1.1.1.1.1.2.2.1.1.cmml" xref="A2.E17.m1.1.1.1.1.2.2.1">superscript</csymbol><ci id="A2.E17.m1.1.1.1.1.2.2.1.2.cmml" xref="A2.E17.m1.1.1.1.1.2.2.1.2">𝑏</ci><cn id="A2.E17.m1.1.1.1.1.2.2.1.3.cmml" type="integer" xref="A2.E17.m1.1.1.1.1.2.2.1.3">2</cn></apply></matrixrow></matrix></apply><ci id="A2.E17.m1.3.4.3.4.cmml" xref="A2.E17.m1.3.4.3.4">𝐑</ci><apply id="A2.E17.m1.3.4.3.5.cmml" xref="A2.E17.m1.3.4.3.5"><csymbol cd="ambiguous" id="A2.E17.m1.3.4.3.5.1.cmml" xref="A2.E17.m1.3.4.3.5">superscript</csymbol><ci id="A2.E17.m1.3.3.cmml" xref="A2.E17.m1.3.3">𝜃</ci><ci id="A2.E17.m1.3.4.3.5.3.cmml" xref="A2.E17.m1.3.4.3.5.3">𝑇</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.E17.m1.3c">\bm{\Sigma}=\mathbf{R}(\theta)\begin{bmatrix}a^{2}&amp;0\\ 0&amp;b^{2}\end{bmatrix}\mathbf{R}(\theta)^{T}</annotation><annotation encoding="application/x-llamapun" id="A2.E17.m1.3d">bold_Σ = bold_R ( italic_θ ) [ start_ARG start_ROW start_CELL italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] bold_R ( italic_θ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(17)</span></td> </tr></tbody> </table> </div> <div class="ltx_para" id="A2.SS0.SSS0.Px2.p4"> <p class="ltx_p" id="A2.SS0.SSS0.Px2.p4.1">where <math alttext="\mathbf{R}(\theta)" class="ltx_Math" display="inline" id="A2.SS0.SSS0.Px2.p4.1.m1.1"><semantics id="A2.SS0.SSS0.Px2.p4.1.m1.1a"><mrow id="A2.SS0.SSS0.Px2.p4.1.m1.1.2" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.2.cmml"><mi id="A2.SS0.SSS0.Px2.p4.1.m1.1.2.2" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.2.2.cmml">𝐑</mi><mo id="A2.SS0.SSS0.Px2.p4.1.m1.1.2.1" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.2.1.cmml">⁢</mo><mrow id="A2.SS0.SSS0.Px2.p4.1.m1.1.2.3.2" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.2.cmml"><mo id="A2.SS0.SSS0.Px2.p4.1.m1.1.2.3.2.1" stretchy="false" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.2.cmml">(</mo><mi id="A2.SS0.SSS0.Px2.p4.1.m1.1.1" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.1.cmml">θ</mi><mo id="A2.SS0.SSS0.Px2.p4.1.m1.1.2.3.2.2" stretchy="false" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.SS0.SSS0.Px2.p4.1.m1.1b"><apply id="A2.SS0.SSS0.Px2.p4.1.m1.1.2.cmml" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.2"><times id="A2.SS0.SSS0.Px2.p4.1.m1.1.2.1.cmml" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.2.1"></times><ci id="A2.SS0.SSS0.Px2.p4.1.m1.1.2.2.cmml" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.2.2">𝐑</ci><ci id="A2.SS0.SSS0.Px2.p4.1.m1.1.1.cmml" xref="A2.SS0.SSS0.Px2.p4.1.m1.1.1">𝜃</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.SS0.SSS0.Px2.p4.1.m1.1c">\mathbf{R}(\theta)</annotation><annotation encoding="application/x-llamapun" id="A2.SS0.SSS0.Px2.p4.1.m1.1d">bold_R ( italic_θ )</annotation></semantics></math> is the 2D rotation matrix:</p> </div> <div class="ltx_para" id="A2.SS0.SSS0.Px2.p5"> <table class="ltx_equation ltx_eqn_table" id="A2.E18"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathbf{R}(\theta)=\begin{bmatrix}\cos\theta&amp;-\sin\theta\\ \sin\theta&amp;\cos\theta\end{bmatrix}" class="ltx_Math" display="block" id="A2.E18.m1.2"><semantics id="A2.E18.m1.2a"><mrow id="A2.E18.m1.2.3" xref="A2.E18.m1.2.3.cmml"><mrow id="A2.E18.m1.2.3.2" xref="A2.E18.m1.2.3.2.cmml"><mi id="A2.E18.m1.2.3.2.2" xref="A2.E18.m1.2.3.2.2.cmml">𝐑</mi><mo id="A2.E18.m1.2.3.2.1" xref="A2.E18.m1.2.3.2.1.cmml">⁢</mo><mrow id="A2.E18.m1.2.3.2.3.2" xref="A2.E18.m1.2.3.2.cmml"><mo id="A2.E18.m1.2.3.2.3.2.1" stretchy="false" xref="A2.E18.m1.2.3.2.cmml">(</mo><mi id="A2.E18.m1.2.2" xref="A2.E18.m1.2.2.cmml">θ</mi><mo id="A2.E18.m1.2.3.2.3.2.2" stretchy="false" xref="A2.E18.m1.2.3.2.cmml">)</mo></mrow></mrow><mo id="A2.E18.m1.2.3.1" xref="A2.E18.m1.2.3.1.cmml">=</mo><mrow id="A2.E18.m1.1.1.3" xref="A2.E18.m1.1.1.2.cmml"><mo id="A2.E18.m1.1.1.3.1" xref="A2.E18.m1.1.1.2.1.cmml">[</mo><mtable columnspacing="5pt" displaystyle="true" id="A2.E18.m1.1.1.1.1" rowspacing="0pt" xref="A2.E18.m1.1.1.1.1.cmml"><mtr id="A2.E18.m1.1.1.1.1a" xref="A2.E18.m1.1.1.1.1.cmml"><mtd id="A2.E18.m1.1.1.1.1b" xref="A2.E18.m1.1.1.1.1.cmml"><mrow id="A2.E18.m1.1.1.1.1.1.1.1" xref="A2.E18.m1.1.1.1.1.1.1.1.cmml"><mi id="A2.E18.m1.1.1.1.1.1.1.1.1" xref="A2.E18.m1.1.1.1.1.1.1.1.1.cmml">cos</mi><mo id="A2.E18.m1.1.1.1.1.1.1.1a" lspace="0.167em" xref="A2.E18.m1.1.1.1.1.1.1.1.cmml">⁡</mo><mi id="A2.E18.m1.1.1.1.1.1.1.1.2" xref="A2.E18.m1.1.1.1.1.1.1.1.2.cmml">θ</mi></mrow></mtd><mtd id="A2.E18.m1.1.1.1.1c" xref="A2.E18.m1.1.1.1.1.cmml"><mrow id="A2.E18.m1.1.1.1.1.1.2.1" xref="A2.E18.m1.1.1.1.1.1.2.1.cmml"><mo id="A2.E18.m1.1.1.1.1.1.2.1a" rspace="0.167em" xref="A2.E18.m1.1.1.1.1.1.2.1.cmml">−</mo><mrow id="A2.E18.m1.1.1.1.1.1.2.1.2" xref="A2.E18.m1.1.1.1.1.1.2.1.2.cmml"><mi id="A2.E18.m1.1.1.1.1.1.2.1.2.1" xref="A2.E18.m1.1.1.1.1.1.2.1.2.1.cmml">sin</mi><mo id="A2.E18.m1.1.1.1.1.1.2.1.2a" lspace="0.167em" xref="A2.E18.m1.1.1.1.1.1.2.1.2.cmml">⁡</mo><mi id="A2.E18.m1.1.1.1.1.1.2.1.2.2" xref="A2.E18.m1.1.1.1.1.1.2.1.2.2.cmml">θ</mi></mrow></mrow></mtd></mtr><mtr id="A2.E18.m1.1.1.1.1d" xref="A2.E18.m1.1.1.1.1.cmml"><mtd id="A2.E18.m1.1.1.1.1e" xref="A2.E18.m1.1.1.1.1.cmml"><mrow id="A2.E18.m1.1.1.1.1.2.1.1" xref="A2.E18.m1.1.1.1.1.2.1.1.cmml"><mi id="A2.E18.m1.1.1.1.1.2.1.1.1" xref="A2.E18.m1.1.1.1.1.2.1.1.1.cmml">sin</mi><mo id="A2.E18.m1.1.1.1.1.2.1.1a" lspace="0.167em" xref="A2.E18.m1.1.1.1.1.2.1.1.cmml">⁡</mo><mi id="A2.E18.m1.1.1.1.1.2.1.1.2" xref="A2.E18.m1.1.1.1.1.2.1.1.2.cmml">θ</mi></mrow></mtd><mtd id="A2.E18.m1.1.1.1.1f" xref="A2.E18.m1.1.1.1.1.cmml"><mrow id="A2.E18.m1.1.1.1.1.2.2.1" xref="A2.E18.m1.1.1.1.1.2.2.1.cmml"><mi id="A2.E18.m1.1.1.1.1.2.2.1.1" xref="A2.E18.m1.1.1.1.1.2.2.1.1.cmml">cos</mi><mo id="A2.E18.m1.1.1.1.1.2.2.1a" lspace="0.167em" xref="A2.E18.m1.1.1.1.1.2.2.1.cmml">⁡</mo><mi id="A2.E18.m1.1.1.1.1.2.2.1.2" xref="A2.E18.m1.1.1.1.1.2.2.1.2.cmml">θ</mi></mrow></mtd></mtr></mtable><mo id="A2.E18.m1.1.1.3.2" xref="A2.E18.m1.1.1.2.1.cmml">]</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.E18.m1.2b"><apply id="A2.E18.m1.2.3.cmml" xref="A2.E18.m1.2.3"><eq id="A2.E18.m1.2.3.1.cmml" xref="A2.E18.m1.2.3.1"></eq><apply id="A2.E18.m1.2.3.2.cmml" xref="A2.E18.m1.2.3.2"><times id="A2.E18.m1.2.3.2.1.cmml" xref="A2.E18.m1.2.3.2.1"></times><ci id="A2.E18.m1.2.3.2.2.cmml" xref="A2.E18.m1.2.3.2.2">𝐑</ci><ci id="A2.E18.m1.2.2.cmml" xref="A2.E18.m1.2.2">𝜃</ci></apply><apply id="A2.E18.m1.1.1.2.cmml" xref="A2.E18.m1.1.1.3"><csymbol cd="latexml" id="A2.E18.m1.1.1.2.1.cmml" xref="A2.E18.m1.1.1.3.1">matrix</csymbol><matrix id="A2.E18.m1.1.1.1.1.cmml" xref="A2.E18.m1.1.1.1.1"><matrixrow id="A2.E18.m1.1.1.1.1a.cmml" xref="A2.E18.m1.1.1.1.1"><apply id="A2.E18.m1.1.1.1.1.1.1.1.cmml" xref="A2.E18.m1.1.1.1.1.1.1.1"><cos id="A2.E18.m1.1.1.1.1.1.1.1.1.cmml" xref="A2.E18.m1.1.1.1.1.1.1.1.1"></cos><ci id="A2.E18.m1.1.1.1.1.1.1.1.2.cmml" xref="A2.E18.m1.1.1.1.1.1.1.1.2">𝜃</ci></apply><apply id="A2.E18.m1.1.1.1.1.1.2.1.cmml" xref="A2.E18.m1.1.1.1.1.1.2.1"><minus id="A2.E18.m1.1.1.1.1.1.2.1.1.cmml" xref="A2.E18.m1.1.1.1.1.1.2.1"></minus><apply id="A2.E18.m1.1.1.1.1.1.2.1.2.cmml" xref="A2.E18.m1.1.1.1.1.1.2.1.2"><sin id="A2.E18.m1.1.1.1.1.1.2.1.2.1.cmml" xref="A2.E18.m1.1.1.1.1.1.2.1.2.1"></sin><ci id="A2.E18.m1.1.1.1.1.1.2.1.2.2.cmml" xref="A2.E18.m1.1.1.1.1.1.2.1.2.2">𝜃</ci></apply></apply></matrixrow><matrixrow id="A2.E18.m1.1.1.1.1b.cmml" xref="A2.E18.m1.1.1.1.1"><apply id="A2.E18.m1.1.1.1.1.2.1.1.cmml" xref="A2.E18.m1.1.1.1.1.2.1.1"><sin id="A2.E18.m1.1.1.1.1.2.1.1.1.cmml" xref="A2.E18.m1.1.1.1.1.2.1.1.1"></sin><ci id="A2.E18.m1.1.1.1.1.2.1.1.2.cmml" xref="A2.E18.m1.1.1.1.1.2.1.1.2">𝜃</ci></apply><apply id="A2.E18.m1.1.1.1.1.2.2.1.cmml" xref="A2.E18.m1.1.1.1.1.2.2.1"><cos id="A2.E18.m1.1.1.1.1.2.2.1.1.cmml" xref="A2.E18.m1.1.1.1.1.2.2.1.1"></cos><ci id="A2.E18.m1.1.1.1.1.2.2.1.2.cmml" xref="A2.E18.m1.1.1.1.1.2.2.1.2">𝜃</ci></apply></matrixrow></matrix></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.E18.m1.2c">\mathbf{R}(\theta)=\begin{bmatrix}\cos\theta&amp;-\sin\theta\\ \sin\theta&amp;\cos\theta\end{bmatrix}</annotation><annotation encoding="application/x-llamapun" id="A2.E18.m1.2d">bold_R ( italic_θ ) = [ start_ARG start_ROW start_CELL roman_cos italic_θ end_CELL start_CELL - roman_sin italic_θ end_CELL end_ROW start_ROW start_CELL roman_sin italic_θ end_CELL start_CELL roman_cos italic_θ end_CELL end_ROW end_ARG ]</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(18)</span></td> </tr></tbody> </table> </div> <div class="ltx_para" id="A2.SS0.SSS0.Px2.p6"> <p class="ltx_p" id="A2.SS0.SSS0.Px2.p6.1">This mathematical relationship enables us to seamlessly transition between probabilistic blob representations and geometric ellipse controls in our framework.</p> </div> </section> </section> <section class="ltx_appendix" id="A3"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix C </span>BlobData Curation</h2> <div class="ltx_para" id="A3.p1"> <p class="ltx_p" id="A3.p1.1">BlobData is a large-scale dataset containing 1.86M samples sourced from BrushData <cite class="ltx_cite ltx_citemacro_citep">(Ju et al., <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#bib.bib19" title="">2024</a>)</cite>, featuring images, segmentation masks, fitted ellipse parameters with derived 2D Gaussians, and descriptive texts. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.13434v1#A3.F8" title="Figure 8 ‣ Appendix C BlobData Curation ‣ BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing"><span class="ltx_text ltx_ref_tag">8</span></a>, the BlobData curation process involves multiple steps:</p> <ul class="ltx_itemize" id="A3.I1"> <li class="ltx_item" id="A3.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A3.I1.i1.p1"> <p class="ltx_p" id="A3.I1.i1.p1.1"><span class="ltx_text ltx_font_bold" id="A3.I1.i1.p1.1.1">Image Filtering.</span> We filter source images to: 1)Retain images with shorter sides exceeding 480 pixels; 2) Keep only images with valid instance segmentation masks; 3) Apply mask filtering to preserve masks with area ratios between 0.01-0.9 of total image area; 4) Exclude masks touching image boundaries.</p> </div> </li> <li class="ltx_item" id="A3.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A3.I1.i2.p1"> <p class="ltx_p" id="A3.I1.i2.p1.1"><span class="ltx_text ltx_font_bold" id="A3.I1.i2.p1.1.1">Parameter Extraction.</span> 1) Fit ellipse parameters using OpenCV’s ellipse fitting algorithm; 2)Derive corresponding 2D Gaussian distributions; 3) Remove invalid samples with covariance values below 1e-5.</p> </div> </li> <li class="ltx_item" id="A3.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A3.I1.i3.p1"> <p class="ltx_p" id="A3.I1.i3.p1.1"><span class="ltx_text ltx_font_bold" id="A3.I1.i3.p1.1.1">Annotation.</span> We generate detailed image descriptions using InternVL-2.5, providing rich textual context for each sample in the dataset.</p> </div> </li> </ul> </div> <figure class="ltx_figure" id="A3.F8"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="158" id="A3.F8.g1" src="extracted/6284831/figures/data_pipeline.png" width="515"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 8: </span>The BlobData curation workflow.</figcaption> </figure> <div class="ltx_pagination ltx_role_newpage"></div> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Mon Mar 17 17:52:00 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10