CINXE.COM

Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars</title> <!--Generated on Thu Mar 13 21:10:50 2025 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2503.11978v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S1" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S2" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Related Work</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>Method</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS1" title="In 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>Datasets</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS2" title="In 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>2D Dual-Stylized Avatar Generation</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS3" title="In 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.3 </span>3D Animatable Stylized Avatar Generation</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS4" title="In 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.4 </span>Training and Losses</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>Experiments</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.SS1" title="In 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>2D Stylized Avatar Generation</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.SS2" title="In 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2 </span>3D Dual-Stylized Avatar Generation</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.SS3" title="In 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.3 </span>3D Stylized Avatar Animation</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S5" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Conclusion</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"> <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text" style="font-size:144%;">A</span> </span><span class="ltx_text" style="font-size:144%;">Implementation Details</span></span></a> <ol class="ltx_toclist ltx_toclist_appendix"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.SS1" title="In Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.1 </span>Bitmoji Training Data</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.SS1.SSS0.Px1" title="In A.1 Bitmoji Training Data ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title">GDA Training Data.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.SS1.SSS0.Px2" title="In A.1 Bitmoji Training Data ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title">Multi-view Training Data.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.SS2" title="In Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.2 </span>Facial Action Coding System</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.SS3" title="In Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A.3 </span>User Interfaces</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_appendix"> <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text" style="font-size:144%;">B</span> </span><span class="ltx_text" style="font-size:144%;">Additional Results</span></span></a> <ol class="ltx_toclist ltx_toclist_appendix"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.SS1" title="In Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">B.1 </span>Results Gallery</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.SS2" title="In Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">B.2 </span>More Applications</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.SS3" title="In Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">B.3 </span>Ablation Studies</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A3" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref"><span class="ltx_text" style="font-size:144%;">C</span> </span><span class="ltx_text" style="font-size:144%;">Ethical Discussion</span></span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line ltx_pruned_first"> <h1 class="ltx_title ltx_title_document"> <span class="ltx_text" id="id16.id1">Snapmoji</span>: Instant Generation of Animatable Dual-Stylized Avatars</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Eric Ming Chen<sup class="ltx_sup" id="id17.16.id1"><span class="ltx_text ltx_font_italic" id="id17.16.id1.1">1,2∗</span></sup>  Di Liu<sup class="ltx_sup" id="id18.17.id2"><span class="ltx_text ltx_font_italic" id="id18.17.id2.1">1,3∗</span></sup>  Sizhuo Ma<sup class="ltx_sup" id="id19.18.id3">1</sup>  Michael Vasilkovsky<sup class="ltx_sup" id="id20.19.id4">1</sup>  Bing Zhou<sup class="ltx_sup" id="id21.20.id5">1</sup>  Qiang Gao<sup class="ltx_sup" id="id22.21.id6">1</sup> <br class="ltx_break"/>Wenzhou Wang<sup class="ltx_sup" id="id23.22.id7">1</sup>  Jiahao Luo<sup class="ltx_sup" id="id24.23.id8"><span class="ltx_text ltx_font_italic" id="id24.23.id8.1">1,4</span></sup>  Dimitris N. Metaxas<sup class="ltx_sup" id="id25.24.id9">3</sup>  Vincent Sitzmann<sup class="ltx_sup" id="id26.25.id10">2</sup>  Jian Wang<sup class="ltx_sup" id="id27.26.id11">1</sup> <br class="ltx_break"/><sup class="ltx_sup" id="id28.27.id12">1</sup>Snap Inc  <sup class="ltx_sup" id="id29.28.id13">2</sup>MIT  <sup class="ltx_sup" id="id30.29.id14">3</sup>Rutgers University  <sup class="ltx_sup" id="id31.30.id15">4</sup>University of California, Santa Cruz <br class="ltx_break"/> </span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id32.id1">The increasing popularity of personalized avatar systems, such as Snapchat Bitmojis and Apple Memojis, highlights the growing demand for digital self-representation. Despite their widespread use, existing avatar platforms face significant limitations, including restricted expressivity due to predefined assets, tedious customization processes, or inefficient rendering requirements. Addressing these shortcomings, we introduce <span class="ltx_text" id="id32.id1.1">Snapmoji</span>, an avatar generation system that instantly creates animatable, dual-stylized avatars from a selfie. We propose Gaussian Domain Adaptation (GDA), which is pre-trained on large-scale Gaussian models using 3D data from sources like Objaverse and fine-tuned with 2D style transfer tasks, endowing it with a rich 3D prior. This enables <span class="ltx_text" id="id32.id1.2">Snapmoji</span> to transform a selfie into a primary stylized avatar (<em class="ltx_emph ltx_font_italic" id="id32.id1.3">e.g</em>.<span class="ltx_text" id="id32.id1.4"></span>, Bitmoji style) and apply a secondary style (<em class="ltx_emph ltx_font_italic" id="id32.id1.5">e.g</em>.<span class="ltx_text" id="id32.id1.6"></span>, Plastic Toy or Alien), all while preserving the user’s identity and the primary style’s integrity. Our system is capable of producing 3D Gaussian avatars that support dynamic animation, including accurate facial expression transfer. Designed for efficiency, <span class="ltx_text" id="id32.id1.7">Snapmoji</span> achieves selfie-to-avatar conversion in a mere 0.9 seconds and supports real-time interactions on mobile devices at 30–40 FPS. Extensive testing confirms that <span class="ltx_text" id="id32.id1.8">Snapmoji</span> outperforms existing methods in versatility and speed, making it a convenient tool for automatic avatar creation in various styles.</p> </div> <div class="ltx_para" id="p2"> <img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_landscape" height="178" id="p2.g1" src="x1.png" width="830"/> </div> <figure class="ltx_figure" id="S0.F1"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S0.F1.4.1.1" style="font-size:90%;">Figure 1</span>: </span><span class="ltx_text" id="S0.F1.5.2" style="font-size:90%;">We introduce <span class="ltx_text ltx_font_bold" id="S0.F1.5.2.1">Snapmoji</span>, a system that can instantly generate animatable dual-stylized avatars. Our dual stylization process reimagines avatars in various artistic styles, enabling users to visualize themselves in diverse scenarios and create personalized stories. Our approach also enables 3D stylized gaussian avatars generation and expression animation. <span class="ltx_text" id="S0.F1.5.2.2">Snapmoji</span> accomplishes the selfie-to-avatar conversion in just 0.9 seconds, and offers real-time functionality for mobile applications. <a class="ltx_ref ltx_href" href="https://echen01.github.io/instamoji-supp/" title="">Project page.</a></span></figcaption> </figure><span class="ltx_note ltx_role_footnotetext" id="footnotex1"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_note_type">footnotetext: </span>Equal contribution</span></span></span> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Personalized cartoon avatars such as <span class="ltx_text ltx_font_italic" id="S1.p1.1.1">Snapchat Bitmojis</span> <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib5" title=""><span class="ltx_text" style="font-size:90%;">5</span></a>]</cite>, <span class="ltx_text ltx_font_italic" id="S1.p1.1.2">Apple Memojis</span> <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib1" title=""><span class="ltx_text" style="font-size:90%;">1</span></a>]</cite>, and <span class="ltx_text ltx_font_italic" id="S1.p1.1.3">Meta Avatars</span> <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib35" title=""><span class="ltx_text" style="font-size:90%;">35</span></a>]</cite> have become popular digital self-representations. The broader digital avatar market, encompassing both stylized and photorealistic avatars, was valued at over $18 billion in 2023 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib41" title=""><span class="ltx_text" style="font-size:90%;">41</span></a>]</cite>. Current stylized avatar platforms, although offering some level of customization, are often restricted by predefined traits, which makes it difficult to adapt avatars to varied styles without developing new 3D assets <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib57" title=""><span class="ltx_text" style="font-size:90%;">57</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title=""><span class="ltx_text" style="font-size:90%;">56</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib29" title=""><span class="ltx_text" style="font-size:90%;">29</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib46" title=""><span class="ltx_text" style="font-size:90%;">46</span></a>]</cite>. Moreover, navigating extensive trait lists can be tedious, and efficiency demands frequently lead to compromises in texture detail and polygon count. Asset-free methods, such as StyleAvatar3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib61" title=""><span class="ltx_text" style="font-size:90%;">61</span></a>]</cite>, TextToon <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title=""><span class="ltx_text" style="font-size:90%;">54</span></a>]</cite> and DATID-3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib23" title=""><span class="ltx_text" style="font-size:90%;">23</span></a>]</cite>, lack support for real-time operation or animatability for mobile augmented reality (AR). Table <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S1.T1" title="Table 1 ‣ 1 Introduction ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">1</span></a> provides a comparative overview of existing stylized avatar generation methods. While photorealistic avatar techniques <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib44" title=""><span class="ltx_text" style="font-size:90%;">44</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib40" title=""><span class="ltx_text" style="font-size:90%;">40</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib32" title=""><span class="ltx_text" style="font-size:90%;">32</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib30" title=""><span class="ltx_text" style="font-size:90%;">30</span></a>]</cite> excel in creating realistic representations and animation, they fall short in adapting to the uniquely stylized geometries of cartoon avatars.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">In pursuit of a more expressive stylized avatar creation platform, we introduce <span class="ltx_text ltx_font_italic" id="S1.p2.1.1">Snapmoji</span>, a system to generate 3D avatars, represented by 3D Gaussian Splats <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib22" title=""><span class="ltx_text" style="font-size:90%;">22</span></a>]</cite>, in only 0.9 seconds.  <span class="ltx_text" id="S1.p2.1.2">Snapmoji</span> is built upon the Bitmoji platform, leveraging its public API and robust developer support <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib52" title=""><span class="ltx_text" style="font-size:90%;">52</span></a>]</cite>. Unlike traditional avatars limited to predefined assets, our solution allows for stylization through text prompts, offering greater flexibility and creativity. Our system is designed with three core objectives in mind: 1) <span class="ltx_text ltx_font_italic" id="S1.p2.1.3">Dual Stylization:</span> The system should generate avatars in the Bitmoji art style, and a secondary style, such as of Plastic Toys or Aliens, while preserving user identity; 2) <span class="ltx_text ltx_font_italic" id="S1.p2.1.4">User Convenience:</span> For ease of use, the system should require only a single image input and produce results instantly; 3) <span class="ltx_text ltx_font_italic" id="S1.p2.1.5">Efficiency:</span> The avatars should be optimized for real-time rendering on mobile devices, supporting applications like AR, with minimal compute requirements.</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">Following these design objectives, we propose a two-stage pipeline for <span class="ltx_text" id="S1.p3.1.1">Snapmoji</span>. First, Gaussian Domain Adaptation (GDA) transforms realistic selfies into 2D Bitmoji-style images, and then a diffusion-based model further stylizes these images based on user-specified text prompts. Second, the image is lifted to an animatable 3D Gaussian avatar that faithfully captures the user’s identity and chosen styles. To facilitate AR applications, these avatars can be animated in real-time using facial parameter estimators. Although showcased with Bitmojis, our approach is applicable to other avatar platforms as well. In summary, our contributions include:</p> </div> <div class="ltx_para" id="S1.p4"> <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i1.p1"> <p class="ltx_p" id="S1.I1.i1.p1.1">Introducing an advanced avatar generation system that produces dual-stylized avatars instantly from a selfie.</p> </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i2.p1"> <p class="ltx_p" id="S1.I1.i2.p1.1">Developing Gaussian Domain Adaptation for enriching <span class="ltx_text" id="S1.I1.i2.p1.1.1">Snapmoji</span> with a 3D prior, enabling dual-style transformations of selfies while preserving identity and style.</p> </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i3.p1"> <p class="ltx_p" id="S1.I1.i3.p1.1">Creating an animatable model by leveraging driving signals from 3DMM and blendshape priors combined with 3D Gaussians, enabling efficient, real-time rendering of dual-stylized avatars.</p> </div> </li> <li class="ltx_item" id="S1.I1.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i4.p1"> <p class="ltx_p" id="S1.I1.i4.p1.1">Demonstrating, through extensive experiments, that our method outperforms existing solutions in both generation and animation performance, enabling real-time applications on mobile devices.</p> </div> </li> </ul> </div> <figure class="ltx_table" id="S1.T1"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S1.T1.2" style="width:433.6pt;height:117.6pt;vertical-align:-0.8pt;"><span class="ltx_transformed_inner" style="transform:translate(-50.6pt,13.6pt) scale(0.810844965673996,0.810844965673996) ;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S1.T1.2.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S1.T1.2.1.1.1"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S1.T1.2.1.1.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">Method</th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S1.T1.2.1.1.1.2" style="padding-left:0.0pt;padding-right:0.0pt;">Selfie Input</th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S1.T1.2.1.1.1.3" style="padding-left:0.0pt;padding-right:0.0pt;">Mobile AR</th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S1.T1.2.1.1.1.4" style="padding-left:0.0pt;padding-right:0.0pt;">Asset-free</th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S1.T1.2.1.1.1.5" style="padding-left:0.0pt;padding-right:0.0pt;">Animatable</th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S1.T1.2.1.1.1.6" style="padding-left:0.0pt;padding-right:0.0pt;">Dual Style</th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S1.T1.2.1.2.1"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S1.T1.2.1.2.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">StyleAvatar3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib61" title=""><span class="ltx_text" style="font-size:90%;">61</span></a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S1.T1.2.1.2.1.2" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S1.T1.2.1.2.1.3" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S1.T1.2.1.2.1.4" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S1.T1.2.1.2.1.5" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S1.T1.2.1.2.1.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.3.2"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.3.2.1" style="padding-left:0.0pt;padding-right:0.0pt;">DATID-3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib23" title=""><span class="ltx_text" style="font-size:90%;">23</span></a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.3.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.3.2.3" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.3.2.4" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.3.2.5" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S1.T1.2.1.3.2.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.4.3"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.4.3.1" style="padding-left:0.0pt;padding-right:0.0pt;">TextToon <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title=""><span class="ltx_text" style="font-size:90%;">54</span></a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.4.3.2" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.4.3.3" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.4.3.4" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.4.3.5" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S1.T1.2.1.4.3.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.5.4"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.5.4.1" style="padding-left:0.0pt;padding-right:0.0pt;">EasyCraft <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib57" title=""><span class="ltx_text" style="font-size:90%;">57</span></a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.5.4.2" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.5.4.3" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.5.4.4" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.5.4.5" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S1.T1.2.1.5.4.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.6.5"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.6.5.1" style="padding-left:0.0pt;padding-right:0.0pt;">SwiftAvatar <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title=""><span class="ltx_text" style="font-size:90%;">56</span></a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.6.5.2" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.6.5.3" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.6.5.4" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.6.5.5" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S1.T1.2.1.6.5.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.7.6"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.7.6.1" style="padding-left:0.0pt;padding-right:0.0pt;">AgileAvatar <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib46" title=""><span class="ltx_text" style="font-size:90%;">46</span></a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.7.6.2" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.7.6.3" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.7.6.4" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.7.6.5" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S1.T1.2.1.7.6.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.8.7"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S1.T1.2.1.8.7.1" style="padding-left:0.0pt;padding-right:0.0pt;"> <span class="ltx_text" id="S1.T1.2.1.8.7.1.1">Snapmoji</span> (ours)</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S1.T1.2.1.8.7.2" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S1.T1.2.1.8.7.3" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S1.T1.2.1.8.7.4" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S1.T1.2.1.8.7.5" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S1.T1.2.1.8.7.6" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> </tr> </tbody> </table> </span></div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table"><span class="ltx_text" id="S1.T1.3.1.1" style="font-size:90%;">Table 1</span>: </span><span class="ltx_text" id="S1.T1.4.2" style="font-size:90%;">Feature comparison among various stylized avatar generation methods.</span></figcaption> </figure> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">2 </span>Related Work</h2> <figure class="ltx_figure" id="S2.F2"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="444" id="S2.F2.g1" src="x2.png" width="830"/> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S2.F2.23.11.1" style="font-size:90%;">Figure 2</span>: </span><span class="ltx_text ltx_font_bold" id="S2.F2.20.10" style="font-size:90%;">The <span class="ltx_text" id="S2.F2.20.10.11">Snapmoji</span> Inference Pipeline.<span class="ltx_text ltx_font_medium" id="S2.F2.20.10.10"> The pipeline has two stages. First, the Gaussian Domain Adaptation network <math alttext="\mathcal{E}_{\text{GDA}}" class="ltx_Math" display="inline" id="S2.F2.11.1.1.m1.1"><semantics id="S2.F2.11.1.1.m1.1b"><msub id="S2.F2.11.1.1.m1.1.1" xref="S2.F2.11.1.1.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.F2.11.1.1.m1.1.1.2" xref="S2.F2.11.1.1.m1.1.1.2.cmml">ℰ</mi><mtext id="S2.F2.11.1.1.m1.1.1.3" xref="S2.F2.11.1.1.m1.1.1.3a.cmml">GDA</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.11.1.1.m1.1c"><apply id="S2.F2.11.1.1.m1.1.1.cmml" xref="S2.F2.11.1.1.m1.1.1"><csymbol cd="ambiguous" id="S2.F2.11.1.1.m1.1.1.1.cmml" xref="S2.F2.11.1.1.m1.1.1">subscript</csymbol><ci id="S2.F2.11.1.1.m1.1.1.2.cmml" xref="S2.F2.11.1.1.m1.1.1.2">ℰ</ci><ci id="S2.F2.11.1.1.m1.1.1.3a.cmml" xref="S2.F2.11.1.1.m1.1.1.3"><mtext id="S2.F2.11.1.1.m1.1.1.3.cmml" mathsize="70%" xref="S2.F2.11.1.1.m1.1.1.3">GDA</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.11.1.1.m1.1d">\mathcal{E}_{\text{GDA}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.11.1.1.m1.1e">caligraphic_E start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT</annotation></semantics></math> converts a facial image into a primary-style avatar <math alttext="I_{\text{sty}}" class="ltx_Math" display="inline" id="S2.F2.12.2.2.m2.1"><semantics id="S2.F2.12.2.2.m2.1b"><msub id="S2.F2.12.2.2.m2.1.1" xref="S2.F2.12.2.2.m2.1.1.cmml"><mi id="S2.F2.12.2.2.m2.1.1.2" xref="S2.F2.12.2.2.m2.1.1.2.cmml">I</mi><mtext id="S2.F2.12.2.2.m2.1.1.3" xref="S2.F2.12.2.2.m2.1.1.3a.cmml">sty</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.12.2.2.m2.1c"><apply id="S2.F2.12.2.2.m2.1.1.cmml" xref="S2.F2.12.2.2.m2.1.1"><csymbol cd="ambiguous" id="S2.F2.12.2.2.m2.1.1.1.cmml" xref="S2.F2.12.2.2.m2.1.1">subscript</csymbol><ci id="S2.F2.12.2.2.m2.1.1.2.cmml" xref="S2.F2.12.2.2.m2.1.1.2">𝐼</ci><ci id="S2.F2.12.2.2.m2.1.1.3a.cmml" xref="S2.F2.12.2.2.m2.1.1.3"><mtext id="S2.F2.12.2.2.m2.1.1.3.cmml" mathsize="70%" xref="S2.F2.12.2.2.m2.1.1.3">sty</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.12.2.2.m2.1d">I_{\text{sty}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.12.2.2.m2.1e">italic_I start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT</annotation></semantics></math>. This avatar undergoes further personalization using a text-guided diffusion process with <math alttext="T" class="ltx_Math" display="inline" id="S2.F2.13.3.3.m3.1"><semantics id="S2.F2.13.3.3.m3.1b"><mi id="S2.F2.13.3.3.m3.1.1" xref="S2.F2.13.3.3.m3.1.1.cmml">T</mi><annotation-xml encoding="MathML-Content" id="S2.F2.13.3.3.m3.1c"><ci id="S2.F2.13.3.3.m3.1.1.cmml" xref="S2.F2.13.3.3.m3.1.1">𝑇</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.13.3.3.m3.1d">T</annotation><annotation encoding="application/x-llamapun" id="S2.F2.13.3.3.m3.1e">italic_T</annotation></semantics></math> steps for additional stylization. Second, expression codes extracted via an 3DMM and FACS are combined with identity features <math alttext="f_{\text{id}}" class="ltx_Math" display="inline" id="S2.F2.14.4.4.m4.1"><semantics id="S2.F2.14.4.4.m4.1b"><msub id="S2.F2.14.4.4.m4.1.1" xref="S2.F2.14.4.4.m4.1.1.cmml"><mi id="S2.F2.14.4.4.m4.1.1.2" xref="S2.F2.14.4.4.m4.1.1.2.cmml">f</mi><mtext id="S2.F2.14.4.4.m4.1.1.3" xref="S2.F2.14.4.4.m4.1.1.3a.cmml">id</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.14.4.4.m4.1c"><apply id="S2.F2.14.4.4.m4.1.1.cmml" xref="S2.F2.14.4.4.m4.1.1"><csymbol cd="ambiguous" id="S2.F2.14.4.4.m4.1.1.1.cmml" xref="S2.F2.14.4.4.m4.1.1">subscript</csymbol><ci id="S2.F2.14.4.4.m4.1.1.2.cmml" xref="S2.F2.14.4.4.m4.1.1.2">𝑓</ci><ci id="S2.F2.14.4.4.m4.1.1.3a.cmml" xref="S2.F2.14.4.4.m4.1.1.3"><mtext id="S2.F2.14.4.4.m4.1.1.3.cmml" mathsize="70%" xref="S2.F2.14.4.4.m4.1.1.3">id</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.14.4.4.m4.1d">f_{\text{id}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.14.4.4.m4.1e">italic_f start_POSTSUBSCRIPT id end_POSTSUBSCRIPT</annotation></semantics></math> from a reference image <math alttext="I_{\text{ref}}" class="ltx_Math" display="inline" id="S2.F2.15.5.5.m5.1"><semantics id="S2.F2.15.5.5.m5.1b"><msub id="S2.F2.15.5.5.m5.1.1" xref="S2.F2.15.5.5.m5.1.1.cmml"><mi id="S2.F2.15.5.5.m5.1.1.2" xref="S2.F2.15.5.5.m5.1.1.2.cmml">I</mi><mtext id="S2.F2.15.5.5.m5.1.1.3" xref="S2.F2.15.5.5.m5.1.1.3a.cmml">ref</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.15.5.5.m5.1c"><apply id="S2.F2.15.5.5.m5.1.1.cmml" xref="S2.F2.15.5.5.m5.1.1"><csymbol cd="ambiguous" id="S2.F2.15.5.5.m5.1.1.1.cmml" xref="S2.F2.15.5.5.m5.1.1">subscript</csymbol><ci id="S2.F2.15.5.5.m5.1.1.2.cmml" xref="S2.F2.15.5.5.m5.1.1.2">𝐼</ci><ci id="S2.F2.15.5.5.m5.1.1.3a.cmml" xref="S2.F2.15.5.5.m5.1.1.3"><mtext id="S2.F2.15.5.5.m5.1.1.3.cmml" mathsize="70%" xref="S2.F2.15.5.5.m5.1.1.3">ref</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.15.5.5.m5.1d">I_{\text{ref}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.15.5.5.m5.1e">italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT</annotation></semantics></math> and positional maps <math alttext="f_{\text{pos}}" class="ltx_Math" display="inline" id="S2.F2.16.6.6.m6.1"><semantics id="S2.F2.16.6.6.m6.1b"><msub id="S2.F2.16.6.6.m6.1.1" xref="S2.F2.16.6.6.m6.1.1.cmml"><mi id="S2.F2.16.6.6.m6.1.1.2" xref="S2.F2.16.6.6.m6.1.1.2.cmml">f</mi><mtext id="S2.F2.16.6.6.m6.1.1.3" xref="S2.F2.16.6.6.m6.1.1.3a.cmml">pos</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.16.6.6.m6.1c"><apply id="S2.F2.16.6.6.m6.1.1.cmml" xref="S2.F2.16.6.6.m6.1.1"><csymbol cd="ambiguous" id="S2.F2.16.6.6.m6.1.1.1.cmml" xref="S2.F2.16.6.6.m6.1.1">subscript</csymbol><ci id="S2.F2.16.6.6.m6.1.1.2.cmml" xref="S2.F2.16.6.6.m6.1.1.2">𝑓</ci><ci id="S2.F2.16.6.6.m6.1.1.3a.cmml" xref="S2.F2.16.6.6.m6.1.1.3"><mtext id="S2.F2.16.6.6.m6.1.1.3.cmml" mathsize="70%" xref="S2.F2.16.6.6.m6.1.1.3">pos</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.16.6.6.m6.1d">f_{\text{pos}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.16.6.6.m6.1e">italic_f start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT</annotation></semantics></math> from a driving image <math alttext="I_{\text{drive}}" class="ltx_Math" display="inline" id="S2.F2.17.7.7.m7.1"><semantics id="S2.F2.17.7.7.m7.1b"><msub id="S2.F2.17.7.7.m7.1.1" xref="S2.F2.17.7.7.m7.1.1.cmml"><mi id="S2.F2.17.7.7.m7.1.1.2" xref="S2.F2.17.7.7.m7.1.1.2.cmml">I</mi><mtext id="S2.F2.17.7.7.m7.1.1.3" xref="S2.F2.17.7.7.m7.1.1.3a.cmml">drive</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.17.7.7.m7.1c"><apply id="S2.F2.17.7.7.m7.1.1.cmml" xref="S2.F2.17.7.7.m7.1.1"><csymbol cd="ambiguous" id="S2.F2.17.7.7.m7.1.1.1.cmml" xref="S2.F2.17.7.7.m7.1.1">subscript</csymbol><ci id="S2.F2.17.7.7.m7.1.1.2.cmml" xref="S2.F2.17.7.7.m7.1.1.2">𝐼</ci><ci id="S2.F2.17.7.7.m7.1.1.3a.cmml" xref="S2.F2.17.7.7.m7.1.1.3"><mtext id="S2.F2.17.7.7.m7.1.1.3.cmml" mathsize="70%" xref="S2.F2.17.7.7.m7.1.1.3">drive</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.17.7.7.m7.1d">I_{\text{drive}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.17.7.7.m7.1e">italic_I start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT</annotation></semantics></math>. The unposed dual-stylized avatar <math alttext="I_{\text{unposed}}" class="ltx_Math" display="inline" id="S2.F2.18.8.8.m8.1"><semantics id="S2.F2.18.8.8.m8.1b"><msub id="S2.F2.18.8.8.m8.1.1" xref="S2.F2.18.8.8.m8.1.1.cmml"><mi id="S2.F2.18.8.8.m8.1.1.2" xref="S2.F2.18.8.8.m8.1.1.2.cmml">I</mi><mtext id="S2.F2.18.8.8.m8.1.1.3" xref="S2.F2.18.8.8.m8.1.1.3a.cmml">unposed</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.18.8.8.m8.1c"><apply id="S2.F2.18.8.8.m8.1.1.cmml" xref="S2.F2.18.8.8.m8.1.1"><csymbol cd="ambiguous" id="S2.F2.18.8.8.m8.1.1.1.cmml" xref="S2.F2.18.8.8.m8.1.1">subscript</csymbol><ci id="S2.F2.18.8.8.m8.1.1.2.cmml" xref="S2.F2.18.8.8.m8.1.1.2">𝐼</ci><ci id="S2.F2.18.8.8.m8.1.1.3a.cmml" xref="S2.F2.18.8.8.m8.1.1.3"><mtext id="S2.F2.18.8.8.m8.1.1.3.cmml" mathsize="70%" xref="S2.F2.18.8.8.m8.1.1.3">unposed</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.18.8.8.m8.1d">I_{\text{unposed}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.18.8.8.m8.1e">italic_I start_POSTSUBSCRIPT unposed end_POSTSUBSCRIPT</annotation></semantics></math> is then processed by an asymmetric UNet <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S2.F2.19.9.9.m9.1"><semantics id="S2.F2.19.9.9.m9.1b"><mrow id="S2.F2.19.9.9.m9.1.2" xref="S2.F2.19.9.9.m9.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.F2.19.9.9.m9.1.2.2" xref="S2.F2.19.9.9.m9.1.2.2.cmml">𝒢</mi><mo id="S2.F2.19.9.9.m9.1.2.1" xref="S2.F2.19.9.9.m9.1.2.1.cmml">⁢</mo><mrow id="S2.F2.19.9.9.m9.1.2.3.2" xref="S2.F2.19.9.9.m9.1.2.cmml"><mo id="S2.F2.19.9.9.m9.1.2.3.2.1" stretchy="false" xref="S2.F2.19.9.9.m9.1.2.cmml">(</mo><mo id="S2.F2.19.9.9.m9.1.1" lspace="0em" rspace="0em" xref="S2.F2.19.9.9.m9.1.1.cmml">⋅</mo><mo id="S2.F2.19.9.9.m9.1.2.3.2.2" stretchy="false" xref="S2.F2.19.9.9.m9.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.F2.19.9.9.m9.1c"><apply id="S2.F2.19.9.9.m9.1.2.cmml" xref="S2.F2.19.9.9.m9.1.2"><times id="S2.F2.19.9.9.m9.1.2.1.cmml" xref="S2.F2.19.9.9.m9.1.2.1"></times><ci id="S2.F2.19.9.9.m9.1.2.2.cmml" xref="S2.F2.19.9.9.m9.1.2.2">𝒢</ci><ci id="S2.F2.19.9.9.m9.1.1.cmml" xref="S2.F2.19.9.9.m9.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.19.9.9.m9.1d">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S2.F2.19.9.9.m9.1e">caligraphic_G ( ⋅ )</annotation></semantics></math>, conditioned on the driving codes <math alttext="f_{\text{drive}}" class="ltx_Math" display="inline" id="S2.F2.20.10.10.m10.1"><semantics id="S2.F2.20.10.10.m10.1b"><msub id="S2.F2.20.10.10.m10.1.1" xref="S2.F2.20.10.10.m10.1.1.cmml"><mi id="S2.F2.20.10.10.m10.1.1.2" xref="S2.F2.20.10.10.m10.1.1.2.cmml">f</mi><mtext id="S2.F2.20.10.10.m10.1.1.3" xref="S2.F2.20.10.10.m10.1.1.3a.cmml">drive</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.20.10.10.m10.1c"><apply id="S2.F2.20.10.10.m10.1.1.cmml" xref="S2.F2.20.10.10.m10.1.1"><csymbol cd="ambiguous" id="S2.F2.20.10.10.m10.1.1.1.cmml" xref="S2.F2.20.10.10.m10.1.1">subscript</csymbol><ci id="S2.F2.20.10.10.m10.1.1.2.cmml" xref="S2.F2.20.10.10.m10.1.1.2">𝑓</ci><ci id="S2.F2.20.10.10.m10.1.1.3a.cmml" xref="S2.F2.20.10.10.m10.1.1.3"><mtext id="S2.F2.20.10.10.m10.1.1.3.cmml" mathsize="70%" xref="S2.F2.20.10.10.m10.1.1.3">drive</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.20.10.10.m10.1d">f_{\text{drive}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.20.10.10.m10.1e">italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT</annotation></semantics></math> through cross-attention, to generate animated, dual-stylized 3D avatars.</span></span></figcaption> </figure> <div class="ltx_para ltx_noindent" id="S2.p1"> <p class="ltx_p" id="S2.p1.1"><span class="ltx_text ltx_font_bold" id="S2.p1.1.1">2D Stylized Avatar Generation.</span> In the realm of 2D avatar generation, neural networks like StyleGAN <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib21" title=""><span class="ltx_text" style="font-size:90%;">21</span></a>]</cite> are renowned for producing realistic images with interpretable latent spaces, as explored in works like <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib15" title=""><span class="ltx_text" style="font-size:90%;">15</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib58" title=""><span class="ltx_text" style="font-size:90%;">58</span></a>]</cite>. StyleGAN’s versatility enables transformations into various styles, including Disney cartoons, paintings, and vintage photos <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib39" title=""><span class="ltx_text" style="font-size:90%;">39</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib7" title=""><span class="ltx_text" style="font-size:90%;">7</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib31" title=""><span class="ltx_text" style="font-size:90%;">31</span></a>]</cite>. A significant advantage of StyleGAN is its capability to perform these transformations without requiring paired domain images, a feature also utilized in SwiftAvatar <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title=""><span class="ltx_text" style="font-size:90%;">56</span></a>]</cite>, which creates paired data between realistic and stylized avatars. Diffusion models represent another prominent approach for 2D stylization, known for their larger architectures and enhanced diversity in generated content. Models such as Stable Diffusion <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib43" title=""><span class="ltx_text" style="font-size:90%;">43</span></a>]</cite> allow for image generation conditioned on text prompts, thereby increasing user control. Further advancements, including SDEdit <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib34" title=""><span class="ltx_text" style="font-size:90%;">34</span></a>]</cite>, ControlNet <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib63" title=""><span class="ltx_text" style="font-size:90%;">63</span></a>]</cite>, and IP Adapter <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib60" title=""><span class="ltx_text" style="font-size:90%;">60</span></a>]</cite>, provide additional control through noise introduction or conditioning on structural inputs. Both GANs and diffusion models effectively convert real images into target styles, such as Bitmojis.</p> </div> <div class="ltx_para ltx_noindent" id="S2.p2"> <p class="ltx_p" id="S2.p2.1"><span class="ltx_text ltx_font_bold" id="S2.p2.1.1">3D Content and Avatar Generation.</span> Recent advances in 3D content creation have also significantly impacted avatar generation <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib62" title=""><span class="ltx_text" style="font-size:90%;">62</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib27" title=""><span class="ltx_text" style="font-size:90%;">27</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib26" title=""><span class="ltx_text" style="font-size:90%;">26</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib28" title=""><span class="ltx_text" style="font-size:90%;">28</span></a>]</cite>. 3D representations like NeRFs <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib36" title=""><span class="ltx_text" style="font-size:90%;">36</span></a>]</cite> and Gaussian Splats <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib22" title=""><span class="ltx_text" style="font-size:90%;">22</span></a>]</cite> have been integrated with generative models to automate the creation of 3D assets. For instance, DATID-3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib23" title=""><span class="ltx_text" style="font-size:90%;">23</span></a>]</cite> and StyleAvatar3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib61" title=""><span class="ltx_text" style="font-size:90%;">61</span></a>]</cite> employ 3D GANs to generate and stylize 3D facial models. More recent developments utilize text-to-image diffusion models for avatar stylization <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title=""><span class="ltx_text" style="font-size:90%;">54</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib37" title=""><span class="ltx_text" style="font-size:90%;">37</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib33" title=""><span class="ltx_text" style="font-size:90%;">33</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib16" title=""><span class="ltx_text" style="font-size:90%;">16</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib9" title=""><span class="ltx_text" style="font-size:90%;">9</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib14" title=""><span class="ltx_text" style="font-size:90%;">14</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib65" title=""><span class="ltx_text" style="font-size:90%;">65</span></a>]</cite>, though this process tends to be slow, taking approximately 10 minutes. In the realm of photorealistic avatar creation, techniques using Gaussian Splats <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib44" title=""><span class="ltx_text" style="font-size:90%;">44</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib40" title=""><span class="ltx_text" style="font-size:90%;">40</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib32" title=""><span class="ltx_text" style="font-size:90%;">32</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib30" title=""><span class="ltx_text" style="font-size:90%;">30</span></a>]</cite> fit real faces to 3D Morphable Models (3DMMs) like FLAME <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib25" title=""><span class="ltx_text" style="font-size:90%;">25</span></a>]</cite>, but these do not adapt well to the geometry of cartoon avatars. Beyond avatars, work in general 3D object generation includes models like LRM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib17" title=""><span class="ltx_text" style="font-size:90%;">17</span></a>]</cite> and LGM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title=""><span class="ltx_text" style="font-size:90%;">55</span></a>]</cite>, which train end-to-end neural networks for mapping 2D images to 3D objects. These models provide much faster inference compared to diffusion models but rely on extensive internet-scale multi-view datasets, such as Objaverse <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib10" title=""><span class="ltx_text" style="font-size:90%;">10</span></a>]</cite>, for training. This approach has yet to be applied to the specific challenge of 3D stylized avatar generation, which is a gap we seek to address.</p> </div> <div class="ltx_para ltx_noindent" id="S2.p3"> <p class="ltx_p" id="S2.p3.1"><span class="ltx_text ltx_font_bold" id="S2.p3.1.1">Production Systems for Stylized 3D Avatar Creation.</span> Developments in avatar creation platforms have introduced automated processes for selecting avatars by training classifiers that can predict avatar traits from a user’s photograph <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib45" title=""><span class="ltx_text" style="font-size:90%;">45</span></a>]</cite>. Initially, these systems generate a basic version of an avatar, which users can then personalize by adjusting various traits to their preference. A significant challenge in training these classifiers is the need for paired data that links real faces to specific avatar traits, a requirement that is difficult to meet on a large scale. To address this challenge, approaches like AgileAvatar <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib46" title=""><span class="ltx_text" style="font-size:90%;">46</span></a>]</cite>, F2P <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib48" title=""><span class="ltx_text" style="font-size:90%;">48</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib49" title=""><span class="ltx_text" style="font-size:90%;">49</span></a>]</cite>, EasyCraft <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib57" title=""><span class="ltx_text" style="font-size:90%;">57</span></a>]</cite> and SwiftAvatar <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title=""><span class="ltx_text" style="font-size:90%;">56</span></a>]</cite> have been developed, utilizing self-supervised learning techniques. While these methods are efficient, they are currently limited to creating avatars from pre-existing 3D asset libraries, rather than generating entirely new styles.</p> </div> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">3 </span>Method</h2> <div class="ltx_para" id="S3.p1"> <p class="ltx_p" id="S3.p1.1">Our method begins with the creation of datasets for real face images and their primary-style avatars (Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS1" title="3.1 Datasets ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">3.1</span></a>), facilitating avatar dual-stylization via Gaussian Domain Adaptation (Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS2" title="3.2 2D Dual-Stylized Avatar Generation ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">3.2</span></a>). This framework supports 3D generation and animation of dual-stylized avatars (Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS3" title="3.3 3D Animatable Stylized Avatar Generation ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">3.3</span></a>). Loss functions and training details are provided in Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS4" title="3.4 Training and Losses ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">3.4</span></a>.</p> </div> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.1 </span>Datasets</h3> <div class="ltx_para" id="S3.SS1.p1"> <p class="ltx_p" id="S3.SS1.p1.1">Training our image-to-avatar GDA model requires paired datasets of real faces and avatars in a primary style, which are not available at scale. To overcome this, we employ GAN inversion techniques inspired by unsupervised domain adaptation <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title=""><span class="ltx_text" style="font-size:90%;">56</span></a>]</cite> to create synthetic paired data. By aligning the latent spaces of a source GAN and a fine-tuned target GAN <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib59" title=""><span class="ltx_text" style="font-size:90%;">59</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib7" title=""><span class="ltx_text" style="font-size:90%;">7</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title=""><span class="ltx_text" style="font-size:90%;">56</span></a>]</cite>, we generate corresponding pairs of realistic and primary-stylized images. Specifically, avatar images are inverted into the target GAN’s latent space to obtain latent codes, which are then applied to the source GAN to produce realistic-face counterparts:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="w:=\text{argmin}_{w\in\mathcal{W}}\|G_{\text{tgt}}(w)-I_{\text{tgt}}\|,\\ I_{\text{src}}=G_{\text{src}}(w)." class="ltx_Math" display="block" id="S3.E1.m1.3"><semantics id="S3.E1.m1.3a"><mrow id="S3.E1.m1.3.3.1"><mrow id="S3.E1.m1.3.3.1.1.2" xref="S3.E1.m1.3.3.1.1.3.cmml"><mrow id="S3.E1.m1.3.3.1.1.1.1" xref="S3.E1.m1.3.3.1.1.1.1.cmml"><mi id="S3.E1.m1.3.3.1.1.1.1.3" xref="S3.E1.m1.3.3.1.1.1.1.3.cmml">w</mi><mo id="S3.E1.m1.3.3.1.1.1.1.2" lspace="0.278em" rspace="0.278em" xref="S3.E1.m1.3.3.1.1.1.1.2.cmml">:=</mo><mrow id="S3.E1.m1.3.3.1.1.1.1.1" xref="S3.E1.m1.3.3.1.1.1.1.1.cmml"><msub id="S3.E1.m1.3.3.1.1.1.1.1.3" xref="S3.E1.m1.3.3.1.1.1.1.1.3.cmml"><mtext id="S3.E1.m1.3.3.1.1.1.1.1.3.2" xref="S3.E1.m1.3.3.1.1.1.1.1.3.2a.cmml">argmin</mtext><mrow id="S3.E1.m1.3.3.1.1.1.1.1.3.3" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.cmml"><mi id="S3.E1.m1.3.3.1.1.1.1.1.3.3.2" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.2.cmml">w</mi><mo id="S3.E1.m1.3.3.1.1.1.1.1.3.3.1" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S3.E1.m1.3.3.1.1.1.1.1.3.3.3" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.3.cmml">𝒲</mi></mrow></msub><mo id="S3.E1.m1.3.3.1.1.1.1.1.2" xref="S3.E1.m1.3.3.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E1.m1.3.3.1.1.1.1.1.1.1" xref="S3.E1.m1.3.3.1.1.1.1.1.1.2.cmml"><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E1.m1.3.3.1.1.1.1.1.1.2.1.cmml">‖</mo><mrow id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.cmml"><mrow id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml"><msub id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.cmml"><mi id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.2" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.2.cmml">G</mi><mtext id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3a.cmml">tgt</mtext></msub><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.1" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.1.cmml">⁢</mo><mrow id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.3.2" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml"><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.3.2.1" stretchy="false" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml">(</mo><mi id="S3.E1.m1.1.1" xref="S3.E1.m1.1.1.cmml">w</mi><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.3.2.2" stretchy="false" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml">)</mo></mrow></mrow><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.1" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml">−</mo><msub id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.cmml"><mi id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.2" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.2.cmml">I</mi><mtext id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3a.cmml">tgt</mtext></msub></mrow><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E1.m1.3.3.1.1.1.1.1.1.2.1.cmml">‖</mo></mrow></mrow></mrow><mo id="S3.E1.m1.3.3.1.1.2.3" xref="S3.E1.m1.3.3.1.1.3a.cmml">,</mo><mrow id="S3.E1.m1.3.3.1.1.2.2" xref="S3.E1.m1.3.3.1.1.2.2.cmml"><msub id="S3.E1.m1.3.3.1.1.2.2.2" xref="S3.E1.m1.3.3.1.1.2.2.2.cmml"><mi id="S3.E1.m1.3.3.1.1.2.2.2.2" xref="S3.E1.m1.3.3.1.1.2.2.2.2.cmml">I</mi><mtext id="S3.E1.m1.3.3.1.1.2.2.2.3" xref="S3.E1.m1.3.3.1.1.2.2.2.3a.cmml">src</mtext></msub><mo id="S3.E1.m1.3.3.1.1.2.2.1" xref="S3.E1.m1.3.3.1.1.2.2.1.cmml">=</mo><mrow id="S3.E1.m1.3.3.1.1.2.2.3" xref="S3.E1.m1.3.3.1.1.2.2.3.cmml"><msub id="S3.E1.m1.3.3.1.1.2.2.3.2" xref="S3.E1.m1.3.3.1.1.2.2.3.2.cmml"><mi id="S3.E1.m1.3.3.1.1.2.2.3.2.2" xref="S3.E1.m1.3.3.1.1.2.2.3.2.2.cmml">G</mi><mtext id="S3.E1.m1.3.3.1.1.2.2.3.2.3" xref="S3.E1.m1.3.3.1.1.2.2.3.2.3a.cmml">src</mtext></msub><mo id="S3.E1.m1.3.3.1.1.2.2.3.1" xref="S3.E1.m1.3.3.1.1.2.2.3.1.cmml">⁢</mo><mrow id="S3.E1.m1.3.3.1.1.2.2.3.3.2" xref="S3.E1.m1.3.3.1.1.2.2.3.cmml"><mo id="S3.E1.m1.3.3.1.1.2.2.3.3.2.1" stretchy="false" xref="S3.E1.m1.3.3.1.1.2.2.3.cmml">(</mo><mi id="S3.E1.m1.2.2" xref="S3.E1.m1.2.2.cmml">w</mi><mo id="S3.E1.m1.3.3.1.1.2.2.3.3.2.2" stretchy="false" xref="S3.E1.m1.3.3.1.1.2.2.3.cmml">)</mo></mrow></mrow></mrow></mrow><mo id="S3.E1.m1.3.3.1.2" lspace="0em">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E1.m1.3b"><apply id="S3.E1.m1.3.3.1.1.3.cmml" xref="S3.E1.m1.3.3.1.1.2"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.3a.cmml" xref="S3.E1.m1.3.3.1.1.2.3">formulae-sequence</csymbol><apply id="S3.E1.m1.3.3.1.1.1.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1"><csymbol cd="latexml" id="S3.E1.m1.3.3.1.1.1.1.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.2">assign</csymbol><ci id="S3.E1.m1.3.3.1.1.1.1.3.cmml" xref="S3.E1.m1.3.3.1.1.1.1.3">𝑤</ci><apply id="S3.E1.m1.3.3.1.1.1.1.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1"><times id="S3.E1.m1.3.3.1.1.1.1.1.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.2"></times><apply id="S3.E1.m1.3.3.1.1.1.1.1.3.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.1.1.1.3.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E1.m1.3.3.1.1.1.1.1.3.2a.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.2"><mtext id="S3.E1.m1.3.3.1.1.1.1.1.3.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.2">argmin</mtext></ci><apply id="S3.E1.m1.3.3.1.1.1.1.1.3.3.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3"><in id="S3.E1.m1.3.3.1.1.1.1.1.3.3.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.1"></in><ci id="S3.E1.m1.3.3.1.1.1.1.1.3.3.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.2">𝑤</ci><ci id="S3.E1.m1.3.3.1.1.1.1.1.3.3.3.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.3">𝒲</ci></apply></apply><apply id="S3.E1.m1.3.3.1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S3.E1.m1.3.3.1.1.1.1.1.1.2.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.2">norm</csymbol><apply id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1"><minus id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.1"></minus><apply id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2"><times id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.1"></times><apply id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2">subscript</csymbol><ci id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.2">𝐺</ci><ci id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3a.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3"><mtext id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3.cmml" mathsize="70%" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3">tgt</mtext></ci></apply><ci id="S3.E1.m1.1.1.cmml" xref="S3.E1.m1.1.1">𝑤</ci></apply><apply id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.2">𝐼</ci><ci id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3a.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3"><mtext id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3">tgt</mtext></ci></apply></apply></apply></apply></apply><apply id="S3.E1.m1.3.3.1.1.2.2.cmml" xref="S3.E1.m1.3.3.1.1.2.2"><eq id="S3.E1.m1.3.3.1.1.2.2.1.cmml" xref="S3.E1.m1.3.3.1.1.2.2.1"></eq><apply id="S3.E1.m1.3.3.1.1.2.2.2.cmml" xref="S3.E1.m1.3.3.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.2.2.2.1.cmml" xref="S3.E1.m1.3.3.1.1.2.2.2">subscript</csymbol><ci id="S3.E1.m1.3.3.1.1.2.2.2.2.cmml" xref="S3.E1.m1.3.3.1.1.2.2.2.2">𝐼</ci><ci id="S3.E1.m1.3.3.1.1.2.2.2.3a.cmml" xref="S3.E1.m1.3.3.1.1.2.2.2.3"><mtext id="S3.E1.m1.3.3.1.1.2.2.2.3.cmml" mathsize="70%" xref="S3.E1.m1.3.3.1.1.2.2.2.3">src</mtext></ci></apply><apply id="S3.E1.m1.3.3.1.1.2.2.3.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3"><times id="S3.E1.m1.3.3.1.1.2.2.3.1.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3.1"></times><apply id="S3.E1.m1.3.3.1.1.2.2.3.2.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3.2"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.2.2.3.2.1.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3.2">subscript</csymbol><ci id="S3.E1.m1.3.3.1.1.2.2.3.2.2.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3.2.2">𝐺</ci><ci id="S3.E1.m1.3.3.1.1.2.2.3.2.3a.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3.2.3"><mtext id="S3.E1.m1.3.3.1.1.2.2.3.2.3.cmml" mathsize="70%" xref="S3.E1.m1.3.3.1.1.2.2.3.2.3">src</mtext></ci></apply><ci id="S3.E1.m1.2.2.cmml" xref="S3.E1.m1.2.2">𝑤</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E1.m1.3c">w:=\text{argmin}_{w\in\mathcal{W}}\|G_{\text{tgt}}(w)-I_{\text{tgt}}\|,\\ I_{\text{src}}=G_{\text{src}}(w).</annotation><annotation encoding="application/x-llamapun" id="S3.E1.m1.3d">italic_w := argmin start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_G start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ( italic_w ) - italic_I start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ∥ , italic_I start_POSTSUBSCRIPT src end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ( italic_w ) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS1.p1.2">Using this method, we generated 13,000 synthetic image pairs from Bitmoji avatars, forming the basis for GDA training. (See Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.F9" title="Figure 9 ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">9</span></a> and Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.F10" title="Figure 10 ‣ Multi-view Training Data. ‣ A.1 Bitmoji Training Data ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">10</span></a> in <span class="ltx_text ltx_font_italic" id="S3.SS1.p1.2.1">Suppl.</span> for details).</p> </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.2 </span>2D Dual-Stylized Avatar Generation</h3> <div class="ltx_para ltx_noindent" id="S3.SS2.p1"> <p class="ltx_p" id="S3.SS2.p1.7"><span class="ltx_text ltx_font_bold" id="S3.SS2.p1.7.1">Gaussian Domain Adaptation</span> <math alttext="\mathcal{E}_{\text{GDA}}(\cdot)" class="ltx_Math" display="inline" id="S3.SS2.p1.1.m1.1"><semantics id="S3.SS2.p1.1.m1.1a"><mrow id="S3.SS2.p1.1.m1.1.2" xref="S3.SS2.p1.1.m1.1.2.cmml"><msub id="S3.SS2.p1.1.m1.1.2.2" xref="S3.SS2.p1.1.m1.1.2.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS2.p1.1.m1.1.2.2.2" xref="S3.SS2.p1.1.m1.1.2.2.2.cmml">ℰ</mi><mtext id="S3.SS2.p1.1.m1.1.2.2.3" xref="S3.SS2.p1.1.m1.1.2.2.3a.cmml">GDA</mtext></msub><mo id="S3.SS2.p1.1.m1.1.2.1" xref="S3.SS2.p1.1.m1.1.2.1.cmml">⁢</mo><mrow id="S3.SS2.p1.1.m1.1.2.3.2" xref="S3.SS2.p1.1.m1.1.2.cmml"><mo id="S3.SS2.p1.1.m1.1.2.3.2.1" stretchy="false" xref="S3.SS2.p1.1.m1.1.2.cmml">(</mo><mo id="S3.SS2.p1.1.m1.1.1" lspace="0em" rspace="0em" xref="S3.SS2.p1.1.m1.1.1.cmml">⋅</mo><mo id="S3.SS2.p1.1.m1.1.2.3.2.2" stretchy="false" xref="S3.SS2.p1.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.1.m1.1b"><apply id="S3.SS2.p1.1.m1.1.2.cmml" xref="S3.SS2.p1.1.m1.1.2"><times id="S3.SS2.p1.1.m1.1.2.1.cmml" xref="S3.SS2.p1.1.m1.1.2.1"></times><apply id="S3.SS2.p1.1.m1.1.2.2.cmml" xref="S3.SS2.p1.1.m1.1.2.2"><csymbol cd="ambiguous" id="S3.SS2.p1.1.m1.1.2.2.1.cmml" xref="S3.SS2.p1.1.m1.1.2.2">subscript</csymbol><ci id="S3.SS2.p1.1.m1.1.2.2.2.cmml" xref="S3.SS2.p1.1.m1.1.2.2.2">ℰ</ci><ci id="S3.SS2.p1.1.m1.1.2.2.3a.cmml" xref="S3.SS2.p1.1.m1.1.2.2.3"><mtext id="S3.SS2.p1.1.m1.1.2.2.3.cmml" mathsize="70%" xref="S3.SS2.p1.1.m1.1.2.2.3">GDA</mtext></ci></apply><ci id="S3.SS2.p1.1.m1.1.1.cmml" xref="S3.SS2.p1.1.m1.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.1.m1.1c">\mathcal{E}_{\text{GDA}}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.1.m1.1d">caligraphic_E start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT ( ⋅ )</annotation></semantics></math><span class="ltx_text ltx_font_bold" id="S3.SS2.p1.7.2">.</span> To bridge the domain gap between real photos and 3D-aware cartoonish avatars (<em class="ltx_emph ltx_font_italic" id="S3.SS2.p1.7.3">i.e</em>.<span class="ltx_text" id="S3.SS2.p1.7.4"></span>, primary-style avatars), we first propose <span class="ltx_text ltx_font_italic" id="S3.SS2.p1.7.5">Gaussian Domain Adaptation</span><span class="ltx_text ltx_font_italic" id="S3.SS2.p1.7.6"> (<span class="ltx_text" id="S3.SS2.p1.7.6.1">GDA</span>)</span>. Surprisingly, we find that features learned by Large Multi-view Gaussian models (LGMs) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title=""><span class="ltx_text" style="font-size:90%;">55</span></a>]</cite> can be quickly and robustly adapted for style transfer. We believe this is due to their ability to hold internet-scale information from multi-view training datasets such as Objaverse <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib10" title=""><span class="ltx_text" style="font-size:90%;">10</span></a>]</cite>. We first begin with a U-Net backbone from LGM trained on 3D objects. Then, we perform GDA by finetuning the network to map real face photos to primary-style avatar images in the frontal view. At inference time, for an input selfie, we first apply face alignment and background removal preprocessing. The preprocessed image is then passed through the asymmetric U-Net that takes the input reference image <math alttext="I_{\text{ref}}\in\mathbb{R}^{3\times 512\times 512}" class="ltx_Math" display="inline" id="S3.SS2.p1.2.m2.1"><semantics id="S3.SS2.p1.2.m2.1a"><mrow id="S3.SS2.p1.2.m2.1.1" xref="S3.SS2.p1.2.m2.1.1.cmml"><msub id="S3.SS2.p1.2.m2.1.1.2" xref="S3.SS2.p1.2.m2.1.1.2.cmml"><mi id="S3.SS2.p1.2.m2.1.1.2.2" xref="S3.SS2.p1.2.m2.1.1.2.2.cmml">I</mi><mtext id="S3.SS2.p1.2.m2.1.1.2.3" xref="S3.SS2.p1.2.m2.1.1.2.3a.cmml">ref</mtext></msub><mo id="S3.SS2.p1.2.m2.1.1.1" xref="S3.SS2.p1.2.m2.1.1.1.cmml">∈</mo><msup id="S3.SS2.p1.2.m2.1.1.3" xref="S3.SS2.p1.2.m2.1.1.3.cmml"><mi id="S3.SS2.p1.2.m2.1.1.3.2" xref="S3.SS2.p1.2.m2.1.1.3.2.cmml">ℝ</mi><mrow id="S3.SS2.p1.2.m2.1.1.3.3" xref="S3.SS2.p1.2.m2.1.1.3.3.cmml"><mn id="S3.SS2.p1.2.m2.1.1.3.3.2" xref="S3.SS2.p1.2.m2.1.1.3.3.2.cmml">3</mn><mo id="S3.SS2.p1.2.m2.1.1.3.3.1" lspace="0.222em" rspace="0.222em" xref="S3.SS2.p1.2.m2.1.1.3.3.1.cmml">×</mo><mn id="S3.SS2.p1.2.m2.1.1.3.3.3" xref="S3.SS2.p1.2.m2.1.1.3.3.3.cmml">512</mn><mo id="S3.SS2.p1.2.m2.1.1.3.3.1a" lspace="0.222em" rspace="0.222em" xref="S3.SS2.p1.2.m2.1.1.3.3.1.cmml">×</mo><mn id="S3.SS2.p1.2.m2.1.1.3.3.4" xref="S3.SS2.p1.2.m2.1.1.3.3.4.cmml">512</mn></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.2.m2.1b"><apply id="S3.SS2.p1.2.m2.1.1.cmml" xref="S3.SS2.p1.2.m2.1.1"><in id="S3.SS2.p1.2.m2.1.1.1.cmml" xref="S3.SS2.p1.2.m2.1.1.1"></in><apply id="S3.SS2.p1.2.m2.1.1.2.cmml" xref="S3.SS2.p1.2.m2.1.1.2"><csymbol cd="ambiguous" id="S3.SS2.p1.2.m2.1.1.2.1.cmml" xref="S3.SS2.p1.2.m2.1.1.2">subscript</csymbol><ci id="S3.SS2.p1.2.m2.1.1.2.2.cmml" xref="S3.SS2.p1.2.m2.1.1.2.2">𝐼</ci><ci id="S3.SS2.p1.2.m2.1.1.2.3a.cmml" xref="S3.SS2.p1.2.m2.1.1.2.3"><mtext id="S3.SS2.p1.2.m2.1.1.2.3.cmml" mathsize="70%" xref="S3.SS2.p1.2.m2.1.1.2.3">ref</mtext></ci></apply><apply id="S3.SS2.p1.2.m2.1.1.3.cmml" xref="S3.SS2.p1.2.m2.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.p1.2.m2.1.1.3.1.cmml" xref="S3.SS2.p1.2.m2.1.1.3">superscript</csymbol><ci id="S3.SS2.p1.2.m2.1.1.3.2.cmml" xref="S3.SS2.p1.2.m2.1.1.3.2">ℝ</ci><apply id="S3.SS2.p1.2.m2.1.1.3.3.cmml" xref="S3.SS2.p1.2.m2.1.1.3.3"><times id="S3.SS2.p1.2.m2.1.1.3.3.1.cmml" xref="S3.SS2.p1.2.m2.1.1.3.3.1"></times><cn id="S3.SS2.p1.2.m2.1.1.3.3.2.cmml" type="integer" xref="S3.SS2.p1.2.m2.1.1.3.3.2">3</cn><cn id="S3.SS2.p1.2.m2.1.1.3.3.3.cmml" type="integer" xref="S3.SS2.p1.2.m2.1.1.3.3.3">512</cn><cn id="S3.SS2.p1.2.m2.1.1.3.3.4.cmml" type="integer" xref="S3.SS2.p1.2.m2.1.1.3.3.4">512</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.2.m2.1c">I_{\text{ref}}\in\mathbb{R}^{3\times 512\times 512}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.2.m2.1d">italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 512 × 512 end_POSTSUPERSCRIPT</annotation></semantics></math> and maps it to pixel-aligned Gaussian parameters, including scaling <math alttext="\boldsymbol{s}" class="ltx_Math" display="inline" id="S3.SS2.p1.3.m3.1"><semantics id="S3.SS2.p1.3.m3.1a"><mi id="S3.SS2.p1.3.m3.1.1" xref="S3.SS2.p1.3.m3.1.1.cmml">𝒔</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.3.m3.1b"><ci id="S3.SS2.p1.3.m3.1.1.cmml" xref="S3.SS2.p1.3.m3.1.1">𝒔</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.3.m3.1c">\boldsymbol{s}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.3.m3.1d">bold_italic_s</annotation></semantics></math>, position <math alttext="\boldsymbol{t}" class="ltx_Math" display="inline" id="S3.SS2.p1.4.m4.1"><semantics id="S3.SS2.p1.4.m4.1a"><mi id="S3.SS2.p1.4.m4.1.1" xref="S3.SS2.p1.4.m4.1.1.cmml">𝒕</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.4.m4.1b"><ci id="S3.SS2.p1.4.m4.1.1.cmml" xref="S3.SS2.p1.4.m4.1.1">𝒕</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.4.m4.1c">\boldsymbol{t}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.4.m4.1d">bold_italic_t</annotation></semantics></math>, color <math alttext="\boldsymbol{c}" class="ltx_Math" display="inline" id="S3.SS2.p1.5.m5.1"><semantics id="S3.SS2.p1.5.m5.1a"><mi id="S3.SS2.p1.5.m5.1.1" xref="S3.SS2.p1.5.m5.1.1.cmml">𝒄</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.5.m5.1b"><ci id="S3.SS2.p1.5.m5.1.1.cmml" xref="S3.SS2.p1.5.m5.1.1">𝒄</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.5.m5.1c">\boldsymbol{c}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.5.m5.1d">bold_italic_c</annotation></semantics></math>, opacity <math alttext="\boldsymbol{o}" class="ltx_Math" display="inline" id="S3.SS2.p1.6.m6.1"><semantics id="S3.SS2.p1.6.m6.1a"><mi id="S3.SS2.p1.6.m6.1.1" xref="S3.SS2.p1.6.m6.1.1.cmml">𝒐</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.6.m6.1b"><ci id="S3.SS2.p1.6.m6.1.1.cmml" xref="S3.SS2.p1.6.m6.1.1">𝒐</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.6.m6.1c">\boldsymbol{o}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.6.m6.1d">bold_italic_o</annotation></semantics></math>, and orientation <math alttext="\boldsymbol{q}" class="ltx_Math" display="inline" id="S3.SS2.p1.7.m7.1"><semantics id="S3.SS2.p1.7.m7.1a"><mi id="S3.SS2.p1.7.m7.1.1" xref="S3.SS2.p1.7.m7.1.1.cmml">𝒒</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.7.m7.1b"><ci id="S3.SS2.p1.7.m7.1.1.cmml" xref="S3.SS2.p1.7.m7.1.1">𝒒</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.7.m7.1c">\boldsymbol{q}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.7.m7.1d">bold_italic_q</annotation></semantics></math>:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E2"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\theta_{\text{GDA}}=\left\{\boldsymbol{s}^{k},\boldsymbol{t}^{k},\boldsymbol{q% }^{k},\boldsymbol{c}^{k},\boldsymbol{o}^{k}\right\}_{k=1}^{M}=\mathcal{E}_{% \text{GDA}}(I_{\text{ref}};\Phi_{\text{GDA}})," class="ltx_Math" display="block" id="S3.E2.m1.1"><semantics id="S3.E2.m1.1a"><mrow id="S3.E2.m1.1.1.1" xref="S3.E2.m1.1.1.1.1.cmml"><mrow id="S3.E2.m1.1.1.1.1" xref="S3.E2.m1.1.1.1.1.cmml"><msub id="S3.E2.m1.1.1.1.1.9" xref="S3.E2.m1.1.1.1.1.9.cmml"><mi id="S3.E2.m1.1.1.1.1.9.2" xref="S3.E2.m1.1.1.1.1.9.2.cmml">θ</mi><mtext id="S3.E2.m1.1.1.1.1.9.3" xref="S3.E2.m1.1.1.1.1.9.3a.cmml">GDA</mtext></msub><mo id="S3.E2.m1.1.1.1.1.10" xref="S3.E2.m1.1.1.1.1.10.cmml">=</mo><msubsup id="S3.E2.m1.1.1.1.1.5" xref="S3.E2.m1.1.1.1.1.5.cmml"><mrow id="S3.E2.m1.1.1.1.1.5.5.5.5" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml"><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.6" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">{</mo><msup id="S3.E2.m1.1.1.1.1.1.1.1.1.1" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E2.m1.1.1.1.1.1.1.1.1.1.2" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1.2.cmml">𝒔</mi><mi id="S3.E2.m1.1.1.1.1.1.1.1.1.1.3" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1.3.cmml">k</mi></msup><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.7" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">,</mo><msup id="S3.E2.m1.1.1.1.1.2.2.2.2.2" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2.cmml"><mi id="S3.E2.m1.1.1.1.1.2.2.2.2.2.2" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2.2.cmml">𝒕</mi><mi id="S3.E2.m1.1.1.1.1.2.2.2.2.2.3" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2.3.cmml">k</mi></msup><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.8" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">,</mo><msup id="S3.E2.m1.1.1.1.1.3.3.3.3.3" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3.cmml"><mi id="S3.E2.m1.1.1.1.1.3.3.3.3.3.2" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3.2.cmml">𝒒</mi><mi id="S3.E2.m1.1.1.1.1.3.3.3.3.3.3" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3.3.cmml">k</mi></msup><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.9" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">,</mo><msup id="S3.E2.m1.1.1.1.1.4.4.4.4.4" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4.cmml"><mi id="S3.E2.m1.1.1.1.1.4.4.4.4.4.2" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4.2.cmml">𝒄</mi><mi id="S3.E2.m1.1.1.1.1.4.4.4.4.4.3" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4.3.cmml">k</mi></msup><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.10" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">,</mo><msup id="S3.E2.m1.1.1.1.1.5.5.5.5.5" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5.cmml"><mi id="S3.E2.m1.1.1.1.1.5.5.5.5.5.2" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5.2.cmml">𝒐</mi><mi id="S3.E2.m1.1.1.1.1.5.5.5.5.5.3" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5.3.cmml">k</mi></msup><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.11" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">}</mo></mrow><mrow id="S3.E2.m1.1.1.1.1.5.5.7" xref="S3.E2.m1.1.1.1.1.5.5.7.cmml"><mi id="S3.E2.m1.1.1.1.1.5.5.7.2" xref="S3.E2.m1.1.1.1.1.5.5.7.2.cmml">k</mi><mo id="S3.E2.m1.1.1.1.1.5.5.7.1" xref="S3.E2.m1.1.1.1.1.5.5.7.1.cmml">=</mo><mn id="S3.E2.m1.1.1.1.1.5.5.7.3" xref="S3.E2.m1.1.1.1.1.5.5.7.3.cmml">1</mn></mrow><mi id="S3.E2.m1.1.1.1.1.5.7" xref="S3.E2.m1.1.1.1.1.5.7.cmml">M</mi></msubsup><mo id="S3.E2.m1.1.1.1.1.11" xref="S3.E2.m1.1.1.1.1.11.cmml">=</mo><mrow id="S3.E2.m1.1.1.1.1.7" xref="S3.E2.m1.1.1.1.1.7.cmml"><msub id="S3.E2.m1.1.1.1.1.7.4" xref="S3.E2.m1.1.1.1.1.7.4.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E2.m1.1.1.1.1.7.4.2" xref="S3.E2.m1.1.1.1.1.7.4.2.cmml">ℰ</mi><mtext id="S3.E2.m1.1.1.1.1.7.4.3" xref="S3.E2.m1.1.1.1.1.7.4.3a.cmml">GDA</mtext></msub><mo id="S3.E2.m1.1.1.1.1.7.3" xref="S3.E2.m1.1.1.1.1.7.3.cmml">⁢</mo><mrow id="S3.E2.m1.1.1.1.1.7.2.2" xref="S3.E2.m1.1.1.1.1.7.2.3.cmml"><mo id="S3.E2.m1.1.1.1.1.7.2.2.3" stretchy="false" xref="S3.E2.m1.1.1.1.1.7.2.3.cmml">(</mo><msub id="S3.E2.m1.1.1.1.1.6.1.1.1" xref="S3.E2.m1.1.1.1.1.6.1.1.1.cmml"><mi id="S3.E2.m1.1.1.1.1.6.1.1.1.2" xref="S3.E2.m1.1.1.1.1.6.1.1.1.2.cmml">I</mi><mtext id="S3.E2.m1.1.1.1.1.6.1.1.1.3" xref="S3.E2.m1.1.1.1.1.6.1.1.1.3a.cmml">ref</mtext></msub><mo id="S3.E2.m1.1.1.1.1.7.2.2.4" xref="S3.E2.m1.1.1.1.1.7.2.3.cmml">;</mo><msub id="S3.E2.m1.1.1.1.1.7.2.2.2" xref="S3.E2.m1.1.1.1.1.7.2.2.2.cmml"><mi id="S3.E2.m1.1.1.1.1.7.2.2.2.2" mathvariant="normal" xref="S3.E2.m1.1.1.1.1.7.2.2.2.2.cmml">Φ</mi><mtext id="S3.E2.m1.1.1.1.1.7.2.2.2.3" xref="S3.E2.m1.1.1.1.1.7.2.2.2.3a.cmml">GDA</mtext></msub><mo id="S3.E2.m1.1.1.1.1.7.2.2.5" stretchy="false" xref="S3.E2.m1.1.1.1.1.7.2.3.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E2.m1.1.1.1.2" xref="S3.E2.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E2.m1.1b"><apply id="S3.E2.m1.1.1.1.1.cmml" xref="S3.E2.m1.1.1.1"><and id="S3.E2.m1.1.1.1.1a.cmml" xref="S3.E2.m1.1.1.1"></and><apply id="S3.E2.m1.1.1.1.1b.cmml" xref="S3.E2.m1.1.1.1"><eq id="S3.E2.m1.1.1.1.1.10.cmml" xref="S3.E2.m1.1.1.1.1.10"></eq><apply id="S3.E2.m1.1.1.1.1.9.cmml" xref="S3.E2.m1.1.1.1.1.9"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.9.1.cmml" xref="S3.E2.m1.1.1.1.1.9">subscript</csymbol><ci id="S3.E2.m1.1.1.1.1.9.2.cmml" xref="S3.E2.m1.1.1.1.1.9.2">𝜃</ci><ci id="S3.E2.m1.1.1.1.1.9.3a.cmml" xref="S3.E2.m1.1.1.1.1.9.3"><mtext id="S3.E2.m1.1.1.1.1.9.3.cmml" mathsize="70%" xref="S3.E2.m1.1.1.1.1.9.3">GDA</mtext></ci></apply><apply id="S3.E2.m1.1.1.1.1.5.cmml" xref="S3.E2.m1.1.1.1.1.5"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.5.6.cmml" xref="S3.E2.m1.1.1.1.1.5">superscript</csymbol><apply id="S3.E2.m1.1.1.1.1.5.5.cmml" xref="S3.E2.m1.1.1.1.1.5"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.5.5.6.cmml" xref="S3.E2.m1.1.1.1.1.5">subscript</csymbol><set id="S3.E2.m1.1.1.1.1.5.5.5.6.cmml" xref="S3.E2.m1.1.1.1.1.5.5.5.5"><apply id="S3.E2.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1">superscript</csymbol><ci id="S3.E2.m1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1.2">𝒔</ci><ci id="S3.E2.m1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="S3.E2.m1.1.1.1.1.2.2.2.2.2.cmml" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.2.2.2.2.2.1.cmml" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2">superscript</csymbol><ci id="S3.E2.m1.1.1.1.1.2.2.2.2.2.2.cmml" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2.2">𝒕</ci><ci id="S3.E2.m1.1.1.1.1.2.2.2.2.2.3.cmml" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2.3">𝑘</ci></apply><apply id="S3.E2.m1.1.1.1.1.3.3.3.3.3.cmml" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.3.3.3.3.3.1.cmml" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3">superscript</csymbol><ci id="S3.E2.m1.1.1.1.1.3.3.3.3.3.2.cmml" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3.2">𝒒</ci><ci id="S3.E2.m1.1.1.1.1.3.3.3.3.3.3.cmml" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3.3">𝑘</ci></apply><apply id="S3.E2.m1.1.1.1.1.4.4.4.4.4.cmml" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.4.4.4.4.4.1.cmml" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4">superscript</csymbol><ci id="S3.E2.m1.1.1.1.1.4.4.4.4.4.2.cmml" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4.2">𝒄</ci><ci id="S3.E2.m1.1.1.1.1.4.4.4.4.4.3.cmml" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4.3">𝑘</ci></apply><apply id="S3.E2.m1.1.1.1.1.5.5.5.5.5.cmml" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.5.5.5.5.5.1.cmml" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5">superscript</csymbol><ci id="S3.E2.m1.1.1.1.1.5.5.5.5.5.2.cmml" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5.2">𝒐</ci><ci id="S3.E2.m1.1.1.1.1.5.5.5.5.5.3.cmml" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5.3">𝑘</ci></apply></set><apply id="S3.E2.m1.1.1.1.1.5.5.7.cmml" xref="S3.E2.m1.1.1.1.1.5.5.7"><eq id="S3.E2.m1.1.1.1.1.5.5.7.1.cmml" xref="S3.E2.m1.1.1.1.1.5.5.7.1"></eq><ci id="S3.E2.m1.1.1.1.1.5.5.7.2.cmml" xref="S3.E2.m1.1.1.1.1.5.5.7.2">𝑘</ci><cn id="S3.E2.m1.1.1.1.1.5.5.7.3.cmml" type="integer" xref="S3.E2.m1.1.1.1.1.5.5.7.3">1</cn></apply></apply><ci id="S3.E2.m1.1.1.1.1.5.7.cmml" xref="S3.E2.m1.1.1.1.1.5.7">𝑀</ci></apply></apply><apply id="S3.E2.m1.1.1.1.1c.cmml" xref="S3.E2.m1.1.1.1"><eq id="S3.E2.m1.1.1.1.1.11.cmml" xref="S3.E2.m1.1.1.1.1.11"></eq><share href="https://arxiv.org/html/2503.11978v1#S3.E2.m1.1.1.1.1.5.cmml" id="S3.E2.m1.1.1.1.1d.cmml" xref="S3.E2.m1.1.1.1"></share><apply id="S3.E2.m1.1.1.1.1.7.cmml" xref="S3.E2.m1.1.1.1.1.7"><times id="S3.E2.m1.1.1.1.1.7.3.cmml" xref="S3.E2.m1.1.1.1.1.7.3"></times><apply id="S3.E2.m1.1.1.1.1.7.4.cmml" xref="S3.E2.m1.1.1.1.1.7.4"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.7.4.1.cmml" xref="S3.E2.m1.1.1.1.1.7.4">subscript</csymbol><ci id="S3.E2.m1.1.1.1.1.7.4.2.cmml" xref="S3.E2.m1.1.1.1.1.7.4.2">ℰ</ci><ci id="S3.E2.m1.1.1.1.1.7.4.3a.cmml" xref="S3.E2.m1.1.1.1.1.7.4.3"><mtext id="S3.E2.m1.1.1.1.1.7.4.3.cmml" mathsize="70%" xref="S3.E2.m1.1.1.1.1.7.4.3">GDA</mtext></ci></apply><list id="S3.E2.m1.1.1.1.1.7.2.3.cmml" xref="S3.E2.m1.1.1.1.1.7.2.2"><apply id="S3.E2.m1.1.1.1.1.6.1.1.1.cmml" xref="S3.E2.m1.1.1.1.1.6.1.1.1"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.6.1.1.1.1.cmml" xref="S3.E2.m1.1.1.1.1.6.1.1.1">subscript</csymbol><ci id="S3.E2.m1.1.1.1.1.6.1.1.1.2.cmml" xref="S3.E2.m1.1.1.1.1.6.1.1.1.2">𝐼</ci><ci id="S3.E2.m1.1.1.1.1.6.1.1.1.3a.cmml" xref="S3.E2.m1.1.1.1.1.6.1.1.1.3"><mtext id="S3.E2.m1.1.1.1.1.6.1.1.1.3.cmml" mathsize="70%" xref="S3.E2.m1.1.1.1.1.6.1.1.1.3">ref</mtext></ci></apply><apply id="S3.E2.m1.1.1.1.1.7.2.2.2.cmml" xref="S3.E2.m1.1.1.1.1.7.2.2.2"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.7.2.2.2.1.cmml" xref="S3.E2.m1.1.1.1.1.7.2.2.2">subscript</csymbol><ci id="S3.E2.m1.1.1.1.1.7.2.2.2.2.cmml" xref="S3.E2.m1.1.1.1.1.7.2.2.2.2">Φ</ci><ci id="S3.E2.m1.1.1.1.1.7.2.2.2.3a.cmml" xref="S3.E2.m1.1.1.1.1.7.2.2.2.3"><mtext id="S3.E2.m1.1.1.1.1.7.2.2.2.3.cmml" mathsize="70%" xref="S3.E2.m1.1.1.1.1.7.2.2.2.3">GDA</mtext></ci></apply></list></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E2.m1.1c">\theta_{\text{GDA}}=\left\{\boldsymbol{s}^{k},\boldsymbol{t}^{k},\boldsymbol{q% }^{k},\boldsymbol{c}^{k},\boldsymbol{o}^{k}\right\}_{k=1}^{M}=\mathcal{E}_{% \text{GDA}}(I_{\text{ref}};\Phi_{\text{GDA}}),</annotation><annotation encoding="application/x-llamapun" id="S3.E2.m1.1d">italic_θ start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT = { bold_italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_italic_q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_italic_o start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT = caligraphic_E start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ; roman_Φ start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS2.p1.10">where <math alttext="M" class="ltx_Math" display="inline" id="S3.SS2.p1.8.m1.1"><semantics id="S3.SS2.p1.8.m1.1a"><mi id="S3.SS2.p1.8.m1.1.1" xref="S3.SS2.p1.8.m1.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.8.m1.1b"><ci id="S3.SS2.p1.8.m1.1.1.cmml" xref="S3.SS2.p1.8.m1.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.8.m1.1c">M</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.8.m1.1d">italic_M</annotation></semantics></math> is the number of Gaussians and <math alttext="\Phi_{\text{GDA}}" class="ltx_Math" display="inline" id="S3.SS2.p1.9.m2.1"><semantics id="S3.SS2.p1.9.m2.1a"><msub id="S3.SS2.p1.9.m2.1.1" xref="S3.SS2.p1.9.m2.1.1.cmml"><mi id="S3.SS2.p1.9.m2.1.1.2" mathvariant="normal" xref="S3.SS2.p1.9.m2.1.1.2.cmml">Φ</mi><mtext id="S3.SS2.p1.9.m2.1.1.3" xref="S3.SS2.p1.9.m2.1.1.3a.cmml">GDA</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.9.m2.1b"><apply id="S3.SS2.p1.9.m2.1.1.cmml" xref="S3.SS2.p1.9.m2.1.1"><csymbol cd="ambiguous" id="S3.SS2.p1.9.m2.1.1.1.cmml" xref="S3.SS2.p1.9.m2.1.1">subscript</csymbol><ci id="S3.SS2.p1.9.m2.1.1.2.cmml" xref="S3.SS2.p1.9.m2.1.1.2">Φ</ci><ci id="S3.SS2.p1.9.m2.1.1.3a.cmml" xref="S3.SS2.p1.9.m2.1.1.3"><mtext id="S3.SS2.p1.9.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p1.9.m2.1.1.3">GDA</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.9.m2.1c">\Phi_{\text{GDA}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.9.m2.1d">roman_Φ start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT</annotation></semantics></math> is the learnable parameters. The 3D Gaussians are then rendered in the frontal view <math alttext="I_{\text{sty}}=\mathcal{E}^{\text{render}}_{\text{3D}}(\theta_{\text{GDA}})" class="ltx_Math" display="inline" id="S3.SS2.p1.10.m3.1"><semantics id="S3.SS2.p1.10.m3.1a"><mrow id="S3.SS2.p1.10.m3.1.1" xref="S3.SS2.p1.10.m3.1.1.cmml"><msub id="S3.SS2.p1.10.m3.1.1.3" xref="S3.SS2.p1.10.m3.1.1.3.cmml"><mi id="S3.SS2.p1.10.m3.1.1.3.2" xref="S3.SS2.p1.10.m3.1.1.3.2.cmml">I</mi><mtext id="S3.SS2.p1.10.m3.1.1.3.3" xref="S3.SS2.p1.10.m3.1.1.3.3a.cmml">sty</mtext></msub><mo id="S3.SS2.p1.10.m3.1.1.2" xref="S3.SS2.p1.10.m3.1.1.2.cmml">=</mo><mrow id="S3.SS2.p1.10.m3.1.1.1" xref="S3.SS2.p1.10.m3.1.1.1.cmml"><msubsup id="S3.SS2.p1.10.m3.1.1.1.3" xref="S3.SS2.p1.10.m3.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS2.p1.10.m3.1.1.1.3.2.2" xref="S3.SS2.p1.10.m3.1.1.1.3.2.2.cmml">ℰ</mi><mtext id="S3.SS2.p1.10.m3.1.1.1.3.3" xref="S3.SS2.p1.10.m3.1.1.1.3.3a.cmml">3D</mtext><mtext id="S3.SS2.p1.10.m3.1.1.1.3.2.3" xref="S3.SS2.p1.10.m3.1.1.1.3.2.3a.cmml">render</mtext></msubsup><mo id="S3.SS2.p1.10.m3.1.1.1.2" xref="S3.SS2.p1.10.m3.1.1.1.2.cmml">⁢</mo><mrow id="S3.SS2.p1.10.m3.1.1.1.1.1" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.cmml"><mo id="S3.SS2.p1.10.m3.1.1.1.1.1.2" stretchy="false" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.cmml">(</mo><msub id="S3.SS2.p1.10.m3.1.1.1.1.1.1" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.cmml"><mi id="S3.SS2.p1.10.m3.1.1.1.1.1.1.2" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.2.cmml">θ</mi><mtext id="S3.SS2.p1.10.m3.1.1.1.1.1.1.3" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.3a.cmml">GDA</mtext></msub><mo id="S3.SS2.p1.10.m3.1.1.1.1.1.3" stretchy="false" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.10.m3.1b"><apply id="S3.SS2.p1.10.m3.1.1.cmml" xref="S3.SS2.p1.10.m3.1.1"><eq id="S3.SS2.p1.10.m3.1.1.2.cmml" xref="S3.SS2.p1.10.m3.1.1.2"></eq><apply id="S3.SS2.p1.10.m3.1.1.3.cmml" xref="S3.SS2.p1.10.m3.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.p1.10.m3.1.1.3.1.cmml" xref="S3.SS2.p1.10.m3.1.1.3">subscript</csymbol><ci id="S3.SS2.p1.10.m3.1.1.3.2.cmml" xref="S3.SS2.p1.10.m3.1.1.3.2">𝐼</ci><ci id="S3.SS2.p1.10.m3.1.1.3.3a.cmml" xref="S3.SS2.p1.10.m3.1.1.3.3"><mtext id="S3.SS2.p1.10.m3.1.1.3.3.cmml" mathsize="70%" xref="S3.SS2.p1.10.m3.1.1.3.3">sty</mtext></ci></apply><apply id="S3.SS2.p1.10.m3.1.1.1.cmml" xref="S3.SS2.p1.10.m3.1.1.1"><times id="S3.SS2.p1.10.m3.1.1.1.2.cmml" xref="S3.SS2.p1.10.m3.1.1.1.2"></times><apply id="S3.SS2.p1.10.m3.1.1.1.3.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.p1.10.m3.1.1.1.3.1.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3">subscript</csymbol><apply id="S3.SS2.p1.10.m3.1.1.1.3.2.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.p1.10.m3.1.1.1.3.2.1.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3">superscript</csymbol><ci id="S3.SS2.p1.10.m3.1.1.1.3.2.2.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3.2.2">ℰ</ci><ci id="S3.SS2.p1.10.m3.1.1.1.3.2.3a.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3.2.3"><mtext id="S3.SS2.p1.10.m3.1.1.1.3.2.3.cmml" mathsize="70%" xref="S3.SS2.p1.10.m3.1.1.1.3.2.3">render</mtext></ci></apply><ci id="S3.SS2.p1.10.m3.1.1.1.3.3a.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3.3"><mtext id="S3.SS2.p1.10.m3.1.1.1.3.3.cmml" mathsize="70%" xref="S3.SS2.p1.10.m3.1.1.1.3.3">3D</mtext></ci></apply><apply id="S3.SS2.p1.10.m3.1.1.1.1.1.1.cmml" xref="S3.SS2.p1.10.m3.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS2.p1.10.m3.1.1.1.1.1.1.1.cmml" xref="S3.SS2.p1.10.m3.1.1.1.1.1">subscript</csymbol><ci id="S3.SS2.p1.10.m3.1.1.1.1.1.1.2.cmml" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.2">𝜃</ci><ci id="S3.SS2.p1.10.m3.1.1.1.1.1.1.3a.cmml" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.3"><mtext id="S3.SS2.p1.10.m3.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.3">GDA</mtext></ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.10.m3.1c">I_{\text{sty}}=\mathcal{E}^{\text{render}}_{\text{3D}}(\theta_{\text{GDA}})</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.10.m3.1d">italic_I start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT = caligraphic_E start_POSTSUPERSCRIPT render end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT )</annotation></semantics></math>. This process transforms real face photos into the primary avatar domain while preserving identity-related features, enabling seamless 3D-aware avatar generation and animation. </p> </div> <div class="ltx_para ltx_noindent" id="S3.SS2.p2"> <p class="ltx_p" id="S3.SS2.p2.2"><span class="ltx_text ltx_font_bold" id="S3.SS2.p2.2.1">Avatar Dual-Stylization.</span> While GDA efficiently generates avatars in a single primary style, our dual-stylization approach allows for additional customization. We employ a diffusion-based pipeline using SDEdit <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib34" title=""><span class="ltx_text" style="font-size:90%;">34</span></a>]</cite> to refine the GDA output image <math alttext="I_{\text{sty}}" class="ltx_Math" display="inline" id="S3.SS2.p2.1.m1.1"><semantics id="S3.SS2.p2.1.m1.1a"><msub id="S3.SS2.p2.1.m1.1.1" xref="S3.SS2.p2.1.m1.1.1.cmml"><mi id="S3.SS2.p2.1.m1.1.1.2" xref="S3.SS2.p2.1.m1.1.1.2.cmml">I</mi><mtext id="S3.SS2.p2.1.m1.1.1.3" xref="S3.SS2.p2.1.m1.1.1.3a.cmml">sty</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.1.m1.1b"><apply id="S3.SS2.p2.1.m1.1.1.cmml" xref="S3.SS2.p2.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.1.m1.1.1.1.cmml" xref="S3.SS2.p2.1.m1.1.1">subscript</csymbol><ci id="S3.SS2.p2.1.m1.1.1.2.cmml" xref="S3.SS2.p2.1.m1.1.1.2">𝐼</ci><ci id="S3.SS2.p2.1.m1.1.1.3a.cmml" xref="S3.SS2.p2.1.m1.1.1.3"><mtext id="S3.SS2.p2.1.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p2.1.m1.1.1.3">sty</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.1.m1.1c">I_{\text{sty}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.1.m1.1d">italic_I start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT</annotation></semantics></math> with minimal noise and guided denoising based on text prompts to add features like art style and accessories. To preserve the avatar’s primary style, we use ControlNet <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib63" title=""><span class="ltx_text" style="font-size:90%;">63</span></a>]</cite> with Canny edges to maintain geometric integrity. Additionally, IP Adapter <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib60" title=""><span class="ltx_text" style="font-size:90%;">60</span></a>]</cite> is integrated with facial similarity embeddings to ensure resemblance to the original user. The cross-attention outputs for each layer, <math alttext="f_{\text{out}}^{d}" class="ltx_Math" display="inline" id="S3.SS2.p2.2.m2.1"><semantics id="S3.SS2.p2.2.m2.1a"><msubsup id="S3.SS2.p2.2.m2.1.1" xref="S3.SS2.p2.2.m2.1.1.cmml"><mi id="S3.SS2.p2.2.m2.1.1.2.2" xref="S3.SS2.p2.2.m2.1.1.2.2.cmml">f</mi><mtext id="S3.SS2.p2.2.m2.1.1.2.3" xref="S3.SS2.p2.2.m2.1.1.2.3a.cmml">out</mtext><mi id="S3.SS2.p2.2.m2.1.1.3" xref="S3.SS2.p2.2.m2.1.1.3.cmml">d</mi></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.2.m2.1b"><apply id="S3.SS2.p2.2.m2.1.1.cmml" xref="S3.SS2.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.2.m2.1.1.1.cmml" xref="S3.SS2.p2.2.m2.1.1">superscript</csymbol><apply id="S3.SS2.p2.2.m2.1.1.2.cmml" xref="S3.SS2.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.2.m2.1.1.2.1.cmml" xref="S3.SS2.p2.2.m2.1.1">subscript</csymbol><ci id="S3.SS2.p2.2.m2.1.1.2.2.cmml" xref="S3.SS2.p2.2.m2.1.1.2.2">𝑓</ci><ci id="S3.SS2.p2.2.m2.1.1.2.3a.cmml" xref="S3.SS2.p2.2.m2.1.1.2.3"><mtext id="S3.SS2.p2.2.m2.1.1.2.3.cmml" mathsize="70%" xref="S3.SS2.p2.2.m2.1.1.2.3">out</mtext></ci></apply><ci id="S3.SS2.p2.2.m2.1.1.3.cmml" xref="S3.SS2.p2.2.m2.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.2.m2.1c">f_{\text{out}}^{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.2.m2.1d">italic_f start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT</annotation></semantics></math>, combine features from the reference image, text prompts, and the primary-stylized avatar:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E3"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\begin{split}f_{\text{out}}^{d}&amp;=\text{softmax}\left(\frac{(f_{\text{sty}}W_{Q% }^{d})(f_{\text{ref}}W_{K}^{d})^{T}}{\sqrt{d_{k}^{d}}}\right)(f_{\text{ref}}W_% {V}^{d})\\ &amp;+\text{softmax}\left(\frac{(f_{\text{sty}}W_{Q}^{d})(f_{\text{txt}}W_{K}^{d})% ^{T}}{\sqrt{d_{k}^{d}}}\right)(f_{\text{txt}}W_{V}^{d}),\end{split}" class="ltx_Math" display="block" id="S3.E3.m1.31"><semantics id="S3.E3.m1.31a"><mtable columnspacing="0pt" displaystyle="true" id="S3.E3.m1.31.31.3" rowspacing="0pt"><mtr id="S3.E3.m1.31.31.3a"><mtd class="ltx_align_right" columnalign="right" id="S3.E3.m1.31.31.3b"><msubsup id="S3.E3.m1.3.3.3.3.3"><mi id="S3.E3.m1.1.1.1.1.1.1" xref="S3.E3.m1.1.1.1.1.1.1.cmml">f</mi><mtext id="S3.E3.m1.2.2.2.2.2.2.1" xref="S3.E3.m1.2.2.2.2.2.2.1a.cmml">out</mtext><mi id="S3.E3.m1.3.3.3.3.3.3.1" xref="S3.E3.m1.3.3.3.3.3.3.1.cmml">d</mi></msubsup></mtd><mtd class="ltx_align_left" columnalign="left" id="S3.E3.m1.31.31.3c"><mrow id="S3.E3.m1.30.30.2.29.16.13"><mi id="S3.E3.m1.30.30.2.29.16.13.14" xref="S3.E3.m1.29.29.1.1.1.cmml"></mi><mo id="S3.E3.m1.4.4.4.4.1.1" xref="S3.E3.m1.4.4.4.4.1.1.cmml">=</mo><mrow id="S3.E3.m1.30.30.2.29.16.13.13"><mtext id="S3.E3.m1.5.5.5.5.2.2" xref="S3.E3.m1.5.5.5.5.2.2a.cmml">softmax</mtext><mo id="S3.E3.m1.30.30.2.29.16.13.13.2" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><mrow id="S3.E3.m1.30.30.2.29.16.13.13.3"><mo id="S3.E3.m1.6.6.6.6.3.3" xref="S3.E3.m1.29.29.1.1.1.cmml">(</mo><mfrac id="S3.E3.m1.7.7.7.7.4.4" xref="S3.E3.m1.7.7.7.7.4.4.cmml"><mrow id="S3.E3.m1.7.7.7.7.4.4.2" xref="S3.E3.m1.7.7.7.7.4.4.2.cmml"><mrow id="S3.E3.m1.7.7.7.7.4.4.1.1.1" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.cmml"><mo id="S3.E3.m1.7.7.7.7.4.4.1.1.1.2" stretchy="false" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.cmml"><msub id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.cmml"><mi id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.2" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.2.cmml">f</mi><mtext id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3a.cmml">sty</mtext></msub><mo id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.1" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.cmml"><mi id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.2" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.3" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.3.cmml">Q</mi><mi id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.3" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.3.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.7.7.7.7.4.4.1.1.1.3" stretchy="false" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.cmml">)</mo></mrow><mo id="S3.E3.m1.7.7.7.7.4.4.2.3" xref="S3.E3.m1.7.7.7.7.4.4.2.3.cmml">⁢</mo><msup id="S3.E3.m1.7.7.7.7.4.4.2.2" xref="S3.E3.m1.7.7.7.7.4.4.2.2.cmml"><mrow id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.cmml"><mo id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.2" stretchy="false" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.cmml"><msub id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.cmml"><mi id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.2" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.2.cmml">f</mi><mtext id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3a.cmml">ref</mtext></msub><mo id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.1" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.cmml"><mi id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.2" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.3" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.3.cmml">K</mi><mi id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.3" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.3.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.3" stretchy="false" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.cmml">)</mo></mrow><mi id="S3.E3.m1.7.7.7.7.4.4.2.2.3" xref="S3.E3.m1.7.7.7.7.4.4.2.2.3.cmml">T</mi></msup></mrow><msqrt id="S3.E3.m1.7.7.7.7.4.4.4" xref="S3.E3.m1.7.7.7.7.4.4.4.cmml"><msubsup id="S3.E3.m1.7.7.7.7.4.4.4.2" xref="S3.E3.m1.7.7.7.7.4.4.4.2.cmml"><mi id="S3.E3.m1.7.7.7.7.4.4.4.2.2.2" xref="S3.E3.m1.7.7.7.7.4.4.4.2.2.2.cmml">d</mi><mi id="S3.E3.m1.7.7.7.7.4.4.4.2.2.3" xref="S3.E3.m1.7.7.7.7.4.4.4.2.2.3.cmml">k</mi><mi id="S3.E3.m1.7.7.7.7.4.4.4.2.3" xref="S3.E3.m1.7.7.7.7.4.4.4.2.3.cmml">d</mi></msubsup></msqrt></mfrac><mo id="S3.E3.m1.8.8.8.8.5.5" xref="S3.E3.m1.29.29.1.1.1.cmml">)</mo></mrow><mo id="S3.E3.m1.30.30.2.29.16.13.13.2a" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><mrow id="S3.E3.m1.30.30.2.29.16.13.13.1.1"><mo id="S3.E3.m1.9.9.9.9.6.6" stretchy="false" xref="S3.E3.m1.29.29.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.30.30.2.29.16.13.13.1.1.1"><msub id="S3.E3.m1.30.30.2.29.16.13.13.1.1.1.2"><mi id="S3.E3.m1.10.10.10.10.7.7" xref="S3.E3.m1.10.10.10.10.7.7.cmml">f</mi><mtext id="S3.E3.m1.11.11.11.11.8.8.1" xref="S3.E3.m1.11.11.11.11.8.8.1a.cmml">ref</mtext></msub><mo id="S3.E3.m1.30.30.2.29.16.13.13.1.1.1.1" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.30.30.2.29.16.13.13.1.1.1.3"><mi id="S3.E3.m1.12.12.12.12.9.9" xref="S3.E3.m1.12.12.12.12.9.9.cmml">W</mi><mi id="S3.E3.m1.13.13.13.13.10.10.1" xref="S3.E3.m1.13.13.13.13.10.10.1.cmml">V</mi><mi id="S3.E3.m1.14.14.14.14.11.11.1" xref="S3.E3.m1.14.14.14.14.11.11.1.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.15.15.15.15.12.12" stretchy="false" xref="S3.E3.m1.29.29.1.1.1.cmml">)</mo></mrow></mrow></mrow></mtd></mtr><mtr id="S3.E3.m1.31.31.3d"><mtd id="S3.E3.m1.31.31.3e" xref="S3.E3.m1.29.29.1.1.1.cmml"></mtd><mtd class="ltx_align_left" columnalign="left" id="S3.E3.m1.31.31.3f"><mrow id="S3.E3.m1.31.31.3.30.14.14.14"><mrow id="S3.E3.m1.31.31.3.30.14.14.14.1"><mo id="S3.E3.m1.31.31.3.30.14.14.14.1a" xref="S3.E3.m1.29.29.1.1.1.cmml">+</mo><mrow id="S3.E3.m1.31.31.3.30.14.14.14.1.1"><mtext id="S3.E3.m1.17.17.17.2.2.2" xref="S3.E3.m1.17.17.17.2.2.2a.cmml">softmax</mtext><mo id="S3.E3.m1.31.31.3.30.14.14.14.1.1.2" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><mrow id="S3.E3.m1.31.31.3.30.14.14.14.1.1.3"><mo id="S3.E3.m1.18.18.18.3.3.3" xref="S3.E3.m1.29.29.1.1.1.cmml">(</mo><mfrac id="S3.E3.m1.19.19.19.4.4.4" xref="S3.E3.m1.19.19.19.4.4.4.cmml"><mrow id="S3.E3.m1.19.19.19.4.4.4.2" xref="S3.E3.m1.19.19.19.4.4.4.2.cmml"><mrow id="S3.E3.m1.19.19.19.4.4.4.1.1.1" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.cmml"><mo id="S3.E3.m1.19.19.19.4.4.4.1.1.1.2" stretchy="false" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.cmml"><msub id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.cmml"><mi id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.2" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.2.cmml">f</mi><mtext id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3a.cmml">sty</mtext></msub><mo id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.1" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.cmml"><mi id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.2" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.3" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.3.cmml">Q</mi><mi id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.3" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.3.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.19.19.19.4.4.4.1.1.1.3" stretchy="false" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.cmml">)</mo></mrow><mo id="S3.E3.m1.19.19.19.4.4.4.2.3" xref="S3.E3.m1.19.19.19.4.4.4.2.3.cmml">⁢</mo><msup id="S3.E3.m1.19.19.19.4.4.4.2.2" xref="S3.E3.m1.19.19.19.4.4.4.2.2.cmml"><mrow id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.cmml"><mo id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.2" stretchy="false" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.cmml"><msub id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.cmml"><mi id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.2" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.2.cmml">f</mi><mtext id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3a.cmml">txt</mtext></msub><mo id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.1" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.cmml"><mi id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.2" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.3" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.3.cmml">K</mi><mi id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.3" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.3.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.3" stretchy="false" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.cmml">)</mo></mrow><mi id="S3.E3.m1.19.19.19.4.4.4.2.2.3" xref="S3.E3.m1.19.19.19.4.4.4.2.2.3.cmml">T</mi></msup></mrow><msqrt id="S3.E3.m1.19.19.19.4.4.4.4" xref="S3.E3.m1.19.19.19.4.4.4.4.cmml"><msubsup id="S3.E3.m1.19.19.19.4.4.4.4.2" xref="S3.E3.m1.19.19.19.4.4.4.4.2.cmml"><mi id="S3.E3.m1.19.19.19.4.4.4.4.2.2.2" xref="S3.E3.m1.19.19.19.4.4.4.4.2.2.2.cmml">d</mi><mi id="S3.E3.m1.19.19.19.4.4.4.4.2.2.3" xref="S3.E3.m1.19.19.19.4.4.4.4.2.2.3.cmml">k</mi><mi id="S3.E3.m1.19.19.19.4.4.4.4.2.3" xref="S3.E3.m1.19.19.19.4.4.4.4.2.3.cmml">d</mi></msubsup></msqrt></mfrac><mo id="S3.E3.m1.20.20.20.5.5.5" xref="S3.E3.m1.29.29.1.1.1.cmml">)</mo></mrow><mo id="S3.E3.m1.31.31.3.30.14.14.14.1.1.2a" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><mrow id="S3.E3.m1.31.31.3.30.14.14.14.1.1.1.1"><mo id="S3.E3.m1.21.21.21.6.6.6" stretchy="false" xref="S3.E3.m1.29.29.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.31.31.3.30.14.14.14.1.1.1.1.1"><msub id="S3.E3.m1.31.31.3.30.14.14.14.1.1.1.1.1.2"><mi id="S3.E3.m1.22.22.22.7.7.7" xref="S3.E3.m1.22.22.22.7.7.7.cmml">f</mi><mtext id="S3.E3.m1.23.23.23.8.8.8.1" xref="S3.E3.m1.23.23.23.8.8.8.1a.cmml">txt</mtext></msub><mo id="S3.E3.m1.31.31.3.30.14.14.14.1.1.1.1.1.1" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.31.31.3.30.14.14.14.1.1.1.1.1.3"><mi id="S3.E3.m1.24.24.24.9.9.9" xref="S3.E3.m1.24.24.24.9.9.9.cmml">W</mi><mi id="S3.E3.m1.25.25.25.10.10.10.1" xref="S3.E3.m1.25.25.25.10.10.10.1.cmml">V</mi><mi id="S3.E3.m1.26.26.26.11.11.11.1" xref="S3.E3.m1.26.26.26.11.11.11.1.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.27.27.27.12.12.12" stretchy="false" xref="S3.E3.m1.29.29.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E3.m1.28.28.28.13.13.13" xref="S3.E3.m1.29.29.1.1.1.cmml">,</mo></mrow></mtd></mtr></mtable><annotation-xml encoding="MathML-Content" id="S3.E3.m1.31b"><apply id="S3.E3.m1.29.29.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><eq id="S3.E3.m1.4.4.4.4.1.1.cmml" xref="S3.E3.m1.4.4.4.4.1.1"></eq><apply id="S3.E3.m1.29.29.1.1.1.4.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.4.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">superscript</csymbol><apply id="S3.E3.m1.29.29.1.1.1.4.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.4.2.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">subscript</csymbol><ci id="S3.E3.m1.1.1.1.1.1.1.cmml" xref="S3.E3.m1.1.1.1.1.1.1">𝑓</ci><ci id="S3.E3.m1.2.2.2.2.2.2.1a.cmml" xref="S3.E3.m1.2.2.2.2.2.2.1"><mtext id="S3.E3.m1.2.2.2.2.2.2.1.cmml" mathsize="70%" xref="S3.E3.m1.2.2.2.2.2.2.1">out</mtext></ci></apply><ci id="S3.E3.m1.3.3.3.3.3.3.1.cmml" xref="S3.E3.m1.3.3.3.3.3.3.1">𝑑</ci></apply><apply id="S3.E3.m1.29.29.1.1.1.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><plus id="S3.E3.m1.16.16.16.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"></plus><apply id="S3.E3.m1.29.29.1.1.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><times id="S3.E3.m1.29.29.1.1.1.1.1.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"></times><ci id="S3.E3.m1.5.5.5.5.2.2a.cmml" xref="S3.E3.m1.5.5.5.5.2.2"><mtext id="S3.E3.m1.5.5.5.5.2.2.cmml" xref="S3.E3.m1.5.5.5.5.2.2">softmax</mtext></ci><apply id="S3.E3.m1.7.7.7.7.4.4.cmml" xref="S3.E3.m1.7.7.7.7.4.4"><divide id="S3.E3.m1.7.7.7.7.4.4.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4"></divide><apply id="S3.E3.m1.7.7.7.7.4.4.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2"><times id="S3.E3.m1.7.7.7.7.4.4.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.3"></times><apply id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1"><times id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.1"></times><apply id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2">subscript</csymbol><ci id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.2">𝑓</ci><ci id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3a.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3"><mtext id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3">sty</mtext></ci></apply><apply id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3">superscript</csymbol><apply id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3">subscript</csymbol><ci id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.2">𝑊</ci><ci id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.3">𝑄</ci></apply><ci id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.3">𝑑</ci></apply></apply><apply id="S3.E3.m1.7.7.7.7.4.4.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.2.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2">superscript</csymbol><apply id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1"><times id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.1"></times><apply id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2">subscript</csymbol><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.2">𝑓</ci><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3a.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3"><mtext id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3">ref</mtext></ci></apply><apply id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3">superscript</csymbol><apply id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3">subscript</csymbol><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.2">𝑊</ci><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.3">𝐾</ci></apply><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.3">𝑑</ci></apply></apply><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.3">𝑇</ci></apply></apply><apply id="S3.E3.m1.7.7.7.7.4.4.4.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4"><root id="S3.E3.m1.7.7.7.7.4.4.4a.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4"></root><apply id="S3.E3.m1.7.7.7.7.4.4.4.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.4.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2">superscript</csymbol><apply id="S3.E3.m1.7.7.7.7.4.4.4.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.4.2.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2">subscript</csymbol><ci id="S3.E3.m1.7.7.7.7.4.4.4.2.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2.2.2">𝑑</ci><ci id="S3.E3.m1.7.7.7.7.4.4.4.2.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2.2.3">𝑘</ci></apply><ci id="S3.E3.m1.7.7.7.7.4.4.4.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2.3">𝑑</ci></apply></apply></apply><apply id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><times id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"></times><apply id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">subscript</csymbol><ci id="S3.E3.m1.10.10.10.10.7.7.cmml" xref="S3.E3.m1.10.10.10.10.7.7">𝑓</ci><ci id="S3.E3.m1.11.11.11.11.8.8.1a.cmml" xref="S3.E3.m1.11.11.11.11.8.8.1"><mtext id="S3.E3.m1.11.11.11.11.8.8.1.cmml" mathsize="70%" xref="S3.E3.m1.11.11.11.11.8.8.1">ref</mtext></ci></apply><apply id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">superscript</csymbol><apply id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.3.2.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">subscript</csymbol><ci id="S3.E3.m1.12.12.12.12.9.9.cmml" xref="S3.E3.m1.12.12.12.12.9.9">𝑊</ci><ci id="S3.E3.m1.13.13.13.13.10.10.1.cmml" xref="S3.E3.m1.13.13.13.13.10.10.1">𝑉</ci></apply><ci id="S3.E3.m1.14.14.14.14.11.11.1.cmml" xref="S3.E3.m1.14.14.14.14.11.11.1">𝑑</ci></apply></apply></apply><apply id="S3.E3.m1.29.29.1.1.1.2.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><times id="S3.E3.m1.29.29.1.1.1.2.2.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"></times><ci id="S3.E3.m1.17.17.17.2.2.2a.cmml" xref="S3.E3.m1.17.17.17.2.2.2"><mtext id="S3.E3.m1.17.17.17.2.2.2.cmml" xref="S3.E3.m1.17.17.17.2.2.2">softmax</mtext></ci><apply id="S3.E3.m1.19.19.19.4.4.4.cmml" xref="S3.E3.m1.19.19.19.4.4.4"><divide id="S3.E3.m1.19.19.19.4.4.4.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4"></divide><apply id="S3.E3.m1.19.19.19.4.4.4.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2"><times id="S3.E3.m1.19.19.19.4.4.4.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.3"></times><apply id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1"><times id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.1"></times><apply id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2">subscript</csymbol><ci id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.2">𝑓</ci><ci id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3a.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3"><mtext id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3">sty</mtext></ci></apply><apply id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3">superscript</csymbol><apply id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3">subscript</csymbol><ci id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.2">𝑊</ci><ci id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.3">𝑄</ci></apply><ci id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.3">𝑑</ci></apply></apply><apply id="S3.E3.m1.19.19.19.4.4.4.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.2.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2">superscript</csymbol><apply id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1"><times id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.1"></times><apply id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2">subscript</csymbol><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.2">𝑓</ci><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3a.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3"><mtext id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3">txt</mtext></ci></apply><apply id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3">superscript</csymbol><apply id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3">subscript</csymbol><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.2">𝑊</ci><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.3">𝐾</ci></apply><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.3">𝑑</ci></apply></apply><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.3">𝑇</ci></apply></apply><apply id="S3.E3.m1.19.19.19.4.4.4.4.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4"><root id="S3.E3.m1.19.19.19.4.4.4.4a.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4"></root><apply id="S3.E3.m1.19.19.19.4.4.4.4.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.4.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2">superscript</csymbol><apply id="S3.E3.m1.19.19.19.4.4.4.4.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.4.2.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2">subscript</csymbol><ci id="S3.E3.m1.19.19.19.4.4.4.4.2.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2.2.2">𝑑</ci><ci id="S3.E3.m1.19.19.19.4.4.4.4.2.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2.2.3">𝑘</ci></apply><ci id="S3.E3.m1.19.19.19.4.4.4.4.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2.3">𝑑</ci></apply></apply></apply><apply id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><times id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"></times><apply id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.2.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">subscript</csymbol><ci id="S3.E3.m1.22.22.22.7.7.7.cmml" xref="S3.E3.m1.22.22.22.7.7.7">𝑓</ci><ci id="S3.E3.m1.23.23.23.8.8.8.1a.cmml" xref="S3.E3.m1.23.23.23.8.8.8.1"><mtext id="S3.E3.m1.23.23.23.8.8.8.1.cmml" mathsize="70%" xref="S3.E3.m1.23.23.23.8.8.8.1">txt</mtext></ci></apply><apply id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.3.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.3.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">superscript</csymbol><apply id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.3.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.3.2.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">subscript</csymbol><ci id="S3.E3.m1.24.24.24.9.9.9.cmml" xref="S3.E3.m1.24.24.24.9.9.9">𝑊</ci><ci id="S3.E3.m1.25.25.25.10.10.10.1.cmml" xref="S3.E3.m1.25.25.25.10.10.10.1">𝑉</ci></apply><ci id="S3.E3.m1.26.26.26.11.11.11.1.cmml" xref="S3.E3.m1.26.26.26.11.11.11.1">𝑑</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E3.m1.31c">\begin{split}f_{\text{out}}^{d}&amp;=\text{softmax}\left(\frac{(f_{\text{sty}}W_{Q% }^{d})(f_{\text{ref}}W_{K}^{d})^{T}}{\sqrt{d_{k}^{d}}}\right)(f_{\text{ref}}W_% {V}^{d})\\ &amp;+\text{softmax}\left(\frac{(f_{\text{sty}}W_{Q}^{d})(f_{\text{txt}}W_{K}^{d})% ^{T}}{\sqrt{d_{k}^{d}}}\right)(f_{\text{txt}}W_{V}^{d}),\end{split}</annotation><annotation encoding="application/x-llamapun" id="S3.E3.m1.31d">start_ROW start_CELL italic_f start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_CELL start_CELL = softmax ( divide start_ARG ( italic_f start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ( italic_f start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG end_ARG ) ( italic_f start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + softmax ( divide start_ARG ( italic_f start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ( italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG end_ARG ) ( italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) , end_CELL end_ROW</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(3)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS2.p2.10">where <math alttext="f_{\text{ref}}" class="ltx_Math" display="inline" id="S3.SS2.p2.3.m1.1"><semantics id="S3.SS2.p2.3.m1.1a"><msub id="S3.SS2.p2.3.m1.1.1" xref="S3.SS2.p2.3.m1.1.1.cmml"><mi id="S3.SS2.p2.3.m1.1.1.2" xref="S3.SS2.p2.3.m1.1.1.2.cmml">f</mi><mtext id="S3.SS2.p2.3.m1.1.1.3" xref="S3.SS2.p2.3.m1.1.1.3a.cmml">ref</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.3.m1.1b"><apply id="S3.SS2.p2.3.m1.1.1.cmml" xref="S3.SS2.p2.3.m1.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.3.m1.1.1.1.cmml" xref="S3.SS2.p2.3.m1.1.1">subscript</csymbol><ci id="S3.SS2.p2.3.m1.1.1.2.cmml" xref="S3.SS2.p2.3.m1.1.1.2">𝑓</ci><ci id="S3.SS2.p2.3.m1.1.1.3a.cmml" xref="S3.SS2.p2.3.m1.1.1.3"><mtext id="S3.SS2.p2.3.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p2.3.m1.1.1.3">ref</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.3.m1.1c">f_{\text{ref}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.3.m1.1d">italic_f start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT</annotation></semantics></math>, <math alttext="f_{\text{txt}}" class="ltx_Math" display="inline" id="S3.SS2.p2.4.m2.1"><semantics id="S3.SS2.p2.4.m2.1a"><msub id="S3.SS2.p2.4.m2.1.1" xref="S3.SS2.p2.4.m2.1.1.cmml"><mi id="S3.SS2.p2.4.m2.1.1.2" xref="S3.SS2.p2.4.m2.1.1.2.cmml">f</mi><mtext id="S3.SS2.p2.4.m2.1.1.3" xref="S3.SS2.p2.4.m2.1.1.3a.cmml">txt</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.4.m2.1b"><apply id="S3.SS2.p2.4.m2.1.1.cmml" xref="S3.SS2.p2.4.m2.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.4.m2.1.1.1.cmml" xref="S3.SS2.p2.4.m2.1.1">subscript</csymbol><ci id="S3.SS2.p2.4.m2.1.1.2.cmml" xref="S3.SS2.p2.4.m2.1.1.2">𝑓</ci><ci id="S3.SS2.p2.4.m2.1.1.3a.cmml" xref="S3.SS2.p2.4.m2.1.1.3"><mtext id="S3.SS2.p2.4.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p2.4.m2.1.1.3">txt</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.4.m2.1c">f_{\text{txt}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.4.m2.1d">italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT</annotation></semantics></math>, and <math alttext="f_{\text{sty}}" class="ltx_Math" display="inline" id="S3.SS2.p2.5.m3.1"><semantics id="S3.SS2.p2.5.m3.1a"><msub id="S3.SS2.p2.5.m3.1.1" xref="S3.SS2.p2.5.m3.1.1.cmml"><mi id="S3.SS2.p2.5.m3.1.1.2" xref="S3.SS2.p2.5.m3.1.1.2.cmml">f</mi><mtext id="S3.SS2.p2.5.m3.1.1.3" xref="S3.SS2.p2.5.m3.1.1.3a.cmml">sty</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.5.m3.1b"><apply id="S3.SS2.p2.5.m3.1.1.cmml" xref="S3.SS2.p2.5.m3.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.5.m3.1.1.1.cmml" xref="S3.SS2.p2.5.m3.1.1">subscript</csymbol><ci id="S3.SS2.p2.5.m3.1.1.2.cmml" xref="S3.SS2.p2.5.m3.1.1.2">𝑓</ci><ci id="S3.SS2.p2.5.m3.1.1.3a.cmml" xref="S3.SS2.p2.5.m3.1.1.3"><mtext id="S3.SS2.p2.5.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p2.5.m3.1.1.3">sty</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.5.m3.1c">f_{\text{sty}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.5.m3.1d">italic_f start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT</annotation></semantics></math> correspond to the features of the reference image, text prompts, and the unstylized avatar, respectively. <math alttext="W_{Q}^{d}" class="ltx_Math" display="inline" id="S3.SS2.p2.6.m4.1"><semantics id="S3.SS2.p2.6.m4.1a"><msubsup id="S3.SS2.p2.6.m4.1.1" xref="S3.SS2.p2.6.m4.1.1.cmml"><mi id="S3.SS2.p2.6.m4.1.1.2.2" xref="S3.SS2.p2.6.m4.1.1.2.2.cmml">W</mi><mi id="S3.SS2.p2.6.m4.1.1.2.3" xref="S3.SS2.p2.6.m4.1.1.2.3.cmml">Q</mi><mi id="S3.SS2.p2.6.m4.1.1.3" xref="S3.SS2.p2.6.m4.1.1.3.cmml">d</mi></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.6.m4.1b"><apply id="S3.SS2.p2.6.m4.1.1.cmml" xref="S3.SS2.p2.6.m4.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.6.m4.1.1.1.cmml" xref="S3.SS2.p2.6.m4.1.1">superscript</csymbol><apply id="S3.SS2.p2.6.m4.1.1.2.cmml" xref="S3.SS2.p2.6.m4.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.6.m4.1.1.2.1.cmml" xref="S3.SS2.p2.6.m4.1.1">subscript</csymbol><ci id="S3.SS2.p2.6.m4.1.1.2.2.cmml" xref="S3.SS2.p2.6.m4.1.1.2.2">𝑊</ci><ci id="S3.SS2.p2.6.m4.1.1.2.3.cmml" xref="S3.SS2.p2.6.m4.1.1.2.3">𝑄</ci></apply><ci id="S3.SS2.p2.6.m4.1.1.3.cmml" xref="S3.SS2.p2.6.m4.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.6.m4.1c">W_{Q}^{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.6.m4.1d">italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT</annotation></semantics></math>, <math alttext="W_{K}^{d}" class="ltx_Math" display="inline" id="S3.SS2.p2.7.m5.1"><semantics id="S3.SS2.p2.7.m5.1a"><msubsup id="S3.SS2.p2.7.m5.1.1" xref="S3.SS2.p2.7.m5.1.1.cmml"><mi id="S3.SS2.p2.7.m5.1.1.2.2" xref="S3.SS2.p2.7.m5.1.1.2.2.cmml">W</mi><mi id="S3.SS2.p2.7.m5.1.1.2.3" xref="S3.SS2.p2.7.m5.1.1.2.3.cmml">K</mi><mi id="S3.SS2.p2.7.m5.1.1.3" xref="S3.SS2.p2.7.m5.1.1.3.cmml">d</mi></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.7.m5.1b"><apply id="S3.SS2.p2.7.m5.1.1.cmml" xref="S3.SS2.p2.7.m5.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.7.m5.1.1.1.cmml" xref="S3.SS2.p2.7.m5.1.1">superscript</csymbol><apply id="S3.SS2.p2.7.m5.1.1.2.cmml" xref="S3.SS2.p2.7.m5.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.7.m5.1.1.2.1.cmml" xref="S3.SS2.p2.7.m5.1.1">subscript</csymbol><ci id="S3.SS2.p2.7.m5.1.1.2.2.cmml" xref="S3.SS2.p2.7.m5.1.1.2.2">𝑊</ci><ci id="S3.SS2.p2.7.m5.1.1.2.3.cmml" xref="S3.SS2.p2.7.m5.1.1.2.3">𝐾</ci></apply><ci id="S3.SS2.p2.7.m5.1.1.3.cmml" xref="S3.SS2.p2.7.m5.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.7.m5.1c">W_{K}^{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.7.m5.1d">italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT</annotation></semantics></math>, and <math alttext="W_{V}^{d}" class="ltx_Math" display="inline" id="S3.SS2.p2.8.m6.1"><semantics id="S3.SS2.p2.8.m6.1a"><msubsup id="S3.SS2.p2.8.m6.1.1" xref="S3.SS2.p2.8.m6.1.1.cmml"><mi id="S3.SS2.p2.8.m6.1.1.2.2" xref="S3.SS2.p2.8.m6.1.1.2.2.cmml">W</mi><mi id="S3.SS2.p2.8.m6.1.1.2.3" xref="S3.SS2.p2.8.m6.1.1.2.3.cmml">V</mi><mi id="S3.SS2.p2.8.m6.1.1.3" xref="S3.SS2.p2.8.m6.1.1.3.cmml">d</mi></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.8.m6.1b"><apply id="S3.SS2.p2.8.m6.1.1.cmml" xref="S3.SS2.p2.8.m6.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.8.m6.1.1.1.cmml" xref="S3.SS2.p2.8.m6.1.1">superscript</csymbol><apply id="S3.SS2.p2.8.m6.1.1.2.cmml" xref="S3.SS2.p2.8.m6.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.8.m6.1.1.2.1.cmml" xref="S3.SS2.p2.8.m6.1.1">subscript</csymbol><ci id="S3.SS2.p2.8.m6.1.1.2.2.cmml" xref="S3.SS2.p2.8.m6.1.1.2.2">𝑊</ci><ci id="S3.SS2.p2.8.m6.1.1.2.3.cmml" xref="S3.SS2.p2.8.m6.1.1.2.3">𝑉</ci></apply><ci id="S3.SS2.p2.8.m6.1.1.3.cmml" xref="S3.SS2.p2.8.m6.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.8.m6.1c">W_{V}^{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.8.m6.1d">italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT</annotation></semantics></math> are the weight matrices. Each cross-attention layer includes a residual connection (scaled by <math alttext="\sqrt{d_{k}^{d}}" class="ltx_Math" display="inline" id="S3.SS2.p2.9.m7.1"><semantics id="S3.SS2.p2.9.m7.1a"><msqrt id="S3.SS2.p2.9.m7.1.1" xref="S3.SS2.p2.9.m7.1.1.cmml"><msubsup id="S3.SS2.p2.9.m7.1.1.2" xref="S3.SS2.p2.9.m7.1.1.2.cmml"><mi id="S3.SS2.p2.9.m7.1.1.2.2.2" xref="S3.SS2.p2.9.m7.1.1.2.2.2.cmml">d</mi><mi id="S3.SS2.p2.9.m7.1.1.2.2.3" xref="S3.SS2.p2.9.m7.1.1.2.2.3.cmml">k</mi><mi id="S3.SS2.p2.9.m7.1.1.2.3" xref="S3.SS2.p2.9.m7.1.1.2.3.cmml">d</mi></msubsup></msqrt><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.9.m7.1b"><apply id="S3.SS2.p2.9.m7.1.1.cmml" xref="S3.SS2.p2.9.m7.1.1"><root id="S3.SS2.p2.9.m7.1.1a.cmml" xref="S3.SS2.p2.9.m7.1.1"></root><apply id="S3.SS2.p2.9.m7.1.1.2.cmml" xref="S3.SS2.p2.9.m7.1.1.2"><csymbol cd="ambiguous" id="S3.SS2.p2.9.m7.1.1.2.1.cmml" xref="S3.SS2.p2.9.m7.1.1.2">superscript</csymbol><apply id="S3.SS2.p2.9.m7.1.1.2.2.cmml" xref="S3.SS2.p2.9.m7.1.1.2"><csymbol cd="ambiguous" id="S3.SS2.p2.9.m7.1.1.2.2.1.cmml" xref="S3.SS2.p2.9.m7.1.1.2">subscript</csymbol><ci id="S3.SS2.p2.9.m7.1.1.2.2.2.cmml" xref="S3.SS2.p2.9.m7.1.1.2.2.2">𝑑</ci><ci id="S3.SS2.p2.9.m7.1.1.2.2.3.cmml" xref="S3.SS2.p2.9.m7.1.1.2.2.3">𝑘</ci></apply><ci id="S3.SS2.p2.9.m7.1.1.2.3.cmml" xref="S3.SS2.p2.9.m7.1.1.2.3">𝑑</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.9.m7.1c">\sqrt{d_{k}^{d}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.9.m7.1d">square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG</annotation></semantics></math>) for stable gradient flow. Using the DDIM scheduler <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib53" title=""><span class="ltx_text" style="font-size:90%;">53</span></a>]</cite>, we perform only <math alttext="T=10" class="ltx_Math" display="inline" id="S3.SS2.p2.10.m8.1"><semantics id="S3.SS2.p2.10.m8.1a"><mrow id="S3.SS2.p2.10.m8.1.1" xref="S3.SS2.p2.10.m8.1.1.cmml"><mi id="S3.SS2.p2.10.m8.1.1.2" xref="S3.SS2.p2.10.m8.1.1.2.cmml">T</mi><mo id="S3.SS2.p2.10.m8.1.1.1" xref="S3.SS2.p2.10.m8.1.1.1.cmml">=</mo><mn id="S3.SS2.p2.10.m8.1.1.3" xref="S3.SS2.p2.10.m8.1.1.3.cmml">10</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.10.m8.1b"><apply id="S3.SS2.p2.10.m8.1.1.cmml" xref="S3.SS2.p2.10.m8.1.1"><eq id="S3.SS2.p2.10.m8.1.1.1.cmml" xref="S3.SS2.p2.10.m8.1.1.1"></eq><ci id="S3.SS2.p2.10.m8.1.1.2.cmml" xref="S3.SS2.p2.10.m8.1.1.2">𝑇</ci><cn id="S3.SS2.p2.10.m8.1.1.3.cmml" type="integer" xref="S3.SS2.p2.10.m8.1.1.3">10</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.10.m8.1c">T=10</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.10.m8.1d">italic_T = 10</annotation></semantics></math> denoising steps to rapidly integrate the secondary style in about one second.</p> </div> </section> <section class="ltx_subsection" id="S3.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.3 </span>3D Animatable Stylized Avatar Generation</h3> <div class="ltx_para ltx_noindent" id="S3.SS3.p1"> <p class="ltx_p" id="S3.SS3.p1.5"><span class="ltx_text ltx_font_bold" id="S3.SS3.p1.5.1">Expression Encoder.</span> Current avatar animation techniques using Gaussian Splats often solely depend on 3D Morphable Models (3DMM) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib8" title=""><span class="ltx_text" style="font-size:90%;">8</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib19" title=""><span class="ltx_text" style="font-size:90%;">19</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib40" title=""><span class="ltx_text" style="font-size:90%;">40</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib47" title=""><span class="ltx_text" style="font-size:90%;">47</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title=""><span class="ltx_text" style="font-size:90%;">54</span></a>]</cite>, which limits generalization beyond realistic faces, especially for stylized or cartoon avatars. To overcome these constraints, we condition the 3D generation network on a blend of 3DMM features and blendshape weights derived from the Facial Action Coding System (FACS) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib13" title=""><span class="ltx_text" style="font-size:90%;">13</span></a>]</cite>, which is widely utilized in cartoon animation to control facial features like eye position and mouth shape. As depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S2.F2" title="Figure 2 ‣ 2 Related Work ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">2</span></a>, for generating expressive avatars, we extract expression codes <math alttext="f_{\text{mm}}\in\mathbb{R}^{100}" class="ltx_Math" display="inline" id="S3.SS3.p1.1.m1.1"><semantics id="S3.SS3.p1.1.m1.1a"><mrow id="S3.SS3.p1.1.m1.1.1" xref="S3.SS3.p1.1.m1.1.1.cmml"><msub id="S3.SS3.p1.1.m1.1.1.2" xref="S3.SS3.p1.1.m1.1.1.2.cmml"><mi id="S3.SS3.p1.1.m1.1.1.2.2" xref="S3.SS3.p1.1.m1.1.1.2.2.cmml">f</mi><mtext id="S3.SS3.p1.1.m1.1.1.2.3" xref="S3.SS3.p1.1.m1.1.1.2.3a.cmml">mm</mtext></msub><mo id="S3.SS3.p1.1.m1.1.1.1" xref="S3.SS3.p1.1.m1.1.1.1.cmml">∈</mo><msup id="S3.SS3.p1.1.m1.1.1.3" xref="S3.SS3.p1.1.m1.1.1.3.cmml"><mi id="S3.SS3.p1.1.m1.1.1.3.2" xref="S3.SS3.p1.1.m1.1.1.3.2.cmml">ℝ</mi><mn id="S3.SS3.p1.1.m1.1.1.3.3" xref="S3.SS3.p1.1.m1.1.1.3.3.cmml">100</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.1.m1.1b"><apply id="S3.SS3.p1.1.m1.1.1.cmml" xref="S3.SS3.p1.1.m1.1.1"><in id="S3.SS3.p1.1.m1.1.1.1.cmml" xref="S3.SS3.p1.1.m1.1.1.1"></in><apply id="S3.SS3.p1.1.m1.1.1.2.cmml" xref="S3.SS3.p1.1.m1.1.1.2"><csymbol cd="ambiguous" id="S3.SS3.p1.1.m1.1.1.2.1.cmml" xref="S3.SS3.p1.1.m1.1.1.2">subscript</csymbol><ci id="S3.SS3.p1.1.m1.1.1.2.2.cmml" xref="S3.SS3.p1.1.m1.1.1.2.2">𝑓</ci><ci id="S3.SS3.p1.1.m1.1.1.2.3a.cmml" xref="S3.SS3.p1.1.m1.1.1.2.3"><mtext id="S3.SS3.p1.1.m1.1.1.2.3.cmml" mathsize="70%" xref="S3.SS3.p1.1.m1.1.1.2.3">mm</mtext></ci></apply><apply id="S3.SS3.p1.1.m1.1.1.3.cmml" xref="S3.SS3.p1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S3.SS3.p1.1.m1.1.1.3.1.cmml" xref="S3.SS3.p1.1.m1.1.1.3">superscript</csymbol><ci id="S3.SS3.p1.1.m1.1.1.3.2.cmml" xref="S3.SS3.p1.1.m1.1.1.3.2">ℝ</ci><cn id="S3.SS3.p1.1.m1.1.1.3.3.cmml" type="integer" xref="S3.SS3.p1.1.m1.1.1.3.3">100</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.1.m1.1c">f_{\text{mm}}\in\mathbb{R}^{100}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.1.m1.1d">italic_f start_POSTSUBSCRIPT mm end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 100 end_POSTSUPERSCRIPT</annotation></semantics></math> from the driving image using a 3DMM estimator. These codes are concatenated with the blendshape vector <math alttext="f_{\text{bs}}\in\mathbb{R}^{16}" class="ltx_Math" display="inline" id="S3.SS3.p1.2.m2.1"><semantics id="S3.SS3.p1.2.m2.1a"><mrow id="S3.SS3.p1.2.m2.1.1" xref="S3.SS3.p1.2.m2.1.1.cmml"><msub id="S3.SS3.p1.2.m2.1.1.2" xref="S3.SS3.p1.2.m2.1.1.2.cmml"><mi id="S3.SS3.p1.2.m2.1.1.2.2" xref="S3.SS3.p1.2.m2.1.1.2.2.cmml">f</mi><mtext id="S3.SS3.p1.2.m2.1.1.2.3" xref="S3.SS3.p1.2.m2.1.1.2.3a.cmml">bs</mtext></msub><mo id="S3.SS3.p1.2.m2.1.1.1" xref="S3.SS3.p1.2.m2.1.1.1.cmml">∈</mo><msup id="S3.SS3.p1.2.m2.1.1.3" xref="S3.SS3.p1.2.m2.1.1.3.cmml"><mi id="S3.SS3.p1.2.m2.1.1.3.2" xref="S3.SS3.p1.2.m2.1.1.3.2.cmml">ℝ</mi><mn id="S3.SS3.p1.2.m2.1.1.3.3" xref="S3.SS3.p1.2.m2.1.1.3.3.cmml">16</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.2.m2.1b"><apply id="S3.SS3.p1.2.m2.1.1.cmml" xref="S3.SS3.p1.2.m2.1.1"><in id="S3.SS3.p1.2.m2.1.1.1.cmml" xref="S3.SS3.p1.2.m2.1.1.1"></in><apply id="S3.SS3.p1.2.m2.1.1.2.cmml" xref="S3.SS3.p1.2.m2.1.1.2"><csymbol cd="ambiguous" id="S3.SS3.p1.2.m2.1.1.2.1.cmml" xref="S3.SS3.p1.2.m2.1.1.2">subscript</csymbol><ci id="S3.SS3.p1.2.m2.1.1.2.2.cmml" xref="S3.SS3.p1.2.m2.1.1.2.2">𝑓</ci><ci id="S3.SS3.p1.2.m2.1.1.2.3a.cmml" xref="S3.SS3.p1.2.m2.1.1.2.3"><mtext id="S3.SS3.p1.2.m2.1.1.2.3.cmml" mathsize="70%" xref="S3.SS3.p1.2.m2.1.1.2.3">bs</mtext></ci></apply><apply id="S3.SS3.p1.2.m2.1.1.3.cmml" xref="S3.SS3.p1.2.m2.1.1.3"><csymbol cd="ambiguous" id="S3.SS3.p1.2.m2.1.1.3.1.cmml" xref="S3.SS3.p1.2.m2.1.1.3">superscript</csymbol><ci id="S3.SS3.p1.2.m2.1.1.3.2.cmml" xref="S3.SS3.p1.2.m2.1.1.3.2">ℝ</ci><cn id="S3.SS3.p1.2.m2.1.1.3.3.cmml" type="integer" xref="S3.SS3.p1.2.m2.1.1.3.3">16</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.2.m2.1c">f_{\text{bs}}\in\mathbb{R}^{16}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.2.m2.1d">italic_f start_POSTSUBSCRIPT bs end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT</annotation></semantics></math>, producing a comprehensive expression feature. A learnable projection layer <math alttext="\mathcal{E}_{\text{proj}}" class="ltx_Math" display="inline" id="S3.SS3.p1.3.m3.1"><semantics id="S3.SS3.p1.3.m3.1a"><msub id="S3.SS3.p1.3.m3.1.1" xref="S3.SS3.p1.3.m3.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p1.3.m3.1.1.2" xref="S3.SS3.p1.3.m3.1.1.2.cmml">ℰ</mi><mtext id="S3.SS3.p1.3.m3.1.1.3" xref="S3.SS3.p1.3.m3.1.1.3a.cmml">proj</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.3.m3.1b"><apply id="S3.SS3.p1.3.m3.1.1.cmml" xref="S3.SS3.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p1.3.m3.1.1.1.cmml" xref="S3.SS3.p1.3.m3.1.1">subscript</csymbol><ci id="S3.SS3.p1.3.m3.1.1.2.cmml" xref="S3.SS3.p1.3.m3.1.1.2">ℰ</ci><ci id="S3.SS3.p1.3.m3.1.1.3a.cmml" xref="S3.SS3.p1.3.m3.1.1.3"><mtext id="S3.SS3.p1.3.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p1.3.m3.1.1.3">proj</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.3.m3.1c">\mathcal{E}_{\text{proj}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.3.m3.1d">caligraphic_E start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT</annotation></semantics></math> then projects this combined feature into a 16-dimensional expression vector <math alttext="f_{\text{exp}}=\mathcal{E}_{\text{proj}}([f_{\text{bs}};f_{\text{mm}}])" class="ltx_Math" display="inline" id="S3.SS3.p1.4.m4.1"><semantics id="S3.SS3.p1.4.m4.1a"><mrow id="S3.SS3.p1.4.m4.1.1" xref="S3.SS3.p1.4.m4.1.1.cmml"><msub id="S3.SS3.p1.4.m4.1.1.3" xref="S3.SS3.p1.4.m4.1.1.3.cmml"><mi id="S3.SS3.p1.4.m4.1.1.3.2" xref="S3.SS3.p1.4.m4.1.1.3.2.cmml">f</mi><mtext id="S3.SS3.p1.4.m4.1.1.3.3" xref="S3.SS3.p1.4.m4.1.1.3.3a.cmml">exp</mtext></msub><mo id="S3.SS3.p1.4.m4.1.1.2" xref="S3.SS3.p1.4.m4.1.1.2.cmml">=</mo><mrow id="S3.SS3.p1.4.m4.1.1.1" xref="S3.SS3.p1.4.m4.1.1.1.cmml"><msub id="S3.SS3.p1.4.m4.1.1.1.3" xref="S3.SS3.p1.4.m4.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p1.4.m4.1.1.1.3.2" xref="S3.SS3.p1.4.m4.1.1.1.3.2.cmml">ℰ</mi><mtext id="S3.SS3.p1.4.m4.1.1.1.3.3" xref="S3.SS3.p1.4.m4.1.1.1.3.3a.cmml">proj</mtext></msub><mo id="S3.SS3.p1.4.m4.1.1.1.2" xref="S3.SS3.p1.4.m4.1.1.1.2.cmml">⁢</mo><mrow id="S3.SS3.p1.4.m4.1.1.1.1.1" xref="S3.SS3.p1.4.m4.1.1.1.cmml"><mo id="S3.SS3.p1.4.m4.1.1.1.1.1.2" stretchy="false" xref="S3.SS3.p1.4.m4.1.1.1.cmml">(</mo><mrow id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.3.cmml"><mo id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.3" stretchy="false" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.3.cmml">[</mo><msub id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.cmml"><mi id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.2" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.2.cmml">f</mi><mtext id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3a.cmml">bs</mtext></msub><mo id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.4" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.3.cmml">;</mo><msub id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.cmml"><mi id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.2" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.2.cmml">f</mi><mtext id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3a.cmml">mm</mtext></msub><mo id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.5" stretchy="false" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.3.cmml">]</mo></mrow><mo id="S3.SS3.p1.4.m4.1.1.1.1.1.3" stretchy="false" xref="S3.SS3.p1.4.m4.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.4.m4.1b"><apply id="S3.SS3.p1.4.m4.1.1.cmml" xref="S3.SS3.p1.4.m4.1.1"><eq id="S3.SS3.p1.4.m4.1.1.2.cmml" xref="S3.SS3.p1.4.m4.1.1.2"></eq><apply id="S3.SS3.p1.4.m4.1.1.3.cmml" xref="S3.SS3.p1.4.m4.1.1.3"><csymbol cd="ambiguous" id="S3.SS3.p1.4.m4.1.1.3.1.cmml" xref="S3.SS3.p1.4.m4.1.1.3">subscript</csymbol><ci id="S3.SS3.p1.4.m4.1.1.3.2.cmml" xref="S3.SS3.p1.4.m4.1.1.3.2">𝑓</ci><ci id="S3.SS3.p1.4.m4.1.1.3.3a.cmml" xref="S3.SS3.p1.4.m4.1.1.3.3"><mtext id="S3.SS3.p1.4.m4.1.1.3.3.cmml" mathsize="70%" xref="S3.SS3.p1.4.m4.1.1.3.3">exp</mtext></ci></apply><apply id="S3.SS3.p1.4.m4.1.1.1.cmml" xref="S3.SS3.p1.4.m4.1.1.1"><times id="S3.SS3.p1.4.m4.1.1.1.2.cmml" xref="S3.SS3.p1.4.m4.1.1.1.2"></times><apply id="S3.SS3.p1.4.m4.1.1.1.3.cmml" xref="S3.SS3.p1.4.m4.1.1.1.3"><csymbol cd="ambiguous" id="S3.SS3.p1.4.m4.1.1.1.3.1.cmml" xref="S3.SS3.p1.4.m4.1.1.1.3">subscript</csymbol><ci id="S3.SS3.p1.4.m4.1.1.1.3.2.cmml" xref="S3.SS3.p1.4.m4.1.1.1.3.2">ℰ</ci><ci id="S3.SS3.p1.4.m4.1.1.1.3.3a.cmml" xref="S3.SS3.p1.4.m4.1.1.1.3.3"><mtext id="S3.SS3.p1.4.m4.1.1.1.3.3.cmml" mathsize="70%" xref="S3.SS3.p1.4.m4.1.1.1.3.3">proj</mtext></ci></apply><list id="S3.SS3.p1.4.m4.1.1.1.1.1.1.3.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2"><apply id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.1.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.2.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.2">𝑓</ci><ci id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3a.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3"><mtext id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3">bs</mtext></ci></apply><apply id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.1.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2">subscript</csymbol><ci id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.2.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.2">𝑓</ci><ci id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3a.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3"><mtext id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3.cmml" mathsize="70%" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3">mm</mtext></ci></apply></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.4.m4.1c">f_{\text{exp}}=\mathcal{E}_{\text{proj}}([f_{\text{bs}};f_{\text{mm}}])</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.4.m4.1d">italic_f start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ( [ italic_f start_POSTSUBSCRIPT bs end_POSTSUBSCRIPT ; italic_f start_POSTSUBSCRIPT mm end_POSTSUBSCRIPT ] )</annotation></semantics></math>, where <math alttext="[\cdot]" class="ltx_Math" display="inline" id="S3.SS3.p1.5.m5.1"><semantics id="S3.SS3.p1.5.m5.1a"><mrow id="S3.SS3.p1.5.m5.1.2.2" xref="S3.SS3.p1.5.m5.1.2.1.cmml"><mo id="S3.SS3.p1.5.m5.1.2.2.1" stretchy="false" xref="S3.SS3.p1.5.m5.1.2.1.1.cmml">[</mo><mo id="S3.SS3.p1.5.m5.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p1.5.m5.1.1.cmml">⋅</mo><mo id="S3.SS3.p1.5.m5.1.2.2.2" stretchy="false" xref="S3.SS3.p1.5.m5.1.2.1.1.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.5.m5.1b"><apply id="S3.SS3.p1.5.m5.1.2.1.cmml" xref="S3.SS3.p1.5.m5.1.2.2"><csymbol cd="latexml" id="S3.SS3.p1.5.m5.1.2.1.1.cmml" xref="S3.SS3.p1.5.m5.1.2.2.1">delimited-[]</csymbol><ci id="S3.SS3.p1.5.m5.1.1.cmml" xref="S3.SS3.p1.5.m5.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.5.m5.1c">[\cdot]</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.5.m5.1d">[ ⋅ ]</annotation></semantics></math> indicates feature concatenation. To integrate expressiveness with identity, the driving signal is formulated as:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E4"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="f_{\text{drive}}=(\mathcal{E}_{\text{mlp}}([f_{\text{exp}},f_{\text{id}}]),f_{% \text{pos}})." class="ltx_Math" display="block" id="S3.E4.m1.1"><semantics id="S3.E4.m1.1a"><mrow id="S3.E4.m1.1.1.1" xref="S3.E4.m1.1.1.1.1.cmml"><mrow id="S3.E4.m1.1.1.1.1" xref="S3.E4.m1.1.1.1.1.cmml"><msub id="S3.E4.m1.1.1.1.1.4" xref="S3.E4.m1.1.1.1.1.4.cmml"><mi id="S3.E4.m1.1.1.1.1.4.2" xref="S3.E4.m1.1.1.1.1.4.2.cmml">f</mi><mtext id="S3.E4.m1.1.1.1.1.4.3" xref="S3.E4.m1.1.1.1.1.4.3a.cmml">drive</mtext></msub><mo id="S3.E4.m1.1.1.1.1.3" xref="S3.E4.m1.1.1.1.1.3.cmml">=</mo><mrow id="S3.E4.m1.1.1.1.1.2.2" xref="S3.E4.m1.1.1.1.1.2.3.cmml"><mo id="S3.E4.m1.1.1.1.1.2.2.3" stretchy="false" xref="S3.E4.m1.1.1.1.1.2.3.cmml">(</mo><mrow id="S3.E4.m1.1.1.1.1.1.1.1" xref="S3.E4.m1.1.1.1.1.1.1.1.cmml"><msub id="S3.E4.m1.1.1.1.1.1.1.1.3" xref="S3.E4.m1.1.1.1.1.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E4.m1.1.1.1.1.1.1.1.3.2" xref="S3.E4.m1.1.1.1.1.1.1.1.3.2.cmml">ℰ</mi><mtext id="S3.E4.m1.1.1.1.1.1.1.1.3.3" xref="S3.E4.m1.1.1.1.1.1.1.1.3.3a.cmml">mlp</mtext></msub><mo id="S3.E4.m1.1.1.1.1.1.1.1.2" xref="S3.E4.m1.1.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E4.m1.1.1.1.1.1.1.1.1.1" xref="S3.E4.m1.1.1.1.1.1.1.1.cmml"><mo id="S3.E4.m1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E4.m1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.3.cmml"><mo id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.3" stretchy="false" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.3.cmml">[</mo><msub id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">f</mi><mtext id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3a.cmml">exp</mtext></msub><mo id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.4" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.3.cmml">,</mo><msub id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.cmml"><mi id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.2" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.2.cmml">f</mi><mtext id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3a.cmml">id</mtext></msub><mo id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.5" stretchy="false" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.3.cmml">]</mo></mrow><mo id="S3.E4.m1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E4.m1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S3.E4.m1.1.1.1.1.2.2.4" xref="S3.E4.m1.1.1.1.1.2.3.cmml">,</mo><msub id="S3.E4.m1.1.1.1.1.2.2.2" xref="S3.E4.m1.1.1.1.1.2.2.2.cmml"><mi id="S3.E4.m1.1.1.1.1.2.2.2.2" xref="S3.E4.m1.1.1.1.1.2.2.2.2.cmml">f</mi><mtext id="S3.E4.m1.1.1.1.1.2.2.2.3" xref="S3.E4.m1.1.1.1.1.2.2.2.3a.cmml">pos</mtext></msub><mo id="S3.E4.m1.1.1.1.1.2.2.5" stretchy="false" xref="S3.E4.m1.1.1.1.1.2.3.cmml">)</mo></mrow></mrow><mo id="S3.E4.m1.1.1.1.2" lspace="0em" xref="S3.E4.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E4.m1.1b"><apply id="S3.E4.m1.1.1.1.1.cmml" xref="S3.E4.m1.1.1.1"><eq id="S3.E4.m1.1.1.1.1.3.cmml" xref="S3.E4.m1.1.1.1.1.3"></eq><apply id="S3.E4.m1.1.1.1.1.4.cmml" xref="S3.E4.m1.1.1.1.1.4"><csymbol cd="ambiguous" id="S3.E4.m1.1.1.1.1.4.1.cmml" xref="S3.E4.m1.1.1.1.1.4">subscript</csymbol><ci id="S3.E4.m1.1.1.1.1.4.2.cmml" xref="S3.E4.m1.1.1.1.1.4.2">𝑓</ci><ci id="S3.E4.m1.1.1.1.1.4.3a.cmml" xref="S3.E4.m1.1.1.1.1.4.3"><mtext id="S3.E4.m1.1.1.1.1.4.3.cmml" mathsize="70%" xref="S3.E4.m1.1.1.1.1.4.3">drive</mtext></ci></apply><interval closure="open" id="S3.E4.m1.1.1.1.1.2.3.cmml" xref="S3.E4.m1.1.1.1.1.2.2"><apply id="S3.E4.m1.1.1.1.1.1.1.1.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1"><times id="S3.E4.m1.1.1.1.1.1.1.1.2.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.2"></times><apply id="S3.E4.m1.1.1.1.1.1.1.1.3.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E4.m1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E4.m1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.3.2">ℰ</ci><ci id="S3.E4.m1.1.1.1.1.1.1.1.3.3a.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.3.3"><mtext id="S3.E4.m1.1.1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E4.m1.1.1.1.1.1.1.1.3.3">mlp</mtext></ci></apply><interval closure="closed" id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2"><apply id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.2">𝑓</ci><ci id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3a.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3"><mtext id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3">exp</mtext></ci></apply><apply id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.1.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2">subscript</csymbol><ci id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.2.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.2">𝑓</ci><ci id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3a.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3"><mtext id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml" mathsize="70%" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3">id</mtext></ci></apply></interval></apply><apply id="S3.E4.m1.1.1.1.1.2.2.2.cmml" xref="S3.E4.m1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E4.m1.1.1.1.1.2.2.2.1.cmml" xref="S3.E4.m1.1.1.1.1.2.2.2">subscript</csymbol><ci id="S3.E4.m1.1.1.1.1.2.2.2.2.cmml" xref="S3.E4.m1.1.1.1.1.2.2.2.2">𝑓</ci><ci id="S3.E4.m1.1.1.1.1.2.2.2.3a.cmml" xref="S3.E4.m1.1.1.1.1.2.2.2.3"><mtext id="S3.E4.m1.1.1.1.1.2.2.2.3.cmml" mathsize="70%" xref="S3.E4.m1.1.1.1.1.2.2.2.3">pos</mtext></ci></apply></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E4.m1.1c">f_{\text{drive}}=(\mathcal{E}_{\text{mlp}}([f_{\text{exp}},f_{\text{id}}]),f_{% \text{pos}}).</annotation><annotation encoding="application/x-llamapun" id="S3.E4.m1.1d">italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT = ( caligraphic_E start_POSTSUBSCRIPT mlp end_POSTSUBSCRIPT ( [ italic_f start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT id end_POSTSUBSCRIPT ] ) , italic_f start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT ) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(4)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS3.p1.8">Here, <math alttext="f_{\text{id}}" class="ltx_Math" display="inline" id="S3.SS3.p1.6.m1.1"><semantics id="S3.SS3.p1.6.m1.1a"><msub id="S3.SS3.p1.6.m1.1.1" xref="S3.SS3.p1.6.m1.1.1.cmml"><mi id="S3.SS3.p1.6.m1.1.1.2" xref="S3.SS3.p1.6.m1.1.1.2.cmml">f</mi><mtext id="S3.SS3.p1.6.m1.1.1.3" xref="S3.SS3.p1.6.m1.1.1.3a.cmml">id</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.6.m1.1b"><apply id="S3.SS3.p1.6.m1.1.1.cmml" xref="S3.SS3.p1.6.m1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p1.6.m1.1.1.1.cmml" xref="S3.SS3.p1.6.m1.1.1">subscript</csymbol><ci id="S3.SS3.p1.6.m1.1.1.2.cmml" xref="S3.SS3.p1.6.m1.1.1.2">𝑓</ci><ci id="S3.SS3.p1.6.m1.1.1.3a.cmml" xref="S3.SS3.p1.6.m1.1.1.3"><mtext id="S3.SS3.p1.6.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p1.6.m1.1.1.3">id</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.6.m1.1c">f_{\text{id}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.6.m1.1d">italic_f start_POSTSUBSCRIPT id end_POSTSUBSCRIPT</annotation></semantics></math> is the global identity feature extracted from a reference image <math alttext="I_{r}" class="ltx_Math" display="inline" id="S3.SS3.p1.7.m2.1"><semantics id="S3.SS3.p1.7.m2.1a"><msub id="S3.SS3.p1.7.m2.1.1" xref="S3.SS3.p1.7.m2.1.1.cmml"><mi id="S3.SS3.p1.7.m2.1.1.2" xref="S3.SS3.p1.7.m2.1.1.2.cmml">I</mi><mi id="S3.SS3.p1.7.m2.1.1.3" xref="S3.SS3.p1.7.m2.1.1.3.cmml">r</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.7.m2.1b"><apply id="S3.SS3.p1.7.m2.1.1.cmml" xref="S3.SS3.p1.7.m2.1.1"><csymbol cd="ambiguous" id="S3.SS3.p1.7.m2.1.1.1.cmml" xref="S3.SS3.p1.7.m2.1.1">subscript</csymbol><ci id="S3.SS3.p1.7.m2.1.1.2.cmml" xref="S3.SS3.p1.7.m2.1.1.2">𝐼</ci><ci id="S3.SS3.p1.7.m2.1.1.3.cmml" xref="S3.SS3.p1.7.m2.1.1.3">𝑟</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.7.m2.1c">I_{r}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.7.m2.1d">italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT</annotation></semantics></math> via a frozen DINOv2 backbone <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib38" title=""><span class="ltx_text" style="font-size:90%;">38</span></a>]</cite>, and <math alttext="f_{\text{pos}}" class="ltx_Math" display="inline" id="S3.SS3.p1.8.m3.1"><semantics id="S3.SS3.p1.8.m3.1a"><msub id="S3.SS3.p1.8.m3.1.1" xref="S3.SS3.p1.8.m3.1.1.cmml"><mi id="S3.SS3.p1.8.m3.1.1.2" xref="S3.SS3.p1.8.m3.1.1.2.cmml">f</mi><mtext id="S3.SS3.p1.8.m3.1.1.3" xref="S3.SS3.p1.8.m3.1.1.3a.cmml">pos</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.8.m3.1b"><apply id="S3.SS3.p1.8.m3.1.1.cmml" xref="S3.SS3.p1.8.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p1.8.m3.1.1.1.cmml" xref="S3.SS3.p1.8.m3.1.1">subscript</csymbol><ci id="S3.SS3.p1.8.m3.1.1.2.cmml" xref="S3.SS3.p1.8.m3.1.1.2">𝑓</ci><ci id="S3.SS3.p1.8.m3.1.1.3a.cmml" xref="S3.SS3.p1.8.m3.1.1.3"><mtext id="S3.SS3.p1.8.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p1.8.m3.1.1.3">pos</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.8.m3.1c">f_{\text{pos}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.8.m3.1d">italic_f start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT</annotation></semantics></math> denotes the position map from 3DMM vertices.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS3.p2"> <p class="ltx_p" id="S3.SS3.p2.3"><span class="ltx_text ltx_font_bold" id="S3.SS3.p2.1.1">3D Generation Network <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S3.SS3.p2.1.1.m1.1"><semantics id="S3.SS3.p2.1.1.m1.1a"><mrow id="S3.SS3.p2.1.1.m1.1.2" xref="S3.SS3.p2.1.1.m1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p2.1.1.m1.1.2.2" xref="S3.SS3.p2.1.1.m1.1.2.2.cmml">𝒢</mi><mo id="S3.SS3.p2.1.1.m1.1.2.1" xref="S3.SS3.p2.1.1.m1.1.2.1.cmml">⁢</mo><mrow id="S3.SS3.p2.1.1.m1.1.2.3.2" xref="S3.SS3.p2.1.1.m1.1.2.cmml"><mo id="S3.SS3.p2.1.1.m1.1.2.3.2.1" stretchy="false" xref="S3.SS3.p2.1.1.m1.1.2.cmml">(</mo><mo id="S3.SS3.p2.1.1.m1.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p2.1.1.m1.1.1.cmml">⋅</mo><mo id="S3.SS3.p2.1.1.m1.1.2.3.2.2" stretchy="false" xref="S3.SS3.p2.1.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.1.1.m1.1b"><apply id="S3.SS3.p2.1.1.m1.1.2.cmml" xref="S3.SS3.p2.1.1.m1.1.2"><times id="S3.SS3.p2.1.1.m1.1.2.1.cmml" xref="S3.SS3.p2.1.1.m1.1.2.1"></times><ci id="S3.SS3.p2.1.1.m1.1.2.2.cmml" xref="S3.SS3.p2.1.1.m1.1.2.2">𝒢</ci><ci id="S3.SS3.p2.1.1.m1.1.1.cmml" xref="S3.SS3.p2.1.1.m1.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.1.1.m1.1c">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.1.1.m1.1d">caligraphic_G ( ⋅ )</annotation></semantics></math>.</span> Given the generated unposed avatars <math alttext="I_{\text{unposed}}" class="ltx_Math" display="inline" id="S3.SS3.p2.2.m1.1"><semantics id="S3.SS3.p2.2.m1.1a"><msub id="S3.SS3.p2.2.m1.1.1" xref="S3.SS3.p2.2.m1.1.1.cmml"><mi id="S3.SS3.p2.2.m1.1.1.2" xref="S3.SS3.p2.2.m1.1.1.2.cmml">I</mi><mtext id="S3.SS3.p2.2.m1.1.1.3" xref="S3.SS3.p2.2.m1.1.1.3a.cmml">unposed</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.2.m1.1b"><apply id="S3.SS3.p2.2.m1.1.1.cmml" xref="S3.SS3.p2.2.m1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.2.m1.1.1.1.cmml" xref="S3.SS3.p2.2.m1.1.1">subscript</csymbol><ci id="S3.SS3.p2.2.m1.1.1.2.cmml" xref="S3.SS3.p2.2.m1.1.1.2">𝐼</ci><ci id="S3.SS3.p2.2.m1.1.1.3a.cmml" xref="S3.SS3.p2.2.m1.1.1.3"><mtext id="S3.SS3.p2.2.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p2.2.m1.1.1.3">unposed</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.2.m1.1c">I_{\text{unposed}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.2.m1.1d">italic_I start_POSTSUBSCRIPT unposed end_POSTSUBSCRIPT</annotation></semantics></math> and driving features <math alttext="f_{\text{drive}}" class="ltx_Math" display="inline" id="S3.SS3.p2.3.m2.1"><semantics id="S3.SS3.p2.3.m2.1a"><msub id="S3.SS3.p2.3.m2.1.1" xref="S3.SS3.p2.3.m2.1.1.cmml"><mi id="S3.SS3.p2.3.m2.1.1.2" xref="S3.SS3.p2.3.m2.1.1.2.cmml">f</mi><mtext id="S3.SS3.p2.3.m2.1.1.3" xref="S3.SS3.p2.3.m2.1.1.3a.cmml">drive</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.3.m2.1b"><apply id="S3.SS3.p2.3.m2.1.1.cmml" xref="S3.SS3.p2.3.m2.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.3.m2.1.1.1.cmml" xref="S3.SS3.p2.3.m2.1.1">subscript</csymbol><ci id="S3.SS3.p2.3.m2.1.1.2.cmml" xref="S3.SS3.p2.3.m2.1.1.2">𝑓</ci><ci id="S3.SS3.p2.3.m2.1.1.3a.cmml" xref="S3.SS3.p2.3.m2.1.1.3"><mtext id="S3.SS3.p2.3.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p2.3.m2.1.1.3">drive</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.3.m2.1c">f_{\text{drive}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.3.m2.1d">italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT</annotation></semantics></math> from the expression encoder, we employ an asymmetric U-Net architecture akin to Large Multi-view Gaussian Models <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title=""><span class="ltx_text" style="font-size:90%;">55</span></a>]</cite> and incorporate cross-attention layers to merge the driving features seamlessly:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E5"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="I_{\text{posed}}=\mathcal{E}_{\text{2D}}^{\text{render}}(\mathcal{G}(I_{\text{% unposed}},f_{\text{drive}};\Phi_{g}))," class="ltx_Math" display="block" id="S3.E5.m1.1"><semantics id="S3.E5.m1.1a"><mrow id="S3.E5.m1.1.1.1" xref="S3.E5.m1.1.1.1.1.cmml"><mrow id="S3.E5.m1.1.1.1.1" xref="S3.E5.m1.1.1.1.1.cmml"><msub id="S3.E5.m1.1.1.1.1.3" xref="S3.E5.m1.1.1.1.1.3.cmml"><mi id="S3.E5.m1.1.1.1.1.3.2" xref="S3.E5.m1.1.1.1.1.3.2.cmml">I</mi><mtext id="S3.E5.m1.1.1.1.1.3.3" xref="S3.E5.m1.1.1.1.1.3.3a.cmml">posed</mtext></msub><mo id="S3.E5.m1.1.1.1.1.2" xref="S3.E5.m1.1.1.1.1.2.cmml">=</mo><mrow id="S3.E5.m1.1.1.1.1.1" xref="S3.E5.m1.1.1.1.1.1.cmml"><msubsup id="S3.E5.m1.1.1.1.1.1.3" xref="S3.E5.m1.1.1.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E5.m1.1.1.1.1.1.3.2.2" xref="S3.E5.m1.1.1.1.1.1.3.2.2.cmml">ℰ</mi><mtext id="S3.E5.m1.1.1.1.1.1.3.2.3" xref="S3.E5.m1.1.1.1.1.1.3.2.3a.cmml">2D</mtext><mtext id="S3.E5.m1.1.1.1.1.1.3.3" xref="S3.E5.m1.1.1.1.1.1.3.3a.cmml">render</mtext></msubsup><mo id="S3.E5.m1.1.1.1.1.1.2" xref="S3.E5.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E5.m1.1.1.1.1.1.1.1" xref="S3.E5.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S3.E5.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E5.m1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E5.m1.1.1.1.1.1.1.1.1" xref="S3.E5.m1.1.1.1.1.1.1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E5.m1.1.1.1.1.1.1.1.1.5" xref="S3.E5.m1.1.1.1.1.1.1.1.1.5.cmml">𝒢</mi><mo id="S3.E5.m1.1.1.1.1.1.1.1.1.4" xref="S3.E5.m1.1.1.1.1.1.1.1.1.4.cmml">⁢</mo><mrow id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml"><mo id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.4" stretchy="false" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml">(</mo><msub id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">I</mi><mtext id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3a.cmml">unposed</mtext></msub><mo id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.5" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml">,</mo><msub id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.cmml"><mi id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.2" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.2.cmml">f</mi><mtext id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3a.cmml">drive</mtext></msub><mo id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.6" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml">;</mo><msub id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.cmml"><mi id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.2" mathvariant="normal" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.2.cmml">Φ</mi><mi id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.3" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.3.cmml">g</mi></msub><mo id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.7" stretchy="false" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml">)</mo></mrow></mrow><mo id="S3.E5.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E5.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E5.m1.1.1.1.2" xref="S3.E5.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E5.m1.1b"><apply id="S3.E5.m1.1.1.1.1.cmml" xref="S3.E5.m1.1.1.1"><eq id="S3.E5.m1.1.1.1.1.2.cmml" xref="S3.E5.m1.1.1.1.1.2"></eq><apply id="S3.E5.m1.1.1.1.1.3.cmml" xref="S3.E5.m1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.3.1.cmml" xref="S3.E5.m1.1.1.1.1.3">subscript</csymbol><ci id="S3.E5.m1.1.1.1.1.3.2.cmml" xref="S3.E5.m1.1.1.1.1.3.2">𝐼</ci><ci id="S3.E5.m1.1.1.1.1.3.3a.cmml" xref="S3.E5.m1.1.1.1.1.3.3"><mtext id="S3.E5.m1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E5.m1.1.1.1.1.3.3">posed</mtext></ci></apply><apply id="S3.E5.m1.1.1.1.1.1.cmml" xref="S3.E5.m1.1.1.1.1.1"><times id="S3.E5.m1.1.1.1.1.1.2.cmml" xref="S3.E5.m1.1.1.1.1.1.2"></times><apply id="S3.E5.m1.1.1.1.1.1.3.cmml" xref="S3.E5.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.1.3.1.cmml" xref="S3.E5.m1.1.1.1.1.1.3">superscript</csymbol><apply id="S3.E5.m1.1.1.1.1.1.3.2.cmml" xref="S3.E5.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.1.3.2.1.cmml" xref="S3.E5.m1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E5.m1.1.1.1.1.1.3.2.2.cmml" xref="S3.E5.m1.1.1.1.1.1.3.2.2">ℰ</ci><ci id="S3.E5.m1.1.1.1.1.1.3.2.3a.cmml" xref="S3.E5.m1.1.1.1.1.1.3.2.3"><mtext id="S3.E5.m1.1.1.1.1.1.3.2.3.cmml" mathsize="70%" xref="S3.E5.m1.1.1.1.1.1.3.2.3">2D</mtext></ci></apply><ci id="S3.E5.m1.1.1.1.1.1.3.3a.cmml" xref="S3.E5.m1.1.1.1.1.1.3.3"><mtext id="S3.E5.m1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E5.m1.1.1.1.1.1.3.3">render</mtext></ci></apply><apply id="S3.E5.m1.1.1.1.1.1.1.1.1.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1"><times id="S3.E5.m1.1.1.1.1.1.1.1.1.4.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.4"></times><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.5.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.5">𝒢</ci><vector id="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3"><apply id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.2">𝐼</ci><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3a.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3"><mtext id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3">unposed</mtext></ci></apply><apply id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.1.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2">subscript</csymbol><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.2.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.2">𝑓</ci><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3a.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3"><mtext id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3.cmml" mathsize="70%" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3">drive</mtext></ci></apply><apply id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.1.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3">subscript</csymbol><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.2.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.2">Φ</ci><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.3.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.3">𝑔</ci></apply></vector></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E5.m1.1c">I_{\text{posed}}=\mathcal{E}_{\text{2D}}^{\text{render}}(\mathcal{G}(I_{\text{% unposed}},f_{\text{drive}};\Phi_{g})),</annotation><annotation encoding="application/x-llamapun" id="S3.E5.m1.1d">italic_I start_POSTSUBSCRIPT posed end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT render end_POSTSUPERSCRIPT ( caligraphic_G ( italic_I start_POSTSUBSCRIPT unposed end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT ; roman_Φ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(5)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS3.p2.7">where <math alttext="\Phi_{g}" class="ltx_Math" display="inline" id="S3.SS3.p2.4.m1.1"><semantics id="S3.SS3.p2.4.m1.1a"><msub id="S3.SS3.p2.4.m1.1.1" xref="S3.SS3.p2.4.m1.1.1.cmml"><mi id="S3.SS3.p2.4.m1.1.1.2" mathvariant="normal" xref="S3.SS3.p2.4.m1.1.1.2.cmml">Φ</mi><mi id="S3.SS3.p2.4.m1.1.1.3" xref="S3.SS3.p2.4.m1.1.1.3.cmml">g</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.4.m1.1b"><apply id="S3.SS3.p2.4.m1.1.1.cmml" xref="S3.SS3.p2.4.m1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.4.m1.1.1.1.cmml" xref="S3.SS3.p2.4.m1.1.1">subscript</csymbol><ci id="S3.SS3.p2.4.m1.1.1.2.cmml" xref="S3.SS3.p2.4.m1.1.1.2">Φ</ci><ci id="S3.SS3.p2.4.m1.1.1.3.cmml" xref="S3.SS3.p2.4.m1.1.1.3">𝑔</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.4.m1.1c">\Phi_{g}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.4.m1.1d">roman_Φ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT</annotation></semantics></math> is the network learnable parameter of <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S3.SS3.p2.5.m2.1"><semantics id="S3.SS3.p2.5.m2.1a"><mrow id="S3.SS3.p2.5.m2.1.2" xref="S3.SS3.p2.5.m2.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p2.5.m2.1.2.2" xref="S3.SS3.p2.5.m2.1.2.2.cmml">𝒢</mi><mo id="S3.SS3.p2.5.m2.1.2.1" xref="S3.SS3.p2.5.m2.1.2.1.cmml">⁢</mo><mrow id="S3.SS3.p2.5.m2.1.2.3.2" xref="S3.SS3.p2.5.m2.1.2.cmml"><mo id="S3.SS3.p2.5.m2.1.2.3.2.1" stretchy="false" xref="S3.SS3.p2.5.m2.1.2.cmml">(</mo><mo id="S3.SS3.p2.5.m2.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p2.5.m2.1.1.cmml">⋅</mo><mo id="S3.SS3.p2.5.m2.1.2.3.2.2" stretchy="false" xref="S3.SS3.p2.5.m2.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.5.m2.1b"><apply id="S3.SS3.p2.5.m2.1.2.cmml" xref="S3.SS3.p2.5.m2.1.2"><times id="S3.SS3.p2.5.m2.1.2.1.cmml" xref="S3.SS3.p2.5.m2.1.2.1"></times><ci id="S3.SS3.p2.5.m2.1.2.2.cmml" xref="S3.SS3.p2.5.m2.1.2.2">𝒢</ci><ci id="S3.SS3.p2.5.m2.1.1.cmml" xref="S3.SS3.p2.5.m2.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.5.m2.1c">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.5.m2.1d">caligraphic_G ( ⋅ )</annotation></semantics></math> and <math alttext="\mathcal{E}_{\text{2D}}^{\text{render}}" class="ltx_Math" display="inline" id="S3.SS3.p2.6.m3.1"><semantics id="S3.SS3.p2.6.m3.1a"><msubsup id="S3.SS3.p2.6.m3.1.1" xref="S3.SS3.p2.6.m3.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p2.6.m3.1.1.2.2" xref="S3.SS3.p2.6.m3.1.1.2.2.cmml">ℰ</mi><mtext id="S3.SS3.p2.6.m3.1.1.2.3" xref="S3.SS3.p2.6.m3.1.1.2.3a.cmml">2D</mtext><mtext id="S3.SS3.p2.6.m3.1.1.3" xref="S3.SS3.p2.6.m3.1.1.3a.cmml">render</mtext></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.6.m3.1b"><apply id="S3.SS3.p2.6.m3.1.1.cmml" xref="S3.SS3.p2.6.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.6.m3.1.1.1.cmml" xref="S3.SS3.p2.6.m3.1.1">superscript</csymbol><apply id="S3.SS3.p2.6.m3.1.1.2.cmml" xref="S3.SS3.p2.6.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.6.m3.1.1.2.1.cmml" xref="S3.SS3.p2.6.m3.1.1">subscript</csymbol><ci id="S3.SS3.p2.6.m3.1.1.2.2.cmml" xref="S3.SS3.p2.6.m3.1.1.2.2">ℰ</ci><ci id="S3.SS3.p2.6.m3.1.1.2.3a.cmml" xref="S3.SS3.p2.6.m3.1.1.2.3"><mtext id="S3.SS3.p2.6.m3.1.1.2.3.cmml" mathsize="70%" xref="S3.SS3.p2.6.m3.1.1.2.3">2D</mtext></ci></apply><ci id="S3.SS3.p2.6.m3.1.1.3a.cmml" xref="S3.SS3.p2.6.m3.1.1.3"><mtext id="S3.SS3.p2.6.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p2.6.m3.1.1.3">render</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.6.m3.1c">\mathcal{E}_{\text{2D}}^{\text{render}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.6.m3.1d">caligraphic_E start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT render end_POSTSUPERSCRIPT</annotation></semantics></math> is a 2DGS renderer <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib20" title=""><span class="ltx_text" style="font-size:90%;">20</span></a>]</cite>. <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S3.SS3.p2.7.m4.1"><semantics id="S3.SS3.p2.7.m4.1a"><mrow id="S3.SS3.p2.7.m4.1.2" xref="S3.SS3.p2.7.m4.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p2.7.m4.1.2.2" xref="S3.SS3.p2.7.m4.1.2.2.cmml">𝒢</mi><mo id="S3.SS3.p2.7.m4.1.2.1" xref="S3.SS3.p2.7.m4.1.2.1.cmml">⁢</mo><mrow id="S3.SS3.p2.7.m4.1.2.3.2" xref="S3.SS3.p2.7.m4.1.2.cmml"><mo id="S3.SS3.p2.7.m4.1.2.3.2.1" stretchy="false" xref="S3.SS3.p2.7.m4.1.2.cmml">(</mo><mo id="S3.SS3.p2.7.m4.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p2.7.m4.1.1.cmml">⋅</mo><mo id="S3.SS3.p2.7.m4.1.2.3.2.2" stretchy="false" xref="S3.SS3.p2.7.m4.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.7.m4.1b"><apply id="S3.SS3.p2.7.m4.1.2.cmml" xref="S3.SS3.p2.7.m4.1.2"><times id="S3.SS3.p2.7.m4.1.2.1.cmml" xref="S3.SS3.p2.7.m4.1.2.1"></times><ci id="S3.SS3.p2.7.m4.1.2.2.cmml" xref="S3.SS3.p2.7.m4.1.2.2">𝒢</ci><ci id="S3.SS3.p2.7.m4.1.1.cmml" xref="S3.SS3.p2.7.m4.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.7.m4.1c">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.7.m4.1d">caligraphic_G ( ⋅ )</annotation></semantics></math> consists of an encoder with five down-sampling blocks, a middle block, and a decoder with three up-sampling blocks. Each block contains two ResNet layers with group normalization and SiLU activation. Cross-attention modules are strategically placed in the deeper layers of the network: the last two down-sampling blocks, the middle block, and the first two up-sampling blocks. The cross-attention is defined as:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E6"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="f_{\text{out}}^{g}=\text{softmax}\left(\frac{(f_{\text{in}}^{g}W_{Q}^{g})(f_{% \text{drive}}W_{K}^{g})^{T}}{\sqrt{d_{k}^{g}}}\right)(f_{\text{drive}}W_{V}^{g% })," class="ltx_Math" display="block" id="S3.E6.m1.3"><semantics id="S3.E6.m1.3a"><mrow id="S3.E6.m1.3.3.1" xref="S3.E6.m1.3.3.1.1.cmml"><mrow id="S3.E6.m1.3.3.1.1" xref="S3.E6.m1.3.3.1.1.cmml"><msubsup id="S3.E6.m1.3.3.1.1.3" xref="S3.E6.m1.3.3.1.1.3.cmml"><mi id="S3.E6.m1.3.3.1.1.3.2.2" xref="S3.E6.m1.3.3.1.1.3.2.2.cmml">f</mi><mtext id="S3.E6.m1.3.3.1.1.3.2.3" xref="S3.E6.m1.3.3.1.1.3.2.3a.cmml">out</mtext><mi id="S3.E6.m1.3.3.1.1.3.3" xref="S3.E6.m1.3.3.1.1.3.3.cmml">g</mi></msubsup><mo id="S3.E6.m1.3.3.1.1.2" xref="S3.E6.m1.3.3.1.1.2.cmml">=</mo><mrow id="S3.E6.m1.3.3.1.1.1" xref="S3.E6.m1.3.3.1.1.1.cmml"><mtext id="S3.E6.m1.3.3.1.1.1.3" xref="S3.E6.m1.3.3.1.1.1.3a.cmml">softmax</mtext><mo id="S3.E6.m1.3.3.1.1.1.2" xref="S3.E6.m1.3.3.1.1.1.2.cmml">⁢</mo><mrow id="S3.E6.m1.3.3.1.1.1.4.2" xref="S3.E6.m1.2.2.cmml"><mo id="S3.E6.m1.3.3.1.1.1.4.2.1" xref="S3.E6.m1.2.2.cmml">(</mo><mfrac id="S3.E6.m1.2.2" xref="S3.E6.m1.2.2.cmml"><mrow id="S3.E6.m1.2.2.2" xref="S3.E6.m1.2.2.2.cmml"><mrow id="S3.E6.m1.1.1.1.1.1" xref="S3.E6.m1.1.1.1.1.1.1.cmml"><mo id="S3.E6.m1.1.1.1.1.1.2" stretchy="false" xref="S3.E6.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E6.m1.1.1.1.1.1.1" xref="S3.E6.m1.1.1.1.1.1.1.cmml"><msubsup id="S3.E6.m1.1.1.1.1.1.1.2" xref="S3.E6.m1.1.1.1.1.1.1.2.cmml"><mi id="S3.E6.m1.1.1.1.1.1.1.2.2.2" xref="S3.E6.m1.1.1.1.1.1.1.2.2.2.cmml">f</mi><mtext id="S3.E6.m1.1.1.1.1.1.1.2.2.3" xref="S3.E6.m1.1.1.1.1.1.1.2.2.3a.cmml">in</mtext><mi id="S3.E6.m1.1.1.1.1.1.1.2.3" xref="S3.E6.m1.1.1.1.1.1.1.2.3.cmml">g</mi></msubsup><mo id="S3.E6.m1.1.1.1.1.1.1.1" xref="S3.E6.m1.1.1.1.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E6.m1.1.1.1.1.1.1.3" xref="S3.E6.m1.1.1.1.1.1.1.3.cmml"><mi id="S3.E6.m1.1.1.1.1.1.1.3.2.2" xref="S3.E6.m1.1.1.1.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E6.m1.1.1.1.1.1.1.3.2.3" xref="S3.E6.m1.1.1.1.1.1.1.3.2.3.cmml">Q</mi><mi id="S3.E6.m1.1.1.1.1.1.1.3.3" xref="S3.E6.m1.1.1.1.1.1.1.3.3.cmml">g</mi></msubsup></mrow><mo id="S3.E6.m1.1.1.1.1.1.3" stretchy="false" xref="S3.E6.m1.1.1.1.1.1.1.cmml">)</mo></mrow><mo id="S3.E6.m1.2.2.2.3" xref="S3.E6.m1.2.2.2.3.cmml">⁢</mo><msup id="S3.E6.m1.2.2.2.2" xref="S3.E6.m1.2.2.2.2.cmml"><mrow id="S3.E6.m1.2.2.2.2.1.1" xref="S3.E6.m1.2.2.2.2.1.1.1.cmml"><mo id="S3.E6.m1.2.2.2.2.1.1.2" stretchy="false" xref="S3.E6.m1.2.2.2.2.1.1.1.cmml">(</mo><mrow id="S3.E6.m1.2.2.2.2.1.1.1" xref="S3.E6.m1.2.2.2.2.1.1.1.cmml"><msub id="S3.E6.m1.2.2.2.2.1.1.1.2" xref="S3.E6.m1.2.2.2.2.1.1.1.2.cmml"><mi id="S3.E6.m1.2.2.2.2.1.1.1.2.2" xref="S3.E6.m1.2.2.2.2.1.1.1.2.2.cmml">f</mi><mtext id="S3.E6.m1.2.2.2.2.1.1.1.2.3" xref="S3.E6.m1.2.2.2.2.1.1.1.2.3a.cmml">drive</mtext></msub><mo id="S3.E6.m1.2.2.2.2.1.1.1.1" xref="S3.E6.m1.2.2.2.2.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E6.m1.2.2.2.2.1.1.1.3" xref="S3.E6.m1.2.2.2.2.1.1.1.3.cmml"><mi id="S3.E6.m1.2.2.2.2.1.1.1.3.2.2" xref="S3.E6.m1.2.2.2.2.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E6.m1.2.2.2.2.1.1.1.3.2.3" xref="S3.E6.m1.2.2.2.2.1.1.1.3.2.3.cmml">K</mi><mi id="S3.E6.m1.2.2.2.2.1.1.1.3.3" xref="S3.E6.m1.2.2.2.2.1.1.1.3.3.cmml">g</mi></msubsup></mrow><mo id="S3.E6.m1.2.2.2.2.1.1.3" stretchy="false" xref="S3.E6.m1.2.2.2.2.1.1.1.cmml">)</mo></mrow><mi id="S3.E6.m1.2.2.2.2.3" xref="S3.E6.m1.2.2.2.2.3.cmml">T</mi></msup></mrow><msqrt id="S3.E6.m1.2.2.4" xref="S3.E6.m1.2.2.4.cmml"><msubsup id="S3.E6.m1.2.2.4.2" xref="S3.E6.m1.2.2.4.2.cmml"><mi id="S3.E6.m1.2.2.4.2.2.2" xref="S3.E6.m1.2.2.4.2.2.2.cmml">d</mi><mi id="S3.E6.m1.2.2.4.2.2.3" xref="S3.E6.m1.2.2.4.2.2.3.cmml">k</mi><mi id="S3.E6.m1.2.2.4.2.3" xref="S3.E6.m1.2.2.4.2.3.cmml">g</mi></msubsup></msqrt></mfrac><mo id="S3.E6.m1.3.3.1.1.1.4.2.2" xref="S3.E6.m1.2.2.cmml">)</mo></mrow><mo id="S3.E6.m1.3.3.1.1.1.2a" xref="S3.E6.m1.3.3.1.1.1.2.cmml">⁢</mo><mrow id="S3.E6.m1.3.3.1.1.1.1.1" xref="S3.E6.m1.3.3.1.1.1.1.1.1.cmml"><mo id="S3.E6.m1.3.3.1.1.1.1.1.2" stretchy="false" xref="S3.E6.m1.3.3.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E6.m1.3.3.1.1.1.1.1.1" xref="S3.E6.m1.3.3.1.1.1.1.1.1.cmml"><msub id="S3.E6.m1.3.3.1.1.1.1.1.1.2" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.cmml"><mi id="S3.E6.m1.3.3.1.1.1.1.1.1.2.2" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.2.cmml">f</mi><mtext id="S3.E6.m1.3.3.1.1.1.1.1.1.2.3" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.3a.cmml">drive</mtext></msub><mo id="S3.E6.m1.3.3.1.1.1.1.1.1.1" xref="S3.E6.m1.3.3.1.1.1.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E6.m1.3.3.1.1.1.1.1.1.3" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.cmml"><mi id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.2" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.3" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.3.cmml">V</mi><mi id="S3.E6.m1.3.3.1.1.1.1.1.1.3.3" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.3.cmml">g</mi></msubsup></mrow><mo id="S3.E6.m1.3.3.1.1.1.1.1.3" stretchy="false" xref="S3.E6.m1.3.3.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E6.m1.3.3.1.2" xref="S3.E6.m1.3.3.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E6.m1.3b"><apply id="S3.E6.m1.3.3.1.1.cmml" xref="S3.E6.m1.3.3.1"><eq id="S3.E6.m1.3.3.1.1.2.cmml" xref="S3.E6.m1.3.3.1.1.2"></eq><apply id="S3.E6.m1.3.3.1.1.3.cmml" xref="S3.E6.m1.3.3.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.3.3.1.1.3.1.cmml" xref="S3.E6.m1.3.3.1.1.3">superscript</csymbol><apply id="S3.E6.m1.3.3.1.1.3.2.cmml" xref="S3.E6.m1.3.3.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.3.3.1.1.3.2.1.cmml" xref="S3.E6.m1.3.3.1.1.3">subscript</csymbol><ci id="S3.E6.m1.3.3.1.1.3.2.2.cmml" xref="S3.E6.m1.3.3.1.1.3.2.2">𝑓</ci><ci id="S3.E6.m1.3.3.1.1.3.2.3a.cmml" xref="S3.E6.m1.3.3.1.1.3.2.3"><mtext id="S3.E6.m1.3.3.1.1.3.2.3.cmml" mathsize="70%" xref="S3.E6.m1.3.3.1.1.3.2.3">out</mtext></ci></apply><ci id="S3.E6.m1.3.3.1.1.3.3.cmml" xref="S3.E6.m1.3.3.1.1.3.3">𝑔</ci></apply><apply id="S3.E6.m1.3.3.1.1.1.cmml" xref="S3.E6.m1.3.3.1.1.1"><times id="S3.E6.m1.3.3.1.1.1.2.cmml" xref="S3.E6.m1.3.3.1.1.1.2"></times><ci id="S3.E6.m1.3.3.1.1.1.3a.cmml" xref="S3.E6.m1.3.3.1.1.1.3"><mtext id="S3.E6.m1.3.3.1.1.1.3.cmml" xref="S3.E6.m1.3.3.1.1.1.3">softmax</mtext></ci><apply id="S3.E6.m1.2.2.cmml" xref="S3.E6.m1.3.3.1.1.1.4.2"><divide id="S3.E6.m1.2.2.3.cmml" xref="S3.E6.m1.3.3.1.1.1.4.2"></divide><apply id="S3.E6.m1.2.2.2.cmml" xref="S3.E6.m1.2.2.2"><times id="S3.E6.m1.2.2.2.3.cmml" xref="S3.E6.m1.2.2.2.3"></times><apply id="S3.E6.m1.1.1.1.1.1.1.cmml" xref="S3.E6.m1.1.1.1.1.1"><times id="S3.E6.m1.1.1.1.1.1.1.1.cmml" xref="S3.E6.m1.1.1.1.1.1.1.1"></times><apply id="S3.E6.m1.1.1.1.1.1.1.2.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E6.m1.1.1.1.1.1.1.2.1.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2">superscript</csymbol><apply id="S3.E6.m1.1.1.1.1.1.1.2.2.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E6.m1.1.1.1.1.1.1.2.2.1.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E6.m1.1.1.1.1.1.1.2.2.2.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2.2.2">𝑓</ci><ci id="S3.E6.m1.1.1.1.1.1.1.2.2.3a.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2.2.3"><mtext id="S3.E6.m1.1.1.1.1.1.1.2.2.3.cmml" mathsize="70%" xref="S3.E6.m1.1.1.1.1.1.1.2.2.3">in</mtext></ci></apply><ci id="S3.E6.m1.1.1.1.1.1.1.2.3.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2.3">𝑔</ci></apply><apply id="S3.E6.m1.1.1.1.1.1.1.3.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.1.1.1.1.1.1.3.1.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3">superscript</csymbol><apply id="S3.E6.m1.1.1.1.1.1.1.3.2.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.1.1.1.1.1.1.3.2.1.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E6.m1.1.1.1.1.1.1.3.2.2.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3.2.2">𝑊</ci><ci id="S3.E6.m1.1.1.1.1.1.1.3.2.3.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3.2.3">𝑄</ci></apply><ci id="S3.E6.m1.1.1.1.1.1.1.3.3.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3.3">𝑔</ci></apply></apply><apply id="S3.E6.m1.2.2.2.2.cmml" xref="S3.E6.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.2.2.2.cmml" xref="S3.E6.m1.2.2.2.2">superscript</csymbol><apply id="S3.E6.m1.2.2.2.2.1.1.1.cmml" xref="S3.E6.m1.2.2.2.2.1.1"><times id="S3.E6.m1.2.2.2.2.1.1.1.1.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.1"></times><apply id="S3.E6.m1.2.2.2.2.1.1.1.2.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.2"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.2.2.1.1.1.2.1.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.2">subscript</csymbol><ci id="S3.E6.m1.2.2.2.2.1.1.1.2.2.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.2.2">𝑓</ci><ci id="S3.E6.m1.2.2.2.2.1.1.1.2.3a.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.2.3"><mtext id="S3.E6.m1.2.2.2.2.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E6.m1.2.2.2.2.1.1.1.2.3">drive</mtext></ci></apply><apply id="S3.E6.m1.2.2.2.2.1.1.1.3.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.2.2.1.1.1.3.1.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3">superscript</csymbol><apply id="S3.E6.m1.2.2.2.2.1.1.1.3.2.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.2.2.1.1.1.3.2.1.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3">subscript</csymbol><ci id="S3.E6.m1.2.2.2.2.1.1.1.3.2.2.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3.2.2">𝑊</ci><ci id="S3.E6.m1.2.2.2.2.1.1.1.3.2.3.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3.2.3">𝐾</ci></apply><ci id="S3.E6.m1.2.2.2.2.1.1.1.3.3.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3.3">𝑔</ci></apply></apply><ci id="S3.E6.m1.2.2.2.2.3.cmml" xref="S3.E6.m1.2.2.2.2.3">𝑇</ci></apply></apply><apply id="S3.E6.m1.2.2.4.cmml" xref="S3.E6.m1.2.2.4"><root id="S3.E6.m1.2.2.4a.cmml" xref="S3.E6.m1.2.2.4"></root><apply id="S3.E6.m1.2.2.4.2.cmml" xref="S3.E6.m1.2.2.4.2"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.4.2.1.cmml" xref="S3.E6.m1.2.2.4.2">superscript</csymbol><apply id="S3.E6.m1.2.2.4.2.2.cmml" xref="S3.E6.m1.2.2.4.2"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.4.2.2.1.cmml" xref="S3.E6.m1.2.2.4.2">subscript</csymbol><ci id="S3.E6.m1.2.2.4.2.2.2.cmml" xref="S3.E6.m1.2.2.4.2.2.2">𝑑</ci><ci id="S3.E6.m1.2.2.4.2.2.3.cmml" xref="S3.E6.m1.2.2.4.2.2.3">𝑘</ci></apply><ci id="S3.E6.m1.2.2.4.2.3.cmml" xref="S3.E6.m1.2.2.4.2.3">𝑔</ci></apply></apply></apply><apply id="S3.E6.m1.3.3.1.1.1.1.1.1.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1"><times id="S3.E6.m1.3.3.1.1.1.1.1.1.1.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.1"></times><apply id="S3.E6.m1.3.3.1.1.1.1.1.1.2.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E6.m1.3.3.1.1.1.1.1.1.2.1.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E6.m1.3.3.1.1.1.1.1.1.2.2.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.2">𝑓</ci><ci id="S3.E6.m1.3.3.1.1.1.1.1.1.2.3a.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.3"><mtext id="S3.E6.m1.3.3.1.1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.3">drive</mtext></ci></apply><apply id="S3.E6.m1.3.3.1.1.1.1.1.1.3.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.3.3.1.1.1.1.1.1.3.1.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3">superscript</csymbol><apply id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.1.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.2.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.2">𝑊</ci><ci id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.3.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.3">𝑉</ci></apply><ci id="S3.E6.m1.3.3.1.1.1.1.1.1.3.3.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.3">𝑔</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E6.m1.3c">f_{\text{out}}^{g}=\text{softmax}\left(\frac{(f_{\text{in}}^{g}W_{Q}^{g})(f_{% \text{drive}}W_{K}^{g})^{T}}{\sqrt{d_{k}^{g}}}\right)(f_{\text{drive}}W_{V}^{g% }),</annotation><annotation encoding="application/x-llamapun" id="S3.E6.m1.3d">italic_f start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = softmax ( divide start_ARG ( italic_f start_POSTSUBSCRIPT in end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) ( italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG end_ARG ) ( italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(6)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS3.p2.11">where <math alttext="f_{\text{in}}^{g},f_{\text{out}}^{g}" class="ltx_Math" display="inline" id="S3.SS3.p2.8.m1.2"><semantics id="S3.SS3.p2.8.m1.2a"><mrow id="S3.SS3.p2.8.m1.2.2.2" xref="S3.SS3.p2.8.m1.2.2.3.cmml"><msubsup id="S3.SS3.p2.8.m1.1.1.1.1" xref="S3.SS3.p2.8.m1.1.1.1.1.cmml"><mi id="S3.SS3.p2.8.m1.1.1.1.1.2.2" xref="S3.SS3.p2.8.m1.1.1.1.1.2.2.cmml">f</mi><mtext id="S3.SS3.p2.8.m1.1.1.1.1.2.3" xref="S3.SS3.p2.8.m1.1.1.1.1.2.3a.cmml">in</mtext><mi id="S3.SS3.p2.8.m1.1.1.1.1.3" xref="S3.SS3.p2.8.m1.1.1.1.1.3.cmml">g</mi></msubsup><mo id="S3.SS3.p2.8.m1.2.2.2.3" xref="S3.SS3.p2.8.m1.2.2.3.cmml">,</mo><msubsup id="S3.SS3.p2.8.m1.2.2.2.2" xref="S3.SS3.p2.8.m1.2.2.2.2.cmml"><mi id="S3.SS3.p2.8.m1.2.2.2.2.2.2" xref="S3.SS3.p2.8.m1.2.2.2.2.2.2.cmml">f</mi><mtext id="S3.SS3.p2.8.m1.2.2.2.2.2.3" xref="S3.SS3.p2.8.m1.2.2.2.2.2.3a.cmml">out</mtext><mi id="S3.SS3.p2.8.m1.2.2.2.2.3" xref="S3.SS3.p2.8.m1.2.2.2.2.3.cmml">g</mi></msubsup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.8.m1.2b"><list id="S3.SS3.p2.8.m1.2.2.3.cmml" xref="S3.SS3.p2.8.m1.2.2.2"><apply id="S3.SS3.p2.8.m1.1.1.1.1.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.8.m1.1.1.1.1.1.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1">superscript</csymbol><apply id="S3.SS3.p2.8.m1.1.1.1.1.2.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.8.m1.1.1.1.1.2.1.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1">subscript</csymbol><ci id="S3.SS3.p2.8.m1.1.1.1.1.2.2.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1.2.2">𝑓</ci><ci id="S3.SS3.p2.8.m1.1.1.1.1.2.3a.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1.2.3"><mtext id="S3.SS3.p2.8.m1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.SS3.p2.8.m1.1.1.1.1.2.3">in</mtext></ci></apply><ci id="S3.SS3.p2.8.m1.1.1.1.1.3.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1.3">𝑔</ci></apply><apply id="S3.SS3.p2.8.m1.2.2.2.2.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS3.p2.8.m1.2.2.2.2.1.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2">superscript</csymbol><apply id="S3.SS3.p2.8.m1.2.2.2.2.2.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS3.p2.8.m1.2.2.2.2.2.1.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2">subscript</csymbol><ci id="S3.SS3.p2.8.m1.2.2.2.2.2.2.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2.2.2">𝑓</ci><ci id="S3.SS3.p2.8.m1.2.2.2.2.2.3a.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2.2.3"><mtext id="S3.SS3.p2.8.m1.2.2.2.2.2.3.cmml" mathsize="70%" xref="S3.SS3.p2.8.m1.2.2.2.2.2.3">out</mtext></ci></apply><ci id="S3.SS3.p2.8.m1.2.2.2.2.3.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2.3">𝑔</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.8.m1.2c">f_{\text{in}}^{g},f_{\text{out}}^{g}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.8.m1.2d">italic_f start_POSTSUBSCRIPT in end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT</annotation></semantics></math> are the input and output features of the cross attention module in <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S3.SS3.p2.9.m2.1"><semantics id="S3.SS3.p2.9.m2.1a"><mrow id="S3.SS3.p2.9.m2.1.2" xref="S3.SS3.p2.9.m2.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p2.9.m2.1.2.2" xref="S3.SS3.p2.9.m2.1.2.2.cmml">𝒢</mi><mo id="S3.SS3.p2.9.m2.1.2.1" xref="S3.SS3.p2.9.m2.1.2.1.cmml">⁢</mo><mrow id="S3.SS3.p2.9.m2.1.2.3.2" xref="S3.SS3.p2.9.m2.1.2.cmml"><mo id="S3.SS3.p2.9.m2.1.2.3.2.1" stretchy="false" xref="S3.SS3.p2.9.m2.1.2.cmml">(</mo><mo id="S3.SS3.p2.9.m2.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p2.9.m2.1.1.cmml">⋅</mo><mo id="S3.SS3.p2.9.m2.1.2.3.2.2" stretchy="false" xref="S3.SS3.p2.9.m2.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.9.m2.1b"><apply id="S3.SS3.p2.9.m2.1.2.cmml" xref="S3.SS3.p2.9.m2.1.2"><times id="S3.SS3.p2.9.m2.1.2.1.cmml" xref="S3.SS3.p2.9.m2.1.2.1"></times><ci id="S3.SS3.p2.9.m2.1.2.2.cmml" xref="S3.SS3.p2.9.m2.1.2.2">𝒢</ci><ci id="S3.SS3.p2.9.m2.1.1.cmml" xref="S3.SS3.p2.9.m2.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.9.m2.1c">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.9.m2.1d">caligraphic_G ( ⋅ )</annotation></semantics></math>, respectively. <math alttext="W_{Q}^{g},W_{K}^{g},W_{V}^{g}" class="ltx_Math" display="inline" id="S3.SS3.p2.10.m3.3"><semantics id="S3.SS3.p2.10.m3.3a"><mrow id="S3.SS3.p2.10.m3.3.3.3" xref="S3.SS3.p2.10.m3.3.3.4.cmml"><msubsup id="S3.SS3.p2.10.m3.1.1.1.1" xref="S3.SS3.p2.10.m3.1.1.1.1.cmml"><mi id="S3.SS3.p2.10.m3.1.1.1.1.2.2" xref="S3.SS3.p2.10.m3.1.1.1.1.2.2.cmml">W</mi><mi id="S3.SS3.p2.10.m3.1.1.1.1.2.3" xref="S3.SS3.p2.10.m3.1.1.1.1.2.3.cmml">Q</mi><mi id="S3.SS3.p2.10.m3.1.1.1.1.3" xref="S3.SS3.p2.10.m3.1.1.1.1.3.cmml">g</mi></msubsup><mo id="S3.SS3.p2.10.m3.3.3.3.4" xref="S3.SS3.p2.10.m3.3.3.4.cmml">,</mo><msubsup id="S3.SS3.p2.10.m3.2.2.2.2" xref="S3.SS3.p2.10.m3.2.2.2.2.cmml"><mi id="S3.SS3.p2.10.m3.2.2.2.2.2.2" xref="S3.SS3.p2.10.m3.2.2.2.2.2.2.cmml">W</mi><mi id="S3.SS3.p2.10.m3.2.2.2.2.2.3" xref="S3.SS3.p2.10.m3.2.2.2.2.2.3.cmml">K</mi><mi id="S3.SS3.p2.10.m3.2.2.2.2.3" xref="S3.SS3.p2.10.m3.2.2.2.2.3.cmml">g</mi></msubsup><mo id="S3.SS3.p2.10.m3.3.3.3.5" xref="S3.SS3.p2.10.m3.3.3.4.cmml">,</mo><msubsup id="S3.SS3.p2.10.m3.3.3.3.3" xref="S3.SS3.p2.10.m3.3.3.3.3.cmml"><mi id="S3.SS3.p2.10.m3.3.3.3.3.2.2" xref="S3.SS3.p2.10.m3.3.3.3.3.2.2.cmml">W</mi><mi id="S3.SS3.p2.10.m3.3.3.3.3.2.3" xref="S3.SS3.p2.10.m3.3.3.3.3.2.3.cmml">V</mi><mi id="S3.SS3.p2.10.m3.3.3.3.3.3" xref="S3.SS3.p2.10.m3.3.3.3.3.3.cmml">g</mi></msubsup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.10.m3.3b"><list id="S3.SS3.p2.10.m3.3.3.4.cmml" xref="S3.SS3.p2.10.m3.3.3.3"><apply id="S3.SS3.p2.10.m3.1.1.1.1.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.1.1.1.1.1.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1">superscript</csymbol><apply id="S3.SS3.p2.10.m3.1.1.1.1.2.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.1.1.1.1.2.1.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1">subscript</csymbol><ci id="S3.SS3.p2.10.m3.1.1.1.1.2.2.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1.2.2">𝑊</ci><ci id="S3.SS3.p2.10.m3.1.1.1.1.2.3.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1.2.3">𝑄</ci></apply><ci id="S3.SS3.p2.10.m3.1.1.1.1.3.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1.3">𝑔</ci></apply><apply id="S3.SS3.p2.10.m3.2.2.2.2.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.2.2.2.2.1.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2">superscript</csymbol><apply id="S3.SS3.p2.10.m3.2.2.2.2.2.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.2.2.2.2.2.1.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2">subscript</csymbol><ci id="S3.SS3.p2.10.m3.2.2.2.2.2.2.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2.2.2">𝑊</ci><ci id="S3.SS3.p2.10.m3.2.2.2.2.2.3.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2.2.3">𝐾</ci></apply><ci id="S3.SS3.p2.10.m3.2.2.2.2.3.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2.3">𝑔</ci></apply><apply id="S3.SS3.p2.10.m3.3.3.3.3.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.3.3.3.3.1.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3">superscript</csymbol><apply id="S3.SS3.p2.10.m3.3.3.3.3.2.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.3.3.3.3.2.1.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3">subscript</csymbol><ci id="S3.SS3.p2.10.m3.3.3.3.3.2.2.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3.2.2">𝑊</ci><ci id="S3.SS3.p2.10.m3.3.3.3.3.2.3.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3.2.3">𝑉</ci></apply><ci id="S3.SS3.p2.10.m3.3.3.3.3.3.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3.3">𝑔</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.10.m3.3c">W_{Q}^{g},W_{K}^{g},W_{V}^{g}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.10.m3.3d">italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT</annotation></semantics></math> are the learnable weight matrices and <math alttext="\sqrt{d_{k}^{g}}" class="ltx_Math" display="inline" id="S3.SS3.p2.11.m4.1"><semantics id="S3.SS3.p2.11.m4.1a"><msqrt id="S3.SS3.p2.11.m4.1.1" xref="S3.SS3.p2.11.m4.1.1.cmml"><msubsup id="S3.SS3.p2.11.m4.1.1.2" xref="S3.SS3.p2.11.m4.1.1.2.cmml"><mi id="S3.SS3.p2.11.m4.1.1.2.2.2" xref="S3.SS3.p2.11.m4.1.1.2.2.2.cmml">d</mi><mi id="S3.SS3.p2.11.m4.1.1.2.2.3" xref="S3.SS3.p2.11.m4.1.1.2.2.3.cmml">k</mi><mi id="S3.SS3.p2.11.m4.1.1.2.3" xref="S3.SS3.p2.11.m4.1.1.2.3.cmml">g</mi></msubsup></msqrt><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.11.m4.1b"><apply id="S3.SS3.p2.11.m4.1.1.cmml" xref="S3.SS3.p2.11.m4.1.1"><root id="S3.SS3.p2.11.m4.1.1a.cmml" xref="S3.SS3.p2.11.m4.1.1"></root><apply id="S3.SS3.p2.11.m4.1.1.2.cmml" xref="S3.SS3.p2.11.m4.1.1.2"><csymbol cd="ambiguous" id="S3.SS3.p2.11.m4.1.1.2.1.cmml" xref="S3.SS3.p2.11.m4.1.1.2">superscript</csymbol><apply id="S3.SS3.p2.11.m4.1.1.2.2.cmml" xref="S3.SS3.p2.11.m4.1.1.2"><csymbol cd="ambiguous" id="S3.SS3.p2.11.m4.1.1.2.2.1.cmml" xref="S3.SS3.p2.11.m4.1.1.2">subscript</csymbol><ci id="S3.SS3.p2.11.m4.1.1.2.2.2.cmml" xref="S3.SS3.p2.11.m4.1.1.2.2.2">𝑑</ci><ci id="S3.SS3.p2.11.m4.1.1.2.2.3.cmml" xref="S3.SS3.p2.11.m4.1.1.2.2.3">𝑘</ci></apply><ci id="S3.SS3.p2.11.m4.1.1.2.3.cmml" xref="S3.SS3.p2.11.m4.1.1.2.3">𝑔</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.11.m4.1c">\sqrt{d_{k}^{g}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.11.m4.1d">square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG</annotation></semantics></math> the scale factor.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS3.p3"> <p class="ltx_p" id="S3.SS3.p3.4"><span class="ltx_text ltx_font_bold" id="S3.SS3.p3.4.1">Mobile AR Application.</span> Our model is also designed to facilitate real-time animation on mobile devices, striking a balance between advanced animation techniques and efficient rendering. Offline, we use the 3D Generation Network <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S3.SS3.p3.1.m1.1"><semantics id="S3.SS3.p3.1.m1.1a"><mrow id="S3.SS3.p3.1.m1.1.2" xref="S3.SS3.p3.1.m1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p3.1.m1.1.2.2" xref="S3.SS3.p3.1.m1.1.2.2.cmml">𝒢</mi><mo id="S3.SS3.p3.1.m1.1.2.1" xref="S3.SS3.p3.1.m1.1.2.1.cmml">⁢</mo><mrow id="S3.SS3.p3.1.m1.1.2.3.2" xref="S3.SS3.p3.1.m1.1.2.cmml"><mo id="S3.SS3.p3.1.m1.1.2.3.2.1" stretchy="false" xref="S3.SS3.p3.1.m1.1.2.cmml">(</mo><mo id="S3.SS3.p3.1.m1.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p3.1.m1.1.1.cmml">⋅</mo><mo id="S3.SS3.p3.1.m1.1.2.3.2.2" stretchy="false" xref="S3.SS3.p3.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.1.m1.1b"><apply id="S3.SS3.p3.1.m1.1.2.cmml" xref="S3.SS3.p3.1.m1.1.2"><times id="S3.SS3.p3.1.m1.1.2.1.cmml" xref="S3.SS3.p3.1.m1.1.2.1"></times><ci id="S3.SS3.p3.1.m1.1.2.2.cmml" xref="S3.SS3.p3.1.m1.1.2.2">𝒢</ci><ci id="S3.SS3.p3.1.m1.1.1.cmml" xref="S3.SS3.p3.1.m1.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.1.m1.1c">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.1.m1.1d">caligraphic_G ( ⋅ )</annotation></semantics></math> from the pipeline shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S2.F2" title="Figure 2 ‣ 2 Related Work ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">2</span></a> to initially generate a base set of Gaussians for the avatar in a rest pose <math alttext="\theta_{\text{rest}}" class="ltx_Math" display="inline" id="S3.SS3.p3.2.m2.1"><semantics id="S3.SS3.p3.2.m2.1a"><msub id="S3.SS3.p3.2.m2.1.1" xref="S3.SS3.p3.2.m2.1.1.cmml"><mi id="S3.SS3.p3.2.m2.1.1.2" xref="S3.SS3.p3.2.m2.1.1.2.cmml">θ</mi><mtext id="S3.SS3.p3.2.m2.1.1.3" xref="S3.SS3.p3.2.m2.1.1.3a.cmml">rest</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.2.m2.1b"><apply id="S3.SS3.p3.2.m2.1.1.cmml" xref="S3.SS3.p3.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS3.p3.2.m2.1.1.1.cmml" xref="S3.SS3.p3.2.m2.1.1">subscript</csymbol><ci id="S3.SS3.p3.2.m2.1.1.2.cmml" xref="S3.SS3.p3.2.m2.1.1.2">𝜃</ci><ci id="S3.SS3.p3.2.m2.1.1.3a.cmml" xref="S3.SS3.p3.2.m2.1.1.3"><mtext id="S3.SS3.p3.2.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p3.2.m2.1.1.3">rest</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.2.m2.1c">\theta_{\text{rest}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.2.m2.1d">italic_θ start_POSTSUBSCRIPT rest end_POSTSUBSCRIPT</annotation></semantics></math>, along with specific Gaussian sets corresponding to each component of the expression features <math alttext="f_{\text{drive}}" class="ltx_Math" display="inline" id="S3.SS3.p3.3.m3.1"><semantics id="S3.SS3.p3.3.m3.1a"><msub id="S3.SS3.p3.3.m3.1.1" xref="S3.SS3.p3.3.m3.1.1.cmml"><mi id="S3.SS3.p3.3.m3.1.1.2" xref="S3.SS3.p3.3.m3.1.1.2.cmml">f</mi><mtext id="S3.SS3.p3.3.m3.1.1.3" xref="S3.SS3.p3.3.m3.1.1.3a.cmml">drive</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.3.m3.1b"><apply id="S3.SS3.p3.3.m3.1.1.cmml" xref="S3.SS3.p3.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p3.3.m3.1.1.1.cmml" xref="S3.SS3.p3.3.m3.1.1">subscript</csymbol><ci id="S3.SS3.p3.3.m3.1.1.2.cmml" xref="S3.SS3.p3.3.m3.1.1.2">𝑓</ci><ci id="S3.SS3.p3.3.m3.1.1.3a.cmml" xref="S3.SS3.p3.3.m3.1.1.3"><mtext id="S3.SS3.p3.3.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p3.3.m3.1.1.3">drive</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.3.m3.1c">f_{\text{drive}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.3.m3.1d">italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT</annotation></semantics></math>. On the mobile device, we use a face tracker, such as <span class="ltx_text ltx_font_italic" id="S3.SS3.p3.4.2">Mediapipe</span>’s BlazeFace tracker <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib4" title=""><span class="ltx_text" style="font-size:90%;">4</span></a>]</cite>, to generate a list of blendshape weights <math alttext="f_{\text{bs}}\in\mathbb{R}^{16}" class="ltx_Math" display="inline" id="S3.SS3.p3.4.m4.1"><semantics id="S3.SS3.p3.4.m4.1a"><mrow id="S3.SS3.p3.4.m4.1.1" xref="S3.SS3.p3.4.m4.1.1.cmml"><msub id="S3.SS3.p3.4.m4.1.1.2" xref="S3.SS3.p3.4.m4.1.1.2.cmml"><mi id="S3.SS3.p3.4.m4.1.1.2.2" xref="S3.SS3.p3.4.m4.1.1.2.2.cmml">f</mi><mtext id="S3.SS3.p3.4.m4.1.1.2.3" xref="S3.SS3.p3.4.m4.1.1.2.3a.cmml">bs</mtext></msub><mo id="S3.SS3.p3.4.m4.1.1.1" xref="S3.SS3.p3.4.m4.1.1.1.cmml">∈</mo><msup id="S3.SS3.p3.4.m4.1.1.3" xref="S3.SS3.p3.4.m4.1.1.3.cmml"><mi id="S3.SS3.p3.4.m4.1.1.3.2" xref="S3.SS3.p3.4.m4.1.1.3.2.cmml">ℝ</mi><mn id="S3.SS3.p3.4.m4.1.1.3.3" xref="S3.SS3.p3.4.m4.1.1.3.3.cmml">16</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.4.m4.1b"><apply id="S3.SS3.p3.4.m4.1.1.cmml" xref="S3.SS3.p3.4.m4.1.1"><in id="S3.SS3.p3.4.m4.1.1.1.cmml" xref="S3.SS3.p3.4.m4.1.1.1"></in><apply id="S3.SS3.p3.4.m4.1.1.2.cmml" xref="S3.SS3.p3.4.m4.1.1.2"><csymbol cd="ambiguous" id="S3.SS3.p3.4.m4.1.1.2.1.cmml" xref="S3.SS3.p3.4.m4.1.1.2">subscript</csymbol><ci id="S3.SS3.p3.4.m4.1.1.2.2.cmml" xref="S3.SS3.p3.4.m4.1.1.2.2">𝑓</ci><ci id="S3.SS3.p3.4.m4.1.1.2.3a.cmml" xref="S3.SS3.p3.4.m4.1.1.2.3"><mtext id="S3.SS3.p3.4.m4.1.1.2.3.cmml" mathsize="70%" xref="S3.SS3.p3.4.m4.1.1.2.3">bs</mtext></ci></apply><apply id="S3.SS3.p3.4.m4.1.1.3.cmml" xref="S3.SS3.p3.4.m4.1.1.3"><csymbol cd="ambiguous" id="S3.SS3.p3.4.m4.1.1.3.1.cmml" xref="S3.SS3.p3.4.m4.1.1.3">superscript</csymbol><ci id="S3.SS3.p3.4.m4.1.1.3.2.cmml" xref="S3.SS3.p3.4.m4.1.1.3.2">ℝ</ci><cn id="S3.SS3.p3.4.m4.1.1.3.3.cmml" type="integer" xref="S3.SS3.p3.4.m4.1.1.3.3">16</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.4.m4.1c">f_{\text{bs}}\in\mathbb{R}^{16}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.4.m4.1d">italic_f start_POSTSUBSCRIPT bs end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT</annotation></semantics></math>. We leverage these weights to animate the avatar through linear interpolation between the parameters of each feature component:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E7"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\theta_{\text{mobile}}=\theta_{\text{rest}}+\sum_{i=1}^{K}f_{\text{drive}}^{i}% (\theta_{i}-\theta_{\text{rest}})" class="ltx_Math" display="block" id="S3.E7.m1.1"><semantics id="S3.E7.m1.1a"><mrow id="S3.E7.m1.1.1" xref="S3.E7.m1.1.1.cmml"><msub id="S3.E7.m1.1.1.3" xref="S3.E7.m1.1.1.3.cmml"><mi id="S3.E7.m1.1.1.3.2" xref="S3.E7.m1.1.1.3.2.cmml">θ</mi><mtext id="S3.E7.m1.1.1.3.3" xref="S3.E7.m1.1.1.3.3a.cmml">mobile</mtext></msub><mo id="S3.E7.m1.1.1.2" xref="S3.E7.m1.1.1.2.cmml">=</mo><mrow id="S3.E7.m1.1.1.1" xref="S3.E7.m1.1.1.1.cmml"><msub id="S3.E7.m1.1.1.1.3" xref="S3.E7.m1.1.1.1.3.cmml"><mi id="S3.E7.m1.1.1.1.3.2" xref="S3.E7.m1.1.1.1.3.2.cmml">θ</mi><mtext id="S3.E7.m1.1.1.1.3.3" xref="S3.E7.m1.1.1.1.3.3a.cmml">rest</mtext></msub><mo id="S3.E7.m1.1.1.1.2" rspace="0.055em" xref="S3.E7.m1.1.1.1.2.cmml">+</mo><mrow id="S3.E7.m1.1.1.1.1" xref="S3.E7.m1.1.1.1.1.cmml"><munderover id="S3.E7.m1.1.1.1.1.2" xref="S3.E7.m1.1.1.1.1.2.cmml"><mo id="S3.E7.m1.1.1.1.1.2.2.2" movablelimits="false" xref="S3.E7.m1.1.1.1.1.2.2.2.cmml">∑</mo><mrow id="S3.E7.m1.1.1.1.1.2.2.3" xref="S3.E7.m1.1.1.1.1.2.2.3.cmml"><mi id="S3.E7.m1.1.1.1.1.2.2.3.2" xref="S3.E7.m1.1.1.1.1.2.2.3.2.cmml">i</mi><mo id="S3.E7.m1.1.1.1.1.2.2.3.1" xref="S3.E7.m1.1.1.1.1.2.2.3.1.cmml">=</mo><mn id="S3.E7.m1.1.1.1.1.2.2.3.3" xref="S3.E7.m1.1.1.1.1.2.2.3.3.cmml">1</mn></mrow><mi id="S3.E7.m1.1.1.1.1.2.3" xref="S3.E7.m1.1.1.1.1.2.3.cmml">K</mi></munderover><mrow id="S3.E7.m1.1.1.1.1.1" xref="S3.E7.m1.1.1.1.1.1.cmml"><msubsup id="S3.E7.m1.1.1.1.1.1.3" xref="S3.E7.m1.1.1.1.1.1.3.cmml"><mi id="S3.E7.m1.1.1.1.1.1.3.2.2" xref="S3.E7.m1.1.1.1.1.1.3.2.2.cmml">f</mi><mtext id="S3.E7.m1.1.1.1.1.1.3.2.3" xref="S3.E7.m1.1.1.1.1.1.3.2.3a.cmml">drive</mtext><mi id="S3.E7.m1.1.1.1.1.1.3.3" xref="S3.E7.m1.1.1.1.1.1.3.3.cmml">i</mi></msubsup><mo id="S3.E7.m1.1.1.1.1.1.2" xref="S3.E7.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E7.m1.1.1.1.1.1.1.1" xref="S3.E7.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S3.E7.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E7.m1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E7.m1.1.1.1.1.1.1.1.1" xref="S3.E7.m1.1.1.1.1.1.1.1.1.cmml"><msub id="S3.E7.m1.1.1.1.1.1.1.1.1.2" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2.cmml"><mi id="S3.E7.m1.1.1.1.1.1.1.1.1.2.2" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2.2.cmml">θ</mi><mi id="S3.E7.m1.1.1.1.1.1.1.1.1.2.3" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2.3.cmml">i</mi></msub><mo id="S3.E7.m1.1.1.1.1.1.1.1.1.1" xref="S3.E7.m1.1.1.1.1.1.1.1.1.1.cmml">−</mo><msub id="S3.E7.m1.1.1.1.1.1.1.1.1.3" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S3.E7.m1.1.1.1.1.1.1.1.1.3.2" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.2.cmml">θ</mi><mtext id="S3.E7.m1.1.1.1.1.1.1.1.1.3.3" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.3a.cmml">rest</mtext></msub></mrow><mo id="S3.E7.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E7.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.E7.m1.1b"><apply id="S3.E7.m1.1.1.cmml" xref="S3.E7.m1.1.1"><eq id="S3.E7.m1.1.1.2.cmml" xref="S3.E7.m1.1.1.2"></eq><apply id="S3.E7.m1.1.1.3.cmml" xref="S3.E7.m1.1.1.3"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.3.1.cmml" xref="S3.E7.m1.1.1.3">subscript</csymbol><ci id="S3.E7.m1.1.1.3.2.cmml" xref="S3.E7.m1.1.1.3.2">𝜃</ci><ci id="S3.E7.m1.1.1.3.3a.cmml" xref="S3.E7.m1.1.1.3.3"><mtext id="S3.E7.m1.1.1.3.3.cmml" mathsize="70%" xref="S3.E7.m1.1.1.3.3">mobile</mtext></ci></apply><apply id="S3.E7.m1.1.1.1.cmml" xref="S3.E7.m1.1.1.1"><plus id="S3.E7.m1.1.1.1.2.cmml" xref="S3.E7.m1.1.1.1.2"></plus><apply id="S3.E7.m1.1.1.1.3.cmml" xref="S3.E7.m1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.3.1.cmml" xref="S3.E7.m1.1.1.1.3">subscript</csymbol><ci id="S3.E7.m1.1.1.1.3.2.cmml" xref="S3.E7.m1.1.1.1.3.2">𝜃</ci><ci id="S3.E7.m1.1.1.1.3.3a.cmml" xref="S3.E7.m1.1.1.1.3.3"><mtext id="S3.E7.m1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E7.m1.1.1.1.3.3">rest</mtext></ci></apply><apply id="S3.E7.m1.1.1.1.1.cmml" xref="S3.E7.m1.1.1.1.1"><apply id="S3.E7.m1.1.1.1.1.2.cmml" xref="S3.E7.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.2.1.cmml" xref="S3.E7.m1.1.1.1.1.2">superscript</csymbol><apply id="S3.E7.m1.1.1.1.1.2.2.cmml" xref="S3.E7.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.2.2.1.cmml" xref="S3.E7.m1.1.1.1.1.2">subscript</csymbol><sum id="S3.E7.m1.1.1.1.1.2.2.2.cmml" xref="S3.E7.m1.1.1.1.1.2.2.2"></sum><apply id="S3.E7.m1.1.1.1.1.2.2.3.cmml" xref="S3.E7.m1.1.1.1.1.2.2.3"><eq id="S3.E7.m1.1.1.1.1.2.2.3.1.cmml" xref="S3.E7.m1.1.1.1.1.2.2.3.1"></eq><ci id="S3.E7.m1.1.1.1.1.2.2.3.2.cmml" xref="S3.E7.m1.1.1.1.1.2.2.3.2">𝑖</ci><cn id="S3.E7.m1.1.1.1.1.2.2.3.3.cmml" type="integer" xref="S3.E7.m1.1.1.1.1.2.2.3.3">1</cn></apply></apply><ci id="S3.E7.m1.1.1.1.1.2.3.cmml" xref="S3.E7.m1.1.1.1.1.2.3">𝐾</ci></apply><apply id="S3.E7.m1.1.1.1.1.1.cmml" xref="S3.E7.m1.1.1.1.1.1"><times id="S3.E7.m1.1.1.1.1.1.2.cmml" xref="S3.E7.m1.1.1.1.1.1.2"></times><apply id="S3.E7.m1.1.1.1.1.1.3.cmml" xref="S3.E7.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.1.3.1.cmml" xref="S3.E7.m1.1.1.1.1.1.3">superscript</csymbol><apply id="S3.E7.m1.1.1.1.1.1.3.2.cmml" xref="S3.E7.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.1.3.2.1.cmml" xref="S3.E7.m1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E7.m1.1.1.1.1.1.3.2.2.cmml" xref="S3.E7.m1.1.1.1.1.1.3.2.2">𝑓</ci><ci id="S3.E7.m1.1.1.1.1.1.3.2.3a.cmml" xref="S3.E7.m1.1.1.1.1.1.3.2.3"><mtext id="S3.E7.m1.1.1.1.1.1.3.2.3.cmml" mathsize="70%" xref="S3.E7.m1.1.1.1.1.1.3.2.3">drive</mtext></ci></apply><ci id="S3.E7.m1.1.1.1.1.1.3.3.cmml" xref="S3.E7.m1.1.1.1.1.1.3.3">𝑖</ci></apply><apply id="S3.E7.m1.1.1.1.1.1.1.1.1.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1"><minus id="S3.E7.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.1"></minus><apply id="S3.E7.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E7.m1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2.2">𝜃</ci><ci id="S3.E7.m1.1.1.1.1.1.1.1.1.2.3.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2.3">𝑖</ci></apply><apply id="S3.E7.m1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E7.m1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.2">𝜃</ci><ci id="S3.E7.m1.1.1.1.1.1.1.1.1.3.3a.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.3"><mtext id="S3.E7.m1.1.1.1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.3">rest</mtext></ci></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E7.m1.1c">\theta_{\text{mobile}}=\theta_{\text{rest}}+\sum_{i=1}^{K}f_{\text{drive}}^{i}% (\theta_{i}-\theta_{\text{rest}})</annotation><annotation encoding="application/x-llamapun" id="S3.E7.m1.1d">italic_θ start_POSTSUBSCRIPT mobile end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT rest end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT rest end_POSTSUBSCRIPT )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(7)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS3.p3.7"><math alttext="K" class="ltx_Math" display="inline" id="S3.SS3.p3.5.m1.1"><semantics id="S3.SS3.p3.5.m1.1a"><mi id="S3.SS3.p3.5.m1.1.1" xref="S3.SS3.p3.5.m1.1.1.cmml">K</mi><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.5.m1.1b"><ci id="S3.SS3.p3.5.m1.1.1.cmml" xref="S3.SS3.p3.5.m1.1.1">𝐾</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.5.m1.1c">K</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.5.m1.1d">italic_K</annotation></semantics></math> represents the number of driving features, and can be tuned to balance expression detail and speed. For compatibility with Mediapipe, we choose <math alttext="K=16" class="ltx_Math" display="inline" id="S3.SS3.p3.6.m2.1"><semantics id="S3.SS3.p3.6.m2.1a"><mrow id="S3.SS3.p3.6.m2.1.1" xref="S3.SS3.p3.6.m2.1.1.cmml"><mi id="S3.SS3.p3.6.m2.1.1.2" xref="S3.SS3.p3.6.m2.1.1.2.cmml">K</mi><mo id="S3.SS3.p3.6.m2.1.1.1" xref="S3.SS3.p3.6.m2.1.1.1.cmml">=</mo><mn id="S3.SS3.p3.6.m2.1.1.3" xref="S3.SS3.p3.6.m2.1.1.3.cmml">16</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.6.m2.1b"><apply id="S3.SS3.p3.6.m2.1.1.cmml" xref="S3.SS3.p3.6.m2.1.1"><eq id="S3.SS3.p3.6.m2.1.1.1.cmml" xref="S3.SS3.p3.6.m2.1.1.1"></eq><ci id="S3.SS3.p3.6.m2.1.1.2.cmml" xref="S3.SS3.p3.6.m2.1.1.2">𝐾</ci><cn id="S3.SS3.p3.6.m2.1.1.3.cmml" type="integer" xref="S3.SS3.p3.6.m2.1.1.3">16</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.6.m2.1c">K=16</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.6.m2.1d">italic_K = 16</annotation></semantics></math>. The final rendering of the Gaussians <math alttext="\theta_{\text{mobile}}" class="ltx_Math" display="inline" id="S3.SS3.p3.7.m3.1"><semantics id="S3.SS3.p3.7.m3.1a"><msub id="S3.SS3.p3.7.m3.1.1" xref="S3.SS3.p3.7.m3.1.1.cmml"><mi id="S3.SS3.p3.7.m3.1.1.2" xref="S3.SS3.p3.7.m3.1.1.2.cmml">θ</mi><mtext id="S3.SS3.p3.7.m3.1.1.3" xref="S3.SS3.p3.7.m3.1.1.3a.cmml">mobile</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.7.m3.1b"><apply id="S3.SS3.p3.7.m3.1.1.cmml" xref="S3.SS3.p3.7.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p3.7.m3.1.1.1.cmml" xref="S3.SS3.p3.7.m3.1.1">subscript</csymbol><ci id="S3.SS3.p3.7.m3.1.1.2.cmml" xref="S3.SS3.p3.7.m3.1.1.2">𝜃</ci><ci id="S3.SS3.p3.7.m3.1.1.3a.cmml" xref="S3.SS3.p3.7.m3.1.1.3"><mtext id="S3.SS3.p3.7.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p3.7.m3.1.1.3">mobile</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.7.m3.1c">\theta_{\text{mobile}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.7.m3.1d">italic_θ start_POSTSUBSCRIPT mobile end_POSTSUBSCRIPT</annotation></semantics></math> takes place in WebGL, offering efficient rendering while retaining high visual fidelity. To demonstrate this capability, we developed a JavaScript application that allows users to control their avatars directly in their browsers.</p> </div> </section> <section class="ltx_subsection" id="S3.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.4 </span>Training and Losses</h3> <div class="ltx_para ltx_noindent" id="S3.SS4.p1"> <p class="ltx_p" id="S3.SS4.p1.4"><span class="ltx_text ltx_font_bold" id="S3.SS4.p1.4.1">2D Dual-Stylized Avatar Generation.</span> Our training process involves using GDA to map the input reference image <math alttext="I_{\text{ref}}" class="ltx_Math" display="inline" id="S3.SS4.p1.1.m1.1"><semantics id="S3.SS4.p1.1.m1.1a"><msub id="S3.SS4.p1.1.m1.1.1" xref="S3.SS4.p1.1.m1.1.1.cmml"><mi id="S3.SS4.p1.1.m1.1.1.2" xref="S3.SS4.p1.1.m1.1.1.2.cmml">I</mi><mtext id="S3.SS4.p1.1.m1.1.1.3" xref="S3.SS4.p1.1.m1.1.1.3a.cmml">ref</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.1.m1.1b"><apply id="S3.SS4.p1.1.m1.1.1.cmml" xref="S3.SS4.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.1.m1.1.1.1.cmml" xref="S3.SS4.p1.1.m1.1.1">subscript</csymbol><ci id="S3.SS4.p1.1.m1.1.1.2.cmml" xref="S3.SS4.p1.1.m1.1.1.2">𝐼</ci><ci id="S3.SS4.p1.1.m1.1.1.3a.cmml" xref="S3.SS4.p1.1.m1.1.1.3"><mtext id="S3.SS4.p1.1.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p1.1.m1.1.1.3">ref</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.1.m1.1c">I_{\text{ref}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.1.m1.1d">italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT</annotation></semantics></math> to Gaussian parameters <math alttext="\theta_{\text{GDA}}" class="ltx_Math" display="inline" id="S3.SS4.p1.2.m2.1"><semantics id="S3.SS4.p1.2.m2.1a"><msub id="S3.SS4.p1.2.m2.1.1" xref="S3.SS4.p1.2.m2.1.1.cmml"><mi id="S3.SS4.p1.2.m2.1.1.2" xref="S3.SS4.p1.2.m2.1.1.2.cmml">θ</mi><mtext id="S3.SS4.p1.2.m2.1.1.3" xref="S3.SS4.p1.2.m2.1.1.3a.cmml">GDA</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.2.m2.1b"><apply id="S3.SS4.p1.2.m2.1.1.cmml" xref="S3.SS4.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.2.m2.1.1.1.cmml" xref="S3.SS4.p1.2.m2.1.1">subscript</csymbol><ci id="S3.SS4.p1.2.m2.1.1.2.cmml" xref="S3.SS4.p1.2.m2.1.1.2">𝜃</ci><ci id="S3.SS4.p1.2.m2.1.1.3a.cmml" xref="S3.SS4.p1.2.m2.1.1.3"><mtext id="S3.SS4.p1.2.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p1.2.m2.1.1.3">GDA</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.2.m2.1c">\theta_{\text{GDA}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.2.m2.1d">italic_θ start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT</annotation></semantics></math>, which are then used to render the unposed primary-style avatar <math alttext="I_{\text{unposed}}" class="ltx_Math" display="inline" id="S3.SS4.p1.3.m3.1"><semantics id="S3.SS4.p1.3.m3.1a"><msub id="S3.SS4.p1.3.m3.1.1" xref="S3.SS4.p1.3.m3.1.1.cmml"><mi id="S3.SS4.p1.3.m3.1.1.2" xref="S3.SS4.p1.3.m3.1.1.2.cmml">I</mi><mtext id="S3.SS4.p1.3.m3.1.1.3" xref="S3.SS4.p1.3.m3.1.1.3a.cmml">unposed</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.3.m3.1b"><apply id="S3.SS4.p1.3.m3.1.1.cmml" xref="S3.SS4.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.3.m3.1.1.1.cmml" xref="S3.SS4.p1.3.m3.1.1">subscript</csymbol><ci id="S3.SS4.p1.3.m3.1.1.2.cmml" xref="S3.SS4.p1.3.m3.1.1.2">𝐼</ci><ci id="S3.SS4.p1.3.m3.1.1.3a.cmml" xref="S3.SS4.p1.3.m3.1.1.3"><mtext id="S3.SS4.p1.3.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p1.3.m3.1.1.3">unposed</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.3.m3.1c">I_{\text{unposed}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.3.m3.1d">italic_I start_POSTSUBSCRIPT unposed end_POSTSUBSCRIPT</annotation></semantics></math> from the frontal view via a 3DGS renderer <math alttext="\mathcal{E}_{\text{3D}}^{\text{render}}" class="ltx_Math" display="inline" id="S3.SS4.p1.4.m4.1"><semantics id="S3.SS4.p1.4.m4.1a"><msubsup id="S3.SS4.p1.4.m4.1.1" xref="S3.SS4.p1.4.m4.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p1.4.m4.1.1.2.2" xref="S3.SS4.p1.4.m4.1.1.2.2.cmml">ℰ</mi><mtext id="S3.SS4.p1.4.m4.1.1.2.3" xref="S3.SS4.p1.4.m4.1.1.2.3a.cmml">3D</mtext><mtext id="S3.SS4.p1.4.m4.1.1.3" xref="S3.SS4.p1.4.m4.1.1.3a.cmml">render</mtext></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.4.m4.1b"><apply id="S3.SS4.p1.4.m4.1.1.cmml" xref="S3.SS4.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.4.m4.1.1.1.cmml" xref="S3.SS4.p1.4.m4.1.1">superscript</csymbol><apply id="S3.SS4.p1.4.m4.1.1.2.cmml" xref="S3.SS4.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.4.m4.1.1.2.1.cmml" xref="S3.SS4.p1.4.m4.1.1">subscript</csymbol><ci id="S3.SS4.p1.4.m4.1.1.2.2.cmml" xref="S3.SS4.p1.4.m4.1.1.2.2">ℰ</ci><ci id="S3.SS4.p1.4.m4.1.1.2.3a.cmml" xref="S3.SS4.p1.4.m4.1.1.2.3"><mtext id="S3.SS4.p1.4.m4.1.1.2.3.cmml" mathsize="70%" xref="S3.SS4.p1.4.m4.1.1.2.3">3D</mtext></ci></apply><ci id="S3.SS4.p1.4.m4.1.1.3a.cmml" xref="S3.SS4.p1.4.m4.1.1.3"><mtext id="S3.SS4.p1.4.m4.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p1.4.m4.1.1.3">render</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.4.m4.1c">\mathcal{E}_{\text{3D}}^{\text{render}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.4.m4.1d">caligraphic_E start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT render end_POSTSUPERSCRIPT</annotation></semantics></math>. The rendered image is supervised using a combination of Mean Squared Error (MSE) and perceptual LPIPS <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib64" title=""><span class="ltx_text" style="font-size:90%;">64</span></a>]</cite> losses:</p> </div> <div class="ltx_para" id="S3.SS4.p2"> <table class="ltx_equation ltx_eqn_table" id="S3.E8"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{\text{GDA}}=\mathcal{L}_{\text{mse}}(I_{\text{ref}},I_{\text{sty}% })+\mathcal{L}_{\text{LPIPS}}(I_{\text{ref}},I_{\text{sty}})." class="ltx_Math" display="block" id="S3.E8.m1.1"><semantics id="S3.E8.m1.1a"><mrow id="S3.E8.m1.1.1.1" xref="S3.E8.m1.1.1.1.1.cmml"><mrow id="S3.E8.m1.1.1.1.1" xref="S3.E8.m1.1.1.1.1.cmml"><msub id="S3.E8.m1.1.1.1.1.6" xref="S3.E8.m1.1.1.1.1.6.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E8.m1.1.1.1.1.6.2" xref="S3.E8.m1.1.1.1.1.6.2.cmml">ℒ</mi><mtext id="S3.E8.m1.1.1.1.1.6.3" xref="S3.E8.m1.1.1.1.1.6.3a.cmml">GDA</mtext></msub><mo id="S3.E8.m1.1.1.1.1.5" xref="S3.E8.m1.1.1.1.1.5.cmml">=</mo><mrow id="S3.E8.m1.1.1.1.1.4" xref="S3.E8.m1.1.1.1.1.4.cmml"><mrow id="S3.E8.m1.1.1.1.1.2.2" xref="S3.E8.m1.1.1.1.1.2.2.cmml"><msub id="S3.E8.m1.1.1.1.1.2.2.4" xref="S3.E8.m1.1.1.1.1.2.2.4.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E8.m1.1.1.1.1.2.2.4.2" xref="S3.E8.m1.1.1.1.1.2.2.4.2.cmml">ℒ</mi><mtext id="S3.E8.m1.1.1.1.1.2.2.4.3" xref="S3.E8.m1.1.1.1.1.2.2.4.3a.cmml">mse</mtext></msub><mo id="S3.E8.m1.1.1.1.1.2.2.3" xref="S3.E8.m1.1.1.1.1.2.2.3.cmml">⁢</mo><mrow id="S3.E8.m1.1.1.1.1.2.2.2.2" xref="S3.E8.m1.1.1.1.1.2.2.2.3.cmml"><mo id="S3.E8.m1.1.1.1.1.2.2.2.2.3" stretchy="false" xref="S3.E8.m1.1.1.1.1.2.2.2.3.cmml">(</mo><msub id="S3.E8.m1.1.1.1.1.1.1.1.1.1" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E8.m1.1.1.1.1.1.1.1.1.1.2" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.2.cmml">I</mi><mtext id="S3.E8.m1.1.1.1.1.1.1.1.1.1.3" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.3a.cmml">ref</mtext></msub><mo id="S3.E8.m1.1.1.1.1.2.2.2.2.4" xref="S3.E8.m1.1.1.1.1.2.2.2.3.cmml">,</mo><msub id="S3.E8.m1.1.1.1.1.2.2.2.2.2" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.cmml"><mi id="S3.E8.m1.1.1.1.1.2.2.2.2.2.2" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.2.cmml">I</mi><mtext id="S3.E8.m1.1.1.1.1.2.2.2.2.2.3" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.3a.cmml">sty</mtext></msub><mo id="S3.E8.m1.1.1.1.1.2.2.2.2.5" stretchy="false" xref="S3.E8.m1.1.1.1.1.2.2.2.3.cmml">)</mo></mrow></mrow><mo id="S3.E8.m1.1.1.1.1.4.5" xref="S3.E8.m1.1.1.1.1.4.5.cmml">+</mo><mrow id="S3.E8.m1.1.1.1.1.4.4" xref="S3.E8.m1.1.1.1.1.4.4.cmml"><msub id="S3.E8.m1.1.1.1.1.4.4.4" xref="S3.E8.m1.1.1.1.1.4.4.4.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E8.m1.1.1.1.1.4.4.4.2" xref="S3.E8.m1.1.1.1.1.4.4.4.2.cmml">ℒ</mi><mtext id="S3.E8.m1.1.1.1.1.4.4.4.3" xref="S3.E8.m1.1.1.1.1.4.4.4.3a.cmml">LPIPS</mtext></msub><mo id="S3.E8.m1.1.1.1.1.4.4.3" xref="S3.E8.m1.1.1.1.1.4.4.3.cmml">⁢</mo><mrow id="S3.E8.m1.1.1.1.1.4.4.2.2" xref="S3.E8.m1.1.1.1.1.4.4.2.3.cmml"><mo id="S3.E8.m1.1.1.1.1.4.4.2.2.3" stretchy="false" xref="S3.E8.m1.1.1.1.1.4.4.2.3.cmml">(</mo><msub id="S3.E8.m1.1.1.1.1.3.3.1.1.1" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.cmml"><mi id="S3.E8.m1.1.1.1.1.3.3.1.1.1.2" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.2.cmml">I</mi><mtext id="S3.E8.m1.1.1.1.1.3.3.1.1.1.3" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.3a.cmml">ref</mtext></msub><mo id="S3.E8.m1.1.1.1.1.4.4.2.2.4" xref="S3.E8.m1.1.1.1.1.4.4.2.3.cmml">,</mo><msub id="S3.E8.m1.1.1.1.1.4.4.2.2.2" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.cmml"><mi id="S3.E8.m1.1.1.1.1.4.4.2.2.2.2" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.2.cmml">I</mi><mtext id="S3.E8.m1.1.1.1.1.4.4.2.2.2.3" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.3a.cmml">sty</mtext></msub><mo id="S3.E8.m1.1.1.1.1.4.4.2.2.5" stretchy="false" xref="S3.E8.m1.1.1.1.1.4.4.2.3.cmml">)</mo></mrow></mrow></mrow></mrow><mo id="S3.E8.m1.1.1.1.2" lspace="0em" xref="S3.E8.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E8.m1.1b"><apply id="S3.E8.m1.1.1.1.1.cmml" xref="S3.E8.m1.1.1.1"><eq id="S3.E8.m1.1.1.1.1.5.cmml" xref="S3.E8.m1.1.1.1.1.5"></eq><apply id="S3.E8.m1.1.1.1.1.6.cmml" xref="S3.E8.m1.1.1.1.1.6"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.6.1.cmml" xref="S3.E8.m1.1.1.1.1.6">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.6.2.cmml" xref="S3.E8.m1.1.1.1.1.6.2">ℒ</ci><ci id="S3.E8.m1.1.1.1.1.6.3a.cmml" xref="S3.E8.m1.1.1.1.1.6.3"><mtext id="S3.E8.m1.1.1.1.1.6.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.6.3">GDA</mtext></ci></apply><apply id="S3.E8.m1.1.1.1.1.4.cmml" xref="S3.E8.m1.1.1.1.1.4"><plus id="S3.E8.m1.1.1.1.1.4.5.cmml" xref="S3.E8.m1.1.1.1.1.4.5"></plus><apply id="S3.E8.m1.1.1.1.1.2.2.cmml" xref="S3.E8.m1.1.1.1.1.2.2"><times id="S3.E8.m1.1.1.1.1.2.2.3.cmml" xref="S3.E8.m1.1.1.1.1.2.2.3"></times><apply id="S3.E8.m1.1.1.1.1.2.2.4.cmml" xref="S3.E8.m1.1.1.1.1.2.2.4"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.2.2.4.1.cmml" xref="S3.E8.m1.1.1.1.1.2.2.4">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.2.2.4.2.cmml" xref="S3.E8.m1.1.1.1.1.2.2.4.2">ℒ</ci><ci id="S3.E8.m1.1.1.1.1.2.2.4.3a.cmml" xref="S3.E8.m1.1.1.1.1.2.2.4.3"><mtext id="S3.E8.m1.1.1.1.1.2.2.4.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.2.2.4.3">mse</mtext></ci></apply><interval closure="open" id="S3.E8.m1.1.1.1.1.2.2.2.3.cmml" xref="S3.E8.m1.1.1.1.1.2.2.2.2"><apply id="S3.E8.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.2">𝐼</ci><ci id="S3.E8.m1.1.1.1.1.1.1.1.1.1.3a.cmml" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.3"><mtext id="S3.E8.m1.1.1.1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.3">ref</mtext></ci></apply><apply id="S3.E8.m1.1.1.1.1.2.2.2.2.2.cmml" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.2.2.2.2.2.1.cmml" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.2.2.2.2.2.2.cmml" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.2">𝐼</ci><ci id="S3.E8.m1.1.1.1.1.2.2.2.2.2.3a.cmml" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.3"><mtext id="S3.E8.m1.1.1.1.1.2.2.2.2.2.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.3">sty</mtext></ci></apply></interval></apply><apply id="S3.E8.m1.1.1.1.1.4.4.cmml" xref="S3.E8.m1.1.1.1.1.4.4"><times id="S3.E8.m1.1.1.1.1.4.4.3.cmml" xref="S3.E8.m1.1.1.1.1.4.4.3"></times><apply id="S3.E8.m1.1.1.1.1.4.4.4.cmml" xref="S3.E8.m1.1.1.1.1.4.4.4"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.4.4.4.1.cmml" xref="S3.E8.m1.1.1.1.1.4.4.4">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.4.4.4.2.cmml" xref="S3.E8.m1.1.1.1.1.4.4.4.2">ℒ</ci><ci id="S3.E8.m1.1.1.1.1.4.4.4.3a.cmml" xref="S3.E8.m1.1.1.1.1.4.4.4.3"><mtext id="S3.E8.m1.1.1.1.1.4.4.4.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.4.4.4.3">LPIPS</mtext></ci></apply><interval closure="open" id="S3.E8.m1.1.1.1.1.4.4.2.3.cmml" xref="S3.E8.m1.1.1.1.1.4.4.2.2"><apply id="S3.E8.m1.1.1.1.1.3.3.1.1.1.cmml" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.3.3.1.1.1.1.cmml" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.3.3.1.1.1.2.cmml" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.2">𝐼</ci><ci id="S3.E8.m1.1.1.1.1.3.3.1.1.1.3a.cmml" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.3"><mtext id="S3.E8.m1.1.1.1.1.3.3.1.1.1.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.3">ref</mtext></ci></apply><apply id="S3.E8.m1.1.1.1.1.4.4.2.2.2.cmml" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.4.4.2.2.2.1.cmml" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.4.4.2.2.2.2.cmml" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.2">𝐼</ci><ci id="S3.E8.m1.1.1.1.1.4.4.2.2.2.3a.cmml" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.3"><mtext id="S3.E8.m1.1.1.1.1.4.4.2.2.2.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.3">sty</mtext></ci></apply></interval></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E8.m1.1c">\mathcal{L}_{\text{GDA}}=\mathcal{L}_{\text{mse}}(I_{\text{ref}},I_{\text{sty}% })+\mathcal{L}_{\text{LPIPS}}(I_{\text{ref}},I_{\text{sty}}).</annotation><annotation encoding="application/x-llamapun" id="S3.E8.m1.1d">caligraphic_L start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT ) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(8)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS4.p2.1">Despite potential noise introduced by low-quality GAN inversion, the extensive pre-training on 3D datasets including Objaverse equips our network with strong generalization capabilities, enabling effective real-to-avatar domain adaptation. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.F3" title="Figure 3 ‣ 3.4 Training and Losses ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">3</span></a>, GDA efficiently transforms realistic faces into a primary style while preserving the subjects’ identity and enhancing features, such as eye size.</p> </div> <figure class="ltx_figure" id="S3.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="403" id="S3.F3.g1" src="x3.png" width="830"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S3.F3.3.1.1" style="font-size:90%;">Figure 3</span>: </span><span class="ltx_text ltx_font_bold" id="S3.F3.4.2" style="font-size:90%;">Gaussian Domain Adaptation.<span class="ltx_text ltx_font_medium" id="S3.F3.4.2.1"> We show the outputs of the GDA network over several training epochs to visualize the domain shifts from natural images to cartoon avatars.</span></span></figcaption> </figure> <div class="ltx_para ltx_noindent" id="S3.SS4.p3"> <p class="ltx_p" id="S3.SS4.p3.2"><span class="ltx_text ltx_font_bold" id="S3.SS4.p3.2.1">3D Animatable Stylized Avatar Generation.</span> To improve the surface geometry of avatars, our model incorporates normal consistency and depth distortion losses. The normal consistency loss <math alttext="\mathcal{L}_{\text{normal}}" class="ltx_Math" display="inline" id="S3.SS4.p3.1.m1.1"><semantics id="S3.SS4.p3.1.m1.1a"><msub id="S3.SS4.p3.1.m1.1.1" xref="S3.SS4.p3.1.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.1.m1.1.1.2" xref="S3.SS4.p3.1.m1.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.1.m1.1.1.3" xref="S3.SS4.p3.1.m1.1.1.3a.cmml">normal</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.1.m1.1b"><apply id="S3.SS4.p3.1.m1.1.1.cmml" xref="S3.SS4.p3.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.1.m1.1.1.1.cmml" xref="S3.SS4.p3.1.m1.1.1">subscript</csymbol><ci id="S3.SS4.p3.1.m1.1.1.2.cmml" xref="S3.SS4.p3.1.m1.1.1.2">ℒ</ci><ci id="S3.SS4.p3.1.m1.1.1.3a.cmml" xref="S3.SS4.p3.1.m1.1.1.3"><mtext id="S3.SS4.p3.1.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.1.m1.1.1.3">normal</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.1.m1.1c">\mathcal{L}_{\text{normal}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.1.m1.1d">caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT</annotation></semantics></math> aligns the normals of 2D Gaussians <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib20" title=""><span class="ltx_text" style="font-size:90%;">20</span></a>]</cite> with surface normals determined through finite differences from rendered depths, thereby reducing noise. Meanwhile, the depth distortion loss <math alttext="\mathcal{L}_{\text{dist}}" class="ltx_Math" display="inline" id="S3.SS4.p3.2.m2.1"><semantics id="S3.SS4.p3.2.m2.1a"><msub id="S3.SS4.p3.2.m2.1.1" xref="S3.SS4.p3.2.m2.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.2.m2.1.1.2" xref="S3.SS4.p3.2.m2.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.2.m2.1.1.3" xref="S3.SS4.p3.2.m2.1.1.3a.cmml">dist</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.2.m2.1b"><apply id="S3.SS4.p3.2.m2.1.1.cmml" xref="S3.SS4.p3.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.2.m2.1.1.1.cmml" xref="S3.SS4.p3.2.m2.1.1">subscript</csymbol><ci id="S3.SS4.p3.2.m2.1.1.2.cmml" xref="S3.SS4.p3.2.m2.1.1.2">ℒ</ci><ci id="S3.SS4.p3.2.m2.1.1.3a.cmml" xref="S3.SS4.p3.2.m2.1.1.3"><mtext id="S3.SS4.p3.2.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.2.m2.1.1.3">dist</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.2.m2.1c">\mathcal{L}_{\text{dist}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.2.m2.1d">caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT</annotation></semantics></math>, implemented following <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib2" title=""><span class="ltx_text" style="font-size:90%;">2</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib3" title=""><span class="ltx_text" style="font-size:90%;">3</span></a>]</cite>, encourages Gaussians to cluster closely along camera rays, effectively enhancing surface representation. This optimization allows our network to output avatars with detailed geometry, suitable for applications such as animation and relighting. The total loss function for the 3D generation network is defined as:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E9"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{\text{3DGen}}=\mathcal{L}_{\text{render}}+\lambda_{\text{LPIPS}}% \mathcal{L}_{\text{LPIPS}}+\lambda_{\text{n}}\mathcal{L}_{\text{normal}}+% \lambda_{\text{d}}\mathcal{L}_{\text{dist}}," class="ltx_Math" display="block" id="S3.E9.m1.1"><semantics id="S3.E9.m1.1a"><mrow id="S3.E9.m1.1.1.1" xref="S3.E9.m1.1.1.1.1.cmml"><mrow id="S3.E9.m1.1.1.1.1" xref="S3.E9.m1.1.1.1.1.cmml"><msub id="S3.E9.m1.1.1.1.1.2" xref="S3.E9.m1.1.1.1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E9.m1.1.1.1.1.2.2" xref="S3.E9.m1.1.1.1.1.2.2.cmml">ℒ</mi><mtext id="S3.E9.m1.1.1.1.1.2.3" xref="S3.E9.m1.1.1.1.1.2.3a.cmml">3DGen</mtext></msub><mo id="S3.E9.m1.1.1.1.1.1" xref="S3.E9.m1.1.1.1.1.1.cmml">=</mo><mrow id="S3.E9.m1.1.1.1.1.3" xref="S3.E9.m1.1.1.1.1.3.cmml"><msub id="S3.E9.m1.1.1.1.1.3.2" xref="S3.E9.m1.1.1.1.1.3.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E9.m1.1.1.1.1.3.2.2" xref="S3.E9.m1.1.1.1.1.3.2.2.cmml">ℒ</mi><mtext id="S3.E9.m1.1.1.1.1.3.2.3" xref="S3.E9.m1.1.1.1.1.3.2.3a.cmml">render</mtext></msub><mo id="S3.E9.m1.1.1.1.1.3.1" xref="S3.E9.m1.1.1.1.1.3.1.cmml">+</mo><mrow id="S3.E9.m1.1.1.1.1.3.3" xref="S3.E9.m1.1.1.1.1.3.3.cmml"><msub id="S3.E9.m1.1.1.1.1.3.3.2" xref="S3.E9.m1.1.1.1.1.3.3.2.cmml"><mi id="S3.E9.m1.1.1.1.1.3.3.2.2" xref="S3.E9.m1.1.1.1.1.3.3.2.2.cmml">λ</mi><mtext id="S3.E9.m1.1.1.1.1.3.3.2.3" xref="S3.E9.m1.1.1.1.1.3.3.2.3a.cmml">LPIPS</mtext></msub><mo id="S3.E9.m1.1.1.1.1.3.3.1" xref="S3.E9.m1.1.1.1.1.3.3.1.cmml">⁢</mo><msub id="S3.E9.m1.1.1.1.1.3.3.3" xref="S3.E9.m1.1.1.1.1.3.3.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E9.m1.1.1.1.1.3.3.3.2" xref="S3.E9.m1.1.1.1.1.3.3.3.2.cmml">ℒ</mi><mtext id="S3.E9.m1.1.1.1.1.3.3.3.3" xref="S3.E9.m1.1.1.1.1.3.3.3.3a.cmml">LPIPS</mtext></msub></mrow><mo id="S3.E9.m1.1.1.1.1.3.1a" xref="S3.E9.m1.1.1.1.1.3.1.cmml">+</mo><mrow id="S3.E9.m1.1.1.1.1.3.4" xref="S3.E9.m1.1.1.1.1.3.4.cmml"><msub id="S3.E9.m1.1.1.1.1.3.4.2" xref="S3.E9.m1.1.1.1.1.3.4.2.cmml"><mi id="S3.E9.m1.1.1.1.1.3.4.2.2" xref="S3.E9.m1.1.1.1.1.3.4.2.2.cmml">λ</mi><mtext id="S3.E9.m1.1.1.1.1.3.4.2.3" xref="S3.E9.m1.1.1.1.1.3.4.2.3a.cmml">n</mtext></msub><mo id="S3.E9.m1.1.1.1.1.3.4.1" xref="S3.E9.m1.1.1.1.1.3.4.1.cmml">⁢</mo><msub id="S3.E9.m1.1.1.1.1.3.4.3" xref="S3.E9.m1.1.1.1.1.3.4.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E9.m1.1.1.1.1.3.4.3.2" xref="S3.E9.m1.1.1.1.1.3.4.3.2.cmml">ℒ</mi><mtext id="S3.E9.m1.1.1.1.1.3.4.3.3" xref="S3.E9.m1.1.1.1.1.3.4.3.3a.cmml">normal</mtext></msub></mrow><mo id="S3.E9.m1.1.1.1.1.3.1b" xref="S3.E9.m1.1.1.1.1.3.1.cmml">+</mo><mrow id="S3.E9.m1.1.1.1.1.3.5" xref="S3.E9.m1.1.1.1.1.3.5.cmml"><msub id="S3.E9.m1.1.1.1.1.3.5.2" xref="S3.E9.m1.1.1.1.1.3.5.2.cmml"><mi id="S3.E9.m1.1.1.1.1.3.5.2.2" xref="S3.E9.m1.1.1.1.1.3.5.2.2.cmml">λ</mi><mtext id="S3.E9.m1.1.1.1.1.3.5.2.3" xref="S3.E9.m1.1.1.1.1.3.5.2.3a.cmml">d</mtext></msub><mo id="S3.E9.m1.1.1.1.1.3.5.1" xref="S3.E9.m1.1.1.1.1.3.5.1.cmml">⁢</mo><msub id="S3.E9.m1.1.1.1.1.3.5.3" xref="S3.E9.m1.1.1.1.1.3.5.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E9.m1.1.1.1.1.3.5.3.2" xref="S3.E9.m1.1.1.1.1.3.5.3.2.cmml">ℒ</mi><mtext id="S3.E9.m1.1.1.1.1.3.5.3.3" xref="S3.E9.m1.1.1.1.1.3.5.3.3a.cmml">dist</mtext></msub></mrow></mrow></mrow><mo id="S3.E9.m1.1.1.1.2" xref="S3.E9.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E9.m1.1b"><apply id="S3.E9.m1.1.1.1.1.cmml" xref="S3.E9.m1.1.1.1"><eq id="S3.E9.m1.1.1.1.1.1.cmml" xref="S3.E9.m1.1.1.1.1.1"></eq><apply id="S3.E9.m1.1.1.1.1.2.cmml" xref="S3.E9.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.2.1.cmml" xref="S3.E9.m1.1.1.1.1.2">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.2.2.cmml" xref="S3.E9.m1.1.1.1.1.2.2">ℒ</ci><ci id="S3.E9.m1.1.1.1.1.2.3a.cmml" xref="S3.E9.m1.1.1.1.1.2.3"><mtext id="S3.E9.m1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.2.3">3DGen</mtext></ci></apply><apply id="S3.E9.m1.1.1.1.1.3.cmml" xref="S3.E9.m1.1.1.1.1.3"><plus id="S3.E9.m1.1.1.1.1.3.1.cmml" xref="S3.E9.m1.1.1.1.1.3.1"></plus><apply id="S3.E9.m1.1.1.1.1.3.2.cmml" xref="S3.E9.m1.1.1.1.1.3.2"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.2.1.cmml" xref="S3.E9.m1.1.1.1.1.3.2">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.2.2.cmml" xref="S3.E9.m1.1.1.1.1.3.2.2">ℒ</ci><ci id="S3.E9.m1.1.1.1.1.3.2.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.2.3"><mtext id="S3.E9.m1.1.1.1.1.3.2.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.2.3">render</mtext></ci></apply><apply id="S3.E9.m1.1.1.1.1.3.3.cmml" xref="S3.E9.m1.1.1.1.1.3.3"><times id="S3.E9.m1.1.1.1.1.3.3.1.cmml" xref="S3.E9.m1.1.1.1.1.3.3.1"></times><apply id="S3.E9.m1.1.1.1.1.3.3.2.cmml" xref="S3.E9.m1.1.1.1.1.3.3.2"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.3.2.1.cmml" xref="S3.E9.m1.1.1.1.1.3.3.2">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.3.2.2.cmml" xref="S3.E9.m1.1.1.1.1.3.3.2.2">𝜆</ci><ci id="S3.E9.m1.1.1.1.1.3.3.2.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.3.2.3"><mtext id="S3.E9.m1.1.1.1.1.3.3.2.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.3.2.3">LPIPS</mtext></ci></apply><apply id="S3.E9.m1.1.1.1.1.3.3.3.cmml" xref="S3.E9.m1.1.1.1.1.3.3.3"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.3.3.1.cmml" xref="S3.E9.m1.1.1.1.1.3.3.3">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.3.3.2.cmml" xref="S3.E9.m1.1.1.1.1.3.3.3.2">ℒ</ci><ci id="S3.E9.m1.1.1.1.1.3.3.3.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.3.3.3"><mtext id="S3.E9.m1.1.1.1.1.3.3.3.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.3.3.3">LPIPS</mtext></ci></apply></apply><apply id="S3.E9.m1.1.1.1.1.3.4.cmml" xref="S3.E9.m1.1.1.1.1.3.4"><times id="S3.E9.m1.1.1.1.1.3.4.1.cmml" xref="S3.E9.m1.1.1.1.1.3.4.1"></times><apply id="S3.E9.m1.1.1.1.1.3.4.2.cmml" xref="S3.E9.m1.1.1.1.1.3.4.2"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.4.2.1.cmml" xref="S3.E9.m1.1.1.1.1.3.4.2">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.4.2.2.cmml" xref="S3.E9.m1.1.1.1.1.3.4.2.2">𝜆</ci><ci id="S3.E9.m1.1.1.1.1.3.4.2.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.4.2.3"><mtext id="S3.E9.m1.1.1.1.1.3.4.2.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.4.2.3">n</mtext></ci></apply><apply id="S3.E9.m1.1.1.1.1.3.4.3.cmml" xref="S3.E9.m1.1.1.1.1.3.4.3"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.4.3.1.cmml" xref="S3.E9.m1.1.1.1.1.3.4.3">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.4.3.2.cmml" xref="S3.E9.m1.1.1.1.1.3.4.3.2">ℒ</ci><ci id="S3.E9.m1.1.1.1.1.3.4.3.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.4.3.3"><mtext id="S3.E9.m1.1.1.1.1.3.4.3.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.4.3.3">normal</mtext></ci></apply></apply><apply id="S3.E9.m1.1.1.1.1.3.5.cmml" xref="S3.E9.m1.1.1.1.1.3.5"><times id="S3.E9.m1.1.1.1.1.3.5.1.cmml" xref="S3.E9.m1.1.1.1.1.3.5.1"></times><apply id="S3.E9.m1.1.1.1.1.3.5.2.cmml" xref="S3.E9.m1.1.1.1.1.3.5.2"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.5.2.1.cmml" xref="S3.E9.m1.1.1.1.1.3.5.2">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.5.2.2.cmml" xref="S3.E9.m1.1.1.1.1.3.5.2.2">𝜆</ci><ci id="S3.E9.m1.1.1.1.1.3.5.2.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.5.2.3"><mtext id="S3.E9.m1.1.1.1.1.3.5.2.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.5.2.3">d</mtext></ci></apply><apply id="S3.E9.m1.1.1.1.1.3.5.3.cmml" xref="S3.E9.m1.1.1.1.1.3.5.3"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.5.3.1.cmml" xref="S3.E9.m1.1.1.1.1.3.5.3">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.5.3.2.cmml" xref="S3.E9.m1.1.1.1.1.3.5.3.2">ℒ</ci><ci id="S3.E9.m1.1.1.1.1.3.5.3.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.5.3.3"><mtext id="S3.E9.m1.1.1.1.1.3.5.3.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.5.3.3">dist</mtext></ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E9.m1.1c">\mathcal{L}_{\text{3DGen}}=\mathcal{L}_{\text{render}}+\lambda_{\text{LPIPS}}% \mathcal{L}_{\text{LPIPS}}+\lambda_{\text{n}}\mathcal{L}_{\text{normal}}+% \lambda_{\text{d}}\mathcal{L}_{\text{dist}},</annotation><annotation encoding="application/x-llamapun" id="S3.E9.m1.1d">caligraphic_L start_POSTSUBSCRIPT 3DGen end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(9)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS4.p3.3">where <math alttext="\mathcal{L}_{\text{render}}" class="ltx_Math" display="inline" id="S3.SS4.p3.3.m1.1"><semantics id="S3.SS4.p3.3.m1.1a"><msub id="S3.SS4.p3.3.m1.1.1" xref="S3.SS4.p3.3.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.3.m1.1.1.2" xref="S3.SS4.p3.3.m1.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.3.m1.1.1.3" xref="S3.SS4.p3.3.m1.1.1.3a.cmml">render</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.3.m1.1b"><apply id="S3.SS4.p3.3.m1.1.1.cmml" xref="S3.SS4.p3.3.m1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.3.m1.1.1.1.cmml" xref="S3.SS4.p3.3.m1.1.1">subscript</csymbol><ci id="S3.SS4.p3.3.m1.1.1.2.cmml" xref="S3.SS4.p3.3.m1.1.1.2">ℒ</ci><ci id="S3.SS4.p3.3.m1.1.1.3a.cmml" xref="S3.SS4.p3.3.m1.1.1.3"><mtext id="S3.SS4.p3.3.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.3.m1.1.1.3">render</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.3.m1.1c">\mathcal{L}_{\text{render}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.3.m1.1d">caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT</annotation></semantics></math> combines RGB and alpha mask losses:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E10"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{\text{render}}=\|{I}_{\text{posed}}-{I}^{\text{gt}}_{\text{posed}% }\|_{2}^{2}+\|\mathbf{\alpha}^{\text{pred}}-\mathbf{M}^{\text{gt}}\|_{2}^{2}." class="ltx_Math" display="block" id="S3.E10.m1.1"><semantics id="S3.E10.m1.1a"><mrow id="S3.E10.m1.1.1.1" xref="S3.E10.m1.1.1.1.1.cmml"><mrow id="S3.E10.m1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.cmml"><msub id="S3.E10.m1.1.1.1.1.4" xref="S3.E10.m1.1.1.1.1.4.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E10.m1.1.1.1.1.4.2" xref="S3.E10.m1.1.1.1.1.4.2.cmml">ℒ</mi><mtext id="S3.E10.m1.1.1.1.1.4.3" xref="S3.E10.m1.1.1.1.1.4.3a.cmml">render</mtext></msub><mo id="S3.E10.m1.1.1.1.1.3" xref="S3.E10.m1.1.1.1.1.3.cmml">=</mo><mrow id="S3.E10.m1.1.1.1.1.2" xref="S3.E10.m1.1.1.1.1.2.cmml"><msubsup id="S3.E10.m1.1.1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.1.1.cmml"><mrow id="S3.E10.m1.1.1.1.1.1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.1.1.1.1.2.cmml"><mo id="S3.E10.m1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E10.m1.1.1.1.1.1.1.1.1.2.1.cmml">‖</mo><mrow id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.cmml"><msub id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.cmml"><mi id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.2" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.2.cmml">I</mi><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3a.cmml">posed</mtext></msub><mo id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.1.cmml">−</mo><msubsup id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.2" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.2.cmml">I</mi><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3a.cmml">posed</mtext><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3a.cmml">gt</mtext></msubsup></mrow><mo id="S3.E10.m1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E10.m1.1.1.1.1.1.1.1.1.2.1.cmml">‖</mo></mrow><mn id="S3.E10.m1.1.1.1.1.1.1.1.3" xref="S3.E10.m1.1.1.1.1.1.1.1.3.cmml">2</mn><mn id="S3.E10.m1.1.1.1.1.1.1.3" xref="S3.E10.m1.1.1.1.1.1.1.3.cmml">2</mn></msubsup><mo id="S3.E10.m1.1.1.1.1.2.3" xref="S3.E10.m1.1.1.1.1.2.3.cmml">+</mo><msubsup id="S3.E10.m1.1.1.1.1.2.2" xref="S3.E10.m1.1.1.1.1.2.2.cmml"><mrow id="S3.E10.m1.1.1.1.1.2.2.1.1.1" xref="S3.E10.m1.1.1.1.1.2.2.1.1.2.cmml"><mo id="S3.E10.m1.1.1.1.1.2.2.1.1.1.2" stretchy="false" xref="S3.E10.m1.1.1.1.1.2.2.1.1.2.1.cmml">‖</mo><mrow id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.cmml"><msup id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.cmml"><mi id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.2" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.2.cmml">α</mi><mtext id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3a.cmml">pred</mtext></msup><mo id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.1.cmml">−</mo><msup id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.cmml"><mi id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.2" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.2.cmml">𝐌</mi><mtext id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3a.cmml">gt</mtext></msup></mrow><mo id="S3.E10.m1.1.1.1.1.2.2.1.1.1.3" stretchy="false" xref="S3.E10.m1.1.1.1.1.2.2.1.1.2.1.cmml">‖</mo></mrow><mn id="S3.E10.m1.1.1.1.1.2.2.1.3" xref="S3.E10.m1.1.1.1.1.2.2.1.3.cmml">2</mn><mn id="S3.E10.m1.1.1.1.1.2.2.3" xref="S3.E10.m1.1.1.1.1.2.2.3.cmml">2</mn></msubsup></mrow></mrow><mo id="S3.E10.m1.1.1.1.2" lspace="0em" xref="S3.E10.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E10.m1.1b"><apply id="S3.E10.m1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1"><eq id="S3.E10.m1.1.1.1.1.3.cmml" xref="S3.E10.m1.1.1.1.1.3"></eq><apply id="S3.E10.m1.1.1.1.1.4.cmml" xref="S3.E10.m1.1.1.1.1.4"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.4.1.cmml" xref="S3.E10.m1.1.1.1.1.4">subscript</csymbol><ci id="S3.E10.m1.1.1.1.1.4.2.cmml" xref="S3.E10.m1.1.1.1.1.4.2">ℒ</ci><ci id="S3.E10.m1.1.1.1.1.4.3a.cmml" xref="S3.E10.m1.1.1.1.1.4.3"><mtext id="S3.E10.m1.1.1.1.1.4.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.4.3">render</mtext></ci></apply><apply id="S3.E10.m1.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.2"><plus id="S3.E10.m1.1.1.1.1.2.3.cmml" xref="S3.E10.m1.1.1.1.1.2.3"></plus><apply id="S3.E10.m1.1.1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1">superscript</csymbol><apply id="S3.E10.m1.1.1.1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1">subscript</csymbol><apply id="S3.E10.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S3.E10.m1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.2">norm</csymbol><apply id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1"><minus id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.1"></minus><apply id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.2">𝐼</ci><ci id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3a.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3"><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3">posed</mtext></ci></apply><apply id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3">subscript</csymbol><apply id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3">superscript</csymbol><ci id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.2">𝐼</ci><ci id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3a.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3"><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3">gt</mtext></ci></apply><ci id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3a.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3"><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3">posed</mtext></ci></apply></apply></apply><cn id="S3.E10.m1.1.1.1.1.1.1.1.3.cmml" type="integer" xref="S3.E10.m1.1.1.1.1.1.1.1.3">2</cn></apply><cn id="S3.E10.m1.1.1.1.1.1.1.3.cmml" type="integer" xref="S3.E10.m1.1.1.1.1.1.1.3">2</cn></apply><apply id="S3.E10.m1.1.1.1.1.2.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.2.2.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2">superscript</csymbol><apply id="S3.E10.m1.1.1.1.1.2.2.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.2.2.1.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2">subscript</csymbol><apply id="S3.E10.m1.1.1.1.1.2.2.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1"><csymbol cd="latexml" id="S3.E10.m1.1.1.1.1.2.2.1.1.2.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.2">norm</csymbol><apply id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1"><minus id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.1"></minus><apply id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2">superscript</csymbol><ci id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.2">𝛼</ci><ci id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3a.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3"><mtext id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3">pred</mtext></ci></apply><apply id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3">superscript</csymbol><ci id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.2">𝐌</ci><ci id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3a.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3"><mtext id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3">gt</mtext></ci></apply></apply></apply><cn id="S3.E10.m1.1.1.1.1.2.2.1.3.cmml" type="integer" xref="S3.E10.m1.1.1.1.1.2.2.1.3">2</cn></apply><cn id="S3.E10.m1.1.1.1.1.2.2.3.cmml" type="integer" xref="S3.E10.m1.1.1.1.1.2.2.3">2</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E10.m1.1c">\mathcal{L}_{\text{render}}=\|{I}_{\text{posed}}-{I}^{\text{gt}}_{\text{posed}% }\|_{2}^{2}+\|\mathbf{\alpha}^{\text{pred}}-\mathbf{M}^{\text{gt}}\|_{2}^{2}.</annotation><annotation encoding="application/x-llamapun" id="S3.E10.m1.1d">caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT = ∥ italic_I start_POSTSUBSCRIPT posed end_POSTSUBSCRIPT - italic_I start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT posed end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_α start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT - bold_M start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(10)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS4.p3.4"><math alttext="\mathcal{L}_{\text{normal}}" class="ltx_Math" display="inline" id="S3.SS4.p3.4.m1.1"><semantics id="S3.SS4.p3.4.m1.1a"><msub id="S3.SS4.p3.4.m1.1.1" xref="S3.SS4.p3.4.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.4.m1.1.1.2" xref="S3.SS4.p3.4.m1.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.4.m1.1.1.3" xref="S3.SS4.p3.4.m1.1.1.3a.cmml">normal</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.4.m1.1b"><apply id="S3.SS4.p3.4.m1.1.1.cmml" xref="S3.SS4.p3.4.m1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.4.m1.1.1.1.cmml" xref="S3.SS4.p3.4.m1.1.1">subscript</csymbol><ci id="S3.SS4.p3.4.m1.1.1.2.cmml" xref="S3.SS4.p3.4.m1.1.1.2">ℒ</ci><ci id="S3.SS4.p3.4.m1.1.1.3a.cmml" xref="S3.SS4.p3.4.m1.1.1.3"><mtext id="S3.SS4.p3.4.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.4.m1.1.1.3">normal</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.4.m1.1c">\mathcal{L}_{\text{normal}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.4.m1.1d">caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT</annotation></semantics></math> aligns predicted normals with surface normals:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E11"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{\text{normal}}=1-(\mathbf{n}^{\text{pred}}\cdot\mathbf{n}^{\text{% surf}})." class="ltx_Math" display="block" id="S3.E11.m1.1"><semantics id="S3.E11.m1.1a"><mrow id="S3.E11.m1.1.1.1" xref="S3.E11.m1.1.1.1.1.cmml"><mrow id="S3.E11.m1.1.1.1.1" xref="S3.E11.m1.1.1.1.1.cmml"><msub id="S3.E11.m1.1.1.1.1.3" xref="S3.E11.m1.1.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E11.m1.1.1.1.1.3.2" xref="S3.E11.m1.1.1.1.1.3.2.cmml">ℒ</mi><mtext id="S3.E11.m1.1.1.1.1.3.3" xref="S3.E11.m1.1.1.1.1.3.3a.cmml">normal</mtext></msub><mo id="S3.E11.m1.1.1.1.1.2" xref="S3.E11.m1.1.1.1.1.2.cmml">=</mo><mrow id="S3.E11.m1.1.1.1.1.1" xref="S3.E11.m1.1.1.1.1.1.cmml"><mn id="S3.E11.m1.1.1.1.1.1.3" xref="S3.E11.m1.1.1.1.1.1.3.cmml">1</mn><mo id="S3.E11.m1.1.1.1.1.1.2" xref="S3.E11.m1.1.1.1.1.1.2.cmml">−</mo><mrow id="S3.E11.m1.1.1.1.1.1.1.1" xref="S3.E11.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S3.E11.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E11.m1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E11.m1.1.1.1.1.1.1.1.1" xref="S3.E11.m1.1.1.1.1.1.1.1.1.cmml"><msup id="S3.E11.m1.1.1.1.1.1.1.1.1.2" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.cmml"><mi id="S3.E11.m1.1.1.1.1.1.1.1.1.2.2" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.2.cmml">𝐧</mi><mtext id="S3.E11.m1.1.1.1.1.1.1.1.1.2.3" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.3a.cmml">pred</mtext></msup><mo id="S3.E11.m1.1.1.1.1.1.1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S3.E11.m1.1.1.1.1.1.1.1.1.1.cmml">⋅</mo><msup id="S3.E11.m1.1.1.1.1.1.1.1.1.3" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S3.E11.m1.1.1.1.1.1.1.1.1.3.2" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.2.cmml">𝐧</mi><mtext id="S3.E11.m1.1.1.1.1.1.1.1.1.3.3" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.3a.cmml">surf</mtext></msup></mrow><mo id="S3.E11.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E11.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E11.m1.1.1.1.2" lspace="0em" xref="S3.E11.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E11.m1.1b"><apply id="S3.E11.m1.1.1.1.1.cmml" xref="S3.E11.m1.1.1.1"><eq id="S3.E11.m1.1.1.1.1.2.cmml" xref="S3.E11.m1.1.1.1.1.2"></eq><apply id="S3.E11.m1.1.1.1.1.3.cmml" xref="S3.E11.m1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E11.m1.1.1.1.1.3.1.cmml" xref="S3.E11.m1.1.1.1.1.3">subscript</csymbol><ci id="S3.E11.m1.1.1.1.1.3.2.cmml" xref="S3.E11.m1.1.1.1.1.3.2">ℒ</ci><ci id="S3.E11.m1.1.1.1.1.3.3a.cmml" xref="S3.E11.m1.1.1.1.1.3.3"><mtext id="S3.E11.m1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E11.m1.1.1.1.1.3.3">normal</mtext></ci></apply><apply id="S3.E11.m1.1.1.1.1.1.cmml" xref="S3.E11.m1.1.1.1.1.1"><minus id="S3.E11.m1.1.1.1.1.1.2.cmml" xref="S3.E11.m1.1.1.1.1.1.2"></minus><cn id="S3.E11.m1.1.1.1.1.1.3.cmml" type="integer" xref="S3.E11.m1.1.1.1.1.1.3">1</cn><apply id="S3.E11.m1.1.1.1.1.1.1.1.1.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1"><ci id="S3.E11.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.1">⋅</ci><apply id="S3.E11.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E11.m1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2">superscript</csymbol><ci id="S3.E11.m1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.2">𝐧</ci><ci id="S3.E11.m1.1.1.1.1.1.1.1.1.2.3a.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.3"><mtext id="S3.E11.m1.1.1.1.1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.3">pred</mtext></ci></apply><apply id="S3.E11.m1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E11.m1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3">superscript</csymbol><ci id="S3.E11.m1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.2">𝐧</ci><ci id="S3.E11.m1.1.1.1.1.1.1.1.1.3.3a.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.3"><mtext id="S3.E11.m1.1.1.1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.3">surf</mtext></ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E11.m1.1c">\mathcal{L}_{\text{normal}}=1-(\mathbf{n}^{\text{pred}}\cdot\mathbf{n}^{\text{% surf}}).</annotation><annotation encoding="application/x-llamapun" id="S3.E11.m1.1d">caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT = 1 - ( bold_n start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT ⋅ bold_n start_POSTSUPERSCRIPT surf end_POSTSUPERSCRIPT ) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(11)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S3.SS4.p3.12">Here, <math alttext="{I}^{\text{posed}},{I}^{\text{posed}}_{\text{gt}}" class="ltx_Math" display="inline" id="S3.SS4.p3.5.m1.2"><semantics id="S3.SS4.p3.5.m1.2a"><mrow id="S3.SS4.p3.5.m1.2.2.2" xref="S3.SS4.p3.5.m1.2.2.3.cmml"><msup id="S3.SS4.p3.5.m1.1.1.1.1" xref="S3.SS4.p3.5.m1.1.1.1.1.cmml"><mi id="S3.SS4.p3.5.m1.1.1.1.1.2" xref="S3.SS4.p3.5.m1.1.1.1.1.2.cmml">I</mi><mtext id="S3.SS4.p3.5.m1.1.1.1.1.3" xref="S3.SS4.p3.5.m1.1.1.1.1.3a.cmml">posed</mtext></msup><mo id="S3.SS4.p3.5.m1.2.2.2.3" xref="S3.SS4.p3.5.m1.2.2.3.cmml">,</mo><msubsup id="S3.SS4.p3.5.m1.2.2.2.2" xref="S3.SS4.p3.5.m1.2.2.2.2.cmml"><mi id="S3.SS4.p3.5.m1.2.2.2.2.2.2" xref="S3.SS4.p3.5.m1.2.2.2.2.2.2.cmml">I</mi><mtext id="S3.SS4.p3.5.m1.2.2.2.2.3" xref="S3.SS4.p3.5.m1.2.2.2.2.3a.cmml">gt</mtext><mtext id="S3.SS4.p3.5.m1.2.2.2.2.2.3" xref="S3.SS4.p3.5.m1.2.2.2.2.2.3a.cmml">posed</mtext></msubsup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.5.m1.2b"><list id="S3.SS4.p3.5.m1.2.2.3.cmml" xref="S3.SS4.p3.5.m1.2.2.2"><apply id="S3.SS4.p3.5.m1.1.1.1.1.cmml" xref="S3.SS4.p3.5.m1.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.5.m1.1.1.1.1.1.cmml" xref="S3.SS4.p3.5.m1.1.1.1.1">superscript</csymbol><ci id="S3.SS4.p3.5.m1.1.1.1.1.2.cmml" xref="S3.SS4.p3.5.m1.1.1.1.1.2">𝐼</ci><ci id="S3.SS4.p3.5.m1.1.1.1.1.3a.cmml" xref="S3.SS4.p3.5.m1.1.1.1.1.3"><mtext id="S3.SS4.p3.5.m1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.5.m1.1.1.1.1.3">posed</mtext></ci></apply><apply id="S3.SS4.p3.5.m1.2.2.2.2.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS4.p3.5.m1.2.2.2.2.1.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2">subscript</csymbol><apply id="S3.SS4.p3.5.m1.2.2.2.2.2.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS4.p3.5.m1.2.2.2.2.2.1.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2">superscript</csymbol><ci id="S3.SS4.p3.5.m1.2.2.2.2.2.2.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2.2.2">𝐼</ci><ci id="S3.SS4.p3.5.m1.2.2.2.2.2.3a.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2.2.3"><mtext id="S3.SS4.p3.5.m1.2.2.2.2.2.3.cmml" mathsize="70%" xref="S3.SS4.p3.5.m1.2.2.2.2.2.3">posed</mtext></ci></apply><ci id="S3.SS4.p3.5.m1.2.2.2.2.3a.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2.3"><mtext id="S3.SS4.p3.5.m1.2.2.2.2.3.cmml" mathsize="70%" xref="S3.SS4.p3.5.m1.2.2.2.2.3">gt</mtext></ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.5.m1.2c">{I}^{\text{posed}},{I}^{\text{posed}}_{\text{gt}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.5.m1.2d">italic_I start_POSTSUPERSCRIPT posed end_POSTSUPERSCRIPT , italic_I start_POSTSUPERSCRIPT posed end_POSTSUPERSCRIPT start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT</annotation></semantics></math> are the predicted and ground truth images; <math alttext="\mathbf{\alpha}^{\text{pred}}" class="ltx_Math" display="inline" id="S3.SS4.p3.6.m2.1"><semantics id="S3.SS4.p3.6.m2.1a"><msup id="S3.SS4.p3.6.m2.1.1" xref="S3.SS4.p3.6.m2.1.1.cmml"><mi id="S3.SS4.p3.6.m2.1.1.2" xref="S3.SS4.p3.6.m2.1.1.2.cmml">α</mi><mtext id="S3.SS4.p3.6.m2.1.1.3" xref="S3.SS4.p3.6.m2.1.1.3a.cmml">pred</mtext></msup><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.6.m2.1b"><apply id="S3.SS4.p3.6.m2.1.1.cmml" xref="S3.SS4.p3.6.m2.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.6.m2.1.1.1.cmml" xref="S3.SS4.p3.6.m2.1.1">superscript</csymbol><ci id="S3.SS4.p3.6.m2.1.1.2.cmml" xref="S3.SS4.p3.6.m2.1.1.2">𝛼</ci><ci id="S3.SS4.p3.6.m2.1.1.3a.cmml" xref="S3.SS4.p3.6.m2.1.1.3"><mtext id="S3.SS4.p3.6.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.6.m2.1.1.3">pred</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.6.m2.1c">\mathbf{\alpha}^{\text{pred}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.6.m2.1d">italic_α start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT</annotation></semantics></math> and <math alttext="\mathbf{M}^{\text{gt}}" class="ltx_Math" display="inline" id="S3.SS4.p3.7.m3.1"><semantics id="S3.SS4.p3.7.m3.1a"><msup id="S3.SS4.p3.7.m3.1.1" xref="S3.SS4.p3.7.m3.1.1.cmml"><mi id="S3.SS4.p3.7.m3.1.1.2" xref="S3.SS4.p3.7.m3.1.1.2.cmml">𝐌</mi><mtext id="S3.SS4.p3.7.m3.1.1.3" xref="S3.SS4.p3.7.m3.1.1.3a.cmml">gt</mtext></msup><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.7.m3.1b"><apply id="S3.SS4.p3.7.m3.1.1.cmml" xref="S3.SS4.p3.7.m3.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.7.m3.1.1.1.cmml" xref="S3.SS4.p3.7.m3.1.1">superscript</csymbol><ci id="S3.SS4.p3.7.m3.1.1.2.cmml" xref="S3.SS4.p3.7.m3.1.1.2">𝐌</ci><ci id="S3.SS4.p3.7.m3.1.1.3a.cmml" xref="S3.SS4.p3.7.m3.1.1.3"><mtext id="S3.SS4.p3.7.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.7.m3.1.1.3">gt</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.7.m3.1c">\mathbf{M}^{\text{gt}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.7.m3.1d">bold_M start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT</annotation></semantics></math> are the predicted alpha mask and its ground truth counterpart; <math alttext="\mathbf{n}^{\text{pred}},\mathbf{n}^{\text{surf}}" class="ltx_Math" display="inline" id="S3.SS4.p3.8.m4.2"><semantics id="S3.SS4.p3.8.m4.2a"><mrow id="S3.SS4.p3.8.m4.2.2.2" xref="S3.SS4.p3.8.m4.2.2.3.cmml"><msup id="S3.SS4.p3.8.m4.1.1.1.1" xref="S3.SS4.p3.8.m4.1.1.1.1.cmml"><mi id="S3.SS4.p3.8.m4.1.1.1.1.2" xref="S3.SS4.p3.8.m4.1.1.1.1.2.cmml">𝐧</mi><mtext id="S3.SS4.p3.8.m4.1.1.1.1.3" xref="S3.SS4.p3.8.m4.1.1.1.1.3a.cmml">pred</mtext></msup><mo id="S3.SS4.p3.8.m4.2.2.2.3" xref="S3.SS4.p3.8.m4.2.2.3.cmml">,</mo><msup id="S3.SS4.p3.8.m4.2.2.2.2" xref="S3.SS4.p3.8.m4.2.2.2.2.cmml"><mi id="S3.SS4.p3.8.m4.2.2.2.2.2" xref="S3.SS4.p3.8.m4.2.2.2.2.2.cmml">𝐧</mi><mtext id="S3.SS4.p3.8.m4.2.2.2.2.3" xref="S3.SS4.p3.8.m4.2.2.2.2.3a.cmml">surf</mtext></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.8.m4.2b"><list id="S3.SS4.p3.8.m4.2.2.3.cmml" xref="S3.SS4.p3.8.m4.2.2.2"><apply id="S3.SS4.p3.8.m4.1.1.1.1.cmml" xref="S3.SS4.p3.8.m4.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.8.m4.1.1.1.1.1.cmml" xref="S3.SS4.p3.8.m4.1.1.1.1">superscript</csymbol><ci id="S3.SS4.p3.8.m4.1.1.1.1.2.cmml" xref="S3.SS4.p3.8.m4.1.1.1.1.2">𝐧</ci><ci id="S3.SS4.p3.8.m4.1.1.1.1.3a.cmml" xref="S3.SS4.p3.8.m4.1.1.1.1.3"><mtext id="S3.SS4.p3.8.m4.1.1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.8.m4.1.1.1.1.3">pred</mtext></ci></apply><apply id="S3.SS4.p3.8.m4.2.2.2.2.cmml" xref="S3.SS4.p3.8.m4.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS4.p3.8.m4.2.2.2.2.1.cmml" xref="S3.SS4.p3.8.m4.2.2.2.2">superscript</csymbol><ci id="S3.SS4.p3.8.m4.2.2.2.2.2.cmml" xref="S3.SS4.p3.8.m4.2.2.2.2.2">𝐧</ci><ci id="S3.SS4.p3.8.m4.2.2.2.2.3a.cmml" xref="S3.SS4.p3.8.m4.2.2.2.2.3"><mtext id="S3.SS4.p3.8.m4.2.2.2.2.3.cmml" mathsize="70%" xref="S3.SS4.p3.8.m4.2.2.2.2.3">surf</mtext></ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.8.m4.2c">\mathbf{n}^{\text{pred}},\mathbf{n}^{\text{surf}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.8.m4.2d">bold_n start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT , bold_n start_POSTSUPERSCRIPT surf end_POSTSUPERSCRIPT</annotation></semantics></math> are predicted and surface normal vectors. <math alttext="\lambda_{\text{lpips}},\lambda_{\text{n}},\lambda_{\text{d}}" class="ltx_Math" display="inline" id="S3.SS4.p3.9.m5.3"><semantics id="S3.SS4.p3.9.m5.3a"><mrow id="S3.SS4.p3.9.m5.3.3.3" xref="S3.SS4.p3.9.m5.3.3.4.cmml"><msub id="S3.SS4.p3.9.m5.1.1.1.1" xref="S3.SS4.p3.9.m5.1.1.1.1.cmml"><mi id="S3.SS4.p3.9.m5.1.1.1.1.2" xref="S3.SS4.p3.9.m5.1.1.1.1.2.cmml">λ</mi><mtext id="S3.SS4.p3.9.m5.1.1.1.1.3" xref="S3.SS4.p3.9.m5.1.1.1.1.3a.cmml">lpips</mtext></msub><mo id="S3.SS4.p3.9.m5.3.3.3.4" xref="S3.SS4.p3.9.m5.3.3.4.cmml">,</mo><msub id="S3.SS4.p3.9.m5.2.2.2.2" xref="S3.SS4.p3.9.m5.2.2.2.2.cmml"><mi id="S3.SS4.p3.9.m5.2.2.2.2.2" xref="S3.SS4.p3.9.m5.2.2.2.2.2.cmml">λ</mi><mtext id="S3.SS4.p3.9.m5.2.2.2.2.3" xref="S3.SS4.p3.9.m5.2.2.2.2.3a.cmml">n</mtext></msub><mo id="S3.SS4.p3.9.m5.3.3.3.5" xref="S3.SS4.p3.9.m5.3.3.4.cmml">,</mo><msub id="S3.SS4.p3.9.m5.3.3.3.3" xref="S3.SS4.p3.9.m5.3.3.3.3.cmml"><mi id="S3.SS4.p3.9.m5.3.3.3.3.2" xref="S3.SS4.p3.9.m5.3.3.3.3.2.cmml">λ</mi><mtext id="S3.SS4.p3.9.m5.3.3.3.3.3" xref="S3.SS4.p3.9.m5.3.3.3.3.3a.cmml">d</mtext></msub></mrow><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.9.m5.3b"><list id="S3.SS4.p3.9.m5.3.3.4.cmml" xref="S3.SS4.p3.9.m5.3.3.3"><apply id="S3.SS4.p3.9.m5.1.1.1.1.cmml" xref="S3.SS4.p3.9.m5.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.9.m5.1.1.1.1.1.cmml" xref="S3.SS4.p3.9.m5.1.1.1.1">subscript</csymbol><ci id="S3.SS4.p3.9.m5.1.1.1.1.2.cmml" xref="S3.SS4.p3.9.m5.1.1.1.1.2">𝜆</ci><ci id="S3.SS4.p3.9.m5.1.1.1.1.3a.cmml" xref="S3.SS4.p3.9.m5.1.1.1.1.3"><mtext id="S3.SS4.p3.9.m5.1.1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.9.m5.1.1.1.1.3">lpips</mtext></ci></apply><apply id="S3.SS4.p3.9.m5.2.2.2.2.cmml" xref="S3.SS4.p3.9.m5.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS4.p3.9.m5.2.2.2.2.1.cmml" xref="S3.SS4.p3.9.m5.2.2.2.2">subscript</csymbol><ci id="S3.SS4.p3.9.m5.2.2.2.2.2.cmml" xref="S3.SS4.p3.9.m5.2.2.2.2.2">𝜆</ci><ci id="S3.SS4.p3.9.m5.2.2.2.2.3a.cmml" xref="S3.SS4.p3.9.m5.2.2.2.2.3"><mtext id="S3.SS4.p3.9.m5.2.2.2.2.3.cmml" mathsize="70%" xref="S3.SS4.p3.9.m5.2.2.2.2.3">n</mtext></ci></apply><apply id="S3.SS4.p3.9.m5.3.3.3.3.cmml" xref="S3.SS4.p3.9.m5.3.3.3.3"><csymbol cd="ambiguous" id="S3.SS4.p3.9.m5.3.3.3.3.1.cmml" xref="S3.SS4.p3.9.m5.3.3.3.3">subscript</csymbol><ci id="S3.SS4.p3.9.m5.3.3.3.3.2.cmml" xref="S3.SS4.p3.9.m5.3.3.3.3.2">𝜆</ci><ci id="S3.SS4.p3.9.m5.3.3.3.3.3a.cmml" xref="S3.SS4.p3.9.m5.3.3.3.3.3"><mtext id="S3.SS4.p3.9.m5.3.3.3.3.3.cmml" mathsize="70%" xref="S3.SS4.p3.9.m5.3.3.3.3.3">d</mtext></ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.9.m5.3c">\lambda_{\text{lpips}},\lambda_{\text{n}},\lambda_{\text{d}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.9.m5.3d">italic_λ start_POSTSUBSCRIPT lpips end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT</annotation></semantics></math> are weights for <math alttext="\mathcal{L}_{\text{lpips}}" class="ltx_Math" display="inline" id="S3.SS4.p3.10.m6.1"><semantics id="S3.SS4.p3.10.m6.1a"><msub id="S3.SS4.p3.10.m6.1.1" xref="S3.SS4.p3.10.m6.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.10.m6.1.1.2" xref="S3.SS4.p3.10.m6.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.10.m6.1.1.3" xref="S3.SS4.p3.10.m6.1.1.3a.cmml">lpips</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.10.m6.1b"><apply id="S3.SS4.p3.10.m6.1.1.cmml" xref="S3.SS4.p3.10.m6.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.10.m6.1.1.1.cmml" xref="S3.SS4.p3.10.m6.1.1">subscript</csymbol><ci id="S3.SS4.p3.10.m6.1.1.2.cmml" xref="S3.SS4.p3.10.m6.1.1.2">ℒ</ci><ci id="S3.SS4.p3.10.m6.1.1.3a.cmml" xref="S3.SS4.p3.10.m6.1.1.3"><mtext id="S3.SS4.p3.10.m6.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.10.m6.1.1.3">lpips</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.10.m6.1c">\mathcal{L}_{\text{lpips}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.10.m6.1d">caligraphic_L start_POSTSUBSCRIPT lpips end_POSTSUBSCRIPT</annotation></semantics></math>, <math alttext="\mathcal{L}_{\text{normal}}" class="ltx_Math" display="inline" id="S3.SS4.p3.11.m7.1"><semantics id="S3.SS4.p3.11.m7.1a"><msub id="S3.SS4.p3.11.m7.1.1" xref="S3.SS4.p3.11.m7.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.11.m7.1.1.2" xref="S3.SS4.p3.11.m7.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.11.m7.1.1.3" xref="S3.SS4.p3.11.m7.1.1.3a.cmml">normal</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.11.m7.1b"><apply id="S3.SS4.p3.11.m7.1.1.cmml" xref="S3.SS4.p3.11.m7.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.11.m7.1.1.1.cmml" xref="S3.SS4.p3.11.m7.1.1">subscript</csymbol><ci id="S3.SS4.p3.11.m7.1.1.2.cmml" xref="S3.SS4.p3.11.m7.1.1.2">ℒ</ci><ci id="S3.SS4.p3.11.m7.1.1.3a.cmml" xref="S3.SS4.p3.11.m7.1.1.3"><mtext id="S3.SS4.p3.11.m7.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.11.m7.1.1.3">normal</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.11.m7.1c">\mathcal{L}_{\text{normal}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.11.m7.1d">caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="\mathcal{L}_{\text{dist}}" class="ltx_Math" display="inline" id="S3.SS4.p3.12.m8.1"><semantics id="S3.SS4.p3.12.m8.1a"><msub id="S3.SS4.p3.12.m8.1.1" xref="S3.SS4.p3.12.m8.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.12.m8.1.1.2" xref="S3.SS4.p3.12.m8.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.12.m8.1.1.3" xref="S3.SS4.p3.12.m8.1.1.3a.cmml">dist</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.12.m8.1b"><apply id="S3.SS4.p3.12.m8.1.1.cmml" xref="S3.SS4.p3.12.m8.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.12.m8.1.1.1.cmml" xref="S3.SS4.p3.12.m8.1.1">subscript</csymbol><ci id="S3.SS4.p3.12.m8.1.1.2.cmml" xref="S3.SS4.p3.12.m8.1.1.2">ℒ</ci><ci id="S3.SS4.p3.12.m8.1.1.3a.cmml" xref="S3.SS4.p3.12.m8.1.1.3"><mtext id="S3.SS4.p3.12.m8.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.12.m8.1.1.3">dist</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.12.m8.1c">\mathcal{L}_{\text{dist}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.12.m8.1d">caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT</annotation></semantics></math>, respectively. The normal and distortion losses commence after 20% of training to first establish basic appearance convergence.</p> </div> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">4 </span>Experiments</h2> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.1 </span>2D Stylized Avatar Generation </h3> <div class="ltx_para ltx_noindent" id="S4.SS1.p1"> <p class="ltx_p" id="S4.SS1.p1.1"><span class="ltx_text ltx_font_bold" id="S4.SS1.p1.1.1">Baselines.</span> We evaluate GDA for 2D stylized avatar generation and compare with GAN inversion and diffusion-based methods, both fine-tuned on our Bitmoji dataset. For GAN inversion, we use a SemanticStyleGAN <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib50" title=""><span class="ltx_text" style="font-size:90%;">50</span></a>]</cite> model, translating real faces into avatars by inverting them into the latent space of a fine-tuned model. The diffusion-based method employs Stable Diffusion 1.5 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib43" title=""><span class="ltx_text" style="font-size:90%;">43</span></a>]</cite> fine-tuned with LoRA <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib18" title=""><span class="ltx_text" style="font-size:90%;">18</span></a>]</cite>, using BLIP-2 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib24" title=""><span class="ltx_text" style="font-size:90%;">24</span></a>]</cite> for avatar captioning and IP Adapter Plus Face <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib60" title=""><span class="ltx_text" style="font-size:90%;">60</span></a>]</cite> for identity conditioning. Both methods aim to efficiently create primary-style avatars that retain the identity of the input images.</p> </div> <figure class="ltx_figure" id="S4.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="419" id="S4.F4.g1" src="x4.png" width="875"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S4.F4.3.1.1" style="font-size:90%;">Figure 4</span>: </span><span class="ltx_text ltx_font_bold" id="S4.F4.4.2" style="font-size:90%;">2D Stylized Avatar Generation.<span class="ltx_text ltx_font_medium" id="S4.F4.4.2.1"> This figure showcases the transformation of photos from eight individuals into the Bitmoji domain using various methods. GAN inversion produces overly generic avatars, struggling with unique features such as beards, glasses, and headwear. Diffusion-based models inaccurately add features, making them inconsistent for targeted styles. In contrast, our GDA method excels in creating high-quality avatars, effectively retaining the original identity features.</span></span></figcaption> </figure> <figure class="ltx_figure" id="S4.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="447" id="S4.F5.g1" src="x5.png" width="875"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S4.F5.3.1.1" style="font-size:90%;">Figure 5</span>: </span><span class="ltx_text ltx_font_bold" id="S4.F5.4.2" style="font-size:90%;">2D Stylized Avatar to 3D Generation.<span class="ltx_text ltx_font_medium" id="S4.F5.4.2.1"> We demonstrate the process of converting dual-stylized avatar images, derived from the single-stylized avatars in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F4" title="Figure 4 ‣ 4.1 2D Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">4</span></a>, into 3D avatars. PTI inversion with EG3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib6" title=""><span class="ltx_text" style="font-size:90%;">6</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib42" title=""><span class="ltx_text" style="font-size:90%;">42</span></a>]</cite> struggles to accurately reproduce 3D geometry, while LGM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title=""><span class="ltx_text" style="font-size:90%;">55</span></a>]</cite> produces artifacts in both geometry and texture. Despite being trained exclusively on the Bitmoji style, our method successfully generates high-quality 3D avatars in previously unseen styles.</span></span></figcaption> </figure> <figure class="ltx_figure" id="S4.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="380" id="S4.F6.g1" src="x6.png" width="789"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S4.F6.5.1.1" style="font-size:90%;">Figure 6</span>: </span><span class="ltx_text ltx_font_bold" id="S4.F6.6.2" style="font-size:90%;">Single Portrait to 3D Generation.<span class="ltx_text ltx_font_medium" id="S4.F6.6.2.1"> We compare <span class="ltx_text" id="S4.F6.6.2.1.1">Snapmoji</span> with DATID-3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib23" title=""><span class="ltx_text" style="font-size:90%;">23</span></a>]</cite> in the context of 3D toonification. For each method and style, we render outputs from two viewpoints alongside a normal map. DATID-3D exhibits typical GAN-related issues, such as poor identity preservation and limited stylistic diversity, resulting in similar outputs across different identities. Conversely, <span class="ltx_text" id="S4.F6.6.2.1.2">Snapmoji</span> effectively maintains identity and produces distinct styles, showcasing superior image quality and sharper geometry.</span></span></figcaption> </figure> <div class="ltx_para ltx_noindent" id="S4.SS1.p2"> <p class="ltx_p" id="S4.SS1.p2.1"><span class="ltx_text ltx_font_bold" id="S4.SS1.p2.1.1">Evaluation.</span> We conducted an evaluation using 100 randomly selected faces from the FFHQ dataset, assessing each method on visual quality, identity retention, and speed. Visual quality was measured through FID and KID scores, comparing the transformed images to the Bitmoji dataset. Identity retention was evaluated using ArcFace <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib11" title=""><span class="ltx_text" style="font-size:90%;">11</span></a>]</cite>, while speed performance was benchmarked on an Nvidia L4 GPU. As shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.T2" title="Table 2 ‣ 4.2 3D Dual-Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">2</span></a>, GDA outperforms the other methods across all metrics, achieving FID scores more than 20 points lower than those of GAN inversion and diffusion. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F4" title="Figure 4 ‣ 4.1 2D Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">4</span></a> visually highlights these quality differences: GAN inversion produces avatars with limited diversity, struggling to avoid generic outputs due to challenges in generating out-of-distribution images. The diffusion approach fails to maintain a consistent style and often incorrectly introduces features like glasses, undermining both style and identity preservation. In contrast, GDA excels at producing avatars with a consistent style that retain key identity features such as eye color, sunglasses, and hairstyles. Its efficiency is noteworthy, requiring only a single forward pass through a U-Net, making it two and four orders of magnitude faster than diffusion and GAN inversion, respectively, with translations completed in under 0.1 seconds. Surprisingly, even though GDA uses data generated from GAN inversion for training, it produces images that are more detailed due to the learned 3D prior from the Objaverse <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib10" title=""><span class="ltx_text" style="font-size:90%;">10</span></a>]</cite> dataset, which enhances its generalization capability.</p> </div> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.2 </span>3D Dual-Stylized Avatar Generation </h3> <figure class="ltx_table" id="S4.T2"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S4.T2.4" style="width:433.6pt;height:173.7pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(126.9pt,-50.9pt) scale(2.41253272928583,2.41253272928583) ;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S4.T2.4.4"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S4.T2.4.4.4"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T2.4.4.4.5" style="padding-left:0.0pt;padding-right:0.0pt;"></th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T2.1.1.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">FID <math alttext="\downarrow" class="ltx_Math" display="inline" id="S4.T2.1.1.1.1.m1.1"><semantics id="S4.T2.1.1.1.1.m1.1a"><mo id="S4.T2.1.1.1.1.m1.1.1" stretchy="false" xref="S4.T2.1.1.1.1.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S4.T2.1.1.1.1.m1.1b"><ci id="S4.T2.1.1.1.1.m1.1.1.cmml" xref="S4.T2.1.1.1.1.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.1.1.1.1.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.1.1.1.1.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T2.2.2.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">KID <math alttext="\downarrow" class="ltx_Math" display="inline" id="S4.T2.2.2.2.2.m1.1"><semantics id="S4.T2.2.2.2.2.m1.1a"><mo id="S4.T2.2.2.2.2.m1.1.1" stretchy="false" xref="S4.T2.2.2.2.2.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S4.T2.2.2.2.2.m1.1b"><ci id="S4.T2.2.2.2.2.m1.1.1.cmml" xref="S4.T2.2.2.2.2.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.2.2.2.2.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.2.2.2.2.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T2.3.3.3.3" style="padding-left:0.0pt;padding-right:0.0pt;">ID <math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.3.3.3.3.m1.1"><semantics id="S4.T2.3.3.3.3.m1.1a"><mo id="S4.T2.3.3.3.3.m1.1.1" stretchy="false" xref="S4.T2.3.3.3.3.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.3.3.3.3.m1.1b"><ci id="S4.T2.3.3.3.3.m1.1.1.cmml" xref="S4.T2.3.3.3.3.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.3.3.3.3.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.3.3.3.3.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T2.4.4.4.4" style="padding-left:0.0pt;padding-right:0.0pt;">Speed <math alttext="\downarrow" class="ltx_Math" display="inline" id="S4.T2.4.4.4.4.m1.1"><semantics id="S4.T2.4.4.4.4.m1.1a"><mo id="S4.T2.4.4.4.4.m1.1.1" stretchy="false" xref="S4.T2.4.4.4.4.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S4.T2.4.4.4.4.m1.1b"><ci id="S4.T2.4.4.4.4.m1.1.1.cmml" xref="S4.T2.4.4.4.4.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.4.4.4.4.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.4.4.4.4.m1.1d">↓</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T2.4.4.5.1"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.4.4.5.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">GAN Inversion</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.4.4.5.1.2" style="padding-left:0.0pt;padding-right:0.0pt;">93.73</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.4.4.5.1.3" style="padding-left:0.0pt;padding-right:0.0pt;">0.0603</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.4.4.5.1.4" style="padding-left:0.0pt;padding-right:0.0pt;">0.16</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T2.4.4.5.1.5" style="padding-left:0.0pt;padding-right:0.0pt;">98.14s</td> </tr> <tr class="ltx_tr" id="S4.T2.4.4.6.2"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T2.4.4.6.2.1" style="padding-left:0.0pt;padding-right:0.0pt;">Diffusion</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T2.4.4.6.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">93.63</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T2.4.4.6.2.3" style="padding-left:0.0pt;padding-right:0.0pt;">0.0457</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T2.4.4.6.2.4" style="padding-left:0.0pt;padding-right:0.0pt;">0.19</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S4.T2.4.4.6.2.5" style="padding-left:0.0pt;padding-right:0.0pt;">3.54s</td> </tr> <tr class="ltx_tr" id="S4.T2.4.4.7.3"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T2.4.4.7.3.1" style="padding-left:0.0pt;padding-right:0.0pt;">GDA (Ours)</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T2.4.4.7.3.2" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text ltx_font_bold" id="S4.T2.4.4.7.3.2.1">72.94</span></td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T2.4.4.7.3.3" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text ltx_font_bold" id="S4.T2.4.4.7.3.3.1">0.0346</span></td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T2.4.4.7.3.4" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text ltx_font_bold" id="S4.T2.4.4.7.3.4.1">0.25</span></td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S4.T2.4.4.7.3.5" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text ltx_font_bold" id="S4.T2.4.4.7.3.5.1">0.080s</span></td> </tr> </tbody> </table> </span></div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table"><span class="ltx_text" id="S4.T2.7.1.1" style="font-size:90%;">Table 2</span>: </span><span class="ltx_text ltx_font_bold" id="S4.T2.8.2" style="font-size:90%;">2D Stylized Avatar Generation.<span class="ltx_text ltx_font_medium" id="S4.T2.8.2.1"> We compare different methods of generating 2D stylized avatars. Our GDA significantly outperforms GAN inversion and diffusion in terms of image quality (FID, KID), identity preservation (ID), and execution speed.</span></span></figcaption> </figure> <div class="ltx_para ltx_noindent" id="S4.SS2.p1"> <p class="ltx_p" id="S4.SS2.p1.1"><span class="ltx_text ltx_font_bold" id="S4.SS2.p1.1.1">2D Stylized Avatar to 3D Generation.</span> To evaluate the performance of generating a 3D avatar from a 2D stylized image, we compare our method against two other single-image 3D reconstruction techniques: EG3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib6" title=""><span class="ltx_text" style="font-size:90%;">6</span></a>]</cite> and LGMs <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title=""><span class="ltx_text" style="font-size:90%;">55</span></a>]</cite>. EG3D, a 3D GAN based on the StyleGAN framework, generates a 3D neural radiance field and is fine-tuned on a multi-view 3D avatar dataset. It inverts a front-facing image into the GAN’s <math alttext="\mathcal{W}+" class="ltx_Math" display="inline" id="S4.SS2.p1.1.m1.1"><semantics id="S4.SS2.p1.1.m1.1a"><mrow id="S4.SS2.p1.1.m1.1.1" xref="S4.SS2.p1.1.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.SS2.p1.1.m1.1.1.2" xref="S4.SS2.p1.1.m1.1.1.2.cmml">𝒲</mi><mo id="S4.SS2.p1.1.m1.1.1.3" xref="S4.SS2.p1.1.m1.1.1.3.cmml">+</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.1.m1.1b"><apply id="S4.SS2.p1.1.m1.1.1.cmml" xref="S4.SS2.p1.1.m1.1.1"><csymbol cd="latexml" id="S4.SS2.p1.1.m1.1.1.1.cmml" xref="S4.SS2.p1.1.m1.1.1">limit-from</csymbol><ci id="S4.SS2.p1.1.m1.1.1.2.cmml" xref="S4.SS2.p1.1.m1.1.1.2">𝒲</ci><plus id="S4.SS2.p1.1.m1.1.1.3.cmml" xref="S4.SS2.p1.1.m1.1.1.3"></plus></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.1.m1.1c">\mathcal{W}+</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.1.m1.1d">caligraphic_W +</annotation></semantics></math> space to render outputs from various viewpoints. LGM uses MVDream <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib51" title=""><span class="ltx_text" style="font-size:90%;">51</span></a>]</cite> to transform a single image into multiple viewpoints for input into a U-Net, which outputs 3D Gaussians. We assessed each method using 100 random 3D Bitmojis, each rendered from one front-facing view and ten additional views distributed spherically around the head. By inputting the front-facing view into each model, we calculated PSNR, SSIM, LPIPS <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib64" title=""><span class="ltx_text" style="font-size:90%;">64</span></a>]</cite>, and speed metrics. As shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.T3" title="Table 3 ‣ 4.2 3D Dual-Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">3</span></a>, our method surpasses all baselines, demonstrating superior capability in accurately converting 2D images to 3D, while being significantly faster, needing only a single U-Net pass. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F5" title="Figure 5 ‣ 4.1 2D Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">5</span></a> provides visual comparisons on 3D <span class="ltx_text ltx_font_italic" id="S4.SS2.p1.1.2">dual-stylized</span> avatars that fall outside the training distribution. The top row features eight stylized avatars generated using diffusion stylization from the identities in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F4" title="Figure 4 ‣ 4.1 2D Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">4</span></a>, with different style prompts as described in Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS2" title="3.2 2D Dual-Stylized Avatar Generation ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">3.2</span></a>. Subsequent rows show each method’s performance in translating these images to 3D. Even when employing PTI <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib42" title=""><span class="ltx_text" style="font-size:90%;">42</span></a>]</cite> for out-of-distribution images, EG3D struggles with high-fidelity geometry generation. Similarly, due to the diffusion process in MVDream, LGM often produces incorrect 3D head geometries. In contrast, our method successfully creates high-quality textures and geometry, even accommodating out-of-distribution accessories like turbans and sunglasses.</p> </div> <figure class="ltx_table" id="S4.T3"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S4.T3.4" style="width:433.6pt;height:100.9pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(62.1pt,-14.5pt) scale(1.40180152360723,1.40180152360723) ;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S4.T3.4.4"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S4.T3.4.4.4"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T3.4.4.4.5" style="padding-left:0.0pt;padding-right:0.0pt;"></th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T3.1.1.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">PSNR <math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T3.1.1.1.1.m1.1"><semantics id="S4.T3.1.1.1.1.m1.1a"><mo id="S4.T3.1.1.1.1.m1.1.1" stretchy="false" xref="S4.T3.1.1.1.1.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T3.1.1.1.1.m1.1b"><ci id="S4.T3.1.1.1.1.m1.1.1.cmml" xref="S4.T3.1.1.1.1.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.1.1.1.1.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.1.1.1.1.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T3.2.2.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">SSIM <math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T3.2.2.2.2.m1.1"><semantics id="S4.T3.2.2.2.2.m1.1a"><mo id="S4.T3.2.2.2.2.m1.1.1" stretchy="false" xref="S4.T3.2.2.2.2.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T3.2.2.2.2.m1.1b"><ci id="S4.T3.2.2.2.2.m1.1.1.cmml" xref="S4.T3.2.2.2.2.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.2.2.2.2.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.2.2.2.2.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T3.3.3.3.3" style="padding-left:0.0pt;padding-right:0.0pt;">LPIPS <math alttext="\downarrow" class="ltx_Math" display="inline" id="S4.T3.3.3.3.3.m1.1"><semantics id="S4.T3.3.3.3.3.m1.1a"><mo id="S4.T3.3.3.3.3.m1.1.1" stretchy="false" xref="S4.T3.3.3.3.3.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S4.T3.3.3.3.3.m1.1b"><ci id="S4.T3.3.3.3.3.m1.1.1.cmml" xref="S4.T3.3.3.3.3.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.3.3.3.3.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.3.3.3.3.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T3.4.4.4.4" style="padding-left:0.0pt;padding-right:0.0pt;">Speed <math alttext="\downarrow" class="ltx_Math" display="inline" id="S4.T3.4.4.4.4.m1.1"><semantics id="S4.T3.4.4.4.4.m1.1a"><mo id="S4.T3.4.4.4.4.m1.1.1" stretchy="false" xref="S4.T3.4.4.4.4.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S4.T3.4.4.4.4.m1.1b"><ci id="S4.T3.4.4.4.4.m1.1.1.cmml" xref="S4.T3.4.4.4.4.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.4.4.4.4.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.4.4.4.4.m1.1d">↓</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T3.4.4.5.1"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.4.4.5.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">EG3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib6" title=""><span class="ltx_text" style="font-size:90%;">6</span></a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.4.4.5.1.2" style="padding-left:0.0pt;padding-right:0.0pt;">10.92</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.4.4.5.1.3" style="padding-left:0.0pt;padding-right:0.0pt;">0.68</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.4.4.5.1.4" style="padding-left:0.0pt;padding-right:0.0pt;">0.50</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T3.4.4.5.1.5" style="padding-left:0.0pt;padding-right:0.0pt;">95.1s</td> </tr> <tr class="ltx_tr" id="S4.T3.4.4.6.2"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T3.4.4.6.2.1" style="padding-left:0.0pt;padding-right:0.0pt;">LGM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title=""><span class="ltx_text" style="font-size:90%;">55</span></a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T3.4.4.6.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">12.16</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T3.4.4.6.2.3" style="padding-left:0.0pt;padding-right:0.0pt;">0.69</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T3.4.4.6.2.4" style="padding-left:0.0pt;padding-right:0.0pt;">0.53</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S4.T3.4.4.6.2.5" style="padding-left:0.0pt;padding-right:0.0pt;">2.82s</td> </tr> <tr class="ltx_tr" id="S4.T3.4.4.7.3"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T3.4.4.7.3.1" style="padding-left:0.0pt;padding-right:0.0pt;">Ours</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T3.4.4.7.3.2" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text ltx_font_bold" id="S4.T3.4.4.7.3.2.1">18.73</span></td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T3.4.4.7.3.3" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text ltx_font_bold" id="S4.T3.4.4.7.3.3.1">0.81</span></td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T3.4.4.7.3.4" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text ltx_font_bold" id="S4.T3.4.4.7.3.4.1">0.24</span></td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S4.T3.4.4.7.3.5" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text ltx_font_bold" id="S4.T3.4.4.7.3.5.1">0.091s</span></td> </tr> </tbody> </table> </span></div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table"><span class="ltx_text" id="S4.T3.7.1.1" style="font-size:90%;">Table 3</span>: </span><span class="ltx_text ltx_font_bold" id="S4.T3.8.2" style="font-size:90%;">2D Stylized Avatar to 3D Generation.<span class="ltx_text ltx_font_medium" id="S4.T3.8.2.1"> Our approach outperforms EG3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib6" title=""><span class="ltx_text" style="font-size:90%;">6</span></a>]</cite> and LGM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title=""><span class="ltx_text" style="font-size:90%;">55</span></a>]</cite> on all metrics, providing superior texture and geometry accuracy with faster processing.</span></span></figcaption> </figure> <figure class="ltx_figure" id="S4.F7"><img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="898" id="S4.F7.g1" src="x7.png" width="830"/> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S4.F7.5.1.1" style="font-size:90%;">Figure 7</span>: </span><span class="ltx_text ltx_font_bold" id="S4.F7.6.2" style="font-size:90%;">3D Stylized Avatar Animation.<span class="ltx_text ltx_font_medium" id="S4.F7.6.2.1"> (a) An <span class="ltx_text" id="S4.F7.6.2.1.1">Snapmoji</span> showcasing various emotions using blendshape weights. (b) <span class="ltx_text" id="S4.F7.6.2.1.2">Snapmoji</span> effectively transfers expressions from driving images, outperforming Portrait4D-v2 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib12" title=""><span class="ltx_text" style="font-size:90%;">12</span></a>]</cite> in accuracy and visual appeal.</span></span></figcaption> </figure> <div class="ltx_para ltx_noindent" id="S4.SS2.p2"> <p class="ltx_p" id="S4.SS2.p2.1"><span class="ltx_text ltx_font_bold" id="S4.SS2.p2.1.1">Single Portrait to 3D Generation.</span> We evaluate <span class="ltx_text" id="S4.SS2.p2.1.2">Snapmoji</span>’s capability to generate a 3D avatar from a single portrait, with a comparison against DATID-3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib23" title=""><span class="ltx_text" style="font-size:90%;">23</span></a>]</cite>. DATID-3D uses a GAN to derive a latent code from the user’s portrait, followed by domain adaptation for styling. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F6" title="Figure 6 ‣ 4.1 2D Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">6</span></a> shows that DATID-3D struggles to maintain the original identity in avatars with distinct styles, like Plastic Toy and Alien. <span class="ltx_text" id="S4.SS2.p2.1.3">Snapmoji</span>, however, achieves a robust balance of identity preservation and style versatility, allowing for enhanced user customization. Our approach produces sharp images and detailed geometries, processing each image in just 0.9 seconds, a significant improvement over DATID-3D’s 90-second processing time for GAN inversion.</p> </div> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.3 </span>3D Stylized Avatar Animation </h3> <div class="ltx_para ltx_noindent" id="S4.SS3.p1"> <p class="ltx_p" id="S4.SS3.p1.1"><span class="ltx_text ltx_font_bold" id="S4.SS3.p1.1.1">Expression Animation.</span> <span class="ltx_text" id="S4.SS3.p1.1.2">Snapmoji</span>enables avatars to express a wide range of emotions, such as neutrality, happiness, frustration, playfulness, anger, and surprise, by using blendshape weights, as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F7" title="Figure 7 ‣ 4.2 3D Dual-Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">7</span></a>(a). Additionally, <span class="ltx_text" id="S4.SS3.p1.1.3">Snapmoji</span>can perform expression transfer from driving images, producing 3D-consistent and visually appealing avatars. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F7" title="Figure 7 ‣ 4.2 3D Dual-Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">7</span></a>(b) shows this capability, where <span class="ltx_text" id="S4.SS3.p1.1.4">Snapmoji</span>outperforms Portrait4D-v2 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib12" title=""><span class="ltx_text" style="font-size:90%;">12</span></a>]</cite> by generating avatars with more accurate expressions derived from the target image.</p> </div> <div class="ltx_para ltx_noindent" id="S4.SS3.p2"> <p class="ltx_p" id="S4.SS3.p2.1"><span class="ltx_text ltx_font_bold" id="S4.SS3.p2.1.1">Mobile AR Application.</span> We showcase a web-based AR app that demonstrates the efficient rendering of avatars on mobile devices. As illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F8" title="Figure 8 ‣ 4.3 3D Stylized Avatar Animation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">8</span></a>, an avatar animated using a user’s facial expressions achieves rendering speeds of 30–40 FPS on an iPhone 13 Pro. These avatars are highly compact, occupying only 3 MB of disk space, enabling the creation of dynamic filters and engaging AR effects directly within a mobile web browser. To highlight the advantages of our animation technique, we compare <span class="ltx_text" id="S4.SS3.p2.1.2">Snapmoji</span> against TextToon <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title=""><span class="ltx_text" style="font-size:90%;">54</span></a>]</cite>. Like many other avatar generation methods <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib19" title=""><span class="ltx_text" style="font-size:90%;">19</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib40" title=""><span class="ltx_text" style="font-size:90%;">40</span></a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib8" title=""><span class="ltx_text" style="font-size:90%;">8</span></a>]</cite>, TextToon relies solely on 3DMM features, achieving only 15–18 FPS on an M1 MacBook. In contrast, our method consistently runs at 90-100 FPS. Moreover, TextToon’s dependence on 3DMMs limits its practicality on phones, whereas our cross-platform solution retains performance over 30 FPS. Table <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.T4" title="Table 4 ‣ 4.3 3D Stylized Avatar Animation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">4</span></a> offers a detailed feature comparison.</p> </div> <figure class="ltx_table" id="S4.T4"> <div class="ltx_inline-block ltx_transformed_outer" id="S4.T4.2" style="width:433.6pt;height:63.4pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(-29.5pt,4.3pt) scale(0.880103445537666,0.880103445537666) ;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S4.T4.2.1"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T4.2.1.1.1"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_tt" id="S4.T4.2.1.1.1.1" rowspan="2" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text" id="S4.T4.2.1.1.1.1.1">Method</span></th> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_tt" colspan="2" id="S4.T4.2.1.1.1.2" style="padding-left:0.0pt;padding-right:0.0pt;">Frame Rate (FPS)</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_tt" id="S4.T4.2.1.1.1.3" rowspan="2" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text" id="S4.T4.2.1.1.1.3.1">Cross-Platform</span></td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_tt" id="S4.T4.2.1.1.1.4" rowspan="2" style="padding-left:0.0pt;padding-right:0.0pt;"><span class="ltx_text" id="S4.T4.2.1.1.1.4.1">Driving Signal</span></td> </tr> <tr class="ltx_tr" id="S4.T4.2.1.2.2"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S4.T4.2.1.2.2.1" style="padding-left:0.0pt;padding-right:0.0pt;">M1 MacBook</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T4.2.1.2.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">iPhone 13 Pro</td> </tr> <tr class="ltx_tr" id="S4.T4.2.1.3.3"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T4.2.1.3.3.1" style="padding-left:0.0pt;padding-right:0.0pt;">TextToon <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title=""><span class="ltx_text" style="font-size:90%;">54</span></a>]</cite> </th> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T4.2.1.3.3.2" style="padding-left:0.0pt;padding-right:0.0pt;">15–18</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.2.1.3.3.3" style="padding-left:0.0pt;padding-right:0.0pt;">N/A</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T4.2.1.3.3.4" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T4.2.1.3.3.5" style="padding-left:0.0pt;padding-right:0.0pt;">3DMM</td> </tr> <tr class="ltx_tr" id="S4.T4.2.1.4.4"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r" id="S4.T4.2.1.4.4.1" style="padding-left:0.0pt;padding-right:0.0pt;"> <span class="ltx_text" id="S4.T4.2.1.4.4.1.1">Snapmoji</span>(Ours)</th> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S4.T4.2.1.4.4.2" style="padding-left:0.0pt;padding-right:0.0pt;">90-100</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T4.2.1.4.4.3" style="padding-left:0.0pt;padding-right:0.0pt;">30–40</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S4.T4.2.1.4.4.4" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S4.T4.2.1.4.4.5" style="padding-left:0.0pt;padding-right:0.0pt;">3DMM + Blendshapes</td> </tr> </tbody> </table> </span></div> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table"><span class="ltx_text" id="S4.T4.4.1.1" style="font-size:90%;">Table 4</span>: </span><span class="ltx_text ltx_font_bold" id="S4.T4.5.2" style="font-size:90%;">Mobile AR Application Comparison.<span class="ltx_text ltx_font_medium" id="S4.T4.5.2.1"> We compare various features of our mobile AR application and TextToon <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title=""><span class="ltx_text" style="font-size:90%;">54</span></a>]</cite>.</span></span></figcaption> </figure> <figure class="ltx_figure" id="S4.F8"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="434" id="S4.F8.g1" src="x8.png" width="830"/> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S4.F8.4.1.1" style="font-size:90%;">Figure 8</span>: </span><span class="ltx_text ltx_font_bold" id="S4.F8.5.2" style="font-size:90%;">Mobile AR Application.<span class="ltx_text ltx_font_medium" id="S4.F8.5.2.1"> <span class="ltx_text" id="S4.F8.5.2.1.1">Snapmoji</span>’s efficient mobile animation method enables a user to puppet their avatar in augmented reality. </span></span></figcaption> </figure> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">5 </span>Conclusion</h2> <div class="ltx_para" id="S5.p1"> <p class="ltx_p" id="S5.p1.1">We introduce <span class="ltx_text" id="S5.p1.1.1">Snapmoji</span>, an easy-to-use system for generating animatable, dual-stylized avatars from selfies instantly. Leveraging Gaussian Domain Adaptation, <span class="ltx_text" id="S5.p1.1.2">Snapmoji</span> first converts selfies into primary stylized avatars, then applies a diffusion process for a secondary style while preserving identity integrity. The system supports 3D Gaussian avatars with dynamic animations and precise facial expression transfer, achieving selfie-to-avatar conversion in just 0.9 seconds, with real-time interactions at 30–40 FPS. Extensive testing confirms <span class="ltx_text" id="S5.p1.1.3">Snapmoji</span>’s versatility and speed, highlighting its value in creating diverse avatar styles.</p> </div> <div class="ltx_para ltx_noindent" id="S5.p2"> <p class="ltx_p" id="S5.p2.1"><span class="ltx_text ltx_font_bold" id="S5.p2.1.1">Limitations and Future Work.</span> <span class="ltx_text" id="S5.p2.1.2">Snapmoji</span> relies on paired data from GAN inversion, which can sometimes yield low-quality images, and requires extensive 3D avatar datasets and text prompt engineering. Future improvements could include using multiple images for more accurate user head geometry and enabling post-stylization edits to specific facial features, like eye color or eyeglasses.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography" style="font-size:90%;">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib1.4.4.1" style="font-size:90%;">Apple [2018]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib1.6.1" style="font-size:90%;"> Apple. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib1.7.1" style="font-size:90%;">Use memoji on your iphone or ipad pro. </span> </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://support.apple.com/en-us/111115" style="font-size:90%;" title="">https://support.apple.com/en-us/111115</a><span class="ltx_text" id="bib.bib1.8.1" style="font-size:90%;">, 2018. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib1.9.1" style="font-size:90%;">Accessed: 2024-11-03. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib2.5.5.1" style="font-size:90%;">Barron et al. [2021]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib2.7.1" style="font-size:90%;"> Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib2.8.1" style="font-size:90%;">Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib2.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib2.10.2" style="font-size:90%;">Proceedings of the IEEE/CVF international conference on computer vision</em><span class="ltx_text" id="bib.bib2.11.3" style="font-size:90%;">, pages 5855–5864, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib3.5.5.1" style="font-size:90%;">Barron et al. [2022]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib3.7.1" style="font-size:90%;"> Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib3.8.1" style="font-size:90%;">Mip-nerf 360: Unbounded anti-aliased neural radiance fields. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib3.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib3.10.2" style="font-size:90%;">Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</em><span class="ltx_text" id="bib.bib3.11.3" style="font-size:90%;">, pages 5470–5479, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib4.5.5.1" style="font-size:90%;">Bazarevsky et al. [2019]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib4.7.1" style="font-size:90%;"> Valentin Bazarevsky, Yury Kartynnik, Andrey Vakunov, Karthik Raveendran, and Matthias Grundmann. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib4.8.1" style="font-size:90%;">Blazeface: Sub-millisecond neural face detection on mobile gpus. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib4.9.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib4.10.2" style="font-size:90%;">, abs/1907.05047, 2019. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib5.4.4.1" style="font-size:90%;">Bitstrips [2007]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib5.6.1" style="font-size:90%;"> Bitstrips. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib5.7.1" style="font-size:90%;">Bitmoji. </span> </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.bitmoji.com/" style="font-size:90%;" title="">https://www.bitmoji.com/</a><span class="ltx_text" id="bib.bib5.8.1" style="font-size:90%;">, 2007. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib5.9.1" style="font-size:90%;">Accessed: 2024-11-03. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib6.5.5.1" style="font-size:90%;">Chan et al. [2022]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib6.7.1" style="font-size:90%;"> Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib6.8.1" style="font-size:90%;">Efficient geometry-aware 3D generative adversarial networks. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib6.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib6.10.2" style="font-size:90%;">CVPR</em><span class="ltx_text" id="bib.bib6.11.3" style="font-size:90%;">, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib7.5.5.1" style="font-size:90%;">Chen et al. [2022]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib7.7.1" style="font-size:90%;"> Eric Chen, Jin Sun, Apoorv Khandelwal, Dani Lischinski, Noah Snavely, and Hadar Averbuch-Elor. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib7.8.1" style="font-size:90%;">What’s in a decade? transforming faces through time. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib7.9.1" style="font-size:90%;">Computer Graphics Forum</em><span class="ltx_text" id="bib.bib7.10.2" style="font-size:90%;">, 42, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib8.4.4.1" style="font-size:90%;">Chu and Harada [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib8.6.1" style="font-size:90%;"> Xuangeng Chu and Tatsuya Harada. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib8.7.1" style="font-size:90%;">Generalizable and animatable gaussian head avatar. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib8.8.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib8.9.2" style="font-size:90%;">The Thirty-eighth Annual Conference on Neural Information Processing Systems</em><span class="ltx_text" id="bib.bib8.10.3" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib9.5.5.1" style="font-size:90%;">Dao et al. [2025]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib9.7.1" style="font-size:90%;"> Quan Dao, Khanh Doan, Di Liu, Trung Le, and Dimitris Metaxas. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib9.8.1" style="font-size:90%;">Improved training technique for latent consistency models. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib9.9.1" style="font-size:90%;">Proceedings of the International Conference on Learning Representations (ICLR)</em><span class="ltx_text" id="bib.bib9.10.2" style="font-size:90%;">, 2025. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib10.5.5.1" style="font-size:90%;">Deitke et al. [2022]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib10.7.1" style="font-size:90%;"> Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib10.8.1" style="font-size:90%;">Objaverse: A universe of annotated 3d objects. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib10.9.1" style="font-size:90%;">2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em><span class="ltx_text" id="bib.bib10.10.2" style="font-size:90%;">, pages 13142–13153, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib11.5.5.1" style="font-size:90%;">Deng et al. [2019]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib11.7.1" style="font-size:90%;"> Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib11.8.1" style="font-size:90%;">Arcface: Additive angular margin loss for deep face recognition. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib11.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib11.10.2" style="font-size:90%;">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em><span class="ltx_text" id="bib.bib11.11.3" style="font-size:90%;">, pages 4690–4699, 2019. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib12.5.5.1" style="font-size:90%;">Deng et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib12.7.1" style="font-size:90%;"> Yu Deng, Duomin Wang, and Baoyuan Wang. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib12.8.1" style="font-size:90%;">Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib12.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib12.10.2" style="font-size:90%;">European Conference on Computer Vision</em><span class="ltx_text" id="bib.bib12.11.3" style="font-size:90%;">, pages 316–333. Springer, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib13.4.4.1" style="font-size:90%;">Ekman and Friesen [1978]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib13.6.1" style="font-size:90%;"> Paul Ekman and Wallace V. Friesen. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib13.7.1" style="font-size:90%;">Facial action coding system: a technique for the measurement of facial movement. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib13.8.1" style="font-size:90%;">1978. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib14.5.5.1" style="font-size:90%;">Han et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib14.7.1" style="font-size:90%;"> Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Anastasis Stathopoulos, Xiaoxiao He, Yuxiao Chen, et al. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib14.8.1" style="font-size:90%;">Proxedit: Improving tuning-free real image editing with proximal guidance. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib14.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib14.10.2" style="font-size:90%;">Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</em><span class="ltx_text" id="bib.bib14.11.3" style="font-size:90%;">, pages 4291–4301, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib15.5.5.1" style="font-size:90%;">Härkönen et al. [2020]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib15.7.1" style="font-size:90%;"> Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib15.8.1" style="font-size:90%;">Ganspace: Discovering interpretable gan controls. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib15.9.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib15.10.2" style="font-size:90%;">, abs/2004.02546, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib16.5.5.1" style="font-size:90%;">He et al. [2025]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib16.7.1" style="font-size:90%;"> Xiaoxiao He, Ligong Han, Quan Dao, Song Wen, Minhao Bai, Di Liu, Han Zhang, Martin Renqiang Min, Felix Juefei-Xu, Chaowei Tan, et al. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib16.8.1" style="font-size:90%;">Dice: Discrete inversion enabling controllable editing for multinomial diffusion and masked generative models. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib16.9.1" style="font-size:90%;">2025. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib17.5.5.1" style="font-size:90%;">Hong et al. [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib17.7.1" style="font-size:90%;"> Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib17.8.1" style="font-size:90%;">Lrm: Large reconstruction model for single image to 3d. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib17.9.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib17.10.2" style="font-size:90%;">, abs/2311.04400, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib18.5.5.1" style="font-size:90%;">Hu et al. [2021]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib18.7.1" style="font-size:90%;"> J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib18.8.1" style="font-size:90%;">Lora: Low-rank adaptation of large language models. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib18.9.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib18.10.2" style="font-size:90%;">, abs/2106.09685, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib19.5.5.1" style="font-size:90%;">Hu et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib19.7.1" style="font-size:90%;"> Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib19.8.1" style="font-size:90%;">Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib19.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib19.10.2" style="font-size:90%;">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em><span class="ltx_text" id="bib.bib19.11.3" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib20.5.5.1" style="font-size:90%;">Huang et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib20.7.1" style="font-size:90%;"> Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib20.8.1" style="font-size:90%;">2d gaussian splatting for geometrically accurate radiance fields. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib20.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib20.10.2" style="font-size:90%;">SIGGRAPH 2024 Conference Papers</em><span class="ltx_text" id="bib.bib20.11.3" style="font-size:90%;">. Association for Computing Machinery, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib21.5.5.1" style="font-size:90%;">Karras et al. [2019]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib21.7.1" style="font-size:90%;"> Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib21.8.1" style="font-size:90%;">Analyzing and improving the image quality of stylegan. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib21.9.1" style="font-size:90%;">2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em><span class="ltx_text" id="bib.bib21.10.2" style="font-size:90%;">, pages 8107–8116, 2019. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib22.5.5.1" style="font-size:90%;">Kerbl et al. [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib22.7.1" style="font-size:90%;"> Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib22.8.1" style="font-size:90%;">3d gaussian splatting for real-time radiance field rendering. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib22.9.1" style="font-size:90%;">ACM Transactions on Graphics (TOG)</em><span class="ltx_text" id="bib.bib22.10.2" style="font-size:90%;">, 42:1 – 14, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib23.4.4.1" style="font-size:90%;">Kim and Chun [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib23.6.1" style="font-size:90%;"> Gwanghyun Kim and Se Young Chun. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib23.7.1" style="font-size:90%;">Datid-3d: Diversity-preserved domain adaptation using text-to-image diffusion for 3d generative model. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib23.8.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib23.9.2" style="font-size:90%;">CVPR</em><span class="ltx_text" id="bib.bib23.10.3" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib24.5.5.1" style="font-size:90%;">Li et al. [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib24.7.1" style="font-size:90%;"> Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib24.8.1" style="font-size:90%;">Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib24.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib24.10.2" style="font-size:90%;">International Conference on Machine Learning</em><span class="ltx_text" id="bib.bib24.11.3" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib25.5.5.1" style="font-size:90%;">Li et al. [2017]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib25.7.1" style="font-size:90%;"> Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib25.8.1" style="font-size:90%;">Learning a model of facial shape and expression from 4D scans. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib25.9.1" style="font-size:90%;">ACM Transactions on Graphics, (Proc. SIGGRAPH Asia)</em><span class="ltx_text" id="bib.bib25.10.2" style="font-size:90%;">, 36(6):194:1–194:17, 2017. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib26.5.5.1" style="font-size:90%;">Liu et al. [2023a]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib26.7.1" style="font-size:90%;"> Di Liu, Xiang Yu, Meng Ye, Qilong Zhangli, Zhuowei Li, Zhixing Zhang, and Dimitris N Metaxas. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib26.8.1" style="font-size:90%;">Deformer: Integrating transformers with deformable models for 3d shape abstraction from a single image. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib26.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib26.10.2" style="font-size:90%;">Proceedings of the IEEE/CVF International Conference on Computer Vision</em><span class="ltx_text" id="bib.bib26.11.3" style="font-size:90%;">, pages 14236–14246, 2023a. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib27.5.5.1" style="font-size:90%;">Liu et al. [2023b]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib27.7.1" style="font-size:90%;"> Di Liu, Long Zhao, Qilong Zhangli, Yunhe Gao, Ting Liu, and Dimitris N Metaxas. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib27.8.1" style="font-size:90%;">Deep deformable models: Learning 3d shape abstractions with part consistency. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib27.9.1" style="font-size:90%;">arXiv preprint arXiv:2309.01035</em><span class="ltx_text" id="bib.bib27.10.2" style="font-size:90%;">, 2023b. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib28.5.5.1" style="font-size:90%;">Liu et al. [2024a]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib28.7.1" style="font-size:90%;"> Di Liu, Qilong Zhangli, Yunhe Gao, and Dimitris Metaxas. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib28.8.1" style="font-size:90%;">Lepard: Learning explicit part discovery for 3d articulated shape reconstruction. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib28.9.1" style="font-size:90%;">Advances in Neural Information Processing Systems</em><span class="ltx_text" id="bib.bib28.10.2" style="font-size:90%;">, 36, 2024a. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib29.5.5.1" style="font-size:90%;">Liu et al. [2024b]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib29.7.1" style="font-size:90%;"> Di Liu, Bingbing Zhuang, Dimitris N. Metaxas, and Manmohan Chandraker. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib29.8.1" style="font-size:90%;">Instantaneous perception of moving objects in 3d. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib29.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib29.10.2" style="font-size:90%;">Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</em><span class="ltx_text" id="bib.bib29.11.3" style="font-size:90%;">, 2024b. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib30.5.5.1" style="font-size:90%;">Liu et al. [2025]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib30.7.1" style="font-size:90%;"> Di Liu, Teng Deng, Giljoo Nam, Yu Rong, Stanislav Pidhorskyi, Junxuan Li, Jason Saragih, Dimitris N. Metaxas, and Chen Cao. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib30.8.1" style="font-size:90%;">Lucas: Layered universal codec avatars, 2025. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib31.5.5.1" style="font-size:90%;">Luo et al. [2021]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib31.7.1" style="font-size:90%;"> Xuan Luo, Xuaner Zhang, Paul Yoo, Ricardo Martin-Brualla, Jason Lawrence, and Steven M. Seitz. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib31.8.1" style="font-size:90%;">Time-travel rephotography. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib31.9.1" style="font-size:90%;">ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH Asia 2021)</em><span class="ltx_text" id="bib.bib31.10.2" style="font-size:90%;">, 40(6), 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib32.5.5.1" style="font-size:90%;">Ma et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib32.7.1" style="font-size:90%;"> Shengjie Ma, Yanlin Weng, Tianjia Shao, and Kun Zhou. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib32.8.1" style="font-size:90%;">3d gaussian blendshapes for head avatar animation. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib32.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib32.10.2" style="font-size:90%;">ACM SIGGRAPH Conference Proceedings, Denver, CO, United States, July 28 - August 1, 2024</em><span class="ltx_text" id="bib.bib32.11.3" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib33.5.5.1" style="font-size:90%;">Men et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib33.7.1" style="font-size:90%;"> Yifang Men, Hanxi Liu, Yuan Yao, Miaomiao Cui, Xuansong Xie, and Zhouhui Lian. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib33.8.1" style="font-size:90%;">3dtoonify: Creating your high-fidelity 3d stylized avatar easily from 2d portrait images. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib33.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib33.10.2" style="font-size:90%;">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em><span class="ltx_text" id="bib.bib33.11.3" style="font-size:90%;">, pages 10127–10137, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib34.5.5.1" style="font-size:90%;">Meng et al. [2021]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib34.7.1" style="font-size:90%;"> Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib34.8.1" style="font-size:90%;">Sdedit: Guided image synthesis and editing with stochastic differential equations. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib34.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib34.10.2" style="font-size:90%;">International Conference on Learning Representations</em><span class="ltx_text" id="bib.bib34.11.3" style="font-size:90%;">, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib35.4.4.1" style="font-size:90%;">Meta [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib35.6.1" style="font-size:90%;"> Meta. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib35.7.1" style="font-size:90%;">Express yourself with Meta avatars. </span> </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.meta.com/avatars/" style="font-size:90%;" title="">https://www.meta.com/avatars/</a><span class="ltx_text" id="bib.bib35.8.1" style="font-size:90%;">, 2024. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib35.9.1" style="font-size:90%;">Accessed: 2024-11-03. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib36.5.5.1" style="font-size:90%;">Mildenhall et al. [2020]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib36.7.1" style="font-size:90%;"> Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib36.8.1" style="font-size:90%;">Nerf: Representing scenes as neural radiance fields for view synthesis. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib36.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib36.10.2" style="font-size:90%;">ECCV</em><span class="ltx_text" id="bib.bib36.11.3" style="font-size:90%;">, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib37.5.5.1" style="font-size:90%;">Nguyen-Phuoc et al. [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib37.7.1" style="font-size:90%;"> Thu Nguyen-Phuoc, Gabriel Schwartz, Yuting Ye, Stephen Lombardi, and Lei Xiao. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib37.8.1" style="font-size:90%;">Alteredavatar: Stylizing dynamic 3d avatars with fast style adaptation. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib37.9.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib37.10.2" style="font-size:90%;">, abs/2305.19245, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib38.5.5.1" style="font-size:90%;">Oquab et al. [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib38.7.1" style="font-size:90%;"> Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib38.8.1" style="font-size:90%;">Dinov2: Learning robust visual features without supervision. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib38.9.1" style="font-size:90%;">arXiv preprint arXiv:2304.07193</em><span class="ltx_text" id="bib.bib38.10.2" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib39.4.4.1" style="font-size:90%;">Pinkney and Adler [2020]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib39.6.1" style="font-size:90%;"> Justin N. M. Pinkney and Doron Adler. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib39.7.1" style="font-size:90%;">Resolution dependent gan interpolation for controllable image synthesis between domains. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib39.8.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib39.9.2" style="font-size:90%;">, abs/2010.05334, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib40.5.5.1" style="font-size:90%;">Qian et al. [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib40.7.1" style="font-size:90%;"> Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib40.8.1" style="font-size:90%;">Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib40.9.1" style="font-size:90%;">arXiv preprint arXiv:2312.02069</em><span class="ltx_text" id="bib.bib40.10.2" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib41"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib41.4.4.1" style="font-size:90%;">Research [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib41.6.1" style="font-size:90%;"> Grand View Research. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib41.7.1" style="font-size:90%;">Digital avatar market size, share and growth report, 2030. </span> </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.grandviewresearch.com/industry-analysis/digital-avatar-market-report/" style="font-size:90%;" title="">https://www.grandviewresearch.com/industry-analysis/digital-avatar-market-report/</a><span class="ltx_text" id="bib.bib41.8.1" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib42"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib42.5.5.1" style="font-size:90%;">Roich et al. [2021]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib42.7.1" style="font-size:90%;"> Daniel Roich, Ron Mokady, Amit H. Bermano, and Daniel Cohen-Or. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib42.8.1" style="font-size:90%;">Pivotal tuning for latent-based editing of real images. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib42.9.1" style="font-size:90%;">ACM Transactions on Graphics (TOG)</em><span class="ltx_text" id="bib.bib42.10.2" style="font-size:90%;">, 42:1 – 13, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib43"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib43.5.5.1" style="font-size:90%;">Rombach et al. [2021]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib43.7.1" style="font-size:90%;"> Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib43.8.1" style="font-size:90%;">High-resolution image synthesis with latent diffusion models. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib43.9.1" style="font-size:90%;">2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em><span class="ltx_text" id="bib.bib43.10.2" style="font-size:90%;">, pages 10674–10685, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib44"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib44.5.5.1" style="font-size:90%;">Saito et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib44.7.1" style="font-size:90%;"> Shunsuke Saito, Gabriel Schwartz, Tomas Simon, Junxuan Li, and Giljoo Nam. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib44.8.1" style="font-size:90%;">Relightable gaussian codec avatars. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib44.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib44.10.2" style="font-size:90%;">CVPR</em><span class="ltx_text" id="bib.bib44.11.3" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib45"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib45.4.4.1" style="font-size:90%;">Samsung [2018]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib45.6.1" style="font-size:90%;"> Samsung. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib45.7.1" style="font-size:90%;">How to use the AR Emoji feature on your Galaxy phone. </span> </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.samsung.com/sg/support/mobile-devices/how-to-use-the-emoji-feature-on-your-galaxy-phone/" style="font-size:90%;" title="">https://www.samsung.com/sg/support/mobile-devices/how-to-use-the-emoji-feature-on-your-galaxy-phone/</a><span class="ltx_text" id="bib.bib45.8.1" style="font-size:90%;">, 2018. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib45.9.1" style="font-size:90%;">Accessed: 2024-11-03. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib46"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib46.5.5.1" style="font-size:90%;">Sang et al. [2022]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib46.7.1" style="font-size:90%;"> Shen Sang, Tiancheng Zhi, Guoxian Song, Minghao Liu, Chun-Pong Lai, Jing Liu, Xiang Wen, James Davis, and Linjie Luo. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib46.8.1" style="font-size:90%;">Agileavatar: Stylized 3d avatar creation via cascaded domain bridging. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib46.9.1" style="font-size:90%;">SIGGRAPH Asia 2022 Conference Papers</em><span class="ltx_text" id="bib.bib46.10.2" style="font-size:90%;">, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib47"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib47.5.5.1" style="font-size:90%;">Shao et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib47.7.1" style="font-size:90%;"> Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib47.8.1" style="font-size:90%;">SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib47.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib47.10.2" style="font-size:90%;">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em><span class="ltx_text" id="bib.bib47.11.3" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib48"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib48.5.5.1" style="font-size:90%;">Shi et al. [2019]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib48.7.1" style="font-size:90%;"> Tianyang Shi, Yi Yuan, Changjie Fan, Zhengxia Zou, Zhen Xia Shi, and Yong Liu. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib48.8.1" style="font-size:90%;">Face-to-parameter translation for game character auto-creation. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib48.9.1" style="font-size:90%;">2019 IEEE/CVF International Conference on Computer Vision (ICCV)</em><span class="ltx_text" id="bib.bib48.10.2" style="font-size:90%;">, pages 161–170, 2019. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib49"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib49.5.5.1" style="font-size:90%;">Shi et al. [2020]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib49.7.1" style="font-size:90%;"> Tianyang Shi, Yi Yuan, Changjie Fan, Zhengxia Zou, Zhenwei Shi, and Yong Liu. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib49.8.1" style="font-size:90%;">Fast and robust face-to-parameter translation for game character auto-creation. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib49.9.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib49.10.2" style="font-size:90%;">, abs/2008.07132, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib50"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib50.5.5.1" style="font-size:90%;">Shi et al. [2021]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib50.7.1" style="font-size:90%;"> Yichun Shi, Xiao Yang, Yangyue Wan, and Xiaohui Shen. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib50.8.1" style="font-size:90%;">Semanticstylegan: Learning compositional generative priors for controllable image synthesis and editing. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib50.9.1" style="font-size:90%;">2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em><span class="ltx_text" id="bib.bib50.10.2" style="font-size:90%;">, pages 11244–11254, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib51"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib51.5.5.1" style="font-size:90%;">Shi et al. [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib51.7.1" style="font-size:90%;"> Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib51.8.1" style="font-size:90%;">Mvdream: Multi-view diffusion for 3d generation. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib51.9.1" style="font-size:90%;">arXiv:2308.16512</em><span class="ltx_text" id="bib.bib51.10.2" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib52"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib52.3.3.1" style="font-size:90%;">[52]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib52.5.1" style="font-size:90%;"> Snap Inc. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib52.6.1" style="font-size:90%;">Bitmoji 3d avatar platform solutions. </span> </span> <span class="ltx_bibblock"><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://developers.snap.com/lens-studio/platform-solutions/bitmoji-avatar/bitmoji-3d" style="font-size:90%;" title="">https://developers.snap.com/lens-studio/platform-solutions/bitmoji-avatar/bitmoji-3d</a><span class="ltx_text" id="bib.bib52.7.1" style="font-size:90%;">. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib53"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib53.5.5.1" style="font-size:90%;">Song et al. [2020]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib53.7.1" style="font-size:90%;"> Jiaming Song, Chenlin Meng, and Stefano Ermon. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib53.8.1" style="font-size:90%;">Denoising diffusion implicit models. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib53.9.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib53.10.2" style="font-size:90%;">, abs/2010.02502, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib54"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib54.5.5.1" style="font-size:90%;">Song et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib54.7.1" style="font-size:90%;"> Luchuan Song, Lele Chen, Celong Liu, Pinxin Liu, and Chenliang Xu. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib54.8.1" style="font-size:90%;">Texttoon: Real-time text toonify head avatar from single video. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib54.9.1" style="font-size:90%;">2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib55"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib55.5.5.1" style="font-size:90%;">Tang et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib55.7.1" style="font-size:90%;"> Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib55.8.1" style="font-size:90%;">Lgm: Large multi-view gaussian model for high-resolution 3d content creation. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib55.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib55.10.2" style="font-size:90%;">European Conference on Computer Vision</em><span class="ltx_text" id="bib.bib55.11.3" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib56"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib56.5.5.1" style="font-size:90%;">Wang et al. [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib56.7.1" style="font-size:90%;"> Shizun Wang, Weihong Zeng, Xu Wang, Han Yang, Li Chen, Chuang Zhang, Ming Wu, Yi Yuan, Yunzhao Zeng, and Minghang Zheng. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib56.8.1" style="font-size:90%;">Swiftavatar: Efficient auto-creation of parameterized stylized character on arbitrary avatar engines. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib56.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib56.10.2" style="font-size:90%;">AAAI Conference on Artificial Intelligence</em><span class="ltx_text" id="bib.bib56.11.3" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib57"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib57.5.5.1" style="font-size:90%;">Wang et al. [2025]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib57.7.1" style="font-size:90%;"> Suzhen Wang, Weijie Chen, Wei Zhang, Minda Zhao, Lincheng Li, Rongsheng Zhang, Zhipeng Hu, and Xin Yu. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib57.8.1" style="font-size:90%;">Easycraft: A robust and efficient framework for automatic avatar crafting, 2025. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib58"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib58.5.5.1" style="font-size:90%;">Wu et al. [2020]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib58.7.1" style="font-size:90%;"> Zongze Wu, Dani Lischinski, and Eli Shechtman. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib58.8.1" style="font-size:90%;">Stylespace analysis: Disentangled controls for stylegan image generation. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib58.9.1" style="font-size:90%;">2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em><span class="ltx_text" id="bib.bib58.10.2" style="font-size:90%;">, pages 12858–12867, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib59"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib59.5.5.1" style="font-size:90%;">Wu et al. [2021]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib59.7.1" style="font-size:90%;"> Zongze Wu, Yotam Nitzan, Eli Shechtman, and Dani Lischinski. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib59.8.1" style="font-size:90%;">Stylealign: Analysis and applications of aligned stylegan models. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib59.9.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib59.10.2" style="font-size:90%;">, abs/2110.11323, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib60"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib60.5.5.1" style="font-size:90%;">Ye et al. [2023]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib60.7.1" style="font-size:90%;"> Hu Ye, Jun Zhang, Siyi Liu, Xiao Han, and Wei Yang. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib60.8.1" style="font-size:90%;">Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib60.9.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib60.10.2" style="font-size:90%;">, abs/2308.06721, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib61"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib61.5.5.1" style="font-size:90%;">Zhang et al. [2023a]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib61.7.1" style="font-size:90%;"> Chi Zhang, Yiwen Chen, Yijun Fu, Zhenglin Zhou, Gang YU, Billzb Wang, Bin Fu, Tao Chen, Guosheng Lin, and Chunhua Shen. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib61.8.1" style="font-size:90%;">Styleavatar3d: Leveraging image-text diffusion models for high-fidelity 3d avatar generation, 2023a. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib62"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib62.5.5.1" style="font-size:90%;">Zhang et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib62.7.1" style="font-size:90%;"> Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib62.8.1" style="font-size:90%;">Gs-lrm: Large reconstruction model for 3d gaussian splatting. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib62.9.1" style="font-size:90%;">ArXiv</em><span class="ltx_text" id="bib.bib62.10.2" style="font-size:90%;">, abs/2404.19702, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib63"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib63.5.5.1" style="font-size:90%;">Zhang et al. [2023b]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib63.7.1" style="font-size:90%;"> Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib63.8.1" style="font-size:90%;">Adding conditional control to text-to-image diffusion models. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib63.9.1" style="font-size:90%;">2023 IEEE/CVF International Conference on Computer Vision (ICCV)</em><span class="ltx_text" id="bib.bib63.10.2" style="font-size:90%;">, pages 3813–3824, 2023b. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib64"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib64.5.5.1" style="font-size:90%;">Zhang et al. [2018]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib64.7.1" style="font-size:90%;"> Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib64.8.1" style="font-size:90%;">The unreasonable effectiveness of deep features as a perceptual metric. </span> </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib64.9.1" style="font-size:90%;">2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</em><span class="ltx_text" id="bib.bib64.10.2" style="font-size:90%;">, pages 586–595, 2018. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib65"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem"><span class="ltx_text" id="bib.bib65.5.5.1" style="font-size:90%;">Zhangli et al. [2024]</span></span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib65.7.1" style="font-size:90%;"> Qilong Zhangli, Jindong Jiang, Di Liu, Licheng Yu, Xiaoliang Dai, Ankit Ramchandani, Guan Pang, Dimitris N Metaxas, and Praveen Krishnan. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib65.8.1" style="font-size:90%;">Layout-agnostic scene text image synthesis with diffusion models. </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib65.9.1" style="font-size:90%;">In </span><em class="ltx_emph ltx_font_italic" id="bib.bib65.10.2" style="font-size:90%;">2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</em><span class="ltx_text" id="bib.bib65.11.3" style="font-size:90%;">, pages 7496–7506. IEEE Computer Society, 2024. </span> </span> </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> <div class="ltx_pagination ltx_role_newpage"></div> <div class="ltx_para ltx_align_center" id="p3"> <span class="ltx_ERROR undefined" id="p3.1">\thetitle</span> <br class="ltx_break"/> <p class="ltx_p" id="p3.2"><span class="ltx_text" id="p3.2.1" style="font-size:144%;">Supplementary Material <br class="ltx_break"/></span></p> </div> <section class="ltx_appendix ltx_centering" id="A1"> <h2 class="ltx_title ltx_title_appendix" style="font-size:144%;"> <span class="ltx_tag ltx_tag_appendix">Appendix A </span>Implementation Details</h2> <figure class="ltx_figure" id="A1.F9"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="814" id="A1.F9.g1" src="x9.png" width="830"/> <figcaption class="ltx_caption ltx_centering" style="font-size:144%;"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="A1.F9.5.1.1" style="font-size:63%;">Figure 9</span>: </span><span class="ltx_text ltx_font_bold" id="A1.F9.6.2" style="font-size:63%;">GDA Training Data.<span class="ltx_text ltx_font_medium" id="A1.F9.6.2.1"> Visualization of data pairs used to train the GDA network. Each avatar is inverted into the latent space of a GAN, and a generator trained on realistic faces creates a corresponding realistic face, preserving features such as hairstyles, hair colors, and general facial characteristics.</span></span></figcaption> </figure> <section class="ltx_subsection" id="A1.SS1"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> <span class="ltx_tag ltx_tag_subsection">A.1 </span>Bitmoji Training Data</h3> <section class="ltx_paragraph" id="A1.SS1.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph" style="font-size:144%;">GDA Training Data.</h4> <div class="ltx_para" id="A1.SS1.SSS0.Px1.p1"> <p class="ltx_p" id="A1.SS1.SSS0.Px1.p1.1"><span class="ltx_text" id="A1.SS1.SSS0.Px1.p1.1.1" style="font-size:144%;">To train our Gaussian Domain Adaptation network, we start with a dataset of random Bitmojis and use GAN inversion to generate corresponding realistic images. Fig. </span><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.F9" style="font-size:144%;" title="Figure 9 ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">9</span></a><span class="ltx_text" id="A1.SS1.SSS0.Px1.p1.1.2" style="font-size:144%;"> showcases some examples of the training data generated through this process. The resulting faces mirror the hairstyles, hair colors, and facial features of the original avatars. While the generated images may contain artifacts and exhibit limited diversity, our GDA model benefits from pre-training on Objaverse </span><cite class="ltx_cite ltx_citemacro_cite"><span class="ltx_text" id="A1.SS1.SSS0.Px1.p1.1.3.1" style="font-size:144%;">[</span><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib10" title=""><span class="ltx_text" style="font-size:90%;">10</span></a><span class="ltx_text" id="A1.SS1.SSS0.Px1.p1.1.4.2" style="font-size:144%;">]</span></cite><span class="ltx_text" id="A1.SS1.SSS0.Px1.p1.1.5" style="font-size:144%;">, enabling it to leverage prior knowledge and produce more detailed reconstructions than GAN inversion alone. This approach enhances the accuracy and expressiveness of the domain adaptation process.</span></p> </div> </section> <section class="ltx_paragraph" id="A1.SS1.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph" style="font-size:144%;">Multi-view Training Data.</h4> <div class="ltx_para" id="A1.SS1.SSS0.Px2.p1"> <p class="ltx_p" id="A1.SS1.SSS0.Px2.p1.1"><span class="ltx_text" id="A1.SS1.SSS0.Px2.p1.1.1" style="font-size:144%;">Fig. </span><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.F10" style="font-size:144%;" title="Figure 10 ‣ Multi-view Training Data. ‣ A.1 Bitmoji Training Data ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">10</span></a><span class="ltx_text" id="A1.SS1.SSS0.Px2.p1.1.2" style="font-size:144%;"> visualizes training samples from the Bitmoji dataset used for training our 3D Generation Network. Each avatar is rendered from 10 spherically distributed viewpoints around the head and is posed with random blendshape weights to simulate diverse facial expressions. The dataset features a wide range of hairstyles, skin tones, and accessories such as glasses, hats, and earrings. Although the U-Net is trained exclusively on Bitmoji-style avatars, it effectively reconstructs dual-stylized avatars that exhibit distinct appearances and textures, demonstrating the network’s versatility and generalization capability.</span></p> </div> <figure class="ltx_figure" id="A1.F10"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="598" id="A1.F10.g1" src="extracted/6275685/appendix/bitmoji_mv.jpg" width="598"/> <figcaption class="ltx_caption ltx_centering" style="font-size:144%;"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="A1.F10.5.1.1" style="font-size:63%;">Figure 10</span>: </span><span class="ltx_text ltx_font_bold" id="A1.F10.6.2" style="font-size:63%;">Multi-view Bitmoji Training Data.<span class="ltx_text ltx_font_medium" id="A1.F10.6.2.1"> Samples from the Bitmoji dataset used in training the 3D Generation Network. Avatars are rendered from the front and multiple random angles around the head, with random blendshapes applied to simulate various expressions.</span></span></figcaption> </figure> </section> </section> <section class="ltx_subsection" id="A1.SS2"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> <span class="ltx_tag ltx_tag_subsection">A.2 </span>Facial Action Coding System</h3> <div class="ltx_para" id="A1.SS2.p1"> <p class="ltx_p" id="A1.SS2.p1.1"><span class="ltx_text" id="A1.SS2.p1.1.1" style="font-size:144%;">We implement the following 16 blendshapes from the Facial Action Coding System. These blendshapes are compatible with most facial blendshape predictors like </span><span class="ltx_text ltx_font_italic" id="A1.SS2.p1.1.2" style="font-size:144%;">Apple ARKit<span class="ltx_note ltx_role_footnote" id="footnote1"><sup class="ltx_note_mark">*</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">*</sup><span class="ltx_tag ltx_tag_note"><span class="ltx_text ltx_font_upright" id="footnote1.1.1.1" style="font-size:69%;">*</span></span><a class="ltx_ref ltx_url ltx_font_typewriter ltx_font_upright" href="https://arkit-face-blendshapes.com/" style="font-size:69%;" title="">https://arkit-face-blendshapes.com/</a></span></span></span></span><span class="ltx_text" id="A1.SS2.p1.1.3" style="font-size:144%;"> or </span><span class="ltx_text ltx_font_italic" id="A1.SS2.p1.1.4" style="font-size:144%;">Google Mediapipe<span class="ltx_note ltx_role_footnote" id="footnote2"><sup class="ltx_note_mark">*</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">*</sup><span class="ltx_tag ltx_tag_note"><span class="ltx_text ltx_font_upright" id="footnote2.1.1.1" style="font-size:69%;">*</span></span><a class="ltx_ref ltx_url ltx_font_typewriter ltx_font_upright" href="https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker" style="font-size:69%;" title="">https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker</a></span></span></span></span><span class="ltx_text" id="A1.SS2.p1.1.5" style="font-size:144%;">.</span></p> </div> <div class="ltx_para" id="A1.SS2.p2"> <ul class="ltx_itemize" id="A1.I1"> <li class="ltx_item" id="A1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i1.p1"> <p class="ltx_p" id="A1.I1.i1.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i1.p1.1.1" style="font-size:144%;">browDownLeft</span><span class="ltx_text" id="A1.I1.i1.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i2.p1"> <p class="ltx_p" id="A1.I1.i2.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i2.p1.1.1" style="font-size:144%;">browDownRight</span><span class="ltx_text" id="A1.I1.i2.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i3.p1"> <p class="ltx_p" id="A1.I1.i3.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i3.p1.1.1" style="font-size:144%;">browUpLeft</span><span class="ltx_text" id="A1.I1.i3.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i4" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i4.p1"> <p class="ltx_p" id="A1.I1.i4.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i4.p1.1.1" style="font-size:144%;">browUpRight</span><span class="ltx_text" id="A1.I1.i4.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i5" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i5.p1"> <p class="ltx_p" id="A1.I1.i5.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i5.p1.1.1" style="font-size:144%;">eyeBlinkLeft</span><span class="ltx_text" id="A1.I1.i5.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i6" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i6.p1"> <p class="ltx_p" id="A1.I1.i6.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i6.p1.1.1" style="font-size:144%;">eyeBlinkRight</span><span class="ltx_text" id="A1.I1.i6.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i7" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i7.p1"> <p class="ltx_p" id="A1.I1.i7.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i7.p1.1.1" style="font-size:144%;">jawOpen</span><span class="ltx_text" id="A1.I1.i7.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i8" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i8.p1"> <p class="ltx_p" id="A1.I1.i8.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i8.p1.1.1" style="font-size:144%;">jawLeft</span><span class="ltx_text" id="A1.I1.i8.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i9" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i9.p1"> <p class="ltx_p" id="A1.I1.i9.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i9.p1.1.1" style="font-size:144%;">jawRight</span><span class="ltx_text" id="A1.I1.i9.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i10" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i10.p1"> <p class="ltx_p" id="A1.I1.i10.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i10.p1.1.1" style="font-size:144%;">lipsPucker</span><span class="ltx_text" id="A1.I1.i10.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i11" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i11.p1"> <p class="ltx_p" id="A1.I1.i11.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i11.p1.1.1" style="font-size:144%;">mouthFrownLeft</span><span class="ltx_text" id="A1.I1.i11.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i12" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i12.p1"> <p class="ltx_p" id="A1.I1.i12.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i12.p1.1.1" style="font-size:144%;">mouthFrownRight</span><span class="ltx_text" id="A1.I1.i12.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i13" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i13.p1"> <p class="ltx_p" id="A1.I1.i13.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i13.p1.1.1" style="font-size:144%;">mouthSmileLeft</span><span class="ltx_text" id="A1.I1.i13.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i14" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i14.p1"> <p class="ltx_p" id="A1.I1.i14.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i14.p1.1.1" style="font-size:144%;">MouthSmileRight</span><span class="ltx_text" id="A1.I1.i14.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i15" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i15.p1"> <p class="ltx_p" id="A1.I1.i15.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i15.p1.1.1" style="font-size:144%;">mouthStretchLeft</span><span class="ltx_text" id="A1.I1.i15.p1.1.2" style="font-size:144%;"></span></p> </div> </li> <li class="ltx_item" id="A1.I1.i16" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="A1.I1.i16.p1"> <p class="ltx_p" id="A1.I1.i16.p1.1"><span class="ltx_text ltx_font_typewriter" id="A1.I1.i16.p1.1.1" style="font-size:144%;">mouthStretchRight</span><span class="ltx_text" id="A1.I1.i16.p1.1.2" style="font-size:144%;"></span></p> </div> </li> </ul> </div> <div class="ltx_para" id="A1.SS2.p3"> <p class="ltx_p" id="A1.SS2.p3.1"><span class="ltx_text" id="A1.SS2.p3.1.1" style="font-size:144%;">To animate the Bitmoji avatars for training data, we used the publicly available Bitmoji rig available here: </span><a class="ltx_ref ltx_url ltx_font_typewriter" href="https://developers.snap.com/lens-studio/features/bitmoji-avatar/animating-bitmoji-3d" style="font-size:144%;" title="">https://developers.snap.com/lens-studio/features/bitmoji-avatar/animating-bitmoji-3d</a><span class="ltx_text" id="A1.SS2.p3.1.2" style="font-size:144%;">.</span></p> </div> <div class="ltx_para" id="A1.SS2.p4"> <p class="ltx_p" id="A1.SS2.p4.1"><span class="ltx_text" id="A1.SS2.p4.1.1" style="font-size:144%;">At inference time, we can use a real-time blendshape predictor like </span><span class="ltx_text ltx_font_italic" id="A1.SS2.p4.1.2" style="font-size:144%;">ARKit</span><span class="ltx_text" id="A1.SS2.p4.1.3" style="font-size:144%;"> or </span><span class="ltx_text ltx_font_italic" id="A1.SS2.p4.1.4" style="font-size:144%;">Mediapipe</span><span class="ltx_text" id="A1.SS2.p4.1.5" style="font-size:144%;"> to puppet the avatars from a real video. Please see the attached HTML gallery for a demonstration.</span></p> </div> </section> <section class="ltx_subsection" id="A1.SS3"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> <span class="ltx_tag ltx_tag_subsection">A.3 </span>User Interfaces</h3> <div class="ltx_para" id="A1.SS3.p1"> <p class="ltx_p" id="A1.SS3.p1.1"><span class="ltx_text" id="A1.SS3.p1.1.1" style="font-size:144%;">To showcase the intuitive features of the </span><span class="ltx_text" id="A1.SS3.p1.1.2" style="font-size:144%;">Snapmoji</span><span class="ltx_text" id="A1.SS3.p1.1.3" style="font-size:144%;"> system, we present videos of the interface interactions available in the HTML gallery. The avatar generation interface, crafted with </span><span class="ltx_text ltx_font_italic" id="A1.SS3.p1.1.4" style="font-size:144%;">Gradio</span><span class="ltx_text" id="A1.SS3.p1.1.5" style="font-size:144%;">, enables users to effortlessly create dual-stylized avatars from their own photos. In addition, the blendshape editor, developed using </span><span class="ltx_text ltx_font_italic" id="A1.SS3.p1.1.6" style="font-size:144%;">Viser</span><span class="ltx_text" id="A1.SS3.p1.1.7" style="font-size:144%;">, allows users to pose their avatars in 3D by adjusting blendshape weights, thereby controlling facial expressions.</span></p> </div> <div class="ltx_para" id="A1.SS3.p2"> <p class="ltx_p" id="A1.SS3.p2.1"><span class="ltx_text" id="A1.SS3.p2.1.1" style="font-size:144%;">During the diffusion stylization process, we provide users with key parameters to balance identity preservation with style diversity. These controls are listed by their significance: 1) Style Transition Strength; 2) Edge Preservation Level 3) Identity Consistency Factor.</span></p> </div> <div class="ltx_para ltx_noindent" id="A1.SS3.p3"> <p class="ltx_p" id="A1.SS3.p3.1"><span class="ltx_text ltx_font_italic" id="A1.SS3.p3.1.1" style="font-size:144%;">Style Transition Strength</span><span class="ltx_text" id="A1.SS3.p3.1.2" style="font-size:144%;">: This parameter, inspired by methods similar to SDEdit, regulates the extent of the stylization transition. Lower values enable the dual-stylized avatar to retain more details from the original single-styled avatar input.</span></p> </div> <div class="ltx_para ltx_noindent" id="A1.SS3.p4"> <p class="ltx_p" id="A1.SS3.p4.1"><span class="ltx_text ltx_font_italic" id="A1.SS3.p4.1.1" style="font-size:144%;">Edge Preservation Level</span><span class="ltx_text" id="A1.SS3.p4.1.2" style="font-size:144%;">: This setting influences how accurately the system maintains the structure of the avatar by preserving the edges from the single-styled input.</span></p> </div> <div class="ltx_para ltx_noindent" id="A1.SS3.p5"> <p class="ltx_p" id="A1.SS3.p5.1"><span class="ltx_text ltx_font_italic" id="A1.SS3.p5.1.1" style="font-size:144%;">Identity Consistency Factor</span><span class="ltx_text" id="A1.SS3.p5.1.2" style="font-size:144%;">: This controls the strength of identity features from the initial input photo, ensuring that essential facial characteristics remain recognizable.</span></p> </div> <div class="ltx_para" id="A1.SS3.p6"> <p class="ltx_p" id="A1.SS3.p6.1"><span class="ltx_text" id="A1.SS3.p6.1.1" style="font-size:144%;">We encourage users to view the videos in the HTML gallery to observe how these parameters affect avatar generation, enhancing both creativity and user experience.</span></p> </div> </section> </section> <section class="ltx_appendix ltx_centering" id="A2"> <h2 class="ltx_title ltx_title_appendix" style="font-size:144%;"> <span class="ltx_tag ltx_tag_appendix">Appendix B </span>Additional Results</h2> <section class="ltx_subsection" id="A2.SS1"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> <span class="ltx_tag ltx_tag_subsection">B.1 </span>Results Gallery</h3> <div class="ltx_para" id="A2.SS1.p1"> <p class="ltx_p" id="A2.SS1.p1.1"><span class="ltx_text" id="A2.SS1.p1.1.1" style="font-size:144%;">We invite you to explore the HTML gallery, which features videos of </span><span class="ltx_text" id="A2.SS1.p1.1.2" style="font-size:144%;">Snapmoji</span><span class="ltx_text" id="A2.SS1.p1.1.3" style="font-size:144%;">avatars animated in 3D. Access the gallery by opening the </span><span class="ltx_text ltx_font_typewriter" id="A2.SS1.p1.1.4" style="font-size:144%;">index.html</span><span class="ltx_text" id="A2.SS1.p1.1.5" style="font-size:144%;"> file in your web browser. The gallery includes the following highlights:</span></p> </div> <div class="ltx_para" id="A2.SS1.p2"> <ol class="ltx_enumerate" id="A2.I1"> <li class="ltx_item" id="A2.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">1.</span> <div class="ltx_para" id="A2.I1.i1.p1"> <p class="ltx_p" id="A2.I1.i1.p1.1"><span class="ltx_text" id="A2.I1.i1.p1.1.1" style="font-size:144%;">Dual-stylized avatars with dynamic facial animations displayed from various novel viewpoints.</span></p> </div> </li> <li class="ltx_item" id="A2.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">2.</span> <div class="ltx_para" id="A2.I1.i2.p1"> <p class="ltx_p" id="A2.I1.i2.p1.1"><span class="ltx_text" id="A2.I1.i2.p1.1.1" style="font-size:144%;">A demonstration of the avatars’ capabilities in facial puppeting for augmented reality applications.</span></p> </div> </li> <li class="ltx_item" id="A2.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">3.</span> <div class="ltx_para" id="A2.I1.i3.p1"> <p class="ltx_p" id="A2.I1.i3.p1.1"><span class="ltx_text" id="A2.I1.i3.p1.1.1" style="font-size:144%;">Screen captures of the </span><span class="ltx_text" id="A2.I1.i3.p1.1.2" style="font-size:144%;">Snapmoji</span><span class="ltx_text" id="A2.I1.i3.p1.1.3" style="font-size:144%;">user interfaces, showcasing the ease of creating dual-stylized avatars and posing them using blendshapes.</span></p> </div> </li> </ol> </div> </section> <section class="ltx_subsection" id="A2.SS2"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> <span class="ltx_tag ltx_tag_subsection">B.2 </span>More Applications</h3> <div class="ltx_para ltx_noindent" id="A2.SS2.p1"> <p class="ltx_p" id="A2.SS2.p1.1"><span class="ltx_text ltx_font_bold" id="A2.SS2.p1.1.1" style="font-size:144%;">3D Avatar Animation.</span><span class="ltx_text" id="A2.SS2.p1.1.2" style="font-size:144%;"> Dual-stylization offers the ability to swiftly visualize avatars in various scenarios, unlocking numerous applications. As illustrated in Fig. </span><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S0.F1" style="font-size:144%;" title="Figure 1 ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">1</span></a><span class="ltx_text" id="A2.SS2.p1.1.3" style="font-size:144%;">, </span><span class="ltx_text" id="A2.SS2.p1.1.4" style="font-size:144%;">Snapmoji</span><span class="ltx_text" id="A2.SS2.p1.1.5" style="font-size:144%;"> avatars can be employed to create personalized comics and stickers, offering users a unique way to express themselves. Another promising application lies in augmented reality (AR), where avatars can be controlled and animated with real-time tracked facial expressions. Examples of this application are shown in Fig. </span><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.F11" style="font-size:144%;" title="Figure 11 ‣ B.2 More Applications ‣ Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">11</span></a><span class="ltx_text" id="A2.SS2.p1.1.6" style="font-size:144%;"> and within the HTML gallery. By utilizing </span><span class="ltx_text ltx_font_italic" id="A2.SS2.p1.1.7" style="font-size:144%;">Mediapipe</span><span class="ltx_text" id="A2.SS2.p1.1.8" style="font-size:144%;">’s real-time blendshape tracker, we animate the 3D avatars and seamlessly integrate them with video content, enabling them to be rendered in an AR environment through alpha compositing.</span></p> </div> <div class="ltx_para ltx_noindent" id="A2.SS2.p2"> <p class="ltx_p" id="A2.SS2.p2.1"><span class="ltx_text ltx_font_bold" id="A2.SS2.p2.1.1" style="font-size:144%;">Real-time Web Rendering.</span><span class="ltx_text" id="A2.SS2.p2.1.2" style="font-size:144%;"> Our choice to represent avatars using Gaussian Splats enables efficient real-time rendering on mobile devices. As demonstrated in Fig. </span><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.F12" style="font-size:144%;" title="Figure 12 ‣ B.2 More Applications ‣ Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">12</span></a><span class="ltx_text" id="A2.SS2.p2.1.3" style="font-size:144%;"> and in the HTML gallery, the avatars achieve a rendering rate of 90-100 FPS on a laptop, and 30-40 FPS on a phone. When paired with a face tracker, these avatars can be used to generate engaging filters and augmented reality effects. The demonstration showcases an avatar rendered in Google Chrome on a MacBook, entirely on the client side.</span></p> </div> <figure class="ltx_figure" id="A2.F11"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="544" id="A2.F11.g1" src="extracted/6275685/appendix/ar-moji.jpeg" width="598"/> <figcaption class="ltx_caption ltx_centering" style="font-size:144%;"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="A2.F11.7.1.1" style="font-size:63%;">Figure 11</span>: </span><span class="ltx_text ltx_font_bold" id="A2.F11.8.2" style="font-size:63%;">Augmented Reality Puppeting.<span class="ltx_text ltx_font_medium" id="A2.F11.8.2.1"> This example demonstrates the use of <span class="ltx_text ltx_font_italic" id="A2.F11.8.2.1.1">Mediapipe</span>’s real-time face detection to animate avatars based on estimated blendshape weights. By alpha-compositing the avatars with the original input, we enable dynamic puppeting in augmented reality. For live demonstrations, please refer to the HTML gallery.</span></span></figcaption> </figure> <figure class="ltx_figure" id="A2.F12"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="337" id="A2.F12.g1" src="extracted/6275685/appendix/instamoj-web.png" width="598"/> <figcaption class="ltx_caption ltx_centering" style="font-size:144%;"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="A2.F12.5.1.1" style="font-size:63%;">Figure 12</span>: </span><span class="ltx_text ltx_font_bold" id="A2.F12.6.2" style="font-size:63%;">Real-time Web Rendering.<span class="ltx_text ltx_font_medium" id="A2.F12.6.2.1"> Leveraging Gaussian Splats, our avatars efficiently render at 90–100 FPS on laptops and 30–40 FPS on mobile devices while occupying only 3 MB of disk space. In conjunction with a mobile face tracker, these avatars facilitate the creation of engaging filters and AR effects. The example shown features an avatar rendered in Google Chrome on an MacBook, developed in JavaScript.</span></span></figcaption> </figure> <div class="ltx_para ltx_noindent" id="A2.SS2.p3"> <p class="ltx_p" id="A2.SS2.p3.1"><span class="ltx_text ltx_font_bold" id="A2.SS2.p3.1.1" style="font-size:144%;">GDA Generalization.</span><span class="ltx_text" id="A2.SS2.p3.1.2" style="font-size:144%;"> GDA demonstrates that the features learned from few-shot 3D reconstruction models are transferrable to new tasks. Shown in Fig. </span><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.F13" style="font-size:144%;" title="Figure 13 ‣ B.2 More Applications ‣ Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">13</span></a><span class="ltx_text" id="A2.SS2.p3.1.3" style="font-size:144%;">, GDA can be applied for more domains such as cats. We hope that GDA can inspire future work on using Gaussian features for other tasks.</span></p> </div> <figure class="ltx_figure" id="A2.F13"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="290" id="A2.F13.g1" src="x10.png" width="830"/> <figcaption class="ltx_caption" style="font-size:144%;"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="A2.F13.5.1.1" style="font-size:63%;">Figure 13</span>: </span><span class="ltx_text ltx_font_bold" id="A2.F13.6.2" style="font-size:63%;">GDA Generalization Across Domains.<span class="ltx_text ltx_font_medium" id="A2.F13.6.2.1"> This illustration showcases the versatility of Gaussian Domain Adaptation (GDA) as an image-to-image translation method. Demonstrated above is GDA’s capability to transform realistic cat images into anime-style representations and vice versa, highlighting its potential for a wide range of applications beyond avatar creation.</span></span></figcaption> </figure> </section> <section class="ltx_subsection" id="A2.SS3"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> <span class="ltx_tag ltx_tag_subsection">B.3 </span>Ablation Studies</h3> <figure class="ltx_figure" id="A2.F14"><img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="1005" id="A2.F14.g1" src="x11.png" width="830"/> <figcaption class="ltx_caption" style="font-size:144%;"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="A2.F14.5.1.1" style="font-size:63%;">Figure 14</span>: </span><span class="ltx_text ltx_font_bold" id="A2.F14.6.2" style="font-size:63%;">Ablation Study on 3DMM Tracking.<span class="ltx_text ltx_font_medium" id="A2.F14.6.2.1"> This figure demonstrates the effects of using 3DMM features in conjunction with FACS blendshape weights. The combination enhances the expressiveness and fidelity of avatar animation, accommodating both realistic and stylized facial expressions.</span></span></figcaption> </figure> <div class="ltx_para ltx_noindent" id="A2.SS3.p1"> <p class="ltx_p" id="A2.SS3.p1.1"><span class="ltx_text ltx_font_bold" id="A2.SS3.p1.1.1" style="font-size:144%;">3DMM Tracking.</span><span class="ltx_text" id="A2.SS3.p1.1.2" style="font-size:144%;"> As shown in Fig. </span><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.F14" style="font-size:144%;" title="Figure 14 ‣ B.3 Ablation Studies ‣ Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars"><span class="ltx_text ltx_ref_tag">14</span></a><span class="ltx_text" id="A2.SS3.p1.1.3" style="font-size:144%;">, our ablation study highlights the complementary strengths of 3DMM tracking and FACS-based blendshape features in avatar animation. 3DMM is adept at capturing realistic facial expressions, making it ideal for animating real faces, but it struggles with the exaggerated features typical of cartoon avatars. Conversely, FACS blendshapes excel in stylizing facial elements, such as eye and mouth shapes, crucial for cartoon animation. By integrating the precision of 3DMM with the expressive capability of FACS blendshapes, we enhance the overall animation quality, enabling our avatars to faithfully portray both realistic and stylized expressions, thus delivering a more versatile and convincing animation experience.</span></p> </div> </section> </section> <section class="ltx_appendix ltx_centering" id="A3"> <h2 class="ltx_title ltx_title_appendix" style="font-size:144%;"> <span class="ltx_tag ltx_tag_appendix">Appendix C </span>Ethical Discussion</h2> <div class="ltx_para" id="A3.p1"> <p class="ltx_p" id="A3.p1.1"><span class="ltx_text" id="A3.p1.1.1" style="font-size:144%;">The use of photorealistic avatars has raised significant privacy and ethical concerns, particularly in relation to their potential misuse in creating deep fakes and spreading misinformation. In contrast, stylized cartoon avatars offer a safer alternative as they are not easily exploited for direct impersonation. In our work, we have prioritized user privacy by ensuring that no real person’s images are used to train our models. Instead, the realistic images employed for training the Gaussian Domain Adaptation (GDA) system are generated by a GAN. We recognize, however, that GAN-generated data can reflect the biases present in the original datasets used for training. Consequently, we remain vigilant about these limitations and are committed to continuous evaluation and improvement to mitigate any unintended biases.</span></p> </div> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Thu Mar 13 21:10:50 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10