CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars</title>  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2503.11978v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S1" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">1 Introduction</a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S2" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">2 Related Work</a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3 Method</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS1" title="In 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3.1 Datasets</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS2" title="In 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3.2 2D Dual-Stylized Avatar Generation</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS3" title="In 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3.3 3D Animatable Stylized Avatar Generation</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS4" title="In 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3.4 Training and Losses</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">4 Experiments</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.SS1" title="In 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">4.1 2D Stylized Avatar Generation</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.SS2" title="In 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">4.2 3D Dual-Stylized Avatar Generation</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.SS3" title="In 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">4.3 3D Stylized Avatar Animation</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S5" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">5 Conclusion</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"> <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">A Implementation Details</a> <ol class="ltx_toclist ltx_toclist_appendix"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.SS1" title="In Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">A.1 Bitmoji Training Data</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.SS1.SSS0.Px1" title="In A.1 Bitmoji Training Data ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">GDA Training Data.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.SS1.SSS0.Px2" title="In A.1 Bitmoji Training Data ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">Multi-view Training Data.</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.SS2" title="In Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">A.2 Facial Action Coding System</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.SS3" title="In Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">A.3 User Interfaces</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_appendix"> <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">B Additional Results</a> <ol class="ltx_toclist ltx_toclist_appendix"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.SS1" title="In Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">B.1 Results Gallery</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.SS2" title="In Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">B.2 More Applications</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.SS3" title="In Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">B.3 Ablation Studies</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A3" title="In Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">C Ethical Discussion</a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line ltx_pruned_first"> <h1 class="ltx_title ltx_title_document"> Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars</h1> <div class="ltx_authors"> Eric Ming Chen1,2∗ Di Liu1,3∗ Sizhuo Ma1 Michael Vasilkovsky1 Bing Zhou1 Qiang Gao1 Wenzhou Wang1 Jiahao Luo1,4 Dimitris N. Metaxas3 Vincent Sitzmann2 Jian Wang1 1Snap Inc 2MIT 3Rutgers University 4University of California, Santa Cruz </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> The increasing popularity of personalized avatar systems, such as Snapchat Bitmojis and Apple Memojis, highlights the growing demand for digital self-representation. Despite their widespread use, existing avatar platforms face significant limitations, including restricted expressivity due to predefined assets, tedious customization processes, or inefficient rendering requirements. Addressing these shortcomings, we introduce Snapmoji, an avatar generation system that instantly creates animatable, dual-stylized avatars from a selfie. We propose Gaussian Domain Adaptation (GDA), which is pre-trained on large-scale Gaussian models using 3D data from sources like Objaverse and fine-tuned with 2D style transfer tasks, endowing it with a rich 3D prior. This enables Snapmoji to transform a selfie into a primary stylized avatar (e.g., Bitmoji style) and apply a secondary style (e.g., Plastic Toy or Alien), all while preserving the user’s identity and the primary style’s integrity. Our system is capable of producing 3D Gaussian avatars that support dynamic animation, including accurate facial expression transfer. Designed for efficiency, Snapmoji achieves selfie-to-avatar conversion in a mere 0.9 seconds and supports real-time interactions on mobile devices at 30–40 FPS. Extensive testing confirms that Snapmoji outperforms existing methods in versatility and speed, making it a convenient tool for automatic avatar creation in various styles. </div> <div class="ltx_para" id="p2"> <img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_landscape" height="178" id="p2.g1" src="x1.png" width="830"/> </div> <figure class="ltx_figure" id="S0.F1"> <figcaption class="ltx_caption">Figure 1: We introduce Snapmoji, a system that can instantly generate animatable dual-stylized avatars. Our dual stylization process reimagines avatars in various artistic styles, enabling users to visualize themselves in diverse scenarios and create personalized stories. Our approach also enables 3D stylized gaussian avatars generation and expression animation. Snapmoji accomplishes the selfie-to-avatar conversion in just 0.9 seconds, and offers real-time functionality for mobile applications. <a class="ltx_ref ltx_href" href="https://echen01.github.io/instamoji-supp/" title="">Project page.</a></figcaption> </figure>11footnotetext: Equal contribution <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> 1 Introduction</h2> <div class="ltx_para" id="S1.p1"> Personalized cartoon avatars such as Snapchat Bitmojis <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib5" title="">5</a>]</cite>, Apple Memojis <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib1" title="">1</a>]</cite>, and Meta Avatars <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib35" title="">35</a>]</cite> have become popular digital self-representations. The broader digital avatar market, encompassing both stylized and photorealistic avatars, was valued at over $18 billion in 2023 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib41" title="">41</a>]</cite>. Current stylized avatar platforms, although offering some level of customization, are often restricted by predefined traits, which makes it difficult to adapt avatars to varied styles without developing new 3D assets <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib57" title="">57</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title="">56</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib29" title="">29</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib46" title="">46</a>]</cite>. Moreover, navigating extensive trait lists can be tedious, and efficiency demands frequently lead to compromises in texture detail and polygon count. Asset-free methods, such as StyleAvatar3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib61" title="">61</a>]</cite>, TextToon <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title="">54</a>]</cite> and DATID-3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib23" title="">23</a>]</cite>, lack support for real-time operation or animatability for mobile augmented reality (AR). Table <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S1.T1" title="Table 1 ‣ 1 Introduction ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">1</a> provides a comparative overview of existing stylized avatar generation methods. While photorealistic avatar techniques <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib44" title="">44</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib40" title="">40</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib32" title="">32</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib30" title="">30</a>]</cite> excel in creating realistic representations and animation, they fall short in adapting to the uniquely stylized geometries of cartoon avatars. </div> <div class="ltx_para" id="S1.p2"> In pursuit of a more expressive stylized avatar creation platform, we introduce Snapmoji, a system to generate 3D avatars, represented by 3D Gaussian Splats <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib22" title="">22</a>]</cite>, in only 0.9 seconds. Snapmoji is built upon the Bitmoji platform, leveraging its public API and robust developer support <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib52" title="">52</a>]</cite>. Unlike traditional avatars limited to predefined assets, our solution allows for stylization through text prompts, offering greater flexibility and creativity. Our system is designed with three core objectives in mind: 1) Dual Stylization: The system should generate avatars in the Bitmoji art style, and a secondary style, such as of Plastic Toys or Aliens, while preserving user identity; 2) User Convenience: For ease of use, the system should require only a single image input and produce results instantly; 3) Efficiency: The avatars should be optimized for real-time rendering on mobile devices, supporting applications like AR, with minimal compute requirements. </div> <div class="ltx_para" id="S1.p3"> Following these design objectives, we propose a two-stage pipeline for Snapmoji. First, Gaussian Domain Adaptation (GDA) transforms realistic selfies into 2D Bitmoji-style images, and then a diffusion-based model further stylizes these images based on user-specified text prompts. Second, the image is lifted to an animatable 3D Gaussian avatar that faithfully captures the user’s identity and chosen styles. To facilitate AR applications, these avatars can be animated in real-time using facial parameter estimators. Although showcased with Bitmojis, our approach is applicable to other avatar platforms as well. In summary, our contributions include: </div> <div class="ltx_para" id="S1.p4"> <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i1.p1"> Introducing an advanced avatar generation system that produces dual-stylized avatars instantly from a selfie. </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i2.p1"> Developing Gaussian Domain Adaptation for enriching Snapmoji with a 3D prior, enabling dual-style transformations of selfies while preserving identity and style. </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i3.p1"> Creating an animatable model by leveraging driving signals from 3DMM and blendshape priors combined with 3D Gaussians, enabling efficient, real-time rendering of dual-stylized avatars. </div> </li> <li class="ltx_item" id="S1.I1.i4" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i4.p1"> Demonstrating, through extensive experiments, that our method outperforms existing solutions in both generation and animation performance, enabling real-time applications on mobile devices. </div> </li> </ul> </div> <figure class="ltx_table" id="S1.T1"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S1.T1.2" style="width:433.6pt;height:117.6pt;vertical-align:-0.8pt;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S1.T1.2.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S1.T1.2.1.1.1"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S1.T1.2.1.1.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">Method</th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S1.T1.2.1.1.1.2" style="padding-left:0.0pt;padding-right:0.0pt;">Selfie Input</th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S1.T1.2.1.1.1.3" style="padding-left:0.0pt;padding-right:0.0pt;">Mobile AR</th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S1.T1.2.1.1.1.4" style="padding-left:0.0pt;padding-right:0.0pt;">Asset-free</th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S1.T1.2.1.1.1.5" style="padding-left:0.0pt;padding-right:0.0pt;">Animatable</th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S1.T1.2.1.1.1.6" style="padding-left:0.0pt;padding-right:0.0pt;">Dual Style</th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S1.T1.2.1.2.1"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S1.T1.2.1.2.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">StyleAvatar3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib61" title="">61</a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S1.T1.2.1.2.1.2" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S1.T1.2.1.2.1.3" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S1.T1.2.1.2.1.4" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S1.T1.2.1.2.1.5" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S1.T1.2.1.2.1.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.3.2"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.3.2.1" style="padding-left:0.0pt;padding-right:0.0pt;">DATID-3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib23" title="">23</a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.3.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.3.2.3" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.3.2.4" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.3.2.5" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S1.T1.2.1.3.2.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.4.3"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.4.3.1" style="padding-left:0.0pt;padding-right:0.0pt;">TextToon <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title="">54</a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.4.3.2" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.4.3.3" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.4.3.4" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.4.3.5" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S1.T1.2.1.4.3.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.5.4"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.5.4.1" style="padding-left:0.0pt;padding-right:0.0pt;">EasyCraft <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib57" title="">57</a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.5.4.2" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.5.4.3" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.5.4.4" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.5.4.5" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S1.T1.2.1.5.4.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.6.5"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.6.5.1" style="padding-left:0.0pt;padding-right:0.0pt;">SwiftAvatar <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title="">56</a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.6.5.2" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.6.5.3" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.6.5.4" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.6.5.5" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S1.T1.2.1.6.5.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.7.6"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.7.6.1" style="padding-left:0.0pt;padding-right:0.0pt;">AgileAvatar <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib46" title="">46</a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.7.6.2" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.7.6.3" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.7.6.4" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S1.T1.2.1.7.6.5" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S1.T1.2.1.7.6.6" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> </tr> <tr class="ltx_tr" id="S1.T1.2.1.8.7"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S1.T1.2.1.8.7.1" style="padding-left:0.0pt;padding-right:0.0pt;"> Snapmoji (ours)</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S1.T1.2.1.8.7.2" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S1.T1.2.1.8.7.3" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S1.T1.2.1.8.7.4" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S1.T1.2.1.8.7.5" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S1.T1.2.1.8.7.6" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> </tr> </tbody> </table> </div> <figcaption class="ltx_caption ltx_centering">Table 1: Feature comparison among various stylized avatar generation methods.</figcaption> </figure> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> 2 Related Work</h2> <figure class="ltx_figure" id="S2.F2"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="444" id="S2.F2.g1" src="x2.png" width="830"/> <figcaption class="ltx_caption">Figure 2: The Snapmoji Inference Pipeline. The pipeline has two stages. First, the Gaussian Domain Adaptation network <math alttext="\mathcal{E}_{\text{GDA}}" class="ltx_Math" display="inline" id="S2.F2.11.1.1.m1.1"><semantics id="S2.F2.11.1.1.m1.1b"><msub id="S2.F2.11.1.1.m1.1.1" xref="S2.F2.11.1.1.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.F2.11.1.1.m1.1.1.2" xref="S2.F2.11.1.1.m1.1.1.2.cmml">ℰ</mi><mtext id="S2.F2.11.1.1.m1.1.1.3" xref="S2.F2.11.1.1.m1.1.1.3a.cmml">GDA</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.11.1.1.m1.1c"><apply id="S2.F2.11.1.1.m1.1.1.cmml" xref="S2.F2.11.1.1.m1.1.1"><csymbol cd="ambiguous" id="S2.F2.11.1.1.m1.1.1.1.cmml" xref="S2.F2.11.1.1.m1.1.1">subscript</csymbol><ci id="S2.F2.11.1.1.m1.1.1.2.cmml" xref="S2.F2.11.1.1.m1.1.1.2">ℰ</ci><ci id="S2.F2.11.1.1.m1.1.1.3a.cmml" xref="S2.F2.11.1.1.m1.1.1.3"><mtext id="S2.F2.11.1.1.m1.1.1.3.cmml" mathsize="70%" xref="S2.F2.11.1.1.m1.1.1.3">GDA</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.11.1.1.m1.1d">\mathcal{E}_{\text{GDA}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.11.1.1.m1.1e">caligraphic_E start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT</annotation></semantics></math> converts a facial image into a primary-style avatar <math alttext="I_{\text{sty}}" class="ltx_Math" display="inline" id="S2.F2.12.2.2.m2.1"><semantics id="S2.F2.12.2.2.m2.1b"><msub id="S2.F2.12.2.2.m2.1.1" xref="S2.F2.12.2.2.m2.1.1.cmml"><mi id="S2.F2.12.2.2.m2.1.1.2" xref="S2.F2.12.2.2.m2.1.1.2.cmml">I</mi><mtext id="S2.F2.12.2.2.m2.1.1.3" xref="S2.F2.12.2.2.m2.1.1.3a.cmml">sty</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.12.2.2.m2.1c"><apply id="S2.F2.12.2.2.m2.1.1.cmml" xref="S2.F2.12.2.2.m2.1.1"><csymbol cd="ambiguous" id="S2.F2.12.2.2.m2.1.1.1.cmml" xref="S2.F2.12.2.2.m2.1.1">subscript</csymbol><ci id="S2.F2.12.2.2.m2.1.1.2.cmml" xref="S2.F2.12.2.2.m2.1.1.2">𝐼</ci><ci id="S2.F2.12.2.2.m2.1.1.3a.cmml" xref="S2.F2.12.2.2.m2.1.1.3"><mtext id="S2.F2.12.2.2.m2.1.1.3.cmml" mathsize="70%" xref="S2.F2.12.2.2.m2.1.1.3">sty</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.12.2.2.m2.1d">I_{\text{sty}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.12.2.2.m2.1e">italic_I start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT</annotation></semantics></math>. This avatar undergoes further personalization using a text-guided diffusion process with <math alttext="T" class="ltx_Math" display="inline" id="S2.F2.13.3.3.m3.1"><semantics id="S2.F2.13.3.3.m3.1b"><mi id="S2.F2.13.3.3.m3.1.1" xref="S2.F2.13.3.3.m3.1.1.cmml">T</mi><annotation-xml encoding="MathML-Content" id="S2.F2.13.3.3.m3.1c"><ci id="S2.F2.13.3.3.m3.1.1.cmml" xref="S2.F2.13.3.3.m3.1.1">𝑇</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.13.3.3.m3.1d">T</annotation><annotation encoding="application/x-llamapun" id="S2.F2.13.3.3.m3.1e">italic_T</annotation></semantics></math> steps for additional stylization. Second, expression codes extracted via an 3DMM and FACS are combined with identity features <math alttext="f_{\text{id}}" class="ltx_Math" display="inline" id="S2.F2.14.4.4.m4.1"><semantics id="S2.F2.14.4.4.m4.1b"><msub id="S2.F2.14.4.4.m4.1.1" xref="S2.F2.14.4.4.m4.1.1.cmml"><mi id="S2.F2.14.4.4.m4.1.1.2" xref="S2.F2.14.4.4.m4.1.1.2.cmml">f</mi><mtext id="S2.F2.14.4.4.m4.1.1.3" xref="S2.F2.14.4.4.m4.1.1.3a.cmml">id</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.14.4.4.m4.1c"><apply id="S2.F2.14.4.4.m4.1.1.cmml" xref="S2.F2.14.4.4.m4.1.1"><csymbol cd="ambiguous" id="S2.F2.14.4.4.m4.1.1.1.cmml" xref="S2.F2.14.4.4.m4.1.1">subscript</csymbol><ci id="S2.F2.14.4.4.m4.1.1.2.cmml" xref="S2.F2.14.4.4.m4.1.1.2">𝑓</ci><ci id="S2.F2.14.4.4.m4.1.1.3a.cmml" xref="S2.F2.14.4.4.m4.1.1.3"><mtext id="S2.F2.14.4.4.m4.1.1.3.cmml" mathsize="70%" xref="S2.F2.14.4.4.m4.1.1.3">id</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.14.4.4.m4.1d">f_{\text{id}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.14.4.4.m4.1e">italic_f start_POSTSUBSCRIPT id end_POSTSUBSCRIPT</annotation></semantics></math> from a reference image <math alttext="I_{\text{ref}}" class="ltx_Math" display="inline" id="S2.F2.15.5.5.m5.1"><semantics id="S2.F2.15.5.5.m5.1b"><msub id="S2.F2.15.5.5.m5.1.1" xref="S2.F2.15.5.5.m5.1.1.cmml"><mi id="S2.F2.15.5.5.m5.1.1.2" xref="S2.F2.15.5.5.m5.1.1.2.cmml">I</mi><mtext id="S2.F2.15.5.5.m5.1.1.3" xref="S2.F2.15.5.5.m5.1.1.3a.cmml">ref</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.15.5.5.m5.1c"><apply id="S2.F2.15.5.5.m5.1.1.cmml" xref="S2.F2.15.5.5.m5.1.1"><csymbol cd="ambiguous" id="S2.F2.15.5.5.m5.1.1.1.cmml" xref="S2.F2.15.5.5.m5.1.1">subscript</csymbol><ci id="S2.F2.15.5.5.m5.1.1.2.cmml" xref="S2.F2.15.5.5.m5.1.1.2">𝐼</ci><ci id="S2.F2.15.5.5.m5.1.1.3a.cmml" xref="S2.F2.15.5.5.m5.1.1.3"><mtext id="S2.F2.15.5.5.m5.1.1.3.cmml" mathsize="70%" xref="S2.F2.15.5.5.m5.1.1.3">ref</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.15.5.5.m5.1d">I_{\text{ref}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.15.5.5.m5.1e">italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT</annotation></semantics></math> and positional maps <math alttext="f_{\text{pos}}" class="ltx_Math" display="inline" id="S2.F2.16.6.6.m6.1"><semantics id="S2.F2.16.6.6.m6.1b"><msub id="S2.F2.16.6.6.m6.1.1" xref="S2.F2.16.6.6.m6.1.1.cmml"><mi id="S2.F2.16.6.6.m6.1.1.2" xref="S2.F2.16.6.6.m6.1.1.2.cmml">f</mi><mtext id="S2.F2.16.6.6.m6.1.1.3" xref="S2.F2.16.6.6.m6.1.1.3a.cmml">pos</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.16.6.6.m6.1c"><apply id="S2.F2.16.6.6.m6.1.1.cmml" xref="S2.F2.16.6.6.m6.1.1"><csymbol cd="ambiguous" id="S2.F2.16.6.6.m6.1.1.1.cmml" xref="S2.F2.16.6.6.m6.1.1">subscript</csymbol><ci id="S2.F2.16.6.6.m6.1.1.2.cmml" xref="S2.F2.16.6.6.m6.1.1.2">𝑓</ci><ci id="S2.F2.16.6.6.m6.1.1.3a.cmml" xref="S2.F2.16.6.6.m6.1.1.3"><mtext id="S2.F2.16.6.6.m6.1.1.3.cmml" mathsize="70%" xref="S2.F2.16.6.6.m6.1.1.3">pos</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.16.6.6.m6.1d">f_{\text{pos}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.16.6.6.m6.1e">italic_f start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT</annotation></semantics></math> from a driving image <math alttext="I_{\text{drive}}" class="ltx_Math" display="inline" id="S2.F2.17.7.7.m7.1"><semantics id="S2.F2.17.7.7.m7.1b"><msub id="S2.F2.17.7.7.m7.1.1" xref="S2.F2.17.7.7.m7.1.1.cmml"><mi id="S2.F2.17.7.7.m7.1.1.2" xref="S2.F2.17.7.7.m7.1.1.2.cmml">I</mi><mtext id="S2.F2.17.7.7.m7.1.1.3" xref="S2.F2.17.7.7.m7.1.1.3a.cmml">drive</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.17.7.7.m7.1c"><apply id="S2.F2.17.7.7.m7.1.1.cmml" xref="S2.F2.17.7.7.m7.1.1"><csymbol cd="ambiguous" id="S2.F2.17.7.7.m7.1.1.1.cmml" xref="S2.F2.17.7.7.m7.1.1">subscript</csymbol><ci id="S2.F2.17.7.7.m7.1.1.2.cmml" xref="S2.F2.17.7.7.m7.1.1.2">𝐼</ci><ci id="S2.F2.17.7.7.m7.1.1.3a.cmml" xref="S2.F2.17.7.7.m7.1.1.3"><mtext id="S2.F2.17.7.7.m7.1.1.3.cmml" mathsize="70%" xref="S2.F2.17.7.7.m7.1.1.3">drive</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.17.7.7.m7.1d">I_{\text{drive}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.17.7.7.m7.1e">italic_I start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT</annotation></semantics></math>. The unposed dual-stylized avatar <math alttext="I_{\text{unposed}}" class="ltx_Math" display="inline" id="S2.F2.18.8.8.m8.1"><semantics id="S2.F2.18.8.8.m8.1b"><msub id="S2.F2.18.8.8.m8.1.1" xref="S2.F2.18.8.8.m8.1.1.cmml"><mi id="S2.F2.18.8.8.m8.1.1.2" xref="S2.F2.18.8.8.m8.1.1.2.cmml">I</mi><mtext id="S2.F2.18.8.8.m8.1.1.3" xref="S2.F2.18.8.8.m8.1.1.3a.cmml">unposed</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.18.8.8.m8.1c"><apply id="S2.F2.18.8.8.m8.1.1.cmml" xref="S2.F2.18.8.8.m8.1.1"><csymbol cd="ambiguous" id="S2.F2.18.8.8.m8.1.1.1.cmml" xref="S2.F2.18.8.8.m8.1.1">subscript</csymbol><ci id="S2.F2.18.8.8.m8.1.1.2.cmml" xref="S2.F2.18.8.8.m8.1.1.2">𝐼</ci><ci id="S2.F2.18.8.8.m8.1.1.3a.cmml" xref="S2.F2.18.8.8.m8.1.1.3"><mtext id="S2.F2.18.8.8.m8.1.1.3.cmml" mathsize="70%" xref="S2.F2.18.8.8.m8.1.1.3">unposed</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.18.8.8.m8.1d">I_{\text{unposed}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.18.8.8.m8.1e">italic_I start_POSTSUBSCRIPT unposed end_POSTSUBSCRIPT</annotation></semantics></math> is then processed by an asymmetric UNet <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S2.F2.19.9.9.m9.1"><semantics id="S2.F2.19.9.9.m9.1b"><mrow id="S2.F2.19.9.9.m9.1.2" xref="S2.F2.19.9.9.m9.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S2.F2.19.9.9.m9.1.2.2" xref="S2.F2.19.9.9.m9.1.2.2.cmml">𝒢</mi><mo id="S2.F2.19.9.9.m9.1.2.1" xref="S2.F2.19.9.9.m9.1.2.1.cmml">⁢</mo><mrow id="S2.F2.19.9.9.m9.1.2.3.2" xref="S2.F2.19.9.9.m9.1.2.cmml"><mo id="S2.F2.19.9.9.m9.1.2.3.2.1" stretchy="false" xref="S2.F2.19.9.9.m9.1.2.cmml">(</mo><mo id="S2.F2.19.9.9.m9.1.1" lspace="0em" rspace="0em" xref="S2.F2.19.9.9.m9.1.1.cmml">⋅</mo><mo id="S2.F2.19.9.9.m9.1.2.3.2.2" stretchy="false" xref="S2.F2.19.9.9.m9.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.F2.19.9.9.m9.1c"><apply id="S2.F2.19.9.9.m9.1.2.cmml" xref="S2.F2.19.9.9.m9.1.2"><times id="S2.F2.19.9.9.m9.1.2.1.cmml" xref="S2.F2.19.9.9.m9.1.2.1"></times><ci id="S2.F2.19.9.9.m9.1.2.2.cmml" xref="S2.F2.19.9.9.m9.1.2.2">𝒢</ci><ci id="S2.F2.19.9.9.m9.1.1.cmml" xref="S2.F2.19.9.9.m9.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.19.9.9.m9.1d">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S2.F2.19.9.9.m9.1e">caligraphic_G ( ⋅ )</annotation></semantics></math>, conditioned on the driving codes <math alttext="f_{\text{drive}}" class="ltx_Math" display="inline" id="S2.F2.20.10.10.m10.1"><semantics id="S2.F2.20.10.10.m10.1b"><msub id="S2.F2.20.10.10.m10.1.1" xref="S2.F2.20.10.10.m10.1.1.cmml"><mi id="S2.F2.20.10.10.m10.1.1.2" xref="S2.F2.20.10.10.m10.1.1.2.cmml">f</mi><mtext id="S2.F2.20.10.10.m10.1.1.3" xref="S2.F2.20.10.10.m10.1.1.3a.cmml">drive</mtext></msub><annotation-xml encoding="MathML-Content" id="S2.F2.20.10.10.m10.1c"><apply id="S2.F2.20.10.10.m10.1.1.cmml" xref="S2.F2.20.10.10.m10.1.1"><csymbol cd="ambiguous" id="S2.F2.20.10.10.m10.1.1.1.cmml" xref="S2.F2.20.10.10.m10.1.1">subscript</csymbol><ci id="S2.F2.20.10.10.m10.1.1.2.cmml" xref="S2.F2.20.10.10.m10.1.1.2">𝑓</ci><ci id="S2.F2.20.10.10.m10.1.1.3a.cmml" xref="S2.F2.20.10.10.m10.1.1.3"><mtext id="S2.F2.20.10.10.m10.1.1.3.cmml" mathsize="70%" xref="S2.F2.20.10.10.m10.1.1.3">drive</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F2.20.10.10.m10.1d">f_{\text{drive}}</annotation><annotation encoding="application/x-llamapun" id="S2.F2.20.10.10.m10.1e">italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT</annotation></semantics></math> through cross-attention, to generate animated, dual-stylized 3D avatars.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S2.p1"> 2D Stylized Avatar Generation. In the realm of 2D avatar generation, neural networks like StyleGAN <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib21" title="">21</a>]</cite> are renowned for producing realistic images with interpretable latent spaces, as explored in works like <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib15" title="">15</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib58" title="">58</a>]</cite>. StyleGAN’s versatility enables transformations into various styles, including Disney cartoons, paintings, and vintage photos <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib39" title="">39</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib7" title="">7</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib31" title="">31</a>]</cite>. A significant advantage of StyleGAN is its capability to perform these transformations without requiring paired domain images, a feature also utilized in SwiftAvatar <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title="">56</a>]</cite>, which creates paired data between realistic and stylized avatars. Diffusion models represent another prominent approach for 2D stylization, known for their larger architectures and enhanced diversity in generated content. Models such as Stable Diffusion <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib43" title="">43</a>]</cite> allow for image generation conditioned on text prompts, thereby increasing user control. Further advancements, including SDEdit <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib34" title="">34</a>]</cite>, ControlNet <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib63" title="">63</a>]</cite>, and IP Adapter <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib60" title="">60</a>]</cite>, provide additional control through noise introduction or conditioning on structural inputs. Both GANs and diffusion models effectively convert real images into target styles, such as Bitmojis. </div> <div class="ltx_para ltx_noindent" id="S2.p2"> 3D Content and Avatar Generation. Recent advances in 3D content creation have also significantly impacted avatar generation <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib62" title="">62</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib27" title="">27</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib26" title="">26</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib28" title="">28</a>]</cite>. 3D representations like NeRFs <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib36" title="">36</a>]</cite> and Gaussian Splats <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib22" title="">22</a>]</cite> have been integrated with generative models to automate the creation of 3D assets. For instance, DATID-3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib23" title="">23</a>]</cite> and StyleAvatar3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib61" title="">61</a>]</cite> employ 3D GANs to generate and stylize 3D facial models. More recent developments utilize text-to-image diffusion models for avatar stylization <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title="">54</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib37" title="">37</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib33" title="">33</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib16" title="">16</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib9" title="">9</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib14" title="">14</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib65" title="">65</a>]</cite>, though this process tends to be slow, taking approximately 10 minutes. In the realm of photorealistic avatar creation, techniques using Gaussian Splats <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib44" title="">44</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib40" title="">40</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib32" title="">32</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib30" title="">30</a>]</cite> fit real faces to 3D Morphable Models (3DMMs) like FLAME <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib25" title="">25</a>]</cite>, but these do not adapt well to the geometry of cartoon avatars. Beyond avatars, work in general 3D object generation includes models like LRM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib17" title="">17</a>]</cite> and LGM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title="">55</a>]</cite>, which train end-to-end neural networks for mapping 2D images to 3D objects. These models provide much faster inference compared to diffusion models but rely on extensive internet-scale multi-view datasets, such as Objaverse <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib10" title="">10</a>]</cite>, for training. This approach has yet to be applied to the specific challenge of 3D stylized avatar generation, which is a gap we seek to address. </div> <div class="ltx_para ltx_noindent" id="S2.p3"> Production Systems for Stylized 3D Avatar Creation. Developments in avatar creation platforms have introduced automated processes for selecting avatars by training classifiers that can predict avatar traits from a user’s photograph <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib45" title="">45</a>]</cite>. Initially, these systems generate a basic version of an avatar, which users can then personalize by adjusting various traits to their preference. A significant challenge in training these classifiers is the need for paired data that links real faces to specific avatar traits, a requirement that is difficult to meet on a large scale. To address this challenge, approaches like AgileAvatar <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib46" title="">46</a>]</cite>, F2P <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib48" title="">48</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib49" title="">49</a>]</cite>, EasyCraft <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib57" title="">57</a>]</cite> and SwiftAvatar <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title="">56</a>]</cite> have been developed, utilizing self-supervised learning techniques. While these methods are efficient, they are currently limited to creating avatars from pre-existing 3D asset libraries, rather than generating entirely new styles. </div> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> 3 Method</h2> <div class="ltx_para" id="S3.p1"> Our method begins with the creation of datasets for real face images and their primary-style avatars (Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS1" title="3.1 Datasets ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3.1</a>), facilitating avatar dual-stylization via Gaussian Domain Adaptation (Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS2" title="3.2 2D Dual-Stylized Avatar Generation ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3.2</a>). This framework supports 3D generation and animation of dual-stylized avatars (Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS3" title="3.3 3D Animatable Stylized Avatar Generation ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3.3</a>). Loss functions and training details are provided in Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS4" title="3.4 Training and Losses ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3.4</a>. </div> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> 3.1 Datasets</h3> <div class="ltx_para" id="S3.SS1.p1"> Training our image-to-avatar GDA model requires paired datasets of real faces and avatars in a primary style, which are not available at scale. To overcome this, we employ GAN inversion techniques inspired by unsupervised domain adaptation <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title="">56</a>]</cite> to create synthetic paired data. By aligning the latent spaces of a source GAN and a fine-tuned target GAN <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib59" title="">59</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib7" title="">7</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib56" title="">56</a>]</cite>, we generate corresponding pairs of realistic and primary-stylized images. Specifically, avatar images are inverted into the target GAN’s latent space to obtain latent codes, which are then applied to the source GAN to produce realistic-face counterparts: <table class="ltx_equation ltx_eqn_table" id="S3.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="w:=\text{argmin}_{w\in\mathcal{W}}\|G_{\text{tgt}}(w)-I_{\text{tgt}}\|,\\ I_{\text{src}}=G_{\text{src}}(w)." class="ltx_Math" display="block" id="S3.E1.m1.3"><semantics id="S3.E1.m1.3a"><mrow id="S3.E1.m1.3.3.1"><mrow id="S3.E1.m1.3.3.1.1.2" xref="S3.E1.m1.3.3.1.1.3.cmml"><mrow id="S3.E1.m1.3.3.1.1.1.1" xref="S3.E1.m1.3.3.1.1.1.1.cmml"><mi id="S3.E1.m1.3.3.1.1.1.1.3" xref="S3.E1.m1.3.3.1.1.1.1.3.cmml">w</mi><mo id="S3.E1.m1.3.3.1.1.1.1.2" lspace="0.278em" rspace="0.278em" xref="S3.E1.m1.3.3.1.1.1.1.2.cmml">:=</mo><mrow id="S3.E1.m1.3.3.1.1.1.1.1" xref="S3.E1.m1.3.3.1.1.1.1.1.cmml"><msub id="S3.E1.m1.3.3.1.1.1.1.1.3" xref="S3.E1.m1.3.3.1.1.1.1.1.3.cmml"><mtext id="S3.E1.m1.3.3.1.1.1.1.1.3.2" xref="S3.E1.m1.3.3.1.1.1.1.1.3.2a.cmml">argmin</mtext><mrow id="S3.E1.m1.3.3.1.1.1.1.1.3.3" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.cmml"><mi id="S3.E1.m1.3.3.1.1.1.1.1.3.3.2" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.2.cmml">w</mi><mo id="S3.E1.m1.3.3.1.1.1.1.1.3.3.1" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.1.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S3.E1.m1.3.3.1.1.1.1.1.3.3.3" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.3.cmml">𝒲</mi></mrow></msub><mo id="S3.E1.m1.3.3.1.1.1.1.1.2" xref="S3.E1.m1.3.3.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E1.m1.3.3.1.1.1.1.1.1.1" xref="S3.E1.m1.3.3.1.1.1.1.1.1.2.cmml"><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E1.m1.3.3.1.1.1.1.1.1.2.1.cmml">‖</mo><mrow id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.cmml"><mrow id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml"><msub id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.cmml"><mi id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.2" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.2.cmml">G</mi><mtext id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3a.cmml">tgt</mtext></msub><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.1" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.1.cmml">⁢</mo><mrow id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.3.2" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml"><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.3.2.1" stretchy="false" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml">(</mo><mi id="S3.E1.m1.1.1" xref="S3.E1.m1.1.1.cmml">w</mi><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.3.2.2" stretchy="false" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml">)</mo></mrow></mrow><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.1" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml">−</mo><msub id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.cmml"><mi id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.2" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.2.cmml">I</mi><mtext id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3a.cmml">tgt</mtext></msub></mrow><mo id="S3.E1.m1.3.3.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E1.m1.3.3.1.1.1.1.1.1.2.1.cmml">‖</mo></mrow></mrow></mrow><mo id="S3.E1.m1.3.3.1.1.2.3" xref="S3.E1.m1.3.3.1.1.3a.cmml">,</mo><mrow id="S3.E1.m1.3.3.1.1.2.2" xref="S3.E1.m1.3.3.1.1.2.2.cmml"><msub id="S3.E1.m1.3.3.1.1.2.2.2" xref="S3.E1.m1.3.3.1.1.2.2.2.cmml"><mi id="S3.E1.m1.3.3.1.1.2.2.2.2" xref="S3.E1.m1.3.3.1.1.2.2.2.2.cmml">I</mi><mtext id="S3.E1.m1.3.3.1.1.2.2.2.3" xref="S3.E1.m1.3.3.1.1.2.2.2.3a.cmml">src</mtext></msub><mo id="S3.E1.m1.3.3.1.1.2.2.1" xref="S3.E1.m1.3.3.1.1.2.2.1.cmml">=</mo><mrow id="S3.E1.m1.3.3.1.1.2.2.3" xref="S3.E1.m1.3.3.1.1.2.2.3.cmml"><msub id="S3.E1.m1.3.3.1.1.2.2.3.2" xref="S3.E1.m1.3.3.1.1.2.2.3.2.cmml"><mi id="S3.E1.m1.3.3.1.1.2.2.3.2.2" xref="S3.E1.m1.3.3.1.1.2.2.3.2.2.cmml">G</mi><mtext id="S3.E1.m1.3.3.1.1.2.2.3.2.3" xref="S3.E1.m1.3.3.1.1.2.2.3.2.3a.cmml">src</mtext></msub><mo id="S3.E1.m1.3.3.1.1.2.2.3.1" xref="S3.E1.m1.3.3.1.1.2.2.3.1.cmml">⁢</mo><mrow id="S3.E1.m1.3.3.1.1.2.2.3.3.2" xref="S3.E1.m1.3.3.1.1.2.2.3.cmml"><mo id="S3.E1.m1.3.3.1.1.2.2.3.3.2.1" stretchy="false" xref="S3.E1.m1.3.3.1.1.2.2.3.cmml">(</mo><mi id="S3.E1.m1.2.2" xref="S3.E1.m1.2.2.cmml">w</mi><mo id="S3.E1.m1.3.3.1.1.2.2.3.3.2.2" stretchy="false" xref="S3.E1.m1.3.3.1.1.2.2.3.cmml">)</mo></mrow></mrow></mrow></mrow><mo id="S3.E1.m1.3.3.1.2" lspace="0em">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E1.m1.3b"><apply id="S3.E1.m1.3.3.1.1.3.cmml" xref="S3.E1.m1.3.3.1.1.2"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.3a.cmml" xref="S3.E1.m1.3.3.1.1.2.3">formulae-sequence</csymbol><apply id="S3.E1.m1.3.3.1.1.1.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1"><csymbol cd="latexml" id="S3.E1.m1.3.3.1.1.1.1.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.2">assign</csymbol><ci id="S3.E1.m1.3.3.1.1.1.1.3.cmml" xref="S3.E1.m1.3.3.1.1.1.1.3">𝑤</ci><apply id="S3.E1.m1.3.3.1.1.1.1.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1"><times id="S3.E1.m1.3.3.1.1.1.1.1.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.2"></times><apply id="S3.E1.m1.3.3.1.1.1.1.1.3.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.1.1.1.3.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E1.m1.3.3.1.1.1.1.1.3.2a.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.2"><mtext id="S3.E1.m1.3.3.1.1.1.1.1.3.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.2">argmin</mtext></ci><apply id="S3.E1.m1.3.3.1.1.1.1.1.3.3.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3"><in id="S3.E1.m1.3.3.1.1.1.1.1.3.3.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.1"></in><ci id="S3.E1.m1.3.3.1.1.1.1.1.3.3.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.2">𝑤</ci><ci id="S3.E1.m1.3.3.1.1.1.1.1.3.3.3.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.3.3.3">𝒲</ci></apply></apply><apply id="S3.E1.m1.3.3.1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S3.E1.m1.3.3.1.1.1.1.1.1.2.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.2">norm</csymbol><apply id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1"><minus id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.1"></minus><apply id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2"><times id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.1"></times><apply id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2">subscript</csymbol><ci id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.2">𝐺</ci><ci id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3a.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3"><mtext id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3.cmml" mathsize="70%" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.2.2.3">tgt</mtext></ci></apply><ci id="S3.E1.m1.1.1.cmml" xref="S3.E1.m1.1.1">𝑤</ci></apply><apply id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.2">𝐼</ci><ci id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3a.cmml" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3"><mtext id="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E1.m1.3.3.1.1.1.1.1.1.1.1.3.3">tgt</mtext></ci></apply></apply></apply></apply></apply><apply id="S3.E1.m1.3.3.1.1.2.2.cmml" xref="S3.E1.m1.3.3.1.1.2.2"><eq id="S3.E1.m1.3.3.1.1.2.2.1.cmml" xref="S3.E1.m1.3.3.1.1.2.2.1"></eq><apply id="S3.E1.m1.3.3.1.1.2.2.2.cmml" xref="S3.E1.m1.3.3.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.2.2.2.1.cmml" xref="S3.E1.m1.3.3.1.1.2.2.2">subscript</csymbol><ci id="S3.E1.m1.3.3.1.1.2.2.2.2.cmml" xref="S3.E1.m1.3.3.1.1.2.2.2.2">𝐼</ci><ci id="S3.E1.m1.3.3.1.1.2.2.2.3a.cmml" xref="S3.E1.m1.3.3.1.1.2.2.2.3"><mtext id="S3.E1.m1.3.3.1.1.2.2.2.3.cmml" mathsize="70%" xref="S3.E1.m1.3.3.1.1.2.2.2.3">src</mtext></ci></apply><apply id="S3.E1.m1.3.3.1.1.2.2.3.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3"><times id="S3.E1.m1.3.3.1.1.2.2.3.1.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3.1"></times><apply id="S3.E1.m1.3.3.1.1.2.2.3.2.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3.2"><csymbol cd="ambiguous" id="S3.E1.m1.3.3.1.1.2.2.3.2.1.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3.2">subscript</csymbol><ci id="S3.E1.m1.3.3.1.1.2.2.3.2.2.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3.2.2">𝐺</ci><ci id="S3.E1.m1.3.3.1.1.2.2.3.2.3a.cmml" xref="S3.E1.m1.3.3.1.1.2.2.3.2.3"><mtext id="S3.E1.m1.3.3.1.1.2.2.3.2.3.cmml" mathsize="70%" xref="S3.E1.m1.3.3.1.1.2.2.3.2.3">src</mtext></ci></apply><ci id="S3.E1.m1.2.2.cmml" xref="S3.E1.m1.2.2">𝑤</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E1.m1.3c">w:=\text{argmin}_{w\in\mathcal{W}}\|G_{\text{tgt}}(w)-I_{\text{tgt}}\|,\\ I_{\text{src}}=G_{\text{src}}(w).</annotation><annotation encoding="application/x-llamapun" id="S3.E1.m1.3d">italic_w := argmin start_POSTSUBSCRIPT italic_w ∈ caligraphic_W end_POSTSUBSCRIPT ∥ italic_G start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ( italic_w ) - italic_I start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ∥ , italic_I start_POSTSUBSCRIPT src end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ( italic_w ) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(1)</td> </tr></tbody> </table> Using this method, we generated 13,000 synthetic image pairs from Bitmoji avatars, forming the basis for GDA training. (See Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.F9" title="Figure 9 ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">9</a> and Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.F10" title="Figure 10 ‣ Multi-view Training Data. ‣ A.1 Bitmoji Training Data ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">10</a> in Suppl. for details). </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> 3.2 2D Dual-Stylized Avatar Generation</h3> <div class="ltx_para ltx_noindent" id="S3.SS2.p1"> Gaussian Domain Adaptation <math alttext="\mathcal{E}_{\text{GDA}}(\cdot)" class="ltx_Math" display="inline" id="S3.SS2.p1.1.m1.1"><semantics id="S3.SS2.p1.1.m1.1a"><mrow id="S3.SS2.p1.1.m1.1.2" xref="S3.SS2.p1.1.m1.1.2.cmml"><msub id="S3.SS2.p1.1.m1.1.2.2" xref="S3.SS2.p1.1.m1.1.2.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS2.p1.1.m1.1.2.2.2" xref="S3.SS2.p1.1.m1.1.2.2.2.cmml">ℰ</mi><mtext id="S3.SS2.p1.1.m1.1.2.2.3" xref="S3.SS2.p1.1.m1.1.2.2.3a.cmml">GDA</mtext></msub><mo id="S3.SS2.p1.1.m1.1.2.1" xref="S3.SS2.p1.1.m1.1.2.1.cmml">⁢</mo><mrow id="S3.SS2.p1.1.m1.1.2.3.2" xref="S3.SS2.p1.1.m1.1.2.cmml"><mo id="S3.SS2.p1.1.m1.1.2.3.2.1" stretchy="false" xref="S3.SS2.p1.1.m1.1.2.cmml">(</mo><mo id="S3.SS2.p1.1.m1.1.1" lspace="0em" rspace="0em" xref="S3.SS2.p1.1.m1.1.1.cmml">⋅</mo><mo id="S3.SS2.p1.1.m1.1.2.3.2.2" stretchy="false" xref="S3.SS2.p1.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.1.m1.1b"><apply id="S3.SS2.p1.1.m1.1.2.cmml" xref="S3.SS2.p1.1.m1.1.2"><times id="S3.SS2.p1.1.m1.1.2.1.cmml" xref="S3.SS2.p1.1.m1.1.2.1"></times><apply id="S3.SS2.p1.1.m1.1.2.2.cmml" xref="S3.SS2.p1.1.m1.1.2.2"><csymbol cd="ambiguous" id="S3.SS2.p1.1.m1.1.2.2.1.cmml" xref="S3.SS2.p1.1.m1.1.2.2">subscript</csymbol><ci id="S3.SS2.p1.1.m1.1.2.2.2.cmml" xref="S3.SS2.p1.1.m1.1.2.2.2">ℰ</ci><ci id="S3.SS2.p1.1.m1.1.2.2.3a.cmml" xref="S3.SS2.p1.1.m1.1.2.2.3"><mtext id="S3.SS2.p1.1.m1.1.2.2.3.cmml" mathsize="70%" xref="S3.SS2.p1.1.m1.1.2.2.3">GDA</mtext></ci></apply><ci id="S3.SS2.p1.1.m1.1.1.cmml" xref="S3.SS2.p1.1.m1.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.1.m1.1c">\mathcal{E}_{\text{GDA}}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.1.m1.1d">caligraphic_E start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT ( ⋅ )</annotation></semantics></math>. To bridge the domain gap between real photos and 3D-aware cartoonish avatars (i.e., primary-style avatars), we first propose Gaussian Domain Adaptation (GDA). Surprisingly, we find that features learned by Large Multi-view Gaussian models (LGMs) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title="">55</a>]</cite> can be quickly and robustly adapted for style transfer. We believe this is due to their ability to hold internet-scale information from multi-view training datasets such as Objaverse <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib10" title="">10</a>]</cite>. We first begin with a U-Net backbone from LGM trained on 3D objects. Then, we perform GDA by finetuning the network to map real face photos to primary-style avatar images in the frontal view. At inference time, for an input selfie, we first apply face alignment and background removal preprocessing. The preprocessed image is then passed through the asymmetric U-Net that takes the input reference image <math alttext="I_{\text{ref}}\in\mathbb{R}^{3\times 512\times 512}" class="ltx_Math" display="inline" id="S3.SS2.p1.2.m2.1"><semantics id="S3.SS2.p1.2.m2.1a"><mrow id="S3.SS2.p1.2.m2.1.1" xref="S3.SS2.p1.2.m2.1.1.cmml"><msub id="S3.SS2.p1.2.m2.1.1.2" xref="S3.SS2.p1.2.m2.1.1.2.cmml"><mi id="S3.SS2.p1.2.m2.1.1.2.2" xref="S3.SS2.p1.2.m2.1.1.2.2.cmml">I</mi><mtext id="S3.SS2.p1.2.m2.1.1.2.3" xref="S3.SS2.p1.2.m2.1.1.2.3a.cmml">ref</mtext></msub><mo id="S3.SS2.p1.2.m2.1.1.1" xref="S3.SS2.p1.2.m2.1.1.1.cmml">∈</mo><msup id="S3.SS2.p1.2.m2.1.1.3" xref="S3.SS2.p1.2.m2.1.1.3.cmml"><mi id="S3.SS2.p1.2.m2.1.1.3.2" xref="S3.SS2.p1.2.m2.1.1.3.2.cmml">ℝ</mi><mrow id="S3.SS2.p1.2.m2.1.1.3.3" xref="S3.SS2.p1.2.m2.1.1.3.3.cmml"><mn id="S3.SS2.p1.2.m2.1.1.3.3.2" xref="S3.SS2.p1.2.m2.1.1.3.3.2.cmml">3</mn><mo id="S3.SS2.p1.2.m2.1.1.3.3.1" lspace="0.222em" rspace="0.222em" xref="S3.SS2.p1.2.m2.1.1.3.3.1.cmml">×</mo><mn id="S3.SS2.p1.2.m2.1.1.3.3.3" xref="S3.SS2.p1.2.m2.1.1.3.3.3.cmml">512</mn><mo id="S3.SS2.p1.2.m2.1.1.3.3.1a" lspace="0.222em" rspace="0.222em" xref="S3.SS2.p1.2.m2.1.1.3.3.1.cmml">×</mo><mn id="S3.SS2.p1.2.m2.1.1.3.3.4" xref="S3.SS2.p1.2.m2.1.1.3.3.4.cmml">512</mn></mrow></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.2.m2.1b"><apply id="S3.SS2.p1.2.m2.1.1.cmml" xref="S3.SS2.p1.2.m2.1.1"><in id="S3.SS2.p1.2.m2.1.1.1.cmml" xref="S3.SS2.p1.2.m2.1.1.1"></in><apply id="S3.SS2.p1.2.m2.1.1.2.cmml" xref="S3.SS2.p1.2.m2.1.1.2"><csymbol cd="ambiguous" id="S3.SS2.p1.2.m2.1.1.2.1.cmml" xref="S3.SS2.p1.2.m2.1.1.2">subscript</csymbol><ci id="S3.SS2.p1.2.m2.1.1.2.2.cmml" xref="S3.SS2.p1.2.m2.1.1.2.2">𝐼</ci><ci id="S3.SS2.p1.2.m2.1.1.2.3a.cmml" xref="S3.SS2.p1.2.m2.1.1.2.3"><mtext id="S3.SS2.p1.2.m2.1.1.2.3.cmml" mathsize="70%" xref="S3.SS2.p1.2.m2.1.1.2.3">ref</mtext></ci></apply><apply id="S3.SS2.p1.2.m2.1.1.3.cmml" xref="S3.SS2.p1.2.m2.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.p1.2.m2.1.1.3.1.cmml" xref="S3.SS2.p1.2.m2.1.1.3">superscript</csymbol><ci id="S3.SS2.p1.2.m2.1.1.3.2.cmml" xref="S3.SS2.p1.2.m2.1.1.3.2">ℝ</ci><apply id="S3.SS2.p1.2.m2.1.1.3.3.cmml" xref="S3.SS2.p1.2.m2.1.1.3.3"><times id="S3.SS2.p1.2.m2.1.1.3.3.1.cmml" xref="S3.SS2.p1.2.m2.1.1.3.3.1"></times><cn id="S3.SS2.p1.2.m2.1.1.3.3.2.cmml" type="integer" xref="S3.SS2.p1.2.m2.1.1.3.3.2">3</cn><cn id="S3.SS2.p1.2.m2.1.1.3.3.3.cmml" type="integer" xref="S3.SS2.p1.2.m2.1.1.3.3.3">512</cn><cn id="S3.SS2.p1.2.m2.1.1.3.3.4.cmml" type="integer" xref="S3.SS2.p1.2.m2.1.1.3.3.4">512</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.2.m2.1c">I_{\text{ref}}\in\mathbb{R}^{3\times 512\times 512}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.2.m2.1d">italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 512 × 512 end_POSTSUPERSCRIPT</annotation></semantics></math> and maps it to pixel-aligned Gaussian parameters, including scaling <math alttext="\boldsymbol{s}" class="ltx_Math" display="inline" id="S3.SS2.p1.3.m3.1"><semantics id="S3.SS2.p1.3.m3.1a"><mi id="S3.SS2.p1.3.m3.1.1" xref="S3.SS2.p1.3.m3.1.1.cmml">𝒔</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.3.m3.1b"><ci id="S3.SS2.p1.3.m3.1.1.cmml" xref="S3.SS2.p1.3.m3.1.1">𝒔</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.3.m3.1c">\boldsymbol{s}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.3.m3.1d">bold_italic_s</annotation></semantics></math>, position <math alttext="\boldsymbol{t}" class="ltx_Math" display="inline" id="S3.SS2.p1.4.m4.1"><semantics id="S3.SS2.p1.4.m4.1a"><mi id="S3.SS2.p1.4.m4.1.1" xref="S3.SS2.p1.4.m4.1.1.cmml">𝒕</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.4.m4.1b"><ci id="S3.SS2.p1.4.m4.1.1.cmml" xref="S3.SS2.p1.4.m4.1.1">𝒕</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.4.m4.1c">\boldsymbol{t}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.4.m4.1d">bold_italic_t</annotation></semantics></math>, color <math alttext="\boldsymbol{c}" class="ltx_Math" display="inline" id="S3.SS2.p1.5.m5.1"><semantics id="S3.SS2.p1.5.m5.1a"><mi id="S3.SS2.p1.5.m5.1.1" xref="S3.SS2.p1.5.m5.1.1.cmml">𝒄</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.5.m5.1b"><ci id="S3.SS2.p1.5.m5.1.1.cmml" xref="S3.SS2.p1.5.m5.1.1">𝒄</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.5.m5.1c">\boldsymbol{c}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.5.m5.1d">bold_italic_c</annotation></semantics></math>, opacity <math alttext="\boldsymbol{o}" class="ltx_Math" display="inline" id="S3.SS2.p1.6.m6.1"><semantics id="S3.SS2.p1.6.m6.1a"><mi id="S3.SS2.p1.6.m6.1.1" xref="S3.SS2.p1.6.m6.1.1.cmml">𝒐</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.6.m6.1b"><ci id="S3.SS2.p1.6.m6.1.1.cmml" xref="S3.SS2.p1.6.m6.1.1">𝒐</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.6.m6.1c">\boldsymbol{o}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.6.m6.1d">bold_italic_o</annotation></semantics></math>, and orientation <math alttext="\boldsymbol{q}" class="ltx_Math" display="inline" id="S3.SS2.p1.7.m7.1"><semantics id="S3.SS2.p1.7.m7.1a"><mi id="S3.SS2.p1.7.m7.1.1" xref="S3.SS2.p1.7.m7.1.1.cmml">𝒒</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.7.m7.1b"><ci id="S3.SS2.p1.7.m7.1.1.cmml" xref="S3.SS2.p1.7.m7.1.1">𝒒</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.7.m7.1c">\boldsymbol{q}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.7.m7.1d">bold_italic_q</annotation></semantics></math>: <table class="ltx_equation ltx_eqn_table" id="S3.E2"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\theta_{\text{GDA}}=\left\{\boldsymbol{s}^{k},\boldsymbol{t}^{k},\boldsymbol{q% }^{k},\boldsymbol{c}^{k},\boldsymbol{o}^{k}\right\}_{k=1}^{M}=\mathcal{E}_{% \text{GDA}}(I_{\text{ref}};\Phi_{\text{GDA}})," class="ltx_Math" display="block" id="S3.E2.m1.1"><semantics id="S3.E2.m1.1a"><mrow id="S3.E2.m1.1.1.1" xref="S3.E2.m1.1.1.1.1.cmml"><mrow id="S3.E2.m1.1.1.1.1" xref="S3.E2.m1.1.1.1.1.cmml"><msub id="S3.E2.m1.1.1.1.1.9" xref="S3.E2.m1.1.1.1.1.9.cmml"><mi id="S3.E2.m1.1.1.1.1.9.2" xref="S3.E2.m1.1.1.1.1.9.2.cmml">θ</mi><mtext id="S3.E2.m1.1.1.1.1.9.3" xref="S3.E2.m1.1.1.1.1.9.3a.cmml">GDA</mtext></msub><mo id="S3.E2.m1.1.1.1.1.10" xref="S3.E2.m1.1.1.1.1.10.cmml">=</mo><msubsup id="S3.E2.m1.1.1.1.1.5" xref="S3.E2.m1.1.1.1.1.5.cmml"><mrow id="S3.E2.m1.1.1.1.1.5.5.5.5" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml"><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.6" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">{</mo><msup id="S3.E2.m1.1.1.1.1.1.1.1.1.1" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E2.m1.1.1.1.1.1.1.1.1.1.2" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1.2.cmml">𝒔</mi><mi id="S3.E2.m1.1.1.1.1.1.1.1.1.1.3" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1.3.cmml">k</mi></msup><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.7" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">,</mo><msup id="S3.E2.m1.1.1.1.1.2.2.2.2.2" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2.cmml"><mi id="S3.E2.m1.1.1.1.1.2.2.2.2.2.2" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2.2.cmml">𝒕</mi><mi id="S3.E2.m1.1.1.1.1.2.2.2.2.2.3" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2.3.cmml">k</mi></msup><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.8" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">,</mo><msup id="S3.E2.m1.1.1.1.1.3.3.3.3.3" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3.cmml"><mi id="S3.E2.m1.1.1.1.1.3.3.3.3.3.2" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3.2.cmml">𝒒</mi><mi id="S3.E2.m1.1.1.1.1.3.3.3.3.3.3" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3.3.cmml">k</mi></msup><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.9" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">,</mo><msup id="S3.E2.m1.1.1.1.1.4.4.4.4.4" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4.cmml"><mi id="S3.E2.m1.1.1.1.1.4.4.4.4.4.2" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4.2.cmml">𝒄</mi><mi id="S3.E2.m1.1.1.1.1.4.4.4.4.4.3" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4.3.cmml">k</mi></msup><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.10" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">,</mo><msup id="S3.E2.m1.1.1.1.1.5.5.5.5.5" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5.cmml"><mi id="S3.E2.m1.1.1.1.1.5.5.5.5.5.2" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5.2.cmml">𝒐</mi><mi id="S3.E2.m1.1.1.1.1.5.5.5.5.5.3" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5.3.cmml">k</mi></msup><mo id="S3.E2.m1.1.1.1.1.5.5.5.5.11" xref="S3.E2.m1.1.1.1.1.5.5.5.6.cmml">}</mo></mrow><mrow id="S3.E2.m1.1.1.1.1.5.5.7" xref="S3.E2.m1.1.1.1.1.5.5.7.cmml"><mi id="S3.E2.m1.1.1.1.1.5.5.7.2" xref="S3.E2.m1.1.1.1.1.5.5.7.2.cmml">k</mi><mo id="S3.E2.m1.1.1.1.1.5.5.7.1" xref="S3.E2.m1.1.1.1.1.5.5.7.1.cmml">=</mo><mn id="S3.E2.m1.1.1.1.1.5.5.7.3" xref="S3.E2.m1.1.1.1.1.5.5.7.3.cmml">1</mn></mrow><mi id="S3.E2.m1.1.1.1.1.5.7" xref="S3.E2.m1.1.1.1.1.5.7.cmml">M</mi></msubsup><mo id="S3.E2.m1.1.1.1.1.11" xref="S3.E2.m1.1.1.1.1.11.cmml">=</mo><mrow id="S3.E2.m1.1.1.1.1.7" xref="S3.E2.m1.1.1.1.1.7.cmml"><msub id="S3.E2.m1.1.1.1.1.7.4" xref="S3.E2.m1.1.1.1.1.7.4.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E2.m1.1.1.1.1.7.4.2" xref="S3.E2.m1.1.1.1.1.7.4.2.cmml">ℰ</mi><mtext id="S3.E2.m1.1.1.1.1.7.4.3" xref="S3.E2.m1.1.1.1.1.7.4.3a.cmml">GDA</mtext></msub><mo id="S3.E2.m1.1.1.1.1.7.3" xref="S3.E2.m1.1.1.1.1.7.3.cmml">⁢</mo><mrow id="S3.E2.m1.1.1.1.1.7.2.2" xref="S3.E2.m1.1.1.1.1.7.2.3.cmml"><mo id="S3.E2.m1.1.1.1.1.7.2.2.3" stretchy="false" xref="S3.E2.m1.1.1.1.1.7.2.3.cmml">(</mo><msub id="S3.E2.m1.1.1.1.1.6.1.1.1" xref="S3.E2.m1.1.1.1.1.6.1.1.1.cmml"><mi id="S3.E2.m1.1.1.1.1.6.1.1.1.2" xref="S3.E2.m1.1.1.1.1.6.1.1.1.2.cmml">I</mi><mtext id="S3.E2.m1.1.1.1.1.6.1.1.1.3" xref="S3.E2.m1.1.1.1.1.6.1.1.1.3a.cmml">ref</mtext></msub><mo id="S3.E2.m1.1.1.1.1.7.2.2.4" xref="S3.E2.m1.1.1.1.1.7.2.3.cmml">;</mo><msub id="S3.E2.m1.1.1.1.1.7.2.2.2" xref="S3.E2.m1.1.1.1.1.7.2.2.2.cmml"><mi id="S3.E2.m1.1.1.1.1.7.2.2.2.2" mathvariant="normal" xref="S3.E2.m1.1.1.1.1.7.2.2.2.2.cmml">Φ</mi><mtext id="S3.E2.m1.1.1.1.1.7.2.2.2.3" xref="S3.E2.m1.1.1.1.1.7.2.2.2.3a.cmml">GDA</mtext></msub><mo id="S3.E2.m1.1.1.1.1.7.2.2.5" stretchy="false" xref="S3.E2.m1.1.1.1.1.7.2.3.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E2.m1.1.1.1.2" xref="S3.E2.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E2.m1.1b"><apply id="S3.E2.m1.1.1.1.1.cmml" xref="S3.E2.m1.1.1.1"><and id="S3.E2.m1.1.1.1.1a.cmml" xref="S3.E2.m1.1.1.1"></and><apply id="S3.E2.m1.1.1.1.1b.cmml" xref="S3.E2.m1.1.1.1"><eq id="S3.E2.m1.1.1.1.1.10.cmml" xref="S3.E2.m1.1.1.1.1.10"></eq><apply id="S3.E2.m1.1.1.1.1.9.cmml" xref="S3.E2.m1.1.1.1.1.9"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.9.1.cmml" xref="S3.E2.m1.1.1.1.1.9">subscript</csymbol><ci id="S3.E2.m1.1.1.1.1.9.2.cmml" xref="S3.E2.m1.1.1.1.1.9.2">𝜃</ci><ci id="S3.E2.m1.1.1.1.1.9.3a.cmml" xref="S3.E2.m1.1.1.1.1.9.3"><mtext id="S3.E2.m1.1.1.1.1.9.3.cmml" mathsize="70%" xref="S3.E2.m1.1.1.1.1.9.3">GDA</mtext></ci></apply><apply id="S3.E2.m1.1.1.1.1.5.cmml" xref="S3.E2.m1.1.1.1.1.5"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.5.6.cmml" xref="S3.E2.m1.1.1.1.1.5">superscript</csymbol><apply id="S3.E2.m1.1.1.1.1.5.5.cmml" xref="S3.E2.m1.1.1.1.1.5"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.5.5.6.cmml" xref="S3.E2.m1.1.1.1.1.5">subscript</csymbol><set id="S3.E2.m1.1.1.1.1.5.5.5.6.cmml" xref="S3.E2.m1.1.1.1.1.5.5.5.5"><apply id="S3.E2.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1">superscript</csymbol><ci id="S3.E2.m1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1.2">𝒔</ci><ci id="S3.E2.m1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E2.m1.1.1.1.1.1.1.1.1.1.3">𝑘</ci></apply><apply id="S3.E2.m1.1.1.1.1.2.2.2.2.2.cmml" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.2.2.2.2.2.1.cmml" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2">superscript</csymbol><ci id="S3.E2.m1.1.1.1.1.2.2.2.2.2.2.cmml" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2.2">𝒕</ci><ci id="S3.E2.m1.1.1.1.1.2.2.2.2.2.3.cmml" xref="S3.E2.m1.1.1.1.1.2.2.2.2.2.3">𝑘</ci></apply><apply id="S3.E2.m1.1.1.1.1.3.3.3.3.3.cmml" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.3.3.3.3.3.1.cmml" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3">superscript</csymbol><ci id="S3.E2.m1.1.1.1.1.3.3.3.3.3.2.cmml" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3.2">𝒒</ci><ci id="S3.E2.m1.1.1.1.1.3.3.3.3.3.3.cmml" xref="S3.E2.m1.1.1.1.1.3.3.3.3.3.3">𝑘</ci></apply><apply id="S3.E2.m1.1.1.1.1.4.4.4.4.4.cmml" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.4.4.4.4.4.1.cmml" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4">superscript</csymbol><ci id="S3.E2.m1.1.1.1.1.4.4.4.4.4.2.cmml" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4.2">𝒄</ci><ci id="S3.E2.m1.1.1.1.1.4.4.4.4.4.3.cmml" xref="S3.E2.m1.1.1.1.1.4.4.4.4.4.3">𝑘</ci></apply><apply id="S3.E2.m1.1.1.1.1.5.5.5.5.5.cmml" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.5.5.5.5.5.1.cmml" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5">superscript</csymbol><ci id="S3.E2.m1.1.1.1.1.5.5.5.5.5.2.cmml" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5.2">𝒐</ci><ci id="S3.E2.m1.1.1.1.1.5.5.5.5.5.3.cmml" xref="S3.E2.m1.1.1.1.1.5.5.5.5.5.3">𝑘</ci></apply></set><apply id="S3.E2.m1.1.1.1.1.5.5.7.cmml" xref="S3.E2.m1.1.1.1.1.5.5.7"><eq id="S3.E2.m1.1.1.1.1.5.5.7.1.cmml" xref="S3.E2.m1.1.1.1.1.5.5.7.1"></eq><ci id="S3.E2.m1.1.1.1.1.5.5.7.2.cmml" xref="S3.E2.m1.1.1.1.1.5.5.7.2">𝑘</ci><cn id="S3.E2.m1.1.1.1.1.5.5.7.3.cmml" type="integer" xref="S3.E2.m1.1.1.1.1.5.5.7.3">1</cn></apply></apply><ci id="S3.E2.m1.1.1.1.1.5.7.cmml" xref="S3.E2.m1.1.1.1.1.5.7">𝑀</ci></apply></apply><apply id="S3.E2.m1.1.1.1.1c.cmml" xref="S3.E2.m1.1.1.1"><eq id="S3.E2.m1.1.1.1.1.11.cmml" xref="S3.E2.m1.1.1.1.1.11"></eq><share href="https://arxiv.org/html/2503.11978v1#S3.E2.m1.1.1.1.1.5.cmml" id="S3.E2.m1.1.1.1.1d.cmml" xref="S3.E2.m1.1.1.1"></share><apply id="S3.E2.m1.1.1.1.1.7.cmml" xref="S3.E2.m1.1.1.1.1.7"><times id="S3.E2.m1.1.1.1.1.7.3.cmml" xref="S3.E2.m1.1.1.1.1.7.3"></times><apply id="S3.E2.m1.1.1.1.1.7.4.cmml" xref="S3.E2.m1.1.1.1.1.7.4"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.7.4.1.cmml" xref="S3.E2.m1.1.1.1.1.7.4">subscript</csymbol><ci id="S3.E2.m1.1.1.1.1.7.4.2.cmml" xref="S3.E2.m1.1.1.1.1.7.4.2">ℰ</ci><ci id="S3.E2.m1.1.1.1.1.7.4.3a.cmml" xref="S3.E2.m1.1.1.1.1.7.4.3"><mtext id="S3.E2.m1.1.1.1.1.7.4.3.cmml" mathsize="70%" xref="S3.E2.m1.1.1.1.1.7.4.3">GDA</mtext></ci></apply><list id="S3.E2.m1.1.1.1.1.7.2.3.cmml" xref="S3.E2.m1.1.1.1.1.7.2.2"><apply id="S3.E2.m1.1.1.1.1.6.1.1.1.cmml" xref="S3.E2.m1.1.1.1.1.6.1.1.1"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.6.1.1.1.1.cmml" xref="S3.E2.m1.1.1.1.1.6.1.1.1">subscript</csymbol><ci id="S3.E2.m1.1.1.1.1.6.1.1.1.2.cmml" xref="S3.E2.m1.1.1.1.1.6.1.1.1.2">𝐼</ci><ci id="S3.E2.m1.1.1.1.1.6.1.1.1.3a.cmml" xref="S3.E2.m1.1.1.1.1.6.1.1.1.3"><mtext id="S3.E2.m1.1.1.1.1.6.1.1.1.3.cmml" mathsize="70%" xref="S3.E2.m1.1.1.1.1.6.1.1.1.3">ref</mtext></ci></apply><apply id="S3.E2.m1.1.1.1.1.7.2.2.2.cmml" xref="S3.E2.m1.1.1.1.1.7.2.2.2"><csymbol cd="ambiguous" id="S3.E2.m1.1.1.1.1.7.2.2.2.1.cmml" xref="S3.E2.m1.1.1.1.1.7.2.2.2">subscript</csymbol><ci id="S3.E2.m1.1.1.1.1.7.2.2.2.2.cmml" xref="S3.E2.m1.1.1.1.1.7.2.2.2.2">Φ</ci><ci id="S3.E2.m1.1.1.1.1.7.2.2.2.3a.cmml" xref="S3.E2.m1.1.1.1.1.7.2.2.2.3"><mtext id="S3.E2.m1.1.1.1.1.7.2.2.2.3.cmml" mathsize="70%" xref="S3.E2.m1.1.1.1.1.7.2.2.2.3">GDA</mtext></ci></apply></list></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E2.m1.1c">\theta_{\text{GDA}}=\left\{\boldsymbol{s}^{k},\boldsymbol{t}^{k},\boldsymbol{q% }^{k},\boldsymbol{c}^{k},\boldsymbol{o}^{k}\right\}_{k=1}^{M}=\mathcal{E}_{% \text{GDA}}(I_{\text{ref}};\Phi_{\text{GDA}}),</annotation><annotation encoding="application/x-llamapun" id="S3.E2.m1.1d">italic_θ start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT = { bold_italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_italic_q start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_italic_o start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT = caligraphic_E start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ; roman_Φ start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(2)</td> </tr></tbody> </table> where <math alttext="M" class="ltx_Math" display="inline" id="S3.SS2.p1.8.m1.1"><semantics id="S3.SS2.p1.8.m1.1a"><mi id="S3.SS2.p1.8.m1.1.1" xref="S3.SS2.p1.8.m1.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.8.m1.1b"><ci id="S3.SS2.p1.8.m1.1.1.cmml" xref="S3.SS2.p1.8.m1.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.8.m1.1c">M</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.8.m1.1d">italic_M</annotation></semantics></math> is the number of Gaussians and <math alttext="\Phi_{\text{GDA}}" class="ltx_Math" display="inline" id="S3.SS2.p1.9.m2.1"><semantics id="S3.SS2.p1.9.m2.1a"><msub id="S3.SS2.p1.9.m2.1.1" xref="S3.SS2.p1.9.m2.1.1.cmml"><mi id="S3.SS2.p1.9.m2.1.1.2" mathvariant="normal" xref="S3.SS2.p1.9.m2.1.1.2.cmml">Φ</mi><mtext id="S3.SS2.p1.9.m2.1.1.3" xref="S3.SS2.p1.9.m2.1.1.3a.cmml">GDA</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.9.m2.1b"><apply id="S3.SS2.p1.9.m2.1.1.cmml" xref="S3.SS2.p1.9.m2.1.1"><csymbol cd="ambiguous" id="S3.SS2.p1.9.m2.1.1.1.cmml" xref="S3.SS2.p1.9.m2.1.1">subscript</csymbol><ci id="S3.SS2.p1.9.m2.1.1.2.cmml" xref="S3.SS2.p1.9.m2.1.1.2">Φ</ci><ci id="S3.SS2.p1.9.m2.1.1.3a.cmml" xref="S3.SS2.p1.9.m2.1.1.3"><mtext id="S3.SS2.p1.9.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p1.9.m2.1.1.3">GDA</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.9.m2.1c">\Phi_{\text{GDA}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.9.m2.1d">roman_Φ start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT</annotation></semantics></math> is the learnable parameters. The 3D Gaussians are then rendered in the frontal view <math alttext="I_{\text{sty}}=\mathcal{E}^{\text{render}}_{\text{3D}}(\theta_{\text{GDA}})" class="ltx_Math" display="inline" id="S3.SS2.p1.10.m3.1"><semantics id="S3.SS2.p1.10.m3.1a"><mrow id="S3.SS2.p1.10.m3.1.1" xref="S3.SS2.p1.10.m3.1.1.cmml"><msub id="S3.SS2.p1.10.m3.1.1.3" xref="S3.SS2.p1.10.m3.1.1.3.cmml"><mi id="S3.SS2.p1.10.m3.1.1.3.2" xref="S3.SS2.p1.10.m3.1.1.3.2.cmml">I</mi><mtext id="S3.SS2.p1.10.m3.1.1.3.3" xref="S3.SS2.p1.10.m3.1.1.3.3a.cmml">sty</mtext></msub><mo id="S3.SS2.p1.10.m3.1.1.2" xref="S3.SS2.p1.10.m3.1.1.2.cmml">=</mo><mrow id="S3.SS2.p1.10.m3.1.1.1" xref="S3.SS2.p1.10.m3.1.1.1.cmml"><msubsup id="S3.SS2.p1.10.m3.1.1.1.3" xref="S3.SS2.p1.10.m3.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS2.p1.10.m3.1.1.1.3.2.2" xref="S3.SS2.p1.10.m3.1.1.1.3.2.2.cmml">ℰ</mi><mtext id="S3.SS2.p1.10.m3.1.1.1.3.3" xref="S3.SS2.p1.10.m3.1.1.1.3.3a.cmml">3D</mtext><mtext id="S3.SS2.p1.10.m3.1.1.1.3.2.3" xref="S3.SS2.p1.10.m3.1.1.1.3.2.3a.cmml">render</mtext></msubsup><mo id="S3.SS2.p1.10.m3.1.1.1.2" xref="S3.SS2.p1.10.m3.1.1.1.2.cmml">⁢</mo><mrow id="S3.SS2.p1.10.m3.1.1.1.1.1" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.cmml"><mo id="S3.SS2.p1.10.m3.1.1.1.1.1.2" stretchy="false" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.cmml">(</mo><msub id="S3.SS2.p1.10.m3.1.1.1.1.1.1" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.cmml"><mi id="S3.SS2.p1.10.m3.1.1.1.1.1.1.2" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.2.cmml">θ</mi><mtext id="S3.SS2.p1.10.m3.1.1.1.1.1.1.3" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.3a.cmml">GDA</mtext></msub><mo id="S3.SS2.p1.10.m3.1.1.1.1.1.3" stretchy="false" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p1.10.m3.1b"><apply id="S3.SS2.p1.10.m3.1.1.cmml" xref="S3.SS2.p1.10.m3.1.1"><eq id="S3.SS2.p1.10.m3.1.1.2.cmml" xref="S3.SS2.p1.10.m3.1.1.2"></eq><apply id="S3.SS2.p1.10.m3.1.1.3.cmml" xref="S3.SS2.p1.10.m3.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.p1.10.m3.1.1.3.1.cmml" xref="S3.SS2.p1.10.m3.1.1.3">subscript</csymbol><ci id="S3.SS2.p1.10.m3.1.1.3.2.cmml" xref="S3.SS2.p1.10.m3.1.1.3.2">𝐼</ci><ci id="S3.SS2.p1.10.m3.1.1.3.3a.cmml" xref="S3.SS2.p1.10.m3.1.1.3.3"><mtext id="S3.SS2.p1.10.m3.1.1.3.3.cmml" mathsize="70%" xref="S3.SS2.p1.10.m3.1.1.3.3">sty</mtext></ci></apply><apply id="S3.SS2.p1.10.m3.1.1.1.cmml" xref="S3.SS2.p1.10.m3.1.1.1"><times id="S3.SS2.p1.10.m3.1.1.1.2.cmml" xref="S3.SS2.p1.10.m3.1.1.1.2"></times><apply id="S3.SS2.p1.10.m3.1.1.1.3.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.p1.10.m3.1.1.1.3.1.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3">subscript</csymbol><apply id="S3.SS2.p1.10.m3.1.1.1.3.2.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.p1.10.m3.1.1.1.3.2.1.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3">superscript</csymbol><ci id="S3.SS2.p1.10.m3.1.1.1.3.2.2.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3.2.2">ℰ</ci><ci id="S3.SS2.p1.10.m3.1.1.1.3.2.3a.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3.2.3"><mtext id="S3.SS2.p1.10.m3.1.1.1.3.2.3.cmml" mathsize="70%" xref="S3.SS2.p1.10.m3.1.1.1.3.2.3">render</mtext></ci></apply><ci id="S3.SS2.p1.10.m3.1.1.1.3.3a.cmml" xref="S3.SS2.p1.10.m3.1.1.1.3.3"><mtext id="S3.SS2.p1.10.m3.1.1.1.3.3.cmml" mathsize="70%" xref="S3.SS2.p1.10.m3.1.1.1.3.3">3D</mtext></ci></apply><apply id="S3.SS2.p1.10.m3.1.1.1.1.1.1.cmml" xref="S3.SS2.p1.10.m3.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS2.p1.10.m3.1.1.1.1.1.1.1.cmml" xref="S3.SS2.p1.10.m3.1.1.1.1.1">subscript</csymbol><ci id="S3.SS2.p1.10.m3.1.1.1.1.1.1.2.cmml" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.2">𝜃</ci><ci id="S3.SS2.p1.10.m3.1.1.1.1.1.1.3a.cmml" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.3"><mtext id="S3.SS2.p1.10.m3.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p1.10.m3.1.1.1.1.1.1.3">GDA</mtext></ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p1.10.m3.1c">I_{\text{sty}}=\mathcal{E}^{\text{render}}_{\text{3D}}(\theta_{\text{GDA}})</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p1.10.m3.1d">italic_I start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT = caligraphic_E start_POSTSUPERSCRIPT render end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT )</annotation></semantics></math>. This process transforms real face photos into the primary avatar domain while preserving identity-related features, enabling seamless 3D-aware avatar generation and animation. </div> <div class="ltx_para ltx_noindent" id="S3.SS2.p2"> Avatar Dual-Stylization. While GDA efficiently generates avatars in a single primary style, our dual-stylization approach allows for additional customization. We employ a diffusion-based pipeline using SDEdit <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib34" title="">34</a>]</cite> to refine the GDA output image <math alttext="I_{\text{sty}}" class="ltx_Math" display="inline" id="S3.SS2.p2.1.m1.1"><semantics id="S3.SS2.p2.1.m1.1a"><msub id="S3.SS2.p2.1.m1.1.1" xref="S3.SS2.p2.1.m1.1.1.cmml"><mi id="S3.SS2.p2.1.m1.1.1.2" xref="S3.SS2.p2.1.m1.1.1.2.cmml">I</mi><mtext id="S3.SS2.p2.1.m1.1.1.3" xref="S3.SS2.p2.1.m1.1.1.3a.cmml">sty</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.1.m1.1b"><apply id="S3.SS2.p2.1.m1.1.1.cmml" xref="S3.SS2.p2.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.1.m1.1.1.1.cmml" xref="S3.SS2.p2.1.m1.1.1">subscript</csymbol><ci id="S3.SS2.p2.1.m1.1.1.2.cmml" xref="S3.SS2.p2.1.m1.1.1.2">𝐼</ci><ci id="S3.SS2.p2.1.m1.1.1.3a.cmml" xref="S3.SS2.p2.1.m1.1.1.3"><mtext id="S3.SS2.p2.1.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p2.1.m1.1.1.3">sty</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.1.m1.1c">I_{\text{sty}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.1.m1.1d">italic_I start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT</annotation></semantics></math> with minimal noise and guided denoising based on text prompts to add features like art style and accessories. To preserve the avatar’s primary style, we use ControlNet <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib63" title="">63</a>]</cite> with Canny edges to maintain geometric integrity. Additionally, IP Adapter <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib60" title="">60</a>]</cite> is integrated with facial similarity embeddings to ensure resemblance to the original user. The cross-attention outputs for each layer, <math alttext="f_{\text{out}}^{d}" class="ltx_Math" display="inline" id="S3.SS2.p2.2.m2.1"><semantics id="S3.SS2.p2.2.m2.1a"><msubsup id="S3.SS2.p2.2.m2.1.1" xref="S3.SS2.p2.2.m2.1.1.cmml"><mi id="S3.SS2.p2.2.m2.1.1.2.2" xref="S3.SS2.p2.2.m2.1.1.2.2.cmml">f</mi><mtext id="S3.SS2.p2.2.m2.1.1.2.3" xref="S3.SS2.p2.2.m2.1.1.2.3a.cmml">out</mtext><mi id="S3.SS2.p2.2.m2.1.1.3" xref="S3.SS2.p2.2.m2.1.1.3.cmml">d</mi></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.2.m2.1b"><apply id="S3.SS2.p2.2.m2.1.1.cmml" xref="S3.SS2.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.2.m2.1.1.1.cmml" xref="S3.SS2.p2.2.m2.1.1">superscript</csymbol><apply id="S3.SS2.p2.2.m2.1.1.2.cmml" xref="S3.SS2.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.2.m2.1.1.2.1.cmml" xref="S3.SS2.p2.2.m2.1.1">subscript</csymbol><ci id="S3.SS2.p2.2.m2.1.1.2.2.cmml" xref="S3.SS2.p2.2.m2.1.1.2.2">𝑓</ci><ci id="S3.SS2.p2.2.m2.1.1.2.3a.cmml" xref="S3.SS2.p2.2.m2.1.1.2.3"><mtext id="S3.SS2.p2.2.m2.1.1.2.3.cmml" mathsize="70%" xref="S3.SS2.p2.2.m2.1.1.2.3">out</mtext></ci></apply><ci id="S3.SS2.p2.2.m2.1.1.3.cmml" xref="S3.SS2.p2.2.m2.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.2.m2.1c">f_{\text{out}}^{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.2.m2.1d">italic_f start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT</annotation></semantics></math>, combine features from the reference image, text prompts, and the primary-stylized avatar: <table class="ltx_equation ltx_eqn_table" id="S3.E3"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\begin{split}f_{\text{out}}^{d}&=\text{softmax}\left(\frac{(f_{\text{sty}}W_{Q% }^{d})(f_{\text{ref}}W_{K}^{d})^{T}}{\sqrt{d_{k}^{d}}}\right)(f_{\text{ref}}W_% {V}^{d})\\ &+\text{softmax}\left(\frac{(f_{\text{sty}}W_{Q}^{d})(f_{\text{txt}}W_{K}^{d})% ^{T}}{\sqrt{d_{k}^{d}}}\right)(f_{\text{txt}}W_{V}^{d}),\end{split}" class="ltx_Math" display="block" id="S3.E3.m1.31"><semantics id="S3.E3.m1.31a"><mtable columnspacing="0pt" displaystyle="true" id="S3.E3.m1.31.31.3" rowspacing="0pt"><mtr id="S3.E3.m1.31.31.3a"><mtd class="ltx_align_right" columnalign="right" id="S3.E3.m1.31.31.3b"><msubsup id="S3.E3.m1.3.3.3.3.3"><mi id="S3.E3.m1.1.1.1.1.1.1" xref="S3.E3.m1.1.1.1.1.1.1.cmml">f</mi><mtext id="S3.E3.m1.2.2.2.2.2.2.1" xref="S3.E3.m1.2.2.2.2.2.2.1a.cmml">out</mtext><mi id="S3.E3.m1.3.3.3.3.3.3.1" xref="S3.E3.m1.3.3.3.3.3.3.1.cmml">d</mi></msubsup></mtd><mtd class="ltx_align_left" columnalign="left" id="S3.E3.m1.31.31.3c"><mrow id="S3.E3.m1.30.30.2.29.16.13"><mi id="S3.E3.m1.30.30.2.29.16.13.14" xref="S3.E3.m1.29.29.1.1.1.cmml"></mi><mo id="S3.E3.m1.4.4.4.4.1.1" xref="S3.E3.m1.4.4.4.4.1.1.cmml">=</mo><mrow id="S3.E3.m1.30.30.2.29.16.13.13"><mtext id="S3.E3.m1.5.5.5.5.2.2" xref="S3.E3.m1.5.5.5.5.2.2a.cmml">softmax</mtext><mo id="S3.E3.m1.30.30.2.29.16.13.13.2" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><mrow id="S3.E3.m1.30.30.2.29.16.13.13.3"><mo id="S3.E3.m1.6.6.6.6.3.3" xref="S3.E3.m1.29.29.1.1.1.cmml">(</mo><mfrac id="S3.E3.m1.7.7.7.7.4.4" xref="S3.E3.m1.7.7.7.7.4.4.cmml"><mrow id="S3.E3.m1.7.7.7.7.4.4.2" xref="S3.E3.m1.7.7.7.7.4.4.2.cmml"><mrow id="S3.E3.m1.7.7.7.7.4.4.1.1.1" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.cmml"><mo id="S3.E3.m1.7.7.7.7.4.4.1.1.1.2" stretchy="false" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.cmml"><msub id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.cmml"><mi id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.2" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.2.cmml">f</mi><mtext id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3a.cmml">sty</mtext></msub><mo id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.1" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.cmml"><mi id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.2" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.3" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.3.cmml">Q</mi><mi id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.3" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.3.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.7.7.7.7.4.4.1.1.1.3" stretchy="false" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.cmml">)</mo></mrow><mo id="S3.E3.m1.7.7.7.7.4.4.2.3" xref="S3.E3.m1.7.7.7.7.4.4.2.3.cmml">⁢</mo><msup id="S3.E3.m1.7.7.7.7.4.4.2.2" xref="S3.E3.m1.7.7.7.7.4.4.2.2.cmml"><mrow id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.cmml"><mo id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.2" stretchy="false" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.cmml"><msub id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.cmml"><mi id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.2" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.2.cmml">f</mi><mtext id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3a.cmml">ref</mtext></msub><mo id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.1" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.cmml"><mi id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.2" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.3" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.3.cmml">K</mi><mi id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.3" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.3.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.3" stretchy="false" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.cmml">)</mo></mrow><mi id="S3.E3.m1.7.7.7.7.4.4.2.2.3" xref="S3.E3.m1.7.7.7.7.4.4.2.2.3.cmml">T</mi></msup></mrow><msqrt id="S3.E3.m1.7.7.7.7.4.4.4" xref="S3.E3.m1.7.7.7.7.4.4.4.cmml"><msubsup id="S3.E3.m1.7.7.7.7.4.4.4.2" xref="S3.E3.m1.7.7.7.7.4.4.4.2.cmml"><mi id="S3.E3.m1.7.7.7.7.4.4.4.2.2.2" xref="S3.E3.m1.7.7.7.7.4.4.4.2.2.2.cmml">d</mi><mi id="S3.E3.m1.7.7.7.7.4.4.4.2.2.3" xref="S3.E3.m1.7.7.7.7.4.4.4.2.2.3.cmml">k</mi><mi id="S3.E3.m1.7.7.7.7.4.4.4.2.3" xref="S3.E3.m1.7.7.7.7.4.4.4.2.3.cmml">d</mi></msubsup></msqrt></mfrac><mo id="S3.E3.m1.8.8.8.8.5.5" xref="S3.E3.m1.29.29.1.1.1.cmml">)</mo></mrow><mo id="S3.E3.m1.30.30.2.29.16.13.13.2a" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><mrow id="S3.E3.m1.30.30.2.29.16.13.13.1.1"><mo id="S3.E3.m1.9.9.9.9.6.6" stretchy="false" xref="S3.E3.m1.29.29.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.30.30.2.29.16.13.13.1.1.1"><msub id="S3.E3.m1.30.30.2.29.16.13.13.1.1.1.2"><mi id="S3.E3.m1.10.10.10.10.7.7" xref="S3.E3.m1.10.10.10.10.7.7.cmml">f</mi><mtext id="S3.E3.m1.11.11.11.11.8.8.1" xref="S3.E3.m1.11.11.11.11.8.8.1a.cmml">ref</mtext></msub><mo id="S3.E3.m1.30.30.2.29.16.13.13.1.1.1.1" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.30.30.2.29.16.13.13.1.1.1.3"><mi id="S3.E3.m1.12.12.12.12.9.9" xref="S3.E3.m1.12.12.12.12.9.9.cmml">W</mi><mi id="S3.E3.m1.13.13.13.13.10.10.1" xref="S3.E3.m1.13.13.13.13.10.10.1.cmml">V</mi><mi id="S3.E3.m1.14.14.14.14.11.11.1" xref="S3.E3.m1.14.14.14.14.11.11.1.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.15.15.15.15.12.12" stretchy="false" xref="S3.E3.m1.29.29.1.1.1.cmml">)</mo></mrow></mrow></mrow></mtd></mtr><mtr id="S3.E3.m1.31.31.3d"><mtd id="S3.E3.m1.31.31.3e" xref="S3.E3.m1.29.29.1.1.1.cmml"></mtd><mtd class="ltx_align_left" columnalign="left" id="S3.E3.m1.31.31.3f"><mrow id="S3.E3.m1.31.31.3.30.14.14.14"><mrow id="S3.E3.m1.31.31.3.30.14.14.14.1"><mo id="S3.E3.m1.31.31.3.30.14.14.14.1a" xref="S3.E3.m1.29.29.1.1.1.cmml">+</mo><mrow id="S3.E3.m1.31.31.3.30.14.14.14.1.1"><mtext id="S3.E3.m1.17.17.17.2.2.2" xref="S3.E3.m1.17.17.17.2.2.2a.cmml">softmax</mtext><mo id="S3.E3.m1.31.31.3.30.14.14.14.1.1.2" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><mrow id="S3.E3.m1.31.31.3.30.14.14.14.1.1.3"><mo id="S3.E3.m1.18.18.18.3.3.3" xref="S3.E3.m1.29.29.1.1.1.cmml">(</mo><mfrac id="S3.E3.m1.19.19.19.4.4.4" xref="S3.E3.m1.19.19.19.4.4.4.cmml"><mrow id="S3.E3.m1.19.19.19.4.4.4.2" xref="S3.E3.m1.19.19.19.4.4.4.2.cmml"><mrow id="S3.E3.m1.19.19.19.4.4.4.1.1.1" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.cmml"><mo id="S3.E3.m1.19.19.19.4.4.4.1.1.1.2" stretchy="false" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.cmml"><msub id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.cmml"><mi id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.2" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.2.cmml">f</mi><mtext id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3a.cmml">sty</mtext></msub><mo id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.1" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.cmml"><mi id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.2" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.3" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.3.cmml">Q</mi><mi id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.3" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.3.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.19.19.19.4.4.4.1.1.1.3" stretchy="false" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.cmml">)</mo></mrow><mo id="S3.E3.m1.19.19.19.4.4.4.2.3" xref="S3.E3.m1.19.19.19.4.4.4.2.3.cmml">⁢</mo><msup id="S3.E3.m1.19.19.19.4.4.4.2.2" xref="S3.E3.m1.19.19.19.4.4.4.2.2.cmml"><mrow id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.cmml"><mo id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.2" stretchy="false" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.cmml"><msub id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.cmml"><mi id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.2" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.2.cmml">f</mi><mtext id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3a.cmml">txt</mtext></msub><mo id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.1" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.cmml"><mi id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.2" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.3" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.3.cmml">K</mi><mi id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.3" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.3.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.3" stretchy="false" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.cmml">)</mo></mrow><mi id="S3.E3.m1.19.19.19.4.4.4.2.2.3" xref="S3.E3.m1.19.19.19.4.4.4.2.2.3.cmml">T</mi></msup></mrow><msqrt id="S3.E3.m1.19.19.19.4.4.4.4" xref="S3.E3.m1.19.19.19.4.4.4.4.cmml"><msubsup id="S3.E3.m1.19.19.19.4.4.4.4.2" xref="S3.E3.m1.19.19.19.4.4.4.4.2.cmml"><mi id="S3.E3.m1.19.19.19.4.4.4.4.2.2.2" xref="S3.E3.m1.19.19.19.4.4.4.4.2.2.2.cmml">d</mi><mi id="S3.E3.m1.19.19.19.4.4.4.4.2.2.3" xref="S3.E3.m1.19.19.19.4.4.4.4.2.2.3.cmml">k</mi><mi id="S3.E3.m1.19.19.19.4.4.4.4.2.3" xref="S3.E3.m1.19.19.19.4.4.4.4.2.3.cmml">d</mi></msubsup></msqrt></mfrac><mo id="S3.E3.m1.20.20.20.5.5.5" xref="S3.E3.m1.29.29.1.1.1.cmml">)</mo></mrow><mo id="S3.E3.m1.31.31.3.30.14.14.14.1.1.2a" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><mrow id="S3.E3.m1.31.31.3.30.14.14.14.1.1.1.1"><mo id="S3.E3.m1.21.21.21.6.6.6" stretchy="false" xref="S3.E3.m1.29.29.1.1.1.cmml">(</mo><mrow id="S3.E3.m1.31.31.3.30.14.14.14.1.1.1.1.1"><msub id="S3.E3.m1.31.31.3.30.14.14.14.1.1.1.1.1.2"><mi id="S3.E3.m1.22.22.22.7.7.7" xref="S3.E3.m1.22.22.22.7.7.7.cmml">f</mi><mtext id="S3.E3.m1.23.23.23.8.8.8.1" xref="S3.E3.m1.23.23.23.8.8.8.1a.cmml">txt</mtext></msub><mo id="S3.E3.m1.31.31.3.30.14.14.14.1.1.1.1.1.1" xref="S3.E3.m1.29.29.1.1.1.cmml">⁢</mo><msubsup id="S3.E3.m1.31.31.3.30.14.14.14.1.1.1.1.1.3"><mi id="S3.E3.m1.24.24.24.9.9.9" xref="S3.E3.m1.24.24.24.9.9.9.cmml">W</mi><mi id="S3.E3.m1.25.25.25.10.10.10.1" xref="S3.E3.m1.25.25.25.10.10.10.1.cmml">V</mi><mi id="S3.E3.m1.26.26.26.11.11.11.1" xref="S3.E3.m1.26.26.26.11.11.11.1.cmml">d</mi></msubsup></mrow><mo id="S3.E3.m1.27.27.27.12.12.12" stretchy="false" xref="S3.E3.m1.29.29.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E3.m1.28.28.28.13.13.13" xref="S3.E3.m1.29.29.1.1.1.cmml">,</mo></mrow></mtd></mtr></mtable><annotation-xml encoding="MathML-Content" id="S3.E3.m1.31b"><apply id="S3.E3.m1.29.29.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><eq id="S3.E3.m1.4.4.4.4.1.1.cmml" xref="S3.E3.m1.4.4.4.4.1.1"></eq><apply id="S3.E3.m1.29.29.1.1.1.4.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.4.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">superscript</csymbol><apply id="S3.E3.m1.29.29.1.1.1.4.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.4.2.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">subscript</csymbol><ci id="S3.E3.m1.1.1.1.1.1.1.cmml" xref="S3.E3.m1.1.1.1.1.1.1">𝑓</ci><ci id="S3.E3.m1.2.2.2.2.2.2.1a.cmml" xref="S3.E3.m1.2.2.2.2.2.2.1"><mtext id="S3.E3.m1.2.2.2.2.2.2.1.cmml" mathsize="70%" xref="S3.E3.m1.2.2.2.2.2.2.1">out</mtext></ci></apply><ci id="S3.E3.m1.3.3.3.3.3.3.1.cmml" xref="S3.E3.m1.3.3.3.3.3.3.1">𝑑</ci></apply><apply id="S3.E3.m1.29.29.1.1.1.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><plus id="S3.E3.m1.16.16.16.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"></plus><apply id="S3.E3.m1.29.29.1.1.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><times id="S3.E3.m1.29.29.1.1.1.1.1.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"></times><ci id="S3.E3.m1.5.5.5.5.2.2a.cmml" xref="S3.E3.m1.5.5.5.5.2.2"><mtext id="S3.E3.m1.5.5.5.5.2.2.cmml" xref="S3.E3.m1.5.5.5.5.2.2">softmax</mtext></ci><apply id="S3.E3.m1.7.7.7.7.4.4.cmml" xref="S3.E3.m1.7.7.7.7.4.4"><divide id="S3.E3.m1.7.7.7.7.4.4.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4"></divide><apply id="S3.E3.m1.7.7.7.7.4.4.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2"><times id="S3.E3.m1.7.7.7.7.4.4.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.3"></times><apply id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1"><times id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.1"></times><apply id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2">subscript</csymbol><ci id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.2">𝑓</ci><ci id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3a.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3"><mtext id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.2.3">sty</mtext></ci></apply><apply id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3">superscript</csymbol><apply id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3">subscript</csymbol><ci id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.2">𝑊</ci><ci id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.2.3">𝑄</ci></apply><ci id="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.1.1.1.1.3.3">𝑑</ci></apply></apply><apply id="S3.E3.m1.7.7.7.7.4.4.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.2.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2">superscript</csymbol><apply id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1"><times id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.1"></times><apply id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2">subscript</csymbol><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.2">𝑓</ci><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3a.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3"><mtext id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.2.3">ref</mtext></ci></apply><apply id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3">superscript</csymbol><apply id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3">subscript</csymbol><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.2">𝑊</ci><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.2.3">𝐾</ci></apply><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.1.1.1.3.3">𝑑</ci></apply></apply><ci id="S3.E3.m1.7.7.7.7.4.4.2.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.2.2.3">𝑇</ci></apply></apply><apply id="S3.E3.m1.7.7.7.7.4.4.4.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4"><root id="S3.E3.m1.7.7.7.7.4.4.4a.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4"></root><apply id="S3.E3.m1.7.7.7.7.4.4.4.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.4.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2">superscript</csymbol><apply id="S3.E3.m1.7.7.7.7.4.4.4.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2"><csymbol cd="ambiguous" id="S3.E3.m1.7.7.7.7.4.4.4.2.2.1.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2">subscript</csymbol><ci id="S3.E3.m1.7.7.7.7.4.4.4.2.2.2.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2.2.2">𝑑</ci><ci id="S3.E3.m1.7.7.7.7.4.4.4.2.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2.2.3">𝑘</ci></apply><ci id="S3.E3.m1.7.7.7.7.4.4.4.2.3.cmml" xref="S3.E3.m1.7.7.7.7.4.4.4.2.3">𝑑</ci></apply></apply></apply><apply id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><times id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"></times><apply id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">subscript</csymbol><ci id="S3.E3.m1.10.10.10.10.7.7.cmml" xref="S3.E3.m1.10.10.10.10.7.7">𝑓</ci><ci id="S3.E3.m1.11.11.11.11.8.8.1a.cmml" xref="S3.E3.m1.11.11.11.11.8.8.1"><mtext id="S3.E3.m1.11.11.11.11.8.8.1.cmml" mathsize="70%" xref="S3.E3.m1.11.11.11.11.8.8.1">ref</mtext></ci></apply><apply id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">superscript</csymbol><apply id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.1.1.1.1.1.3.2.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">subscript</csymbol><ci id="S3.E3.m1.12.12.12.12.9.9.cmml" xref="S3.E3.m1.12.12.12.12.9.9">𝑊</ci><ci id="S3.E3.m1.13.13.13.13.10.10.1.cmml" xref="S3.E3.m1.13.13.13.13.10.10.1">𝑉</ci></apply><ci id="S3.E3.m1.14.14.14.14.11.11.1.cmml" xref="S3.E3.m1.14.14.14.14.11.11.1">𝑑</ci></apply></apply></apply><apply id="S3.E3.m1.29.29.1.1.1.2.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><times id="S3.E3.m1.29.29.1.1.1.2.2.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"></times><ci id="S3.E3.m1.17.17.17.2.2.2a.cmml" xref="S3.E3.m1.17.17.17.2.2.2"><mtext id="S3.E3.m1.17.17.17.2.2.2.cmml" xref="S3.E3.m1.17.17.17.2.2.2">softmax</mtext></ci><apply id="S3.E3.m1.19.19.19.4.4.4.cmml" xref="S3.E3.m1.19.19.19.4.4.4"><divide id="S3.E3.m1.19.19.19.4.4.4.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4"></divide><apply id="S3.E3.m1.19.19.19.4.4.4.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2"><times id="S3.E3.m1.19.19.19.4.4.4.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.3"></times><apply id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1"><times id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.1"></times><apply id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2">subscript</csymbol><ci id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.2">𝑓</ci><ci id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3a.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3"><mtext id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.2.3">sty</mtext></ci></apply><apply id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3">superscript</csymbol><apply id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3">subscript</csymbol><ci id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.2">𝑊</ci><ci id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.2.3">𝑄</ci></apply><ci id="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.1.1.1.1.3.3">𝑑</ci></apply></apply><apply id="S3.E3.m1.19.19.19.4.4.4.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.2.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2">superscript</csymbol><apply id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1"><times id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.1"></times><apply id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2">subscript</csymbol><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.2">𝑓</ci><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3a.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3"><mtext id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.2.3">txt</mtext></ci></apply><apply id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3">superscript</csymbol><apply id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3">subscript</csymbol><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.2">𝑊</ci><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.2.3">𝐾</ci></apply><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.1.1.1.3.3">𝑑</ci></apply></apply><ci id="S3.E3.m1.19.19.19.4.4.4.2.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.2.2.3">𝑇</ci></apply></apply><apply id="S3.E3.m1.19.19.19.4.4.4.4.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4"><root id="S3.E3.m1.19.19.19.4.4.4.4a.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4"></root><apply id="S3.E3.m1.19.19.19.4.4.4.4.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.4.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2">superscript</csymbol><apply id="S3.E3.m1.19.19.19.4.4.4.4.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2"><csymbol cd="ambiguous" id="S3.E3.m1.19.19.19.4.4.4.4.2.2.1.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2">subscript</csymbol><ci id="S3.E3.m1.19.19.19.4.4.4.4.2.2.2.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2.2.2">𝑑</ci><ci id="S3.E3.m1.19.19.19.4.4.4.4.2.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2.2.3">𝑘</ci></apply><ci id="S3.E3.m1.19.19.19.4.4.4.4.2.3.cmml" xref="S3.E3.m1.19.19.19.4.4.4.4.2.3">𝑑</ci></apply></apply></apply><apply id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><times id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"></times><apply id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.2.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">subscript</csymbol><ci id="S3.E3.m1.22.22.22.7.7.7.cmml" xref="S3.E3.m1.22.22.22.7.7.7">𝑓</ci><ci id="S3.E3.m1.23.23.23.8.8.8.1a.cmml" xref="S3.E3.m1.23.23.23.8.8.8.1"><mtext id="S3.E3.m1.23.23.23.8.8.8.1.cmml" mathsize="70%" xref="S3.E3.m1.23.23.23.8.8.8.1">txt</mtext></ci></apply><apply id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.3.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.3.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">superscript</csymbol><apply id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.3.2.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14"><csymbol cd="ambiguous" id="S3.E3.m1.29.29.1.1.1.2.2.1.1.1.3.2.1.cmml" xref="S3.E3.m1.30.30.2.29.16.13.14">subscript</csymbol><ci id="S3.E3.m1.24.24.24.9.9.9.cmml" xref="S3.E3.m1.24.24.24.9.9.9">𝑊</ci><ci id="S3.E3.m1.25.25.25.10.10.10.1.cmml" xref="S3.E3.m1.25.25.25.10.10.10.1">𝑉</ci></apply><ci id="S3.E3.m1.26.26.26.11.11.11.1.cmml" xref="S3.E3.m1.26.26.26.11.11.11.1">𝑑</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E3.m1.31c">\begin{split}f_{\text{out}}^{d}&=\text{softmax}\left(\frac{(f_{\text{sty}}W_{Q% }^{d})(f_{\text{ref}}W_{K}^{d})^{T}}{\sqrt{d_{k}^{d}}}\right)(f_{\text{ref}}W_% {V}^{d})\\ &+\text{softmax}\left(\frac{(f_{\text{sty}}W_{Q}^{d})(f_{\text{txt}}W_{K}^{d})% ^{T}}{\sqrt{d_{k}^{d}}}\right)(f_{\text{txt}}W_{V}^{d}),\end{split}</annotation><annotation encoding="application/x-llamapun" id="S3.E3.m1.31d">start_ROW start_CELL italic_f start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_CELL start_CELL = softmax ( divide start_ARG ( italic_f start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ( italic_f start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG end_ARG ) ( italic_f start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + softmax ( divide start_ARG ( italic_f start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ( italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG end_ARG ) ( italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) , end_CELL end_ROW</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(3)</td> </tr></tbody> </table> where <math alttext="f_{\text{ref}}" class="ltx_Math" display="inline" id="S3.SS2.p2.3.m1.1"><semantics id="S3.SS2.p2.3.m1.1a"><msub id="S3.SS2.p2.3.m1.1.1" xref="S3.SS2.p2.3.m1.1.1.cmml"><mi id="S3.SS2.p2.3.m1.1.1.2" xref="S3.SS2.p2.3.m1.1.1.2.cmml">f</mi><mtext id="S3.SS2.p2.3.m1.1.1.3" xref="S3.SS2.p2.3.m1.1.1.3a.cmml">ref</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.3.m1.1b"><apply id="S3.SS2.p2.3.m1.1.1.cmml" xref="S3.SS2.p2.3.m1.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.3.m1.1.1.1.cmml" xref="S3.SS2.p2.3.m1.1.1">subscript</csymbol><ci id="S3.SS2.p2.3.m1.1.1.2.cmml" xref="S3.SS2.p2.3.m1.1.1.2">𝑓</ci><ci id="S3.SS2.p2.3.m1.1.1.3a.cmml" xref="S3.SS2.p2.3.m1.1.1.3"><mtext id="S3.SS2.p2.3.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p2.3.m1.1.1.3">ref</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.3.m1.1c">f_{\text{ref}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.3.m1.1d">italic_f start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT</annotation></semantics></math>, <math alttext="f_{\text{txt}}" class="ltx_Math" display="inline" id="S3.SS2.p2.4.m2.1"><semantics id="S3.SS2.p2.4.m2.1a"><msub id="S3.SS2.p2.4.m2.1.1" xref="S3.SS2.p2.4.m2.1.1.cmml"><mi id="S3.SS2.p2.4.m2.1.1.2" xref="S3.SS2.p2.4.m2.1.1.2.cmml">f</mi><mtext id="S3.SS2.p2.4.m2.1.1.3" xref="S3.SS2.p2.4.m2.1.1.3a.cmml">txt</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.4.m2.1b"><apply id="S3.SS2.p2.4.m2.1.1.cmml" xref="S3.SS2.p2.4.m2.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.4.m2.1.1.1.cmml" xref="S3.SS2.p2.4.m2.1.1">subscript</csymbol><ci id="S3.SS2.p2.4.m2.1.1.2.cmml" xref="S3.SS2.p2.4.m2.1.1.2">𝑓</ci><ci id="S3.SS2.p2.4.m2.1.1.3a.cmml" xref="S3.SS2.p2.4.m2.1.1.3"><mtext id="S3.SS2.p2.4.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p2.4.m2.1.1.3">txt</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.4.m2.1c">f_{\text{txt}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.4.m2.1d">italic_f start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT</annotation></semantics></math>, and <math alttext="f_{\text{sty}}" class="ltx_Math" display="inline" id="S3.SS2.p2.5.m3.1"><semantics id="S3.SS2.p2.5.m3.1a"><msub id="S3.SS2.p2.5.m3.1.1" xref="S3.SS2.p2.5.m3.1.1.cmml"><mi id="S3.SS2.p2.5.m3.1.1.2" xref="S3.SS2.p2.5.m3.1.1.2.cmml">f</mi><mtext id="S3.SS2.p2.5.m3.1.1.3" xref="S3.SS2.p2.5.m3.1.1.3a.cmml">sty</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.5.m3.1b"><apply id="S3.SS2.p2.5.m3.1.1.cmml" xref="S3.SS2.p2.5.m3.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.5.m3.1.1.1.cmml" xref="S3.SS2.p2.5.m3.1.1">subscript</csymbol><ci id="S3.SS2.p2.5.m3.1.1.2.cmml" xref="S3.SS2.p2.5.m3.1.1.2">𝑓</ci><ci id="S3.SS2.p2.5.m3.1.1.3a.cmml" xref="S3.SS2.p2.5.m3.1.1.3"><mtext id="S3.SS2.p2.5.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS2.p2.5.m3.1.1.3">sty</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.5.m3.1c">f_{\text{sty}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.5.m3.1d">italic_f start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT</annotation></semantics></math> correspond to the features of the reference image, text prompts, and the unstylized avatar, respectively. <math alttext="W_{Q}^{d}" class="ltx_Math" display="inline" id="S3.SS2.p2.6.m4.1"><semantics id="S3.SS2.p2.6.m4.1a"><msubsup id="S3.SS2.p2.6.m4.1.1" xref="S3.SS2.p2.6.m4.1.1.cmml"><mi id="S3.SS2.p2.6.m4.1.1.2.2" xref="S3.SS2.p2.6.m4.1.1.2.2.cmml">W</mi><mi id="S3.SS2.p2.6.m4.1.1.2.3" xref="S3.SS2.p2.6.m4.1.1.2.3.cmml">Q</mi><mi id="S3.SS2.p2.6.m4.1.1.3" xref="S3.SS2.p2.6.m4.1.1.3.cmml">d</mi></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.6.m4.1b"><apply id="S3.SS2.p2.6.m4.1.1.cmml" xref="S3.SS2.p2.6.m4.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.6.m4.1.1.1.cmml" xref="S3.SS2.p2.6.m4.1.1">superscript</csymbol><apply id="S3.SS2.p2.6.m4.1.1.2.cmml" xref="S3.SS2.p2.6.m4.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.6.m4.1.1.2.1.cmml" xref="S3.SS2.p2.6.m4.1.1">subscript</csymbol><ci id="S3.SS2.p2.6.m4.1.1.2.2.cmml" xref="S3.SS2.p2.6.m4.1.1.2.2">𝑊</ci><ci id="S3.SS2.p2.6.m4.1.1.2.3.cmml" xref="S3.SS2.p2.6.m4.1.1.2.3">𝑄</ci></apply><ci id="S3.SS2.p2.6.m4.1.1.3.cmml" xref="S3.SS2.p2.6.m4.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.6.m4.1c">W_{Q}^{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.6.m4.1d">italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT</annotation></semantics></math>, <math alttext="W_{K}^{d}" class="ltx_Math" display="inline" id="S3.SS2.p2.7.m5.1"><semantics id="S3.SS2.p2.7.m5.1a"><msubsup id="S3.SS2.p2.7.m5.1.1" xref="S3.SS2.p2.7.m5.1.1.cmml"><mi id="S3.SS2.p2.7.m5.1.1.2.2" xref="S3.SS2.p2.7.m5.1.1.2.2.cmml">W</mi><mi id="S3.SS2.p2.7.m5.1.1.2.3" xref="S3.SS2.p2.7.m5.1.1.2.3.cmml">K</mi><mi id="S3.SS2.p2.7.m5.1.1.3" xref="S3.SS2.p2.7.m5.1.1.3.cmml">d</mi></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.7.m5.1b"><apply id="S3.SS2.p2.7.m5.1.1.cmml" xref="S3.SS2.p2.7.m5.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.7.m5.1.1.1.cmml" xref="S3.SS2.p2.7.m5.1.1">superscript</csymbol><apply id="S3.SS2.p2.7.m5.1.1.2.cmml" xref="S3.SS2.p2.7.m5.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.7.m5.1.1.2.1.cmml" xref="S3.SS2.p2.7.m5.1.1">subscript</csymbol><ci id="S3.SS2.p2.7.m5.1.1.2.2.cmml" xref="S3.SS2.p2.7.m5.1.1.2.2">𝑊</ci><ci id="S3.SS2.p2.7.m5.1.1.2.3.cmml" xref="S3.SS2.p2.7.m5.1.1.2.3">𝐾</ci></apply><ci id="S3.SS2.p2.7.m5.1.1.3.cmml" xref="S3.SS2.p2.7.m5.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.7.m5.1c">W_{K}^{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.7.m5.1d">italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT</annotation></semantics></math>, and <math alttext="W_{V}^{d}" class="ltx_Math" display="inline" id="S3.SS2.p2.8.m6.1"><semantics id="S3.SS2.p2.8.m6.1a"><msubsup id="S3.SS2.p2.8.m6.1.1" xref="S3.SS2.p2.8.m6.1.1.cmml"><mi id="S3.SS2.p2.8.m6.1.1.2.2" xref="S3.SS2.p2.8.m6.1.1.2.2.cmml">W</mi><mi id="S3.SS2.p2.8.m6.1.1.2.3" xref="S3.SS2.p2.8.m6.1.1.2.3.cmml">V</mi><mi id="S3.SS2.p2.8.m6.1.1.3" xref="S3.SS2.p2.8.m6.1.1.3.cmml">d</mi></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.8.m6.1b"><apply id="S3.SS2.p2.8.m6.1.1.cmml" xref="S3.SS2.p2.8.m6.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.8.m6.1.1.1.cmml" xref="S3.SS2.p2.8.m6.1.1">superscript</csymbol><apply id="S3.SS2.p2.8.m6.1.1.2.cmml" xref="S3.SS2.p2.8.m6.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.8.m6.1.1.2.1.cmml" xref="S3.SS2.p2.8.m6.1.1">subscript</csymbol><ci id="S3.SS2.p2.8.m6.1.1.2.2.cmml" xref="S3.SS2.p2.8.m6.1.1.2.2">𝑊</ci><ci id="S3.SS2.p2.8.m6.1.1.2.3.cmml" xref="S3.SS2.p2.8.m6.1.1.2.3">𝑉</ci></apply><ci id="S3.SS2.p2.8.m6.1.1.3.cmml" xref="S3.SS2.p2.8.m6.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.8.m6.1c">W_{V}^{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.8.m6.1d">italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT</annotation></semantics></math> are the weight matrices. Each cross-attention layer includes a residual connection (scaled by <math alttext="\sqrt{d_{k}^{d}}" class="ltx_Math" display="inline" id="S3.SS2.p2.9.m7.1"><semantics id="S3.SS2.p2.9.m7.1a"><msqrt id="S3.SS2.p2.9.m7.1.1" xref="S3.SS2.p2.9.m7.1.1.cmml"><msubsup id="S3.SS2.p2.9.m7.1.1.2" xref="S3.SS2.p2.9.m7.1.1.2.cmml"><mi id="S3.SS2.p2.9.m7.1.1.2.2.2" xref="S3.SS2.p2.9.m7.1.1.2.2.2.cmml">d</mi><mi id="S3.SS2.p2.9.m7.1.1.2.2.3" xref="S3.SS2.p2.9.m7.1.1.2.2.3.cmml">k</mi><mi id="S3.SS2.p2.9.m7.1.1.2.3" xref="S3.SS2.p2.9.m7.1.1.2.3.cmml">d</mi></msubsup></msqrt><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.9.m7.1b"><apply id="S3.SS2.p2.9.m7.1.1.cmml" xref="S3.SS2.p2.9.m7.1.1"><root id="S3.SS2.p2.9.m7.1.1a.cmml" xref="S3.SS2.p2.9.m7.1.1"></root><apply id="S3.SS2.p2.9.m7.1.1.2.cmml" xref="S3.SS2.p2.9.m7.1.1.2"><csymbol cd="ambiguous" id="S3.SS2.p2.9.m7.1.1.2.1.cmml" xref="S3.SS2.p2.9.m7.1.1.2">superscript</csymbol><apply id="S3.SS2.p2.9.m7.1.1.2.2.cmml" xref="S3.SS2.p2.9.m7.1.1.2"><csymbol cd="ambiguous" id="S3.SS2.p2.9.m7.1.1.2.2.1.cmml" xref="S3.SS2.p2.9.m7.1.1.2">subscript</csymbol><ci id="S3.SS2.p2.9.m7.1.1.2.2.2.cmml" xref="S3.SS2.p2.9.m7.1.1.2.2.2">𝑑</ci><ci id="S3.SS2.p2.9.m7.1.1.2.2.3.cmml" xref="S3.SS2.p2.9.m7.1.1.2.2.3">𝑘</ci></apply><ci id="S3.SS2.p2.9.m7.1.1.2.3.cmml" xref="S3.SS2.p2.9.m7.1.1.2.3">𝑑</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.9.m7.1c">\sqrt{d_{k}^{d}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.9.m7.1d">square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG</annotation></semantics></math>) for stable gradient flow. Using the DDIM scheduler <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib53" title="">53</a>]</cite>, we perform only <math alttext="T=10" class="ltx_Math" display="inline" id="S3.SS2.p2.10.m8.1"><semantics id="S3.SS2.p2.10.m8.1a"><mrow id="S3.SS2.p2.10.m8.1.1" xref="S3.SS2.p2.10.m8.1.1.cmml"><mi id="S3.SS2.p2.10.m8.1.1.2" xref="S3.SS2.p2.10.m8.1.1.2.cmml">T</mi><mo id="S3.SS2.p2.10.m8.1.1.1" xref="S3.SS2.p2.10.m8.1.1.1.cmml">=</mo><mn id="S3.SS2.p2.10.m8.1.1.3" xref="S3.SS2.p2.10.m8.1.1.3.cmml">10</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.10.m8.1b"><apply id="S3.SS2.p2.10.m8.1.1.cmml" xref="S3.SS2.p2.10.m8.1.1"><eq id="S3.SS2.p2.10.m8.1.1.1.cmml" xref="S3.SS2.p2.10.m8.1.1.1"></eq><ci id="S3.SS2.p2.10.m8.1.1.2.cmml" xref="S3.SS2.p2.10.m8.1.1.2">𝑇</ci><cn id="S3.SS2.p2.10.m8.1.1.3.cmml" type="integer" xref="S3.SS2.p2.10.m8.1.1.3">10</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.10.m8.1c">T=10</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.10.m8.1d">italic_T = 10</annotation></semantics></math> denoising steps to rapidly integrate the secondary style in about one second. </div> </section> <section class="ltx_subsection" id="S3.SS3"> <h3 class="ltx_title ltx_title_subsection"> 3.3 3D Animatable Stylized Avatar Generation</h3> <div class="ltx_para ltx_noindent" id="S3.SS3.p1"> Expression Encoder. Current avatar animation techniques using Gaussian Splats often solely depend on 3D Morphable Models (3DMM) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib8" title="">8</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib19" title="">19</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib40" title="">40</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib47" title="">47</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title="">54</a>]</cite>, which limits generalization beyond realistic faces, especially for stylized or cartoon avatars. To overcome these constraints, we condition the 3D generation network on a blend of 3DMM features and blendshape weights derived from the Facial Action Coding System (FACS) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib13" title="">13</a>]</cite>, which is widely utilized in cartoon animation to control facial features like eye position and mouth shape. As depicted in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S2.F2" title="Figure 2 ‣ 2 Related Work ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">2</a>, for generating expressive avatars, we extract expression codes <math alttext="f_{\text{mm}}\in\mathbb{R}^{100}" class="ltx_Math" display="inline" id="S3.SS3.p1.1.m1.1"><semantics id="S3.SS3.p1.1.m1.1a"><mrow id="S3.SS3.p1.1.m1.1.1" xref="S3.SS3.p1.1.m1.1.1.cmml"><msub id="S3.SS3.p1.1.m1.1.1.2" xref="S3.SS3.p1.1.m1.1.1.2.cmml"><mi id="S3.SS3.p1.1.m1.1.1.2.2" xref="S3.SS3.p1.1.m1.1.1.2.2.cmml">f</mi><mtext id="S3.SS3.p1.1.m1.1.1.2.3" xref="S3.SS3.p1.1.m1.1.1.2.3a.cmml">mm</mtext></msub><mo id="S3.SS3.p1.1.m1.1.1.1" xref="S3.SS3.p1.1.m1.1.1.1.cmml">∈</mo><msup id="S3.SS3.p1.1.m1.1.1.3" xref="S3.SS3.p1.1.m1.1.1.3.cmml"><mi id="S3.SS3.p1.1.m1.1.1.3.2" xref="S3.SS3.p1.1.m1.1.1.3.2.cmml">ℝ</mi><mn id="S3.SS3.p1.1.m1.1.1.3.3" xref="S3.SS3.p1.1.m1.1.1.3.3.cmml">100</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.1.m1.1b"><apply id="S3.SS3.p1.1.m1.1.1.cmml" xref="S3.SS3.p1.1.m1.1.1"><in id="S3.SS3.p1.1.m1.1.1.1.cmml" xref="S3.SS3.p1.1.m1.1.1.1"></in><apply id="S3.SS3.p1.1.m1.1.1.2.cmml" xref="S3.SS3.p1.1.m1.1.1.2"><csymbol cd="ambiguous" id="S3.SS3.p1.1.m1.1.1.2.1.cmml" xref="S3.SS3.p1.1.m1.1.1.2">subscript</csymbol><ci id="S3.SS3.p1.1.m1.1.1.2.2.cmml" xref="S3.SS3.p1.1.m1.1.1.2.2">𝑓</ci><ci id="S3.SS3.p1.1.m1.1.1.2.3a.cmml" xref="S3.SS3.p1.1.m1.1.1.2.3"><mtext id="S3.SS3.p1.1.m1.1.1.2.3.cmml" mathsize="70%" xref="S3.SS3.p1.1.m1.1.1.2.3">mm</mtext></ci></apply><apply id="S3.SS3.p1.1.m1.1.1.3.cmml" xref="S3.SS3.p1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S3.SS3.p1.1.m1.1.1.3.1.cmml" xref="S3.SS3.p1.1.m1.1.1.3">superscript</csymbol><ci id="S3.SS3.p1.1.m1.1.1.3.2.cmml" xref="S3.SS3.p1.1.m1.1.1.3.2">ℝ</ci><cn id="S3.SS3.p1.1.m1.1.1.3.3.cmml" type="integer" xref="S3.SS3.p1.1.m1.1.1.3.3">100</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.1.m1.1c">f_{\text{mm}}\in\mathbb{R}^{100}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.1.m1.1d">italic_f start_POSTSUBSCRIPT mm end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 100 end_POSTSUPERSCRIPT</annotation></semantics></math> from the driving image using a 3DMM estimator. These codes are concatenated with the blendshape vector <math alttext="f_{\text{bs}}\in\mathbb{R}^{16}" class="ltx_Math" display="inline" id="S3.SS3.p1.2.m2.1"><semantics id="S3.SS3.p1.2.m2.1a"><mrow id="S3.SS3.p1.2.m2.1.1" xref="S3.SS3.p1.2.m2.1.1.cmml"><msub id="S3.SS3.p1.2.m2.1.1.2" xref="S3.SS3.p1.2.m2.1.1.2.cmml"><mi id="S3.SS3.p1.2.m2.1.1.2.2" xref="S3.SS3.p1.2.m2.1.1.2.2.cmml">f</mi><mtext id="S3.SS3.p1.2.m2.1.1.2.3" xref="S3.SS3.p1.2.m2.1.1.2.3a.cmml">bs</mtext></msub><mo id="S3.SS3.p1.2.m2.1.1.1" xref="S3.SS3.p1.2.m2.1.1.1.cmml">∈</mo><msup id="S3.SS3.p1.2.m2.1.1.3" xref="S3.SS3.p1.2.m2.1.1.3.cmml"><mi id="S3.SS3.p1.2.m2.1.1.3.2" xref="S3.SS3.p1.2.m2.1.1.3.2.cmml">ℝ</mi><mn id="S3.SS3.p1.2.m2.1.1.3.3" xref="S3.SS3.p1.2.m2.1.1.3.3.cmml">16</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.2.m2.1b"><apply id="S3.SS3.p1.2.m2.1.1.cmml" xref="S3.SS3.p1.2.m2.1.1"><in id="S3.SS3.p1.2.m2.1.1.1.cmml" xref="S3.SS3.p1.2.m2.1.1.1"></in><apply id="S3.SS3.p1.2.m2.1.1.2.cmml" xref="S3.SS3.p1.2.m2.1.1.2"><csymbol cd="ambiguous" id="S3.SS3.p1.2.m2.1.1.2.1.cmml" xref="S3.SS3.p1.2.m2.1.1.2">subscript</csymbol><ci id="S3.SS3.p1.2.m2.1.1.2.2.cmml" xref="S3.SS3.p1.2.m2.1.1.2.2">𝑓</ci><ci id="S3.SS3.p1.2.m2.1.1.2.3a.cmml" xref="S3.SS3.p1.2.m2.1.1.2.3"><mtext id="S3.SS3.p1.2.m2.1.1.2.3.cmml" mathsize="70%" xref="S3.SS3.p1.2.m2.1.1.2.3">bs</mtext></ci></apply><apply id="S3.SS3.p1.2.m2.1.1.3.cmml" xref="S3.SS3.p1.2.m2.1.1.3"><csymbol cd="ambiguous" id="S3.SS3.p1.2.m2.1.1.3.1.cmml" xref="S3.SS3.p1.2.m2.1.1.3">superscript</csymbol><ci id="S3.SS3.p1.2.m2.1.1.3.2.cmml" xref="S3.SS3.p1.2.m2.1.1.3.2">ℝ</ci><cn id="S3.SS3.p1.2.m2.1.1.3.3.cmml" type="integer" xref="S3.SS3.p1.2.m2.1.1.3.3">16</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.2.m2.1c">f_{\text{bs}}\in\mathbb{R}^{16}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.2.m2.1d">italic_f start_POSTSUBSCRIPT bs end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT</annotation></semantics></math>, producing a comprehensive expression feature. A learnable projection layer <math alttext="\mathcal{E}_{\text{proj}}" class="ltx_Math" display="inline" id="S3.SS3.p1.3.m3.1"><semantics id="S3.SS3.p1.3.m3.1a"><msub id="S3.SS3.p1.3.m3.1.1" xref="S3.SS3.p1.3.m3.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p1.3.m3.1.1.2" xref="S3.SS3.p1.3.m3.1.1.2.cmml">ℰ</mi><mtext id="S3.SS3.p1.3.m3.1.1.3" xref="S3.SS3.p1.3.m3.1.1.3a.cmml">proj</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.3.m3.1b"><apply id="S3.SS3.p1.3.m3.1.1.cmml" xref="S3.SS3.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p1.3.m3.1.1.1.cmml" xref="S3.SS3.p1.3.m3.1.1">subscript</csymbol><ci id="S3.SS3.p1.3.m3.1.1.2.cmml" xref="S3.SS3.p1.3.m3.1.1.2">ℰ</ci><ci id="S3.SS3.p1.3.m3.1.1.3a.cmml" xref="S3.SS3.p1.3.m3.1.1.3"><mtext id="S3.SS3.p1.3.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p1.3.m3.1.1.3">proj</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.3.m3.1c">\mathcal{E}_{\text{proj}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.3.m3.1d">caligraphic_E start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT</annotation></semantics></math> then projects this combined feature into a 16-dimensional expression vector <math alttext="f_{\text{exp}}=\mathcal{E}_{\text{proj}}([f_{\text{bs}};f_{\text{mm}}])" class="ltx_Math" display="inline" id="S3.SS3.p1.4.m4.1"><semantics id="S3.SS3.p1.4.m4.1a"><mrow id="S3.SS3.p1.4.m4.1.1" xref="S3.SS3.p1.4.m4.1.1.cmml"><msub id="S3.SS3.p1.4.m4.1.1.3" xref="S3.SS3.p1.4.m4.1.1.3.cmml"><mi id="S3.SS3.p1.4.m4.1.1.3.2" xref="S3.SS3.p1.4.m4.1.1.3.2.cmml">f</mi><mtext id="S3.SS3.p1.4.m4.1.1.3.3" xref="S3.SS3.p1.4.m4.1.1.3.3a.cmml">exp</mtext></msub><mo id="S3.SS3.p1.4.m4.1.1.2" xref="S3.SS3.p1.4.m4.1.1.2.cmml">=</mo><mrow id="S3.SS3.p1.4.m4.1.1.1" xref="S3.SS3.p1.4.m4.1.1.1.cmml"><msub id="S3.SS3.p1.4.m4.1.1.1.3" xref="S3.SS3.p1.4.m4.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p1.4.m4.1.1.1.3.2" xref="S3.SS3.p1.4.m4.1.1.1.3.2.cmml">ℰ</mi><mtext id="S3.SS3.p1.4.m4.1.1.1.3.3" xref="S3.SS3.p1.4.m4.1.1.1.3.3a.cmml">proj</mtext></msub><mo id="S3.SS3.p1.4.m4.1.1.1.2" xref="S3.SS3.p1.4.m4.1.1.1.2.cmml">⁢</mo><mrow id="S3.SS3.p1.4.m4.1.1.1.1.1" xref="S3.SS3.p1.4.m4.1.1.1.cmml"><mo id="S3.SS3.p1.4.m4.1.1.1.1.1.2" stretchy="false" xref="S3.SS3.p1.4.m4.1.1.1.cmml">(</mo><mrow id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.3.cmml"><mo id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.3" stretchy="false" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.3.cmml">[</mo><msub id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.cmml"><mi id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.2" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.2.cmml">f</mi><mtext id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3a.cmml">bs</mtext></msub><mo id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.4" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.3.cmml">;</mo><msub id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.cmml"><mi id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.2" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.2.cmml">f</mi><mtext id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3a.cmml">mm</mtext></msub><mo id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.5" stretchy="false" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.3.cmml">]</mo></mrow><mo id="S3.SS3.p1.4.m4.1.1.1.1.1.3" stretchy="false" xref="S3.SS3.p1.4.m4.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.4.m4.1b"><apply id="S3.SS3.p1.4.m4.1.1.cmml" xref="S3.SS3.p1.4.m4.1.1"><eq id="S3.SS3.p1.4.m4.1.1.2.cmml" xref="S3.SS3.p1.4.m4.1.1.2"></eq><apply id="S3.SS3.p1.4.m4.1.1.3.cmml" xref="S3.SS3.p1.4.m4.1.1.3"><csymbol cd="ambiguous" id="S3.SS3.p1.4.m4.1.1.3.1.cmml" xref="S3.SS3.p1.4.m4.1.1.3">subscript</csymbol><ci id="S3.SS3.p1.4.m4.1.1.3.2.cmml" xref="S3.SS3.p1.4.m4.1.1.3.2">𝑓</ci><ci id="S3.SS3.p1.4.m4.1.1.3.3a.cmml" xref="S3.SS3.p1.4.m4.1.1.3.3"><mtext id="S3.SS3.p1.4.m4.1.1.3.3.cmml" mathsize="70%" xref="S3.SS3.p1.4.m4.1.1.3.3">exp</mtext></ci></apply><apply id="S3.SS3.p1.4.m4.1.1.1.cmml" xref="S3.SS3.p1.4.m4.1.1.1"><times id="S3.SS3.p1.4.m4.1.1.1.2.cmml" xref="S3.SS3.p1.4.m4.1.1.1.2"></times><apply id="S3.SS3.p1.4.m4.1.1.1.3.cmml" xref="S3.SS3.p1.4.m4.1.1.1.3"><csymbol cd="ambiguous" id="S3.SS3.p1.4.m4.1.1.1.3.1.cmml" xref="S3.SS3.p1.4.m4.1.1.1.3">subscript</csymbol><ci id="S3.SS3.p1.4.m4.1.1.1.3.2.cmml" xref="S3.SS3.p1.4.m4.1.1.1.3.2">ℰ</ci><ci id="S3.SS3.p1.4.m4.1.1.1.3.3a.cmml" xref="S3.SS3.p1.4.m4.1.1.1.3.3"><mtext id="S3.SS3.p1.4.m4.1.1.1.3.3.cmml" mathsize="70%" xref="S3.SS3.p1.4.m4.1.1.1.3.3">proj</mtext></ci></apply><list id="S3.SS3.p1.4.m4.1.1.1.1.1.1.3.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2"><apply id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.1.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.2.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.2">𝑓</ci><ci id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3a.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3"><mtext id="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.1.1.3">bs</mtext></ci></apply><apply id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.1.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2">subscript</csymbol><ci id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.2.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.2">𝑓</ci><ci id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3a.cmml" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3"><mtext id="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3.cmml" mathsize="70%" xref="S3.SS3.p1.4.m4.1.1.1.1.1.1.2.2.3">mm</mtext></ci></apply></list></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.4.m4.1c">f_{\text{exp}}=\mathcal{E}_{\text{proj}}([f_{\text{bs}};f_{\text{mm}}])</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.4.m4.1d">italic_f start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT proj end_POSTSUBSCRIPT ( [ italic_f start_POSTSUBSCRIPT bs end_POSTSUBSCRIPT ; italic_f start_POSTSUBSCRIPT mm end_POSTSUBSCRIPT ] )</annotation></semantics></math>, where <math alttext="[\cdot]" class="ltx_Math" display="inline" id="S3.SS3.p1.5.m5.1"><semantics id="S3.SS3.p1.5.m5.1a"><mrow id="S3.SS3.p1.5.m5.1.2.2" xref="S3.SS3.p1.5.m5.1.2.1.cmml"><mo id="S3.SS3.p1.5.m5.1.2.2.1" stretchy="false" xref="S3.SS3.p1.5.m5.1.2.1.1.cmml">[</mo><mo id="S3.SS3.p1.5.m5.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p1.5.m5.1.1.cmml">⋅</mo><mo id="S3.SS3.p1.5.m5.1.2.2.2" stretchy="false" xref="S3.SS3.p1.5.m5.1.2.1.1.cmml">]</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.5.m5.1b"><apply id="S3.SS3.p1.5.m5.1.2.1.cmml" xref="S3.SS3.p1.5.m5.1.2.2"><csymbol cd="latexml" id="S3.SS3.p1.5.m5.1.2.1.1.cmml" xref="S3.SS3.p1.5.m5.1.2.2.1">delimited-[]</csymbol><ci id="S3.SS3.p1.5.m5.1.1.cmml" xref="S3.SS3.p1.5.m5.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.5.m5.1c">[\cdot]</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.5.m5.1d">[ ⋅ ]</annotation></semantics></math> indicates feature concatenation. To integrate expressiveness with identity, the driving signal is formulated as: <table class="ltx_equation ltx_eqn_table" id="S3.E4"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="f_{\text{drive}}=(\mathcal{E}_{\text{mlp}}([f_{\text{exp}},f_{\text{id}}]),f_{% \text{pos}})." class="ltx_Math" display="block" id="S3.E4.m1.1"><semantics id="S3.E4.m1.1a"><mrow id="S3.E4.m1.1.1.1" xref="S3.E4.m1.1.1.1.1.cmml"><mrow id="S3.E4.m1.1.1.1.1" xref="S3.E4.m1.1.1.1.1.cmml"><msub id="S3.E4.m1.1.1.1.1.4" xref="S3.E4.m1.1.1.1.1.4.cmml"><mi id="S3.E4.m1.1.1.1.1.4.2" xref="S3.E4.m1.1.1.1.1.4.2.cmml">f</mi><mtext id="S3.E4.m1.1.1.1.1.4.3" xref="S3.E4.m1.1.1.1.1.4.3a.cmml">drive</mtext></msub><mo id="S3.E4.m1.1.1.1.1.3" xref="S3.E4.m1.1.1.1.1.3.cmml">=</mo><mrow id="S3.E4.m1.1.1.1.1.2.2" xref="S3.E4.m1.1.1.1.1.2.3.cmml"><mo id="S3.E4.m1.1.1.1.1.2.2.3" stretchy="false" xref="S3.E4.m1.1.1.1.1.2.3.cmml">(</mo><mrow id="S3.E4.m1.1.1.1.1.1.1.1" xref="S3.E4.m1.1.1.1.1.1.1.1.cmml"><msub id="S3.E4.m1.1.1.1.1.1.1.1.3" xref="S3.E4.m1.1.1.1.1.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E4.m1.1.1.1.1.1.1.1.3.2" xref="S3.E4.m1.1.1.1.1.1.1.1.3.2.cmml">ℰ</mi><mtext id="S3.E4.m1.1.1.1.1.1.1.1.3.3" xref="S3.E4.m1.1.1.1.1.1.1.1.3.3a.cmml">mlp</mtext></msub><mo id="S3.E4.m1.1.1.1.1.1.1.1.2" xref="S3.E4.m1.1.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E4.m1.1.1.1.1.1.1.1.1.1" xref="S3.E4.m1.1.1.1.1.1.1.1.cmml"><mo id="S3.E4.m1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E4.m1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.3.cmml"><mo id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.3" stretchy="false" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.3.cmml">[</mo><msub id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">f</mi><mtext id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3a.cmml">exp</mtext></msub><mo id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.4" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.3.cmml">,</mo><msub id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.cmml"><mi id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.2" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.2.cmml">f</mi><mtext id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3a.cmml">id</mtext></msub><mo id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.5" stretchy="false" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.3.cmml">]</mo></mrow><mo id="S3.E4.m1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E4.m1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S3.E4.m1.1.1.1.1.2.2.4" xref="S3.E4.m1.1.1.1.1.2.3.cmml">,</mo><msub id="S3.E4.m1.1.1.1.1.2.2.2" xref="S3.E4.m1.1.1.1.1.2.2.2.cmml"><mi id="S3.E4.m1.1.1.1.1.2.2.2.2" xref="S3.E4.m1.1.1.1.1.2.2.2.2.cmml">f</mi><mtext id="S3.E4.m1.1.1.1.1.2.2.2.3" xref="S3.E4.m1.1.1.1.1.2.2.2.3a.cmml">pos</mtext></msub><mo id="S3.E4.m1.1.1.1.1.2.2.5" stretchy="false" xref="S3.E4.m1.1.1.1.1.2.3.cmml">)</mo></mrow></mrow><mo id="S3.E4.m1.1.1.1.2" lspace="0em" xref="S3.E4.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E4.m1.1b"><apply id="S3.E4.m1.1.1.1.1.cmml" xref="S3.E4.m1.1.1.1"><eq id="S3.E4.m1.1.1.1.1.3.cmml" xref="S3.E4.m1.1.1.1.1.3"></eq><apply id="S3.E4.m1.1.1.1.1.4.cmml" xref="S3.E4.m1.1.1.1.1.4"><csymbol cd="ambiguous" id="S3.E4.m1.1.1.1.1.4.1.cmml" xref="S3.E4.m1.1.1.1.1.4">subscript</csymbol><ci id="S3.E4.m1.1.1.1.1.4.2.cmml" xref="S3.E4.m1.1.1.1.1.4.2">𝑓</ci><ci id="S3.E4.m1.1.1.1.1.4.3a.cmml" xref="S3.E4.m1.1.1.1.1.4.3"><mtext id="S3.E4.m1.1.1.1.1.4.3.cmml" mathsize="70%" xref="S3.E4.m1.1.1.1.1.4.3">drive</mtext></ci></apply><interval closure="open" id="S3.E4.m1.1.1.1.1.2.3.cmml" xref="S3.E4.m1.1.1.1.1.2.2"><apply id="S3.E4.m1.1.1.1.1.1.1.1.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1"><times id="S3.E4.m1.1.1.1.1.1.1.1.2.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.2"></times><apply id="S3.E4.m1.1.1.1.1.1.1.1.3.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E4.m1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E4.m1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.3.2">ℰ</ci><ci id="S3.E4.m1.1.1.1.1.1.1.1.3.3a.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.3.3"><mtext id="S3.E4.m1.1.1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E4.m1.1.1.1.1.1.1.1.3.3">mlp</mtext></ci></apply><interval closure="closed" id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2"><apply id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.2">𝑓</ci><ci id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3a.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3"><mtext id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.1.1.3">exp</mtext></ci></apply><apply id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.1.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2">subscript</csymbol><ci id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.2.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.2">𝑓</ci><ci id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3a.cmml" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3"><mtext id="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3.cmml" mathsize="70%" xref="S3.E4.m1.1.1.1.1.1.1.1.1.1.1.2.2.3">id</mtext></ci></apply></interval></apply><apply id="S3.E4.m1.1.1.1.1.2.2.2.cmml" xref="S3.E4.m1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E4.m1.1.1.1.1.2.2.2.1.cmml" xref="S3.E4.m1.1.1.1.1.2.2.2">subscript</csymbol><ci id="S3.E4.m1.1.1.1.1.2.2.2.2.cmml" xref="S3.E4.m1.1.1.1.1.2.2.2.2">𝑓</ci><ci id="S3.E4.m1.1.1.1.1.2.2.2.3a.cmml" xref="S3.E4.m1.1.1.1.1.2.2.2.3"><mtext id="S3.E4.m1.1.1.1.1.2.2.2.3.cmml" mathsize="70%" xref="S3.E4.m1.1.1.1.1.2.2.2.3">pos</mtext></ci></apply></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E4.m1.1c">f_{\text{drive}}=(\mathcal{E}_{\text{mlp}}([f_{\text{exp}},f_{\text{id}}]),f_{% \text{pos}}).</annotation><annotation encoding="application/x-llamapun" id="S3.E4.m1.1d">italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT = ( caligraphic_E start_POSTSUBSCRIPT mlp end_POSTSUBSCRIPT ( [ italic_f start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT id end_POSTSUBSCRIPT ] ) , italic_f start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT ) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(4)</td> </tr></tbody> </table> Here, <math alttext="f_{\text{id}}" class="ltx_Math" display="inline" id="S3.SS3.p1.6.m1.1"><semantics id="S3.SS3.p1.6.m1.1a"><msub id="S3.SS3.p1.6.m1.1.1" xref="S3.SS3.p1.6.m1.1.1.cmml"><mi id="S3.SS3.p1.6.m1.1.1.2" xref="S3.SS3.p1.6.m1.1.1.2.cmml">f</mi><mtext id="S3.SS3.p1.6.m1.1.1.3" xref="S3.SS3.p1.6.m1.1.1.3a.cmml">id</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.6.m1.1b"><apply id="S3.SS3.p1.6.m1.1.1.cmml" xref="S3.SS3.p1.6.m1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p1.6.m1.1.1.1.cmml" xref="S3.SS3.p1.6.m1.1.1">subscript</csymbol><ci id="S3.SS3.p1.6.m1.1.1.2.cmml" xref="S3.SS3.p1.6.m1.1.1.2">𝑓</ci><ci id="S3.SS3.p1.6.m1.1.1.3a.cmml" xref="S3.SS3.p1.6.m1.1.1.3"><mtext id="S3.SS3.p1.6.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p1.6.m1.1.1.3">id</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.6.m1.1c">f_{\text{id}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.6.m1.1d">italic_f start_POSTSUBSCRIPT id end_POSTSUBSCRIPT</annotation></semantics></math> is the global identity feature extracted from a reference image <math alttext="I_{r}" class="ltx_Math" display="inline" id="S3.SS3.p1.7.m2.1"><semantics id="S3.SS3.p1.7.m2.1a"><msub id="S3.SS3.p1.7.m2.1.1" xref="S3.SS3.p1.7.m2.1.1.cmml"><mi id="S3.SS3.p1.7.m2.1.1.2" xref="S3.SS3.p1.7.m2.1.1.2.cmml">I</mi><mi id="S3.SS3.p1.7.m2.1.1.3" xref="S3.SS3.p1.7.m2.1.1.3.cmml">r</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.7.m2.1b"><apply id="S3.SS3.p1.7.m2.1.1.cmml" xref="S3.SS3.p1.7.m2.1.1"><csymbol cd="ambiguous" id="S3.SS3.p1.7.m2.1.1.1.cmml" xref="S3.SS3.p1.7.m2.1.1">subscript</csymbol><ci id="S3.SS3.p1.7.m2.1.1.2.cmml" xref="S3.SS3.p1.7.m2.1.1.2">𝐼</ci><ci id="S3.SS3.p1.7.m2.1.1.3.cmml" xref="S3.SS3.p1.7.m2.1.1.3">𝑟</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.7.m2.1c">I_{r}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.7.m2.1d">italic_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT</annotation></semantics></math> via a frozen DINOv2 backbone <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib38" title="">38</a>]</cite>, and <math alttext="f_{\text{pos}}" class="ltx_Math" display="inline" id="S3.SS3.p1.8.m3.1"><semantics id="S3.SS3.p1.8.m3.1a"><msub id="S3.SS3.p1.8.m3.1.1" xref="S3.SS3.p1.8.m3.1.1.cmml"><mi id="S3.SS3.p1.8.m3.1.1.2" xref="S3.SS3.p1.8.m3.1.1.2.cmml">f</mi><mtext id="S3.SS3.p1.8.m3.1.1.3" xref="S3.SS3.p1.8.m3.1.1.3a.cmml">pos</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p1.8.m3.1b"><apply id="S3.SS3.p1.8.m3.1.1.cmml" xref="S3.SS3.p1.8.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p1.8.m3.1.1.1.cmml" xref="S3.SS3.p1.8.m3.1.1">subscript</csymbol><ci id="S3.SS3.p1.8.m3.1.1.2.cmml" xref="S3.SS3.p1.8.m3.1.1.2">𝑓</ci><ci id="S3.SS3.p1.8.m3.1.1.3a.cmml" xref="S3.SS3.p1.8.m3.1.1.3"><mtext id="S3.SS3.p1.8.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p1.8.m3.1.1.3">pos</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p1.8.m3.1c">f_{\text{pos}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p1.8.m3.1d">italic_f start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT</annotation></semantics></math> denotes the position map from 3DMM vertices. </div> <div class="ltx_para ltx_noindent" id="S3.SS3.p2"> 3D Generation Network <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S3.SS3.p2.1.1.m1.1"><semantics id="S3.SS3.p2.1.1.m1.1a"><mrow id="S3.SS3.p2.1.1.m1.1.2" xref="S3.SS3.p2.1.1.m1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p2.1.1.m1.1.2.2" xref="S3.SS3.p2.1.1.m1.1.2.2.cmml">𝒢</mi><mo id="S3.SS3.p2.1.1.m1.1.2.1" xref="S3.SS3.p2.1.1.m1.1.2.1.cmml">⁢</mo><mrow id="S3.SS3.p2.1.1.m1.1.2.3.2" xref="S3.SS3.p2.1.1.m1.1.2.cmml"><mo id="S3.SS3.p2.1.1.m1.1.2.3.2.1" stretchy="false" xref="S3.SS3.p2.1.1.m1.1.2.cmml">(</mo><mo id="S3.SS3.p2.1.1.m1.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p2.1.1.m1.1.1.cmml">⋅</mo><mo id="S3.SS3.p2.1.1.m1.1.2.3.2.2" stretchy="false" xref="S3.SS3.p2.1.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.1.1.m1.1b"><apply id="S3.SS3.p2.1.1.m1.1.2.cmml" xref="S3.SS3.p2.1.1.m1.1.2"><times id="S3.SS3.p2.1.1.m1.1.2.1.cmml" xref="S3.SS3.p2.1.1.m1.1.2.1"></times><ci id="S3.SS3.p2.1.1.m1.1.2.2.cmml" xref="S3.SS3.p2.1.1.m1.1.2.2">𝒢</ci><ci id="S3.SS3.p2.1.1.m1.1.1.cmml" xref="S3.SS3.p2.1.1.m1.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.1.1.m1.1c">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.1.1.m1.1d">caligraphic_G ( ⋅ )</annotation></semantics></math>. Given the generated unposed avatars <math alttext="I_{\text{unposed}}" class="ltx_Math" display="inline" id="S3.SS3.p2.2.m1.1"><semantics id="S3.SS3.p2.2.m1.1a"><msub id="S3.SS3.p2.2.m1.1.1" xref="S3.SS3.p2.2.m1.1.1.cmml"><mi id="S3.SS3.p2.2.m1.1.1.2" xref="S3.SS3.p2.2.m1.1.1.2.cmml">I</mi><mtext id="S3.SS3.p2.2.m1.1.1.3" xref="S3.SS3.p2.2.m1.1.1.3a.cmml">unposed</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.2.m1.1b"><apply id="S3.SS3.p2.2.m1.1.1.cmml" xref="S3.SS3.p2.2.m1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.2.m1.1.1.1.cmml" xref="S3.SS3.p2.2.m1.1.1">subscript</csymbol><ci id="S3.SS3.p2.2.m1.1.1.2.cmml" xref="S3.SS3.p2.2.m1.1.1.2">𝐼</ci><ci id="S3.SS3.p2.2.m1.1.1.3a.cmml" xref="S3.SS3.p2.2.m1.1.1.3"><mtext id="S3.SS3.p2.2.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p2.2.m1.1.1.3">unposed</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.2.m1.1c">I_{\text{unposed}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.2.m1.1d">italic_I start_POSTSUBSCRIPT unposed end_POSTSUBSCRIPT</annotation></semantics></math> and driving features <math alttext="f_{\text{drive}}" class="ltx_Math" display="inline" id="S3.SS3.p2.3.m2.1"><semantics id="S3.SS3.p2.3.m2.1a"><msub id="S3.SS3.p2.3.m2.1.1" xref="S3.SS3.p2.3.m2.1.1.cmml"><mi id="S3.SS3.p2.3.m2.1.1.2" xref="S3.SS3.p2.3.m2.1.1.2.cmml">f</mi><mtext id="S3.SS3.p2.3.m2.1.1.3" xref="S3.SS3.p2.3.m2.1.1.3a.cmml">drive</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.3.m2.1b"><apply id="S3.SS3.p2.3.m2.1.1.cmml" xref="S3.SS3.p2.3.m2.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.3.m2.1.1.1.cmml" xref="S3.SS3.p2.3.m2.1.1">subscript</csymbol><ci id="S3.SS3.p2.3.m2.1.1.2.cmml" xref="S3.SS3.p2.3.m2.1.1.2">𝑓</ci><ci id="S3.SS3.p2.3.m2.1.1.3a.cmml" xref="S3.SS3.p2.3.m2.1.1.3"><mtext id="S3.SS3.p2.3.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p2.3.m2.1.1.3">drive</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.3.m2.1c">f_{\text{drive}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.3.m2.1d">italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT</annotation></semantics></math> from the expression encoder, we employ an asymmetric U-Net architecture akin to Large Multi-view Gaussian Models <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title="">55</a>]</cite> and incorporate cross-attention layers to merge the driving features seamlessly: <table class="ltx_equation ltx_eqn_table" id="S3.E5"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="I_{\text{posed}}=\mathcal{E}_{\text{2D}}^{\text{render}}(\mathcal{G}(I_{\text{% unposed}},f_{\text{drive}};\Phi_{g}))," class="ltx_Math" display="block" id="S3.E5.m1.1"><semantics id="S3.E5.m1.1a"><mrow id="S3.E5.m1.1.1.1" xref="S3.E5.m1.1.1.1.1.cmml"><mrow id="S3.E5.m1.1.1.1.1" xref="S3.E5.m1.1.1.1.1.cmml"><msub id="S3.E5.m1.1.1.1.1.3" xref="S3.E5.m1.1.1.1.1.3.cmml"><mi id="S3.E5.m1.1.1.1.1.3.2" xref="S3.E5.m1.1.1.1.1.3.2.cmml">I</mi><mtext id="S3.E5.m1.1.1.1.1.3.3" xref="S3.E5.m1.1.1.1.1.3.3a.cmml">posed</mtext></msub><mo id="S3.E5.m1.1.1.1.1.2" xref="S3.E5.m1.1.1.1.1.2.cmml">=</mo><mrow id="S3.E5.m1.1.1.1.1.1" xref="S3.E5.m1.1.1.1.1.1.cmml"><msubsup id="S3.E5.m1.1.1.1.1.1.3" xref="S3.E5.m1.1.1.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E5.m1.1.1.1.1.1.3.2.2" xref="S3.E5.m1.1.1.1.1.1.3.2.2.cmml">ℰ</mi><mtext id="S3.E5.m1.1.1.1.1.1.3.2.3" xref="S3.E5.m1.1.1.1.1.1.3.2.3a.cmml">2D</mtext><mtext id="S3.E5.m1.1.1.1.1.1.3.3" xref="S3.E5.m1.1.1.1.1.1.3.3a.cmml">render</mtext></msubsup><mo id="S3.E5.m1.1.1.1.1.1.2" xref="S3.E5.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E5.m1.1.1.1.1.1.1.1" xref="S3.E5.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S3.E5.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E5.m1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E5.m1.1.1.1.1.1.1.1.1" xref="S3.E5.m1.1.1.1.1.1.1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E5.m1.1.1.1.1.1.1.1.1.5" xref="S3.E5.m1.1.1.1.1.1.1.1.1.5.cmml">𝒢</mi><mo id="S3.E5.m1.1.1.1.1.1.1.1.1.4" xref="S3.E5.m1.1.1.1.1.1.1.1.1.4.cmml">⁢</mo><mrow id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml"><mo id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.4" stretchy="false" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml">(</mo><msub id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.2.cmml">I</mi><mtext id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3a.cmml">unposed</mtext></msub><mo id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.5" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml">,</mo><msub id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.cmml"><mi id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.2" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.2.cmml">f</mi><mtext id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3a.cmml">drive</mtext></msub><mo id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.6" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml">;</mo><msub id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.cmml"><mi id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.2" mathvariant="normal" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.2.cmml">Φ</mi><mi id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.3" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.3.cmml">g</mi></msub><mo id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.7" stretchy="false" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml">)</mo></mrow></mrow><mo id="S3.E5.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E5.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E5.m1.1.1.1.2" xref="S3.E5.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E5.m1.1b"><apply id="S3.E5.m1.1.1.1.1.cmml" xref="S3.E5.m1.1.1.1"><eq id="S3.E5.m1.1.1.1.1.2.cmml" xref="S3.E5.m1.1.1.1.1.2"></eq><apply id="S3.E5.m1.1.1.1.1.3.cmml" xref="S3.E5.m1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.3.1.cmml" xref="S3.E5.m1.1.1.1.1.3">subscript</csymbol><ci id="S3.E5.m1.1.1.1.1.3.2.cmml" xref="S3.E5.m1.1.1.1.1.3.2">𝐼</ci><ci id="S3.E5.m1.1.1.1.1.3.3a.cmml" xref="S3.E5.m1.1.1.1.1.3.3"><mtext id="S3.E5.m1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E5.m1.1.1.1.1.3.3">posed</mtext></ci></apply><apply id="S3.E5.m1.1.1.1.1.1.cmml" xref="S3.E5.m1.1.1.1.1.1"><times id="S3.E5.m1.1.1.1.1.1.2.cmml" xref="S3.E5.m1.1.1.1.1.1.2"></times><apply id="S3.E5.m1.1.1.1.1.1.3.cmml" xref="S3.E5.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.1.3.1.cmml" xref="S3.E5.m1.1.1.1.1.1.3">superscript</csymbol><apply id="S3.E5.m1.1.1.1.1.1.3.2.cmml" xref="S3.E5.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.1.3.2.1.cmml" xref="S3.E5.m1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E5.m1.1.1.1.1.1.3.2.2.cmml" xref="S3.E5.m1.1.1.1.1.1.3.2.2">ℰ</ci><ci id="S3.E5.m1.1.1.1.1.1.3.2.3a.cmml" xref="S3.E5.m1.1.1.1.1.1.3.2.3"><mtext id="S3.E5.m1.1.1.1.1.1.3.2.3.cmml" mathsize="70%" xref="S3.E5.m1.1.1.1.1.1.3.2.3">2D</mtext></ci></apply><ci id="S3.E5.m1.1.1.1.1.1.3.3a.cmml" xref="S3.E5.m1.1.1.1.1.1.3.3"><mtext id="S3.E5.m1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E5.m1.1.1.1.1.1.3.3">render</mtext></ci></apply><apply id="S3.E5.m1.1.1.1.1.1.1.1.1.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1"><times id="S3.E5.m1.1.1.1.1.1.1.1.1.4.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.4"></times><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.5.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.5">𝒢</ci><vector id="S3.E5.m1.1.1.1.1.1.1.1.1.3.4.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3"><apply id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.2">𝐼</ci><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3a.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3"><mtext id="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.E5.m1.1.1.1.1.1.1.1.1.1.1.1.3">unposed</mtext></ci></apply><apply id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.1.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2">subscript</csymbol><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.2.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.2">𝑓</ci><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3a.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3"><mtext id="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3.cmml" mathsize="70%" xref="S3.E5.m1.1.1.1.1.1.1.1.1.2.2.2.3">drive</mtext></ci></apply><apply id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3"><csymbol cd="ambiguous" id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.1.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3">subscript</csymbol><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.2.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.2">Φ</ci><ci id="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.3.cmml" xref="S3.E5.m1.1.1.1.1.1.1.1.1.3.3.3.3">𝑔</ci></apply></vector></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E5.m1.1c">I_{\text{posed}}=\mathcal{E}_{\text{2D}}^{\text{render}}(\mathcal{G}(I_{\text{% unposed}},f_{\text{drive}};\Phi_{g})),</annotation><annotation encoding="application/x-llamapun" id="S3.E5.m1.1d">italic_I start_POSTSUBSCRIPT posed end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT render end_POSTSUPERSCRIPT ( caligraphic_G ( italic_I start_POSTSUBSCRIPT unposed end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT ; roman_Φ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(5)</td> </tr></tbody> </table> where <math alttext="\Phi_{g}" class="ltx_Math" display="inline" id="S3.SS3.p2.4.m1.1"><semantics id="S3.SS3.p2.4.m1.1a"><msub id="S3.SS3.p2.4.m1.1.1" xref="S3.SS3.p2.4.m1.1.1.cmml"><mi id="S3.SS3.p2.4.m1.1.1.2" mathvariant="normal" xref="S3.SS3.p2.4.m1.1.1.2.cmml">Φ</mi><mi id="S3.SS3.p2.4.m1.1.1.3" xref="S3.SS3.p2.4.m1.1.1.3.cmml">g</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.4.m1.1b"><apply id="S3.SS3.p2.4.m1.1.1.cmml" xref="S3.SS3.p2.4.m1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.4.m1.1.1.1.cmml" xref="S3.SS3.p2.4.m1.1.1">subscript</csymbol><ci id="S3.SS3.p2.4.m1.1.1.2.cmml" xref="S3.SS3.p2.4.m1.1.1.2">Φ</ci><ci id="S3.SS3.p2.4.m1.1.1.3.cmml" xref="S3.SS3.p2.4.m1.1.1.3">𝑔</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.4.m1.1c">\Phi_{g}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.4.m1.1d">roman_Φ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT</annotation></semantics></math> is the network learnable parameter of <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S3.SS3.p2.5.m2.1"><semantics id="S3.SS3.p2.5.m2.1a"><mrow id="S3.SS3.p2.5.m2.1.2" xref="S3.SS3.p2.5.m2.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p2.5.m2.1.2.2" xref="S3.SS3.p2.5.m2.1.2.2.cmml">𝒢</mi><mo id="S3.SS3.p2.5.m2.1.2.1" xref="S3.SS3.p2.5.m2.1.2.1.cmml">⁢</mo><mrow id="S3.SS3.p2.5.m2.1.2.3.2" xref="S3.SS3.p2.5.m2.1.2.cmml"><mo id="S3.SS3.p2.5.m2.1.2.3.2.1" stretchy="false" xref="S3.SS3.p2.5.m2.1.2.cmml">(</mo><mo id="S3.SS3.p2.5.m2.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p2.5.m2.1.1.cmml">⋅</mo><mo id="S3.SS3.p2.5.m2.1.2.3.2.2" stretchy="false" xref="S3.SS3.p2.5.m2.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.5.m2.1b"><apply id="S3.SS3.p2.5.m2.1.2.cmml" xref="S3.SS3.p2.5.m2.1.2"><times id="S3.SS3.p2.5.m2.1.2.1.cmml" xref="S3.SS3.p2.5.m2.1.2.1"></times><ci id="S3.SS3.p2.5.m2.1.2.2.cmml" xref="S3.SS3.p2.5.m2.1.2.2">𝒢</ci><ci id="S3.SS3.p2.5.m2.1.1.cmml" xref="S3.SS3.p2.5.m2.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.5.m2.1c">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.5.m2.1d">caligraphic_G ( ⋅ )</annotation></semantics></math> and <math alttext="\mathcal{E}_{\text{2D}}^{\text{render}}" class="ltx_Math" display="inline" id="S3.SS3.p2.6.m3.1"><semantics id="S3.SS3.p2.6.m3.1a"><msubsup id="S3.SS3.p2.6.m3.1.1" xref="S3.SS3.p2.6.m3.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p2.6.m3.1.1.2.2" xref="S3.SS3.p2.6.m3.1.1.2.2.cmml">ℰ</mi><mtext id="S3.SS3.p2.6.m3.1.1.2.3" xref="S3.SS3.p2.6.m3.1.1.2.3a.cmml">2D</mtext><mtext id="S3.SS3.p2.6.m3.1.1.3" xref="S3.SS3.p2.6.m3.1.1.3a.cmml">render</mtext></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.6.m3.1b"><apply id="S3.SS3.p2.6.m3.1.1.cmml" xref="S3.SS3.p2.6.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.6.m3.1.1.1.cmml" xref="S3.SS3.p2.6.m3.1.1">superscript</csymbol><apply id="S3.SS3.p2.6.m3.1.1.2.cmml" xref="S3.SS3.p2.6.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.6.m3.1.1.2.1.cmml" xref="S3.SS3.p2.6.m3.1.1">subscript</csymbol><ci id="S3.SS3.p2.6.m3.1.1.2.2.cmml" xref="S3.SS3.p2.6.m3.1.1.2.2">ℰ</ci><ci id="S3.SS3.p2.6.m3.1.1.2.3a.cmml" xref="S3.SS3.p2.6.m3.1.1.2.3"><mtext id="S3.SS3.p2.6.m3.1.1.2.3.cmml" mathsize="70%" xref="S3.SS3.p2.6.m3.1.1.2.3">2D</mtext></ci></apply><ci id="S3.SS3.p2.6.m3.1.1.3a.cmml" xref="S3.SS3.p2.6.m3.1.1.3"><mtext id="S3.SS3.p2.6.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p2.6.m3.1.1.3">render</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.6.m3.1c">\mathcal{E}_{\text{2D}}^{\text{render}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.6.m3.1d">caligraphic_E start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT render end_POSTSUPERSCRIPT</annotation></semantics></math> is a 2DGS renderer <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib20" title="">20</a>]</cite>. <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S3.SS3.p2.7.m4.1"><semantics id="S3.SS3.p2.7.m4.1a"><mrow id="S3.SS3.p2.7.m4.1.2" xref="S3.SS3.p2.7.m4.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p2.7.m4.1.2.2" xref="S3.SS3.p2.7.m4.1.2.2.cmml">𝒢</mi><mo id="S3.SS3.p2.7.m4.1.2.1" xref="S3.SS3.p2.7.m4.1.2.1.cmml">⁢</mo><mrow id="S3.SS3.p2.7.m4.1.2.3.2" xref="S3.SS3.p2.7.m4.1.2.cmml"><mo id="S3.SS3.p2.7.m4.1.2.3.2.1" stretchy="false" xref="S3.SS3.p2.7.m4.1.2.cmml">(</mo><mo id="S3.SS3.p2.7.m4.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p2.7.m4.1.1.cmml">⋅</mo><mo id="S3.SS3.p2.7.m4.1.2.3.2.2" stretchy="false" xref="S3.SS3.p2.7.m4.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.7.m4.1b"><apply id="S3.SS3.p2.7.m4.1.2.cmml" xref="S3.SS3.p2.7.m4.1.2"><times id="S3.SS3.p2.7.m4.1.2.1.cmml" xref="S3.SS3.p2.7.m4.1.2.1"></times><ci id="S3.SS3.p2.7.m4.1.2.2.cmml" xref="S3.SS3.p2.7.m4.1.2.2">𝒢</ci><ci id="S3.SS3.p2.7.m4.1.1.cmml" xref="S3.SS3.p2.7.m4.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.7.m4.1c">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.7.m4.1d">caligraphic_G ( ⋅ )</annotation></semantics></math> consists of an encoder with five down-sampling blocks, a middle block, and a decoder with three up-sampling blocks. Each block contains two ResNet layers with group normalization and SiLU activation. Cross-attention modules are strategically placed in the deeper layers of the network: the last two down-sampling blocks, the middle block, and the first two up-sampling blocks. The cross-attention is defined as: <table class="ltx_equation ltx_eqn_table" id="S3.E6"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="f_{\text{out}}^{g}=\text{softmax}\left(\frac{(f_{\text{in}}^{g}W_{Q}^{g})(f_{% \text{drive}}W_{K}^{g})^{T}}{\sqrt{d_{k}^{g}}}\right)(f_{\text{drive}}W_{V}^{g% })," class="ltx_Math" display="block" id="S3.E6.m1.3"><semantics id="S3.E6.m1.3a"><mrow id="S3.E6.m1.3.3.1" xref="S3.E6.m1.3.3.1.1.cmml"><mrow id="S3.E6.m1.3.3.1.1" xref="S3.E6.m1.3.3.1.1.cmml"><msubsup id="S3.E6.m1.3.3.1.1.3" xref="S3.E6.m1.3.3.1.1.3.cmml"><mi id="S3.E6.m1.3.3.1.1.3.2.2" xref="S3.E6.m1.3.3.1.1.3.2.2.cmml">f</mi><mtext id="S3.E6.m1.3.3.1.1.3.2.3" xref="S3.E6.m1.3.3.1.1.3.2.3a.cmml">out</mtext><mi id="S3.E6.m1.3.3.1.1.3.3" xref="S3.E6.m1.3.3.1.1.3.3.cmml">g</mi></msubsup><mo id="S3.E6.m1.3.3.1.1.2" xref="S3.E6.m1.3.3.1.1.2.cmml">=</mo><mrow id="S3.E6.m1.3.3.1.1.1" xref="S3.E6.m1.3.3.1.1.1.cmml"><mtext id="S3.E6.m1.3.3.1.1.1.3" xref="S3.E6.m1.3.3.1.1.1.3a.cmml">softmax</mtext><mo id="S3.E6.m1.3.3.1.1.1.2" xref="S3.E6.m1.3.3.1.1.1.2.cmml">⁢</mo><mrow id="S3.E6.m1.3.3.1.1.1.4.2" xref="S3.E6.m1.2.2.cmml"><mo id="S3.E6.m1.3.3.1.1.1.4.2.1" xref="S3.E6.m1.2.2.cmml">(</mo><mfrac id="S3.E6.m1.2.2" xref="S3.E6.m1.2.2.cmml"><mrow id="S3.E6.m1.2.2.2" xref="S3.E6.m1.2.2.2.cmml"><mrow id="S3.E6.m1.1.1.1.1.1" xref="S3.E6.m1.1.1.1.1.1.1.cmml"><mo id="S3.E6.m1.1.1.1.1.1.2" stretchy="false" xref="S3.E6.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E6.m1.1.1.1.1.1.1" xref="S3.E6.m1.1.1.1.1.1.1.cmml"><msubsup id="S3.E6.m1.1.1.1.1.1.1.2" xref="S3.E6.m1.1.1.1.1.1.1.2.cmml"><mi id="S3.E6.m1.1.1.1.1.1.1.2.2.2" xref="S3.E6.m1.1.1.1.1.1.1.2.2.2.cmml">f</mi><mtext id="S3.E6.m1.1.1.1.1.1.1.2.2.3" xref="S3.E6.m1.1.1.1.1.1.1.2.2.3a.cmml">in</mtext><mi id="S3.E6.m1.1.1.1.1.1.1.2.3" xref="S3.E6.m1.1.1.1.1.1.1.2.3.cmml">g</mi></msubsup><mo id="S3.E6.m1.1.1.1.1.1.1.1" xref="S3.E6.m1.1.1.1.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E6.m1.1.1.1.1.1.1.3" xref="S3.E6.m1.1.1.1.1.1.1.3.cmml"><mi id="S3.E6.m1.1.1.1.1.1.1.3.2.2" xref="S3.E6.m1.1.1.1.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E6.m1.1.1.1.1.1.1.3.2.3" xref="S3.E6.m1.1.1.1.1.1.1.3.2.3.cmml">Q</mi><mi id="S3.E6.m1.1.1.1.1.1.1.3.3" xref="S3.E6.m1.1.1.1.1.1.1.3.3.cmml">g</mi></msubsup></mrow><mo id="S3.E6.m1.1.1.1.1.1.3" stretchy="false" xref="S3.E6.m1.1.1.1.1.1.1.cmml">)</mo></mrow><mo id="S3.E6.m1.2.2.2.3" xref="S3.E6.m1.2.2.2.3.cmml">⁢</mo><msup id="S3.E6.m1.2.2.2.2" xref="S3.E6.m1.2.2.2.2.cmml"><mrow id="S3.E6.m1.2.2.2.2.1.1" xref="S3.E6.m1.2.2.2.2.1.1.1.cmml"><mo id="S3.E6.m1.2.2.2.2.1.1.2" stretchy="false" xref="S3.E6.m1.2.2.2.2.1.1.1.cmml">(</mo><mrow id="S3.E6.m1.2.2.2.2.1.1.1" xref="S3.E6.m1.2.2.2.2.1.1.1.cmml"><msub id="S3.E6.m1.2.2.2.2.1.1.1.2" xref="S3.E6.m1.2.2.2.2.1.1.1.2.cmml"><mi id="S3.E6.m1.2.2.2.2.1.1.1.2.2" xref="S3.E6.m1.2.2.2.2.1.1.1.2.2.cmml">f</mi><mtext id="S3.E6.m1.2.2.2.2.1.1.1.2.3" xref="S3.E6.m1.2.2.2.2.1.1.1.2.3a.cmml">drive</mtext></msub><mo id="S3.E6.m1.2.2.2.2.1.1.1.1" xref="S3.E6.m1.2.2.2.2.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E6.m1.2.2.2.2.1.1.1.3" xref="S3.E6.m1.2.2.2.2.1.1.1.3.cmml"><mi id="S3.E6.m1.2.2.2.2.1.1.1.3.2.2" xref="S3.E6.m1.2.2.2.2.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E6.m1.2.2.2.2.1.1.1.3.2.3" xref="S3.E6.m1.2.2.2.2.1.1.1.3.2.3.cmml">K</mi><mi id="S3.E6.m1.2.2.2.2.1.1.1.3.3" xref="S3.E6.m1.2.2.2.2.1.1.1.3.3.cmml">g</mi></msubsup></mrow><mo id="S3.E6.m1.2.2.2.2.1.1.3" stretchy="false" xref="S3.E6.m1.2.2.2.2.1.1.1.cmml">)</mo></mrow><mi id="S3.E6.m1.2.2.2.2.3" xref="S3.E6.m1.2.2.2.2.3.cmml">T</mi></msup></mrow><msqrt id="S3.E6.m1.2.2.4" xref="S3.E6.m1.2.2.4.cmml"><msubsup id="S3.E6.m1.2.2.4.2" xref="S3.E6.m1.2.2.4.2.cmml"><mi id="S3.E6.m1.2.2.4.2.2.2" xref="S3.E6.m1.2.2.4.2.2.2.cmml">d</mi><mi id="S3.E6.m1.2.2.4.2.2.3" xref="S3.E6.m1.2.2.4.2.2.3.cmml">k</mi><mi id="S3.E6.m1.2.2.4.2.3" xref="S3.E6.m1.2.2.4.2.3.cmml">g</mi></msubsup></msqrt></mfrac><mo id="S3.E6.m1.3.3.1.1.1.4.2.2" xref="S3.E6.m1.2.2.cmml">)</mo></mrow><mo id="S3.E6.m1.3.3.1.1.1.2a" xref="S3.E6.m1.3.3.1.1.1.2.cmml">⁢</mo><mrow id="S3.E6.m1.3.3.1.1.1.1.1" xref="S3.E6.m1.3.3.1.1.1.1.1.1.cmml"><mo id="S3.E6.m1.3.3.1.1.1.1.1.2" stretchy="false" xref="S3.E6.m1.3.3.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E6.m1.3.3.1.1.1.1.1.1" xref="S3.E6.m1.3.3.1.1.1.1.1.1.cmml"><msub id="S3.E6.m1.3.3.1.1.1.1.1.1.2" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.cmml"><mi id="S3.E6.m1.3.3.1.1.1.1.1.1.2.2" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.2.cmml">f</mi><mtext id="S3.E6.m1.3.3.1.1.1.1.1.1.2.3" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.3a.cmml">drive</mtext></msub><mo id="S3.E6.m1.3.3.1.1.1.1.1.1.1" xref="S3.E6.m1.3.3.1.1.1.1.1.1.1.cmml">⁢</mo><msubsup id="S3.E6.m1.3.3.1.1.1.1.1.1.3" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.cmml"><mi id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.2" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.2.cmml">W</mi><mi id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.3" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.3.cmml">V</mi><mi id="S3.E6.m1.3.3.1.1.1.1.1.1.3.3" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.3.cmml">g</mi></msubsup></mrow><mo id="S3.E6.m1.3.3.1.1.1.1.1.3" stretchy="false" xref="S3.E6.m1.3.3.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E6.m1.3.3.1.2" xref="S3.E6.m1.3.3.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E6.m1.3b"><apply id="S3.E6.m1.3.3.1.1.cmml" xref="S3.E6.m1.3.3.1"><eq id="S3.E6.m1.3.3.1.1.2.cmml" xref="S3.E6.m1.3.3.1.1.2"></eq><apply id="S3.E6.m1.3.3.1.1.3.cmml" xref="S3.E6.m1.3.3.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.3.3.1.1.3.1.cmml" xref="S3.E6.m1.3.3.1.1.3">superscript</csymbol><apply id="S3.E6.m1.3.3.1.1.3.2.cmml" xref="S3.E6.m1.3.3.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.3.3.1.1.3.2.1.cmml" xref="S3.E6.m1.3.3.1.1.3">subscript</csymbol><ci id="S3.E6.m1.3.3.1.1.3.2.2.cmml" xref="S3.E6.m1.3.3.1.1.3.2.2">𝑓</ci><ci id="S3.E6.m1.3.3.1.1.3.2.3a.cmml" xref="S3.E6.m1.3.3.1.1.3.2.3"><mtext id="S3.E6.m1.3.3.1.1.3.2.3.cmml" mathsize="70%" xref="S3.E6.m1.3.3.1.1.3.2.3">out</mtext></ci></apply><ci id="S3.E6.m1.3.3.1.1.3.3.cmml" xref="S3.E6.m1.3.3.1.1.3.3">𝑔</ci></apply><apply id="S3.E6.m1.3.3.1.1.1.cmml" xref="S3.E6.m1.3.3.1.1.1"><times id="S3.E6.m1.3.3.1.1.1.2.cmml" xref="S3.E6.m1.3.3.1.1.1.2"></times><ci id="S3.E6.m1.3.3.1.1.1.3a.cmml" xref="S3.E6.m1.3.3.1.1.1.3"><mtext id="S3.E6.m1.3.3.1.1.1.3.cmml" xref="S3.E6.m1.3.3.1.1.1.3">softmax</mtext></ci><apply id="S3.E6.m1.2.2.cmml" xref="S3.E6.m1.3.3.1.1.1.4.2"><divide id="S3.E6.m1.2.2.3.cmml" xref="S3.E6.m1.3.3.1.1.1.4.2"></divide><apply id="S3.E6.m1.2.2.2.cmml" xref="S3.E6.m1.2.2.2"><times id="S3.E6.m1.2.2.2.3.cmml" xref="S3.E6.m1.2.2.2.3"></times><apply id="S3.E6.m1.1.1.1.1.1.1.cmml" xref="S3.E6.m1.1.1.1.1.1"><times id="S3.E6.m1.1.1.1.1.1.1.1.cmml" xref="S3.E6.m1.1.1.1.1.1.1.1"></times><apply id="S3.E6.m1.1.1.1.1.1.1.2.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E6.m1.1.1.1.1.1.1.2.1.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2">superscript</csymbol><apply id="S3.E6.m1.1.1.1.1.1.1.2.2.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E6.m1.1.1.1.1.1.1.2.2.1.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E6.m1.1.1.1.1.1.1.2.2.2.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2.2.2">𝑓</ci><ci id="S3.E6.m1.1.1.1.1.1.1.2.2.3a.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2.2.3"><mtext id="S3.E6.m1.1.1.1.1.1.1.2.2.3.cmml" mathsize="70%" xref="S3.E6.m1.1.1.1.1.1.1.2.2.3">in</mtext></ci></apply><ci id="S3.E6.m1.1.1.1.1.1.1.2.3.cmml" xref="S3.E6.m1.1.1.1.1.1.1.2.3">𝑔</ci></apply><apply id="S3.E6.m1.1.1.1.1.1.1.3.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.1.1.1.1.1.1.3.1.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3">superscript</csymbol><apply id="S3.E6.m1.1.1.1.1.1.1.3.2.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.1.1.1.1.1.1.3.2.1.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E6.m1.1.1.1.1.1.1.3.2.2.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3.2.2">𝑊</ci><ci id="S3.E6.m1.1.1.1.1.1.1.3.2.3.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3.2.3">𝑄</ci></apply><ci id="S3.E6.m1.1.1.1.1.1.1.3.3.cmml" xref="S3.E6.m1.1.1.1.1.1.1.3.3">𝑔</ci></apply></apply><apply id="S3.E6.m1.2.2.2.2.cmml" xref="S3.E6.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.2.2.2.cmml" xref="S3.E6.m1.2.2.2.2">superscript</csymbol><apply id="S3.E6.m1.2.2.2.2.1.1.1.cmml" xref="S3.E6.m1.2.2.2.2.1.1"><times id="S3.E6.m1.2.2.2.2.1.1.1.1.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.1"></times><apply id="S3.E6.m1.2.2.2.2.1.1.1.2.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.2"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.2.2.1.1.1.2.1.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.2">subscript</csymbol><ci id="S3.E6.m1.2.2.2.2.1.1.1.2.2.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.2.2">𝑓</ci><ci id="S3.E6.m1.2.2.2.2.1.1.1.2.3a.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.2.3"><mtext id="S3.E6.m1.2.2.2.2.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E6.m1.2.2.2.2.1.1.1.2.3">drive</mtext></ci></apply><apply id="S3.E6.m1.2.2.2.2.1.1.1.3.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.2.2.1.1.1.3.1.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3">superscript</csymbol><apply id="S3.E6.m1.2.2.2.2.1.1.1.3.2.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.2.2.1.1.1.3.2.1.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3">subscript</csymbol><ci id="S3.E6.m1.2.2.2.2.1.1.1.3.2.2.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3.2.2">𝑊</ci><ci id="S3.E6.m1.2.2.2.2.1.1.1.3.2.3.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3.2.3">𝐾</ci></apply><ci id="S3.E6.m1.2.2.2.2.1.1.1.3.3.cmml" xref="S3.E6.m1.2.2.2.2.1.1.1.3.3">𝑔</ci></apply></apply><ci id="S3.E6.m1.2.2.2.2.3.cmml" xref="S3.E6.m1.2.2.2.2.3">𝑇</ci></apply></apply><apply id="S3.E6.m1.2.2.4.cmml" xref="S3.E6.m1.2.2.4"><root id="S3.E6.m1.2.2.4a.cmml" xref="S3.E6.m1.2.2.4"></root><apply id="S3.E6.m1.2.2.4.2.cmml" xref="S3.E6.m1.2.2.4.2"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.4.2.1.cmml" xref="S3.E6.m1.2.2.4.2">superscript</csymbol><apply id="S3.E6.m1.2.2.4.2.2.cmml" xref="S3.E6.m1.2.2.4.2"><csymbol cd="ambiguous" id="S3.E6.m1.2.2.4.2.2.1.cmml" xref="S3.E6.m1.2.2.4.2">subscript</csymbol><ci id="S3.E6.m1.2.2.4.2.2.2.cmml" xref="S3.E6.m1.2.2.4.2.2.2">𝑑</ci><ci id="S3.E6.m1.2.2.4.2.2.3.cmml" xref="S3.E6.m1.2.2.4.2.2.3">𝑘</ci></apply><ci id="S3.E6.m1.2.2.4.2.3.cmml" xref="S3.E6.m1.2.2.4.2.3">𝑔</ci></apply></apply></apply><apply id="S3.E6.m1.3.3.1.1.1.1.1.1.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1"><times id="S3.E6.m1.3.3.1.1.1.1.1.1.1.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.1"></times><apply id="S3.E6.m1.3.3.1.1.1.1.1.1.2.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E6.m1.3.3.1.1.1.1.1.1.2.1.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E6.m1.3.3.1.1.1.1.1.1.2.2.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.2">𝑓</ci><ci id="S3.E6.m1.3.3.1.1.1.1.1.1.2.3a.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.3"><mtext id="S3.E6.m1.3.3.1.1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E6.m1.3.3.1.1.1.1.1.1.2.3">drive</mtext></ci></apply><apply id="S3.E6.m1.3.3.1.1.1.1.1.1.3.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.3.3.1.1.1.1.1.1.3.1.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3">superscript</csymbol><apply id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.1.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.2.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.2">𝑊</ci><ci id="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.3.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.2.3">𝑉</ci></apply><ci id="S3.E6.m1.3.3.1.1.1.1.1.1.3.3.cmml" xref="S3.E6.m1.3.3.1.1.1.1.1.1.3.3">𝑔</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E6.m1.3c">f_{\text{out}}^{g}=\text{softmax}\left(\frac{(f_{\text{in}}^{g}W_{Q}^{g})(f_{% \text{drive}}W_{K}^{g})^{T}}{\sqrt{d_{k}^{g}}}\right)(f_{\text{drive}}W_{V}^{g% }),</annotation><annotation encoding="application/x-llamapun" id="S3.E6.m1.3d">italic_f start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = softmax ( divide start_ARG ( italic_f start_POSTSUBSCRIPT in end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) ( italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG end_ARG ) ( italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(6)</td> </tr></tbody> </table> where <math alttext="f_{\text{in}}^{g},f_{\text{out}}^{g}" class="ltx_Math" display="inline" id="S3.SS3.p2.8.m1.2"><semantics id="S3.SS3.p2.8.m1.2a"><mrow id="S3.SS3.p2.8.m1.2.2.2" xref="S3.SS3.p2.8.m1.2.2.3.cmml"><msubsup id="S3.SS3.p2.8.m1.1.1.1.1" xref="S3.SS3.p2.8.m1.1.1.1.1.cmml"><mi id="S3.SS3.p2.8.m1.1.1.1.1.2.2" xref="S3.SS3.p2.8.m1.1.1.1.1.2.2.cmml">f</mi><mtext id="S3.SS3.p2.8.m1.1.1.1.1.2.3" xref="S3.SS3.p2.8.m1.1.1.1.1.2.3a.cmml">in</mtext><mi id="S3.SS3.p2.8.m1.1.1.1.1.3" xref="S3.SS3.p2.8.m1.1.1.1.1.3.cmml">g</mi></msubsup><mo id="S3.SS3.p2.8.m1.2.2.2.3" xref="S3.SS3.p2.8.m1.2.2.3.cmml">,</mo><msubsup id="S3.SS3.p2.8.m1.2.2.2.2" xref="S3.SS3.p2.8.m1.2.2.2.2.cmml"><mi id="S3.SS3.p2.8.m1.2.2.2.2.2.2" xref="S3.SS3.p2.8.m1.2.2.2.2.2.2.cmml">f</mi><mtext id="S3.SS3.p2.8.m1.2.2.2.2.2.3" xref="S3.SS3.p2.8.m1.2.2.2.2.2.3a.cmml">out</mtext><mi id="S3.SS3.p2.8.m1.2.2.2.2.3" xref="S3.SS3.p2.8.m1.2.2.2.2.3.cmml">g</mi></msubsup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.8.m1.2b"><list id="S3.SS3.p2.8.m1.2.2.3.cmml" xref="S3.SS3.p2.8.m1.2.2.2"><apply id="S3.SS3.p2.8.m1.1.1.1.1.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.8.m1.1.1.1.1.1.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1">superscript</csymbol><apply id="S3.SS3.p2.8.m1.1.1.1.1.2.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.8.m1.1.1.1.1.2.1.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1">subscript</csymbol><ci id="S3.SS3.p2.8.m1.1.1.1.1.2.2.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1.2.2">𝑓</ci><ci id="S3.SS3.p2.8.m1.1.1.1.1.2.3a.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1.2.3"><mtext id="S3.SS3.p2.8.m1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.SS3.p2.8.m1.1.1.1.1.2.3">in</mtext></ci></apply><ci id="S3.SS3.p2.8.m1.1.1.1.1.3.cmml" xref="S3.SS3.p2.8.m1.1.1.1.1.3">𝑔</ci></apply><apply id="S3.SS3.p2.8.m1.2.2.2.2.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS3.p2.8.m1.2.2.2.2.1.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2">superscript</csymbol><apply id="S3.SS3.p2.8.m1.2.2.2.2.2.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS3.p2.8.m1.2.2.2.2.2.1.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2">subscript</csymbol><ci id="S3.SS3.p2.8.m1.2.2.2.2.2.2.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2.2.2">𝑓</ci><ci id="S3.SS3.p2.8.m1.2.2.2.2.2.3a.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2.2.3"><mtext id="S3.SS3.p2.8.m1.2.2.2.2.2.3.cmml" mathsize="70%" xref="S3.SS3.p2.8.m1.2.2.2.2.2.3">out</mtext></ci></apply><ci id="S3.SS3.p2.8.m1.2.2.2.2.3.cmml" xref="S3.SS3.p2.8.m1.2.2.2.2.3">𝑔</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.8.m1.2c">f_{\text{in}}^{g},f_{\text{out}}^{g}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.8.m1.2d">italic_f start_POSTSUBSCRIPT in end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT</annotation></semantics></math> are the input and output features of the cross attention module in <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S3.SS3.p2.9.m2.1"><semantics id="S3.SS3.p2.9.m2.1a"><mrow id="S3.SS3.p2.9.m2.1.2" xref="S3.SS3.p2.9.m2.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p2.9.m2.1.2.2" xref="S3.SS3.p2.9.m2.1.2.2.cmml">𝒢</mi><mo id="S3.SS3.p2.9.m2.1.2.1" xref="S3.SS3.p2.9.m2.1.2.1.cmml">⁢</mo><mrow id="S3.SS3.p2.9.m2.1.2.3.2" xref="S3.SS3.p2.9.m2.1.2.cmml"><mo id="S3.SS3.p2.9.m2.1.2.3.2.1" stretchy="false" xref="S3.SS3.p2.9.m2.1.2.cmml">(</mo><mo id="S3.SS3.p2.9.m2.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p2.9.m2.1.1.cmml">⋅</mo><mo id="S3.SS3.p2.9.m2.1.2.3.2.2" stretchy="false" xref="S3.SS3.p2.9.m2.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.9.m2.1b"><apply id="S3.SS3.p2.9.m2.1.2.cmml" xref="S3.SS3.p2.9.m2.1.2"><times id="S3.SS3.p2.9.m2.1.2.1.cmml" xref="S3.SS3.p2.9.m2.1.2.1"></times><ci id="S3.SS3.p2.9.m2.1.2.2.cmml" xref="S3.SS3.p2.9.m2.1.2.2">𝒢</ci><ci id="S3.SS3.p2.9.m2.1.1.cmml" xref="S3.SS3.p2.9.m2.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.9.m2.1c">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.9.m2.1d">caligraphic_G ( ⋅ )</annotation></semantics></math>, respectively. <math alttext="W_{Q}^{g},W_{K}^{g},W_{V}^{g}" class="ltx_Math" display="inline" id="S3.SS3.p2.10.m3.3"><semantics id="S3.SS3.p2.10.m3.3a"><mrow id="S3.SS3.p2.10.m3.3.3.3" xref="S3.SS3.p2.10.m3.3.3.4.cmml"><msubsup id="S3.SS3.p2.10.m3.1.1.1.1" xref="S3.SS3.p2.10.m3.1.1.1.1.cmml"><mi id="S3.SS3.p2.10.m3.1.1.1.1.2.2" xref="S3.SS3.p2.10.m3.1.1.1.1.2.2.cmml">W</mi><mi id="S3.SS3.p2.10.m3.1.1.1.1.2.3" xref="S3.SS3.p2.10.m3.1.1.1.1.2.3.cmml">Q</mi><mi id="S3.SS3.p2.10.m3.1.1.1.1.3" xref="S3.SS3.p2.10.m3.1.1.1.1.3.cmml">g</mi></msubsup><mo id="S3.SS3.p2.10.m3.3.3.3.4" xref="S3.SS3.p2.10.m3.3.3.4.cmml">,</mo><msubsup id="S3.SS3.p2.10.m3.2.2.2.2" xref="S3.SS3.p2.10.m3.2.2.2.2.cmml"><mi id="S3.SS3.p2.10.m3.2.2.2.2.2.2" xref="S3.SS3.p2.10.m3.2.2.2.2.2.2.cmml">W</mi><mi id="S3.SS3.p2.10.m3.2.2.2.2.2.3" xref="S3.SS3.p2.10.m3.2.2.2.2.2.3.cmml">K</mi><mi id="S3.SS3.p2.10.m3.2.2.2.2.3" xref="S3.SS3.p2.10.m3.2.2.2.2.3.cmml">g</mi></msubsup><mo id="S3.SS3.p2.10.m3.3.3.3.5" xref="S3.SS3.p2.10.m3.3.3.4.cmml">,</mo><msubsup id="S3.SS3.p2.10.m3.3.3.3.3" xref="S3.SS3.p2.10.m3.3.3.3.3.cmml"><mi id="S3.SS3.p2.10.m3.3.3.3.3.2.2" xref="S3.SS3.p2.10.m3.3.3.3.3.2.2.cmml">W</mi><mi id="S3.SS3.p2.10.m3.3.3.3.3.2.3" xref="S3.SS3.p2.10.m3.3.3.3.3.2.3.cmml">V</mi><mi id="S3.SS3.p2.10.m3.3.3.3.3.3" xref="S3.SS3.p2.10.m3.3.3.3.3.3.cmml">g</mi></msubsup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.10.m3.3b"><list id="S3.SS3.p2.10.m3.3.3.4.cmml" xref="S3.SS3.p2.10.m3.3.3.3"><apply id="S3.SS3.p2.10.m3.1.1.1.1.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.1.1.1.1.1.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1">superscript</csymbol><apply id="S3.SS3.p2.10.m3.1.1.1.1.2.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.1.1.1.1.2.1.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1">subscript</csymbol><ci id="S3.SS3.p2.10.m3.1.1.1.1.2.2.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1.2.2">𝑊</ci><ci id="S3.SS3.p2.10.m3.1.1.1.1.2.3.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1.2.3">𝑄</ci></apply><ci id="S3.SS3.p2.10.m3.1.1.1.1.3.cmml" xref="S3.SS3.p2.10.m3.1.1.1.1.3">𝑔</ci></apply><apply id="S3.SS3.p2.10.m3.2.2.2.2.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.2.2.2.2.1.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2">superscript</csymbol><apply id="S3.SS3.p2.10.m3.2.2.2.2.2.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.2.2.2.2.2.1.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2">subscript</csymbol><ci id="S3.SS3.p2.10.m3.2.2.2.2.2.2.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2.2.2">𝑊</ci><ci id="S3.SS3.p2.10.m3.2.2.2.2.2.3.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2.2.3">𝐾</ci></apply><ci id="S3.SS3.p2.10.m3.2.2.2.2.3.cmml" xref="S3.SS3.p2.10.m3.2.2.2.2.3">𝑔</ci></apply><apply id="S3.SS3.p2.10.m3.3.3.3.3.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.3.3.3.3.1.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3">superscript</csymbol><apply id="S3.SS3.p2.10.m3.3.3.3.3.2.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3"><csymbol cd="ambiguous" id="S3.SS3.p2.10.m3.3.3.3.3.2.1.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3">subscript</csymbol><ci id="S3.SS3.p2.10.m3.3.3.3.3.2.2.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3.2.2">𝑊</ci><ci id="S3.SS3.p2.10.m3.3.3.3.3.2.3.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3.2.3">𝑉</ci></apply><ci id="S3.SS3.p2.10.m3.3.3.3.3.3.cmml" xref="S3.SS3.p2.10.m3.3.3.3.3.3">𝑔</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.10.m3.3c">W_{Q}^{g},W_{K}^{g},W_{V}^{g}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.10.m3.3d">italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT</annotation></semantics></math> are the learnable weight matrices and <math alttext="\sqrt{d_{k}^{g}}" class="ltx_Math" display="inline" id="S3.SS3.p2.11.m4.1"><semantics id="S3.SS3.p2.11.m4.1a"><msqrt id="S3.SS3.p2.11.m4.1.1" xref="S3.SS3.p2.11.m4.1.1.cmml"><msubsup id="S3.SS3.p2.11.m4.1.1.2" xref="S3.SS3.p2.11.m4.1.1.2.cmml"><mi id="S3.SS3.p2.11.m4.1.1.2.2.2" xref="S3.SS3.p2.11.m4.1.1.2.2.2.cmml">d</mi><mi id="S3.SS3.p2.11.m4.1.1.2.2.3" xref="S3.SS3.p2.11.m4.1.1.2.2.3.cmml">k</mi><mi id="S3.SS3.p2.11.m4.1.1.2.3" xref="S3.SS3.p2.11.m4.1.1.2.3.cmml">g</mi></msubsup></msqrt><annotation-xml encoding="MathML-Content" id="S3.SS3.p2.11.m4.1b"><apply id="S3.SS3.p2.11.m4.1.1.cmml" xref="S3.SS3.p2.11.m4.1.1"><root id="S3.SS3.p2.11.m4.1.1a.cmml" xref="S3.SS3.p2.11.m4.1.1"></root><apply id="S3.SS3.p2.11.m4.1.1.2.cmml" xref="S3.SS3.p2.11.m4.1.1.2"><csymbol cd="ambiguous" id="S3.SS3.p2.11.m4.1.1.2.1.cmml" xref="S3.SS3.p2.11.m4.1.1.2">superscript</csymbol><apply id="S3.SS3.p2.11.m4.1.1.2.2.cmml" xref="S3.SS3.p2.11.m4.1.1.2"><csymbol cd="ambiguous" id="S3.SS3.p2.11.m4.1.1.2.2.1.cmml" xref="S3.SS3.p2.11.m4.1.1.2">subscript</csymbol><ci id="S3.SS3.p2.11.m4.1.1.2.2.2.cmml" xref="S3.SS3.p2.11.m4.1.1.2.2.2">𝑑</ci><ci id="S3.SS3.p2.11.m4.1.1.2.2.3.cmml" xref="S3.SS3.p2.11.m4.1.1.2.2.3">𝑘</ci></apply><ci id="S3.SS3.p2.11.m4.1.1.2.3.cmml" xref="S3.SS3.p2.11.m4.1.1.2.3">𝑔</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p2.11.m4.1c">\sqrt{d_{k}^{g}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p2.11.m4.1d">square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG</annotation></semantics></math> the scale factor. </div> <div class="ltx_para ltx_noindent" id="S3.SS3.p3"> Mobile AR Application. Our model is also designed to facilitate real-time animation on mobile devices, striking a balance between advanced animation techniques and efficient rendering. Offline, we use the 3D Generation Network <math alttext="\mathcal{G}(\cdot)" class="ltx_Math" display="inline" id="S3.SS3.p3.1.m1.1"><semantics id="S3.SS3.p3.1.m1.1a"><mrow id="S3.SS3.p3.1.m1.1.2" xref="S3.SS3.p3.1.m1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS3.p3.1.m1.1.2.2" xref="S3.SS3.p3.1.m1.1.2.2.cmml">𝒢</mi><mo id="S3.SS3.p3.1.m1.1.2.1" xref="S3.SS3.p3.1.m1.1.2.1.cmml">⁢</mo><mrow id="S3.SS3.p3.1.m1.1.2.3.2" xref="S3.SS3.p3.1.m1.1.2.cmml"><mo id="S3.SS3.p3.1.m1.1.2.3.2.1" stretchy="false" xref="S3.SS3.p3.1.m1.1.2.cmml">(</mo><mo id="S3.SS3.p3.1.m1.1.1" lspace="0em" rspace="0em" xref="S3.SS3.p3.1.m1.1.1.cmml">⋅</mo><mo id="S3.SS3.p3.1.m1.1.2.3.2.2" stretchy="false" xref="S3.SS3.p3.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.1.m1.1b"><apply id="S3.SS3.p3.1.m1.1.2.cmml" xref="S3.SS3.p3.1.m1.1.2"><times id="S3.SS3.p3.1.m1.1.2.1.cmml" xref="S3.SS3.p3.1.m1.1.2.1"></times><ci id="S3.SS3.p3.1.m1.1.2.2.cmml" xref="S3.SS3.p3.1.m1.1.2.2">𝒢</ci><ci id="S3.SS3.p3.1.m1.1.1.cmml" xref="S3.SS3.p3.1.m1.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.1.m1.1c">\mathcal{G}(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.1.m1.1d">caligraphic_G ( ⋅ )</annotation></semantics></math> from the pipeline shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S2.F2" title="Figure 2 ‣ 2 Related Work ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">2</a> to initially generate a base set of Gaussians for the avatar in a rest pose <math alttext="\theta_{\text{rest}}" class="ltx_Math" display="inline" id="S3.SS3.p3.2.m2.1"><semantics id="S3.SS3.p3.2.m2.1a"><msub id="S3.SS3.p3.2.m2.1.1" xref="S3.SS3.p3.2.m2.1.1.cmml"><mi id="S3.SS3.p3.2.m2.1.1.2" xref="S3.SS3.p3.2.m2.1.1.2.cmml">θ</mi><mtext id="S3.SS3.p3.2.m2.1.1.3" xref="S3.SS3.p3.2.m2.1.1.3a.cmml">rest</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.2.m2.1b"><apply id="S3.SS3.p3.2.m2.1.1.cmml" xref="S3.SS3.p3.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS3.p3.2.m2.1.1.1.cmml" xref="S3.SS3.p3.2.m2.1.1">subscript</csymbol><ci id="S3.SS3.p3.2.m2.1.1.2.cmml" xref="S3.SS3.p3.2.m2.1.1.2">𝜃</ci><ci id="S3.SS3.p3.2.m2.1.1.3a.cmml" xref="S3.SS3.p3.2.m2.1.1.3"><mtext id="S3.SS3.p3.2.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p3.2.m2.1.1.3">rest</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.2.m2.1c">\theta_{\text{rest}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.2.m2.1d">italic_θ start_POSTSUBSCRIPT rest end_POSTSUBSCRIPT</annotation></semantics></math>, along with specific Gaussian sets corresponding to each component of the expression features <math alttext="f_{\text{drive}}" class="ltx_Math" display="inline" id="S3.SS3.p3.3.m3.1"><semantics id="S3.SS3.p3.3.m3.1a"><msub id="S3.SS3.p3.3.m3.1.1" xref="S3.SS3.p3.3.m3.1.1.cmml"><mi id="S3.SS3.p3.3.m3.1.1.2" xref="S3.SS3.p3.3.m3.1.1.2.cmml">f</mi><mtext id="S3.SS3.p3.3.m3.1.1.3" xref="S3.SS3.p3.3.m3.1.1.3a.cmml">drive</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.3.m3.1b"><apply id="S3.SS3.p3.3.m3.1.1.cmml" xref="S3.SS3.p3.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p3.3.m3.1.1.1.cmml" xref="S3.SS3.p3.3.m3.1.1">subscript</csymbol><ci id="S3.SS3.p3.3.m3.1.1.2.cmml" xref="S3.SS3.p3.3.m3.1.1.2">𝑓</ci><ci id="S3.SS3.p3.3.m3.1.1.3a.cmml" xref="S3.SS3.p3.3.m3.1.1.3"><mtext id="S3.SS3.p3.3.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p3.3.m3.1.1.3">drive</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.3.m3.1c">f_{\text{drive}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.3.m3.1d">italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT</annotation></semantics></math>. On the mobile device, we use a face tracker, such as Mediapipe’s BlazeFace tracker <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib4" title="">4</a>]</cite>, to generate a list of blendshape weights <math alttext="f_{\text{bs}}\in\mathbb{R}^{16}" class="ltx_Math" display="inline" id="S3.SS3.p3.4.m4.1"><semantics id="S3.SS3.p3.4.m4.1a"><mrow id="S3.SS3.p3.4.m4.1.1" xref="S3.SS3.p3.4.m4.1.1.cmml"><msub id="S3.SS3.p3.4.m4.1.1.2" xref="S3.SS3.p3.4.m4.1.1.2.cmml"><mi id="S3.SS3.p3.4.m4.1.1.2.2" xref="S3.SS3.p3.4.m4.1.1.2.2.cmml">f</mi><mtext id="S3.SS3.p3.4.m4.1.1.2.3" xref="S3.SS3.p3.4.m4.1.1.2.3a.cmml">bs</mtext></msub><mo id="S3.SS3.p3.4.m4.1.1.1" xref="S3.SS3.p3.4.m4.1.1.1.cmml">∈</mo><msup id="S3.SS3.p3.4.m4.1.1.3" xref="S3.SS3.p3.4.m4.1.1.3.cmml"><mi id="S3.SS3.p3.4.m4.1.1.3.2" xref="S3.SS3.p3.4.m4.1.1.3.2.cmml">ℝ</mi><mn id="S3.SS3.p3.4.m4.1.1.3.3" xref="S3.SS3.p3.4.m4.1.1.3.3.cmml">16</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.4.m4.1b"><apply id="S3.SS3.p3.4.m4.1.1.cmml" xref="S3.SS3.p3.4.m4.1.1"><in id="S3.SS3.p3.4.m4.1.1.1.cmml" xref="S3.SS3.p3.4.m4.1.1.1"></in><apply id="S3.SS3.p3.4.m4.1.1.2.cmml" xref="S3.SS3.p3.4.m4.1.1.2"><csymbol cd="ambiguous" id="S3.SS3.p3.4.m4.1.1.2.1.cmml" xref="S3.SS3.p3.4.m4.1.1.2">subscript</csymbol><ci id="S3.SS3.p3.4.m4.1.1.2.2.cmml" xref="S3.SS3.p3.4.m4.1.1.2.2">𝑓</ci><ci id="S3.SS3.p3.4.m4.1.1.2.3a.cmml" xref="S3.SS3.p3.4.m4.1.1.2.3"><mtext id="S3.SS3.p3.4.m4.1.1.2.3.cmml" mathsize="70%" xref="S3.SS3.p3.4.m4.1.1.2.3">bs</mtext></ci></apply><apply id="S3.SS3.p3.4.m4.1.1.3.cmml" xref="S3.SS3.p3.4.m4.1.1.3"><csymbol cd="ambiguous" id="S3.SS3.p3.4.m4.1.1.3.1.cmml" xref="S3.SS3.p3.4.m4.1.1.3">superscript</csymbol><ci id="S3.SS3.p3.4.m4.1.1.3.2.cmml" xref="S3.SS3.p3.4.m4.1.1.3.2">ℝ</ci><cn id="S3.SS3.p3.4.m4.1.1.3.3.cmml" type="integer" xref="S3.SS3.p3.4.m4.1.1.3.3">16</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.4.m4.1c">f_{\text{bs}}\in\mathbb{R}^{16}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.4.m4.1d">italic_f start_POSTSUBSCRIPT bs end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT</annotation></semantics></math>. We leverage these weights to animate the avatar through linear interpolation between the parameters of each feature component: <table class="ltx_equation ltx_eqn_table" id="S3.E7"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\theta_{\text{mobile}}=\theta_{\text{rest}}+\sum_{i=1}^{K}f_{\text{drive}}^{i}% (\theta_{i}-\theta_{\text{rest}})" class="ltx_Math" display="block" id="S3.E7.m1.1"><semantics id="S3.E7.m1.1a"><mrow id="S3.E7.m1.1.1" xref="S3.E7.m1.1.1.cmml"><msub id="S3.E7.m1.1.1.3" xref="S3.E7.m1.1.1.3.cmml"><mi id="S3.E7.m1.1.1.3.2" xref="S3.E7.m1.1.1.3.2.cmml">θ</mi><mtext id="S3.E7.m1.1.1.3.3" xref="S3.E7.m1.1.1.3.3a.cmml">mobile</mtext></msub><mo id="S3.E7.m1.1.1.2" xref="S3.E7.m1.1.1.2.cmml">=</mo><mrow id="S3.E7.m1.1.1.1" xref="S3.E7.m1.1.1.1.cmml"><msub id="S3.E7.m1.1.1.1.3" xref="S3.E7.m1.1.1.1.3.cmml"><mi id="S3.E7.m1.1.1.1.3.2" xref="S3.E7.m1.1.1.1.3.2.cmml">θ</mi><mtext id="S3.E7.m1.1.1.1.3.3" xref="S3.E7.m1.1.1.1.3.3a.cmml">rest</mtext></msub><mo id="S3.E7.m1.1.1.1.2" rspace="0.055em" xref="S3.E7.m1.1.1.1.2.cmml">+</mo><mrow id="S3.E7.m1.1.1.1.1" xref="S3.E7.m1.1.1.1.1.cmml"><munderover id="S3.E7.m1.1.1.1.1.2" xref="S3.E7.m1.1.1.1.1.2.cmml"><mo id="S3.E7.m1.1.1.1.1.2.2.2" movablelimits="false" xref="S3.E7.m1.1.1.1.1.2.2.2.cmml">∑</mo><mrow id="S3.E7.m1.1.1.1.1.2.2.3" xref="S3.E7.m1.1.1.1.1.2.2.3.cmml"><mi id="S3.E7.m1.1.1.1.1.2.2.3.2" xref="S3.E7.m1.1.1.1.1.2.2.3.2.cmml">i</mi><mo id="S3.E7.m1.1.1.1.1.2.2.3.1" xref="S3.E7.m1.1.1.1.1.2.2.3.1.cmml">=</mo><mn id="S3.E7.m1.1.1.1.1.2.2.3.3" xref="S3.E7.m1.1.1.1.1.2.2.3.3.cmml">1</mn></mrow><mi id="S3.E7.m1.1.1.1.1.2.3" xref="S3.E7.m1.1.1.1.1.2.3.cmml">K</mi></munderover><mrow id="S3.E7.m1.1.1.1.1.1" xref="S3.E7.m1.1.1.1.1.1.cmml"><msubsup id="S3.E7.m1.1.1.1.1.1.3" xref="S3.E7.m1.1.1.1.1.1.3.cmml"><mi id="S3.E7.m1.1.1.1.1.1.3.2.2" xref="S3.E7.m1.1.1.1.1.1.3.2.2.cmml">f</mi><mtext id="S3.E7.m1.1.1.1.1.1.3.2.3" xref="S3.E7.m1.1.1.1.1.1.3.2.3a.cmml">drive</mtext><mi id="S3.E7.m1.1.1.1.1.1.3.3" xref="S3.E7.m1.1.1.1.1.1.3.3.cmml">i</mi></msubsup><mo id="S3.E7.m1.1.1.1.1.1.2" xref="S3.E7.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S3.E7.m1.1.1.1.1.1.1.1" xref="S3.E7.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S3.E7.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E7.m1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E7.m1.1.1.1.1.1.1.1.1" xref="S3.E7.m1.1.1.1.1.1.1.1.1.cmml"><msub id="S3.E7.m1.1.1.1.1.1.1.1.1.2" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2.cmml"><mi id="S3.E7.m1.1.1.1.1.1.1.1.1.2.2" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2.2.cmml">θ</mi><mi id="S3.E7.m1.1.1.1.1.1.1.1.1.2.3" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2.3.cmml">i</mi></msub><mo id="S3.E7.m1.1.1.1.1.1.1.1.1.1" xref="S3.E7.m1.1.1.1.1.1.1.1.1.1.cmml">−</mo><msub id="S3.E7.m1.1.1.1.1.1.1.1.1.3" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S3.E7.m1.1.1.1.1.1.1.1.1.3.2" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.2.cmml">θ</mi><mtext id="S3.E7.m1.1.1.1.1.1.1.1.1.3.3" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.3a.cmml">rest</mtext></msub></mrow><mo id="S3.E7.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E7.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.E7.m1.1b"><apply id="S3.E7.m1.1.1.cmml" xref="S3.E7.m1.1.1"><eq id="S3.E7.m1.1.1.2.cmml" xref="S3.E7.m1.1.1.2"></eq><apply id="S3.E7.m1.1.1.3.cmml" xref="S3.E7.m1.1.1.3"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.3.1.cmml" xref="S3.E7.m1.1.1.3">subscript</csymbol><ci id="S3.E7.m1.1.1.3.2.cmml" xref="S3.E7.m1.1.1.3.2">𝜃</ci><ci id="S3.E7.m1.1.1.3.3a.cmml" xref="S3.E7.m1.1.1.3.3"><mtext id="S3.E7.m1.1.1.3.3.cmml" mathsize="70%" xref="S3.E7.m1.1.1.3.3">mobile</mtext></ci></apply><apply id="S3.E7.m1.1.1.1.cmml" xref="S3.E7.m1.1.1.1"><plus id="S3.E7.m1.1.1.1.2.cmml" xref="S3.E7.m1.1.1.1.2"></plus><apply id="S3.E7.m1.1.1.1.3.cmml" xref="S3.E7.m1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.3.1.cmml" xref="S3.E7.m1.1.1.1.3">subscript</csymbol><ci id="S3.E7.m1.1.1.1.3.2.cmml" xref="S3.E7.m1.1.1.1.3.2">𝜃</ci><ci id="S3.E7.m1.1.1.1.3.3a.cmml" xref="S3.E7.m1.1.1.1.3.3"><mtext id="S3.E7.m1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E7.m1.1.1.1.3.3">rest</mtext></ci></apply><apply id="S3.E7.m1.1.1.1.1.cmml" xref="S3.E7.m1.1.1.1.1"><apply id="S3.E7.m1.1.1.1.1.2.cmml" xref="S3.E7.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.2.1.cmml" xref="S3.E7.m1.1.1.1.1.2">superscript</csymbol><apply id="S3.E7.m1.1.1.1.1.2.2.cmml" xref="S3.E7.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.2.2.1.cmml" xref="S3.E7.m1.1.1.1.1.2">subscript</csymbol><sum id="S3.E7.m1.1.1.1.1.2.2.2.cmml" xref="S3.E7.m1.1.1.1.1.2.2.2"></sum><apply id="S3.E7.m1.1.1.1.1.2.2.3.cmml" xref="S3.E7.m1.1.1.1.1.2.2.3"><eq id="S3.E7.m1.1.1.1.1.2.2.3.1.cmml" xref="S3.E7.m1.1.1.1.1.2.2.3.1"></eq><ci id="S3.E7.m1.1.1.1.1.2.2.3.2.cmml" xref="S3.E7.m1.1.1.1.1.2.2.3.2">𝑖</ci><cn id="S3.E7.m1.1.1.1.1.2.2.3.3.cmml" type="integer" xref="S3.E7.m1.1.1.1.1.2.2.3.3">1</cn></apply></apply><ci id="S3.E7.m1.1.1.1.1.2.3.cmml" xref="S3.E7.m1.1.1.1.1.2.3">𝐾</ci></apply><apply id="S3.E7.m1.1.1.1.1.1.cmml" xref="S3.E7.m1.1.1.1.1.1"><times id="S3.E7.m1.1.1.1.1.1.2.cmml" xref="S3.E7.m1.1.1.1.1.1.2"></times><apply id="S3.E7.m1.1.1.1.1.1.3.cmml" xref="S3.E7.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.1.3.1.cmml" xref="S3.E7.m1.1.1.1.1.1.3">superscript</csymbol><apply id="S3.E7.m1.1.1.1.1.1.3.2.cmml" xref="S3.E7.m1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.1.3.2.1.cmml" xref="S3.E7.m1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E7.m1.1.1.1.1.1.3.2.2.cmml" xref="S3.E7.m1.1.1.1.1.1.3.2.2">𝑓</ci><ci id="S3.E7.m1.1.1.1.1.1.3.2.3a.cmml" xref="S3.E7.m1.1.1.1.1.1.3.2.3"><mtext id="S3.E7.m1.1.1.1.1.1.3.2.3.cmml" mathsize="70%" xref="S3.E7.m1.1.1.1.1.1.3.2.3">drive</mtext></ci></apply><ci id="S3.E7.m1.1.1.1.1.1.3.3.cmml" xref="S3.E7.m1.1.1.1.1.1.3.3">𝑖</ci></apply><apply id="S3.E7.m1.1.1.1.1.1.1.1.1.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1"><minus id="S3.E7.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.1"></minus><apply id="S3.E7.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E7.m1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2.2">𝜃</ci><ci id="S3.E7.m1.1.1.1.1.1.1.1.1.2.3.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.2.3">𝑖</ci></apply><apply id="S3.E7.m1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E7.m1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S3.E7.m1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.2">𝜃</ci><ci id="S3.E7.m1.1.1.1.1.1.1.1.1.3.3a.cmml" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.3"><mtext id="S3.E7.m1.1.1.1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E7.m1.1.1.1.1.1.1.1.1.3.3">rest</mtext></ci></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E7.m1.1c">\theta_{\text{mobile}}=\theta_{\text{rest}}+\sum_{i=1}^{K}f_{\text{drive}}^{i}% (\theta_{i}-\theta_{\text{rest}})</annotation><annotation encoding="application/x-llamapun" id="S3.E7.m1.1d">italic_θ start_POSTSUBSCRIPT mobile end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT rest end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT drive end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT rest end_POSTSUBSCRIPT )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(7)</td> </tr></tbody> </table> <math alttext="K" class="ltx_Math" display="inline" id="S3.SS3.p3.5.m1.1"><semantics id="S3.SS3.p3.5.m1.1a"><mi id="S3.SS3.p3.5.m1.1.1" xref="S3.SS3.p3.5.m1.1.1.cmml">K</mi><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.5.m1.1b"><ci id="S3.SS3.p3.5.m1.1.1.cmml" xref="S3.SS3.p3.5.m1.1.1">𝐾</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.5.m1.1c">K</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.5.m1.1d">italic_K</annotation></semantics></math> represents the number of driving features, and can be tuned to balance expression detail and speed. For compatibility with Mediapipe, we choose <math alttext="K=16" class="ltx_Math" display="inline" id="S3.SS3.p3.6.m2.1"><semantics id="S3.SS3.p3.6.m2.1a"><mrow id="S3.SS3.p3.6.m2.1.1" xref="S3.SS3.p3.6.m2.1.1.cmml"><mi id="S3.SS3.p3.6.m2.1.1.2" xref="S3.SS3.p3.6.m2.1.1.2.cmml">K</mi><mo id="S3.SS3.p3.6.m2.1.1.1" xref="S3.SS3.p3.6.m2.1.1.1.cmml">=</mo><mn id="S3.SS3.p3.6.m2.1.1.3" xref="S3.SS3.p3.6.m2.1.1.3.cmml">16</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.6.m2.1b"><apply id="S3.SS3.p3.6.m2.1.1.cmml" xref="S3.SS3.p3.6.m2.1.1"><eq id="S3.SS3.p3.6.m2.1.1.1.cmml" xref="S3.SS3.p3.6.m2.1.1.1"></eq><ci id="S3.SS3.p3.6.m2.1.1.2.cmml" xref="S3.SS3.p3.6.m2.1.1.2">𝐾</ci><cn id="S3.SS3.p3.6.m2.1.1.3.cmml" type="integer" xref="S3.SS3.p3.6.m2.1.1.3">16</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.6.m2.1c">K=16</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.6.m2.1d">italic_K = 16</annotation></semantics></math>. The final rendering of the Gaussians <math alttext="\theta_{\text{mobile}}" class="ltx_Math" display="inline" id="S3.SS3.p3.7.m3.1"><semantics id="S3.SS3.p3.7.m3.1a"><msub id="S3.SS3.p3.7.m3.1.1" xref="S3.SS3.p3.7.m3.1.1.cmml"><mi id="S3.SS3.p3.7.m3.1.1.2" xref="S3.SS3.p3.7.m3.1.1.2.cmml">θ</mi><mtext id="S3.SS3.p3.7.m3.1.1.3" xref="S3.SS3.p3.7.m3.1.1.3a.cmml">mobile</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS3.p3.7.m3.1b"><apply id="S3.SS3.p3.7.m3.1.1.cmml" xref="S3.SS3.p3.7.m3.1.1"><csymbol cd="ambiguous" id="S3.SS3.p3.7.m3.1.1.1.cmml" xref="S3.SS3.p3.7.m3.1.1">subscript</csymbol><ci id="S3.SS3.p3.7.m3.1.1.2.cmml" xref="S3.SS3.p3.7.m3.1.1.2">𝜃</ci><ci id="S3.SS3.p3.7.m3.1.1.3a.cmml" xref="S3.SS3.p3.7.m3.1.1.3"><mtext id="S3.SS3.p3.7.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS3.p3.7.m3.1.1.3">mobile</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS3.p3.7.m3.1c">\theta_{\text{mobile}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS3.p3.7.m3.1d">italic_θ start_POSTSUBSCRIPT mobile end_POSTSUBSCRIPT</annotation></semantics></math> takes place in WebGL, offering efficient rendering while retaining high visual fidelity. To demonstrate this capability, we developed a JavaScript application that allows users to control their avatars directly in their browsers. </div> </section> <section class="ltx_subsection" id="S3.SS4"> <h3 class="ltx_title ltx_title_subsection"> 3.4 Training and Losses</h3> <div class="ltx_para ltx_noindent" id="S3.SS4.p1"> 2D Dual-Stylized Avatar Generation. Our training process involves using GDA to map the input reference image <math alttext="I_{\text{ref}}" class="ltx_Math" display="inline" id="S3.SS4.p1.1.m1.1"><semantics id="S3.SS4.p1.1.m1.1a"><msub id="S3.SS4.p1.1.m1.1.1" xref="S3.SS4.p1.1.m1.1.1.cmml"><mi id="S3.SS4.p1.1.m1.1.1.2" xref="S3.SS4.p1.1.m1.1.1.2.cmml">I</mi><mtext id="S3.SS4.p1.1.m1.1.1.3" xref="S3.SS4.p1.1.m1.1.1.3a.cmml">ref</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.1.m1.1b"><apply id="S3.SS4.p1.1.m1.1.1.cmml" xref="S3.SS4.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.1.m1.1.1.1.cmml" xref="S3.SS4.p1.1.m1.1.1">subscript</csymbol><ci id="S3.SS4.p1.1.m1.1.1.2.cmml" xref="S3.SS4.p1.1.m1.1.1.2">𝐼</ci><ci id="S3.SS4.p1.1.m1.1.1.3a.cmml" xref="S3.SS4.p1.1.m1.1.1.3"><mtext id="S3.SS4.p1.1.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p1.1.m1.1.1.3">ref</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.1.m1.1c">I_{\text{ref}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.1.m1.1d">italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT</annotation></semantics></math> to Gaussian parameters <math alttext="\theta_{\text{GDA}}" class="ltx_Math" display="inline" id="S3.SS4.p1.2.m2.1"><semantics id="S3.SS4.p1.2.m2.1a"><msub id="S3.SS4.p1.2.m2.1.1" xref="S3.SS4.p1.2.m2.1.1.cmml"><mi id="S3.SS4.p1.2.m2.1.1.2" xref="S3.SS4.p1.2.m2.1.1.2.cmml">θ</mi><mtext id="S3.SS4.p1.2.m2.1.1.3" xref="S3.SS4.p1.2.m2.1.1.3a.cmml">GDA</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.2.m2.1b"><apply id="S3.SS4.p1.2.m2.1.1.cmml" xref="S3.SS4.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.2.m2.1.1.1.cmml" xref="S3.SS4.p1.2.m2.1.1">subscript</csymbol><ci id="S3.SS4.p1.2.m2.1.1.2.cmml" xref="S3.SS4.p1.2.m2.1.1.2">𝜃</ci><ci id="S3.SS4.p1.2.m2.1.1.3a.cmml" xref="S3.SS4.p1.2.m2.1.1.3"><mtext id="S3.SS4.p1.2.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p1.2.m2.1.1.3">GDA</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.2.m2.1c">\theta_{\text{GDA}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.2.m2.1d">italic_θ start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT</annotation></semantics></math>, which are then used to render the unposed primary-style avatar <math alttext="I_{\text{unposed}}" class="ltx_Math" display="inline" id="S3.SS4.p1.3.m3.1"><semantics id="S3.SS4.p1.3.m3.1a"><msub id="S3.SS4.p1.3.m3.1.1" xref="S3.SS4.p1.3.m3.1.1.cmml"><mi id="S3.SS4.p1.3.m3.1.1.2" xref="S3.SS4.p1.3.m3.1.1.2.cmml">I</mi><mtext id="S3.SS4.p1.3.m3.1.1.3" xref="S3.SS4.p1.3.m3.1.1.3a.cmml">unposed</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.3.m3.1b"><apply id="S3.SS4.p1.3.m3.1.1.cmml" xref="S3.SS4.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.3.m3.1.1.1.cmml" xref="S3.SS4.p1.3.m3.1.1">subscript</csymbol><ci id="S3.SS4.p1.3.m3.1.1.2.cmml" xref="S3.SS4.p1.3.m3.1.1.2">𝐼</ci><ci id="S3.SS4.p1.3.m3.1.1.3a.cmml" xref="S3.SS4.p1.3.m3.1.1.3"><mtext id="S3.SS4.p1.3.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p1.3.m3.1.1.3">unposed</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.3.m3.1c">I_{\text{unposed}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.3.m3.1d">italic_I start_POSTSUBSCRIPT unposed end_POSTSUBSCRIPT</annotation></semantics></math> from the frontal view via a 3DGS renderer <math alttext="\mathcal{E}_{\text{3D}}^{\text{render}}" class="ltx_Math" display="inline" id="S3.SS4.p1.4.m4.1"><semantics id="S3.SS4.p1.4.m4.1a"><msubsup id="S3.SS4.p1.4.m4.1.1" xref="S3.SS4.p1.4.m4.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p1.4.m4.1.1.2.2" xref="S3.SS4.p1.4.m4.1.1.2.2.cmml">ℰ</mi><mtext id="S3.SS4.p1.4.m4.1.1.2.3" xref="S3.SS4.p1.4.m4.1.1.2.3a.cmml">3D</mtext><mtext id="S3.SS4.p1.4.m4.1.1.3" xref="S3.SS4.p1.4.m4.1.1.3a.cmml">render</mtext></msubsup><annotation-xml encoding="MathML-Content" id="S3.SS4.p1.4.m4.1b"><apply id="S3.SS4.p1.4.m4.1.1.cmml" xref="S3.SS4.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.4.m4.1.1.1.cmml" xref="S3.SS4.p1.4.m4.1.1">superscript</csymbol><apply id="S3.SS4.p1.4.m4.1.1.2.cmml" xref="S3.SS4.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S3.SS4.p1.4.m4.1.1.2.1.cmml" xref="S3.SS4.p1.4.m4.1.1">subscript</csymbol><ci id="S3.SS4.p1.4.m4.1.1.2.2.cmml" xref="S3.SS4.p1.4.m4.1.1.2.2">ℰ</ci><ci id="S3.SS4.p1.4.m4.1.1.2.3a.cmml" xref="S3.SS4.p1.4.m4.1.1.2.3"><mtext id="S3.SS4.p1.4.m4.1.1.2.3.cmml" mathsize="70%" xref="S3.SS4.p1.4.m4.1.1.2.3">3D</mtext></ci></apply><ci id="S3.SS4.p1.4.m4.1.1.3a.cmml" xref="S3.SS4.p1.4.m4.1.1.3"><mtext id="S3.SS4.p1.4.m4.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p1.4.m4.1.1.3">render</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p1.4.m4.1c">\mathcal{E}_{\text{3D}}^{\text{render}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p1.4.m4.1d">caligraphic_E start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT render end_POSTSUPERSCRIPT</annotation></semantics></math>. The rendered image is supervised using a combination of Mean Squared Error (MSE) and perceptual LPIPS <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib64" title="">64</a>]</cite> losses: </div> <div class="ltx_para" id="S3.SS4.p2"> <table class="ltx_equation ltx_eqn_table" id="S3.E8"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{\text{GDA}}=\mathcal{L}_{\text{mse}}(I_{\text{ref}},I_{\text{sty}% })+\mathcal{L}_{\text{LPIPS}}(I_{\text{ref}},I_{\text{sty}})." class="ltx_Math" display="block" id="S3.E8.m1.1"><semantics id="S3.E8.m1.1a"><mrow id="S3.E8.m1.1.1.1" xref="S3.E8.m1.1.1.1.1.cmml"><mrow id="S3.E8.m1.1.1.1.1" xref="S3.E8.m1.1.1.1.1.cmml"><msub id="S3.E8.m1.1.1.1.1.6" xref="S3.E8.m1.1.1.1.1.6.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E8.m1.1.1.1.1.6.2" xref="S3.E8.m1.1.1.1.1.6.2.cmml">ℒ</mi><mtext id="S3.E8.m1.1.1.1.1.6.3" xref="S3.E8.m1.1.1.1.1.6.3a.cmml">GDA</mtext></msub><mo id="S3.E8.m1.1.1.1.1.5" xref="S3.E8.m1.1.1.1.1.5.cmml">=</mo><mrow id="S3.E8.m1.1.1.1.1.4" xref="S3.E8.m1.1.1.1.1.4.cmml"><mrow id="S3.E8.m1.1.1.1.1.2.2" xref="S3.E8.m1.1.1.1.1.2.2.cmml"><msub id="S3.E8.m1.1.1.1.1.2.2.4" xref="S3.E8.m1.1.1.1.1.2.2.4.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E8.m1.1.1.1.1.2.2.4.2" xref="S3.E8.m1.1.1.1.1.2.2.4.2.cmml">ℒ</mi><mtext id="S3.E8.m1.1.1.1.1.2.2.4.3" xref="S3.E8.m1.1.1.1.1.2.2.4.3a.cmml">mse</mtext></msub><mo id="S3.E8.m1.1.1.1.1.2.2.3" xref="S3.E8.m1.1.1.1.1.2.2.3.cmml">⁢</mo><mrow id="S3.E8.m1.1.1.1.1.2.2.2.2" xref="S3.E8.m1.1.1.1.1.2.2.2.3.cmml"><mo id="S3.E8.m1.1.1.1.1.2.2.2.2.3" stretchy="false" xref="S3.E8.m1.1.1.1.1.2.2.2.3.cmml">(</mo><msub id="S3.E8.m1.1.1.1.1.1.1.1.1.1" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S3.E8.m1.1.1.1.1.1.1.1.1.1.2" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.2.cmml">I</mi><mtext id="S3.E8.m1.1.1.1.1.1.1.1.1.1.3" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.3a.cmml">ref</mtext></msub><mo id="S3.E8.m1.1.1.1.1.2.2.2.2.4" xref="S3.E8.m1.1.1.1.1.2.2.2.3.cmml">,</mo><msub id="S3.E8.m1.1.1.1.1.2.2.2.2.2" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.cmml"><mi id="S3.E8.m1.1.1.1.1.2.2.2.2.2.2" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.2.cmml">I</mi><mtext id="S3.E8.m1.1.1.1.1.2.2.2.2.2.3" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.3a.cmml">sty</mtext></msub><mo id="S3.E8.m1.1.1.1.1.2.2.2.2.5" stretchy="false" xref="S3.E8.m1.1.1.1.1.2.2.2.3.cmml">)</mo></mrow></mrow><mo id="S3.E8.m1.1.1.1.1.4.5" xref="S3.E8.m1.1.1.1.1.4.5.cmml">+</mo><mrow id="S3.E8.m1.1.1.1.1.4.4" xref="S3.E8.m1.1.1.1.1.4.4.cmml"><msub id="S3.E8.m1.1.1.1.1.4.4.4" xref="S3.E8.m1.1.1.1.1.4.4.4.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E8.m1.1.1.1.1.4.4.4.2" xref="S3.E8.m1.1.1.1.1.4.4.4.2.cmml">ℒ</mi><mtext id="S3.E8.m1.1.1.1.1.4.4.4.3" xref="S3.E8.m1.1.1.1.1.4.4.4.3a.cmml">LPIPS</mtext></msub><mo id="S3.E8.m1.1.1.1.1.4.4.3" xref="S3.E8.m1.1.1.1.1.4.4.3.cmml">⁢</mo><mrow id="S3.E8.m1.1.1.1.1.4.4.2.2" xref="S3.E8.m1.1.1.1.1.4.4.2.3.cmml"><mo id="S3.E8.m1.1.1.1.1.4.4.2.2.3" stretchy="false" xref="S3.E8.m1.1.1.1.1.4.4.2.3.cmml">(</mo><msub id="S3.E8.m1.1.1.1.1.3.3.1.1.1" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.cmml"><mi id="S3.E8.m1.1.1.1.1.3.3.1.1.1.2" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.2.cmml">I</mi><mtext id="S3.E8.m1.1.1.1.1.3.3.1.1.1.3" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.3a.cmml">ref</mtext></msub><mo id="S3.E8.m1.1.1.1.1.4.4.2.2.4" xref="S3.E8.m1.1.1.1.1.4.4.2.3.cmml">,</mo><msub id="S3.E8.m1.1.1.1.1.4.4.2.2.2" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.cmml"><mi id="S3.E8.m1.1.1.1.1.4.4.2.2.2.2" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.2.cmml">I</mi><mtext id="S3.E8.m1.1.1.1.1.4.4.2.2.2.3" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.3a.cmml">sty</mtext></msub><mo id="S3.E8.m1.1.1.1.1.4.4.2.2.5" stretchy="false" xref="S3.E8.m1.1.1.1.1.4.4.2.3.cmml">)</mo></mrow></mrow></mrow></mrow><mo id="S3.E8.m1.1.1.1.2" lspace="0em" xref="S3.E8.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E8.m1.1b"><apply id="S3.E8.m1.1.1.1.1.cmml" xref="S3.E8.m1.1.1.1"><eq id="S3.E8.m1.1.1.1.1.5.cmml" xref="S3.E8.m1.1.1.1.1.5"></eq><apply id="S3.E8.m1.1.1.1.1.6.cmml" xref="S3.E8.m1.1.1.1.1.6"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.6.1.cmml" xref="S3.E8.m1.1.1.1.1.6">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.6.2.cmml" xref="S3.E8.m1.1.1.1.1.6.2">ℒ</ci><ci id="S3.E8.m1.1.1.1.1.6.3a.cmml" xref="S3.E8.m1.1.1.1.1.6.3"><mtext id="S3.E8.m1.1.1.1.1.6.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.6.3">GDA</mtext></ci></apply><apply id="S3.E8.m1.1.1.1.1.4.cmml" xref="S3.E8.m1.1.1.1.1.4"><plus id="S3.E8.m1.1.1.1.1.4.5.cmml" xref="S3.E8.m1.1.1.1.1.4.5"></plus><apply id="S3.E8.m1.1.1.1.1.2.2.cmml" xref="S3.E8.m1.1.1.1.1.2.2"><times id="S3.E8.m1.1.1.1.1.2.2.3.cmml" xref="S3.E8.m1.1.1.1.1.2.2.3"></times><apply id="S3.E8.m1.1.1.1.1.2.2.4.cmml" xref="S3.E8.m1.1.1.1.1.2.2.4"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.2.2.4.1.cmml" xref="S3.E8.m1.1.1.1.1.2.2.4">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.2.2.4.2.cmml" xref="S3.E8.m1.1.1.1.1.2.2.4.2">ℒ</ci><ci id="S3.E8.m1.1.1.1.1.2.2.4.3a.cmml" xref="S3.E8.m1.1.1.1.1.2.2.4.3"><mtext id="S3.E8.m1.1.1.1.1.2.2.4.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.2.2.4.3">mse</mtext></ci></apply><interval closure="open" id="S3.E8.m1.1.1.1.1.2.2.2.3.cmml" xref="S3.E8.m1.1.1.1.1.2.2.2.2"><apply id="S3.E8.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.2">𝐼</ci><ci id="S3.E8.m1.1.1.1.1.1.1.1.1.1.3a.cmml" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.3"><mtext id="S3.E8.m1.1.1.1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.1.1.1.1.1.3">ref</mtext></ci></apply><apply id="S3.E8.m1.1.1.1.1.2.2.2.2.2.cmml" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.2.2.2.2.2.1.cmml" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.2.2.2.2.2.2.cmml" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.2">𝐼</ci><ci id="S3.E8.m1.1.1.1.1.2.2.2.2.2.3a.cmml" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.3"><mtext id="S3.E8.m1.1.1.1.1.2.2.2.2.2.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.2.2.2.2.2.3">sty</mtext></ci></apply></interval></apply><apply id="S3.E8.m1.1.1.1.1.4.4.cmml" xref="S3.E8.m1.1.1.1.1.4.4"><times id="S3.E8.m1.1.1.1.1.4.4.3.cmml" xref="S3.E8.m1.1.1.1.1.4.4.3"></times><apply id="S3.E8.m1.1.1.1.1.4.4.4.cmml" xref="S3.E8.m1.1.1.1.1.4.4.4"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.4.4.4.1.cmml" xref="S3.E8.m1.1.1.1.1.4.4.4">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.4.4.4.2.cmml" xref="S3.E8.m1.1.1.1.1.4.4.4.2">ℒ</ci><ci id="S3.E8.m1.1.1.1.1.4.4.4.3a.cmml" xref="S3.E8.m1.1.1.1.1.4.4.4.3"><mtext id="S3.E8.m1.1.1.1.1.4.4.4.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.4.4.4.3">LPIPS</mtext></ci></apply><interval closure="open" id="S3.E8.m1.1.1.1.1.4.4.2.3.cmml" xref="S3.E8.m1.1.1.1.1.4.4.2.2"><apply id="S3.E8.m1.1.1.1.1.3.3.1.1.1.cmml" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.3.3.1.1.1.1.cmml" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.3.3.1.1.1.2.cmml" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.2">𝐼</ci><ci id="S3.E8.m1.1.1.1.1.3.3.1.1.1.3a.cmml" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.3"><mtext id="S3.E8.m1.1.1.1.1.3.3.1.1.1.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.3.3.1.1.1.3">ref</mtext></ci></apply><apply id="S3.E8.m1.1.1.1.1.4.4.2.2.2.cmml" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2"><csymbol cd="ambiguous" id="S3.E8.m1.1.1.1.1.4.4.2.2.2.1.cmml" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2">subscript</csymbol><ci id="S3.E8.m1.1.1.1.1.4.4.2.2.2.2.cmml" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.2">𝐼</ci><ci id="S3.E8.m1.1.1.1.1.4.4.2.2.2.3a.cmml" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.3"><mtext id="S3.E8.m1.1.1.1.1.4.4.2.2.2.3.cmml" mathsize="70%" xref="S3.E8.m1.1.1.1.1.4.4.2.2.2.3">sty</mtext></ci></apply></interval></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E8.m1.1c">\mathcal{L}_{\text{GDA}}=\mathcal{L}_{\text{mse}}(I_{\text{ref}},I_{\text{sty}% })+\mathcal{L}_{\text{LPIPS}}(I_{\text{ref}},I_{\text{sty}}).</annotation><annotation encoding="application/x-llamapun" id="S3.E8.m1.1d">caligraphic_L start_POSTSUBSCRIPT GDA end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT mse end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT ) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(8)</td> </tr></tbody> </table> Despite potential noise introduced by low-quality GAN inversion, the extensive pre-training on 3D datasets including Objaverse equips our network with strong generalization capabilities, enabling effective real-to-avatar domain adaptation. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.F3" title="Figure 3 ‣ 3.4 Training and Losses ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3</a>, GDA efficiently transforms realistic faces into a primary style while preserving the subjects’ identity and enhancing features, such as eye size. </div> <figure class="ltx_figure" id="S3.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="403" id="S3.F3.g1" src="x3.png" width="830"/> <figcaption class="ltx_caption ltx_centering">Figure 3: Gaussian Domain Adaptation. We show the outputs of the GDA network over several training epochs to visualize the domain shifts from natural images to cartoon avatars.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S3.SS4.p3"> 3D Animatable Stylized Avatar Generation. To improve the surface geometry of avatars, our model incorporates normal consistency and depth distortion losses. The normal consistency loss <math alttext="\mathcal{L}_{\text{normal}}" class="ltx_Math" display="inline" id="S3.SS4.p3.1.m1.1"><semantics id="S3.SS4.p3.1.m1.1a"><msub id="S3.SS4.p3.1.m1.1.1" xref="S3.SS4.p3.1.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.1.m1.1.1.2" xref="S3.SS4.p3.1.m1.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.1.m1.1.1.3" xref="S3.SS4.p3.1.m1.1.1.3a.cmml">normal</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.1.m1.1b"><apply id="S3.SS4.p3.1.m1.1.1.cmml" xref="S3.SS4.p3.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.1.m1.1.1.1.cmml" xref="S3.SS4.p3.1.m1.1.1">subscript</csymbol><ci id="S3.SS4.p3.1.m1.1.1.2.cmml" xref="S3.SS4.p3.1.m1.1.1.2">ℒ</ci><ci id="S3.SS4.p3.1.m1.1.1.3a.cmml" xref="S3.SS4.p3.1.m1.1.1.3"><mtext id="S3.SS4.p3.1.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.1.m1.1.1.3">normal</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.1.m1.1c">\mathcal{L}_{\text{normal}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.1.m1.1d">caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT</annotation></semantics></math> aligns the normals of 2D Gaussians <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib20" title="">20</a>]</cite> with surface normals determined through finite differences from rendered depths, thereby reducing noise. Meanwhile, the depth distortion loss <math alttext="\mathcal{L}_{\text{dist}}" class="ltx_Math" display="inline" id="S3.SS4.p3.2.m2.1"><semantics id="S3.SS4.p3.2.m2.1a"><msub id="S3.SS4.p3.2.m2.1.1" xref="S3.SS4.p3.2.m2.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.2.m2.1.1.2" xref="S3.SS4.p3.2.m2.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.2.m2.1.1.3" xref="S3.SS4.p3.2.m2.1.1.3a.cmml">dist</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.2.m2.1b"><apply id="S3.SS4.p3.2.m2.1.1.cmml" xref="S3.SS4.p3.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.2.m2.1.1.1.cmml" xref="S3.SS4.p3.2.m2.1.1">subscript</csymbol><ci id="S3.SS4.p3.2.m2.1.1.2.cmml" xref="S3.SS4.p3.2.m2.1.1.2">ℒ</ci><ci id="S3.SS4.p3.2.m2.1.1.3a.cmml" xref="S3.SS4.p3.2.m2.1.1.3"><mtext id="S3.SS4.p3.2.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.2.m2.1.1.3">dist</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.2.m2.1c">\mathcal{L}_{\text{dist}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.2.m2.1d">caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT</annotation></semantics></math>, implemented following <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib2" title="">2</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib3" title="">3</a>]</cite>, encourages Gaussians to cluster closely along camera rays, effectively enhancing surface representation. This optimization allows our network to output avatars with detailed geometry, suitable for applications such as animation and relighting. The total loss function for the 3D generation network is defined as: <table class="ltx_equation ltx_eqn_table" id="S3.E9"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{\text{3DGen}}=\mathcal{L}_{\text{render}}+\lambda_{\text{LPIPS}}% \mathcal{L}_{\text{LPIPS}}+\lambda_{\text{n}}\mathcal{L}_{\text{normal}}+% \lambda_{\text{d}}\mathcal{L}_{\text{dist}}," class="ltx_Math" display="block" id="S3.E9.m1.1"><semantics id="S3.E9.m1.1a"><mrow id="S3.E9.m1.1.1.1" xref="S3.E9.m1.1.1.1.1.cmml"><mrow id="S3.E9.m1.1.1.1.1" xref="S3.E9.m1.1.1.1.1.cmml"><msub id="S3.E9.m1.1.1.1.1.2" xref="S3.E9.m1.1.1.1.1.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E9.m1.1.1.1.1.2.2" xref="S3.E9.m1.1.1.1.1.2.2.cmml">ℒ</mi><mtext id="S3.E9.m1.1.1.1.1.2.3" xref="S3.E9.m1.1.1.1.1.2.3a.cmml">3DGen</mtext></msub><mo id="S3.E9.m1.1.1.1.1.1" xref="S3.E9.m1.1.1.1.1.1.cmml">=</mo><mrow id="S3.E9.m1.1.1.1.1.3" xref="S3.E9.m1.1.1.1.1.3.cmml"><msub id="S3.E9.m1.1.1.1.1.3.2" xref="S3.E9.m1.1.1.1.1.3.2.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E9.m1.1.1.1.1.3.2.2" xref="S3.E9.m1.1.1.1.1.3.2.2.cmml">ℒ</mi><mtext id="S3.E9.m1.1.1.1.1.3.2.3" xref="S3.E9.m1.1.1.1.1.3.2.3a.cmml">render</mtext></msub><mo id="S3.E9.m1.1.1.1.1.3.1" xref="S3.E9.m1.1.1.1.1.3.1.cmml">+</mo><mrow id="S3.E9.m1.1.1.1.1.3.3" xref="S3.E9.m1.1.1.1.1.3.3.cmml"><msub id="S3.E9.m1.1.1.1.1.3.3.2" xref="S3.E9.m1.1.1.1.1.3.3.2.cmml"><mi id="S3.E9.m1.1.1.1.1.3.3.2.2" xref="S3.E9.m1.1.1.1.1.3.3.2.2.cmml">λ</mi><mtext id="S3.E9.m1.1.1.1.1.3.3.2.3" xref="S3.E9.m1.1.1.1.1.3.3.2.3a.cmml">LPIPS</mtext></msub><mo id="S3.E9.m1.1.1.1.1.3.3.1" xref="S3.E9.m1.1.1.1.1.3.3.1.cmml">⁢</mo><msub id="S3.E9.m1.1.1.1.1.3.3.3" xref="S3.E9.m1.1.1.1.1.3.3.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E9.m1.1.1.1.1.3.3.3.2" xref="S3.E9.m1.1.1.1.1.3.3.3.2.cmml">ℒ</mi><mtext id="S3.E9.m1.1.1.1.1.3.3.3.3" xref="S3.E9.m1.1.1.1.1.3.3.3.3a.cmml">LPIPS</mtext></msub></mrow><mo id="S3.E9.m1.1.1.1.1.3.1a" xref="S3.E9.m1.1.1.1.1.3.1.cmml">+</mo><mrow id="S3.E9.m1.1.1.1.1.3.4" xref="S3.E9.m1.1.1.1.1.3.4.cmml"><msub id="S3.E9.m1.1.1.1.1.3.4.2" xref="S3.E9.m1.1.1.1.1.3.4.2.cmml"><mi id="S3.E9.m1.1.1.1.1.3.4.2.2" xref="S3.E9.m1.1.1.1.1.3.4.2.2.cmml">λ</mi><mtext id="S3.E9.m1.1.1.1.1.3.4.2.3" xref="S3.E9.m1.1.1.1.1.3.4.2.3a.cmml">n</mtext></msub><mo id="S3.E9.m1.1.1.1.1.3.4.1" xref="S3.E9.m1.1.1.1.1.3.4.1.cmml">⁢</mo><msub id="S3.E9.m1.1.1.1.1.3.4.3" xref="S3.E9.m1.1.1.1.1.3.4.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E9.m1.1.1.1.1.3.4.3.2" xref="S3.E9.m1.1.1.1.1.3.4.3.2.cmml">ℒ</mi><mtext id="S3.E9.m1.1.1.1.1.3.4.3.3" xref="S3.E9.m1.1.1.1.1.3.4.3.3a.cmml">normal</mtext></msub></mrow><mo id="S3.E9.m1.1.1.1.1.3.1b" xref="S3.E9.m1.1.1.1.1.3.1.cmml">+</mo><mrow id="S3.E9.m1.1.1.1.1.3.5" xref="S3.E9.m1.1.1.1.1.3.5.cmml"><msub id="S3.E9.m1.1.1.1.1.3.5.2" xref="S3.E9.m1.1.1.1.1.3.5.2.cmml"><mi id="S3.E9.m1.1.1.1.1.3.5.2.2" xref="S3.E9.m1.1.1.1.1.3.5.2.2.cmml">λ</mi><mtext id="S3.E9.m1.1.1.1.1.3.5.2.3" xref="S3.E9.m1.1.1.1.1.3.5.2.3a.cmml">d</mtext></msub><mo id="S3.E9.m1.1.1.1.1.3.5.1" xref="S3.E9.m1.1.1.1.1.3.5.1.cmml">⁢</mo><msub id="S3.E9.m1.1.1.1.1.3.5.3" xref="S3.E9.m1.1.1.1.1.3.5.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E9.m1.1.1.1.1.3.5.3.2" xref="S3.E9.m1.1.1.1.1.3.5.3.2.cmml">ℒ</mi><mtext id="S3.E9.m1.1.1.1.1.3.5.3.3" xref="S3.E9.m1.1.1.1.1.3.5.3.3a.cmml">dist</mtext></msub></mrow></mrow></mrow><mo id="S3.E9.m1.1.1.1.2" xref="S3.E9.m1.1.1.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E9.m1.1b"><apply id="S3.E9.m1.1.1.1.1.cmml" xref="S3.E9.m1.1.1.1"><eq id="S3.E9.m1.1.1.1.1.1.cmml" xref="S3.E9.m1.1.1.1.1.1"></eq><apply id="S3.E9.m1.1.1.1.1.2.cmml" xref="S3.E9.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.2.1.cmml" xref="S3.E9.m1.1.1.1.1.2">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.2.2.cmml" xref="S3.E9.m1.1.1.1.1.2.2">ℒ</ci><ci id="S3.E9.m1.1.1.1.1.2.3a.cmml" xref="S3.E9.m1.1.1.1.1.2.3"><mtext id="S3.E9.m1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.2.3">3DGen</mtext></ci></apply><apply id="S3.E9.m1.1.1.1.1.3.cmml" xref="S3.E9.m1.1.1.1.1.3"><plus id="S3.E9.m1.1.1.1.1.3.1.cmml" xref="S3.E9.m1.1.1.1.1.3.1"></plus><apply id="S3.E9.m1.1.1.1.1.3.2.cmml" xref="S3.E9.m1.1.1.1.1.3.2"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.2.1.cmml" xref="S3.E9.m1.1.1.1.1.3.2">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.2.2.cmml" xref="S3.E9.m1.1.1.1.1.3.2.2">ℒ</ci><ci id="S3.E9.m1.1.1.1.1.3.2.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.2.3"><mtext id="S3.E9.m1.1.1.1.1.3.2.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.2.3">render</mtext></ci></apply><apply id="S3.E9.m1.1.1.1.1.3.3.cmml" xref="S3.E9.m1.1.1.1.1.3.3"><times id="S3.E9.m1.1.1.1.1.3.3.1.cmml" xref="S3.E9.m1.1.1.1.1.3.3.1"></times><apply id="S3.E9.m1.1.1.1.1.3.3.2.cmml" xref="S3.E9.m1.1.1.1.1.3.3.2"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.3.2.1.cmml" xref="S3.E9.m1.1.1.1.1.3.3.2">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.3.2.2.cmml" xref="S3.E9.m1.1.1.1.1.3.3.2.2">𝜆</ci><ci id="S3.E9.m1.1.1.1.1.3.3.2.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.3.2.3"><mtext id="S3.E9.m1.1.1.1.1.3.3.2.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.3.2.3">LPIPS</mtext></ci></apply><apply id="S3.E9.m1.1.1.1.1.3.3.3.cmml" xref="S3.E9.m1.1.1.1.1.3.3.3"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.3.3.1.cmml" xref="S3.E9.m1.1.1.1.1.3.3.3">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.3.3.2.cmml" xref="S3.E9.m1.1.1.1.1.3.3.3.2">ℒ</ci><ci id="S3.E9.m1.1.1.1.1.3.3.3.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.3.3.3"><mtext id="S3.E9.m1.1.1.1.1.3.3.3.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.3.3.3">LPIPS</mtext></ci></apply></apply><apply id="S3.E9.m1.1.1.1.1.3.4.cmml" xref="S3.E9.m1.1.1.1.1.3.4"><times id="S3.E9.m1.1.1.1.1.3.4.1.cmml" xref="S3.E9.m1.1.1.1.1.3.4.1"></times><apply id="S3.E9.m1.1.1.1.1.3.4.2.cmml" xref="S3.E9.m1.1.1.1.1.3.4.2"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.4.2.1.cmml" xref="S3.E9.m1.1.1.1.1.3.4.2">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.4.2.2.cmml" xref="S3.E9.m1.1.1.1.1.3.4.2.2">𝜆</ci><ci id="S3.E9.m1.1.1.1.1.3.4.2.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.4.2.3"><mtext id="S3.E9.m1.1.1.1.1.3.4.2.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.4.2.3">n</mtext></ci></apply><apply id="S3.E9.m1.1.1.1.1.3.4.3.cmml" xref="S3.E9.m1.1.1.1.1.3.4.3"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.4.3.1.cmml" xref="S3.E9.m1.1.1.1.1.3.4.3">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.4.3.2.cmml" xref="S3.E9.m1.1.1.1.1.3.4.3.2">ℒ</ci><ci id="S3.E9.m1.1.1.1.1.3.4.3.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.4.3.3"><mtext id="S3.E9.m1.1.1.1.1.3.4.3.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.4.3.3">normal</mtext></ci></apply></apply><apply id="S3.E9.m1.1.1.1.1.3.5.cmml" xref="S3.E9.m1.1.1.1.1.3.5"><times id="S3.E9.m1.1.1.1.1.3.5.1.cmml" xref="S3.E9.m1.1.1.1.1.3.5.1"></times><apply id="S3.E9.m1.1.1.1.1.3.5.2.cmml" xref="S3.E9.m1.1.1.1.1.3.5.2"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.5.2.1.cmml" xref="S3.E9.m1.1.1.1.1.3.5.2">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.5.2.2.cmml" xref="S3.E9.m1.1.1.1.1.3.5.2.2">𝜆</ci><ci id="S3.E9.m1.1.1.1.1.3.5.2.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.5.2.3"><mtext id="S3.E9.m1.1.1.1.1.3.5.2.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.5.2.3">d</mtext></ci></apply><apply id="S3.E9.m1.1.1.1.1.3.5.3.cmml" xref="S3.E9.m1.1.1.1.1.3.5.3"><csymbol cd="ambiguous" id="S3.E9.m1.1.1.1.1.3.5.3.1.cmml" xref="S3.E9.m1.1.1.1.1.3.5.3">subscript</csymbol><ci id="S3.E9.m1.1.1.1.1.3.5.3.2.cmml" xref="S3.E9.m1.1.1.1.1.3.5.3.2">ℒ</ci><ci id="S3.E9.m1.1.1.1.1.3.5.3.3a.cmml" xref="S3.E9.m1.1.1.1.1.3.5.3.3"><mtext id="S3.E9.m1.1.1.1.1.3.5.3.3.cmml" mathsize="70%" xref="S3.E9.m1.1.1.1.1.3.5.3.3">dist</mtext></ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E9.m1.1c">\mathcal{L}_{\text{3DGen}}=\mathcal{L}_{\text{render}}+\lambda_{\text{LPIPS}}% \mathcal{L}_{\text{LPIPS}}+\lambda_{\text{n}}\mathcal{L}_{\text{normal}}+% \lambda_{\text{d}}\mathcal{L}_{\text{dist}},</annotation><annotation encoding="application/x-llamapun" id="S3.E9.m1.1d">caligraphic_L start_POSTSUBSCRIPT 3DGen end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(9)</td> </tr></tbody> </table> where <math alttext="\mathcal{L}_{\text{render}}" class="ltx_Math" display="inline" id="S3.SS4.p3.3.m1.1"><semantics id="S3.SS4.p3.3.m1.1a"><msub id="S3.SS4.p3.3.m1.1.1" xref="S3.SS4.p3.3.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.3.m1.1.1.2" xref="S3.SS4.p3.3.m1.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.3.m1.1.1.3" xref="S3.SS4.p3.3.m1.1.1.3a.cmml">render</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.3.m1.1b"><apply id="S3.SS4.p3.3.m1.1.1.cmml" xref="S3.SS4.p3.3.m1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.3.m1.1.1.1.cmml" xref="S3.SS4.p3.3.m1.1.1">subscript</csymbol><ci id="S3.SS4.p3.3.m1.1.1.2.cmml" xref="S3.SS4.p3.3.m1.1.1.2">ℒ</ci><ci id="S3.SS4.p3.3.m1.1.1.3a.cmml" xref="S3.SS4.p3.3.m1.1.1.3"><mtext id="S3.SS4.p3.3.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.3.m1.1.1.3">render</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.3.m1.1c">\mathcal{L}_{\text{render}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.3.m1.1d">caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT</annotation></semantics></math> combines RGB and alpha mask losses: <table class="ltx_equation ltx_eqn_table" id="S3.E10"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{\text{render}}=\|{I}_{\text{posed}}-{I}^{\text{gt}}_{\text{posed}% }\|_{2}^{2}+\|\mathbf{\alpha}^{\text{pred}}-\mathbf{M}^{\text{gt}}\|_{2}^{2}." class="ltx_Math" display="block" id="S3.E10.m1.1"><semantics id="S3.E10.m1.1a"><mrow id="S3.E10.m1.1.1.1" xref="S3.E10.m1.1.1.1.1.cmml"><mrow id="S3.E10.m1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.cmml"><msub id="S3.E10.m1.1.1.1.1.4" xref="S3.E10.m1.1.1.1.1.4.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E10.m1.1.1.1.1.4.2" xref="S3.E10.m1.1.1.1.1.4.2.cmml">ℒ</mi><mtext id="S3.E10.m1.1.1.1.1.4.3" xref="S3.E10.m1.1.1.1.1.4.3a.cmml">render</mtext></msub><mo id="S3.E10.m1.1.1.1.1.3" xref="S3.E10.m1.1.1.1.1.3.cmml">=</mo><mrow id="S3.E10.m1.1.1.1.1.2" xref="S3.E10.m1.1.1.1.1.2.cmml"><msubsup id="S3.E10.m1.1.1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.1.1.cmml"><mrow id="S3.E10.m1.1.1.1.1.1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.1.1.1.1.2.cmml"><mo id="S3.E10.m1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E10.m1.1.1.1.1.1.1.1.1.2.1.cmml">‖</mo><mrow id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.cmml"><msub id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.cmml"><mi id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.2" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.2.cmml">I</mi><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3a.cmml">posed</mtext></msub><mo id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.1.cmml">−</mo><msubsup id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.2" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.2.cmml">I</mi><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3a.cmml">posed</mtext><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3a.cmml">gt</mtext></msubsup></mrow><mo id="S3.E10.m1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E10.m1.1.1.1.1.1.1.1.1.2.1.cmml">‖</mo></mrow><mn id="S3.E10.m1.1.1.1.1.1.1.1.3" xref="S3.E10.m1.1.1.1.1.1.1.1.3.cmml">2</mn><mn id="S3.E10.m1.1.1.1.1.1.1.3" xref="S3.E10.m1.1.1.1.1.1.1.3.cmml">2</mn></msubsup><mo id="S3.E10.m1.1.1.1.1.2.3" xref="S3.E10.m1.1.1.1.1.2.3.cmml">+</mo><msubsup id="S3.E10.m1.1.1.1.1.2.2" xref="S3.E10.m1.1.1.1.1.2.2.cmml"><mrow id="S3.E10.m1.1.1.1.1.2.2.1.1.1" xref="S3.E10.m1.1.1.1.1.2.2.1.1.2.cmml"><mo id="S3.E10.m1.1.1.1.1.2.2.1.1.1.2" stretchy="false" xref="S3.E10.m1.1.1.1.1.2.2.1.1.2.1.cmml">‖</mo><mrow id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.cmml"><msup id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.cmml"><mi id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.2" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.2.cmml">α</mi><mtext id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3a.cmml">pred</mtext></msup><mo id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.1" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.1.cmml">−</mo><msup id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.cmml"><mi id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.2" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.2.cmml">𝐌</mi><mtext id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3a.cmml">gt</mtext></msup></mrow><mo id="S3.E10.m1.1.1.1.1.2.2.1.1.1.3" stretchy="false" xref="S3.E10.m1.1.1.1.1.2.2.1.1.2.1.cmml">‖</mo></mrow><mn id="S3.E10.m1.1.1.1.1.2.2.1.3" xref="S3.E10.m1.1.1.1.1.2.2.1.3.cmml">2</mn><mn id="S3.E10.m1.1.1.1.1.2.2.3" xref="S3.E10.m1.1.1.1.1.2.2.3.cmml">2</mn></msubsup></mrow></mrow><mo id="S3.E10.m1.1.1.1.2" lspace="0em" xref="S3.E10.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E10.m1.1b"><apply id="S3.E10.m1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1"><eq id="S3.E10.m1.1.1.1.1.3.cmml" xref="S3.E10.m1.1.1.1.1.3"></eq><apply id="S3.E10.m1.1.1.1.1.4.cmml" xref="S3.E10.m1.1.1.1.1.4"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.4.1.cmml" xref="S3.E10.m1.1.1.1.1.4">subscript</csymbol><ci id="S3.E10.m1.1.1.1.1.4.2.cmml" xref="S3.E10.m1.1.1.1.1.4.2">ℒ</ci><ci id="S3.E10.m1.1.1.1.1.4.3a.cmml" xref="S3.E10.m1.1.1.1.1.4.3"><mtext id="S3.E10.m1.1.1.1.1.4.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.4.3">render</mtext></ci></apply><apply id="S3.E10.m1.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.2"><plus id="S3.E10.m1.1.1.1.1.2.3.cmml" xref="S3.E10.m1.1.1.1.1.2.3"></plus><apply id="S3.E10.m1.1.1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1">superscript</csymbol><apply id="S3.E10.m1.1.1.1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1">subscript</csymbol><apply id="S3.E10.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S3.E10.m1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.2">norm</csymbol><apply id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1"><minus id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.1"></minus><apply id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2">subscript</csymbol><ci id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.2">𝐼</ci><ci id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3a.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3"><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.2.3">posed</mtext></ci></apply><apply id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3">subscript</csymbol><apply id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.1.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3">superscript</csymbol><ci id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.2.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.2">𝐼</ci><ci id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3a.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3"><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.2.3">gt</mtext></ci></apply><ci id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3a.cmml" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3"><mtext id="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.1.1.1.1.1.1.3.3">posed</mtext></ci></apply></apply></apply><cn id="S3.E10.m1.1.1.1.1.1.1.1.3.cmml" type="integer" xref="S3.E10.m1.1.1.1.1.1.1.1.3">2</cn></apply><cn id="S3.E10.m1.1.1.1.1.1.1.3.cmml" type="integer" xref="S3.E10.m1.1.1.1.1.1.1.3">2</cn></apply><apply id="S3.E10.m1.1.1.1.1.2.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.2.2.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2">superscript</csymbol><apply id="S3.E10.m1.1.1.1.1.2.2.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.2.2.1.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2">subscript</csymbol><apply id="S3.E10.m1.1.1.1.1.2.2.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1"><csymbol cd="latexml" id="S3.E10.m1.1.1.1.1.2.2.1.1.2.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.2">norm</csymbol><apply id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1"><minus id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.1"></minus><apply id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2">superscript</csymbol><ci id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.2">𝛼</ci><ci id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3a.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3"><mtext id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.2.3">pred</mtext></ci></apply><apply id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.1.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3">superscript</csymbol><ci id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.2.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.2">𝐌</ci><ci id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3a.cmml" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3"><mtext id="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E10.m1.1.1.1.1.2.2.1.1.1.1.3.3">gt</mtext></ci></apply></apply></apply><cn id="S3.E10.m1.1.1.1.1.2.2.1.3.cmml" type="integer" xref="S3.E10.m1.1.1.1.1.2.2.1.3">2</cn></apply><cn id="S3.E10.m1.1.1.1.1.2.2.3.cmml" type="integer" xref="S3.E10.m1.1.1.1.1.2.2.3">2</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E10.m1.1c">\mathcal{L}_{\text{render}}=\|{I}_{\text{posed}}-{I}^{\text{gt}}_{\text{posed}% }\|_{2}^{2}+\|\mathbf{\alpha}^{\text{pred}}-\mathbf{M}^{\text{gt}}\|_{2}^{2}.</annotation><annotation encoding="application/x-llamapun" id="S3.E10.m1.1d">caligraphic_L start_POSTSUBSCRIPT render end_POSTSUBSCRIPT = ∥ italic_I start_POSTSUBSCRIPT posed end_POSTSUBSCRIPT - italic_I start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT posed end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_α start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT - bold_M start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(10)</td> </tr></tbody> </table> <math alttext="\mathcal{L}_{\text{normal}}" class="ltx_Math" display="inline" id="S3.SS4.p3.4.m1.1"><semantics id="S3.SS4.p3.4.m1.1a"><msub id="S3.SS4.p3.4.m1.1.1" xref="S3.SS4.p3.4.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.4.m1.1.1.2" xref="S3.SS4.p3.4.m1.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.4.m1.1.1.3" xref="S3.SS4.p3.4.m1.1.1.3a.cmml">normal</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.4.m1.1b"><apply id="S3.SS4.p3.4.m1.1.1.cmml" xref="S3.SS4.p3.4.m1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.4.m1.1.1.1.cmml" xref="S3.SS4.p3.4.m1.1.1">subscript</csymbol><ci id="S3.SS4.p3.4.m1.1.1.2.cmml" xref="S3.SS4.p3.4.m1.1.1.2">ℒ</ci><ci id="S3.SS4.p3.4.m1.1.1.3a.cmml" xref="S3.SS4.p3.4.m1.1.1.3"><mtext id="S3.SS4.p3.4.m1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.4.m1.1.1.3">normal</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.4.m1.1c">\mathcal{L}_{\text{normal}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.4.m1.1d">caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT</annotation></semantics></math> aligns predicted normals with surface normals: <table class="ltx_equation ltx_eqn_table" id="S3.E11"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\mathcal{L}_{\text{normal}}=1-(\mathbf{n}^{\text{pred}}\cdot\mathbf{n}^{\text{% surf}})." class="ltx_Math" display="block" id="S3.E11.m1.1"><semantics id="S3.E11.m1.1a"><mrow id="S3.E11.m1.1.1.1" xref="S3.E11.m1.1.1.1.1.cmml"><mrow id="S3.E11.m1.1.1.1.1" xref="S3.E11.m1.1.1.1.1.cmml"><msub id="S3.E11.m1.1.1.1.1.3" xref="S3.E11.m1.1.1.1.1.3.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.E11.m1.1.1.1.1.3.2" xref="S3.E11.m1.1.1.1.1.3.2.cmml">ℒ</mi><mtext id="S3.E11.m1.1.1.1.1.3.3" xref="S3.E11.m1.1.1.1.1.3.3a.cmml">normal</mtext></msub><mo id="S3.E11.m1.1.1.1.1.2" xref="S3.E11.m1.1.1.1.1.2.cmml">=</mo><mrow id="S3.E11.m1.1.1.1.1.1" xref="S3.E11.m1.1.1.1.1.1.cmml"><mn id="S3.E11.m1.1.1.1.1.1.3" xref="S3.E11.m1.1.1.1.1.1.3.cmml">1</mn><mo id="S3.E11.m1.1.1.1.1.1.2" xref="S3.E11.m1.1.1.1.1.1.2.cmml">−</mo><mrow id="S3.E11.m1.1.1.1.1.1.1.1" xref="S3.E11.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S3.E11.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="S3.E11.m1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S3.E11.m1.1.1.1.1.1.1.1.1" xref="S3.E11.m1.1.1.1.1.1.1.1.1.cmml"><msup id="S3.E11.m1.1.1.1.1.1.1.1.1.2" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.cmml"><mi id="S3.E11.m1.1.1.1.1.1.1.1.1.2.2" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.2.cmml">𝐧</mi><mtext id="S3.E11.m1.1.1.1.1.1.1.1.1.2.3" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.3a.cmml">pred</mtext></msup><mo id="S3.E11.m1.1.1.1.1.1.1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S3.E11.m1.1.1.1.1.1.1.1.1.1.cmml">⋅</mo><msup id="S3.E11.m1.1.1.1.1.1.1.1.1.3" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S3.E11.m1.1.1.1.1.1.1.1.1.3.2" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.2.cmml">𝐧</mi><mtext id="S3.E11.m1.1.1.1.1.1.1.1.1.3.3" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.3a.cmml">surf</mtext></msup></mrow><mo id="S3.E11.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="S3.E11.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S3.E11.m1.1.1.1.2" lspace="0em" xref="S3.E11.m1.1.1.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E11.m1.1b"><apply id="S3.E11.m1.1.1.1.1.cmml" xref="S3.E11.m1.1.1.1"><eq id="S3.E11.m1.1.1.1.1.2.cmml" xref="S3.E11.m1.1.1.1.1.2"></eq><apply id="S3.E11.m1.1.1.1.1.3.cmml" xref="S3.E11.m1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E11.m1.1.1.1.1.3.1.cmml" xref="S3.E11.m1.1.1.1.1.3">subscript</csymbol><ci id="S3.E11.m1.1.1.1.1.3.2.cmml" xref="S3.E11.m1.1.1.1.1.3.2">ℒ</ci><ci id="S3.E11.m1.1.1.1.1.3.3a.cmml" xref="S3.E11.m1.1.1.1.1.3.3"><mtext id="S3.E11.m1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E11.m1.1.1.1.1.3.3">normal</mtext></ci></apply><apply id="S3.E11.m1.1.1.1.1.1.cmml" xref="S3.E11.m1.1.1.1.1.1"><minus id="S3.E11.m1.1.1.1.1.1.2.cmml" xref="S3.E11.m1.1.1.1.1.1.2"></minus><cn id="S3.E11.m1.1.1.1.1.1.3.cmml" type="integer" xref="S3.E11.m1.1.1.1.1.1.3">1</cn><apply id="S3.E11.m1.1.1.1.1.1.1.1.1.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1"><ci id="S3.E11.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.1">⋅</ci><apply id="S3.E11.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S3.E11.m1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2">superscript</csymbol><ci id="S3.E11.m1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.2">𝐧</ci><ci id="S3.E11.m1.1.1.1.1.1.1.1.1.2.3a.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.3"><mtext id="S3.E11.m1.1.1.1.1.1.1.1.1.2.3.cmml" mathsize="70%" xref="S3.E11.m1.1.1.1.1.1.1.1.1.2.3">pred</mtext></ci></apply><apply id="S3.E11.m1.1.1.1.1.1.1.1.1.3.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S3.E11.m1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3">superscript</csymbol><ci id="S3.E11.m1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.2">𝐧</ci><ci id="S3.E11.m1.1.1.1.1.1.1.1.1.3.3a.cmml" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.3"><mtext id="S3.E11.m1.1.1.1.1.1.1.1.1.3.3.cmml" mathsize="70%" xref="S3.E11.m1.1.1.1.1.1.1.1.1.3.3">surf</mtext></ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E11.m1.1c">\mathcal{L}_{\text{normal}}=1-(\mathbf{n}^{\text{pred}}\cdot\mathbf{n}^{\text{% surf}}).</annotation><annotation encoding="application/x-llamapun" id="S3.E11.m1.1d">caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT = 1 - ( bold_n start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT ⋅ bold_n start_POSTSUPERSCRIPT surf end_POSTSUPERSCRIPT ) .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(11)</td> </tr></tbody> </table> Here, <math alttext="{I}^{\text{posed}},{I}^{\text{posed}}_{\text{gt}}" class="ltx_Math" display="inline" id="S3.SS4.p3.5.m1.2"><semantics id="S3.SS4.p3.5.m1.2a"><mrow id="S3.SS4.p3.5.m1.2.2.2" xref="S3.SS4.p3.5.m1.2.2.3.cmml"><msup id="S3.SS4.p3.5.m1.1.1.1.1" xref="S3.SS4.p3.5.m1.1.1.1.1.cmml"><mi id="S3.SS4.p3.5.m1.1.1.1.1.2" xref="S3.SS4.p3.5.m1.1.1.1.1.2.cmml">I</mi><mtext id="S3.SS4.p3.5.m1.1.1.1.1.3" xref="S3.SS4.p3.5.m1.1.1.1.1.3a.cmml">posed</mtext></msup><mo id="S3.SS4.p3.5.m1.2.2.2.3" xref="S3.SS4.p3.5.m1.2.2.3.cmml">,</mo><msubsup id="S3.SS4.p3.5.m1.2.2.2.2" xref="S3.SS4.p3.5.m1.2.2.2.2.cmml"><mi id="S3.SS4.p3.5.m1.2.2.2.2.2.2" xref="S3.SS4.p3.5.m1.2.2.2.2.2.2.cmml">I</mi><mtext id="S3.SS4.p3.5.m1.2.2.2.2.3" xref="S3.SS4.p3.5.m1.2.2.2.2.3a.cmml">gt</mtext><mtext id="S3.SS4.p3.5.m1.2.2.2.2.2.3" xref="S3.SS4.p3.5.m1.2.2.2.2.2.3a.cmml">posed</mtext></msubsup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.5.m1.2b"><list id="S3.SS4.p3.5.m1.2.2.3.cmml" xref="S3.SS4.p3.5.m1.2.2.2"><apply id="S3.SS4.p3.5.m1.1.1.1.1.cmml" xref="S3.SS4.p3.5.m1.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.5.m1.1.1.1.1.1.cmml" xref="S3.SS4.p3.5.m1.1.1.1.1">superscript</csymbol><ci id="S3.SS4.p3.5.m1.1.1.1.1.2.cmml" xref="S3.SS4.p3.5.m1.1.1.1.1.2">𝐼</ci><ci id="S3.SS4.p3.5.m1.1.1.1.1.3a.cmml" xref="S3.SS4.p3.5.m1.1.1.1.1.3"><mtext id="S3.SS4.p3.5.m1.1.1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.5.m1.1.1.1.1.3">posed</mtext></ci></apply><apply id="S3.SS4.p3.5.m1.2.2.2.2.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS4.p3.5.m1.2.2.2.2.1.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2">subscript</csymbol><apply id="S3.SS4.p3.5.m1.2.2.2.2.2.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS4.p3.5.m1.2.2.2.2.2.1.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2">superscript</csymbol><ci id="S3.SS4.p3.5.m1.2.2.2.2.2.2.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2.2.2">𝐼</ci><ci id="S3.SS4.p3.5.m1.2.2.2.2.2.3a.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2.2.3"><mtext id="S3.SS4.p3.5.m1.2.2.2.2.2.3.cmml" mathsize="70%" xref="S3.SS4.p3.5.m1.2.2.2.2.2.3">posed</mtext></ci></apply><ci id="S3.SS4.p3.5.m1.2.2.2.2.3a.cmml" xref="S3.SS4.p3.5.m1.2.2.2.2.3"><mtext id="S3.SS4.p3.5.m1.2.2.2.2.3.cmml" mathsize="70%" xref="S3.SS4.p3.5.m1.2.2.2.2.3">gt</mtext></ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.5.m1.2c">{I}^{\text{posed}},{I}^{\text{posed}}_{\text{gt}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.5.m1.2d">italic_I start_POSTSUPERSCRIPT posed end_POSTSUPERSCRIPT , italic_I start_POSTSUPERSCRIPT posed end_POSTSUPERSCRIPT start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT</annotation></semantics></math> are the predicted and ground truth images; <math alttext="\mathbf{\alpha}^{\text{pred}}" class="ltx_Math" display="inline" id="S3.SS4.p3.6.m2.1"><semantics id="S3.SS4.p3.6.m2.1a"><msup id="S3.SS4.p3.6.m2.1.1" xref="S3.SS4.p3.6.m2.1.1.cmml"><mi id="S3.SS4.p3.6.m2.1.1.2" xref="S3.SS4.p3.6.m2.1.1.2.cmml">α</mi><mtext id="S3.SS4.p3.6.m2.1.1.3" xref="S3.SS4.p3.6.m2.1.1.3a.cmml">pred</mtext></msup><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.6.m2.1b"><apply id="S3.SS4.p3.6.m2.1.1.cmml" xref="S3.SS4.p3.6.m2.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.6.m2.1.1.1.cmml" xref="S3.SS4.p3.6.m2.1.1">superscript</csymbol><ci id="S3.SS4.p3.6.m2.1.1.2.cmml" xref="S3.SS4.p3.6.m2.1.1.2">𝛼</ci><ci id="S3.SS4.p3.6.m2.1.1.3a.cmml" xref="S3.SS4.p3.6.m2.1.1.3"><mtext id="S3.SS4.p3.6.m2.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.6.m2.1.1.3">pred</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.6.m2.1c">\mathbf{\alpha}^{\text{pred}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.6.m2.1d">italic_α start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT</annotation></semantics></math> and <math alttext="\mathbf{M}^{\text{gt}}" class="ltx_Math" display="inline" id="S3.SS4.p3.7.m3.1"><semantics id="S3.SS4.p3.7.m3.1a"><msup id="S3.SS4.p3.7.m3.1.1" xref="S3.SS4.p3.7.m3.1.1.cmml"><mi id="S3.SS4.p3.7.m3.1.1.2" xref="S3.SS4.p3.7.m3.1.1.2.cmml">𝐌</mi><mtext id="S3.SS4.p3.7.m3.1.1.3" xref="S3.SS4.p3.7.m3.1.1.3a.cmml">gt</mtext></msup><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.7.m3.1b"><apply id="S3.SS4.p3.7.m3.1.1.cmml" xref="S3.SS4.p3.7.m3.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.7.m3.1.1.1.cmml" xref="S3.SS4.p3.7.m3.1.1">superscript</csymbol><ci id="S3.SS4.p3.7.m3.1.1.2.cmml" xref="S3.SS4.p3.7.m3.1.1.2">𝐌</ci><ci id="S3.SS4.p3.7.m3.1.1.3a.cmml" xref="S3.SS4.p3.7.m3.1.1.3"><mtext id="S3.SS4.p3.7.m3.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.7.m3.1.1.3">gt</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.7.m3.1c">\mathbf{M}^{\text{gt}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.7.m3.1d">bold_M start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT</annotation></semantics></math> are the predicted alpha mask and its ground truth counterpart; <math alttext="\mathbf{n}^{\text{pred}},\mathbf{n}^{\text{surf}}" class="ltx_Math" display="inline" id="S3.SS4.p3.8.m4.2"><semantics id="S3.SS4.p3.8.m4.2a"><mrow id="S3.SS4.p3.8.m4.2.2.2" xref="S3.SS4.p3.8.m4.2.2.3.cmml"><msup id="S3.SS4.p3.8.m4.1.1.1.1" xref="S3.SS4.p3.8.m4.1.1.1.1.cmml"><mi id="S3.SS4.p3.8.m4.1.1.1.1.2" xref="S3.SS4.p3.8.m4.1.1.1.1.2.cmml">𝐧</mi><mtext id="S3.SS4.p3.8.m4.1.1.1.1.3" xref="S3.SS4.p3.8.m4.1.1.1.1.3a.cmml">pred</mtext></msup><mo id="S3.SS4.p3.8.m4.2.2.2.3" xref="S3.SS4.p3.8.m4.2.2.3.cmml">,</mo><msup id="S3.SS4.p3.8.m4.2.2.2.2" xref="S3.SS4.p3.8.m4.2.2.2.2.cmml"><mi id="S3.SS4.p3.8.m4.2.2.2.2.2" xref="S3.SS4.p3.8.m4.2.2.2.2.2.cmml">𝐧</mi><mtext id="S3.SS4.p3.8.m4.2.2.2.2.3" xref="S3.SS4.p3.8.m4.2.2.2.2.3a.cmml">surf</mtext></msup></mrow><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.8.m4.2b"><list id="S3.SS4.p3.8.m4.2.2.3.cmml" xref="S3.SS4.p3.8.m4.2.2.2"><apply id="S3.SS4.p3.8.m4.1.1.1.1.cmml" xref="S3.SS4.p3.8.m4.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.8.m4.1.1.1.1.1.cmml" xref="S3.SS4.p3.8.m4.1.1.1.1">superscript</csymbol><ci id="S3.SS4.p3.8.m4.1.1.1.1.2.cmml" xref="S3.SS4.p3.8.m4.1.1.1.1.2">𝐧</ci><ci id="S3.SS4.p3.8.m4.1.1.1.1.3a.cmml" xref="S3.SS4.p3.8.m4.1.1.1.1.3"><mtext id="S3.SS4.p3.8.m4.1.1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.8.m4.1.1.1.1.3">pred</mtext></ci></apply><apply id="S3.SS4.p3.8.m4.2.2.2.2.cmml" xref="S3.SS4.p3.8.m4.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS4.p3.8.m4.2.2.2.2.1.cmml" xref="S3.SS4.p3.8.m4.2.2.2.2">superscript</csymbol><ci id="S3.SS4.p3.8.m4.2.2.2.2.2.cmml" xref="S3.SS4.p3.8.m4.2.2.2.2.2">𝐧</ci><ci id="S3.SS4.p3.8.m4.2.2.2.2.3a.cmml" xref="S3.SS4.p3.8.m4.2.2.2.2.3"><mtext id="S3.SS4.p3.8.m4.2.2.2.2.3.cmml" mathsize="70%" xref="S3.SS4.p3.8.m4.2.2.2.2.3">surf</mtext></ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.8.m4.2c">\mathbf{n}^{\text{pred}},\mathbf{n}^{\text{surf}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.8.m4.2d">bold_n start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT , bold_n start_POSTSUPERSCRIPT surf end_POSTSUPERSCRIPT</annotation></semantics></math> are predicted and surface normal vectors. <math alttext="\lambda_{\text{lpips}},\lambda_{\text{n}},\lambda_{\text{d}}" class="ltx_Math" display="inline" id="S3.SS4.p3.9.m5.3"><semantics id="S3.SS4.p3.9.m5.3a"><mrow id="S3.SS4.p3.9.m5.3.3.3" xref="S3.SS4.p3.9.m5.3.3.4.cmml"><msub id="S3.SS4.p3.9.m5.1.1.1.1" xref="S3.SS4.p3.9.m5.1.1.1.1.cmml"><mi id="S3.SS4.p3.9.m5.1.1.1.1.2" xref="S3.SS4.p3.9.m5.1.1.1.1.2.cmml">λ</mi><mtext id="S3.SS4.p3.9.m5.1.1.1.1.3" xref="S3.SS4.p3.9.m5.1.1.1.1.3a.cmml">lpips</mtext></msub><mo id="S3.SS4.p3.9.m5.3.3.3.4" xref="S3.SS4.p3.9.m5.3.3.4.cmml">,</mo><msub id="S3.SS4.p3.9.m5.2.2.2.2" xref="S3.SS4.p3.9.m5.2.2.2.2.cmml"><mi id="S3.SS4.p3.9.m5.2.2.2.2.2" xref="S3.SS4.p3.9.m5.2.2.2.2.2.cmml">λ</mi><mtext id="S3.SS4.p3.9.m5.2.2.2.2.3" xref="S3.SS4.p3.9.m5.2.2.2.2.3a.cmml">n</mtext></msub><mo id="S3.SS4.p3.9.m5.3.3.3.5" xref="S3.SS4.p3.9.m5.3.3.4.cmml">,</mo><msub id="S3.SS4.p3.9.m5.3.3.3.3" xref="S3.SS4.p3.9.m5.3.3.3.3.cmml"><mi id="S3.SS4.p3.9.m5.3.3.3.3.2" xref="S3.SS4.p3.9.m5.3.3.3.3.2.cmml">λ</mi><mtext id="S3.SS4.p3.9.m5.3.3.3.3.3" xref="S3.SS4.p3.9.m5.3.3.3.3.3a.cmml">d</mtext></msub></mrow><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.9.m5.3b"><list id="S3.SS4.p3.9.m5.3.3.4.cmml" xref="S3.SS4.p3.9.m5.3.3.3"><apply id="S3.SS4.p3.9.m5.1.1.1.1.cmml" xref="S3.SS4.p3.9.m5.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.9.m5.1.1.1.1.1.cmml" xref="S3.SS4.p3.9.m5.1.1.1.1">subscript</csymbol><ci id="S3.SS4.p3.9.m5.1.1.1.1.2.cmml" xref="S3.SS4.p3.9.m5.1.1.1.1.2">𝜆</ci><ci id="S3.SS4.p3.9.m5.1.1.1.1.3a.cmml" xref="S3.SS4.p3.9.m5.1.1.1.1.3"><mtext id="S3.SS4.p3.9.m5.1.1.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.9.m5.1.1.1.1.3">lpips</mtext></ci></apply><apply id="S3.SS4.p3.9.m5.2.2.2.2.cmml" xref="S3.SS4.p3.9.m5.2.2.2.2"><csymbol cd="ambiguous" id="S3.SS4.p3.9.m5.2.2.2.2.1.cmml" xref="S3.SS4.p3.9.m5.2.2.2.2">subscript</csymbol><ci id="S3.SS4.p3.9.m5.2.2.2.2.2.cmml" xref="S3.SS4.p3.9.m5.2.2.2.2.2">𝜆</ci><ci id="S3.SS4.p3.9.m5.2.2.2.2.3a.cmml" xref="S3.SS4.p3.9.m5.2.2.2.2.3"><mtext id="S3.SS4.p3.9.m5.2.2.2.2.3.cmml" mathsize="70%" xref="S3.SS4.p3.9.m5.2.2.2.2.3">n</mtext></ci></apply><apply id="S3.SS4.p3.9.m5.3.3.3.3.cmml" xref="S3.SS4.p3.9.m5.3.3.3.3"><csymbol cd="ambiguous" id="S3.SS4.p3.9.m5.3.3.3.3.1.cmml" xref="S3.SS4.p3.9.m5.3.3.3.3">subscript</csymbol><ci id="S3.SS4.p3.9.m5.3.3.3.3.2.cmml" xref="S3.SS4.p3.9.m5.3.3.3.3.2">𝜆</ci><ci id="S3.SS4.p3.9.m5.3.3.3.3.3a.cmml" xref="S3.SS4.p3.9.m5.3.3.3.3.3"><mtext id="S3.SS4.p3.9.m5.3.3.3.3.3.cmml" mathsize="70%" xref="S3.SS4.p3.9.m5.3.3.3.3.3">d</mtext></ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.9.m5.3c">\lambda_{\text{lpips}},\lambda_{\text{n}},\lambda_{\text{d}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.9.m5.3d">italic_λ start_POSTSUBSCRIPT lpips end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT n end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT d end_POSTSUBSCRIPT</annotation></semantics></math> are weights for <math alttext="\mathcal{L}_{\text{lpips}}" class="ltx_Math" display="inline" id="S3.SS4.p3.10.m6.1"><semantics id="S3.SS4.p3.10.m6.1a"><msub id="S3.SS4.p3.10.m6.1.1" xref="S3.SS4.p3.10.m6.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.10.m6.1.1.2" xref="S3.SS4.p3.10.m6.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.10.m6.1.1.3" xref="S3.SS4.p3.10.m6.1.1.3a.cmml">lpips</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.10.m6.1b"><apply id="S3.SS4.p3.10.m6.1.1.cmml" xref="S3.SS4.p3.10.m6.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.10.m6.1.1.1.cmml" xref="S3.SS4.p3.10.m6.1.1">subscript</csymbol><ci id="S3.SS4.p3.10.m6.1.1.2.cmml" xref="S3.SS4.p3.10.m6.1.1.2">ℒ</ci><ci id="S3.SS4.p3.10.m6.1.1.3a.cmml" xref="S3.SS4.p3.10.m6.1.1.3"><mtext id="S3.SS4.p3.10.m6.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.10.m6.1.1.3">lpips</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.10.m6.1c">\mathcal{L}_{\text{lpips}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.10.m6.1d">caligraphic_L start_POSTSUBSCRIPT lpips end_POSTSUBSCRIPT</annotation></semantics></math>, <math alttext="\mathcal{L}_{\text{normal}}" class="ltx_Math" display="inline" id="S3.SS4.p3.11.m7.1"><semantics id="S3.SS4.p3.11.m7.1a"><msub id="S3.SS4.p3.11.m7.1.1" xref="S3.SS4.p3.11.m7.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.11.m7.1.1.2" xref="S3.SS4.p3.11.m7.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.11.m7.1.1.3" xref="S3.SS4.p3.11.m7.1.1.3a.cmml">normal</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.11.m7.1b"><apply id="S3.SS4.p3.11.m7.1.1.cmml" xref="S3.SS4.p3.11.m7.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.11.m7.1.1.1.cmml" xref="S3.SS4.p3.11.m7.1.1">subscript</csymbol><ci id="S3.SS4.p3.11.m7.1.1.2.cmml" xref="S3.SS4.p3.11.m7.1.1.2">ℒ</ci><ci id="S3.SS4.p3.11.m7.1.1.3a.cmml" xref="S3.SS4.p3.11.m7.1.1.3"><mtext id="S3.SS4.p3.11.m7.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.11.m7.1.1.3">normal</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.11.m7.1c">\mathcal{L}_{\text{normal}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.11.m7.1d">caligraphic_L start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="\mathcal{L}_{\text{dist}}" class="ltx_Math" display="inline" id="S3.SS4.p3.12.m8.1"><semantics id="S3.SS4.p3.12.m8.1a"><msub id="S3.SS4.p3.12.m8.1.1" xref="S3.SS4.p3.12.m8.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S3.SS4.p3.12.m8.1.1.2" xref="S3.SS4.p3.12.m8.1.1.2.cmml">ℒ</mi><mtext id="S3.SS4.p3.12.m8.1.1.3" xref="S3.SS4.p3.12.m8.1.1.3a.cmml">dist</mtext></msub><annotation-xml encoding="MathML-Content" id="S3.SS4.p3.12.m8.1b"><apply id="S3.SS4.p3.12.m8.1.1.cmml" xref="S3.SS4.p3.12.m8.1.1"><csymbol cd="ambiguous" id="S3.SS4.p3.12.m8.1.1.1.cmml" xref="S3.SS4.p3.12.m8.1.1">subscript</csymbol><ci id="S3.SS4.p3.12.m8.1.1.2.cmml" xref="S3.SS4.p3.12.m8.1.1.2">ℒ</ci><ci id="S3.SS4.p3.12.m8.1.1.3a.cmml" xref="S3.SS4.p3.12.m8.1.1.3"><mtext id="S3.SS4.p3.12.m8.1.1.3.cmml" mathsize="70%" xref="S3.SS4.p3.12.m8.1.1.3">dist</mtext></ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS4.p3.12.m8.1c">\mathcal{L}_{\text{dist}}</annotation><annotation encoding="application/x-llamapun" id="S3.SS4.p3.12.m8.1d">caligraphic_L start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT</annotation></semantics></math>, respectively. The normal and distortion losses commence after 20% of training to first establish basic appearance convergence. </div> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> 4 Experiments</h2> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> 4.1 2D Stylized Avatar Generation </h3> <div class="ltx_para ltx_noindent" id="S4.SS1.p1"> Baselines. We evaluate GDA for 2D stylized avatar generation and compare with GAN inversion and diffusion-based methods, both fine-tuned on our Bitmoji dataset. For GAN inversion, we use a SemanticStyleGAN <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib50" title="">50</a>]</cite> model, translating real faces into avatars by inverting them into the latent space of a fine-tuned model. The diffusion-based method employs Stable Diffusion 1.5 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib43" title="">43</a>]</cite> fine-tuned with LoRA <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib18" title="">18</a>]</cite>, using BLIP-2 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib24" title="">24</a>]</cite> for avatar captioning and IP Adapter Plus Face <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib60" title="">60</a>]</cite> for identity conditioning. Both methods aim to efficiently create primary-style avatars that retain the identity of the input images. </div> <figure class="ltx_figure" id="S4.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="419" id="S4.F4.g1" src="x4.png" width="875"/> <figcaption class="ltx_caption ltx_centering">Figure 4: 2D Stylized Avatar Generation. This figure showcases the transformation of photos from eight individuals into the Bitmoji domain using various methods. GAN inversion produces overly generic avatars, struggling with unique features such as beards, glasses, and headwear. Diffusion-based models inaccurately add features, making them inconsistent for targeted styles. In contrast, our GDA method excels in creating high-quality avatars, effectively retaining the original identity features.</figcaption> </figure> <figure class="ltx_figure" id="S4.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="447" id="S4.F5.g1" src="x5.png" width="875"/> <figcaption class="ltx_caption ltx_centering">Figure 5: 2D Stylized Avatar to 3D Generation. We demonstrate the process of converting dual-stylized avatar images, derived from the single-stylized avatars in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F4" title="Figure 4 ‣ 4.1 2D Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">4</a>, into 3D avatars. PTI inversion with EG3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib6" title="">6</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib42" title="">42</a>]</cite> struggles to accurately reproduce 3D geometry, while LGM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title="">55</a>]</cite> produces artifacts in both geometry and texture. Despite being trained exclusively on the Bitmoji style, our method successfully generates high-quality 3D avatars in previously unseen styles.</figcaption> </figure> <figure class="ltx_figure" id="S4.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="380" id="S4.F6.g1" src="x6.png" width="789"/> <figcaption class="ltx_caption ltx_centering">Figure 6: Single Portrait to 3D Generation. We compare Snapmoji with DATID-3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib23" title="">23</a>]</cite> in the context of 3D toonification. For each method and style, we render outputs from two viewpoints alongside a normal map. DATID-3D exhibits typical GAN-related issues, such as poor identity preservation and limited stylistic diversity, resulting in similar outputs across different identities. Conversely, Snapmoji effectively maintains identity and produces distinct styles, showcasing superior image quality and sharper geometry.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S4.SS1.p2"> Evaluation. We conducted an evaluation using 100 randomly selected faces from the FFHQ dataset, assessing each method on visual quality, identity retention, and speed. Visual quality was measured through FID and KID scores, comparing the transformed images to the Bitmoji dataset. Identity retention was evaluated using ArcFace <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib11" title="">11</a>]</cite>, while speed performance was benchmarked on an Nvidia L4 GPU. As shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.T2" title="Table 2 ‣ 4.2 3D Dual-Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">2</a>, GDA outperforms the other methods across all metrics, achieving FID scores more than 20 points lower than those of GAN inversion and diffusion. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F4" title="Figure 4 ‣ 4.1 2D Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">4</a> visually highlights these quality differences: GAN inversion produces avatars with limited diversity, struggling to avoid generic outputs due to challenges in generating out-of-distribution images. The diffusion approach fails to maintain a consistent style and often incorrectly introduces features like glasses, undermining both style and identity preservation. In contrast, GDA excels at producing avatars with a consistent style that retain key identity features such as eye color, sunglasses, and hairstyles. Its efficiency is noteworthy, requiring only a single forward pass through a U-Net, making it two and four orders of magnitude faster than diffusion and GAN inversion, respectively, with translations completed in under 0.1 seconds. Surprisingly, even though GDA uses data generated from GAN inversion for training, it produces images that are more detailed due to the learned 3D prior from the Objaverse <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib10" title="">10</a>]</cite> dataset, which enhances its generalization capability. </div> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> 4.2 3D Dual-Stylized Avatar Generation </h3> <figure class="ltx_table" id="S4.T2"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S4.T2.4" style="width:433.6pt;height:173.7pt;vertical-align:-0.0pt;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S4.T2.4.4"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S4.T2.4.4.4"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T2.4.4.4.5" style="padding-left:0.0pt;padding-right:0.0pt;"></th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T2.1.1.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">FID <math alttext="\downarrow" class="ltx_Math" display="inline" id="S4.T2.1.1.1.1.m1.1"><semantics id="S4.T2.1.1.1.1.m1.1a"><mo id="S4.T2.1.1.1.1.m1.1.1" stretchy="false" xref="S4.T2.1.1.1.1.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S4.T2.1.1.1.1.m1.1b"><ci id="S4.T2.1.1.1.1.m1.1.1.cmml" xref="S4.T2.1.1.1.1.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.1.1.1.1.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.1.1.1.1.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T2.2.2.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">KID <math alttext="\downarrow" class="ltx_Math" display="inline" id="S4.T2.2.2.2.2.m1.1"><semantics id="S4.T2.2.2.2.2.m1.1a"><mo id="S4.T2.2.2.2.2.m1.1.1" stretchy="false" xref="S4.T2.2.2.2.2.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S4.T2.2.2.2.2.m1.1b"><ci id="S4.T2.2.2.2.2.m1.1.1.cmml" xref="S4.T2.2.2.2.2.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.2.2.2.2.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.2.2.2.2.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T2.3.3.3.3" style="padding-left:0.0pt;padding-right:0.0pt;">ID <math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.3.3.3.3.m1.1"><semantics id="S4.T2.3.3.3.3.m1.1a"><mo id="S4.T2.3.3.3.3.m1.1.1" stretchy="false" xref="S4.T2.3.3.3.3.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.3.3.3.3.m1.1b"><ci id="S4.T2.3.3.3.3.m1.1.1.cmml" xref="S4.T2.3.3.3.3.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.3.3.3.3.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.3.3.3.3.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T2.4.4.4.4" style="padding-left:0.0pt;padding-right:0.0pt;">Speed <math alttext="\downarrow" class="ltx_Math" display="inline" id="S4.T2.4.4.4.4.m1.1"><semantics id="S4.T2.4.4.4.4.m1.1a"><mo id="S4.T2.4.4.4.4.m1.1.1" stretchy="false" xref="S4.T2.4.4.4.4.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S4.T2.4.4.4.4.m1.1b"><ci id="S4.T2.4.4.4.4.m1.1.1.cmml" xref="S4.T2.4.4.4.4.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.4.4.4.4.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.4.4.4.4.m1.1d">↓</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T2.4.4.5.1"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.4.4.5.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">GAN Inversion</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.4.4.5.1.2" style="padding-left:0.0pt;padding-right:0.0pt;">93.73</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.4.4.5.1.3" style="padding-left:0.0pt;padding-right:0.0pt;">0.0603</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T2.4.4.5.1.4" style="padding-left:0.0pt;padding-right:0.0pt;">0.16</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T2.4.4.5.1.5" style="padding-left:0.0pt;padding-right:0.0pt;">98.14s</td> </tr> <tr class="ltx_tr" id="S4.T2.4.4.6.2"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T2.4.4.6.2.1" style="padding-left:0.0pt;padding-right:0.0pt;">Diffusion</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T2.4.4.6.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">93.63</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T2.4.4.6.2.3" style="padding-left:0.0pt;padding-right:0.0pt;">0.0457</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T2.4.4.6.2.4" style="padding-left:0.0pt;padding-right:0.0pt;">0.19</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S4.T2.4.4.6.2.5" style="padding-left:0.0pt;padding-right:0.0pt;">3.54s</td> </tr> <tr class="ltx_tr" id="S4.T2.4.4.7.3"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T2.4.4.7.3.1" style="padding-left:0.0pt;padding-right:0.0pt;">GDA (Ours)</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T2.4.4.7.3.2" style="padding-left:0.0pt;padding-right:0.0pt;">72.94</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T2.4.4.7.3.3" style="padding-left:0.0pt;padding-right:0.0pt;">0.0346</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T2.4.4.7.3.4" style="padding-left:0.0pt;padding-right:0.0pt;">0.25</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S4.T2.4.4.7.3.5" style="padding-left:0.0pt;padding-right:0.0pt;">0.080s</td> </tr> </tbody> </table> </div> <figcaption class="ltx_caption ltx_centering">Table 2: 2D Stylized Avatar Generation. We compare different methods of generating 2D stylized avatars. Our GDA significantly outperforms GAN inversion and diffusion in terms of image quality (FID, KID), identity preservation (ID), and execution speed.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S4.SS2.p1"> 2D Stylized Avatar to 3D Generation. To evaluate the performance of generating a 3D avatar from a 2D stylized image, we compare our method against two other single-image 3D reconstruction techniques: EG3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib6" title="">6</a>]</cite> and LGMs <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title="">55</a>]</cite>. EG3D, a 3D GAN based on the StyleGAN framework, generates a 3D neural radiance field and is fine-tuned on a multi-view 3D avatar dataset. It inverts a front-facing image into the GAN’s <math alttext="\mathcal{W}+" class="ltx_Math" display="inline" id="S4.SS2.p1.1.m1.1"><semantics id="S4.SS2.p1.1.m1.1a"><mrow id="S4.SS2.p1.1.m1.1.1" xref="S4.SS2.p1.1.m1.1.1.cmml"><mi class="ltx_font_mathcaligraphic" id="S4.SS2.p1.1.m1.1.1.2" xref="S4.SS2.p1.1.m1.1.1.2.cmml">𝒲</mi><mo id="S4.SS2.p1.1.m1.1.1.3" xref="S4.SS2.p1.1.m1.1.1.3.cmml">+</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.1.m1.1b"><apply id="S4.SS2.p1.1.m1.1.1.cmml" xref="S4.SS2.p1.1.m1.1.1"><csymbol cd="latexml" id="S4.SS2.p1.1.m1.1.1.1.cmml" xref="S4.SS2.p1.1.m1.1.1">limit-from</csymbol><ci id="S4.SS2.p1.1.m1.1.1.2.cmml" xref="S4.SS2.p1.1.m1.1.1.2">𝒲</ci><plus id="S4.SS2.p1.1.m1.1.1.3.cmml" xref="S4.SS2.p1.1.m1.1.1.3"></plus></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.1.m1.1c">\mathcal{W}+</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.1.m1.1d">caligraphic_W +</annotation></semantics></math> space to render outputs from various viewpoints. LGM uses MVDream <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib51" title="">51</a>]</cite> to transform a single image into multiple viewpoints for input into a U-Net, which outputs 3D Gaussians. We assessed each method using 100 random 3D Bitmojis, each rendered from one front-facing view and ten additional views distributed spherically around the head. By inputting the front-facing view into each model, we calculated PSNR, SSIM, LPIPS <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib64" title="">64</a>]</cite>, and speed metrics. As shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.T3" title="Table 3 ‣ 4.2 3D Dual-Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3</a>, our method surpasses all baselines, demonstrating superior capability in accurately converting 2D images to 3D, while being significantly faster, needing only a single U-Net pass. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F5" title="Figure 5 ‣ 4.1 2D Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">5</a> provides visual comparisons on 3D dual-stylized avatars that fall outside the training distribution. The top row features eight stylized avatars generated using diffusion stylization from the identities in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F4" title="Figure 4 ‣ 4.1 2D Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">4</a>, with different style prompts as described in Sec. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S3.SS2" title="3.2 2D Dual-Stylized Avatar Generation ‣ 3 Method ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">3.2</a>. Subsequent rows show each method’s performance in translating these images to 3D. Even when employing PTI <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib42" title="">42</a>]</cite> for out-of-distribution images, EG3D struggles with high-fidelity geometry generation. Similarly, due to the diffusion process in MVDream, LGM often produces incorrect 3D head geometries. In contrast, our method successfully creates high-quality textures and geometry, even accommodating out-of-distribution accessories like turbans and sunglasses. </div> <figure class="ltx_table" id="S4.T3"> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S4.T3.4" style="width:433.6pt;height:100.9pt;vertical-align:-0.0pt;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S4.T3.4.4"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S4.T3.4.4.4"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T3.4.4.4.5" style="padding-left:0.0pt;padding-right:0.0pt;"></th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T3.1.1.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">PSNR <math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T3.1.1.1.1.m1.1"><semantics id="S4.T3.1.1.1.1.m1.1a"><mo id="S4.T3.1.1.1.1.m1.1.1" stretchy="false" xref="S4.T3.1.1.1.1.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T3.1.1.1.1.m1.1b"><ci id="S4.T3.1.1.1.1.m1.1.1.cmml" xref="S4.T3.1.1.1.1.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.1.1.1.1.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.1.1.1.1.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T3.2.2.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">SSIM <math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T3.2.2.2.2.m1.1"><semantics id="S4.T3.2.2.2.2.m1.1a"><mo id="S4.T3.2.2.2.2.m1.1.1" stretchy="false" xref="S4.T3.2.2.2.2.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T3.2.2.2.2.m1.1b"><ci id="S4.T3.2.2.2.2.m1.1.1.cmml" xref="S4.T3.2.2.2.2.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.2.2.2.2.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.2.2.2.2.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="S4.T3.3.3.3.3" style="padding-left:0.0pt;padding-right:0.0pt;">LPIPS <math alttext="\downarrow" class="ltx_Math" display="inline" id="S4.T3.3.3.3.3.m1.1"><semantics id="S4.T3.3.3.3.3.m1.1a"><mo id="S4.T3.3.3.3.3.m1.1.1" stretchy="false" xref="S4.T3.3.3.3.3.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S4.T3.3.3.3.3.m1.1b"><ci id="S4.T3.3.3.3.3.m1.1.1.cmml" xref="S4.T3.3.3.3.3.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.3.3.3.3.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.3.3.3.3.m1.1d">↓</annotation></semantics></math> </th> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T3.4.4.4.4" style="padding-left:0.0pt;padding-right:0.0pt;">Speed <math alttext="\downarrow" class="ltx_Math" display="inline" id="S4.T3.4.4.4.4.m1.1"><semantics id="S4.T3.4.4.4.4.m1.1a"><mo id="S4.T3.4.4.4.4.m1.1.1" stretchy="false" xref="S4.T3.4.4.4.4.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S4.T3.4.4.4.4.m1.1b"><ci id="S4.T3.4.4.4.4.m1.1.1.cmml" xref="S4.T3.4.4.4.4.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.4.4.4.4.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S4.T3.4.4.4.4.m1.1d">↓</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T3.4.4.5.1"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.4.4.5.1.1" style="padding-left:0.0pt;padding-right:0.0pt;">EG3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib6" title="">6</a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.4.4.5.1.2" style="padding-left:0.0pt;padding-right:0.0pt;">10.92</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.4.4.5.1.3" style="padding-left:0.0pt;padding-right:0.0pt;">0.68</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T3.4.4.5.1.4" style="padding-left:0.0pt;padding-right:0.0pt;">0.50</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T3.4.4.5.1.5" style="padding-left:0.0pt;padding-right:0.0pt;">95.1s</td> </tr> <tr class="ltx_tr" id="S4.T3.4.4.6.2"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T3.4.4.6.2.1" style="padding-left:0.0pt;padding-right:0.0pt;">LGM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title="">55</a>]</cite> </td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T3.4.4.6.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">12.16</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T3.4.4.6.2.3" style="padding-left:0.0pt;padding-right:0.0pt;">0.69</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T3.4.4.6.2.4" style="padding-left:0.0pt;padding-right:0.0pt;">0.53</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S4.T3.4.4.6.2.5" style="padding-left:0.0pt;padding-right:0.0pt;">2.82s</td> </tr> <tr class="ltx_tr" id="S4.T3.4.4.7.3"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T3.4.4.7.3.1" style="padding-left:0.0pt;padding-right:0.0pt;">Ours</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T3.4.4.7.3.2" style="padding-left:0.0pt;padding-right:0.0pt;">18.73</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T3.4.4.7.3.3" style="padding-left:0.0pt;padding-right:0.0pt;">0.81</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T3.4.4.7.3.4" style="padding-left:0.0pt;padding-right:0.0pt;">0.24</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S4.T3.4.4.7.3.5" style="padding-left:0.0pt;padding-right:0.0pt;">0.091s</td> </tr> </tbody> </table> </div> <figcaption class="ltx_caption ltx_centering">Table 3: 2D Stylized Avatar to 3D Generation. Our approach outperforms EG3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib6" title="">6</a>]</cite> and LGM <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib55" title="">55</a>]</cite> on all metrics, providing superior texture and geometry accuracy with faster processing.</figcaption> </figure> <figure class="ltx_figure" id="S4.F7"><img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="898" id="S4.F7.g1" src="x7.png" width="830"/> <figcaption class="ltx_caption">Figure 7: 3D Stylized Avatar Animation. (a) An Snapmoji showcasing various emotions using blendshape weights. (b) Snapmoji effectively transfers expressions from driving images, outperforming Portrait4D-v2 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib12" title="">12</a>]</cite> in accuracy and visual appeal.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S4.SS2.p2"> Single Portrait to 3D Generation. We evaluate Snapmoji’s capability to generate a 3D avatar from a single portrait, with a comparison against DATID-3D <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib23" title="">23</a>]</cite>. DATID-3D uses a GAN to derive a latent code from the user’s portrait, followed by domain adaptation for styling. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F6" title="Figure 6 ‣ 4.1 2D Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">6</a> shows that DATID-3D struggles to maintain the original identity in avatars with distinct styles, like Plastic Toy and Alien. Snapmoji, however, achieves a robust balance of identity preservation and style versatility, allowing for enhanced user customization. Our approach produces sharp images and detailed geometries, processing each image in just 0.9 seconds, a significant improvement over DATID-3D’s 90-second processing time for GAN inversion. </div> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> 4.3 3D Stylized Avatar Animation </h3> <div class="ltx_para ltx_noindent" id="S4.SS3.p1"> Expression Animation. Snapmojienables avatars to express a wide range of emotions, such as neutrality, happiness, frustration, playfulness, anger, and surprise, by using blendshape weights, as shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F7" title="Figure 7 ‣ 4.2 3D Dual-Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">7</a>(a). Additionally, Snapmojican perform expression transfer from driving images, producing 3D-consistent and visually appealing avatars. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F7" title="Figure 7 ‣ 4.2 3D Dual-Stylized Avatar Generation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">7</a>(b) shows this capability, where Snapmojioutperforms Portrait4D-v2 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib12" title="">12</a>]</cite> by generating avatars with more accurate expressions derived from the target image. </div> <div class="ltx_para ltx_noindent" id="S4.SS3.p2"> Mobile AR Application. We showcase a web-based AR app that demonstrates the efficient rendering of avatars on mobile devices. As illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.F8" title="Figure 8 ‣ 4.3 3D Stylized Avatar Animation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">8</a>, an avatar animated using a user’s facial expressions achieves rendering speeds of 30–40 FPS on an iPhone 13 Pro. These avatars are highly compact, occupying only 3 MB of disk space, enabling the creation of dynamic filters and engaging AR effects directly within a mobile web browser. To highlight the advantages of our animation technique, we compare Snapmoji against TextToon <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title="">54</a>]</cite>. Like many other avatar generation methods <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib19" title="">19</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib40" title="">40</a>, <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib8" title="">8</a>]</cite>, TextToon relies solely on 3DMM features, achieving only 15–18 FPS on an M1 MacBook. In contrast, our method consistently runs at 90-100 FPS. Moreover, TextToon’s dependence on 3DMMs limits its practicality on phones, whereas our cross-platform solution retains performance over 30 FPS. Table <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S4.T4" title="Table 4 ‣ 4.3 3D Stylized Avatar Animation ‣ 4 Experiments ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">4</a> offers a detailed feature comparison. </div> <figure class="ltx_table" id="S4.T4"> <div class="ltx_inline-block ltx_transformed_outer" id="S4.T4.2" style="width:433.6pt;height:63.4pt;vertical-align:-0.0pt;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S4.T4.2.1"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T4.2.1.1.1"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_tt" id="S4.T4.2.1.1.1.1" rowspan="2" style="padding-left:0.0pt;padding-right:0.0pt;">Method</th> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_tt" colspan="2" id="S4.T4.2.1.1.1.2" style="padding-left:0.0pt;padding-right:0.0pt;">Frame Rate (FPS)</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_tt" id="S4.T4.2.1.1.1.3" rowspan="2" style="padding-left:0.0pt;padding-right:0.0pt;">Cross-Platform</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_tt" id="S4.T4.2.1.1.1.4" rowspan="2" style="padding-left:0.0pt;padding-right:0.0pt;">Driving Signal</td> </tr> <tr class="ltx_tr" id="S4.T4.2.1.2.2"> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center" id="S4.T4.2.1.2.2.1" style="padding-left:0.0pt;padding-right:0.0pt;">M1 MacBook</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r" id="S4.T4.2.1.2.2.2" style="padding-left:0.0pt;padding-right:0.0pt;">iPhone 13 Pro</td> </tr> <tr class="ltx_tr" id="S4.T4.2.1.3.3"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T4.2.1.3.3.1" style="padding-left:0.0pt;padding-right:0.0pt;">TextToon <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title="">54</a>]</cite> </th> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T4.2.1.3.3.2" style="padding-left:0.0pt;padding-right:0.0pt;">15–18</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_r ltx_border_t" id="S4.T4.2.1.3.3.3" style="padding-left:0.0pt;padding-right:0.0pt;">N/A</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T4.2.1.3.3.4" style="padding-left:0.0pt;padding-right:0.0pt;">×</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_t" id="S4.T4.2.1.3.3.5" style="padding-left:0.0pt;padding-right:0.0pt;">3DMM</td> </tr> <tr class="ltx_tr" id="S4.T4.2.1.4.4"> <th class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r" id="S4.T4.2.1.4.4.1" style="padding-left:0.0pt;padding-right:0.0pt;"> Snapmoji(Ours)</th> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S4.T4.2.1.4.4.2" style="padding-left:0.0pt;padding-right:0.0pt;">90-100</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb ltx_border_r" id="S4.T4.2.1.4.4.3" style="padding-left:0.0pt;padding-right:0.0pt;">30–40</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S4.T4.2.1.4.4.4" style="padding-left:0.0pt;padding-right:0.0pt;">✓</td> <td class="ltx_td ltx_nopad_l ltx_nopad_r ltx_align_center ltx_border_bb" id="S4.T4.2.1.4.4.5" style="padding-left:0.0pt;padding-right:0.0pt;">3DMM + Blendshapes</td> </tr> </tbody> </table> </div> <figcaption class="ltx_caption">Table 4: Mobile AR Application Comparison. We compare various features of our mobile AR application and TextToon <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib54" title="">54</a>]</cite>.</figcaption> </figure> <figure class="ltx_figure" id="S4.F8"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="434" id="S4.F8.g1" src="x8.png" width="830"/> <figcaption class="ltx_caption">Figure 8: Mobile AR Application. Snapmoji’s efficient mobile animation method enables a user to puppet their avatar in augmented reality. </figcaption> </figure> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> 5 Conclusion</h2> <div class="ltx_para" id="S5.p1"> We introduce Snapmoji, an easy-to-use system for generating animatable, dual-stylized avatars from selfies instantly. Leveraging Gaussian Domain Adaptation, Snapmoji first converts selfies into primary stylized avatars, then applies a diffusion process for a secondary style while preserving identity integrity. The system supports 3D Gaussian avatars with dynamic animations and precise facial expression transfer, achieving selfie-to-avatar conversion in just 0.9 seconds, with real-time interactions at 30–40 FPS. Extensive testing confirms Snapmoji’s versatility and speed, highlighting its value in creating diverse avatar styles. </div> <div class="ltx_para ltx_noindent" id="S5.p2"> Limitations and Future Work. Snapmoji relies on paired data from GAN inversion, which can sometimes yield low-quality images, and requires extensive 3D avatar datasets and text prompt engineering. Future improvements could include using multiple images for more accurate user head geometry and enabling post-stylization edits to specific facial features, like eye color or eyeglasses. </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography" style="font-size:90%;">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> Apple [2018] Apple. Use memoji on your iphone or ipad pro. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://support.apple.com/en-us/111115" style="font-size:90%;" title="">https://support.apple.com/en-us/111115</a>, 2018. Accessed: 2024-11-03. </li> <li class="ltx_bibitem" id="bib.bib2"> Barron et al. [2021] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5855–5864, 2021. </li> <li class="ltx_bibitem" id="bib.bib3"> Barron et al. [2022] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022. </li> <li class="ltx_bibitem" id="bib.bib4"> Bazarevsky et al. [2019] Valentin Bazarevsky, Yury Kartynnik, Andrey Vakunov, Karthik Raveendran, and Matthias Grundmann. Blazeface: Sub-millisecond neural face detection on mobile gpus. ArXiv, abs/1907.05047, 2019. </li> <li class="ltx_bibitem" id="bib.bib5"> Bitstrips [2007] Bitstrips. Bitmoji. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.bitmoji.com/" style="font-size:90%;" title="">https://www.bitmoji.com/</a>, 2007. Accessed: 2024-11-03. </li> <li class="ltx_bibitem" id="bib.bib6"> Chan et al. [2022] Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry-aware 3D generative adversarial networks. In CVPR, 2022. </li> <li class="ltx_bibitem" id="bib.bib7"> Chen et al. [2022] Eric Chen, Jin Sun, Apoorv Khandelwal, Dani Lischinski, Noah Snavely, and Hadar Averbuch-Elor. What’s in a decade? transforming faces through time. Computer Graphics Forum, 42, 2022. </li> <li class="ltx_bibitem" id="bib.bib8"> Chu and Harada [2024] Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. </li> <li class="ltx_bibitem" id="bib.bib9"> Dao et al. [2025] Quan Dao, Khanh Doan, Di Liu, Trung Le, and Dimitris Metaxas. Improved training technique for latent consistency models. Proceedings of the International Conference on Learning Representations (ICLR), 2025. </li> <li class="ltx_bibitem" id="bib.bib10"> Deitke et al. [2022] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2022. </li> <li class="ltx_bibitem" id="bib.bib11"> Deng et al. [2019] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019. </li> <li class="ltx_bibitem" id="bib.bib12"> Deng et al. [2024] Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. In European Conference on Computer Vision, pages 316–333. Springer, 2024. </li> <li class="ltx_bibitem" id="bib.bib13"> Ekman and Friesen [1978] Paul Ekman and Wallace V. Friesen. Facial action coding system: a technique for the measurement of facial movement. 1978. </li> <li class="ltx_bibitem" id="bib.bib14"> Han et al. [2024] Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Anastasis Stathopoulos, Xiaoxiao He, Yuxiao Chen, et al. Proxedit: Improving tuning-free real image editing with proximal guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4291–4301, 2024. </li> <li class="ltx_bibitem" id="bib.bib15"> Härkönen et al. [2020] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls. ArXiv, abs/2004.02546, 2020. </li> <li class="ltx_bibitem" id="bib.bib16"> He et al. [2025] Xiaoxiao He, Ligong Han, Quan Dao, Song Wen, Minhao Bai, Di Liu, Han Zhang, Martin Renqiang Min, Felix Juefei-Xu, Chaowei Tan, et al. Dice: Discrete inversion enabling controllable editing for multinomial diffusion and masked generative models. 2025. </li> <li class="ltx_bibitem" id="bib.bib17"> Hong et al. [2023] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. ArXiv, abs/2311.04400, 2023. </li> <li class="ltx_bibitem" id="bib.bib18"> Hu et al. [2021] J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021. </li> <li class="ltx_bibitem" id="bib.bib19"> Hu et al. [2024] Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. </li> <li class="ltx_bibitem" id="bib.bib20"> Huang et al. [2024] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. In SIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. </li> <li class="ltx_bibitem" id="bib.bib21"> Karras et al. [2019] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8107–8116, 2019. </li> <li class="ltx_bibitem" id="bib.bib22"> Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 42:1 – 14, 2023. </li> <li class="ltx_bibitem" id="bib.bib23"> Kim and Chun [2023] Gwanghyun Kim and Se Young Chun. Datid-3d: Diversity-preserved domain adaptation using text-to-image diffusion for 3d generative model. In CVPR, 2023. </li> <li class="ltx_bibitem" id="bib.bib24"> Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023. </li> <li class="ltx_bibitem" id="bib.bib25"> Li et al. [2017] Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. </li> <li class="ltx_bibitem" id="bib.bib26"> Liu et al. [2023a] Di Liu, Xiang Yu, Meng Ye, Qilong Zhangli, Zhuowei Li, Zhixing Zhang, and Dimitris N Metaxas. Deformer: Integrating transformers with deformable models for 3d shape abstraction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14236–14246, 2023a. </li> <li class="ltx_bibitem" id="bib.bib27"> Liu et al. [2023b] Di Liu, Long Zhao, Qilong Zhangli, Yunhe Gao, Ting Liu, and Dimitris N Metaxas. Deep deformable models: Learning 3d shape abstractions with part consistency. arXiv preprint arXiv:2309.01035, 2023b. </li> <li class="ltx_bibitem" id="bib.bib28"> Liu et al. [2024a] Di Liu, Qilong Zhangli, Yunhe Gao, and Dimitris Metaxas. Lepard: Learning explicit part discovery for 3d articulated shape reconstruction. Advances in Neural Information Processing Systems, 36, 2024a. </li> <li class="ltx_bibitem" id="bib.bib29"> Liu et al. [2024b] Di Liu, Bingbing Zhuang, Dimitris N. Metaxas, and Manmohan Chandraker. Instantaneous perception of moving objects in 3d. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024b. </li> <li class="ltx_bibitem" id="bib.bib30"> Liu et al. [2025] Di Liu, Teng Deng, Giljoo Nam, Yu Rong, Stanislav Pidhorskyi, Junxuan Li, Jason Saragih, Dimitris N. Metaxas, and Chen Cao. Lucas: Layered universal codec avatars, 2025. </li> <li class="ltx_bibitem" id="bib.bib31"> Luo et al. [2021] Xuan Luo, Xuaner Zhang, Paul Yoo, Ricardo Martin-Brualla, Jason Lawrence, and Steven M. Seitz. Time-travel rephotography. ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH Asia 2021), 40(6), 2021. </li> <li class="ltx_bibitem" id="bib.bib32"> Ma et al. [2024] Shengjie Ma, Yanlin Weng, Tianjia Shao, and Kun Zhou. 3d gaussian blendshapes for head avatar animation. In ACM SIGGRAPH Conference Proceedings, Denver, CO, United States, July 28 - August 1, 2024, 2024. </li> <li class="ltx_bibitem" id="bib.bib33"> Men et al. [2024] Yifang Men, Hanxi Liu, Yuan Yao, Miaomiao Cui, Xuansong Xie, and Zhouhui Lian. 3dtoonify: Creating your high-fidelity 3d stylized avatar easily from 2d portrait images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10127–10137, 2024. </li> <li class="ltx_bibitem" id="bib.bib34"> Meng et al. [2021] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021. </li> <li class="ltx_bibitem" id="bib.bib35"> Meta [2024] Meta. Express yourself with Meta avatars. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.meta.com/avatars/" style="font-size:90%;" title="">https://www.meta.com/avatars/</a>, 2024. Accessed: 2024-11-03. </li> <li class="ltx_bibitem" id="bib.bib36"> Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020. </li> <li class="ltx_bibitem" id="bib.bib37"> Nguyen-Phuoc et al. [2023] Thu Nguyen-Phuoc, Gabriel Schwartz, Yuting Ye, Stephen Lombardi, and Lei Xiao. Alteredavatar: Stylizing dynamic 3d avatars with fast style adaptation. ArXiv, abs/2305.19245, 2023. </li> <li class="ltx_bibitem" id="bib.bib38"> Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. </li> <li class="ltx_bibitem" id="bib.bib39"> Pinkney and Adler [2020] Justin N. M. Pinkney and Doron Adler. Resolution dependent gan interpolation for controllable image synthesis between domains. ArXiv, abs/2010.05334, 2020. </li> <li class="ltx_bibitem" id="bib.bib40"> Qian et al. [2023] Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. arXiv preprint arXiv:2312.02069, 2023. </li> <li class="ltx_bibitem" id="bib.bib41"> Research [2023] Grand View Research. Digital avatar market size, share and growth report, 2030. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.grandviewresearch.com/industry-analysis/digital-avatar-market-report/" style="font-size:90%;" title="">https://www.grandviewresearch.com/industry-analysis/digital-avatar-market-report/</a>, 2023. </li> <li class="ltx_bibitem" id="bib.bib42"> Roich et al. [2021] Daniel Roich, Ron Mokady, Amit H. Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real images. ACM Transactions on Graphics (TOG), 42:1 – 13, 2021. </li> <li class="ltx_bibitem" id="bib.bib43"> Rombach et al. [2021] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021. </li> <li class="ltx_bibitem" id="bib.bib44"> Saito et al. [2024] Shunsuke Saito, Gabriel Schwartz, Tomas Simon, Junxuan Li, and Giljoo Nam. Relightable gaussian codec avatars. In CVPR, 2024. </li> <li class="ltx_bibitem" id="bib.bib45"> Samsung [2018] Samsung. How to use the AR Emoji feature on your Galaxy phone. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://www.samsung.com/sg/support/mobile-devices/how-to-use-the-emoji-feature-on-your-galaxy-phone/" style="font-size:90%;" title="">https://www.samsung.com/sg/support/mobile-devices/how-to-use-the-emoji-feature-on-your-galaxy-phone/</a>, 2018. Accessed: 2024-11-03. </li> <li class="ltx_bibitem" id="bib.bib46"> Sang et al. [2022] Shen Sang, Tiancheng Zhi, Guoxian Song, Minghao Liu, Chun-Pong Lai, Jing Liu, Xiang Wen, James Davis, and Linjie Luo. Agileavatar: Stylized 3d avatar creation via cascaded domain bridging. SIGGRAPH Asia 2022 Conference Papers, 2022. </li> <li class="ltx_bibitem" id="bib.bib47"> Shao et al. [2024] Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. </li> <li class="ltx_bibitem" id="bib.bib48"> Shi et al. [2019] Tianyang Shi, Yi Yuan, Changjie Fan, Zhengxia Zou, Zhen Xia Shi, and Yong Liu. Face-to-parameter translation for game character auto-creation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 161–170, 2019. </li> <li class="ltx_bibitem" id="bib.bib49"> Shi et al. [2020] Tianyang Shi, Yi Yuan, Changjie Fan, Zhengxia Zou, Zhenwei Shi, and Yong Liu. Fast and robust face-to-parameter translation for game character auto-creation. ArXiv, abs/2008.07132, 2020. </li> <li class="ltx_bibitem" id="bib.bib50"> Shi et al. [2021] Yichun Shi, Xiao Yang, Yangyue Wan, and Xiaohui Shen. Semanticstylegan: Learning compositional generative priors for controllable image synthesis and editing. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11244–11254, 2021. </li> <li class="ltx_bibitem" id="bib.bib51"> Shi et al. [2023] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023. </li> <li class="ltx_bibitem" id="bib.bib52"> [52] Snap Inc. Bitmoji 3d avatar platform solutions. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://developers.snap.com/lens-studio/platform-solutions/bitmoji-avatar/bitmoji-3d" style="font-size:90%;" title="">https://developers.snap.com/lens-studio/platform-solutions/bitmoji-avatar/bitmoji-3d</a>. </li> <li class="ltx_bibitem" id="bib.bib53"> Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ArXiv, abs/2010.02502, 2020. </li> <li class="ltx_bibitem" id="bib.bib54"> Song et al. [2024] Luchuan Song, Lele Chen, Celong Liu, Pinxin Liu, and Chenliang Xu. Texttoon: Real-time text toonify head avatar from single video. 2024. </li> <li class="ltx_bibitem" id="bib.bib55"> Tang et al. [2024] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. In European Conference on Computer Vision, 2024. </li> <li class="ltx_bibitem" id="bib.bib56"> Wang et al. [2023] Shizun Wang, Weihong Zeng, Xu Wang, Han Yang, Li Chen, Chuang Zhang, Ming Wu, Yi Yuan, Yunzhao Zeng, and Minghang Zheng. Swiftavatar: Efficient auto-creation of parameterized stylized character on arbitrary avatar engines. In AAAI Conference on Artificial Intelligence, 2023. </li> <li class="ltx_bibitem" id="bib.bib57"> Wang et al. [2025] Suzhen Wang, Weijie Chen, Wei Zhang, Minda Zhao, Lincheng Li, Rongsheng Zhang, Zhipeng Hu, and Xin Yu. Easycraft: A robust and efficient framework for automatic avatar crafting, 2025. </li> <li class="ltx_bibitem" id="bib.bib58"> Wu et al. [2020] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace analysis: Disentangled controls for stylegan image generation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12858–12867, 2020. </li> <li class="ltx_bibitem" id="bib.bib59"> Wu et al. [2021] Zongze Wu, Yotam Nitzan, Eli Shechtman, and Dani Lischinski. Stylealign: Analysis and applications of aligned stylegan models. ArXiv, abs/2110.11323, 2021. </li> <li class="ltx_bibitem" id="bib.bib60"> Ye et al. [2023] Hu Ye, Jun Zhang, Siyi Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. ArXiv, abs/2308.06721, 2023. </li> <li class="ltx_bibitem" id="bib.bib61"> Zhang et al. [2023a] Chi Zhang, Yiwen Chen, Yijun Fu, Zhenglin Zhou, Gang YU, Billzb Wang, Bin Fu, Tao Chen, Guosheng Lin, and Chunhua Shen. Styleavatar3d: Leveraging image-text diffusion models for high-fidelity 3d avatar generation, 2023a. </li> <li class="ltx_bibitem" id="bib.bib62"> Zhang et al. [2024] Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. ArXiv, abs/2404.19702, 2024. </li> <li class="ltx_bibitem" id="bib.bib63"> Zhang et al. [2023b] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3813–3824, 2023b. </li> <li class="ltx_bibitem" id="bib.bib64"> Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. </li> <li class="ltx_bibitem" id="bib.bib65"> Zhangli et al. [2024] Qilong Zhangli, Jindong Jiang, Di Liu, Licheng Yu, Xiaoliang Dai, Ankit Ramchandani, Guan Pang, Dimitris N Metaxas, and Praveen Krishnan. Layout-agnostic scene text image synthesis with diffusion models. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7496–7506. IEEE Computer Society, 2024. </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> <div class="ltx_pagination ltx_role_newpage"></div> <div class="ltx_para ltx_align_center" id="p3"> \thetitle Supplementary Material </div> <section class="ltx_appendix ltx_centering" id="A1"> <h2 class="ltx_title ltx_title_appendix" style="font-size:144%;"> Appendix A Implementation Details</h2> <figure class="ltx_figure" id="A1.F9"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="814" id="A1.F9.g1" src="x9.png" width="830"/> <figcaption class="ltx_caption ltx_centering" style="font-size:144%;">Figure 9: GDA Training Data. Visualization of data pairs used to train the GDA network. Each avatar is inverted into the latent space of a GAN, and a generator trained on realistic faces creates a corresponding realistic face, preserving features such as hairstyles, hair colors, and general facial characteristics.</figcaption> </figure> <section class="ltx_subsection" id="A1.SS1"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> A.1 Bitmoji Training Data</h3> <section class="ltx_paragraph" id="A1.SS1.SSS0.Px1"> <h4 class="ltx_title ltx_title_paragraph" style="font-size:144%;">GDA Training Data.</h4> <div class="ltx_para" id="A1.SS1.SSS0.Px1.p1"> To train our Gaussian Domain Adaptation network, we start with a dataset of random Bitmojis and use GAN inversion to generate corresponding realistic images. Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.F9" style="font-size:144%;" title="Figure 9 ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">9</a> showcases some examples of the training data generated through this process. The resulting faces mirror the hairstyles, hair colors, and facial features of the original avatars. While the generated images may contain artifacts and exhibit limited diversity, our GDA model benefits from pre-training on Objaverse <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#bib.bib10" title="">10</a>]</cite>, enabling it to leverage prior knowledge and produce more detailed reconstructions than GAN inversion alone. This approach enhances the accuracy and expressiveness of the domain adaptation process. </div> </section> <section class="ltx_paragraph" id="A1.SS1.SSS0.Px2"> <h4 class="ltx_title ltx_title_paragraph" style="font-size:144%;">Multi-view Training Data.</h4> <div class="ltx_para" id="A1.SS1.SSS0.Px2.p1"> Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A1.F10" style="font-size:144%;" title="Figure 10 ‣ Multi-view Training Data. ‣ A.1 Bitmoji Training Data ‣ Appendix A Implementation Details ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">10</a> visualizes training samples from the Bitmoji dataset used for training our 3D Generation Network. Each avatar is rendered from 10 spherically distributed viewpoints around the head and is posed with random blendshape weights to simulate diverse facial expressions. The dataset features a wide range of hairstyles, skin tones, and accessories such as glasses, hats, and earrings. Although the U-Net is trained exclusively on Bitmoji-style avatars, it effectively reconstructs dual-stylized avatars that exhibit distinct appearances and textures, demonstrating the network’s versatility and generalization capability. </div> <figure class="ltx_figure" id="A1.F10"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="598" id="A1.F10.g1" src="extracted/6275685/appendix/bitmoji_mv.jpg" width="598"/> <figcaption class="ltx_caption ltx_centering" style="font-size:144%;">Figure 10: Multi-view Bitmoji Training Data. Samples from the Bitmoji dataset used in training the 3D Generation Network. Avatars are rendered from the front and multiple random angles around the head, with random blendshapes applied to simulate various expressions.</figcaption> </figure> </section> </section> <section class="ltx_subsection" id="A1.SS2"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> A.2 Facial Action Coding System</h3> <div class="ltx_para" id="A1.SS2.p1"> We implement the following 16 blendshapes from the Facial Action Coding System. These blendshapes are compatible with most facial blendshape predictors like Apple ARKit***<a class="ltx_ref ltx_url ltx_font_typewriter ltx_font_upright" href="https://arkit-face-blendshapes.com/" style="font-size:69%;" title="">https://arkit-face-blendshapes.com/</a> or Google Mediapipe***<a class="ltx_ref ltx_url ltx_font_typewriter ltx_font_upright" href="https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker" style="font-size:69%;" title="">https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker</a>. </div> <div class="ltx_para" id="A1.SS2.p2"> <ul class="ltx_itemize" id="A1.I1"> <li class="ltx_item" id="A1.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i1.p1"> browDownLeft </div> </li> <li class="ltx_item" id="A1.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i2.p1"> browDownRight </div> </li> <li class="ltx_item" id="A1.I1.i3" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i3.p1"> browUpLeft </div> </li> <li class="ltx_item" id="A1.I1.i4" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i4.p1"> browUpRight </div> </li> <li class="ltx_item" id="A1.I1.i5" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i5.p1"> eyeBlinkLeft </div> </li> <li class="ltx_item" id="A1.I1.i6" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i6.p1"> eyeBlinkRight </div> </li> <li class="ltx_item" id="A1.I1.i7" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i7.p1"> jawOpen </div> </li> <li class="ltx_item" id="A1.I1.i8" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i8.p1"> jawLeft </div> </li> <li class="ltx_item" id="A1.I1.i9" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i9.p1"> jawRight </div> </li> <li class="ltx_item" id="A1.I1.i10" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i10.p1"> lipsPucker </div> </li> <li class="ltx_item" id="A1.I1.i11" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i11.p1"> mouthFrownLeft </div> </li> <li class="ltx_item" id="A1.I1.i12" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i12.p1"> mouthFrownRight </div> </li> <li class="ltx_item" id="A1.I1.i13" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i13.p1"> mouthSmileLeft </div> </li> <li class="ltx_item" id="A1.I1.i14" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i14.p1"> MouthSmileRight </div> </li> <li class="ltx_item" id="A1.I1.i15" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i15.p1"> mouthStretchLeft </div> </li> <li class="ltx_item" id="A1.I1.i16" style="list-style-type:none;"> • <div class="ltx_para" id="A1.I1.i16.p1"> mouthStretchRight </div> </li> </ul> </div> <div class="ltx_para" id="A1.SS2.p3"> To animate the Bitmoji avatars for training data, we used the publicly available Bitmoji rig available here: <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://developers.snap.com/lens-studio/features/bitmoji-avatar/animating-bitmoji-3d" style="font-size:144%;" title="">https://developers.snap.com/lens-studio/features/bitmoji-avatar/animating-bitmoji-3d</a>. </div> <div class="ltx_para" id="A1.SS2.p4"> At inference time, we can use a real-time blendshape predictor like ARKit or Mediapipe to puppet the avatars from a real video. Please see the attached HTML gallery for a demonstration. </div> </section> <section class="ltx_subsection" id="A1.SS3"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> A.3 User Interfaces</h3> <div class="ltx_para" id="A1.SS3.p1"> To showcase the intuitive features of the Snapmoji system, we present videos of the interface interactions available in the HTML gallery. The avatar generation interface, crafted with Gradio, enables users to effortlessly create dual-stylized avatars from their own photos. In addition, the blendshape editor, developed using Viser, allows users to pose their avatars in 3D by adjusting blendshape weights, thereby controlling facial expressions. </div> <div class="ltx_para" id="A1.SS3.p2"> During the diffusion stylization process, we provide users with key parameters to balance identity preservation with style diversity. These controls are listed by their significance: 1) Style Transition Strength; 2) Edge Preservation Level 3) Identity Consistency Factor. </div> <div class="ltx_para ltx_noindent" id="A1.SS3.p3"> Style Transition Strength: This parameter, inspired by methods similar to SDEdit, regulates the extent of the stylization transition. Lower values enable the dual-stylized avatar to retain more details from the original single-styled avatar input. </div> <div class="ltx_para ltx_noindent" id="A1.SS3.p4"> Edge Preservation Level: This setting influences how accurately the system maintains the structure of the avatar by preserving the edges from the single-styled input. </div> <div class="ltx_para ltx_noindent" id="A1.SS3.p5"> Identity Consistency Factor: This controls the strength of identity features from the initial input photo, ensuring that essential facial characteristics remain recognizable. </div> <div class="ltx_para" id="A1.SS3.p6"> We encourage users to view the videos in the HTML gallery to observe how these parameters affect avatar generation, enhancing both creativity and user experience. </div> </section> </section> <section class="ltx_appendix ltx_centering" id="A2"> <h2 class="ltx_title ltx_title_appendix" style="font-size:144%;"> Appendix B Additional Results</h2> <section class="ltx_subsection" id="A2.SS1"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> B.1 Results Gallery</h3> <div class="ltx_para" id="A2.SS1.p1"> We invite you to explore the HTML gallery, which features videos of Snapmojiavatars animated in 3D. Access the gallery by opening the index.html file in your web browser. The gallery includes the following highlights: </div> <div class="ltx_para" id="A2.SS1.p2"> <ol class="ltx_enumerate" id="A2.I1"> <li class="ltx_item" id="A2.I1.i1" style="list-style-type:none;"> 1. <div class="ltx_para" id="A2.I1.i1.p1"> Dual-stylized avatars with dynamic facial animations displayed from various novel viewpoints. </div> </li> <li class="ltx_item" id="A2.I1.i2" style="list-style-type:none;"> 2. <div class="ltx_para" id="A2.I1.i2.p1"> A demonstration of the avatars’ capabilities in facial puppeting for augmented reality applications. </div> </li> <li class="ltx_item" id="A2.I1.i3" style="list-style-type:none;"> 3. <div class="ltx_para" id="A2.I1.i3.p1"> Screen captures of the Snapmojiuser interfaces, showcasing the ease of creating dual-stylized avatars and posing them using blendshapes. </div> </li> </ol> </div> </section> <section class="ltx_subsection" id="A2.SS2"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> B.2 More Applications</h3> <div class="ltx_para ltx_noindent" id="A2.SS2.p1"> 3D Avatar Animation. Dual-stylization offers the ability to swiftly visualize avatars in various scenarios, unlocking numerous applications. As illustrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#S0.F1" style="font-size:144%;" title="Figure 1 ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">1</a>, Snapmoji avatars can be employed to create personalized comics and stickers, offering users a unique way to express themselves. Another promising application lies in augmented reality (AR), where avatars can be controlled and animated with real-time tracked facial expressions. Examples of this application are shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.F11" style="font-size:144%;" title="Figure 11 ‣ B.2 More Applications ‣ Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">11</a> and within the HTML gallery. By utilizing Mediapipe’s real-time blendshape tracker, we animate the 3D avatars and seamlessly integrate them with video content, enabling them to be rendered in an AR environment through alpha compositing. </div> <div class="ltx_para ltx_noindent" id="A2.SS2.p2"> Real-time Web Rendering. Our choice to represent avatars using Gaussian Splats enables efficient real-time rendering on mobile devices. As demonstrated in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.F12" style="font-size:144%;" title="Figure 12 ‣ B.2 More Applications ‣ Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">12</a> and in the HTML gallery, the avatars achieve a rendering rate of 90-100 FPS on a laptop, and 30-40 FPS on a phone. When paired with a face tracker, these avatars can be used to generate engaging filters and augmented reality effects. The demonstration showcases an avatar rendered in Google Chrome on a MacBook, entirely on the client side. </div> <figure class="ltx_figure" id="A2.F11"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="544" id="A2.F11.g1" src="extracted/6275685/appendix/ar-moji.jpeg" width="598"/> <figcaption class="ltx_caption ltx_centering" style="font-size:144%;">Figure 11: Augmented Reality Puppeting. This example demonstrates the use of Mediapipe’s real-time face detection to animate avatars based on estimated blendshape weights. By alpha-compositing the avatars with the original input, we enable dynamic puppeting in augmented reality. For live demonstrations, please refer to the HTML gallery.</figcaption> </figure> <figure class="ltx_figure" id="A2.F12"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="337" id="A2.F12.g1" src="extracted/6275685/appendix/instamoj-web.png" width="598"/> <figcaption class="ltx_caption ltx_centering" style="font-size:144%;">Figure 12: Real-time Web Rendering. Leveraging Gaussian Splats, our avatars efficiently render at 90–100 FPS on laptops and 30–40 FPS on mobile devices while occupying only 3 MB of disk space. In conjunction with a mobile face tracker, these avatars facilitate the creation of engaging filters and AR effects. The example shown features an avatar rendered in Google Chrome on an MacBook, developed in JavaScript.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="A2.SS2.p3"> GDA Generalization. GDA demonstrates that the features learned from few-shot 3D reconstruction models are transferrable to new tasks. Shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.F13" style="font-size:144%;" title="Figure 13 ‣ B.2 More Applications ‣ Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">13</a>, GDA can be applied for more domains such as cats. We hope that GDA can inspire future work on using Gaussian features for other tasks. </div> <figure class="ltx_figure" id="A2.F13"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="290" id="A2.F13.g1" src="x10.png" width="830"/> <figcaption class="ltx_caption" style="font-size:144%;">Figure 13: GDA Generalization Across Domains. This illustration showcases the versatility of Gaussian Domain Adaptation (GDA) as an image-to-image translation method. Demonstrated above is GDA’s capability to transform realistic cat images into anime-style representations and vice versa, highlighting its potential for a wide range of applications beyond avatar creation.</figcaption> </figure> </section> <section class="ltx_subsection" id="A2.SS3"> <h3 class="ltx_title ltx_title_subsection" style="font-size:144%;"> B.3 Ablation Studies</h3> <figure class="ltx_figure" id="A2.F14"><img alt="Refer to caption" class="ltx_graphics ltx_img_square" height="1005" id="A2.F14.g1" src="x11.png" width="830"/> <figcaption class="ltx_caption" style="font-size:144%;">Figure 14: Ablation Study on 3DMM Tracking. This figure demonstrates the effects of using 3DMM features in conjunction with FACS blendshape weights. The combination enhances the expressiveness and fidelity of avatar animation, accommodating both realistic and stylized facial expressions.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="A2.SS3.p1"> 3DMM Tracking. As shown in Fig. <a class="ltx_ref" href="https://arxiv.org/html/2503.11978v1#A2.F14" style="font-size:144%;" title="Figure 14 ‣ B.3 Ablation Studies ‣ Appendix B Additional Results ‣ Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars">14</a>, our ablation study highlights the complementary strengths of 3DMM tracking and FACS-based blendshape features in avatar animation. 3DMM is adept at capturing realistic facial expressions, making it ideal for animating real faces, but it struggles with the exaggerated features typical of cartoon avatars. Conversely, FACS blendshapes excel in stylizing facial elements, such as eye and mouth shapes, crucial for cartoon animation. By integrating the precision of 3DMM with the expressive capability of FACS blendshapes, we enhance the overall animation quality, enabling our avatars to faithfully portray both realistic and stylized expressions, thus delivering a more versatile and convincing animation experience. </div> </section> </section> <section class="ltx_appendix ltx_centering" id="A3"> <h2 class="ltx_title ltx_title_appendix" style="font-size:144%;"> Appendix C Ethical Discussion</h2> <div class="ltx_para" id="A3.p1"> The use of photorealistic avatars has raised significant privacy and ethical concerns, particularly in relation to their potential misuse in creating deep fakes and spreading misinformation. In contrast, stylized cartoon avatars offer a safer alternative as they are not easily exploited for direct impersonation. In our work, we have prioritized user privacy by ensuring that no real person’s images are used to train our models. Instead, the realistic images employed for training the Gaussian Domain Adaptation (GDA) system are generated by a GAN. We recognize, however, that GAN-generated data can reflect the biases present in the original datasets used for training. Consequently, we remain vigilant about these limitations and are committed to continuous evaluation and improvement to mitigate any unintended biases. </div> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Thu Mar 13 21:10:50 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/">LaTeXML<img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>