CINXE.COM

Estimating the completeness of discrete speech units

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>Estimating the completeness of discrete speech units</title> <!--Generated on Sun Sep 22 18:33:24 2024 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2409.06109v2/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S1" title="In Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2" title="In Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Methods</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.SS1" title="In 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.1 </span>Discrete speech units with RVQ</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.SS2" title="In 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.2 </span>Completeness as mutual information</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.SS3" title="In 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.3 </span>Information completeness and accessibility</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S3" title="In Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>Related work</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S3.SS1" title="In 3 Related work ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>Information-theoretic probing</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S3.SS2" title="In 3 Related work ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>Measuring information completeness</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S4" title="In Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>Experimental settings</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S4.SS1" title="In 4 Experimental settings ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>Discrete speech units</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S4.SS2" title="In 4 Experimental settings ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2 </span>Completeness task</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S4.SS3" title="In 4 Experimental settings ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.3 </span>Accessibility tasks</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5" title="In Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Results and Discussions</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.SS1" title="In 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.1 </span>Information in the residuals</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.SS2" title="In 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.2 </span>Information disentanglement?</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.SS3" title="In 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.3 </span>Information completeness and accessibility</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.SS4" title="In 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.4 </span>Fine-tuning RVQ on the lower bound of MI</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.SS5" title="In 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.5 </span>Rate-distortion and rate-accessibility</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.SS6" title="In 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5.6 </span>Information in the last layer</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S6" title="In Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6 </span>Conclusion</span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line"> <h1 class="ltx_title ltx_title_document">Estimating the completeness of discrete speech units</h1> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id1.id1">Representing speech with discrete units has been widely used in speech codec and speech generation. However, there are several unverified claims about self-supervised discrete units, such as disentangling phonetic and speaker information with k-means, or assuming information loss after k-means. In this work, we take an information-theoretic perspective to answer how much information is present (information completeness) and how much information is accessible (information accessibility), before and after residual vector quantization. We show a lower bound for information completeness and estimate completeness on discretized HuBERT representations after residual vector quantization. We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement. Our results offer a comprehensive assessment on the choice of discrete units, and suggest that a lot more information in the residual should be mined rather than discarded.</p> </div> <div class="ltx_para" id="p1"> <p class="ltx_p" id="p1.1"><span class="ltx_text ltx_font_bold ltx_font_italic" id="p1.1.1">Index Terms<span class="ltx_text ltx_font_upright" id="p1.1.1.1">— </span></span> discrete speech units, self-supervised learning, information theory, completeness</p> </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Previous work has proposed to use discrete speech units as an alternative to a variety of speech tasks, which offers lower computational and storage costs at some loss in performance. Of particular interests are discrete units derived from self-supervised speech representations, because the representations have demonstrated strong performance in many downstream tasks <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib1" title="">1</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib2" title="">2</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib3" title="">3</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib4" title="">4</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib5" title="">5</a>]</cite>. For example, the discrete units, usually realized with k-means on self-supervised representations, have been applied to automatic speech recognition <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib6" title="">6</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib7" title="">7</a>]</cite>, due to their strong phonetic prominence <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib4" title="">4</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib8" title="">8</a>]</cite>. Recent work has also considered synthesizing speech with discrete speech units <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib9" title="">9</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib10" title="">10</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib11" title="">11</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib7" title="">7</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib12" title="">12</a>]</cite>, claiming either that quantization has an disentanglement effect, or that the speaker identity is lost if not explicitly modeled. We ask how much information is present (information completeness) and how much information is accessible (information accessibility) before and after vector quantization of speech representations.</p> </div> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">Information accessibility is understood as how easy we can extract certain information from the representations, while information completeness indicates how much information from the original signals is encoded in the representations. The accessibility has inspired the development of many probing tasks <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib13" title="">13</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib14" title="">14</a>]</cite>, using accuracy as a proxy to measure how accessible the target information is using a simple classifier <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib15" title="">15</a>]</cite>. However, there is not yet a comprehensive study of the completeness of a representation, and how it relates to information accessibility. This question has received considerable attention when it comes to speech generations solely relying on discrete speech units from k-means <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib10" title="">10</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib16" title="">16</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib7" title="">7</a>]</cite> or residual vector quantization (RVQ) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib17" title="">17</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib18" title="">18</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib19" title="">19</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib20" title="">20</a>]</cite>, in which information is highly likely to lose.</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">Although recent approaches have proposed to evaluate information completeness of discrete speech units on synthesized speech, the synthesized speech may not faithfully reflect the encoded information <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib9" title="">9</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib10" title="">10</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib21" title="">21</a>]</cite>. For example, the synthesizer could hallucinate especially when using generative adversarial networks (GANs) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib22" title="">22</a>]</cite>. The additional speech recognition and speaker embedding systems can amplify the effect of hallucination <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib23" title="">23</a>]</cite>. Rather than synthesizing speech, in this work we directly evaluate completeness on the discrete speech units.</p> </div> <div class="ltx_para" id="S1.p4"> <p class="ltx_p" id="S1.p4.1">To answer how complete a representation is, we show a lower bound of mutual information for information completeness through the lens of information theory, with which we estimate completeness on discretized HuBERT representations after RVQ. More specifically, we pose information completeness as minimum distortion between the representations and associated log Mel spectrograms. While estimating mutual information is known to be difficult (if not impossible) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib24" title="">24</a>]</cite>, the lower bound has an important interpretation—the amount of information that is at least present in the representations.</p> </div> <div class="ltx_para" id="S1.p5"> <p class="ltx_p" id="S1.p5.1">We further connect information completeness to information accessibility, adopting higher-performing probes to achieve tighter lower bound of mutual information <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib15" title="">15</a>]</cite>. We then use the proposed lower bound to examine several design choices and unverified claims on speech representations and discrete speech units. For example, Zhang <span class="ltx_text ltx_font_italic" id="S1.p5.1.1">et al.</span> <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib21" title="">21</a>]</cite> claim that “there is significant information redundancy between semantic tokens and acoustic tokens”, with semantic tokens (a misnomer itself <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib8" title="">8</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib25" title="">25</a>]</cite>) being quantized HuBERT units. We show that the amount of information in HuBERT units can be quantitatively measured, and a lot of information are in fact present in the discrete units. We also show that information is likely to be less complete in the later layers, despite more accessible phonetic information, confirming the choice of WavLM layer <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib5" title="">5</a>]</cite> in voice conversions <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib26" title="">26</a>]</cite>. We reveal that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement.</p> </div> <div class="ltx_para" id="S1.p6"> <p class="ltx_p" id="S1.p6.1">In our experiments, we empirically evaluate information completeness and accessibility on HuBERT representations, along with their discrete units considering different depths of RVQ. The evaluation on accessibility includes phone classification, pitch estimation and speaker verification. Our analyses provide insight into the choice of discrete speech units for different speech applications, and show that information is largely present in the residual. We remark that the discrete units from HuBERT can achieve higher completeness and accessibility if we further quantize the residuals, showing better reconstructed log Mels.</p> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">2 </span>Methods</h2> <div class="ltx_para" id="S2.p1"> <p class="ltx_p" id="S2.p1.1">In the following, we describe the quantization scheme to extract discrete speech units. We then formally define completeness from an information theory point of view. Finally, we draw connections between completeness and accessibility.</p> </div> <section class="ltx_subsection" id="S2.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.1 </span>Discrete speech units with RVQ</h3> <div class="ltx_para" id="S2.SS1.p1"> <p class="ltx_p" id="S2.SS1.p1.5">We denote <math alttext="R" class="ltx_Math" display="inline" id="S2.SS1.p1.1.m1.1"><semantics id="S2.SS1.p1.1.m1.1a"><mi id="S2.SS1.p1.1.m1.1.1" xref="S2.SS1.p1.1.m1.1.1.cmml">R</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.1.m1.1b"><ci id="S2.SS1.p1.1.m1.1.1.cmml" xref="S2.SS1.p1.1.m1.1.1">𝑅</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.1.m1.1c">R</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.1.m1.1d">italic_R</annotation></semantics></math> the speech representations, and <math alttext="\hat{R}" class="ltx_Math" display="inline" id="S2.SS1.p1.2.m2.1"><semantics id="S2.SS1.p1.2.m2.1a"><mover accent="true" id="S2.SS1.p1.2.m2.1.1" xref="S2.SS1.p1.2.m2.1.1.cmml"><mi id="S2.SS1.p1.2.m2.1.1.2" xref="S2.SS1.p1.2.m2.1.1.2.cmml">R</mi><mo id="S2.SS1.p1.2.m2.1.1.1" xref="S2.SS1.p1.2.m2.1.1.1.cmml">^</mo></mover><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.2.m2.1b"><apply id="S2.SS1.p1.2.m2.1.1.cmml" xref="S2.SS1.p1.2.m2.1.1"><ci id="S2.SS1.p1.2.m2.1.1.1.cmml" xref="S2.SS1.p1.2.m2.1.1.1">^</ci><ci id="S2.SS1.p1.2.m2.1.1.2.cmml" xref="S2.SS1.p1.2.m2.1.1.2">𝑅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.2.m2.1c">\hat{R}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.2.m2.1d">over^ start_ARG italic_R end_ARG</annotation></semantics></math> the quantized representations after residual vector quantization (RVQ) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib17" title="">17</a>]</cite>, also known as multiple stage VQ <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib27" title="">27</a>]</cite>. RVQ consists of a cascade of <math alttext="L" class="ltx_Math" display="inline" id="S2.SS1.p1.3.m3.1"><semantics id="S2.SS1.p1.3.m3.1a"><mi id="S2.SS1.p1.3.m3.1.1" xref="S2.SS1.p1.3.m3.1.1.cmml">L</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.3.m3.1b"><ci id="S2.SS1.p1.3.m3.1.1.cmml" xref="S2.SS1.p1.3.m3.1.1">𝐿</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.3.m3.1c">L</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.3.m3.1d">italic_L</annotation></semantics></math> codebooks, each of which of size <math alttext="N" class="ltx_Math" display="inline" id="S2.SS1.p1.4.m4.1"><semantics id="S2.SS1.p1.4.m4.1a"><mi id="S2.SS1.p1.4.m4.1.1" xref="S2.SS1.p1.4.m4.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.4.m4.1b"><ci id="S2.SS1.p1.4.m4.1.1.cmml" xref="S2.SS1.p1.4.m4.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.4.m4.1c">N</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.4.m4.1d">italic_N</annotation></semantics></math>, successively quantizing the residuals of previous quantization using the nearest neighbor principle to capture finer details. Different from <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib17" title="">17</a>]</cite> that update the codebooks with exponential moving average, we iteratively optimize each codebook using k-means until the loss converges. Codebooks are not fine-tuned if not specified, following the common practice of discrete speech units derived from k-means, where centroids are usually not fine-tuned. Note that, when <math alttext="L=1" class="ltx_Math" display="inline" id="S2.SS1.p1.5.m5.1"><semantics id="S2.SS1.p1.5.m5.1a"><mrow id="S2.SS1.p1.5.m5.1.1" xref="S2.SS1.p1.5.m5.1.1.cmml"><mi id="S2.SS1.p1.5.m5.1.1.2" xref="S2.SS1.p1.5.m5.1.1.2.cmml">L</mi><mo id="S2.SS1.p1.5.m5.1.1.1" xref="S2.SS1.p1.5.m5.1.1.1.cmml">=</mo><mn id="S2.SS1.p1.5.m5.1.1.3" xref="S2.SS1.p1.5.m5.1.1.3.cmml">1</mn></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.5.m5.1b"><apply id="S2.SS1.p1.5.m5.1.1.cmml" xref="S2.SS1.p1.5.m5.1.1"><eq id="S2.SS1.p1.5.m5.1.1.1.cmml" xref="S2.SS1.p1.5.m5.1.1.1"></eq><ci id="S2.SS1.p1.5.m5.1.1.2.cmml" xref="S2.SS1.p1.5.m5.1.1.2">𝐿</ci><cn id="S2.SS1.p1.5.m5.1.1.3.cmml" type="integer" xref="S2.SS1.p1.5.m5.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.5.m5.1c">L=1</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.5.m5.1d">italic_L = 1</annotation></semantics></math>, our RVQ becomes vanilla k-means.</p> </div> <div class="ltx_para" id="S2.SS1.p2"> <p class="ltx_p" id="S2.SS1.p2.4">In practice, we can represent a quantized frame <math alttext="\hat{r}_{t}" class="ltx_Math" display="inline" id="S2.SS1.p2.1.m1.1"><semantics id="S2.SS1.p2.1.m1.1a"><msub id="S2.SS1.p2.1.m1.1.1" xref="S2.SS1.p2.1.m1.1.1.cmml"><mover accent="true" id="S2.SS1.p2.1.m1.1.1.2" xref="S2.SS1.p2.1.m1.1.1.2.cmml"><mi id="S2.SS1.p2.1.m1.1.1.2.2" xref="S2.SS1.p2.1.m1.1.1.2.2.cmml">r</mi><mo id="S2.SS1.p2.1.m1.1.1.2.1" xref="S2.SS1.p2.1.m1.1.1.2.1.cmml">^</mo></mover><mi id="S2.SS1.p2.1.m1.1.1.3" xref="S2.SS1.p2.1.m1.1.1.3.cmml">t</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.1.m1.1b"><apply id="S2.SS1.p2.1.m1.1.1.cmml" xref="S2.SS1.p2.1.m1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.1.m1.1.1.1.cmml" xref="S2.SS1.p2.1.m1.1.1">subscript</csymbol><apply id="S2.SS1.p2.1.m1.1.1.2.cmml" xref="S2.SS1.p2.1.m1.1.1.2"><ci id="S2.SS1.p2.1.m1.1.1.2.1.cmml" xref="S2.SS1.p2.1.m1.1.1.2.1">^</ci><ci id="S2.SS1.p2.1.m1.1.1.2.2.cmml" xref="S2.SS1.p2.1.m1.1.1.2.2">𝑟</ci></apply><ci id="S2.SS1.p2.1.m1.1.1.3.cmml" xref="S2.SS1.p2.1.m1.1.1.3">𝑡</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.1.m1.1c">\hat{r}_{t}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.1.m1.1d">over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT</annotation></semantics></math> with discrete speech units <math alttext="c_{t}=(c_{t,1},\dots,c_{t,L})" class="ltx_Math" display="inline" id="S2.SS1.p2.2.m2.7"><semantics id="S2.SS1.p2.2.m2.7a"><mrow id="S2.SS1.p2.2.m2.7.7" xref="S2.SS1.p2.2.m2.7.7.cmml"><msub id="S2.SS1.p2.2.m2.7.7.4" xref="S2.SS1.p2.2.m2.7.7.4.cmml"><mi id="S2.SS1.p2.2.m2.7.7.4.2" xref="S2.SS1.p2.2.m2.7.7.4.2.cmml">c</mi><mi id="S2.SS1.p2.2.m2.7.7.4.3" xref="S2.SS1.p2.2.m2.7.7.4.3.cmml">t</mi></msub><mo id="S2.SS1.p2.2.m2.7.7.3" xref="S2.SS1.p2.2.m2.7.7.3.cmml">=</mo><mrow id="S2.SS1.p2.2.m2.7.7.2.2" xref="S2.SS1.p2.2.m2.7.7.2.3.cmml"><mo id="S2.SS1.p2.2.m2.7.7.2.2.3" stretchy="false" xref="S2.SS1.p2.2.m2.7.7.2.3.cmml">(</mo><msub id="S2.SS1.p2.2.m2.6.6.1.1.1" xref="S2.SS1.p2.2.m2.6.6.1.1.1.cmml"><mi id="S2.SS1.p2.2.m2.6.6.1.1.1.2" xref="S2.SS1.p2.2.m2.6.6.1.1.1.2.cmml">c</mi><mrow id="S2.SS1.p2.2.m2.2.2.2.4" xref="S2.SS1.p2.2.m2.2.2.2.3.cmml"><mi id="S2.SS1.p2.2.m2.1.1.1.1" xref="S2.SS1.p2.2.m2.1.1.1.1.cmml">t</mi><mo id="S2.SS1.p2.2.m2.2.2.2.4.1" xref="S2.SS1.p2.2.m2.2.2.2.3.cmml">,</mo><mn id="S2.SS1.p2.2.m2.2.2.2.2" xref="S2.SS1.p2.2.m2.2.2.2.2.cmml">1</mn></mrow></msub><mo id="S2.SS1.p2.2.m2.7.7.2.2.4" xref="S2.SS1.p2.2.m2.7.7.2.3.cmml">,</mo><mi id="S2.SS1.p2.2.m2.5.5" mathvariant="normal" xref="S2.SS1.p2.2.m2.5.5.cmml">…</mi><mo id="S2.SS1.p2.2.m2.7.7.2.2.5" xref="S2.SS1.p2.2.m2.7.7.2.3.cmml">,</mo><msub id="S2.SS1.p2.2.m2.7.7.2.2.2" xref="S2.SS1.p2.2.m2.7.7.2.2.2.cmml"><mi id="S2.SS1.p2.2.m2.7.7.2.2.2.2" xref="S2.SS1.p2.2.m2.7.7.2.2.2.2.cmml">c</mi><mrow id="S2.SS1.p2.2.m2.4.4.2.4" xref="S2.SS1.p2.2.m2.4.4.2.3.cmml"><mi id="S2.SS1.p2.2.m2.3.3.1.1" xref="S2.SS1.p2.2.m2.3.3.1.1.cmml">t</mi><mo id="S2.SS1.p2.2.m2.4.4.2.4.1" xref="S2.SS1.p2.2.m2.4.4.2.3.cmml">,</mo><mi id="S2.SS1.p2.2.m2.4.4.2.2" xref="S2.SS1.p2.2.m2.4.4.2.2.cmml">L</mi></mrow></msub><mo id="S2.SS1.p2.2.m2.7.7.2.2.6" stretchy="false" xref="S2.SS1.p2.2.m2.7.7.2.3.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.2.m2.7b"><apply id="S2.SS1.p2.2.m2.7.7.cmml" xref="S2.SS1.p2.2.m2.7.7"><eq id="S2.SS1.p2.2.m2.7.7.3.cmml" xref="S2.SS1.p2.2.m2.7.7.3"></eq><apply id="S2.SS1.p2.2.m2.7.7.4.cmml" xref="S2.SS1.p2.2.m2.7.7.4"><csymbol cd="ambiguous" id="S2.SS1.p2.2.m2.7.7.4.1.cmml" xref="S2.SS1.p2.2.m2.7.7.4">subscript</csymbol><ci id="S2.SS1.p2.2.m2.7.7.4.2.cmml" xref="S2.SS1.p2.2.m2.7.7.4.2">𝑐</ci><ci id="S2.SS1.p2.2.m2.7.7.4.3.cmml" xref="S2.SS1.p2.2.m2.7.7.4.3">𝑡</ci></apply><vector id="S2.SS1.p2.2.m2.7.7.2.3.cmml" xref="S2.SS1.p2.2.m2.7.7.2.2"><apply id="S2.SS1.p2.2.m2.6.6.1.1.1.cmml" xref="S2.SS1.p2.2.m2.6.6.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.2.m2.6.6.1.1.1.1.cmml" xref="S2.SS1.p2.2.m2.6.6.1.1.1">subscript</csymbol><ci id="S2.SS1.p2.2.m2.6.6.1.1.1.2.cmml" xref="S2.SS1.p2.2.m2.6.6.1.1.1.2">𝑐</ci><list id="S2.SS1.p2.2.m2.2.2.2.3.cmml" xref="S2.SS1.p2.2.m2.2.2.2.4"><ci id="S2.SS1.p2.2.m2.1.1.1.1.cmml" xref="S2.SS1.p2.2.m2.1.1.1.1">𝑡</ci><cn id="S2.SS1.p2.2.m2.2.2.2.2.cmml" type="integer" xref="S2.SS1.p2.2.m2.2.2.2.2">1</cn></list></apply><ci id="S2.SS1.p2.2.m2.5.5.cmml" xref="S2.SS1.p2.2.m2.5.5">…</ci><apply id="S2.SS1.p2.2.m2.7.7.2.2.2.cmml" xref="S2.SS1.p2.2.m2.7.7.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.p2.2.m2.7.7.2.2.2.1.cmml" xref="S2.SS1.p2.2.m2.7.7.2.2.2">subscript</csymbol><ci id="S2.SS1.p2.2.m2.7.7.2.2.2.2.cmml" xref="S2.SS1.p2.2.m2.7.7.2.2.2.2">𝑐</ci><list id="S2.SS1.p2.2.m2.4.4.2.3.cmml" xref="S2.SS1.p2.2.m2.4.4.2.4"><ci id="S2.SS1.p2.2.m2.3.3.1.1.cmml" xref="S2.SS1.p2.2.m2.3.3.1.1">𝑡</ci><ci id="S2.SS1.p2.2.m2.4.4.2.2.cmml" xref="S2.SS1.p2.2.m2.4.4.2.2">𝐿</ci></list></apply></vector></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.2.m2.7c">c_{t}=(c_{t,1},\dots,c_{t,L})</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.2.m2.7d">italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_c start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_t , italic_L end_POSTSUBSCRIPT )</annotation></semantics></math> only at the cost of <math alttext="L\log_{2}N" class="ltx_Math" display="inline" id="S2.SS1.p2.3.m3.1"><semantics id="S2.SS1.p2.3.m3.1a"><mrow id="S2.SS1.p2.3.m3.1.1" xref="S2.SS1.p2.3.m3.1.1.cmml"><mi id="S2.SS1.p2.3.m3.1.1.2" xref="S2.SS1.p2.3.m3.1.1.2.cmml">L</mi><mo id="S2.SS1.p2.3.m3.1.1.1" lspace="0.167em" xref="S2.SS1.p2.3.m3.1.1.1.cmml">⁢</mo><mrow id="S2.SS1.p2.3.m3.1.1.3" xref="S2.SS1.p2.3.m3.1.1.3.cmml"><msub id="S2.SS1.p2.3.m3.1.1.3.1" xref="S2.SS1.p2.3.m3.1.1.3.1.cmml"><mi id="S2.SS1.p2.3.m3.1.1.3.1.2" xref="S2.SS1.p2.3.m3.1.1.3.1.2.cmml">log</mi><mn id="S2.SS1.p2.3.m3.1.1.3.1.3" xref="S2.SS1.p2.3.m3.1.1.3.1.3.cmml">2</mn></msub><mo id="S2.SS1.p2.3.m3.1.1.3a" lspace="0.167em" xref="S2.SS1.p2.3.m3.1.1.3.cmml">⁡</mo><mi id="S2.SS1.p2.3.m3.1.1.3.2" xref="S2.SS1.p2.3.m3.1.1.3.2.cmml">N</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.3.m3.1b"><apply id="S2.SS1.p2.3.m3.1.1.cmml" xref="S2.SS1.p2.3.m3.1.1"><times id="S2.SS1.p2.3.m3.1.1.1.cmml" xref="S2.SS1.p2.3.m3.1.1.1"></times><ci id="S2.SS1.p2.3.m3.1.1.2.cmml" xref="S2.SS1.p2.3.m3.1.1.2">𝐿</ci><apply id="S2.SS1.p2.3.m3.1.1.3.cmml" xref="S2.SS1.p2.3.m3.1.1.3"><apply id="S2.SS1.p2.3.m3.1.1.3.1.cmml" xref="S2.SS1.p2.3.m3.1.1.3.1"><csymbol cd="ambiguous" id="S2.SS1.p2.3.m3.1.1.3.1.1.cmml" xref="S2.SS1.p2.3.m3.1.1.3.1">subscript</csymbol><log id="S2.SS1.p2.3.m3.1.1.3.1.2.cmml" xref="S2.SS1.p2.3.m3.1.1.3.1.2"></log><cn id="S2.SS1.p2.3.m3.1.1.3.1.3.cmml" type="integer" xref="S2.SS1.p2.3.m3.1.1.3.1.3">2</cn></apply><ci id="S2.SS1.p2.3.m3.1.1.3.2.cmml" xref="S2.SS1.p2.3.m3.1.1.3.2">𝑁</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.3.m3.1c">L\log_{2}N</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.3.m3.1d">italic_L roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_N</annotation></semantics></math> bits. More formally, let <math alttext="V=(v_{1},\dots,v_{L})" class="ltx_Math" display="inline" id="S2.SS1.p2.4.m4.3"><semantics id="S2.SS1.p2.4.m4.3a"><mrow id="S2.SS1.p2.4.m4.3.3" xref="S2.SS1.p2.4.m4.3.3.cmml"><mi id="S2.SS1.p2.4.m4.3.3.4" xref="S2.SS1.p2.4.m4.3.3.4.cmml">V</mi><mo id="S2.SS1.p2.4.m4.3.3.3" xref="S2.SS1.p2.4.m4.3.3.3.cmml">=</mo><mrow id="S2.SS1.p2.4.m4.3.3.2.2" xref="S2.SS1.p2.4.m4.3.3.2.3.cmml"><mo id="S2.SS1.p2.4.m4.3.3.2.2.3" stretchy="false" xref="S2.SS1.p2.4.m4.3.3.2.3.cmml">(</mo><msub id="S2.SS1.p2.4.m4.2.2.1.1.1" xref="S2.SS1.p2.4.m4.2.2.1.1.1.cmml"><mi id="S2.SS1.p2.4.m4.2.2.1.1.1.2" xref="S2.SS1.p2.4.m4.2.2.1.1.1.2.cmml">v</mi><mn id="S2.SS1.p2.4.m4.2.2.1.1.1.3" xref="S2.SS1.p2.4.m4.2.2.1.1.1.3.cmml">1</mn></msub><mo id="S2.SS1.p2.4.m4.3.3.2.2.4" xref="S2.SS1.p2.4.m4.3.3.2.3.cmml">,</mo><mi id="S2.SS1.p2.4.m4.1.1" mathvariant="normal" xref="S2.SS1.p2.4.m4.1.1.cmml">…</mi><mo id="S2.SS1.p2.4.m4.3.3.2.2.5" xref="S2.SS1.p2.4.m4.3.3.2.3.cmml">,</mo><msub id="S2.SS1.p2.4.m4.3.3.2.2.2" xref="S2.SS1.p2.4.m4.3.3.2.2.2.cmml"><mi id="S2.SS1.p2.4.m4.3.3.2.2.2.2" xref="S2.SS1.p2.4.m4.3.3.2.2.2.2.cmml">v</mi><mi id="S2.SS1.p2.4.m4.3.3.2.2.2.3" xref="S2.SS1.p2.4.m4.3.3.2.2.2.3.cmml">L</mi></msub><mo id="S2.SS1.p2.4.m4.3.3.2.2.6" stretchy="false" xref="S2.SS1.p2.4.m4.3.3.2.3.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.4.m4.3b"><apply id="S2.SS1.p2.4.m4.3.3.cmml" xref="S2.SS1.p2.4.m4.3.3"><eq id="S2.SS1.p2.4.m4.3.3.3.cmml" xref="S2.SS1.p2.4.m4.3.3.3"></eq><ci id="S2.SS1.p2.4.m4.3.3.4.cmml" xref="S2.SS1.p2.4.m4.3.3.4">𝑉</ci><vector id="S2.SS1.p2.4.m4.3.3.2.3.cmml" xref="S2.SS1.p2.4.m4.3.3.2.2"><apply id="S2.SS1.p2.4.m4.2.2.1.1.1.cmml" xref="S2.SS1.p2.4.m4.2.2.1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.4.m4.2.2.1.1.1.1.cmml" xref="S2.SS1.p2.4.m4.2.2.1.1.1">subscript</csymbol><ci id="S2.SS1.p2.4.m4.2.2.1.1.1.2.cmml" xref="S2.SS1.p2.4.m4.2.2.1.1.1.2">𝑣</ci><cn id="S2.SS1.p2.4.m4.2.2.1.1.1.3.cmml" type="integer" xref="S2.SS1.p2.4.m4.2.2.1.1.1.3">1</cn></apply><ci id="S2.SS1.p2.4.m4.1.1.cmml" xref="S2.SS1.p2.4.m4.1.1">…</ci><apply id="S2.SS1.p2.4.m4.3.3.2.2.2.cmml" xref="S2.SS1.p2.4.m4.3.3.2.2.2"><csymbol cd="ambiguous" id="S2.SS1.p2.4.m4.3.3.2.2.2.1.cmml" xref="S2.SS1.p2.4.m4.3.3.2.2.2">subscript</csymbol><ci id="S2.SS1.p2.4.m4.3.3.2.2.2.2.cmml" xref="S2.SS1.p2.4.m4.3.3.2.2.2.2">𝑣</ci><ci id="S2.SS1.p2.4.m4.3.3.2.2.2.3.cmml" xref="S2.SS1.p2.4.m4.3.3.2.2.2.3">𝐿</ci></apply></vector></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.4.m4.3c">V=(v_{1},\dots,v_{L})</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.4.m4.3d">italic_V = ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT )</annotation></semantics></math> be the codebooks of RVQ, a quantized frame is</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="S6.EGx1"> <tbody id="S2.E1"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle\hat{r}_{t}=\sum_{i=1}^{L}V_{i}\mathbf{1}_{c_{t,i}}," class="ltx_Math" display="inline" id="S2.E1.m1.3"><semantics id="S2.E1.m1.3a"><mrow id="S2.E1.m1.3.3.1" xref="S2.E1.m1.3.3.1.1.cmml"><mrow id="S2.E1.m1.3.3.1.1" xref="S2.E1.m1.3.3.1.1.cmml"><msub id="S2.E1.m1.3.3.1.1.2" xref="S2.E1.m1.3.3.1.1.2.cmml"><mover accent="true" id="S2.E1.m1.3.3.1.1.2.2" xref="S2.E1.m1.3.3.1.1.2.2.cmml"><mi id="S2.E1.m1.3.3.1.1.2.2.2" xref="S2.E1.m1.3.3.1.1.2.2.2.cmml">r</mi><mo id="S2.E1.m1.3.3.1.1.2.2.1" xref="S2.E1.m1.3.3.1.1.2.2.1.cmml">^</mo></mover><mi id="S2.E1.m1.3.3.1.1.2.3" xref="S2.E1.m1.3.3.1.1.2.3.cmml">t</mi></msub><mo id="S2.E1.m1.3.3.1.1.1" xref="S2.E1.m1.3.3.1.1.1.cmml">=</mo><mrow id="S2.E1.m1.3.3.1.1.3" xref="S2.E1.m1.3.3.1.1.3.cmml"><mstyle displaystyle="true" id="S2.E1.m1.3.3.1.1.3.1" xref="S2.E1.m1.3.3.1.1.3.1.cmml"><munderover id="S2.E1.m1.3.3.1.1.3.1a" xref="S2.E1.m1.3.3.1.1.3.1.cmml"><mo id="S2.E1.m1.3.3.1.1.3.1.2.2" movablelimits="false" xref="S2.E1.m1.3.3.1.1.3.1.2.2.cmml">∑</mo><mrow id="S2.E1.m1.3.3.1.1.3.1.2.3" xref="S2.E1.m1.3.3.1.1.3.1.2.3.cmml"><mi id="S2.E1.m1.3.3.1.1.3.1.2.3.2" xref="S2.E1.m1.3.3.1.1.3.1.2.3.2.cmml">i</mi><mo id="S2.E1.m1.3.3.1.1.3.1.2.3.1" xref="S2.E1.m1.3.3.1.1.3.1.2.3.1.cmml">=</mo><mn id="S2.E1.m1.3.3.1.1.3.1.2.3.3" xref="S2.E1.m1.3.3.1.1.3.1.2.3.3.cmml">1</mn></mrow><mi id="S2.E1.m1.3.3.1.1.3.1.3" xref="S2.E1.m1.3.3.1.1.3.1.3.cmml">L</mi></munderover></mstyle><mrow id="S2.E1.m1.3.3.1.1.3.2" xref="S2.E1.m1.3.3.1.1.3.2.cmml"><msub id="S2.E1.m1.3.3.1.1.3.2.2" xref="S2.E1.m1.3.3.1.1.3.2.2.cmml"><mi id="S2.E1.m1.3.3.1.1.3.2.2.2" xref="S2.E1.m1.3.3.1.1.3.2.2.2.cmml">V</mi><mi id="S2.E1.m1.3.3.1.1.3.2.2.3" xref="S2.E1.m1.3.3.1.1.3.2.2.3.cmml">i</mi></msub><mo id="S2.E1.m1.3.3.1.1.3.2.1" xref="S2.E1.m1.3.3.1.1.3.2.1.cmml">⁢</mo><msub id="S2.E1.m1.3.3.1.1.3.2.3" xref="S2.E1.m1.3.3.1.1.3.2.3.cmml"><mn id="S2.E1.m1.3.3.1.1.3.2.3.2" xref="S2.E1.m1.3.3.1.1.3.2.3.2.cmml">𝟏</mn><msub id="S2.E1.m1.2.2.2" xref="S2.E1.m1.2.2.2.cmml"><mi id="S2.E1.m1.2.2.2.4" xref="S2.E1.m1.2.2.2.4.cmml">c</mi><mrow id="S2.E1.m1.2.2.2.2.2.4" xref="S2.E1.m1.2.2.2.2.2.3.cmml"><mi id="S2.E1.m1.1.1.1.1.1.1" xref="S2.E1.m1.1.1.1.1.1.1.cmml">t</mi><mo id="S2.E1.m1.2.2.2.2.2.4.1" xref="S2.E1.m1.2.2.2.2.2.3.cmml">,</mo><mi id="S2.E1.m1.2.2.2.2.2.2" xref="S2.E1.m1.2.2.2.2.2.2.cmml">i</mi></mrow></msub></msub></mrow></mrow></mrow><mo id="S2.E1.m1.3.3.1.2" xref="S2.E1.m1.3.3.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.E1.m1.3b"><apply id="S2.E1.m1.3.3.1.1.cmml" xref="S2.E1.m1.3.3.1"><eq id="S2.E1.m1.3.3.1.1.1.cmml" xref="S2.E1.m1.3.3.1.1.1"></eq><apply id="S2.E1.m1.3.3.1.1.2.cmml" xref="S2.E1.m1.3.3.1.1.2"><csymbol cd="ambiguous" id="S2.E1.m1.3.3.1.1.2.1.cmml" xref="S2.E1.m1.3.3.1.1.2">subscript</csymbol><apply id="S2.E1.m1.3.3.1.1.2.2.cmml" xref="S2.E1.m1.3.3.1.1.2.2"><ci id="S2.E1.m1.3.3.1.1.2.2.1.cmml" xref="S2.E1.m1.3.3.1.1.2.2.1">^</ci><ci id="S2.E1.m1.3.3.1.1.2.2.2.cmml" xref="S2.E1.m1.3.3.1.1.2.2.2">𝑟</ci></apply><ci id="S2.E1.m1.3.3.1.1.2.3.cmml" xref="S2.E1.m1.3.3.1.1.2.3">𝑡</ci></apply><apply id="S2.E1.m1.3.3.1.1.3.cmml" xref="S2.E1.m1.3.3.1.1.3"><apply id="S2.E1.m1.3.3.1.1.3.1.cmml" xref="S2.E1.m1.3.3.1.1.3.1"><csymbol cd="ambiguous" id="S2.E1.m1.3.3.1.1.3.1.1.cmml" xref="S2.E1.m1.3.3.1.1.3.1">superscript</csymbol><apply id="S2.E1.m1.3.3.1.1.3.1.2.cmml" xref="S2.E1.m1.3.3.1.1.3.1"><csymbol cd="ambiguous" id="S2.E1.m1.3.3.1.1.3.1.2.1.cmml" xref="S2.E1.m1.3.3.1.1.3.1">subscript</csymbol><sum id="S2.E1.m1.3.3.1.1.3.1.2.2.cmml" xref="S2.E1.m1.3.3.1.1.3.1.2.2"></sum><apply id="S2.E1.m1.3.3.1.1.3.1.2.3.cmml" xref="S2.E1.m1.3.3.1.1.3.1.2.3"><eq id="S2.E1.m1.3.3.1.1.3.1.2.3.1.cmml" xref="S2.E1.m1.3.3.1.1.3.1.2.3.1"></eq><ci id="S2.E1.m1.3.3.1.1.3.1.2.3.2.cmml" xref="S2.E1.m1.3.3.1.1.3.1.2.3.2">𝑖</ci><cn id="S2.E1.m1.3.3.1.1.3.1.2.3.3.cmml" type="integer" xref="S2.E1.m1.3.3.1.1.3.1.2.3.3">1</cn></apply></apply><ci id="S2.E1.m1.3.3.1.1.3.1.3.cmml" xref="S2.E1.m1.3.3.1.1.3.1.3">𝐿</ci></apply><apply id="S2.E1.m1.3.3.1.1.3.2.cmml" xref="S2.E1.m1.3.3.1.1.3.2"><times id="S2.E1.m1.3.3.1.1.3.2.1.cmml" xref="S2.E1.m1.3.3.1.1.3.2.1"></times><apply id="S2.E1.m1.3.3.1.1.3.2.2.cmml" xref="S2.E1.m1.3.3.1.1.3.2.2"><csymbol cd="ambiguous" id="S2.E1.m1.3.3.1.1.3.2.2.1.cmml" xref="S2.E1.m1.3.3.1.1.3.2.2">subscript</csymbol><ci id="S2.E1.m1.3.3.1.1.3.2.2.2.cmml" xref="S2.E1.m1.3.3.1.1.3.2.2.2">𝑉</ci><ci id="S2.E1.m1.3.3.1.1.3.2.2.3.cmml" xref="S2.E1.m1.3.3.1.1.3.2.2.3">𝑖</ci></apply><apply id="S2.E1.m1.3.3.1.1.3.2.3.cmml" xref="S2.E1.m1.3.3.1.1.3.2.3"><csymbol cd="ambiguous" id="S2.E1.m1.3.3.1.1.3.2.3.1.cmml" xref="S2.E1.m1.3.3.1.1.3.2.3">subscript</csymbol><cn id="S2.E1.m1.3.3.1.1.3.2.3.2.cmml" type="integer" xref="S2.E1.m1.3.3.1.1.3.2.3.2">1</cn><apply id="S2.E1.m1.2.2.2.cmml" xref="S2.E1.m1.2.2.2"><csymbol cd="ambiguous" id="S2.E1.m1.2.2.2.3.cmml" xref="S2.E1.m1.2.2.2">subscript</csymbol><ci id="S2.E1.m1.2.2.2.4.cmml" xref="S2.E1.m1.2.2.2.4">𝑐</ci><list id="S2.E1.m1.2.2.2.2.2.3.cmml" xref="S2.E1.m1.2.2.2.2.2.4"><ci id="S2.E1.m1.1.1.1.1.1.1.cmml" xref="S2.E1.m1.1.1.1.1.1.1">𝑡</ci><ci id="S2.E1.m1.2.2.2.2.2.2.cmml" xref="S2.E1.m1.2.2.2.2.2.2">𝑖</ci></list></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E1.m1.3c">\displaystyle\hat{r}_{t}=\sum_{i=1}^{L}V_{i}\mathbf{1}_{c_{t,i}},</annotation><annotation encoding="application/x-llamapun" id="S2.E1.m1.3d">over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS1.p2.6">where <math alttext="\mathbf{1}_{c_{i}}" class="ltx_Math" display="inline" id="S2.SS1.p2.5.m1.1"><semantics id="S2.SS1.p2.5.m1.1a"><msub id="S2.SS1.p2.5.m1.1.1" xref="S2.SS1.p2.5.m1.1.1.cmml"><mn id="S2.SS1.p2.5.m1.1.1.2" xref="S2.SS1.p2.5.m1.1.1.2.cmml">𝟏</mn><msub id="S2.SS1.p2.5.m1.1.1.3" xref="S2.SS1.p2.5.m1.1.1.3.cmml"><mi id="S2.SS1.p2.5.m1.1.1.3.2" xref="S2.SS1.p2.5.m1.1.1.3.2.cmml">c</mi><mi id="S2.SS1.p2.5.m1.1.1.3.3" xref="S2.SS1.p2.5.m1.1.1.3.3.cmml">i</mi></msub></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.5.m1.1b"><apply id="S2.SS1.p2.5.m1.1.1.cmml" xref="S2.SS1.p2.5.m1.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.5.m1.1.1.1.cmml" xref="S2.SS1.p2.5.m1.1.1">subscript</csymbol><cn id="S2.SS1.p2.5.m1.1.1.2.cmml" type="integer" xref="S2.SS1.p2.5.m1.1.1.2">1</cn><apply id="S2.SS1.p2.5.m1.1.1.3.cmml" xref="S2.SS1.p2.5.m1.1.1.3"><csymbol cd="ambiguous" id="S2.SS1.p2.5.m1.1.1.3.1.cmml" xref="S2.SS1.p2.5.m1.1.1.3">subscript</csymbol><ci id="S2.SS1.p2.5.m1.1.1.3.2.cmml" xref="S2.SS1.p2.5.m1.1.1.3.2">𝑐</ci><ci id="S2.SS1.p2.5.m1.1.1.3.3.cmml" xref="S2.SS1.p2.5.m1.1.1.3.3">𝑖</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.5.m1.1c">\mathbf{1}_{c_{i}}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.5.m1.1d">bold_1 start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT</annotation></semantics></math> is a one-hot vector with <math alttext="c_{i}" class="ltx_Math" display="inline" id="S2.SS1.p2.6.m2.1"><semantics id="S2.SS1.p2.6.m2.1a"><msub id="S2.SS1.p2.6.m2.1.1" xref="S2.SS1.p2.6.m2.1.1.cmml"><mi id="S2.SS1.p2.6.m2.1.1.2" xref="S2.SS1.p2.6.m2.1.1.2.cmml">c</mi><mi id="S2.SS1.p2.6.m2.1.1.3" xref="S2.SS1.p2.6.m2.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.6.m2.1b"><apply id="S2.SS1.p2.6.m2.1.1.cmml" xref="S2.SS1.p2.6.m2.1.1"><csymbol cd="ambiguous" id="S2.SS1.p2.6.m2.1.1.1.cmml" xref="S2.SS1.p2.6.m2.1.1">subscript</csymbol><ci id="S2.SS1.p2.6.m2.1.1.2.cmml" xref="S2.SS1.p2.6.m2.1.1.2">𝑐</ci><ci id="S2.SS1.p2.6.m2.1.1.3.cmml" xref="S2.SS1.p2.6.m2.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.6.m2.1c">c_{i}</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.6.m2.1d">italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math>-th entry being 1.</p> </div> </section> <section class="ltx_subsection" id="S2.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.2 </span>Completeness as mutual information</h3> <div class="ltx_para" id="S2.SS2.p1"> <p class="ltx_p" id="S2.SS2.p1.1">Given (quantized) speech representations, we then define completeness as the mutual information between log Mel spectrograms <math alttext="X" class="ltx_Math" display="inline" id="S2.SS2.p1.1.m1.1"><semantics id="S2.SS2.p1.1.m1.1a"><mi id="S2.SS2.p1.1.m1.1.1" xref="S2.SS2.p1.1.m1.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p1.1.m1.1b"><ci id="S2.SS2.p1.1.m1.1.1.cmml" xref="S2.SS2.p1.1.m1.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p1.1.m1.1c">X</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p1.1.m1.1d">italic_X</annotation></semantics></math> and the representations. We choose log Mel spectrograms (log Mels) instead of raw waveforms because log Mels are sufficient for many speech processing tasks; the argument equally applies to waveforms. We argue that, if a representation is complete, it should be able to present <span class="ltx_text ltx_font_italic" id="S2.SS2.p1.1.1">all</span> information in the log Mel. The completeness is formally defined as</p> <table class="ltx_equationgroup ltx_eqn_table" id="S2.E2"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S2.E2X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle I(R,X)" class="ltx_Math" display="inline" id="S2.E2X.2.1.1.m1.2"><semantics id="S2.E2X.2.1.1.m1.2a"><mrow id="S2.E2X.2.1.1.m1.2.3" xref="S2.E2X.2.1.1.m1.2.3.cmml"><mi id="S2.E2X.2.1.1.m1.2.3.2" xref="S2.E2X.2.1.1.m1.2.3.2.cmml">I</mi><mo id="S2.E2X.2.1.1.m1.2.3.1" xref="S2.E2X.2.1.1.m1.2.3.1.cmml">⁢</mo><mrow id="S2.E2X.2.1.1.m1.2.3.3.2" xref="S2.E2X.2.1.1.m1.2.3.3.1.cmml"><mo id="S2.E2X.2.1.1.m1.2.3.3.2.1" stretchy="false" xref="S2.E2X.2.1.1.m1.2.3.3.1.cmml">(</mo><mi id="S2.E2X.2.1.1.m1.1.1" xref="S2.E2X.2.1.1.m1.1.1.cmml">R</mi><mo id="S2.E2X.2.1.1.m1.2.3.3.2.2" xref="S2.E2X.2.1.1.m1.2.3.3.1.cmml">,</mo><mi id="S2.E2X.2.1.1.m1.2.2" xref="S2.E2X.2.1.1.m1.2.2.cmml">X</mi><mo id="S2.E2X.2.1.1.m1.2.3.3.2.3" stretchy="false" xref="S2.E2X.2.1.1.m1.2.3.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E2X.2.1.1.m1.2b"><apply id="S2.E2X.2.1.1.m1.2.3.cmml" xref="S2.E2X.2.1.1.m1.2.3"><times id="S2.E2X.2.1.1.m1.2.3.1.cmml" xref="S2.E2X.2.1.1.m1.2.3.1"></times><ci id="S2.E2X.2.1.1.m1.2.3.2.cmml" xref="S2.E2X.2.1.1.m1.2.3.2">𝐼</ci><interval closure="open" id="S2.E2X.2.1.1.m1.2.3.3.1.cmml" xref="S2.E2X.2.1.1.m1.2.3.3.2"><ci id="S2.E2X.2.1.1.m1.1.1.cmml" xref="S2.E2X.2.1.1.m1.1.1">𝑅</ci><ci id="S2.E2X.2.1.1.m1.2.2.cmml" xref="S2.E2X.2.1.1.m1.2.2">𝑋</ci></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E2X.2.1.1.m1.2c">\displaystyle I(R,X)</annotation><annotation encoding="application/x-llamapun" id="S2.E2X.2.1.1.m1.2d">italic_I ( italic_R , italic_X )</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=H(X)-H(X|R)" class="ltx_Math" display="inline" id="S2.E2X.3.2.2.m1.2"><semantics id="S2.E2X.3.2.2.m1.2a"><mrow id="S2.E2X.3.2.2.m1.2.2" xref="S2.E2X.3.2.2.m1.2.2.cmml"><mi id="S2.E2X.3.2.2.m1.2.2.3" xref="S2.E2X.3.2.2.m1.2.2.3.cmml"></mi><mo id="S2.E2X.3.2.2.m1.2.2.2" xref="S2.E2X.3.2.2.m1.2.2.2.cmml">=</mo><mrow id="S2.E2X.3.2.2.m1.2.2.1" xref="S2.E2X.3.2.2.m1.2.2.1.cmml"><mrow id="S2.E2X.3.2.2.m1.2.2.1.3" xref="S2.E2X.3.2.2.m1.2.2.1.3.cmml"><mi id="S2.E2X.3.2.2.m1.2.2.1.3.2" xref="S2.E2X.3.2.2.m1.2.2.1.3.2.cmml">H</mi><mo id="S2.E2X.3.2.2.m1.2.2.1.3.1" xref="S2.E2X.3.2.2.m1.2.2.1.3.1.cmml">⁢</mo><mrow id="S2.E2X.3.2.2.m1.2.2.1.3.3.2" xref="S2.E2X.3.2.2.m1.2.2.1.3.cmml"><mo id="S2.E2X.3.2.2.m1.2.2.1.3.3.2.1" stretchy="false" xref="S2.E2X.3.2.2.m1.2.2.1.3.cmml">(</mo><mi id="S2.E2X.3.2.2.m1.1.1" xref="S2.E2X.3.2.2.m1.1.1.cmml">X</mi><mo id="S2.E2X.3.2.2.m1.2.2.1.3.3.2.2" stretchy="false" xref="S2.E2X.3.2.2.m1.2.2.1.3.cmml">)</mo></mrow></mrow><mo id="S2.E2X.3.2.2.m1.2.2.1.2" xref="S2.E2X.3.2.2.m1.2.2.1.2.cmml">−</mo><mrow id="S2.E2X.3.2.2.m1.2.2.1.1" xref="S2.E2X.3.2.2.m1.2.2.1.1.cmml"><mi id="S2.E2X.3.2.2.m1.2.2.1.1.3" xref="S2.E2X.3.2.2.m1.2.2.1.1.3.cmml">H</mi><mo id="S2.E2X.3.2.2.m1.2.2.1.1.2" xref="S2.E2X.3.2.2.m1.2.2.1.1.2.cmml">⁢</mo><mrow id="S2.E2X.3.2.2.m1.2.2.1.1.1.1" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.cmml"><mo id="S2.E2X.3.2.2.m1.2.2.1.1.1.1.2" stretchy="false" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.cmml">(</mo><mrow id="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.cmml"><mi id="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.2" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.2.cmml">X</mi><mo fence="false" id="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.1" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.1.cmml">|</mo><mi id="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.3" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.3.cmml">R</mi></mrow><mo id="S2.E2X.3.2.2.m1.2.2.1.1.1.1.3" stretchy="false" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E2X.3.2.2.m1.2b"><apply id="S2.E2X.3.2.2.m1.2.2.cmml" xref="S2.E2X.3.2.2.m1.2.2"><eq id="S2.E2X.3.2.2.m1.2.2.2.cmml" xref="S2.E2X.3.2.2.m1.2.2.2"></eq><csymbol cd="latexml" id="S2.E2X.3.2.2.m1.2.2.3.cmml" xref="S2.E2X.3.2.2.m1.2.2.3">absent</csymbol><apply id="S2.E2X.3.2.2.m1.2.2.1.cmml" xref="S2.E2X.3.2.2.m1.2.2.1"><minus id="S2.E2X.3.2.2.m1.2.2.1.2.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.2"></minus><apply id="S2.E2X.3.2.2.m1.2.2.1.3.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.3"><times id="S2.E2X.3.2.2.m1.2.2.1.3.1.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.3.1"></times><ci id="S2.E2X.3.2.2.m1.2.2.1.3.2.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.3.2">𝐻</ci><ci id="S2.E2X.3.2.2.m1.1.1.cmml" xref="S2.E2X.3.2.2.m1.1.1">𝑋</ci></apply><apply id="S2.E2X.3.2.2.m1.2.2.1.1.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.1"><times id="S2.E2X.3.2.2.m1.2.2.1.1.2.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.1.2"></times><ci id="S2.E2X.3.2.2.m1.2.2.1.1.3.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.1.3">𝐻</ci><apply id="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1"><csymbol cd="latexml" id="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.1.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.1">conditional</csymbol><ci id="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.2.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.2">𝑋</ci><ci id="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.3.cmml" xref="S2.E2X.3.2.2.m1.2.2.1.1.1.1.1.3">𝑅</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E2X.3.2.2.m1.2c">\displaystyle=H(X)-H(X|R)</annotation><annotation encoding="application/x-llamapun" id="S2.E2X.3.2.2.m1.2d">= italic_H ( italic_X ) - italic_H ( italic_X | italic_R )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="2"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(2)</span></td> </tr> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S2.E2Xa"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\geq I(\hat{R},X)," class="ltx_Math" display="inline" id="S2.E2Xa.2.1.1.m1.3"><semantics id="S2.E2Xa.2.1.1.m1.3a"><mrow id="S2.E2Xa.2.1.1.m1.3.3.1" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.cmml"><mrow id="S2.E2Xa.2.1.1.m1.3.3.1.1" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.cmml"><mi id="S2.E2Xa.2.1.1.m1.3.3.1.1.2" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.2.cmml"></mi><mo id="S2.E2Xa.2.1.1.m1.3.3.1.1.1" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.1.cmml">≥</mo><mrow id="S2.E2Xa.2.1.1.m1.3.3.1.1.3" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3.cmml"><mi id="S2.E2Xa.2.1.1.m1.3.3.1.1.3.2" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3.2.cmml">I</mi><mo id="S2.E2Xa.2.1.1.m1.3.3.1.1.3.1" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3.1.cmml">⁢</mo><mrow id="S2.E2Xa.2.1.1.m1.3.3.1.1.3.3.2" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3.3.1.cmml"><mo id="S2.E2Xa.2.1.1.m1.3.3.1.1.3.3.2.1" stretchy="false" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3.3.1.cmml">(</mo><mover accent="true" id="S2.E2Xa.2.1.1.m1.1.1" xref="S2.E2Xa.2.1.1.m1.1.1.cmml"><mi id="S2.E2Xa.2.1.1.m1.1.1.2" xref="S2.E2Xa.2.1.1.m1.1.1.2.cmml">R</mi><mo id="S2.E2Xa.2.1.1.m1.1.1.1" xref="S2.E2Xa.2.1.1.m1.1.1.1.cmml">^</mo></mover><mo id="S2.E2Xa.2.1.1.m1.3.3.1.1.3.3.2.2" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3.3.1.cmml">,</mo><mi id="S2.E2Xa.2.1.1.m1.2.2" xref="S2.E2Xa.2.1.1.m1.2.2.cmml">X</mi><mo id="S2.E2Xa.2.1.1.m1.3.3.1.1.3.3.2.3" stretchy="false" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3.3.1.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E2Xa.2.1.1.m1.3.3.1.2" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.E2Xa.2.1.1.m1.3b"><apply id="S2.E2Xa.2.1.1.m1.3.3.1.1.cmml" xref="S2.E2Xa.2.1.1.m1.3.3.1"><geq id="S2.E2Xa.2.1.1.m1.3.3.1.1.1.cmml" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.1"></geq><csymbol cd="latexml" id="S2.E2Xa.2.1.1.m1.3.3.1.1.2.cmml" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.2">absent</csymbol><apply id="S2.E2Xa.2.1.1.m1.3.3.1.1.3.cmml" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3"><times id="S2.E2Xa.2.1.1.m1.3.3.1.1.3.1.cmml" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3.1"></times><ci id="S2.E2Xa.2.1.1.m1.3.3.1.1.3.2.cmml" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3.2">𝐼</ci><interval closure="open" id="S2.E2Xa.2.1.1.m1.3.3.1.1.3.3.1.cmml" xref="S2.E2Xa.2.1.1.m1.3.3.1.1.3.3.2"><apply id="S2.E2Xa.2.1.1.m1.1.1.cmml" xref="S2.E2Xa.2.1.1.m1.1.1"><ci id="S2.E2Xa.2.1.1.m1.1.1.1.cmml" xref="S2.E2Xa.2.1.1.m1.1.1.1">^</ci><ci id="S2.E2Xa.2.1.1.m1.1.1.2.cmml" xref="S2.E2Xa.2.1.1.m1.1.1.2">𝑅</ci></apply><ci id="S2.E2Xa.2.1.1.m1.2.2.cmml" xref="S2.E2Xa.2.1.1.m1.2.2">𝑋</ci></interval></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E2Xa.2.1.1.m1.3c">\displaystyle\geq I(\hat{R},X),</annotation><annotation encoding="application/x-llamapun" id="S2.E2Xa.2.1.1.m1.3d">≥ italic_I ( over^ start_ARG italic_R end_ARG , italic_X ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr> </tbody> </table> <p class="ltx_p" id="S2.SS2.p1.3">where the second equation is due to the data processing inequality. Because <math alttext="H(X)" class="ltx_Math" display="inline" id="S2.SS2.p1.2.m1.1"><semantics id="S2.SS2.p1.2.m1.1a"><mrow id="S2.SS2.p1.2.m1.1.2" xref="S2.SS2.p1.2.m1.1.2.cmml"><mi id="S2.SS2.p1.2.m1.1.2.2" xref="S2.SS2.p1.2.m1.1.2.2.cmml">H</mi><mo id="S2.SS2.p1.2.m1.1.2.1" xref="S2.SS2.p1.2.m1.1.2.1.cmml">⁢</mo><mrow id="S2.SS2.p1.2.m1.1.2.3.2" xref="S2.SS2.p1.2.m1.1.2.cmml"><mo id="S2.SS2.p1.2.m1.1.2.3.2.1" stretchy="false" xref="S2.SS2.p1.2.m1.1.2.cmml">(</mo><mi id="S2.SS2.p1.2.m1.1.1" xref="S2.SS2.p1.2.m1.1.1.cmml">X</mi><mo id="S2.SS2.p1.2.m1.1.2.3.2.2" stretchy="false" xref="S2.SS2.p1.2.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p1.2.m1.1b"><apply id="S2.SS2.p1.2.m1.1.2.cmml" xref="S2.SS2.p1.2.m1.1.2"><times id="S2.SS2.p1.2.m1.1.2.1.cmml" xref="S2.SS2.p1.2.m1.1.2.1"></times><ci id="S2.SS2.p1.2.m1.1.2.2.cmml" xref="S2.SS2.p1.2.m1.1.2.2">𝐻</ci><ci id="S2.SS2.p1.2.m1.1.1.cmml" xref="S2.SS2.p1.2.m1.1.1">𝑋</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p1.2.m1.1c">H(X)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p1.2.m1.1d">italic_H ( italic_X )</annotation></semantics></math> remains constant given different representations, we only have to compute the conditional entropy <math alttext="H(X|R)" class="ltx_Math" display="inline" id="S2.SS2.p1.3.m2.1"><semantics id="S2.SS2.p1.3.m2.1a"><mrow id="S2.SS2.p1.3.m2.1.1" xref="S2.SS2.p1.3.m2.1.1.cmml"><mi id="S2.SS2.p1.3.m2.1.1.3" xref="S2.SS2.p1.3.m2.1.1.3.cmml">H</mi><mo id="S2.SS2.p1.3.m2.1.1.2" xref="S2.SS2.p1.3.m2.1.1.2.cmml">⁢</mo><mrow id="S2.SS2.p1.3.m2.1.1.1.1" xref="S2.SS2.p1.3.m2.1.1.1.1.1.cmml"><mo id="S2.SS2.p1.3.m2.1.1.1.1.2" stretchy="false" xref="S2.SS2.p1.3.m2.1.1.1.1.1.cmml">(</mo><mrow id="S2.SS2.p1.3.m2.1.1.1.1.1" xref="S2.SS2.p1.3.m2.1.1.1.1.1.cmml"><mi id="S2.SS2.p1.3.m2.1.1.1.1.1.2" xref="S2.SS2.p1.3.m2.1.1.1.1.1.2.cmml">X</mi><mo fence="false" id="S2.SS2.p1.3.m2.1.1.1.1.1.1" xref="S2.SS2.p1.3.m2.1.1.1.1.1.1.cmml">|</mo><mi id="S2.SS2.p1.3.m2.1.1.1.1.1.3" xref="S2.SS2.p1.3.m2.1.1.1.1.1.3.cmml">R</mi></mrow><mo id="S2.SS2.p1.3.m2.1.1.1.1.3" stretchy="false" xref="S2.SS2.p1.3.m2.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p1.3.m2.1b"><apply id="S2.SS2.p1.3.m2.1.1.cmml" xref="S2.SS2.p1.3.m2.1.1"><times id="S2.SS2.p1.3.m2.1.1.2.cmml" xref="S2.SS2.p1.3.m2.1.1.2"></times><ci id="S2.SS2.p1.3.m2.1.1.3.cmml" xref="S2.SS2.p1.3.m2.1.1.3">𝐻</ci><apply id="S2.SS2.p1.3.m2.1.1.1.1.1.cmml" xref="S2.SS2.p1.3.m2.1.1.1.1"><csymbol cd="latexml" id="S2.SS2.p1.3.m2.1.1.1.1.1.1.cmml" xref="S2.SS2.p1.3.m2.1.1.1.1.1.1">conditional</csymbol><ci id="S2.SS2.p1.3.m2.1.1.1.1.1.2.cmml" xref="S2.SS2.p1.3.m2.1.1.1.1.1.2">𝑋</ci><ci id="S2.SS2.p1.3.m2.1.1.1.1.1.3.cmml" xref="S2.SS2.p1.3.m2.1.1.1.1.1.3">𝑅</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p1.3.m2.1c">H(X|R)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p1.3.m2.1d">italic_H ( italic_X | italic_R )</annotation></semantics></math> to measure completeness.</p> </div> <div class="ltx_para" id="S2.SS2.p2"> <p class="ltx_p" id="S2.SS2.p2.2">Nonetheless, the desired conditional entropy is generally not available <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib24" title="">24</a>]</cite>. To estimate <math alttext="H(X|R)" class="ltx_Math" display="inline" id="S2.SS2.p2.1.m1.1"><semantics id="S2.SS2.p2.1.m1.1a"><mrow id="S2.SS2.p2.1.m1.1.1" xref="S2.SS2.p2.1.m1.1.1.cmml"><mi id="S2.SS2.p2.1.m1.1.1.3" xref="S2.SS2.p2.1.m1.1.1.3.cmml">H</mi><mo id="S2.SS2.p2.1.m1.1.1.2" xref="S2.SS2.p2.1.m1.1.1.2.cmml">⁢</mo><mrow id="S2.SS2.p2.1.m1.1.1.1.1" xref="S2.SS2.p2.1.m1.1.1.1.1.1.cmml"><mo id="S2.SS2.p2.1.m1.1.1.1.1.2" stretchy="false" xref="S2.SS2.p2.1.m1.1.1.1.1.1.cmml">(</mo><mrow id="S2.SS2.p2.1.m1.1.1.1.1.1" xref="S2.SS2.p2.1.m1.1.1.1.1.1.cmml"><mi id="S2.SS2.p2.1.m1.1.1.1.1.1.2" xref="S2.SS2.p2.1.m1.1.1.1.1.1.2.cmml">X</mi><mo fence="false" id="S2.SS2.p2.1.m1.1.1.1.1.1.1" xref="S2.SS2.p2.1.m1.1.1.1.1.1.1.cmml">|</mo><mi id="S2.SS2.p2.1.m1.1.1.1.1.1.3" xref="S2.SS2.p2.1.m1.1.1.1.1.1.3.cmml">R</mi></mrow><mo id="S2.SS2.p2.1.m1.1.1.1.1.3" stretchy="false" xref="S2.SS2.p2.1.m1.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.1.m1.1b"><apply id="S2.SS2.p2.1.m1.1.1.cmml" xref="S2.SS2.p2.1.m1.1.1"><times id="S2.SS2.p2.1.m1.1.1.2.cmml" xref="S2.SS2.p2.1.m1.1.1.2"></times><ci id="S2.SS2.p2.1.m1.1.1.3.cmml" xref="S2.SS2.p2.1.m1.1.1.3">𝐻</ci><apply id="S2.SS2.p2.1.m1.1.1.1.1.1.cmml" xref="S2.SS2.p2.1.m1.1.1.1.1"><csymbol cd="latexml" id="S2.SS2.p2.1.m1.1.1.1.1.1.1.cmml" xref="S2.SS2.p2.1.m1.1.1.1.1.1.1">conditional</csymbol><ci id="S2.SS2.p2.1.m1.1.1.1.1.1.2.cmml" xref="S2.SS2.p2.1.m1.1.1.1.1.1.2">𝑋</ci><ci id="S2.SS2.p2.1.m1.1.1.1.1.1.3.cmml" xref="S2.SS2.p2.1.m1.1.1.1.1.1.3">𝑅</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.1.m1.1c">H(X|R)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.1.m1.1d">italic_H ( italic_X | italic_R )</annotation></semantics></math>, we upper-bound it with cross entropy estimation, introducing a variational distribution <math alttext="q(x|r)" class="ltx_Math" display="inline" id="S2.SS2.p2.2.m2.1"><semantics id="S2.SS2.p2.2.m2.1a"><mrow id="S2.SS2.p2.2.m2.1.1" xref="S2.SS2.p2.2.m2.1.1.cmml"><mi id="S2.SS2.p2.2.m2.1.1.3" xref="S2.SS2.p2.2.m2.1.1.3.cmml">q</mi><mo id="S2.SS2.p2.2.m2.1.1.2" xref="S2.SS2.p2.2.m2.1.1.2.cmml">⁢</mo><mrow id="S2.SS2.p2.2.m2.1.1.1.1" xref="S2.SS2.p2.2.m2.1.1.1.1.1.cmml"><mo id="S2.SS2.p2.2.m2.1.1.1.1.2" stretchy="false" xref="S2.SS2.p2.2.m2.1.1.1.1.1.cmml">(</mo><mrow id="S2.SS2.p2.2.m2.1.1.1.1.1" xref="S2.SS2.p2.2.m2.1.1.1.1.1.cmml"><mi id="S2.SS2.p2.2.m2.1.1.1.1.1.2" xref="S2.SS2.p2.2.m2.1.1.1.1.1.2.cmml">x</mi><mo fence="false" id="S2.SS2.p2.2.m2.1.1.1.1.1.1" xref="S2.SS2.p2.2.m2.1.1.1.1.1.1.cmml">|</mo><mi id="S2.SS2.p2.2.m2.1.1.1.1.1.3" xref="S2.SS2.p2.2.m2.1.1.1.1.1.3.cmml">r</mi></mrow><mo id="S2.SS2.p2.2.m2.1.1.1.1.3" stretchy="false" xref="S2.SS2.p2.2.m2.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.2.m2.1b"><apply id="S2.SS2.p2.2.m2.1.1.cmml" xref="S2.SS2.p2.2.m2.1.1"><times id="S2.SS2.p2.2.m2.1.1.2.cmml" xref="S2.SS2.p2.2.m2.1.1.2"></times><ci id="S2.SS2.p2.2.m2.1.1.3.cmml" xref="S2.SS2.p2.2.m2.1.1.3">𝑞</ci><apply id="S2.SS2.p2.2.m2.1.1.1.1.1.cmml" xref="S2.SS2.p2.2.m2.1.1.1.1"><csymbol cd="latexml" id="S2.SS2.p2.2.m2.1.1.1.1.1.1.cmml" xref="S2.SS2.p2.2.m2.1.1.1.1.1.1">conditional</csymbol><ci id="S2.SS2.p2.2.m2.1.1.1.1.1.2.cmml" xref="S2.SS2.p2.2.m2.1.1.1.1.1.2">𝑥</ci><ci id="S2.SS2.p2.2.m2.1.1.1.1.1.3.cmml" xref="S2.SS2.p2.2.m2.1.1.1.1.1.3">𝑟</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.2.m2.1c">q(x|r)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.2.m2.1d">italic_q ( italic_x | italic_r )</annotation></semantics></math> <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib24" title="">24</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib15" title="">15</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib28" title="">28</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib29" title="">29</a>]</cite>. It leads to a lower bound of mutual information</p> <table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="S6.EGx2"> <tbody id="S2.Ex1"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle I(R,X)" class="ltx_Math" display="inline" id="S2.Ex1.m1.2"><semantics id="S2.Ex1.m1.2a"><mrow id="S2.Ex1.m1.2.3" xref="S2.Ex1.m1.2.3.cmml"><mi id="S2.Ex1.m1.2.3.2" xref="S2.Ex1.m1.2.3.2.cmml">I</mi><mo id="S2.Ex1.m1.2.3.1" xref="S2.Ex1.m1.2.3.1.cmml">⁢</mo><mrow id="S2.Ex1.m1.2.3.3.2" xref="S2.Ex1.m1.2.3.3.1.cmml"><mo id="S2.Ex1.m1.2.3.3.2.1" stretchy="false" xref="S2.Ex1.m1.2.3.3.1.cmml">(</mo><mi id="S2.Ex1.m1.1.1" xref="S2.Ex1.m1.1.1.cmml">R</mi><mo id="S2.Ex1.m1.2.3.3.2.2" xref="S2.Ex1.m1.2.3.3.1.cmml">,</mo><mi id="S2.Ex1.m1.2.2" xref="S2.Ex1.m1.2.2.cmml">X</mi><mo id="S2.Ex1.m1.2.3.3.2.3" stretchy="false" xref="S2.Ex1.m1.2.3.3.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.Ex1.m1.2b"><apply id="S2.Ex1.m1.2.3.cmml" xref="S2.Ex1.m1.2.3"><times id="S2.Ex1.m1.2.3.1.cmml" xref="S2.Ex1.m1.2.3.1"></times><ci id="S2.Ex1.m1.2.3.2.cmml" xref="S2.Ex1.m1.2.3.2">𝐼</ci><interval closure="open" id="S2.Ex1.m1.2.3.3.1.cmml" xref="S2.Ex1.m1.2.3.3.2"><ci id="S2.Ex1.m1.1.1.cmml" xref="S2.Ex1.m1.1.1">𝑅</ci><ci id="S2.Ex1.m1.2.2.cmml" xref="S2.Ex1.m1.2.2">𝑋</ci></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.Ex1.m1.2c">\displaystyle I(R,X)</annotation><annotation encoding="application/x-llamapun" id="S2.Ex1.m1.2d">italic_I ( italic_R , italic_X )</annotation></semantics></math></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=H(X)-H(X|R)" class="ltx_Math" display="inline" id="S2.Ex1.m2.2"><semantics id="S2.Ex1.m2.2a"><mrow id="S2.Ex1.m2.2.2" xref="S2.Ex1.m2.2.2.cmml"><mi id="S2.Ex1.m2.2.2.3" xref="S2.Ex1.m2.2.2.3.cmml"></mi><mo id="S2.Ex1.m2.2.2.2" xref="S2.Ex1.m2.2.2.2.cmml">=</mo><mrow id="S2.Ex1.m2.2.2.1" xref="S2.Ex1.m2.2.2.1.cmml"><mrow id="S2.Ex1.m2.2.2.1.3" xref="S2.Ex1.m2.2.2.1.3.cmml"><mi id="S2.Ex1.m2.2.2.1.3.2" xref="S2.Ex1.m2.2.2.1.3.2.cmml">H</mi><mo id="S2.Ex1.m2.2.2.1.3.1" xref="S2.Ex1.m2.2.2.1.3.1.cmml">⁢</mo><mrow id="S2.Ex1.m2.2.2.1.3.3.2" xref="S2.Ex1.m2.2.2.1.3.cmml"><mo id="S2.Ex1.m2.2.2.1.3.3.2.1" stretchy="false" xref="S2.Ex1.m2.2.2.1.3.cmml">(</mo><mi id="S2.Ex1.m2.1.1" xref="S2.Ex1.m2.1.1.cmml">X</mi><mo id="S2.Ex1.m2.2.2.1.3.3.2.2" stretchy="false" xref="S2.Ex1.m2.2.2.1.3.cmml">)</mo></mrow></mrow><mo id="S2.Ex1.m2.2.2.1.2" xref="S2.Ex1.m2.2.2.1.2.cmml">−</mo><mrow id="S2.Ex1.m2.2.2.1.1" xref="S2.Ex1.m2.2.2.1.1.cmml"><mi id="S2.Ex1.m2.2.2.1.1.3" xref="S2.Ex1.m2.2.2.1.1.3.cmml">H</mi><mo id="S2.Ex1.m2.2.2.1.1.2" xref="S2.Ex1.m2.2.2.1.1.2.cmml">⁢</mo><mrow id="S2.Ex1.m2.2.2.1.1.1.1" xref="S2.Ex1.m2.2.2.1.1.1.1.1.cmml"><mo id="S2.Ex1.m2.2.2.1.1.1.1.2" stretchy="false" xref="S2.Ex1.m2.2.2.1.1.1.1.1.cmml">(</mo><mrow id="S2.Ex1.m2.2.2.1.1.1.1.1" xref="S2.Ex1.m2.2.2.1.1.1.1.1.cmml"><mi id="S2.Ex1.m2.2.2.1.1.1.1.1.2" xref="S2.Ex1.m2.2.2.1.1.1.1.1.2.cmml">X</mi><mo fence="false" id="S2.Ex1.m2.2.2.1.1.1.1.1.1" xref="S2.Ex1.m2.2.2.1.1.1.1.1.1.cmml">|</mo><mi id="S2.Ex1.m2.2.2.1.1.1.1.1.3" xref="S2.Ex1.m2.2.2.1.1.1.1.1.3.cmml">R</mi></mrow><mo id="S2.Ex1.m2.2.2.1.1.1.1.3" stretchy="false" xref="S2.Ex1.m2.2.2.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.Ex1.m2.2b"><apply id="S2.Ex1.m2.2.2.cmml" xref="S2.Ex1.m2.2.2"><eq id="S2.Ex1.m2.2.2.2.cmml" xref="S2.Ex1.m2.2.2.2"></eq><csymbol cd="latexml" id="S2.Ex1.m2.2.2.3.cmml" xref="S2.Ex1.m2.2.2.3">absent</csymbol><apply id="S2.Ex1.m2.2.2.1.cmml" xref="S2.Ex1.m2.2.2.1"><minus id="S2.Ex1.m2.2.2.1.2.cmml" xref="S2.Ex1.m2.2.2.1.2"></minus><apply id="S2.Ex1.m2.2.2.1.3.cmml" xref="S2.Ex1.m2.2.2.1.3"><times id="S2.Ex1.m2.2.2.1.3.1.cmml" xref="S2.Ex1.m2.2.2.1.3.1"></times><ci id="S2.Ex1.m2.2.2.1.3.2.cmml" xref="S2.Ex1.m2.2.2.1.3.2">𝐻</ci><ci id="S2.Ex1.m2.1.1.cmml" xref="S2.Ex1.m2.1.1">𝑋</ci></apply><apply id="S2.Ex1.m2.2.2.1.1.cmml" xref="S2.Ex1.m2.2.2.1.1"><times id="S2.Ex1.m2.2.2.1.1.2.cmml" xref="S2.Ex1.m2.2.2.1.1.2"></times><ci id="S2.Ex1.m2.2.2.1.1.3.cmml" xref="S2.Ex1.m2.2.2.1.1.3">𝐻</ci><apply id="S2.Ex1.m2.2.2.1.1.1.1.1.cmml" xref="S2.Ex1.m2.2.2.1.1.1.1"><csymbol cd="latexml" id="S2.Ex1.m2.2.2.1.1.1.1.1.1.cmml" xref="S2.Ex1.m2.2.2.1.1.1.1.1.1">conditional</csymbol><ci id="S2.Ex1.m2.2.2.1.1.1.1.1.2.cmml" xref="S2.Ex1.m2.2.2.1.1.1.1.1.2">𝑋</ci><ci id="S2.Ex1.m2.2.2.1.1.1.1.1.3.cmml" xref="S2.Ex1.m2.2.2.1.1.1.1.1.3">𝑅</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.Ex1.m2.2c">\displaystyle=H(X)-H(X|R)</annotation><annotation encoding="application/x-llamapun" id="S2.Ex1.m2.2d">= italic_H ( italic_X ) - italic_H ( italic_X | italic_R )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> <tbody id="S2.Ex2"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=H(X)+\mathbb{E}_{(x,r)\sim p}[\log q(x|r)+\log\frac{p(x|r)}{q(x|% r)}]" class="ltx_Math" display="inline" id="S2.Ex2.m1.6"><semantics id="S2.Ex2.m1.6a"><mrow id="S2.Ex2.m1.6.6" xref="S2.Ex2.m1.6.6.cmml"><mi id="S2.Ex2.m1.6.6.3" xref="S2.Ex2.m1.6.6.3.cmml"></mi><mo id="S2.Ex2.m1.6.6.2" xref="S2.Ex2.m1.6.6.2.cmml">=</mo><mrow id="S2.Ex2.m1.6.6.1" xref="S2.Ex2.m1.6.6.1.cmml"><mrow id="S2.Ex2.m1.6.6.1.3" xref="S2.Ex2.m1.6.6.1.3.cmml"><mi id="S2.Ex2.m1.6.6.1.3.2" xref="S2.Ex2.m1.6.6.1.3.2.cmml">H</mi><mo id="S2.Ex2.m1.6.6.1.3.1" xref="S2.Ex2.m1.6.6.1.3.1.cmml">⁢</mo><mrow id="S2.Ex2.m1.6.6.1.3.3.2" xref="S2.Ex2.m1.6.6.1.3.cmml"><mo id="S2.Ex2.m1.6.6.1.3.3.2.1" stretchy="false" xref="S2.Ex2.m1.6.6.1.3.cmml">(</mo><mi id="S2.Ex2.m1.5.5" xref="S2.Ex2.m1.5.5.cmml">X</mi><mo id="S2.Ex2.m1.6.6.1.3.3.2.2" stretchy="false" xref="S2.Ex2.m1.6.6.1.3.cmml">)</mo></mrow></mrow><mo id="S2.Ex2.m1.6.6.1.2" xref="S2.Ex2.m1.6.6.1.2.cmml">+</mo><mrow id="S2.Ex2.m1.6.6.1.1" xref="S2.Ex2.m1.6.6.1.1.cmml"><msub id="S2.Ex2.m1.6.6.1.1.3" xref="S2.Ex2.m1.6.6.1.1.3.cmml"><mi id="S2.Ex2.m1.6.6.1.1.3.2" xref="S2.Ex2.m1.6.6.1.1.3.2.cmml">𝔼</mi><mrow id="S2.Ex2.m1.2.2.2" xref="S2.Ex2.m1.2.2.2.cmml"><mrow id="S2.Ex2.m1.2.2.2.4.2" xref="S2.Ex2.m1.2.2.2.4.1.cmml"><mo id="S2.Ex2.m1.2.2.2.4.2.1" stretchy="false" xref="S2.Ex2.m1.2.2.2.4.1.cmml">(</mo><mi id="S2.Ex2.m1.1.1.1.1" xref="S2.Ex2.m1.1.1.1.1.cmml">x</mi><mo id="S2.Ex2.m1.2.2.2.4.2.2" xref="S2.Ex2.m1.2.2.2.4.1.cmml">,</mo><mi id="S2.Ex2.m1.2.2.2.2" xref="S2.Ex2.m1.2.2.2.2.cmml">r</mi><mo id="S2.Ex2.m1.2.2.2.4.2.3" stretchy="false" xref="S2.Ex2.m1.2.2.2.4.1.cmml">)</mo></mrow><mo id="S2.Ex2.m1.2.2.2.3" xref="S2.Ex2.m1.2.2.2.3.cmml">∼</mo><mi id="S2.Ex2.m1.2.2.2.5" xref="S2.Ex2.m1.2.2.2.5.cmml">p</mi></mrow></msub><mo id="S2.Ex2.m1.6.6.1.1.2" xref="S2.Ex2.m1.6.6.1.1.2.cmml">⁢</mo><mrow id="S2.Ex2.m1.6.6.1.1.1.1" xref="S2.Ex2.m1.6.6.1.1.1.2.cmml"><mo id="S2.Ex2.m1.6.6.1.1.1.1.2" stretchy="false" xref="S2.Ex2.m1.6.6.1.1.1.2.1.cmml">[</mo><mrow id="S2.Ex2.m1.6.6.1.1.1.1.1" xref="S2.Ex2.m1.6.6.1.1.1.1.1.cmml"><mrow id="S2.Ex2.m1.6.6.1.1.1.1.1.1" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.cmml"><mrow id="S2.Ex2.m1.6.6.1.1.1.1.1.1.3" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.cmml"><mi id="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.1" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.1.cmml">log</mi><mo id="S2.Ex2.m1.6.6.1.1.1.1.1.1.3a" lspace="0.167em" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.cmml">⁡</mo><mi id="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.2" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.2.cmml">q</mi></mrow><mo id="S2.Ex2.m1.6.6.1.1.1.1.1.1.2" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.cmml"><mo id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.cmml"><mi id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.2" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.2.cmml">x</mi><mo fence="false" id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.1" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.1.cmml">|</mo><mi id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.3" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.3.cmml">r</mi></mrow><mo id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.Ex2.m1.6.6.1.1.1.1.1.2" xref="S2.Ex2.m1.6.6.1.1.1.1.1.2.cmml">+</mo><mrow id="S2.Ex2.m1.6.6.1.1.1.1.1.3" xref="S2.Ex2.m1.6.6.1.1.1.1.1.3.cmml"><mi id="S2.Ex2.m1.6.6.1.1.1.1.1.3.1" xref="S2.Ex2.m1.6.6.1.1.1.1.1.3.1.cmml">log</mi><mo id="S2.Ex2.m1.6.6.1.1.1.1.1.3a" lspace="0.167em" xref="S2.Ex2.m1.6.6.1.1.1.1.1.3.cmml">⁡</mo><mstyle displaystyle="true" id="S2.Ex2.m1.4.4" xref="S2.Ex2.m1.4.4.cmml"><mfrac id="S2.Ex2.m1.4.4a" xref="S2.Ex2.m1.4.4.cmml"><mrow id="S2.Ex2.m1.3.3.1" xref="S2.Ex2.m1.3.3.1.cmml"><mi id="S2.Ex2.m1.3.3.1.3" xref="S2.Ex2.m1.3.3.1.3.cmml">p</mi><mo id="S2.Ex2.m1.3.3.1.2" xref="S2.Ex2.m1.3.3.1.2.cmml">⁢</mo><mrow id="S2.Ex2.m1.3.3.1.1.1" xref="S2.Ex2.m1.3.3.1.1.1.1.cmml"><mo id="S2.Ex2.m1.3.3.1.1.1.2" stretchy="false" xref="S2.Ex2.m1.3.3.1.1.1.1.cmml">(</mo><mrow id="S2.Ex2.m1.3.3.1.1.1.1" xref="S2.Ex2.m1.3.3.1.1.1.1.cmml"><mi id="S2.Ex2.m1.3.3.1.1.1.1.2" xref="S2.Ex2.m1.3.3.1.1.1.1.2.cmml">x</mi><mo fence="false" id="S2.Ex2.m1.3.3.1.1.1.1.1" xref="S2.Ex2.m1.3.3.1.1.1.1.1.cmml">|</mo><mi id="S2.Ex2.m1.3.3.1.1.1.1.3" xref="S2.Ex2.m1.3.3.1.1.1.1.3.cmml">r</mi></mrow><mo id="S2.Ex2.m1.3.3.1.1.1.3" stretchy="false" xref="S2.Ex2.m1.3.3.1.1.1.1.cmml">)</mo></mrow></mrow><mrow id="S2.Ex2.m1.4.4.2" xref="S2.Ex2.m1.4.4.2.cmml"><mi id="S2.Ex2.m1.4.4.2.3" xref="S2.Ex2.m1.4.4.2.3.cmml">q</mi><mo id="S2.Ex2.m1.4.4.2.2" xref="S2.Ex2.m1.4.4.2.2.cmml">⁢</mo><mrow id="S2.Ex2.m1.4.4.2.1.1" xref="S2.Ex2.m1.4.4.2.1.1.1.cmml"><mo id="S2.Ex2.m1.4.4.2.1.1.2" stretchy="false" xref="S2.Ex2.m1.4.4.2.1.1.1.cmml">(</mo><mrow id="S2.Ex2.m1.4.4.2.1.1.1" xref="S2.Ex2.m1.4.4.2.1.1.1.cmml"><mi id="S2.Ex2.m1.4.4.2.1.1.1.2" xref="S2.Ex2.m1.4.4.2.1.1.1.2.cmml">x</mi><mo fence="false" id="S2.Ex2.m1.4.4.2.1.1.1.1" xref="S2.Ex2.m1.4.4.2.1.1.1.1.cmml">|</mo><mi id="S2.Ex2.m1.4.4.2.1.1.1.3" xref="S2.Ex2.m1.4.4.2.1.1.1.3.cmml">r</mi></mrow><mo id="S2.Ex2.m1.4.4.2.1.1.3" stretchy="false" xref="S2.Ex2.m1.4.4.2.1.1.1.cmml">)</mo></mrow></mrow></mfrac></mstyle></mrow></mrow><mo id="S2.Ex2.m1.6.6.1.1.1.1.3" stretchy="false" xref="S2.Ex2.m1.6.6.1.1.1.2.1.cmml">]</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.Ex2.m1.6b"><apply id="S2.Ex2.m1.6.6.cmml" xref="S2.Ex2.m1.6.6"><eq id="S2.Ex2.m1.6.6.2.cmml" xref="S2.Ex2.m1.6.6.2"></eq><csymbol cd="latexml" id="S2.Ex2.m1.6.6.3.cmml" xref="S2.Ex2.m1.6.6.3">absent</csymbol><apply id="S2.Ex2.m1.6.6.1.cmml" xref="S2.Ex2.m1.6.6.1"><plus id="S2.Ex2.m1.6.6.1.2.cmml" xref="S2.Ex2.m1.6.6.1.2"></plus><apply id="S2.Ex2.m1.6.6.1.3.cmml" xref="S2.Ex2.m1.6.6.1.3"><times id="S2.Ex2.m1.6.6.1.3.1.cmml" xref="S2.Ex2.m1.6.6.1.3.1"></times><ci id="S2.Ex2.m1.6.6.1.3.2.cmml" xref="S2.Ex2.m1.6.6.1.3.2">𝐻</ci><ci id="S2.Ex2.m1.5.5.cmml" xref="S2.Ex2.m1.5.5">𝑋</ci></apply><apply id="S2.Ex2.m1.6.6.1.1.cmml" xref="S2.Ex2.m1.6.6.1.1"><times id="S2.Ex2.m1.6.6.1.1.2.cmml" xref="S2.Ex2.m1.6.6.1.1.2"></times><apply id="S2.Ex2.m1.6.6.1.1.3.cmml" xref="S2.Ex2.m1.6.6.1.1.3"><csymbol cd="ambiguous" id="S2.Ex2.m1.6.6.1.1.3.1.cmml" xref="S2.Ex2.m1.6.6.1.1.3">subscript</csymbol><ci id="S2.Ex2.m1.6.6.1.1.3.2.cmml" xref="S2.Ex2.m1.6.6.1.1.3.2">𝔼</ci><apply id="S2.Ex2.m1.2.2.2.cmml" xref="S2.Ex2.m1.2.2.2"><csymbol cd="latexml" id="S2.Ex2.m1.2.2.2.3.cmml" xref="S2.Ex2.m1.2.2.2.3">similar-to</csymbol><interval closure="open" id="S2.Ex2.m1.2.2.2.4.1.cmml" xref="S2.Ex2.m1.2.2.2.4.2"><ci id="S2.Ex2.m1.1.1.1.1.cmml" xref="S2.Ex2.m1.1.1.1.1">𝑥</ci><ci id="S2.Ex2.m1.2.2.2.2.cmml" xref="S2.Ex2.m1.2.2.2.2">𝑟</ci></interval><ci id="S2.Ex2.m1.2.2.2.5.cmml" xref="S2.Ex2.m1.2.2.2.5">𝑝</ci></apply></apply><apply id="S2.Ex2.m1.6.6.1.1.1.2.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1"><csymbol cd="latexml" id="S2.Ex2.m1.6.6.1.1.1.2.1.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.2">delimited-[]</csymbol><apply id="S2.Ex2.m1.6.6.1.1.1.1.1.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1"><plus id="S2.Ex2.m1.6.6.1.1.1.1.1.2.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.2"></plus><apply id="S2.Ex2.m1.6.6.1.1.1.1.1.1.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1"><times id="S2.Ex2.m1.6.6.1.1.1.1.1.1.2.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.2"></times><apply id="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.3"><log id="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.1.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.1"></log><ci id="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.2.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.3.2">𝑞</ci></apply><apply id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.1">conditional</csymbol><ci id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.2">𝑥</ci><ci id="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.3.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.1.1.1.1.3">𝑟</ci></apply></apply><apply id="S2.Ex2.m1.6.6.1.1.1.1.1.3.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.3"><log id="S2.Ex2.m1.6.6.1.1.1.1.1.3.1.cmml" xref="S2.Ex2.m1.6.6.1.1.1.1.1.3.1"></log><apply id="S2.Ex2.m1.4.4.cmml" xref="S2.Ex2.m1.4.4"><divide id="S2.Ex2.m1.4.4.3.cmml" xref="S2.Ex2.m1.4.4"></divide><apply id="S2.Ex2.m1.3.3.1.cmml" xref="S2.Ex2.m1.3.3.1"><times id="S2.Ex2.m1.3.3.1.2.cmml" xref="S2.Ex2.m1.3.3.1.2"></times><ci id="S2.Ex2.m1.3.3.1.3.cmml" xref="S2.Ex2.m1.3.3.1.3">𝑝</ci><apply id="S2.Ex2.m1.3.3.1.1.1.1.cmml" xref="S2.Ex2.m1.3.3.1.1.1"><csymbol cd="latexml" id="S2.Ex2.m1.3.3.1.1.1.1.1.cmml" xref="S2.Ex2.m1.3.3.1.1.1.1.1">conditional</csymbol><ci id="S2.Ex2.m1.3.3.1.1.1.1.2.cmml" xref="S2.Ex2.m1.3.3.1.1.1.1.2">𝑥</ci><ci id="S2.Ex2.m1.3.3.1.1.1.1.3.cmml" xref="S2.Ex2.m1.3.3.1.1.1.1.3">𝑟</ci></apply></apply><apply id="S2.Ex2.m1.4.4.2.cmml" xref="S2.Ex2.m1.4.4.2"><times id="S2.Ex2.m1.4.4.2.2.cmml" xref="S2.Ex2.m1.4.4.2.2"></times><ci id="S2.Ex2.m1.4.4.2.3.cmml" xref="S2.Ex2.m1.4.4.2.3">𝑞</ci><apply id="S2.Ex2.m1.4.4.2.1.1.1.cmml" xref="S2.Ex2.m1.4.4.2.1.1"><csymbol cd="latexml" id="S2.Ex2.m1.4.4.2.1.1.1.1.cmml" xref="S2.Ex2.m1.4.4.2.1.1.1.1">conditional</csymbol><ci id="S2.Ex2.m1.4.4.2.1.1.1.2.cmml" xref="S2.Ex2.m1.4.4.2.1.1.1.2">𝑥</ci><ci id="S2.Ex2.m1.4.4.2.1.1.1.3.cmml" xref="S2.Ex2.m1.4.4.2.1.1.1.3">𝑟</ci></apply></apply></apply></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.Ex2.m1.6c">\displaystyle=H(X)+\mathbb{E}_{(x,r)\sim p}[\log q(x|r)+\log\frac{p(x|r)}{q(x|% r)}]</annotation><annotation encoding="application/x-llamapun" id="S2.Ex2.m1.6d">= italic_H ( italic_X ) + blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_r ) ∼ italic_p end_POSTSUBSCRIPT [ roman_log italic_q ( italic_x | italic_r ) + roman_log divide start_ARG italic_p ( italic_x | italic_r ) end_ARG start_ARG italic_q ( italic_x | italic_r ) end_ARG ]</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr></tbody> <tbody id="S2.E3"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\geq H(X)+\mathbb{E}_{(x,r)\sim p}[\log q(x|r)]," class="ltx_Math" display="inline" id="S2.E3.m1.4"><semantics id="S2.E3.m1.4a"><mrow id="S2.E3.m1.4.4.1" xref="S2.E3.m1.4.4.1.1.cmml"><mrow id="S2.E3.m1.4.4.1.1" xref="S2.E3.m1.4.4.1.1.cmml"><mi id="S2.E3.m1.4.4.1.1.3" xref="S2.E3.m1.4.4.1.1.3.cmml"></mi><mo id="S2.E3.m1.4.4.1.1.2" xref="S2.E3.m1.4.4.1.1.2.cmml">≥</mo><mrow id="S2.E3.m1.4.4.1.1.1" xref="S2.E3.m1.4.4.1.1.1.cmml"><mrow id="S2.E3.m1.4.4.1.1.1.3" xref="S2.E3.m1.4.4.1.1.1.3.cmml"><mi id="S2.E3.m1.4.4.1.1.1.3.2" xref="S2.E3.m1.4.4.1.1.1.3.2.cmml">H</mi><mo id="S2.E3.m1.4.4.1.1.1.3.1" xref="S2.E3.m1.4.4.1.1.1.3.1.cmml">⁢</mo><mrow id="S2.E3.m1.4.4.1.1.1.3.3.2" xref="S2.E3.m1.4.4.1.1.1.3.cmml"><mo id="S2.E3.m1.4.4.1.1.1.3.3.2.1" stretchy="false" xref="S2.E3.m1.4.4.1.1.1.3.cmml">(</mo><mi id="S2.E3.m1.3.3" xref="S2.E3.m1.3.3.cmml">X</mi><mo id="S2.E3.m1.4.4.1.1.1.3.3.2.2" stretchy="false" xref="S2.E3.m1.4.4.1.1.1.3.cmml">)</mo></mrow></mrow><mo id="S2.E3.m1.4.4.1.1.1.2" xref="S2.E3.m1.4.4.1.1.1.2.cmml">+</mo><mrow id="S2.E3.m1.4.4.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.cmml"><msub id="S2.E3.m1.4.4.1.1.1.1.3" xref="S2.E3.m1.4.4.1.1.1.1.3.cmml"><mi id="S2.E3.m1.4.4.1.1.1.1.3.2" xref="S2.E3.m1.4.4.1.1.1.1.3.2.cmml">𝔼</mi><mrow id="S2.E3.m1.2.2.2" xref="S2.E3.m1.2.2.2.cmml"><mrow id="S2.E3.m1.2.2.2.4.2" xref="S2.E3.m1.2.2.2.4.1.cmml"><mo id="S2.E3.m1.2.2.2.4.2.1" stretchy="false" xref="S2.E3.m1.2.2.2.4.1.cmml">(</mo><mi id="S2.E3.m1.1.1.1.1" xref="S2.E3.m1.1.1.1.1.cmml">x</mi><mo id="S2.E3.m1.2.2.2.4.2.2" xref="S2.E3.m1.2.2.2.4.1.cmml">,</mo><mi id="S2.E3.m1.2.2.2.2" xref="S2.E3.m1.2.2.2.2.cmml">r</mi><mo id="S2.E3.m1.2.2.2.4.2.3" stretchy="false" xref="S2.E3.m1.2.2.2.4.1.cmml">)</mo></mrow><mo id="S2.E3.m1.2.2.2.3" xref="S2.E3.m1.2.2.2.3.cmml">∼</mo><mi id="S2.E3.m1.2.2.2.5" xref="S2.E3.m1.2.2.2.5.cmml">p</mi></mrow></msub><mo id="S2.E3.m1.4.4.1.1.1.1.2" xref="S2.E3.m1.4.4.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E3.m1.4.4.1.1.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.2.cmml"><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.2" stretchy="false" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.cmml">[</mo><mrow id="S2.E3.m1.4.4.1.1.1.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.cmml"><mrow id="S2.E3.m1.4.4.1.1.1.1.1.1.1.3" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.cmml"><mi id="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.1.cmml">log</mi><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.1.3a" lspace="0.167em" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.cmml">⁡</mo><mi id="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.2" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.2.cmml">q</mi></mrow><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.1.2" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.cmml"><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.2" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.2.cmml">x</mi><mo fence="false" id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.1.cmml">|</mo><mi id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.3" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.3.cmml">r</mi></mrow><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.3" stretchy="false" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.cmml">]</mo></mrow></mrow></mrow></mrow><mo id="S2.E3.m1.4.4.1.2" xref="S2.E3.m1.4.4.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.E3.m1.4b"><apply id="S2.E3.m1.4.4.1.1.cmml" xref="S2.E3.m1.4.4.1"><geq id="S2.E3.m1.4.4.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.2"></geq><csymbol cd="latexml" id="S2.E3.m1.4.4.1.1.3.cmml" xref="S2.E3.m1.4.4.1.1.3">absent</csymbol><apply id="S2.E3.m1.4.4.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1"><plus id="S2.E3.m1.4.4.1.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.1.2"></plus><apply id="S2.E3.m1.4.4.1.1.1.3.cmml" xref="S2.E3.m1.4.4.1.1.1.3"><times id="S2.E3.m1.4.4.1.1.1.3.1.cmml" xref="S2.E3.m1.4.4.1.1.1.3.1"></times><ci id="S2.E3.m1.4.4.1.1.1.3.2.cmml" xref="S2.E3.m1.4.4.1.1.1.3.2">𝐻</ci><ci id="S2.E3.m1.3.3.cmml" xref="S2.E3.m1.3.3">𝑋</ci></apply><apply id="S2.E3.m1.4.4.1.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1"><times id="S2.E3.m1.4.4.1.1.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.2"></times><apply id="S2.E3.m1.4.4.1.1.1.1.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E3.m1.4.4.1.1.1.1.3.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.3">subscript</csymbol><ci id="S2.E3.m1.4.4.1.1.1.1.3.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.3.2">𝔼</ci><apply id="S2.E3.m1.2.2.2.cmml" xref="S2.E3.m1.2.2.2"><csymbol cd="latexml" id="S2.E3.m1.2.2.2.3.cmml" xref="S2.E3.m1.2.2.2.3">similar-to</csymbol><interval closure="open" id="S2.E3.m1.2.2.2.4.1.cmml" xref="S2.E3.m1.2.2.2.4.2"><ci id="S2.E3.m1.1.1.1.1.cmml" xref="S2.E3.m1.1.1.1.1">𝑥</ci><ci id="S2.E3.m1.2.2.2.2.cmml" xref="S2.E3.m1.2.2.2.2">𝑟</ci></interval><ci id="S2.E3.m1.2.2.2.5.cmml" xref="S2.E3.m1.2.2.2.5">𝑝</ci></apply></apply><apply id="S2.E3.m1.4.4.1.1.1.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1"><csymbol cd="latexml" id="S2.E3.m1.4.4.1.1.1.1.1.2.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.2">delimited-[]</csymbol><apply id="S2.E3.m1.4.4.1.1.1.1.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1"><times id="S2.E3.m1.4.4.1.1.1.1.1.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.2"></times><apply id="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.3"><log id="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.1"></log><ci id="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.3.2">𝑞</ci></apply><apply id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.1">conditional</csymbol><ci id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.2">𝑥</ci><ci id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.3">𝑟</ci></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E3.m1.4c">\displaystyle\geq H(X)+\mathbb{E}_{(x,r)\sim p}[\log q(x|r)],</annotation><annotation encoding="application/x-llamapun" id="S2.E3.m1.4d">≥ italic_H ( italic_X ) + blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_r ) ∼ italic_p end_POSTSUBSCRIPT [ roman_log italic_q ( italic_x | italic_r ) ] ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(3)</span></td> </tr></tbody> </table> <p class="ltx_p" id="S2.SS2.p2.4">where <math alttext="\mathbb{E}_{p}[\log q(x|r)]" class="ltx_Math" display="inline" id="S2.SS2.p2.3.m1.1"><semantics id="S2.SS2.p2.3.m1.1a"><mrow id="S2.SS2.p2.3.m1.1.1" xref="S2.SS2.p2.3.m1.1.1.cmml"><msub id="S2.SS2.p2.3.m1.1.1.3" xref="S2.SS2.p2.3.m1.1.1.3.cmml"><mi id="S2.SS2.p2.3.m1.1.1.3.2" xref="S2.SS2.p2.3.m1.1.1.3.2.cmml">𝔼</mi><mi id="S2.SS2.p2.3.m1.1.1.3.3" xref="S2.SS2.p2.3.m1.1.1.3.3.cmml">p</mi></msub><mo id="S2.SS2.p2.3.m1.1.1.2" xref="S2.SS2.p2.3.m1.1.1.2.cmml">⁢</mo><mrow id="S2.SS2.p2.3.m1.1.1.1.1" xref="S2.SS2.p2.3.m1.1.1.1.2.cmml"><mo id="S2.SS2.p2.3.m1.1.1.1.1.2" stretchy="false" xref="S2.SS2.p2.3.m1.1.1.1.2.1.cmml">[</mo><mrow id="S2.SS2.p2.3.m1.1.1.1.1.1" xref="S2.SS2.p2.3.m1.1.1.1.1.1.cmml"><mrow id="S2.SS2.p2.3.m1.1.1.1.1.1.3" xref="S2.SS2.p2.3.m1.1.1.1.1.1.3.cmml"><mi id="S2.SS2.p2.3.m1.1.1.1.1.1.3.1" xref="S2.SS2.p2.3.m1.1.1.1.1.1.3.1.cmml">log</mi><mo id="S2.SS2.p2.3.m1.1.1.1.1.1.3a" lspace="0.167em" xref="S2.SS2.p2.3.m1.1.1.1.1.1.3.cmml">⁡</mo><mi id="S2.SS2.p2.3.m1.1.1.1.1.1.3.2" xref="S2.SS2.p2.3.m1.1.1.1.1.1.3.2.cmml">q</mi></mrow><mo id="S2.SS2.p2.3.m1.1.1.1.1.1.2" xref="S2.SS2.p2.3.m1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.cmml"><mo id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.cmml"><mi id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.2" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.2.cmml">x</mi><mo fence="false" id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.1" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.1.cmml">|</mo><mi id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.3" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.3.cmml">r</mi></mrow><mo id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.SS2.p2.3.m1.1.1.1.1.3" stretchy="false" xref="S2.SS2.p2.3.m1.1.1.1.2.1.cmml">]</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.3.m1.1b"><apply id="S2.SS2.p2.3.m1.1.1.cmml" xref="S2.SS2.p2.3.m1.1.1"><times id="S2.SS2.p2.3.m1.1.1.2.cmml" xref="S2.SS2.p2.3.m1.1.1.2"></times><apply id="S2.SS2.p2.3.m1.1.1.3.cmml" xref="S2.SS2.p2.3.m1.1.1.3"><csymbol cd="ambiguous" id="S2.SS2.p2.3.m1.1.1.3.1.cmml" xref="S2.SS2.p2.3.m1.1.1.3">subscript</csymbol><ci id="S2.SS2.p2.3.m1.1.1.3.2.cmml" xref="S2.SS2.p2.3.m1.1.1.3.2">𝔼</ci><ci id="S2.SS2.p2.3.m1.1.1.3.3.cmml" xref="S2.SS2.p2.3.m1.1.1.3.3">𝑝</ci></apply><apply id="S2.SS2.p2.3.m1.1.1.1.2.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1"><csymbol cd="latexml" id="S2.SS2.p2.3.m1.1.1.1.2.1.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1.2">delimited-[]</csymbol><apply id="S2.SS2.p2.3.m1.1.1.1.1.1.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1.1"><times id="S2.SS2.p2.3.m1.1.1.1.1.1.2.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1.1.2"></times><apply id="S2.SS2.p2.3.m1.1.1.1.1.1.3.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1.1.3"><log id="S2.SS2.p2.3.m1.1.1.1.1.1.3.1.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1.1.3.1"></log><ci id="S2.SS2.p2.3.m1.1.1.1.1.1.3.2.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1.1.3.2">𝑞</ci></apply><apply id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.1">conditional</csymbol><ci id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.2">𝑥</ci><ci id="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.3.cmml" xref="S2.SS2.p2.3.m1.1.1.1.1.1.1.1.1.3">𝑟</ci></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.3.m1.1c">\mathbb{E}_{p}[\log q(x|r)]</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.3.m1.1d">blackboard_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ roman_log italic_q ( italic_x | italic_r ) ]</annotation></semantics></math> is the empirical cross entropy, with the inequality due to the non-negativity of KL divergence. By making a Gaussian assumption of <math alttext="q(x|r)" class="ltx_Math" display="inline" id="S2.SS2.p2.4.m2.1"><semantics id="S2.SS2.p2.4.m2.1a"><mrow id="S2.SS2.p2.4.m2.1.1" xref="S2.SS2.p2.4.m2.1.1.cmml"><mi id="S2.SS2.p2.4.m2.1.1.3" xref="S2.SS2.p2.4.m2.1.1.3.cmml">q</mi><mo id="S2.SS2.p2.4.m2.1.1.2" xref="S2.SS2.p2.4.m2.1.1.2.cmml">⁢</mo><mrow id="S2.SS2.p2.4.m2.1.1.1.1" xref="S2.SS2.p2.4.m2.1.1.1.1.1.cmml"><mo id="S2.SS2.p2.4.m2.1.1.1.1.2" stretchy="false" xref="S2.SS2.p2.4.m2.1.1.1.1.1.cmml">(</mo><mrow id="S2.SS2.p2.4.m2.1.1.1.1.1" xref="S2.SS2.p2.4.m2.1.1.1.1.1.cmml"><mi id="S2.SS2.p2.4.m2.1.1.1.1.1.2" xref="S2.SS2.p2.4.m2.1.1.1.1.1.2.cmml">x</mi><mo fence="false" id="S2.SS2.p2.4.m2.1.1.1.1.1.1" xref="S2.SS2.p2.4.m2.1.1.1.1.1.1.cmml">|</mo><mi id="S2.SS2.p2.4.m2.1.1.1.1.1.3" xref="S2.SS2.p2.4.m2.1.1.1.1.1.3.cmml">r</mi></mrow><mo id="S2.SS2.p2.4.m2.1.1.1.1.3" stretchy="false" xref="S2.SS2.p2.4.m2.1.1.1.1.1.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.4.m2.1b"><apply id="S2.SS2.p2.4.m2.1.1.cmml" xref="S2.SS2.p2.4.m2.1.1"><times id="S2.SS2.p2.4.m2.1.1.2.cmml" xref="S2.SS2.p2.4.m2.1.1.2"></times><ci id="S2.SS2.p2.4.m2.1.1.3.cmml" xref="S2.SS2.p2.4.m2.1.1.3">𝑞</ci><apply id="S2.SS2.p2.4.m2.1.1.1.1.1.cmml" xref="S2.SS2.p2.4.m2.1.1.1.1"><csymbol cd="latexml" id="S2.SS2.p2.4.m2.1.1.1.1.1.1.cmml" xref="S2.SS2.p2.4.m2.1.1.1.1.1.1">conditional</csymbol><ci id="S2.SS2.p2.4.m2.1.1.1.1.1.2.cmml" xref="S2.SS2.p2.4.m2.1.1.1.1.1.2">𝑥</ci><ci id="S2.SS2.p2.4.m2.1.1.1.1.1.3.cmml" xref="S2.SS2.p2.4.m2.1.1.1.1.1.3">𝑟</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.4.m2.1c">q(x|r)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.4.m2.1d">italic_q ( italic_x | italic_r )</annotation></semantics></math>, we obtain the proposed lower bound</p> <table class="ltx_equationgroup ltx_eqn_table" id="S2.E4"> <tbody> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S2.E4X"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle H(X)+\mathbb{E}_{(x,r)\sim p}[\log q(x|r)]" class="ltx_Math" display="inline" id="S2.E4X.2.1.1.m1.4"><semantics id="S2.E4X.2.1.1.m1.4a"><mrow id="S2.E4X.2.1.1.m1.4.4" xref="S2.E4X.2.1.1.m1.4.4.cmml"><mrow id="S2.E4X.2.1.1.m1.4.4.3" xref="S2.E4X.2.1.1.m1.4.4.3.cmml"><mi id="S2.E4X.2.1.1.m1.4.4.3.2" xref="S2.E4X.2.1.1.m1.4.4.3.2.cmml">H</mi><mo id="S2.E4X.2.1.1.m1.4.4.3.1" xref="S2.E4X.2.1.1.m1.4.4.3.1.cmml">⁢</mo><mrow id="S2.E4X.2.1.1.m1.4.4.3.3.2" xref="S2.E4X.2.1.1.m1.4.4.3.cmml"><mo id="S2.E4X.2.1.1.m1.4.4.3.3.2.1" stretchy="false" xref="S2.E4X.2.1.1.m1.4.4.3.cmml">(</mo><mi id="S2.E4X.2.1.1.m1.3.3" xref="S2.E4X.2.1.1.m1.3.3.cmml">X</mi><mo id="S2.E4X.2.1.1.m1.4.4.3.3.2.2" stretchy="false" xref="S2.E4X.2.1.1.m1.4.4.3.cmml">)</mo></mrow></mrow><mo id="S2.E4X.2.1.1.m1.4.4.2" xref="S2.E4X.2.1.1.m1.4.4.2.cmml">+</mo><mrow id="S2.E4X.2.1.1.m1.4.4.1" xref="S2.E4X.2.1.1.m1.4.4.1.cmml"><msub id="S2.E4X.2.1.1.m1.4.4.1.3" xref="S2.E4X.2.1.1.m1.4.4.1.3.cmml"><mi id="S2.E4X.2.1.1.m1.4.4.1.3.2" xref="S2.E4X.2.1.1.m1.4.4.1.3.2.cmml">𝔼</mi><mrow id="S2.E4X.2.1.1.m1.2.2.2" xref="S2.E4X.2.1.1.m1.2.2.2.cmml"><mrow id="S2.E4X.2.1.1.m1.2.2.2.4.2" xref="S2.E4X.2.1.1.m1.2.2.2.4.1.cmml"><mo id="S2.E4X.2.1.1.m1.2.2.2.4.2.1" stretchy="false" xref="S2.E4X.2.1.1.m1.2.2.2.4.1.cmml">(</mo><mi id="S2.E4X.2.1.1.m1.1.1.1.1" xref="S2.E4X.2.1.1.m1.1.1.1.1.cmml">x</mi><mo id="S2.E4X.2.1.1.m1.2.2.2.4.2.2" xref="S2.E4X.2.1.1.m1.2.2.2.4.1.cmml">,</mo><mi id="S2.E4X.2.1.1.m1.2.2.2.2" xref="S2.E4X.2.1.1.m1.2.2.2.2.cmml">r</mi><mo id="S2.E4X.2.1.1.m1.2.2.2.4.2.3" stretchy="false" xref="S2.E4X.2.1.1.m1.2.2.2.4.1.cmml">)</mo></mrow><mo id="S2.E4X.2.1.1.m1.2.2.2.3" xref="S2.E4X.2.1.1.m1.2.2.2.3.cmml">∼</mo><mi id="S2.E4X.2.1.1.m1.2.2.2.5" xref="S2.E4X.2.1.1.m1.2.2.2.5.cmml">p</mi></mrow></msub><mo id="S2.E4X.2.1.1.m1.4.4.1.2" xref="S2.E4X.2.1.1.m1.4.4.1.2.cmml">⁢</mo><mrow id="S2.E4X.2.1.1.m1.4.4.1.1.1" xref="S2.E4X.2.1.1.m1.4.4.1.1.2.cmml"><mo id="S2.E4X.2.1.1.m1.4.4.1.1.1.2" stretchy="false" xref="S2.E4X.2.1.1.m1.4.4.1.1.2.1.cmml">[</mo><mrow id="S2.E4X.2.1.1.m1.4.4.1.1.1.1" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.cmml"><mrow id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.cmml"><mi id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.1" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.1.cmml">log</mi><mo id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3a" lspace="0.167em" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.cmml">⁡</mo><mi id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.2" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.2.cmml">q</mi></mrow><mo id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.2" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.cmml"><mo id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.2" stretchy="false" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.cmml"><mi id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.2" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.2.cmml">x</mi><mo fence="false" id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.1" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.1.cmml">|</mo><mi id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.3" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.3.cmml">r</mi></mrow><mo id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.3" stretchy="false" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.E4X.2.1.1.m1.4.4.1.1.1.3" stretchy="false" xref="S2.E4X.2.1.1.m1.4.4.1.1.2.1.cmml">]</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E4X.2.1.1.m1.4b"><apply id="S2.E4X.2.1.1.m1.4.4.cmml" xref="S2.E4X.2.1.1.m1.4.4"><plus id="S2.E4X.2.1.1.m1.4.4.2.cmml" xref="S2.E4X.2.1.1.m1.4.4.2"></plus><apply id="S2.E4X.2.1.1.m1.4.4.3.cmml" xref="S2.E4X.2.1.1.m1.4.4.3"><times id="S2.E4X.2.1.1.m1.4.4.3.1.cmml" xref="S2.E4X.2.1.1.m1.4.4.3.1"></times><ci id="S2.E4X.2.1.1.m1.4.4.3.2.cmml" xref="S2.E4X.2.1.1.m1.4.4.3.2">𝐻</ci><ci id="S2.E4X.2.1.1.m1.3.3.cmml" xref="S2.E4X.2.1.1.m1.3.3">𝑋</ci></apply><apply id="S2.E4X.2.1.1.m1.4.4.1.cmml" xref="S2.E4X.2.1.1.m1.4.4.1"><times id="S2.E4X.2.1.1.m1.4.4.1.2.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.2"></times><apply id="S2.E4X.2.1.1.m1.4.4.1.3.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.3"><csymbol cd="ambiguous" id="S2.E4X.2.1.1.m1.4.4.1.3.1.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.3">subscript</csymbol><ci id="S2.E4X.2.1.1.m1.4.4.1.3.2.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.3.2">𝔼</ci><apply id="S2.E4X.2.1.1.m1.2.2.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.2"><csymbol cd="latexml" id="S2.E4X.2.1.1.m1.2.2.2.3.cmml" xref="S2.E4X.2.1.1.m1.2.2.2.3">similar-to</csymbol><interval closure="open" id="S2.E4X.2.1.1.m1.2.2.2.4.1.cmml" xref="S2.E4X.2.1.1.m1.2.2.2.4.2"><ci id="S2.E4X.2.1.1.m1.1.1.1.1.cmml" xref="S2.E4X.2.1.1.m1.1.1.1.1">𝑥</ci><ci id="S2.E4X.2.1.1.m1.2.2.2.2.cmml" xref="S2.E4X.2.1.1.m1.2.2.2.2">𝑟</ci></interval><ci id="S2.E4X.2.1.1.m1.2.2.2.5.cmml" xref="S2.E4X.2.1.1.m1.2.2.2.5">𝑝</ci></apply></apply><apply id="S2.E4X.2.1.1.m1.4.4.1.1.2.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1"><csymbol cd="latexml" id="S2.E4X.2.1.1.m1.4.4.1.1.2.1.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.2">delimited-[]</csymbol><apply id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1"><times id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.2.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.2"></times><apply id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3"><log id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.1.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.1"></log><ci id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.2.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.3.2">𝑞</ci></apply><apply id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1"><csymbol cd="latexml" id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.1.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.1">conditional</csymbol><ci id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.2.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.2">𝑥</ci><ci id="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.3.cmml" xref="S2.E4X.2.1.1.m1.4.4.1.1.1.1.1.1.1.3">𝑟</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E4X.2.1.1.m1.4c">\displaystyle H(X)+\mathbb{E}_{(x,r)\sim p}[\log q(x|r)]</annotation><annotation encoding="application/x-llamapun" id="S2.E4X.2.1.1.m1.4d">italic_H ( italic_X ) + blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_r ) ∼ italic_p end_POSTSUBSCRIPT [ roman_log italic_q ( italic_x | italic_r ) ]</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="3"><span class="ltx_tag ltx_tag_equationgroup ltx_align_right">(4)</span></td> </tr> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S2.E4Xa"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=H(X)-\frac{1}{2}\mathbb{E}_{(x,r)\sim p}[(x-f(r))^{2}]+\frac{d}{% 2}\log(2\pi e)" class="ltx_Math" display="inline" id="S2.E4Xa.2.1.1.m1.7"><semantics id="S2.E4Xa.2.1.1.m1.7a"><mrow id="S2.E4Xa.2.1.1.m1.7.7" xref="S2.E4Xa.2.1.1.m1.7.7.cmml"><mi id="S2.E4Xa.2.1.1.m1.7.7.4" xref="S2.E4Xa.2.1.1.m1.7.7.4.cmml"></mi><mo id="S2.E4Xa.2.1.1.m1.7.7.3" xref="S2.E4Xa.2.1.1.m1.7.7.3.cmml">=</mo><mrow id="S2.E4Xa.2.1.1.m1.7.7.2" xref="S2.E4Xa.2.1.1.m1.7.7.2.cmml"><mrow id="S2.E4Xa.2.1.1.m1.6.6.1.1" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.cmml"><mrow id="S2.E4Xa.2.1.1.m1.6.6.1.1.3" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.3.cmml"><mi id="S2.E4Xa.2.1.1.m1.6.6.1.1.3.2" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.3.2.cmml">H</mi><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.3.1" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.3.1.cmml">⁢</mo><mrow id="S2.E4Xa.2.1.1.m1.6.6.1.1.3.3.2" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.3.cmml"><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.3.3.2.1" stretchy="false" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.3.cmml">(</mo><mi id="S2.E4Xa.2.1.1.m1.3.3" xref="S2.E4Xa.2.1.1.m1.3.3.cmml">X</mi><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.3.3.2.2" stretchy="false" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.3.cmml">)</mo></mrow></mrow><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.2" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.2.cmml">−</mo><mrow id="S2.E4Xa.2.1.1.m1.6.6.1.1.1" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.cmml"><mstyle displaystyle="true" id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.cmml"><mfrac id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3a" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.cmml"><mn id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.2" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.2.cmml">1</mn><mn id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.3" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.3.cmml">2</mn></mfrac></mstyle><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.2" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.2.cmml">⁢</mo><msub id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.4" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.4.cmml"><mi id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.4.2" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.4.2.cmml">𝔼</mi><mrow id="S2.E4Xa.2.1.1.m1.2.2.2" xref="S2.E4Xa.2.1.1.m1.2.2.2.cmml"><mrow id="S2.E4Xa.2.1.1.m1.2.2.2.4.2" xref="S2.E4Xa.2.1.1.m1.2.2.2.4.1.cmml"><mo id="S2.E4Xa.2.1.1.m1.2.2.2.4.2.1" stretchy="false" xref="S2.E4Xa.2.1.1.m1.2.2.2.4.1.cmml">(</mo><mi id="S2.E4Xa.2.1.1.m1.1.1.1.1" xref="S2.E4Xa.2.1.1.m1.1.1.1.1.cmml">x</mi><mo id="S2.E4Xa.2.1.1.m1.2.2.2.4.2.2" xref="S2.E4Xa.2.1.1.m1.2.2.2.4.1.cmml">,</mo><mi id="S2.E4Xa.2.1.1.m1.2.2.2.2" xref="S2.E4Xa.2.1.1.m1.2.2.2.2.cmml">r</mi><mo id="S2.E4Xa.2.1.1.m1.2.2.2.4.2.3" stretchy="false" xref="S2.E4Xa.2.1.1.m1.2.2.2.4.1.cmml">)</mo></mrow><mo id="S2.E4Xa.2.1.1.m1.2.2.2.3" xref="S2.E4Xa.2.1.1.m1.2.2.2.3.cmml">∼</mo><mi id="S2.E4Xa.2.1.1.m1.2.2.2.5" xref="S2.E4Xa.2.1.1.m1.2.2.2.5.cmml">p</mi></mrow></msub><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.2a" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.2.cmml">⁢</mo><mrow id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.2.cmml"><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.2" stretchy="false" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.2.1.cmml">[</mo><msup id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.cmml"><mrow id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.cmml"><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.cmml"><mi id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.2" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.2.cmml">x</mi><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.cmml">−</mo><mrow id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.2" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.2.cmml">f</mi><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.1" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mrow id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.3.2" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.cmml"><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.3.2.1" stretchy="false" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.cmml">(</mo><mi id="S2.E4Xa.2.1.1.m1.4.4" xref="S2.E4Xa.2.1.1.m1.4.4.cmml">r</mi><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.3.2.2" stretchy="false" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow><mn id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.3" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.3.cmml">2</mn></msup><mo id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.3" stretchy="false" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.2.1.cmml">]</mo></mrow></mrow></mrow><mo id="S2.E4Xa.2.1.1.m1.7.7.2.3" xref="S2.E4Xa.2.1.1.m1.7.7.2.3.cmml">+</mo><mrow id="S2.E4Xa.2.1.1.m1.7.7.2.2" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.cmml"><mstyle displaystyle="true" id="S2.E4Xa.2.1.1.m1.7.7.2.2.3" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.3.cmml"><mfrac id="S2.E4Xa.2.1.1.m1.7.7.2.2.3a" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.3.cmml"><mi id="S2.E4Xa.2.1.1.m1.7.7.2.2.3.2" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.3.2.cmml">d</mi><mn id="S2.E4Xa.2.1.1.m1.7.7.2.2.3.3" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.3.3.cmml">2</mn></mfrac></mstyle><mo id="S2.E4Xa.2.1.1.m1.7.7.2.2.2" lspace="0.167em" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.2.cmml">⁢</mo><mrow id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.2.cmml"><mi id="S2.E4Xa.2.1.1.m1.5.5" xref="S2.E4Xa.2.1.1.m1.5.5.cmml">log</mi><mo id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1a" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.2.cmml">⁡</mo><mrow id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.2.cmml"><mo id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.2" stretchy="false" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.2.cmml">(</mo><mrow id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.cmml"><mn id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.2" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.2.cmml">2</mn><mo id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.1" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.1.cmml">⁢</mo><mi id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.3" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.3.cmml">π</mi><mo id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.1a" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.1.cmml">⁢</mo><mi id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.4" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.4.cmml">e</mi></mrow><mo id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.3" stretchy="false" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.2.cmml">)</mo></mrow></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E4Xa.2.1.1.m1.7b"><apply id="S2.E4Xa.2.1.1.m1.7.7.cmml" xref="S2.E4Xa.2.1.1.m1.7.7"><eq id="S2.E4Xa.2.1.1.m1.7.7.3.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.3"></eq><csymbol cd="latexml" id="S2.E4Xa.2.1.1.m1.7.7.4.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.4">absent</csymbol><apply id="S2.E4Xa.2.1.1.m1.7.7.2.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2"><plus id="S2.E4Xa.2.1.1.m1.7.7.2.3.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.3"></plus><apply id="S2.E4Xa.2.1.1.m1.6.6.1.1.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1"><minus id="S2.E4Xa.2.1.1.m1.6.6.1.1.2.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.2"></minus><apply id="S2.E4Xa.2.1.1.m1.6.6.1.1.3.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.3"><times id="S2.E4Xa.2.1.1.m1.6.6.1.1.3.1.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.3.1"></times><ci id="S2.E4Xa.2.1.1.m1.6.6.1.1.3.2.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.3.2">𝐻</ci><ci id="S2.E4Xa.2.1.1.m1.3.3.cmml" xref="S2.E4Xa.2.1.1.m1.3.3">𝑋</ci></apply><apply id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1"><times id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.2.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.2"></times><apply id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3"><divide id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.1.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3"></divide><cn id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.2.cmml" type="integer" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.2">1</cn><cn id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.3.cmml" type="integer" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.3.3">2</cn></apply><apply id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.4.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.4"><csymbol cd="ambiguous" id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.4.1.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.4">subscript</csymbol><ci id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.4.2.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.4.2">𝔼</ci><apply id="S2.E4Xa.2.1.1.m1.2.2.2.cmml" xref="S2.E4Xa.2.1.1.m1.2.2.2"><csymbol cd="latexml" id="S2.E4Xa.2.1.1.m1.2.2.2.3.cmml" xref="S2.E4Xa.2.1.1.m1.2.2.2.3">similar-to</csymbol><interval closure="open" id="S2.E4Xa.2.1.1.m1.2.2.2.4.1.cmml" xref="S2.E4Xa.2.1.1.m1.2.2.2.4.2"><ci id="S2.E4Xa.2.1.1.m1.1.1.1.1.cmml" xref="S2.E4Xa.2.1.1.m1.1.1.1.1">𝑥</ci><ci id="S2.E4Xa.2.1.1.m1.2.2.2.2.cmml" xref="S2.E4Xa.2.1.1.m1.2.2.2.2">𝑟</ci></interval><ci id="S2.E4Xa.2.1.1.m1.2.2.2.5.cmml" xref="S2.E4Xa.2.1.1.m1.2.2.2.5">𝑝</ci></apply></apply><apply id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.2.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1"><csymbol cd="latexml" id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.2.1.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.2">delimited-[]</csymbol><apply id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.2.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1">superscript</csymbol><apply id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1"><minus id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1"></minus><ci id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.2">𝑥</ci><apply id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3"><times id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.1"></times><ci id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.3.2">𝑓</ci><ci id="S2.E4Xa.2.1.1.m1.4.4.cmml" xref="S2.E4Xa.2.1.1.m1.4.4">𝑟</ci></apply></apply><cn id="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.3.cmml" type="integer" xref="S2.E4Xa.2.1.1.m1.6.6.1.1.1.1.1.1.3">2</cn></apply></apply></apply></apply><apply id="S2.E4Xa.2.1.1.m1.7.7.2.2.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.2"><times id="S2.E4Xa.2.1.1.m1.7.7.2.2.2.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.2"></times><apply id="S2.E4Xa.2.1.1.m1.7.7.2.2.3.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.3"><divide id="S2.E4Xa.2.1.1.m1.7.7.2.2.3.1.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.3"></divide><ci id="S2.E4Xa.2.1.1.m1.7.7.2.2.3.2.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.3.2">𝑑</ci><cn id="S2.E4Xa.2.1.1.m1.7.7.2.2.3.3.cmml" type="integer" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.3.3">2</cn></apply><apply id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.2.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1"><log id="S2.E4Xa.2.1.1.m1.5.5.cmml" xref="S2.E4Xa.2.1.1.m1.5.5"></log><apply id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1"><times id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.1.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.1"></times><cn id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.2.cmml" type="integer" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.2">2</cn><ci id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.3.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.3">𝜋</ci><ci id="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.4.cmml" xref="S2.E4Xa.2.1.1.m1.7.7.2.2.1.1.1.1.4">𝑒</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E4Xa.2.1.1.m1.7c">\displaystyle=H(X)-\frac{1}{2}\mathbb{E}_{(x,r)\sim p}[(x-f(r))^{2}]+\frac{d}{% 2}\log(2\pi e)</annotation><annotation encoding="application/x-llamapun" id="S2.E4Xa.2.1.1.m1.7d">= italic_H ( italic_X ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_r ) ∼ italic_p end_POSTSUBSCRIPT [ ( italic_x - italic_f ( italic_r ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_d end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_e )</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr> <tr class="ltx_equation ltx_eqn_row ltx_align_baseline" id="S2.E4Xb"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_td ltx_eqn_cell"></td> <td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle\geq H(X)-\frac{1}{2}\mathbb{E}_{(x,r)\sim p}[(x-f(\hat{r}))^{2}]% +\frac{d}{2}\log(2\pi e)," class="ltx_Math" display="inline" id="S2.E4Xb.2.1.1.m1.6"><semantics id="S2.E4Xb.2.1.1.m1.6a"><mrow id="S2.E4Xb.2.1.1.m1.6.6.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.cmml"><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.cmml"><mi id="S2.E4Xb.2.1.1.m1.6.6.1.1.4" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.4.cmml"></mi><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.3" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.3.cmml">≥</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.cmml"><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.cmml"><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.cmml"><mi id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.2.cmml">H</mi><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.1.cmml">⁢</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.3.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.cmml"><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.3.2.1" stretchy="false" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.cmml">(</mo><mi id="S2.E4Xb.2.1.1.m1.3.3" xref="S2.E4Xb.2.1.1.m1.3.3.cmml">X</mi><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.3.2.2" stretchy="false" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.cmml">)</mo></mrow></mrow><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.2.cmml">−</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.cmml"><mstyle displaystyle="true" id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.cmml"><mfrac id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3a" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.cmml"><mn id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.2.cmml">1</mn><mn id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.3" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.3.cmml">2</mn></mfrac></mstyle><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.2.cmml">⁢</mo><msub id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.4" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.4.cmml"><mi id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.4.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.4.2.cmml">𝔼</mi><mrow id="S2.E4Xb.2.1.1.m1.2.2.2" xref="S2.E4Xb.2.1.1.m1.2.2.2.cmml"><mrow id="S2.E4Xb.2.1.1.m1.2.2.2.4.2" xref="S2.E4Xb.2.1.1.m1.2.2.2.4.1.cmml"><mo id="S2.E4Xb.2.1.1.m1.2.2.2.4.2.1" stretchy="false" xref="S2.E4Xb.2.1.1.m1.2.2.2.4.1.cmml">(</mo><mi id="S2.E4Xb.2.1.1.m1.1.1.1.1" xref="S2.E4Xb.2.1.1.m1.1.1.1.1.cmml">x</mi><mo id="S2.E4Xb.2.1.1.m1.2.2.2.4.2.2" xref="S2.E4Xb.2.1.1.m1.2.2.2.4.1.cmml">,</mo><mi id="S2.E4Xb.2.1.1.m1.2.2.2.2" xref="S2.E4Xb.2.1.1.m1.2.2.2.2.cmml">r</mi><mo id="S2.E4Xb.2.1.1.m1.2.2.2.4.2.3" stretchy="false" xref="S2.E4Xb.2.1.1.m1.2.2.2.4.1.cmml">)</mo></mrow><mo id="S2.E4Xb.2.1.1.m1.2.2.2.3" xref="S2.E4Xb.2.1.1.m1.2.2.2.3.cmml">∼</mo><mi id="S2.E4Xb.2.1.1.m1.2.2.2.5" xref="S2.E4Xb.2.1.1.m1.2.2.2.5.cmml">p</mi></mrow></msub><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.2a" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.2.cmml"><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.2.1.cmml">[</mo><msup id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.cmml"><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.cmml"><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.cmml"><mi id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.2.cmml">x</mi><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.1.cmml">−</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.2.cmml">f</mi><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.1.cmml">⁢</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.3.2" xref="S2.E4Xb.2.1.1.m1.4.4.cmml"><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.3.2.1" stretchy="false" xref="S2.E4Xb.2.1.1.m1.4.4.cmml">(</mo><mover accent="true" id="S2.E4Xb.2.1.1.m1.4.4" xref="S2.E4Xb.2.1.1.m1.4.4.cmml"><mi id="S2.E4Xb.2.1.1.m1.4.4.2" xref="S2.E4Xb.2.1.1.m1.4.4.2.cmml">r</mi><mo id="S2.E4Xb.2.1.1.m1.4.4.1" xref="S2.E4Xb.2.1.1.m1.4.4.1.cmml">^</mo></mover><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.3.2.2" stretchy="false" xref="S2.E4Xb.2.1.1.m1.4.4.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow><mn id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.3" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.3.cmml">2</mn></msup><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.2.1.cmml">]</mo></mrow></mrow></mrow><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.3" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.3.cmml">+</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.cmml"><mstyle displaystyle="true" id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.cmml"><mfrac id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3a" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.cmml"><mi id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.2.cmml">d</mi><mn id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.3" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.3.cmml">2</mn></mfrac></mstyle><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.2" lspace="0.167em" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.2.cmml">⁢</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.2.cmml"><mi id="S2.E4Xb.2.1.1.m1.5.5" xref="S2.E4Xb.2.1.1.m1.5.5.cmml">log</mi><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1a" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.2.cmml">⁡</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.2.cmml"><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.2" stretchy="false" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.2.cmml">(</mo><mrow id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.cmml"><mn id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.2.cmml">2</mn><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.1" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.1.cmml">⁢</mo><mi id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.3" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.3.cmml">π</mi><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.1a" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.1.cmml">⁢</mo><mi id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.4" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.4.cmml">e</mi></mrow><mo id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.3" stretchy="false" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.2.cmml">)</mo></mrow></mrow></mrow></mrow></mrow><mo id="S2.E4Xb.2.1.1.m1.6.6.1.2" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.E4Xb.2.1.1.m1.6b"><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1"><geq id="S2.E4Xb.2.1.1.m1.6.6.1.1.3.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.3"></geq><csymbol cd="latexml" id="S2.E4Xb.2.1.1.m1.6.6.1.1.4.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.4">absent</csymbol><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2"><plus id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.3.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.3"></plus><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1"><minus id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.2"></minus><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3"><times id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.1"></times><ci id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.3.2">𝐻</ci><ci id="S2.E4Xb.2.1.1.m1.3.3.cmml" xref="S2.E4Xb.2.1.1.m1.3.3">𝑋</ci></apply><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1"><times id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.2"></times><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3"><divide id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3"></divide><cn id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.2.cmml" type="integer" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.2">1</cn><cn id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.3.cmml" type="integer" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.3.3">2</cn></apply><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.4.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.4"><csymbol cd="ambiguous" id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.4.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.4">subscript</csymbol><ci id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.4.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.4.2">𝔼</ci><apply id="S2.E4Xb.2.1.1.m1.2.2.2.cmml" xref="S2.E4Xb.2.1.1.m1.2.2.2"><csymbol cd="latexml" id="S2.E4Xb.2.1.1.m1.2.2.2.3.cmml" xref="S2.E4Xb.2.1.1.m1.2.2.2.3">similar-to</csymbol><interval closure="open" id="S2.E4Xb.2.1.1.m1.2.2.2.4.1.cmml" xref="S2.E4Xb.2.1.1.m1.2.2.2.4.2"><ci id="S2.E4Xb.2.1.1.m1.1.1.1.1.cmml" xref="S2.E4Xb.2.1.1.m1.1.1.1.1">𝑥</ci><ci id="S2.E4Xb.2.1.1.m1.2.2.2.2.cmml" xref="S2.E4Xb.2.1.1.m1.2.2.2.2">𝑟</ci></interval><ci id="S2.E4Xb.2.1.1.m1.2.2.2.5.cmml" xref="S2.E4Xb.2.1.1.m1.2.2.2.5">𝑝</ci></apply></apply><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1"><csymbol cd="latexml" id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.2.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.2">delimited-[]</csymbol><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1">superscript</csymbol><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1"><minus id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.1"></minus><ci id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.2">𝑥</ci><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3"><times id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.1"></times><ci id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.2">𝑓</ci><apply id="S2.E4Xb.2.1.1.m1.4.4.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.1.1.1.3.3.2"><ci id="S2.E4Xb.2.1.1.m1.4.4.1.cmml" xref="S2.E4Xb.2.1.1.m1.4.4.1">^</ci><ci id="S2.E4Xb.2.1.1.m1.4.4.2.cmml" xref="S2.E4Xb.2.1.1.m1.4.4.2">𝑟</ci></apply></apply></apply><cn id="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.3.cmml" type="integer" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.1.1.1.1.1.1.3">2</cn></apply></apply></apply></apply><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2"><times id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.2"></times><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3"><divide id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3"></divide><ci id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.2">𝑑</ci><cn id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.3.cmml" type="integer" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.3.3">2</cn></apply><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.2.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1"><log id="S2.E4Xb.2.1.1.m1.5.5.cmml" xref="S2.E4Xb.2.1.1.m1.5.5"></log><apply id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1"><times id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.1.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.1"></times><cn id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.2.cmml" type="integer" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.2">2</cn><ci id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.3.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.3">𝜋</ci><ci id="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.4.cmml" xref="S2.E4Xb.2.1.1.m1.6.6.1.1.2.2.1.1.1.1.4">𝑒</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E4Xb.2.1.1.m1.6c">\displaystyle\geq H(X)-\frac{1}{2}\mathbb{E}_{(x,r)\sim p}[(x-f(\hat{r}))^{2}]% +\frac{d}{2}\log(2\pi e),</annotation><annotation encoding="application/x-llamapun" id="S2.E4Xb.2.1.1.m1.6d">≥ italic_H ( italic_X ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_r ) ∼ italic_p end_POSTSUBSCRIPT [ ( italic_x - italic_f ( over^ start_ARG italic_r end_ARG ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_d end_ARG start_ARG 2 end_ARG roman_log ( 2 italic_π italic_e ) ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> </tr> </tbody> </table> <p class="ltx_p" id="S2.SS2.p2.7">where <math alttext="d" class="ltx_Math" display="inline" id="S2.SS2.p2.5.m1.1"><semantics id="S2.SS2.p2.5.m1.1a"><mi id="S2.SS2.p2.5.m1.1.1" xref="S2.SS2.p2.5.m1.1.1.cmml">d</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.5.m1.1b"><ci id="S2.SS2.p2.5.m1.1.1.cmml" xref="S2.SS2.p2.5.m1.1.1">𝑑</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.5.m1.1c">d</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.5.m1.1d">italic_d</annotation></semantics></math> is the dimension, <math alttext="f(\cdot)" class="ltx_Math" display="inline" id="S2.SS2.p2.6.m2.1"><semantics id="S2.SS2.p2.6.m2.1a"><mrow id="S2.SS2.p2.6.m2.1.2" xref="S2.SS2.p2.6.m2.1.2.cmml"><mi id="S2.SS2.p2.6.m2.1.2.2" xref="S2.SS2.p2.6.m2.1.2.2.cmml">f</mi><mo id="S2.SS2.p2.6.m2.1.2.1" xref="S2.SS2.p2.6.m2.1.2.1.cmml">⁢</mo><mrow id="S2.SS2.p2.6.m2.1.2.3.2" xref="S2.SS2.p2.6.m2.1.2.cmml"><mo id="S2.SS2.p2.6.m2.1.2.3.2.1" stretchy="false" xref="S2.SS2.p2.6.m2.1.2.cmml">(</mo><mo id="S2.SS2.p2.6.m2.1.1" lspace="0em" rspace="0em" xref="S2.SS2.p2.6.m2.1.1.cmml">⋅</mo><mo id="S2.SS2.p2.6.m2.1.2.3.2.2" stretchy="false" xref="S2.SS2.p2.6.m2.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.6.m2.1b"><apply id="S2.SS2.p2.6.m2.1.2.cmml" xref="S2.SS2.p2.6.m2.1.2"><times id="S2.SS2.p2.6.m2.1.2.1.cmml" xref="S2.SS2.p2.6.m2.1.2.1"></times><ci id="S2.SS2.p2.6.m2.1.2.2.cmml" xref="S2.SS2.p2.6.m2.1.2.2">𝑓</ci><ci id="S2.SS2.p2.6.m2.1.1.cmml" xref="S2.SS2.p2.6.m2.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.6.m2.1c">f(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.6.m2.1d">italic_f ( ⋅ )</annotation></semantics></math> is a regression network, and the third line follows from the data processing inequality on discrete speech units. The lower bound implies that, to achieve a larger lower bound of mutual information, i.e., better estimation of information completeness, we should minimize the mean square error using a powerful <math alttext="f(\cdot)" class="ltx_Math" display="inline" id="S2.SS2.p2.7.m3.1"><semantics id="S2.SS2.p2.7.m3.1a"><mrow id="S2.SS2.p2.7.m3.1.2" xref="S2.SS2.p2.7.m3.1.2.cmml"><mi id="S2.SS2.p2.7.m3.1.2.2" xref="S2.SS2.p2.7.m3.1.2.2.cmml">f</mi><mo id="S2.SS2.p2.7.m3.1.2.1" xref="S2.SS2.p2.7.m3.1.2.1.cmml">⁢</mo><mrow id="S2.SS2.p2.7.m3.1.2.3.2" xref="S2.SS2.p2.7.m3.1.2.cmml"><mo id="S2.SS2.p2.7.m3.1.2.3.2.1" stretchy="false" xref="S2.SS2.p2.7.m3.1.2.cmml">(</mo><mo id="S2.SS2.p2.7.m3.1.1" lspace="0em" rspace="0em" xref="S2.SS2.p2.7.m3.1.1.cmml">⋅</mo><mo id="S2.SS2.p2.7.m3.1.2.3.2.2" stretchy="false" xref="S2.SS2.p2.7.m3.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.p2.7.m3.1b"><apply id="S2.SS2.p2.7.m3.1.2.cmml" xref="S2.SS2.p2.7.m3.1.2"><times id="S2.SS2.p2.7.m3.1.2.1.cmml" xref="S2.SS2.p2.7.m3.1.2.1"></times><ci id="S2.SS2.p2.7.m3.1.2.2.cmml" xref="S2.SS2.p2.7.m3.1.2.2">𝑓</ci><ci id="S2.SS2.p2.7.m3.1.1.cmml" xref="S2.SS2.p2.7.m3.1.1">⋅</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.p2.7.m3.1c">f(\cdot)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.p2.7.m3.1d">italic_f ( ⋅ )</annotation></semantics></math>.</p> </div> <figure class="ltx_figure" id="S2.F1"><svg class="ltx_picture ltx_centering" height="151.73" id="S2.F1.pic1" overflow="visible" version="1.1" width="233.15"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" transform="translate(0,151.73) matrix(1 0 0 -1 0 0) translate(19.28,0) translate(0,25.03) matrix(1.0 0.0 0.0 1.0 -19.28 -25.03)"><g class="ltx_nestedsvg" transform="matrix(1 0 0 1 0 0) translate(19.28,0) translate(0,25.03)"><g color="#808080" fill="#808080" stroke="#808080" stroke-width="0.2pt"><path d="M 50.79 -2.95 L 50.79 2.95" style="fill:none"></path></g><g fill="#000000" stroke="#000000" stroke-width="0.8pt"><path d="M 0 0 L 209.73 0" style="fill:none"></path><g transform="matrix(1.0 0.0 0.0 1.0 209.73 0)"><path d="M 3.6 0 L -2.16 2.88 L 0 0 L -2.16 -2.88" style="stroke:none"></path></g><path d="M 0 0 L 0 107.36" style="fill:none"></path><g transform="matrix(0.0 1.0 -1.0 0.0 0 107.36)"><path d="M 3.6 0 L -2.16 2.88 L 0 0 L -2.16 -2.88" style="stroke:none"></path></g><g fill="#000000" stroke="#000000" transform="matrix(1.0 0.0 0.0 1.0 14.43 -17.73)"><foreignobject height="12.3" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="72.72"><span class="ltx_text" id="S2.F1.pic1.3.3.3.3.3.1.1">linear probe</span></foreignobject></g><clippath id="pgfcp1"><path d="M 0 0 L 213.33 0 L 213.33 110.96 L 0 110.96 Z"></path></clippath><g clip-path="url(#pgfcp1)"><g color="#6666FF" fill="#6666FF" stroke="#6666FF" stroke-width="1.2pt"><path d="M 0 142.86 L 2.05 137.37 L 4.1 132.1 L 6.16 127.03 L 8.21 122.16 L 10.26 117.49 L 12.31 113 L 14.37 108.69 L 16.42 104.55 L 18.47 100.58 L 20.52 96.76 L 22.57 93.09 L 24.63 89.57 L 26.68 86.19 L 28.73 82.94 L 30.78 79.82 L 32.84 76.83 L 34.89 73.95 L 36.94 71.19 L 38.99 68.53 L 41.04 65.98 L 43.1 63.53 L 45.15 61.18 L 47.2 58.92 L 49.25 56.76 L 51.3 54.67 L 53.36 52.67 L 55.41 50.75 L 57.46 48.91 L 59.51 47.13 L 61.57 45.43 L 63.62 43.8 L 65.67 42.23 L 67.72 40.72 L 69.77 39.27 L 71.83 37.88 L 73.88 36.55 L 75.93 35.27 L 77.98 34.03 L 80.04 32.85 L 82.09 31.72 L 84.14 30.62 L 86.19 29.58 L 88.24 28.57 L 90.3 27.6 L 92.35 26.68 L 94.4 25.78 L 96.45 24.93 L 98.51 24.1 L 100.56 23.32 L 102.61 22.56 L 104.66 21.83 L 106.71 21.13 L 108.77 20.46 L 110.82 19.81 L 112.87 19.19 L 114.92 18.6 L 116.98 18.02 L 119.03 17.48 L 121.08 16.95 L 123.13 16.44 L 125.18 15.96 L 127.24 15.49 L 129.29 15.04 L 131.34 14.61 L 133.39 14.2 L 135.44 13.8 L 137.5 13.42 L 139.55 13.05 L 141.6 12.7 L 143.65 12.36 L 145.71 12.04 L 147.76 11.72 L 149.81 11.42 L 151.86 11.14 L 153.91 10.86 L 155.97 10.6 L 158.02 10.34 L 160.07 10.1 L 162.12 9.86 L 164.18 9.63 L 166.23 9.42 L 168.28 9.21 L 170.33 9.01 L 172.38 8.82 L 174.44 8.63 L 176.49 8.46 L 178.54 8.29 L 180.59 8.12 L 182.65 7.97 L 184.7 7.82 L 186.75 7.67 L 188.8 7.53 L 190.85 7.4 L 192.91 7.27 L 194.96 7.15 L 197.01 7.03 L 199.06 6.92 L 201.12 6.81 L 203.17 6.7" style="fill:none"></path></g><g></g><g color="#FF6666" fill="#FF6666" stroke="#FF6666" stroke-width="1.2pt"><path d="M 0 142.86 L 2.05 142.77 L 4.1 142.5 L 6.16 142.05 L 8.21 141.42 L 10.26 140.62 L 12.31 139.64 L 14.37 138.49 L 16.42 137.18 L 18.47 135.72 L 20.52 134.1 L 22.57 132.33 L 24.63 130.42 L 26.68 128.37 L 28.73 126.2 L 30.78 123.91 L 32.84 121.51 L 34.89 119.01 L 36.94 116.41 L 38.99 113.73 L 41.04 110.98 L 43.1 108.16 L 45.15 105.28 L 47.2 102.35 L 49.25 99.38 L 51.3 96.38 L 53.36 93.36 L 55.41 90.33 L 57.46 87.29 L 59.51 84.25 L 61.57 81.22 L 63.62 78.22 L 65.67 75.23 L 67.72 72.28 L 69.77 69.36 L 71.83 66.49 L 73.88 63.67 L 75.93 60.9 L 77.98 58.18 L 80.04 55.53 L 82.09 52.95 L 84.14 50.44 L 86.19 48 L 88.24 45.63 L 90.3 43.34 L 92.35 41.13 L 94.4 38.99 L 96.45 36.94 L 98.51 34.97 L 100.56 33.08 L 102.61 31.27 L 104.66 29.54 L 106.71 27.89 L 108.77 26.32 L 110.82 24.82 L 112.87 23.4 L 114.92 22.06 L 116.98 20.78 L 119.03 19.58 L 121.08 18.45 L 123.13 17.38 L 125.18 16.37 L 127.24 15.43 L 129.29 14.55 L 131.34 13.72 L 133.39 12.95 L 135.44 12.23 L 137.5 11.56 L 139.55 10.93 L 141.6 10.35 L 143.65 9.82 L 145.71 9.32 L 147.76 8.86 L 149.81 8.43 L 151.86 8.04 L 153.91 7.68 L 155.97 7.35 L 158.02 7.05 L 160.07 6.77 L 162.12 6.52 L 164.18 6.28 L 166.23 6.07 L 168.28 5.88 L 170.33 5.7 L 172.38 5.54 L 174.44 5.4 L 176.49 5.27 L 178.54 5.15 L 180.59 5.04 L 182.65 4.95 L 184.7 4.86 L 186.75 4.78 L 188.8 4.71 L 190.85 4.65 L 192.91 4.59 L 194.96 4.54 L 197.01 4.5 L 199.06 4.46 L 201.12 4.42 L 203.17 4.39" style="fill:none"></path></g><g></g><g fill="#000000" stroke="#000000" transform="matrix(1.0 0.0 0.0 1.0 5086.29 9025.35)"><foreignobject height="12.1" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="16.42"><math alttext="R_{A}" class="ltx_Math" display="inline" id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1"><semantics id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1a"><msub id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1" xref="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.cmml"><mi id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.2" mathcolor="#FF4D4D" xref="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.2.cmml">R</mi><mi id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.3" mathcolor="#FF4D4D" xref="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.3.cmml">A</mi></msub><annotation-xml encoding="MathML-Content" id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1b"><apply id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.cmml" xref="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.1.cmml" xref="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.2.cmml" xref="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.2">𝑅</ci><ci id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.3.cmml" xref="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1.1.3">𝐴</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1c">R_{A}</annotation><annotation encoding="application/x-llamapun" id="S2.F1.pic1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.m1.1d">italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT</annotation></semantics></math></foreignobject></g><g></g><g fill="#000000" stroke="#000000" transform="matrix(1.0 0.0 0.0 1.0 2038.78 3477.29)"><foreignobject height="12.1" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="16.88"><math alttext="R_{B}" class="ltx_Math" display="inline" id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1"><semantics id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1a"><msub id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1" xref="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.cmml"><mi id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.2" mathcolor="#4D4DFF" xref="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.2.cmml">R</mi><mi id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.3" mathcolor="#4D4DFF" xref="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.3.cmml">B</mi></msub><annotation-xml encoding="MathML-Content" id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1b"><apply id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.cmml" xref="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.1.cmml" xref="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.2.cmml" xref="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.2">𝑅</ci><ci id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.3.cmml" xref="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1.1.3">𝐵</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1c">R_{B}</annotation><annotation encoding="application/x-llamapun" id="S2.F1.pic1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.1.1.1.1.1.1.1.1.1.m1.1d">italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT</annotation></semantics></math></foreignobject></g><g></g><g stroke-dasharray="0.8pt,2.0pt" stroke-dashoffset="0.0pt" stroke-width="0.8pt"><path d="M 50.79 97.14 L 50.79 0" style="fill:none"></path></g><g></g><g color="#6666FF" fill="#6666FF" stroke="#6666FF"><path d="M 52.75 55.19 C 52.75 56.27 51.87 57.14 50.79 57.14 C 49.71 57.14 48.83 56.27 48.83 55.19 C 48.83 54.11 49.71 53.23 50.79 53.23 C 51.87 53.23 52.75 54.11 52.75 55.19 Z M 50.79 55.19" style="stroke:none"></path></g><g></g><g color="#FF6666" fill="#FF6666" stroke="#FF6666"><path d="M 52.75 97.14 C 52.75 98.22 51.87 99.09 50.79 99.09 C 49.71 99.09 48.83 98.22 48.83 97.14 C 48.83 96.05 49.71 95.18 50.79 95.18 C 51.87 95.18 52.75 96.05 52.75 97.14 Z M 50.79 97.14" style="stroke:none"></path></g><g></g></g><g fill="#000000" stroke="#000000" transform="matrix(1.0 0.0 0.0 1.0 117.07 -14.77)"><foreignobject height="12.3" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="90.71"><span class="ltx_text" id="S2.F1.pic1.4.4.4.4.4.1.1">model capacity</span></foreignobject></g><g fill="#000000" stroke="#000000" transform="matrix(1.0 0.0 0.0 1.0 -14.66 116.13)"><foreignobject height="5.96" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="29.33"><span class="ltx_text" id="S2.F1.pic1.5.5.5.5.5.1.1">error</span></foreignobject></g></g></g></g></svg> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text ltx_font_bold" id="S2.F1.10.5.1" style="font-size:90%;">Fig. 1</span>: </span><span class="ltx_text" id="S2.F1.8.4" style="font-size:90%;">An illustration of information accessibility. <math alttext="R_{A}" class="ltx_Math" display="inline" id="S2.F1.5.1.m1.1"><semantics id="S2.F1.5.1.m1.1b"><msub id="S2.F1.5.1.m1.1.1" xref="S2.F1.5.1.m1.1.1.cmml"><mi id="S2.F1.5.1.m1.1.1.2" xref="S2.F1.5.1.m1.1.1.2.cmml">R</mi><mi id="S2.F1.5.1.m1.1.1.3" xref="S2.F1.5.1.m1.1.1.3.cmml">A</mi></msub><annotation-xml encoding="MathML-Content" id="S2.F1.5.1.m1.1c"><apply id="S2.F1.5.1.m1.1.1.cmml" xref="S2.F1.5.1.m1.1.1"><csymbol cd="ambiguous" id="S2.F1.5.1.m1.1.1.1.cmml" xref="S2.F1.5.1.m1.1.1">subscript</csymbol><ci id="S2.F1.5.1.m1.1.1.2.cmml" xref="S2.F1.5.1.m1.1.1.2">𝑅</ci><ci id="S2.F1.5.1.m1.1.1.3.cmml" xref="S2.F1.5.1.m1.1.1.3">𝐴</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F1.5.1.m1.1d">R_{A}</annotation><annotation encoding="application/x-llamapun" id="S2.F1.5.1.m1.1e">italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="R_{B}" class="ltx_Math" display="inline" id="S2.F1.6.2.m2.1"><semantics id="S2.F1.6.2.m2.1b"><msub id="S2.F1.6.2.m2.1.1" xref="S2.F1.6.2.m2.1.1.cmml"><mi id="S2.F1.6.2.m2.1.1.2" xref="S2.F1.6.2.m2.1.1.2.cmml">R</mi><mi id="S2.F1.6.2.m2.1.1.3" xref="S2.F1.6.2.m2.1.1.3.cmml">B</mi></msub><annotation-xml encoding="MathML-Content" id="S2.F1.6.2.m2.1c"><apply id="S2.F1.6.2.m2.1.1.cmml" xref="S2.F1.6.2.m2.1.1"><csymbol cd="ambiguous" id="S2.F1.6.2.m2.1.1.1.cmml" xref="S2.F1.6.2.m2.1.1">subscript</csymbol><ci id="S2.F1.6.2.m2.1.1.2.cmml" xref="S2.F1.6.2.m2.1.1.2">𝑅</ci><ci id="S2.F1.6.2.m2.1.1.3.cmml" xref="S2.F1.6.2.m2.1.1.3">𝐵</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F1.6.2.m2.1d">R_{B}</annotation><annotation encoding="application/x-llamapun" id="S2.F1.6.2.m2.1e">italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT</annotation></semantics></math> are two representations, and their probing errors differ depending on the model capacity of the probes. Under a linear probe, information in <math alttext="R_{B}" class="ltx_Math" display="inline" id="S2.F1.7.3.m3.1"><semantics id="S2.F1.7.3.m3.1b"><msub id="S2.F1.7.3.m3.1.1" xref="S2.F1.7.3.m3.1.1.cmml"><mi id="S2.F1.7.3.m3.1.1.2" xref="S2.F1.7.3.m3.1.1.2.cmml">R</mi><mi id="S2.F1.7.3.m3.1.1.3" xref="S2.F1.7.3.m3.1.1.3.cmml">B</mi></msub><annotation-xml encoding="MathML-Content" id="S2.F1.7.3.m3.1c"><apply id="S2.F1.7.3.m3.1.1.cmml" xref="S2.F1.7.3.m3.1.1"><csymbol cd="ambiguous" id="S2.F1.7.3.m3.1.1.1.cmml" xref="S2.F1.7.3.m3.1.1">subscript</csymbol><ci id="S2.F1.7.3.m3.1.1.2.cmml" xref="S2.F1.7.3.m3.1.1.2">𝑅</ci><ci id="S2.F1.7.3.m3.1.1.3.cmml" xref="S2.F1.7.3.m3.1.1.3">𝐵</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F1.7.3.m3.1d">R_{B}</annotation><annotation encoding="application/x-llamapun" id="S2.F1.7.3.m3.1e">italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT</annotation></semantics></math> is more accessible than <math alttext="R_{A}" class="ltx_Math" display="inline" id="S2.F1.8.4.m4.1"><semantics id="S2.F1.8.4.m4.1b"><msub id="S2.F1.8.4.m4.1.1" xref="S2.F1.8.4.m4.1.1.cmml"><mi id="S2.F1.8.4.m4.1.1.2" xref="S2.F1.8.4.m4.1.1.2.cmml">R</mi><mi id="S2.F1.8.4.m4.1.1.3" xref="S2.F1.8.4.m4.1.1.3.cmml">A</mi></msub><annotation-xml encoding="MathML-Content" id="S2.F1.8.4.m4.1c"><apply id="S2.F1.8.4.m4.1.1.cmml" xref="S2.F1.8.4.m4.1.1"><csymbol cd="ambiguous" id="S2.F1.8.4.m4.1.1.1.cmml" xref="S2.F1.8.4.m4.1.1">subscript</csymbol><ci id="S2.F1.8.4.m4.1.1.2.cmml" xref="S2.F1.8.4.m4.1.1.2">𝑅</ci><ci id="S2.F1.8.4.m4.1.1.3.cmml" xref="S2.F1.8.4.m4.1.1.3">𝐴</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.F1.8.4.m4.1d">R_{A}</annotation><annotation encoding="application/x-llamapun" id="S2.F1.8.4.m4.1e">italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT</annotation></semantics></math> with a lower error.</span></figcaption> </figure> </section> <section class="ltx_subsection" id="S2.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.3 </span>Information completeness and accessibility</h3> <div class="ltx_para" id="S2.SS3.p1"> <p class="ltx_p" id="S2.SS3.p1.2">Information accessibility of a representation describes how easy it is to predict a target information. Accessibility depends on the capacity of the model used to extract the information, as shown figuratively in Figure <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.F1" title="Figure 1 ‣ 2.2 Completeness as mutual information ‣ 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">1</span></a>. A higher model capacity is more likely to have better performance. To measure information accessibility, previous work has developed various speech downstream tasks <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib1" title="">1</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib13" title="">13</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib30" title="">30</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib14" title="">14</a>]</cite>. It is widely accepted that if a speech property encoded in a representation is linearly predictable with linear probes (low model capacity), the information of the speech property in this particular representation is highly accessible <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib1" title="">1</a>]</cite>. For instance, the information in <math alttext="R_{B}" class="ltx_Math" display="inline" id="S2.SS3.p1.1.m1.1"><semantics id="S2.SS3.p1.1.m1.1a"><msub id="S2.SS3.p1.1.m1.1.1" xref="S2.SS3.p1.1.m1.1.1.cmml"><mi id="S2.SS3.p1.1.m1.1.1.2" xref="S2.SS3.p1.1.m1.1.1.2.cmml">R</mi><mi id="S2.SS3.p1.1.m1.1.1.3" xref="S2.SS3.p1.1.m1.1.1.3.cmml">B</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.1.m1.1b"><apply id="S2.SS3.p1.1.m1.1.1.cmml" xref="S2.SS3.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.1.m1.1.1.1.cmml" xref="S2.SS3.p1.1.m1.1.1">subscript</csymbol><ci id="S2.SS3.p1.1.m1.1.1.2.cmml" xref="S2.SS3.p1.1.m1.1.1.2">𝑅</ci><ci id="S2.SS3.p1.1.m1.1.1.3.cmml" xref="S2.SS3.p1.1.m1.1.1.3">𝐵</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.1.m1.1c">R_{B}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.1.m1.1d">italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT</annotation></semantics></math> is more accessible than in <math alttext="R_{A}" class="ltx_Math" display="inline" id="S2.SS3.p1.2.m2.1"><semantics id="S2.SS3.p1.2.m2.1a"><msub id="S2.SS3.p1.2.m2.1.1" xref="S2.SS3.p1.2.m2.1.1.cmml"><mi id="S2.SS3.p1.2.m2.1.1.2" xref="S2.SS3.p1.2.m2.1.1.2.cmml">R</mi><mi id="S2.SS3.p1.2.m2.1.1.3" xref="S2.SS3.p1.2.m2.1.1.3.cmml">A</mi></msub><annotation-xml encoding="MathML-Content" id="S2.SS3.p1.2.m2.1b"><apply id="S2.SS3.p1.2.m2.1.1.cmml" xref="S2.SS3.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S2.SS3.p1.2.m2.1.1.1.cmml" xref="S2.SS3.p1.2.m2.1.1">subscript</csymbol><ci id="S2.SS3.p1.2.m2.1.1.2.cmml" xref="S2.SS3.p1.2.m2.1.1.2">𝑅</ci><ci id="S2.SS3.p1.2.m2.1.1.3.cmml" xref="S2.SS3.p1.2.m2.1.1.3">𝐴</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p1.2.m2.1c">R_{A}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p1.2.m2.1d">italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT</annotation></semantics></math> with a linear probe in Figure <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.F1" title="Figure 1 ‣ 2.2 Completeness as mutual information ‣ 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">1</span></a>.</p> </div> <div class="ltx_para" id="S2.SS3.p2"> <p class="ltx_p" id="S2.SS3.p2.3">On the other hand, information completeness lies at the opposite end of the spectrum, requiring a higher model capacity to reach a tighter lower bound. For example, Figure <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.F1" title="Figure 1 ‣ 2.2 Completeness as mutual information ‣ 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">1</span></a> tells another story if we focus on the high model capacity region. Similar finding is also noted in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib31" title="">31</a>]</cite>. To better estimate the mutual information between a speech property and the representations, as in (<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.E3" title="In 2.2 Completeness as mutual information ‣ 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">3</span></a>), the cross entropy should be minimized using a powerful <math alttext="q" class="ltx_Math" display="inline" id="S2.SS3.p2.1.m1.1"><semantics id="S2.SS3.p2.1.m1.1a"><mi id="S2.SS3.p2.1.m1.1.1" xref="S2.SS3.p2.1.m1.1.1.cmml">q</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.1.m1.1b"><ci id="S2.SS3.p2.1.m1.1.1.cmml" xref="S2.SS3.p2.1.m1.1.1">𝑞</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.1.m1.1c">q</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.1.m1.1d">italic_q</annotation></semantics></math>. This fact is also noted in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib32" title="">32</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib15" title="">15</a>]</cite>. While it is generally not possible to find the optimal <math alttext="q" class="ltx_Math" display="inline" id="S2.SS3.p2.2.m2.1"><semantics id="S2.SS3.p2.2.m2.1a"><mi id="S2.SS3.p2.2.m2.1.1" xref="S2.SS3.p2.2.m2.1.1.cmml">q</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.2.m2.1b"><ci id="S2.SS3.p2.2.m2.1.1.cmml" xref="S2.SS3.p2.2.m2.1.1">𝑞</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.2.m2.1c">q</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.2.m2.1d">italic_q</annotation></semantics></math> that maximizes the lower bound, we consider parameterizing <math alttext="q" class="ltx_Math" display="inline" id="S2.SS3.p2.3.m3.1"><semantics id="S2.SS3.p2.3.m3.1a"><mi id="S2.SS3.p2.3.m3.1.1" xref="S2.SS3.p2.3.m3.1.1.cmml">q</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.p2.3.m3.1b"><ci id="S2.SS3.p2.3.m3.1.1.cmml" xref="S2.SS3.p2.3.m3.1.1">𝑞</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.p2.3.m3.1c">q</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.p2.3.m3.1d">italic_q</annotation></semantics></math> with a deeper network to obtain a tighter lower bound, treating downstream performance as information accessibility.</p> </div> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">3 </span>Related work</h2> <div class="ltx_para" id="S3.p1"> <p class="ltx_p" id="S3.p1.1">There are various aspects of literature related to ours. Given how widely discrete units are applied, especially in speech language models and voice conversion, we focus on the completeness aspect surrounding discrete units in this section.</p> </div> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.1 </span>Information-theoretic probing</h3> <div class="ltx_para" id="S3.SS1.p1"> <p class="ltx_p" id="S3.SS1.p1.1">In this work we focus on information completeness, another aspect of a representation, via the lens of information theory. Several recent approaches have taken information-theoretic techniques to evaluate BERT representations <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib32" title="">32</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib15" title="">15</a>]</cite>. Similar techniques have also inspired the evaluation of speech representations <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib29" title="">29</a>]</cite>, connecting mutual information to speech downstream tasks. However, the information completeness of speech representations has not been well studied.</p> </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.2 </span>Measuring information completeness</h3> <div class="ltx_para" id="S3.SS2.p1"> <p class="ltx_p" id="S3.SS2.p1.1">There are other methods claiming that certain speech properties are disentangled in self-supervised speech representations <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib9" title="">9</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib33" title="">33</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib34" title="">34</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib21" title="">21</a>]</cite> or in the extracted discrete units <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib9" title="">9</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib21" title="">21</a>]</cite>. The presence of information is not verified through an information-theoretic measure. Instead, they take evaluation metrics from voice conversions to measure whether content and speaker information are preserved in synthesized speech. The content and speaker information are analyzed with a speech recognition system and a speaker encoder to compare word error rates and speaker similarity between synthesized and original speech <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib9" title="">9</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib19" title="">19</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib21" title="">21</a>]</cite>.</p> </div> <div class="ltx_para" id="S3.SS2.p2"> <p class="ltx_p" id="S3.SS2.p2.1">Although the evaluation protocol is widely adopted, whether the synthesized speech faithfully reflect the information carried in discrete speech units is questionable. On one hand, the synthesized speech from the representations may hallucinate if the generation involves GANs <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib22" title="">22</a>]</cite> or diffusion models <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib35" title="">35</a>]</cite>. In GANs, for example, the discriminator only estimates if the generation is real or fake, while not estimating the actual distribution <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib36" title="">36</a>]</cite>. This prohibits the justification of information completeness on the lower bound of mutual information (<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.E3" title="In 2.2 Completeness as mutual information ‣ 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">3</span></a>).<span class="ltx_note ltx_role_footnote" id="footnote1"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_tag ltx_tag_note">1</span>We note that vocoders such as HiFiGAN <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib37" title="">37</a>]</cite> has a Mel spectrogram loss to promote more realistic synthesized speech, our arguments on the justification of the lower bound still hold.</span></span></span> In addition to the hallucination on synthesized speech, the external speech recognition model can potentially suffer from hallucination as well <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib38" title="">38</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib39" title="">39</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib23" title="">23</a>]</cite>, leading to weaker justification of information completeness.</p> </div> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">4 </span>Experimental settings</h2> <div class="ltx_para" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">Given the lower bound of mutual information (<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.E4" title="In 2.2 Completeness as mutual information ‣ 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">4</span></a>), we empirically evaluate the completeness of HuBERT representations, and the derived discrete units. We also present accessibility measurements on phonetic classification, pitch estimation and speaker verification, considering a higher model capacity region. While one can always argue whether a probing model is sufficiently strong or not, the aim is <span class="ltx_text ltx_font_bold" id="S4.p1.1.1">not</span> to estimate information content (in fact it is barely possible), but rather to identify at least how much information is present in the representations.</p> </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.1 </span>Discrete speech units</h3> <div class="ltx_para" id="S4.SS1.p1"> <p class="ltx_p" id="S4.SS1.p1.5">We choose HuBERT layer 4 and layer 9 for all experiments, which are the best-performing layers for content-related and speaker-related tasks respectively <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib5" title="">5</a>]</cite>. Discrete speech units are obtained by running RVQ on these two layers. We randomly sample 5000 utterances from LibriSpeech <span class="ltx_text ltx_font_typewriter" id="S4.SS1.p1.5.1">train-clean-360</span> <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib40" title="">40</a>]</cite> to train each codebook using k-means. Codebooks are successively optimized to minimize the Euclidean distance between the quantized and original HuBERT representations. Unless stated otherwise, RVQ codebooks are not further fine-tuned. We experiment with <math alttext="L" class="ltx_Math" display="inline" id="S4.SS1.p1.1.m1.1"><semantics id="S4.SS1.p1.1.m1.1a"><mi id="S4.SS1.p1.1.m1.1.1" xref="S4.SS1.p1.1.m1.1.1.cmml">L</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.1.m1.1b"><ci id="S4.SS1.p1.1.m1.1.1.cmml" xref="S4.SS1.p1.1.m1.1.1">𝐿</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.1.m1.1c">L</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.1.m1.1d">italic_L</annotation></semantics></math> from 1 to 8, denoting <math alttext="\text{RVQ}_{L}" class="ltx_Math" display="inline" id="S4.SS1.p1.2.m2.1"><semantics id="S4.SS1.p1.2.m2.1a"><msub id="S4.SS1.p1.2.m2.1.1" xref="S4.SS1.p1.2.m2.1.1.cmml"><mtext id="S4.SS1.p1.2.m2.1.1.2" xref="S4.SS1.p1.2.m2.1.1.2a.cmml">RVQ</mtext><mi id="S4.SS1.p1.2.m2.1.1.3" xref="S4.SS1.p1.2.m2.1.1.3.cmml">L</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.2.m2.1b"><apply id="S4.SS1.p1.2.m2.1.1.cmml" xref="S4.SS1.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.2.m2.1.1.1.cmml" xref="S4.SS1.p1.2.m2.1.1">subscript</csymbol><ci id="S4.SS1.p1.2.m2.1.1.2a.cmml" xref="S4.SS1.p1.2.m2.1.1.2"><mtext id="S4.SS1.p1.2.m2.1.1.2.cmml" xref="S4.SS1.p1.2.m2.1.1.2">RVQ</mtext></ci><ci id="S4.SS1.p1.2.m2.1.1.3.cmml" xref="S4.SS1.p1.2.m2.1.1.3">𝐿</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.2.m2.1c">\text{RVQ}_{L}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.2.m2.1d">RVQ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT</annotation></semantics></math> the RVQ with <math alttext="L" class="ltx_Math" display="inline" id="S4.SS1.p1.3.m3.1"><semantics id="S4.SS1.p1.3.m3.1a"><mi id="S4.SS1.p1.3.m3.1.1" xref="S4.SS1.p1.3.m3.1.1.cmml">L</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.3.m3.1b"><ci id="S4.SS1.p1.3.m3.1.1.cmml" xref="S4.SS1.p1.3.m3.1.1">𝐿</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.3.m3.1c">L</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.3.m3.1d">italic_L</annotation></semantics></math> codebooks. Note that <math alttext="\text{RVQ}_{1}" class="ltx_Math" display="inline" id="S4.SS1.p1.4.m4.1"><semantics id="S4.SS1.p1.4.m4.1a"><msub id="S4.SS1.p1.4.m4.1.1" xref="S4.SS1.p1.4.m4.1.1.cmml"><mtext id="S4.SS1.p1.4.m4.1.1.2" xref="S4.SS1.p1.4.m4.1.1.2a.cmml">RVQ</mtext><mn id="S4.SS1.p1.4.m4.1.1.3" xref="S4.SS1.p1.4.m4.1.1.3.cmml">1</mn></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.4.m4.1b"><apply id="S4.SS1.p1.4.m4.1.1.cmml" xref="S4.SS1.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.4.m4.1.1.1.cmml" xref="S4.SS1.p1.4.m4.1.1">subscript</csymbol><ci id="S4.SS1.p1.4.m4.1.1.2a.cmml" xref="S4.SS1.p1.4.m4.1.1.2"><mtext id="S4.SS1.p1.4.m4.1.1.2.cmml" xref="S4.SS1.p1.4.m4.1.1.2">RVQ</mtext></ci><cn id="S4.SS1.p1.4.m4.1.1.3.cmml" type="integer" xref="S4.SS1.p1.4.m4.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.4.m4.1c">\text{RVQ}_{1}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.4.m4.1d">RVQ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT</annotation></semantics></math> is identical to k-means. Each codebook is of size of 1024, consuming <math alttext="{\log_{2}}1024=10" class="ltx_Math" display="inline" id="S4.SS1.p1.5.m5.1"><semantics id="S4.SS1.p1.5.m5.1a"><mrow id="S4.SS1.p1.5.m5.1.1" xref="S4.SS1.p1.5.m5.1.1.cmml"><mrow id="S4.SS1.p1.5.m5.1.1.2" xref="S4.SS1.p1.5.m5.1.1.2.cmml"><msub id="S4.SS1.p1.5.m5.1.1.2.1" xref="S4.SS1.p1.5.m5.1.1.2.1.cmml"><mi id="S4.SS1.p1.5.m5.1.1.2.1.2" xref="S4.SS1.p1.5.m5.1.1.2.1.2.cmml">log</mi><mn id="S4.SS1.p1.5.m5.1.1.2.1.3" xref="S4.SS1.p1.5.m5.1.1.2.1.3.cmml">2</mn></msub><mo id="S4.SS1.p1.5.m5.1.1.2a" lspace="0.167em" xref="S4.SS1.p1.5.m5.1.1.2.cmml">⁡</mo><mn id="S4.SS1.p1.5.m5.1.1.2.2" xref="S4.SS1.p1.5.m5.1.1.2.2.cmml">1024</mn></mrow><mo id="S4.SS1.p1.5.m5.1.1.1" xref="S4.SS1.p1.5.m5.1.1.1.cmml">=</mo><mn id="S4.SS1.p1.5.m5.1.1.3" xref="S4.SS1.p1.5.m5.1.1.3.cmml">10</mn></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.5.m5.1b"><apply id="S4.SS1.p1.5.m5.1.1.cmml" xref="S4.SS1.p1.5.m5.1.1"><eq id="S4.SS1.p1.5.m5.1.1.1.cmml" xref="S4.SS1.p1.5.m5.1.1.1"></eq><apply id="S4.SS1.p1.5.m5.1.1.2.cmml" xref="S4.SS1.p1.5.m5.1.1.2"><apply id="S4.SS1.p1.5.m5.1.1.2.1.cmml" xref="S4.SS1.p1.5.m5.1.1.2.1"><csymbol cd="ambiguous" id="S4.SS1.p1.5.m5.1.1.2.1.1.cmml" xref="S4.SS1.p1.5.m5.1.1.2.1">subscript</csymbol><log id="S4.SS1.p1.5.m5.1.1.2.1.2.cmml" xref="S4.SS1.p1.5.m5.1.1.2.1.2"></log><cn id="S4.SS1.p1.5.m5.1.1.2.1.3.cmml" type="integer" xref="S4.SS1.p1.5.m5.1.1.2.1.3">2</cn></apply><cn id="S4.SS1.p1.5.m5.1.1.2.2.cmml" type="integer" xref="S4.SS1.p1.5.m5.1.1.2.2">1024</cn></apply><cn id="S4.SS1.p1.5.m5.1.1.3.cmml" type="integer" xref="S4.SS1.p1.5.m5.1.1.3">10</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.5.m5.1c">{\log_{2}}1024=10</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.5.m5.1d">roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1024 = 10</annotation></semantics></math> bits storage cost.</p> </div> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.2 </span>Completeness task</h3> <div class="ltx_para" id="S4.SS2.p1"> <p class="ltx_p" id="S4.SS2.p1.4">We conduct experiments on information completeness using LibriSpeech. Models are trained on <span class="ltx_text ltx_font_typewriter" id="S4.SS2.p1.4.1">train-clean-360</span>, and evaluated on <span class="ltx_text ltx_font_typewriter" id="S4.SS2.p1.4.2">dev-clean</span>. The sampling rate is 16000. We use 80 bands log Mels as <math alttext="X" class="ltx_Math" display="inline" id="S4.SS2.p1.1.m1.1"><semantics id="S4.SS2.p1.1.m1.1a"><mi id="S4.SS2.p1.1.m1.1.1" xref="S4.SS2.p1.1.m1.1.1.cmml">X</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.1.m1.1b"><ci id="S4.SS2.p1.1.m1.1.1.cmml" xref="S4.SS2.p1.1.m1.1.1">𝑋</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.1.m1.1c">X</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.1.m1.1d">italic_X</annotation></semantics></math> in (<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.E4" title="In 2.2 Completeness as mutual information ‣ 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">4</span></a>), the target of completeness. To match the frame rate of HuBERT representations (50 Hz), we set a hop size of 320. We set the length of the FFT to 1024. We do not normalization log Mels with global mean and variance. We parameterize <math alttext="f" class="ltx_Math" display="inline" id="S4.SS2.p1.2.m2.1"><semantics id="S4.SS2.p1.2.m2.1a"><mi id="S4.SS2.p1.2.m2.1.1" xref="S4.SS2.p1.2.m2.1.1.cmml">f</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.2.m2.1b"><ci id="S4.SS2.p1.2.m2.1.1.cmml" xref="S4.SS2.p1.2.m2.1.1">𝑓</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.2.m2.1c">f</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.2.m2.1d">italic_f</annotation></semantics></math> as convolutional networks. It consists of two parts. The first part contains 6 convolutions with channels <math alttext="(256,256,256,256,512,512)" class="ltx_Math" display="inline" id="S4.SS2.p1.3.m3.6"><semantics id="S4.SS2.p1.3.m3.6a"><mrow id="S4.SS2.p1.3.m3.6.7.2" xref="S4.SS2.p1.3.m3.6.7.1.cmml"><mo id="S4.SS2.p1.3.m3.6.7.2.1" stretchy="false" xref="S4.SS2.p1.3.m3.6.7.1.cmml">(</mo><mn id="S4.SS2.p1.3.m3.1.1" xref="S4.SS2.p1.3.m3.1.1.cmml">256</mn><mo id="S4.SS2.p1.3.m3.6.7.2.2" xref="S4.SS2.p1.3.m3.6.7.1.cmml">,</mo><mn id="S4.SS2.p1.3.m3.2.2" xref="S4.SS2.p1.3.m3.2.2.cmml">256</mn><mo id="S4.SS2.p1.3.m3.6.7.2.3" xref="S4.SS2.p1.3.m3.6.7.1.cmml">,</mo><mn id="S4.SS2.p1.3.m3.3.3" xref="S4.SS2.p1.3.m3.3.3.cmml">256</mn><mo id="S4.SS2.p1.3.m3.6.7.2.4" xref="S4.SS2.p1.3.m3.6.7.1.cmml">,</mo><mn id="S4.SS2.p1.3.m3.4.4" xref="S4.SS2.p1.3.m3.4.4.cmml">256</mn><mo id="S4.SS2.p1.3.m3.6.7.2.5" xref="S4.SS2.p1.3.m3.6.7.1.cmml">,</mo><mn id="S4.SS2.p1.3.m3.5.5" xref="S4.SS2.p1.3.m3.5.5.cmml">512</mn><mo id="S4.SS2.p1.3.m3.6.7.2.6" xref="S4.SS2.p1.3.m3.6.7.1.cmml">,</mo><mn id="S4.SS2.p1.3.m3.6.6" xref="S4.SS2.p1.3.m3.6.6.cmml">512</mn><mo id="S4.SS2.p1.3.m3.6.7.2.7" stretchy="false" xref="S4.SS2.p1.3.m3.6.7.1.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.3.m3.6b"><vector id="S4.SS2.p1.3.m3.6.7.1.cmml" xref="S4.SS2.p1.3.m3.6.7.2"><cn id="S4.SS2.p1.3.m3.1.1.cmml" type="integer" xref="S4.SS2.p1.3.m3.1.1">256</cn><cn id="S4.SS2.p1.3.m3.2.2.cmml" type="integer" xref="S4.SS2.p1.3.m3.2.2">256</cn><cn id="S4.SS2.p1.3.m3.3.3.cmml" type="integer" xref="S4.SS2.p1.3.m3.3.3">256</cn><cn id="S4.SS2.p1.3.m3.4.4.cmml" type="integer" xref="S4.SS2.p1.3.m3.4.4">256</cn><cn id="S4.SS2.p1.3.m3.5.5.cmml" type="integer" xref="S4.SS2.p1.3.m3.5.5">512</cn><cn id="S4.SS2.p1.3.m3.6.6.cmml" type="integer" xref="S4.SS2.p1.3.m3.6.6">512</cn></vector></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.3.m3.6c">(256,256,256,256,512,512)</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.3.m3.6d">( 256 , 256 , 256 , 256 , 512 , 512 )</annotation></semantics></math>, strides <math alttext="(1,1,1,1,2,2)" class="ltx_Math" display="inline" id="S4.SS2.p1.4.m4.6"><semantics id="S4.SS2.p1.4.m4.6a"><mrow id="S4.SS2.p1.4.m4.6.7.2" xref="S4.SS2.p1.4.m4.6.7.1.cmml"><mo id="S4.SS2.p1.4.m4.6.7.2.1" stretchy="false" xref="S4.SS2.p1.4.m4.6.7.1.cmml">(</mo><mn id="S4.SS2.p1.4.m4.1.1" xref="S4.SS2.p1.4.m4.1.1.cmml">1</mn><mo id="S4.SS2.p1.4.m4.6.7.2.2" xref="S4.SS2.p1.4.m4.6.7.1.cmml">,</mo><mn id="S4.SS2.p1.4.m4.2.2" xref="S4.SS2.p1.4.m4.2.2.cmml">1</mn><mo id="S4.SS2.p1.4.m4.6.7.2.3" xref="S4.SS2.p1.4.m4.6.7.1.cmml">,</mo><mn id="S4.SS2.p1.4.m4.3.3" xref="S4.SS2.p1.4.m4.3.3.cmml">1</mn><mo id="S4.SS2.p1.4.m4.6.7.2.4" xref="S4.SS2.p1.4.m4.6.7.1.cmml">,</mo><mn id="S4.SS2.p1.4.m4.4.4" xref="S4.SS2.p1.4.m4.4.4.cmml">1</mn><mo id="S4.SS2.p1.4.m4.6.7.2.5" xref="S4.SS2.p1.4.m4.6.7.1.cmml">,</mo><mn id="S4.SS2.p1.4.m4.5.5" xref="S4.SS2.p1.4.m4.5.5.cmml">2</mn><mo id="S4.SS2.p1.4.m4.6.7.2.6" xref="S4.SS2.p1.4.m4.6.7.1.cmml">,</mo><mn id="S4.SS2.p1.4.m4.6.6" xref="S4.SS2.p1.4.m4.6.6.cmml">2</mn><mo id="S4.SS2.p1.4.m4.6.7.2.7" stretchy="false" xref="S4.SS2.p1.4.m4.6.7.1.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.4.m4.6b"><vector id="S4.SS2.p1.4.m4.6.7.1.cmml" xref="S4.SS2.p1.4.m4.6.7.2"><cn id="S4.SS2.p1.4.m4.1.1.cmml" type="integer" xref="S4.SS2.p1.4.m4.1.1">1</cn><cn id="S4.SS2.p1.4.m4.2.2.cmml" type="integer" xref="S4.SS2.p1.4.m4.2.2">1</cn><cn id="S4.SS2.p1.4.m4.3.3.cmml" type="integer" xref="S4.SS2.p1.4.m4.3.3">1</cn><cn id="S4.SS2.p1.4.m4.4.4.cmml" type="integer" xref="S4.SS2.p1.4.m4.4.4">1</cn><cn id="S4.SS2.p1.4.m4.5.5.cmml" type="integer" xref="S4.SS2.p1.4.m4.5.5">2</cn><cn id="S4.SS2.p1.4.m4.6.6.cmml" type="integer" xref="S4.SS2.p1.4.m4.6.6">2</cn></vector></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.4.m4.6c">(1,1,1,1,2,2)</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.4.m4.6d">( 1 , 1 , 1 , 1 , 2 , 2 )</annotation></semantics></math> and a kernel size of 3. The second part contains 8 ConvNeXt blocks used in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib41" title="">41</a>]</cite>. We use a batch size of 16 and a learning rate of 0.0002. Models are trained for up to 60 epochs. The objective is to predict log Mels by minimizing MSE, which equivalently maximizes the lower bound of mutual information.</p> </div> </section> <section class="ltx_subsection" id="S4.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.3 </span>Accessibility tasks</h3> <div class="ltx_para" id="S4.SS3.p1"> <p class="ltx_p" id="S4.SS3.p1.1">We design three tasks to evaluate information accessibility, taking into account probes with higher model capacity as oppose to <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib14" title="">14</a>]</cite>. We choose phone classification (PC) to evaluate the presence of phonetic information. In particular, discrete speech units have shown strong phonetic prominence in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib8" title="">8</a>]</cite>. To see if prosody information is preserved in discrete units, we conduct experiments on pitch estimation (<math alttext="f_{0}" class="ltx_Math" display="inline" id="S4.SS3.p1.1.m1.1"><semantics id="S4.SS3.p1.1.m1.1a"><msub id="S4.SS3.p1.1.m1.1.1" xref="S4.SS3.p1.1.m1.1.1.cmml"><mi id="S4.SS3.p1.1.m1.1.1.2" xref="S4.SS3.p1.1.m1.1.1.2.cmml">f</mi><mn id="S4.SS3.p1.1.m1.1.1.3" xref="S4.SS3.p1.1.m1.1.1.3.cmml">0</mn></msub><annotation-xml encoding="MathML-Content" id="S4.SS3.p1.1.m1.1b"><apply id="S4.SS3.p1.1.m1.1.1.cmml" xref="S4.SS3.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.SS3.p1.1.m1.1.1.1.cmml" xref="S4.SS3.p1.1.m1.1.1">subscript</csymbol><ci id="S4.SS3.p1.1.m1.1.1.2.cmml" xref="S4.SS3.p1.1.m1.1.1.2">𝑓</ci><cn id="S4.SS3.p1.1.m1.1.1.3.cmml" type="integer" xref="S4.SS3.p1.1.m1.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p1.1.m1.1c">f_{0}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p1.1.m1.1d">italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT</annotation></semantics></math>). For the third task, we present speaker verification (SV) to measure if speaker-related information in present.</p> </div> <div class="ltx_para" id="S4.SS3.p2"> <p class="ltx_p" id="S4.SS3.p2.2">We use 3-layer feedforward networks following by a linear layer to model phone classification and pitch estimation on Wall Street Journal (WSJ) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib42" title="">42</a>]</cite>. We set the hidden dimension to 3076 and use ReLU as the activation function for each feedforward network. We adopt the setups in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib30" title="">30</a>]</cite> and train models on the WSJ training set using 90% of <span class="ltx_text ltx_font_typewriter" id="S4.SS3.p2.2.1">si284</span>. We select the best model based on its performance on the development set, the rest 10% of <span class="ltx_text ltx_font_typewriter" id="S4.SS3.p2.2.2">si284</span>. We report numbers on <span class="ltx_text ltx_font_typewriter" id="S4.SS3.p2.2.3">eval92</span> after training. Models are trained with a learning rate of <math alttext="0.001" class="ltx_Math" display="inline" id="S4.SS3.p2.1.m1.1"><semantics id="S4.SS3.p2.1.m1.1a"><mn id="S4.SS3.p2.1.m1.1.1" xref="S4.SS3.p2.1.m1.1.1.cmml">0.001</mn><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.1.m1.1b"><cn id="S4.SS3.p2.1.m1.1.1.cmml" type="float" xref="S4.SS3.p2.1.m1.1.1">0.001</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.1.m1.1c">0.001</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.1.m1.1d">0.001</annotation></semantics></math> using a batch size of <math alttext="12" class="ltx_Math" display="inline" id="S4.SS3.p2.2.m2.1"><semantics id="S4.SS3.p2.2.m2.1a"><mn id="S4.SS3.p2.2.m2.1.1" xref="S4.SS3.p2.2.m2.1.1.cmml">12</mn><annotation-xml encoding="MathML-Content" id="S4.SS3.p2.2.m2.1b"><cn id="S4.SS3.p2.2.m2.1.1.cmml" type="integer" xref="S4.SS3.p2.2.m2.1.1">12</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p2.2.m2.1c">12</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p2.2.m2.1d">12</annotation></semantics></math>.</p> </div> <div class="ltx_para" id="S4.SS3.p3"> <p class="ltx_p" id="S4.SS3.p3.1">For phone classification, we use forced alignments extracted by a speaker adaptive GMM-HMM as targets. Its performance is measured using phone error rates (PER). Regarding pitch estimation, we extract the fundamental frequency using PYIN <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib43" title="">43</a>]</cite>, and treat them as the ground truth. The minimum and maximum frequency in set to be 50 Hz and 600 Hz respectively. We use root-mean-square error (RMSE) in Hz and predict pitch only on sonorants obtained from the forced alignments.</p> </div> <div class="ltx_para" id="S4.SS3.p4"> <p class="ltx_p" id="S4.SS3.p4.2">Finally, we evaluate whether representations and discrete speech units encode speaker information by performing speaker verification (SV) on voxceleb1 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib44" title="">44</a>]</cite>. We employ a variant of ECAPA-TDNN <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib45" title="">45</a>]</cite> to learn speaker embeddings, where we do not concatenate the mean and standard deviation before attentive pooling. We also do not use AAM-Softmax as in the original paper. Speaker encoders are trained with a learning rate of <math alttext="0.0005" class="ltx_Math" display="inline" id="S4.SS3.p4.1.m1.1"><semantics id="S4.SS3.p4.1.m1.1a"><mn id="S4.SS3.p4.1.m1.1.1" xref="S4.SS3.p4.1.m1.1.1.cmml">0.0005</mn><annotation-xml encoding="MathML-Content" id="S4.SS3.p4.1.m1.1b"><cn id="S4.SS3.p4.1.m1.1.1.cmml" type="float" xref="S4.SS3.p4.1.m1.1.1">0.0005</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p4.1.m1.1c">0.0005</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p4.1.m1.1d">0.0005</annotation></semantics></math> using a batch size of <math alttext="8" class="ltx_Math" display="inline" id="S4.SS3.p4.2.m2.1"><semantics id="S4.SS3.p4.2.m2.1a"><mn id="S4.SS3.p4.2.m2.1.1" xref="S4.SS3.p4.2.m2.1.1.cmml">8</mn><annotation-xml encoding="MathML-Content" id="S4.SS3.p4.2.m2.1b"><cn id="S4.SS3.p4.2.m2.1.1.cmml" type="integer" xref="S4.SS3.p4.2.m2.1.1">8</cn></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.p4.2.m2.1c">8</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.p4.2.m2.1d">8</annotation></semantics></math>. We crop the input utterance to at most 12 seconds due to the memory constraint. We train all models for 10 epochs.</p> </div> <figure class="ltx_figure" id="S4.F2"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_figure_panel ltx_img_square" height="145" id="S4.F2.g1" src="extracted/5871689/img/Cluster_145.png" width="166"/></div> <div class="ltx_flex_cell ltx_flex_size_2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_figure_panel ltx_img_square" height="145" id="S4.F2.g2" src="extracted/5871689/img/Cluster_272.png" width="166"/></div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text ltx_font_bold" id="S4.F2.3.1.1" style="font-size:90%;">Fig. 2</span>: </span><span class="ltx_text" id="S4.F2.4.2" style="font-size:90%;">Frames of HuBERT representations assigned to two example k-means clusters are visualized with the first two principle components of PCA. Colors represent speaker identifies.</span></figcaption> </figure> <figure class="ltx_table ltx_figure_panel" id="S4.T1"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table"><span class="ltx_text ltx_font_bold" id="S4.T1.4.2.1" style="font-size:90%;">Table 1</span>: </span><span class="ltx_text" id="S4.T1.2.1" style="font-size:90%;">Accessibility of phone identifies, <math alttext="f_{0}" class="ltx_Math" display="inline" id="S4.T1.2.1.m1.1"><semantics id="S4.T1.2.1.m1.1b"><msub id="S4.T1.2.1.m1.1.1" xref="S4.T1.2.1.m1.1.1.cmml"><mi id="S4.T1.2.1.m1.1.1.2" xref="S4.T1.2.1.m1.1.1.2.cmml">f</mi><mn id="S4.T1.2.1.m1.1.1.3" xref="S4.T1.2.1.m1.1.1.3.cmml">0</mn></msub><annotation-xml encoding="MathML-Content" id="S4.T1.2.1.m1.1c"><apply id="S4.T1.2.1.m1.1.1.cmml" xref="S4.T1.2.1.m1.1.1"><csymbol cd="ambiguous" id="S4.T1.2.1.m1.1.1.1.cmml" xref="S4.T1.2.1.m1.1.1">subscript</csymbol><ci id="S4.T1.2.1.m1.1.1.2.cmml" xref="S4.T1.2.1.m1.1.1.2">𝑓</ci><cn id="S4.T1.2.1.m1.1.1.3.cmml" type="integer" xref="S4.T1.2.1.m1.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.2.1.m1.1d">f_{0}</annotation><annotation encoding="application/x-llamapun" id="S4.T1.2.1.m1.1e">italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT</annotation></semantics></math>, and speaker identities on the 4th and the 9th HuBERT layer. Residuals are computed by subtracting the centroids from the associated representations.</span></figcaption> <div class="ltx_inline-block ltx_figure_panel ltx_align_center ltx_transformed_outer" id="S4.SS3.3.3" style="width:403.3pt;height:187.5pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(46.8pt,-21.8pt) scale(1.30215448977308,1.30215448977308) ;"> <table class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S4.SS3.3.3.3"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.SS3.1.1.1.1"> <th class="ltx_td ltx_th ltx_th_row ltx_border_r ltx_border_tt" id="S4.SS3.1.1.1.1.2"></th> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.SS3.1.1.1.1.3">PC</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.SS3.1.1.1.1.1"><math alttext="f_{0}" class="ltx_Math" display="inline" id="S4.SS3.1.1.1.1.1.m1.1"><semantics id="S4.SS3.1.1.1.1.1.m1.1a"><msub id="S4.SS3.1.1.1.1.1.m1.1.1" xref="S4.SS3.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.SS3.1.1.1.1.1.m1.1.1.2" xref="S4.SS3.1.1.1.1.1.m1.1.1.2.cmml">f</mi><mn id="S4.SS3.1.1.1.1.1.m1.1.1.3" xref="S4.SS3.1.1.1.1.1.m1.1.1.3.cmml">0</mn></msub><annotation-xml encoding="MathML-Content" id="S4.SS3.1.1.1.1.1.m1.1b"><apply id="S4.SS3.1.1.1.1.1.m1.1.1.cmml" xref="S4.SS3.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.SS3.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.SS3.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S4.SS3.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.SS3.1.1.1.1.1.m1.1.1.2">𝑓</ci><cn id="S4.SS3.1.1.1.1.1.m1.1.1.3.cmml" type="integer" xref="S4.SS3.1.1.1.1.1.m1.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.1.1.1.1.1.m1.1c">f_{0}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.1.1.1.1.1.m1.1d">italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.SS3.1.1.1.1.4">SV</td> </tr> <tr class="ltx_tr" id="S4.SS3.3.3.3.4.1"> <th class="ltx_td ltx_th ltx_th_row ltx_border_r" id="S4.SS3.3.3.3.4.1.1"></th> <td class="ltx_td ltx_align_center" id="S4.SS3.3.3.3.4.1.2">PER (%)</td> <td class="ltx_td ltx_align_center" id="S4.SS3.3.3.3.4.1.3">RMSE (Hz)</td> <td class="ltx_td ltx_align_center" id="S4.SS3.3.3.3.4.1.4">EER (%)</td> </tr> <tr class="ltx_tr" id="S4.SS3.3.3.3.5.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.SS3.3.3.3.5.2.1">HuBERT L4</th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.SS3.3.3.3.5.2.2">11.6</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.SS3.3.3.3.5.2.3">35.7</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.SS3.3.3.3.5.2.4">4.4</td> </tr> <tr class="ltx_tr" id="S4.SS3.2.2.2.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.SS3.2.2.2.2.1">k-means (<math alttext="\text{RVQ}_{1}" class="ltx_Math" display="inline" id="S4.SS3.2.2.2.2.1.m1.1"><semantics id="S4.SS3.2.2.2.2.1.m1.1a"><msub id="S4.SS3.2.2.2.2.1.m1.1.1" xref="S4.SS3.2.2.2.2.1.m1.1.1.cmml"><mtext id="S4.SS3.2.2.2.2.1.m1.1.1.2" xref="S4.SS3.2.2.2.2.1.m1.1.1.2a.cmml">RVQ</mtext><mn id="S4.SS3.2.2.2.2.1.m1.1.1.3" xref="S4.SS3.2.2.2.2.1.m1.1.1.3.cmml">1</mn></msub><annotation-xml encoding="MathML-Content" id="S4.SS3.2.2.2.2.1.m1.1b"><apply id="S4.SS3.2.2.2.2.1.m1.1.1.cmml" xref="S4.SS3.2.2.2.2.1.m1.1.1"><csymbol cd="ambiguous" id="S4.SS3.2.2.2.2.1.m1.1.1.1.cmml" xref="S4.SS3.2.2.2.2.1.m1.1.1">subscript</csymbol><ci id="S4.SS3.2.2.2.2.1.m1.1.1.2a.cmml" xref="S4.SS3.2.2.2.2.1.m1.1.1.2"><mtext id="S4.SS3.2.2.2.2.1.m1.1.1.2.cmml" xref="S4.SS3.2.2.2.2.1.m1.1.1.2">RVQ</mtext></ci><cn id="S4.SS3.2.2.2.2.1.m1.1.1.3.cmml" type="integer" xref="S4.SS3.2.2.2.2.1.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.2.2.2.2.1.m1.1c">\text{RVQ}_{1}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.2.2.2.2.1.m1.1d">RVQ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT</annotation></semantics></math>)</th> <td class="ltx_td ltx_align_center" id="S4.SS3.2.2.2.2.2">29.8</td> <td class="ltx_td ltx_align_center" id="S4.SS3.2.2.2.2.3">67.1</td> <td class="ltx_td ltx_align_center" id="S4.SS3.2.2.2.2.4">18.8</td> </tr> <tr class="ltx_tr" id="S4.SS3.3.3.3.6.3"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.SS3.3.3.3.6.3.1">Residual</th> <td class="ltx_td ltx_align_center" id="S4.SS3.3.3.3.6.3.2">13.2</td> <td class="ltx_td ltx_align_center" id="S4.SS3.3.3.3.6.3.3">37.8</td> <td class="ltx_td ltx_align_center" id="S4.SS3.3.3.3.6.3.4">5.5</td> </tr> <tr class="ltx_tr" id="S4.SS3.3.3.3.7.4"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.SS3.3.3.3.7.4.1">HuBERT L9</th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.SS3.3.3.3.7.4.2">7.3</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.SS3.3.3.3.7.4.3">41.0</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.SS3.3.3.3.7.4.4">6.5</td> </tr> <tr class="ltx_tr" id="S4.SS3.3.3.3.3"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.SS3.3.3.3.3.1">k-means (<math alttext="\text{RVQ}_{1}" class="ltx_Math" display="inline" id="S4.SS3.3.3.3.3.1.m1.1"><semantics id="S4.SS3.3.3.3.3.1.m1.1a"><msub id="S4.SS3.3.3.3.3.1.m1.1.1" xref="S4.SS3.3.3.3.3.1.m1.1.1.cmml"><mtext id="S4.SS3.3.3.3.3.1.m1.1.1.2" xref="S4.SS3.3.3.3.3.1.m1.1.1.2a.cmml">RVQ</mtext><mn id="S4.SS3.3.3.3.3.1.m1.1.1.3" xref="S4.SS3.3.3.3.3.1.m1.1.1.3.cmml">1</mn></msub><annotation-xml encoding="MathML-Content" id="S4.SS3.3.3.3.3.1.m1.1b"><apply id="S4.SS3.3.3.3.3.1.m1.1.1.cmml" xref="S4.SS3.3.3.3.3.1.m1.1.1"><csymbol cd="ambiguous" id="S4.SS3.3.3.3.3.1.m1.1.1.1.cmml" xref="S4.SS3.3.3.3.3.1.m1.1.1">subscript</csymbol><ci id="S4.SS3.3.3.3.3.1.m1.1.1.2a.cmml" xref="S4.SS3.3.3.3.3.1.m1.1.1.2"><mtext id="S4.SS3.3.3.3.3.1.m1.1.1.2.cmml" xref="S4.SS3.3.3.3.3.1.m1.1.1.2">RVQ</mtext></ci><cn id="S4.SS3.3.3.3.3.1.m1.1.1.3.cmml" type="integer" xref="S4.SS3.3.3.3.3.1.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS3.3.3.3.3.1.m1.1c">\text{RVQ}_{1}</annotation><annotation encoding="application/x-llamapun" id="S4.SS3.3.3.3.3.1.m1.1d">RVQ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT</annotation></semantics></math>)</th> <td class="ltx_td ltx_align_center" id="S4.SS3.3.3.3.3.2">23.4</td> <td class="ltx_td ltx_align_center" id="S4.SS3.3.3.3.3.3">72.9</td> <td class="ltx_td ltx_align_center" id="S4.SS3.3.3.3.3.4">21.5</td> </tr> <tr class="ltx_tr" id="S4.SS3.3.3.3.8.5"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r" id="S4.SS3.3.3.3.8.5.1">Residual</th> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.SS3.3.3.3.8.5.2">8.3</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.SS3.3.3.3.8.5.3">41.7</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.SS3.3.3.3.8.5.4">7.3</td> </tr> </tbody> </table> </span></div> </figure> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">5 </span>Results and Discussions</h2> <section class="ltx_subsection" id="S5.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.1 </span>Information in the residuals</h3> <div class="ltx_para" id="S5.SS1.p1"> <p class="ltx_p" id="S5.SS1.p1.1">We first show evidence of the presence of speaker information in the residuals of k-means (<math alttext="\text{RVQ}_{1}" class="ltx_Math" display="inline" id="S5.SS1.p1.1.m1.1"><semantics id="S5.SS1.p1.1.m1.1a"><msub id="S5.SS1.p1.1.m1.1.1" xref="S5.SS1.p1.1.m1.1.1.cmml"><mtext id="S5.SS1.p1.1.m1.1.1.2" xref="S5.SS1.p1.1.m1.1.1.2a.cmml">RVQ</mtext><mn id="S5.SS1.p1.1.m1.1.1.3" xref="S5.SS1.p1.1.m1.1.1.3.cmml">1</mn></msub><annotation-xml encoding="MathML-Content" id="S5.SS1.p1.1.m1.1b"><apply id="S5.SS1.p1.1.m1.1.1.cmml" xref="S5.SS1.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S5.SS1.p1.1.m1.1.1.1.cmml" xref="S5.SS1.p1.1.m1.1.1">subscript</csymbol><ci id="S5.SS1.p1.1.m1.1.1.2a.cmml" xref="S5.SS1.p1.1.m1.1.1.2"><mtext id="S5.SS1.p1.1.m1.1.1.2.cmml" xref="S5.SS1.p1.1.m1.1.1.2">RVQ</mtext></ci><cn id="S5.SS1.p1.1.m1.1.1.3.cmml" type="integer" xref="S5.SS1.p1.1.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS1.p1.1.m1.1c">\text{RVQ}_{1}</annotation><annotation encoding="application/x-llamapun" id="S5.SS1.p1.1.m1.1d">RVQ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT</annotation></semantics></math>). We randomly pick 300 utterances from 6 speakers in LibriSpeech <span class="ltx_text ltx_font_typewriter" id="S5.SS1.p1.1.1">dev-clean</span>, and present HuBERT 9th layer frames assigned to two sample clusters. Frames belong the same speaker are in the same color. Representations assigned to the two clusters are shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S4.F2" title="Figure 2 ‣ 4.3 Accessibility tasks ‣ 4 Experimental settings ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">2</span></a> using PCA, with the presence of speaker’s information. Similar patterns are also observed in several k-means clusters, as noted in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib34" title="">34</a>]</cite>. The evidence indicates that information in the residuals should be further mined by increasing the cluster size or more efficiently RVQ.</p> </div> <figure class="ltx_table" id="S5.T2"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table"><span class="ltx_text ltx_font_bold" id="S5.T2.20.1.1" style="font-size:90%;">Table 2</span>: </span><span class="ltx_text" id="S5.T2.21.2" style="font-size:90%;">Results of information completeness and information accessibility. The information rate, known as the storage cost per frame is also included. A lower MSE means the representations are closer to complete. </span></figcaption> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S5.T2.18.18" style="width:693.8pt;height:585.9pt;vertical-align:-1.3pt;"><span class="ltx_transformed_inner" style="transform:translate(81.0pt,-68.3pt) scale(1.30475009580769,1.30475009580769) ;"> <table class="ltx_tabular ltx_align_middle" id="S5.T2.18.18.18"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S5.T2.18.18.18.19.1"> <td class="ltx_td ltx_border_r ltx_border_tt" id="S5.T2.18.18.18.19.1.1" style="padding-left:10.0pt;padding-right:10.0pt;"></td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="S5.T2.18.18.18.19.1.2" style="padding-left:10.0pt;padding-right:10.0pt;"><span class="ltx_text ltx_inline-block" id="S5.T2.18.18.18.19.1.2.1" style="width:0.0pt;">Information completeness</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="3" id="S5.T2.18.18.18.19.1.3" style="padding-left:10.0pt;padding-right:10.0pt;">Information accessibility</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S5.T2.18.18.18.19.1.4" style="padding-left:10.0pt;padding-right:10.0pt;">Information rate</td> </tr> <tr class="ltx_tr" id="S5.T2.6.6.6.6"> <td class="ltx_td ltx_border_r" id="S5.T2.6.6.6.6.7" style="padding-left:10.0pt;padding-right:10.0pt;"></td> <td class="ltx_td ltx_align_center" id="S5.T2.1.1.1.1.1" style="padding-left:10.0pt;padding-right:10.0pt;">MSE <math alttext="\downarrow" class="ltx_Math" display="inline" id="S5.T2.1.1.1.1.1.m1.1"><semantics id="S5.T2.1.1.1.1.1.m1.1a"><mo id="S5.T2.1.1.1.1.1.m1.1.1" stretchy="false" xref="S5.T2.1.1.1.1.1.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S5.T2.1.1.1.1.1.m1.1b"><ci id="S5.T2.1.1.1.1.1.m1.1.1.cmml" xref="S5.T2.1.1.1.1.1.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.1.1.1.1.1.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S5.T2.1.1.1.1.1.m1.1d">↓</annotation></semantics></math> </td> <td class="ltx_td ltx_align_center" id="S5.T2.2.2.2.2.2" style="padding-left:10.0pt;padding-right:10.0pt;">SNR (dB) <math alttext="\uparrow" class="ltx_Math" display="inline" id="S5.T2.2.2.2.2.2.m1.1"><semantics id="S5.T2.2.2.2.2.2.m1.1a"><mo id="S5.T2.2.2.2.2.2.m1.1.1" stretchy="false" xref="S5.T2.2.2.2.2.2.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S5.T2.2.2.2.2.2.m1.1b"><ci id="S5.T2.2.2.2.2.2.m1.1.1.cmml" xref="S5.T2.2.2.2.2.2.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.2.2.2.2.2.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S5.T2.2.2.2.2.2.m1.1d">↑</annotation></semantics></math> </td> <td class="ltx_td ltx_align_center" id="S5.T2.3.3.3.3.3" style="padding-left:10.0pt;padding-right:10.0pt;">PER (%) <math alttext="\downarrow" class="ltx_Math" display="inline" id="S5.T2.3.3.3.3.3.m1.1"><semantics id="S5.T2.3.3.3.3.3.m1.1a"><mo id="S5.T2.3.3.3.3.3.m1.1.1" stretchy="false" xref="S5.T2.3.3.3.3.3.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S5.T2.3.3.3.3.3.m1.1b"><ci id="S5.T2.3.3.3.3.3.m1.1.1.cmml" xref="S5.T2.3.3.3.3.3.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.3.3.3.3.3.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S5.T2.3.3.3.3.3.m1.1d">↓</annotation></semantics></math> </td> <td class="ltx_td ltx_align_center" id="S5.T2.4.4.4.4.4" style="padding-left:10.0pt;padding-right:10.0pt;">RMSE (Hz) <math alttext="\downarrow" class="ltx_Math" display="inline" id="S5.T2.4.4.4.4.4.m1.1"><semantics id="S5.T2.4.4.4.4.4.m1.1a"><mo id="S5.T2.4.4.4.4.4.m1.1.1" stretchy="false" xref="S5.T2.4.4.4.4.4.m1.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S5.T2.4.4.4.4.4.m1.1b"><ci id="S5.T2.4.4.4.4.4.m1.1.1.cmml" xref="S5.T2.4.4.4.4.4.m1.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.4.4.4.4.4.m1.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S5.T2.4.4.4.4.4.m1.1d">↓</annotation></semantics></math> </td> <td class="ltx_td ltx_align_center" id="S5.T2.6.6.6.6.6" style="padding-left:10.0pt;padding-right:10.0pt;">EER (<math alttext="\%" class="ltx_Math" display="inline" id="S5.T2.5.5.5.5.5.m1.1"><semantics id="S5.T2.5.5.5.5.5.m1.1a"><mo id="S5.T2.5.5.5.5.5.m1.1.1" xref="S5.T2.5.5.5.5.5.m1.1.1.cmml">%</mo><annotation-xml encoding="MathML-Content" id="S5.T2.5.5.5.5.5.m1.1b"><csymbol cd="latexml" id="S5.T2.5.5.5.5.5.m1.1.1.cmml" xref="S5.T2.5.5.5.5.5.m1.1.1">percent</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.5.5.5.5.5.m1.1c">\%</annotation><annotation encoding="application/x-llamapun" id="S5.T2.5.5.5.5.5.m1.1d">%</annotation></semantics></math>) <math alttext="\downarrow" class="ltx_Math" display="inline" id="S5.T2.6.6.6.6.6.m2.1"><semantics id="S5.T2.6.6.6.6.6.m2.1a"><mo id="S5.T2.6.6.6.6.6.m2.1.1" stretchy="false" xref="S5.T2.6.6.6.6.6.m2.1.1.cmml">↓</mo><annotation-xml encoding="MathML-Content" id="S5.T2.6.6.6.6.6.m2.1b"><ci id="S5.T2.6.6.6.6.6.m2.1.1.cmml" xref="S5.T2.6.6.6.6.6.m2.1.1">↓</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.6.6.6.6.6.m2.1c">\downarrow</annotation><annotation encoding="application/x-llamapun" id="S5.T2.6.6.6.6.6.m2.1d">↓</annotation></semantics></math> </td> <td class="ltx_td ltx_align_center" id="S5.T2.6.6.6.6.8" style="padding-left:10.0pt;padding-right:10.0pt;">Per frame bits</td> </tr> <tr class="ltx_tr" id="S5.T2.8.8.8.8"> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S5.T2.8.8.8.8.3" style="padding-left:10.0pt;padding-right:10.0pt;">Log Mel</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.8.8.8.8.4" style="padding-left:10.0pt;padding-right:10.0pt;">0.0</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.7.7.7.7.1" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="\inf" class="ltx_Math" display="inline" id="S5.T2.7.7.7.7.1.m1.1"><semantics id="S5.T2.7.7.7.7.1.m1.1a"><mo id="S5.T2.7.7.7.7.1.m1.1.1" xref="S5.T2.7.7.7.7.1.m1.1.1.cmml">inf</mo><annotation-xml encoding="MathML-Content" id="S5.T2.7.7.7.7.1.m1.1b"><csymbol cd="latexml" id="S5.T2.7.7.7.7.1.m1.1.1.cmml" xref="S5.T2.7.7.7.7.1.m1.1.1">infimum</csymbol></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.7.7.7.7.1.m1.1c">\inf</annotation><annotation encoding="application/x-llamapun" id="S5.T2.7.7.7.7.1.m1.1d">roman_inf</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.8.8.8.8.5" style="padding-left:10.0pt;padding-right:10.0pt;">37.3</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.8.8.8.8.6" style="padding-left:10.0pt;padding-right:10.0pt;">38.4</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.8.8.8.8.7" style="padding-left:10.0pt;padding-right:10.0pt;">13.2</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.8.8.8.8.2" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="32\times 80" class="ltx_Math" display="inline" id="S5.T2.8.8.8.8.2.m1.1"><semantics id="S5.T2.8.8.8.8.2.m1.1a"><mrow id="S5.T2.8.8.8.8.2.m1.1.1" xref="S5.T2.8.8.8.8.2.m1.1.1.cmml"><mn id="S5.T2.8.8.8.8.2.m1.1.1.2" xref="S5.T2.8.8.8.8.2.m1.1.1.2.cmml">32</mn><mo id="S5.T2.8.8.8.8.2.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S5.T2.8.8.8.8.2.m1.1.1.1.cmml">×</mo><mn id="S5.T2.8.8.8.8.2.m1.1.1.3" xref="S5.T2.8.8.8.8.2.m1.1.1.3.cmml">80</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.T2.8.8.8.8.2.m1.1b"><apply id="S5.T2.8.8.8.8.2.m1.1.1.cmml" xref="S5.T2.8.8.8.8.2.m1.1.1"><times id="S5.T2.8.8.8.8.2.m1.1.1.1.cmml" xref="S5.T2.8.8.8.8.2.m1.1.1.1"></times><cn id="S5.T2.8.8.8.8.2.m1.1.1.2.cmml" type="integer" xref="S5.T2.8.8.8.8.2.m1.1.1.2">32</cn><cn id="S5.T2.8.8.8.8.2.m1.1.1.3.cmml" type="integer" xref="S5.T2.8.8.8.8.2.m1.1.1.3">80</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.8.8.8.8.2.m1.1c">32\times 80</annotation><annotation encoding="application/x-llamapun" id="S5.T2.8.8.8.8.2.m1.1d">32 × 80</annotation></semantics></math></td> </tr> <tr class="ltx_tr" id="S5.T2.9.9.9.9"> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S5.T2.9.9.9.9.2" style="padding-left:10.0pt;padding-right:10.0pt;">HuBERT L4</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.9.9.9.9.3" style="padding-left:10.0pt;padding-right:10.0pt;">39.6</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.9.9.9.9.4" style="padding-left:10.0pt;padding-right:10.0pt;">18.5</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.9.9.9.9.5" style="padding-left:10.0pt;padding-right:10.0pt;">11.6</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.9.9.9.9.6" style="padding-left:10.0pt;padding-right:10.0pt;">35.7</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.9.9.9.9.7" style="padding-left:10.0pt;padding-right:10.0pt;">4.4</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.9.9.9.9.1" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="32\times 768" class="ltx_Math" display="inline" id="S5.T2.9.9.9.9.1.m1.1"><semantics id="S5.T2.9.9.9.9.1.m1.1a"><mrow id="S5.T2.9.9.9.9.1.m1.1.1" xref="S5.T2.9.9.9.9.1.m1.1.1.cmml"><mn id="S5.T2.9.9.9.9.1.m1.1.1.2" xref="S5.T2.9.9.9.9.1.m1.1.1.2.cmml">32</mn><mo id="S5.T2.9.9.9.9.1.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S5.T2.9.9.9.9.1.m1.1.1.1.cmml">×</mo><mn id="S5.T2.9.9.9.9.1.m1.1.1.3" xref="S5.T2.9.9.9.9.1.m1.1.1.3.cmml">768</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.T2.9.9.9.9.1.m1.1b"><apply id="S5.T2.9.9.9.9.1.m1.1.1.cmml" xref="S5.T2.9.9.9.9.1.m1.1.1"><times id="S5.T2.9.9.9.9.1.m1.1.1.1.cmml" xref="S5.T2.9.9.9.9.1.m1.1.1.1"></times><cn id="S5.T2.9.9.9.9.1.m1.1.1.2.cmml" type="integer" xref="S5.T2.9.9.9.9.1.m1.1.1.2">32</cn><cn id="S5.T2.9.9.9.9.1.m1.1.1.3.cmml" type="integer" xref="S5.T2.9.9.9.9.1.m1.1.1.3">768</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.9.9.9.9.1.m1.1c">32\times 768</annotation><annotation encoding="application/x-llamapun" id="S5.T2.9.9.9.9.1.m1.1d">32 × 768</annotation></semantics></math></td> </tr> <tr class="ltx_tr" id="S5.T2.11.11.11.11"> <td class="ltx_td ltx_align_left ltx_border_r" id="S5.T2.10.10.10.10.1" style="padding-left:10.0pt;padding-right:10.0pt;"> <math alttext="\text{RVQ}_{8}" class="ltx_Math" display="inline" id="S5.T2.10.10.10.10.1.m1.1"><semantics id="S5.T2.10.10.10.10.1.m1.1a"><msub id="S5.T2.10.10.10.10.1.m1.1.1" xref="S5.T2.10.10.10.10.1.m1.1.1.cmml"><mtext id="S5.T2.10.10.10.10.1.m1.1.1.2" xref="S5.T2.10.10.10.10.1.m1.1.1.2a.cmml">RVQ</mtext><mn id="S5.T2.10.10.10.10.1.m1.1.1.3" xref="S5.T2.10.10.10.10.1.m1.1.1.3.cmml">8</mn></msub><annotation-xml encoding="MathML-Content" id="S5.T2.10.10.10.10.1.m1.1b"><apply id="S5.T2.10.10.10.10.1.m1.1.1.cmml" xref="S5.T2.10.10.10.10.1.m1.1.1"><csymbol cd="ambiguous" id="S5.T2.10.10.10.10.1.m1.1.1.1.cmml" xref="S5.T2.10.10.10.10.1.m1.1.1">subscript</csymbol><ci id="S5.T2.10.10.10.10.1.m1.1.1.2a.cmml" xref="S5.T2.10.10.10.10.1.m1.1.1.2"><mtext id="S5.T2.10.10.10.10.1.m1.1.1.2.cmml" xref="S5.T2.10.10.10.10.1.m1.1.1.2">RVQ</mtext></ci><cn id="S5.T2.10.10.10.10.1.m1.1.1.3.cmml" type="integer" xref="S5.T2.10.10.10.10.1.m1.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.10.10.10.10.1.m1.1c">\text{RVQ}_{8}</annotation><annotation encoding="application/x-llamapun" id="S5.T2.10.10.10.10.1.m1.1d">RVQ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT</annotation></semantics></math>(fine-tuned)</td> <td class="ltx_td ltx_align_center" id="S5.T2.11.11.11.11.3" style="padding-left:10.0pt;padding-right:10.0pt;">49.5</td> <td class="ltx_td ltx_align_center" id="S5.T2.11.11.11.11.4" style="padding-left:10.0pt;padding-right:10.0pt;">17.5</td> <td class="ltx_td ltx_align_center" id="S5.T2.11.11.11.11.5" style="padding-left:10.0pt;padding-right:10.0pt;">13.8</td> <td class="ltx_td ltx_align_center" id="S5.T2.11.11.11.11.6" style="padding-left:10.0pt;padding-right:10.0pt;">49.8</td> <td class="ltx_td ltx_align_center" id="S5.T2.11.11.11.11.7" style="padding-left:10.0pt;padding-right:10.0pt;">5.9</td> <td class="ltx_td ltx_align_center" id="S5.T2.11.11.11.11.2" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="10\times 8" class="ltx_Math" display="inline" id="S5.T2.11.11.11.11.2.m1.1"><semantics id="S5.T2.11.11.11.11.2.m1.1a"><mrow id="S5.T2.11.11.11.11.2.m1.1.1" xref="S5.T2.11.11.11.11.2.m1.1.1.cmml"><mn id="S5.T2.11.11.11.11.2.m1.1.1.2" xref="S5.T2.11.11.11.11.2.m1.1.1.2.cmml">10</mn><mo id="S5.T2.11.11.11.11.2.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S5.T2.11.11.11.11.2.m1.1.1.1.cmml">×</mo><mn id="S5.T2.11.11.11.11.2.m1.1.1.3" xref="S5.T2.11.11.11.11.2.m1.1.1.3.cmml">8</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.T2.11.11.11.11.2.m1.1b"><apply id="S5.T2.11.11.11.11.2.m1.1.1.cmml" xref="S5.T2.11.11.11.11.2.m1.1.1"><times id="S5.T2.11.11.11.11.2.m1.1.1.1.cmml" xref="S5.T2.11.11.11.11.2.m1.1.1.1"></times><cn id="S5.T2.11.11.11.11.2.m1.1.1.2.cmml" type="integer" xref="S5.T2.11.11.11.11.2.m1.1.1.2">10</cn><cn id="S5.T2.11.11.11.11.2.m1.1.1.3.cmml" type="integer" xref="S5.T2.11.11.11.11.2.m1.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.11.11.11.11.2.m1.1c">10\times 8</annotation><annotation encoding="application/x-llamapun" id="S5.T2.11.11.11.11.2.m1.1d">10 × 8</annotation></semantics></math></td> </tr> <tr class="ltx_tr" id="S5.T2.13.13.13.13"> <td class="ltx_td ltx_align_left ltx_border_r" id="S5.T2.12.12.12.12.1" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="\text{RVQ}_{8}" class="ltx_Math" display="inline" id="S5.T2.12.12.12.12.1.m1.1"><semantics id="S5.T2.12.12.12.12.1.m1.1a"><msub id="S5.T2.12.12.12.12.1.m1.1.1" xref="S5.T2.12.12.12.12.1.m1.1.1.cmml"><mtext id="S5.T2.12.12.12.12.1.m1.1.1.2" xref="S5.T2.12.12.12.12.1.m1.1.1.2a.cmml">RVQ</mtext><mn id="S5.T2.12.12.12.12.1.m1.1.1.3" xref="S5.T2.12.12.12.12.1.m1.1.1.3.cmml">8</mn></msub><annotation-xml encoding="MathML-Content" id="S5.T2.12.12.12.12.1.m1.1b"><apply id="S5.T2.12.12.12.12.1.m1.1.1.cmml" xref="S5.T2.12.12.12.12.1.m1.1.1"><csymbol cd="ambiguous" id="S5.T2.12.12.12.12.1.m1.1.1.1.cmml" xref="S5.T2.12.12.12.12.1.m1.1.1">subscript</csymbol><ci id="S5.T2.12.12.12.12.1.m1.1.1.2a.cmml" xref="S5.T2.12.12.12.12.1.m1.1.1.2"><mtext id="S5.T2.12.12.12.12.1.m1.1.1.2.cmml" xref="S5.T2.12.12.12.12.1.m1.1.1.2">RVQ</mtext></ci><cn id="S5.T2.12.12.12.12.1.m1.1.1.3.cmml" type="integer" xref="S5.T2.12.12.12.12.1.m1.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.12.12.12.12.1.m1.1c">\text{RVQ}_{8}</annotation><annotation encoding="application/x-llamapun" id="S5.T2.12.12.12.12.1.m1.1d">RVQ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S5.T2.13.13.13.13.3" style="padding-left:10.0pt;padding-right:10.0pt;">63.8</td> <td class="ltx_td ltx_align_center" id="S5.T2.13.13.13.13.4" style="padding-left:10.0pt;padding-right:10.0pt;">16.4</td> <td class="ltx_td ltx_align_center" id="S5.T2.13.13.13.13.5" style="padding-left:10.0pt;padding-right:10.0pt;">22.5</td> <td class="ltx_td ltx_align_center" id="S5.T2.13.13.13.13.6" style="padding-left:10.0pt;padding-right:10.0pt;">60.7</td> <td class="ltx_td ltx_align_center" id="S5.T2.13.13.13.13.7" style="padding-left:10.0pt;padding-right:10.0pt;">9.9</td> <td class="ltx_td ltx_align_center" id="S5.T2.13.13.13.13.2" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="10\times 8" class="ltx_Math" display="inline" id="S5.T2.13.13.13.13.2.m1.1"><semantics id="S5.T2.13.13.13.13.2.m1.1a"><mrow id="S5.T2.13.13.13.13.2.m1.1.1" xref="S5.T2.13.13.13.13.2.m1.1.1.cmml"><mn id="S5.T2.13.13.13.13.2.m1.1.1.2" xref="S5.T2.13.13.13.13.2.m1.1.1.2.cmml">10</mn><mo id="S5.T2.13.13.13.13.2.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S5.T2.13.13.13.13.2.m1.1.1.1.cmml">×</mo><mn id="S5.T2.13.13.13.13.2.m1.1.1.3" xref="S5.T2.13.13.13.13.2.m1.1.1.3.cmml">8</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.T2.13.13.13.13.2.m1.1b"><apply id="S5.T2.13.13.13.13.2.m1.1.1.cmml" xref="S5.T2.13.13.13.13.2.m1.1.1"><times id="S5.T2.13.13.13.13.2.m1.1.1.1.cmml" xref="S5.T2.13.13.13.13.2.m1.1.1.1"></times><cn id="S5.T2.13.13.13.13.2.m1.1.1.2.cmml" type="integer" xref="S5.T2.13.13.13.13.2.m1.1.1.2">10</cn><cn id="S5.T2.13.13.13.13.2.m1.1.1.3.cmml" type="integer" xref="S5.T2.13.13.13.13.2.m1.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.13.13.13.13.2.m1.1c">10\times 8</annotation><annotation encoding="application/x-llamapun" id="S5.T2.13.13.13.13.2.m1.1d">10 × 8</annotation></semantics></math></td> </tr> <tr class="ltx_tr" id="S5.T2.14.14.14.14"> <td class="ltx_td ltx_align_left ltx_border_r" id="S5.T2.14.14.14.14.2" style="padding-left:10.0pt;padding-right:10.0pt;">k-means</td> <td class="ltx_td ltx_align_center" id="S5.T2.14.14.14.14.3" style="padding-left:10.0pt;padding-right:10.0pt;">82.6</td> <td class="ltx_td ltx_align_center" id="S5.T2.14.14.14.14.4" style="padding-left:10.0pt;padding-right:10.0pt;">15.3</td> <td class="ltx_td ltx_align_center" id="S5.T2.14.14.14.14.5" style="padding-left:10.0pt;padding-right:10.0pt;">29.8</td> <td class="ltx_td ltx_align_center" id="S5.T2.14.14.14.14.6" style="padding-left:10.0pt;padding-right:10.0pt;">67.1</td> <td class="ltx_td ltx_align_center" id="S5.T2.14.14.14.14.7" style="padding-left:10.0pt;padding-right:10.0pt;">18.8</td> <td class="ltx_td ltx_align_center" id="S5.T2.14.14.14.14.1" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="10\times 1" class="ltx_Math" display="inline" id="S5.T2.14.14.14.14.1.m1.1"><semantics id="S5.T2.14.14.14.14.1.m1.1a"><mrow id="S5.T2.14.14.14.14.1.m1.1.1" xref="S5.T2.14.14.14.14.1.m1.1.1.cmml"><mn id="S5.T2.14.14.14.14.1.m1.1.1.2" xref="S5.T2.14.14.14.14.1.m1.1.1.2.cmml">10</mn><mo id="S5.T2.14.14.14.14.1.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S5.T2.14.14.14.14.1.m1.1.1.1.cmml">×</mo><mn id="S5.T2.14.14.14.14.1.m1.1.1.3" xref="S5.T2.14.14.14.14.1.m1.1.1.3.cmml">1</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.T2.14.14.14.14.1.m1.1b"><apply id="S5.T2.14.14.14.14.1.m1.1.1.cmml" xref="S5.T2.14.14.14.14.1.m1.1.1"><times id="S5.T2.14.14.14.14.1.m1.1.1.1.cmml" xref="S5.T2.14.14.14.14.1.m1.1.1.1"></times><cn id="S5.T2.14.14.14.14.1.m1.1.1.2.cmml" type="integer" xref="S5.T2.14.14.14.14.1.m1.1.1.2">10</cn><cn id="S5.T2.14.14.14.14.1.m1.1.1.3.cmml" type="integer" xref="S5.T2.14.14.14.14.1.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.14.14.14.14.1.m1.1c">10\times 1</annotation><annotation encoding="application/x-llamapun" id="S5.T2.14.14.14.14.1.m1.1d">10 × 1</annotation></semantics></math></td> </tr> <tr class="ltx_tr" id="S5.T2.15.15.15.15"> <td class="ltx_td ltx_align_left ltx_border_r ltx_border_t" id="S5.T2.15.15.15.15.2" style="padding-left:10.0pt;padding-right:10.0pt;">HuBERT L9</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.15.15.15.15.3" style="padding-left:10.0pt;padding-right:10.0pt;">54.2</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.15.15.15.15.4" style="padding-left:10.0pt;padding-right:10.0pt;">17.1</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.15.15.15.15.5" style="padding-left:10.0pt;padding-right:10.0pt;">7.3</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.15.15.15.15.6" style="padding-left:10.0pt;padding-right:10.0pt;">41.0</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.15.15.15.15.7" style="padding-left:10.0pt;padding-right:10.0pt;">6.5</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S5.T2.15.15.15.15.1" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="32\times 768" class="ltx_Math" display="inline" id="S5.T2.15.15.15.15.1.m1.1"><semantics id="S5.T2.15.15.15.15.1.m1.1a"><mrow id="S5.T2.15.15.15.15.1.m1.1.1" xref="S5.T2.15.15.15.15.1.m1.1.1.cmml"><mn id="S5.T2.15.15.15.15.1.m1.1.1.2" xref="S5.T2.15.15.15.15.1.m1.1.1.2.cmml">32</mn><mo id="S5.T2.15.15.15.15.1.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S5.T2.15.15.15.15.1.m1.1.1.1.cmml">×</mo><mn id="S5.T2.15.15.15.15.1.m1.1.1.3" xref="S5.T2.15.15.15.15.1.m1.1.1.3.cmml">768</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.T2.15.15.15.15.1.m1.1b"><apply id="S5.T2.15.15.15.15.1.m1.1.1.cmml" xref="S5.T2.15.15.15.15.1.m1.1.1"><times id="S5.T2.15.15.15.15.1.m1.1.1.1.cmml" xref="S5.T2.15.15.15.15.1.m1.1.1.1"></times><cn id="S5.T2.15.15.15.15.1.m1.1.1.2.cmml" type="integer" xref="S5.T2.15.15.15.15.1.m1.1.1.2">32</cn><cn id="S5.T2.15.15.15.15.1.m1.1.1.3.cmml" type="integer" xref="S5.T2.15.15.15.15.1.m1.1.1.3">768</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.15.15.15.15.1.m1.1c">32\times 768</annotation><annotation encoding="application/x-llamapun" id="S5.T2.15.15.15.15.1.m1.1d">32 × 768</annotation></semantics></math></td> </tr> <tr class="ltx_tr" id="S5.T2.17.17.17.17"> <td class="ltx_td ltx_align_left ltx_border_r" id="S5.T2.16.16.16.16.1" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="\text{RVQ}_{8}" class="ltx_Math" display="inline" id="S5.T2.16.16.16.16.1.m1.1"><semantics id="S5.T2.16.16.16.16.1.m1.1a"><msub id="S5.T2.16.16.16.16.1.m1.1.1" xref="S5.T2.16.16.16.16.1.m1.1.1.cmml"><mtext id="S5.T2.16.16.16.16.1.m1.1.1.2" xref="S5.T2.16.16.16.16.1.m1.1.1.2a.cmml">RVQ</mtext><mn id="S5.T2.16.16.16.16.1.m1.1.1.3" xref="S5.T2.16.16.16.16.1.m1.1.1.3.cmml">8</mn></msub><annotation-xml encoding="MathML-Content" id="S5.T2.16.16.16.16.1.m1.1b"><apply id="S5.T2.16.16.16.16.1.m1.1.1.cmml" xref="S5.T2.16.16.16.16.1.m1.1.1"><csymbol cd="ambiguous" id="S5.T2.16.16.16.16.1.m1.1.1.1.cmml" xref="S5.T2.16.16.16.16.1.m1.1.1">subscript</csymbol><ci id="S5.T2.16.16.16.16.1.m1.1.1.2a.cmml" xref="S5.T2.16.16.16.16.1.m1.1.1.2"><mtext id="S5.T2.16.16.16.16.1.m1.1.1.2.cmml" xref="S5.T2.16.16.16.16.1.m1.1.1.2">RVQ</mtext></ci><cn id="S5.T2.16.16.16.16.1.m1.1.1.3.cmml" type="integer" xref="S5.T2.16.16.16.16.1.m1.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.16.16.16.16.1.m1.1c">\text{RVQ}_{8}</annotation><annotation encoding="application/x-llamapun" id="S5.T2.16.16.16.16.1.m1.1d">RVQ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S5.T2.17.17.17.17.3" style="padding-left:10.0pt;padding-right:10.0pt;">75.1</td> <td class="ltx_td ltx_align_center" id="S5.T2.17.17.17.17.4" style="padding-left:10.0pt;padding-right:10.0pt;">15.7</td> <td class="ltx_td ltx_align_center" id="S5.T2.17.17.17.17.5" style="padding-left:10.0pt;padding-right:10.0pt;">14.8</td> <td class="ltx_td ltx_align_center" id="S5.T2.17.17.17.17.6" style="padding-left:10.0pt;padding-right:10.0pt;">64.7</td> <td class="ltx_td ltx_align_center" id="S5.T2.17.17.17.17.7" style="padding-left:10.0pt;padding-right:10.0pt;">12.8</td> <td class="ltx_td ltx_align_center" id="S5.T2.17.17.17.17.2" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="10\times 8" class="ltx_Math" display="inline" id="S5.T2.17.17.17.17.2.m1.1"><semantics id="S5.T2.17.17.17.17.2.m1.1a"><mrow id="S5.T2.17.17.17.17.2.m1.1.1" xref="S5.T2.17.17.17.17.2.m1.1.1.cmml"><mn id="S5.T2.17.17.17.17.2.m1.1.1.2" xref="S5.T2.17.17.17.17.2.m1.1.1.2.cmml">10</mn><mo id="S5.T2.17.17.17.17.2.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S5.T2.17.17.17.17.2.m1.1.1.1.cmml">×</mo><mn id="S5.T2.17.17.17.17.2.m1.1.1.3" xref="S5.T2.17.17.17.17.2.m1.1.1.3.cmml">8</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.T2.17.17.17.17.2.m1.1b"><apply id="S5.T2.17.17.17.17.2.m1.1.1.cmml" xref="S5.T2.17.17.17.17.2.m1.1.1"><times id="S5.T2.17.17.17.17.2.m1.1.1.1.cmml" xref="S5.T2.17.17.17.17.2.m1.1.1.1"></times><cn id="S5.T2.17.17.17.17.2.m1.1.1.2.cmml" type="integer" xref="S5.T2.17.17.17.17.2.m1.1.1.2">10</cn><cn id="S5.T2.17.17.17.17.2.m1.1.1.3.cmml" type="integer" xref="S5.T2.17.17.17.17.2.m1.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.17.17.17.17.2.m1.1c">10\times 8</annotation><annotation encoding="application/x-llamapun" id="S5.T2.17.17.17.17.2.m1.1d">10 × 8</annotation></semantics></math></td> </tr> <tr class="ltx_tr" id="S5.T2.18.18.18.18"> <td class="ltx_td ltx_align_left ltx_border_bb ltx_border_r" id="S5.T2.18.18.18.18.2" style="padding-left:10.0pt;padding-right:10.0pt;">k-means</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S5.T2.18.18.18.18.3" style="padding-left:10.0pt;padding-right:10.0pt;">91.6</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S5.T2.18.18.18.18.4" style="padding-left:10.0pt;padding-right:10.0pt;">14.8</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S5.T2.18.18.18.18.5" style="padding-left:10.0pt;padding-right:10.0pt;">23.3</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S5.T2.18.18.18.18.6" style="padding-left:10.0pt;padding-right:10.0pt;">72.9</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S5.T2.18.18.18.18.7" style="padding-left:10.0pt;padding-right:10.0pt;">21.5</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S5.T2.18.18.18.18.1" style="padding-left:10.0pt;padding-right:10.0pt;"><math alttext="10\times 1" class="ltx_Math" display="inline" id="S5.T2.18.18.18.18.1.m1.1"><semantics id="S5.T2.18.18.18.18.1.m1.1a"><mrow id="S5.T2.18.18.18.18.1.m1.1.1" xref="S5.T2.18.18.18.18.1.m1.1.1.cmml"><mn id="S5.T2.18.18.18.18.1.m1.1.1.2" xref="S5.T2.18.18.18.18.1.m1.1.1.2.cmml">10</mn><mo id="S5.T2.18.18.18.18.1.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S5.T2.18.18.18.18.1.m1.1.1.1.cmml">×</mo><mn id="S5.T2.18.18.18.18.1.m1.1.1.3" xref="S5.T2.18.18.18.18.1.m1.1.1.3.cmml">1</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.T2.18.18.18.18.1.m1.1b"><apply id="S5.T2.18.18.18.18.1.m1.1.1.cmml" xref="S5.T2.18.18.18.18.1.m1.1.1"><times id="S5.T2.18.18.18.18.1.m1.1.1.1.cmml" xref="S5.T2.18.18.18.18.1.m1.1.1.1"></times><cn id="S5.T2.18.18.18.18.1.m1.1.1.2.cmml" type="integer" xref="S5.T2.18.18.18.18.1.m1.1.1.2">10</cn><cn id="S5.T2.18.18.18.18.1.m1.1.1.3.cmml" type="integer" xref="S5.T2.18.18.18.18.1.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.T2.18.18.18.18.1.m1.1c">10\times 1</annotation><annotation encoding="application/x-llamapun" id="S5.T2.18.18.18.18.1.m1.1d">10 × 1</annotation></semantics></math></td> </tr> </tbody> </table> </span></div> </figure> <figure class="ltx_figure" id="S5.F3"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_figure_panel ltx_img_square" height="208" id="S5.F3.g1" src="x1.png" width="237"/></div> <div class="ltx_flex_cell ltx_flex_size_4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_figure_panel ltx_img_square" height="208" id="S5.F3.g2" src="x2.png" width="237"/></div> <div class="ltx_flex_cell ltx_flex_size_4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_figure_panel ltx_img_square" height="208" id="S5.F3.g3" src="x3.png" width="237"/></div> <div class="ltx_flex_cell ltx_flex_size_4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_figure_panel ltx_img_square" height="208" id="S5.F3.g4" src="x4.png" width="237"/></div> <div class="ltx_flex_break"></div> <div class="ltx_flex_cell ltx_flex_size_1"><svg class="ltx_picture ltx_centering ltx_figure_panel" height="18.68" id="S5.F3.pic1" overflow="visible" version="1.1" width="181.31"><g transform="translate(0,18.68) matrix(1 0 0 -1 0 0) translate(0.83,0) translate(0,-128.46)"><g color="#916CAD" fill="#916CAD" stroke="#916CAD" stroke-width="1.2pt"><path d="M 0 137.8 L 11.81 137.8" style="fill:none"></path></g><g fill="#000000" stroke="#000000" stroke-width="0.4pt"><g color="#916CAD" fill="#916CAD" stroke="#916CAD"><path d="M 5.91 137.8" style="fill:none"></path><path d="M 3.83 135.72 h 4.15 v 4.15 h -4.15 Z"></path></g><g fill="#000000" stroke="#000000" transform="matrix(1.0 0.0 0.0 1.0 16.7 133.07)"><foreignobject height="9.46" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="77.64"><span class="ltx_text" id="S5.F3.pic1.1.1.1.1.1.1">HuBERT L4</span></foreignobject></g><g color="#DC7168" fill="#DC7168" stroke="#DC7168" stroke-width="1.2pt"><path d="M 82.68 137.8 L 94.49 137.8" style="fill:none"></path></g><g color="#DC7168" fill="#DC7168" stroke="#DC7168"><path d="M 88.58 137.8" style="fill:none"></path><path d="M 86.51 135.72 h 4.15 v 4.15 h -4.15 Z"></path></g><g fill="#000000" stroke="#000000" transform="matrix(1.0 0.0 0.0 1.0 99.38 133.07)"><foreignobject height="9.46" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="77.64"><span class="ltx_text" id="S5.F3.pic1.2.2.2.2.1.1">HuBERT L9</span></foreignobject></g></g></g></svg></div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text ltx_font_bold" id="S5.F3.8.4.1" style="font-size:90%;">Fig. 3</span>: </span><span class="ltx_text" id="S5.F3.6.3" style="font-size:90%;">The completeness and accessibility of representations at different rates (bits per frame). We vary the depth of RVQ from <math alttext="L=1" class="ltx_Math" display="inline" id="S5.F3.4.1.m1.1"><semantics id="S5.F3.4.1.m1.1b"><mrow id="S5.F3.4.1.m1.1.1" xref="S5.F3.4.1.m1.1.1.cmml"><mi id="S5.F3.4.1.m1.1.1.2" xref="S5.F3.4.1.m1.1.1.2.cmml">L</mi><mo id="S5.F3.4.1.m1.1.1.1" xref="S5.F3.4.1.m1.1.1.1.cmml">=</mo><mn id="S5.F3.4.1.m1.1.1.3" xref="S5.F3.4.1.m1.1.1.3.cmml">1</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.F3.4.1.m1.1c"><apply id="S5.F3.4.1.m1.1.1.cmml" xref="S5.F3.4.1.m1.1.1"><eq id="S5.F3.4.1.m1.1.1.1.cmml" xref="S5.F3.4.1.m1.1.1.1"></eq><ci id="S5.F3.4.1.m1.1.1.2.cmml" xref="S5.F3.4.1.m1.1.1.2">𝐿</ci><cn id="S5.F3.4.1.m1.1.1.3.cmml" type="integer" xref="S5.F3.4.1.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.F3.4.1.m1.1d">L=1</annotation><annotation encoding="application/x-llamapun" id="S5.F3.4.1.m1.1e">italic_L = 1</annotation></semantics></math> to <math alttext="L=8" class="ltx_Math" display="inline" id="S5.F3.5.2.m2.1"><semantics id="S5.F3.5.2.m2.1b"><mrow id="S5.F3.5.2.m2.1.1" xref="S5.F3.5.2.m2.1.1.cmml"><mi id="S5.F3.5.2.m2.1.1.2" xref="S5.F3.5.2.m2.1.1.2.cmml">L</mi><mo id="S5.F3.5.2.m2.1.1.1" xref="S5.F3.5.2.m2.1.1.1.cmml">=</mo><mn id="S5.F3.5.2.m2.1.1.3" xref="S5.F3.5.2.m2.1.1.3.cmml">8</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.F3.5.2.m2.1c"><apply id="S5.F3.5.2.m2.1.1.cmml" xref="S5.F3.5.2.m2.1.1"><eq id="S5.F3.5.2.m2.1.1.1.cmml" xref="S5.F3.5.2.m2.1.1.1"></eq><ci id="S5.F3.5.2.m2.1.1.2.cmml" xref="S5.F3.5.2.m2.1.1.2">𝐿</ci><cn id="S5.F3.5.2.m2.1.1.3.cmml" type="integer" xref="S5.F3.5.2.m2.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.F3.5.2.m2.1d">L=8</annotation><annotation encoding="application/x-llamapun" id="S5.F3.5.2.m2.1e">italic_L = 8</annotation></semantics></math>. Representations are quantized at a cost of 10 bits per codebook, corresponding to a codebook size <math alttext="N=1024" class="ltx_Math" display="inline" id="S5.F3.6.3.m3.1"><semantics id="S5.F3.6.3.m3.1b"><mrow id="S5.F3.6.3.m3.1.1" xref="S5.F3.6.3.m3.1.1.cmml"><mi id="S5.F3.6.3.m3.1.1.2" xref="S5.F3.6.3.m3.1.1.2.cmml">N</mi><mo id="S5.F3.6.3.m3.1.1.1" xref="S5.F3.6.3.m3.1.1.1.cmml">=</mo><mn id="S5.F3.6.3.m3.1.1.3" xref="S5.F3.6.3.m3.1.1.3.cmml">1024</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.F3.6.3.m3.1c"><apply id="S5.F3.6.3.m3.1.1.cmml" xref="S5.F3.6.3.m3.1.1"><eq id="S5.F3.6.3.m3.1.1.1.cmml" xref="S5.F3.6.3.m3.1.1.1"></eq><ci id="S5.F3.6.3.m3.1.1.2.cmml" xref="S5.F3.6.3.m3.1.1.2">𝑁</ci><cn id="S5.F3.6.3.m3.1.1.3.cmml" type="integer" xref="S5.F3.6.3.m3.1.1.3">1024</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.F3.6.3.m3.1d">N=1024</annotation><annotation encoding="application/x-llamapun" id="S5.F3.6.3.m3.1e">italic_N = 1024</annotation></semantics></math>. Codebooks are not fine-tuned.</span></figcaption> </figure> </section> <section class="ltx_subsection" id="S5.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.2 </span>Information disentanglement?</h3> <div class="ltx_para" id="S5.SS2.p1"> <p class="ltx_p" id="S5.SS2.p1.2">Previous work has claimed the disentanglement properties of self-supervised representations and their discrete units after k-means <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib9" title="">9</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib21" title="">21</a>]</cite>. To test the claim, we evaluate the original representations, including HuBERT 4th layer (HuBERT L4) and HuBERT 9th layer (HuBERT L9), their k-means (<math alttext="\text{RVQ}_{1}" class="ltx_Math" display="inline" id="S5.SS2.p1.1.m1.1"><semantics id="S5.SS2.p1.1.m1.1a"><msub id="S5.SS2.p1.1.m1.1.1" xref="S5.SS2.p1.1.m1.1.1.cmml"><mtext id="S5.SS2.p1.1.m1.1.1.2" xref="S5.SS2.p1.1.m1.1.1.2a.cmml">RVQ</mtext><mn id="S5.SS2.p1.1.m1.1.1.3" xref="S5.SS2.p1.1.m1.1.1.3.cmml">1</mn></msub><annotation-xml encoding="MathML-Content" id="S5.SS2.p1.1.m1.1b"><apply id="S5.SS2.p1.1.m1.1.1.cmml" xref="S5.SS2.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S5.SS2.p1.1.m1.1.1.1.cmml" xref="S5.SS2.p1.1.m1.1.1">subscript</csymbol><ci id="S5.SS2.p1.1.m1.1.1.2a.cmml" xref="S5.SS2.p1.1.m1.1.1.2"><mtext id="S5.SS2.p1.1.m1.1.1.2.cmml" xref="S5.SS2.p1.1.m1.1.1.2">RVQ</mtext></ci><cn id="S5.SS2.p1.1.m1.1.1.3.cmml" type="integer" xref="S5.SS2.p1.1.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.p1.1.m1.1c">\text{RVQ}_{1}</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p1.1.m1.1d">RVQ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT</annotation></semantics></math>) units and their residuals (<math alttext="R-\hat{R}" class="ltx_Math" display="inline" id="S5.SS2.p1.2.m2.1"><semantics id="S5.SS2.p1.2.m2.1a"><mrow id="S5.SS2.p1.2.m2.1.1" xref="S5.SS2.p1.2.m2.1.1.cmml"><mi id="S5.SS2.p1.2.m2.1.1.2" xref="S5.SS2.p1.2.m2.1.1.2.cmml">R</mi><mo id="S5.SS2.p1.2.m2.1.1.1" xref="S5.SS2.p1.2.m2.1.1.1.cmml">−</mo><mover accent="true" id="S5.SS2.p1.2.m2.1.1.3" xref="S5.SS2.p1.2.m2.1.1.3.cmml"><mi id="S5.SS2.p1.2.m2.1.1.3.2" xref="S5.SS2.p1.2.m2.1.1.3.2.cmml">R</mi><mo id="S5.SS2.p1.2.m2.1.1.3.1" xref="S5.SS2.p1.2.m2.1.1.3.1.cmml">^</mo></mover></mrow><annotation-xml encoding="MathML-Content" id="S5.SS2.p1.2.m2.1b"><apply id="S5.SS2.p1.2.m2.1.1.cmml" xref="S5.SS2.p1.2.m2.1.1"><minus id="S5.SS2.p1.2.m2.1.1.1.cmml" xref="S5.SS2.p1.2.m2.1.1.1"></minus><ci id="S5.SS2.p1.2.m2.1.1.2.cmml" xref="S5.SS2.p1.2.m2.1.1.2">𝑅</ci><apply id="S5.SS2.p1.2.m2.1.1.3.cmml" xref="S5.SS2.p1.2.m2.1.1.3"><ci id="S5.SS2.p1.2.m2.1.1.3.1.cmml" xref="S5.SS2.p1.2.m2.1.1.3.1">^</ci><ci id="S5.SS2.p1.2.m2.1.1.3.2.cmml" xref="S5.SS2.p1.2.m2.1.1.3.2">𝑅</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS2.p1.2.m2.1c">R-\hat{R}</annotation><annotation encoding="application/x-llamapun" id="S5.SS2.p1.2.m2.1d">italic_R - over^ start_ARG italic_R end_ARG</annotation></semantics></math>) after k-means. We want to emphasize that the gap between the original representations and the residuals does not imply information loss after quantization as we cannot tell how tight the lower bound of mutual information is. Nonetheless, we can verify whether the information is present or even disentangled.</p> </div> <div class="ltx_para" id="S5.SS2.p2"> <p class="ltx_p" id="S5.SS2.p2.1">Table <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S4.SS3" title="4.3 Accessibility tasks ‣ 4 Experimental settings ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">4.3</span></a> reports the results on the accessibility tasks. We first note that speaker and phonetic information is sufficiently present in HuBERT discrete units on both layers. The strong performance on the residuals indicates that information remains present after vector quantization. We observe little disentanglement of the speech properties but the likelihood of information loss in general. While claiming information loss is theoretically difficult, we do find it hard to recover performance even with stronger probes. We also observe that pitch is less accessible with k-means units of HuBERT L9.</p> </div> <figure class="ltx_figure" id="S5.F4"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_1"> <div class="ltx_inline-block ltx_figure_panel ltx_align_center ltx_transformed_outer" id="S5.F4.2" style="width:657.1pt;height:593.9pt;vertical-align:-593.9pt;"><span class="ltx_transformed_inner" style="transform:translate(0.0pt,0.0pt) scale(1,1) ;"> <div class="ltx_inline-block ltx_transformed_outer" id="S5.F4.2.1" style="width:742.5pt;height:671.1pt;vertical-align:-671.1pt;"><span class="ltx_transformed_inner" style="transform:translate(42.7pt,0.0pt) scale(1.13,1.13) ;"> <p class="ltx_p" id="S5.F4.2.1.1"><span class="ltx_text" id="S5.F4.2.1.1.1"></span></p> </span></div> </span></div> </div> <div class="ltx_flex_break"></div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S5.F4.sf1"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="498" id="S5.F4.sf1.g1" src="x5.png" width="830"/> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S5.F4.sf1.2.1.1" style="font-size:90%;">(a)</span> </span><span class="ltx_text" id="S5.F4.sf1.3.2" style="font-size:90%;">k-means</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S5.F4.sf2"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="498" id="S5.F4.sf2.g1" src="x6.png" width="830"/> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S5.F4.sf2.4.1.1" style="font-size:90%;">(b)</span> </span><math alttext="\text{RVQ}_{8}" class="ltx_Math" display="inline" id="S5.F4.sf2.2.m1.1"><semantics id="S5.F4.sf2.2.m1.1b"><msub id="S5.F4.sf2.2.m1.1.1" xref="S5.F4.sf2.2.m1.1.1.cmml"><mtext id="S5.F4.sf2.2.m1.1.1.2" mathsize="90%" xref="S5.F4.sf2.2.m1.1.1.2a.cmml">RVQ</mtext><mn id="S5.F4.sf2.2.m1.1.1.3" mathsize="90%" xref="S5.F4.sf2.2.m1.1.1.3.cmml">8</mn></msub><annotation-xml encoding="MathML-Content" id="S5.F4.sf2.2.m1.1c"><apply id="S5.F4.sf2.2.m1.1.1.cmml" xref="S5.F4.sf2.2.m1.1.1"><csymbol cd="ambiguous" id="S5.F4.sf2.2.m1.1.1.1.cmml" xref="S5.F4.sf2.2.m1.1.1">subscript</csymbol><ci id="S5.F4.sf2.2.m1.1.1.2a.cmml" xref="S5.F4.sf2.2.m1.1.1.2"><mtext id="S5.F4.sf2.2.m1.1.1.2.cmml" mathsize="90%" xref="S5.F4.sf2.2.m1.1.1.2">RVQ</mtext></ci><cn id="S5.F4.sf2.2.m1.1.1.3.cmml" type="integer" xref="S5.F4.sf2.2.m1.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.F4.sf2.2.m1.1d">\text{RVQ}_{8}</annotation><annotation encoding="application/x-llamapun" id="S5.F4.sf2.2.m1.1e">RVQ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT</annotation></semantics></math></figcaption> </figure> </div> <div class="ltx_flex_break"></div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S5.F4.sf3"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="498" id="S5.F4.sf3.g1" src="x7.png" width="830"/> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S5.F4.sf3.4.1.1" style="font-size:90%;">(c)</span> </span><math alttext="\text{RVQ}_{8}" class="ltx_Math" display="inline" id="S5.F4.sf3.2.m1.1"><semantics id="S5.F4.sf3.2.m1.1b"><msub id="S5.F4.sf3.2.m1.1.1" xref="S5.F4.sf3.2.m1.1.1.cmml"><mtext id="S5.F4.sf3.2.m1.1.1.2" mathsize="90%" xref="S5.F4.sf3.2.m1.1.1.2a.cmml">RVQ</mtext><mn id="S5.F4.sf3.2.m1.1.1.3" mathsize="90%" xref="S5.F4.sf3.2.m1.1.1.3.cmml">8</mn></msub><annotation-xml encoding="MathML-Content" id="S5.F4.sf3.2.m1.1c"><apply id="S5.F4.sf3.2.m1.1.1.cmml" xref="S5.F4.sf3.2.m1.1.1"><csymbol cd="ambiguous" id="S5.F4.sf3.2.m1.1.1.1.cmml" xref="S5.F4.sf3.2.m1.1.1">subscript</csymbol><ci id="S5.F4.sf3.2.m1.1.1.2a.cmml" xref="S5.F4.sf3.2.m1.1.1.2"><mtext id="S5.F4.sf3.2.m1.1.1.2.cmml" mathsize="90%" xref="S5.F4.sf3.2.m1.1.1.2">RVQ</mtext></ci><cn id="S5.F4.sf3.2.m1.1.1.3.cmml" type="integer" xref="S5.F4.sf3.2.m1.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.F4.sf3.2.m1.1d">\text{RVQ}_{8}</annotation><annotation encoding="application/x-llamapun" id="S5.F4.sf3.2.m1.1e">RVQ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT</annotation></semantics></math><span class="ltx_text" id="S5.F4.sf3.5.2" style="font-size:90%;"> (fine-tuned)</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S5.F4.sf4"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="498" id="S5.F4.sf4.g1" src="x8.png" width="830"/> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text" id="S5.F4.sf4.2.1.1" style="font-size:90%;">(d)</span> </span><span class="ltx_text" id="S5.F4.sf4.3.2" style="font-size:90%;">HuBERT L4</span></figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text ltx_font_bold" id="S5.F4.3.1.1" style="font-size:90%;">Fig. 4</span>: </span><span class="ltx_text" id="S5.F4.4.2" style="font-size:90%;">An example of the reconstructed log Mels with HuBERT L4 representations and their discrete units. The distortion (MSE) decreases from left to right. Details over 20 Mel bands are better captured in (c) and (d). The ground truth is shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.F5" title="Figure 5 ‣ 5.2 Information disentanglement? ‣ 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">5</span></a>. </span></figcaption> </figure> <figure class="ltx_figure" id="S5.F5"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_middle" id="S5.F5.1" style="width:260.2pt;"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="168" id="S5.F5.1.g1" src="x9.png" width="279"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure"><span class="ltx_text ltx_font_bold" id="S5.F5.1.1.1.1" style="font-size:90%;">Fig. 5</span>: </span><span class="ltx_text" id="S5.F5.1.2.2" style="font-size:90%;">The ground truth utterance for Figure <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.F4" title="Figure 4 ‣ 5.2 Information disentanglement? ‣ 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">4</span></a>.</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_minipage ltx_align_middle" id="S5.F5.fig1" style="width:173.4pt;"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_table ltx_figure_panel" id="S5.T3"> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_table"><span class="ltx_text ltx_font_bold" id="S5.T3.2.1.1" style="font-size:90%;">Table 3</span>: </span><span class="ltx_text" id="S5.T3.3.2" style="font-size:90%;">The completeness of the 4th, 9th and 12th HuBERT layer.</span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_2"> <table class="ltx_tabular ltx_centering ltx_figure_panel ltx_guessed_headers ltx_align_middle" id="S5.F5.fig1.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S5.F5.fig1.1.1.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_tt" id="S5.F5.fig1.1.1.1.1">HuBERT</th> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_border_tt" id="S5.F5.fig1.1.1.1.2">MSE</th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S5.F5.fig1.1.2.1"> <td class="ltx_td ltx_align_left ltx_border_t" id="S5.F5.fig1.1.2.1.1">L4</td> <td class="ltx_td ltx_align_left ltx_border_t" id="S5.F5.fig1.1.2.1.2">39.6</td> </tr> <tr class="ltx_tr" id="S5.F5.fig1.1.3.2"> <td class="ltx_td ltx_align_left" id="S5.F5.fig1.1.3.2.1">L9</td> <td class="ltx_td ltx_align_left" id="S5.F5.fig1.1.3.2.2">54.2</td> </tr> <tr class="ltx_tr" id="S5.F5.fig1.1.4.3"> <td class="ltx_td ltx_align_left ltx_border_bb" id="S5.F5.fig1.1.4.3.1">L12</td> <td class="ltx_td ltx_align_left ltx_border_bb" id="S5.F5.fig1.1.4.3.2">52.8</td> </tr> </tbody> </table> </div> </div> </figure> </div> </div> </figure> </section> <section class="ltx_subsection" id="S5.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.3 </span>Information completeness and accessibility</h3> <div class="ltx_para" id="S5.SS3.p1"> <p class="ltx_p" id="S5.SS3.p1.1">We have revealed in the previous sections that residuals contains much information that should not be discarded. We conduct RVQ for up to 8 codebooks to capture the information in the residuals. Codebooks in RVQ are optimized with iterative k-means, and hold fixed unless otherwise stated. The number of bits used to encode single frame is <math alttext="L{\log_{2}}N=L\times 10" class="ltx_Math" display="inline" id="S5.SS3.p1.1.m1.1"><semantics id="S5.SS3.p1.1.m1.1a"><mrow id="S5.SS3.p1.1.m1.1.1" xref="S5.SS3.p1.1.m1.1.1.cmml"><mrow id="S5.SS3.p1.1.m1.1.1.2" xref="S5.SS3.p1.1.m1.1.1.2.cmml"><mi id="S5.SS3.p1.1.m1.1.1.2.2" xref="S5.SS3.p1.1.m1.1.1.2.2.cmml">L</mi><mo id="S5.SS3.p1.1.m1.1.1.2.1" lspace="0.167em" xref="S5.SS3.p1.1.m1.1.1.2.1.cmml">⁢</mo><mrow id="S5.SS3.p1.1.m1.1.1.2.3" xref="S5.SS3.p1.1.m1.1.1.2.3.cmml"><msub id="S5.SS3.p1.1.m1.1.1.2.3.1" xref="S5.SS3.p1.1.m1.1.1.2.3.1.cmml"><mi id="S5.SS3.p1.1.m1.1.1.2.3.1.2" xref="S5.SS3.p1.1.m1.1.1.2.3.1.2.cmml">log</mi><mn id="S5.SS3.p1.1.m1.1.1.2.3.1.3" xref="S5.SS3.p1.1.m1.1.1.2.3.1.3.cmml">2</mn></msub><mo id="S5.SS3.p1.1.m1.1.1.2.3a" lspace="0.167em" xref="S5.SS3.p1.1.m1.1.1.2.3.cmml">⁡</mo><mi id="S5.SS3.p1.1.m1.1.1.2.3.2" xref="S5.SS3.p1.1.m1.1.1.2.3.2.cmml">N</mi></mrow></mrow><mo id="S5.SS3.p1.1.m1.1.1.1" xref="S5.SS3.p1.1.m1.1.1.1.cmml">=</mo><mrow id="S5.SS3.p1.1.m1.1.1.3" xref="S5.SS3.p1.1.m1.1.1.3.cmml"><mi id="S5.SS3.p1.1.m1.1.1.3.2" xref="S5.SS3.p1.1.m1.1.1.3.2.cmml">L</mi><mo id="S5.SS3.p1.1.m1.1.1.3.1" lspace="0.222em" rspace="0.222em" xref="S5.SS3.p1.1.m1.1.1.3.1.cmml">×</mo><mn id="S5.SS3.p1.1.m1.1.1.3.3" xref="S5.SS3.p1.1.m1.1.1.3.3.cmml">10</mn></mrow></mrow><annotation-xml encoding="MathML-Content" id="S5.SS3.p1.1.m1.1b"><apply id="S5.SS3.p1.1.m1.1.1.cmml" xref="S5.SS3.p1.1.m1.1.1"><eq id="S5.SS3.p1.1.m1.1.1.1.cmml" xref="S5.SS3.p1.1.m1.1.1.1"></eq><apply id="S5.SS3.p1.1.m1.1.1.2.cmml" xref="S5.SS3.p1.1.m1.1.1.2"><times id="S5.SS3.p1.1.m1.1.1.2.1.cmml" xref="S5.SS3.p1.1.m1.1.1.2.1"></times><ci id="S5.SS3.p1.1.m1.1.1.2.2.cmml" xref="S5.SS3.p1.1.m1.1.1.2.2">𝐿</ci><apply id="S5.SS3.p1.1.m1.1.1.2.3.cmml" xref="S5.SS3.p1.1.m1.1.1.2.3"><apply id="S5.SS3.p1.1.m1.1.1.2.3.1.cmml" xref="S5.SS3.p1.1.m1.1.1.2.3.1"><csymbol cd="ambiguous" id="S5.SS3.p1.1.m1.1.1.2.3.1.1.cmml" xref="S5.SS3.p1.1.m1.1.1.2.3.1">subscript</csymbol><log id="S5.SS3.p1.1.m1.1.1.2.3.1.2.cmml" xref="S5.SS3.p1.1.m1.1.1.2.3.1.2"></log><cn id="S5.SS3.p1.1.m1.1.1.2.3.1.3.cmml" type="integer" xref="S5.SS3.p1.1.m1.1.1.2.3.1.3">2</cn></apply><ci id="S5.SS3.p1.1.m1.1.1.2.3.2.cmml" xref="S5.SS3.p1.1.m1.1.1.2.3.2">𝑁</ci></apply></apply><apply id="S5.SS3.p1.1.m1.1.1.3.cmml" xref="S5.SS3.p1.1.m1.1.1.3"><times id="S5.SS3.p1.1.m1.1.1.3.1.cmml" xref="S5.SS3.p1.1.m1.1.1.3.1"></times><ci id="S5.SS3.p1.1.m1.1.1.3.2.cmml" xref="S5.SS3.p1.1.m1.1.1.3.2">𝐿</ci><cn id="S5.SS3.p1.1.m1.1.1.3.3.cmml" type="integer" xref="S5.SS3.p1.1.m1.1.1.3.3">10</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS3.p1.1.m1.1c">L{\log_{2}}N=L\times 10</annotation><annotation encoding="application/x-llamapun" id="S5.SS3.p1.1.m1.1d">italic_L roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_N = italic_L × 10</annotation></semantics></math>. Here, we explore how complete and accessible the information encoded in the discrete speech units.</p> </div> <div class="ltx_para" id="S5.SS3.p2"> <p class="ltx_p" id="S5.SS3.p2.1">Table <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.T2" title="Table 2 ‣ 5.1 Information in the residuals ‣ 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">2</span></a> summarizes the completeness and accessibility of representations before and after vector quantization. Besides our completeness objective MSE, we provide signal-to-noise ratio (SNR) in dB to gain intuition of the reconstruction quality. Rate represents the number of bits per frame to be stored or transmitted in speech coding. We provide log Mels as the upper bound of completeness. Despite the most complete baseline compared to other representations, the phonetic information encoded in log Mels is less accessible than HuBERT representations by a large margin in phone classification, even their discrete units. Compare to HuBERT L9, L4 is closer to complete, showing better performance in pitch estimation and speaker verification. On the other hand, HuBERT L9 exhibits higher phone accessibility, outperforming log Mels with the rate of 10 bits.</p> </div> <div class="ltx_para" id="S5.SS3.p3"> <p class="ltx_p" id="S5.SS3.p3.1">The results provide a detailed assessment of speech representations and discrete speech units. For example, HuBERT L4 is more preferred than L9 in voice conversion <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib26" title="">26</a>]</cite>, speech codecs <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib17" title="">17</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib46" title="">46</a>]</cite> and discrete units for speech language modeling <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib10" title="">10</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib18" title="">18</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib19" title="">19</a>]</cite>. The lower bound of mutual information can also be used to quantify the redundancy between two signals <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib28" title="">28</a>]</cite>. Based on our results, the claim made in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib21" title="">21</a>]</cite> about the significant redundancy between HuBERT units and speech properties is not about whether the units are semantic or disentangled but likely due to information loss or the lack of model capacity. In fact, HuBERT units adequately capture information in acoustic features.</p> </div> </section> <section class="ltx_subsection" id="S5.SS4"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.4 </span>Fine-tuning RVQ on the lower bound of MI</h3> <div class="ltx_para" id="S5.SS4.p1"> <p class="ltx_p" id="S5.SS4.p1.4">The proposed lower bound can not only be used to measure information completeness but also improve the learned discrete units. We experiment with fine-tuning the codebooks of <math alttext="\text{RVQ}_{8}" class="ltx_Math" display="inline" id="S5.SS4.p1.1.m1.1"><semantics id="S5.SS4.p1.1.m1.1a"><msub id="S5.SS4.p1.1.m1.1.1" xref="S5.SS4.p1.1.m1.1.1.cmml"><mtext id="S5.SS4.p1.1.m1.1.1.2" xref="S5.SS4.p1.1.m1.1.1.2a.cmml">RVQ</mtext><mn id="S5.SS4.p1.1.m1.1.1.3" xref="S5.SS4.p1.1.m1.1.1.3.cmml">8</mn></msub><annotation-xml encoding="MathML-Content" id="S5.SS4.p1.1.m1.1b"><apply id="S5.SS4.p1.1.m1.1.1.cmml" xref="S5.SS4.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S5.SS4.p1.1.m1.1.1.1.cmml" xref="S5.SS4.p1.1.m1.1.1">subscript</csymbol><ci id="S5.SS4.p1.1.m1.1.1.2a.cmml" xref="S5.SS4.p1.1.m1.1.1.2"><mtext id="S5.SS4.p1.1.m1.1.1.2.cmml" xref="S5.SS4.p1.1.m1.1.1.2">RVQ</mtext></ci><cn id="S5.SS4.p1.1.m1.1.1.3.cmml" type="integer" xref="S5.SS4.p1.1.m1.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS4.p1.1.m1.1c">\text{RVQ}_{8}</annotation><annotation encoding="application/x-llamapun" id="S5.SS4.p1.1.m1.1d">RVQ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT</annotation></semantics></math> by maximizing the lower bound (<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S2.E4" title="In 2.2 Completeness as mutual information ‣ 2 Methods ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">4</span></a>) with convolution networks <math alttext="f" class="ltx_Math" display="inline" id="S5.SS4.p1.2.m2.1"><semantics id="S5.SS4.p1.2.m2.1a"><mi id="S5.SS4.p1.2.m2.1.1" xref="S5.SS4.p1.2.m2.1.1.cmml">f</mi><annotation-xml encoding="MathML-Content" id="S5.SS4.p1.2.m2.1b"><ci id="S5.SS4.p1.2.m2.1.1.cmml" xref="S5.SS4.p1.2.m2.1.1">𝑓</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS4.p1.2.m2.1c">f</annotation><annotation encoding="application/x-llamapun" id="S5.SS4.p1.2.m2.1d">italic_f</annotation></semantics></math>, denote <math alttext="\text{RVQ}_{8}" class="ltx_Math" display="inline" id="S5.SS4.p1.3.m3.1"><semantics id="S5.SS4.p1.3.m3.1a"><msub id="S5.SS4.p1.3.m3.1.1" xref="S5.SS4.p1.3.m3.1.1.cmml"><mtext id="S5.SS4.p1.3.m3.1.1.2" xref="S5.SS4.p1.3.m3.1.1.2a.cmml">RVQ</mtext><mn id="S5.SS4.p1.3.m3.1.1.3" xref="S5.SS4.p1.3.m3.1.1.3.cmml">8</mn></msub><annotation-xml encoding="MathML-Content" id="S5.SS4.p1.3.m3.1b"><apply id="S5.SS4.p1.3.m3.1.1.cmml" xref="S5.SS4.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S5.SS4.p1.3.m3.1.1.1.cmml" xref="S5.SS4.p1.3.m3.1.1">subscript</csymbol><ci id="S5.SS4.p1.3.m3.1.1.2a.cmml" xref="S5.SS4.p1.3.m3.1.1.2"><mtext id="S5.SS4.p1.3.m3.1.1.2.cmml" xref="S5.SS4.p1.3.m3.1.1.2">RVQ</mtext></ci><cn id="S5.SS4.p1.3.m3.1.1.3.cmml" type="integer" xref="S5.SS4.p1.3.m3.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS4.p1.3.m3.1c">\text{RVQ}_{8}</annotation><annotation encoding="application/x-llamapun" id="S5.SS4.p1.3.m3.1d">RVQ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT</annotation></semantics></math> (fine-tuned). We only fine-tune the codebooks once with log Mels. Unlike in SoundStream <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib17" title="">17</a>]</cite> that updates codebooks with exponential moving average, we simply use Gumbel Softmax with a constant temperature of 1 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib47" title="">47</a>, <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#bib.bib48" title="">48</a>]</cite>. Quantizer dropout is not applied. We find that fine-tuning the codebooks results in an increase in completeness and accessibility of all tasks for <math alttext="\text{RVQ}_{8}" class="ltx_Math" display="inline" id="S5.SS4.p1.4.m4.1"><semantics id="S5.SS4.p1.4.m4.1a"><msub id="S5.SS4.p1.4.m4.1.1" xref="S5.SS4.p1.4.m4.1.1.cmml"><mtext id="S5.SS4.p1.4.m4.1.1.2" xref="S5.SS4.p1.4.m4.1.1.2a.cmml">RVQ</mtext><mn id="S5.SS4.p1.4.m4.1.1.3" xref="S5.SS4.p1.4.m4.1.1.3.cmml">8</mn></msub><annotation-xml encoding="MathML-Content" id="S5.SS4.p1.4.m4.1b"><apply id="S5.SS4.p1.4.m4.1.1.cmml" xref="S5.SS4.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S5.SS4.p1.4.m4.1.1.1.cmml" xref="S5.SS4.p1.4.m4.1.1">subscript</csymbol><ci id="S5.SS4.p1.4.m4.1.1.2a.cmml" xref="S5.SS4.p1.4.m4.1.1.2"><mtext id="S5.SS4.p1.4.m4.1.1.2.cmml" xref="S5.SS4.p1.4.m4.1.1.2">RVQ</mtext></ci><cn id="S5.SS4.p1.4.m4.1.1.3.cmml" type="integer" xref="S5.SS4.p1.4.m4.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS4.p1.4.m4.1c">\text{RVQ}_{8}</annotation><annotation encoding="application/x-llamapun" id="S5.SS4.p1.4.m4.1d">RVQ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT</annotation></semantics></math> (fine-tuned). Moreover, it outperforms HuBERT L9 in completeness and speaker verification with 80 bits storage.</p> </div> </section> <section class="ltx_subsection" id="S5.SS5"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.5 </span>Rate-distortion and rate-accessibility</h3> <div class="ltx_para" id="S5.SS5.p1"> <p class="ltx_p" id="S5.SS5.p1.3">We carry out experiments to study the effects of RVQ depth on information completeness and accessibility, showing the importance of mining residuals and its trade-off between the compression rate and the performance. Figure <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.F3" title="Figure 3 ‣ 5.1 Information in the residuals ‣ 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">3</span></a> shows the rate-distortion and rate-accessibility curves from 10 bits (<math alttext="L=1" class="ltx_Math" display="inline" id="S5.SS5.p1.1.m1.1"><semantics id="S5.SS5.p1.1.m1.1a"><mrow id="S5.SS5.p1.1.m1.1.1" xref="S5.SS5.p1.1.m1.1.1.cmml"><mi id="S5.SS5.p1.1.m1.1.1.2" xref="S5.SS5.p1.1.m1.1.1.2.cmml">L</mi><mo id="S5.SS5.p1.1.m1.1.1.1" xref="S5.SS5.p1.1.m1.1.1.1.cmml">=</mo><mn id="S5.SS5.p1.1.m1.1.1.3" xref="S5.SS5.p1.1.m1.1.1.3.cmml">1</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.SS5.p1.1.m1.1b"><apply id="S5.SS5.p1.1.m1.1.1.cmml" xref="S5.SS5.p1.1.m1.1.1"><eq id="S5.SS5.p1.1.m1.1.1.1.cmml" xref="S5.SS5.p1.1.m1.1.1.1"></eq><ci id="S5.SS5.p1.1.m1.1.1.2.cmml" xref="S5.SS5.p1.1.m1.1.1.2">𝐿</ci><cn id="S5.SS5.p1.1.m1.1.1.3.cmml" type="integer" xref="S5.SS5.p1.1.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS5.p1.1.m1.1c">L=1</annotation><annotation encoding="application/x-llamapun" id="S5.SS5.p1.1.m1.1d">italic_L = 1</annotation></semantics></math>) to 80 bits (<math alttext="L=8" class="ltx_Math" display="inline" id="S5.SS5.p1.2.m2.1"><semantics id="S5.SS5.p1.2.m2.1a"><mrow id="S5.SS5.p1.2.m2.1.1" xref="S5.SS5.p1.2.m2.1.1.cmml"><mi id="S5.SS5.p1.2.m2.1.1.2" xref="S5.SS5.p1.2.m2.1.1.2.cmml">L</mi><mo id="S5.SS5.p1.2.m2.1.1.1" xref="S5.SS5.p1.2.m2.1.1.1.cmml">=</mo><mn id="S5.SS5.p1.2.m2.1.1.3" xref="S5.SS5.p1.2.m2.1.1.3.cmml">8</mn></mrow><annotation-xml encoding="MathML-Content" id="S5.SS5.p1.2.m2.1b"><apply id="S5.SS5.p1.2.m2.1.1.cmml" xref="S5.SS5.p1.2.m2.1.1"><eq id="S5.SS5.p1.2.m2.1.1.1.cmml" xref="S5.SS5.p1.2.m2.1.1.1"></eq><ci id="S5.SS5.p1.2.m2.1.1.2.cmml" xref="S5.SS5.p1.2.m2.1.1.2">𝐿</ci><cn id="S5.SS5.p1.2.m2.1.1.3.cmml" type="integer" xref="S5.SS5.p1.2.m2.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS5.p1.2.m2.1c">L=8</annotation><annotation encoding="application/x-llamapun" id="S5.SS5.p1.2.m2.1d">italic_L = 8</annotation></semantics></math>). As expected, increasing <math alttext="L" class="ltx_Math" display="inline" id="S5.SS5.p1.3.m3.1"><semantics id="S5.SS5.p1.3.m3.1a"><mi id="S5.SS5.p1.3.m3.1.1" xref="S5.SS5.p1.3.m3.1.1.cmml">L</mi><annotation-xml encoding="MathML-Content" id="S5.SS5.p1.3.m3.1b"><ci id="S5.SS5.p1.3.m3.1.1.cmml" xref="S5.SS5.p1.3.m3.1.1">𝐿</ci></annotation-xml><annotation encoding="application/x-tex" id="S5.SS5.p1.3.m3.1c">L</annotation><annotation encoding="application/x-llamapun" id="S5.SS5.p1.3.m3.1d">italic_L</annotation></semantics></math> generally improves information completeness and accessibility. The variations in pitch estimation is relatively large in HuBERT L9 before 60 bits. The trade-off between rate and distortion is important for deciding the information processing capabilities of representations for different applications.</p> </div> <div class="ltx_para" id="S5.SS5.p2"> <p class="ltx_p" id="S5.SS5.p2.2">The distortion is reflected in the predicted log Mels as shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.F4" title="Figure 4 ‣ 5.2 Information disentanglement? ‣ 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">4</span></a>. Discrete units with k-means struggle to capture the first two harmonics, while <math alttext="\text{RVQ}_{8}" class="ltx_Math" display="inline" id="S5.SS5.p2.1.m1.1"><semantics id="S5.SS5.p2.1.m1.1a"><msub id="S5.SS5.p2.1.m1.1.1" xref="S5.SS5.p2.1.m1.1.1.cmml"><mtext id="S5.SS5.p2.1.m1.1.1.2" xref="S5.SS5.p2.1.m1.1.1.2a.cmml">RVQ</mtext><mn id="S5.SS5.p2.1.m1.1.1.3" xref="S5.SS5.p2.1.m1.1.1.3.cmml">8</mn></msub><annotation-xml encoding="MathML-Content" id="S5.SS5.p2.1.m1.1b"><apply id="S5.SS5.p2.1.m1.1.1.cmml" xref="S5.SS5.p2.1.m1.1.1"><csymbol cd="ambiguous" id="S5.SS5.p2.1.m1.1.1.1.cmml" xref="S5.SS5.p2.1.m1.1.1">subscript</csymbol><ci id="S5.SS5.p2.1.m1.1.1.2a.cmml" xref="S5.SS5.p2.1.m1.1.1.2"><mtext id="S5.SS5.p2.1.m1.1.1.2.cmml" xref="S5.SS5.p2.1.m1.1.1.2">RVQ</mtext></ci><cn id="S5.SS5.p2.1.m1.1.1.3.cmml" type="integer" xref="S5.SS5.p2.1.m1.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS5.p2.1.m1.1c">\text{RVQ}_{8}</annotation><annotation encoding="application/x-llamapun" id="S5.SS5.p2.1.m1.1d">RVQ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT</annotation></semantics></math> starts to capture the rises and falls of the first three harmonics. By fine-tuning codebooks on the lower bound of completeness, <math alttext="\text{RVQ}_{8}" class="ltx_Math" display="inline" id="S5.SS5.p2.2.m2.1"><semantics id="S5.SS5.p2.2.m2.1a"><msub id="S5.SS5.p2.2.m2.1.1" xref="S5.SS5.p2.2.m2.1.1.cmml"><mtext id="S5.SS5.p2.2.m2.1.1.2" xref="S5.SS5.p2.2.m2.1.1.2a.cmml">RVQ</mtext><mn id="S5.SS5.p2.2.m2.1.1.3" xref="S5.SS5.p2.2.m2.1.1.3.cmml">8</mn></msub><annotation-xml encoding="MathML-Content" id="S5.SS5.p2.2.m2.1b"><apply id="S5.SS5.p2.2.m2.1.1.cmml" xref="S5.SS5.p2.2.m2.1.1"><csymbol cd="ambiguous" id="S5.SS5.p2.2.m2.1.1.1.cmml" xref="S5.SS5.p2.2.m2.1.1">subscript</csymbol><ci id="S5.SS5.p2.2.m2.1.1.2a.cmml" xref="S5.SS5.p2.2.m2.1.1.2"><mtext id="S5.SS5.p2.2.m2.1.1.2.cmml" xref="S5.SS5.p2.2.m2.1.1.2">RVQ</mtext></ci><cn id="S5.SS5.p2.2.m2.1.1.3.cmml" type="integer" xref="S5.SS5.p2.2.m2.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S5.SS5.p2.2.m2.1c">\text{RVQ}_{8}</annotation><annotation encoding="application/x-llamapun" id="S5.SS5.p2.2.m2.1d">RVQ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT</annotation></semantics></math> (fine-tuned) predicts clearer spectrograms only at a cost of 80 bits. We present the ground truth in Figure <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.F5" title="Figure 5 ‣ 5.2 Information disentanglement? ‣ 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">5</span></a> as a reference.</p> </div> </section> <section class="ltx_subsection" id="S5.SS6"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">5.6 </span>Information in the last layer</h3> <div class="ltx_para" id="S5.SS6.p1"> <p class="ltx_p" id="S5.SS6.p1.1">With the lower bound, it is also interesting to see how much information is preserved in the last layer. As shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2409.06109v2#S5.F5" title="Figure 5 ‣ 5.2 Information disentanglement? ‣ 5 Results and Discussions ‣ Estimating the completeness of discrete speech units"><span class="ltx_text ltx_ref_tag">5</span></a>, HuBERT L12 achieves a larger lower bound (lower MSE) than L9. Due to the data processing inequality, this implies that L9 is at least as complete as HuBERT L12. Similarly, the first three layers is at least as complete as layer 4.</p> </div> </section> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">6 </span>Conclusion</h2> <div class="ltx_para" id="S6.p1"> <p class="ltx_p" id="S6.p1.1">We present an information-theoretic approach to estimating the completeness of speech representations before and after vector quantization. In addition, we establish connections between information completeness and information accessibility, providing a lower bound of completeness with a stronger justification. We then use the concepts of completeness and accessibility to validate claims on the information encoded in HuBERT representations, including the disentanglement and the redundancy of discrete units.</p> </div> <div class="ltx_para" id="S6.p2"> <p class="ltx_p" id="S6.p2.1">We further explore the relationships among information completeness, accessibility and rate, showing the trade-off between depths of residual vector quantizer (the rate) and the other two quantities. Our results re-position the role of self-supervised discrete units on speech applications, showing that in addition to phonetic information, prosody and speaker information can also be captured by quantizing the residuals.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography" style="font-size:90%;">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_tag_bibitem">[1]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib1.1.1" style="font-size:90%;"> Aäron van den Oord, Yazhe Li, and Oriol Vinyals, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib1.2.1" style="font-size:90%;">“Representation learning with contrastive predictive coding,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib1.3.1" style="font-size:90%;">arXiv:1807:03748</span><span class="ltx_text" id="bib.bib1.4.2" style="font-size:90%;">, 2018. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_tag_bibitem">[2]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib2.1.1" style="font-size:90%;"> Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib2.2.1" style="font-size:90%;">“wav2vec 2.0: A framework for self-supervised learning of speech representations,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib2.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib2.4.2" style="font-size:90%;">NeurIPS</span><span class="ltx_text" id="bib.bib2.5.3" style="font-size:90%;">, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_tag_bibitem">[3]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib3.1.1" style="font-size:90%;"> Yu-An Chung, Hao Tang, and James R. Glass, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib3.2.1" style="font-size:90%;">“Vector-quantized autoregressive predictive coding,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib3.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib3.4.2" style="font-size:90%;">Interspeech</span><span class="ltx_text" id="bib.bib3.5.3" style="font-size:90%;">, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_tag_bibitem">[4]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib4.1.1" style="font-size:90%;"> Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib4.2.1" style="font-size:90%;">“Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib4.3.1" style="font-size:90%;">IEEE/ACM Transactions on Audio, Speech, and Language Processing</span><span class="ltx_text" id="bib.bib4.4.2" style="font-size:90%;">, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_tag_bibitem">[5]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib5.1.1" style="font-size:90%;"> Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib5.2.1" style="font-size:90%;">“Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib5.3.1" style="font-size:90%;">IEEE Journal of Selected Topics in Signal Processing</span><span class="ltx_text" id="bib.bib5.4.2" style="font-size:90%;">, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_tag_bibitem">[6]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib6.1.1" style="font-size:90%;"> Xuankai Chang, Brian Yan, Yuya Fujita, Takashi Maekaku, and Shinji Watanabe, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib6.2.1" style="font-size:90%;">“Exploration of efficient end-to-end asr using discretized input from self-supervised learning,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib6.3.1" style="font-size:90%;">Interspeech</span><span class="ltx_text" id="bib.bib6.4.2" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_tag_bibitem">[7]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib7.1.1" style="font-size:90%;"> Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, and Xie Chen, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib7.2.1" style="font-size:90%;">“Towards universal speech discrete tokens: A case study for asr and tts,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib7.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib7.4.2" style="font-size:90%;">ICASSP</span><span class="ltx_text" id="bib.bib7.5.3" style="font-size:90%;">. IEEE, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_tag_bibitem">[8]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib8.1.1" style="font-size:90%;"> Dan Wells, Hao Tang, and Korin Richmond, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib8.2.1" style="font-size:90%;">“Phonetic analysis of self-supervised representations of English speech.,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib8.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib8.4.2" style="font-size:90%;">Interspeech</span><span class="ltx_text" id="bib.bib8.5.3" style="font-size:90%;">, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_tag_bibitem">[9]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib9.1.1" style="font-size:90%;"> Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib9.2.1" style="font-size:90%;">“Speech resynthesis from discrete disentangled self-supervised representations,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib9.3.1" style="font-size:90%;">Interspeech</span><span class="ltx_text" id="bib.bib9.4.2" style="font-size:90%;">, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_tag_bibitem">[10]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib10.1.1" style="font-size:90%;"> Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al., </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib10.2.1" style="font-size:90%;">“On generative spoken language modeling from raw audio,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib10.3.1" style="font-size:90%;">Transactions of the Association for Computational Linguistics</span><span class="ltx_text" id="bib.bib10.4.2" style="font-size:90%;">, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_tag_bibitem">[11]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib11.1.1" style="font-size:90%;"> Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, and Yossi Adi, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib11.2.1" style="font-size:90%;">“Textless speech emotion conversion using discrete and decomposed representations,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib11.3.1" style="font-size:90%;">arXiv preprint arXiv:2111.07402</span><span class="ltx_text" id="bib.bib11.4.2" style="font-size:90%;">, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_tag_bibitem">[12]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib12.1.1" style="font-size:90%;"> Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, and Kai Yu, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib12.2.1" style="font-size:90%;">“Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib12.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib12.4.2" style="font-size:90%;">AAAI</span><span class="ltx_text" id="bib.bib12.5.3" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_tag_bibitem">[13]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib13.1.1" style="font-size:90%;"> Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, et al., </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib13.2.1" style="font-size:90%;">“SUPERB: Speech processing universal performance benchmark,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib13.3.1" style="font-size:90%;">arXiv preprint arXiv:2105.01051</span><span class="ltx_text" id="bib.bib13.4.2" style="font-size:90%;">, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_tag_bibitem">[14]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib14.1.1" style="font-size:90%;"> Guan-Ting Lin, Chi-Luen Feng, Wei-Ping Huang, Yuan Tseng, Tzu-Han Lin, Chen-An Li, Hung-yi Lee, and Nigel G Ward, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib14.2.1" style="font-size:90%;">“On the utility of self-supervised models for prosody-related tasks,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib14.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib14.4.2" style="font-size:90%;">IEEE Spoken Language Technology Workshop (SLT)</span><span class="ltx_text" id="bib.bib14.5.3" style="font-size:90%;">. IEEE, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_tag_bibitem">[15]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib15.1.1" style="font-size:90%;"> Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib15.2.1" style="font-size:90%;">“Information-theoretic probing for linguistic structure,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib15.3.1" style="font-size:90%;">ACL</span><span class="ltx_text" id="bib.bib15.4.2" style="font-size:90%;">, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_tag_bibitem">[16]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib16.1.1" style="font-size:90%;"> Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Sravya Popuri, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, et al., </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib16.2.1" style="font-size:90%;">“Direct speech-to-speech translation with discrete units,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib16.3.1" style="font-size:90%;">ACL</span><span class="ltx_text" id="bib.bib16.4.2" style="font-size:90%;">, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_tag_bibitem">[17]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib17.1.1" style="font-size:90%;"> Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib17.2.1" style="font-size:90%;">“Soundstream: An end-to-end neural audio codec,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib17.3.1" style="font-size:90%;">IEEE/ACM Transactions on Audio, Speech, and Language Processing</span><span class="ltx_text" id="bib.bib17.4.2" style="font-size:90%;">, 2021. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_tag_bibitem">[18]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib18.1.1" style="font-size:90%;"> Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al., </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib18.2.1" style="font-size:90%;">“Audiolm: a language modeling approach to audio generation,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib18.3.1" style="font-size:90%;">IEEE/ACM Transactions on Audio, Speech, and Language Processing</span><span class="ltx_text" id="bib.bib18.4.2" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_tag_bibitem">[19]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib19.1.1" style="font-size:90%;"> Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al., </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib19.2.1" style="font-size:90%;">“Neural codec language models are zero-shot text to speech synthesizers,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib19.3.1" style="font-size:90%;">arXiv preprint arXiv:2301.02111</span><span class="ltx_text" id="bib.bib19.4.2" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_tag_bibitem">[20]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib20.1.1" style="font-size:90%;"> Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, and David Harwath, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib20.2.1" style="font-size:90%;">“Voicecraft: Zero-shot speech editing and text-to-speech in the wild,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib20.3.1" style="font-size:90%;">arXiv preprint arXiv:2403.16973</span><span class="ltx_text" id="bib.bib20.4.2" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_tag_bibitem">[21]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib21.1.1" style="font-size:90%;"> Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib21.2.1" style="font-size:90%;">“Speechtokenizer: Unified speech tokenizer for speech language models,” 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_tag_bibitem">[22]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib22.1.1" style="font-size:90%;"> Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, and Jakob Verbeek, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib22.2.1" style="font-size:90%;">“Adaptive density estimation for generative models,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib22.3.1" style="font-size:90%;">NeurIPS</span><span class="ltx_text" id="bib.bib22.4.2" style="font-size:90%;">, 2019. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_tag_bibitem">[23]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib23.1.1" style="font-size:90%;"> Rita Frieske and Bertram E Shi, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib23.2.1" style="font-size:90%;">“Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib23.3.1" style="font-size:90%;">arXiv preprint arXiv:2401.01572</span><span class="ltx_text" id="bib.bib23.4.2" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_tag_bibitem">[24]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib24.1.1" style="font-size:90%;"> David McAllester and Karl Stratos, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib24.2.1" style="font-size:90%;">“Formal limitations on the measurement of mutual information,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib24.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib24.4.2" style="font-size:90%;">International Conference on Artificial Intelligence and Statistics</span><span class="ltx_text" id="bib.bib24.5.3" style="font-size:90%;">. PMLR, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_tag_bibitem">[25]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib25.1.1" style="font-size:90%;"> Kwanghee Choi, Ankita Pasad, Tomohiko Nakamura, Satoru Fukayama, Karen Livescu, and Shinji Watanabe, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib25.2.1" style="font-size:90%;">“Self-supervised speech representations are more phonetic than semantic,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib25.3.1" style="font-size:90%;">arXiv preprint arXiv:2406.08619</span><span class="ltx_text" id="bib.bib25.4.2" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_tag_bibitem">[26]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib26.1.1" style="font-size:90%;"> Matthew Baas, Benjamin van Niekerk, and Herman Kamper, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib26.2.1" style="font-size:90%;">“Voice conversion with just nearest neighbors,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib26.3.1" style="font-size:90%;">Interspeech</span><span class="ltx_text" id="bib.bib26.4.2" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_tag_bibitem">[27]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib27.1.1" style="font-size:90%;"> Biing-Hwang Juang and A Gray, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib27.2.1" style="font-size:90%;">“Multiple stage vector quantization for speech coding,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib27.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib27.4.2" style="font-size:90%;">ICASSP</span><span class="ltx_text" id="bib.bib27.5.3" style="font-size:90%;">. IEEE, 1982. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_tag_bibitem">[28]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib28.1.1" style="font-size:90%;"> Lukas Wolf, Tiago Pimentel, Evelina Fedorenko, Ryan Cotterell, Alex Warstadt, Ethan Wilcox, and Tamar Regev, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib28.2.1" style="font-size:90%;">“Quantifying the redundancy between prosody and text,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib28.3.1" style="font-size:90%;">ACL</span><span class="ltx_text" id="bib.bib28.4.2" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_tag_bibitem">[29]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib29.1.1" style="font-size:90%;"> Alexander H Liu, Sung-Lin Yeh, and James R Glass, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib29.2.1" style="font-size:90%;">“Revisiting self-supervised learning of speech representation from a mutual information perspective,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib29.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib29.4.2" style="font-size:90%;">ICASSP</span><span class="ltx_text" id="bib.bib29.5.3" style="font-size:90%;">. IEEE, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_tag_bibitem">[30]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib30.1.1" style="font-size:90%;"> Gene-Ping Yang, Sung-Lin Yeh, Yu-An Chung, James Glass, and Hao Tang, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib30.2.1" style="font-size:90%;">“Autoregressive predictive coding: A comprehensive study,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib30.3.1" style="font-size:90%;">IEEE Journal of Selected Topics in Signal Processing</span><span class="ltx_text" id="bib.bib30.4.2" style="font-size:90%;">, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_tag_bibitem">[31]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib31.1.1" style="font-size:90%;"> Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, and Mirco Ravanelli, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib31.2.1" style="font-size:90%;">“Speech self-supervised representations benchmarking: a case for larger probing heads,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib31.3.1" style="font-size:90%;">arXiv preprint arXiv:2308.14456</span><span class="ltx_text" id="bib.bib31.4.2" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_tag_bibitem">[32]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib32.1.1" style="font-size:90%;"> Elena Voita and Ivan Titov, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib32.2.1" style="font-size:90%;">“Information-theoretic probing with minimum description length,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib32.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib32.4.2" style="font-size:90%;">EMNLP</span><span class="ltx_text" id="bib.bib32.5.3" style="font-size:90%;">, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_tag_bibitem">[33]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib33.1.1" style="font-size:90%;"> Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, and Shiyu Chang, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib33.2.1" style="font-size:90%;">“Contentvec: An improved self-supervised speech representation by disentangling speakers,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib33.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib33.4.2" style="font-size:90%;">ICML</span><span class="ltx_text" id="bib.bib33.5.3" style="font-size:90%;">. PMLR, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_tag_bibitem">[34]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib34.1.1" style="font-size:90%;"> Weiwei Lin, Chenhang He, Man-Wai Mak, and Youzhi Tu, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib34.2.1" style="font-size:90%;">“Self-supervised neural factor analysis for disentangling utterance-level speech representations,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib34.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib34.4.2" style="font-size:90%;">ICML</span><span class="ltx_text" id="bib.bib34.5.3" style="font-size:90%;">. PMLR, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_tag_bibitem">[35]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib35.1.1" style="font-size:90%;"> Sumukh K Aithal, Pratyush Maini, Zachary C Lipton, and J Zico Kolter, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib35.2.1" style="font-size:90%;">“Understanding hallucinations in diffusion models through mode interpolation,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib35.3.1" style="font-size:90%;">arXiv preprint arXiv:2406.09358</span><span class="ltx_text" id="bib.bib35.4.2" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_tag_bibitem">[36]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib36.1.1" style="font-size:90%;"> Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib36.2.1" style="font-size:90%;">“Generative adversarial networks,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib36.3.1" style="font-size:90%;">Communications of the ACM</span><span class="ltx_text" id="bib.bib36.4.2" style="font-size:90%;">, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_tag_bibitem">[37]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib37.1.1" style="font-size:90%;"> Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib37.2.1" style="font-size:90%;">“Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib37.3.1" style="font-size:90%;">NeurIPS</span><span class="ltx_text" id="bib.bib37.4.2" style="font-size:90%;">, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_tag_bibitem">[38]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib38.1.1" style="font-size:90%;"> Jan Anguita, Javier Hernando, Stéphane Peillon, and Alexandre Bramoullé, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib38.2.1" style="font-size:90%;">“Detection of confusable words in automatic speech recognition,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib38.3.1" style="font-size:90%;">IEEE Signal Processing Letters</span><span class="ltx_text" id="bib.bib38.4.2" style="font-size:90%;">, 2005. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_tag_bibitem">[39]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib39.1.1" style="font-size:90%;"> Ashish Mittal, Rudra Murthy, Vishwajeet Kumar, and Riyaz Bhat, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib39.2.1" style="font-size:90%;">“Towards understanding and mitigating the hallucinations in nlp and speech,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib39.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib39.4.2" style="font-size:90%;">Proceedings of the 7th Joint International Conference on Data Science &amp; Management of Data (11th ACM IKDD CODS and 29th COMAD)</span><span class="ltx_text" id="bib.bib39.5.3" style="font-size:90%;">, 2024. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_tag_bibitem">[40]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib40.1.1" style="font-size:90%;"> Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib40.2.1" style="font-size:90%;">“Librispeech: an ASR corpus based on public domain audio books,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib40.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib40.4.2" style="font-size:90%;">ICASSP</span><span class="ltx_text" id="bib.bib40.5.3" style="font-size:90%;">, 2015. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib41"> <span class="ltx_tag ltx_tag_bibitem">[41]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib41.1.1" style="font-size:90%;"> Hubert Siuzdak, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib41.2.1" style="font-size:90%;">“Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib41.3.1" style="font-size:90%;">arXiv preprint arXiv:2306.00814</span><span class="ltx_text" id="bib.bib41.4.2" style="font-size:90%;">, 2023. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib42"> <span class="ltx_tag ltx_tag_bibitem">[42]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib42.1.1" style="font-size:90%;"> Douglas B Paul and Janet Baker, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib42.2.1" style="font-size:90%;">“The design for the Wall Street Journal-based CSR corpus,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib42.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib42.4.2" style="font-size:90%;">Speech and Natural Language Workshop</span><span class="ltx_text" id="bib.bib42.5.3" style="font-size:90%;">, 1992. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib43"> <span class="ltx_tag ltx_tag_bibitem">[43]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib43.1.1" style="font-size:90%;"> Matthias Mauch and Simon Dixon, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib43.2.1" style="font-size:90%;">“pyin: A fundamental frequency estimator using probabilistic threshold distributions,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib43.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib43.4.2" style="font-size:90%;">ICASSP</span><span class="ltx_text" id="bib.bib43.5.3" style="font-size:90%;">. IEEE, 2014. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib44"> <span class="ltx_tag ltx_tag_bibitem">[44]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib44.1.1" style="font-size:90%;"> Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib44.2.1" style="font-size:90%;">“Voxceleb: a large-scale speaker identification dataset,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib44.3.1" style="font-size:90%;">Interspeech</span><span class="ltx_text" id="bib.bib44.4.2" style="font-size:90%;">, 2017. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib45"> <span class="ltx_tag ltx_tag_bibitem">[45]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib45.1.1" style="font-size:90%;"> Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib45.2.1" style="font-size:90%;">“ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib45.3.1" style="font-size:90%;">Interspeech</span><span class="ltx_text" id="bib.bib45.4.2" style="font-size:90%;">, 2020. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib46"> <span class="ltx_tag ltx_tag_bibitem">[46]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib46.1.1" style="font-size:90%;"> Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib46.2.1" style="font-size:90%;">“High fidelity neural audio compression,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text ltx_font_italic" id="bib.bib46.3.1" style="font-size:90%;">arXiv preprint arXiv:2210.13438</span><span class="ltx_text" id="bib.bib46.4.2" style="font-size:90%;">, 2022. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib47"> <span class="ltx_tag ltx_tag_bibitem">[47]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib47.1.1" style="font-size:90%;"> Eric Jang, Shixiang Gu, and Ben Poole, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib47.2.1" style="font-size:90%;">“Categorical reparameterization with Gumbel-softmax,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib47.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib47.4.2" style="font-size:90%;">ICLR</span><span class="ltx_text" id="bib.bib47.5.3" style="font-size:90%;">, 2017. </span> </span> </li> <li class="ltx_bibitem" id="bib.bib48"> <span class="ltx_tag ltx_tag_bibitem">[48]</span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib48.1.1" style="font-size:90%;"> Chris J. Maddison, Andriy Mnih, and Yee Whye Teh, </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib48.2.1" style="font-size:90%;">“The concrete distribution: A continuous relaxation of discrete random variables,” </span> </span> <span class="ltx_bibblock"><span class="ltx_text" id="bib.bib48.3.1" style="font-size:90%;">in </span><span class="ltx_text ltx_font_italic" id="bib.bib48.4.2" style="font-size:90%;">ICLR</span><span class="ltx_text" id="bib.bib48.5.3" style="font-size:90%;">, 2017. </span> </span> </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Sun Sep 22 18:33:24 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10