CINXE.COM
[2401.06408] AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <title>[2401.06408] AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters</title> <meta name="viewport" content="width=device-width, initial-scale=1"> <link rel="apple-touch-icon" sizes="180x180" href="/static/browse/0.3.4/images/icons/apple-touch-icon.png"> <link rel="icon" type="image/png" sizes="32x32" href="/static/browse/0.3.4/images/icons/favicon-32x32.png"> <link rel="icon" type="image/png" sizes="16x16" href="/static/browse/0.3.4/images/icons/favicon-16x16.png"> <link rel="manifest" href="/static/browse/0.3.4/images/icons/site.webmanifest"> <link rel="mask-icon" href="/static/browse/0.3.4/images/icons/safari-pinned-tab.svg" color="#5bbad5"> <meta name="msapplication-TileColor" content="#da532c"> <meta name="theme-color" content="#ffffff"> <link rel="stylesheet" type="text/css" media="screen" href="/static/browse/0.3.4/css/arXiv.css?v=20240822" /> <link rel="stylesheet" type="text/css" media="print" href="/static/browse/0.3.4/css/arXiv-print.css?v=20200611" /> <link rel="stylesheet" type="text/css" media="screen" href="/static/browse/0.3.4/css/browse_search.css" /> <script language="javascript" src="/static/browse/0.3.4/js/accordion.js" /></script> <link rel="canonical" href="https://arxiv.org/abs/2401.06408"/> <meta name="description" content="Abstract page for arXiv paper 2401.06408: AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters"><meta property="og:type" content="website" /> <meta property="og:site_name" content="arXiv.org" /> <meta property="og:title" content="AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters" /> <meta property="og:url" content="https://arxiv.org/abs/2401.06408v3" /> <meta property="og:image" content="/static/browse/0.3.4/images/arxiv-logo-fb.png" /> <meta property="og:image:secure_url" content="/static/browse/0.3.4/images/arxiv-logo-fb.png" /> <meta property="og:image:width" content="1200" /> <meta property="og:image:height" content="700" /> <meta property="og:image:alt" content="arXiv logo"/> <meta property="og:description" content="Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications."/> <meta name="twitter:site" content="@arxiv"/> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="AboutMe: Using Self-Descriptions in Webpages to Document the..."/> <meta name="twitter:description" content="Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during..."/> <meta name="twitter:image" content="https://static.arxiv.org/icons/twitter/arxiv-logo-twitter-square.png"/> <meta name="twitter:image:alt" content="arXiv logo"/> <link rel="stylesheet" media="screen" type="text/css" href="/static/browse/0.3.4/css/tooltip.css"/><link rel="stylesheet" media="screen" type="text/css" href="https://static.arxiv.org/js/bibex-dev/bibex.css?20200709"/> <script src="/static/browse/0.3.4/js/mathjaxToggle.min.js" type="text/javascript"></script> <script src="//code.jquery.com/jquery-latest.min.js" type="text/javascript"></script> <script src="//cdn.jsdelivr.net/npm/js-cookie@2/src/js.cookie.min.js" type="text/javascript"></script> <script src="//cdn.jsdelivr.net/npm/dompurify@2.3.5/dist/purify.min.js"></script> <script src="/static/browse/0.3.4/js/toggle-labs.js?20241022" type="text/javascript"></script> <script src="/static/browse/0.3.4/js/cite.js" type="text/javascript"></script><meta name="citation_title" content="AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters" /><meta name="citation_author" content="Lucy, Li" /><meta name="citation_author" content="Gururangan, Suchin" /><meta name="citation_author" content="Soldaini, Luca" /><meta name="citation_author" content="Strubell, Emma" /><meta name="citation_author" content="Bamman, David" /><meta name="citation_author" content="Klein, Lauren F." /><meta name="citation_author" content="Dodge, Jesse" /><meta name="citation_date" content="2024/01/12" /><meta name="citation_online_date" content="2024/06/20" /><meta name="citation_pdf_url" content="http://arxiv.org/pdf/2401.06408" /><meta name="citation_arxiv_id" content="2401.06408" /><meta name="citation_abstract" content="Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications." /> </head> <body class="with-cu-identity"> <div class="flex-wrap-footer"> <header> <a href="#content" class="is-sr-only">Skip to main content</a> <!-- start desktop header --> <div class="columns is-vcentered is-hidden-mobile" id="cu-identity"> <div class="column" id="cu-logo"> <a href="https://www.cornell.edu/"><img src="/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg" alt="Cornell University" /></a> </div><div class="column" id="support-ack"> <span id="support-ack-url">We gratefully acknowledge support from the Simons Foundation, <a href="https://info.arxiv.org/about/ourmembers.html">member institutions</a>, and all contributors.</span> <a href="https://info.arxiv.org/about/donate.html" class="btn-header-donate">Donate</a> </div> </div> <div id="header" class="is-hidden-mobile"> <a aria-hidden="true" href="{url_path('ignore_me')}"></a> <div class="header-breadcrumbs is-hidden-mobile"> <a href="/"><img src="/static/browse/0.3.4/images/arxiv-logo-one-color-white.svg" alt="arxiv logo" style="height:40px;"/></a> <span>></span> <a href="/list/cs/recent">cs</a> <span>></span> arXiv:2401.06408 </div> <div class="search-block level-right"> <form class="level-item mini-search" method="GET" action="https://arxiv.org/search"> <div class="field has-addons"> <div class="control"> <input class="input is-small" type="text" name="query" placeholder="Search..." aria-label="Search term or terms" /> <p class="help"><a href="https://info.arxiv.org/help">Help</a> | <a href="https://arxiv.org/search/advanced">Advanced Search</a></p> </div> <div class="control"> <div class="select is-small"> <select name="searchtype" aria-label="Field to search"> <option value="all" selected="selected">All fields</option> <option value="title">Title</option> <option value="author">Author</option> <option value="abstract">Abstract</option> <option value="comments">Comments</option> <option value="journal_ref">Journal reference</option> <option value="acm_class">ACM classification</option> <option value="msc_class">MSC classification</option> <option value="report_num">Report number</option> <option value="paper_id">arXiv identifier</option> <option value="doi">DOI</option> <option value="orcid">ORCID</option> <option value="author_id">arXiv author ID</option> <option value="help">Help pages</option> <option value="full_text">Full text</option> </select> </div> </div> <input type="hidden" name="source" value="header"> <button class="button is-small is-cul-darker">Search</button> </div> </form> </div> </div><!-- /end desktop header --> <div class="mobile-header"> <div class="columns is-mobile"> <div class="column logo-arxiv"><a href="https://arxiv.org/"><img src="/static/browse/0.3.4/images/arxiv-logomark-small-white.svg" alt="arXiv logo" style="height:60px;" /></a></div> <div class="column logo-cornell"><a href="https://www.cornell.edu/"> <picture> <source media="(min-width: 501px)" srcset="/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg 400w" sizes="400w" /> <source srcset="/static/browse/0.3.4/images/icons/cu/cornell_seal_simple_black.svg 2x" /> <img src="/static/browse/0.3.4/images/icons/cu/cornell-reduced-white-SMALL.svg" alt="Cornell University Logo" /> </picture> </a></div> <div class="column nav" id="toggle-container" role="menubar"> <button class="toggle-control"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" class="icon filter-white"><title>open search</title><path d="M505 442.7L405.3 343c-4.5-4.5-10.6-7-17-7H372c27.6-35.3 44-79.7 44-128C416 93.1 322.9 0 208 0S0 93.1 0 208s93.1 208 208 208c48.3 0 92.7-16.4 128-44v16.3c0 6.4 2.5 12.5 7 17l99.7 99.7c9.4 9.4 24.6 9.4 33.9 0l28.3-28.3c9.4-9.4 9.4-24.6.1-34zM208 336c-70.7 0-128-57.2-128-128 0-70.7 57.2-128 128-128 70.7 0 128 57.2 128 128 0 70.7-57.2 128-128 128z"/></svg></button> <div class="mobile-toggle-block toggle-target"> <form class="mobile-search-form" method="GET" action="https://arxiv.org/search"> <div class="field has-addons"> <input class="input" type="text" name="query" placeholder="Search..." aria-label="Search term or terms" /> <input type="hidden" name="source" value="header"> <input type="hidden" name="searchtype" value="all"> <button class="button">GO</button> </div> </form> </div> <button class="toggle-control"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512" class="icon filter-white" role="menu"><title>open navigation menu</title><path d="M16 132h416c8.837 0 16-7.163 16-16V76c0-8.837-7.163-16-16-16H16C7.163 60 0 67.163 0 76v40c0 8.837 7.163 16 16 16zm0 160h416c8.837 0 16-7.163 16-16v-40c0-8.837-7.163-16-16-16H16c-8.837 0-16 7.163-16 16v40c0 8.837 7.163 16 16 16zm0 160h416c8.837 0 16-7.163 16-16v-40c0-8.837-7.163-16-16-16H16c-8.837 0-16 7.163-16 16v40c0 8.837 7.163 16 16 16z"/ ></svg></button> <div class="mobile-toggle-block toggle-target"> <nav class="mobile-menu" aria-labelledby="mobilemenulabel"> <h2 id="mobilemenulabel">quick links</h2> <ul> <li><a href="https://arxiv.org/login">Login</a></li> <li><a href="https://info.arxiv.org/help">Help Pages</a></li> <li><a href="https://info.arxiv.org/about">About</a></li> </ul> </nav> </div> </div> </div> </div><!-- /end mobile-header --> </header> <main> <div id="content"> <!-- rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"> <rdf:Description rdf:about="/abs/2401.06408" dc:identifier="/abs/2401.06408" dc:title="AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters" trackback:ping="/trackback/2401.06408" /> </rdf:RDF> --><div id="abs-outer"> <div class="leftcolumn"> <div class="subheader"> <h1>Computer Science > Computation and Language</h1> </div> <div class="header-breadcrumbs-mobile"> <strong>arXiv:2401.06408</strong> (cs) </div> <link rel="stylesheet" type="text/css" href="/static/base/1.0.1/css/abs.css"> <div id="content-inner"> <div id="abs"> <div class="dateline"> [Submitted on 12 Jan 2024 (<a href="https://arxiv.org/abs/2401.06408v1">v1</a>), last revised 20 Jun 2024 (this version, v3)]</div> <h1 class="title mathjax"><span class="descriptor">Title:</span>AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters</h1> <div class="authors"><span class="descriptor">Authors:</span><a href="https://arxiv.org/search/cs?searchtype=author&query=Lucy,+L" rel="nofollow">Li Lucy</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Gururangan,+S" rel="nofollow">Suchin Gururangan</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Soldaini,+L" rel="nofollow">Luca Soldaini</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Strubell,+E" rel="nofollow">Emma Strubell</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Bamman,+D" rel="nofollow">David Bamman</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Klein,+L+F" rel="nofollow">Lauren F. Klein</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Dodge,+J" rel="nofollow">Jesse Dodge</a></div> <div id="download-button-info" hidden>View a PDF of the paper titled AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters, by Li Lucy and 6 other authors</div> <a class="mobile-submission-download" href="/pdf/2401.06408">View PDF</a> <a class="mobile-submission-download" href="https://arxiv.org/html/2401.06408v3">HTML (experimental)</a> <blockquote class="abstract mathjax"> <span class="descriptor">Abstract:</span>Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications. </blockquote> <!--CONTEXT--> <div class="metatable"> <table summary="Additional metadata"> <tr> <td class="tablecell label">Comments:</td> <td class="tablecell comments mathjax">28 pages, 13 figures. Association for Computational Linguistics (ACL) 2024</td> </tr> <tr> <td class="tablecell label">Subjects:</td> <td class="tablecell subjects"> <span class="primary-subject">Computation and Language (cs.CL)</span></td> </tr><tr> <td class="tablecell label">Cite as:</td> <td class="tablecell arxivid"><span class="arxivid"><a href="https://arxiv.org/abs/2401.06408">arXiv:2401.06408</a> [cs.CL]</span></td> </tr> <tr> <td class="tablecell label"> </td> <td class="tablecell arxividv">(or <span class="arxivid"> <a href="https://arxiv.org/abs/2401.06408v3">arXiv:2401.06408v3</a> [cs.CL]</span> for this version) </td> </tr> <tr> <td class="tablecell label"> </td> <td class="tablecell arxivdoi"> <a href="https://doi.org/10.48550/arXiv.2401.06408" id="arxiv-doi-link">https://doi.org/10.48550/arXiv.2401.06408</a><div class="button-and-tooltip"> <button class="more-info" aria-describedby="more-info-desc-1"> <svg height="15" role="presentation" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512"><path fill="currentColor" d="M256 8C119.043 8 8 119.083 8 256c0 136.997 111.043 248 248 248s248-111.003 248-248C504 119.083 392.957 8 256 8zm0 110c23.196 0 42 18.804 42 42s-18.804 42-42 42-42-18.804-42-42 18.804-42 42-42zm56 254c0 6.627-5.373 12-12 12h-88c-6.627 0-12-5.373-12-12v-24c0-6.627 5.373-12 12-12h12v-64h-12c-6.627 0-12-5.373-12-12v-24c0-6.627 5.373-12 12-12h64c6.627 0 12 5.373 12 12v100h12c6.627 0 12 5.373 12 12v24z" class=""></path></svg> <span class="visually-hidden">Focus to learn more</span> </button> <!-- tooltip description --> <div role="tooltip" id="more-info-desc-1"> <span class="left-corner"></span> arXiv-issued DOI via DataCite</div> </div> </td> </tr></table> </div> </div> </div> <div class="submission-history"> <h2>Submission history</h2> From: Li Lucy [<a href="/show-email/72fe3a30/2401.06408" rel="nofollow">view email</a>] <br/> <strong><a href="/abs/2401.06408v1" rel="nofollow">[v1]</a></strong> Fri, 12 Jan 2024 07:10:10 UTC (4,379 KB)<br/> <strong><a href="/abs/2401.06408v2" rel="nofollow">[v2]</a></strong> Tue, 16 Jan 2024 19:35:28 UTC (4,379 KB)<br/> <strong>[v3]</strong> Thu, 20 Jun 2024 18:21:49 UTC (4,313 KB)<br/> </div> </div> <!--end leftcolumn--> <div class="extra-services"> <div class="full-text"> <a name="other"></a> <span class="descriptor">Full-text links:</span> <h2>Access Paper:</h2> <ul> <div id="download-button-info" hidden> View a PDF of the paper titled AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters, by Li Lucy and 6 other authors</div><li><a href="/pdf/2401.06408" aria-describedby="download-button-info" accesskey="f" class="abs-button download-pdf">View PDF</a></li><li><a href="https://arxiv.org/html/2401.06408v3" class="abs-button" id="latexml-download-link">HTML (experimental)</a></li><li><a href="/src/2401.06408" class="abs-button download-eprint">TeX Source</a></li><li><a href="/format/2401.06408" class="abs-button download-format">Other Formats</a></li></ul> <div class="abs-license"><a href="http://creativecommons.org/licenses/by/4.0/" title="Rights to this article" class="has_license"> <img alt="license icon" role="presentation" src="https://arxiv.org/icons/licenses/by-4.0.png"/> <span>view license</span> </a></div> </div> <!--end full-text--> <div class="browse"> Current browse context: <div class="current">cs.CL</div> <div class="prevnext"> <span class="arrow"> <a class="abs-button prev-url" href="/prevnext?id=2401.06408&function=prev&context=cs.CL" accesskey="p" title="previous in cs.CL (accesskey p)" rel="nofollow">< prev</a> </span> <span class="is-hidden-mobile"> | </span> <span class="arrow"> <a class="abs-button next-url" href="/prevnext?id=2401.06408&function=next&context=cs.CL" accesskey="n" title="next in cs.CL (accesskey n)" rel="nofollow">next ></a> </span><br/> </div><div class="list"> <a class="abs-button abs-button-grey abs-button-small context-new" href="/list/cs.CL/new" rel="nofollow">new</a> <span class="is-hidden-mobile"> | </span> <a class="abs-button abs-button-grey abs-button-small context-recent" href="/list/cs.CL/recent" rel="nofollow">recent</a> <span class="is-hidden-mobile"> | </span><a class="abs-button abs-button-grey abs-button-small context-id" href="/list/cs.CL/2024-01" rel="nofollow">2024-01</a> </div><div class="abs-switch-cat"> Change to browse by: <div class="switch context-change"> <a href="/abs/2401.06408?context=cs" rel="nofollow">cs</a><br class="is-hidden-mobile"> </div> </div> </div> <div class="extra-ref-cite"> <h3>References & Citations</h3> <ul> <li><a class="abs-button abs-button-small cite-ads" href="https://ui.adsabs.harvard.edu/abs/arXiv:2401.06408">NASA ADS</a></li><li><a class="abs-button abs-button-small cite-google-scholar" href="https://scholar.google.com/scholar_lookup?arxiv_id=2401.06408" target="_blank" rel="noopener">Google Scholar</a></li> <li><a class="abs-button abs-button-small cite-semantic-scholar" href="https://api.semanticscholar.org/arXiv:2401.06408" target="_blank" rel="noopener">Semantic Scholar</a></li> </ul> <div style="clear:both;"></div> </div> <div class='extra-ref-cite'> <a id='bib-cite-css' hidden='true' href='/static/browse/0.3.4/css/cite.css'>a</a> <span id='bib-cite-trigger' class="bib-cite-button abs-button">export BibTeX citation</span> <span id='bib-cite-loading' hidden='true'>Loading...</span> </div> <div id='bib-cite-modal' class='bib-modal' hidden='true'> <div class='bib-modal-content'> <div class='bib-modal-title'> <h2>BibTeX formatted citation</h2> <span class='bib-modal-close' >×</span> </div> <div> <textarea id='bib-cite-target' class="bib-citation-content" aria-label="loading the citation">loading...</textarea> </div> <div> <span>Data provided by: </span> <a id='bib-cite-source-api'></a> </div> </div> </div><div class="bookmarks"> <div><h3>Bookmark</h3></div><a class="abs-button abs-button-grey abs-button-small" href="http://www.bibsonomy.org/BibtexHandler?requTask=upload&url=https://arxiv.org/abs/2401.06408&description=AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters" title="Bookmark on BibSonomy"> <img src="/static/browse/0.3.4/images/icons/social/bibsonomy.png" alt="BibSonomy logo"/> </a> <a class="abs-button abs-button-grey abs-button-small" href="https://reddit.com/submit?url=https://arxiv.org/abs/2401.06408&title=AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters" title="Bookmark on Reddit"> <img src="/static/browse/0.3.4/images/icons/social/reddit.png" alt="Reddit logo"/> </a> </div> </div> <!--end extra-services--> <!-- LABS AREA --> <div id="labstabs"> <div class="labstabs"><input type="radio" name="tabs" id="tabone"checked="checked"> <label for="tabone">Bibliographic Tools</label> <div class="tab labs-display-bib"> <h1>Bibliographic and Citation Tools</h1> <div class="toggle"> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="bibex-toggle" type="checkbox" class="lab-toggle"> <span class="slider"></span> <span class="is-sr-only">Bibliographic Explorer Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-bibex">Bibliographic Explorer</span> <em>(<a href="https://info.arxiv.org/labs/showcase.html#arxiv-bibliographic-explorer">What is the Explorer?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="connectedpapers-toggle" type="checkbox" class="lab-toggle" data-script-url="/static/browse/0.3.4/js/connectedpapers.js" aria-labelledby="label-for-connected-papers"> <span class="slider"></span> <span class="is-sr-only">Connected Papers Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-connected-papers">Connected Papers</span> <em>(<a href="https://www.connectedpapers.com/about" target="_blank">What is Connected Papers?</a>)</em> </div> </div><div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="litmaps-toggle" type="checkbox" class="lab-toggle" data-script-url="/static/browse/0.3.4/js/litmaps.js?20210617" aria-labelledby="label-for-litmaps"> <span class="slider"></span> <span class="is-sr-only">Litmaps Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-litmaps">Litmaps</span> <em>(<a href="https://www.litmaps.co/" target="_blank">What is Litmaps?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="scite-toggle" type="checkbox" class="lab-toggle" data-script-url="/static/browse/0.3.4/js/scite.js?20210617" aria-labelledby="label-for-scite"> <span class="slider"></span> <span class="is-sr-only">scite.ai Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-scite">scite Smart Citations</span> <em>(<a href="https://www.scite.ai/" target="_blank">What are Smart Citations?</a>)</em> </div> </div> </div> <div class="labs-content-placeholder labs-display" style="display: none;"></div> <div style="min-height: 15px" id="connectedpapers-output"></div> <div style="min-height: 15px" id="litmaps-open-in"></div> <div style="min-height: 15px" id="scite-open-in"></div> </div> <input type="radio" name="tabs" id="tabtwo"> <label for="tabtwo">Code, Data, Media</label> <div class="tab"> <h1>Code, Data and Media Associated with this Article</h1> <div class="toggle"> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="alphaxiv-toggle" data-script-url="/static/browse/0.3.4/js/alphaxiv.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-alphaxiv"> <span class="slider"></span> <span class="is-sr-only">alphaXiv Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-alphaxiv">alphaXiv</span> <em>(<a href="https://alphaxiv.org/" target="_blank">What is alphaXiv?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="catalyzex-toggle" data-script-url="/static/browse/0.3.4/js/catalyzex.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-cx"> <span class="slider"></span> <span class="is-sr-only">Links to Code Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-cx">CatalyzeX Code Finder for Papers</span> <em>(<a href="https://www.catalyzex.com" target="_blank">What is CatalyzeX?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="dagshub-toggle" data-script-url="/static/browse/0.3.4/js/dagshub.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-dagshub"> <span class="slider"></span> <span class="is-sr-only">DagsHub Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-dagshub">DagsHub</span> <em>(<a href="https://dagshub.com/" target="_blank">What is DagsHub?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="gotitpub-toggle" data-script-url="/static/browse/0.3.4/js/gotitpub.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-gotitpub"> <span class="slider"></span> <span class="is-sr-only">GotitPub Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-gotitpub">Gotit.pub</span> <em>(<a href="http://gotit.pub/faq" target="_blank">What is GotitPub?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="huggingface-toggle" data-script-url="/static/browse/0.3.4/js/huggingface.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-huggingface"> <span class="slider"></span> <span class="is-sr-only">Huggingface Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-huggingface">Hugging Face</span> <em>(<a href="https://huggingface.co/huggingface" target="_blank">What is Huggingface?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="paperwithcode-toggle" data-script-url="/static/browse/0.3.4/js/paperswithcode.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-pwc"> <span class="slider"></span> <span class="is-sr-only">Links to Code Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-pwc">Papers with Code</span> <em>(<a href="https://paperswithcode.com/" target="_blank">What is Papers with Code?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="sciencecast-toggle" data-script-url="/static/browse/0.3.4/js/sciencecast.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-sciencecast"> <span class="slider"></span> <span class="is-sr-only">ScienceCast Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-sciencecast">ScienceCast</span> <em>(<a href="https://sciencecast.org/welcome" target="_blank">What is ScienceCast?</a>)</em> </div> </div> </div> <div id="alphaxiv-output" style="display:none"></div> <div id="catalyzex-output" style="display:none"></div> <div id="dagshub-output" style="display:none"></div> <div id="gotitpub-output" style="display:none"></div> <div id="pwc-output" style="display:none"></div> <div id="pwc-data-output" style="display:none"></div> <div id="sciencecast-output" style="display:none"></div> <div id="huggingface-output" style="display:none"></div> </div> <input type="radio" name="tabs" id="labstabs-demos-input"> <label for="labstabs-demos-input" id="labstabs-demos-label">Demos</label> <div class="tab"> <h1>Demos</h1> <div class="toggle"> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="replicate-toggle" data-script-url="/static/browse/0.3.4/js/replicate.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-replicate"> <span class="slider"></span> <span class="is-sr-only">Replicate Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-replicate">Replicate</span> <em>(<a href="https://replicate.com/docs/arxiv/about" target="_blank">What is Replicate?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="spaces-toggle" data-script-url="/static/browse/0.3.4/js/spaces.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-spaces"> <span class="slider"></span> <span class="is-sr-only">Spaces Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-spaces">Hugging Face Spaces</span> <em>(<a href="https://huggingface.co/docs/hub/spaces" target="_blank">What is Spaces?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="txyz-toggle" data-script-url="/static/browse/0.3.4/js/txyz.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-txyz"> <span class="slider"></span> <span class="is-sr-only">Spaces Toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-txyz">TXYZ.AI</span> <em>(<a href="https://txyz.ai" target="_blank">What is TXYZ.AI?</a>)</em> </div> </div> </div> <div id="replicate-output"></div> <div id="spaces-output"></div> <div id="txyz-output"></div> </div> <input type="radio" name="tabs" id="tabfour"> <label for="tabfour">Related Papers</label> <div class="tab"> <h1>Recommenders and Search Tools</h1> <div class="toggle"> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="influenceflower-toggle" data-script-url="/static/browse/0.3.4/js/influenceflower.js" type="checkbox" class="lab-toggle" aria-labelledby="label-for-influenceflower"> <span class="slider"></span> <span class="is-sr-only">Link to Influence Flower</span> </label> </div> <div class="column lab-name"> <span id="label-for-influenceflower">Influence Flower</span> <em>(<a href="https://influencemap.cmlab.dev/" target="_blank">What are Influence Flowers?</a>)</em> </div> </div> <div class="columns is-mobile lab-row"> <div class="column lab-switch"> <label class="switch"> <input id="core-recommender-toggle" type="checkbox" class="lab-toggle" aria-labelledby="label-for-core"> <span class="slider"></span> <span class="is-sr-only">Core recommender toggle</span> </label> </div> <div class="column lab-name"> <span id="label-for-core">CORE Recommender</span> <em>(<a href="https://core.ac.uk/services/recommender">What is CORE?</a>)</em> </div> </div></div> <div id="influenceflower-output"></div> <div id="influenceflower-output-graph" style="display:none"> <ul class="flower-tabs"> <li class="active"><a class="btn tab-btn" onclick="openTab(event, 'tab-author')">Author</a></li> <li><a class="btn tab-btn" onclick="openTab(event, 'tab-venue')">Venue</a></li> <li><a class="btn tab-btn" onclick="openTab(event, 'tab-inst')">Institution</a></li> <li><a class="btn tab-btn" onclick="openTab(event, 'tab-topic')">Topic</a></li> </ul> <div class="flower-tab-content"> <div class="tab-flower active" id="tab-author"><svg id="flower-graph-author"></svg></div> <div class="tab-flower" id="tab-venue"><svg id="flower-graph-venue"></svg></div> <div class="tab-flower" id="tab-inst"><svg id="flower-graph-inst"></svg></div> <div class="tab-flower" id="tab-topic"><svg id="flower-graph-topic"></svg></div> </div> </div> <div id="coreRecommenderOutput"></div> <div id="iarxivOutput"></div> </div> <input type="radio" name="tabs" id="tabfive"> <label for="tabfive"> About arXivLabs </label> <div class="tab"> <div class="columns"> <div class="column"> <h1>arXivLabs: experimental projects with community collaborators</h1> <p>arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.</p> <p>Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.</p> <p>Have an idea for a project that will add value for arXiv's community? <a href="https://info.arxiv.org/labs/index.html"><strong>Learn more about arXivLabs</strong></a>.</p> </div> <div class="column is-narrow is-full-mobile"> <p class="icon-labs"><svg xmlns="http://www.w3.org/2000/svg" role="presentation" viewBox="0 0 635.572 811"><path d="M175.6 676v27h-27v-27zm-54 27v27h27v-27zm-27 27v27h27v-27zm396-54v27h-27v-27zm0 27v27h27v-27zm27 27v27h27v-27zm-27-414h27v27h-27zm27 0h27v-27h-27zm27-27h27v-27h-27zm-396 45h-27v-27h27zm-27-54h-27v27h27zm-27-27h-27v27h27z"/><path d="M94.6 730v27h-27v-27zm477 0v27h-27v-27zm-27-495h27v27h-27zm-450 18h-27v-27h27zm477 9h27v27h-27zm-54 495h27v27h-27zm-423 0h27v27h-27zm-54-504h27v27h-27z" fill="#666"/><path d="M67.6 730v27h-27v-27zm54 54v27h-27v-27zm0-108v27h27v-27zm-27 27v27h27v-27zm-81 0v27h27v-27zm585 27v27h-27v-27zm-108-54v27h27v-27zm27 27v27h27v-27zm81 0v27h27v-27zm-54-495h27v27h-27zm-54 108h27v-27h-27zm27-27h27v-27h-27zm0-81h27v-27h-27zm-423 18h-27v-27h27zm54 54h-27v27h27zm-27-27h-27v27h27zm0-81h-27v27h27zm423 612v27h-27v-27zm81-522v27h-27v-27zm-585-9v27h-27v-27z" fill="#999"/><path d="M94.6 784v27h-27v-27zm-27-27v27h27v-27zm-27-54v27h27v-27zm27 0v27h27v-27zm0-27v27h27v-27zm27 0v27h27v-27zm0-27v27h27v-27zm27 0v27h27v-27zm-108 81v27h27v-27zm558 54v27h-27v-27zm-27-27v27h27v-27zm27-54v27h27v-27zm-27 0v27h27v-27zm0-27v27h27v-27zm-27 0v27h27v-27zm0-27v27h27v-27zm-27 0v27h27v-27zm108 81v27h27v-27zm0-495h27v27h-27zm-27 27h27v-27h-27zm-54-27h27v-27h-27zm0 27h27v-27h-27zm-27 0h27v-27h-27zm0 27h27v-27h-27zm-27 0h27v-27h-27zm0 27h27v-27h-27zm81-108h27v-27h-27zm-504 45h-27v-27h27zm27-27h-27v27h27zm54-27h-27v27h27zm0 27h-27v27h27zm27 0h-27v27h27zm0 27h-27v27h27zm27 0h-27v27h27zm0 27h-27v27h27zm-81-108h-27v27h27z" fill="#ccc"/><path d="M598.6 665.1H41.5C-76.5 667 176 280.2 176 280.2h53a46.5 46.5 0 0162.8-56.3 29.2 29.2 0 1128.5 35.9h-1a46.5 46.5 0 01-1.5 20.3l142.5-.1s255.3 387 138.3 385.1zM291 181a29.3 29.3 0 10-29.2-29.3A29.3 29.3 0 00291 181zm65.4-66.8a22.4 22.4 0 10-22.5-22.4 22.4 22.4 0 0022.5 22.4z" fill="#fc0"/><path d="M245.5 172V10h153v162s324 495 198 495h-558c-126 0 207-495 207-495zm126 54h56m-13 72h56m-9 72h56m-20 72h56m-22 72h56m-29 72h56m-457-45c20.8 41.7 87.3 81 160.7 81 72.1 0 142.1-38.2 163.4-81" fill="none" stroke="#000" stroke-miterlimit="10" stroke-width="20"/><path d="M273.3 421.7c0 31-9.8 56.3-21.9 56.3s-21.8-25.2-21.8-56.3 9.8-56.3 21.8-56.3 21.9 25.2 21.9 56.3zm114.4-56.3c-12 0-21.8 25.2-21.8 56.3s9.7 56.3 21.8 56.3 21.9-25.2 21.9-56.3-9.8-56.3-21.9-56.3zM150.1 526.6c-18.2 6.7-27.5 22.9-23.2 30.2s14.8-5.5 33-12.2 37.4-4.9 33-12.2-24.5-12.6-42.8-5.8zm296 5.8c-4.2 7.3 14.9 5.5 33.1 12.2s28.7 19.5 33 12.2-5-23.5-23.2-30.2-38.5-1.5-42.8 5.8z"/></svg></p> </div> </div> </div> </div> </div> <!-- END LABS AREA --> <div class="endorsers"> <a href="/auth/show-endorsers/2401.06408" class="endorser-who" rel="nofollow">Which authors of this paper are endorsers?</a> | <a id="mathjax_toggle" href="javascript:setMathjaxCookie()">Disable MathJax</a> (<a href="https://info.arxiv.org/help/mathjax.html">What is MathJax?</a>) <span class="help" style="font-style: normal; float: right; margin-top: 0; margin-right: 1em;"></span> </div> <script type="text/javascript" language="javascript">mathjaxToggle();</script> </div> </div> </main> <footer style="clear: both;"> <div class="columns is-desktop" role="navigation" aria-label="Secondary" style="margin: -0.75em -0.75em 0.75em -0.75em"> <!-- Macro-Column 1 --> <div class="column" style="padding: 0;"> <div class="columns"> <div class="column"> <ul style="list-style: none; line-height: 2;"> <li><a href="https://info.arxiv.org/about">About</a></li> <li><a href="https://info.arxiv.org/help">Help</a></li> </ul> </div> <div class="column"> <ul style="list-style: none; line-height: 2;"> <li> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" class="icon filter-black" role="presentation"><title>contact arXiv</title><desc>Click here to contact arXiv</desc><path d="M502.3 190.8c3.9-3.1 9.7-.2 9.7 4.7V400c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V195.6c0-5 5.7-7.8 9.7-4.7 22.4 17.4 52.1 39.5 154.1 113.6 21.1 15.4 56.7 47.8 92.2 47.6 35.7.3 72-32.8 92.3-47.6 102-74.1 131.6-96.3 154-113.7zM256 320c23.2.4 56.6-29.2 73.4-41.4 132.7-96.3 142.8-104.7 173.4-128.7 5.8-4.5 9.2-11.5 9.2-18.9v-19c0-26.5-21.5-48-48-48H48C21.5 64 0 85.5 0 112v19c0 7.4 3.4 14.3 9.2 18.9 30.6 23.9 40.7 32.4 173.4 128.7 16.8 12.2 50.2 41.8 73.4 41.4z"/></svg> <a href="https://info.arxiv.org/help/contact.html"> Contact</a> </li> <li> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" class="icon filter-black" role="presentation"><title>subscribe to arXiv mailings</title><desc>Click here to subscribe</desc><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> <a href="https://info.arxiv.org/help/subscribe"> Subscribe</a> </li> </ul> </div> </div> </div> <!-- End Macro-Column 1 --> <!-- Macro-Column 2 --> <div class="column" style="padding: 0;"> <div class="columns"> <div class="column"> <ul style="list-style: none; line-height: 2;"> <li><a href="https://info.arxiv.org/help/license/index.html">Copyright</a></li> <li><a href="https://info.arxiv.org/help/policies/privacy_policy.html">Privacy Policy</a></li> </ul> </div> <div class="column sorry-app-links"> <ul style="list-style: none; line-height: 2;"> <li><a href="https://info.arxiv.org/help/web_accessibility.html">Web Accessibility Assistance</a></li> <li> <p class="help"> <a class="a11y-main-link" href="https://status.arxiv.org" target="_blank">arXiv Operational Status <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 256 512" class="icon filter-dark_grey" role="presentation"><path d="M224.3 273l-136 136c-9.4 9.4-24.6 9.4-33.9 0l-22.6-22.6c-9.4-9.4-9.4-24.6 0-33.9l96.4-96.4-96.4-96.4c-9.4-9.4-9.4-24.6 0-33.9L54.3 103c9.4-9.4 24.6-9.4 33.9 0l136 136c9.5 9.4 9.5 24.6.1 34z"/></svg></a><br> Get status notifications via <a class="is-link" href="https://subscribe.sorryapp.com/24846f03/email/new" target="_blank"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" class="icon filter-black" role="presentation"><path d="M502.3 190.8c3.9-3.1 9.7-.2 9.7 4.7V400c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V195.6c0-5 5.7-7.8 9.7-4.7 22.4 17.4 52.1 39.5 154.1 113.6 21.1 15.4 56.7 47.8 92.2 47.6 35.7.3 72-32.8 92.3-47.6 102-74.1 131.6-96.3 154-113.7zM256 320c23.2.4 56.6-29.2 73.4-41.4 132.7-96.3 142.8-104.7 173.4-128.7 5.8-4.5 9.2-11.5 9.2-18.9v-19c0-26.5-21.5-48-48-48H48C21.5 64 0 85.5 0 112v19c0 7.4 3.4 14.3 9.2 18.9 30.6 23.9 40.7 32.4 173.4 128.7 16.8 12.2 50.2 41.8 73.4 41.4z"/></svg>email</a> or <a class="is-link" href="https://subscribe.sorryapp.com/24846f03/slack/new" target="_blank"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512" class="icon filter-black" role="presentation"><path d="M94.12 315.1c0 25.9-21.16 47.06-47.06 47.06S0 341 0 315.1c0-25.9 21.16-47.06 47.06-47.06h47.06v47.06zm23.72 0c0-25.9 21.16-47.06 47.06-47.06s47.06 21.16 47.06 47.06v117.84c0 25.9-21.16 47.06-47.06 47.06s-47.06-21.16-47.06-47.06V315.1zm47.06-188.98c-25.9 0-47.06-21.16-47.06-47.06S139 32 164.9 32s47.06 21.16 47.06 47.06v47.06H164.9zm0 23.72c25.9 0 47.06 21.16 47.06 47.06s-21.16 47.06-47.06 47.06H47.06C21.16 243.96 0 222.8 0 196.9s21.16-47.06 47.06-47.06H164.9zm188.98 47.06c0-25.9 21.16-47.06 47.06-47.06 25.9 0 47.06 21.16 47.06 47.06s-21.16 47.06-47.06 47.06h-47.06V196.9zm-23.72 0c0 25.9-21.16 47.06-47.06 47.06-25.9 0-47.06-21.16-47.06-47.06V79.06c0-25.9 21.16-47.06 47.06-47.06 25.9 0 47.06 21.16 47.06 47.06V196.9zM283.1 385.88c25.9 0 47.06 21.16 47.06 47.06 0 25.9-21.16 47.06-47.06 47.06-25.9 0-47.06-21.16-47.06-47.06v-47.06h47.06zm0-23.72c-25.9 0-47.06-21.16-47.06-47.06 0-25.9 21.16-47.06 47.06-47.06h117.84c25.9 0 47.06 21.16 47.06 47.06 0 25.9-21.16 47.06-47.06 47.06H283.1z"/></svg>slack</a> </p> </li> </ul> </div> </div> </div> <!-- end MetaColumn 2 --> <!-- End Macro-Column 2 --> </div> </footer> </div> <script src="/static/base/1.0.1/js/member_acknowledgement.js"></script> </body> </html>