CINXE.COM

Top Influential Papers Cited by Popular CS Conferences | Semantic Scholar

<!DOCTYPE html><!-- Last Published: Thu Jan 30 2025 22:07:33 GMT+0000 (Coordinated Universal Time) --><html data-wf-domain="webflow.semanticscholar.org" data-wf-page="6584745360a4872a287a89a1" data-wf-site="605236bb767e9a5bb229c63c" lang="en"><head><meta charset="utf-8"/><title>Top Influential Papers Cited by Popular CS Conferences | Semantic Scholar</title><meta content="Welcome to a curated collection of papers that are popular references to papers published in esteemed computer science conferences: ACL, ICML, and EMNLP! Explore this collection of fundamental works that have shaped the field of natural language processing." name="description"/><meta content="Top Influential Papers Cited by Popular CS Conferences | Semantic Scholar" property="og:title"/><meta content="Welcome to a curated collection of papers that are popular references to papers published in esteemed computer science conferences: ACL, ICML, and EMNLP! Explore this collection of fundamental works that have shaped the field of natural language processing." property="og:description"/><meta content="https://assets-global.website-files.com/605236bb767e9a5bb229c63c/60a2cb67437bf2de419d5137_s2-og.png" property="og:image"/><meta content="Top Influential Papers Cited by Popular CS Conferences | Semantic Scholar" property="twitter:title"/><meta content="Welcome to a curated collection of papers that are popular references to papers published in esteemed computer science conferences: ACL, ICML, and EMNLP! Explore this collection of fundamental works that have shaped the field of natural language processing." property="twitter:description"/><meta content="https://assets-global.website-files.com/605236bb767e9a5bb229c63c/60a2cb67437bf2de419d5137_s2-og.png" property="twitter:image"/><meta property="og:type" content="website"/><meta content="summary_large_image" name="twitter:card"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/css/semanticscholar.19d4ddfb2.min.css" rel="stylesheet" type="text/css"/><link href="https://fonts.googleapis.com" rel="preconnect"/><link href="https://fonts.gstatic.com" rel="preconnect" crossorigin="anonymous"/><script src="https://ajax.googleapis.com/ajax/libs/webfont/1.6.26/webfont.js" type="text/javascript"></script><script type="text/javascript">WebFont.load({ google: { families: ["Lato:100,100italic,300,300italic,400,400italic,700,700italic,900,900italic","Roboto Slab:300,regular,500,700","Roboto:300,regular,500,700,900","Roboto Mono:regular","Roboto Mono:100,200,300,regular"] }});</script><script type="text/javascript">!function(o,c){var n=c.documentElement,t=" w-mod-";n.className+=t+"js",("ontouchstart"in o||o.DocumentTouch&&c instanceof DocumentTouch)&&(n.className+=t+"touch")}(window,document);</script><link href="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/609add8e5f5ce7570f656904_favicon.png" rel="shortcut icon" type="image/x-icon"/><link href="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/609adda9bd029148c37023a9_webclip.png" rel="apple-touch-icon"/><script src="https://www.google.com/recaptcha/api.js" type="text/javascript"></script><link href="/api/1/user/webflow.css" rel="stylesheet" type="text/css"> <!-- Heap Analytics Snippet --> <script type="text/javascript"> window.heap=window.heap||[],heap.load=function(e,t){window.heap.appid=e,window.heap.config=t=t||{};var r=t.forceSSL||"https:"===document.location.protocol,a=document.createElement("script");a.type="text/javascript",a.async=!0,a.src=(r?"https:":"http:")+"//cdn.heapanalytics.com/js/heap-"+e+".js";var n=document.getElementsByTagName("script")[0];n.parentNode.insertBefore(a,n);for(var o=function(e){return function(){heap.push([e].concat(Array.prototype.slice.call(arguments,0)))}},p=["addEventProperties","addUserProperties","clearEventProperties","identify","removeEventProperty","setEventProperties","track","unsetEventProperty"],c=0;c<p.length;c++)heap[p[c]]=o(p[c])}; heap.load("2424575119"); </script> <!--End Heap Analytics Snippet--> <style type="text/css"> /* Auth */ .site-header__navigation-auth { display: var(--has-auth--block) !important; } .site-header__navigation-not-auth { display: var(--no-auth--none) !important; } /* Dropdown Menu Top Bar */ .dropdown .dropdown__menu:before{ border-color: transparent transparent #1857B6 transparent; border-style: solid; border-width: 0 8px 8px 8px; content:""; height: 0; position: absolute; right: 12px; top: -12px; width: 0; } /* Embedded Newsletter Hubspot Form */ .newsletter .hbspt-form label{ margin: 0; } .newsletter .hbspt-form .hs-form { align-items: end; display: flex; } .newsletter .hbspt-form .hs-form-field { flex: 1; position: relative; } .newsletter .hbspt-form .hs-form-field .hs-input { border: 1px solid #546973; font-size: 16px; height: 36px; line-height: 36px; padding: 8px; width: 100%; } .newsletter .hbspt-form .hs-form-field .hs-input.error { border-color: #a92020; } .newsletter .hbspt-form .hs-form-field .hs-error-msgs { background: #a92020; bottom: -44px; left: 4px; list-style: none; margin: 0; padding: 6px 12px; position: absolute; } .newsletter .hbspt-form .hs-form-field .hs-error-msgs:after { border-color: transparent transparent transparent #a92020; border-style: solid; border-width: 8px 0 0 8px; content: ""; height: 0; left: 0; position: absolute; top: -8px; width: 0; } .newsletter .hbspt-form .hs-form-field .hs-error-msg { color: #fff; font-size: 14px; } .newsletter .hbspt-form .hs-submit { flex: 0 0 auto; } .newsletter .hbspt-form .hs-submit .hs-button { background: #1857B6; border: none; border-radius: 0 3px 3px 0; color: #fff; cursor: pointer; font-size: 14px; height: 36px; line-height: 36px; margin: 0; padding: 0 14px; transition: background-color 250ms cubic-bezier(.25, .46, .45, .94); } .newsletter .hbspt-form .hs-submit .hs-button:hover { background: #0f3875; } .newsletter .hbspt-form .hs_error_rollup { display: none; } .newsletter .hbspt-form .submitted-message{ border: 1px solid #1857B6; border-radius: 3px; padding: 12px; } .newsletter .hbspt-form .submitted-message p { color: #fff; margin: 0; text-align: left !important; } .newsletter-embed--accessibility .hbspt-form label{ color: #fff; font-family: "Roboto Slab", Serif; font-size: 18px; font-weight: 400; text-align: center; padding-bottom: 12px; } /* Paper Object */ .paper{ filter: drop-shadow(0 1px 2px rgba(0,0,0,.1)); } .paper:after{ background: #D9DADB; clip-path: polygon(0 0, 100% 100%, 0 100%); content: " "; height: 24px; position: absolute; right: 0; top: 0; width: 24px; } .paper__content{ clip-path: polygon(0 0, calc(100% - 24px) 0%, 100% calc(0% + 24px), 100% 100%, 0% 100%); } /* Testimonials */ .testimonial__citation:after{ content: ""; position: absolute; top: 36px; width: 0; height: 0; border-style: solid; border-width: 8px 0 8px 8px; border-color: transparent transparent transparent #f5f6f7; left: -24px; } .testimonial__citation.testimonial__citation--alt:after{ left: auto; right: -24px; border-width: 8px 8px 8px 0; border-color: transparent #f5f6f7 transparent transparent; } @media screen and (max-width: 767px){ .testimonial__citation:after{ display: none; } } </style></head><body><header class="site-header site-header--fixed"><div class="site-header__content"><a href="https://www.semanticscholar.org" class="site-header__logo w-inline-block"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/6053b48e21b11570b9788241_s2-logo-small.svg" loading="lazy" alt="Semantic Scholar" height="36" class="logo-small"/><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/605274dd4af9b0ca8ac84182_s2-logo.svg" loading="lazy" alt="Semantic Scholar" height="36" class="logo-large"/></a><div class="site-header__search w-form"><form id="wf-form-Search" name="wf-form-Search" data-name="Search" action="https://www.semanticscholar.org/search" method="get" class="search__form" data-wf-page-id="6584745360a4872a287a89a1" data-wf-element-id="6ab03a1e-944e-9291-1968-b70fe5f1160b" data-turnstile-sitekey="0x4AAAAAAAQTptj2So4dx43e"><input class="search__field w-input" maxlength="256" name="q" data-name="q" placeholder="Search over 214 million papers from all fields of science" type="text" id="q"/><input type="submit" data-wait="Please wait..." class="search__submit w-button" value="Search"/></form><div class="w-form-done"><div>Thank you! Your submission has been received!</div></div><div class="w-form-fail"><div>Oops! Something went wrong while submitting the form.</div></div></div><div class="site-header__navigation site-header__navigation-auth"><div class="site-header__navigation-wrapper"><a data-w-id="6ab03a1e-944e-9291-1968-b70fe5f11616" href="#" class="site-header__navigation-close w-inline-block"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/60908306c5938d99543d2b58_close.svg" loading="lazy" alt=""/></a><div data-hover="false" data-delay="0" class="site-header__navigation dropdown w-dropdown"><div class="site-header__navigation dropdown button button--secondary w-dropdown-toggle"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/605b9e8398d437113ca1d650_icon-account.svg" loading="lazy" alt="Icon - Account" height="12" class="dropdown image"/><div class="dropdown icon w-icon-dropdown-toggle"></div><div class="dropdown dropdown__text">Account</div></div><nav class="dropdown dropdown__menu w-dropdown-list"><a href="https://www.semanticscholar.org/me/research" class="dropdown dropdown__link w-inline-block"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/62dafc4d8c585c6c25f98ce9_menu-icon-dashboard.svg" loading="lazy" width="18" height="18" alt="" class="dropdown dropdown__image"/><div>Research Dashboard</div></a><a href="https://www.semanticscholar.org/me/recommendations" class="dropdown dropdown__link w-inline-block"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/62dafc4d4e4b2383249840d1_menu-icon-feeds.svg" loading="lazy" width="18" height="18" alt="" class="dropdown dropdown__image"/><div>Research Feeds</div></a><a href="https://www.semanticscholar.org/me/library/all" class="dropdown dropdown__link w-inline-block"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/62dafc4c42e988c961b5809a_menu-icon-library.svg" loading="lazy" width="18" height="18" alt="" class="dropdown dropdown__image"/><div>Library</div></a><a href="https://www.semanticscholar.org/me/account" class="dropdown dropdown__link dropdown__link--section w-inline-block"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/62dafc4d3f5f3076275ec271_menu-icon-settings.svg" loading="lazy" width="18" height="18" alt="" class="dropdown dropdown__image"/><div>Settings</div></a><a href="https://www.semanticscholar.org/me/research" class="dropdown dropdown__link w-inline-block"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/62dafc4c2ab4d9438fee8d1c_menu-icon-logout.svg" loading="lazy" width="18" height="18" alt="" class="dropdown dropdown__image"/><div>Sign Out</div></a></nav></div></div><a data-w-id="6ab03a1e-944e-9291-1968-b70fe5f11630" href="#" class="site-header__navigation-open">Menu</a></div><div class="site-header__navigation site-header__navigation-not-auth"><div class="site-header__navigation-wrapper"><a data-w-id="6ab03a1e-944e-9291-1968-b70fe5f11634" href="#" class="site-header__navigation-close w-inline-block"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/60908306c5938d99543d2b58_close.svg" loading="lazy" alt=""/></a><a href="https://www.semanticscholar.org/sign-in" class="site-header__navigation button button--secondary w-button">Sign In</a><a href="https://www.semanticscholar.org/sign-in" class="site-header__navigation button w-button">Create Free Account</a></div><a data-w-id="6ab03a1e-944e-9291-1968-b70fe5f11662" href="#" class="site-header__navigation-open">Menu</a></div></div></header><main class="main"><div class="section-navigation"><div class="section-navigation__container"><div class="section-navigation__intro"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/6402965ea141a9451bcbaca9_icon-s2.svg" loading="lazy" alt="" class="section-navigation__logo"/><a href="/product/scholars-hub" class="section-navigation__title w-inline-block">Scholar&#x27;s Hub</a></div><ul role="list" class="section-navigation__links w-list-unstyled"><li><a href="/product/scholars-hub/trending" class="section-navigation__link">Trending Papers</a></li><li><a href="/product/scholars-hub/award-winning-ai-and-theory" class="section-navigation__link">AI &amp; Theory</a></li><li><a href="/product/scholars-hub/award-winning-systems-and-databases" class="section-navigation__link">Systems &amp; Databases</a></li><li><a href="/product/scholars-hub/award-winning-hci" class="section-navigation__link">HCI</a><a href="/product/scholars-hub/semantic-scholars-picks" class="section-navigation__link">Semantic Scholar&#x27;s Picks</a></li></ul></div></div><div class="blade"><div class="blade__grid blade__grid--3-2"><div id="w-node-dc13fa01-f63d-0dd6-3ea2-09dbe45fc53c-287a89a1" class="blade__content"><h1>Scholar&#x27;s Hub</h1><p class="p__intro p__intro--header"><strong class="p__intro p__intro--header">Top Influential Papers Cited by Popular CS Conferences</strong></p><p class="p__intro">Welcome to a curated collection of papers that are popular references to papers published in esteemed computer science conferences: the Association for Computational Linguistics (ACL), Empirical Methods in Natural Language Processing (EMNLP), and the International Conference on Machine Learning (ICML)! Explore this collection of fundamental works that have shaped the field of natural language processing.</p><p>This list collects papers that earned the most citations from these venues, implying their significant impact on computational linguistics research. If you have any feedback or suggestions, please <a href="/about/contact">contact us</a>.</p><p class="p--small">Last updated: December 21st, 2023</p><div class="tabs"><a href="#ACL" tab_target="trending" class="tabs__tab w-button">ACL</a><a href="#EMNLP" tab_target="trending" class="tabs__tab w-button">EMNLP</a><a href="#ICML" tab_target="trending" class="tabs__tab w-button">ICML</a></div></div><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/64adb7026d0f25c7866370b5_trending-papers.png" loading="lazy" id="w-node-f36bb904-25b2-ce45-e358-9d11b7715e8e-287a89a1" sizes="(max-width: 479px) 57vw, (max-width: 991px) 45vw, 36vw" alt="Illustration: Trending Papers" srcset="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/64adb7026d0f25c7866370b5_trending-papers-p-500.png 500w, https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/64adb7026d0f25c7866370b5_trending-papers-p-800.png 800w, https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/64adb7026d0f25c7866370b5_trending-papers.png 943w" class="blade__illustration"/></div></div><div id="mission" class="blade blade--white"><div class="blade__grid blade__grid--full"><div class="blade__content"><div class="condensed-list"><h5 id="ACL" class="condensed-list__title w-node-eab81c09-5f5d-ddbe-fbec-f67faa575a4e-287a89a1">ACL </h5><div class="condensed-list__paper-list w-dyn-list"><div role="list" class="paper-list w-dyn-items"><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/df2b0e26d0599ce3e70df8a9da02e51594e0e992?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ACL&amp;utm_term=1" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Devlin et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">NAACL</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">October 11, 2018</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ACL&amp;utm_term=2" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Attention is All you Need</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Vaswani et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">NIPS</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">June 12, 2017</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/6b85b63579a916f705a8e10a49bd8d849d91b1fc?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ACL&amp;utm_term=4" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Language Models are Few-Shot Learners</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Brown et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">NeurIPS</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">May 28, 2020</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3&#x27;s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/6c4b76232bb72897685d19b3d264c6ee3005bc2b?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ACL&amp;utm_term=5" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Raffel et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">J. Mach. Learn. Res.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">October 23, 2019</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new &quot;Colossal Clean Crawled Corpus&quot;, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/395de0bd3837fdf4b4b5e5f04835bcc69c279481?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ACL&amp;utm_term=6" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Lewis et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">ACL</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">October 29, 2019</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and other recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 3.5 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also replicate other pretraining schemes within the BART framework, to understand their effect on end-task performance.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/9405cc0d6169988371b2755e573cc28650d14dfe?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ACL&amp;utm_term=7" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Language Models are Unsupervised Multitask Learners</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Radford et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item w-dyn-bind-empty"></p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item w-dyn-bind-empty"></p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/af3f67b6639a50fd094e1467a2f3b6b8fef7c7c2?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ACL&amp;utm_term=8" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Transformers: State-of-the-Art Natural Language Processing</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Wolf et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">EMNLP</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">October 9, 2019</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. \textit{Transformers} is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at \url{this https URL}.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/d7da009f457917aa381619facfa5ffae9329a6e9?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ACL&amp;utm_term=9" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Bleu: a Method for Automatic Evaluation of Machine Translation</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Papineni et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">ACL</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">July 6, 2002</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/a6cb366736791bcccc5c8639de5a8f9636bf87e8?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ACL&amp;utm_term=10" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Adam: A Method for Stochastic Optimization</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Kingma et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">ICLR</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">December 22, 2014</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.</p></div></a></div></div></div></div><div class="condensed-list"><h5 id="EMNLP" class="condensed-list__title w-node-_99d1b0bd-ae5a-1955-6998-22e3603fd4c4-287a89a1">EMNLP</h5><div class="condensed-list__paper-list w-dyn-list"><div role="list" class="paper-list w-dyn-items"><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/df2b0e26d0599ce3e70df8a9da02e51594e0e992?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=EMNLP&amp;utm_term=1" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Devlin et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">NAACL</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">October 11, 2018</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=EMNLP&amp;utm_term=2" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Attention is All you Need</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Vaswani et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">NIPS</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">June 12, 2017</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/077f8329a7b6fa3b7c877a57b81eb6c18b5f87de?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=EMNLP&amp;utm_term=3" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">RoBERTa: A Robustly Optimized BERT Pretraining Approach</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Liu et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">ArXiv</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">July 26, 2019</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/6c4b76232bb72897685d19b3d264c6ee3005bc2b?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=EMNLP&amp;utm_term=4" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Raffel et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">J. Mach. Learn. Res.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">October 23, 2019</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new &quot;Colossal Clean Crawled Corpus&quot;, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/6b85b63579a916f705a8e10a49bd8d849d91b1fc?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=EMNLP&amp;utm_term=5" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Language Models are Few-Shot Learners</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Brown et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">NeurIPS</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">May 28, 2020</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3&#x27;s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/395de0bd3837fdf4b4b5e5f04835bcc69c279481?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=EMNLP&amp;utm_term=6" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Lewis et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">ACL</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">October 29, 2019</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and other recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 3.5 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also replicate other pretraining schemes within the BART framework, to understand their effect on end-task performance.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/a6cb366736791bcccc5c8639de5a8f9636bf87e8?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=EMNLP&amp;utm_term=7" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Adam: A Method for Stochastic Optimization</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Kingma et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">ICLR</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">December 22, 2014</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/af3f67b6639a50fd094e1467a2f3b6b8fef7c7c2?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=EMNLP&amp;utm_term=8" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Transformers: State-of-the-Art Natural Language Processing</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Wolf et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">EMNLP</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">October 9, 2019</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. \textit{Transformers} is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at \url{this https URL}.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/9405cc0d6169988371b2755e573cc28650d14dfe?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=EMNLP&amp;utm_term=9" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Language Models are Unsupervised Multitask Learners</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Radford et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item w-dyn-bind-empty"></p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item w-dyn-bind-empty"></p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/d7da009f457917aa381619facfa5ffae9329a6e9?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=EMNLP&amp;utm_term=10" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Bleu: a Method for Automatic Evaluation of Machine Translation</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Papineni et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">ACL</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">July 6, 2002</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.</p></div></a></div></div></div></div><div class="condensed-list"><h5 id="ICML" class="condensed-list__title w-node-_61348cce-11a3-3e5e-055f-a77339ff1f30-287a89a1">ICML</h5><div class="condensed-list__paper-list w-dyn-list"><div role="list" class="paper-list w-dyn-items"><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/a6cb366736791bcccc5c8639de5a8f9636bf87e8?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ICML&amp;utm_term=1" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Adam: A Method for Stochastic Optimization</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Kingma et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">ICLR</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">December 22, 2014</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/2c03df8b48bf3fa39054345bafabfeff15bfd11d?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ICML&amp;utm_term=2" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Deep Residual Learning for Image Recognition</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">He et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">December 10, 2015</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers - 8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC &amp; COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ICML&amp;utm_term=3" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Attention is All you Need</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Vaswani et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">NIPS</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">June 12, 2017</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/5d90f06bb70a0a3dced62413346235c02b1aa086?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ICML&amp;utm_term=4" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Learning Multiple Layers of Features from Tiny Images</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Krizhevsky et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item w-dyn-bind-empty"></p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item w-dyn-bind-empty"></p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it dicult to learn a good set of lters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is signicantly improved by pre-training a layer of features on a large set of unlabeled tiny images.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/6b85b63579a916f705a8e10a49bd8d849d91b1fc?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ICML&amp;utm_term=5" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Language Models are Few-Shot Learners</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Brown et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">NeurIPS</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">May 28, 2020</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3&#x27;s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/3c8a456509e6c0805354bd40a35e3f2dbf8069b1?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ICML&amp;utm_term=6" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">PyTorch: An Imperative Style, High-Performance Deep Learning Library</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Paszke et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">NeurIPS</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">December 3, 2019</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it was designed from first principles to support an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several commonly used benchmarks.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/268d347e8a55b5eb82fb5e7d2f800e33c75ab18a?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ICML&amp;utm_term=7" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Dosovitskiy et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">ICLR</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">October 22, 2020</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/d2c733e34d48784a37d717fe43d9e93277a8c53e?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ICML&amp;utm_term=8" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">ImageNet: A large-scale hierarchical image database</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Deng et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">2009 IEEE Conference on Computer Vision and Pattern Recognition</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">June 20, 2009</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/5c126ae3421f05768d8edd97ecd44b1364e2c99a?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ICML&amp;utm_term=9" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Denoising Diffusion Probabilistic Models</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Ho et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">NeurIPS</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">June 19, 2020</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at this https URL</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.</p></div></a></div><div role="listitem" class="paper-list__paper paper-list__paper--condensed w-dyn-item"><a href="https://www.semanticscholar.org/paper/6f870f7f02a8c59c3e23f407f3ef00dd1dcf8fc4?utm_source=conference-references&amp;utm_medium=hubpage&amp;utm_campaign=ICML&amp;utm_term=10" target="_blank" class="paper-list__paper-link w-inline-block"><h4 class="paper-list__paper-title">Learning Transferable Visual Models From Natural Language Supervision</h4><ul role="list" class="paper-list__paper-meta"><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">Radford et al.</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">ICML</p></li><li class="paper-list__paper-meta-holder"><p class="paper-list__paper-meta-item">February 26, 2021</p></li></ul><p class="paper-list__paper-abstract w-condition-invisible">State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.</p><div class="paper-list__paper-tldr-holder w-clearfix"><div class="pill pill--gray">TLDR</div><p class="paper-list__paper-abstract">It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.</p></div></a></div></div></div></div></div></div></div></main><div class="post__list"><h4 class="post__list-heading">Latest News &amp; Updates</h4><div class="w-dyn-list"><div role="list" class="post__grid w-dyn-items"><div role="listitem" class="post w-dyn-item"><a href="https://blog.allenai.org/case-study-iterative-design-for-skimming-support-5563dbe0899e" target="_blank" class="post__link w-inline-block"><div class="post__image-wrapper"><img src="https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/652055861194e6fc6bb5983a_skimming_2.0.png" loading="lazy" alt="Case Study: Iterative Design for Skimming Support" sizes="(max-width: 479px) 85vw, (max-width: 767px) 84vw, (max-width: 991px) 88vw, 22vw" srcset="https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/652055861194e6fc6bb5983a_skimming_2.0-p-500.png 500w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/652055861194e6fc6bb5983a_skimming_2.0-p-800.png 800w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/652055861194e6fc6bb5983a_skimming_2.0-p-1080.png 1080w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/652055861194e6fc6bb5983a_skimming_2.0-p-1600.png 1600w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/652055861194e6fc6bb5983a_skimming_2.0-p-2000.png 2000w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/652055861194e6fc6bb5983a_skimming_2.0-p-2600.png 2600w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/652055861194e6fc6bb5983a_skimming_2.0-p-3200.png 3200w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/652055861194e6fc6bb5983a_skimming_2.0.png 5834w" class="post__image"/></div><h4 class="post__title">Case Study: Iterative Design for Skimming Support</h4><div class="post__meta"><div class="post__date">Oct 6, 2023</div><div class="post__read-time">7 min read</div></div><p class="post__intro">How might we help researchers quickly assess the relevance of scientific literature? Take a closer look at Skimming, Semantic Reader’s latest AI feature, and the collaborative design process behind it.</p></a><div class="post__author">Cassidy Trier</div></div><div role="listitem" class="post w-dyn-item"><a href="https://blog.allenai.org/behind-the-scenes-of-semantic-scholars-new-author-influence-design-d7e007ba6a84" target="_blank" class="post__link w-inline-block"><div class="post__image-wrapper"><img src="https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64de863a294e5ef80fee0187_Screen%20Shot%202023-02-06%20at%2011.36.31%20AM.png" loading="lazy" alt="Behind the Scenes of Semantic Scholar’s New Author Influence Design" sizes="(max-width: 479px) 85vw, (max-width: 767px) 84vw, (max-width: 991px) 88vw, 22vw" srcset="https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64de863a294e5ef80fee0187_Screen%20Shot%202023-02-06%20at%2011.36.31%20AM-p-500.png 500w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64de863a294e5ef80fee0187_Screen%20Shot%202023-02-06%20at%2011.36.31%20AM-p-800.png 800w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64de863a294e5ef80fee0187_Screen%20Shot%202023-02-06%20at%2011.36.31%20AM-p-1080.png 1080w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64de863a294e5ef80fee0187_Screen%20Shot%202023-02-06%20at%2011.36.31%20AM-p-1600.png 1600w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64de863a294e5ef80fee0187_Screen%20Shot%202023-02-06%20at%2011.36.31%20AM-p-2000.png 2000w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64de863a294e5ef80fee0187_Screen%20Shot%202023-02-06%20at%2011.36.31%20AM.png 2150w" class="post__image"/></div><h4 class="post__title">Behind the Scenes of Semantic Scholar’s New Author Influence Design</h4><div class="post__meta"><div class="post__date">Aug 17, 2023</div><div class="post__read-time">5 min read</div></div><p class="post__intro">We released a new version of Author Influence interface to help scholars better discover other scholars in their fields. Here&#x27;s how we identified user insights and made those design choices.</p></a><div class="post__author">Cassidy Trier, Evie Cheng, Ashley Lee</div></div><div role="listitem" class="post w-dyn-item"><a href="https://www.nature.com/articles/d41586-023-01907-z" target="_blank" class="post__link w-inline-block"><div class="post__image-wrapper"><img src="https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64d15e16ad0f9fd89058273b_nature.webp" loading="lazy" alt="Artificial-intelligence search engines wrangle academic literature" sizes="(max-width: 479px) 85vw, (max-width: 767px) 84vw, (max-width: 991px) 88vw, 22vw" srcset="https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64d15e16ad0f9fd89058273b_nature-p-500.webp 500w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64d15e16ad0f9fd89058273b_nature-p-800.webp 800w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64d15e16ad0f9fd89058273b_nature-p-1080.webp 1080w, https://cdn.prod.website-files.com/605ba9b55a4a92803e45a32b/64d15e16ad0f9fd89058273b_nature.webp 1248w" class="post__image"/></div><h4 class="post__title">Artificial-intelligence search engines wrangle academic literature</h4><div class="post__meta"><div class="post__date">Aug 7, 2023</div><div class="post__read-time">5 min read</div></div><p class="post__intro">Nature had a chat with Dan Weld, Chief Scientist at Semantic Scholar, to discuss how search engines are helping scientists explore and innovate by making it easier to draw connections from a massive collection of scientific literature.</p></a><div class="post__author">Amanda Heidt</div></div></div></div></div><div class="cta__blade"><h4 class="cta__header">Experience a smarter way to search and discover scholarly research.</h4><a href="https://www.semanticscholar.org/sign-in" class="button button--hero w-button">Create Your Account</a></div><div class="newsletter"><div class="newsletter__container"><div class="newsletter-layout"><h5 class="newsletter-title">Stay Connected With Semantic Scholar</h5><div class="newsletter-embed w-embed w-script"><!--[if lte IE 8]> <script charset="utf-8" type="text/javascript" src="//js.hsforms.net/forms/v2-legacy.js"></script> <![endif]--> <script charset="utf-8" type="text/javascript" src="//js.hsforms.net/forms/v2.js"></script> <script> hbspt.forms.create({ region: "na1", portalId: "5910970", formId: "b8dd2b25-f81d-4cba-a9b6-044e249b7a07" }); </script></div></div></div></div><footer class="site-footer"><div class="site-footer__top"><div class="site-footer__top-container"><div class="site-footer__about"><h6 class="site-footer site-footer__title">What Is Semantic Scholar?</h6><p class="site-footer site-footer__text">Semantic Scholar is a free, AI-powered research tool for scientific literature, based at Ai2.</p><a href="/about" class="site-footer site-footer__link">Learn More</a></div><div class="site-footer__navigation"><ul role="list" class="site-footer site-footer__list w-list-unstyled"><li><h6 class="site-footer site-footer__title">About</h6></li><li><a href="/about" class="site-footer site-footer__link">About Us<br/></a></li><li><a href="/about/publishers" class="site-footer site-footer__link">Publishers</a></li><li><a href="https://medium.com/ai2-blog/semantic-scholar/home" target="_blank" class="site-footer site-footer__link">Blog</a></li><li><a href="https://allenai.org/careers?team=semantic+scholar#current-openings" target="_blank" class="site-footer site-footer__link">Ai2 Careers</a></li></ul><ul role="list" class="site-footer site-footer__list w-list-unstyled"><li><h6 class="site-footer site-footer__title">Product</h6></li><li><a href="/product" class="site-footer site-footer__link">Product Overview</a></li><li><a href="/product/semantic-reader" class="site-footer site-footer__link">Semantic Reader</a></li><li><a href="/product/scholars-hub" class="site-footer site-footer__link">Scholar&#x27;s Hub</a></li><li><a href="/product/beta-program" class="site-footer site-footer__link">Beta Program</a></li></ul><ul role="list" class="site-footer site-footer__list w-list-unstyled"><li><h6 class="site-footer site-footer__title">API</h6></li><li><a href="/product/api" class="site-footer site-footer__link">API Overview</a></li><li><a href="/product/api/tutorial" class="site-footer site-footer__link">API Tutorials</a></li><li><a href="https://api.semanticscholar.org/api-docs/" class="site-footer site-footer__link">API Documentation</a></li><li><a href="/product/api/gallery" class="site-footer site-footer__link">API Gallery</a></li></ul><ul id="w-node-fb42b76b-9428-3dc2-f4cc-8fca4f056107-46e684e3" role="list" class="site-footer site-footer__list w-list-unstyled"><li><h6 class="site-footer site-footer__title">Research</h6></li><li><a href="https://allenai.org/papers?tag=Semantic%20Scholar" target="_blank" class="site-footer site-footer__link">Publications</a></li><li><a href="https://allenai.org/careers" target="_blank" class="site-footer site-footer__link">Research Careers</a></li><li><a href="https://allenai.org/ai-for-science" class="site-footer site-footer__link">Resources</a></li></ul><ul id="w-node-a1cfe8f5-f656-0f8f-b57f-f2c91de1b718-46e684e3" role="list" class="site-footer site-footer__list w-list-unstyled"><li><h6 class="site-footer site-footer__title">Help</h6></li><li><a href="https://www.semanticscholar.org/faq" class="site-footer site-footer__link">FAQ</a></li><li><a href="/about/librarians" class="site-footer site-footer__link">Librarians</a></li><li><a href="/product/tutorials" class="site-footer site-footer__link">Tutorials</a></li></ul></div></div></div><div class="site-footer__bottom"><div class="site-footer__bottom-container"><p class="site-footer__legal">Proudly built by <a href="https://allenai.org/" target="_blank" class="site-footer site-footer__link">Ai2</a> with the help of our Collaborators<br/><a href="https://allenai.org/terms.html" target="_blank" class="site-footer site-footer__link">Terms of Service</a>  •  <a href="https://allenai.org/privacy-policy.html" target="_blank" class="site-footer site-footer__link">Privacy Policy</a>  •  <a href="/product/api/license" class="site-footer site-footer__link">API License Agreement</a></p><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/66a167e36965869758dd87f8_Ai2_logo_offwhite_RGB.png" loading="lazy" sizes="94.5703125px" srcset="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/66a167e36965869758dd87f8_Ai2_logo_offwhite_RGB-p-500.png 500w, https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/66a167e36965869758dd87f8_Ai2_logo_offwhite_RGB-p-800.png 800w, https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/66a167e36965869758dd87f8_Ai2_logo_offwhite_RGB-p-1080.png 1080w, https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/66a167e36965869758dd87f8_Ai2_logo_offwhite_RGB-p-1600.png 1600w, https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/66a167e36965869758dd87f8_Ai2_logo_offwhite_RGB-p-2000.png 2000w, https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/66a167e36965869758dd87f8_Ai2_logo_offwhite_RGB.png 2771w" alt="" class="site-footer__logo"/></div></div><div class="contact-modal"><div class="contact-modal__container"><a data-w-id="094e8a79-f899-529e-250c-5240927de9d7" href="#" class="contact-modal__close w-inline-block"><img src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/61e9b4e329b9d877dee723c3_close-light.svg" loading="lazy" alt="Close" class="contact-modal__close-jewel"/></a><h4 class="margin-top--none">Contact Us</h4><div class="contact-modal__form-wrapper w-form"><form id="freshdesk-contact-form" name="wf-form-Contact" data-name="Contact" action="https://www.semanticscholar.org/api/1/feedback" method="post" class="contact-modal__form" data-wf-page-id="6584745360a4872a287a89a1" data-wf-element-id="7663d2bf-5ed9-4856-37e4-a4966bfbf84f" data-turnstile-sitekey="0x4AAAAAAAQTptj2So4dx43e"><p class="margin-bottom--sm">Please visit our <a href="https://www.semanticscholar.org/faq">FAQ</a> to find helpful information before submitting your question.<br/></p><label for="contact-form-name">Your name</label><input class="w-input" maxlength="256" name="name" data-name="name" placeholder="" type="text" id="contact-form-name"/><label for="contact-form-email-2">Your email</label><input class="w-input" maxlength="256" name="email" data-name="email" placeholder="" type="email" id="contact-form-email" required=""/><label for="contact-form-subject">Subject<br/></label><input class="w-input" maxlength="256" name="subject" data-name="subject" placeholder="" type="text" id="contact-form-subject" required=""/><label for="contact-form-topic">Topic<br/></label><select id="contact-form-topic" name="topic" data-name="topic" required="" class="select-field w-select"><option value="">Select A Topic</option><option value="Takedown Request">Remove A Paper</option><option value="Author Disambiguation">Merge Authors</option><option value="Other Problem">Other</option></select><label for="contact-form-feedback-2">Feedback<br/></label><textarea id="contact-form-feedback" name="feedback" maxlength="5000" data-name="feedback" placeholder="" required="" class="margin-bottom--sm w-input"></textarea><div data-sitekey="6LdTDpgqAAAAABjsCMeT00cfexfPZEK8qSpYu47Q" class="w-form-formrecaptcha g-recaptcha g-recaptcha-error g-recaptcha-disabled"></div><input type="submit" data-wait="Please wait..." class="button w-button" value="Contact Us"/></form><div class="contact-modal__form-success w-form-done"><div><strong>Thanks! </strong>Your feedback has been submitted.</div></div><div class="contact-modal__form-error w-form-fail"><div>Something went wrong while submitting the form, please try again.</div></div></div></div><div data-w-id="094e8a79-f899-529e-250c-5240927de9fe" class="contact-modal__overlay"></div></div></footer><script src="https://d3e54v103j8qbb.cloudfront.net/js/jquery-3.5.1.min.dc5e7f18c8.js?site=605236bb767e9a5bb229c63c" type="text/javascript" integrity="sha256-9/aliU8dGd2tb6OSsuzixeV4y/faTqgFtohetphbbj0=" crossorigin="anonymous"></script><script src="https://cdn.prod.website-files.com/605236bb767e9a5bb229c63c/js/semanticscholar.ec6b1c48.8bc17f9f6bed1fc0.js" type="text/javascript"></script><script> $(document).ready(function() { // Contact Form $('.contact-modal__form').submit(function(e){ // Stops regular form submit e.preventDefault(); // Sets variables, encodes form into json var $this = $(this), $parent = $this.parent(), $success = $parent.find(".contact-modal__form-success"), $error = $parent.find(".contact-modal__form-error"), action = $this.attr('action'), submission = $this.serializeArray().reduce((memo, field) => ({...memo, [field.name]: field.value}), {}); // Record URL submission.url=window.location.href; // Submit $.ajax(action, { method: 'POST', contentType: 'application/json', data: JSON.stringify(submission), cache: false, dataType: 'json', crossDomain: true, processData: false }).always(function(e){ // Hides form, shows success $this.hide(); $success.show(); }); // just in case return false; }); // Listens for links to /about/contact and pops up contact form instead of redirecting. $('.main a[href$="about/contact"]').on('click', function(e){ e.preventDefault(); $('.contact-modal').show(); }); }); </script></body></html>

Pages: 1 2 3 4 5 6 7 8 9 10