CINXE.COM
arXiv Dataset | Kaggle
<!DOCTYPE html> <html lang="en"> <head> <title>arXiv Dataset | Kaggle</title> <meta charset="utf-8" /> <meta name="robots" content="index, follow" /> <meta name="description" content="arXiv dataset and metadata of 1.7M+ scholarly papers across STEM" /> <meta name="keywords" content="earth and nature,education" /> <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, minimum-scale=1.0"> <meta name="theme-color" content="#008ABC" /> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==" type="text/javascript"> window["pageRequestStartTime"] = 1739801066323; window["pageRequestEndTime"] = 1739801066462; window["initialPageLoadStartTime"] = new Date().getTime(); </script> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==" id="gsi-client" src="https://accounts.google.com/gsi/client" async defer></script> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==">window.KAGGLE_JUPYTERLAB_PATH = "/static/assets/jupyterlab-v4/jupyterlab-index-491dbbb2de495d414be4.html";</script> <link rel="preconnect" href="https://www.google-analytics.com" crossorigin="anonymous" /><link rel="preconnect" href="https://stats.g.doubleclick.net" /><link rel="preconnect" href="https://storage.googleapis.com" /><link rel="preconnect" href="https://apis.google.com" /> <link href="/static/images/favicon.ico" rel="shortcut icon" type="image/x-icon" id="dynamic-favicon" /> <link rel="manifest" href="/static/json/manifest.json" crossorigin="use-credentials"> <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /> <link href="https://fonts.googleapis.com/css?family=Inter:400,400i,500,500i,600,600i,700,700i&display=swap" rel="preload" as="style" /> <link href="https://fonts.googleapis.com/css2?family=Google+Symbols:FILL@0..1&display=block" rel="preload" as="style" /> <link href="https://fonts.googleapis.com/css?family=Inter:400,400i,500,500i,600,600i,700,700i&display=swap" rel="stylesheet" media="print" id="async-google-font-1" /> <link href="https://fonts.googleapis.com/css2?family=Google+Symbols:FILL@0..1&display=block" rel="stylesheet" media="print" id="async-google-font-2" /> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==" type="text/javascript"> const styleSheetIds = ["async-google-font-1", "async-google-font-2"]; styleSheetIds.forEach(function (id) { document.getElementById(id).addEventListener("load", function() { this.media = "all"; }); }); </script> <link rel="canonical" href="https://www.kaggle.com/datasets/Cornell-University/arxiv" /> <link rel="stylesheet" type="text/css" href="/static/assets/app.css?v=2d4e7ec4eb689d926191" /> <script nonce="CNJACrjhTEpaIQKQsQMlgQ=="> try{(function(a,s,y,n,c,h,i,d,e){d=s.createElement("style"); d.appendChild(s.createTextNode(""));s.head.appendChild(d);d=d.sheet; y=y.map(x => d.insertRule(x + "{ opacity: 0 !important }")); h.start=1*new Date;h.end=i=function(){y.forEach(x => x<d.cssRules.length ? d.deleteRule(x) : {})}; (a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c; })(window,document,['.site-header-react__nav'],'dataLayer',2000,{'GTM-52LNT9S':true});}catch(ex){} </script> <script nonce="CNJACrjhTEpaIQKQsQMlgQ=="> window.dataLayer = window.dataLayer || []; function gtag() { dataLayer.push(arguments); } gtag('js', new Date()); gtag('config', 'G-T7QHS60L4Q', { 'optimize_id': 'GTM-52LNT9S', 'displayFeaturesTask': null, 'send_page_view': false, 'content_group1': 'Datasets' }); </script> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==" async src="https://www.googletagmanager.com/gtag/js?id=G-T7QHS60L4Q"></script> <meta property="og:url" content="https://www.kaggle.com/datasets/Cornell-University/arxiv" /> <meta property="og:title" content="arXiv Dataset" /> <meta property="og:description" content="arXiv dataset and metadata of 1.7M+ scholarly papers across STEM" /> <meta property="og:type" content="website" /> <meta property="og:image" content="https://storage.googleapis.com/kaggle-datasets-images/612177/5023398/ed4470482e04000d11d3de82e2fb4389/dataset-card.png?t=2023-02-22-23-00-49" /> <meta property="fb:app_id" content="2665027677054710" /> <meta name="twitter:card" content="summary" /> <meta name="twitter:site" content="@kaggledatasets" /> <meta name="twitter:site" content="@Kaggle" /> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==" type="application/ld+json">{"@context":"http://schema.org/","@type":"Dataset","name":"arXiv Dataset","description":"## About ArXiv\n\nFor nearly 30 years, [ArXiv](https://arxiv.org) has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth. \n\nIn these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more. \n\nOur hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces. \n\nThe dataset is freely available via Google Cloud Storage buckets ([more info here](https://cloud.google.com/storage/docs/json_api/v1/buckets)). Stay tuned for weekly updates to the dataset!\n\nArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by [Cornell University](https://www.cornell.edu).\n\nThe release of this dataset was featured further in a Kaggle blog post [here](https://medium.com/@kaggleteam/leveraging-ml-to-fuel-new-discoveries-with-the-arxiv-dataset-981a95bfe365).\n\n![](https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG)\n\n[See here for more information.](https://arxiv.org/about)\n\n## ArXiv On Kaggle\n\n### Metadata\nThis dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the `json` format. This file contains an entry for each paper, containing:\n- `id`: ArXiv ID (can be used to access the paper, see below)\n- `submitter`: Who submitted the paper\n- `authors`: Authors of the paper\n- `title`: Title of the paper\n- `comments`: Additional info, such as number of pages and figures\n- `journal-ref`: Information about the journal the paper was published in\n- `doi`: [https://www.doi.org](Digital Object Identifier)\n- `abstract`: The abstract of the paper\n- `categories`: Categories / tags in the ArXiv system\n- `versions`: A version history\n\nYou can access each paper directly on [ArXiv](https://arxiv.org) using these links:\n- `https://arxiv.org/abs/{id}`: Page for this paper including its abstract and further links\n- `https://arxiv.org/pdf/{id}`: Direct link to download the PDF\n\n### Bulk access\nThe full set of PDFs is available for free in the GCS bucket `gs://arxiv-dataset` or [through Google API](https://storage.googleapis.com/arxiv-dataset) ([json documentation](https://cloud.google.com/storage/docs/json_api/v1/buckets) and [xml documentation](https://cloud.google.com/storage/docs/xml-api/reference-methods)). \n\nYou can use for example [gsutil](https://cloud.google.com/storage/docs/gsutil) to download the data to your local machine.\n``` \n# List files:\ngsutil cp gs://arxiv-dataset/arxiv/\n\n# Download pdfs from March 2020:\ngsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/\n\n# Download all the source files\ngsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/\n``` \n\n### Update Frequency\nWe're automatically updating the metadata as well as the GCS bucket on a weekly basis.\n\n## License\n[Creative Commons CC0 1.0 Universal Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0) applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.\n\n## Acknowledgements\nThe original data is maintained by [ArXiv](https://arxiv.org), huge thanks to the [team](https://arxiv.org/about/people/leadership_team) for building and maintaining this dataset.\n\nWe're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to [Matt Bierbaum](https://github.com/mattbierbaum) for providing this tool.\n","url":"https://www.kaggle.com/Cornell-University/arxiv","version":219,"keywords":["subject, earth and nature","subject, people and society, education"],"license":{"@type":"CreativeWork","name":"CC0: Public Domain","url":"https://creativecommons.org/publicdomain/zero/1.0/"},"identifier":["612177"],"includedInDataCatalog":{"@type":"DataCatalog","name":"Kaggle","url":"https://www.kaggle.com"},"creator":{"@type":"Organization","name":"Cornell University","url":"https://www.kaggle.com/organizations/Cornell-University","image":"https://storage.googleapis.com/kaggle-organizations/848/thumbnail.png%3Fr=794"},"distribution":[{"@type":"DataDownload","requiresSubscription":true,"encodingFormat":"zip","fileFormat":"zip","contentUrl":"https://www.kaggle.com/datasets/Cornell-University/arxiv/download?datasetVersionNumber=219","contentSize":"1500270940 bytes"}],"commentCount":63,"dateModified":"2025-02-16T00:51:11.333Z","discussionUrl":"https://www.kaggle.com/Cornell-University/arxiv/discussion","alternateName":"arXiv dataset and metadata of 1.7M+ scholarly papers across STEM","isAccessibleForFree":true,"thumbnailUrl":"https://storage.googleapis.com/kaggle-datasets-images/612177/5023398/ed4470482e04000d11d3de82e2fb4389/dataset-card.png?t=2023-02-22-23-00-49","interactionStatistic":[{"@type":"InteractionCounter","interactionType":"http://schema.org/CommentAction","userInteractionCount":63},{"@type":"InteractionCounter","interactionType":"http://schema.org/DownloadAction","userInteractionCount":56765},{"@type":"InteractionCounter","interactionType":"http://schema.org/ViewAction","userInteractionCount":599791},{"@type":"InteractionCounter","interactionType":"http://schema.org/LikeAction","userInteractionCount":1432}]}</script> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==">window['useKaggleAnalytics'] = true;</script> <script id="gapi-target" nonce="CNJACrjhTEpaIQKQsQMlgQ==" src="https://apis.google.com/js/api.js" defer async></script> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==" src="/static/assets/runtime.js?v=17a49a744c00a3be3c1e"></script> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==" src="/static/assets/vendor.js?v=a62013a985d655b5d6e4"></script> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==" src="/static/assets/app.js?v=b9c8f99555ee537aa2d2"></script> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==" type="text/javascript"> window.kaggleStackdriverConfig = { key: 'AIzaSyA4eNqUdRRskJsCZWVz-qL655Xa5JEMreE', projectId: 'kaggle-161607', service: 'web-fe', version: 'ci', userId: '0' } </script> </head> <body> <div id="root"> <script nonce="CNJACrjhTEpaIQKQsQMlgQ==" type="text/x-mathjax-config"> MathJax.Hub.Config({ "HTML-CSS": { preferredFont: "TeX", availableFonts: ["STIX", "TeX"], linebreaks: { automatic: true }, EqnChunk: (MathJax.Hub.Browser.isMobile ? 10 : 50) }, tex2jax: { inlineMath: [["\\(", "\\)"], ["\\\\(", "\\\\)"]], displayMath: [["$$", "$$"], ["\\[", "\\]"]], processEscapes: true, ignoreClass: "tex2jax_ignore|dno" }, TeX: { noUndefined: { attributes: { mathcolor: "red", mathbackground: "#FFEEEE", mathsize: "90%" } } }, Macros: { href: "{}" }, skipStartupTypeset: true, messageStyle: "none", extensions: ["Safe.js"], }); </script> <script type="text/javascript" nonce="CNJACrjhTEpaIQKQsQMlgQ=="> window.addEventListener("DOMContentLoaded", () => { const head = document.getElementsByTagName("head")[0]; const lib = document.createElement("script"); lib.type = "text/javascript"; // Always use the production asset in local dev, which is served from GCS. We tried to proxy and / or serve this // in a better way in localhost, but it didn't work out. See b/328073416#comment8 for details. const forceProdHost = window.location.hostname === "localhost"; lib.src = `${forceProdHost ? "https://www.kaggle.com" : ""}/static/mathjax/2.7.9/MathJax.js?config=TeX-AMS-MML_HTMLorMML`; head.appendChild(lib); }); </script> </div> </body> </html>