CINXE.COM
GH Archive
<!DOCTYPE html> <html lang="en" itemscope itemtype="https://schema.org/Article"> <head><script type="text/javascript" src="/_static/js/bundle-playback.js?v=HxkREWBo" charset="utf-8"></script> <script type="text/javascript" src="/_static/js/wombat.js?v=txqj7nKC" charset="utf-8"></script> <script>window.RufflePlayer=window.RufflePlayer||{};window.RufflePlayer.config={"autoplay":"on","unmuteOverlay":"hidden"};</script> <script type="text/javascript" src="/_static/js/ruffle/ruffle.js"></script> <script type="text/javascript"> __wm.init("https://web.archive.org/web"); __wm.wombat("https://www.gharchive.org/","20210308082316","https://web.archive.org/","web","/_static/", "1615191796"); </script> <link rel="stylesheet" type="text/css" href="/_static/css/banner-styles.css?v=S1zqJCYt" /> <link rel="stylesheet" type="text/css" href="/_static/css/iconochive.css?v=3PDvdIFv" /> <!-- End Wayback Rewrite JS Include --> <title>GH Archive</title> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta itemprop="name" content="GH Archive"> <meta name="description" content="GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis."> <meta itemprop="image" content="https://web.archive.org/web/20210308082316im_/https://gems.github.com/octocat.png"/> <link rel="canonical" href="https://web.archive.org/web/20210308082316/https://www.gharchive.org/"> <link rel="shortcut icon" href="https://web.archive.org/web/20210308082316im_/https://github.com/favicon.ico"> <link href="/web/20210308082316cs_/https://www.gharchive.org/assets/css/bootstrap.min.css" rel="stylesheet"> <style type="text/css"> body { padding-top: 20px; padding-bottom: 40px; } p, li { font-size:16px; line-height:25px; } </style> </head> <body> <div class="container"> <h1 style="margin-bottom:0.5em"> <img src="/web/20210308082316im_/https://www.gharchive.org/assets/img/github.png" style="vertical-align:middle;padding-right:0.25em"/> GH <span style="color:#c60000">Archive</span> <span style="float:right; margin-top:0.25em"> <span style="margin-right:1em"><a class="github-button" href="https://web.archive.org/web/20210308082316/https://github.com/igrigorik/gharchive.org" data-count-href="/igrigorik/gharchive.org/stargazers" data-count-api="/repos/igrigorik/gharchive.org#stargazers_count">Star</a></span> <g:plusone size="medium" href="https://www.gharchive.org/"></g:plusone> <a href="https://web.archive.org/web/20210308082316/https://twitter.com/share" class="twitter-share-button" data-text="GitHub public timeline archive @ " data-url="https://web.archive.org/web/20210308082316/https://www.gharchive.org/">Tweet</a> </span> </h1> <div class="hero-unit" align="center"> <script src="https://web.archive.org/web/20210308082316js_/https://www.stathat.com/javascripts/embed.js"></script> <script>StatHatEmbed.render({s1: 'lxoC', w: 840, h: 260});</script> </div> <div class="row"> <div class="span12"> <p style="font-size:20px;font-weight:200;line-height:27px;"> Open-source developers all over the world are working on millions of projects: writing code & documentation, fixing & submitting bugs, and so forth. GH Archive is a project to <strong>record</strong> the public GitHub timeline, <strong>archive it</strong>, and <strong>make it easily accessible</strong> for further analysis. </p> </div> </div> <hr/> <div class="row"> <div class="span12"> <p>GitHub provides <a href="https://web.archive.org/web/20210308082316/https://developer.github.com/v3/activity/events/types/">20+ event types</a>, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. These events are aggregated into hourly archives, which you can access with any HTTP client:</p> <table class="table table-striped"> <thead> <tr> <th>Query</th> <th>Command</th> </tr> </thead> <tbody> <tr> <td>Activity for 1/1/2015 @ 3PM UTC</td> <td><code>wget https://data.gharchive.org/2015-01-01-15.json.gz</code></td> </tr> <tr> <td>Activity for 1/1/2015</td> <td><code>wget https://data.gharchive.org/2015-01-01-{0..23}.json.gz</code></td> </tr> <tr> <td>Activity for all of January 2015</td> <td><code>wget https://data.gharchive.org/2015-01-{01..31}-{0..23}.json.gz</code></td> </tr> </tbody> </table> <br/> <p>Each archive contains JSON encoded events as reported by the GitHub API. You can download the raw data and apply own processing to it - e.g. write a custom aggregation script, import it into a database, and so on! An example Ruby script to download and iterate over a single archive:</p> <script src="https://web.archive.org/web/20210308082316js_/https://gist.github.com/igrigorik/2017506.js"></script> <ul> <li>Activity archives are available starting 2/12/2011.</li> <li>Activity archives for dates between 2/12/2011-12/31/2014 was recorded from the (now deprecated) Timeline API.</li> <li>Activity archives for dates starting 1/1/2015 is recorded from the Events API.</li> </ul> <p>For the curious, check out <a href="https://web.archive.org/web/20210308082316/https://changelog.com/podcast/144">The Changelog episode #144</a> for an in-depth interview about the history of GH Archive, integration with BigQuery, where the project is heading, and more.</p> </div> </div> <hr/> <div class="row"> <div class="span12"> <h2 id="bigquery"><a href="#bigquery">Analyzing event data with BigQuery</a></h2> <br/> <p>The entire GH Archive is also available as a public dataset on <a href="https://web.archive.org/web/20210308082316/https://developers.google.com/bigquery/">Google BigQuery</a>: the dataset is automatically updated every hour and enables you to run <a href="https://web.archive.org/web/20210308082316/https://developers.google.com/bigquery/docs/query-reference">arbitrary SQL-like queries</a> over the entire dataset in seconds. To get started:</p> <ol> <li>If you don't already have a Google project...</li> <ul> <li><a href="https://web.archive.org/web/20210308082316/https://console.developers.google.com/">Login into the Google Developer Console</a></li> <li><a href="https://web.archive.org/web/20210308082316/https://developers.google.com/console/help/#creatingdeletingprojects">Create a project</a> and <a href="https://web.archive.org/web/20210308082316/https://developers.google.com/console/help/#activatingapis">activate the BigQuery API</a></li> </ul> <li>Open public dataset: <a href="https://web.archive.org/web/20210308082316/https://console.cloud.google.com/bigquery?project=githubarchive&page=project">https://console.cloud.google.com/bigquery?project=githubarchive&page;=project</a></li> <li>Execute your first query...</li> </ol> <script src="https://web.archive.org/web/20210308082316js_/https://gist.github.com/igrigorik/90fab69946d27842882c.js"></script> <p>For convenience, note that there are multiple tables that you can use for your analysis:</p> <ol> <li><strong>year dataset:</strong> <code><a href="https://web.archive.org/web/20210308082316/https://bigquery.cloud.google.com/table/githubarchive:year.2011">2011</a></code>, <code><a href="https://web.archive.org/web/20210308082316/https://bigquery.cloud.google.com/table/githubarchive:year.2012">2012</a></code>, <code><a href="https://web.archive.org/web/20210308082316/https://bigquery.cloud.google.com/table/githubarchive:year.2013">2013</a></code>, <code><a href="https://web.archive.org/web/20210308082316/https://bigquery.cloud.google.com/table/githubarchive:year.2014">2014</a></code>, and <code><a href="https://web.archive.org/web/20210308082316/https://bigquery.cloud.google.com/table/githubarchive:year.2015">2015</a></code> tables contain all activities for each respective year.</li> <li><strong>month dataset:</strong> contains activity for each respective month - e.g. <code><a href="https://web.archive.org/web/20210308082316/https://bigquery.cloud.google.com/table/githubarchive:month.201501">201501</a></code>.</li> <li><strong>day dataset:</strong> contains activity for each day - e.g. <code><a href="https://web.archive.org/web/20210308082316/https://bigquery.cloud.google.com/table/githubarchive:day.20150101">20150101</a></code>.</li> </ol> <p>The <a href="https://web.archive.org/web/20210308082316/https://github.com/igrigorik/gharchive.org/blob/master/bigquery/schema.js">schema of above datasets</a> contains distinct columns for common activity fields (see <a href="https://web.archive.org/web/20210308082316/https://developer.github.com/v3/activity/events/">same response format</a>), a <code>"payload"</code> string field which contains the JSON encoded activity description, and <code>"other"</code> string field containing all other fields.</p> <ul> <li>The content of the <code>"payload"</code> field is different for each event type and may be updated by GitHub at any point, hence it is kept as a serialized JSON string value in BigQuery. Use the provided <a href="https://web.archive.org/web/20210308082316/https://cloud.google.com/bigquery/query-reference#jsonfunctions">JSON functions</a> (e.g. see query example above with <code>JSON_EXTRACT()</code>) to extract and access data in this field.</li> <li>The content of the <code>"other"</code> field is a JSON string which contains all other data provided but GitHub that does not match the predefined BigQuery schema - e.g. if GitHub adds a new field, it will show up in "other" until and unless the schema is extended to support it.</li> </ul> <p>Note that you get <a href="https://web.archive.org/web/20210308082316/https://cloud.google.com/bigquery/pricing#queries">1 TB of data processed per month free of charge</a>. In order to make best use of it, you can restrict your queries to relevant time ranges to minimize the amount of scanned data. To scan multiple tables at once, you can use <a href="https://web.archive.org/web/20210308082316/https://cloud.google.com/bigquery/query-reference#tablewildcardfunctions">table wildcards</a>: <script src="https://web.archive.org/web/20210308082316js_/https://gist.github.com/igrigorik/1cf65f6528e27ed1a580.js"></script> </div> </div> <hr/> <div class="row"> <div class="span12"> <h2 id="daily-reports"><a href="#daily-reports">Daily reports</a></h2> <br/> <p><a href="https://web.archive.org/web/20210308082316/https://changelog.com/nightly">Changelog Nightly</a> is the new and improved version of the daily email reports powered by the GH Archive data. These reports ship each day at 10pm CT and unearth the hottest new repos on GitHub. Alternatively, if you want something curated and less frequent, subscribe to <a href="https://web.archive.org/web/20210308082316/https://changelog.com/weekly">Changelog Weekly</a>.</p> </div> </div> <hr/> <div class="row"> <div class="span12"> <h2 id="resources"><a href="#resources">Research, visualizations, talks...</a></h2> <br/> <ul> <li><a href="https://web.archive.org/web/20210308082316/https://www.baresquare.com/github-devtrends/">GitHub DevTrends</a> is a freely available dynamic report for the developer community based on trends identified from GitHub data.</a></li> <li><a href="https://web.archive.org/web/20210308082316/https://solutionshub.epam.com/OSCI/">Open Source Contributor Index</a> (OSCI) - a tool that ranks the top open source contributors by commercial organizations.</a></li> <li><a href="https://web.archive.org/web/20210308082316/https://blog.antoine-augusti.fr/2019/04/analysing-commits-on-github-by-gouv-fr-authors/">Analysing commits on github by @.gouv.fr authors</a> - how French public servants publish and contribute to open source projects on GitHub</a></li> <li><a href="https://web.archive.org/web/20210308082316/https://medium.com/@hamelhusain/mlapp-419f90e8f007?source=friends_link&sk=760e18a2d6e60999d7eb2887352a92a8">How to Automate Tasks on GitHub With Machine Learning for Fun and Profit</a></li> <li>How to detect Github trending repo API, using GH Archive, Heroku, MongoDB and Github API ? <a href="https://web.archive.org/web/20210308082316/https://maxday.github.io/trending/#JavaScript">Demo</a> - <a href="https://web.archive.org/web/20210308082316/https://medium.com/@max.day/how-to-detect-github-trending-repo-api-using-githubarchive-heroku-mongodb-and-github-api-b3489efd9f3e">Medium blog post</a>.</li> <li>Semantic code search with deep learning: <a href="https://web.archive.org/web/20210308082316/https://medium.com/@hamelhusain/semantic-code-search-3cd6d244a39c">Medium blog post</a>.</li> <li>How to use deep learning to extract features from Github data, an end-to-end example: <a href="https://web.archive.org/web/20210308082316/https://medium.com/@hamelhusain/how-to-create-data-products-that-are-magical-using-sequence-to-sequence-models-703f86a231f8">Medium blog post</a>. An interactive demo of this model: <a href="https://web.archive.org/web/20210308082316/https://gh-demo.kubeflow.org/">https://gh-demo.kubeflow.org</a>.</li> <li>GitHub Data Challenge: <a href="https://web.archive.org/web/20210308082316/https://github.com/blog/1162-github-data-challenge-winners">2012</a>, <a href="https://web.archive.org/web/20210308082316/https://github.com/blog/1544-data-challenge-ii-results">2013</a>, and <a href="https://web.archive.org/web/20210308082316/https://github.com/blog/1892-third-annual-data-challenge-winners">2014 winners</a>.</li> <li><a href="https://web.archive.org/web/20210308082316/https://www.youtube.com/watch?v=U_LNo_cSc70">Analyzing Millions of GitHub Commits (O'Reilly Strata talk)</a></li> <li> <a href="https://web.archive.org/web/20210308082316/https://githut.info/">GitHut</a> is an attempt to visualize and explore the complexity of the universe of programming languages.</li> <li> <a href="https://web.archive.org/web/20210308082316/https://danielvdende.com/projects/gdc2014/index.html">Who speaks what on GitHub?</a> Three visualizations provide insight into the language skills of users on GitHub.</li> <li> <a href="https://web.archive.org/web/20210308082316/https://blog.coderstats.net/github/2013/event-types/">GitHub in 2013</a> is a brief visual overview of GitHub event types in 2013.</li> <li><a href="https://web.archive.org/web/20210308082316/https://geeksta.net/geeklog/exploring-expressions-emotions-github-commit-messages/">Exploring Expressions of Emotions in GitHub Commit Messages</a></li> <li><a href="https://web.archive.org/web/20210308082316/https://www.fastcolabs.com/3015178/the-top-10-hottest-github-projects-right-now?partner=rss">The Top 11 Hottest GitHub Projects Right Now</a></li> <li><a href="https://web.archive.org/web/20210308082316/https://www.gitlogs.com/">GitLogs</a> - Github Daily Newsletter curated with a peak detection algorithm. Also a sexy interface to search topics and trends on Github</li> <li><a href="https://web.archive.org/web/20210308082316/https://changelog.com/nightly/">Subscribe to Changelog Nightly</a> (the new and improved GH Archive daily email reports). It ships every night at 10pm CT — and unearths the hottest new repos on GitHub before they blow up. It's nerd to the core and in your inbox each night.</li> <li><a href="https://web.archive.org/web/20210308082316/https://github.com/harishvc/githubanalytics">GitHub Analytics</a> - search latest GitHub timeline</li> <li><a href="https://web.archive.org/web/20210308082316/https://github.com/neonichu/githop">GitHop</a> - see your contributions from a year ago.</li> <li><a href="https://web.archive.org/web/20210308082316/https://gitmostwanted.com/">GitMostWanted</a> - Advanced explorer of github.com.</li> <li><a href="https://web.archive.org/web/20210308082316/https://github.com/tenex/opensourcecontributors">OpenSourceContributo.rs</a> - Search engine for contributions to GitHub</li> <li><a href="https://web.archive.org/web/20210308082316/https://www.gitlive.net/">GitLive</a> - View what's happening on GitHub at real time - Made at Dragon Hacks 2016 Hackathon</li> <li><a href="https://web.archive.org/web/20210308082316/https://github.com/guiman/archive_observer">ArchiveObserver</a> - Build users open source report cards, refreshed hourly!</li> <li><a href="https://web.archive.org/web/20210308082316/https://public.tableau.com/profile/alysonla#!/vizhome/OpenSourceMonthlyNovember2015/OrgDashboard">Open Source Monthly</a> - An activity dashboard for Open Source Organizations.</li> <li><a href="https://web.archive.org/web/20210308082316/https://github.com/yodasco/graphub">GrapHub</a> - Github connections as a D3 graph. Find your GH Erd艖s number from other GH users by co-contributions.</li> <li><a href="https://web.archive.org/web/20210308082316/https://github.com/frankyjuang/Octomender">Octomender</a> - Get repo recommendation based on your GitHub star history.</li> <li><a href="https://web.archive.org/web/20210308082316/https://github.com/vitalets/github-trending-repos">GitHub Trending Repos</a> - Stay notified about new trending repositories in your favorite programming language via GitHub notifications.</li> <li><a href="https://web.archive.org/web/20210308082316/https://github.com/cncf/devstats">DevStats</a> - CNCF-created tool for analyzing and graphing developer contributions</li> <li><a href="https://web.archive.org/web/20210308082316/https://gh.clickhouse.tech/explorer/">GH Explorer</a> - GitHub insights on ClickHouse (structured dataset for download + interactive queries)</li> </ul> <p>Have a cool project that should be on this list? <a href="https://web.archive.org/web/20210308082316/https://github.com/igrigorik/gharchive.org/tree/gh-pages">Send a pull request</a>!</p> </div> </div> <hr/> <footer> <p> <a href="https://web.archive.org/web/20210308082316/https://github.com/igrigorik/gharchive.org" style="vertical-align:top; margin-right:1em"><strong>igrigorik/gharchive.org</strong></a> <span style="float:right;"> <a class="GitHub-button" href="https://web.archive.org/web/20210308082316/https://github.com/igrigorik/gharchive.org" data-count-href="/igrigorik/gharchive.org/stargazers" data-style="mega" data-count-api="/repos/igrigorik/gharchive.org#stargazers_count">Star</a> </span> </p> </footer> </div> <!-- /container --> </body> <script> !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src="https://web.archive.org/web/20210308082316/https://platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs"); </script> <script type="text/javascript"> var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-71196-6']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://web.archive.org/web/20210308082316/https://ssl' : 'https://web.archive.org/web/20210308082316/https://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); </script> <script async defer src="https://web.archive.org/web/20210308082316js_/https://apis.google.com/js/platform.js"></script> <script async defer id="github-bjs" src="https://web.archive.org/web/20210308082316js_/https://buttons.github.io/buttons.js"></script> </html> <!-- FILE ARCHIVED ON 08:23:16 Mar 08, 2021 AND RETRIEVED FROM THE INTERNET ARCHIVE ON 00:26:42 Nov 26, 2024. JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE. ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C. SECTION 108(a)(3)). --> <!-- playback timings (ms): captures_list: 0.586 exclusion.robots: 0.042 exclusion.robots.policy: 0.031 esindex: 0.01 cdx.remote: 12.692 LoadShardBlock: 164.883 (3) PetaboxLoader3.datanode: 109.125 (5) PetaboxLoader3.resolve: 222.0 (2) load_resource: 242.481 loaddict: 55.68 -->