CINXE.COM
David Kane
<?xml version="1.0" encoding="utf-8"?> <rss version="2.0"><channel><title>David Kane</title><link>https://dkane.net/</link><description></description><lastBuildDate>Mon, 07 Oct 2024 11:08:47 +0100</lastBuildDate><item><title>Why PostgreSQL + Django is my data pipeline of choice</title><link>https://dkane.net/2024/why-postgresql-django-is-my-data-pipeline-of-choice/</link><description><p><em>This blog post is based partly on an interesting conversation at <a href="https://www.odcamp.uk/">Open Data camp 9 in Manchester</a> this year, where Giles from Open Innovations led a session called <a href="https://docs.google.com/document/d/1zY1KZGlpac-rUnfW3pRwncXotmXe4e1FHzo2VECHQ6M/edit">DIY FTW! What’s in YOUR toolbox?</a>. As well as showcasing a bunch of tools that I'd not come across, it also made me think about what my typical workflow looks like.</em></p> <p>A large proportion of my work at the moment involves creating data pipelines. These data pipelines start with "inputs" - generally some public data source, like the Charity Commission Register of Charities. There is then some cleaning or processing that happens - merging data together, manually adding or amending data. And then finally some "outputs" happen - this might be reports on an individual charity, or a set of aggregate data around a particular group of charities.</p> <p>There's a few particular requirements for these pipelines - I generally need them to be repeatable, including re-running with updated data. This means I need to carefully separate the data cleaning and additions so that they are not lost when new data comes in.</p> <p>For a small scale data pipeline - maybe running a quick piece of analyis from data fetched in an SQL query, I'll generally turn to a python Notebook. I've also experimented with new tools like <a href="https://duckdb.org/">DuckDB</a> for fetching and manipulating data.</p> <p>But for anything larger than an afternoon's work, I've found that I'm reaching for <a href="https://www.djangoproject.com/">Django</a>, the python web framework. This might seem like an unintuitive choice at first - Django is a big sprawling piece of software, and creating data pipelines is not what it is designed for. But I've found that its "batteries-included" approach means that it has all the pieces I need in place.</p> <div class="toc"><span class="toctitle">Table of Contents</span><ul> <li><a href="#what-you-get-out-of-the-box">What you get out of the box</a><ul> <li><a href="#data-model">Data model</a></li> <li><a href="#admin-area">Admin area</a></li> <li><a href="#user-accessauthentication">User access/authentication</a></li> <li><a href="#management-commands">Management commands</a></li> </ul> </li> <li><a href="#what-i-add-with-add-ons">What I add with add-ons</a><ul> <li><a href="#django-sql-dashboard">Django SQL Dashboard</a></li> <li><a href="#dokku-for-deployment">Dokku for deployment</a></li> <li><a href="#django-ninja-api">Django Ninja API</a></li> </ul> </li> <li><a href="#whats-missing-what-are-the-issues">What's missing / what are the issues</a><ul> <li><a href="#no-easy-ways-of-getting-data-out">No easy ways of getting data out</a></li> <li><a href="#overheads-are-quite-high">Overheads are quite high</a></li> <li><a href="#dealing-with-large-datasets">Dealing with large datasets</a></li> </ul> </li> </ul> </div> <h2 id="what-you-get-out-of-the-box">What you get out of the box</h2> <p>A big attraction for using Django is that it comes fully loaded - a lot of the components I might need in a data pipeline are included. Even if you don't need every feature for every project, there's a lot of value in having them at your fingertips anyway. I often find that I start a project thinking, for example, that I won't need user authentication - but then if the requirements change then it's easy to add.</p> <p>The out of the box parts I find most useful are:</p> <h3 id="data-model">Data model</h3> <p>A django app is structured around the <a href="https://docs.djangoproject.com/en/5.1/topics/db/models/">database models</a> you create. It's easy to define your models in code, and there are powerful features for managing relationships between them.</p> <p>A key positive is the flexibility of <a href="https://docs.djangoproject.com/en/5.1/topics/migrations/">django migrations</a>. These mean that I can change the data model, test the effect these changes will have on the data, and then deploy these changes.</p> <p>If the rigid structure of a data model ends up being too prescriptive for the data you're dealing with, there are options to have something more flexible. I've recently been using a pattern of storing data in JSON fields, and then <a href="generated-field.md">using Generated Fields</a> to access the data.</p> <h3 id="admin-area">Admin area</h3> <p>After you've defined your database models, Django makes it quick and easy to add an <a href="https://docs.djangoproject.com/en/5.1/ref/contrib/admin/">admin interface</a> that can be accessed from the web to view the data. You can make some complex views with the built-in admin tools - adding filters, search, the ability to link records together. You can also add and edit data - though you do need to carefully think about how this works when overwriting a data import.</p> <h3 id="user-accessauthentication">User access/authentication</h3> <p>When working with clients, it is invaluable to be able to <a href="https://docs.djangoproject.com/en/5.1/topics/auth/">set up user accounts</a> out of the box, in order to let them see the data. This might be about giving them access to the <a href="#admin-area">admin area</a> or setting up some views to show them what the outputs might look like, or filter and download the data themselves.</p> <p>Adding user authentication, particularly when data can be edited in the admin area, means that I can also add an audit trail easily, to see who changed the data and why.</p> <h3 id="management-commands">Management commands</h3> <p>Often a data pipeline has a set of reusable scripts that don't fit nicely with the website paradigm that Django works with. But this is where <a href="https://docs.djangoproject.com/en/5.1/howto/custom-management-commands/">management commands</a> come in - you can easily define python scripts that run on the data. These are useful for data management tasks like import, export or bulk cleaning processes.</p> <p>I've started making use of <a href="https://github.com/GaretJax/django-click">django-click</a> which I slightly prefer over the native django commands.</p> <p>My one gripe with management commands is that a bit of work is needed in order to define logging, transaction management, etc - but it is possible to do that.</p> <h2 id="what-i-add-with-add-ons">What I add with add-ons</h2> <p>Although Django comes pretty fully-featured out of the box, there are a few add-ons and other tools that I typically install at the same time. These add specific functionality that isn't present in standard Django, but add particular features around data science.</p> <h3 id="django-sql-dashboard">Django SQL Dashboard</h3> <p><a href="https://django-sql-dashboard.datasette.io/en/stable/">Django SQL Dashboard</a> is a great add on created by <a href="https://simonwillison.net/">Simon Willison</a> the creator of <a href="https://datasette.io/">Datasette</a> (and one of the co-creators of Django itself!)</p> <p>The plugin basically lets you write custom SQL queries against your database, which are then displayed as webpages. These are great for making always-updated dashboards showing the latest state of your data, without needing to wrangle the Django models into the right shape for a custom view (creating aggregate queries with Django models can be a bit fiddly).</p> <p>I use these extensively with <a href="https://findthatcharity.uk/">findthatcharity.uk</a> - for example, <a href="https://findthatcharity.uk/dashboard/data-completeness/">here's a dashboard showing how up-to-date the Charity register is</a>.</p> <p>I've actually created <a href="https://github.com/drkane/django-sql-dashboard">my own version of the plugin</a>, which adds a bit more customisability, for example by allowing titles and descriptions for individual queries within a dashboard.</p> <p>There's a bit of overhead to setting this up - you need to be using a PostgreSQL database, you need to create a read-only SQL user, etc. And it does have potential security implications if not done correctly.</p> <h3 id="dokku-for-deployment">Dokku for deployment</h3> <p>Obviously if you're running Django partly because it's a web server, you need to be able to deploy it to the web. I use <a href="https://dokku.com/docs/getting-started/installation/">dokku</a> for this - it's like running your own heroku instance.</p> <p>Dokku makes it easy to add things like a database (I generally use PostgreSQL) or even a search engine like elasticsearch. I can easily test changes locally, then deploy simply using <code>git push</code>.</p> <p>I also use dokku's scheduling feature to run scripts periodically, for example setting up nightly data import.</p> <h3 id="django-ninja-api">Django Ninja API</h3> <p>This is slightly more rare, but it can be useful to have an API to access the data. I really like <a href="https://django-ninja.dev/">Django Ninja</a> for defining APIs. It fits easily with the OpenAPI standard, which means you get nice documentation and testing out the box. I generally add some sort of API Key based authentication to aid with sharing with others.</p> <h2 id="whats-missing-what-are-the-issues">What's missing / what are the issues</h2> <p>Although it's my tool of choice for data pipelines, Django isn't perfect for this purpose (which after all, it wasn't specifically designed for). There are a few gaps and downsides to this approach:</p> <h3 id="no-easy-ways-of-getting-data-out">No easy ways of getting data out</h3> <p>Often the output of a data pipeline is a bulk output that needs to go somewhere else - a spreadsheet download, a set of tables for a report, etc. I usually end up producing some django views to output or download these, but it can quickly get complex or inflexible.</p> <p>One approach that I've had some success with is adding functionality to export (and import) from admin views. There are some add-ons that do this, or it's relatively straightforward to create your own using custom admin templates.</p> <h3 id="overheads-are-quite-high">Overheads are quite high</h3> <p>This approach does mean there are quite high setup costs, compared to something like a python Notebook. For a full deployment, you need to have a server, a domain name, a database. You need to manage SSL certificates, users, downtime etc. You need to be able to get command line access to run scripts, etc on the server. This is all doable, particularly with something like dokku, but it does require a bit of effort.</p> <h3 id="dealing-with-large-datasets">Dealing with large datasets</h3> <p>Django, particularly in terms of data models, is designed around manually creating and editing individual instances of a model. I find with a lot of my work I am working on bulk data imported from a source. Sometimes these can run to hundreds of thousands or even millions of rows.</p> <p>While this is all possible to handle, and you can bypass Django to a certain extent, it can be tricky when running bulk updates &amp; imports. For example, a pattern I've not quite cracked is bulk upserts - where data is either created or updated depending on whether it already exists. This is possible to do, but can be a bit fiddly.</p> <hr> <p>I hope I've successfully shown how Django can work well as a tool for managing data pipelines. I have a <a href="https://github.com/kanedata/cookiecutter-django-kd">cookiecutter template</a> that forms the basis for a new django project, which shows some of this approach in practice.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Mon, 07 Oct 2024 11:08:47 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2024-10-07:/2024/why-postgresql-django-is-my-data-pipeline-of-choice/</guid><category>Blog</category></item><item><title>Where to get boundary data on new parliamentary constituencies</title><link>https://dkane.net/2024/where-to-get-boundary-data-on-new-parliamentary-constituencies/</link><description><p>The 2024 UK General Election happens on July 4th. This election will be fought on a new set of constituency boundaries, with the makeup of constituencies changing for the first time since 2010. You can find out more information about these changes <a href="https://www.parliament.uk/about/how/elections-and-voting/constituencies/">on the parliament.uk website</a>.</p> <p>If you're analysing or visualising data by parliamentary constituency, then you probably want to use the new 2024 boundaries. I thought it might be useful to show how to get data on these new boundaries, particularly as lots of the <a href="./where-to-find-boundary-data.md">main sources of boundary data</a> haven't yet updated to take them into account.</p> <h2 id="source-data">Source data</h2> <p>The source I tend to use for boundaries is the ONS Geoportal. They've <a href="https://geoportal.statistics.gov.uk/search?q=Future_Parlicons&amp;sort=Title%7Ctitle%7Casc">produced a category bringing together all the data on the new constituencies</a>. Some of the useful files are:</p> <ul> <li>List of the <a href="https://geoportal.statistics.gov.uk/maps/9a876e4777bc47e392e670a7b8bc3f5c">names and codes</a> of the new constituencies</li> <li>Boundary data can be downloaded as GeoJSON or in other formats at a variety of resolutions (the higher resolution files are larger): <a href="https://geoportal.statistics.gov.uk/maps/0d698c9712de4afcac9377367d831c1a">Highest resolution 1</a>, <a href="https://geoportal.statistics.gov.uk/maps/e489d4e5fe9f4f9caa6af161da5442af">Highest resolution 2</a>, <a href="https://geoportal.statistics.gov.uk/maps/7f5c8a729c2f4ccbb8edc3a067845823">Lower resolution</a> and <a href="https://geoportal.statistics.gov.uk/maps/ef63f363ac824b79ae9670744fcc4307">lowest resolution</a>.</li> <li><a href="https://geoportal.statistics.gov.uk/datasets/6f2f35a9a0b94e7e949eeba7785911d4">Postcode lookup</a> - mapping every UK postcode to the new constituencies</li> <li><a href="https://geoportal.statistics.gov.uk/maps/c776b66c0e534b849cae5a5121b7a16a">Lookups from old constituencies to new ones</a> - note that these aren't one-to-one, so where it'll show two constituencies that overlap to form a new one</li> <li>Lookups from smaller geographies to the new consitutencies: <a href="https://geoportal.statistics.gov.uk/maps/5968b5b2c0f14dd29ba277beaae6dec3">Output Area 2021</a>, <a href="https://geoportal.statistics.gov.uk/maps/4491820e4f274755a2553c42fc6a250c">Output Area 2011</a>, <a href="https://geoportal.statistics.gov.uk/maps/0d62b50d90d14b8d8b2ccf9f770b5751">LSOA 2021</a>, <a href="https://geoportal.statistics.gov.uk/maps/098360c460dd41beacbdfad83bc4fea2">MSOA 2021</a>, <a href="https://geoportal.statistics.gov.uk/maps/62eb9df29a2f4521b5076a419ff9a47e">Ward</a> - note that these lookups are generally done through a "best fit" method, because the areas don't match exactly.</li> </ul> <h2 id="find-that-postcode">Find that Postcode</h2> <p>I've updated the <a href="https://findthatpostcode.uk/">Find that Postcode</a> tool to use parliamentary constituencies as the default. <a href="https://findthatpostcode.uk/areas/E14001444.html">Here's an example constituency</a>.</p> <p>In particular, you can use the <a href="https://findthatpostcode.uk/addtocsv/">Add fields to CSV tool</a> if you have a list of postcodes in CSV format, to add data about parliamentary constituencies (or other areas) to that list.</p> <h2 id="other-resources">Other resources</h2> <ul> <li>Open Innovations have created a <a href="https://open-innovations.org/projects/hexmaps/builder.html?https://raw.githubusercontent.com/odileeds/hexmaps/gh-pages/maps/uk-constituencies-2023.hexjson">hex map layout</a> for the new constituencies.</li> <li>Open Innovations have also then used this to create a <a href="https://constituencies.open-innovations.org/">resource of maps using the new constituency data</a>.</li> <li>MySociety have a <a href="https://pages.mysociety.org/2025-constituencies/datasets/parliament_con_2025/0_1_4">data resource with data on the new files</a></li> <li><a href="http://www.statsmapsnpix.com/2023/07/a-new-uk-constituency-hex-map.html">Alasdair Rae and Philip Brown have also produced a hex map</a> along with some nifty 3-letter codes</li> <li>The <a href="https://commonslibrary.parliament.uk/constituency-boundary-review-data-for-new-constituencies/">House of Commons library</a> has some data plus links to other resources</li> <li><a href="https://www.ordnancesurvey.co.uk/blog/os-election-maps-boundary-lines">Ordnance Survey have also released boundary data</a></li> <li>For more politics-focused data, the <a href="https://www.instituteforgovernment.org.uk/explainer/2023-boundary-changes">Institute for Government</a> have some data on the implications of the boundary changes.</li> </ul> <p><strong>This map shows an example of a boundary change, produced by the House of Commons Library</strong></p> <p><img alt="A map produced by the House of Commons Library showing an example boundary change" src="/images/boundaries/ogmore-1536x1152.webp"></p> <p>Source: <a href="https://commonslibrary.parliament.uk/boundary-review-2023-which-seats-will-change/">House of Commons Library</a></p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Fri, 07 Jun 2024 10:45:00 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2024-06-07:/2024/where-to-get-boundary-data-on-new-parliamentary-constituencies/</guid><category>Blog</category></item><item><title>Getting VueJS 3 to work nicely with Django</title><link>https://dkane.net/2024/getting-vuejs-3-to-work-nicely-with-django/</link><description><p>Most of the tutorials for using Django and VueJS assume that you want to completely remove Django views and instead just use VueJS to serve the entire frontend.</p> <p>My usecase are usually different - I mainly find that I want to use VueJS sparingly to provide some enhancement to a particular page, serving all the pages in Django through the usual way.</p> <p>I've not found a good tutorial for doing this, so I thought I would write my own. This tutorial assumes:</p> <ul> <li>you know your way around Django and VueJS, and just want to connect them.</li> <li>You've already got a working Django site that you want to add to.</li> <li>You've got NPM installed to add VueJS.</li> </ul> <p>There's also a couple of requirements that are specific to how I like to set up Django, but these are optional. In particular:</p> <ul> <li>This is based on VueJS 3 (specifically )</li> <li>I prefer to use Jinja2 templates rather than Django's built in templating</li> <li>I use Whitenoise to serve static files</li> <li>I use Dokku for app deployment (the tips here may also work with Heroku)</li> </ul> <p><strong>Edit: 2024-06-19</strong> I've changed some of the details below based on the experience of deploying to dokku</p> <h2 id="initial-setup">Initial setup</h2> <h3 id="1-install-django-vite">1. Install <a href="https://github.com/MrBin99/django-vite">django-vite</a></h3> <p><code>django-vite</code> provides some useful template tags to automatically find the scripts to add to a page.</p> <ol> <li>Run <code>pip install django-vite</code></li> <li>Add <code>django_vite</code> to <code>INSTALLED_APPS</code> in <code>settings.py</code></li> </ol> <h3 id="2-create-the-vue-app-in-your-directory">2. Create the vue app in your directory</h3> <div class="highlight"><pre><span></span><code>npm<span class="w"> </span>create<span class="w"> </span>vite@latest<span class="w"> </span>frontend<span class="w"> </span>--<span class="w"> </span>--template<span class="w"> </span>vue </code></pre></div> <p>This creates a directory called <code>./frontend/</code> in the root directory of your django project. If you want to call it something different you can change <code>frontend</code> in the command.</p> <p>Move the <code>package.json</code> file from <code>./frontend/package.json</code> to <code>./package.json</code> (ie into the root directory of your project). Change the three scripts created in <code>package.json</code> so they use the <code>frontend/vite.config.js</code> config file. This will look like:</p> <div class="highlight"><pre><span></span><code><span class="err">...</span> <span class="nt">&quot;scripts&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nt">&quot;dev&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;vite -c frontend/vite.config.js&quot;</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;build&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;vite build -c frontend/vite.config.js&quot;</span><span class="p">,</span> <span class="w"> </span><span class="nt">&quot;preview&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;vite preview -c frontend/vite.config.js&quot;</span> <span class="p">},</span> <span class="err">...</span> </code></pre></div> <p>Delete the <code>./frontend/index.html</code> file that's created here - we'll be inserting the scripts into our django templates so we don't need this file. You might also want to delete more of the sample files that are created at this stage.</p> <h3 id="3-update-settingspy-again">3. Update <code>settings.py</code> again</h3> <p>We need to tell django where to look for the static files. Add a value to your <code>STATICFILES_DIRS</code> setting:</p> <div class="highlight"><pre><span></span><code><span class="n">STATICFILES_DIRS</span> <span class="o">=</span> <span class="p">[</span> <span class="c1"># ...existing dirs (probably BASE_DIR/static)</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">BASE_DIR</span><span class="p">,</span> <span class="s2">&quot;frontend&quot;</span><span class="p">,</span> <span class="s2">&quot;dist&quot;</span><span class="p">),</span> <span class="c1"># or</span> <span class="n">BASE_DIR</span> <span class="o">/</span> <span class="s2">&quot;frontend&quot;</span> <span class="o">/</span> <span class="s2">&quot;dist&quot;</span> <span class="p">]</span> </code></pre></div> <p>Also add a setting called <code>DJANGO_VITE</code></p> <div class="highlight"><pre><span></span><code><span class="n">DJANGO_VITE</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">&quot;default&quot;</span><span class="p">:</span> <span class="p">{</span> <span class="s2">&quot;dev_mode&quot;</span><span class="p">:</span> <span class="n">DEBUG</span><span class="p">,</span> <span class="s2">&quot;manifest_path&quot;</span><span class="p">:</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">BASE_DIR</span><span class="p">,</span> <span class="s2">&quot;frontend&quot;</span><span class="p">,</span> <span class="s2">&quot;dist&quot;</span><span class="p">,</span> <span class="s2">&quot;manifest.json&quot;</span><span class="p">),</span> <span class="p">},</span> <span class="p">}</span> </code></pre></div> <p>This will use the existing <code>DEBUG</code> setting to work out whether django vite operates in dev mode or production mode.</p> <p>It also adds a path telling Django Vite where to find the manifest for production deployment. This will facilitate the building of the static assets in production.</p> <p>In dev mode you must be running a static server using <code>npm run dev</code>. This will run a server at <code>http://localhost:5173/static/</code> which serves the files.</p> <p>In production mode, you must have generated the files using <code>npm run build</code> and then run <code>python manage.py collectstatic</code> to find the files. This will make sure that the files are added to your <code>staticfiles</code> directory.</p> <p>If you're using Whitenoise to serve your static files, you may also need to add a new test that takes into account the URL format generated by Vite. This is described in the <a href="https://github.com/MrBin99/django-vite?tab=readme-ov-file#whitenoise">django-vite documentation</a>. I added the snippet to <code>settings.py</code>.</p> <h3 id="4-update-frontendviteconfigjs">4. Update <code>frontend/vite.config.js</code></h3> <p>The <code>vite.config.js</code> file tells vite both which files to build and where to put the outputs. This should look like:</p> <div class="highlight"><pre><span></span><code><span class="k">import</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nx">defineConfig</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="kr">from</span><span class="w"> </span><span class="s2">&quot;vite&quot;</span><span class="p">;</span> <span class="k">import</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nx">resolve</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="kr">from</span><span class="w"> </span><span class="s2">&quot;path&quot;</span><span class="p">;</span> <span class="k">import</span><span class="w"> </span><span class="nx">vue</span><span class="w"> </span><span class="kr">from</span><span class="w"> </span><span class="s2">&quot;@vitejs/plugin-vue&quot;</span><span class="p">;</span> <span class="k">export</span><span class="w"> </span><span class="k">default</span><span class="w"> </span><span class="nx">defineConfig</span><span class="p">({</span> <span class="w"> </span><span class="nx">base</span><span class="o">:</span><span class="w"> </span><span class="s2">&quot;/static/&quot;</span><span class="p">,</span> <span class="w"> </span><span class="nx">plugins</span><span class="o">:</span><span class="w"> </span><span class="p">[</span><span class="nx">vue</span><span class="p">()],</span> <span class="w"> </span><span class="nx">build</span><span class="o">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nx">manifest</span><span class="o">:</span><span class="w"> </span><span class="s2">&quot;manifest.json&quot;</span><span class="p">,</span> <span class="w"> </span><span class="nx">outDir</span><span class="o">:</span><span class="w"> </span><span class="nx">resolve</span><span class="p">(</span><span class="s2">&quot;./frontend/dist&quot;</span><span class="p">),</span> <span class="w"> </span><span class="nx">rollupOptions</span><span class="o">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nx">input</span><span class="o">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nx">main</span><span class="o">:</span><span class="w"> </span><span class="nx">resolve</span><span class="p">(</span><span class="s2">&quot;./frontend/src/main.js&quot;</span><span class="p">),</span> <span class="w"> </span><span class="p">},</span> <span class="w"> </span><span class="p">},</span> <span class="w"> </span><span class="p">},</span> <span class="p">});</span> </code></pre></div> <p>This sets the following options:</p> <ul> <li><code>base: "/static/"</code> - this makes sure that the vite development server serves files from <code>/static/</code> url, as this is where they will be served in production.</li> <li><code>manifest: "manifest.json"</code> - the manifest file shows where all the outputs can be found</li> <li><code>outDir: resolve("./frontend/dist")</code> - this sets the directory where outputs are stored to <code>/frontend/dist/</code>. This is the same as is defined in <code>settings.py</code> above. I chose this directory because it's already excluded from git by the default .gitignore.</li> <li><code>main: resolve("./frontend/src/main.js")</code> - this is an example of an entrypoint for an app. This will create an app called "main" using the script found at <code>/frontend/src/main.js</code>. You can create one of these entries for each entrypoint within your app.</li> </ul> <h3 id="4-optional-set-up-jinja-to-use-the-django_vite-template-tags">4. [optional] Set up Jinja to use the django_vite template tags</h3> <p>I prefer to use Jinja over the default Django templates. This extra step is needed to make the django-vite template tags available to Jinja templates.</p> <p>I've set up a Jinja2 environment <a href="https://docs.djangoproject.com/en/5.0/topics/templates/#django.template.backends.jinja2.Jinja2">as described in the Django docs</a> so I add the django-vite template tags to the global environment there.</p> <details><summary>Show how to add django_vite tags to jinja global environment</summary> <div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">django_vite.templatetags.django_vite</span> <span class="kn">import</span> <span class="p">(</span> <span class="n">vite_asset</span><span class="p">,</span> <span class="n">vite_asset_url</span><span class="p">,</span> <span class="n">vite_hmr_client</span><span class="p">,</span> <span class="n">vite_legacy_asset</span><span class="p">,</span> <span class="n">vite_legacy_polyfills</span><span class="p">,</span> <span class="n">vite_preload_asset</span><span class="p">,</span> <span class="n">vite_react_refresh</span><span class="p">,</span> <span class="p">)</span> <span class="kn">from</span> <span class="nn">jinja2</span> <span class="kn">import</span> <span class="n">Environment</span> <span class="k">def</span> <span class="nf">environment</span><span class="p">(</span><span class="o">**</span><span class="n">options</span><span class="p">):</span> <span class="n">env</span> <span class="o">=</span> <span class="n">Environment</span><span class="p">(</span><span class="o">**</span><span class="n">options</span><span class="p">)</span> <span class="n">env</span><span class="o">.</span><span class="n">globals</span><span class="o">.</span><span class="n">update</span><span class="p">(</span> <span class="p">{</span> <span class="c1"># ... any other globals</span> <span class="s2">&quot;vite_hmr_client&quot;</span><span class="p">:</span> <span class="n">vite_hmr_client</span><span class="p">,</span> <span class="s2">&quot;vite_asset&quot;</span><span class="p">:</span> <span class="n">vite_asset</span><span class="p">,</span> <span class="s2">&quot;vite_preload_asset&quot;</span><span class="p">:</span> <span class="n">vite_preload_asset</span><span class="p">,</span> <span class="s2">&quot;vite_asset_url&quot;</span><span class="p">:</span> <span class="n">vite_asset_url</span><span class="p">,</span> <span class="s2">&quot;vite_legacy_polyfills&quot;</span><span class="p">:</span> <span class="n">vite_legacy_polyfills</span><span class="p">,</span> <span class="s2">&quot;vite_legacy_asset&quot;</span><span class="p">:</span> <span class="n">vite_legacy_asset</span><span class="p">,</span> <span class="s2">&quot;vite_react_refresh&quot;</span><span class="p">:</span> <span class="n">vite_react_refresh</span><span class="p">,</span> <span class="p">}</span> <span class="p">)</span> <span class="n">env</span><span class="o">.</span><span class="n">filters</span><span class="o">.</span><span class="n">update</span><span class="p">(</span> <span class="p">{</span> <span class="c1"># ... any filters</span> <span class="p">}</span> <span class="p">)</span> <span class="k">return</span> <span class="n">env</span> </code></pre></div> </details> <h2 id="adding-a-frontend-app">Adding a frontend app</h2> <p>This setup allows you to serve different VueJS apps from different pages in your app. To add a VueJS app to a page, you need to do the following:</p> <h3 id="1-create-a-vuejs-app-in-the-frontendsrc-directory">1. Create a VueJS app in the frontend/src directory</h3> <p>This should look something like the <code>/frontend/src/main.js</code> script, e.g. it should mount a VueJS component to a particular HTML element, usually identified by its ID. A calculator app saved in <code>calculator.js</code> might look something like:</p> <div class="highlight"><pre><span></span><code><span class="k">import</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nx">createApp</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="kr">from</span><span class="w"> </span><span class="s2">&quot;vue&quot;</span><span class="p">;</span> <span class="k">import</span><span class="w"> </span><span class="nx">CalculatorApp</span><span class="w"> </span><span class="kr">from</span><span class="w"> </span><span class="s2">&quot;./CalculatorApp.vue&quot;</span><span class="p">;</span> <span class="k">import</span><span class="w"> </span><span class="s2">&quot;vite/modulepreload-polyfill&quot;</span><span class="p">;</span> <span class="nx">createApp</span><span class="p">(</span><span class="nx">CalculatorApp</span><span class="p">).</span><span class="nx">mount</span><span class="p">(</span><span class="s2">&quot;#calculator&quot;</span><span class="p">);</span> </code></pre></div> <h3 id="2-add-the-new-app-to-viteconfigjs">2. Add the new app to <code>vite.config.js</code></h3> <p>Within your existing <code>vite.config.js</code> file add another entry to <code>build.rollupOptions.input</code>, alongside any existing entries:</p> <div class="highlight"><pre><span></span><code><span class="k">export</span><span class="w"> </span><span class="k">default</span><span class="w"> </span><span class="nx">defineConfig</span><span class="p">({</span> <span class="w"> </span><span class="nx">build</span><span class="o">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nx">rollupOptions</span><span class="o">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nx">input</span><span class="o">:</span><span class="w"> </span><span class="p">{</span> <span class="w"> </span><span class="nx">main</span><span class="o">:</span><span class="w"> </span><span class="nx">resolve</span><span class="p">(</span><span class="s2">&quot;./src/main.js&quot;</span><span class="p">),</span> <span class="w"> </span><span class="nx">calculator</span><span class="o">:</span><span class="w"> </span><span class="nx">resolve</span><span class="p">(</span><span class="s2">&quot;./src/calculator.js&quot;</span><span class="p">),</span> <span class="w"> </span><span class="p">},</span> <span class="w"> </span><span class="p">},</span> <span class="w"> </span><span class="p">},</span> <span class="p">});</span> </code></pre></div> <h3 id="3-insert-the-template-tags-into-your-templates">3. Insert the template tags into your templates</h3> <p>As I mentioned above, I use Jinja2 for compiling templates. My base templates tend to have a <code>headscripts</code> block which is where <code>&lt;script&gt;</code> tags go. So my template has a block like this:</p> <div class="highlight"><pre><span></span><code><span class="cp">{%</span> <span class="k">block</span> <span class="nv">headscripts</span> <span class="cp">%}</span> <span class="cp">{{</span> <span class="nb">super</span><span class="o">()</span> <span class="cp">}}</span> <span class="cp">{{</span> <span class="nv">vite_hmr_client</span><span class="o">()</span> <span class="cp">}}</span> <span class="cp">{{</span> <span class="nv">vite_asset</span><span class="o">(</span><span class="s1">&#39;frontend/src/calculator.js&#39;</span><span class="o">)</span> <span class="cp">}}</span> <span class="cp">{%</span> <span class="k">endblock</span> <span class="cp">%}</span> </code></pre></div> <p>The <code>vite_hmr_client()</code> call creates script tags in development mode to load the development script. And the <code>vite_asset('frontend/src/calculator.js')</code> call loads the actual calculator script.</p> <p>In the django template language it would look more like:</p> <div class="highlight"><pre><span></span><code><span class="cp">{%</span> <span class="k">load</span> <span class="nv">django_vite</span> <span class="cp">%}</span> <span class="cp">{%</span> <span class="k">block</span> <span class="nv">headscripts</span> <span class="cp">%}</span> <span class="cp">{{</span> <span class="nb">block</span><span class="nv">.super</span> <span class="cp">}}</span> <span class="cp">{%</span> <span class="k">vite_hmr_client</span> <span class="cp">%}</span> <span class="cp">{%</span> <span class="k">vite_asset</span> <span class="s1">&#39;frontend/src/calculator.js&#39;</span> <span class="cp">%}</span> <span class="cp">{%</span> <span class="k">endblock</span> <span class="cp">%}</span> </code></pre></div> <p>You'll also need to at this point add a <code>&lt;div id="calculator"&gt;&lt;/div&gt;</code> tag to your template to provide somewhere for your app to live.</p> <h2 id="serving-in-development-mode">Serving in development mode</h2> <p>When developing your site locally, you'll need to run two different servers.</p> <p>I usually use django's built in development server by running <code>python manage.py runserver</code>.</p> <p>You'll also need to run the vite server by running <code>npm run dev</code>.</p> <h2 id="serving-the-files-in-production-mode">Serving the files in production mode</h2> <p>When your site is in production, the built files will need to be available for the static file storage engine to use. You can build the files by running <code>npm run build</code>.</p> <p>This will create the files in <code>/frontend/dist/</code>. By default this folder won't be included in Git, so if you're using something like Heroku or Dokku to deploy, then these files won't make it into your staticfiles. There are two ways to achieve this:</p> <h3 id="1-use-a-nodejs-buildpack-recommended">1. Use a nodejs buildpack (recommended)</h3> <p>You can add a nodejs buildpack, which will install the needed modules and build your files to the right place. To do this add a file called <code>.buildpacks</code> in the root directory of your app. This file should contain two lines:</p> <div class="highlight"><pre><span></span><code>https://github.com/heroku/heroku-buildpack-nodejs.git https://github.com/heroku/heroku-buildpack-python.git </code></pre></div> <p>When you deploy your app to dokku/heroku, it will then install the needed packages from <code>package.json</code> and then automatically run the <code>npm run build</code> command. These files will then be available for django's collectstatic process.</p> <h3 id="2-manually-build-files-every-time-you-deploy">2. Manually build files every time you deploy</h3> <p>Or you can unignore the <code>/frontend/dist</code> folder from git (or use a different folder to output) and make sure you build the files every time you deploy.</p> <h2 id="appendix-i-example-file-structure">Appendix I - example file structure</h2> <p>This shows an indicative file structure of the key files that need to be in the right place.</p> <div class="highlight"><pre><span></span><code>- config/ # django configuration directory - admin.py - apps.py - asgi.py - jinja2.py - settings.py # main django settings file - urls.py - wsgi.py - frontend/ - dist/ # created by vite build - should be git ignored - assets/ - main-1a2b3c4e.js # compiled main JS - manifest.json - src/ - components/ - Calculator.vue - ... any other files - calculator.js - vite.config.js # where vite configuration lives - my-app/ # normal django app - migrations/ - models.py - views.py - ...etc - static/ - ... other static files like CSS, images, etc - manage.py </code></pre></div></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Wed, 05 Jun 2024 11:05:50 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2024-06-05:/2024/getting-vuejs-3-to-work-nicely-with-django/</guid><category>Code</category><category>Django</category></item><item><title>Using Django's GeneratedField</title><link>https://dkane.net/2024/using-djangos-generatedfield/</link><description><p>Django (the python web framework) version 5.0, released in December 2023, introduced a new field type for models - the <a href="https://docs.djangoproject.com/en/5.0/releases/5.0/#database-generated-model-field"><code>GeneratedField</code></a>. This is potentially a very handy tool - it allows you to define fields that are based on other fields, but gives you the flexibility and speed that comes with a field belonging to the database, without the overhead of needing to store duplicate data.</p> <p>I recently had an opportunity to use this field in a real life project, so I wanted to test how well it worked.</p> <h2 id="my-use-case">My use case</h2> <p>I work a lot with financial data that comes from regulators, and that data is often wrong or needs cleaning in some way. However, I'm also consuming a "firehose" of data from those regulators, with updates coming every day.</p> <p>What I need is a way of storing both the original value that comes from the regulator, along with the ability to overwrite it with a manually checked value. That way I can keep storing the original value from regulator data without needing to re-enter the manual value every time.</p> <p>So if I'm gathering an organisation's turnover in a given year I end up with two fields in the database:</p> <ul> <li><code>turnover_regulator</code> - the value of Turnover from the regulator's data. This could be null if not provided, but would typically be available for most organisations. This field is updated by a nightly data fetching process - any data is this field is overwritten every night.</li> <li><code>turnover_manual</code> - this is a field that is null by default but can be manually changed to a correct value if needed. This field isn't touched by the data fetching process so any changes will be kept.</li> </ul> <p>Both these fields are integer, float or decimal depending on the exact need, and they can both have null values. So in a django model they would look something like:</p> <div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">django.db</span> <span class="kn">import</span> <span class="n">models</span> <span class="k">class</span> <span class="nc">FinancialYear</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span> <span class="n">organisation</span> <span class="o">=</span> <span class="o">...</span> <span class="n">financial_year_end</span> <span class="o">=</span> <span class="o">...</span> <span class="n">turnover_regulator</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">DecimalField</span><span class="p">(</span><span class="n">null</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="n">turnover_manual</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">DecimalField</span><span class="p">(</span><span class="n">null</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> </code></pre></div> <p>Once this data is stored, I need some way of finding the best available estimate of turnover for an organisation. This is found by checking whether <code>turnover_manual</code> has a value in it - if it does then use that, if not then use the value in <code>turnover_regulator</code>.</p> <h2 id="the-old-way">The old way</h2> <p>With this setup I would have two options for calculating the final <code>turnover</code> figure:</p> <h3 id="1-django-model-property">1. Django model property</h3> <p>I could set up a property on the model which fetches both values from the database and then uses Python to choose which one to use. It might look something like:</p> <div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">FinancialYear</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span> <span class="o">...</span> <span class="nd">@property</span> <span class="k">def</span> <span class="nf">turnover</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">turnover_manual</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">turnover_manual</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">turnover_regulator</span> </code></pre></div> <p>This works fine for looking at a single record, and has the advantage of not needing unnecessary storage in the database. However, when dealing with thousands of rows it requires every row to be fetched from the database and then processed in Python, so you can't take advantage of the optimisations that come with calculating things in the database. For example, I wouldn't be able to create a sum of turnover for a set of organisations using the database.</p> <h3 id="2-updating-a-separate-field">2. Updating a separate field</h3> <p>The other way to store this is to create a separate <code>turnover</code> field on the model which you thne need to make sure is updated with the right value.</p> <p>You could update it by patching into the models' <code>save()</code> method, although this isn't always trigged when the model is updated in bulk. Or you could have some kind of bulk update process that runs periodically - although this would mean that there's times when the field is out of sync. Or you may be able to set up a trigger to update the data in the database - but this would decouple the database code from Django and could have performance concerns.</p> <h2 id="using-generatedfield-instead">Using <code>GeneratedField</code> instead</h2> <p>The new <code>GeneratedField</code> gives us a third option. Instead, we can use <code>GeneratedField</code> to define an SQL function, and then the database will make sure that the value is always available to use as if it was any normal field.</p> <p>I did this using the following definition:</p> <div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">FinancialYear</span><span class="p">(</span><span class="n">models</span><span class="o">.</span><span class="n">Model</span><span class="p">):</span> <span class="o">...</span> <span class="n">turnover</span> <span class="o">=</span> <span class="n">models</span><span class="o">.</span><span class="n">GeneratedField</span><span class="p">(</span> <span class="n">expression</span><span class="o">=</span><span class="n">models</span><span class="o">.</span><span class="n">functions</span><span class="o">.</span><span class="n">Coalesce</span><span class="p">(</span><span class="s2">&quot;turnover_manual&quot;</span><span class="p">,</span> <span class="s2">&quot;turnover_regulator&quot;</span><span class="p">),</span> <span class="n">output_field</span><span class="o">=</span><span class="n">models</span><span class="o">.</span><span class="n">DecimalField</span><span class="p">(),</span> <span class="n">db_persist</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="p">)</span> </code></pre></div> <p>What I'm doing here is defining a field called <code>turnover</code> that uses the <code>Coalesce</code> SQL function to choose the first non-null value from the two fields. It's a pretty simple use-case, but it is possible to do more complex expression there too.</p> <p>I've also defined an <code>output_field</code> which tells Django/the database what type of data will be returned, and <code>db_persist</code> which sets whether it's permanently stored in the database or not (I only used this when it was set to <code>True</code> so I haven't seen what the performance impact might be).</p> <p>I found this worked really well - I could access the value in Django, for example adding it as a (read-only) field in admin pages, but it was also available in more complex SQL queries that I was creating to aggregate and filter data. I haven't done any proper benchmarking but I didn't notice any big impacts on performance when using it. And it seemed to work - Django and the databases take care of the details of keeping the field up-to-date, so I didn't need to worry about out-of-date data being used.</p> <h2 id="limitations">Limitations</h2> <p>My experience was generally great, but there were a few limitations I found. I've also only tested it in a fairly narrow way - I only used it with Sqlite, and only with a narrow set of generated field expressions so there may be other issues.</p> <h3 id="bugs">Bugs</h3> <p>This is quite a new feature, so I imagine there are a few bugs around that might cause issues when using it.</p> <p>I particularly found <a href="https://code.djangoproject.com/ticket/35350">an issue with creating new objects when the primary key was defined rather than generated</a> - this happened to match my exact use-case! Fortunately, this bug has now been fixed in version 5.0.5 (released this week), but as with any new features there may be others out there.</p> <h3 id="generating-migrations">Generating migrations</h3> <p>At the moment, <code>GeneratedField</code> will work correctly when first created in a migration, but changing the <code>GeneratedField</code> will result in a migration that doesn't work. There is a simple workaround for this - simply replace the <code>migrations.AlterField</code> operation that is created with two new migrations - one that does <code>migrations.RemoveField</code> and one that does <code>migrations.AddField</code> (because the field is generated from other data you don't need to worry about losing data).</p> <p>It isn't ideal to have to do this manually every time you change the <code>GeneratedField</code>, but it's not a big problem.</p> <h3 id="using-with-relationships">Using with relationships</h3> <p>As <a href="https://docs.djangoproject.com/en/5.0/ref/models/fields/#generatedfield">the docs explain</a>, <code>GeneratedField</code>s can only access fields from the same model, so it's not possible to access fields in related tables. This does block of some of the potential use cases for a generated field in aggregating data from a related model. But again, not a major blocker.</p> <h2 id="further-reading">Further reading</h2> <ul> <li><a href="https://docs.djangoproject.com/en/5.0/releases/5.0/#database-generated-model-field">Django 5.0 Release notes announcing the feature</a></li> <li><a href="https://docs.djangoproject.com/en/5.0/ref/models/fields/#generatedfield">Django model Documentation</a></li> <li><a href="https://modern-sql.com/caniuse/generated-always-as">sql <code>GENERATED ALWAYS</code> support</a> - this is what Django uses behind the scenes to store the value</li> <li><a href="https://www.paulox.net/2023/11/07/database-generated-columns-part-1-django-and-sqlite/">Another blog post explaining the feature</a></li> </ul></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Wed, 08 May 2024 10:03:01 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2024-05-08:/2024/using-djangos-generatedfield/</guid><category>Code</category><category>Django</category></item><item><title>Blocking bots by UserAgent in dokku using nginx</title><link>https://dkane.net/2023/blocking-bots-by-useragent-in-dokku-using-nginx/</link><description><h2 id="the-problem">The problem</h2> <p>A website I ran was being aggresively scraped by someone using a particular UserAgent string. It wasn't a normal search engine or similar bot (e.g. GoogleBot) so adding something to robots.txt wasn't going to work.</p> <p>This site is using <a href="https://dokku.com/">dokku</a> to serve the web app, dokku uses <a href="https://www.nginx.com/">nginx</a> to route requests to the app.</p> <h2 id="the-solution">The solution</h2> <p>You can use dokku's server configuration to add a rule to nginx telling it to return HTTP status 444 when given a particular User Agent string.</p> <p><a href="https://dokku.com/docs/networking/proxies/nginx/#customizing-via-configuration-files-included-by-the-default-templates">Dokku loads any nginx conf files</a> which are stored in the <code>/home/dokku/$APPNAME/nginx.conf.d/</code> directory (replace <code>$APPNAME</code> with the name of your app). These files are loaded into the server block of the main configuration for the app.</p> <p>On your server, first make sure the directory exists by running:</p> <div class="highlight"><pre><span></span><code>mkdir<span class="w"> </span>/home/dokku/<span class="nv">$APPNAME</span>/nginx.conf.d/ </code></pre></div> <p>(again, replace <code>$APPNAME</code> with the app you're using it on).</p> <p>Then create the file. I called mine <code>bots.conf</code>, and used nano to edit it:</p> <div class="highlight"><pre><span></span><code>nano<span class="w"> </span>/home/dokku/<span class="nv">$APPNAME</span>/nginx.conf.d/bots.conf </code></pre></div> <p>Add the following rule to the file, and then save it.</p> <div class="highlight"><pre><span></span><code>if ($http_user_agent = &quot;bot-name&quot;){ return 444; } </code></pre></div> <p><a href="https://http.dev/444">444 is an unofficial status code</a> that nginx uses. Instead of returning anything, nginx will simply close the connection without a response.</p> <p>My example will close the connection if the user agent string exactly equals "bot-name". You could also use regular expressions here for more flexible matching (see the <a href="https://nginx.org/en/docs/http/ngx_http_rewrite_module.html#if">documentation on <code>if</code> for more details</a>).</p> <p>To exclude more than one bot, simply add multiple rules to the file.</p> <p>Next make sure that dokku owns the file you've created</p> <div class="highlight"><pre><span></span><code>chown<span class="w"> </span>dokku:dokku<span class="w"> </span>/home/dokku/<span class="nv">$APPNAME</span>/nginx.conf.d/bots.conf </code></pre></div> <p>Finally reload nginx - the rule should then be applied.</p> <div class="highlight"><pre><span></span><code>service<span class="w"> </span>nginx<span class="w"> </span>reload </code></pre></div></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Wed, 25 Oct 2023 15:16:06 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2023-10-25:/2023/blocking-bots-by-useragent-in-dokku-using-nginx/</guid><category>Code</category></item><item><title>Where to find geographic boundaries for the UK</title><link>https://dkane.net/2022/where-to-find-geographic-boundaries-for-the-uk/</link><description><p>Geographic boundary data is a requirement for any geographic analysis or map-making. But finding boundary data is <a href="https://twitter.com/mapstack_io/status/1600142044986372097">often a nightmare</a>.</p> <blockquote class="tweet-embed" id="tweet-embed-1600142044986372097" lang="" > <header class="tweet-embed-header" > <a href="https://twitter.com/mapstack_io" class="tweet-embed-user" > <img class="tweet-embed-avatar tweet-embed-avatar-circle" src="" alt="" > <div class="tweet-embed-user-names"> <p class="tweet-embed-user-names-name" >Mapstack </p> @mapstack_io </div> </a> <img class="tweet-embed-logo" alt="" src=''> </header> <section class="tweet-embed-text" > Just Googled for "UK county boundaries as GeoJSON". Results:<br><br>1. <a href='http://data.gov.uk'>data.gov.uk</a> no joy (see screenshot).<br><br>2. <a href='http://gis.stackexchange.com'>gis.stackexchange.com</a> post from 2015, suggesting I use ogr2ogr...<br><br>3. A site that has data, unfortunately it's not UK counties or <a href='https://twitter.com/hashtag/geojson'>#geojson</a>.<br><br><a href='https://twitter.com/hashtag/opendata'>#opendata</a> <a href='https://twitter.com/hashtag/toohard'>#toohard</a> <a href='https://twitter.com/mapstack_io/status/1600142044986372097/photo/1'>pic.x.com/bi5gvhd61d</a> <div class="tweet-embed-media-grid"><a href="https://pbs.twimg.com/media/FjTYsd3WAAEa5yp.png" class="tweet-embed-media-link"><img class="tweet-embed-media" alt="" src=""></a></div> </section> <hr class="tweet-embed-hr"> <footer class="tweet-embed-footer"> <a href="" aria-label="4 likes" class="tweet-embed-meta">❤️ 4</a> <a href="" aria-label="4 replies" class="tweet-embed-meta">💬 4</a> <a href="" aria-label="0 retweets" class="tweet-embed-meta">🔁 0</a> <a href=""><time datetime="" >14:56 - Tue 06 December 2022</time></a> </footer></blockquote> <p>This is a problem I've faced too, so I thought it might be useful to outline some of the places to look for boundary data:</p> <h2 id="ons-open-geography-portal"><a href="https://geoportal.statistics.gov.uk/">ONS Open Geography Portal</a></h2> <p>This site - which leaves a lot to be desired in terms of usability &amp; reliability<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup> - is the main place the ONS shares their boundary data. What it lacks in user experience it makes up for in breadth and depth - there are a range of boundaries across lots of different types of area, and with a variety of different variations to download.</p> <p>For a given boundary, there are usually four or five different versions with different resolution. The lower resolution files are generally smaller, but will produce simplified versions of the boundaries rather than the exact boundaries. The more generalised versions are better for things like chloropleth maps where you're comparing lots of different areas, and don't need the boundaries to be exactly right. The exact versions are better for when you show just one area in a lot of detail.</p> <p>The different versions included are:</p> <ul> <li>Full resolution - extent of the realm</li> <li>Full resolution - clipped to the coastline</li> <li>Generalised (20m)</li> <li>Super generalised (200m)</li> <li>Ultra generalised (500m)</li> </ul> <p>The "extent of the realm" files include some offshore islands and use the mean low water mark, all the other files use the mean high water mark as a boundary.</p> <p>Most boundaries can be downloaded in GeoJSON format - you can also get KML, Shapefile or other formats.</p> <p>One helpful thing the geoportal provides is historic versions - so if you've used local authority data from 2015 you can get a file that includes local authorities that were active at that time.</p> <p><strong>NOTE:</strong> Stuart Lowe at Open Innovations has found an issue with some of the GeoJSON produced by the Geoportal, and has <a href="https://open-innovations.github.io/ONS-GeoJSON-Fixer/">created a tool to fix the files</a>. <a href="https://open-innovations.org/blog/2024-08-22-fixing-ons-geoportal-geojson">There's alsoa blog post with more information about the issue</a>.</p> <h2 id="ordance-survey-boundary-line"><a href="https://www.ordnancesurvey.co.uk/business-government/products/boundaryline">Ordance Survey Boundary-Line</a></h2> <figure> <p><img alt="A snippet of boundary-line from Newcaster" src="/images/boundaries/boundaryline-newcastle.jpg"></p> <figcaption>Image &copy; Ordnance Survey</figcaption> </figure> <p>Part of the OS's Open Data products, the Boundary-Line product contains a range of administrative boundaries across Great Britain. The boundaries it includes are:</p> <ul> <li>Civil parishes</li> <li>Wards</li> <li>Communities</li> <li>Districts</li> <li>Counties</li> <li>Metropolitan districts</li> <li>Unitary authorities</li> <li>Parliamentary constituencies</li> <li>Electoral divisions</li> <li>London boroughs</li> <li>Greater London authorities</li> <li>Greater London authority assembly constituencies</li> <li>European electoral regions</li> <li>Welsh assembly constituencies</li> <li>Welsh assembly electoral regions</li> <li>Scottish parliamentary constituencies</li> <li>Scottish parliamentary electoral regions</li> </ul> <p>Note that it doesn't include ONS-specific statistic geography like Lower Super Output Areas and Middle Super Output Areas. There's lots of other boundary types that aren't included either - like health geography.</p> <p>A direct link to <a href="https://osdatahub.os.uk/downloads/open/BoundaryLine">download Boundary-Line can be found on the OS Data Hub</a>. You can get it in a few different formats - GeoPackage is probably the most versatile. It's quite a big file - over 600Mb - and you'll probably need some specialist GIS software (like <a href="https://qgis.org/">QGIS</a>) to use it.</p> <p>The data is available under the Open Government Licence, so you can use it for any purpose.</p> <h2 id="geolookups-by-david-kane"><a href="https://github.com/drkane/geo-lookups/tree/master/boundaries">Geolookups by David Kane</a></h2> <figure> <p><img alt="A map of the Lower Super Output Areas in Cambridgeshire and Peterborough" src="/images/boundaries/geolookup.png"></p> </figure> <p>This is a repository of boundaries that I've put together, using data from the ONS geoportal.</p> <p>The aim of this is to produce 3 GeoJSON files for each Local Authority, Combined Authority or Region. Each file shows the boundaries of either Lower Super Output Areas (LSOA), Middle Super Output Areas (MSOA) or Wards within that area. They're meant to help people who want to map data in a small area. They use a fairly high resolution so you can pick out detail in the areas.</p> <p>If you want to combile GeoJSON files together into one file, you can use my <a href="https://findthatpostcode.uk/tools/merge-geojson">GeoJSON Merge Tool</a>.</p> <h2 id="other-resources">Other Resources</h2> <ul> <li><a href="https://bothness.github.io/geo-draw/"><strong>Draw your own geography</strong></a> - A nice site that allows you to download a set of output areas based on either drawing on the map or choosing an existing area.</li> <li><a href="https://automaticknowledge.co.uk/resources/#UKLocal"><strong>UK local Authorities with insets</strong></a> - a readymade UK map with insets of urban areas. It's worth looking at the rest of the resources on that page - Alasdair Rae's resources are very useful.</li> <li><a href="https://automaticknowledge.org/gb/"><strong>Local authority GIS data</strong></a> - a set of useful GIS data for Local Authorities in Great Britain. Includes boundaries but also includes roads, surface water, buildings, greenspace, etc - everything you need for making a map. More useful resources from Alasdair.</li> <li><a href="https://flourish.studio/examples/"><strong>Flourish</strong></a> - this is a great tool for making lots of different types of visualisations, and their mapping visualisation has lots of UK boundaries already available. You can also upload GeoJSON files from other sources.</li> <li><a href="https://mapit.mysociety.org/"><strong>MapIt</strong></a> by MySociety and <a href="https://findthatpostcode.uk/"><strong>Find that Postcode</strong></a> by me both allow you to lookup and individual area's GeoJSON boundary and access it through an API.</li> <li><a href="https://github.com/open-innovations/geography-bits/tree/master/data"><strong>Open Innovations</strong></a> maintain a collection of GeoJSON files for different areas. (Thanks to <a href="https://mastodon.social/@slowe@mastodon.me.uk">Stuart from OI on mastodon.social</a> for the suggestion).</li> </ul> <h2 id="whats-missing">What's missing?</h2> <p>I'd love to hear from you if there are other resources you find useful - I'm sure to have missed some! I'm particularly conscious that lots of the resources cover only either England and Wales or Great Britain, and are harder to find for Northern Ireland.</p> <div class="footnote"> <hr> <ol> <li id="fn:1"> <p>The site wasn't working at all when I wrote this blog post!&#160;<a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">&#8617;</a></p> </li> </ol> </div></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Tue, 06 Dec 2022 15:08:59 +0000</pubDate><guid isPermaLink="false">tag:dkane.net,2022-12-06:/2022/where-to-find-geographic-boundaries-for-the-uk/</guid><category>Blog</category></item><item><title>What powers charity websites?</title><link>https://dkane.net/2022/what-powers-charity-websites/</link><description><p>I wanted to understand what technologies charities are using with their websites. What <abbr title="Content Management Systems">CMSs</abbr> do they use to create content and layout the sites? What web technologies are they using - to what extent have the embraced the latest tech on their sites? And how well are they complying with some of the regulatory requirements for their sites?</p> <p>There's lots more detail below, but here's the headlines:</p> <ul> <li><strong>Nearly half</strong> of UK charities don't have a website, although larger charities are much more likely to have one</li> <li><strong>At least 10%</strong> of charity websites don't work at a given moment</li> <li><strong>At least one third</strong> of charities don't have their charity number on their website</li> <li><strong>WordPress</strong> powers one third of charities' websites</li> <li><strong>jQuery</strong> is present on 71% of charity websites</li> <li><strong>Nearly two thirds</strong> of charities use Google Analytics</li> </ul> <h2 id="which-charities-have-websites">Which charities have websites?</h2> <p>I took a random sample of 1,137 charities from the UK, using data from the three UK charity regulators. The sample was based on those charities with a valid website address. I collected 200 random websites for each group of charities in 5 income bands, plus websites for all 137 charities with over £100m.</p> <p>Only just over half (57%) of UK charities supply a website address to their regulator, although larger organsiations are much more likely to - 84% of charities with more than £100,000 annual income have a valid website address.</p> <div class="flourish-embed flourish-chart" data-src="visualisation/8655291"></div> <p>I excluded those charities that use a facebook page as their website. 750 charities give a facebook.com address - less than 1% of charities with websites, mainly concentrated in the smaller charities.</p> <h2 id="do-the-websites-actually-work">Do the websites actually work?</h2> <p>With my sample of charities, I could check which of these actually return a response. Valid responses were received for 82% of charity sites. Of the remainder, 6% returned a "403 Forbidden" response. This suggests that they are aiming to only respond to valid web browser requests, rather than attempts at web scraping (although some of these responses could be because they have given a site address that requires a login).</p> <p>The remaining 11% response returned either no response at all, or gave an error of some kind. This was much higher for smaller charities - over 15% of websites for charities under £100,000 did not work at all.</p> <p>Even for larger charities with income over £10m, a worrying proportion of websites just don't work, although some of this may be due to anti-scraping measures.</p> <div class="flourish-embed flourish-chart" data-src="visualisation/8655406"></div> <h2 id="do-charities-include-their-charity-number-on-their-website">Do charities include their charity number on their website?</h2> <p>Charities should <a href="https://www.wrighthassall.co.uk/knowledge-base/guide-to-advertising-charitable-status-on-all-written-materials">display their charity number</a> on their website, so that members of the public can see that they are regulated as a charity.</p> <p>Most charities complied with this, but a significant minority (40% of the working websites checked), don't have their registered charity number on their website. There are some caveats to this - I only checked the HTML of the website homepage, so if a site is entirely powered by javascript then I won't pick up the content of the page as a real user might see it. But I think it's a reasonable estimate.</p> <p>And although larger charities do slightly better at having their charity number visible on the website, it's still nearly a quarter of the largest charities with incomes over £10m that don't.</p> <div class="flourish-embed flourish-chart" data-src="visualisation/8655836"></div> <h2 id="what-content-management-systems-do-charities-use">What Content Management Systems do charities use?</h2> <p>I did some keyword searching of the website homepage to try and detect which content management system (CMS) the website uses. For example, if I found the text <code>wp-content</code> in the HTML of the homepage, that's a good indication that they are using WordPress.</p> <p>These keywords aren't perfect - I detected the CMS for about 58% of websites. Larger charities were harder to detect the CMS - they are more likely to use a custom CMS, or one that isn't easily revealed through the HTML.</p> <p>WordPress was the most popular CMS, powering 36% of sites. This is followed by Drupal, which powered 9% of sites. WordPress is more commonly used by smaller charities, where Drupal is more popular with larger ones.</p> <div class="flourish-embed flourish-chart" data-src="visualisation/8606184"></div> <h2 id="which-web-technologies-do-charities-use">Which web technologies do charities use?</h2> <p>I also searched the HTML text for a set of web technologies that could be present on pages. This was a fairly arbritrary list based on existing knowledge a little bit of checking of some of the responses - so any suggestions for what else to search for would be very useful.</p> <p>The tools I looked for, and the proportion of sites that appeared to use the tool, were:</p> <h3 id="web-technologies-and-javascriptcss-frameworks">Web technologies and JavaScript/CSS frameworks</h3> <ul> <li>site claims to be in HTML5 (<code>&lt;!DOCTYPE html&gt;</code> - <strong>91%</strong> of sites)</li> <li>uses jQuery (<strong>71%</strong>)</li> <li>uses Google fonts (<strong>44%</strong>)</li> <li>uses Bootstrap CSS framework (<strong>13%</strong>)</li> <li>uses React JS framework (<strong>6%</strong>)</li> </ul> <h3 id="analytics-and-social-media">Analytics and social media</h3> <ul> <li>Mentions cookies - potentially a proxy for whether a cookie popup is shown (<strong>72%</strong> of sites)</li> <li>Google Analytics is enabled (<strong>62%</strong>)</li> <li>Twitter cards or Open Graph metadata available (<strong>62%</strong>)</li> <li>Facebook tracking enabled (<strong>14%</strong>)</li> </ul> <h2 id="assistive-technology">Assistive technology</h2> <p>Note that these are pretty poor proxies for how accessible a website is for a range of users</p> <ul> <li>Includes <code>alt=</code> somewhere on the page (could indicate that alt-text is included for some images - <strong>93%</strong>)</li> <li>Includes a <code>aria</code> property somewhere on the page (<strong>72%</strong>)</li> <li>Uses the browsealoud extension (<strong>1.2%</strong>)</li> </ul> <p>As shown below, use of these tools changes based on the size of the charity. Larger charities are more likely to use Google analytics &amp; Facebook tracking, more likely to mention cookies and use Open Graph or Twitter Cards.</p> <div class="flourish-embed flourish-chart" data-src="visualisation/8606173"></div> <script src="https://public.flourish.studio/resources/embed.js"></script></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Wed, 09 Feb 2022 16:06:00 +0000</pubDate><guid isPermaLink="false">tag:dkane.net,2022-02-09:/2022/what-powers-charity-websites/</guid><category>Blog</category><category>Website</category></item><item><title>Better horizontal bar charts with plotly</title><link>https://dkane.net/2020/better-horizontal-bar-charts-with-plotly/</link><description><p>I often find that horizontal bar charts are a great way of visualising comparisons between different categories. They're easy to understand and make, and provide a really simple way of displaying data.</p> <p>But I've found the default way of labelling them often doesn't make sense. Labels for the bars are usually presented to the left of the axis, and often have to be crammed together, split across multiple lines or truncated to make them readable. </p> <p>Some charting tools can produce a much easier to read version, where the labels for the bars appear on a separate line to the bars themselves. This gives the data itself more room to breathe and makes the labels less cramped.</p> <p>Here's an example from <a href="https://www.datawrapper.de/">DataWrapper</a> using some of their sample data. First using their default settings:</p> <figure> <p><img alt="Bar chart showing labels to the left of the vertical axis" src="/images/horizontal-bar-1.png"></p> </figure> <p>And then with the "labels on separate line" options selected:</p> <figure> <p><img alt="Bar chart showing labels above each of the bars" src="/images/horizontal-bar-2.png"></p> </figure> <h2 id="implementing-in-plotly">Implementing in plotly</h2> <p>I've been using <a href="https://plotly.com/python/">Plotly</a> for charting a lot recently, particularly in conjunction with <a href="https://dash.plotly.com/">Dash</a> for making rich interactive dashboards. It's a very powerful combination.</p> <p>I built <a href="">360Insights</a> using Dash, and plotly's charts powers the charts on <a href="https://ngoexplorer.org/">NGO Explorer</a>. I wanted to record how I created the charts with labels above the bars in NGO Explorer. You can see some examples of these on a <a href="https://ngoexplorer.org/country/fra">country page</a>.</p> <figure> <p><img alt="Bar chart from NGO Explorer showing labels above each of the bars" src="/images/horizontal-bar-3.png"></p> </figure> <h2 id="making-subplots">Making subplots</h2> <p>The only way I could find to accomplish the same effect in plotly was to use <a href="https://plotly.com/python/subplots/">subplots</a>. These essentially split the figure into a number of different charts. What we're going to do is use a different subplot for each chart, using the <code>make_subplots</code> function. Start by importing the <code>make_subplots</code> function.</p> <div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">plotly.subplots</span> <span class="kn">import</span> <span class="n">make_subplots</span> </code></pre></div> <p>Add our data to the <code>categories</code> variable.</p> <div class="highlight"><pre><span></span><code><span class="n">categories</span> <span class="o">=</span> <span class="p">[</span> <span class="p">{</span><span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;Musée du Louvre, Paris&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">:</span> <span class="mi">10200000</span><span class="p">},</span> <span class="p">{</span><span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;National Museum of China, Beijing&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">:</span> <span class="mi">8610092</span><span class="p">},</span> <span class="p">{</span><span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;Metropolitan Museum of Art, New York City&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">:</span> <span class="mi">6953927</span><span class="p">},</span> <span class="p">{</span><span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;Vatican Museums, Vatican City&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">:</span> <span class="mi">6756186</span><span class="p">},</span> <span class="p">{</span><span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;Tate Modern, London&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">:</span> <span class="mi">5868562</span><span class="p">},</span> <span class="p">{</span><span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;British Museum, London&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">:</span> <span class="mi">5820000</span><span class="p">},</span> <span class="p">{</span><span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;National Gallery, London&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">:</span> <span class="mi">5735831</span><span class="p">},</span> <span class="p">{</span><span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;National Gallery of Art, Washington D.C.&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">:</span> <span class="mi">4404212</span><span class="p">},</span> <span class="p">{</span><span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;State Hermitage Museum, Saint Petersburg&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">:</span> <span class="mi">4220000</span><span class="p">},</span> <span class="p">{</span><span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;Victoria and Albert Museum, London&quot;</span><span class="p">,</span> <span class="s2">&quot;value&quot;</span><span class="p">:</span> <span class="mi">3967566</span><span class="p">},</span> <span class="p">]</span> </code></pre></div> <p>Create as many subplots as there are categories. For each subplot we'll set the title to the name of the museum using the <code>subplot_titles</code> parameter. We also set <code>shared_xaxes=True</code> to ensure that all the bars are still in proportion to each other, and <code>print_grid=False</code> to make it seamless between the plots (otherwise each would be in a box.</p> <p>The <code>vertical_spacing</code> parameter is important, but can be tricky to get right. I've found that 0.45 divided by the number of categories gives a good spacing between them, but you may need to play around with it.</p> <p>I've also removed the background colour from the subplots, and set the width of the plots.</p> <div class="highlight"><pre><span></span><code><span class="n">subplots</span> <span class="o">=</span> <span class="n">make_subplots</span><span class="p">(</span> <span class="n">rows</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">categories</span><span class="p">),</span> <span class="n">cols</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">subplot_titles</span><span class="o">=</span><span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="s2">&quot;name&quot;</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">categories</span><span class="p">],</span> <span class="n">shared_xaxes</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">print_grid</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">vertical_spacing</span><span class="o">=</span><span class="p">(</span><span class="mf">0.45</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">categories</span><span class="p">)),</span> <span class="p">)</span> <span class="n">_</span> <span class="o">=</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">update</span><span class="p">(</span> <span class="n">width</span><span class="o">=</span><span class="mi">550</span><span class="p">,</span> <span class="n">plot_bgcolor</span><span class="o">=</span><span class="s1">&#39;#fff&#39;</span><span class="p">,</span> <span class="p">)</span> </code></pre></div> <p>Next we go through each category and add it to the relevant subplot, using the <code>add_trace</code> function.</p> <p>This functions takes a dictionary with the values and settings for the bar (notice that the <code>y</code> and <code>x</code> values are wrapped in lists because plotly would normally expect a list of values here). </p> <p>We'll set the marker so they're all the same colour, and add a text label for the amount to the inside of the bar.</p> <div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">categories</span><span class="p">):</span> <span class="n">subplots</span><span class="o">.</span><span class="n">add_trace</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span> <span class="nb">type</span><span class="o">=</span><span class="s1">&#39;bar&#39;</span><span class="p">,</span> <span class="n">orientation</span><span class="o">=</span><span class="s1">&#39;h&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="s2">&quot;name&quot;</span><span class="p">]],</span> <span class="n">x</span><span class="o">=</span><span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="s2">&quot;value&quot;</span><span class="p">]],</span> <span class="n">text</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;</span><span class="si">{:,.0f}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s2">&quot;value&quot;</span><span class="p">])],</span> <span class="n">hoverinfo</span><span class="o">=</span><span class="s1">&#39;text&#39;</span><span class="p">,</span> <span class="n">textposition</span><span class="o">=</span><span class="s1">&#39;auto&#39;</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="nb">dict</span><span class="p">(</span> <span class="n">color</span><span class="o">=</span><span class="s2">&quot;#7030a0&quot;</span><span class="p">,</span> <span class="p">),</span> <span class="p">),</span> <span class="n">k</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> </code></pre></div> <figure> <p><img alt="After adding traces" src="/images/plotly-bar-stage-1.png"></p> </figure> <p>This gives us subplots that are much closer to what we're looking for, but they look a bit of a mess as there's still quite a bit of the layout we need to adjust.</p> <p>First we remove the legend on the right hand side and make sure the titles (stored by plotly as annotations) are aligned to the left hand side, and have smaller font sizes).</p> <div class="highlight"><pre><span></span><code><span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">update</span><span class="p">(</span> <span class="n">showlegend</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">subplots</span><span class="p">[</span><span class="s2">&quot;layout&quot;</span><span class="p">][</span><span class="s1">&#39;annotations&#39;</span><span class="p">]:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;x&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;xanchor&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;left&#39;</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;align&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;left&#39;</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;font&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span> <span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="p">)</span> </code></pre></div> <figure> <p><img alt="After removing annotations and images" src="/images/plotly-bar-stage-2.png"></p> </figure> <p>Next we remove all the axes to give a clean look to the chart. If you wanted, you could keep the x-axis to help see the values. </p> <div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">axis</span> <span class="ow">in</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">]:</span> <span class="k">if</span> <span class="n">axis</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;yaxis&#39;</span><span class="p">)</span> <span class="ow">or</span> <span class="n">axis</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;xaxis&#39;</span><span class="p">):</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">][</span><span class="n">axis</span><span class="p">][</span><span class="s1">&#39;visible&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="kc">False</span> </code></pre></div> <p>Finally we remove the margins at the left hand side, and calculate the height so that the categories fit nicely.</p> <div class="highlight"><pre><span></span><code><span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">][</span><span class="s1">&#39;margin&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="s1">&#39;l&#39;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;t&#39;</span><span class="p">:</span> <span class="mi">20</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="p">}</span> <span class="n">height_calc</span> <span class="o">=</span> <span class="mi">45</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">categories</span><span class="p">)</span> <span class="n">height_calc</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="n">height_calc</span><span class="p">,</span> <span class="mi">350</span><span class="p">])</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">][</span><span class="s1">&#39;height&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">height_calc</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">][</span><span class="s1">&#39;width&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">height_calc</span> <span class="n">subplots</span> </code></pre></div> <h2 id="the-finished-product">The finished product</h2> <script type="text/javascript">window.PlotlyConfig = {MathJaxConfig: 'local'};</script> <script src="https://cdn.plot.ly/plotly-latest.min.js"></script> <div id="6165b01c-fa55-4212-99f5-9b48ce3a75d5" class="plotly-graph-div" style="height:450px; width:450px;"></div> <script type="text/javascript"> window.PLOTLYENV=window.PLOTLYENV || {}; if (document.getElementById("6165b01c-fa55-4212-99f5-9b48ce3a75d5")) { Plotly.newPlot( '6165b01c-fa55-4212-99f5-9b48ce3a75d5', [{"hoverinfo": "text", "marker": {"color": "#7030a0"}, "orientation": "h", "text": ["10,200,000"], "textposition": "auto", "type": "bar", "x": [10200000], "xaxis": "x", "y": ["Mus\u00e9e du Louvre, Paris"], "yaxis": "y"}, {"hoverinfo": "text", "marker": {"color": "#7030a0"}, "orientation": "h", "text": ["8,610,092"], "textposition": "auto", "type": "bar", "x": [8610092], "xaxis": "x2", "y": ["National Museum of China, Beijing"], "yaxis": "y2"}, {"hoverinfo": "text", "marker": {"color": "#7030a0"}, "orientation": "h", "text": ["6,953,927"], "textposition": "auto", "type": "bar", "x": [6953927], "xaxis": "x3", "y": ["Metropolitan Museum of Art, New York City"], "yaxis": "y3"}, {"hoverinfo": "text", "marker": {"color": "#7030a0"}, "orientation": "h", "text": ["6,756,186"], "textposition": "auto", "type": "bar", "x": [6756186], "xaxis": "x4", "y": ["Vatican Museums, Vatican City"], "yaxis": "y4"}, {"hoverinfo": "text", "marker": {"color": "#7030a0"}, "orientation": "h", "text": ["5,868,562"], "textposition": "auto", "type": "bar", "x": [5868562], "xaxis": "x5", "y": ["Tate Modern, London"], "yaxis": "y5"}, {"hoverinfo": "text", "marker": {"color": "#7030a0"}, "orientation": "h", "text": ["5,820,000"], "textposition": "auto", "type": "bar", "x": [5820000], "xaxis": "x6", "y": ["British Museum, London"], "yaxis": "y6"}, {"hoverinfo": "text", "marker": {"color": "#7030a0"}, "orientation": "h", "text": ["5,735,831"], "textposition": "auto", "type": "bar", "x": [5735831], "xaxis": "x7", "y": ["National Gallery, London"], "yaxis": "y7"}, {"hoverinfo": "text", "marker": {"color": "#7030a0"}, "orientation": "h", "text": ["4,404,212"], "textposition": "auto", "type": "bar", "x": [4404212], "xaxis": "x8", "y": ["National Gallery of Art, Washington D.C."], "yaxis": "y8"}, {"hoverinfo": "text", "marker": {"color": "#7030a0"}, "orientation": "h", "text": ["4,220,000"], "textposition": "auto", "type": "bar", "x": [4220000], "xaxis": "x9", "y": ["State Hermitage Museum, Saint Petersburg"], "yaxis": "y9"}, {"hoverinfo": "text", "marker": {"color": "#7030a0"}, "orientation": "h", "text": ["3,967,566"], "textposition": "auto", "type": "bar", "x": [3967566], "xaxis": "x10", "y": ["Victoria and Albert Museum, London"], "yaxis": "y10"}], {"annotations": [{"align": "left", "font": {"size": 12}, "showarrow": false, "text": "Mus\u00e9e du Louvre, Paris", "x": 0, "xanchor": "left", "xref": "paper", "y": 0.9999999999999999, "yanchor": "bottom", "yref": "paper"}, {"align": "left", "font": {"size": 12}, "showarrow": false, "text": "National Museum of China, Beijing", "x": 0, "xanchor": "left", "xref": "paper", "y": 0.8955, "yanchor": "bottom", "yref": "paper"}, {"align": "left", "font": {"size": 12}, "showarrow": false, "text": "Metropolitan Museum of Art, New York City", "x": 0, "xanchor": "left", "xref": "paper", "y": 0.791, "yanchor": "bottom", "yref": "paper"}, {"align": "left", "font": {"size": 12}, "showarrow": false, "text": "Vatican Museums, Vatican City", "x": 0, "xanchor": "left", "xref": "paper", "y": 0.6865, "yanchor": "bottom", "yref": "paper"}, {"align": "left", "font": {"size": 12}, "showarrow": false, "text": "Tate Modern, London", "x": 0, "xanchor": "left", "xref": "paper", "y": 0.582, "yanchor": "bottom", "yref": "paper"}, {"align": "left", "font": {"size": 12}, "showarrow": false, "text": "British Museum, London", "x": 0, "xanchor": "left", "xref": "paper", "y": 0.4775, "yanchor": "bottom", "yref": "paper"}, {"align": "left", "font": {"size": 12}, "showarrow": false, "text": "National Gallery, London", "x": 0, "xanchor": "left", "xref": "paper", "y": 0.373, "yanchor": "bottom", "yref": "paper"}, {"align": "left", "font": {"size": 12}, "showarrow": false, "text": "National Gallery of Art, Washington D.C.", "x": 0, "xanchor": "left", "xref": "paper", "y": 0.26849999999999996, "yanchor": "bottom", "yref": "paper"}, {"align": "left", "font": {"size": 12}, "showarrow": false, "text": "State Hermitage Museum, Saint Petersburg", "x": 0, "xanchor": "left", "xref": "paper", "y": 0.16399999999999998, "yanchor": "bottom", "yref": "paper"}, {"align": "left", "font": {"size": 12}, "showarrow": false, "text": "Victoria and Albert Museum, London", "x": 0, "xanchor": "left", "xref": "paper", "y": 0.0595, "yanchor": "bottom", "yref": "paper"}], "height": 450, "margin": {"b": 1, "l": 0, "r": 0, "t": 20}, "plot_bgcolor": "#fff", "showlegend": false, "template": {"data": {"bar": [{"error_x": {"color": "#2a3f5f"}, "error_y": {"color": "#2a3f5f"}, "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}, "type": "bar"}], "barpolar": [{"marker": {"line": {"color": "#E5ECF6", "width": 0.5}}, "type": "barpolar"}], "carpet": [{"aaxis": {"endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f"}, "baxis": {"endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f"}, "type": "carpet"}], "choropleth": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "choropleth"}], "contour": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "type": "contour"}], "contourcarpet": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "contourcarpet"}], "heatmap": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "type": "heatmap"}], "heatmapgl": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "type": "heatmapgl"}], "histogram": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "histogram"}], "histogram2d": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "type": "histogram2d"}], "histogram2dcontour": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "type": "histogram2dcontour"}], "mesh3d": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "mesh3d"}], "parcoords": [{"line": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "parcoords"}], "pie": [{"automargin": true, "type": "pie"}], "scatter": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatter"}], "scatter3d": [{"line": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatter3d"}], "scattercarpet": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattercarpet"}], "scattergeo": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattergeo"}], "scattergl": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattergl"}], "scattermapbox": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattermapbox"}], "scatterpolar": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatterpolar"}], "scatterpolargl": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatterpolargl"}], "scatterternary": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatterternary"}], "surface": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "type": "surface"}], "table": [{"cells": {"fill": {"color": "#EBF0F8"}, "line": {"color": "white"}}, "header": {"fill": {"color": "#C8D4E3"}, "line": {"color": "white"}}, "type": "table"}]}, "layout": {"annotationdefaults": {"arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1}, "coloraxis": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "colorscale": {"diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequential": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]], "sequentialminus": [[0.0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1.0, "#f0f921"]]}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"], "font": {"color": "#2a3f5f"}, "geo": {"bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white"}, "hoverlabel": {"align": "left"}, "hovermode": "closest", "mapbox": {"style": "light"}, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": {"angularaxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}}, "scene": {"xaxis": {"backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white"}, "yaxis": {"backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white"}, "zaxis": {"backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white"}}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "ternary": {"aaxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}, "baxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}}, "title": {"x": 0.05}, "xaxis": {"automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": {"standoff": 15}, "zerolinecolor": "white", "zerolinewidth": 2}, "yaxis": {"automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": {"standoff": 15}, "zerolinecolor": "white", "zerolinewidth": 2}}}, "width": 450, "xaxis": {"anchor": "y", "domain": [0.0, 1.0], "matches": "x10", "showticklabels": false, "visible": false}, "xaxis10": {"anchor": "y10", "domain": [0.0, 1.0], "visible": false}, "xaxis2": {"anchor": "y2", "domain": [0.0, 1.0], "matches": "x10", "showticklabels": false, "visible": false}, "xaxis3": {"anchor": "y3", "domain": [0.0, 1.0], "matches": "x10", "showticklabels": false, "visible": false}, "xaxis4": {"anchor": "y4", "domain": [0.0, 1.0], "matches": "x10", "showticklabels": false, "visible": false}, "xaxis5": {"anchor": "y5", "domain": [0.0, 1.0], "matches": "x10", "showticklabels": false, "visible": false}, "xaxis6": {"anchor": "y6", "domain": [0.0, 1.0], "matches": "x10", "showticklabels": false, "visible": false}, "xaxis7": {"anchor": "y7", "domain": [0.0, 1.0], "matches": "x10", "showticklabels": false, "visible": false}, "xaxis8": {"anchor": "y8", "domain": [0.0, 1.0], "matches": "x10", "showticklabels": false, "visible": false}, "xaxis9": {"anchor": "y9", "domain": [0.0, 1.0], "matches": "x10", "showticklabels": false, "visible": false}, "yaxis": {"anchor": "x", "domain": [0.9404999999999999, 0.9999999999999999], "visible": false}, "yaxis10": {"anchor": "x10", "domain": [0.0, 0.0595], "visible": false}, "yaxis2": {"anchor": "x2", "domain": [0.836, 0.8955], "visible": false}, "yaxis3": {"anchor": "x3", "domain": [0.7315, 0.791], "visible": false}, "yaxis4": {"anchor": "x4", "domain": [0.627, 0.6865], "visible": false}, "yaxis5": {"anchor": "x5", "domain": [0.5225, 0.582], "visible": false}, "yaxis6": {"anchor": "x6", "domain": [0.418, 0.4775], "visible": false}, "yaxis7": {"anchor": "x7", "domain": [0.3135, 0.373], "visible": false}, "yaxis8": {"anchor": "x8", "domain": [0.209, 0.26849999999999996], "visible": false}, "yaxis9": {"anchor": "x9", "domain": [0.1045, 0.16399999999999998], "visible": false}}, {"responsive": true} ) }; </script> <p>I think the final result is a lot clearer than the default chart for the data. I've put together all of this together in a function to help making these types of charts easier.</p> <p>You can play around with the settings, either as part of <code>make_subplots</code>, the individual bars (<code>add_trace</code>) or the layout (make changes to <code>subplots['layout']</code>.</p> <div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">plotly.subplots</span> <span class="kn">import</span> <span class="n">make_subplots</span> <span class="k">def</span> <span class="nf">horizontal_bar_labels</span><span class="p">(</span><span class="n">categories</span><span class="p">):</span> <span class="n">subplots</span> <span class="o">=</span> <span class="n">make_subplots</span><span class="p">(</span> <span class="n">rows</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">categories</span><span class="p">),</span> <span class="n">cols</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">subplot_titles</span><span class="o">=</span><span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="s2">&quot;name&quot;</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">categories</span><span class="p">],</span> <span class="n">shared_xaxes</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">print_grid</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">vertical_spacing</span><span class="o">=</span><span class="p">(</span><span class="mf">0.45</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">categories</span><span class="p">)),</span> <span class="p">)</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">update</span><span class="p">(</span> <span class="n">width</span><span class="o">=</span><span class="mi">550</span><span class="p">,</span> <span class="n">plot_bgcolor</span><span class="o">=</span><span class="s1">&#39;#fff&#39;</span><span class="p">,</span> <span class="p">)</span> <span class="c1"># add bars for the categories</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">categories</span><span class="p">):</span> <span class="n">subplots</span><span class="o">.</span><span class="n">add_trace</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span> <span class="nb">type</span><span class="o">=</span><span class="s1">&#39;bar&#39;</span><span class="p">,</span> <span class="n">orientation</span><span class="o">=</span><span class="s1">&#39;h&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="s2">&quot;name&quot;</span><span class="p">]],</span> <span class="n">x</span><span class="o">=</span><span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="s2">&quot;value&quot;</span><span class="p">]],</span> <span class="n">text</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;</span><span class="si">{:,.0f}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s2">&quot;value&quot;</span><span class="p">])],</span> <span class="n">hoverinfo</span><span class="o">=</span><span class="s1">&#39;text&#39;</span><span class="p">,</span> <span class="n">textposition</span><span class="o">=</span><span class="s1">&#39;auto&#39;</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="nb">dict</span><span class="p">(</span> <span class="n">color</span><span class="o">=</span><span class="s2">&quot;#7030a0&quot;</span><span class="p">,</span> <span class="p">),</span> <span class="p">),</span> <span class="n">k</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="c1"># update the layout</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">update</span><span class="p">(</span> <span class="n">showlegend</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">subplots</span><span class="p">[</span><span class="s2">&quot;layout&quot;</span><span class="p">][</span><span class="s1">&#39;annotations&#39;</span><span class="p">]:</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;x&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;xanchor&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;left&#39;</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;align&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;left&#39;</span> <span class="n">x</span><span class="p">[</span><span class="s1">&#39;font&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span> <span class="n">size</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="p">)</span> <span class="c1"># hide the axes</span> <span class="k">for</span> <span class="n">axis</span> <span class="ow">in</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">]:</span> <span class="k">if</span> <span class="n">axis</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;yaxis&#39;</span><span class="p">)</span> <span class="ow">or</span> <span class="n">axis</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;xaxis&#39;</span><span class="p">):</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">][</span><span class="n">axis</span><span class="p">][</span><span class="s1">&#39;visible&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="kc">False</span> <span class="c1"># update the margins and size</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">][</span><span class="s1">&#39;margin&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="s1">&#39;l&#39;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;t&#39;</span><span class="p">:</span> <span class="mi">20</span><span class="p">,</span> <span class="s1">&#39;b&#39;</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="p">}</span> <span class="n">height_calc</span> <span class="o">=</span> <span class="mi">45</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">categories</span><span class="p">)</span> <span class="n">height_calc</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="n">height_calc</span><span class="p">,</span> <span class="mi">350</span><span class="p">])</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">][</span><span class="s1">&#39;height&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">height_calc</span> <span class="n">subplots</span><span class="p">[</span><span class="s1">&#39;layout&#39;</span><span class="p">][</span><span class="s1">&#39;width&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">height_calc</span> <span class="k">return</span> <span class="n">subplots</span> </code></pre></div> <h2 id="a-note-on-log-axes">A note on log axes</h2> <p>If you're using a log axis, as some of the charts on <ngoexplorer.org> do, then the implementation above may not work due to the different way that log ranges are calculated.</p> <p>Adding something similar to this on the function above should help (where <code>log_axis</code> is a boolean variable):</p> <div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">hb_plot</span><span class="p">[</span><span class="s2">&quot;layout&quot;</span><span class="p">]:</span> <span class="k">if</span> <span class="n">x</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s2">&quot;xaxis&quot;</span><span class="p">):</span> <span class="k">if</span> <span class="n">log_axis</span><span class="p">:</span> <span class="n">hb_plot</span><span class="p">[</span><span class="s2">&quot;layout&quot;</span><span class="p">][</span><span class="n">x</span><span class="p">][</span><span class="s2">&quot;type&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;log&quot;</span> <span class="n">hb_plot</span><span class="p">[</span><span class="s2">&quot;layout&quot;</span><span class="p">][</span><span class="n">x</span><span class="p">][</span><span class="s2">&quot;range&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">log10</span><span class="p">(</span><span class="n">max_value</span><span class="p">))</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="k">else</span><span class="p">:</span> <span class="n">hb_plot</span><span class="p">[</span><span class="s2">&quot;layout&quot;</span><span class="p">][</span><span class="n">x</span><span class="p">][</span><span class="s2">&quot;range&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="n">max_value</span> <span class="o">*</span> <span class="mf">1.1</span><span class="p">]</span> </code></pre></div> <h2 id="updated-2021-11-25">Updated 2021-11-25</h2> <p>Thanks to Chris Caudill for pointing out a missing part of the function - this is now included. I've also added the section at the end on log axes.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Tue, 16 Jun 2020 15:46:51 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2020-06-16:/2020/better-horizontal-bar-charts-with-plotly/</guid><category>Code</category><category>Charts</category><category>Plotly</category></item><item><title>How many charity employees have been furloughed?</title><link>https://dkane.net/2020/how-many-charity-employees-have-been-furloughed/</link><description><p><em>This blog post is a write-up of a <a href="https://twitter.com/kanedr/status/1271395238331011080">twitter thread</a> exploring data available. These are very rough estimates, based on data that's available at the time of writing.</em></p> <p>It's frustratingly difficult to find out how many charities have used the government's Coronavirus Job Retention Scheme, or how many employees have been furloughed through it.</p> <p>For the rest of the economy, HMRC has <a href="https://www.gov.uk/government/statistics/coronavirus-job-retention-scheme-statistics-june-2020">produced "experimental statistics"</a> with estimates of the number of employers and employees that have used the scheme. These statistics are based on the data HMRC uses to administer the scheme. As well as overall totals, they've been able to produce estimates of how use of the scheme has varied across sectors of the economy (aka "industries", using SIC2007) and geography (local authorities and regions).</p> <p>Unfortunately it's not possible to directly extract charities from this data - charity is a legal form, and charities work in lots of different industries as classified by SIC2007 - health, arts, social care, education, etc. So the industry breakdown doesn't allow charities to be identified.</p> <h2 id="estimating-using-sector-data">Estimating using sector data</h2> <p>However, even though the industry data doesn't allow us to directly estimate the numbers of charity employers and employees using the scheme, we can use the industry breakdowns as a base for some rough estimates. </p> <p>If we make the assumption that charities behave similarly to other organisations in their industry (which may not be true) then if we know the proportion of organisations in each industry that are charities, we can apply that proportion to the number of employers and employees to make and estimate of figures for charities.</p> <p>But how to get an estimate of the proportion of companies that are charities? Here I've turned to Companies House and Charity Commission data. Using data extracts from these two regulators, it's possible to get a list of companies, their SIC codes (for their industry) and whether or not they are a registered charity. From this we can produce a list of industries and the proportion of organisations in each that are charities.</p> <p>But is this a fair picture? Is the number of companies a proxy for the number of employees? Well possibly, but it's probably more complicated. We know that employees aren't evenly distributed in any sector of the economy - <a href="https://data.ncvo.org.uk/workforce/#by-size">NCVO's data shows that 40-50% of employees in private &amp; voluntary sector work in organisations with more than 50 employees</a>.</p> <p>There are lots of small charities without paid staff, and similarly there are hundreds of thousands of small registered companies that don't employ people, or are currently dormant. To strip these out, I used a field in the company data that describes the type of accounts the company last submitted - this is based on their size. But limiting it to accounts returned by active companies, we get a better sense of the proportion of "active" companies that are charities. </p> <h2 id="results-of-these-estimates">Results of these estimates</h2> <p>The two methods - using all companies and then using larger "active" companies, give two different figures for the proportion of registered companies that are charities. For all companies charities represent 0.7% of companies, compared to 2% of "active" companies. The latter figure perhaps feels instinctively more correct - we know from NCVO that charity employees are around 3% of the workforce.</p> <p>The figures give the proportions for industries. These proportions feel correct - charities are concentrated in 4 SIC industries: education (20% of companies), health and social work (12% of companies), arts &amp; recreation (12% of companies), trade union, religious, political and repair (8% of companies). They're also absent where you would expect - mining, manufacturing and construction show negligible numbers of charities (the SIC classification system is definitely not perfect in every case).</p> <iframe title="% of companies that are charities, by industry" aria-label="Range Plot" id="datawrapper-chart-oCE6u" src="https://datawrapper.dwcdn.net/oCE6u/1/" scrolling="no" frameborder="0" style="border: none;" width="600" height="605"></iframe> <p>And using these proportions we can make estimates of the number of charities that are using the furlough scheme, and the number of employees. Using the proportions for all companies gives a figure of 8,000 charities and 69,000 employees, but using just larger active companies gives <strong>22,000 charities</strong> and <strong>188,000 employees</strong>.</p> <p>While there's a number of assumptions going into these figures, I think they feel like sensible estimates of the boundaries of the actual number, though the larger one is perhaps closer to the truth.</p> <iframe title="Estimated number of charity employees furloughed" aria-label="Range Plot" id="datawrapper-chart-oX8S0" src="https://datawrapper.dwcdn.net/oX8S0/1/" scrolling="no" frameborder="0" style="border: none;" width="600" height="586"></iframe> <h2 id="comparing-with-survey-results">Comparing with survey results</h2> <p>When I shared these results on Twitter, there were some helpful pointers to attempts to answer the same questions using surveys.</p> <p>Nick Temple of the Social Investment Business estimates that 13% of their customer base's employees are on furlough.</p> <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Thanks for doing this David; even a rough guess is helpful. Recent analysis of our customer base came up with around 13% of all employees on furlough. Though SIB customer base tends to be medium, rather than small/micro.</p>&mdash; Nick Temple (@nicktemple1) <a href="https://twitter.com/nicktemple1/status/1271438133473705986?ref_src=twsrc%5Etfw">June 12, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <p>Charity Tax Group &amp; Charity Retail suggest that many of the 26,000 charity shop employees have been furloughed - the unique nature of these employees (similar private businesses do not have retail employees) means that they are potentially missed from my estimate.</p> <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Based on recent discussions with <a href="https://twitter.com/CharityRetail?ref_src=twsrc%5Etfw">@CharityRetail</a> there are 26,000 FTE charity shop employees &amp; many of them have been furloughed. <a href="https://twitter.com/CivilSocietyUK?ref_src=twsrc%5Etfw">@CivilSocietyUK</a> estimated 20,000 from the 20 largest charities some time back. Very difficult to be certain though <a href="https://t.co/FZhj3jdC0e">https://t.co/FZhj3jdC0e</a></p>&mdash; Charity Tax Group (@CharityTaxGroup) <a href="https://twitter.com/CharityTaxGroup/status/1271440561581498370?ref_src=twsrc%5Etfw">June 12, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <p>A Charity Tax Group survey showed 70% of respondents had made a claim.</p> <blockquote class="twitter-tweet"><p lang="en" dir="ltr">This does not provide specific figures but a recent CTG snapshot poll showed high level of take-up organisationally (without specific responses on furloughed staff per charity) <a href="https://t.co/6Z7gxsMeab">pic.twitter.com/6Z7gxsMeab</a></p>&mdash; Charity Tax Group (@CharityTaxGroup) <a href="https://twitter.com/CharityTaxGroup/status/1271441766068166656?ref_src=twsrc%5Etfw">June 12, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <p>66% of respondents to <a href="https://www.probonoeconomics.com/news/pbe-launches-new-survey-understand-impact-covid-19-charity-sector">Pro Bono Economics</a> survey (9-10 June) said they had furloughed or used job retention scheme.</p> <h2 id="what-next">What next?</h2> <p>These estimates are probably wrong. They're based on imperfect data and assumptions that are unlikely to be correct. However, given the absence of better data, I think they give an indication of the scale we would expect to see in figures.</p> <p>ONS and HMRC have done a fantastic job during this crisis of producing statistics to understand the environment and measure responses. But charities can often fall through the cracks in official statistics - due to lack of data, or because they're not thought about when the measures are designed.</p> <p>I'd like to see better official data on charities in this crisis, to help the charity sector measure the impact and respond. </p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Fri, 12 Jun 2020 16:06:00 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2020-06-12:/2020/how-many-charity-employees-have-been-furloughed/</guid><category>Blog</category><category>Covid19</category></item><item><title>Working with XBRL</title><link>https://dkane.net/2018/working-with-xbrl/</link><description><p>Some notes on the XBRL data format, as I've been working with it this week. I'm working with Power to Change to help them extract data from accounts.</p> <h2 id="whats-xbrlixbrl">What's XBRL/iXBRL?</h2> <p>XBRL standards for eXtensible Business Reporting Language. iXBRL is the same but "inline". They describe a data standard for financial reporting, based on XML. In the UK it's used by Companies House and HMRC for reporting - instead of submitting a PDF or paper copy of their accounts (which contains no machine readable version of the data within) companies can (and in some cases must) submit in XBRL format.</p> <p>The iXBRL format differs slightly from XBRL in that the machine-readable financial data is contained within a human-readable HTML file. That way you get the best of both words - documents that can be read by people interested in that company, with data that can be extracted in a consistent way.</p> <p>That's the theory, anyway.</p> <h2 id="whats-in-xbrl">What's in XBRL?</h2> <p>In my mind, XBRL has two main parts:</p> <ol> <li>The XBRL <strong>format</strong>, describing how to represent various concepts within an XML or HTML document. The formats for XBRL and iXBRL are slightly different - you should be able to convert an iXBRL document to an XBRL one, but that wouldn't work the other way round.</li> <li><strong>Schemas</strong> which are applied on top of the XBRL format to represent financial data in a specific framework - for example the FRS accounting framework or the Charity SORP. </li> </ol> <p>The format describes what the data should look like, while the schema describes what each individual data item refers to. </p> <h2 id="how-do-you-make-xbrl">How do you make XBRL?</h2> <p>This isn't a particularly relevant question for me - for this exercise I'm more interested in how to get data out of the documents.</p> <p>But I think most XBRL documents are produced by standard accounting packages. There are also specialist firms like <a href="http://ixbrl-assist.co.uk/">Hypercube Consultancy</a> who provide tagging services for accounts. Stewart from Hypercube was kind enough to get in touch with me and share his experience of creating and using XBRL and some of the tools and resources out there.</p> <h2 id="what-can-you-do-with-xbrl">What can you do with XBRL?</h2> <p>This is what I'm interested in - how do you get data out of XBRL documents. My aim in this is to look for particular values (total assets, for example) and compare them across different organisations. </p> <p>This is trickier that I had thought. I had hoped that there would be a nice suite of (open source) tools for parsing and extracting data from XBRL - but as far as I can tell there isn't. I did know from trying beforehand that the flexibility of XBRL and iXBRL means it is very complex to deal with - often the same value can be represented in different ways depending on the choices made.</p> <p><strong>Resources</strong></p> <p>There are some resources out there though:</p> <ul> <li><a href="https://uk-taxonomies-tdp.corefiling.com/yeti/resources/yeti-gwt/Yeti.jsp">Yeti</a> - a tool created by CoreFiling, who I think were one of the original authors of XBRL. One issue I've had looking at XBRL is navigating the schemas, and this tool makes it much easier to do so.</li> <li><a href="https://github.com/XBRLInternational/ixbrl-samples">Sample iXBRL files</a> - This seems to be the only github presence of XBRL International who look after the standard, but it's really useful for testing tools.</li> <li>Companies House - There's a few ways you can get real-life iXBRL files for companies:<ul> <li>Via the <a href="https://beta.companieshouse.gov.uk/">Companies House register</a> if a company has filed iXBRL accounts they're available for download on the filing history page for that company. Not all companies file iXBRL accounts so if they're not available then PDF copies are instead.</li> <li>You can also download accounts in bulk via the <a href="http://download.companieshouse.gov.uk/en_accountsdata.html">Free Accounts Data Product</a>. This has daily and monthly dumps of any XBRL or iXBRL accounts filed with Companies House. This is useful for getting accounts to test on.</li> </ul> </li> <li>You can also view the official specifications for the format - I found these very difficult to read and get a sense of what the resulting data looks like:<ul> <li><a href="http://www.xbrl.org/Specification/XBRL-2.1/REC-2003-12-31/XBRL-2.1-REC-2003-12-31+corrected-errata-2013-02-20.html#_4.7">XBRL 2.1</a> - released 2003, corrected in 2013</li> <li><a href="https://www.xbrl.org/specification/inlinexbrl-part1/rec-2013-11-18/inlinexbrl-part1-rec-2013-11-18.html">iXBRL 1.1</a> - released 2013</li> <li><a href="http://www.xbrl.org/Specification/xbrl-csv/PWD-2017-05-02/xbrl-csv-PWD-2017-05-02.html">xBRL-CSV</a> - a proposal for converting XBRL to CSV format. I've not seen whether this is implemented anywhere, but this would be very useful for my purposes.</li> </ul> </li> </ul> <h2 id="what-have-i-done">What have I done?</h2> <p>As these notes suggest, a lot of what I've done is trying to make sense of the landscape and explore the data. I did most of this through jupyter notebooks, looking through iXBRL files of a selection of community businesses. I've found the python <a href="https://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> library helpful for parsing the XML.</p> <p>Using what I'd learnt from exploring the iXBRL files I've started to put together a python module for parsing iXBRL and extracting the data. I'll be aiming to develop this further, and hopefully use it in anger: <a href="https://github.com/drkane/ixbrl-parse">https://github.com/drkane/ixbrl-parse</a>.</p> <h2 id="who-else-is-doing-things-with-xbrl">Who else is doing things with XBRL</h2> <p>I did a search of GitHub to find XBRL repositories. There were a few people who'd tried to do similar things over the years. But most interestingly, the <a href="https://github.com/ONSBigData/parsing_company_accounts">ONS Big Data Unit</a> has an active repository with some code, approaching it in a very similar way to me. I've got in touch with them to see if they want to compare notes.</p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Tue, 18 Dec 2018 11:06:07 +0000</pubDate><guid isPermaLink="false">tag:dkane.net,2018-12-18:/2018/working-with-xbrl/</guid><category>Blog</category><category>XBRL</category><category>iXBRL</category><category>accounts</category></item><item><title>Text analysis in the voluntary sector</title><link>https://dkane.net/2018/text-analysis-in-the-voluntary-sector/</link><description><p><img alt="Chart from Google n-grams showing results for data and cheese since 1800. Data overtakes cheese in 1970ish" src="/images/data-cheese-chart.png"> </p> <p>Last week at the <a href="https://www.data4goodconf.org.uk/">Data 4 Good event</a> in Birmingham I gave a talk on the opportunities and challenges that text analysis offers to the voluntary sector. It was a short lightning talk so I was only able to give a quick overview of the possibilities, and it's a big area so there's lots more to explore.</p> <p>You can find the slides on Google drive.</p> <iframe src="https://docs.google.com/presentation/d/e/2PACX-1vTvN-R9g5Dtcr72YKIe5RWSSBRqQlpUmaVpQfRDxd5EIPVzwPmzMPbVGCnJapJPjOMnIqhBCii7Eq30/embed?start=false&loop=false&delayms=3000" frameborder="0" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true" style="width:100%;height:389px;"></iframe> <p>In the last slide I suggest some tools you can use to take your data further with text analysis. They are:</p> <ul> <li><a href="https://databasic.io/en/samediff/"><strong>DataBasic – SameDiff</strong></a> - a simple tool for comparing two sets of text</li> <li><a href="https://databasic.io/en/wordcounter/"><strong>DataBasic – word counter</strong></a> - create word clouds and look in more detail at the frequency of words in a piece of text.</li> <li><a href="http://www.opencalais.com/opencalais-demo/"><strong>Reuters OpenCalais</strong></a> - a proprietary tool for getting started with entity recognition. I'd be interested to know if there are any similar open-source projects - one I've seen is <a href="https://spacy.io/usage/spacy-101#annotations-ner">spacy</a> but it requires some programming knowledge to use.</li> <li><a href="http://openrefine.org/"><strong>OpenRefine</strong></a> - a useful data wrangling tool with lots of useful features for working with text.</li> </ul></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Mon, 19 Nov 2018 09:58:43 +0000</pubDate><guid isPermaLink="false">tag:dkane.net,2018-11-19:/2018/text-analysis-in-the-voluntary-sector/</guid><category>Blog</category><category>Text analysis</category></item><item><title>A map of World Cup stadia using wikidata</title><link>https://dkane.net/2018/a-map-of-world-cup-stadia-using-wikidata/</link><description><p><a href="https://www.wikidata.org/wiki/Wikidata:Main_Page">Wikidata</a> is an amazing project that aims to turn the unstructured text of Wikipedia into a database of facts and figures that allows you to go beyond just presenting a page about something to using data about it.</p> <p>I've been wanting to try out using it, and "SPARQL", the language used to query it, so I decided to try and create a map of every stadium that has hosted a game at the Fifa World Cup finals - a topical query as the 2018 World Cup in Russia has just started.</p> <h2 id="step-1-querying-the-data">Step 1. Querying the data</h2> <p>I used <a href="https://query.wikidata.org/">query.wikidata.org</a> to come up with a query that got me the data I was looking for. Having never used SPARQL before it took a bit of tweaking to get the query I needed - I found the interface helpful for finding the right entities and the included examples for how to structure it.</p> <p>Here's the query I came up with. I'll go through what each part does below.</p> <div class="highlight"><pre><span></span><code><span class="n">wc_sparql</span> <span class="o">=</span> <span class="s2">&quot;&quot;&quot;</span> <span class="s2">SELECT ?FIFA_World_CupLabel ?location ?locationLabel ?coord ?countryLabel WHERE {</span> <span class="s2"> SERVICE wikibase:label { bd:serviceParam wikibase:language &quot;[AUTO_LANGUAGE],en&quot;. }</span> <span class="s2"> ?FIFA_World_Cup wdt:P3450 wd:Q19317.</span> <span class="s2"> ?FIFA_World_Cup wdt:P276 ?location.</span> <span class="s2"> ?location wdt:P625 ?coord.</span> <span class="s2"> ?location wdt:P17 ?country</span> <span class="s2">}</span> <span class="s2">ORDER BY ?FIFA_World_CupLabel</span> <span class="s2">&quot;&quot;&quot;</span> </code></pre></div> <p>The first part sets up the fields we want to return - the name of the World Cup, the location ID (a stadium), the name of the stadium, the latitude and longitude and the name of the country</p> <div class="highlight"><pre><span></span><code>SELECT ?FIFA_World_CupLabel ?location ?locationLabel ?coord ?countryLabel WHERE { </code></pre></div> <p>This next part allows you to fetch labels for each of the items, which is more helpful than the URI that gets returned.</p> <div class="highlight"><pre><span></span><code><span class="n">SERVICE</span><span class="w"> </span><span class="nl">wikibase</span><span class="p">:</span><span class="n">label</span><span class="w"> </span><span class="err">{</span><span class="w"> </span><span class="nl">bd</span><span class="p">:</span><span class="n">serviceParam</span><span class="w"> </span><span class="nl">wikibase</span><span class="p">:</span><span class="k">language</span><span class="w"> </span><span class="ss">&quot;[AUTO_LANGUAGE],en&quot;</span><span class="p">.</span><span class="w"> </span><span class="err">}</span> </code></pre></div> <p>Then we start off by adding a field called "FIFA_World_Cup" based on finding "sports season of league or competition" (<code>wdt:P3450</code>) with the labels "FIFA World Cup" (<code>wd:Q19317</code>)</p> <div class="highlight"><pre><span></span><code>?FIFA_World_Cup wdt:P3450 wd:Q19317. </code></pre></div> <p>Then we look for the locations (<code>wdt:P276</code>) attached to each of these competitions:</p> <div class="highlight"><pre><span></span><code>?FIFA_World_Cup wdt:P276 ?location. </code></pre></div> <p>And for each location we want the co-ordinates (<code>wdt:P625</code>) and country (<code>wdt:P17</code>).</p> <div class="highlight"><pre><span></span><code>?location wdt:P625 ?coord. ?location wdt:P17 ?country </code></pre></div> <p>I then used a python library called <a href="https://rdflib.github.io/sparqlwrapper/">SPARQLWrapper</a> to send the query to the WikiData sparql endpoint, and get JSON data back.</p> <div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">SPARQLWrapper</span> <span class="kn">import</span> <span class="n">SPARQLWrapper</span><span class="p">,</span> <span class="n">JSON</span> <span class="n">sparql</span> <span class="o">=</span> <span class="n">SPARQLWrapper</span><span class="p">(</span><span class="s2">&quot;https://query.wikidata.org/sparql&quot;</span><span class="p">)</span> <span class="n">sparql</span><span class="o">.</span><span class="n">setQuery</span><span class="p">(</span><span class="n">wc_sparql</span><span class="p">)</span> <span class="n">sparql</span><span class="o">.</span><span class="n">setReturnFormat</span><span class="p">(</span><span class="n">JSON</span><span class="p">)</span> <span class="n">results</span> <span class="o">=</span> <span class="n">sparql</span><span class="o">.</span><span class="n">query</span><span class="p">()</span><span class="o">.</span><span class="n">convert</span><span class="p">()</span> </code></pre></div> <p>Here's an example of what one of the results looks like - a stadium used in the very first World Cup in Uruguay.</p> <div class="highlight"><pre><span></span><code><span class="n">results</span><span class="p">[</span><span class="s1">&#39;results&#39;</span><span class="p">][</span><span class="s1">&#39;bindings&#39;</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="p">{</span><span class="err">&#39;</span><span class="nx">FIFA_World_CupLabel</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="err">&#39;</span><span class="k">type</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">literal</span><span class="err">&#39;</span><span class="p">,</span> <span class="w"> </span><span class="err">&#39;</span><span class="nx">value</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="mi">1930</span><span class="w"> </span><span class="nx">FIFA</span><span class="w"> </span><span class="nx">World</span><span class="w"> </span><span class="nx">Cup</span><span class="err">&#39;</span><span class="p">,</span> <span class="w"> </span><span class="err">&#39;</span><span class="nx">xml</span><span class="p">:</span><span class="nx">lang</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">en</span><span class="err">&#39;</span><span class="p">},</span> <span class="w"> </span><span class="err">&#39;</span><span class="nx">coord</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="err">&#39;</span><span class="nx">datatype</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">http</span><span class="p">:</span><span class="c1">//www.opengis.net/ont/geosparql#wktLiteral&#39;,</span> <span class="w"> </span><span class="err">&#39;</span><span class="k">type</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">literal</span><span class="err">&#39;</span><span class="p">,</span> <span class="w"> </span><span class="err">&#39;</span><span class="nx">value</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">Point</span><span class="p">(</span><span class="o">-</span><span class="m m-Double">56.152778</span><span class="w"> </span><span class="o">-</span><span class="m m-Double">34.894444</span><span class="p">)</span><span class="err">&#39;</span><span class="p">},</span> <span class="w"> </span><span class="err">&#39;</span><span class="nx">countryLabel</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="err">&#39;</span><span class="k">type</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">literal</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">value</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">Uruguay</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">xml</span><span class="p">:</span><span class="nx">lang</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">en</span><span class="err">&#39;</span><span class="p">},</span> <span class="w"> </span><span class="err">&#39;</span><span class="nx">location</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="err">&#39;</span><span class="k">type</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">uri</span><span class="err">&#39;</span><span class="p">,</span> <span class="w"> </span><span class="err">&#39;</span><span class="nx">value</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">http</span><span class="p">:</span><span class="c1">//www.wikidata.org/entity/Q498245&#39;},</span> <span class="w"> </span><span class="err">&#39;</span><span class="nx">locationLabel</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="err">&#39;</span><span class="k">type</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">literal</span><span class="err">&#39;</span><span class="p">,</span> <span class="w"> </span><span class="err">&#39;</span><span class="nx">value</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">Estadio</span><span class="w"> </span><span class="nx">Centenario</span><span class="err">&#39;</span><span class="p">,</span> <span class="w"> </span><span class="err">&#39;</span><span class="nx">xml</span><span class="p">:</span><span class="nx">lang</span><span class="err">&#39;</span><span class="p">:</span><span class="w"> </span><span class="err">&#39;</span><span class="nx">en</span><span class="err">&#39;</span><span class="p">}}</span> </code></pre></div> <h2 id="step-2-processing-the-results">Step 2: processing the results</h2> <p>I then want to turn the results into nicely formatted data for plotting on a map. I'm looking for data that contains one record for each stadium, even if it has hosted games at more than one World Cup (e.g Mexico in 1970 and 1986). </p> <p>The co-ordinates for each location come in WKT format, so I use a library called <a href="https://pypi.org/project/Shapely/">Shapely</a> to extract the latitude and longitude.</p> <div class="highlight"><pre><span></span><code><span class="c1"># for converting coordinates</span> <span class="kn">import</span> <span class="nn">shapely.wkt</span> </code></pre></div> <p>Then I go through each of the results and add to a python dictionary. If the stadium is already in the dictionary I just add the extra World Cup year to the dictionary, rather than adding a new record.</p> <div class="highlight"><pre><span></span><code><span class="n">stadia</span> <span class="o">=</span> <span class="p">{}</span> <span class="k">for</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">results</span><span class="p">[</span><span class="s2">&quot;results&quot;</span><span class="p">][</span><span class="s2">&quot;bindings&quot;</span><span class="p">]:</span> <span class="n">stadium_id</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s2">&quot;location&quot;</span><span class="p">][</span><span class="s2">&quot;value&quot;</span><span class="p">]</span> <span class="n">worldcup</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="s2">&quot;FIFA_World_CupLabel&quot;</span><span class="p">][</span><span class="s2">&quot;value&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">&quot; FIFA World Cup&quot;</span><span class="p">,</span><span class="s2">&quot;&quot;</span><span class="p">)</span> <span class="k">if</span> <span class="n">stadium_id</span> <span class="ow">in</span> <span class="n">stadia</span><span class="p">:</span> <span class="n">stadia</span><span class="p">[</span><span class="n">stadium_id</span><span class="p">][</span><span class="s2">&quot;worldcups&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">worldcup</span><span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="n">stadia</span><span class="p">[</span><span class="n">stadium_id</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">&quot;lat_lng&quot;</span><span class="p">:</span> <span class="n">shapely</span><span class="o">.</span><span class="n">wkt</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s2">&quot;coord&quot;</span><span class="p">][</span><span class="s2">&quot;value&quot;</span><span class="p">])</span><span class="o">.</span><span class="n">coords</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="s2">&quot;worldcups&quot;</span><span class="p">:</span> <span class="p">[</span><span class="n">worldcup</span><span class="p">],</span> <span class="s2">&quot;stadium&quot;</span><span class="p">:</span> <span class="n">result</span><span class="p">[</span><span class="s2">&quot;locationLabel&quot;</span><span class="p">][</span><span class="s2">&quot;value&quot;</span><span class="p">],</span> <span class="s2">&quot;country&quot;</span><span class="p">:</span> <span class="n">result</span><span class="p">[</span><span class="s2">&quot;countryLabel&quot;</span><span class="p">][</span><span class="s2">&quot;value&quot;</span><span class="p">],</span> <span class="p">}</span> </code></pre></div> <p>Here's what an entry in the processed data looks like. I've used the wikidata URI as an identifier for each stadium.</p> <div class="highlight"><pre><span></span><code><span class="n">stadia</span><span class="p">[</span><span class="s1">&#39;http://www.wikidata.org/entity/Q498245&#39;</span><span class="p">]</span> </code></pre></div> <div class="highlight"><pre><span></span><code>{&#39;country&#39;: &#39;Uruguay&#39;, &#39;lat_lng&#39;: (-56.152778, -34.894444), &#39;stadium&#39;: &#39;Estadio Centenario&#39;, &#39;worldcups&#39;: [&#39;1930&#39;]} </code></pre></div> <h2 id="step-3-mapping-the-results">Step 3: Mapping the results</h2> <p>I really like <a href="http://python-visualization.github.io/folium/index.html">folium</a> for easily producing Leaflet-based maps in python. I'm going to also use the MarkerCluster plugin to cluster the markers to make it easier to view all the stadia on one map - with clusters based on countries.</p> <div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">folium</span> <span class="kn">from</span> <span class="nn">folium.plugins</span> <span class="kn">import</span> <span class="n">MarkerCluster</span> <span class="kn">import</span> <span class="nn">html</span> </code></pre></div> <p>First I initialise the map and zoom out so you can see the whole world.</p> <div class="highlight"><pre><span></span><code><span class="n">m</span> <span class="o">=</span> <span class="n">folium</span><span class="o">.</span><span class="n">Map</span><span class="p">(</span> <span class="n">location</span><span class="o">=</span><span class="p">[</span><span class="mi">20</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span> <span class="n">zoom_start</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">tiles</span><span class="o">=</span><span class="s1">&#39;Stamen Toner&#39;</span><span class="p">,</span> <span class="n">attr</span><span class="o">=</span><span class="s1">&#39;&#39;&#39;&lt;a id=&quot;home-link&quot; target=&quot;_top&quot; href=&quot;../&quot;&gt;Map tiles&lt;/a&gt; by </span> <span class="s1"> &lt;a target=&quot;_top&quot; href=&quot;http://stamen.com&quot;&gt;Stamen Design&lt;/a&gt;, </span> <span class="s1"> under &lt;a target=&quot;_top&quot; href=&quot;http://creativecommons.org/licenses/by/3.0&quot;&gt;CC BY 3.0&lt;/a&gt;. </span> <span class="s1"> Data by &lt;a target=&quot;_top&quot; href=&quot;http://openstreetmap.org&quot;&gt;OpenStreetMap&lt;/a&gt;, </span> <span class="s1"> under &lt;a target=&quot;_top&quot; href=&quot;http://creativecommons.org/licenses/by-sa/3.0&quot;&gt;CC BY SA&lt;/a&gt;.</span> <span class="s1"> | Locations powered by &lt;a href=&quot;https://query.wikidata.org/&quot;&gt;Wikidata&lt;/a&gt;.&#39;&#39;&#39;</span> <span class="p">)</span> </code></pre></div> <p>Then we go through the stadia and add each one to a cluster based on its country. I've also added a little popup which tells you the stadium's name and which World Cups it hosted games at. I also set a football icon for the pins.</p> <div class="highlight"><pre><span></span><code><span class="n">clusters</span> <span class="o">=</span> <span class="p">{}</span> <span class="k">for</span> <span class="n">stadium_id</span> <span class="ow">in</span> <span class="n">stadia</span><span class="p">:</span> <span class="n">s</span> <span class="o">=</span> <span class="n">stadia</span><span class="p">[</span><span class="n">stadium_id</span><span class="p">]</span> <span class="k">if</span> <span class="n">s</span><span class="p">[</span><span class="s2">&quot;country&quot;</span><span class="p">]</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">clusters</span><span class="p">:</span> <span class="n">clusters</span><span class="p">[</span><span class="n">s</span><span class="p">[</span><span class="s2">&quot;country&quot;</span><span class="p">]]</span> <span class="o">=</span> <span class="n">MarkerCluster</span><span class="p">()</span><span class="o">.</span><span class="n">add_to</span><span class="p">(</span><span class="n">m</span><span class="p">)</span> <span class="n">folium</span><span class="o">.</span><span class="n">Marker</span><span class="p">(</span> <span class="p">[</span><span class="n">s</span><span class="p">[</span><span class="s2">&quot;lat_lng&quot;</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="n">s</span><span class="p">[</span><span class="s2">&quot;lat_lng&quot;</span><span class="p">][</span><span class="mi">0</span><span class="p">]],</span> <span class="n">popup</span><span class="o">=</span><span class="s1">&#39;</span><span class="si">{}</span><span class="s1">, </span><span class="si">{}</span><span class="s1"> - &lt;i&gt;</span><span class="si">{}</span><span class="s1">&lt;/i&gt;&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span> <span class="n">html</span><span class="o">.</span><span class="n">escape</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="s2">&quot;stadium&quot;</span><span class="p">]),</span> <span class="n">html</span><span class="o">.</span><span class="n">escape</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="s2">&quot;country&quot;</span><span class="p">]),</span> <span class="n">html</span><span class="o">.</span><span class="n">escape</span><span class="p">(</span><span class="s2">&quot;, &quot;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="s2">&quot;worldcups&quot;</span><span class="p">]))</span> <span class="p">),</span> <span class="n">icon</span><span class="o">=</span><span class="n">folium</span><span class="o">.</span><span class="n">Icon</span><span class="p">(</span><span class="n">icon</span><span class="o">=</span><span class="s1">&#39;soccer-ball-o&#39;</span><span class="p">,</span> <span class="n">prefix</span><span class="o">=</span><span class="s1">&#39;fa&#39;</span><span class="p">)</span> <span class="p">)</span><span class="o">.</span><span class="n">add_to</span><span class="p">(</span><span class="n">clusters</span><span class="p">[</span><span class="n">s</span><span class="p">[</span><span class="s2">&quot;country&quot;</span><span class="p">]])</span> </code></pre></div> <p>Finally we show the resulting map, which can be zoomed and panned to look at particular countries.</p> <div class="highlight"><pre><span></span><code><span class="n">m</span> </code></pre></div> <div style="width:100%;"><div style="position:relative;width:100%;height:0;padding-bottom:60%;"><iframe src="/data/world-cup-stadia-map.html" style="position:absolute;width:100%;height:100%;left:0;top:0;border:none !important;" allowfullscreen webkitallowfullscreen mozallowfullscreen></iframe></div></div> <p>As an extra I wanted to convert the data into GeoJSON format so it's easy to use elsewhere.</p> <ul> <li><a href="https://dkane.net/data/world_cup_stadia.geojson">world_cup_stadia.geojson</a></li> </ul> <div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">geojson</span> <span class="kn">import</span> <span class="n">Feature</span><span class="p">,</span> <span class="n">Point</span><span class="p">,</span> <span class="n">FeatureCollection</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="n">wc_geojson</span> <span class="o">=</span> <span class="n">FeatureCollection</span><span class="p">(</span> <span class="p">[</span><span class="n">Feature</span><span class="p">(</span><span class="n">geometry</span><span class="o">=</span><span class="n">Point</span><span class="p">(</span><span class="n">stadia</span><span class="p">[</span><span class="n">s</span><span class="p">][</span><span class="s2">&quot;lat_lng&quot;</span><span class="p">]),</span> <span class="n">properties</span><span class="o">=</span><span class="n">stadia</span><span class="p">[</span><span class="n">s</span><span class="p">])</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">stadia</span><span class="p">]</span> <span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;world_cup_stadia.geojson&#39;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">a</span><span class="p">:</span> <span class="n">geojson</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">wc_geojson</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span> </code></pre></div> <h2 id="step-4-taking-it-further">Step 4: taking it further</h2> <p>This was just a quick exercise to try and get data out of wikidata and then use it. There's a few things that could be done to take it further:</p> <ul> <li>add filters to the map to filter by country, World Cup, etc.</li> <li>see if Wikidata has data on the matches that took place at each location and the teams that have played there, allowing you to filter by team or stage of the competition.</li> <li>visualise the data by adding in details like the maximum attendance </li> </ul> <h2 id="acknowledgements">Acknowledgements</h2> <ul> <li>The data is from Wikidata and is used under the <a href="https://creativecommons.org/about/cc0">CC0 Public Domain licence</a>. </li> <li><a href="https://rdflib.github.io/sparqlwrapper/">sparqlwrapper</a> for querying the data.</li> <li><a href="http://python-visualization.github.io/folium/index.html">Folium</a> mapping using <a href="https://leafletjs.com/">Leaflet</a> and <a href="http://fontawesome.com/icons?d=gallery&amp;c=sports&amp;m=free">FontAwesome</a> icons.</li> <li>Map tiles by <a href="http://stamen.com">Stamen Design</a>, under <a href="http://creativecommons.org/licenses/by/3.0">CC BY 3.0</a>. Data by <a href="http://openstreetmap.org">OpenStreetMap</a>, under <a href="http://creativecommons.org/licenses/by-sa/3.0">CC BY SA</a>.</li> <li><a href="https://pypi.org/project/Shapely/">Shapely</a> and <a href="https://pypi.org/project/geojson/">geojson</a> python libraries for manipulating the geodata</li> <li><a href="https://nbviewer.jupyter.org/urls/dkane.net/data/world_cup_stadia.ipynb">This blog post was written alongside the code in a Juypter notebook</a>.</li> </ul> <p><a title="By Charlie Kritschmar (WMDE) [CC0], from Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Wikidata_stamp.png"><img width="128" alt="Wikidata stamp" src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Wikidata_stamp.png/512px-Wikidata_stamp.png"></a></p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Sun, 17 Jun 2018 17:23:04 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2018-06-17:/2018/a-map-of-world-cup-stadia-using-wikidata/</guid><category>Code</category><category>Football</category><category>Wikidata</category></item><item><title>Grants in Kingston</title><link>https://dkane.net/2018/grants-in-kingston/</link><description><p>On Tuesday I was lucky enough to take part in a Kingston "Data Day", organised by the fantastic <a href="https://superhighways.org.uk/">Superhighways</a> and <a href="https://kva.org.uk/">Kingston Voluntary Action</a>.</p> <p>The aim of the day was to showcase some of the work that charities in Kingston are already doing to use data in their work, and see some of the tools and resources out there. We heard from charities like <a href="https://www.leah.org.uk/">LEAH</a>, <a href="https://www.manandboy.org/">Man and Boy</a> and <a href="https://kingstoncarers.org.uk/young-carers/">Kingston Young Carers</a> about how they were using data to measure outcomes and improve their services. We also heard from <a href="https://data.kingston.gov.uk/">Kingston Borough Council</a>, 360 Giving and others on resources out there that local charities could tap into.</p> <p>I gave a quick presentation on what the 360 Giving data on grants tells us about voluntary sector funding in Kingston. Here's a few things I took away from the data:</p> <h2 id="1-grants-to-kingston-are-lower-than-other-areas">1. Grants to Kingston are lower than other areas</h2> <p>I didn't actually produce any comparisons for Kingston and other London boroughs, but in Katherine from 360 Giving's presentation she showed that looking at the Grantnav 360 Giving data, Kingston comes right at the bottom of the table in terms of the number of grants made. </p> <p>You can see this for yourself when looking at a <a href="http://grantnav.threesixtygiving.org/search?page=1&amp;json_query=%7B%22extra_context%22%3A+%7B%22amountAwardedFixed_facet_size%22%3A+3%2C+%22awardYear_facet_size%22%3A+3%7D%2C+%22sort%22%3A+%7B%22_score%22%3A+%7B%22order%22%3A+%22desc%22%7D%7D%2C+%22aggs%22%3A+%7B%22recipientOrganization%22%3A+%7B%22terms%22%3A+%7B%22field%22%3A+%22recipientOrganization.id_and_name%22%2C+%22size%22%3A+3%7D%7D%2C+%22fundingOrganization%22%3A+%7B%22terms%22%3A+%7B%22field%22%3A+%22fundingOrganization.id_and_name%22%2C+%22size%22%3A+3%7D%7D%2C+%22currency%22%3A+%7B%22terms%22%3A+%7B%22field%22%3A+%22currency%22%2C+%22size%22%3A+3%7D%7D%2C+%22recipientRegionName%22%3A+%7B%22terms%22%3A+%7B%22field%22%3A+%22recipientRegionName%22%2C+%22size%22%3A+3%7D%7D%2C+%22recipientDistrictName%22%3A+%7B%22terms%22%3A+%7B%22field%22%3A+%22recipientDistrictName%22%2C+%22size%22%3A+50%7D%7D%7D%2C+%22query%22%3A+%7B%22bool%22%3A+%7B%22filter%22%3A+%5B%7B%22bool%22%3A+%7B%22should%22%3A+%5B%5D%7D%7D%2C+%7B%22bool%22%3A+%7B%22should%22%3A+%5B%5D%7D%7D%2C+%7B%22bool%22%3A+%7B%22should%22%3A+%5B%5D%2C+%22must%22%3A+%7B%7D%7D%7D%2C+%7B%22bool%22%3A+%7B%22should%22%3A+%7B%22range%22%3A+%7B%22amountAwarded%22%3A+%7B%7D%7D%7D%2C+%22must%22%3A+%7B%7D%7D%7D%2C+%7B%22bool%22%3A+%7B%22should%22%3A+%5B%5D%7D%7D%2C+%7B%22bool%22%3A+%7B%22should%22%3A+%5B%5D%7D%7D%2C+%7B%22bool%22%3A+%7B%22should%22%3A+%5B%5D%7D%7D%2C+%7B%22bool%22%3A+%7B%22should%22%3A+%5B%5D%7D%7D%5D%2C+%22must%22%3A+%7B%22query_string%22%3A+%7B%22query%22%3A+%22London%22%2C+%22default_field%22%3A+%22recipientLocation%22%7D%7D%7D%7D%7D&amp;page_y=684">list of all the grants to London</a> (see the "Recipient District" filter on the left hand side for counts in each Borough).</p> <h2 id="2-the-big-lottery-fund-is-important-to-the-area">2. The Big Lottery Fund is important to the area</h2> <p>The Big Lottery Fund is by far the largest funder that has published 360 Giving data, and this shows in Kingston. Half of the grant amount received by organisations in the borough comes from Big Lottery Fund.</p> <h2 id="3-but-other-funders-play-important-roles">3. But other funders play important roles</h2> <p>Funders like the Co-operative Group, Sports England and Kingston Voluntary Action (through the <a href="https://www.lovekingston.org.uk/">Love Kingston</a> programme) also provide key funding, particularly to smaller groups.</p> <h2 id="4-grants-are-going-to-medium-sized-groups">4. Grants are going to medium-sized groups</h2> <p>Most grants that we found are going to organisations with between £100,000 and £1m annual income. These organisations provide vital services in the local area, but there could be scope for increasing the funding available for smaller groups.</p> <hr> <p>I've created a map of all the grants to Kingston that is embedded below:</p> <figure> <iframe src="/data/kingston-grants-map.html" width="100%" height="500" frameborder="0"></iframe> </figure> <p>You can also find my presentation on Google slides:</p> <figure> <iframe src="https://docs.google.com/presentation/d/e/2PACX-1vTaYI58fQlEc2pY66Y_3Y9MEUMpYUEw-cCTDHI6vYtIWhpt8hTO3dqXgjx7XY6urIzarSIrozhgVeDJ/embed?start=false&loop=false&delayms=3000" frameborder="0" width="100%" height="500" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe> </figure> <p>Finally you can also download the full dataset to use yourself:</p> <ul> <li><a href="/data/all_kingston_grants.csv">Kingston grants data [CSV - 263Kb]</a></li> </ul> <p>This data is released under the <a href="https://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution Sharealike license (CC-BY-SA)</a> in accordance with the terms under which the data was used from GrantNav.</p> <hr> <h3 id="acknowledgements-and-data-licensing">Acknowledgements and data licensing</h3> <p>All data has been used under the license applied by the data publisher. <a href="http://grantnav.threesixtygiving.org/datasets/#copyright">A list of the licenses used</a> can be found on GrantNav.</p> <p>Contains data from <a href="http://grantnav.threesixtygiving.org/datasets/">GrantNav</a> a <a href="http://www.threesixtygiving.org/">360Giving</a> application released under the terms of the <a href="https://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution Sharealike license (CC-BY-SA)</a></p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Thu, 14 Jun 2018 12:57:19 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2018-06-14:/2018/grants-in-kingston/</guid><category>Blog</category><category>Grants</category><category>Kingston</category></item><item><title>Adding charity details using findthatcharity - Part II</title><link>https://dkane.net/2018/adding-charity-details-using-findthatcharity-part-ii/</link><description><p><a href="/2018/adding-charity-details-using-findthatcharityuk-part-i/">...continued from part 1</a></p> <h2 id="part-ii-add-charity-details">Part II - add charity details</h2> <p>Once you've got a list of organisations in OpenRefine with charity numbers you can then add more details about the organisations using <a href="https://findthatcharity.uk/">findthatcharity.uk</a>. You can add:</p> <ul> <li>postcode</li> <li>website</li> <li>latest income</li> <li>link to Charity Commission register</li> <li>date registered/removed</li> <li>Company Number</li> </ul> <h3 id="1-fetch-json-data">1: Fetch JSON data</h3> <p>First you fetch data from the findthatcharity page for that charity. We'll be getting the data in "JSON" format - this means we can get the fields we need out of the data.</p> <p>To fetch the data, click the arrow menu next to the column with the charity numbers and select <code>Edit Column &gt; Add column by fetching URLs...</code>. </p> <figure> <p><img alt="Screenshot of OpenRefine option for adding column based on an url" src="/images/reconcile/add-column-by-url.png"></p> </figure> <p>In the resulting dialog you need to add a formula which will create the URL that needs to be fetched. This means adding the charity number value into a template. The formula you need is:</p> <div class="highlight"><pre><span></span><code>&quot;https://findthatcharity.uk/charity/&quot; + value + &quot;.json&quot; </code></pre></div> <p>Make sure you include the quotation marks in the correct place. Add a new column name (something like "Charity Details"). The "Throttle delay" setting helps to lessen the load on the site being fetched from - the default value is 5000 milliseconds (5 seconds). Findthat charity can cope with more requests than that (within reason!) So I'll generally set it to 50-100 milliseconds, which works out as between 10 and 20 requests per second.</p> <p>The resulting dialog should look something like the following. You'll see a preview of the url which should look sensible. You can test that the URL is correct by visiting one of the preview URLs. You'll also see that columns with a <code>null</code> value also produce a <code>null</code> URL so won't be fetched.</p> <figure> <p><img alt="Preview of the Add column by fetching from URL dialog" src="/images/reconcile/add-column-dialog.png"></p> </figure> <h3 id="2-extract-details-from-the-json-data">2. Extract details from the JSON data</h3> <p>You should see a new column filled with JSON data - it looks very hard to read, but all the data we'll looking for. You'll need to get OpenRefine to look through this data and extract the fields you're looking for.</p> <p>I'll demo this with the <code>latestincome</code> and <code>company number</code> fields, but the same principle can be used to extract anything from the data.</p> <p>Use the arrow menu next to the new column, and then select <code>Edit Column &gt; Add column based on this column...</code> Again, we'll enter a formula to parse the JSON and find the right value within. We're using the <code>parseJson</code> OpenRefine function. Enter the following formula:</p> <div class="highlight"><pre><span></span><code>parseJson(value)[&quot;latest_income&quot;] </code></pre></div> <p>Enter a name for the column and press OK to add it.</p> <figure> <p><img alt="Dialog for adding a new column by parsing JSON" src="/images/reconcile/parse-json-1.png"></p> </figure> <h4 id="company-number">Company number</h4> <p>Adding a company number is slightly more complicated. This is because the <code>company_number</code> field can contain multiple values (if a charity is linked with more than one company). So the formula we'll use in this instance is:</p> <div class="highlight"><pre><span></span><code>parseJson(value)[&quot;company_number&quot;][0][&quot;number&quot;] </code></pre></div> <p>The <code>[0]</code> part selects the first value within the list of values of <code>company_number</code>, and then <code>["number"]</code> selects the company number within that.</p> <h3 id="3-save-the-data">3. Save the data</h3> <p>While OpenRefine is useful for this kind of data cleaning and linking, you'll want to export the data to another programme to perform analysis. You can do this using the <code>Export</code> menu found in the top right hand corner. The default formats are pretty straightforward, and it's possible to create more sophisticated exports (including a JSON export) using the <code>Custom tabular exporter...</code> option.</p> <p>One thing to remember with export - it will use any active filters when exporting, so make sure you don't have restrictive filters that exclude data you actually need.</p> <h3 id="4-optional-using-charity-base">4. Optional - using Charity Base</h3> <p>The data found on findthatcharity.uk is minimal as it is only included to support the site's main purpose of being a search engine. If there are other details you need then you may need to look for other data sources.</p> <p>A good place to look is <a href="https://charitybase.uk/">CharityBase</a> which has more detailed data extracted from the Charity Commission.</p> <p>A sample formula for getting a CharityBase URL for extracting data might look like:</p> <div class="highlight"><pre><span></span><code><span class="s">&quot;https://charitybase.uk/api/v0.2.0/charities?subNumber=0&amp;charityNumber=&quot;</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="s">&quot;&amp;fields=mainCharity.income,registration,beta.people.employees&quot;</span> </code></pre></div> <p>(Note that you need to specify the fields you want to return - here we're looking for the number of employees).</p> <p>And the sample formula for parsing the resulting JSON data might look like:</p> <div class="highlight"><pre><span></span><code>parseJson(value)[&quot;charities&quot;][0][&quot;beta&quot;][&quot;people&quot;][&quot;employees&quot;] </code></pre></div></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Sat, 31 Mar 2018 13:25:00 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2018-03-31:/2018/adding-charity-details-using-findthatcharity-part-ii/</guid><category>Blog</category><category>Charities</category><category>findthatcharity</category></item><item><title>Adding charity details using findthatcharity.uk - Part I</title><link>https://dkane.net/2018/adding-charity-details-using-findthatcharityuk-part-i/</link><description><h2 id="part-i-add-charity-numbers">Part I - add charity numbers</h2> <p><em><a href="/2018/adding-charity-details-using-findthatcharity-part-ii/">See part II</a></em></p> <p><a href="https://findthatcharity.uk/">findthatcharity.uk</a> is a site I've created to help access data about charities. One of the ways you can use it is to add details about a list of charities if you only know their names.</p> <p>This is often called "reconciliation" - turning a list of names into something that unambiguously identifies the charity. Charities are often known by different names, or have variations like "Ltd" rather than "Limited", so by adding charity numbers we can make sure we're talking about the same charity whether we say "Comic Relief" or "Charity Projects".</p> <p>In this post I'm going to show you how to use <a href="http://openrefine.org/">OpenRefine</a> and <a href="https://findthatcharity.uk/">findthatcharity.uk</a> to turn charity names into numbers. In the next part, I'll show how to use charity numbers to fetch extra information about a charity - like their web address or latest income.</p> <h3 id="1-get-open-refine">1. Get Open Refine</h3> <p><a href="http://openrefine.org/">OpenRefine</a> is a great data cleaning/wrangling tool. It was originally started by Google (and so is sometimes still called Google Refine), but is now run by a team of volunteer developers.</p> <p>You can find the download for your platform on the <a href="http://openrefine.org/download.html">download page</a>. Note that if you're a Windows user you won't "install" it in the same way you might for other programs - instead you extract the zip file into a directory and then run <code>refine.bat</code> from that directory.</p> <p>Again, OpenRefine might be a bit different from other programs you've seen in that it doesn't open in it's own window - instead it has a "server" that runs in the background, and you interact with it through a web browser. The address you visit (normally <a href="http://localhost:3333">http://localhost:3333</a>) is only accessible through your computer though - other parts of the internet can't see it.</p> <h3 id="2-load-your-data">2. Load your data</h3> <p>When you first open OpenRefine you'll see a page that looks a bit like this:</p> <p><img alt="Web page screenshot showing the OpenRefine screen for importing data" src="/images/reconcile/first-screen.png"> </p> <p>You can follow the instructions for importing your data. OpenRefine has some very flexible options for importing data. The important thing about your data is that it should be a "flat file" - a simple table with rows and columns. You might need to use another program (like Excel) to get your data in this format first.</p> <p>I've chosen to use some data published by the <a href="http://cheshirecommunityfoundation.org.uk/wp-content/uploads/2017/06/Grants-Awarded-Fiscal-Year-2016.xlsx">Cheshire Community Foundation [xlsx - 57Kb]</a> on the grants they've published. This data is published to the <a href="https://www.threesixtygiving.org/">360 Giving grant data standard</a> and I found it in the <a href="https://www.threesixtygiving.org/data/data-registry/">360 Giving data registry</a>. It already includes charity numbers so we wouldn't need to do this reconciliation process, but it's useful to have them there to check how good the reconiliation process is.</p> <p>I can point OpenRefine at this URL and it will download and load it itself. Here's what the data looks like once it's been imported. I've hidden the <code>description</code> field to get more rows on the screen.</p> <p><img alt="Data loaded into OpenRefine" src="/images/reconcile/data-ready.png"></p> <h3 id="3-reconcile-charity-names">3. Reconcile charity names</h3> <p>We're going to be using the <code>Recipient Org:Name</code> field to reconcile - this contains the name of the organisation that received the grant. Click on the arrow next to the column title to bring up the menu, and click <code>Reconcile &gt; Start reconciling...</code>.</p> <p><img alt="Location of the reconcile menu" src="/images/reconcile/reconcile-menu.png"></p> <p>Now you need to add the Find that Charity reconciliation service. This is a web service that will recieve data from OpenRefine (in this case the name of a charity) and tries to match it to an entity. It returns a list of possible matches, and allows us to extract the ID of each match (in this case the charity number).</p> <p>Click <code>Add Standard Service...</code> and then paste <code>https://findthatcharity.uk/reconcile</code> in the textbox, and click <code>Add Service</code>. This will then (possibly after a short delay) add a service called "charitysearch".</p> <p><img alt="Dialog box for adding a reconciliation service" src="/images/reconcile/add-service.png"></p> <p>You'll see a list with just "charity" on the left hand side (you might need to hide the sidebar). You can leave the default options (at the moment findthatcharity doesn't take any additional properties to help matching, but this might be something we add later).</p> <p>Click <code>Start Reconciling</code> to make the magic happen. This might take a while if you have lots of records. OpenRefine will batch up charities and send a few at a time to findthatcharity, but the server can only cope with so many requests at a time. If you need to do lots of requests (&gt;5,000 in one go), speak to me directly.</p> <p><img alt="Completed options for running reconcilation" src="/images/reconcile/start-reconciling.png"></p> <p>You'll see a counter with the percentage complete at the top, and when finished each field will show the matches that have been made. </p> <p><img alt="Reconciliation results, showing a series of checkboxes for each entry with possible matches" src="/images/reconcile/reconcile-results.png"></p> <h3 id="4-check-the-matches">4. Check the matches</h3> <p>You now need to check the matches. Depending on your time and the number of records, there's a few options for this. </p> <h4 id="manually-check-every-record">Manually check every record</h4> <p>You can manually check each record - just click on the single tick next to the correct match. You can also click the box with two ticks if you'd like every instance of this record to be matched to this entity.</p> <h4 id="match-all-records-to-the-best-result">Match all records to the best result</h4> <p>If you trust the algorithm, or have done some manual checking and think they're all correct, you can just take the best match for each record. To do this, click the arrow menu next to the <code>Recipient Org:Name</code> column heading, and select <code>Reconcile &gt; Actions &gt; Match each cell to its best candidate</code>.</p> <h4 id="filter-and-match">Filter and match</h4> <p>You can also combine these two approaches using filters. On the left hand side of the screen, OpenRefine has added two "facets" based on the results of the reconciliation. </p> <p>The <code>judgement</code> facet at the top shows the overall result - either <code>(blank)</code> where no matches have been found, <code>none</code> where there are some matches or <code>matched</code> where a single match has been selected (some records are marked as <code>matched</code> already if findthatcharity thinks they are an exact match).</p> <p>The <code>best candidate's score</code> facet shows the score of the best match candidate, and allows you to select different numbers by dragging the range sliders. You can also remove <code>Error</code> values: these are the ones with no match.</p> <p>You can drag the score fact to select only those with high matching scores - this will filter to show only those rows with the selected scores. The <code>Match each cell to its best candidate</code> action also respects this filter, so you can use that to match all the records found by the filter, with the remainder remaining unmatched.</p> <p><img alt="Results of combining the score filter with matching each cell to best candidate" src="/images/reconcile/filtered-matches.png"></p> <h3 id="5-add-charity-number">5. Add charity number</h3> <p>Once you're happy with the matches you can add a new column with the matched charity number in. To do this, use the arrow menu next to the column heading, and click <code>Edit column &gt; Add column based on this column...</code></p> <p>The box that pops up allows you to specify lots of different columns. Add a name for your column at the top ("Charity Number" would make sense), and then add <code>cell.recon.match.id</code> in the "Expression" box. This will look at the cell reconcilation object, and find the <code>id</code> of the matched candidate.</p> <p><img alt="Entering an expression for getting the matched cell id" src="/images/reconcile/add-charity-number.png"></p> <p>Press <code>OK</code> to add this column.</p> <hr> <p>That's it for part one - hopefully you've now got a charity number for all the charities in your data. In <a href="/2018/adding-charity-details-using-findthatcharity-part-ii/">the next part</a> I'll take a look at using the charity number to add more details about the charity, like it's web address or latest income.</p> <p><a href="/2018/adding-charity-details-using-findthatcharity-part-ii/">Part II continues here...</a></p></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Mon, 26 Mar 2018 13:25:00 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2018-03-26:/2018/adding-charity-details-using-findthatcharityuk-part-i/</guid><category>Blog</category><category>Charities</category><category>findthatcharity</category></item><item><title>Back in the GDPR</title><link>https://dkane.net/2018/back-in-the-gdpr/</link><description><p>G-day (the 25 May) is fast approaching, when charities will need to be GDPR-compliant. While <a href="https://www.thirdsector.co.uk/less-half-charities-heard-new-data-protection-rule/fundraising/article/1455490">half of charities apparently haven't heard of it</a> there's no shortage of resources so I've put together <a href="https://github.com/drkane/charity-gdpr-resources">a list of the resources I've seen produced</a> on github.</p> <p>If you see anything missing then please add an issue or make a pull request (or <a href="https://twitter.com/kanedr">tweet me @kanedr</a>) and I'll add it.</p> <p>My view on GDPR is that, as a consumer, it's a welcome rebalancing of data rights from organisations to individuals, and I think organisations can embrace it positively to think better about how they handle data about their clients, staff, donors and beneficiaries.</p> <p>With that in mind I'd highly recommend these resources by <a href="https://projectsbyif.com/">IF</a>. They show ways in which GDPR doesn't just have to be an annoyance that means you have to reword your forms, but instead something that can help you fundamentally rethink the way your users need you to handle their data.</p> <ul> <li><a href="https://newdigitalrights.projectsbyif.com/">Designing for new Digital Rights</a></li> </ul></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Thu, 01 Mar 2018 13:25:00 +0000</pubDate><guid isPermaLink="false">tag:dkane.net,2018-03-01:/2018/back-in-the-gdpr/</guid><category>Blog</category><category>GDPR</category><category>Data Protection</category></item><item><title>Acrostic football league tables</title><link>https://dkane.net/2018/acrostic-football-league-tables/</link><description><p>Following from a <a href="https://www.theguardian.com/football/2018/feb/14/footballers-joining-and-leaving-a-club-in-the-same-transfer-window-the-knowledge">question to the Guardian's "The Knowledge" column</a> I've tried to find a longer acrostic than "TABLE".</p> <blockquote> <p>“Has there ever been a longer acrostic spelled out in a table than the oft-recurring ‘TABLE’ from the Premier League this season – or ‘LAWNS’ from League Two?” asks Marco Jackson.</p> </blockquote> <p>To do this I'm going to fetch data on every match played in English leagues since 1994-95 and use it to generate league tables for the end of every day that a match has been played. I'll then check the first letter of every team in the league table against a dictionary of English words to find the longest.</p> <h2 id="import-needed-modules">Import needed modules</h2> <div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">sys</span> <span class="kn">import</span> <span class="nn">urllib.parse</span> <span class="kn">import</span> <span class="nn">os</span> <span class="kn">import</span> <span class="nn">csv</span> <span class="kn">import</span> <span class="nn">re</span> <span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span> <span class="kn">import</span> <span class="nn">requests</span> <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="n">pd</span><span class="o">.</span><span class="n">options</span><span class="o">.</span><span class="n">display</span><span class="o">.</span><span class="n">float_format</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="si">{:,.0f}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span> </code></pre></div> <h2 id="fetch-league-data">Fetch league data</h2> <p>From <a href="http://www.football-data.co.uk/englandm.php">http://www.football-data.co.uk/englandm.php</a> - series of CSV tables with data on matches played in English leagues (Premier League, Division One/Championship, Division Two/League One, Division Three/League Two, Conference) since 1994/95.</p> <p>I haven't checked the accuracy of this data against any other source. In particular, it doesn't include any data on points deductions which do happen and affect the league table.</p> <div class="highlight"><pre><span></span><code><span class="c1"># download the main page and extract the links to CSV files</span> <span class="n">data_url</span> <span class="o">=</span> <span class="s2">&quot;http://www.football-data.co.uk/englandm.php&quot;</span> <span class="n">r</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">data_url</span><span class="p">)</span> <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s1">&#39;html.parser&#39;</span><span class="p">)</span> <span class="c1"># go through all the links in the document</span> <span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">soup</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s1">&#39;a&#39;</span><span class="p">):</span> <span class="n">href</span> <span class="o">=</span> <span class="n">link</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;href&#39;</span><span class="p">)</span> <span class="c1"># check for those that are CSV files</span> <span class="k">if</span> <span class="n">href</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s2">&quot;.csv&quot;</span><span class="p">):</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&quot;data/&quot;</span> <span class="o">+</span> <span class="n">href</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">&quot;/&quot;</span><span class="p">,</span> <span class="s2">&quot;_&quot;</span><span class="p">),</span> <span class="s2">&quot;wb&quot;</span><span class="p">)</span> <span class="k">as</span> <span class="n">fd</span><span class="p">:</span> <span class="n">data_file_url</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">parse</span><span class="o">.</span><span class="n">urljoin</span><span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">url</span><span class="p">,</span> <span class="n">href</span><span class="p">)</span> <span class="n">r_data</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">data_file_url</span><span class="p">,</span> <span class="n">stream</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="c1"># write the file to disk in the `/data` directory</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">r_data</span><span class="o">.</span><span class="n">iter_content</span><span class="p">(</span><span class="n">chunk_size</span><span class="o">=</span><span class="mi">128</span><span class="p">):</span> <span class="n">fd</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Downloaded </span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">data_file_url</span><span class="p">))</span> </code></pre></div> <h2 id="fetch-dictionary-data">Fetch dictionary data</h2> <p>From <a href="https://github.com/dwyl/english-words/blob/master/words_alpha.txt">https://github.com/dwyl/english-words/blob/master/words_alpha.txt</a></p> <div class="highlight"><pre><span></span><code><span class="n">dictionary</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s1">&#39;https://github.com/dwyl/english-words/blob/master/words_alpha.txt?raw=true&#39;</span><span class="p">,</span> <span class="n">stream</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&quot;data/dictionary.txt&quot;</span><span class="p">,</span> <span class="s2">&quot;wb&quot;</span><span class="p">)</span> <span class="k">as</span> <span class="n">c</span><span class="p">:</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">dictionary</span><span class="o">.</span><span class="n">iter_content</span><span class="p">(</span><span class="n">chunk_size</span><span class="o">=</span><span class="mi">128</span><span class="p">):</span> <span class="n">c</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span> </code></pre></div> <h2 id="useful-functions">Useful functions</h2> <p><code>matches_to_table()</code> takes a list of matches (as a pandas dataframe) and a list of teams and turns it into a league table. The optional list of teams ensures that any teams that haven't yet played are included in the table (if not provided it's worked out based on the matches. Tables are sorted according to the following criteria: points, then goal difference, then goals scored.</p> <p>The matches dataframe requires the following columns for each match:</p> <ul> <li><code>HomeTeam</code> - team playing at home</li> <li><code>AwayTeam</code> - team playing away</li> <li><code>FTR</code> - full time result: "H" = Home win, "A" = Away win, "D" = Draw</li> <li><code>FTHG</code> - goals scored by full time by the home team</li> <li><code>FTAG</code> - goals scored by full time by the home team</li> </ul> <div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">matches_to_table</span><span class="p">(</span><span class="n">matches</span><span class="p">,</span> <span class="n">teams</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">points_for_win</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">points_for_draw</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">teams</span><span class="p">:</span> <span class="n">teams</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">matches</span><span class="o">.</span><span class="n">HomeTeam</span><span class="o">.</span><span class="n">unique</span><span class="p">())</span> <span class="o">+</span> <span class="nb">list</span><span class="p">(</span><span class="n">matches</span><span class="o">.</span><span class="n">AwayTeam</span><span class="o">.</span><span class="n">unique</span><span class="p">())))</span> <span class="n">fields</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;played&quot;</span><span class="p">,</span> <span class="s2">&quot;won&quot;</span><span class="p">,</span> <span class="s2">&quot;drawn&quot;</span><span class="p">,</span> <span class="s2">&quot;lost&quot;</span><span class="p">,</span> <span class="s2">&quot;for&quot;</span><span class="p">,</span> <span class="s2">&quot;against&quot;</span><span class="p">,</span> <span class="s2">&quot;gd&quot;</span><span class="p">,</span> <span class="s2">&quot;points&quot;</span><span class="p">,</span> <span class="s2">&quot;points_deduction&quot;</span><span class="p">]</span> <span class="n">table</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="n">teams</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">fields</span><span class="p">)</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;won&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;won&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span> <span class="n">matches</span><span class="p">[</span><span class="n">matches</span><span class="o">.</span><span class="n">FTR</span> <span class="o">==</span> <span class="s2">&quot;H&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">HomeTeam</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(),</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># Home wins</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;lost&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;lost&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span> <span class="n">matches</span><span class="p">[</span><span class="n">matches</span><span class="o">.</span><span class="n">FTR</span> <span class="o">==</span> <span class="s2">&quot;A&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">HomeTeam</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(),</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># Home losses</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;drawn&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;drawn&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span> <span class="n">matches</span><span class="p">[</span><span class="n">matches</span><span class="o">.</span><span class="n">FTR</span> <span class="o">==</span> <span class="s2">&quot;D&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">HomeTeam</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(),</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># Home draws</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;won&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;won&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span> <span class="n">matches</span><span class="p">[</span><span class="n">matches</span><span class="o">.</span><span class="n">FTR</span> <span class="o">==</span> <span class="s2">&quot;A&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">AwayTeam</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(),</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># Away wins</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;lost&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;lost&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span> <span class="n">matches</span><span class="p">[</span><span class="n">matches</span><span class="o">.</span><span class="n">FTR</span> <span class="o">==</span> <span class="s2">&quot;H&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">AwayTeam</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(),</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># Away losses</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;drawn&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;drawn&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span> <span class="n">matches</span><span class="p">[</span><span class="n">matches</span><span class="o">.</span><span class="n">FTR</span> <span class="o">==</span> <span class="s2">&quot;D&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">AwayTeam</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(),</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># Away draws</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;for&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;for&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span> <span class="n">matches</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s2">&quot;HomeTeam&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()[</span><span class="s2">&quot;FTHG&quot;</span><span class="p">]</span> <span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># home for</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;for&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;for&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span> <span class="n">matches</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s2">&quot;AwayTeam&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()[</span><span class="s2">&quot;FTAG&quot;</span><span class="p">]</span> <span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># away for</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;against&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;against&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">matches</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s2">&quot;HomeTeam&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()[</span><span class="s2">&quot;FTAG&quot;</span><span class="p">]</span> <span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># home against</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;against&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;against&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">matches</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s2">&quot;AwayTeam&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()[</span><span class="s2">&quot;FTHG&quot;</span><span class="p">]</span> <span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># away against</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;points&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;won&quot;</span><span class="p">]</span> <span class="o">*</span> <span class="n">points_for_win</span><span class="p">)</span> <span class="o">+</span> \ <span class="p">(</span><span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;drawn&quot;</span><span class="p">]</span> <span class="o">*</span> <span class="n">points_for_draw</span><span class="p">)</span> <span class="o">-</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;points_deduction&quot;</span><span class="p">]</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;gd&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;for&quot;</span><span class="p">]</span> <span class="o">-</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;against&quot;</span><span class="p">]</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;played&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;won&quot;</span><span class="p">]</span> <span class="o">+</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;drawn&quot;</span><span class="p">]</span> <span class="o">+</span> <span class="n">table</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;lost&quot;</span><span class="p">]</span> <span class="k">return</span> <span class="n">table</span><span class="o">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s2">&quot;points&quot;</span><span class="p">,</span> <span class="s2">&quot;gd&quot;</span><span class="p">,</span> <span class="s2">&quot;for&quot;</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> </code></pre></div> <p><code>table_to_string()</code> takes the table output from <code>matches_to_table</code> and turns it into a string consisting of the first letter of each of the teams in the table (in order)</p> <div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">table_to_string</span><span class="p">(</span><span class="n">table</span><span class="p">):</span> <span class="k">return</span> <span class="s2">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">table</span><span class="o">.</span><span class="n">index</span><span class="p">])</span> </code></pre></div> <p><code>get_char_ngrams()</code> takes a string and returns a list of ngrams found within it, with the length of the ngrams found controlled by <code>min_n</code> and <code>max_n</code>.</p> <div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">get_char_ngrams</span><span class="p">(</span><span class="n">input_string</span><span class="p">,</span> <span class="n">min_n</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">max_n</span><span class="o">=</span><span class="mi">100</span><span class="p">):</span> <span class="n">ngrams</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">min_n</span><span class="p">,</span> <span class="n">max_n</span><span class="p">):</span> <span class="n">ngrams</span> <span class="o">+=</span> <span class="p">[</span><span class="n">input_string</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="n">n</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">input_string</span><span class="p">)</span><span class="o">-</span><span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">)]</span> <span class="k">return</span> <span class="n">ngrams</span> </code></pre></div> <p><code>get_division_from_filename()</code> gets the division and year based on the filename given</p> <div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">get_division_from_filename</span><span class="p">(</span><span class="n">filename</span><span class="p">):</span> <span class="n">filename</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;_|\.&#39;</span><span class="p">,</span> <span class="n">filename</span><span class="p">)</span> <span class="n">year</span> <span class="o">=</span> <span class="n">filename</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">if</span> <span class="nb">int</span><span class="p">(</span><span class="n">year</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">])</span> <span class="o">&gt;</span> <span class="mi">90</span><span class="p">:</span> <span class="n">year</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">year</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">])</span> <span class="o">+</span> <span class="mi">1900</span> <span class="k">else</span><span class="p">:</span> <span class="n">year</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">year</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">])</span> <span class="o">+</span> <span class="mi">2000</span> <span class="k">if</span> <span class="n">year</span> <span class="o">&lt;</span> <span class="mi">2004</span><span class="p">:</span> <span class="n">division</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">&quot;E0&quot;</span><span class="p">:</span> <span class="s2">&quot;Premier League&quot;</span><span class="p">,</span> <span class="s2">&quot;E1&quot;</span><span class="p">:</span> <span class="s2">&quot;Division One&quot;</span><span class="p">,</span> <span class="s2">&quot;E2&quot;</span><span class="p">:</span> <span class="s2">&quot;Division Two&quot;</span><span class="p">,</span> <span class="s2">&quot;E3&quot;</span><span class="p">:</span> <span class="s2">&quot;Division Three&quot;</span><span class="p">,</span> <span class="s2">&quot;EC&quot;</span><span class="p">:</span> <span class="s2">&quot;Conference&quot;</span> <span class="p">}[</span><span class="n">filename</span><span class="p">[</span><span class="mi">2</span><span class="p">]]</span> <span class="k">else</span><span class="p">:</span> <span class="n">division</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">&quot;E0&quot;</span><span class="p">:</span> <span class="s2">&quot;Premier League&quot;</span><span class="p">,</span> <span class="s2">&quot;E1&quot;</span><span class="p">:</span> <span class="s2">&quot;Championship&quot;</span><span class="p">,</span> <span class="s2">&quot;E2&quot;</span><span class="p">:</span> <span class="s2">&quot;League One&quot;</span><span class="p">,</span> <span class="s2">&quot;E3&quot;</span><span class="p">:</span> <span class="s2">&quot;League Two&quot;</span><span class="p">,</span> <span class="s2">&quot;EC&quot;</span><span class="p">:</span> <span class="s2">&quot;Conference&quot;</span> <span class="p">}[</span><span class="n">filename</span><span class="p">[</span><span class="mi">2</span><span class="p">]]</span> <span class="n">year</span> <span class="o">=</span> <span class="s2">&quot;</span><span class="si">{}</span><span class="s2">-</span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span> <span class="nb">str</span><span class="p">(</span><span class="n">year</span><span class="p">),</span> <span class="nb">str</span><span class="p">(</span><span class="n">year</span><span class="o">+</span><span class="mi">1</span><span class="p">)[</span><span class="mi">2</span><span class="p">:</span><span class="mi">4</span><span class="p">])</span> <span class="k">return</span> <span class="p">(</span><span class="n">year</span><span class="p">,</span> <span class="n">division</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">find_word</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">found_words</span><span class="p">,</span> <span class="n">full</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span> <span class="n">t</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">([</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">found_words</span> <span class="k">if</span> <span class="n">w</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span><span class="o">==</span><span class="n">word</span><span class="o">.</span><span class="n">lower</span><span class="p">()][</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">])</span> <span class="nb">print</span><span class="p">(</span><span class="n">get_division_from_filename</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">+</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">],))</span> <span class="n">table</span> <span class="o">=</span> <span class="n">tables</span><span class="p">[</span><span class="n">t</span><span class="p">]</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;index&quot;</span><span class="p">:</span> <span class="s2">&quot;Team&quot;</span><span class="p">})</span> <span class="n">table</span><span class="o">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">full</span><span class="p">:</span> <span class="k">return</span> <span class="n">table</span> <span class="n">word_location</span> <span class="o">=</span> <span class="s2">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="nb">list</span><span class="p">(</span><span class="n">table</span><span class="o">.</span><span class="n">Team</span><span class="p">)])</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="k">return</span> <span class="n">table</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="nb">slice</span><span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">word_location</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="n">word_location</span> <span class="o">+</span> <span class="nb">len</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="kc">None</span><span class="p">)]</span> </code></pre></div> <h2 id="generate-tables">Generate tables</h2> <p>Here I go through each file downloaded in the <code>/data</code> directory, and carry out the following steps:</p> <ol> <li>read file into a pandas dataframe</li> <li>find unique dates on which matches were played</li> <li>go through each date and extract the matches played by that date</li> <li>create a table based on the matches played by that date</li> <li>extract the string of the first letter of each team played by that date</li> </ol> <p>The strings and tables are added to arrays to make it easy to find them again.</p> <div class="highlight"><pre><span></span><code><span class="n">tstrings</span> <span class="o">=</span> <span class="p">[]</span> <span class="n">files_checked</span> <span class="o">=</span> <span class="p">[]</span> <span class="n">tables</span> <span class="o">=</span> <span class="p">{}</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">dfile</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="s2">&quot;data&quot;</span><span class="p">):</span> <span class="k">if</span> <span class="n">dfile</span> <span class="ow">in</span> <span class="n">files_checked</span><span class="p">:</span> <span class="k">continue</span> <span class="k">if</span> <span class="n">dfile</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s2">&quot;.csv&quot;</span><span class="p">):</span> <span class="k">try</span><span class="p">:</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">&quot;data/</span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">dfile</span><span class="p">))</span> <span class="c1"># step 1</span> <span class="c1"># print(dfile)</span> <span class="k">except</span> <span class="n">pd</span><span class="o">.</span><span class="n">errors</span><span class="o">.</span><span class="n">ParserError</span><span class="p">:</span> <span class="nb">print</span><span class="p">(</span><span class="n">dfile</span> <span class="o">+</span> <span class="s2">&quot; [SKIPPED]&quot;</span><span class="p">)</span> <span class="k">continue</span> <span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;Date&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">Date</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s2">&quot;</span><span class="si">%d</span><span class="s2">/%m/%y&quot;</span><span class="p">)</span> <span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s2">&quot;all&quot;</span><span class="p">)</span> <span class="n">teams</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">HomeTeam</span><span class="o">.</span><span class="n">unique</span><span class="p">())</span> <span class="o">+</span> <span class="nb">list</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">AwayTeam</span><span class="o">.</span><span class="n">unique</span><span class="p">())))</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">Date</span><span class="o">.</span><span class="n">unique</span><span class="p">():</span> <span class="c1"># step 2</span> <span class="n">played_matches</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">Date</span> <span class="o">&lt;=</span> <span class="n">i</span><span class="p">]</span> <span class="c1"># step 3</span> <span class="n">table</span> <span class="o">=</span> <span class="n">matches_to_table</span><span class="p">(</span><span class="n">played_matches</span><span class="p">,</span> <span class="n">teams</span><span class="p">)</span> <span class="c1"># step 4</span> <span class="n">tables</span><span class="p">[(</span><span class="n">dfile</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)[</span><span class="mi">0</span><span class="p">:</span><span class="mi">10</span><span class="p">])]</span> <span class="o">=</span> <span class="n">table</span> <span class="n">tstring</span> <span class="o">=</span> <span class="n">table_to_string</span><span class="p">(</span><span class="n">table</span><span class="p">)</span> <span class="c1"># step 5</span> <span class="n">tstrings</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">dfile</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)[</span><span class="mi">0</span><span class="p">:</span><span class="mi">10</span><span class="p">],</span> <span class="n">tstring</span><span class="p">))</span> <span class="n">files_checked</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">dfile</span><span class="p">)</span> <span class="c1"># print(&quot;{} - {} matches played by {}&quot;.format(tstring, len(played_matches), i))</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="c1"># save the strings to a CSV file for later use</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&quot;tablestrings.csv&quot;</span><span class="p">,</span> <span class="s2">&quot;w&quot;</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">&#39;&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">b</span><span class="p">:</span> <span class="n">writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">writer</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">([</span><span class="s2">&quot;file&quot;</span><span class="p">,</span> <span class="s2">&quot;date&quot;</span><span class="p">,</span> <span class="s2">&quot;tstring&quot;</span><span class="p">])</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">tstrings</span><span class="p">:</span> <span class="n">writer</span><span class="o">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="s2">&quot;</span><span class="si">{:,}</span><span class="s2"> files checked (each file contains matches for one year for one division)&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">files_checked</span><span class="p">))</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="s1">&#39;113 files checked (each file contains matches for one year for one division)&#39;</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="s2">&quot;</span><span class="si">{:,}</span><span class="s2"> table strings generated. Example string: </span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">tstrings</span><span class="p">),</span> <span class="n">tstrings</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">2</span><span class="p">])</span> </code></pre></div> <div class="highlight"><pre><span></span><code>&#39;10,268 table strings generated. Example string: CCMTLLSSDALNMABWICEM&#39; </code></pre></div> <h2 id="import-dictionary-data">Import dictionary data</h2> <div class="highlight"><pre><span></span><code><span class="n">dictionary</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&quot;data/dictionary.txt&quot;</span><span class="p">,</span> <span class="s2">&quot;r&quot;</span><span class="p">)</span> <span class="k">as</span> <span class="n">c</span><span class="p">:</span> <span class="n">dictionary</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">readlines</span><span class="p">()</span> <span class="n">dictionary</span> <span class="o">=</span> <span class="nb">set</span><span class="p">([</span><span class="n">d</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">dictionary</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">d</span><span class="o">.</span><span class="n">strip</span><span class="p">())</span> <span class="o">&gt;=</span> <span class="mi">5</span><span class="p">])</span> </code></pre></div> <h2 id="find-long-words-in-the-table-strings">Find long words in the table strings</h2> <p>Split each iteration of the league table into ngrams and check whether it is a word found in the dictionary.</p> <div class="highlight"><pre><span></span><code><span class="n">found_words</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">tstrings</span><span class="p">):</span> <span class="n">ngrams</span> <span class="o">=</span> <span class="n">get_char_ngrams</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">lower</span><span class="p">(),</span> <span class="n">min_n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="n">ngrams</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">ngrams</span><span class="p">)</span> <span class="n">found</span> <span class="o">=</span> <span class="n">ngrams</span><span class="o">.</span><span class="n">intersection</span><span class="p">(</span><span class="n">dictionary</span><span class="p">)</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">found</span><span class="p">:</span> <span class="n">found_words</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="o">+</span> <span class="p">[</span><span class="n">f</span><span class="p">])</span> <span class="k">if</span> <span class="n">k</span> <span class="o">%</span> <span class="mi">1000</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Iteration </span><span class="si">{}</span><span class="s2">, found </span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">found_words</span><span class="p">)))</span> <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Iteration </span><span class="si">{}</span><span class="s2">, found </span><span class="si">{}</span><span class="s2"> [COMPLETE]&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">found_words</span><span class="p">)))</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="n">Iteration</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">0</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">1000</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">73</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">2000</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">109</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">3000</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">146</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">4000</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">195</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">5000</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">242</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">6000</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">330</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">7000</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">358</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">8000</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">411</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">9000</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">450</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">10000</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">478</span> <span class="n">Iteration</span><span class="w"> </span><span class="mi">10267</span><span class="p">,</span><span class="w"> </span><span class="k">found</span><span class="w"> </span><span class="mi">481</span><span class="w"> </span><span class="o">[</span><span class="n">COMPLETE</span><span class="o">]</span> </code></pre></div> <p>Words with six or more letters</p> <div class="highlight"><pre><span></span><code><span class="p">[</span><span class="nb">list</span><span class="p">(</span><span class="n">get_division_from_filename</span><span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span> <span class="o">+</span> <span class="p">[</span><span class="n">w</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">w</span><span class="p">[</span><span class="mi">3</span><span class="p">],</span> <span class="nb">len</span><span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="mi">3</span><span class="p">])]</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">found_words</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span><span class="o">&gt;</span><span class="mi">5</span><span class="p">]</span> </code></pre></div> <div class="highlight"><pre><span></span><code>[[&#39;2000-01&#39;, &#39;Premier League&#39;, &#39;2000-12-17&#39;, &#39;maills&#39;, 6], [&#39;2000-01&#39;, &#39;Premier League&#39;, &#39;2000-12-18&#39;, &#39;maills&#39;, 6], [&#39;2000-01&#39;, &#39;Premier League&#39;, &#39;2000-12-22&#39;, &#39;maills&#39;, 6], [&#39;2000-01&#39;, &#39;Premier League&#39;, &#39;2001-01-30&#39;, &#39;maslin&#39;, 6], [&#39;2000-01&#39;, &#39;Premier League&#39;, &#39;2001-04-02&#39;, &#39;maills&#39;, 6], [&#39;2000-01&#39;, &#39;Premier League&#39;, &#39;2001-04-04&#39;, &#39;maills&#39;, 6], [&#39;2000-01&#39;, &#39;Premier League&#39;, &#39;2001-04-10&#39;, &#39;maills&#39;, 6], [&#39;2000-01&#39;, &#39;Premier League&#39;, &#39;2001-04-11&#39;, &#39;maills&#39;, 6], [&#39;2000-01&#39;, &#39;Premier League&#39;, &#39;2001-04-30&#39;, &#39;scants&#39;, 6], [&#39;2000-01&#39;, &#39;Premier League&#39;, &#39;2001-05-01&#39;, &#39;scants&#39;, 6], [&#39;2001-02&#39;, &#39;Division Three&#39;, &#39;2001-10-27&#39;, &#39;blokes&#39;, 6], [&#39;2001-02&#39;, &#39;Division Three&#39;, &#39;2001-10-28&#39;, &#39;blokes&#39;, 6], [&#39;2006-07&#39;, &#39;Premier League&#39;, &#39;2006-08-22&#39;, &#39;rental&#39;, 6], [&#39;2006-07&#39;, &#39;Premier League&#39;, &#39;2007-03-14&#39;, &#39;albert&#39;, 6], [&#39;2006-07&#39;, &#39;League One&#39;, &#39;2006-12-30&#39;, &#39;snobby&#39;, 6], [&#39;2008-09&#39;, &#39;League One&#39;, &#39;2009-01-27&#39;, &#39;smolts&#39;, 6], [&#39;2008-09&#39;, &#39;League Two&#39;, &#39;2009-01-03&#39;, &#39;begall&#39;, 6], [&#39;2010-11&#39;, &#39;League Two&#39;, &#39;2010-09-11&#39;, &#39;cobcab&#39;, 6], [&#39;2011-12&#39;, &#39;League Two&#39;, &#39;2011-08-20&#39;, &#39;monasa&#39;, 6], [&#39;2012-13&#39;, &#39;Conference&#39;, &#39;2013-04-09&#39;, &#39;chants&#39;, 6], [&#39;2013-14&#39;, &#39;Premier League&#39;, &#39;2014-02-08&#39;, &#39;camlet&#39;, 6], [&#39;2013-14&#39;, &#39;Premier League&#39;, &#39;2014-03-08&#39;, &#39;amtmen&#39;, 6], [&#39;2014-15&#39;, &#39;League Two&#39;, &#39;2015-02-24&#39;, &#39;peacod&#39;, 6], [&#39;2015-16&#39;, &#39;League Two&#39;, &#39;2015-09-12&#39;, &#39;cancha&#39;, 6], [&#39;2015-16&#39;, &#39;League Two&#39;, &#39;2015-10-21&#39;, &#39;blanch&#39;, 6], [&#39;1993-94&#39;, &#39;Premier League&#39;, &#39;1994-05-07&#39;, &#39;tmesis&#39;, 6], [&#39;1993-94&#39;, &#39;Premier League&#39;, &#39;1994-05-08&#39;, &#39;tmesis&#39;, 6], [&#39;1996-97&#39;, &#39;Premier League&#39;, &#39;1997-02-15&#39;, &#39;wastel&#39;, 6], [&#39;1998-99&#39;, &#39;Division Three&#39;, &#39;1998-12-18&#39;, &#39;cherts&#39;, 6], [&#39;1999-00&#39;, &#39;Premier League&#39;, &#39;2000-03-05&#39;, &#39;calesa&#39;, 6]] </code></pre></div> <p>Words with five letters</p> <div class="highlight"><pre><span></span><code><span class="n">five_letters</span> <span class="o">=</span> <span class="p">[</span><span class="nb">list</span><span class="p">(</span><span class="n">get_division_from_filename</span><span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span> <span class="o">+</span> <span class="p">[</span><span class="n">w</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">w</span><span class="p">[</span><span class="mi">3</span><span class="p">],</span> <span class="nb">len</span><span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="mi">3</span><span class="p">])]</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">found_words</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">w</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span><span class="o">==</span><span class="mi">5</span><span class="p">]</span> <span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">five_letters</span><span class="p">))</span> <span class="c1"># five_letters</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="mf">451</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="s2">&quot;, &quot;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="nb">set</span><span class="p">([</span><span class="n">w</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">five_letters</span><span class="p">])))</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="p">&#39;</span><span class="n">abama</span><span class="p">,</span><span class="w"> </span><span class="n">abamp</span><span class="p">,</span><span class="w"> </span><span class="n">accel</span><span class="p">,</span><span class="w"> </span><span class="n">alans</span><span class="p">,</span><span class="w"> </span><span class="n">allan</span><span class="p">,</span><span class="w"> </span><span class="n">allis</span><span class="p">,</span><span class="w"> </span><span class="n">amahs</span><span class="p">,</span><span class="w"> </span><span class="n">ambos</span><span class="p">,</span><span class="w"> </span><span class="n">amlet</span><span class="p">,</span><span class="w"> </span><span class="n">amsel</span><span class="p">,</span><span class="w"> </span><span class="n">ancha</span><span class="p">,</span><span class="w"> </span><span class="n">ancle</span><span class="p">,</span><span class="w"> </span><span class="n">ankhs</span><span class="p">,</span><span class="w"> </span><span class="n">anlas</span><span class="p">,</span><span class="w"> </span><span class="n">appal</span><span class="p">,</span><span class="w"> </span><span class="n">apter</span><span class="p">,</span><span class="w"> </span><span class="n">ascan</span><span class="p">,</span><span class="w"> </span><span class="n">astel</span><span class="p">,</span><span class="w"> </span><span class="n">atlas</span><span class="p">,</span><span class="w"> </span><span class="n">awacs</span><span class="p">,</span><span class="w"> </span><span class="n">baccy</span><span class="p">,</span><span class="w"> </span><span class="n">baled</span><span class="p">,</span><span class="w"> </span><span class="n">banal</span><span class="p">,</span><span class="w"> </span><span class="n">barat</span><span class="p">,</span><span class="w"> </span><span class="n">barms</span><span class="p">,</span><span class="w"> </span><span class="n">basan</span><span class="p">,</span><span class="w"> </span><span class="n">bassa</span><span class="p">,</span><span class="w"> </span><span class="n">beant</span><span class="p">,</span><span class="w"> </span><span class="n">becco</span><span class="p">,</span><span class="w"> </span><span class="n">belam</span><span class="p">,</span><span class="w"> </span><span class="n">belch</span><span class="p">,</span><span class="w"> </span><span class="n">betas</span><span class="p">,</span><span class="w"> </span><span class="n">bibbs</span><span class="p">,</span><span class="w"> </span><span class="n">blams</span><span class="p">,</span><span class="w"> </span><span class="n">blanc</span><span class="p">,</span><span class="w"> </span><span class="n">blens</span><span class="p">,</span><span class="w"> </span><span class="n">bless</span><span class="p">,</span><span class="w"> </span><span class="n">blobs</span><span class="p">,</span><span class="w"> </span><span class="n">bloke</span><span class="p">,</span><span class="w"> </span><span class="n">blows</span><span class="p">,</span><span class="w"> </span><span class="n">bolty</span><span class="p">,</span><span class="w"> </span><span class="n">brent</span><span class="p">,</span><span class="w"> </span><span class="n">cabda</span><span class="p">,</span><span class="w"> </span><span class="n">cable</span><span class="p">,</span><span class="w"> </span><span class="n">caleb</span><span class="p">,</span><span class="w"> </span><span class="n">camel</span><span class="p">,</span><span class="w"> </span><span class="n">camla</span><span class="p">,</span><span class="w"> </span><span class="n">canch</span><span class="p">,</span><span class="w"> </span><span class="n">canis</span><span class="p">,</span><span class="w"> </span><span class="n">canos</span><span class="p">,</span><span class="w"> </span><span class="n">cants</span><span class="p">,</span><span class="w"> </span><span class="n">capsa</span><span class="p">,</span><span class="w"> </span><span class="n">casts</span><span class="p">,</span><span class="w"> </span><span class="n">catel</span><span class="p">,</span><span class="w"> </span><span class="n">celts</span><span class="p">,</span><span class="w"> </span><span class="n">chant</span><span class="p">,</span><span class="w"> </span><span class="n">chats</span><span class="p">,</span><span class="w"> </span><span class="n">cheap</span><span class="p">,</span><span class="w"> </span><span class="n">cheek</span><span class="p">,</span><span class="w"> </span><span class="n">chert</span><span class="p">,</span><span class="w"> </span><span class="n">chics</span><span class="p">,</span><span class="w"> </span><span class="n">claes</span><span class="p">,</span><span class="w"> </span><span class="n">clans</span><span class="p">,</span><span class="w"> </span><span class="n">clape</span><span class="p">,</span><span class="w"> </span><span class="n">clast</span><span class="p">,</span><span class="w"> </span><span class="n">cleam</span><span class="p">,</span><span class="w"> </span><span class="n">clefs</span><span class="p">,</span><span class="w"> </span><span class="n">clomb</span><span class="p">,</span><span class="w"> </span><span class="n">clyde</span><span class="p">,</span><span class="w"> </span><span class="n">cobby</span><span class="p">,</span><span class="w"> </span><span class="n">coman</span><span class="p">,</span><span class="w"> </span><span class="n">comby</span><span class="p">,</span><span class="w"> </span><span class="n">comdt</span><span class="p">,</span><span class="w"> </span><span class="n">compd</span><span class="p">,</span><span class="w"> </span><span class="n">copes</span><span class="p">,</span><span class="w"> </span><span class="n">copps</span><span class="p">,</span><span class="w"> </span><span class="n">copra</span><span class="p">,</span><span class="w"> </span><span class="n">cordy</span><span class="p">,</span><span class="w"> </span><span class="n">corms</span><span class="p">,</span><span class="w"> </span><span class="n">cotes</span><span class="p">,</span><span class="w"> </span><span class="n">crabs</span><span class="p">,</span><span class="w"> </span><span class="n">crost</span><span class="p">,</span><span class="w"> </span><span class="n">dampy</span><span class="p">,</span><span class="w"> </span><span class="n">dobra</span><span class="p">,</span><span class="w"> </span><span class="n">dwarf</span><span class="p">,</span><span class="w"> </span><span class="n">egall</span><span class="p">,</span><span class="w"> </span><span class="n">elamp</span><span class="p">,</span><span class="w"> </span><span class="n">ental</span><span class="p">,</span><span class="w"> </span><span class="n">etwas</span><span class="p">,</span><span class="w"> </span><span class="n">fable</span><span class="p">,</span><span class="w"> </span><span class="n">facet</span><span class="p">,</span><span class="w"> </span><span class="n">fawns</span><span class="p">,</span><span class="w"> </span><span class="n">feces</span><span class="p">,</span><span class="w"> </span><span class="n">forbs</span><span class="p">,</span><span class="w"> </span><span class="n">fpsps</span><span class="p">,</span><span class="w"> </span><span class="n">gasts</span><span class="p">,</span><span class="w"> </span><span class="n">gecks</span><span class="p">,</span><span class="w"> </span><span class="n">ghast</span><span class="p">,</span><span class="w"> </span><span class="n">gotra</span><span class="p">,</span><span class="w"> </span><span class="n">hants</span><span class="p">,</span><span class="w"> </span><span class="n">herls</span><span class="p">,</span><span class="w"> </span><span class="n">hewgh</span><span class="p">,</span><span class="w"> </span><span class="n">hoers</span><span class="p">,</span><span class="w"> </span><span class="n">hyleg</span><span class="p">,</span><span class="w"> </span><span class="n">kohls</span><span class="p">,</span><span class="w"> </span><span class="n">laban</span><span class="p">,</span><span class="w"> </span><span class="n">laics</span><span class="p">,</span><span class="w"> </span><span class="n">lamas</span><span class="p">,</span><span class="w"> </span><span class="n">lames</span><span class="p">,</span><span class="w"> </span><span class="n">lanas</span><span class="p">,</span><span class="w"> </span><span class="n">lanch</span><span class="p">,</span><span class="w"> </span><span class="n">lansa</span><span class="p">,</span><span class="w"> </span><span class="n">lawns</span><span class="p">,</span><span class="w"> </span><span class="n">leban</span><span class="p">,</span><span class="w"> </span><span class="n">lessn</span><span class="p">,</span><span class="w"> </span><span class="n">liwan</span><span class="p">,</span><span class="w"> </span><span class="n">llama</span><span class="p">,</span><span class="w"> </span><span class="n">macan</span><span class="p">,</span><span class="w"> </span><span class="n">maced</span><span class="p">,</span><span class="w"> </span><span class="n">maces</span><span class="p">,</span><span class="w"> </span><span class="n">macle</span><span class="p">,</span><span class="w"> </span><span class="n">madge</span><span class="p">,</span><span class="w"> </span><span class="n">maill</span><span class="p">,</span><span class="w"> </span><span class="n">mails</span><span class="p">,</span><span class="w"> </span><span class="n">malam</span><span class="p">,</span><span class="w"> </span><span class="n">masts</span><span class="p">,</span><span class="w"> </span><span class="n">mated</span><span class="p">,</span><span class="w"> </span><span class="n">mates</span><span class="p">,</span><span class="w"> </span><span class="n">meach</span><span class="p">,</span><span class="w"> </span><span class="n">melds</span><span class="p">,</span><span class="w"> </span><span class="n">merak</span><span class="p">,</span><span class="w"> </span><span class="n">metal</span><span class="p">,</span><span class="w"> </span><span class="n">mewls</span><span class="p">,</span><span class="w"> </span><span class="n">molts</span><span class="p">,</span><span class="w"> </span><span class="n">monas</span><span class="p">,</span><span class="w"> </span><span class="n">mosso</span><span class="p">,</span><span class="w"> </span><span class="n">nabal</span><span class="p">,</span><span class="w"> </span><span class="n">nable</span><span class="p">,</span><span class="w"> </span><span class="n">nambe</span><span class="p">,</span><span class="w"> </span><span class="n">nants</span><span class="p">,</span><span class="w"> </span><span class="n">nasab</span><span class="p">,</span><span class="w"> </span><span class="n">nasch</span><span class="p">,</span><span class="w"> </span><span class="n">neaps</span><span class="p">,</span><span class="w"> </span><span class="n">nobby</span><span class="p">,</span><span class="w"> </span><span class="n">norit</span><span class="p">,</span><span class="w"> </span><span class="n">orang</span><span class="p">,</span><span class="w"> </span><span class="n">parma</span><span class="p">,</span><span class="w"> </span><span class="n">plebs</span><span class="p">,</span><span class="w"> </span><span class="n">plote</span><span class="p">,</span><span class="w"> </span><span class="n">ponds</span><span class="p">,</span><span class="w"> </span><span class="n">praam</span><span class="p">,</span><span class="w"> </span><span class="n">prate</span><span class="p">,</span><span class="w"> </span><span class="n">proms</span><span class="p">,</span><span class="w"> </span><span class="n">pyche</span><span class="p">,</span><span class="w"> </span><span class="n">regal</span><span class="p">,</span><span class="w"> </span><span class="n">resaw</span><span class="p">,</span><span class="w"> </span><span class="n">rests</span><span class="p">,</span><span class="w"> </span><span class="n">rotas</span><span class="p">,</span><span class="w"> </span><span class="n">rybat</span><span class="p">,</span><span class="w"> </span><span class="n">sabes</span><span class="p">,</span><span class="w"> </span><span class="n">samba</span><span class="p">,</span><span class="w"> </span><span class="n">sawed</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="p">,</span><span class="w"> </span><span class="n">scams</span><span class="p">,</span><span class="w"> </span><span class="n">scant</span><span class="p">,</span><span class="w"> </span><span class="n">scase</span><span class="p">,</span><span class="w"> </span><span class="n">sceat</span><span class="p">,</span><span class="w"> </span><span class="n">scobs</span><span class="p">,</span><span class="w"> </span><span class="n">scrob</span><span class="p">,</span><span class="w"> </span><span class="n">segos</span><span class="p">,</span><span class="w"> </span><span class="n">sensa</span><span class="p">,</span><span class="w"> </span><span class="n">shawn</span><span class="p">,</span><span class="w"> </span><span class="n">sherd</span><span class="p">,</span><span class="w"> </span><span class="n">slabs</span><span class="p">,</span><span class="w"> </span><span class="n">slaws</span><span class="p">,</span><span class="w"> </span><span class="n">smash</span><span class="p">,</span><span class="w"> </span><span class="n">smolt</span><span class="p">,</span><span class="w"> </span><span class="n">snows</span><span class="p">,</span><span class="w"> </span><span class="n">snowy</span><span class="p">,</span><span class="w"> </span><span class="n">stale</span><span class="p">,</span><span class="w"> </span><span class="n">stops</span><span class="p">,</span><span class="w"> </span><span class="n">stoss</span><span class="p">,</span><span class="w"> </span><span class="n">swail</span><span class="p">,</span><span class="w"> </span><span class="n">swash</span><span class="p">,</span><span class="w"> </span><span class="n">swats</span><span class="p">,</span><span class="w"> </span><span class="n">swing</span><span class="p">,</span><span class="w"> </span><span class="n">swobs</span><span class="p">,</span><span class="w"> </span><span class="n">swots</span><span class="p">,</span><span class="w"> </span><span class="n">syces</span><span class="p">,</span><span class="w"> </span><span class="k">table</span><span class="p">,</span><span class="w"> </span><span class="n">taces</span><span class="p">,</span><span class="w"> </span><span class="n">taels</span><span class="p">,</span><span class="w"> </span><span class="n">tamas</span><span class="p">,</span><span class="w"> </span><span class="n">teals</span><span class="p">,</span><span class="w"> </span><span class="n">thaws</span><span class="p">,</span><span class="w"> </span><span class="n">tramp</span><span class="p">,</span><span class="w"> </span><span class="n">twaes</span><span class="p">,</span><span class="w"> </span><span class="n">twale</span><span class="p">,</span><span class="w"> </span><span class="n">wants</span><span class="p">,</span><span class="w"> </span><span class="n">wasco</span><span class="p">,</span><span class="w"> </span><span class="n">wasnt</span><span class="p">,</span><span class="w"> </span><span class="n">waste</span><span class="p">,</span><span class="w"> </span><span class="n">whang</span><span class="p">,</span><span class="w"> </span><span class="n">wises</span><span class="p">,</span><span class="w"> </span><span class="n">worms</span><span class="p">&#39;</span> </code></pre></div> <h2 id="words-in-context">Words in context</h2> <p>Display the league table centered around the teams making the words.</p> <div class="highlight"><pre><span></span><code><span class="n">find_word</span><span class="p">(</span><span class="s2">&quot;blokes&quot;</span><span class="p">,</span> <span class="n">found_words</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code>(&#39;2001-02&#39;, &#39;Division Three&#39;, &#39;2001-10-27&#39;) </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Team</th> <th>played</th> <th>won</th> <th>drawn</th> <th>lost</th> <th>for</th> <th>against</th> <th>gd</th> <th>points</th> <th>points_deduction</th> </tr> </thead> <tbody> <tr> <th>14</th> <td>Darlington</td> <td>15</td> <td>5</td> <td>4</td> <td>6</td> <td>19</td> <td>17</td> <td>2</td> <td>19</td> <td>0</td> </tr> <tr> <th>15</th> <td>Bristol Rvs</td> <td>14</td> <td>5</td> <td>4</td> <td>5</td> <td>13</td> <td>15</td> <td>-2</td> <td>19</td> <td>0</td> </tr> <tr> <th>16</th> <td>Lincoln</td> <td>16</td> <td>4</td> <td>6</td> <td>6</td> <td>16</td> <td>17</td> <td>-1</td> <td>18</td> <td>0</td> </tr> <tr> <th>17</th> <td>Oxford</td> <td>16</td> <td>4</td> <td>6</td> <td>6</td> <td>15</td> <td>16</td> <td>-1</td> <td>18</td> <td>0</td> </tr> <tr> <th>18</th> <td>Kidderminster</td> <td>16</td> <td>4</td> <td>5</td> <td>7</td> <td>10</td> <td>17</td> <td>-7</td> <td>17</td> <td>0</td> </tr> <tr> <th>19</th> <td>Exeter</td> <td>16</td> <td>4</td> <td>5</td> <td>7</td> <td>18</td> <td>31</td> <td>-13</td> <td>17</td> <td>0</td> </tr> <tr> <th>20</th> <td>Swansea</td> <td>16</td> <td>4</td> <td>4</td> <td>8</td> <td>22</td> <td>29</td> <td>-7</td> <td>16</td> <td>0</td> </tr> <tr> <th>21</th> <td>Hartlepool</td> <td>15</td> <td>4</td> <td>3</td> <td>8</td> <td>14</td> <td>16</td> <td>-2</td> <td>15</td> <td>0</td> </tr> </tbody> </table> </div> <div class="highlight"><pre><span></span><code><span class="n">find_word</span><span class="p">(</span><span class="s2">&quot;chants&quot;</span><span class="p">,</span> <span class="n">found_words</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code>(&#39;2012-13&#39;, &#39;Conference&#39;, &#39;2013-04-09&#39;) </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Team</th> <th>played</th> <th>won</th> <th>drawn</th> <th>lost</th> <th>for</th> <th>against</th> <th>gd</th> <th>points</th> <th>points_deduction</th> </tr> </thead> <tbody> <tr> <th>12</th> <td>Braintree Town</td> <td>42</td> <td>17</td> <td>7</td> <td>18</td> <td>59</td> <td>70</td> <td>-11</td> <td>58</td> <td>0</td> </tr> <tr> <th>13</th> <td>Cambridge</td> <td>44</td> <td>14</td> <td>14</td> <td>16</td> <td>65</td> <td>66</td> <td>-1</td> <td>56</td> <td>0</td> </tr> <tr> <th>14</th> <td>Hyde United</td> <td>44</td> <td>16</td> <td>7</td> <td>21</td> <td>61</td> <td>67</td> <td>-6</td> <td>55</td> <td>0</td> </tr> <tr> <th>15</th> <td>Alfreton Town</td> <td>43</td> <td>14</td> <td>12</td> <td>17</td> <td>64</td> <td>71</td> <td>-7</td> <td>54</td> <td>0</td> </tr> <tr> <th>16</th> <td>Nuneaton Town</td> <td>44</td> <td>12</td> <td>15</td> <td>17</td> <td>51</td> <td>62</td> <td>-11</td> <td>51</td> <td>0</td> </tr> <tr> <th>17</th> <td>Tamworth</td> <td>42</td> <td>14</td> <td>9</td> <td>19</td> <td>51</td> <td>62</td> <td>-11</td> <td>51</td> <td>0</td> </tr> <tr> <th>18</th> <td>Southport</td> <td>44</td> <td>13</td> <td>12</td> <td>19</td> <td>70</td> <td>83</td> <td>-13</td> <td>51</td> <td>0</td> </tr> <tr> <th>19</th> <td>Gateshead</td> <td>43</td> <td>12</td> <td>14</td> <td>17</td> <td>54</td> <td>59</td> <td>-5</td> <td>50</td> <td>0</td> </tr> </tbody> </table> </div> <div class="highlight"><pre><span></span><code><span class="n">find_word</span><span class="p">(</span><span class="s2">&quot;snobby&quot;</span><span class="p">,</span> <span class="n">found_words</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code>(&#39;2006-07&#39;, &#39;League One&#39;, &#39;2006-12-30&#39;) </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Team</th> <th>played</th> <th>won</th> <th>drawn</th> <th>lost</th> <th>for</th> <th>against</th> <th>gd</th> <th>points</th> <th>points_deduction</th> </tr> </thead> <tbody> <tr> <th>1</th> <td>Scunthorpe</td> <td>25</td> <td>14</td> <td>6</td> <td>5</td> <td>38</td> <td>20</td> <td>18</td> <td>48</td> <td>0</td> </tr> <tr> <th>2</th> <td>Nott'm Forest</td> <td>25</td> <td>14</td> <td>6</td> <td>5</td> <td>35</td> <td>20</td> <td>15</td> <td>48</td> <td>0</td> </tr> <tr> <th>3</th> <td>Oldham</td> <td>25</td> <td>13</td> <td>6</td> <td>6</td> <td>40</td> <td>21</td> <td>19</td> <td>45</td> <td>0</td> </tr> <tr> <th>4</th> <td>Bristol City</td> <td>25</td> <td>13</td> <td>6</td> <td>6</td> <td>34</td> <td>24</td> <td>10</td> <td>45</td> <td>0</td> </tr> <tr> <th>5</th> <td>Blackpool</td> <td>25</td> <td>11</td> <td>8</td> <td>6</td> <td>40</td> <td>27</td> <td>13</td> <td>41</td> <td>0</td> </tr> <tr> <th>6</th> <td>Yeovil</td> <td>24</td> <td>11</td> <td>8</td> <td>5</td> <td>31</td> <td>22</td> <td>9</td> <td>41</td> <td>0</td> </tr> <tr> <th>7</th> <td>Swansea</td> <td>25</td> <td>11</td> <td>6</td> <td>8</td> <td>40</td> <td>29</td> <td>11</td> <td>39</td> <td>0</td> </tr> </tbody> </table> </div> <div class="highlight"><pre><span></span><code><span class="n">find_word</span><span class="p">(</span><span class="s2">&quot;albert&quot;</span><span class="p">,</span> <span class="n">found_words</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code>(&#39;2006-07&#39;, &#39;Premier League&#39;, &#39;2007-03-14&#39;) </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Team</th> <th>played</th> <th>won</th> <th>drawn</th> <th>lost</th> <th>for</th> <th>against</th> <th>gd</th> <th>points</th> <th>points_deduction</th> </tr> </thead> <tbody> <tr> <th>2</th> <td>Chelsea</td> <td>29</td> <td>20</td> <td>6</td> <td>3</td> <td>51</td> <td>19</td> <td>32</td> <td>66</td> <td>0</td> </tr> <tr> <th>3</th> <td>Arsenal</td> <td>28</td> <td>16</td> <td>7</td> <td>5</td> <td>51</td> <td>23</td> <td>28</td> <td>55</td> <td>0</td> </tr> <tr> <th>4</th> <td>Liverpool</td> <td>29</td> <td>16</td> <td>5</td> <td>8</td> <td>44</td> <td>20</td> <td>24</td> <td>53</td> <td>0</td> </tr> <tr> <th>5</th> <td>Bolton</td> <td>29</td> <td>14</td> <td>5</td> <td>10</td> <td>34</td> <td>34</td> <td>0</td> <td>47</td> <td>0</td> </tr> <tr> <th>6</th> <td>Everton</td> <td>29</td> <td>11</td> <td>10</td> <td>8</td> <td>37</td> <td>26</td> <td>11</td> <td>43</td> <td>0</td> </tr> <tr> <th>7</th> <td>Reading</td> <td>29</td> <td>13</td> <td>4</td> <td>12</td> <td>43</td> <td>38</td> <td>5</td> <td>43</td> <td>0</td> </tr> <tr> <th>8</th> <td>Tottenham</td> <td>29</td> <td>12</td> <td>6</td> <td>11</td> <td>40</td> <td>43</td> <td>-3</td> <td>42</td> <td>0</td> </tr> <tr> <th>9</th> <td>Portsmouth</td> <td>29</td> <td>11</td> <td>8</td> <td>10</td> <td>36</td> <td>31</td> <td>5</td> <td>41</td> <td>0</td> </tr> </tbody> </table> </div> <div class="highlight"><pre><span></span><code><span class="n">find_word</span><span class="p">(</span><span class="s2">&quot;table&quot;</span><span class="p">,</span> <span class="n">found_words</span><span class="p">)</span> </code></pre></div> <div class="highlight"><pre><span></span><code>(&#39;2017-18&#39;, &#39;Premier League&#39;, &#39;2017-12-23&#39;) </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Team</th> <th>played</th> <th>won</th> <th>drawn</th> <th>lost</th> <th>for</th> <th>against</th> <th>gd</th> <th>points</th> <th>points_deduction</th> </tr> </thead> <tbody> <tr> <th>4</th> <td>Liverpool</td> <td>19</td> <td>9</td> <td>8</td> <td>2</td> <td>41</td> <td>23</td> <td>18</td> <td>35</td> <td>0</td> </tr> <tr> <th>5</th> <td>Tottenham</td> <td>19</td> <td>10</td> <td>4</td> <td>5</td> <td>34</td> <td>18</td> <td>16</td> <td>34</td> <td>0</td> </tr> <tr> <th>6</th> <td>Arsenal</td> <td>19</td> <td>10</td> <td>4</td> <td>5</td> <td>34</td> <td>23</td> <td>11</td> <td>34</td> <td>0</td> </tr> <tr> <th>7</th> <td>Burnley</td> <td>19</td> <td>9</td> <td>5</td> <td>5</td> <td>16</td> <td>15</td> <td>1</td> <td>32</td> <td>0</td> </tr> <tr> <th>8</th> <td>Leicester</td> <td>19</td> <td>7</td> <td>6</td> <td>6</td> <td>29</td> <td>28</td> <td>1</td> <td>27</td> <td>0</td> </tr> <tr> <th>9</th> <td>Everton</td> <td>19</td> <td>7</td> <td>5</td> <td>7</td> <td>24</td> <td>30</td> <td>-6</td> <td>26</td> <td>0</td> </tr> <tr> <th>10</th> <td>Watford</td> <td>19</td> <td>6</td> <td>4</td> <td>9</td> <td>27</td> <td>34</td> <td>-7</td> <td>22</td> <td>0</td> </tr> </tbody> </table> </div> <h2 id="teams-in-alphabetical-order">Teams in alphabetical order</h2> <p>I also wanted to look for the table in which a run of teams were in alphabetical order. I've ignored tables in August and September when the league table is less developed and so a default alphabetical team list is likely to take place.</p> <div class="highlight"><pre><span></span><code><span class="n">alpha_tables</span> <span class="o">=</span> <span class="p">[]</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tables</span><span class="p">:</span> <span class="n">division</span> <span class="o">=</span> <span class="n">get_division_from_filename</span><span class="p">(</span><span class="n">i</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="n">team_order</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">tables</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">index</span><span class="p">)</span> <span class="n">ngrams</span> <span class="o">=</span> <span class="n">get_char_ngrams</span><span class="p">(</span><span class="n">team_order</span><span class="p">,</span> <span class="n">min_n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">ngrams</span><span class="p">:</span> <span class="k">if</span> <span class="n">n</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span> <span class="n">alpha_tables</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">i</span><span class="p">,</span> <span class="n">division</span><span class="p">,</span> <span class="n">n</span><span class="p">])</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="n">gt_seven_teams</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">alpha_tables</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span><span class="o">&gt;</span><span class="mi">7</span> <span class="ow">and</span> <span class="n">a</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="o">==</span><span class="nb">sorted</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">2</span><span class="p">])]</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="n">gt_nine_teams</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">alpha_tables</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span><span class="o">&gt;</span><span class="mi">9</span><span class="p">]</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">])</span> <span class="nb">print</span><span class="p">(</span><span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">2</span><span class="p">])</span> <span class="n">t</span> <span class="o">=</span> <span class="n">tables</span><span class="p">[</span><span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]]</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;index&quot;</span><span class="p">:</span> <span class="s2">&quot;Team&quot;</span><span class="p">})</span> <span class="n">t</span><span class="o">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span> <span class="n">t</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="mf">2004</span><span class="o">-</span><span class="mf">05</span><span class="w"> </span><span class="n">League</span><span class="w"> </span><span class="kr">On</span><span class="n">e</span><span class="w"> </span><span class="mf">2004</span><span class="o">-</span><span class="mf">11</span><span class="o">-</span><span class="mf">09</span> <span class="err">[&#39;</span><span class="n">Bournemouth</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Bradford</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Brentford</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Bristol</span><span class="w"> </span><span class="n">City</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Chesterfield</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Hartlepool</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Huddersfield</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Swindon</span><span class="err">&#39;]</span> </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Team</th> <th>played</th> <th>won</th> <th>drawn</th> <th>lost</th> <th>for</th> <th>against</th> <th>gd</th> <th>points</th> <th>points_deduction</th> </tr> </thead> <tbody> <tr> <th>1</th> <td>Luton</td> <td>17</td> <td>12</td> <td>2</td> <td>3</td> <td>34</td> <td>18</td> <td>16</td> <td>38</td> <td>0</td> </tr> <tr> <th>2</th> <td>Hull</td> <td>17</td> <td>10</td> <td>2</td> <td>5</td> <td>28</td> <td>22</td> <td>6</td> <td>32</td> <td>0</td> </tr> <tr> <th>3</th> <td>Tranmere</td> <td>17</td> <td>9</td> <td>5</td> <td>3</td> <td>26</td> <td>20</td> <td>6</td> <td>32</td> <td>0</td> </tr> <tr> <th>4</th> <td>Bournemouth</td> <td>17</td> <td>9</td> <td>3</td> <td>5</td> <td>31</td> <td>21</td> <td>10</td> <td>30</td> <td>0</td> </tr> <tr> <th>5</th> <td>Bradford</td> <td>17</td> <td>9</td> <td>2</td> <td>6</td> <td>28</td> <td>25</td> <td>3</td> <td>29</td> <td>0</td> </tr> <tr> <th>6</th> <td>Brentford</td> <td>17</td> <td>8</td> <td>4</td> <td>5</td> <td>21</td> <td>22</td> <td>-1</td> <td>28</td> <td>0</td> </tr> <tr> <th>7</th> <td>Bristol City</td> <td>17</td> <td>7</td> <td>6</td> <td>4</td> <td>34</td> <td>23</td> <td>11</td> <td>27</td> <td>0</td> </tr> <tr> <th>8</th> <td>Chesterfield</td> <td>17</td> <td>7</td> <td>5</td> <td>5</td> <td>23</td> <td>17</td> <td>6</td> <td>26</td> <td>0</td> </tr> <tr> <th>9</th> <td>Hartlepool</td> <td>17</td> <td>8</td> <td>2</td> <td>7</td> <td>26</td> <td>28</td> <td>-2</td> <td>26</td> <td>0</td> </tr> <tr> <th>10</th> <td>Huddersfield</td> <td>17</td> <td>7</td> <td>4</td> <td>6</td> <td>28</td> <td>23</td> <td>5</td> <td>25</td> <td>0</td> </tr> <tr> <th>11</th> <td>Swindon</td> <td>17</td> <td>7</td> <td>4</td> <td>6</td> <td>28</td> <td>25</td> <td>3</td> <td>25</td> <td>0</td> </tr> <tr> <th>12</th> <td>Sheffield Weds</td> <td>17</td> <td>6</td> <td>6</td> <td>5</td> <td>25</td> <td>23</td> <td>2</td> <td>24</td> <td>0</td> </tr> <tr> <th>13</th> <td>Port Vale</td> <td>17</td> <td>7</td> <td>2</td> <td>8</td> <td>20</td> <td>24</td> <td>-4</td> <td>23</td> <td>0</td> </tr> <tr> <th>14</th> <td>Doncaster</td> <td>17</td> <td>6</td> <td>5</td> <td>6</td> <td>21</td> <td>26</td> <td>-5</td> <td>23</td> <td>0</td> </tr> <tr> <th>15</th> <td>Colchester</td> <td>17</td> <td>6</td> <td>4</td> <td>7</td> <td>26</td> <td>22</td> <td>4</td> <td>22</td> <td>0</td> </tr> <tr> <th>16</th> <td>Wrexham</td> <td>16</td> <td>5</td> <td>6</td> <td>5</td> <td>18</td> <td>23</td> <td>-5</td> <td>21</td> <td>0</td> </tr> <tr> <th>17</th> <td>Barnsley</td> <td>17</td> <td>4</td> <td>6</td> <td>7</td> <td>21</td> <td>23</td> <td>-2</td> <td>18</td> <td>0</td> </tr> <tr> <th>18</th> <td>Blackpool</td> <td>17</td> <td>4</td> <td>6</td> <td>7</td> <td>21</td> <td>25</td> <td>-4</td> <td>18</td> <td>0</td> </tr> <tr> <th>19</th> <td>Walsall</td> <td>17</td> <td>4</td> <td>6</td> <td>7</td> <td>24</td> <td>30</td> <td>-6</td> <td>18</td> <td>0</td> </tr> <tr> <th>20</th> <td>Peterboro</td> <td>17</td> <td>4</td> <td>5</td> <td>8</td> <td>19</td> <td>19</td> <td>0</td> <td>17</td> <td>0</td> </tr> <tr> <th>21</th> <td>Oldham</td> <td>17</td> <td>4</td> <td>4</td> <td>9</td> <td>21</td> <td>27</td> <td>-6</td> <td>16</td> <td>0</td> </tr> <tr> <th>22</th> <td>Torquay</td> <td>16</td> <td>3</td> <td>6</td> <td>7</td> <td>20</td> <td>27</td> <td>-7</td> <td>15</td> <td>0</td> </tr> <tr> <th>23</th> <td>Milton Keynes Dons</td> <td>17</td> <td>3</td> <td>5</td> <td>9</td> <td>20</td> <td>31</td> <td>-11</td> <td>14</td> <td>0</td> </tr> <tr> <th>24</th> <td>Stockport</td> <td>17</td> <td>2</td> <td>4</td> <td>11</td> <td>16</td> <td>35</td> <td>-19</td> <td>10</td> <td>0</td> </tr> </tbody> </table> </div> <div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">])</span> <span class="nb">print</span><span class="p">(</span><span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">2</span><span class="p">])</span> <span class="n">t</span> <span class="o">=</span> <span class="n">tables</span><span class="p">[</span><span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">]]</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;index&quot;</span><span class="p">:</span> <span class="s2">&quot;Team&quot;</span><span class="p">})</span> <span class="n">t</span><span class="o">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span> <span class="n">oteams</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="n">t</span><span class="o">.</span><span class="n">Team</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">gt_seven_teams</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">2</span><span class="p">])]</span><span class="o">.</span><span class="n">index</span> <span class="n">t</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="nb">slice</span><span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">oteams</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="n">oteams</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="kc">None</span><span class="p">),</span> <span class="p">:]</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="mf">2011</span><span class="o">-</span><span class="mf">12</span><span class="w"> </span><span class="n">Premier</span><span class="w"> </span><span class="n">League</span><span class="w"> </span><span class="mf">2011</span><span class="o">-</span><span class="mf">11</span><span class="o">-</span><span class="mf">27</span> <span class="err">[&#39;</span><span class="n">Arsenal</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Aston</span><span class="w"> </span><span class="n">Villa</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Everton</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Norwich</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">QPR</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Stoke</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Swansea</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">West</span><span class="w"> </span><span class="n">Brom</span><span class="err">&#39;]</span> </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Team</th> <th>played</th> <th>won</th> <th>drawn</th> <th>lost</th> <th>for</th> <th>against</th> <th>gd</th> <th>points</th> <th>points_deduction</th> </tr> </thead> <tbody> <tr> <th>6</th> <td>Liverpool</td> <td>13</td> <td>6</td> <td>5</td> <td>2</td> <td>17</td> <td>12</td> <td>5</td> <td>23</td> <td>0</td> </tr> <tr> <th>7</th> <td>Arsenal</td> <td>13</td> <td>7</td> <td>2</td> <td>4</td> <td>26</td> <td>23</td> <td>3</td> <td>23</td> <td>0</td> </tr> <tr> <th>8</th> <td>Aston Villa</td> <td>13</td> <td>3</td> <td>7</td> <td>3</td> <td>16</td> <td>17</td> <td>-1</td> <td>16</td> <td>0</td> </tr> <tr> <th>9</th> <td>Everton</td> <td>12</td> <td>5</td> <td>1</td> <td>6</td> <td>15</td> <td>16</td> <td>-1</td> <td>16</td> <td>0</td> </tr> <tr> <th>10</th> <td>Norwich</td> <td>13</td> <td>4</td> <td>4</td> <td>5</td> <td>19</td> <td>21</td> <td>-2</td> <td>16</td> <td>0</td> </tr> <tr> <th>11</th> <td>QPR</td> <td>13</td> <td>4</td> <td>3</td> <td>6</td> <td>14</td> <td>24</td> <td>-10</td> <td>15</td> <td>0</td> </tr> <tr> <th>12</th> <td>Stoke</td> <td>13</td> <td>4</td> <td>3</td> <td>6</td> <td>13</td> <td>23</td> <td>-10</td> <td>15</td> <td>0</td> </tr> <tr> <th>13</th> <td>Swansea</td> <td>13</td> <td>3</td> <td>5</td> <td>5</td> <td>12</td> <td>16</td> <td>-4</td> <td>14</td> <td>0</td> </tr> <tr> <th>14</th> <td>West Brom</td> <td>13</td> <td>4</td> <td>2</td> <td>7</td> <td>12</td> <td>20</td> <td>-8</td> <td>14</td> <td>0</td> </tr> <tr> <th>15</th> <td>Fulham</td> <td>13</td> <td>2</td> <td>6</td> <td>5</td> <td>15</td> <td>16</td> <td>-1</td> <td>12</td> <td>0</td> </tr> </tbody> </table> </div> <div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">gt_nine_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">gt_nine_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="n">gt_nine_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">])</span> <span class="nb">print</span><span class="p">(</span><span class="n">gt_nine_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">2</span><span class="p">])</span> <span class="n">t</span> <span class="o">=</span> <span class="n">tables</span><span class="p">[</span><span class="n">gt_nine_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]]</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;index&quot;</span><span class="p">:</span> <span class="s2">&quot;Team&quot;</span><span class="p">})</span> <span class="n">t</span><span class="o">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">index</span> <span class="o">+</span> <span class="mi">1</span> <span class="n">oteams</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="n">t</span><span class="o">.</span><span class="n">Team</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">gt_nine_teams</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">2</span><span class="p">])]</span><span class="o">.</span><span class="n">index</span> <span class="n">t</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="nb">slice</span><span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">oteams</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="n">oteams</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="kc">None</span><span class="p">),</span> <span class="p">:]</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="mf">2011</span><span class="o">-</span><span class="mf">12</span><span class="w"> </span><span class="n">Conference</span><span class="w"> </span><span class="mf">2012</span><span class="o">-</span><span class="mf">03</span><span class="o">-</span><span class="mf">31</span> <span class="err">[&#39;</span><span class="n">York</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Southport</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Luton</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Kidderminster</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Grimsby</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Gateshead</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="kr">For</span><span class="n">est</span><span class="w"> </span><span class="n">Green</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Cambridge</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Braintree</span><span class="w"> </span><span class="kr">To</span><span class="n">wn</span><span class="err">&#39;</span><span class="p">,</span><span class="w"> </span><span class="err">&#39;</span><span class="n">Barrow</span><span class="err">&#39;]</span> </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Team</th> <th>played</th> <th>won</th> <th>drawn</th> <th>lost</th> <th>for</th> <th>against</th> <th>gd</th> <th>points</th> <th>points_deduction</th> </tr> </thead> <tbody> <tr> <th>3</th> <td>Mansfield</td> <td>41</td> <td>20</td> <td>14</td> <td>7</td> <td>74</td> <td>46</td> <td>28</td> <td>74</td> <td>0</td> </tr> <tr> <th>4</th> <td>York</td> <td>39</td> <td>19</td> <td>13</td> <td>7</td> <td>74</td> <td>41</td> <td>33</td> <td>70</td> <td>0</td> </tr> <tr> <th>5</th> <td>Southport</td> <td>41</td> <td>20</td> <td>10</td> <td>11</td> <td>63</td> <td>60</td> <td>3</td> <td>70</td> <td>0</td> </tr> <tr> <th>6</th> <td>Luton</td> <td>39</td> <td>18</td> <td>13</td> <td>8</td> <td>67</td> <td>37</td> <td>30</td> <td>67</td> <td>0</td> </tr> <tr> <th>7</th> <td>Kidderminster</td> <td>41</td> <td>19</td> <td>10</td> <td>12</td> <td>71</td> <td>55</td> <td>16</td> <td>67</td> <td>0</td> </tr> <tr> <th>8</th> <td>Grimsby</td> <td>41</td> <td>18</td> <td>10</td> <td>13</td> <td>75</td> <td>57</td> <td>18</td> <td>64</td> <td>0</td> </tr> <tr> <th>9</th> <td>Gateshead</td> <td>40</td> <td>18</td> <td>10</td> <td>12</td> <td>61</td> <td>56</td> <td>5</td> <td>64</td> <td>0</td> </tr> <tr> <th>10</th> <td>Forest Green</td> <td>41</td> <td>16</td> <td>13</td> <td>12</td> <td>59</td> <td>42</td> <td>17</td> <td>61</td> <td>0</td> </tr> <tr> <th>11</th> <td>Cambridge</td> <td>39</td> <td>15</td> <td>12</td> <td>12</td> <td>47</td> <td>37</td> <td>10</td> <td>57</td> <td>0</td> </tr> <tr> <th>12</th> <td>Braintree Town</td> <td>41</td> <td>16</td> <td>9</td> <td>16</td> <td>70</td> <td>71</td> <td>-1</td> <td>57</td> <td>0</td> </tr> <tr> <th>13</th> <td>Barrow</td> <td>41</td> <td>16</td> <td>7</td> <td>18</td> <td>55</td> <td>67</td> <td>-12</td> <td>55</td> <td>0</td> </tr> <tr> <th>14</th> <td>Ebbsfleet</td> <td>40</td> <td>13</td> <td>10</td> <td>17</td> <td>59</td> <td>69</td> <td>-10</td> <td>49</td> <td>0</td> </tr> </tbody> </table> </div></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Tue, 20 Feb 2018 13:45:00 +0000</pubDate><guid isPermaLink="false">tag:dkane.net,2018-02-20:/2018/acrostic-football-league-tables/</guid><category>Code</category><category>Football</category></item><item><title>Names shared by genders</title><link>https://dkane.net/2018/names-shared-by-genders/</link><description><p>Building the gender classifier I've <a href="/a-name-gender-classifier/">written about here</a> got me interested in ambiguous names - those that are shared by people of both genders.</p> <p>I realised I could use the list of male and female charity trustee names I'd gathered to look into this in a bit more detail. Bearing in mind the limitations and bias of the source dataset, I think it can generate some insight.</p> <p>This blogpost was written as a juypter notebook, so it can be run to recreate the research. <a href="https://github.com/">You can find the notebook code on github</a>.</p> <p>We start by importing the libraries we need - two built-in python libraries (<code>csv</code> and <code>collections</code>), <code>pandas</code> for analysing the data and <code>matplotlib</code> for making charts.</p> <div class="highlight"><pre><span></span><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span> <span class="kn">import</span> <span class="nn">csv</span> <span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span> <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> </code></pre></div> <p>To start, we go through all the names in our source data, and use the names with titles to find first names of males and females. To do this I've just taken the first word from the string, excluding the title, providing it's more than one character. This is fairly crude, and will be wrong in some cases (eg someone who uses two first names). But it should be good enough for our purposes.</p> <div class="highlight"><pre><span></span><code><span class="n">names</span> <span class="o">=</span> <span class="p">{</span> <span class="s2">&quot;female&quot;</span><span class="p">:</span> <span class="p">[],</span> <span class="s2">&quot;male&quot;</span><span class="p">:</span> <span class="p">[]</span> <span class="p">}</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&quot;extract_trustee.csv&quot;</span><span class="p">)</span> <span class="k">as</span> <span class="n">a</span><span class="p">:</span> <span class="n">reader</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">reader</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">reader</span><span class="p">:</span> <span class="c1"># ignore rows that aren&#39;t two records long</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">row</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">2</span><span class="p">:</span> <span class="k">continue</span> <span class="n">name</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&quot; &quot;</span><span class="p">)</span> <span class="c1"># split the name by spaces</span> <span class="c1"># if there&#39;s only one field then ignore the name</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="mi">1</span><span class="p">:</span> <span class="k">continue</span> <span class="c1"># if the second name string (usually surname) is less that 2 characters ignore the row</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">&lt;=</span> <span class="mi">2</span><span class="p">:</span> <span class="k">continue</span> <span class="c1"># if there are non-alpha characters (numbers, symbols) then ignore the row</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">name</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">isalpha</span><span class="p">():</span> <span class="k">continue</span> <span class="n">name</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">name</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">title</span><span class="p">()</span> <span class="c1"># get the first name (assuming name[0] is a title)</span> <span class="c1"># if the first word is one of these titles it&#39;s a female name</span> <span class="k">if</span> <span class="n">name</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;miss&quot;</span><span class="p">,</span> <span class="s2">&quot;mrs&quot;</span><span class="p">,</span> <span class="s2">&quot;ms&quot;</span><span class="p">,</span> <span class="s2">&quot;dame&quot;</span><span class="p">]:</span> <span class="n">names</span><span class="p">[</span><span class="s2">&quot;female&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">name</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="c1"># if the first word is one of these titles it&#39;s a male name</span> <span class="k">elif</span> <span class="n">name</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;mr&quot;</span><span class="p">,</span> <span class="s2">&quot;sir&quot;</span><span class="p">]:</span> <span class="n">names</span><span class="p">[</span><span class="s2">&quot;male&quot;</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">name</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> </code></pre></div> <p>Lets check the female and male names to check we're on the right track.</p> <div class="highlight"><pre><span></span><code><span class="nb">print</span><span class="p">(</span><span class="n">names</span><span class="p">[</span><span class="s2">&quot;female&quot;</span><span class="p">][</span><span class="mi">0</span><span class="p">:</span><span class="mi">10</span><span class="p">])</span> <span class="nb">print</span><span class="p">(</span><span class="n">names</span><span class="p">[</span><span class="s2">&quot;male&quot;</span><span class="p">][</span><span class="mi">0</span><span class="p">:</span><span class="mi">10</span><span class="p">])</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="k">[&#39;Felicity&#39;, &#39;Tessa&#39;, &#39;Elizabeth&#39;, &#39;Julie&#39;, &#39;Rosemary&#39;, &#39;Catherine&#39;, &#39;Eileen&#39;, &#39;Christa&#39;, &#39;Roberta&#39;, &#39;Beverley&#39;]</span> <span class="k">[&#39;Oliver&#39;, &#39;Kenneth&#39;, &#39;Neil&#39;, &#39;Keith&#39;, &#39;John&#39;, &#39;Herschel&#39;, &#39;Alex&#39;, &#39;David&#39;, &#39;Christopher&#39;, &#39;Daniel&#39;]</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="n">female</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">names</span><span class="p">[</span><span class="s2">&quot;female&quot;</span><span class="p">])</span> <span class="n">male</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">names</span><span class="p">[</span><span class="s2">&quot;male&quot;</span><span class="p">])</span> <span class="n">female_names</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">female</span><span class="p">)</span> <span class="n">male_names</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">male</span><span class="p">)</span> <span class="n">all_names</span> <span class="o">=</span> <span class="n">female_names</span> <span class="o">+</span> <span class="n">male_names</span> <span class="n">female_people</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">female</span><span class="o">.</span><span class="n">elements</span><span class="p">()))</span> <span class="n">male_people</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">male</span><span class="o">.</span><span class="n">elements</span><span class="p">()))</span> <span class="n">all_people</span> <span class="o">=</span> <span class="n">female_people</span> <span class="o">+</span> <span class="n">male_people</span> <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Female: </span><span class="si">{:,.0f}</span><span class="s2"> (from </span><span class="si">{:,.0f}</span><span class="s2"> people)&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span> <span class="n">female_names</span><span class="p">,</span> <span class="n">female_people</span><span class="p">,</span> <span class="p">))</span> <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Male: </span><span class="si">{:,.0f}</span><span class="s2"> (from </span><span class="si">{:,.0f}</span><span class="s2"> people)&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span> <span class="n">male_names</span><span class="p">,</span> <span class="n">male_people</span><span class="p">,</span> <span class="p">))</span> <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;</span><span class="si">{:,.1f}</span><span class="s2"> </span><span class="si">% o</span><span class="s2">f names are female, </span><span class="si">{:,.1f}</span><span class="s2"> of people in sample&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span> <span class="p">(</span><span class="n">female_names</span> <span class="o">/</span> <span class="n">all_names</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span><span class="p">,</span> <span class="p">(</span><span class="n">female_people</span> <span class="o">/</span> <span class="n">all_people</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span> <span class="p">))</span> <span class="n">weighting</span> <span class="o">=</span> <span class="p">((</span><span class="n">male_people</span> <span class="o">/</span> <span class="mf">0.489</span><span class="p">)</span> <span class="o">-</span> <span class="n">male_people</span><span class="p">)</span> <span class="o">/</span> <span class="n">female_people</span> <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Use a weighting of </span><span class="si">{:,.3f}</span><span class="s2"> to bring female population to 51.1</span><span class="si">% o</span><span class="s2">f sample&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span> <span class="n">weighting</span> <span class="p">))</span> <span class="c1"># 100 = X + Y</span> <span class="c1"># X = male</span> <span class="c1"># Y = female</span> <span class="c1"># Z = female weighted</span> <span class="c1"># X / (X + Y) = 0.520</span> <span class="c1"># X / (X + Z) = 0.489</span> <span class="c1"># X = 0.489 * (X + Z)</span> <span class="c1"># X / 0.489 = X + Z</span> <span class="c1"># Z = (X / 0.489) - X</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="n">Female</span><span class="o">:</span><span class="w"> </span><span class="mi">15</span><span class="o">,</span><span class="mi">367</span><span class="w"> </span><span class="o">(</span><span class="n">from</span><span class="w"> </span><span class="mi">335</span><span class="o">,</span><span class="mi">237</span><span class="w"> </span><span class="n">people</span><span class="o">)</span> <span class="n">Male</span><span class="o">:</span><span class="w"> </span><span class="mi">16</span><span class="o">,</span><span class="mi">085</span><span class="w"> </span><span class="o">(</span><span class="n">from</span><span class="w"> </span><span class="mi">363</span><span class="o">,</span><span class="mi">296</span><span class="w"> </span><span class="n">people</span><span class="o">)</span> <span class="mf">48.9</span><span class="w"> </span><span class="o">%</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">names</span><span class="w"> </span><span class="n">are</span><span class="w"> </span><span class="n">female</span><span class="o">,</span><span class="w"> </span><span class="mf">48.0</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">people</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">sample</span> <span class="n">Use</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="n">weighting</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="mf">1.132</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">bring</span><span class="w"> </span><span class="n">female</span><span class="w"> </span><span class="n">population</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="mf">51.1</span><span class="o">%</span><span class="w"> </span><span class="n">of</span><span class="w"> </span><span class="n">sample</span> </code></pre></div> <p>This gives us 15,367 female first names (from 335,000 people) and 16,085 male ones (from 363,296 people). The only issue is that the ratio is slightly off - 48.0% of our sample are female, compared to <a href="https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland">51.1% of the 18+ population in England and Wales</a>. So we can use a weighting of 1.132 to correct this when we use the counts.</p> <p>Next we work out the count of how many times each name appears for men and women. To do this the data is put into panda dataframes.</p> <div class="highlight"><pre><span></span><code><span class="n">female_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="o">.</span><span class="n">from_dict</span><span class="p">(</span><span class="n">female</span><span class="p">,</span> <span class="n">orient</span><span class="o">=</span><span class="s1">&#39;index&#39;</span><span class="p">)</span> <span class="n">female_df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;female&quot;</span><span class="p">]</span> <span class="n">male_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="o">.</span><span class="n">from_dict</span><span class="p">(</span><span class="n">male</span><span class="p">,</span> <span class="n">orient</span><span class="o">=</span><span class="s1">&#39;index&#39;</span><span class="p">)</span> <span class="n">male_df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;male&quot;</span><span class="p">]</span> </code></pre></div> <p>We then apply our weighting to the female figures to adjust for the lower number in our sample.</p> <div class="highlight"><pre><span></span><code><span class="n">female_df</span> <span class="o">=</span> <span class="p">(</span><span class="n">female_df</span> <span class="o">*</span> <span class="n">weighting</span><span class="p">)</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> </code></pre></div> <p>These dataframes are merged together, joining on the name, to give us a list of first names found in the dataset alongside the number of females and males found with that name.</p> <div class="highlight"><pre><span></span><code><span class="n">both</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">female_df</span><span class="p">,</span> <span class="n">male_df</span><span class="p">],</span> <span class="n">join</span><span class="o">=</span><span class="s1">&#39;outer&#39;</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="n">both</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;total&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">both</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;female&quot;</span><span class="p">]</span> <span class="o">+</span> <span class="n">both</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;male&quot;</span><span class="p">]</span> <span class="n">both</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;female_pc&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">both</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;female&quot;</span><span class="p">]</span> <span class="o">/</span> <span class="n">both</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;total&quot;</span><span class="p">]</span> <span class="n">both</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;male_pc&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">both</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;male&quot;</span><span class="p">]</span> <span class="o">/</span> <span class="n">both</span><span class="o">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s2">&quot;total&quot;</span><span class="p">]</span> <span class="n">both</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s2">&quot;total&quot;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> </code></pre></div> <p>We can see a list of the top 10 most common first names in the dataset (only two female names make the list - even after applying our weighting).</p> <div class="highlight"><pre><span></span><code><span class="n">both</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">10</span><span class="p">]</span> </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>female</th> <th>male</th> <th>total</th> <th>female_pc</th> <th>male_pc</th> </tr> </thead> <tbody> <tr> <th>David</th> <td>8.0</td> <td>21480.0</td> <td>21488.0</td> <td>0.000372</td> <td>0.999628</td> </tr> <tr> <th>John</th> <td>9.0</td> <td>21155.0</td> <td>21164.0</td> <td>0.000425</td> <td>0.999575</td> </tr> <tr> <th>Peter</th> <td>6.0</td> <td>12421.0</td> <td>12427.0</td> <td>0.000483</td> <td>0.999517</td> </tr> <tr> <th>Michael</th> <td>3.0</td> <td>12021.0</td> <td>12024.0</td> <td>0.000250</td> <td>0.999750</td> </tr> <tr> <th>Richard</th> <td>1.0</td> <td>9702.0</td> <td>9703.0</td> <td>0.000103</td> <td>0.999897</td> </tr> <tr> <th>Susan</th> <td>8872.0</td> <td>3.0</td> <td>8875.0</td> <td>0.999662</td> <td>0.000338</td> </tr> <tr> <th>Andrew</th> <td>1.0</td> <td>8746.0</td> <td>8747.0</td> <td>0.000114</td> <td>0.999886</td> </tr> <tr> <th>Paul</th> <td>3.0</td> <td>8492.0</td> <td>8495.0</td> <td>0.000353</td> <td>0.999647</td> </tr> <tr> <th>Robert</th> <td>1.0</td> <td>7879.0</td> <td>7880.0</td> <td>0.000127</td> <td>0.999873</td> </tr> <tr> <th>Margaret</th> <td>7866.0</td> <td>5.0</td> <td>7871.0</td> <td>0.999365</td> <td>0.000635</td> </tr> </tbody> </table> </div> <p>And a list of some of the least common - these names only appear once.</p> <div class="highlight"><pre><span></span><code><span class="n">both</span><span class="p">[</span><span class="o">-</span><span class="mi">10</span><span class="p">:]</span> </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>female</th> <th>male</th> <th>total</th> <th>female_pc</th> <th>male_pc</th> </tr> </thead> <tbody> <tr> <th>Jibola</th> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> <td>1.0</td> </tr> <tr> <th>Jibi</th> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> <td>1.0</td> </tr> <tr> <th>Jiba</th> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> <td>1.0</td> </tr> <tr> <th>Jiaokun</th> <td>1.0</td> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>Jiann</th> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> <td>1.0</td> </tr> <tr> <th>Jianmin</th> <td>1.0</td> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>Jiale</th> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> <td>1.0</td> </tr> <tr> <th>Jia</th> <td>1.0</td> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>Jhumar</th> <td>1.0</td> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> </tr> <tr> <th>Zyta</th> <td>1.0</td> <td>0.0</td> <td>1.0</td> <td>1.0</td> <td>0.0</td> </tr> </tbody> </table> </div> <h2 id="shared-names">Shared names</h2> <p>We now move on to what we were trying to do - get a list of names that are commonly shared between people of different genders. You can see from the top 10s above that it's not a perfect dataset - some names that you might assume are unambiguously male or female have some counterparts - there are 8 female Davids, and 3 male Susans.</p> <p>This could be a mistake in the way the algorithm was applied, a typo in the data, or people who have an unusual name. But to look only at names we would expect to be more common, I've filtered to only show only names where more than 30 instances of both male and female people have the name. This threshold should also ensure we have a decent sample of people for each name.</p> <div class="highlight"><pre><span></span><code><span class="n">shared</span> <span class="o">=</span> <span class="n">both</span><span class="p">[(</span><span class="n">both</span><span class="p">[</span><span class="s2">&quot;male&quot;</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">30</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">both</span><span class="p">[</span><span class="s2">&quot;female&quot;</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">30</span><span class="p">)]</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s2">&quot;female_pc&quot;</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">shared</span><span class="p">))</span> <span class="n">shared</span> </code></pre></div> <div class="highlight"><pre><span></span><code><span class="mf">23</span> </code></pre></div> <div> <style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>female</th> <th>male</th> <th>total</th> <th>female_pc</th> <th>male_pc</th> </tr> </thead> <tbody> <tr> <th>Jean</th> <td>3787.0</td> <td>50.0</td> <td>3837.0</td> <td>0.986969</td> <td>0.013031</td> </tr> <tr> <th>Pat</th> <td>1154.0</td> <td>67.0</td> <td>1221.0</td> <td>0.945127</td> <td>0.054873</td> </tr> <tr> <th>Kerry</th> <td>530.0</td> <td>45.0</td> <td>575.0</td> <td>0.921739</td> <td>0.078261</td> </tr> <tr> <th>Lyn</th> <td>337.0</td> <td>36.0</td> <td>373.0</td> <td>0.903485</td> <td>0.096515</td> </tr> <tr> <th>Kim</th> <td>711.0</td> <td>80.0</td> <td>791.0</td> <td>0.898862</td> <td>0.101138</td> </tr> <tr> <th>Jan</th> <td>802.0</td> <td>108.0</td> <td>910.0</td> <td>0.881319</td> <td>0.118681</td> </tr> <tr> <th>Lindsay</th> <td>347.0</td> <td>62.0</td> <td>409.0</td> <td>0.848411</td> <td>0.151589</td> </tr> <tr> <th>Sandy</th> <td>126.0</td> <td>42.0</td> <td>168.0</td> <td>0.750000</td> <td>0.250000</td> </tr> <tr> <th>Mel</th> <td>62.0</td> <td>54.0</td> <td>116.0</td> <td>0.534483</td> <td>0.465517</td> </tr> <tr> <th>Vivian</th> <td>87.0</td> <td>79.0</td> <td>166.0</td> <td>0.524096</td> <td>0.475904</td> </tr> <tr> <th>Leigh</th> <td>88.0</td> <td>92.0</td> <td>180.0</td> <td>0.488889</td> <td>0.511111</td> </tr> <tr> <th>Jose</th> <td>43.0</td> <td>60.0</td> <td>103.0</td> <td>0.417476</td> <td>0.582524</td> </tr> <tr> <th>Sam</th> <td>215.0</td> <td>340.0</td> <td>555.0</td> <td>0.387387</td> <td>0.612613</td> </tr> <tr> <th>Laurie</th> <td>32.0</td> <td>71.0</td> <td>103.0</td> <td>0.310680</td> <td>0.689320</td> </tr> <tr> <th>Alex</th> <td>177.0</td> <td>506.0</td> <td>683.0</td> <td>0.259151</td> <td>0.740849</td> </tr> <tr> <th>Chris</th> <td>419.0</td> <td>1856.0</td> <td>2275.0</td> <td>0.184176</td> <td>0.815824</td> </tr> <tr> <th>Ali</th> <td>52.0</td> <td>268.0</td> <td>320.0</td> <td>0.162500</td> <td>0.837500</td> </tr> <tr> <th>Lee</th> <td>92.0</td> <td>524.0</td> <td>616.0</td> <td>0.149351</td> <td>0.850649</td> </tr> <tr> <th>Ashley</th> <td>43.0</td> <td>257.0</td> <td>300.0</td> <td>0.143333</td> <td>0.856667</td> </tr> <tr> <th>Leslie</th> <td>78.0</td> <td>729.0</td> <td>807.0</td> <td>0.096654</td> <td>0.903346</td> </tr> <tr> <th>Francis</th> <td>51.0</td> <td>533.0</td> <td>584.0</td> <td>0.087329</td> <td>0.912671</td> </tr> <tr> <th>Terry</th> <td>54.0</td> <td>973.0</td> <td>1027.0</td> <td>0.052580</td> <td>0.947420</td> </tr> <tr> <th>Robin</th> <td>36.0</td> <td>1438.0</td> <td>1474.0</td> <td>0.024423</td> <td>0.975577</td> </tr> </tbody> </table> </div> <p>This identifies 23 names meeting our criteria. They range from Jean (97% female), to Robin (98% male). But the most interesting names come in the middle. The closest to a 50-50 split are Leigh (49% female), Vivian (52% female) and Mel (53% female). </p> <div class="highlight"><pre><span></span><code><span class="n">shared</span><span class="p">[[</span><span class="s2">&quot;female_pc&quot;</span><span class="p">,</span> <span class="s2">&quot;male_pc&quot;</span><span class="p">]]</span><span class="o">.</span><span class="n">multiply</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span><span class="o">.</span><span class="n">iloc</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s2">&quot;barh&quot;</span><span class="p">,</span> <span class="n">stacked</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span> </code></pre></div> <div class="highlight"><pre><span></span><code>&lt;matplotlib.axes._subplots.AxesSubplot at 0x23fe0924e80&gt; </code></pre></div> <p><img alt="png" src="/images/output_20_1.png"></p> <p>Charting the data gives us a few clusters:</p> <ul> <li>Names that are mostly female, but with some male use: <strong>Jean, Pat, Kerry, Lyn, Kim, Jan, Lindsay</strong></li> <li>Names that are pretty close to 50-50: <strong>Mel, Vivian, Leigh, Jose, Sam</strong></li> <li>Names that are mostly male, but with some female use: <strong>Alex, Chris, Ali, Lee, Ashley, Leslie, Terry, Robin</strong></li> </ul></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Sat, 10 Feb 2018 21:32:35 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2018-02-10:/2018/names-shared-by-genders/</guid><category>Blog</category></item><item><title>A name gender classifier</title><link>https://dkane.net/2018/a-name-gender-classifier/</link><description><p>Something I've needed to do a couple of times is take a long list of names and classify them into male and female. For example, I've looked at lists of people who attended events to see whether they were reaching more men or women - this then helps target future events.</p> <p><em>Note: there are lots of reasons why this is a terrible idea. For a start, gender is more complicated than male/female, and there are lots of reasons why a name is a bad guide to someone's gender (<a href="http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/">and other things programmers wrongly believe about names</a>). But I've justified it here on the basis that I'm not interested in getting 100% of the genders exactly right, but instead to get a general idea of a split in a population.</em></p> <p>This is a good place to use a naïve Bayesian classifier. The best example of this technique is a spam classifier - this will work out whether an email is spam, based on comparing it to a model generated from a training data set. The "naïve" part comes from the fact that it's simply based on words that appear in the training data set, without considering the meaning of or connections between those words.</p> <p>To do this I'm going to use python (3.6 to be exact) and the <a href="http://scikit-learn.org/stable/">scikit</a> library which has a naïve Bayesian classifier built in. It's based on the tutorial found here.</p> <h2 id="step-1-get-some-training-data">Step 1 - get some training data</h2> <p>The training data needs to be a list of names with their genders. I've found this a tricky thing to come by, but luckily my work with the <a href="http://data.charitycommission.gov.uk/">Charity Commission Register of Charities</a> throws up a good source.</p> <p>The large data extract they publish, under an open license, contains a list of trustees for each charity. It's just a list of names - they hold information like address, date of birth, etc but don't publish it for data protection reasons - but it's possible to work out the gender of a proportion of the names.</p> <p>Out of the X,000 names, X,000 (X%) include a title (Mrs, Miss, Ms, Mr). Using those four titles which are gender specific and cover X% of titles used. I wrote a python script to parse the name CSV file, look for names that start with Mr, Mrs, etc and put them into a male list or female list. This produces lists of X male names and X female names.</p> <p>A couple of notes here. Like any training data set, this is not unbiased. Charity trustees represent a particular slice of the population - people from all walks of life are trustees, but the average trustee is older, whiter and richer than the average person. So the names we're using are going to reflect those demographics.</p> <p>Note too that I'm not doing anything to separate out first names and last names - I'm just using the whole string. This means that last names will be included in the model - with the hope that they'll be equally distributed and so won't make a difference to the final total. But this isn't a guarantee.</p> <h2 id="step-2-training-the-model">Step 2: training the model</h2> <h2 id="step-3-testing-the-model">Step 3: testing the model</h2> <h2 id="step-4-using-the-model">Step 4: using the model</h2></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Sat, 10 Feb 2018 21:19:35 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2018-02-10:/2018/a-name-gender-classifier/</guid><category>Blog</category></item><item><title>Twitter bot – random charity</title><link>https://dkane.net/2016/twitter-bot-random-charity/</link><description><p>I’ve been playing about with twitter bots, and following <a href="https://videlais.com/2015/03/02/how-to-create-a-basic-twitterbot-in-python/">some instructions on using the twitter API in python</a>, I’ve created a bot that tweets a link to a random charity every half an hour.</p> <p>The <a href="https://github.com/drkane/random-charity-bot">code and more details on the bot can be found on github</a>. From the readme:</p> <blockquote> <p>Twitter bot that tweets the name and website of a random charity. The data is based on the register held by the Charity Commission for England and Wales, <a href="http://data.charitycommission.gov.uk/">released as open data</a>.</p> <p>The bot randomly selects an active charity from the register, and tweets it. If the website of the charity is known it tweets that too, otherwise it gives a link to the official register entry for that charity.</p> <p>This bot isn’t affiliated with the Charity Commission, and gives no endorsement of the charities tweeted.</p> </blockquote> <p>Credit due to the following places:</p> <ul> <li>Fetch charity data script adapted from <a href="https://github.com/OpenDataServices/grantnav/blob/master/dataload/fetch_charity_data.py">https://github.com/OpenDataServices/grantnav/blob/master/dataload/fetch_charity_data.py</a></li> <li>Twitter API use from <a href="https://videlais.com/2015/03/02/how-to-create-a-basic-twitterbot-in-python/">https://videlais.com/2015/03/02/how-to-create-a-basic-twitterbot-in-python/</a></li> <li>Charity data from <a href="http://data.charitycommission.gov.uk/">http://data.charitycommission.gov.uk/</a> – used under Open Government Licence v3.0</li> </ul></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Sun, 16 Oct 2016 10:20:00 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2016-10-16:/2016/twitter-bot-random-charity/</guid><category>Blog</category></item><item><title>Development version of WordPress on Windows 10</title><link>https://dkane.net/2016/development-version-of-wordpress-on-windows-10/</link><description><p><strong>The problem</strong>: I need a local development version of WordPress to help test problems with the live site and also test updates. Previously I’ve used WAMP but that tends to get a bit messy, particularly on Windows 10.</p> <p><strong>The solution</strong>: use docker to create a container with all the needed files that can be run when necessary. The steps described here create a copy of WordPress which allows you to access the WordPress installation files and database directly, allowing you to create an exact copy of the live site.</p> <h2 id="step-1-install-docker">Step 1 – install Docker</h2> <p>Download and install the <a href="https://www.docker.com/products/docker-toolbox">Docker Toolbox</a>. We’ll be mostly using <code>docker-compose</code>.</p> <h2 id="step-2-configure-the-container">Step 2 – configure the container</h2> <p>This step was adapted from the <a href="https://docs.docker.com/compose/wordpress/">instructions shown in the Docker WordPress docs</a>. The main adaptations are to allow for access to the filesystem and direct database access.</p> <p>Navigate to an empty directory and create a file called <code>docker-compose.yaml</code>. Open and edit the file to include the following text:</p> <div class="highlight"><pre><span></span><code><span class="nt">version</span><span class="p">:</span><span class="w"> </span><span class="s">&#39;2&#39;</span> <span class="nt">services</span><span class="p">:</span> <span class="w"> </span><span class="nt">db</span><span class="p">:</span> <span class="w"> </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">mysql:5.7</span> <span class="w"> </span><span class="nt">volumes</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="s">&quot;./.data/db:/var/lib/mysql&quot;</span> <span class="w"> </span><span class="nt">restart</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">always</span> <span class="w"> </span><span class="nt">ports</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="s">&quot;3306:3306&quot;</span> <span class="w"> </span><span class="nt">environment</span><span class="p">:</span> <span class="w"> </span><span class="nt">MYSQL_ROOT_PASSWORD</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">wordpress</span> <span class="w"> </span><span class="nt">MYSQL_DATABASE</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">wordpress</span> <span class="w"> </span><span class="nt">MYSQL_USER</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">wordpress</span> <span class="w"> </span><span class="nt">MYSQL_PASSWORD</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">wordpress</span> <span class="w"> </span><span class="nt">wordpress</span><span class="p">:</span> <span class="w"> </span><span class="nt">volumes</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="s">&quot;./html:/var/www/html&quot;</span> <span class="w"> </span><span class="nt">depends_on</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">db</span> <span class="w"> </span><span class="nt">image</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">wordpress:latest</span> <span class="w"> </span><span class="nt">links</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">db</span> <span class="w"> </span><span class="nt">ports</span><span class="p">:</span> <span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="s">&quot;8000:80&quot;</span> <span class="w"> </span><span class="nt">restart</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">always</span> <span class="w"> </span><span class="nt">environment</span><span class="p">:</span> <span class="w"> </span><span class="nt">WORDPRESS_DB_HOST</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">db:3306</span> <span class="w"> </span><span class="nt">WORDPRESS_DB_PASSWORD</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">wordpress</span> </code></pre></div> <h2 id="step-3-copy-the-live-files">Step 3 – copy the live files</h2> <p>You now need to make a copy of the files from your live site. I did this by downloading them via FTP from the server. Make sure you have a copy of all three main WordPress directorys (<code>wp-admin</code>, <code>wp-content</code>, <code>wp-includes</code>) plus the <code>index.php</code> file and any files starting <code>wp-*.php</code> in the main directory.</p> <p>Copy these files into a directory called <code>html</code> in your project directory.</p> <h2 id="step-4-edit-wp-configphp">Step 4 – edit wp-config.php</h2> <p>Because you’ve directly copied the files from the live site, your configuration settings won’t work. Open <code>wp-config.php</code> from the <code>html</code> directory in your text editor, and change the following configuration values to match the values specified in the docker-compose.yaml file.</p> <div class="highlight"><pre><span></span><code><span class="x">/** The name of the database for WordPress */</span> <span class="x">define(&#39;DB_NAME&#39;, &#39;wordpress&#39;);</span> <span class="x">/** MySQL database username */</span> <span class="x">define(&#39;DB_USER&#39;, &#39;wordpress&#39;);</span> <span class="x">/** MySQL database password */</span> <span class="x">define(&#39;DB_PASSWORD&#39;, &#39;wordpress&#39;);</span> <span class="x">/** MySQL hostname */</span> <span class="x">define(&#39;DB_HOST&#39;, &#39;db:3306&#39;);</span> </code></pre></div> <p>You’ll also need to add or edit variables that give the URL of your site so that it doesn’t redirect to the live one. Add the following lines (removing any existing definitions of <code>WP_SITEURL</code> and <code>WP_HOME</code>):</p> <div class="highlight"><pre><span></span><code><span class="x">define(&#39;WP_HOME&#39;,&#39;http://192.168.99.100:8000/&#39;);</span> <span class="x">define(&#39;WP_SITEURL&#39;,&#39;http://192.168.99.100:8000/&#39;);</span> </code></pre></div> <p>(Note: I think <code>192.168.99.100</code> is the default IP address given by Docker to the container, but yours may be different. The <code>:8000</code> port is configured in the <code>docker-compose.yaml</code> file.)</p> <h2 id="step-5-run-the-container">Step 5 – run the container</h2> <p>To run the container I ran the “Docker Quickstart Terminal” which has been installed to the start menu. This gives access to the Docker virtual machine command line so you can run commands.</p> <p>Navigate to the directory where the <code>docker-compose.yaml</code> you created above is located (using <code>cd xxxx</code>) and then run</p> <div class="highlight"><pre><span></span><code>docker-compose up -d </code></pre></div> <p>This will create the database and WordPress containers and start them. You can now access the installation in three ways:</p> <ul> <li>View the website (and access the admin backend) through your web browser at <a href="http://192.168.99.100:8000/">http://192.168.99.100:8000/</a> (nb don’t do this yet as we haven’t copied the database)</li> <li>Edit theme and plugin files through your file system using the <code>html/wp-content/</code> directory in your project</li> <li>Directly access the database by connecting through host <code>192.168.99.100</code> on port <code>3306</code> using the user <code>wordpress</code> with password <code>wordpress</code></li> </ul> <h2 id="step-6-copy-the-database">Step 6 – copy the database</h2> <p>The next stage is to take a copy of your live MySQL database and apply it to the local version. I did this by:</p> <ul> <li>Connect to the database using <a href="http://www.heidisql.com/">HeidiSQL</a> with the credentials given above</li> <li>Dropping or emptying all the tables created when the container was started</li> <li>Getting a sql dump of the live site as a single file (I used <a href="https://wordpress.org/plugins/wordpress-backup-to-dropbox/">wpb2d</a> for this)</li> <li>Editing the SQL dump file so that the <code>USE database;</code> statement at the top of the file was changed to <code>USE wordpress;</code> – this makes sure the SQL is executed in the right database</li> <li>Loading the SQL dump file in HeidiSQL and then running it.</li> </ul> <p>Depending on the size of your file this might take a little while.</p> <h2 id="step-7-use-the-site">Step 7 – use the site</h2> <p>Now if you go to <a href="http://192.168.99.100:8000/">http://192.168.99.100:8000/</a> you should be able to access your site, and the admin backend should be available at <a href="http://192.168.99.100:8000/wp-admin/">http://192.168.99.100:8000/wp-admin/</a>. You should be able to log in using the same credentials you use on the live site.</p> <h2 id="notes">Notes</h2> <ul> <li>I found that if I made changes to the <code>docker-compose.yaml</code> file when tweaking the setup I would then need to run the <code>docker-compose down</code> command before re-running <code>docker-compose up -d</code>. There may be a better way to do this (<code>docker-compose build</code>?)</li> <li>You may find that some URLs (particularly in pages and posts) are hardcoded to point to the live site – this can mean you switch to the live version without realising it, so be careful! You can use <a href="https://interconnectit.com/products/search-and-replace-for-wordpress-databases/">something like this tool from interconnectit</a> to change these values in the database.</li> <li>Be careful about user data – any personal data held in the live site is now also held in the local site. You may want to replace or anonymise the data in the <code>wp_users</code> table.</li> <li>You can <a href="http://blog.pavelsklenar.com/5-useful-docker-tip-and-tricks-on-windows/">add your docker IP address to your hosts file</a> to make it into a nicer URL. If you do this don’t forget to change the <code>WP_SITEURL</code> and <code>WP_HOME</code> variables to reflect the change.</li> </ul></description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David Kane</dc:creator><pubDate>Mon, 13 Jun 2016 11:19:35 +0100</pubDate><guid isPermaLink="false">tag:dkane.net,2016-06-13:/2016/development-version-of-wordpress-on-windows-10/</guid><category>Code</category></item></channel></rss>