CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta name='description' content='Artificial Informer - Issue One, April 2019'> <meta name='subtitle' content='An electronic journal of machine learning applied to investigative reporting'> <meta property="og:title" content="Artificial Informer - Issue One, April 2019"> <meta property="og:description" content="An electronic journal of machine learning applied to investigative reporting"> <meta property="og:url" content="https://artificialinformer.com"> <meta name="twitter:title" content="Artificial Informer - Issue One, April 2019"> <meta name="twitter:description" content="An electronic journal of machine learning applied to investigative reporting"> <title>Artificial Informer</title> <meta name='keywords' content='artificial intelligence, investigative journalism'> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="twitter:image" content="https://artificialinformer.com/images/ai-logo.png"> <meta name="twitter:image:alt" content="Artificial Informer"> <meta name="twitter:card" content="summary_large_image"> <meta name="twitter:site" content="@bxroberts"> <meta property="og:site_name" content="Artificial Informer"> <meta property="og:image" content="https://artificialinformer.com/images/ai-logo.png" /> <link rel="icon" href="https://artificialinformer.com/theme/images/ai.png"> <meta charset="utf-8" />  <script async src="https://www.googletagmanager.com/gtag/js?id=UA-137364769-1"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-137364769-1'); </script> <link href="https://artificialinformer.com/feeds/artificial-informer-all.atom.xml" type="application/atom+xml" rel="alternate" title="Artificial Informer Full Atom Feed" /> <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.0/build/base-min.css"> <link rel="stylesheet" type="text/css" href="https://artificialinformer.com/theme/css/style.css" /> </head> <body id="index" class="home"> <header id="banner" class="body"> <a href="https://artificialinformer.com/" id="banner-link" style="text-decoration: none"> <pre id="placeholder"> ARTIFICIAL ____ _ __ ______ ____ ____ __ ___ ______ ____ / _/ / | / / / ____// __ \ / __ \ / |/ / / ____/ / __ \ / / / |/ / / /_ / / / / / /_/ / / /|_/ / / __/ / /_/ / _/ / / /| / / __/ / /_/ / / _, _/ / / / / / /___ / _, _/ /___/ /_/ |_/ /_/ \____/ /_/ |_| /_/ /_/ /_____/ /_/ |_| An electronic journal of machine learning applied to investigative reporting </pre> </a> </header> <div id="main"> <section id="content"> <nav id="table-of-contents"> <h2>Issue One: Table of Contents</h2> <article class="introduction"> <p> <em>Arificial Informer</em>. April 2019.<br/> Edited by Lisa van Dam-Bates.<br/> Written by <a target="_blank" href="https://twitter.com/bxroberts">Brandon Roberts</a>.<br/> <a href="https://artificialinformer.com/files/artificial-informer-issue-one.pdf">Print version (PDF) available here.</a> </p> </article> <ol class="entries">       <li class="entry"> <div class="link"> <a href="https://artificialinformer.com/issue-one/an-interview-with-chase-davis-journalism-machine-learning-pioneer.html">An Interview with Chase Davis</a> </div> <div class="desc">A pioneer in applying machine learning techniques to investigative journalism.</div> </li> <li class="entry"> <div class="link"> <a href="https://artificialinformer.com/issue-one/dissecting-a-machine-learning-powered-investigation.html">Dissecting a Machine Learning Powered Investigation</a> </div> <div class="desc"> Uncovering property tax evasion in Austin, Texas, using machine learning and statistical modeling. An investigative recipe. </div> </li> <li class="entry"> <div class="link"> <a href="https://artificialinformer.com/issue-one/rethinking-web-scraping-with-autoscrape.html">Rethinking Web Scraping with AutoScrape</a> </div> <div class="desc"> Web scraping is tedious. In this article, we explore patterns for building scrapers and introduce a new tool, AutoScrape. </div> </li> <li class="entry"> <div class="link"> <a href="https://artificialinformer.com/issue-one/glossary-of-terminology.html">Glossary of Terminology</a> </div> <div class="desc"> Explanations of the various technical terms and concepts found in this issue. </div> </li> </ol> </nav><div class="separator" style="background-image: url('https://artificialinformer.com/theme/images/noise.gif');">   </div> <div id="post-list"> <div> <article class="hentry"> <footer class="article-info"> <span class="issue"> <em>Artificial Informer</em> - Issue One </span> <time class="published" datetime="2019-04-02T12:00:00-07:00"> April 2019 </time> </footer> <header class="article-header"> <h2 class="entry-title"> <a href="https://artificialinformer.com/issue-one/an-interview-with-chase-davis-journalism-machine-learning-pioneer.html" rel="bookmark" title="Permalink to An Interview with Chase Davis: Journalism + Machine Learning Pioneer"> An Interview with Chase Davis: Journalism + Machine Learning Pioneer </a> </h2> </header> <div class="entry-content article interview"> <p><a href="http://www.chasedavis.com/">Chase Davis</a> is a journalist based in his hometown, Minneapolis, where he's the Senior Digital Editor at the <em>Star Tribune</em>. He got his start working at <em>The Houston Chronicle</em>, eventually co-founding a media technology consultancy, Hot Type Consulting, which brought him to help launch <a href="https://www.texastribune.org/"><em>The Texas Tribune</em></a>. He's probably best known for leading the Interactive Desk at <em>The New York Times</em> and for being a major presence at NICAR conferences. He's spoken on topics including advanced computational techniques, like machine learning (ML), and <a href="https://www.ire.org/events-and-training/event/3433/4157/">innovation at small, local journalism organizations</a>.</p> <p>We talked to Chase for this first issue not only because he was way ahead of the curve on using ML in the newsroom but also because he has a wide range of practical experience in both the trenches of journalism and in the tech industry. Additionally, he was at <em>The New York Times</em> at a pivotal moment during its <a href="https://web.archive.org/web/20190130040741/https://www.buzzfeednews.com/article/mylestanzer/exclusive-times-internal-report-painted-dire-digital-picture">storied and successful transition to digital</a>. We reached him via email for this interview. Our questions are in bold.</p> <p><strong>Just to get started, can you give us some background on how you got into journalism and where your interest in it comes from?</strong></p> <p>I followed a pretty typical path into journalism: I was the editor of my high school paper, went to journalism school at Mizzou, worked at student papers and did a bunch of internships, and kind of moved on from there. As for where the interest came from, I think I just liked the idea of having a career where constant learning was part of the job description.</p> <p><strong>As far as I understand, you don't have a formal computer science degree/education. How did you pick up your data and programming skills?</strong></p> <p>Not only do I not have a formal computer science degree — I basically failed the only computer science classes I ever took. I started learning to code for fun in middle school, and I kept it up by pursuing personal projects. In college I figured out that using data, basic programming, web scraping, etc., could help me get scoops as a reporter. And then, almost by accident, I discovered IRE and NICAR, which gave me an outlet for refining those skills and applying them more broadly to journalism.</p> <p><strong>Your ML presentation at NICAR 2013 with Jeff Larson was really far ahead of its time. Can you explain how you got into ML and applying it to journalism?</strong></p> <p>I started getting interested in machine learning pretty soon after I graduated college in 2006. It just seemed like an interesting new frontier with big potential in the world of data journalism. So I started studying it in my spare time, checking out books from libraries, trying to make sense of them, and making little prototypes. At some point I ended up on a couple working groups, notably one put together by Brant Houston at the University of Illinois, between the journalism school and the National Center for Supercomputing Applications there. Those really convinced me there was some potential there.</p> <p>For a while, machine learning in journalism felt like a solution in search of a problem. But there have been a few problems in recent years where knowing some machine learning has really been helpful. Having some background in the subject has made it easier to step in, see some of those opportunities, and put together solutions that make sense in the context of news.</p> <p><strong>A lot of focus is on where ML should be used in journalism. Are there any areas in journalism that you feel ML <em>shouldn't</em> be used?</strong></p> <p>Most of them. At this point I think the problem space for supervised learning, in particular, is relatively small and well defined: namely a handful of data cleaning, parsing and classification problems that defy rules-based approaches. Unsupervised learning is more of a question mark. But for most problems in journalism at this point, machine learning is both overkill and even a little dangerous — especially if you don't understand the techniques well enough to adjust for some of their weaknesses.</p> <p><strong>Are there any media organizations that you think are doing the best in using these new technologies?</strong></p> <p>Beyond the WaPo, NYT, <em>ProPublica</em>, etc., I think the Data Desk at the <em>LA Times</em> has done some incredible stuff.</p> <p><strong>Do you see understanding statistics and computer science as becoming a requirement for future journalists, if it's not already?</strong></p> <p>We're getting to the point where a basic understanding of things like probability and technology should be a requirement for any professional human, journalist or not. But do I think every journalist needs to be able to code? No. Should they all be able to run a regression analysis? No.</p> <p>That said, I do think we need to stop with the willful ignorance: "I'm a journalist, I'm bad at math!" or "I don't do computers!" Not only are assertions like that self-defeating, they sound increasingly absurd as technology's role in society continues to grow.</p> <p>Plus, our job is basically to learn things for a living. We don't just get to throw up our hands when the learning gets hard.</p> <p><strong>It's not ML, but I like to ask people who know about advanced computational techniques about this since journalism is still dominated by it: what do you think about the prevalence of Excel in the newsroom?</strong></p> <p>Excel has been the gateway drug for literally thousands of journalists to begin exploring data analysis in the newsroom. Without the right people learning Excel at the right times — and organizations like IRE/NICAR being there to teach them — I'm almost horrified to think of all incredible stories that never would have been told.</p> <p><strong>Any advice for people looking to get into journalism in 2019 who can't go to college?</strong></p> <p>Do good work and don't be a jerk.</p> <p>Seriously. In journalism more than most professions, it's all about the work. If you have a demonstrated history of doing good stuff, you're a humble, curious and decent human being, and you plug into the communities of people who do this kind of thing on a daily basis, there's a place for you in this industry.</p> </div> </article> </div> <div class="separator" style="background-image: url('https://artificialinformer.com/theme/images/noise.gif');">   </div> <div> <article class="hentry"> <footer class="article-info"> <span class="issue"> <em>Artificial Informer</em> - Issue One </span> <time class="published" datetime="2019-04-02T11:00:00-07:00"> April 2019 </time> </footer> <header class="article-header"> <h2 class="entry-title"> <a href="https://artificialinformer.com/issue-one/dissecting-a-machine-learning-powered-investigation.html" rel="bookmark" title="Permalink to Dissecting a Machine Learning Powered Investigation"> Dissecting a Machine Learning Powered Investigation </a> </h2> </header> <div class="subtitle"> Uncovering local property tax evasion using machine learning and statistical modeling. An investigative recipe. </div> <div class="author-container"> <address class="vcard author"> By Brandon Roberts </address> </div> <div class="entry-content article article"> <p>If there's one universal investigative template I've come across in my journalism career, it's this: take a list of names or organizations, find those names in another <em>dataset</em><a href="#glossary-dataset" class="gentry">[1]</a> and identify the most suspicious ones for further investigation. I call this a "data in, data out" investigation. While I've personally only applied this pattern to property tax related stories, it could also be extended to areas like campaign finance and background research.</p> <p>Before I get into the technical dirt, I want to talk about why data in, data out investigations are ideal. Going into a dataset essentially blind and letting the data provide leads helps keep things less biased. Smaller newsrooms can benefit, too, because they don't need whistleblowers telling them exactly who or what to look for. To me, the best part is these kinds of investigations lend themselves well to advanced computer assisted methods—eliminating a lot of the tedium that goes into investigative work.</p> <p>I'm going to dissect a data in, data out investigation and the <em>machine learning (ML)</em><a href="#glossary-ml" class="gentry">[10]</a> tools that went into it.</p> <h3><a href="#investigation-overview">The Investigation: An Overview</a></h3> <p>The City of Austin, like many cities across the US, requires people who list their homes on sites like HomeAway and Airbnb to register and pay taxes on each stay, just like hotels do. These types of rentals are called "short-term rentals" (STRs). Properties used as STRs generally aren't eligible for most tax exemptions. In Texas, home owners can apply for a so-called "homestead exemption". These add up to significant tax savings. Because of this, there are lots of rules around them. For example, only one property per person or couple can be exempt. It needs to be their primary residence and it can't be used commercially.</p> <p>We wanted to see whether the appropriate rental taxes were being paid and if the appraisal district was letting STR operators apply for and receive tax exemptions—a clear violation of the law.</p> <p>A list of STRs registered with the city was the primary dataset. Secondarily, we got datasets of appraisal district properties and electric utility subscribers. We took the STR list and used a ML algorithm to look for matches in the other two datasets. This identified STR operators (individuals or couples) who owned multiple properties and had applied for exemptions.</p> <p>Next, we checked each joint STR/property record for exemptions and ranked them by suspiciousness. This ranking was accomplished using ML. From there, we filed a series of public information requests to verify our findings.</p> <p>Joining two datasets based on one of their columns is a basic strategy common to many investigations. Relational databases were created to do exactly this. But in our case, we had low quality data and a lot of it: roughly a million records, combined. Names were spelled inconsistently and addresses were often abbreviated or were missing apartment numbers. Standard tools didn't cut it, so we turned to ML.</p> <h3><a href="#linking-records">Linking Records with Machine Learning</a></h3> <p>The foundation of my approach to this investigation is a ML algorithm called Locality Sensitive Hashing (LSH). This is part of the <em>unsupervised</em><a href="#glossary-unsupervised-ml" class="gentry">[19]</a> machine learning family of algorithms. Unsupervised learning gets its name because it learns directly from the data itself. Unlike other forms of ML, unsupervised methods don't require people to tediously sort through data and tag individual records. These algorithms are inexpensive to use, can work in a variety of situations and lend themselves to straightforward, practical applications. Unsupervised learning excels in problems like search, uncovering structure, similarity, grouping and segmentation. It's my opinion that unsupervised machine learning holds the most promise for the future of local and nonprofit investigative journalism.</p> <p>LSH is an algorithmic method for taking large amounts of data and efficiently grouping together records which are similar to each other. It can handle inexact or "fuzzy" matches, greatly reducing the amount of <em>preprocessing</em><a href="#glossary-preprocessing" class="gentry">[8]</a> needed.</p> <p class="center caption"></p> <p><img alt="LSH: A Visual Description" src="/images/lsh_visual_description.gif" title="LSH: A Visual Description" class="img_anim"></p> <p class="center caption">An animation of the process of grouping three short random words into groups using LSH.</p> <p>In short, LSH operates by taking data points and assigning them to "buckets" or groups. When using LSH to link records from two datasets, we take the larger dataset, build an LSH <em>index</em> and then take the smaller dataset and <em>query</em> that index. The result of a query is a list of records which should be similar to the record we're querying.</p> <p>I'm going to give a technical overview of the specific LSH algorithm I used. Feel free to skip this part if you don't care.</p> <div class="highlight"><pre><span></span>--------------- START TECHNICAL DISCUSSION --------------- </pre></div> <p>In order to understand LSH, we first need to discuss the k-Nearest Neighbors (k-NN) algorithm. What k-NN does is take one set of records and allow you to find some number of most similar records. That number is the <em>k</em> in k-NN. So a k-NN algorithm that only finds the single most similar record is known as k-NN with <code>k=1</code> or (sometimes) <em>1-NN</em>. In its most basic form, this algorithm checks every point against each other and gets a score for how similar they are. We then rank these and take the k-number of closest matches. While this algorithm has its place, it is problematic in our investigation for several reasons:</p> <ol> <li>I had a dataset of 20,000 properties and another dataset of 800,000 utilities records. Doing the math, 20,000 records times 800,000, tells us 16,000,000,000 checks need to be performed to find groups using plain old k-NN.</li> <li>Even if doing that number of checks was feasible, I'd need to come up with a <em>distance</em><a href="#glossary-distance" class="gentry">[2]</a> threshold—a number that establishes how similar two records need to be in order to be considered the same.</li> </ol> <p>Both of these problems are annoying to solve and can be avoided by using LSH, which is a form of approximate nearest neighbors.</p> <p>LSH goes about finding similar records using what I mentally categorize as a mathematical cheat code: if two records, A and B, are similar to a third record, then the two are likely similar to each other. It's a massive simplification of how LSH works, but it serves as a decent mental model. This gets around having to check every record against every record in order to find groups. We only need to check all records against that third thing.</p> <p>In reality, LSH works by <em>projecting</em><a href="#glossary-projection" class="gentry">[9]</a> our records onto a randomized <em>hyperplane</em> (basically a random <em>matrix</em><a href="#glossary-matrix" class="gentry">[11]</a> of data), doing a little rounding, and assigning them to a bucket. The smaller the hyperplane is, the higher chance that records will end up in the same bucket, despite dissimilarity.</p> <p>Generally, this is a lot quicker than k-NN because, for each record, we only need to do a multiplication with the hyperplane. Mathematically speaking, LSH scales linearly with the number of records, as opposed to k-NN which scales exponentially. In my example above of 20,000 and 800,000 records, I only need to do 820,000 checks (800,000 plus 20,000 checks against our hyperplane). Obviously, this is far less than the 16,000,000,000 checks with plain k-NN.</p> <div class="highlight"><pre><span></span>--------------- END TECHNICAL DISCUSSION --------------- </pre></div> <p>Grouping records in LSH is done probabilistically, meaning you probably won't find every match and some non-matches might also fall into your results. Fortunately, there are ways to tune LSH in order to avoid this. LSH has two main configuration options, if you will. The most important being the <em>hash</em><a href="#glossary-hash" class="gentry">[5]</a> size. The larger the hash size, the more similar records must be to end up in the same bucket. Another way to improve accuracy is to use multiple LSH indexes. This increases the chances of records falling into the same bucket if they are indeed similar. I tend to use three LSH indexes with a hash size of either 512 or 1024 bits. (Another strategy for finding a good hash size is to start with the square root of the number of records and go up from there. <a href="https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set">Wikipedia has a decent page on this</a>.)</p> <p>The process for grouping records between two datasets using LSH goes like this:</p> <ol> <li>Choose a column to match records with and extract that column from each dataset (e.g., we want to match against owner name, so get the "owner name" column from one dataset and "name" on the other).</li> <li>Take only one of our extracted columns for matching and build a LSH index with it (e.g., build a LSH index on the "owner name" column).</li> <li>Take the second, remaining column and query (get bucket assignments from) our LSH index (e.g., for each value in the "name" column, get the most similar names from the "owner name" column). The results of each query forms a "group" of similar records.</li> <li>For each match in each group, grab the original full record. We will use the combined group of records to continue our investigation.</li> </ol> <p>This process results in a list of grouped records. I tend to flatten these groups into wide rows by combining fields from both datasets. I also add some extra columns describing things like the number of matches found in the bucket and some averages from the records. We can use this data to filter out groups that aren't worth investigating further and then rank the remaining records by investigative interest. But instead of doing this last step by hand, I propose using another ML technique: regression.</p> <h3><a href="#ranking-records">Ranking Records by Suspiciousness with Lasso Regression</a></h3> <p>So, we've successfully identified groups of records and now want to uncover ones that warrant intensive, manual research. We've also built a few aggregates from each group to work with. In the investigation described above, we had these aggregate variables (or <em>features</em><a href="#glossary-feature" class="gentry">[4]</a>) to work with:</p> <ol> <li>Number of properties owned.</li> <li>Total number of exemptions held.</li> <li>Total monetary amount of combined exemptions.</li> <li>Property type (there were three types of properties, with different rules about how exemptions may legally be applied).</li> <li>Average valuation of all properties owned.</li> </ol> <p>Typically, one might consider writing a program to take these values and use them to rank grouped records. We could do this by assigning an importance value, or <em>weight</em><a href="#glossary-weight" class="gentry">[21]</a>, to each variable, combining them and obtaining a score for each group.</p> <p>This may sound good, but there are still some open questions. How do we combine the variables for each group? How do we compare variables with different ranges? The valuation of a property, for example, is in the hundreds of thousands, while the number of exemptions applied to a property is in the single digits. How do we combine everything once we've decided on importance? What do we do if some of the variables are redundant and overpower other, more important, variables? Luckily there is ML algorithm that can help with all of these problems.</p> <p>Regression is a statistical modeling technique in the <em>supervised</em><a href="#glossary-supervised-ml" class="gentry">[17]</a> family of ML algorithms. If you've heard about ML before, it's almost guaranteed to be a supervised learning algorithm. This family of algorithms is useful in cases where you have some task to replicate and examples of it being done. Some problems and algorithms need more examples than others.</p> <p>A classic example is using regression to predict the price of a home based on the number of rooms, bathrooms, existence of a garage, etc. If we give a regression algorithm enough data about homes along with the price of each (we call this <em>training data</em><a href="#glossary-training" class="gentry">[18]</a>), the model can learn to predict price when given information about new homes.</p> <p>Lasso, short for Least Absolute Shrinkage and Selection Operator, is a specific type of regression that is not only able to make predictions based on training data, but it can also automatically assign importance values and eliminate useless features from consideration.</p> <p>In order to use regression to rank our properties by suspiciousness, we need to score some of our records by hand. So I browsed my dataset, found a few seemingly egregious examples, and gave them high scores (I chose zero through ten, with ten being very suspicious). Then I found a few records I wanted to be sure to ignore and gave them a low score. With a few of these hand labeled records, I was ready to build my Lasso model.</p> <p class="center caption"></p> <p><img alt="Lasso regression model" src="/images/regression_explanation.png#center" title="Lasso regression model scatter plot matrix"></p> <p class="center caption">Visual description of a Lasso regression model. In this scatter matrix, each feature is plotted against the other features, forming a scatter plot for each combination. For example, the bottom left plot shows "property type" on the y-axis vs. "number of exemptions" on the x-axis. Points are colored by the suspiciousness scores obtained by the combination of the two features. Red dots are the most suspicious; dark blue/purple dots are the least suspicious.</p> <p>There's a major benefit to using regression models: the weights they assign to your input variables are straightforward to understand. Other ML models are basically <a href="https://en.wikipedia.org/wiki/Black_box">black boxes</a>. This is so much the case that there's an entire subfield of ML dedicated to building models that can interpret other models. In journalism, we need to quickly check our model's assumptions and biases. Regression gives us a way to do this.</p> <p>As an example, here's a breakdown of a Lasso model I trained:</p> <div class="highlight"><pre><span></span> Lasso model explanation Feature Importance _______________________________________________ Property type => 2.3802236455789316 Total exemption amount => 0.9168560761627526 Number of properties => -0.052699999380756465 Number of exemptions => Ignored Valuation => Ignored (Prop type * 2.38) - (No. props * 0.05) + (Total exempt amt * 0.91) =========================== Suspiciousness </pre></div> <p>In general, this model thought "property type" was the most important feature. But it found records with large numbers of properties owned by a single person worth ignoring. This is because the data lacked details like unit/apartment numbers, so searches returned properties for every apartment in a complex. Because of this, the regression model decided to ignore buckets with huge numbers of properties. The total exemption amount would bring up a record's suspiciousness score, while number of exemptions and valuation were ignored.</p> <p>In regression, each of these weights gets multiplied by the corresponding value from each record. When summed, we get a suspiciousness score for each record. Data in, data out.</p> <p>There were three property types: 1, 2 and 3. Since there were very few type 2 and 3 properties in my dataset, it was important to look into them. Anything rare in a dataset usually is worth looking into. Some of the addresses pointed to large apartment complexes. Unfortunately, our data was lacking apartment or suite numbers, so searches for these properties would bring back every unit in the building. If a bucket contained a lot of properties, it was probably because of a search result like this. Total exemption amount was also an important variable. In this case, my model figured out that "number of exemptions" and "valuation" were directly related to the "total exemption amount". Because of this, we only needed the total exemption amount and could ignore the other two variables.</p> <p>Using this, I was able to run my grouped records through the trained Lasso regression model and get back a ranked list of records. The top scoring records being the most likely to be worth manual investigative effort.</p> <h3><a href="#conclusion">Conclusion</a></h3> <p>In one investigation, when the algorithms were done spitting out results, we filed public information requests for hundreds of property tax applications. We got a stack of paper four inches thick that my girlfriend and I carefully spread across our living room floor, cataloged and verified. <em>The Austin</em> <em>Bulldog</em> used this analysis to produce a two-part <a href="https://web.archive.org/web/20151004213528/https://www.theaustinbulldog.org/index.php?option=com_content&view=article&id=266:homestead-exemptions-rife-with-abuse&catid=3:main-articles">investigative</a> <a href="https://web.archive.org/web/20160329231301/https://www.theaustinbulldog.org/index.php?option=com_content&view=article&id=268:homestead-exemptions-a-tax-loophole&catid=3:main-articles">story</a> about the appraisal district's reluctance to vet tax exemption applications. Hundreds of thousands of dollars in back taxes were reclaimed for taxpayers.</p> <p>In another investigation, we identified properties that were likely running large, illegal online STRs and getting unwarranted tax breaks. The fallout from this one is still ongoing.</p> <p>Both investigations wouldn't have been possible without algorithmic assistance. But no matter how many fancy computational methods you use during an investigation, you can never escape the need to go back to the primary, physical sources and meticulously check your work.</p> <p>Machine learning is just a tool, not magic. It won't do our investigation for us and it won't follow the leads it provides. But when used correctly, it can unlock leads from data and help get us to the part we’re good at: writing stories that expose wrongdoing and effecting change.</p> </div> </article> </div> <div class="separator" style="background-image: url('https://artificialinformer.com/theme/images/noise.gif');">   </div> <div> <article class="hentry"> <footer class="article-info"> <span class="issue"> <em>Artificial Informer</em> - Issue One </span> <time class="published" datetime="2019-04-02T10:00:00-07:00"> April 2019 </time> </footer> <header class="article-header"> <h2 class="entry-title"> <a href="https://artificialinformer.com/issue-one/rethinking-web-scraping-with-autoscrape.html" rel="bookmark" title="Permalink to Rethinking Web Scraping with AutoScrape"> Rethinking Web Scraping with AutoScrape </a> </h2> </header> <div class="author-container"> <address class="vcard author"> By Brandon Roberts </address> </div> <div class="entry-content article article"> <p>In 2011, living in a city with more dented cars than I'd ever seen in my life, I wanted traffic data for a story I was working on, so I wrote my first real web scraper. I still remember it pretty well: it was written in PHP (barf), ran every five minutes, grabbed the current collisions police were responding to across the city and dumped them into a database. Immediately after writing a story based on the data, I threw the scraper on GitHub and never touched it again (RIP). It got me wondering how many journalists had rewritten and shared scrapers for the exact same sites and never maintained them. I pictured a mountain of useless, time consuming code floating around in cyberspace. Journalists write too many scrapers and we're generally terrible at maintaining them.</p> <p>There are plenty of reasons why we don't maintain our scrapers. In 2013 I built a web tool to help journalists do background research on political candidates. By entering a name into a search form, the tool would scrape up to fifteen local government sites containing public information. It didn't take long to figure out maintaining this tool was a full time job. The tiniest change on one of the sites would break the whole thing. Fixing it was a tedious process that went something like this: read the logs to get an idea of which website broke the system, visit the website in my web browser, manually test the scrape process, fix the broken code and re-deploy the web application.</p> <p class="center caption"></p> <p><img alt="Typical scraper development process" src="/images/typical-scraper-design-workflow.png#center" title="A flowchart of the typical web scraper development process"></p> <p class="center caption">A flowchart of the typical web scraper development process. Development starts at the top left and continues with manual steps: analyzing the website, reading the source code, extracting XPaths and pasting them into code. The single automated step consists of running the scraper. Because scrapers built this way are prone to breaking or extracting subtly incorrect data, the output needs to be checked after every scrape. This increases the overall maintenance requirements of operating a web scraper.</p> <p>Leap forward to 2018—I was running recurring, scheduled scrapes on open government sites for several media organizations—and I was still building my scrapers the old fashioned way. The fundamental problem with these scrapers was the way they identified web page elements to interact with or extract from: XPath selectors.</p> <p>To understand XPath, you need to understand that browsers "see" webpages as a hierarchical tree of elements, known as the <em>Document Object Model (DOM)</em><a href="#glossary-dom" class="gentry">[3]</a>, starting from the topmost element. Here's an example:</p> <div class="highlight"><pre><span></span> Example webpage DOM html | --------------- | | head body | | -------- ---------- | | | | | title style h1 div footer | button </pre></div> <p>XPath is a language for identifying parts of a webpage by tracing the path from the top of the DOM to the specified elements. For example, to identify the button in our example DOM above, we'd start at the top html tag and work down to the button, which gives us the following XPath selector:</p> <div class="highlight"><pre><span></span>/html/body/div/button </pre></div> <p>As you can see, the XPath selector leading to the button depends on the elements above it. So, if a web developer decides to change the layout and switches the <code>div</code> to something else, then our XPath—and our scraper—is broken.</p> <p>I wanted a tool that would let me describe a scrape using a set of standardized options. After partnering with a few journalism organizations, I came up with three common investigative scrape patterns:</p> <ol> <li>Crawl a site and download documents (e.g., download all PDFs on a site).</li> <li>Query and grab a single result page (e.g., enter a package tracking number and get a page saying where it is).</li> <li>Query and save all result pages (e.g., a Google search with many results pages).</li> </ol> <p>The result of this work is <a href="https://github.com/brandonrobertz/autoscrape-py">AutoScrape</a>, a resilient web scraper that can perform these kinds of scrapes and survive a variety of site changes. Using a set of basic configuration options, most common journalistic scraping problems can be solved. It operates on three main principles in order to reduce the amount of maintenance required when running scrapes:</p> <p><strong>Navigate like people do.</strong></p> <p>When people use websites they look for visual cues. These are generally standardized and rarely change. For example, search forms typically have a "Search" or "Submit" button. AutoScrape takes advantage of this. Instead of collecting DOM element selectors, users only need to provide the text of pages, buttons and forms to interact with while crawling a site.</p> <p><strong>Extraction as a separate step.</strong></p> <p>Web scrapers spend most of their time interacting with sites, submitting forms and following links. Piling data extraction onto that process increases the risk of having the scraper break. For this reason, AutoScrape separates site navigation and data extraction into two separate tasks. While AutoScrape is crawling and querying a site, it saves the rendered HTML for each page visited.</p> <p><strong>Protect extraction from page changes as much as possible.</strong></p> <p>Extracting data from HTML pages typically involves taking XPaths, extracting data into columns, converting these columns of data into record rows and exporting them. In AutoScrape, we avoid this entirely by using an existing domain-specific template language for extracting JSON from HTML called <a href="https://hext.thomastrapp.com/">Hext</a>. Hext templates, as they're called, look a lot like HTML but include syntax for data extraction. To ease the construction of these templates, the AutoScrape system includes a Hext template builder. Users load one of the scraped pages containing data and, for a single record, click on and label each of the values in it. Once this is done, the annotated HTML of the selected record can be converted into a Hext template. JSON data extraction is then a matter of bulk processing the scraped pages with the Hext tool and template.</p> <p class="center caption"></p> <p><img alt="Hext Extraction Template Example" src="/images/hext-extraction-example.gif#center" title="Hext extraction template example"></p> <p class="center caption">An example extraction of structured data from a source HTML document (left), using a simple Hext template (center) and the resulting JSON (right). If the HTML source document contained more records in place of the comment, the Hext template would have extracted them as additional JSON objects.</p> <p>Hext templates are superior to the traditional XPath method of data extraction on pages that contain few class names or IDs, as is common on primitive government sites. While an XPath to an element can be broken by changes made to ancestor elements, breaking a Hext template requires changing the actual chunk of HTML that contains data.</p> <h3><a href="#using-autoscrape">Using AutoScrape</a></h3> <p>Here are a few examples of what various scrape configurations look like using AutoScrape. In all of these cases, we're just going to use the command line version of the tool: <code>scrape.py</code>.</p> <p>Let's say you want to crawl an entire website, saving all HTML and style sheets (no screenshots):</p> <div class="highlight"><pre><span></span>./scrape.py \ --maxdepth -1 \ --output crawled_site \ 'https://some.page/to-crawl' </pre></div> <p>In the above case, we've set the scraper to crawl infinitely deep into the site. But if you want to only archive a single webpage, grabbing both the code and a full length screenshot (PNG) for future reference, you could do this:</p> <div class="highlight"><pre><span></span>./scrape.py \ --full-page-screenshots \ --load-images \ --maxdepth 0 \ --save-screenshots \ --driver Firefox \ --output archived_webpage \ 'https://some.page/to-archive' </pre></div> <p>Finally, we have the real magic: interactively querying web search portals. In this example, we want AutoScrape to do a few things: load a webpage, look for a search form containing the text "SEARCH HERE", select a date (January 20, 1992) from a date picker, enter "Brandon" into an input box and then click "NEXT ->" buttons on the result pages. Such a scrape is described like this:</p> <div class="highlight"><pre><span></span>./scrape.py \ --output search_query_data \ --form-match "SEARCH HERE" \ --input "i:0:Brandon,d:1:1992-01-20" \ --next-match "Next ->" \ 'https://some.page/search?s=newquery' </pre></div> <p>All of these commands create a folder that contains the HTML pages encountered during the scrape. The pages are categorized by the general type of page they are: pages viewed when crawling, search form pages, search result pages and downloads.</p> <div class="highlight"><pre><span></span>autoscrape-data/ ├── crawl_pages/ ├── data_pages/ ├── downloads/ └── search_pages/ </pre></div> <p>Extracting data from these HTML pages is a matter of building a Hext template and then running the extraction tool. Hext templates can either be written from scratch or by using the Hext builder web tool included with AutoScrape. This is best illustrated in the <a href="https://www.youtube.com/watch?v=D0Mchcf6THE">quickstart video</a>.</p> <p>Currently, I'm working with the <a href="http://workbenchdata.com/">Computational Journalism Workbench</a> team to integrate AutoScrape into their web platform so that you won't need to use the command line at all. Until that happens, you can go to the <a href="https://github.com/brandonrobertz/autoscrape-py">GitHub repo</a> to learn how to set up a graphical version of AutoScrape.</p> <h3><a href="#conclusion">Fully Automated Scrapers and the Future</a></h3> <p>In addition to building a simple, straightforward tool for scraping websites, I had a secondary, more lofty motive for creating AutoScrape: using it as a testbed for fully automated scrapers.</p> <p>Ultimately, I want to be able to hand AutoScrape a URL and have it automatically figure out what to do. Search forms contain clues about how to use them in both their source code and displayed text. This information can likely be used to by a <em>machine learning</em><a href="#glossary-ml" class="gentry">[10]</a> model to figure out how to search a form, opening up the possibility for fully-automated scrapers.</p> <p>If you're interested in improving web scraping, or just want to chat, feel free to reach out. I'm in this for the long haul. So stay tuned.</p> <p><img alt="AutoScrape Logo" src="/images/autoscraper.png#center" title="AutoScraper"></p> </div> </article> </div> <div class="separator" style="background-image: url('https://artificialinformer.com/theme/images/noise.gif');">   </div> <div> <article class="hentry"> <footer class="article-info"> <span class="issue"> <em>Artificial Informer</em> - Issue One </span> <time class="published" datetime="2019-04-02T09:00:00-07:00"> April 2019 </time> </footer> <header class="article-header"> <h2 class="entry-title"> <a href="https://artificialinformer.com/issue-one/glossary-of-terminology.html" rel="bookmark" title="Permalink to Glossary of Terminology"> Glossary of Terminology </a> </h2> </header> <div class="entry-content article glossary"> <p><strong><a name="glossary-dataset">1.</a> <em>Dataset</em></strong> A collection of machine readable records, typically from a single source. A dataset can be a single file (Excel or CSV), a database table, or a collection of documents. In machine learning, a dataset is commonly called a corpus. When the dataset is being used to <em>train</em><a href="#glossary-training" class="gentry">[18]</a> a machine learning <em>model</em><a href="#glossary-model" class="gentry">[12]</a>, it can be called a training dataset (a.k.a. a training set). Datasets need to be transformed into a <em>matrix</em><a href="#glossary-matrix" class="gentry">[11]</a> before they can be used by a machine learning model.</p> <p>Further reading: <a href="https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets">Training, Validation and Test Sets - Wikipedia</a></p> <p><strong><a name="glossary-distance">2.</a> <em>Distance Function, Distance Metric</em></strong> A method for quantifying how dissimilar, or far apart, two records are. Euclidean distance, the simplest distance metric used, is attributed to the Ancient Greek mathematician Euclid. This distance metric finds the length of a straight line between two points, as if using a ruler. Cosine distance is another popular metric which measures the angle between two points using trigonometry.</p> <p><strong><a name="glossary-dom">3.</a> <em>Document Object Model, DOM</em></strong> A representation of a HTML page using a hierarchical tree. This is the way that browsers "see" web pages. As an example, we have a simple HTML page and its corresponding DOM tree:</p> <div class="highlight"><pre><span></span> HTML DOM --------------------------------------------------------------- <span class="nt"><html></span> | html <span class="nt"><head></span> | | <span class="nt"><title></span>Example DOM<span class="nt"></title></span> | --------------- <span class="nt"><style></span>*{margin: 0;}<span class="nt"></style></span> | | | <span class="nt"></head></span> | head body <span class="nt"><body></span> | | | <span class="nt"><h1></span>Example Page!<span class="nt"></h1></span> | -------- ---------- <span class="nt"><div></span> | | | | | | <span class="nt"><button></span>Save<span class="nt"></button></span> | title style h1 div footer <span class="nt"></div></span> | | <span class="nt"><footer></span>A footer<span class="nt"></footer></span> | button <span class="nt"></body></span> | <span class="nt"></html></span> | </pre></div> <p><strong><a name="glossary-feature">4.</a> <em>Feature</em></strong> A column in a dataset representing a specific type of values. A feature is typically represented as a variable in a machine learning model. For example, in a campaign finance dataset, a feature might be "contribution amount" or "candidate name". The number of features in a dataset determines its dimensionality. In many machine learning algorithms, high dimensional data (data with lots of features) is notoriously difficult to work with.</p> <p><strong><a name="glossary-hash">5.</a> <em>Hash</em></strong> A short label or string of characters identifying a piece of data. Hashes are generated by a hash function. An example of this comes from the most common use case: password hashes. Instead of storing passwords in a database for anyone to read (and steal), password hashes are stored. For example, the password "Thing99" might get turned into something like <code>b3aca92c793ee0e9b1a9b0a5f5fc044e05140df3</code> by a hash function and saved in a database. When logging in, the website will hash the provided password and check it against the one in the database. A strong <a href="https://en.wikipedia.org/wiki/Cryptographic_hash_function">cryptographic hash function</a> can't feasibly be reversed and uniquely identifies a record. In other usages, such as in_ LSH_, a hash may identify a group of similar records. Hashes are a fixed length, unlike the input data used to create them.</p> <p>Further reading:<a href="https://en.wikipedia.org/wiki/Hash_function"> "Hash Function" - Wikipedia</a>,<a href="http://matthewcasperson.blogspot.com/2013/11/minhash-for-dummies.html"> "MinHash for dummies"</a>,<a href="http://www.dev-hq.net/posts/4--intro-to-cryptographic-hash-functions"> "An Introduction to Cryptographic Hash Functions" - Dev-HQ</a></p> <p><strong><a name="glossary-knn">6.</a> <em>k-NN, k-Nearest Neighbors</em></strong> An algorithm for finding the <em>k</em> number of most similar records to a given record, or query point. k-NN can use a variety of <em>distance metrics</em><a href="#glossary-distance" class="gentry">[2]</a> to measure dissimilarity, or distance_,<em> between points. In k-NN, when _k</em> is equal to 1, the algorithm will return the single most similar record. When <em>k</em> is greater than 1, the algorithm will return multiple records. A common practice is to take the similar records, average them and make educated guesses about the query point.</p> <p><strong><a name="glossary-lsh">7.</a> <em>Locality Sensitive Hashing, LSH</em></strong> A method, similar in application to <em>k-NN</em><a href="#glossary-knn" class="gentry">[6]</a>, for identifying similar records given a query record. LSH uses some statistical tricks like <em>hashing</em><a href="#glossary-hash" class="gentry">[5]</a> and <em>projection</em><a href="#glossary-projection" class="gentry">[9]</a> to do this as a performance optimization. Due to this, it can be used on large amounts of data. The penalty for this is that it's possible for false records to turn up in the results and, inversely, for actual similar records to be missed.</p> <p><strong><a name="glossary-preprocessing">8.</a> <em>Preprocessing</em></strong> A step in data analysis that happens before any actual analysis occurs to transform the data into a specific format or to clean it. A common preprocessing task is lowercasing and stripping symbols. <em>Vectorization</em><a href="#glossary-vectorization" class="gentry">[20]</a> is a common preprocessing step found in machine learning and statistics.</p> <p><strong><a name="glossary-projection">9.</a> <em>Projection</em></strong> A mathematical method for taking an input <em>vector</em><a href="#glossary-vectorization" class="gentry">[20]</a> and transforming it into another dimension. Typically, this is done by taking high dimensional data (data with a large number of columns) and converting it to a lower dimension. A simple example of this would be taking a 3D coordinate and turning it into a 2D point. This is one of the key concepts behind <em>LSH</em><a href="#glossary-lsh" class="gentry">[7]</a>.</p> <p>Further reading:<a href="https://en.wikipedia.org/wiki/Random_projection"> Random Projection - Wikipedia</a></p> <p><strong><a name="glossary-ml">10.</a> <em>Machine Learning, ML</em></strong> A field of statistics and computer science focused on building or using algorithms that can perform tasks, without being told specifically how to accomplish them, by learning from data. The type of data required by the machine learning algorithm, labeled or unlabeled, splits the field into two major groups: <em>supervised</em><a href="#glossary-supervised-ml" class="gentry">[17]</a> and <em>unsupervised</em><a href="#glossary-unsupervised-ml" class="gentry">[19]</a>, respectively. Machine learning is a subfield of Artificial Intelligence.</p> <p><strong><a name="glossary-matrix">11.</a> <em>Matrix</em></strong> Rectangularly arranged data made up of rows and columns. In the machine learning context, every cell in a matrix is a number. The numbers in a matrix may represent a letter, number or category.</p> <div class="highlight"><pre><span></span> <span class="n">Example</span> <span class="n">m</span><span class="o">-</span><span class="n">by</span><span class="o">-</span><span class="n">n</span> <span class="nf">matrix </span><span class="p">(</span><span class="m">2</span><span class="n">x3</span><span class="p">)</span> <span class="n">n</span> <span class="nf">columns </span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="n">.-</span><span class="o">-------------------------------</span><span class="m">-.</span> <span class="n">m</span> <span class="o">|</span> <span class="m">11.347271944951725</span> <span class="o">|</span> <span class="m">2203</span> <span class="o">|</span> <span class="m">2.0</span> <span class="o">|</span> <span class="o"><-</span> <span class="n">row</span> <span class="n">vector</span> <span class="n">rows</span> <span class="o">|--------------------+------+-----|</span> <span class="p">(</span><span class="m">2</span><span class="p">)</span> <span class="o">|</span> <span class="m">9.411296351528783</span> <span class="o">|</span> <span class="m">1867</span> <span class="o">|</span> <span class="m">1.0</span> <span class="o">|</span> <span class="n">`</span><span class="o">---------------------------------</span><span class="s">'</span> <span class="n">\</span> <span class="n">This</span> <span class="n">is</span> <span class="nf">element </span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">1</span><span class="p">)</span> </pre></div> <p>Each row, which represents a single record, is known as a <em>vector</em><a href="#glossary-vectorization" class="gentry">[20]</a>. The process of turning source data into a matrix is known as vectorization.</p> <p><strong><a name="glossary-model">12.</a> <em>Model, Statistical Model</em></strong> A collection of assumptions that a machine learning algorithm has learned from a dataset. Fundamentally, a model consists of numbers, known as weights, that can be be plugged into a machine learning algorithm. We use models to get data out of machine learning algorithms.</p> <p><strong><a name="glossary-outlier">13.</a> <em>Outlier Detection</em></strong> A method for identifying records that are out of place in the context of a dataset. These outlying data points can be thought of as strange, suspicious, fraudulent, rare, unique, etc. Outlier detection is a major subfield of machine learning with applications in fraud detection, quality assurance and alert systems.</p> <p><strong><a name="glossary-regression">14.</a> <em>Regression</em></strong> A statistical method for identifying relationships among the <em>features</em><a href="#glossary-feature" class="gentry">[4]</a> in a dataset.</p> <p><strong><a name="glossary-scraping">15.</a> <em>Scraping, Web Scraping</em></strong> The process of loading a web page, extracting information and collecting it into a specific structure (a database, spreadsheet, etc). Typically web scraping is done automatically with a program, or tool, known as a web scraper.</p> <p><strong><a name="glossary-string">16.</a> <em>String</em></strong> A piece of data, arranged sequentially, made up of letters, numbers or symbols. Technically speaking, computers represent everything as numbers, but they are converted to letters when needed. Numbers, words, sentences, paragraphs and even entire documents can be represented as strings.</p> <p><strong><a name="glossary-supervised-ml">17.</a> <em>Supervised Machine Learning, Supervised Learning</em></strong> A subfield of machine learning where algorithms learn to predict values or categories from human-labeled data. Examples of supervised machine learning problems: (1) predicting the temperature of a future day using a dataset of historical weather readings and (2) classifying emails by whether or not they are spam from a set of categorized emails. The goal of supervised machine learning is to learn from one dataset and then make accurate predictions on new data (this is known as generalization).</p> <p><strong><a name="glossary-training">18.</a> <em>Training</em></strong> The process of feeding a statistical or machine learning algorithm data for the purpose of learning to predict, identifying structure, or extracting knowledge. As an example, consider a list of legitimate campaign contributions. Once an algorithm has been shown this data, it generates a <em>model</em><a href="#glossary-model" class="gentry">[12]</a> representing how these contributions typically look. This model can be used to spot unusual contributions, since the model has learned what normal ones look like. There are many different methods for training models, but most of them are iterative, step-based procedures that slowly improve over time. A common analogy for how models are trained is hill climbing: knowing that a flat area (a good solution) is at the top of a hill, but only being able to see a short distance due to thick fog, the top can be found by following steep paths. Training is also known as model fitting.</p> <p><strong><a name="glossary-unsupervised-ml">19.</a> <em>Unsupervised Machine Learning, Unsupervised Learning</em></strong> A subfield of machine learning where algorithms learn to identify the structure or find patterns within a dataset. Unsupervised algorithms don't require human labeling or organization, and therefore can be used on a wide variety of datasets and in many situations. Examples of unsupervised use cases: (1) discovering natural groups of records in a dataset, (2) finding similar documents in a dataset and (3) identifying the way that events normally occur and using this to detect unusual events (a.k.a. <em>outlier detection</em> and <em>anomaly detection</em><a href="#glossary-outlier" class="gentry">[13]</a>).</p> <p><strong><a name="glossary-vectorization">20.</a> <em>Vectorization, Vector</em></strong> The process of turning a raw source dataset into a numerical <em>matrix</em><a href="#glossary-matrix" class="gentry">[11]</a>. Each record becomes a row of the matrix, known as a vector.</p> <p><strong><a name="glossary-weight">21.</a> <em>Weight</em></strong> A number that is used to either increase or decrease the importance of a <em>feature</em><a href="#glossary-feature" class="gentry">[4]</a>. Weights are used in supervised machine learning to quantify how well one variable predicts another; in unsupervised learning, weights are used to emphasize features that segment a dataset into groups.</p> <p><strong><a name="glossary-xpath">22.</a> <em>XPath</em></strong> A description of the location of an element on a web page. From the browser's perspective, a web page is represented as a hierarchical tree known as the <em>Document Object Model (DOM)</em><a href="#glossary-dom" class="gentry">[3]</a>. An XPath selector describes a route through this tree that leads to a specific part of the page.</p> </div> </article> </div> <div class="separator" style="background-image: url('https://artificialinformer.com/theme/images/noise.gif');">   </div> </div> </section> <footer id="main-footer" class="body"> <div id="footer-content"> <address id="about" class="vcard body"> <h2>Artificial Informer</h2> <p> <b>Issue One, April 2019</b><br/> Edited by Lisa van Dam-Bates.<br/> Written by <a target="_blank" href="https://twitter.com/bxroberts">Brandon Roberts</a>.<br/> An Artificial Informer Labs project.<br/> <a href="mailto:artificialinformer@nym.hush.com">artificialinformer@nym.hush.com</a> </p> <p id="footer-copyright"> Copyright 2019<br/> <a href="https://creativecommons.org/licenses/by-sa/4.0/"> <img alt="CC BY-SA" src="https://artificialinformer.com/theme/images/cc-by-sa.png" /> </a> </p> </address> <section id="follow"> <h2>Follow</h2> <div><a href="https://twitter.com/bxroberts">@bxroberts</a></div> <div><a href="https://twitter.com/lisavandambates">@lisavandambates</a></div> <div><a href="http://bxroberts.org/">B.X. Roberts .org</a></div> <h2>Mailing List</h2>  <style type="text/css"> #mc_embed_signup { font-size: 0.9em; display: flex; flex-direction: row; justify-content: flex-end; } #mc_embed_signup form { border: none; } #mc_embed_signup input.email { color: #000; background-color: white; border: 1px solid #999; border-radius: 4px; flex-basis: 50%; height: 25px; font-size: 0.9em; } #mc_embed_signup input.button:hover { background-color: #222; } #mc_embed_signup input.button { flex-basis: 30%; background-color: #555; color: #fff; border: 1px dashed #eee; border-radius: 7px; height: 31px; } </style> <div id="mc_embed_signup"> <form style="padding:3px;text-align:center;" action="https://tinyletter.com/bxroberts" method="post" target="popupwindow" onsubmit="window.open('https://tinyletter.com/bxroberts', 'popupwindow', 'scrollbars=yes,width=800,height=600');return true"> <input class="email" type="text" style="width:140px" name="email" id="tlemail" placeholder="Email address..."/> <input type="hidden" value="1" name="embed"/> <input class="button" type="submit" value="Subscribe" /> </form> </div> </section> </div> </footer> </div> <script type="text/javascript" src="https://artificialinformer.com/theme/js/jquery.min.js"></script> <script type="text/javascript" src="https://artificialinformer.com/theme/js/banner.js"></script> <script type="text/javascript" src="https://artificialinformer.com/theme/js/gifStartStop.js"></script> </body> </html>