CINXE.COM
<!doctype html> <html lang="en" dir="ltr" class="blog-wrapper blog-post-page plugin-blog plugin-id-default" data-has-hydrated="false"> <head> <meta charset="UTF-8"> <meta name="generator" content="Docusaurus v3.0.0-rc.1"> <title data-rh="true">The State of Web Scraping 2022 | ScrapeOps</title><meta data-rh="true" name="viewport" content="width=device-width,initial-scale=1"><meta data-rh="true" name="twitter:card" content="summary_large_image"><meta data-rh="true" property="og:url" content="https://scrapeops.io/blog/the-state-of-web-scraping-2022/"><meta data-rh="true" property="og:locale" content="en"><meta data-rh="true" name="docusaurus_locale" content="en"><meta data-rh="true" name="docusaurus_tag" content="default"><meta data-rh="true" name="docsearch:language" content="en"><meta data-rh="true" name="docsearch:docusaurus_tag" content="default"><meta data-rh="true" name="robots" content="max-image-preview:large"><meta data-rh="true" property="og:title" content="The State of Web Scraping 2022 | ScrapeOps"><meta data-rh="true" name="description" content="What are the biggest trends and developments in web scraping? What does 2022, likely have in store for web scraping?"><meta data-rh="true" property="og:description" content="What are the biggest trends and developments in web scraping? What does 2022, likely have in store for web scraping?"><meta data-rh="true" property="og:image" content="https://scrapeops.io/assets/images/state-of-web-scraping-2022-525e3d3deefc857e0bbb01f258d6ce16.jpg"><meta data-rh="true" name="twitter:image" content="https://scrapeops.io/assets/images/state-of-web-scraping-2022-525e3d3deefc857e0bbb01f258d6ce16.jpg"><meta data-rh="true" property="og:type" content="article"><meta data-rh="true" property="article:published_time" content="2022-01-12T00:00:00.000Z"><meta data-rh="true" property="article:tag" content="web scraping"><link data-rh="true" rel="icon" href="/img/favicon.svg"><link data-rh="true" rel="canonical" href="https://scrapeops.io/blog/the-state-of-web-scraping-2022/"><link data-rh="true" rel="alternate" href="https://scrapeops.io/blog/the-state-of-web-scraping-2022/" hreflang="en"><link data-rh="true" rel="alternate" href="https://scrapeops.io/blog/the-state-of-web-scraping-2022/" hreflang="x-default"><link rel="alternate" type="application/rss+xml" href="/blog/rss.xml" title="ScrapeOps RSS Feed"> <link rel="alternate" type="application/atom+xml" href="/blog/atom.xml" title="ScrapeOps Atom Feed"> <link rel="preconnect" href="https://www.google-analytics.com"> <link rel="preconnect" href="https://www.googletagmanager.com"> <script async src="https://www.googletagmanager.com/gtag/js?id=G-QJSW9S9YH4"></script> <script>function gtag(){dataLayer.push(arguments)}window.dataLayer=window.dataLayer||[],gtag("js",new Date),gtag("config","G-QJSW9S9YH4",{anonymize_ip:!0})</script> <script src="/get-refferer.js"></script> <script src="https://cdn.firstpromoter.com/fpr.js" async></script><link rel="stylesheet" href="/assets/css/styles.0a6705c8.css"> <script src="/assets/js/runtime~main.8a89d22b.js" defer="defer"></script> <script src="/assets/js/main.2a878338.js" defer="defer"></script> </head> <body class="navigation-with-keyboard"> <script>!function(){function t(t){document.documentElement.setAttribute("data-theme",t)}var e=function(){try{return new URLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{return localStorage.getItem("theme")}catch(t){}}();t(null!==e?e:"light")}(),function(){try{const a=new URLSearchParams(window.location.search).entries();for(var[t,e]of a)if(t.startsWith("docusaurus-data-")){var n=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(n,e)}}catch(t){}}(),document.documentElement.setAttribute("data-announcement-bar-initially-dismissed",function(){try{return"true"===localStorage.getItem("docusaurus.announcement.dismiss")}catch(t){}return!1}())</script><div id="__docusaurus"><div role="region" aria-label="Skip to main content"><a class="skipToContent_fXgn" href="#__docusaurus_skipToContent_fallback">Skip to main content</a></div><div class="announcementBar_s0pr" style="background-color:#0d53d7;color:#fff" role="banner"><div class="announcementBarContent_dpRF">Need a proxy solution? Try ScrapeOps and get <a target="_self" href="https://scrapeops.io/app/register/proxy/">1,000 free requests here</a>, or compare all proxy providers <a target="_self" href="https://scrapeops.io/proxy-providers/comparison/">here</a>!</div></div><nav class="navbar navbar--fixed-top"><div class="navbar__inner"><div class="navbar__items"><button aria-label="Navigation bar toggle" class="navbar__toggle clean-btn" type="button" tabindex="0"><svg width="30" height="30" viewBox="0 0 30 30" aria-hidden="true"><path stroke="currentColor" stroke-linecap="round" stroke-miterlimit="10" stroke-width="2" d="M4 7h22M4 15h22M4 23h22"></path></svg></button><a class="navbar__brand" href="/"><div class="navbar__logo"><img src="/img/scrapeops-logo.svg" alt="ScrapeOps Logo" class="themedComponent_mlkZ themedComponent--light_NVdE" height="24px" width="18px"><img src="/img/scrapeops-logo.svg" alt="ScrapeOps Logo" class="themedComponent_mlkZ themedComponent--dark_xIcU" height="24px" width="18px"></div><b class="navbar__title text--truncate">ScrapeOps</b></a></div><div class="navbar__items navbar__items--right"><div class="navbar__item dropdown dropdown--hoverable dropdown--right"><a href="#" aria-haspopup="true" aria-expanded="false" role="button" class="navbar__link">Solutions</a><ul class="dropdown__menu"><li><a class="dropdown__link" href="/proxy-aggregator/">Proxy Aggregator</a></li><li><a class="dropdown__link" href="/monitoring-scheduling/">Monitoring & Scheduler</a></li></ul></div><a class="navbar__item navbar__link" href="/docs/intro/">Docs</a><a href="https://scrapeops.io/proxy-providers/comparison/" target="_self" rel="noopener noreferrer" class="navbar__item navbar__link">Proxy Comparison<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a><div class="navbar__item dropdown dropdown--hoverable dropdown--right"><a href="#" aria-haspopup="true" aria-expanded="false" role="button" class="navbar__link">Guides</a><ul class="dropdown__menu"><li><a class="dropdown__link" href="/web-scraping-playbook/">Web Scraping Playbook</a></li><li><a class="dropdown__link" href="/python-web-scraping-playbook/">Python Web Scraping Playbook</a></li><li><a class="dropdown__link" href="/nodejs-web-scraping-playbook/">NodeJs Web Scraping Playbook</a></li><li><a class="dropdown__link" href="/python-scrapy-playbook/">Python Scrapy Playbook</a></li><li><a class="dropdown__link" href="/selenium-web-scraping-playbook/">Selenium Web Scraping Playbook</a></li><li><a class="dropdown__link" href="/puppeteer-web-scraping-playbook/">Puppeteer Web Scraping Playbook</a></li><li><a class="dropdown__link" href="/playwright-web-scraping-playbook/">Playwright Web Scraping Playbook</a></li></ul></div><a href="https://scrapeops.io/app/login/" target="_self" rel="noopener noreferrer" class="navbar__item navbar__link">Login<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a><a href="https://scrapeops.io/app/register/main/" target="_self" rel="noopener noreferrer" class="navbar__item navbar__link">Signup<svg width="13.5" height="13.5" aria-hidden="true" viewBox="0 0 24 24" class="iconExternalLink_nPIU"><path fill="currentColor" d="M21 13v10h-21v-19h12v2h-10v15h17v-8h2zm3-12h-10.988l4.035 4-6.977 7.07 2.828 2.828 6.977-7.07 4.125 4.172v-11z"></path></svg></a></div></div><div role="presentation" class="navbar-sidebar__backdrop"></div></nav><div class="main-wrapper"><div class="container margin-vert--lg"><div class="row"><main class="col col--9 col--offset-1" itemscope="" itemtype="https://schema.org/Blog"><article itemprop="blogPost" itemscope="" itemtype="https://schema.org/BlogPosting"><meta itemprop="description" content="What are the biggest trends and developments in web scraping? What does 2022, likely have in store for web scraping?"><link itemprop="image" href="https://scrapeops.io/assets/images/state-of-web-scraping-2022-525e3d3deefc857e0bbb01f258d6ce16.jpg"><header><h1 class="title_f1Hy" itemprop="headline">The State of Web Scraping 2022</h1><div class="container_mt6G margin-vert--md"><time datetime="2022-01-12T00:00:00.000Z" itemprop="datePublished">January 12, 2022</time> · <!-- -->17 min read</div></header><div id="__blog-post-container" class="markdown" itemprop="articleBody"><p><img alt="The State of Web Scraping 2022" src="/assets/images/state-of-web-scraping-2022-525e3d3deefc857e0bbb01f258d6ce16.jpg" width="2000" height="1050" class="img_CujE"></p> <p>With 2021 having come to an end, now is the time to look back at the big events & trends in the world of web scraping, and try to project what will 2022 look like for web scraping.</p> <p>Including the good, the bad, and the ugly:</p> <ul> <li><a href="#web-scraping-falling-out-of-fashion">"Web Scraping" Falling Out Of Fashion</a></li> <li><a href="#anti-bots-the-arms-race-continues">Anti-Bots: The Arms Race Continues</a></li> <li><a href="#legal-issues-maybe-a-little-less-grey">Legal Issues: Maybe A Little Less Grey?</a></li> <li><a href="#fighting-dirty-increased-use-of-pressure-tactics">Fighting Dirty: Increased Use of Pressure Tactics</a></li> <li><a href="#ai-data-extractors">AI Data Extractors</a></li> <li><a href="#web-scraping-libraries">Web Scraping Libraries</a></li> </ul> <!-- --> <!-- --> <div style="background-color:#f6fafe;border-radius:2px;border:1px solid #f0f0f3;padding:1.5rem;padding-bottom:2.5rem;padding-top:2.5rem;text-align:center;margin-top:2rem"><h3>Need help scraping the web?</h3><p style="margin-bottom:2rem">Then check out <a href="https://scrapeops.io/" style="color:#000;text-decoration:none;border-bottom:2px dashed #0d53d7;font-weight:600">ScrapeOps</a>, the complete toolkit for web scraping.</p><div style="display:flex;flex-direction:row;flex-wrap:wrap;justify-content:center"><div><div><img style="height:80px;width:80px" src="https://assets-scrapeops.nyc3.digitaloceanspaces.com/Icons/scrapeops-proxy-aggregator-icon.svg" alt="ScrapeOps Proxy Manager"></div><div><a href="https://scrapeops.io/proxy-aggregator/" style="color:#000;text-decoration:none;font-weight:600">Proxy Manager</a></div></div><div style="margin-left:3.5rem;margin-right:3.5rem"><div><img style="height:80px;width:80px" src="https://assets-scrapeops.nyc3.digitaloceanspaces.com/Icons/scrapeops-monitoring-icon.svg" alt="ScrapeOps Monitoring"></div><div><a href="https://scrapeops.io/monitoring-scheduling/" style="color:#000;text-decoration:none;font-weight:600">Scraper Monitoring</a></div></div><div><div><img style="height:80px;width:80px" src="https://assets-scrapeops.nyc3.digitaloceanspaces.com/Icons/scrapeops-scheduler-icon.svg" alt="ScrapeOps Job Scheduling"></div><div><a href="https://scrapeops.io/monitoring-scheduling/" style="color:#000;text-decoration:none;font-weight:600">Job Scheduling</a></div></div></div></div> <hr> <h2 class="anchor anchorWithStickyNavbar_LWe7" id="web-scraping-falling-out-of-fashion">"Web Scraping" Falling Out Of Fashion<a href="#web-scraping-falling-out-of-fashion" class="hash-link" aria-label="Direct link to "Web Scraping" Falling Out Of Fashion" title="Direct link to "Web Scraping" Falling Out Of Fashion"></a></h2> <p>After consistent growth in interest every year over the last 10 years, 2021 seems to have been a year when <strong>"web scraping"</strong> became <strong>uncool</strong>.</p> <p>Google searches for the term <strong>"web scraping"</strong> have dropped <strong>30-40%</strong> compared to 2020 volumes.</p> <p><img alt="State of Web Scraping 2022: Web Scraping Trends" src="/assets/images/state-of-web-scraping-2022-web-scraping-trend-c2d64b01bbae3079b47e0d22c4e2ec5c.png" width="1430" height="426" class="img_CujE"></p> <p>It's hard to know for sure, but it is likely a combination of:</p> <ul> <li>Growing number of companies offering industry specific data feeds (product info, SERP results, etc.) so people don't have to scrape the data themselves.</li> <li>More companies offering ready made product monitoring tools for e-commerce, etc. so companies aren't building their own in-house web scraping infrastructures.</li> <li>The term <strong>"web scraping"</strong> is falling out of fashion in favour of terms like <strong>"data feeds"</strong> and <strong>"data extraction"</strong>.</li> </ul> <p>In 2021, two of the largest players in web scraping, <strong>Luminati</strong> and <strong>Scrapinghub</strong>, rebranded to <strong>"Bright Data"</strong> and <strong>"Zyte"</strong> respectively, highlighting the shift away from web scraping to a more data focus.</p> <p>Or maybe, just people and companies don't want web data as much anymore, who knows...</p> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="2022-outlook">2022 Outlook<a href="#2022-outlook" class="hash-link" aria-label="Direct link to 2022 Outlook" title="Direct link to 2022 Outlook"></a></h3> <p>Expect to see more existing web scraping players and new entrants focus on becoming data feed providers, instead of being web-scraping-as-a-service or proxy providers. Not only are the optics better, but the profit margins are significantly higher too!</p> <hr> <p><img alt="State of Web Scraping 2022: Web Scrapers vs Anti-Bots" src="/assets/images/state-of-web-scraping-2022-anti-bots-520f47c24888040d6143870a5a065395.jpg" width="2000" height="1050" class="img_CujE"></p> <h2 class="anchor anchorWithStickyNavbar_LWe7" id="anti-bots-the-arms-race-continues">Anti-Bots: The Arms Race Continues!<a href="#anti-bots-the-arms-race-continues" class="hash-link" aria-label="Direct link to Anti-Bots: The Arms Race Continues!" title="Direct link to Anti-Bots: The Arms Race Continues!"></a></h2> <p>The endless war between web scrapers and websites trying to block them continued unabated in 2021, with web scrapers still largely staying one step ahead.</p> <p>Websites and anti-bot providers have continued to develop more sophisticated anti-bot measures. They are increasingly moving away from simple header and IP fingerprinting, to more complicated browser and TCP fingerprinting with webRTC, canvas fingerprinting and analysing mouse movements so that they can differentiate automated scrapers from real-users.</p> <p>But as of yet no anti-bot has found the magic bullet to completely prevent web scrapers.</p> <p>With the right combination of proxies, user agents and browsers, you can scrape every website. Even those that seem unscrapable.</p> <p>However, whilst scraping a website might be still possible, anti-bots can make it not worth the effort and cost if you have to resort to ever more expensive web scraping setups (using headless browsers with residential/mobile IP networks, etc).</p> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="cloudflare-datadome-perimeterx">Cloudflare, DataDome, PerimeterX<a href="#cloudflare-datadome-perimeterx" class="hash-link" aria-label="Direct link to Cloudflare, DataDome, PerimeterX" title="Direct link to Cloudflare, DataDome, PerimeterX"></a></h3> <p>Of all the anti-bot solutions out there, <strong>Cloudflare</strong> probably was the most widespread <strong>pain in the a</strong>** for web scrapers. Not necessarily because it is the best anti-bot, but because it is the most widely used.</p> <p>Whilst the vanilla Cloudflare anti-bot can still be bypassed relatively easily, when a website is using the advanced version it can be quite challenging to deal with. Most of the time, you need to fallback to a headless browser and good proxy/user agent management.</p> <p><strong>DataDome</strong>, <strong>PerimeterX</strong>, and others have also proven themselves to be challenging to bypass in 2021.</p> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="moving-data-behind-logins">Moving Data Behind Logins<a href="#moving-data-behind-logins" class="hash-link" aria-label="Direct link to Moving Data Behind Logins" title="Direct link to Moving Data Behind Logins"></a></h3> <p>Following in the footsteps of <strong>LinkedIn</strong>, in 2021 <strong>Facebook</strong> and <strong>Instagram</strong> led the way in a growing trend of websites moving their data behind login pages.</p> <p>This not only provides websites an easy and low-barrier way to block scrapers that has limited impact on real user UX, but also tilts the legal situation in their favour if someone explicitedly decides to scrape behind the login and violate their T&Cs.</p> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="2022-outlook-1">2022 Outlook<a href="#2022-outlook-1" class="hash-link" aria-label="Direct link to 2022 Outlook" title="Direct link to 2022 Outlook"></a></h3> <p>Expect to see ever more sophisticated anti-bots being used on websites. More complicated browser, TCP, and IP fingerprinting techniques are going to require you to use:</p> <ul> <li>Higher quality proxies.</li> <li>Better user agent/cookie management techniques.</li> <li>Headless browsers.</li> <li>Or one of the growing number of purpose built anti-bot bypassing solutions becoming available like <a href="https://brightdata.grsm.io/scrapeops-web-unlocker" target="_blank" rel="noopener noreferrer">Web Unlocker</a>, <a href="https://www.zyte.com/smart-browser-api-anti-fingerprinting/?rfsn=6335521.8097b3" target="_blank" rel="noopener noreferrer">Zyte Smart Browser</a> or <a href="https://www.scraperapi.com/?fp_ref=scrapeops" target="_blank" rel="noopener noreferrer">ScraperAPI</a>.</li> </ul> <p>E-commerce scrapers should be safe from websites moving their data behind their logins, however, expect Facebook, Instagram, and other social networks to become harder to scrape as they expand and refine their usage of login pop-ups to block scrapers.</p> <p><strong>Need a Proxy?</strong> Then check out our <a href="https://scrapeops.io/proxy-providers/comparison/" target="_blank" rel="noopener noreferrer"><strong>Proxy Comparison Tool</strong></a> to compare the pricing, features and limits of every proxy provider on the market so you can find the one that best suits your needs.</p> <hr> <p><img alt="State of Web Scraping 2022: Web Scraping Legal Issues" src="/assets/images/state-of-web-scraping-2022-legal-issues-57ad6705a696ced1e691267ee7b34d29.jpg" width="2000" height="1050" class="img_CujE"></p> <h2 class="anchor anchorWithStickyNavbar_LWe7" id="legal-issues-maybe-a-little-less-grey">Legal Issues: Maybe A Little Less Grey?<a href="#legal-issues-maybe-a-little-less-grey" class="hash-link" aria-label="Direct link to Legal Issues: Maybe A Little Less Grey?" title="Direct link to Legal Issues: Maybe A Little Less Grey?"></a></h2> <p>Web scraping has always been in a grey area, and probably will always be. However, in 2021 web scraping became a little less grey with the outcome of the <a href="https://www.natlawreview.com/article/supreme-court-vacates-linkedin-hiq-scraping-decision-remands-to-ninth-circuit" target="_blank" rel="noopener noreferrer">Van Buren v. United States</a>.</p> <p>In this case, the <strong>US Supreme Court</strong> ruled that accessing permissible areas of a computer system with prior authorization, even for an improper or prohibited purpose, is not a violation of the <strong>CFAA</strong> (Computer Fraud and Abuse Act).</p> <p>This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).</p> <p>The US Supreme Court has sent back the <strong>LinkedIn v. HiQ case</strong> to the Ninth Circuit for a further consideration in light of the Van Buren ruling, so the LinkedIn case will give us further insight on how the courts will view web scraping moving forward.</p> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="2022-outlook-2">2022 Outlook<a href="#2022-outlook-2" class="hash-link" aria-label="Direct link to 2022 Outlook" title="Direct link to 2022 Outlook"></a></h3> <p>Given the <strong>US Supreme Courts Van Buren</strong> ruling, and the Ninth Circuit's previous ruling in the <strong>LinkedIn v. HiQ</strong> case, then odds are the Ninth Circuit's next ruling on the LinkedIn vs HiQ case will be ruled in web scrapings favour (I'm not a laywer). Giving companies and developers more clarity and certainty over web scrapings legal position.</p> <p>Companies and developers can continue to operate on the assumption that scraping publically available information is okay. But if you login to a website and scrape data only available whilst being logged in, then you have explicitedly agreed to the websites T&Cs when creating your account and could be in breach of their T&Cs which might prohibit web scraping.</p> <hr> <p><img alt="State of Web Scraping 2022: Pressure Tactics" src="/assets/images/state-of-web-scraping-2022-pressure-tactics-49017f52198e821f614fde43c6adb4cb.jpg" width="2000" height="1050" class="img_CujE"></p> <h2 class="anchor anchorWithStickyNavbar_LWe7" id="fighting-dirty-increased-use-of-pressure-tactics">Fighting Dirty: Increased Use of Pressure Tactics<a href="#fighting-dirty-increased-use-of-pressure-tactics" class="hash-link" aria-label="Direct link to Fighting Dirty: Increased Use of Pressure Tactics" title="Direct link to Fighting Dirty: Increased Use of Pressure Tactics"></a></h2> <p>With the legal situation continuing to favour web scrapers, some big companies have started to resort to pressure tactics to try and prevent their sites being scraped. Threatening scrapers with costly legal cases that a small company can't afford.</p> <p>There have been an increasing number of stories where <strong>Facebook/Instagram</strong>, <strong>LinkedIn</strong>, etc. have suspended the accounts of any employee or company associated with scraping companies, and sent cease and desist letters to them threatening legal action if they don't ban all scraping of their sites on the scrapers plaforms.</p> <p>This is despite the fact that in most cases, the scraping company only allows users to scrape data that is publically available and not behind a login, which has been deemed legal in the recent court cases.</p> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="2022-outlook-3">2022 Outlook<a href="#2022-outlook-3" class="hash-link" aria-label="Direct link to 2022 Outlook" title="Direct link to 2022 Outlook"></a></h3> <p>Expect to see more of this. If the courts continue to rule in favour of a open public internet, and anti-bot techniques aren't good enough to stop web scraping then some big companies will try to prevent scraping by pressuring companies with costly legal cases or the loss of personal/professional access to their platforms.</p> <hr> <p><img alt="State of Web Scraping 2022: AI Data Extractors" src="/assets/images/state-of-web-scraping-2022-ai-data-extractors-9f81aed02073c39038d135a41650d854.jpg" width="2000" height="1050" class="img_CujE"></p> <h2 class="anchor anchorWithStickyNavbar_LWe7" id="ai-data-extractors">AI Data Extractors<a href="#ai-data-extractors" class="hash-link" aria-label="Direct link to AI Data Extractors" title="Direct link to AI Data Extractors"></a></h2> <p>One of the most exciting trends in web scraping at the moment, is the growing capabilities of AI data parsers and crawlers.</p> <p>Maintaining the parsing logic within spiders can be a pain, especially when you scraping hundreds or thousands of websites.</p> <p>So having a AI based parser that can automatically extract your required data and deal with any HTML structure changes would be a huge value add to many developers.</p> <p>As of 2021, most of the big web scraping providers have launched their own data API that uses AI to parse the data:</p> <ul> <li><a href="https://brightdata.grsm.io/scrapeops-data-collector" target="_blank" rel="noopener noreferrer">Bright Data: Data Collector</a></li> <li><a href="https://www.zyte.com/automatic-extraction/?rfsn=6335521.8097b3" target="_blank" rel="noopener noreferrer">Zyte: Automatic Extraction</a></li> <li><a href="https://oxylabs.io/products/scraper-api/web" target="_blank" rel="noopener noreferrer">Oxylabs: Web Scraper API</a></li> <li><a href="https://www.diffbot.com/" target="_blank" rel="noopener noreferrer">Diffbot</a></li> </ul> <p>However, as of now and depending on the use case, the website coverage, data quality or pricing of these AI data extractors often doesn't meet the requirements for many developers looking to build data feeds.</p> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="2022-outlook-4">2022 Outlook<a href="#2022-outlook-4" class="hash-link" aria-label="Direct link to 2022 Outlook" title="Direct link to 2022 Outlook"></a></h3> <p>Expect to see the performance of existing AI data extractors improve, more AI data extractor tools launched by existing players and new entrants into the web scraping market.</p> <p>Hopefully, we will see some open source or unbundled AI data parsing tools, so you can use the AI parsing functionality without having to use a specific proxy network.</p> <hr> <p><img alt="State of Web Scraping 2022: Web Scraping Libraries" src="/assets/images/state-of-web-scraping-2022-web-scraping-libraries-dd402ec54a1de1752aff4e9353d9e511.jpg" width="2000" height="1050" class="img_CujE"></p> <h2 class="anchor anchorWithStickyNavbar_LWe7" id="web-scraping-libraries">Web Scraping Libraries<a href="#web-scraping-libraries" class="hash-link" aria-label="Direct link to Web Scraping Libraries" title="Direct link to Web Scraping Libraries"></a></h2> <p>When it comes to web scraping libraries & frameworks, <strong>Python is still king!</strong></p> <p>However, with the growing shift to scraping with headless browsers, Node.js is gaining ground fast.</p> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="python">Python<a href="#python" class="hash-link" aria-label="Direct link to Python" title="Direct link to Python"></a></h3> <p>Web scraping with Python is still dominated by the popular Python Requests/BeautifulSoup combo and Python Scrapy, with their dominance looking unlikely to change.</p> <p><img alt="State of Web Scraping 2022: Web Scraping Trends" src="/assets/images/state-of-web-scraping-2022-scrapy-vs-python-requests-fc9a2ce02a70f6a9b891e3a26380107d.png" width="1166" height="584" class="img_CujE"></p> <ul> <li><strong>Python Requests/BeautifulSoup -</strong> Due to its large community, ease of use and short learning curve, Python Requests/BeautifulSoup dwarfs Python Scrapy when it comes to interest and downloads (~23M vs ~700k monthly downloads).</li> <li><strong>Python Scrapy -</strong> Although not as popular as it once was, Scrapy is still the go-to-option for many Python developers looking to build large scale web scraping infrastructures because of all the functionality ready to use right out of the box.</li> </ul> <p>Although not as popular as BeautifulSoup & Scrapy, there are many other great web scraping libraries for Python that mightn't be as well-known but which are very powerful:</p> <ul> <li><strong>Requests-HTML -</strong> <a href="https://github.com/psf/requests-html" target="_blank" rel="noopener noreferrer">Requests-HTML</a> is a combined Python requests and parsing library that supports async requests, CSS/XPath selectors, Javascript support and more. As it uses lxml under the hood, it is much faster at parsing than BeautifulSoup (~600k monthly downloads).</li> <li><strong>Selectolax -</strong> <a href="https://github.com/rushter/selectolax" target="_blank" rel="noopener noreferrer">Selectolax</a> is a HTML5 parser with CSS selectors that is a great alternative to BeautifulSoup as it is much faster (~34k monthly downloads).</li> <li><strong>pyspider -</strong> <a href="https://github.com/binux/pyspider" target="_blank" rel="noopener noreferrer">pyspider</a> is a powerful spider framework, simplier version of Scrapy, that has a built in scheduler that manages concurrency, retries, request queueing, etc. behind the scenes (~6k monthly downloads).</li> <li><strong>AutoScraper -</strong> <a href="https://github.com/alirezamika/autoscraper" target="_blank" rel="noopener noreferrer">AutoScraper</a> is a lightweight scraper for Python that identify the correct CSS selectors for your parsers after you giving a couple examples of the data you would like to scrape.</li> </ul> <p>There are also a number of Python wrappers for <strong>Selenium</strong>, <strong>Puppeteer</strong> and <strong>Playwright</strong> that allow you to create headless browser scrapers using Python.</p> <p>For years, <strong>Selenium</strong> was the go-to headless browser for web scrapers, however, with the launch of <strong>Puppeteer</strong> and <strong>Playwright</strong> developers have increasingly started building their scrapers with Node.js due to their ease of use and better functionality.</p> <hr> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="nodejs">Node.js<a href="#nodejs" class="hash-link" aria-label="Direct link to Node.js" title="Direct link to Node.js"></a></h3> <p>With the increasing popularity of Node.js and the growing need to use headless browsers to bypass anti-bots, Node.js has become the 2nd most popular language used in web scraping.</p> <p>Although, Node.js isn't as popular as Python for normal HTTP request web scraping there are a number of great web scraping libraries on offer:</p> <ul> <li><strong>Fetch, Request, Axios, SuperAgent -</strong> There are a number of popular HTTP clients used for web scraping. They range from Fetch and Request, light-weight HTTP clients that cover all the basics, to Axios and SuperAgent, which have a bit more functionality and support promises and async/await.</li> <li><strong>Cheerio -</strong> Is a lightweight implementation of the core jQuery API that runs on the server-side. With <a href="https://cheerio.js.org/" target="_blank" rel="noopener noreferrer">Cheerio</a> you can use the jQuery API to parse, manipulate and extract data from a web page.</li> <li><strong>JSDOM -</strong> Unlike Cheerio, which uses jQuery, <a href="https://github.com/jsdom/jsdom" target="_blank" rel="noopener noreferrer">JSDOM</a> is a pure Javascript implementation of the DOM that allows you to interact, and scrape HTML websites. JSDOM parses the provided HTML and emulates the browser to create a DOM you can query with the normal Javascript DOM API.</li> <li><strong>Apify SDK -</strong> Although primarily designed to be used with Apify's paid service, the <a href="https://sdk.apify.com/" target="_blank" rel="noopener noreferrer">Apify SDK</a> is a powerful library that simplifies the development and management of web crawlers, scrapers, data extractors and web automation jobs. It is a scraper management tool that provides tools to manage and automatically scale a pool of headless browsers, to maintain queues of URLs to crawl, store crawling results to a local filesystem or into the cloud, rotate proxies, etc. It can be use by itself on run on <a href="https://apify.com/" target="_blank" rel="noopener noreferrer">Apify Cloud</a>.</li> </ul> <h4 class="anchor anchorWithStickyNavbar_LWe7" id="headless-browsers">Headless Browsers<a href="#headless-browsers" class="hash-link" aria-label="Direct link to Headless Browsers" title="Direct link to Headless Browsers"></a></h4> <p>The area where Javascript web scraping really stands out is with the numerous headless browsers options it has available. Headless browsers request the data from a website, and render the HTML and any browser-side Javascript so your scraper sees the page the same way you would see in your browser.</p> <p>Headless browsers are great when you need to scrape:</p> <ul> <li>Small or beginner projects.</li> <li>Single page apps using React, AngularJS, Vue.js, etc.</li> <li>Jobs that require a lot of interaction with the page to get the data you need (fill out forms/search bars, click buttons, scroll pages, etc).</li> <li>When a website is using a sophisticated anti-bot that you need to bypass.</li> </ul> <p>There a numerous headless browser libraries available that make it very easy to spin up a headless browser to scrape a site.</p> <ul> <li><strong>Puppeteer -</strong> <a href="https://github.com/puppeteer/puppeteer" target="_blank" rel="noopener noreferrer">Puppeteer</a> is the most popular browser automation library, that just like the name implies, allows you to manipulate a web page like a puppet and scrape the data you need using a Chrome browser.</li> <li><strong>Playwright -</strong> Released by Microsoft in 2020, <a href="https://github.com/microsoft/playwright" target="_blank" rel="noopener noreferrer">Playwright.js</a> is the new kid on the block and it very similar to Puppeteer in many ways (many of the Puppeteer team left Google, and created Playwright at Microsoft). However, it has been gaining a lot of traction because of its cross-browser support (can drive Chromium, WebKit, and Firefox browsers, whilst Puppeteer only drives Chromium) and some developer experience improvements that would have been breaking changes with Puppeteer.</li> <li><strong>Nightmare & PhantomJS -</strong> <a href="https://github.com/segmentio/nightmare" target="_blank" rel="noopener noreferrer">Nightmare.js</a> and <a href="https://phantomjs.org/" target="_blank" rel="noopener noreferrer">PhantomJS</a> are two headless browser drivers that are still in use, but which fell out of popularity with the launch of Puppeteer. Both are kinda out-of-date now and aren't really maintained anymore, so its recommended to use either Puppeteer or Playwright.</li> </ul> <hr> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="other-languages">Other Languages<a href="#other-languages" class="hash-link" aria-label="Direct link to Other Languages" title="Direct link to Other Languages"></a></h3> <p>Although Python and Node.js are the most popular languages for web scraping, if you would prefer to use a different language then there a lots of libraries to choose from for pretty much every language out there.</p> <h4 class="anchor anchorWithStickyNavbar_LWe7" id="java">Java<a href="#java" class="hash-link" aria-label="Direct link to Java" title="Direct link to Java"></a></h4> <ul> <li><strong>JSoup -</strong> <a href="https://jsoup.org/" target="_blank" rel="noopener noreferrer">JSoup</a> is a powerful yet simple Java library designed to make fetching pages, parsing HTML and extracting data simple by using DOM traversal or CSS selectors to find the data.</li> <li><strong>Jaunt -</strong> <a href="https://jaunt-api.com/" target="_blank" rel="noopener noreferrer">Jaunt</a> is a web automation library for web-scraping, web-automation and JSON querying. It provides a light-weight headless browser that allows you to fetch pages, parse the data and work with forms, however, it doesn't support Javascript.</li> <li><strong>HTMLUnit -</strong> <a href="https://htmlunit.sourceforge.io/" target="_blank" rel="noopener noreferrer">HTMLUnit</a> is another headless browser for Java which was originally designed for HTML unit testing but which can also be used for web scraping.</li> <li><strong>Apache Nutch -</strong> <a href="https://nutch.apache.org/" target="_blank" rel="noopener noreferrer">Apache Nutch</a> is a very powerful and scalable web crawling library that is easily extenible with out-of-the-box parsers, HTML filters, etc.</li> </ul> <h4 class="anchor anchorWithStickyNavbar_LWe7" id="golang">Golang<a href="#golang" class="hash-link" aria-label="Direct link to Golang" title="Direct link to Golang"></a></h4> <ul> <li><strong>goquery -</strong> <a href="https://github.com/PuerkitoBio/goquery" target="_blank" rel="noopener noreferrer">goquery</a> is the jQuery of Golang. It allows you to traverse the DOM and extract data with CSS selectors like you would with jQuery making it a great library for web scraping.</li> <li><strong>Colly -</strong> <a href="https://github.com/gocolly/colly" target="_blank" rel="noopener noreferrer">Colly</a> is a powerful web scraping framework that offers similar functionality to Python's Scrapy. With Colly you can write any kind of crawler or spider, parse the data you need and control the request frequency and concurrency. If you need more complicated HTML parsing then it can be paired with goquery.</li> </ul> <h4 class="anchor anchorWithStickyNavbar_LWe7" id="php">PHP<a href="#php" class="hash-link" aria-label="Direct link to PHP" title="Direct link to PHP"></a></h4> <ul> <li><strong>Guzzle -</strong> <a href="https://docs.guzzlephp.org/en/stable/" target="_blank" rel="noopener noreferrer">Guzzle</a> is a popular HTTP client that makes it simple to send both synchronous and asynchronous HTTP requests, making it a great option to request a web page from your target website.</li> <li><strong>Goutte –</strong> <a href="https://github.com/FriendsOfPHP/Goutte" target="_blank" rel="noopener noreferrer">Goutte</a> is web scraping library for PHP that allows you to make requests, parse the respone, interact with the page and extract the data built by the creator of the <a href="https://symfony.com/" target="_blank" rel="noopener noreferrer">Symfony Framework</a>. It also has a <a href="https://symfony.com/doc/current/components/browser_kit.html" target="_blank" rel="noopener noreferrer">BrowserKit Component</a> that allows you to simulate the behavior of a web browser.</li> </ul> <h4 class="anchor anchorWithStickyNavbar_LWe7" id="ruby">Ruby<a href="#ruby" class="hash-link" aria-label="Direct link to Ruby" title="Direct link to Ruby"></a></h4> <ul> <li><strong>Nokogiri -</strong> <a href="https://nokogiri.org/" target="_blank" rel="noopener noreferrer">Nokogiri</a> is a popular HTML and XML parsing gem for Ruby, that makes it very easy to parse data from HTML responses using XPath and CSS selectors.</li> <li><strong>Kimurai -</strong> <a href="https://github.com/vifreefly/kimuraframework" target="_blank" rel="noopener noreferrer">Kimurai</a> is purpose built web scraping framework for Ruby that allows you to make HTTP requests and scrape the responses. It also works out of the box with Headless Chromium, Firefox and PhantomJS.</li> </ul> <h3 class="anchor anchorWithStickyNavbar_LWe7" id="2022-outlook-5">2022 Outlook<a href="#2022-outlook-5" class="hash-link" aria-label="Direct link to 2022 Outlook" title="Direct link to 2022 Outlook"></a></h3> <p>Whilst there is always room for improvement, making requests and parsing HTML content is a solved problem in pretty much every programming language at this point, so we're unlikely to see any major changes here.</p> <p>The big areas for improvement are the creation of extensions for these existing libraries/frameworks to simplify some of the common problems developers face when scraping.</p> <p><strong><a href="https://scrapeops.io/" target="_blank" rel="noopener noreferrer">ScrapeOps</a> is a good example.</strong> ScrapeOps is a new monitoring and alerting tool dedicated to web scraping. With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.</p> <p><strong>Live demo here:</strong> <a href="https://scrapeops.io/app/login/demo/" target="_blank" rel="noopener noreferrer">ScrapeOps Demo</a></p> <p><img src="https://assets-scrapeops.nyc3.digitaloceanspaces.com/Images%2FScrapeOps_Assets%2Fscrapeops-promo.png" alt="ScrapeOps Promo" fetchpriority="high" loading="eager" class="top-header-img img_CujE"></p> <p>The primary goal with ScrapeOps is to give every developer the same level of scraping monitoring capabilities as the most sophisticated web scrapers, without any of the hassle of setting up your own custom solution.</p> <p>Once you have completed the simple install (3 lines in your scraper), ScrapeOps will:</p> <ul> <li>🕵️♂️ <strong>Monitor -</strong> Automatically monitor all your scrapers.</li> <li>📈 <strong>Dashboards -</strong> Visualise your job data in dashboards, so you see real-time & historical stats.</li> <li>💯 <strong>Data Quality -</strong> Validate the field coverage in each of your jobs, so broken parsers can be detected straight away.</li> <li>📉 <strong>Auto Health Checks -</strong> Automatically check every jobs performance data versus its 7 day moving average to see if its healthy or not.</li> <li>✔️ <strong>Custom Health Checks -</strong> Check each job with any custom health checks you have enabled for it.</li> <li>⏰ <strong>Alerts -</strong> Alert you via email, Slack, etc. if any of your jobs are unhealthy.</li> <li>📑 <strong>Reports -</strong> Generate daily (periodic) reports, that check all jobs versus your criteria and let you know if everything is healthy or not.</li> </ul> <p>It is only available for Python Scrapy at the moment, but we will be releasing integrations for Python Requests/BeautifulSoup, Node.js, Puppeteer, Java, etc. scrapers over the coming weeks.</p> <hr> <h2 class="anchor anchorWithStickyNavbar_LWe7" id="2022-is-looking-good">2022 Is Looking Good!<a href="#2022-is-looking-good" class="hash-link" aria-label="Direct link to 2022 Is Looking Good!" title="Direct link to 2022 Is Looking Good!"></a></h2> <p>Sure, web scraping faces some anti-bot and legal challenges in 2022, however, it has faced those challenges every year for the past few years and came out stronger because of it.</p> <p>The web scraping ecosystem is growing, with more libraries, frameworks and products available than ever before to simplify our web scraping headaches so the future is looking bright.</p> <p><strong>Need a Proxy?</strong> Then check out our <a href="https://scrapeops.io/proxy-providers/comparison/" target="_blank" rel="noopener noreferrer"><strong>Proxy Comparison Tool</strong></a> to compare the pricing, features and limits of every proxy provider on the market so you can find the one that best suits your needs.</p></div><footer class="row docusaurus-mt-lg blogPostFooterDetailsFull_mRVl"><div class="col"><b>Tags:</b><ul class="tags_jXut padding--none margin-left--sm"><li class="tag_QGVx"><a class="tag_zVej tagRegular_sFm0" href="/blog/tags/web-scraping/">web scraping</a></li></ul></div></footer></article><nav class="pagination-nav docusaurus-mt-lg" aria-label="Blog post page navigation"><a class="pagination-nav__link pagination-nav__link--prev" href="/blog/scrapeops-scrapy-extension/"><div class="pagination-nav__sublabel">Newer Post</div><div class="pagination-nav__label">ScrapeOps - Job Monitoring, Alerts & Scheduling For Web Scrapers Made Simple!</div></a><a class="pagination-nav__link pagination-nav__link--next" href="/blog/beta/"><div class="pagination-nav__sublabel">Older Post</div><div class="pagination-nav__label">Launching The ScrapeOps Beta</div></a></nav></main><div class="col col--2"><div class="tableOfContents_bqdL thin-scrollbar"><ul class="table-of-contents table-of-contents__left-border"><li><a href="#web-scraping-falling-out-of-fashion" class="table-of-contents__link toc-highlight">"Web Scraping" Falling Out Of Fashion</a><ul><li><a href="#2022-outlook" class="table-of-contents__link toc-highlight">2022 Outlook</a></li></ul></li><li><a href="#anti-bots-the-arms-race-continues" class="table-of-contents__link toc-highlight">Anti-Bots: The Arms Race Continues!</a><ul><li><a href="#cloudflare-datadome-perimeterx" class="table-of-contents__link toc-highlight">Cloudflare, DataDome, PerimeterX</a></li><li><a href="#moving-data-behind-logins" class="table-of-contents__link toc-highlight">Moving Data Behind Logins</a></li><li><a href="#2022-outlook-1" class="table-of-contents__link toc-highlight">2022 Outlook</a></li></ul></li><li><a href="#legal-issues-maybe-a-little-less-grey" class="table-of-contents__link toc-highlight">Legal Issues: Maybe A Little Less Grey?</a><ul><li><a href="#2022-outlook-2" class="table-of-contents__link toc-highlight">2022 Outlook</a></li></ul></li><li><a href="#fighting-dirty-increased-use-of-pressure-tactics" class="table-of-contents__link toc-highlight">Fighting Dirty: Increased Use of Pressure Tactics</a><ul><li><a href="#2022-outlook-3" class="table-of-contents__link toc-highlight">2022 Outlook</a></li></ul></li><li><a href="#ai-data-extractors" class="table-of-contents__link toc-highlight">AI Data Extractors</a><ul><li><a href="#2022-outlook-4" class="table-of-contents__link toc-highlight">2022 Outlook</a></li></ul></li><li><a href="#web-scraping-libraries" class="table-of-contents__link toc-highlight">Web Scraping Libraries</a><ul><li><a href="#python" class="table-of-contents__link toc-highlight">Python</a></li><li><a href="#nodejs" class="table-of-contents__link toc-highlight">Node.js</a></li><li><a href="#other-languages" class="table-of-contents__link toc-highlight">Other Languages</a></li><li><a href="#2022-outlook-5" class="table-of-contents__link toc-highlight">2022 Outlook</a></li></ul></li><li><a href="#2022-is-looking-good" class="table-of-contents__link toc-highlight">2022 Is Looking Good!</a></li></ul></div></div></div></div></div><footer class="footer footer--dark"><div class="container container-fluid"><div class="row footer__links"><div class="col footer__col"><div class="footer__title">Resources</div><ul class="footer__items clean-list"><li class="footer__item"><a class="footer__link-item" href="/docs/intro/">Documentation</a></li><li class="footer__item"><a href="https://scrapeops.io/proxy-providers/comparison/" target="_blank" rel="noopener noreferrer" class="footer__link-item">Proxy Comparison Tool</a></li><li class="footer__item"><a class="footer__link-item" href="/blog/">Blog</a></li><li class="footer__item"><a href="https://github.com/ScrapeOps" target="_blank" rel="noopener noreferrer" class="footer__link-item">GitHub</a></li></ul></div><div class="col footer__col"><div class="footer__title">Web Scraping Guides</div><ul class="footer__items clean-list"><li class="footer__item"><a class="footer__link-item" href="/web-scraping-playbook/">Web Scraping Playbook</a></li><li class="footer__item"><a class="footer__link-item" href="/python-web-scraping-playbook/">Python Web Scraping Playbook</a></li><li class="footer__item"><a class="footer__link-item" href="/nodejs-web-scraping-playbook/">NodeJs Web Scraping Playbook</a></li><li class="footer__item"><a class="footer__link-item" href="/python-scrapy-playbook/">Python Scrapy Playbook</a></li><li class="footer__item"><a class="footer__link-item" href="/selenium-web-scraping-playbook/">Selenium Web Scraping Playbook</a></li><li class="footer__item"><a class="footer__link-item" href="/puppeteer-web-scraping-playbook/">Puppeteer Web Scraping Playbook</a></li><li class="footer__item"><a class="footer__link-item" href="/playwright-web-scraping-playbook/">Playwright Web Scraping Playbook</a></li></ul></div><div class="col footer__col"><div class="footer__title">Company</div><ul class="footer__items clean-list"><li class="footer__item"><a class="footer__link-item" href="/affiliate-program/">Affiliate Program</a></li><li class="footer__item"><a class="footer__link-item" href="/privacy-policy/">Privacy Policy</a></li><li class="footer__item"><a class="footer__link-item" href="/terms-of-service/">Terms Of Service</a></li><li class="footer__item"><a class="footer__link-item" href="/data-protection-policy/">Data Protection Policy</a></li><li class="footer__item"><a class="footer__link-item" href="/data-processing-agreement/">Data Processing Agreement</a></li></ul></div></div><div class="footer__bottom text--center"><div class="footer__copyright">Copyright © 2024 ScrapeOps.</div></div></div></footer></div> </body> </html>