CINXE.COM

Star History Monthly May 2024 | Open Source AI Web Scrapers

<!DOCTYPE html><html lang="en"> <head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><link rel="icon" href="/assets/favicon.ico"/><script defer="" data-domain="star-history.com" src="https://plausible.io/js/script.js"></script><meta name="next-head-count" content="4"/><link data-next-font="size-adjust" rel="preconnect" href="/" crossorigin="anonymous"/><link rel="preload" href="/_next/static/css/f94657194d4c857a.css" as="style" crossorigin=""/><link rel="stylesheet" href="/_next/static/css/f94657194d4c857a.css" crossorigin="" data-n-g=""/><noscript data-n-css=""></noscript><script defer="" crossorigin="" nomodule="" src="/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js"></script><script src="/_next/static/chunks/webpack-38cee4c0e358b1a3.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/framework-fda0a023b274c574.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/main-001c9e19b1894c7d.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/pages/_app-915effad870aa62e.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/6c86d9ce-d8b7531786dd65a5.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/472-8057db644de3d496.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/590-d0a3c67c09cc0662.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/pages/blog/%5Bslug%5D-7b378153203b51eb.js" defer="" crossorigin=""></script><script src="/_next/static/xKX4ZiOi_N7h3OBOEsSZu/_buildManifest.js" defer="" crossorigin=""></script><script src="/_next/static/xKX4ZiOi_N7h3OBOEsSZu/_ssgManifest.js" defer="" crossorigin=""></script></head><body><div id="__next"><div class="relative w-full h-auto min-h-screen overflow-auto flex flex-col"><title>Star History Monthly May 2024 | Open Source AI Web Scrapers</title><meta name="description" content="With the help of AI web scrapers, the limitations associated with traditional scrapers can be addressed: dynamic or unstructured websites can easily be handled."/><meta property="og:type" content="website"/><meta property="og:url" content="https://star-history.com/blog/ai-web-scraper"/><meta property="og:title" content="Star History Monthly May 2024 | Open Source AI Web Scrapers"/><meta property="og:description" content="With the help of AI web scrapers, the limitations associated with traditional scrapers can be addressed: dynamic or unstructured websites can easily be handled."/><meta property="og:image" content="https://star-history.com/assets/blog/ai-web-scraper/banner.webp"/><meta name="twitter:card" content="summary_large_image"/><meta name="twitter:url" content="https://star-history.com/blog/ai-web-scraper"/><meta name="twitter:title" content="Star History Monthly May 2024 | Open Source AI Web Scrapers"/><meta name="twitter:description" content="With the help of AI web scrapers, the limitations associated with traditional scrapers can be addressed: dynamic or unstructured websites can easily be handled."/><meta name="twitter:image" content="https://star-history.com/assets/blog/ai-web-scraper/banner.webp"/><nav><div class="flex justify-center items-center gap-x-6 bg-green-600 px-6 py-1 sm:px-3.5 "><p class="text-sm leading-6 text-white"><a href="/blog/list-your-open-source-project">Want to promote your open source project? Be on our ⭐️Starlet List⭐️ for FREE →</a></p></div></nav><header class="w-full h-14 shrink-0 flex flex-row justify-center items-center bg-[#363636] text-light"><div class="w-full md:max-w-5xl lg:max-w-7xl h-full flex flex-row justify-between items-center px-0 sm:px-4"><div class="h-full bg-dark flex flex-row justify-start items-center"><a class="h-full flex flex-row justify-center items-center px-3 hover:bg-zinc-800" href="/"><img class="w-7 h-auto" src="/assets/icon.png" alt="Logo"/></a><a class="h-full flex flex-row justify-center items-center text-base px-3 hover:bg-zinc-800" href="/blog"><span class="text-white font-semibold -2">Blog</span></a><span class="h-full flex flex-row justify-center items-center cursor-pointer text-white text-base px-3 font-semibold mr-2 hover:bg-zinc-800">Add Access Token</span></div><div class="hidden h-full md:flex flex-row justify-start items-center"><a target="_blank" rel="noopener noreferrer" class="h-full flex text-white text-base flex-row justify-center items-center px-4 hover:bg-zinc-800" href="https://www.bytebase.com/?source=star-history"><img class="h-6 mt-1 mr-2" src="/assets/craft-by-bytebase.webp" alt=""/></a></div><div class="h-full hidden md:flex flex-row justify-end items-center space-x-2"><a class="h-full flex flex-row justify-center items-center px-2 hover:bg-zinc-800" href="https://twitter.com/StarHistoryHQ" target="_blank" rel="noopener noreferrer"><i class="fab fa-twitter text-2xl text-blue-300"></i></a></div><div class="h-full flex md:hidden flex-row justify-end items-center"><span class="relative h-full w-10 px-3 flex flex-row justify-center items-center cursor-pointer font-semibold text-light hover:bg-zinc-800"><span class="w-4 transition-all h-px bg-light absolute top-1/2 -mt-1"></span><span class="w-4 transition-all h-px bg-light absolute top-1/2 "></span><span class="w-4 transition-all h-px bg-light absolute top-1/2 mt-1"></span></span></div></div></header><div class="w-full h-auto py-2 flex md:hidden flex-col justify-start items-start shadow-lg border-b hidden"><a class="h-12 text-base px-3 w-full flex flex-row justify-start items-center cursor-pointer font-semibold text-dark mr-2 hover:bg-gray-100 hover:text-blue-500" href="/blog/how-to-use-github-star-history">📕 How to use this site</a><span class="h-12 px-3 text-base w-full flex flex-row justify-start items-center cursor-pointer font-semibold text-dark mr-2 hover:bg-gray-100 hover:text-blue-500">Add Access Token</span><span class="h-12 text-base px-3 w-full flex flex-row justify-start items-center"><a class="github-button -mt-1" href="https://github.com/star-history/star-history" data-show-count="true" aria-label="Star star-history/star-history on GitHub" target="_blank" rel="noopener noreferrer">Star</a></span></div><div class="w-full h-auto grow lg:grid lg:grid-cols-[256px_1fr_256px]"><div class="w-full hidden lg:block"><div class="flex flex-col justify-start items-start w-full mt-2 p-4 pl-8"><a class="hover:opacity-75" href="/blog/list-your-open-source-project"><img class="w-auto max-w-full" src="/assets/starlet-icon.webp"/></a><div><div class="w-full flex flex-row justify-between items-center my-2"><h3 class="text-sm font-medium text-gray-400 leading-6">Playbook</h3></div><ul class="list-disc list-inside"><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/how-to-use-github-star-history"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">📕 How to Use this Site</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/playbook-for-more-github-stars"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">⭐️ How to Get More Stars</span></a></li></ul></div><div><div class="w-full flex flex-row justify-between items-center my-2"><h3 class="text-sm font-medium text-gray-400 leading-6">Monthly Pick</h3></div><ul class="list-disc list-inside"><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-devtools"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Nov (AI DevTools)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/homelab"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Oct (Homelab)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-agents"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Sep (AI Agents)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/rag-frameworks"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Aug (RAG frameworks)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-generators"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Jul (AI Generators)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-search"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Jun (AI Searches)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-web-scraper"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 May (AI Web Scraper)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/prompt-engineering"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Apr (AI Prompt)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/non-ai"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Mar (Non-AI)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/most-underrated"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Feb (Most Underrated)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/text2sql"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Jan (Text2SQL)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/gpt-wrappers"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Dec (GPT Wrappers)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/tts"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Nov (TTS)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-for-postgres"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Oct (AI for Postgres)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/coding-ai"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Sept (Coding AI)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/cli-tool-for-llm"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Aug (CLI tool for LLMs)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/llama2"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 July (Llama 2 Edition)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202306"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 June</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202305"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 May</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202304"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Apr</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202303"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Mar (ChatGPT Edition)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202302"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Feb</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202301"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Jan</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202212"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2022 Dec</span></a></li></ul></div><div><div class="w-full flex flex-row justify-between items-center my-2"><h3 class="text-sm font-medium text-gray-400 leading-6">Yearly Pick</h3></div><ul class="list-disc list-inside"><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/best-of-2023"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-yearly-pick-2022-data-infra-devtools"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2022 Data, Infra &amp; DevTools</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-open-source-2022-platform-engineering"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2022 Platform Engineering</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-open-source-2022-open-source-alternatives"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2022 OSS Alternatives</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-yearly-pick-2022-frontend"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2022 Front-end</span></a></li></ul></div><div><div class="w-full flex flex-row justify-between items-center my-2"><h3 class="text-sm font-medium text-gray-400 leading-6">Starlet List</h3></div><ul class="list-disc list-inside"><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/list-your-open-source-project"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">🎁 Prompt yours for FREE</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/trench"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #28 - Trench</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/langfuse"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #27 - langfuse</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/thepipe"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #26 - thepi.pe</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/taipy"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #25 - Taipy</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/superlinked"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #24 - Superlinked</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/tea-tasting"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #23 - tea-tasting</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/giskard"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #22 - Giskard</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/khoj"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #21 - Khoj</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/paradedb"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #20 - ParadeDB</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/skyvern"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #19 - Skyvern</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/prisma"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #18 - Prisma</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/spicedb"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #17 - SpiceDB</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/answer"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #16 - Apache Answer</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/infinity"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #15 - Infinity</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/proton"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #14 - Proton</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/earthly"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #13 - Earthly</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/wasp"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #12 - Wasp</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/libsql"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #11 - libSQL</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/postgresml"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #10 - PostgresML</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/electricsql"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #9 - ElectricSQL</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/prompt-flow"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #8 - Prompt flow</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/clipboard"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #7 - Clipboard</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/hoppscotch"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #6 - Hoppscotch</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/metisfl"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #5 - MetisFL</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/chatgpt-js"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #4 - chatgpt.js</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/mockoon"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #3 - Mockoon</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/dlta-ai"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #2 - DLTA-AI</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/sniffnet"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #1 - Sniffnet</span></a></li></ul></div></div></div><div class="w-full flex flex-col justify-start items-center"><div class="w-full p-4 md:p-0 mt-6 md:w-5/6 lg:max-w-6xl h-full flex flex-col justify-start items-center self-center"><img class="hidden md:block w-auto max-w-full object-scale-down" src="/assets/blog/ai-web-scraper/banner.webp" alt=""/><div class="w-auto max-w-6xl mt-4 md:mt-12 prose prose-indigo prose-xl md:prose-2xl flex flex-col justify-center items-center"><h1 class="leading-16">Star History Monthly May 2024 | Open Source AI Web Scrapers</h1></div><div class="w-full mt-8 mb-2 max-w-6xl px-2 flex flex-row items-center justify-center text-sm text-gray-900 font-semibold trackingwide uppercase"><div class="flex space-x-1 text-gray-500"><span class="text-gray-900">Mila</span><span aria-hidden="true"> · </span><time dateTime="2024-05-23:00:00.000Z">May 23, 2024</time><span aria-hidden="true"> · </span><span> <!-- -->3<!-- --> min read </span></div></div><div class="mt-8 w-full max-w-5xl prose prose-indigo prose-xl md:prose-2xl"><p>Web scraping, in simpler words, is to scrape data and content from websites, the data is then saved in the form of XML, Excel, or SQL. On top of lead generation, competitor monitoring, market research, web scrapers can also be used to automate your data collection process.</p> <p>With the help of AI web scraping tools, the limitations associated with manual or purely code-based scraping tools can be addressed: dynamic or unstructured websites can easily be handled, all without human intervention.</p> <p>Here, we present a few open-source AI web scraping tools to choose from.</p> <ul> <li><a href="#reader">Reader</a></li> <li><a href="#llm-scraper">LLM Scraper</a></li> <li><a href="#firecrawl">Firecrawl</a></li> <li><a href="#scrapegraphai">ScrapeGraphAI</a></li> <li><a href="#langchain">LangChain</a></li> </ul> <h2>Reader</h2> <p><img src="/assets/blog/ai-web-scraper/reader-star-history.webp" alt="reader-star-history"></p> <p><a href="https://github.com/jina-ai/reader">Reader</a> is an offering by Jina AI. It can convert any URL to an LLM-friendly input when you append a simple <a href="https://r.jina.ai/">https://r.jina.ai/</a>, and you can get structured output for your agent and RAG systems at no cost.</p> <p>Since its first release just this past month (April 15th, to be exact), they have served <a href="https://jina.ai/news/jina-reader-for-search-grounding-to-improve-factuality-of-llms/">over 18M</a> requests from the world, and the project itself has already gained 4.5K stargazers.</p> <p><img src="/assets/blog/ai-web-scraper/reader.webp" alt="reader"></p> <p>Aside from scraping any URL, Jina just released another feature where you can use <a href="https://s.jina.ai/YOUR_SEARCH_QUERY">https://s.jina.ai/YOUR_SEARCH_QUERY</a> to search from the up-to-date knowledge on the Internet. The result includes a title, LLM-friendly markdown, and a URL that attributes the source.</p> <p>Together, you can construct a comprehensive solution for LLMs, agents, and RAG systems.</p> <p><img src="/assets/blog/ai-web-scraper/reader-knowledge.webp" alt="reader-knowledge"></p> <h2>LLM Scraper</h2> <p><img src="/assets/blog/ai-web-scraper/llm-scraper-star-history.webp" alt="llm-scraper-star-history"></p> <p><a href="https://github.com/mishushakov/llm-scraper">LLM Scraper</a> is a TypeScript library that can convert any webpage into structured data using LLMs. Essentially, it uses function calling to convert pages to structured data.</p> <p>Simliarly to Reader, it was open-sourced just last month. It currently supports Local (GGUF), OpenAI, Groq chat models. Apparently, the author is <a href="https://news.ycombinator.com/item?id=40100824">working on</a> supporting local LLMs via llama.cpp to lower the cost of using LLMs for web scraping.</p> <p><img src="/assets/blog/ai-web-scraper/llm-scraper.webp" alt="llm-scraper"></p> <h2>Firecrawl</h2> <p><img src="/assets/blog/ai-web-scraper/firecrawl-star-history.webp" alt="firecrawl-star-history"></p> <p><a href="https://github.com/mendableai/firecrawl">Firecrawl</a> is an API service that can convert an URL into clean, well-formatted markdown. This format is great for LLM applications, offering a structured yet flexible way to represent web content.</p> <p><img src="/assets/blog/ai-web-scraper/firecrawl.webp" alt="firecrawl"></p> <p>This tool is tailored for LLM engineers, data scientists, AI researchers, and developers looking to harness web data for training machine learning models, market research, content aggregation. It simplifies the data preparation process, allowing professionals to focus on insights and model development, and you can self-host it to your own taste.</p> <h2>ScrapeGraphAI</h2> <p><img src="/assets/blog/ai-web-scraper/scrapegraphai-star-history.webp" alt="scrapegraphai-star-history"></p> <p><a href="https://github.com/VinciGit00/Scrapegraph-ai">ScrapeGraphAI</a> is a Python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.). With ScrapeGraphAI, you get to specify exactly what sort of data you want to extract.</p> <p><img src="/assets/blog/ai-web-scraper/scrapegraphai.webp" alt="scrapegraphai"></p> <p>ScrapegraphAI leverages the power of LLMs, and can thus adapt to changes in website structures, reducing the need for constant developer intervention. This flexibility ensures that scrapers remain functional even when website layouts change.</p> <p>The LLMs it currently supports include GPT, Gemini, Groq, Azure, Hugging Face, as well as local models.</p> <h2>LangChain</h2> <p><img src="/assets/blog/ai-web-scraper/langchain-star-history.webp" alt="langchain-star-history"></p> <p>What is LangChain not capable of? Not <a href="https://python.langchain.com/v0.1/docs/use_cases/web_scraping/">web scraping</a>.</p> <p>One of web scraping&#39;s biggest challenges is the changing nature of modern websites&#39; layouts and content, which requires modifying scraping scripts to accommodate the changes, and LangChain also utilizes function (e.g., OpenAI) with an extraction chain, so that you don&#39;t have to change your code constantly when websites change.</p> <p>If you are doing research and want to scrape only news article&#39;s name and summary from The Wall Street Journal website, it&#39;s got you covered.</p> <p><img src="/assets/blog/ai-web-scraper/langchain.webp" alt="langchain"></p> <h2>To Sum Up</h2> <p>Of course, there is no one-size-fits-all web scraper. Do you prefer old-school traditional web scrapers or LLM-empowered ones?</p> <p>📧 <em>Subscribe to our <a href="https://star-history.beehiiv.com/subscribe">weekly newsletter here</a>.</em></p> </div></div><div class="mt-12"><iframe src="https://embeds.beehiiv.com/2803dbaa-d8dd-4486-8880-4b843f3a7da6?slim=true" data-test-id="beehiiv-embed" height="52" frameBorder="0" scrolling="no" style="margin:0;border-radius:0px !important;background-color:transparent"></iframe></div></div><div class="w-full hidden lg:block"></div></div><footer class="relative w-full shrink-0 h-auto mt-6 flex flex-col justify-end items-center"><div class="w-full py-2 px-3 md:w-5/6 lg:max-w-7xl flex flex-row flex-wrap justify-between items-center text-neutral-700 border-t"><div class="text-sm leading-8 flex flex-row flex-wrap justify-start items-center"><div class="h-full text-gray-600">The missing GitHub star history graph</div><a class="h-full flex flex-row justify-center items-center ml-3 text-lg hover:opacity-80" href="https://twitter.com/StarHistoryHQ" target="_blank" rel="noopener noreferrer"><svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 512 512" height="1em" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg></a><a class="h-full flex flex-row justify-center items-center mx-3 text-lg hover:opacity-80" href="mailto:star@bytebase.com" target="_blank" rel="noopener noreferrer"><svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 512 512" height="1em" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M502.3 190.8c3.9-3.1 9.7-.2 9.7 4.7V400c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V195.6c0-5 5.7-7.8 9.7-4.7 22.4 17.4 52.1 39.5 154.1 113.6 21.1 15.4 56.7 47.8 92.2 47.6 35.7.3 72-32.8 92.3-47.6 102-74.1 131.6-96.3 154-113.7zM256 320c23.2.4 56.6-29.2 73.4-41.4 132.7-96.3 142.8-104.7 173.4-128.7 5.8-4.5 9.2-11.5 9.2-18.9v-19c0-26.5-21.5-48-48-48H48C21.5 64 0 85.5 0 112v19c0 7.4 3.4 14.3 9.2 18.9 30.6 23.9 40.7 32.4 173.4 128.7 16.8 12.2 50.2 41.8 73.4 41.4z"></path></svg></a><a class="h-full flex flex-row justify-center items-center mr-3 text-lg hover:opacity-80" href="https://github.com/star-history/star-history" target="_blank" rel="noopener noreferrer"><svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 496 512" height="1em" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg></a></div><div class="flex flex-row flex-wrap items-center space-x-4"><div class="flex flex-row text-sm leading-8 underline text-blue-700 hover:opacity-80"><img class="h-6 mt-1 mr-2" src="/assets/sqlchat.webp" alt="SQL Chat"/><a href="https://sqlchat.ai" target="_blank" rel="noopener noreferrer"> <!-- -->SQL Chat<!-- --> </a></div><div class="flex flex-row text-sm leading-8 underline text-blue-700 hover:opacity-80"><img class="h-6 mt-1 mr-2" src="/assets/dbcost.webp" alt="DB Cost"/><a href="https://dbcost.com" target="_blank" rel="noopener noreferrer">DB Cost</a></div></div><div class="text-xs leading-8 flex flex-row flex-nowrap justify-end items-center"><span class="text-gray-600">Maintained by<!-- --> <a class="text-blue-500 font-bold hover:opacity-80" href="https://bytebase.com" target="_blank" rel="noopener noreferrer">Bytebase</a>, originally built by<!-- --> <a class="bg-blue-400 text-white p-1 pl-2 pr-2 rounded-l-2xl rounded-r-2xl hover:opacity-80" href="https://twitter.com/tim_qian" target="_blank" rel="noopener noreferrer">@tim_qian</a></span></div></div></footer><div class="fixed right-0 top-32 hidden lg:flex flex-col justify-start items-start transition-all bg-white w-48 xl:w-56 p-2 z-10 "><div class="w-full flex justify-between items-center mb-2"><p class="text-xs text-gray-400">Sponsors (random order)</p><svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 352 512" class="fas fa-times text-xs text-gray-400 cursor-pointer hover:text-gray-500" height="1em" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M242.72 256l100.07-100.07c12.28-12.28 12.28-32.19 0-44.48l-22.24-22.24c-12.28-12.28-32.19-12.28-44.48 0L176 189.28 75.93 89.21c-12.28-12.28-32.19-12.28-44.48 0L9.21 111.45c-12.28 12.28-12.28 32.19 0 44.48L109.28 256 9.21 356.07c-12.28 12.28-12.28 32.19 0 44.48l22.24 22.24c12.28 12.28 32.2 12.28 44.48 0L176 322.72l100.07 100.07c12.28 12.28 32.2 12.28 44.48 0l22.24-22.24c12.28-12.28 12.28-32.19 0-44.48L242.72 256z"></path></svg></div><a href="https://dify.ai/?utm_source=star-history" class="bg-gray-50 p-2 rounded w-full flex flex-col justify-center items-center mb-2 text-zinc-600 hover:opacity-80 hover:text-blue-600 hover:underline" target="_blank"><img class="w-auto max-w-full" src="/assets/sponsors/dify/logo.webp" alt="Dify"/><span class="text-xs mt-2">Dify: Open-source platform for building LLM apps, from agents to AI workflows.</span></a><a href="https://bytebase.com?utm_source=star-history" class="bg-gray-50 p-2 rounded w-full flex flex-col justify-center items-center mb-2 text-zinc-600 hover:opacity-80 hover:text-blue-600 hover:underline" target="_blank"><img class="w-auto max-w-full" src="/assets/sponsors/bytebase/logo.webp" alt="Bytebase"/><span class="text-xs mt-2">Bytebase: Database DevOps and CI/CD for MySQL, PG, Oracle, SQL Server, Snowflake, ClickHouse, Mongo, Redis</span></a><a href="mailto:star@bytebase.com?subject=I&#x27;m interested in sponsoring star-history.com" target="_blank" class="w-full p-2 text-center bg-gray-50 text-xs leading-6 text-gray-400 rounded hover:underline hover:text-blue-600">Your logo</a></div></div></div><script id="__NEXT_DATA__" type="application/json" crossorigin="">{"props":{"pageProps":{"blog":{"title":"Star History Monthly May 2024 | Open Source AI Web Scrapers","slug":"ai-web-scraper","author":"Mila","featured":true,"featureImage":"/assets/blog/ai-web-scraper/banner.webp","publishedDate":"2024-05-23:00:00.000Z","excerpt":"With the help of AI web scrapers, the limitations associated with traditional scrapers can be addressed: dynamic or unstructured websites can easily be handled.","readingTime":3},"parsedBlogHTML":"\u003cp\u003eWeb scraping, in simpler words, is to scrape data and content from websites, the data is then saved in the form of XML, Excel, or SQL. On top of lead generation, competitor monitoring, market research, web scrapers can also be used to automate your data collection process.\u003c/p\u003e\n\u003cp\u003eWith the help of AI web scraping tools, the limitations associated with manual or purely code-based scraping tools can be addressed: dynamic or unstructured websites can easily be handled, all without human intervention.\u003c/p\u003e\n\u003cp\u003eHere, we present a few open-source AI web scraping tools to choose from.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"#reader\"\u003eReader\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"#llm-scraper\"\u003eLLM Scraper\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"#firecrawl\"\u003eFirecrawl\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"#scrapegraphai\"\u003eScrapeGraphAI\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"#langchain\"\u003eLangChain\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2\u003eReader\u003c/h2\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/reader-star-history.webp\" alt=\"reader-star-history\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://github.com/jina-ai/reader\"\u003eReader\u003c/a\u003e is an offering by Jina AI. It can convert any URL to an LLM-friendly input when you append a simple \u003ca href=\"https://r.jina.ai/\"\u003ehttps://r.jina.ai/\u003c/a\u003e, and you can get structured output for your agent and RAG systems at no cost.\u003c/p\u003e\n\u003cp\u003eSince its first release just this past month (April 15th, to be exact), they have served \u003ca href=\"https://jina.ai/news/jina-reader-for-search-grounding-to-improve-factuality-of-llms/\"\u003eover 18M\u003c/a\u003e requests from the world, and the project itself has already gained 4.5K stargazers.\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/reader.webp\" alt=\"reader\"\u003e\u003c/p\u003e\n\u003cp\u003eAside from scraping any URL, Jina just released another feature where you can use \u003ca href=\"https://s.jina.ai/YOUR_SEARCH_QUERY\"\u003ehttps://s.jina.ai/YOUR_SEARCH_QUERY\u003c/a\u003e to search from the up-to-date knowledge on the Internet. The result includes a title, LLM-friendly markdown, and a URL that attributes the source.\u003c/p\u003e\n\u003cp\u003eTogether, you can construct a comprehensive solution for LLMs, agents, and RAG systems.\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/reader-knowledge.webp\" alt=\"reader-knowledge\"\u003e\u003c/p\u003e\n\u003ch2\u003eLLM Scraper\u003c/h2\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/llm-scraper-star-history.webp\" alt=\"llm-scraper-star-history\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://github.com/mishushakov/llm-scraper\"\u003eLLM Scraper\u003c/a\u003e is a TypeScript library that can convert any webpage into structured data using LLMs. Essentially, it uses function calling to convert pages to structured data.\u003c/p\u003e\n\u003cp\u003eSimliarly to Reader, it was open-sourced just last month. It currently supports Local (GGUF), OpenAI, Groq chat models. Apparently, the author is \u003ca href=\"https://news.ycombinator.com/item?id=40100824\"\u003eworking on\u003c/a\u003e supporting local LLMs via llama.cpp to lower the cost of using LLMs for web scraping.\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/llm-scraper.webp\" alt=\"llm-scraper\"\u003e\u003c/p\u003e\n\u003ch2\u003eFirecrawl\u003c/h2\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/firecrawl-star-history.webp\" alt=\"firecrawl-star-history\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://github.com/mendableai/firecrawl\"\u003eFirecrawl\u003c/a\u003e is an API service that can convert an URL into clean, well-formatted markdown. This format is great for LLM applications, offering a structured yet flexible way to represent web content.\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/firecrawl.webp\" alt=\"firecrawl\"\u003e\u003c/p\u003e\n\u003cp\u003eThis tool is tailored for LLM engineers, data scientists, AI researchers, and developers looking to harness web data for training machine learning models, market research, content aggregation. It simplifies the data preparation process, allowing professionals to focus on insights and model development, and you can self-host it to your own taste.\u003c/p\u003e\n\u003ch2\u003eScrapeGraphAI\u003c/h2\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/scrapegraphai-star-history.webp\" alt=\"scrapegraphai-star-history\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://github.com/VinciGit00/Scrapegraph-ai\"\u003eScrapeGraphAI\u003c/a\u003e is a Python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.). With ScrapeGraphAI, you get to specify exactly what sort of data you want to extract.\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/scrapegraphai.webp\" alt=\"scrapegraphai\"\u003e\u003c/p\u003e\n\u003cp\u003eScrapegraphAI leverages the power of LLMs, and can thus adapt to changes in website structures, reducing the need for constant developer intervention. This flexibility ensures that scrapers remain functional even when website layouts change.\u003c/p\u003e\n\u003cp\u003eThe LLMs it currently supports include GPT, Gemini, Groq, Azure, Hugging Face, as well as local models.\u003c/p\u003e\n\u003ch2\u003eLangChain\u003c/h2\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/langchain-star-history.webp\" alt=\"langchain-star-history\"\u003e\u003c/p\u003e\n\u003cp\u003eWhat is LangChain not capable of? Not \u003ca href=\"https://python.langchain.com/v0.1/docs/use_cases/web_scraping/\"\u003eweb scraping\u003c/a\u003e.\u003c/p\u003e\n\u003cp\u003eOne of web scraping\u0026#39;s biggest challenges is the changing nature of modern websites\u0026#39; layouts and content, which requires modifying scraping scripts to accommodate the changes, and LangChain also utilizes function (e.g., OpenAI) with an extraction chain, so that you don\u0026#39;t have to change your code constantly when websites change.\u003c/p\u003e\n\u003cp\u003eIf you are doing research and want to scrape only news article\u0026#39;s name and summary from The Wall Street Journal website, it\u0026#39;s got you covered.\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/ai-web-scraper/langchain.webp\" alt=\"langchain\"\u003e\u003c/p\u003e\n\u003ch2\u003eTo Sum Up\u003c/h2\u003e\n\u003cp\u003eOf course, there is no one-size-fits-all web scraper. Do you prefer old-school traditional web scrapers or LLM-empowered ones?\u003c/p\u003e\n\u003cp\u003e📧 \u003cem\u003eSubscribe to our \u003ca href=\"https://star-history.beehiiv.com/subscribe\"\u003eweekly newsletter here\u003c/a\u003e.\u003c/em\u003e\u003c/p\u003e\n"},"__N_SSG":true},"page":"/blog/[slug]","query":{"slug":"ai-web-scraper"},"buildId":"xKX4ZiOi_N7h3OBOEsSZu","isFallback":false,"gsp":true,"scriptLoader":[]}</script></body></html>

Pages: 1 2 3 4 5 6 7 8 9 10