CINXE.COM

Starlet #26 thepi.pe: Intelligent Document Scraping with Vision-Language Models

<!DOCTYPE html><html lang="en"> <head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><link rel="icon" href="/assets/favicon.ico"/><script defer="" data-domain="star-history.com" src="https://plausible.io/js/script.js"></script><meta name="next-head-count" content="4"/><link data-next-font="size-adjust" rel="preconnect" href="/" crossorigin="anonymous"/><link rel="preload" href="/_next/static/css/f94657194d4c857a.css" as="style" crossorigin=""/><link rel="stylesheet" href="/_next/static/css/f94657194d4c857a.css" crossorigin="" data-n-g=""/><noscript data-n-css=""></noscript><script defer="" crossorigin="" nomodule="" src="/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js"></script><script src="/_next/static/chunks/webpack-38cee4c0e358b1a3.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/framework-fda0a023b274c574.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/main-001c9e19b1894c7d.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/pages/_app-915effad870aa62e.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/6c86d9ce-d8b7531786dd65a5.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/472-8057db644de3d496.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/590-d0a3c67c09cc0662.js" defer="" crossorigin=""></script><script src="/_next/static/chunks/pages/blog/%5Bslug%5D-7b378153203b51eb.js" defer="" crossorigin=""></script><script src="/_next/static/xKX4ZiOi_N7h3OBOEsSZu/_buildManifest.js" defer="" crossorigin=""></script><script src="/_next/static/xKX4ZiOi_N7h3OBOEsSZu/_ssgManifest.js" defer="" crossorigin=""></script></head><body><div id="__next"><div class="relative w-full h-auto min-h-screen overflow-auto flex flex-col"><title>Starlet #26 thepi.pe: Intelligent Document Scraping with Vision-Language Models</title><meta name="description" content="thepi.pe is an open-source package for intelligent document scraping at scale using vision-language models (VLMs)."/><meta property="og:type" content="website"/><meta property="og:url" content="https://star-history.com/blog/thepipe"/><meta property="og:title" content="Starlet #26 thepi.pe: Intelligent Document Scraping with Vision-Language Models"/><meta property="og:description" content="thepi.pe is an open-source package for intelligent document scraping at scale using vision-language models (VLMs)."/><meta property="og:image" content="https://star-history.com/assets/blog/thepipe/banner.webp"/><meta name="twitter:card" content="summary_large_image"/><meta name="twitter:url" content="https://star-history.com/blog/thepipe"/><meta name="twitter:title" content="Starlet #26 thepi.pe: Intelligent Document Scraping with Vision-Language Models"/><meta name="twitter:description" content="thepi.pe is an open-source package for intelligent document scraping at scale using vision-language models (VLMs)."/><meta name="twitter:image" content="https://star-history.com/assets/blog/thepipe/banner.webp"/><nav><div class="flex justify-center items-center gap-x-6 bg-green-600 px-6 py-1 sm:px-3.5 "><p class="text-sm leading-6 text-white"><a href="/blog/list-your-open-source-project">Want to promote your open source project? Be on our ⭐️Starlet List⭐️ for FREE →</a></p></div></nav><header class="w-full h-14 shrink-0 flex flex-row justify-center items-center bg-[#363636] text-light"><div class="w-full md:max-w-5xl lg:max-w-7xl h-full flex flex-row justify-between items-center px-0 sm:px-4"><div class="h-full bg-dark flex flex-row justify-start items-center"><a class="h-full flex flex-row justify-center items-center px-3 hover:bg-zinc-800" href="/"><img class="w-7 h-auto" src="/assets/icon.png" alt="Logo"/></a><a class="h-full flex flex-row justify-center items-center text-base px-3 hover:bg-zinc-800" href="/blog"><span class="text-white font-semibold -2">Blog</span></a><span class="h-full flex flex-row justify-center items-center cursor-pointer text-white text-base px-3 font-semibold mr-2 hover:bg-zinc-800">Add Access Token</span></div><div class="hidden h-full md:flex flex-row justify-start items-center"><a target="_blank" rel="noopener noreferrer" class="h-full flex text-white text-base flex-row justify-center items-center px-4 hover:bg-zinc-800" href="https://www.bytebase.com/?source=star-history"><img class="h-6 mt-1 mr-2" src="/assets/craft-by-bytebase.webp" alt=""/></a></div><div class="h-full hidden md:flex flex-row justify-end items-center space-x-2"><a class="h-full flex flex-row justify-center items-center px-2 hover:bg-zinc-800" href="https://twitter.com/StarHistoryHQ" target="_blank" rel="noopener noreferrer"><i class="fab fa-twitter text-2xl text-blue-300"></i></a></div><div class="h-full flex md:hidden flex-row justify-end items-center"><span class="relative h-full w-10 px-3 flex flex-row justify-center items-center cursor-pointer font-semibold text-light hover:bg-zinc-800"><span class="w-4 transition-all h-px bg-light absolute top-1/2 -mt-1"></span><span class="w-4 transition-all h-px bg-light absolute top-1/2 "></span><span class="w-4 transition-all h-px bg-light absolute top-1/2 mt-1"></span></span></div></div></header><div class="w-full h-auto py-2 flex md:hidden flex-col justify-start items-start shadow-lg border-b hidden"><a class="h-12 text-base px-3 w-full flex flex-row justify-start items-center cursor-pointer font-semibold text-dark mr-2 hover:bg-gray-100 hover:text-blue-500" href="/blog/how-to-use-github-star-history">📕 How to use this site</a><span class="h-12 px-3 text-base w-full flex flex-row justify-start items-center cursor-pointer font-semibold text-dark mr-2 hover:bg-gray-100 hover:text-blue-500">Add Access Token</span><span class="h-12 text-base px-3 w-full flex flex-row justify-start items-center"><a class="github-button -mt-1" href="https://github.com/star-history/star-history" data-show-count="true" aria-label="Star star-history/star-history on GitHub" target="_blank" rel="noopener noreferrer">Star</a></span></div><div class="w-full h-auto grow lg:grid lg:grid-cols-[256px_1fr_256px]"><div class="w-full hidden lg:block"><div class="flex flex-col justify-start items-start w-full mt-2 p-4 pl-8"><a class="hover:opacity-75" href="/blog/list-your-open-source-project"><img class="w-auto max-w-full" src="/assets/starlet-icon.webp"/></a><div><div class="w-full flex flex-row justify-between items-center my-2"><h3 class="text-sm font-medium text-gray-400 leading-6">Playbook</h3></div><ul class="list-disc list-inside"><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/how-to-use-github-star-history"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">📕 How to Use this Site</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/playbook-for-more-github-stars"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">⭐️ How to Get More Stars</span></a></li></ul></div><div><div class="w-full flex flex-row justify-between items-center my-2"><h3 class="text-sm font-medium text-gray-400 leading-6">Monthly Pick</h3></div><ul class="list-disc list-inside"><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-devtools"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Nov (AI DevTools)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/homelab"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Oct (Homelab)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-agents"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Sep (AI Agents)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/rag-frameworks"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Aug (RAG frameworks)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-generators"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Jul (AI Generators)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-search"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Jun (AI Searches)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-web-scraper"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 May (AI Web Scraper)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/prompt-engineering"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Apr (AI Prompt)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/non-ai"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Mar (Non-AI)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/most-underrated"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Feb (Most Underrated)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/text2sql"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2024 Jan (Text2SQL)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/gpt-wrappers"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Dec (GPT Wrappers)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/tts"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Nov (TTS)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/ai-for-postgres"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Oct (AI for Postgres)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/coding-ai"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Sept (Coding AI)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/cli-tool-for-llm"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Aug (CLI tool for LLMs)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/llama2"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 July (Llama 2 Edition)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202306"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 June</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202305"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 May</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202304"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Apr</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202303"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Mar (ChatGPT Edition)</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202302"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Feb</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202301"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023 Jan</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-monthly-pick-202212"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2022 Dec</span></a></li></ul></div><div><div class="w-full flex flex-row justify-between items-center my-2"><h3 class="text-sm font-medium text-gray-400 leading-6">Yearly Pick</h3></div><ul class="list-disc list-inside"><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/best-of-2023"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2023</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-yearly-pick-2022-data-infra-devtools"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2022 Data, Infra &amp; DevTools</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-open-source-2022-platform-engineering"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2022 Platform Engineering</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-open-source-2022-open-source-alternatives"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2022 OSS Alternatives</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/star-history-yearly-pick-2022-frontend"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">2022 Front-end</span></a></li></ul></div><div><div class="w-full flex flex-row justify-between items-center my-2"><h3 class="text-sm font-medium text-gray-400 leading-6">Starlet List</h3></div><ul class="list-disc list-inside"><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/list-your-open-source-project"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">🎁 Prompt yours for FREE</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/trench"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #28 - Trench</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/langfuse"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #27 - langfuse</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/thepipe"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #26 - thepi.pe</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/taipy"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #25 - Taipy</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/superlinked"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #24 - Superlinked</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/tea-tasting"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #23 - tea-tasting</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/giskard"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #22 - Giskard</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/khoj"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #21 - Khoj</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/paradedb"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #20 - ParadeDB</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/skyvern"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #19 - Skyvern</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/prisma"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #18 - Prisma</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/spicedb"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #17 - SpiceDB</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/answer"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #16 - Apache Answer</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/infinity"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #15 - Infinity</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/proton"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #14 - Proton</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/earthly"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #13 - Earthly</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/wasp"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #12 - Wasp</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/libsql"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #11 - libSQL</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/postgresml"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #10 - PostgresML</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/electricsql"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #9 - ElectricSQL</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/prompt-flow"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #8 - Prompt flow</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/clipboard"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #7 - Clipboard</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/hoppscotch"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #6 - Hoppscotch</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/metisfl"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #5 - MetisFL</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/chatgpt-js"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #4 - chatgpt.js</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/mockoon"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #3 - Mockoon</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/dlta-ai"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #2 - DLTA-AI</span></a></li><li class="mb-2 leading-3"><a class="cursor-pointer" rel="noopener noreferrer" href="/blog/sniffnet"><span class="inline -ml-2 text-sm text-blue-700 hover:underline">Issue #1 - Sniffnet</span></a></li></ul></div></div></div><div class="w-full flex flex-col justify-start items-center"><div class="w-full p-4 md:p-0 mt-6 md:w-5/6 lg:max-w-6xl h-full flex flex-col justify-start items-center self-center"><img class="hidden md:block w-auto max-w-full object-scale-down" src="/assets/blog/thepipe/banner.webp" alt=""/><div class="w-auto max-w-6xl mt-4 md:mt-12 prose prose-indigo prose-xl md:prose-2xl flex flex-col justify-center items-center"><h1 class="leading-16">Starlet #26 thepi.pe: Intelligent Document Scraping with Vision-Language Models</h1></div><div class="w-full mt-8 mb-2 max-w-6xl px-2 flex flex-row items-center justify-center text-sm text-gray-900 font-semibold trackingwide uppercase"><div class="flex space-x-1 text-gray-500"><span class="text-gray-900">Emmett McFarlane</span><span aria-hidden="true"> · </span><time dateTime="2024-09-25:00:00.000Z">Sep 25, 2024</time><span aria-hidden="true"> · </span><span> <!-- -->2<!-- --> min read </span></div></div><div class="mt-8 w-full max-w-5xl prose prose-indigo prose-xl md:prose-2xl"><p><em>This is the twenty-sixth issue of The Starlet List. If you want to prompt your open source project on star-history.com for free, please check out our <a href="/blog/list-your-open-source-project">announcement</a>.</em></p> <hr> <p><img src="/assets/blog/thepipe/thepipe-logo.webp" alt="thepi.pe logo"></p> <p><a href="https://thepi.pe/"><strong>thepi.pe</strong></a> is an open-source package for intelligent document scraping at scale using vision-language models (VLMs). It works for dozens of file types. Unlike traditional scrapers, thepi.pe has no problems with poorly scanned PDFs, highly complex visuals, and irregularly formatted tables. It&#39;s designed to work with state-of-the-art models like GPT-4o and it integrates easily with LLM APIs, vector databases, and RAG frameworks.</p> <h2>How it works</h2> <p>thepi.pe uses a combination of computer vision models and carefully crafted heuristics to extract clean content from various sources. It processes this content for downstream use with language models or vision transformers.</p> <p><img src="/assets/blog/thepipe/thepipe-workflow.webp" alt="thepi.pe workflow diagram"></p> <p>The output from thepi.pe is a list of chunks containing all content within the source document. These chunks can be easily converted to a prompt format compatible with any LLM or multimodal model using <code>thepipe.core.chunks_to_messages</code>.</p> <h2>Key Features</h2> <ol> <li><strong>Multimodal Extraction</strong>: Extract clean markdown, tables, and images from documents or webpages.</li> <li><strong>Structured Data Extraction</strong>: Extract structured data as JSON or CSV from any document or webpage.</li> <li><strong>VLM-based Document Analysis</strong>: Utilizes AI for filetype detection, layout analysis, and data extraction.</li> <li><strong>Wide Format Support</strong>: Works with PDFs, word docs, powerpoints, webpages, videos, images, and more.</li> <li><strong>LLM Integration</strong>: Seamlessly interfaces with any language model (GPT4, Claude, LLaMa) and vector databases like LlamaIndex.</li> </ol> <h2>Getting Started</h2> <p>Install thepi.pe:</p> <pre><code class="language-bash">pip install thepipe-api </code></pre> <p>Here&#39;s a quick example of how to use thepi.pe with the hosted API:</p> <pre><code class="language-python">from thepipe.scraper import scrape_file from thepipe.core import chunks_to_messages from openai import OpenAI # Scrape markdown, tables, visuals chunks = scrape_file(filepath=&quot;paper.pdf&quot;, ai_extraction=True) # Call LLM with clean, comprehensive data client = OpenAI() response = client.chat.completions.create( model=&quot;gpt-4o&quot;, messages=chunks_to_messages(chunks), ) </code></pre> <h2>Supported File Types</h2> <p>thepi.pe supports a wide range of file types, including:</p> <ul> <li>Webpages (URLs)</li> <li>PDFs</li> <li>Word Documents</li> <li>PowerPoint Presentations</li> <li>Videos and Audio files</li> <li>Jupyter Notebooks</li> <li>Spreadsheets</li> <li>Images</li> </ul> <p>And it&#39;s easily extensible to support any custom file type of your choice.</p> <h2>Learn More and Contribute</h2> <p>If you&#39;re excited about intelligent document processing and want to try thepi.pe for yourself, visit our GitHub repository!</p> <ul> <li>GitHub: <a href="https://github.com/emcf/thepipe">https://github.com/emcf/thepipe</a></li> <li>Website: <a href="https://thepi.pe/">https://thepi.pe/</a></li> </ul> <p>Join us in shaping the future of intelligent document processing! Thank you to the organizers of Star History&#39;s Starlet program for this opportunity to present thepi.pe to the best minds in the industry.</p> </div></div><div class="mt-12"><iframe src="https://embeds.beehiiv.com/2803dbaa-d8dd-4486-8880-4b843f3a7da6?slim=true" data-test-id="beehiiv-embed" height="52" frameBorder="0" scrolling="no" style="margin:0;border-radius:0px !important;background-color:transparent"></iframe></div></div><div class="w-full hidden lg:block"></div></div><footer class="relative w-full shrink-0 h-auto mt-6 flex flex-col justify-end items-center"><div class="w-full py-2 px-3 md:w-5/6 lg:max-w-7xl flex flex-row flex-wrap justify-between items-center text-neutral-700 border-t"><div class="text-sm leading-8 flex flex-row flex-wrap justify-start items-center"><div class="h-full text-gray-600">The missing GitHub star history graph</div><a class="h-full flex flex-row justify-center items-center ml-3 text-lg hover:opacity-80" href="https://twitter.com/StarHistoryHQ" target="_blank" rel="noopener noreferrer"><svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 512 512" height="1em" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg></a><a class="h-full flex flex-row justify-center items-center mx-3 text-lg hover:opacity-80" href="mailto:star@bytebase.com" target="_blank" rel="noopener noreferrer"><svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 512 512" height="1em" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M502.3 190.8c3.9-3.1 9.7-.2 9.7 4.7V400c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V195.6c0-5 5.7-7.8 9.7-4.7 22.4 17.4 52.1 39.5 154.1 113.6 21.1 15.4 56.7 47.8 92.2 47.6 35.7.3 72-32.8 92.3-47.6 102-74.1 131.6-96.3 154-113.7zM256 320c23.2.4 56.6-29.2 73.4-41.4 132.7-96.3 142.8-104.7 173.4-128.7 5.8-4.5 9.2-11.5 9.2-18.9v-19c0-26.5-21.5-48-48-48H48C21.5 64 0 85.5 0 112v19c0 7.4 3.4 14.3 9.2 18.9 30.6 23.9 40.7 32.4 173.4 128.7 16.8 12.2 50.2 41.8 73.4 41.4z"></path></svg></a><a class="h-full flex flex-row justify-center items-center mr-3 text-lg hover:opacity-80" href="https://github.com/star-history/star-history" target="_blank" rel="noopener noreferrer"><svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 496 512" height="1em" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg></a></div><div class="flex flex-row flex-wrap items-center space-x-4"><div class="flex flex-row text-sm leading-8 underline text-blue-700 hover:opacity-80"><img class="h-6 mt-1 mr-2" src="/assets/sqlchat.webp" alt="SQL Chat"/><a href="https://sqlchat.ai" target="_blank" rel="noopener noreferrer"> <!-- -->SQL Chat<!-- --> </a></div><div class="flex flex-row text-sm leading-8 underline text-blue-700 hover:opacity-80"><img class="h-6 mt-1 mr-2" src="/assets/dbcost.webp" alt="DB Cost"/><a href="https://dbcost.com" target="_blank" rel="noopener noreferrer">DB Cost</a></div></div><div class="text-xs leading-8 flex flex-row flex-nowrap justify-end items-center"><span class="text-gray-600">Maintained by<!-- --> <a class="text-blue-500 font-bold hover:opacity-80" href="https://bytebase.com" target="_blank" rel="noopener noreferrer">Bytebase</a>, originally built by<!-- --> <a class="bg-blue-400 text-white p-1 pl-2 pr-2 rounded-l-2xl rounded-r-2xl hover:opacity-80" href="https://twitter.com/tim_qian" target="_blank" rel="noopener noreferrer">@tim_qian</a></span></div></div></footer><div class="fixed right-0 top-32 hidden lg:flex flex-col justify-start items-start transition-all bg-white w-48 xl:w-56 p-2 z-10 "><div class="w-full flex justify-between items-center mb-2"><p class="text-xs text-gray-400">Sponsors (random order)</p><svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 352 512" class="fas fa-times text-xs text-gray-400 cursor-pointer hover:text-gray-500" height="1em" width="1em" xmlns="http://www.w3.org/2000/svg"><path d="M242.72 256l100.07-100.07c12.28-12.28 12.28-32.19 0-44.48l-22.24-22.24c-12.28-12.28-32.19-12.28-44.48 0L176 189.28 75.93 89.21c-12.28-12.28-32.19-12.28-44.48 0L9.21 111.45c-12.28 12.28-12.28 32.19 0 44.48L109.28 256 9.21 356.07c-12.28 12.28-12.28 32.19 0 44.48l22.24 22.24c12.28 12.28 32.2 12.28 44.48 0L176 322.72l100.07 100.07c12.28 12.28 32.2 12.28 44.48 0l22.24-22.24c12.28-12.28 12.28-32.19 0-44.48L242.72 256z"></path></svg></div><a href="https://dify.ai/?utm_source=star-history" class="bg-gray-50 p-2 rounded w-full flex flex-col justify-center items-center mb-2 text-zinc-600 hover:opacity-80 hover:text-blue-600 hover:underline" target="_blank"><img class="w-auto max-w-full" src="/assets/sponsors/dify/logo.webp" alt="Dify"/><span class="text-xs mt-2">Dify: Open-source platform for building LLM apps, from agents to AI workflows.</span></a><a href="https://bytebase.com?utm_source=star-history" class="bg-gray-50 p-2 rounded w-full flex flex-col justify-center items-center mb-2 text-zinc-600 hover:opacity-80 hover:text-blue-600 hover:underline" target="_blank"><img class="w-auto max-w-full" src="/assets/sponsors/bytebase/logo.webp" alt="Bytebase"/><span class="text-xs mt-2">Bytebase: Database DevOps and CI/CD for MySQL, PG, Oracle, SQL Server, Snowflake, ClickHouse, Mongo, Redis</span></a><a href="mailto:star@bytebase.com?subject=I&#x27;m interested in sponsoring star-history.com" target="_blank" class="w-full p-2 text-center bg-gray-50 text-xs leading-6 text-gray-400 rounded hover:underline hover:text-blue-600">Your logo</a></div></div></div><script id="__NEXT_DATA__" type="application/json" crossorigin="">{"props":{"pageProps":{"blog":{"title":"Starlet #26 thepi.pe: Intelligent Document Scraping with Vision-Language Models","slug":"thepipe","author":"Emmett McFarlane","featured":true,"featureImage":"/assets/blog/thepipe/banner.webp","publishedDate":"2024-09-25:00:00.000Z","excerpt":"thepi.pe is an open-source package for intelligent document scraping at scale using vision-language models (VLMs).","readingTime":2},"parsedBlogHTML":"\u003cp\u003e\u003cem\u003eThis is the twenty-sixth issue of The Starlet List. If you want to prompt your open source project on star-history.com for free, please check out our \u003ca href=\"/blog/list-your-open-source-project\"\u003eannouncement\u003c/a\u003e.\u003c/em\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/thepipe/thepipe-logo.webp\" alt=\"thepi.pe logo\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://thepi.pe/\"\u003e\u003cstrong\u003ethepi.pe\u003c/strong\u003e\u003c/a\u003e is an open-source package for intelligent document scraping at scale using vision-language models (VLMs). It works for dozens of file types. Unlike traditional scrapers, thepi.pe has no problems with poorly scanned PDFs, highly complex visuals, and irregularly formatted tables. It\u0026#39;s designed to work with state-of-the-art models like GPT-4o and it integrates easily with LLM APIs, vector databases, and RAG frameworks.\u003c/p\u003e\n\u003ch2\u003eHow it works\u003c/h2\u003e\n\u003cp\u003ethepi.pe uses a combination of computer vision models and carefully crafted heuristics to extract clean content from various sources. It processes this content for downstream use with language models or vision transformers.\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"/assets/blog/thepipe/thepipe-workflow.webp\" alt=\"thepi.pe workflow diagram\"\u003e\u003c/p\u003e\n\u003cp\u003eThe output from thepi.pe is a list of chunks containing all content within the source document. These chunks can be easily converted to a prompt format compatible with any LLM or multimodal model using \u003ccode\u003ethepipe.core.chunks_to_messages\u003c/code\u003e.\u003c/p\u003e\n\u003ch2\u003eKey Features\u003c/h2\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003eMultimodal Extraction\u003c/strong\u003e: Extract clean markdown, tables, and images from documents or webpages.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eStructured Data Extraction\u003c/strong\u003e: Extract structured data as JSON or CSV from any document or webpage.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eVLM-based Document Analysis\u003c/strong\u003e: Utilizes AI for filetype detection, layout analysis, and data extraction.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eWide Format Support\u003c/strong\u003e: Works with PDFs, word docs, powerpoints, webpages, videos, images, and more.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eLLM Integration\u003c/strong\u003e: Seamlessly interfaces with any language model (GPT4, Claude, LLaMa) and vector databases like LlamaIndex.\u003c/li\u003e\n\u003c/ol\u003e\n\u003ch2\u003eGetting Started\u003c/h2\u003e\n\u003cp\u003eInstall thepi.pe:\u003c/p\u003e\n\u003cpre\u003e\u003ccode class=\"language-bash\"\u003epip install thepipe-api\n\u003c/code\u003e\u003c/pre\u003e\n\u003cp\u003eHere\u0026#39;s a quick example of how to use thepi.pe with the hosted API:\u003c/p\u003e\n\u003cpre\u003e\u003ccode class=\"language-python\"\u003efrom thepipe.scraper import scrape_file\nfrom thepipe.core import chunks_to_messages\nfrom openai import OpenAI\n\n# Scrape markdown, tables, visuals\nchunks = scrape_file(filepath=\u0026quot;paper.pdf\u0026quot;, ai_extraction=True)\n\n# Call LLM with clean, comprehensive data\nclient = OpenAI()\nresponse = client.chat.completions.create(\n model=\u0026quot;gpt-4o\u0026quot;,\n messages=chunks_to_messages(chunks),\n)\n\u003c/code\u003e\u003c/pre\u003e\n\u003ch2\u003eSupported File Types\u003c/h2\u003e\n\u003cp\u003ethepi.pe supports a wide range of file types, including:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eWebpages (URLs)\u003c/li\u003e\n\u003cli\u003ePDFs\u003c/li\u003e\n\u003cli\u003eWord Documents\u003c/li\u003e\n\u003cli\u003ePowerPoint Presentations\u003c/li\u003e\n\u003cli\u003eVideos and Audio files\u003c/li\u003e\n\u003cli\u003eJupyter Notebooks\u003c/li\u003e\n\u003cli\u003eSpreadsheets\u003c/li\u003e\n\u003cli\u003eImages\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAnd it\u0026#39;s easily extensible to support any custom file type of your choice.\u003c/p\u003e\n\u003ch2\u003eLearn More and Contribute\u003c/h2\u003e\n\u003cp\u003eIf you\u0026#39;re excited about intelligent document processing and want to try thepi.pe for yourself, visit our GitHub repository!\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eGitHub: \u003ca href=\"https://github.com/emcf/thepipe\"\u003ehttps://github.com/emcf/thepipe\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003eWebsite: \u003ca href=\"https://thepi.pe/\"\u003ehttps://thepi.pe/\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eJoin us in shaping the future of intelligent document processing!\nThank you to the organizers of Star History\u0026#39;s Starlet program for this opportunity to present thepi.pe to the best minds in the industry.\u003c/p\u003e\n"},"__N_SSG":true},"page":"/blog/[slug]","query":{"slug":"thepipe"},"buildId":"xKX4ZiOi_N7h3OBOEsSZu","isFallback":false,"gsp":true,"scriptLoader":[]}</script></body></html>

Pages: 1 2 3 4 5 6 7 8 9 10