CINXE.COM
Apache Nutch™
<!DOCTYPE html> <html lang="en-us"> <head> <meta name="generator" content="Hugo 0.125.1"> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title> Apache Nutch™ </title> <meta name="description" content="Nutch is a highly extensible, highly scalable, matured, production-ready Web crawler which enables fine grained configuration and accomodates a wide variety of data acquisition tasks."> <meta name="author" content="Apache Nutch Project Management Committee"> <meta property="og:url" content="/"> <meta property="og:site_name" content="Apache Nutch™"> <meta property="og:title" content="Apache Nutch™"> <meta property="og:description" content="Nutch is a highly extensible, highly scalable, matured, production-ready Web crawler which enables fine grained configuration and accomodates a wide variety of data acquisition tasks."> <meta property="og:locale" content="en-us"> <meta property="og:type" content="website"> <link href="/index.xml" rel="alternate" type="application/rss+xml" title="Apache Nutch™" /> <link rel="canonical" href="/"> <link rel="shortcut icon" type="image/ico" href="/favicon.ico"> <link href="/css/font.css" rel="stylesheet" type="text/css"> <link href="/css/kube.min.css" rel="stylesheet" type="text/css"> <link href="/css/kube.legenda.css" rel="stylesheet" type="text/css"> <link href="/css/highlight.css" rel="stylesheet" type="text/css"> <link href="/css/main.css" rel="stylesheet" type="text/css"> <link href="/css/custom.css" rel="stylesheet" type="text/css"> <script src="/js/jquery-2.1.4.min.js" type="text/javascript"> </script> <script type="text/javascript" src="/js/tocbot.min.js"></script> <script src="https://www.apachecon.com/event-images/snippet.js"></script> </head> <body class="page-kube"> <header> <div class="show-sm"> <div id="nav-toggle-box"> <div id="nav-toggle-brand"> <a href="/">Apache Nutch™</a> </div><a data-component="toggleme" data-target="#top" href="#" id="nav-toggle"><i class="kube-menu"></i></a> </div> </div> <div class="hide-sm" id="top"> <div id="top-brand"> <a href="/" title="home">Apache Nutch™</a> </div> <nav id="top-nav-main"> <ul> <li><a href="/community/" >Community</a></li> <li><a href="/development/" >Development</a></li> <li><a href="/documentation/" >Docs</a></li> <li><a href="/download/" >Download</a></li> <li><a href="/news/" >News</a></li> <li><a href="/apache/" >The Apache Software Foundation</a></li> </ul> </nav> <nav id="top-nav-extra"> <ul> </ul> </nav> </div> </header> <main> <div id="main"> <div id="hero"> <h1>Apache Nutch™</h1> <p><b>Nutch</b> is a highly extensible, highly scalable, matured, production-ready <a href="https://en.wikipedia.org/wiki/Web_crawler" target="_blank">Web crawler</a> which enables fine grained configuration and accomodates a wide variety of data acquisition tasks.</p> </div> <div id="action-buttons"> <a class="button primary big" href="/download" onclick="_gaq.push(['_trackEvent', 'kube', 'download']);">Download</a> <a class="button outline big" href="https://github.com/apache/nutch" target="_blank" rel="noopener noreferrer" onclick="_gaq.push(['_trackEvent', 'kube', 'github']);">View on Github</a> <a class="button primary big" href="https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial" target="_blank" rel="noopener noreferrer" onclick="_gaq.push(['_trackEvent', 'kube', 'github']);">Get Started</a> </div> <div id="kube-features"> <div class="row gutters"> <div class="col col-4 item"> <figure> <img alt="Baseline" height="48" src="/img/kube/icon-baseline.png" width="48"> </figure> <h3>Scalable</h3> <p>Relying on <a href="https://hadoop.apache.org" target="_blank" rel="noopener noreferrer">Apache Hadoop™</a> data structures, Nutch is great for batch processing large data volumes but can also be tailored to smaller jobs.</p> </div> <div class="col col-4 item"> <figure> <img alt="Typography" height="48" src="/img/plug.svg" width="48"> </figure> <h3>Pluggable</h3> <p>Out of the box Nutch offer powerful plugins i.e., parsing with <a href="https://tika.apache.org" target="_blank" rel="noopener noreferrer">Apache Tika™</a>, indexing with <a href="https://solr.apache.org" target="_blank" rel="noopener noreferrer">Apache Solr™</a>, <a href="https://www.elastic.co/elasticsearch" target="_blank" rel="noopener noreferrer">Elasticsearch</a> and more!</p> </div> <div class="col col-4 item"> <figure> <img alt="Minimalism" height="48" src="/img/plus-square.svg" width="48"> </figure> <h3>Extensible</h3> <p>Provides intuitive and stable interfaces for popular functions i.e., <a href="https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/javadoc/org/apache/nutch/parse/Parser.html" target="_blank" rel="noopener noreferrer">Parsers</a>, <a href="https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/javadoc/org/apache/nutch/parse/HtmlParseFilter.html" target="_blank" rel="noopener noreferrer">HTML Filtering</a>, <a href="https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/javadoc/org/apache/nutch/indexer/IndexingFilter.html" target="_blank" rel="noopener noreferrer">Indexing</a> and <a href="https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/javadoc/org/apache/nutch/scoring/ScoringFilter.html" target="_blank" rel="noopener noreferrer">Scoring</a> for custom implementations.</p> </div> </div> </div> </div> </main> <footer> <footer id="footer"> <p>© 2004-2024 The Apache Software Foundation. Built using the <a href="https://github.com/jeblister/kube" target="_blank" rel="noopener noreferrer">kube Theme for Hugo</a>. Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache Nutch project logo are trademarks of The Apache Software Foundation.</p> </footer> </footer> <script src="/js/kube.js" type="text/javascript"> </script> <script src="/js/kube.legenda.js" type="text/javascript"> </script> <script src="/js/main.js" type="text/javascript"> </script> </body> </html>