CINXE.COM
OCR - Community Help Wiki
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> <meta name="robots" content="index,nofollow"> <title>OCR - Community Help Wiki</title> <script type="text/javascript" src="/moin_static198/common/js/common.js"></script> <script type="text/javascript"> <!-- var search_hint = "Search"; //--> </script> <link rel="stylesheet" type="text/css" charset="utf-8" media="all" href="/moin_static198/light/css/common.css"> <link rel="stylesheet" type="text/css" charset="utf-8" media="screen" href="/moin_static198/light/css/screen.css"> <link rel="stylesheet" type="text/css" charset="utf-8" media="print" href="/moin_static198/light/css/print.css"> <link rel="stylesheet" type="text/css" charset="utf-8" media="projection" href="/moin_static198/light/css/projection.css"> <!-- css only for MS IE6/IE7 browsers --> <!--[if lt IE 8]> <link rel="stylesheet" type="text/css" charset="utf-8" media="all" href="/moin_static198/light/css/msie.css"> <![endif]--> <link rel="alternate" title="Community Help Wiki: OCR" href="/community/OCR?diffs=1&show_att=1&action=rss_rc&unique=0&page=OCR&ddiffs=1" type="application/rss+xml"> <link rel="Start" href="/community/CommunityHelpWiki"> <link rel="Alternate" title="Wiki Markup" href="/community/OCR?action=raw"> <link rel="Alternate" media="print" title="Print View" href="/community/OCR?action=print"> <link rel="Search" href="/community/FindPage"> <link rel="Index" href="/community/TitleIndex"> <link rel="Glossary" href="/community/WordIndex"> <link rel="Help" href="/community/HelpOnFormatting"> </head> <body lang="en" dir="ltr"> <div id="leftbanner"> </div> <div id="rightbanner"> </div> <div id="container"> <div id="container-inner"> <div id="mothership"> <ul> <li><a href="https://canonical.com/partners">Partners</a> </li> <li><a href="https://ubuntu.com/community/support">Support</a></li> <li><a href="https://ubuntu.com/community">Community</a></li> <li><a href="https://ubuntu.com">Ubuntu.com</a></li> </ul> </div> <div id="header"> <h1 id="ubuntu-header"><a href="https://help.ubuntu.com/">Ubuntu Documentation</a></h1> <ul id="main-menu"> <li><a href="https://help.ubuntu.com/">Official Documentation</a></li> <li><a class="main-menu-item current" href="https://help.ubuntu.com/community/CommunityHelpWiki">Community Help Wiki</a></li> <li><a href="https://ubuntu.com/community/contribute">Contribute</a></li> </ul> </div> <div id="menu-search"> <div id="subheader"> <div class="subheader-menu"> <div id="second-nav"> <ul id="username"><li><a class="nbinfo" href="/community/OCR?action=info" rel="nofollow">Page History</a></li><li><a href="/community/OCR?action=login&login=1" id="login" rel="nofollow">Login to edit</a></li></ul> </div> <div id="user-info"> </div> </div> </div> <div id="search-box"> <noscript> <form action="http://www.google.com/cse" id="cse-search-box"> <div> <input type="hidden" name="cx" value="003883529982892832976:e2vwumte3fq" /> <input type="hidden" name="ie" value="UTF-8" /> <input type="text" name="q" size="21" /> <input type="submit" name="sa" value="Search" /> </div> </form> </noscript> <script> document.write('<form action="https://help.ubuntu.com/search.html" id="cse-search-box">'); document.write(' <div>'); document.write(' <input type="hidden" name="cof" value="FORID:9" />'); document.write(' <input type="hidden" name="cx" value="003883529982892832976:e2vwumte3fq" />'); document.write(' <input type="hidden" name="ie" value="UTF-8" />'); document.write(' <input type="text" name="q" size="21" />'); document.write(' <input type="submit" name="sa" value="Search" />'); document.write(' </div>'); document.write('</form>'); </script> </div> </div> <div id="title"><h1> <span id="pagelocation"><a href="/community/OCR">OCR</a></span> </h1> </div> <div id="cwt-nav3"> <hr class="clear" /> </div> <div id="cwt-content" class="clearfix content-area"> <div id="page" lang="en" dir="ltr"> <div dir="ltr" id="content" lang="en"><span class="anchor" id="top"></span> <span class="anchor" id="line-1"></span><div><table style="&quot; float:right; font-size: 0.9em; width:40%; background:#F1F1ED; margin: 0 0 1em 1em; &quot;"><tbody><tr> <td style="&quot; padding:0.5em; &quot;"><p class="line891"><div class="table-of-contents"><p class="table-of-contents-heading">Contents<ol><li> <a href="#OCR_-_Optical_Character_Recognition">OCR - Optical Character Recognition</a></li><li> <a href="#Available_OCR_tools">Available OCR tools</a><ol><li> <a href="#OCRFeeder">OCRFeeder</a></li><li> <a href="#Tesseract">Tesseract</a></li><li> <a href="#CuneiForm">CuneiForm</a></li></ol></li><li> <a href="#OCR_on_a_Multi_Page_PDF">OCR on a Multi Page PDF</a><ol><li> <a href="#gscan2pdf">gscan2pdf</a></li><li> <a href="#OCRFeeder-1">OCRFeeder</a></li><li> <a href="#pdfocr">pdfocr</a></li></ol></li><li> <a href="#Further_Reading">Further Reading</a></li></ol></div></td> </tr> </tbody></table></div><span class="anchor" id="line-2"></span><span class="anchor" id="line-3"></span><p class="line867"> <h1 id="OCR_-_Optical_Character_Recognition">OCR - Optical Character Recognition</h1> <span class="anchor" id="line-4"></span><span class="anchor" id="line-5"></span><p class="line874">OCR is a technology that allows you to convert scanned images of text into plain text. This enables you to save space, edit the text and search/index it. <span class="anchor" id="line-6"></span><span class="anchor" id="line-7"></span><p class="line867"> <h1 id="Available_OCR_tools">Available OCR tools</h1> <span class="anchor" id="line-8"></span><p class="line874">The Ubuntu Universe repositories contain the following OCR tools: <span class="anchor" id="line-9"></span><ul><li><p class="line891"><a class="http" href="http://fuzzyocr.own-hero.net/">fuzzyocr</a> - spamassassin plugin to check image attachments <span class="anchor" id="line-10"></span></li><li><p class="line891"><a class="http" href="http://jocr.sourceforge.net/">gocr</a> - a command line OCR <span class="anchor" id="line-11"></span></li><li><p class="line891"><a class="http" href="http://hocr.berlios.de/">libhocr0</a> - Hebrew OCR <span class="anchor" id="line-12"></span></li><li><p class="line891"><a class="http" href="http://www.gnu.org/software/ocrad/ocrad.html">ocrad</a> - OCR program <span class="anchor" id="line-13"></span></li><li><p class="line891"><a class="http" href="http://live.gnome.org/OCRFeeder">ocrfeeder</a> - document layout analysis and optical character recognition system <span class="anchor" id="line-14"></span></li><li><p class="line891"><a class="http" href="http://code.google.com/p/ocropus/">ocropus</a> - document analysis and OCR system <span class="anchor" id="line-15"></span></li><li><p class="line891"><a class="http" href="http://code.google.com/p/tesseract-ocr/">tesseract-ocr</a> - command line OCR <span class="anchor" id="line-16"></span><span class="anchor" id="line-17"></span></li></ul><p class="line874">The Ubuntu multiverse respositories also contain: <span class="anchor" id="line-18"></span><ul><li><p class="line891"><a class="http" href="http://launchpad.net/cuneiform-linux/">cuneiform</a> - multi-language OCR system <span class="anchor" id="line-19"></span><span class="anchor" id="line-20"></span></li></ul><p class="line867"> <h2 id="OCRFeeder">OCRFeeder</h2> <span class="anchor" id="line-21"></span><p class="line862">While Tesseract and <a class="nonexistent" href="/community/CuneiForm">CuneiForm</a> are the most accurate, under Linux now they lack graphical interface (GUI), which is a very important usability feature for a typical desktop user. <span class="anchor" id="line-22"></span><span class="anchor" id="line-23"></span><p class="line862">OCRFeeder suite provides handy GUI, which is basically a front-end for some image, OCR and text tools (like unpaper or spellchecker). It doesn't make character recognition itself, but uses other OCR apps (through so called "OCR engines" settings) instead. It has predefined settings for Tesseract, <a class="nonexistent" href="/community/CuneiForm">CuneiForm</a>, GOCR and Ocrad, so the user doesn't need to know how to invoke them. One has only to install in Ubuntu its OCR engines of choice - one or more - and then detect them in OCRFeeder settings. It is possible to add other engines and to change these options manually, there can be more than one engine entry using the same application. Main OCRFeeder window allows to choose on the fly which engine to use for a particular area, there is also setting for making one engine the default choice. <span class="anchor" id="line-24"></span><span class="anchor" id="line-25"></span><p class="line862">As of version 0.7.3 there is no easy way to choose a language of a recognized text. In case of Tesseract and <a class="nonexistent" href="/community/CuneiForm">CuneiForm</a> one has to add "-l" switch followed with a proper language/script code (for example "-l pol" for Polish or "-l dan-frak" for Danish Fraktur) to the given engine's settings. One can even make multiple separate entries with settings for each desired combination of language and application (and naming them like "Traditional Chinese - Tesseract", "German - Tesseract" and "German - <a class="nonexistent" href="/community/CuneiForm">CuneiForm</a>", because we may want the same language to be recognized by different applications) to select them later from the pull down "OCR engines" list in the main OCRFeeder window. <span class="anchor" id="line-26"></span><span class="anchor" id="line-27"></span><p class="line874">OCRFeeder can also be run in pure command line mode: <span class="anchor" id="line-28"></span><span class="anchor" id="line-29"></span><p class="line867"><tt>$ ocrfeeder-cli -i input1.jpg input2.jpg -f html -o output.htm</tt> <span class="anchor" id="line-30"></span><span class="anchor" id="line-31"></span><p class="line867"> <h2 id="Tesseract">Tesseract</h2> <span class="anchor" id="line-32"></span><p class="line867"> <h3 id="Introduction">Introduction</h3> <span class="anchor" id="line-33"></span><span class="anchor" id="line-34"></span><p class="line874">Arguably the one producing the best (most accurate) results is Tesseract. It is a technology initially developed by HP Labs between 1985 and 1995, then they open-sourced it in 2005. <span class="anchor" id="line-35"></span><span class="anchor" id="line-36"></span><p class="line874">Version 2.x did not support layout analysis, so multi-column text, images, equations etc. should give you a garbled text output. Also, it only supported TIFF images as input. Version 3.x includes layout analysis, and, if compiled with Leptonica, supports all image formats Leptonica supports. <span class="anchor" id="line-37"></span><span class="anchor" id="line-38"></span><p class="line862">Originally Tesseract could recognize text in English only; version 2.x extended it to 7 different languages: English, German, French, Italian, Spanish, Brazilian Portuguese and Dutch. You can install more than one dictionary if needed. Newer versions can recognize text in <a class="https" href="https://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/Makefile.in">following languages/scripts</a> (loosely based on <a class="http" href="http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes">ISO 963-2</a>): <span class="anchor" id="line-39"></span><ul><li>ara - Arabic <span class="anchor" id="line-40"></span></li><li>eng - English <span class="anchor" id="line-41"></span></li><li>bul - Bulgarian <span class="anchor" id="line-42"></span></li><li>cat - Catalan <span class="anchor" id="line-43"></span></li><li>ces - Czech <span class="anchor" id="line-44"></span></li><li>chi_sim - Chinese [Simplified] <span class="anchor" id="line-45"></span></li><li>chi_tra - Chinese [Traditional] <span class="anchor" id="line-46"></span></li><li>dan - Danish <span class="anchor" id="line-47"></span></li><li>dan-frak - Danish [Fraktur] <span class="anchor" id="line-48"></span></li><li>ger - German <span class="anchor" id="line-49"></span></li><li>ell - Greek [Modern] <span class="anchor" id="line-50"></span></li><li>fin - Finnish <span class="anchor" id="line-51"></span></li><li>fra - French <span class="anchor" id="line-52"></span></li><li>heb - Hebrew <span class="anchor" id="line-53"></span></li><li>hrv - Croatian <span class="anchor" id="line-54"></span></li><li>hun - Hungarian <span class="anchor" id="line-55"></span></li><li>ind - Indonesian <span class="anchor" id="line-56"></span></li><li>ita - Italian <span class="anchor" id="line-57"></span></li><li>jpn - Japanese <span class="anchor" id="line-58"></span></li><li>kor - Korean <span class="anchor" id="line-59"></span></li><li>lav - Latvian <span class="anchor" id="line-60"></span></li><li>lit - Lithuanian <span class="anchor" id="line-61"></span></li><li>nld - Dutch <span class="anchor" id="line-62"></span></li><li>nor - Norwegian <span class="anchor" id="line-63"></span></li><li>osd - [Orientation and Script Detection] <span class="anchor" id="line-64"></span></li><li>pol - Polish <span class="anchor" id="line-65"></span></li><li>por - Portuguese <span class="anchor" id="line-66"></span></li><li>ron - Romanian <span class="anchor" id="line-67"></span></li><li>rus - Russian <span class="anchor" id="line-68"></span></li><li>slk - Slovak <span class="anchor" id="line-69"></span></li><li>slk-frak - Slovak [Fraktur] <span class="anchor" id="line-70"></span></li><li>slv - Slovenian <span class="anchor" id="line-71"></span></li><li>spa - Spanish <span class="anchor" id="line-72"></span></li><li>srp - Serbian <span class="anchor" id="line-73"></span></li><li>swe - Swedish <span class="anchor" id="line-74"></span></li><li>tgl - Tagalog <span class="anchor" id="line-75"></span></li><li>tha - Thai <span class="anchor" id="line-76"></span></li><li>tur - Turkish <span class="anchor" id="line-77"></span></li><li>ukr - Ukrainian <span class="anchor" id="line-78"></span></li><li>vie - Vietnamese <span class="anchor" id="line-79"></span><span class="anchor" id="line-80"></span></li></ul><p class="line867"> <h3 id="Usage">Usage</h3> <span class="anchor" id="line-81"></span><span class="anchor" id="line-82"></span><p class="line862">The current version of Tesseract in the Ubuntu repository is a command-line-only tool. After successful installation, the command to use is <tt>tesseract <path to image> <basename of output file></tt>. Tesseract will automatically give the output file a .txt extension. If you have installed the language specific data files from one of the <tt>tesseract-ocr-???</tt> packages, you can give an <tt>-l</tt> option followed by the language code. <span class="anchor" id="line-83"></span><span class="anchor" id="line-84"></span><p class="line874">For versions of Tesseract older then 3 it is critical that the image is in Tagged Image File Format and has a ".tif" extension and not a ".tiff" extension. The command line should look like this example: <span class="anchor" id="line-85"></span><span class="anchor" id="line-86"></span><p class="line867"><tt>$ tesseract ~/input.tif output</tt> <span class="anchor" id="line-87"></span><span class="anchor" id="line-88"></span><p class="line862">Where <tt>input.tif</tt> is the document to be converted located in your home folder and <tt>output</tt> is the document that Tesseract will create as <tt>output.txt</tt>. The <tt>.txt</tt> file extension will be added by Tesseract automatically. <span class="anchor" id="line-89"></span><span class="anchor" id="line-90"></span><p class="line867"> <h3 id="Preparing_images_for_old_versions_of_Tesseract">Preparing images for old versions of Tesseract</h3> <span class="anchor" id="line-91"></span><span class="anchor" id="line-92"></span><p class="line874">Tesseract 2.x is not very flexible about the format of its input images. It will only accept TIFF images. According to user reports, compressed TIFF images are quite problematic, and the same goes for grey-scale and colour images. So you're better of with single-bit uncompressed TIFF images. <span class="anchor" id="line-93"></span><span class="anchor" id="line-94"></span><p class="line874">The process to prepare them with GIMP is very simple: <span class="anchor" id="line-95"></span><span class="anchor" id="line-96"></span><ol type="1"><li><p class="line862">Go to the Image→Mode menu and make sure the image is in RGB or Grayscale mode. <span class="anchor" id="line-97"></span></li><li><p class="line862">Select from the menu Tools→Color Tools→Threshold and choose an adequate threshold value. <span class="anchor" id="line-98"></span></li><li><p class="line862">Select from the menu Image→Mode→Indexed and from the options choose 1-bit and no dithering. <span class="anchor" id="line-99"></span></li><li>Save the image in TIFF format with a .tif extension. <span class="anchor" id="line-100"></span><span class="anchor" id="line-101"></span></li></ol><p class="line867"> <h3 id="Using_Tesseract_With_a_Multi_Page_PDF">Using Tesseract With a Multi Page PDF</h3> <span class="anchor" id="line-102"></span><span class="anchor" id="line-103"></span><p class="line862">Often, scanned documents are stored as a raster image in a large PDF document. Using <a href="/community/ImageMagick">ImageMagick</a>, the individual pages can then be extracted as TIFF files for processing using Tesseract. The following script can help automate this process: <span class="anchor" id="line-104"></span><span class="anchor" id="line-105"></span><p class="line867"><span class="anchor" id="line-106"></span><span class="anchor" id="line-107"></span><span class="anchor" id="line-108"></span><span class="anchor" id="line-109"></span><span class="anchor" id="line-110"></span><span class="anchor" id="line-111"></span><span class="anchor" id="line-112"></span><span class="anchor" id="line-113"></span><span class="anchor" id="line-114"></span><span class="anchor" id="line-115"></span><span class="anchor" id="line-116"></span><span class="anchor" id="line-117"></span><span class="anchor" id="line-118"></span><span class="anchor" id="line-119"></span><span class="anchor" id="line-120"></span><pre><span class="anchor" id="line-1"></span>#!/bin/sh <span class="anchor" id="line-2"></span>STARTPAGE=6 # set to pagenumber of the first page of PDF you wish to convert <span class="anchor" id="line-3"></span>ENDPAGE=255 # set to pagenumber of the last page of PDF you wish to convert <span class="anchor" id="line-4"></span>SOURCE=book.pdf # set to the file name of the PDF <span class="anchor" id="line-5"></span>OUTPUT=book.txt # set to the final output file <span class="anchor" id="line-6"></span>RESOLUTION=600 # set to the resolution the scanner used (the higher, the better) <span class="anchor" id="line-7"></span> <span class="anchor" id="line-8"></span>touch $OUTPUT <span class="anchor" id="line-9"></span>for i in `seq $STARTPAGE $ENDPAGE`; do <span class="anchor" id="line-10"></span> convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif <span class="anchor" id="line-11"></span> echo processing page $i <span class="anchor" id="line-12"></span> tesseract page.tif tempoutput <span class="anchor" id="line-13"></span> cat tempoutput.txt >> $OUTPUT <span class="anchor" id="line-14"></span>done</pre><span class="anchor" id="line-121"></span><span class="anchor" id="line-122"></span><p class="line862">After running this script, the OCR text should be contained in <tt>book.txt</tt> (or whatever you set <tt>$OUTPUT</tt> to be). <span class="anchor" id="line-123"></span><span class="anchor" id="line-124"></span><p class="line867"> <h2 id="CuneiForm">CuneiForm</h2> <span class="anchor" id="line-125"></span><p class="line867"> <h3 id="Introduction-1">Introduction</h3> <span class="anchor" id="line-126"></span><span class="anchor" id="line-127"></span><p class="line867"><a class="nonexistent" href="/community/CuneiForm">CuneiForm</a> is another OCR system, which was originally developed and open-sourced by Cognitive Technologies. <span class="anchor" id="line-128"></span><span class="anchor" id="line-129"></span><p class="line862">Windows version, which has its own graphical interface, can be run <a class="http" href="http://appdb.winehq.org/objectManager.php?sClass=version&iId=10327">with some results</a> under <a href="/community/Wine">Wine</a>. Its Linux port is being developed on <a class="http" href="http://launchpad.net/cuneiform-linux">Launchpad</a> <span class="anchor" id="line-130"></span>and while it currently doesn't have its own GUI, <a class="nonexistent" href="/community/CuneiForm">CuneiForm</a> can be successfuly run from within OCRFeeder graphical interface. <span class="anchor" id="line-131"></span><span class="anchor" id="line-132"></span><p class="line867"><a class="nonexistent" href="/community/CuneiForm">CuneiForm</a> recognizes Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, German, Hungarian, Italian, Latvian, Lithuanian, Polish, Portuguese, Romanian, Russian, Russian-English bilingual, Serbian, Slovene, Spanish, Swedish, Turkish, and Ukrainian text. <span class="anchor" id="line-133"></span><span class="anchor" id="line-134"></span><p class="line862">List of <a class="http" href="http://bazaar.launchpad.net/~jpakkane/cuneiform-linux/trunk/files/head:/datafiles/">language/script</a> codes (loosely based on <a class="http" href="http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes">ISO 963-2</a>): <span class="anchor" id="line-135"></span><ul><li>bul - Bulgarian <span class="anchor" id="line-136"></span></li><li>cro - Croatian <span class="anchor" id="line-137"></span></li><li>cze - Czech <span class="anchor" id="line-138"></span></li><li>dan - Danish <span class="anchor" id="line-139"></span></li><li>dut - Dutch <span class="anchor" id="line-140"></span></li><li>est - Estonian <span class="anchor" id="line-141"></span></li><li>frn - French <span class="anchor" id="line-142"></span></li><li>ger - German <span class="anchor" id="line-143"></span></li><li>hun - Hungarian <span class="anchor" id="line-144"></span></li><li>ita - Italian <span class="anchor" id="line-145"></span></li><li>lat - Latvian <span class="anchor" id="line-146"></span></li><li>lit - Lithuanian <span class="anchor" id="line-147"></span></li><li>pol - Polish <span class="anchor" id="line-148"></span></li><li>por - Portuguese <span class="anchor" id="line-149"></span></li><li>rom - Romanian <span class="anchor" id="line-150"></span></li><li>rus - Russian <span class="anchor" id="line-151"></span></li><li>ser - Serbian <span class="anchor" id="line-152"></span></li><li>slo - Slovene <span class="anchor" id="line-153"></span></li><li>spa - Spanish <span class="anchor" id="line-154"></span></li><li>swe - Swedish <span class="anchor" id="line-155"></span></li><li>tur - Turkish <span class="anchor" id="line-156"></span></li><li>ukr - Ukrainian <span class="anchor" id="line-157"></span><span class="anchor" id="line-158"></span></li></ul><p class="line867"> <h3 id="From_JPEG_to_TXT">From JPEG to TXT</h3> <span class="anchor" id="line-159"></span><span class="anchor" id="line-160"></span><p class="line874">The following is an anecdotal example. Had success translating some image/jpeg screenshots of an internet message board into useful text/plain files with: <span class="anchor" id="line-161"></span><span class="anchor" id="line-162"></span><span class="anchor" id="line-163"></span><span class="anchor" id="line-164"></span><span class="anchor" id="line-165"></span><span class="anchor" id="line-166"></span><span class="anchor" id="line-167"></span><span class="anchor" id="line-168"></span><span class="anchor" id="line-169"></span><span class="anchor" id="line-170"></span><span class="anchor" id="line-171"></span><span class="anchor" id="line-172"></span><span class="anchor" id="line-173"></span><span class="anchor" id="line-174"></span><span class="anchor" id="line-175"></span><span class="anchor" id="line-176"></span><span class="anchor" id="line-177"></span><span class="anchor" id="line-178"></span><span class="anchor" id="line-179"></span><span class="anchor" id="line-180"></span><span class="anchor" id="line-181"></span><span class="anchor" id="line-182"></span><span class="anchor" id="line-183"></span><span class="anchor" id="line-184"></span><pre><span class="anchor" id="line-1-1"></span>#!/bin/bash <span class="anchor" id="line-2-1"></span>if [ "$1" ] && [ -e "$1" ]; then <span class="anchor" id="line-3-1"></span> TMPF=$(mktemp XXXXXXXX.tif) <span class="anchor" id="line-4-1"></span> DEST="$2" <span class="anchor" id="line-5-1"></span> if [ ! "$DEST" ]; then <span class="anchor" id="line-6-1"></span> DEST="${1%.*}.txt" <span class="anchor" id="line-7-1"></span> if [ -e "$DEST" ]; then <span class="anchor" id="line-8-1"></span> echo "$DEST already exists; please provide a new textfile name" >&2 <span class="anchor" id="line-9-1"></span> exit 1 <span class="anchor" id="line-10-1"></span> fi <span class="anchor" id="line-11-1"></span> fi <span class="anchor" id="line-12-1"></span> /usr/bin/convert "$1" -colorspace Gray -depth 8 -resample 200x200 $TMPF \ <span class="anchor" id="line-13-1"></span> && /usr/bin/cuneiform -o "$DEST" $TMPF <span class="anchor" id="line-14-1"></span> EX=$? <span class="anchor" id="line-15"></span> /bin/rm -f $TMPF <span class="anchor" id="line-16"></span> [ $EX -eq 0 ] && [ "$TERM" ] && echo "created $DEST" <span class="anchor" id="line-17"></span> exit $EX <span class="anchor" id="line-18"></span>else <span class="anchor" id="line-19"></span> echo "Usage: $0 imagefile [textfile]" >&2 <span class="anchor" id="line-20"></span> echo " creates a plain text file with the text found in imagefile" >&2 <span class="anchor" id="line-21"></span> exit 1 <span class="anchor" id="line-22"></span>fi</pre><span class="anchor" id="line-185"></span><span class="anchor" id="line-186"></span><span class="anchor" id="line-187"></span><p class="line867"> <h1 id="OCR_on_a_Multi_Page_PDF">OCR on a Multi Page PDF</h1> <span class="anchor" id="line-188"></span><p class="line874">If you have a multi-page PDF file and want to make it searchable you should use one of these following methods. <span class="anchor" id="line-189"></span><span class="anchor" id="line-190"></span><p class="line867"> <h2 id="gscan2pdf">gscan2pdf</h2> <span class="anchor" id="line-191"></span><p class="line874">This is probably the easiest way of doing this. Gscan2pdf is a graphical tool which lets you not only scan files, but also import files and perform OCR on them. <span class="anchor" id="line-192"></span><span class="anchor" id="line-193"></span><ul><li>Install gscan2pdf, either from Ubuntu Software Center or running this command in a terminal: <span class="anchor" id="line-194"></span><span class="anchor" id="line-195"></span><p class="line891"><tt>$ sudo apt-get install gscan2pdf</tt> <span class="anchor" id="line-196"></span><span class="anchor" id="line-197"></span></li><li class="gap">Run gscan2pdf <span class="anchor" id="line-198"></span></li><li>Import the pdf (Ctrl+i) <span class="anchor" id="line-199"></span></li><li><p class="line862">Choose Tools=>OCR <span class="anchor" id="line-200"></span></li><li>Save (Ctrl+s) <span class="anchor" id="line-201"></span><span class="anchor" id="line-202"></span></li></ul><p class="line874">It may take some time if you have many pages. This is normal. <span class="anchor" id="line-203"></span><span class="anchor" id="line-204"></span><p class="line867"> <h2 id="OCRFeeder-1">OCRFeeder</h2> <span class="anchor" id="line-205"></span><p class="line874">OCRFeeder can do this too. Sadly it doesn't seem to work very well yet. <span class="anchor" id="line-206"></span><span class="anchor" id="line-207"></span><span class="anchor" id="line-208"></span><p class="line867"> <h2 id="pdfocr">pdfocr</h2> <span class="anchor" id="line-209"></span><p class="line862">pdfocr is a script which both performs OCR on multi-page PDF files, and also embeds the text back into the PDF file as a searchable text layer. It can use either tesseract or cuneiform as the OCR engine. The script itself can be obtained from <a class="http" href="http://github.com/gkovacs/pdfocr/raw/master/pdfocr.rb">Github</a> or from the <a class="http" href="http://launchpad.net/~gezakovacs/+archive/pdfocr">PPA</a>. To use, simply enter this command in a terminal: <span class="anchor" id="line-210"></span><span class="anchor" id="line-211"></span><p class="line867"><tt>pdfocr -i input.pdf -o output.pdf</tt> <span class="anchor" id="line-212"></span><span class="anchor" id="line-213"></span><p class="line867"> <h1 id="Further_Reading">Further Reading</h1> <span class="anchor" id="line-214"></span><span class="anchor" id="line-215"></span><ul><li><p class="line891"><a class="http" href="http://www.linuxjournal.com/article/9676">A LinuxJournal article on Tesseract</a> <span class="anchor" id="line-216"></span><span class="anchor" id="line-217"></span></li></ul><p class="line867"><hr /><p class="line874"> <span class="anchor" id="line-218"></span><a href="/community/CategoryGraphicsApplications">CategoryGraphicsApplications</a> <span class="anchor" id="line-219"></span><span class="anchor" id="bottom"></span></div><div id="pagebottom"></div> </div> </div> <p id="pageinfo" class="info" lang="en" dir="ltr">OCR (last edited 2015-03-31 12:07:20 by <span title="??? @ p4FC25736.dip0.t-ipconnect.de[79.194.87.54]">p4FC25736</span>)</p> </div> <div id="footer"> <p> The material on this wiki is available under a free license, see <a href="https://help.ubuntu.com/community/ License">Copyright / License</a> for details<br /><b>You</b> can contribute to this wiki, see <a href="https://help.ubuntu.com/community/WikiGuide">Wiki Guide</a> for details </p> </div> </div></body> </html>