CINXE.COM
Europarl Parallel Corpus
<html><head> <title>Europarl Parallel Corpus</title> </head> <body bgcolor="White" text="Black"><FONT FACE="Georgia"> <center> <h2>European Parliament Proceedings Parallel Corpus 1996-2011</h2> </CENTER> <hr size=1 noshade> <P> For a detailed description of this corpus, please read:<P> <B>Europarl: A Parallel Corpus for Statistical Machine Translation</B>, <I>Philipp Koehn</I>, MT Summit 2005, <a href="http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl-mtsummit05.pdf">pdf</a>. <P> Please cite the paper, if you use this corpus in your work. See also the extended (but earlier) version of the report (<a href="http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl.ps">ps</a>, <a href="http://www.iccs.inf.ed.ac.uk/~pkoehn/publications/europarl.pdf">pdf</a>). <P> The Europarl parallel corpus is extracted from the proceedings of the <A HREF="http://www3.europarl.eu.int/omk/omnsapir.so/calendar?APP=CRE&LANGUE=EN">European Parliament</A>. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.<P> The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the <A HREF="http://people.csail.mit.edu/koehn/publications/de-news/church_and_gale.ps">Church and Gale algorithm</A>.<P> <P> <hr size=1 noshade> <P> <h3>Release v7</h3> On 15 May 2012 we released a further expanded and improved version of the corpus. Previous versions are available <a href="archives.html">here</a>. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English. <P> <b>Changes since v6</b> <ul><li>added 01/2011 - 11/2011 data, now up to around 60 million words per language <li>further refined preprocessing, cleaning </ul> All formats contain document (<CHAPTER id>), speaker (<SPEAKER id name language>), and paragraph (<P>) mark-up on a separate line. The data is stored in one file per day, and in smaller units for newer data. <P> Some documents have the SPEAKER tag attribute LANGUAGE which indicates what language the original speaker was using. <P> To use the parallel corpora with tools like GIZA++, you want to: <UL><LI> tokenize the text (required) <LI> lowercase the text (recommended) <LI> strip empty lines and their correspondences (required) <LI> remove lines with XML-Tags (starting with "<") (required) </UL> <B>Download</B> <UL><LI><A HREF="v7/europarl.tgz">source release</A> (text files), 1.5 GB <LI><A HREF="v7/tools.tgz">tools</A> (preprocessing tools and sentence aligner only), 8.6 KB <LI><A HREF="v7/bg-en.tgz">parallel corpus Bulgarian-English</A>, 41 MB, 01/2007-11/2011 <LI><A HREF="v7/cs-en.tgz">parallel corpus Czech-English</A>, 60 MB, 01/2007-11/2011 <LI><A HREF="v7/da-en.tgz">parallel corpus Danish-English</A>, 179 MB, 04/1996-11/2011 <LI><A HREF="v7/de-en.tgz">parallel corpus German-English</A>, 189 MB, 04/1996-11/2011 <LI><A HREF="v7/el-en.tgz">parallel corpus Greek-English</A>, 145 MB, 04/1996-11/2011 <LI><A HREF="v7/es-en.tgz">parallel corpus Spanish-English</A>, 187 MB, 04/1996-11/2011 <LI><A HREF="v7/et-en.tgz">parallel corpus Estonian-English</A>, 57 MB, 01/2007-11/2011 <LI><A HREF="v7/fi-en.tgz">parallel corpus Finnish-English</A>, 179 MB, 01/1997-11/2011 <LI><A HREF="v7/fr-en.tgz">parallel corpus French-English</A>, 194 MB, 04/1996-11/2011 <LI><A HREF="v7/hu-en.tgz">parallel corpus Hungarian-English</A>, 59 MB, 01/2007-11/2011 <LI><A HREF="v7/it-en.tgz">parallel corpus Italian-English</A>, 188 MB, 04/1996-11/2011 <LI><A HREF="v7/lt-en.tgz">parallel corpus Lithuanian-English</A>, 57 MB, 01/2007-11/2011 <LI><A HREF="v7/lv-en.tgz">parallel corpus Latvian-English</A>, 57 MB, 01/2007-11/2011 <LI><A HREF="v7/nl-en.tgz">parallel corpus Dutch-English</A>, 190 MB, 04/1996-11/2011 <LI><A HREF="v7/pl-en.tgz">parallel corpus Polish-English</A>, 59 MB, 01/2007-11/2011 <LI><A HREF="v7/pt-en.tgz">parallel corpus Portuguese-English</A>, 189 MB, 04/1996-11/2011 <LI><A HREF="v7/ro-en.tgz">parallel corpus Romanian-English</A>, 37 MB, 01/2007-11/2011 <LI><A HREF="v7/sk-en.tgz">parallel corpus Slovak-English</A>, 59 MB, 01/2007-11/2011 <LI><A HREF="v7/sl-en.tgz">parallel corpus Slovene-English</A>, 54 MB, 01/2007-11/2011 <LI><A HREF="v7/sv-en.tgz">parallel corpus Swedish-English</A>, 171 MB, 01/1997-11/2011 </UL> <P> <hr size=1 noshade> <P> <H3>Size of the Corpus</H3> Sizes for single-language data after removing XML.<P> <table border=1 cellpadding=5 style="text-align: center; font-family: georgia"> <TR><TH>Language</TH><TH>Sentences</TH><TH>Words</TH></TR> <TR><TD>Bulgarian</TD> <TD> 411,636</TD><TD>-</TD></TR> <TR><TD>Czech</TD> <TD> 668,595</TD><TD>13,195,311</TD></TR> <TR><TD>Danish</TD> <TD>2,323,099</TD><TD>47,761,381</TD></TR> <TR><TD>German</TD> <TD>2,176,537</TD><TD>47,236,849</TD></TR> <TR><TD>Greek</TD> <TD>1,517,141</TD><TD>-</TD></TR> <TR><TD>English</TD> <TD>2,218,201</TD><TD>53,974,751</TD></TR> <TR><TD>Spanish</TD> <TD>2,123,835</TD><TD>54,806,927</TD></TR> <TR><TD>Estonian</TD> <TD> 692,210</TD><TD>11,358,009</TD></TR> <TR><TD>Finnish</TD> <TD>2,119,515</TD><TD>33,708,706</TD></TR> <TR><TD>French</TD> <TD>2,190,579</TD><TD>54,202,850</TD></TR> <TR><TD>Hungarian</TD> <TD> 658,824</TD><TD>12,606,986</TD></TR> <TR><TD>Italian</TD> <TD>2,081,669</TD><TD>50,259,169</TD></TR> <TR><TD>Lithuanian</TD><TD> 678,665</TD><TD>11,512,131</TD></TR> <TR><TD>Latvian</TD> <TD> 666,026</TD><TD>12,085,228</TD></TR> <TR><TD>Dutch</TD> <TD>2,333,816</TD><TD>53,487,257</TD></TR> <TR><TD>Polish</TD> <TD> 387,490</TD><TD> 7,087,016</TD></TR> <TR><TD>Portuguese</TD><TD>2,121,889</TD><TD>52,300,149</TD></TR> <TR><TD>Romanian</TD> <TD> 402,904</TD><TD> 9,663,544</TD></TR> <TR><TD>Slovak</TD> <TD> 674,359</TD><TD>13,116,301</TD></TR> <TR><TD>Slovene</TD> <TD> 634,488</TD><TD>12,665,974</TD></TR> <TR><TD>Swedish</TD> <TD>2,241,386</TD><TD>45,665,947</TD></TR> </table> <P> Sizes for parallel corpora after sentence aligning and removing XML.<P> <table border=1 cellpadding=5 style="text-align: center; font-family: georgia"> <TR><TH>Parallel Corpus (L1-L2)</TH><TH>Sentences</TH><TH>L1 Words</TH><TH>English Words</TH></TR> <TR><TD>Bulgarian-English </TD><TD> 406,934</TD><TD> -</TD><TD> 9,886,291</TD></TR> <TR><TD>Czech-English </TD><TD> 646,605</TD><TD>12,999,455</TD><TD>15,625,264</TD></TR> <TR><TD>Danish-English </TD><TD>1,968,800</TD><TD>44,654,417</TD><TD>48,574,988</TD></TR> <TR><TD>German-English </TD><TD>1,920,209</TD><TD>44,548,491</TD><TD>47,818,827</TD></TR> <TR><TD>Greek-English </TD><TD>1,235,976</TD><TD> -</TD><TD>31,929,703</TD></TR> <TR><TD>Spanish-English </TD><TD>1,965,734</TD><TD>51,575,748</TD><TD>49,093,806</TD></TR> <TR><TD>Estonian-English </TD><TD> 651,746</TD><TD>11,214,221</TD><TD>15,685,733</TD></TR> <TR><TD>Finnish-English </TD><TD>1,924,942</TD><TD>32,266,343</TD><TD>47,460,063</TD></TR> <TR><TD>French-English </TD><TD>2,007,723</TD><TD>51,388,643</TD><TD>50,196,035</TD></TR> <TR><TD>Hungarian-English </TD><TD> 624,934</TD><TD>12,420,276</TD><TD>15,096,358</TD></TR> <TR><TD>Italian-English </TD><TD>1,909,115</TD><TD>47,402,927</TD><TD>49,666,692</TD></TR> <TR><TD>Lithuanian-English</TD><TD> 635,146</TD><TD>11,294,690</TD><TD>15,341,983</TD></TR> <TR><TD>Latvian-English </TD><TD> 637,599</TD><TD>11,928,716</TD><TD>15,411,980</TD></TR> <TR><TD>Dutch-English </TD><TD>1,997,775</TD><TD>50,602,994</TD><TD>49,469,373</TD></TR> <TR><TD>Polish-English </TD><TD> 632,565</TD><TD>12,815,544</TD><TD>15,268,824</TD></TR> <TR><TD>Portuguese-English</TD><TD>1,960,407</TD><TD>49,147,826</TD><TD>49,216,896</TD></TR> <TR><TD>Romanian-English </TD><TD> 399,375</TD><TD> 9,628,010</TD><TD> 9,710,331</TD></TR> <TR><TD>Slovak-English </TD><TD> 640,715</TD><TD>12,942,434</TD><TD>15,442,233</TD></TR> <TR><TD>Slovene-English </TD><TD> 623,490</TD><TD>12,525,644</TD><TD>15,021,497</TD></TR> <TR><TD>Swedish-English </TD><TD>1,862,234</TD><TD>41,508,712</TD><TD>45,703,795</TD></TR> </table> <P> <hr size=1 noshade> <p> <H3>Test Sets</H3> Several test sets have been released for the Europarl corpus. In general, the Q4/2000 portion of the data (2000-10 to 2000-12) should be reserved for testing. All released test sets have been selected from this quarter. The shared tasks for the <a href="../wmt06/shared-task/">2006</a> and <a href = "../wmt07/shared-task.html">2007</a> ACL Workshops on Statistical Machine Translation provide test sets from the Europarl corpus. <P> The original common test set from the Koehn/Och/Marcu ACL 2003 Paper is available in the <a href="archives.html">archives</a>. <P> Extended versions of these test sets are available in the <a href="http://matrix.statmt.org/test_sets/list">Evaluation Matrix</a> of the EuroMatrix project. <H3>Known Bugs</H3> <UL> <LI>Some special HTML entities and noisy characters are not removed from the data. <LI>Some recent Greek data has only parts of transcripts in the files. </UL> <H3>Terms of Use</H3> We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact <A HREF="mailto:phi@jhu.edu">phi@jhu.edu</A>. Please let us know if you find problems with the data or if you want the data for other language pairs. We recommend using the last quarter of 2000 for testing (2000-10 until 2000-12) for consistency in reporting research results on this data. <H3>Acknowledgments</H3> The work was in part supported by the <A HREF="http://www.euromatrixplus.net/">EuroMatrixPlus</A> project funded by the European Commission (7th Framework Programme). <P> <hr size=1 noshade> <p> <!--Google Analytics--> <script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-9722437-6"); pageTracker._trackPageview(); } catch(err) {}</script> </body></html>