CINXE.COM
CiteSeerX Software | CiteSeerX
<!DOCTYPE html> <html lang="en"> <head> <title>CiteSeerX Software | CiteSeerX</title> <link rel="shortcut icon" href="#"> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <!-- Bootstrap --> <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css"> <link href="https://fonts.googleapis.com/css?family=Roboto:400,700" rel="stylesheet"> <link href="https://fonts.googleapis.com/css?family=Open+Sans" rel="stylesheet"> <link type="text/css" href="/resources/css/navigation.css" rel="stylesheet"> <link type="text/css" href="/resources/css/textstyles.css" rel="stylesheet"> <link type="text/css" rel="stylesheet" href="/resources/css/footer-distributed-with-address-and-phones.css"> <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css"> <link type="text/css" href="/resources/css/content.css" rel="stylesheet"> <!-- jQuery (necessary for Bootstrap's JavaScript plugins) --> <script src="/js/jquery-3.2.1.min.js" type="text/javascript"></script> <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script> <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries --> <!--[if lt IE 9]> <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script> <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script> <![endif]--> <!-- JQuery --> <link rel="stylesheet" href="https://code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css"> <script src="https://code.jquery.com/jquery-1.12.4.js"></script> <script src="https://code.jquery.com/ui/1.12.1/jquery-ui.js"></script> </head> <body> <div class="DMCA psunav"> <!-- DMCA --> <ul> <li class="li-2"><a href="https://www.psu.edu/copyright-information" target="_blank"> <div class="colorchange">DMCA</div> </a></li> </ul> </div> <div class="container header"> <header> <a name="top"></a> <!-- navigation tabs --> <a href="https://citeseerx.ist.psu.edu"> <img src="/resources/img/citeseerx.png" title="CiteSeerX" class="citeseerx_logo"/> </a> <div class="psunav"> <ul> <!-- About --> <li class="li-1"><a href="../index.html"> <div class="colorchange">About</div> </a></li> <!-- People Dropdown --> <li class="li-2"> <div class="colorchange"> <div class="dropbtn" id="dropbtn-people"> People <i class="fa fa-caret-down"></i> <div class="dropmenu-content" id="dropmenu-content-people"> <a href="../people/team.html">Team</a> <a href="../people/collaborators.html">Collaborators</a> </div> </div> </div> </li> <!-- Publications --> <li class="li-2"><a href="http://clgiles.ist.psu.edu/citeseer-related.pdf" target="_blank"> <div class="colorchange">Publications</div> </a></li> <!-- Data Dropdown --> <li class="li-2"> <div class="colorchange"> <div class="dropbtn" id="dropbtn-downloads"> Downloads <i class="fa fa-caret-down"></i> <div class="dropmenu-content" id="dropmenu-content-downloads"> <a href="../downloads/data.html">Data</a> <a href="../downloads/software.html">Software</a> </div> </div> </div> </li> <!-- Contact Form --> <li class="li-2"> <div class="colorchange"> <div class="dropbtn" id="dropbtn-contact"> Contact <i class="fa fa-caret-down"></i> <div class="dropmenu-content" id="dropmenu-content-contact"> <a href="../contact/contact.html">Contact Us</a> <a href="http://csxcrawlweb01.ist.psu.edu/submit_pub/" target="_blank">Crawler</a> </div> </div> </div> </li> <!-- Donate --> <li class="li-2"><a href="http://www.givenow.psu.edu/CiteseerxFund" target="_blank"> <div class="colorchange">Donate</div> </a></li> </ul> </div> </header> </div> <div> <div class="pagetitle"> <div class="container"> <h1 class="titletext">CiteSeerX Software</h1> </div> </div> </div> <div class="pagebody"> <div class="container bodytext"> <div class="row"> <section class="col-sm-12 col-md-12 col-lg-12 col-x1-12"> <div class="ptext"> <p>The purpose of this page is to maintain a list of tools, publications, and Web services that are related to extracting information from scholarly documents so as to provide a point of reference for anyone interested in exploring this topic. The main focus is on header (title, authors, institutions, venue, etc.) and citation metadata extraction, though other types of information extraction are covered as well.</p> <p>This page was created and is maintained by <a href="http://www.personal.psu.edu/kiw5209">Kyle Williams</a> and <a href="http://www.personal.psu.edu/szr163">Sagnik ray Choudhury</a>.</p> <p>For changes and additions to this page please contact kwilliams (at) psu (dot) edu or sagnik (at) psu (dot) edu</p> <h3>Contents</h3> <ul> <li> <a href="#Extraction Tools">Extraction Tools</a> <ul> <li> <a href="#Header Extraction">Header Extraction</a> </li> <li> <a href="#Citation Extraction">Citation Extraction</a> </li> <li> <a href="#Other Extraction">Other Extraction</a> </li> </ul> </li> <li> <a href="#Publications">Publications</a> <ul> <li> <a href="#Header Extraction">Header Extraction</a> </li> <li> <a href="#Citation Extraction">Citation Extraction</a> </li> <li> <a href="#Other Extraction">Other Extraction</a> </li> <li> <a href="#Comparisons">Comparisons</a> </li> <li> <a href="#Datasets">Datasets</a> </li> </ul> </li> <li> <a href="#Services">Services</a> <ul> <li> <a href="#Web Services">Web Services</a> </li> </ul> </li> </ul> <h2 class="section-header"><a name="Extraction Tools" class="sie-section-header">Extraction Tools</a></h2> <a href="#top">[Top]</a> <p>These are publicly available extraction tools for information extraction.</p> <h3><a name="Header Extraction" class="sie-section-header">Header Extraction</a></h3> <a href="#top">[Top]</a> <p>This list is based on Lipinski et al. (JCDL 2013). A big thanks to the authors for identifying all of these tools.</p> <ul> <li> <u>SVM Header Parse</u> <br> <a href="http://sourceforge.net/projects/citeseerx/">http://sourceforge.net/projects/citeseerx/</a> <br> License: Apache License v2.0 <br> <i>SVM Header Parse is a tool for metadata extraction based on SVMs and is part of the SeerSuite package. It was developed at the Pennsylvania State University</i> </li> <li> <u>Grobid</u> <br> <a href="https://github.com/kermitt2/grobid">https://github.com/kermitt2/grobid</a> <br> License: Apache License v2.0 <br> <i>Grobid performs header and citation extraction using CRFs</i> </li> <li> <u>ParsCit</u> <br> <a href="http://aye.comp.nus.edu.sg/parsCit/">http://aye.comp.nus.edu.sg/parsCit/</a> <br> License: Lesser GNU Public License <br> <i>ParsCit performs header and citation extraction parsing using CRFs</i> </li> <li> <u>Docear's PDF Inspector</u> <br> <a href="http://www.docear.org/">http://www.docear.org/</a> <br> License: Apache License v2.0, GPLv2, GPLv3 <br> <i>Extracts document metadata based on stylistic analysis</i> </li> <li> <u>Mendeley</u> <br> <a href="http://www.mendeley.com/">http://www.mendeley.com/</a> <br> License: Commercial <br> <i>Mendeley is a software package for managing collections of academic documents; however, it does also perform automatic extraction of metadata using SVMs.</i> </li> <li> <u>PDFMeat</u> <br> <a href="http://code.google.com/p/pdfmeat/">http://code.google.com/p/pdfmeat/</a> <br> License: GPLv2 <br> <i>Extracts appropriate terms from a paper and then queries Google Scholar to retrieve the metadata.</i> </li> <li> <u>SciPlore Xtract</u> <br> <a href="http://sciplore.org/">http://sciplore.org/</a> <br> License: Unsure <br> <i>Extracts header information based on a stylistic analysis of XML.</i> </li> </ul> <h3><a name="Citation Extraction" class="sie-section-header">Citation Extraction</a></h3> <a href="#top">[Top]</a> <p></p> <p> </p> <ul> <li> <u>ParsCit</u> <br> <a href="http://aye.comp.nus.edu.sg/parsCit/">http://aye.comp.nus.edu.sg/parsCit/</a> <br> License: Lesser GNU Public License <br> <i>ParsCit performs header and citation extraction parsing using CRFs</i> </li> <li> <u>HMM Metadata Extractor</u> <br> <a href="http://gales.cdlib.org/~egh/hmm-citation-extractor/">http://gales.cdlib.org/~egh/hmm-citation-extractor/</a> <br> License: Free for use <br> <i>A citation parsing tool based on Hidden Markov Models</i> </li> </ul> <h3><a name="Other Extraction" class="sie-section-header">Other Extraction</a></h3> <a href="#top">[Top]</a> <p></p> <p> </p> <ul> <li> <u>TableSeer</u> <br> <a href="http://sourceforge.net/projects/tableseer/">http://sourceforge.net/projects/tableseer/</a> <br> License: Unspecified, but open source <br> <i>Automatically extracts tables and table data</i> </li> </ul> <ul> <li> <u>Pdffigures</u> <br> <a href="https://github.com/allenai/pdffigures2">pdffigures project by AllenAI</a> <br> License: Apache <br> <i>Automatically extracts figures and tables from PDF documents</i> </li> </ul> <h2 class="section-header"><a name="Publications" class="sie-section-header">Publications</a></h2> <a href="#top">[Top]</a> <p></p> <p>A list of publications related to metadata extraction grouped by type of extraction performed. I have NOT read all of these papers, but this might be a good place to start for someone interested in this topic. The references are also in different formats since they come from different sources.</p> <h3><a name="Header Extraction" class="sie-section-header">Header Extraction</a></h3> <a href="#top">[Top]</a> <ul> <li> GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. P. Lopez. Proceedings of the 13th European Conference on Digital Library (ECDL), Corfu, Greece, 2009. </li> <li> J. Beel, B. Gipp, A. Shaker, and N. Friedrich, SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size), in Research and Advanced Technology for Digital Libraries: Proceedings of the 14th European Conference on Digital Libraries (ECDL'10), Glasgow, UK, 2010. </li> <li> Huy Hoang Nhat Do, Muthu Kumar Chandrasekaran, Philip S. Cho, and Min-Yen Kan.(2013) Extracting and Matching Authors and Affiliations in Scholarly Documents.In Proceedings of the Thirteenth Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL'13), Indianapolis: ACM. 2013. </li> <li> Han, H., Giles, C., Manavoglu, E., Zha, H., Zhang, Z., Fox, E. (2003). Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries. </li> <li> Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan (2010) Logical Structure Recovery in Scholarly Articles with Rich Document Features. International Journal of Digital Library Systems (IJDLS), 1(4), 1-23. </li> <li> Cui, Binge. "Scientific literature metadata extraction based on HMM." Cooperative Design, Visualization, and Engineering. Springer Berlin Heidelberg, 2009. 64-68. </li> </ul> <h3><a name="Citation Extraction" class="sie-section-header">Citation Extraction</a></h3> <a href="#top">[Top]</a> <p></p> <p> </p> <ul> <li> Erik Hetzner. 2008. A simple method for citation metadata extraction using hidden markov models. In Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries (JCDL '08). ACM, New York, NY, USA, 280-284. </li> <li> Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit: An open-source CRF reference string parsing package. In Proceedings of the Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morrocco, May. </li> <li> Guido Sautter and Klemens Bohm. 2012. Improved bibliographic reference parsing based on repeated patterns. In Proceedings of the Second international conference on Theory and Practice of Digital Libraries (TPDL'12), Panayiotis Zaphiris, George Buchanan, Edie Rasmussen, and Fernando Loizides (Eds.). Springer-Verlag, Berlin, Heidelberg, 370-382. </li> <li> Eli Cortez , Altigran S. da Silva , Marcos Andre Goncalves , Filipe Mesquita , Edleno S. de Moura, FLUX-CIM: flexible unsupervised extraction of citation metadata, Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada </li> </ul> <h3><a name="Other Extraction" class="sie-section-header">Other Extraction</a></h3> <a href="#top">[Top]</a> <p></p> <ul> <li> Khabsa, M., Treeratpituk, P., and Giles, C. L. (2012). AckSeer: A Repository and Search Engine for Automatically Extracted Acknowledgments from Digital Libraries, 185-194. </li> <li> Liu, Y., Bai, K., Mitra, P., and Giles, C. (2007). Tableseer: automatic table metadata extraction and searching in digital libraries. Proceeding of the 7thth annual international ACM/IEEE joint conference on Digital libraries - JCDL '07, 91-10. </li> <li> Sagnik Ray Choudhury, Suppawong Tuarob, Prasenjit Mitra, Lior Rokach, Andi Kirk, Silvia Szep, Donald Pellegrino, Sue Jones, and Clyde Lee Giles. 2013. A figure search engine architecture for a chemistry digital library. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries (JCDL '13). ACM, New York, NY, USA, 369-370. </li> <li> Sagnik Ray Choudhury, Prasenjit Mitra, Andi Kirk, Silvia Szep, Donald Pellegrino, Sue Jones, C. Lee Giles: Figure Metadata Extraction from Digital Documents. ICDAR 2013: 135-139 </li> </ul> <h3><a name="Comparisons" class="sie-section-header">Comparisons</a></h3> <a href="#top">[Top]</a> <p></p> <p> </p> <ul> <li> M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp, Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents, in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Indianapolis, IN, USA, 2013. </li> </ul> <h3><a name="Datasets" class="sie-section-header">Datasets</a></h3> <a href="#top">[Top]</a> <p></p> <p> </p> <ul> <li> Anzaroot, S., and McCallum, A. (2013). A New Dataset for Fine-Grained Citation Field Extraction. ICML Workshop on Peer Reviewing and Publishing Models, 28. </li> </ul> <h2 class="section-header"><a name="Services" class="sie-section-header">Services</a></h2> <a href="#top">[Top]</a> <p></p> <p></p> <h3><a name="Web Services" class="sie-section-header">Web Services</a></h3> <a href="#top">[Top]</a> <p>These are web services that you can use for extracting metadata without running any software locally </p> <ul> <li> <u>CiteSeerExtractor</u> <br> <a href="http://citeseerextractor.ist.psu.edu:8080">http://citeseerextractor.ist.psu.edu:8080</a> <br> License: Apache License v2.0 <br> <i>Provides a RESTful API to the tools used for extraction in CiteSeerX</i> </li> <li> <u>ParsCit Web Service</u> <br> <a href="http://aye.comp.nus.edu.sg/parsCit/#ws">http://aye.comp.nus.edu.sg/parsCit/#ws</a> <br> License: N/A <br> <i>A Web service for parsing citations. Also provide an online demo</i> </li> <li> <u>FreeCite</u> <br> <a href="http://freecite.library.brown.edu/">http://freecite.library.brown.edu/</a> <br> License: MIT License <br> <i>A Web service for parsing citations based on ParsCit</i> </li> </ul> </section> </div> </div> </div> <footer class="footer-distributed"> <div class="footer-left"> <img src="/resources/img/footer_logo.png" width="510" height="150"> <p class="footer-links"> <a href="../privacy-policy/privacy-policy.html">Privacy Policy</a> · <a href="../help/help.html">Help</a> · <a href="https://github.com/SeerLabs/CiteSeerX">Source</a> · <a href="../contact/contact.html">Contact Us</a> </p> <p class="footer-company-name">Developed at and hosted by <a href="http://ist.psu.edu/">The College of Information Sciences and Technology</a></p><br/> <p class="footer-company-name"><a href="https://www.psu.edu/">Pennsylvania State University</a> © 2007-2016 </p> </div> <div class="footer-center"> <div> <div> <i class="fa fa-map-marker"></i> <p><span>Westgate Building</span> Pennsylvania State University <br/>University Park, PA 16802 </p> </div> <div> <i class="fa fa-phone"></i> <p>+(814) 865 7884</p> </div> </div> </div> </footer> </body> </html>