CINXE.COM
The OpenAIRE Literature Broker Service for Institutional Repositories
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /> <meta name="DOI" content="10.1045/november2015-artini" /> <meta name="description" content="D-Lib Magazine" /> <meta name="keywords" content="OpenAIRE, OpenAIRE Literature Broker Service, Institutional Repositories" /> <link rel="metadata" href="11artini.meta.xml" /> <link rel="metadata" href="../11bib.meta.bib" /> <link rel="metadata" href="../11ris.meta.ris" /> <link href="../../../style/style1.css" rel="stylesheet" type="text/css" /> <title>The OpenAIRE Literature Broker Service for Institutional Repositories</title> </head> <body> <form action="/cgi-bin/search.cgi" method="get"> <table width="100%" border="0" cellpadding="0" cellspacing="0" bgcolor="#2b538e"> <tr> <td><img src="../../../img2/space.gif" alt="" width="10" height="2" /></td></tr> </table> <table width="100%" border="0" cellpadding="0" cellspacing="0"> <tr> <td valign="bottom" colspan="4" align="right" bgcolor="#4078b1"> <table border="0"> <tr> <td align="right" class="search"><img src="../../../img2/search2.gif" alt="" width="51" height="20" align="middle" />Search D-Lib:</td> <td> <input type="text" name="words" value="" size="25" /> </td> <td align="left" valign="middle"> <input type="submit" name="search" value="Go!" /> <input type="hidden" name="config" value="htdig" /> <input type="hidden" name="restrict" value="" /> <input type="hidden" name="exclude" value="" /> </td> </tr> </table> </td></tr></table> <table width="100%" border="0" cellpadding="0" cellspacing="0"> <tr> <td valign="bottom" colspan="4"> <table width="100%" border="0" cellpadding="0" cellspacing="0" bgcolor="#e04c1e" id="outer" summary="Main Table"> <tr> <td><img src="../../../img2/space.gif" alt="" width="10" height="1" /></td></tr> </table> <table width="100%" border="0" cellpadding="0" cellspacing="0" bgcolor="#F6F6F6" id="bannertable"> <tr> <td width="830" bgcolor="#4078b1" class="backBannerImage" align="left"><img src="../../../img2/D-Lib-blocks.gif" alt="D-Lib Magazine" width="450" height="100" border="0" /></td> </tr> <tr> <td width="830" bgcolor="#e04c1e"><img src="../../../img2/transparent.gif" alt="spacer" height="1" /></td> </tr> <tr> <td width="830" bgcolor="#eda443" align="left"><img src="../../../img2/magazine.gif" alt="The Magazine of Digital Library Research" width="830" height="24" border="0" /></td> </tr> <tr> <td width="830" bgcolor="#e04c1e"><img src="../../../img2/transparent.gif" alt="spacer" height="1" /></td> </tr> </table> <table width="100%" border="0" cellpadding="0" cellspacing="0" id="navtable"> <tr> <td width="5" height="20" bgcolor="#2b538e"> </td> <td width="24" height="20" bgcolor="#2b538e"><img src="../../../img2/transparent.gif" alt="" width="24" height="20" /></td> <td height="20" align="left" bgcolor="#2b538e" class="navtext" nowrap="nowrap"><a href="../../../dlib.html">HOME</a> | <a href="../../../about.html">ABOUT D-LIB</a> | <a href="../../../contents.html" class="navtext">CURRENT ISSUE</a> | <a href="../../../back.html">ARCHIVE</a> | <a href="../../../author-index.html">INDEXES</a> | <a href="http://www.dlib.org/groups.html">CALENDAR</a> | <a href="../../author-guidelines.html">AUTHOR GUIDELINES</a> | <a href="http://www.dlib.org/mailman/listinfo/dlib-subscribers">SUBSCRIBE</a> | <a href="../../letters.html">CONTACT D-LIB</a></td> <td width="5" height="20" bgcolor="#2b538e"> </td> </tr> </table> <table width="100%" border="0" cellpadding="0" cellspacing="0"> <tr> <td width="55" height="1" bgcolor="#e04c1e"><img src="../../../img2/space.gif" alt="transparent image" width="1" height="1" /></td></tr> </table> <!-- CONTENT TABLE --> <table width="100%" border="0" align="center" cellpadding="0" cellspacing="0"> <tr> <td> <!-- BEGIN MAIN CONTENT TABLE --> <table width="100%" border="0" cellspacing="0" cellpadding="10" bgcolor="#ffffff"> <tr> <td width="10"><img src="../../../img2/space.gif" alt="" width="1" height="1" /></td> <td valign="top"> <h3 class="blue-space">D-Lib Magazine</h3> <p class="blue">November/December 2015<br /> Volume 21, Number 11/12<br /> <a href="../11contents.html">Table of Contents</a> </p> <div class="divider-full"> </div> <h3 class="blue-space">The OpenAIRE Literature Broker Service for Institutional Repositories</h3> <p class="blue"> Michele Artini, Claudio Atzori, Alessia Bardi, Sandro La Bruzzo, Paolo Manghi and Andrea Mannocci<br /> Istituto di Scienza e Tecnologie dell'Informazione "A. Faedo" — CNR, Pisa, Italy<br /> {michele.artini, claudio.atzori, alessia.bardi, sandro.labruzzo, paolo.manghi, andrea.mannocci}@isti.cnr.it <br /><br />DOI: 10.1045/november2015-artini </p> <div class="divider-full"> </div> <p class="blue"><a href="11artini.print.html" class="fc">Printer-friendly Version</a></p> <div class="divider-full"> </div> <!-- Abstract or TOC goes here --> <h3 class="blue">Abstract</h3> <p class="blue">OpenAIRE is the European infrastructure for Open Access scholarly communication. It populates and provides access to a graph of objects relative to publications, datasets, people, organizations, projects, and funders aggregated from a variety of data sources, such as institutional repositories, data archives, journals, and CRIS systems. Thanks to infrastructure services, objects in the graph are harmonized to achieve semantic homogeneity, de-duplicated to avoid ambiguities, and enriched with missing properties and/or relationships. OpenAIRE data sources interested in enhancing or incrementing their content may benefit in a number of ways from this graph. This paper presents the high-level architecture behind the realization of an institutional repository Literature Broker Service for OpenAIRE. The Service implements a subscription and notification paradigm supporting institutional repositories willing to: (i) learn about publication objects in OpenAIRE that do not appear in their collection but may be pertinent to it, and (ii) learn about extra properties or relationships relative to publication objects in their collection.</p> <!-- Article goes next --> <div class="divider-full"> </div> <h3>1 Introduction</h3> <p>The OpenAIRE infrastructure [<a href="#5">5</a>] is both a networking and technological infrastructure whose mission is to advocate and monitor the adoption of the European Commission Open Access mandates, and to evaluate the impact of EC funding and National funders. </p> <p>Its networking infrastructure consists of the National Open Access Desks (NOADs), providing OpenAIRE contact points for each of the EC countries. The NOADs monitor and advocate the adoption of Open Access EC policies at the level of the countries, support researchers at the implementation of the <a href="https://www.openaire.eu/ordp/ordp/pilot">EC Data Pilot</a>, and function as a bidirectional communication channel between the Commission, OpenAIRE and the countries.</p> <p>Its technological infrastructure provides services [<a href="#6">6</a>] to monitor funders, project research impact, and track Open Access trends in terms of related publications and datasets. To this aim, the services offer functionalities to populate a European (and beyond) graph-like information space that aggregates information about publications, datasets, organizations, persons, projects and several funders (e.g. European Commission, Wellcome Trust, Fundação para a Ciência e a Tecnologia, Australian Research Council) collected from hundreds of online data sources (e.g. publication repositories, dataset repositories, and CRIS systems, journals, publishers). To facilitate the harvesting process as well as interoperability between publication repositories, dataset repositories, and CRIS systems, OpenAIRE has released specific "metadata export guidelines" for the managers of such data sources [<a href="#1">1</a>][<a href="#8">8</a>]. The guidelines describe the expected structure (i.e. fields) and semantics (i.e. vocabularies and formats) of the metadata records, as they should be exposed by the data sources. Their aim is to reach a community consensus on how to homogenize and therefore facilitate exchange of information across scholarly communication data sources in Europe (and beyond, thanks to the synergies with <a href="https://www.coar-repositories.org/">COAR</a>, US SHARE [<a href="#3">3</a>], and UK <a href="https://jisc.ac.uk/">JISC</a>). The typologies and the number of data sources currently included in OpenAIRE are summarized in Table 1.</p> <table align="center" border="0" cellpadding="6" cellspacing="0" width="80%"> <tr> <td class="topLeft" align="left" bgcolor="#cccccc"><b>Data Source Typology</b></td> <td class="topLeft" align="left" bgcolor="#cccccc"><b>Number of Data Sources</b></td> <td class="topLeftRight" align="left" bgcolor="#cccccc"><b>Type of Objects</b></td> </tr> <tr> <td class="topLeft" align="left">Journal Platform</td> <td class="topLeft" align="left">5,582</td> <td class="topLeftRight" align="left">Publications and persons</td> </tr> <tr> <td class="topLeft" align="left">Publication Repository</td> <td class="topLeft" align="left">512 (Total)</td> <td class="topLeftRight" align="left" rowspan="4">Publications and persons</td> </tr> <tr> <td class="topLeft" align="right">Institutional</td> <td class="topLeft" align="left">426</td> </tr> <tr> <td class="topLeft" align="right">Thematic</td> <td class="topLeft" align="left">36</td> </tr> <tr> <td class="topLeft" align="right">Other/Unknown</td> <td class="topLeft" align="left">50</td> </tr> <tr> <td class="topLeft" align="left">Data Repository</td> <td class="topLeft" align="left">38</td> <td class="topLeftRight" align="left">Datasets, publications, and persons</td> </tr> <tr> <td class="topLeft" align="left">Aggregator of Publication Repositories</td> <td class="topLeft" align="left">8</td> <td class="topLeftRight" align="left">Publications and persons</td> </tr> <tr> <td class="topLeft" align="left">Aggregator of Data Repositories</td> <td class="topLeft" align="left">1</td> <td class="topLeftRight" align="left">Datasets and persons</td> </tr> <tr> <td class="topLeft" align="left">Aggregator/Publisher of Journals</td> <td class="topLeft" align="left">6</td> <td class="topLeftRight" align="left">Publications, persons, and data sources (i.e. journals)</td> </tr> <tr> <td class="topLeft" align="left">Entity Registry (data sources offering authoritative lists of entities)</td> <td class="topLeft" align="left">13</td> <td class="topLeftRight" align="left">Data sources (i.e. publication repositories, data repositories), projects, funders, persons</td> </tr> <tr> <td class="topLeftBottom" align="left">CRIS systems</td> <td class="topLeftBottom" align="left">0 (the first CRIS systems will be aggregated by the end of 2015)</td> <td class="all" align="left">Publications, datasets, projects, persons</td> </tr> </table> <p align="center">Table 1: Data source typologies in the OpenAIRE federation (update to date 2015-10-16)</p> <p>The OpenAIRE infrastructure collects metadata records from data sources and derives from them objects and relationships that form the information space graph. For example, a bibliographic metadata record describing a scientific article will yield one publication object and a set of person objects (one per author) related to it. After aggregation, dedicated services clean and enrich the graph as depicted in Figure 1:</p> <ul> <li style="padding-bottom: .5em;">Harmonization (aggregation sub-system): objects of given entities are transformed from their native data models (e.g. physically represented as XML records, HTML responses, CSV files) onto the OpenAIRE data model [<a href="#7">7</a>] in order to build an homogenous information space.</li> <li style="padding-bottom: .5em;">Merge (de-duplication subsystem): objects of the same entity type are de-duplicated in order to remove ambiguities that may compromise statistics (e.g. the same publication may be collected from different repositories as supposedly different objects).</li> <li>Enrichment (information inference sub-system): publication full-texts are collected and processed by text mining services [<a href="#9">9</a>] capable of inferring new property values or new relationships between objects.</li> </ul> <div align="center"> <img src="artini-fig1.png" alt="artini-fig1" width="628" height="313" class="borderGray" /> <p><i>Figure 1: OpenAIRE services high-level architecture</i></p> </div> <p>The enriched information space graph is then made available for programmatic access via several APIs (Search HTTP APIs, OAI-PMH, and soon Linked Open Data) [<a href="#2">2</a>] and for search, browse and statistics consultation via the <a href="http://www.openaire.eu">OpenAIRE</a> portal.</p> <p>Needless to say, data sources that are providing content to OpenAIRE and are interested in augmenting their local collections may benefit in a number of ways from the OpenAIRE information space. This is particularly true for institutional repositories, whose mission is to grow a complete collection of the scientific publications produced by the authors affiliated with the institution they serve. The repository managers' goal is twofold: to bring into the collection all articles produced by affiliated authors, and to make sure that the metadata is as complete and up-to-date as possible.</p> <p>This paper presents the functional requirements driving the realization of a Literature Broker Service for the OpenAIRE infrastructure. The Service implements a subscription and notification mechanism supporting repository managers who are enhancing the content of their repositories by taking advantage of the OpenAIRE information space. Using the Service, repository managers can subscribe to special "addition" or "enrichment" events in order to be notified about: (i) publication objects in OpenAIRE that do not appear in their collection but may be pertinent to it, or (ii) properties or relationships relative to publication objects in their collection that do not appear in their local metadata.</p> <p>Section 2 describes two initiatives for the brokerage of publication metadata: the US SHARE Notify and the JISC/EDINA Publications Router. Section 3 presents the OpenAIRE graph data model and its approach to modelling the provenance of the original metadata records and of the inferred properties and relationships. The opportunities of data exchange between the OpenAIRE infrastructure and institutional repositories are discussed in Section 4, where the OpenAIRE Broker Service and its subscription and notification mechanisms are also presented. Section 5 offers conclusions and discusses future work.</p> <div class="divider-full"> </div> <h3>2 Repository Literature Brokers in the literature</h3> <p>The literature deluge makes the reporting and tracking of research results harder for all stakeholders in scholarly communication. Researchers often feel they lose precious time when they are asked to provide detailed metadata information about their articles multiple times at different locations, e.g. the institutional repository and funders. As a consequence, publication metadata can be poor, subject to mistakes, and found at different locations. Publishers own publication metadata information, but a direct interaction with repositories is rare, for both technical (e.g. lack of shared author identifiers) and cost reasons. As a consequence, a number of initiatives started working on approaches favoring single-deposition of publication metadata with subsequent automated delivery to other repositories. Some approaches focused on techniques for automatic deposition into a repository (SWORD project [<a href="#4">4</a>]), while others focused on the complementary aspects of how to broker publication information from publishers to relevant/interested repositories. SHARE and JISC/EDINA are two such initiatives, based respectively in the US and UK</p> <p>SHARE (SHared Access Research Ecosystem) [<a href="#3">3</a>] is a higher education and research community established in 2013 that supports preservation, access and re-use of research results across United States. The first project set up by SHARE is <a href="http://www.share-research.org/projects/share-notify/">SHARE Notify</a>. A public beta version of the service has been available since April 2015 and counts approximately 600,000 metadata records about articles and datasets from more than 30 providers. SHARE Notify allows interested stakeholders (e.g. researchers, repositories, funders) to subscribe to notifications about research release events such as the publication of an article in a peer-reviewed journal, the deposition of a pre-print version in an institutional repository or the deposition of a dataset. Notifications are distributed as Atom feeds, consumable via common RSS readers, containing metadata summaries about the research results matching the subscription query. The subscription query may include any metadata field of the SHARE schema, also in combination with boolean operators according to the Lucene syntax. While it is possible to subscribe to be notified of events related to one or more data sources (journals, repositories, etc.), it is not yet possible to subscribe to receive events related to authors' institutions, as the majority of metadata records collected by SHARE does not contain explicit authors' affiliations. A JSON API is also available to build dedicated applications by consuming the content collected by SHARE.</p> <p>JISC is a UK initiative that promotes ICT in education and research. Built in collaboration with <a href="http://broker.edina.ac.uk/">EDINA</a>, the prototype of the JISC Publications Router offers a notification system (PostCards) and an automatic mechanism based on the SWORD protocol to transfer metadata and files from one location to another. Upon subscription, users can select one or more repositories of interest. The Postcards system will send them emails with a list of metadata records suitable for the selected repositories. The "suitability" of a record with respect to a given repository is automatically calculated by extracting the authors' affiliations from the metadata records. Though the subscription criteria are static, the Postcard system of the JISC Publications Router is very flexible in terms of the format of the notifications: citations in ASCII, bibtex, endnote, and Dublin Core metadata records are only a subset of the formats that a subscriber can choose to receive. Currently, the service is undergoing a full revision to improve its quality and make it a production system in 2016 [<a href="#10">10</a>].</p> <p>The JISC/EDINA Publications Router is more mature than SHARE Notify and its capabilities of detecting authors' affiliations and of sending notifications in different formats are valuable. Metadata records collected by the router can be bulk downloaded via the standard OAI-PMH protocol. On the other hand, the "young" SHARE Notify gives more control to users on their subscription topics and the availability of a JSON API allows IT-skilled users to build applications on top of the SHARE content.</p> <div class="divider-full"> </div> <h3>3 OpenAIRE information space</h3> <p>The OpenAIRE information space data model [<a href="#7">7</a>] (see Figure 2) builds on the OpenAIRE guidelines and is inspired by the <a href="http://www.datacite.org">DataCite</a> and <a href="http://www.eurocris.org/cerif/main-features-cerif">CERIF</a> initiatives. Its main entities are: <i>Results</i> (datasets and publications), <i>Persons</i>, <i>Organizations</i>, <i>Funders</i>, <i>Funding Streams</i>, <i>Projects</i>, and <i>Data Sources</i>.</p> <div align="center"> <img src="artini-fig2.png" alt="artini-fig2" width="820" height="443" vspace="10"class="borderGray" /> <p><i>Figure 2: The OpenAIRE data model</i></p> </div> <p><i>Results</i> are intended as the outcome of research activities and may be related to <i>Projects</i>. OpenAIRE supports two kinds of research outcome: <i>Datasets</i> (e.g. experimental data) and <i>Publications</i> (<i>Patents</i> and <i>Software</i> entity types will be introduced soon). As a result of merging equivalent objects collected from separate data sources, a Result object may have several physical manifestations, called <i>instances</i>; instances indicate URL(s) of the payload file, access rights (i.e. open, embargo, restricted, closed), and a relationship to the data source that hosts the file (i.e. provenance). </p> <p><i>Persons</i> are individuals that have one (or more) role(s) in the research domain, such as authors of a Result or coordinator of a Project.</p> <p><i>Organizations</i> include companies, research centers or institutions involved as project partners or that are responsible for operating data sources.</p> <p><i>Funders</i> (e.g. European Commission, Wellcome Trust, FCT Portugal, Australian Research Council) are <i>Organizations</i> responsible for a list of Funding Streams (e.g. FP7 and H2020 for the EC), which are strands of investments. </p> <p><i>Funding Streams</i> identify the strands of funding managed by a <i>Funder</i> and can be nested to form a tree of sub-funding streams (e.g. FP7 — SP1 — HEALTH). </p> <p><i>Projects</i> are research projects funded by a <i>Funding Stream</i> of a <i>Funder</i>. Investigations and studies conducted in the context of a <i>Project</i> may lead to one or more <i>Results</i>.</p> <p>Finally, OpenAIRE objects are created out of metadata records (e.g. XMLs, CSV, txt, xls, JSON, HTML) collected from various <i>Data Sources</i> (see Table 1). Data Sources are associated with all objects collected from them. </p> <p>In order to give visibility to the original data sources, OpenAIRE keeps provenance information about each piece of aggregated information. Specifically, since de-duplication merges objects collected from different sources and inference enriches such objects, provenance information is kept at the granularity of the object itself, its properties, and its relationships. Object level provenance tells the origin of the object that is the data sources from which its different manifestations were collected. Property and relationship level provenance tells the origin of a specific property or relationship when inference algorithms derive these, e.g. algorithm name and version. Examples are:</p> <ul> <li style="padding-bottom: .5em;">Document classification properties: e.g. subjects from a set of standard classification schemes, such as the Dewey Decimal Classification and Medical Subject Headings;</li> <li style="padding-bottom: .5em;">Research initiative properties: e.g. information about the research initiatives, such as the European Grid Infrastructure, related to the research results presented in the publication;</li> <li style="padding-bottom: .5em;">Citation properties: e.g. the list of references cited by the publication, extracted from the bibliography or reference section of the full-text;</li> <li>Relationships to projects, datasets, and similar publications.</li> </ul> <div class="divider-full"> </div> <h3>4 The OpenAIRE Literature Broker Service</h3> <p>The OpenAIRE enriched information graph offers a great opportunity for managers of institutional repositories to improve their collections. However, the current access APIs provided by OpenAIRE to third-party services (i.e. HTTP APIs, OAI-PMH) are not intended to support these needs. To this aim, the infrastructure is in the process of realizing a Literature Broker Service ("the Service"), by learning from other experiences, and targeting the specificity of the OpenAIRE setting. The Service will allow repository managers to subscribe to (potential) "enrichment" and (potential) "addition" events occurring to the OpenAIRE information space graph with respect to the scope of their repository. "Enrichment" events identify objects fed by the repository to OpenAIRE that have been enriched by OpenAIRE inference algorithms or de-duplication merges (i.e. merged with objects describing the same publication but with richer or different metadata, for example the open access version of an article). "Addition" events identify objects that enter into the OpenAIRE information space graph, are not present in the repository, but may be part of its collection. Repository managers will then receive notifications about the events they are subscribed to, according to various notification strategies. </p> <p>Figure 3 shows how the Service will integrate with the existing OpenAIRE infrastructure. Objects collected from data sources are aggregated, de-duplicated and enriched by inference algorithms to form the OpenAIRE Information space graph. Whenever a new information space is generated, the Service explores the graph to detect if any of the active subscriptions are matched and if so, the active subscriptions are matched and if so, notifications are generated and delivered. </p> <div align="center"> <img src="artini-fig3.png" alt="artini-fig3" width="625" height="264" class="borderGray" /> <p><i>Figure 3: The Scholarly Communication Broker Service in the OpenAIRE infrastructure</i></p> </div> <div class="divider-white"> </div> <div class="divider-dot"> </div> <h4>4.1 Subscriptions</h4> <p>Repository managers will be able to subscribe to two main classes of subscriptions: "enrichment" and "addition". </p> <p>The first class refers to notifications about publications that (i) were collected from the repository by OpenAIRE and (ii) have been enriched with properties or relationships to other objects by OpenAIRE inference algorithms (e.g. relationships to projects and datasets, citation lists, document classification properties) or by the side effect of being merged with richer publication objects (e.g. DOI of a publication, Open Access version of the publication). The identification of these events is straightforward as it is based on provenance of collection (i.e. selects publication objects collected from the given repository) and of enrichment (i.e. further selects objects of the given repository involved into a merge or enriched by inference algorithms). Repository managers will be able to fine-tune their subscriptions based on the typology of enrichment. </p> <p>The second class refers to notifications relative to publications that are "relevant to" the repository at hand, but are not present in the repository. The identification of these events is less trivial as it requires devising a criterion of "publication <i>relevant to</i> a repository". Three strategies have been proposed, according to which a publication is relevant to a repository if one of the following chains of relationships exist in the OpenAIRE information graph: </p> <ul> <li style="padding-bottom: .5em;"><i>publication-author-organization-repository</i>: the publication has an author whose organization (affiliation) has a given institutional repository of reference;</li> <li style="padding-bottom: .5em;"><i>publication-author-repository</i>: the publication has an author with a given institutional repository of reference; </li> <li><i>publication-project-organization-repository</i>: the publication has been funded by a project whose participants (beneficiaries of the grant) have a given institutional repository of reference. </li> </ul> <p>Given a publication, if such relationships exist as are found in the graph (collected or inferred) the Service may notify the interested repositories of the publication. The challenge is that such relationships are generally not provided by data sources but must be inferred by OpenAIRE services. As a consequence subscription and notification can secure levels of "correctness" that depend on the level of trust of inference algorithms and can be fine-tuned by repository managers at subscription time. </p> <p><b><i>Relationships: publication-author-organization-repository</i></b></p> <p>The most intuitive criterion of publication <i>relevant to</i> a repository is that based on the relationships <i>authorship</i>, i.e. the publication has a given author, <i>author affiliation</i>, i.e. the author of the publication is affiliated with an organization, and <i>organizationRepositoryOfReference</i>, i.e. the institutional repository of reference of all authors of an organization. While OpenAIRE can collect from data sources relationships between publication-author (e.g. publication metadata) and data source-organizations (e.g. OpenDOAR returns the list of European publication repositories), affiliational relationships between publication and authors are generally not available in collected publication metadata. In fact, publication records provided by data sources (according to the OpenAIRE guidelines, but also according to accepted use of Dublin Core) do not provide author affiliation information or when they do, they follow patterns that vary from case to case and are hard to match automatically. </p> <p>The inference service of OpenAIRE features a module for affiliation inference, which mines the publication full-texts to identify and extract pairs <author-organization>. If the algorithm is able to determine which author is associated with which organization, then a relationship <i>affiliation</i> between the author and the organization is added to the graph, otherwise the <i>authoringOrganization</i> relationship is created between the publication and the organization. For the purpose of the Service there is no difference between the two (see Figure 4) as what matters is the identification of a relationship between the publication and repositories. </p> <div align="center"> <img src="artini-fig4.png" alt="artini-fig4" width="557" height="389" class="borderGray" /> <p><i>Figure 4: Detection of "relevant to" criterion via full-text mining</i></p> </div> <p><b><i>Relationships: publication-author-repository</i></b></p> <p>The second criterion, to detect which publications may be "relevant to" a repository, is based on the relationships <i>authorship</i>, i.e. the publication has an author, and <i>authorRepositoryOfReference</i>, the author deposits her publication in the given repository. As previously mentioned, <i>authorship</i> is generally provided by the collected metadata, while <i>authorRepositoryOfReference</i> needs to be inferred by OpenAIRE services. To this aim the services exploit the results of the de-duplication algorithms over authors and publications. Harvested metadata records contain authors' names in <i>dc:creator</i> fields, as simple strings. In the OpenAIRE information space, such "raw" author objects are initially created with a stateless identifier that makes them unique in the graph. Author identifiers are obtained from the OpenAIRE identifier of the repository, the OpenAIRE identifier of the publication that contains them (obtained from publication identifiers such as DOIs or OAI-PMH identifiers), and the author name string (see Figure 5 for an example). </p> <p>As such, before de-duplication, each occurrence of an author name in a publication from a given data source is considered to be a unique author, which carries a pointer to the data source (e.g. the repository) and to the publication that brought it into the system. The result of de-duplication over author objects is a set of "anchor" authors obtained as the merge of several "raw" authors. Starting from "anchor" authors, and exploiting the pointers to institutional repositories of the raw authors they merge, OpenAIRE inference services calculate the notion of "author submission frequency" by counting the number of publications of the author across different repository data sources. In the majority of cases, the repository with the highest submission frequency turns out to be the repository of reference for the author, namely the one to which she is supposed to report her publications (in future work, this process will be further refined to identify the "migration" of an author to another institution, therefore to a different repository of reference; this condition may conflict with the "highest number of submissions" criteria, but may be identified using submission dates). </p> <p>Accordingly, with a given degree of approximation, when OpenAIRE collects a new publication from a given repository it is possible to state if some of its authors have a different repository of reference. An exemplification is shown in Figure 5. The same author string ("A. Turing") collected from four different data sources (three institutional repositories and one data source of a different typology) results in four different person objects. When the de-duplication is run, the four persons are merged into one new "anchor" object ("anchor::A. Turing", in the example). Table 2 shows the occurrences of submission of "anchor::A. Turing" considering the provenance of the four "raw" authors it merges. The table shows that the author deposited mostly in repository "Repo1", which can then be considered the repository of reference for the author. Consequently, "Repo1" may be interested in being notified about the publications the author deposited in "Repo2", "Repo3" and "DS".</p> <p>In order to avoid useless notifications, publication de-duplication permits understanding whether or not the repository of reference already has the publications or should be notified.</p> <div align="center"> <img src="artini-fig5.png" alt="artini-fig5" width="694" height="446" class="borderGray" /> <p><i>Figure 5: Affiliation detection: using de-duplication to compute the closeness of an author to a repository</i></p> </div> <div class="divider-gray"> </div> <table align="center" border="0" cellpadding="6" cellspacing="0"> <tr> <td class="topLeft" align="left"> </td> <td class="topLeft" align="center"><b>Repo1</b></td> <td class="topLeft" align="center"><b>Repo2</b></td> <td class="topLeft" align="center"><b>Repo3</b></td> <td class="topLeftRight" align="center"><b>DS</b></td> </tr> <tr> <td class="topLeftBottom" align="left" bgcolor="#cccccc">Anchor::A. Turing</td> <td class="topLeftBottom" align="center" bgcolor="#cccccc">3</td> <td class="topLeftBottom" align="center" bgcolor="#cccccc">1</td> <td class="topLeftBottom" align="center" bgcolor="#cccccc">1</td> <td class="all" align="center" bgcolor="#cccccc">1</td> </tr> </table> <p align="center">Table 2: Submission frequency for the graph in Figure 5</p> <p>The accuracy of the de-duplication algorithms is very important for the correct implementation of this strategy. In addition, repository managers should be able to fine-tune the parameters for the selection of "author submission frequency" (e.g. minimum number of submissions per data source or in total) in order to limit the number of false positive notifications. </p> <p>A preliminary analysis of the OpenAIRE information space graph for the detection of "frequent submitters" has been carried out considering authors with at least 10 publications and with at least 4 publications in the repository of reference. The analysis is summarized in Table 3 and in the graph in Figure 6. From a total of 157,549 anchor authors from 426 institutional repositories, about 19% submit their articles into a single repository (i.e. 100% of the publications of each author has been collected from the same data source). Interestingly, about 60% submitted publications in different repositories, but their repository of reference hosts from 50% to 99% of their publications. Most likely, repository managers will be interested in this subset of authors, as they are those that mostly deposit papers in one repository, but some of their papers can also be found in other locations. Finally, about 20% submitted only up to 50% of their publications in their repository of reference. </p> <table align="center" border="0" cellpadding="6" cellspacing="0"> <tr> <td class="topLeft" align="left"><b>Author publications appearing<br />in repository of reference (%)</b></td> <td class="topLeftRight" align="left"><b>Number of authors in this category</b></td> </tr> <tr> <td class="topLeft" align="center" bgcolor="#cccccc">10-19%</td> <td class="topLeftRight" align="center" bgcolor="#cccccc">70</td> </tr> <tr> <td class="topLeft" align="center">20-29%</td> <td class="topLeftRight" align="center">4,469</td> </tr> <tr> <td class="topLeft" align="center" bgcolor="#cccccc">30-39%</td> <td class="topLeftRight" align="center" bgcolor="#cccccc">12,316</td> </tr> <tr> <td class="topLeft" align="center">40-49%</td> <td class="topLeftRight" align="center">15,802</td> </tr> <tr> <td class="topLeft" align="center" bgcolor="#cccccc">50-59%</td> <td class="topLeftRight" align="center" bgcolor="#cccccc">20,659</td> </tr> <tr> <td class="topLeft" align="center" >60-69%</td> <td class="topLeftRight" align="center">16,832</td> </tr> <tr> <td class="topLeft" align="center" bgcolor="#cccccc">70-79%</td> <td class="topLeftRight" align="center" bgcolor="#cccccc">15,805</td> </tr> <tr> <td class="topLeft" align="center">80-89%</td> <td class="topLeftRight" align="center">18,823</td> </tr> <tr> <td class="topLeft" align="center" bgcolor="#cccccc">90-99%</td> <td class="topLeftRight" align="center" bgcolor="#cccccc">22,572 </td> </tr> <tr> <td class="topLeftBottom" align="center">100%</td> <td class="all" align="center">30,201</td> </tr> </table> <p align="center">Table 3: Authors, repositories of reference and percentage of deposition</p> <div class="divider-gray"> </div> <div class="divider-white"> </div> <div align="center"> <img src="artini-fig6.png" alt="artini-fig6" width="564" height="281" class="borderGray" /> <p><i>Figure 6: Preliminary analysis of the OpenAIRE graph for "frequent submitters"</i></p> </div> <p><b><i>Relationships: publication-project-organization-repository</i></b></p> <p>The third criterion available for subscriptions on "relevant to" exploits the relationships <i>beneficiaryOf</i>, i.e. organizations involved in research projects, and <i>organizationRepositoryOfReference</i>, i.e. the institutional repository of reference for all authors of an organization. OpenAIRE collects these relationships from publication metadata (e.g. repositories, journals), project metadata (funders), and repository metadata (OpenDOAR). Figure 7 provides an example of the concept: CNR-ISTI is an Italian research institute whose institutional repository is PUblication MAnagement (PUMA). Researchers from CNR-ISTI should deposit their publications in PUMA. CNR-ISTI is involved in the EC funded project OpenAIRE2020, therefore some of the publications linked to the OpenAIRE2020 project may be "relevant to" PUMA because they may be co-authored by researchers working at CNR-ISTI.</p> <div align="center"> <img src="artini-fig7.png" alt="artini-fig7" width="555" height="287" class="borderGray" /> <p><i>Figure 7: Detection of publications' affiliation: exploiting links to projects</i></p> </div> <p>The approach has a high chance of yielding false positive notifications because some projects involve a considerable number of organizations (e.g. the OpenAIRE2020 EC-H2020 project involves 49 organizations). Repository managers will be able to fine-tune their subscription in order to include, for example, only projects from a given list, or projects with a limited number of participants. </p> <div class="divider-dot"> </div> <h4>4.2 Notifications</h4> <p>Different notification strategies are under evaluation in order to meet the diverse requirements of subscribers:</p> <ul> <li style="padding-bottom: .5em;"><b>Mail postcards</b>: Following the example of the JISC Publication Router, subscribers may opt to be notified by email at given intervals (e.g. daily, weekly, monthly) and with given granularity (individual records, digests, URL to user interface), together with instructions for the retrieval of the complete metadata records and full-texts. </li> <li style="padding-bottom: .5em;"><b>Programmatic access</b>: APIs will be provided to retrieve notifications by status (e.g. read/unread), subscription typology, and filters (e.g. criteria on the metadata fields). A prototype solution based on OAI-PMH has already been realized on top of the OpenAIRE OAI-PMH Publisher and it is currently undergoing testing. For subscribing repositories, the OAI-PMH publisher service gives access to the OAI set of records collected from the repository that have been enriched by OpenAIRE. </li> <li><b>Web interface</b>: A web application will offer a dashboard where repository managers can find the tools to: <ul style="list-style: disc;"> <li style="padding-bottom: .5em; padding-top: .5em;">Manage their notifications, i.e. create, suspend, resume and delete</li> <li style="padding-bottom: .5em;">View, download or re-send old notifications</li> <li>Select the format of the email notification digests to receive among a set of supported formats (e.g. Dublin Core XML, Bibtex and citations in ASCII)</li> </ul> </li> </ul> <p>Finally, existing APIs for the automatic ingestion of records into repositories will be evaluated (SWORD, [<a href="#10">10</a>]) and realization of software modules for integration and ingestion into known repository platforms will be considered (e.g. DSpace, ePrints).</p> <div class="divider-full"> </div> <h3>5 Conclusions</h3> <p>OpenAIRE populates, cleans, and enriches a graph of objects relative to publications, datasets, people, organizations, projects, and funders aggregated from a variety of data sources. The OpenAIRE graph is a great opportunity for repository managers to improve their repository collections, as it may feature information that is not otherwise available to them. The OpenAIRE Literature Broker Service will offer subscription and notification functionalities explicitly targeting their needs. By exploiting the provenance information tracked by the OpenAIRE infrastructure, it will be possible to subscribe to "enrichment" events and be notified whenever OpenAIRE enriches a publication metadata record with new properties (subjects, citation list, research initiatives) or new relationships to projects or datasets. By enriching with relationships and analyzing the information space graph, the service will also be able to notify repository managers about "addition" events whenever a publication metadata record relevant for that repository is aggregated from another data source.</p> <p>A first prototype for the export of enriched metadata records per data source via OAI-PMH has already been implemented, and in the near future, the procedure for subscription and email notification, together with a first implementation of the Dashboard Web UI, will be made available to selected repository managers for testing purposes.</p> <div class="divider-full"> </div> <h3>Acknowledgements</h3> <p>This work is partially funded by the European Commission H2020 project OpenAIRE2020 (Grant agreement: 643410, Call: H2020-EINFRA-2014-1).</p> <div class="divider-full"> </div> <h3>References</h3> <p><a name="1">[1]</a> <a href="https://guidelines.openaire.eu">The OpenAIRE guidelines</a>.</p> <p><a name="2">[2]</a> <a href="https://api.openaire.eu">The OpenAIRE API</a>.</p> <p><a name="3">[3]</a> Walters, T., & Ruttenberg, J. (2014). <a href="http://www.educause.edu/ero/article/shared-access-research-ecosystem">SHared Access Research Ecosystem</a>. <i>Educause Review</i>, 49(2), 56-57.</p> <p><a name="4">[4]</a> Lewis, S., de Castro, P., & Jones, R. (2012). SWORD: Facilitating deposit scenarios. <i>D-Lib Magazine</i>, 18(1), 4. <a href="http://doi.org/10.1045/january2012-lewis">http://doi.org/10.1045/january2012-lewis</a></p> <p><a name="5">[5]</a> Manghi, P., Bolikowski, L., Manold, N., Schirrwagen, J., & Smith, T. (2012). Openaireplus: the european scholarly communication data infrastructure. <i>D-Lib Magazine</i>, 18(9), 1. <a href="http://doi.org/10.1045/september2012-manghi">http://doi.org/10.1045/september2012-manghi</a></p> <p><a name="6">[6]</a> Manghi, P., Artini, M., Atzori, C., Bardi, A., Mannocci, A., La Bruzzo, S., Candela L., Castelli D., & Pagano P. (2014). The D-NET software toolkit: A framework for the realization, maintenance, and operation of aggregative infrastructures. <i>Program: electronic library and information systems</i>, 48(4), 322-354. <a href="http://doi.org/10.1108/PROG-08-2013-0045">http://doi.org/10.1108/PROG-08-2013-0045</a></p> <p><a name="7">[7]</a> Manghi, P., Houssos, N., Mikulicic, M., & Jörg, B. (2012). The data model of the openaire scientific communication e-infrastructure. In <i>Metadata and Semantics Research</i> (pp. 168-180). Springer Berlin Heidelberg. <a href="http://doi.org/10.1007/978-3-642-35233-1_18">http://doi.org/10.1007/978-3-642-35233-1_18</a></p> <p><a name="8">[8]</a> Houssos, N., Jörg, B., Dvořák, J., Príncipe, P., Rodrigues, E., Manghi, P., & Elbæk, M. K. (2014). OpenAIRE guidelines for CRIS managers: supporting interoperability of open research information through established standards. <i>Procedia Computer Science</i>, 33, 33-38. <a href="http://doi.org/10.1016/j.procs.2014.06.006">http://doi.org/10.1016/j.procs.2014.06.006</a></p> <p><a name="9">[9]</a> Kobos, M., Bolikowski, Ł., Horst, M., Manghi, P., Manola, N., & Schirrwagen, J. (2014). Information Inference in Scholarly Communication Infrastructures: The OpenAIREplus Project Experience. <i>Procedia Computer Science</i>, 38, 92-99. <a href="http://doi.org/10.1016/j.procs.2014.10.016">http://doi.org/10.1016/j.procs.2014.10.016</a></p> <p><a name="10">[10]</a> Jisc Blog, "<a href="http://scholarlycommunications.jiscinvolve.org/wp/2015/07/01/jisc-publications-router-enters-a-new-phase">Jisc Publications Router enters a new phase</a>".</p> <div class="divider-full"> </div> <h3>About the Authors</h3> <table border="0" cellpadding="6" bgcolor="#FFFFFF"> <tr> <td align="center"><img src="artini.jpg" class="border" alt="artini" width="100" height="120" /></td> <td> <p class="blue"><b>Michele Artini</b> is a research fellow at the Networked Multimedia Information Systems laboratory of the Istituto di Scienza e Tecnologie dell'Informazione (ISTI), Consiglio Nazionale delle Ricerche, Pisa, Italy. Since 2005 he has been involved in EC funded projects for the realisation of aggregative data infrastructures like DRIVER, DRIVER II, BELIEF, HOPE, EFG, EFG1914, OpenAIRE, OpenAIREPlus and Openaire2020. He is interested in digital libraries, service-oriented infrastructures, database systems and workflow management systems.</p> </td> </tr> </table> <div class="divider-full"> </div> <table border="0" cellpadding="6" bgcolor="#FFFFFF"> <tr> <td align="center"><img src="atzori.jpg" class="border" alt="atzori" width="100" height="122" /></td> <td> <p class="blue"><b>Claudio Atzori</b> received his MSc in "Information Technology" in 2009 at the University of Cagliari. He is a PhD student in Information Engineering at the Engineering School "Leonardo da Vinci" of the University of Pisa. He works as a research fellow in the InfraScience research group, part of the Multimedia Networked Information System Laboratory (NeMIS), at the "Istituto di Scienza e Tecnologie dell'Informazione", National Research Council, Pisa, Italy. He works on the realisation of aggregative data infrastructures for the e-science and scholarly communication. He has also participated to the EC funded R&D projects: DRIVER-II, EFG, EFG1914, HOPE, EAGLE, OpenAIRE, OpenAIREPlus, OpenAIRE2020.</p> </td> </tr> </table> <div class="divider-full"> </div> <table border="0" cellpadding="6" bgcolor="#FFFFFF"> <tr> <td align="center"><img src="bardi.png" class="border" alt="bardi" width="100" height="136" /></td> <td> <p class="blue"><b>Alessia Bardi</b> received her MSc in Information Technologies in the year 2009 at the University of Pisa, Italy. She is a PhD student in Information Engineering at the Engineering Ph.D. School "Leonardo da Vinci" of the University of Pisa and works as graduate fellow at the Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy. Today she is a member of the InfraScience research group, part of the Multimedia Networked Information System Laboratory (NeMIS). She is involved in EC funded projects for the realisation of aggregative data infrastructures. Her research interests include service-oriented architectures and data infrastructures for e-science and scholarly communication.</p> </td> </tr> </table> <div class="divider-full"> </div> <table border="0" cellpadding="6" bgcolor="#FFFFFF"> <tr> <td align="center"><img src="labruzzo.png" class="border" alt="labruzzo" width="100" height="146" /></td> <td> <p class="blue"><b>Sandro La Bruzzo</b> is a research fellow at Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy. He received his MSc in Information Technologies in the year 2010 at the University of Pisa, Italy. Today he is a member of the InfraScience research group, part of the Multimedia Networked Information System Laboratory (NeMIS). His current research interests are in the areas of Service-Oriented Infrastructures for Digital Libraries, protocols for metadata exchanging, Database, Index. He is currently working for the development of the Digital Library and Data infrastructures for the European Commission projects OpenAIRE, OpenAIREplus, OpenAIRE2020, EFG1914, HOPE, and EAGLE.</p> </td> </tr> </table> <div class="divider-full"> </div> <table border="0" cellpadding="6" bgcolor="#FFFFFF"> <tr> <td align="center"><img src="manghi.png" class="border" alt="manghi" width="100" height="130" /></td> <td> <p class="blue"><b>Paolo Manghi</b> is a Researcher in computer science at Istituto di Scienza e Tecnologie dell'Informazione (ISTI) of Consiglio Nazionale delle Ricerche (CNR), in Pisa, Italy. He is the acting technical manager and researcher for the EU-H2020 infrastructure projects OpenAIRE2020, <a href="http://www.sobigdata.eu/">SoBigData.eu</a>, PARTHENOS, RDA Europe, and EAGLE. He is an active member of a number of Data Citation and Data Publishing Working groups of the Research Data Alliance. In addition, he is an invited member of the advisory boards of the Research Object initiative (Carole Goble, University of Manchester) and of the Europeana Cloud project. His research areas of interest currently are data e-infrastructures for science, scholarly communication infrastructures, and publishing/interlinking of data and experiments, with a focus on technologies supporting open science and digital scholarly communication, i.e. reusing, sharing, assessing all research products, be them articles, datasets or experiments. </p> </td> </tr> </table> <div class="divider-full"> </div> <table border="0" cellpadding="6" bgcolor="#FFFFFF"> <tr> <td align="center"><img src="mannocci.png" class="border" alt="mannocci" width="100" height="146" /></td> <td> <p class="blue"><b>Andrea Mannocci</b> is a research fellow at the Networked Multimedia Information Systems (NeMIS) laboratory of the Istituto di Scienza e Tecnologie dell'Informazione (ISTI), Consiglio Nazionale delle Ricerche, Pisa, Italy. He is also a PhD student in Information Science Engineering at the University of Pisa. He is involved in EC funded projects for the realization of data integration infrastructures (OpenAIREplus, OpenAIRE2020 and EAGLE). His research activities span data quality and application monitoring, and service-oriented infrastructures for e-science. In 2010 he received a MSc in Computer Science Engineering at the University of Pisa and in 2011 a MSc in Telematic Engineering at the University Carlos III of Madrid, Spain.</p> </td> </tr> </table> <div class="divider-full"> </div> <!-- Standard Copyright line here --> <div class="center"> <p class="footer">Copyright © 2015 Michele Artini, Claudio Atzori, Alessia Bardi, Sandro La Bruzzo, Paolo Manghi and Andrea Mannocci</p> </div> </td> </tr> </table> <table width="100%" border="0" align="center" cellpadding="0" cellspacing="0"> <tr> <td height="1" bgcolor="#2b538e"><img src="../../../img2/transparent.gif" alt="transparent image" width="100" height="2" /></td> </tr> </table> </td></tr></table> </td></tr></table> </form> </body> </html>