CINXE.COM
Benefits from Extending KBpedia with Private Datasets
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <!--[if IE 8]> <html lang="en" class="ie8"> <![endif]--> <!--[if IE 9]> <html lang="en" class="ie9"> <![endif]--> <!--[if !IE]><!--> <html lang="en"> <!--<![endif]--> <head> <title>Benefits from Extending KBpedia with Private Datasets</title> <link rel="alternate" type="application/rss+xml" title="" href="/resources/feeds/news.xml" /> <!-- Meta --> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="description" content=""> <meta name="author" content=""> <!-- Favicon --> <link rel="shortcut icon" href="/favicon.ico"> <!-- CSS Global Compulsory --> <link rel="stylesheet" href="/assets/plugins/bootstrap/css/bootstrap.min.css"> <link rel="stylesheet" href="/assets/css/style.css"> <!-- CSS Implementing Plugins --> <link rel="stylesheet" href="/assets/plugins/line-icons/line-icons.css"> <link rel="stylesheet" href="/assets/plugins/font-awesome/css/font-awesome.min.css"> <!-- CSS Theme --> <link rel="stylesheet" href="/assets/css/theme-colors/blue.css"> <link rel="stylesheet" href="/assets/css/footers/footer-v1.css"> <link rel="stylesheet" href="/assets/css/headers/header-v1.css"> <!-- CSS Customization --> <link rel="stylesheet" href="/css/custom.css"> <meta property="og:site_name" content="KBpedia"/> <meta property="og:title" content="Mapping Use Cases: Extend KBpedia for Domains"> <meta property="og:type" content="article"/> <meta property="og:description" content="This use case describes how to extend KBpedia for new domains with private datasets."> <meta property="og:image" content="/imgs/kbpedia-logo-350.png"> <meta property="og:image:width" content="350"> <meta property="og:image:height" content="390"> </head> <body> <div class="wrapper"> <div class="header-v1"> <!-- Topbar --> <div class="topbar"> <div class="container"> <div class="col-md-1"> </div> <a class="navbar-brand" href="/"> <img id="logo-header" src="/imgs/kbpedia-logo-420-horz.png" height="75" alt="KBpedia Knowledge Structure" name="logo-header"> </a> </div> </div><!-- End Topbar --> <!-- Navbar --> <div class="navbar navbar-default mega-menu" role="navigation"> <div class="container"> <!-- Brand and toggle get grouped for better mobile display --> <div class="navbar-header"> <button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-responsive-collapse"> <span class="sr-only">Toggle navigation</span> </button> </div> <div style="clear:both; height: 1px;"> </div> <!-- Collect the nav links, forms, and other content for toggling --> <div class="col-md-1"> </div><!--/col-md-1--> <div class="collapse navbar-collapse navbar-responsive-collapse col-md-10"> <ul class="nav navbar-nav pull-left"> <!-- Demo --> <li> <a href="/">Home</a> </li> <!-- Home --> <li> <a href="/knowledge-graph/">knowledge Graph</a> </li> <li> <a href="http://sparql.kbpedia.org/">SPARQL</a> </li> <!-- Background --> <li class="dropdown"> <a href="/background/">Background</a> <ul class="dropdown-menu"> <li> <a href="/background/overview/">Overview</a> </li> <li> <a href="/background/features-and-benefits/">Features & Benefits</a> </li> <li> <a href="/background/data-and-knowledge-structures/">Data and Knowledge Structures</a> </li> <li> <a href="/background/technology/">Technology</a> </li> <li> <a href="/background/machine-learning/">Machine Learning</a> </li> <li> <a href="/background/uses/">Uses</a> </li> </ul> </li><!-- End Uses --> <li class="dropdown"> <a href="/use-cases/">Use Cases</a> <ul class="dropdown-menu"> <li class="dropdown-submenu"> <a href="/use-cases/knowledge-graph/">Knowledge Graph (KG)</a> <ul class="dropdown-menu"> <li><a href="/use-cases/browse-the-knowledge-graph/">Browse the Knowledge Graph</a></li> <li><a href="/use-cases/search-the-knowledge-graph/">Search the Knowledge Graph</a></li> <li><a href="/use-cases/expand-queries-using-semsets/">Expand Queries Using Semsets</a></li> <li><a href="/use-cases/use-and-control-of-inferencing/">Uses and Control of Inferencing</a></li> </ul> </li> <li class="dropdown-submenu"> <a href="/use-cases/machine-learning-use-case/">Machine Learning (KBAI)</a> <ul class="dropdown-menu"> <li><a href="/use-cases/text-classification-using-esa-and-svm/">Create Supervised Learning Training Sets</a></li> <li><a href="/use-cases/document-specific-word2vec-training-corpuses/">Create Word Embedding Corpuses</a></li> <li><a href="/use-cases/extending-kbpedia-with-kbpedia-categories/">Create Graph Embedding Corpuses</a></li> <li><a href="/use-cases/text-classification-using-esa-and-svm/">Classify Text</a></li> <li><a href="/use-cases/dynamic-machine-learning/">Create 'Gold Standards' for Tuning Learners</a></li> <li><a href="/use-cases/disambiguating-kbpedia-knowledge-graph-concepts/">Disambiguate KG Concepts</a></li> <li><a href="/use-cases/dynamic-machine-learning/">Dynamic Machine Learning Using the KG</a></li> </ul> </li> <li class="dropdown-submenu"> <a href="/use-cases/mapping-use-case/">Mapping</a> <ul class="dropdown-menu"> <li><a href="/use-cases/mapping-external-data-and-schema/">Map Concepts</a></li> <li><a href="/use-cases/extending-kbpedia-with-kbpedia-categories/">Extend KBpedia with Wikipedia</a></li> <li><a href="/use-cases/benefits-from-extending-kbpedia-with-private-datasets/">Extend KBpedia for Domains</a></li> <li><a href="/use-cases/mapping-external-data-and-schema/">General Use of the Mapper</a></li> </ul> </li> </ul> </li> <li class="dropdown"> <a href="/resources/">Resources</a> <ul class="dropdown-menu"> <li><a href="/resources/downloads/">Download KBpedia</a></li> <li><a href="/resources/about/">About KBpedia</a></li> <li><a href="/resources/faq/">KBpedia FAQ</a></li> <li><a href="/resources/news/">News About KBpedia</a></li> <li><a href="/resources/statistics/">KBpedia Statistics</a></li> <li><a href="/resources/documentation/">Additional Documentation</a></li> <li><a href="/resources/support/">Support for KBpedia</a>.</li> </ul> </li> </ul> </div><!--/navbar-collapse--> <div class="col-md-1"> </div><!--/col-md-1--> </div> </div><!-- End Navbar --> </div><!--=== End Header ===--> <!--=== Breadcrumbs ===--> <div class="breadcrumbs"> <div class="container"> <div class="col-md-1"> </div><!--/col-md-1--> <div class="col-md-10"> <h1 class="pull-left"></h1> <ul class="pull-right breadcrumb"> <li>Use Cases</li> <li class="active">Integrating Private Data</li> </ul> </div><!--/col-md-10--> <div class="col-md-1"> </div><!--/col-md-1--> </div> </div> <!--/breadcrumbs--> <!--=== End Breadcrumbs ===--> <!--=== Content Part ===--> <div class="container content"> <div class="row"> <div class="col-md-2"> </div> <div class="col-md-8"> <div class="use-cases-header"> <table border="0" cellpadding="4" cellspacing="2"> <tbody> <tr> <td colspan="2" align="center"> <h2> <b>USE CASE</b> </h2> </td> </tr> <tr> <td style="width: 140px;" valign="top"> <b>Title:</b> </td> <td valign="top"> <span style="font-weight: bold; padding-left: 25px;">Benefits from Extending KBpedia with Private Datasets</span> </td> </tr> <tr> <td valign="top"> <b>Short Description:</b> </td> <td valign="top" style="padding-left: 25px;"> Significant improvements in tagging accuracy may be obtained by adding private (enterprise- or domain-specific) datasets to the standard public knowledge bases already in KBpedia<span style="font-style: italic;"></span> <br> </td> </tr> <tr> <td valign="top"> <b>Problem:</b> </td> <td valign="top" style="padding-left: 25px;"> We want to obtain as comprehensive and accurate tagging of entities as possible for our specific enterprise needs <br> </td> </tr> <tr> <td valign="top"> <b>Approach:</b> </td> <td valign="top" style="padding-left: 25px;"> KBpedia provides a rich set of 30 million entities in its standard configuration. However, by identifying and including relevant entity lists already in the possession of the enterprise or from specialty datasets in the relevant domain, significant improvements can be achieved across all of the standard metrics used for entity recognition and tagging. Further, our standard methodology includes the creation of reference, or "gold standards", for measuring the benefits of adding more data or performing other tweaks on the entity extraction algorithms </td> </tr> <tr> <td valign="top"> <b>Key Findings:</b> </td> <td valign="top"> <ul> <li>In this specific example, adding private enterprise data results in more than a doubling of accuracy (108%) over the standard, baseline KBpedia for identifying the publishing organization of a Web page</li> <li>Some datasets may have a more significant impact than others, but overall, each dataset contributes to the overall improvements of the predictions. Generally adding more data improves results across all measured metrics</li> <li>Approx. 500 training cases are sufficient to build a useful "gold standard" for entity tagging; negative training examples are also advisable</li> <li>"Gold standards" are an essential component to testing the value of adding specific datasets or refining machine learning parameters<br> </li> <li>Even if all specific entities are not identified, flagging a potential "unknown" entity is an important means for targeted next efforts of adding to the current knowledge base<br> </li> <li>KBpedia is a very useful structure and starting point for an entity tagging effort, but that adding domain data is probably essential to gain the overall accuracy desired for enterprise requirements<br> </li> <li>This use case is broadly applicable to any entity recognition and tagging initiative. </li> </ul> </td> </tr> </tbody> </table> </div> </div> <div class="col-md-2"> </div><!--/col-md-2--> </div> <div class="row"> </div> <div class="row"> </div> <div class="row"> <div class="col-md-2"> </div> <div class="col-md-8"> <p> This use case demonstrates two aspects of working with the KBpedia knowledge structure. First, we demonstrate the benefits of adding private datasets to the standard knowledge bases included with KBpedia. And, second, we highlight our standard use of reference, or "gold", standards as a way of measuring progress in tweaking datasets and parameters when doing machine learning tasks. </p> <p> The basis for this use case is an enterprise that is monitoring information published on the Web and wants to be able to identify the organization responsible for publishing a given page. The enterprise monitors the Web on a daily basis in its domain of interest and is able to identify new Web pages it has not seen before. Further, the enterprise also has two of its own datasets that contain possible candidate organizations that might be publishers of such pages. These two datasets are private and not available to the general public. </p> <p> In this use case, we describe a publisher analyzer used for organization identification, the standard KBpedia datasets available for the task, the enterprise's own private datasets, and the approach we take to "gold standards" and the specifics of that standard for this case. Once these component parts are described, we proceed to give the results of adding or using different datasets. We then summarize some conclusions. </p> <p> Note this use case is broadly applicable to any entity recognition and tagging initiative. </p> <div id="outline-container-orgheadline1" class="outline-2"> <br /> <h2 id="orgheadline1">The Analysis Framework</h2> <div class="outline-text-2" id="text-orgheadline1"> <p> The analysis framework is comprised of general platform code, the publisher analyzer, the standard KBpedia knowledge structure and its public knowledge bases, reference "gold standards", and, for the test, external (private enterprise) data. </p> </div> <div id="outline-container-orgheadline2" class="outline-3"> <br /> <h3 id="orgheadline2">Overview of the Publisher Analyzer</h3> <div class="outline-text-3" id="text-orgheadline2"> <p> The publisher analyzer attempts to determine the publisher of a web page from analyzing the web page's content. There are multiple moving parts to this analyzer, but its general internal workflow is as follows: </p> <ol class="org-ol"> <li>It crawls a given webpage URL</li> <li>It extracts the page's content and extracts its meta-data (including "defluffing", which is the removal of navigation, ads, and normal Web page boilerplate)</li> <li>It tags all of the organizations (anything that is considered an <a href="http://kbpedia.org/knowledge-graph/reference-concept/?uri=Organization">organization in KBpedia</a>) across the extracted content using the organization entities that exist in the knowledge base</li> <li>It conducts certain specialty analysis related to page "signals" that might indicate an organizational entity</li> <li>It detects unknown entities that will eventually be added to the knowledge base after curation</li> <li>It performs an in-depth analysis of the organization entities (known or unknown) that got tagged in the content of the web page, and analyzes which of these is the most likely to be the publisher of the web page.</li> </ol> <p> The machine learning system leverages existing algorithms to calculate the likelihood that an organization is the publisher of a web page and to detect unknown organizations. These are conventional uses of these algorithms. What differentiates the publisher analyzer is its knowledge base. We leverage KBpedia to detect known organization entities. We use the knowledge in KBpedia's combined KBs for each of these entities to improve the analysis process. We constrain the analysis to certain types (by inference) of named entities, etc. The special sauce of this entire process is the fully integrated set of knowledge bases that comprise KBpedia, including its hundreds of thousands of concepts, 39,000 reference concepts, and 20 million known entities. </p> </div> </div> <div id="outline-container-orgheadline3" class="outline-3"> <br /> <h3 id="orgheadline3">Public Datasets</h3> <div class="outline-text-3" id="text-orgheadline3"> <p> This use case begins with three public datasets already in KBpedia: <a href="http://wikipedia.org/">Wikipedia</a> (via <a href="http://dbpedia.org/">DBpedia</a>), <a href="http://freebase.com/">Freebase</a> and <a href="http://www.uspto.gov/">USPTO</a>. </p> </div> </div> <div id="outline-container-orgheadline4" class="outline-3"> <br /> <h3 id="orgheadline4">Private Datasets</h3> <div class="outline-text-3" id="text-orgheadline4"> <p> These public datasets are then compared to two private datasets, which contain high-quality, curated, and domain-related listings of organizations. The numbers of organizations contained in these private datasets are much smaller than those in the public ones, but they are also more relevant to the domain. These private datasets are fairly typical of the specific information that an enterprise may have available in its own domain. </p> </div> </div> <div id="outline-container-orgheadline5" class="outline-3"> <br /> <h3 id="orgheadline5">Gold Standard</h3> <div class="outline-text-3" id="text-orgheadline5"> <p> The reference standard, or <a href="https://en.wikipedia.org/wiki/Gold_standard_(test)">"gold standard"</a>, employed in this use case is composed of 511 randomly selected Web pages that are manually vetted and characterized. (As a general rule of thumb we find about 500 examples in the positive training set to be adequate.) </p> <p> The gold standard is really simple. For each of the URLs we have in the standard, we determine the publishing organization manually. Then once the organization is determined, we search in each dataset to see if the entity is already existing. If it is, then we add the URI (unique identifier) of the entity in the knowledge base into the gold standard. It is this URI reference that is used to determine if the publisher analyzer properly detects the actual publisher of the web page. </p> <p> We also add a set of 10 web pages manually for which we are sure that <b>no</b> publisher can be determined for the web page. These are the 10 <code>True Negative</code> (see below) instances of the gold standard. </p> <p> The gold standard also includes the identifier of possible unknown entities that are the publishers of the web pages. These are used to calculate the metrics when considering the unknown entities detected by the system. </p> </div> </div> <div id="outline-container-orgheadline6" class="outline-3"> <br /> <h3 id="orgheadline6">Analysis Metrics</h3> <div class="outline-text-3" id="text-orgheadline6"> <p> The goal of the analysis is to determine how good the analyzer is to perform the task (detecting the organization that published a web page on the Web). What we have to do is to use a set of metrics that will help us understand the performance of the system. The metrics calculation is based on the <a href="https://en.wikipedia.org/wiki/Confusion_matrix">confusion matrix</a>. </p> <div class="figure"> <p><img src="confusion-matrix-wikipedia.png" alt="confusion-matrix-wikipedia.png" /> </p> </div> <p> When we are processing a new run, all results are characterized according to four possible scoring values: </p> <ol class="org-ol"> <li><code>True Positive (TP)</code>: test identifies the same entity as in the gold standard</li> <li><code>False Positive (FP)</code>: test identifies a different entity than what is in the gold standard</li> <li><code>True Negative (TN)</code>: test identifies no entity; gold standard has no entity</li> <li><code>False Negative (FN)</code>: test identifies no entity, but the gold standard has one</li> </ol> <p> The <code>True Positive</code>, <code>False Positive</code>, <code>True Negative</code> and <code>False Negative</code> (see <a href="https://en.wikipedia.org/wiki/Type_I_and_type_II_errors">Type I and type II</a> errors for definitions) should be interpreted in the manner common to <a href="https://en.wikipedia.org/wiki/Named-entity_recognition#Formal_evaluation">in named entity recognition tasks</a>. </p> <p> This simple scoring method allows us to apply a series of metrics based on the four possible scoring values: </p> <ol class="org-ol"> <li><a href="https://en.wikipedia.org/wiki/Precision_and_recall#Precision">Precision</a>: is the proportion of properly predicted publishers amongst all of the publishers that exists in the gold standard <code>(TP / TP + FP)</code></li> <li><a href="https://en.wikipedia.org/wiki/Precision_and_recall#Recall">Recall</a>: is the proportion of properly predicted publishers amongst all the predictions that have been made (good and bad) <code>(TP / TP + FN)</code></li> <li><a href="https://en.wikipedia.org/wiki/Accuracy_and_precision">Accuracy</a>: is the proportion of correctly classified test instances; the publishers that could be identified by the system, and the ones that couldn't (the web pages for which no publisher could be identified). <code>((TP + TN) / (TP + TN + FP + FN))</code></li> <li><a href="https://en.wikipedia.org/wiki/F1_score">f1</a>: is the test's equally weighted combination of precision and recall</li> <li><a href="https://en.wikipedia.org/wiki/Precision_and_recall#F-measure">f2</a>: is the test's weighted combination of precision and recall, with a preference for recall</li> <li><a href="https://en.wikipedia.org/wiki/Precision_and_recall#F-measure">f0.5</a>: the test's weighted combination of precision and recall, with a preference for precision.</li> </ol> <p> The <a href="https://en.wikipedia.org/wiki/F1_score">F-score</a> is a common combined score of the general prediction system. The F-score is a measure that combines precision and recall via their <a href="https://en.wikipedia.org/wiki/Harmonic_mean">harmonic mean</a>. The <code>f2</code> measure weighs recall higher than precision (by placing more emphasis on false negatives), and the <code>f0.5</code> measure weighs recall lower than precision (by attenuating the influence of false negatives). Cognonto includes all three F-measures in its standard reports to give a general overview of what happens when we put an emphasis on precision or recall. Some clients prefer limiting false positives at the sake of lower recall; others want fuller coverage. </p> <p> Still, in most instances, we have found customers find <code>accuracy</code> to be the most useful metric. We particularly emphasize that metric in the results below. </p> </div> </div> </div> <div id="outline-container-orgheadline7" class="outline-2"> <br /> <h2 id="orgheadline7">Running The Tests</h2> <div class="outline-text-2" id="text-orgheadline7"> <p> The goal with these tests is to run the gold standard calculation procedure against various combinations of the available datasets in order to determine their comparative contribution to improved accuracy (or other metric of choice). Here is the general run procedure; note that each run has a standard presentation of the run statistics, beginning with the four scoring values, followed by the standard metrics: </p> </div> <div id="outline-container-orgheadline8" class="outline-3"> <br /> <h3 id="orgheadline8">Baseline: No Dataset</h3> <div class="outline-text-3" id="text-orgheadline8"> <p> The first step is to create the starting basis that includes no dataset. Then we will add different datasets, and try different combinations, when computing against the gold standard such that we know the impact of each on the metrics. </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 2 False positives: 5 True negatives: 19 False negatives: 485 +--------------+--------------+ | key | value | +--------------+--------------+ | :precision | 0.2857143 | | :recall | 0.0041067763 | | :accuracy | 0.04109589 | | :f1 | 0.008097166 | | :f2 | 0.0051150895 | | :f0.5 | 0.019417476 | +--------------+--------------+ </pre> </div> </div> <div id="outline-container-orgheadline9" class="outline-3"> <br /> <h3 id="orgheadline9">One Dataset Only</h3> <div class="outline-text-3" id="text-orgheadline9"> <p> Now, let's see the impact of each of the datasets that exist in the knowledge base we created to perform these tests. This will gives us an indicator of the inherent impact of each dataset on the prediction task. </p> </div> <div id="outline-container-orgheadline10" class="outline-4"> <br /> <h4 id="orgheadline10">Wikipedia (via DBpedia) Only</h4> <div class="outline-text-4" id="text-orgheadline10"> <p> Let's test the impact of adding a single general purpose dataset, the publicly available: <a href="http://wikipedia.org/">Wikipedia</a> (via <a href="http://dbpedia.org/">DBpedia</a>): </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://dbpedia.org/resource/"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 121 False positives: 57 True negatives: 19 False negatives: 314 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.6797753 | | :recall | 0.27816093 | | :accuracy | 0.2739726 | | :f1 | 0.39477977 | | :f2 | 0.31543276 | | :f0.5 | 0.52746296 | +--------------+------------+ </pre> </div> </div> <div id="outline-container-orgheadline11" class="outline-4"> <br /> <h4 id="orgheadline11">Freebase Only</h4> <div class="outline-text-4" id="text-orgheadline11"> <p> Now, let's test the impact of adding another single general purpose dataset, this one the publicly available: <a href="http://freebase.com/">Freebase</a>: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://rdf.freebase.com/ns/"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 11 False positives: 14 True negatives: 19 False negatives: 467 +--------------+-------------+ | key | value | +--------------+-------------+ | :precision | 0.44 | | :recall | 0.023012552 | | :accuracy | 0.058708414 | | :f1 | 0.043737575 | | :f2 | 0.028394425 | | :f0.5 | 0.09515571 | +--------------+-------------+ </pre> </div> </div> <div id="outline-container-orgheadline12" class="outline-4"> <br /> <h4 id="orgheadline12">USPTO Only</h4> <div class="outline-text-4" id="text-orgheadline12"> <p> Now, let's test the impact of adding still a different publicly available specialized dataset: <a href="http://www.uspto.gov/">USPTO</a>: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://www.uspto.gov"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 6 False positives: 13 True negatives: 19 False negatives: 473 +--------------+-------------+ | key | value | +--------------+-------------+ | :precision | 0.31578946 | | :recall | 0.012526096 | | :accuracy | 0.04892368 | | :f1 | 0.024096385 | | :f2 | 0.015503876 | | :f0.5 | 0.054054055 | +--------------+-------------+ </pre> </div> </div> <div id="outline-container-orgheadline13" class="outline-4"> <br /> <h4 id="orgheadline13">Private Dataset #1</h4> <div class="outline-text-4" id="text-orgheadline13"> <p> Now, let's test the first private dataset: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://kbpedia.org/datasets/private/1/"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 231 False positives: 109 True negatives: 19 False negatives: 152 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.67941177 | | :recall | 0.60313314 | | :accuracy | 0.4892368 | | :f1 | 0.6390042 | | :f2 | 0.61698717 | | :f0.5 | 0.6626506 | +--------------+------------+ </pre> </div> </div> <div id="outline-container-orgheadline14" class="outline-4"> <br /> <h4 id="orgheadline14">Private Dataset #2</h4> <div class="outline-text-4" id="text-orgheadline14"> <p> And, then, the second private dataset: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://kbpedia.org/datasets/private/2/"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 24 False positives: 21 True negatives: 19 False negatives: 447 +--------------+-------------+ | key | value | +--------------+-------------+ | :precision | 0.53333336 | | :recall | 0.050955415 | | :accuracy | 0.08414873 | | :f1 | 0.093023255 | | :f2 | 0.0622084 | | :f0.5 | 0.1843318 | +--------------+-------------+ </pre> </div> </div> </div> <div id="outline-container-orgheadline15" class="outline-3"> <br /> <h3 id="orgheadline15">Combined Public Datasets</h3> <div class="outline-text-3" id="text-orgheadline15"> <p> A more realistic analysis is to use a combination of datasets. Let's see what happens to the performance metrics if we start combining the <b>public</b> datasets only. </p> </div> <div id="outline-container-orgheadline16" class="outline-4"> <br /> <h4 id="orgheadline16">Wikipedia + Freebase</h4> <div class="outline-text-4" id="text-orgheadline16"> <p> First, let's start by combining Wikipedia and Freebase. </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://dbpedia.org/resource/"</span> <span style="color: #c7254e;">"http://rdf.freebase.com/ns/"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 126 False positives: 60 True negatives: 19 False negatives: 306 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.67741936 | | :recall | 0.29166666 | | :accuracy | 0.28375733 | | :f1 | 0.407767 | | :f2 | 0.3291536 | | :f0.5 | 0.53571427 | +--------------+------------+ </pre> <p> Adding the Freebase dataset to the DBpedia one has the following effects on the different metrics: </p> <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <colgroup> <col class="org-left" /> <col class="org-right" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">metric</th> <th scope="col" class="org-right" style="padding-left: 40px">Impact in %</th> </tr> </thead> <tbody> <tr> <td class="org-left">precision</td> <td class="org-right" style="padding-left: 40px">-0.03%</td> </tr> <tr> <td class="org-left">recall</td> <td class="org-right" style="padding-left: 40px">+4.85%</td> </tr> <tr> <td class="org-left">accuracy</td> <td class="org-right" style="padding-left: 40px">+3.57%</td> </tr> <tr> <td class="org-left">f1</td> <td class="org-right" style="padding-left: 40px">+3.29%</td> </tr> <tr> <td class="org-left">f2</td> <td class="org-right" style="padding-left: 40px">+4.34%</td> </tr> <tr> <td class="org-left">f0.5</td> <td class="org-right" style="padding-left: 40px">+1.57%</td> </tr> </tbody> </table> <p> </p> <p> As we can see, the impact of adding Freebase to the knowledge base is positive even if not ground breaking considering the size of the dataset. </p> </div> </div> <div id="outline-container-orgheadline17" class="outline-4"> <br /> <h4 id="orgheadline17">Wikipedia + USPTO</h4> <div class="outline-text-4" id="text-orgheadline17"> <p> Let's switch Freebase for the other specialized public dataset, USPTO (organizations with trademarks in the US Patent and Trademark Office dataset). </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://dbpedia.org/resource/"</span> <span style="color: #c7254e;">"http://www.uspto.gov"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 122 False positives: 59 True negatives: 19 False negatives: 311 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.67403316 | | :recall | 0.2817552 | | :accuracy | 0.27592954 | | :f1 | 0.39739415 | | :f2 | 0.31887087 | | :f0.5 | 0.52722555 | +--------------+------------+ </pre> <p> Adding the USPTO dataset to the DBpedia one instead of Freebase has the following effects on the different metrics: </p> <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <colgroup> <col class="org-left" /> <col class="org-right" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">metric</th> <th scope="col" class="org-right" style="padding-left: 40px">Impact in %</th> </tr> </thead> <tbody> <tr> <td class="org-left">precision</td> <td class="org-right" style="padding-left: 40px">-0.83%</td> </tr> <tr> <td class="org-left">recall</td> <td class="org-right" style="padding-left: 40px">+1.29%</td> </tr> <tr> <td class="org-left">accuracy</td> <td class="org-right" style="padding-left: 40px">+0.73%</td> </tr> <tr> <td class="org-left">f1</td> <td class="org-right" style="padding-left: 40px">+0.65%</td> </tr> <tr> <td class="org-left">f2</td> <td class="org-right" style="padding-left: 40px">+1.07%</td> </tr> <tr> <td class="org-left">f0.5</td> <td class="org-right" style="padding-left: 40px">+0.03%</td> </tr> </tbody> </table> <p> </p> <p> As we may have expected, the gains are smaller than Freebase. This may be partly due to the fact that the USPTO dataset is smaller and more specialized than Freebase. Because it is more specialized (enterprises that have trademarks registered in US), maybe the gold standard doesn't represent well the organizations belonging to this dataset. But in any case, there are still gains. </p> </div> </div> <div id="outline-container-orgheadline18" class="outline-4"> <br /> <h4 id="orgheadline18">Wikipedia + Freebase + USPTO</h4> <div class="outline-text-4" id="text-orgheadline18"> <p> Let's continue and now include all three datasets. </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://dbpedia.org/resource/"</span> <span style="color: #c7254e;">"http://www.uspto.gov"</span> <span style="color: #c7254e;">"http://rdf.freebase.com/ns/"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 127 False positives: 62 True negatives: 19 False negatives: 303 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.6719577 | | :recall | 0.29534882 | | :accuracy | 0.2857143 | | :f1 | 0.41033927 | | :f2 | 0.3326349 | | :f0.5 | 0.53541315 | +--------------+------------+ </pre> <p> Now let's see the impact of adding both Freebase and USPTO to the Wikipedia dataset: </p> <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <colgroup> <col class="org-left" /> <col class="org-right" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">metric</th> <th scope="col" class="org-right" style="padding-left: 40px">Impact in %</th> </tr> </thead> <tbody> <tr> <td class="org-left">precision</td> <td class="org-right" style="padding-left: 40px">+1.14%</td> </tr> <tr> <td class="org-left">recall</td> <td class="org-right" style="padding-left: 40px">+6.18%</td> </tr> <tr> <td class="org-left">accuracy</td> <td class="org-right" style="padding-left: 40px">+4.30%</td> </tr> <tr> <td class="org-left">f1</td> <td class="org-right" style="padding-left: 40px">+3.95%</td> </tr> <tr> <td class="org-left">f2</td> <td class="org-right" style="padding-left: 40px">+5.45%</td> </tr> <tr> <td class="org-left">f0.5</td> <td class="org-right" style="padding-left: 40px">+1.51%</td> </tr> </tbody> </table> <p> </p> <p> This combination of public datasets is a key baseline for the conclusions below. </p> </div> </div> </div> <div id="outline-container-orgheadline19" class="outline-3"> <br /> <h3 id="orgheadline19">Combined Public and Private Datasets</h3> <div class="outline-text-3" id="text-orgheadline19"> <p> Now let's see the impact of using adding the private datasets. We will continue to use the combination of the three public datasets (Wikipedia, Freebase and USPTO) to which we will add the private datasets (PD #1 and PD #2). </p> </div> <div id="outline-container-orgheadline20" class="outline-4"> <br /> <h4 id="orgheadline20">Wikipedia + Freebase + USPTO + PD #1</h4> <div class="outline-text-4" id="text-orgheadline20"> <p> We will first add one of the private datasets (PD #1). </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://dbpedia.org/resource/"</span> <span style="color: #c7254e;">"http://www.uspto.gov"</span> <span style="color: #c7254e;">"http://rdf.freebase.com/ns/"</span> <span style="color: #c7254e;">"http://kbpedia.org/datasets/private/1/"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 279 False positives: 102 True negatives: 19 False negatives: 111 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.7322835 | | :recall | 0.7153846 | | :accuracy | 0.58317024 | | :f1 | 0.7237354 | | :f2 | 0.7187017 | | :f0.5 | 0.7288401 | +--------------+------------+ </pre> <p> When we compare these results to just the combination of the three public datasets, we get these percentage immprovements: </p> <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <colgroup> <col class="org-left" /> <col class="org-right" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">metric</th> <th scope="col" class="org-right" style="padding-left: 40px">Impact in %</th> </tr> </thead> <tbody> <tr> <td class="org-left">precision</td> <td class="org-right" style="padding-left: 40px">+8.97%</td> </tr> <tr> <td class="org-left">recall</td> <td class="org-right" style="padding-left: 40px">+142.22%</td> </tr> <tr> <td class="org-left">accuracy</td> <td class="org-right" style="padding-left: 40px">+104.09%</td> </tr> <tr> <td class="org-left">f1</td> <td class="org-right" style="padding-left: 40px">+76.38%</td> </tr> <tr> <td class="org-left">f2</td> <td class="org-right" style="padding-left: 40px">+116.08%</td> </tr> <tr> <td class="org-left">f0.5</td> <td class="org-right" style="padding-left: 40px">+36.12%</td> </tr> </tbody> </table> <p> </p> <p> If we run the private dataset #1 alone (not in combination with the public ones), we get these lesser improvements: </p> <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <colgroup> <col class="org-left" /> <col class="org-right" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">metric</th> <th scope="col" class="org-right" style="padding-left: 40px">Impact in %</th> </tr> </thead> <tbody> <tr> <td class="org-left">precision</td> <td class="org-right" style="padding-left: 40px">+7.77%</td> </tr> <tr> <td class="org-left">recall</td> <td class="org-right" style="padding-left: 40px">+18.60%</td> </tr> <tr> <td class="org-left">accuracy</td> <td class="org-right" style="padding-left: 40px">+19.19%</td> </tr> <tr> <td class="org-left">f1</td> <td class="org-right" style="padding-left: 40px">+13.25%</td> </tr> <tr> <td class="org-left">f2</td> <td class="org-right" style="padding-left: 40px">+16.50%</td> </tr> <tr> <td class="org-left">f0.5</td> <td class="org-right" style="padding-left: 40px">+9.99%</td> </tr> </tbody> </table> <p> </p> <p> So, while the highly targeted private dataset #1 performs better than the three combined public datasets, the combination of private dataset #1 and the three public ones shows still further improvements. </p> </div> </div> <div id="outline-container-orgheadline21" class="outline-4"> <br /> <h4 id="orgheadline21">Wikipedia + Freebase + USPTO + PD #2</h4> <div class="outline-text-4" id="text-orgheadline21"> <p> We can repeat this analysis, only now focusing on the second private dataset (PD #2). This first run combines the three public datasets with PD #2: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://dbpedia.org/resource/"</span> <span style="color: #c7254e;">"http://www.uspto.gov"</span> <span style="color: #c7254e;">"http://rdf.freebase.com/ns/"</span> <span style="color: #c7254e;">"http://kbpedia.org/datasets/private/2/"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 138 False positives: 69 True negatives: 19 False negatives: 285 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.6666667 | | :recall | 0.32624114 | | :accuracy | 0.3072407 | | :f1 | 0.43809524 | | :f2 | 0.36334914 | | :f0.5 | 0.55155873 | +--------------+------------+ </pre> <p> We can see that PD #2 in combination with the three public datasets does not perform as well as PD #1 added to the three public ones. This observation just affirms that not all of the private datasets have equivalent impact. Here are the percent differences for when PD #2 is added to the three public datasets vs. the three public datasets alone: </p> <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <colgroup> <col class="org-left" /> <col class="org-right" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">metric</th> <th scope="col" class="org-right" style="padding-left: 40px">Impact in %</th> </tr> </thead> <tbody> <tr> <td class="org-left">precision</td> <td class="org-right" style="padding-left: 40px">-0.78%</td> </tr> <tr> <td class="org-left">recall</td> <td class="org-right" style="padding-left: 40px">+10.46%</td> </tr> <tr> <td class="org-left">accuracy</td> <td class="org-right" style="padding-left: 40px">+7.52%</td> </tr> <tr> <td class="org-left">f1</td> <td class="org-right" style="padding-left: 40px">+6.75%</td> </tr> <tr> <td class="org-left">f2</td> <td class="org-right" style="padding-left: 40px">+9.23%</td> </tr> <tr> <td class="org-left">f0.5</td> <td class="org-right" style="padding-left: 40px">+3.00%</td> </tr> </tbody> </table> <p> </p> <p> In this case, we actually see that precision drops when adding PD #2, though accuracy is still improved. </p> </div> </div> <div id="outline-container-orgheadline22" class="outline-4"> <br /> <h4 id="orgheadline22">Wikipedia + Freebase + USPTO + PD #1 + PD #2</h4> <div class="outline-text-4" id="text-orgheadline22"> <p> Now that we have seen the impact of PD #1 and PD #2 in isolation, let's see what happens when we combine <b>all</b> of the public and private datasets. First, let's look at the raw metrics of the run: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #A6E22E;">[</span><span style="color: #c7254e;">"http://dbpedia.org/resource/"</span> <span style="color: #c7254e;">"http://www.uspto.gov"</span> <span style="color: #c7254e;">"http://rdf.freebase.com/ns/"</span> <span style="color: #c7254e;">"http://kbpedia.org/datasets/private/1/"</span> <span style="color: #c7254e;">"http://kbpedia.org/datasets/private/2/"</span><span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 285 False positives: 102 True negatives: 19 False negatives: 105 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.7364341 | | :recall | 0.7307692 | | :accuracy | 0.59491193 | | :f1 | 0.7335907 | | :f2 | 0.7318952 | | :f0.5 | 0.7352941 | +--------------+------------+ </pre> <p> As before, let's look at the percentage changes due to adding both of the private datasets #1 and #2 to the three public datasets: </p> <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <colgroup> <col class="org-left" /> <col class="org-right" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">metric</th> <th scope="col" class="org-right" style="padding-left: 40px">Impact in %</th> </tr> </thead> <tbody> <tr> <td class="org-left">precision</td> <td class="org-right" style="padding-left: 40px">+9.60%</td> </tr> <tr> <td class="org-left">recall</td> <td class="org-right" style="padding-left: 40px">+147.44%</td> </tr> <tr> <td class="org-left">accuracy</td> <td class="org-right" style="padding-left: 40px">+108.22%</td> </tr> <tr> <td class="org-left">f1</td> <td class="org-right" style="padding-left: 40px">+78.77%</td> </tr> <tr> <td class="org-left">f2</td> <td class="org-right" style="padding-left: 40px">+120.02%</td> </tr> <tr> <td class="org-left">f0.5</td> <td class="org-right" style="padding-left: 40px">+37.31%</td> </tr> </tbody> </table> <p> </p> <p> Note, for all metrics, this total combination of <b>all</b> datasets performs best compared to any of the other tested combinations. </p> </div> </div> </div> <div id="outline-container-orgheadline23" class="outline-3"> <br /> <h3 id="orgheadline23">Adding Unknown Entities Tagger</h3> <div class="outline-text-3" id="text-orgheadline23"> <p> There is one last feature with the publisher analyzer that we should highlight. The analyzer also allows us to identify <code>unknown entities</code> from the web page. (An "unknown entity" is identified as a likely organization entity, but which does not already exist in the KB.) Sometimes, it is the unknown entity that is the publisher of the web page. The usefulness of the unknown entity identification is to flag new possible entities (organizations, in this case) that should be considered for addition to the overall knowledge base. </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>table <span style="color: #66D9EF;">(</span>generate-stats <span style="color: #AE81FF;">:js</span> <span style="color: #AE81FF;">:execute</span> <span style="color: #AE81FF;">:datasets</span> <span style="color: #AE81FF;">:all</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positives: 345 False positives: 104 True negatives: 19 False negatives: 43 +--------------+------------+ | key | value | +--------------+------------+ | :precision | 0.76837415 | | :recall | 0.88917524 | | :accuracy | 0.7123288 | | :f1 | 0.82437277 | | :f2 | 0.86206895 | | :f0.5 | 0.78983516 | +--------------+------------+ </pre> <p> As we can see, the overall accuracy improved by <code>19.73%</code> when considering the unknown entities compared to the public and private datasets. </p> <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides"> <colgroup> <col class="org-left" /> <col class="org-right" /> </colgroup> <thead> <tr> <th scope="col" class="org-left">metric</th> <th scope="col" class="org-right" style="padding-left: 40px">Impact in %</th> </tr> </thead> <tbody> <tr> <td class="org-left">precision</td> <td class="org-right" style="padding-left: 40px">+4.33%</td> </tr> <tr> <td class="org-left">recall</td> <td class="org-right" style="padding-left: 40px">+21.67%</td> </tr> <tr> <td class="org-left">accuracy</td> <td class="org-right" style="padding-left: 40px">+19.73%</td> </tr> <tr> <td class="org-left">f1</td> <td class="org-right" style="padding-left: 40px">+12.37%</td> </tr> <tr> <td class="org-left">f2</td> <td class="org-right" style="padding-left: 40px">+17.79%</td> </tr> <tr> <td class="org-left">f0.5</td> <td class="org-right" style="padding-left: 40px">+7.42%</td> </tr> </tbody> </table> <p> </p> </div> </div> </div> <div id="outline-container-orgheadline24" class="outline-2"> <br /> <h2 id="orgheadline24">Discussion of Results</h2> <div class="outline-text-2" id="text-orgheadline24"> <p> When we first tested the system with single datasets, some of them were scoring better than others for most of the metrics. However, does that mean that we could only use them and be done with it? No, what this analysis is telling us is that some datasets score better for this set of web pages. They cover more entities found in those web pages. However, even if a dataset was scoring lower it does not mean it is useless. In fact, that <i>worse</i> dataset may in fact cover one prediction area not covered in a <i>better</i> one, which means that by combining the two, we could improve the general prediction power of the system. This is what we can see by adding the private datasets to the public ones. </p> <p> Even if the highly curated and domain-specific private datasets score much better than the more general public datasets, the system still greatly benefits from the contribution of the public datasets by significantly improving the accuracy of the system. We achieve a gain of <code>108%</code> in accuracy by adding the private datasets to KBpedia's public ones. What this tells us is that KBpedia is a very useful structure and starting point for an entity tagging effort, but that adding domain data is probably essential to gain the overall accuracy desired for enterprise requirements. </p> <p> Another thing that this series of tests tends to demonstrate is that the more knowledge we have, the more we can improve the accuracy of the system. Adding datasets doesn't appear to lower the overall performance of the system (even though we did see one case of a slight decrease in precision for PD #2 even though all other metrics improved). Generally, the more data available for a given task, the better the results. </p> <p> Finally, adding a feature to the system can also greatly improve its overall accuracy. In this case, we added the feature of detecting unknown entities (organization entities that are not existing in the datasets that compose the knowledge base), which improves the overall accuracy by another <code>20%</code>. How is that possible? To understand this we have to consider the domain: random web pages that exist on the Web. A web page can be published by anybody and any organization. This means that the <a href="https://en.wikipedia.org/wiki/Long_tail">long tail</a> of web page publisher is probably pretty long. Considering this fact, it is normal that existing knowledge bases may not contain all of the obscure organizations that publish web pages. It is most likely why having a system that can detect and predict unknown entities as the publishers of web page will have a significant impact on the overall accuracy of the system. The flagging of such "unknown" entities tells us where to focus efforts to add to the known database of existing publishers. </p> </div> </div> <div id="outline-container-orgheadline25" class="outline-2"> <br /> <h2 id="orgheadline25">Conclusion</h2> <div class="outline-text-2" id="text-orgheadline25"> <p> As we saw in this analysis, adding high quality and domain-specific private datasets can greatly improve the accuracy of such a prediction system. Some datasets may have a more significant impact than others, but overall, each dataset contributes to the overall improvements of the predictions. </p> </div> </div> </div> <div class="col-md-2"> </div><!--/col-md-2--> </div> </div><!--/container--> <!--=== End Content Part ===--> <div class="footer-v1"> <div class="footer"> <div class="container"> <div class="row"> <!-- About --> <div class="col-md-3 md-margin-bottom-40"> <table> <tbody><tr> <td> <a href="/"> <img id="logo-footer" class="footer-logo" src="/imgs/logo-simple-purple.png" alt="" name="logo-footer"> </a> </td> </tr> <tr> <td> <center> <p> KBpedia </p> </center> </td> </tr> </tbody></table> <p style="font-size: 0.85em;"> KBpedia exploits large-scale knowledge bases and semantic technologies for machine learning, data interoperability and mapping, and fact extraction and tagging. </p> </div><!--/col-md-3--> <!-- End About --> <!-- Latest --> <div class="col-md-3 md-margin-bottom-40"> <div class="posts"> <div class="headline"> <h2> Latest News </h2> </div> <ul class="list-unstyled latest-list"> <!-- HTML generated from an RSS Feed by rss2html.php, http://www.FeedForAll.com/ a NotePage, Inc. product (http://www.notepage.com/) --> <li> <a href="https://kbpedia.org/resources/news/kbpedia-adds-ecommerce/">KBpedia Adds Major eCommerce Capabilities</a> <small>06/15/2020</small> </li> <!-- HTML generated from an RSS Feed by rss2html.php, http://www.FeedForAll.com/ a NotePage, Inc. product (http://www.notepage.com/) --> <li> <a href="http://kbpedia.org/resources/news/kbpedia-continues-quality-improvements/">KBpedia Continues Quality Improvements</a> <small>12/04/2019</small> </li> <!-- HTML generated from an RSS Feed by rss2html.php, http://www.FeedForAll.com/ a NotePage, Inc. product (http://www.notepage.com/) --> <li> <a href="http://kbpedia.org/resources/news/wikidata-coverage-nearly-complete/">Wikidata Coverage Nearly Complete (98%)</a> <small>04/08/2019</small> </li> </ul> </div> </div><!--/col-md-3--><!-- End Latest --><!-- Link List --> <div class="col-md-3 md-margin-bottom-40"> <div class="headline"> <h2> Other Resources </h2> </div> <ul class="list-unstyled link-list"> <li> <a href="/resources/about/">About</a> </li> <li> <a href="/resources/faq/">FAQ</a> </li> <li> <a href="/resources/news/">News</a> </li> <li> <a href="/use-cases/">Use Cases</a> </li> <li> <a href="/resources/documentation/">Documentation</a> </li> <li> <a href="/resources/privacy/">Privacy</a> </li> <li> <a href="/resources/terms-of-use/">Terms of Use</a> </li> </ul> </div><!--/col-md-3--> <!-- End Link List --><!-- Address --> <div class="col-md-3 map-img md-margin-bottom-40"> <div class="headline"> <h2> Contact Us </h2> </div> <address class="md-margin-bottom-40"> c/o <a href="mailto:info@mkbergman.com?subject=KBpedia%20Inquiry">Michael K. Bergman</a> <br> 380 Knowling Drive <br> Coralville, IA 52241 <br> U.S.A. <br> Voice: +1 319 621 5225 </address> </div><!--/col-md-3--> <!-- End Address --> </div> </div> </div><!--/footer--> <div class="copyright"> <div class="container"> <div class="row"> <div class="col-md-7"> <p class="copyright" style="font-size: 10px;"> 2016-2022 漏 <a href="http://kbpedia.org" style="font-size: 10px;">Michael K. Bergman.</a> All Rights Reserved. </p> </div> <!-- Social Links --> <div class="col-md-5"> <ul class="footer-socials list-inline"> <li> <a href="/resources/feeds/news.xml" class="tooltips" data-toggle="tooltip" data-placement="top" title="" data-original-title="RSS feed"> <i class="fa fa-rss-square"></i> </a> <br></li> <li> <a href="http://github.com/Cognonto" class="tooltips" data-toggle="tooltip" data-placement="top" title="" data-original-title="Github"> <i class="fa fa-github"></i> </a> <br></li> <li> <a href="http://twitter.com/cognonto" class="tooltips" data-toggle="tooltip" data-placement="top" title="" data-original-title="Twitter"> <i class="fa fa-twitter"></i> </a> <br></li> </ul> </div> <!-- End Social Links --> </div> </div> </div><!--/copyright--> </div><!--=== End Footer Version 1 ===--> <!--/wrapper--> <!-- JS Global Compulsory --> <script type="text/javascript" src="/assets/plugins/jquery/jquery.min.js"></script> <script type="text/javascript" src="/assets/plugins/jquery/jquery-migrate.min.js"></script> <script type="text/javascript" src="/assets/plugins/bootstrap/js/bootstrap.min.js"></script> <!-- JS Implementing Plugins --> <script type="text/javascript" src="/assets/plugins/back-to-top.js"></script> <!-- JS Customization --> <script type="text/javascript" src="/assets/js/custom.js"></script> <!-- JS Page Level --> <script type="text/javascript" src="/assets/js/app.js"></script> <!-- JS Implementing Plugins --> <script type="text/javascript" src="/assets/plugins/smoothScroll.js"></script> <script type="text/javascript" src="/assets/plugins/owl-carousel/owl-carousel/owl.carousel.js"></script> <script type="text/javascript" src="/assets/plugins/layer-slider/layerslider/js/greensock.js"></script> <script type="text/javascript" src="/assets/plugins/layer-slider/layerslider/js/layerslider.transitions.js"></script> <script type="text/javascript" src="/assets/plugins/layer-slider/layerslider/js/layerslider.kreaturamedia.jquery.js"></script> <!-- JS Customization --> <script type="text/javascript" src="/assets/js/custom.js"></script> <!-- JS Page Level --> <script type="text/javascript" src="/assets/js/plugins/layer-slider.js"></script> <script type="text/javascript" src="/assets/js/plugins/style-switcher.js"></script> <script type="text/javascript" src="/assets/js/plugins/owl-carousel.js"></script> <script type="text/javascript" src="/assets/js/plugins/owl-recent-works.js"></script> <script type="text/javascript"> jQuery(document).ready(function() { App.init(); LayerSlider.initLayerSlider(); StyleSwitcher.initStyleSwitcher(); OwlCarousel.initOwlCarousel(); OwlRecentWorks.initOwlRecentWorksV2(); }); </script> <!--[if lt IE 9]> <script src="assets/plugins/respond.js"></script> <script src="assets/plugins/html5shiv.js"></script> <script src="assets/plugins/placeholder-IE-fixes.js"></script> <![endif]--> <!--[if lt IE 9]> <script src="/assets/plugins/respond.js"></script> <script src="/assets/plugins/html5shiv.js"></script> <script src="/assets/js/plugins/placeholder-IE-fixes.js"></script> <![endif]--> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-84405507-1', 'auto'); ga('send', 'pageview'); </script>