Dynamic Machine Learning Using the KBpedia Knowledge Graph

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">    <html lang="en">  <head> <title>Dynamic Machine Learning Using the KBpedia Knowledge Graph</title> <link rel="alternate" type="application/rss+xml" title="" href="/resources/feeds/news.xml" />  <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="description" content=""> <meta name="author" content="">  <link rel="shortcut icon" href="/favicon.ico">  <link rel="stylesheet" href="/assets/plugins/bootstrap/css/bootstrap.min.css"> <link rel="stylesheet" href="/assets/css/style.css">  <link rel="stylesheet" href="/assets/plugins/line-icons/line-icons.css"> <link rel="stylesheet" href="/assets/plugins/font-awesome/css/font-awesome.min.css">  <link rel="stylesheet" href="/assets/css/theme-colors/blue.css"> <link rel="stylesheet" href="/assets/css/footers/footer-v1.css"> <link rel="stylesheet" href="/assets/css/headers/header-v1.css">  <link rel="stylesheet" href="/css/custom.css"> <meta property="og:site_name" content="KBpedia"/> <meta property="og:title" content="Machine Learning Use Cases: Dynamic Machine Learning Using the KG"> <meta property="og:type" content="article"/> <meta property="og:description" content="The automated ways to select training sets and corpuses inherent with KBpedia, particularly in conjunction with setting up gold standards for analyzing test runs, enables much more time to be spent on refining the input data and machine learning parameters to obtain "best" results."> <meta property="og:image" content="/imgs/kbpedia-logo-350.png"> <meta property="og:image:width" content="350"> <meta property="og:image:height" content="390"> </head> <body> <div class="wrapper"> <div class="header-v1">  <div class="topbar"> <div class="container"> <div class="col-md-1"> </div> <a class="navbar-brand" href="/"> <img id="logo-header" src="/imgs/kbpedia-logo-420-horz.png" height="75" alt="KBpedia Knowledge Structure" name="logo-header"> </a> </div> </div>  <div class="navbar navbar-default mega-menu" role="navigation"> <div class="container">  <div class="navbar-header"> <button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-responsive-collapse"> <span class="sr-only">Toggle navigation</span> </button> </div> <div style="clear:both; height: 1px;">   </div>  <div class="col-md-1">   </div> <div class="collapse navbar-collapse navbar-responsive-collapse col-md-10"> <ul class="nav navbar-nav pull-left">  <li> <a href="/">Home</a> </li>  <li> <a href="/knowledge-graph/">knowledge Graph</a> </li> <li> <a href="http://sparql.kbpedia.org/">SPARQL</a> </li>  <li class="dropdown"> <a href="/background/">Background</a> <ul class="dropdown-menu"> <li> <a href="/background/overview/">Overview</a> </li> <li> <a href="/background/features-and-benefits/">Features & Benefits</a> </li> <li> <a href="/background/data-and-knowledge-structures/">Data and Knowledge Structures</a> </li> <li> <a href="/background/technology/">Technology</a> </li> <li> <a href="/background/machine-learning/">Machine Learning</a> </li> <li> <a href="/background/uses/">Uses</a> </li> </ul> </li> <li class="dropdown"> <a href="/use-cases/">Use Cases</a> <ul class="dropdown-menu"> <li class="dropdown-submenu"> <a href="/use-cases/knowledge-graph/">Knowledge Graph (KG)</a> <ul class="dropdown-menu"> <li><a href="/use-cases/browse-the-knowledge-graph/">Browse the Knowledge Graph</a></li> <li><a href="/use-cases/search-the-knowledge-graph/">Search the Knowledge Graph</a></li> <li><a href="/use-cases/expand-queries-using-semsets/">Expand Queries Using Semsets</a></li> <li><a href="/use-cases/use-and-control-of-inferencing/">Uses and Control of Inferencing</a></li> </ul> </li> <li class="dropdown-submenu"> <a href="/use-cases/machine-learning-use-case/">Machine Learning (KBAI)</a> <ul class="dropdown-menu"> <li><a href="/use-cases/text-classification-using-esa-and-svm/">Create Supervised Learning Training Sets</a></li> <li><a href="/use-cases/document-specific-word2vec-training-corpuses/">Create Word Embedding Corpuses</a></li> <li><a href="/use-cases/extending-kbpedia-with-kbpedia-categories/">Create Graph Embedding Corpuses</a></li> <li><a href="/use-cases/text-classification-using-esa-and-svm/">Classify Text</a></li> <li><a href="/use-cases/dynamic-machine-learning/">Create 'Gold Standards' for Tuning Learners</a></li> <li><a href="/use-cases/disambiguating-kbpedia-knowledge-graph-concepts/">Disambiguate KG Concepts</a></li> <li><a href="/use-cases/dynamic-machine-learning/">Dynamic Machine Learning Using the KG</a></li> </ul> </li> <li class="dropdown-submenu"> <a href="/use-cases/mapping-use-case/">Mapping</a> <ul class="dropdown-menu"> <li><a href="/use-cases/mapping-external-data-and-schema/">Map Concepts</a></li> <li><a href="/use-cases/extending-kbpedia-with-kbpedia-categories/">Extend KBpedia with Wikipedia</a></li> <li><a href="/use-cases/benefits-from-extending-kbpedia-with-private-datasets/">Extend KBpedia for Domains</a></li> <li><a href="/use-cases/mapping-external-data-and-schema/">General Use of the Mapper</a></li> </ul> </li> </ul> </li> <li class="dropdown"> <a href="/resources/">Resources</a> <ul class="dropdown-menu"> <li><a href="/resources/downloads/">Download KBpedia</a></li> <li><a href="/resources/about/">About KBpedia</a></li> <li><a href="/resources/faq/">KBpedia FAQ</a></li> <li><a href="/resources/news/">News About KBpedia</a></li> <li><a href="/resources/statistics/">KBpedia Statistics</a></li> <li><a href="/resources/documentation/">Additional Documentation</a></li> <li><a href="/resources/support/">Support for KBpedia</a>.</li> </ul> </li> </ul> </div> <div class="col-md-1">   </div> </div> </div> </div>  <div class="breadcrumbs"> <div class="container"> <div class="col-md-1">   </div> <div class="col-md-10"> <h1 class="pull-left"></h1> <ul class="pull-right breadcrumb"> <li>Use Cases</li> <li class="active">Dynamic Machine Learning</li> </ul> </div> <div class="col-md-1">   </div> </div> </div>    <div class="container content"> <div class="row"> <div class="col-md-2">   </div> <div class="col-md-8"> <div class="use-cases-header"> <table border="0" cellpadding="4" cellspacing="2"> <tbody> <tr> <td colspan="2" align="center"> <h2> <b>USE CASE</b> </h2> </td> </tr> <tr> <td style="width: 140px;" valign="top"> <b>Title:</b> </td> <td style="padding-left: 25px;" valign="top"> <span style="font-weight: bold;">Dynamic Machine Learning Using the KBpedia Knowledge Graph</span> </td> </tr> <tr> <td valign="top"> <b>Short Description:</b> </td> <td style="padding-left: 25px;" valign="top"> The automated ways to select training sets and corpuses inherent with KBpedia, particularly in conjunction with setting up gold standards for analyzing test runs, enables much more time to be spent on refining the input data and machine learning parameters to obtain "best" results. </td> </tr> <tr> <td valign="top"> <b>Problem:</b> </td> <td style="padding-left: 25px;" valign="top"> After initial set-up or due to a change in the input data, we want to test and refine the parameters of our machine learners to obtain the best results. </td> </tr> <tr> <td valign="top"> <b>Approach:</b> </td> <td style="padding-left: 25px;" valign="top"> Because of the nearly automatic way to use KBpedia's knowledge structure to generate training sets and corpuses for machine learners, more time can be spent in the critical phases of testing and refining the actual choice and use of the learners. This efficiency also allows multiple learners to be combined for ensemble learning, and means that we can also devote time to improving the input data to generate the training sets for the learners.This use case highlights these capabilities by rapid testing of feature selection, hyperparameter optimization, and ensemble learning.. </td> </tr> <tr> <td valign="top"> <b>Key Findings</b> </td> <td valign="top"> <ul> <li>Rapid selection and creation of training sets and corpuses removes a time-consuming bottleneck in machine learning </li> <li>Reduced efforts for training selection means more time can be spent on refining the input parameters to the machine learners or in combining learners for ensemble learning </li> <li>Depending on the domain of interest, different strategies and techniques can lead to better predictions </li> <li>More often than not, multiple different training corpuses, learners and hyperparameters need to be tested before ending up with the best prediction models </li> <li>A combination of techniques can raise <span style= "font-weight: bold;">both</span> <span style= "font-style: italic;">precision</span> and <span style= "font-style: italic;">recall</span>, leading to higher overall performance. </li> </ul> </td> </tr> </tbody> </table> </div> </div> <div class="col-md-2">   </div> </div> <div class="row"> </div> <div class="row"> </div> <div class="row"> <div class="col-md-2">   </div> <div class="col-md-8"> <p> Another use case, <a href="/use-cases/text-classification-using-esa-and-svm/">Text Classification Using ESA and SVM</a>, explains how one can use KBpedia to create positive and negative training sets automatically for different machine learning tasks. That use case explains how SVM classifiers may be trained and used to check if an input text belongs to the defined domain or not. </p> <p> This current use case extends on this idea to explain how KBpedia can be used, along with other machine learning techniques, to cope with dynamic situations that may alter data or input assumptions. The variations we investigate are feature selection, hyperparameter optimization, and ensemble learning. The emphasis here is on the testing and refining of machine learners, versus the set up and configuration perspectives covered in other use cases. </p> <p> Depending on the domain of interest, and depending on the required <code>precision</code> or <code>recall</code>, different strategies and techniques can lead to better predictions. More often than not, multiple different training corpuses, learners and hyperparameters need to be tested before ending up with the initial best possible prediction model. The key take away from this use case is that KBpedia can be used to automate fully the creation of a wide range of different training corpuses, to create models, to optimize their hyperparameters, and to evaluate those models. </p> <div id="outline-container-org6e9a330" class="outline-2"> <br /> <h2 id="org6e9a330">New Knowledge Graph and Reasoning</h2> <div class="outline-text-2" id="text-org6e9a330"> <p> One of the variations in this investigation is to look at the potential impact of a new version of KBpedia (<code>version 1.10</code> in this case). A knowledge graph such as KBpedia is not static. It constantly evolves, gets fixed, and improves. New concepts are created, deprecated concepts are removed, new linkage to external data sources are created, etc. This growth means that any of these changes can have a [positive] impact on the creation of the positive and negative training sets. Applications based on KBpedia should be tested against any new knowledge graph that is released to see if its models will improve. Better concepts, better structure, and more linkages will often lead to better training sets as well. </p> <p> Such growth in KBpedia (or in combination with domain information linked to it) is also why automating, and more importantly testing, this process is crucial. Upon the release of major new versions we are able to automate all of these steps to see the final impacts of upgrading the knowledge graph: </p> <ol class="org-ol"> <li>Aggregate all the reference concepts that scope the specified domain (by inference)</li> <li>Create the positive and negative training corpuses</li> <li>Prune the training corpuses</li> <li>Configure the classifier (in this case, create the semantic vectors for ESA)</li> <li>Train the model (in this case, the SVM model)</li> <li>Optimize the hyperparameters of the algorithm (in this case, the linear SVM hyperparameters), and</li> <li>Evaluate the model on multiple gold standards.</li> </ol> <p> Because each of these steps belongs to an automated workflow, we can easily check the impact of updating the KBpedia Knowledge Graph on our models. </p> </div> </div> <div id="outline-container-orgaf7519c" class="outline-2"> <br /> <h2 id="orgaf7519c">Reasoning Over The Knowledge Graph</h2> <div class="outline-text-2" id="text-orgaf7519c"> <p> A new step we have added to this current use case is to use a reasoner to reason over the KBpedia knowledge graph. The reasoner is used when we define the scope of the domain to classify. We will browse the knowledge graph to see which <i>seed</i> reference concepts we should add to the scope. Then we will use a reasoner to extend the models to include any new sub-classes relevant to the scope of the domain. This means that we may add further specific features to the final model. </p> </div> </div> <div id="outline-container-org1353d5f" class="outline-2"> <br /> <h2 id="org1353d5f">Update Domain Training Corpus Using KBpedia 1.10 and a Reasoner</h2> <div class="outline-text-2" id="text-org1353d5f"> <p> Recall a prior <a href="http://kbpedia.org/use-cases/text-classification-using-esa-and-svm/">use case</a> used <i>Music</i> as its domain scope. The first step is to use updated KBpedia version <code>1.10</code> along with a reasoner to create the full scope of this updated Music domain. </p> <p> The result of using this new version and a reasoner is that we now end up with <code>196</code> features (reference documents) instead of <code>64</code> with the previous version. This also means that we will have 196 documents in our positive training set if we only use the Wikipedia pages linked to these reference concepts (and not their related named entities). </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>use '<span style="color: #66D9EF;">cognonto-esa.core</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>require '<span style="color: #66D9EF;">[</span><span style="color: #66D9EF;">cognonto-owl.core</span> <span style="color: #AE81FF;">:as</span> owl<span style="color: #66D9EF;">]</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>require '<span style="color: #66D9EF;">[</span><span style="color: #66D9EF;">cognonto-owl.reasoner</span> <span style="color: #AE81FF;">:as</span> reasoner<span style="color: #66D9EF;">]</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span><span style="color: #F92672;">def</span> <span style="color: #FD971F;">kbpedia-manager</span> <span style="color: #66D9EF;">(</span><span style="color: #66D9EF;">owl</span><span style="color: #c7254e;">/</span>make-ontology-manager<span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span><span style="color: #F92672;">def</span> <span style="color: #FD971F;">kbpedia</span> <span style="color: #66D9EF;">(</span><span style="color: #66D9EF;">owl</span><span style="color: #c7254e;">/</span>load-ontology <span style="color: #E6DB74;">"resources/kbpedia_reference_concepts_linkage.n3"</span> <span style="color: #AE81FF;">:manager</span> kbpedia-manager<span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span><span style="color: #F92672;">def</span> <span style="color: #FD971F;">kbpedia-reasoner</span> <span style="color: #66D9EF;">(</span><span style="color: #66D9EF;">reasoner</span><span style="color: #c7254e;">/</span>make-reasoner kbpedia<span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span><span style="color: #F92672;">define-domain-corpus</span> <span style="color: #66D9EF;">[</span><span style="color: #E6DB74;">"http://kbpedia.org/kko/rc/Music"</span> <span style="color: #E6DB74;">"http://kbpedia.org/kko/rc/Musician"</span> <span style="color: #E6DB74;">"http://kbpedia.org/kko/rc/MusicPerformanceOrganization"</span> <span style="color: #E6DB74;">"http://kbpedia.org/kko/rc/MusicalInstrument"</span> <span style="color: #E6DB74;">"http://kbpedia.org/kko/rc/Album-CW"</span> <span style="color: #E6DB74;">"http://kbpedia.org/kko/rc/Album-IBO"</span> <span style="color: #E6DB74;">"http://kbpedia.org/kko/rc/MusicalComposition"</span> <span style="color: #E6DB74;">"http://kbpedia.org/kko/rc/MusicalText"</span> <span style="color: #E6DB74;">"http://kbpedia.org/kko/rc/PropositionalConceptualWork-MusicalGenre"</span> <span style="color: #E6DB74;">"http://kbpedia.org/kko/rc/MusicalPerformer"</span><span style="color: #66D9EF;">]</span> kbpedia <span style="color: #E6DB74;">"resources/domain-corpus-dictionary.csv"</span> <span style="color: #AE81FF;">:reasoner</span> kbpedia-reasoner<span style="color: #AE81FF;">)</span> </pre> </div> </div> </div> <div id="outline-container-orgc60be2f" class="outline-2"> <br /> <h2 id="orgc60be2f">Create Training Corpuses</h2> <div class="outline-text-2" id="text-orgc60be2f"> <p> The next step is to create the actual training corpuses: the general and domain ones. We have to load the dictionaries we created in the previous step, and then to locally cache and normalize the corpuses. Remember that the normalization steps are: </p> <ol class="org-ol"> <li>Defluff the raw HTML page. We convert the HTML into text, and we only keep the body of the page</li> <li>Normalize the text with the following rules: <ol class="org-ol"> <li>remove diacritics characters</li> <li>remove everything between brackets like: [edit] [show]</li> <li>remove punctuation</li> <li>remove all numbers</li> <li>remove all invisible control characters</li> <li>remove all [math] symbols</li> <li>remove all words with 2 characters or fewer</li> <li>remove line and paragraph seperators</li> <li>remove anything that is not an alpha character</li> <li>normalize spaces</li> <li>put everything in lower case, and</li> <li>remove stop words.</li> </ol></li> </ol> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>load-dictionaries <span style="color: #E6DB74;">"resources/general-corpus-dictionary.csv"</span> <span style="color: #E6DB74;">"resources/domain-corpus-dictionary.csv"</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>cache-corpus<span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>normalize-cached-corpus <span style="color: #E6DB74;">"resources/corpus/"</span> <span style="color: #E6DB74;">"resources/corpus-normalized/"</span><span style="color: #AE81FF;">)</span> </pre> </div> </div> </div> <div id="outline-container-orgdefc681" class="outline-2"> <br /> <h2 id="orgdefc681">Create New Gold Standard</h2> <div class="outline-text-2" id="text-orgdefc681"> <p> Because we never have enough instances in our gold standards to test against, let's create a third one, but this time adding a music related news feed that will add more positive examples to the gold standard. </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span><span style="color: #F92672;">defn</span> <span style="color: #A6E22E;">create-gold-standard-from-feeds</span> <span style="color: #66D9EF;">[</span>name<span style="color: #66D9EF;">]</span> <span style="color: #66D9EF;">(</span><span style="color: #F92672;">let</span> <span style="color: #A6E22E;">[</span>feeds <span style="color: #E6DB74;">[</span><span style="color: #E6DB74;">"http://www.music-news.com/rss/UK/news"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/topstories.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/world.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/canada.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/politics.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/business.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/health.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/arts.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/technology.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/offbeat.xml"</span> <span style="color: #E6DB74;">"http://www.cbc.ca/cmlink/rss-cbcaboriginal"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/sports.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/canada-britishcolumbia.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/canada-calgary.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/canada-montreal.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/canada-pei.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/canada-ottawa.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/canada-toronto.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/canada-north.xml"</span> <span style="color: #E6DB74;">"http://rss.cbc.ca/lineup/canada-manitoba.xml"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/news/artsculture"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/reuters/businessNews"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/reuters/entertainment"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/reuters/companyNews"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/reuters/lifestyle"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/reuters/healthNews"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/reuters/MostRead"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/reuters/peopleNews"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/reuters/scienceNews"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/reuters/technologyNews"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/Reuters/domesticNews"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/Reuters/worldNews"</span> <span style="color: #E6DB74;">"http://feeds.reuters.com/reuters/USmediaDiversifiedNews"</span><span style="color: #E6DB74;">]</span><span style="color: #A6E22E;">]</span> <span style="color: #A6E22E;">(</span><span style="color: #F92672;">with-open</span> <span style="color: #E6DB74;">[</span>out-file <span style="color: #FD971F;">(</span><span style="color: #66D9EF;">io</span><span style="color: #c7254e;">/</span>writer <span style="color: #F92672;">(</span>str <span style="color: #E6DB74;">"resources/"</span> name <span style="color: #E6DB74;">".csv"</span><span style="color: #F92672;">)</span><span style="color: #FD971F;">)</span><span style="color: #E6DB74;">]</span> <span style="color: #E6DB74;">(</span><span style="color: #66D9EF;">csv</span><span style="color: #c7254e;">/</span>write-csv out-file <span style="color: #FD971F;">[</span><span style="color: #F92672;">[</span><span style="color: #E6DB74;">"class"</span> <span style="color: #E6DB74;">"title"</span> <span style="color: #E6DB74;">"url"</span><span style="color: #F92672;">]</span><span style="color: #FD971F;">]</span><span style="color: #E6DB74;">)</span> <span style="color: #E6DB74;">(</span><span style="color: #F92672;">doseq</span> <span style="color: #FD971F;">[</span>feed-url feeds<span style="color: #FD971F;">]</span> <span style="color: #FD971F;">(</span><span style="color: #F92672;">doseq</span> <span style="color: #F92672;">[</span>item <span style="color: #AE81FF;">(</span><span style="color: #AE81FF;">:entries</span> <span style="color: #66D9EF;">(</span><span style="color: #66D9EF;">feed</span><span style="color: #c7254e;">/</span>parse-feed feed-url<span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span><span style="color: #F92672;">]</span> <span style="color: #F92672;">(</span><span style="color: #66D9EF;">csv</span><span style="color: #c7254e;">/</span>write-csv out-file <span style="color: #E6DB74;">""</span> <span style="color: #AE81FF;">(</span><span style="color: #AE81FF;">:title</span> item<span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span><span style="color: #AE81FF;">:link</span> item<span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">:append</span> <span style="color: #AE81FF;">true</span><span style="color: #F92672;">)</span><span style="color: #FD971F;">)</span><span style="color: #E6DB74;">)</span><span style="color: #A6E22E;">)</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <p> This routine creates this third gold standard. Remember, we use the gold standard to evaluate different methods and models to classify an input text to see if it belongs to the domain or not. </p> <p> For each piece of news aggregated in this manner, we manually determined if the candidate document belongs to the domain or not. This task is always the most time consuming part of the case. This task can be tricky, and requires a clear understanding of the proper scope for the domain. In this example, we consider an article to belong to the <i>music domain</i> if it mentions music concepts such as musical albums, songs, multiple music related topics, etc. If only a singer is mentioned in an article because he broke up with his girlfriend, without further mention of anything related to music, we don't classify it as being part of the domain. </p> <p> [However, under a different interpretation of what should be in the domain wherein any mention of a singer qualifies, then we could extend the classification process to include named entities (the singer) extraction to help properly classify those articles. This revised scope is not used in this article, but it does indicate how your exact domain needs should inform such scoping and classification (tagging) decisions.] </p> <p> You can download this new <a href="gold-standard-3.csv">third gold standard from here</a>. </p> </div> </div> <div id="outline-container-orgc140c4c" class="outline-2"> <br /> <h2 id="orgc140c4c">Evaluate Initial Domain Model</h2> <div class="outline-text-2" id="text-orgc140c4c"> <p> Now that we have updated the training corpuses using the updated scope of the domain compared to the <a href="http://fgiasson.com/blog/index.php/2016/10/24/create-a-domain-text-classifier-using-cognonto/">previous use case</a>, let's analyze the impact of using a new version of KBpedia and to use a reasoner to increase the number of features in our model. Let's run our automatic process to evaluate the new models. The remaining steps that needs to be run are: </p> <ol class="org-ol"> <li>Configure the classifier (in this case, create the semantic vectors for ESA)</li> <li>Train the model (in this case, the SVM model), and</li> <li>Evaluate the model on multiple gold standards.</li> </ol> <p> Note: the see the full explanation of how ESA and the SVM classifiers works, please refer to the <a href="http://kbpedia.org/use-cases/text-classification-using-esa-and-svm/">Text Classification Using ESA and SVM</a> use case for more background information. </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #75715E; font-style: italic;">;; </span><span style="color: #75715E; font-style: italic;">Load positive and negative training corpuses</span> <span style="color: #AE81FF;">(</span>load-dictionaries <span style="color: #E6DB74;">"resources/general-corpus-dictionary.csv"</span> <span style="color: #E6DB74;">"resources/domain-corpus-dictionary.csv"</span><span style="color: #AE81FF;">)</span> <span style="color: #75715E; font-style: italic;">;; </span><span style="color: #75715E; font-style: italic;">Build the ESA semantic interpreter </span> <span style="color: #AE81FF;">(</span>build-semantic-interpreter <span style="color: #E6DB74;">"base"</span> <span style="color: #E6DB74;">"resources/semantic-interpreters/base/"</span> <span style="color: #66D9EF;">(</span>distinct <span style="color: #A6E22E;">(</span>concat <span style="color: #E6DB74;">(</span>get-domain-pages<span style="color: #E6DB74;">)</span> <span style="color: #E6DB74;">(</span>get-general-pages<span style="color: #E6DB74;">)</span><span style="color: #A6E22E;">)</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> <span style="color: #75715E; font-style: italic;">;; </span><span style="color: #75715E; font-style: italic;">Build the vectors to feed to a SVM classifier using ESA</span> <span style="color: #AE81FF;">(</span>build-svm-model-vectors <span style="color: #E6DB74;">"resources/svm/base/"</span> <span style="color: #AE81FF;">:corpus-folder-normalized</span> <span style="color: #E6DB74;">"resources/corpus-normalized/"</span><span style="color: #AE81FF;">)</span> <span style="color: #75715E; font-style: italic;">;; </span><span style="color: #75715E; font-style: italic;">Train the SVM using the best parameters discovered in the previous tests</span> <span style="color: #AE81FF;">(</span>train-svm-model <span style="color: #E6DB74;">"svm.w50"</span> <span style="color: #E6DB74;">"resources/svm/base/"</span> <span style="color: #AE81FF;">:weights</span> <span style="color: #66D9EF;">{</span>1 50.0<span style="color: #66D9EF;">}</span> <span style="color: #AE81FF;">:v</span> <span style="color: #AE81FF;">nil</span> <span style="color: #AE81FF;">:c</span> 1 <span style="color: #AE81FF;">:algorithm</span> <span style="color: #AE81FF;">:l2l2</span><span style="color: #AE81FF;">)</span> </pre> </div> <p> Let's evaluate this model using our three gold standards: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>evaluate-model <span style="color: #E6DB74;">"svm.goldstandard.1.w50"</span> <span style="color: #E6DB74;">"resources/gold-standard-1.csv"</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positive: 21 False positive: 3 True negative: 306 False negative: 6 Precision: 0.875 Recall: 0.7777778 Accuracy: 0.97321427 F1: 0.8235294 </pre> <p> The performance changes related to the previous results (using KBpedia <code>1.02</code>) are: </p> <ul class="org-ul"> <li>Precision: <code>+10.33%</code></li> <li>Recall: <code>-12.16%</code></li> <li>Accuracy: <code>+0.31%</code></li> <li>F1: <code>+0.26%</code></li> </ul> <p> The results for the second gold standard are: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>evaluate-model <span style="color: #E6DB74;">"svm.goldstandard.2.w50"</span> <span style="color: #E6DB74;">"resources/gold-standard-2.csv"</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positive: 16 False positive: 3 True negative: 317 False negative: 9 Precision: 0.84210527 Recall: 0.64 Accuracy: 0.9652174 F1: 0.72727275 </pre> <p> The performances changes related to the previous results (using KBpedia <code>1.02</code>) are: </p> <ul class="org-ul"> <li>Precision: <code>+6.18%</code></li> <li>Recall: <code>-29.35%</code></li> <li>Accuracy: <code>-1.19%</code></li> <li>F1: <code>-14.63%</code></li> </ul> <p> What we can say is that the new scope for the domain greatly improved the <code>precision</code> of the model. This happens because the new model is probably more complex and better scoped, which leads it to be more selective. However, because of this the <code>recall</code> of the model suffers since some of the positive cases of our gold standard are not considered to be positive but negative, which now creates new <code>false positives</code>. As you can see, there is almost always a tradeoff between <code>precision</code> and <code>recall</code>. However, you could have 100% <code>precision</code> by only having one result right, but then the <code>recall</code> would suffer greatly. This is why the <a href="https://en.wikipedia.org/wiki/F1_score">F1</a> score is important since it is a weighted average of the <code>precision</code> <b>and</b> the <code>recall</code>. </p> <p> Now let's look at the results of our new gold standard: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>evaluate-model <span style="color: #E6DB74;">"svm.goldstandard.3.w50"</span> <span style="color: #E6DB74;">"resources/gold-standard-3.csv"</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positive: 28 False positive: 3 True negative: 355 False negative: 22 Precision: 0.9032258 Recall: 0.56 Accuracy: 0.9387255 F1: 0.69135803 </pre> <p> Again, with this new gold standard, we can see the same pattern: the <code>precision</code> is pretty good, but the <code>recall</code> is not that great since about half the <code>true positives</code> did not get noticed by the model. </p> <p> Now, what could we do to try to improve this situation? The next thing we will investigate is to use feature selection and pruning. </p> </div> </div> <div id="outline-container-org64304a7" class="outline-2"> <br /> <h2 id="org64304a7">Features Selection Using Pruning and Training Corpus Pruning</h2> <div class="outline-text-2" id="text-org64304a7"> <p> A new method that we will investigate to try to improve the performance of the models is called <a href="https://en.wikipedia.org/wiki/Feature_selection">feature selection</a>. As its name says, what we are doing is to select specific features to create our prediction model. The idea here is that not all features are born equal and different features may have different (positive or negative) impacts on the model. </p> <p> In our specific use case, we want to do feature selection using a pruning technique. What we will do is to count the number of tokens for each of our features, and each of the Wikipedia pages related to these features. If the number of tokens in an article is too small (below 100), then we will drop that feature. </p> <p> [Note: feature selection is a complex topic; other options and nuances are not further discussed here.] </p> <p> The idea here is not to give undue importance to a feature for which we lack proper positive documents in the training corpus. Depending on the feature, it may, or may not, have an impact on the overall model's performance. </p> <p> Pruning the general and domain specific dictionaries is really simple. We only have to read the current dictionaries, to read each of the documents mentioned in the dictionary from the cache, to calculate the number of tokens in each, and then to keep them or to drop them if they reach a certain threshold. Finally we write a new dictionary with the pruned features and documents: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span><span style="color: #F92672;">defn</span> <span style="color: #A6E22E;">create-pruned-pages-dictionary-csv</span> <span style="color: #66D9EF;">[</span>dictionary-file prunned-file normalized-corpus-folder & <span style="color: #A6E22E;">{</span><span style="color: #AE81FF;">:keys</span> <span style="color: #E6DB74;">[</span>min-tokens<span style="color: #E6DB74;">]</span> <span style="color: #AE81FF;">:or</span> <span style="color: #E6DB74;">{</span>min-tokens 100<span style="color: #E6DB74;">}</span><span style="color: #A6E22E;">}</span><span style="color: #66D9EF;">]</span> <span style="color: #66D9EF;">(</span><span style="color: #F92672;">let</span> <span style="color: #A6E22E;">[</span>dictionary <span style="color: #E6DB74;">(</span>rest <span style="color: #FD971F;">(</span><span style="color: #F92672;">with-open</span> <span style="color: #F92672;">[</span>in-file <span style="color: #AE81FF;">(</span><span style="color: #66D9EF;">io</span><span style="color: #c7254e;">/</span>reader dictionary-file<span style="color: #AE81FF;">)</span><span style="color: #F92672;">]</span> <span style="color: #F92672;">(</span><span style="color: #F92672;">doall</span> <span style="color: #AE81FF;">(</span><span style="color: #66D9EF;">csv</span><span style="color: #c7254e;">/</span>read-csv in-file<span style="color: #AE81FF;">)</span><span style="color: #F92672;">)</span><span style="color: #FD971F;">)</span><span style="color: #E6DB74;">)</span><span style="color: #A6E22E;">]</span> <span style="color: #A6E22E;">(</span><span style="color: #F92672;">with-open</span> <span style="color: #E6DB74;">[</span>out-file <span style="color: #FD971F;">(</span><span style="color: #66D9EF;">io</span><span style="color: #c7254e;">/</span>writer prunned-file<span style="color: #FD971F;">)</span><span style="color: #E6DB74;">]</span> <span style="color: #E6DB74;">(</span><span style="color: #66D9EF;">csv</span><span style="color: #c7254e;">/</span>write-csv out-file <span style="color: #FD971F;">(</span><span style="color: #F92672;">->></span> dictionary <span style="color: #F92672;">(</span>mapv <span style="color: #AE81FF;">(</span><span style="color: #F92672;">fn</span> <span style="color: #66D9EF;">[</span><span style="color: #A6E22E;">[</span>title rc<span style="color: #A6E22E;">]</span><span style="color: #66D9EF;">]</span> <span style="color: #66D9EF;">(</span><span style="color: #F92672;">when</span> <span style="color: #A6E22E;">(</span><span style="color: #F92672;">.exists</span> <span style="color: #AE81FF;">(</span><span style="color: #66D9EF;">io</span><span style="color: #c7254e;">/</span>as-file <span style="color: #66D9EF;">(</span>str normalized-corpus-folder title <span style="color: #E6DB74;">".txt"</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span><span style="color: #A6E22E;">)</span> <span style="color: #A6E22E;">(</span><span style="color: #F92672;">when</span> <span style="color: #AE81FF;">(</span>> <span style="color: #66D9EF;">(</span><span style="color: #F92672;">->></span> <span style="color: #A6E22E;">(</span>slurp <span style="color: #E6DB74;">(</span>str normalized-corpus-folder title <span style="color: #E6DB74;">".txt"</span><span style="color: #E6DB74;">)</span><span style="color: #A6E22E;">)</span> tokenize count<span style="color: #66D9EF;">)</span> min-tokens<span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">[</span><span style="color: #66D9EF;">[</span>title rc<span style="color: #66D9EF;">]</span><span style="color: #AE81FF;">]</span><span style="color: #A6E22E;">)</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span><span style="color: #F92672;">)</span> <span style="color: #F92672;">(</span>apply concat<span style="color: #F92672;">)</span> <span style="color: #F92672;">(</span>into <span style="color: #AE81FF;">[</span><span style="color: #AE81FF;">]</span><span style="color: #F92672;">)</span><span style="color: #FD971F;">)</span><span style="color: #E6DB74;">)</span><span style="color: #A6E22E;">)</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <p> Then we can prune the general and domain specific dictionaries using this simple function: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>create-pruned-pages-dictionary-csv <span style="color: #E6DB74;">"resources/general-corpus-dictionary.csv"</span> <span style="color: #E6DB74;">"resources/general-corpus-dictionary.pruned.csv"</span> <span style="color: #E6DB74;">"resources/corpus-normalized/"</span> min-tokens 100<span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>create-pruned-pages-dictionary-csv <span style="color: #E6DB74;">"resources/domain-corpus-dictionary.csv"</span> <span style="color: #E6DB74;">"resources/domain-corpus-dictionary.pruned.csv"</span> <span style="color: #E6DB74;">"resources/corpus-normalized/"</span> min-tokens 100<span style="color: #AE81FF;">)</span> </pre> </div> <p> As a result of this specific pruning approach, the number of features drops from <code>197</code> to <code>175</code>. </p> </div> <div id="outline-container-orgcdd6d17" class="outline-3"> <br /> <h3 id="orgcdd6d17">Evaluating Pruned Training Corpuses and Selected Features</h3> <div class="outline-text-3" id="text-orgcdd6d17"> <p> Now that the training corpuses have been pruned, let's load them and then evaluate their performance on the gold standards. </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #75715E; font-style: italic;">;; </span><span style="color: #75715E; font-style: italic;">Load positive and negative pruned training corpuses</span> <span style="color: #AE81FF;">(</span>load-dictionaries <span style="color: #E6DB74;">"resources/general-corpus-dictionary.pruned.csv"</span> <span style="color: #E6DB74;">"resources/domain-corpus-dictionary.pruned.csv"</span><span style="color: #AE81FF;">)</span> <span style="color: #75715E; font-style: italic;">;; </span><span style="color: #75715E; font-style: italic;">Build the ESA semantic interpreter </span> <span style="color: #AE81FF;">(</span>build-semantic-interpreter <span style="color: #E6DB74;">"base"</span> <span style="color: #E6DB74;">"resources/semantic-interpreters/base-pruned/"</span> <span style="color: #66D9EF;">(</span>distinct <span style="color: #A6E22E;">(</span>concat <span style="color: #E6DB74;">(</span>get-domain-pages<span style="color: #E6DB74;">)</span> <span style="color: #E6DB74;">(</span>get-general-pages<span style="color: #E6DB74;">)</span><span style="color: #A6E22E;">)</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> <span style="color: #75715E; font-style: italic;">;; </span><span style="color: #75715E; font-style: italic;">Build the vectors to feed to a SVM classifier using ESA</span> <span style="color: #AE81FF;">(</span>build-svm-model-vectors <span style="color: #E6DB74;">"resources/svm/base-pruned/"</span> <span style="color: #AE81FF;">:corpus-folder-normalized</span> <span style="color: #E6DB74;">"resources/corpus-normalized/"</span><span style="color: #AE81FF;">)</span> <span style="color: #75715E; font-style: italic;">;; </span><span style="color: #75715E; font-style: italic;">Train the SVM using the best parameters discovered in the previous tests</span> <span style="color: #AE81FF;">(</span>train-svm-model <span style="color: #E6DB74;">"svm.w50"</span> <span style="color: #E6DB74;">"resources/svm/base-pruned/"</span> <span style="color: #AE81FF;">:weights</span> <span style="color: #66D9EF;">{</span>1 50.0<span style="color: #66D9EF;">}</span> <span style="color: #AE81FF;">:v</span> <span style="color: #AE81FF;">nil</span> <span style="color: #AE81FF;">:c</span> 1 <span style="color: #AE81FF;">:algorithm</span> <span style="color: #AE81FF;">:l2l2</span><span style="color: #AE81FF;">)</span> </pre> </div> <p> Let's evaluate this model using our three gold standards: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>evaluate-model <span style="color: #E6DB74;">"svm.pruned.goldstandard.1.w50"</span> <span style="color: #E6DB74;">"resources/gold-standard-1.csv"</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positive: 21 False positive: 2 True negative: 307 False negative: 6 Precision: 0.9130435 Recall: 0.7777778 Accuracy: 0.97619045 F1: 0.84000003 </pre> <p> The performances changes related to the initial results (using KBpedia <code>1.02</code>) are: </p> <ul class="org-ul"> <li>Precision: <code>+18.75%</code></li> <li>Recall: <code>-12.08%</code></li> <li>Accuracy: <code>+0.61%</code></li> <li>F1: <code>+2.26%</code></li> </ul> <p> In this case, compared with the previous results (non-pruned with KBpedia <code>1.10</code>), we improved the <code>precision</code> without decreasing the <code>recall</code> which is the ultimate goal. This means that the <code>F1</code> score increased by <code>2.26%</code> just by pruning, for this gold standard. </p> <p> The results for the second gold standard are: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>evaluate-model <span style="color: #E6DB74;">"svm.goldstandard.2.w50"</span> <span style="color: #E6DB74;">"resources/gold-standard-2.csv"</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positive: 16 False positive: 3 True negative: 317 False negative: 9 Precision: 0.84210527 Recall: 0.64 Accuracy: 0.9652174 F1: 0.72727275 </pre> <p> The performances changes related to the previous results (using KBpedia <code>1.02</code>) are: </p> <ul class="org-ul"> <li>Precision: <code>+6.18%</code></li> <li>Recall: <code>-29.35%</code></li> <li>Accuracy: <code>-1.19%</code></li> <li>F1: <code>-14.63%</code></li> </ul> <p> In this case, the results are identical (with non-pruned with KBpedia <code>1.10</code>). Pruning did not change anything. Considering the relatively small size of the gold standard, this is to be expected since the model also did not drastically change. </p> <p> Now let's look at the results of our new gold standard: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>evaluate-model <span style="color: #E6DB74;">"svm.goldstandard.3.w50"</span> <span style="color: #E6DB74;">"resources/gold-standard-3.csv"</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positive: 27 False positive: 7 True negative: 351 False negative: 23 Precision: 0.7941176 Recall: 0.54 Accuracy: 0.9264706 F1: 0.64285713 </pre> <p> Now let's check how these compare to the non-pruned version of the training corpus: </p> <ul class="org-ul"> <li>Precision: <code>-12.08%</code></li> <li>Recall: <code>-3.7%</code></li> <li>Accuracy: <code>-1.31%</code></li> <li>F1: <code>-7.02%</code></li> </ul> <p> Both <code>false positives</code> and <code>false negatives</code> increased with this change, which also led to a decrease in the overall metrics. What happened? </p> <p> Different things may have happened in fact. Maybe the new set of features is not optimal, or maybe the hyperparameters of the SVM classifier are offset. This is what we will try to figure out by working with two new methods that we will use to try to continue to improve our model: hyperparameters optimization using grid search and using ensembles learning. </p> </div> </div> </div> <div id="outline-container-org6161f5c" class="outline-2"> <br /> <h2 id="org6161f5c">Hyperparameters Optimization Using Grid Search</h2> <div class="outline-text-2" id="text-org6161f5c"> <p> Hyperparameters are parameters that are not learned by the estimators. They are a kind of configuration option for an algorithm. In the case of a linear SVM, hyperparameters are <code>C</code>, <code>epsilon</code>, <code>weight</code> and the <code>algorithm</code> used. Hyperparameter optimization is the task of trying to find the right parameter values in order to optimize the performance of the model. </p> <p> There are multiple different strategies that we can use to try to find the best values for these hyperparameters, but the one we will highlight here is called the grid search, which exhaustively searches across a manually listing of possible hyperparameter values. </p> <p> The grid search function we want to define will enable us to specify the <code>algorithm(s)</code>, the <code>weight(s)</code>, <code>C</code> and the <code>stopping tolerence</code>. Then we will want the grid search to keep the hyperparameters that optimize the score of the metric we want to focus on. We also have to specify the gold standard we want to use to evaluate the performance of the different models. </p> <p> Here is the function that implements that grid search algorithm: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span><span style="color: #F92672;">defn</span> <span style="color: #A6E22E;">svm-grid-search</span> <span style="color: #66D9EF;">[</span>name model-path gold-standard & <span style="color: #A6E22E;">{</span><span style="color: #AE81FF;">:keys</span> <span style="color: #E6DB74;">[</span>grid-parameters selection-metric<span style="color: #E6DB74;">]</span> <span style="color: #AE81FF;">:or</span> <span style="color: #E6DB74;">{</span>grid-parameters <span style="color: #FD971F;">[</span><span style="color: #F92672;">{</span><span style="color: #AE81FF;">:c</span> <span style="color: #AE81FF;">[</span>1 2 4 16 256<span style="color: #AE81FF;">]</span> <span style="color: #AE81FF;">:e</span> <span style="color: #AE81FF;">[</span>0.001 0.01 0.1<span style="color: #AE81FF;">]</span> <span style="color: #AE81FF;">:algorithm</span> <span style="color: #AE81FF;">[</span><span style="color: #AE81FF;">:l2l2</span><span style="color: #AE81FF;">]</span> <span style="color: #AE81FF;">:weight</span> <span style="color: #AE81FF;">[</span>1 15 30<span style="color: #AE81FF;">]</span><span style="color: #F92672;">}</span><span style="color: #FD971F;">]</span> selection-metric <span style="color: #AE81FF;">:f1</span><span style="color: #E6DB74;">}</span><span style="color: #A6E22E;">}</span><span style="color: #66D9EF;">]</span> <span style="color: #66D9EF;">(</span><span style="color: #F92672;">let</span> <span style="color: #A6E22E;">[</span>best <span style="color: #E6DB74;">(</span>atom <span style="color: #FD971F;">{</span><span style="color: #AE81FF;">:gold-standard</span> gold-standard <span style="color: #AE81FF;">:selection-metric</span> selection-metric <span style="color: #AE81FF;">:score</span> 0.0 <span style="color: #AE81FF;">:c</span> <span style="color: #AE81FF;">nil</span> <span style="color: #AE81FF;">:e</span> <span style="color: #AE81FF;">nil</span> <span style="color: #AE81FF;">:algorithm</span> <span style="color: #AE81FF;">nil</span> <span style="color: #AE81FF;">:weight</span> <span style="color: #AE81FF;">nil</span><span style="color: #FD971F;">}</span><span style="color: #E6DB74;">)</span> model-vectors <span style="color: #E6DB74;">(</span>read-string <span style="color: #FD971F;">(</span>slurp <span style="color: #F92672;">(</span>str model-path <span style="color: #E6DB74;">"model.vectors"</span><span style="color: #F92672;">)</span><span style="color: #FD971F;">)</span><span style="color: #E6DB74;">)</span><span style="color: #A6E22E;">]</span> <span style="color: #A6E22E;">(</span><span style="color: #F92672;">doseq</span> <span style="color: #E6DB74;">[</span>parameters grid-parameters<span style="color: #E6DB74;">]</span> <span style="color: #E6DB74;">(</span><span style="color: #F92672;">doseq</span> <span style="color: #FD971F;">[</span>algo <span style="color: #F92672;">(</span><span style="color: #AE81FF;">:algorithm</span> parameters<span style="color: #F92672;">)</span><span style="color: #FD971F;">]</span> <span style="color: #FD971F;">(</span><span style="color: #F92672;">doseq</span> <span style="color: #F92672;">[</span>weight <span style="color: #AE81FF;">(</span><span style="color: #AE81FF;">:weight</span> parameters<span style="color: #AE81FF;">)</span><span style="color: #F92672;">]</span> <span style="color: #F92672;">(</span><span style="color: #F92672;">doseq</span> <span style="color: #AE81FF;">[</span>e <span style="color: #66D9EF;">(</span><span style="color: #AE81FF;">:e</span> parameters<span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">]</span> <span style="color: #AE81FF;">(</span><span style="color: #F92672;">doseq</span> <span style="color: #66D9EF;">[</span>c <span style="color: #A6E22E;">(</span><span style="color: #AE81FF;">:c</span> parameters<span style="color: #A6E22E;">)</span><span style="color: #66D9EF;">]</span> <span style="color: #66D9EF;">(</span>train-svm-model name model-path <span style="color: #AE81FF;">:weights</span> <span style="color: #A6E22E;">{</span>1 <span style="color: #AE81FF;">(</span>double weight<span style="color: #AE81FF;">)</span><span style="color: #A6E22E;">}</span> <span style="color: #AE81FF;">:v</span> <span style="color: #AE81FF;">nil</span> <span style="color: #AE81FF;">:c</span> c <span style="color: #AE81FF;">:e</span> e <span style="color: #AE81FF;">:algorithm</span> algo <span style="color: #AE81FF;">:model-vectors</span> model-vectors<span style="color: #66D9EF;">)</span> <span style="color: #66D9EF;">(</span><span style="color: #F92672;">let</span> <span style="color: #A6E22E;">[</span>results <span style="color: #AE81FF;">(</span>evaluate-model name gold-standard <span style="color: #AE81FF;">:output</span> <span style="color: #AE81FF;">false</span><span style="color: #AE81FF;">)</span><span style="color: #A6E22E;">]</span> <span style="color: #A6E22E;">(</span>println <span style="color: #E6DB74;">"Algorithm:"</span> algo<span style="color: #A6E22E;">)</span> <span style="color: #A6E22E;">(</span>println <span style="color: #E6DB74;">"C:"</span> c<span style="color: #A6E22E;">)</span> <span style="color: #A6E22E;">(</span>println <span style="color: #E6DB74;">"Epsilon:"</span> e<span style="color: #A6E22E;">)</span> <span style="color: #A6E22E;">(</span>println <span style="color: #E6DB74;">"Weight:"</span> weight<span style="color: #A6E22E;">)</span> <span style="color: #A6E22E;">(</span>println selection-metric <span style="color: #E6DB74;">":"</span> <span style="color: #AE81FF;">(</span>get results selection-metric<span style="color: #AE81FF;">)</span><span style="color: #A6E22E;">)</span> <span style="color: #A6E22E;">(</span>println<span style="color: #A6E22E;">)</span> <span style="color: #A6E22E;">(</span><span style="color: #F92672;">when</span> <span style="color: #AE81FF;">(</span>> <span style="color: #66D9EF;">(</span>get results selection-metric<span style="color: #66D9EF;">)</span> <span style="color: #66D9EF;">(</span><span style="color: #AE81FF;">:score</span> @best<span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>reset! best <span style="color: #66D9EF;">{</span><span style="color: #AE81FF;">:gold-standard</span> gold-standard <span style="color: #AE81FF;">:selection-metric</span> selection-metric <span style="color: #AE81FF;">:score</span> <span style="color: #A6E22E;">(</span>get results selection-metric<span style="color: #A6E22E;">)</span> <span style="color: #AE81FF;">:c</span> c <span style="color: #AE81FF;">:e</span> e <span style="color: #AE81FF;">:algorithm</span> algo <span style="color: #AE81FF;">:weight</span> weight<span style="color: #66D9EF;">}</span><span style="color: #AE81FF;">)</span><span style="color: #A6E22E;">)</span><span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span><span style="color: #F92672;">)</span><span style="color: #FD971F;">)</span><span style="color: #E6DB74;">)</span><span style="color: #A6E22E;">)</span> @best<span style="color: #66D9EF;">)</span><span style="color: #AE81FF;">)</span> </pre> </div> <p> The possible algorithms are: </p> <ol class="org-ol"> <li><code>:l2lr_primal</code></li> <li><code>:l2l2</code></li> <li><code>:l2l2_primal</code></li> <li><code>:l2l1</code></li> <li><code>:multi</code></li> <li><code>:l1l2_primal</code></li> <li><code>:l1lr</code></li> <li><code>:l2lr</code></li> </ol> <p> To simplify things a little bit for this task, we will merge the three gold standards we have into one. We will use that gold standard moving forward. The merged gold standard can be <a href="gold-standard-full.csv">downloaded from here</a>. We now have a single gold standard with 1017 manually vetted web pages. </p> <p> Now that we have a new consolidated gold standard, let's calculate the performance of the models when the training corpuses are pruned or not. This will become the new basis to compare the subsequent results ifor this use case. The metrics when the training corpuses <b>are</b> pruned: </p> <blockquote> <p> True positive: 56 false positive: 10 True negative: 913 False negative: 38 </p> <p> Precision: 0.8484849 Recall: 0.59574467 Accuracy: 0.95280236 F1: 0.7 </p> </blockquote> <p> Now, let's run the grid search that will try to optimize the <code>F1</code> score of the model using the pruned training corpuses and using the full gold standard: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>svm-grid-search <span style="color: #E6DB74;">"grid-search-base-pruned-tests"</span> <span style="color: #E6DB74;">"resources/svm/base-pruned/"</span> <span style="color: #E6DB74;">"resources/gold-standard-full.csv"</span> <span style="color: #AE81FF;">:selection-metric</span> <span style="color: #AE81FF;">:f1</span> <span style="color: #AE81FF;">:grid-parameters</span> <span style="color: #66D9EF;">[</span><span style="color: #A6E22E;">{</span><span style="color: #AE81FF;">:c</span> <span style="color: #E6DB74;">[</span>1 2 4 16 256<span style="color: #E6DB74;">]</span> <span style="color: #AE81FF;">:e</span> <span style="color: #E6DB74;">[</span>0.001 0.01 0.1<span style="color: #E6DB74;">]</span> <span style="color: #AE81FF;">:algorithm</span> <span style="color: #E6DB74;">[</span><span style="color: #AE81FF;">:l2l2</span><span style="color: #E6DB74;">]</span> <span style="color: #AE81FF;">:weight</span> <span style="color: #E6DB74;">[</span>1 15 30<span style="color: #E6DB74;">]</span><span style="color: #A6E22E;">}</span><span style="color: #66D9EF;">]</span><span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> {:gold-standard "resources/gold-standard-full.csv" :selection-metric :f1 :score 0.7096774 :c 2 :e 0.001 :algorithm :l2l2 :weight 30} </pre> <p> With a simple subset of the possible hyperparameter space, we found that by increasing the <code>c</code> parameter to 2 we could improve the performance of the <code>F1</code> score on the gold standard by <code>1.37%</code>. It is not a huge gain, but it is still an appreciable gain given the miinimal effort invested so far (basically: waiting for the grid search to finish). Subsequently we could tweak the subset of parameters to try to improve a little further. Let's try with <code>c = [1.5, 2, 2.5]</code> and <code>weight = [30, 40]</code>. Let's also try to check with other algorithms as well like <code>L2-regularized L1-loss support vector regression (dual)</code>. </p> <p> The goal here is to configure the initial grid search with general parameters with a wide range of possible values. Then subsequently we can use that tool to fine tune some of the parameters that were returning good results. In any case, the more computer power and time you have, the more tests you will be able to perform. </p> </div> </div> <div id="outline-container-orgff9c5b7" class="outline-2"> <br /> <h2 id="orgff9c5b7">Ensemble Learning With SVM</h2> <div class="outline-text-2" id="text-orgff9c5b7"> <p> Now that we have good hyperparameters for a single linear SVM classifier, let's try another technique to improve the performance of the system: ensemble learning. </p> <p> So far, we already reached <code>95%</code> of accuracy with some tweaking the hyperparameters and the training corpuses but the <code>F1</code> score is still around <code>~70%</code> with the full gold standard which can be improved. There are also situations when <code>precision</code> should be nearly perfect (because false positives are really not acceptable) or when the <code>recall</code> should be optimized. </p> <p> Here we will try to improve this situation by using <a href="https://en.wikipedia.org/wiki/Ensemble_learning">ensemble learning</a>. It uses multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. In our examples, each model will have a vote and the weight of the vote will be equal for each mode. We will use five different strategies to create the models that will belong to the ensemble: </p> <ol class="org-ol"> <li><a href="https://en.wikipedia.org/wiki/Bootstrap_aggregating">Bootstrap aggregating</a> <i>(bagging)</i></li> <li>Asymmetric bagging <sup><a id="fnr.1.100" class="footref" href="#fn.1">1</a></sup></li> <li><a href="https://en.wikipedia.org/wiki/Random_subspace_method">Random subspace method</a> <i>(feature bagging)</i></li> <li>Asymmetric bagging + random subspace method (ABRS) <sup><a id="fnr.1.100" class="footref" href="#fn.1">1</a></sup></li> <li>Bootstrap aggregating + random subspace method (BRS)</li> </ol> <p> Different strategies will be used depending on different things like: are the positive and negative training documents unbalanced? How many features does the model have? etc. Let's introduce each of these different strategies. </p> <p> Note that in this use case we are only creating ensembles with linear SVM learners. However an ensemble can be composed of multiple different kind of learners, like SVM with non-linear kernels, decisions trees, etc. However, to simplify this use case, we will stick to a single linear SVM with multiple different training corpuses and features. </p> </div> <div id="outline-container-orgf23a505" class="outline-3"> <br /> <h3 id="orgf23a505">Strategies</h3> <div class="outline-text-3" id="text-orgf23a505"> </div><div id="outline-container-org8b792f3" class="outline-4"> <br /> <h4 id="org8b792f3">Bootstrap Aggregating (bagging)</h4> <div class="outline-text-4" id="text-org8b792f3"> <p> The idea behind bagging is to draw a subset of positive and negative training samples at random and with replacement. Each model of the ensemble will have a different training set but some of the training sample may appear in multiple different training sets. </p> </div> </div> <div id="outline-container-org57984c3" class="outline-4"> <br /> <h4 id="org57984c3">Asymmetric Bagging</h4> <div class="outline-text-4" id="text-org57984c3"> <p> Asymmetric Bagging has been proposed by Tao, Tang, Li and Wu <sup><a id="fnr.1" class="footref" href="#fn.1">1</a></sup>. The idea is to use asymmetric bagging when the number of positive training samples is largely unbalanced relatively to the negative training samples. The idea is to create a subset of random (with replacement) negative training samples, but by always keeping the full set of positive training samples. </p> </div> </div> <div id="outline-container-org2c78f71" class="outline-4"> <br /> <h4 id="org2c78f71">Random Subspace method <i>(feature bagging)</i></h4> <div class="outline-text-4" id="text-org2c78f71"> <p> The idea behind feature bagging is the same as bagging, but works on the features of the model instead of the training sets. It attempts to reduce the correlation between estimators (features) in an ensemble by training them on random samples of features instead of the entire feature set. </p> </div> </div> <div id="outline-container-orge687ad6" class="outline-4"> <br /> <h4 id="orge687ad6">Asymmetric Bagging + Random Subspace method (ABRS)</h4> <div class="outline-text-4" id="text-orge687ad6"> <p> Asymmetric bagging and the random subspace method have also been proposed by Tao, Tang, Li and Wu <sup><a id="fnr.1.100" class="footref" href="#fn.1">1</a></sup>. The problems they had with their content-based image retrieval system are the same we have with this kind of automatic training corpuses generated from knowledge graph: </p> <ol class="org-ol"> <li>SVM is unstable on small-sized training set</li> <li>SVM's optimal hyperplane may be biased when the positive training sample is much less than the negative feedback sample (this is why we used weights in this case), and</li> <li>The training set is smaller than the number of features in the SVM model.</li> </ol> <p> The third point is not immediately an issue for us (except if you have a domain with many more features than we had in our example), but becomes one when we start using asymmetric bagging. </p> <p> What we want to do here is to implement asymmetric bagging and the random subspace method to create <img src="ltximg/dynamic_machine_learning_use_case_859c7c3c1099242193bc675bd7b1bf25c900754e.png" alt="dynamic_machine_learning_use_case_859c7c3c1099242193bc675bd7b1bf25c900754e.png" /> number of individual models. This method is called ABRS-SVM which stands for Asymmetric Bagging Random Subspace Support Vector Machines. </p> <p> The algorithm we will use is: </p> <ol class="org-ol"> <li>Let the number of positive training documents be <img src="ltximg/dynamic_machine_learning_use_case_edc31b30a8bd1852c35517549bcac8b4a7af7fc8.png" alt="dynamic_machine_learning_use_case_edc31b30a8bd1852c35517549bcac8b4a7af7fc8.png" />, the number of negative training document be <img src="ltximg/dynamic_machine_learning_use_case_a957f8e8350deff65b7a8982eb9a29c95f5e7773.png" alt="dynamic_machine_learning_use_case_a957f8e8350deff65b7a8982eb9a29c95f5e7773.png" /> and the number of features in the training data be <img src="ltximg/dynamic_machine_learning_use_case_8541d15dc6a8dbbfe4c3da369275938094ab9a70.png" alt="dynamic_machine_learning_use_case_8541d15dc6a8dbbfe4c3da369275938094ab9a70.png" />.</li> <li>Choose <img src="ltximg/dynamic_machine_learning_use_case_859c7c3c1099242193bc675bd7b1bf25c900754e.png" alt="dynamic_machine_learning_use_case_859c7c3c1099242193bc675bd7b1bf25c900754e.png" /> to be the number of individual models in the ensemble.</li> <li>For all individual model <img src="ltximg/dynamic_machine_learning_use_case_0faca97934e9db8e9f056c94b7613c45cb12e1ef.png" alt="dynamic_machine_learning_use_case_0faca97934e9db8e9f056c94b7613c45cb12e1ef.png" />, choose <img src="ltximg/dynamic_machine_learning_use_case_81af6d98760653700a014a8f1362a186300f0207.png" alt="dynamic_machine_learning_use_case_81af6d98760653700a014a8f1362a186300f0207.png" /> where <img src="ltximg/dynamic_machine_learning_use_case_90ae52c890d2e8bd9b3a9376696d65719d104954.png" alt="dynamic_machine_learning_use_case_90ae52c890d2e8bd9b3a9376696d65719d104954.png" /> to be the number of negative training documents for <img src="ltximg/dynamic_machine_learning_use_case_d15a4e4ae61385fcd2221a2be30a7f59da7bd4ca.png" alt="dynamic_machine_learning_use_case_d15a4e4ae61385fcd2221a2be30a7f59da7bd4ca.png" /></li> <li>For all individual models <img src="ltximg/dynamic_machine_learning_use_case_0faca97934e9db8e9f056c94b7613c45cb12e1ef.png" alt="dynamic_machine_learning_use_case_0faca97934e9db8e9f056c94b7613c45cb12e1ef.png" />, choose <img src="ltximg/dynamic_machine_learning_use_case_36954911a1d2865eaf400325a9f0b3d9d08a9993.png" alt="dynamic_machine_learning_use_case_36954911a1d2865eaf400325a9f0b3d9d08a9993.png" /> where <img src="ltximg/dynamic_machine_learning_use_case_cd852c5f4f131fdce3c46aa64249678cb4456717.png" alt="dynamic_machine_learning_use_case_cd852c5f4f131fdce3c46aa64249678cb4456717.png" /> to be the number of input variables for <img src="ltximg/dynamic_machine_learning_use_case_8af9860053a468761786b279cda937b39be994c5.png" alt="dynamic_machine_learning_use_case_8af9860053a468761786b279cda937b39be994c5.png" />.</li> <li>For each individual model <img src="ltximg/dynamic_machine_learning_use_case_0faca97934e9db8e9f056c94b7613c45cb12e1ef.png" alt="dynamic_machine_learning_use_case_0faca97934e9db8e9f056c94b7613c45cb12e1ef.png" />, create a training set by choosing <img src="ltximg/dynamic_machine_learning_use_case_36954911a1d2865eaf400325a9f0b3d9d08a9993.png" alt="dynamic_machine_learning_use_case_36954911a1d2865eaf400325a9f0b3d9d08a9993.png" /> features from <img src="ltximg/dynamic_machine_learning_use_case_8541d15dc6a8dbbfe4c3da369275938094ab9a70.png" alt="dynamic_machine_learning_use_case_8541d15dc6a8dbbfe4c3da369275938094ab9a70.png" /> <a href="https://en.wikipedia.org/wiki/Sampling_(statistics)#Replacement_of_selected_units">with replacement</a>, by choosing <img src="ltximg/dynamic_machine_learning_use_case_81af6d98760653700a014a8f1362a186300f0207.png" alt="dynamic_machine_learning_use_case_81af6d98760653700a014a8f1362a186300f0207.png" /> negative training documents from <img src="ltximg/dynamic_machine_learning_use_case_a957f8e8350deff65b7a8982eb9a29c95f5e7773.png" alt="dynamic_machine_learning_use_case_a957f8e8350deff65b7a8982eb9a29c95f5e7773.png" /> <a href="https://en.wikipedia.org/wiki/Sampling_(statistics)#Replacement_of_selected_units">with replacement</a>, by choosing all positive training documents <img src="ltximg/dynamic_machine_learning_use_case_edc31b30a8bd1852c35517549bcac8b4a7af7fc8.png" alt="dynamic_machine_learning_use_case_edc31b30a8bd1852c35517549bcac8b4a7af7fc8.png" /> and then train the model.</li> </ol> </div> </div> <div id="outline-container-org9d51df8" class="outline-4"> <br /> <h4 id="org9d51df8">Bootstrap Aggregating + Random Subspace method (BRS)</h4> <div class="outline-text-4" id="text-org9d51df8"> <p> Bagging with features bagging is the same as asymmetric bagging with the random subspace method except that we use bagging instead of asymmetric bagging. (<code>ABRS</code> should be used if your positive training sample is severely unbalanced compared to your negative training sample. Otherwise <code>BRS</code> should be used.) </p> </div> </div> <div id="outline-container-org40e4723" class="outline-4"> <br /> <h4 id="org40e4723">SVM Learner</h4> <div class="outline-text-4" id="text-org40e4723"> <p> We use the linear Semantic Vector Machine (SVM) as the learner to use for the ensemble. What we will be creating is a series of SVM models that will be different depending on the ensemble method(s) we will use to create the ensemble. </p> </div> </div> </div> <div id="outline-container-orgdf40d30" class="outline-3"> <br /> <h3 id="orgdf40d30">Build Training Document Vectors</h3> <div class="outline-text-3" id="text-orgdf40d30"> <p> The first step we have to do is to create a structure where all the positive and negative training documents will have their vector representation. Since this is the task that takes the most computer time in the whole process, we will calculate them using the <code>(build-svm-model-vectors)</code> function and we will serialize the structure on the file system. That way, to create the ensemble's models, we will only have to load it from the file system without having the re-calculate it each time. </p> </div> </div> <div id="outline-container-org96c5779" class="outline-3"> <br /> <h3 id="org96c5779">Train, Classify and Evaluate Ensembles</h3> <div class="outline-text-3" id="text-org96c5779"> <p> The goal is to create a set of <code>X</code> number of SVM classifiers where each of them use different models. The models can differ in their features or their training corpus. Then each of the classifier will try to classify an input text according to their own model. Finally each classifier will vote to determine if that input text belong, or not, to the domain. </p> <p> There are four hyperparameters related to ensemble learning: </p> <ol class="org-ol"> <li>The mode to use</li> <li>The number of models we want to create in the ensemble</li> <li>The number of training documents we want in the training corpus, and</li> <li>The number of features.</li> </ol> <p> Other hyperparameters could include the ones of the linear SVM classifier, but in this example we will simply reuse the best parameters we found above. We now train the ensemble using the <code>(train-ensemble-svm)</code> function. </p> <p> Once the ensemble is created and trained, then we have to use the <code>(classify-ensemble-text)</code> function to classify an input text using the ensemble we created. That function takes two parameters: <code>:mode</code>, which is the ensemble's mode, and <code>:vote-acceptance-ratio</code>, which defines the number of positive votes that is required such that the ensemble positively classify the input text. By default, the ratio is <code>50%</code>, but if you want to optimize the <code>precision</code> of the ensemble, then you may want to increase that ratio to <code>70%</code> or even <code>95%</code> as we will see below. </p> <p> Finally the ensemble, configured with all its hyperparameters, will be evaluated using the <code>(evaluate-ensemble)</code> function, which is the same as the <code>(evaluate-model)</code> function, but which uses the ensemble instead of a single SVM model to classify all of the articles. As before, we will characterize the assignments in relation to the gold standard. </p> <p> Let's now train different ensembles to try to improve the performance of the system. </p> </div> <div id="outline-container-orgadb6b58" class="outline-4"> <br /> <h4 id="orgadb6b58">Asymmetric Bagging</h4> <div class="outline-text-4" id="text-orgadb6b58"> <p> The current corpus training set is highly unbalanced. This is why the first test we will do is to apply the asymmetric bagging strategy. What this does is that each of the SVM classifiers will use the same positive training set with the same number of positive documents. However, each of them will take a random number of negative training documents (with replacement). </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>use '<span style="color: #66D9EF;">cognonto-esa.core</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>use '<span style="color: #66D9EF;">cognonto-esa.ensemble-svm</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>load-dictionaries <span style="color: #E6DB74;">"resources/general-corpus-dictionary.pruned.csv"</span> <span style="color: #E6DB74;">"resources/domain-corpus-dictionary.pruned.csv"</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>load-semantic-interpreter <span style="color: #E6DB74;">"base-pruned"</span> <span style="color: #E6DB74;">"resources/semantic-interpreters/base-pruned/"</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>reset! ensemble <span style="color: #66D9EF;">[</span><span style="color: #66D9EF;">]</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>train-ensemble-svm <span style="color: #E6DB74;">"ensemble.base.pruned.ab.c2.w30"</span> <span style="color: #E6DB74;">"resources/ensemble-svm/base-pruned/"</span> <span style="color: #AE81FF;">:mode</span> <span style="color: #AE81FF;">:ab</span> <span style="color: #AE81FF;">:weight</span> <span style="color: #66D9EF;">{</span>1 30.0<span style="color: #66D9EF;">}</span> <span style="color: #AE81FF;">:c</span> 2 <span style="color: #AE81FF;">:e</span> 0.001 <span style="color: #AE81FF;">:nb-models</span> 100 <span style="color: #AE81FF;">:nb-training-documents</span> 3500<span style="color: #AE81FF;">)</span> </pre> </div> <p> Now let's evaluate this ensemble with a vote acceptance ratio of <code>50%</code> </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>evaluate-ensemble <span style="color: #E6DB74;">"ensemble.base.pruned.ab.c2.w30"</span> <span style="color: #E6DB74;">"resources/gold-standard-full.csv"</span> <span style="color: #AE81FF;">:mode</span> <span style="color: #AE81FF;">:ab</span> <span style="color: #AE81FF;">:vote-acceptance-ratio</span> 0.50<span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positive: 48 False positive: 6 True negative: 917 False negative: 46 Precision: 0.8888889 Recall: 0.5106383 Accuracy: 0.9488692 F1: 0.6486486 </pre> <p> Let's increase the vote acceptance ratio to <code>90%</code>: </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>evaluate-ensemble <span style="color: #E6DB74;">"ensemble.base.pruned.ab.c2.w30"</span> <span style="color: #E6DB74;">"resources/gold-standard-full.csv"</span> <span style="color: #AE81FF;">:mode</span> <span style="color: #AE81FF;">:ab</span> <span style="color: #AE81FF;">:vote-acceptance-ratio</span> 0.90<span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positive: 37 False positive: 2 True negative: 921 False negative: 57 Precision: 0.94871795 Recall: 0.39361703 Accuracy: 0.94198626 F1: 0.556391 </pre> <p> In both cases, the <code>precision</code> increases considerably compared to the non-ensemble learning results. However, the <code>recall</code> did drop at the same time, which dropped the <code>F1</code> score as well. Let's now try with the <code>ABRS</code> method </p> </div> </div> <div id="outline-container-orgba411b5" class="outline-4"> <br /> <h4 id="orgba411b5">Asymmetric Bagging + Random Subspace method (ABRS)</h4> <div class="outline-text-4" id="text-orgba411b5"> <p> The goal of the random subspace method is to select a random set of features. This means that each model will have their own feature set and will make predictions according to them. With the ABRS strategy, we will conclude with highly different models since none will have the same negative training sets nor the same features. </p> <p> Here what we test is to define each classifier with <code>65</code> randomly chosen features out of <code>174</code> to restrict the negative training corpus to 3500 randomly selected documents. Then we choose to create 300 models to try to get a really heterogeneous population of models. </p> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>reset! ensemble <span style="color: #66D9EF;">[</span><span style="color: #66D9EF;">]</span><span style="color: #AE81FF;">)</span> <span style="color: #AE81FF;">(</span>train-ensemble-svm <span style="color: #E6DB74;">"ensemble.base.pruned.abrs.c2.w30"</span> <span style="color: #E6DB74;">"resources/ensemble-svm/base-pruned/"</span> <span style="color: #AE81FF;">:mode</span> <span style="color: #AE81FF;">:abrs</span> <span style="color: #AE81FF;">:weight</span> <span style="color: #66D9EF;">{</span>1 30.0<span style="color: #66D9EF;">}</span> <span style="color: #AE81FF;">:c</span> 2 <span style="color: #AE81FF;">:e</span> 0.001 <span style="color: #AE81FF;">:nb-models</span> 300 <span style="color: #AE81FF;">:nb-features</span> 65 <span style="color: #AE81FF;">:nb-training-documents</span> 3500<span style="color: #AE81FF;">)</span> </pre> </div> <div class="org-src-container"> <pre class="src src-clojure"><span style="color: #AE81FF;">(</span>evaluate-ensemble <span style="color: #E6DB74;">"ensemble.base.pruned.abrs.c2.w30"</span> <span style="color: #E6DB74;">"resources/gold-standard-full.csv"</span> <span style="color: #AE81FF;">:mode</span> <span style="color: #AE81FF;">:abrs</span> <span style="color: #AE81FF;">:vote-acceptance-ratio</span> 0.50<span style="color: #AE81FF;">)</span> </pre> </div> <pre class="example"> True positive: 41 False positive: 3 True negative: 920 False negative: 53 Precision: 0.9318182 Recall: 0.43617022 Accuracy: 0.9449361 F1: 0.59420294 </pre> <p> For these features and training sets, using the <code>ABRS</code> method did not improve on the <code>AB</code> method we tried above. </p> </div> </div> </div> </div> <div id="outline-container-org67bbb95" class="outline-2"> <br /> <h2 id="org67bbb95">Conclusion</h2> <div class="outline-text-2" id="text-org67bbb95"> <p> This use case shows three totally different ways to use KBpedia (and any domain extensions that may be employed) to create positive and negative training sets automatically. We demonstrated how the full process can be automated where the only requirement is to get a list of seed KBpedia reference concepts. </p> <p> We also quantified the impact of using new versions of KBpedia, and how different strategies, techniques or algorithms can have different impacts on the prediction models. </p> <p> Creating prediction models using supervised machine learning algorithms (which is currently the bulk of the learners currently used) has two global steps: </p> <ol class="org-ol"> <li>Label training sets and generate gold standards, and</li> <li>Test, compare, and optimize different learners, ensembles and hyperparameters.</li> </ol> <p> Unfortunately, today, given the manual efforts required by the first step, the overwhelming portion of time and budget is spent here to create a prediction model. By automating much of this process, KBpedia <b>substantially</b> reduces this effort. Time and budget can now be re-directed to the second step of "dialing in" the learners, where the real payoff occurs. </p> <p> Further, as we also demonstrated, once we automate this process of labeling and reference standards, then we can also automate the testing and optimization of multiple prediction algorithms, hyperparameters configuration, etc. In short, for both steps, KBpedia provides significant reductions in time and effort to get to desired results. </p> </div> </div> <div id="footnotes"> <br /> <h2 class="footnotes">Footnotes: </h2> <div id="text-footnotes"> <div class="footdef"><sup><a id="fn.1" class="footnum" href="#fnr.1">1</a></sup> Tao, D., Tang, X., Li, X. and Wu, X., (2006). Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(7), pp.1088-1099. </div> </div> </div> </div> <div class="col-md-2">   </div> </div> </div>  <div class="footer-v1"> <div class="footer"> <div class="container"> <div class="row">  <div class="col-md-3 md-margin-bottom-40"> <table> <tbody><tr> <td> <a href="/"> <img id="logo-footer" class="footer-logo" src="/imgs/logo-simple-purple.png" alt="" name="logo-footer"> </a> </td> </tr> <tr> <td> <center> <p> KBpedia </p> </center> </td> </tr> </tbody></table> <p style="font-size: 0.85em;"> KBpedia exploits large-scale knowledge bases and semantic technologies for machine learning, data interoperability and mapping, and fact extraction and tagging. </p> </div>   <div class="col-md-3 md-margin-bottom-40"> <div class="posts"> <div class="headline"> <h2> Latest News </h2> </div> <ul class="list-unstyled latest-list">  <li> <a href="https://kbpedia.org/resources/news/kbpedia-adds-ecommerce/">KBpedia Adds Major eCommerce Capabilities</a> <small>06/15/2020</small> </li>  <li> <a href="http://kbpedia.org/resources/news/kbpedia-continues-quality-improvements/">KBpedia Continues Quality Improvements</a> <small>12/04/2019</small> </li>  <li> <a href="http://kbpedia.org/resources/news/wikidata-coverage-nearly-complete/">Wikidata Coverage Nearly Complete (98%)</a> <small>04/08/2019</small> </li> </ul> </div> </div> <div class="col-md-3 md-margin-bottom-40"> <div class="headline"> <h2> Other Resources </h2> </div> <ul class="list-unstyled link-list"> <li> <a href="/resources/about/">About</a> </li> <li> <a href="/resources/faq/">FAQ</a> </li> <li> <a href="/resources/news/">News</a> </li> <li> <a href="/use-cases/">Use Cases</a> </li> <li> <a href="/resources/documentation/">Documentation</a> </li> <li> <a href="/resources/privacy/">Privacy</a> </li> <li> <a href="/resources/terms-of-use/">Terms of Use</a> </li> </ul> </div>  <div class="col-md-3 map-img md-margin-bottom-40"> <div class="headline"> <h2> Contact Us </h2> </div> <address class="md-margin-bottom-40"> c/o <a href="mailto:info@mkbergman.com?subject=KBpedia%20Inquiry">Michael K. Bergman</a> <br> 380 Knowling Drive <br> Coralville, IA 52241 <br> U.S.A. <br> Voice: +1 319 621 5225 </address> </div>  </div> </div> </div> <div class="copyright"> <div class="container"> <div class="row"> <div class="col-md-7"> <p class="copyright" style="font-size: 10px;"> 2016-2022 漏 <a href="http://kbpedia.org" style="font-size: 10px;">Michael K. Bergman.</a> All Rights Reserved. </p> </div>  <div class="col-md-5"> <ul class="footer-socials list-inline"> <li> <a href="/resources/feeds/news.xml" class="tooltips" data-toggle="tooltip" data-placement="top" title="" data-original-title="RSS feed"> <i class="fa fa-rss-square"></i> </a> <br></li> <li> <a href="http://github.com/Cognonto" class="tooltips" data-toggle="tooltip" data-placement="top" title="" data-original-title="Github"> <i class="fa fa-github"></i> </a> <br></li> <li> <a href="http://twitter.com/cognonto" class="tooltips" data-toggle="tooltip" data-placement="top" title="" data-original-title="Twitter"> <i class="fa fa-twitter"></i> </a> <br></li> </ul> </div>  </div> </div> </div> </div>   <script type="text/javascript" src="/assets/plugins/jquery/jquery.min.js"></script> <script type="text/javascript" src="/assets/plugins/jquery/jquery-migrate.min.js"></script> <script type="text/javascript" src="/assets/plugins/bootstrap/js/bootstrap.min.js"></script>  <script type="text/javascript" src="/assets/plugins/back-to-top.js"></script>  <script type="text/javascript" src="/assets/js/custom.js"></script>  <script type="text/javascript" src="/assets/js/app.js"></script>  <script type="text/javascript" src="/assets/plugins/smoothScroll.js"></script> <script type="text/javascript" src="/assets/plugins/owl-carousel/owl-carousel/owl.carousel.js"></script> <script type="text/javascript" src="/assets/plugins/layer-slider/layerslider/js/greensock.js"></script> <script type="text/javascript" src="/assets/plugins/layer-slider/layerslider/js/layerslider.transitions.js"></script> <script type="text/javascript" src="/assets/plugins/layer-slider/layerslider/js/layerslider.kreaturamedia.jquery.js"></script>  <script type="text/javascript" src="/assets/js/custom.js"></script>  <script type="text/javascript" src="/assets/js/plugins/layer-slider.js"></script> <script type="text/javascript" src="/assets/js/plugins/style-switcher.js"></script> <script type="text/javascript" src="/assets/js/plugins/owl-carousel.js"></script> <script type="text/javascript" src="/assets/js/plugins/owl-recent-works.js"></script> <script type="text/javascript"> jQuery(document).ready(function() { App.init(); LayerSlider.initLayerSlider(); StyleSwitcher.initStyleSwitcher(); OwlCarousel.initOwlCarousel(); OwlRecentWorks.initOwlRecentWorksV2(); }); </script>   <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-84405507-1', 'auto'); ga('send', 'pageview'); </script>

CINXE.COM

Dynamic Machine Learning Using the KBpedia Knowledge Graph