Extending KBpedia With Wikipedia Categories

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">    <html lang="en">  <head> <title>Extending KBpedia With Wikipedia Categories</title> <link rel="alternate" type="application/rss+xml" title="" href="/resources/feeds/news.xml" />  <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="description" content=""> <meta name="author" content="">  <link rel="shortcut icon" href="/favicon.ico">  <link rel="stylesheet" href="/assets/plugins/bootstrap/css/bootstrap.min.css"> <link rel="stylesheet" href="/assets/css/style.css">  <link rel="stylesheet" href="/assets/plugins/line-icons/line-icons.css"> <link rel="stylesheet" href="/assets/plugins/font-awesome/css/font-awesome.min.css">  <link rel="stylesheet" href="/assets/css/theme-colors/blue.css"> <link rel="stylesheet" href="/assets/css/footers/footer-v1.css"> <link rel="stylesheet" href="/assets/css/headers/header-v1.css">  <link rel="stylesheet" href="/css/custom.css"> <meta property="og:site_name" content="KBpedia"/> <meta property="og:title" content="Machine Learning Use Cases: Create Graph Embedding Corpuses"> <meta property="og:type" content="article"/> <meta property="og:description" content="This use case describes how knowledge graphs, such as KBpedia, which need to be kept current and extended based on new knowledge and new mappings, can be so maintained with acceptable effort and accuracy."> <meta property="og:image" content="/imgs/kbpedia-logo-350.png"> <meta property="og:image:width" content="350"> <meta property="og:image:height" content="390"> </head> <body> <div class="wrapper"> <div class="header-v1">  <div class="topbar"> <div class="container"> <div class="col-md-1"> </div> <a class="navbar-brand" href="/"> <img id="logo-header" src="/imgs/kbpedia-logo-420-horz.png" height="75" alt="KBpedia Knowledge Structure" name="logo-header"> </a> </div> </div>  <div class="navbar navbar-default mega-menu" role="navigation"> <div class="container">  <div class="navbar-header"> <button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-responsive-collapse"> Toggle navigation </button> </div> <div style="clear:both; height: 1px;">   </div>  <div class="col-md-1">   </div> <div class="collapse navbar-collapse navbar-responsive-collapse col-md-10"> <ul class="nav navbar-nav pull-left">  <li> <a href="/">Home</a> </li>  <li> <a href="/knowledge-graph/">knowledge Graph</a> </li> <li> <a href="http://sparql.kbpedia.org/">SPARQL</a> </li>  <li class="dropdown"> <a href="/background/">Background</a> <ul class="dropdown-menu"> <li> <a href="/background/overview/">Overview</a> </li> <li> <a href="/background/features-and-benefits/">Features & Benefits</a> </li> <li> <a href="/background/data-and-knowledge-structures/">Data and Knowledge Structures</a> </li> <li> <a href="/background/technology/">Technology</a> </li> <li> <a href="/background/machine-learning/">Machine Learning</a> </li> <li> <a href="/background/uses/">Uses</a> </li> </ul> </li> <li class="dropdown"> <a href="/use-cases/">Use Cases</a> <ul class="dropdown-menu"> <li class="dropdown-submenu"> <a href="/use-cases/knowledge-graph/">Knowledge Graph (KG)</a> <ul class="dropdown-menu"> <li><a href="/use-cases/browse-the-knowledge-graph/">Browse the Knowledge Graph</a></li> <li><a href="/use-cases/search-the-knowledge-graph/">Search the Knowledge Graph</a></li> <li><a href="/use-cases/expand-queries-using-semsets/">Expand Queries Using Semsets</a></li> <li><a href="/use-cases/use-and-control-of-inferencing/">Uses and Control of Inferencing</a></li> </ul> </li> <li class="dropdown-submenu"> <a href="/use-cases/machine-learning-use-case/">Machine Learning (KBAI)</a> <ul class="dropdown-menu"> <li><a href="/use-cases/text-classification-using-esa-and-svm/">Create Supervised Learning Training Sets</a></li> <li><a href="/use-cases/document-specific-word2vec-training-corpuses/">Create Word Embedding Corpuses</a></li> <li><a href="/use-cases/extending-kbpedia-with-kbpedia-categories/">Create Graph Embedding Corpuses</a></li> <li><a href="/use-cases/text-classification-using-esa-and-svm/">Classify Text</a></li> <li><a href="/use-cases/dynamic-machine-learning/">Create 'Gold Standards' for Tuning Learners</a></li> <li><a href="/use-cases/disambiguating-kbpedia-knowledge-graph-concepts/">Disambiguate KG Concepts</a></li> <li><a href="/use-cases/dynamic-machine-learning/">Dynamic Machine Learning Using the KG</a></li> </ul> </li> <li class="dropdown-submenu"> <a href="/use-cases/mapping-use-case/">Mapping</a> <ul class="dropdown-menu"> <li><a href="/use-cases/mapping-external-data-and-schema/">Map Concepts</a></li> <li><a href="/use-cases/extending-kbpedia-with-kbpedia-categories/">Extend KBpedia with Wikipedia</a></li> <li><a href="/use-cases/benefits-from-extending-kbpedia-with-private-datasets/">Extend KBpedia for Domains</a></li> <li><a href="/use-cases/mapping-external-data-and-schema/">General Use of the Mapper</a></li> </ul> </li> </ul> </li> <li class="dropdown"> <a href="/resources/">Resources</a> <ul class="dropdown-menu"> <li><a href="/resources/downloads/">Download KBpedia</a></li> <li><a href="/resources/about/">About KBpedia</a></li> <li><a href="/resources/faq/">KBpedia FAQ</a></li> <li><a href="/resources/news/">News About KBpedia</a></li> <li><a href="/resources/statistics/">KBpedia Statistics</a></li> <li><a href="/resources/documentation/">Additional Documentation</a></li> <li><a href="/resources/support/">Support for KBpedia</a>.</li> </ul> </li> </ul> </div> <div class="col-md-1">   </div> </div> </div> </div>  <div class="breadcrumbs"> <div class="container"> <div class="col-md-1">   </div> <div class="col-md-10"> <h1 class="pull-left"></h1> <ul class="pull-right breadcrumb"> <li>Use Cases</li> <li class="active">Extending KBpedia With Wikipedia Categories</li> </ul> </div> <div class="col-md-1">   </div> </div> </div>    <div class="container content"> <div class="row"> <div class="col-md-2">   </div> <div class="col-md-8"> <div class="use-cases-header"> <table border="0" cellpadding="4" cellspacing="2"> <tbody> <tr> <td colspan="2" align="center"> <h2> USE CASE </h2> </td> </tr> <tr> <td style="width: 140px;" valign="top"> Title: </td> <td style="padding-left: 25px;" valign="top"> Extending KBpedia With Wikipedia Categories </td> </tr> <tr> <td valign="top"> Short Description: </td> <td style="padding-left: 25px;" valign="top"> This use case describes how knowledge graphs, such as KBpedia, which need to be kept current and extended based on new knowledge and new mappings, can be so maintained with acceptable effort and accuracy. </td> </tr> <tr> <td valign="top"> Problem: </td> <td style="padding-left: 25px;" valign="top"> Knowledge graphs are under constant change and need to be extended with specific domain information for particular domain purposes. The combinatorial aspects of adding new external schema or concepts to an existing store of concepts can be extensive. Effective means at acceptable time and cost must be found for enhancing or updating these knowledge graphs. </td> </tr> <tr> <td valign="top"> Approach: </td> <td style="padding-left: 25px;" valign="top"> We extend KBpedia's knowledge graph under this use case by adding more concepts from the Wikipedia category structure, "cleaned" to produce its most natural classes.These extensions are made using a <a href="https://en.wikipedia.org/wiki/Support_vector_machine">SVM</a> classifier trained over graph-based embedding vectors generated using the <a href="https://arxiv.org/abs/1403.6652">DeepWalk</a> method. The source graph is based on the KBpedia knowledge graph structure linked to the Wikipedia categories. Means are put in place to test and optimize the parameters used in the machine learning methods. These mapping techniques are then visualized using the <a href="http://projector.tensorflow.org/">TensorFlow Projector</a> web application to help build confidence that the mapping clusters are correct. The overall process is captured by a repeatable pipeline with statistical reporting, enabling rapid refinements in parameters and methods to achieve the best-performing model. Once appropriate candidate categories are generated using this optimized model, the results are then inspected by a human to make the final selection decisions. The semi-automatic methods in this use case can be applied to extending KBpedia with any external schema, ontology or vocabulary. </td> </tr> <tr> <td valign="top"> Key Findings </td> <td valign="top"> <ul> <li>General methods are explored and documented for how to extend the KBpedia knowledge graph </li> <li>A variety of machine learning methods can reduce the effort required to add new concepts by 95% or more </li> <li>A workable and reusable pipeline leads to fast methods for testing and optimizing parameters used in the machine learning methods</li> <li>Care should be taken when using visualizations to validate relatonships, especially when using dimension reduction techniques </li> <li>To our knowledge, this use case is a unique combination of relatively new artificial intelligence methods </li> <li>The approach documented in this use case is applicable to extending a knowledge graph with any external schema, ontologies or vocabularies. </li> </ul> </td> </tr> </tbody> </table> </div> </div> <div class="col-md-2">   </div> </div> <div class="row"> </div> <div class="row"> </div> <div class="row"> <div class="col-md-2">   </div> <div class="col-md-8"> A knowledge graph is an ever evolving structure. It needs to be extended to be able to cope with new kinds of knowledge; it needs to be fixed and improved in all kinds of different ways. It also needs to be linked to other sources of data and to other knowledge representations such as schema, ontologies and vocabularies. One of the core tasks related to knowledge graphs is to extend their scope. This idea seems simple enough, but how can we extend a general knowledge graph that has nearly 40,000 concepts with potentially multiple thousands more? How can we do this while keeping it consistent, coherent and meaningful? How can we do this without spending undue effort on such a task? These are the questions we will try to answer with the methods we cover in this use case. The methods we present herein describe how we can extend the KBpedia knowledge graph using an external source of knowledge, one which has a completely different structure than KBpedia and one which has been built completely differently with a different purpose in mind than KBpedia. In this use case, this external resource is the <a href="https://en.wikipedia.org/wiki/Help:Category">Wikipedia category</a> structure. What we will show in this use case is how we may automatically select the right Wikipedia categories that could lead to new KBpedia concepts. These selections are made using a <a href="https://en.wikipedia.org/wiki/Support_vector_machine">SVM</a> classifier trained over graph embedding vectors generated by a <a href="https://arxiv.org/abs/1403.6652">DeepWalk</a> model based on the KBpedia knowledge graph structure linked to the Wikipedia categories. Once appropriate candidate categories are selected using this model, the results are then inspected by a human to make the final selection decisions. This semi-automated process takes 5% of the time it would normally take to conduct this task by comparable manual means. Like other KBpedia use cases, the code examples provided herein are written in <a href="https://clojure.org/">Clojure</a>. <div id="outline-container-orgf4deeb7" class="outline-2"> <h2 id="orgf4deeb7">Extending KBpedia Using Wikipedia Categories</h2> <div class="outline-text-2" id="text-orgf4deeb7"> What we want to accomplish is to extend the KBpedia knowledge graph by leveraging its own graph structure and its linkage to the external Wikipedia category structure. The goal is to find the sub-graph structure surrounding each of the existing links between these two structures and then to use that graph structure to find new Wikipedia categories that share the same kind of sub-graph structure with KBpedia, but that are not currently linked to any existing KBpedia concept. These candidates could then lead to new KBpedia concepts that are not currently existing in the structure. </div> <div id="outline-container-org63de7ae" class="outline-3"> <h3 id="org63de7ae">The Process</h3> <div class="outline-text-3" id="text-org63de7ae"> Thousands of KBpedia reference concepts are currently linked to related Wikipedia categories. What we want to do is to use this linkage to propose a series of new sub-classes that we could add to KBpedia based on the sub-categories that exist in Wikipedia for each of these links. What we want to do is to create a list of new KBpedia concepts candidates that comes from the Wikipedia category structure. This new list includes Wikipedia categories that come from: <ol class="org-ol"> <li>The sub-categories of Wikipedia categories that are linked to a leaf KBpedia reference concept,</li> <li>The sub-categories of Wikipedia categories that are linked to a KBpedia reference concept that is the parent of a leaf concept, and</li> <li>The sub-categories of Wikipedia categories that are linked to a KBpedia reference concept that is not <code>1.</code> nor <code>2.</code></li> </ol> The idea is that <code>1.</code> and <code>2.</code> would use the Wikipedia category structure to specialize the KBpedia conceptual structure and <code>3.</code> would fix potential scope/coverage of the structure with new general concepts. The challenge we face by proceeding in this way is that our procedure potentially creates tens of thousands of new candidates. Because the Wikipedia category structure has a completely different purpose than the KBpedia knowledge graph, and because Wikipedia's creation rules are completely different than KBpedia, many candidates are inconsistent or incoherent to include in KBpedia. Most of the candidate categories need to be dropped. Reviewing hundreds of thousands of new candidates manually is not tenable without an automatic way to rank potential candidates. An objective of the process is to greatly reduce the cost of such an Herculean task by using machine learning techniques to help the human reviewer by pre-categorizing each of the proposed new <code>sub-class-of</code> relationships. We thus split the problem into three distinct tasks: <ol class="org-ol"> <li>The first thing we have to do is to learn the sub-category patterns that exists in the Wikipedia category structure. These patterns will be learned in an unsupervised manner using the <a href="https://arxiv.org/abs/1403.6652">DeepWalk algorithm</a>. Graph embedding vectors will be created from this task;</li> <li>Then we create a training set with thousands of pre-classified sub-categories. <code>75%</code> of the training set is used for training, and <code>25%</code> of it is used for cross-validation. The classifier we will use for this task is SVM with a RBF kernel; and</li> <li>Employ hyperparameter optimization of the previous two steps.</li> </ol> Once these three steps are completed, we classify all of the proposed sub-categories and create a list of potential <code>sub-class-of</code> candidates to add into KBpedia, which is then validated by a human. These steps significantly reduce the time required to add new reference concepts using an external structure such as the Wikipedia category structure. </div> </div> <div id="outline-container-org4f8710c" class="outline-3"> <h3 id="org4f8710c">Cleaning Wikipedia Categories</h3> <div class="outline-text-3" id="text-org4f8710c"> KBpedia presently links to what we call the "clean" categories within the Wikipedia category structure. These same "clean" categories are used by this present use case. The reason for "cleaning" the Wikipedia categories is to remove internal administrative categories of use to Wikipedia alone and to remove "compound" or "artificial" categories frequently found in Wikipedia that do not conform to natural classes but are matters of grouping convenience (such as <a href="http://en.wikipedia.org/wiki/Category:Films_directed_by_Pedro_Almod%C3%B3var">Films directed by Pedro Almod贸var</a> or <a href="http://en.wikipedia.org/wiki/Category:Ambassadors_of_the_United_States_to_Mexico">Ambassadors of the United States to Mexico</a>). The process that creates this "clean" category listing for Wikipedia consists of the following: <ol class="org-ol"> <li>Read all Wikipedia categories,</li> <li>Remove all administrative categories that were collected by hand inspection,</li> <li>Remove all articles related categories that were collected by hand inspection,</li> <li>Remove all dates related categories that were collected by hand inspection,</li> <li>Remove all categories that had some preposition patterns that were collected by hand inspection,</li> <li>Remove all the lists categories that were collected by hand inspection, and</li> <li>Remove other categories that were tagged to be removed that do not fall into any categories listed above.</li> </ol> The result of this category filtering process is to produce a "clean" list of 88,691 Wikipedia categories. Note this is a significant reduction for the total category listing found on Wikipedia itself. This list is also what is used as input to the our Mapper service, which is what is used as the candidate pool for possible matches between any of these clean Wikipedia categories and existing KBpedia reference concepts. Final selections, wherever made, are drawn from this "clean" candidate pool and vetted by a human prior to acceptance. </div> </div> <div id="outline-container-org92dd9c6" class="outline-3"> <h3 id="org92dd9c6">Introducing DeepWalk</h3> <div class="outline-text-3" id="text-org92dd9c6"> <a href="https://arxiv.org/abs/1403.6652">DeepWalk</a> was created to learn social representations of a graph's vectices that capture neighborhood similarity and community membership. DeepWalk generalizes neural language models to process a special language composed of a set of randomly-generated walks. With KBpedia, we want to use DeepWalk not to learn social representations but to learn the relationship (that is, the similarity) between all of the concepts existing in a knowledge graph given different kinds of relationships such as <code>sub-class-of</code>, <code>super-class-of</code>, <code>equivalent-class</code> or other relationships such as KBpedia's <code>80 aspects relationships</code>. For this use case we use the DeepWalk algorithm to select concepts from an external conceptual structure that could be added into the KBpedia knowledge graph that shares the same graph structure. Other tasks that could be performed using DeepWalk in a similar manner are: <ol class="org-ol"> <li>Content recommendation,</li> <li>Anomaly detection [in the knowledge graph], or</li> <li>Missing link prediction [in the knowledge graph].</li> </ol> Note that we randomly walk the graphs as stated in DeepWalk's original paper <a id="fnr.1" class="footref" href="#fn.1">1</a>. However more experiments could be performed to change the random walk by other graph walk strategies like depth-first or breadth-first walks. </div> </div> <div id="outline-container-org729d31c" class="outline-3"> <h3 id="org729d31c">Finding Candidates</h3> <div class="outline-text-3" id="text-org729d31c"> The first step we have to do is to find all of the potential Wikipedia category cadidates that could become new KBpedia reference concepts based on their graph structure. What we have to do is to list all these candidates according to these heuristics: <ol class="org-ol"> <li>The sub-categories of Wikipedia categories that are linked to a leaf KBpedia reference concept (that is, leaf),</li> <li>The sub-categories of Wikipedia categories that are linked to a KBpedia reference concept that is the parent of a leaf concept (that is, near-leaf),</li> <li>The sub-categories of Wikipedia categories that are linked to a KBpedia reference concept that is not <code>1.</code> nor <code>2.</code> (that is, core).</li> </ol> Each of these steps will lead to a different list of candidates. The first step is to load the KBpedia knowledge graph to analyze its leaf, near-leaf and core concepts structure: <div class="org-src-container"> <pre class="src src-clojure">(require '[cognonto-owl.model :as model]) (require '[cognonto-owl.query :as query]) (require '[cognonto-owl.core :as owl]) (require '[cognonto-owl.reasoner :as reasoner]) (def onto-iri (str "file:/d:/cognonto-git/cognonto-deepwalk/resources/kbpedia_reference_concepts.n3")) (def kbpedia-manager (owl/make-ontology-manager)) (def kbpedia (owl/load-ontology onto-iri :ontology-manager kbpedia-manager)) (def kbpedia-reasoner (reasoner/make-reasoner kbpedia)) </pre> </div> <div class="org-src-container"> <pre class="src src-clojure">(defn is-near-leaf? "Return true is the reference concept is a leaf or is the parent of a leaf concept of the graph." ([rc] (is-near-leaf? rc 1)) ([rc depth] (let [sub-classes (query/sub-classes rc kbpedia :direct true)] (if (empty? sub-classes) true (if (= 0 depth) false (->> sub-classes (map (fn [sub-class] (is-near-leaf? sub-class (dec depth)))) (apply = true))))))) (defn is-leaf? "Return true is the reference concept is a leaf concept in the graph." [rc] (if (empty? (query/sub-classes rc kbpedia :direct true)) true false)) </pre> </div> Then we create the three lists of KBpedia reference concepts: <div class="org-src-container"> <pre class="src src-clojure">(def near-leaf-rcs (->> (query/get-classes kbpedia) (map (fn [rc] (when (is-near-leaf? rc) (when-not (is-leaf? rc) rc)))) (remove nil?) (into []))) (def leaf-rcs (->> (query/get-classes kbpedia) (map (fn [rc] (when (is-leaf? rc) rc))) (remove nil?) (into []))) (def core-rcs (->> (query/get-classes kbpedia) (map (fn [rc] (when-not (is-near-leaf? rc) rc))) (remove nil?) (into []))) </pre> </div> Each of these lists is composed of the following number of reference concepts: <div class="org-src-container"> <pre class="src src-clojure">(println "Leaf reference concepts: " (count leaf-rcs)) (println "Near-leaf reference concepts: " (count near-leaf-rcs)) (println "Core reference concepts: " (count core-rcs)) </pre> </div> <pre class="example"> Leaf reference concepts: 29,782 Near-leaf reference concepts: 4,779 Core reference concepts: 4,533 </pre> Finally what we do is to query the Wikipedia category structure to get the list of all the sub-categories linked to each of the leaf, near-leaf and core KBpedia reference concepts. We serialize this list of candidates into three distinct CSV files. <div class="org-src-container"> <pre class="src src-clojure">(get-immediate-sub-categories leaf-rcs "resources/leaf-rcs-narrower-concepts.csv") (get-immediate-sub-categories near-leaf-rcs "resources/near-leaf-rcs-narrower-concepts.csv") (get-immediate-sub-categories core-rcs "resources/core-rcs-narrower-concepts.csv") </pre> </div> Each of these lists leads to 34,957 leaf category candidates, 6,104 near-leaf category candidates and 6,066 core category candidates for a grand total of 47,127 new KBpedia reference concept candidates to review. </div> </div> <div id="outline-container-orgc8e6080" class="outline-3"> <h3 id="orgc8e6080">Create Graph Embedding Vectors</h3> <div class="outline-text-3" id="text-orgc8e6080"> The next step is to create the <code>graph embedding</code> for each of the Wikipedia categories. What we have to do is to use the Wikipedia category structure along with the linked KBpedia knowledge graph. Then we have to generate the graph embedding for each of the Wikipedia categories that exist in the structure. The graph embeddings are generated using the <a href="https://arxiv.org/abs/1403.6652">DeepWalk</a> algorithm over that linked structure. It randomly walks the linked graph hundred of times to generate the graph embeddings for each category. </div> <div id="outline-container-org0ed91c3" class="outline-4"> <h4 id="org0ed91c3">Create Deeplearning4j Graph</h4> <div class="outline-text-4" id="text-org0ed91c3"> To generate the graph embeddings, we use <a href="https://deeplearning4j.org/">Deeplearning4j's</a> DeepWalk implementation. The first step is to create a Deeplearning4j <code>graph</code> structure that is used by its DeepWalk implementation to generate the embeddings. The graph we have to create is composed of the latest version of the Wikipedia category structure along with the latest version of the KBpedia knowledge graph. The Wikipedia category structure comes from the DBpedia file <a href="http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_skos_categories_en.ttl.bz2">2015-10slcore-i18nslenslskoscategoriesen.ttl.bz2</a> and the KBpedia knowledge graph version <code>1.20</code>. Because of the size of the file, we simply read the <a href="https://en.wikipedia.org/wiki/Turtle_(syntax)">Turtle</a> file and create the vertices and the edges in the graph on the fly with some regular expression operations on each line. In the process, we create two intermediary CSV files: one that lists all the edges, and one that lists all the vertices used to create the Deeplearning4j graph structure: <div class="org-src-container"> <pre class="src src-clojure">(import org.deeplearning4j.graph.graph.Graph) (import org.deeplearning4j.graph.api.Vertex) (import org.deeplearning4j.graph.models.deepwalk.DeepWalk$Builder) (require '[clojure.java.io :as io]) (require '[clojure.data.csv :as csv]) (defn create-dl4j-graph-vertices-index-csv [n3-file csv-file & {:keys [base-uri] :or {base-uri nil}}] (with-open [in-file (io/reader n3-file)] (with-open [out-file (io/writer csv-file)] (doseq [vertice (->> (line-seq in-file) (mapcat (fn [line] (when-let [matches (re-matches #"^<.*resource/Category:(.*?)>.*<.*broader.*>.*<.*resource/Category:(.*)>.*$" line)] [(nth matches 1) (nth matches 2)]))) (distinct) (sort) (into []))] (csv/write-csv out-file [[(if-not (nil? base-uri) (str base-uri vertice) vertice)]]))))) (defn create-dl4j-graph-edges-index-csv [n3-file csv-file & {:keys [base-uri] :or {base-uri nil}}] (with-open [in-file (io/reader n3-file)] (with-open [out-file (io/writer csv-file)] (doseq [line (line-seq in-file)] (when-let [matches (re-matches #"^<.*resource/Category:(.*?)>.*<.*broader.*>.*<.*resource/Category:(.*)>.*$" line)] (csv/write-csv out-file [[(if-not (nil? base-uri) (str base-uri (nth matches 1)) (nth matches 1)) (if-not (nil? base-uri) (str base-uri (nth matches 2)) (nth matches 2))]])))))) </pre> </div> Now we generate the two intermediary CSV files: <div class="org-src-container"> <pre class="src src-clojure">(create-dl4j-graph-vertices-index-csv "resources/skos_categories_en.ttl" "resources/skos_categories_vertices.csv" :base-uri "http://wikipedia.org/wiki/Category:") (create-dl4j-graph-edges-index-csv "resources/skos_categories_en.ttl" "resources/skos_categories_edges.csv" :base-uri "http://wikipedia.org/wiki/Category:") </pre> </div> Finally we generate the initial Deeplearning4j graph structure composed of the Wikipedia category structure and the inferred KBpedia knowledge graph. The resulting linked structure composes the graph used by the DeepWalk algorithm to generate the embedding vectors for each of the Wikipedia categories. <div class="org-src-container"> <pre class="src src-clojure">(require '[cognonto-owl.core :as owl]) (require '[cognonto-owl.reasoner :as reasoner]) (def onto-iri (str "file:/d:/cognonto-git/cognonto-deepwalk/resources/kbpedia_reference_concepts_linkage_inferrence_extended.n3")) (def kbpedia-manager (owl/make-ontology-manager)) (def kbpedia-graph (owl/load-ontology onto-iri :ontology-manager kbpedia-manager)) (def kbpedia-reasoner (reasoner/make-reasoner kbpedia-graph)) </pre> </div> <div class="org-src-container"> <pre class="src src-clojure">(defn create-deepwalk-graph-from-wikipedia-categories-and-kbpedia [vertices-file edges-file knowledge-graph & {:keys [directed?] :or {directed? true}}] (let [index (->> (into (->> (query/get-classes knowledge-graph) (mapv (fn [class] (.toString (.getIRI class))))) (->> (with-open [in-file (io/reader vertices-file)] (doall (csv/read-csv in-file))) (mapv (fn [[class]] class)))) distinct (map-indexed (fn [i vertice] {vertice (inc i)})) (apply merge)) index (merge {"http://www.w3.org/2002/07/owl#Thing" 0} index) index (into (sorted-map-by (fn [key1 key2] (compare [(get index key1) key1] [(get index key2) key2]))) index) graph (new Graph (mapv (fn [[class i]] (new Vertex i class)) index) false)] (doseq [[vertice-1 vertice-2] (with-open [in-file (io/reader edges-file)] (doall (csv/read-csv in-file)))] (.addEdge graph (get index vertice-1) (get index vertice-2) "sub-class-of" directed?)) (doseq [class (query/get-classes knowledge-graph)] (doseq [super-class (query/super-classes class knowledge-graph :direct true :reasoner kbpedia-reasoner)] (try (.addEdge graph (get index (.toString (.getIRI class))) (get index (.toString (.getIRI super-class))) "sub-class-of" directed?) (catch Exception e)))) ;; Make sure that all the nodes are at least connected to owl:Thing (doseq [i (take (.numVertices graph) (iterate inc 0))] (when (<= (.getVertexDegree graph i) 0) (.addEdge graph i 0 "sub-class-of" directed?))) graph)) </pre> </div> Next we generate the final Deeplearning4j graph used by the DeepWalk algorithm to generate the Wikipedia categories graph embeddings. <div class="org-src-container"> <pre class="src src-clojure">(def graph (create-deepwalk-graph-from-wikipedia-categories-and-kbpedia "resources\\skos_categories_vertices.csv" "resources\\skos_categories_edges.csv" kbpedia-graph :directed? true)) </pre> </div> </div> </div> <div id="outline-container-org17b64d4" class="outline-4"> <h4 id="org17b64d4">Train DeepWalk</h4> <div class="outline-text-4" id="text-org17b64d4"> Once the Deeplearning4j graph is created, the next step is to create and train the DeepWalk algorithm. What the <code>(create-deep-walk)</code> function does is to create and initialize a <code>DeepWalk</code> object with the <code>Graph</code> we created above and with some hyperparameters. The <code>:window-size</code> hyperparameter is the size of the window used by the continuous <a href="https://en.wikipedia.org/wiki/N-gram#Skip-gram">Skip-gram</a> algorithm used in DeepWalk. The <code>:vector-size</code> hyperparameter is the size of the embedding vectors we want the DeepWalk to generate (it is the number of dimensions of our model). The <code>:learning-rate</code> is the initial leaning rate of the <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">Stochastic gradient descent</a>. For this task, we initially use a window of <code>15</code> and <code>3</code> dimensions to make visualizations simpler to interpret, and an initial learning rate of <code>2.5%</code>. <div class="org-src-container"> <pre class="src src-clojure">(use '[cognonto-deepwalk.core]) (def deep-walk (create-deep-walk graph :window-size 15 :vector-size 3 :learning-rate 0.025)) </pre> </div> Once the DeepWalk object is created and initialized with the graph, the next step is to train that model to generate the embedding vectors for each vertice in the graph. The training is performed using a random walk iterator. The two hyperparameters related to the training process are the <code>walk-length</code> and the <code>walks-per-vertex</code>. The <code>walk-length</code> is the number of vertices we want to visit for each iteration. The <code>walks-per-vertex</code> is the number of timex we want to create random walks for each vertex in the graph. <div class="org-src-container"> <pre class="src src-clojure">(defn train ([deep-walk iterator] (train deep-walk iterator 1)) ([deep-walk iterator walks-per-vertex] (.fit deep-walk iterator) (dotimes [n walks-per-vertex] (.reset iterator) (.fit deep-walk iterator))) ([deep-walk graph walk-length walks-per-vertex] (train deep-walk (new RandomWalkIterator graph walk-length) walks-per-vertex))) </pre> </div> For the initial setup, we want to have a <code>walk-length</code> of <code>15</code> and we want to iterate the process <code>175</code> times per vertex. <div class="org-src-container"> <pre class="src src-clojure">(train deep-walk graph 15 175) </pre> </div> </div> </div> <div id="outline-container-org2573265" class="outline-4"> <h4 id="org2573265">Create Training Sets</h4> <div class="outline-text-4" id="text-org2573265"> Now that the DeepWalk algorithm has been created and trained, we then create the training sets for the SVM classification model. The training set is created from the manually vetted linkages between the KBpedia knowledge graph and the Wikipedia category structure. <code>75%</code> of the vetted linkages is used for training and <code>25%</code> for cross validation. This sampling is performed randomly. <div class="org-src-container"> <pre class="src src-clojure">(defn build-svm-model-vectors "Build the vectors used to train the SVM model that is used to determine if a sub-category is likely to create a good new reference concept candidate based on its graph embedding vector." [training-set-csv-file deep-walk index & {:keys [base-uri] :or {base-uri nil}}] (let [training-csv (rest (with-open [in-file (io/reader training-set-csv-file)] (doall (csv/read-csv in-file)))) sets (->> training-csv (map (fn [[kbpedia-rc wikipedia-category possible-new-sub-class-of is-sub-class-of?]] (when (and (not (empty? possible-new-sub-class-of)) (not (nil? (get index (if (nil? base-uri) possible-new-sub-class-of (str base-uri possible-new-sub-class-of)))))) (list {:name possible-new-sub-class-of :class (if (= is-sub-class-of? "x") 1 0) :f (into (sorted-map-by <) (->> (read-string (.toString (.data (.getVector (.lookupTable deep-walk) (get index (if (nil? base-uri) possible-new-sub-class-of (str base-uri possible-new-sub-class-of))))))) (map-indexed (fn [feature-id value] {feature-id value})) (apply merge)))})))) (apply concat) (remove nil?)) sets (shuffle sets) sets (split-at (int (java.lang.Math/floor (* (count sets) 0.75))) sets)] {:training (first sets) :validation (second sets)})) </pre> </div> <div class="org-src-container"> <pre class="src src-clojure">(def model-vectors (build-svm-model-vectors "resources/core-wikipedia-subclass-mapped--extended.csv" deep-walk index :base-uri "http://wikipedia.org/wiki/Category:")) </pre> </div> The training set is composed of 7,124 items and the validation set is composed of 2,375. There are 7,646 positive training examples and 1,853 negative ones. </div> </div> <div id="outline-container-orga763791" class="outline-4"> <h4 id="orga763791">Train SVM Classifier</h4> <div class="outline-text-4" id="text-orga763791"> Now that the training and the validation sets have been generated, the next step is to train the SVM classifier using the training set. The initial evaluation of our model uses the following hyperparameter values. Also, the SVM implementation we use is <a href="https://github.com/cjlin1/libsvm/tree/master/java">LIBSVM implemented in Java</a>. We use the RBF kernel. <div class="org-src-container"> <pre class="src src-clojure">(use 'svm.core) (def libsvm-model (svm.core/train-model (->> (:training model-vectors) (map (fn [item] [(if (= (:class item) 0) -1 1) (->> (:f item) (map (fn [[k v]] {(inc k) v})) (apply merge))])) concat) :eps 1e-3 :gamma 0 :kernel-type (:rbf svm.core/kernel-types) :svm-type (:one-class svm.core/svm-types))) </pre> </div> </div> </div> <div id="outline-container-org713ce59" class="outline-4"> <h4 id="org713ce59">Evaluate Model</h4> <div class="outline-text-4" id="text-org713ce59"> Once the SVM model is trained, the next step is to evaluate its performance. The different performance metrics are calculated using the following function: <div class="org-src-container"> <pre class="src src-clojure">(defn evaluate-libsvm-model [model-vectors svm-model] (let [validation-set (:validation model-vectors) true-positive (atom 0) false-positive (atom 0) true-negative (atom 0) false-negative (atom 0)] (doseq [case validation-set] (let [predicted-class (svm.core/predict svm-model (:f case))] (when (and (= (:class case) 1) (= predicted-class 1.0)) (swap! true-positive inc)) (when (and (= (:class case) 0) (= predicted-class 1.0)) (swap! false-positive inc)) (when (and (= (:class case) 0) (= predicted-class -1.0)) (swap! true-negative inc)) (when (and (= (:class case) 1) (= predicted-class -1.0)) (swap! false-negative inc)))) (println "True positive: " @true-positive) (println "false positive: " @false-positive) (println "True negative: " @true-negative) (println "False negative: " @false-negative) (println) (if (= 0 @true-positive) (let [precision 0 recall 0 accuracy 0 f1 0] (println "Precision: " precision) (println "Recall: " recall) (println "Accuracy: " accuracy) (println "F1: " f1) {:precision precision :recall recall :accuracy accuracy :f1 f1}) (let [precision (float (/ @true-positive (+ @true-positive @false-positive))) recall (float (/ @true-positive (+ @true-positive @false-negative))) accuracy (float (/ (+ @true-positive @true-negative) (+ @true-positive @false-negative @false-positive @true-negative))) f1 (float (* 2 (/ (* precision recall) (+ precision recall))))] (println "Precision: " precision) (println "Recall: " recall) (println "Accuracy: " accuracy) (println "F1: " f1) {:precision precision :recall recall :accuracy accuracy :f1 f1})))) </pre> </div> <div class="org-src-container"> <pre class="src src-clojure">(evaluate-libsvm-model model-vectors libsvm-model) </pre> </div> <pre class="example"> True positive: 1000 false positive: 207 True negative: 227 False negative: 941 Precision: 0.8285004 Recall: 0.51519835 Accuracy: 0.5166316 F1: 0.635324 </pre> </div> </div> <div id="outline-container-org7d8a146" class="outline-4"> <h4 id="org7d8a146">SVM Hyperparameters Optimization</h4> <div class="outline-text-4" id="text-org7d8a146"> The last step is to optimize the hyperparameters of the prediction workflow. A grid search algorithm is used to optimize the following SVM hyperparameters that uses a <code>RBF kernel</code> with the <code>one-class</code> algorithm: <ol class="org-ol"> <li><code>gamma</code>,</li> <li><code>eps</code>,</li> <li>Usage of <code>shrinking</code>, and</li> <li>Usage of <code>probability</code></li> </ol> This algorithm simply iterates over all of the possible combinations of hyperparameter values as specified in the input grid parameters. <div class="org-src-container"> <pre class="src src-clojure">(defn svm-grid-search-one-class-rbf [model-vectors & {:keys [grid-parameters selection-metric] :or {grid-parameters [{:gamma [1 1/2 1/3 1/4 1/5 1/7 1/10 1/50 1/100] :eps [0.000001 0.00001 0.0001 0.001 0.01 0.1 0.2] :nu [0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9] :probability [0 1] :shrinking [0 1] :selection-metric :f1}]}}] (let [best (atom {:score 0.0 :eps nil :nu nil :gamma nil :probability nil :shrinking nil}) training-vector (->> (:training model-vectors) (map (fn [item] [(if (= (:class item) 0) -1 1) (->> (:f item) (map (fn [[k v]] {(inc k) v})) (apply merge))])) concat)] (doseq [parameters grid-parameters] (doseq [probability (:probability parameters)] (doseq [shrinking (:shrinking parameters)] (doseq [gamma (:gamma parameters)] (doseq [nu (:nu parameters)] (doseq [eps (:eps parameters)] (let [svm-model (svm.core/train-model training-vector :eps eps :gamma gamma :nu nu :shrinking shrinking :probability probability :kernel-type (:rbf svm.core/kernel-types) :svm-type (:one-class svm.core/svm-types)) results (evaluate-libsvm-model model-vectors svm-model)] (println "Probability:" probability) (println "Shrinking:" shrinking) (println "Gamma:" gamma) (println "Eps:" eps) (println "Nu:" nu) (println) (when (> (get results (:selection-metric parameters)) (:score @best)) (reset! best {:score (get results (:selection-metric parameters)) :gamma gamma :eps eps :nu nu :probability probability :shrinking shrinking}))))))))) @best)) </pre> </div> Let's run the grid search to find the best values for each of the hyperparameters of the SVM classifier that uses a RBF kernel while optimizing the result of the <code>F1</code> score. <div class="org-src-container"> <pre class="src src-clojure">(svm-grid-search-one-class-rbf model-vectors :grid-parameters [{:gamma [1 1/2 1/3 1/4 1/5 1/7 1/10 1/50 1/100] :eps [0.000001 0.00001 0.0001 0.001 0.01 0.1 0.2] :nu [0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9] :probability [0 1] :shrinking [0 1] :selection-metric :f1}]) </pre> </div> <pre class="example"> {:score 0.85630643 :gamma 1 :eps 0.2 :nu 0.1 :probability 0 :shrinking 0} </pre> Finally, we create the final optimized SVM model that uses these optimal hyperparameters. <div class="org-src-container"> <pre class="src src-clojure">(def libsvm-model-optimized (svm.core/train-model (->> (:training model-vectors) (map (fn [item] [(if (= (:class item) 0) -1 1) (->> (:f item) (map (fn [[k v]] {(inc k) v})) (apply merge))])) concat) :eps 0.2 :gamma 1 :probability 0 :shrinking 0 :nu 0.1 :kernel-type (:rbf svm.core/kernel-types) :svm-type (:one-class svm.core/svm-types))) </pre> </div> </div> </div> <div id="outline-container-org8851133" class="outline-4"> <h4 id="org8851133">Visualize With TensorFlow Projector</h4> <div class="outline-text-4" id="text-org8851133"> What we want want to do next is to visualize the apparent relationship between each of the positive and negative training examples based on their latent representation vectors as computed by the DeepWalk algorithm. We visualize the training set using two different methods: <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis</a> (PCA) and <a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-distributed Stochastic Neighbor Embedding</a> (t-SNE). The goal of trying to visualize the graph using these techniques is to try to find insights into the models we created and see if our intuition holds. Note that the following visualizations are used to get <code>insights</code> into the model(s) we are trying to create. The visualizations help find possible correlations, boundaries and outliers in the model. By the nature of the algorithms used, these visualizations should not be overinterpreted. It is to be noted that "The assumption that the data lies along a low-dimentional manifold may not always be correct or useful. We argue that in the context of AI tasks, such as those that involve processing images, sounds, or text, the manifold assumption is at least approximately correct." <a id="fnr.2" class="footref" href="#fn.2">2</a> The tool we use to create these visualizations is the <a href="http://projector.tensorflow.org/">TensorFlow Projector</a> web application. This visualization tool requires two input CSV files: one that lists all the vertices of the graph with their embedding vectors, and one that lists all metadata (name, class, etc.) of each of these vertices. <div class="org-src-container"> <pre class="src src-clojure">(defn generate-tensorflow-projector-vectors-libsvm [model-vectors svm-model vectors-file-name metadata-file-name] (with-open [out-file-vectors (io/writer (str "resources/" vectors-file-name))] (with-open [out-file-metadata (io/writer (str "resources/" metadata-file-name))] (csv/write-csv out-file-metadata [["name" "class"]] :separator \tab) (doseq [model (:training model-vectors)] (csv/write-csv out-file-vectors [(into [] (vals (:f model)))] :separator \tab) (csv/write-csv out-file-metadata [[(:name model) "n/a"]] :separator \tab)) (doseq [model (:validation model-vectors)] (csv/write-csv out-file-vectors [(into [] (vals (:f model)))] :separator \tab) (csv/write-csv out-file-metadata [[(:name model) (if (= (svm.core/predict svm-model (:f model)) 1.0) 1 0)]] :separator \tab))))) </pre> </div> <div class="org-src-container"> <pre class="src src-clojure">(generate-tensorflow-projector-vectors-libsvm model-vectors libsvm-model-optimized "kbpedia-projector-vectors-libsvm-optimized.csv" "kbpedia-projector-metadata-libsvm-optimized.csv") </pre> </div> </div> <h4>Principal Component Analysis</h4> <div class="outline-text-5" id="text-orgcc8634e"> <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a> can be viewed as an unsupervised learning algorithm that learns the representation of data through patterns in the data, so as to detect correlations between variables. Before starting to look into these visualization and get insights from them, let's review how they get created in the first place. Each node in the scatter plots below is a concept that belongs to the training and validation sets. The positioning of these nodes in the plot is determined by their respective graph embeddings. These graph embeddings are the latent representations of the concepts within the graph composed of the KBpedia knowledge graph linked to their respective Wikipedia categories. This linkage is based on the metadata for each of these nodes comprised of: <ol class="org-ol"> <li>The name of the concept we consider adding to the KBpedia knowledge graph that comes from the Wikipedia category,</li> <li>The SVM classification for each of these concepts. The SVM classification can one one of three values: <ol class="org-ol"> <li><code>0</code> - which means that the node doesn't belong to the class,</li> <li><code>1</code> - which means that the node does belong to the class, or</li> <li><code>n/a</code> - which means that the node belongs to the training set. Note that only the validation set is classified by the SVM.</li> </ol></li> </ol> The SVM classification of each of these nodes is unrelated to the way the PCA or the t-SNE algorithm works. However we want to see if there are some correlations between the two by checking the positioning of the nodes given how the SVM classified them. All of the scatter plots below have been generated using the <a href="http://projector.tensorflow.org/">TensorFlow's Projector</a> application. For each of them, we also document the vectors, metadata and files that enable independent visualization and exploration of these graphs. To load this scatter plot, <a href="tsne-wl15-p25-l10-i8500-3d.zip">download this package</a>, follow the instructions shown under <a href="http://projector.tensorflow.org/">http://projector.tensorflow.org/</a>, then load the <code>vectors</code> and <code>metadata</code> files. Finally load the bookmark using the <code>txt</code> file from the package. In the following scatter plot, we can see all the 9,499 examples of the <code>training</code> and <code>validation</code> sets as visualized by the PCA algorithm. <div class="figure"> <img src="pca-wl15-3d-call.png" alt="pca-wl15-3d-call.png" /> </div> Now we only highlight the negative examples of the <code>validation</code> set. What we can observe in this graph is that all of the examples that have been classified by the SVM model we created above have been displayed at the edges of the PCA plot. <div class="figure"> <img src="pca-wl15-3d-c0.png" alt="pca-wl15-3d-c0.png" /> </div> Then if we highlight the positive examples of the <code>validation</code> set, we can observe that all the examples that have been classified by the SVM model have been arranged at the center of the PCA plot. <div class="figure"> <img src="pca-wl15-3d-c1.png" alt="pca-wl15-3d-c1.png" /> </div> This material shows the structure that emerges from the creation of the graph embedding, how it is classified using the <code>training</code> set, and how it is classified by the SVM classifier and arranged by the PCA algorithm. Let's investigate further with the t-SNE algorithm. </div> <h4>t-distributed Stochastic Neighbor Embedding</h4> <div class="outline-text-5" id="text-orgecbec0b"> <a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> is another unsupervised learning algorithm that tries to learn the representation of the data to get insights about it. In the following graphs, we see a few 3-dimensional scatter plots to see what they look like, but we focus on the 2-dimensional renderings of the same plots since they are easier to visualize and interpret. We encourage you to load the graphs in the TensorFlow Projector to play with and manipulate the 3-dimensional expressions of the plot. The following 3D scatter plot shows all the 9,499 examples from the <code>training</code> and the <code>validation</code> sets as rendered by the t-SNE algorithm. This graph used a <code>perplexity</code> of <code>25</code>, a <code>learning rate</code> of 10 and we ran the algorithm for 8,500 iterations. Remember that the graph embedding vectors of these examples have three features, which means that they are represented in three dimensions. However, when we will visualize the same t-SNE graphs in two dimensions, the dimension will be reduced by one by the algorithm. What we can observe is that a few big clusters emerge. Theoretically, the features learned by the DeepWalk algorithm represent the inner structure of the graph surrounding the vertices (concepts). With the current graph we created, we have the full KBpedia knowledge graph structure linked to the full Wikipedia category structure. These clusters represent concepts that have a similar graph structure when we the DeepWalk algorithm randomly walk each vertex following the <code>sub-class-of</code> directed edges of the graph up to depth of 15. Intuitively we assume that these clusters are created because they are following a similar conceptual parental chain up to shared <a href="/docs/kko-upper-structure/">SuperTypes</a>. <div class="figure"> <img src="tsne-wl15-p25-l10-i8500-3d-call.png" alt="tsne-wl15-p25-l10-i8500-3d-call.png" /> </div> Now let's highlight the negative examples of the <code>validation</code> set. What we can observe in this graph is that all the examples that have been classified by the SVM model we created above have been displayed at the center of each cluster created by the t-SNE algorithm. <div class="figure"> <img src="tsne-wl15-p25-l10-i8500-3d-c0.png" alt="tsne-wl15-p25-l10-i8500-3d-c0.png" /> </div> When highlighting the positive examples of the validation set, we can see that almost all of the vertices belong to the clusters, but also apparently at the outer edges of the clusters. <div class="figure"> <img src="tsne-wl15-p25-l10-i8500-3d-c1.png" alt="tsne-wl15-p25-l10-i8500-3d-c1.png" /> </div> Now let's reduce the number of dimensions to two dimensions to have an overall easier scatter plot to visualize. Here are all the <code>training</code> and <code>validation</code> set examples. <div class="figure"> <img src="tsne-wl15-p25-l10-i8500-3d-to-2d-call.png" alt="tsne-wl15-p25-l10-i8500-3d-to-2d-call.png" /> </div> Here we only focus on the negative examples of the <code>validation</code> set. We can clearly see that each of the negative examples appears at the center of each cluster. <div class="figure"> <img src="tsne-wl15-p25-l10-i8500-3d-to-2d-c0.png" alt="tsne-wl15-p25-l10-i8500-3d-to-2d-c0.png" /> </div> Here we only focus on the positive examples of the <code>validation</code> set. <div class="figure"> <img src="tsne-wl15-p25-l10-i8500-3d-to-2d-c1.png" alt="tsne-wl15-p25-l10-i8500-3d-to-2d-c1.png" /> </div> Now let's focus on specific Wikipedia categories that can belong to the <code>training</code> or the <code>validation</code> set. In the following scatter plot, we highlight all the categories that have the word "police" in them. What we can observe is that each of these categories related to the general topic <code>police</code> belong to different clusters. But at the same time, they refer to different kinds of concepts related to the <code>police</code> topic. What this suggests and may validate is that the intuition we had about the nature of these clusters is that they are related to some <a href="/docs/kko-upper-structure/">SuperType</a> and that they are clustered as such. Let's see if this holds with other kind of topics as well. <div class="figure"> <img src="tsne-wl15-p25-l10-i8500-3d-focus_police.png" alt="tsne-wl15-p25-l10-i8500-3d-focus_police.png" /> </div> Here we highlight all the Wikipedia categories that have the word "fire" in them. The same appears to be happening again. Most of the concepts belongs to distinct clusters and appear to be related to different upper structure concepts. <div class="figure"> <img src="tsne-wl15-p25-l10-i8500-3d-focus_fire.png" alt="tsne-wl15-p25-l10-i8500-3d-focus_fire.png" /> </div> As another example, here we highlight all the Wikipedia categories that have the word "wine" in them. <div class="figure"> <img src="tsne-wl15-p25-l10-i8500-3d-focus_wine.png" alt="tsne-wl15-p25-l10-i8500-3d-focus_wine.png" /> </div> </div> </div> <div id="outline-container-orga71e656" class="outline-4"> <h4 id="orga71e656">Experimenting With Multiple DeepWalk Hyperparameters</h4> <div class="outline-text-4" id="text-orga71e656"> Now that we have optimized the SVM hyperparameters to find the best classification performences, let's see how some of the hyperparameters related to the DeepWalk algorithm can influence the performance of the model. The DeepWalk algorithm's hyperparameters we want to experiment with are: <ol class="org-ol"> <li>The number of dimensions of the graph embedding vectors (<code>:vector-size</code>), and</li> <li>The depth of the random walks (<code>walks-per-vertex</code>).</li> </ol> </div> <h4>Walk-length of Five</h4> <div class="outline-text-5" id="text-orgbac43ad"> The first experiment we want to perform is to visualize how the DeepWalk algorithm reacts if we change the depth of the random walks from 15 steps to only five. The first thing we have to do is to create the <code>deep-walk</code> object with a <code>window-size</code> of 15 and a <code>vector-size</code> of three. Nothing else changes here in this experiment. <div class="org-src-container"> <pre class="src src-clojure">(def deep-walk (create-deep-walk graph :window-size 15 :vector-size 3 :learning-rate 0.025)) </pre> </div> Then when we train the model, we want the algorithm to go five steps for each vertex in the graph instead of 15. <div class="org-src-container"> <pre class="src src-clojure">(train deep-walk graph 5 175) </pre> </div> Then we create the create a new set of <code>training</code> and <code>validation</code> sets to evaluate the model. <div class="org-src-container"> <pre class="src src-clojure">(def model-vectors (build-svm-model-vectors "resources/core-wikipedia-subclass-mapped--extended.csv" deep-walk index :base-uri "http://wikipedia.org/wiki/Category:")) </pre> </div> We create a new SVM classification model trained on the new graph embedding vectors that have been created by DeepWalk with a <code>walk-length</code> of 20. <div class="org-src-container"> <pre class="src src-clojure">(def libsvm-model (svm.core/train-model (->> (:training model-vectors) (map (fn [item] [(if (= (:class item) 0) -1 1) (->> (:f item) (map (fn [[k v]] {(inc k) v})) (apply merge))])) concat) :eps 0.2 :gamma 1 :probability 0 :shrinking 0 :nu 0.1 :kernel-type (:rbf svm.core/kernel-types) :svm-type (:one-class svm.core/svm-types))) </pre> </div> Finally we evaluate the new model. Note that we used the optimal SVM hyperparameter values that we found with the grid search above. The results are somewhat equivalent with the previous ones we experienced. <div class="org-src-container"> <pre class="src src-clojure">(evaluate-libsvm-model model-vectors libsvm-model) </pre> </div> <pre class="example"> True positive: 1723 false positive: 395 True negative: 48 False negative: 209 Precision: 0.8135033 Recall: 0.8918219 Accuracy: 0.7456842 F1: 0.8508642 </pre> <div class="org-src-container"> <pre class="src src-clojure">(generate-tensorflow-projector-vectors-libsvm model-vectors libsvm-model "kbpedia-projector-vectors-libsvm-wl5.csv" "kbpedia-projector-metadata-libsvm-wl5.csv") </pre> </div> However what we want to do is to visualize the new scatter plot generated by this new DeepWalk model. The following models have been created using DeepWalk with a <code>walk-length</code> of five, 175 <code>iterations</code> and embedding vectors with three dimensions. The visualization has been created using the t-SNE algorithm with a <code>perplexity</code> of 25, a <code>learning rate</code> of 10 and 8500 <code>iterations</code>. To load the following scatter plots, <a href="tsne-wl5-p25-l10-i8500-3d.zip">download this package</a>, follow the instructions at <a href="http://projector.tensorflow.org/">http://projector.tensorflow.org/</a>, then load the <code>vectors</code> and <code>metadata</code> files. Finally load the bookmark using the <code>txt</code> file from the package. Here is the 3-dimensional rendering of the scatter plot. <div class="figure"> <img src="tsne-wl5-p25-l10-i8500-3d-call.png" alt="tsne-wl5-p25-l10-i8500-3d-call.png" /> </div> Here is the 2-dimensional rendering of the same graph composed of all the examples of the <code>training</code> and <code>validation</code> sets. If we compare this plot with the one that had a <code>walk-length</code> of 15 (see above), we can observe that most of the clusters are much closer to the other, which suggests that their distinctiveness is less pronounced than it was with a <code>walk-length</code> of 15. This intuitively makes sense since we only walk five vertices of the graph instead of 15, then the upper structure of the knowledge graph has less impact on the overall structure (as exemplified by the graph embeddings) of each vertex. <div class="figure"> <img src="tsne-wl5-p25-l10-i8500-3d-to-2d-call.png" alt="tsne-wl5-p25-l10-i8500-3d-to-2d-call.png" /> </div> Here we only focus on the negative examples of the <code>validation</code> set. We can see that each of the negative examples appears at the center of each cluster. <div class="figure"> <img src="tsne-wl5-p25-l10-i8500-3d-to-2d-c0.png" alt="tsne-wl5-p25-l10-i8500-3d-to-2d-c0.png" /> </div> Here we only focus on the positive examples of the <code>validation</code> set. <div class="figure"> <img src="tsne-wl5-p25-l10-i8500-3d-to-2d-c1.png" alt="tsne-wl5-p25-l10-i8500-3d-to-2d-c1.png" /> </div> Now let's focus on specific Wikipedia categories that can belong to the <code>training</code> or the <code>validation</code> set. In the following scatter plot, we highlight all the categories that have the word "police" in them. What we can observe here compared to the previous version of this graph is that these Wikipedia categories appear closer to each other which suggest that the upper structure has less impact on how the over all categories are clustered (by their graph latent structure). Also, the <code>Chiefs_of_police</code> does not belong to any cluster, but it was when we used a <code>walk-length</code> of 15. <div class="figure"> <img src="tsne-wl5-p25-l10-i8500-3d-focus_police.png" alt="tsne-wl5-p25-l10-i8500-3d-focus_police.png" /> </div> The same behavior can be observed for the categories that have "fire" in their name. <div class="figure"> <img src="tsne-wl5-p25-l10-i8500-3d-focus_fire.png" alt="tsne-wl5-p25-l10-i8500-3d-focus_fire.png" /> </div> Finally exactly the same behavior can be observed for the categories that have "wine" in their name. <div class="figure"> <img src="tsne-wl5-p25-l10-i8500-3d-focus_wine.png" alt="tsne-wl5-p25-l10-i8500-3d-focus_wine.png" /> </div> What we can conclude from changing the <code>walk-length</code> of the DeepWalk algorithm on this knowledge graph is that it appears that it will have an impact on how the vertices will be represented and clustered by the t-SNE algorithm and how each other will be related one between the other considering their graph embeddings. A higher number for <code>walk-length</code> appears to be more beneficial, but takes more time to compute. </div></li> <h4>Nine Dimensions</h4> <div class="outline-text-5" id="text-orgf548100"> Now let's experiment when we increase the number of dimensions for the graph embeddings. Let's increase the number of dimensions from three to nine. However let's use the <code>walk-length</code> of 15. The first thing we have to do is to create the DeepWalk object with a <code>:vector-size</code> of nine. <div class="org-src-container"> <pre class="src src-clojure">(def deep-walk (create-deep-walk graph :window-size 15 :vector-size 9 :learning-rate 0.025)) </pre> </div> Then we train the model with a <code>walk-length</code> of 15 and and 175 iterations. <div class="org-src-container"> <pre class="src src-clojure">(train deep-walk graph 15 175) </pre> </div> We re-create the <code>training</code> and <code>validation</code> sets to train the SVM classifier. <div class="org-src-container"> <pre class="src src-clojure">(def model-vectors (build-svm-model-vectors "resources/core-wikipedia-subclass-mapped--extended.csv" deep-walk index :base-uri "http://wikipedia.org/wiki/Category:")) </pre> </div> Then we re-run the grid search to find the optimal hyperparameters to properly configure the SVM classifier with these new graph embedding vectors as computed by the new DeepWalk model. <div class="org-src-container"> <pre class="src src-clojure">(svm-grid-search-one-class-rbf model-vectors :grid-parameters [{:gamma [1 1/2 1/3 1/4 1/5 1/7 1/10 1/50 1/100] :eps [0.000001 0.00001 0.0001 0.001 0.01 0.1 0.2] :nu [0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9] :probability [0 1] :shrinking [0 1] :selection-metric :f1}]) </pre> </div> <pre class="example"> {:score 0.85384995 :gamma 1 :eps 0.1 :nu 0.1 :probability 0 shrinking 0} </pre> <div class="org-src-container"> <pre class="src src-clojure">(def libsvm-model-optimized (svm.core/train-model (->> (:training model-vectors) (map (fn [item] [(if (= (:class item) 0) -1 1) (->> (:f item) (map (fn [[k v]] {(inc k) v})) (apply merge))])) concat) :eps 0.1 :gamma 1 :nu 0.1 :probability 0 :shrinking 0 :kernel-type (:rbf svm.core/kernel-types) :svm-type (:one-class svm.core/svm-types))) </pre> </div> <div class="org-src-container"> <pre class="src src-clojure">(evaluate-libsvm-model model-vectors libsvm-model-optimized) </pre> </div> <pre class="example"> True positive: 1741 false positive: 422 True negative: 38 False negative: 174 Precision: 0.8049006 Recall: 0.9091384 Accuracy: 0.74905264 F1: 0.85384995 </pre> <div class="org-src-container"> <pre class="src src-clojure">(generate-tensorflow-projector-vectors-libsvm model-vectors libsvm-model "kbpedia-projector-vectors-libsvm-wl15-d9.csv" "kbpedia-projector-metadata-libsvm-wl15-d9.csv") </pre> </div> OK, so let us now visualize this new model that uses nine dimensions instead of three. The following models have been created using DeepWalk with a <code>walk-length</code> of 15, 175 <code>iterations</code> and embedding vectors with nine dimensions. The visualization has been created using the t-SNE algorithm with a <code>perplexity</code> of 25, a <code>learning rate</code> of 10 and 30,000 <code>iterations</code>. To load the following scatter plots, <a href="tsne-wl15-p25-l10-i30000-9d.zip">download this package</a>, follow the instructions at <a href="http://projector.tensorflow.org/">http://projector.tensorflow.org/</a>, then load the <code>vectors</code> and <code>metadata</code> files. Finally load the bookmark using the <code>txt</code> file from the package. Now let's visualize the graph of all the <code>training</code> and <code>validation</code> set examples reduced from nine dimensions to two. As you can see, this graph is quite different from the previous ones. We can see some kind of blurred clusters but everything is quite scattered. <div class="figure"> <img src="tsne-wl15-p25-l10-i30000-9d-to-2d-call.png" alt="tsne-wl15-p25-l10-i30000-9d-to-2d-call.png" /> </div> The following graph highlight all the positive examples of the <code>validation</code> set. <div class="figure"> <img src="tsne-wl15-p25-l10-i30000-9d-to-2d-c1.png" alt="tsne-wl15-p25-l10-i30000-9d-to-2d-c1.png" /> </div> Here are all the categories that belong to the negative examples of the <code>validation</code> set. There is no apparent structure for these two graphs; vertices just appear to be "randomly" displayed in the scatter plot. <div class="figure"> <img src="tsne-wl15-p25-l10-i30000-9d-to-2d-c0.png" alt="tsne-wl15-p25-l10-i30000-9d-to-2d-c0.png" /> </div> Finally let's highlight the Wikipedia categories that have the word "wine" in them. <div class="figure"> <img src="tsne-wl15-p25-l10-i30000-9d-to-2d-focus-wine.png" alt="tsne-wl15-p25-l10-i30000-9d-to-2d-focus-wine.png" /> </div> What this experiment shows is how care should be taken when visualizing, and more importantly interpreting, these kinds of visualizations created by dimension reduction techniques such as PCA and t-SNE algorithms. </div> </div> </div> <div id="outline-container-orgb53ab62" class="outline-3"> <h3 id="orgb53ab62">Classifying Candidates</h3> <div class="outline-text-3" id="text-orgb53ab62"> Now that we have an adequate model in place, the last step of the process is to classify each of the candidates we previously found using our best optimized model. Remember that the goal is to find Wikipedia category candidates that can be placed within the existing KBpedia reference concept (knowledge) graph by either extending the graph or filling gaps within it. So, what we have to do is to iterate over all the candidates we found and to classify each of them according to their graph embedding vectors as classified by a SVM model configured using the best hyperparameters. <div class="org-src-container"> <pre class="src src-clojure">(defn classify-categories [svm-model possible-new-rc-csv classified-csv index deep-walk] (let [categories-csv (rest (with-open [in-file (io/reader possible-new-rc-csv)] (doall (csv/read-csv in-file))))] (with-open [out-file (io/writer classified-csv)] (csv/write-csv out-file [["kbpedia-rc" "wikipedia-category" "possible-new-sub-class-of" "cognonto-classification"]]) (doseq [[kbpedia-rc wikipedia-category possible-new-sub-class-of] categories-csv] (if (empty? possible-new-sub-class-of) (csv/write-csv out-file [[kbpedia-rc wikipedia-category possible-new-sub-class-of ""]]) (csv/write-csv out-file [[kbpedia-rc wikipedia-category possible-new-sub-class-of (if (= (svm.core/predict svm-model (into (sorted-map-by <) (->> (read-string (.toString (.data (.getVector (.lookupTable deep-walk) (get index (str "http://wikipedia.org/wiki/Category:" possible-new-sub-class-of)))))) (map-indexed (fn [feature-id value] {feature-id value})) (apply merge)))) 1.0) "x" "")]])))))) </pre> </div> <div class="org-src-container"> <pre class="src src-clojure">(classify-categories libsvm-model-optimized "resources/leaf-rcs-narrower-concepts.csv" "resources/leaf-rcs-narrower-concepts--libsvm-classified.csv" index deep-walk) (classify-categories libsvm-model-optimized "resources/near-leaf-rcs-narrower-concepts.csv" "resources/near-leaf-rcs-narrower-concepts--libsvm-classified.csv" index deep-walk) (classify-categories libsvm-model-optimized "resources/core-wikipedia-subclass-mapped.csv" "resources/core-wikipedia-subclass-mapped--libsvm-classified.csv" index deep-walk) </pre> </div> The result of this classification task is that among all the candidates we identified, 20,653 have been classified to be possible new KBpedia reference concepts, a potential expansion of our starting knowledge graph of about 50%. These final candidates will then be reviewed by a KBpedia maintainer to make the final decision to determine if that concept should be added to KBpedia or not. </div> </div> </div> <div id="outline-container-orgbc9e274" class="outline-2"> <h2 id="orgbc9e274">Conclusion</h2> <div class="outline-text-2" id="text-orgbc9e274"> What we demonstrated in this use case is how the inner structure of the KBpedia knowledge graph and its linkage to external conceptual structures such as the Wikipedia category structure can be leveraged to extend the scope of the knowledge graph. Additionally we demonstrated how machine learning techniques such as DeepWalk and SVM classifiers can be used to filter potential candidates automatically to reduce greatly the time a human reviewer has to spend to make a final decision as to which new concepts will make it into the knowledge graph. While the example herein is based on the Wikipedia category structure, any external schema or ontology may be handled in a similar way. A multitude of machine learning techniques are available at each step in the evaluation. The essential point is to create a system and workflow that enables huge numbers of combinatorial candidates to be winnowed down into likely candidates for manual approval. The systematic approach to the pipeline and the use of positive and negative training sets means that tuning the approach can be fully automated and rapidly vetted. Many of these machine learning techniques are relatively new within the scientific literature. Applying them to these kinds of tasks is novel to our knowledge. But more importantly, based on our experience performing these tasks, the time spent by a human to extend such a big coherent and consistent knowledge graph can significantly be reduced by leveraging such techniques, which ultimately greatly reduces the development and maintenance costs of such knowledge graphs. </div> </div> <div id="footnotes"> <h2 class="footnotes">Footnotes: </h2> <div id="text-footnotes"> <div class="footdef"><a id="fn.1" class="footnum" href="#fnr.1">1</a> Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM. </div> <div class="footdef"><a id="fn.2" class="footnum" href="#fnr.2">2</a> Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep learning. Cambridge, MA: MIT Press. </div> </div> </div> </div> <div class="col-md-2">   </div> </div> </div>  <div class="footer-v1"> <div class="footer"> <div class="container"> <div class="row">  <div class="col-md-3 md-margin-bottom-40"> <table> <tbody><tr> <td> <a href="/"> <img id="logo-footer" class="footer-logo" src="/imgs/logo-simple-purple.png" alt="" name="logo-footer"> </a> </td> </tr> <tr> <td> <center> KBpedia </center> </td> </tr> </tbody></table> KBpedia exploits large-scale knowledge bases and semantic technologies for machine learning, data interoperability and mapping, and fact extraction and tagging. </div>   <div class="col-md-3 md-margin-bottom-40"> <div class="posts"> <div class="headline"> <h2> Latest News </h2> </div> <ul class="list-unstyled latest-list">  <li> <a href="https://kbpedia.org/resources/news/kbpedia-adds-ecommerce/">KBpedia Adds Major eCommerce Capabilities</a> 06/15/2020 </li>  <li> <a href="http://kbpedia.org/resources/news/kbpedia-continues-quality-improvements/">KBpedia Continues Quality Improvements</a> 12/04/2019 </li>  <li> <a href="http://kbpedia.org/resources/news/wikidata-coverage-nearly-complete/">Wikidata Coverage Nearly Complete (98%)</a> 04/08/2019 </li> </ul> </div> </div> <div class="col-md-3 md-margin-bottom-40"> <div class="headline"> <h2> Other Resources </h2> </div> <ul class="list-unstyled link-list"> <li> <a href="/resources/about/">About</a> </li> <li> <a href="/resources/faq/">FAQ</a> </li> <li> <a href="/resources/news/">News</a> </li> <li> <a href="/use-cases/">Use Cases</a> </li> <li> <a href="/resources/documentation/">Documentation</a> </li> <li> <a href="/resources/privacy/">Privacy</a> </li> <li> <a href="/resources/terms-of-use/">Terms of Use</a> </li> </ul> </div>  <div class="col-md-3 map-img md-margin-bottom-40"> <div class="headline"> <h2> Contact Us </h2> </div> <address class="md-margin-bottom-40"> c/o <a href="mailto:info@mkbergman.com?subject=KBpedia%20Inquiry">Michael K. Bergman</a> 380 Knowling Drive Coralville, IA 52241 U.S.A. Voice: +1 319 621 5225 </address> </div>  </div> </div> </div> <div class="copyright"> <div class="container"> <div class="row"> <div class="col-md-7"> 2016-2022 漏 <a href="http://kbpedia.org" style="font-size: 10px;">Michael K. Bergman.</a> All Rights Reserved. </div>  <div class="col-md-5"> <ul class="footer-socials list-inline"> <li> <a href="/resources/feeds/news.xml" class="tooltips" data-toggle="tooltip" data-placement="top" title="" data-original-title="RSS feed"> </a> </li> <li> <a href="http://github.com/Cognonto" class="tooltips" data-toggle="tooltip" data-placement="top" title="" data-original-title="Github"> </a> </li> <li> <a href="http://twitter.com/cognonto" class="tooltips" data-toggle="tooltip" data-placement="top" title="" data-original-title="Twitter"> </a> </li> </ul> </div>  </div> </div> </div> </div>   <script type="text/javascript" src="/assets/plugins/jquery/jquery.min.js"></script> <script type="text/javascript" src="/assets/plugins/jquery/jquery-migrate.min.js"></script> <script type="text/javascript" src="/assets/plugins/bootstrap/js/bootstrap.min.js"></script>  <script type="text/javascript" src="/assets/plugins/back-to-top.js"></script>  <script type="text/javascript" src="/assets/js/custom.js"></script>  <script type="text/javascript" src="/assets/js/app.js"></script>  <script type="text/javascript" src="/assets/plugins/smoothScroll.js"></script> <script type="text/javascript" src="/assets/plugins/owl-carousel/owl-carousel/owl.carousel.js"></script> <script type="text/javascript" src="/assets/plugins/layer-slider/layerslider/js/greensock.js"></script> <script type="text/javascript" src="/assets/plugins/layer-slider/layerslider/js/layerslider.transitions.js"></script> <script type="text/javascript" src="/assets/plugins/layer-slider/layerslider/js/layerslider.kreaturamedia.jquery.js"></script>  <script type="text/javascript" src="/assets/js/custom.js"></script>  <script type="text/javascript" src="/assets/js/plugins/layer-slider.js"></script> <script type="text/javascript" src="/assets/js/plugins/style-switcher.js"></script> <script type="text/javascript" src="/assets/js/plugins/owl-carousel.js"></script> <script type="text/javascript" src="/assets/js/plugins/owl-recent-works.js"></script> <script type="text/javascript"> jQuery(document).ready(function() { App.init(); LayerSlider.initLayerSlider(); StyleSwitcher.initStyleSwitcher(); OwlCarousel.initOwlCarousel(); OwlRecentWorks.initOwlRecentWorksV2(); }); </script>   <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-84405507-1', 'auto'); ga('send', 'pageview'); </script>

CINXE.COM

Extending KBpedia With Wikipedia Categories