CINXE.COM
Measuring the Structural Similarity of Web-based Documents: A Novel Approach
<!DOCTYPE html> <html lang="en" dir="ltr"> <head> <!-- Google tag (gtag.js) --> <script async src="https://www.googletagmanager.com/gtag/js?id=G-P63WKM1TM1"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-P63WKM1TM1'); </script> <!-- Yandex.Metrika counter --> <script type="text/javascript" > (function(m,e,t,r,i,k,a){m[i]=m[i]||function(){(m[i].a=m[i].a||[]).push(arguments)}; m[i].l=1*new Date(); for (var j = 0; j < document.scripts.length; j++) {if (document.scripts[j].src === r) { return; }} k=e.createElement(t),a=e.getElementsByTagName(t)[0],k.async=1,k.src=r,a.parentNode.insertBefore(k,a)}) (window, document, "script", "https://mc.yandex.ru/metrika/tag.js", "ym"); ym(55165297, "init", { clickmap:false, trackLinks:true, accurateTrackBounce:true, webvisor:false }); </script> <noscript><div><img src="https://mc.yandex.ru/watch/55165297" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter --> <!-- Matomo --> <!-- End Matomo Code --> <title>Measuring the Structural Similarity of Web-based Documents: A Novel Approach</title> <meta name="description" content="Measuring the Structural Similarity of Web-based Documents: A Novel Approach"> <meta name="keywords" content="Graph similarity, hierarchical and directed graphs, hypertext, generalized trees, web structure mining."> <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, user-scalable=no"> <meta charset="utf-8"> <meta name="citation_title" content="Measuring the Structural Similarity of Web-based Documents: A Novel Approach"> <meta name="citation_author" content="Matthias Dehmer"> <meta name="citation_author" content="Frank Emmert Streib"> <meta name="citation_author" content="Alexander Mehler"> <meta name="citation_author" content="Jürgen Kilian"> <meta name="citation_publication_date" content="2007/10/20"> <meta name="citation_journal_title" content="International Journal of Computer and Information Engineering"> <meta name="citation_volume" content="1"> <meta name="citation_issue" content="10"> <meta name="citation_firstpage" content="3070"> <meta name="citation_lastpage" content="3076"> <meta name="citation_pdf_url" content="https://publications.waset.org/15928/pdf"> <link href="https://cdn.waset.org/favicon.ico" type="image/x-icon" rel="shortcut icon"> <link href="https://cdn.waset.org/static/plugins/bootstrap-4.2.1/css/bootstrap.min.css" rel="stylesheet"> <link href="https://cdn.waset.org/static/plugins/fontawesome/css/all.min.css" rel="stylesheet"> <link href="https://cdn.waset.org/static/css/site.css?v=150220211555" rel="stylesheet"> </head> <body> <header> <div class="container"> <nav class="navbar navbar-expand-lg navbar-light"> <a class="navbar-brand" href="https://waset.org"> <img src="https://cdn.waset.org/static/images/wasetc.png" alt="Open Science Research Excellence" title="Open Science Research Excellence" /> </a> <button class="d-block d-lg-none navbar-toggler ml-auto" type="button" data-toggle="collapse" data-target="#navbarMenu" aria-controls="navbarMenu" aria-expanded="false" aria-label="Toggle navigation"> <span class="navbar-toggler-icon"></span> </button> <div class="w-100"> <div class="d-none d-lg-flex flex-row-reverse"> <form method="get" action="https://waset.org/search" class="form-inline my-2 my-lg-0"> <input class="form-control mr-sm-2" type="search" placeholder="Search Conferences" value="" name="q" aria-label="Search"> <button class="btn btn-light my-2 my-sm-0" type="submit"><i class="fas fa-search"></i></button> </form> </div> <div class="collapse navbar-collapse mt-1" id="navbarMenu"> <ul class="navbar-nav ml-auto align-items-center" id="mainNavMenu"> <li class="nav-item"> <a class="nav-link" href="https://waset.org/conferences" title="Conferences in 2024/2025/2026">Conferences</a> </li> <li class="nav-item"> <a class="nav-link" href="https://waset.org/disciplines" title="Disciplines">Disciplines</a> </li> <li class="nav-item"> <a class="nav-link" href="https://waset.org/committees" rel="nofollow">Committees</a> </li> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle" href="#" id="navbarDropdownPublications" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false"> Publications </a> <div class="dropdown-menu" aria-labelledby="navbarDropdownPublications"> <a class="dropdown-item" href="https://publications.waset.org/abstracts">Abstracts</a> <a class="dropdown-item" href="https://publications.waset.org">Periodicals</a> <a class="dropdown-item" href="https://publications.waset.org/archive">Archive</a> </div> </li> <li class="nav-item"> <a class="nav-link" href="https://waset.org/page/support" title="Support">Support</a> </li> </ul> </div> </div> </nav> </div> </header> <main> <div class="container mt-4"> <div class="row"> <div class="col-md-9 mx-auto"> <form method="get" action="https://publications.waset.org/search"> <div id="custom-search-input"> <div class="input-group"> <i class="fas fa-search"></i> <input type="text" class="search-query" name="q" placeholder="Author, Title, Abstract, Keywords" value=""> <input type="submit" class="btn_search" value="Search"> </div> </div> </form> </div> </div> <div class="row mt-3"> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Commenced</strong> in January 2007</div> </div> </div> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Frequency:</strong> Monthly</div> </div> </div> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Edition:</strong> International</div> </div> </div> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Paper Count:</strong> 33093</div> </div> </div> </div> <div class="card publication-listing mt-3 mb-3"> <h5 class="card-header" style="font-size:.9rem">Measuring the Structural Similarity of Web-based Documents: A Novel Approach</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/search?q=Matthias%20Dehmer">Matthias Dehmer</a>, <a href="https://publications.waset.org/search?q=Frank%20Emmert%20Streib"> Frank Emmert Streib</a>, <a href="https://publications.waset.org/search?q=Alexander%20Mehler"> Alexander Mehler</a>, <a href="https://publications.waset.org/search?q=J%C3%BCrgen%20Kilian"> Jürgen Kilian</a> </p> <p class="card-text"><strong>Abstract:</strong></p> <p>Most known methods for measuring the structural similarity of document structures are based on, e.g., tag measures, path metrics and tree measures in terms of their DOM-Trees. Other methods measures the similarity in the framework of the well known vector space model. In contrast to these we present a new approach to measuring the structural similarity of web-based documents represented by so called generalized trees which are more general than DOM-Trees which represent only directed rooted trees.We will design a new similarity measure for graphs representing web-based hypertext structures. Our similarity measure is mainly based on a novel representation of a graph as strings of linear integers, whose components represent structural properties of the graph. The similarity of two graphs is then defined as the optimal alignment of the underlying property strings. In this paper we apply the well known technique of sequence alignments to solve a novel and challenging problem: Measuring the structural similarity of generalized trees. More precisely, we first transform our graphs considered as high dimensional objects in linear structures. Then we derive similarity values from the alignments of the property strings in order to measure the structural similarity of generalized trees. Hence, we transform a graph similarity problem to a string similarity problem. We demonstrate that our similarity measure captures important structural information by applying it to two different test sets consisting of graphs representing web-based documents.</p> <iframe src="https://publications.waset.org/15928.pdf" style="width:100%; height:400px;" frameborder="0"></iframe> <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/search?q=Graph%20similarity" title="Graph similarity">Graph similarity</a>, <a href="https://publications.waset.org/search?q=hierarchical%20and%20directed%20graphs" title=" hierarchical and directed graphs"> hierarchical and directed graphs</a>, <a href="https://publications.waset.org/search?q=hypertext" title=" hypertext"> hypertext</a>, <a href="https://publications.waset.org/search?q=generalized%20trees" title=" generalized trees"> generalized trees</a>, <a href="https://publications.waset.org/search?q=web%20structure%20mining." title=" web structure mining."> web structure mining.</a> </p> <p class="card-text"><strong>Digital Object Identifier (DOI):</strong> <a href="https://doi.org/10.5281/zenodo.1086031" target="_blank">doi.org/10.5281/zenodo.1086031</a> </p> <a href="https://publications.waset.org/15928/measuring-the-structural-similarity-of-web-based-documents-a-novel-approach" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/15928/apa" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">APA</a> <a href="https://publications.waset.org/15928/bibtex" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">BibTeX</a> <a href="https://publications.waset.org/15928/chicago" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">Chicago</a> <a href="https://publications.waset.org/15928/endnote" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">EndNote</a> <a href="https://publications.waset.org/15928/harvard" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">Harvard</a> <a href="https://publications.waset.org/15928/json" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">JSON</a> <a href="https://publications.waset.org/15928/mla" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">MLA</a> <a href="https://publications.waset.org/15928/ris" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">RIS</a> <a href="https://publications.waset.org/15928/xml" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">XML</a> <a href="https://publications.waset.org/15928/iso690" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">ISO 690</a> <a href="https://publications.waset.org/15928.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">2557</span> </span> <p class="card-text"><strong>References:</strong></p> <br>[1] R. Bellman, Dynamic Programming. Princeton University Press, 1957 <br>[2] R. A. Botafogo, B. Shneiderman: Structural analysis of hypertexts: Identifying hierarchies and useful metrics, ACM Trans. Inf. Syst. 10 (2), 1992, 142-180 <br>[3] S. Chakrabarti: Mining the Web. Discovering Knowledge from Hypertext Data, Morgen and Kaufmann Publishers, 2003 <br>[4] S. Chakrabarti: Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction, Proc. of the 10th International World Wide Web Conference, Hong Kong, 2001, 211- 220 <br>[5] I. F. Cruz, S. Borisov, M. A. Marks, T. R. Webb: Measuring Structural Similarity Among Web Documents: Preliminary Results , Lecture Notes In Computer Science, Vol. 1375, 1998 <br>[6] M. Dehmer, Strukturelle Analyse web-basierter Dokumente, Ph.D Thesis, Department of Computer Science, Technische Universit┬¿at Darmstadt, 2005, unpublished <br>[7] M. Dehmer, R. Gleim, A. Mehler: Aspekte der Kategorisierung von Webseiten, GI-Edition - Lecture Notes in Informatics (LNI) - Proceedings, Jahrestagung der Gesellschaft f┬¿ur Informatik, Informatik 2004, Ulm/Germany, 2004, 39-43 <br>[8] R. Gleim: HyGraph - Ein Framework zur Extraktion, Repr┬¿asentation und Analyse webbasierter Hypertextstrukturen, Beitr┬¿age zur GLDVTagung 2005, Bonn/Germany, 2005 <br>[9] D. Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997 <br>[10] T. Jiang, L. Wang, K. Zhang: Alignment of trees - An alternative to tree edit, Theoretical Computer Science, Elsevier, Vol. 143, 1995, 137-148 <br>[11] S. Joshi, N. Agrawal, R. Krishnapuram, S. Negi,: Bag of Paths Model for Measuring Structural Similarity in Web Documents, Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2003, 577-582. <br>[12] Mehler A.: Textbedeutung. Zur prozeduralen Analyse und Repr┬¿asentation struktureller ┬¿Ahnlichkeiten von Texten, Peter Lang, Europ┬¿aischer Verlag der Wissenschaften, 2001 <br>[13] A. Mehler, M. Dehmer, R. Gleim: Towards logical hypertext structure. A graph-theoretic perspective, Proc. of I2CS-04, Guadalajara/Mexico, Lecture Notes in Computer Science, Berlin-New York: Springer, 2004 <br>[14] A. Mehler, R. Gleim, M. Dehmer: Towards structure-sensitive hypertext categorization, to appear in: Proceedings of the 29-th Annual Conference of the German Classification Society, 2005 <br>[15] S. M. Selkow: The tree-to-tree editing problem, Information Processing Letters, Vol. 6 (6), 1977, 184-186 <br>[16] T. F. Smith, M. S. Waterman: Identification of common molecular subsequences, Journal of Molecular Biology, Vol. 147 (1), 1981, 195- 197 <br>[17] F. Sobik, Graphmetriken und Klassifikation strukturierter Objekte, ZKIInformationen, Akad. Wiss. DDR, Vol. 2 (82), 1982, 63-122 <br>[18] J. R. Ullman, An algorithm for subgraph isomorphism, J. ACM, Vol. 23 (1), 1976, 31-42 <br>[19] P. H. Winne., L. Gupta, J. C. Nesbit: Exploring individual differences in studying strategies using graph theoretic statistics, The Alberta Journal of Educational Research, Vol. 40, 1994, 177-193 <br>[20] A. Winter: Exchanching Graphs with GXL, http://www.gupro. de/GXL <br>[21] Y. Yang, S. Slattery, R. Ghani: A study of approaches to hypertext categorization, Journal of Intelligent Information Systems, Vol. 18 (2-3), 2002, 219-241 <br>[22] K. Zhang, D. Shasha: Simple fast algorithms for the editing distance between trees and related problems, SIAM Journal of Computing, Vol. 18 (6), 1989, 1245-1262 <br>[23] B. Zelinka, On a certain distance between isomorphism classes of graphs, ╦ç Casopis pro ╦çpest. Mathematiky, Vol. 100, 1975, 371-373 </div> </div> </div> </main> <footer> <div id="infolinks" class="pt-3 pb-2"> <div class="container"> <div style="background-color:#f5f5f5;" class="p-3"> <div class="row"> <div class="col-md-2"> <ul class="list-unstyled"> About <li><a href="https://waset.org/page/support">About Us</a></li> <li><a href="https://waset.org/page/support#legal-information">Legal</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/WASET-16th-foundational-anniversary.pdf">WASET celebrates its 16th foundational anniversary</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Account <li><a href="https://waset.org/profile">My Account</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Explore <li><a href="https://waset.org/disciplines">Disciplines</a></li> <li><a href="https://waset.org/conferences">Conferences</a></li> <li><a href="https://waset.org/conference-programs">Conference Program</a></li> <li><a href="https://waset.org/committees">Committees</a></li> <li><a href="https://publications.waset.org">Publications</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Research <li><a href="https://publications.waset.org/abstracts">Abstracts</a></li> <li><a href="https://publications.waset.org">Periodicals</a></li> <li><a href="https://publications.waset.org/archive">Archive</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Open Science <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Open-Science-Philosophy.pdf">Open Science Philosophy</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Open-Science-Award.pdf">Open Science Award</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Open-Society-Open-Science-and-Open-Innovation.pdf">Open Innovation</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Postdoctoral-Fellowship-Award.pdf">Postdoctoral Fellowship Award</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Scholarly-Research-Review.pdf">Scholarly Research Review</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Support <li><a href="https://waset.org/page/support">Support</a></li> <li><a href="https://waset.org/profile/messages/create">Contact Us</a></li> <li><a href="https://waset.org/profile/messages/create">Report Abuse</a></li> </ul> </div> </div> </div> </div> </div> <div class="container text-center"> <hr style="margin-top:0;margin-bottom:.3rem;"> <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" class="text-muted small">Creative Commons Attribution 4.0 International License</a> <div id="copy" class="mt-2">© 2024 World Academy of Science, Engineering and Technology</div> </div> </footer> <a href="javascript:" id="return-to-top"><i class="fas fa-arrow-up"></i></a> <div class="modal" id="modal-template"> <div class="modal-dialog"> <div class="modal-content"> <div class="row m-0 mt-1"> <div class="col-md-12"> <button type="button" class="close" data-dismiss="modal" aria-label="Close"><span aria-hidden="true">×</span></button> </div> </div> <div class="modal-body"></div> </div> </div> </div> <script src="https://cdn.waset.org/static/plugins/jquery-3.3.1.min.js"></script> <script src="https://cdn.waset.org/static/plugins/bootstrap-4.2.1/js/bootstrap.bundle.min.js"></script> <script src="https://cdn.waset.org/static/js/site.js?v=150220211556"></script> <script> jQuery(document).ready(function() { /*jQuery.get("https://publications.waset.org/xhr/user-menu", function (response) { jQuery('#mainNavMenu').append(response); });*/ jQuery.get({ url: "https://publications.waset.org/xhr/user-menu", cache: false }).then(function(response){ jQuery('#mainNavMenu').append(response); }); }); </script> </body> </html>