CINXE.COM
{"title":"Application of a Similarity Measure for Graphs to Web-based Document Structures","authors":"Matthias Dehmer, Frank Emmert Streib, Alexander Mehler, J\u00fcrgen Kilian, Max M\u00fchlhauser","volume":8,"journal":"International Journal of Mathematical and Computational Sciences","pagesStart":361,"pagesEnd":366,"ISSN":"1307-6892","URL":"https:\/\/publications.waset.org\/pdf\/15299","abstract":"Due to the tremendous amount of information provided\r\nby the World Wide Web (WWW) developing methods for mining\r\nthe structure of web-based documents is of considerable interest. In\r\nthis paper we present a similarity measure for graphs representing\r\nweb-based hypertext structures. Our similarity measure is mainly\r\nbased on a novel representation of a graph as linear integer strings,\r\nwhose components represent structural properties of the graph. The\r\nsimilarity of two graphs is then defined as the optimal alignment of\r\nthe underlying property strings. In this paper we apply the well known\r\ntechnique of sequence alignments for solving a novel and challenging\r\nproblem: Measuring the structural similarity of generalized trees.\r\nIn other words: We first transform our graphs considered as high\r\ndimensional objects in linear structures. Then we derive similarity\r\nvalues from the alignments of the property strings in order to\r\nmeasure the structural similarity of generalized trees. Hence, we\r\ntransform a graph similarity problem to a string similarity problem for\r\ndeveloping a efficient graph similarity measure. We demonstrate that\r\nour similarity measure captures important structural information by\r\napplying it to two different test sets consisting of graphs representing\r\nweb-based document structures.","references":"[1] R. Bellman, Dynamic Programming. Princeton University Press, 1957\r\n[2] R. A. Botafogo, B. Shneiderman: Structural analysis of hypertexts:\r\nIdentifying hierarchies and useful metrics, ACM Trans. Inf. Syst. 10\r\n(2), 1992, 142-180\r\n[3] S. Chakrabarti: Mining the Web. Discovering Knowledge from Hypertext\r\nData, Morgen and Kaufmann Publishers, 2003\r\n[4] S. Chakrabarti: Integrating the document object model with hyperlinks\r\nfor enhanced topic distillation and information extraction, Proc. of the\r\n10th International World Wide Web Conference, Hong Kong, 2001, 211-\r\n220\r\n[5] I. F. Cruz, S. Borisov, M. A. Marks, T. R. Webb: Measuring Structural\r\nSimilarity Among Web Documents: Preliminary Results , Lecture Notes\r\nIn Computer Science, Vol. 1375, 1998\r\n[6] M. Dehmer, Strukturelle Analyse web-basierter Dokumente, Ph.D Thesis,\r\nDepartment of Computer Science, Technische Universit\u252c\u00bfat Darmstadt,\r\n2005, unpublished\r\n[7] R. Gleim: HyGraph - Ein Framework zur Extraktion, Repr\u252c\u00bfasentation\r\nund Analyse webbasierter Hypertextstrukturen, Beitrage zur GLDVTagung\r\n2005, Bonn\/Germany, 2005\r\n[8] D. Gusfield: Algorithms on Strings, Trees, and Sequences: Computer\r\nScience and Computational Biology, Cambridge University Press, 1997\r\n[9] T. Jiang, L. Wang, K. Zhang: Alignment of trees - An alternative to tree\r\nedit, Theoretical Computer Science, Elsevier, Vol. 143, 1995, 137-148\r\n[10] S. Joshi, N. Agrawal, R. Krishnapuram, S. Negi,: Bag of Paths Model\r\nfor Measuring Structural Similarity in Web Documents, Proceedings of\r\nthe ACM International Conference on Knowledge Discovery and Data\r\nMining (SIGKDD), 2003, 577-582.\r\n[11] A. Mehler, M. Dehmer, R. Gleim: Towards logical hypertext structure.\r\nA graph-theoretic perspective, Proc. of I2CS-04, Guadalajara\/Mexico,\r\nLecture Notes in Computer Science, Berlin-New York: Springer, 2004\r\n[12] A. Mehler, R. Gleim, M. Dehmer: Towards structure-sensitive hypertext\r\ncategorization, to appear in: Proceedings of the 29-th Annual Conference\r\nof the German Classification Society, 2005\r\n[13] S. M. Selkow: The tree-to-tree editing problem, Information Processing\r\nLetters, Vol. 6 (6), 1977, 184-186\r\n[14] T. F. Smith, M. S. Waterman: Identification of common molecular\r\nsubsequences, Journal of Molecular Biology, Vol. 147 (1), 1981, 195-\r\n197\r\n[15] F. Sobik, Graphmetriken und Klassifikation strukturierter Objekte, ZKIInformationen,\r\nAkad. Wiss. DDR, Vol. 2 (82), 1982, 63-122\r\n[16] J. R. Ullman, An algorithm for subgraph isomorphism, J. ACM, Vol. 23\r\n(1), 1976, 31-42\r\n[17] P. H. Winne., L. Gupta, J. C. Nesbit: Exploring individual differences in\r\nstudying strategies using graph theoretic statistics, The Alberta Journal\r\nof Educational Research, Vol. 40, 177-193, 1994\r\n[18] A. Winter: Exchanching Graphs with GXL, http:\/\/www.gupro.\r\nde\/GXL\r\n[19] K. Zhang, D. Shasha: Simple fast algorithms for the editing distance\r\nbetween trees and related problems, SIAM Journal of Computing, Vol.\r\n18 (6), 1989, 1245-1262\r\n[20] B. Zelinka, On a certain distance between isomorphism classes of\r\ngraphs, \u2566\u00e7Casopis pro \u2566\u00e7pest. Mathematiky, Vol. 100, 1975, 371-373","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 8, 2007"}