Measuring the Structural Similarity of Web-based Documents: A Novel Approach

In contrast to these we present a new approach to measuring the structural similarity of web-based documents represented by so called generalized trees which are more general than DOM-Trees which represent only directed rooted trees.We will design a new similarity measure for graphs representing web-based hypertext structures. Our similarity measure is mainly based on a novel representation of a graph as strings of linear integers, whose components represent structural properties of the graph. The similarity of two graphs is then defined as the optimal alignment of the underlying property strings. In this paper we apply the well known technique of sequence alignments to solve a novel and challenging problem: Measuring the structural similarity of generalized trees. More precisely, we first transform our graphs considered as high dimensional objects in linear structures. Then we derive similarity values from the alignments of the property strings in order to measure the structural similarity of generalized trees. Hence, we transform a graph similarity problem to a string similarity problem. Published October 20, 2007
| Version 15928
Journal article
Open Measuring the Structural Similarity of Web-based Documents: A Novel Approach
Matthias Dehmer
Frank Emmert Streib
Alexander Mehler
Jürgen Kilian Most known methods for measuring the structural similarity of document structures are based on, e.g., tag measures, path metrics and tree measures in terms of their DOM-Trees. Other methods measures the similarity in the framework of the well known vector space model. In contrast to these we present a new approach to measuring the structural similarity of web-based documents represented by so called generalized trees which are more general than DOM-Trees which represent only directed rooted trees.We will design a new similarity measure for graphs representing web-based hypertext structures. Our similarity measure is mainly based on a novel representation of a graph as strings of linear integers, whose components represent structural properties of the graph. The similarity of two graphs is then defined as the optimal alignment of the underlying property strings. In this paper we apply the well known technique of sequence alignments to solve a novel and challenging problem: Measuring the structural similarity of generalized trees. More precisely, we first transform our graphs considered as high dimensional objects in linear structures. Then we derive similarity values from the alignments of the property strings in order to measure the structural similarity of generalized trees. Hence, we transform a graph similarity problem to a string similarity problem. We demonstrate that our similarity measure captures important structural information by applying it to two different test sets consisting of graphs representing web-based documents.

References

R. Bellman, Dynamic Programming. Princeton University Press, 1957
R. A. Botafogo, B. Shneiderman: Structural analysis of hypertexts: Identifying hierarchies and useful metrics, ACM Trans. Inf. Syst. 10 (2), 1992, 142-180
S. Chakrabarti: Mining the Web. Discovering Knowledge from Hypertext Data, Morgen and Kaufmann Publishers, 2003
S. Chakrabarti: Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction, Proc. of the 10th International World Wide Web Conference, Hong Kong, 2001, 211- 220
I. F. Cruz, S. Borisov, M. A. Marks, T. R. Webb: Measuring Structural Similarity Among Web Documents: Preliminary Results , Lecture Notes In Computer Science, Vol. 1375, 1998
M. 