Recently XML heterogeneity has become a new challenge. In this paper, a novel clustering strategy is proposed to regroup these heterogeneous XML sources, for searching in a relatively smaller space with certain similarity can reduce cost. The strategy consists of four steps. We at first extract features about paths and map them into High-dimension Vector Space (HDVS). In the data pre-process, two algorithms are applied to diminish the redundancies in XML sources. Then heterogeneous documents are clustered. Finally, Multivalued Dependency (MVD) is introduced, for MVD can be redefined according to the range of constraints of XML. This paper also proposes a novel algorithm that discovering minimal MVD, based on the rough set handling non-integrity data. It can solve the problem that non-integrity data of XML influence on finding the MVD of XML, thus patterns can be extracted from each cluster.
This paper presents a novel clustering method for XML documents. Much research effort of document clustering is currently devoted to support the storage and retrieval of large collections of XML documents. However, traditional text clustering approaches cannot embody the structural information of semi-structured documents. Our technique is firstly to extract relative path features to represent each document. And then, we transform these documents to Vector Space Model (VSM) and propose a similarity computation. Before clustering, we apply Independent Component Analysis (ICA) to reduce dimensions of VSM. To the best of author's knowledge, ICA has not been used for XML clustering before. The standard C-means partition algorithm is also improved: When a solution can be no more improved, the algorithm makes the next iteration after an appropriate disturbance on the local minimum solution. Thus the algorithm can skip out of the local minimum and in the meanwhile, reach the whole search space. Experimental results, based on two real datasets and one synthetic dataset, show that the proposed approach is efficient and outperforms naive-clustering method without ICA applied.