18 April 2006 Clustering method via independent components for semi-structured documents
Author Affiliations +
Proceedings Volume 6241, Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2006; 62410V (2006); doi: 10.1117/12.665427
Event: Defense and Security Symposium, 2006, Orlando (Kissimmee), Florida, United States
Abstract
This paper presents a novel clustering method for XML documents. Much research effort of document clustering is currently devoted to support the storage and retrieval of large collections of XML documents. However, traditional text clustering approaches cannot embody the structural information of semi-structured documents. Our technique is firstly to extract relative path features to represent each document. And then, we transform these documents to Vector Space Model (VSM) and propose a similarity computation. Before clustering, we apply Independent Component Analysis (ICA) to reduce dimensions of VSM. To the best of author's knowledge, ICA has not been used for XML clustering before. The standard C-means partition algorithm is also improved: When a solution can be no more improved, the algorithm makes the next iteration after an appropriate disturbance on the local minimum solution. Thus the algorithm can skip out of the local minimum and in the meanwhile, reach the whole search space. Experimental results, based on two real datasets and one synthetic dataset, show that the proposed approach is efficient and outperforms naive-clustering method without ICA applied.
© (2006) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Tong Wang, Da-Xin Liu, Xuanzuo Lin, Wei Sun, "Clustering method via independent components for semi-structured documents", Proc. SPIE 6241, Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2006, 62410V (18 April 2006); doi: 10.1117/12.665427; https://doi.org/10.1117/12.665427
PROCEEDINGS
8 PAGES


SHARE
KEYWORDS
Independent component analysis

Principal component analysis

Vector spaces

Data modeling

Dimension reduction

Feature extraction

Data mining

Back to Top