22 May 2017 Automatic similarity detection and clustering of data
Author Affiliations +
Abstract
An algorithm was created which identifies the number of unique clusters in a dataset and assigns the data to the clusters. A cluster is defined as a group of data which share similar characteristics. Similarity is measured using the dot product between two vectors where the data are input as vectors. Unlike other clustering algorithms such as K-means, no knowledge of the number of clusters is required. This allows for an unbiased analysis of the data. The automatic cluster detection algorithm (ACD), is executed in two phases: an averaging phase and a clustering phase. In the averaging phase, the number of unique clusters is detected. In the clustering phase, data are matched to the cluster to which they are most similar. The ACD algorithm takes a matrix of vectors as an input and outputs a 2D array of the clustered data. The indices of the output correspond to a cluster, and the elements in each cluster correspond to the position of the datum in the dataset. Clusters are vectors in N-dimensional space, where N is the length of the input vectors which make up the matrix. The algorithm is distributed, increasing computational efficiency
© (2017) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Craig Einstein, Craig Einstein, Peter Chin, Peter Chin, } "Automatic similarity detection and clustering of data", Proc. SPIE 10185, Cyber Sensing 2017, 101850K (22 May 2017); doi: 10.1117/12.2267844; https://doi.org/10.1117/12.2267844
PROCEEDINGS
7 PAGES


SHARE
RELATED CONTENT

Optimization of short amino acid sequences classifier
Proceedings of SPIE (October 14 2012)
How well does multiple OCR error correction generalize?
Proceedings of SPIE (March 23 2014)
Granular language and its reasoning
Proceedings of SPIE (March 20 2003)

Back to Top