We present a probability-density-based data stream clustering approach which requires only the newly arrived data, not the entire historical data, to be saved in memory. This approach incrementally updates the density estimate taking only the newly arrived data and the previously estimated density. The idea roots on a theorem of estimator updating and it works naturally with Gaussian mixture models. We implement it through the expectation maximization algorithm and a cluster merging strategy by multivariate statistical tests for equality of covariance and mean. Our approach is highly efficient in clustering voluminous online data streams when compared to the standard EM algorithm. We demonstrate the performance of our algorithm on clustering a simulated Gaussian mixture data stream and clustering real noisy spike signals extracted from neuronal recordings.
In non-parametric pattern recognition, the probability density function is approximated by means of many parameters, each one for a density value in a small hyper-rectangular volume of the space. The hyper-rectangles are determined by appropriately quantizing the range of each variable. Optimal quantization determines a compact and efficient representation of the probability density of data by optimizing a global quantizer performance measure. The measure used here is a weighted combination of average log likelihood, entropy and correct classification probability. In multi-dimensions, we study a grid based quantization technique. Smoothing is an important aspect of optimal quantization because it affects the generalization ability of the quantized density estimates. We use a fast generalized k nearest neighbor smoothing algorithm. We illustrate the effectiveness of optimal quantization on a set of not very well separated Gaussian mixture models, as compared to the expectation maximization (EM) algorithm. Optimal quantization produces better results than the EM algorithm. The reason is that the convergence of the EM algorithm to the true parameters for not well separated mixture models can be extremely slow.