Proc. SPIE. 11198, Fourth International Workshop on Pattern Recognition
KEYWORDS: Data mining, Facial recognition systems, Detection and tracking algorithms, Data modeling, Medical diagnostics, Precision measurement, Machine learning, Software engineering, Data centers, Performance modeling
The class imbalance problem is one of the key challenges in machine learning and data mining. Imbalanced data can result in the sub-optimal performance of classification models. To address the problem, a variety of data sampling methods have been proposed in previous studies. However, there is no universal solution and it is worth to explore which kind of data sampling technique is more effective in balancing class distribution in terms of the type of data and classifier. In this work, we present an experimental study based on a number of real-world data sets obtained from different disciplines. The goal is to investigate different sampling techniques in terms of the effectiveness of increasing the classification performance in imbalanced data sets. In particular, we study ten sampling methods of different types, including random sampling, clusterbased sampling, ensemble sampling and so on. Besides, the C4.5 decision tree algorithm is used to train the base classifiers and the performance is measured by using precision, G-Measure and Cohen's Kappa statistic.