K-means clustering based filter feature selection on high dimensional data

(1) * Dewi Pramudi Ismi Mail (Universitas Ahmad Dahlan, Indonesia)
(2) Shireen Panchoo Mail (University of Technology Mauritius, Mauritius)
(3) Murinto Murinto Mail (Universitas Ahmad Dahlan, Indonesia)
*corresponding author


With hundreds or thousands of features in high dimensional data, computational workload is challenging. In classification process, features which do not contribute significantly to prediction of classes, add to the computational workload. Therefore the aim of this paper is to use feature selection to decrease the computation load by reducing the size of high dimensional data. Selecting subsets of features which represent all features were used. Hence the process is two-fold; discarding irrelevant data and choosing one feature that representing a number of redundant features. There have been many studies regarding feature selection, for example backward feature selection and forward feature selection. In this study, a k-means clustering based feature selection is proposed. It is assumed that redundant features are located in the same cluster, whereas irrelevant features do not belong to any clusters. In this research, two different high dimensional datasets are used: 1) the Human Activity Recognition Using Smartphones (HAR) Dataset, containing 7352 data points each of 561 features and 2) the National Classification of Economic Activities Dataset, which contains 1080 data points each of 857 features. Both datasets provide class label information of each data point. Our experiment shows that k-means clustering based feature selection can be performed to produce subset of features. The latter returns more than 80% accuracy of classification result.


feature selection; dimensionality reduction; clustering; k-means clustering; classification; high dimensional data




Article metrics

Abstract views : 2654 | PDF views : 586




Full Text



Dhanachandra N., Manglem K. and Chanu Y., Image Segmentation using K-Means clustering algorithm and subtractive clustering algorithm, Eleventh International Multi-Conference on Information Processing 2015, Procedia Computer Science 54, 2015, pp 764-771.

L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. In Proceedings of the twentieth International Conference on Machine Learning, pages 856–863, 2003.

Song, Q., J.Ni, dan G.Wang. 2013. A Fast Clustering Based Feature Subset Selection Algorithm for High Dimensional Data. IEEE Transaction on Knowledge and Data Engineering. Vol. 25, No. 1.

R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273–324, 1997.

Saeys, Y., I.Inza, dan P.Larranaga. 2007. A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics. Vol. 23, No. 19.hlm 2507-2517.

Liu, H. dan L.Yu. 2005. Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transaction on Knowledge and Data Engineering. Vol. 17, No. 4.

Arora P., Deepali, Varshney S., 2016. Analysis of K-Means and K-Medoids Algorithm for Big Data, International Conference on Information Security & Privacy, 11-12 Dec. 2015, India, published in Science Direct, Procedia Computer Science 78, 2016, 507-512.

Alpaydin, E. 2010. Introduction to Machine Learning, second Edition. The MIT Press. ISBN:978-0-262-01243-0 2

Han, J. dan M. Kamber. 2006. Data Mining Concepts and Techniques, Second Edition. Morgan Kaufman. ISBN-13:978-1-55860-901-3

Pavel Berkhin, “Survey of Clustering Data Mining Techniques”, Accrue Software, Inc, 2002.

Shah M. and Nair S., A survey of data mining clustering algorithms, International Journal of Computer Applications, Vol. 128 No. 1, 2015.

MacKay, D. 2003. Information Theory, Inference and Learning Algorithms. Cambridge University Press. Cambrige – England. pp. 284–292. ISBN 0-521-64298-1.

Pallavi Purohit and Ritesh Joshi, A New Efficient Approach towards k-means Clustering Algorithm, In International Journal of Computer Applications, (0975-8887), vol. 65, no. 11, March (2013).

K. A. Abdul Nazeer and M. P. Sebastian, Improving the Accuracy and Efficiency of the k-means Clustering Algorithm, In Proceedings of the World Congress on Engineering, London, WCE, vol. 1, July (2001).

Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. SIGKDD Explor. Newsl. 6, 90–105 (2004).

UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets.html.

Human Activity Recognition Using Smartphones (HAR) Dataset. https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

CNAE-9 Dataset. https://archive.ics.uci.edu/ml/datasets/CNAE-9

Qiu et al., 2016, A survey of machine learning for big data processing, EURASIP Journal on Advances in Signal Processing, 2016.

I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182, 2003.

Jiliang Tang, Salem Alelyani and Huan Liu, Feature Selection for Classification: A Review, In Charu C. Aggarwal, Data Classification: Algorithms and Applications, 2014.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by Informatics Department - Universitas Ahmad Dahlan, and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: ijain@uad.ac.id (paper handling issues)
    info@ijain.org, andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0