An improved K-Nearest neighbour with grasshopper optimization algorithm for imputation of missing data

(1) Nadzurah Zainal Abidin Mail (Department of Computer Science, International Islamic University Malaysia, Malaysia)
(2) * Amelia Ritahani Ismail Mail (Department of Computer Science, International Islamic University Malaysia, Malaysia)
*corresponding author

Abstract


K-nearest neighbors (KNN) has been extensively used as imputation algorithm to substitute missing data with plausible values. One of the successes of KNN imputation is the ability to measure the missing data simulated from its nearest neighbors robustly. However, despite the favorable points, KNN still imposes undesirable circumstances. KNN suffers from high time complexity, choosing the right k, and different functions. Thus, this paper proposes a novel method for imputation of missing data, named KNNGOA, which optimized the KNN imputation technique based on the grasshopper optimization algorithm. Our GOA is designed to find the best value of k and optimize the imputed value from KNN that maximizes the imputation accuracy. Experimental evaluation for different types of datasets collected from UCI, with various rates of missing values ranging from 10%, 30%, and 50%. Our proposed algorithm has achieved promising results from the experiment conducted, which outperformed other methods, especially in terms of accuracy.

Keywords


Grasshopper; KNN; Imputation accuracy; GOA; Missing data

   

DOI

https://doi.org/10.26555/ijain.v7i3.696
      

Article metrics

Abstract views : 316 | PDF views : 55

   

Cite

   

Full Text

Download

References


[1] R. Pan, T. Yang, J. Cao, K. Lu, and Z. Zhang, "Missing data imputation by K nearest neighbours based on grey relational structure and mutual information," J. Appl. Intell., vol. 43, no. 3, pp. 614–632, 2015. doi: 10.1007/s10489-015-0666-x.

[2] A. Ngueilbaye, H. Wang, D. A. Mahamat, and S. B. Junaidu, “Modulo 9 model-based learning for missing data imputation,” Appl. Soft Comput., vol. 103, p. 107167, 2021, doi: 10.1016/j.asoc.2021.107167.

[3] V. Agarwal, "Research on Data Preprocessing and Categorization Technique for Smartphone Review Analysis," Int. J. Comput. Appl., vol. 131, no. 4, pp. 30–36, 2015. doi: 10.5120/ijca2015907309.

[4] N. Z. Zainal Abidin, A. R. Ismail, and N. A. Emran, "Performance Analysis of Machine Learning Algorithms for Missing Value Imputation," Int. J. Adv. Comput. Sci. Appl., vol. 9, no. 6, 2018. doi: 10.14569/IJACSA.2018.090660.

[5] B. J. Wells et al., "Strategies for Handling Missing Data in Electronic Health Record Derived Data," EDM Forum Community, vol. 1, pp. 12–17, 2013. doi: 10.13063/2327-9214.1035.

[6] P. Rani, R. Kumar, and A. Jain, “Multistage Model for Accurate Prediction of Missing Values Using Imputation Methods in Heart Disease Dataset,” in Innovative Data Communication Technologies and Application, 2021, pp. 637–653. doi: 10.1007/978-981-15-9651-3_53.

[7] N. A. M. Pauzi, Y. B. Wah, S. M. Deni, S. K. N. A. Rahim, and Suhartono, "Comparison of single and mice imputation methods for missing values: A simulation study," Pertanika J. Sci. Technol., vol. 29, no. 2, pp. 979–998, 2021. doi: 10.47836/pjst.29.2.15.

[8] G. E. A. P. A. Batista and M. C. Monard, "An analysis of four missing data treatment methods for supervised learning," Appl. Artif. Intell., vol. 17, no. 5–6, pp. 519–533, 2003. doi: 10.1080/713827181.

[9] S. Faisal and G. Tutz, “Multiple imputation using nearest neighbor methods,” Inf. Sci. (Ny)., vol. 570, pp. 500–516, 2021. doi: 10.1016/j.ins.2021.04.009.

[10] H. M. Dodeen, "Effectiveness of Valid Mean Substitution in Treating Missing Data in Attitude Assessment," Assess. Eval. High. Educ., Vol. 28, No. 5, p. 505-513, 2003. doi: 10.1080/02602930301674.

[11] J. W. Graham, "Analysis of Missing Data," in Missing Data: Analysis and Design, Statistics for Social and Behavioral Sciences, Ed. New York: Springer Science + Business Media, 2012, pp. 47–68. doi: 10.1007/978-1-4614-4018-5_2.

[12] A. Lamba and D. Kumar, "Survey on KNN and Its Variants," Int. J. Adv. Res. Comput. Commun. Eng., vol. 5, no. 5, 2016. doi: 10.17148/IJARCCE.2016.55101.

[13] S. Zhang, "Nearest neighbor selection for iteratively k NN imputation," J. Syst. Softw., vol. 85, no. 11, pp. 2541–2552, 2012. doi: 10.1016/j.jss.2012.05.073.

[14] O. Kramer, "Dimensionality Reduction by Unsupervised K-Nearest Neighbor Regression," Int. Conf. Mach. Learn. Appl., pp. 2–5, 2011. doi: 10.1109/ICMLA.2011.55.

[15] N. A. B. Kamisan, M. H. Lee, A. G. Hussin, and Y. Z. Zubairi, "Imputation techniques for incomplete load data based on seasonality and orientation of the missing values," Sains Malaysiana, vol. 49, no. 5, pp. 1165–1174, 2020. doi: 10.17576/jsm-2020-4905-22.

[16] H. Singh and B. Singh, "A Comparison of Optimization Algorithms for Standard Benchmark Functions," Int. J. Adv. Res. Comput. Sci., vol. 8, no. 7, pp. 1249–1254, 2017. doi: 10.26483/ijarcs.v8i7.4581.

[17] S. Jiang, G. Pang, M. Wu, and L. Kuang, " An improved K-nearest-neighbor algorithm for text categorization," Expert Syst. Appl., Vol. 39, No. 1, p. 1503-1509, 2012. doi: 10.1016/j.eswa.2011.08.040.

[18] H. A. E Alfeilat, A. B. A. Hassanat, O. Lasassmeh, A. S. Tarawneh, M. B. Alhasanat, H. S. E. Salman, and V. B. S. Prasath " Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review," Big Data, Vol. 7, No. 4, pp. 1–50, 2019. doi: 10.1089/big.2018.0175.

[19] C. D. Yu and B. Xiao, "Performance Optimization for the K Nearest-Neighbor Kernel on x86 Architectures," Proceeding Int. Conf. High Perform. Comput. Networking, Storage Anal., 2015. doi: 10.1145/2807591.2807601.

[20] M. Shahjaman, M. R. Rahman, T. Islam, M. R. Auwul, M. A. Moni, and M. N. H. Mollah, “rMisbeta: A robust missing value imputation approach in transcriptomics and metabolomics data,” Comput. Biol. Med., vol. 138, p. 104911, 2021. doi: 10.1016/j.compbiomed.2021.104911.

[21] S. P. Mandel J, "A Comparison of Six Methods for Missing Data Imputation," J. Biom. Biostat., vol. 6, no. 1, pp. 1–6, 2015. doi: 10.4172/2155-6180.1000224.

[22] S. I. Khan and A. S. M. L. Hoque, "SICE: an improved missing data imputation technique," J. Big Data, vol. 7, no. 1, 2020. doi: 10.1186/s40537-020-00313-w.

[23] K. K. Sharma and A. Seal, “Spectral embedded generalized mean based k-nearest neighbors clustering with S-distance,” Expert Syst. Appl., vol. 169, p. 114326, 2021. doi: 10.1016/j.eswa.2020.114326.

[24] P. Jönsson and C. Wohlin, "An evaluation of k-nearest neighbour imputation using lIkert data," in Proceedings - International Software Metrics Symposium, 2004, pp. 108–118. doi: 10.1109/METRIC.2004.1357895.

[25] S. Oehmcke, O. Zielinski, and O. Kramer, "kNN ensembles with penalized DTW for multivariate time series imputation," Proc. Int. Jt. Conf. Neural Networks, vol. 2016–Octob, pp. 2774–2781, 2016. doi: 10.1109/IJCNN.2016.7727549.

[26] M. Tabassian, M. Alessandrini, R. Jasaityte, L. De Marchi, G. Masetti, and J. D'Hooge, "Handling missing strain (rate) curves using K-nearest neighbor imputation," IEEE Int. Ultrason. Symp. IUS, vol. 2016–Novem, pp. 1–4, 2016. doi: 10.1109/ULTSYM.2016.7728809.

[27] M. Askarian, G. Escudero, M. Graells, R. Zarghami, F. Jalali-Farahani, and N. Mostoufi, "Fault diagnosis of chemical processes with incomplete observations: A comparative study," Comput. Chem. Eng., vol. 84, pp. 104–116, 2016. doi: 10.1016/j.compchemeng.2015.08.018.

[28] S. Mirjalili, P. Jangir, and S. Saremi, "Multi-objective ant lion optimizer: a multi-objective optimization algorithm for solving engineering problems," Appl. Intell., vol. 46, no. 1, pp. 79–95, Jan. 2017. doi: 10.1007/s10489-016-0825-8.

[29] J. Luo, H. Chen, Y. Xu, H. Huang, and X. Zhao, "An Improved Grasshopper Optimization Algorithm with Application to Financial Stress Prediction," Appl. Math. Model., 2018. doi: 10.1016/j.apm.2018.07.044.

[30] A. G. Neve, G. M. Kakandikar, and O. Kulkarni, "Application of Grasshopper Optimization Algorithm for Constrained and Unconstrained Test Functions," Int. J. Swarm Intell. Evol. Comput., vol. 6, no. 3, 2017. doi: 10.4172/2090-4908.1000165.

[31] S. Saremi, S. Mirjalili, and A. Lewis, "Grasshopper Optimisation Algorithm : Theory and application," Adv. Eng. Softw., vol. 105, pp. 30–47, 2017. doi: 10.1016/j.advengsoft.2017.01.004.

[32] M. J. Zeynali and A. Shahidi, "Performance Assessment of Grasshopper Optimization Algorithm for Optimizing Coefficients of Sediment Rating Curve," AUT J. Civ. Eng., vol. 2, no. 1, pp. 39–48, 2018. doi: 10.22060/ajce.2018.14511.5480.

[33] L. Abualigah and A. Diabat, "A comprehensive survey of the Grasshopper optimization algorithm: results, variants, and applications," Neural Comput. Appl., vol. 32, no. 19, pp. 15533–15556, 2020. doi: 10.1007/s00521-020-04789-8.

[34] S. M. Rogers, T. Matheson, E. Despland, T. Dodgson, M. Burrows, and S. J. Simpson, "Mechanosensory-induced behavioural gregarization in the desert locust Schistocerca gregaria," J. Exp. Biol., vol. 206, no. 22, pp. 3991–4002, 2003. doi: 10.1242/jeb.00648.

[35] M. Mafarja et al., "Evolutionary Population Dynamics and Grasshopper Optimization Approaches for Feature Selection Problems," Knowledge-Based Syst., no. December, 2017. doi: 10.1016/j.knosys.2017.12.037.

[36] M. Singh, V. M. Srivastava, K. Gaurav, and P. K. Gupta, "Automatic test data generation based on multi-objective ant lion optimization algorithm," in 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), 2017, pp. 168–174. doi: 10.1109/RoboMech.2017.8261142.

[37] S. Arora and P. Anand, "Chaotic grasshopper optimization algorithm for global optimization," Neural Comput. Appl., vol. 31, no. 8, pp. 4385–4405, 2019. doi: 10.1007/s00521-018-3343-2.

[38] R. Melina, "What makes grasshoppers swarm?," Live Science, 2010. [Online]. Available: https://www.livescience.com/32609-what-makes-grasshoppers-swarm.html.

[39] S. Lukasik, P. A. Kowalski, M. Charytanowicz, and P. Kulczycki, “Data clustering with grasshopper optimization algorithm,” Proc. 2017 Fed. Conf. Comput. Sci. Inf. Syst. FedCSIS 2017, vol. 11, pp. 71–74, 2017. doi: 10.15439/2017F340.

[40] H. D. Delaney and A. Vargha, "A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong," J. Educ. Behav. Stat., vol. 25, no. 2, pp. 101–132, 2000. doi: 10.3102/10769986025002101.

[41] M. Najib and N. A. Samat, "FCMPSO : An Imputation for Missing Data Features in Heart Disease Classification," IOP Conf. Ser. Mater. Sci. Eng., 2017. doi: 10.1088/1757-899X/226/1/012102.




Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by Informatics Department - Universitas Ahmad Dahlan, and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: ijain@uad.ac.id (paper handling issues)
    info@ijain.org, andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0