The application of classification approaches utilizing multi-variable with machine learning methods holds immense implications, particularly in the realm of healthcare and disease prediction. Accurate classification of medical conditions, such as hepatitis, is critical for early diagnosis and timely intervention. In order to identify people based on important hepatitis-related characteristics, this study applies advanced machine learning with statistical techniques. It also examines a real dataset in order to create a reliable early detection predictive model. Through this model, we aspire to raise awareness and guide affected individuals toward timely treatment. The paper focuses on comprehensive data preprocessing, including outlier removal, handling class imbalance problem, missing values and extract highly correlated features in order to improve model performance. In our research paper, we have applied mean/mode imputation technique to deal with missing values. Furthermore, we have used z score approach to detect and remove outliers from out dataset and handle class imbalance problem by using oversampling technique. To identify features that are highly correlated, we have used the embedded feature selection approach in our paper. Classic machine learning algorithms, notably K-Nearest Neighbors (KNN), Naive Bayes (NB) and Random Forest (RF) have employed to predict either a person is affected by hepatitis disease or not. To assess the efficacy of our model, we have utilized the 10-fold cross validation procedure. At 97.44%, we have the highest classification accuracy from RF, with Precession, Recall, F1 score and ROC values of, respectively, 0.99, 0.96, 0.97 and 1.00.
Keywords: Hepatitis, Missing values, Class imbalance problem, Early-stage prediction, Machine learning, KNN, NB, RF, Classification.
[1] World Health Organization: WHO (2020). Hepatitis. https://www.who.int/health-topics/hepatitis#tab= tab_1.
[2] Farghaly, H.M, Shams, M.Y., & Abd El-Hafeez, T. (2023). Hepatitis C Virus prediction based on machine learning framework: a real-world case study in Egypt. Knowledge and Information Systems, 65(6): 2595–2617. https://doi.org/10.1007/s10115-023-01851-4.
[3] Gündoğdu, S. (2022). Hepatitis C Disease Detection Based on PCA–SVM Model. Hittite Journal of Science and Engineering, 9(2): 111–116. https://doi.org/10.17350/hjse19030000261.
[4] Majzoobi, M.M., Namdar, S., Najafi-Vosough, R., Hajilooi, A.A., & Mahjub, H. (2022). Prediction of Hepatitis disease using ensemble learning methods. Journal of Preventive Medicine and Hygiene, 63(3). https://doi.org/10. 15167/2421-4248/jpmh2022.63.3.2515.
[5] Nayeem, M.J., Rana, S., Alam, F., & Rahman, M.A. (2021). Prediction of hepatitis disease using K-nearest neighbors, Naive Bayes, support vector machine, multi-layer perceptron and random forest. IEEE International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), Pages 280–284. https://doi.org/10.1109/icict4sd50815.2021.9397013.
[6] Butt, M.B., Alfayad, M., Saqib, S., Khan, M.A., Ahmad, M., Khan, M.A., & Elmitwally, N.S. (2021). Diagnosing the stage of hepatitis C using machine learning. Journal of Healthcare Engineering, Pages 1–8. https://doi.org/10.1155/2021/8062410.
[7] Hafeez, M.A., Imran, A., Khan, M.I., Khan, A.H., Nawaz, A., & Ahmed, S. (2022). Diagnosis of Liver Disease Induced by Hepatitis Virus Using Machine Learning Methods. IEEE 8th International Conference on Information Technology Trends (ITT), Pages 154–159. http://dx.doi.org/10.1109/itt56123.2022.9863944.
[8] Trishna, T.I., Emon, S.U., Ema, R.R., Sajal, G.I.H., Kundu, S., & Islam, T. (2019). Detection of hepatitis (a, b, c and e) viruses based on random forest, k-nearest and naïve bayes classifier. IEEE 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Pages 1–7.
[9] Hashem, S., Esmat, G., Elakel, W., Habashy, S., Abdel Raouf, S., Darweesh, S., & ElHefnawi, M. (2016). Accurate prediction of advanced liver fibrosis using the decision tree learning algorithm in chronic hepatitis C Egyptian patients. Gastroenterology Research and Practice, Pages 1–7. https://doi.org/10.1155/2016/2636390.
[10] Krisnabayu, R.Y., Ridok, A., & Setia Budi, A. (2021). Hepatitis detection using random forest based on SVM-RFE (recursive feature elimination) feature selection and SMOTE. In Proceedings of the 6th International Conference on Sustainable Information Engineering and Technology, Pages 151–156.
[11] Dutta, P., Paul, S., Jana, G.G., & Sadhu, A. (2023). Hybrid Genetic Algorithm Random Forest algorithm (HGARF) for improving the missing value Imputation in Hepatitis Medical Dataset. IEEE International Symposium on Devices, Circuits and Systems (ISDCS), Pages 01–05.
[12] Sachdeva, R.K., Bathla, P., Rani, P., Solanki, V., & Ahuja, R. (2023). A systematic method for diagnosis of hepatitis disease using machine learning. Innovations in Systems and Software Engineering, 19(1): 71–80. https:// doi.org/10.1007/s11334-022-00509-8.
[13] Genemo, M.D. (2023). Diagnosis of Hepatitis using Supervised Learning algorithm. Indonesian Journal of Data and Science (IJODAS), 4(1): 25–30. https://doi.org/10.56705/ijodas.v4i1.60.
[14] Ahmed, I.I., Mohammed, D.Y., & Zidan, K.A. (2022). Diagnosis of hepatitis disease using machine learning techniques. Indonesian Journal of Electrical Engineering and Computer Science, 26(3): 1564–1572.
[15] Alizargar, A., Chang, Y.L., & Tan, T.H. (2023). Performance comparison of machine learning approaches on Hepatitis C prediction employing data mining techniques. Bioengineering, 10(4): 481. https://doi.org/10.3390/ bioengineering10040481.
[16] Md, A.Q., Kulkarni, S., Joshua, C.J., Vaichole, T., Mohan, S., & Iwendi, C. (2023). Enhanced Preprocessing Approach Using Ensemble Machine Learning Algorithms for Detecting Liver Disease. Biomedicines, 11(2): 581. https://doi.org/10.3390/biomedicines11020581.
[17] Alotaibi, A., Alnajrani, L., Alsheikh, N., Alanazy, A., Alshammasi, S., Almusairii, M., & Alansari, A. (2023). Explainable Ensemble-Based Machine Learning Models for Detecting the Presence of Cirrhosis in Hepatitis C Patients. Computation, 11(6): 104. https://doi.org/10.3390/computation11060104.
[18] Dritsas, E., & Trigka, M. (2023). Supervised machine learning models for liver disease risk prediction. Computers, 12(1): 19. https://doi.org/10.3390/computers12010019.
[19] Suárez, M., Martínez, R., Torres, A.M., Ramón, A., Blasco, P., & Mateo, J. (2023). A Machine Learning- Based Method for Detecting Liver Fibrosis. Diagnostics, 13(18): 2952. https://doi.org/10.3390/diagnostics 13182952.
[20] Harabor, V., Mogos, R., Nechita, A., Adam, A.M., Adam, G., Melinte-Popescu, A.S., & Harabor, A. (2023). Machine Learning Approaches for the Prediction of Hepatitis B and C Seropositivity. International Journal of Environmental Research and Public Health, 20(3): 2380. https://doi.org/10.3390/ijerph20032380.
[21] Tokala, S., Hajarathaiah, K., Gunda, S.R.P., Botla, S., Nalluri, L., Nagamanohar, P., & Enduri, M.K. (2023). Liver Disease Prediction and Classification using Machine Learning Techniques. International Journal of Advanced Computer Science and Applications, 14(2): 1–9. http://dx.doi.org/10.14569/ijacsa.2023.0140299.
[22] Attiya, I.M., Abouelsoud, R.A., & Ismail, A.S. (2023). A Proposed Approach for Predicting Liver Disease. Information Sciences Letters, 12(6): 2447–2460. http://dx.doi.org/10.18576/isl/120644.
[23] Nigatu, S.S., Alla, P.C.R., Ravikumar, R.N., Mishra, K., Komala, G., & Chami, G.R. (2023). A Comparative Study on Liver Disease Prediction using Supervised Learning Algorithms with Hyperparameter Tuning. IEEE International Conference on Advancement in Computation & Computer Technologies (InCACCT), Pages 353–357. https://doi.org/10.1109/incacct57535.2023.10141830.
[24] Chicco, D., & Jurman, G. (2021). An ensemble learning approach for enhanced classification of patients with hepatitis and cirrhosis. IEEE Access, 9: 24485–24498. https://doi.org/10.1109/access.2021.3057196.
[25] Yarasuri, V.K., Indukuri, G.K., & Nair, A.K. (2019). Prediction of hepatitis disease using machine learning technique. IEEE 3rd International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Pages 265–269. https://doi.org/10.1109/i-smac47947.2019.9032585.
[26] Hepatitis (1988). UCI Machine Learning Repository. https://doi.org/10.24432/c5q59j.
Source of Funding:
This study did not receive any grant from funding agencies in the public, commercial, or not–for–profit sectors.
Competing Interests Statement:
The authors declare no competing financial, professional, or personal interests.
Consent for publication:
The authors declare that they consented to the publication of this study.
Authors' contributions:
All the authors made an equal contribution in the Conception and design of the work, Data collection, Simulation analysis, Drafting the article, and Critical revision of the article. All the authors have read and approved the final copy of the manuscript.
Availability of data and material:
Authors are willing to share data and material according to the relevant needs.
A New Issue was published – Volume 8, Issue 2, 2025
13-04-2025 11-01-2025