Interpretable Machine Learning with SHAP and XGBoost for Lung Cancer Prediction Insights

  • Taufik Kurniawan Universitas Sultan Fatah Demak
  • Laily Hermawanti Universitas Sultan Fatah Demak
  • Achmad Nuruddin Safriandono Universitas Sultan Fatah Demak

Abstract

Lung cancer remains one of the leading causes of death worldwide, and early detection through accurate and reliable methods is essential to improve patient prognosis. This study proposes a lung cancer classification model that integrates XGBoost with SHapley Additive exPlanations (SHAP) and Random Over Sampling (ROS) techniques to address the data imbalance problem. Using hyperparameter optimization through Optuna, the resulting model demonstrated superior performance, with an average accuracy of 96.84%, precision of 99.23%, recall of 94.51%, F1-score of 96.74%, specificity of 99.17%, and AUC of 96.84% in a 10-fold cross-validation evaluation. SHAP analysis provided significant interpretability, identifying key features such as gender, smoking habits, and physical signs of yellow fingers as the factors that most influence the model's predictions. The results of this study indicate that the proposed model is not only accurate, but also interpretable, making a significant contribution to supporting better clinical decision making in lung cancer diagnosis.

References

[1] F. S. Gomiasti, W. Warto, E. Kartikadarma, J. Gondohanindijo, and D. R. I. M. Setiadi, “Enhancing Lung Cancer Classification Effectiveness Through Hyperparameter-Tuned Support Vector Machine,” J. Comput. Theor. Appl., vol. 1, no. 4, pp. 396–406, Mar. 2024, doi: 10.62411/jcta.10106.
[2] R. Yanuar, S. Sa’adah, and P. E. Yunanto, “Implementation of Hyperparameters to the Ensemble Learning Method for Lung Cancer Classification,” Build. Informatics, Technol. Sci., vol. 5, no. 2, pp. 498–508, Sep. 2023, doi: 10.47065/bits.v5i2.4096.
[3] Y. F. Zamzam, T. H. Saragih, R. Herteno, Muliadi, D. T. Nugrahadi, and P.-H. Huynh, “Comparison of CatBoost and Random Forest Methods for Lung Cancer Classification using Hyperparameter Tuning Bayesian Optimization-based,” J. Electron. Electromed. Eng. Med. Informatics, vol. 6, no. 2, pp. 125–136, Mar. 2024, doi: 10.35882/jeeemi.v6i2.382.
[4] E. Dritsas and M. Trigka, “Lung Cancer Risk Prediction with Machine Learning Models,” Big Data Cogn. Comput., vol. 6, no. 4, p. 139, Nov. 2022, doi: 10.3390/bdcc6040139.
[5] T. R. Noviandy, G. M. Idroes, and I. Hardi, “An Interpretable Machine Learning Strategy for Antimalarial Drug Discovery with LightGBM and SHAP,” J. Futur. Artif. Intell. Technol., vol. 1, no. 2, pp. 84–95, Aug. 2024, doi: 10.62411/faith.2024-16.
[6] R. K. Pathan, I. J. Shorna, M. S. Hossain, M. U. Khandaker, H. I. Almohammed, and Z. Y. Hamd, “The efficacy of machine learning models in lung cancer risk prediction with explainability,” PLoS One, vol. 19, no. 6, p. e0305035, Jun. 2024, doi: 10.1371/journal.pone.0305035.
[7] S. T. Rikta, K. M. M. Uddin, N. Biswas, R. Mostafiz, F. Sharmin, and S. K. Dey, “XML-GBM lung: An explainable machine learning-based application for the diagnosis of lung cancer,” J. Pathol. Inform., vol. 14, p. 100307, Jan. 2023, doi: 10.1016/j.jpi.2023.100307.
[8] M. I. Akazue, I. A. Debekeme, A. E. Edje, C. Asuai, and U. J. Osame, “Unmasking Fraudsters: Ensemble Features Selection to Enhance Random Forest Fraud Detection,” J. Comput. Theor. Appl., vol. 1, no. 2, pp. 201–211, Dec. 2023, doi: 10.33633/jcta.v1i2.9462.
[9] M. A. Araaf, K. Nugroho, and D. R. I. M. Setiadi, “Comprehensive Analysis and Classification of Skin Diseases based on Image Texture Features using K-Nearest Neighbors Algorithm,” J. Comput. Theor. Appl., vol. 1, no. 1, pp. 31–40, Sep. 2023, doi: 10.33633/jcta.v1i1.9185.
[10] A. Wibowo and H. Hariyanto, “Comparison of Naive Bayes Method with Support Vector Machine in Helpdesk Ticket Classification,” J. Appl. Informatics Comput., vol. 7, no. 2, pp. 165–171, Nov. 2023, doi: 10.30871/jaic.v7i2.6376.
[11] A. N. Safriandono, D. R. I. M. Setiadi, A. Dahlan, F. Z. Rahmanti, I. S. Wibisono, and A. A. Ojugo, “Analyzing Quantum Feature Engineering and Balancing Strategies Effect on Liver Disease Classification,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 51–63, Jun. 2024, doi: 10.62411/faith.2024-12.
[12] E. Vieira, D. Ferreira, C. Neto, A. Abelha, and J. Machado, “Data Mining Approach to Classify Cases of Lung Cancer,” in Trends and Applications in Information Systems and Technologies, 2021, pp. 511–521. doi: 10.1007/978-3-030-72657-7_49.
[13] F. Omoruwou, A. A. Ojugo, and S. E. Ilodigwe, “Strategic Feature Selection for Enhanced Scorch Prediction in Flexible Polyurethane Form Manufacturing,” J. Comput. Theor. Appl., vol. 1, no. 3, pp. 346–357, Feb. 2024, doi: 10.62411/jcta.9539.
[14] R. E. Ako et al., “Effects of Data Resampling on Predicting Customer Churn via a Comparative Tree-based Random Forest and XGBoost,” J. Comput. Theor. Appl., vol. 2, no. 1, pp. 86–101, Jun. 2024, doi: 10.62411/jcta.10562.
[15] D. R. I. M. Setiadi, K. Nugroho, A. R. Muslikh, S. W. Iriananda, and A. A. Ojugo, “Integrating SMOTE-Tomek and Fusion Learning with XGBoost Meta-Learner for Robust Diabetes Recognition,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 23–38, May 2024, doi: 10.62411/faith.2024-11.
[16] E. B. Wijayanti, D. R. I. M. Setiadi, and B. H. Setyoko, “Dataset Analysis and Feature Characteristics to Predict Rice Production based on eXtreme Gradient Boosting,” J. Comput. Theor. Appl., vol. 1, no. 3, pp. 299–310, Feb. 2024, doi: 10.62411/jcta.10057.
[17] D. R. I. M. Setiadi, H. M. M. Islam, G. A. Trisnapradika, and W. Herowati, “Analyzing Preprocessing Impact on Machine Learning Classifiers for Cryotherapy and Immunotherapy Dataset,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 39–50, Jun. 2024, doi: 10.62411/faith.2024-2.
[18] C. Yang, E. A. Fridgeirsson, J. A. Kors, J. M. Reps, and P. R. Rijnbeek, “Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data,” J. Big Data, vol. 11, no. 1, p. 7, Jan. 2024, doi: 10.1186/s40537-023-00857-7.
[19] T. Riston et al., “Oversampling Methods for Handling Imbalance Data in Binary Classification,” in Computational Science and Its Applications – ICCSA 2023 Workshops, 2023, pp. 3–23. doi: 10.1007/978-3-031-37108-0_1.
[20] M. A. Bhat, “Lung Cancer Classification Dataset.” Nov. 05, 2023. [Online]. Available: https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer
[21] K. Cabello-Solorzano, I. Ortigosa de Araujo, M. Peña, L. Correia, and A. J. Tallón-Ballesteros, “The Impact of Data Normalization on the Accuracy of Machine Learning Algorithms: A Comparative Analysis,” in 18th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2023), 2023, pp. 344–353. doi: 10.1007/978-3-031-42536-3_33.
[22] S. Watanabe, “Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance,” arXiv. Apr. 21, 2023. [Online]. Available: http://arxiv.org/abs/2304.11127
[23] T. R. Noviandy, K. Nisa, G. M. Idroes, I. Hardi, and N. R. Sasmita, “Classifying Beta-Secretase 1 Inhibitor Activity for Alzheimer’s Drug Discovery with LightGBM,” J. Comput. Theor. Appl., vol. 1, no. 4, pp. 358–367, Mar. 2024, doi: 10.62411/jcta.10129.
Published
2024-11-01
How to Cite
KURNIAWAN, Taufik; HERMAWANTI, Laily; SAFRIANDONO, Achmad Nuruddin. Interpretable Machine Learning with SHAP and XGBoost for Lung Cancer Prediction Insights. JOURNAL OF APPLIED INFORMATICS AND COMPUTING, [S.l.], v. 8, n. 2, p. 296-303, nov. 2024. ISSN 2548-6861. Available at: <http://704209.wb34atkl.asia/index.php/JAIC/article/view/8395>. Date accessed: 28 nov. 2024. doi: https://doi.org/10.30871/jaic.v8i2.8395.
Section
Articles

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.