The Impact of Oversampling with SMOTE on the Performance of 3 Classifiers in Prediction of Type 2 Diabetes

Azra Ramezankhani; Omid Pournik; Jamal Shahrabi; Fereidoun Azizi; Farzad Hadaegh; Davood Khalili

doi:10.1177/0272989X14560647

The Impact of Oversampling with SMOTE on the Performance of 3 Classifiers in Prediction of Type 2 Diabetes

Med Decis Making. 2016 Jan;36(1):137-44. doi: 10.1177/0272989X14560647. Epub 2014 Dec 1.

Authors

Azra Ramezankhani¹, Omid Pournik², Jamal Shahrabi³, Fereidoun Azizi⁴, Farzad Hadaegh¹, Davood Khalili^{1

5}

Affiliations

¹ Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (AR, FH, DK)
² Department of Community Medicine, School of Medicine, Iran University of Medical Sciences, Tehran, Iran (OP)
³ Industrial Engineering Department, Amirkabir University of Technology, Tehran, Iran (JS)
⁴ Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (FA)
⁵ Department of Epidemiology, School of Public Health, Shahid Beheshti University of Medical Sciences, Tehran, Iran (DK).

PMID: 25449060
DOI: 10.1177/0272989X14560647

Abstract

Objective: To evaluate the impact of the synthetic minority oversampling technique (SMOTE) on the performance of probabilistic neural network (PNN), naïve Bayes (NB), and decision tree (DT) classifiers for predicting diabetes in a prospective cohort of the Tehran Lipid and Glucose Study (TLGS).

Methods: . Data of the 6647 nondiabetic participants, aged 20 years or older with more than 10 years of follow-up, were used to develop prediction models based on 21 common risk factors. The minority class in the training dataset was oversampled using the SMOTE technique, at 100%, 200%, 300%, 400%, 500%, 600%, and 700% of its original size. The original and the oversampled training datasets were used to establish the classification models. Accuracy, sensitivity, specificity, precision, F-measure, and Youden's index were used to evaluated the performance of classifiers in the test dataset. To compare the performance of the 3 classification models, we used the ROC convex hull (ROCCH).

Results: Oversampling the minority class at 700% (completely balanced) increased the sensitivity of the PNN, DT, and NB by 64%, 51%, and 5%, respectively, but decreased the accuracy and specificity of the 3 classification methods. NB had the best Youden's index before and after oversampling. The ROCCH showed that PNN is suboptimal for any class and cost conditions.

Conclusions: To determine a classifier with a machine learning algorithm like the PNN and DT, class skew in data should be considered. The NB and DT were optimal classifiers in a prediction task in an imbalanced medical database.

Keywords: SMOTE; classification; data mining; diabetes.

MeSH terms

Adult
Algorithms
Bayes Theorem
Decision Trees*
Diabetes Mellitus, Type 2 / epidemiology*
Female
Humans
Iran
Male
Middle Aged
Minority Groups*
Models, Theoretical
Neural Networks, Computer*
Prospective Studies
Risk Factors
Selection Bias*
Sensitivity and Specificity
Socioeconomic Factors