The Impact of Oversampling with SMOTE on the Performance of 3 Classifiers in Prediction of Type 2 Diabetes

Med Decis Making. 2016 Jan;36(1):137-44. doi: 10.1177/0272989X14560647. Epub 2014 Dec 1.

Abstract

Objective: To evaluate the impact of the synthetic minority oversampling technique (SMOTE) on the performance of probabilistic neural network (PNN), naïve Bayes (NB), and decision tree (DT) classifiers for predicting diabetes in a prospective cohort of the Tehran Lipid and Glucose Study (TLGS).

Methods: . Data of the 6647 nondiabetic participants, aged 20 years or older with more than 10 years of follow-up, were used to develop prediction models based on 21 common risk factors. The minority class in the training dataset was oversampled using the SMOTE technique, at 100%, 200%, 300%, 400%, 500%, 600%, and 700% of its original size. The original and the oversampled training datasets were used to establish the classification models. Accuracy, sensitivity, specificity, precision, F-measure, and Youden's index were used to evaluated the performance of classifiers in the test dataset. To compare the performance of the 3 classification models, we used the ROC convex hull (ROCCH).

Results: Oversampling the minority class at 700% (completely balanced) increased the sensitivity of the PNN, DT, and NB by 64%, 51%, and 5%, respectively, but decreased the accuracy and specificity of the 3 classification methods. NB had the best Youden's index before and after oversampling. The ROCCH showed that PNN is suboptimal for any class and cost conditions.

Conclusions: To determine a classifier with a machine learning algorithm like the PNN and DT, class skew in data should be considered. The NB and DT were optimal classifiers in a prediction task in an imbalanced medical database.

Keywords: SMOTE; classification; data mining; diabetes.

MeSH terms

  • Adult
  • Algorithms
  • Bayes Theorem
  • Decision Trees*
  • Diabetes Mellitus, Type 2 / epidemiology*
  • Female
  • Humans
  • Iran
  • Male
  • Middle Aged
  • Minority Groups*
  • Models, Theoretical
  • Neural Networks, Computer*
  • Prospective Studies
  • Risk Factors
  • Selection Bias*
  • Sensitivity and Specificity
  • Socioeconomic Factors