Construction of a 26‑feature gene support vector machine classifier for smoking and non‑smoking lung adenocarcinoma sample classification

Mol Med Rep. 2018 Feb;17(2):3005-3013. doi: 10.3892/mmr.2017.8220. Epub 2017 Dec 7.

Abstract

The present study aimed to identify the feature genes associated with smoking in lung adenocarcinoma (LAC) samples and explore the underlying mechanism. Three gene expression datasets of LAC samples were downloaded from the Gene Expression Omnibus database through pre‑set criteria and the expression data were processed using meta‑analysis. Differentially expressed genes (DEGs) between LAC samples of smokers and non‑smokers were identified using limma package in R. The classification accuracy of selected DEGs were visualized using hierarchical clustering analysis in R language. A protein‑protein interaction (PPI) network was constructed using gene interaction data from the Human Protein Reference Database for the DEGs. Betweenness centrality was calculated for each node in the network and genes with the greatest BC values were utilized for the construction of the support vector machine (SVM) classifier. The dataset GSE43458 was used as the training dataset for the construction and the other datasets (GSE12667 and GSE10072) were used as the validation datasets. The classification accuracy of the classifier was tested using sensitivity, specificity, positive predictive value, negative predictive value and area under curve parameters with the pROC package in R language. The feature genes in the SVM classifier were subjected to pathway enrichment analysis using Fisher's exact test. A total of 347 genes were identified to be differentially expressed between samples of smokers and non‑smokers. The PPI network of DEGs were comprised of 202 nodes and 300 edges. An SVM classifier comprised of 26 feature genes was constructed to distinguish between different LAC samples, with prediction accuracies for the GSE43458, GSE12667 and GSE10072 datasets of 100, 100 and 94.83%, respectively. Furthermore, the 26 feature genes that were significantly enriched in 9 overrepresented biological pathways, including extracellular matrix‑receptor interaction, proteoglycans in cancer, cell adhesion molecules, p53 signaling pathway, microRNAs in cancer and apoptosis, were identified to be smoking‑related genes in LAC. In conclusion, an SVM classifier with a high prediction accuracy for smoking and non‑smoking samples was obtained. The genes in the classifier may likely be the potential feature genes associated with the development of patients with LAC who smoke.

Publication types

  • Meta-Analysis

MeSH terms

  • Adenocarcinoma / etiology*
  • Adenocarcinoma / genetics*
  • Adenocarcinoma / metabolism
  • Adenocarcinoma of Lung
  • Cluster Analysis
  • Databases, Genetic
  • Gene Expression Profiling
  • Gene Expression Regulation, Neoplastic*
  • Humans
  • Lung Neoplasms / etiology*
  • Lung Neoplasms / genetics*
  • Lung Neoplasms / metabolism
  • Protein Interaction Maps*
  • Smoking / adverse effects*
  • Support Vector Machine*