Towards a better detection of horizontally transferred genes by combining unusual properties effectively

PLoS One. 2012;7(8):e43126. doi: 10.1371/journal.pone.0043126. Epub 2012 Aug 14.

Abstract

Background: Horizontal gene transfer (HGT) is one of the major mechanisms contributing to microbial genome diversification. A number of computational methods for finding horizontally transferred genes have been proposed in the past decades; however none of them has provided a reliable detector yet. In existing parametric approaches, only one single compositional property can participate in the detection process, or the results obtained through each single property are just simply combined. It's known that different properties may mean different information, so the single property can't sufficiently contain the information encoded by gene sequences. In addition, the class imbalance problem in the datasets, which also results in great errors for the gene detection, hasn't been considered by the published methods. Here we developed an effective classifier system (Hgtident) that used support vector machine (SVM) by combining unusual properties effectively for HGT detection.

Results: Our approach Hgtident includes the introduction of more representative datasets, optimization of SVM model, feature selection, handling of imbalance problem in the datasets and extensive performance evaluation via systematic cross-validation methods. Through feature selection, we found that JS-DN and JS-CB have higher discriminating power for HGT detection, while GC1-GC3 and k-mer (k = 1, 2, …, 7) make the least contribution. Extensive experiments indicated the new classifier could reduce Mean error dramatically, and also improve Recall by a certain level. For the testing genomes, compared with the existing popular multiple-threshold approach, on average, our Recall and Mean error was respectively improved by 2.81% and reduced by 26.32%, which means that numerous false positives were identified correctly.

Conclusions: Hgtident introduced here is an effective approach for better detecting HGT. Combining multiple features of HGT is also essential for a wider range of HGT events detection.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computational Biology / methods
  • DNA, Bacterial / genetics
  • Databases, Genetic
  • False Negative Reactions
  • False Positive Reactions
  • Gene Transfer, Horizontal*
  • Genome, Bacterial
  • Genomics / methods
  • Models, Genetic
  • Models, Statistical
  • Phylogeny
  • Reproducibility of Results
  • Software
  • Support Vector Machine

Substances

  • DNA, Bacterial

Grants and funding

This work was supported by grants from the National Natural Science Foundation of China under Grant No. 61172171 and the Chinese Academy of Science Knowledge Innovation Project No. KSCX2-EW-R-01-02. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.