Evaluation of methods for differential expression analysis on multi-group RNA-seq count data

BMC Bioinformatics. 2015 Nov 4:16:361. doi: 10.1186/s12859-015-0794-7.

Abstract

Background: RNA-seq is a powerful tool for measuring transcriptomes, especially for identifying differentially expressed genes or transcripts (DEGs) between sample groups. A number of methods have been developed for this task, and several evaluation studies have also been reported. However, those evaluations so far have been restricted to two-group comparisons. Accumulations of comparative studies for multi-group data are also desired.

Methods: We compare 12 pipelines available in nine R packages for detecting differential expressions (DE) from multi-group RNA-seq count data, focusing on three-group data with or without replicates. We evaluate those pipelines on the basis of both simulation data and real count data.

Results: As a result, the pipelines in the TCC package performed comparably to or better than other pipelines under various simulation scenarios. TCC implements a multi-step normalization strategy (called DEGES) that internally uses functions provided by other representative packages (edgeR, DESeq2, and so on). We found considerably different numbers of identified DEGs (18.5 ~ 45.7% of all genes) among the pipelines for the same real dataset but similar distributions of the classified expression patterns. We also found that DE results can roughly be estimated by the hierarchical dendrogram of sample clustering for the raw count data.

Conclusion: We confirmed the DEGES-based pipelines implemented in TCC performed well in a three-group comparison as well as a two-group comparison. We recommend using the DEGES-based pipeline that internally uses edgeR (here called the EEE-E pipeline) for count data with replicates (especially for small sample sizes). For data without replicates, the DEGES-based pipeline with DESeq2 (called SSS-S) can be recommended.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Area Under Curve
  • Computer Simulation
  • Female
  • Gene Expression Profiling / methods*
  • Gene Expression Regulation
  • Humans
  • Macaca mulatta / genetics
  • Male
  • Pan troglodytes / genetics
  • Reproducibility of Results
  • Sequence Analysis, RNA / methods*
  • Software
  • Transcriptome / genetics