Assessment of data transformations for model-based clustering of RNA-Seq data

Janelle R Noel-MacDonnell; Joseph Usset; Ellen L Goode; Brooke L Fridley

doi:10.1371/journal.pone.0191758

Assessment of data transformations for model-based clustering of RNA-Seq data

PLoS One. 2018 Feb 27;13(2):e0191758. doi: 10.1371/journal.pone.0191758. eCollection 2018.

Authors

Janelle R Noel-MacDonnell^{1

2}, Joseph Usset¹, Ellen L Goode³, Brooke L Fridley^{1

4}

Affiliations

¹ Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS, United States of America.
² Department of Health Services and Outcomes Research, Children's Mercy Hospital, Kansas City, MO, United States of America.
³ Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States of America.
⁴ Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, Tampa, FL, United States of America.

Abstract

Quality control, global biases, normalization, and analysis methods for RNA-Seq data are quite different than those for microarray-based studies. The assumption of normality is reasonable for microarray based gene expression data; however, RNA-Seq data tend to follow an over-dispersed Poisson or negative binomial distribution. Little research has been done to assess how data transformations impact Gaussian model-based clustering with respect to clustering performance and accuracy in estimating the correct number of clusters in RNA-Seq data. In this article, we investigate Gaussian model-based clustering performance and accuracy in estimating the correct number of clusters by applying four data transformations (i.e., naïve, logarithmic, Blom, and variance stabilizing transformation) to simulated RNA-Seq data. To do so, an extensive simulation study was carried out in which the scenarios varied in terms of: how genes were selected to be included in the clustering analyses, size of the clusters, and number of clusters. Following the application of the different transformations to the simulated data, Gaussian model-based clustering was carried out. To assess clustering performance for each of the data transformations, the adjusted rand index, clustering error rate, and concordance index were utilized. As expected, our results showed that clustering performance was gained in scenarios where data transformations were applied to make the data appear "more" Gaussian in distribution.

MeSH terms

Cluster Analysis
Female
Humans
Likelihood Functions
Models, Genetic*
Ovarian Neoplasms / genetics
Ovarian Neoplasms / pathology
Sequence Analysis, RNA*

Grants and funding

The authors received no specific funding for this work.