Feature selection followed by a residuals-based normalization simplifies and improves single-cell gene expression analysis

bioRxiv [Preprint]. 2024 Jan 22:2023.03.02.530891. doi: 10.1101/2023.03.02.530891.

Abstract

Normalization is a critical step in the computational analysis of single-cell RNA-sequencing (scRNA-seq) counts data. The objective is to reduce systematic biases introduced by technical sources that can obscure underlying biological differences. This is typically accomplished by re-scaling the observed counts to reduce the differences in total counts between the cells and then transforming the scaled counts to stabilize the variances. In the standard scRNA-seq workflow, this is followed by feature selection to identify genes that capture most of the biologically meaningful variation across the cells. Here, we propose a simple feature selection method and show that we can perform feature selection before normalization. We also propose a novel residuals-based normalization method that includes a monotonic non-linear transformation to ensure effective variance stabilization of the residuals. We demonstrate significant improvements in downstream clustering analyses through the application of our feature selection and normalization methods to truth-known biological as well as simulated counts data sets. Based on these results, we make the case for a revised scRNA-seq analysis workflow wherein feature selection precedes and in fact informs our residuals-based normalization. This novel workflow has been implemented in an R package called Piccolo.

Publication types

  • Preprint