EPS: automated feature selection in case-control studies using extreme pseudo-sampling

Ruhollah Shemirani; Stephane Wenric; Eimear Kenny; José Luis Ambite

doi:10.1093/bioinformatics/btab214

EPS: automated feature selection in case-control studies using extreme pseudo-sampling

Bioinformatics. 2021 Oct 11;37(19):3372-3373. doi: 10.1093/bioinformatics/btab214.

Authors

Ruhollah Shemirani¹, Stephane Wenric², Eimear Kenny², José Luis Ambite¹

Affiliations

¹ Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292, USA.
² Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.

PMID: 33774671
DOI: 10.1093/bioinformatics/btab214

Abstract

Summary: Finding informative predictive features in high-dimensional biological case-control datasets is challenging. The Extreme Pseudo-Sampling (EPS) algorithm offers a solution to the challenge of feature selection via a combination of deep learning and linear regression models. First, using a variational autoencoder, it generates complex latent representations for the samples. Second, it classifies the latent representations of cases and controls via logistic regression. Third, it generates new samples (pseudo-samples) around the extreme cases and controls in the regression model. Finally, it trains a new regression model over the upsampled space. The most significant variables in this regression are selected. We present an open-source implementation of the algorithm that is easy to set up, use and customize. Our package enhances the original algorithm by providing new features and customizability for data preparation, model training and classification functionalities. We believe the new features will enable the adoption of the algorithm for a diverse range of datasets.

Availability and implementation: The software package for Python is available online at https://github.com/roohy/eps.

Supplementary information: Supplementary data are available at Bioinformatics online.