EPS: automated feature selection in case-control studies using extreme pseudo-sampling

Bioinformatics. 2021 Oct 11;37(19):3372-3373. doi: 10.1093/bioinformatics/btab214.

Abstract

Summary: Finding informative predictive features in high-dimensional biological case-control datasets is challenging. The Extreme Pseudo-Sampling (EPS) algorithm offers a solution to the challenge of feature selection via a combination of deep learning and linear regression models. First, using a variational autoencoder, it generates complex latent representations for the samples. Second, it classifies the latent representations of cases and controls via logistic regression. Third, it generates new samples (pseudo-samples) around the extreme cases and controls in the regression model. Finally, it trains a new regression model over the upsampled space. The most significant variables in this regression are selected. We present an open-source implementation of the algorithm that is easy to set up, use and customize. Our package enhances the original algorithm by providing new features and customizability for data preparation, model training and classification functionalities. We believe the new features will enable the adoption of the algorithm for a diverse range of datasets.

Availability and implementation: The software package for Python is available online at https://github.com/roohy/eps.

Supplementary information: Supplementary data are available at Bioinformatics online.