Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles

Charles Spanbauer; Wei Pan; ADNI, The Alzheimer's Disease Neuroimaging Initiative

doi:10.1002/gepi.22505

Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles

Genet Epidemiol. 2023 Feb;47(1):26-44. doi: 10.1002/gepi.22505. Epub 2022 Nov 9.

Authors

Charles Spanbauer¹, Wei Pan¹; ADNI, The Alzheimer's Disease Neuroimaging Initiative¹

Affiliation

¹ Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA.

Abstract

Using high-dimensional genetic variants such as single nucleotide polymorphisms (SNP) to predict complex diseases and traits has important applications in basic research and other clinical settings. For example, predicting gene expression is a necessary first step to identify (putative) causal genes in transcriptome-wide association studies. Due to weak signals, high-dimensionality, and linkage disequilibrium (correlation) among SNPs, building such a prediction model is challenging. However, functional annotations at the SNP level (e.g., as epigenomic data across multiple cell- or tissue-types) are available and could be used to inform predictor importance and aid in outcome prediction. Existing approaches to incorporate annotations have been based mainly on (generalized) linear models. Bayesian additive regression trees (BART), in contrast, is a reliable method to obtain high-quality nonlinear out of sample predictions without overfitting. Unfortunately, the default prior from BART may be too inflexible to handle sparse situations where the number of predictors approaches or surpasses the number of observations. Motivated by our real data application, this article proposes an alternative prior based on the logit normal distribution because it provides a framework that is adaptive to sparsity and can model informative functional annotations. It also provides a framework to incorporate prior information about the between SNP correlations. Computational details for carrying out inference are presented along with the results from a simulation study and a genome-wide prediction analysis of the Alzheimer's Disease Neuroimaging Initiative data.

Keywords: ensemble learning; genetics; high-dimensional prediction; sparsity.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Bayes Theorem
Computer Simulation
Genome-Wide Association Study / methods
Humans
Models, Genetic*
Neuroimaging / methods
Polymorphism, Single Nucleotide

Abstract

Publication types

MeSH terms

Grants and funding