Unsupervised representation learning improves genomic discovery and risk prediction for respiratory and circulatory functions and diseases

medRxiv [Preprint]. 2023 Aug 29:2023.04.28.23289285. doi: 10.1101/2023.04.28.23289285.

Abstract

High-dimensional clinical data are becoming more accessible in biobank-scale datasets. However, effectively utilizing high-dimensional clinical data for genetic discovery remains challenging. Here we introduce a general deep learning-based framework, REpresentation learning for Genetic discovery on Low-dimensional Embeddings (REGLE), for discovering associations between genetic variants and high-dimensional clinical data. REGLE uses convolutional variational autoencoders to compute a non-linear, low-dimensional, disentangled embedding of the data with highly heritable individual components. REGLE can incorporate expert-defined or clinical features and provides a framework to create accurate disease-specific polygenic risk scores (PRS) in datasets which have minimal expert phenotyping. We apply REGLE to both respiratory and circulatory systems: spirograms which measure lung function and photoplethysmograms (PPG) which measure blood volume changes. Genome-wide association studies on REGLE embeddings identify more genome-wide significant loci than existing methods and replicate known loci for both spirograms and PPG, demonstrating the generality of the framework. Furthermore, these embeddings are associated with overall survival. Finally, we construct a set of PRSs that improve predictive performance of asthma, chronic obstructive pulmonary disease, hypertension, and systolic blood pressure in multiple biobanks. Thus, REGLE embeddings can quantify clinically relevant features that are not currently captured in a standardized or automated way.

Publication types

  • Preprint