Leveraging summary statistics to make inferences about complex phenotypes in large biobanks

Angela Gasdaska; Derek Friend; Rachel Chen; Jason Westra; Matthew Zawistowski; William Lindsey; Nathan Tintle

Leveraging summary statistics to make inferences about complex phenotypes in large biobanks

Pac Symp Biocomput. 2019:24:391-402.

Authors

Angela Gasdaska^#¹, Derek Friend^#², Rachel Chen³, Jason Westra⁴, Matthew Zawistowski⁵, William Lindsey⁶, Nathan Tintle⁷

Affiliations

¹ Department of Mathematics and Computer Science and Department of Quantitative Theory and Methods, Emory University, Atlanta, GA 30322, USA, aegasdaska@gmail.com.
² Department of Geography, University of Nevada, Reno, NV 89557, USA, derekfriend@outlook.com.
³ Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA, rschen@ncsu.edu.
⁴ Department of Math, Computer Science, and Statistics, Dordt College, Sioux Center, IA 51250, USA, westrajason@hotmail.com.
⁵ Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA, mattz@umich.edu.
⁶ Department of Math, Computer Science, and Statistics, Dordt College, Sioux Center, IA 51250, USA William.Lindsey@dordt.edu.
⁷ Department of Math, Computer Science, and Statistics, Dordt College, Sioux Center, IA 51250, USA Nathan.Tintle@dordt.edu.

^# Contributed equally.

PMID: 30963077
PMCID: PMC6417828

Abstract

As genetic sequencing becomes less expensive and data sets linking genetic data and medical records (e.g., Biobanks) become larger and more common, issues of data privacy and computational challenges become more necessary to address in order to realize the benefits of these datasets. One possibility for alleviating these issues is through the use of already-computed summary statistics (e.g., slopes and standard errors from a regression model of a phenotype on a genotype). If groups share summary statistics from their analyses of biobanks, many of the privacy issues and computational challenges concerning the access of these data could be bypassed. In this paper we explore the possibility of using summary statistics from simple linear models of phenotype on genotype in order to make inferences about more complex phenotypes (those that are derived from two or more simple phenotypes). We provide exact formulas for the slope, intercept, and standard error of the slope for linear regressions when combining phenotypes. Derived equations are validated via simulation and tested on a real data set exploring the genetics of fatty acids.

Keywords: biobank; computational challenges; data security; genetics; genome-wide association study; phenotypes; privacy; single nucleotide variant.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Biological Specimen Banks / statistics & numerical data*
Computational Biology
Computer Simulation
Fatty Acids / genetics
Genetic Privacy
Genotype
Humans
Linear Models
Models, Genetic
Phenotype*
Polymorphism, Single Nucleotide

Substances

Fatty Acids

Grants and funding

R15 HG006915/HG/NHGRI NIH HHS/United States