Penalized regression for genome-wide association screening of sequence data

Pac Symp Biocomput. 2011:106-17. doi: 10.1142/9789814335058_0012.

Abstract

Whole exome and whole genome sequencing are likely to be potent tools in the study of common diseases and complex traits. Despite this promise, some very difficult issues in data management and statistical analysis must be squarely faced. The number of rare variants identified by sequencing is apt to be much larger than the number of common variants encountered in current association studies. The low frequencies of rare variants alone will make association testing difficult. This article extends the penalized regression framework for model selection in genome-wide association data to sequencing data with both common and rare variants. Previous research has shown that lasso penalties discourage irrelevant predictors from entering a model. The Euclidean penalties dealt with here group variants by gene or pathway. Pertinent biological information can be incorporated by calibrating penalties by weights. The current paper examines some of the tradeoffs in using pure lasso penalties, pure group penalties, and mixtures of the two types of penalty. All of the computational and statistical advantages of lasso penalized estimation are retained in this richer setting. The overall strategy is implemented in the free statistical genetics analysis software MENDEL and illustrated on both simulated and real data.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Breast Neoplasms / genetics
  • Computational Biology
  • Data Interpretation, Statistical
  • Female
  • Genetic Predisposition to Disease
  • Genetic Variation
  • Genome-Wide Association Study / statistics & numerical data*
  • High-Throughput Nucleotide Sequencing / statistics & numerical data*
  • Humans
  • Logistic Models
  • Polymorphism, Single Nucleotide
  • Regression Analysis
  • Software