SHARE: an adaptive algorithm to select the most informative set of SNPs for candidate genetic association

James Y Dai; Michael Leblanc; Nicholas L Smith; Bruce Psaty; Charles Kooperberg

doi:10.1093/biostatistics/kxp023

SHARE: an adaptive algorithm to select the most informative set of SNPs for candidate genetic association

Biostatistics. 2009 Oct;10(4):680-93. doi: 10.1093/biostatistics/kxp023. Epub 2009 Jul 15.

Authors

James Y Dai¹, Michael Leblanc, Nicholas L Smith, Bruce Psaty, Charles Kooperberg

Affiliation

¹ Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N, M2-C200, Seattle, WA 98109, USA. jdai@fhcrc.org

Abstract

Association studies have been widely used to identify genetic liability variants for complex diseases. While scanning the chromosomal region 1 single nucleotide polymorphism (SNP) at a time may not fully explore linkage disequilibrium, haplotype analyses tend to require a fairly large number of parameters, thus potentially losing power. Clustering algorithms, such as the cladistic approach, have been proposed to reduce the dimensionality, yet they have important limitations. We propose a SNP-Haplotype Adaptive REgression (SHARE) algorithm that seeks the most informative set of SNPs for genetic association in a targeted candidate region by growing and shrinking haplotypes with 1 more or less SNP in a stepwise fashion, and comparing prediction errors of different models via cross-validation. Depending on the evolutionary history of the disease mutations and the markers, this set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses. Haplotype phase ambiguity is effectively accounted for by treating haplotype reconstruction as a part of the learning procedure. Simulations and a data application show that our method has improved power over existing methodologies and that the results are informative in the search for disease-causal loci.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Aged
Algorithms*
Biostatistics / methods
Female
Genome-Wide Association Study / statistics & numerical data*
Haplotypes
Humans
Linkage Disequilibrium
Lipoproteins / genetics
Middle Aged
Models, Statistical
Polymorphism, Single Nucleotide*
Regression Analysis
Venous Thrombosis / genetics

Substances

Lipoproteins
lipoprotein-associated coagulation inhibitor

Abstract

Publication types

MeSH terms

Substances

Grants and funding