Participant identification in genetic association studies: improved methods and practical implications

Int J Epidemiol. 2011 Dec;40(6):1629-42. doi: 10.1093/ije/dyr149.

Abstract

Background: In a recent paper by Homer et al. (Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 2008;4:e1000167), a method for detecting whether a given individual is a contributor to a particular genomic mixture was proposed. This prompted grave concern about the public dissemination of aggregate statistics from genome-wide association studies. It is of clear scientific importance that such data be shared widely, but the confidentiality of study participants must not be compromised. The issue of what summary genomic data can safely be posted on the web is only addressed satisfactorily when the theoretical underpinnings of the proposed method are clarified and its performance evaluated in terms of dependence on underlying assumptions.

Methods: The original method raised a number of concerns and several alternatives have since been proposed, including a simple linear regression approach. In our proposed generalized estimating equation approach, we maintain the simplicity of the linear regression model but obtain inferences that are more robust to approximation of the variance/covariance structure and can accommodate linkage disequilibrium.

Results: We affirm that, in principle, it is possible to determine that a 'candidate' individual has participated in a study, given a subset of aggregate statistics from that study. However, the methods depend critically on a number of key factors including: the ancestry of participants in the study; the absolute and relative numbers of cases and controls; and the number of single nucleotide polymorphisms.

Conclusions: Simple guidelines for publication that are based on a single criterion are therefore unlikely to suffice. In particular, 'directed' summary statistics should not be posted openly on the web but could be protected by an internet-based access check as proposed by the P3G_Consortium et al. (Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet 2009;5:e1000665).

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Cohort Studies
  • Ethics, Research*
  • Genetic Association Studies / methods*
  • Genetic Privacy*
  • Genotype
  • Human Experimentation / ethics*
  • Humans
  • Linear Models
  • Polymorphism, Single Nucleotide
  • Research Design*