Identifying multiple changepoints in heterogeneous binary data with an application to molecular genetics

Biostatistics. 2004 Oct;5(4):515-29. doi: 10.1093/biostatistics/kxh005.

Abstract

Identifying changepoints is an important problem in molecular genetics. Our motivating example is from cancer genetics where interest focuses on identifying areas of a chromosome with an increased likelihood of a tumor suppressor gene. Loss of heterozygosity (LOH) is a binary measure of allelic loss in which abrupt changes in LOH frequency along the chromosome may identify boundaries indicative of a region containing a tumor suppressor gene. Our interest was on testing for the presence of multiple changepoints in order to identify regions of increased LOH frequency. A complicating factor is the substantial heterogeneity in LOH frequency across patients, where some patients have a very high LOH frequency while others have a low frequency. We develop a procedure for identifying multiple changepoints in heterogeneous binary data. We propose both approximate and full maximum-likelihood approaches and compare these two approaches with a naive approach in which we ignore the heterogeneity in the binary data. The methodology is used to estimate the pattern in LOH frequency on chromosome 13 in esophageal cancer patients and to isolate an area of inflated LOH frequency on chromosome 13 which may contain a tumor suppressor gene. Using simulations, we show that our approach works well and that it is robust to departures from some key modeling assumptions.

MeSH terms

  • Chromosomes, Human, Pair 13 / genetics
  • Computer Simulation
  • Esophageal Neoplasms / genetics*
  • Genes, Tumor Suppressor*
  • Humans
  • Loss of Heterozygosity*
  • Models, Genetic*
  • Molecular Biology / methods*