Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm

Eran Elhaik; Dan Graur; Kresimir Josić; Giddy Landan

doi:10.1093/nar/gkq532

Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm

Nucleic Acids Res. 2010 Aug;38(15):e158. doi: 10.1093/nar/gkq532. Epub 2010 Jun 22.

Authors

Eran Elhaik¹, Dan Graur, Kresimir Josić, Giddy Landan

Affiliation

¹ McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA. eelhaik1@jhmi.edu

Abstract

It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen-Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative segmentation algorithm, D(JS), using two sets of simulated genomic sequences. Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas D(JS) failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion. By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Base Composition
Computer Simulation
Genome, Human*
Genomics / methods*
Humans
Isochores
Models, Genetic

Substances

Isochores

Grants and funding

LM010009-01/LM/NLM NIH HHS/United States