A statistical approach designed for finding mathematically defined repeats in shotgun data and determining the length distribution of clone-inserts

Genomics Proteomics Bioinformatics. 2003 Feb;1(1):43-51. doi: 10.1016/s1672-0229(03)01006-4.

Abstract

The large amount of repeats, especially high copy repeats, in the genomes of higher animals and plants makes whole genome assembly (WGA) quite difficult. In order to solve this problem, we tried to identify repeats and mask them prior to assembly even at the stage of genome survey. It is known that repeats of different copy number have different probabilities of appearance in shotgun data, so based on this principle, we constructed a statistical model and inferred criteria for mathematically defined repeats (MDRs) at different shotgun coverages. According to these criteria, we developed software MDRmasker to identify and mask MDRs in shotgun data. With repeats masked prior to assembly, the speed of assembly was increased with lower error probability. In addition, clone-insert size affect the accuracy of repeat assembly and scaffold construction, we also designed length distribution of clone-inserts using our model. In our simulated genomes of human and rice, the length distribution of repeats is different, so their optimal length distributions of clone-inserts were not the same. Thus with optimal length distribution of clone-inserts, a given genome could be assembled better at lower coverage.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Cloning, Molecular
  • Genome*
  • Genome, Human
  • Genomics / methods*
  • Humans
  • Models, Genetic*
  • Models, Statistical
  • Models, Theoretical
  • Oryza / genetics
  • Sequence Analysis, DNA