NucBreak: location of structural errors in a genome assembly by using paired-end Illumina reads

BMC Bioinformatics. 2020 Feb 21;21(1):66. doi: 10.1186/s12859-020-3414-0.

Abstract

Background: Advances in whole genome sequencing strategies have provided the opportunity for genomic and comparative genomic analysis of a vast variety of organisms. The analysis results are highly dependent on the quality of the genome assemblies used. Assessment of the assembly accuracy may significantly increase the reliability of the analysis results and is therefore of great importance.

Results: Here, we present a new tool called NucBreak aimed at localizing structural errors in assemblies, including insertions, deletions, duplications, inversions, and different inter- and intra-chromosomal rearrangements. The approach taken by existing alternative tools is based on analysing reads that do not map properly to the assembly, for instance discordantly mapped reads, soft-clipped reads and singletons. NucBreak uses an entirely different and unique method to localise the errors. It is based on analysing the alignments of reads that are properly mapped to an assembly and exploit information about the alternative read alignments. It does not annotate detected errors. We have compared NucBreak with other existing assembly accuracy assessment tools, namely Pilon, REAPR, and FRCbam as well as with several structural variant detection tools, including BreakDancer, Lumpy, and Wham, by using both simulated and real datasets.

Conclusions: The benchmarking results have shown that NucBreak in general predicts assembly errors of different types and sizes with relatively high sensitivity and with lower false discovery rate than the other tools. Such a balance between sensitivity and false discovery rate makes NucBreak a good alternative to the existing assembly accuracy assessment tools and SV detection tools. NucBreak is freely available at https://github.com/uio-bmi/NucBreak under the MPL license.

Keywords: Assembly accuracy assessment; Assembly errors; Genome assembly; Illumina paired-end reads; Structural variant detection.

Publication types

  • Evaluation Study

MeSH terms

  • Genome
  • Genomics / methods*
  • High-Throughput Nucleotide Sequencing / methods*
  • Reproducibility of Results
  • Sequence Analysis, DNA / methods*
  • Software