KmerAperture: Retaining k-mer synteny for alignment-free extraction of core and accessory differences between bacterial genomes

PLoS Genet. 2024 Apr 29;20(4):e1011184. doi: 10.1371/journal.pgen.1011184. eCollection 2024 Apr.

Abstract

By decomposing genome sequences into k-mers, it is possible to estimate genome differences without alignment. Techniques such as k-mer minimisers, for example MinHash, have been developed and are often accurate approximations of distances based on full k-mer sets. These and other alignment-free methods avoid the large temporal and computational expense of alignment. However, these k-mer set comparisons are not entirely accurate within-species and can be completely inaccurate within-lineage. This is due, in part, to their inability to distinguish core polymorphism from accessory differences. Here we present a new approach, KmerAperture, which uses information on the k-mer relative genomic positions to determine the type of polymorphism causing differences in k-mer presence and absence between pairs of genomes. Single SNPs are expected to result in k unique contiguous k-mers per genome. On the other hand, contiguous series > k may be caused by accessory differences of length S-k+1; when the start and end of the sequence are contiguous with homologous sequence. Alternatively, they may be caused by multiple SNPs within k bp from each other and KmerAperture can determine whether that is the case. To demonstrate use cases KmerAperture was benchmarked using datasets including a very low diversity simulated population with accessory content independent from the number of SNPs, a simulated population where SNPs are spatially dense, a moderately diverse real cluster of genomes (Escherichia coli ST1193) with a large accessory genome and a low diversity real genome cluster (Salmonella Typhimurium ST34). We show that KmerAperture can accurately distinguish both core and accessory sequence diversity without alignment, outperforming other k-mer based tools.

MeSH terms

  • Algorithms
  • Escherichia coli / genetics
  • Genome, Bacterial*
  • Genomics / methods
  • Phylogeny
  • Polymorphism, Single Nucleotide* / genetics
  • Sequence Alignment / methods
  • Software
  • Synteny

Grants and funding

This work was supported by the National Institute for Health and Care Research Health Protection Research Unit in Gastrointestinal Infections (MM; XD) and Genomics and Enabling Data (PR; XD). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.