Scalable Empirical Mixture Models That Account for Across-Site Compositional Heterogeneity

Dominik Schrempf; Nicolas Lartillot; Gergely Szöllősi

doi:10.1093/molbev/msaa145

Scalable Empirical Mixture Models That Account for Across-Site Compositional Heterogeneity

Mol Biol Evol. 2020 Dec 16;37(12):3616-3631. doi: 10.1093/molbev/msaa145.

Authors

Dominik Schrempf¹, Nicolas Lartillot², Gergely Szöllősi^{1

3

4}

Affiliations

¹ Department of Biological Physics, Eötvös University, Budapest, Hungary.
² Laboratoire de Biométrie et Biologie Evolutive UMR 5558, CNRS, Université de Lyon, Villeurbanne, France.
³ ELTE-MTA "Lendület" Evolutionary Genomics Research Group, Budapest, Hungary.
⁴ Evolutionary Systems Research Group, Centre for Ecological Research, Hungarian Academy of Sciences, Tihany, Hungary.

Abstract

Biochemical demands constrain the range of amino acids acceptable at specific sites resulting in across-site compositional heterogeneity of the amino acid replacement process. Phylogenetic models that disregard this heterogeneity are prone to systematic errors, which can lead to severe long-branch attraction artifacts. State-of-the-art models accounting for across-site compositional heterogeneity include the CAT model, which is computationally expensive, and empirical distribution mixture models estimated via maximum likelihood (C10-C60 models). Here, we present a new, scalable method EDCluster for finding empirical distribution mixture models involving a simple cluster analysis. The cluster analysis utilizes specific coordinate transformations which allow the detection of specialized amino acid distributions either from curated databases or from the alignment at hand. We apply EDCluster to the HOGENOM and HSSP databases in order to provide universal distribution mixture (UDM) models comprising up to 4,096 components. Detailed analyses of the UDM models demonstrate the removal of various long-branch attraction artifacts and improved performance compared with the C10-C60 models. Ready-to-use implementations of the UDM models are provided for three established software packages (IQ-TREE, Phylobayes, and RevBayes).

Keywords: empirical distribution mixture models; empirical profile mixture models; long-branch attraction; microsporidia; phylogenetics.

Publication types

Evaluation Study
Research Support, Non-U.S. Gov't

MeSH terms

Amino Acid Substitution*
Cluster Analysis
Genetic Techniques*
Models, Genetic*
Phylogeny*
Software*