Friday, October 10, 2014

A "parsimonious" Bayesian supertree model for estimating species trees

When we have sequence alignments regarding several genes from a group of taxa, we usually want to extract the phylogenetic information common to all of them. However, in many cases such phylogenomic analyses depend on selecting one sequence from each species per gene family (=alignment), or excluding paralogs, or partitioning these paralogous sequences into loci, or utilizing only gene families without apparent paralogs. If we want to analyse all our data at once, without excluding sequences or whole alignments, we are left with few options.

We just published such an alternative, which is based on the idea that we can measure the disagreement between the phylogenetic trees representing each gene and a putative tree representing the species. Therefore, by using disagreement measures that allow for arbitrary mappings between the trees, we can handle gene trees with paralogs, multiple individuals from the same population or missing data. These measures we call "distances" [1], and we developed a probability distribution describing how these distances can work as penalties against very dissimilar gene and species tree pairs.

We can use any combination of the reconciliation distances, the recently developed mulRF distance, and (very experimentally) an approximate SPR distance to include into our multivariate penalty distribution. We are also experimenting with other distances, as we implement them. This penalty distribution is then incorporated into a hierarchical Bayesian model, which I call "parsimonious" since it doesn't use a fully probabilistic model to describe the coalescent processes, or the birth and death of new loci. It assumes instead that only the most parsimonious reconciliations are relevant to the model. (I was advised, however, that calling it a "parsimonious Bayesian model" could be confusing...)

The distance supertree model is based on several distance measures d(G,S) between each gene family tree G and the species tree S. A species trees that is more similar to all gene trees is more likely than a more distant one. Notice that d(G,S) can in fact be a vector with several distances.

We implemented this model into the software guenomu, which is available under a GPL license at The input to the software is a set of files with the distribution of gene trees as estimated for each gene family, independently, and the output will be the posterior distribution of these gene family trees together with the distribution of species trees.  We tested our model on many data sets simulated with the SimPhy software -- which is able to simulate the evolution of gene families with duplications, losses, and the multispecies coalescent fully probabilistically -- followed by a quick-and-dirty emulation of a Bayesian phylogenetic inference [2].

The difference between the input and output (posterior) distribution of trees for each gene family is that the input trees were estimated independently -- let's say, by running MrBayes for each alignment representing a gene family -- while the posterior takes into account the other gene families through their common species tree. Therefore the posterior distribution is a re-sampled version of the input, and as we see in the figure below it improves the gene tree estimation.

Input and posterior distributions if gene trees across many simulations (average values over gene families, per simulated data set). The simulations were pooled by species tree size, where we can see that guenomu can reduce the uncertainty of the gene trees. The accuracy is the fraction of splits (=branches) successfully reconstructed. Figure adapted from doi:10.1093/sysbio/syu082
Our model was successful in reconstructing the species tree even for high levels of incomplete lineage sorting (short species tree branches, in coalescent units) coupled with duplications and losses. It also fared a bit better than iGTP, and much better than our implementation of distance matrix-based species tree inference methods [3]. Notice that only software that accepts gene trees with several tips from the same species can be compared. We were gladly surprised to see that iGTP under the duplication-loss cost also performed well, provided we use the gene tree frequencies as weights.

Violin plots showing the distribution of accuracies in species tree estimation, over all simulations. The two red distributions are for the consensus and MAP tree estimates using guenomu, while the brown and blue plots are for other reconstruction methods. The dendrogram at the top classifies the methods by accuracy. Figure adapted from doi:10.1093/sysbio/syu082

[1] They are not proper metrics since they are not symmetric, for instance.

[2] Since our simulated gene families have hundreds of tips, simulating the alignments and then sampling the gene tree distributions with MrBayes or friends would take too long (we did this for smaller data sets only). We therefore created a program (available with guenomu) that would copy many times each tree, replacing randomly short branches by one of its alternative bipartitions.

[3] We must take into account that these matrix-based methods (like GLASS, SD, etc.) assume that all disagreement is due to the coalescent, which is not true under our simulations. Furthermore our implementation may not be as good as some established software. Therefore our results are not evidence against these methods. (I particularly love their idea of being able to work with the distance matrices.)

de Oliveira Martins L., Mallo D. & Posada D. (2014). A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction, Systematic Biology, DOI:

(The supplementary material is not available yet at DataDryad, as of today Oct 10, 2014. I assume it will go online soon, but if you want it please drop me a line)

No comments:

Post a Comment

Use the space below to ask, inform and criticize -- if you are not very happy please read the rules for commenting.

Please, do not include unrelated, commercial sites not even in your signature.