We just published such an alternative, which is based on the idea that we can measure the disagreement between the phylogenetic trees representing each gene and a putative tree representing the species. Therefore, by using disagreement measures that allow for arbitrary mappings between the trees, we can handle gene trees with paralogs, multiple individuals from the same population or missing data. These measures we call "distances" [1], and we developed a probability distribution describing how these distances can work as penalties against very dissimilar gene and species tree pairs.
We can use any combination of the reconciliation distances, the recently developed mulRF distance, and (very experimentally) an approximate SPR distance to include into our multivariate penalty distribution. We are also experimenting with other distances, as we implement them. This penalty distribution is then incorporated into a hierarchical Bayesian model, which I call "parsimonious" since it doesn't use a fully probabilistic model to describe the coalescent processes, or the birth and death of new loci. It assumes instead that only the most parsimonious reconciliations are relevant to the model. (I was advised, however, that calling it a "parsimonious Bayesian model" could be confusing...)
We implemented this model into the software guenomu, which is available under a GPL license at http://bitbucket.org/leomrtns/guenomu. The input to the software is a set of files with the distribution of gene trees as estimated for each gene family, independently, and the output will be the posterior distribution of these gene family trees together with the distribution of species trees. We tested our model on many data sets simulated with the SimPhy software -- which is able to simulate the evolution of gene families with duplications, losses, and the multispecies coalescent fully probabilistically -- followed by a quick-and-dirty emulation of a Bayesian phylogenetic inference [2].
The difference between the input and output (posterior) distribution of trees for each gene family is that the input trees were estimated independently -- let's say, by running MrBayes for each alignment representing a gene family -- while the posterior takes into account the other gene families through their common species tree. Therefore the posterior distribution is a re-sampled version of the input, and as we see in the figure below it improves the gene tree estimation.
Input and posterior distributions if gene trees across many simulations (average values over gene families, per simulated data set). The simulations were pooled by species tree size, where we can see that guenomu can reduce the uncertainty of the gene trees. The accuracy is the fraction of splits (=branches) successfully reconstructed. Figure adapted from doi:10.1093/sysbio/syu082 |
Violin plots showing the distribution of accuracies in species tree estimation, over all simulations. The two red distributions are for the consensus and MAP tree estimates using guenomu, while the brown and blue plots are for other reconstruction methods. The dendrogram at the top classifies the methods by accuracy. Figure adapted from doi:10.1093/sysbio/syu082 |
Notes:
[1] They are not proper metrics since they are not symmetric, for instance.
[2] Since our simulated gene families have hundreds of tips, simulating the alignments and then sampling the gene tree distributions with MrBayes or friends would take too long (we did this for smaller data sets only). We therefore created a program (available with guenomu) that would copy many times each tree, replacing randomly short branches by one of its alternative bipartitions.
[3] We must take into account that these matrix-based methods (like GLASS, SD, etc.) assume that all disagreement is due to the coalescent, which is not true under our simulations. Furthermore our implementation may not be as good as some established software. Therefore our results are not evidence against these methods. (I particularly love their idea of being able to work with the distance matrices.)
Reference:
de Oliveira Martins L., Mallo D. & Posada D. (2014). A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction, Systematic Biology, DOI: http://dx.doi.org/10.1093/sysbio/syu082
(The supplementary material is not available yet at DataDryad, as of today Oct 10, 2014. I assume it will go online soon, but if you want it please drop me a line)
No comments:
Post a Comment
Use the space below to ask, inform and criticize -- if you are not very happy please read the rules for commenting.
Please, do not include unrelated, commercial sites not even in your signature.