Sunday, September 11, 2016

When our intuition fails us with collections of trees

Recently I realised that some ideas regarding the distribution of phylogenetic trees are not as straightforward as they seem. The first case is that the frequency of the most common tree does not give us information about the dispersion of the distribution. I tried to represent this in the figure below, where we have four very different distributions with the same  modal frequency. What makes it harder for us to process is that the dispersion depends on a measure of distance between trees. Therefore even if we have only two alternative trees in total and with same frequencies, we may still need to know how similar they are. On the other hand, the total number of distinct trees in this distribution may also tell us very different stories about the data generating them. (Of course one can always dismiss this discussion altogether by claiming that the "support" of the modal tree is enough information.)

Histogram representing a collection of trees. The green bar is the frequency of the modal tree (the best supported, most frequent), while blue bars are frequencies of other trees. The distance is an arbitrary representation of how far apart they are. We can assume that the modal frequency is the same in all four distributions.

Another tempting but inaccurate idea is that the tree frequencies (e.g. from a bootstrap analysis) represent the fraction of sites favouring one tree over the others. This misinterpretation may become more common in the future, with the widespread use of concatenated genes, where we might think that the (bootstrapped) trees estimated from the resulting supermatrix will recapitulate the individual genes. For example, under the multispecies coalescent, concatenating the genes can be misleading in terms of the optimal tree. Even similar phylogenies, which may differ only on branch lengths, can lead to distinct trees.

To see why this correspondence is not true, it may be enough to realize that each site can favour several trees at the same time (e.g. an non-segregating site or a singleton). For the concatenation case, it is enough to remember that each individual gene tree was estimated from N sites (let's assume all m genes have same size), while every bootstrap replicate used mN sites. Furthermore, assume that every gene favours a different tree but only slightly over a "second best" tree, that however is the same for all genes. In this case, our supermatrix might favour very strongly this next-to-best tree, and would rarely chose any of the individually best trees...

No comments:

Post a Comment

Use the space below to ask, inform and criticize -- if you are not very happy please read the rules for commenting.

Please, do not include unrelated, commercial sites not even in your signature.