Wednesday, November 30, 2016

Comparative methods in the genomic era

Every modern biologist should know about phylogenetic comparative methods, even if they are not familiar with the term. The idea is that we cannot compare biological traits without taking into account the evolutionary relations between the species/populations used as "sample points". This is because individuals are not independent samples of whatever variables we are trying to correlate (they are "pseudoreplicates").

However, not everybody is an up-to-date biologist, as we see in the figure below: birds with similar body masses would also have similar flight speeds not because these two traits are related, but because these birds are closely related. That is, not long ago they were a single species (with a single value for each trait). The solution to avoid these spurious correlations is, in abstract terms, to think about what trait values do we expect for their ancestors. Hopefully we will have this in mind in upcoming large-scale genomic analyses, where sophisticated statistical algorithms for ever-increasing amounts of data might obfuscate their biological appropriateness.

from doi:10.1063/1.4886855
Here I include a few links to good comments and papers discussing comparative methods, starting with Felsenstein's reaction to the paper above (which is about airplanes, by the way). I also include at the bottom the slides of an impromptu journal club I presented a few years ago about things we can model over a tree, with strong emphasis on phylogenetic contrasts.

Physicists and engineers decide how to analyze evolution (by Joe Felsenstein):
They make allometric plots of features of new airplane models, log-log plots over many orders of magnitude. The airplanes show allometry: did you know that a 20-foot-long airplane won’t have 100-foot-long wings? That you need more fuel to carry a bigger load?
But permit me a curmudgeonly point: This paper would have been rejected in any evolutionary biology journal. Most of its central citations to biological allometry are to 1980s papers on allometry that failed to take the the phylogeny of the organisms into account. The points plotted in those old papers are thus not independently sampled, a requirement of the statistics used. (More precisely, their error residuals are correlated).
Steps towards understanding comparative methods – Wainwright Lab:
Felsenstein elaborates on his method of calculating standardized contrasts (phylogenetically independent contrasts) to help overcome the non-independence of character traits.  These contrasts are basically the differences between trait values of species pairs weighted by the evolutionary change separating them; they are estimates of the rate of change over time. A common use of standardized contrasts is to look for correlation in this rate between two traits; if standardized contrasts of traits X and Y are compared in a regression analysis, a linear trend suggests correlated rates of evolution between the traits.
Strong phylogenetic inertia on genome size and transposable element content among 26 species of flies | Biology Letters doi:10.1098/rsbl.2016.0407:
To date, quantifying the importance of phylogenetic inertia in TE content distribution remains a key question as the dynamic of TE accumulation is still poorly understood. Here, we analysed the evolution of genome size and genomic TE content in 26 Drosophila, using a phylogenetic framework. We estimated genomic TE content using a de novo TE assembly approach, tested the correlation between TE content and genome size among closely related species and finally estimated the phylogenetic inertia.
(...)
Comparative analyses were performed using ape [21], nlme [22] and phytools [23] packages in R. Ancestral trait reconstruction of genome size was calculated using phylogenetic independent contrasts. We tested the phylogenetic signal using Pagel's λ [24]. Best-fitting model to the trait evolution and its covariance structure was tested among (i) absence of phylogenetic signal, (ii) neutral Brownian motion and (iii) constrained evolution Ornstein–Uhlenbeck (OU) models using generalized least squares (GLS) and selected according to minimum Akaike information criterion (AIC).
Controlling for non-independence in comparative analysis of patterns across populations within species | Philosophical Transactions of the Royal Society B: Biological Sciences doi:10.1098/rstb.2010.0311:
How do we quantify patterns (such as responses to local selection) sampled across multiple populations within a single species? Key to this question is the extent to which populations within species represent statistically independent data points in our analysis. Comparative analyses across species and higher taxa have long recognized the need to control for the non-independence of species data that arises through patterns of shared common ancestry among them (phylogenetic non-independence), as have quantitative genetic studies of individuals linked by a pedigree.
The Unsolved Challenge to Phylogenetic Correlation Tests for Categorical Characters | Systematic Biology doi:10.1093/sysbio/syu070:
When comparative biologists observe that animal species living in caves also tend to have reduced eyes, they may see such correlation as evidence that the traits are adaptively or functionally linked: for instance, selection to maintain eye function is relaxed when light is unavailable.
(...)
However, the last few decades have taught us that among-species correlative tests should take into account evolutionary relationships (Felsenstein 1985; Ridley 1989; Harvey and Pagel 1991). If phylogeny is not taken into account, an interpreted correlation may have a trivial explanation different from the biological relationship we claim. There is a correlation among species in the distribution of fur and bones in the middle ear—species with fur also have three bones in the middle ear, and vice versa. These two traits are characteristics of mammals, and absent outside the mammals. Using their shared distribution as evidence of an interesting biological relationship between fur and middle ear bones would be considered a mistake, however, for reasons understood long ago by Darwin (1872)

Tuesday, November 29, 2016

Bar charts must start at zero (or something)

The other day I was mentioning to a colleague that a bar chart should start at zero, and I may have given the impression that it was just my personal taste. It is not. It is a universal standard in statistical visualisation. However, since it is very easy to get distracted or to misinterpret it, I'll summarise it here very quickly, and include some links below if you want to read more.
  1. It is OK to set the Y axis to whatever range you want (min an max are good choices). However, in this case, do not use bar charts. Use mean-and-error plots, boxplots, lines, scatterplots, violin plots, etc.
  2. Bar charts start at the bottom of the plot for a reason: you want to easily compare the heights between bars. Our eyes try to find the proportion of increase/decrease between them. Then, for example, if your Y axis starts at  a higher value, a 0.1% increase may look twice the height.
  3. The Y axis doesn't always need to start at zero, but it must be a value that makes sense as a baseline. Therefore, if you are comparing ratios maybe one is a better baseline, or for a time series the baseline can be the value at day one (as in climate change graphs). The idea is that there are "natural" baseline values, such that the proportions between bars are meaningful.
  4. In all cases, ask yourself if a bar chart is really necessary, and if you can justify it against other options. This may help you pay attention to the interpretation of the baseline. 
I believe that the same rules apply to an impulse plot or a stem plot -- like the ones used in ACF plots -- but I am not so sure. Anyway, here are some links:

It’s OK not to start your y-axis at zero — Quartz
Of course column and bar charts should always have zeroed axes, since that is the only way for the visualization to accurately represent the data. Bar and column charts rely on bars that stretch to zero to accurately mirror the ratios between data points. Truncating the axis breaks the relationship between the size of the rectangle and the value of the data. There is no debating this one (except for a few exceptions).
(The exception linked above is about a ratio, where the natural baseline is one instead of zero.)

Bar Chart Baselines Start at Zero | FlowingData
The main argument for bar charts without a zero baseline is this: There’s no point in extending the range of the value axis if the range of the data never includes zero. Ok.
Now instead of weight, let’s look at height. I think we can agree that it’s difficult to find people who are zero inches tall. I’m 70 inches tall, and my son is half my height at 35 inches. The bar chart on the left shows the comparison with a zero baseline, and as expected, the bar for me is twice the length of the bar for my son. On the right, I take it to the extreme and set the baseline to 35 inches, and the bar for my son disappears.

height-comparison

Maybe the latter communicates that I’m much taller, but the magnitude is infinitely exaggerated.
(The link above also provides an elegant solution if you insist on using barplots: explicitly model your baseline such that you can compare all values to it.)

Kick the bar chart habit : Nature Methods : Nature Research
Instead of comparing to an abstract zero level, scientists often compare multiple experimental samples to one another. Because the samples are usually generated from populations with a potentially large and irregular underlying variation, graphing their means using bar charts misleadingly assigns importance to the distance of the means from zero and poorly represents the distribution of the data used to calculate the means. Instead of bar charts, mean-and-error plots and box plots should be used for statistical sample data.
Zero is zero - Statistical Modeling, Causal Inference, and Social Science
The idea is that the area of the bar represents “how many” or “how much.” The bar has to go down to 0 for that to work. You don’t have to have your y-axis go to zero, but if you want the axis to go anywhere else, don’t use a bar graph, use a line graph. Usually line graphs are better anyway.  I’m sure this is all in a book somewhere.
10 Ways Charts Can Lie, Cheat & Lead Astray | QVDesign
Here we’re looking at exactly the same basic dataset covering Visitor numbers per year, the chart on the left seems very placid and visitor numbers appear to be relatively steady year on year; not bad but not great, whilst the chart on the right looks much more positive; visitor numbers took a huge leap in 2010; cue extra investment and pay rises all round! The charts show identical data so what’s going on?


The ONLY difference between these 2 charts is that the one on the left has the Properties > Axis > ‘Forced 0’ checked for the Y-Axis whilst the other doesn’t; one check box and we’re completely changing the message of this chart. Now of course if you spend a little time looking at this chart and read off the numbers it becomes clear that the increase isn’t all that marked, however; (...)  To accentuate the effect even further you can use the ‘Static Min/Max’ settings to zero in even more on the difference making the smaller value seem even smaller and the increase all the larger.

Sunday, September 11, 2016

When our intuition fails us with collections of trees

Recently I realised that some ideas regarding the distribution of phylogenetic trees are not as straightforward as they seem. The first case is that the frequency of the most common tree does not give us information about the dispersion of the distribution. I tried to represent this in the figure below, where we have four very different distributions with the same  modal frequency. What makes it harder for us to process is that the dispersion depends on a measure of distance between trees. Therefore even if we have only two alternative trees in total and with same frequencies, we may still need to know how similar they are. On the other hand, the total number of distinct trees in this distribution may also tell us very different stories about the data generating them. (Of course one can always dismiss this discussion altogether by claiming that the "support" of the modal tree is enough information.)


Histogram representing a collection of trees. The green bar is the frequency of the modal tree (the best supported, most frequent), while blue bars are frequencies of other trees. The distance is an arbitrary representation of how far apart they are. We can assume that the modal frequency is the same in all four distributions.

Another tempting but inaccurate idea is that the tree frequencies (e.g. from a bootstrap analysis) represent the fraction of sites favouring one tree over the others. This misinterpretation may become more common in the future, with the widespread use of concatenated genes, where we might think that the (bootstrapped) trees estimated from the resulting supermatrix will recapitulate the individual genes. For example, under the multispecies coalescent, concatenating the genes can be misleading in terms of the optimal tree. Even similar phylogenies, which may differ only on branch lengths, can lead to distinct trees.

To see why this correspondence is not true, it may be enough to realize that each site can favour several trees at the same time (e.g. an non-segregating site or a singleton). For the concatenation case, it is enough to remember that each individual gene tree was estimated from N sites (let's assume all m genes have same size), while every bootstrap replicate used mN sites. Furthermore, assume that every gene favours a different tree but only slightly over a "second best" tree, that however is the same for all genes. In this case, our supermatrix might favour very strongly this next-to-best tree, and would rarely chose any of the individually best trees...

Friday, October 10, 2014

A "parsimonious" Bayesian supertree model for estimating species trees

When we have sequence alignments regarding several genes from a group of taxa, we usually want to extract the phylogenetic information common to all of them. However, in many cases such phylogenomic analyses depend on selecting one sequence from each species per gene family (=alignment), or excluding paralogs, or partitioning these paralogous sequences into loci, or utilizing only gene families without apparent paralogs. If we want to analyse all our data at once, without excluding sequences or whole alignments, we are left with few options.

We just published such an alternative, which is based on the idea that we can measure the disagreement between the phylogenetic trees representing each gene and a putative tree representing the species. Therefore, by using disagreement measures that allow for arbitrary mappings between the trees, we can handle gene trees with paralogs, multiple individuals from the same population or missing data. These measures we call "distances" [1], and we developed a probability distribution describing how these distances can work as penalties against very dissimilar gene and species tree pairs.

We can use any combination of the reconciliation distances, the recently developed mulRF distance, and (very experimentally) an approximate SPR distance to include into our multivariate penalty distribution. We are also experimenting with other distances, as we implement them. This penalty distribution is then incorporated into a hierarchical Bayesian model, which I call "parsimonious" since it doesn't use a fully probabilistic model to describe the coalescent processes, or the birth and death of new loci. It assumes instead that only the most parsimonious reconciliations are relevant to the model. (I was advised, however, that calling it a "parsimonious Bayesian model" could be confusing...)

The distance supertree model is based on several distance measures d(G,S) between each gene family tree G and the species tree S. A species trees that is more similar to all gene trees is more likely than a more distant one. Notice that d(G,S) can in fact be a vector with several distances.

We implemented this model into the software guenomu, which is available under a GPL license at http://bitbucket.org/leomrtns/guenomu. The input to the software is a set of files with the distribution of gene trees as estimated for each gene family, independently, and the output will be the posterior distribution of these gene family trees together with the distribution of species trees.  We tested our model on many data sets simulated with the SimPhy software -- which is able to simulate the evolution of gene families with duplications, losses, and the multispecies coalescent fully probabilistically -- followed by a quick-and-dirty emulation of a Bayesian phylogenetic inference [2].

The difference between the input and output (posterior) distribution of trees for each gene family is that the input trees were estimated independently -- let's say, by running MrBayes for each alignment representing a gene family -- while the posterior takes into account the other gene families through their common species tree. Therefore the posterior distribution is a re-sampled version of the input, and as we see in the figure below it improves the gene tree estimation.

Input and posterior distributions if gene trees across many simulations (average values over gene families, per simulated data set). The simulations were pooled by species tree size, where we can see that guenomu can reduce the uncertainty of the gene trees. The accuracy is the fraction of splits (=branches) successfully reconstructed. Figure adapted from doi:10.1093/sysbio/syu082
Our model was successful in reconstructing the species tree even for high levels of incomplete lineage sorting (short species tree branches, in coalescent units) coupled with duplications and losses. It also fared a bit better than iGTP, and much better than our implementation of distance matrix-based species tree inference methods [3]. Notice that only software that accepts gene trees with several tips from the same species can be compared. We were gladly surprised to see that iGTP under the duplication-loss cost also performed well, provided we use the gene tree frequencies as weights.


Violin plots showing the distribution of accuracies in species tree estimation, over all simulations. The two red distributions are for the consensus and MAP tree estimates using guenomu, while the brown and blue plots are for other reconstruction methods. The dendrogram at the top classifies the methods by accuracy. Figure adapted from doi:10.1093/sysbio/syu082


Notes:
[1] They are not proper metrics since they are not symmetric, for instance.

[2] Since our simulated gene families have hundreds of tips, simulating the alignments and then sampling the gene tree distributions with MrBayes or friends would take too long (we did this for smaller data sets only). We therefore created a program (available with guenomu) that would copy many times each tree, replacing randomly short branches by one of its alternative bipartitions.

[3] We must take into account that these matrix-based methods (like GLASS, SD, etc.) assume that all disagreement is due to the coalescent, which is not true under our simulations. Furthermore our implementation may not be as good as some established software. Therefore our results are not evidence against these methods. (I particularly love their idea of being able to work with the distance matrices.)

Reference:
de Oliveira Martins L., Mallo D. & Posada D. (2014). A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction, Systematic Biology, DOI: http://dx.doi.org/10.1093/sysbio/syu082

(The supplementary material is not available yet at DataDryad, as of today Oct 10, 2014. I assume it will go online soon, but if you want it please drop me a line)

Tuesday, December 31, 2013

If your code is not open, you're not doing science

This is a list of articles explicitly suggesting or tangentially indorsing Open Source Software in the context of academic publications. The list is far from complete and in no particular order, but it creates a strong defense of source code as an integral part of the scientific process (code is data after all!). 




Friday, February 8, 2013

The difference between the RF and the NNI distance

Just to complement my answer to a blog post, where I maintain that the Nearest-Neighbor Interchange (NNI) distance is not equivalent to the Robinson-Foulds (RF) distance, a simple example:



Where we can see that trees T1 and T2 differ only in the location of nodes A and B -- on these trees, we can naturally think of the nodes A, B, 1,..., 6 as representing leaves, but they might also be large subtrees.

The RF distance is the number of edges (=branches) that are unique to each tree (that's why it's also called the symmetric difference), and it may be normalized to one. If we highlight the unique edges on trees T1 and T2


We see that the (unnormalized) RF distance is 10. For dichotomic trees, the number of unique edges is the same on both trees.

The NNI distance is the minimum number of NNIs that must be applied to one tree such that it becomes equal to the other. One NNI branch swap will change exactly one edge, thus is very tempting to assume that the NNI distance can be found by looking at the distinct edges.

But the problem is when the same branch is involved in more than one path of the "NNI walk". The RF distance (divided by two, for fully resolved trees) is then a lower bound on the minimum number of NNIs. In our example:


The NNI distance between T1 and T2 is 6, one more than the RF distance since the edge splitting (1,2,3) and (4,5,6) is used twice in the NNI computation. The problem, as explained by Liam, is that simulating trees with a specified distance is hard, and the solution of using very large trees masks the cases where the distances disagree...

Reference:
Bryant D. (2004). The Splits in the Neighborhood of a Tree, Annals of Combinatorics, 8 (1) 1-11. DOI:

(Crossposted from Bioinformatics News and Reviews, my personal blog)

Friday, June 15, 2012

Postdoctoral positions to work with Phylogenomics in Vigo

[crossposted from BioinfoNews]

There are two postdoc positions available in our lab -- the Phylogenomics group of the University of Vigo, in Spain. You can read the announcement in evoldir or on the link below, from the European Commission. Basically we are looking for someone with a good theoretical/empirical background in phylogenomics (NGS, phylogenetics, HPC, gene-species trees discordance, etc.) The applicant should have a good combination of computational, statistical and practical analysis skills. In any case, please get in contact with David and keep in mind that a honest evaluation of your skills will be greatly appreciated.

EURAXESS - Postdoctoral positions in Phylogenomics -- University of Vigo, Spain
Two postdoctoral positions are available to work on phylogenomics methodology and applications under a project funded by the European Research Council (ERC) in David Posada’s lab at the University of Vigo, Spain (http://darwin.uvigo.es).

DESCRIPTION: Two postdoctoral positions are available to work on phylogenomics methodology and applications under a project funded by the European Research Council (ERC) in David Posada’s lab at the University of Vigo, Spain (http://darwin.uvigo.es).


Initial appointments will be made for one year, with possible extension. Gross annual salary including benefits will be around 24-30 kEuros, commensurate with experience. Starting date should be as soon as possible .

REQUIREMENTS: Candidates should have a doctoral degree in Biology, Computer Science, Statistics or Mathematics. Excellent bioinformatic skills and prior experience with phylogenomics are essential. Strong communication and teamwork abilities are fundamental. Familiarity with parallel and distributed computing environments is convenient. Advanced statistical skills would be a plus.

APPLICATION: Please send a letter of interest, C.V., and the names and contact details of two referees to David Posada at dposada@uvigo.es, indicating “postdoc phylogenomics” in the subject of the email. Questions and requests for more information should be directed at the same address. Review of applications will begin immediately, and continue until the positions are filled.
Random picture of the Vigo bay

PS: In September we will also have a Computational and Statistical Phylogenomics Meeting in town. Besides being a great opportunity for knowing Vigo, the invited talks represent some of the most interesting topics in theoretical phylogenomics.

Tuesday, May 15, 2012

Testing for common ancestry

ResearchBlogging.orgOur commentary on Douglas Theobald's test from Universal Common Ancestry (UCA) just went online. The original idea was to make a user-friendly review of his analysis described in "A formal test of the theory of universal common ancestry", but after a long e-mail exchange between Douglas and us -- actually between him and David, I didn't say much -- we decided to expand the article to include some remaining points of skepticism and spell out the basic problem with his approach.

His work

His test for UCA compares the hypothesis of one single ancestral lineage diverging into all living forms, against scenarios where more than one ancestral populations are needed to explain the current diversity of life. That is, he tries to quantify the possibility that there were two or more ancient life forms still represented today in the major domains, in comparison with the most natural possibility that only one ancestral life form prevailed (with other life forms eventually going extinct, for instance).

These scenarios can be nicely represented by phylogenies: under the UCA hypothesis all existing species can be connected by a single phylogeny (connected network or tree), while under the hypothesis of Independent Origins (IO) the species can be partitioned into disconnected groups (without a branch joining them). And then, once we have the best phylogenies under each hypothesis, we can use the arsenal of model selection methods available to chose between the hypotheses.

Using a curated set of genes highly conserved among all three domains, his test indicated that UCA is a much better explanation than IO, and that each gene is better represented by its own phylogeny than by forcing all genes to follow the same tree -- that is, horizontal gene transfer (HGT) cannot be neglected.

Our comments

Imagine that I developed a method capable of predicting a candidate's academic success based on a questionnaire. But it works only for PhDs on their mid-twenties who have at least a few high-impact single-author publications. And notice: I'm not assuming at all that such a young genius has a secured place in a university. I guess this half-baked analogy summarizes our contention with Douglas' paper.

What called our attention is the fact that the phylogenetic inference methods assume homology at each site, so it is not surprising that it favors UCA for sufficiently good alignments -- your mileage may vary on the definition of "good". These methods are delegating to the alignment the responsibility of handling homology. And he used a data set with a particularly convincing evidence of common ancestry. To transform our argument into a picture we simulated sequences under both UCA and IO scenarios (that is, using one or two phylogenies) and looked at how the resulting alignments would look like. As expected they were very different.

average sequence identity for alignments simulated under UCA and IO
(from doi:10.4081/eb.2012.e5)

The motivation for our skepticism in the UCA test is how it would perform on a blind experiment like Assemblaton or CASP: given a group of sequences of unknown homology status, can the model selection devised by Douglas Theobald tell us if they share a single ancestry? Our impression is that there would be several decisions before doing the actual test -- like optimizing the alignment, possible removal of poorly aligned regions, refusing to do the test if alignment is bad --that might undermine its applicability. So we cannot yet recommend the test for arbitrary data sets.

Frequencies of average identity per column, for alignments simulated under UCA and IO, with real data set values in gray
(from doi:10.4081/eb.2012.e5)

In our article we also wonder about the effect of HGT under the hypothesis of multiple ancestry: what if we find one, and only one gene that strongly supports independent origins? Even if all others fit nicely into a single phylogeny, wouldn't it be evidence of this otherwise extinct lineage?

The publication

We all have dreadful stories about the feedback from peer reviewers, but this manuscript was not such a case. All reviewers seemed to know very well the original work, and could make precise comments on what we were missing or mistaken. The editor, David Liberles, also joined the discussion and gave us some good advice. So we thank them for being fortiter in re, suaviter in modo.

We are preparing another manuscript that is more centered on the examples given in D. Theobald's paper. Actually we started to write this "counter-examples" manuscript prior to our present work, but we had to reorganize it since: 1) we realized that it would be harder to understand it without the current work; 2) D. Theobald recently published a reply that is relevant to our discussion, since it contains a response to a former version of our "counter-examples" manuscript. Our present paper became then necessary to minimize future misunderstandings.

The scripts necessary to reconstruct the simulations and graphics used in our study are available at our home page: http://darwin.uvigo.es/common_origin/. Please let me know if you have any trouble running the scripts, or if you want some more information.

References

Leonardo de Oliveira Martins, David Posada (2012). Proving universal common ancestry with similar sequences Trends in Evolutionary Biology, 14 (1) : 10.4081/eb.2012.e5
Theobald, D. (2010). A formal test of the theory of universal common ancestry Nature, 465 (7295), 219-222 DOI: 10.1038/nature09014

(with thanks to Jonathan Eisen for noticing the article).

Friday, April 13, 2012

The first to describe phylogenetic maximum likelihood methods, according to Joe Felsenstein

(correction: my title is incomplete, it should include for molecular sequence data :)
    Theoretical work in Berkeley has been less widely noted. It started with the Jukes­Cantor distance and the stochastic model derived from it. Shortly afterward, Berkeley's famous statistician Jerzy Neyman became involved, owing to contact with Allan Wilson. He was the first to describe maximum likelihood phylogeny methods for molecular sequence data, a development for which I often mistakenly get the credit.
Taking Variation of Evolutionary Rates Between Sites into Account in Inferring Phylogenies -- Joseph Felsenstein at the Journal of Molecular Evolution, Volume 53, Numbers 4-5

Sunday, July 24, 2011

How to summarise a collection of trees that came from a Bayesian analysis

ResearchBlogging.orgAfter running a Bayesian phylogenetic analysis we are usually left with a large collection of trees, that came from the posterior distribution of the model given our data. Then if we want to work with a single tree - that is, to have a point estimate of this posterior distribution of trees - the most usual ways are to calculate the consensus tree or to select the most frequent tree. There are other ways, but let's fix on those by now.

We might not be aware of it, but when we choose for one or another summary we are in fact deciding for the tree estimate that minimizes its distance to all other trees in the set, and in expectation this will be the closest to the true tree under this distance metric (the so called Bayes estimator). This depends on what exactly do we mean by "distance" between trees, and that's what the article "Bayes Estimators for Phylogenetic Reconstruction" (doi 10.1093/sysbio/syr021) is about.

For example, the majority-rule consensus tree is the best we can get if we assume that the Robinson-Foulds distance (RF distance) is a good way of penalizing trees far away from the true one (I won't dwell into the meaning of "truth"; for us, the True tree® is the one that originated the data). To be more explicit, the consensus tree is the one whose RF distance to all trees in the sample is the shortest possible. This will be the closest we can get to the true tree for this sample, if by "close" we mean "with a small RF distance".

Now suppose I don't like the RF metric because I can only count to two: if the trees are the same the distance is zero, but if they are different then the distance is not zero, and I don't care how different they are (think of apples and oranges). In this case the best representative of my sample is the one that appears more often, known as modal value or Maximum A Posteriori (MAP) value, since our sample comes from a posterior distribution. Is it the closest I can get to the true tree for this distribution? Yes, for this particular definition of distance: the MAP tree is the tree that maximizes the expected coincidence with the true tree.

In the article they also mention that if you want to find the tree that minimizes the expected quartet distance to the true value, then the quartet puzzling method will find this tree for you. But the quartet puzzling tree is not as easy to calculate as the consensus or MAP tree, and there is no straightforward way to find the tree that minimizes other distances in general (e.g. the dSPR, the geodesic distance or the Gene Tree Parsimony). Therefore the authors offer the well-known hill-climbing heuristics for finding the best tree, and use the squared path difference as an example of distance.

Below you can find the presentation I gave to my group last week about this paper, it contains basically some background information and a summary of their method. One thing that is absent from the slides are the results, which I briefly summarize below:
  • their method (called "Bayes" in the figures or "BE") always used the path difference as distance measure; this is the overall distance they were trying to minimize.
  • they simulated many data sets with several levels of sequence divergence, and reconstructed the phylogeny using Maximum Likelihood, Neighbor-Joining, and Bayesian analysis. From the Bayesian posterior distribution they elected as point estimates the consensus tree, MAP tree, and used their method to find the BE under the path difference.
  • Figures 3 and 5 show the distance between the inferred and the true trees, where on figure 3 this distance is the path difference and in figure 5 it is the RF distance. As expected,  the Bayes estimator is better than any other measure at minimizing the path difference distance to the true tree, while the consensus tree wins if we want the closest in terms of RF distance.
  • this result is rephrased in figures 8 and 9, which now look specifically at the distances between BE or MAP trees and the true tree. What they plot is distance(BE, true) - distance(MAP, true) for a different definition of distance(,) in each case. The MAP tree is correlated to the consensus tree (if the MAP frequency is larger than 50% they are equal, for instance). Therefore it should come as no surprise that if we define closeness to the true tree in terms of RF distance, the MAP tree will be closer than the BE as shown in figure 9. Because BE assumes that closeness to true is calculated in terms of the path difference, which is reinforced in figure 8.
  • The authors wisely avoid offering the "best" Bayes estimator, since it depends on your judgment of how to penalize trees different from the true one.




OBS: This was my first time using beamer for Latex (after all these years, I know), so the slides are not prime time material. This is also my first submission to slideshare, and I like the idea of an embedded presentation within the blog post. I use latex a lot, and I think it would be easier for me to prepare a post with figures, equations and text within a presentation, and then simply embed it here with a minimum of extra text.

Maybe I'll try this next time, a presentation but with much more text than the recommended - in real life presentations the slides should support and complement but not replace the lecturer. Then you tell me if you would like to read on such a format or if you prefer a more traditional article-ish post.

Reference:
Huggins, P., Li, W., Haws, D., Friedrich, T., Liu, J., & Yoshida, R. (2011). Bayes Estimators for Phylogenetic Reconstruction Systematic Biology, 60 (4), 528-540 DOI: 10.1093/sysbio/syr021