Saturday, December 26, 2009

New article describing priors on biomc2



Our new paper is out!
Leonardo de Oliveira Martins and Hirohisa Kishino (2010) Distribution of distances between topologies and its effect on detection of phylogenetic recombination. Annals of the Institute of Statistical Mathematics 62(1): 145--159. doi:10.1007/s10463-009-0259-8
Unfortunately I had to transfer the copyright to the ISM, but I am still allowed to maintain the accepted manuscript at my personal homepage [1]. In this article we describe in more details theoretical aspects that we couldn't explore before (because it would divert the audience's attention). The novelties are:
  • describing the topology distance parameter as an augmented variable. It seems contradictory that our method assumes that the number of recombination break-points is variable, and at the same time we claim that the number of parameters is constant. This is because we have a variable (we could think of it as an indicator variable) that tells if there is recombination or not. Actually it is not just an indicator variable since it doesn't simply say if there is recombination or not, but actually approximates the amount of recombination. I like the analogy with linear models (the infamous Y=B0 + B1 X1 + B2 X2 + ... + Bn Xn), where despite the number of variables is constant (n+1) those parameters too close to zero mean that the "effective" number of variables is smaller.
  • describing the mini-sampler procedure. There is a strategy in reversible-jump MCMC (rjMCMC) that allows the chain to walk a little bit before accepting/rejecting, that we refer to as the mini-sampler. As we just saw our model doesn't need a rjMCMC since the number of parameters is constant, but still we employ the same strategies, making sure that the moves respect the detailed balance of the chain. This mini-sampler is necessary because on the one hand each step changes the topology just a little, and on the other hand we want to be able to handle cases where neighboring topologies are very different.
  • showing the importance of the modified Poisson distribution as a prior for the distances. We explicitly work with two scenarios:
    1. forcing the penalty hyperparameter to a fixed value. This shows the relevance of the hierarchical modelling.
    2. using a simplified distance that can only detect presence/absence of recombination witout being able to quantify it. This is a model analogous to other procedures that don't explicitly take the distance into account, and is equivalent to the "indicator variable" described above.

  • description of the ensemble of mosaics and how to choose the most representative one. Each MCMC sample is one set of topologies (one per segment) that we call the mosaic structure. We then devise the calculation of a distance between these mosaics to quantify how similar two samples are, and find the centroid mosaic - the sample most similar to all other samples. It is worth noticing that
    1. this is done after the MCMC sampling is finished, and is not part of the Bayesian model per se
    2. this distance between samples is completely unrelated to the distance between topologies (that we call dSPR).

In the meanwhile I'm updating the software site and upgrading the program: nothing special, but I realized that many libraries were unnecessary and that biomc2.summarise was painfully slow for large topologies. Now it is only slow.

[1] the first time that you try to access any file hosted on https://corn.ab.a.u-tokyo.ac.jp/ your browser will complain about my self-signed certificate - some lobby is not happy about it. Please neglect the terrorist warnings.


No comments:

Post a Comment

Use the space below to ask, inform and criticize -- if you are not very happy please read the rules for commenting.

Please, do not include unrelated, commercial sites not even in your signature.