(Post exported from my old blog, now defunct -- published originally on April 27, 2010)
Recently a paper about the software MANTiS called my attention, and I've been trying to write about it for a while. This announcement at the EvolDir list seemed like the perfect opportunity. I must warn you though that I've never used the software and I don't have any intimacy with the underlying databases, but the article is easy to follow.
The main result of the paper, published in Genome Biology and Evolution, is that there is a correlation between the mean number of anatomical systems (human tissues or cell types) where the gene is expressed and the time when the gene appeared on the phylogeny of the species. In other words, recent gene families are expressed in fewer anatomical systems (are more specific) than ancient ones. An anatomical system is a hierarchical classification of human tissues (e.g. the first level of the hierarchy: nervous, dermal, embryo, etc) available from gene expression data. So the age of appearance of a gene is an indicator of its specificity. Since the genes are subject to duplication we may have more than one member of the gene family in the same species, and the authors show that this correlation is maintained if we consider the appearance of the gene itself (as a result of duplication) or the appearance of the whole gene family to which the gene belongs.
They worked with gene families identified by MANTiS, which is a pipeline that 1) downloads data from metazoan genomes at ENSEMBL, 2) infers the gene tree based on the protein alignment of the gene family and 3) detects duplications through a reconciliation with a given species tree. Each gene tree is produced by EnsemblCompara which, as I understand, employs an extension of "reciprocal best hits" (that allow for many-to-many relations) to find the members of the family, and then maximum likelihood to find the tree itself. I will talk more about the gene tree/species tree reconciliation in the future, but it is enough to say that it's the minimal list of nodes on the gene tree that represent duplications. We have an example of such a reconciled gene tree below, where the duplications are represented by the red boxes:
extracted from Bioinformatics 2008 24(2):151-157
MANTiS creates a new character (the brown polygons, that I think of as an orthologous group) for each duplication event, and the phylogenetic profile generated by these characters is then used to calculate the branch lengths of the species tree through a least squares approach. The phylogenetic profiles are represented by 0's and 1's in the inlet figure above, from which a distance matrix must be calculated in order to have the branch lengths.
In the study two datasets were created for the presence/absence of genes: one called "families only" composed of one character for each single gene and for each protein family, and another called "with duplications" where a new character is created for each duplication event. Both analyses were necessary since gene gain through duplications is important in explaining genome size increase.
MANTiS creates a database relating each gene to its biological function and anatomical system: the biological processes and molecular functions (ontology terms) of protein families are given by the PANTHER database for human, mouse, rat and D. melanogaster, while the gene expression data (related to the anatomical systems) comes from eGenetics, GNF and HMDEG. When comparing the time of appearance of the gene (as explained above) and the expression data for the genes we have a figure like the following:
modified from Genome Biology and Evolution Vol. 2010:13
We must notice that in this graph the X axis is inverted (that is, left is older with the present day at the right) giving the impression of a negative correlation. So older gene families - or duplications - are expressed in more cell types in humans. Similar results were obtained using rat expression data - since the expression datasets had information for both - or using the other expression datasets.
The authors say that a possible explanation for this behaviour is the increase in the number of distinct cell types (blue line, notice the inverted axis again :D), where new genes are likely to be more specific to a cell type (which may have appeared recently itself). Associated with this explanation is the subfunctionalization of duplicated genes, and the tendency to subfunctionalize ("specialize") can explain the decreased extent of expression. The subfunctionalization process itself might be related to the generation of a new cell phenotype.
One shortcoming of the analysis is that the gene family inference might fail to detect distantly related genes, and therefore what appears to be a gene gain (the "birth" of a new gene family) might be in fact a duplication of a more ancient single gene family. For example if after the duplication number 3 on the first figure the sequences diverged too much, we might wrongly classify them as two gene families. But to be free from this problem is a tall order. The authors also call our attention to the problem of low coverage of some genomes and taxonomic bias.
Milinkovitch, M., Helaers, R., & Tzika, A. (2009). Historical Constraints on Vertebrate Genome Evolution Genome Biology and Evolution, 2010, 13-18 DOI: 10.1093/gbe/evp052
Tzika, A., Helaers, R., Van de Peer, Y., & Milinkovitch, M. (2007). MANTIS: a phylogenetic framework for multi-species genome comparisons Bioinformatics, 24 (2), 151-157 DOI: 10.1093/bioinformatics/btm567