Saturday, July 23, 2011

Fault-tolerant conversion between sequence alignments

(Post exported from my old blog, now defunct -- published originally on May 17, 2010)

Despite I'm very charitable when testing my own programs, I'm not so nice when asked to scrutinize other people's work. That's why I was happy to see the announcement about the ALTER web server being published at Nucleic Acids Research (open access!).

I am not involved in the project, but I was in the very comfortable position of being one of the beta testers: all I needed to do is to find the largest and most obscure datasets I had and try them; then complain to the authors about the minimal details. I tried some big datasets (I think it was influenza H3N2 HA and HIV-1 complete genomes from South America, around 2 and 4 Mbytes each), and my simulated alignments created "by hand" from PAML. And ALTER could handle them in the end: they even sent me a report explaining how each one of my commentaries was used to improve the software, and asking me to try again until I feel satisfied.

The ALTER web server is a converter between multiple sequence alignment (MSA) formats, for DNA or protein, focused not only on the format itself (like FASTA or NEXUS) but more on the softwares that generated the alignment and the software where the alignment is going to be used in (e.g. clustal or MrBayes). They mention that this program-oriented format conversion is necessary since all useful softwares eventually violate the (outdated) format specification. In their own words
[D]uring the last years MSA's formats have `evolved' very much like the sequences they contain, with mutational events consisting of long names, extra spaces, additional carriage returns, etc.

The web service can automatically recognize the input format, and generate an output for several programs, in several formats. I found it very easy to use, as you proceed it automatically shows you the possible next steps in the same page. Another very nice feature is the possibility of collapsing duplicate (identical) sequences, working then only with the haplotypes (unique sequences). If later you need the information about the collapsed duplicates check out the "info" panel on the bottom of the screen (inside the "log" window).

The obvious case when this elimination of duplicates is useful is when doing phylogenetic reconstruction (in many cases you can safely remove identical sequences), but another option offered by ALTER is to remove very similar sequences, where you can define the threshold of similarity. Sometimes when I'm doing a preliminary analysis on a dataset, I want to discard sequences too similar in order to get an overall picture of the data, and some other times I must remove closely-related sequences since my recombination-detection program has a limitation on the number of taxa...

Besides the user-friendly web service, they also offer a geek-friendly API - if you want your program to communicate directly with the service - and the source code, licensed under the LGPL.

Glez-Pena, D., Gomez-Blanco, D., Reboiro-Jato, M., Fdez-Riverola, F., & Posada, D. (2010). ALTER: program-oriented conversion of DNA and protein alignments Nucleic Acids Research DOI: 10.1093/nar/gkq321

No comments:

Post a Comment

Use the space below to ask, inform and criticize -- if you are not very happy please read the rules for commenting.

Please, do not include unrelated, commercial sites not even in your signature.