Tuesday, September 25, 2012

A study of Phylogenetic Tree generation in QIIME

1. QIIME introduction

QIIME stands for Quantitative Insights Into Microbial Ecology. It takes input as raw sequencing files (usually FASTA files) and output initial analyses such as OTU picking, taxonomic assignment, and as well as construction of phylogenetic trees from representative sequences of OTUs. It also provide  downstream statistical analysis, visualization, and production of publication-quality graphics.


2. QIIME pipeline

Currently, our pipeline, DACIDR includes:
Pairwise Sequence Alignment (PSA) -> Deterministic Annealing Pairwise Clustering (DA-PWC) / Multidimensional Scaling (MDS) -> Region Refinement -> Interpolation -> Generate OTUs -> Select Center Sequence (Reference Sequences) -> Generate Phylo Tree
And Qiime includes:
Clustering (Picking OTUs) -> Choosing Reference Sequence -> Multiple Sequence Alignment -> Alignment Filter (Removing Unnecessary Gaps) -> Build Phylo Tree -> Build an OTU table
The Clustering step in QIIME has very similar function in DACIDR, where both of them input the FASTA files and output clustering result with reference sequences. QIIME clustering choices are usearch, usearch_ref, prefix_suffix, mothur, trie, blast, uclust_ref, cdhit and uclust. The reference sequence is usually selected as most abundant sequence.
However, in QIIME, they have more steps to generate the Phylogenetic Tree, where the details are present below:

2. 1 Multiple Sequence Alignment

The QIIME has three choice of multiple sequence alignment: PyNAST(Caporaso et al., 2009), MUSCLE(Edgar, 2004) and INFERNAL (Nawrocki, Kolbe, & Eddy, 2009).
PyNAST is a python implementation of the NAST alignment algorithm. The NAST algorithm aligns each provided sequence (the “candidate” sequence) to the best-matching sequence in a pre-aligned database of sequences (the “template” sequence). Candidate sequences are not permitted to introduce new gap characters into the template database, so the algorithm introduces local mis-alignments to preserve the existing template sequence, which means if we want to use PyNAST, we need to have to a template of sequences which looks like:

>114239 
............................................................................................................aGAGTTT-GA--T-CC-T-G-GCTC-AG-AT-TGAA-C-GC--TGG-C--G-GT-A-TG--C----T-T--AACACA-T-GC-A-AGT-CGA-A-CG----------G-TAA-CA-G-----------------------------CAG-A-AG----------------------------------------------------CTT-G----------------------------------------------------------------------------------CTT-CT------------------GGCT--G--AC--G--AG-T-GG-C-GG-A--C-------------GGG-TGAGT-A--AC-GC-G-T-A-GG---A-A--T-CT-G--C-CTTA---CA-G------------------------------------------------------------------T-GG----GGG-AT-AG-CCC-------------------------G-G-T-----------------------GAA-A---ACC-GGA-TTAA-TA---CC-G--C-AT-A----------C--------------------G-------------------------------------CC-C-----------------------------------------------------------------------------------------------------------------------T-AC-G--------------------------------------------------------------------------------------------------------------------------------------G-G-G---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------GAAA--G-CTG--G-----G--GA-T--C--------------------------------------------------------------------------------------------------------------------TTC-G----------------------------------------------------------------------------------------------------------------------G-A--CC-TG--G---C-A--------------C----T-G---T-TA-G---AT---G-A-----G-CCT-GCG--T-GAG--A------TT--A--G-CT-T----G---TTGG-T-G-GG-G-T----AAT-GG---C-CTACCA--A-GG-C-A--A-CG-A------------TCT-C-T------AG-CT-G-G-TCT-G-AG----A--GG-AC--G-AT-C-AG-CCAC-A-CTGGG--A-C-TG-A-GA-C-AC-G-GCCCAG-A-CTCC-TAC-G--G-G-A-G-GC-A-GC-A-G-TG---GG-G-A-ATA-TTGCA-C-AA-T-GG--GC-GA-G----A-G-CC-T-GA-TG-CA-GCAA-TACC-G-CG-T---G-T-G--T--GA-A-G--A--A-G-G--C----CTG-AG---------G-G-T-T-G-T--A---AA-G-CAC--------TT-TC-A-A--T--TG---TGA-A--G---AAAAGCT---T-TT-GG----T--T--AA-T---A----------AC-C-TTGAGTC-TT-GA-CA-TTAA-C-A--A-TA-C---------AA-----------GAAGC-ACC-GG-C-TAA---C--T-CCGT--GCCA--G-C---A--GCCG---C-GG--TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GAGT-AC--G-TA-G-G-T-G------------G--T-TC-G-T-T-AA----GTC---A---G-ATG-TG-A-AA-GC--CC-CGG-G--------------------------------------------------------------------CT-C-AA-------------------------------------------------------------------------CC-T-G-GG-AA-C----T-G-C-A-T-T--------T---GAA-A-C-T-G-GCA--A-A-C---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------T-T-G-A-G-T-A-----T-GG--TA-G-A------------G-GC-G-GG-T----AG--AA-T-TCA-TAGT--GT-A-GCG-GTGAAA-TG-CGT-AGAT-ATTA-T-G-A--GG-A-AT-A-CC-AG--T--G--GC-GAA-G--G-C---G----G--C-C-CGCTG------G-AC-CA--------------------------------------------------------------AT-A-C-C--GA--CA-----CT-GA-GG--T-A-CGA--AA-G-C--------------G-TGGG-GAG-C-G-AACA--GG-ATTA-G-ATA-C-----CC-T-G-GTA-G-T----C-CA--C-G-CCG-T-AAA--C-GATG-TC--AA-CT---------A-GC--C--G-T-TG-G-GA-C-----------------------------------------------------------------------------------------CTT-GA--------------------------------------------------------------------------------------------------------------------------------------------------G-G-T-CT--C-A-G-T-GG-T------GC--A----GC-TAA--CG-C---G--T-GAA-GT--T----G-ACC-GCC-T-G-GG-GAG-TA---CGG-----C-C--G-C-A-A-GGC-T--AAA-ACTC-AAA---------TGAA-TTG-ACGAG-G-G-CCCG----C-A--C-A-A-GCG-GT-G--G--AG-CA-T--GT-GGT-TT-AATT-C-G-ATG-CAAC-G-CG-A-AG-A-A-CC-TT-A-CC-ATCCC-TT-G-AC-ATC-A---------------TC-A-G-------------A-A--CTT-G--TT--A-GA-G-A-T--A-A-C----T-CGG--T-G-----CC-------------------------------------T--TC-G------------------------------------------GG------------AA--CTGA-AT--GA---------------------------------------------------C-A-G-G-T-GCTG-CA-TGG-CT--GTC-GTC-A-GC-TC---G-TG-TC-G--TGA-GA-TGT-T-GG-G-TT-AA-GT-CCCGC-AA--------C-GAG-CGC-A-ACC-C-T-TA--TC--C-TTAG--T-T-G-C-C---AG-C-A----CG------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TAAT------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------GG---T----G-G-----------------G---A-A--CT---------------C-T-A-A-G-GA-G--AC-T-G-CCG--G-T------------------------------------G-A---TAA----------------------------------A-C-C-G--G-A-GG-A--AGG-T--GGGG-A-TGAC-GTC--AAGT-C---ATC-A-T-G-G-C-C-CTT----AT-G--GG-A-T-GG-GC-TA-CAC-ACGTG-C--TA--CAATG---G-T-CA-GTA--C-AAA-GG-GC--------------------------------------------------------------------------------------------------A-G-C-C-A--A-CTCG-C--G---------------------------------------A-GA-G-T-----------G--C-G-CA---A----------A--TCC-C------A-T-AAAGCTGA---T-C-G-TAG-TCC--------GGA-T-TGGAG-TC--T-GCAA-CT-C-------------------------------------------------------------------------------------------------G-ACTCC-A-T-G-AA-G-TT-GGAAT-CG-C-TA--G-TA-AT-C-G-T----GAA-TC-A-G--A------AT--GTC-AC-G-GT-G-AAT-ACGT-T-CCCGGGCCT-TGCA----CACACCG-CCC-GTC-----a---ca--cca-tg-gg-a--g---tgg-g-ct-gc-aaa--a-gaa------g--t-agg-ta-g-t-t-t-aa-c-c--------------------------------------------------------------ttc-g------------------------------------------------------------------------------------------------------gg-a--ga-a--c---gc-tta--cc--act-t----t-gtg-gt-tca------------------------tg--act-gggg-tg-aag-tcgtaacaa-ggtag-ccct-aggggaa-cctg-gggc-tggatcacctcctt.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
MUSCLE is an alignment method which stands for MUltiple Sequence Comparison by Log-Expectation. We can directly use MUSCLE for multiple sequence alignment, where I used this to play around a very small set of sequences. However, MUSCLE is slower than PyNAST when sequence number is large.
Infernal (“INFERence of RNA ALignment”) is for an alignment method for using RNA structure and sequence similarities.
Infernal takes a multiple sequence alignment with a corresponding secondary structure annotation. This input file must be in Stockholm alignment format. It requires a template of sequence and performs the slowest out of these 3 methods.



2.2 Alignment Filter

This component is used to remove positions which are gaps in every sequence. Additionally, the user can supply a lanemask file, that defines which positions should included when building the tree, and which should be ignored. A lanemast file looks like this:
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001101111010110011101001011010110010000101001111110101101011101110101110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000110101101011010010000000000000111011111010011011010101011000111001011010010101110011010000000000000000000000000000000000000000000000000000000000000000001011000011101101101110000000000000000000000000101010000000000000000000000011101000111011101111011000110100101101000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000010100010110100011000101000001011101110010111001000000110010010110100001000111101010110101000011101101010101111001011010100101101000000000000111010100000011011010101110


2.3 Generate Phylogenetic Tree

After the multiple sequence alignment is done on the reference sequence set, the phylogenetic tree can be produced. The methods provided in QIIME includes clearcut, clustalw, raxml, fasttree_v1, fasttree, raxml_v730 and muscle. The default setting is to use FastTree (Price, Dehal, & Arkin, 2009), which takes only a few seconds to generate a tree with around 200 sequences in my test.

Reference:
  • Yang Ruan, Geoffrey House, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang, Geoffrey Fox. Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions. Proceedings of C4Bio 2014 of IEEE/ACM CCGrid 2014, Chicago, USA, May 26-29, 2014.
  • Yang Ruan, Geoffrey Fox. A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting. Proceedings of IEEE eScience 2013, Beijing, China, Oct. 22-Oct. 25, 2013. (Best Student Innovation Award)
  • Yang Ruan, Saliya Ekanayake, Mina Rho, Haixu Tang, Seung-Hee Bae, Judy Qiu, Geoffrey Fox. DACIDR: Deterministic Annealed Clustering with Interpolative Dimension Reduction using a Large Collection of 16S rRNA Sequences. Proceedings of ACM-BCB 2012, Orlando, Florida, ACM, Oct. 7-Oct. 10, 2012.
  • Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox. HyMR: a Hybrid MapReduce Workflow System. Proceedings of ECMLS’12 of ACM HPDC 2012, Delft, Netherlands, ACM, Jun. 18-Jun. 22, 2012
  • Adam Hughes, Yang Ruan, Saliya Ekanayake, Seung-Hee Bae, Qunfeng Dong, Mina Rho, Judy Qiu, Geoffrey Fox. Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets, BMC Bioinformatics 2012, 13(Suppl 2):S9.
  • Monday, September 24, 2012

    Phy Tree on Side Plot


    Description

    Environment: Tempest; 32 nodes (768 cores)
    Aligner: SmithWaterman
    ScoringMatrix: EDNAFULL
    GapOpen: -16
    GapExt: -4
    DistanceType: (1-Normalized Score*), Percentage Identity
    Method: 1291+123+74+420 all varied
    WithReverse: dynamic determine algorithm

    Dataset

    1) 1291 new GenBank sequences; (id:617~1907)
    2) 123 consensus sequence; (id: 420~542)
    3) 74 sequences from Gen Bank; (id: 543~616)
    4) 420 center sequences; (id: 0~419)

    Tree Configuration

    New sequences from Gen Bank: Hexagon
    Consensus sequence in Haixu 123: Triangle
    Center sequences in 420: original points
    Root: Sphere
    
    Color Scheme:  1) Phylogenetic tree generated here
                   2) Mega Region coloring

    Final Result

    Phylogenetic tree coloring:

    1. Plot File in PVIZ format, with pid distance.
    2. Plot File in PVIZ format, with old* score distance.

    Mega Region coloring:

    3. Plot File in PVIZ format, with old* score distance.

    Screen Shot

    1. pid distance

    2. old* score distance


    3. old* score distance with mega region