1. QIIME introduction
QIIME stands for Quantitative Insights Into Microbial Ecology. It takes input as raw sequencing files (usually FASTA files) and output initial analyses such as OTU picking, taxonomic assignment, and as well as construction of phylogenetic trees from representative sequences of OTUs. It also provide downstream statistical analysis, visualization, and production of publication-quality graphics.2. QIIME pipeline
Currently, our pipeline, DACIDR includes:Pairwise Sequence Alignment (PSA) -> Deterministic Annealing Pairwise Clustering (DA-PWC) / Multidimensional Scaling (MDS) -> Region Refinement -> Interpolation -> Generate OTUs -> Select Center Sequence (Reference Sequences) -> Generate Phylo Tree
And Qiime includes:
Clustering (Picking OTUs) -> Choosing Reference Sequence -> Multiple Sequence Alignment -> Alignment Filter (Removing Unnecessary Gaps) -> Build Phylo Tree ->
The Clustering step in QIIME has very similar function in DACIDR, where both of them input the FASTA files and output clustering result with reference sequences. QIIME clustering choices are usearch, usearch_ref, prefix_suffix, mothur, trie, blast, uclust_ref, cdhit and uclust. The reference sequence is usually selected as most abundant sequence.
However, in QIIME, they have more steps to generate the Phylogenetic Tree, where the details are present below:
2. 1 Multiple Sequence Alignment
PyNAST is a python implementation of the NAST alignment algorithm. The NAST algorithm aligns each provided sequence (the “candidate” sequence) to the best-matching sequence in a pre-aligned database of sequences (the “template” sequence). Candidate sequences are not permitted to introduce new gap characters into the template database, so the algorithm introduces local mis-alignments to preserve the existing template sequence, which means if we want to use PyNAST, we need to have to a template of sequences which looks like:
>114239
............................................................................................................aGAGTTT-GA--T-CC-T-G-GCTC-AG-AT-TGAA-C-GC--TGG-C--G-GT-A-TG--C----T-T--AACACA-T-GC-A-AGT-CGA-A-CG----------G-TAA-CA-G-----------------------------CAG-A-AG----------------------------------------------------CTT-G----------------------------------------------------------------------------------CTT-CT------------------GGCT--G--AC--G--AG-T-GG-C-GG-A--C-------------GGG-TGAGT-A--AC-GC-G-T-A-GG---A-A--T-CT-G--C-CTTA---CA-G------------------------------------------------------------------T-GG----GGG-AT-AG-CCC-------------------------G-G-T-----------------------GAA-A---ACC-GGA-TTAA-TA---CC-G--C-AT-A----------C--------------------G-------------------------------------CC-C-----------------------------------------------------------------------------------------------------------------------T-AC-G--------------------------------------------------------------------------------------------------------------------------------------G-G-G---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------GAAA--G-CTG--G-----G--GA-T--C--------------------------------------------------------------------------------------------------------------------TTC-G----------------------------------------------------------------------------------------------------------------------G-A--CC-TG--G---C-A--------------C----T-G---T-TA-G---AT---G-A-----G-CCT-GCG--T-GAG--A------TT--A--G-CT-T----G---TTGG-T-G-GG-G-T----AAT-GG---C-CTACCA--A-GG-C-A--A-CG-A------------TCT-C-T------AG-CT-G-G-TCT-G-AG----A--GG-AC--G-AT-C-AG-CCAC-A-CTGGG--A-C-TG-A-GA-C-AC-G-GCCCAG-A-CTCC-TAC-G--G-G-A-G-GC-A-GC-A-G-TG---GG-G-A-ATA-TTGCA-C-AA-T-GG--GC-GA-G----A-G-CC-T-GA-TG-CA-GCAA-TACC-G-CG-T---G-T-G--T--GA-A-G--A--A-G-G--C----CTG-AG---------G-G-T-T-G-T--A---AA-G-CAC--------TT-TC-A-A--T--TG---TGA-A--G---AAAAGCT---T-TT-GG----T--T--AA-T---A----------AC-C-TTGAGTC-TT-GA-CA-TTAA-C-A--A-TA-C---------AA-----------GAAGC-ACC-GG-C-TAA---C--T-CCGT--GCCA--G-C---A--GCCG---C-GG--TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GAGT-AC--G-TA-G-G-T-G------------G--T-TC-G-T-T-AA----GTC---A---G-ATG-TG-A-AA-GC--CC-CGG-G--------------------------------------------------------------------CT-C-AA-------------------------------------------------------------------------CC-T-G-GG-AA-C----T-G-C-A-T-T--------T---GAA-A-C-T-G-GCA--A-A-C---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------T-T-G-A-G-T-A-----T-GG--TA-G-A------------G-GC-G-GG-T----AG--AA-T-TCA-TAGT--GT-A-GCG-GTGAAA-TG-CGT-AGAT-ATTA-T-G-A--GG-A-AT-A-CC-AG--T--G--GC-GAA-G--G-C---G----G--C-C-CGCTG------G-AC-CA--------------------------------------------------------------AT-A-C-C--GA--CA-----CT-GA-GG--T-A-CGA--AA-G-C--------------G-TGGG-GAG-C-G-AACA--GG-ATTA-G-ATA-C-----CC-T-G-GTA-G-T----C-CA--C-G-CCG-T-AAA--C-GATG-TC--AA-CT---------A-GC--C--G-T-TG-G-GA-C-----------------------------------------------------------------------------------------CTT-GA--------------------------------------------------------------------------------------------------------------------------------------------------G-G-T-CT--C-A-G-T-GG-T------GC--A----GC-TAA--CG-C---G--T-GAA-GT--T----G-ACC-GCC-T-G-GG-GAG-TA---CGG-----C-C--G-C-A-A-GGC-T--AAA-ACTC-AAA---------TGAA-TTG-ACGAG-G-G-CCCG----C-A--C-A-A-GCG-GT-G--G--AG-CA-T--GT-GGT-TT-AATT-C-G-ATG-CAAC-G-CG-A-AG-A-A-CC-TT-A-CC-ATCCC-TT-G-AC-ATC-A---------------TC-A-G-------------A-A--CTT-G--TT--A-GA-G-A-T--A-A-C----T-CGG--T-G-----CC-------------------------------------T--TC-G------------------------------------------GG------------AA--CTGA-AT--GA---------------------------------------------------C-A-G-G-T-GCTG-CA-TGG-CT--GTC-GTC-A-GC-TC---G-TG-TC-G--TGA-GA-TGT-T-GG-G-TT-AA-GT-CCCGC-AA--------C-GAG-CGC-A-ACC-C-T-TA--TC--C-TTAG--T-T-G-C-C---AG-C-A----CG------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TAAT------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------GG---T----G-G-----------------G---A-A--CT---------------C-T-A-A-G-GA-G--AC-T-G-CCG--G-T------------------------------------G-A---TAA----------------------------------A-C-C-G--G-A-GG-A--AGG-T--GGGG-A-TGAC-GTC--AAGT-C---ATC-A-T-G-G-C-C-CTT----AT-G--GG-A-T-GG-GC-TA-CAC-ACGTG-C--TA--CAATG---G-T-CA-GTA--C-AAA-GG-GC--------------------------------------------------------------------------------------------------A-G-C-C-A--A-CTCG-C--G---------------------------------------A-GA-G-T-----------G--C-G-CA---A----------A--TCC-C------A-T-AAAGCTGA---T-C-G-TAG-TCC--------GGA-T-TGGAG-TC--T-GCAA-CT-C-------------------------------------------------------------------------------------------------G-ACTCC-A-T-G-AA-G-TT-GGAAT-CG-C-TA--G-TA-AT-C-G-T----GAA-TC-A-G--A------AT--GTC-AC-G-GT-G-AAT-ACGT-T-CCCGGGCCT-TGCA----CACACCG-CCC-GTC-----a---ca--cca-tg-gg-a--g---tgg-g-ct-gc-aaa--a-gaa------g--t-agg-ta-g-t-t-t-aa-c-c--------------------------------------------------------------ttc-g------------------------------------------------------------------------------------------------------gg-a--ga-a--c---gc-tta--cc--act-t----t-gtg-gt-tca------------------------tg--act-gggg-tg-aag-tcgtaacaa-ggtag-ccct-aggggaa-cctg-gggc-tggatcacctcctt.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
MUSCLE is an alignment method which stands for MUltiple Sequence Comparison by Log-Expectation. We can directly use MUSCLE for multiple sequence alignment, where I used this to play around a very small set of sequences. However, MUSCLE is slower than PyNAST when sequence number is large.............................................................................................................aGAGTTT-GA--T-CC-T-G-GCTC-AG-AT-TGAA-C-GC--TGG-C--G-GT-A-TG--C----T-T--AACACA-T-GC-A-AGT-CGA-A-CG----------G-TAA-CA-G-----------------------------CAG-A-AG----------------------------------------------------CTT-G----------------------------------------------------------------------------------CTT-CT------------------GGCT--G--AC--G--AG-T-GG-C-GG-A--C-------------GGG-TGAGT-A--AC-GC-G-T-A-GG---A-A--T-CT-G--C-CTTA---CA-G------------------------------------------------------------------T-GG----GGG-AT-AG-CCC-------------------------G-G-T-----------------------GAA-A---ACC-GGA-TTAA-TA---CC-G--C-AT-A----------C--------------------G-------------------------------------CC-C-----------------------------------------------------------------------------------------------------------------------T-AC-G--------------------------------------------------------------------------------------------------------------------------------------G-G-G---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------GAAA--G-CTG--G-----G--GA-T--C--------------------------------------------------------------------------------------------------------------------TTC-G----------------------------------------------------------------------------------------------------------------------G-A--CC-TG--G---C-A--------------C----T-G---T-TA-G---AT---G-A-----G-CCT-GCG--T-GAG--A------TT--A--G-CT-T----G---TTGG-T-G-GG-G-T----AAT-GG---C-CTACCA--A-GG-C-A--A-CG-A------------TCT-C-T------AG-CT-G-G-TCT-G-AG----A--GG-AC--G-AT-C-AG-CCAC-A-CTGGG--A-C-TG-A-GA-C-AC-G-GCCCAG-A-CTCC-TAC-G--G-G-A-G-GC-A-GC-A-G-TG---GG-G-A-ATA-TTGCA-C-AA-T-GG--GC-GA-G----A-G-CC-T-GA-TG-CA-GCAA-TACC-G-CG-T---G-T-G--T--GA-A-G--A--A-G-G--C----CTG-AG---------G-G-T-T-G-T--A---AA-G-CAC--------TT-TC-A-A--T--TG---TGA-A--G---AAAAGCT---T-TT-GG----T--T--AA-T---A----------AC-C-TTGAGTC-TT-GA-CA-TTAA-C-A--A-TA-C---------AA-----------GAAGC-ACC-GG-C-TAA---C--T-CCGT--GCCA--G-C---A--GCCG---C-GG--TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GAGT-AC--G-TA-G-G-T-G------------G--T-TC-G-T-T-AA----GTC---A---G-ATG-TG-A-AA-GC--CC-CGG-G--------------------------------------------------------------------CT-C-AA-------------------------------------------------------------------------CC-T-G-GG-AA-C----T-G-C-A-T-T--------T---GAA-A-C-T-G-GCA--A-A-C---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------T-T-G-A-G-T-A-----T-GG--TA-G-A------------G-GC-G-GG-T----AG--AA-T-TCA-TAGT--GT-A-GCG-GTGAAA-TG-CGT-AGAT-ATTA-T-G-A--GG-A-AT-A-CC-AG--T--G--GC-GAA-G--G-C---G----G--C-C-CGCTG------G-AC-CA--------------------------------------------------------------AT-A-C-C--GA--CA-----CT-GA-GG--T-A-CGA--AA-G-C--------------G-TGGG-GAG-C-G-AACA--GG-ATTA-G-ATA-C-----CC-T-G-GTA-G-T----C-CA--C-G-CCG-T-AAA--C-GATG-TC--AA-CT---------A-GC--C--G-T-TG-G-GA-C-----------------------------------------------------------------------------------------CTT-GA--------------------------------------------------------------------------------------------------------------------------------------------------G-G-T-CT--C-A-G-T-GG-T------GC--A----GC-TAA--CG-C---G--T-GAA-GT--T----G-ACC-GCC-T-G-GG-GAG-TA---CGG-----C-C--G-C-A-A-GGC-T--AAA-ACTC-AAA---------TGAA-TTG-ACGAG-G-G-CCCG----C-A--C-A-A-GCG-GT-G--G--AG-CA-T--GT-GGT-TT-AATT-C-G-ATG-CAAC-G-CG-A-AG-A-A-CC-TT-A-CC-ATCCC-TT-G-AC-ATC-A---------------TC-A-G-------------A-A--CTT-G--TT--A-GA-G-A-T--A-A-C----T-CGG--T-G-----CC-------------------------------------T--TC-G------------------------------------------GG------------AA--CTGA-AT--GA---------------------------------------------------C-A-G-G-T-GCTG-CA-TGG-CT--GTC-GTC-A-GC-TC---G-TG-TC-G--TGA-GA-TGT-T-GG-G-TT-AA-GT-CCCGC-AA--------C-GAG-CGC-A-ACC-C-T-TA--TC--C-TTAG--T-T-G-C-C---AG-C-A----CG------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TAAT------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------GG---T----G-G-----------------G---A-A--CT---------------C-T-A-A-G-GA-G--AC-T-G-CCG--G-T------------------------------------G-A---TAA----------------------------------A-C-C-G--G-A-GG-A--AGG-T--GGGG-A-TGAC-GTC--AAGT-C---ATC-A-T-G-G-C-C-CTT----AT-G--GG-A-T-GG-GC-TA-CAC-ACGTG-C--TA--CAATG---G-T-CA-GTA--C-AAA-GG-GC--------------------------------------------------------------------------------------------------A-G-C-C-A--A-CTCG-C--G---------------------------------------A-GA-G-T-----------G--C-G-CA---A----------A--TCC-C------A-T-AAAGCTGA---T-C-G-TAG-TCC--------GGA-T-TGGAG-TC--T-GCAA-CT-C-------------------------------------------------------------------------------------------------G-ACTCC-A-T-G-AA-G-TT-GGAAT-CG-C-TA--G-TA-AT-C-G-T----GAA-TC-A-G--A------AT--GTC-AC-G-GT-G-AAT-ACGT-T-CCCGGGCCT-TGCA----CACACCG-CCC-GTC-----a---ca--cca-tg-gg-a--g---tgg-g-ct-gc-aaa--a-gaa------g--t-agg-ta-g-t-t-t-aa-c-c--------------------------------------------------------------ttc-g------------------------------------------------------------------------------------------------------gg-a--ga-a--c---gc-tta--cc--act-t----t-gtg-gt-tca------------------------tg--act-gggg-tg-aag-tcgtaacaa-ggtag-ccct-aggggaa-cctg-gggc-tggatcacctcctt.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Infernal (“INFERence of RNA ALignment”) is for an alignment method for using RNA structure and sequence similarities.Infernal takes a multiple sequence alignment with a corresponding secondary structure annotation. This input file must be in Stockholm alignment format. It requires a template of sequence and performs the slowest out of these 3 methods.
2.2 Alignment Filter
This component is used to remove positions which are gaps in every sequence. Additionally, the user can supply a lanemask file, that defines which positions should included when building the tree, and which should be ignored. A lanemast file looks like this:
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001101111010110011101001011010110010000101001111110101101011101110101110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000110101101011010010000000000000111011111010011011010101011000111001011010010101110011010000000000000000000000000000000000000000000000000000000000000000001011000011101101101110000000000000000000000000101010000000000000000000000011101000111011101111011000110100101101000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000010100010110100011000101000001011101110010111001000000110010010110100001000111101010110101000011101101010101111001011010100101101000000000000111010100000011011010101110
2.3 Generate Phylogenetic Tree
After the multiple sequence alignment is done on the reference sequence set, the phylogenetic tree can be produced. The methods provided in QIIME includes clearcut, clustalw, raxml, fasttree_v1, fasttree, raxml_v730 and muscle. The default setting is to use FastTree (Price, Dehal, & Arkin, 2009), which takes only a few seconds to generate a tree with around 200 sequences in my test.
Reference: