The Fungi Phylogenetic Project: 2013

Friday, October 11, 2013

revised 599nts and 999nts Raxml vs Clustering using Spherical Phylogenetic Tree

Dataset

1. revised 599nts
This work follows the previous work on http://salsafungiphy.blogspot.com/2013/09/pairwise-distances-from-msa-vs-pairwise.html
This dataset has a total number of 831 sequences.
2. 999nts
This work follows the previous work on http://salsafungiphy.blogspot.com/2013/06/pairwise-distances-from-multiple.html
This dataset has a total number of 1306 sequences.

Alignment

The clustering was done based on
1) SWG, using EDNAFULL scoring matrix, with gap open = -16 and gap extension = -4
2) Multiple sequence alignment, PID.

The Raxml was done based on multiple sequence alignment,
1) revised 599nts, newick file
2) 999nts, newwick file

Spherical Tree

The information of spherical tree can be found on previous work of http://salsafungiphy.blogspot.com/2012/11/phylogenetic-tree-generation-for.html and http://salsafungiphy.blogspot.com/2012/11/phylogenetic-tree-mega-table.html

Dimension Reduction

Manxcat SMACOF:
The pviz file for SWG clustering result with revised 599nts is here, for 999nts is here.
For MSA clustering result with revised 599nts is here, for 999nts is here.
WDA-SMACOF:
alpha set to 0.95
The pviz file for SWG clustering result with revised 599nts is here, for 999nts is here.
For MSA clustering result with revised 599nts is here, for 999nts is here.

Sum of branch lengths (edge sum)

(note that the difference of edge sum between SWG and MSA should due to the difference of the original distances of the clustering plot)

	Manxcat SMACOF		WDA-SMACOF
	revised 599nts	999nts	revised 599nts	999nts
SWG	19.37	19.89	16.01	12.74
MSA	16.62	16.36	15.20	13.65

1) revised 599nts spherical tree

MSA Result

SWG Result

2) 999nts spherical tree

MSA Result

SWG Result

Sunday, September 8, 2013

Pairwise Distances from MSA Vs Pairwise Local Alignment Distances for Revised 599Nts

Following heatmap present the correlation between distances computed from multiple sequence alignment (MSA) versus pairwise local alignment for revised 599Nts dataset (MSA is available here and sequence file is here).

Details on MSA based distance computation is at http://salsafungiphy.blogspot.com/2013/06/pairwise-distance-calculation.html

Details on pairwise local alignment based distance computation is at http://salsafungiphy.blogspot.com/2012/10/pairwise-distances-from-multiple.html

Wednesday, September 4, 2013

Summary of Multi-Dimensional Scaling Runs

Monday, June 17, 2013

Pairwise Distances from Multiple Sequence Alignment (MSA) Vs Pairwise Local Alignment Distances

Following heatmaps present the correlation between distances computed from multiple sequence alignment (MSA) versus pairwise local alignment. Distance computation from MSA is explained in the previous post at http://salsafungiphy.blogspot.com/2013/06/pairwise-distance-calculation.html. Distances computed from pairwise local alignment follows the usual recipe of using Smith-Waterman with -16/-4 gap penalties and EDNAFULL scoring matrix for alignment followed by percent identity based distance computation. A similar study was done with previous data as well (see http://salsafungiphy.blogspot.com/2012/10/pairwise-distances-from-multiple.html).

599Nts data set

999Nts data set

Thursday, June 6, 2013

Pairwise Distance Calculation

Distance between each pair of sequences is the input for our algorithms performing dimensional scaling and clustering. The usual form of distance we've been using is to compute percent identity as described here. However, since the sequences are already aligned with respect to each other, our collaborators were interested in computing pairwise distance from multiple sequence alignment (MSA) itself based on the following strategy. Apparently, this is identical to the method we've been adopting previously, known as PIDNG with Strict DNA Alphabet in here.

The following content is copied from the email sent by Geoffrey House.

Pairwise gap deletion strategy for AMF clustering project

May 21, 2013

Yellow highlight – Same base; base position is counted in length

Red highlight – Different base; base position is counted in length

Green highlight – Skip this position (there is a gap in one or both of the sequences); base position not counted in length

Pink highlight - Skip this position (there is an ambiguous nucleotide in one or both of the sequences); base position not counted in length

Sequence 1: -ATG---WCT

Sequence 2: -TTGC--W-A

Sequence 3: GATGY---GT

Pairwise comparison 1 between sequences 1 and 2 (the length of the comparison is 4; 2 base positions are different):

-ATG---WCT

-TTGC--W-A

Pairwise comparison 2 between sequences 1 and 3 (the length of the comparison is 5; 1 base position is different):

-ATG---WCT

GATGY---GT

Pairwise comparison 3 between sequences 2 and 3 (the length of the comparison is 4; 2 base positions is different):

-TTGC--W-A

GATGY---GT

Tuesday, June 4, 2013

Phase Two of Phylogenetic Work

In this phase we received two sets of refined sequences. The sequences in these sets come from sequences, which we've been denoting as Kruger and Wittiya sequences. Posts on this blog hereupon will be referencing these new data sets unless noted otherwise.

The sequences are already aligned using multiple sequence alignment and the we name the two data sets as 599Nts and 999Nts as they contain 599 and 999 base spans respectively in the aligned form. Total number of sequences in these sets are,

599Nts -- 1821 sequences
999Nts -- 1306 sequences

The description we received from our collaborators on this data is (copied from email conversation with Geoffrey House and modified for blog) is given below on the right with the graph.

We created the full sequence data set (for the large subunit (LSU) sequences only) by taking the aligned sequences that were used in Kruger et al 2012 (and that were 400-1300 bp long), and adding Wittaya's sequences to the existing alignment using MAFFT and its --add option.

We then made line graphs of the number of sequences that contained a base at each sequence position The top panel is the sequence coverage of the combined data set (Kruger and Wittaya), the middle panel graphs the sequence coverage of only the Kruger sequences within the combined data set, and the bottom panel graphs the sequence coverage of only the Wittaya sequences within the combined data set.

From the graphs, we saw that a large number of sequences contained nucleotides along a 999 base span. Even more sequences contained nucleotides along a 499 base span that was nested within the 999 base span. The red rectangles above the top panel of the graph to denote the approximate sequence locations of these two different base length spans.

This is how we derived the two sets of sequences - one is all of the sequences that span the 499 bases, and the other is all of the sequences that span the 999 bases (which essentially represents a subset of the sequences spanning the 499 base length).

Finally for adding Wittaya's sequences to Kruger's, we used the default alignment option in MAFFT, which is very fast but possibly a bit rough. It is possible that we may need to slightly refine the MAFFT alignments later in the project to improve their accuracy, but that should have a fairly small effect on the length of the aligned sequences.