Showing posts with label Phylogenetic. Show all posts
Showing posts with label Phylogenetic. Show all posts

Sunday, September 8, 2013

Pairwise Distances from MSA Vs Pairwise Local Alignment Distances for Revised 599Nts

Following heatmap present the correlation between distances computed from multiple sequence alignment (MSA) versus pairwise local alignment for revised 599Nts dataset (MSA is available here and sequence file is here).

Details on MSA based distance computation is at http://salsafungiphy.blogspot.com/2013/06/pairwise-distance-calculation.html

Details on pairwise local alignment based distance computation is at http://salsafungiphy.blogspot.com/2012/10/pairwise-distances-from-multiple.html



Tuesday, June 4, 2013

Phase Two of Phylogenetic Work

In this phase we received two sets of refined sequences. The sequences in these sets come from sequences, which we've been denoting as Kruger and Wittiya sequences. Posts on this blog hereupon will be referencing these new data sets unless noted otherwise.

The sequences are already aligned using multiple sequence alignment and the we name the two data sets as 599Nts and 999Nts as they contain 599 and 999 base spans respectively in the aligned form. Total number of sequences in these sets are,

  • 599Nts -- 1821 sequences
  • 999Nts -- 1306 sequences

The description we received from our collaborators on this data is (copied from email conversation with Geoffrey House and modified for blog) is given below on the right with the graph.






We created the full sequence data set (for the large subunit (LSU) sequences only) by taking the aligned sequences that were used in Kruger et al 2012 (and that were 400-1300 bp long), and adding Wittaya's sequences to the existing alignment using MAFFT and its --add option.

We then made line graphs of the number of sequences that contained a base at each sequence position The top panel is the sequence coverage of the combined data set (Kruger and Wittaya), the middle panel graphs the sequence coverage of only the Kruger sequences within the combined data set, and the bottom panel graphs the sequence coverage of only the Wittaya sequences within the combined data set.  


From the graphs, we saw that a large number of sequences contained nucleotides along a 999 base span.  Even more sequences contained nucleotides along a 499 base span that was nested within the 999 base span.  The red rectangles above the top panel of the graph to denote the approximate sequence locations of these two different base length spans.  


This is how we derived the two sets of sequences - one is all of the sequences that span the 499 bases, and the other is all of the sequences that span the 999 bases (which essentially represents a subset of the sequences spanning the 499 base length).

Finally for adding Wittaya's sequences to Kruger's, we used the default alignment option in MAFFT, which is very fast but possibly a bit rough.  It is possible that we may need to slightly refine the MAFFT alignments later in the project to improve their accuracy, but that should have a fairly small effect on the length of the aligned sequences.

Wednesday, November 28, 2012

Robinson-Foulds Distances

Here we present the Robinson-Foulds metric computed for trees with 2133 and 200 Fungi taxa. Details on these datasets are presented at http://salsafungiphy.blogspot.com/2012/11/tree-distance-heatmaps.html

Metric for Fungi 2133

We have following five trees generated with this dataset.

  • Tree A
    • Pairwise sequence alignment with Smith-Waterman algorithm using EDNAFULL scoring matrix and gap penalties –16 and –4 for gap open and gap extension respectively.
    • Distance measure between sequences is percent identity
    • Tree created with Ninja
  • Tree B
    • Multiple sequence alignment with MUSCLE
    • Tree created with RAxML
  • Tree C
    • Distance between sequences are taken as the 3D distance resulted from DA-SMACOF dimensional scaling of the Smith-Waterman percent identity distances (used in Tree A).
    • Tree created with Ninja
  • Tree D
    • Distance between sequences are taken as the 10D distance resulted from Manxcat (dimensional scaling  program) running in SMACOF mode for the Smith-Waterman percent identity distances (used in Tree A).
    • Tree created with Ninja
  • Tree E
    • Distance between sequences are taken as the 10D distance resulted from DA-SMACOF dimensional scaling of the Smith-Waterman percent identity distances (used in Tree A).
    • Tree created with Ninja

Robinson-Foulds metric for trees

 

Tree A

Tree  B

Tree C

Tree D

Tree E

Tree A

0

3484

3966

3726

3740

Tree B

3484

0

3956

3796

3806

Tree C

3966

3956

0

3438

3392

Tree D

3726

3796

3438

0

1282

Tree E

3740

3806

3392

1282

0

         

Monday, November 5, 2012

Tree Distance Heatmaps

This is an attempt to find a “goodness” measure for phylogenetic trees generated by programs such as Ninja (http://nimbletwist.com/software/ninja/index.html) or RAxML (http://www.phylo.org/news/RAxML). We calculate distance, which we refer as “Tree Distance”,  for each pair of sequences based on the structure of the tree. We then compare this against original pairwise distances computed for sequences based on either local alignment or multiple sequence alignment. Definition for tree distance is not unique and we currently use the “Edge Sum” definition given below. Edge Count is possibly another definition to test, though we have not tested it yet.
  • Edge Sum
    • Given two sequence A and B, we find the shortest path from A to B in the tree and sum up values on edges along the path.
  • Edge Count
    • Given two sequence A and B, we find the shortest path from A to B and count the number of edges in it.
We have performed this analysis for two datasets.

Heatmaps for Fungi 200

3D DA-SMACOF of SWG PID Vs SWG PID
 
3.)whole-plot-DASMACOF-SWG-PID-Vs-SWG-PIDDensitySat[0.85]-large  
   
   
   
Edge Sum Ninja from SWG PID Vs 3D DA-SMACOF of SWG PID
Edge Sum Ninja from SWG PID Vs SWG PID
1.)whole-plot-Ninja-Edgesum-Vs-DASMACOF-SWG-PIDDensitySat[0.85]-large 2.)whole-plot-Ninja-Edgesum-Vs-SWG-PIDDensitySat[0.85]-large

Heatmaps of Fungi 2133

3D DA-SMACOF of SWG PID Vs SWG PID
 
whole-plot-DASMACOF-SWG-PID-Vs-SWG-PIDDensitySat[0.85]-large  
   
   
   
Edge Sum Ninja from 3D DA-SMACOF of SWG PID Vs 3D DA-SMACOF of SWG PID
Edge Sum Ninja from 3D DA-SMACOF of SWG PID Vs SWG PID
whole-plot-Ninja-Edgesum-Vs-DASMACOF-SWG-PIDDensitySat[0.85]-large whole-plot-Ninja-Edgesum-Vs-SWG-PIDDensitySat[0.85]-large
   
   
   
Edge Sum RAxML(20 iterations) from ClustlO MSA Vs SWG PID
Edge Sum RAxML (20 iterations) from ClustlO MSA (distances from 0 to 0.15) Vs SWG PID
whole-plot-Raxml(20)ClustlO-Edgesum-Vs-SWG-PIDDensitySat[0.85]-large-1 whole-plot-Raxml(20)ClustlO-Edgesum-Vs-SWG-PIDDensitySat[0.85]-large-2
   
   
   
Edge SumRAxML(20 iterations) from ClustlO MSA Vs Strict PIDNG
Edge Sum RAxML (20 iterations) from ClustlO MSA (distances from 0 to 0.15) Vs Strict PIDNG
whole-plot-Raxml(20)ClustlO-Edgesum-Vs-Strict-PIDNGDensitySat[0.85]-large-1 whole-plot-Raxml(20)ClustlO-Edgesum-Vs-Strict-PIDNGDensitySat[0.85]-large-2