Next: , Previous: Technical manual, Up: Top


6 Testing

This chapter written by Peter van Heusden

This section was written with respoect to an early version of wcd 0.3. The code and performance has changed significantly since that time.

wcd was compared against d2_cluster on a dual processor Pentium III system, with processors running at 1GHz and 1.5Gb RAM.

The test dataset was created out of 23300 ESTs and 315 mRNAs known to be associated with 27 genes on chromosome 22 that are known to be alternately spliced. The average length of the mRNAs was 4923 bases and of ESTs 633 bases.

Input sequences were masked for repeats and contamination usings cross_match.

Three test runs were done, first with only ESTs, second with only mRNAs and finally with a dataset comprising all ESTs and mRNAs. Only a single CPU was used for processing, and run times were as follows:

ESTs only: wcd 1 hour 4 minutes (3870 seconds user cpu, 16 seconds system cpu) d2_cluster 1 hour 2 minutes (3704 seconds user cpu, 14 seconds system cpu)

mRNAs only: wcd 5 hours 16 minutes (19014 seconds user cpu, 24 seconds system cpu) d2_cluster 3 minutes 20 seconds (198 seconds user cpu, 0.7 seconds system cpu)

ESTs and mRNAs combined: wcd 5 hours 22 minutes (19353 seconds user cpu, 2.23 seconds system cpu) d2_cluster 1 hours 21 minutes (4905 seconds user cpu, 1 second system cpu)

As can be seen from these results, wcd currently is significantly slower than d2_cluster when dealing with mRNA data. wcd's performance can be improved by increasing the number of words that two sequences must have in common before wcd will do a detailed comparison. This parameter is set with the -H flag. The price of this increase in performance will be decrease in sensitivity.

In terms of sensitivity, the following results show that wcd has comparable sensitivity to d2_cluster in finding similarities between EST and mRNA sequences.

For the dataset of EST sequences, the following clusters were found: wcd 125 clusters consisting of 12520 sequences. d2_cluster 129 clusters consisting of 12690 sequences.

A detailed comparison of the clustering results shows that wcd joined together 6 clusters into 3 in its results, whereas d2_cluster joined together 8 clusters into 2 in turn. In the results from wcd, 179 sequences were singletons that were in clusters in the d2_cluster results, whereas in the d2_cluster results, 9 sequences were singletons that were in clusters in the wcd results.

While these results suggest that d2_cluster is marginally more successful than wcd in assigning sequences to clusters, the difference between results is not significant (only 0.76% of sequences were singletons in wcd results but in a cluster in d2_cluster) results.

For the dataset of mRNA sequences, the following clusters were found: wcd 26 clusters consisting of 265 sequences. d2_cluster 26 clusters consisting of 270 sequences.

As in the EST results, d2_cluster assigned more sequences (5 or 1.58% of the dataset) to clusters than wcd did. Again, however, the results are not significantly different in terms of sensitivity.

For the dataset of all (EST combined with mRNA) sequences, the following clusters were found: wcd 83 clusters consisting of 12852 sequences. d2_cluster 82 clusters consisting of 13026 sequences.

As can be seen by comparing the combined dataset to that of ESTs, the addition of mRNAs to the dataset has the result of reducing fragmentation.

A detailed comparison of the clustering results shows that wcd joined together 4 clusters into 2 in its results, whereas d2_cluster joined together 9 clusters into 2 in turn. In the results from wcd, 181 sequences (0.76% of the dataset) were singletons that were in clusters in the d2_cluster results, whereas in the d2_cluster results, 7 sequences were singletons that were in clusters in the wcd results.

Again, the results show that d2_cluster is marginally more successful at assigning sequences to clusters than wcd is, but that overall the difference in results between the two programs is not significant.

Comment by SH

Testing of the difference in quality between the d2_cluster program and wcd is a little tricky. In principle, it is highly unlikely there is any real difference. Also note that wcd 0.4 is significantly faster than 0.3.

Two methodological points: cluster size is only a very rough measure of correctness; and the only valid comparison is with a known correct answer.

Research we have done using different distance measures has shown that the parameters used can be more important than which distance measures. wcd and d2_cluster have slightly different default parameters. Changing the parameters will change the results. If you use the right parameters, you will get good answers; if you don't you won't.

Moreover, changing the heuristics slightly can change the performance dramatically. Changing some of the heuristic parameters will speed up clustering by more than a factor of 2 with little impact of quality.