Next: Acknowledgements and copyright, Previous: Technical manual, Up: Top
This section was written with respoect to an early version of wcd
0.3. The code and performance has changed significantly since that time.
wcd was compared against d2_cluster on a dual processor Pentium III system, with processors running at 1GHz and 1.5Gb RAM.
The test dataset was created out of 23300 ESTs and 315 mRNAs known to be associated with 27 genes on chromosome 22 that are known to be alternately spliced. The average length of the mRNAs was 4923 bases and of ESTs 633 bases.
Input sequences were masked for repeats and contamination usings
cross_match.
Three test runs were done, first with only ESTs, second with only mRNAs and finally with a dataset comprising all ESTs and mRNAs. Only a single CPU was used for processing, and run times were as follows:
ESTs only:
wcd 1 hour 4 minutes (3870 seconds user cpu, 16 seconds system cpu)
d2_cluster 1 hour 2 minutes (3704 seconds user cpu, 14 seconds system cpu)
mRNAs only:
wcd 5 hours 16 minutes (19014 seconds user cpu, 24 seconds system cpu)
d2_cluster 3 minutes 20 seconds (198 seconds user cpu, 0.7 seconds system cpu)
ESTs and mRNAs combined: wcd 5 hours 22 minutes (19353 seconds
user cpu, 2.23 seconds system cpu) d2_cluster 1 hours 21 minutes
(4905 seconds user cpu, 1 second system cpu)
As can be seen from these results, wcd currently is significantly
slower than d2_cluster when dealing with mRNA data. wcd's
performance can be improved by increasing the number of words that two
sequences must have in common before wcd will do a detailed
comparison. This parameter is set with the -H flag. The price of this
increase in performance will be decrease in sensitivity.
In terms of sensitivity, the following results show that wcd has
comparable sensitivity to d2_cluster in finding similarities
between EST and mRNA sequences.
For the dataset of EST sequences, the following clusters were found:
wcd 125 clusters consisting of 12520 sequences.
d2_cluster 129 clusters consisting of 12690 sequences.
A detailed comparison of the clustering results shows that wcd
joined together 6 clusters into 3 in its results, whereas
d2_cluster joined together 8 clusters into 2 in turn. In the
results from wcd, 179 sequences were singletons that were in
clusters in the d2_cluster results, whereas in the
d2_cluster results, 9 sequences were singletons that were in
clusters in the wcd results.
While these results suggest that d2_cluster is marginally more
successful than wcd in assigning sequences to clusters, the
difference between results is not significant (only 0.76% of sequences
were singletons in wcd results but in a cluster in
d2_cluster) results.
For the dataset of mRNA sequences, the following clusters were found:
wcd 26 clusters consisting of 265 sequences. d2_cluster
26 clusters consisting of 270 sequences.
As in the EST results, d2_cluster assigned more sequences (5 or
1.58% of the dataset) to clusters than wcd did. Again, however,
the results are not significantly different in terms of sensitivity.
For the dataset of all (EST combined with mRNA) sequences, the following
clusters were found: wcd 83 clusters consisting of 12852
sequences. d2_cluster 82 clusters consisting of 13026 sequences.
As can be seen by comparing the combined dataset to that of ESTs, the addition of mRNAs to the dataset has the result of reducing fragmentation.
A detailed comparison of the clustering results shows that wcd
joined together 4 clusters into 2 in its results, whereas
d2_cluster joined together 9 clusters into 2 in turn. In the
results from wcd, 181 sequences (0.76% of the dataset) were
singletons that were in clusters in the d2_cluster results,
whereas in the d2_cluster results, 7 sequences were singletons
that were in clusters in the wcd results.
Again, the results show that d2_cluster is marginally more
successful at assigning sequences to clusters than wcd is, but
that overall the difference in results between the two programs is not
significant.
Testing of the difference in quality between the d2_cluster program and
wcd is a little tricky. In principle, it is highly unlikely there
is any real difference. Also note that wcd 0.4 is significantly
faster than 0.3.
Two methodological points: cluster size is only a very rough measure of correctness; and the only valid comparison is with a known correct answer.
Research we have done using different distance measures has shown that
the parameters used can be more important than which distance measures.
wcd and d2_cluster have slightly different default
parameters. Changing the parameters will change the results. If you use
the right parameters, you will get good answers; if you don't you won't.
Moreover, changing the heuristics slightly can change the performance dramatically. Changing some of the heuristic parameters will speed up clustering by more than a factor of 2 with little impact of quality.