Next: Installing wcd, Previous: Top, Up: Top
This manual describes wcd (pronounced `wicked'), a program that
performs clustering of sequences. Essentially, it takes a set of DNA sequences
(typically Expressed Sequence Tags, ESTs), and clusters the sequences
into the related groups.
It reads in a set of DNA sequences (typically ESTs or mRNAs) in FASTA format and organises them into clusters of similar sequences, producing a list of clusters and their members on standard output.
The clustering performed is single-linkage clustering. Sequences are
put in the same cluster if they have regions of close
similarity. Similarity is determined using one of the standard
distance functions. In versions > 0.2.0 of wcd, three distance
functions are supported: the d2 function, edit distance and a common
word heuristic. This may be expanded in later versions of
wcd.
When using the d2 distance function, wcd's results are similar to
those generated by the d2_cluster program, which also implements
clustering based on the d2 distance function. For a comparison of the
two programs, See Testing.
The use of the d2 distance function in DNA sequence comparison is described in Hide et al. (1994) and Burke et al. (1999), and has been shown to be sensitive to sequence similarity even in the absence of direct sequence alignments. This corresponds with the nature of EST data which typically contains single-base read errors and may contain sequence from different splice forms of the same transcript (leading to local dissimilarity).
Thanks to the properties of the d2 distance function, sequence
clusters created by wcd correspond to closely gene classes
i.e. all the ESTs originating from a specific gene will be clustered
into a single class.
wcd is written in C and has been successfully installed and tested on
Linux, FreeBSD, SGI Irix, Sun Solaris and Windows 2000 (under Cygwin).
Users of wcd are encouraged to subscribe to the wcd-users mailing
list using the form at:
http://swing.sanbi.ac.za/mailman/listinfo/wcd-users.
The remainder of this manual assumes that a user understand the basic
principles of what EST clustering is about, and so the basic concepts
are only briefly reviewed in this chapter. Subsequent chapterd discuss
installing and running wcd, as well as giving some technical
background so someone who wants to use some of the code.
wcdIf you cite wcd please use the following reference
wcd EST clustering
Bioinformatics. doi:
10.1093/bioinformatics/btn203. Early access, 14 May 2008.
A description of the underlying algorithms can be found in
wcd tool. South African Computer Journal,
40, June 2008.
The basic idea is that we take a set of sequences and cluster them so that sequences that are close to each other in the same cluster. There are many clustering algorithms, and one review can be found in
@article{cluster2,
author = "A. K. Jain and M. N. Murty and P. J. Flynn",
title = "Data clustering: a review.",
journal = "ACM Comp. Surveys",
volume = "31",
number = "3",
pages = "264--323",
month = sep,
year = "1999"
}
The type of clustering we do is single linkage clustering. We create the largest number of clusters (and hence smallest clusters) such that if two sequences are very close they are put in the same clusters. Note the definition is transitive: if A and B are close, and B and C are close then A and C will be in the same cluster even if A and C are far apart.
The basic clustering algorithm is shown below (obviously, there are performance enhancements).
foreach sequence i
put i in its own cluster
foreach sequence i
foreach sequence j
if i and j are close to each other
merge the clusters to which i and j belong
A distance function takes two sequences and returns a number that says how far apart the sequences are mathematically. Many distance functions have been proposed with different biological and computational motivations: hopefully a well-chosen distance function will return a mathematical value that has some biological distance.
One of the purposes of wcd is to allow different distance
functions to be used. The distance functions which wcd supports
The d2 distance function is based on the word count.
We use the notation x' \in x to mean that x' is a window or substring of x of some specified length.
To measure the distance between sequences we compute the word frequencies and then take the sum of the square of the differences
For a fixed k, we examine all the words w of length k, and compute the frequencies which they occur in the strings x and y, and then sum the square of the differences. Formally,
d^2_k(x,y) = \sum_|w|=k(c_x(w)-c_y(w))
Then in general
d^2(x,y) = \sum_k=l^Ld^2_k(x,y)
for constants l and L. In practice (and this is what
wcd does), we compute the d2 score for a fixed k. The
default value of k is 6, but this is a parameter of the program.
The goal of EST clustering is to see whether there is an overlap between two sequences: thus we are no interested in the d2 score between the whole of one sequence and the whole of another, but whether there are two regions (called windows) that have a low d2 score.
Thus the d2 score between two sequences is the minimum of the d2 score between all pairs of windows of those two sequences. d^2_k(x,y) = \min_x' \in x, y' \in y\sum_|w|=k(c_x'(w)-c_y'(w))
The consequence of this is that d2 applied like this does not obey the triangle inequality. It is very common that
d^2(x,z) > d^2(x,y)+d^2(y,z)
since there is a window that x shares with y, another that y shares with z but no window common to both x and z.
As an aside, d2 even when applied to sequences is not a metric since d^2(x,y)=0 does not necessarily imply that x=y.
We don't compare two sequences directly. Rather we compare all windows of fixed length in one sequence with all the windows in the other. The default window length is 100 (again it's parameterisable).
The paper by Burke et al describes the basic d2 method and justifies
its use. The paper by Hazelhurst describes the algorithm that is
implemented in wcd. This paper should be available from
the ACM Digital library http://www.acm.org/dl.
@article{burke99,
author = "J. Burke and D. Davison and W. Hide",
title = "{D2\_cluster: A Validated Method for Clustering EST and Full-length cDNA Sequences}",
journal = "{Genome Research}",
year = 1999,
volume = 9,
number = 11,
pages = "1135--1142"
}
@InProceedings{hazel2004c,
author = "S. Hazelhurst",
title = "An Efficient Implementation of the $d^2$ Distance Function for {EST} Clustering",
booktitle = {Proceedings of {SAICSIT 2004}: Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists},
pages = "229-223",
year = 2004,
editor = "G. Marsden and P. Kotz\'e and A. Adesina-Ojo",
series = "{ACM} International Conference Proceedings Series",
month = oct
}
Edit distance is well known; the reference below gives a very thorough introduction to the subject.
The version of edit distance implemented in wcd does local
alignment (Smith-Waterman). Penalties (positive numbers) are given for
mismatches, deleltions or gaps, and a reward (a negative number) is
given for a match. The default values are:
The user can change these values.
@Book{ gusfield97,
author = "D. Gusfield",
title = "Algorithms on strings, trees, and sequences",
address = "Cambridge, United Kingdom",
publisher = "Cambridge University Press",
year = 1997
}