Description - wcd Manual 0.

Next: Installing wcd, Previous: Top, Up: Top

1 Description

This manual describes wcd (pronounced `wicked'), a program that performs clustering of sequences. Essentially, it takes a set of DNA sequences (typically Expressed Sequence Tags, ESTs), and clusters the sequences into the related groups.

It reads in a set of DNA sequences (typically ESTs or mRNAs) in FASTA format and organises them into clusters of similar sequences, producing a list of clusters and their members on standard output.

The clustering performed is single-linkage clustering. Sequences are put in the same cluster if they have regions of close similarity. Similarity is determined using one of the standard distance functions. In versions > 0.2.0 of wcd, three distance functions are supported: the d2 function, edit distance and a common word heuristic. This may be expanded in later versions of wcd.

When using the d2 distance function, wcd's results are similar to those generated by the d2_cluster program, which also implements clustering based on the d2 distance function. For a comparison of the two programs, See Testing.

The use of the d2 distance function in DNA sequence comparison is described in Hide et al. (1994) and Burke et al. (1999), and has been shown to be sensitive to sequence similarity even in the absence of direct sequence alignments. This corresponds with the nature of EST data which typically contains single-base read errors and may contain sequence from different splice forms of the same transcript (leading to local dissimilarity).

Thanks to the properties of the d2 distance function, sequence clusters created by wcd correspond to closely gene classes i.e. all the ESTs originating from a specific gene will be clustered into a single class.

wcd is written in C and has been successfully installed and tested on Linux, FreeBSD, SGI Irix, Sun Solaris and Windows 2000 (under Cygwin).

Users of wcd are encouraged to subscribe to the wcd-users mailing list using the form at: http://swing.sanbi.ac.za/mailman/listinfo/wcd-users.

The remainder of this manual assumes that a user understand the basic principles of what EST clustering is about, and so the basic concepts are only briefly reviewed in this chapter. Subsequent chapterd discuss installing and running wcd, as well as giving some technical background so someone who wants to use some of the code.

1.0.1 Citing `wcd`

If you cite wcd please use the following reference

S. Hazelhurst, W. Hide, Z. Liptak, R. Nogueira. R. Starfield An overview of the wcd EST clustering Bioinformatics. doi: 10.1093/bioinformatics/btn203. Early access, 14 May 2008.

A description of the underlying algorithms can be found in

wcd

South African Computer Journal

1.1 Clustering

The basic idea is that we take a set of sequences and cluster them so that sequences that are close to each other in the same cluster. There are many clustering algorithms, and one review can be found in

 @article{cluster2,
   author =       "A. K. Jain and M. N. Murty and P. J. Flynn",
   title =        "Data clustering: a review.",
   journal =      "ACM Comp. Surveys",
   volume =       "31",
   number =       "3",
   pages =        "264--323",
   month =        sep,
   year =         "1999"
 }

The type of clustering we do is single linkage clustering. We create the largest number of clusters (and hence smallest clusters) such that if two sequences are very close they are put in the same clusters. Note the definition is transitive: if A and B are close, and B and C are close then A and C will be in the same cluster even if A and C are far apart.

The basic clustering algorithm is shown below (obviously, there are performance enhancements).

foreach sequence i
  put i in its own cluster
foreach sequence i
  foreach sequence j
     if i and j are close to each other
        merge the clusters to which i and j belong

1.2 Distance functions

A distance function takes two sequences and returns a number that says how far apart the sequences are mathematically. Many distance functions have been proposed with different biological and computational motivations: hopefully a well-chosen distance function will return a mathematical value that has some biological distance.

One of the purposes of wcd is to allow different distance functions to be used. The distance functions which wcd supports

d2, edit distance

1.2.1 The D2 distance function

The d2 distance function is based on the word count.

Let x be a sequence and w a word
We use the notation x' \in x to mean that x' is a window or substring of x of some specified length.
c_x(w) is the number of times that w appears in x.

1.2.1.1 General formulation

To measure the distance between sequences we compute the word frequencies and then take the sum of the square of the differences

For a fixed k, we examine all the words w of length k, and compute the frequencies which they occur in the strings x and y, and then sum the square of the differences. Formally,

d^2_k(x,y) = \sum_|w|=k(c_x(w)-c_y(w))

Then in general

d^2(x,y) = \sum_k=l^Ld^2_k(x,y)

for constants l and L. In practice (and this is what wcd does), we compute the d2 score for a fixed k. The default value of k is 6, but this is a parameter of the program.

1.2.1.2 Windows

The goal of EST clustering is to see whether there is an overlap between two sequences: thus we are no interested in the d2 score between the whole of one sequence and the whole of another, but whether there are two regions (called windows) that have a low d2 score.

Thus the d2 score between two sequences is the minimum of the d2 score between all pairs of windows of those two sequences. d^2_k(x,y) = \min_x' \in x, y' \in y\sum_|w|=k(c_x'(w)-c_y'(w))

The consequence of this is that d2 applied like this does not obey the triangle inequality. It is very common that

d^2(x,z) > d^2(x,y)+d^2(y,z)

since there is a window that x shares with y, another that y shares with z but no window common to both x and z.

As an aside, d2 even when applied to sequences is not a metric since d^2(x,y)=0 does not necessarily imply that x=y.

We don't compare two sequences directly. Rather we compare all windows of fixed length in one sequence with all the windows in the other. The default window length is 100 (again it's parameterisable).

1.2.1.3 References

The paper by Burke et al describes the basic d2 method and justifies its use. The paper by Hazelhurst describes the algorithm that is implemented in wcd. This paper should be available from the ACM Digital library http://www.acm.org/dl.

@article{burke99,
   author = "J. Burke and D. Davison and W. Hide",
   title  = "{D2\_cluster: A Validated Method for Clustering EST and Full-length cDNA Sequences}",
   journal = "{Genome Research}",
   year =  1999,
   volume = 9,
   number = 11,
   pages = "1135--1142"
}

@InProceedings{hazel2004c,
  author = 	 "S. Hazelhurst",
  title = 	 "An Efficient Implementation of the $d^2$ Distance Function for {EST} Clustering",
  booktitle =	 {Proceedings of {SAICSIT 2004}: Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists},
  pages =	 "229-223",
  year =	 2004,
  editor =	 "G. Marsden and P. Kotz\'e and A. Adesina-Ojo",
  series =	 "{ACM} International Conference Proceedings Series",
  month =	 oct
}

1.2.2 Edit distance

Edit distance is well known; the reference below gives a very thorough introduction to the subject.

The version of edit distance implemented in wcd does local alignment (Smith-Waterman). Penalties (positive numbers) are given for mismatches, deleltions or gaps, and a reward (a negative number) is given for a match. The default values are:

-1 for a match
+3 for mismatch
+5 for opening a gap (indel)
+2 for extending a gap

The user can change these values.

@Book{		  gusfield97,
  author	= "D. Gusfield",
  title		= "Algorithms on strings, trees, and sequences",
  address	= "Cambridge, United Kingdom",
  publisher	= "Cambridge University Press",
  year		= 1997
}