Doing clustering - wcd Manual 0.

Next: Refining clusters, Previous: Comparing two sequences, Up: Running wcd

3.7 Doing clustering

wcd [opts] <file>

3.7.1 Arguments

When used in this way, wcd takes one argument: the name of the input file.

3.7.2 Options

[ --output | -o] fname By default all wcd output goes to standard output. Using this option allows you to specify another file
[ --num_seqs | -C] val
By default all the sequences in the input file will be processed. If you only want to process part of a file, you can use the -C option.
e.g. --num_seqs 100
will only process the first 100 sequences.
If you specify a number greater than the number of sequences actually in the file, then the whole file will be processed.
--show_clusters, -c
Prints the results of the clustering in a compact way. Each cluster is printed on a line by itself. The sequences that make up the cluster are separated by commas. See See Output, for more description.
--histogram, -g
Show a histoGram of the results of the clustering. For each cluster size, it shows how many clusters there are that size up to some maximum size.
--show_ext, -t
Prints the clustering in extended format. See See Output, for a description of the format.
[--output | -o] fname
By default any output gets sent to standard output. You can send output to a given file.
[--function | -F] fun wcd decides whether two sequences should clustered together on the basis of a distance function. The distance function that can be used are
- --function d2: use the d2 function. This is the default. The default threshold is 40.
- --function ed: use edit distance (local alignment). The default threshold is -20. See the --parameter option below for how to specify other options.
- --function heuristic: use the common word heuristic described below. The common word heuristic gives a crude and fast membership criterion.
- --function suffix: A pair of sequences are clustered together if they share at least one word. A good value to use is 25: note that the default value of 6 is a very bad value.
[--parameter | P] fname
This specifies a paramter file that parameterises the distance function. In the current version this is only used by the edit distance option. The first four lines specify a 4x4 matrix which give penalties for substiutions (a,c,g,t vs a, a, c, g, t). There are four integers per line which should be separated by a single space. The fifth line gives two integers, separated by a space which give the cost for opening a gap and extending a gap. The file that would be used for the default parameters is shown below
```
     
     -1 3 3 3 
     3 -1 3 3
     3 3 -1 3
     3 3 3 -1
     5 2
```
[--common_word | -H] val
Set the common word heuristic threshold (the default is 65). Before running a d2 check between 2 sequences, this first checks to see how many distinct 6-words are shared between the sequence (NB, the sequence, not some windows). This can be done in linear rather than quadratic time and so is probably 2 orders of magnitude faster than checking d2. If not enough common words are found, a d2 check will not be done. [NB: this has changed from version 0.3]
[--window_len| -l] val Set the window length to the given value. The default is 100
[--skip_val|-S] val
Set skip value — how much the window along the second sequence should be updated. The default is 1. Don't be too aggressive. Setting the common word threshold at its default value is probably better than changing the skip value (IMO).
[--threshold_val, -T] val Set the distance threshold — the default is 40 for d2, -20 for edit distance.
[--word_len|-w] val Set the d2 word length (default 6)
--performance, -s
Show performance stats
--no_rc, -n
Don't do the reverse complementation check (rc-checking is done by default.
[--constraint, -k] filename
Give the constraint file for the first input data file. This is optional. The constraint file enables you to ensure that certain sequences are not clustered together or to ignore certain sequences while clustering. See Format of Constraint File, which gives more details on the required format of the constraint file, and the semantics.
[--sample_word_len | -B] val
Word length used in the sample heuristic. See below for use.
[--sample_thresh | -K] val
This is the threshold for the sample heuristic. Suppose K and B are the sample word length and threshold parameters. When comparing two sequences i and j, the first sample test described below is done. If it passes, a more rigorous test is done for similarity; if it fails the pair is declared not to have overlap.
The sample test: When comparing two sequences i and jevery 8-th word of length B is sampled from j; at least K must also occur among all the words in i. The defaults are K=7, B=8 which is conservative. [NB: this has changed from version 0.3]
-X: Do clone linking
wcd will use the clone information in the sequence headers to put sequences together. If a sequence header contains the word Clone followed by a string, then that sequence is identified as matching a particular clone. All ESTs matching a particular clone will be clustered together. (The current implementation is very simplistic and probably adds about 25\% cost to clustering. It will be improved in future).

3.7.3 Clustering based upon suffix arrays

This provides a coarse clustering very quickly. It puts two sequences in the same cluster if they share at least one word (of the specified word length). You need to create a suffix array of the input data file. wcd expects a certain naming convention to be used.

In order to use this facility you must create some auxiliary data files in the same directory as your main sequence file. I assume you have available the mksary suffix array package, though in principle others should do. ‘mksary’ can be found at http://sary.sourceforge.net/

Note that in the current implementations, the suffix clustering is not parallelised.

Once ‘mksary’ has been installed, do the following

   ./fasta2sary data.fasta -o data.fasta.nlc
   mksary data.fasta.nlc

This will leave three files data.fasta, data.fasta.nlc and data.fasta.nlc.ary. Again: wcd expects this naming convention to be met.

To cluster

wcd -F suffix -w 30 -c data.fasta

3.8 Clustering a range

Usage: wcd [--range, -R]  dataf i j

The range option allows clustering only a range or slice of the input data file in the following way.

Each sequence in the file with an index from i (inclusive) to j-1 (inclusive) is compared to each sequence from i+1 to the end. More precisely

     
     for(k=i,k<j,k++) 
       for(m=k+1, m<num_seqs; m++) 
         compare sequences k and n and cluster if necessary

If you think of the all the comparisons to be done as the upper half of a matrix, the range option restricts the comparisons to be done to a slice of this matrix. The purpose of the option is to allow a simplistic and crude parallelisation of work. We can run the multiple wcd processes with the same data but different slices. Typically we would also use the --dump option with this. See the section on parallelisation in the technical manual.

If the dump option is chosen, just before completion wcd creates a file with an -FIN suffix. For example if the dump file name is range10, then a file called range10-FIN is created. This enables monitoring programs which run asynchronously and cannot control the wcd process directly to use the file system to determine whether the job has completed.