Next: Refining clusters, Previous: Comparing two sequences, Up: Running wcd
wcd [opts] <file>
When used in this way, wcd takes one argument: the name of the
input file.
[ --output | -o] fname
By default all wcd output goes to standard output. Using this
option allows you to specify another file
[ --num_seqs | -C] val
By default all the sequences in the input file will be processed. If you only want to process part of a file, you can use the -C option.
e.g. --num_seqs 100
will only process the first 100 sequences.
If you specify a number greater than the number of sequences actually in the file, then the whole file will be processed.
--show_clusters, -c
Prints the results of the clustering in a compact way. Each cluster is printed on a line by itself. The sequences that make up the cluster are separated by commas. See See Output, for more description.
--histogram, -g
Show a histoGram of the results of the clustering. For each cluster size, it shows how many clusters there are that size up to some maximum size.
--show_ext, -t
Prints the clustering in extended format. See See Output, for a description of the format.
[--output | -o] fname
By default any output gets sent to standard output. You can send output to a given file.
[--function | -F] fun
wcd decides whether two sequences should clustered together
on the basis of a distance function. The distance function that can
be used are
--function d2: use the d2 function. This is the
default. The default threshold is 40.
--function ed: use edit distance (local
alignment). The default threshold is -20. See the
--parameter option below for how to specify other
options.
--function heuristic: use the common word heuristic
described below. The common word heuristic gives a crude and
fast membership criterion.
--function suffix: A pair of sequences are clustered
together if they share at least one word. A good value to use is 25:
note that the default value of 6 is a very bad value.
[--parameter | P] fname
This specifies a paramter file that parameterises the distance function. In the current version this is only used by the edit distance option. The first four lines specify a 4x4 matrix which give penalties for substiutions (a,c,g,t vs a, a, c, g, t). There are four integers per line which should be separated by a single space. The fifth line gives two integers, separated by a space which give the cost for opening a gap and extending a gap. The file that would be used for the default parameters is shown below
-1 3 3 3
3 -1 3 3
3 3 -1 3
3 3 3 -1
5 2
[--common_word | -H] val
Set the common word heuristic threshold (the default is 65). Before running a d2 check between 2 sequences, this first checks to see how many distinct 6-words are shared between the sequence (NB, the sequence, not some windows). This can be done in linear rather than quadratic time and so is probably 2 orders of magnitude faster than checking d2. If not enough common words are found, a d2 check will not be done. [NB: this has changed from version 0.3]
[--window_len| -l] val
Set the window length to the given value. The default is 100
[--skip_val|-S] val
Set skip value — how much the window along the second sequence should be updated. The default is 1. Don't be too aggressive. Setting the common word threshold at its default value is probably better than changing the skip value (IMO).
[--threshold_val, -T] val
Set the distance threshold — the default is 40 for d2, -20 for
edit distance.
[--word_len|-w] val
Set the d2 word length (default 6)
--performance, -s
Show performance stats
--no_rc, -n
Don't do the reverse complementation check (rc-checking is done by default.
[--constraint, -k] filename
Give the constraint file for the first input data file. This is optional. The constraint file enables you to ensure that certain sequences are not clustered together or to ignore certain sequences while clustering. See Format of Constraint File, which gives more details on the required format of the constraint file, and the semantics.
[--sample_word_len | -B] val
Word length used in the sample heuristic. See below for use.
[--sample_thresh | -K] val
This is the threshold for the sample heuristic. Suppose K and B are the sample word length and threshold parameters. When comparing two sequences i and j, the first sample test described below is done. If it passes, a more rigorous test is done for similarity; if it fails the pair is declared not to have overlap.
The sample test: When comparing two sequences i and jevery 8-th word of length B is sampled from j; at least K must also occur among all the words in i. The defaults are K=7, B=8 which is conservative. [NB: this has changed from version 0.3]
-X: Do clone linking
wcd will use the clone information in the sequence headers to
put sequences together. If a sequence header contains the word
Clone followed by a string, then that sequence is identified
as matching a particular clone. All ESTs matching a particular clone
will be clustered together. (The current implementation is very
simplistic and probably adds about 25\% cost to clustering. It will
be improved in future).
This provides a coarse clustering very quickly. It puts two sequences in
the same cluster if they share at least one word (of the specified word
length). You need to create a suffix array of the input data
file. wcd expects a certain naming convention to be used.
In order to use this facility you must create some auxiliary data files
in the same directory as your main sequence file. I assume you have
available the mksary suffix array package, though in principle others
should do. ‘mksary’ can be found at
http://sary.sourceforge.net/
Note that in the current implementations, the suffix clustering is not parallelised.
Once ‘mksary’ has been installed, do the following
./fasta2sary data.fasta -o data.fasta.nlc mksary data.fasta.nlc
This will leave three files data.fasta, data.fasta.nlc
and data.fasta.nlc.ary. Again: wcd expects this naming
convention to be met.
To cluster
wcd -F suffix -w 30 -c data.fasta
Usage: wcd [--range, -R] dataf i j
The range option allows clustering only a range or slice of the input data file in the following way.
for(k=i,k<j,k++)
for(m=k+1, m<num_seqs; m++)
compare sequences k and n and cluster if necessary
If you think of the all the comparisons to be done as the upper half of
a matrix, the range option restricts the comparisons to be done to a
slice of this matrix. The purpose of the option is to allow a simplistic
and crude parallelisation of work. We can run the multiple wcd
processes with the same data but different slices. Typically we would
also use the --dump option with this. See the section on
parallelisation in the technical manual.
If the dump option is chosen, just before completion wcd creates
a file with an -FIN suffix. For example if the dump file name is
range10, then a file called range10-FIN is created. This
enables monitoring programs which run asynchronously and cannot control
the wcd process directly to use the file system to determine
whether the job has completed.