Next: , Previous: Comparing two sequences, Up: Running wcd


3.7 Doing clustering

wcd [opts] <file>

3.7.1 Arguments

When used in this way, wcd takes one argument: the name of the input file.

3.7.2 Options

3.7.3 Clustering based upon suffix arrays

This provides a coarse clustering very quickly. It puts two sequences in the same cluster if they share at least one word (of the specified word length). You need to create a suffix array of the input data file. wcd expects a certain naming convention to be used.

In order to use this facility you must create some auxiliary data files in the same directory as your main sequence file. I assume you have available the mksary suffix array package, though in principle others should do. ‘mksary’ can be found at http://sary.sourceforge.net/

Note that in the current implementations, the suffix clustering is not parallelised.

Once ‘mksary’ has been installed, do the following

   ./fasta2sary data.fasta -o data.fasta.nlc
   mksary data.fasta.nlc

This will leave three files data.fasta, data.fasta.nlc and data.fasta.nlc.ary. Again: wcd expects this naming convention to be met.

To cluster

wcd -F suffix -w 30 -c data.fasta

3.8 Clustering a range

Usage: wcd [--range, -R]  dataf i j

The range option allows clustering only a range or slice of the input data file in the following way.

If you think of the all the comparisons to be done as the upper half of a matrix, the range option restricts the comparisons to be done to a slice of this matrix. The purpose of the option is to allow a simplistic and crude parallelisation of work. We can run the multiple wcd processes with the same data but different slices. Typically we would also use the --dump option with this. See the section on parallelisation in the technical manual.

If the dump option is chosen, just before completion wcd creates a file with an -FIN suffix. For example if the dump file name is range10, then a file called range10-FIN is created. This enables monitoring programs which run asynchronously and cannot control the wcd process directly to use the file system to determine whether the job has completed.