Previous: Format of input files, Up: Running wcd
All output goes to standard output. Look at the arguments section to decide the format of the output. Note if you don't have any format arguments, nothing will get printed, which will be a waste.
A convenient way (for humans and probably for many programs) to show the
output of the clustering process is to use the --show_clusters
option. The format of the compressed cluster table is very simple. Each
cluster appears on a line by itself. The cluster is given by listing the
indices of the sequences that make up the cluster. The indices are
separated by a space, and the last sequence in the cluster is followed
by a full stop `.'.
0 1 3.
2.
4.
5 7.
6 8 9.
10.
When the code --ext_show option is chosen, the clustering is
given in table format. The columns of the table are as follows:
The orient field requires a little more explanation. It gives the
orientation of the sequence with respect to the root of its
cluster. Formally, this is described as follows. Let x and
y be two sequences in a cluster. While they may not overlap, we
know that there is an ordered list or path of sequences x=x_0,
x_1, \ldots, x_n-1, x_n=y such that for each i either
d^2(x_i,x_i+1)\leq \theta (positive match) or
d^2(x,rc(x_i+1))\leq \theta (reverse-complement match), where
\theta is the threshold. In particular for every sequence there
is such a path from the root of the cluster to that sequence. In
a path from the root to a sequence x, we compute the number of
times the match is a positive one, and the number of times it is a
reverse-complement one. The orient field is 1 if the number of
reverse-complement matches is even, and -1 if the number is odd.
In principle it is possible for there to be two paths from the root to a sequence which would yield different orient values. First, this is unlikely to happen. Second, all the orient field is saying is that such an orientation of the sequence is legitimate. The fact that other orientation is also legitimate does not affect the correctness of the result.
Usage: wcd -d dump_file seq_file
When used with this option, wcd will open the given dump file for
writing and then perform clustering. Whenever it finds two sequences
that should be clustered it writes the match to the dump file: the
output are the indices of the two sequences, and a 1 (if the there is a
positive match) or -1 (if there is an RC match).
This was introduced into wcd to support our simplistic
parallelisation (see the parallelisation section in the technical
manual).
A number of auxiliary programs come with the wcd distribution.
rindex.py
This Python program takes two arguments, the names of two files each containing (compact) clusterings. It computes the sensitivity, specificity, Jaccard index, Rand index and correlation coefficient between the two clusterings.
If you use the –index rand option only the Rand index is shown. If you use the –index jaccard optio, only the Jaccard index is shown.
If you use the –diff n option, the indices above are not printed but the mismatches between the two clusterings are shown. First the pairs that are clustered by the first cluster but not the second are shown, and then the ones clustered by the second but not the first. If n==1, then all such pairs are shown; otherwise only the pairs that belong to clusters with n or fewer sequences are shown. This is helpful to explore differences in clusterings.
ext2comp.pl
This Perl program converts the extended cluster table format to the compressed table format.
comp2ext.pl
This Perl program converts the compressed table format to the extended table format. Since all the information of the extended table is not in the compressed format, you will find 0s in the orient column and -1 in the witness field.
For both programs, input and output are standard input and output. So you would probably run the programs thus
./ext2comp.pl < cluster.ext > cluster.com
./comp2ext.pl < cluster.com > cluster.ext
fasta2sary.py
This takes as input a FASTA file and produces the file in a format suitable to produce a suffix array. It can do simple clean up as well.
python fasta2sary.py -x -d 11 myfile.fasta -o myfile.fasta.nlc
Note the convention that should be used. The output file must be the same as the input file with .nlc appended.
analysecluster.py
Takes as input a clt file and produces a histogram of the cluster
tables. If you use the -t N option, instead of the histograms any
clusters with more than N sequences are output.
combine.c
This takes as input a list of names of dump files, reads in each dump
file in turn, and constructs the clustering from that. To make the
executable, say make combine.
wcd_wrapper.sh that allows wcd to be used as a replacement
for d2_cluster in the stackPACK analysis pipeline.
wcd has support for both shared and distributed memory
parallelisation. The parallel version supports straightforward
clustering only.
There are, however, major restrictions should you use these options.
In version 0.4, the wcd options for suffix clustering, merging,
reclustering, dealing with constraints etc, are NOT supported when you
use the parallel options. It is my intention that future versions will
fix these problems. The following options are not supported if you use
the parallel options.
If you are running wcd on a shared memory processor with multiple
threads, the --num_threads or -N option can be used to
specify how many threads should be used. If there's a close match to the
number of CPUs that are available and unloaded, you should see a
performance improvement though the curren version is not very scalable.
By enabling MPI support when installing, wcd can be used in a
cluster of workstations. A description of MPI is beyond the scope of
this document. Use mpirun to run wcd (which takes the
normal parameters). This code has been tested using LAMMPI
(RedHat, Suse, MacOS X), MPICH (Ubuntu) and MVAPICH (Suse).
For example, using LAMMPI the lamboot command specifies what
processors are availabe (the list is given in the hosts file – in its
simplest form a list of the machines or their IP addressed). The
mpirun command is then used to run wcd. A simple example
follows.
lamboot hosts mpirun -np 4 wcd -c sample.fas
This will run wcd on 4 different processors (these procesors may
be real or virtual, depending on what's available on the machines
specified by the hosts file). When wcd runs like this with
mutiplie procesors available, one version of wcd runs as the
master, and the rest as slaves. The sequence input file must be
available on the master node, but need not be on the others.
The master process does not do any clustering itself, but merely coordinates the clustering process. In the above example, this means you would be running a master and three slaves and so could expect a 3-fold improvement in performance at best. The computational load on the master is fairly small and so it is safe (memory being available) to schedule both a master and a slave on same processor.
In future versions of wcd, the behaviour is likely to change so
that the master does do clustering (to make it more memory effective).
NOTE: When you install wcd you can enable both Pthreads
and MPI so that the exectable can do both. BUT: Do NOT try to use the
Pthreads and MPI at the same time (this will be something that goes into
a later version of wcd).