Previous: Format of input files, Up: Running wcd


3.11 Output

All output goes to standard output. Look at the arguments section to decide the format of the output. Note if you don't have any format arguments, nothing will get printed, which will be a waste.

Format of the Compressed Cluster Table

A convenient way (for humans and probably for many programs) to show the output of the clustering process is to use the --show_clusters option. The format of the compressed cluster table is very simple. Each cluster appears on a line by itself. The cluster is given by listing the indices of the sequences that make up the cluster. The indices are separated by a space, and the last sequence in the cluster is followed by a full stop `.'.

     0 1 3.
     2.
     4.
     5 7.
     6 8 9.
     10.

Format of Extended Cluster Table

When the code --ext_show option is chosen, the clustering is given in table format. The columns of the table are as follows:

The orient field requires a little more explanation. It gives the orientation of the sequence with respect to the root of its cluster. Formally, this is described as follows. Let x and y be two sequences in a cluster. While they may not overlap, we know that there is an ordered list or path of sequences x=x_0, x_1, \ldots, x_n-1, x_n=y such that for each i either d^2(x_i,x_i+1)\leq \theta (positive match) or d^2(x,rc(x_i+1))\leq \theta (reverse-complement match), where \theta is the threshold. In particular for every sequence there is such a path from the root of the cluster to that sequence. In a path from the root to a sequence x, we compute the number of times the match is a positive one, and the number of times it is a reverse-complement one. The orient field is 1 if the number of reverse-complement matches is even, and -1 if the number is odd.

In principle it is possible for there to be two paths from the root to a sequence which would yield different orient values. First, this is unlikely to happen. Second, all the orient field is saying is that such an orientation of the sequence is legitimate. The fact that other orientation is also legitimate does not affect the correctness of the result.

The Dump option

Usage: wcd -d dump_file  seq_file

When used with this option, wcd will open the given dump file for writing and then perform clustering. Whenever it finds two sequences that should be clustered it writes the match to the dump file: the output are the indices of the two sequences, and a 1 (if the there is a positive match) or -1 (if there is an RC match).

This was introduced into wcd to support our simplistic parallelisation (see the parallelisation section in the technical manual).

3.12 Auxiliary programs

A number of auxiliary programs come with the wcd distribution.

For both programs, input and output are standard input and output. So you would probably run the programs thus

     ./ext2comp.pl < cluster.ext > cluster.com
     ./comp2ext.pl < cluster.com > cluster.ext

3.13 Running wcd in parallel

wcd has support for both shared and distributed memory parallelisation. The parallel version supports straightforward clustering only.

There are, however, major restrictions should you use these options.

In version 0.4, the wcd options for suffix clustering, merging, reclustering, dealing with constraints etc, are NOT supported when you use the parallel options. It is my intention that future versions will fix these problems. The following options are not supported if you use the parallel options.

3.13.1 Shared Memory Parallelisation

If you are running wcd on a shared memory processor with multiple threads, the --num_threads or -N option can be used to specify how many threads should be used. If there's a close match to the number of CPUs that are available and unloaded, you should see a performance improvement though the curren version is not very scalable.

3.13.2 MPI Parallelisation

By enabling MPI support when installing, wcd can be used in a cluster of workstations. A description of MPI is beyond the scope of this document. Use mpirun to run wcd (which takes the normal parameters). This code has been tested using LAMMPI (RedHat, Suse, MacOS X), MPICH (Ubuntu) and MVAPICH (Suse).

For example, using LAMMPI the lamboot command specifies what processors are availabe (the list is given in the hosts file – in its simplest form a list of the machines or their IP addressed). The mpirun command is then used to run wcd. A simple example follows.

lamboot hosts
mpirun -np 4 wcd -c sample.fas

This will run wcd on 4 different processors (these procesors may be real or virtual, depending on what's available on the machines specified by the hosts file). When wcd runs like this with mutiplie procesors available, one version of wcd runs as the master, and the rest as slaves. The sequence input file must be available on the master node, but need not be on the others.

The master process does not do any clustering itself, but merely coordinates the clustering process. In the above example, this means you would be running a master and three slaves and so could expect a 3-fold improvement in performance at best. The computational load on the master is fairly small and so it is safe (memory being available) to schedule both a master and a slave on same processor.

In future versions of wcd, the behaviour is likely to change so that the master does do clustering (to make it more memory effective).

NOTE: When you install wcd you can enable both Pthreads and MPI so that the exectable can do both. BUT: Do NOT try to use the Pthreads and MPI at the same time (this will be something that goes into a later version of wcd).