Output - wcd Manual 0.

Previous: Format of input files, Up: Running wcd

3.11 Output

All output goes to standard output. Look at the arguments section to decide the format of the output. Note if you don't have any format arguments, nothing will get printed, which will be a waste.

Format of the Compressed Cluster Table

A convenient way (for humans and probably for many programs) to show the output of the clustering process is to use the --show_clusters option. The format of the compressed cluster table is very simple. Each cluster appears on a line by itself. The cluster is given by listing the indices of the sequences that make up the cluster. The indices are separated by a space, and the last sequence in the cluster is followed by a full stop `.'.

Format of Extended Cluster Table

When the code --ext_show option is chosen, the clustering is given in table format. The columns of the table are as follows:

the sequence identifier (note that the sequences are numbered from 0, in the order that they appear in the input file);
cluster number: in each cluster, one sequence is chosen as the representative of the cluster and its index is used for the cluster. You can identify the roots because their indices are the same as their cluster numbers.
link: the number of another sequence in the cluster. It is guaranteed that if you start at the root, or representative sequence in the cluster, you can traverse the entire cluster using the link field. It is not guaranteed that two adjacent nodes are within the d2 threshold.
orient: the orientation of that witness (positive or RC) with respect to the root or representative sequence in the cluster.
witness: the number of another sequence in the cluster which is within the d2 threshold of that sequence. This may or not be the same as the link field. Note that the value of the link field is an artifact, merely a convenient way in which can list all the sequences in one cluster, whereas the witness field tells us about two sequences that do overlap.

The orient field requires a little more explanation. It gives the orientation of the sequence with respect to the root of its cluster. Formally, this is described as follows. Let x and y be two sequences in a cluster. While they may not overlap, we know that there is an ordered list or path of sequences x=x_0, x_1, \ldots, x_n-1, x_n=y such that for each i either d^2(x_i,x_i+1)\leq \theta (positive match) or d^2(x,rc(x_i+1))\leq \theta (reverse-complement match), where \theta is the threshold. In particular for every sequence there is such a path from the root of the cluster to that sequence. In a path from the root to a sequence x, we compute the number of times the match is a positive one, and the number of times it is a reverse-complement one. The orient field is 1 if the number of reverse-complement matches is even, and -1 if the number is odd.

In principle it is possible for there to be two paths from the root to a sequence which would yield different orient values. First, this is unlikely to happen. Second, all the orient field is saying is that such an orientation of the sequence is legitimate. The fact that other orientation is also legitimate does not affect the correctness of the result.

The Dump option

Usage: wcd -d dump_file  seq_file

When used with this option, wcd will open the given dump file for writing and then perform clustering. Whenever it finds two sequences that should be clustered it writes the match to the dump file: the output are the indices of the two sequences, and a 1 (if the there is a positive match) or -1 (if there is an RC match).

This was introduced into wcd to support our simplistic parallelisation (see the parallelisation section in the technical manual).

3.12 Auxiliary programs

A number of auxiliary programs come with the wcd distribution.

rindex.py
This Python program takes two arguments, the names of two files each containing (compact) clusterings. It computes the sensitivity, specificity, Jaccard index, Rand index and correlation coefficient between the two clusterings.
If you use the –index rand option only the Rand index is shown. If you use the –index jaccard optio, only the Jaccard index is shown.
If you use the –diff n option, the indices above are not printed but the mismatches between the two clusterings are shown. First the pairs that are clustered by the first cluster but not the second are shown, and then the ones clustered by the second but not the first. If n==1, then all such pairs are shown; otherwise only the pairs that belong to clusters with n or fewer sequences are shown. This is helpful to explore differences in clusterings.
ext2comp.pl
This Perl program converts the extended cluster table format to the compressed table format.
comp2ext.pl
This Perl program converts the compressed table format to the extended table format. Since all the information of the extended table is not in the compressed format, you will find 0s in the orient column and -1 in the witness field.

For both programs, input and output are standard input and output. So you would probably run the programs thus

     ./ext2comp.pl < cluster.ext > cluster.com
     ./comp2ext.pl < cluster.com > cluster.ext

fasta2sary.py
This takes as input a FASTA file and produces the file in a format suitable to produce a suffix array. It can do simple clean up as well.
```
     
     python fasta2sary.py -x -d 11 myfile.fasta -o myfile.fasta.nlc
```
Note the convention that should be used. The output file must be the same as the input file with .nlc appended.
analysecluster.py
Takes as input a clt file and produces a histogram of the cluster tables. If you use the -t N option, instead of the histograms any clusters with more than N sequences are output.
combine.c
This takes as input a list of names of dump files, reads in each dump file in turn, and constructs the clustering from that. To make the executable, say make combine.
In addition to these programs, there is a shell script wcd_wrapper.sh that allows wcd to be used as a replacement for d2_cluster in the stackPACK analysis pipeline.

3.13 Running wcd in parallel

wcd has support for both shared and distributed memory parallelisation. The parallel version supports straightforward clustering only.

There are, however, major restrictions should you use these options.

In version 0.4, the wcd options for suffix clustering, merging, reclustering, dealing with constraints etc, are NOT supported when you use the parallel options. It is my intention that future versions will fix these problems. The following options are not supported if you use the parallel options.

suffix-last based clustering
–show_seq, –show_rc_seq
-E, –compare: show seqs, number common words, and d2scores
-e, –abbrev_compare: show min of d2scores (pos + rc)
-p, –pairwise: show pairwise d2 scores of all windows
–cluster_compare,-D] compare two clusters
–merge,-m: merge two clusterings
–add,-a
–recluster,-r

3.13.1 Shared Memory Parallelisation

If you are running wcd on a shared memory processor with multiple threads, the --num_threads or -N option can be used to specify how many threads should be used. If there's a close match to the number of CPUs that are available and unloaded, you should see a performance improvement though the curren version is not very scalable.

3.13.2 MPI Parallelisation

By enabling MPI support when installing, wcd can be used in a cluster of workstations. A description of MPI is beyond the scope of this document. Use mpirun to run wcd (which takes the normal parameters). This code has been tested using LAMMPI (RedHat, Suse, MacOS X), MPICH (Ubuntu) and MVAPICH (Suse).

For example, using LAMMPI the lamboot command specifies what processors are availabe (the list is given in the hosts file – in its simplest form a list of the machines or their IP addressed). The mpirun command is then used to run wcd. A simple example follows.

lamboot hosts
mpirun -np 4 wcd -c sample.fas

This will run wcd on 4 different processors (these procesors may be real or virtual, depending on what's available on the machines specified by the hosts file). When wcd runs like this with mutiplie procesors available, one version of wcd runs as the master, and the rest as slaves. The sequence input file must be available on the master node, but need not be on the others.

The master process does not do any clustering itself, but merely coordinates the clustering process. In the above example, this means you would be running a master and three slaves and so could expect a 3-fold improvement in performance at best. The computational load on the master is fairly small and so it is safe (memory being available) to schedule both a master and a slave on same processor.

In future versions of wcd, the behaviour is likely to change so that the master does do clustering (to make it more memory effective).

NOTE: When you install wcd you can enable both Pthreads and MPI so that the exectable can do both. BUT: Do NOT try to use the Pthreads and MPI at the same time (this will be something that goes into a later version of wcd).