Next: , Previous: wcd inside stackPACK, Up: Top


5 Technical manual

Hopefully this will fill out a bit and become a little more coherent.

5.1 Program structure

Originally wcd was intended to be one small program and so would fit into one file. However, it grew and we needed to break up the program. The main files are

5.2 Data structures

The key data structure used is a find-union data structure. This discussion assumes that you know how that works — look in a standard algorithms text like Cormen, Leieserson and Rivest if you don't.

All the sequences in one cluster will belong at the end to the same tree in a find-union forest. Each sequence will point to the root of the tree.

To represent sequences we use an array of following structs


typedef struct SeqStruct  {
  seqType seq;     // the sequence
  int    len;     // length
  string id;      // sequence id (from data file)
  int    cluster; // refers to index of root
  int    rank;    // for find-union 
  int    next;    // for find-union: next in linked list of cluster
  int    last;    // for find-union: last ditto
  int    match;   // index of another seq for which a match was made
  short  orient;  // pos or rc for that match
  int    flag;    // flags for fix, reset etc
} SeqStruct;


typedef SeqStruct * SeqPtr;

SeqPtr seq;

The key variable is seq which is declared globally and statically. The main function opens up the data file(s) and computes the number of sequences in the file. Memory is then allocated for seq which is then treated as an array thereafter. The values are actually given to the individual array elements in the function read_sequences.

The fields of the struct are:

Storage of sequences.

Different versions of this program have taken different approaches to the storage of sequences. Initially, I went for one base per byte. Then, I went for four bases per byte. The reason for this (explained below is to save memory). However, this does make the code more complicated and cause recomputation as we need to extract out information. Then I went to a redundant representation of the data – each word being stored in one machine word. However, this was too memory intensive for large data sets so I went back to the compact representation

The data is stored in compressed form. The sequences are read in in ASCII format, one base per byte. However, we only need two bits to represent the four bases, and so we compress the data storing four bases per byte. There is a little complexity in this — the macros GETWORD and so on look like witchcraft and some performance penalty in time, but if we have 1M ESTs then it may make a few 100M of memory difference which may be useful.

A down side of the compressed format is that we can only treat Ns that are in the data as some arbitrary symbol. It would be better to ignore all words that contain N, but there is no way of knowing this in the compressed format.

The key data structures are shown below. Each sequence is an array of integers: the definition of seqElt is key. Have a look at the wcd code to see the current version (this may change). In 0.1.8 at least we used int32_t. If you want to change this to some other type, change the definition of seqElt and change the definition of SEQELTWIDTH to reflect this change (i.e. the bit width).

typedef int32_t   seqElt;
typedef seqElt  * seqType;
#define  SEQELTWIDTH  32
#define  BASESPERELT   (SEQELTWIDTH/2)
In read_sequences each sequence is read into memory in turn. Once we know the length of the sequence, we divide through by BASESPERELT to compute the length of the needed array and allocate memory accordingly.

The maximum EST length is set by a macro constant definition MAX_SEQUENCE_LENGTH. It's currently 64K, but can be changed to any reasonable value. The only real limitation is that the next_seq variable in read_sequences is declared in the function and there might be a limitation on how the memory allocated on the stack can be on some machines. In version 0.1.8 for example this would mean that the the actual limit on sequence length is 64K*8 which is probably sufficient. If you run into this problem you may consider moving the declaration globally.

5.3 The algorithms

5.3.1 Heuristic

Since d2 is an expensive test, a heuristic test is done before d2 is called. This is based upon the idea that for a d2 score to be under the threshold, the two sequences must share a number of non-overlapping common words.

I have not had time to work on this analytically though I think it can be shown. However, I have done some empirical work which showed that if you graph d2 score (y axis) against common word score (x axis) that all the data points lie above the line d2 = -12.5cw + 150. This means that at a d2 threshold of 40, all matches have over 8 common words. Thus the default set of 5 is very conservative.

The heuristic is a little more complicated than this. If we are comparing sequences u and v, then

5.3.2 The parameters of do_cluster, and implementing merging and add

Normally, to do a clustering we need to compare each sequence to each other. The obvious way of doing this is by using a double loop:

   for(i=0; i<num_seqs; i++) {
      for(j=i+1; j<num_seqs; j++) {
         blah, blah
      }
   }

When we do a merge, we have two separate clusterings and there is no need to compare the sequences in the first half with each other, and the sequences in the second half with each other but only those sequences in the first half with those in the second. Assuming there are n1 sequences in the first part, the code becomes

   for(i=0; i<n1; i++) {
      for(j=n1; j<num_seqs; j++) {
         blah, blah
      }
   }

When we do an add, we have a clustering that we add sequences to. The clustered data is in the first part and the new data is in the second. There is no need to compare the sequences in the first half with each other. The code becomes

   for(i=0; i<num_seqs; i++) {
      for(j=max(n1,i+1); j<num_seqs; j++) {
         blah, blah
      }
   }

Each of these three cases can be dealt with by passing as parameters of do_cluster the bounds of these two loops. The indices are set in the main function based on the program's arguments.

In addition, sometimes we only want to cluster some of the sequences. There are two ways in which this can be done. One is to just set the ignore flag for the sequences that should be clustered to an appropriate value. The other is to pass the do_cluster function an array which contains the indices of those sequences to be clustered. This index array can then be used to extract out the indices of the sequences.

Thus there are four parameters for the do_cluster function:

5.4 How to add distance functions

If you want to add a new distance function, say farfun, you need to implement three functions

The code should be placed in a file called farfun.c and the prototype should be put in farfun.h.

Then in the wcd.c add to chooseFun.

In common.c, call the initialise code in initialise.

5.5 Parallelisation

This section first discusses work allocation in MPI and Pthreads and the discusses some more primitive support that wcd has for parallelisation.

Different work allocation methods are provided for MPI and Pthreads. This is to allow for experimentation and future releases of wcd are likely to change the way in which they do things.

5.6 Work allocation in MPI

In MPI, work allocation is done staticallly. The input file is read in and estimates are made of the clustering costs based upon sequence length and the number of sequences. The workload is then allocated to the slaves and the slaves are launched. This may well change in future so if you write a script that uses this option, please check in future releses.

5.7 Work allocation using Pthreads

In Pthreads, the work allocation is done dynamically. Conceptually the work is broken up into a d x d grid of work and the work is then allocated to slaves. As there are many more blocks than slaves, each slave will have to ask for work many times.

5.7.1 Distributed Parallelisation

wcd has a number of options to support a simple parallelisation on a local network. The approach is very simplistic: we had a very large real job (110k mRNAs, 250M of data) that needed to be clustered in a hurry and urgently, so we hacked an inelegant solution together. Now that MPI support is available, these options will become less important and may be removed in future versions of wcd. (Let me know if you rely on them.)

In our labs we have about 150 machines on one NFS server. We wrote a Perl script that lauched wcd on each of the machines, using the --range and --dump options to specify the workload that machine had to take on and to say where the results should go.

NFS is used for communication. Each process reads in the entire sequence file – obviously hideously expensive. To make what we did socially acceptable and to prevent our NFS server from falling over, the Perl script that lauches the wcd processes delays for about 60s between launches which means that it takes over 2 hours to start all the wcd processes.

The auxilary program, combine can be used to read in all final dump files to produce the overall clustering.