Technical manual - wcd Manual 0.

Next: Testing, Previous: wcd inside stackPACK, Up: Top

5 Technical manual

Hopefully this will fill out a bit and become a little more coherent.

5.1 Program structure

Originally wcd was intended to be one small program and so would fit into one file. However, it grew and we needed to break up the program. The main files are

wcd.h: This contains the critical type definitions of the program.
wcd.c: This is the main program. It contains some utility and auxiliary code, processes all the command-line arguments and then calls the right routines. The key function in this code is the docluster code. This does the pair-wise comparison of the sequences and where the comparison falls below the thresh-hold merges the clusters, calling the appropriate find-union data structures.
ed.c: this contains the routiness for computing the edit distance.
d2.c: this contains the routines for computing the d2 score.
common.c: this contains code and declaration for manipulating words and sequences

5.2 Data structures

The key data structure used is a find-union data structure. This discussion assumes that you know how that works — look in a standard algorithms text like Cormen, Leieserson and Rivest if you don't.

All the sequences in one cluster will belong at the end to the same tree in a find-union forest. Each sequence will point to the root of the tree.

To represent sequences we use an array of following structs


typedef struct SeqStruct  {
  seqType seq;     // the sequence
  int    len;     // length
  string id;      // sequence id (from data file)
  int    cluster; // refers to index of root
  int    rank;    // for find-union 
  int    next;    // for find-union: next in linked list of cluster
  int    last;    // for find-union: last ditto
  int    match;   // index of another seq for which a match was made
  short  orient;  // pos or rc for that match
  int    flag;    // flags for fix, reset etc
} SeqStruct;


typedef SeqStruct * SeqPtr;

SeqPtr seq;

The key variable is seq which is declared globally and statically. The main function opens up the data file(s) and computes the number of sequences in the file. Memory is then allocated for seq which is then treated as an array thereafter. The values are actually given to the individual array elements in the function read_sequences.

The fields of the struct are:

seq: the sequence itself: we store the sequences compactly – see below.
len: the length of the sequence
id: the sequence ID — this comes from the data file (the string to the right of the '>'.
cluster: This is the index of the representative sequence for the cluster to which a sequence belongs. This is initialised so that each sequence is the root of its own cluster.
rank: this is the rank used in the rank-heuristic of the find-union algorithm. Roughly it describes how many sequences are in the tree rooted at this sequence. This is only valid for sequences that are the root of the cluster.
next, last. These are not standard f-u fields. However, we want to not only know which sequence belongs to which cluster but to be able to print out the clusters efficiently. Thus, besides keeping the fields needed for the f-u structure, we also keep a linked list of all sequences in the cluster. next points to the next sequence in the list. The root will always be the first element in the linked list and for the root, the last field will point to the last sequence in the list. The make_union function updates these values.
match is an index of an arbitrary other sequence in the cluster for which there is an overlap.
orient: orientation of the match (-1 or 1). Initially this is set to 1, since each sequence starts being the root of its own cluster. The following algorithm is used to update it. Suppose we find a match between a sequences i and j. We consider the orientations of i and j with respect to their current roots:
- If the match was positive, then if the orientation of i and j to their respective roots is the same, then we know that if we take the union of the two clusters then the orientation of the elements of the two clusters are consistent and no changes need to be made.
  However, if the orientations of i and j to their respective roots are different, then we know that the orientation of the clusters is not consistent. Therefore, we must invert the orientations all the sequences in one of the two clusters before merging. The choice of cluster is arbitrary but we choose the one that will become the `child' cluster.
- If the match was a reverse-complement match, the situation is the reverse of the above case. If the orientations of i and j are the same, then we know the orientations of the two clusters is inconsistent, and so change the orientations of the sequences in one of the two clusters.

Storage of sequences.

Different versions of this program have taken different approaches to the storage of sequences. Initially, I went for one base per byte. Then, I went for four bases per byte. The reason for this (explained below is to save memory). However, this does make the code more complicated and cause recomputation as we need to extract out information. Then I went to a redundant representation of the data – each word being stored in one machine word. However, this was too memory intensive for large data sets so I went back to the compact representation

The data is stored in compressed form. The sequences are read in in ASCII format, one base per byte. However, we only need two bits to represent the four bases, and so we compress the data storing four bases per byte. There is a little complexity in this — the macros GETWORD and so on look like witchcraft and some performance penalty in time, but if we have 1M ESTs then it may make a few 100M of memory difference which may be useful.

A down side of the compressed format is that we can only treat Ns that are in the data as some arbitrary symbol. It would be better to ignore all words that contain N, but there is no way of knowing this in the compressed format.

The key data structures are shown below. Each sequence is an array of integers: the definition of seqElt is key. Have a look at the wcd code to see the current version (this may change). In 0.1.8 at least we used int32_t. If you want to change this to some other type, change the definition of seqElt and change the definition of SEQELTWIDTH to reflect this change (i.e. the bit width).

typedef int32_t   seqElt;
typedef seqElt  * seqType;
#define  SEQELTWIDTH  32
#define  BASESPERELT   (SEQELTWIDTH/2)

In read_sequences each sequence is read into memory in turn. Once we know the length of the sequence, we divide through by BASESPERELT to compute the length of the needed array and allocate memory accordingly.

The maximum EST length is set by a macro constant definition MAX_SEQUENCE_LENGTH. It's currently 64K, but can be changed to any reasonable value. The only real limitation is that the next_seq variable in read_sequences is declared in the function and there might be a limitation on how the memory allocated on the stack can be on some machines. In version 0.1.8 for example this would mean that the the actual limit on sequence length is 64K*8 which is probably sufficient. If you run into this problem you may consider moving the declaration globally.

5.3 The algorithms

5.3.1 Heuristic

Since d2 is an expensive test, a heuristic test is done before d2 is called. This is based upon the idea that for a d2 score to be under the threshold, the two sequences must share a number of non-overlapping common words.

I have not had time to work on this analytically though I think it can be shown. However, I have done some empirical work which showed that if you graph d2 score (y axis) against common word score (x axis) that all the data points lie above the line d2 = -12.5cw + 150. This means that at a d2 threshold of 40, all matches have over 8 common words. Thus the default set of 5 is very conservative.

The heuristic is a little more complicated than this. If we are comparing sequences u and v, then

First we build a table of all words in u.
(Note that this is not particularly expensive to do. When we do the clustering we have a double loop
```
     
               for(i=0; i<n; i++) {
                  for(j=i+1; j<n; j++) {
                     compare i and j
```
the building of the table is done in the outer loop, outside of the inner loop and so the cost is negligible when amortised over all values.)
Then if the threshold for the number of commone words is at least 5, we sample the words in v ; If there are fewer than 2 matches, we report failure.
If there are more than 2 matches, we then see how many non-overlapping words in v appear in u. If they are above the threshold we report success

5.3.2 The parameters of do_cluster, and implementing merging and add

Normally, to do a clustering we need to compare each sequence to each other. The obvious way of doing this is by using a double loop:

   for(i=0; i<num_seqs; i++) {
      for(j=i+1; j<num_seqs; j++) {
         blah, blah
      }
   }

When we do a merge, we have two separate clusterings and there is no need to compare the sequences in the first half with each other, and the sequences in the second half with each other but only those sequences in the first half with those in the second. Assuming there are n1 sequences in the first part, the code becomes

   for(i=0; i<n1; i++) {
      for(j=n1; j<num_seqs; j++) {
         blah, blah
      }
   }

When we do an add, we have a clustering that we add sequences to. The clustered data is in the first part and the new data is in the second. There is no need to compare the sequences in the first half with each other. The code becomes

   for(i=0; i<num_seqs; i++) {
      for(j=max(n1,i+1); j<num_seqs; j++) {
         blah, blah
      }
   }

Each of these three cases can be dealt with by passing as parameters of do_cluster the bounds of these two loops. The indices are set in the main function based on the program's arguments.

In addition, sometimes we only want to cluster some of the sequences. There are two ways in which this can be done. One is to just set the ignore flag for the sequences that should be clustered to an appropriate value. The other is to pass the do_cluster function an array which contains the indices of those sequences to be clustered. This index array can then be used to extract out the indices of the sequences.

Thus there are four parameters for the do_cluster function:

i_end: where the i-loop should finish looping.
j_beg: where the j-loop should start looping
j_end: where the j-loop should end looping
index: an array of the indices of sequences to be clustered.
If this parameter has a NULL value, then all the sequences from 0 to i_end will be compared to all the sequences from j_beg to j_end.
If this parameter has a non-NULL value, then all the sequences from index[0] to index[i_end] will be compared to all the sequences from index[j_beg] to index[j_end].

5.4 How to add distance functions

If you want to add a new distance function, say farfun, you need to implement three functions

farfun: this should actually implement the distance function. It need not return the correct result: however if the distance between the two sequences is less than the threshold then the function should return a number less than the threshold (i.e. once you detect that the value is less than the threshold you can just return an estimate, you don't have to find the actual value)
farfunpair: this should implement the distance function but should return the correct answer.
farfuninit: this is initialisation code that will be called when the program runs.

The code should be placed in a file called farfun.c and the prototype should be put in farfun.h.

Then in the wcd.c add to chooseFun.

In common.c, call the initialise code in initialise.

5.5 Parallelisation

This section first discusses work allocation in MPI and Pthreads and the discusses some more primitive support that wcd has for parallelisation.

Different work allocation methods are provided for MPI and Pthreads. This is to allow for experimentation and future releases of wcd are likely to change the way in which they do things.

5.6 Work allocation in MPI

In MPI, work allocation is done staticallly. The input file is read in and estimates are made of the clustering costs based upon sequence length and the number of sequences. The workload is then allocated to the slaves and the slaves are launched. This may well change in future so if you write a script that uses this option, please check in future releses.

5.7 Work allocation using Pthreads

In Pthreads, the work allocation is done dynamically. Conceptually the work is broken up into a d x d grid of work and the work is then allocated to slaves. As there are many more blocks than slaves, each slave will have to ask for work many times.

5.7.1 Distributed Parallelisation

wcd has a number of options to support a simple parallelisation on a local network. The approach is very simplistic: we had a very large real job (110k mRNAs, 250M of data) that needed to be clustered in a hurry and urgently, so we hacked an inelegant solution together. Now that MPI support is available, these options will become less important and may be removed in future versions of wcd. (Let me know if you rely on them.)

In our labs we have about 150 machines on one NFS server. We wrote a Perl script that lauched wcd on each of the machines, using the --range and --dump options to specify the workload that machine had to take on and to say where the results should go.

NFS is used for communication. Each process reads in the entire sequence file – obviously hideously expensive. To make what we did socially acceptable and to prevent our NFS server from falling over, the Perl script that lauches the wcd processes delays for about 60s between launches which means that it takes over 2 hours to start all the wcd processes.

The auxilary program, combine can be used to read in all final dump files to produce the overall clustering.