Next: Testing, Previous: wcd inside stackPACK, Up: Top
Hopefully this will fill out a bit and become a little more coherent.
Originally wcd was intended to be one small program and so
would fit into one file. However, it grew and we needed to break
up the program. The main files are
wcd.h: This contains the critical type definitions of the
program.
wcd.c: This is the main program. It contains some
utility and auxiliary code, processes all the command-line
arguments and then calls the right routines. The key function
in this code is the docluster code. This does the
pair-wise comparison of the sequences and where the comparison
falls below the thresh-hold merges the clusters, calling the
appropriate find-union data structures.
ed.c: this contains the routiness for computing the edit
distance.
d2.c: this contains the routines for computing the
d2 score.
common.c: this contains code and declaration for
manipulating words and sequences
The key data structure used is a find-union data structure. This discussion assumes that you know how that works — look in a standard algorithms text like Cormen, Leieserson and Rivest if you don't.
All the sequences in one cluster will belong at the end to the same tree in a find-union forest. Each sequence will point to the root of the tree.
To represent sequences we use an array of following structs
typedef struct SeqStruct {
seqType seq; // the sequence
int len; // length
string id; // sequence id (from data file)
int cluster; // refers to index of root
int rank; // for find-union
int next; // for find-union: next in linked list of cluster
int last; // for find-union: last ditto
int match; // index of another seq for which a match was made
short orient; // pos or rc for that match
int flag; // flags for fix, reset etc
} SeqStruct;
typedef SeqStruct * SeqPtr;
SeqPtr seq;
The key variable is seq which is declared globally and
statically. The main function opens up the data file(s) and computes the
number of sequences in the file. Memory is then allocated for seq
which is then treated as an array thereafter. The values are actually
given to the individual array elements in the function read_sequences.
The fields of the struct are:
However, if the orientations of i and j to their respective roots are different, then we know that the orientation of the clusters is not consistent. Therefore, we must invert the orientations all the sequences in one of the two clusters before merging. The choice of cluster is arbitrary but we choose the one that will become the `child' cluster.
Different versions of this program have taken different approaches to the storage of sequences. Initially, I went for one base per byte. Then, I went for four bases per byte. The reason for this (explained below is to save memory). However, this does make the code more complicated and cause recomputation as we need to extract out information. Then I went to a redundant representation of the data – each word being stored in one machine word. However, this was too memory intensive for large data sets so I went back to the compact representation
The data is stored in compressed form. The sequences are read in in ASCII format, one base per byte. However, we only need two bits to represent the four bases, and so we compress the data storing four bases per byte. There is a little complexity in this — the macros GETWORD and so on look like witchcraft and some performance penalty in time, but if we have 1M ESTs then it may make a few 100M of memory difference which may be useful.
A down side of the compressed format is that we can only treat Ns that are in the data as some arbitrary symbol. It would be better to ignore all words that contain N, but there is no way of knowing this in the compressed format.
The key data structures are shown below. Each sequence is an array of
integers: the definition of seqElt is key. Have a look at the
wcd code to see the current version (this may change). In 0.1.8
at least we used int32_t. If you want to change this to some
other type, change the definition of seqElt and change the
definition of SEQELTWIDTH to reflect this change (i.e. the bit
width).
typedef int32_t seqElt; typedef seqElt * seqType; #define SEQELTWIDTH 32 #define BASESPERELT (SEQELTWIDTH/2)In
read_sequences each sequence is read into memory in turn. Once
we know the length of the sequence, we divide through by BASESPERELT to
compute the length of the needed array and allocate memory accordingly.
The maximum EST length is set by a macro constant definition
MAX_SEQUENCE_LENGTH. It's currently 64K, but can be changed to any
reasonable value. The only real limitation is that the next_seq
variable in read_sequences is declared in the function and there
might be a limitation on how the memory allocated on the stack can be on
some machines. In version 0.1.8 for example this would mean that the the
actual limit on sequence length is 64K*8 which is probably sufficient.
If you run into this problem you may consider moving the declaration
globally.
Since d2 is an expensive test, a heuristic test is done before d2 is called. This is based upon the idea that for a d2 score to be under the threshold, the two sequences must share a number of non-overlapping common words.
I have not had time to work on this analytically though I think it can be shown. However, I have done some empirical work which showed that if you graph d2 score (y axis) against common word score (x axis) that all the data points lie above the line d2 = -12.5cw + 150. This means that at a d2 threshold of 40, all matches have over 8 common words. Thus the default set of 5 is very conservative.
The heuristic is a little more complicated than this. If we are comparing sequences u and v, then
(Note that this is not particularly expensive to do. When we do the clustering we have a double loop
for(i=0; i<n; i++) {
for(j=i+1; j<n; j++) {
compare i and j
the building of the table is done in the outer loop, outside of the inner loop and so the cost is negligible when amortised over all values.)
Normally, to do a clustering we need to compare each sequence to each other. The obvious way of doing this is by using a double loop:
for(i=0; i<num_seqs; i++) {
for(j=i+1; j<num_seqs; j++) {
blah, blah
}
}
When we do a merge, we have two separate clusterings and there is no need to compare the sequences in the first half with each other, and the sequences in the second half with each other but only those sequences in the first half with those in the second. Assuming there are n1 sequences in the first part, the code becomes
for(i=0; i<n1; i++) {
for(j=n1; j<num_seqs; j++) {
blah, blah
}
}
When we do an add, we have a clustering that we add sequences to. The clustered data is in the first part and the new data is in the second. There is no need to compare the sequences in the first half with each other. The code becomes
for(i=0; i<num_seqs; i++) {
for(j=max(n1,i+1); j<num_seqs; j++) {
blah, blah
}
}
Each of these three cases can be dealt with by passing as parameters of do_cluster the bounds of these two loops. The indices are set in the main function based on the program's arguments.
In addition, sometimes we only want to cluster some of the
sequences. There are two ways in which this can be done. One is to just
set the ignore flag for the sequences that should be clustered to
an appropriate value. The other is to pass the do_cluster
function an array which contains the indices of those sequences to be
clustered. This index array can then be used to extract out the
indices of the sequences.
Thus there are four parameters for the do_cluster function:
i_end: where the i-loop should finish looping.
j_beg: where the j-loop should start looping
j_end: where the j-loop should end looping
index: an array of the indices of sequences to
be clustered.
If this parameter has a NULL value, then all the
sequences from 0 to i_end will be compared to all
the sequences from j_beg to j_end.
If this parameter has a non-NULL value, then all the
sequences from index[0] to index[i_end] will be
compared to all the sequences from index[j_beg] to
index[j_end].
If you want to add a new distance function, say farfun, you need to implement three functions
The code should be placed in a file called farfun.c and the
prototype should be put in farfun.h.
Then in the wcd.c add to chooseFun.
In common.c, call the initialise code in initialise.
This section first discusses work allocation in MPI and Pthreads and the
discusses some more primitive support that wcd has for
parallelisation.
Different work allocation methods are provided for MPI and
Pthreads. This is to allow for experimentation and future releases of
wcd are likely to change the way in which they do things.
In MPI, work allocation is done staticallly. The input file is read in and estimates are made of the clustering costs based upon sequence length and the number of sequences. The workload is then allocated to the slaves and the slaves are launched. This may well change in future so if you write a script that uses this option, please check in future releses.
In Pthreads, the work allocation is done dynamically. Conceptually the work is broken up into a d x d grid of work and the work is then allocated to slaves. As there are many more blocks than slaves, each slave will have to ask for work many times.
wcd has a number of options to support a simple parallelisation
on a local network. The approach is very simplistic: we had a very large
real job (110k mRNAs, 250M of data) that needed to be clustered in a
hurry and urgently, so we hacked an inelegant solution together. Now
that MPI support is available, these options will become less important
and may be removed in future versions of wcd. (Let me know if you
rely on them.)
In our labs we have about 150 machines on one NFS server. We wrote a
Perl script that lauched wcd on each of the machines, using the
--range and --dump options to specify the workload that
machine had to take on and to say where the results should go.
NFS is used for communication. Each process reads in the entire sequence
file – obviously hideously expensive. To make what we did socially
acceptable and to prevent our NFS server from falling over, the Perl
script that lauches the wcd processes delays for about 60s
between launches which means that it takes over 2 hours to start all the
wcd processes.
The auxilary program,
combine can be used to read in all final dump files to produce
the overall clustering.