Next: Help, Up: Running wcd
wcdThis section shows common ways in which wcd is likely to be invoked.
The following examples show straightforward clustering examples.
wcd --show_clusters data/5000.seq
Cluster the sequences found in the file data/5000.seq. Print
the clusters on standard output in compact form. Use the d2
function to determine cluster membership.
wcd --histogram --show_clusters data/5000.seq
As above, but also print a histogram that shows the size of the clusters found.
wcd --histogram --function ed data/5000.seq
Cluster as above, but use edit distance as distance function.
wcd --output ans/5000.ans --histogram --show_ext data/5000.seq
wcd -o ans/5000.ans -g -t data/5000.seq
As above, but print the clusters in extended, table format. Also save the output in a file.
wcd -c -N 5 data/5000.seq
If the wcd has been installed with the PTHREADS
option. Run wcd on 5 processors at the same time.
mpirun -np 16 wcd -c data/5000.seq
If the wcd has been installed with the MPI option. Run
wcd on 16 processors using the MPI libraries.
wcd -X -c data/5000.seq
Cluster, but also use clone information. If two ESTs come from the same clone, they'll also be put together. The clone information comes from the FASTA file directly – it's the symbol that follows the word “clone” in the header. This is a convenient option, but for larger files it would be better to put this information in a constraints file.
This can be used to seed the clustering to start with. Instead of
starting each sequence in its own cluster, we do some preallocation of
clusters. wcd will then continue clustering using this a start:
no clusters will be broken up, but some of the clusters will be
merged. This might be useful when you know for some reason that some
sequences should be clustered together regardless of d2-score (e.g. from
an annotation or from biological knowledge).
In this case, create a clustering file. Each cluster should be on a line by itself, terminated by a full stop `.'. This line can be as long as you like, but don't break it. For example, suppose we know that sequences 0, 2, 10, and 11 should be clustered together; and so should 6, 17, 107, and 120; and so should 151 and 152. Your cluster file (let's assume it's called init.cl would look like this:
0 2 10 11.
6 17 107 120.
151 152.
Those sequences that are not mentioned will be put in their own clustering. Clustering in then done by saying:
wcd --show_clusters -f init.cl data/5000.seq
A constraint file enables you to specify additional knowledge about the data and so help wcd do clustering more efficiently and more correctly. Each line in the cluster file gives a directive. There are three directive.
fix. Suppose you know that two sequences definitely should
not be clustered together (you might know this from previous experiments).
You can then tell wcd never to merge two clusters containing
these sequences. For example, suppose sequences 1, 17 and 325
definitely do not belong in the same cluster. Then you would have as a line in
the constraint file:
fix 1 17 325.
Note that fixedness is either all or nothing. There is no way in the current version of saying don't cluster 1 with 325, and don't cluster 45 with 360, but it's OK to cluster 1 with 45 or 360. If there turns out to be a need for it, it might be included in a later version of the program.
cluster-only
The clustering table allows you to provide an initial clustering,
which wcd can then refine. Sometimes you may only want to
refine the clustering of some of the sequences. In which case you can
make major performance savings by using the cluster-only
directive to tell wcd to only refine the clustering of some of
the sequences and to leave the clustering of all the others as given
initially by the clustering table. For example, if we had the
following clusters: [0, 1, 4] [2,3,7] [5] [6,8] [9]; and we were
generally happy with the clustering but wanted to see whether [2,3,
7] should be merged with [6,8], the following directive could be used
cluster-only 2 3 6 7 8.
wcd would then check to whether those clusters would be merged.
NB: It only makes sense to have one cluster-only
directive. It can be as long as you like.
reset This is similar to cluster-only except you
want to say that you are happy with the clustering of the other
sequences but not happy with the clustering of the specified
sequences. Typically, you would be concerned that the clustering of
the specified sequences was too lenient (i.e. that some sequences had
wrongly been put together). So taking the above example, if you said
reset 2 3 6 7 8.
You would be saying that you wanted to leave the clustering of 0, 1, 4, 5, and 9, but you wanted to cluster the other sequences de novo, completely ignoring the initial clustering.
The major difference between cluster-only and reset is
that with reset you are saying that you want to recluster the
specified sequences de novo: you think that some of the sequences
specified that have been clustered already should not be and you want
to check again (probably using other parameters). With
cluster-only you are happy with what clustering has been done,
but you want to check whether there should be even more clustering.
cluster-others, reset-others. This has the same
semantics as the previous two except the specified sequences should be
left as is and not processed, and the sequences not specfied should be
clustered again.
We had a very large data set to cluster with heterogeneous data. Some of the long sequences had very large overlaps. We did the following.
fasta2sary -x -d 10 bp.fasta -o bp.fasta.nlc
mksary bp.fasta.nlc
wcd -c -F suffix -w 120 bp.fasta > bp120.clt
This creates a cluster table which contains one very large super cluster (of about 43k sequences) and lots of other very small clusters. We copy the line that contains the supercluster into a file bp120.con and put “reset-others” in front of it
reset-others 6355 29988 2 9282 71821 4 .....
We cluster again, less stringently. This time we initialise the clustering with the clustering given by bp120.clt subject to the constraint file. This says: leave the super-cluster as-is. Recluster all the other sequences from scratch.
wcd --init-cluster bp120.clt --constraint1 bp120.con -c -F suffix -w 90 bp.fasta > bp90.clt
This shows no new super clusters so we throw away bp90.clt and continue with bp120.clt. (But if it did, you would use that information)
If your data set size is reasonable or you have lots of CPUs, you can just cluster as normal
wcd --init-cluster bp120.clt --constraint1 bp120.con -c bp.fasta > bp.clt
This phase is done if you are concerned that the final phase will be too computationally expensive. Here, the suffix array algorithm is used to speed-up the process, to produce a refine We recluster, leniently, (25-30 is probably a good choce of word size) and create a new cluster table bp30.clt. To recap what this does: leave the supercluster as is, and recluster everything else from scratch on the basis that two sequences should be put in the same cluster if they share a common word of length 30.
wcd --init-cluster bp120.clt --constraint1 bp120.con -c -F suffix -w 30 bp.fasta > bp30.clt
This clustering is probably too lenient and so the clusters, except for the super-cluster, are probably too big.
Now we want to cluster again more strictly using the bp30.clt as the starting point. Within each cluster of bp30.clt we recluster afresh using the standard d2-clustering algorithm. We do not compare sequences from different clusters of bp30.clt to see whether they should be put together, but only compare sequences within the clusters of bp30.clt If we didn't have to worry about the supercluster we would just say:
wcd --recluster bp30.clt -c bp.fasta
But we do have to worry about the supercluster. So there are two things we need to do: first, tell wcd not to look at elements of the supercluster; second, tell wcd to put the all the elements of the super-cluster together. To do the first we use the constraint file. To do the second we extract out of bp120.clt the line with the supercluster and save it in a file, say ‘super.clt’. This is exactly the same as ‘bp120.con’ without the ‘cluster-ony’ directive and initialise the clustering with it. So all in all we say
wcd --recluster bp30.clt --init_cluster super.clt --constraint1 bp120.con -c bp.fasta
These features of wcd enable you to combine two clusterings. You
could do this de novo, but there are performance benefits of using these
wcd features.
Suppose you have two files
wcd and saved in a file ‘data1.cl’)
0 1 2 12 13 14.
3.
4 5 6 8.
7 9 10 11.
wcd and saved in ‘data2.cl’) are:
0 2 4 10.
1 3 5.
6 7 8 9 11.
You now want to merge the two files. You are happy with the clustering of the two files with respect to themselves, but you now need to see whether the sequences in the one file are related to the sequences in the second file. You would do this by saying:
wcd --show_clusters --merge data1.seq data.cl data2.seq data2.cl
This merges the two clusterings. All the sequences in the first file will be compared to all the sequences in the second. The new clustering would be output. The sequences in the second file will be renumbered 15 to 26. For example a possible output might be:
0 1 2 12 13 14. 3 21 22 23 24 26. 4 5 6 8 16 18 20 7 9 10 11. 15 17 19 25.
This could happen if sequence 3 in file 1 is related to sequence 21 (6 in file 2); sequence 4 in file 1 related to sequence 16 (1 in file 2); and sequence 7 in file 1 to 18 (3 in file 2).
Adding allows you to add unclustered sequences into a cluster.
wcd --show_clusters --add data1.seq data1.cl data2.seq
would add the sequences in the file data2.seq to the ones in
data1.seq, clustering as appropriate.