Examples - wcd Manual 0.

3.3 Examples — A Quick Introduction to `wcd`

This section shows common ways in which wcd is likely to be invoked.

3.3.1 Basic Clustering

The following examples show straightforward clustering examples.

wcd --show_clusters data/5000.seq
Cluster the sequences found in the file data/5000.seq. Print the clusters on standard output in compact form. Use the d2 function to determine cluster membership.
wcd --histogram --show_clusters data/5000.seq
As above, but also print a histogram that shows the size of the clusters found.
wcd --histogram --function ed data/5000.seq
Cluster as above, but use edit distance as distance function.
wcd --output ans/5000.ans --histogram --show_ext data/5000.seq
wcd -o ans/5000.ans -g -t data/5000.seq
As above, but print the clusters in extended, table format. Also save the output in a file.
wcd -c -N 5 data/5000.seq If the wcd has been installed with the PTHREADS option. Run wcd on 5 processors at the same time.
mpirun -np 16 wcd -c data/5000.seq If the wcd has been installed with the MPI option. Run wcd on 16 processors using the MPI libraries.
wcd -X -c data/5000.seq
Cluster, but also use clone information. If two ESTs come from the same clone, they'll also be put together. The clone information comes from the FASTA file directly – it's the symbol that follows the word “clone” in the header. This is a convenient option, but for larger files it would be better to put this information in a constraints file.

3.3.2 More advanced clustering

3.3.2.1 Using a clustering file

This can be used to seed the clustering to start with. Instead of starting each sequence in its own cluster, we do some preallocation of clusters. wcd will then continue clustering using this a start: no clusters will be broken up, but some of the clusters will be merged. This might be useful when you know for some reason that some sequences should be clustered together regardless of d2-score (e.g. from an annotation or from biological knowledge).

In this case, create a clustering file. Each cluster should be on a line by itself, terminated by a full stop `.'. This line can be as long as you like, but don't break it. For example, suppose we know that sequences 0, 2, 10, and 11 should be clustered together; and so should 6, 17, 107, and 120; and so should 151 and 152. Your cluster file (let's assume it's called init.cl would look like this:

     0 2 10 11.
     6 17 107 120.
     151 152.

Those sequences that are not mentioned will be put in their own clustering. Clustering in then done by saying:

     wcd --show_clusters -f init.cl data/5000.seq

3.3.2.2 Using a constraint file

A constraint file enables you to specify additional knowledge about the data and so help wcd do clustering more efficiently and more correctly. Each line in the cluster file gives a directive. There are three directive.

fix. Suppose you know that two sequences definitely should not be clustered together (you might know this from previous experiments). You can then tell wcd never to merge two clusters containing these sequences. For example, suppose sequences 1, 17 and 325 definitely do not belong in the same cluster. Then you would have as a line in the constraint file:
```
     
     fix 1 17 325.
```
Note that fixedness is either all or nothing. There is no way in the current version of saying don't cluster 1 with 325, and don't cluster 45 with 360, but it's OK to cluster 1 with 45 or 360. If there turns out to be a need for it, it might be included in a later version of the program.
cluster-only
The clustering table allows you to provide an initial clustering, which wcd can then refine. Sometimes you may only want to refine the clustering of some of the sequences. In which case you can make major performance savings by using the cluster-only directive to tell wcd to only refine the clustering of some of the sequences and to leave the clustering of all the others as given initially by the clustering table. For example, if we had the following clusters: [0, 1, 4] [2,3,7] [5] [6,8] [9]; and we were generally happy with the clustering but wanted to see whether [2,3, 7] should be merged with [6,8], the following directive could be used
```
     
     cluster-only 2 3 6 7 8.
```
wcd would then check to whether those clusters would be merged.
NB: It only makes sense to have one cluster-only directive. It can be as long as you like.
reset This is similar to cluster-only except you want to say that you are happy with the clustering of the other sequences but not happy with the clustering of the specified sequences. Typically, you would be concerned that the clustering of the specified sequences was too lenient (i.e. that some sequences had wrongly been put together). So taking the above example, if you said
```
     
     reset 2 3 6 7 8.
```
You would be saying that you wanted to leave the clustering of 0, 1, 4, 5, and 9, but you wanted to cluster the other sequences de novo, completely ignoring the initial clustering.
The major difference between cluster-only and reset is that with reset you are saying that you want to recluster the specified sequences de novo: you think that some of the sequences specified that have been clustered already should not be and you want to check again (probably using other parameters). With cluster-only you are happy with what clustering has been done, but you want to check whether there should be even more clustering.
cluster-others, reset-others. This has the same semantics as the previous two except the specified sequences should be left as is and not processed, and the sequences not specfied should be clustered again.

3.3.2.3 An example of reclustering and constraint files.

We had a very large data set to cluster with heterogeneous data. Some of the long sequences had very large overlaps. We did the following.

Prepare a suffix array of the data file (this part is explained in more detail later).

     
     fasta2sary -x -d 10 bp.fasta -o bp.fasta.nlc
     mksary bp.fasta.nlc

Cluster with a very high degree of stringency
```
     
     wcd -c -F suffix -w 120 bp.fasta > bp120.clt
```
This creates a cluster table which contains one very large super cluster (of about 43k sequences) and lots of other very small clusters. We copy the line that contains the supercluster into a file bp120.con and put “reset-others” in front of it
```
     
     reset-others  6355 29988 2 9282 71821 4 .....  
```
Cluster less stringently
We cluster again, less stringently. This time we initialise the clustering with the clustering given by bp120.clt subject to the constraint file. This says: leave the super-cluster as-is. Recluster all the other sequences from scratch.
```
     
     wcd --init-cluster bp120.clt --constraint1 bp120.con -c -F suffix -w 90 bp.fasta > bp90.clt
```
This shows no new super clusters so we throw away bp90.clt and continue with bp120.clt. (But if it did, you would use that information)
Cluster “normally”
If your data set size is reasonable or you have lots of CPUs, you can just cluster as normal
```
     
     wcd --init-cluster bp120.clt --constraint1 bp120.con -c  bp.fasta > bp.clt
```
Or use the suffix-array algorithm to speed up
This phase is done if you are concerned that the final phase will be too computationally expensive. Here, the suffix array algorithm is used to speed-up the process, to produce a refine We recluster, leniently, (25-30 is probably a good choce of word size) and create a new cluster table bp30.clt. To recap what this does: leave the supercluster as is, and recluster everything else from scratch on the basis that two sequences should be put in the same cluster if they share a common word of length 30.
```
     
     wcd --init-cluster bp120.clt --constraint1 bp120.con -c -F suffix -w 30 bp.fasta > bp30.clt
```
This clustering is probably too lenient and so the clusters, except for the super-cluster, are probably too big.
Now we want to cluster again more strictly using the bp30.clt as the starting point. Within each cluster of bp30.clt we recluster afresh using the standard d2-clustering algorithm. We do not compare sequences from different clusters of bp30.clt to see whether they should be put together, but only compare sequences within the clusters of bp30.clt If we didn't have to worry about the supercluster we would just say:
```
     
     wcd --recluster bp30.clt -c bp.fasta
```
But we do have to worry about the supercluster. So there are two things we need to do: first, tell wcd not to look at elements of the supercluster; second, tell wcd to put the all the elements of the super-cluster together. To do the first we use the constraint file. To do the second we extract out of bp120.clt the line with the supercluster and save it in a file, say ‘super.clt’. This is exactly the same as ‘bp120.con’ without the ‘cluster-ony’ directive and initialise the clustering with it. So all in all we say
```
     
     wcd --recluster bp30.clt --init_cluster super.clt --constraint1 bp120.con -c bp.fasta
```

3.3.3 Merging and adding

These features of wcd enable you to combine two clusterings. You could do this de novo, but there are performance benefits of using these wcd features.

Suppose you have two files

File 1, ‘data1.seq’: 15 sequences, numbered 0 through 14. The clusters (as produced by wcd and saved in a file ‘data1.cl’)
```
     
     0 1 2 12 13 14.
     3.
     4 5 6 8.
     7 9 10 11.
```
File 2 (‘data2.seq’): 12 sequences, numbered 0 through 11. The clusters (as produced by wcd and saved in ‘data2.cl’) are:
```
     
     0 2 4 10.
     1 3 5.
     6 7 8 9 11.
```

You now want to merge the two files. You are happy with the clustering of the two files with respect to themselves, but you now need to see whether the sequences in the one file are related to the sequences in the second file. You would do this by saying:

wcd --show_clusters --merge data1.seq data.cl data2.seq data2.cl

This merges the two clusterings. All the sequences in the first file will be compared to all the sequences in the second. The new clustering would be output. The sequences in the second file will be renumbered 15 to 26. For example a possible output might be:

0 1 2 12 13 14.
3 21 22 23 24 26.
4 5 6 8 16 18 20 7 9 10 11.
15 17 19 25.

This could happen if sequence 3 in file 1 is related to sequence 21 (6 in file 2); sequence 4 in file 1 related to sequence 16 (1 in file 2); and sequence 7 in file 1 to 18 (3 in file 2).

Adding

Adding allows you to add unclustered sequences into a cluster.

wcd --show_clusters --add data1.seq data1.cl data2.seq

would add the sequences in the file data2.seq to the ones in data1.seq, clustering as appropriate.

3.3 Examples — A Quick Introduction to wcd