Next: , Up: Running wcd


3.3 Examples — A Quick Introduction to wcd

This section shows common ways in which wcd is likely to be invoked.

3.3.1 Basic Clustering

The following examples show straightforward clustering examples.

3.3.2 More advanced clustering

3.3.2.1 Using a clustering file

This can be used to seed the clustering to start with. Instead of starting each sequence in its own cluster, we do some preallocation of clusters. wcd will then continue clustering using this a start: no clusters will be broken up, but some of the clusters will be merged. This might be useful when you know for some reason that some sequences should be clustered together regardless of d2-score (e.g. from an annotation or from biological knowledge).

In this case, create a clustering file. Each cluster should be on a line by itself, terminated by a full stop `.'. This line can be as long as you like, but don't break it. For example, suppose we know that sequences 0, 2, 10, and 11 should be clustered together; and so should 6, 17, 107, and 120; and so should 151 and 152. Your cluster file (let's assume it's called init.cl would look like this:

     0 2 10 11.
     6 17 107 120.
     151 152.

Those sequences that are not mentioned will be put in their own clustering. Clustering in then done by saying:

     wcd --show_clusters -f init.cl data/5000.seq
3.3.2.2 Using a constraint file

A constraint file enables you to specify additional knowledge about the data and so help wcd do clustering more efficiently and more correctly. Each line in the cluster file gives a directive. There are three directive.

3.3.2.3 An example of reclustering and constraint files.

We had a very large data set to cluster with heterogeneous data. Some of the long sequences had very large overlaps. We did the following.

  1. Prepare a suffix array of the data file (this part is explained in more detail later).
         
         fasta2sary -x -d 10 bp.fasta -o bp.fasta.nlc
         mksary bp.fasta.nlc
    
  2. Cluster with a very high degree of stringency
         
         wcd -c -F suffix -w 120 bp.fasta > bp120.clt
    

    This creates a cluster table which contains one very large super cluster (of about 43k sequences) and lots of other very small clusters. We copy the line that contains the supercluster into a file bp120.con and put “reset-others” in front of it

         
         reset-others  6355 29988 2 9282 71821 4 .....  
    
  3. Cluster less stringently

    We cluster again, less stringently. This time we initialise the clustering with the clustering given by bp120.clt subject to the constraint file. This says: leave the super-cluster as-is. Recluster all the other sequences from scratch.

         
         wcd --init-cluster bp120.clt --constraint1 bp120.con -c -F suffix -w 90 bp.fasta > bp90.clt
    

    This shows no new super clusters so we throw away bp90.clt and continue with bp120.clt. (But if it did, you would use that information)

  4. Cluster “normally”

    If your data set size is reasonable or you have lots of CPUs, you can just cluster as normal

         
         wcd --init-cluster bp120.clt --constraint1 bp120.con -c  bp.fasta > bp.clt
    
  5. Or use the suffix-array algorithm to speed up

    This phase is done if you are concerned that the final phase will be too computationally expensive. Here, the suffix array algorithm is used to speed-up the process, to produce a refine We recluster, leniently, (25-30 is probably a good choce of word size) and create a new cluster table bp30.clt. To recap what this does: leave the supercluster as is, and recluster everything else from scratch on the basis that two sequences should be put in the same cluster if they share a common word of length 30.

         
         wcd --init-cluster bp120.clt --constraint1 bp120.con -c -F suffix -w 30 bp.fasta > bp30.clt
    

    This clustering is probably too lenient and so the clusters, except for the super-cluster, are probably too big.

    Now we want to cluster again more strictly using the bp30.clt as the starting point. Within each cluster of bp30.clt we recluster afresh using the standard d2-clustering algorithm. We do not compare sequences from different clusters of bp30.clt to see whether they should be put together, but only compare sequences within the clusters of bp30.clt If we didn't have to worry about the supercluster we would just say:

         
         wcd --recluster bp30.clt -c bp.fasta
    

    But we do have to worry about the supercluster. So there are two things we need to do: first, tell wcd not to look at elements of the supercluster; second, tell wcd to put the all the elements of the super-cluster together. To do the first we use the constraint file. To do the second we extract out of bp120.clt the line with the supercluster and save it in a file, say ‘super.clt’. This is exactly the same as ‘bp120.con’ without the ‘cluster-ony’ directive and initialise the clustering with it. So all in all we say

         
         wcd --recluster bp30.clt --init_cluster super.clt --constraint1 bp120.con -c bp.fasta
    

3.3.3 Merging and adding

These features of wcd enable you to combine two clusterings. You could do this de novo, but there are performance benefits of using these wcd features.

Suppose you have two files

You now want to merge the two files. You are happy with the clustering of the two files with respect to themselves, but you now need to see whether the sequences in the one file are related to the sequences in the second file. You would do this by saying:

wcd --show_clusters --merge data1.seq data.cl data2.seq data2.cl

This merges the two clusterings. All the sequences in the first file will be compared to all the sequences in the second. The new clustering would be output. The sequences in the second file will be renumbered 15 to 26. For example a possible output might be:

0 1 2 12 13 14.
3 21 22 23 24 26.
4 5 6 8 16 18 20 7 9 10 11.
15 17 19 25.

This could happen if sequence 3 in file 1 is related to sequence 21 (6 in file 2); sequence 4 in file 1 related to sequence 16 (1 in file 2); and sequence 7 in file 1 to 18 (3 in file 2).

Adding

Adding allows you to add unclustered sequences into a cluster.

wcd --show_clusters --add data1.seq data1.cl data2.seq

would add the sequences in the file data2.seq to the ones in data1.seq, clustering as appropriate.