Refining clusters - wcd Manual 0.

Next: Format of input files, Previous: Doing clustering, Up: Running wcd

3.9 Refining clusters

3.9.1 Merging clusters

This can be used to merge two known clusterings. The input is two FASTA files with the sequences and two files that give the clusterings.

It assumed that the two FASTA files are disjoint.

Usage: wcd [--merge,-m] <seqf1> <clf1> <seqf2> <clf2>
           merge two clusterings

Here you merge two clusterings that have already been computed. The four arguments are: the first FASTA file, the first clustering file, the second FASTA file, and the second clustering file. These are mandatory.

The --constraint option may be of particular use here. This can be used to constrain the first input file and its related clustering. You can use the --constraint2 option to constrain the second input file.

The files that specify the clustering must be in the same format as produced by the compressed clustering format.. The sequences are referred to by index number (the position of the sequence in the input file), numbered from 0. Each cluster is given on a line by itself terminated by a full stop: the indices of the sequences in the cluster are printed out, separated by spaces.

The output is a a new cluster table in the same format as the input cluster table. The indices shown in the table are:

The same as the input index if the sequence came from the first file specified.
n+input index if the sequence comes from the second file, assuming n sequences in the first file.

Another useful option for merging is:

[--constraint2, -k] filename
Give the constraint file for the second input data file (it. This is optional. The constraint file enables you to ensure that certain sequences are not clustered together or to ignore certain sequences while clustering. See Format of Constraint File, which gives more details on the required format of the constraint file, and the semantics.

3.9.2 Adding sequences

This can be used to add a number of new sequences to an existing cluster. It is assumed that the new sequences do not exist in the original file.

The input is two FASTA files and a cluster table for the first file. The remarks above apply here.

3.9.3 Reclustering

Usage: wcd [--recluster,-r] <clf1> <seqf1>
           recluster from a more stringent clustering

This takes a clustering based on a more lenient (or just different) criterion and reclusters using d^2-scores as the basis for clustering. The clustering as given by the input cluster table is given as a scheme. For each cluster of the initial cluster table, wcd does a d2-clustering on the sequences in that cluster, ignoring all the other sequences. wcd will never compare the sequences in one cluster with the sequences in another. The resulting clustering is therefore a finer partition.