Next: Output, Previous: Refining clusters, Up: Running wcd
The format of input files is:
What is meant by FASTA format? Each sequence MUST be preceded by an identification line. Each sequence itself may be on one line, or it may be on several lines. If it is on several lines, each line should terminate with a carriage return and there must be NO spaces on each line.
The identification line starts with a `greater-than' sign (>). This is all that is required. IF there is an alphanumeric sequences (string with no blanks) IMMEDIATELY following the greater than sign then that is treated as a sequence ID that is used by a few of the options for display purposes. The rest of the identification line is completely ignored.
The merge and add options require as input files that specify a clustering. These files must use the compressed format described below.
Constraint files consist of a sequence of constraints, each on a line by itself. Each line in the constraint file is a directive followed by a list of indices, terminated by a full stop `.'. There are three directives and their semantics are described below.
fix
This directive can be used to specify a list of sequences which should be labelled fixed. Any cluster than contains a fixed sequence will be labelled as fixed. This is useful when the user has some external knowledge about the clustering and wants to ensure that some sequences aren't clustered together (e.g. by a poor quality EST).
Normally when a program starts, each sequence is put into a cluster. By default, a sequence is put into a cluster by itself, but if a clustering file is given then the clustering specified by that will be used.
Thereafter, clustering starts. However, if the fix directive
is used, two sequences that are labelled as fixed will never be
merged. If an EST matches more than 1 fixed cluster it will be added
to at most 1 of them. Note that sequences that are not fixed can be
added to fixed clusters, and a non-fixed cluster can be added to a
fixed cluster.
cluster-only
This tells wcd to only try to cluster those sequences that
in the list (ignore the rest). This is useful if you only want to
cluster a part of an input file (e.g. you might know the clustering
for rest of the file).
NB: It is an error to put more than one cluster-only
or reset in a constraint file.
reset
This is similar to cluster-only but in addition, the
clustering of the sequences in this list is reset to the default
clustering (i.e. each sequence in the list is put in its own
cluster).
This is used where you are given a clustering file as input, but while you are generally happy with the clustering given for some sequences, you would like others to be reclustered. The implication is that those sequences in the reset list
wcd will not attempt to cluster them with any of the
other sequences