Format of input files - wcd Manual 0.

Next: Output, Previous: Refining clusters, Up: Running wcd

3.10 Format of input files

The format of input files is:

FASTA format
will treat Ns randomly.

What is meant by FASTA format? Each sequence MUST be preceded by an identification line. Each sequence itself may be on one line, or it may be on several lines. If it is on several lines, each line should terminate with a carriage return and there must be NO spaces on each line.

The identification line starts with a `greater-than' sign (>). This is all that is required. IF there is an alphanumeric sequences (string with no blanks) IMMEDIATELY following the greater than sign then that is treated as a sequence ID that is used by a few of the options for display purposes. The rest of the identification line is completely ignored.

Format of clustering input

The merge and add options require as input files that specify a clustering. These files must use the compressed format described below.

Format of constraint file

Constraint files consist of a sequence of constraints, each on a line by itself. Each line in the constraint file is a directive followed by a list of indices, terminated by a full stop `.'. There are three directives and their semantics are described below.

fix
This directive can be used to specify a list of sequences which should be labelled fixed. Any cluster than contains a fixed sequence will be labelled as fixed. This is useful when the user has some external knowledge about the clustering and wants to ensure that some sequences aren't clustered together (e.g. by a poor quality EST).
Normally when a program starts, each sequence is put into a cluster. By default, a sequence is put into a cluster by itself, but if a clustering file is given then the clustering specified by that will be used.
Thereafter, clustering starts. However, if the fix directive is used, two sequences that are labelled as fixed will never be merged. If an EST matches more than 1 fixed cluster it will be added to at most 1 of them. Note that sequences that are not fixed can be added to fixed clusters, and a non-fixed cluster can be added to a fixed cluster.
cluster-only
This tells wcd to only try to cluster those sequences that in the list (ignore the rest). This is useful if you only want to cluster a part of an input file (e.g. you might know the clustering for rest of the file).
NB: It is an error to put more than one cluster-only or reset in a constraint file.
reset
This is similar to cluster-only but in addition, the clustering of the sequences in this list is reset to the default clustering (i.e. each sequence in the list is put in its own cluster).
This is used where you are given a clustering file as input, but while you are generally happy with the clustering given for some sequences, you would like others to be reclustered. The implication is that those sequences in the reset list
- will be clustered using d2 into one or more clusters;
- wcd will not attempt to cluster them with any of the other sequences