NAME
icatool - cluster similar DNA sequences
SYNOPSIS
icatool [-mode Query|Update|q|u ] [-anti-sense
Yes|No|y|n ] [-index filename ] [-ini filename ] [-ktup
n ] [-print n ] [-screen n ] -seq filename ... [-
threshold n ]
DESCRIPTION
ICAtool takes files of DNA sequence information and produces
an index file which links similar sequences together in
clusters. ICA is an acronym for Incremental Clustering
Algorithm which describes the way the program builds its
index one sequence at a time. Sequences can be spread
amongst any number of files and new files can be added at
any time to increase the number of sequences clustered.
Various sequence formats are supported including GenBank,
EMBL, plain, (unformatted sequence files),Staden's semi-
colon and Experiment file formats, and also 2 NBRF/FASTA
style formats with the description either on the same line
as '>sequence-name' or with the description on the line
immediately following the sequence name.
USAGE
ICAtool can get its configuration parameters from the com-
mand line or from a user initial configuration file or just
set to built in defaults. Parameter settings over-ride each
other with defaults being set first, then the configuration
file then finally the command line. Options can be listed in
any order for all the ICAtools.
OPTIONS
-anti-sense Yes|No|y|n
Determines whether sequences should also be compared in the
opposite sense to how they are entered. Default is no.
-index filname
Defines the name of the index file existing or to be
created. Default is "cluster.index" in the current direc-
tory.
-ini filename
Defines the name of the file which holds the user's initial
configuration file. Default is "ICAtool.ini" in the current
directory.
-ktup n
Defines the minimum identity length needed before a sequence
pair become suitable for further analysis. The range is
4..10 with bigger values taking more memory and on big hosts
making the program run faster. Default is 6.
-mode Query|Update|q|u
Determines whether an index file is being updated/created or
queried. Default is QUERY.
-print n
If in query mode then this determines the number of align-
ments to print per query sequence. Default is 4.
-screen n
If in query mode then this determines the number of charac-
ters per printed line. Default is 80.
-seq filename1 filename2 filenameN
This flag denotes the start of a list of space separated
filenames which hold DNA sequence information. No default,
always required.
-threshold n
When in UPDATE mode this flag determines the subsequence
similarity score which defines the threshold at which 2
sequences are said to be similar. Minimum value is 4,
default is 20. Score is defined as #matches - #mismatches in
an ungapped alignment (HSP).
FILES
ICAtool.ini
If this file is present then all startup details present in
it will be read. An example would be
mode=U
ktup=8
threshold=25
print=20
screen=100
anti-sense=yes
index=cluster.23rdJuly
cluster.index
If this file is present when in UPDATE mode then any extra
sequences are added to this existing index. This file must
be present when in QUERY mode.
SEE ALSO
N2tool(1), ICAass(1), ICAprint(1), ICAstats(1),
ICAmatches(1), tofasta(1), ssort(1), just30(1)
BUGS
Doesn't use base ambiguity symbols properly: use only 'n' or
'N' which are converted to random bases.