NAME
icaass - Whole DNA database clustering and querying
SYNOPSIS
icaass [-mode Update|U|u|Query|Q|q|Orphans|O|o ]
[-anti-sense Yes|No|y|n ] [-index filename ]
[-ini filename ] [-threshold nn.n ] [-score n ]
[ -extmin n ] [-screen.I n ] -seq filename ...
DESCRIPTION
ICAass takes files of DNA sequence information and produces
an index file which links similar sequences together in
clusters. ICAass differs from N2tool and ICAtool because it
uses a novel type of global sequence comparison algorithm to
determine when two sequences are considered similar. The
comparison is asymmetric because the algorithm searches for
instances where one sequence is approximately repeated
within the length of another. To facilitate the rapid detec-
tion of such matches, the program needs to be provided with
a size sorted file of DNA sequences (longest sequence
first). This allows the program to use the rapid incremen-
tal clustering approach also used in ICAtool. In this
Update mode, ICAass has been used to reduce the redundancy
in copies of the separate divisions of Release 28 of the
EMBL DNA database. The size-sorted sequence file can be
produced using the program ssort.
ICAass also has a query mode that allows it to be used for
rapid database searches. ICAass is especially quick when
querying a database with a whole batch of query sequences
because the database is only loaded a single time and then
stored in a compressed form in memory for repeated scanning.
The comparison algorithm used when querying is different
from that used when creating a cluster index. The sequence
comparisons are potentially less sensitive than those per-
formed by ICAtool but always more sensitive than those per-
formed by BLASTN using its default settings. When in query
mode, it is possible to select how many matches are returned
by either a simple count (-print option), or by their score
(-score option).
The program also has a special 'orphans' mode which allows a
simple index to be created almost instantly without perform-
ing any sequence comparisons. This mode enables local data-
bases to be indexed and searched with a minimum of effort
and resources. Indexes produced on the basis of sequence
similarity are quicker to search but the indexing can take a
significant time. If the database is only going to be
searched a hundred or so times then the effort of proper
cluster indexing is not worthwhile.
Sequences can be spread amongst any number of files and new
files can be added at any time to increase the number of
sequences clustered. Various sequence formats are supported
including GenBank, EMBL, plain, (unformatted sequence
files),Staden's semi-colon and Experiment file formats, and
also 2 NBRF/FASTA style formats with the description either
on the same line as '>sequence-name' or with the description
on the line immediately following the sequence name. Extra
files of sequences can be added at any time without any
penalty of recalculation but no sequences referenced by an
index should ever be deleted.
USAGE
ICAass can get its configuration parameters from the command
line or from a user initial configuration file or just set
to built in defaults. Parameter settings over-ride each
other with defaults being set first, then the configuration
file then finally the command line.
OPTIONS
-anti-sense Yes|No|y|n
Determines whether sequences should also be compared in the
opposite sense to how they are entered. Default is no.
-index filename
Defines the name of the index file existing or to be
created. Default is "cluster.index" in the current direc-
tory.
-ini filename
Defines the name of the file which holds the user's initial
configuration file. Default is "ICAtool.ini" in the current
directory.
-mode Update|U|u|Query|Q|q|Orphans|O|o
Defines how the program will operate. In Update mode ICAass
will perform full database clustering on the basis of pair-
wise sequence comparisons. In Orphans mode ICAass will
almost instantly index all the sequences separately and
without any sequence comparisons being performed. In Query
mode ICAass will use an existing cluster or orphan index and
search for local sequence similarities between the indexed
and query sequences.
-seq filename1 filename2 filenameN
This flag denotes the start of a list of space separated
filenames which hold DNA sequence information. No default,
always required.
-threshold nn.n
When creating a cluster index, this flag determines the
subsequence similarity score that defines the threshold at
which one sequence is said to be an Approximate SubSequence
of another. The threshold corresponds to the percentage of
the putative ASS that is also represented in the superse-
quence. A fixed gap-start penalty is subtracted from the
number of matching bases for alignment gaps in either the
ASS or the supersequence. This gap-start cost is equivalent
to 8 bases of the ASS not being present in the longer
sequence. Minimum value is 25.0, Maximum is 100 though 'N'
characters are randomized so be careful.
-score n
When querying it is useful to select matches on the basis of
scores. ICAass uses a simple +1 (match), -1 (mismatch) scor-
ing scheme to make score screening of ungapped matching seg-
ments is easy.
-screen
If in query mode then this determines the number of charac-
ters per printed line. Default is 80.
-extmin n
This is a kind of sensititvity control. It limits how far
the program searches past mismatches before deciding that
there is no more alignment to be found. The more negative
the number, the further beyond the edges of a good alignment
the program will search. Useful values could be between 8
(most insensitive) to -20. The default is approximately
zero.
FILES
ICAtool.ini
If this file is present then all startup details present in
it will be read. An example would be
threshold=80.0
anti-sense=yes
index=cluster.23rdJuly
cluster.index
If this file or an equivalent is present when in UPDATE mode
then any extra sequences are added to this existing index.
An cluster or orphan index file is needed to perform data-
base querying.
SEE ALSO
N2tool(1), ICAass(1), ICAprint(1), ICAstats(1),
ICAmatches(1), tofasta(1), ssort(1), just30(1)
BUGS
None of the ICAtools check their command line parameters
fully. Only those parameters that are recognized are
checked.
Doesn't use base ambiguity symbols properly: use only 'n' or
'N' which are converted to random bases.