NAME

     icatool - cluster similar DNA sequences


SYNOPSIS

     icatool    [-mode     Query|Update|q|u    ]     [-anti-sense
      Yes|No|y|n  ] [-index  filename ]  [-ini  filename ] [-ktup
      n ]  [-print  n ]  [-screen n  ]   -seq  filename  ...   [-
     threshold n ]


DESCRIPTION

     ICAtool takes files of DNA sequence information and produces
     an  index  file  which  links  similar sequences together in
     clusters.  ICA is  an  acronym  for  Incremental  Clustering
     Algorithm  which  describes  the  way the program builds its
     index one sequence at  a  time.   Sequences  can  be  spread
     amongst  any  number  of files and new files can be added at
     any time to increase  the  number  of  sequences  clustered.
     Various  sequence  formats  are supported including GenBank,
     EMBL, plain,  (unformatted  sequence  files),Staden's  semi-
     colon  and  Experiment  file  formats, and also 2 NBRF/FASTA
     style formats with the description either on the  same  line
     as  '>sequence-name'  or  with  the  description on the line
     immediately following the sequence name.


USAGE

     ICAtool can get its configuration parameters from  the  com-
     mand  line or from a user initial configuration file or just
     set to built in defaults.  Parameter settings over-ride each
     other  with defaults being set first, then the configuration
     file then finally the command line. Options can be listed in
     any order for all the ICAtools.


OPTIONS

  -anti-sense Yes|No|y|n
     Determines whether sequences should also be compared in  the
     opposite sense to how they are entered. Default is no.

  -index filname
     Defines the name  of  the  index  file  existing  or  to  be
     created.  Default  is  "cluster.index" in the current direc-
     tory.

  -ini filename
     Defines the name of the file which holds the user's  initial
     configuration  file. Default is "ICAtool.ini" in the current
     directory.

  -ktup n
     Defines the minimum identity length needed before a sequence
     pair  become  suitable  for  further  analysis. The range is
     4..10 with bigger values taking more memory and on big hosts
     making the program run faster. Default is 6.

  -mode Query|Update|q|u
     Determines whether an index file is being updated/created or
     queried. Default is QUERY.

  -print n
     If in query mode then this determines the number  of  align-
     ments to print per query sequence. Default is 4.

  -screen n
     If in query mode then this determines the number of  charac-
     ters per printed line. Default is 80.

  -seq filename1 filename2 filenameN
     This flag denotes the start of a  list  of  space  separated
     filenames  which  hold DNA sequence information. No default,
     always required.

  -threshold n
     When in UPDATE mode this  flag  determines  the  subsequence
     similarity  score  which  defines  the  threshold at which 2
     sequences are said  to  be  similar.  Minimum  value  is  4,
     default is 20. Score is defined as #matches - #mismatches in
     an ungapped alignment (HSP).


FILES

  ICAtool.ini
     If this file is present then all startup details present  in
     it will be read. An example would be
     mode=U
     ktup=8
     threshold=25
     print=20
     screen=100
     anti-sense=yes
     index=cluster.23rdJuly

  cluster.index
     If this file is present when in UPDATE mode then  any  extra
     sequences  are  added to this existing index. This file must
     be present when in QUERY mode.


SEE ALSO

     N2tool(1),     ICAass(1),     ICAprint(1),      ICAstats(1),
     ICAmatches(1), tofasta(1), ssort(1), just30(1)


BUGS

     Doesn't use base ambiguity symbols properly: use only 'n' or
     'N' which are converted to random bases.