Introduction to the ICAtools

The ICAtools are a set of programs that could be of use to anyone doing medium-to-large scale DNA sequencing projects. By structuring otherwise amorphous sets of DNA sequences the ICAtools can quickly provide useful information that might otherwise lie undiscovered.

For example, when used for their primary task of providing guiding information about the efficient use of cDNA libraries, the programs can estimate the amount of redundancy (or conversely, normalization success) in a given library, and can predict the number of as yet unfound sequences that remain in any particular library. More generally the ICAtools can be used to discover when similar subsequences are present in any set of sequences, such as Alu repeats, or when vector or linker sequence have not been removed from otherwise disimilar sequences. The programs can list the names and descriptions of those sequences that contain shared subsequences and then display alignments that detail the nature and extent of what exactly any sequences have in common. The range of uses to which clustering can be put is vast and the ICAtools are not going to be perfect for every case, but often a combination of tools using different styles of clustering can reveal different kinds of sequence relationship that would not otherwise be apparent. Clustering ESTs works best with anchored reads. If your ESTs come from all over your mRNA sequences, then you have a much harder problem to solve perfectly and you will probably need an assembly program such as Gap, or Phrap though the ICAtools can still be used for imperfect, relative cDNA library comparisons.

Note that the ICAtools do not work properly with base ambiguity symbols. Use the UNIX sed command to change them to 'n' or 'N'. The tools can be adjusted to specialized areas of inquiry, or to speed them up, by changing run-time and compile-time constants. Check the individual program descriptions and study the programs' source.

References

ICAtools:

Related Software:

Downloading:

Watch for the latest versions, as these may not be in sync.


The ICAtools are:

N2tool

Compares all the submitted sequences with each other to find those which share any region of similarity. These similar sequences are grouped together into clusters. One cluster is formed for each unique combination of matching sequences. This can result in many similar clusters if local repeats are common but attention should then shift to other programs, e.g. icamatches for summaries, and to icaprint's selective printing options. n2tool is the best program to use when searching for sequencing artefacts.

ICAtool

When in update mode, indexes DNA sequences into clusters which share local sequence similarities. On the whole, n2tool does a better job when the resources are available. When in query mode, icatool can also be used to perform database searches. Querying is a quick way to display the best subsequence alignment between any sequences. ICAtool's querying is the slowest and most sensitive of all the ICAtools. Note that it only returns the single best alignment between any pair of sequences.

ICAass

Takes a size-sorted (longest first) file of sequences and searches for those sequences which are approximately repeated within the length of another. This can be useful when searching for globally similar sequences within a library or when attempting to reduce the redundancy in a particular sequence database. ICAass also has a query mode, like icatool, but icaass is much quicker though potentially less sensitive. When used for database querying with batches of sequences, icaass should be quicker than FASTA and more sensitive than BLASTN when run with default parameters.

ICAprint

Takes an index produced by icatool, icaass or n2tool and produces various ascii text representations of the clustering. Has selective printing options that can be very useful.

ICAstats

Takes icaprint output and produces a table of summary statistics. If being used for the analysis of cDNA library sequences then icastats also makes predictions that aid efficient management of the libraries by helping to define when any particular library is effectively exhausted. This feature is best used in combination with indexes produced by icaass.

ICAmatches

ICAmatches attempts to explain why sequences have been clustered together by using a novel style of sequence alignment. The clusters produced by the ICAtools can be very large and complex, and are not, therefore, amenable to the traditional styles of multiple alignment analysis. Only a single type example sequence is displayed in an icamatches alignment; matching regions are shown by cumulative local match counts displayed underneath the listed type sequence. Artefact sequence normally has a positional and frequency bias that is easily detected in an icamatches alignment.

tofasta

General utility that can convert many different DNA sequence formats into FASTA format.

ssort

Loads sequence files into an index, then prints the sequences out, longest to shortest, in FASTA format.

just30

Natty command to just print the start of any supplied sequences in FASTA format. Great for investigating artefacts.


Example Commands

n2tool -seq myseqs/* Prepares an index by clustering sequences found in all the files in directory myseqs.
icaprint > cluster.list Prints an ascii text file showing which sequences have been clustered together.
icastats cluster.list Prints a table of statistics on the screen.

Related Software

Miropeats

Miropeats displays DNA sequence similarity information graphically. The program uses icaass to discover regions of similarity amongs any set of DNA sequences and then draws a PostScript (TM) graphic that summarizes the length, location and relative orientations of any repeated sequences.

ESTcluster

A Perl script that wraps up calls to icaass to simplify its repeated use on a large EST sequencing project. In all my experience of various EST sequencing projects, I have never discovered a single cDNA library that can be deeply sampled and found to be well normalized. ESTcluster assumes a normalized cDNA library so please treat its efforts at prediction with great scepticism. The icaass program can be compiled with special options for short sequences to improve its speed when called from inside ESTcluster.

Disclaimer

Permission is granted to any individual or institution to use, copy, or redistribute this software so long as it is not sold for profit, provided that this notice and the original copyright notices are retained. Jeremy Parsons makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty. Its academic software OK !

Author: Jeremy D. Parsons

Address: EBI, EMBL-Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, England

Send email to me