Authors: Jeremy D. Parsons and Patricia Rodriguez-Tomé

EMBL-Outstation - The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK;

Corresponding Author:
Jeremy D. Parsons
Boinformatics Consultancy
37 Walpole Road,
Cambridge,
CB1 3TH
England

Tel. +44 7986 534110

http://www.littlest.co.uk/

JESAM: CORBA software components to create and publish EST alignments and clusters

Abstract

Motivation:

Expressed Sequence Tags (ESTs) are cheap, easy and quick to obtain relative to full genomic sequencing and currently sample more eukaryotic genes than any other data source. ESTs are particularly useful for developing Sequence Tag Sites (STSs for mapping), polymorphism discovery, disease gene hunting, mass spectrometer proteomics, and most ironically for finding genes and predicting gene structure after the great effort of genomic sequencing. However, ESTs have many problems and the public EST databases contain all the errors and high redundancy intrinsic to the submitted data so it is often found that derived database views, which reduce both errors and redundancy, are more effective starting points for research than the original raw submissions. Existing derived views such as EST cluster databases and consensus databases have never published supporting evidence or intermediary results leading to difficulties trusting, correcting, and customizing the final published database. These difficulties have lead many groups to wastefully repeat the complex intermediary work of others in order to offer slightly different final views. A better approach might be to discover the most expensive common calculations used by all the approaches and then publish all intermediary results. Given a globally accessible database with a suitable component interface, like the JESAM software described in this paper, the creation of customized EST-derived databases could be achieved with a minimum effort.

Results:

Databases of EST and full-length mRNA sequences for four model organisms have been self-compared by searching for overlaps consistent with contiguity. The sequence comparisons are performed in parallel using a PVM process farm and previous results are stored to allow incremental updates with minimal effort. The overlap databases have been published with CORBA interfaces to enable flexible global access as demonstrated by example Java applet browsers. Simple cDNA supercluster databases built as alignment database clients are themselves published via CORBA interfaces browsable with prototypical applets. A comparison with UniGene Mouse and Rat databases revealed undesirable features in both and the advantages of contrasting perspectives on complex data.

Availability:

The software is packaged as two jar files available from: URL: http://www.littlest.co.uk/ . One jar contains all the Java source code, and the other contains all the C, C++ and IDL code. Links to working examples of the alignment and cluster viewers (if remote firewall permits) are no longer available at http://www.ebi.ac.uk/EST.

Contact:

http://www.littlest.co.uk/

Introduction

CORBA and biological data

The use of CORBA in a biological context was introduced by Hu et al. (1998) and Lijnzaad et al. (1998) where they explain that CORBA can be a good solution to the problem of creating applications for distributed heterogeneous environments. The Internet is the extreme example of both distribution and heterogeneity and, therefore, the hardest to support with conventional client/server systems. Orfali and Harkey (1998) see CORBA as a good middleware solution for the Internet and refer to the creation of the "Object Web" where CORBA hides some of the most complex interoperability issues behind well specified interfaces fronting globally accessible software components. Principal among the benefits that CORBA brings to biological data distribution and interaction, are: the Interface Definition Language (IDL) to define interfaces between objects, scalability via language, location and Operating System independence, and, finally, a rich set of 15 object services. CORBA is often described as being middleware because it sits between software components, or data sources, and lets them all interact without regard to intervening distance or data conversions. In 1996, the ease with which client/server object-oriented applications could be written, distributed, and supported across the Internet increased when Netscape (http://www.netscape.com/) announced (Orfali and Harkey, 1997) that its browsers were all going to include a CORBA Object Request Broker (ORB) alongside the built-in Java Virtual Machine (JVM), and when it subsequently decided to distribute its browsers for free.

Sequence databases (for example the EMBL NSDB Stoesser et al, 1998) are currently the most familiar biological database and a typical source for EST sequences. Traditionally these databases are presented and distributed using "flat file" views of the sequence originally submitted, but these static raw views are being superseded by novel interfaces better suited to the needs of academic and industrial researchers. The new browsing interfaces are needed because current research may need to investigate nucleotide or protein sequences grouped by their function, phylogeny, map position, cellular location, expression regulation, disease association, annotation similarity or metabolic pathway etc. To support these new investigative cross-database clients, it is useful to have standardized API level access to both data and services - a requirement where CORBA is starkly appropriate. In summary, ESTs are inherently difficult data to handle (yet extremely useful) so they are a natural target for new efforts to improve data quality, cross database browsing and novel visualization tools.

EST databases and clustering extensions

Of all the biological databases, Expressed Sequence Tag (EST) databases have been the most conspicuous in recent years both because of their prodigious growth rates and because of their increase in intrinsic importance connecting different areas of biological research. Expressed Sequence Tags are cheap, easy and quick to obtain relative to full genomic sequencing and currently sample more eukaryotic genes than any other data source. They have great current utility for disease gene hunting (Banfi et al. 1996), mass spectrometer proteomics (Kuster and Mann, 1998), and for finding genes and predicting gene structure after genomic sequencing (Bailey et al, 1998, and Jiang and Jacob, 1998). The patterns of early public work with human ESTs (Boguski et al, 1993; Hillier et al, 1996; Gerhold and Caskey, 1996) are being repeated again now for mouse and other eukaryotic model species. The Mouse Genome Project is advancing quickly (Bettey et al, 1999), with a large EST project (Marra et al, 1999) and various maps (Blake et al, 1998, McCarthy et al, 1997) completed and novel databases available for unique analyses of mammalian development (Davidson et al, 1997), and inter-species comparisons (Hardison, et al, 1997, Makalowski et al, 1996).

ESTs have many problems that stem from their means of production. The cDNA library from which any ESTs are drawn will sample the levels of expression in a particular tissue at a particular time: rare transcripts will be missed and highly expressed genes will be overly abundant. This latter problem, redundancy, is both wasteful and difficult to handle due to the extreme volume of error prone data. Redundancy can be reduced in the laboratory using normalization techniques but these are imperfect though additional subtractive techniques are being investigated (Bonaldo et al, 1996).

To address the redundancy problem using computer techniques, a variety of EST clustering programs have been implemented for example Parsons et al (1992), Boguski and Schuler (1995), Parsons (1995), Eckman et al. (1998), and Yee and Conklin (1998). Once clustered, gene fragments can be assembled using sequence assembly programs, or both steps can be combined in large-scale EST assembly and gene indexing protocols. An assembly stage adds two main benefits: first, it produces contigs and consensus sequences which can completely hide EST redundancy and, second, it should also improve the length and quality of the gene reconstructions beyond that available from any one EST (usually single pass reads). For some applications, the clustering is all that is required, for example, the early stages of STS mapping but for others, sophisticated cluster partitioning (Burke et al, 1998) or complex assemblies are more suitable.

Whilst the redundancy problem can be mitigated, other deficiencies of EST databases are computationally insoluble and most EST based projects have centered on pragmatic scientific goals cognizant that perfect results are impossible. Early projects that used EST clustering include TIGR's THC database (Adams et al, 1995), Geneexpress (Houlgatte et al, 1995), and UniGene (Boguski and Schuler, 1995) but new genomes (e.g. Arabidopsis, Rounsley et al., 1996), new methods (e.g. SANBI STACK, http://www.sanbi.ac.za/Dbases.html), and new public data (Aaronson, 1996) have kept the field busy with important discoveries and enhanced views (Eckman et al, 1998).

Assembly and Gene Indexing

Assembling gene transcript fragments is more complex than assembling shotgun clones of genomic DNA. One gene does not lead to one EST nor even a statistically predictable sampling depth and, complicating further, there are also variations in the underlying reconstruction target. The cDNA libraries that have been used in the public EST projects were made from many different individuals/genotypes and tissues so the public EST collections show polymorphism and tissue-specific variation. Additionally, errors of many forms (basecalling, misnaming, vector and cross-species contamination, chimeras, poor template, etc.) are more likely with broadly sampled complex cDNA collections than with tightly focussed genomic projects which can operate more internal checks and aim for a simpler single target. Tissue variation may take the form of alternative polyadenylation (Gautheret et al, 1998), or alternative splicing (Lopez, 1998). In all cases, the best possible EST comparisons will require knowledge of the likelihood of the various types of error, and the exact tissue-specific origins of the source cDNA library.

EST clustering on typical cDNA/EST collections will produce clusters which may vary in size from one to approximately 20,000 which is a size range commonly encountered with larger shotgun assembly projects. Some DNA shotgun assembly programs (necessary for all large-scale genomic sequencing) have been applied to EST assembly, however there is a significant danger of producing artificial sequences by incorrectly joining, or homogenizing transcripts from different genes. In summary, postulated transcription products will ultimately need experimental biological confirmation and the totality of products from a single gene may not fit well into a FASTA format file: instead, a description is needed with the potential to record all of the reconstructed biological complexity.

EST clusters can be combined with other sources of gene sequence information, (for example full-length mRNA, and genomic DNA) to better sample an organisms entire genome, (especially for rare transcripts) to create "gene indexes", and consensus sequence databases. As gene mapping information accumulates, so the gene indexes lead to the creation of "gene maps" which link genes and their functions to locations on a physical map. Gene maps supported by polymorphism information make an extremely valuable resource for positional candidate disease gene hunting.

Introduction to EST Clustering Algorithms

Computational Complexity

A typical EST sequencing project may produce between fifty thousand and a few million ESTs which need to be compared against each other (order N²). This formidable scaling problem must be handled by whichever algorithm is chosen and before user considerations such as input requirements and presentation style can be considered. Algorithmic tools with the potential to reduce the comparison complexity include suffix trees, hashes, Finite State Machines, indexing and statistical DNA-word frequency analyses. These methods could be applied sequentially for each sequence versus the rest, or (for some of the algorithms) a simultaneous all-against-all result emerges inside some data structure.

Experimental and Biological Complexity

Deciding if and how two ESTs are related is complicated by errors in the process of EST manufacture: short single pass reads (Hillier et al. 1996) and by many natural causes of variation (Wolfsberg and Landsman, 1997). The program that best estimates the likelihood of combinations of all these complicating events should generate the best set of clusters or contigs, and the best overall result. Given the scale and difficulty of the challenges faced, even the best programs will make many mistakes and be unable to assess the significance of, or just plain miss, much of importance so experimental confirmation is necessary. This should not be a surprise as the same is true of even fully finished genomes (Oliver, 1996).

Comparisons of Existing Programs

Unfortunately, there are as yet no detailed comparisons of the quality of any current clustered EST databases and direct comparisons of the clustering tools themselves are complicated by potentially important issues such as results presentation style, software portability, memory, disk and CPU requirements. Miller and Powell, (1994) published a quantitative comparison of genomic shotgun assembly packages but this work is not directly relevant because all the programs have since changed and the paper did not address issues specific to EST assembly. Burke et al.(1998) did publish two clustering comparisons: first a subpartitioning of UniGene101 but this used the existing (not independent) UniGene clusters as its starting point and, second, a contrasting comparison with TIGR Gene Index but this was summary and non-rigorous using different EST data.

Published references for EST comparison programs show a variety of specified goals and assumptions. The simplest use scripting languages (e.g. Perl) to run and parse sequence database searches by BLASTN (Altschul et al, 1997, http://www3.ncbi.nlm.nih.gov/BLAST/) or FASTA (Pearson et al., 1997, http://www.cs.virginia.edu/brochure/profs/pearson.html). EST sequences may be clipped to various levels of quality, vector trimmed, and low complexity masked before being sent for BLASTN/FASTA comparisons. BLASTN and FASTA hash the query sequence before scanning a sequence database. JESAM and ACEmbly hash the whole database (indexing) and find hash hits for the query against all database sequences simultaneously before checking for alignments. Construction of Compugen's EST clusters (LEADS, http://www.cgen.com/solutions/index.html) uses a proprietary non-alignment based algorithm. Other (published) non-alignment algorithms include EMBLSCAN (Bishop and Thompson, 1984) D-squared Cluster, and RAPID (Miller et al, 1999) try to estimate the significance of discovering oligos (DNA words) in common between two sequences without the cost of a full alignment.

The simple ideal that all cognate ESTs should cluster together could be achieved easily if all 3-prime ESTs were anchored to the same region of the cDNA and gene. In this situation (ignoring sequencing errors) any EST from the same gene would have the same basecalled sequence . The original ICAtools (Parsons et al., 1992) were the first tools built to automate this simple analysis. Unfortunately, as described by Gautheret et al. (1998), alternate polyadenylation is common (in humans) so ESTs may not be well anchored and discovering cognate ESTs reverts to solving a peculiar massive shotgun assembly. The anchored EST assumption will, however, still give useful results for strictly relative comparisons of the complexity within individual cDNA libraries, especially if transcript heterogeneity is temporally and anatomically restricted (Gautheret et al, Burke et al).

BLAST and FASTA based scripts

Examples of the BLASTN or FASTA clusterers include: INCA (Graul and Sadee, 1997, http://itsa.ucsf.edu/~gram/home/inca/) SEALS (http://www.ncbi.nlm.nih.gov/Walker/SEALS/index.html), WHALE (unpublished), Zymogenetics' REX (Yee and Conklin, 1998) and the anonymous tools used to produce Geneexpress, and the Merck Gene Index. The BLASTN/FASTA algorithms were designed for searches for local similarity and homology, which is a different goal from a search for alignments consistent with overlap/contiguity. In consequence, the script wrappers put around the homology searches have three functions: to run the search, to parse the search results and to enforce user-defined rules of when two sequences should actually be considered to overlap. These extra match criteria cannot be handled directly by the database searching programs so later, more sensitive and slower pairwise sequence alignments may be run to improve the comparison accuracy (Merck Gene Index). Typical extra heuristics include (for Geneexpress) 90% overall nucleotide identity in the overlapping region, no more than 35 unmatched (outside local alignment) dangling sequence ends (UniGene) etc.

Glaxo's "Dynamic" assembler (Gill et al, 1997) uses interactive rounds of BLASTN searches of EST databases to identify possible overlapping sequences which are fed to Gap (Bonfield et al, 1995) for Greedy assembly. The approach taken by Glaxo was quick to implement, and effective for simple assemblies but suffers pauses between each round of assembly, the Greedy assembly is prone to errors with repetitive sequence regions, the static HTML output limits interactivity/editing and alternative BLASTN alignments displayed simultaneously in one alignment may cause confusion for the user.

The original UniGene package (Boguski and Schuler, 1995, Schuler et al, 1996) relied on a BLASTN-like hash based search to identify pairs of sequences with at least two 13-mer words separated by no more than two bases. These sequence pairs were then aligned more accurately within a 10-base window either side of the diagonal identified by the oligo match. The alignment was scored +1 for a nucleotide identity, -2 for a mis-match, -1 for a gap, and 0 for an N ambiguous base. An alignment quality was calculated by dividing the score by the alignment length and had to exceed 0.91 to be accepted. An extra constraint was that the aligned region had to extend within 35 bases of the end of either sequence. UniGene itself only accepts data of known good quality, (high quality base count, presence of poly-A tail etc.) so not all ESTs that might pass the above alignment criteria are ever checked for matches. Explicit lists of excluded sequences (>200,000) are not published nor assemblies, alignments or consensus sequences. UniGene is available online at http://www.ncbi.nlm.nih.gov/UniGene/index.html with extra descriptions at http://www.ncbi.nlm.nih.gov/Schuler/Papers/ESTtransmap/

Specialized Clustering Programs

The ICAtools and CLEANUP are specialized hashed-query sequence clustering packages. ICAass (one of the latest ICAtools - http://www.ebi.ac.uk/~jparsons) uses a FASTA like search but uses an asymmetric scoring scheme designed to measure redundancy, rather than directly discover overlaps. A variety of specialized cluster browsing tools assist extraction of non-redundant sequence sets, or allow accelerated database searches where an increased portion of the computation time is spent comparing the query with very similar sequences. All code is ANSI C, runs on many UNIX variants (and ported to MacOS), and is free to academics and industry. It has been used to cluster 180,000 ESTs on one shared 143 MHz UltraSPARC in 9 days. Memory usage scales linearly with database size, but computation time scales quadratically.

CLEANUP (Grillo et al, 1996) is similar to the ICAtools but performs a unique hashed-query pairwise sequence comparison which makes it slower than ICAass. Query sequences are encoded as hashed oligos along with one base mutated versions of the oligos to enhance query sensitivity. The largest published data set: 2400 Drosophila sequences, was self-compared in 160 seconds. All code is written in C, and publicly available.

Small-Scale Assembly Programs

Conventional shotgun sequence assembly relies on overlapping alignment of the called nucleotide bases to recreate a representation of the original DNA progenitor sequence. All the shotgun assemblers referenced except ACEmbly use a hashed query sequence algorithm for finding matches. Some of the more sophisticated programs are able to make use of ancillary information such as basecall probability estimates, clone orientation and length estimates to improve the quality of the contigs produced, especially in the presence of repetitive DNA. Relevant programs include Gap4, (Bonfield et al, 1995), CAP3 (http://genome.cs.mtu.edu/cap/cap3.html), Phrap and Gene Meyer's FAK (Meyers, 1995). The conventional shotgun assemblers can be used to assemble typically-sized clusters of ESTs.

Gap4 ( http://www.mrc-lmb.cam.ac.uk/pubseq/overview.html ) implements a simple algorithm (relatively slow) where new shotgun reads are compared against contig consensus sequences looking for hash hits that extend the contigs in a Greedy fashion. The program is memory efficient and scales to megabase-sized YAC assembly projects with more than 10,000 reads. Gap4 can read and extend contigs produced by other programs e.g.. Phrap, and CAP2.

Phrap (unpublished, http://bozeman.genome.washington.edu/index.html) has incorporated error probability information deeper into assembly and editing than any other assembly package. Alignments are stored in detail, as a directed graph in RAM, leading to large memory requirements (1 GB for a YAC assembly). A consensus sequence is extracted by using a mosaic approach where the best read in any multiple alignment is directly copied to the result. Phrap has no specific features to handle EST idiosyncrasies like alternative splicing etc. Assemblies are fast (6 hours for 49,000 sequences on a single 195 MHz R10000) however a single sequence addition requires all previous calculations to be repeated. An optimized and parallelised version of Phrap is commercially available (http://www.spsoft.com/).

CAP3 (unpublished, http://genome.cs.mtu.edu/cap/cap3.html) is an enhancement of CAP2 (Huang, 1996) - both are relatively slow. CAP3 can utilize ancillary information to guide sequence assembly: mate-pairs, clone size, orientation/location, and basecall error estimates. CAP3 is used at MIPS to assemble UniGene clusters (http://www.mips.biochem.mpg.de/proj/human/)

TIGR Assembler, (Sutton et al, 1995) is a relatively fast sequence assembly program that has been proved assembling a diverse collection of repetitive microbial genomes. Memory usage for the query hashing stage is 32 * 4ⁿ bytes where n is the word size (oligo size) e.g. 512 MB for 12-mer oligos. Four criteria control the match decision: minimum length of overlap, maximum number of local errors, percentage of best possible score, and maximum length of overhang. TIGR use a combination of BLASTN and TIGR Assembler to produce their own consensus sequences. This combination is relatively slow: multiple-months to produce THC (Tentative Human Consensus) sequences for all the human ESTs. The THC database is biased towards splitting EST clusters due to this stringent assembly stage (Burke et al.,1998). TIGR Assembler on its own has been used to assemble 186,000 ESTs on a single SPARC processor, with 512 MB RAM, in seven days.

ACEmbly (http://alpha.crbm.cnrs-mop.fr/acembly/) is a fast hashed-database program where oligos are sampled from all sequences and used to prime alignments and then assemblies. ACEmbly fragment ordering is initially based on a distance metric calculated from the number of oligos two sequences have in common divided by their total number of sampled oligos. This "zipping" is fast (< 10 seconds for 600 reads) but prone to errors with repeated sequences. JESAM, the subject of this paper is another hashed-database sequence comparison program.

Word-Based Clustering

Word-based clustering is inherently fast because it omits all alignment steps before clustering. Eliminating initial alignments raises the potential for poor quality clustering either through poor sensitivity, or poor selectivity. Word-based clustering uses less information and so has greater error potential, especially when compared to sequence alignments that can use position-dependent sequence quality measures. Good specificity depends on refinements such as statistical weighting schemes where common oligos are given reduced significance. The sensitivity of word based alignments has not been discussed to any depth in the literature.

The published work on D-Squared Cluster (Hide et al, 1994, http://www.bchs.uh.edu/stemple/D2/d2.html) was preliminary and experimental with only limited exploration of the kinds of optimizations available that might make improvements to the programs specificity. D-Squared cluster uses windowed (i.e. local) n-mer oligo frequency comparisons between a query sequence and a compressed database. Using a Cray Y/MP-48 supercomputer, 884 bases could be compared against 19 million bases in less than one minute.

RAPID (Rapid analysis of Pre-Indexed Data structures) Miller et al., 1999 and http://www.bioinf.man.ac.uk/rapid/) uses oligo frequencies to alter the significance of oligo matches before scoring particular pairwise matches. No clustering is offered but the program has been tested on the analogous problem of removing redundancy and contamination from DNA sequence databases. RAPID uses Receiver Operator Characteristic (ROC) curves to tune parameters such as oligo size to get database search results with maximal specificity to the query. Rapid is memory intensive but relatively fast.

Systems and Methods

The JESAM sequence alignment calculation and storage code is written in ANSI C with a C++ wrapper for the alignment server. Sequence alignments were stored in Berkeley DB libraries supplied by Sleepycat: http://www.sleepycat.com/db/. All the client software is written in JAVA and compiled using Sun Microsystems/Javasoft's Javac JAVA compilers (http://www.javasoft.com/) and runs with the Java 1.1 class libraries and Java Virtual Machine JVM including the implementation in Netscape Communicator 4.5. The IDL interface specifications were compiled by Object Oriented Concepts' (URL http://www.ooc.com/) ORBacus IDL to Java and IDL to C++ compilers which have free run-time ORB licenses. Many ORBs and IDL compilers are available free (URL: http://industry.ebi.ac.uk/~corba/) as are Sun's Java and IDL compilers. Documentation is distributed throughout the Java source code in Javadoc comments.

Algorithm

Sequence overlap detection

A multi-stage pipeline was developed to discover all the arrangements of all pairs of sequences where the alignment could be consistent with the sequences being cognate and contiguous. The algorithm's first stages reduce the total number of pairwise sequence alignments whilst aiming to maintain overlap sensitivity and alignment accuracy.

To ensure that the published alignments were both biologically useful and mathematically optimal, it was thought necessary to use a dynamic programming algorithm with a sophisticated gap penalty scheme (Altschul and Erickson, 1986). However, computation time makes it impractical to compare all sequences against each other for large EST datasets with workstation implementations of derivatives of the Smith-Waterman algorithm. Specialized hardware was an unacceptable solution due to perceived problems of cost, availability, portability, and ease of algorithm development. The JESAM alignment algorithm therefore uses dynamic programming only for the final alignments relying on the gross overall overlap being easy to find because the goal was only to discover potentially overlapping subsequences, not distant homologues mutated apart though millennia

Alignment Algorithm Stages

Load all non-overlapping 12-mers from all sequences into a hash table
Delete all low-complexity and over-abundant hash table entries
For each sequence
        Using a 12-mer window sliding in single base steps down query
                Find the positions of all oligo matches retrieved using 12-mer key
        Sort all the oligo matches by relative start positions (phase) to remove redundancy
        Use the representative oligo matches to seed a windowed optimal alignment
Store all alignments consistent with contiguity

For all the alignments described in this paper, an affine gap scoring scheme (+5 match, -15 mismatch, -15 gap-start, -15 per-gap-base) was used with end gaps only penalized if not-consistent with contiguity and an overall score of 200 needed for an alignment to be stored. To ensure sensitive alignments are calculated and stored, no predefined repeats are masked. Furthermore, repeat discovery is easier to do accurately after all alignments are calculated.

Supercluster creation

A simple clustering algorithm was chosen to summarize groups of overlapping sequences both as an extension of the alignment database and to prove the ease of writing client CORBA components. The algorithm aims to create a series of independent clusters through recursive calls that follow all alignments for every sequence. Each cluster ultimately contains all the sequences that can be linked by any series of stored overlaps to every cluster member. The clustering is performed without regard for any inconsistent alignment combinations making this algorithm much simpler and less selective than those used to produce highly optimized clustering databases like UniGene. These simple superclusters have a bias towards sensitivity and overclumping that aids further processing (e.g.. reduced likelihood of data loss for cluster assembly) and directly offer a useful and easily understood benchmark to contrast against other clustering schemes.

Implementation

JESAM takes novel approaches to both sequence comparison and clustering: alignments are calculated using a PVM process farm, and published via CORBA. A Java program acts as a CORBA client to request alignments, and make sequence clusters and then turns itself into a CORBA server for the cluster data.

Alignment calculation and storage.

All the sequences can be loaded from flat files or from a CORBA sequence server. The latter would be more convenient but the former offers higher performance with current sequence server implementations. All sequences are used to extract non-overlapping 12-mers which are recorded in a hash-table of linked lists. Abnormally common oligos and low-complexity oligos may have all their linked lists freed but no predefined repeats are masked. New sequences are retrieved in batches and overlapping 12-mers are used to look up oligo hits in the hash table. Oligo hit results are partitioned between the slaves which do a banded Smith-Waterman-based sequence comparison looking for matches consistent with sequence overlap. The parallel computer architecture and all affine gap penalties can be independently adjusted on the command-line. The slaves receive computational tasks via Parallel Virtual Machine (PVM) message transfers which are asynchronous and fully buffered. Buffering is controlled by requiring feedback to the oligo scanner and limiting the number of outstanding comparisons. To spread the network load, slaves ask for batches of sequences from a sequence server only when needed and then cache their set locally. The slaves may need a large amount of virtual memory, (to cache sequences: one byte per base) but little RAM (typically 10 to 20 MB) for the actual comparisons because backtracking memory usage grows linearly with sequence length. The largest memory requirements are for the oligo hash-table and the Berkeley DB libraries which both have a random data access pattern: see tables 2 and 3 for actual figures. Cache sizing is configured on the command line. The minimum cache size is impossible to estimate without experimentation. A larger cache is needed for alignments of very repetitive/redundant sequence sets (more matches), slower disk access and for larger PVM configurations.

The distributed architecture of JESAM provides a flexible framework to allow extension to more computationally intensive sequence comparisons (for example base call error probabilities) and could also support other costly calculations like assembly contig building via directed graph explorations. The PVM slaves running sequence comparisons are typically CPU-limited so the performance of the PVM implementation on networks of cheap uniprocessor PCs is equivalent to that of more expensive multi-processor servers.

Updates do not require recomputation of known alignments because all sequence overlaps are automatically stored by the Berkeley DB routines. If a new alignment crosses a user-determined threshold score, then it is saved complete with the position of all indels, starts and ends. The alignments are offered by a CORBA interface provided by a C++ wrapper around the same ANSI C code used to create the alignments.

A Java client recursively steps through the matches to make superclusters which are stored in a serialized Java Hashtable object which loads and saves quickly. The Supercluster server took only one afternoon to write thanks to the alignment data CORBA interface, simple algorithm, and the innate productivity of Java. Both the alignment and supercluster data can be viewed interactively using web-accessible Java applets or larger datasets can be saved to client file systems for use in other file based scripts etc. The JESAM clusters have been integrated with other databases at the European Bioinformatics Institute, see URL http://corba.ebi.ac.uk/EST for browser demonstrations.

Figure 1

Diagram showing how PVM links are used to distribute data and calculations across a number of processes which may be running on different processors in different computers.
PVM comparison process farm

Figure 2

Diagrammatic representation of how an IDL file (centre) is used to create both the server skeleton code (C++), and the client stub code which in this case is Java. Other vendor's CORBA products could replace the examples shown.
CORBA alignment view

Results and Discussion

JESAM Performance

Performing EST comparisons using JESAM can be fast and inexpensive due to the underlying design: storing results reduces computational needs, and PVM can spread the incremental calculations across a network of cheap personal computers, or idle workstations. A total of greater than 120,000 ESTs were aligned and clustered in five hours using a total of 11 processors (UltraSPARC and Compaq Alpha at EBI, all UltraSPARC at Cereon). Scaling efficiency has been tested in detail (Tables 1,2 and 3) with up to 40 slaves on a small dataset (120,000 ESTs) but larger datasets allow greater scaling efficiency (PVM daemon, or network communication restrictions will appear with less than 100 slaves with many PVM configurations). The European Bioinformatics Institute can maintain fully updated alignment and cluster databases (with their Internet accessible interfaces) for four model organisms (mouse, rat, Aspergillus nidulans, and Zebrafish) using only six distributed processors (2 in server, 4 in separate slaves) for one evening every two weeks. These on-line databases are accessible via multiple applets to allow browsing of clustering statistics, pairwise alignments, cluster details, and downloads of FASTA format sequences for individual clusters.

Table 1

Number of Slaves	10	20	40
Wall Time (minutes)	328	171	107

Total time taken to align all 123459 Rat ESTs ( in EMBL and EMBLnew on 15th September 1999, totaling 48461025 bases) against each other and then store the alignments when using different numbers of slaves. The master processes were run on a two processor (400 MHz) UltraSPARC server and the slaves were run on separate (248 MHz) UltraSPARC servers all running Solaris 2.5 or 2.6.

Table 2

	Number of Slaves
Process	10	20	40
Master pvmd3	8360	29560	267280
jesam	26904	26784	25016
jesamscanner	191672	192656	193288
jesamdistributer	292912	309656	455432
Slave pvmd3	18224	23504	61144
jesamslave	20248	17328	21896

Peak Core Memory Usage (KBytes) for all processes creating the alignments for Table 1. All processes were sampled at 10-minute intervals. Figures shown are for the largest memory usage by any instance of each process at any sample time so actual memory usage is less than shown.

Table 3

	Number of Slaves
Process	10	20	40
Master pvmd3	9000	30200	267960
jesam	27304	27184	27232
jesamscanner	192048	193032	193744
jesamdistributer	565128	588208	757264
Slave pvmd3	18560	24000	61640
jesamslave	20712	17576	22120

Peak Virtual Memory Usage (KBytes) calculated as per Table 2

Clustering Comparisons

A full comparison of all the mentioned EST alignment and clustering programs on multiple representative datasets is a formidable task and certainly beyond the scope of this paper. Instead, samples of UniGene Mouse and Rat databases were studied by comparison to superclusters and alignments produced by JESAM. The UniGene databases are frequently referenced and important resources that have guided many large-scale experiments so an investigation into their contents has potentially significant implications. An initial investigation suggested it was impossible to use identical datasets for this comparison because the UniGene databases include sequences which have been intentionally deleted from the EMBL and GenBank nucleotide databases and were therefore not publicly retrievable. (A separate ftp site for this information (ftp.ncbi.nlm.nih.gov/repository/UniGene/ was, however, identified in review) Unfortunately, the non-identical input sequences led to frequent differences in the cluster sets and this complicated and limited the scale of investigations into the basis of the overall cluster discrepancies. It is worth noting that the more intensive EST/genomic comparisons performed by Wolfsberg and Landsman were limited (by the inherent complexity of the underlying data) to analyzing just 15 genes and their study concluded that this small sample gave indications of uncharacterized alleles, partial splicing, undocumented splicing, uncoupled clones, and possibly novel biological phenomena !

The UniGene investigation detailed in Fig. 3 and Fig. 4 was small but it sampled examples of many of the known or expected properties of the UniGene databases, and some of the advantages and disadvantages of JESAM's simplistic superclustering client. UniGene Rat excluded >20,000 sequences included in the corresponding JESAM alignments and clusters, and UniGene mouse similarly excluded >150,000 sequences. The two rat databases were generally more similar than the two mouse databases (data not shown). Perhaps most significantly, JESAM both clustered more sequences and created more clusters for both data sets despite the use of genomic clones in UniGene (excluded from JESAM) that helped clustering in two of the ten mouse clusters examined (but no rat cluster). However, one rat UniGene cluster (Rn. 34368) did probably correctly combine two large divergent cDNAs which may encode alternative transcripts that JESAM rejected as being clearly non-overlapping sequences. The worst feature of the current published JESAM superclusters was demonstrated with the 12 sequences in UniGene Mm. 2477 which were combined into one supercluster of size 116977 with many other sequences presumably joined on repetitive DNA. This supercluster would not form with a more selective scoring scheme but sensitivity would then be lost. Biasing towards sensitivity minimizes information loss yet still significantly reduces the problem scale for later cluster partitioning software components. With all the alignments available on-line, it should be relatively easy to develop and test some quite sophisticated partitioning or assembly algorithms so these superclusters should not be seen as final answers in contrast to the other EST cluster databases which cannot be programmatically interrogated.

Figure 3

Comparison between Rat UniGene and EBI Superclusters. First 10 "interesting" (defined by UniGene) clusters taken from largest (best sampled) Rat cDNA library: Lib.53, UI-R-C0 containing 7024 sequences. UniGene build 42, uploaded on Wed Feb 17 1999 (80755 sequences, 23003 clusters including 9520 orphans). EBI superclusters were made on 19th February 1999 (101383 sequences, 31863 clusters, including 17765 orphans). Comparison done on 22nd February 1999. NR = Not Retrievable from EMBL, GenBank, or dbEST (i.e. withdrawn entries, or not available at the time of the study - Nicole Redaschi, personal communication)

UniGene ClusterID	UniGene Cluster Size	Super Cluster Size	Missed ESTs	Best Missed	Explanations
Rn.12107	15	17	AI228202, AI137581, AA997212, -AA963421	AI228202 against AI231460, 124 base overlap with 3 mismatches	AI228202 was prepended, for unknown reasons, with 5 extra bases "actaa" on 9th January 1999. These 5 bases reduce the match quality. AA963421 was NR
Rn.32936	10	11	AA998090	AA998090 against X53725, 138 base overlap with two mismatches	AA998090 contains a low complexity region that may have been masked in UniGene
Rn.12821	7	4	-AI029388, -AI045806, -AI029890,	None	AI029890 was NR so other two form a separate cluster
Rn.11720	8	13	AA899962, AI236745, AI234665, AI170199, AA945230, AA943491, -AI029893	AI236745 against AA963979, 75 base overlap with one indel, and one mismatch	All found in separate cluster Rn.7578 with extra sequence AI030576 which was NR AI029893 was NR
Rn.13083	6	7	AB006882	AB006882 against AB020504, 510 base overlap with 4 mismatches	AB020504 has replaced AB006882
Rn.12740	10	12	AI407336, AI407876	AI407336 against AI138051, 384 base overlap with one mismatch	Both entered GenBank on 9th February, and EMBL on 11th February 1999 but not in dbEST version used (on 13 February according to release notes) for UniGene Rat.
Rn.11948	6	9	AI406632, AI408422, AI409932	AI406632 against AA997064, 454 base overlap with one indel and 5 mismatches	Too recent to appear in this UniGene release
Rn.9416	8	33	Rn.2375	AI029424 against U35775, 137 base overlap with one indel and one mismatch	AI029424 appears to join two alternative 3' ends ?
Rn.8514	6	7	AI407620	AI407620 against AI408791, 392 base overlap with one indel and 15 mismatches	Too recent to appear in this UniGene release
Rn.34368	4	2	-AI043650, -AA964642	AA998702 matches AI02983 correctly	Sequences share a pattern of local similarities inconsistent with contiguity. UniGene cluster may represent alternative splicing variants.

Figure 4

Comparison between Mouse UniGene and EBI Superclusters. First 10 "interesting" (defined by UniGene) clusters taken from largest (best sampled) Mouse cDNA library: Lib.30, Soares Mouse Embryo NbME13.5 containing 39996 sequences. UniGene build 46, uploaded on Mon 15th February 1999 (268247 sequences, 15275 clusters including 2761 orphans). EBI superclusters were made on 23rd February 1999 (424080 sequences, 65588 clusters including 42418 orphans). Comparison done on 24th February 1999. NR = Not Retrievable from EMBL, GenBank, or dbEST at the time of the study.

UniGene ClusterID	UniGene Cluster Size	EBI Cluster Size	Missed ESTs	Best Missed	Explanation
Mm.2496	25	30	AA048065, AA198448, AA230545, AA967785, W58964, W88178, W98102 -L27220, -AI425958	AA967785 against AA218283, 432 base overlap with 2 indels and 29 mismatches	All 7 missing ESTs were derived from poor quality traces and therefore excluded from UniGene. L27220 is genomic DNA. AI425958 was NR
Mm.11285	14	12	AA052635, -AI427383, -AI430589, -AI431070	AA052635 against AA058152, 140 base overlap with 1 indel and 5 mismatches.	AA052635 is only 171 bases long and contains a low complexity region. AI427383, AI430589, and AI431070 were NR
Mm.3501	11	11	AF013117, AA693148, -AI430391, -AI430109	AF013117 against D49544, 465 base overlap with 5 mismatches	AF013117 has been excluded for unknown reasons. AA693148 is annotated as low quality. AI430391 and AI430109 were NR
Mm.3277	12	6	-AA030514, -AA171281, -AI323427, -AI427499,	None	Z22923 is genomic DNA so AA030514, AA171281, AI323427, AI427499 were split off into separate clusters.
Mm.30219	9	12	AI317572, W78547, W66677, AA237557, AA510343, AI181413, -AI430530, -AA110526, -AI429692	AI317572 against AA049045, 314 base overlap with perfect match	AI317572, W78547, W66677, AA237557, AA510343, AI181413, were excluded for unknown reasons. AI430530 andAI429692 were NR. AA110526 is only 84 bases long
Mm.20964	11	13	AA450931, W64306, W64209, AA268149, AA170119, -M36778, -W43961	W64306 against W64299, 238 base overlap with 2 indels.	AA450931, W64306, W64209, AA268149 were excluded for unknown reasons. M36777 and M36778 are two alternatively spliced mRNAs that differ in over 1KB and are non overlapping.
Mm.3127	11	16	AA271310, AA028577, AA028535, AA062181, AI414081	AA271310 against W83829, 263 base overlap with 2 indels, and 8 mismatches	AA271310, AA028577, AA028535, and AA062181 were excluded for unknown reasons. AI414081 was entered on 11th February 1999 .
Mm.2477	12	116977	Not relevant	Not relevant	Joined on repetitive DNA
Mm.4758	9	59	-X86368, -AI430408, -AI429593	AA048965 against L38607, 272 base overlap with 20 mismatches	Possibly matching on low-copy repetitive DNA. X86368 is genomic DNA, AI430408 and AI429593 were NR.
Mm.30224	7	5	-AI430492, -AI430463,	None	AI430492 and AI430463 were NR

Figure 5 optional suggestions:

Chooser applet window

Rat Cluster Size Histogram

Extensions

Memory usage could be drastically reduced by switching to a partitioned sequence database: this would both reduce the number of sequences and oligos in memory at any one time and also reduce the subset of sorted alignment results that would need to be cached for random update access. With this enhancement, JESAM could scale easily to any N² sequence comparison but also shrink to fit any parallel architecture. If access to the oligo hash table became limiting with large numbers of processors, the single oligo hash table lookup function could be rewritten to use multi-threading but it would then have to run on a multi-processor computer. If greater speed were needed on very limited hardware, the slave sequence comparisons could be significantly optimized relative to the current generic implementation, and multiple oligo hits per pair of sequences (currently just a single 12-mer) could be required before launching a slave comparison (the existing code currently used to prevent multiple searches on the same diagonal could be easily extended for this purpose).

The mouse EST and mRNA clusters (downloaded directly from the CORBA servers) have been successfully assembled using CAP3 by Juha Muilu (personal communication) as a useful extension of the JESAM superclusters. CAP3 however ignores the pairwise alignment data so an opportunity still exists for a more integrated, higher performance assembly tool tailored to EST data. A suitable post-assembly goal would be something like the Pangea CRAW tools (Burke et al., 1998 and Chou and Burke, 1999) which demonstrate how complicated patterns of expression variations can be summarized and presented in a Java applet alongside supporting analyses.

Linking the EST clusters with other databases would make them more valuable. The EMBL outstation European Bioinformatics Institute (EBI) amongst others (including the OMG's Life Science Research (LSR) Domain Task Force (DTF) (URL: http://lsr.ebi.ac.uk/) are working towards standards for object interfaces and components for many kinds of biological data. Some of the EBI's effort's include physical and genetic maps (Jungfer and Rodriguez-Tomé, 1998), annotated DNA and protein sequences, and sequence traces (Parsons et al., 1999).

JAVA RMI could have offered a CORBA alternative but was not investigated due to the lack of relevant biological standards efforts, frameworks, language independence, services and local support. AcePerl (Stein and Mieg, 1998), is another potential alternative.

Conclusion

As explained in the introduction, no EST clustering database can be expected to provide a perfect view of an organism's genes but the imperfect result is still valuable. The human UniGene database has been a notably effective resource for many large and varied scientific studies however this small study shows the mouse and rat UniGene clusters have questionable features: the databases sometimes contain deleted sequences, and conversely may exclude apparently unique and high-quality sequences, and many with perfect matches to those included sequences. The simple JESAM superclusters have problems of their own and consequently complement rather than supersede the UniGene databases but this is a significant role. Furthermore, the public CORBA interface to model organism alignment data offers great potential for future enhancements through new public software components interacting over the Internet, or (since all the code is available and portable) combining public and private data behind corporate firewalls.

The use of CORBA interfaces in the JESAM package requires more software libraries (and greater expertise to install and maintain them) than would a normal file-based Input/Output system but they also offer greater flexibility. Furthermore, by maintaining all information accessible online, it is possible to give users an evidence trail that they can follow all the way from cluster, to alignment, to sequence and even back to the original sequence trace.

There are currently a few problems deploying CORBA based applications over the Internet. These problems include: firewalls blocking the IIOP protocol, the need to download ORB classes to clients due to the lack of a guaranteed local ORB, and a lack of support for multiple applet signing which would allow applets to follow object references to objects on computers other than the original applet's host. Some of the Java and CORBA problems should decrease once JAVA 2 from Javasoft (URL:http://www.javasoft.com/) becomes a standard component of every client workstation. Amongst the many improvements over Java 1.1 is a built-in CORBA ORB allowing distributed computing to extend out of the Web browser and directly into a user's general molecular biology application set.

References

Aaronson,J.S., Eckman,B, Blevins,R.A., Borkowski,J.A.,Myerson,J.,Imran,S. and Elliston,K.O. (1996) Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. Genome Res. 6, 829-845.
Adams et al., (1995), Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence, Nature, 377: Supplement 3-174
Altschul, S.F., and Erickson, B.W., 1986, Optimal Sequence Alignments using Affine Gap Costs, Bulletin of Mathematical Biology, 48, 603-616.
Altschul S.F., Madden T.L. Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J., 1997, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 25, 3389-3402.
Bailey, L.C, Searls, D.B, and Overton, G.C., 1998, Analysis of EST-Driven Gene Annotation in Human Genomic Sequence, Genome Res., 8, 362-376.
Battey, J., Jordan, E., Cox, D., and Dove, W., 1999, An action plan for mouse genomics, Nature Genetics, 21, 73-75
Banfi, S., Borsani, G., Rossi, E., Bernard, L., Guffanti, A., Rubboli, F., Marchitiello, A., Giglio, S., Coluccia, E., Zollo, M., Zuffardi, O., and Ballabio, A., 1996, Identification and mapping of human cDNAs homologous to Drosophila mutant genes through EST database searching. Nature Genetics, 13, 167-174
Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., and Ouellette, B.F.F., 1998, GenBank, Nucleic Acids Res., 26: 1-7
Bishop, M., and Thompson, E., 1984, Fast Computer Search for Similar DNA Sequences, Nucleic Acids Res., 12, 5471-5474
Blake, J.A., Eppig, J.T., Richardson, J.E., Davisson, M.T. and the Mouse Genome Informatics Group, 1998, The Mouse Genome Database (MGD): a community resource, Status and enhancements., Nucleic Acids Res., 26, 130-137
Boguski, M., Lowe, T.M.J, and Tolstoshev, C.M., 1993, dbEST - database for "Expressed Sequence Tags", Nature Genetics, 4, 332
Boguski, M. and Schuler G., (1995). ESTablishing a human transcript map. Nature Genetics, 10, 369-371
Bonaldo, M.F, Lennon G., and Soares, M.B., (1996), Normalization and Subtraction: Two Approaches to facilitate Gene Discovery, Genome Res., 8, 791-806
Bonfield, J.K., Smith, K.F., and Staden R., 1995, A new DNA sequence assembly program. Nucleic Acids Res., 24: 4992-4999
Burke, J., Wang, H., Hide, W., and. Davison, D.B., (1998), Alternative Gene Form Discovery and Candidate Gene Selection from Gene Indexing Projects, Genome Res. 8, 276-290.
Davidson, D., Bard, J., Brune, R., Burger, A., Dubreuil, C., Hill, W., Kaufman, M., Quinn, J., and Baldock, R., 1997, The mouse atlas and graphical gene-expression database, Cell and Developmental Biology, 8, 509-517.
Chou, A. and Burke J., 1999, CRAWview: for viewing splicing variation, gene families, and
polymorphism in clusters of ESTs and full-length sequences. Bioinformatics, 15, 376-381
Eckman, BA, Aaronson, JS, Borkowski, JA, Bailey, WJ, Elliston, KO, Williamson, AR, Blevins,
RA, (1998), The Merck Gene Index browser: an extensible data integration system for gene finding,
gene characterization and EST data mining, Bioinformatics, 14: 2-13
Ewing, B. and Green, P., 1998, Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Res., 8: 186-194
Gautheret, D., Poirot, O., Lopez, F., Audic, S., and Claverie, J., (1998), Alternate Polyadenylation in Human mRNAs: A Large-Scale Analysis by EST Clustering, Genome Res., 8: 524-530.
Gerhold, D. and Caskey, T., 1996, It's the genes! EST access to human genome content, BioEssays, 18, 973-981
Gill, RW, Hodgman, TC, Littler, CB, Oxer, MD, Montgomery, DS, Taylor, S, Sanseau, P (1997), A new dynamic tool to perform assembly of expressed sequence tags (ESTs) Comput. Applic. Biosciences 13: 453-457
Graul RC, Sadée W., (1997), Evolutionary relationships among proteins probed by an iterative neighborhood cluster analysis (INCA). Alignment of bacteriorhodopsins with the yeast sequence YRO2. Pharmaceutical Research, 14: 1533-1541
G. Grillo, M. Attimonelli, S. Liuni and G. Pesole, (1996), Cleanup: a fast computer program for removing redundancies from nucleotide sequence datatbases, Comput. Applic. Biosci., 12, 1-8
Hardison, R.C., Oeltjen, J., and Miller W., 1997, Long Human-Mouse Sequence Alignments Reveal Novel Regulatory Elements: A Reason to Sequence the Mouse Genome, Genome Res., 7, 959-966
Hillier L., Lennon G. Becker, M., Bonaldo, M.F., Chiapelli, B., Chissoe, S., Dietrich, N., DuBuque, T., Favello, A., Gish, W., Hawkins, M., Hultman, M., Kucaba, T., Lacy,M., Le, M., Le, N., Mardis, E., Moore, B., Morris, M., Parsons, J., Prange, C., Rifkin, L., Rohlfing, T., Schellenberg, K., Bento Soares, B., Tan, F., Thierry-Mieg, J., Trevaskis, E., Underwood, K., Wohldman, P., Waterston, R., Wilson, R., Marra, M. (1996), Generation and Analysis of 280,000 Human Expressed Sequence Tags. Genome Res., 6, 807-828
Hide W, Burke J, Davison DB, (1994), Biological evaluation of d2, an algorithm for high-performance sequence comparison, J. Comput Biol 1: 199-215
Houlgatte, R., Mariage-Samson, R., Duprat, S., Tessier, A., Bentolila, S., Lamy, B., and Auffray, C., (1995), The Genexpress Index: A Resource for Gene Discovery and the Genic Map of the Human Genome, Genome Res., 5, 272-304
Hunkapiller, T., Kaiser, R.J., Koop, B.F., and Hood, L., 1991, Large-Scale and Automated DNA Sequence Determination, Science, 254, 59-67
Lijnzaad, P., Helgesen, C., and Rodriguez-Tomé, P., 1998, The Radiation Hybrid Database, Nucleic Acids Res., 26, 102-105
Hu,J., Mungall, C., Nicholson,D., and Archibald,A.L., 1998, Design and implementation of a CORBA-based genome mapping system prototype, Bioinformatics, 14, 112-120.
Huang, X., 1996, An Improved Sequence Assembly Program, Genomics, 33, 21-31.
Jungfer, K., and Rodriguez-Tomé, P., 1998, Mapplet: a CORBA-based genome map viewer, Bioinformatics, 14, 734-738
McCarthy, L.C., Terret, J., Davis, M.E., Knights, C.J., Smith, A.L., Critcher, R., Schmitt, K., Hudson, J., Spurr, N.K., and Goodfellow P.N., 1997, A first-Generation Whole Genome-Radiation Hybrid Map Spanning the Mouse Genome, Genome Res., 7, 1153-1161
Makalowski, W., Zhang, J., and Boguski, M.S., 1996, Comparative Analysis of 1196 Orthologous Mouse and Human Full-length mRNA and Protein Sequences, Genome Res., 6, 846-857
Jiang, J, and Jacob, H.J., 1998, EbEST: An Automated Tool Using Expressed Sequence Tags to Delineate Gene Structure, Genome Res., 8, 268-275
Küster, B. and Mann, M., 1998, Identifying proteins and post-translational modifications by mass sprectrometry, Current Opinion in Structural Biol., 8, 393-400
Lopez, A.J., 1998, Alternative Splicing of pre-mRNA: Developmental Consequences and Mechanisms of Regulation, Annu. Rev. Genetic., 32, 279-305
Marra, M., Hillier, L., Kucaba, T., Allen,M., Barstead, R., Beck, C., Blistain, A., Bonaldo, M., Bowers, Y., Bowles, L., Cardenas, M., Chamberlain, A., Chappell, J., Clifton, S., Favello, A., Geisel, S., Gibbons, M., Harvey, N., Hill, F., Jackson, Y., Kohn, S., Lennon, G., Mardis, E., Martin, J., Mila, L., McCann, R., Morales, R., Pape, D., Person, B., Prange C., Ritter., E., Soares, M., Schurk, R., Shin, T., Steptoe, M., Swaller, T., Theising, B., Underwood, K., Wylie, T., Yount, T., Wilson, R., Waterston, R., 1999, An encyclopedia of mouse genes, Nature Genetics, 21, 191-199
Miller, C., Gurd, J and Brass, A, 1999, A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases, Bioinformatics, 15, 111-121
Miller, M.J., and Powell, J.I., (1994), A Quantitative Comparison of DNA Sequence Assembly Programs, J. Comput. Biol., 4, 257-269
Myers, Eugene W., (1995) Toward Simplifying and Accurately Formulating Fragment Assembly, Journal of Computational Biology 2: 275-290.
Orfali,R., and Harkey,D. (1997) Client/Server Programming with JAVA and CORBA. John Wiley & Sons, New York, N.Y. 10158-0012.
Oliver, S.G., 1996, From DNA sequence to biological function, Nature, 379, 597-600
Orfali,R., and Harkey,D. (1998) Client/Server Programming with JAVA and CORBA. 2nd Edition, John Wiley & Sons, New York, N.Y. 10158-0012.
Parsons, J.D., (1995) Improved Tools for DNA Comparison and Clustering, Comput. Applic. Biosci., 11, 603-613
Parsons, J.D., Buehler, E., and Hillier, L., (1999) DNA Sequence Chromatogram Browsing Using JAVA and CORBA., Genome Res., 9, 277-281.
Parsons, J.D., Brenner, S. and Bishop, M.J., (1992), Clustering cDNA Sequences, Comput. Applic. Biosci., 8, 461-466
Pearson, W. R., Wood, T., Zhang, Z., and Miller, W. (1997) Comparison of DNA Sequences with Protein Sequences, Genomics 46, 24-36
Rounsley_SD, Glodek_A, Sutton_G, Adams_MD, Somerville_CR, Venter_JC, Kerlavage_AR, (1996), The construction of Arabidopsis expressed sequence tag assemblies - A new resource to facilitate gene identification, Plant Pysiology, 112, 1177-1183
Schuler et al. 1996, A Gene Map of the Human Genome, Science, 274, 540-546
Stein L.D. and Thierry-Mieg, 1998, Scriptable Access to the Caenorhabditis elegans Genome Sequence and Other ACEDB Databases, Genome Res., 8:1308-1315
Stoesser, G., Moseley, M.A., Sleep, J., McGowran, M., Gracia-Pastor, M., and Sterk, P. (1998) The EMBL Nucleotide Sequence Database, Nucleic Acids Res., 26, 8-15
Sutton G., White O., Adams M. and Kerlavage A. (1995). TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science & Technology, 1, 9-19.
Wolfsberg, T.G. and Landsman, D., 1997, A comparison of expressed sequence tags (ESTs) to human genomic sequences, Nucleic Acids Res., 25, 1626-1632
Yee, D.P., and Conklin, D., (1998), Automated clustering and assembly of large EST collections, proceedings ISMB-98

Acknowledgements

The authors are grateful to Tom Flores for helping to start the CORBA element of this project. The authors would also like to thank David Starks-Browning at the EBI, and the Systems staff at Cereon for their support. This work was funded by EU grant BIO 4 CT 960346