EST Exploitation

Jeremy D. Parsons (1997)

Introduction

Expressed Sequence Tags (ESTs) have grown in utility and importance in recent years as they have become more numerous, and better connected to other biological data. ESTs have been mapped onto portions of the human and other genomes leading to transcript maps (and on to gene maps) or just assembled together to form sequence contigs with similarity to better understood biological sequences. There are many groups doing the sequencing work and others add value as the information flows into the public databases. Whilst the exploitation of ESTs has been improving, there are still many opportunities to push the public data even further by increasing its intrinsic quality (e.g. via assembly), its connectedness (to other databases), and its presentation.

Planned Work

Quality

To improve the intrinsic quality of EST sequence information, I propose to use a new base-calling program: Phred, on all the publicly available EST trace data . Phred was written by Phil Green at Washington University and improved over the last four years to the point where it has now been shown to produce fewer errors than the original ABI basecalls which until recently have been almost universally accepted. Of greater significance, Phred can produce a base quality file that defines which basecalls can be trusted and which cannot. Where the original traces are not available, it should be possible to estimate basecall quality by counting 'N' calls and by comparing sample sequences to the EMBL mRNA databases to measure typical error rates and distributions.

EST Overlap-Detection

ESTs, as samples from a larger cDNA, can come in many positions and either 5', or 3' orientations. In addition, each gene may have many alternative transcripts so it is not a trivial task to assign each EST to its progenitor gene even with perfect EST sequencing. As this is a difficult problem, it makes sense to use as much information as possible when trying to determine the relationship of EST to gene. The main sources of ancillary information are basecall quality measures or estimates, forward and reverse read linkage, clone sizing experiments, and library source data to help distinguish polymorphisms and sequencing errors from gene homologues.

I propose to use a new dynamic programming-based sequence alignment algorithm that can use much of the ancillary information directly in its calculations. Based on a no-end-gap global similarity comparison with penalties moderated by quality etc. it should be possible to obtain sequence alignments and scores that reflect the true likelihood of any two ESTs being derived from the same original transcript. This should be better than the local similarity sequence alignment used in Phil Green's Phrap assembly program which is seen as the current benchmark. The overlap detection process should not be confused with normal database similarity searching where the focus is on sensitivity because selectivity is more important when identifying and screening putative overlaps.

Dynamic programming is slow and the potential number of sequence overlaps huge (Order N*N) so the comparisons will be pre-filtered by a partial indexing scheme similar to that used in Acembly (and in the creation of UniGene) and the detailed alignments will be distributed using PVM so CPU power should not be limiting.

To test the sequence alignment process, the initial focus is just the detection of EST to EST overlaps but if the tests show that this is producing high-quality alignments (relative to EMBL mRNA sequences) the overlaps idea could be extended to other EMBL DNA sequences. Prescreening of ESTs similar to known mRNAs is more likely to be useful and a simpler means of data integration

Overlap Database

The ultimate goal of this work is to connect as many different biological data sources together to as large an extent as possible through ESTs from maps to functions and whole genome views or specialized views. At the moment however much of the biolological information and many of the database links are not present. One step above the raw, submitted data view is a database of sequence overlaps. If this database is up to date, accurate and easily accessible, it may provide a new base for contig assembly programs or any other application where sequence overlap, not homology, is of interest. One example of such a program is the Glaxo WWW interface to BLASTN to allow EST assembly in remote sites that have no database searching or assembly facilities of their own. Most of the problems with the Glaxo approach could be fixed by a pre-computed database of validated overlaps, some putative assemblies and some local client interactivity. To accommodate a variety of client interactions with the database, it should be built to allow CORBA access. Storing all the calculated overlaps allows the database to be kept up to date without repeated computation effort so as a design advantage, the database would exist in some form anyway, and any extra burdon for external access should be minimal.

Of direct interest to researchers may be a classification of ESTs to reveal those that have no known overlaps on either of their ends. These reads should directly lead on to novel sequence information if they are re-sequenced from the publicly available clones or via some form of PCR.

Contig and Consensus Databases

It can be relatively simple to determine that two DNA sequences share enough similarity to suggest that they could be derived from the same original mRNA but it is a much more difficult problem to determine whether they are genuinely cognate. To work towards that goal, various additional constraints can be applied to the overlap information to determine which of any possible EST assemblies is judged the most likely. Ultimately, biological experiments may be necesssary to confirm any contig constructions but suggesting best-guesses and ranges of options could still be extremely valuable. Any whole-database assembly can enforce the constraint that each EST may only appear in one contig. Adding restriction digest, and cDNA sizing information and protein coding region predictions/constraints can further guide the assembly towards the correct answer. These guiding data can be combined through the use of a directed graph and some kind of weighted simulated annealing algorithm.

Once a set of putative assemblies has been created, they can be given to the user community for annotation and improvement, or used directly to generate consensus sequences and then translated to generate an adjunct to the TREMBL database. This will require new applets or applications. Links to Washington University's trace server combined with the applets and the overlap database will provide biologists at remote sites with a functionality equivalent to having their own bioinformatics team on site. By allowing users the opportunity to review all the decisions made by the automatic assembly in a contig viewer, and pairwise alignment editor, their confidence can be increased or errors more quickly discovered and then corrected.

If an assembly joins both 5' and 3' reads of a long enough original clone, it should connect both 3' untranslated regions which are often used as the basis of STSs, and the upstream sequence which is more likely to be coding and therefore more likely to have been ascribed a putative function. Thus, the better the assemblies created, the better the linkage between functions, genes and map locations. For those contigs that are shorter than their putative cDNA, it may still be possible to use 3' clustering to confirm the relationship between the sequence fragments and create gapped assemblies where some of the target cDNA is missing.

Any assemblies that reveal alternative contig constructions or high quality sequence mismatches would be ideal starting points for researchers interested in human polymorphism detection.

EST Data Links

In 1995 Francis Collins predicted (in Nature) that the positional candidate approach would become the "predominant method for cloning human diseases". Since it is the desire to understand and ultimately cure human diseases that is driving most of the current wave of genome science, it would not be inappropriate to guide the views of the available human genetic information towards greater support of the Positional Candidate (PC) approach. This requires genes to be both mapped, to localize them approximately in the genome, and also sequenced, sufficient to suggest their cellular function or at least guide experiments towards the latter. It has been estimated (Schuler et al, Science, 1996) that the majority of the estimated 100,000 human genes have already been sampled and made available in the public EST databases. and over 16,000 EST gene based STS have been mapped already so ESTs currently provide the best link between gene function and location. This link can be improved by creating better EST assemblies that connect unique untranslated portions of cDNAs to their coding regions which are more likely to have similarity to genes of known function in the same or other organisms.

Physical maps can be prepared from end-sequencing, fingerprinting, cross hybridization, STS amplimer detection, and radiation hybrid mapping. An immediate goal is a gene map that integrates the sequence map and the physical maps into a single consistent view. Through the use of CORBA it should be possible to standardise access to all the component data sources and thereby ease the data integration task. Putative translations of EST assemblies can also be linked to the protein databases by amino acid sequence similarity.

EST Data Presentation

Using interactive Java clients to view complex biological data has become the norm with genome viewers for GDB, and microbial genomes (JMGD), physical maps (e.g. X), and transcript maps (e.g. human Chr.21). As many of these tools share similar functions, a BioWidgets consortium has been established to maximise the amount of code-reuse, quality and availability of the Java clients. I expect most of the required viewers will be written by others as Java Beans minimising the actual work needed.

The three basic tools needed are a contig assembly editor, a trace viewer and a pairwise alignment viewer. A more sophisticated map browsing tool will follow on to summarise and display all the information combined together in an integrated gene map. One desirable feature not currently available anywhere would be a multiple genome gene map. This could provide an ideal mechanism for browsing areas of syteny between the human and mouse genomes which both have large EST, mapping and disease genetics information stores available, but are currently kept and displayed separately.

My Home

EBI Home