Alternative Gene Form Discovery and Candidate Gene Selection from Gene Indexing Projects

1998 
Several efforts are under way to partition single-read expressed sequence tag (EST), as well as full-length transcript data, into large-scale gene indices, where transcripts are in common index classes if and only if they share a common progenitor gene. Accurate gene indexing facilitates gene expression studies, as well as inexpensive and early gene sequence discovery through assembly of ESTs that are derived from genes that have not been sequenced by classical methods. We extend, correct, and enhance the information obtained from index groups by splitting index classes into subclasses based on sequence dissimilarity (diversity). Two applications of this are highlighted in this report. First it is shown that our method can ameliorate the damage that artifacts, such as chimerism, inflict on index integrity. Additionally, we demonstrate how the organization imposed by an effective subpartition can greatly increase the sensitivity of gene expression studies by accounting for the existence and tissue- or pathology-specific regulation of novel gene isoforms and polymorphisms. We apply our subpartitioning treatment to the UniGene gene indexing project to measure a marked increase in information quality and abundance (in terms of assembly length and insertion/deletion error) after treatment and demonstrate cases where new levels of information concerning differential expression of alternate gene forms, such as regulated alternative splicing, are discovered. [Tables 2 and ​and33 can be viewed in their entirety as Online Supplements at http://www.genome.org.] Table 2 Putative Cases of Gene Overlap in the 3‘UTR (Truncated) Table 3 Some Disease-State/Tissue-Specific Gene Forms Isolated through CRAW Analysis The exploitation of single-read sequencing from the ends of sufficiently expressed mRNAs (popularly referred to as expressed sequence tags or ESTs; Adams et al. 1991; Okubo et al. 1991; Wilcox et al. 1991) has brought to light the existence of many genes well before the projected completion of the human genome project in the year 2005 and before the completion of sequencing efforts in other organisms (Adams et al. 1992; Matsubara and Okubo 1993; Venter 1993). Additionally, EST data have facilitated large-scale expression studies (Okubo et al. 1992, 1994; Adams et al. 1995). EST sequencing has enabled the construction of a physical map of the human genome (Hudson et al. 1995), as well a gene map that localizes many genes with respect to the markers of the physical map (Schuler et al. 1996). The utility of EST data has also been increased greatly by the establishment of centralized databases (Boguski et al. 1993; Benson et al. 1994). Because they are primed to hybridize to the poly(A) tail of mRNAs, 3′ ESTs usually capture regions of the mRNA untranslated region (UTR) that have been thought to contain less conservation than the coding regions. The goal has been that genes could then be reliably indexed using the 3′ UTR/EST as a gene fingerprint; however, the vast quantity of EST data and its fragmented nature pose an obstacle to harvesting the full potential from this data source. Hence, several projects are in progress to construct information frameworks, called gene indices, where the EST data and the known gene sequence data can be consolidated and placed in a correct pathologic and mapping context. A few of the more widely known efforts in this area are UniGene (Boguski and Schuler 1995; Schuler et al. 1996) from NCBI; the TIGR Human Gene Index (HGI) from the Institute for Genomic Research (http://www.tigr.org/tdb/hgi/hgi.html); the Merck–Washington University Gene Index (Williamson et al. 1995; Eckman et al. 1998; http://www.merck.com/mrl/merck_gene_index.2.html); and the GenExpress project (Houlgatte et al. 1995). Algorithmically, these projects all comprise some form of cluster analysis where the sequence similarity of ESTs is used to place or link the sequences into index classes. Below, in our discussion on the creation of index classes and the partitioning of class members into subclasses, we use the terms group, class, and cluster interchangeably. The structures of current gene indexing projects follow one of two patterns. Strict gene indices, of which the primary example is the TIGR HGI, are generally constructed using sequence assemblers (Sutton et al. 1995). These assemblers have stringent matching criterion to join sequences into common classes, and hence they effectively prevent chimerism and contamination from tainting most index groups. On the other hand, this strictness results in a more fragmented representation of the data that often disallows divergent ESTs that sample alternative forms of the same gene to be folded into the same index class (http://www.tigr.org/hgi/hgi_info.html). In HGI these are linked as being splice variants only in those cases where the ESTs match fully sequenced genes with known isoforms in a full-length gene sequence database, the Expressed Gene Anatomy Database (EGAD; White and Kerlavage 1996). In loose gene indexing projects (of which UniGene, Merck Gene Index, and GenExpress are examples) sequences are grouped into common classes if they share overlap above a certain threshold. Sequence similarity searching programs such as BLAST (Altschul et al. 1990), FASTA (Pearson 1990), or variants of the Smith–Waterman algorithm (Schuler et al. 1996) are used to find and quantify sequence overlap. The benefits and drawbacks of loose methods complement the strict methods: A single index class can contain multiple splice forms of the same gene, but chimeras and other artifacts may cause sequences from different genes to be in the same class (Houlgatte et al. 1995). In addition to these gene indexing projects, other tools have been developed that cluster DNA sequence or remove redundancies from sequence sets (Parsons 1995; Grillo et al. 1996). Some of us (J. Burke and W. Hide) are involved in the development of STACKdb, a hybrid approach to gene index construction (see Discussion). Significant research has also been put into the grouping of protein sequence where domain structure complicates the analysis (Sonnhammer and Kahn 1994; Worley et al. 1995; Adams et al. 1996; Sonnhammer et al. 1997). Several studies have been performed on small data sets of ESTs where corresponding full-length sequence was available (multipass or fully sequenced transcripts, positionally cloned genes, and full-length genomic sequence). These studies noted the presence of chimerism, clone reversal, internal priming, introns, and alternative splicing within groups of transcripts. Error rates were estimated for lane-tracking and chimerism, clone reversal, internal priming, insert size annotation, and other features (Aaronson et al. 1996; Hillier et al. 1996; Wolfsberg and Landsman 1997). In contrast, our analysis does not assume the availability of full-length sequence. We leverage the fact that the presence of ESTs containing artifacts or that sample polymorphic loci or gene isoforms often introduce sequence that is unalignable (inconsistent) with the rest of an index class. Instead of relying on sequence similarity to known genes for feature detection, these inconsistencies can be used to partition the index class members such that inconsistent transcripts are in different subclasses. Damage is contained when transcripts that are improperly joined due to the presence of artifact are segregated into disparate subclasses. When cDNA library information is associated with subclass membership, the subclass structure becomes a powerful method for candidate gene selection because the library composition of a subclass is often tissue, developmental state, or disease-specific even when the composition of the greater index class is diverse.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    39
    References
    96
    Citations
    NaN
    KQI
    []