Diversification of transcriptional modulation: Large-scale identification and characterization of putative alternative promoters of human genes

2005 
One of the most striking findings revealed by the Human Genome Project is that the human genome contains only 20,000-25,000 kinds of protein-coding genes (International Human Genome Sequencing Consortium 2004). This number is unexpectedly small compared with the total gene numbers in yeast, fly, and worm genomes, which are estimated to be 6,000, 14,000, and 19,000, respectively (Goffeau et al. 1996; C. elegans Sequencing Consortium 1998; Adams et al. 2000). It is supposed that there must be other factors in addition to mere gene numbers to satisfy the prerequisites that enable the human genome to fabricate such highly elaborated systems as the brain and immune systems. To explain this, it has been hypothesized that multifaceted use of the genes should play a pivotal role in functional diversification of human genes without affecting the total gene number (Ewing and Green 2000). Multifaceted use of the genes would be enabled either by the production of slightly different transcripts, which are finely tuned for specific purposes from a single gene locus, or by employing essentially the same transcript in different circumstances, or by the combination of these mechanisms. As for the first possibility, recent reports showed that alternative splicing (AS) is employed in about half of all human genes, producing more than three different transcript variants per locus on average (Lander et al. 2001). Various transcripts produced by AS are consequently translated into proteins with slightly different structures and functions, and thus this mechanism is thought to provide a molecular mechanism for the fine tuning of the gene functions of a single locus (Lopez 1998; Black 2000). As for the second possibility, the use of alternative promoters (APs) has been presumed. By utilizing APs, which consist of different modules of transcriptional regulatory elements, diversified transcriptional regulation should be enabled at a single locus (Landry et al. 2003). Combinatory use of these two possibilities (AS and APs) would even further increase the potential complexity of the products expressed from a single gene; for example, multiple separated promoters might independently direct transcription from different genomic positions and the subsequent variation in the first exons might result in the production of N-terminally different proteins. Actually, for some human genes of particular interest, in vitro and in vivo experiments have verified that such complex diversification takes place within a cell. For example, the SHC1 gene has two APs and produces three different transcripts encoding protein isoforms of 46, 52, and 66 kD (Luzi et al. 2000). The transcript encoding p46/p52 is transcribed from the proximal promoter with a ubiquitous expression pattern. On the other hand, the transcript encoding p66, whose biological functions are completely different from those of p46/p52 because of the presence of one additional collagen homology domain at its N terminus, is driven by a distal promoter and is specifically expressed in limited types of cells. The promoters of these two isoforms are approximately 4 kb apart from each other and the repertories of the predicted potential cis-acting regulatory elements are completely different. In addition, recent studies also demonstrated that the histone acetylation and cytosine methylation statuses are significantly different between the two APs (Ventura et al. 2002). Although our understanding of the comprehensive features of AS has been rapidly advancing with the compiling of EST data (Lee et al. 2003; http://www.bioinformatics.ucla.edu/ASAP/), very little is understood about the genome-wide features of the APs so far. Indeed, in spite of increasing general interest and need, almost no reports or databases have provided a genome-wide view of which population of human genes is regulated by APs and what biological consequences such diversification of the transcriptional modulation would bring about. In our opinion, this is because of a general lack of information about the transcriptional start sites (TSSs) and adjacent putative promoter regions (PPRs). For systematic identification of the APs, highly redundant sequence data would be essential. However, the coverage of the EST data at the 5′-ends has generally been low, since the conventional cDNAs are constructed utilizing the 3′-end poly(A) without any selection method for the opposite end, the 5′-end cap structure (Suzuki et al. 1997). Besides, even if available cDNA sequences have already covered the corresponding regions, it cannot be assured that their 5′-ends correspond to the real TSSs without in-depth analysis. For these reasons, the massively accumulated current EST data could not be directly used for the identification and analyses of the APs. Although a pioneering study was done by Zavolan et al. (2003) making use of full-length cDNAs, their data were limited to about 60,000 mouse cDNAs. We have been collecting full-length cDNAs of human genes by constructing cDNA libraries using the oligo-capping method (Suzuki and Sugano 2003; Ota et al. 2004). Using the full-length cDNA sequence data, we have also demonstrated that the exact positions of the TSSs and the adjacent putative promoters could be retrieved from the human genomic sequences in a high-throughput manner (Suzuki et al. 2001b). Our human full-length cDNA data accumulated so far and the retrieved adjacent upstream PPR information are integrated in our database, DBTSS (Suzuki et al. 2004; http://dbtss.hgc.jp) and have been made publicly available now. In this study, we further expand our 5′-end sequence data up to 1.8 million cDNAs (all of the sequence data have been registered in DDBJ and the physical cDNA clones are available on request). With the increase of the entire dbEST data (6 million entries as of March, 2005) by about 30%, these data extensively complement the 5′-ends in the current EST collection. By using our new 5′-end oligo-cap cDNA data, we were able to glimpse for the first time an overview of how the TSSs are clustered and in what manner the adjacent putative promoters are used multifacetedly in humans. Here we report our first genome-wide analysis of the alternative use of putative promoters using our unprecedented collection of 5′-end cDNA data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    51
    References
    397
    Citations
    NaN
    KQI
    []