SPATA: An Accurate GUI Tool forDe NovoTranscriptome Assembly

2010 
Human transcriptomes are highly diverse, overlapping, complex, and dynamic. Alternative splicing and structural variations play pivotal roles in enhancing the range of transcriptome complexity ( [1–7]). The first level complexity is introduced by splicing and alternative splicing of the annotated exons, including Untranslated Regions (UTR’s) ( [1–3, 7]). The second level complexity is a more general problem that is introduced by ubiquitous structural variations, such as deletions, insertions, and translocations, in human genomes and transcriptomes ( [4–6]). Enormous challenges exist for both mapping and assembling reads from altered genomic regions. We design a novel divide-and-conquer strategy and develop a new de novo assembly algorithm to tackle both levels of complexities in transcriptome reconstruction. Our algorithm reconstructs transcript sequences directly from short reads and exhaustively finds the sequences of all expressed mRNA transcripts. The only required input from users is the raw read sequence file in standard fastq or fasta format. It exports sequences of each expressed mRNA transcript and reports tissuespecific splicing and structural variations events. Availability: http://asammate.sourceforge.net/. Short reads localization An innovation of our approach is first to localize reads, with or without structural variations, to each annotated gene or un-annotated transcriptional unit. After the localization, we de novo reconstruct transcript structures within each annotated gene or un-annotated transcriptional unit. Our approach proceeds as follows: Step 1: We first align the entire set of reads to the reference genome using Bowtie ( [8]). For single-end reads, we align most of the human exonic reads, e.g. 60% − 70%, to the reference genome. For paired-end reads, we localize the reads if at least one of the pair is aligned to the reference genome, including those with structural variations. Step 2: For the reads left from the step 1, if single-ended, we equally split each read into three partitions, and localize the split reads to annotated genes or un-annotated transcriptional units using anchors. Anchors refer to any partition reads that are completely aligned to the reference genome. For paired-end reads, we equally split each read into two partitions and localize them using anchors. De novo transcriptome assembly The central idea of our algorithm is to use overlaps between a pair of reads. The algorithm proceeds into two consecutive stages as described below: The seeding and growing stage: In the seeding and growing stage, the algorithm uses the first read in the set as a seed and extends it toward left and right, respectively, until no extension reads can be found. A merged sequence is generated and all the short reads covered by the merged sequence are removed from the input set while the seed grows. If there are uncovered short reads after the first round of growth, another seed (the first read in the set) is used to grow a second sequence. This process is repeated until all the reads are covered by some sequence and thus the input set becomes empty. The patching and cutting stage: In the patching and cutting stage, when applicable, all the sequences passing a minimum overlap cut-off are patched to another sequence to form the whole set of transcripts. After the sequences have been patched to each other, they are cut into small segments. Redundant segments are removed and remaining segments are organized into an isoform graph to reconstruct isoform structures. Performance evaluation We compare the performance of SPATA with Trans-ABySS (TA) ( [9,10]) using simulated short reads from known transcript structures from Ensembl. We evaluate the performance of an assembler using the following metrics: % assembled contigs fully covered by the true contigs; % assembled contigs partially covered by the true contigs) and the % covered bases (total length of covered contigs divided by total length of contigs). We use SSAHA2 ( [11]), a pairwise sequence alignment tool, to align two sets of contig sequences stored in the FASTA format. The figure below demonstrates a comparison of the two assemblers using ground truth. The result of SPATA is more consistent with the ground truth than that of the TA.
    • Correction
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    0
    Citations
    NaN
    KQI
    []