language-icon Old Web
English
Sign In

Hybrid genome assembly

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is magnitudes smaller than the average size of a genome (the genome of the octoploid plant Paris japonica is 149 billion base pairs). This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient. In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is magnitudes smaller than the average size of a genome (the genome of the octoploid plant Paris japonica is 149 billion base pairs). This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient. The term genome assembly refers to the process of taking a large number of DNA fragments that are generated during shotgun sequencing and assembling them into the correct order such as to reconstruct the original genome. Sequencing involves using automated machines to determine the order of nucleic acids in the DNA of interest (the nucleic acids in DNA are adenine, cytosine, guanine and thymine) to conduct genomic analyses involving an organism of interest. The advent of next generation sequencing has presented significant improvements in the speed, accuracy and cost of DNA sequencing and has made the sequencing of entire genomes a feasible process. There are many different sequencing technologies that have been developed by various biotechnology companies, each of which produce different sequencing reads in terms of accuracy and read length. Some of these technologies include Roche 454, Illumina, SOLiD, and IonTorrent. These sequencing technologies produce relatively short reads (50-700 bases) and have a high accuracy (>98%). Third generation sequencing include technologies as the PacBio RS system which can produce long reads (maximum of 23kb) but have a relatively low accuracy. Genome assembly is normally done by one of two methods: assembly using a reference genome as a scaffold, or de novo assembly. The scaffolding approach can be useful if the genome of a similar organism has been previously sequenced. This process involves assembling the genome of interest by comparing it to a known genome or scaffold. De novo genome assembly is used when the genome to be assembled is not similar to any other organisms whose genomes have been previously sequenced. This process is carried out by assembling single reads into contiguous sequences (contigs) which are then extended in the 3’ and 5’ directions by overlapping other sequences. The latter is preferred because it allows for the conservation of more sequences. The de novo assembly of DNA sequences is a very computationally challenging process and can fall into the NP-hard class of problems if the Hamiltonian-cycle approach is used. This is because millions of sequences must be assembled to reconstruct a genome. Within genomes, there are often tandem repeats of DNA segments that can be thousands of base pairs in length, which can cause problems during assembly. Although next generation sequencing technology is now capable of producing millions of reads, the assembly of these reads can cause a bottleneck in the entire genome assembly process. As such, extensive research is being done to develop new techniques and algorithms to streamline the genome assembly process and make it a more computationally efficient process and to increase the accuracy of the process as a whole. One hybrid approach to genome assembly involves supplementing short, accurate second-generation sequencing data (i.e. from IonTorrent, Illumina or Roche 454) with long less accurate third generation sequencing data (i.e. from PacBio RS) to resolve complex repeated DNA segments. The main limitation of single-molecule third-generation sequencing that prevents it from being used alone is its relatively low accuracy, which causes inherent errors in the sequenced DNA. Using solely second-generation sequencing technologies for genome assembly can miss or lead to the incomplete assembly of important aspects of the genome. Supplementation of third generation reads with short, high-accuracy second generation sequences can overcome these inherent errors and completed crucial details of the genome. This approach has been used to sequence the genomes of some bacterial species including a strain of Vibrio cholerae. Algorithms specific for this type of hybrid genome assembly have been developed, such as the PacBio corrected Reads algorithm. There are inherent challenges when utilizing sequence reads from various technologies to assemble a sequenced genome; data coming from different sequencers can have different characteristics. An example of this can be seen when using the overlap-layout-consensus (OLC) method of genome assembly, which can be difficult when using reads of substantially different lengths. Currently, this challenge is being overcome by using multiple genome assembly programs. An example of this can be seen in Goldberg et al. where the authors paired 454 reads with Sanger reads. The 454 reads were first assemble using the Newbler assembler (which is optimized to use short reads) generating pseudo reads that were then paired with the longer Sanger reads and assembled using the Celera assembler. Hybrid genome assembly can also be accomplished using the Eulerian path approach. In this approach, the length of the assembled sequences does not matter as once a k-mer spectrum has been constructed, the lengths of the reads are irrelevant. The authors of this study developed a correction algorithm called the PacBio corrected Reads (PBcR) algorithm which is implemented as part of the Celera assembly program. This algorithm calculates an accurate hybrid consensus sequence by mapping higher accuracy short reads (from second generation sequencing technologies) to individual lower accuracy long reads (from third-generation sequencing technologies). This mapping allows for trimming and correction of the long reads to improve the read accuracy from as low as 80% to over 99.9%. In the best example of this application from this paper, the contig size was quintupled when compared to the assemblies using only second-generation reads.

[ "Whole genome sequencing", "Genome project", "Shotgun sequencing", "Deep sequencing", "Genomics" ]
Parent Topic
Child Topic
    No Parent Topic