TGS-GapCloser: fast and accurately passing through the Bermuda in large genome using error-prone third-generation long reads

2019 
The completeness and accuracy of genome assemblies determine the quality of subsequent bioinformatics analysis. Despite benefiting from the medium/long-range information of third-generation sequencing techniques, current gap-closing tools to enhance assemblies suffer multi-alignments and high error rates, resulting in huge time and money costs. We developed a software tool, TGS-GapCloser that uses the low depth (>=10X) single molecule sequencing long reads without any error correction to close gaps. The algorithm distinguishes gap regions from the alignments of long reads against original scaffolds, corrects only the candidate regions, and assigns the best sequences to each gap. We demonstrate that TGS-GapCloser improves the contig N50 value of draft assembly by 25-fold on average, updating above 90% gaps with 93.96% positive predictive value. Despite of high error rate of raw long reads, improved assemblies archive Q50 (99.999%) single-base accuracy with only 11.8% decrement to inputs. Besides it could complete more gaps, and is also ~29-fold faster than mainstream gap-closing tools. BUSCO analysis revealed that 3.4%-13.1% more expected genes were complete. TGS-GapCloser also shows its power to fill gaps for ultra large genome assembly of ginkgo (~12Gb) with 71.6% of gaps closed. The validation of inserted or merged gap sequences was conducted with NGS reads and reference genomes, respectively. The updated genome assemblies may promote the gene annotation, structure variant calling and thus improving the downstream analysis of ontogeny, phylogeny, and evolution.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    42
    References
    13
    Citations
    NaN
    KQI
    []