Scalable Assembly for Massive Genomic Graphs

2017 
Scientists increasingly want to assemble large genomes, metagenomes, and large numbers of individual genomes. In order to meet the demand for processing these huge datasets, parallel genome assembly is a vital step. Among all the parallel genome assemblers, de Bruijn graph based ones are most popular. However, the size of de Bruijn graph is determined by the number of distinct kmers used in the algorithm, thus redundant kmers in the genome datasets donot contribute to the graph size. The scalability of genome assemblers is influenced directly by the distinct kmers in the dataset or de Bruijn graph size, rather than the input dataset size. In order to assembly large genomes, we have artificially created 16 datasets of 4 Terabytes in total from the human reference genome. The human reference genome is firstly mutated with a 5% mutation rate, and then subjected to a genome sequencing data simulator ART. The simulated datasets have linearly increasing number of distinct kmers as the size/number of the combined datasets increases. We then evaluate all five time-consuming steps of the SWAP-Assembler 2.0 (SWAP2) using these 16 simulated datasets. Compared with our previous experiment on 1000 human dataset with fixed de Bruijn graph size, the weak-scaling test shows that SWAP2 can scale well from 1024 cores using one dataset to 16,384 cores. The percentage of time usage for all five steps of SWAP2 is fixed, and total time usage is also constant. The result showed that the time usage of graph simplification occupied almost 75% of the total time usage, which will be subject to further optimization for future work.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    1
    Citations
    NaN
    KQI
    []