ntHits: de novo repeat identification of genomics data using a streaming approach

2020 
MotivationRepeat elements such as satellites, transposons, high number of gene copies, and segmental duplications are abundant in eukaryotic genomes. They often induce many local alignments, complicating sequence assembly and comparisons between genomes and analysis of large-scale duplications and rearrangements. Hence, identification and classification of repeats is a fundamental step in many genomics applications and their downstream analysis tools. ResultsIn this work, we present an efficient streaming algorithm and software tool, ntHits, for de novo repeat identification based on the statistical analysis of the k-mer content profile of large-scale DNA sequencing data. In the proposed algorithm, we first obtain the k-mer coverage histograms of input datasets using the ntCard algorithm, an efficient streaming algorithm for estimating the k-mer coverage histograms. From the obtained k-mer coverage histogram, the repetitive k-mers would present a long tail to the distribution of k-mer coverage profile. Experimental results show that ntHits can efficiently and accurately identify the repeat content in large-scale DNA sequencing data. For example, ntHits accurately identifies the repeat k-mers in the white spruce sequencing data set with 96x sequencing coverage in about 12 hours and using less than 150GB of memory, while using the exact methods for reporting the repeated k-mers takes several days and terabytes of memory and disk space. AvailabilityntHits is written in C++ and is released under the MIT License. It is freely available at https://github.com/bcgsc/ntHits. Contacthmohamadi@bcgsc.ca
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    2
    Citations
    NaN
    KQI
    []