Efficient Mining Closed k-Mers from DNA and Protein Sequences

2020 
Counting frequent k-mers as patterns in a long sequence is a typical data-mining problem and contributes to broad applications in bioinformatics. Existing methods of k-mer counting usually produce all the possible contiguous subsequences of length k that are contained in a given sequence. However, these methods often spawn numerous redundant and possibly meaningless k-mers. This inherently gives rise to excessive runtime and space overhead at sequence analysis. Moreover, most of the k-mer counting algorithms consider only single-length contiguous subsequences, which limits the diversity of data mining and knowledge discovery. Previous studies have demonstrated that closed patterns contain the complete information regarding the corresponding patterns. Inspired by the application of the upperclosure property of sequential patterns, we propose an efficient algorithm, called CloKmer, to find closed k-mers of various lengths that are compact yet lossless representation of k-mers. CloKmer utilizes the inverted index to project the original sequence onto an equivalent set and then mines closed k-mers by exploiting space pruning and upper-closure checking. Our experimental results demonstrate that CloKmer generates average 44% less patters but still achieves a comparable accuracy when classifying transcription factor binding sites (TFBSs), compared to traditional (k-mer-based classification) methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    35
    References
    1
    Citations
    NaN
    KQI
    []