Mining Uncertain Sequence Data on Hadoop Platform

2014 
Sequence pattern mining is the mining of special and representative features hidden in sequence data. Recently, it has been attracting a lot of attention, especially in the fields of bioinformatics and spatio-temporal trajectory mining. Observing that many sequence data are born with uncertainties and huge sequence data are increasingly generated and accumulated, this paper aims to discover the hidden features from a large amount of uncertain sequence data. Specifically, Probabilistic Suffix Tree (PST) is an implementation of Variable-length Markov Chain (VMM) that has been widely applied in sequence data mining. However, the conventional PST construction algorithm is not for the mining of uncertain data and cannot bear the computing of huge data. Thus, to mine a large amount of sequence data with uncertainties, this paper proposes the uPST\(_{MR}^+\) algorithm on the Hadoop platform to fully utilize the computing power and storage capacity of cloud computing. The proposed uPST\(_{MR}^+\) algorithm constructs a PST in a progressive, multi-layered, and iterative manner so as to avoid excessive learning patterns and balance the overhead of distributed computing. In addition, to prevent the drag on overall performance owing to multiple scanning of the entire sequence data, we trade space for time by using a NodeArray data structure to store the intermediate statistical results to reduce disk I/O. To verify the performance of uPST\(_{MR}^{+}\), we conduct several experiments. The experimental results show that uPST\(_{MR}^{+}\) outperforms the naive approach significantly and show good scalability and stability. Also, although using NodeArray costs a little extra memory, the execution time is significantly lowered.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    2
    Citations
    NaN
    KQI
    []