Practical String Dictionary Compression Using String Dictionary Encoding

2017 
A string dictionary is a data structure for storing a set of strings that maps them to unique IDs. It can manage string data in compact space by encoding them into integers. However, instances have recently emerged in practice where the size of string dictionaries has become a critical problem for very large datasets in many applications. A number of compressed string dictionaries have been proposed as a solution. In particular, the application of Re-Pair, a powerful text compression technique, to tries and front coding can help to obtain compact string dictionaries that support fast dictionary operations. However, the cost of constructing such dictionaries using Re-Pair is impractical for large datasets. In this paper, we propose an alternative compression strategy using string dictionary encoding and develop several dictionary structures for it. We show that our string dictionaries can be constructed up to 422.5x faster than the Re-Pair versions with competitive space and operation speed, through experiments on real-world datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    6
    Citations
    NaN
    KQI
    []