Natural Language Compression on Edge-Guided text preprocessing

2011 
This paper presents Edge-Guided (E-G), an optimized text preprocessing technique for compression purposes. It transforms the original text into a word net, which stores all relationships between adjoining words. A specific directed graph is proposed to model this transformation: words are stored in vertices, whereas edges represent word transitions. Thus, the word net has a text representation which reflects the natural word order in the text, so it can be used directly for encoding purposes. A specific coding scheme is described at the top of the word net. It regards a text as a sequence of word transitions, in such a way that each word is encoded by traversing a specific edge from the vertex which stores the preceding word. This accomplishes a 1-order text preprocessing whose output is an intermediate byte representation that can be effectively encoded with universal techniques. This technique is called E-G"1 and performs on some variants. This experience is used to revisit the concept of word net. It is used to identify significative 2-word symbols by performing a specific transformation on frequent edges. The resultant transformed word net appends these 2-word symbols to the original word vocabulary, and allows a specific hierarchical relationship between them and their component words. The transformed approach also enhances the original coding scheme to handle these new features. A new technique, called E-G"2, approximates a 2-order model on words that also support specific variants. Both techniques are studied from empirical and experimental perspectives. Some compressors are also used to analyze the preprocessing ability of E-G with respect to different compression approaches. Competitive space/time trade-offs are achieved when our approaches are used to compress medium-large size texts. The best results are achieved when E-G preprocessing is coupled with high-order compressors such as Prediction by Partial Matching (PPM).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    56
    References
    4
    Citations
    NaN
    KQI
    []