When Less is More — Using Restricted Repetition Search in Fast Compressors

Danny Harnik,Ety Khaitzin,Dmitry Sotnikov

When Less is More — Using Restricted Repetition Search in Fast Compressors

2016

In this work we identify a surprising effect by which searching for less repetitions in fast compressors can occasionally gain more compression (may reach up to a 30% improvement). Moreover, when executed on the correct data, restricting the repetition search manages to simultaneously improve both compression ratio and compression and decompression speeds. The rationale behind this phenomena is that for many highly compressible files, resources are "wasted" towards finding the short repetitions and this obscuresfinding longer repetitions. This can be remedied by restricting the attention solely to longer repetitions. Moreover, restricting attention to long repetitions has less interference with benefits of Huffman encoding, interfering only when long and beneficial repetitions are found.Following this observation we face the challenge of identifying the aforementioned data in order to deploy the restricted repetition search on it and doing so in an online manner. The problem is that while some data types can benefit greatly from the restricted search, others data types can suffer compression degradation from it. So being able to identify the data correctly is paramount to achieving an overall improvement. Moreover, since we are targeting high speed compressors, we are restricted to using only methods that are extremely fast and will not degrade the compression speeds for data that is unaffected by this method. Our solution is an adaptive algorithm that tunes the compression to the data at hand in order to reap the benefits on data types where restricted repetition exhibits gains. On other data types the compressor performs on par with the original fast compressor.We implemented our approach on two compressors: The worst is LZ4 [1] where nice improvements could be seen on a few selectfiles (some files showed a 20% improvement on compression ratio), but for the most part gains were limited to 5-7%. The second implementation was for RapidZlib [3] which is Deflate [2] compatible and also combines Huffman encoding on top of the repetition elimination process. The results here were by far better, exhibiting improvements upward of 20% on manyfiles and at times over 30%. We evaluated our adaptive RapidZlib solution on a large collection of data from various types and verified that it improves significantly on the potential files and does not add overheads or degrade on others.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations