Effective asymmetric XML compression

作者:Skibinski Przemyslaw; Grabowski Szymon; Swacha Jakub*
来源:Software: Practice and Experience , 2008, 38(10): 1027-1047.
DOI:10.1002/spe.859

摘要

The innate verbosity of the extensible markup language (XML) remains one of its main weaknesses, especially when large documents are concerned. This problem can be solved with the aid of dedicated XML compression algorithms.
In this work, we describe XML word-replacing transform (XMI-WRT), a fast and fully reversible XML transform, which, when combined with generally used LZ77-style compression algorithms, allows to attain high compression ratios, comparable to those achieved by the current state-of-the-art XML compressors. The resulting compression scheme is asymmetric in the sense that its decoder is much faster than the coder. This is a desirable practical property, as in many XML applications data are read much more often than written. The key features of the transform are dictionary-based encoding of both document structure and content, separation of different content types into multiple streams, and dedicated encoding of specific patterns, including numbers and dates. T e test results show that the proposed transform improves the XML compression efficiency of general-purpose compressors on average by 35% in case of gzip, and 17% in case of LZMA. Compared with the current state-of-the-art SCMPPM algorithm, XML-WRT with LZMA attains over 2% better compression ratio, while being 55% faster.

  • 出版日期2008-8