摘要

This paper proposes generalized context modeling (GCM) for heterogeneous data compression. The proposed model extends the suffix of predicted subsequences in classic context modeling to arbitrary combinations of symbols in multiple directions. To address the selection of contexts, GCM constructs a model graph with a combinatorial structuring of finite order combination of predicted symbols as its nodes. The estimated probability for prediction is obtained by weighting over a class of context models that contain all the occurrences of nodes in the model graph. Moreover, separable context modeling in each direction is adopted for efficient prediction. To find optimal class of context models for prediction, the normalized maximum likelihood (NML) function is developed to estimate their structures and parameters, especially for heterogeneous data with large sizes. Furthermore, it is refined by context pruning to exclude the redundant models. Such model selection is optimal in the sense of minimum description length (MDL) principle, whose divergence is proven to be consistent with the actual distribution. It is shown that upper bounds of model redundancy for GCM are irrelevant to the size of data. GCM is validated in an extensive field of applications, e.g., Calgary corpus, executable files, and genomic data. Experimental results show that it outperforms most state-of-the-art context modeling algorithms reported.