🤖 AI Summary
This paper addresses lossy text compression by proposing Error-Constrained Predictive Coding (EPC), the first method to employ a Masked Language Model (MLM) as a learnable decompressor. EPC generates residuals via predictive coding, selects a minimal correction set via rank-based coding, and achieves error-controllable optimal rate-distortion trade-offs through exact bit counting. Its key contributions are: (i) leveraging MLM’s contextual modeling capability to enhance reconstruction fidelity; (ii) introducing a residual channel for fine-grained, adaptive rate-distortion control; and (iii) circumventing information loss and optimization bottlenecks inherent in conventional masked prediction or vector quantization. Experiments demonstrate that EPC significantly outperforms PM and VQ+RE baselines at equal or lower bitrates across multiple text datasets, achieving superior rate-distortion performance and compression efficiency.
📝 Abstract
Large Language Models (LLMs) can achieve near-optimal lossless compression by acting as powerful probability models. We investigate their use in the lossy domain, where reconstruction fidelity is traded for higher compression ratios. This paper introduces Error-Bounded Predictive Coding (EPC), a lossy text codec that leverages a Masked Language Model (MLM) as a decompressor. Instead of storing a subset of original tokens, EPC allows the model to predict masked content and stores minimal, rank-based corrections only when the model's top prediction is incorrect. This creates a residual channel that offers continuous rate-distortion control. We compare EPC to a simpler Predictive Masking (PM) baseline and a transform-based Vector Quantisation with a Residual Patch (VQ+RE) approach. Through an evaluation that includes precise bit accounting and rate-distortion analysis, we demonstrate that EPC consistently dominates PM, offering superior fidelity at a significantly lower bit rate by more efficiently utilising the model's intrinsic knowledge.