The Lossy Horizon: Error-Bounded Predictive Coding for Lossy Text Compression (Episode I)

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses lossy text compression by proposing Error-Constrained Predictive Coding (EPC), the first method to employ a Masked Language Model (MLM) as a learnable decompressor. EPC generates residuals via predictive coding, selects a minimal correction set via rank-based coding, and achieves error-controllable optimal rate-distortion trade-offs through exact bit counting. Its key contributions are: (i) leveraging MLM’s contextual modeling capability to enhance reconstruction fidelity; (ii) introducing a residual channel for fine-grained, adaptive rate-distortion control; and (iii) circumventing information loss and optimization bottlenecks inherent in conventional masked prediction or vector quantization. Experiments demonstrate that EPC significantly outperforms PM and VQ+RE baselines at equal or lower bitrates across multiple text datasets, achieving superior rate-distortion performance and compression efficiency.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can achieve near-optimal lossless compression by acting as powerful probability models. We investigate their use in the lossy domain, where reconstruction fidelity is traded for higher compression ratios. This paper introduces Error-Bounded Predictive Coding (EPC), a lossy text codec that leverages a Masked Language Model (MLM) as a decompressor. Instead of storing a subset of original tokens, EPC allows the model to predict masked content and stores minimal, rank-based corrections only when the model's top prediction is incorrect. This creates a residual channel that offers continuous rate-distortion control. We compare EPC to a simpler Predictive Masking (PM) baseline and a transform-based Vector Quantisation with a Residual Patch (VQ+RE) approach. Through an evaluation that includes precise bit accounting and rate-distortion analysis, we demonstrate that EPC consistently dominates PM, offering superior fidelity at a significantly lower bit rate by more efficiently utilising the model's intrinsic knowledge.
Problem

Research questions and friction points this paper is trying to address.

Developing error-bounded lossy compression for text using predictive coding
Trading reconstruction fidelity for higher compression ratios with LLMs
Creating residual channel for continuous rate-distortion control in compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses masked language model as decompressor
Stores rank-based corrections for incorrect predictions
Provides continuous rate-distortion control via residual channel
🔎 Similar Papers
N
Nnamdi Aghanya
Cranfield University
J
Jun Li
Cranfield University
Kewei Wang
Kewei Wang
Alibaba Cloud
MLLM