Proxy Compression for Language Modeling

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitations of conventional language models that rely on fixed tokenizers, which tightly couple modeling to specific compression schemes and hinder flexibility and robustness in byte-level language modeling. The authors propose a proxy-compression training framework that jointly trains the model on both raw byte sequences and their external lossless compressed representations (e.g., UTF-8), while enabling inference exclusively on raw bytes. This approach decouples training efficiency from byte-level generalization, allowing models to be trained efficiently on compressed data yet perform end-to-end inference on raw bytes. Evaluated on code language modeling tasks, the method significantly outperforms pure byte-level baselines under identical computational budgets, and its performance approaches or even surpasses that of traditional tokenizer-based methods as model scale increases.

Technology Category

Application Category

📝 Abstract

Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.

Problem

Research questions and friction points this paper is trying to address.

language modeling

tokenization

byte-level modeling

compression

tokenizer coupling

Innovation

Methods, ideas, or system contributions that make the work stand out.

proxy compression

byte-level language modeling

tokenizer-free