🤖 AI Summary
This study investigates the potential of lightweight pre-trained Transformers for byte-level multimodal lossless compression, aiming to overcome their historically inferior compression ratios relative to conventional algorithms (e.g., gzip, LZMA2, PNG, JPEG-XL, FLAC). We employ a vanilla Transformer architecture operating directly on raw byte sequences and train million-parameter models on a 165 GB cross-modal byte corpus. Systematic evaluation is conducted on a 1 GB per modality out-of-distribution test set spanning text, images, audio, and mixed modalities. To our knowledge, this is the first work to demonstrate that small-scale pre-trained Transformers consistently outperform both general-purpose and domain-specific compressors across all modalities: achieving a compression ratio of 0.49 on audio—surpassing FLAC’s 0.54—and showing substantial gains from multimodal joint training, which enhances cross-modal generalization despite exhibiting modality interference thresholds. These results establish foundation models as a novel paradigm for universal, high-ratio, parameter-efficient lossless compression.
📝 Abstract
Foundation models are strong data compressors, but when accounting for their parameter size, their compression ratios are inferior to standard compression algorithms. Naively reducing the parameter count does not necessarily help as it deteriorates predictions and, accordingly, compression. We conduct a large-scale empirical study to find a sweet spot where pre-trained vanilla transformers can achieve competitive compression ratios. To this end, we train models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG-XL, FLAC) $unicode{x2013}$ even when accounting for parameter size. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). We conduct extensive ablations and hyperparameter sweeps to study the impact of model- and dataset scale, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but unlike large-scale foundation models, transfer to unseen modalities is generally weak.