🤖 AI Summary
This work addresses the high computational overhead in semantic classification caused by processing raw bytes or fully decoding compressed media. We propose TEMPEST, the first method to directly feed the intrinsic byte-stream structure of compressed files into Transformers—bypassing decoding and full media reconstruction. TEMPEST introduces lightweight, compression-aware tokenization and encoding strategies grounded in the statistical and structural properties of compressed data, enabling end-to-end semantic representation learning. Evaluated across multimodal domains (images, audio) and diverse compression formats (JPEG, MP3, etc.), TEMPEST achieves on-par classification accuracy with state-of-the-art methods while reducing token count by 62% on average, significantly lowering memory footprint and FLOPs. Its core innovation lies in treating compressed-domain byte streams as inherently semantic-rich, compact representations—establishing a new paradigm for efficient, cross-modal, decoding-agnostic representation learning.
📝 Abstract
Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.