Transformers from Compressed Representations

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational overhead in semantic classification caused by processing raw bytes or fully decoding compressed media. We propose TEMPEST, the first method to directly feed the intrinsic byte-stream structure of compressed files into Transformers—bypassing decoding and full media reconstruction. TEMPEST introduces lightweight, compression-aware tokenization and encoding strategies grounded in the statistical and structural properties of compressed data, enabling end-to-end semantic representation learning. Evaluated across multimodal domains (images, audio) and diverse compression formats (JPEG, MP3, etc.), TEMPEST achieves on-par classification accuracy with state-of-the-art methods while reducing token count by 62% on average, significantly lowering memory footprint and FLOPs. Its core innovation lies in treating compressed-domain byte streams as inherently semantic-rich, compact representations—establishing a new paradigm for efficient, cross-modal, decoding-agnostic representation learning.

Technology Category

Application Category

📝 Abstract
Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.
Problem

Research questions and friction points this paper is trying to address.

Learning semantic representations directly from compressed data streams
Reducing token count for semantic classification tasks
Achieving competitive accuracy with improved computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses compressed file byte-stream for tokenization
Applies transformers directly on compressed data
Reduces token count for computational efficiency