BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenge of efficiently deploying multimodal vision-language models on resource-constrained edge devices, this paper proposes BitMar—the first low-bit, memory-augmented multimodal Transformer framework tailored for edge computing. Methodologically, BitMar integrates a human-inspired episodic memory mechanism via a fixed-size external key-value memory, coupled with layer-wise conditional decoding and sliding-window attention. It pioneers deep synergy between ultra-low-bit quantization (1.58-bit BitNet-style text encoding and DiNOv2-based visual quantization) and memory-augmented architecture. Additionally, it incorporates attention sinks and native support for streaming inference. Evaluated on image captioning and multimodal understanding tasks, BitMar achieves competitive accuracy while significantly reducing model size, memory footprint, and latency—demonstrating its feasibility and effectiveness for edge deployment.

Technology Category

Application Category

📝 Abstract

Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.

Problem

Research questions and friction points this paper is trying to address.

Enables multimodal AI on resource-limited edge devices

Reduces model size via 1.58-bit quantized encoders

Manages memory constraints with episodic memory retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

1.58-bit encoders for compact multimodal embeddings

External episodic memory for efficient context retrieval

Sliding-window attention with per-layer conditioning mechanism

🔎 Similar Papers

No similar papers found.

Authors to Follow