🤖 AI Summary
To address the challenge of efficiently deploying multimodal vision-language models on resource-constrained edge devices, this paper proposes BitMar—the first low-bit, memory-augmented multimodal Transformer framework tailored for edge computing. Methodologically, BitMar integrates a human-inspired episodic memory mechanism via a fixed-size external key-value memory, coupled with layer-wise conditional decoding and sliding-window attention. It pioneers deep synergy between ultra-low-bit quantization (1.58-bit BitNet-style text encoding and DiNOv2-based visual quantization) and memory-augmented architecture. Additionally, it incorporates attention sinks and native support for streaming inference. Evaluated on image captioning and multimodal understanding tasks, BitMar achieves competitive accuracy while significantly reducing model size, memory footprint, and latency—demonstrating its feasibility and effectiveness for edge deployment.
📝 Abstract
Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.