O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high inference cost of multimodal large language models when processing long audiovisual sequences and the inadequacy of existing benchmarks in evaluating their ability to understand audiovisual associations in noisy user-generated videos. To this end, the authors propose OMAC, a training-free multimodal memory compression plugin, and O-MARC, a distillation framework that compresses inputs and enhances the robustness of smaller models without additional training overhead. They also introduce UGC-AVQA, a new benchmark emphasizing audiovisual co-understanding. Experiments show that on Qwen2.5-Omni-3B, O-MARC achieves an average score of 45.8 across four benchmarks, outperforming full-sequence inference (44.1) and OmniZip (41.0), while reducing latency by 34.6% and memory usage by 34.7%.
📝 Abstract
Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We introduce UGC-AVQA, a public UGC benchmark with 1,000 videos and 4,816 QA pairs, where an audio removal test ensures that benchmark questions require both acoustic and visual evidence. To reduce inference cost, we propose OMAC, a training free plug in compression method that preserves salient visual memory and temporally grounded audio anchors. To further make compact models robust to compressed inputs, we introduce O-MARC, a compression distillation framework for learning with memory compressed multimodal contexts. On Qwen2.5-Omni-3B, O-MARC improves the average score across four benchmarks to 45.8, outperforming full token inference at 44.1 and OmniZip at 41.0. OMAC also keeps inference efficient, reducing latency by 34.6\% (1.53$\times$ speedup) and memory by 34.7\% compared with full token inference.
Problem

Research questions and friction points this paper is trying to address.

efficient video understanding
audio-visual association
inference cost
noisy user-generated videos
multimodal compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

O-MARC
memory-augmented compression
audio-visual QA benchmark
compression distillation
efficient video understanding
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30