video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing large language models for video understanding struggle to process long-duration, high-frame-rate, high-resolution video streams continuously under fixed memory constraints, often resorting to truncation, downsampling, or token compression—leading to critical information loss. To address this, we propose the first test-time training (TTT)-enabled, memory-augmented architecture, featuring a dynamically updated TTT memory module and a prompt-dependent memory retrieval mechanism, coupled with Hessian-free conjugate gradient optimization (TTT_HF) for efficient adaptation. Our method achieves, for the first time, lossless continuous understanding of video streams up to three hours long at 1 fps and 360p resolution. On multi-hour video benchmarks such as Video-MME, an 8B-parameter model attains 74.2% overall accuracy and 67.8% on long-video subsets—significantly outperforming both offline and streaming baselines. This work breaks the longstanding trade-off between long-range dependency modeling and memory efficiency.

Technology Category

Application Category

📝 Abstract

Continuous, high-frame-rate, high-resolution processing of long video streams is critical for future AI agents, yet current video-understanding LLMs struggle to scale. Offline, fixed-frame-number methods require the stream length to adapt frame rates; streaming methods constrain memory by merging or discarding tokens, losing information. We propose video-SALMONN S, a streaming audio-visual LLM that, to our knowledge, is the first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget. Our model introduces (i) a test-time-training (TTT) memory module that continually updates token representations to capture long-range dependencies by replacing token merging, and (ii) a prompt-dependent memory reader that selectively retrieves context-relevant content from fixed-size memory. The TTT module is optimised with a Hessian-free conjugate-gradient procedure (TTT_HF) for efficient adaptation. On long-video benchmarks (Video-MME, LVBench, VideoEvalPro), video-SALMONN S sustains high-quality understanding on multi-hour videos with 10k frames and 1M tokens. Our 8B-parameter model achieves 74.2% overall and 67.8% on the Video-MME long split, outperforming both offline and streaming baselines.

Problem

Research questions and friction points this paper is trying to address.

Process long video streams with fixed memory constraints

Maintain continuous high-frame-rate video understanding

Overcome token loss in streaming audio-visual LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time-training memory module updates token representations

Prompt-dependent memory reader retrieves context-relevant content

Hessian-free conjugate-gradient optimization enables efficient adaptation

🔎 Similar Papers

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges