AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing video counting benchmarks suffer from short video durations, limited query diversity, absence of fine-grained spatiotemporal cue annotations, and insufficient coverage of audio-visual modalities—severely constraining the generalization capability of multimodal large language models (MLLMs). To address these limitations, we introduce CG-AV-Counting, the first audio-visual counting benchmark featuring explicit spatiotemporal cue annotations, comprising 497 long videos, 1,027 questions, and 5,845 fine-grained cues. We further propose AV-Reasoner, a novel architecture integrating GRPO-based reinforcement learning, curriculum learning, and multimodal cue alignment. Critically, we establish a “cue-grounded” evaluation paradigm supporting both black-box and white-box assessment, revealing for the first time the failure of language-space reasoning in cross-domain counting. Our method achieves state-of-the-art performance across multiple benchmarks. All code and data are publicly released to advance interpretable and robust multimodal counting research.

Technology Category

Application Category

📝 Abstract

Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on https://av-reasoner.github.io.

Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle with audio-visual counting tasks

Existing benchmarks lack comprehensive multimodal coverage

Need for improved reasoning-based counting models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CG-AV-Counting benchmark with annotated clues

Proposes AV-Reasoner with GRPO and curriculum learning

Achieves state-of-the-art results via reinforcement learning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs