RVTBench: A Benchmark for Visual Reasoning Tasks

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual reasoning benchmarks suffer from limitations inherent to LLM-generated queries—namely, inadequate modeling of spatiotemporal relationships and multi-step reasoning chains in videos, coupled with rigid, single-format outputs. To address this, we introduce RVTBench, the first multimodal visual reasoning benchmark supporting diverse output formats (segmentation, localization, VQA, and summarization), covering semantic, spatial, and temporal reasoning across four difficulty levels. It comprises 200 videos and over 1.2 million tokens, with 3,896 high-quality queries automatically generated via a digital twin–driven pipeline. We propose the RVT unified task paradigm, wherein the digital twin serves as a structured perceptual–linguistic interface, and design RVTagent—a multimodal agent framework enabling zero-shot cross-task generalization without fine-tuning. Empirically, RVTagent significantly outperforms pure LLM baselines on RVTBench, demonstrating the efficacy of our structural and methodological innovations.

Technology Category

Application Category

📝 Abstract
Visual reasoning, the capability to interpret visual input in response to implicit text query through multi-step reasoning, remains a challenge for deep learning models due to the lack of relevant benchmarks. Previous work in visual reasoning has primarily focused on reasoning segmentation, where models aim to segment objects based on implicit text queries. This paper introduces reasoning visual tasks (RVTs), a unified formulation that extends beyond traditional video reasoning segmentation to a diverse family of visual language reasoning problems, which can therefore accommodate multiple output formats including bounding boxes, natural language descriptions, and question-answer pairs. Correspondingly, we identify the limitations in current benchmark construction methods that rely solely on large language models (LLMs), which inadequately capture complex spatial-temporal relationships and multi-step reasoning chains in video due to their reliance on token representation, resulting in benchmarks with artificially limited reasoning complexity. To address this limitation, we propose a novel automated RVT benchmark construction pipeline that leverages digital twin (DT) representations as structured intermediaries between perception and the generation of implicit text queries. Based on this method, we construct RVTBench, a RVT benchmark containing 3,896 queries of over 1.2 million tokens across four types of RVT (segmentation, grounding, VQA and summary), three reasoning categories (semantic, spatial, and temporal), and four increasing difficulty levels, derived from 200 video sequences. Finally, we propose RVTagent, an agent framework for RVT that allows for zero-shot generalization across various types of RVT without task-specific fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Lack of benchmarks for visual reasoning in deep learning models
Current benchmarks fail to capture complex spatial-temporal relationships
Need for unified formulation accommodating diverse visual language tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified RVT formulation for diverse visual language tasks
Automated benchmark pipeline using digital twin representations
Agent framework enabling zero-shot generalization across RVTs
🔎 Similar Papers
No similar papers found.