OpenView: Empowering MLLMs with Out-of-view VQA

📅 2025-12-20

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses a critical gap in multimodal large language models (MLLMs): reasoning about out-of-view (OOV) content—objects, activities, and scenes beyond the image boundaries—in natural images. To this end, we introduce the first OOV visual question answering (VQA) paradigm and propose OpenView, a four-stage synthetic framework that generates spatially localizable, context-rich multiple-choice VQA samples. We further release OpenView-Dataset, the first high-quality synthetic dataset for OOV-VQA, and OpenView-Bench, a dual-dimensional evaluation benchmark assessing both answer correctness and reasoning plausibility. Fine-tuning MLLMs on OpenView yields substantial improvements: average OOV-VQA accuracy rises from 48.6% to 64.1%, significantly narrowing the gap with human performance. Our work establishes a foundational step toward embodied visual reasoning that transcends the visible frame.

Technology Category

Application Category

📝 Abstract

Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at https://github.com/q1xiangchen/OpenView.

Problem

Research questions and friction points this paper is trying to address.

Enables MLLMs to reason about out-of-view objects and scenes

Generates synthetic VQA data from panoramic imagery for training

Provides a benchmark for evaluating interpretable OOV understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging panoramic imagery for VQA synthesis

Curating synthetic dataset from real-world panoramas

Building benchmark for interpretable evaluation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs