VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing VTR benchmarks suffer from a simulation-to-reality gap, oversimplified tasks, and unidimensional reasoning evaluation. To address these limitations, we introduce VTR-Real—the first VTR benchmark grounded in real-world human-object interactions—covering spatial, procedural, and quantitative reasoning across 12 manipulation tasks, 6 subtasks, and 472 high-quality question-answer pairs. We propose a systematic evaluation framework featuring multi-dimensional reasoning assessment and a scalable data construction pipeline: leveraging first-person manipulation videos, we integrate large language models for automated metadata annotation, image-pair extraction, and structured question generation, followed by rigorous human verification. Experiments reveal that current vision-language models (VLMs) perform reasonably well on static spatial understanding but exhibit fundamental deficiencies in multi-step reasoning, intermediate state recognition, and transformation sequence planning—establishing VTR-Real as a rigorous new benchmark and clarifying critical directions for future VTR research.

Technology Category

Application Category

📝 Abstract
Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real-world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions - spatial, procedural, and quantitative - through 6 well-defined subtask types. The benchmark features 472 high-quality question-answer pairs in various formats, including multiple-choice, open-ended counting, and target enumeration. We introduce a scalable data construction pipeline built upon first-person manipulation videos, which integrates task selection, image pair extraction, automated metadata annotation with large multimodal models, and structured question generation. Human verification ensures the final benchmark is both high-quality and interpretable. Evaluations of various state-of-the-art vision-language models show strong performance in static spatial tasks. However, they reveal notable shortcomings in dynamic, multi-step reasoning scenarios, particularly in areas like intermediate state recognition and transformation sequence planning. These findings highlight fundamental weaknesses in temporal modeling and causal reasoning, providing clear directions for future research aimed at developing more capable and generalizable VTR systems. The dataset and code are available at https://github.com/WangYipu2002/VisualTrans.
Problem

Research questions and friction points this paper is trying to address.

Addresses sim-to-real gap in visual transformation reasoning
Evaluates spatial, procedural, quantitative reasoning dimensions
Improves dynamic multi-step reasoning in VTR systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

First comprehensive benchmark for VTR
Scalable data construction pipeline
Automated metadata annotation with LMMs
🔎 Similar Papers
No similar papers found.
Yuheng Ji
Yuheng Ji
Institute of Automation, Chinese Academy of Sciences
Embodied AIComputer Vision
Y
Yipu Wang
School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences
Y
Yuyang Liu
Institute of Automation, Chinese Academy of Sciences
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language
Y
Yue Liu
Institute of Automation, Chinese Academy of Sciences
Yuting Zhao
Yuting Zhao
Institute of Automation, Chinese Academy of Sciences
Computer Vision
Huaihai Lyu
Huaihai Lyu
Institute of Automation
multi-modalembodied intelligence
X
Xiaolong Zheng
School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences