🤖 AI Summary
Existing visual reasoning methods rely on task-specific architectures and supervised fine-tuning, hindering unified modeling across diverse tasks (e.g., segmentation, grounding, captioning, VQA) and modalities (vision and language).
Method: We propose DT-R1—the first reinforcement learning framework for visual reasoning that incorporates digital twin principles. It constructs a structured, interpretable digital twin representation of the input image, enabling heterogeneous outputs ranging from pixel-level masks to natural language descriptions. DT-R1 integrates large language models with the GRPO reinforcement learning algorithm and introduces a novel reward function that jointly optimizes twin structural fidelity and reasoning accuracy—eliminating the need for task-specific architectures.
Contribution/Results: DT-R1 achieves state-of-the-art performance across six benchmarks spanning two modalities and four task categories, demonstrating exceptional generalization and modality-agnostic universality without architectural customization.
📝 Abstract
Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and output accuracy. Evaluations in six visual reasoning benchmarks, covering two modalities and four task types, demonstrate that DT-R1 consistently achieves improvements over state-of-the-art task-specific models. DT-R1 opens a new direction where visual reasoning emerges from reinforcement learning with digital twin representations.