Artemis: Structured Visual Reasoning for Perception Policy Learning

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing natural language-based chain-of-thought approaches for vision-perception reinforcement learning suffer from misalignment with the spatial and object-centric nature of visual tasks due to unstructured linguistic modeling. To address this, we propose Spatially Anchored Reasoning (SAR), a framework that explicitly represents intermediate reasoning steps as structured (label, bounding-box) pairs—enabling verifiable state tracking, direct supervision, ambiguity-free interpretation, and tight alignment between reasoning and visual geometry. SAR is built upon Qwen2.5-VL-3B and integrates proposal-driven reasoning, object detection, and semantic labeling to support cross-task generalization. Experiments demonstrate substantial improvements over baselines on localization and detection tasks, successful generalization to counting and geometric perception tasks, and competitive performance on general multimodal LLM benchmarks. These results validate the effectiveness and broad applicability of spatially structured reasoning for vision-language reasoning.

Technology Category

Application Category

📝 Abstract
Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.
Problem

Research questions and friction points this paper is trying to address.

Improves visual perception policy learning with structured reasoning
Replaces linguistic reasoning with spatial, object-centric proposal steps
Enables explicit state tracking and reduces ambiguity in perception tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured proposal-based reasoning with bounding boxes
Explicit tracking of intermediate visual states
Spatial representation alignment for perception policies
🔎 Similar Papers
No similar papers found.