🤖 AI Summary
Existing visual reasoning approaches are constrained by purely textual chains, fixed schemas, or single-step pipelines, resulting in limited flexibility, poor interpretability, and weak cross-task generalization. This paper proposes a novel multimodal visual reasoning paradigm centered on executable code as a universal solver: leveraging multimodal large language models (MLLMs) to build a code generation and execution engine that supports dynamic tool invocation, composition, and self-verification; introducing a balanced adaptive tool-calling reward mechanism within PPO-based reinforcement learning, enabling emergent capabilities—including novel tool discovery, cross-task composition, and zero-shot transfer; and integrating a visualization rendering module for transparent intermediate computation and traceable outputs. Evaluated on benchmarks spanning visual search, mathematical reasoning, and chart question answering, our method consistently outperforms chain-of-thought and schema-driven baselines, and significantly surpasses GPT-4o and leading open-source MLLMs.
📝 Abstract
Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.