Thyme: Think Beyond Images

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing open-source visual reasoning models lag significantly behind proprietary counterparts (e.g., O3) in image processing fidelity and logical reasoning accuracy, lacking dynamic image manipulation and code-augmented reasoning capabilities. Method: We propose Thyme, a novel multimodal reasoning paradigm that jointly performs high-fidelity image transformation and symbolic mathematical reasoning via automatically generated and executed Python code. It introduces the first open-source, end-to-end code-execution framework supporting diverse image operations and logic-enhanced inference. We design GRPO-ATS—a guided reinforcement learning algorithm leveraging differential temperature sampling to balance exploration and code correctness—and adopt a two-stage training strategy: supervised fine-tuning on 500K samples followed by RL optimization using human-constructed high-resolution QA pairs. Results: Thyme achieves state-of-the-art performance across nearly 20 benchmarks, with particularly substantial gains in high-resolution visual understanding and complex multi-step reasoning tasks.

Technology Category

Application Category

📝 Abstract

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Enabling MLLMs to autonomously generate and execute image processing code

Enhancing logical reasoning capabilities through computational operations

Providing rich on-the-fly image manipulations and mathematical computations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous code generation for image manipulation

Two-stage training with SFT and RL

GRPO-ATS algorithm balancing reasoning and precision

🔎 Similar Papers

CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis with Multimodal Diffusion