🤖 AI Summary
This work addresses the lack of principled criteria for evaluating the effectiveness of reasoning fine-tuning and for selecting suitable multimodal reasoning data. The authors propose Dual Tuning, a framework that jointly assesses, under a given task and base model, both the contribution of training data to reasoning efficacy and the quality of generated chains of thought. This enables the categorization of data samples into those beneficial for reasoning training, direct answering, or neither, and provides quantitative selection guidelines. Integrating reasoning efficacy evaluation, chain-of-thought analysis, reinforcement learning, and multimodal modeling, Dual Tuning facilitates co-optimization of data selection and training strategies. Experiments demonstrate that the framework substantially improves data utilization efficiency and model reasoning performance on spatial, mathematical, and interdisciplinary tasks.
📝 Abstract
While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel"Instruct"and"Thinking"models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the"Thinking Boundary"to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the"Thinking Boundary"can guide data refinement. Our findings challenge the"reasoning-for-all"paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.