Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenge in visual-language models where imbalanced token counts between reasoning and answer segments during chain-of-thought (CoT) training often lead to verbose reasoning and inaccurate answers. To mitigate this, the authors propose SCALe, a method that introduces segment-aware adaptive loss and curriculum scheduling during supervised fine-tuning. SCALe employs a dynamic, length-agnostic weighting strategy to differentially supervise reasoning and answer segments and uses cosine scheduling to gradually shift training emphasis from reasoning to answer generation. The approach is architecture-agnostic and consistently outperforms standard supervised fine-tuning across multiple benchmarks, achieving accuracy comparable to full two-stage SFT+GRPO training while requiring only one-seventh of the training time. Further gains are realized when SCALe is combined with GRPO, yielding state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

token imbalance

supervised fine-tuning

multimodal reasoning

reasoning verbosity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scheduled Curriculum Adaptive Loss

Chain of Thought

Vision-Language Models