Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in visual-language models where imbalanced token counts between reasoning and answer segments during chain-of-thought (CoT) training often lead to verbose reasoning and inaccurate answers. To mitigate this, the authors propose SCALe, a method that introduces segment-aware adaptive loss and curriculum scheduling during supervised fine-tuning. SCALe employs a dynamic, length-agnostic weighting strategy to differentially supervise reasoning and answer segments and uses cosine scheduling to gradually shift training emphasis from reasoning to answer generation. The approach is architecture-agnostic and consistently outperforms standard supervised fine-tuning across multiple benchmarks, achieving accuracy comparable to full two-stage SFT+GRPO training while requiring only one-seventh of the training time. Further gains are realized when SCALe is combined with GRPO, yielding state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
token imbalance
supervised fine-tuning
multimodal reasoning
reasoning verbosity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scheduled Curriculum Adaptive Loss
Chain of Thought
Vision-Language Models
Token-Imbalanced Supervision
Dynamic Loss Weighting
🔎 Similar Papers
No similar papers found.
S
Shaked Perek
IBM Research
B
Ben Wiesel
IBM Research
Avihu Dekel
Avihu Dekel
IBM Research
Deep LearningActive LearningSpeech Synthesis
N
Nimrod Shabtay
IBM Research
E
Eli Schwartz
IBM Research