π€ AI Summary
Existing multimodal large language models optimize visual tokens only indirectly through text prediction, lacking fine-grained supervision and resulting in coarse visual understanding. This work proposes DV-SFT, a method that leverages the precise correspondence between image patches and text in OCR scenarios to automatically generate word-level labels for visual tokens. These labels enable end-to-end fine-tuning via the standard next-token prediction objective. DV-SFT is the first approach to provide direct, explicit, and architecture-agnostic fine-grained supervision for visual tokens, without requiring additional decoders or extra forward computations. Evaluated on three in-domain and four out-of-domain benchmarks, DV-SFT consistently and significantly outperforms standard supervised fine-tuning, effectively enhancing the modelβs fine-grained visual comprehension and multimodal alignment capabilities.
π Abstract
Multimodal large language models are typically trained end-to-end to predict ground-truth answers, yet supervision signals are applied exclusively to text tokens. Visual tokens, the core carriers of visual information, are optimized only implicitly as part of the context, leading to coarse-grained visual understanding. Prior works attempt to supervise visual inputs but inevitably rely on auxiliary components such as additional decoders or forward passes, because visual tokens lack readily interpretable labels. This limits their practical applicability. In this work, we propose \textbf{D}irect \textbf{V}ision \textbf{S}upervised \textbf{F}ine-\textbf{T}uning (DV-SFT), which constructs explicit, token-level supervision for visual tokens and trains them through the same next-token prediction objective used for text. Specifically, we exploit the direct vision--text correspondence in OCR-related scenarios and automatically label each visual token with the word in its corresponding image patch. DV-SFT treats the MLLM as a black box, requiring no architectural modifications or additional forward passes. Extensive experiments demonstrate the superiority of direct vision supervision. DV-SFT consistently outperforms standard SFT across three in-domain and four out-of-domain benchmarks. Further analyses show that vision supervision effectively enhances fine-grained visual understanding and achieves higher multimodal alignment efficiency.