🤖 AI Summary
Current vision-language models (VLMs) are constrained by chain-of-thought reasoning paradigms in complex multimodal tasks, resulting in verbose outputs and poor generalization. This work proposes a topology-aware training framework that transcends linear reasoning limitations, enabling diverse reasoning structures—including tree- and graph-based topologies. Our core contributions are: (1) TopoAug, a topology-guided data augmentation strategy that synthesizes structured multimodal reasoning paths; and (2) Frugal Learning, a lightweight learning mechanism jointly optimizing inference efficiency and accuracy. Leveraging synergistic optimization across synthetic data generation, supervised fine-tuning, and reinforcement learning, the framework achieves efficient multimodal reasoning alignment. Experiments demonstrate state-of-the-art performance: +9.7% over baselines on MATH-V and +7.3% over Qwen2VL-72B-Instruct on VLM-S2H; up to +28.4% improvement over Phi-4-Multimodal-Instruct on five out-of-distribution benchmarks; and significant output length compression.
📝 Abstract
Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks. We have released datasets, and code will be available.