STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) are constrained by chain-of-thought reasoning paradigms in complex multimodal tasks, resulting in verbose outputs and poor generalization. This work proposes a topology-aware training framework that transcends linear reasoning limitations, enabling diverse reasoning structures—including tree- and graph-based topologies. Our core contributions are: (1) TopoAug, a topology-guided data augmentation strategy that synthesizes structured multimodal reasoning paths; and (2) Frugal Learning, a lightweight learning mechanism jointly optimizing inference efficiency and accuracy. Leveraging synergistic optimization across synthetic data generation, supervised fine-tuning, and reinforcement learning, the framework achieves efficient multimodal reasoning alignment. Experiments demonstrate state-of-the-art performance: +9.7% over baselines on MATH-V and +7.3% over Qwen2VL-72B-Instruct on VLM-S2H; up to +28.4% improvement over Phi-4-Multimodal-Instruct on five out-of-distribution benchmarks; and significant output length compression.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks. We have released datasets, and code will be available.
Problem

Research questions and friction points this paper is trying to address.

Improves vision-language models' complex multimodal reasoning efficiency
Addresses limitations of chain-of-thought with diverse topological structures
Reduces output verbosity while maintaining high accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Topology-aware reasoning with TopoAug data pipeline
Frugal Learning reduces output length efficiently
Combines supervised fine-tuning and reinforcement learning
🔎 Similar Papers
No similar papers found.