π€ AI Summary
Current vision-language models exhibit limited robustness in spatial reasoning tasks under geometric transformations, often failing on transformed inputs even when they correctly answer the original ones. This work proposes SAGE, a novel framework that, for the first time, leverages geometric-linguistic dual consistency as a self-evolution objective. SAGE employs a dynamic operation pool to continuously identify model weaknesses and incorporates dual consistency as an auxiliary reward within GRPO-based reinforcement learning. This approach enables a model-agnostic, data-efficient, and lightweight post-training optimization strategy. Experimental results demonstrate that SAGE substantially enhances model performance on video and spatial reasoning benchmarks while significantly improving generalization to unseen data.
π Abstract
Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.