Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Current vision-language models exhibit limited performance on geometric reasoning tasks due to insufficient perceptual grounding in fundamental geometric primitives. To address this, this work proposes GeoDPO, a novel framework that introduces translator-guided reinforcement learning for geometric perception. The approach constructs the GeoPerceive benchmark via automatically synthesized data and leverages a natural language-to-domain-specific language (DSL) translator, combined with Direct Preference Optimization (DPO), to decouple perception from reasoning during training. Evaluated across in-distribution, out-of-distribution, and downstream reasoning tasks, GeoDPO achieves performance gains of 26.5%, 8.0%, and 39.0%, respectively, substantially outperforming supervised fine-tuning and significantly enhancing model generalization.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in-domain and out-of-domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains: $+26.5\%$ on in-domain data, $+8.0\%$ on out-of-domain data, and $+39.0\%$ on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All codes are released at https://github.com/Longin-Yu/GeoPerceive to ensure reproducibility.

Problem

Research questions and friction points this paper is trying to address.

geometric perception

vision-language models

geometric reasoning

diagram understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

translator-guided reinforcement learning

geometric perception

vision-language models

domain-specific language