StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Current vision-language models lack fine-grained numerical reasoning capabilities regarding object states and manipulable regions in robotic tasks. This work proposes a novel fine-tuning strategy that, for the first time, incorporates an auxiliary regression loss (ARL) into vision-language model training. By jointly modeling object states and graspable regions through a bounding box decoder, the approach enhances localization accuracy while preserving sequential prediction capabilities. The authors introduce OSAR, the first open-source benchmark dedicated to object state and functional reasoning, and demonstrate consistent performance gains: an average improvement of 1.6% on the RefCOCO suite and a 5.2% boost on OSAR. These results significantly improve output consistency in functional reasoning tasks.
📝 Abstract
Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object representations, including precise localization of objects and their states, as well as graspable regions. Due to the lack of a benchmark for object-state affordance reasoning, we introduce an open-source benchmark, Object State Affordance Reasoning (OSAR), which contains 1,172 scenes with 7,746 individual objects and corresponding bounding boxes. Comparative experiments on adapted benchmarks (RefCOCO, RefCOCO+, and \mbox{RefCOCOg}) demonstrate that ARL improves model performance by an average of 1.6\% compared to models without ARL. Experiments on the OSAR benchmark further support this finding, showing that StateVLM with ARL achieves an average of 5.2\% higher performance than models without ARL. In particular, ARL is also important for the complex task of affordance reasoning in OSAR, where it enhances the consistency of model outputs.
Problem

Research questions and friction points this paper is trying to address.

numerical reasoning
object-state localization
vision-language models
affordance reasoning
object detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

StateVLM
Auxiliary Regression Loss
object-state localization
affordance reasoning
vision-language model