Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses two interrelated bottlenecks: the low reliability and poor interpretability of vision-language-action (VLA) models, and the limited flexibility of cognitive architectures (CAs). We propose a bidirectional synergistic paradigm integrating VLAs with the symbolic cognitive architecture DIARC. Leveraging inter-layer linear probing, we systematically demonstrate—for the first time—that OpenVLA’s hidden layers encode decodable symbolic representations of object attributes, spatial relations, and action states. Building on this insight, we design an end-to-end DIARC-OpenVLA integration system enabling real-time symbolic state monitoring and closed-loop control. On LIBERO-spatial benchmarks, our approach achieves multi-layer state recognition accuracy exceeding 0.90. By unifying sub-symbolic perception-action learning with symbolic reasoning in a single embodied framework, this work overcomes the traditional dichotomy between black-box decision-making and explicit symbolic inference—establishing a novel architecture for embodied intelligence that simultaneously ensures robustness, interpretability, and adaptability.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models hold promise as generalist robotics solutions by translating visual and linguistic inputs into robot actions, yet they lack reliability due to their black-box nature and sensitivity to environmental changes. In contrast, cognitive architectures (CA) excel in symbolic reasoning and state monitoring but are constrained by rigid predefined execution. This work bridges these approaches by probing OpenVLA's hidden layers to uncover symbolic representations of object properties, relations, and action states, enabling integration with a CA for enhanced interpretability and robustness. Through experiments on LIBERO-spatial pick-and-place tasks, we analyze the encoding of symbolic states across different layers of OpenVLA's Llama backbone. Our probing results show consistently high accuracies (>0.90) for both object and action states across most layers, though contrary to our hypotheses, we did not observe the expected pattern of object states being encoded earlier than action states. We demonstrate an integrated DIARC-OpenVLA system that leverages these symbolic representations for real-time state monitoring, laying the foundation for more interpretable and reliable robotic manipulation.

Problem

Research questions and friction points this paper is trying to address.

Integrate Vision-Language-Action model with Cognitive Architecture

Probe OpenVLA for symbolic object and action states

Enhance robotic interpretability and robustness via symbolic integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probes OpenVLA's hidden layers

Integrates with cognitive architecture

Enhances interpretability and robustness

🔎 Similar Papers

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models