Belief-Aware VLM Model for Human-like Reasoning

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing vision-language models struggle to explicitly model dynamically evolving human beliefs, limiting their ability to understand intentions in long-horizon tasks. To address this challenge, this work proposes a belief-aware vision-language modeling framework that innovatively replaces explicit belief representations with vector memory. By leveraging a retrieval-augmented mechanism to approximate belief states in latent space and integrating reinforcement learning to optimize decision policies, the approach seamlessly embeds multimodal context into the reasoning process. Evaluated on VQA benchmarks such as HD-EPIC, the method significantly outperforms zero-shot baselines, demonstrating the efficacy and superiority of belief-aware mechanisms for complex reasoning tasks.

Technology Category

Application Category

📝 Abstract

Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.

Problem

Research questions and friction points this paper is trying to address.

belief representation

intent inference

Vision Language Models

human-like reasoning

long-horizon reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

belief-aware reasoning

vision-language models

retrieval-based memory