Context as Prior: Bayesian-Inspired Intent Inference for Non-Speaking Agents with a Household Cat Testbed

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the challenge of inferring intentions from noisy and incomplete behavioral observations in language-deprived settings, such as with domestic pets or pre-linguistic infants. To this end, the authors propose CatSignal, a novel framework that treats spatial context as a Bayesian prior rather than a conventional feature. CatSignal integrates pose dynamics, acoustic cues, and contextual information through a context-gated product-of-experts mechanism, enabling robust probabilistic intention inference. Evaluated via leave-one-video-out cross-validation on a newly collected multimodal dataset of domestic cats, CatSignal achieves 77.72% accuracy—significantly outperforming both feature concatenation (71.83%) and strong late-fusion baselines—while effectively mitigating shortcut errors induced by environmental context.
📝 Abstract
Many agents in real-world environments cannot reliably communicate their goals through language, including household pets, pre-verbal infants, and other non-speaking embodied agents. In such settings, intent must be inferred from incomplete behavioral observations in context-rich environments. This creates a core ambiguity: observable behavior is often noisy or underspecified, while context provides strong prior information but can also induce brittle shortcut predictions if used naively. We present CatSignal, a Bayesian-inspired probabilistic framework for multimodal intent inference that models spatial context as a prior-like constraint and behavioral observations as evidence. Rather than treating context as an ordinary input feature, our method uses a context-gated Product-of-Experts formulation to compute posterior-like intent distributions from context, pose dynamics, and acoustic cues. We instantiate this formulation in a household cat setting as a focused proof-of-concept for intent inference in non-speaking agents. Under Leave-One-Video-Out evaluation on a multimodal domestic cat dataset, the proposed prior-guided fusion achieves the best overall accuracy of 77.72%, outperforming feature concatenation (71.83%) and stronger late-fusion baselines. More importantly, it substantially reduces context-driven shortcut failures in ambiguous cases. While simpler fusion strategies remain competitive in Macro-F1 and selective prediction, the proposed model provides the strongest overall accuracy and the best suppression of context-based shortcut collapse.
Problem

Research questions and friction points this paper is trying to address.

intent inference
non-speaking agents
context prior
behavioral ambiguity
multimodal observation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian-inspired inference
context as prior
Product-of-Experts
multimodal intent recognition
non-speaking agents