CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of dynamic interaction context modeling and inconsistent entity tracking in human-robot collaboration, this paper proposes a spatiotemporally grounded triplet reasoning framework. Methodologically, it introduces the first integration of vision-language models (VLMs) with instance-level object detection, action recognition, and cross-frame instance association to construct actor-action-object triplets—enabling fine-grained role differentiation, unique entity tracking, and temporally consistent contextual modeling. Evaluated on collaborative tasks including pouring, handing-over, and sorting, the system achieves high triplet generation accuracy and robustness, while supporting task generalization and scene scalability. Key contributions are: (1) a VLM-driven multimodal joint triplet inference mechanism; and (2) an end-to-end situational awareness architecture that jointly optimizes semantic understanding and spatiotemporal consistency.

Technology Category

Application Category

📝 Abstract
We introduce CARMA, a system for situational grounding in human-robot group interactions. Effective collaboration in such group settings requires situational awareness based on a consistent representation of present persons and objects coupled with an episodic abstraction of events regarding actors and manipulated objects. This calls for a clear and consistent assignment of instances, ensuring that robots correctly recognize and track actors, objects, and their interactions over time. To achieve this, CARMA uniquely identifies physical instances of such entities in the real world and organizes them into grounded triplets of actors, objects, and actions. To validate our approach, we conducted three experiments, where multiple humans and a robot interact: collaborative pouring, handovers, and sorting. These scenarios allow the assessment of the system's capabilities as to role distinction, multi-actor awareness, and consistent instance identification. Our experiments demonstrate that the system can reliably generate accurate actor-action-object triplets, providing a structured and robust foundation for applications requiring spatiotemporal reasoning and situated decision-making in collaborative settings.
Problem

Research questions and friction points this paper is trying to address.

Enabling robots to recognize and track human-object interactions in group settings
Ensuring consistent identification of actors, objects, and actions over time
Providing structured situational awareness for collaborative human-robot tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines vision-language models with object recognition
Organizes entities into actor-object-action triplets
Ensures consistent instance identification over time
🔎 Similar Papers
No similar papers found.