π€ AI Summary
Current social robots exhibit limitations in multi-turn interactions, social relationship reasoning, and long-context dialogue. This work proposes ARISβa modular agent framework that explicitly constructs a user-centric social relationship knowledge graph and integrates multimodal reasoning (speech, vision, and action) with retrieval-augmented generation (RAG) to enable cross-session user identification and low-latency, highly relevant dialogue. The system employs a scalable RAG pipeline and structured API integration to support large-scale context understanding. A user study (N=23) demonstrates that ARIS significantly outperforms large language model baselines in perceived intelligence, vividness, anthropomorphism, and user preference.
π Abstract
Foundational models have advanced social robotics, enabling richer perception and communicative interaction with users. However, current systems still struggle with multi-turn engagement, social-relationship reasoning, and contextually grounded dialogue at scale. We present ARIS (Agentic and Relationship Intelligence System), an agentic AI framework that unifies multimodal reasoning, a graph-based Social World Model, and retrieval-augmented generation (RAG) within a single modular architecture for social robots. We evaluate ARIS with the Pepper robot in a robot-mediated dyadic conversational setting, comparing it against a large language model baseline. A user study (N=23) shows that ARIS yields significantly higher perceived intelligence, animacy, anthropomorphism, and likeability. Our contributions are threefold: (1)~a Social World Model that explicitly maps and updates social relationships between users through a knowledge graph, enabling social reasoning and re-identification across encounters; (2)~an efficient RAG-based conversational pipeline that maintains bounded latency as dialogue histories grow to thousands of exchanges while preserving response relevance; and (3)~system integration and empirical validation of these components within a modular agentic architecture that coordinates speech, vision, and physical action through structured APIs. The implementation of ARIS will be released as open source upon publication.