π€ AI Summary
This work addresses the challenge that large language models (LLMs) struggle to effectively interpret implicit, ambiguous, or conflicting social norms in embodied environments, often leading to failures in pronoun and reference resolution. To tackle this issue, the paper introduces the first norm-based reference resolution (NBRR) task and presents SNICβa scenarized diagnostic benchmark that integrates both physical and social contextual cues across everyday activities such as cleaning, organizing, and serving. The dataset is human-validated to rigorously evaluate model performance in socially grounded reasoning. Experimental results demonstrate that even the most advanced LLMs exhibit significant shortcomings in tasks requiring inference over implicit social conventions, revealing a critical gap in their ability to understand and apply unwritten social rules during embodied interaction.
π Abstract
Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms: shared expectations that constrain what actions are appropriate in context. A key capability in such settings is norm-based reference resolution (NBRR), where interpreting referential expressions requires inferring implicit normative expectations grounded in physical and social context. Yet it remains unclear whether Large Language Models (LLMs) can support this kind of reasoning. In this work, we introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR. SNIC emphasizes physically grounded norms that arise in everyday tasks such as cleaning, tidying, and serving. Across a range of controlled evaluations, we find that even the strongest LLMs struggle to consistently identify and apply social norms, particularly when norms are implicit, underspecified, or in conflict. These findings reveal a blind spot in current LLMs and highlight a key challenge for deploying language-based systems in socially situated, embodied settings.