Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work proposes a novel multimodal behavior generation approach to facilitate appropriate calibration of user trust in social agents while mitigating gender stereotypes. For the first time, it enables large language models to automatically align behaviors across language, prosody, gestures, and facial expressions with the two core dimensions of trust—competence and benevolence. Leveraging GPT-5.4 to generate behavioral sequences, the study conducts a within-subjects experiment via the Prolific platform, complemented by random forest feature importance analysis. Results demonstrate that the generated behaviors effectively convey intended trust levels, validating the method’s efficacy. However, the model also reproduces traditional gender-role biases when prompted with gender cues, highlighting critical risks of embedded social biases that must be addressed in agent design.

📝 Abstract

As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent's actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust-calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT-5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents' behaviors with high ability and female agents' behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within-subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.

Problem

Research questions and friction points this paper is trying to address.

Trust Calibration

Socially Interactive Agents

Multimodal Behaviors

Gender Stereotypes

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

trust calibration

multimodal behavior generation

large language models