Evaluating Cultural and Social Awareness of LLM Web Agents

📅 2024-10-30
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the capacity of large language models (LLMs) operating as web agents to perceive and respond appropriately to cultural and social norms in real-world scenarios—specifically online shopping and social forums—with emphasis on detecting norm-violating queries and mitigating misleading content. Method: We introduce CASABenchmark, the first benchmark explicitly designed for evaluating cultural and social awareness in AI agents; it systematically defines and quantifies multidimensional cultural-social awareness. We further propose an automated three-dimensional evaluation framework measuring coverage, helpfulness, and violation rate. Contribution/Results: Experiments reveal that current LLM-based agent configurations achieve less than 10% awareness coverage and exhibit over 40% violation rates on misleading content—substantially underperforming non-agent LLM deployments. To address this, we innovatively combine culture-sensitive prompt engineering with region-specific fine-tuning, significantly improving cross-regional generalization and response quality on complex, norm-sensitive tasks.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) expand into performing as agents for real-world applications beyond traditional NLP tasks, evaluating their robustness becomes increasingly important. However, existing benchmarks often overlook critical dimensions like cultural and social awareness. To address these, we introduce CASA, a benchmark designed to assess LLM agents' sensitivity to cultural and social norms across two web-based tasks: online shopping and social discussion forums. Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations. Furthermore, we propose a comprehensive evaluation framework that measures awareness coverage, helpfulness in managing user queries, and the violation rate when facing misleading web content. Experiments show that current LLMs perform significantly better in non-agent than in web-based agent environments, with agents achieving less than 10% awareness coverage and over 40% violation rates. To improve performance, we explore two methods: prompting and fine-tuning, and find that combining both methods can offer complementary advantages -- fine-tuning on culture-specific datasets significantly enhances the agents' ability to generalize across different regions, while prompting boosts the agents' ability to navigate complex tasks. These findings highlight the importance of constantly benchmarking LLM agents' cultural and social awareness during the development cycle.
Problem

Research questions and friction points this paper is trying to address.

Assess LLM agents' cultural sensitivity
Evaluate social norm awareness in web tasks
Improve LLM performance in agent environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CASA benchmark
Combines prompting and fine-tuning
Measures awareness and violation rates
🔎 Similar Papers
No similar papers found.
Haoyi Qiu
Haoyi Qiu
UCLA
Trustworthy AIMultimodality
A
A. R. Fabbri
Salesforce AI Research
D
Divyansh Agarwal
Salesforce AI Research
K
Kung-Hsiang Huang
Salesforce AI Research
Sarah Tan
Sarah Tan
Salesforce / Cornell University
safetyinterpretabilityfairnesscausal inferencehealthcare
N
Nanyun Peng
University of California, Los Angeles
C
Chien-Sheng Wu
Salesforce AI Research