🤖 AI Summary
This work addresses the limitations of existing evaluations for large language model (LLM) agents, which often focus on isolated capabilities or outcome-based metrics while neglecting the rich process-level information embedded in decision-making and interactive communication. To enable a more comprehensive characterization of social behavior, we propose M3-Bench—a multi-stage evaluation benchmark grounded in mixed-motive games—featuring the first process-aware assessment framework that integrates the Big Five personality model with social exchange theory. Our approach synergistically combines Behavioral Trajectory Analysis (BTA), Reasoning Process Analysis (RPA), and Communicative Content Analysis (CCA) to generate interpretable social behavioral profiles. Experiments demonstrate that M3-Bench effectively uncovers significant inconsistencies between models’ reasoning, communication, and their final behavioral outcomes, thereby transcending conventional result-oriented evaluation paradigms.
📝 Abstract
As the capabilities of large language model (LLM) agents continue to advance, their advanced social behaviors, such as cooperation, deception, and collusion, call for systematic evaluation. However, existing benchmarks often emphasize a single capability dimension or rely solely on behavioral outcomes, overlooking rich process information from agents'decision reasoning and communicative interactions. To address this gap, we propose M3-Bench, a multi-stage benchmark for mixed-motive games, together with a process-aware evaluation framework that conducts synergistic analysis across three modules: BTA (Behavioral Trajectory Analysis), RPA (Reasoning Process Analysis), and CCA (Communication Content Analysis). Furthermore, we integrate the Big Five personality model and Social Exchange Theory to aggregate multi-dimensional evidence into interpretable social behavior portraits, thereby characterizing agents'personality traits and capability profiles beyond simple task scores or outcome-based metrics. Experimental results show that M3-Bench can reliably distinguish diverse social behavior competencies across models, and it reveals that some models achieve seemingly reasonable behavioral outcomes while exhibiting pronounced inconsistencies in their reasoning and communication.