🤖 AI Summary
Current large language models predominantly capture collective consensus, failing to model individual-specific reasoning styles and dynamic belief evolution. To address this, we introduce HugAgent—the first benchmark for individualized reasoning adaptation—featuring a dual-track evaluation paradigm: (1) controlled synthetic data and (2) authentic human “think-aloud” transcripts, jointly assessing individual belief updating and reasoning path prediction. Methodologically, we enable personalized reasoning adaptation via individualized prompting and trajectory distillation. Experiments reveal substantial performance gaps among mainstream LMs on this task. We publicly release the HugAgent benchmark and the TraceYourThinking dialogue system, establishing foundational infrastructure and standardized evaluation protocols for reproducible, scalable research on individualized reasoning.
📝 Abstract
Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, "out-loud" reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).