HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Current large language models predominantly capture collective consensus, failing to model individual-specific reasoning styles and dynamic belief evolution. To address this, we introduce HugAgent—the first benchmark for individualized reasoning adaptation—featuring a dual-track evaluation paradigm: (1) controlled synthetic data and (2) authentic human “think-aloud” transcripts, jointly assessing individual belief updating and reasoning path prediction. Methodologically, we enable personalized reasoning adaptation via individualized prompting and trajectory distillation. Experiments reveal substantial performance gaps among mainstream LMs on this task. We publicly release the HugAgent benchmark and the TraceYourThinking dialogue system, establishing foundational infrastructure and standardized evaluation protocols for reproducible, scalable research on individualized reasoning.

Technology Category

Application Category

📝 Abstract

Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, "out-loud" reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (https://anonymous.4open.science/r/HugAgent) and TraceYourThinking (https://anonymous.4open.science/r/trace-your-thinking).

Problem

Research questions and friction points this paper is trying to address.

Simulating individual human reasoning styles in open-ended tasks

Predicting personal belief updates from partial historical evidence

Evaluating model fidelity to human reasoning evolution processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-track design for scalable and ecological evaluation

Predicting individual reasoning evolution from past views

Benchmark for intra-agent fidelity in belief updates

🔎 Similar Papers

A Survey on Large Language Model based Autonomous Agents