Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of enabling language model agents to perform human-like rational information seeking and decision-making under resource constraints, this paper proposes a novel framework integrating Bayesian experimental design with Monte Carlo inference. The method optimizes question-asking and action-selection strategies by explicitly modeling expected information gain (EIG), thereby guiding agents to efficiently acquire discriminative information in collaborative dialogue tasks—such as Collaborative Battleship and Guess Who? Its core contribution lies in embedding experimental design theory into the LLM agent’s decision-making loop, substantially improving reasoning efficiency and accuracy under information bottlenecks. Experiments demonstrate significant gains: the Captain agent achieves a +0.227-bit EIG improvement in Battleship, with joint component F1 scores rising by 0.303–0.374; Llama-4-Scout’s win rate surges from 8% to 82%; and Guess Who? accuracy increases by 28.3–42.4 percentage points—surpassing both human performance and state-of-the-art models at minimal computational cost.

Technology Category

Application Category

📝 Abstract
Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called Collaborative Battleship, in which a partially-informed Captain must balance exploration (asking questions) and action (taking shots), while a fully-informed Spotter must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% -> 82% win rate) and frontier models (0% -> 67% win rate vs. GPT-5) at ~1% of GPT-5's cost. We replicate these findings on Guess Who? where our methods significantly boost accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LM agents' rational decision-making in information-seeking tasks
Addressing LM struggles with contextual grounding and strategic questioning
Developing Bayesian methods to enhance agent exploration and action efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monte Carlo inference for language model agents
Bayesian Experimental Design principles for information-seeking
Strategic dialogue tasks to benchmark agent rationality