🤖 AI Summary
The strategic hypothesis-driven information acquisition capability of foundation models in dynamic, interactive environments remains poorly understood. This paper introduces the first hypothesis-driven active information-gathering evaluation framework, built upon an iterative reasoning system that maximizes information gain, supporting dual-modality assessment in both textual and 3D embodied settings. Our method integrates LLM-guided action planning, context-aware memory scheduling, self-correction mechanisms, and multimodal interaction interfaces. Key contributions include: (1) the first systematic evaluation of large language models’ strategic information acquisition; (2) a task-adaptive exploration strategy; and (3) novel empirical findings—e.g., smaller models outperforming larger ones on single-feature tasks. Experiments show near-optimal performance on single-feature tasks; self-correction substantially improves performance on conjunctive tasks; and comparable performance across text and 3D environments—where the latter’s bottleneck lies in visual perception accuracy rather than reasoning.
📝 Abstract
While problem solving is a standard evaluation task for foundation models, a crucial component of problem solving -- actively and strategically gathering information to test hypotheses -- has not been closely investigated. To assess the information gathering abilities of foundation models in interactive environments, we introduce a framework in which a model must determine the factors influencing a hidden reward function by iteratively reasoning about its previously gathered information and proposing its next exploratory action to maximize information gain at each step. We implement this framework in both a text-based environment, which offers a tightly controlled setting and enables high-throughput parameter sweeps, and in an embodied 3D environment, which requires addressing complexities of multi-modal interaction more relevant to real-world applications. We further investigate whether approaches such as self-correction and increased inference time improve information gathering efficiency. In a relatively simple task that requires identifying a single rewarding feature, we find that LLM's information gathering capability is close to optimal. However, when the model must identify a conjunction of rewarding features, performance is suboptimal. The hit in performance is due partly to the model translating task description to a policy and partly to the model's effectiveness in using its in-context memory. Performance is comparable in both text and 3D embodied environments, although imperfect visual object recognition reduces its accuracy in drawing conclusions from gathered information in the 3D embodied case. For single-feature-based rewards, we find that smaller models curiously perform better; for conjunction-based rewards, incorporating self correction into the model improves performance.