Can foundation models actively gather information in interactive environments to test hypotheses?

📅 2024-12-09
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
The strategic hypothesis-driven information acquisition capability of foundation models in dynamic, interactive environments remains poorly understood. This paper introduces the first hypothesis-driven active information-gathering evaluation framework, built upon an iterative reasoning system that maximizes information gain, supporting dual-modality assessment in both textual and 3D embodied settings. Our method integrates LLM-guided action planning, context-aware memory scheduling, self-correction mechanisms, and multimodal interaction interfaces. Key contributions include: (1) the first systematic evaluation of large language models’ strategic information acquisition; (2) a task-adaptive exploration strategy; and (3) novel empirical findings—e.g., smaller models outperforming larger ones on single-feature tasks. Experiments show near-optimal performance on single-feature tasks; self-correction substantially improves performance on conjunctive tasks; and comparable performance across text and 3D environments—where the latter’s bottleneck lies in visual perception accuracy rather than reasoning.

Technology Category

Application Category

📝 Abstract
While problem solving is a standard evaluation task for foundation models, a crucial component of problem solving -- actively and strategically gathering information to test hypotheses -- has not been closely investigated. To assess the information gathering abilities of foundation models in interactive environments, we introduce a framework in which a model must determine the factors influencing a hidden reward function by iteratively reasoning about its previously gathered information and proposing its next exploratory action to maximize information gain at each step. We implement this framework in both a text-based environment, which offers a tightly controlled setting and enables high-throughput parameter sweeps, and in an embodied 3D environment, which requires addressing complexities of multi-modal interaction more relevant to real-world applications. We further investigate whether approaches such as self-correction and increased inference time improve information gathering efficiency. In a relatively simple task that requires identifying a single rewarding feature, we find that LLM's information gathering capability is close to optimal. However, when the model must identify a conjunction of rewarding features, performance is suboptimal. The hit in performance is due partly to the model translating task description to a policy and partly to the model's effectiveness in using its in-context memory. Performance is comparable in both text and 3D embodied environments, although imperfect visual object recognition reduces its accuracy in drawing conclusions from gathered information in the 3D embodied case. For single-feature-based rewards, we find that smaller models curiously perform better; for conjunction-based rewards, incorporating self correction into the model improves performance.
Problem

Research questions and friction points this paper is trying to address.

Foundation models struggle with multi-turn exploration in dynamic environments
Models fail to integrate knowledge through adaptive strategies over time
Current models cannot improve performance across trials without intervention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Summarizing observations enables emergent meta-learning
Adaptive strategies integrate knowledge across trials
Models re-learn when environment rules change