🤖 AI Summary
This work addresses the challenge of efficiently adapting large language models to new tasks or domains post-deployment, a setting where existing context optimization methods fall short due to their inability to actively acquire external information. To overcome this limitation, the paper introduces an active information retrieval mechanism into the context optimization framework, integrating Wikipedia search and web browser tools to dynamically maintain and prune multiple candidate contexts during training, thereby enabling data-driven context updates. The proposed approach substantially enhances model adaptability in both novel domains and low-resource scenarios, yielding consistent performance gains across diverse benchmarks—including Flores+, HealthBench, LiveCodeBench, and Humanity's Last Exam—while demonstrating strong data efficiency, hyperparameter robustness, and cross-model generalization capabilities.
📝 Abstract
Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity's Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.