🤖 AI Summary
This study addresses the systematic under-exploration of large language models (LLMs) in interactive environments, which hinders their ability to efficiently discover optimal solutions within limited interaction budgets. The authors design three parametric tasks with controllable exploration difficulty to evaluate prominent LLMs in both continuous and discrete settings, benchmarking them against simple heuristic baselines. Results reveal that LLMs consistently underperform heuristics, with only marginal gains as the interaction budget increases. To mitigate this limitation, the work proposes two lightweight intervention strategies: parallel execution with budget partitioning and periodic summarization of interaction history. Experimental evaluations demonstrate that both strategies significantly enhance exploration efficiency, with the history summarization mechanism yielding further performance improvements.
📝 Abstract
We evaluate language models on their ability to explore interactive environments under a limited interaction budget. We introduce three parametric tasks with controllable exploration difficulty, spanning continuous and discrete environments. Across state-of-the-art models, we find systematic under-exploration and suboptimal solutions, with performance often significantly worse than simple explore--exploit heuristic baselines and scaling weakly as the budget increases. Finally, we study two lightweight interventions: splitting a fixed budget into parallel executions, which surprisingly improves performance despite a no-gain theoretical result for our tasks, and periodically summarizing the interaction history, which preserves key discoveries and further improves exploration.