🤖 AI Summary
This paper addresses the inefficient utilization of episodic experience by large language models (LLMs) in physical planning tasks. Specifically, medium-scale models (e.g., 7B) suffer from weak situational grounding, while large-scale models (70–405B), despite superior abstraction capabilities, exhibit a “scale paradox” that impedes effective integration of sequential experience. To bridge this gap, we propose a cross-scale embodied experience transfer framework, introducing the first scalable “weak-to-strong” episodic learning paradigm. Our method integrates MCTS-guided experience acquisition with memory distillation—preserving original model capabilities—and incorporates hierarchical knowledge distillation and layer-wise probing analysis. Experiments demonstrate that our approach outperforms state-of-the-art closed-source LMs by 3.45% across diverse planning and question-answering benchmarks. Moreover, it significantly improves deep-layer representation alignment and enhances generalization stability on complex, unseen scenarios.
📝 Abstract
Language models (LMs) require robust episodic grounding-the capacity to learn from and apply past experiences-to excel at physical planning tasks. Current episodic grounding approaches struggle with scalability and integration, limiting their effectiveness, especially for medium-sized LMs (7B parameters). While larger LMs (70-405B parameters) possess superior hierarchical representations and extensive pre-trained knowledge, they encounter a fundamental scale paradox: despite their advanced abstraction capabilities, they lack efficient mechanisms to leverage experience streams. We propose a scalable weak-to-strong episodic learning framework that effectively transfers episodic behaviors from smaller to larger LMs. This framework integrates Monte Carlo tree search for structured experience collection with a novel distillation method, preserving the inherent LM capabilities while embedding episodic memory. Experiments demonstrate our method surpasses state-of-the-art proprietary LMs by 3.45% across diverse planning and question-answering tasks. Layer-wise probing further indicates significant improvements in task alignment, especially within deeper LM layers, highlighting stable generalization even for previously unseen scenarios with increased planning complexity-conditions where baseline methods degrade markedly.