GLUE: Global-Local Unified Encoding for Imitation Learning via Key-Patch Tracking

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In complex out-of-distribution (OOD) environments—such as cluttered or occluded scenes—global visual representations are highly susceptible to interference, leading to degraded imitation learning performance. To address this, we propose GLUE, a Global-Local Unified Encoding framework. GLUE introduces a text-guided key image patch selection and tracking mechanism, coupled with a global-feature-driven local information fusion architecture, enabling task-relevant feature alignment and contextual consistency preservation. It further enhances representation robustness via vision transformers, multi-scale feature fusion, and construction of a low-heterogeneity representation space. Experiments demonstrate that GLUE outperforms the strongest baseline by 17.6% in simulation and by 36.3% in real-world settings; moreover, its real-environment generalization capability improves by 58.3%.

Technology Category

Application Category

📝 Abstract
In recent years, visual representation learning has gained widespread attention in robotic imitation learning. However, in complex Out-of-Distribution(OOD) settings characterized by clutter and occlusion, the attention of global visual representations can be diluted or interfered, leading to degraded policy performance. The invariance of local representations for task-relevant objects offers a solution. By efficiently utilizing these local representations, training and testing data can be mapped to a more similar feature space, thereby mitigating the covariate shift problem. Accordingly, we propose GLUE, a global-local unified encoding framework for imitation learning based on key-patch tracking. GLUE selects and tracks key-patches as critical local representations by employing a text-guided mechanism. It features a novel fusion framework where global patch features query local patches to distill essential information, yielding fine-grained local features with low heterogeneity relative to the global context. This fused representation steers the robot's visual attention toward task-relevant objects and preserves precise global context, which together align the training and testing distributions into a similar and task-informative feature space, ultimately enhancing the robustness of the imitation learning policy. Experiments demonstrate that GLUE achieves strong performance across diverse tasks in both simulation and real-world settings, outperforming the strongest baseline by 17.6% in simulation, 36.3% in real-world environments, and 58.3% on real-world generalization settings. The project website of GLUE is available at https://GLUE666.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Addresses degraded policy performance in imitation learning under complex OOD conditions
Mitigates covariate shift by mapping training and testing data to similar feature spaces
Enhances robustness through global-local unified encoding with key-patch tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Key-patch tracking with text-guided selection
Global features query local patches for fusion
Unified encoding reduces feature distribution shift
🔎 Similar Papers
No similar papers found.
Y
Ye Chen
Beijing Institute of Technology
Z
Zichen Zhou
Beijing Institute of Technology
J
Jianyu Dou
Beijing Institute of Technology
Te Cui
Te Cui
Beijing Institute of Technology
Embodied AI
Y
Yi Yang
Beijing Institute of Technology
Y
Yufeng Yue
Beijing Institute of Technology