🤖 AI Summary
This work addresses the limited cross-task generalization capability of existing methods in open-world robotic manipulation, which often rely solely on low-level action sequences and thus struggle to extract composable skill knowledge. To overcome this, the authors propose a skill-reasoning framework that decomposes observed task demonstrations into interpretable atomic skill–action pairs, constructing a hybrid skill demonstration library comprising both dynamic and static components. By integrating vision–language retrieval, coverage-aware static memory, and in-context learning mechanisms, the framework enables compositional skill reasoning and execution sequencing. Evaluated on the AGNOSTOS benchmark and in real-world environments, the approach demonstrates significant improvements in zero-shot cross-task manipulation performance.
📝 Abstract
Cross-task generalization is a core challenge in open-world robotic manipulation, and the key lies in extracting transferable manipulation knowledge from seen tasks. Recent in-context learning approaches leverage seen task demonstrations to generate actions for unseen tasks without parameter updates. However, existing methods provide only low-level continuous action sequences as context, failing to capture composable skill knowledge and causing models to degenerate into superficial trajectory imitation. We propose Decompose and Recompose, a skill reasoning framework using atomic skill-action pairs as intermediate representations. Our approach decomposes seen demonstrations into interpretable skill--action alignments, enabling the model to recompose these skills for unseen tasks through compositional reasoning. Specifically, we construct a task-adaptive dynamic demonstration library via visual-semantic retrieval combined with skill sequences from a planning agent, complemented by a coverage-aware static library to fill missing skill patterns. Together, these yield skill-comprehensive demonstrations that explicitly elicit compositional reasoning for skill composition and execution ordering. Experiments on the AGNOSTOS benchmark and real-world environments validate our method's zero-shot cross-task generalization capability.