FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization and poor transferability of end-to-end imitation learning policies to unseen tasks and objects. To tackle this, we propose Functional Object Normalization (FON), a framework that leverages vision-language models to extract object affordance cues, represents actions as structured “agent–verb–object” primitives, and aligns cross-category object representations in a functional embedding space—enabling both pose-aware grounding and semantic generalization. Building upon FON, we introduce FuncDiffuser, a diffusion-based policy model centered on objects and actions, trained on normalized data with joint functional alignment and trajectory transfer. Experiments demonstrate that our approach achieves category-level generalization, cross-task action reuse, and robust sim-to-real transfer on both simulated and real robotic platforms, significantly improving composability and generalization performance in imitation learning.

Technology Category

Application Category

📝 Abstract
General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.
Problem

Research questions and friction points this paper is trying to address.

Converting long-horizon manipulation tasks into reusable action chunks
Enabling pose-aware and category-general robotic manipulation policies
Achieving generalization across object categories and task variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Functional object canonicalization for alignment
Action-centric diffusion policy respecting affordances
Pose-aware action primitives enabling compositionality
Hongli Xu
Hongli Xu
University of Science and Technology of China
Software Defined NetworkCooperative CommunicationSensor Networks
L
Lei Zhang
TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg, Hamburg, Germany.
X
Xiaoyue Hu
Technical University of Munich, Germany.
B
Boyang Zhong
Technical University of Munich, Germany.
K
Kaixin Bai
TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg, Hamburg, Germany.
Z
Zoltán-Csaba Márton
Agile Robots SE, Munich, Germany.
Zhenshan Bing
Zhenshan Bing
Nanjing University / Technical University of Munich
Robotics
Z
Zhaopeng Chen
Agile Robots SE, Munich, Germany.
A
Alois Christian Knoll
Technical University of Munich, Germany.
J
Jianwei Zhang
TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg, Hamburg, Germany.