EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Standard Transformers lack intrinsic guarantees of 3D geometric consistency, resulting in unreliable behavior of language-driven robotic policies under SE(3) spatial transformations. To address this, we propose the first multi-task Transformer architecture with theoretically grounded SE(3)-equivariance. Our method introduces two key innovations: (1) an SE(3)-equivariant point-cloud U-Net backbone that preserves geometric structure under rigid-body transformations; and (2) a language modulation mechanism based on spherical harmonic features and SE(3)-invariant iFiLM, enabling geometry-robust cross-modal alignment. Evaluated on 18 RLBench simulation tasks—including systematic SE(3) and SE(2) perturbations—and 4 real-robot manipulation tasks, our approach achieves state-of-the-art performance. It significantly improves spatial generalization and deployment reliability of learned policies, demonstrating consistent robustness to geometric variations across simulation and physical domains.

Technology Category

Application Category

📝 Abstract

Transformer architectures can effectively learn language-conditioned, multi-task 3D open-loop manipulation policies from demonstrations by jointly processing natural language instructions and 3D observations. However, although both the robot policy and language instructions inherently encode rich 3D geometric structures, standard transformers lack built-in guarantees of geometric consistency, often resulting in unpredictable behavior under SE(3) transformations of the scene. In this paper, we leverage SE(3) equivariance as a key structural property shared by both policy and language, and propose EquAct-a novel SE(3)-equivariant multi-task transformer. EquAct is theoretically guaranteed to be SE(3) equivariant and consists of two key components: (1) an efficient SE(3)-equivariant point cloud-based U-net with spherical Fourier features for policy reasoning, and (2) SE(3)-invariant Feature-wise Linear Modulation (iFiLM) layers for language conditioning. To evaluate its spatial generalization ability, we benchmark EquAct on 18 RLBench simulation tasks with both SE(3) and SE(2) scene perturbations, and on 4 physical tasks. EquAct performs state-of-the-art across these simulation and physical tasks.

Problem

Research questions and friction points this paper is trying to address.

Lack of geometric consistency in standard transformers for robotic manipulation

Need for SE(3)-equivariant models to handle 3D scene transformations

Improving spatial generalization in language-conditioned multi-task robotic policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

SE(3)-equivariant transformer for robotic manipulation

Point cloud U-net with spherical Fourier features

SE(3)-invariant iFiLM layers for language conditioning

🔎 Similar Papers

No similar papers found.