FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the privacy-performance trade-off in training large-scale vision-language-action (VLA) models for robotic manipulation at the client edge, this paper proposes the first federated VLA learning framework tailored for robot manipulation. Methodologically, we design an instruction-guided scene parsing mechanism and a dual-gated Mixture-of-Experts (MoE) network enabling expert self-aware activation; further, we introduce an expert-driven federated aggregation strategy that jointly optimizes task-aware representation learning, adaptive expert selection, and vision-language-action alignment. Evaluated in both simulation and real-robot settings, our approach achieves task success rates approaching those of centralized training—outperforming baselines by +12.3% on average—while reducing communication overhead by 37% and strictly ensuring that raw client data never leaves local devices. This work establishes a scalable, high-performance paradigm for collaborative embodied intelligence model training under stringent privacy constraints.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution. However, training these models often relies on large-scale user-specific data, raising concerns about privacy and security, which in turn limits their broader adoption. To address this, we propose FedVLA, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance. Our framework integrates task-aware representation learning, adaptive expert selection, and expert-driven federated aggregation, enabling efficient and privacy-preserving training of VLA models. Specifically, we introduce an Instruction Oriented Scene-Parsing mechanism, which decomposes and enhances object-level features based on task instructions, improving contextual understanding. To effectively learn diverse task patterns, we design a Dual Gating Mixture-of-Experts (DGMoE) mechanism, where not only input tokens but also self-aware experts adaptively decide their activation. Finally, we propose an Expert-Driven Aggregation strategy at the federated server, where model aggregation is guided by activated experts, ensuring effective cross-client knowledge transfer.Extensive simulations and real-world robotic experiments demonstrate the effectiveness of our proposals. Notably, DGMoE significantly improves computational efficiency compared to its vanilla counterpart, while FedVLA achieves task success rates comparable to centralized training, effectively preserving data privacy.

Problem

Research questions and friction points this paper is trying to address.

Privacy concerns in training vision-language-action models for robotics

Need for distributed training without performance compromise

Efficient learning of diverse task patterns while preserving privacy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated VLA learning preserves data privacy

Dual Gating Mixture-of-Experts enhances task learning

Expert-Driven Aggregation ensures cross-client knowledge transfer

🔎 Similar Papers

Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks