SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak spatiotemporal reasoning capability of lightweight vision-language-action (VLA) models and the difficulty in balancing efficiency and performance, this paper proposes a 4D geometric-aware architecture. Our method introduces a 4D visual geometric Transformer to explicitly model spatiotemporal geometric relationships; integrates Fusion Tokens and masked reconstruction to enable unified representation learning across 2D images and 4D features; and incorporates a temporal caching mechanism to enhance sequential modeling. During inference, the 4D branch can be dynamically discarded, significantly reducing computational overhead. Evaluated on both real-world and simulated environments, our approach outperforms lightweight baselines and matches the performance of models with seven times more parameters. On edge devices, it achieves 18× faster inference speed and reduces memory footprint to 1/12. To the best of our knowledge, this is the first work to enable efficient and accurate 4D perception-driven action generation in compact VLA models.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.
Problem

Research questions and friction points this paper is trying to address.

Enhances lightweight VLA models for spatiotemporal reasoning
Integrates 4D features from 2D images with minimal overhead
Enables efficient action generation on edge devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pretrained 4D visual geometry transformer with temporal cache
Introduces Fusion Tokens for unified 2D and 4D feature representation
Employs mask-and-reconstruct strategy to enable efficient 4D learning
🔎 Similar Papers
No similar papers found.