TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

📅 2025-08-15
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Current Vision-Language-Action (VLA) models process visual inputs frame-by-frame, neglecting critical temporal structure in manipulation tasks—leading to sensitivity to visual noise and difficulty modeling inter-frame coherence. To address this, we propose a **training-free temporal token fusion method**, which selectively fuses historical and current visual representations under pixel-level attention guidance, augmented by dual-dimensional detection and keyframe anchoring to suppress error accumulation. We further reveal that selective reuse of the query matrix strikes an optimal balance between performance and efficiency—a novel insight. Our method is architecture-agnostic and integrates seamlessly with mainstream VLA models. On the LIBERO benchmark, it improves average success rate by 4.0 percentage points (from 68.4% to 72.4%); gains of +4.8% and +8.7% are observed on SimplerEnv and real-robot manipulation tasks, respectively. These results demonstrate the method’s generality, robustness, and practical utility.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4% vs 68.4% baseline), cross-environment validation on SimplerEnv (4.8% relative improvement), and 8.7% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.
Problem

Research questions and friction points this paper is trying to address.

Integrates temporal visual information to overcome frame-by-frame processing limitations
Reduces vulnerability to visual noise through selective historical token fusion
Enhances vision-language-action model performance across simulated and real environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free temporal token fusion for VLA models
Dual-dimension detection with pixel and attention analysis
Selective fusion via hard strategies and keyframe anchoring
🔎 Similar Papers
No similar papers found.
C
Chenghao Liu
School of Advanced Manufacturing and Robotics, Peking University
J
Jiachen Zhang
School of Advanced Manufacturing and Robotics, Peking University
C
Chengxuan Li
School of Advanced Manufacturing and Robotics, Peking University
Z
Zhimu Zhou
School of Advanced Manufacturing and Robotics, Peking University
S
Shixin Wu
School of Advanced Manufacturing and Robotics, Peking University
Songfang Huang
Songfang Huang
Peking University, Alibaba DAMO, IBM Research, The University of Edinburgh
LLMEmbodied AI
H
Huiling Duan
School of Advanced Manufacturing and Robotics, Peking University