BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high inference latency of multi-view vision-language-action (VLA) models, which stems from the large number of visual tokens and hinders real-time robotic applications. Existing token pruning methods often overlook inter-view dependencies and task dynamics, leading to performance degradation. To overcome these limitations, we propose BFA++, a dynamic token pruning framework tailored for multi-view VLA models. BFA++ introduces a novel two-level importance prediction mechanism: it identifies task-relevant regions within each view to suppress noise and dynamically selects critical views across the multi-view input to eliminate redundancy, enabling task-aware efficient pruning. Experiments demonstrate that BFA++ improves task success rates by approximately 10% over baselines on both RoboTwin and real-world robotic tasks, while achieving 1.8× and 1.5× speedups on π0 and RDT models, respectively.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationships between different views and fail to account for the dynamic and task-specific characteristics of robotic operation. To address this, we propose BFA++, a dynamic token pruning framework designed specifically for VLA models. BFA++ introduces a hierarchical pruning strategy guided by two-level importance predictors: an intra-view predictor highlights task-relevant regions within each image to suppress spatial noise, while an inter-view predictor identifies critical camera views throughout different manipulation phases to reduce cross-view redundancy. This design enables efficient token selection while preserving essential visual cues, resulting in improved computational efficiency and higher manipulation success rates. Evaluations on the RoboTwin benchmark and real-world robotic tasks demonstrate that BFA++ consistently outperforms existing methods. BFA++ improves the success rate by about 10% on both the {\pi}0 and RDT models, achieving speedup of 1.8X and 1.5X, respectively. Our results highlight that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
token pruning
multi-view
real-time robotic manipulation
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

token pruning
vision-language-action models
multi-view perception
task-aware pruning
robotic manipulation
🔎 Similar Papers
No similar papers found.
H
Haosheng Li
Institute of Software, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
W
Weixin Mao
LimX Dynamic, Shenzhen, China
Z
Zihan Lan
LimX Dynamic, Shenzhen, China
H
Hongwei Xiong
LimX Dynamic, Shenzhen, China
H
Hongan Wang
Institute of Software, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
C
Chenyang Si
Nanjing University, Nanjing, China
Ziwei Liu
Ziwei Liu
Associate Professor, Nanyang Technological University
Computer VisionMachine LearningComputer Graphics
Xiaoming Deng
Xiaoming Deng
Institute of Software, CAS
Computer VisionRobotic ManipulationNatural User InterfacesVirtual HumansHand Tracking
Hua Chen
Hua Chen
Assistant Professor, ZJU-UIUC Institute; Co-founder, LimX Dynamics
RoboticsEmbodied AIRobot LearningReinforcement LearningControl