DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the high inference latency of existing Vision-Language-Action models caused by processing a large number of visual tokens, a problem exacerbated by conventional compression methods that degrade spatial reasoning capabilities. The authors propose a training-free visual token compression framework that, for the first time, leverages depth information as a prior for spatial structure, enabling fine-grained preservation of near-field details while efficiently compressing distant regions. The approach further incorporates cross-frame temporal redundancy elimination and a motion-adaptive multi-view compression strategy, substantially improving real-time performance and robustness. Evaluated on the LIBERO benchmark, the method achieves up to 1.28× inference speedup with less than 1% average drop in task success rate. Real-world robotic experiments demonstrate enhanced task throughput and more responsive closed-loop control.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models enable generalist robotic manipulation but suffer from high inference latency. This bottleneck stems from the massive number of visual tokens processed by large language backbones. Existing methods either prune or merge tokens uniformly, degrading the spatial reasoning essential for robotic control. We present DepthCache, a training-free framework that leverages depth as a structural prior for visual token compression. It partitions observations into depth-based regions and applies spatially differentiated merge ratios, preserving the near-field workspace while compressing the distant background. To exploit temporal redundancy, DepthCache distributes the merging process across consecutive frames, ensuring consistent representations while reducing per-step computation. A motion-adaptive pipeline further optimizes auxiliary view compression based on end-effector dynamics. The framework requires no model modification, generalizing across diverse VLA architectures. On the LIBERO benchmark, DepthCache achieves up to 1.28x inference speedup with less than 1% average success rate degradation across three VLA models (pi_0.5, OpenVLA, GR00T), whereas pruning and merging baselines incur 4--24% degradation at comparable compression. Real-world experiments on a physical manipulator demonstrate that DepthCache enables faster task throughput and more responsive closed-loop control in latency-sensitive scenarios.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

inference latency

visual token compression

spatial reasoning

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-guided token merging

Training-free compression

Vision-Language-Action models