🤖 AI Summary
This work addresses the challenge of deploying vision-language-action (VLA) models in real-time settings due to their high computational cost. Existing token pruning methods face a fundamental trade-off between compression ratio and preservation of critical geometric details, such as contact points. To overcome this, the authors propose a geometry-aware continuous visual token resampling mechanism that adaptively selects and reconstructs salient features within the visual encoder through task-aware coordinate prediction and differentiable interpolation. By formulating compression as a continuous resampling process, the method introduces GridS—a plug-and-play, differentiable grid sampler—that transcends the performance-efficiency limitations of conventional pruning. Experiments on the LIBERO benchmark and real robotic platforms demonstrate that the approach achieves lossless task success rates using fewer than 10% of the original visual tokens and a 76% reduction in FLOPs, establishing a new state-of-the-art in minimal viable visual token usage.
📝 Abstract
Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.