See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This work addresses the challenge of deploying vision-language-action (VLA) models in real-time settings due to their high computational cost. Existing token pruning methods face a fundamental trade-off between compression ratio and preservation of critical geometric details, such as contact points. To overcome this, the authors propose a geometry-aware continuous visual token resampling mechanism that adaptively selects and reconstructs salient features within the visual encoder through task-aware coordinate prediction and differentiable interpolation. By formulating compression as a continuous resampling process, the method introduces GridS—a plug-and-play, differentiable grid sampler—that transcends the performance-efficiency limitations of conventional pruning. Experiments on the LIBERO benchmark and real robotic platforms demonstrate that the approach achieves lossless task success rates using fewer than 10% of the original visual tokens and a 76% reduction in FLOPs, establishing a new state-of-the-art in minimal viable visual token usage.
📝 Abstract
Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
token pruning
computational cost
geometric details
real-time deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable Grid Sampling
Vision-Language-Action Models
Token Pruning
Geometry-Aware Compression
Continuous Resampling
💼 Related Jobs
Vision Foundation Model Research Intern
Intrinsic
Salary Range$57.69—$57.69 USDAt Intrinsic, we are proud to be an equal opportunity workplace. Employment at Intrinsic is based solely on a person's merit and qualifications directly related to professional competence. Intrinsic does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. It is Intrinsic’s policy to comply with all applicable national, state and local laws pertaining to nondiscrimination and equal opportunity.
Mountain View, California / Mountain View (US-MTV), Mountain View, California, United States