SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work proposes SCALE, a novel test-time scaling approach for vision-language-action (VLA) models that addresses perceptual ambiguity by introducing uncertainty-driven joint modulation of perception and action. Unlike existing methods—which rely on additional training, external verifiers, or multiple forward passes and intervene only at the action decoding stage—SCALE leverages active inference theory to dynamically adjust both visual perception and action execution within a single forward pass, without requiring extra training or validation components. Evaluated on both simulated and real-world robotic tasks, SCALE significantly outperforms current test-time scaling strategies while maintaining computational efficiency through one-shot inference.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

test-time scaling

perceptual ambiguity

visual representation

robotic control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

Test-Time Scaling

Self-Uncertainty