Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the challenges of unstable training, poor generalization, and difficulty in error recovery in vision-and-language navigation (VLN) within continuous environments, which stem from sparse rewards and the accumulation of trajectory errors. To tackle these issues, the authors propose the Step-Aware Contrastive Alignment (SACA) framework, which introduces a step-aware contrastive alignment mechanism. SACA employs a perception-driven step evaluator to identify effective prefixes and deviation points in failed trajectories and leverages scene-conditioned grouping to dynamically refine the training strategy. This enables the extraction of dense supervisory signals from imperfect trajectories, facilitating fine-grained credit assignment and stable reinforcement fine-tuning. By integrating multimodal large language models, contrastive learning, and reinforcement learning, SACA achieves state-of-the-art performance on the VLN-CE benchmark, significantly improving navigation success rates and robustness.

Technology Category

Application Category

📝 Abstract

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have driven recent progress, current training paradigms struggle to balance generalization capability, error recovery and training stability. Specifically, (i) policies derived from SFT suffer from compounding errors, struggling to recover from out-of-distribution states, and (ii) Reinforcement Fine-Tuning (RFT) methods e.g. GRPO are bottlenecked by sparse outcome rewards. Their binary feedback fails to assign credit to individual steps, leading to gradient signal collapse in failure dominant batches. To address these challenges, we introduce Step-Aware Contrastive Alignment (SACA), a framework designed to extract dense supervision from imperfect trajectories. At its core, the Perception-Grounded Step-Aware auditor evaluates progress step-by-step, disentangling failed trajectories into valid prefixes and exact divergence points. Leveraging these signals, Scenario-Conditioned Group Construction mechanism dynamically routes batches to specialized resampling and optimization strategies. Extensive experiments on VLN-CE benchmarks demonstrate that SACA achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

Continuous Environments

Reinforcement Fine-Tuning

Credit Assignment

Training Stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-Aware Contrastive Alignment

Vision-Language Navigation

Continuous Environments