🤖 AI Summary
In vision-and-language navigation, relying solely on standard action supervision struggles to balance behavioral diversity and learning stability, leading to unreliable self-supervised signals. To address this, this work formulates the trade-off as a tunable equilibrium problem and introduces a plug-and-play Stability-Diversity Balancing (SDB) mechanism. At each step, SDB generates multiple controllable perturbations of instruction-conditioned latent states as behavioral hypotheses, retains diverse yet instruction-consistent options via reliability-aware soft aggregation, and incorporates a hypothesis interaction regularizer to prevent diversity collapse or drift. Requiring no additional supervision, the method consistently improves performance across R2R, SOON, and REVERIE benchmarks—e.g., boosting SPL from 33.73 to 35.93 and OSR from 51.07 to 54.25 on REVERIE val-unseen.
📝 Abstract
In vision-and-language navigation (VLN), self-improvement from policy-induced experience, using only standard VLN action supervision, critically depends on balancing behavioral diversity and learning stability, which governs whether the agent can extract a reliable learning signal for improvement. Increasing behavioral diversity is necessary to expose alternative action hypotheses but can destabilize policy-induced learning signals, whereas overly conservative stability constraints suppress exploration and induce early commitment, making reliable self-improvement difficult. To address this challenge, we propose Stability-Diversity Balance (SDB), a plug-and-play mechanism for balanced self-improvement in VLN. SDB expands each decision step into multiple latent behavioral hypotheses by applying controlled shifts in the instruction-conditioned hidden states, and then performs reliability-aware soft evaluation and aggregation to retain diverse yet instruction-consistent alternatives during learning. An explicit regularizer further constrains hypothesis interactions, preventing excessive drift or premature collapse of hypothesis diversity and stabilizing self-improvement without discarding training signals. Experiments on R2R, SOON, and REVERIE show consistent improvements; for example, on REVERIE val-unseen, SDB improves SPL from 33.73 to 35.93 and OSR from 51.07 to 54.25.