π€ AI Summary
This work addresses the challenge of cascading failures in multimodal embodied intelligence tasks, where errors in intermediate steps often propagate and degrade overall performance. To mitigate this issue, the authors propose a hierarchical prediction-correction framework that simultaneously performs prediction and contrastive alignment across three levels: actions, sub-goals, and trajectories. This enables dynamic error correction while preserving semantic consistency with the agentβs high-level intent. The method introduces two key innovations: a Sinkhorn alignment module and a Score-field module, which jointly optimize the action generator. Additionally, a novel metric for quantifying error propagation is devised to support fine-grained adjustments without compromising global goal coherence. Experimental results demonstrate that the proposed approach significantly outperforms both open-source and closed-source large language model baselines on established benchmarks, including VisualAgentBench, MineDojo, and AI2-THOR.
π Abstract
Vision-Language-Action systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis-specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks, capturing how mistakes spread and fade over long-horizon execution. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2-THOR, outperforming strong proprietary and open-source Large Language Model baselines.