Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

๐Ÿ“… 2025-11-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large Vision-Language Models (VLMs) exhibit boundary oscillation during step-wise exploration in embodied question answering (EQA), stemming from overconfidence and leading to unstable navigation and degraded answer quality. To address this, we propose the Prune-Then-Plan framework, introducing the first hierarchical pruning mechanism grounded in the Holmโ€“Bonferroni principle. This mechanism decouples unreliable frontier action candidates from final decision-making, enabling conservative and interpretable action selection. We further integrate a coverage-aware planner and human-judgment-driven step-wise calibration into the 3D-Mem EQA architecture. Evaluated on OpenEQA and EXPRESS-Bench, our method significantly improves scene coverage, achieving up to 49% and 33% gains in visual localization Success-weighted by Path Length (SPL) and LLM-Match, respectively, over strong baselines. It effectively mitigates exploration oscillation caused by insufficient VLM calibration.

Technology Category

Application Category

๐Ÿ“ Abstract
Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.
Problem

Research questions and friction points this paper is trying to address.

Stabilize frontier oscillations in embodied question answering
Address VLM overconfidence causing inefficient navigation
Improve step-level calibration for better scene coverage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes implausible frontier choices using Holm-Bonferroni procedure
Delegates final decisions to coverage-based planner
Converts overconfident predictions into conservative interpretable actions
๐Ÿ”Ž Similar Papers
No similar papers found.