VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Visual instruction-driven sequential decision-making remains challenging due to entangled perception and reasoning, limiting interpretability, generalization, and performance. Method: This paper proposes a perception–reasoning decoupled multimodal planning framework: a frozen vision-language model (e.g., LLaVA) generates image captions, while a large language model (e.g., Llama-3) performs action reasoning solely from these textual descriptions and task goals. It introduces the first modular architecture using text as an intermediate representation, explicitly separating perception from decision-making. The policy is jointly optimized via behavior cloning and PPO-based reinforcement learning. Contribution/Results: On the ALFWorld benchmark, our method achieves a 27% absolute improvement in task success rate over existing visual instruction planners, approaching the performance of a pure-text oracle. Crucially, it enables fine-grained attribution analysis, significantly enhancing decision transparency, interpretability, and cross-task generalization.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent's decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

Problem

Research questions and friction points this paper is trying to address.

Integrates VLM-based perception with LLM-based reasoning for planning.

Improves decision-making using behavioral cloning and reinforcement learning.

Enhances explainability by leveraging text as an intermediate representation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates VLM-based perception with LLM-based reasoning

Uses behavioral cloning and reinforcement learning

Enhances explainability via text as intermediate representation

🔎 Similar Papers

No similar papers found.

Authors to Follow