VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from limited performance in embodied decision-making within virtual open-world environments due to insufficient domain-specific knowledge. Method: This paper proposes a low-cost, high-efficiency agent construction framework featuring (i) a vision–language cross-modal knowledge graph, (ii) a lightweight customized object detector, (iii) retrieval-augmented reasoning, and (iv) a desktop manipulation skill library. Crucially, retrieval-based information extraction reduces required domain annotation effort from millions to merely hundreds of samples. Contribution/Results: The approach achieves state-of-the-art performance across diverse open-world tasks. It significantly lowers development overhead while markedly enhancing environmental perception and grounded decision-making capabilities—demonstrating both scalability and practicality for real-world embodied AI applications.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.
Problem

Research questions and friction points this paper is trying to address.

Reducing domain-specific data requirements for embodied agents
Integrating cross-modal knowledge for environment understanding
Enabling cost-effective agent operation in Minecraft
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates cross-modal knowledge graph for understanding
Reduces domain data needs from millions to hundreds
Uses retrieval pooling and desktop skill library
🔎 Similar Papers
No similar papers found.
Honghao Fu
Honghao Fu
Assistant Professor, Concordia University
Quantum computing
J
Junlong Ren
The Hong Kong University of Science and Technology (Guangzhou)
Q
Qi Chai
The Hong Kong University of Science and Technology (Guangzhou)
Deheng Ye
Deheng Ye
Director of AI, Tencent
Applied machine learning
Yujun Cai
Yujun Cai
NTU → Meta → Lecturer(Assistant Professor) @UQ
Multi-Modal PerceptionVision-Language Models
H
Hao Wang
The Hong Kong University of Science and Technology (Guangzhou)