VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Conventional multi-agent reinforcement learning (MARL) frameworks (e.g., SMAC) rely on abstract state representations, resulting in low ecological validity and poor human-AI alignment in agent behavior. Method: We propose VLM-Attention—the first multimodal StarCraft II environment supporting raw RGB visual input and natural language observations, enabling AI perception modalities to closely approximate those of human players. Our approach integrates foundational multimodal large models (e.g., Qwen-VL and GPT-4o), designs a vision-language-enhanced self-attention mechanism, incorporates a retrieval-augmented generation (RAG)-based tactical decision module, and introduces a dynamic role-aware multi-agent coordination framework for zero-shot execution of complex tactics. Contribution/Results: Evaluated across 21 custom scenarios, VLM-Attention achieves tactical performance on par with state-of-the-art MARL methods—without task-specific training—demonstrating substantial improvements in ecological validity and cognitive alignment.

Technology Category

Application Category

📝 Abstract

We introduce VLM-Attention, a multimodal StarCraft II environment that aligns artificial agent perception with the human gameplay experience. Traditional frameworks such as SMAC rely on abstract state representations that diverge significantly from human perception, limiting the ecological validity of agent behavior. Our environment addresses this limitation by incorporating RGB visual inputs and natural language observations that more closely simulate human cognitive processes during gameplay. The VLM-Attention framework consists of three integrated components: (1) a vision-language model enhanced with specialized self-attention mechanisms for strategic unit targeting and battlefield assessment, (2) a retrieval-augmented generation system that leverages domain-specific StarCraft II knowledge to inform tactical decisions, and (3) a dynamic role-based task distribution system that enables coordinated multi-agent behavior. Our experimental evaluation across 21 custom scenarios demonstrates that VLM-based agents powered by foundation models (specifically Qwen-VL and GPT-4o) can execute complex tactical maneuvers without explicit training, achieving comparable performance to traditional MARL methods that require substantial training iterations. This work establishes a foundation for developing human-aligned StarCraft II agents and advances the broader research agenda of multimodal game AI. Our implementation is available at https://github.com/camel-ai/VLM-Play-StarCraft2.

Problem

Research questions and friction points this paper is trying to address.

Aligning AI perception with human gameplay experience in StarCraft II.

Overcoming limitations of abstract state representations in traditional frameworks.

Enabling complex tactical maneuvers without explicit training using multimodal AI.

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-Attention integrates RGB and language inputs.

Retrieval-augmented system uses StarCraft II knowledge.

Dynamic role-based task distribution for multi-agent coordination.

🔎 Similar Papers

No similar papers found.