Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Building generalist agents capable of executing long-horizon (multi-hour), real-time, and complex tasks in 3D open-world environments remains a significant challenge. Method: This paper introduces the first open-source, end-to-end vision-language-driven framework that operates directly on raw pixel inputs, jointly performing perception, multimodal reasoning, and action decision-making—generating fine-grained keyboard/mouse control at 30 Hz from 5 Hz visual sampling while adaptively triggering multi-step reasoning. Contribution/Results: Its core innovation is zero-shot cross-game generalization: without task-specific fine-tuning, it successfully transfers to commercially deployed open-world games with divergent mechanics—including *Genshin Impact*, *Wuthering Waves*, and *Honkai: Star Rail*. In *Genshin Impact*, it fully completes the 5-hour Mondstadt main questline with human-level task efficiency; in the other titles, it autonomously executes over 100 minutes of zero-shot gameplay, demonstrating strong generalization and practical viability.

Technology Category

Application Category

📝 Abstract
We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine's effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.
Problem

Research questions and friction points this paper is trying to address.

Develop generalist agents for complex 3D missions
Unify perception reasoning and action end-to-end
Achieve cross-game generalization without fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies perception, reasoning, action end-to-end
Processes pixels at 5 Hz for 30 Hz actions
Demonstrates zero-shot cross-game generalization capability
🔎 Similar Papers