Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

πŸ“… 2025-10-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper proposes a general multimodal agent framework tailored for cross-platform gaming scenarios, addressing key limitations of existing game AIβ€”including poor generalization, heterogeneous action spaces, and low training efficiency. Methodologically, it introduces a unified, extensible action space grounded in native keyboard-mouse inputs, enabling seamless operation across diverse environments such as operating systems, web browsers, and emulators; further, it incorporates a causal decay loss function and a sparse reasoning strategy to support large-scale continual pretraining and efficient inference. Contributions include: (1) the first cross-domain multimodal trajectory modeling with human-aligned action representations; (2) a ~2Γ— improvement in success rate over SOTA on open-ended Minecraft tasks; (3) achieving novice-human-level performance on unseen web-based 3D games; and (4) outperforming GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet on FPS benchmarks.

Technology Category

Application Category

πŸ“ Abstract
We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.
Problem

Research questions and friction points this paper is trying to address.

Developing scalable generalist agents for multimodal game environments
Creating unified action space for cross-domain computer interaction
Balancing reasoning depth with computational efficiency in game AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified scalable action space with human-aligned inputs
Decaying continual loss to reduce causal confusion
Sparse-Thinking strategy balancing reasoning and cost
πŸ”Ž Similar Papers
Z
Zihao Wang
Bytedance Seed
X
Xujing Li
Bytedance Seed
Yining Ye
Yining Ye
Tsinghua University, Bytedance
Tool LearningAgentUnified-LM
J
Junjie Fang
Bytedance Seed
Haoming Wang
Haoming Wang
University of Pittsburgh
Federated learning
L
Longxiang Liu
Bytedance Seed
Shihao Liang
Shihao Liang
ByteDance
Multimodal AgentAgent Evaluation
Junting Lu
Junting Lu
Peking University
Multimodal Agent
Z
Zhiyong Wu
Bytedance Seed
Jiazhan Feng
Jiazhan Feng
University of Oxford; PhD at Peking University
Natural Language ProcessingLarge Language ModelsMultimodal Agent
Wanjun Zhong
Wanjun Zhong
Bytedance Seed Research
NLP
Z
Zili Li
Bytedance Seed
Y
Yu Wang
Bytedance Seed
Y
Yu Miao
Bytedance Seed
B
Bo Zhou
Bytedance Seed
Y
Yuanfan Li
Bytedance Seed
H
Hao Wang
Bytedance Seed
Zhongkai Zhao
Zhongkai Zhao
Bytedance
Machine Learning SystemsLLMSoftware Engineering
Faming Wu
Faming Wu
Bytedance Seed
Z
Zhengxuan Jiang
Bytedance Seed
W
Weihao Tan
Bytedance Seed
H
Heyuan Yao
Bytedance Seed
Shi Yan
Shi Yan
Eindhoven University of Technology
Optical communicationfiber opticsSignal processing
X
Xiangyang Li
Bytedance Seed
Yitao Liang
Yitao Liang
Peking University
Machine LearningAI ReasoningAI Agent