SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

233K/year
🤖 AI Summary
Existing world models for first-person shooter (FPS) games struggle to capture high-frequency local action signals and lack cross-game generalization capabilities. This work proposes SCOPE, a novel approach that integrates conditional modules into the Transformer blocks of a video diffusion model. By reshaping features into per-pixel temporal sequences, SCOPE decouples local action responses from external scene generation within localized scopes. It introduces, for the first time, a spatially selective action modeling mechanism that operates without requiring segmentation labels and constructs CrossFPS—the first frame-aligned, multi-game FPS dataset—to enable zero-shot transfer. Experiments across seven diverse games demonstrate SCOPE’s superior performance in action response accuracy, clarity of scope disentanglement, and cross-game zero-shot generalization.
📝 Abstract
Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.
Problem

Research questions and friction points this paper is trying to address.

interactive world models
first-person shooter
action conditioning
cross-game generalization
high-frequency control signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

world models
video diffusion
spatially selective actions
cross-game generalization
zero-shot transfer
Z
Zizhao Tong
UCAS-Terminus AI Lab, University of Chinese Academy of Sciences
H
Hongfeng Lai
Tencent
Z
Zeqing Wang
Tencent
Zhaohu Xing
Zhaohu Xing
Hong Kong University of Science and Technology (Guangzhou)
Medical Image AnalysisVideo UnderstandingImage Generation
K
Kexu Cheng
UCAS-Terminus AI Lab, University of Chinese Academy of Sciences
Haoran Xu
Haoran Xu
Zhejiang University
Embodied AIRoboticsComputer Vision3D Vision
Z
Zhao Pu
Shanghai Jiaotong University
S
Shangwen Zhu
Shanghai Jiaotong University
R
Ruili Feng
University of Waterloo
Jian Zhao
Jian Zhao
Zhongguancun Institute of Artificial Intelligence
Reinforcement LearningMulti-Agent System
Y
Yan Zhang
National University of Singapore
Hao Tang
Hao Tang
Peking University
computer vision
Yeying Jin
Yeying Jin
Tencent | National University of Singapore
Computer VisionAIGCGenAIMLLMVLM
L
Ling Shao
UCAS-Terminus AI Lab, University of Chinese Academy of Sciences