🤖 AI Summary
Modern imitation learning for video games suffers from high costs due to reliance on game-specific APIs, large-scale datasets, and custom integration. Method: This paper proposes a lightweight, end-to-end image-based imitation learning framework leveraging general-purpose pretrained vision encoders (DINOv2, ViT), eliminating game-specific design and using only low-resolution screen frames as input. It combines behavioral cloning with minimal demonstration data (a few minutes) and is systematically evaluated across Minecraft, CS:GO, and Minecraft Dungeons to assess generalization and data efficiency. Contribution/Results: Our approach improves policy performance by over 30% compared to end-to-end convolutional baselines while reducing training cost by an order of magnitude. To our knowledge, this is the first work to empirically demonstrate that general visual representations can substantially alleviate data scarcity and deployment bottlenecks in game AI—establishing a new paradigm for zero-API, few-shot, cross-game imitation learning.
📝 Abstract
Video games have served as useful benchmarks for the decision-making community, but going beyond Atari games towards modern games has been prohibitively expensive for the vast majority of the research community. Prior work in modern video games typically relied on game-specific integration to obtain game features and enable online training, or on existing large datasets. An alternative approach is to train agents using imitation learning to play video games purely from images. However, this setting poses a fundamental question: which visual encoders obtain representations that retain information critical for decision making? To answer this question, we conduct a systematic study of imitation learning with publicly available pre-trained visual encoders compared to the typical task-specific end-to-end training approach in Minecraft, Counter-Strike: Global Offensive, and Minecraft Dungeons. Our results show that end-to-end training can be effective with comparably low-resolution images and only minutes of demonstrations, but significant improvements can be gained by utilising pre-trained encoders such as DINOv2 depending on the game. In addition to enabling effective decision making, we show that pre-trained encoders can make decision-making research in video games more accessible by significantly reducing the cost of training.