Can Test-Time Scaling Improve World Foundation Model?

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Addressing the challenges of prohibitively high pretraining computational costs, limited post-training data, and diminishing returns from model scaling or retraining in World Foundation Models (WFMs), this paper proposes SWIFT—a novel inference-time scaling framework. SWIFT establishes the first empirical validation of an inference-time compute scaling law for WFMs, enabling performance gains without parameter modification or architectural expansion. It introduces a lightweight, scalable inference paradigm integrating rapid tokenization, probability-guided Top-K pruning, and efficient beam search. Evaluated using a custom-built WFM benchmark suite on the COSMOS model, SWIFT demonstrates that optimal inference-time scaling significantly improves prediction accuracy and robustness while maintaining controllable computational overhead. The core contributions are: (1) the first formal confirmation of feasible inference-time scaling for WFMs, and (2) the first efficient, physics-aware inference optimization framework tailored for physical intelligence applications.

Technology Category

Application Category

📝 Abstract

World foundation models, which simulate the physical world by predicting future states from current observations and inputs, have become central to many applications in physical intelligence, including autonomous driving and robotics. However, these models require substantial computational resources for pretraining and are further constrained by available data during post-training. As such, scaling computation at test time emerges as both a critical and practical alternative to traditional model enlargement or re-training. In this work, we introduce SWIFT, a test-time scaling framework tailored for WFMs. SWIFT integrates our extensible WFM evaluation toolkit with process-level inference strategies, including fast tokenization, probability-based Top-K pruning, and efficient beam search. Empirical results on the COSMOS model demonstrate that test-time scaling exists even in a compute-optimal way. Our findings reveal that test-time scaling laws hold for WFMs and that SWIFT provides a scalable and effective pathway for improving WFM inference without retraining or increasing model size. The code is available at https://github.com/Mia-Cong/SWIFT.git.

Problem

Research questions and friction points this paper is trying to address.

Improving world foundation models via test-time scaling

Reducing computational costs without model retraining

Enhancing inference efficiency with scalable strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time scaling framework SWIFT for WFMs

Extensible evaluation toolkit with inference strategies

Compute-optimal scaling without retraining or model enlargement

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time