Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work investigates whether large physics-based foundation models possess interpretable, abstract representations—akin to those in large language models—capable of encoding human-understandable physical concepts and supporting causal interventions. Method: We propose a direction extraction method based on activation-space differencing, identifying concept-aligned directions from multi-physics-state simulation data; these directions are then injected during inference to enable cross-domain physical behavior control. Contribution/Results: We demonstrate, for the first time in physics-based foundation models, targeted induction and suppression of deep physical principles—not merely surface-level statistical patterns—thereby empirically validating the existence, manipulability, and generalizability of abstract physical concepts. Our approach extends the scope of mechanistic interpretability to scientific AI and establishes a new paradigm for trustworthy, physics-grounded AI systems.

Technology Category

Application Category

📝 Abstract

Recent advances in mechanistic interpretability have revealed that large language models (LLMs) develop internal representations corresponding not only to concrete entities but also distinct, human-understandable abstract concepts and behaviour. Moreover, these hidden features can be directly manipulated to steer model behaviour. However, it remains an open question whether this phenomenon is unique to models trained on inherently structured data (ie. language, images) or if it is a general property of foundation models. In this work, we investigate the internal representations of a large physics-focused foundation model. Inspired by recent work identifying single directions in activation space for complex behaviours in LLMs, we extract activation vectors from the model during forward passes over simulation datasets for different physical regimes. We then compute "delta" representations between the two regimes. These delta tensors act as concept directions in activation space, encoding specific physical features. By injecting these concept directions back into the model during inference, we can steer its predictions, demonstrating causal control over physical behaviours, such as inducing or removing some particular physical feature from a simulation. These results suggest that scientific foundation models learn generalised representations of physical principles. They do not merely rely on superficial correlations and patterns in the simulations. Our findings open new avenues for understanding and controlling scientific foundation models and has implications for AI-enabled scientific discovery.

Problem

Research questions and friction points this paper is trying to address.

Investigating whether foundation models learn interpretable physical concepts

Developing methods to extract causal concept directions from physics models

Demonstrating causal control over physical behaviors through activation steering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracting activation vectors from physics foundation model

Computing delta tensors as concept directions

Injecting directions to steer physical predictions

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Research Scientist, Interpretability

Anthropic

$350,000—$850,000 USD

San Francisco, CA, USA / remote (case-by-case basis)

AI Research Scientist, Reinforcement Learning