Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

This work addresses the challenge of fast, generalizable inference of 3D scene physical properties (e.g., elasticity, stiffness) from a single RGB image—overcoming the prohibitive computational cost and poor generalization of existing per-scene optimization approaches. We propose the first end-to-end feedforward neural network that jointly leverages CLIP-based visual features and Gaussian Splatting–based 3D scene representations to directly predict pixel-wise material parameters via supervised learning, enabling physically grounded rendering and simulation. To support training, we introduce PIXIEVERSE, a large-scale synthetic dataset with dense, physically grounded 3D annotations. Experiments demonstrate that our method operates orders of magnitude faster than test-time optimization, achieves 1.46–4.39× improvements in multi-scene generalization, and enables zero-shot transfer to real-world scenes—significantly enhancing the practicality and scalability of cross-scene physical perception.

Technology Category

Application Category

📝 Abstract

Inferring the physical properties of 3D scenes from visual information is a critical yet challenging task for creating interactive and realistic virtual worlds. While humans intuitively grasp material characteristics such as elasticity or stiffness, existing methods often rely on slow, per-scene optimization, limiting their generalizability and application. To address this problem, we introduce PIXIE, a novel method that trains a generalizable neural network to predict physical properties across multiple scenes from 3D visual features purely using supervised losses. Once trained, our feed-forward network can perform fast inference of plausible material fields, which coupled with a learned static scene representation like Gaussian Splatting enables realistic physics simulation under external forces. To facilitate this research, we also collected PIXIEVERSE, one of the largest known datasets of paired 3D assets and physic material annotations. Extensive evaluations demonstrate that PIXIE is about 1.46-4.39x better and orders of magnitude faster than test-time optimization methods. By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data. https://pixie-3d.github.io/

Problem

Research questions and friction points this paper is trying to address.

Inferring 3D physical properties from visual information

Overcoming slow per-scene optimization limitations

Creating generalizable physics prediction from pixels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizable neural network from 3D features

Fast feed-forward inference of material fields

Leverages pretrained visual features like CLIP

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

US, CA, Santa Clara

Senior Research Engineer, Mechanical Intuition in Multimodal Models

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)