{Phi}eat: Physically-Grounded Feature Representation

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Current self-supervised visual representation learning conflates high-level semantic concepts with low-level physical factors—such as geometry and illumination—limiting generalization in physical reasoning tasks. To address this, we propose Φeat, a physics-aware backbone that achieves disentangled representation learning of intrinsic material properties from geometric and illumination variations—entirely in a self-supervised manner. Our method leverages high-fidelity rendered data to design a physics-enhanced contrastive learning framework: it contrasts spatially cropped views of the same material under diverse shapes and lighting conditions, without requiring explicit physical annotations. Φeat learns representations sensitive to reflectance and micro-geometry yet robust to shape and illumination changes. Empirically, it significantly outperforms existing self-supervised methods on material similarity analysis and selection tasks, demonstrating superior cross-condition invariance and physical consistency. This work advances the development of unsupervised foundation models for physics-aware perception.

Technology Category

Application Category

📝 Abstract

Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce $Phi$eat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that $Phi$eat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.

Problem

Research questions and friction points this paper is trying to address.

Developing physically-grounded visual features disentangling semantics from physical factors

Creating self-supervised representations sensitive to material identity and geometry

Enabling robust feature learning invariant to lighting and shape variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Physically-grounded visual backbone for material identity

Contrastive pretraining with spatial and physical augmentations

Self-supervised learning invariant to shape and lighting

🔎 Similar Papers

No similar papers found.