🤖 AI Summary
Current self-supervised visual representation learning conflates high-level semantic concepts with low-level physical factors—such as geometry and illumination—limiting generalization in physical reasoning tasks. To address this, we propose Φeat, a physics-aware backbone that achieves disentangled representation learning of intrinsic material properties from geometric and illumination variations—entirely in a self-supervised manner. Our method leverages high-fidelity rendered data to design a physics-enhanced contrastive learning framework: it contrasts spatially cropped views of the same material under diverse shapes and lighting conditions, without requiring explicit physical annotations. Φeat learns representations sensitive to reflectance and micro-geometry yet robust to shape and illumination changes. Empirically, it significantly outperforms existing self-supervised methods on material similarity analysis and selection tasks, demonstrating superior cross-condition invariance and physical consistency. This work advances the development of unsupervised foundation models for physics-aware perception.
📝 Abstract
Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce $Phi$eat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that $Phi$eat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.