🤖 AI Summary
Existing general-purpose image editors lack a unified benchmark for evaluating their ability to predict dense physical property maps—such as depth, surface normals, albedo, roughness, and metallicity—from a single RGB image and conditional prompts. This work proposes PhysEditBench, the first protocol-driven benchmark tailored for image editors, which establishes standardized input–output formats and evaluation procedures. It integrates datasets including OpenRooms-FF, InteriorVerse, and high-quality procedurally generated scenes, while introducing validity masks, illumination-stress subsets, and scene-level sampling strategies to enhance evaluation reliability. Experiments reveal that specialized models still outperform image editors in predicting depth, normals, and albedo, whereas editors achieve competitive performance on certain roughness and metallicity metrics, yet remain limited by structural inaccuracies, sparsity issues, and sensitivity to lighting conditions.
📝 Abstract
Can general-purpose image editors predict physical maps from a single RGB image? General-purpose image editors differ from standard task-specific dense-prediction models: they do not directly take an image and output a physical map. Instead, they must be guided by prompts, examples, or image-based textual cues. To this end, we introduce PhysEditBench, a novel protocol-conditioned benchmark to evaluate and standardize image editors in dense physical-map prediction that covers five targets: depth, normal, albedo, roughness, and metallic maps. For evaluation data, we build a target-dependent benchmark substrate. We use OpenRooms-FF for depth, surface normal, albedo, and roughness, InteriorVerse as an additional source for depth, normal, albedo, and a new procedurally generated source for metallic maps. We curate the data with quality checks, valid-region masks, scene-level sampling, and lighting-based stress subsets to ensure reliable and diverse evaluation. For each target, PhysEditBench defines a fixed protocol that specifies the allowed input, expected output format, and scoring procedure. Each score, therefore, reflects the performance of a model under a specified protocol, rather than its best possible performance under all prompts or interaction modes. Experimental results show that specialized models remain much stronger on depth, normal, and albedo, and stronger image editors can produce more reasonable map-like outputs. For roughness and metallic, image editors can match or outperform specialized baselines on some scalar metrics, but they still suffer from structural errors, sparsity effects, and sensitivity to lighting.