🤖 AI Summary
Assessing the realism of 3D shapes without ground-truth references remains a fundamental challenge. To address this, we propose the first reference-free shape-realism alignment metric. Our core innovation lies in geometrically encoding 3D mesh structures into language token space, leveraging large language models (LLMs) to bridge low-level geometry and high-level human perceptual realism, and introducing a dedicated realism decoder for end-to-end score prediction. To support training and evaluation, we curate RealismGrading—the first publicly available dataset featuring human-annotated realism scores for diverse 3D shapes. Extensive experiments using k-fold cross-validation demonstrate strong agreement with human judgments (Spearman’s ρ > 0.85), significantly outperforming conventional metrics such as Chamfer Distance and Fréchet Inception Distance (FID). Moreover, our method exhibits robust cross-category generalization, confirming its effectiveness beyond domain-specific assumptions.
📝 Abstract
3D generation and reconstruction techniques have been widely used in computer games, film, and other content creation areas. As the application grows, there is a growing demand for 3D shapes that look truly realistic. Traditional evaluation methods rely on a ground truth to measure mesh fidelity. However, in many practical cases, a shape's realism does not depend on having a ground truth reference. In this work, we propose a Shape-Realism Alignment Metric that leverages a large language model (LLM) as a bridge between mesh shape information and realism evaluation. To achieve this, we adopt a mesh encoding approach that converts 3D shapes into the language token space. A dedicated realism decoder is designed to align the language model's output with human perception of realism. Additionally, we introduce a new dataset, RealismGrading, which provides human-annotated realism scores without the need for ground truth shapes. Our dataset includes shapes generated by 16 different algorithms on over a dozen objects, making it more representative of practical 3D shape distributions. We validate our metric's performance and generalizability through k-fold cross-validation across different objects. Experimental results show that our metric correlates well with human perceptions and outperforms existing methods, and has good generalizability.