🤖 AI Summary
Existing image generation evaluation metrics—such as BLEU, CIDEr, and CLIPScore—exhibit limited fidelity in domain-specific and context-sensitive scenarios, failing to adequately capture semantic plausibility and structural-physical consistency. To address this, we propose a physics-aware multimodal evaluation framework: first extracting spatial-semantic features, then performing confidence-weighted fusion of outputs from vision-language models, object detectors, and large language models (LLMs). Crucially, we introduce a physics-guided LLM reasoning mechanism that integrates component-level adaptive verification and domain-knowledge mapping to enable cross-modal consistency modeling. Our hierarchical three-tier architecture significantly enhances discriminative capability for both semantic and structural accuracy of synthesized images. Extensive evaluation demonstrates superior correlation with human judgment and greater robustness compared to state-of-the-art metrics, particularly in specialized domains and complex contextual tasks.
📝 Abstract
Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.