🤖 AI Summary
This study investigates vision-language models’ (VLMs) ability to discriminate between real objects and “visually similar but non-real” entities (e.g., toys, statues, paintings). To this end, we introduce RoLA—the first systematic benchmark dataset for realness-aware language–vision understanding—and propose a quantifiable semantic direction in CLIP’s embedding space that captures the “real vs. visually similar” distinction. Methodologically, we integrate prompt learning, contrastive learning, and embedding-space direction estimation, evaluating on cross-modal retrieval and image captioning using Conceptual12M. Our key contributions are: (1) the first identification of an implicit, structured semantic direction in CLIP encoding realness; (2) demonstration that this direction is transferable—improving cross-modal retrieval accuracy by +4.2% R@1—and significantly enhancing the semantic fidelity and detail accuracy of CLIP-based prefix captioning models. These findings advance fine-grained semantic understanding and controllable generation in VLMs.
📝 Abstract
Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.