SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Visual language models (VLMs) struggle to interpret negated prompts (e.g., “street scene without pedestrians”); while existing fine-tuning approaches improve negation handling, they degrade zero-shot performance on affirmative prompts. This paper proposes a training-free geometric modeling paradigm that, for the first time, represents negated semantics as a spherical cap region—not a single point—in the joint image-text embedding space. By measuring the angular distance from the prompt’s central direction, our method quantifies alignment between images and negated prompts, enabling efficient retrieval, selection, and text-to-image generation within CLIP-style VLM embedding spaces. Evaluated across three tasks—negated image retrieval, filtering, and generation—the approach achieves an average 30% improvement, substantially narrowing the performance gap between negated and affirmed prompts while fully preserving zero-shot transfer capability.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) struggle with negation. Given a prompt like "retrieve (or generate) a street scene without pedestrians," they often fail to respect the "not." Existing methods address this limitation by fine-tuning on large negation datasets, but such retraining often compromises the model's zero-shot performance on affirmative prompts. We show that the embedding space of VLMs, such as CLIP, can be divided into semantically consistent subspaces. Based on this property, we propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point (Figure 1). To find the matching image for a caption such as "A but not N," we construct two spherical caps around the embeddings of A and N, and we score images by the central direction of the region that is close to A and far from N. Across retrieval, MCQ, and text-to-image tasks, our method improves negation understanding by about 30% on average over prior methods. It closes the gap between affirmative and negated prompts while preserving the zero-shot performance that fine-tuned models fail to maintain. Code will be released upon publication.

Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with understanding negation in prompts

Fine-tuning for negation harms zero-shot affirmative performance

Need training-free method to handle negation without performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Models negation as subspace in embedding space

Constructs spherical caps around affirmative and negative embeddings

Scores images by proximity to affirmative and distance from negative

🔎 Similar Papers

No similar papers found.