🤖 AI Summary
This work addresses the challenge of generalizing to unseen category-attribute combinations in open-vocabulary fine-grained segmentation. The authors propose a decoupled vision-language alignment approach that explicitly decomposes textual prompts into concept tokens and multiple attribute tokens, aligning each component separately across modalities. To enable interpretable compositional semantic modeling, they introduce a feature-gated cross-attention mechanism and a log-space similarity aggregation strategy. This method is the first to explicitly decouple and independently align semantic units in segmentation tasks, achieving significant improvements in generalization to novel combinations on standard benchmarks.
📝 Abstract
Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.