🤖 AI Summary
This work addresses the hierarchical and intra-level semantic misalignment that arises when transferring image-level vision-language models like CLIP to pixel-level prediction in open-vocabulary semantic segmentation. To resolve this, the authors propose HyRo, a novel framework that explicitly decouples hierarchical alignment from semantic alignment within hyperbolic space for the first time. Specifically, HyRo aligns category hierarchies by adjusting the hyperbolic radius in the Poincaré ball, while preserving intra-level semantic relationships through radius-preserving orthogonal transformations, enabling theoretically grounded fine-grained semantic refinement. Integrated with fine-tuning of vision-language pre-trained models, HyRo achieves state-of-the-art performance on standard open-vocabulary semantic segmentation benchmarks, significantly outperforming existing approaches.
📝 Abstract
Open-vocabulary semantic segmentation requires adapting image-level vision-language models such as CLIP to dense pixel-level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment in the embedding space. While recent works leverage hyperbolic geometry to model hierarchical relationships, they align embeddings across hierarchical levels but overlook semantic misalignment among embeddings within the same level. In this work, we propose HyRo, a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincaré ball model. HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius. Experiments on standard open-vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves state-of-the-art performance over prior methods.