PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the limitations of existing training-free open-vocabulary semantic segmentation methods, which often neglect cross-modal geometric alignment and rely on complex post-processing or multi-model architectures, leading to inefficiency. The authors propose PEARL, a training-free two-stage inference framework that first achieves geometric alignment between text and visual subspaces within self-attention layers via Procrustes orthogonal alignment—solved iteratively using polar decomposition—and then refines pixel-level predictions through text-aware Laplacian graph propagation on a coarse grid, optimized via conjugate gradient descent. PEARL is the first method to integrate Procrustes alignment with text-guided graph propagation, operating without learnable parameters, additional data, or auxiliary backbones, thus offering a plug-and-play solution. It sets new state-of-the-art performance records across multiple benchmarks for training-free open-vocabulary segmentation, achieving both high accuracy and extremely low inference latency.

Technology Category

Application Category

📝 Abstract

Training-free open-vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post-processing or handle text and vision in isolation, leaving cross-modal geometry underutilized. Others introduce auxiliary vision backbones or multi-model pipelines, which increase complexity and latency while compromising design simplicity. We present PEARL, \textbf{\underline{P}}rocrust\textbf{\underline{e}}s \textbf{\underline{a}}lignment with text-awa\textbf{\underline{r}}e \textbf{\underline{L}}aplacian propagation, a compact two-step inference that follows an align-then-propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self-attention block, rotating keys toward the query subspace via a stable polar iteration. The text-aware Laplacian propagation then refines per-pixel logits on a small grid through a confidence-weighted, text-guided graph solve: text provides both a data-trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training-free, plug-and-play, and uses only fixed constants, adding minimal latency with a small per-head projection and a few conjugate-gradient steps. Our approach, PEARL, sets a new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation

training-free

cross-modal geometry

model complexity

post-processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free

Open-vocabulary semantic segmentation

Procrustes alignment