PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation

📅 2026-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing training-free open-vocabulary semantic segmentation methods, which often neglect cross-modal geometric alignment and rely on complex post-processing or multi-model architectures, leading to inefficiency. The authors propose PEARL, a training-free two-stage inference framework that first achieves geometric alignment between text and visual subspaces within self-attention layers via Procrustes orthogonal alignment—solved iteratively using polar decomposition—and then refines pixel-level predictions through text-aware Laplacian graph propagation on a coarse grid, optimized via conjugate gradient descent. PEARL is the first method to integrate Procrustes alignment with text-guided graph propagation, operating without learnable parameters, additional data, or auxiliary backbones, thus offering a plug-and-play solution. It sets new state-of-the-art performance records across multiple benchmarks for training-free open-vocabulary segmentation, achieving both high accuracy and extremely low inference latency.

Technology Category

Application Category

📝 Abstract
Training-free open-vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post-processing or handle text and vision in isolation, leaving cross-modal geometry underutilized. Others introduce auxiliary vision backbones or multi-model pipelines, which increase complexity and latency while compromising design simplicity. We present PEARL, \textbf{\underline{P}}rocrust\textbf{\underline{e}}s \textbf{\underline{a}}lignment with text-awa\textbf{\underline{r}}e \textbf{\underline{L}}aplacian propagation, a compact two-step inference that follows an align-then-propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self-attention block, rotating keys toward the query subspace via a stable polar iteration. The text-aware Laplacian propagation then refines per-pixel logits on a small grid through a confidence-weighted, text-guided graph solve: text provides both a data-trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training-free, plug-and-play, and uses only fixed constants, adding minimal latency with a small per-head projection and a few conjugate-gradient steps. Our approach, PEARL, sets a new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation
training-free
cross-modal geometry
model complexity
post-processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free
Open-vocabulary semantic segmentation
Procrustes alignment
Text-aware Laplacian propagation
Cross-modal geometry
🔎 Similar Papers
No similar papers found.
G
Gensheng Pei
Department of Electrical and Computer Engineering, Sungkyunkwan University
X
Xiruo Jiang
School of Computing and Artificial Intelligence, Southwest Jiaotong University
Xinhao Cai
Xinhao Cai
Nanjing University of Science and Technology
computer visionmachine learning
Tao Chen
Tao Chen
Nanjing University of Science and Technology
computer vision
Y
Yazhou Yao
School of Computer Science and Engineering, Nanjing University of Science and Technology
Byeungwoo Jeon
Byeungwoo Jeon
Professor, Sungkyunkwan University
Signal processingvideo codingimagevideo