UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the misalignment between visual-language model (VLM) outputs and human subjective preferences in domain-specific tasks such as urban perception. The authors propose a training-free post-processing framework that transforms a frozen VLM into an interpretable and human-aligned evaluation system without fine-tuning or reinforcement learning. The approach comprises three stages: concept mining, structured scoring via an Observer-Debater-Judge multi-agent architecture, and local geometric calibration on a hybrid visual-semantic manifold. This method achieves end-to-end dimension-wise optimization with full interpretability across all dimensions. Evaluated on Place Pulse 2.0, it attains 72.2% accuracy (Cohen’s κ = 0.45), outperforming the best supervised baseline and the original VLM by 15.1 and 16.3 percentage points, respectively.

Technology Category

Application Category

📝 Abstract

Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ($κ=0.45$) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.

Problem

Research questions and friction points this paper is trying to address.

vision-language model

human preference alignment

urban perception

post-hoc calibration

subjective perception tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-hoc calibration

concept bottleneck

training-free alignment