UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

πŸ“… 2026-02-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the misalignment between visual-language model (VLM) outputs and human subjective preferences in domain-specific tasks such as urban perception. The authors propose a training-free post-processing framework that transforms a frozen VLM into an interpretable and human-aligned evaluation system without fine-tuning or reinforcement learning. The approach comprises three stages: concept mining, structured scoring via an Observer-Debater-Judge multi-agent architecture, and local geometric calibration on a hybrid visual-semantic manifold. This method achieves end-to-end dimension-wise optimization with full interpretability across all dimensions. Evaluated on Place Pulse 2.0, it attains 72.2% accuracy (Cohen’s ΞΊ = 0.45), outperforming the best supervised baseline and the original VLM by 15.1 and 16.3 percentage points, respectively.

Technology Category

Application Category

πŸ“ Abstract
Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ($ΞΊ=0.45$) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.
Problem

Research questions and friction points this paper is trying to address.

vision-language model
human preference alignment
urban perception
post-hoc calibration
subjective perception tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

post-hoc calibration
concept bottleneck
training-free alignment
multi-agent scoring
geometric calibration
πŸ”Ž Similar Papers
No similar papers found.
Y
Yecheng Zhang
Tsinghua University
R
Rong Zhao
University College London
Zhizhou Sha
Zhizhou Sha
Tsinghua University
Generative Models
Yong Li
Yong Li
HKUST
Self-supervised LearningUrban Visual Intelligence
L
Lei Wang
Peking University
C
Ce Hou
Hong Kong University of Science and Technology
Wen Ji
Wen Ji
Southwest Jiaotong University
Drive-by sensingOptimizationTransportation
H
Hao Huang
Tsinghua University
Y
Yunshan Wan
Zhejiang University
Jian Yu
Jian Yu
Auckland University of Technology
graph neural networksrecommender systemsdeep learningcomplex networksInternet computing
J
Junhao Xia
Tsinghua University
Yuru Zhang
Yuru Zhang
PhD Candidate of Computer Science, University of Nebraska-Lincoln
Wireless CommunicationMachine LearningEdge Computing
C
Chunlei Shi
Southeast University