π€ AI Summary
This work addresses the misalignment between visual-language model (VLM) outputs and human subjective preferences in domain-specific tasks such as urban perception. The authors propose a training-free post-processing framework that transforms a frozen VLM into an interpretable and human-aligned evaluation system without fine-tuning or reinforcement learning. The approach comprises three stages: concept mining, structured scoring via an Observer-Debater-Judge multi-agent architecture, and local geometric calibration on a hybrid visual-semantic manifold. This method achieves end-to-end dimension-wise optimization with full interpretability across all dimensions. Evaluated on Place Pulse 2.0, it attains 72.2% accuracy (Cohenβs ΞΊ = 0.45), outperforming the best supervised baseline and the original VLM by 15.1 and 16.3 percentage points, respectively.
π Abstract
Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ($ΞΊ=0.45$) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.