Multi-Domain Explainability of Preferences

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Current preference mechanisms in LLM alignment—such as human preferences, LLM-as-a-Judge (LaaJ), and reward modeling—suffer from poor interpretability, hindering trustworthy evaluation and optimization. To address this, we propose the first end-to-end automated framework that discovers and vectorizes interpretable concepts via LLM-based concept mining, then constructs a hierarchical, multi-domain white-box regression model to jointly characterize both domain-general and domain-specific concept effects on preferences. Our method enables both local and global concept-level explanations. Crucially, it achieves bidirectional application-level validation: (1) guiding LLMs to generate significantly more preferred responses (LaaJ preference increase, *p* < 0.01); and (2) improving LaaJ discrimination accuracy by +7.2% on average. Evaluated across eight domains and twelve preference mechanisms, our approach establishes new state-of-the-art predictive performance.

Technology Category

Application Category

📝 Abstract

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated end-to-end method for generating local and global concept-based explanations of preferences across multiple domains. Our method employs an LLM to discover concepts that differentiate between chosen and rejected responses and represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two novel application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work provides a new paradigm for explainability in the era of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Understanding concepts driving preference mechanisms in LLMs

Generating local and global explanations for multi-domain preferences

Improving LLM outputs and preference predictions via explainable concepts

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM discovers concepts for preference differentiation

Hierarchical Multi-Domain Regression model

Explainable preference prediction outperforms baselines

🔎 Similar Papers

Exploring Commonalities in Explanation Frameworks: A Multi-Domain Survey Analysis