Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes

📅 2025-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) employed as automatic evaluators suffer from implicit preference biases, leading to inaccurate discrimination in text quality assessment and commonsense reasoning tasks. To address this, we propose contrastive-prompt-based training of linear classification probes that directly decode discriminative knowledge embedded within LLMs to extract preferences with high fidelity. This is the first work to apply linear probing for LLM preference extraction: the probe exhibits strong generalization under domain shift, outperforms fine-tuned discriminators using equivalent data, and achieves superior accuracy (up to +12.7% in certain settings), robustness, and interpretability. Experiments span four model families, six parameter scales, two task types, and six benchmark datasets—consistently surpassing generative evaluators while maintaining comparable computational cost.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are often used as automated judges to evaluate text, but their effectiveness can be hindered by various unintentional biases. We propose using linear classifying probes, trained by leveraging differences between contrasting pairs of prompts, to directly access LLMs' latent knowledge and extract more accurate preferences. Through extensive experiments using models of varying size from four different families and six diverse datasets assessing text quality evaluation and common sense reasoning, we demonstrate that both supervised and unsupervised probing approaches consistently outperform traditional generation-based judgement while maintaining similar computational costs. These probes generalise under domain shifts and can even outperform finetuned evaluators with the same training data size. Our results suggest linear probing offers an accurate, robust and computationally efficient approach for LLM-as-judge tasks while providing interpretable insights into how models encode judgement-relevant knowledge. Our data and code will be openly released in the future.
Problem

Research questions and friction points this paper is trying to address.

Extract accurate preferences from LLMs by identifying latent knowledge
Reduce biases in LLMs as automated judges for text evaluation
Improve LLM judgement tasks with efficient and interpretable linear probing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using linear classifying probes for bias reduction
Leveraging contrasting prompts to access latent knowledge
Probes outperform traditional generation-based judgement methods
🔎 Similar Papers
No similar papers found.