Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study investigates how automated prompt optimization can enhance the effectiveness of large language models (LLMs) as evaluators in free-text legal question answering and examines the impact of evaluator bias on prompt generalization. Leveraging the LEXam benchmark, the authors employ the ProTeGi method with either lenient or strict LLM evaluators (Qwen3-32B and DeepSeek-V3) to optimize prompts and assess cross-evaluator transfer performance across four task models. Results demonstrate that automatically optimized prompts significantly outperform handcrafted ones. Moreover, prompts generated by lenient evaluators exhibit superior generalization and stability, and transfer more effectively to strict evaluators than vice versa, highlighting the critical role of evaluator bias in prompt optimization and generalization.

Technology Category

Application Category

📝 Abstract

This work explores the role of prompt design and judge selection in LLM-as-a-Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human-centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3-32B, DeepSeek-V3) across four task models, and then testing cross-judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge-specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human-centered prompt design and that judges' dispositions during optimization shape prompt generalizability. Code and optimized prompts are available at https://github.com/TUMLegalTech/icail2026-llm-judge-gaming.

Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge

prompt optimization

legal question answering

judge feedback style

prompt transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-Judge

prompt optimization

legal QA