Phare: A Safety Probe for Large Language Models

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
Existing LLM safety evaluations predominantly focus on performance ranking, neglecting systematic failure-mode identification. This paper introduces Phare—the first multilingual safety assessment framework centered on failure-mode diagnosis—systematically uncovering model vulnerabilities across three dimensions: hallucination and reliability, social bias, and harmful content generation. Phare innovatively integrates adversarial prompt engineering, controllable behavioral sampling, and fine-grained human annotation to identify concrete risk patterns—including sycophancy, prompt sensitivity, and stereotype reiteration—for the first time. Evaluated on 17 mainstream models, it reveals cross-lingual and cross-architectural common vulnerabilities. Furthermore, Phare provides actionable, model-agnostic improvement pathways. By shifting evaluation from aggregate scoring to diagnostic analysis, it advances the development of more robust, aligned, and trustworthy language systems.

Technology Category

Application Category

📝 Abstract
Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.
Problem

Research questions and friction points this paper is trying to address.

Probes LLM safety across hallucination, bias, harm
Identifies systematic vulnerabilities in 17 LLMs
Provides actionable insights for robust model alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual diagnostic framework for LLMs
Probes hallucination, biases, harmful content
Identifies systematic vulnerabilities in LLMs
🔎 Similar Papers
No similar papers found.