Quantifying Risks in Multi-turn Conversation with Large Language Models

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Large language models (LLMs) frequently generate catastrophic responses in multi-turn dialogues, posing serious public safety risks; existing evaluation methods suffer from reliance on fixed adversarial prompts, lack of statistical guarantees, and inability to adequately cover long-horizon dialogue spaces. Method: We propose QRLLM, the first framework that formalizes multi-turn dialogue as a semantic-similarity-driven query graph Markov process, integrating confidence-interval estimation with adaptive rejection sampling to provide statistically guaranteed certification of risk probability. Contribution/Results: QRLLM supports diverse practical dialogue distribution patterns and efficiently quantifies certified lower bounds on risk in real dialogue flows. Experiments reveal substantial undetected catastrophic risks in state-of-the-art LLMs—up to 70% certified lower bound for the worst-performing model—exposing fundamental limitations of current safety alignment strategies.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose QRLLM, a novel, principled Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributions: random node, graph path, adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.

Problem

Research questions and friction points this paper is trying to address.

Quantifying catastrophic risks in multi-turn conversations with large language models

Developing certification framework with statistical guarantees for response safety

Addressing limitations of existing evaluations using Markov-based conversation modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Markov process models multi-turn conversation distributions

Query graph encodes semantic similarity for realistic flow

Confidence intervals quantify catastrophic risks with guarantees

🔎 Similar Papers

Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models