Quantifying Risks in Multi-turn Conversation with Large Language Models

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently generate catastrophic responses in multi-turn dialogues, posing serious public safety risks; existing evaluation methods suffer from reliance on fixed adversarial prompts, lack of statistical guarantees, and inability to adequately cover long-horizon dialogue spaces. Method: We propose QRLLM, the first framework that formalizes multi-turn dialogue as a semantic-similarity-driven query graph Markov process, integrating confidence-interval estimation with adaptive rejection sampling to provide statistically guaranteed certification of risk probability. Contribution/Results: QRLLM supports diverse practical dialogue distribution patterns and efficiently quantifies certified lower bounds on risk in real dialogue flows. Experiments reveal substantial undetected catastrophic risks in state-of-the-art LLMs—up to 70% certified lower bound for the worst-performing model—exposing fundamental limitations of current safety alignment strategies.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose QRLLM, a novel, principled Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributions: random node, graph path, adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.
Problem

Research questions and friction points this paper is trying to address.

Quantifying catastrophic risks in multi-turn conversations with large language models
Developing certification framework with statistical guarantees for response safety
Addressing limitations of existing evaluations using Markov-based conversation modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Markov process models multi-turn conversation distributions
Query graph encodes semantic similarity for realistic flow
Confidence intervals quantify catastrophic risks with guarantees
C
Chengxiao Wang
University of Illinois, Urbana-Champaign
I
Isha Chaudhary
University of Illinois, Urbana-Champaign
Q
Qian Hu
Amazon
Weitong Ruan
Weitong Ruan
Applied Scientist, Amazon Alexa AI
R
Rahul Gupta
Amazon
G
Gagandeep Singh
University of Illinois, Urbana-Champaign