Scalable Delphi: Large Language Models for Structured Risk Estimation

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This study addresses the limitations of the traditional Delphi method in structured risk assessment—namely its time intensity, heavy reliance on human experts, and poor scalability. To overcome these challenges, the authors propose a novel Delphi framework powered by large language models (LLMs), which simulates a diverse panel of virtual experts through role-based prompting, multi-round iterative reasoning, and shared justification mechanisms. This approach represents the first application of LLMs as proxies for structured expert judgment, achieving high calibration, evidence sensitivity, and internal consistency while reducing assessment cycles from months to minutes. In AI-augmented cybersecurity risk assessments, the model’s outputs exhibit strong correlation with ground-truth benchmarks (Pearson r = 0.87–0.95), continuously improve with added evidence, and align closely with human expert judgments—even surpassing inter-human expert agreement in some cases.

Technology Category

Application Category

📝 Abstract

Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of coordination and specialist time, placing rigorous risk assessment out of reach for most applications. We investigate whether Large Language Models (LLMs) can serve as scalable proxies for structured expert elicitation. We propose Scalable Delphi, adapting the classical protocol for LLMs with diverse expert personas, iterative refinement, and rationale sharing. Because target quantities are typically unobservable, we develop an evaluation framework based on necessary conditions: calibration against verifiable proxies, sensitivity to evidence, and alignment with human expert judgment. We evaluate in the domain of AI-augmented cybersecurity risk, using three capability benchmarks and independent human elicitation studies. LLM panels achieve strong correlations with benchmark ground truth (Pearson r=0.87-0.95), improve systematically as evidence is added, and align with human expert panels - in one comparison, closer to a human panel than the two human panels are to each other. This demonstrates that LLM-based elicitation can extend structured expert judgment to settings where traditional methods are infeasible, reducing elicitation time from months to minutes.

Problem

Research questions and friction points this paper is trying to address.

structured expert elicitation

Delphi method

quantitative risk assessment

large language models

scalable risk estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable Delphi

Large Language Models

structured expert elicitation