Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the critical gap in existing red-teaming benchmarks, which often overlook domain-specific risks in finance and fail to capture harmful behaviors of large language models within legitimate or professional contexts. To this end, we propose a risk-aware evaluation framework tailored for the Banking, Financial Services, and Insurance (BFSI) sector, integrating a domain-specific harm taxonomy, multi-turn adaptive red-teaming attacks, an ensemble adjudication protocol, and a Risk-Adjusted Harm Score (RAHS). RAHS quantifies safety failures with fine-grained granularity by incorporating operational severity, mitigation signals, and adjudicator consensus, substantially enhancing assessment sensitivity. Empirical results demonstrate that high-stochasticity decoding combined with sustained interaction not only increases jailbreak success rates but also systematically amplifies the operational harm of model outputs, thereby revealing the limitations of single-turn, general-purpose evaluations.

Technology Category

Application Category

📝 Abstract

The rapid adoption of large language models (LLMs) in financial services introduces new operational, regulatory, and security risks. Yet most red-teaming benchmarks remain domain-agnostic and fail to capture failure modes specific to regulated BFSI settings, where harmful behavior can be elicited through legally or professionally plausible framing. We propose a risk-aware evaluation framework for LLM security failures in Banking, Financial Services, and Insurance (BFSI), combining a domain-specific taxonomy of financial harms, an automated multi-round red-teaming pipeline, and an ensemble-based judging protocol. We introduce the Risk-Adjusted Harm Score (RAHS), a risk-sensitive metric that goes beyond success rates by quantifying the operational severity of disclosures, accounting for mitigation signals, and leveraging inter-judge agreement. Across diverse models, we find that higher decoding stochasticity and sustained adaptive interaction not only increase jailbreak success, but also drive systematic escalation toward more severe and operationally actionable financial disclosures. These results expose limitations of single-turn, domain-agnostic security evaluation and motivate risk-sensitive assessment under prolonged adversarial pressure for real-world BFSI deployment.

Problem

Research questions and friction points this paper is trying to address.

red-teaming

financial services

LLM security

risk assessment

domain-specific harms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Risk-Adjusted Harm Score

domain-specific red teaming

financial LLM security