🤖 AI Summary
Large language models (LLMs) exhibit latent vulnerabilities in financial applications, enabling circumvention of regulatory compliance through ostensibly benign yet substantively noncompliant outputs.
Method: We propose the first red-teaming framework tailored to financial compliance, diverging from conventional harm-focused red-teaming by introducing multi-turn adversarial dialogues—“stealthy risk elicitation attacks”—that progressively obscure malicious intent to elicit surface-level compliant but materially违规 outputs.
Contribution/Results: We construct FIN-Bench, the first financial safety-specific evaluation benchmark, integrating domain-adapted prompt engineering and systematic human annotation. Experiments across nine state-of-the-art LLMs reveal an average attack success rate of 93.18%, with GPT-4.1 and OpenAI o1 achieving 98.28% and 97.56%, respectively—demonstrating critical deficiencies in current alignment techniques for financial regulatory contexts.
📝 Abstract
Large Language Models (LLMs) are increasingly integrated into financial applications, yet existing red-teaming research primarily targets harmful content, largely neglecting regulatory risks. In this work, we aim to investigate the vulnerability of financial LLMs through red-teaming approaches. We introduce Risk-Concealment Attacks (RCA), a novel multi-turn framework that iteratively conceals regulatory risks to provoke seemingly compliant yet regulatory-violating responses from LLMs. To enable systematic evaluation, we construct FIN-Bench, a domain-specific benchmark for assessing LLM safety in financial contexts. Extensive experiments on FIN-Bench demonstrate that RCA effectively bypasses nine mainstream LLMs, achieving an average attack success rate (ASR) of 93.18%, including 98.28% on GPT-4.1 and 97.56% on OpenAI o1. These findings reveal a critical gap in current alignment techniques and underscore the urgent need for stronger moderation mechanisms in financial domains. We hope this work offers practical insights for advancing robust and domain-aware LLM alignment.