Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM evaluation overemphasizes accuracy while neglecting interactive robustness and strategic adaptability. To address this gap, we propose the first adversarial evaluation framework integrating cognitive psychology and game theory. Our method employs a two-armed bandit task—assessing exploration-exploitation trade-offs—and a multi-round trust game—evaluating social cooperation, fairness perception, and strategic flexibility—to systematically stress-test LLM decision-making under dynamic adversarial conditions. Innovatively, we combine interactive adversarial prompting, cognitively grounded task design, and quantitative trajectory analysis of博弈 strategies. Results reveal significant cross-model disparities: GPT-4 and Gemini-1.5 exhibit pronounced strategic rigidity, susceptibility to manipulation, and divergent fairness recognition capabilities. These findings provide interpretable, actionable behavioral diagnostics for LLM alignment and safe deployment.

Technology Category

Application Category

📝 Abstract
As Large Language Models (LLMs) become increasingly integrated into real-world decision-making systems, understanding their behavioural vulnerabilities remains a critical challenge for AI safety and alignment. While existing evaluation metrics focus primarily on reasoning accuracy or factual correctness, they often overlook whether LLMs are robust to adversarial manipulation or capable of using adaptive strategy in dynamic environments. This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of LLMs under interactive and adversarial conditions. Drawing on methodologies from cognitive psychology and game theory, our framework probes how models respond in two canonical tasks: the two-armed bandit task and the Multi-Round Trust Task. These tasks capture key aspects of exploration-exploitation trade-offs, social cooperation, and strategic flexibility. We apply this framework to several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini-1.5, and DeepSeek-V3, revealing model-specific susceptibilities to manipulation and rigidity in strategy adaptation. Our findings highlight distinct behavioral patterns across models and emphasize the importance of adaptability and fairness recognition for trustworthy AI deployment. Rather than offering a performance benchmark, this work proposes a methodology for diagnosing decision-making weaknesses in LLM-based agents, providing actionable insights for alignment and safety research.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM vulnerabilities in adversarial decision-making scenarios
Testing robustness to manipulation and adaptive strategy in dynamic environments
Diagnosing decision-making weaknesses in LLMs for AI safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial evaluation framework for LLMs
Cognitive psychology and game theory methods
Tests on bandit and trust tasks
🔎 Similar Papers
No similar papers found.
L
Lili Zhang
School of Computing, Dublin City University; Insight SFI Research Centre for Data Analytics
H
Haomiaomiao Wang
School of Computing, Dublin City University; Insight SFI Research Centre for Data Analytics
L
Long Cheng
North China Electric Power University
Libao Deng
Libao Deng
Harbin Institute of Technology (Weihai)
Tomas Ward
Tomas Ward
Professor & Director, Insight Research Ireland Centre for Data Analytics, Dublin City University
neurotechnologyhuman-centric AIwearable sensors