Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current LLM evaluation overemphasizes accuracy while neglecting interactive robustness and strategic adaptability. To address this gap, we propose the first adversarial evaluation framework integrating cognitive psychology and game theory. Our method employs a two-armed bandit task—assessing exploration-exploitation trade-offs—and a multi-round trust game—evaluating social cooperation, fairness perception, and strategic flexibility—to systematically stress-test LLM decision-making under dynamic adversarial conditions. Innovatively, we combine interactive adversarial prompting, cognitively grounded task design, and quantitative trajectory analysis of博弈 strategies. Results reveal significant cross-model disparities: GPT-4 and Gemini-1.5 exhibit pronounced strategic rigidity, susceptibility to manipulation, and divergent fairness recognition capabilities. These findings provide interpretable, actionable behavioral diagnostics for LLM alignment and safe deployment.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) become increasingly integrated into real-world decision-making systems, understanding their behavioural vulnerabilities remains a critical challenge for AI safety and alignment. While existing evaluation metrics focus primarily on reasoning accuracy or factual correctness, they often overlook whether LLMs are robust to adversarial manipulation or capable of using adaptive strategy in dynamic environments. This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of LLMs under interactive and adversarial conditions. Drawing on methodologies from cognitive psychology and game theory, our framework probes how models respond in two canonical tasks: the two-armed bandit task and the Multi-Round Trust Task. These tasks capture key aspects of exploration-exploitation trade-offs, social cooperation, and strategic flexibility. We apply this framework to several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini-1.5, and DeepSeek-V3, revealing model-specific susceptibilities to manipulation and rigidity in strategy adaptation. Our findings highlight distinct behavioral patterns across models and emphasize the importance of adaptability and fairness recognition for trustworthy AI deployment. Rather than offering a performance benchmark, this work proposes a methodology for diagnosing decision-making weaknesses in LLM-based agents, providing actionable insights for alignment and safety research.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM vulnerabilities in adversarial decision-making scenarios

Testing robustness to manipulation and adaptive strategy in dynamic environments

Diagnosing decision-making weaknesses in LLMs for AI safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial evaluation framework for LLMs

Cognitive psychology and game theory methods

Tests on bandit and trust tasks

🔎 Similar Papers

No similar papers found.