Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Black-box large language models often exhibit biases or factual inaccuracies on specific topics, yet their trustworthiness boundaries remain poorly defined, hindering safe deployment. To address this challenge, this work proposes an active probing method that integrates knowledge graphs, multi-agent reinforcement learning, and a bias propagation mechanism to efficiently identify untrustworthy topic boundaries in mainstream models—such as Llama2 and Qwen2—using only limited black-box queries. This study represents the first effort to combine bias propagation with multi-agent reinforcement learning for evaluating black-box language models and introduces the first open-source dataset annotated with bias-prone topics across multiple models.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.

Problem

Research questions and friction points this paper is trying to address.

Black-box LLM

Untrustworthy Boundary Detection

Bias Detection

Large Language Models

Trustworthiness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box LLM

Untrustworthy Boundary Detection

Bias-Diffusion