Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

๐Ÿ“… 2026-04-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Black-box large language models often exhibit biases or factual inaccuracies on specific topics, yet their trustworthiness boundaries remain poorly defined, hindering safe deployment. To address this challenge, this work proposes an active probing method that integrates knowledge graphs, multi-agent reinforcement learning, and a bias propagation mechanism to efficiently identify untrustworthy topic boundaries in mainstream modelsโ€”such as Llama2 and Qwen2โ€”using only limited black-box queries. This study represents the first effort to combine bias propagation with multi-agent reinforcement learning for evaluating black-box language models and introduces the first open-source dataset annotated with bias-prone topics across multiple models.
๐Ÿ“ Abstract
Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.
Problem

Research questions and friction points this paper is trying to address.

Black-box LLM
Untrustworthy Boundary Detection
Bias Detection
Large Language Models
Trustworthiness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box LLM
Untrustworthy Boundary Detection
Bias-Diffusion
Multi-Agent Reinforcement Learning
Knowledge Graph
๐Ÿ”Ž Similar Papers
No similar papers found.