๐ค AI Summary
Black-box large language models often exhibit biases or factual inaccuracies on specific topics, yet their trustworthiness boundaries remain poorly defined, hindering safe deployment. To address this challenge, this work proposes an active probing method that integrates knowledge graphs, multi-agent reinforcement learning, and a bias propagation mechanism to efficiently identify untrustworthy topic boundaries in mainstream modelsโsuch as Llama2 and Qwen2โusing only limited black-box queries. This study represents the first effort to combine bias propagation with multi-agent reinforcement learning for evaluating black-box language models and introduces the first open-source dataset annotated with bias-prone topics across multiple models.
๐ Abstract
Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.