🤖 AI Summary
This work addresses the significant yet underexplored disparity in verifiability among hallucinations generated by multimodal large language models, particularly highlighting the insidious nature of "covert" hallucinations that are difficult to detect and pose serious safety risks. Existing approaches lack effective mechanisms to regulate the verifiability of such hallucinations. To bridge this gap, the study systematically distinguishes between "overt" and "covert" hallucinations for the first time, introduces a human-annotated multimodal hallucination dataset comprising 4,470 instances, and proposes a learnable probing mechanism based on activation-space intervention. This approach enables fine-grained control over the verifiability of model outputs. Experiments demonstrate that the method effectively modulates the verifiability of specific hallucination types, and that hybrid intervention strategies can flexibly balance safety and usability requirements across diverse application scenarios.
📝 Abstract
AI applications driven by multimodal large language models (MLLMs) are prone to hallucinations and pose considerable risks to human users. Crucially, such hallucinations are not equally problematic: some hallucination contents could be detected by human users(i.e., obvious hallucinations), while others are often missed or require more verification effort(i.e., elusive hallucinations). This indicates that multimodal AI hallucinations vary significantly in their verifiability. Yet, little research has explored how to control this property for AI applications with diverse security and usability demands. To address this gap, we construct a dataset from 4,470 human responses to AI-generated hallucinations and categorize these hallucinations into obvious and elusive types based on their verifiability by human users. Further, we propose an activation-space intervention method that learns separate probes for obvious and elusive hallucinations. We reveal that obvious and elusive hallucinations elicit different intervention probes, allowing for fine-grained control over the model's verifiability. Empirical results demonstrate the efficacy of this approach and show that targeted interventions yield superior performance in regulating corresponding verifiability. Moreover, simply mixing these interventions enables flexible control over the verifiability required for different scenarios.