Learning on LLM Output Signatures for gray-box LLM Behavior Analysis

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the challenges of detecting data contamination and hallucinations in large language models (LLMs) under gray-box settings, this paper proposes LLM Output Signatures (LOS)—a unified representation that jointly encodes the full token-level probability distribution at each decoding step with the preceding history. Unlike existing methods relying solely on sampled tokens or heuristic rules, LOS is grounded in a theoretically justified Transformer-based architecture, supporting sequence-level probabilistic encoding and temporal attention, and is trained end-to-end under gray-box constraints. It provides a rigorous approximation to the optimal white-box/gray-box detector and, for the first time, uncovers universal behavioral patterns across diverse models and datasets. Experiments demonstrate that LOS significantly outperforms state-of-the-art gray-box baselines in both hallucination and data contamination detection, exhibiting strong generalization. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited, particularly in detecting data contamination and hallucinations. While recently proposed probing techniques provide insights through activation analysis, they require"white-box"access to model internals, often unavailable. Current"gray-box"approaches typically analyze only the probability of the actual tokens in the sequence with simple task-specific heuristics. Importantly, these methods overlook the rich information contained in the full token distribution at each processing step. To address these limitations, we propose that gray-box analysis should leverage the complete observable output of LLMs, consisting of both the previously used token probabilities as well as the complete token distribution sequences - a unified data type we term LOS (LLM Output Signature). To this end, we develop a transformer-based approach to process LOS that theoretically guarantees approximation of existing techniques while enabling more nuanced analysis. Our approach achieves superior performance on hallucination and data contamination detection in gray-box settings, significantly outperforming existing baselines. Furthermore, it demonstrates strong transfer capabilities across datasets and LLMs, suggesting that LOS captures fundamental patterns in LLM behavior. Our code is available at: https://github.com/BarSGuy/LLM-Output-Signatures-Network.

Problem

Research questions and friction points this paper is trying to address.

Detect data contamination and hallucinations in LLMs

Analyze LLM behavior without white-box access

Leverage complete token distribution for nuanced analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages complete LLM output for gray-box analysis

Uses transformer-based processing of LLM Output Signatures

Enhances detection of hallucinations and data contamination

🔎 Similar Papers

Interpreting Learned Feedback Patterns in Large Language Models