🤖 AI Summary
Existing approaches struggle to systematically compare the outputs of large language models under varying generation conditions. This work proposes a “visual fingerprint” framework that models model outputs as distributions over multidimensional linguistic choices—encompassing content, expression, and structure—and enables cross-condition comparison of generative behavior through an integrated natural language processing pipeline and distribution visualization techniques. For the first time, this method facilitates intuitive, distribution-level insights into stable behavioral patterns that persist across diverse settings yet remain undetectable via single-sample inspection or conventional aggregate metrics. The efficacy of the framework is demonstrated across four distinct application scenarios.
📝 Abstract
Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which can bias outputs in various ways. Understanding how different generation conditions shape model behaviors is essential for tasks such as prompt design and model evaluation, yet it remains challenging due to the stochastic and open-ended nature of text generation. We present an approach to visually compare LLM outputs across generation conditions by modeling responses as collections of linguistic choices, including content, expression, and structure. We extract these choices using natural language processing pipelines and represent their distributions across repeated samples. We then visualize these distributions as visual fingerprints, enabling direct, distribution-level comparison of condition-specific tendencies. Through four usage scenarios, we demonstrate how visual fingerprints reveal consistent patterns in LLM behavior that are difficult to observe through individual responses or aggregate metrics.