Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work systematically evaluates large language models’ (LLMs) capability to perform stochastic tasks—such as random number/string generation and sequence shuffling—where genuine statistical randomness is essential. Method: Through controlled-variable experiments, we quantify the statistical randomness of LLM outputs using information-theoretic entropy analysis and the NIST SP 800-22 battery of randomness tests; we further investigate the impact of prompt engineering, tool-use capability, and model state. Contribution/Results: We find that while LLMs can produce superficially random outputs, their statistical properties significantly deviate from ideal randomness, exhibiting strong dependence on prompt design and external intervention, and poor output stability. This study introduces the first multi-dimensional evaluation framework specifically for assessing randomness quality in LLMs, revealing a fundamental limitation in their intrinsic capacity for stochastic processing. The findings provide critical insights and methodological foundations for trustworthy AI and security-critical applications.

Technology Category

Application Category

📝 Abstract

The rapid advancement of large language model (LLM) technology has led to diverse applications, many of which inherently require randomness, such as stochastic decision-making, gaming, scheduling, AI agents, and cryptography-related tasks. However, the capabilities of LLMs in handling randomness, particularly in generating and utilizing random numbers effectively, remain unclear. This paper investigates the capacity of LLMs for handling tasks that involve randomness through a series of experiments. We designed a set of experiments that consider various factors that can influence an LLM's performance in tasks involving randomness, such as accessibility to external tools, types of tasks, model states (fresh vs. non-fresh), and prompting strategies. The experiments cover a range of tasks, including generating random numbers, generating random strings such as passwords, shuffling items, and evaluating the quality of randomness using entropy and the NIST randomness test-suite. Our findings reveal that while LLMs can generate outputs that exhibit some degree of randomness, their performance is inconsistent and often deviates significantly from the expected behavior. The analysis of the experimental results highlights key limitations and areas where improvement is needed for the LLMs to effectively handle tasks involving randomness

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' capacity for handling randomness in diverse applications

Investigating factors affecting LLM performance in randomness-related tasks

Assessing limitations in LLM-generated randomness quality and consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLM randomness handling through experiments

Testing randomness in generation, shuffling, and passwords

Analyzing entropy and NIST tests for randomness quality

🔎 Similar Papers

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions