Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work investigates whether Transformers can model the bitwise operations of high-complexity Pseudorandom Number Generators (PCGs) and perform context-aware prediction. We employ large-scale training (5B tokens), curriculum learning, and a customized Transformer architecture supporting bit shifts, XOR, rotations, and truncation. Our model achieves near-perfect single-bit prediction on modular arithmetic tasks up to modulus $2^{22}$. We are the first to demonstrate that Transformers can surpass classical cryptanalytic limits by accurately predicting truncated PCG outputs. We discover that the embedding layer spontaneously learns rotation-invariant integer cluster representations, enabling cross-modulus knowledge transfer. We propose a scaling law based on the square root of the modulus, characterizing the generalization boundary. Finally, we show that the model jointly learns multiple PRNG variants, evidencing the emergence of structured internal representations.

Technology Category

Application Category

📝 Abstract

We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the model spontaneously groups the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.

Problem

Research questions and friction points this paper is trying to address.

Transformers learning pseudorandom sequences from Permuted Congruential Generators

Predicting truncated single-bit outputs beyond classical attack capabilities

Analyzing scaling laws and embedding clusters for modulus generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers predict permuted congruential generator sequences

Model scales to moduli 2^22 using curriculum learning

Embeddings form rotation-invariant clusters for generalization

🔎 Similar Papers

TokenMark: A Modality-Agnostic Watermark for Pre-trained Transformers