Base Models Beat Aligned Models at Randomness and Creativity

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Alignment techniques—particularly RLHF—systematically degrade model performance on tasks requiring unpredictable outputs, despite improving safety and instruction following. Method: We conduct controlled experiments across multiple generations of open-source LLMs (Llama, Qwen, Phi), employing statistical bias analysis, game-theoretic win-rate evaluation, and human-rated creativity assessments on four unpredictability-intensive tasks: random number generation, rock-paper-scissors, hide-and-seek, and creative writing. Contribution/Results: We consistently observe that base models significantly outperform their aligned counterparts; moreover, stronger alignment correlates with greater performance degradation. Crucially, we provide the first empirical evidence that higher scores on standard alignment benchmarks strongly negatively correlate with unpredictability task performance—directly challenging the implicit “alignment-as-universal-improvement” assumption. This establishes alignment-induced capability trade-offs as a fundamental phenomenon in LLM development.

Technology Category

Application Category

📝 Abstract

Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate"7"over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities.

Problem

Research questions and friction points this paper is trying to address.

Base models outperform aligned models in randomness tasks

Aligned models show predictable behavior in strategic games

Aligned models prioritize safety over creative originality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Base models outperform aligned models unpredictably

Aligned models narrow behaviors limit creativity

Trade-off between benchmark performance task capability

🔎 Similar Papers

Divergent Creativity in Humans and Large Language Models