🤖 AI Summary
This work investigates whether large language models (LLMs) intrinsically encode problem difficulty representations aligned with human judgment, and whether such representations track generalization performance during reinforcement learning (RL) fine-tuning. Using the mathematics and programming subsets of Easy2HardBench, we apply cross-layer, cross-position linear probing across 60 models. We find that human-annotated difficulty is highly linearly decodable (Spearman ρ ≈ 0.88), with decoding accuracy scaling strongly with model size; in contrast, model-generated difficulty estimates negatively correlate with downstream performance. Crucially, RL fine-tuning amplifies human-aligned difficulty signals—and this amplified signal positively correlates with test accuracy. Moreover, gradient-guided intervention along the “solvability” direction effectively reduces hallucination and improves accuracy. These results establish difficulty representation as a critical latent dimension for understanding LLM generalization mechanisms.
📝 Abstract
Large language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones. We investigate whether LLMs internally encode problem difficulty in a way that aligns with human judgment, and whether this representation tracks generalization during reinforcement learning post-training. We train linear probes across layers and token positions on 60 models, evaluating on mathematical and coding subsets of Easy2HardBench. We find that human-labeled difficulty is strongly linearly decodable (AMC: $ρapprox 0.88$) and exhibits clear model-size scaling, whereas LLM-derived difficulty is substantially weaker and scales poorly. Steering along the difficulty direction reveals that pushing models toward "easier" representations reduces hallucination and improves accuracy. During GRPO training on Qwen2.5-Math-1.5B, the human-difficulty probe strengthens and positively correlates with test accuracy across training steps, while the LLM-difficulty probe degrades and negatively correlates with performance. These results suggest that human annotations provide a stable difficulty signal that RL amplifies, while automated difficulty estimates derived from model performance become misaligned precisely as models improve. We release probe code and evaluation scripts to facilitate replication.