🤖 AI Summary
This study investigates the origins of cognitive biases in large language models (LLMs), quantifying the relative causal contributions of pretraining, instruction fine-tuning, and training stochasticity. We introduce cross-finetuning—systematically swapping instruction datasets between same-source and different-source pretrained models—combined with multi-seed fine-tuning and a causal analysis framework to evaluate over 30 cognitive biases. Results demonstrate that bias patterns are predominantly determined by pretraining: models sharing the same pretraining corpus exhibit highly similar bias profiles, whereas fine-tuning exerts only secondary, modulatory effects, and training randomness contributes negligibly. To our knowledge, this is the first work to isolate and quantify the stage-specific causal effects of training components on cognitive biases. Our methodology provides a reproducible foundation for bias attribution, evaluation, and mitigation in LLMs, advancing both theoretical understanding and practical interventions for responsible AI development.
📝 Abstract
Large language models (LLMs) exhibit cognitive biases -- systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over $30$ cognitive biases. Second, we introduce emph{cross-tuning} -- swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.