🤖 AI Summary
This work addresses two key challenges in evaluating domain-specific large language models (LLMs): unreliable evaluation benchmarks and opaque knowledge representation. We propose a deterministic, LLM- and human-free benchmark generation pipeline that automatically constructs cloze-style test sets from raw corpora using term frequency (TF) and term TF-IDF statistics. Coupled with hierarchical mechanistic interpretability analysis—distinguishing attribute extraction from next-token prediction—we uncover, for the first time, the layer-wise dynamics of knowledge forgetting during domain adaptation: forgetting initiates in middle layers and amplifies in later layers. The method enables low-cost, high-fidelity domain knowledge assessment. Across multiple models (e.g., Llama-3.1, Qwen-2) and domains, our benchmark achieves Spearman correlation ρ > 0.92 with expert-curated benchmarks—substantially outperforming perplexity—and supports early stopping in base-model training, revealing that small models require only ~500 steps for domain adaptation, with middle layers serving as the critical locus for both knowledge extraction and forgetting onset.
📝 Abstract
The paper addresses two critical challenges in language model (LM) evaluation: creating reliable domain-specific benchmarks and understanding knowledge representation during domain adaptation. We introduce a deterministic pipeline that converts raw domain corpora into completion-type benchmarks without relying on LMs or human curation, eliminating benchmark contamination issues while enabling evaluation on the latest domain data. Our approach generates domain-specific keywords and related word lists using TF and Term TF-IDF methods and constructs prompt-target pairs. We evaluate models by measuring their ability to complete these prompts with the correct domain-specific targets, providing a direct assessment of domain knowledge with low computational cost. Through comprehensive experiments across multiple models (GPT-2 medium/XL, Llama-2/3.1, OLMo-2, Qwen-2, Mistral) and domains, we demonstrate that our benchmark strongly correlates with expert-generated benchmarks while providing a more accurate measure of domain knowledge than traditional perplexity metrics. We reveal that domain adaptation happens rapidly in smaller models (within 500 steps) and illustrate a new approach to domain knowledge evaluation in base models during training for early stopping. By extending mechanistic analysis to domain adaptation, we discover that initial-to-mid layers are primarily responsible for attribute extraction, while later layers focus on next token prediction. Furthermore, we show that during adaptation, forgetting begins in the middle layers, where attribute extraction happens and is amplified in later layers. Our work provides both a practical evaluation methodology for domain-specific LMs and novel insights into knowledge representation during adaptation, with implications for more efficient fine-tuning strategies and targeted approaches to mitigate catastrophic forgetting.