IKnow: Instruction-Knowledge-Aware Continual Pretraining for Effective Domain Adaptation

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address the degradation of instruction-following capability and semantic representation during continual pretraining of large language models (LLMs) on new domains, this paper proposes IKnow—a novel framework that requires neither the original base model weights nor external domain-specific knowledge bases. IKnow introduces an instruction-response dialogue-based self-supervised objective, enabling deep semantic encoding by explicitly uncovering implicit domain knowledge embedded in raw text. Its core innovation lies in jointly modeling instruction awareness and knowledge awareness, thereby simultaneously enhancing domain adaptability and instruction-following robustness under fully unsupervised conditions. Extensive experiments across multiple domain-transfer tasks demonstrate that IKnow significantly outperforms baseline methods while preserving general-purpose capabilities. These results validate its effectiveness and generalizability in resource-constrained real-world deployment scenarios.

Technology Category

Application Category

📝 Abstract

Continual pretraining promises to adapt large language models (LLMs) to new domains using only unlabeled test-time data, but naively applying standard self-supervised objectives to instruction-tuned models is known to degrade their instruction-following capability and semantic representations. Existing fixes assume access to the original base model or rely on knowledge from an external domain-specific database - both of which pose a realistic barrier in settings where the base model weights are withheld for safety reasons or reliable external corpora are unavailable. In this work, we propose Instruction-Knowledge-Aware Continual Adaptation (IKnow), a simple and general framework that formulates novel self-supervised objectives in the instruction-response dialogue format. Rather than depend- ing on external resources, IKnow leverages domain knowledge embedded within the text itself and learns to encode it at a deeper semantic level.

Problem

Research questions and friction points this paper is trying to address.

Adapts LLMs to new domains using unlabeled data only

Preserves instruction-following ability during continual pretraining

Leverages embedded domain knowledge without external resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-response dialogue format objectives

Leverages embedded domain knowledge internally

Encodes knowledge at deeper semantic level

🔎 Similar Papers

Investigating Continual Pretraining in Large Language Models: Insights and Implications