🤖 AI Summary
This work addresses the challenge that large language models struggle to acquire effective representations in specialized domains—such as medicine, chemistry, and law—due to insufficient domain knowledge when relying solely on contrastive learning. To overcome this limitation, the authors propose LBR, a two-stage framework that first injects and compresses domain knowledge through generative pretraining under an information bottleneck constraint, and then performs contrastive learning with generative fine-tuning on the compressed representations to achieve semantic alignment. By decoupling knowledge acquisition from semantic alignment, LBR mitigates the inherent conflict between generative and contrastive objectives, establishing a novel representation paradigm tailored for vertical domains. Experimental results demonstrate that LBR significantly outperforms strong baselines on medical, chemical, and code retrieval tasks, confirming its effectiveness and robustness.
📝 Abstract
Large Language Models (LLMs) adapted via contrastive learning excel in general representation learning but struggle in vertical domains like chemistry and law, primarily due to a lack of domain-specific knowledge. This work identifies a core bottleneck: the prevailing ``LLM+CL''paradigm focuses on semantic alignment but cannot perform knowledge acquisition, leading to failures on specialized terminology. To bridge this gap, we propose Learn Before Represent (LBR), a novel two-stage framework. LBR first injects domain knowledge via an Information Bottleneck-Constrained Generative Learning stage, preserving the LLM's causal attention to maximize knowledge acquisition while compressing semantics. It then performs Generative-Refined Contrastive Learning on the compressed representations for alignment. This approach maintains architectural consistency and resolves the objective conflict between generative and contrastive learning. Extensive experiments on medical, chemistry, and code retrieval tasks show that LBR significantly outperforms strong baselines. Our work establishes a new paradigm for building accurate and robust representations in vertical domains.