ChEmbed: Enhancing Chemical Literature Search Through Domain-Specific Text Embeddings

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

General-purpose text embedding models struggle to accurately represent chemical terminology, limiting retrieval performance in chemistry literature RAG systems. To address this, we propose ChEmbed—a family of domain-specific embedding models designed for chemical text. ChEmbed employs a tokenizer extended with chemistry-specific tokens, supports 8,192-token context lengths, and is trained on PubChem, Semantic Scholar, and ChemRxiv corpora using domain-adaptive fine-tuning alongside large language model–synthesized high-quality query-paragraph pairs (over one million). This approach fills a critical gap in chemistry-aware embedding models. Evaluated on our newly constructed ChemRxiv Retrieval benchmark, ChEmbed achieves an nDCG@10 of 0.91—outperforming state-of-the-art general-purpose models by +0.09—and significantly enhances retrieval relevance and downstream RAG system performance.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) systems in chemistry heavily depend on accurate and relevant retrieval of chemical literature. However, general-purpose text embedding models frequently fail to adequately represent complex chemical terminologies, resulting in suboptimal retrieval quality. Specialized embedding models tailored to chemical literature retrieval have not yet been developed, leaving a substantial performance gap. To address this challenge, we introduce ChEmbed, a domain-adapted family of text embedding models fine-tuned on a dataset comprising chemistry-specific text from the PubChem, Semantic Scholar, and ChemRxiv corpora. To create effective training data, we employ large language models to synthetically generate queries, resulting in approximately 1.7 million high-quality query-passage pairs. Additionally, we augment the tokenizer by adding 900 chemically specialized tokens to previously unused slots, which significantly reduces the fragmentation of chemical entities, such as IUPAC names. ChEmbed also maintains a 8192-token context length, enabling the efficient retrieval of longer passages compared to many other open-source embedding models, which typically have a context length of 512 or 2048 tokens. Evaluated on our newly introduced ChemRxiv Retrieval benchmark, ChEmbed outperforms state-of-the-art general embedding models, raising nDCG@10 from 0.82 to 0.91 (+9 pp). ChEmbed represents a practical, lightweight, and reproducible embedding solution that effectively improves retrieval for chemical literature search.

Problem

Research questions and friction points this paper is trying to address.

General text embeddings fail to represent complex chemical terminologies.

No specialized embedding models exist for chemical literature retrieval.

Current models struggle with long passages and chemical entity fragmentation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-adapted text embeddings for chemistry

Synthetic query generation with LLMs

Augmented tokenizer with chemical tokens

🔎 Similar Papers

No similar papers found.