ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain

📅 2024-11-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing general-purpose text embedding benchmarks (e.g., MTEB) lack domain-specific relevance to chemistry, failing to assess models’ understanding of reaction descriptions, molecular properties, patent abstracts, and other chemically grounded semantics. To address this gap, we introduce ChemTEB—the first dedicated benchmark for evaluating text embeddings in chemistry—comprising standardized multi-task evaluation suites for semantic similarity, retrieval, and classification, all constructed from authentic chemical texts and governed by rigorous evaluation protocols. We systematically evaluate 34 open-source and commercial embedding models, revealing substantial performance degradation of most general-purpose models on chemical tasks and identifying several top-performing models with superior chemical semantic alignment. ChemTEB establishes the first principled framework for chemistry-aware embedding evaluation, and we fully open-source the benchmark data, code, and evaluation scripts—filling a critical void in domain-specific NLP benchmarks and enabling robust development and iteration of specialized language models.

Technology Category

Application Category

📝 Abstract
Recent advancements in language models have started a new era of superior information retrieval and content generation, with embedding models playing an important role in optimizing data representation efficiency and performance. While benchmarks like the Massive Text Embedding Benchmark (MTEB) have standardized the evaluation of general domain embedding models, a gap remains in specialized fields such as chemistry, which require tailored approaches due to domain-specific challenges. This paper introduces a novel benchmark, the Chemical Text Embedding Benchmark (ChemTEB), designed specifically for the chemical sciences. ChemTEB addresses the unique linguistic and semantic complexities of chemical literature and data, offering a comprehensive suite of tasks on chemical domain data. Through the evaluation of 34 open-source and proprietary models using this benchmark, we illuminate the strengths and weaknesses of current methodologies in processing and understanding chemical information. Our work aims to equip the research community with a standardized, domain-specific evaluation framework, promoting the development of more precise and efficient NLP models for chemistry-related applications. Furthermore, it provides insights into the performance of generic models in a domain-specific context. ChemTEB comes with open-source code and data, contributing further to its accessibility and utility.
Problem

Research questions and friction points this paper is trying to address.

Chemical Text Embedding
Evaluation Benchmark
Performance Comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChemTEB
Chemical Text Embedding Benchmark
Model Evaluation Standard
🔎 Similar Papers
No similar papers found.
A
Ali Shiraee Kasmaee
Department of Computational Science and Engineering, McMaster University, Canada
Mohammad Khodadad
Mohammad Khodadad
Research Assistant, McMaster University
Machine LearningGraph TheoryBioinformaticsReinforcement LearningComputer Vision
M
Mohammad Arshi Saloot
BASF Canada Inc., Canada
N
Nick Sherck
BASF Corporation, USA
S
Stephen Dokas
BASF Corporation, USA
H
H. Mahyar
Department of Computational Science and Engineering, McMaster University, Canada
Soheila Samiee
Soheila Samiee
Senior Applied Research Scientist, BASF
Large Language ModelsTabular deep learningMachine LearningTime-series analysisNeuroscience