🤖 AI Summary
Existing large language models (LLMs) lack rigorous evaluation of scientific reasoning capabilities in chemical and biological engineering (CBE), particularly for ionic liquid (IL)-based carbon capture—a critical carbon-neutral technology.
Method: We introduce IL-CapBench, the first expert-annotated, 5,920-instance benchmark dataset for IL carbon-capture reasoning. It features a multi-dimensional difficulty framework integrating linguistic understanding and domain-specific knowledge, constructed via expert co-annotation, controllable difficulty design, and domain-consistency validation. We evaluate open-weight models (<10B parameters)—including Phi-3, Qwen2, and Llama3—under zero-shot and few-shot settings.
Contribution/Results: Our analysis reveals that while small LLMs possess basic IL knowledge, their domain-specific scientific reasoning remains severely limited. Crucially, we identify for the first time a synergistic optimization pathway between model performance and carbon footprint reduction, providing empirical grounding for deploying LLMs in carbon-neutral research. This work establishes the first CBE-specialized LLM benchmark and advances trustworthy AI for sustainable chemistry.
📝 Abstract
Although Large Language Models (LLMs) have achieved remarkable performance in diverse general knowledge and reasoning tasks, their utility in the scientific domain of Chemical and Biological Engineering (CBE) is unclear. Hence, it necessitates challenging evaluation benchmarks that can measure LLM performance in knowledge- and reasoning-based tasks, which is lacking. As a foundational step, we empirically measure the reasoning capabilities of LLMs in CBE. We construct and share an expert-curated dataset of 5,920 examples for benchmarking LLMs' reasoning capabilities in the niche domain of Ionic Liquids (ILs) for carbon sequestration, an emergent solution to reducing global warming. The dataset presents different difficulty levels by varying along the dimensions of linguistic and domain-specific knowledge. Benchmarking three less than 10B parameter open-source LLMs on the dataset suggests that while smaller general-purpose LLMs are knowledgeable about ILs, they lack domain-specific reasoning capabilities. Based on our results, we further discuss considerations for leveraging LLMs for carbon capture research using ILs. Since LLMs have a high carbon footprint, gearing them for IL research can symbiotically benefit both fields and help reach the ambitious carbon neutrality target by 2050. Dataset link: https://github.com/sougata-ub/llms_for_ionic_liquids