🤖 AI Summary
To address the dual challenges of severe hallucination in complex scientific reasoning and inefficient tool invocation (e.g., over-reliance on high-cost tools) in large language models (LLMs), this paper proposes a “Learn-as-You-Adapt” framework. The method employs a two-stage collaborative fine-tuning strategy: first, leveraging tool-generated solutions to internalize world knowledge (WKL); second, enabling fine-grained, difficulty-aware tool usage decisions (TUA) based on model confidence—mimicking human experts’ problem assessment and adaptive strategy switching. It integrates difficulty-aware modeling, multi-domain scientific tool orchestration, and parameter-efficient fine-tuning. Evaluated on six climate, epidemiological, and mathematical benchmarks, our approach improves answer accuracy by 28.27% and tool invocation accuracy by 13.76% over an 8B baseline. Moreover, it outperforms GPT-4 and Claude-3.5 on four custom scientific datasets.
📝 Abstract
Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but, even with domain-specific fine-tuning, often produce hallucinations for complex ones. While integrating LLMs with tools can mitigate this reliability issue, models finetuned on tool usage only often over-rely on them, incurring unnecessary costs from resource-intensive scientific tools even for simpler problems. Inspired by how human experts assess the complexity of the problem before choosing the solutions, we propose a novel two-component fine-tuning method, Adapting While Learning (AWL). In the first component, World Knowledge Learning (WKL), LLMs internalize scientific knowledge by learning from tools-generated solutions. In the second component, Tool Usage Adaptation (TUA), we classify questions as easy or hard based on the WKL-trained model's accuracy, and train it to maintain direct reasoning for simple problems while switching to tools for challenging ones. We validate our method on 6 scientific benchmark datasets in climate science, epidemiology, and mathematics. Compared to the base 8B model, our trained models achieve 28.27% higher answer accuracy and 13.76% better tool usage accuracy, even surpassing state-of-the-art models including GPT-4 and Claude-3.5 on 4 custom-created datasets.