Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how reasoning capabilities—particularly chain-of-thought (CoT) prompting—affect idiom detection performance in large language models (LLMs) and how this relationship scales with model size. Using the DeepSeek-R1 family (1.5B–70B), we systematically evaluate CoT prompting, math-task fine-tuning intermediates, and explicit external definition injection across four idiom detection benchmarks. Results show that large models (≥14B) achieve stable yet marginal gains under CoT; smaller models fail to spontaneously activate reasoning, but partially benefit from CoT when math-fine-tuned, while explicit injection of idiom definitions yields substantial performance improvements. This work is the first to systematically demonstrate the critical role of reasoning chains in idiom comprehension and introduces “definition injection” as a lightweight, effective paradigm for enhancing small-model idiom detection—offering a practical solution for resource-constrained idiom processing.

Technology Category

Application Category

📝 Abstract
The recent trend towards utilisation of reasoning models has improved the performance of Large Language Models (LLMs) across many tasks which involve logical steps. One linguistic task that could benefit from this framing is idiomaticity detection, as a potentially idiomatic expression must first be understood before it can be disambiguated and serves as a basis for reasoning. In this paper, we explore how reasoning capabilities in LLMs affect idiomaticity detection performance and examine the effect of model size. We evaluate, as open source representative models, the suite of DeepSeek-R1 distillation models ranging from 1.5B to 70B parameters across four idiomaticity detection datasets. We find the effect of reasoning to be smaller and more varied than expected. For smaller models, producing chain-of-thought (CoT) reasoning increases performance from Math-tuned intermediate models, but not to the levels of the base models, whereas larger models (14B, 32B, and 70B) show modest improvements. Our in-depth analyses reveal that larger models demonstrate good understanding of idiomaticity, successfully producing accurate definitions of expressions, while smaller models often fail to output the actual meaning. For this reason, we also experiment with providing definitions in the prompts of smaller models, which we show can improve performance in some cases.
Problem

Research questions and friction points this paper is trying to address.

Investigating reasoning's effect on idiomaticity detection in LLMs
Examining how model size impacts idiomatic expression understanding
Evaluating chain-of-thought reasoning across different parameter scales
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated reasoning models for idiomaticity detection
Tested DeepSeek-R1 models across parameter sizes
Provided definitions in prompts for smaller models
🔎 Similar Papers
No similar papers found.