Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study addresses the distinction between fine-tuning (FT) and in-context learning (ICL) in large language models with respect to linguistic competence and inductive biases. To this end, the authors construct formal language learning tasks as a rigorously controlled evaluation benchmark and propose a generative probability–based criterion for discriminating between valid and invalid strings. Systematic comparisons reveal that FT substantially outperforms ICL in in-distribution generalization, while both approaches exhibit comparable out-of-distribution performance. At high proficiency levels, their inductive biases diverge significantly. Moreover, ICL proves more sensitive to model scale, architecture, and vocabulary design. This work is the first to integrate formal language theory into this line of inquiry, offering an interpretable and reproducible empirical framework for elucidating the mechanistic differences between these two learning paradigms.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.

Problem

Research questions and friction points this paper is trying to address.

fine-tuning

in-context learning

large language models

inductive biases

language proficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

formal languages

fine-tuning

in-context learning