Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Autoregressive large language models (arLLMs) suffer from low knowledge injection efficiency and poor generalization—particularly hindered by the “reverse curse”—during fine-tuning. Method: We propose Masked Fine-tuning, a novel fine-tuning paradigm that, for the first time, incorporates the efficient knowledge learning mechanism of denoising large language models (dLLMs) into arLLMs. Unlike conventional approaches, it eliminates reliance on input token order and requires neither data rewriting nor manual augmentation. Contribution/Results: Comprehensive evaluation across multiple datasets demonstrates that dLLMs inherently possess superior bidirectional (forward/reverse) question-answering generalization. Our method significantly improves arLLMs’ data efficiency: on reverse QA tasks, their accuracy approaches that of dLLMs, substantially narrowing the performance gap between the two architectures. This establishes a scalable, efficient pathway for knowledge injection into arLLMs.

Technology Category

Application Category

📝 Abstract

Despite autoregressive large language models (arLLMs) being the current dominant paradigm in language modeling, they resist knowledge injection via fine-tuning due to inherent shortcomings such as the "reversal curse" -- the challenge of answering questions that reverse the original information order in the training sample. Masked diffusion large language models (dLLMs) are rapidly emerging as a powerful alternative to the arLLM paradigm, with evidence of better data efficiency and free of the "reversal curse" in pre-training. However, it is unknown whether these advantages extend to the post-training phase, i.e. whether pre-trained dLLMs can easily acquire new knowledge through fine-tuning. On three diverse datasets, we fine-tune arLLMs and dLLMs, evaluating them with forward and backward style Question Answering (QA) to probe knowledge generalization and the reversal curse. Our results confirm that arLLMs critically rely on extensive data augmentation via paraphrases for QA generalization, and paraphrases are only effective when their information order matches the QA style. Conversely, dLLMs achieve high accuracies on both forward and backward QAs without paraphrases; adding paraphrases yields only marginal gains. Lastly, inspired by the dLLM's performance, we introduce a novel masked fine-tuning paradigm for knowledge injection into pre-trained arLLMs. This proposed method successfully and drastically improves the data efficiency of arLLM fine-tuning, effectively closing the performance gap with dLLMs.

Problem

Research questions and friction points this paper is trying to address.

Autoregressive LLMs struggle with knowledge injection and reversal curse

Masked diffusion LLMs' fine-tuning capabilities remain unexplored for knowledge acquisition

Proposing novel masked fine-tuning to close data-efficiency gap between paradigms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked diffusion LLMs achieve high accuracy without paraphrases

Proposed masked fine-tuning paradigm for autoregressive LLMs

Improved data efficiency and closed performance gap

🔎 Similar Papers

No similar papers found.