Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

📅 2024-10-14
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Instruction-tuned large language models (LLMs) under 14B parameters consistently underperform encoder-only models (e.g., BERT) on NLU benchmarks such as GLUE and SuperGLUE. Method: This work pioneers the systematic integration of Proximal Policy Optimization (PPO) reinforcement learning for aligning LLMs’ NLU capabilities—formulating NLU tasks as sequence decision processes, using label consistency as a token-level reward signal, and performing efficient policy optimization exclusively via LoRA adapters. The approach synergistically combines supervised fine-tuning (SFT) with PPO. Contribution/Results: On LLaMA2-7B, our method achieves a +6.3-point average GLUE gain over SFT alone, outperforming zero-shot and few-shot baselines by 38.7 and 26.1 points, respectively, and surpassing BERT-large across all tasks. Strong generalization is confirmed on Qwen2.5-7B and MPT-7B. Our key contribution is a lightweight, scalable RL alignment paradigm that substantially narrows the NLU performance gap between general-purpose LLMs and task-specialized encoders.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs), built on decoder-only transformers, excel in natural language generation and adapt to diverse tasks using zero-shot and few-shot prompting. However, these prompting methods often struggle on natural language understanding (NLU) tasks, where encoder-only models like BERT-base outperform LLMs on benchmarks like GLUE and SuperGLUE. This paper explores two approaches-supervised fine-tuning (SFT) and proximal policy optimization (PPO)-to enhance LLMs' NLU abilities. To reduce the cost of full-model fine-tuning, we integrate low-rank adaptation (LoRA) layers, limiting updates to these layers during both SFT and PPO. In SFT, task-specific prompts are concatenated with input queries and ground-truth labels, optimizing with next-token prediction. Despite this, LLMs still underperform compared to models like BERT-base on several NLU tasks. To close this gap, we apply PPO, a reinforcement learning technique that treats each token generation as an action and uses a reward function based on alignment with ground-truth answers. PPO then updates the model to maximize these rewards, aligning outputs with correct labels. Our experiments with LLAMA2-7B show that PPO improves performance, with a 6.3-point gain over SFT on GLUE. PPO exceeds zero-shot by 38.7 points and few-shot by 26.1 points on GLUE, while surpassing these by 28.8 and 28.5 points on SuperGLUE. Additionally, PPO outperforms BERT-large by 2.7 points on GLUE and 9.3 points on SuperGLUE. The improvements are consistent across models like Qwen2.5-7B and MPT-7B, highlighting PPO's robustness in enhancing LLMs' NLU capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhancing NLU in small LLMs via reinforcement learning
Addressing underperformance of sub-14B LLMs on GLUE benchmarks
Improving task adaptation using PPO instead of supervised fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Proximal Policy Optimization (PPO)
Frames NLU as reinforcement learning environment
Optimizes reward signals for label alignment
🔎 Similar Papers
No similar papers found.