Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Fine-tuning large language models (LLMs) on consumer-grade GPUs (e.g., 12 GB VRAM) faces severe memory bottlenecks; even state-of-the-art parameter-efficient fine-tuning (PEFT) methods like QLoRA incur high memory overhead during backward propagation. Method: We propose Ladder Side Tuning (LST), a lightweight side-network architecture, and its deep extension xLadder, which jointly reduce peak memory via gradient flow redirection, cross-layer connections, and chain-based inference compression. Contribution/Results: On a 7B model, LST achieves a 50% reduction in peak memory, enabling full-sequence fine-tuning with 2K context length without activation checkpointing. It matches QLoRA’s performance across NLU, mathematical reasoning, and LLM-Critic tasks. To our knowledge, this is the first end-to-end PEFT training of a 7B model on a single 12 GB GPU. Furthermore, LST exhibits the same scalability trends as QLoRA—performance scales linearly with adapter rank—validating its robustness and generalizability.

Technology Category

Application Category

📝 Abstract

Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA's compute scaling slope while cutting peak memory by 50%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA's accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing extemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder's architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory usage for fine-tuning large language models on limited GPU memory.

Enables efficient fine-tuning of 7B-parameter models on consumer GPUs without gradient checkpointing.

Enhances reasoning depth via cross-connections while maintaining memory efficiency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight side network reduces memory usage

Matches QLoRA performance with 50% less memory

Depth-extended variant enhances reasoning without extra memory

🔎 Similar Papers

No similar papers found.

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Authors to Follow