SFT-GO: Supervised Fine-Tuning with Group Optimization for Large Language Models

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current supervised fine-tuning (SFT) uniformly weights all tokens in a sequence, ignoring the sparsity of task-critical information and thereby limiting optimization efficacy. To address this, we propose SFT-GO, the first SFT framework featuring token-level importance grouping: it dynamically identifies high-value tokens via gradient sensitivity analysis and clusters them into semantically coherent token groups. SFT-GO jointly optimizes a worst-group loss—targeting risk minimization over the most challenging token groups—and standard cross-entropy loss, enabling adaptive, group-aware learning. We provide theoretical convergence guarantees, breaking the conventional uniform-sequence modeling paradigm. Extensive experiments across multiple LLM benchmarks demonstrate that SFT-GO consistently outperforms mainstream SFT baselines—achieving significant gains on diverse datasets and foundation models (e.g., Llama, Qwen)—thereby validating its effectiveness and strong generalization capability.

Technology Category

Application Category

📝 Abstract

Supervised fine-tuning (SFT) has become an essential step in tailoring large language models (LLMs) to align with human expectations and specific downstream tasks. However, existing SFT methods typically treat each training instance as a uniform sequence, giving equal importance to all tokens regardless of their relevance. This overlooks the fact that only a subset of tokens often contains critical, task-specific information. To address this limitation, we introduce Supervised Fine-Tuning with Group Optimization (SFT-GO), a novel approach that treats groups of tokens differently based on their importance.SFT-GO groups tokens in each sample based on their importance values and optimizes the LLM using a weighted combination of the worst-group loss and the standard cross-entropy loss. This mechanism adaptively emphasizes the most challenging token groups and guides the model to better handle different group distributions, thereby improving overall learning dynamics. We provide a theoretical analysis of SFT-GO's convergence rate, demonstrating its efficiency. Empirically, we apply SFT-GO with three different token grouping strategies and show that models trained with SFT-GO consistently outperform baseline approaches across popular LLM benchmarks. These improvements hold across various datasets and base models, demonstrating the robustness and the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Improves token importance weighting in supervised fine-tuning

Optimizes LLMs using adaptive worst-group and cross-entropy loss

Enhances model performance across diverse datasets and benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group tokens by importance for optimization

Combine worst-group loss with cross-entropy loss

Adaptively emphasize challenging token groups

🔎 Similar Papers

No similar papers found.