π€ AI Summary
This work addresses the scarcity of high-quality supervised fine-tuning data in knowledge-intensive domains, where existing rule-based synthetic data generation methods suffer from poor generalization and heavy reliance on expert-driven trial and error. The study introduces, for the first time, influence estimation of the target model into the synthetic data generation pipeline, proposing a model-feedback-driven framework that automatically optimizes scoring criteria. By leveraging gradient-based influence estimates to quantify each synthetic sampleβs utility for the downstream task, the method employs this signal as a reward in reinforcement learning. Integrated with an optimizer-aware influence estimator, lightweight guiding prompts, and a dedicated scoring model, the framework enables adaptive, cross-domain data synthesis without task-specific hyperparameter tuning. Experiments demonstrate consistent performance gains across diverse domains, models, and data generators.
π Abstract
Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to a target model's objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.