DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low data quality and weak generalization of single models in Text-to-SQL tasks, this paper proposes the first data-centric solution. Our method introduces a fully automated, data-driven pipeline integrating adaptive data repair and error-guided augmentation, coupled with a multi-model collaborative iterative training framework that achieves joint optimization through diversified training, capability complementarity, and ensemble inference. Built upon an agent-based, multi-stage optimization architecture, the approach attains state-of-the-art performance (rank #1 on the leaderboard) using lightweight models under 70B parameters. Ablation studies quantitatively demonstrate substantial gains from both the data-centric pipeline and the multi-model collaboration mechanism. This work establishes a reproducible, scalable paradigm for co-optimizing data and models in Text-to-SQL, advancing beyond traditional model-centric approaches.

Technology Category

Application Category

📝 Abstract
Text-to-SQL tasks have gained attractive improvements since the release of ChatGPT. Among them, agent-based frameworks have been widely used in this field. However, the impact of data-centric strategies on text-to-SQL tasks has rarely been explored. In this paper, we systemically design a fully automated data-centric pipeline for text-to-SQL tasks, including emph{adaptive data repair}, which can automatically find and fix errors in the training dataset; and emph{error data augmentation}, where we specifically diffuse and enhance erroneous data predicted by the initially trained models. Meanwhile, we propose a Multi-Model collaboration training schema, aiming to train multiple models with different augmented data, enabling them to possess distinct capabilities and work together to complement each other, because it has been found that the capability of a single fine-tuned model is very limited. Furthermore, we utilize an ensemble strategy to integrate the capabilities of multiple models to solve a multiple-choice question, aiming to further improve the accuracy of text-to-SQL tasks. The experiment results and ablation study have demonstrated the effectiveness of data-centric pipeline and Multi-Model(MM) interactive iterative strategies, achieving first place in lightweight text-to-SQL models (within 70B).
Problem

Research questions and friction points this paper is trying to address.

Automating data repair and augmentation for text-to-SQL training datasets
Developing multi-model collaboration training with distinct specialized capabilities
Improving accuracy of lightweight text-to-SQL models through ensemble strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data-centric pipeline repairs and augments data
Multi-model collaboration training with distinct capabilities
Ensemble strategy integrates multiple models for accuracy
🔎 Similar Papers
Y
Yuanzhen Xie
Platform and Content Group, Tencent
L
Liu Ye
Platform and Content Group, Tencent
J
Jiqun Chu
Platform and Content Group, Tencent
M
Mochi Gao
Platform and Content Group, Tencent
H
Hehuan Liu
Platform and Content Group, Tencent
Yunzhi Tan
Yunzhi Tan
Tencent
Recommendation SystemMachine Learning
B
Bo Hu
Platform and Content Group, Tencent
Z
Zang Li
Platform and Content Group, Tencent