QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing synthetic code-instruction datasets often suffer from noise and hallucinations, and conventional filtering methods struggle to reliably identify high-quality samples. This work proposes the QAQ framework, which assesses data quality through reverse semantic consistency by measuring bidirectional semantic alignment between queries and answers via forward and reverse conditional probabilities. The key innovation lies in introducing the Reverse Mutual Information (RMI) metric, whose extreme values at both ends signal data quality issues. By further incorporating a discrepancy-based strategy that leverages the performance gap between strong and weak language models, QAQ effectively selects samples that are both valid and challenging. Evaluated on the WarriorCoder dataset, QAQ achieves performance comparable to training on the full dataset using only 25% of the filtered data, significantly outperforming existing data selection approaches.

Technology Category

Application Category

📝 Abstract
Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.
Problem

Research questions and friction points this paper is trying to address.

synthetic data
code generation
data selection
hallucination
semantic coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reverse Mutual Information
Bidirectional Semantic Coherence
Synthetic Data Selection
Code Generation
Model Disagreement
🔎 Similar Papers
No similar papers found.
J
Jiayin Lei
Beijing University of Technology, Beijing, 100124, China
Ming Ma
Ming Ma
Department of Mathematical Sciences, Tsinghua University, Beijing
Y
Yunxi Duan
Beijing University of Technology, Beijing, 100124, China
C
Chenxi Li
University of Chicago, Chicago, IL, USA
T
Tianming Yang
Institute of Neuroscience, State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai, 200031, China