QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing synthetic code-instruction datasets often suffer from noise and hallucinations, and conventional filtering methods struggle to reliably identify high-quality samples. This work proposes the QAQ framework, which assesses data quality through reverse semantic consistency by measuring bidirectional semantic alignment between queries and answers via forward and reverse conditional probabilities. The key innovation lies in introducing the Reverse Mutual Information (RMI) metric, whose extreme values at both ends signal data quality issues. By further incorporating a discrepancy-based strategy that leverages the performance gap between strong and weak language models, QAQ effectively selects samples that are both valid and challenging. Evaluated on the WarriorCoder dataset, QAQ achieves performance comparable to training on the full dataset using only 25% of the filtered data, significantly outperforming existing data selection approaches.

Technology Category

Application Category

📝 Abstract

Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.

Problem

Research questions and friction points this paper is trying to address.

synthetic data

code generation

data selection

hallucination

semantic coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reverse Mutual Information

Bidirectional Semantic Coherence

Synthetic Data Selection