Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In biology, the scarcity and high cost of wet-lab experimental labels severely constrain the construction of synthetic chain-of-thought (CoT) reasoning data. Method: We propose an unsupervised synthetic reasoning data filtering framework that requires no ground-truth labels. It innovatively leverages model-intrinsic uncertainty metrics—such as self-consistency and prediction perplexity—to dynamically weight and fuse multi-dimensional confidence signals, enabling class-adaptive selection of high-quality CoT trajectories. Results: The resulting synthetic dataset substantially improves performance on biological perturbation prediction: supervised fine-tuning achieves performance comparable to full supervised training with real labels and surpasses strong baselines. This work is the first to systematically introduce an uncertainty-driven, self-supervised filtering mechanism into biological reasoning data generation, establishing a scalable, low-cost data engineering paradigm for resource-constrained scientific AI.

Technology Category

Application Category

📝 Abstract
Synthetic chain-of-thought (CoT) traces are widely used to train large reasoning models (LRMs), improving generalization by providing step-level supervision. Yet most approaches require ground-truth labels to seed or filter these traces - an expensive bottleneck in domains like biology where wet-lab data are scarce. We propose a label-free alternative: uncertainty-based filtering, which uses a model's own confidence - quantified through established uncertainty metrics like self-consistency and predictive perplexity - as a substitute for external labels. We sample multiple reasoning traces and retain only low-uncertainty subsets. Applied to biological perturbation prediction, a domain where wet-lab labels are especially costly, we show that the filtered subset has higher accuracy, and that supervised fine-tuning (SFT) on uncertainty-filtered data outperforms unfiltered synthetic data, narrows the gap to ground-truth training, and surpasses strong LRM baselines. Ablations show that per-class filtering corrects for class-specific uncertainty scales and that hybrid uncertainty metrics yield higher-quality datasets. Our results suggest that model-internal confidence is a powerful signal for efficient reasoning dataset creation, enabling LRMs in domains where supervision is expensive.
Problem

Research questions and friction points this paper is trying to address.

Creating synthetic reasoning datasets without expensive ground-truth labels
Filtering biological reasoning data using model uncertainty metrics
Improving biological perturbation prediction with uncertainty-filtered supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uncertainty filtering replaces external labels
Retains low-uncertainty reasoning traces
Uses self-consistency and perplexity metrics
🔎 Similar Papers
No similar papers found.