UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation models heavily rely on large-scale labeled or unlabeled data—often including raw source code—whose acquisition is costly and resource-intensive. Method: We propose UCoder, the first fully unsupervised code generation framework that requires no external data whatsoever. Its core innovation is the “internal state probing” paradigm, which systematically models the problem space, interprets test specifications, explores the solution space, and distills knowledge—all by uncovering and leveraging implicit correctness and quality signals embedded in the latent states of large language models (LLMs). The method integrates latent-state probing, self-consistency filtering, representation-driven quality estimation, and unsupervised knowledge distillation. Contribution/Results: UCoder achieves competitive performance with supervised baselines across multiple benchmarks, while drastically reducing dependence on labeled data and computational resources. It is the first work to empirically validate the feasibility of purely internal-knowledge-driven code generation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised code generation without external data
Reducing reliance on labeled datasets and resources
Leveraging internal model states for quality estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised code generation via internal LLM probing
Self-consistency and representation-based quality estimation
Reduces dependency on labeled data and resources
🔎 Similar Papers
No similar papers found.
J
Jiajun Wu
Beihang University
J
Jian Yang
Beihang University
W
Wei Zhang
Beihang University
L
Lin Jing
Beihang University
Y
Yuqing Ma
Huawei
Ensheng Shi
Ensheng Shi
Huawei
Code IntelligenceProgram ComprehensionSoftware EngineeringMachine Learning
Yuchi Ma
Yuchi Ma
HUAWEI
Large Language ModelAI for SE
Zhoujun Li
Zhoujun Li
Beihang University
Artificial IntelligentNatural Language ProcessingNetwork Security
X
Xianglong Liu
Beihang University