🤖 AI Summary
Conventional language models, trained top-down on general corpora, lack deep domain-specific abstraction and reasoning capabilities—particularly exhibiting poor cross-subdomain generalization in medicine. Method: We propose a knowledge graph (KG)-driven bottom-up paradigm to build medical-domain superintelligence: (1) constructing a structured medical KG from ICD and UMLS; (2) designing a path-guided task generation pipeline that automatically synthesizes 24,000 hierarchical reasoning tasks with corresponding chain-of-thought annotations; and (3) introducing a KG-supported curriculum learning framework for progressive training—from foundational concepts to complex multi-step reasoning. Contribution/Results: Based on QwQ-32B, we develop QwQ-Med-3, which achieves significant gains on our newly constructed multi-domain medical reasoning benchmark, ICD-Bench (+12.7% on the most challenging tasks), outperforming all existing models. It also demonstrates strong transferability across multiple medical QA benchmarks.
📝 Abstract
Language models traditionally used for cross-domain generalization have recently demonstrated task-specific reasoning. However, their top-down training approach on general corpora is insufficient for acquiring abstractions needed for deep domain expertise. This may require a bottom-up approach that acquires expertise by learning to compose simple domain concepts into more complex ones. A knowledge graph (KG) provides this compositional structure, where domain primitives are represented as head-relation-tail edges and their paths encode higher-level concepts. We present a task generation pipeline that synthesizes tasks directly from KG primitives, enabling models to acquire and compose them for reasoning. We fine-tune language models on the resultant KG-grounded curriculum to demonstrate domain-specific superintelligence. While broadly applicable, we validate our approach in medicine, where reliable KGs exist. Using a medical KG, we curate 24,000 reasoning tasks paired with thinking traces derived from diverse medical primitives. We fine-tune the QwQ-32B model on this curriculum to obtain QwQ-Med-3 that takes a step towards medical superintelligence. We also introduce ICD-Bench, an evaluation suite to quantify reasoning abilities across 15 medical domains. Our experiments demonstrate that QwQ-Med-3 significantly outperforms state-of-the-art reasoning models on ICD-Bench categories. Further analysis reveals that QwQ-Med-3 utilizes acquired primitives to widen the performance gap on the hardest tasks of ICD-Bench. Finally, evaluation on medical question-answer benchmarks shows that QwQ-Med-3 transfers acquired expertise to enhance the base model's performance. While the industry's approach to artificial general intelligence (AGI) emphasizes broad expertise, we envision a future in which AGI emerges from the composable interaction of efficient domain-specific superintelligent agents.