🤖 AI Summary
This work addresses the inefficiency of large language models in clinical reasoning tasks—such as differential diagnosis—that inherently require parallel processing but are constrained by sequential autoregressive decoding. To overcome this limitation, the authors propose modeling complex medical reasoning as a directed acyclic graph (DAG) grounded in Petri net theory and introduce a fully parallelized inference framework. This framework incorporates topological-aware attention, adaptive positional encoding, and a zero-overhead parallel decoding engine. Experimental results demonstrate that the approach improves performance by 8.9% on general-purpose large language models, achieving accuracy comparable to specialized medical LLMs while reducing inference latency by 1.3× and increasing generation throughput by 1.7×.
📝 Abstract
Large language models (LLMs) have demonstrated strong performance and rapid progress in a wide range of medical reasoning tasks. However, their sequential autoregressive decoding forces inherently parallel clinical reasoning, such as differential diagnosis, into a single linear reasoning path, limiting both efficiency and reliability for complex medical problems. To address this, we propose MedVerse, a reasoning framework for complex medical inference that reformulates medical reasoning as a parallelizable directed acyclic graph (DAG) process based on Petri net theory. The framework adopts a full-stack design across data, model architecture, and system execution. For data creation, we introduce the MedVerse Curator, an automated pipeline that synthesizes knowledge-grounded medical reasoning paths and transforms them into Petri net-structured representations. At the architectural level, we propose a topology-aware attention mechanism with adaptive position indices that supports parallel reasoning while preserving logical consistency. Systematically, we develop a customized inference engine that supports parallel execution without additional overhead. Empirical evaluations show that MedVerse improves strong general-purpose LLMs by up to 8.9%. Compared to specialized medical LLMs, MedVerse achieves comparable performance while delivering a 1.3x reduction in inference latency and a 1.7x increase in generation throughput, enabled by its parallel decoding capability. Code is available at https://github.com/aiming-lab/MedVerse.