Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing LLM serving frameworks exhibit non-deterministic inference outputs across varying tensor parallelism (TP) degrees—particularly problematic in reinforcement learning (RL), where precision mismatches between TP=1 training and multi-GPU TP inference cause training instability or collapse. Method: We propose the Tree-based Invariant Kernel (TBIK), which enforces bit-level deterministic inference by unifying computation order across TP scales via a hierarchical binary-tree reduction mechanism. We implement TP-invariant matrix multiplication and reduction operators in Triton, and integrate TBIK into vLLM and FSDP to ensure training-inference consistency under greedy decoding. Contribution/Results: Our approach eliminates output divergence across all TP configurations, achieving zero-bitwise deviation. Experiments demonstrate substantially improved RL training stability and reproducibility. TBIK establishes critical infrastructure for rigorous LLM evaluation, multi-agent systems, and RL-based LLM alignment—enabling deterministic, scalable, and trustworthy inference.

Technology Category

Application Category

📝 Abstract

Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size-related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives that guarantee bit-wise identical results regardless of TP size. Our key insight is to align intra- and inter-GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit-wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit-wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at https://github.com/nanomaoli/llm_reproducibility.

Problem

Research questions and friction points this paper is trying to address.

Eliminates non-deterministic inference across different tensor parallel configurations

Solves training-inference mismatch in RL pipelines with varying parallel strategies

Addresses floating-point arithmetic inconsistencies causing output variations across GPUs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-based invariant kernels ensure deterministic tensor parallel inference

Hierarchical binary tree structure aligns GPU reduction orders

Bit-wise identical results across different parallel strategy configurations

🔎 Similar Papers

No similar papers found.