Fibration Policy Optimization

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing proximal policy optimization methods in heterogeneous large language model training, which operate only at a single scale and lack unified control over policy stability across tokens, trajectories, and higher-level structures. To overcome this, we propose FiberPO—a multi-scale policy optimization framework grounded in fiber bundle algebraic structures. FiberPO introduces an Aggregational Policy Censoring Objective and a Fiber Bundle Gating mechanism to decouple and coordinate policy updates across multiple granularities. It establishes, for the first time, an unconstrained exact reconstruction of TV-TRPO, revealing the duality between clipping-based objectives and trust-region optimization. Furthermore, it incorporates a composable hierarchical fiber bundle gating architecture that enables trust-region control at arbitrary depths. Experiments demonstrate that FiberPO significantly improves token efficiency while preserving on-policy consistency and successfully scales to a four-level hierarchy—domain, prompt group, trajectory, and token—validating its extensibility.

Technology Category

Application Category

📝 Abstract
Large language models are increasingly trained as heterogeneous systems spanning multiple domains, expert partitions, and agentic pipelines, yet prevalent proximal objectives operate at a single scale and lack a principled mechanism for coupling token-level, trajectory-level, and higher-level hierarchical stability control. To bridge this gap, we derive the Aggregational Policy Censoring Objective (APC-Obj), the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem. Building on this foundation, we develop Fiber Bundle Gating (FBG), an algebraic framework that organizes sampled RL data as a fiber bundle and decomposes ratio gating into a base-level gate on trajectory aggregates and a fiber-level gate on per-token residuals, with provable first-order agreement with the true RL objective near on-policy. From APC-Obj and FBG we derive Fibration Policy Optimization (or simply, FiberPO), a concrete objective whose Jacobian is block-diagonal over trajectories, reduces to identity at on-policy, and provides better update direction thus improving token efficiency. The compositional nature of the framework extends beyond the trajectory-token case: fibrations compose algebraically into a Fibration Gating Hierarchy (FGH) that scales the same gating mechanism to arbitrary hierarchical depth without new primitives, as demonstrated by FiberPO-Domain, a four-level instantiation with independent trust-region budgets at the domain, prompt group, trajectory, and token levels. Together, these results connect the trust-region theory, a compositional algebraic structure, and practical multi-scale stability control into a unified framework for LLM policy optimization.
Problem

Research questions and friction points this paper is trying to address.

multi-scale policy optimization
hierarchical stability control
large language models
trust-region optimization
heterogeneous systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fiber Bundle Gating
Aggregational Policy Censoring Objective
Fibration Policy Optimization
Trust-Region Optimization
Hierarchical Reinforcement Learning
🔎 Similar Papers
No similar papers found.