🤖 AI Summary
This work addresses the limitation of existing proximal policy optimization methods in heterogeneous large language model training, which operate only at a single scale and lack unified control over policy stability across tokens, trajectories, and higher-level structures. To overcome this, we propose FiberPO—a multi-scale policy optimization framework grounded in fiber bundle algebraic structures. FiberPO introduces an Aggregational Policy Censoring Objective and a Fiber Bundle Gating mechanism to decouple and coordinate policy updates across multiple granularities. It establishes, for the first time, an unconstrained exact reconstruction of TV-TRPO, revealing the duality between clipping-based objectives and trust-region optimization. Furthermore, it incorporates a composable hierarchical fiber bundle gating architecture that enables trust-region control at arbitrary depths. Experiments demonstrate that FiberPO significantly improves token efficiency while preserving on-policy consistency and successfully scales to a four-level hierarchy—domain, prompt group, trajectory, and token—validating its extensibility.
📝 Abstract
Large language models are increasingly trained as heterogeneous systems spanning multiple domains, expert partitions, and agentic pipelines, yet prevalent proximal objectives operate at a single scale and lack a principled mechanism for coupling token-level, trajectory-level, and higher-level hierarchical stability control. To bridge this gap, we derive the Aggregational Policy Censoring Objective (APC-Obj), the first exact unconstrained reformulation of sample-based TV-TRPO, establishing that clipping-based surrogate design and trust-region optimization are dual formulations of the same problem. Building on this foundation, we develop Fiber Bundle Gating (FBG), an algebraic framework that organizes sampled RL data as a fiber bundle and decomposes ratio gating into a base-level gate on trajectory aggregates and a fiber-level gate on per-token residuals, with provable first-order agreement with the true RL objective near on-policy. From APC-Obj and FBG we derive Fibration Policy Optimization (or simply, FiberPO), a concrete objective whose Jacobian is block-diagonal over trajectories, reduces to identity at on-policy, and provides better update direction thus improving token efficiency. The compositional nature of the framework extends beyond the trajectory-token case: fibrations compose algebraically into a Fibration Gating Hierarchy (FGH) that scales the same gating mechanism to arbitrary hierarchical depth without new primitives, as demonstrated by FiberPO-Domain, a four-level instantiation with independent trust-region budgets at the domain, prompt group, trajectory, and token levels. Together, these results connect the trust-region theory, a compositional algebraic structure, and practical multi-scale stability control into a unified framework for LLM policy optimization.