🤖 AI Summary
This work investigates decentralized training of autoregressive generative models while preserving performance. To this end, it introduces the Decentralized Discrete Flow Matching objective, which models the generative process as a linear combination of expert flows, and formally establishes the first decentralized autoregressive generation framework. Experiments on multimodal language models—such as LLaVA and InternVL 2.5-1B—combined with a CLIP vision encoder and full-parameter fine-tuning (encompassing ViT, MLP, and LLM components)—demonstrate that the proposed approach achieves performance comparable to centralized training across multiple benchmarks. These results validate the equivalence of decentralized and centralized training in multimodal settings and provide both theoretical grounding and practical insights for efficient distributed autoregressive generation.
📝 Abstract
We present a theoretical analysis of decentralization of autoregressive generation. We define the Decentralized Discrete Flow Matching objective, by expressing probability generating velocity as a linear combination of expert flows. We also conduct experiments demonstrating the equivalence between decentralized and centralized training settings for multimodal language models across diverse set of benchmarks. Specifically, we compare two distinct paradigms: LLaVA and InternVL 2.5-1B, which uses a fixed CLIP vision encoder and performs full-parameter fine-tuning (ViT+MLP+LLM) during the instruction tuning stage.