NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Diffusion models for vision-language-action (VLA) tasks suffer from high inference latency due to iterative denoising, hindering real-time, high-frequency robotic control. To address this, we propose NinA—the first VLA framework to integrate Normalizing Flows into the action decoder, enabling an invertible, single-step action generation architecture. NinA is end-to-end jointly fine-tuned with a pre-trained vision-language model, eliminating iterative sampling while preserving task success rates comparable to diffusion-based methods. Crucially, it achieves one-shot action prediction, drastically reducing inference latency. On the LIBERO benchmark, NinA attains state-of-the-art performance across multiple tasks and accelerates inference by several-fold—demonstrating practical viability for real-time robotic control applications.

Technology Category

Application Category

📝 Abstract

Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alter- native to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.

Problem

Research questions and friction points this paper is trying to address.

Replacing diffusion models for faster VLA action decoding

Enabling one-shot sampling to reduce inference time

Achieving efficient high-frequency control without performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces diffusion decoder with Normalizing Flow

Enables one-shot sampling through invertible transformation

Achieves faster inference without performance compromise

🔎 Similar Papers

On the Universality of Volume-Preserving and Coupling-Based Normalizing Flows