π€ AI Summary
Existing LiDAR generation methods support only single-frame synthesis, while prediction approaches rely on multi-frame historical inputs and produce deterministic, one-shot outputsβboth failing to enable long-horizon, interactive scene generation. This work proposes the first autoregressive LiDAR generation framework for autonomous driving that supports long-horizon, interactive 4D point cloud synthesis, starting from a single input frame. Our method progressively generates high-fidelity 4D point cloud sequences frame-by-frame. Key innovations include: (1) inter-frame conditional autoregressive modeling; (2) bounding-box-guided conditional injection; (3) a scene-decoupled estimation module for object-level controllable interaction; and (4) a noise modulation strategy to suppress long-range error accumulation. We establish the first long-horizon LiDAR generation evaluation protocol on nuScenes. Experiments demonstrate significant improvements over state-of-the-art generation and prediction models in both distant-frame quality and geometric detail fidelity.
π Abstract
Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model's interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.