🤖 AI Summary
This work addresses the limitations of prevailing multimodal systems, which are predominantly language-centric and treat non-linguistic modalities as secondary, resulting in architectural fragmentation and insufficient modality fusion. To overcome these issues, the authors propose the Discrete Native Autoregressive (DiNA) framework, which unifies text, vision, and audio into a shared discrete space and enables native multimodal modeling through a single autoregressive objective. Key innovations include the first discrete native Vision Transformer (dNaViT) supporting arbitrary input resolutions—surpassing performance bottlenecks in discrete visual understanding—and a unified multimodal tokenizer coupled with a native multimodal Transformer architecture that seamlessly integrates both understanding and generation tasks. Experiments demonstrate DiNA’s strong performance across multiple multimodal benchmarks, achieving unified capabilities in image understanding, generation, and dialogue, with models and tokenizers released to foster community advancement.
📝 Abstract
The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next