Kelix Technical Report

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Current vision-language models employing discrete visual tokens suffer from information loss due to limited codebook capacity, resulting in significantly weaker comprehension capabilities compared to models based on continuous features. To address this limitation, this work proposes Kelix, the first fully discrete autoregressive unified multimodal model. By refining visual discretization representations and enhancing codebook design within a unified cross-modal sequence training framework, Kelix enables efficient modeling while preserving strong generative capabilities. The method substantially improves visual understanding performance, effectively bridging the gap between discrete and continuous representations and achieving parity with state-of-the-art continuous-feature vision-language models.

Technology Category

Application Category

📝 Abstract

Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.

Problem

Research questions and friction points this paper is trying to address.

discrete visual tokens

multimodal understanding

autoregressive models

vision-language models

representation gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete visual tokenization

autoregressive multimodal modeling

unified understanding and generation