AceTone: Bridging Words and Colors for Conditional Image Grading

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image color grading methods struggle to accommodate diverse creative intents and often exhibit insufficient alignment with human aesthetic preferences. This work proposes a multimodal conditional generative framework for color transformation that, for the first time, unifies text and reference image guidance within a single architecture to stylize images via 3D lookup tables (3D-LUTs). The core innovations include compressing a 3×32³ LUT into 64 discrete tokens using a vector-quantized variational autoencoder (VQ-VAE) and enhancing perceptual aesthetic alignment through a vision-language model coupled with reinforcement learning. Experiments demonstrate that the proposed method achieves state-of-the-art performance in both text- and reference-guided color grading, improving the LPIPS metric by up to 50% and significantly outperforming existing approaches in human evaluations of visual appeal and style consistency.
📝 Abstract
Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a $3\times32^3$ LUT vector to 64 discrete tokens with $ΔE<2$ fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone's results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.
Problem

Research questions and friction points this paper is trying to address.

color grading
creative intent
aesthetic preference
generalization
image style
Innovation

Methods, ideas, or system contributions that make the work stand out.

conditional color grading
3D-LUT generation
VQ-VAE tokenizer
vision-language model
reinforcement learning for aesthetics
🔎 Similar Papers
No similar papers found.
T
Tianren Ma
University of Chinese Academy of Sciences
M
Mingxiang Liao
ByteDance
X
Xijin Zhang
ByteDance
Qixiang Ye
Qixiang Ye
University of Chinese Academy of Sciences, University of Maryland
Visual Object DetectionImage Processing