RoadTones: Tone Controllable Text Generation from Road Event Videos

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing video-language models struggle to control the tone and style of generated captions, limiting their applicability in high-stakes communication scenarios. This work proposes the first tone-controllable video captioning framework tailored for road incidents. The authors introduce RoadTones-51K, a novel dataset comprising 51K samples annotated with multidimensional tone labels, and develop RoadTones-VL-CoT, a tone-conditional video-language model that incorporates chain-of-thought intermediate representations to enhance interpretability. Furthermore, they establish RoadTones-Eval, a comprehensive evaluation protocol that jointly assesses factual consistency and tonal appropriateness. User studies demonstrate that the proposed approach significantly outperforms baseline methods in tone control, factual accuracy, and overall caption quality.

📝 Abstract

Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In addition, we conducted a user study whose results validate caption quality, tone control, and factual consistency. Together, these contributions lay the foundation for context-sensitive tone-controllable video captioning.

Problem

Research questions and friction points this paper is trying to address.

tone control

video captioning

road events

controllable text generation

factual consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

tone-controllable captioning

video-to-text generation

Chain-of-Thought reasoning