When Drawing Is Not Enough: Exploring Spontaneous Speech with Sketch for Intent Alignment in Multimodal LLMs

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenge that early-stage design sketches often fail to fully convey designers’ intent, as critical information is frequently embedded in implicit, unspoken context. To bridge this gap, the authors introduce TalkSketchD, the first temporally aligned multimodal dataset pairing freehand sketches with spontaneous speech recorded during the drawing process. Building upon this dataset, they propose a design intent understanding framework grounded in multimodal large language models (MLLMs). By jointly modeling sketch strokes and concurrent natural speech, the approach significantly improves the alignment between generated images and designers’ stated goals across multiple dimensions—form, function, user experience, and overall intent. The results demonstrate that spontaneous speech plays a pivotal role in enhancing model comprehension of user design intent.

Technology Category

Application Category

📝 Abstract

Early-stage design ideation often relies on rough sketches created under time pressure, leaving much of the designer's intent implicit. In practice, designers frequently speak while sketching, verbally articulating functional goals and ideas that are difficult to express visually. We introduce TalkSketchD, a sketch-while-speaking dataset that captures spontaneous speech temporally aligned with freehand sketches during early-stage toaster ideation. To examine the dataset's value, we conduct a sketch-to-image generation study comparing sketch-only inputs with sketches augmented by concurrent speech transcripts using multimodal large language models (MLLMs). Generated images are evaluated against designers' self-reported intent using a reasoning MLLM as a judge. Quantitative results show that incorporating spontaneous speech significantly improves judged intent alignment of generated design images across form, function, experience, and overall intent. These findings demonstrate that temporally aligned sketch-and-speech data can enhance MLLMs' ability to interpret user intent in early-stage design ideation.

Problem

Research questions and friction points this paper is trying to address.

intent alignment

multimodal LLMs

sketch

spontaneous speech

design ideation

Innovation

Methods, ideas, or system contributions that make the work stand out.

sketch-and-speech alignment

multimodal LLMs

intent alignment