π€ AI Summary
Existing AR/VR air-drawing tools rely on expensive hardware and external markers, suffer from poor accessibility, and demand high manual sketching proficiency. This paper proposes the first markerless, head-mounted-display-free method for generating hand-drawn sketches in mid-air. Our approach leverages self-supervised augmentation during training to enable a controllable image diffusion model to robustly extract motion semantics from highly noisy hand-trajectory sequences, directly synthesizing structurally accurate, clean-lined, and stylistically refined sketches. The method requires no physical markers or specialized equipmentβonly a standard RGB camera for hand motion tracking. We introduce AirSketch, a dual-modality air-drawing dataset comprising both real-world and synthetic data (AirSketch-Real/Syn) to support training and evaluation. Experiments demonstrate strong cross-domain generalization, significantly lowering the barrier to air drawing and establishing a new lightweight paradigm for AR/VR content creation.
π Abstract
Illustration is a fundamental mode of human expression and communication. Certain types of motion that accompany speech can provide this illustrative mode of communication. While Augmented and Virtual Reality technologies (AR/VR) have introduced tools for producing drawings with hand motions (air drawing), they typically require costly hardware and additional digital markers, thereby limiting their accessibility and portability. Furthermore, air drawing demands considerable skill to achieve aesthetic results. To address these challenges, we introduce the concept of AirSketch, aimed at generating faithful and visually coherent sketches directly from hand motions, eliminating the need for complicated headsets or markers. We devise a simple augmentation-based self-supervised training procedure, enabling a controllable image diffusion model to learn to translate from highly noisy hand tracking images to clean, aesthetically pleasing sketches, while preserving the essential visual cues from the original tracking data. We present two air drawing datasets to study this problem. Our findings demonstrate that beyond producing photo-realistic images from precise spatial inputs, controllable image diffusion can effectively produce a refined, clear sketch from a noisy input. Our work serves as an initial step towards marker-less air drawing and reveals distinct applications of controllable diffusion models to AirSketch and AR/VR in general.