Robot Synesthesia: A Sound and Emotion Guided AI Painter

📅 2023-02-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of enabling embodied robots to generate semantically coherent and emotionally expressive visual art from auditory inputs. We propose the first robotic auditory–visual synesthetic paradigm, which disentangles speech content and emotional prosody to independently control painting subject matter and stylistic rendering. Our end-to-end system integrates cross-modal latent space alignment, speech emotion recognition, automatic speech recognition (ASR), text-to-image generation, and motion control of the FRIDA robotic arm. Deployed on the FRIDA platform, it supports artistic creation guided by music and natural sounds. User studies demonstrate that emotion and sound recognition accuracy exceeds random baseline performance by 2.1×, significantly validating the feasibility, expressiveness, and capacity for human–robot affective resonance in sound-driven visual art generation.
📝 Abstract
If a picture paints a thousand words, sound may voice a million. While recent robotic painting and image synthesis methods have achieved progress in generating visuals from text inputs, the translation of sound into images is vastly unexplored. Generally, sound-based interfaces and sonic interactions have the potential to expand accessibility and control for the user and provide a means to convey complex emotions and the dynamic aspects of the real world. In this paper, we propose an approach for using sound and speech to guide a robotic painting process, known here as robot synesthesia. For general sound, we encode the simulated paintings and input sounds into the same latent space. For speech, we decouple speech into its transcribed text and the tone of the speech. Whereas we use the text to control the content, we estimate the emotions from the tone to guide the mood of the painting. Our approach has been fully integrated with FRIDA, a robotic painting framework, adding sound and speech to FRIDA's existing input modalities, such as text and style. In two surveys, participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. On our sound-guided image manipulation and music-guided paintings, we discuss the results qualitatively.
Problem

Research questions and friction points this paper is trying to address.

Emotion Recognition
Audio-Visual Translation
Robotic Art Creation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Robotics Aesthetics
Audio-Visual Synthesis
Emotional Intelligence
🔎 Similar Papers
No similar papers found.