ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Event camera annotation data is scarce and expensive to acquire, hindering the advancement of event-based vision tasks. To address this, we propose the first controllable diffusion framework specifically designed for event stream generation. Leveraging priors from pre-trained image diffusion models (e.g., Stable Diffusion), our method requires only lightweight fine-tuning to synthesize high-fidelity, multimodally controllable event data. It supports diverse conditioning inputs—including text descriptions, 2D skeletal poses, and 3D human poses—and introduces the first cross-modal zero-shot event generation capability. Extensive experiments demonstrate that the synthesized events significantly improve downstream performance in visual recognition and 2D/3D pose estimation across multiple benchmarks, yielding consistent gains. These results validate the high quality, strong generalization ability, and practical utility of our generated event data.

Technology Category

Application Category

📝 Abstract

In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing event camera data using diffusion models

Reducing cost of labeled event dataset generation

Generating events guided by text and pose signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages diffusion prior from foundational image models

Synthesizes event data using diverse control signals

Enables generation with minimal fine-tuning and limited data

🔎 Similar Papers

Make Me Happier: Evoking Emotions Through Image Diffusion Models