Subjective Camera: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenging problem of reconstructing photorealistic scene images from subjective human inputs—namely, natural language descriptions coupled with coarse sketches—constrained by three key bottlenecks: user-specific cognitive biases, the modality gap between 2D sketches and 3D diffusion priors, and performance degradation due to sketch quality sensitivity. We propose a human-centric, progressive co-generation framework that models the sketching process as an ordered sequence to reflect human cognition. Our method introduces training-free user preference alignment, latent-space modality alignment, and a hierarchical reward mechanism. By integrating text-guided reward optimization, we establish appearance priors that enable sequence-aware disentanglement and fusion of language and sketch cues in the latent space. Extensive experiments demonstrate significant improvements in semantic and spatial consistency, achieving state-of-the-art reconstruction fidelity and subjective intent alignment across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

We propose Subjective Camera, a human-as-imaging-device paradigm that reconstructs real-world scenes from mental impressions through synergistic use of verbal descriptions and progressive rough sketches. This approach overcomes dual limitations of language ambiguity and sketch abstraction by treating the user's drawing sequence as priors, effectively translating subjective perceptual expectations into photorealistic images. Existing approaches face three fundamental barriers: (1) user-specific subjective input biases, (2) huge modality gap between planar sketch and 3D priors in diffusion, and (3) sketch quality-sensitive performance degradation. Current solutions either demand resource-intensive model adaptation or impose impractical requirements on sketch precision. Our framework addresses these challenges through concept-sequential generation. (1) We establish robust appearance priors through text-reward optimization, and then implement sequence-aware disentangled generation that processes concepts in sketching order; these steps accommodate user-specific subjective expectation in a train-free way. (2) We employ latent optimization that effectively bridges the modality gap between planar sketches and 3D priors in diffusion. (3) Our hierarchical reward-guided framework enables the use of rough sketches without demanding artistic expertise. Comprehensive evaluation across diverse datasets demonstrates that our approach achieves state-of-the-art performance in maintaining both semantic and spatial coherence.

Problem

Research questions and friction points this paper is trying to address.

Overcoming language ambiguity and sketch abstraction in scene reconstruction

Bridging modality gap between planar sketches and 3D priors

Enabling rough sketch usage without requiring artistic precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequence-aware sketch-guided diffusion for scene reconstruction

Text-reward optimization for robust appearance priors

Latent optimization bridges sketch and 3D prior gap

🔎 Similar Papers

ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis