ObjectAlign: Neuro-Symbolic Object Consistency Verification and Correction

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Video editing often suffers from inter-frame inconsistencies, causing flickering and identity drift that degrade visual coherence. To address this, we propose the first object-level consistency restoration framework integrating perceptual metrics with symbolic reasoning. Our method introduces: (1) a learnable adaptive threshold to jointly model perceptual fidelity—via CLIP similarity, LPIPS, histogram statistics, and SAM mask IoU—and temporal logical constraints; (2) a neuro-symbolic verification mechanism that combines an SMT solver with probabilistic model checking to simultaneously ensure low-level stability and high-level temporal logic correctness; and (3) neural adaptive frame interpolation to enhance temporal smoothness. Evaluated on DAVIS and Pexels benchmarks, our approach achieves a +1.4 improvement in CLIP Score and a −6.1 reduction in warp error, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Video editing and synthesis often introduce object inconsistencies, such as frame flicker and identity drift that degrade perceptual quality. To address these issues, we introduce ObjectAlign, a novel framework that seamlessly blends perceptual metrics with symbolic reasoning to detect, verify, and correct object-level and temporal inconsistencies in edited video sequences. The novel contributions of ObjectAlign are as follows: First, we propose learnable thresholds for metrics characterizing object consistency (i.e. CLIP-based semantic similarity, LPIPS perceptual distance, histogram correlation, and SAM-derived object-mask IoU). Second, we introduce a neuro-symbolic verifier that combines two components: (a) a formal, SMT-based check that operates on masked object embeddings to provably guarantee that object identity does not drift, and (b) a temporal fidelity check that uses a probabilistic model checker to verify the video's formal representation against a temporal logic specification. A frame transition is subsequently deemed "consistent" based on a single logical assertion that requires satisfying both the learned metric thresholds and this unified neuro-symbolic constraint, ensuring both low-level stability and high-level temporal correctness. Finally, for each contiguous block of flagged frames, we propose a neural network based interpolation for adaptive frame repair, dynamically choosing the interpolation depth based on the number of frames to be corrected. This enables reconstruction of the corrupted frames from the last valid and next valid keyframes. Our results show up to 1.4 point improvement in CLIP Score and up to 6.1 point improvement in warp error compared to SOTA baselines on the DAVIS and Pexels video datasets.

Problem

Research questions and friction points this paper is trying to address.

Detects and corrects object inconsistencies in edited video sequences

Verifies object identity consistency using neuro-symbolic reasoning methods

Repairs corrupted video frames through adaptive neural network interpolation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable thresholds for object consistency metrics

Neuro-symbolic verifier with SMT and probabilistic checks

Neural network interpolation for adaptive frame repair

🔎 Similar Papers

Achieving more human brain-like vision via human EEG representational alignment

2024-01-30arXiv.orgCitations: 4

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

AI Research Scientist, Video Generation and Post Training, FAIR