🤖 AI Summary
Scene Graph Generation (SGG) suffers from coarse subject–predicate–object modeling and neglect of bidirectional dependencies between entities and predicates. To address this, we propose a Bidirectional Conditionalized Transformer architecture—the first to enable mutual guidance between visual feature processing and semantic decoding: visual features dynamically modulate semantic decoding, while semantic structures reciprocally constrain visual attention, thereby facilitating joint optimization of objects and relations. Our method comprises a dual-stream conditional encoder, a learnable relational prior module, and a contrastive relational rescoring mechanism. Evaluated on Visual Genome, our approach achieves a +3.2% improvement in Recall@100, demonstrating substantial gains in long-tail relation recognition and modeling of co-occurring multiple relations.