🤖 AI Summary
This work addresses the limitations of current training-free diffusion-based segmentation methods, which struggle to improve segmentation performance in tandem with the advancing generative capabilities of diffusion models due to inconsistencies in cross-attention maps and imbalanced text token scores. To overcome these issues, the authors propose a novel mechanism that automatically aggregates multi-layer, multi-head cross-attention maps and applies pixel-wise rescaling to effectively balance semantic responses and fully exploit the representational power of strong diffusion models. This approach is the first to achieve co-scaled segmentation performance aligned with the capacity of the underlying generative model, significantly outperforming existing training-free methods on standard semantic segmentation benchmarks. Furthermore, it demonstrates broad applicability by being successfully integrated into generative tasks, confirming its effectiveness and generalizability.
📝 Abstract
As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.