🤖 AI Summary
Visual manipulation localization (VML) faces two key challenges: weak cross-modal generalization and low efficiency in processing high-resolution images and long videos. To address these, we propose RelayFormer, a unified framework centered on the Global-Local Relay Attention (GLoRA) mechanism—enabling resolution-agnostic, linear-complexity single-pass inference without architectural modification. RelayFormer incorporates flexible local units, lightweight adapter modules, and a query-based mask decoder, ensuring seamless integration with diverse Transformer backbones (e.g., ViT, SegFormer). Evaluated across multiple image and video VML benchmarks, it achieves state-of-the-art performance, significantly improving scalability and cross-modal generalization. RelayFormer thus establishes a new generic baseline for VML.
📝 Abstract
Visual manipulation localization (VML) -- across both images and videos -- is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently.
We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations.
Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: https://github.com/WenOOI/RelayFormer.