RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Visual manipulation localization (VML) faces two key challenges: weak cross-modal generalization and low efficiency in processing high-resolution images and long videos. To address these, we propose RelayFormer, a unified framework centered on the Global-Local Relay Attention (GLoRA) mechanism—enabling resolution-agnostic, linear-complexity single-pass inference without architectural modification. RelayFormer incorporates flexible local units, lightweight adapter modules, and a query-based mask decoder, ensuring seamless integration with diverse Transformer backbones (e.g., ViT, SegFormer). Evaluated across multiple image and video VML benchmarks, it achieves state-of-the-art performance, significantly improving scalability and cross-modal generalization. RelayFormer thus establishes a new generic baseline for VML.

Technology Category

Application Category

📝 Abstract

Visual manipulation localization (VML) -- across both images and videos -- is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently. We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations. Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: https://github.com/WenOOI/RelayFormer.

Problem

Research questions and friction points this paper is trying to address.

Localizing tampered regions in images and videos efficiently

Handling high-resolution and long-duration inputs effectively

Achieving cross-modal generalization for manipulation localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified local-global attention framework

Lightweight adaptation for existing Transformers

Query-based mask decoder for videos

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs