RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual manipulation localization (VML) faces two key challenges: weak cross-modal generalization and low efficiency in processing high-resolution images and long videos. To address these, we propose RelayFormer, a unified framework centered on the Global-Local Relay Attention (GLoRA) mechanism—enabling resolution-agnostic, linear-complexity single-pass inference without architectural modification. RelayFormer incorporates flexible local units, lightweight adapter modules, and a query-based mask decoder, ensuring seamless integration with diverse Transformer backbones (e.g., ViT, SegFormer). Evaluated across multiple image and video VML benchmarks, it achieves state-of-the-art performance, significantly improving scalability and cross-modal generalization. RelayFormer thus establishes a new generic baseline for VML.

Technology Category

Application Category

📝 Abstract
Visual manipulation localization (VML) -- across both images and videos -- is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently. We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations. Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: https://github.com/WenOOI/RelayFormer.
Problem

Research questions and friction points this paper is trying to address.

Localizing tampered regions in images and videos efficiently
Handling high-resolution and long-duration inputs effectively
Achieving cross-modal generalization for manipulation localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified local-global attention framework
Lightweight adaptation for existing Transformers
Query-based mask decoder for videos
🔎 Similar Papers
No similar papers found.
W
Wen Huang
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
J
Jiarui Yang
Nankai University, Tianjin, China
Tao Dai
Tao Dai
Shenzhen University
image restorationcomputer visiondeep learning
J
Jiawei Li
Huawei Technologies Co., Ltd
Shaoxiong Zhan
Shaoxiong Zhan
Tsinghua University
Natural Language ProcessingLarge Language Model
B
Bin Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Shu-Tao Xia
Shu-Tao Xia
SIGS, Tsinghua University
coding and information theorymachine learningcomputer visionAI security