CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost, memory footprint, and energy consumption hindering Vision Transformer (ViT) deployment on resource-constrained devices, this paper proposes CascadedViT (CViT), a lightweight and efficient architecture. Its core innovations are the Cascaded Chunked Feed-Forward Network (CCFFN) and Cascaded Group Attention (CGA), which jointly improve parameter and FLOPs efficiency through feature decomposition and lightweight attention design. Additionally, we introduce Accuracy-Per-FLOP as a novel energy-efficiency metric. Evaluated on ImageNet-1K, CViT-XL achieves 75.5% Top-1 accuracy while reducing FLOPs by 15% and energy consumption by 3.3% compared to baseline ViTs. The full CViT model family attains state-of-the-art energy efficiency, outperforming existing methods in accuracy-per-FLOP across multiple scales.

Technology Category

Application Category

📝 Abstract
Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose emph{Cascaded-ViT (CViT)}, a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called emph{Cascaded-Chunk Feed Forward Network (CCFFN)}. By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our emph{CViT-XL} model achieves 75.5% Top-1 accuracy while reducing FLOPs by 15% and energy consumption by 3.3% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called emph{Accuracy-Per-FLOP (APF)}, which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2% more accurate than EfficientViT-M2 while having comparable APF scores.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational and memory demands of Vision Transformers
Improves energy efficiency for deployment on mobile devices
Enhances parameter and FLOP efficiency without sacrificing accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded-Chunk Feed Forward Network for efficiency
Splitting input features to improve FLOP efficiency
Cascaded Group Attention for lightweight transformer design
🔎 Similar Papers
No similar papers found.