IN45023 Neural Network Design Patterns in Computer Vision Seminar Report, Summer 2025

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper systematically surveys the evolution of neural network design paradigms in computer vision, focusing on image recognition, generative modeling, and self-supervised learning. Through a comprehensive literature review and cross-model comparative analysis, we establish a unified analytical framework that identifies three overarching trends: (i) convolutional architectures giving way to attention-based mechanisms; (ii) supervised learning transitioning toward self-supervised paradigms; and (iii) discriminative modeling shifting to generative modeling. We conduct in-depth analyses of six landmark models—ResNet, ViT, GAN, LDM, DINO, and MAE—highlighting their breakthroughs in training stability at depth, generative fidelity, and reduced reliance on labeled data. Crucially, we unify diverse mechanisms—including residual connections, momentum teachers, masked encoding, diffusion processes, and adversarial training—within a coherent design logic, revealing intrinsic consistency across architectural advances. Our synthesis yields reusable architectural principles and theoretical guidance for next-generation vision foundation models. (149 words)

Technology Category

Application Category

📝 Abstract
This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analy- sis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recogni- tion. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent Diffusion Models (LDMs) are covered, which improve upon prior generative methods by performing a sequential denoising process in a perceptually compressed latent space. LDMs achieve high-fidelity synthesis with greater computational efficiency, representing the current state-of-the-art for image generation. Finally, we explore self-supervised learning techniques that reduce dependency on labeled data. DINO is a self-distillation framework in which a student network learns to match the output of a momentum-updated teacher, yielding features with strong k-NN classification performance. We conclude with Masked Autoencoders (MAE), which utilize an asymmetric encoder-decoder design to reconstruct heavily masked inputs, providing a highly scalable and effective method for pre-training large-scale vision models.
Problem

Research questions and friction points this paper is trying to address.

Analyzing evolution of design patterns in computer vision
Investigating generative models for image synthesis
Exploring self-supervised learning to reduce labeled data dependency
Innovation

Methods, ideas, or system contributions that make the work stand out.

ResNet uses residual connections for deep networks
ViT applies Transformers to image patch sequences
LDMs denoise in latent space for efficient synthesis