Scholar

Muhammad Ferjad Naeem

Google Scholar ID: PR2DwYYAAAAJ

Research Scientist, Google

Artificial IntelligenceComputer VisionMachine LearningDeep Learning

Homepage↗Google Scholar↗

Citations & Impact

All-time

Citations

1,997

H-index

i10-index

Publications

Co-authors

list available

Contact

Emailferjad.naeem@vision.ee.ethz.ch CVOpen ↗TwitterOpen ↗GitHubOpen ↗

Publications

8 items

Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing

2026

Cited

DataComp-VLM: Improved Open Datasets for Vision-Language Models

2026

Cited

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

2026

Cited

RefAM: Attention Magnets for Zero-Shot Referral Segmentation

2025

Cited

Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLMs

2025

Cited

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

2025

Cited

Active Data Curation Effectively Distills Large-Scale Multimodal Models

arXiv.org · 2024

Cited

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

arXiv.org · 2024

Cited

Resume (English only)

Academic Achievements

Published multiple papers in top conferences and journals such as ICCV, CVPR, NeurIPS, ECCV, covering areas like vision-language pretraining, zero-shot image classification, learning attention propagation, etc.

Research Experience

Works as a Research Consultant with Google in Zurich and collaborates closely with Google Deepmind on Foundational Vision Language Models. Has also been an intern at Nvidia.

Education

Ph.D. Candidate at ETH Zürich, Computer Vision lab, supervised by Prof. Luc Van Gool and PD. Dr. Federico Tombari; Master's degree from Technical University of Munich, with a focus on Generative Models (Naver AI Lab) and Zero-shot Learning (UniTübingen AI Research).

Background

Interested in building strong multimodal foundational models and distilling the world knowledge of foundational models to smaller task-specific models that can adapt and generalize to novel classes and environments.

Miscellany