Papers published in top conferences like CVPR 2025, CVPR 2024, NeurIPS 2022; Involved in developing ComposeAnything framework for compositional text-to-image generation, VELOCITI benchmark to evaluate video-language models, MICap model for identity-aware movie descriptions, and Grounded Video Situation Recognition framework.
Research Experience
Research Assistant in the Computer Vision lab at IIT Gandhinagar, working with Shanmuganathan Raman, focused on Computational Photography, specifically in high dynamic range (HDR) image and video reconstruction.
Education
PhD: Willow team at Inria and École Normale Supérieure in Paris, advised by Cordelia Schmid and Shizhe Chen; Master's: CVIT IIIT Hyderabad, advised by C.V. Jawahar and Makarand Tapaswi, thesis on Situation Recognition for Holistic Video Understanding.
Background
Research Interests: Unified large multimodal diffusion models, particularly at the intersection of vision and language for joint understanding and generation across text, images, and videos. Currently exploring compositional representations for high fidelity and interpretable text-to-image/video diffusion models.
Miscellany
Contact: zeeshan.khan@inria.fr; Office: C-412; Address: 2 Rue Simone IFF, 75012 Paris France