Recipient of the ELLIS PhD Award for doctoral thesis.
Awarded the Google PhD Fellowship during doctoral studies.
Published multiple papers at top-tier conferences including NeurIPS, ICCV, CVPR, ACL, ECCV, and Interspeech, such as:
— VidChapters-7M (NeurIPS 2023): Introduced a large-scale dataset and tasks for video chapterization.
— PaLI-X (arXiv 2023): Scaled up multilingual vision-language models achieving SOTA on 25+ benchmarks.
— UnLoc (ICCV 2023): A unified framework for video localization tasks using image-text models like CLIP.
— AutoAD series (CVPR/ICCV 2023): Automatic audio description for movies with focus on character recognition and contextual understanding.
— Vid2Seq (CVPR 2023): A single-stage dense video captioning model pretrained on narrated videos, achieving SOTA performance.
— Modular VQA via Code Generation (ACL 2023): Used LLMs to generate executable code for visual question answering, setting new records on COVR and GQA.
— AVFormer (CVPR 2023): Enabled zero-shot audiovisual ASR by injecting vision into frozen speech models.
— LanSER (Interspeech 2023): Leveraged LLMs to derive emotion labels from speech for improved speech emotion recognition.