Research interests include adapting foundational multi-modal models for vision tasks such as image recognition, object detection, and video action recognition. The goal is to steer these foundational models for downstream tasks with limited data (few-/zero-shot) while maintaining their pre-trained generalization for novel tasks.
Miscellany
Invited talks on multi-modal learning at Amazon Prime Video and Cohere For AI on the ProText work.