Published multiple papers, including 'Visual Lexicon: Rich Image Features in Language Space' (2025), 'Dense Video Object Captioning from Disjoint Supervision' (2025), 'Streaming Dense Video Captioning' (2024), 'Pixel Aligned Language Models' (2024), 'How can objects help action recognition?' (2023), 'Detecting Twenty-thousand Classes using Image-level Supervision' (2022), 'Global Tracking Transformers' (2022), 'Simple multi-dataset detection' (2022), 'Probabilistic two-stage detection' (2021), 'Multimodal Virtual Point 3D Detection' (2021), 'Center-based 3D Object Detection and Tracking' (2021), 'Tracking Objects as Points' (2020), 'Objects as Points' (2019), 'Bottom-up Object Detection by Grouping Extreme and Center Points' (2019), 'StarMap for Category-Agnostic Keypoint and Viewpoint Estimation' (2018), 'Unsupervised Domain Adaptation for 3D Keypoint Estimation via View Consistency' (2018), 'Towards 3D Human Pose Estimation in the Wild: A weakly-supervised Approach' (2017), 'Deep Kinematic Pose Regression' (2016), 'Model-based Deep Hand Pose Estimation' (2016).
Research Experience
Worked at Google DeepMind and interned at Microsoft Research Asia, Google Research, Intel Labs, and Facebook AI Research.
Education
Ph.D. in Computer Science from The University of Texas at Austin, supervised by Prof. Philipp Krähenbühl; Bachelor's degree from School of Computer Science at Fudan University.
Background
Currently a Research Scientist at Meta GenAI. His research interest is introducing fine-grained understanding capability in vision-language models.
Miscellany
Personal website: https://skylerhallinan.com/, Last updated: February 2025