Research Intern - Interactive Multimodal Futures Group (Situated & Affective Computing)

Microsoft
San Francisco Bay area, USA / New York City metropolitan area, USA2025-12-02onsite

About the job

Research Internships at Microsoft provide a dynamic environment for research careers with a network of world-class research labs led by globally-recognized scientists and engineers, who pursue innovation in a range of scientific and technical disciplines to help solve complex challenges in diverse fields, including computing, healthcare, economics, and the environment. The Interactive Multimodal Futures (IMF) group at Microsoft Research seeks a PhD-level Research Intern to work on a project at the intersection of situated interaction, affective computing, and human-centered AI systems. The project will include elements of multimodal sensing (physiology, speech, gaze, gestures, olfaction/gas, etc.), signal processing, and real-time interaction.

Responsibilities

Design and implement research prototypes for real-time situated and adaptive interaction.

Explore the use of the latest generative AI techniques related to interpreting multimodal interaction, conversation, and behavioral signals.

Conduct user studies and analyze multimodal data.

Contribute to publications and share findings with the research community.

Qualifications

Minimum

Currently enrolled in a PhD or equivalent program in HCI, HRI, Computer Science, Cognitive Science, Robotics, Electrical Engineering, Psychology, or related STEM field.

At least 2 years of research experience using human-centered approaches in HCI, HRI, ML, CV, or Affective Computing.

Preferred

Experience writing peer-reviewed publications.

Experience with generative AI techniques, ML frameworks (e.g., PyTorch), and real-time interactive systems.

Strong collaboration and communication skills.

Conducting human-subjects research.

Experience implementing research prototypes (frontend, backend, or both).

Using human-centered design & research methods.

Familiarity with reinforcement learning or time-series signal processing.

Working with large datasets (e.g., text, vision, physiology, behavioral).