Model Evaluation & Data Quality Lead

Twelve Labs
San Francisco / Seoul2026-05-07Hybrid

About the job

You will be a vital member of our ML Data Team – which leads the full spectrum of video-language data preparation and model evaluation. This role comes with high ownership and includes responsibilities such as defining dataset needs and requirements in consultation with our research and product teams; designing and building data pipelines; and driving our post-training model evaluation strategy. You will also be responsible for automating as much of the repetitive partnership, annotation, and quality evaluation work as possible. A desire to work cross functionally and to build relationships is critical for success in this position.

Responsibilities

- Model Evaluation: Design and build robust model evaluation frameworks, automating repetitive processes and maintaining a balanced approach to efficiency and depth in obtaining evaluation metrics and feedback.

- Portfolio Monitoring: Manage resource allocation and timelines, adjusting direction flexibly based on real-time information across all data streams in your product vertical.

- External Partner Collaboration: Enhance dataset and process quality through seamless collaboration with vendors and outsourcing partners.

- Data Quality & Tooling Advancement: Establish labeling guidelines, monitor data quality, and improve tools and infrastructure to build a sustainable data operations framework.

- Internal Collaboration: Partner with Engineering and AI Model teams to align on top priority data needs, design tools such as analytical reports and dashboards, and clearly communicate project progress.

Qualifications

Minimum

5+ years of experience working in an AI focused data operations organization.

A proven track record designing and executing large scale data or evaluation projects, including gathering, labeling, and post-processing data.

The ability to analyze messy and complex data, identify overarching patterns, and distill your findings into crisp annotation guidelines or model quality reports.

Proficiency with Python, LLMs, or other popular industry tools for automation.

Excellent communication and project management skills, and the ability to support several projects simultaneously.

A foundational understanding of and interest in LLMs/VLMs and multimodal AI.

Conviction that data is the key ingredient for the performance and assessment of AI models.

Preferred

Experience in data collection and labeling for multimodal language models.

Experience in red teaming, localization testing, or other evaluation focused fields.

Experience working with research scientists and engineers.

Expertise or interest in video-centric domains, such as sports, advertising, and content creation.