🤖 AI Summary
This work addresses the limitations of current question-answering systems, which prioritize factual correctness but fall short in educational and career guidance contexts that require reflective and pedagogically supportive responses. To bridge this gap, the authors propose a novel “mentor-style” QA paradigm and introduce MentorQA, the first multilingual long-video QA benchmark comprising 9,000 question-answer pairs across four languages. Beyond factual accuracy, they define new evaluation dimensions—clarity, coherence, and learning value—to better capture pedagogical quality. Through systematic comparisons of Single-Agent, Dual-Agent, RAG, and Multi-Agent architectures, the study demonstrates that multi-agent approaches significantly outperform others on complex topics and low-resource languages. Furthermore, the research reveals a notable discrepancy between current LLM-based automatic evaluations and human judgments, highlighting the need for more nuanced assessment frameworks.
📝 Abstract
Question answering systems are typically evaluated on factual correctness, yet many real-world applications-such as education and career guidance-require mentorship: responses that provide reflection and guidance. Existing QA benchmarks rarely capture this distinction, particularly in multilingual and long-form settings. We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship-focused question answering from long-form videos, comprising nearly 9,000 QA pairs from 180 hours of content across four languages. We define mentorship-focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value. Using MentorQA, we compare Single-Agent, Dual-Agent, RAG, and Multi-Agent QA architectures under controlled conditions. Multi-Agent pipelines consistently produce higher-quality mentorship responses, with especially strong gains for complex topics and lower-resource languages. We further analyze the reliability of automated LLM-based evaluation, observing substantial variation in alignment with human judgments. Overall, this work establishes mentorship-focused QA as a distinct research problem and provides a multilingual benchmark for studying agentic architectures and evaluation design in educational AI. The dataset and evaluation framework are released at https://github.com/AIM-SCU/MentorQA.