Beyond Factual QA: Mentorship-Oriented Question Answering over Long-Form Multilingual Content

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the limitations of current question-answering systems, which prioritize factual correctness but fall short in educational and career guidance contexts that require reflective and pedagogically supportive responses. To bridge this gap, the authors propose a novel “mentor-style” QA paradigm and introduce MentorQA, the first multilingual long-video QA benchmark comprising 9,000 question-answer pairs across four languages. Beyond factual accuracy, they define new evaluation dimensions—clarity, coherence, and learning value—to better capture pedagogical quality. Through systematic comparisons of Single-Agent, Dual-Agent, RAG, and Multi-Agent architectures, the study demonstrates that multi-agent approaches significantly outperform others on complex topics and low-resource languages. Furthermore, the research reveals a notable discrepancy between current LLM-based automatic evaluations and human judgments, highlighting the need for more nuanced assessment frameworks.

Technology Category

Application Category

📝 Abstract

Question answering systems are typically evaluated on factual correctness, yet many real-world applications-such as education and career guidance-require mentorship: responses that provide reflection and guidance. Existing QA benchmarks rarely capture this distinction, particularly in multilingual and long-form settings. We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship-focused question answering from long-form videos, comprising nearly 9,000 QA pairs from 180 hours of content across four languages. We define mentorship-focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value. Using MentorQA, we compare Single-Agent, Dual-Agent, RAG, and Multi-Agent QA architectures under controlled conditions. Multi-Agent pipelines consistently produce higher-quality mentorship responses, with especially strong gains for complex topics and lower-resource languages. We further analyze the reliability of automated LLM-based evaluation, observing substantial variation in alignment with human judgments. Overall, this work establishes mentorship-focused QA as a distinct research problem and provides a multilingual benchmark for studying agentic architectures and evaluation design in educational AI. The dataset and evaluation framework are released at https://github.com/AIM-SCU/MentorQA.

Problem

Research questions and friction points this paper is trying to address.

mentorship-oriented QA

long-form content

multilingual QA

educational AI

question answering evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

mentorship-oriented QA

multilingual long-form QA

multi-agent architecture

beyond factual accuracy

LLM-based evaluation

🔎 Similar Papers

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

2024-03-15arXiv.orgCitations: 8