Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Current video large language models (Video-LLMs) lack the capability to discriminate question answerability, frequently generating hallucinated responses to questions beyond the video content. To address this, we propose the **Answerability Alignment framework**, the first systematic formalization and modeling of answerability judgment in video question answering. Methodologically, we design a dedicated evaluation protocol and a data generation pipeline; formulate an end-to-end answerability alignment objective grounded in video–caption pairs; and integrate an explicit refusal mechanism. Experiments demonstrate substantial improvements in both answerability detection accuracy and refusal rate for unanswerable questions. Our approach achieves state-of-the-art performance across multiple benchmarks and exhibits strong cross-dataset generalization.

Technology Category

Application Category

📝 Abstract

In the broader context of deep learning, Multimodal Large Language Models have achieved significant breakthroughs by leveraging powerful Large Language Models as a backbone to align different modalities into the language space. A prime exemplification is the development of Video Large Language Models (Video-LLMs). While numerous advancements have been proposed to enhance the video understanding capabilities of these models, they are predominantly trained on questions generated directly from video content. However, in real-world scenarios, users often pose questions that extend beyond the informational scope of the video, highlighting the need for Video-LLMs to assess the relevance of the question. We demonstrate that even the best-performing Video-LLMs fail to reject unfit questions-not necessarily due to a lack of video understanding, but because they have not been trained to identify and refuse such questions. To address this limitation, we propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video and appropriately decline to answer when the question exceeds the scope of the video, as well as an evaluation framework with a comprehensive set of metrics designed to measure model behavior before and after alignment. Furthermore, we present a pipeline for creating a dataset specifically tailored for alignment for answerability, leveraging existing video-description paired datasets.

Problem

Research questions and friction points this paper is trying to address.

Video-LLMs cannot reject irrelevant questions effectively

Lack training to assess question relevance to video content

Need framework to evaluate and refuse unfit questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Alignment framework for question relevance evaluation

Dataset pipeline for answerability training

Comprehensive metrics for model behavior assessment

🔎 Similar Papers

VideoQA in the Era of LLMs: An Empirical Study