MAviS: A Multimodal Conversational Assistant For Avian Species

📅 2026-03-07
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing general-purpose multimodal large language models exhibit limited performance in fine-grained bird recognition and cross-modal question answering, falling short of the demands of ecological monitoring. To address this gap, this work introduces MAviS-Dataset, the first large-scale multimodal avian dataset encompassing images, audio recordings, and textual descriptions, alongside MAviS-Chat, a domain-adaptive multimodal dialogue model designed for deep integration of audio, visual, and textual modalities to enable fine-grained species understanding. The study also presents MAviS-Bench, a dedicated evaluation benchmark. Experimental results demonstrate that MAviS-Chat significantly outperforms current open-source models, such as MiniCPM-o-2.6, on this benchmark, thereby validating the effectiveness and practical utility of domain-specific multimodal large language models for ecological applications.

Technology Category

Application Category

📝 Abstract
Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.
Problem

Research questions and friction points this paper is trying to address.

multimodal large language models
avian species
fine-grained understanding
species-specific question answering
ecological monitoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal LLM
avian species dataset
instruction tuning
cross-modal reasoning
ecological AI
🔎 Similar Papers
No similar papers found.
Y
Yevheniia Kryklyvets
Mohamed bin Zayed University of Artificial Intelligence
M
Mohammed Irfan Kurpath
Mohamed bin Zayed University of Artificial Intelligence
Sahal Shaji Mullappilly
Sahal Shaji Mullappilly
PhD Computer Vision Student, MBZUAI
Vision Language ModelsComputer VisionObject DetectionReal-time models
J
Jinxing Zhou
Mohamed bin Zayed University of Artificial Intelligence
F
Fahad Shabzan Khan
Mohamed bin Zayed University of Artificial Intelligence
R
Rao Anwer
Mohamed bin Zayed University of Artificial Intelligence
Salman Khan
Salman Khan
MBZUAI, Australian National University
Computer VisionMachine LearningGenerative AIAI4Science
Hisham Cholakkal
Hisham Cholakkal
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer VisionLarge Multimodal ModelsLLMHealthcare Foundation ModelConversational Assistant