🤖 AI Summary
This work addresses the semantic ambiguity in animal behavior—exemplified by feline purring, which conveys distinct intentions across different physiological states—by proposing the first quadruple-modal large language model tailored for felids. The model integrates video, audio, high-frequency physiological time-series signals, and textual data through a novel physiology-informed cross-modal alignment framework. This framework embeds domain-specific scientific encoders into a unified backbone architecture to enable context-aware, deep intention reasoning. Evaluated on MeowBench, an expert-constructed benchmark, the model achieves 71.16% accuracy, substantially outperforming existing baselines. The authors further release the full model, training framework, and the Meow-10K dataset, establishing a scalable new paradigm for cross-species intention understanding.
📝 Abstract
Deciphering animal intent is a fundamental challenge in computational ethology, largely because of semantic aliasing, the phenomenon where identical external signals (e.g., a cat's purr) correspond to radically different internal states depending on physiological context. Existing Multimodal Large Language Models (MLLMs) are blind to high-frequency biological time-series data, restricting them to superficial behavioural pattern matching rather than genuine latent-state reasoning. To bridge this gap, we introduce Meow-Omni 1, the first open-source, quad-modal MLLM purpose-built for computational ethology. It natively fuses video, audio, and physiological time-series streams with textual reasoning. Through targeted architectural adaptation, we integrate specialized scientific encoders into a unified backbone and formalize intent inference via physiologically grounded cross-modal alignment. Evaluated on MeowBench, a novel, expert-verified quad-modal benchmark, Meow-Omni 1 achieves state-of-the-art intent-recognition accuracy (71.16%), substantially outperforming leading vision-language and omni-modal baselines. We release the complete open-source pipeline including model weights, training framework, and the Meow-10K dataset, to establish a scalable paradigm for inter-species intent understanding and to advance foundation models toward real-world veterinary diagnostics and wildlife conservation.