Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of analyzing uncurated, unlabeled, and highly noisy wild capuchin monkey behavioral videos—characterized by absent manual annotations and extremely low audio-visual signal-to-noise ratios. Method: We propose the first weakly supervised video–text alignment and retrieval framework, featuring a proxy-based data pipeline that automatically extracts semantically aligned video segments and corresponding audio-derived descriptive texts; we adapt the X-CLIP architecture with LoRA for efficient end-to-end cross-modal alignment and retrieval, requiring no human behavioral labels. Contribution/Results: Evaluated on real-world field data, our method achieves +167% and +114% improvements in Hits@5 (at 16-frame and 8-frame resolutions, respectively) over baselines, and significantly outperforms them in NDCG@K, demonstrating robust ranking across multiple behavior classes. It establishes a scalable zero-label learning paradigm for primate behavioral analysis in naturalistic settings.

Technology Category

Application Category

📝 Abstract
Video recordings of nonhuman primates in their natural habitat are a common source for studying their behavior in the wild. We fine-tune pre-trained video-text foundational models for the specific domain of capuchin monkeys, with the goal of developing useful computational models to help researchers to retrieve useful clips from videos. We focus on the challenging problem of training a model based solely on raw, unlabeled video footage, using weak audio descriptions sometimes provided by field collaborators. We leverage recent advances in Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs) to address the extremely noisy nature of both video and audio content. Specifically, we propose a two-folded approach: an agentic data treatment pipeline and a fine-tuning process. The data processing pipeline automatically extracts clean and semantically aligned video-text pairs from the raw videos, which are subsequently used to fine-tune a pre-trained Microsoft's X-CLIP model through Low-Rank Adaptation (LoRA). We obtained an uplift in $Hits@5$ of $167%$ for the 16 frames model and an uplift of $114%$ for the 8 frame model on our domain data. Moreover, based on $NDCG@K$ results, our model is able to rank well most of the considered behaviors, while the tested raw pre-trained models are not able to rank them at all. The code will be made available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Fine-tuning video-text models for primate behavior retrieval
Training models with unlabeled raw videos and weak audio
Improving noisy video and audio content analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune video-text models for capuchin monkeys
Use MLLMs and VLMs for noisy video and audio
Apply LoRA to adapt pre-trained X-CLIP model
🔎 Similar Papers
No similar papers found.
G
Giulio Cesare Mastrocinque Santo
Institute of Mathematics and Statistics, University of São Paulo (IME-USP), Rua do Matão, 1010, São Paulo, 05508-090, São Paulo, Brazil
P
Patrícia Izar
Department of Experimental Psychology, Institute of Psychology, University of São Paulo (IP-USP), Av. Professor Mello Moraes, 1721, São Paulo, 05508-030, São Paulo, Brazil
I
Irene Delval
Department of Experimental Psychology, Institute of Psychology, University of São Paulo (IP-USP), Av. Professor Mello Moraes, 1721, São Paulo, 05508-030, São Paulo, Brazil
V
Victor de Napole Gregolin
Institute of Biosciences, University of São Paulo (IB-USP), Rua do Matão, 321, São Paulo, 05508-090, São Paulo, Brazil
Nina S. T. Hirata
Nina S. T. Hirata
Computer Science Department, Institute of Mathematics and Statistics, University of São Paulo
Machine LearningDeep learningPattern recognitionImage understandingImage processing