🤖 AI Summary
This study addresses the challenge of image classification generalization in precision livestock farming under severe data scarcity and high computational costs, evaluating from-scratch training, frozen feature extraction, and parameter-efficient fine-tuning (PEFT) strategies based on DINOv3—a 6.7-billion-parameter vision foundation model—under extreme data imbalance (98:1 test-to-train ratio). Focusing on QLoRA and DoRA, the work systematically compares their performance across adapter ranks (8/16/64) and target modules (q_proj versus all linear layers). Results show that increasing adapter capacity substantially improves generalization without overfitting, with underfitting identified as the primary bottleneck. Notably, QLoRA applied to all linear layers with rank 64 fine-tunes only 2.72% of parameters (183 million), achieving 83.16% accuracy within 5.8 hours—significantly outperforming ResNet-18 (72.87%), ViT-Small (61.91%), and frozen DINOv3 (76.56%)—providing practical guidance for deploying billion-scale vision models in resource-constrained agricultural settings.
📝 Abstract
Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers).
With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours).
Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.