🤖 AI Summary
This study addresses the challenge of balancing domain-specific expertise and cross-task generalization in Brazilian legal retrieval, where heterogeneous tasks—such as case law search, legislative document retrieval, and question answering—demand both legal precision and broader adaptability. To tackle this, the authors propose a hybrid fine-tuning strategy based on Qwen3-Embedding-4B, integrating legal corpora with Portuguese SQuAD-pt question-answering data. This approach preserves strong performance on legal tasks while substantially enhancing generalization for question-based retrieval. Experimental results across six Brazilian legal datasets demonstrate an average NDCG@10 of 0.447, MRR@10 of 0.595, and MAP@10 of 0.308, with particularly significant gains on the Quati question-answering benchmark compared to models fine-tuned exclusively on legal data.
📝 Abstract
Brazilian legal retrieval is heterogeneous, covering case law, legislation, and question-based search. This makes training dense retrievers a trade-off between stronger domain specialization and broader robustness across retrieval types of search. In this paper, we explore this trade-off using three training setups based on Qwen3-Embedding-4B: a base model with no fine-tuning, a version trained only on legal data, and a mixed setup that combines legal data with SQuAD-pt supervised dataset. We evaluate these models on five legal datasets from the JUÁ leaderboard, along with Quati dataset as an extra Portuguese retrieval benchmark to test out-of-domain generalization. The legal-only model performs best on the most specialized legal tasks. The mixed setup keeps strong performance on legal data while offering a better overall balance, improving average NDCG@10 from 0.414 to 0.447, MRR@10 from 0.586 to 0.595, and MAP@10 from 0.270 to 0.308 across all six datasets. The biggest improvement appears on Quati, where the mixed model clearly outperforms the legal-only one. Overall, the results show that legal-only and mixed training lead to different strengths: the first is better for specialization, while the second is more robust across different types of search, especially question-based ones. Both adapted models are available on Hugging Face