Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Pantagruel, the first unified self-supervised encoder for French that jointly learns representations from both text and speech within a single architectural framework. Addressing the absence of a cohesive and efficient approach for multimodal representation learning in French, Pantagruel introduces contextual target modeling in the feature space—a novel contribution for the French language. The model is pretrained at scale using extensive text corpora, including Wikipedia, OSCAR, and CroissantLLM, alongside speech data from Multilingual LibriSpeech, LeBenchmark, and a newly curated 100,000-hour INA-100k speech corpus. Evaluated on standard benchmarks such as FLUE and LeBenchmark, Pantagruel matches or surpasses strong baselines like CamemBERT and FlauBERT, demonstrating its effectiveness and strong generalization capabilities across modalities.

Technology Category

Application Category

📝 Abstract
We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.
Problem

Research questions and friction points this paper is trying to address.

self-supervised learning
multimodal representation
French text and speech
feature-space representation
unified encoder
Innovation

Methods, ideas, or system contributions that make the work stand out.

feature-space self-supervision
multimodal representation learning
French speech-text encoder
INA-100k corpus
unified architecture
🔎 Similar Papers
No similar papers found.
P
Phuong-Hang Le
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
Valentin Pelloin
Valentin Pelloin
Researcher, Institut National de l'Audiovisuel (INA)
Speech and Language ProcessingSpoken Language UnderstandingMachine learningArtificial
A
Arnault Chatelain
CREST (École Polytechnique, ENSAE, CNRS), 5 avenue Le Chatelier, 91120 Palaiseau, France
M
Maryem Bouziane
Avignon Université, LIA, France
M
Mohammed Ghennai
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
Q
Qianwen Guan
LLF (Université Paris Cité and CNRS), UFRL Olympe de Gouges, 13 place Paul Ricoeur, 75013 Paris, France
K
Kirill Milintsevich
INA (Institut National de l’Audiovisuel), 4 Avenue de l’Europe, 94366 Bry-sur-Marne, France
Salima Mdhaffar
Salima Mdhaffar
Researcher, University of Avignon
Speech RecognitionSpeech ProcessingSpoken Language UnderstandingSelf-supervised
A
Aidan Mannion
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
N
Nils Defauw
Univ. Grenoble Alpes, EFELIA-MIAI, IUT2 Grenoble, LIG, 38000 Grenoble, France
S
Shuyue Gu
LLF (Université Paris Cité and CNRS), UFRL Olympe de Gouges, 13 place Paul Ricoeur, 75013 Paris, France
A
Alexandre Audibert
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
Marco Dinarelli
Marco Dinarelli
LIG CNRS
Natural Language ProcessingDeep and Machine LearningText CategorizationKnowledge Management
Yannick Estève
Yannick Estève
Professor in Computer Science, University of Avignon, France
Natural Language & SpeechMachine learning
Lorraine Goeuriot
Lorraine Goeuriot
Université Grenoble Alpes
S
Steffen Lalande
INA (Institut National de l’Audiovisuel), 4 Avenue de l’Europe, 94366 Bry-sur-Marne, France
N
Nicolas Hervé
INA (Institut National de l’Audiovisuel), 4 Avenue de l’Europe, 94366 Bry-sur-Marne, France
Maximin Coavoux
Maximin Coavoux
CNRS, LIG, Getalp, Université Grenoble Alpes
NLPComputational linguistics
François Portet
François Portet
professeur, Laboratoire d'Informatique de Grenoble, Univ Grenoble Alpes
Natural Language ProcessingAmbient IntelligenceArtificial IntelligenceContext-Aware Activity and Situation Recognition
É
Étienne Ollion
CREST (École Polytechnique, ENSAE, CNRS), 5 avenue Le Chatelier, 91120 Palaiseau, France
Marie Candito
Marie Candito
Maîtresse de Conférences, Université Paris Cité
natural language processingsyntactic parsingsemantic parsingsyntactico-semantic resources
Maxime Peyrard
Maxime Peyrard
Université Grenoble Alpes
NLPMachine LearningData Science
S
Solange Rossato
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
Benjamin Lecouteux
Benjamin Lecouteux
Professor - GETALP team - LIG - Université Grenoble Alpes
Speech RecognitionMachine Translationself supervised modelsmachine learningConfidence measures
Aurélie Nardy
Aurélie Nardy
Laboratoire LIDILEM - Univ. Grenoble Alpes
acquisition du langagelanguage acquisitionvariation
Gilles Sérasset
Gilles Sérasset
Université Grenoble Alpes
Computer SciencesNatural Language ProcessingComputational LinguisticsCA23147 GOBLIN
Vincent Segonne
Vincent Segonne
Post-doc Laboratoire d'Informatique de Grenoble - GETALP
S
Solène Evain
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
D
Diandra Fabre
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
Didier Schwab
Didier Schwab
Univ. Grenoble Alpes, LIG-GETALP
Natural Language ProcessingLarge Language ModelsAlternative and Augmentative Communication