Type and Complexity Signals in Multilingual Question Representations

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how multilingual Transformer models (specifically Glot500-m) encode morphosyntactic properties and structural complexity of questions. Method: We introduce QTC—the first multilingual question dataset covering seven languages, annotated for question type and complexity metrics including dependency distance, tree depth, and lexical density—and propose an extended hierarchical regression probing framework with selective control to systematically compare frozen contextual representations, subword TF-IDF baselines, and fine-tuned models on cross-lingual question understanding. Contribution/Results: Statistical features remain competitive in languages with explicit morphological marking, whereas neural probes better capture fine-grained structural complexity. Parameter updates during fine-tuning do not significantly degrade pretrained linguistic knowledge, confirming the irreplaceability of contextual representations for specific complexity dimensions.

Technology Category

Application Category

📝 Abstract
This work investigates how a multilingual transformer model represents morphosyntactic properties of questions. We introduce the Question Type and Complexity (QTC) dataset with sentences across seven languages, annotated with type information and complexity metrics including dependency length, tree depth, and lexical density. Our evaluation extends probing methods to regression labels with selectivity controls to quantify gains in generalizability. We compare layer-wise probes on frozen Glot500-m (Imani et al., 2023) representations against subword TF-IDF baselines, and a fine-tuned model. Results show that statistical features classify questions effectively in languages with explicit marking, while neural probes capture fine-grained structural complexity patterns better. We use these results to evaluate when contextual representations outperform statistical baselines and whether parameter updates reduce the availability of pre-trained linguistic information.
Problem

Research questions and friction points this paper is trying to address.

Analyzing multilingual transformer representations of morphosyntactic question properties
Comparing neural probes with statistical baselines for question classification
Evaluating when contextual representations outperform traditional linguistic features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces QTC dataset with multilingual complexity metrics
Extends probing methods with regression and selectivity controls
Compares neural probes against statistical and fine-tuned baselines
🔎 Similar Papers
No similar papers found.