SkillRater: Untangling Capabilities in Multimodal Data

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work proposes SkillRater, a novel framework that redefines data quality as a multidimensional skill space to jointly enhance model capabilities in visual understanding, OCR, and STEM reasoning—tasks inadequately addressed by conventional single-score filtering approaches. SkillRater employs meta-learning to train orthogonal, task-specific scorers and introduces a progressive sample selection strategy that prioritizes diversity in early training stages and high-value samples later. Experiments on a 2B-parameter vision-language model demonstrate consistent improvements over an unfiltered baseline, with gains of 5.63%, 2.00%, and 3.53% across the three skill dimensions, respectively. The near-orthogonality of the scorer signals further validates the independence of these capability axes and the effectiveness of their coordinated optimization.

Technology Category

Application Category

📝 Abstract

Data curation methods typically assign samples a single quality score. We argue this scalar framing is fundamentally limited: when training requires multiple distinct capabilities, a monolithic scorer cannot maximize useful signals for all of them simultaneously. Quality is better understood as multidimensional, with each dimension corresponding to a capability the model must acquire. We introduce SkillRater, a framework that decomposes data filtering into specialized raters - one per capability, each trained via meta-learning on a disjoint validation objective - and composes their scores through a progressive selection rule: at each training stage, a sample is retained if any rater ranks it above a threshold that tightens over time, preserving diversity early while concentrating on high-value samples late. We validate this approach on vision language models, decomposing quality into three capability dimensions: visual understanding, OCR, and STEM reasoning. At 2B parameters, SkillRater improves over unfiltered baselines by 5.63% on visual understanding, 2.00% on OCR, and 3.53% on STEM on held out benchmarks. The learned rater signals are near orthogonal, confirming that the decomposition captures genuinely independent quality dimensions and explaining why it outperforms both unfiltered training and monolithic learned filtering.

Problem

Research questions and friction points this paper is trying to address.

multimodal data

data curation

quality assessment

capability decomposition

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multidimensional quality

capability decomposition

meta-learned raters