Bounded-Compute Multimodal Regression for Product-Rating Prediction

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently performing multimodal product rating prediction under stringent latency constraints, where existing vision-language models struggle to balance accuracy and speed. The authors adapt SmolVLM2-256M-Video-Instruct into a lightweight regression system by employing fixed 384×384 image inputs and truncated metadata as deterministic features, replacing the language modeling head with a two-layer MLP regressor that directly predicts scalar ratings from pooled decoder states. By integrating static global image processing with feature-level regression, the proposed approach achieves significantly better performance than dynamic chunking and generative baselines while maintaining low computational overhead. Evaluated on the LoViF 2026 challenge, it attains 0.39 PLCC and 0.40 CES, establishing a strong baseline for multimodal regression in resource-constrained settings.
📝 Abstract
Vision-language models (VLMs) are increasingly attractive for multimodal quality assessment, but their default reliance on autoregressive text generation and dynamic visual processing is poorly matched to scalar regression under strict latency budgets. We present a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction in the LoViF 2026 Efficient VLM challenge. Motivated by recent multimodal engagement-prediction results showing that feature-based regression can outperform token-based score generation, we replace the language-modeling head with a lightweight two-layer MLP fed by pooled decoder states, and we enforce deterministic inputs through fixed 384x384 images and truncated metadata. Across controlled ablations, static global image processing slightly outperforms dynamic tiling, and scaling from 100K to 16M training examples substantially improves validation correlation. Under the official held-out evaluation, our 228M-parameter model achieves 0.39 PLCC and 0.40 CES, providing a strong and reproducible baseline for resource-constrained multimodal regression.
Problem

Research questions and friction points this paper is trying to address.

multimodal regression
product-rating prediction
bounded-compute
vision-language models
scalar regression
Innovation

Methods, ideas, or system contributions that make the work stand out.

bounded-compute regression
vision-language models
feature-based regression
efficient multimodal learning
product-rating prediction