Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study investigates the efficacy of supervised fine-tuning for screen-conditioned action prediction and its sensitivity to model architecture. Leveraging the PiSAR dataset, we conduct supervised fine-tuning on multimodal large language models—including Qwen3-VL-8B-Instruct and Gemma-4-26B-A4B-IT—within a unified evaluation framework, using semantic similarity (sem_sim) as the primary metric against zero-shot state-of-the-art models. Our findings reveal, for the first time, that model architecture critically influences fine-tuning outcomes: fine-tuned Qwen3-VL achieves a sem_sim of 0.783 (with 79% of samples exceeding 0.7), substantially outperforming Claude Opus and GPT-5.5 (0.459–0.482), whereas fine-tuned Gemma yields only 0.441, showing no performance gain. These results underscore the necessity of tailoring fine-tuning strategies to specific model architectures.

📝 Abstract

We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.

Problem

Research questions and friction points this paper is trying to address.

screen-conditioned action prediction

supervised fine-tuning

model architecture sensitivity

behavioral rationale

PiSAR benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

architecture-sensitive fine-tuning

screen-conditioned action prediction

PiSAR benchmark