Weakly-supervised Latent Models for Task-specific Visual-Language Control

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

In hazardous environments, AI agents struggle to accurately map natural language instructions to visual-space control actions; existing large language models (LLMs) directly driving visual control achieve only 58% success. Method: We propose a lightweight, task-specific latent dynamics model that learns action transitions in a shared latent space using only goal-state supervision. To stabilize training, we introduce global action embeddings and auxiliary losses. Our approach integrates weakly supervised learning, latent-space modeling, and vision-language alignment: an LLM parses instructions to guide latent-state prediction for control. Contribution/Results: Unlike conventional world models, our method requires neither massive datasets nor high computational resources. On spatial alignment tasks, it achieves 71% success—significantly outperforming baselines—and demonstrates strong generalization to unseen images and instructions.

Technology Category

Application Category

📝 Abstract

Autonomous inspection in hazardous environments requires AI agents that can interpret high-level goals and execute precise control. A key capability for such agents is spatial grounding, for example when a drone must center a detected object in its camera view to enable reliable inspection. While large language models provide a natural interface for specifying goals, using them directly for visual control achieves only 58% success in this task. We envision that equipping agents with a world model as a tool would allow them to roll out candidate actions and perform better in spatially grounded settings, but conventional world models are data and compute intensive. To address this, we propose a task-specific latent dynamics model that learns state-specific action-induced shifts in a shared latent space using only goal-state supervision. The model leverages global action embeddings and complementary training losses to stabilize learning. In experiments, our approach achieves 71% success and generalizes to unseen images and instructions, highlighting the potential of compact, domain-specific latent dynamics models for spatial alignment in autonomous inspection.

Problem

Research questions and friction points this paper is trying to address.

Improve spatial grounding for autonomous inspection agents

Reduce data/compute requirements of conventional world models

Enable generalization to unseen images and instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-specific latent dynamics model for visual-language control

Goal-state supervised learning in shared latent space

Global action embeddings with complementary training losses

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives