Weakly-supervised Latent Models for Task-specific Visual-Language Control

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In hazardous environments, AI agents struggle to accurately map natural language instructions to visual-space control actions; existing large language models (LLMs) directly driving visual control achieve only 58% success. Method: We propose a lightweight, task-specific latent dynamics model that learns action transitions in a shared latent space using only goal-state supervision. To stabilize training, we introduce global action embeddings and auxiliary losses. Our approach integrates weakly supervised learning, latent-space modeling, and vision-language alignment: an LLM parses instructions to guide latent-state prediction for control. Contribution/Results: Unlike conventional world models, our method requires neither massive datasets nor high computational resources. On spatial alignment tasks, it achieves 71% success—significantly outperforming baselines—and demonstrates strong generalization to unseen images and instructions.

Technology Category

Application Category

📝 Abstract
Autonomous inspection in hazardous environments requires AI agents that can interpret high-level goals and execute precise control. A key capability for such agents is spatial grounding, for example when a drone must center a detected object in its camera view to enable reliable inspection. While large language models provide a natural interface for specifying goals, using them directly for visual control achieves only 58% success in this task. We envision that equipping agents with a world model as a tool would allow them to roll out candidate actions and perform better in spatially grounded settings, but conventional world models are data and compute intensive. To address this, we propose a task-specific latent dynamics model that learns state-specific action-induced shifts in a shared latent space using only goal-state supervision. The model leverages global action embeddings and complementary training losses to stabilize learning. In experiments, our approach achieves 71% success and generalizes to unseen images and instructions, highlighting the potential of compact, domain-specific latent dynamics models for spatial alignment in autonomous inspection.
Problem

Research questions and friction points this paper is trying to address.

Improve spatial grounding for autonomous inspection agents
Reduce data/compute requirements of conventional world models
Enable generalization to unseen images and instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-specific latent dynamics model for visual-language control
Goal-state supervised learning in shared latent space
Global action embeddings with complementary training losses
🔎 Similar Papers
No similar papers found.
Xian Yeow Lee
Xian Yeow Lee
Hitachi America Ltd.
Machine LearningDeep LearningReinforcement LearningData ScienceEngineering
L
Lasitha Vidyaratne
Industrial AI Lab, Hitachi America, Ltd.
G
Gregory Sin
Industrial AI Lab, Hitachi America, Ltd.
A
Ahmed Farahat
Industrial AI Lab, Hitachi America, Ltd.
Chetan Gupta
Chetan Gupta
Industrial AI Lab, Hitachi America, Ltd.