Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling robots to accurately interpret and execute free-form natural language instructions under resource-constrained conditions in real-world environments. The authors propose a lightweight, fully edge-deployable framework that parses high-level instructions into atomic action sequences and generates precise control trajectories through visual perception. The approach integrates a compact BiLSTM with a multi-head attention-based autoencoder for fine-grained instruction parsing, coupled with a YOLOv8-based scene analyzer and a Dynamic Adaptive Trajectory Radial Network (DATRN) for vision-guided, real-time trajectory generation. Evaluated on a custom dataset, the system achieves 91.5% accuracy in sub-action prediction, an overall task success rate of 90% across four categories, sub-action inference within 3.8 seconds, and end-to-end execution in 30–60 seconds.

Technology Category

Application Category

📝 Abstract
Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint. Real-robot evaluations across four tasks (pick-place, pick-pour, wipe, and pick-give) yield an overall 90% success; sub-action inference completes in<3.8s, with end-to-end executions in 30-60s depending on task complexity. These results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation in resource-constrained, single-camera settings.
Problem

Research questions and friction points this paper is trying to address.

robotic manipulation
human instruction
natural language understanding
on-device execution
action sequencing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruct2Act
Robot Action Network (RAN)
Dynamic Adaptive Trajectory Radial Network (DATRN)
On-device manipulation
Instruction-to-action parsing
🔎 Similar Papers
No similar papers found.
Archit Sharma
Archit Sharma
M.Tech. Student at IIT Mandi
Robotics and AI
D
Dharmendra Sharma
Indian Institute of Technology Mandi, India
J
John Rebeiro
Indian Institute of Technology Mandi, India
P
Peeyush Thakur
Indian Institute of Technology Mandi, India
N
Narendra Dhar
Indian Institute of Technology Mandi, India
Laxmidhar Behera
Laxmidhar Behera
IIT Kanpur
Intelligent Systems and ControlRoboticsSoft Computing