🤖 AI Summary
Traditional sequential reasoning architectures struggle to meet the real-time requirements of embodied AI for high-frequency perception and action generation in dynamic environments. To address this, we propose Auras—a co-designed asynchronous inference framework integrating algorithmic and systems innovations. Methodologically, Auras (1) decouples perception and generation modules and implements a controlled pipelined parallel execution mechanism; and (2) introduces a shared-context synchronization strategy to mitigate data staleness under high concurrency while preserving decision accuracy. At the systems level, it incorporates lightweight scheduling and memory optimization to enable end-to-end efficient asynchronous inference. Experimental results demonstrate that Auras achieves a 2.54× average throughput improvement over baseline serial architectures while maintaining 102.7% of the original model’s accuracy—effectively breaking the performance bottleneck inherent in sequential designs.
📝 Abstract
Embodied AI systems operate in dynamic environments, requiring seamless integration of perception and generation modules to process high-frequency input and output demands. Traditional sequential computation patterns, while effective in ensuring accuracy, face significant limitations in achieving the necessary "thinking" frequency for real-world applications. In this work, we present Auras, an algorithm-system co-designed inference framework to optimize the inference frequency of embodied AI agents. Auras disaggregates the perception and generation and provides controlled pipeline parallelism for them to achieve high and stable throughput. Faced with the data staleness problem that appears when the parallelism is increased, Auras establishes a public context for perception and generation to share, thereby promising the accuracy of embodied agents. Experimental results show that Auras improves throughput by 2.54x on average while achieving 102.7% of the original accuracy, demonstrating its efficacy in overcoming the constraints of sequential computation and providing high throughput.