Y-BotFrame: An Extensible Embodied Agent Framework for Quadruped Robot Assistants

Luyao Zhang¹, Ke Li¹, Yuan Ding¹, Xulong Zhao¹, Guo Yu¹, Chengwei Yan¹, Fuyu Dong¹, Jiawei Hu¹, Di Wang¹, Nan Luo¹, Gang Liu¹, Quan Wang¹

1. Xidian University, Xi'an, China

Corresponding Author

Abstract

Quadruped robots are capable of traversing a wide range of complex terrains with high flexibility. As highly mobile ground-based intelligent platforms, they can be equipped with modules for navigation control, environmental perception, and intelligent interaction, thereby serving as real-world mobile deployment platforms for various algorithms. In this paper, we introduce Y-BotFrame, an extensible embodied platform that turns a robot into an intelligent ground assistant. Y-BotFrame integrates multimodal perception capabilities, including speech, vision, and LiDAR, and employs a large language model as the cognitive core for environmental understanding, contextual reasoning, and task planning. The system maps user natural-language instructions into executable embodied task units that can be carried out by the robot. Y-BotFrame supports natural interaction through voice commands and visual feedback, removing the need for a remote controller and enabling efficient human-robot collaboration. With a highly extensible framework, Y-BotFrame supports plug-and-play integration of new functional modules as well as modular upgrades and iterative development, offering a reference implementation for the real-world deployment of general-purpose, instruction-driven embodied agents.

Introduction

Quadruped robots provide strong mobility, flexibility, and traversability in unstructured environments such as stairs, gravel paths, and grasslands, making them promising platforms for deploying general-purpose embodied agents. However, many existing quadruped robot systems focus on path planning, locomotion control, or task-specific policy learning, which limits multitask execution, extensibility, and stable deployment in open environments.

Y-BotFrame addresses these challenges by combining quadruped mobility, visual perception, and large-language-model-based semantic understanding and task planning. The framework encapsulates navigation, environmental perception, and embodied question answering into executable modules with clearly defined interfaces, while an LLM-based agent generates structured task plans from user instructions, interaction history, and environmental priors.