Y-BotFrame: An Extensible Embodied Agent Framework for Quadruped Robot Assistants

Luyao Zhang1 Ke Li1 Yuan Ding1 Xulong Zhao1 Guo Yu1 Chengwei Yan1 Fuyu Dong1 Jiawei Hu1 Di Wang1 Nan Luo1 Gang Liu1 Quan Wang1
1. Xidian University, Xi'an, China
Corresponding Author

Figure 1. Overview of Y-BotFrame. The proposed system integrates multimodal human--robot interaction, an LLM-based task planner, executable tool modules, and robot-side execution feedback to support instruction-driven embodied tasks on a quadruped robot.


Abstract

Quadruped robots are capable of traversing a wide range of complex terrains with high flexibility. As highly mobile ground-based intelligent platforms, they can be equipped with modules for navigation control, environmental perception, and intelligent interaction, thereby serving as real-world mobile deployment platforms for various algorithms. In this paper, we introduce Y-BotFrame, an extensible embodied platform that turns a robot into an intelligent ground assistant. Y-BotFrame integrates multimodal perception capabilities, including speech, vision, and LiDAR, and employs a large language model as the cognitive core for environmental understanding, contextual reasoning, and task planning. The system maps user natural-language instructions into executable embodied task units that can be carried out by the robot. Y-BotFrame supports natural interaction through voice commands and visual feedback, removing the need for a remote controller and enabling efficient human-robot collaboration. With a highly extensible framework, Y-BotFrame supports plug-and-play integration of new functional modules as well as modular upgrades and iterative development, offering a reference implementation for the real-world deployment of general-purpose, instruction-driven embodied agents.

Introduction

Quadruped robots provide strong mobility, flexibility, and traversability in unstructured environments such as stairs, gravel paths, and grasslands, making them promising platforms for deploying general-purpose embodied agents. However, many existing quadruped robot systems focus on path planning, locomotion control, or task-specific policy learning, which limits multitask execution, extensibility, and stable deployment in open environments.

Y-BotFrame addresses these challenges by combining quadruped mobility, visual perception, and large-language-model-based semantic understanding and task planning. The framework encapsulates navigation, environmental perception, and embodied question answering into executable modules with clearly defined interfaces, while an LLM-based agent generates structured task plans from user instructions, interaction history, and environmental priors.