Clicky

logo EmbodiedBench

Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

1University of Illinois Urbana-Champaign, 2Northwestern University, 3University of Toronto, 4Toyota Technological Institute at Chicago,
*Equal contribution

Successful Examples in EmbodiedBench powered by GPT-4o: EB-Manipulation (left) and EB-Navigation (right).

Failure Examples in EmbodiedBench powered by GPT-4o: EB-Manipulation (left) and EB-Navigation (right).

Overview


We introduce EmbodiedBench, a comprehensive benchmark designed to evaluate Multi-modal Large Language Models (MLLMs) as embodied agents. While existing benchmarks have primarily focused on Large Language Models (LLMs) and high-level tasks, EmbodiedBench takes a leap forward by offering a comprehensive, fine-grained evaluation of MLLM-based agents across both high-level and low-level tasks, as well as six critical agent capabilities.

EmbodiedBench is more than just a benchmark-it's a multifaceted, standardized evaluation platform that not only uncovers the current challenges in embodied AI but also provides actionable insights to push the boundaries of MLLM-driven embodied intelligence.

Figure 1. Overview of EMBODIEDBENCH.
Figure 1. Overview of EmbodiedBench. Two key features of our benchmark: various action levels and capability-oriented evaluation.

EmbodiedBench is designed with two key features that set it apart from existing benchmarks: 1. Diverse tasks with hierarchical action levels. Among the four environments, EB-ALFRED and EB-Habitat focus on high-level task decomposition and planning (e.g., "put a book on the desk" ), while EB-Navigation and EB-Manipulation demand planning with low-level actions (e.g., translational/rotational control ) and require precise perception and spatial reasoning. 2. Capability-oriented evaluation. Unlike previous benchmarks that primarily emphasize overall accuracy or module-specific performance, EmbodiedBench introduces a fine-grained evaluation framework that assesses six critical capabilities of embodied agents, including basic task solving, commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-horizon planning.



Agent Pipeline

Figure 2. Vision-driven agent pipeline used in EMBODIEDBENCH.
Figure 2. Vision-driven agent pipeline used in EmbodiedBench.

To evaluate MLLMs as agents in EmbodiedBench, we design a unified embodied agent pipeline, illustrated in Figure 2. This pipeline provides a robust framework for processing multimodal inputs, reasoning through interactions, and generating structured, executable plans composed of sequential actions.

Leaderboard

High-Level Tasks



Low-Level Tasks


Error Analysis

In this section, We conducted an error analysis on GPT-4o to identify potential failure modes. For each environment, we sample 10 failure episodes from each subset, resulting in a total of 110 failed episodes to be analyzed. We found three main types of errors: perception errors, reasoning errors, and planning errors. Each error type corresponds to a specific stage in our agent pipeline. For example, perception errors occur during the visual state description stage, reasoning errors arise in the reflection and reasoning stages, and planning errors occur during the language plan and executable plan generation stages.

Figure 3. Error Analysis.
Figure 3. Error Analysis.

Case Study


Success Examples

Failure Examples

BibTeX

        @misc{yang2025embodied,
          title={EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents},
          author={Yang, Rui and Chen, Hanyang and Zhang, Junyu and Zhao, Mark and Qian, Cheng and Wang, Kangrui and Wang, Qineng and Koripella, Teja Venkat and Movahedi, Marziyeh and Li, Manling and Ji, Heng and Zhang, Huan and Zhang, Tong},
          year={2025},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
        }