logo EmbodiedBench Challenge

Comprehensive Benchmarking Vision-Driven Embodied Agents

🏛️ Hosted by: Foundation Models Meet Embodied Agents Workshop, CVPR 2026.

We are organizing the EmbodiedBench Challenge, a benchmark designed to evaluate Multi-modal Large Language Models (MLLMs) as vision-driven embodied agents. Participants will first be ranked according to open-source benchmark results. The top 5–10 teams will then be ranked on the held-out test set by task success rates across four environments (EB-ALFRED, EB-Habitat, EB-Navigation, EB-Manipulation). More details will be released soon.

Quick Links:

Submission Portal: EvalAI (Coming Soon)  |  Dataset & Code: GitHub  |  Dataset: Hugging Face

Challenge Overview

Goal

Given visual observations and language instructions in embodied environments, develop vision-driven agents that can plan and execute tasks. Models will be evaluated on their ability to understand instructions, reason about the environment, and generate executable action sequences across both high-level tasks (EB-ALFRED, EB-Habitat) and low-level tasks (EB-Navigation, EB-Manipulation).

What You Do

  1. Train / fine-tune on the EmbodiedBench training dataset.
  2. Develop and validate on the EmbodiedBench.
  3. Submit your models through EvalAI; they will be evaluated on a hidden, held-out test set (more instructions coming soon).

Data Splits

📚

Train

EmbodiedBench training data across four environments

Public
🔬

Validation

EmbodiedBench validation set for development

Public
🏆

Test (Held-out)

Final evaluation set

Coming Soon

Dataset: Data can be found at huggingface.co/EmbodiedBench
Format & loading: Please refer to the official instructions in the EmbodiedBench repository.

Evaluation

Primary Metrics:

  • Task Success Rate: Measures the model's ability to correctly complete embodied tasks in each environment.
  • Overall Score: Weighted combination of success rates across EB-ALFRED, EB-Habitat, EB-Navigation, and EB-Manipulation.

Capability Subsets: We may additionally report accuracy by capability category (commonsense reasoning, complex instruction, spatial awareness, visual perception, long-horizon planning) for detailed analysis.

Ranking: Teams are ranked by the overall score on the held-out test set.

Tie-break: Higher score on low-level tasks (EB-Navigation + EB-Manipulation), then earlier submission time.

Challenge Leaderboard

Performance of submitted methods on the held-out test set. Click on column headers to sort the results.

Rank ↕ Team / Method ↕ Overall ↕ EB-ALFRED ↕ EB-Habitat ↕ EB-Navigation ↕ EB-Manipulation ↕
- Baseline - - - - -
Challenge submissions coming soon...

Leaderboard will be updated after the test set is released and submissions are evaluated.

Submission

Submission File Format (JSONL)

Submit a single .jsonl file with one JSON object per line, containing:

  • sample_id (string): Unique identifier for the test sample
  • prediction (string or object): Your model's prediction (action sequence or task outcome)

Example submission format:

{"sample_id":"eb_000001","prediction":"[PickupObject, PutObject, ...]"} {"sample_id":"eb_000002","prediction":"[MoveAhead, TurnLeft, ...]"}

Requirements

  • Provide exactly one prediction for each sample_id in the test set.
  • Duplicate IDs: Will result in invalid submission.
  • Missing IDs: Count as incorrect / invalid submission.
  • (Recommended) You may gzip the file for size: predictions.jsonl.gz

How to Submit

  1. Download the held-out test set (coming soon).
  2. Generate your predictions.jsonl following the required format.
  3. Name the file as: TeamName_MethodName.jsonl (or .jsonl.gz).
  4. Submit via EvalAI platform (link coming soon).

Submission Limit: Up to 5 submissions per team; best submission counts
Deadline: TBD
Results Announcement: TBD

Rules

Baselines & Starter Kit

Baselines, environment setup, and evaluation scripts are available in the official EmbodiedBench repository:

github.com/EmbodiedBench/EmbodiedBench

Getting Started: Check out our baseline implementations and starter code to quickly get up and running with the EmbodiedBench dataset.

Contact

For questions, please reach out via:

GitHub Issues    Discussions

Back to EmbodiedBench Home