logo EmbodiedBench Challenge

Benchmarking Vision-Driven Embodied Agents across EB-ALFRED and EB-Navigation

🏛️ Hosted by: Foundation Models Meet Embodied Agents Workshop, CVPR 2026.

The EmbodiedBench Challenge evaluates Multi-modal Large Language Models (MLLMs) as vision-driven embodied agents across two environments: EB-ALFRED (household task planning) and EB-Navigation (spatial navigation). Stage 1 runs from April 15, 2026 to May 25, 2026, 23:59 AoE (Anywhere on Earth), and participants can submit trajectory results via EvalAI. The top 5 teams will advance to Stage 2 for held-out evaluation, with full Stage 2 instructions available below.

Timeline

All deadlines use AoE (Anywhere on Earth, UTC−12).

Stage 1 — Qualification Phase
April 15, 2026 to May 25, 2026, 23:59 AoE (Anywhere on Earth)

Submit trajectory results via EvalAI. Up to 5 submissions per day, 50 total. Leaderboard is private (visible to host only).

Stage 2 — Final Held-out Test Phase
May 25, 2026 to May 28, 2026, 23:59 AoE (Anywhere on Earth)

This stage is described publicly below. The top 5 teams from Stage 1 will receive a private submission link, and each team submits exactly once by May 26, 2026 (AoE). The remaining Stage 2 period is reserved for online evaluation and addressing engineering issues related to customization. Teams may submit either a vLLM-servable model or a modified EmbodiedBench agent framework.

Results Announcement
May 31, 2026

Final rankings and award ceremony at the Foundation Models Meet Embodied Agents Workshop.

Awards & Recognition

Top-performing teams will be recognized with both cash awards and workshop visibility opportunities:

  • 1st place: $500
  • 2nd place: $300
  • 3rd place: $200
  • Certificates: Award certificates will be provided.
  • Workshop opportunities: Selected teams may be invited to submit a technique report and give an on-site talk, subject to the final workshop schedule.

Challenge Overview

Goal

Given visual observations and language instructions, develop vision-driven agents that can plan and execute tasks in two EmbodiedBench environments:

  • EB-ALFRED — Household task planning: multi-step high-level tasks requiring object interaction, task decomposition, and long-horizon planning.
  • EB-Navigation — Spatial navigation: low-level action planning requiring spatial reasoning, visual perception, and precise movement control.

Stage 1 — What You Do

  1. Run your model or agent using the official EmbodiedBench code on EB-ALFRED and EB-Navigation.
  2. Use copy_json.py to extract episode result files and prepare a zip submission.
  3. Submit the zip file to EvalAI. The evaluator automatically computes your scores and updates the leaderboard.

Stage 2 — What You Do (Top 5 Teams Only)

  1. Receive a private invitation and submission link from the organizers.
  2. Submit your model, modified agent framework, or a combined model-plus-agent submission (exactly one submission) following one of the three supported formats (see Stage 2 Submission below).
  3. The organizers run your system on held-out EB-ALFRED and EB-Navigation tasks to determine the final ranking.
  4. These held-out tasks will be released after the end of Stage 2.

Evaluation

Leaderboard Metrics:

  • Overall Score (primary): Average task success rate across EB-ALFRED and EB-Navigation.
  • ALFRED SR: Task success rate on EB-ALFRED (%).
  • ALFRED Steps: Average steps per episode on EB-ALFRED (lower is better).
  • Navigation SR: Task success rate on EB-Navigation (%).
  • Navigation Steps: Average steps per episode on EB-Navigation (lower is better).
  • Average Steps: Average of per-environment avg steps (tiebreaker, lower is better).

Ranking: Teams are ranked by Overall Score. Ties are broken by Average Steps (fewer is better).

Model size: Open-source base models must have fewer than 10B parameters. Commercial APIs are not allowed.

Challenge Leaderboard

Stage 1 results on the validation set. Click on column headers to sort.

Rank ↕ Team / Method ↕ Overall Score ↕ ALFRED SR ↕ ALFRED Steps ↕ Navigation SR ↕ Navigation Steps ↕ Avg Steps ↕
Stage 1 runs from April 15, 2026 to May 25, 2026, 23:59 AoE (Anywhere on Earth). Leaderboard entries will appear here as submissions are processed.

The live leaderboard is available on the EvalAI challenge page during Stage 1.

Stage 1 Submission

Step 1 — Run EmbodiedBench

Run your model or agent using the official EmbodiedBench code to generate rollout trajectories for EB-ALFRED and EB-Navigation. A typical raw rollout directory looks like:

eb_alfred/model_experiment/ ├── summary_all.json ├── base/ │ ├── episode_1_step_19.json │ ├── episode_2_step_10.json │ ├── images/ │ └── results/ │ ├── episode_1_final_res.json │ └── ... └── common_sense/ ├── episode_1_step_14.json ├── images/ └── results/ └── ...

Step 2 — Extract Episode JSON Files

Use copy_json.py to extract only the episode JSON files. Run the script once per environment:

python copy_json.py --source path/to/running/eb_alfred/model_experiment \ --output submission_folder/eb_alfred python copy_json.py --source path/to/running/eb_nav/model_experiment \ --output submission_folder/eb_nav

This produces a compact directory:

submission_folder/ ├── eb_alfred/ │ └── model_experiment/ │ ├── base/ │ │ ├── episode_1_step_19.json │ │ └── ... │ └── common_sense/ │ └── ... └── eb_nav/ └── model_experiment/ ├── base/ │ ├── episode_1.json │ └── ... └── common_sense/ └── ...

Step 3 — Compress and Submit

cd submission_folder && zip -r ../submission.zip . && cd ..

Upload submission.zip to the EvalAI challenge page. The evaluator will automatically extract the zip, compute your scores, and update the leaderboard.

Submission limits: Up to 5 submissions per day, 50 total during Stage 1.
Stage 1 timeline: April 15, 2026 to May 25, 2026, 23:59 AoE (Anywhere on Earth).

Stage 2 Submission (Top 5 Teams)

The top 5 teams from Stage 1 will be invited to Stage 2 via a private link provided by the organizers. Each team may submit exactly once. The submission must be a runnable model, modified agent framework, or combined model-plus-agent system for EmbodiedBench. Three submission formats are supported:

Note: The deadline for submitting the Stage 2 model is May 26, 2026 (AoE), to allow sufficient time for online evaluation and to address any engineering issues related to customization from participating teams.
Held-out tasks: Stage 2 uses held-out tasks from EB-ALFRED and EB-Navigation, which will be released after Stage 2 concludes.

Option A — vLLM-Compatible Model Server

Submit a fine-tuned or adapted model that can be served with vLLM. The organizers will launch it as:

vllm serve <your-model> --host 0.0.0.0 --port 8000

Provide the model path or Hugging Face model ID and any required vllm serve flags.

Option B — Modified Agent Framework

Submit a modified EmbodiedBench agent framework. Teams may customize files under embodiedbench/planner/ (for example prompt construction, reasoning, memory, replanning, and action parsing in planner modules such as vlm_planner.py and nav_planner.py).

Constraints: Teams may only modify files under embodiedbench/planner/ and add new supporting modules. Evaluator code, environment code, and metric computation must remain unchanged.

Submit a code zip together with a README that explains how to run your modified framework for Stage 2 evaluation on the held-out EB-ALFRED and EB-Navigation tasks.

Option A + B — Fine-tuned Model with Custom Agent Framework

Teams may combine both options by submitting a fine-tuned model served via vLLM together with a custom planner under embodiedbench/planner/ that calls it. This supports end-to-end optimization of both model weights and agent strategy.

For the planner portion, the same constraints as Option B apply: teams may only modify files under embodiedbench/planner/ and add new supporting modules, while evaluator code, environment code, and metric computation must remain unchanged.

Submission Package

  • Model weights or code repository (Hugging Face model ID, GitHub link, or compressed archive)
  • For Option B, a code zip of the modified agent framework
  • For Option A + B, both the model package or model ID and a code zip of the custom planner framework
  • A README with step-by-step startup instructions and any customization notes needed for reproduction
  • Complete dependency list (requirements.txt or equivalent)
  • Any required vllm serve flags or other startup arguments
  • For Option B and Option A + B, the submitted package must preserve the original evaluator, environment, and metric code unchanged

Rules

Baselines & Starter Kit

Baselines, environment setup, and evaluation scripts are available in the official EmbodiedBench repository:

github.com/EmbodiedBench/EmbodiedBench

The repository includes baseline model implementations, the copy_json.py submission helper, and detailed instructions for running evaluations on EB-ALFRED and EB-Navigation.

Contact

Community Channels

Join the EmbodiedBench community on Slack, or use GitHub Issues for technical questions and announcements.

WeChat Group

Scan the QR code below to join the EmbodiedBench WeChat group.

EmbodiedBench WeChat group QR code

The QR code image may expire periodically. If it stops working, please use Slack or GitHub to contact the organizers.

Back to EmbodiedBench Home