EmbodiedBench Challenge

Benchmarking Vision-Driven Embodied Agents across EB-ALFRED and EB-Navigation

🏛️ Hosted by: Foundation Models Meet Embodied Agents Workshop, CVPR 2026.

The EmbodiedBench Challenge evaluates Multi-modal Large Language Models (MLLMs) as vision-driven embodied agents across two environments: EB-ALFRED (household task planning) and EB-Navigation (spatial navigation). Stage 1 runs from April 15, 2026 to May 25, 2026, 23:59 AoE (Anywhere on Earth), and participants can submit trajectory results via EvalAI. The top 5 teams will advance to Stage 2 for held-out evaluation, with full Stage 2 instructions available below.

Challenge Instructions Stage 2 Details Awards Contact Dataset & Code Submit on EvalAI Dataset

Timeline

All deadlines use AoE (Anywhere on Earth, UTC−12).

Stage 1 — Qualification Phase

April 15, 2026 to May 25, 2026, 23:59 AoE (Anywhere on Earth)

Submit trajectory results via EvalAI. Up to 5 submissions per day, 50 total. Leaderboard is private (visible to host only).

Stage 2 — Final Held-out Test Phase

May 25, 2026 to May 28, 2026, 23:59 AoE (Anywhere on Earth)

This stage is described publicly below. The top 5 teams from Stage 1 will receive a private submission link, and each team submits exactly once by May 26, 2026 (AoE). The remaining Stage 2 period is reserved for online evaluation and addressing engineering issues related to customization. Teams may submit either a vLLM-servable model or a modified EmbodiedBench agent framework.

Results Announcement

May 31, 2026

Final rankings and award ceremony at the Foundation Models Meet Embodied Agents Workshop.

Awards & Recognition

Top-performing teams will be recognized with both cash awards and workshop visibility opportunities:

1st place: $500
2nd place: $300
3rd place: $200
Certificates: Award certificates will be provided.
Workshop opportunities: Selected teams may be invited to submit a technique report and give an on-site talk, subject to the final workshop schedule.

Challenge Overview

Goal

Given visual observations and language instructions, develop vision-driven agents that can plan and execute tasks in two EmbodiedBench environments:

EB-ALFRED — Household task planning: multi-step high-level tasks requiring object interaction, task decomposition, and long-horizon planning.
EB-Navigation — Spatial navigation: low-level action planning requiring spatial reasoning, visual perception, and precise movement control.

Stage 1 — What You Do

Run your model or agent using the official EmbodiedBench code on EB-ALFRED and EB-Navigation.
Use copy_json.py to extract episode result files and prepare a zip submission.
Submit the zip file to EvalAI. The evaluator automatically computes your scores and updates the leaderboard.

Stage 2 — What You Do (Top 5 Teams Only)

Receive a private invitation and submission link from the organizers.
Submit your model, modified agent framework, or a combined model-plus-agent submission (exactly one submission) following one of the three supported formats (see Stage 2 Submission below).
The organizers run your system on held-out EB-ALFRED and EB-Navigation tasks to determine the final ranking.
These held-out tasks will be released after the end of Stage 2.

Evaluation

Leaderboard Metrics:

Avg Success Rate (primary): Average task success rate across EB-ALFRED and EB-Navigation.
EB-ALFRED Success Rate: Task success rate on EB-ALFRED (%).
EB-Navigation Success Rate: Task success rate on EB-Navigation (%).

Ranking: Teams are ranked by Avg Success Rate.

Model size: Open-source base models must have fewer than 10B parameters. Commercial APIs are not allowed.

Challenge Leaderboard

Final challenge results across the open validation split and the held-out Stage 2 split.

Open Leaderboard

Rank	Team / Method	EB-ALFRED Success Rate	EB-Navigation Success Rate	Avg Success Rate
1	NJU-LAMDA-SZ	83.33	81.67	82.50
2	Ideal-Embody(Ideal)	66.33	56.33	61.33
3	vla-number-one	65.77	44.33	55.05
4	team233	42.33	43.00	42.67
5	EBSkills	24.67	50.33	37.50

Held-out Leaderboard

Team vla-number-one withdrew from the competition. The held-out EB-Navigation tasks are more challenging than the open split, so the maximum number of steps was increased to 30 for the held-out evaluation.

Rank	Team / Method	EB-ALFRED Success Rate	EB-Navigation Success Rate	Avg Success Rate
1	NJU-LAMDA-SZ	67.3	37.0	52.2
2	Ideal-Embody(Ideal)	75.3	22.0	48.7
3	Team233	44.7	28.0	36.4
4	EBSkills	22.7	27.0	24.9

Stage 1 Submission

Step 1 — Run EmbodiedBench

Run your model or agent using the official EmbodiedBench code to generate rollout trajectories for EB-ALFRED and EB-Navigation. A typical raw rollout directory looks like:

eb_alfred/model_experiment/
├── summary_all.json
├── base/
│   ├── episode_1_step_19.json
│   ├── episode_2_step_10.json
│   ├── images/
│   └── results/
│       ├── episode_1_final_res.json
│       └── ...
└── common_sense/
    ├── episode_1_step_14.json
    ├── images/
    └── results/
        └── ...

Step 2 — Extract Episode JSON Files

Use copy_json.py to extract only the episode JSON files. Run the script once per environment:

python copy_json.py --source path/to/running/eb_alfred/model_experiment \
                    --output submission_folder/eb_alfred
python copy_json.py --source path/to/running/eb_nav/model_experiment \
                    --output submission_folder/eb_nav

This produces a compact directory:

submission_folder/
├── eb_alfred/
│   └── model_experiment/
│       ├── base/
│       │   ├── episode_1_step_19.json
│       │   └── ...
│       └── common_sense/
│           └── ...
└── eb_nav/
    └── model_experiment/
        ├── base/
        │   ├── episode_1.json
        │   └── ...
        └── common_sense/
            └── ...

Step 3 — Compress and Submit

cd submission_folder && zip -r ../submission.zip . && cd ..

Upload submission.zip to the EvalAI challenge page. The evaluator will automatically extract the zip, compute your scores, and update the leaderboard.

Submission limits: Up to 5 submissions per day, 50 total during Stage 1.
Stage 1 timeline: April 15, 2026 to May 25, 2026, 23:59 AoE (Anywhere on Earth).

Stage 2 Submission (Top 5 Teams)

The top 5 teams from Stage 1 will be invited to Stage 2 via a private link provided by the organizers. Each team may submit exactly once. The submission must be a runnable model, modified agent framework, or combined model-plus-agent system for EmbodiedBench. Three submission formats are supported:

Note: The deadline for submitting the Stage 2 model is May 26, 2026 (AoE), to allow sufficient time for online evaluation and to address any engineering issues related to customization from participating teams.
Held-out tasks: Stage 2 uses held-out tasks from EB-ALFRED and EB-Navigation, which will be released after Stage 2 concludes. The held-out EB-Navigation tasks are more challenging than the open split, and the maximum number of steps is increased to 30 for held-out evaluation.

Option A — vLLM-Compatible Model Server

Submit a fine-tuned or adapted model that can be served with vLLM. The organizers will launch it as:

vllm serve <your-model> --host 0.0.0.0 --port 8000

Provide the model path or Hugging Face model ID and any required vllm serve flags.

Option B — Modified Agent Framework

Submit a modified EmbodiedBench agent framework. Teams may customize files under embodiedbench/planner/ (for example prompt construction, reasoning, memory, replanning, and action parsing in planner modules such as vlm_planner.py and nav_planner.py).

Constraints: Teams may only modify files under embodiedbench/planner/ and add new supporting modules. Evaluator code, environment code, and metric computation must remain unchanged.

Submit a code zip together with a README that explains how to run your modified framework for Stage 2 evaluation on the held-out EB-ALFRED and EB-Navigation tasks.

Option A + B — Fine-tuned Model with Custom Agent Framework

Teams may combine both options by submitting a fine-tuned model served via vLLM together with a custom planner under embodiedbench/planner/ that calls it. This supports end-to-end optimization of both model weights and agent strategy.

For the planner portion, the same constraints as Option B apply: teams may only modify files under embodiedbench/planner/ and add new supporting modules, while evaluator code, environment code, and metric computation must remain unchanged.

Submission Package

Model weights or code repository (Hugging Face model ID, GitHub link, or compressed archive)
For Option B, a code zip of the modified agent framework
For Option A + B, both the model package or model ID and a code zip of the custom planner framework
A README with step-by-step startup instructions and any customization notes needed for reproduction
Complete dependency list (requirements.txt or equivalent)
Any required vllm serve flags or other startup arguments
For Option B and Option A + B, the submitted package must preserve the original evaluator, environment, and metric code unchanged

Rules

Commercial APIs not allowed. Models must not rely on commercial API calls (e.g., GPT-4, Claude, Gemini) for inference during evaluation.
Model size limit: Open-source base models must have fewer than 10B parameters to emphasize algorithmic design over model scale.
External data and pre-trained models: Allowed with disclosure. Clearly list all external resources in your submission.
Human-in-the-loop labeling on test: Disallowed. Do not attempt to obtain test labels or manipulate evaluation.
Verification: Top teams will be asked to provide a technical report and reproducibility details. Stage 2 teams must provide a runnable model or framework.
Teams: Team size is limited to 5 members. Each team may only submit under one team name.

Baselines & Starter Kit

Baselines, environment setup, and evaluation scripts are available in the official EmbodiedBench repository:

github.com/EmbodiedBench/EmbodiedBench

The repository includes baseline model implementations, the copy_json.py submission helper, and detailed instructions for running evaluations on EB-ALFRED and EB-Navigation.

Contact

Community Channels

Join the EmbodiedBench community on Slack, or use GitHub Issues for technical questions and announcements.

Join Slack GitHub Issues

WeChat Group

Scan the QR code below to join the EmbodiedBench WeChat group.

The QR code image may expire periodically. If it stops working, please use Slack or GitHub to contact the organizers.

Back to EmbodiedBench Home