Primary Metrics:
- Task Success Rate: Measures the model's ability to correctly complete embodied tasks in each environment.
- Overall Score: Weighted combination of success rates across EB-ALFRED, EB-Habitat, EB-Navigation, and EB-Manipulation.
Capability Subsets: We may additionally report accuracy by capability category (commonsense reasoning, complex instruction, spatial awareness, visual perception, long-horizon planning) for detailed analysis.
Ranking: Teams are ranked by the overall score on the held-out test set.
Tie-break: Higher score on low-level tasks (EB-Navigation + EB-Manipulation), then earlier submission time.