Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Abstract

Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models.

In this paper, we introduce VLN-PE, a physically realistic VLN platform and benchmark that provides a comprehensive environment for cross-embodiment (humanoid, quadruped, and wheeled) data collection and systematic evaluation of policies across various robot embodiments and environmental conditions. Our experiments on VLN-PE reveal several critical insights that highlight limitations in current approaches and suggest promising directions for improvement:

SoTA Models Struggle in Physical Environments: Existing VLN-CE models exhibit a 34% SR relative drop when transferred to physical settings, revealing a gap between pseudo-motion training and physical deployment.
Cross-embodiment Sensitivity: Model performance varies across different robots, primarily due to viewpoint height differences, highlighting the need for height-adaptive or perspective-invariant representations.
Multi-Modal Robustness: RGB-only models degrade significantly in low-light conditions, whereas RGB + depth models perform more reliably, underscoring the value of multi-modal fusion to improve the model's robustness.
Limited Generalization of Standard Datasets: MP3D-style datasets cannot fully capture environment shifts. A simple baseline with 6M trainable parameters, fine-tuned on our small-scale dataset of the newly introduced scenes, outperforms previous SoTA method in zero-shot settings, suggesting the importance of more diverse training distributions and comprehensive evaluation system.
Towards Cross-Embodiment VLN: In our experiments, co-training across different robots enables a single baseline to generalize across embodiments and achieve the SoTA result, showing an important foundation for the future unified cross-embodiment VLN model.

Below, we provide some visualized results on the VLN-PE simulator.

Based on the VLN-PE simulator, we conduct several experiments to evaluate the performance of different VLN models.

BibTeX

@article{wang2025rethinking,
        title={Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities},
        author={Wang, Liuyi and Xia, Xinyuan and Zhao, Hui and Wang, Hanqing and Wang, Tai and Chen, Yilun and Liu, Chengju and Chen, Qijun and Pang, Jiangmiao},
        journal={arXiv preprint arXiv:2503.14390},
        year={2025}
      }

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

🤖 Different viewpoints of the robots walking in VLN-PE.

🗺️ Trajectories and instructions from the GRU-VLN10 dataset.

🏢 Trajectories and instructions from the 3DGS-Lab-VLN dataset.

💡 Comparisons of different lighting conditions.

Abstract

Visualization of the humanoid robot walking in VLN-PE. Leveraging the powerful interactive capabilities of Isaac Sim, researchers can easily observe robot motion from various perspectives within the environment.

Example of improved VLMaps. Blue dot: current position (black line: orientation). Green dot: frontiers (black line: exploration orientation). White: dilated obstacles. Light gray: explored area. Dark gray: unexplored area. Blue line: local planner trajectory.

Examples of trajectories and instructions from our introduced GRU-VLN10 datasets.

Examples of trajectories and instructions from our introduced 3DGS-Lab-VLN datasets.

BibTeX