Abstract
Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that integrates a reactive small-model planner (RSMP) with a reflective large-model reasoner (RLMR). RSMP adopts a causal-learning-based dual-branch architecture to enhance generalization, while RLMR leverages panoramic visual prompting with chain-of-thought reasoning to support interpretable spatial understanding and navigation. We further introduce an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models. For obstacle avoidance, in simulation, we replace the rule-based controller with a fully learnable point-goal policy, and in real-world deployment, we design a LiDAR-based clustering module for generating navigable waypoints and pair it with an online SLAM-based local controller. CLASH achieves state-of-the-art (SoTA) results (ranking 1-st) on the VLN-CE leaderboard, significantly improving SR and SPL on the test-unseen set over the previous SoTA methods. Real-world experiments demonstrate CLASH's strong robustness, validating its effectiveness in both simulation and deployment scenarios.
Contributions
1) Novel Hybrid Architecture. We introduce CLASH, the first VLN-CE framework that synergistically integrates a task-specific small model for reactive planning with an MLLM for reflective reasoning, combining task-specific efficiency with generalizable commonsense reasoning.
2) Enhanced High-Level Reasoning. We propose three key innovations: (i) a causal learning-based dual-branch framework that improves causal reasoning and generalization to unseen environments; (ii) panoramic visual prompting with chain-of-thought reasoning (PVP-CoT) for interpretable spatial reasoning; and (iii) an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models based on confidence estimation.
3) Practical Low-Level Execution. We develop a dual-level solution bridging simulation and real-world deployment: a learnable DDPPO-based policy for simulation with robust obstacle avoidance, and a training-free LiDAR-based waypoint generation method with SLAM-based control for zero-shot real-world transfer.
4) Superior Performance and Real-World Validation. CLASH achieves state-of-the-art results on R2R-CE and REVERIE-CE, ranking 1st on the VLN-CE leaderboard, with extensive real-world experiments validating strong sim-to-real transferability.
Comparison with other SoTA methods on the R2R-CE dataset.
Model efficiency comparison.
Visualization of navigation trajectories in VLN-CE.
Visualization of navigation trajectories in VLN-CE.
Real-world Experiments
BibTeX
@article{wang2024clash,
title={CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation},
author={Wang, Liuyi and He, Zongtao and Li, Jinlong and Qi, Xiaoyan and Hu, Mengxian and Yao, Chenpeng and Liu, Chengju and Chen, Qijun},
journal={arXiv preprint arXiv:2512.10360},
year={2025},
}