CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation

Wang, Liuyi; He, Zongtao; Li, Jinlong; Qi, Xiaoyan; Hu, Mengxian; Yao, Chenpeng; Liu, Chengju; Chen, Qijun

CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation

Liuyi Wang, Zongtao He, Jinlong Li, Xiaoyan Qi, Mengxian Hu, Chenpeng Yao, Chengju Liu^†, Qijun Chen^†

Tongji University, College of Electronic and Information Engineering, China
Under Review
^†Corresponding Author

Paper Code arXiv

Overview of the CLASH framework. CLASH integrates a reactive small model and a reflective vision-language large model through an uncertainty-aware collaboration mechanism, with tailored waypoint and point-goal prediction schemes for simulation and real-world deployment.

Abstract

Vision-and-Language Navigation (VLN) requires robots to follow natural language instructions and navigate complex environments without prior maps. While recent vision-language large models demonstrate strong reasoning abilities, they often underperform task-specific panoramic small models in VLN tasks. To address this, we propose CLASH (Collaborative Large-Small Hierarchy), a VLN-CE framework that integrates a reactive small-model planner (RSMP) with a reflective large-model reasoner (RLMR). RSMP adopts a causal-learning-based dual-branch architecture to enhance generalization, while RLMR leverages panoramic visual prompting with chain-of-thought reasoning to support interpretable spatial understanding and navigation. We further introduce an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models. For obstacle avoidance, in simulation, we replace the rule-based controller with a fully learnable point-goal policy, and in real-world deployment, we design a LiDAR-based clustering module for generating navigable waypoints and pair it with an online SLAM-based local controller. CLASH achieves state-of-the-art (SoTA) results (ranking 1-st) on the VLN-CE leaderboard, significantly improving SR and SPL on the test-unseen set over the previous SoTA methods. Real-world experiments demonstrate CLASH's strong robustness, validating its effectiveness in both simulation and deployment scenarios.

Contributions

1) Novel Hybrid Architecture. We introduce CLASH, the first VLN-CE framework that synergistically integrates a task-specific small model for reactive planning with an MLLM for reflective reasoning, combining task-specific efficiency with generalizable commonsense reasoning.

2) Enhanced High-Level Reasoning. We propose three key innovations: (i) a causal learning-based dual-branch framework that improves causal reasoning and generalization to unseen environments; (ii) panoramic visual prompting with chain-of-thought reasoning (PVP-CoT) for interpretable spatial reasoning; and (iii) an uncertainty-aware collaboration mechanism (UCM) that adaptively fuses decisions from both models based on confidence estimation.

3) Practical Low-Level Execution. We develop a dual-level solution bridging simulation and real-world deployment: a learnable DDPPO-based policy for simulation with robust obstacle avoidance, and a training-free LiDAR-based waypoint generation method with SLAM-based control for zero-shot real-world transfer.

4) Superior Performance and Real-World Validation. CLASH achieves state-of-the-art results on R2R-CE and REVERIE-CE, ranking 1st on the VLN-CE leaderboard, with extensive real-world experiments validating strong sim-to-real transferability.

Comparison with other SoTA methods on the R2R-CE dataset.

Model efficiency comparison.

Visualization of navigation trajectories in VLN-CE.

Real-world Experiments

BibTeX

@article{wang2024clash,
  title={CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation},
  author={Wang, Liuyi and He, Zongtao and Li, Jinlong and Qi, Xiaoyan and Hu, Mengxian and Yao, Chenpeng and Liu, Chengju and Chen, Qijun},
  journal={arXiv preprint arXiv:2512.10360},
  year={2025},
}

More Works from Our Lab

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Vision-and-Language Navigation via Causal Learning

MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-Efficient Vision-and-Language Navigation

NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization

CLASH: Collaborative Large-Small Hierarchical Framework for Continuous Vision-and-Language Navigation

Overview of the CLASH framework. CLASH integrates a reactive small model and a reflective vision-language large model through an uncertainty-aware collaboration mechanism, with tailored waypoint and point-goal prediction schemes for simulation and real-world deployment.

Abstract

Contributions

Comparison with other SoTA methods on the R2R-CE dataset.

Model efficiency comparison.

Visualization of navigation trajectories in VLN-CE.

Visualization of navigation trajectories in VLN-CE.

Real-world Experiments

BibTeX