EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device

ICCV 2025

1,2,4Georgia Tech, 3Toyota Research Institute
EmbodiedSplat Teaser Image

The sim-to-real gap remains a bottleneck in the field of Embodied AI due to synthetic datasets lacking realism and real-world datasets being expensive to collect. Advancements in 3D scene representation, such as 3D Gaussian Splatting (GS), followed by DN-Splatter, enable efficient scene reconstruction from low-effort collects, and their potential for training and deploying robot policies remains largely unexplored. In this work, we introduce EmbodiedSplat, an efficient and cost-effective pipeline for bridging the sim-to-real gap in navigation with the creation of high-quality 3D simulations from low-cost iPhone captures using depth-aware 3D Guassian Splats (GS) and Polycam. Our comprehensive evaluations on both simulation and real-world environments show that EmbodiedSplat substantially improves sim-to-real transfer thanks to the effectiveness of fine-tuning on high-fidelity reconstructions. Our in-depth analysis explores the relationship between reconstruction quality, pre-training scenes, downstream navigation performance, and training strategies. Our findings provide valuable insights into how reconstruction fidelity influences policy generalization and applicability. We provide an open-source codebase and dataset to facilitate further research in this domain and reproducibility of results.

Abstract

The field of Embodied AI predominantly relies on simulation for training and evaluation, often using either fully synthetic environments that lack photorealism or high-fidelity real-world reconstructions captured with expensive hardware. As a result, sim-to-real transfer remains a major challenge. In this paper, we introduce EmbodiedSplat, a novel approach that personalizes policy training by efficiently capturing the deployment environment and fine-tuning policies within the reconstructed scenes. Our method leverages 3D Gaussian Splatting (GS) and the Habitat-Sim simulator to bridge the gap between realistic scene capture and effective training environments. Using iPhone-captured deployment scenes, we reconstruct meshes via GS, enabling training in settings that closely approximate real-world conditions. We conduct a comprehensive analysis of training strategies, pre-training datasets, and mesh reconstruction techniques, evaluating their impact on sim-to-real predictivity in real-world scenarios. Experimental results demonstrate that agents fine-tuned with EmbodiedSplat outperform both zero-shot baselines pre-trained on large-scale real-world datasets (HM3D) and synthetically generated datasets (HSSD), achieving absolute success rate improvements of 20% and 40% on real-world Image Navigation task. Moreover, our approach yields a high sim-vs-real correlation (0.87–0.97) for the reconstructed meshes, underscoring its effectiveness in adapting policies to diverse environments with minimal effort.

The EmbodiedSplat Pipeline

EmbodiedSplat Pipeline

The EmbodiedSplat Pipeline involves capturing University scenes using Polycam and Nerfstudio which produces RGB frames, associated iPhone GT depth maps, and poses. DN-Splatter is used to train Gaussian Splatting using depth and normal regularization. Meshes are extracted through Poisson Reconstruction from the trained gaussians and are processed and loaded into Habitat-Sim for agent training in simulation. The trained policies are deployed in the same real-world scenes for image-goal navigation.

Our Dataset

Reconstructed meshes of the Captured scenes

We capture scenes from a university environment (classroom, community lounges, conference rooms, etc). For custom data collection, we use an iPhone 13 Pro Max to record the iPhone RGB-D data using the Polycam. Subsequently, we use Nerfstudio to process the RGB-D data and sample 1000 aligned RGB-depth frames with low blur scores and corresponding poses. Each capture requires 20-30 minutes of recording with Polycam. We repeat this process for different indoor scenes, such as lounge, classroom, conf_a, and conf_b. For mesh reconstruction, we use DN-Splatter as our method of choice for its superior performance on mesh reconstruction in comparison to others, and simplicity of its integration with Habitat-Sim. DN-Splatter leverages depth-normal regularization, and smoothness losses to maintain geometric consistency during Gaussian Splat training. After training GS and converting into ply meshes, we convert the meshes to glb in Blender. The final meshes are shown above.

Zero-Shot Evaluation Results

Fine-tuned Evaluation Results

Real World Performance

Real World Performance Results

We evaluate both zero-shot and fine-tuned policies in the real-world lounge scene on a Stretch robot. During the evaluation, each episode is capped at 100 steps, with 10 distinct start-and-goal locations sampled within the scene. To assess performance, we record the number of steps taken to reach the goal and the final distance to the goal at the end of each episode, determining whether the agent successfully completes the task. The zero-shot HM3D policy achieves a 50% success rate, demonstrating our hypothesized lack of generalization. We attribute this to the structural and semantic differences betweeen the lounge and the apartment-style scenes typically encountered in the HM3D dataset. Fine-tuning on the POLYCAM and DN mesh reconstructions of this scene improves performance, with success rates increasing to 70%. For HSSD, zero-shot performance is significantly lower at 10%, while fine-tuned policies improve success rates to 50% with POLYCAM and 40% with DN mesh. The improvements in evaluation performance on DN and POLYCAM meshes in simulation translate to improved real-world performance. This demonstrates that our approach can efficiently adapt policies to novel real-world environments.

Ablations and Analysis

Conclusion

In this work, we introduced a comprehensive pipeline for bridging the gap between real-world and simulated environments in training embodied agents using 3D Gaussian Splats (GS) and Polycam. By leveraging the MuSHRoom dataset and custom iPhone-captured scenes, we demonstrated an efficient and scalable approach to policy personalization, leveraging 3D scene reconstruction from low-effort collects, enabling high-quality training for the ImageNav task. Our pipeline facilitates accessible and replicable scene collection without requiring specialized hardware or significant costs, making it a practical solution for large-scale embodied AI research. We evaluated overfitted, zero-shot, and fine-tuned policies, showing that fine-tuning pre-trained policies on real-world scene reconstructions improves sim-to-real transfer. We also analyzed the differences between GS-generated DN meshes and POLYCAM meshes, finding that POLYCAM more closely resemble real-world scenes. This work lays the foundation for seamlessly integrating real-world scene captures into simulation, expanding the applicability of embodied AI systems. Another promising avenue is integrating GS directly into training, replacing visual observations and goal images to enhance learning efficiency. Furthermore, we aim to extend the use of Gaussian Splats to more complex embodied AI tasks, such as rearrangement and mobile manipulation, broadening its impact across diverse real-world applications.