EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device

ICCV 2025

Gunjan Chhablani¹, Xiaomeng Ye¹, Muhammad Zubair Irshad², Zsolt Kira¹

¹Georgia Tech, ²Toyota Research Institute

The sim-to-real gap remains a bottleneck in the field of Embodied AI due to synthetic datasets lacking realism and real-world datasets being expensive to collect. Advancements in 3D scene representation, such as 3D Gaussian Splatting (GS), followed by DN-Splatter, enable efficient scene reconstruction from low-effort collects, and their potential for training and deploying robot policies remains largely unexplored. In this work, we introduce EmbodiedSplat, an efficient and cost-effective pipeline for bridging the sim-to-real gap in navigation with the creation of high-quality 3D simulations from low-cost iPhone captures using depth-aware 3D Guassian Splats (GS) and Polycam. Our comprehensive evaluations on both simulation and real-world environments show that EmbodiedSplat substantially improves sim-to-real transfer thanks to the effectiveness of fine-tuning on high-fidelity reconstructions. Our in-depth analysis explores the relationship between reconstruction quality, pre-training scenes, downstream navigation performance, and training strategies. Our findings provide valuable insights into how reconstruction fidelity influences policy generalization and applicability. We provide an open-source codebase and dataset to facilitate further research in this domain and reproducibility of results.

Abstract

The field of Embodied AI predominantly relies on simulation for training and evaluation, often using either fully synthetic environments that lack photorealism or high-fidelity real-world reconstructions captured with expensive hardware. As a result, sim-to-real transfer remains a major challenge. In this paper, we introduce EmbodiedSplat, a novel approach that personalizes policy training by efficiently capturing the deployment environment and fine-tuning policies within the reconstructed scenes. Our method leverages 3D Gaussian Splatting (GS) and the Habitat-Sim simulator to bridge the gap between realistic scene capture and effective training environments. Using iPhone-captured deployment scenes, we reconstruct meshes via GS, enabling training in settings that closely approximate real-world conditions. We conduct a comprehensive analysis of training strategies, pre-training datasets, and mesh reconstruction techniques, evaluating their impact on sim-to-real predictivity in real-world scenarios. Experimental results demonstrate that agents fine-tuned with EmbodiedSplat outperform both zero-shot baselines pre-trained on large-scale real-world datasets (HM3D) and synthetically generated datasets (HSSD), achieving absolute success rate improvements of 20% and 40% on real-world Image Navigation task. Moreover, our approach yields a high sim-vs-real correlation (0.87–0.97) for the reconstructed meshes, underscoring its effectiveness in adapting policies to diverse environments with minimal effort.

The EmbodiedSplat Pipeline

The EmbodiedSplat Pipeline involves capturing University scenes using Polycam and Nerfstudio which produces RGB frames, associated iPhone GT depth maps, and poses. DN-Splatter is used to train Gaussian Splatting using depth and normal regularization. Meshes are extracted through Poisson Reconstruction from the trained gaussians and are processed and loaded into Habitat-Sim for agent training in simulation. The trained policies are deployed in the same real-world scenes for image-goal navigation.

Our Dataset

Click the previews below to toggle interactive 3D meshes (click again to close).

DN-Splatter Meshes

"lounge"

"classroom"

"conf a"

"conf b"

Polycam Meshes

"lounge"

"classroom"

"conf a"

"conf b"

We capture scenes from a university environment (classroom, community lounges, conference rooms, etc). For custom data collection, we use an iPhone 13 Pro Max to record the iPhone RGB-D data using the Polycam. Subsequently, we use Nerfstudio to process the RGB-D data and sample 1000 aligned RGB-depth frames with low blur scores and corresponding poses. Each capture requires 20-30 minutes of recording with Polycam. We repeat this process for different indoor scenes, such as lounge, classroom, conf_a, and conf_b. For mesh reconstruction, we use DN-Splatter as our method of choice for its superior performance on mesh reconstruction in comparison to others, and simplicity of its integration with Habitat-Sim. DN-Splatter leverages depth-normal regularization, and smoothness losses to maintain geometric consistency during Gaussian Splat training. After training GS and converting into ply meshes, we convert the meshes to glb in Blender. The final meshes are shown above.

Zero-Shot Evaluation Results

The results of zero-shot evaluation across individual scenes for the HM3D pre-trained policy. The policy demonstrates relatively high success rates in smaller scenes, such as conf_a and conf_b. However, it struggles in larger environments such as classroom and lounge, which are more complex and differ more significantly from the training data. The policy performs generally equally well in both DN and Polycam meshes, yielding similar success rates (~60%). Three MuSHRoom scenes are included in the evaluation. Having no Polycam meshes, we conduct evaluations on DN meshes only. Again, performances in sauna and honka are better than that in activity, suggesting that the policy is more effective in smaller, less complex environments.

The results of zero-shot evaluation across individual scenes for the HSSD pre-trained policy. The policy reveals similar trend as the HM3D pre-trained policy, though with significantly lower success rates overall. This degradation in performance can be attributed to the synthetic nature of HSSD scenes, which lack the realism and scale necessary for effective generalization. And the performances on DN and Polycam meshes are still comparable.

Fine-tuned Evaluation Results

We fine-tune the pre-trained policy on the training episodes for a single scene and evaluate on the corresponding validation episodes. We observe that fine-tuning significantly improves performance across all tested scenes. For the pre-trained HM3D policy, fine-tuning for 20M steps results in success rates approaching 90%+ across different meshes.

Similarly, for the HSSD pre-trained policy, fine-tuning leads to substantial improvements, with most policies achieving success rates of 80%+ in respective scenes. In particular, performance gains are particularly pronounced in larger, more complex environments such as classroom and lounge, which differ significantly from apartment-style scenes in HM3D and HSSD.

Training and Evaluation in Simulation

Zero-shot HM3D policy navigating in the dn-splatter mesh reconstruction of the lounge scene.

Fine-tuned HM3D policy navigating in the dn-splatter mesh reconstruction of the lounge scene.

Zero-shot HM3D policy navigating in the polycam mesh reconstruction of the lounge scene.

Fine-tuned HM3D policy navigating in the polycam mesh reconstruction of the lounge scene.

Real World Performance

We evaluate both zero-shot and fine-tuned policies in the real-world lounge scene on a Stretch robot. During the evaluation, each episode is capped at 100 steps, with 10 distinct start-and-goal locations sampled within the scene. To assess performance, we record the number of steps taken to reach the goal and the final distance to the goal at the end of each episode, determining whether the agent successfully completes the task. The zero-shot HM3D policy achieves a 50% success rate, demonstrating our hypothesized lack of generalization. We attribute this to the structural and semantic differences betweeen the lounge and the apartment-style scenes typically encountered in the HM3D dataset. Fine-tuning on the POLYCAM and DN mesh reconstructions of this scene improves performance, with success rates increasing to 70%. For HSSD, zero-shot performance is significantly lower at 10%, while fine-tuned policies improve success rates to 50% with POLYCAM and 40% with DN mesh. The improvements in evaluation performance on DN and POLYCAM meshes in simulation translate to improved real-world performance. This demonstrates that our approach can efficiently adapt policies to novel real-world environments.

Ablations and Analysis

We investigate the necessity of pretraining on large scale datasets by overfitting policies directly on POLYCAM and DN meshes with a policy trained from scratch (not pre-trained on large-scale datasets such as HM3D or HSSD) for 300M steps. As expected, the overfitted policies achieve near-perfect performance in simulation, maintaining high success rates. We further evaluate these overfitted policies in the real- world lounge scene using POLYCAM and DN meshes. Surprisingly, the overfitted policy trained on the POLYCAM mesh achieves a 50% success rate in real-world evaluations. In contrast, the policy trained on the DN mesh achieves only 10% success. This result suggests that our mesh-based training approach can yield non-zero real-world perfor- mance, even without large-scale pre-training. We attribute the performance gap between POLYCAM and DN-trained policies to differences in visual fidelity—POLYCAM meshes preserve more visual detail by directly utilizing original im- ages to reconstruct the scene, whereas DN meshes are based on GS which use the learned colors for the 3D Gaussians.

correlation between zero-shot success rates with PSNR and average distance to goal

We illustrate the correlation between zero-shot validation success rates on DN meshes and two key factors: the Peak Signal-to-Noise Ratio (PSNR) of the corresponding 3D Gaussian Splats (GS) and the scale of the scene, measured as the average shortest distance between start-goal locations in validation episodes. We observe a negative correlation between success rate (SR) and average shortest distance—indicating that as the scale of the scene increases, the zero-shot success rate declines. Conversely, a positive correlation is seen between SR and PSNR, where higher validation PSNR values correspond to improved success rates. Notably, different trend lines emerge for MuSHRoom captures and our own Captured scenes. MuSHRoom captures generally exhibit higher validation PSNRs, likely due to the use of a stabilized gimbal during data collection.

We present average zero-shot success rates across HM3D validation set, POLYCAM meshes, and DN meshes at various stages of HM3D policy pre-training. This analysis aims to determine whether continuous performance improvements on HM3D validation scenes translate to improved zero-shot performance on our Captured scenes. We observe that while performance initially increases, it begins to deteriorate or plateau at approximately 400M steps, despite continued improvements on HM3D validation scenes.

We show a similar experiment conducted using the HSSD dataset. Up to 300M steps, improvements in HSSD validation performance correspond to slight improvements in zero-shot performance on our Captured scenes. However, performance plateaus for Captured, showing no further gains despite continued improvement on HSSD validation scenes. For HM3D pre-training, POLYCAM meshes outperform DN meshes in success rates, while for HSSD, the trend reverses slightly, highlighting the impact of pre-training dataset characteristics on zero-shot generalization.

Conclusion

In this work, we introduced a comprehensive pipeline for bridging the gap between real-world and simulated environments in training embodied agents using 3D Gaussian Splats (GS) and Polycam. By leveraging the MuSHRoom dataset and custom iPhone-captured scenes, we demonstrated an efficient and scalable approach to policy personalization, leveraging 3D scene reconstruction from low-effort collects, enabling high-quality training for the ImageNav task. Our pipeline facilitates accessible and replicable scene collection without requiring specialized hardware or significant costs, making it a practical solution for large-scale embodied AI research. We evaluated overfitted, zero-shot, and fine-tuned policies, showing that fine-tuning pre-trained policies on real-world scene reconstructions improves sim-to-real transfer. We also analyzed the differences between GS-generated DN meshes and POLYCAM meshes, finding that POLYCAM more closely resemble real-world scenes. This work lays the foundation for seamlessly integrating real-world scene captures into simulation, expanding the applicability of embodied AI systems. Another promising avenue is integrating GS directly into training, replacing visual observations and goal images to enhance learning efficiency. Furthermore, we aim to extend the use of Gaussian Splats to more complex embodied AI tasks, such as rearrangement and mobile manipulation, broadening its impact across diverse real-world applications.