Curriculum Learning for GPS-Free Indoor Social Navigation

Georgia Tech
*Indicates Equal Contribution

Abstract

Tasks involving human-robot interaction demand seamless collaboration between the two within indoor settings. Habitat 3.0 introduced a novel Social Navigation task where agents find, track, and follow humans while avoiding collisions. Their baselines show that performance relies heavily on human GPS availability. However, indoor GPS sensors are rarely reliable in real-life and it may be impractical to provide GPS for every human in the scene. In this work, we tackle the issue of realistic social navigation by relaxing the human GPS requirement at every time step. We achieve this via a curriculum learning strategy for training an RL policy capable of finding and tracking humans with sparse or no reliance on human GPS observations. Our experiments demonstrate the effectiveness of our curriculum strategy, achieving comparable performance to the baselines with lesser samples, using a single GPS observation at the beginning of the episode. The project website and videos can be found at gchhablani.github.io/socnav-curr.

Introduction

Embodied navigation in indoor environments poses a significant challenge in robotics and AI. Recent advancements in deep reinforcement learning have been applied to tackle this challenge. However, many existing approaches rely on the assumption that the GPS location of the goal is always available, which is often impractical in real-world scenarios.

In this work, we introduce a novel approach aimed at addressing this limitation. Building upon the Social Navigation (SocialNav) task proposed by Habitat 3.0, where an agent navigates indoor environments alongside humans, we propose a curriculum-based learning strategy. This strategy gradually introduces the agent to increasingly challenging scenarios with infrequent access to human GPS locations, mirroring real-world conditions.

Our approach draws inspiration from curriculum learning techniques, which start with simpler versions of a task and progressively increase complexity. By adapting this concept to focus on human GPS availability, we demonstrate that our agent can achieve comparable performance to those with constant GPS access, reaching peak performance faster.

Methodology

In our methodology, we adopt the standard reinforcement learning (RL) framework, as utilized in Habitat 3.0, to model the Social Navigation (SocialNav) task as a partially-observable Markov Decision Process. Our observation space comprises four simulation sensors: an arm depth camera, an arm RGB camera, a humanoid detector, and a humanoid GPS sensor.

To explore scenarios with limited GPS availability, we introduce two strategies for representing human GPS locations when unavailable:

  • ZeroGPS: Providing coordinates (0, 0) instead of GPS data.
  • LastGPS: Utilizing the last known GPS location.

  • In the absence of human GPS information, the success rate of finding significantly decreases, as shown in Habitat 3.0 paper. Thus, we propose a curriculum-based approach to adaptively train the agent, hypothesizing that a policy relying consistently on GPS information lacks adaptability.

    Our curriculum design gradually reduces the frequency of GPS availability based on predetermined upper and lower success rate thresholds. This process aims to challenge the agent progressively until it can perform optimally without GPS assistance. We set the curriculum level K in the range of [1, 1500], determining the interval in the episode where the agent has access to current GPS sensor data.

    Training spans 300 million steps, with curriculum adjustments made every 10 million steps. We incorporate a warm-up phase of 10 million steps to ensure the agent attains satisfactory performance under full GPS conditions. Upper success rate thresholds are set at 0.9, while lower thresholds vary from 0.8 to 0.88 across training phases.

    To update the curriculum, we explore three approaches:

  • Additive: Incrementing (+50) or decrementing (-25) a fixed value based on finding success rate.
  • Multiplicative: Doubling or halving the current GPS frequency based on finding success rate.
  • Dynamic Additive: Adding or subtracting a dynamic value scaling with training finding success rate.

  • By employing this methodology, we aim to train an adaptive agent capable of navigating SocialNav tasks with limited reliance on GPS information.

    Experiments

    Metrics and Evaluation

    We assess our policies through 500 unseen episodes, utilizing evaluation metrics borrowed from prior work [11]. In ZeroGPS evaluation, a fixed GPS observation of (0,0) is assumed at each step, while LastGPS receives only the initial human GPS at every time step.



    Baselines

    We establish two baseline models: FullGPS (Human GPS observed at every timestep), and NoGPS (No GPS sensor). Evaluation results depicted in the figure below reveal that while the NoGPS baseline achieves a respectable 0.92 evaluation finding success (eval FS), its performance is inconsistent. Conversely, FullGPS consistently achieves near-perfect performance with a 0.98 eval FS.

    Figure 1: Eval FS for best-performing curriculum policies
    Table 1: Evaluation results on checkpoints with the highest FS.
    Experiment FS (↑) FR (↑) SPS (↑) CR (↓) R (↑)
    NoGPS 0.92 0.72 0.52 0.52 6770.28
    FullGPS 0.98 0.68 0.52 0.59 6079.23
    Mul (ZGPS) 0.52 0.06 0.02 0.40 31.07
    Add (ZGPS) 0.92 0.66 0.44 0.60 4941.44
    Dyn-Add (ZGPS) 0.89 0.60 0.41 0.63 4784.58
    Mul-LGPS 0.92 0.63 0.38 0.64 4044.00
    Dyn-Add-LGPS 0.92 0.65 0.44 0.62 5364.39
    Add-LGPS 0.91 0.71 0.51 0.54 6605.48


    Zero GPS vs Last GPS

    For brevity, we focus on the top two curriculum strategies in the figure above. LastGPS Dyn-Add outperforms ZeroGPS Add, requiring fewer iterations for similar performance, indicating its higher sample efficiency. Table 1 confirms the superior performance of LastGPS over ZeroGPS. This result likely stems from ZeroGPS needing to implicitly remember the latest human GPS observation, whereas LastGPS continually receives the cached human GPS.

    Additive vs Multiplicative vs Dynamic Additive

    We observe that FS for Add and Dyn-Add reaches 0.92. Mul strategy performs poorly with ZeroGPS (0.52 FS) but well with LastGPS (0.92 FS). If the difficulty updates too rapidly, as in Mul, it can exacerbate the learning curve. Since LastGPS is easier than the ZeroGPS setting (see Section 3.3), the combination of Mul and ZeroGPS results in a curriculum that is too difficult.

    GPS vs Baselines

    The figure illustrates that our best curriculum strategies (Add-Curr and Dyn-Add-Curr-LGPS) perform as well as the NoGPS baseline but reach high performance early in training. We also observe more stable curves compared to the NoGPS baseline towards the end of training, indicating that curriculum learning aids in developing a robust policy with minimal samples while mitigating reliance on GPS availability.

    Evaluation Trajectories


    In the above episode, we can see that the NoGPS agent keeps moving in circles waiting for human and then when it is following the human, finally it collides while trying to avoid, ending the episode very soon. In contract, the dynamic additive agent trained with curriculum learning is able to find the human quickly, follow it for longer while avoiding collisions. Although, the episode ends when it gets stuck within walls and times out.



    In the above episode, we see that the NoGPS agent and dynamic additive LastGPS agent perform similarly and are able to find the human and follow for a long duration. However, the zero GPS additive agent collides pretty early in the episode, although it is still able to find the human reasonably early in the episode. This suggests that ZeroGPS is harder compared to LastGPS setting as the agent has no information after the first step of any GPS sensor. NoGPS agent stays in place rotating and waiting for human, while the dynamic additive LastGPS agent has some navigation ability and finds the human earlier than NoGPS.



    In the above episode, the agent fails pretty early for NoGPS due to collision and the agent keeps rotating initially without trying to navigate the human. In comparison, the additive ZeroGPS and the dynamic additive LastGPS agents do a good job of finding the human (going right instead of left like NoGPS) and then follow the human for a good number of steps, albeit ending in collision.

    Conclusion and Future Work

    In this work, we use curriculum training to relax the requirement of human GPS availability in the SocialNav task. Our approach achieves comparable success rates to the NoGPS conditions with better stability and using fewer training samples, demonstrating the effectiveness of our curriculum learning. In future, we aim to improve on other metrics such as collision rate and SPS, which currently lag behind NoGPS. Additionally, we will explore strategies to encourage active exploration, as we observed instances where the agent moves in circles until a human is visible.