Tasks involving human-robot interaction demand seamless collaboration between the two within indoor settings. Habitat 3.0 introduced a novel Social Navigation task where agents find, track, and follow humans while avoiding collisions. Their baselines show that performance relies heavily on human GPS availability. However, indoor GPS sensors are rarely reliable in real-life and it may be impractical to provide GPS for every human in the scene. In this work, we tackle the issue of realistic social navigation by relaxing the human GPS requirement at every time step. We achieve this via a curriculum learning strategy for training an RL policy capable of finding and tracking humans with sparse or no reliance on human GPS observations. Our experiments demonstrate the effectiveness of our curriculum strategy, achieving comparable performance to the baselines with lesser samples, using a single GPS observation at the beginning of the episode. The project website and videos can be found at gchhablani.github.io/socnav-curr.
Embodied navigation in indoor environments poses a significant challenge in robotics and AI. Recent advancements in deep
reinforcement learning have been applied to tackle this challenge. However, many existing approaches rely on the
assumption that the GPS location of the goal is always available, which is often impractical in real-world scenarios.
In this work, we introduce a novel approach aimed at addressing this limitation. Building upon the Social Navigation
(SocialNav) task proposed by Habitat 3.0, where an agent navigates indoor environments alongside humans, we propose a
curriculum-based learning strategy. This strategy gradually introduces the agent to increasingly challenging scenarios
with infrequent access to human GPS locations, mirroring real-world conditions.
Our approach draws inspiration from curriculum learning techniques, which start with simpler versions of a task and
progressively increase complexity. By adapting this concept to focus on human GPS availability, we demonstrate that our
agent can achieve comparable performance to those with constant GPS access, reaching peak performance faster.
In our methodology, we adopt the standard reinforcement learning (RL) framework, as utilized in Habitat 3.0, to
model the Social Navigation (SocialNav) task as a partially-observable Markov Decision Process. Our observation space
comprises four simulation sensors: an arm depth camera, an arm RGB camera, a humanoid detector, and a humanoid GPS
sensor.
To explore scenarios with limited GPS availability, we introduce two strategies for representing human GPS locations
when unavailable:
Additive
: Incrementing (+50) or decrementing (-25) a fixed value based on finding success rate.Multiplicative
: Doubling or halving the current GPS frequency based on finding success rate.Dynamic Additive
: Adding or subtracting a dynamic value scaling with training finding success rate.We assess our policies through 500 unseen episodes, utilizing evaluation metrics borrowed from prior work [11]. In ZeroGPS evaluation, a fixed GPS observation of (0,0) is assumed at each step, while LastGPS receives only the initial human GPS at every time step.
We establish two baseline models: FullGPS (Human GPS observed at every timestep), and NoGPS (No GPS sensor). Evaluation results depicted in the figure below reveal that while the NoGPS baseline achieves a respectable 0.92 evaluation finding success (eval FS), its performance is inconsistent. Conversely, FullGPS consistently achieves near-perfect performance with a 0.98 eval FS.
Experiment | FS (↑) | FR (↑) | SPS (↑) | CR (↓) | R (↑) |
---|---|---|---|---|---|
NoGPS | 0.92 | 0.72 | 0.52 | 0.52 | 6770.28 |
FullGPS | 0.98 | 0.68 | 0.52 | 0.59 | 6079.23 |
Mul (ZGPS) | 0.52 | 0.06 | 0.02 | 0.40 | 31.07 |
Add (ZGPS) | 0.92 | 0.66 | 0.44 | 0.60 | 4941.44 |
Dyn-Add (ZGPS) | 0.89 | 0.60 | 0.41 | 0.63 | 4784.58 |
Mul-LGPS | 0.92 | 0.63 | 0.38 | 0.64 | 4044.00 |
Dyn-Add-LGPS | 0.92 | 0.65 | 0.44 | 0.62 | 5364.39 |
Add-LGPS | 0.91 | 0.71 | 0.51 | 0.54 | 6605.48 |
Dyn-Add
outperforms ZeroGPS Add
,
requiring fewer iterations for similar performance, indicating its higher sample efficiency. Table 1 confirms the
superior performance of LastGPS over ZeroGPS. This result likely stems from ZeroGPS needing to implicitly remember the
latest human GPS observation, whereas LastGPS continually receives the cached human GPS.
Add
and Dyn-Add
reaches 0.92. Mul
strategy performs poorly with ZeroGPS (0.52 FS)
but well with LastGPS (0.92 FS). If the difficulty updates too rapidly, as in Mul
, it can exacerbate the learning curve.
Since LastGPS is easier than the ZeroGPS setting (see Section 3.3), the combination of Mul
and ZeroGPS results in a
curriculum that is too difficult.
Add-Curr
and Dyn-Add-Curr-LGPS
) perform as well as the NoGPS
baseline but reach high performance early in training. We also observe more stable curves compared to the NoGPS baseline
towards the end of training, indicating that curriculum learning aids in developing a robust policy with minimal samples
while mitigating reliance on GPS availability.
In the above episode, we can see that the NoGPS agent keeps moving in circles waiting for human and then when it is following the human, finally it collides while trying to avoid, ending the episode very soon. In contract, the dynamic additive agent trained with curriculum learning is able to find the human quickly, follow it for longer while avoiding collisions. Although, the episode ends when it gets stuck within walls and times out.
In the above episode, we see that the NoGPS agent and dynamic additive LastGPS agent perform similarly and are able to find the human and follow for a long duration. However, the zero GPS additive agent collides pretty early in the episode, although it is still able to find the human reasonably early in the episode. This suggests that ZeroGPS is harder compared to LastGPS setting as the agent has no information after the first step of any GPS sensor. NoGPS agent stays in place rotating and waiting for human, while the dynamic additive LastGPS agent has some navigation ability and finds the human earlier than NoGPS.
In this work, we use curriculum training to relax the requirement of human GPS availability in the SocialNav task. Our approach achieves comparable success rates to the NoGPS conditions with better stability and using fewer training samples, demonstrating the effectiveness of our curriculum learning. In future, we aim to improve on other metrics such as collision rate and SPS, which currently lag behind NoGPS. Additionally, we will explore strategies to encourage active exploration, as we observed instances where the agent moves in circles until a human is visible.