Visually-grounded Humanoid Agents

Visually-grounded Virtual Humans. This system enables humanoid agents to behave actively in novel environments captured by videos.

Abstract

Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel, unseen environments. We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with any digital humans, at scale, that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce **Visually-grounded Humanoid Agents**, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer provides a structured substrate for interaction, by reconstructing semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware semantic scene reconstruction pipeline, and accommodating animatable Gaussian-based human avatars within them. The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial-awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene. We further introduce a benchmark to evaluate humanoid-scene interaction within reconstructed 3D environments. Experimental results demonstrate that our agents achieve robust autonomous behavior through effective planning and action execution, yielding higher task success rates and fewer collisions compared to both ablations and state-of-the-art planning methods. This work offers a new perspective on populating scenes with digital humans in an active manner, enabling more research opportunities for the community and fostering human-centric embodied AI.

Method Overview

Overview of our framework. This system first creates large-scale, semantically detailed digital environments and animatable human avatars in its World Layer. Then, the Agent Layer controls these avatars through a perception-action loop for human-scene interaction.

Comparison with State-of-the-art

Quantitative comparison. Our method significantly outperforms two SOTA VLN approaches on both tasks, achieving a better balance between reaching the goal and avoiding collisions.

Qualitative Results

Qualitative results of the ablation study on the VLM-based planning paradigm. We compare our full model with variants that remove spatial-aware visual prompting or iterative reasoning. The results demonstrate that the variant lacking spatial grounding is prone to making navigational errors, causing it to deviate from the optimal path and lose track of the goal, often resulting in failure. The variant without iterative reasoning exhibits short-sighted behavior; it tends to follow simplistic, straight-line paths and fails to perform complex planning, leading to a significantly higher collision rate. In contrast, our full model produces robust, goal-directed trajectories that successfully navigate around obstacles.