Visually-grounded Virtual Agents in Realistic 3D Scenes. From monocular videos, our framework reconstructs a high-fidelity 3D environment with rich semantics and instantiates high-fidelity humanoid agents aligned with the scene. Each agent perceives the world through its own egocentric view and acts autonomously, enabling realistic and purposeful behaviors within the reconstructed environment.
Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments.
We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with any digital humans, at scale, that exhibit spontaneous, natural, goal-directed behaviors.
To this end, we introduce Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer provides a structured substrate for interaction, by reconstructing semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware semantic scene reconstruction pipeline, and accommodating animatable Gaussian-based human avatars within them.
The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial-awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene.
We further introduce a comprehensive benchmark to evaluate humanoid–scene interaction within diverse reconstructed 3D environments. Extensive experimental results demonstrate that our agents achieve robust autonomous behavior through effective planning and action execution, yielding higher task success rates and fewer collisions compared to both ablations and state-of-the-art planning methods.
This work offers a new perspective on populating scenes with digital humans in an active manner, enabling more research opportunities for the community and fostering human-centric embodied AI. Data, code, and models will be open-sourced.
Framework Overview. Our framework consists of two layers. The World Layer processes real-world data (scene videos, object assets, human videos) to build large-scale, semantically detailed environments via occlusion-aware reconstruction, and populates them with GS-based animatable human avatars. Then the Agent Layer drives these avatars for human-scene interaction via a perception-action loop, where visually-grounded agents plan actions from egocentric observations.
The World Layer reconstructs large-scale 3D environments from monocular video, producing RGB renderings, collision meshes, depth maps, and semantic features within a unified 3D Gaussian framework.
World Layer reconstruction outputs. From left to right: RGB rendering (3DGS), collision mesh, depth map, and semantic feature visualization.
We explore autonomous humanoid behaviors in open-world reconstructed 3D environments. An effective benchmark requires environments that provide photorealistic rendering, physical collision meshes, and rich open-vocabulary semantics. We conduct a thorough survey of existing large-scale scene datasets and curate a diverse hierarchy of scenes for comprehensive evaluation:
| Scene | World Eval. |
Agent Eval. |
Universality Eval. |
Recon. Quality |
Inter. Range |
Sem. Richness |
Coll. Mesh |
|---|---|---|---|---|---|---|---|
| SmallCity | ✓ | ✓ | ✓ | High | Massive | High | ✓ |
| XGRIDS | ✓ | ✓ | Ultra | Massive | High | ✓ | |
| SAGE-3D | ✓ | ✓ | High | Room-scale | Medium | ✓ | |
| ArgoVerse2, PandaSet | ✓ | Medium | Large | Vehicle-only | ✓ | ||
| Mip-NeRF360, DL3DV-10K | ✓ | Medium | Limited | Low/Medium | ✓ | ||
| MatrixCity, Horizon-GS | ✓ | Low | Massive | High | ✓ | ||
| SuperSplat, Pointcosm | ✓ | Ultra | Large | Medium | ✗ |
Settings of the scenes used in our experiments. SmallCity satisfies all criteria and serves as the primary scene for both world-layer and agent-layer evaluation. XGRIDS and SAGE-3D complement it to evaluate the generalization of the Agent Layer across diverse outdoor and indoor environments.
Navigation is one of the most fundamental abilities for autonomous agents in open-world 3D environments, especially in large-scale outdoor scenes where long-horizon planning, visual distractors, and frequent occlusions demand robust perception and adaptive decision-making. We design a multi-level benchmark with increasing complexity: SimNav (L1), ObstNav (L2), and SocialNav (L3), plus a multi-goal setting for long-horizon planning. See figures below for detailed examples.
Level-1 (SimNav): fundamental goal-reaching with a clear path.
Level-2 (ObstNav): static obstacle on the optimal route, requiring adaptive replanning.
Level-3 (SocialNav): dynamic pedestrians intersecting the agent’s path, requiring reactive social avoidance.
Multi-Goal: 5 consecutive landmarks per episode, assessing long-horizon planning.
Top: ego-centric departure views; bottom: BEV map with goals and optimal trajectory.
We evaluate the Agent Layer on 200 test scenarios across all three navigation levels in SmallCity, comparing against three SOTA VLN baselines: NaVILA, NaVid, and Uni-NaVid. Our method consistently outperforms all baselines by a substantial margin, achieving a task success rate approximately 30% higher than the strongest baseline, while maintaining comparable or lower collision rates.
Quantitative comparison on SmallCity (200 test cases). SR = Success Rate, SPL = Success weighted by Path Length, CR = Collision Rate. Best in bold, second-best underlined.
| Method | SimNav | ObstNav | SocialNav | ||||||
|---|---|---|---|---|---|---|---|---|---|
| SR↑ | SPL↑ | CR↓ | SR↑ | SPL↑ | CR↓ | SR↑ | SPL↑ | CR↓ | |
| NaVILA | 22.5% | 0.199 | 70.8% | 20.8% | 0.176 | 75.0% | 8.3% | 0.072 | 84.1% |
| NaVid | 37.4% | 0.279 | 19.2% | 32.5% | 0.233 | 23.1% | 17.5% | 0.138 | 51.7% |
| Uni-NaVid | 38.8% | 0.370 | 12.8% | 25.3% | 0.242 | 29.4% | 12.5% | 0.112 | 66.7% |
| Ours | 68.3% | 0.640 | 13.3% | 55.8% | 0.516 | 30.8% | 39.2% | 0.366 | 48.3% |
Generalization on diverse environments and multi-goal evaluation. XGRIDS (2 outdoor scenes), SAGE-3D (4 indoor scenes), and a multi-goal benchmark with 2–5 consecutive landmarks per episode. PR = Progress Rate, PPL = Progress weighted by Path Length.
| Method | XGRIDS | SAGE-3D | Multi-Goal | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SR↑ | SPL↑ | CR↓ | SR↑ | SPL↑ | CR↓ | SR↑ | PR↑ | PPL↑ | CR↓ | |
| NaVILA | 17.8% | 0.160 | 71.1% | 31.7% | 0.293 | 67.2% | 0.0% | 21.6% | 0.102 | 78.9% |
| NaVid | 40.0% | 0.316 | 35.6% | 33.1% | 0.324 | 58.4% | 0.0% | 17.5% | 0.096 | 70.2% |
| Uni-NaVid | 31.1% | 0.298 | 48.9% | 25.1% | 0.228 | 35.2% | 12.3% | 8.3% | 0.066 | 48.9% |
| Ours | 74.7% | 0.725 | 14.7% | 58.2% | 0.534 | 33.3% | 38.0% | 73.6% | 0.707 | 34.8% |
We are the first to explore semantic understanding on large-scale outdoor 3DGS street scenes at the dataset, benchmark, and method levels. We compare our World Layer against three representative semantic Gaussian Splatting methods: Feature-3DGS, GW-3DGS, and OpenGaussian, evaluated by mIoU and mAcc on the SmallCity scene.
Quantitative evaluation of semantic scene reconstruction. Comparison with baselines and ablation of World Layer components. (a) Base, (b) +Occ-Aware, (c) +ViewSel (Ours). Best in bold.
| Metric | Baselines | Ablation (Ours) | ||||
|---|---|---|---|---|---|---|
| Feature-3DGS | GW-3DGS | OpenGS | (a) Base | (b) +Occ | (c) +ViewSel | |
| mIoU↑ | 0.256 | 0.025 | 0.483 | 0.518 | 0.556 | 0.601 |
| mAcc↑ | 0.647 | 0.109 | 0.675 | 0.636 | 0.699 | 0.733 |
Beyond standard goal-directed navigation, our framework readily extends to a wide range of behaviors and environments. The Agent Layer supports diverse locomotion styles, social interactions among multiple agents, and long-horizon multi-goal planning. Meanwhile, the World Layer is compatible with any Gaussian Splatting-based environment, enabling generalization from large-scale outdoor street scenes to indoor rooms. Below we showcase three representative extensions.
Multi-agent social navigation in a town scene. Multiple humanoid agents are simultaneously deployed in a reconstructed outdoor town environment. Each agent autonomously navigates toward its own goal while reactively avoiding collisions with other moving agents, demonstrating socially-aware planning capabilities.
Diverse motion types in a greenbelt scene. Our framework is not restricted to walking-only navigation. By varying the motion type in the action primitives, agents can exhibit diverse locomotion styles such as jogging, jumping, kicking, and other full-body movements, all while maintaining goal-directed behavior.
Multi-goal navigation in an indoor scene. The agent is tasked with visiting a sequence of consecutive landmarks within a room-scale indoor environment. This demonstrates both the long-horizon planning capability of our iterative reasoning mechanism and the generalization of our Agent Layer to indoor scenes beyond large-scale outdoor settings.
Qualitative results of the ablation study on context-aware action planning. The results demonstrate that omitting spatial grounding is prone to making navigational errors, causing it to deviate from the optimal path and lose track of the goal. The variant without iterative reasoning exhibits short-sighted behavior; it tends to follow simplistic, straight-line paths and fails to perform complex planning, leading to a significantly higher collision rate. In contrast, our full model produces robust, goal-directed trajectories that successfully navigate around obstacles.
Qualitative ablation on occlusion-aware semantic scene reconstruction. Our framework achieves precise 3D instance segmentation with well-defined boundaries, demonstrating robustness to severe occlusion while recognizing thin or small objects in large-scale outdoor scenes.
We introduce Visually-grounded Humanoid Agents, a coupled two-layer (world–agent) paradigm for embodied digital humans that can perceive, decide, and act autonomously in complex real-world 3D environments.
We further introduce a comprehensive benchmark to evaluate humanoid–scene interaction across diverse reconstructed 3D environments.
@article{ye2026vghuman,
author = {Ye, Hang and Ma, Xiaoxuan and Lu, Fan and Wu, Wayne and Lin, Kwan-Yee and Wang, Yizhou},
title = {Visually-grounded Humanoid Agents},
journal = {arXiv preprint arXiv:},
year = {2026},
}