Visually-grounded Humanoid Agents

Arxiv 2026

Hang Ye¹, Xiaoxuan Ma², Fan Lu³, Wayne Wu⁴, Kwan-Yee Lin⁵, Yizhou Wang¹

¹Peking University,

²Carnegie Mellon University,

³Tongji University

⁴University of California, Los Angeles,

⁵University of Michigan

Arxiv Code

Visually-grounded Virtual Agents in Realistic 3D Scenes. From monocular videos, our framework reconstructs a high-fidelity 3D environment with rich semantics and instantiates high-fidelity humanoid agents aligned with the scene. Each agent perceives the world through its own egocentric view and acts autonomously, enabling realistic and purposeful behaviors within the reconstructed environment.

Abstract

Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments. We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with any digital humans, at scale, that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer provides a structured substrate for interaction, by reconstructing semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware semantic scene reconstruction pipeline, and accommodating animatable Gaussian-based human avatars within them. The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial-awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene. We further introduce a comprehensive benchmark to evaluate humanoid–scene interaction within diverse reconstructed 3D environments. Extensive experimental results demonstrate that our agents achieve robust autonomous behavior through effective planning and action execution, yielding higher task success rates and fewer collisions compared to both ablations and state-of-the-art planning methods. This work offers a new perspective on populating scenes with digital humans in an active manner, enabling more research opportunities for the community and fostering human-centric embodied AI. Data, code, and models will be open-sourced.

Method Overview

Framework Overview. Our framework consists of two layers. The World Layer processes real-world data (scene videos, object assets, human videos) to build large-scale, semantically detailed environments via occlusion-aware reconstruction, and populates them with GS-based animatable human avatars. Then the Agent Layer drives these avatars for human-scene interaction via a perception-action loop, where visually-grounded agents plan actions from egocentric observations.

World Layer: Scene Reconstruction

The World Layer reconstructs large-scale 3D environments from monocular video, producing RGB renderings, collision meshes, depth maps, and semantic features within a unified 3D Gaussian framework.

World Layer reconstruction outputs. From left to right: RGB rendering (3DGS), collision mesh, depth map, and semantic feature visualization.

Benchmark: Scenes and Tasks

We explore autonomous humanoid behaviors in open-world reconstructed 3D environments. An effective benchmark requires environments that provide photorealistic rendering, physical collision meshes, and rich open-vocabulary semantics. We conduct a thorough survey of existing large-scale scene datasets and curate a diverse hierarchy of scenes for comprehensive evaluation:

Scene	World Eval.	Agent Eval.	Universality Eval.	Recon. Quality	Inter. Range	Sem. Richness	Coll. Mesh
SmallCity	✓	✓	✓	High	Massive	High	✓
XGRIDS		✓	✓	Ultra	Massive	High	✓
SAGE-3D		✓	✓	High	Room-scale	Medium	✓
ArgoVerse2, PandaSet			✓	Medium	Large	Vehicle-only	✓
Mip-NeRF360, DL3DV-10K			✓	Medium	Limited	Low/Medium	✓
MatrixCity, Horizon-GS			✓	Low	Massive	High	✓
SuperSplat, Pointcosm			✓	Ultra	Large	Medium	✗

Settings of the scenes used in our experiments. SmallCity satisfies all criteria and serves as the primary scene for both world-layer and agent-layer evaluation. XGRIDS and SAGE-3D complement it to evaluate the generalization of the Agent Layer across diverse outdoor and indoor environments.

Navigation is one of the most fundamental abilities for autonomous agents in open-world 3D environments, especially in large-scale outdoor scenes where long-horizon planning, visual distractors, and frequent occlusions demand robust perception and adaptive decision-making. We design a multi-level benchmark with increasing complexity: SimNav (L1), ObstNav (L2), and SocialNav (L3), plus a multi-goal setting for long-horizon planning. See figures below for detailed examples.

Level-1 (SimNav): fundamental goal-reaching with a clear path.

Level-2 (ObstNav): static obstacle on the optimal route, requiring adaptive replanning.

Level-3 (SocialNav): dynamic pedestrians intersecting the agent’s path, requiring reactive social avoidance.

Multi-Goal: 5 consecutive landmarks per episode, assessing long-horizon planning.
Top: ego-centric departure views; bottom: BEV map with goals and optimal trajectory.

Comparison with State-of-the-art

We evaluate the Agent Layer on 200 test scenarios across all three navigation levels in SmallCity, comparing against three SOTA VLN baselines: NaVILA, NaVid, and Uni-NaVid. Our method consistently outperforms all baselines by a substantial margin, achieving a task success rate approximately 30% higher than the strongest baseline, while maintaining comparable or lower collision rates.

Quantitative comparison on SmallCity (200 test cases). SR = Success Rate, SPL = Success weighted by Path Length, CR = Collision Rate. Best in bold, second-best underlined.

Method	SimNav			ObstNav			SocialNav
Method	SR↑	SPL↑	CR↓	SR↑	SPL↑	CR↓	SR↑	SPL↑	CR↓
NaVILA	22.5%	0.199	70.8%	20.8%	0.176	75.0%	8.3%	0.072	84.1%
NaVid	37.4%	0.279	19.2%	32.5%	0.233	23.1%	17.5%	0.138	51.7%
Uni-NaVid	38.8%	0.370	12.8%	25.3%	0.242	29.4%	12.5%	0.112	66.7%
Ours	68.3%	0.640	13.3%	55.8%	0.516	30.8%	39.2%	0.366	48.3%

Generalization on diverse environments and multi-goal evaluation. XGRIDS (2 outdoor scenes), SAGE-3D (4 indoor scenes), and a multi-goal benchmark with 2–5 consecutive landmarks per episode. PR = Progress Rate, PPL = Progress weighted by Path Length.

Method	XGRIDS			SAGE-3D			Multi-Goal
Method	SR↑	SPL↑	CR↓	SR↑	SPL↑	CR↓	SR↑	PR↑	PPL↑	CR↓
NaVILA	17.8%	0.160	71.1%	31.7%	0.293	67.2%	0.0%	21.6%	0.102	78.9%
NaVid	40.0%	0.316	35.6%	33.1%	0.324	58.4%	0.0%	17.5%	0.096	70.2%
Uni-NaVid	31.1%	0.298	48.9%	25.1%	0.228	35.2%	12.3%	8.3%	0.066	48.9%
Ours	74.7%	0.725	14.7%	58.2%	0.534	33.3%	38.0%	73.6%	0.707	34.8%

We are the first to explore semantic understanding on large-scale outdoor 3DGS street scenes at the dataset, benchmark, and method levels. We compare our World Layer against three representative semantic Gaussian Splatting methods: Feature-3DGS, GW-3DGS, and OpenGaussian, evaluated by mIoU and mAcc on the SmallCity scene.

Quantitative evaluation of semantic scene reconstruction. Comparison with baselines and ablation of World Layer components. (a) Base, (b) +Occ-Aware, (c) +ViewSel (Ours). Best in bold.

Metric	Baselines			Ablation (Ours)
Metric	Feature-3DGS	GW-3DGS	OpenGS	(a) Base	(b) +Occ	(c) +ViewSel
mIoU↑	0.256	0.025	0.483	0.518	0.556	0.601
mAcc↑	0.647	0.109	0.675	0.636	0.699	0.733

Versatility and Generalization

Beyond standard goal-directed navigation, our framework readily extends to a wide range of behaviors and environments. The Agent Layer supports diverse locomotion styles, social interactions among multiple agents, and long-horizon multi-goal planning. Meanwhile, the World Layer is compatible with any Gaussian Splatting-based environment, enabling generalization from large-scale outdoor street scenes to indoor rooms. Below we showcase three representative extensions.

Social Interaction

Multi-agent social navigation in a town scene. Multiple humanoid agents are simultaneously deployed in a reconstructed outdoor town environment. Each agent autonomously navigates toward its own goal while reactively avoiding collisions with other moving agents, demonstrating socially-aware planning capabilities.

Diverse Locomotion Styles

Diverse motion types in a greenbelt scene. Our framework is not restricted to walking-only navigation. By varying the motion type in the action primitives, agents can exhibit diverse locomotion styles such as jogging, jumping, kicking, and other full-body movements, all while maintaining goal-directed behavior.

Indoor Multi-Goal Navigation

Multi-goal navigation in an indoor scene. The agent is tasked with visiting a sequence of consecutive landmarks within a room-scale indoor environment. This demonstrates both the long-horizon planning capability of our iterative reasoning mechanism and the generalization of our Agent Layer to indoor scenes beyond large-scale outdoor settings.

Qualitative Ablations on Context-Aware Action Planning

Qualitative results of the ablation study on context-aware action planning. The results demonstrate that omitting spatial grounding is prone to making navigational errors, causing it to deviate from the optimal path and lose track of the goal. The variant without iterative reasoning exhibits short-sighted behavior; it tends to follow simplistic, straight-line paths and fails to perform complex planning, leading to a significantly higher collision rate. In contrast, our full model produces robust, goal-directed trajectories that successfully navigate around obstacles.

Qualitative Ablations on Semantic Scene Reconstruction

Qualitative ablation on occlusion-aware semantic scene reconstruction. Our framework achieves precise 3D instance segmentation with well-defined boundaries, demonstrating robustness to severe occlusion while recognizing thin or small objects in large-scale outdoor scenes.

Contributions

We introduce Visually-grounded Humanoid Agents, a coupled two-layer (world–agent) paradigm for embodied digital humans that can perceive, decide, and act autonomously in complex real-world 3D environments.

WORLD LAYER Large-scale, semantically enriched 3D environment reconstruction

An occlusion-aware semantic scene reconstruction pipeline that augments 3D Gaussian Splatting with semantic features via occlusion-aware masks and view selection, enabling accurate large-scale 3D instance segmentation with semantic descriptions.
Compositional scene construction that accommodates animatable Gaussian-based human avatars, providing a photorealistic physical and semantic substrate for embodied agents.

AGENT LAYER Perception–action loop for autonomous humanoid behavior

Spatially-aware visual prompting and iterative reasoning for memory-enhanced, context-aware planning via first-person RGB-D perception.
Diffusion-based full-body motion generation that closes the perception–action loop with unified observation, decision, and motion control.

We further introduce a comprehensive benchmark to evaluate humanoid–scene interaction across diverse reconstructed 3D environments.

BibTeX

@article{ye2026vghuman,
    author    = {Ye, Hang and Ma, Xiaoxuan and Lu, Fan and Wu, Wayne and Lin, Kwan-Yee and Wang, Yizhou},
    title     = {Visually-grounded Humanoid Agents},
    journal   = {arXiv preprint arXiv:},
    year      = {2026},
}