VideoMimic
Visual imitation enables contextual humanoid control
Arthur Allshire*, Hongsuk Choi*, Junyi Zhang*, David McAllister*, Anthony Zhang, Chung Min Kim,
Trevor Darrell, Pieter Abbeel, Jitendra Malik, Angjoo Kanazawa.          (*:equal contribution) University of California, Berkeley

VideoMimic is a real-to-sim-to-real pipeline that converts monocular videos into transferable humanoid skills, letting robots learn context-aware behaviors (terrain-traversing, climbing, sitting) in a single policy.

Abstract.
How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them—casually capture a human motion video and feed it to humanoids. We introduce VideoMimic, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills—all from a single policy, conditioned on the environment and global root commands. VideoMimic offers a scalable path towards teaching humanoids to operate in diverse real-world environments.
Real-world Demo.
Approach.

Input Video

Human + Scene Reconstruction

          G1 Retargeted Results

          Egoview (RGB/Depth)

Training in Simulation      

From a monocular video, we jointly reconstruct metric-scale 4D human trajectories and dense scene geometry. The human motion is retargeted to a humanoid, and with the scene converted to a mesh in the simulator, the motion is used as a reference to train a context-aware whole-body control policy. While our policy does not use RGB conditioning for now, we demonstrate the potential of our reconstruction for ego-view rendering.

1. Real to Sim.
real-to-sim pipeline

Figure 1: The Real-to-Sim pipeline reconstructs human motion and scene geometry from video, outputting simulator-ready data.

real-to-sim pipeline

Figure 2: Versatile capabilities include handling internet videos, multi-human reconstruction, and ego-view rendering.

2. Training in Sim.
Training in Sim

Figure 3: Policy training pipeline in simulation, progressing from MoCap pre-training to environment-aware tracking and distillation.

Acknowledgements.

We thank Brent Yi for his guidance with the excellent 3D visualization tool we use, Viser. We are grateful to Ritvik Singh, Jason Liu, Ankur Handa, Ilija Radosavovic, Himanshu Gaurav-Singh, Haven Feng, and Kevin Zakka for helpful advice and discussions during the paper. We thank Lea Müller for helpful discussions at the start of the project. We thank Zhizheng Liu for helpful suggestions on evaluating human and scene reconstruction. We thank Eric Xu, Matthew Liu, Hayeon Jeong, Hyunjoo Lee, Jihoon Choi, Tyler Bonnen, and Yao Tang for their help in capturing and featuring in the video clips used in this project.

BibTeX
@article{videomimic,
          title     = {Visual imitation enables contextual humanoid control},
          author    = {Allshire, Arthur and Choi, Hongsuk and Zhang, Junyi and McAllister, David 
                       and Zhang, Anthony and Kim, Chung Min and Darrell, Trevor and Abbeel, 
                       Pieter and Malik, Jitendra and Kanazawa, Angjoo},
          journal   = {arXiv preprint arXiv:2505.03729},
          year      = {2025}
        }