Generating realistic human videos remains a challenging task, with the most effective methods currently relying on a human motion sequence as a control signal. Existing approaches often use existing motion extracted from other videos, which restricts applications to specific motion types and global scene matching. We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image, allowing for diverse motion that adapts to different scenes. Our approach utilizes a diffusion model that accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene. To train this model, we collect a large-scale video dataset featuring single-human activities, annotating each video with the corresponding human motion as the target output. Experiments demonstrate that our method effectively predicts human motion that aligns with the scene image after projection. Furthermore, we show that the generated motion sequence improves human motion quality in video synthesis tasks.
Generating realistic human motion in a scene remains a challenging task in video generation due to the complexity of human movement. Many works have improved the quality of human videos by using motion sequences, typically extracted from other videos, as control signals during the generation process. In contrast, we propose a novel task of 2D-conditioned human motion generation, defined as follows: given an image representing the target scene and a text prompt describing the desired motion, we generate a motion sequence that aligns with the text description and can be naturally projected onto the scene image.
To tackle the proposed task, we collect the Humans-in-Context Motion (HiC-Motion) dataset, a large collection of human videos with pseudo ground truth human motions, text prompts, and background scene images. Sourced from open-domain internet videos, the dataset includes 300k videos spanning a wide range of indoor and outdoor scenes, as well as over 1k human activity categories. We apply a basic inpainting model to remove humans from the video frames and filter the videos to ensure static backgrounds, enabling any selected frame to consistently represent the scene throughout the motion sequence.
We propose a transformer-based diffusion model to generate a motion sequence that is semantically aligned with the input description and physically compatible with the scene. To improve the model's ability to capture interactions between the inputs, we employ an in-context learning strategy, incorporating scene and text inputs into the model via a shared token space. The transformer model then generates a human motion sequence through a diffusion denoising process. The diffusion timestep is encoded into the AdaLN layer to enhance the temporal smoothness of the generated motions.
Our model generates motions with large dynamics, demonstrating strong scene compatibility by accurately placing the human sequence and ensuring natural movement within the environment. It also produces complex human activities such as playing tennis—a challenging task for video generation models that our approach handles effectively.
Our model generates human poses that are consistent with both the text prompts and the scene context, such as standing at the edge of a cliff, sitting on a chair, and surfing on a board. In addition, our model is capable of generating complex human-scene interactions, including activities such as riding a horse, decorating a tree, and petting a dog.
Compared with existing models conditioned on single or multiple modalities, the text-conditioned model MDM (Tevet et al., ICLR 2023) struggles with generating long sequences, while MLD (Chen et al., CVPR 2023) generates running actions but lacks scene compatibility. The scene-conditioned method SceneDiff (Huang et al., CVPR 2023) fails to produce accurate poses, and HUMANISE (Wang et al., NeurIPS 2022) generates static motions. These methods, trained on limited synthetic data, struggle with real-world scenes. In contrast, our model generates motion that aligns with both the scene and text prompts.
Our approach enables a two-pass human video generation pipeline. In the first pass, our model generates scene-compatible motion sequences from a scene image and text prompt. This motion then serves as a control signal for the subsequent video generation, such as using Gen-3 (Runway, 2024) to produce a motion-guided video. The motion generated by our model ensures accurate human shapes and smooth movement in the resulting videos.
@article{huang2024move,
author = {Huang, Hsin-Ping and Zhou, Yang and Wang, Jui-Hsien and Liu, Difan and Liu, Feng and Yang, Ming-Hsuan and Xu, Zhan},
title = {Move-in-2D: 2D-Conditioned Human Motion Generation},
journal = {arXiv preprint arXiv:2412.13185},
year = {2024},
}