Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation

1Fudan University     2DAMO Academy, Alibaba Group     3Nanyang Technological University    
ICCV 2025

Figure 1. The rule-based generation pipeline of videos in the proposed Synthetic Dataset for Free-Form Motion Control (SynFMC). This example generates synthetic video with three objects: (1) The environment asset and it's matching object assets are selected as the scene elements. (2) The motion types of objects and camera are randomly selected for trajectory generation. (3) The center region shows the resulting 3D animation sequence used for rendering. The rendered video and annotations are demonstrated in the last row.


Abstract

Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse object and environment categories and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video. To provide precise 3D-aware motion control, we further propose a method trained on SynFMC, Free-Form Motion Control (FMC). FMC can control the 6D poses of objects and camera independently or simultaneously, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed FMC outperforms previous methods across multiple scenarios.

1. Visualization of SynFMC

Environment Categories

⭐ The environments in SynFMC span five types: ground, near ground, sky, water surface, and underwater.

Ground
0442a954 d321dde4
Near Ground
02221fb0 bbe97d18
Sky
002b4dce 26ed56e6
Water Surface
c791ddbb e5e9eb29
Underwater
c791ddbb e5e9eb29


Scene Categories

⭐ The scenes in SynFMC span four types: static single-object, static multi-object, dynamic single-object, and dynamic multi-object. Static means fixed object locations in world space while the camera remains movable.

Static Single-Object
26ed56e6
Dynamic Single-Object
c791ddbb
Static Multi-Object
e5e9eb29
Dynamic Multi-Object
c791ddbb


Auxiliary Annotation of SynFMC

⭐ Besides 6D poses of objects and the camera, SynFMC also provides auxiliary annotations, including instance segmentation maps, depth maps, and descriptions of both visual content and motion.

0442a954

2. Architecture of FMC

⭐ The following figure presents the architecture of FMC, where the Object Motion Controller (OMC) perceives the orientation and size of objects in the camera coordinate system by accepting 6D poses.

Figure 2. The architecture of FMC. In the first stage, we randomly sample the images from synthetic videos and update the parameters from injected Domain LoRA. Next, the modules from CMC are learned. It consists of two parts: Camera Encoder and Camera Adapter, where the Camera Adapter is introduced into the temporal modules. Finally, we train the Object Encoder from OMC. It receives the 6D object pose features, which are repeated in the corresponding object region. We use Gaussian blur kernel centered at the centroid to prevent the need of precise masks. Then, the output is multiplied by the coarse masks to modulate the features in the main branch.

3. Results of FMC

Independent Control of Camera / Object

⭐ The first/last two examples are the results from independent control of camera/object:
canyon rim with a view of red rocks
02221fb0 02221fb0
cactus in the garden
002b4dce 002b4dce
a balloon floating over the road
bbe97d18 bbe97d18
a butterfly flying over the ground
bbe97d18 bbe97d18

Simultaneous Control of Camera & Object

⭐ The first/last two examples are results from static/dynamic single-object scene:
a yellow mushroom on the road
0442a954 0442a954
a cat in the grass covered with leaves
0442a954 0442a954
a butterfly is flying over the ground
0442a954 0442a954
a balloon floating in the cloudy sky
0442a954 0442a954



⭐ The first/last two examples are results from static/dynamic multi-object scene:
a deer and a man in the grass
0442a954 0442a954
two birds in meadow
0442a954 0442a954
two UFOs are flying over the city
0442a954 0442a954
a shark and a yellow fish are swimming in the sea
0442a954 0442a954

BibTeX

Please consider to cite SynFMC if it helps your research.
@inproceedings{SynFMC,
  title={{Free-Form Motion Control}: Controlling the 6D Poses of Camera and Objects in Video Generation},
  author={Shuai, Xincheng and Ding, Henghui and Qin, Zhenyuan and Luo, Hao and Ma, Xingjun and Tao, Dacheng},
  booktitle={ICCV},
  year={2025}
}