AnyI2V: Animating Any Conditional Image with Motion Control

1Fudan University     2DAMO Academy, Alibaba group     3Hupan Lab    
ICCV 2025

Figure 1. The first frame conditional control of our Training-Free architecture AnyI2V. (a) AnyI2V supports diverse types of conditional inputs, including those that are difficult to obtain construct pairs for training, such as mesh and point cloud data. The trajectories serve as input for motion control in subsequent frames. (b) AnyI2V can accept inputs with mixed conditional types, further increasing the flexibility of the input. (c) By using LoRA or different text prompts, AnyI2V can achieve the editing effect of the original image.


Abstract

Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content.In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation.

1. Pipeline of AnyI2V

⭐ The following figure presents the architecture of AnyI2V, which utilizes a T2V backbone to achieve the effect of I2V task but supports a wider modality.
Pipeline Image

Figure 1. Our pipeline begins by performing DDIM inversion on the conditional image. To do this, we remove the temporal module (i.e., temporal self-attention) from the 3D U-Net and then extract features from its spatial blocks at timestep tα. Next, we optimize the latent representation by substituting the features from the first frame back into the U-Net. This optimization is constrained to a specific region by an auto-generated semantic mask and is only performed for timesteps t'γtγ.


2. Comparison with Previous Methods

⭐ The following figure presents the comparison with the previous training-based methods.

Figure 2. Comparison between AnyI2V and previous methods, DragAnything, DragNUWA, and MOFA. ‘First Frame*’ indicates that the condition images for previous methods are generated using AnyI2V to ensure a more consistent and fair comparison.


3. Controlling Multiple Modalities

⭐ AnyI2V support multiple categories of modalities, including the modality not support for ControlNet.

Figure 3. This picture demonstrates AnyI2V’s ability to control diverse conditions. AnyI2V can not only handle modalities that ControlNet does not support but also effectively control mixed modalities, which previously required additional training by other methods.


4. Camera Control

⭐ By forcing motion control on a static object (e.g., the house in the following picture), it achieves the effect of controlling camera motion.

Figure 4. This Picture shows the camera control result by forcing dragging on a static object. However, AnyI2V cannot support complex camera control, such as rotating the camera.


5. Visual Editing

⭐ The generated result of the first frame by AnyI2V is not strictly constrained by the structural condition, which means AnyI2V enables flexible structure control even when the structure and prompt conflict.

AnyI2V supports visual editing by modifying the prompt. Even when the structure and prompt conflict, it can still generate harmonious shapes and smooth motion.


BibTeX

Please consider to cite AnyI2V if it helps your research.
@inproceedings{AnyI2V,
  title={{AnyI2V}: Animating Any Conditional Image with Motion Control Generation},
  author={Li, Ziye and Luo, Hao and Shuai, Xincheng and Ding, Henghui},
  booktitle={ICCV},
  year={2025}
}