Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content.In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation.
Figure 1. Our pipeline begins by performing DDIM inversion on the conditional image. To do this, we remove the temporal module (i.e., temporal self-attention) from the 3D U-Net and then extract features from its spatial blocks at timestep tα. Next, we optimize the latent representation by substituting the features from the first frame back into the U-Net. This optimization is constrained to a specific region by an auto-generated semantic mask and is only performed for timesteps t'γ ≤ tγ.
Figure 2. Comparison between AnyI2V and previous methods, DragAnything, DragNUWA, and MOFA. ‘First Frame*’ indicates that the condition images for previous methods are generated using AnyI2V to ensure a more consistent and fair comparison.
Figure 3. This picture demonstrates AnyI2V’s ability to control diverse conditions. AnyI2V can not only handle modalities that ControlNet does not support but also effectively control mixed modalities, which previously required additional training by other methods.
Figure 4. This Picture shows the camera control result by forcing dragging on a static object. However, AnyI2V cannot support complex camera control, such as rotating the camera.
AnyI2V supports visual editing by modifying the prompt. Even when the structure and prompt conflict, it can still generate harmonious shapes and smooth motion.
@inproceedings{AnyI2V,
title={{AnyI2V}: Animating Any Conditional Image with Motion Control Generation},
author={Li, Ziye and Luo, Hao and Shuai, Xincheng and Ding, Henghui},
booktitle={ICCV},
year={2025}
}