EffectErase: Joint Video Object Removal
and Insertion for High-Quality Effect Erasing

CVPR 2026

Yang Fu  ·  Yike Zheng  ·  Ziyun Dai  ·  Henghui Ding

Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University, China
† Corresponding author

Drag to see removal results
WILD_ENV001_00024
WILD_ENV001_00044
WILD_ENV001_00045
WILD_ENV002_00062
WILD_ENV001_00092
WILD_ENV003_00031
WILD_ENV002_00039
WILD_ENV003_00037
WILD_ENV003_00026
WILD_ENV005_00028
WILD_ENV006_00039
WILD_ENV006_00040
Drag the vertical handle to compare the two videos.

Demo

Comparison

ObjectClear / OmniPaint / EffectErase
Input
ObjectClear
OmniPaint
EffectErase
WILD_ENV001_00024
ProPainter / DiffuEraser / VACE / EffectErase
Input
ProPainter
DiffuEraser
VACE
EffectErase
WILD_ENV001_00024
Minmax-Remover / ROSE / EffectErase
Input
Minmax-Remover
ROSE
EffectErase
WILD_ENV001_00024

Portrait Comparison

ObjectClear / OmniPaint / EffectErase
Input
ObjectClear
OmniPaint
EffectErase
WILD_ENV003_00088
ProPainter / DiffuEraser / VACE / EffectErase
Input
ProPainter
DiffuEraser
VACE
EffectErase
WILD_ENV003_00088
Minmax-Remover / ROSE / EffectErase
Input
Minmax-Remover
ROSE
EffectErase
WILD_ENV003_00088

Dataset Pipeline

Dataset pipeline
Dataset Construction Pipeline of VOR. VOR is a hybrid dataset combining synthetic data and real-world captures. Synthetic data are generated in Blender using 3D environments, objects, and animations collected from public sources, together with carefully designed natural object and camera trajectories. Real-world data are recorded across diverse scenes and object categories using cameras, followed by the Ken Burns effect to simulate camera motion. All videos are segmented by SAM2 and manually cleaned and refined by human annotators. The final dataset comprises triplet pairs of videos with and without the target object, and the corresponding mask.
Dataset Demo

Method

Method
The framework of EffectErase. During training, removal and insertion pairs are encoded into the latent space by VAE and fused with noise via the Adaptor. Each DiT block performs cross-attention using fused features as Query and Task-Aware Region Guidance prompt embeddings as Key/Value, producing attention maps that highlight affected regions. We aggregate attention maps from all blocks and apply max pooling to obtain a maximal-activation map, which is supervised by the effect consistency loss (LEC) to encourage both tasks to focus on the same affected area. At inference, users can switch the model between removal and insertion by modifying the inputs.

BibTeX

@inproceedings{EffectErase,
  title     = {{EffectErase}: Joint Video Object Removal and Insertion for High-Quality Effect Erasing},
  author    = {Fu, Yang and Zheng, Yike and Dai, Ziyun and Ding, Henghui},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}