EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

Demo

Comparison

ObjectClear / OmniPaint / EffectErase

Input

ObjectClear

OmniPaint

EffectErase

WILD_ENV001_00024

ProPainter / DiffuEraser / VACE / EffectErase

Input

ProPainter

DiffuEraser

VACE

EffectErase

WILD_ENV001_00024

Minmax-Remover / ROSE / EffectErase

Input

Minmax-Remover

ROSE

EffectErase

WILD_ENV001_00024

Portrait Comparison

ObjectClear / OmniPaint / EffectErase

Input

ObjectClear

OmniPaint

EffectErase

WILD_ENV003_00088

ProPainter / DiffuEraser / VACE / EffectErase

Input

ProPainter

DiffuEraser

VACE

EffectErase

WILD_ENV003_00088

Minmax-Remover / ROSE / EffectErase

Input

Minmax-Remover

ROSE

EffectErase

WILD_ENV003_00088

Dataset Pipeline

Dataset Construction Pipeline of VOR. VOR is a hybrid dataset combining synthetic data and real-world captures. Synthetic data are generated in Blender using 3D environments, objects, and animations collected from public sources, together with carefully designed natural object and camera trajectories. Real-world data are recorded across diverse scenes and object categories using cameras, followed by the Ken Burns effect to simulate camera motion. All videos are segmented by SAM2 and manually cleaned and refined by human annotators. The final dataset comprises triplet pairs of videos with and without the target object, and the corresponding mask.

Dataset Demo

Method

The framework of EffectErase. During training, removal and insertion pairs are encoded into the latent space by VAE and fused with noise via the Adaptor. Each DiT block performs cross-attention using fused features as Query and Task-Aware Region Guidance prompt embeddings as Key/Value, producing attention maps that highlight affected regions. We aggregate attention maps from all blocks and apply max pooling to obtain a maximal-activation map, which is supervised by the effect consistency loss (L_EC) to encourage both tasks to focus on the same affected area. At inference, users can switch the model between removal and insertion by modifying the inputs.

BibTeX

@inproceedings{EffectErase,
  title     = {{EffectErase}: Joint Video Object Removal and Insertion for High-Quality Effect Erasing},
  author    = {Fu, Yang and Zheng, Yike and Dai, Ziyun and Ding, Henghui},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}