Drag the slider to compare.
Abstract
Despite impressive advances in image matting, video matting remains challenging due to the gap between high-level tracking, which requires frame-wise understanding, and low-level matting, which focuses on extremely fine-grained details. Existing methods attempt this with expensive and narrowly-scoped video matting datasets, which may limit out-of-domain generalization and compromise tracking robustness. We rethink with SAM2Matting, a tracker-to-matting framework that advances VOS trackers to (hence 2) high-fidelity alpha matting. Specifically, it decouples the task by enhancing such foundational tracker (e.g., SAM2, SAM3) with a region-proposal bridge and dedicated matting heads, enabling the uncompromised tracker to handle temporal consistency while the matting components dedicated to resolving fine-grained details. Notably, despite being trained only on images, SAM2Matting establishes new state-of-the-art video matting performance, supports diverse prompt types, maintains strong temporal consistency, and exhibits robust generalization across both human-centric and in-the-wild scenarios.
Architecture
SAM2Matting is a generalized matting framework that decouples high-level tracking from dedicated low-level matting components. Specifically, a VOS tracker provides a temporally-consistent target mask for each frame. Given the mask and multi-scale image features, an ROI Detector identifies instance-specific matting-critical regions with fine-grained details or semi-transparency. A Progressive Alpha Predictor then iteratively produces and refines the matte through a coarse-to-fine cascade, with intermediate mattes supervised at each scale to progressively capture finer details.
Quantitative Results
We provide three SAM2Matting variants based on different trackers of SAM2.1-T, SAM2.1-B+, and SAM3. The best, second-best, and third-best results are highlighted with red, orange, and yellow backgrounds, respectively. SAM2Matting achieves state-of-the-art performance on both image and video matting benchmarks, with its video matting performance evaluated in a zero-shot manner.
Image Matting
Video Matting
Qualitative Results
Interactive Demo
SAM2Matting supports diverse prompt types and enables robust matting of any open-world target throughout video sequence.
BibTeX
@inproceedings{SAM2Matting,
title={{SAM2Matting}: Generalized Image and Video Matting},
author={Shen, Ruiqi and Jie, Guangquan and Liu, Chang and Ding, Henghui},
booktitle={European Conference on Computer Vision (ECCV)},
year={2026}
}