SAM2Matting
Generalized Image and Video Matting

1Fudan University    2Shanghai University of Finance and Economics
ECCV 2026
arXiv Code Hugging Face Models

Drag the slider to compare.

Input
Mask ↔ Matte
Mask Matte

Abstract

Despite impressive advances in image matting, video matting remains challenging due to the gap between high-level tracking, which requires frame-wise understanding, and low-level matting, which focuses on extremely fine-grained details. Existing methods attempt this with expensive and narrowly-scoped video matting datasets, which may limit out-of-domain generalization and compromise tracking robustness. We rethink with SAM2Matting, a tracker-to-matting framework that advances VOS trackers to (hence 2) high-fidelity alpha matting. Specifically, it decouples the task by enhancing such foundational tracker (e.g., SAM2, SAM3) with a region-proposal bridge and dedicated matting heads, enabling the uncompromised tracker to handle temporal consistency while the matting components dedicated to resolving fine-grained details. Notably, despite being trained only on images, SAM2Matting establishes new state-of-the-art video matting performance, supports diverse prompt types, maintains strong temporal consistency, and exhibits robust generalization across both human-centric and in-the-wild scenarios.

Architecture

SAM2Matting is a generalized matting framework that decouples high-level tracking from dedicated low-level matting components. Specifically, a VOS tracker provides a temporally-consistent target mask for each frame. Given the mask and multi-scale image features, an ROI Detector identifies instance-specific matting-critical regions with fine-grained details or semi-transparency. A Progressive Alpha Predictor then iteratively produces and refines the matte through a coarse-to-fine cascade, with intermediate mattes supervised at each scale to progressively capture finer details.

SAM2Matting framework overview

Quantitative Results

We provide three SAM2Matting variants based on different trackers of SAM2.1-T, SAM2.1-B+, and SAM3. The best, second-best, and third-best results are highlighted with red, orange, and yellow backgrounds, respectively. SAM2Matting achieves state-of-the-art performance on both image and video matting benchmarks, with its video matting performance evaluated in a zero-shot manner.

Image Matting

SAM2Matting image matting quantitative results

Video Matting

SAM2Matting video matting quantitative results

Qualitative Results

Qualitative comparison on fast motion
SAM2Matting stably tracks challenging targets and recovers intricate details in in-the-wild and rapid-motion scenarios where baselines fail.

Interactive Demo

SAM2Matting supports diverse prompt types and enables robust matting of any open-world target throughout video sequence.

BibTeX

@inproceedings{SAM2Matting,
  title={{SAM2Matting}: Generalized Image and Video Matting},
  author={Shen, Ruiqi and Jie, Guangquan and Liu, Chang and Ding, Henghui},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2026}
}