1. Cut-VOS Benchmark
Benchmark Statistics
| Dataset |
#Videos |
#Objects |
#Masks |
#Shots |
Trans. Frequency |
Obj. Categories |
Available |
| YouMVOS-test |
30 |
78 |
64.6K |
2.4K |
0.222/s |
4 |
✖ |
| Cut-VOS (Ours) |
100 |
174 |
10.2K |
648 |
0.346/s |
11 |
✔ |
Table 1: The basic statistics for the Cut-VOS benchmark.
As shown in Table 1, the Cut-VOS benchmark contains 100 videos, 174 annotated objects, and 10.2K high-quality masks overall. Compared to the previous MVOS dataset YouMVOS, Cut-VOS contains more videos (100 vs. 30) and objects representing more diverse scenarios, and carefully screened, multiple types of transitions with a 1.6 times higher frequency reaching 0.352/s. Besides, different from YouMVOS which solely focus on actors especially human subjects, Cut-VOS contains 11 different categories and 40+ subcategories across both actors and static objects, making it more in-the-wild.
Diverse Object Categories
Figure 2:
The Comparison of object categories. Cut-VOS contains
4 categories
in YouMVOS and 7 new categories.
As shown in Figure 2, the proposed Cut-VOS contains the objcet across 11 categories:
Adult,
Child,
Virtual,
Animal,
Vehicle,
Tool,
Food,
Architecture,
Furniture,
Plants, and
Instrument.
The first five categories correspond to actors, while the remaining six belong to static objects, accounting for 62% and 38% of the benchmark, respectively.
Transition Types Analysis
We classify all the shot transitions into 9 diffrent types:
cut away,
cut in,
delayed cut in,
scene change,
pitch transformation,
horizon transformation,
close-up view,
distant view, and
insignificancy, as shown in Figure 4.
We hope to pinpoint the existing bottlenecks by analyzing the transition types. We find that the
insignificancy and
cut away
are the most easy for the existing VOS methods, while
scene change,
close-up view, and
distant view are the most challenging.
The Cut-VOS benchmark contains more challenging types and few
insignificancy and long duration
cut away to keep complexity.
The relevant analysis and statistics are involved in our paper and technical appendix. Clike
here to access our paper on arXiv.
2. TMA Strategy and SAAS Method
Figure 5: We first propose
the TMA stategy to enable the training on single-shot videos. Such that the severe data sparsity is alleviated.
TMA automatically generates the samples with different patterns to mimick different transition types.
Examples: (a) Random strong transforms. (b) Single transition across diffrent segments from the same video.
(c) Multiple transitions, conducting a case with cut away and cut in. (d) Single transition to another video,
with random replication and gradual translations.
Figure 6: The architecture of
our proposed SAAS (Segment Anything Across Shots) model. It consists of three new compenents:
Transition Detection Module (TDM), Transition Comprehension Module (TCH), and local memory bank. These moduels
detect and understand the occurring transition and guide the cross-shot segmentation. With the training support
of TMA, SAAS achieves strong multi-shot segmentation capacity.
3. Experiments
Benchmark Results
Table 2: Main results on existing Cut-VOS
methods and our proposed SAAS on YouMVOS and Cut-VOS benchmarks.
We evaluate the representative VOS methods, including Xmem, DEVA, Cutie, and SAM2, along with our proposed
SAAS on both YouMVOS and Cut-VOS benchmarks, as shown in Table 2.
* denotes the model is directly trained
on the YTVOS dataset without extra data augmentation. Bold and underlined indicate the best and
the second-best performance in the tested methods. The Results show that SAAS achieves the SOTA performance
on both benchmarks while keeping virtually no degradation in inference speed.
Qualitative Results
Figure 7. Qualitative comparison of some representative cases from Cut-VOS between the
SAAS and the SAM2 methods. (a) shows a case with a delayed cut in transition and an abrupt position shift of target objects.
(b) demonstrates SAAS's better capacity in a crowded scene with complex relations.
SAAS coherently segments the target object among ten similar objects.
BibTeX
Please consider to cite SAAS if it helps your research.
@inproceedings{SAAS2025,
title={Segment Anything Across Shots: A Method and Benchmark},
author={Hu, Hengrui and Ying, Kaining and Ding, Henghui},
booktitle={AAAI},
year={2026}
}