Segment Anything Across Shots:
A Method and Benchmark

Fudan University, China
AAAI 2026
✉️ Corresponding author

Figure 1. This work focuses on an underexplored task of Multi-shot Video Object Segmentation (MVOS). As shown in (a), the significant variations in object appearance, spatial location, and background across shots pose major challenges in MVOS. We introduce Cut-VOS, a challenging MVOS benchmark with high transition diversity to support this task. As shown in (b), on Cut-VOS, SAM2-B+ exhibits a 21.4% \( \mathcal{J}\)&\( \mathcal{F}\) drop compared to the challenging single-shot MOSE dataset and a 16.4% \( \mathcal{J}\)t drop compared to YouMVOS. Upon the observation, we propose a new transition-specific segmentation model, Segment Anything Across Shots ( SAAS) which effectively segments objects in multi-shot videos.


Abstract

This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions.

1. Cut-VOS Benchmark

Benchmark Statistics

Dataset #Videos #Objects #Masks #Shots Trans. Frequency Obj. Categories Available
YouMVOS-test 30 78 64.6K 2.4K 0.222/s 4
Cut-VOS (Ours) 100 174 10.2K 648 0.346/s 11

Table 1: The basic statistics for the Cut-VOS benchmark.

As shown in Table 1, the Cut-VOS benchmark contains 100 videos, 174 annotated objects, and 10.2K high-quality masks overall. Compared to the previous MVOS dataset YouMVOS, Cut-VOS contains more videos (100 vs. 30) and objects representing more diverse scenarios, and carefully screened, multiple types of transitions with a 1.6 times higher frequency reaching 0.352/s. Besides, different from YouMVOS which solely focus on actors especially human subjects, Cut-VOS contains 11 different categories and 40+ subcategories across both actors and static objects, making it more in-the-wild.

Diverse Object Categories

Figure 2: The Comparison of object categories. Cut-VOS contains
4 categories in YouMVOS and 7 new categories.

As shown in Figure 2, the proposed Cut-VOS contains the objcet across 11 categories: Adult, Child, Virtual, Animal, Vehicle, Tool, Food, Architecture, Furniture, Plants, and Instrument. The first five categories correspond to actors, while the remaining six belong to static objects, accounting for 62% and 38% of the benchmark, respectively.
Adult
adults
Children
children
Animal
animal
Vehicle
vehicle
Tool
tool
Food
food
Architecture
architecture
Furniture
furniture

Figure 3: Examples of diverse object categories in the Cut-VOS benchmark.

Transition Types Analysis

Cut away
cutaway
Cut in
cutin
Delayed Cut in
delayedcut
Scene Change
scenechange
Pitch
pitch
Horizon
horizon
Close-up View
closeup
Distant View
distant

Figure 4: Visualization of 8 significant transition types.

We classify all the shot transitions into 9 diffrent types: cut away, cut in, delayed cut in, scene change, pitch transformation, horizon transformation, close-up view, distant view, and insignificancy, as shown in Figure 4. We hope to pinpoint the existing bottlenecks by analyzing the transition types. We find that the insignificancy and cut away are the most easy for the existing VOS methods, while scene change, close-up view, and distant view are the most challenging. The Cut-VOS benchmark contains more challenging types and few insignificancy and long duration cut away to keep complexity.

The relevant analysis and statistics are involved in our paper and technical appendix. Clike here to access our paper on arXiv.

2. TMA Strategy and SAAS Method

Figure 5: We first propose the TMA stategy to enable the training on single-shot videos. Such that the severe data sparsity is alleviated. TMA automatically generates the samples with different patterns to mimick different transition types. Examples: (a) Random strong transforms. (b) Single transition across diffrent segments from the same video. (c) Multiple transitions, conducting a case with cut away and cut in. (d) Single transition to another video, with random replication and gradual translations.

Figure 6: The architecture of our proposed SAAS (Segment Anything Across Shots) model. It consists of three new compenents: Transition Detection Module (TDM), Transition Comprehension Module (TCH), and local memory bank. These moduels detect and understand the occurring transition and guide the cross-shot segmentation. With the training support of TMA, SAAS achieves strong multi-shot segmentation capacity.

3. Experiments

Benchmark Results

Table 2: Main results on existing Cut-VOS methods and our proposed SAAS on YouMVOS and Cut-VOS benchmarks.

We evaluate the representative VOS methods, including Xmem, DEVA, Cutie, and SAM2, along with our proposed SAAS on both YouMVOS and Cut-VOS benchmarks, as shown in Table 2. * denotes the model is directly trained on the YTVOS dataset without extra data augmentation. Bold and underlined indicate the best and the second-best performance in the tested methods. The Results show that SAAS achieves the SOTA performance on both benchmarks while keeping virtually no degradation in inference speed.

Qualitative Results

Figure 7. Qualitative comparison of some representative cases from Cut-VOS between the SAAS and the SAM2 methods. (a) shows a case with a delayed cut in transition and an abrupt position shift of target objects. (b) demonstrates SAAS's better capacity in a crowded scene with complex relations. SAAS coherently segments the target object among ten similar objects.

BibTeX

Please consider to cite SAAS if it helps your research.
@inproceedings{SAAS2025,
  title={Segment Anything Across Shots: A Method and Benchmark},
  author={Hu, Hengrui and Ying, Kaining and Ding, Henghui},
  booktitle={AAAI},
  year={2026}
}