MOVE: Motion-Guided Few-Shot Video Object Segmentation

Fudan University, China
ICCV 2025, Honolulu, Hawai'i
* Equal contribution, ✉️ Corresponding author

Figure 1. We propose a new benchmark for MOtion-guided Few-shot Video object sEgmentation (MOVE). In this example, given two support videos showing distinct motion patterns (S1: Cristiano Ronaldo's signature celebration, S2: hugging), our benchmark aims to segment target objects in the query video that perform the same motions as in the support videos. MOVE provides a platform for advancing few-shot video analysis and perception by enabling the segmentation of diverse objects that exhibit the same motions.


Abstract

This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS. Based on MOVE, we comprehensively evaluate 6 state-of-the-art methods from 3 different related tasks across 2 experimental settings. Our results reveal that current methods struggle to address motion-guided FSVOS, prompting us to analyze the associated challenges and propose a baseline method, Decoupled Motion-Appearance Network (DMA). Experiments demonstrate that our approach achieves superior performance in few-shot motion understanding, establishing a solid foundation for future research in this direction.

Dataset: MOVE

Dataset Statistics

Dataset Venue Label Type Annotation Support Type Categories Videos Objects Frames Masks
FSVOD-500 ECCV'22 Object Box Image 500 42,72 96,609 104,495 104,507
YouTube-VIS ICCV'19 Object Mask Image 40 2,238 3,774 61,845 97,110
MiniVSPW IJCV'25 Object Mask Image 20 2,471 - 541,007 -
MOVE (ours) ICCV'25 Motion Mask Video 224 4,300 5,135 261,920 314,619

Table 1: Comparison of MOVE with existing video object segmentation datasets.

As shown in Table 1, our MOVE benchmark contains 224 action categories across four domains (daily actions, sports, entertainment activities, and special actions), with 4,300 video clips, 5,135 moving objects, 261,920 frames, and 314,619 mask annotations. Compared to existing object-centric datasets, MOVE features video-level support samples and motion-based categories, while maintaining comparable scale in terms of videos and annotations. Each video clip is equipped with high-quality pixel-level mask annotations, capturing diverse scenes, subjects (person, vehicle, animal, etc.), and motion complexities. For more detailed statistics and analysis, please refer to the supplementary materials.

Visualization of Different Motions

Cross
Cross motion
Heimlich
Heimlich
Hugging
Hugging motion
Spinning Plate
Spinning plate
Stacking Dice
Stacking dice
Waacking
Waacking dance
Yoga Pigeon
Yoga pigeon pose
Siu
Siue

Figure 2: Examples of diverse motion categories in the MOVE dataset.

Visualization of Different Objects with Same Motion

Hugging
Person hugging

Person

Fox hugging

Fox

Metamorphosis
Person metamorphosis

Person

Horse metamorphosis

Horse

Shake
Person shaking

Person

Animal shaking

Animal

Siu
Person siu

Person

Robot siu

Robot

Figure 3: Examples of same motion performed by different object categories in the MOVE dataset.

2. Baseline Method: DMA


Figure 3. The architecture of our proposed DMA (Decoupled Motion-Appearance Network). The method consists of five main components: 1) a shared encoder that extracts multi-scale features from both support and query video frames, 2) a proposal generator that produces coarse mask proposals for the query video, 3) a shared DMA module (right) for extracting decoupled motion-appearance prototypes, 4) a prototype attention module that facilitates interaction between support and query prototypes, and 5) a mask decoder that generates the final segmentation masks for the query video. Given a support video with corresponding masks and a query video, our goal is to segment objects in the query video that exhibit the same motion pattern as objects in the support video.

Experiments

Benchmark Results

OS Results

Table 3. Main results on MOVE benchmark with overlapping split (OS) setting. Bold and underlined indicate the largest and second largest values under the same backbone, respectively. VSwin-T indicates VideoSwin-T backbone.

NS Results

Table 4. Main results on MOVE benchmark with non-overlapping split (NS) setting.

We evaluate 6 state-of-the-art methods from 3 different related tasks across 2 experimental settings on our MOVE benchmark. The two settings are Overlapping Split (OS) and Non-overlapping Split (NS), which differ in how motion categories are distributed across folds for cross-validation. OS allows some motion overlap between folds while NS ensures completely separate motion categories between folds, creating a more challenging scenario. Our proposed DMA method achieves superior performance in few-shot motion understanding compared to existing approaches in both settings.

Qualitative Results

OS Results

Figure 4. Qualitative comparison of representative cases from MOVE between baseline methods, DANet and HPAN, and our proposed DMA. The examples show: different object categories performing the same action (cat and person playing drums); temporally correlated motions with fingers transitioning between pinching and opening positions; and a challenging case with misleading background where football is played on a basketball court.

BibTeX

Please consider to cite MOVE if it helps your research.
@inproceedings{ying2025move,
  title={{MOVE}: {M}otion-{G}uided {F}ew-{S}hot {V}ideo {O}bject {S}egmentation}, 
  author={Kaining Ying and Hengrui Hu and Henghui Ding},
  year={2025},
  booktitle={ICCV}
}