MOVE: Motion-Guided Few-Shot Video Object Segmentation

Kaining Ying^*, Hengrui Hu^*, Henghui Ding^✉️

Fudan University, China

ICCV 2025, Honolulu, Hawai'i

^* Equal contribution, ^✉️ Corresponding author

Figure 1. We propose a new benchmark for MOtion-guided Few-shot Video object sEgmentation (MOVE). In this example, given two support videos showing distinct motion patterns (S1: Cristiano Ronaldo's signature celebration, S2: hugging), our benchmark aims to segment target objects in the query video that perform the same motions as in the support videos. MOVE provides a platform for advancing few-shot video analysis and perception by enabling the segmentation of diverse objects that exhibit the same motions.

Abstract

This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS. Based on MOVE, we comprehensively evaluate 6 state-of-the-art methods from 3 different related tasks across 2 experimental settings. Our results reveal that current methods struggle to address motion-guided FSVOS, prompting us to analyze the associated challenges and propose a baseline method, Decoupled Motion-Appearance Network (DMA). Experiments demonstrate that our approach achieves superior performance in few-shot motion understanding, establishing a solid foundation for future research in this direction.

Dataset: MOVE

Dataset Statistics

Dataset	Venue	Label Type	Annotation	Support Type	Categories	Videos	Objects	Frames	Masks
FSVOD-500	ECCV'22	Object	Box	Image	500	42,72	96,609	104,495	104,507
YouTube-VIS	ICCV'19	Object	Mask	Image	40	2,238	3,774	61,845	97,110
MiniVSPW	IJCV'25	Object	Mask	Image	20	2,471	-	541,007	-
MOVE (ours)	ICCV'25	Motion	Mask	Video	224	4,300	5,135	261,920	314,619

Table 1: Comparison of MOVE with existing video object segmentation datasets.

As shown in Table 1, our MOVE benchmark contains 224 action categories across four domains (daily actions, sports, entertainment activities, and special actions), with 4,300 video clips, 5,135 moving objects, 261,920 frames, and 314,619 mask annotations. Compared to existing object-centric datasets, MOVE features video-level support samples and motion-based categories, while maintaining comparable scale in terms of videos and annotations. Each video clip is equipped with high-quality pixel-level mask annotations, capturing diverse scenes, subjects (person, vehicle, animal, etc.), and motion complexities. For more detailed statistics and analysis, please refer to the supplementary materials.

Visualization of Different Motions

Cross

Heimlich

Hugging

Spinning Plate

Stacking Dice

Waacking

Yoga Pigeon

Siu

Figure 2: Examples of diverse motion categories in the MOVE dataset.

Visualization of Different Objects with Same Motion

Hugging

Person

Fox

Metamorphosis

Person

Horse

Shake

Person

Animal

Siu

Person

Robot

Figure 3: Examples of same motion performed by different object categories in the MOVE dataset.

2. Baseline Method: DMA

Figure 3. The architecture of our proposed DMA (Decoupled Motion-Appearance Network). The method consists of five main components: 1) a shared encoder that extracts multi-scale features from both support and query video frames, 2) a proposal generator that produces coarse mask proposals for the query video, 3) a shared DMA module (right) for extracting decoupled motion-appearance prototypes, 4) a prototype attention module that facilitates interaction between support and query prototypes, and 5) a mask decoder that generates the final segmentation masks for the query video. Given a support video with corresponding masks and a query video, our goal is to segment objects in the query video that exhibit the same motion pattern as objects in the support video.

Experiments

Benchmark Results

Table 3. Main results on MOVE benchmark with overlapping split (OS) setting. Bold and underlined indicate the largest and second largest values under the same backbone, respectively. VSwin-T indicates VideoSwin-T backbone.

Table 4. Main results on MOVE benchmark with non-overlapping split (NS) setting.

We evaluate 6 state-of-the-art methods from 3 different related tasks across 2 experimental settings on our MOVE benchmark. The two settings are Overlapping Split (OS) and Non-overlapping Split (NS), which differ in how motion categories are distributed across folds for cross-validation. OS allows some motion overlap between folds while NS ensures completely separate motion categories between folds, creating a more challenging scenario. Our proposed DMA method achieves superior performance in few-shot motion understanding compared to existing approaches in both settings.

Qualitative Results

Figure 4. Qualitative comparison of representative cases from MOVE between baseline methods, DANet and HPAN, and our proposed DMA. The examples show: different object categories performing the same action (cat and person playing drums); temporally correlated motions with fingers transitioning between pinching and opening positions; and a challenging case with misleading background where football is played on a basketball court.

BibTeX

Please consider to cite MOVE if it helps your research.

@inproceedings{ying2025move,
  title={{MOVE}: {M}otion-{G}uided {F}ew-{S}hot {V}ideo {O}bject {S}egmentation}, 
  author={Kaining Ying and Hengrui Hu and Henghui Ding},
  year={2025},
  booktitle={ICCV}
}