Dataset: MOVE
Dataset Statistics
Dataset |
Venue |
Label Type |
Annotation |
Support Type |
Categories |
Videos |
Objects |
Frames |
Masks |
FSVOD-500 |
ECCV'22 |
Object |
Box |
Image |
500 |
42,72 |
96,609 |
104,495 |
104,507 |
YouTube-VIS |
ICCV'19 |
Object |
Mask |
Image |
40 |
2,238 |
3,774 |
61,845 |
97,110 |
MiniVSPW |
IJCV'25 |
Object |
Mask |
Image |
20 |
2,471 |
- |
541,007 |
- |
MOVE (ours) |
ICCV'25 |
Motion |
Mask |
Video |
224 |
4,300 |
5,135 |
261,920 |
314,619 |
Table 1: Comparison of MOVE with existing video object segmentation datasets.
As shown in Table 1, our MOVE benchmark contains 224 action categories across four domains (daily actions, sports, entertainment activities, and special actions), with 4,300 video clips, 5,135 moving objects, 261,920 frames, and 314,619 mask annotations. Compared to existing object-centric datasets, MOVE features video-level support samples and motion-based categories, while maintaining comparable scale in terms of videos and annotations. Each video clip is equipped with high-quality pixel-level mask annotations, capturing diverse scenes, subjects (person, vehicle, animal, etc.), and motion complexities. For more detailed statistics and analysis, please refer to the supplementary materials.
Visualization of Different Motions
Visualization of Different Objects with Same Motion
Hugging
Person
Fox
Metamorphosis
Person
Horse
Shake
Person
Animal
Siu
Person
Robot
Figure 3: Examples of same motion performed by different object categories in the MOVE dataset.
2. Baseline Method: DMA
Figure 3.
The architecture of our proposed DMA (Decoupled Motion-Appearance Network). The method consists of five main components: 1) a shared encoder that extracts multi-scale features from both support and query video frames, 2) a proposal generator that produces coarse mask proposals for the query video, 3) a shared DMA module (right) for extracting decoupled motion-appearance prototypes, 4) a prototype attention module that facilitates interaction between support and query prototypes, and 5) a mask decoder that generates the final segmentation masks for the query video. Given a support video with corresponding masks and a query video, our goal is to segment objects in the query video that exhibit the same motion pattern as objects in the support video.
Experiments
Benchmark Results
Table 3. Main results on MOVE benchmark with overlapping split (OS) setting. Bold and underlined indicate the largest and second largest values under the same backbone, respectively. VSwin-T indicates VideoSwin-T backbone.
Table 4. Main results on MOVE benchmark with non-overlapping split (NS) setting.
We evaluate 6 state-of-the-art methods from 3 different related tasks across 2 experimental settings on our MOVE benchmark. The two settings are Overlapping Split (OS) and Non-overlapping Split (NS), which differ in how motion categories are distributed across folds for cross-validation. OS allows some motion overlap between folds while NS ensures completely separate motion categories between folds, creating a more challenging scenario. Our proposed DMA method achieves superior performance in few-shot motion understanding compared to existing approaches in both settings.
Qualitative Results
Figure 4. Qualitative comparison of representative cases from MOVE between baseline methods, DANet and HPAN, and our proposed DMA. The examples show: different object categories performing the same action (cat and person playing drums); temporally correlated motions with fingers transitioning between pinching and opening positions; and a challenging case with misleading background where football is played on a basketball court.
BibTeX
Please consider to cite MOVE if it helps your research.
@inproceedings{ying2025move,
title={{MOVE}: {M}otion-{G}uided {F}ew-{S}hot {V}ideo {O}bject {S}egmentation},
author={Kaining Ying and Hengrui Hu and Henghui Ding},
year={2025},
booktitle={ICCV}
}