Introduction
Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG).
This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks.
Classic Referring Expression Segmentation (RES), Comprehension (REC), and Generation (REG), collectively denoted as REx, only supports expressions that indicate a single target object, e.g., "The kid in red". Compared with REx, the proposed Generalized Referring Expression tasks (GREx), including Generalized RES (GRES), Generalized REC (GREC), and Generalized REG (GREG), extend expressions to multi-target or no-target. For example, GREx support multi-target expressions that indicate several objects by their commonalities or relationships, e.g., category (2) "All people", attribute (3) "Standing people", counting (4) "Two people on the far left", and compound (5) "Everyone except the kid in white". GRES and GREC further support no-target expressions that do not match any object, e.g., (6) "The kid in blue".
Statistics
Total privided Expressions
Multiple- or No-target expressions
referred distinct Objects
High-quality Mask annotations
Tasks
In this work, in order to overcome the limitations of classic RES, REC, and REG, we introduce three new GREx benchmarks, called Generalized Referring Expression Segmentation (GRES), Generalized Referring Expression Comprehension (GREC), and Generalized Referring Expression Generation (GREG), which allow expressions indicating any number of target objects. GRES/GREC takes an image and a referring expression as input, the same as classic RES/REC.
Expressions
The incorporation of multi-target, no-target expressions and GREC extends the application scope beyond single object and makes the tasks more practical to real-world scenarios:
(a) Multi-target: selecting multiple objects in a single forward process.
(b) No-target: retrieving images that contain the object.
(c) GREG: capturing the common semantics and generating concise and natural expression for multiple selected objects at once.
Dataset
Dataset Download
The dataset is available for non-commercial research purpose only. Please follow the following links to download.
People
Henghui Ding
Fudan University
Chang Liu
SUFE
Shuting He
SUFE
Xudong Jiang
NTU
Yu-Gang Jiang
Fudan UniversityCitation
Please consider to cite GREx if it helps your research.
@article{GREx,
title={{GREx}: Generalized Referring Expression Segmentation, Comprehension, and Generation},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Jiang, Yu-Gang},
journal={IJCV},
year={2026},
publisher={Springer}
}
@inproceedings{GRES,
title={{GRES}: Generalized Referring Expression Segmentation},
author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
booktitle={CVPR},
year={2023}
}
@article{VLT,
title={{VLT}: Vision-language transformer and query generation for referring segmentation},
author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2023},
publisher={IEEE}
}
Check out our related works on vision-language understanding, and more!
@article{MeViSv2,
title={MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation},
author={Ding, Henghui and Liu, Chang and He, Shuting and Ying, Kaining and Jiang, Xudong and Loy, Chen Change and Jiang, Yu-Gang},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2025},
publisher={IEEE}
}
@inproceedings{MeViS,
title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
booktitle={ICCV},
year={2023}
}@article{MOSEv2,
title={{MOSEv2}: A More Challenging Dataset for Video Object Segmentation in Complex Scenes},
author={Ding, Henghui and Ying, Kaining and Liu, Chang and He, Shuting and Jiang, Xudong and Jiang, Yu-Gang and Torr, Philip HS and Bai, Song},
journal={arXiv preprint arXiv:2508.05630},
year={2025}
}
@inproceedings{MOSE,
title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
booktitle={ICCV},
year={2023}
}