GREx

Generalized Referring Expression Segmentation, Comprehension, and Generation

Introduction

Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG).

This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks.

Teaser image

Classic Referring Expression Segmentation (RES), Comprehension (REC), and Generation (REG), collectively denoted as REx, only supports expressions that indicate a single target object, e.g., "The kid in red". Compared with REx, the proposed Generalized Referring Expression tasks (GREx), including Generalized RES (GRES), Generalized REC (GREC), and Generalized REG (GREG), extend expressions to multi-target or no-target. For example, GREx support multi-target expressions that indicate several objects by their commonalities or relationships, e.g., category (2) "All people", attribute (3) "Standing people", counting (4) "Two people on the far left", and compound (5) "Everyone except the kid in white". GRES and GREC further support no-target expressions that do not match any object, e.g., (6) "The kid in blue".

Statistics

Total privided Expressions

Multiple- or No-target expressions

referred distinct Objects

High-quality Mask annotations

Tasks

In this work, in order to overcome the limitations of classic RES, REC, and REG, we introduce three new GREx benchmarks, called Generalized Referring Expression Segmentation (GRES), Generalized Referring Expression Comprehension (GREC), and Generalized Referring Expression Generation (GREG), which allow expressions indicating any number of target objects. GRES/GREC takes an image and a referring expression as input, the same as classic RES/REC.

GRES

Generalized Referring Expression Segmentation

GREC

Generalized Referring Expression Comprehension

GREG

Generalized Referring Expression Generation

Expressions

The incorporation of multi-target, no-target expressions and GREC extends the application scope beyond single object and makes the tasks more practical to real-world scenarios:

Teaser image

(a) Multi-target: selecting multiple objects in a single forward process.

Teaser image

(b) No-target: retrieving images that contain the object.

Teaser image

(c) GREG: capturing the common semantics and generating concise and natural expression for multiple selected objects at once.

Dataset

Dataset Download

The dataset is available for non-commercial research purpose only. Please follow the following links to download.

Baselines

We provide baseline code and models for GRES and GREC:

People

Henghui Ding

Fudan University

Chang Liu

SUFE

Shuting He

SUFE

Yu-Gang Jiang

Fudan University

Citation

Please consider to cite GREx if it helps your research.

@article{GREx,
  title={{GREx}: Generalized Referring Expression Segmentation, Comprehension, and Generation},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Jiang, Yu-Gang},
  journal={IJCV},
  year={2026},
  publisher={Springer}
}
@inproceedings{GRES,
  title={{GRES}: Generalized Referring Expression Segmentation},
  author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
  booktitle={CVPR},
  year={2023}
}
@article{VLT,
  title={{VLT}: Vision-language transformer and query generation for referring segmentation},
  author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2023},
  publisher={IEEE}
}

Check out our related works on vision-language understanding, and more!

@article{MeViSv2,
  title={MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Ying, Kaining and Jiang, Xudong and Loy, Chen Change and Jiang, Yu-Gang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2025},
  publisher={IEEE}
}
@inproceedings{MeViS,
  title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
  booktitle={ICCV},
  year={2023}
}@article{MOSEv2,
  title={{MOSEv2}: A More Challenging Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Ying, Kaining and Liu, Chang and He, Shuting and Jiang, Xudong and Jiang, Yu-Gang and Torr, Philip HS and Bai, Song},
  journal={arXiv preprint arXiv:2508.05630},
  year={2025}
}
@inproceedings{MOSE,
  title={{MOSE}: A New Dataset for Video Object Segmentation in Complex Scenes},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
  booktitle={ICCV},
  year={2023}
}