Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Fudan University, China
ICCV 2025, Honolulu, Hawai'i
βœ‰οΈ Corresponding author

Figure 1. Examples of the proposed benchmark Omnimodal Referring Audio-Visual Segmentation (OmniAVS) to show its nature and flexibility. OmniAVS introduces 3 key features: 1) It supports diverse multimodal referring expressions that flexibly combine text, speech, sound, and image for referring audio-visual segmentation; 2) It emphasizes understanding the content of audio rather than merely hearing them; 3) It incorporates complex reasoning and world knowledge in expressions, prompting models to provide explanations for their segmentation decisions. These characteristics make OmniAVS practical for real-world use and well-suited for developing omnimodal models with fine-grained perception.


Abstract

Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.

Dataset: OmniAVS

Dataset Statistics

Dataset Venue Content Referring Reasoning Statistics
Video Audio Text Audio Image Avail. Expl. Video Frame Object Mask Expr. Expl.
R-YouTubeVOS ECCV'20 βœ“ βœ— βœ“ βœ— βœ— βœ— βœ— 3,978 116k 7,451 131k 15,009 βœ—
MeViS ICCV'23 βœ“ βœ— βœ“ βœ— βœ— βœ— βœ— 2,006 44k 8,175 443k 28,570 βœ—
ReVOS ECCV'24 βœ“ βœ— βœ“ βœ— βœ— βœ“ βœ— 1,042 116k 5,535 469k 35,074 βœ—
Ref-AVS ECCV'24 βœ“ βœ“ βœ“ βœ— βœ— βœ— βœ— 4,002 40k 6,888 78k 20,261 βœ—
OmniAVS (ours) ICCV'25 βœ“ βœ“ βœ“ βœ“ βœ“ βœ“ βœ“ 2,098 103k 5,135 206k 59,458 59,458
Table 1: Comparison of OmniAVS with existing releated segmentation datasets. Notes: Avail. = Availability of reasoning, Expl. = Explanations, Expr. = Expressions.

Table 1 shows a statistical comparison between OmniAVS and related datasets. Refer-YouTubeVOS, MeViS, and ReVOS focus on silent videos without audio and provide only text expressions. While ReVOS further supports reasoning expression, it lacks explanations. Ref-AVS Bench includes audio-visual videos but supports only single-modality text expressions without reasoning or explanations. Compared to previous datasets, the proposed OmniAVS dataset offers multimodal content (audio-visual videos) and diverse expression modalities (text, speech, sound, image), supports reasoning with explanations, and allows for referring to arbitrary number of target objects.

Visualization

Video 1: Visualization of OmniAVS dataset and task. Note: The video contains audio. Please wear headphones or turn on your speakers for the best experience.

Baseline Method: OISA

To address the challenges of the OmniAVS task, we propose a baseline method called Omnimodal Interactive Segmentation Architecture (OISA). This approach is designed to handle the diverse expression modalities and reasoning requirements of our dataset.

OISA Architecture
Figure 2: Overview of our baseline OISA architecture for the OmniAVS task.

OISA consists of two main components: a Multimodal Large Language Model (MLLM) for understanding and reasoning across different modalities, and a mask head for segmentation and tracking. The MLLM incorporates audio and vision encoders along with a language model, while the mask head utilizes ViT-Adapter, pixel decoder, and mask decoder components. Our architecture processes both audio-visual content and omnimodal expressions (text, speech, sound, and images). A key innovation in OISA is our Audio-Visual Interleaving strategy, which synchronizes audio and video frames by dividing audio tokens into clips and interleaving them with vision tokens. This approach ensures proper alignment between audio and visual content without requiring additional parameters. For segmentation, we employ a query propagation mechanism that updates the object representation frame-by-frame. This approach overcomes limitations of static representations when tracking objects with rapid or complex motion patterns. The model generates a specialized token representing the target object, which is then processed by the mask decoder to produce the final segmentation masks across video frames. OISA is designed to handle the unique challenges of the OmniAVS task, including multimodal understanding, reasoning with explanations, and segmenting objects referred to through diverse expression modalities.

Experiments

Benchmark Results

We evaluate our OISA model on the OmniAVS dataset and compare it with several state-of-the-art methods. Table 2 presents the comprehensive results across all eight splits of our dataset.

Method All I II III IV V VI VII VIII METEOR
LMPM 25.8 31.2 28.7 20.0 22.7 21.3 20.9 30.0 31.4 -
EEMC 29.6 34.4 32.6 19.6 26.0 28.0 24.7 35.6 36.0 -
MUTR 32.3 35.4 33.3 28.4 29.8 26.5 22.8 41.6 40.5 -
LISA-7B 33.6 33.3 31.2 29.2 32.7 28.6 27.3 43.4 43.1 11.6
LISA-13B 36.1 36.4 32.1 30.4 35.7 31.6 30.2 46.7 45.7 16.5
OISA (ours) 41.1 40.1 38.5 34.9 38.5 35.9 35.2 52.6 53.0 21.7
Table 2: Testing on OmniAVS dataset. Notes: We use \(\mathcal{J}\&\mathcal{F}\) as the default metric. All is the average result across 8 splits. The splits represent different combinations of referring expression modalities: I) Text; II) Speech; III) Text with Sound; IV) Speech with Sound; V) Text with Image; VI) Speech with Image; VII) Text with Sound and Image; and VIII) Speech with Sound and Image.

As shown in Table 2, our OISA model significantly outperforms all baseline methods across all splits of the OmniAVS dataset. Compared to the strongest baseline, LISA-13B, our approach achieves a 5.0% improvement in the overall \(\mathcal{J}\&\mathcal{F}\) metric (41.1% vs. 36.1%) and a 5.2% improvement in the METEOR score (21.7% vs. 16.5%). The performance gains are consistent across all eight splits, demonstrating the effectiveness of our approach in handling diverse expression modalities and reasoning requirements. Notably, our model performs particularly well on splits VII and VIII, which contain the most challenging scenarios with complex reasoning and explanations.

Qualitative Results

Qualitative Results
Figure 3: Qualitative results of our OISA model on the OmniAVS dataset. The model effectively handles various referring expression modalities and produces accurate segmentation masks with explanations. However, the model performs poorly in noisy environments, which is a direction for future work.

BibTeX

Please consider to cite MOVE if it helps your research.
@inproceedings{ying2025omniavs,
  title={{T}owards {O}mnimodal {E}xpressions and {R}easoning in {R}eferring {A}udio-{V}isual {S}egmentation, 
  author={Kaining Ying and Henghui Ding and Guangquan Jie and Yu-Gang Jiang},
  year={2025},
  booktitle={ICCV}
}