OmniAVS

Abstract

Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.

Dataset: OmniAVS

Dataset Statistics

Dataset	Venue	Content		Referring			Reasoning		Statistics
Dataset	Venue	Video	Audio	Text	Audio	Image	Avail.	Expl.	Video	Frame	Object	Mask	Expr.	Expl.
R-YouTubeVOS	ECCV'20	✓	✗	✓	✗	✗	✗	✗	3,978	116k	7,451	131k	15,009	✗
MeViS	ICCV'23	✓	✗	✓	✗	✗	✗	✗	2,006	44k	8,175	443k	28,570	✗
ReVOS	ECCV'24	✓	✗	✓	✗	✗	✓	✗	1,042	116k	5,535	469k	35,074	✗
Ref-AVS	ECCV'24	✓	✓	✓	✗	✗	✗	✗	4,002	40k	6,888	78k	20,261	✗
OmniAVS (ours)	ICCV'25	✓	✓	✓	✓	✓	✓	✓	2,098	103k	5,135	206k	59,458	59,458

Table 1: Comparison of OmniAVS with existing releated segmentation datasets. Notes: Avail. = Availability of reasoning, Expl. = Explanations, Expr. = Expressions.

Table 1 shows a statistical comparison between OmniAVS and related datasets. Refer-YouTubeVOS, MeViS, and ReVOS focus on silent videos without audio and provide only text expressions. While ReVOS further supports reasoning expression, it lacks explanations. Ref-AVS Bench includes audio-visual videos but supports only single-modality text expressions without reasoning or explanations. Compared to previous datasets, the proposed OmniAVS dataset offers multimodal content (audio-visual videos) and diverse expression modalities (text, speech, sound, image), supports reasoning with explanations, and allows for referring to arbitrary number of target objects.

Visualization

Video 1: Visualization of OmniAVS dataset and task. Note: The video contains audio. Please wear headphones or turn on your speakers for the best experience.

Baseline Method: OISA

To address the challenges of the OmniAVS task, we propose a baseline method called Omnimodal Interactive Segmentation Architecture (OISA). This approach is designed to handle the diverse expression modalities and reasoning requirements of our dataset.

Figure 2: Overview of our baseline OISA architecture for the OmniAVS task.

OISA consists of two main components: a Multimodal Large Language Model (MLLM) for understanding and reasoning across different modalities, and a mask head for segmentation and tracking. The MLLM incorporates audio and vision encoders along with a language model, while the mask head utilizes ViT-Adapter, pixel decoder, and mask decoder components. Our architecture processes both audio-visual content and omnimodal expressions (text, speech, sound, and images). A key innovation in OISA is our Audio-Visual Interleaving strategy, which synchronizes audio and video frames by dividing audio tokens into clips and interleaving them with vision tokens. This approach ensures proper alignment between audio and visual content without requiring additional parameters. For segmentation, we employ a query propagation mechanism that updates the object representation frame-by-frame. This approach overcomes limitations of static representations when tracking objects with rapid or complex motion patterns. The model generates a specialized token representing the target object, which is then processed by the mask decoder to produce the final segmentation masks across video frames. OISA is designed to handle the unique challenges of the OmniAVS task, including multimodal understanding, reasoning with explanations, and segmenting objects referred to through diverse expression modalities.

Experiments

Benchmark Results

We evaluate our OISA model on the OmniAVS dataset and compare it with several state-of-the-art methods. Table 2 presents the comprehensive results across all eight splits of our dataset.

Method	All	I	II	III	IV	V	VI	VII	VIII	METEOR
LMPM	25.8	31.2	28.7	20.0	22.7	21.3	20.9	30.0	31.4	-
EEMC	29.6	34.4	32.6	19.6	26.0	28.0	24.7	35.6	36.0	-
MUTR	32.3	35.4	33.3	28.4	29.8	26.5	22.8	41.6	40.5	-
LISA-7B	33.6	33.3	31.2	29.2	32.7	28.6	27.3	43.4	43.1	11.6
LISA-13B	36.1	36.4	32.1	30.4	35.7	31.6	30.2	46.7	45.7	16.5
OISA (ours)	41.1	40.1	38.5	34.9	38.5	35.9	35.2	52.6	53.0	21.7

Table 2: Testing on OmniAVS dataset. Notes: We use \(\mathcal{J}\&\mathcal{F}\) as the default metric. All is the average result across 8 splits. The splits represent different combinations of referring expression modalities: I) Text; II) Speech; III) Text with Sound; IV) Speech with Sound; V) Text with Image; VI) Speech with Image; VII) Text with Sound and Image; and VIII) Speech with Sound and Image.

As shown in Table 2, our OISA model significantly outperforms all baseline methods across all splits of the OmniAVS dataset. Compared to the strongest baseline, LISA-13B, our approach achieves a 5.0% improvement in the overall \(\mathcal{J}\&\mathcal{F}\) metric (41.1% vs. 36.1%) and a 5.2% improvement in the METEOR score (21.7% vs. 16.5%). The performance gains are consistent across all eight splits, demonstrating the effectiveness of our approach in handling diverse expression modalities and reasoning requirements. Notably, our model performs particularly well on splits VII and VIII, which contain the most challenging scenarios with complex reasoning and explanations.

Qualitative Results

Figure 3: Qualitative results of our OISA model on the OmniAVS dataset. The model effectively handles various referring expression modalities and produces accurate segmentation masks with explanations. However, the model performs poorly in noisy environments, which is a direction for future work.

BibTeX

Please consider to cite MOVE if it helps your research.

@inproceedings{ying2025omniavs,
  title={Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation}, 
  author={Kaining Ying and Henghui Ding and Guangquan Jie and Yu-Gang Jiang},
  year={2025},
  booktitle={ICCV}
}