To address the challenges of the OmniAVS task, we propose a baseline method called Omnimodal Interactive Segmentation Architecture (OISA). This approach is designed to handle the diverse expression modalities and reasoning requirements of our dataset.
Figure 2: Overview of our baseline OISA architecture for the OmniAVS task.
OISA consists of two main components: a Multimodal Large Language Model (MLLM) for understanding and reasoning across different modalities, and a mask head for segmentation and tracking. The MLLM incorporates audio and vision encoders along with a language model, while the mask head utilizes ViT-Adapter, pixel decoder, and mask decoder components.
Our architecture processes both audio-visual content and omnimodal expressions (text, speech, sound, and images). A key innovation in OISA is our Audio-Visual Interleaving strategy, which synchronizes audio and video frames by dividing audio tokens into clips and interleaving them with vision tokens. This approach ensures proper alignment between audio and visual content without requiring additional parameters.
For segmentation, we employ a query propagation mechanism that updates the object representation frame-by-frame. This approach overcomes limitations of static representations when tracking objects with rapid or complex motion patterns. The model generates a specialized token representing the target object, which is then processed by the mask decoder to produce the final segmentation masks across video frames.
OISA is designed to handle the unique challenges of the OmniAVS task, including multimodal understanding, reasoning with explanations, and segmenting objects referred to through diverse expression modalities.