A vision-language agent that explores 360° scenes to find & segment what you describe.
Fudan University
✉ Corresponding authors
scripts/build_figures.sh to insert assets/task.pngActive Panoramic Referring Segmentation (APRS). An agent searches for and segments a target (e.g., the “floor cabinet opposite the bed”) within a continuous 360° environment, iteratively adjusting its camera pose (θ, φ) to reason about cross-view spatial relations before producing a mask.
Existing referring image segmentation (RIS) models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in continuous 360° environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS), where an agent adjusts its viewing direction (Δθ, Δφ) to explore the 360° scene, seeking the object specified by a user instruction for segmentation.
To tackle this challenge, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model with EgoSphere — an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360° representation, EgoSphere enables the agent to plan efficient, non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask.
We curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize exploration efficiency. Extensive experiments on our new APRS benchmark show that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.
A novel task requiring agents to actively explore 360° environments to seek and segment targets from language. The benchmark spans 4,971 diverse indoor/outdoor scenes with four types of spatial referring expressions and comprehensive evaluation protocols.
A memory-augmented vision-language agent that builds an explicit, geometrically-consistent 360° spatial memory (EgoSphere) to plan efficient, non-redundant searches — turning exploration into a visual “fill-in-the-blank” task on the panorama.
An expert-annotated search trajectory dataset with memory timelines for SFT, followed by GRPO-based Reinforcement Learning with efficiency and terminal rewards to explicitly optimize exploration efficiency.
PanoSeeker outperforms strong static, heuristic, and VLM-agent baselines (incl. GPT-5.2 and Gemini-3) across success rate, search efficiency, and segmentation quality on the APRS benchmark.
Large-scale panoramic scenes with spatially grounded instructions — no costly 3D reconstruction required. The agent perceives a 120°×90° field of view and turns in 30° increments.
Relative to the agent's orientation. “Turn right to find the sofa.”
Distinctive visual attribute. “The yellow floor lamp in the room.”
Relative to anchor objects. “The chair opposite the sofa.”
Multi-step compositional reasoning. “Turn around and find the chair next to the table.”
A VLM reasons over the local view, the instruction, and an explicit 360° memory — then acts.
scripts/build_figures.sh to insert assets/method.pngPanoSeeker framework. The VLM fuses the current view Vt, instruction I, and EgoSphere memory Mt to predict a viewpoint move (Δθt, Δφt) or [STOP]. On stop, an Active Alignment & Segmentation module realigns the target and produces the mask. Training combines SFT on expert trajectories with GRPO using efficiency (Reff) and terminal (Rterm) rewards.
Sequential local views are inverse-gnomonic-projected onto a unified ERP canvas, annotated with a crosshair, lat-long grid, and the visited trajectory — giving the agent a persistent global map and preventing dead loops.
Qwen3-VL-8B is fine-tuned on expert trajectories, then optimized with GRPO. Efficiency rewards push the agent toward the shortest geodesic path; terminal rewards grant a bonus on successful, accurate stops.
On [STOP], the VLM grounds a bounding box; an SAM-3-based module performs concurrent center-seeking, tracking, and segmentation to produce a stable, accurate mask.
Pick a real scene and drag inside the panorama to look around, just like PanoSeeker exploring. Each view is back-projected onto the EgoSphere memory map in real time — watch the 360° canvas paint itself in with the true curved field-of-view footprint. Center the target to “find & segment” it, or hit Auto-Search to let the agent do it.
The search process is illustrative — the active alignment and segmentation stages are omitted.
PanoSeeker leads on success, efficiency, and segmentation quality — with the fewest active steps among agents.
| Method | Memory | SR (%) ↑ | AS ↓ | SPL ↑ | mIoU (%) ↑ |
|---|---|---|---|---|---|
| Static Methods (direct panorama input) | |||||
| LISA CVPR'24 | — | 64.1 | 1.0* | — | 44.5 |
| VisionReasoner ICLR'26 | — | 66.2 | 1.0* | — | 47.7 |
| Heuristic Methods (pre-defined scanning) | |||||
| VisionReasoner ICLR'26 | — | 49.5 | 4.9 | 0.26 | 39.9 |
| VLM-based Agents (active exploration) | |||||
| Qwen3-VL-30B-Thinking | Text Log | 62.9 | 8.0 | 0.29 | 51.0 |
| Gemini-3-Flash | Text Log | 64.9 | 6.8 | 0.31 | 51.4 |
| GPT-5.2 | Text Log | 69.1 | 6.2 | 0.36 | 53.2 |
| PanoSeeker (Ours) | EgoSphere | 75.4 | 4.8 | 0.57 | 55.8 |
SR: Success Rate · AS: Average Steps · SPL: Success weighted by Path Length · mIoU: mean IoU. *Static methods process the panorama in a single pass (AS=1), so SPL is not applicable. PanoSeeker improves SPL by ~58% over GPT-5.2 (0.57 vs. 0.36) using only 4.8 steps.
scripts/build_figures.sh to insert assets/qualitative.pngQualitative results. Red dots mark initial viewpoints; blue boxes highlight the final targets found by PanoSeeker. The EgoSphere is built step-by-step as the agent explores the 360° scene.
@inproceedings{APRS,
title = {Seek to Segment: Active Perception for Panoramic Referring Segmentation},
author = {Tang, Song and Hu, Shuming and Shuai, Xincheng and Ding, Henghui and Jiang, Yu-Gang},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}