★ ECCV 2026 Fudan University Embodied AI · Active Perception

Seek to Segment: Active Perception for Panoramic Referring Segmentation

A vision-language agent that explores 360° scenes to find & segment what you describe.

Song Tang, Shuming Hu, Xincheng Shuai, Henghui Ding, Yu-Gang Jiang

Fudan University

Corresponding authors

🧭🌐
Teaser figure — run scripts/build_figures.sh to insert assets/task.png

Active Panoramic Referring Segmentation (APRS). An agent searches for and segments a target (e.g., the “floor cabinet opposite the bed”) within a continuous 360° environment, iteratively adjusting its camera pose (θ, φ) to reason about cross-view spatial relations before producing a mask.

TL;DR — We turn referring segmentation from passive (one fixed image) into active (explore a full 360° sphere). Our agent PanoSeeker uses an explicit spatial memory, EgoSphere, to plan efficient, non-redundant searches, reaching 75.4% Success Rate and 0.57 SPL while outperforming GPT-5.2 with far fewer steps.
Overview

From passive observation to active 360° perception

Existing referring image segmentation (RIS) models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in continuous 360° environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS), where an agent adjusts its viewing direction (Δθ, Δφ) to explore the 360° scene, seeking the object specified by a user instruction for segmentation.

To tackle this challenge, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model with EgoSphere — an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360° representation, EgoSphere enables the agent to plan efficient, non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask.

We curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize exploration efficiency. Extensive experiments on our new APRS benchmark show that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.

4,971
Unique 360° scenes
7,420
Annotated samples
75.4%
Success Rate
0.57
SPL (efficiency)
Contributions

What's new

🎯

The APRS Task & Benchmark

A novel task requiring agents to actively explore 360° environments to seek and segment targets from language. The benchmark spans 4,971 diverse indoor/outdoor scenes with four types of spatial referring expressions and comprehensive evaluation protocols.

🧠

PanoSeeker + EgoSphere

A memory-augmented vision-language agent that builds an explicit, geometrically-consistent 360° spatial memory (EgoSphere) to plan efficient, non-redundant searches — turning exploration into a visual “fill-in-the-blank” task on the panorama.

📈

Trajectory Data + RL Training

An expert-annotated search trajectory dataset with memory timelines for SFT, followed by GRPO-based Reinforcement Learning with efficiency and terminal rewards to explicitly optimize exploration efficiency.

🏆

State-of-the-Art Results

PanoSeeker outperforms strong static, heuristic, and VLM-agent baselines (incl. GPT-5.2 and Gemini-3) across success rate, search efficiency, and segmentation quality on the APRS benchmark.

Benchmark

The APRS dataset

Large-scale panoramic scenes with spatially grounded instructions — no costly 3D reconstruction required. The agent perceives a 120°×90° field of view and turns in 30° increments.

Four types of spatial referring expressions

EGO

Egocentric

Relative to the agent's orientation. “Turn right to find the sofa.”

UNIQ

Unique-Attribute

Distinctive visual attribute. “The yellow floor lamp in the room.”

ALLO

Allocentric

Relative to anchor objects. “The chair opposite the sofa.”

MULTIHOP

Multi-hop

Multi-step compositional reasoning. “Turn around and find the chair next to the table.”

Method

How PanoSeeker works

A VLM reasons over the local view, the instruction, and an explicit 360° memory — then acts.

🗺️
Method figure — run scripts/build_figures.sh to insert assets/method.png

PanoSeeker framework. The VLM fuses the current view Vt, instruction I, and EgoSphere memory Mt to predict a viewpoint move (Δθt, Δφt) or [STOP]. On stop, an Active Alignment & Segmentation module realigns the target and produces the mask. Training combines SFT on expert trajectories with GRPO using efficiency (Reff) and terminal (Rterm) rewards.

1

EgoSphere Memory

Sequential local views are inverse-gnomonic-projected onto a unified ERP canvas, annotated with a crosshair, lat-long grid, and the visited trajectory — giving the agent a persistent global map and preventing dead loops.

2

Active Search (SFT + RL)

Qwen3-VL-8B is fine-tuned on expert trajectories, then optimized with GRPO. Efficiency rewards push the agent toward the shortest geodesic path; terminal rewards grant a bonus on successful, accurate stops.

3

Align & Segment

On [STOP], the VLM grounds a bounding box; an SAM-3-based module performs concurrent center-seeking, tracking, and segmentation to produce a stable, accurate mask.

Try it live

360° Active Search — playable demo

Pick a real scene and drag inside the panorama to look around, just like PanoSeeker exploring. Each view is back-projected onto the EgoSphere memory map in real time — watch the 360° canvas paint itself in with the true curved field-of-view footprint. Center the target to “find & segment” it, or hit Auto-Search to let the agent do it.

🎯 “Find the orange armchair.”
Exploring…
drag to look around

🌐 EgoSphere Memory — live back-projection

current FoV footprint crosshair trajectory target unexplored

📊 Live episode

Coverage explored0%
Steps / turns0
Distance to target

The search process is illustrative — the active alignment and segmentation stages are omitted.

Results

State-of-the-art on the APRS benchmark

PanoSeeker leads on success, efficiency, and segmentation quality — with the fewest active steps among agents.

MethodMemory SR (%) AS SPL mIoU (%)
Static Methods (direct panorama input)
LISA CVPR'2464.11.0*44.5
VisionReasoner ICLR'2666.21.0*47.7
Heuristic Methods (pre-defined scanning)
VisionReasoner ICLR'2649.54.90.2639.9
VLM-based Agents (active exploration)
Qwen3-VL-30B-ThinkingText Log62.98.00.2951.0
Gemini-3-FlashText Log64.96.80.3151.4
GPT-5.2Text Log69.16.20.3653.2
PanoSeeker (Ours)EgoSphere75.44.80.5755.8

SR: Success Rate · AS: Average Steps · SPL: Success weighted by Path Length · mIoU: mean IoU. *Static methods process the panorama in a single pass (AS=1), so SPL is not applicable. PanoSeeker improves SPL by ~58% over GPT-5.2 (0.57 vs. 0.36) using only 4.8 steps.

🖼️
Qualitative results — run scripts/build_figures.sh to insert assets/qualitative.png

Qualitative results. Red dots mark initial viewpoints; blue boxes highlight the final targets found by PanoSeeker. The EgoSphere is built step-by-step as the agent explores the 360° scene.

Citation

BibTeX

@inproceedings{APRS,
  title     = {Seek to Segment: Active Perception for Panoramic Referring Segmentation},
  author    = {Tang, Song and Hu, Shuming and Shuai, Xincheng and Ding, Henghui and Jiang, Yu-Gang},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}