APRS: Seek to Segment

TL;DR — We turn referring segmentation from passive (one fixed image) into active (explore a full 360° sphere). Our agent PanoSeeker uses an explicit spatial memory, EgoSphere, to plan efficient, non-redundant searches, reaching 75.4% Success Rate and 0.57 SPL while outperforming GPT-5.2 with far fewer steps.

Overview

From passive observation to active 360° perception

Existing referring image segmentation (RIS) models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in continuous 360° environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS), where an agent adjusts its viewing direction (Δθ, Δφ) to explore the 360° scene, seeking the object specified by a user instruction for segmentation.

To tackle this challenge, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model with EgoSphere — an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360° representation, EgoSphere enables the agent to plan efficient, non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask.

We curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize exploration efficiency. Extensive experiments on our new APRS benchmark show that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.

4,971

Unique 360° scenes

7,420

Annotated samples

75.4%

Success Rate

0.57

SPL (efficiency)

Contributions

What's new

🎯

The APRS Task & Benchmark

A novel task requiring agents to actively explore 360° environments to seek and segment targets from language. The benchmark spans 4,971 diverse indoor/outdoor scenes with four types of spatial referring expressions and comprehensive evaluation protocols.

🧠

PanoSeeker + EgoSphere

A memory-augmented vision-language agent that builds an explicit, geometrically-consistent 360° spatial memory (EgoSphere) to plan efficient, non-redundant searches — turning exploration into a visual “fill-in-the-blank” task on the panorama.

📈

Trajectory Data + RL Training

An expert-annotated search trajectory dataset with memory timelines for SFT, followed by GRPO-based Reinforcement Learning with efficiency and terminal rewards to explicitly optimize exploration efficiency.

🏆

State-of-the-Art Results

PanoSeeker outperforms strong static, heuristic, and VLM-agent baselines (incl. GPT-5.2 and Gemini-3) across success rate, search efficiency, and segmentation quality on the APRS benchmark.

Benchmark

The APRS dataset

Large-scale panoramic scenes with spatially grounded instructions — no costly 3D reconstruction required. The agent perceives a 120°×90° field of view and turns in 30° increments.

Four types of spatial referring expressions

EGO

Egocentric

Relative to the agent's orientation. “Turn right to find the sofa.”

UNIQ

Unique-Attribute

Distinctive visual attribute. “The yellow floor lamp in the room.”

ALLO

Allocentric

Relative to anchor objects. “The chair opposite the sofa.”

MULTIHOP

Multi-hop

Multi-step compositional reasoning. “Turn around and find the chair next to the table.”

Method

How PanoSeeker works

A VLM reasons over the local view, the instruction, and an explicit 360° memory — then acts.

🗺️

Method figure — run scripts/build_figures.sh to insert assets/method.png

PanoSeeker framework. The VLM fuses the current view V_t, instruction I, and EgoSphere memory M_t to predict a viewpoint move (Δθ_t, Δφ_t) or [STOP]. On stop, an Active Alignment & Segmentation module realigns the target and produces the mask. Training combines SFT on expert trajectories with GRPO using efficiency (R_eff) and terminal (R_term) rewards.

1

EgoSphere Memory

Sequential local views are inverse-gnomonic-projected onto a unified ERP canvas, annotated with a crosshair, lat-long grid, and the visited trajectory — giving the agent a persistent global map and preventing dead loops.

2

Active Search (SFT + RL)

Qwen3-VL-8B is fine-tuned on expert trajectories, then optimized with GRPO. Efficiency rewards push the agent toward the shortest geodesic path; terminal rewards grant a bonus on successful, accurate stops.

3

Align & Segment

On [STOP], the VLM grounds a bounding box; an SAM-3-based module performs concurrent center-seeking, tracking, and segmentation to produce a stable, accurate mask.

Try it live

360° Active Search — playable demo

Pick a real scene and drag inside the panorama to look around, just like PanoSeeker exploring. Each view is back-projected onto the EgoSphere memory map in real time — watch the 360° canvas paint itself in with the true curved field-of-view footprint. Center the target to “find & segment” it, or hit Auto-Search to let the agent do it.

🎯 “Find the orange armchair.”

Exploring…

drag to look around

🌐 EgoSphere Memory — live back-projection

current FoV footprint crosshair trajectory target unexplored

📊 Live episode

Coverage explored0%

Steps / turns0

Distance to target—

The search process is illustrative — the active alignment and segmentation stages are omitted.

Results

State-of-the-art on the APRS benchmark

PanoSeeker leads on success, efficiency, and segmentation quality — with the fewest active steps among agents.

Method	Memory	SR (%) ↑	AS ↓	SPL ↑	mIoU (%) ↑
Static Methods (direct panorama input)
LISA CVPR'24	—	64.1	1.0*	—	44.5
VisionReasoner ICLR'26	—	66.2	1.0*	—	47.7
Heuristic Methods (pre-defined scanning)
VisionReasoner ICLR'26	—	49.5	4.9	0.26	39.9
VLM-based Agents (active exploration)
Qwen3-VL-30B-Thinking	Text Log	62.9	8.0	0.29	51.0
Gemini-3-Flash	Text Log	64.9	6.8	0.31	51.4
GPT-5.2	Text Log	69.1	6.2	0.36	53.2
PanoSeeker (Ours)	EgoSphere	75.4	4.8	0.57	55.8

SR: Success Rate · AS: Average Steps · SPL: Success weighted by Path Length · mIoU: mean IoU. *Static methods process the panorama in a single pass (AS=1), so SPL is not applicable. PanoSeeker improves SPL by ~58% over GPT-5.2 (0.57 vs. 0.36) using only 4.8 steps.

🖼️

Qualitative results — run scripts/build_figures.sh to insert assets/qualitative.png

Qualitative results. Red dots mark initial viewpoints; blue boxes highlight the final targets found by PanoSeeker. The EgoSphere is built step-by-step as the agent explores the 360° scene.

Citation

BibTeX

@inproceedings{APRS,
  title     = {Seek to Segment: Active Perception for Panoramic Referring Segmentation},
  author    = {Tang, Song and Hu, Shuming and Shuai, Xincheng and Ding, Henghui and Jiang, Yu-Gang},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}