MTPano: Multi-Task Panoramic Scene Understanding

MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors

¹Texas A&M University ²Adobe

arXiv 2025

Abstract

Panoramic scene understanding suffers from scarce multi-task annotations and severe geometric distortions that hinder the direct use of perspective foundation models. To address this, we propose MTPano, a robust multi-task panoramic foundation model. MTPano leverages a label-free training pipeline to extract perspective dense priors and introduces Panoramic Dual BridgeNet (PD-BridgeNet) to untangle task interference.

Label-Free Pipeline: Overcomes data scarcity by extracting accurate pseudo-labels from perspective foundation models via patch-wise projection.
Feature Disentanglement: PD-BridgeNet explicitly separates rotation-invariant (e.g., depth, semantics) and rotation-variant (e.g., normals) features using geometry-aware modulation.
Harmonized Interactions: Uses Truncated Gradient Flow to share beneficial cross-task information while blocking conflicting gradients.

Methodology

1. Label-Free Training Pipeline

To overcome the scarcity of annotations, we integrate knowledge from multiple off-the-shelf perspective foundation models. We project the panoramic image into multiple perspective patches to obtain distortion-free pseudo-labels using models like InternImage and MoGe-2, and then re-project them back as spherical patches for patch-wise supervision. This strategy allows us to leverage vast foundational knowledge while preventing overfitting to projection artifacts.

2. Panorama-Dual-BridgeNet (PD-BridgeNet)

To resolve geometric conflicts on the sphere, we design a dual-branch architecture. It uses an ERP Token Mixer to handle Equirectangular Projection (ERP) distortions and disentangles feature streams by injecting absolute position and ray direction priors into the variant branch via geometry-aware modulation layers. Finally, an asymmetric bridge mechanism with Truncated Gradient Flow ensures beneficial information flows between branches while blocking conflicting gradients.

Results

Extensive Testing on Synthetic Data (DiT360): MTPano demonstrates robust and highly consistent panoramic scene parsing capabilities, maintaining strict geometric consistency even in highly imaginative, out-of-distribution synthetic environments.

Real-World Generalization on Google Street View: MTPano exhibits exceptional zero-shot generalization to in-the-wild captures, effortlessly handling complex outdoor lighting and severe perspective distortions to deliver high-fidelity 3D priors.

BibTeX

@article{zhang2026mtpano, title={MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors}, author={Zhang, Jingdong and Zhan, Xiaohang and Zhang, Lingzhi and Wang, Yizhou and Yu, Zhengming and Wang, Jionghao and Wang, Wenping and Li, Xin}, journal={arXiv preprint arXiv:2602.05330}, year={2026} }