Panoramic Scene Understanding via MTPano. MTPano is a multi-task foundation model for panoramic dense scene parsing, capable of jointly estimating semantic segmentation, depth, and surface normals.
Panoramic scene understanding suffers from scarce multi-task annotations and severe geometric distortions that hinder the direct use of perspective foundation models. To address this, we propose MTPano, a robust multi-task panoramic foundation model. MTPano leverages a label-free training pipeline to extract perspective dense priors and introduces Panoramic Dual BridgeNet (PD-BridgeNet) to untangle task interference.
To overcome the scarcity of annotations, we integrate knowledge from multiple off-the-shelf perspective foundation models. We project the panoramic image into multiple perspective patches to obtain distortion-free pseudo-labels using models like InternImage and MoGe-2, and then re-project them back as spherical patches for patch-wise supervision. This strategy allows us to leverage vast foundational knowledge while preventing overfitting to projection artifacts.
To resolve geometric conflicts on the sphere, we design a dual-branch architecture. It uses an ERP Token Mixer to handle Equirectangular Projection (ERP) distortions and disentangles feature streams by injecting absolute position and ray direction priors into the variant branch via geometry-aware modulation layers. Finally, an asymmetric bridge mechanism with Truncated Gradient Flow ensures beneficial information flows between branches while blocking conflicting gradients.
Extensive Testing on Synthetic Data (DiT360): MTPano demonstrates robust and highly consistent panoramic scene parsing capabilities, maintaining strict geometric consistency even in highly imaginative, out-of-distribution synthetic environments.
Real-World Generalization on Google Street View: MTPano exhibits exceptional zero-shot generalization to in-the-wild captures, effortlessly handling complex outdoor lighting and severe perspective distortions to deliver high-fidelity 3D priors.
@article{zhang2026mtpano,
title={MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors},
author={Zhang, Jingdong and Zhan, Xiaohang and Zhang, Lingzhi and Wang, Yizhou and Yu, Zhengming and Wang, Jionghao and Wang, Wenping and Li, Xin},
journal={arXiv preprint arXiv:2602.05330},
year={2026}
}