UniSER : HALO & Synthetic Haze Datasets

Companion data releases for our CVPR 2026 paper UniSER: A Foundation Model for Unified Soft Effects Removal
WebDataset CC-BY-NC-SA-4.0 · gated
HALO triplets: light / flare / separate across four flare types

HALO: each sample ships as a flare-free / flared / flare-only triplet across four flare types (Streak, Reflective, Glare, Shimmer).

Synthetic Haze: one row per upstream source, three synthesized variants each

Synthetic Haze: one representative scene per upstream source paired with three synthesized haze variants — fog / haze / smoke (HAZESPACE, ITS, OTS) or neutral / yellowish atmospheric haze (WSRD, Flare-R, ISTD). OTS clean images are not bundled (Flickr copyright).

Overview

We release the two image datasets that power UniSER's training pipeline, both sharded in WebDataset format on Hugging Face for fast streaming and partial-download support.

HALO is a brand-new 3D-rendered lens-flare dataset with 4,945 4K-resolution triplets (clean / flared / flare-only) spanning 32 scenes × 4 flare types. The triplet design supports paired flare removal, forward flare synthesis, and additive decomposition (flare ≈ light + separate).

Synthetic Haze bundles ~80.9k unique clean images with ~2M physically-motivated haze variants — a depth-driven atmospheric scattering synthesizer applied across six upstream image sources covering indoor/outdoor, homogeneous/non-homogeneous, daytime/dense atmospheric conditions.

HALO Lens Flare
4,945
4K triplets · 32 scenes · ~153 GB · 71 shards
Synthetic Haze
~80.9k · ~2M
unique GTs · haze variants · ~2.5 TB · 1,327 shards

HALO — 3D-Rendered Lens Flare

HALO

Blender-rendered 4K scenes with physically-based flare triplets.

HALO covers 4 flare types across 32 distinct scenes, with multiple cameras and lighting variants per scene. Each sample carries three aligned 4K RGBA PNGs — the clean scene (gt), the same scene with flare added (flare), and the flare layer rendered against transparent background (separate) so it can be re-composited over arbitrary clean images at training time.

Per-scene composition
HALO per-scene composition

4,945 samples are stratified across 32 scenes; the four flare types share the same per-scene distribution at a 2 : 2 : 1 : 1 (Streak : Reflective : Glare : Shimmer) ratio.

Effect-type breakdown
Effect typeSamplesShareDescription
Streak1,65633.5%bright streak / anamorphic stretch flares
Reflective1,65533.5%inter-element ghost reflections
Glare81716.5%wide soft halo + bloom
Shimmer81716.5%iridescent / dispersive flare
Total4,945100%32 scenes × 4 effects
Per-sample shard layout
halo/<base_name>.gt.png         clean scene without flare       (RGBA, 3840×2160)
halo/<base_name>.flare.png      same scene with flare added     (RGBA, 3840×2160)
halo/<base_name>.separate.png   flare-only on transparent bg    (RGBA, 3840×2160)
halo/<base_name>.json           per-sample metadata
Stream from Hugging Face
pip install -U "huggingface_hub[hf_xet]" webdataset pillow
hf auth login   # accept terms at https://huggingface.co/datasets/jdzhang0929/halo-flare-dataset

import io, json
from huggingface_hub import HfFileSystem
import webdataset as wds
from PIL import Image

REPO = "jdzhang0929/halo-flare-dataset"
urls = [
    f"https://huggingface.co/datasets/{REPO}/resolve/main/{p[len(f'datasets/{REPO}/'):]}"
    for p in HfFileSystem().ls(f"datasets/{REPO}/shards", detail=False)
    if p.endswith(".tar")
]

def decode(s):
    meta = json.loads(s["json"])
    return {
        "scene":    meta["scene"],
        "effect":   meta["effect_id"],
        "light":    Image.open(io.BytesIO(s["gt.png"])).convert("RGB"),
        "flare":    Image.open(io.BytesIO(s["flare.png"])).convert("RGB"),
        "separate": Image.open(io.BytesIO(s["separate.png"])),  # keep RGBA
    }

pipeline = wds.WebDataset(urls, shardshuffle=True).shuffle(500).map(decode)
for sample in pipeline:
    print(sample["scene"], sample["effect"])
    break

Synthetic Haze

Synthetic Haze

Depth-driven atmospheric synthesis over six upstream image sources.

Per the paper's §7 atmospheric model, every clean image is augmented with multiple haze variants whose optical thickness τ follows depth predictions from Marigold, optionally modulated by Perlin noise for non-uniform haze and valley fog. Variants span homogeneous, non-homogeneous, indoor, outdoor, daytime, dense, and smoke-like conditions.

Source composition
Synthetic haze composition

Six upstream sources, dominated by HazeSpace2M (~82% of unique GTs). Each GT carries ~19–37 synthesized haze variants, for ~2 million total pairs.

Per-sample shard layout
<source>/<base_name>.gt.<ext>          clean RGB image (absent for OTS)
<source>/<base_name>.haze_NNN.png      synthesized haze variant
<source>/<base_name>.haze_NNN.txt      descriptive tag (e.g. out_fog_120)
<source>/<base_name>.json              per-sample metadata

<source> is one of hazespace, its, ots, wsrd, real_flare, istd. RESIDE-OTS clean GTs carry third-party photographer copyrights and are not bundled; the companion repo ships a fetch script that pulls the originals from the official RESIDE source.

Stream from Hugging Face
import io, json, random
from huggingface_hub import HfFileSystem
import webdataset as wds
from PIL import Image

REPO = "jdzhang0929/uniser-haze-dataset"
urls = [
    f"https://huggingface.co/datasets/{REPO}/resolve/main/{p[len(f'datasets/{REPO}/'):]}"
    for p in HfFileSystem().ls(f"datasets/{REPO}/shards", detail=False)
    if p.endswith(".tar")
]

def decode(s):
    meta = json.loads(s["json"])
    haze_keys = sorted(k for k in s if k.startswith("haze_") and k.endswith(".png"))
    chosen = random.choice(haze_keys)
    gt_key = next((k for k in s if k.startswith("gt.")), None)
    return {
        "source": meta["source"],
        "gt":     Image.open(io.BytesIO(s[gt_key])).convert("RGB") if gt_key else None,
        "haze":   Image.open(io.BytesIO(s[chosen])).convert("RGB"),
        "tag":    s[chosen.replace(".png", ".txt")].decode(),
    }

pipeline = wds.WebDataset(urls, shardshuffle=True).shuffle(1000).map(decode)
for sample in pipeline:
    print(sample["source"], sample["tag"])
    break

Acknowledgments

HALO HDRI environments are sourced from Poly Haven (CC0 1.0 Universal, public domain) and freepoly.org. The 4,945 HALO samples span 118 distinct HDRIs across coast, mountain, urban, industrial, pastoral, and aerial-landscape categories. The companion HF dataset card lists a representative subset.

Synthetic Haze builds on six upstream image datasets — HazeSpace2M, RESIDE-ITS, RESIDE-OTS, WSRD, Flare7K++ (Flare-R subset), and ISTD. Each subset retains its own upstream license; please consult docs/upstream_licenses.md and cite every subset whose data you use.

BibTeX

@article{zhang2025uniser,
    title={UniSER: A Foundation Model for Unified Soft Effects Removal},
    author={Zhang, Jingdong and Zhang, Lingzhi and Liu, Qing and Chiu, Mang Tik and Barnes, Connelly and Wang, Yizhou and You, Haoran and Liu, Xiaoyang and Zhou, Yuqian and Lin, Zhe and others},
    journal={arXiv preprint arXiv:2511.14183},
    year={2025}
    }