:
HALO &
Synthetic Haze
Datasets
HALO: each sample ships as a flare-free / flared / flare-only triplet across four flare types (Streak, Reflective, Glare, Shimmer).
Synthetic Haze: one representative scene per upstream source paired with three synthesized haze variants — fog / haze / smoke (HAZESPACE, ITS, OTS) or neutral / yellowish atmospheric haze (WSRD, Flare-R, ISTD). OTS clean images are not bundled (Flickr copyright).
We release the two image datasets that power UniSER's training pipeline, both sharded in WebDataset format on Hugging Face for fast streaming and partial-download support.
HALO is a brand-new 3D-rendered lens-flare dataset with
4,945 4K-resolution triplets (clean / flared / flare-only) spanning
32 scenes × 4 flare types. The triplet design supports paired flare
removal, forward flare synthesis, and additive decomposition
(flare ≈ light + separate).
Synthetic Haze bundles ~80.9k unique clean images with ~2M physically-motivated haze variants — a depth-driven atmospheric scattering synthesizer applied across six upstream image sources covering indoor/outdoor, homogeneous/non-homogeneous, daytime/dense atmospheric conditions.
HALO covers 4 flare types across 32 distinct scenes, with multiple cameras
and lighting variants per scene. Each sample carries three aligned 4K RGBA PNGs — the
clean scene (gt), the same scene with flare added (flare), and
the flare layer rendered against transparent background (separate) so it can
be re-composited over arbitrary clean images at training time.
4,945 samples are stratified across 32 scenes; the four flare types share the same per-scene distribution at a 2 : 2 : 1 : 1 (Streak : Reflective : Glare : Shimmer) ratio.
| Effect type | Samples | Share | Description |
|---|---|---|---|
| Streak | 1,656 | 33.5% | bright streak / anamorphic stretch flares |
| Reflective | 1,655 | 33.5% | inter-element ghost reflections |
| Glare | 817 | 16.5% | wide soft halo + bloom |
| Shimmer | 817 | 16.5% | iridescent / dispersive flare |
| Total | 4,945 | 100% | 32 scenes × 4 effects |
halo/<base_name>.gt.png clean scene without flare (RGBA, 3840×2160) halo/<base_name>.flare.png same scene with flare added (RGBA, 3840×2160) halo/<base_name>.separate.png flare-only on transparent bg (RGBA, 3840×2160) halo/<base_name>.json per-sample metadata
pip install -U "huggingface_hub[hf_xet]" webdataset pillow
hf auth login # accept terms at https://huggingface.co/datasets/jdzhang0929/halo-flare-dataset
import io, json
from huggingface_hub import HfFileSystem
import webdataset as wds
from PIL import Image
REPO = "jdzhang0929/halo-flare-dataset"
urls = [
f"https://huggingface.co/datasets/{REPO}/resolve/main/{p[len(f'datasets/{REPO}/'):]}"
for p in HfFileSystem().ls(f"datasets/{REPO}/shards", detail=False)
if p.endswith(".tar")
]
def decode(s):
meta = json.loads(s["json"])
return {
"scene": meta["scene"],
"effect": meta["effect_id"],
"light": Image.open(io.BytesIO(s["gt.png"])).convert("RGB"),
"flare": Image.open(io.BytesIO(s["flare.png"])).convert("RGB"),
"separate": Image.open(io.BytesIO(s["separate.png"])), # keep RGBA
}
pipeline = wds.WebDataset(urls, shardshuffle=True).shuffle(500).map(decode)
for sample in pipeline:
print(sample["scene"], sample["effect"])
break
Per the paper's §7 atmospheric model, every clean image is augmented with multiple
haze variants whose optical thickness τ follows depth predictions from
Marigold, optionally modulated by Perlin noise for non-uniform haze and valley fog.
Variants span homogeneous, non-homogeneous, indoor, outdoor, daytime, dense, and
smoke-like conditions.
Six upstream sources, dominated by HazeSpace2M (~82% of unique GTs). Each GT carries ~19–37 synthesized haze variants, for ~2 million total pairs.
<source>/<base_name>.gt.<ext> clean RGB image (absent for OTS) <source>/<base_name>.haze_NNN.png synthesized haze variant <source>/<base_name>.haze_NNN.txt descriptive tag (e.g. out_fog_120) <source>/<base_name>.json per-sample metadata
<source> is one of hazespace, its,
ots, wsrd, real_flare, istd.
RESIDE-OTS clean GTs carry third-party photographer copyrights and are
not bundled; the companion repo ships a
fetch script
that pulls the originals from the official RESIDE source.
import io, json, random
from huggingface_hub import HfFileSystem
import webdataset as wds
from PIL import Image
REPO = "jdzhang0929/uniser-haze-dataset"
urls = [
f"https://huggingface.co/datasets/{REPO}/resolve/main/{p[len(f'datasets/{REPO}/'):]}"
for p in HfFileSystem().ls(f"datasets/{REPO}/shards", detail=False)
if p.endswith(".tar")
]
def decode(s):
meta = json.loads(s["json"])
haze_keys = sorted(k for k in s if k.startswith("haze_") and k.endswith(".png"))
chosen = random.choice(haze_keys)
gt_key = next((k for k in s if k.startswith("gt.")), None)
return {
"source": meta["source"],
"gt": Image.open(io.BytesIO(s[gt_key])).convert("RGB") if gt_key else None,
"haze": Image.open(io.BytesIO(s[chosen])).convert("RGB"),
"tag": s[chosen.replace(".png", ".txt")].decode(),
}
pipeline = wds.WebDataset(urls, shardshuffle=True).shuffle(1000).map(decode)
for sample in pipeline:
print(sample["source"], sample["tag"])
break
HALO HDRI environments are sourced from Poly Haven (CC0 1.0 Universal, public domain) and freepoly.org. The 4,945 HALO samples span 118 distinct HDRIs across coast, mountain, urban, industrial, pastoral, and aerial-landscape categories. The companion HF dataset card lists a representative subset.
Synthetic Haze builds on six upstream image datasets — HazeSpace2M, RESIDE-ITS, RESIDE-OTS, WSRD, Flare7K++ (Flare-R subset), and ISTD. Each subset retains its own upstream license; please consult docs/upstream_licenses.md and cite every subset whose data you use.
@article{zhang2025uniser,
title={UniSER: A Foundation Model for Unified Soft Effects Removal},
author={Zhang, Jingdong and Zhang, Lingzhi and Liu, Qing and Chiu, Mang Tik and Barnes, Connelly and Wang, Yizhou and You, Haoran and Liu, Xiaoyang and Zhou, Yuqian and Lin, Zhe and others},
journal={arXiv preprint arXiv:2511.14183},
year={2025}
}