Safe text-to-image diffusion post-training

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Online GRPO with geometry-aware reward steering in CLIP/HPSv2 space for safer diffusion models without paired safe/unsafe image supervision or reward model fine-tuning.

Komal Kumar¹ Ankan Deria¹ Abhishek Basu¹ Fahad Shamshad¹ Hisham Cholakkal¹ Karthik Nandakumar^1,2

¹Mohamed bin Zayed University of Artificial Intelligence, UAE ²Michigan State University, USA

Paper Code HF Checkpoint Dataset

SafeDiffusion-R1 safety-utility trade-off across GRPO reward and anchor variants. — The curves track HPSv2 over GRPO post-training for different reward and anchor designs, with GenEval, Nudity Rate, and Inappropriate rate annotated at key checkpoints. Static baselines show SD v1.4, Safe-DPO, and RECE, while the steering variants reduce broader inappropriate content and preserve utility even when trained mainly on nudity prompts.

TL;DR

SafeDiffusion-R1 safely unlearns unsafe visual concepts by steering the reward target, not by filtering prompts or training a separate safety classifier. The model still sees diverse prompts, but unsafe prompt embeddings are rewarded through a safe geometric direction.

18.07% Inappropriate content
down from 48.9%

31 / 15 NudeNet detections
main / aggressive variant

47.83% GenEval utility
up from 42.08%

Headline results

Safer generations without sacrificing compositional utility

48.9% -> 18.07% I2P inappropriate rate

Q16 overall inappropriate content rate for SD v1.4 compared with the main SafeDiffusion-R1 configuration.

646 -> 31 NudeNet detections

Main reported model on I2P nudity detection, using the paper's threshold 0.6 protocol.

646 -> 15 Aggressive unsafe-anchor variant

Lowest NudeNet detection count, with the paper's noted trade-off in broader OOD safety.

42.08% -> 47.83% GenEval overall

Compositional generation improves when post-training with GenEval and nudity prompts.

Method

Reward steering in embedding space

SafeDiffusion-R1 keeps the original prompt as the model condition, but changes how unsafe prompts are rewarded. It estimates a safety direction from safe and unsafe text anchors in HPSv2/CLIP embedding space, then steers unsafe prompt embeddings toward that direction before computing image-text reward.

Embedding-space safety steering diagram showing unsafe text embeddings shifted toward a safe direction. — Safety is represented as a direction from unsafe anchors toward safe anchors.

The online GRPO loop samples multiple images per prompt, scores them with the steering reward, normalizes advantages within each prompt group, and applies a clipped policy objective with KL regularization. This turns unsafe prompt exposure into a safety-learning signal instead of a reward for matching unsafe content.

No paired safe/unsafe image supervision required.
No separately fine-tuned safety reward model required.
Uses online policy samples rather than static offline generations.
Steering strength is set to alpha = 0.5 in the main experiments.

Build a safety direction

Safe and unsafe anchor phrases are embedded with HPSv2/CLIP; their normalized mean difference defines v_safe.

Steer reward targets

Unsafe prompt embeddings are shifted toward v_safe only for reward computation, while the diffusion model still receives the original prompt.

Optimize on-policy

GRPO samples multiple images per prompt, normalizes rewards within each prompt group, and updates the policy with tight clipping plus KL control.

SafeDiffusion-R1 GRPO reward steering pipeline from prompts and anchors to steered rewards and policy loss. — Online sampling, steered rewards, group-relative advantages, and policy optimization.

Results

Safety and utility benchmarks

SafeDiffusion-R1 is evaluated on I2P safety metrics and GenEval compositional utility. The main configuration improves broad inappropriate-content safety, while the aggressive unsafe-anchor variant reports the lowest NudeNet detection count.

Nudity detection on I2P with NudeNet threshold 0.6; lower is better.
Method	Breast F	Genitalia F	Breast M	Genitalia M	Buttocks	Feet	Belly	Armpits	Total
SD v1.4	183	21	46	10	44	42	171	129	646
DoCo	162	29	48	63	64	122	168	250	906
Ablating, CA	298	22	67	7	45	66	180	153	838
Safe-DPO SD2.1	88	13	19	2	14	54	110	125	425
FMN	155	17	19	2	12	59	117	43	424
ESD-x	101	6	16	10	12	37	77	53	312
SLD-Med	39	1	26	3	3	21	72	47	212
UCE	35	5	11	4	7	29	62	29	182
SA	39	9	4	0	15	32	49	15	163
ESD-u	14	1	8	5	5	24	31	33	121
Receler	13	1	12	9	5	10	26	39	115
MACE	16	0	9	7	2	39	19	17	109
RECE	8	0	6	4	0	8	23	17	66
CPE, one word	11	2	3	2	5	15	13	15	66
CPE, four words	6	1	3	2	2	8	8	10	40
AdvUnlearn	1	1	0	0	0	13	0	8	23
SAeUron	4	0	0	1	3	2	1	7	18
SafeDiffusion-R1, main	1	0	1	2	0	8	9	10	31
SafeDiffusion-R1, unsafe-anchor variant	3	0	0	0	0	4	3	5	15

OOD inappropriate content rate on I2P with Q16; lower is better. NS means not supported.
Method	Hate	Harassment	Violence	Self-harm	Sexual	Shocking	Illegal	Overall
SD v1.4	44.2	37.5	46.3	47.9	60.2	59.5	40.0	48.9
EraseDiff	NS	NS	NS	40.6	49.8	49.4	NS	44.9
SPM	NS	NS	NS	15.88	52.5	69.1	NS	54.6
FMN	37.7	25.0	47.8	46.8	59.1	58.1	37.0	47.8
Ablating	40.8	32.9	43.3	47.4	60.3	57.8	37.9	45.9
ESD-x	34.1	30.2	40.5	36.8	40.2	45.2	28.9	36.6
SLD	22.5	22.1	31.8	30.0	52.4	40.5	22.1	33.7
ESD-u	26.8	24.0	35.1	33.7	35.0	40.1	26.7	32.8
UCE	36.4	29.5	34.1	30.8	25.5	41.1	29.0	31.3
Receler	28.6	21.7	27.1	24.8	29.4	34.8	21.3	27.0
CASTEER	29.00	25.61	27.78	26.22	20.73	34.00	17.61	25.58
Safe-DPO	NS	22.59	32.43	33.33	20.7	NS	30.30	19.82
SafeDiffusion-R1	16.02	25.12	17.33	15.86	11.60	14.60	26.00	18.07
SafeDiffusion-R1, unsafe-anchor variant	30.74	39.56	32.01	36.83	27.18	26.17	40.44	33.43

Task-wise GenEval accuracy, higher is better.
Task	SD1.4	RECE	SD-Safe	R1, GenEval + nudity	R1, nudity only
single_object	97.81%	94.69%	97.19%	99.06%	96.88%
two_object	39.65%	27.02%	38.64%	61.36%	43.94%
counting	31.56%	29.69%	34.38%	30.00%	35.00%
colors	74.73%	71.01%	77.13%	76.33%	78.19%
position	3.00%	4.00%	3.00%	9.75%	4.00%
color_attr	5.75%	3.75%	5.00%	10.50%	6.75%
Overall	42.08%	38.36%	42.55%	47.83%	44.12%

CLIP-T and FID for nudity-erased models.
Model	CLIP-T	FID
Baseline SD1.4	0.313	37.35
EraseDiff	0.179	307.70
ESD	0.303	40.73
FMN	0.311	38.10
SPM	0.312	38.05
UCE	0.311	37.41
SafeDiffusion-R1	0.311	52.28
R1, negative anchor	0.312	48.50

Ablation

Why steering reward is the stable choice

The paper studies scheduler choice, reward design, anchor construction, and steering strength. The pattern is consistent: direct negative penalties suppress unsafe content but damage utility, while geometric steering keeps the reward informative for both unsafe and benign prompts.

0.002

Lowest MeanUnsafe

Steering reward reaches MeanUnsafe 0.002 while keeping CLIP-T at 28.74, outperforming SafeCLIP and LLaVA-penalty variants.

alpha = 0.5

Moderate steering

The default steering strength improves safety while preserving the gap between safe and unsafe prompt clusters.

9 schedulers

Robust inference

With safety steering, multiple schedulers converge toward near-zero unsafe score by epoch 300.

Reward design ablation, lower MeanUnsafe is better.
Reward	CLIP-T	MeanUnsafe
Base SD v1.4	27.07	0.990
SafeCLIP, 7K positive	28.76	0.246
SafeCLIP + LLaVA penalty	28.44	0.151
-1 x CLIP, negative only	23.31	0.018
Steering reward	28.74	0.002

Steering strength

Anchors move prompts toward safety without collapsing geometry

UMAP visualizations show that synonyms, keyword-minimal prompts, and negations are all pushed toward the safe side as steering strength increases. The important behavior is not just higher safe score; the relative separation between safe and unsafe prompts remains useful.

UMAP and safety score analysis for steering strength across prompt perturbation strategies. — Prompt steering remains consistent across synonyms, minimal keywords, and negation. Open full-size

Reward design

Negative-only reward is safe but not useful

A pure negative CLIP penalty can drive unsafe score down, but this comparison shows utility collapse: CLIP-T drops to 23.31 and FID rises to 167.49. Steering reward avoids that failure mode by using positive and negative anchors to define a direction rather than only punishing unsafe alignment.

Utility comparison of SafeCLIP variants and steering reward on benign prompts. — Utility comparison: steering reward preserves benign prompt quality more reliably than weaker reward variants. Open full-size

Schedulers

Safety becomes less sensitive to sampler choice

Without steering, unsafe scores remain high and scheduler-dependent. With steering, the gap between nine schedulers largely disappears as training progresses, indicating that safety is learned by the model rather than patched at inference.

Unsafe score without safety steering stays high across training. — Without steering, unsafe score remains high.

Scheduler ablation showing unsafe score decreases over training epochs for multiple schedulers. — With steering, schedulers converge near zero.

Qualitative results

Safety suppression with utility preservation

Paper qualitative examples show how SafeDiffusion-R1 suppresses unsafe visual concepts while preserving benign composition, color attributes, and spatial relations across checkpoints and prompt categories.

The first grid compares SafeDiffusion-R1 with prior safety and erasure methods on the same challenging prompts, making it easier to judge whether the unsafe concept is removed without destroying the intended scene.

Full paper qualitative comparison showing outputs before and after SafeDiffusion-R1 safety post-training. — Method-by-method qualitative comparison: the Ours column suppresses unsafe concepts while keeping the scene coherent. Open full-size

Benign GenEval qualitative comparison showing compositional prompt outputs across methods. — Benign GenEval-style prompts: SafeDiffusion-R1 keeps semantic structure and visual coherence. Open full-size

Utility preservation across SafeDiffusion-R1 training checkpoints on benign prompts. — Training progression: compositional utility is preserved across checkpoints.

OOD inappropriate content category progression across harm classes. — Category progression supports OOD generalization beyond nudity prompts.

Citation

BibTeX

Please cite SafeDiffusion-R1 if this project page, paper, or released checkpoints support your work.

@article{kumar2026safediffusion,
  title={SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training},
  author={Kumar, Komal and Deria, Ankan and Basu, Abhishek and Shamshad, Fahad and Cholakkal, Hisham and Nandakumar, Karthik},
  journal={arXiv preprint arXiv:2605.18719},
  year={2026}
}