Safe text-to-image diffusion post-training

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Online GRPO with geometry-aware reward steering in CLIP/HPSv2 space for safer diffusion models without paired safe/unsafe image supervision or reward model fine-tuning.

Komal Kumar1 Ankan Deria1 Abhishek Basu1 Fahad Shamshad1 Hisham Cholakkal1 Karthik Nandakumar1,2
1Mohamed bin Zayed University of Artificial Intelligence, UAE 2Michigan State University, USA
MBZUAI logo Michigan State University logo
SafeDiffusion-R1 safety-utility trade-off across GRPO reward and anchor variants.
The curves track HPSv2 over GRPO post-training for different reward and anchor designs, with GenEval, Nudity Rate, and Inappropriate rate annotated at key checkpoints. Static baselines show SD v1.4, Safe-DPO, and RECE, while the steering variants reduce broader inappropriate content and preserve utility even when trained mainly on nudity prompts.

TL;DR

SafeDiffusion-R1 safely unlearns unsafe visual concepts by steering the reward target, not by filtering prompts or training a separate safety classifier. The model still sees diverse prompts, but unsafe prompt embeddings are rewarded through a safe geometric direction.

18.07% Inappropriate content
down from 48.9%
31 / 15 NudeNet detections
main / aggressive variant
47.83% GenEval utility
up from 42.08%
Headline results

Safer generations without sacrificing compositional utility

48.9% -> 18.07% I2P inappropriate rate

Q16 overall inappropriate content rate for SD v1.4 compared with the main SafeDiffusion-R1 configuration.

646 -> 31 NudeNet detections

Main reported model on I2P nudity detection, using the paper's threshold 0.6 protocol.

646 -> 15 Aggressive unsafe-anchor variant

Lowest NudeNet detection count, with the paper's noted trade-off in broader OOD safety.

42.08% -> 47.83% GenEval overall

Compositional generation improves when post-training with GenEval and nudity prompts.

Method

Reward steering in embedding space

SafeDiffusion-R1 keeps the original prompt as the model condition, but changes how unsafe prompts are rewarded. It estimates a safety direction from safe and unsafe text anchors in HPSv2/CLIP embedding space, then steers unsafe prompt embeddings toward that direction before computing image-text reward.

Embedding-space safety steering diagram showing unsafe text embeddings shifted toward a safe direction.
Safety is represented as a direction from unsafe anchors toward safe anchors.

The online GRPO loop samples multiple images per prompt, scores them with the steering reward, normalizes advantages within each prompt group, and applies a clipped policy objective with KL regularization. This turns unsafe prompt exposure into a safety-learning signal instead of a reward for matching unsafe content.

  • No paired safe/unsafe image supervision required.
  • No separately fine-tuned safety reward model required.
  • Uses online policy samples rather than static offline generations.
  • Steering strength is set to alpha = 0.5 in the main experiments.
01

Build a safety direction

Safe and unsafe anchor phrases are embedded with HPSv2/CLIP; their normalized mean difference defines v_safe.

02

Steer reward targets

Unsafe prompt embeddings are shifted toward v_safe only for reward computation, while the diffusion model still receives the original prompt.

03

Optimize on-policy

GRPO samples multiple images per prompt, normalizes rewards within each prompt group, and updates the policy with tight clipping plus KL control.

SafeDiffusion-R1 GRPO reward steering pipeline from prompts and anchors to steered rewards and policy loss.
Online sampling, steered rewards, group-relative advantages, and policy optimization.
Results

Safety and utility benchmarks

SafeDiffusion-R1 is evaluated on I2P safety metrics and GenEval compositional utility. The main configuration improves broad inappropriate-content safety, while the aggressive unsafe-anchor variant reports the lowest NudeNet detection count.

Nudity detection on I2P with NudeNet threshold 0.6; lower is better.
Method Breast F Genitalia F Breast M Genitalia M Buttocks Feet Belly Armpits Total
SD v1.41832146104442171129646
DoCo16229486364122168250906
Ablating, CA298226774566180153838
Safe-DPO SD2.188131921454110125425
FMN15517192125911743424
ESD-x1016161012377753312
SLD-Med3912633217247212
UCE3551147296229182
SA3994015324915163
ESD-u141855243133121
Receler1311295102639115
MACE160972391917109
RECE806408231766
CPE, one word11232515131566
CPE, four words61322881040
AdvUnlearn11000130823
SAeUron4001321718
SafeDiffusion-R1, main10120891031
SafeDiffusion-R1, unsafe-anchor variant3000043515
OOD inappropriate content rate on I2P with Q16; lower is better. NS means not supported.
Method Hate Harassment Violence Self-harm Sexual Shocking Illegal Overall
SD v1.444.237.546.347.960.259.540.048.9
EraseDiffNSNSNS40.649.849.4NS44.9
SPMNSNSNS15.8852.569.1NS54.6
FMN37.725.047.846.859.158.137.047.8
Ablating40.832.943.347.460.357.837.945.9
ESD-x34.130.240.536.840.245.228.936.6
SLD22.522.131.830.052.440.522.133.7
ESD-u26.824.035.133.735.040.126.732.8
UCE36.429.534.130.825.541.129.031.3
Receler28.621.727.124.829.434.821.327.0
CASTEER29.0025.6127.7826.2220.7334.0017.6125.58
Safe-DPONS22.5932.4333.3320.7NS30.3019.82
SafeDiffusion-R116.0225.1217.3315.8611.6014.6026.0018.07
SafeDiffusion-R1, unsafe-anchor variant30.7439.5632.0136.8327.1826.1740.4433.43
Task-wise GenEval accuracy, higher is better.
Task SD1.4 RECE SD-Safe R1, GenEval + nudity R1, nudity only
single_object97.81%94.69%97.19%99.06%96.88%
two_object39.65%27.02%38.64%61.36%43.94%
counting31.56%29.69%34.38%30.00%35.00%
colors74.73%71.01%77.13%76.33%78.19%
position3.00%4.00%3.00%9.75%4.00%
color_attr5.75%3.75%5.00%10.50%6.75%
Overall42.08%38.36%42.55%47.83%44.12%
CLIP-T and FID for nudity-erased models.
Model CLIP-T FID
Baseline SD1.40.31337.35
EraseDiff0.179307.70
ESD0.30340.73
FMN0.31138.10
SPM0.31238.05
UCE0.31137.41
SafeDiffusion-R10.31152.28
R1, negative anchor0.31248.50
Ablation

Why steering reward is the stable choice

The paper studies scheduler choice, reward design, anchor construction, and steering strength. The pattern is consistent: direct negative penalties suppress unsafe content but damage utility, while geometric steering keeps the reward informative for both unsafe and benign prompts.

0.002

Lowest MeanUnsafe

Steering reward reaches MeanUnsafe 0.002 while keeping CLIP-T at 28.74, outperforming SafeCLIP and LLaVA-penalty variants.

alpha = 0.5

Moderate steering

The default steering strength improves safety while preserving the gap between safe and unsafe prompt clusters.

9 schedulers

Robust inference

With safety steering, multiple schedulers converge toward near-zero unsafe score by epoch 300.

Reward design ablation, lower MeanUnsafe is better.
Reward CLIP-T MeanUnsafe
Base SD v1.4 27.07 0.990
SafeCLIP, 7K positive 28.76 0.246
SafeCLIP + LLaVA penalty 28.44 0.151
-1 x CLIP, negative only 23.31 0.018
Steering reward 28.74 0.002
Steering strength

Anchors move prompts toward safety without collapsing geometry

UMAP visualizations show that synonyms, keyword-minimal prompts, and negations are all pushed toward the safe side as steering strength increases. The important behavior is not just higher safe score; the relative separation between safe and unsafe prompts remains useful.

UMAP and safety score analysis for steering strength across prompt perturbation strategies.
Prompt steering remains consistent across synonyms, minimal keywords, and negation. Open full-size
Reward design

Negative-only reward is safe but not useful

A pure negative CLIP penalty can drive unsafe score down, but this comparison shows utility collapse: CLIP-T drops to 23.31 and FID rises to 167.49. Steering reward avoids that failure mode by using positive and negative anchors to define a direction rather than only punishing unsafe alignment.

Utility comparison of SafeCLIP variants and steering reward on benign prompts.
Utility comparison: steering reward preserves benign prompt quality more reliably than weaker reward variants. Open full-size
Schedulers

Safety becomes less sensitive to sampler choice

Without steering, unsafe scores remain high and scheduler-dependent. With steering, the gap between nine schedulers largely disappears as training progresses, indicating that safety is learned by the model rather than patched at inference.

Unsafe score without safety steering stays high across training.
Without steering, unsafe score remains high.
Scheduler ablation showing unsafe score decreases over training epochs for multiple schedulers.
With steering, schedulers converge near zero.
Qualitative results

Safety suppression with utility preservation

Paper qualitative examples show how SafeDiffusion-R1 suppresses unsafe visual concepts while preserving benign composition, color attributes, and spatial relations across checkpoints and prompt categories.

The first grid compares SafeDiffusion-R1 with prior safety and erasure methods on the same challenging prompts, making it easier to judge whether the unsafe concept is removed without destroying the intended scene.

Full paper qualitative comparison showing outputs before and after SafeDiffusion-R1 safety post-training.
Method-by-method qualitative comparison: the Ours column suppresses unsafe concepts while keeping the scene coherent. Open full-size
Benign GenEval qualitative comparison showing compositional prompt outputs across methods.
Benign GenEval-style prompts: SafeDiffusion-R1 keeps semantic structure and visual coherence. Open full-size
Utility preservation across SafeDiffusion-R1 training checkpoints on benign prompts.
Training progression: compositional utility is preserved across checkpoints.
OOD inappropriate content category progression across harm classes.
Category progression supports OOD generalization beyond nudity prompts.
Citation

BibTeX

Please cite SafeDiffusion-R1 if this project page, paper, or released checkpoints support your work.

@article{kumar2026safediffusion,
  title={SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training},
  author={Kumar, Komal and Deria, Ankan and Basu, Abhishek and Shamshad, Fahad and Cholakkal, Hisham and Nandakumar, Karthik},
  journal={arXiv preprint arXiv:2605.18719},
  year={2026}
}