Merging Pretrained Experts to Break the Likelihood-Quality Trade-off in Diffusion Models
Diffusion models have revolutionized generative AI, but they often face a fundamental trade-off: improving the likelihood (how well they model the data distribution) can sometimes degrade the perceptual quality of generated samples, and vice-versa. This article introduces a novel approach using Mixture-of-Experts (MoE) to tackle this challenge by merging pretrained diffusion experts.
Key Takeaways
- Objective Clarity: Merging K=4 pretrained diffusion experts via a gating network can reduce Bits-Per-Dixel (NLL) while preserving or improving perceptual quality (FID/IS/LPIPS) compared to a single expert baseline.
- Architecture & Objectives: Utilizes a soft-gated Mixture-of-Experts with a gating network producing weights (w1..wK) at each denoising step. The loss function is
L = L_NLL + lambda_load·L_load_balance + lambda_ent·L_entropy, with initial valueslambda_load ≈ 0.1andlambda_ent ≈ 0.01. - Data Slices & Specialization: Each expert is trained on distinct data slices (e.g., by class or attributes like color variance or texture emphasis), and routing is handled by a data-conditioned gate to leverage complementary strengths.
- evaluation & Benchmarks: Benchmarked on CIFAR-10 (32×32), CelebA-HQ (1024×1024), and LSUN-Church (256×256). Reports Bits-Per-Dixel (NLL), FID, IS, LPIPS, SSIM, PSNR, sampling latency, and total parameter count, including gating overhead and memory footprint.
- Reproducibility & Transparency: Emphasizes providing explicit hyperparameters, data splits, seeds, and a reproducible workflow, including high-level pseudocode.
- E-E-A-T & Practical Relevance: Anchors the approach in established evaluation practices and integrates perceptual signals to validate realism beyond numeric scores, considering the diffusion-decision-model framing.
Algorithmic Blueprint: Merging Pretrained Experts in Diffusion Models
Diffusion models that leverage a mixture of experts (MoE) split the modeling burden across several specialized pathways. Our approach uses four pretrained diffusion experts, each focused on a distinct slice of the data, and a compact gating network that decides how much each expert contributes at every denoising step. The result is a model that can cover diverse styles and structures without a prohibitive increase in compute or memory.
Experts
We deploy K = 4 pretrained diffusion models as independent experts. Each expert is trained on a distinct data slice, such as:
- Class groups: Different object categories or scenes.
- Color-variance emphasis: Areas with varying color distribution.
- Texture-focused patches: Fine-grained textures and surfaces.
- Semantic subdomains: Coherent semantic regions or styles.
This specialization allows each expert to become a specialist for its slice, while the ensemble covers the full data distribution.
Gating Network
A small encoder, shared across all experts, processes the current noisy state and feeds into a 2-layer MLP that outputs a softmax over K. At each diffusion step t, the gate produces weights w1(t), ..., wK(t) that determine each expert’s contribution to the update:
- The shared encoder extracts a compact, task-relevant representation from the current step
tinput. - The 2-layer MLP maps these features to a K-dimensional vector, which is passed through a softmax to yield the gate weights.
- These gates are time-step dependent, allowing the gating to adapt as denoising progresses.
Inference Routine
For each denoising step t, we compute the update from each expert and combine them using the gate weights:
Delta_t = sum_{k=1..K} w_k(t) * Delta_t^{(k)}
Here, Delta_t^{(k)} is the update suggested by expert k at step t. A sparse variant further reduces compute by letting only the top-2 gate weights contribute:
Delta_t = sum_{k in Top-2(t)} w_k(t) * Delta_t^{(k)}
In practice, using only the top-2 gates yields approximately a 50% reduction in compute with negligible loss in quality, thanks to the complementary strengths of the experts.
Training Objective
The overall loss combines the standard diffusion objective with two regularizers that shape expert usage:
L = L_NLL + lambda_load * L_load_balance + lambda_ent * L_entropy
L_NLLis the primary negative log-likelihood / denoising objective.L_load_balanceencourages uniform usage of experts across a mini-batch, preventing single expert dominance. This is achieved by pushing the average gate weight per expert toward an even distribution over the batch.L_entropydiscourages degenerate gating by promoting higher gate entropy over time steps, preventing gates from collapsing to a single expert. This can be implemented with a term proportional to the negative entropy of the gate distributionp_t(k)across K at each stept.
In essence: L_load_balance promotes fair workload distribution, while L_entropy ensures diverse gating, keeping all experts useful and specialized.
Parameter Economy
To preserve specialization without ballooning parameter count:
- Share early encoder layers across all experts for efficient input processing.
- Equip each expert with small, dedicated adapters downstream of the shared encoder, adding minimal overhead while maintaining individual specialization.
Data Slices and Training Schedule
Imagine a model learning from several specialized viewpoints, then intelligently combining their wisdom. This section outlines complementary data slices, a two-phase training plan, and practical stability tricks.
Data Slices
We define four complementary data slices:
- CIFAR-10: Low color-variance group
Images with narrow color palettes, emphasizing shape and texture. - CIFAR-10: High color-variance group
Images with vibrant, varied colors, encouraging learning of robust color-based features. - CelebA-HQ: Hair color groups
Subsets categorized by hair color to test invariance and generalization across appearance variations. - CelebA-HQ: Lighting condition groups
Subsets defined by lighting variations to build resilience to illumination changes. - LSUN-Church: Subcategory groups
Subcategories capturing different architectural contexts to introduce diversity beyond faces.
Training Schedule
A two-phase training process ensures each expert focuses on its slice before learning to collaborate:
- Phase 1: Warm-start (10–20% of total steps)
Warm-start each expert on its dedicated data slice. Monitor per-expert loss and per-slice accuracy. - Phase 2: Joint training with gating (Remaining steps)
Train all experts together with the gating network. Tune gating capacity and regularization. Monitor overall accuracy, per-expert contribution, and gating weight distribution.
Practical note: start with a moderate total step count (e.g., 100k), allocating Phase 1 for fair warm-starting, then move to Phase 2 for cross-slice coordination.
Regularization and Stability
- Gradient clipping: Clip gradients by global norm (cap around 5.0) to prevent unstable updates.
- Gating dropout: Apply small dropout (0.1–0.2) on the gating module to prevent overreliance and improve robustness.
- Shared seed: Use a single random seed for data shuffling, weight initialization, and dropout for stable comparisons.
Evaluation Protocol and Metrics
Rigorous evaluation is key to understanding efficiency, fairness, and scalability. This section outlines a reproducible protocol and the metrics for evaluating MoE diffusion models.
Quantitative Metrics
- Bits per Pixel (BPP) / Negative Log-Likelihood (NLL): Quantifies data distribution modeling. Report BPP on a held-out test set, specifying the NLL computation method.
- FID (Fréchet Inception Distance) and IS (Inception Score): Measures sample quality and diversity. State the feature extractor, sample count, and any filtering. Report with confidence intervals.
- LPIPS, SSIM, PSNR: Assess perceptual similarity (LPIPS) and pixel-wise fidelity (SSIM, PSNR). Describe preprocessing, color space, and evaluation batch size.
- Sampling latency per image: Measure inference time on a fixed GPU. Report median latency with robust statistics. Specify GPU type and software versions.
- Peak memory usage and total parameter count: Track peak VRAM during sampling and total parameters, including all components.
Benchmark Datasets and Setup
- CIFAR-10 (32×32): Describe train/test split, preprocessing, and normalization.
- CelebA-HQ (1024×1024 or 256×256): State downsampling, cropping, and alignment steps. Justify resolution choices.
- LSUN-Church (256×256): Describe alignment, cropping, and handling of duplicates.
- Downsampling and preprocessing: Clearly document all data transformations.
Baselines for Comparison
Compare against clear baselines to isolate MoE contributions:
- Baseline DDPM: Single expert (no MoE).
- MoE with Random Gating: Random expert assignment.
- MoE with Data-Slice Alignment but no Load-Balancing: Gating aligned to slices, without explicit balancing loss.
For each baseline, report the full metric suite and relative gains against Baseline DDPM.
Ablations and Ablation Plan
Investigate the impact of key components:
- Remove
L_load_balance: Assess load-balancing’s contribution. - Remove
L_entropy: Evaluate the role of diversity-promoting loss. - Sparse top-2 gating vs. dense gating: Compare routing schemes.
- Vary
K(e.g.,K = 2, 4, 8): Demonstrate effects of increasing expert count.
For each ablation, report the full metric set with statistical uncertainty.
Experimental Protocol and Reporting Practices
- Reproducibility: Publish code, configurations, scripts, seeds, and hardware details.
- Seeds and statistics: Run multiple seeds (3–5) and report means with standard deviations or confidence intervals.
- Transparency: Document data leakage risks, preprocessing differences, and post-processing.
Quick guidelines for presenting results: Show compact tables per dataset with baselines, MoE variants, and ablations. Provide narrative explanations for metric changes and discuss when MoE is most beneficial.
Comparison Table: Baseline vs. Merged-Experts Diffusion
| Model | Configuration / Description | Experts | Objective | Evaluation Metrics | Pros | Cons |
|---|---|---|---|---|---|---|
| Model A — Baseline Diffusion (Single Expert) | Baseline diffusion model with a single expert (no MoE). | 1 | L_NLL |
BPP, FID, IS | Simple; well-understood | Limited to single-expert capacity; struggles to optimize likelihood and perceptual quality simultaneously |
| Model B — MoE Diffusion (K=4, Soft Gates) | MoE with 4 experts; gating via softmax; joint objective with load-balancing and entropy terms. | 4 | Joint objective with load-balancing and entropy terms | BPP, FID, IS | Improved likelihood coverage; robustness across data slices | Gating adds compute; potential gating misrouting if gate under-trained |
| Model C — MoE Diffusion (K=4, Top-2 Sparse Gates) | MoE with 4 experts; top-2 gating reduces per-step compute. | 4 | Not explicitly stated | BPP, FID, IS | Higher efficiency with near-parity quality | Risk of under-utilization of some experts in edge cases |
| Model D — Data-Slice Specialized MoE (K=4) | Each expert specializes in a distinct data subdomain; routing uses encoder features. | 4 | MoE framework with data-slice specialization | BPP, FID, IS | Stronger specialization; potential large gains on targeted metrics | Data-slice management complexity; risk of overfitting if slices are unbalanced |
Pros and Cons
Pros
- Flexible modeling capacity: each expert captures distinct data modes.
- Potential to break the likelihood–quality trade-off by combining complementary strengths.
- Scalable by adding more experts or data slices.
- Sparse gating reduces inference cost while preserving gains.
Cons
- Higher training complexity and more hyperparameters (K, lambda_load, lambda_ent, gating architecture).
- Gating can misroute samples if data slices are poorly chosen.
- Increased memory footprint and maintenance burden.
- Requires careful evaluation to prevent overfitting to slice-specific artifacts.
This approach offers a compelling direction for advancing diffusion models by effectively merging specialized knowledge. Its potential to overcome the likelihood-quality trade-off makes it a valuable contribution to the field.

Leave a Reply