Phased Distribution Matching Distillation for Diffusion...

Phased Distribution Matching Distillation for Diffusion Models: Achieving Efficient Few-Step Distillation with Score Matching in Subintervals

This article introduces Phased distribution Matching Distillation (PDMD), a technique designed to distill large diffusion models into smaller, faster student models. PDMD achieves this by partitioning the diffusion process into multiple subintervals and training a student network to approximate the teacher model’s score function within each of these phases. This approach enables efficient few-step generation with minimal loss in sample fidelity.

Core Concept: Distilling Scores in Subintervals

The core idea of PDMD is to break down the complex diffusion process into manageable segments. We partition the total diffusion horizon (T) into K subintervals, denoted as $I_1, …, I_K$. A student score network, $\theta$, is trained to closely mimic the teacher score, $s_T$, on each subinterval. This phased approach allows the student model to learn the essential dynamics of the diffusion process more effectively, leading to significant reductions in the number of sampling steps required for generation (from thousands to around 8-16) without substantially degrading sample quality, as measured by metrics like FID.

Key Components and Definitions

Forward Process: Represented by $q_t(x)$, defining how noise is added to data over time.
Reverse Score: The score function $s_\theta(x,t) \approx \nabla_x \log p_t(x)$ that the student network aims to learn.
Teacher Score: $s_T(x,t)$, derived from a pre-trained, high-fidelity diffusion model.

The PDMD Loss Function

The training objective for the student network is defined as a weighted sum of losses across each subinterval:

$$L(\theta) = \sum_{k=1}^K \alpha_k E_{t\in I_k, x\sim p_t}[||s_\theta(x,t) – s_T(x,t)||^2]$$

Here, $\alpha_k$ represents the weight assigned to the k-th subinterval. These weights can be determined based on the duration of the subinterval ($\Delta t_k$) or by the variance of the teacher’s score within that subinterval, allowing for a focus on more dynamic or critical phases of the diffusion process.

Detailed Methodology and Algorithms

Notation and Problem Setup

Diffusion models operate over a continuous or discrete time horizon. Understanding the notation is crucial for grasping how the student model learns to emulate the teacher across different time segments for fast sampling.

T: The total diffusion horizon.
$t \in \lbrace t_1, …, t_T \rbrace$: Discrete timesteps within the diffusion process.
$I_k = [t_{k-1}, t_k]$: Subintervals partitioning the total horizon [0, T], for $k = 1, …, K$.
$p_t$: The data distribution at time t.
$s_T^*(x, t)$: The teacher’s score function, computed from a pre-trained diffusion model.
$s_\theta(x, t)$: The student’s score function, which is learned during distillation.
$\\alpha_k$: Weights for subintervals. These can be set proportionally to the time duration of the subinterval ($\\alpha_k \propto \Delta t_k$) or weighted based on the estimated score variance within $I_k$ to emphasize regions where the teacher’s score changes rapidly.

The objective is to train $s_\theta$ to imitate $s_T^*$ across all subintervals, enabling accurate reverse diffusion with significantly fewer steps.

Subinterval Partitioning Strategy

The effectiveness of PDMD hinges on strategically partitioning the diffusion timeline. Not all time points are equally critical for learning.

Choosing the Number of Subintervals (K)

The number of subintervals, K, can be selected from values like {4, 8, 16, 32}. A smaller K leads to faster training but potentially less precise distillation, while a larger K offers more granular control but increases computational cost.

Partitioning Schemes: Uniform vs. Adaptive

Uniform Partitioning: Divides the time horizon into subintervals of equal length.
Adaptive Partitioning: Allocates more subintervals to regions where the teacher’s score exhibits rapid changes (high $|ds_T/dt|$). This focuses learning effort on the most informative parts of the diffusion process. The procedure involves iteratively subdividing regions with high score drift until a budget of total steps or a target number of subintervals is met.

Subinterval Weighting ($\\alpha_k$)

Once subintervals are defined, their contribution to the overall loss is determined by weights $\\alpha_k$. Common strategies include:

$\\alpha_k \propto \Delta t_k$: Longer intervals receive more weight.
$\\alpha_k \propto V_k$: Intervals with higher estimated score variance within them receive more weight.

The weights are typically normalized such that $\sum_k \alpha_k = 1$.

Score Matching Objective in Subintervals

The core of PDMD involves optimizing the student score function within each subinterval $I_k$ using a local score-matching loss:

$$L_k(\theta) = E_{t \in I_k, x \sim p_t}[ || s_\theta(x, t) – s_T(x, t) ||^2 ]$$

The total loss $L(\theta)$ is the weighted sum: $L(\theta) = \sum_k \alpha_k L_k(\theta)$.

Stabilizing Training

To ensure stable training, several regularization techniques can be employed:

Weight decay (L2 regularization)
Gradient clipping
Spectral normalization on $s_\theta$
Learning-rate scheduling
Using optimizers like AdamW with standard hyperparameters (e.g., $lr \approx 2 \times 10^{-4}, \beta_1=0.9, \beta_2=0.999$).

practical Estimation Tips

Approximate expectations using minibatches of $(x, t)$ pairs and their corresponding teacher scores.
Utilize mixed-precision arithmetic for improved memory efficiency and training speed.

Distillation Training Pipeline

The PDMD pipeline effectively distills a high-capacity teacher model (trained with 1000+ steps) into an efficient student model capable of generating samples in as few as 8-16 steps.

Teacher and Student Models

Teacher Model: A pre-trained diffusion model providing high-fidelity score estimates across many timesteps.
Student Model: A lighter score network ($s_\theta$) optimized for rapid inference (8-16 steps).

Training Loop Overview

Sample Data: Select a minibatch of data and sample a timestep $t$. Generate a noisy sample $x_t \sim p_t(x_t | x_0)$.
Compute Teacher Score: Obtain $s_T(x_t, t)$ from the teacher model.
Compute Student Score: Obtain $s_\theta(x_t, t)$ from the student model.
Evaluate Loss: Calculate the subinterval loss $L(\theta) = \sum_k \alpha_k || s_\theta(x_t, t) – s_T(x_t, t) ||^2$.
Update Student: Update $\\theta$ using an optimizer like AdamW.

Inference

After training, sample generation proceeds from $t=T$ down to $t=0$ using the student model in a reduced number of steps (S, typically 8 or 16).

Conditional Generation

PDMD can be extended to conditional diffusion models (e.g., class-conditional, text-conditional) by providing the same conditioning inputs ($c$) to the student network as used by the teacher. This ensures alignment in guidance across both models.

Practical Considerations and Stability

Ensuring stable targets from the teacher model is paramount for successful PDMD training.

Calibrating Teacher-Student Gap

To mitigate noise from teacher scores, smoothing techniques can be applied:

Light Ensemble: Averaging predictions from multiple lightweight teachers.
Short-Horizon Smoothing: Using a moving average or exponential moving average (EMA) on recent teacher outputs.
History Buffering: Averaging the last few target scores.

These methods provide steadier targets, leading to more consistent gradients and stable training.

Memory and Compute Management

Caching vs. Re-computation: Cache teacher scores for frequently encountered inputs; re-compute periodically.
Precision: Employ half-precision (fp16 or bfloat16) for activations and weights to reduce memory usage and accelerate training.
Gradient Checkpointing: Trade compute for memory by re-computing activations during backpropagation for very large models.

Hyperparameters to Report

For reproducibility and fair comparisons, it is essential to report key hyperparameters and conduct ablation studies. These include:

K: Number of subintervals.
$\\alpha_k$ schedule: How weights are assigned across subintervals.
$\Delta t_k$: Time-step gaps between teacher signals used for targets.
Learning rate, batch size, dataset-specific choices (augmentation, normalization, conditioning).

Ablation studies on K (e.g., K=4, 8, 16) demonstrate the trade-offs between stability, compute, and memory footprint. For instance, increasing K generally improves stability but increases resource demands.

Extensions to Conditional Diffusion

PDMD can be applied to conditional diffusion models by ensuring identical conditioning inputs ($c$) are fed to both teacher and student networks. This maintains alignment and allows the student to learn from the teacher under the same guidance signals (e.g., text prompts).

Comparative Evaluation and Implementation Details

Experimental Setup

PDMD is evaluated against a baseline 1000-step diffusion model across standard datasets like CIFAR-10, CelebA-64, and LSUN-Church-128. Key metrics include FID, LPIPS, inference time, and memory footprint.

Targeted Outcomes

Sampling Speedup: Achieved speedups of 4-60× compared to the baseline.
FID Degradation: Bounded FID degradation (e.g., $\\leq$ 5-12 FID points for 8-16 steps).
Resource Reduction: Lower memory and training time compared to the baseline.

Qualitative Results

PDMD configurations, particularly with K=16, offer the best trade-off between fidelity and speed. K=8 provides a simpler alternative with substantial gains when compute is highly constrained.

Figures and Tables

Effective visualization requires figures showing FID vs. steps, inference time tables, and illustrations of subinterval partitioning and loss curves. Specific labels should indicate the K value used (e.g., K=8 or K=16).

Pros and Cons of PDMD

Pros:

Dramatic reduction in sampling steps (8-16 vs. 1000).
Significant inference time speedups.
Modular training, allowing reuse of existing teacher models.
Explicit score matching within subintervals enhances fidelity control.
Suitable for deployment on resource-constrained devices.
Facilitates clearer ablations and is compatible with unconditional/conditional models.

Cons:

Requires a strong teacher model and careful hyperparameter tuning (K, $\\alpha_k$).
Introduces complexity in loss design and training stability management.
Overhead in computing or approximating teacher scores during training.
Potential diminishing returns on very simple datasets.
Risk of distribution mismatch if subinterval boundaries are not aligned with score dynamics.
Requires robust multi-dataset evaluation to prove generality.

Phased Distribution Matching Distillation for Diffusion…