DES Year 3 Deep Learning SBI Inference: Analysis Design

Executive Summary and Objectives

Goal: Build a forward-model SBI pipeline to jointly infer wCDM parameters (focus on w0 and wa) from DES Year 3 weak-lensing shear, galaxy clustering (gg), and their cross-correlation (gammat) using a deep-learning normalizing-flow posterior.

Data foundations: Leverage DES Year 3 photometric data (three-year coverage); anchor calibration and validation on the Y3 GOLD release with community usage notes.

Validation pathway: Start with DES Y3 simulations for baseline performance; address domain shift to real DES Y3 data via fine-tuning on GOLD calibration samples and domain-adaptation techniques.

Performance criteria: Recover the true wCDM values within 95% credible intervals in validation; achieve posterior calibration on simulated tests comparable to or better than traditional likelihood analyses; maintain robust uncertainty quantification under forward-model misspecifications.

Reproducibility and accessibility: Publish code, data subsets, containers (Docker/Singularity), and comprehensive documentation; provide a reproducibility package with environment specs, training scripts, and data-generation recipes.

Key weaknesses to address: explicit real-DES Y3 demonstration path; explicit domain adaptation steps to mitigate sim-to-real gaps; documented hyperparameter sensitivity and failure modes; transparent forward-model choices for intrinsic alignments, baryons, photo-z, and shear biases; accessible reproducibility guidance.

Related Video Guide: Detailed SBI Pipeline Architecture for DES Year 3: Forward Model, Normalizing Flows, and Real Data Adaptation

Forward-Model Components

In cosmic-shear and galaxy-clustering analyses, the forward model translates theory into observable two-point statistics. Here is a concise, practical blueprint of the components that connect our cosmology to data, including how we handle systematics and nuisances.

Intrinsic Alignments (IA)

We adopt a flexible non-linear alignment (NLA) model. The IA contribution is scaled by an amplitude parameter A_IA, treated as a nuisance parameter. To capture possible evolution, IA(z,L) is modeled as:

A_IA · (1+z)^η_IA · (L/L0)^β_IA

with η_IA (redshift-dependence) and β_IA (luminosity-dependence) treated as nuisance parameters with broad priors. The overall amplitude is constrained to A_IA ∈ [−2, 2]. This setup lets the data inform whether alignment effects grow with redshift or luminosity, while keeping the parameterization reasonably simple.

Baryonic physics

We include a HMCode‑like baryon-suppression parameter B_baryon ∈ [0.5, 1.5]. B_baryon = 1 corresponds to no extra baryonic suppression. This parameter modulates the small-scale matter power spectrum to reflect baryonic effects (e.g., feedback processes). We place a prior on B_baryon that reflects external constraints from DES Y3 and ancillary data, typically centered near 1 with a moderate width to accommodate uncertainties (e.g., a Gaussian prior around 1 with a few tenths of scatter, truncated to [0.5, 1.5]).

Photometric redshift uncertainties

We account for per-bin redshift shifts Δz_i with priors Δz_i ∈ [−0.05, 0.05]. These shifts propagate into uncertainties in the n(z) shapes and, crucially, induce cross-bin covariances. The Δz_i parameters are calibrated against DES Y3 GOLD redshift-validation results. In addition to the shifts, we model potential shape changes in n(z) and propagate their impact through the tomographic cross-correlations.

Shear calibration biases

Each tomographic bin i has a multiplicative shear bias m_i with priors m_i ∈ [−0.05, 0.05]. These biases are calibrated from image simulations and DES Y3 shear catalogs, and their uncertainties are propagated into the predictions of all two-point functions (ξ±, γt, w) through the standard linear response relations.

Tomography and observables

We use four tomographic bins for shear and four to six bins for galaxies. The forward model predicts:

ξ±(θ) for cosmic shear,
w(θ) for galaxy clustering,
γt(θ) cross-correlations between shear and galaxy positions.

All auto- and cross-bin combinations are included to maximize information content, yielding a rich, multi-probe data vector.

Data vector and covariance

The data vector combines (ξ+, ξ−, w, γt) across all chosen tomographic bin pairs. The covariance has two pieces: (i) an analytic model for the Gaussian (or quasi-Gaussian) part and (ii) a simulated sample covariance from mocks. Crucially, cross-correlations between probes (ξ, w, γt) and between tomographic bins are included, so the full multi-probe, multi-bin covariance is captured.

Forward-model fidelity vs. cost

We balance completeness with computational practicality. The default forward model includes the full set of two-point functions and their cross-correlations across bins, but we document all approximations (e.g., surrogate emulators, limited angular scales, or simplified covariance treatments) and quantify their expected impact on parameter inference. The goal is transparent, reproducible modeling with manageable run times.

Observables alignment with DES Y3 GOLD

To ease calibration and reproducibility, we tailor data-vector choices and redshift-bin definitions to match DES Y3 GOLD conventions. This alignment facilitates cross-checks with GOLD-based pipelines and helps ensure consistent interpretation of results across analyses.

Training Architecture for the DL Inference

To turn a flood of cosmological data into fast, trustworthy posteriors, the training setup is the secret sauce. Below is a clear, actionable blueprint that ties model design, data generation, optimization, and validation into a cohesive pipeline.

Backbone model and posterior approximation

Model: Masked autoregressive Flow (MAF) used as the normalizing flow for posterior approximation.
Structure: 12–16 coupling layers, with 256 hidden units per layer.
Base distribution: standard Gaussian.

Dimensionality and compression

Goal: encode the physical summary statistics into a 32–64 dimensional latent vector.
Approach: apply an initial 2-layer encoder (MLP) to map observables to this latent space.

Hyperparameters

Optimizer: AdamW
Initial learning rate: 1e-3 with cosine decay
Batch size: 2048
Total training steps: 200k
Gradient clipping: 5.0 to improve stability
Weight decay: 1e-5

Hardware and training time

Compute: 8× NVIDIA A100 GPUs (80 GB each)
Estimated wall time: ~48–72 hours for full training and validation on 2 million mock realizations

Data generation

Realizations: generate 2,000,000 simulated samples spanning the wCDM parameter space
Fiducial cosmology: near Planck-like priors
Speed-ups: reuse cosmology emulators to accelerate likelihood-free predictions where possible

Domain adaptation strategy

Baseline: train on mocks first
Fine-tuning: perform a targeted pass on DES Y3 GOLD calibration data (thousands of labeled real realizations)
Goal: align summary statistics and model calibration with the real data distribution to reduce domain shift

Validation hooks

Posterior predictive checks to assess how well the model captures data variability
Coverage tests for w0 and wa to verify confidence intervals are well-calibrated
Out-of-distribution (OOD) detection to flag potential domain shift or model failure cases

Table of key settings (quick reference)

Component	Specification
Backbone	Masked Autoregressive Flow (MAF); 12–16 coupling layers; 256 hidden units per layer; base dist: standard Gaussian
Latent space	32–64 dimensional (physical summary statistics)
Encoder	Initial 2-layer MLP mapping observables to latent space
Optimizer	AdamW
Learning rate	1e-3 with cosine decay
Batch size	2048
Training steps	200,000
Gradient clipping	5.0
Weight decay	1e-5
Hardware	8× NVIDIA A100 (80 GB)
Training time (2M mocks)	~48–72 hours
Data generation	2,000,000 simulations over wCDM; Planck-like priors
Domain adaptation	Fine-tune on DES Y3 GOLD calibration data (thousands of labeled realizations)
Validation	Posterior predictive checks, w0/wa coverage, OOD detection

In short, this training architecture blends a powerful, expressive posterior model with a carefully staged data pipeline, robust optimization, and explicit strategies to handle real-world domain shifts. The result is a DL inference setup that not only learns fast but also stays honest to the distributions it aims to constrain in practice.

Real Data Adaptation and Domain Transfer Strategy

Bringing simulations in line with real data is essential for robust cosmology inference. This section outlines a practical, repeatable workflow that combines domain adaptation with careful data curation to minimize the simulation-to-reality gap.

Domain adaptation workflow

We begin by pretraining on simulated realizations to learn the basic data-generating process. Next, we fine-tune the flow and the encoder on a curated subset of DES Y3 GOLD calibration data—real observations with well-understood systematics. This targeted fine-tuning reduces the simulation-to-reality gap while preserving the physical structure learned during pretraining.

Calibration data selection

Choose a representative subset that spans redshift bins, galaxy types, and survey conditions. Include masked regions and survey masks so the calibration data reflect realistic observational conditions and coverage. This ensures the adaptation accounts for both typical and edge cases in the data.

Importance weighting

Apply importance weights to the mocks to better match the empirical distributions observed in DES Y3 GOLD. Target distributions include n(z) (redshift distribution), shear noise characteristics, and the masking pattern. Weighting helps the simulated training data reflect the actual survey statistics more closely, improving transfer performance.

Regularization for domain shift

When validation diagnostics reveal residual mismatch, inflate posterior uncertainties to reflect the remaining ambiguity. Run alternate calibrations using different intrinsic alignment (IA) models, baryonic effects, and photo-z priors to quantify robustness and identify sensitivities to modeling choices.

Cross-validation with withheld Y3 GOLD sub-samples

Hold out independent Y3 GOLD sub-samples to test domain transfer performance and guard against overfitting to the calibration data. This cross-validation step provides a realistic check on generalization to data not seen during adaptation.

Performance Benchmarks and Validation Plan

This section spells out how we will quantify the accuracy, robustness, and practicality of DL-SBI in a cosmology inference workflow. The goal is to show not only that we can recover known truths in simulations, but also that the method behaves sensibly on real data and under realistic model perturbations.

Metrics

Metric	What it measures	How it’s computed	Why it matters
KL divergence between inferred and true posteriors	Distance between the posterior inferred by DL-SBI and the known true posterior from simulated data	Estimate KL(p̂(θ\|d) \|\| p(θ\|d)) using samples from the inferred posterior and the true simulator distribution over a representative test set	Quantifies how accurately the method recovers the full posterior structure, not just point estimates
95% credible interval coverage on held-out mocks	Calibration of uncertainty: do 95% intervals contain the true parameter values with the expected frequency?	Apply DL-SBI to held-out mock catalogs; compute the fraction of true values falling inside the inferred 95% intervals across mocks	Assesses reliability of uncertainty quantification under realistic mocks
Posterior width for w0 and wa vs data-vector truncation	How precise the constraints on the dark-energy equation of state parameters are as the data vector is reduced	Track the posterior width (e.g., credible interval half-width) for w0 and wa as a function of the maximum multipole or angular scale included	Illustrates robustness to information loss and identifies where DL-SBI gains persist or fade with less data

Baseline comparison

We will benchmark DL-SBI against traditional likelihood analyses that use two-point statistics with standard priors. Specifically, we will:

Run a conventional likelihood analysis using two-point cosmic shear statistics (and relevant ancillary probes) with widely used priors.
Compare posterior widths, parameter biases (in simulations), and reported systematics handling between the two approaches.
Quantify gains in precision (smaller uncertainties) and robustness (stability to reasonable model/forward-model changes) attributable to the DL-SBI workflow.

Real-data readiness

To demonstrate practicality beyond simulations, we will apply the method to DES Y3 GOLD calibration data and assess stability under plausible forward-model perturbations. Key checks include:

Posterior shifts that are consistent with independent calibrations and prior information.
Stability of inferences when perturbing forward models for intrinsic alignments (IA), baryonic physics, and photometric redshift (photo-z) errors.
Consistency across data-processing variants and masking choices to ensure results are not driven by artifacts.

Computational budget

We will document the full compute and storage footprint, and provide a reproducibility package so others can reproduce runtimes on similar hardware. Details include:

Total CPU/GPU hours for all training, inference, and hyperparameter-tuning runs.
Storage needs for the simulated mock catalog and intermediate data products (e.g., feature vectors, posterior samples).
A reproducibility package that includes code, environment specifications (e.g., conda/virtualenv or Docker/Singularity), and a scripted workflow to replicate runtimes on comparable hardware.

Failure modes

We explicitly catalog potential failure modes and outline diagnostics and mitigations to keep the validation plan honest and actionable. Common risks include:

Mis-specified IA model. Consequence: biased inferences or misestimated uncertainties if IA contributions are wrong.
Strong domain shift between simulations and real data. Consequence: degraded posterior accuracy in real analyses.
Degeneracies between multiplicative shear/magnitude parameters (m_i) and IA amplitude (A_IA). Consequence: inflated uncertainties or biased posteriors.

Diagnostics and mitigations:

Perform posterior predictive checks to see if simulated observables reproduce real data distributions under diverse forward-model settings.
Use ablation studies and hierarchical modeling to assess sensitivity to IA choices and to separate cosmology from systematics.
Incorporate robust priors and/or flexible IA models, and validate with cross-validation on mocks and with external calibrations.
Monitor domain shift by comparing feature distributions between training mocks and real data; apply domain-adaptation or reweighting if needed.
Run joint inferences with and without problematic components (e.g., IA), to quantify degeneracies and identify stable combinations of data and model choices.

Together, these benchmarks provide a transparent, repeatable plan to quantify how well DL-SBI performs in practice, where it shines, and where careful scrutiny or methodological adjustments are warranted.

Reproducibility, Code Availability, and Practical Adoption: Addressing Competitors’ Weaknesses

Item / Aspect	Reproducibility Emphasis	Code Availability & Licensing	Practical Adoption & Implementation Details
Code and data release plan	High: publish full end-to-end pipeline; includes training scripts, data-generation utilities, forward-model modules, and a lightweight data subset for testing.	Permissive license (MIT); release code and data under open terms; includes all components necessary for replication.	Requires a hosted repository and clear versioning; enables testing with a small subset; lowers barrier to entry for replication.
Documentation and tutorials	High: detailed README, API docs, architecture diagrams, and a step-by-step tutorial to reproduce baseline results on a standard compute cluster.	Documentation accompanying the release; public access to API docs and diagrams; changes tracked with releases.	Promotes user onboarding and reproducibility; aligns with typical HPC workflows; reduces support burden.
Containerization and environments	High: deliver Docker/Singularity containers with all dependencies; provide an environment.yml for local development and HPC deployment.	Container images and environment specification publicly available; supports reproducible environments across platforms.	Simplifies setup on laptops and HPC clusters; improves deployment reliability; requires container registry access and build maintenance.
Reproducibility package	Very High: include a lightweight subset of mocks (e.g., 100k realizations) and processed data vectors with a script to reproduce the posterior for a fixed fiducial cosmology.	Scripts and data subset packaged with the release; ensures end-to-end replication of the posterior for a fixed cosmology.	Enables quick validation by newcomers without large data access; encourages external audits and benchmarking.
Hyperparameter transparency	High: publish a dedicated hyperparameter sweep report including sensitivity analyses for A_IA, B_baryon, Δz_i, and m_i; include recommended default ranges and their impact on posteriors.	Report published; default ranges documented; supports reproducible sensitivity studies.	Guides users in choosing priors and defaults; reduces guesswork; facilitates cross-dataset comparisons.
Forward-model transparency	High: clearly document IA, baryon, photo-z, and shear-bias implementations, priors, and their impact on inference; supply alternate forward-model variants to test robustness.	Documentation of models and priors; variants provided or accessible for testing.	Enables robustness checks; enhances credibility and peer review; supports adaptation to new data or surveys.
Real-data demonstration plan	Medium-High: staged path to move from simulations to real DES Y3 data, including a one-click protocol for domain adaptation and posterior recalibration using Y3 GOLD.	Protocol and tooling for domain adaptation; references to Y3 GOLD data; emphasizes controlled data access.	Facilitates transition to real data deployment; accelerates validation with a clear, repeatable process.
Availability of baseline data	High: provide a small, fixed data-vector subset derived from DES Y3 GOLD to enable independent novices to validate the pipeline without requiring full-scale DES data access.	Baseline data subset included or easily obtainable; ensures reproducibility without full dataset access.	Lowers entry barrier for new users; supports educational and benchmarking use cases.

E-E-A-T Alignment: Credibility Boost from DES Y3 GOLD and Year 3 Context

Pros

DES Year 3 photometric data set is assembled from the first three years of science operations, providing a mature, well-characterized data source for inference; Y3 GOLD release offers an expanded and curated data set with enhanced quality and accessibility over Y1 GOLD and DES DR1.

Pros: DES Y3 GOLD includes usage notes aimed at the broad astrophysics community, facilitating external validation, cross-survey comparisons, and broader reproducibility.

Cons

Cons: Dependence on the Y3 GOLD calibration subset means the domain-adaptation step must robustly handle potential biases in the calibration data itself; plan to validate against multiple Y3 GOLD calibration partitions.

Mitigation:

Explicitly document domain shift diagnostics, offer multiple forward-model variants, and maintain an open channel for external validation using additional DES Y3 data products or external spectroscopic priors.

Implementation Roadmap and Deliverables: Milestones, Resources, and Deployment

Milestones and Timeline

Here’s a practical, transparent roadmap guiding our forward-modeling effort from setup to community adoption. Each milestone builds a verifiable step toward robust inferences and reproducible science.

M1 (Months 1–2): Finalize forward-model choices: intrinsic alignments (IA), baryons, photometric redshift (photo-z) errors, and shear biases; establish baseline mocks (2 million realizations); and define the data-vector structure aligned with DES Y3 GOLD conventions.
M2 (Months 3–4): Implement the DL-SBI architecture (MAF flow, encoder, and training pipeline); run initial training on mocks; establish baseline inference for w0 and wa.
M3 (Months 5–6): Conduct domain adaptation using DES Y3 GOLD calibration data; perform hyperparameter sensitivity studies; begin posterior-calibration checks with held-out Y3 GOLD samples.
M4 (Months 7–8): Perform full validation against simulated test sets; assess 95% credible-interval coverage; compare results to classical analyses; prepare reproducibility package and documentation.
M5 (Month 9): Release code, a data subset, and reproducibility materials; publish a preprint or technical report detailing results and best-practice guidelines for community adoption.
M6 (Month 10+): Enter an active maintenance window for code updates, user feedback, and integration with DES Y3 data releases and future surveys.

Resource Plan

Turning the forward-model idea into a working, reproducible pipeline comes down to four concrete levers: compute, storage, people, and data access. This plan is sized for a 6–9 month effort with room to scale if we add more complex variants.

Compute: 8 × NVIDIA A100 GPUs for training. The setup is scalable to larger clusters if needed. Anticipate roughly 2–3× longer training times when we explore more complex forward-model variants, depending on the final architecture and data volume.
Storage: Several hundred terabytes for mock catalogs and intermediate data products. We will use compressed representations for inference-ready posteriors to keep downstream storage and delivery efficient.
Personnel: A core team of 2–3 researchers working 6–9 months to implement, validate, and document the pipeline. In addition, one dedicated reproducibility engineer will maintain documentation and containerized environments to ensure repeatable workflows.
Data access: Coordinate with DES collaboration policies to ensure compliant access to Y3 GOLD calibration subsets. We will also provide a small, shareable data-vector subset for external validation while respecting data-use constraints.

By aligning resources this way, we keep the project focused, scalable, and auditable from day one.

Deliverables and Documentation

We ship a complete, user-friendly package that makes reproducibility, validation, and adaptation straightforward. Below is what you can expect to receive, and how each piece helps you reuse and verify the work in different settings.

Deliverable	What you get	Why it matters
D1: End-to-end reproducible SBI pipeline	Public code repository with clean, documented scripts and notebooks. Containerized environments (Docker and/or Singularity) to reproduce results across laptops, clusters, or cloud instances. A comprehensive validation suite (unit, integration, and regression tests) with example data and runbooks. Explicit software versioning and data provenance to ensure exact replayability. Continuous integration hooks that verify reproducibility on code changes.	Ensures you can run the full workflow anywhere, reproduce results exactly, and validate steps independently.
D2: Published set of sensitivity analyses and hyperparameter sweeps	Recorded experiments spanning priors, model architectures, training regimes, and inference settings. Accessible figures, tables, and data releases that summarize robustness and performance trade-offs. Documentation on how to interpret results and apply findings to new analyses.	Documents robustness across priors, models, and settings; provides a reference for trust and comparison.
D3: User-friendly guide for adapting the pipeline to other DES-like data sets or future surveys	A user-friendly, step-by-step guide to map a new survey to the pipeline’s inputs and outputs. Template data schemas, feature mappings, and recommended preprocessing steps. Examples and checklists to help researchers avoid common pitfalls during domain transfer.	Lowers the barrier to reuse on new data, speeding up adoption and cross-survey analyses.
D4: Minimal-yet-powerful data-vector subset and posterior samples for external validation	A minimal, well-documented data-vector subset that captures the essential information needed for validation. Posterior samples released in accessible formats (e.g., CSV, HDF5/NetCDF) with metadata describing the context and assumptions. Clear guidance on how external teams can perform independent checks without full DES access.	Allows external teams to validate methods without requiring full access to DES data.
D5: Maintained FAQ and troubleshooting guide addressing common failure modes and domain-adaptation caveats	FAQ covering common failure modes, from data loading issues to numerical instabilities in inference. Domain-adaptation caveats and practical fixes when transferring to new data regimes or surveys. Tips for diagnosing problems, with recommended configurations and reference baselines.	Helps users diagnose issues quickly and understand limitations when transferring to new domains.

If you have ideas for improving any of these deliverables or want to see additional formats (e.g., Jupyter notebooks, prebuilt tutorials, or quick-start kits), we welcome your feedback to keep the documentation—and the science—as accessible as possible.

DES Year 3 Deep Learning Driven Inference of wCDM from…