Climate Database

The Climate Database contains a variety of data extracted from different sources. This page provides an overview of the data layout, the sources we've extracted data from, the data harmonization and downscaling process, and an overview of the available climate variables.

Data Organization

The root of the climate data is located at /mnt/share/erf/climate_downscale. There are several subdirectories in the root directory, but only the extracted data directory extracted_data and the results directory results are relevant to users of the database.

The file tree with subdirectories is as follows:

/mnt/share/erf/climate_downscale/
├── extracted_data/
│   ├── era5/
│   │   └── {ERA5_DATASET}_{ERA5_VARIABLE}_{YEAR}_{MONTH}.nc
│   ├── cmip6/
│   │   └── {CMIP6_VARIABLE}_{CMIP6_EXPERIMENT}_{CMIP6_SOURCE}_{VARIANT}.nc
│   └── _Other Data Sources_/
└── results/
    ├── annual/
    │   ├── archive/
    │   │   ├── historical/
    │   │   │   └── {ANNUAL_VARIABLE}/
    │   │   │       └── {YEAR}.nc
    │   │   └── {SCENARIO}/
    │   │       └── {ANNUAL_VARIABLE}/
    │   │           └── {YEAR}.nc
    |   ├── raw/
    │   │   ├── compiled/
    │   │   │   └── {SCENARIO}/
    │   │   │       └── {ANNUAL_VARIABLE}/
    │   │   │           └── {GCM_MEMBER}.nc
    │   │   ├── historical/
    │   │   │   └── {ANNUAL_VARIABLE}/
    │   │   │       └── {YEAR}_era5.nc
    │   │   └── {SCENARIO}/
    │   │       └── {ANNUAL_VARIABLE}/
    │   │           └── {YEAR}_{GCM_MEMBER}.nc
    │   └── {SCENARIO}/
    │       └── {ANNUAL_VARIABLE}/
    │           └── {DRAW}.nc
    ├── daily/
    │   └── {SCENARIO}/
    │       └── {DAILY_VARIABLE}/
    │           ├── {YEAR}.nc
    │           └── reference.nc
    └── metadata/

The file patterns will be explained in more detail in the following sections.

Extracted Data

The two primary data sources are historical climate data from the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 dataset and climate forecast data from the Climate Model Intercomparison Project Phase 6 (CMIP6). There are also some additional data sources that have been extracted to serve as covariates in a forthcoming downscaling model.

ERA5 Data

The ECMWF Reanalysis v5 (ERA5) is the fifth generation ECMWF atmospheric reanalysis of the global climate covering the period from January 1950 to present. ERA5 is produced by the Copernicus Climate Change Service (C3S) at ECMWF. There are three datasets of note:

The Complete ERA5 global atmospheric reanalysis: This dataset contains a wide range of atmospheric, land, and oceanic climate variables on a regular latitude/longitude grid at a roughly 31km resolution. Additionally it splits the atmosphere into 137 pressure levels starting at the Earth's surface and extending to a height of 80km. We do not typically extract from this source as we generally have no need of the oceanic data or the data above the Earth's surface.
ERA5-Land hourly data from 1950 to present: This dataset contains land surface variables at the native model resolution of 9km. Variables not defined over the land surface (such as sea surface temperature) are not present in this dataset. Additionally, this dataset sometimes misses some land area, especially islands and regions. Because of its detailed resolution, this is the model we prefer to extract from wherever possible.
ERA5 hourly data on single levels from 1940 to present: This dataset contains a wide range of atmospheric, land, and oceanic climate variables on a regular latitude/longitude grid at a roughly 31km. This dataset is similar to the complete ERA5 dataset, but it only contains data at the Earth's surface and at a few fixed pressure levels, making it significantly smaller and faster to work with. This is the dataset we use to supplement the ERA5-Land data over regions where the land data is missing or incomplete. We also use this dataset for variables that are not available in the ERA5-Land dataset.

Storage and naming conventions

File Pattern: /mnt/share/erf/climate_downscale/extracted_data/era5/{ERA5_DATASET}_{ERA5_VARIABLE}_{YEAR}_{MONTH}.nc

Naming Conventions

{ERA5_DATASET}: One of reanalysis-era5-land, or reanalysis-era5-single-levels.
{ERA5_VARIABLE}: The variable being extracted (one of 10m_u_component_of_wind, 10m_v_component_of_wind, 2m_dewpoint_temperature, 2m_temperature, surface_pressure, total_precipitation, sea_surface_temperature).
{YEAR} and {MONTH}: The year and month of the data being extracted. {YEAR} ranges from 1950 to 2023.

CMIP6 Data

The Climate Model Intercomparison Project Phase 6 (CMIP6) is a collaborative effort to compare climate models across the globe. The data is organized into different variables, scenarios, and sources.

Storage and Naming Conventions

File Pattern: /mnt/share/erf/climate_downscale/extracted_data/cmip6/{CMIP6_VARIABLE}_{CMIP6_EXPERIMENT}_{CMIP6_SOURCE}_{VARIANT}.nc

Naming Conventions

{CMIP6_VARIABLE}: The variable being extracted (one of uas, vas, hurs, tas, tasmin, tasmax, tos, pr).
{CMIP6_EXPERIMENT}: The scenario being extracted (one of ssp126, ssp245, ssp585).
{CMIP6_SOURCE}: The source model for the data. A source model is a particular model from a particular institution, e.g. BCC-CSM2-MR.
{VARIANT}: The variant of the model, which is a particular run of the model with specific initial and boundary conditions and forcing scenarios.

Model Inclusion

We use a subset of the CMIP6 data in our analysis following a model evaluation published in Nature that determined which models to include based on their performance. Our inclusion criteria are as follows:

The model must be flagged as "Yes" in the Included in 'Model Subset'? column in the evaluation table.
The model must have daily results (which concretely means there is either a day or Oday table_id associated with the model).
The model must make estimates for the three scenarios we are interested in: ssp126, ssp245, and ssp585.
The model must cover the time range 2019-2099. We project out to 2100, and many models run a 2100 year, but some stop at 2099. They are incorporated here with their last year repeated.

Model Inclusion Caveats

The extraction criteria does not completely capture model inclusion criteria as it does not account for the year range available in the data. This determination is made when we process the data in later steps. See the scenario inclusion stage of the processing pipeline for more detail.

Data Availability

The following tables show the number of unique models available for each variable.

Model CountSource Breakdown

variable	Source Count	Variant Count
hurs	15	34
pr	18	35
tas	20	44
tasmax	18	39
tasmin	17	37
tos	9	22
uas	16	35
vas	17	35

source	hurs	pr	tas	tasmax	tasmin	tos	uas	vas
ACCESS-CM2	3	2	2	3	2	2	1	1
AWI-CM-1-1-MR	1	1	1	1	1	1	1	1
BCC-CSM2-MR	0	1	1	1	1	1	1	1
CAMS-CSM1-0	0	1	1	1	0	0	0	0
CMCC-CM2-SR5	1	1	1	1	1	1	0	1
CMCC-ESM2	1	1	1	1	1	0	1	1
CNRM-CM6-1	6	1	6	1	1	0	6	6
CNRM-CM6-1-HR	1	0	0	0	0	0	1	1
CNRM-ESM2-1	3	1	4	1	1	0	3	3
FGOALS-g3	0	0	4	4	4	0	0	0
GFDL-ESM4	1	1	1	1	1	0	1	1
GISS-E2-1-G	0	0	1	0	0	0	0	0
IITM-ESM	1	1	1	0	0	0	1	1
INM-CM4-8	1	1	1	1	1	0	1	1
INM-CM5-0	1	1	1	1	1	0	1	1
MIROC-ES2L	0	1	1	1	1	0	1	1
MIROC6	0	3	3	3	3	3	3	3
MPI-ESM1-2-HR	2	2	2	2	2	2	2	1
MPI-ESM1-2-LR	10	10	10	10	10	10	10	10
MRI-ESM2-0	1	5	1	5	5	1	1	1
NorESM2-MM	1	1	1	1	1	1	0	0

Variant Labels

For a given experiment, the realization_index, initialization_index, physics_index, and forcing_index are used to uniquely identify each simulation of an ensemble of runs contributed by a single model. These indices are defined as follows:

realization_index = an integer (≥1) distinguishing among members of an ensemble of simulations that differ only in their initial conditions (e.g., initialized from different points in a control run). Note that if two different simulations were started from the same initial conditions, the same realization number should be used for both simulations. For example if a historical run with “natural forcing” only and another historical run that includes anthropogenic forcing were both spawned at the same point in a control run, both should be assigned the same realization. Also, each so-called RCP (future scenario) simulation should normally be assigned the same realization integer as the historical run from which it was initiated. This will allow users to easily splice together the appropriate historical and future runs.
initialization_index = an integer (≥1), which should be assigned a value of 1 except to distinguish simulations performed under the same conditions but with different initialization procedures. In CMIP6 this index should invariably be assigned the value “1” except for some hindcast and forecast experiments called for by the DCPP activity. The initialization_index can be used either to distinguish between different algorithms used to impose initial conditions on a forecast or to distinguish between different observational datasets used to initialize a forecast.
physics_index = an integer (≥1) identifying the physics version used by the model. In the usual case of a single physics version of a model, this argument should normally be assigned the value 1, but it is essential that a consistent assignment of physics_index be used across all simulations performed by a particular model. Use of “physics_index” is reserved for closely-related model versions (e.g., as in a “perturbed physics” ensemble) or for the same model run with slightly different parameterizations (e.g., of cloud physics). Model versions that are substantially different from one another should be given a different source_id” (rather than simply assigning a different value of the physics_index).
forcing_index = an integer (≥1) used to distinguish runs conforming to the protocol of a single CMIP6 experiment, but with different variants of forcing applied. One can, for example, distinguish between two historical simulations, one forced with the CMIP6-recommended forcing data sets and another forced by a different dataset, which might yield information about how forcing uncertainty affects the simulation.

Processed Data

The processed data is stored in the /mnt/share/erf/climate_downscale/results directory, organized by scenario and variable.

There are two types of processed data: daily and annual. Daily data is stored for historical data only (and for the mean_temperature variable for CMIP6 data). We generally only generate annual results, as storing daily results for all models and all variables would be prohibitively expensive. Daily data is stored in the /mnt/share/erf/climate_downscale/results/daily directory.

Daily Data Storage and Naming Conventions

File Patterns:

/mnt/share/erf/climate_downscale/results/daily/historical/{DAILY_VARIABLE}/{YEAR}.nc - Daily data for historical variables.
/mnt/share/erf/climate_downscale/results/daily/historical/{DAILY_VARIABLE}/reference.nc - Reference climatology data for historical variables.
/mnt/share/erf/climate_downscale/results/daily/{SCENARIO}/mean_temperature/{YEAR}.nc - Daily data for the mean_temperature variable for CMIP6 scenarios.

Naming Conventions

{SCENARIO}: The CMIP6 scenario being stored (one of ssp126, ssp245, ssp585).
{DAILY_VARIABLE}: The name of the variable being stored (one of mean_temperature, max_temperature, min_temperature, wind_speed, relative_humidity, total_precipitation).
{YEAR}: The year of the data being stored. In historical subdirectories, this runs from 1950 to 2023. In scenario subdirectories, this runs from 2024 to 2100.

The annual data is stored in the /mnt/share/erf/climate_downscale/results/annual directory. Annual data is stored by draw number, with each draw representing a random sample of a Global Climate Model (GCM) and variant from CMIP6. Each draw is a full annual time series from 1950 to 2100 and collates the historical ERA5 data with the CMIP6 scenario data.

Annual Data Storage and Naming Conventions

Archive File Patterns

These store the prior results for the climate database to ease transition to the new draw-level outputs. They use an older version of the CMIP6 ensemble and represent an ensemble mean. They should be transitioned to the new draw-level outputs as soon as possible.

/mnt/share/erf/climate_downscale/results/annual/archive/historical/{ANNUAL_VARIABLE}/{YEAR}.nc - Archived historical annual data using the ERA5 dataset.
/mnt/share/erf/climate_downscale/results/annual/archive/{SCENARIO}/{ANNUAL_VARIABLE}/{YEAR}.nc - Archived scenario annual data using the CMIP6 dataset and the original point estimate ensemble.

Raw and Compiled File Patterns

/mnt/share/erf/climate_downscale/results/annual/raw/historical/{ANNUAL_VARIABLE}/{YEAR}_era5.nc - Raw historical annual data using the ERA5 dataset.
/mnt/share/erf/climate_downscale/results/annual/raw/{SCENARIO}/{ANNUAL_VARIABLE}/{YEAR}_{GCM_MEMBER}.nc - Raw scenario annual data using the CMIP6 dataset. Each dataset is a bias-corrected and downscaled GCM-member.
/mnt/share/erf/climate_downscale/results/annual/raw/compiled/{SCENARIO}/{ANNUAL_VARIABLE}/{GCM_MEMBER}.nc - Annual compilations of the raw scenario data for each GCM-member.

Draw File Pattern: /mnt/share/erf/climate_downscale/results/{SCENARIO}/{ANNUAL_VARIABLE}/{DRAW}.nc

Naming Conventions

{ANNUAL_VARIABLE}: The name of the variable being stored (one of mean_temperature, mean_high_temperature, mean_low_temperature, days_over_30C, malaria_suitability, dengue_suitability, wind_speed, relative_humidity, total_precipitation, precipitation_days).
{SCENARIO}: The scenario being stored (one of ssp126, ssp245, ssp585).
{YEAR}: The year of the data being stored. In historical subdirectories, this runs from 1950 to 2023. In scenario subdirectories, this runs from 2024 to 2100.
{GCM_MEMBER}: The GCM member being stored. This is a unique identifier for each GCM member combining the source model and variant.
{DRAW}: The draw number of the data being stored as a three digit string (e.g. 027).

Pipeline Stages

The processing pipelines turn the extracted ERA5 and CMIP6 data into a coherent set of climate variables with a consistent resolution, time scale, and data storage format. The pipeline is run with the cdrun command (see Installation for installation instructions). The pipeline has the following steps:

Historical Daily (cdrun generate historical_daily): This processes the hourly ERA5-Land and ERA5-Single-Level data into a unified daily format, pulling the higher-resolution ERA5-Land data where available and filling in with ERA5-Single-Level data. Historical daily data is produced for all core variables (those not derived by computing annual summary statistics from the daily data).
Historical Reference (cdrun generate historical_reference): This produces a set of reference climatologies from the historical daily data. These reference climatologies are built by averaging the last 5 years of historical data grouped by month. For example, the reference climatology for January mean temperature is the average of all Januare mean daily temperatures from 2019 to 2023. These climatologies are used to bias-correct the scenario data by serving as a seasonally-aware reference point we can intercept shift to.
Scenario Inclusion (cdrun generate scenario_inclusion): This produces a set of metadata that determines which CMIP sources and variants are used to generate scenario draws. This is the second stage scenario determination. When we extract CMIP6 data, we cannot determine the year range of the data until it is extracted. This stage determines which models are included based on the year range of the data and writes this information to a file in /mnt/share/erf/climate_downscale/results/metadata.
Scenario Daily (cdrun generate scenario_daily): This produces scenario projections from the CMIP6 data by dynamical downscaling the daily data to the same resolution as the ERA5 data. Our downscaling process computes the absolute or relative anomaly of a forecast day from a CMIP6 model relative to what that model's average prediction in the month over the reference period (2019-2023). This anomaly is then downscaled using linear interpolation to the ERA5 grid. The downscaled anomaly is then applied back to the reference climatology, which adds in fine-scale detail and provides a seasonally-aware bias correction. This produces draw level estimates using a two-stage sampling method, first selecting a CMIP source model, and then selecting a variant from that model. This sampling method ensures each model of the ensemble is equally represented despite the large difference in the number of variants produced by each source model. NOTE: This stage is not typically invoked on its own, as storing draw-level daily data for all models and all variables is prohibitively expensive. It is instead invoked indirectly as part of the scenario annual stage.
Scenario Annual (cdrun generate scenario_annual): This produces annual estimates of the climate variables. It invokes the scenario daily stage to produce daily data, then computes annual summaries of the daily data, saving only the annual summary.