NetCDF and CF: The Basics

Overview

This notebook will introduce you to the basics of Climate and Forecasting (CF) metadata for netCDF data files. Along with an introduction to netCDF, we will introduce the CF data model and discuss some netCDF implementation details to consider when deciding how to write data with CF and netCDF. We will cover the following,

Demonstrating gridded data
Demonstrating observational data

Prerequisites

Concepts	Importance	Notes
Numpy Basics	Necessary
Datetime	Necessary

Time to learn: 50 minutes

Imports

from datetime import datetime, timedelta

import numpy as np
from cftime import date2num
from netCDF4 import Dataset
from pyproj import Proj

Gridded Data

Let’s say we’re working with some numerical weather forecast model output. Let’s walk through the steps necessary to store this data in netCDF, using the Climate and Forecasting metadata conventions to ensure that our data are available to as many tools as possible.

To start, let’s assume the following about our data:

There are three spatial dimensions (x, y, and press) and one temporal dimension (times)
The native coordinate system of the model is on a regular 3km x 3km grid (x and y) that represents the Earth on a Lambert conformal projection.
The vertical dimension (press) consists of several discrete pressure levels in units of hPa.
The time dimension consists of twelve consecutive hours (times), beginning at 2200 UTC on the current day.

We’ll also go ahead and generate our dimensional arrays:

start = datetime.utcnow().replace(hour=22, minute=0, second=0, microsecond=0)
times = np.array([start + timedelta(hours=h) for h in range(13)])

x = np.arange(-150, 153, 3)
y = np.arange(-100, 100, 3)

press = np.array([1000, 925, 850, 700, 500, 300, 250])

Next, let’s create a multidimensional array of random values to hold our variable of interest … in this case, temperature (temps). Note that the dimensions correspond to the ones we just created above.

temps = np.random.randn(times.size, press.size, y.size, x.size)

Creating the file and dimensions

The first step is to create a new file in netCDF format and set up the shared dimensions we’ll be using in the file. We’ll be using the netCDF4 library to do all of the requisite netCDF API calls.

nc = Dataset('forecast_model.nc', 'w', format='NETCDF4_CLASSIC', diskless=True)

Info

This creates the netCDF file in memory. If you wish to write the file to disk you can add the persist=True argument or remove the diskless=True argument.

Danger

If you open an existing file with ‘w’ as the second argument, any data already in the file will be overwritten. If you would like to edit or add to the file, open it using ‘a’ as the second argument.

We’re going to start by adding some global attribute metadata. These are recommendations from the standard (not required), but they’re easy to add and help users keep the data straight, so let’s go ahead and do it.

nc.Conventions = 'CF-1.7'
nc.title = 'Forecast model run'
nc.institution = 'Unidata'
nc.source = 'WRF-1.5'
nc.history = str(datetime.utcnow()) + ' Python'
nc.references = ''
nc.comment = ''

At this point, this is the plain-text representation of this netCDF dataset:

netcdf forecast_model {
  attributes:
    :Conventions = "CF-1.7" ;
    :title = "Forecast model run" ;
    :institution = "Unidata" ;
    :source = "WRF-1.5" ;
    :history = "2019-07-16 02:21:52.005718 Python" ;
    :references = "" ;
    :comment = "" ;
}

Info

This plain-text representation is known as netCDF Common Data Format Language, or CDL.

Next, before adding variables to the file to define each of the data fields in this file, we need to define the dimensions that exist in this data set. We set each of x, y, and pressure to the size of the corresponding array. We set forecast_time to be an “unlimited” dimension, which allows the dataset to grow along that dimension if we write additional data to it later.

nc.createDimension('forecast_time', None)
nc.createDimension('x', x.size)
nc.createDimension('y', y.size)
nc.createDimension('pressure', press.size)
nc

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4_CLASSIC data model, file format HDF5):
    Conventions: CF-1.7
    title: Forecast model run
    institution: Unidata
    source: WRF-1.5
    history: 2022-11-08 20:05:02.590641 Python
    references: 
    comment: 
    dimensions(sizes): forecast_time(0), x(101), y(67), pressure(7)
    variables(dimensions): 
    groups: 

The CDL representation now shows our dimensions:

netcdf forecast_model {
  dimensions:
    forecast_time = UNLIMITED (currently 13) ;
    x = 101 ;
    y = 67 ;
    pressure = 7 ;
  attributes:
    :Conventions = "CF-1.7" ;
    :title = "Forecast model run" ;
    :institution = "Unidata" ;
    :source = "WRF-1.5" ;
    :history = "2019-07-16 02:21:52.005718 Python" ;
    :references = "" ;
    :comment = "" ;
}

Creating and filling a variable

So far, all we’ve done is outlined basic information about our dataset: broad metadata and the dimensions of our dataset. Now we create a netCDF4 variable to hold one particular data field for our dataset, in this case the forecast air temperature. When defining this variable, we specify the datatype for the values being stored, the relevant dimensions, as well as enable optional zlib-based compression.

temps_var = nc.createVariable(
    'Temperature',
    datatype=np.float32,
    dimensions=('forecast_time', 'pressure', 'y', 'x'),
    zlib=True,
)

Now that we have the variable, we tell Python to write our array of data to it.

temps_var[:] = temps
temps_var

<class 'netCDF4._netCDF4.Variable'>
float32 Temperature(forecast_time, pressure, y, x)
unlimited dimensions: forecast_time
current shape = (13, 7, 67, 101)
filling on, default _FillValue of 9.969209968386869e+36 used

If instead we wanted to write data sporadically, like once per time step, we could do that instead (though the for loop below might actually be at a higher level in the program:

next_slice = 0
for temp_slice in temps:
    temps_var[next_slice] = temp_slice
    next_slice += 1

At this point, this is the CDL representation of our dataset:

netcdf forecast_model {
  dimensions:
    forecast_time = UNLIMITED (currently 13) ;
    x = 101 ;
    y = 67 ;
    pressure = 7 ;
  variables:
    float Temperature(forecast_time, pressure, y, x) ;
  attributes:
    :Conventions = "CF-1.7" ;
    :title = "Forecast model run" ;
    :institution = "Unidata" ;
    :source = "WRF-1.5" ;
    :history = "2019-07-16 02:21:52.005718 Python" ;
    :references = "" ;
    :comment = "" ;
}

We can also add attributes to this variable to define metadata. The CF conventions require a units attribute to be set for all variables that represent a dimensional quantity. The value of this attribute needs to be parsable by the UDUNITS library. Here we set it to a value of 'Kelvin'. We also set the standard (optional) attributes of long_name and standard_name. The former contains a longer description of the variable, while the latter comes from a controlled vocabulary in the CF conventions. This allows users of data to understand, in a standard fashion, what a variable represents. If we had missing values, we could also set the missing_value attribute to an appropriate value.

NASA Dataset Interoperability Recommendations:

Section 2.2 - Include Basic CF Attributes

Include where applicable: units, long_name, standard_name, valid_min / valid_max, scale_factor / add_offset and others.

temps_var.units = 'Kelvin'
temps_var.standard_name = 'air_temperature'
temps_var.long_name = 'Forecast air temperature'
temps_var.missing_value = -9999
temps_var

<class 'netCDF4._netCDF4.Variable'>
float32 Temperature(forecast_time, pressure, y, x)
    units: Kelvin
    standard_name: air_temperature
    long_name: Forecast air temperature
    missing_value: -9999.0
unlimited dimensions: forecast_time
current shape = (13, 7, 67, 101)
filling on, default _FillValue of 9.969209968386869e+36 used

The resulting CDL (truncated to the variables only) looks like:

  variables:
    float Temperature(forecast_time, pressure, y, x) ;
      Temperature:units = "Kelvin" ;
      Temperature:standard_name = "air_temperature" ;
      Temperature:long_name = "Forecast air temperature" ;
      Temperature:missing_value = -9999.0 ;

Coordinate variables

To properly orient our data in time and space, we need to go beyond dimensions (which define common sizes and alignment) and include values along these dimensions, which are called “Coordinate Variables”. Generally, these are defined by creating a one dimensional variable with the same name as the respective dimension.

To start, we define variables which define our x and y coordinate values. These variables include standard_names which allow associating them with projections (more on this later) as well as an optional axis attribute to make clear what standard direction this coordinate refers to.

x_var = nc.createVariable('x', np.float32, ('x',))
x_var[:] = x
x_var.units = 'km'
x_var.axis = 'X'  # Optional
x_var.standard_name = 'projection_x_coordinate'
x_var.long_name = 'x-coordinate in projected coordinate system'

y_var = nc.createVariable('y', np.float32, ('y',))
y_var[:] = y
y_var.units = 'km'
y_var.axis = 'Y'  # Optional
y_var.standard_name = 'projection_y_coordinate'
y_var.long_name = 'y-coordinate in projected coordinate system'

We also define a coordinate variable pressure to reference our data in the vertical dimension. The standard_name of 'air_pressure' is sufficient to identify this coordinate variable as the vertical axis, but let’s go ahead and specify the axis as well. We also specify the attribute positive to indicate whether the variable increases when going up or down. In the case of pressure, this is technically optional.

press_var = nc.createVariable('pressure', np.float32, ('pressure',))
press_var[:] = press
press_var.units = 'hPa'
press_var.axis = 'Z'  # Optional
press_var.standard_name = 'air_pressure'
press_var.positive = 'down'  # Optional

Time coordinates must contain a units attribute with a string value with a form similar to 'seconds since 2019-01-06 12:00:00.00'. ‘seconds’, ‘minutes’, ‘hours’, and ‘days’ are the most commonly used units for time. Due to the variable length of months and years, they are not recommended.

Before we can write data, we need to first convert our list of Python datetime instances to numeric values. We can use the date2num method from the cftime library to make this easy to convert using the unit string as defined above.

time_units = f'hours since {times[0]:%Y-%m-%d 00:00}'
time_vals = date2num(times, time_units)
time_vals

array([22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34])

Now we can create the forecast_time variable just as we did before for the other coordinate variables:

time_var = nc.createVariable('forecast_time', np.int32, ('forecast_time',))
time_var[:] = time_vals
time_var.units = time_units
time_var.axis = 'T'  # Optional
time_var.standard_name = 'time'  # Optional
time_var.long_name = 'time'

The CDL representation of the variables now contains much more information:

  dimensions:
    forecast_time = UNLIMITED (currently 13) ;
    x = 101 ;
    y = 67 ;
    pressure = 7 ;
  variables:
    float x(x) ;
      x:units = "km" ;
      x:axis = "X" ;
      x:standard_name = "projection_x_coordinate" ;
      x:long_name = "x-coordinate in projected coordinate system" ;
    float y(y) ;
      y:units = "km" ;
      y:axis = "Y" ;
      y:standard_name = "projection_y_coordinate" ;
      y:long_name = "y-coordinate in projected coordinate system" ;
    float pressure(pressure) ;
      pressure:units = "hPa" ;
      pressure:axis = "Z" ;
      pressure:standard_name = "air_pressure" ;
      pressure:positive = "down" ;
    float forecast_time(forecast_time) ;
      forecast_time:units = "hours since 2019-07-16 00:00" ;
      forecast_time:axis = "T" ;
      forecast_time:standard_name = "time" ;
      forecast_time:long_name = "time" ;
    float Temperature(forecast_time, pressure, y, x) ;
      Temperature:units = "Kelvin" ;
      Temperature:standard_name = "air_temperature" ;
      Temperature:long_name = "Forecast air temperature" ;
      Temperature:missing_value = -9999.0 ;

Auxilliary Coordinates

Our data are still not CF-compliant because they do not contain latitude and longitude information, which is needed to properly locate the data. To solve this, we need to add variables with latitude and longitude. These are called “auxillary coordinate variables”, not because they are extra, but because they are not simple one-dimensional variables.

Below, we first generate longitude and latitude values from our projected coordinates using the pyproj library.

X, Y = np.meshgrid(x, y)
lcc = Proj({'proj': 'lcc', 'lon_0': -105, 'lat_0': 40, 'a': 6371000.0, 'lat_1': 25})
lon, lat = lcc(X * 1000, Y * 1000, inverse=True)

Now we can create the needed variables. Both are dimensioned on y and x and are two-dimensional. The longitude variable is identified as actually containing such information by its required units of 'degrees_east', as well as the optional 'longitude' standard_name attribute. The case is the same for latitude, except the units are 'degrees_north' and the standard_name is 'latitude'.

lon_var = nc.createVariable('lon', np.float64, ('y', 'x'))
lon_var[:] = lon
lon_var.units = 'degrees_east'
lon_var.standard_name = 'longitude'  # Optional
lon_var.long_name = 'longitude'

lat_var = nc.createVariable('lat', np.float64, ('y', 'x'))
lat_var[:] = lat
lat_var.units = 'degrees_north'
lat_var.standard_name = 'latitude'  # Optional
lat_var.long_name = 'latitude'

With the variables created, we identify these variables as containing coordinates for the Temperature variable by setting the coordinates value to a space-separated list of the names of the auxilliary coordinate variables:

temps_var.coordinates = 'lon lat'

This yields the following CDL:

  double lon(y, x);
    lon:units = "degrees_east";
    lon:long_name = "longitude coordinate";
    lon:standard_name = "longitude";
  double lat(y, x);
    lat:units = "degrees_north";
    lat:long_name = "latitude coordinate";
    lat:standard_name = "latitude";
  float Temperature(time, y, x);
    Temperature:units = "Kelvin" ;
    Temperature:standard_name = "air_temperature" ;
    Temperature:long_name = "Forecast air temperature" ;
    Temperature:missing_value = -9999.0 ;
    Temperature:coordinates = "lon lat";

Coordinate System Information

With our data specified on a Lambert conformal projected grid, it would be good to include this information in our metadata. We can do this using a “grid mapping” variable. This uses a dummy scalar variable as a namespace for holding all of the required information. Relevant variables then reference the dummy variable with their grid_mapping attribute.

Below we create a variable and set it up for a Lambert conformal conic projection on a spherical earth. The grid_mapping_name attribute describes which of the CF-supported grid mappings we are specifying. The names of additional attributes vary between the mappings.

proj_var = nc.createVariable('lambert_projection', np.int32, ())
proj_var.grid_mapping_name = 'lambert_conformal_conic'
proj_var.standard_parallel = 25.0
proj_var.latitude_of_projection_origin = 40.0
proj_var.longitude_of_central_meridian = -105.0
proj_var.semi_major_axis = 6371000.0
proj_var

<class 'netCDF4._netCDF4.Variable'>
int32 lambert_projection()
    grid_mapping_name: lambert_conformal_conic
    standard_parallel: 25.0
    latitude_of_projection_origin: 40.0
    longitude_of_central_meridian: -105.0
    semi_major_axis: 6371000.0
unlimited dimensions: 
current shape = ()
filling on, default _FillValue of -2147483647 used

Now that we created the variable, all that’s left is to set the grid_mapping attribute on our Temperature variable to the name of our dummy variable:

temps_var.grid_mapping = 'lambert_projection'  # or proj_var.name

This yields the CDL:

  variables:
    int lambert_projection ;
      lambert_projection:grid_mapping_name = "lambert_conformal_conic ;
      lambert_projection:standard_parallel = 25.0 ;
      lambert_projection:latitude_of_projection_origin = 40.0 ;
      lambert_projection:longitude_of_central_meridian = -105.0 ;
      lambert_projection:semi_major_axis = 6371000.0 ;
    float Temperature(forecast_time, pressure, y, x) ;
      Temperature:units = "Kelvin" ;
      Temperature:standard_name = "air_temperature" ;
      Temperature:long_name = "Forecast air temperature" ;
      Temperature:missing_value = -9999.0 ;
      Temperature:coordinates = "lon lat" ;
      Temperature:grid_mapping = "lambert_projection" ;

Cell Bounds

NASA Dataset Interoperability Recommendations:

Section 2.3 - Use CF “bounds” attributes

CF conventions state: “When gridded data does not represent the point values of a field but instead represents some characteristic of the field within cells of finite ‘volume,’ a complete description of the variable should include metadata that describes the domain or extent of each cell, and the characteristic of the field that the cell values represent.”

For example, if a rain guage is read every 3 hours but only dumped every six hours, it might look like this

netcdf precip_bucket_bounds {
  dimensions:
      lat = 12 ;
      lon = 19 ;
      time = 8 ;
      tbv = 2;
  variables:
      float lat(lat) ;
      float lon(lon) ;
      float time(time) ;
        time:units = "hours since 2019-07-12 00:00:00.00";
        time:bounds = "time_bounds" ;
      float time_bounds(time,tbv)
      float precip(time, lat, lon) ;
        precip:units = "inches" ;
  data:
    time = 3, 6, 9, 12, 15, 18, 21, 24;
    time_bounds = 0, 3, 0, 6, 6, 9, 6, 12, 12, 15, 12, 18, 18, 21, 18, 24;
}

So the time coordinate looks like

|---X
|-------X
        |---X
        |-------X
                |---X
                |-------X
                        |---X
                        |-------X
0   3   6   9  12  15  18  21  24

Observational Data

So far we’ve focused on how to handle storing data that are arranged in a grid. What about in-situ data, often termed “observational” data? CF describes this as Conventions for Discrete Sampling Geometeries (DSG).

For data that are regularly sampled (e.g., from a vertical profiler site), this is straightforward. First, let’s define some sample profile data for three hypothetical profilers located at Boulder, Norman, and Albany. We’ll assume that the profiler reports data every 10m, from 10 m up to (but not including) 1000 m.

lons = np.array([-97.1, -105, -73.8])
lats = np.array([35.25, 40, 42.75])
heights = np.linspace(10, 1000, 10)
temps = np.random.randn(lats.size, heights.size)
stids = ['KBOU', 'KOUN', 'KALB']

Creation and basic setup

First we create a new file and define some dimensions. Since this is profile data, heights will be one dimension. We use station as our other dimension. We also set the global featureType attribute to 'profile' to indicate that this file holds “an ordered set of data points along a vertical line at a fixed horizontal position and fixed time”. We also add a dimension to assist in storing our string station ids.

nc.close()
nc = Dataset('obs_data.nc', 'w', format='NETCDF4_CLASSIC', diskless=True)
nc.createDimension('station', lats.size)
nc.createDimension('heights', heights.size)
nc.createDimension('str_len', 4)
nc.Conventions = 'CF-1.7'
nc.featureType = 'profile'
nc

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4_CLASSIC data model, file format HDF5):
    Conventions: CF-1.7
    featureType: profile
    dimensions(sizes): station(3), heights(10), str_len(4)
    variables(dimensions): 
    groups: 

Which gives this CDL:

netcdf obs_data {
  dimensions:
    station = 3 ;
    heights = 10 ;
    str_len = 4 ;
  attributes:
    :Conventions = "CF-1.7" ;
    :featureType = "profile" ;
}

We can create our coordinates with:

lon_var = nc.createVariable('lon', np.float64, ('station',))
lon_var.units = 'degrees_east'
lon_var.standard_name = 'longitude'

lat_var = nc.createVariable('lat', np.float64, ('station',))
lat_var.units = 'degrees_north'
lat_var.standard_name = 'latitude'

The standard refers to these as “instance variables” because each one refers to an instance of a feature. From here we can create our height coordinate variable:

heights_var = nc.createVariable('heights', np.float32, ('heights',))
heights_var.units = 'meters'
heights_var.standard_name = 'altitude'
heights_var.positive = 'up'
heights_var[:] = heights

Station IDs

Now we can also write our station IDs to a variable. This is a 2D variable, but one of the dimensions is simply there to facilitate treating strings as character arrays. We also assign this an attribute cf_role with a value of 'profile_id' to facilitate software to identify individual profiles:

stid_var = nc.createVariable('stid', 'c', ('station', 'str_len'))
stid_var.cf_role = 'profile_id'
stid_var.long_name = 'Station identifier'
stid_var[:] = stids

Now our CDL looks like:

netcdf obs_data {
  dimensions:
    station = 3 ;
    heights = 10 ;
    str_len = 4 ;
  variables:
    double lon(station) ;
      lon:units = "degrees_east" ;
      lon:standard_name = "longitude" ;
    double lat(station) ;
      lat:units = "degrees_north" ;
      lat:standard_name = "latitude" ;
    float heights(heights) ;
      heights:units = "meters" ;
      heights:standard_name = "altitude";
      heights:positive = "up" ;
    char stid(station, str_len) ;
      stid:cf_role = "profile_id" ;
      stid:long_name = "Station identifier" ;
  attributes:
    :Conventions = "CF-1.7" ;
    :featureType = "profile" ;
}

Writing the field

Now all that’s left is to write our profile data, which looks fairly standard. We also add a scalar variable for the time at which these profiles were captured:

time_var = nc.createVariable('time', np.float32, ())
time_var.units = 'minutes since 2019-07-16 17:00'
time_var.standard_name = 'time'
time_var[:] = [5.0]

temp_var = nc.createVariable('temperature', np.float32, ('station', 'heights'))
temp_var.units = 'celsius'
temp_var.standard_name = 'air_temperature'
temp_var.coordinates = 'lon lat heights time'

Note the use of the coordinates attribute to store the names of the auxilliary coordinate variables since they’re all dimensioned on station and not proper coordinate variables. This yields the CDL for the variables:

  variables:
    double lon(station) ;
      lon:units = "degrees_east" ;
      lon:standard_name = "longitude" ;
    double lat(station) ;
      lat:units = "degrees_north" ;
      lat:standard_name = "latitude" ;
    float heights(heights) ;
      heights:units = "meters" ;
      heights:standard_name = "altitude";
      heights:positive = "up" ;
    char stid(station, str_len) ;
      stid:cf_role = "profile_id" ;
      stid:long_name = "Station identifier" ;
    float time ;
      time:units = "minutes since 2019-07-16 17:00" ;
      time:standard_name = "time" ;
    float temperature(station, heights) ;
      temperature:units = "celsius" ;
      temperature:standard_name = "air_temperature" ;
      temperature:coordinates = "lon lat heights time" ;

These standards for storing DSG extend to time series, trajectories, and combinations of them. They also can extend for differing amounts of data per feature using ragged arrays. For more information see the main document or the annotated DSG examples.

Summary

We have discussed and created examples of netCDF Datasets, both gridded and in-situ, that follow the Climate and Forecast (CF) Conventions. These Datasets are self-describing, in that their attributes, or metadata, are included. Other libraries in the Python scientific software ecosystem, such as xarray and MetPy can thus easily read in (and write to) and analyze these Datasets.

What’s Next?

In subsequent notebooks, we will work with netCDF Datasets culled from “real-life” model and in-situ sources.