Open In Colab

Introduction to NetCDF and xarray

What is NetCDF?

NetCDF (Network Common Data Form) is a file format designed for storing and sharing multidimensional scientific data. It was developed by Unidata and has become the standard format in oceanography, climatology, and atmospheric sciences.

A NetCDF file is structured around three core concepts:

  • Variables: the actual data arrays (e.g. temperature, salinity, wind speed).

  • Dimensions: the axes along which variables are defined (e.g. time, latitude, longitude, depth).

  • Attributes: metadata that describes the variables and the dataset (units, long name, source, etc…).

For example, a sea surface temperature dataset might have a variable sst with dimensions (time, lat, lon), where each point in the array stores the temperature at a given location and time.

Why NetCDF?

  • Self-describing: the file contains metadata that explains what the data represents.

  • Portable across operating systems and programming languages.

  • Efficient for large multidimensional arrays.

  • Widely supported by scientific software (Python, R, MATLAB, NCO, CDO…).

File extensions

NetCDF files typically use the .nc or .nc4 extension. The most common version is NetCDF-4, which is built on top of HDF5.

The xarray library

xarray is the standard Python library for working with labelled multidimensional arrays. It extends NumPy by attaching dimension names and coordinate labels to arrays, making it much easier to work with NetCDF data.

The two main data structures in xarray are:

  • DataArray: a single labelled multidimensional array (equivalent to one NetCDF variable).

  • Dataset: a collection of DataArray objects sharing the same dimensions (equivalent to a full NetCDF file).

To use this library we need to install two python packages: xarray and netcdf4

[1]:
import xarray as xr
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Loading a NetCDF File

We will use a sample dataset included in xarray, the ERSSTV5 sea surface temperature dataset to illustrate the basic operations of this library. In a real workflow, you must replace this with your own file:

ds = xr.open_dataset('your_file.nc')
[2]:
# Load a sample dataset included in xarray
ds = xr.tutorial.load_dataset('ersstv5')
print(ds)
<xarray.Dataset> Size: 40MB
Dimensions:    (time: 624, nbnds: 2, lat: 89, lon: 180)
Coordinates:
  * time       (time) datetime64[ns] 5kB 1970-01-01 1970-02-01 ... 2021-12-01
  * lat        (lat) float32 356B 88.0 86.0 84.0 82.0 ... -84.0 -86.0 -88.0
  * lon        (lon) float32 720B 0.0 2.0 4.0 6.0 ... 352.0 354.0 356.0 358.0
Dimensions without coordinates: nbnds
Data variables:
    time_bnds  (time, nbnds) float64 10kB 9.969e+36 9.969e+36 ... 9.969e+36
    sst        (time, lat, lon) float32 40MB -1.8 -1.8 -1.8 -1.8 ... nan nan nan
Attributes: (12/37)
    climatology:               Climatology is based on 1971-2000 SST, Xue, Y....
    description:               In situ data: ICOADS2.5 before 2007 and NCEP i...
    keywords_vocabulary:       NASA Global Change Master Directory (GCMD) Sci...
    keywords:                  Earth Science > Oceans > Ocean Temperature > S...
    instrument:                Conventional thermometers
    source_comment:            SSTs were observed by conventional thermometer...
    ...                        ...
    creator_url_original:      https://www.ncei.noaa.gov
    license:                   No constraints on data access or use
    comment:                   SSTs were observed by conventional thermometer...
    summary:                   ERSST.v5 is developed based on v4 after revisi...
    dataset_title:             NOAA Extended Reconstructed SST V5
    data_modified:             2022-06-07

Exploring the Dataset

The printed output of a Dataset shows its dimensions, coordinates, variables, and global attributes. Let’s explore each component.

[ ]:
# Dimensions
print('Dimensions:', dict(ds.dims))

[ ]:
# Coordinates
print('\nCoordinates:', list(ds.coords))

[ ]:
# Variables
print('\nVariables:', list(ds.data_vars))
[ ]:
# Global attributes
print('\nAttributes:', ds.attrs)
[ ]:
# Inspect a single variable (DataArray)
sst = ds['sst']
print(sst)
[ ]:
# Shape and dimensions of the variable
print('Shape:', sst.shape)
print('Dimensions:', sst.dims)
print('Units:', sst.attrs.get('units', 'not specified'))

Selecting Data

xarray provides two methods for selecting data:

  • ``.sel()``: select by label (coordinate value)

  • ``.isel()``: select by integer index

[ ]:
# Select a specific time step by label
sst_2000 = sst.sel(time='2000-01')
print('Selected time step shape:', sst_2000.shape)

# Select a time range
sst_range = sst.sel(time=slice('1990-01', '2000-12'))
print('Time range shape:', sst_range.shape)

# Select by integer index
sst_first = sst.isel(time=0)
print('First time step shape:', sst_first.shape)
[ ]:
# Select a geographic region (Mediterranean Sea)
sst_med = sst.sel(
    lat=slice(30, 46),
    lon=slice(-6, 36)
)
print('Mediterranean region shape:', sst_med.shape)
[ ]:
sst_med

Basic Operations

xarray supports Pandas-like operations along named dimensions, which makes computing spatial or temporal statistics very intuitive.

[ ]:
# Temporal mean (average over all time steps)
sst_mean = sst.mean(dim='time')
print('Temporal mean shape:', sst_mean.shape)

# Spatial mean (average over lat and lon)
sst_global_mean = sst.mean(dim=['lat', 'lon'])
print('Global mean time series shape:', sst_global_mean.shape)

# Seasonal groupby
sst_seasonal = sst.groupby('time.season').mean()
print('Seasonal mean shape:', sst_seasonal.shape)

Visualisation

xarray integrates directly with matplotlib, making it easy to plot maps and time series.

[ ]:
# Map: temporal mean SST
sst_mean.plot(figsize=(12, 5), cmap='RdBu_r')
plt.title('Mean Sea Surface Temperature')
plt.tight_layout()
plt.show()
[ ]:
# Time series: global mean SST
sst_global_mean.plot(figsize=(12, 4))
plt.title('Global Mean SST Over Time')
plt.ylabel('SST (ºC)')
plt.tight_layout()
plt.show()

Converting to NumPy

When the data needs to be passed to scikit-learn or PyTorch, it must be converted to a NumPy array. This is straightforward with .values.

[ ]:
# Convert a DataArray to NumPy
sst_numpy = sst.values
print('Type:', type(sst_numpy))
print('Shape:', sst_numpy.shape)
print('dtype:', sst_numpy.dtype)
[ ]:
# Extract coordinates as NumPy arrays
lat  = ds['lat'].values
lon  = ds['lon'].values
time = ds['time'].values

print('Latitudes:', lat[:5])
print('Longitudes:', lon[:5])
print('Time steps:', time[:5])

Converting to Pandas

For tabular analysis or when working with time series at a specific location, xarray can be converted to a pandas DataFrame.

[ ]:
# Convert the full Dataset to a DataFrame
df = ds.to_dataframe()
print(df.head())
print('\nShape:', df.shape)
[ ]:
# Extract a time series at a specific location
sst_point = sst.sel(lat=40.0, lon=0.0, method='nearest')
df_point  = sst_point.to_dataframe(name='sst').reset_index()

print(df_point.head())
print('\nShape:', df_point.shape)
[ ]:
# Plot the extracted time series
df_point.set_index('time')['sst'].plot(figsize=(12, 4))
plt.title('SST Time Series at lat=40°N, lon=0°E')
plt.ylabel('SST (ºC)')
plt.tight_layout()
plt.show()

Summary

Operation

xarray method

Open a NetCDF file

xr.open_dataset('file.nc')

Select by label

.sel(dim=value)

Select by index

.isel(dim=index)

Compute mean

.mean(dim='dim_name')

Convert to NumPy

.values

Convert to DataFrame

.to_dataframe()

Plot

.plot()

Exercise

We will use the GISS Surface Temperature dataset link.

The GISTEMP v4 dataset (GISS Surface Temperature Analysis) contains monthly surface temperature anomalies on a regular 2°×2° global grid from 1880 to the present. Anomalies are computed as deviations from the 1951–1980 baseline mean.

The NetCDF file has the following structure:

  • ``tempanomaly``: surface temperature anomaly (ºC) — shape (time, lat, lon)

  • ``time``: monthly time steps from January 1880

  • ``lat``: 90 latitude points from -89° to 89° (2° resolution)

  • ``lon``: 180 longitude points from -179° to 179° (2° resolution)

More info

Extract a temperature anomaly time series at a specific geographic location (LAT = 39.0, LON = 3.0), convert it to a pandas dataframe and into a numpy array suitable for machine learning tasks.

[ ]:
ds = xr.open_dataset(PATH, engine='scipy')