{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": [ "\n", " \"Open\n", "" ], "id": "7019b808aa039e8f" }, { "cell_type": "markdown", "id": "a1b2c3d4", "metadata": {}, "source": [ "# Introduction to NetCDF and xarray\n", "\n", "## What is NetCDF?\n", "\n", "[NetCDF](https://en.wikipedia.org/wiki/NetCDF) (*Network Common Data Form*) is a file format designed for storing and sharing multidimensional scientific data. It was developed by Unidata and has become the standard format in oceanography, climatology, and atmospheric sciences.\n", "\n", "A NetCDF file is structured around three core concepts:\n", "\n", "- **Variables**: the actual data arrays (e.g. temperature, salinity, wind speed).\n", "- **Dimensions**: the axes along which variables are defined (e.g. time, latitude, longitude, depth).\n", "- **Attributes**: metadata that describes the variables and the dataset (units, long name, source, etc...).\n", "\n", "For example, a sea surface temperature dataset might have a variable `sst` with dimensions `(time, lat, lon)`, where each point in the array stores the temperature at a given location and time.\n", "\n", "### Why NetCDF?\n", "\n", "- Self-describing: the file contains metadata that explains what the data represents.\n", "- Portable across operating systems and programming languages.\n", "- Efficient for large multidimensional arrays.\n", "- Widely supported by scientific software (Python, R, MATLAB, NCO, CDO...).\n", "\n", "### File extensions\n", "\n", "NetCDF files typically use the `.nc` or `.nc4` extension. The most common version is **NetCDF-4**, which is built on top of HDF5." ] }, { "cell_type": "markdown", "id": "b2c3d4e5", "metadata": {}, "source": [ "## The `xarray` library\n", "\n", "`xarray` is the standard Python library for working with labelled multidimensional arrays. It extends NumPy by attaching dimension names and coordinate labels to arrays, making it much easier to work with NetCDF data.\n", "\n", "The two main data structures in `xarray` are:\n", "\n", "- `DataArray`: a single labelled multidimensional array (equivalent to one NetCDF variable).\n", "- `Dataset`: a collection of `DataArray` objects sharing the same dimensions (equivalent to a full NetCDF file).\n", "\n", "To use this library we need to install two python packages: `xarray` and `netcdf4`\n" ] }, { "cell_type": "code", "id": "d4e5f6g7", "metadata": { "ExecuteTime": { "end_time": "2026-04-28T13:05:39.767960400Z", "start_time": "2026-04-28T13:05:38.904020300Z" } }, "source": [ "import xarray as xr\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ], "outputs": [], "execution_count": 1 }, { "cell_type": "markdown", "id": "e5f6g7h8", "metadata": {}, "source": [ "## Loading a NetCDF File\n", "\n", "We will use a sample dataset included in `xarray`, the **ERSSTV5** sea surface temperature dataset to illustrate the basic operations of this library. In a real workflow, you must replace this with your own file:\n", "\n", "```python\n", "ds = xr.open_dataset('your_file.nc')\n", "```" ] }, { "cell_type": "code", "id": "f6g7h8i9", "metadata": { "ExecuteTime": { "end_time": "2026-04-28T13:05:52.986195200Z", "start_time": "2026-04-28T13:05:52.158300300Z" } }, "source": [ "# Load a sample dataset included in xarray\n", "ds = xr.tutorial.load_dataset('ersstv5')\n", "print(ds)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Size: 40MB\n", "Dimensions: (time: 624, nbnds: 2, lat: 89, lon: 180)\n", "Coordinates:\n", " * time (time) datetime64[ns] 5kB 1970-01-01 1970-02-01 ... 2021-12-01\n", " * lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -84.0 -86.0 -88.0\n", " * lon (lon) float32 720B 0.0 2.0 4.0 6.0 ... 352.0 354.0 356.0 358.0\n", "Dimensions without coordinates: nbnds\n", "Data variables:\n", " time_bnds (time, nbnds) float64 10kB 9.969e+36 9.969e+36 ... 9.969e+36\n", " sst (time, lat, lon) float32 40MB -1.8 -1.8 -1.8 -1.8 ... nan nan nan\n", "Attributes: (12/37)\n", " climatology: Climatology is based on 1971-2000 SST, Xue, Y....\n", " description: In situ data: ICOADS2.5 before 2007 and NCEP i...\n", " keywords_vocabulary: NASA Global Change Master Directory (GCMD) Sci...\n", " keywords: Earth Science > Oceans > Ocean Temperature > S...\n", " instrument: Conventional thermometers\n", " source_comment: SSTs were observed by conventional thermometer...\n", " ... ...\n", " creator_url_original: https://www.ncei.noaa.gov\n", " license: No constraints on data access or use\n", " comment: SSTs were observed by conventional thermometer...\n", " summary: ERSST.v5 is developed based on v4 after revisi...\n", " dataset_title: NOAA Extended Reconstructed SST V5\n", " data_modified: 2022-06-07\n" ] } ], "execution_count": 2 }, { "cell_type": "markdown", "id": "g7h8i9j0", "metadata": {}, "source": [ "## Exploring the Dataset\n", "\n", "The printed output of a `Dataset` shows its dimensions, coordinates, variables, and global attributes. Let's explore each component." ] }, { "cell_type": "code", "id": "h8i9j0k1", "metadata": {}, "source": [ "# Dimensions\n", "print('Dimensions:', dict(ds.dims))\n" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "# Coordinates\n", "print('\\nCoordinates:', list(ds.coords))\n" ], "id": "34e8a52fa03831a0", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "# Variables\n", "print('\\nVariables:', list(ds.data_vars))" ], "id": "e46f077e71024d72", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": [ "# Global attributes\n", "print('\\nAttributes:', ds.attrs)" ], "id": "d613dc93f568e8d9", "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "i9j0k1l2", "metadata": {}, "source": [ "# Inspect a single variable (DataArray)\n", "sst = ds['sst']\n", "print(sst)" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "j0k1l2m3", "metadata": {}, "source": [ "# Shape and dimensions of the variable\n", "print('Shape:', sst.shape)\n", "print('Dimensions:', sst.dims)\n", "print('Units:', sst.attrs.get('units', 'not specified'))" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "k1l2m3n4", "metadata": {}, "source": [ "## Selecting Data\n", "\n", "`xarray` provides two methods for selecting data:\n", "\n", "- **`.sel()`**: select by label (coordinate value)\n", "- **`.isel()`**: select by integer index" ] }, { "cell_type": "code", "id": "l2m3n4o5", "metadata": {}, "source": [ "# Select a specific time step by label\n", "sst_2000 = sst.sel(time='2000-01')\n", "print('Selected time step shape:', sst_2000.shape)\n", "\n", "# Select a time range\n", "sst_range = sst.sel(time=slice('1990-01', '2000-12'))\n", "print('Time range shape:', sst_range.shape)\n", "\n", "# Select by integer index\n", "sst_first = sst.isel(time=0)\n", "print('First time step shape:', sst_first.shape)" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "m3n4o5p6", "metadata": {}, "source": [ "# Select a geographic region (Mediterranean Sea)\n", "sst_med = sst.sel(\n", " lat=slice(30, 46),\n", " lon=slice(-6, 36)\n", ")\n", "print('Mediterranean region shape:', sst_med.shape)" ], "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "code", "source": "sst_med", "id": "62470e87628fec28", "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "n4o5p6q7", "metadata": {}, "source": [ "## Basic Operations\n", "\n", "`xarray` supports Pandas-like operations along named dimensions, which makes computing spatial or temporal statistics very intuitive." ] }, { "cell_type": "code", "id": "o5p6q7r8", "metadata": {}, "source": [ "# Temporal mean (average over all time steps)\n", "sst_mean = sst.mean(dim='time')\n", "print('Temporal mean shape:', sst_mean.shape)\n", "\n", "# Spatial mean (average over lat and lon)\n", "sst_global_mean = sst.mean(dim=['lat', 'lon'])\n", "print('Global mean time series shape:', sst_global_mean.shape)\n", "\n", "# Seasonal groupby\n", "sst_seasonal = sst.groupby('time.season').mean()\n", "print('Seasonal mean shape:', sst_seasonal.shape)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "p6q7r8s9", "metadata": {}, "source": [ "## Visualisation\n", "\n", "`xarray` integrates directly with `matplotlib`, making it easy to plot maps and time series." ] }, { "cell_type": "code", "id": "q7r8s9t0", "metadata": {}, "source": [ "# Map: temporal mean SST\n", "sst_mean.plot(figsize=(12, 5), cmap='RdBu_r')\n", "plt.title('Mean Sea Surface Temperature')\n", "plt.tight_layout()\n", "plt.show()" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "r8s9t0u1", "metadata": {}, "source": [ "# Time series: global mean SST\n", "sst_global_mean.plot(figsize=(12, 4))\n", "plt.title('Global Mean SST Over Time')\n", "plt.ylabel('SST (ºC)')\n", "plt.tight_layout()\n", "plt.show()" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "s9t0u1v2", "metadata": {}, "source": [ "## Converting to NumPy\n", "\n", "When the data needs to be passed to scikit-learn or PyTorch, it must be converted to a NumPy array. This is straightforward with `.values`." ] }, { "cell_type": "code", "id": "t0u1v2w3", "metadata": {}, "source": [ "# Convert a DataArray to NumPy\n", "sst_numpy = sst.values\n", "print('Type:', type(sst_numpy))\n", "print('Shape:', sst_numpy.shape)\n", "print('dtype:', sst_numpy.dtype)" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "u1v2w3x4", "metadata": {}, "source": [ "# Extract coordinates as NumPy arrays\n", "lat = ds['lat'].values\n", "lon = ds['lon'].values\n", "time = ds['time'].values\n", "\n", "print('Latitudes:', lat[:5])\n", "print('Longitudes:', lon[:5])\n", "print('Time steps:', time[:5])" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "v2w3x4y5", "metadata": {}, "source": [ "## Converting to Pandas\n", "\n", "For tabular analysis or when working with time series at a specific location, `xarray` can be converted to a `pandas` DataFrame." ] }, { "cell_type": "code", "id": "w3x4y5z6", "metadata": {}, "source": [ "# Convert the full Dataset to a DataFrame\n", "df = ds.to_dataframe()\n", "print(df.head())\n", "print('\\nShape:', df.shape)" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "x4y5z6a7", "metadata": {}, "source": [ "# Extract a time series at a specific location\n", "sst_point = sst.sel(lat=40.0, lon=0.0, method='nearest')\n", "df_point = sst_point.to_dataframe(name='sst').reset_index()\n", "\n", "print(df_point.head())\n", "print('\\nShape:', df_point.shape)" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "id": "y5z6a7b8", "metadata": {}, "source": [ "# Plot the extracted time series\n", "df_point.set_index('time')['sst'].plot(figsize=(12, 4))\n", "plt.title('SST Time Series at lat=40°N, lon=0°E')\n", "plt.ylabel('SST (ºC)')\n", "plt.tight_layout()\n", "plt.show()" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "z6a7b8c9", "metadata": {}, "source": [ "## Summary\n", "\n", "| Operation | `xarray` method |\n", "|-----------|---------------|\n", "| Open a NetCDF file | `xr.open_dataset('file.nc')` |\n", "| Select by label | `.sel(dim=value)` |\n", "| Select by index | `.isel(dim=index)` |\n", "| Compute mean | `.mean(dim='dim_name')` |\n", "| Convert to NumPy | `.values` |\n", "| Convert to DataFrame | `.to_dataframe()` |\n", "| Plot | `.plot()` |\n" ] }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Exercise\n", "\n", "We will use the GISS Surface Temperature dataset [link](https://github.com/bmalcover/AppOC/blob/main/docs/_static/02/gistemp1200_GHCNv4_ERSSTv5.nc.gz).\n", "\n", "The **GISTEMP v4** dataset (*GISS Surface Temperature Analysis*) contains monthly surface temperature anomalies on a regular 2°×2° global grid from 1880 to the present. Anomalies are computed as deviations from the 1951–1980 baseline mean.\n", "\n", "The NetCDF file has the following structure:\n", "\n", "- **`tempanomaly`**: surface temperature anomaly (ºC) — shape `(time, lat, lon)`\n", "- **`time`**: monthly time steps from January 1880\n", "- **`lat`**: 90 latitude points from -89° to 89° (2° resolution)\n", "- **`lon`**: 180 longitude points from -179° to 179° (2° resolution)\n", "\n", "[More info](https://data.giss.nasa.gov/gistemp/data_v4.html)\n", "\n", "Extract a temperature anomaly time series at a specific geographic location (LAT = 39.0, LON = 3.0), convert it to a pandas dataframe and into a numpy array suitable for machine learning tasks." ], "id": "164a76af0ebfb966" }, { "metadata": {}, "cell_type": "code", "source": "ds = xr.open_dataset(PATH, engine='scipy')\n", "id": "ec93f9456d51b451", "outputs": [], "execution_count": null } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 5 }