From gridded data to time-series

Aggregating gridded data with Python and xarray

In the last years I needed many times to aggregate the data into a gridded dataset (for example ERA5 meteorological data) into a time-series, according to specific borders (for example administrative regions). In the past I was using a R package which I developed and that I used for example for the ERA-NUTS dataset. However, when recently I had to deal with a 5km resolution hydrological model (1M grid cells per day in Europe) I had to find a solution which allowed me to work out-of-memory. In this short post I will present a simple solution based on open-source Python modules:

import xarray as xr 
import numpy as np
import regionmask
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt

In this example we will use the NUTS classification developed by EUROSTAT. All the shapefiles are available on the EUROSTAT website. The files are available in different formats and in zipped archives containing multiple projections and spatialtypes: here I will use the file containing NUTS0 boundaries with EPSG:4326 projection and multipolygons as spatial type. The NUTS website provides all the needed information to choose the right files.

# Load the shapefile
PATH_TO_SHAPEFILE = 'shp/NUTS_RG_10M_2016_4326_LEVL_0.shp'
nuts = gpd.read_file(PATH_TO_SHAPEFILE)
nuts.head()

LEVL_CODE NUTS_ID CNTR_CODE NUTS_NAME FID geometry
0 0 AL AL SHQIPËRIA AL POLYGON ((19.82698 42.4695, 19.83939 42.4695, ...
1 0 AT AT ÖSTERREICH AT POLYGON ((15.54245 48.9077, 15.75363 48.85218,...
2 0 BE BE BELGIQUE-BELGIË BE POLYGON ((5.10218 51.429, 5.0878 51.3823, 5.14...
3 0 NL NL NEDERLAND NL (POLYGON ((6.87491 53.40801, 6.91836 53.34529,...
4 0 PL PL POLSKA PL (POLYGON ((18.95003 54.35831, 19.35966 54.3723...

nuts is a GeoDataFrame containing all polygons illustrating the national boundaries for the 37 countries in the NUTS classification.

Now we can load the gridded data. In a folder I hundreds of ERA5 reanalysis data downloaded from the CDS. Each of my files contains a single month of the global reanalysis, about 1.5 GB. The parameter chunks is very important, it defines how big are the “pieces” of data moved from the disk to the memory. With this value the entire computation on a workstation with 32 GB takes a couple of minutes. I want to load all the temperature files for the year 2018. With the function assign_coords the longitude is converted from the 0-360 range to -180,180

# Read NetCDF
FOLDER_WITH_NETCDF = '/emhires-data/data-warehouse/gridded/ERA5/'
d = xr.open_mfdataset(FOLDER_WITH_NETCDF+'era5-hourly-2m_temperature-2018-*.nc', chunks = {'time': 10})
d = d.assign_coords(longitude=(((d.longitude + 180) % 360) - 180)).sortby('longitude')

d
<xarray.Dataset>
Dimensions:    (latitude: 721, longitude: 1440, time: 8760)
Coordinates:
  * longitude  (longitude) float32 -180.0 -179.75 -179.5 ... 179.25 179.5 179.75
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * time       (time) datetime64[ns] 2018-01-01 ... 2018-12-31T23:00:00
Data variables:
    t2m        (time, latitude, longitude) float32 dask.array<shape=(8760, 721, 1440), chunksize=(10, 721, 1440)>
Attributes:
    Conventions:  CF-1.6
    history:      2019-04-27 23:33:00 GMT by grib_to_netcdf-2.10.0: /opt/ecmw...

As you can see this xarray Dataset contains a single variable (t2m) which is stored as a dask.array. This is the result of loading files with open_mfdataset. More information and examples of the interaction between dask and xarray can be found in their documentations(here and here)

Now we are ready for a bit of regionmask magic. The module can create a gridded mask with the function regions_cls documented here. With this function we have created an object able to mask gridded data.

# CALCULATE MASK
nuts_mask_poly = regionmask.Regions_cls(name = 'nuts_mask', numbers = list(range(0,37)), names = list(nuts.NUTS_ID), abbrevs = list(nuts.NUTS_ID), outlines = list(nuts.geometry.values[i] for i in range(0,37)))
print(nuts_mask_poly)

37 'nuts_mask' Regions ()
AL AT BE NL PL PT DK DE EL ES BG CH CY RO RS CZ EE HU HR SI SK SE IT TR FI NO IE IS LI LT LU LV ME MK UK MT FR

Now we are ready to apply the mask on the gridded dataset d. We select only the first timestep to speed up the process. For the same reason it is better to “zoom” only on the European continent. We specify the name of the coordinates (the defaults are lat and lon).

mask = nuts_mask_poly.mask(d.isel(time = 0).sel(latitude = slice(75, 32), longitude = slice(-30, 50)), lat_name='latitude', lon_name='longitude')
mask
<xarray.DataArray 'region' (latitude: 173, longitude: 321)>
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])
Coordinates:
  * latitude   (latitude) float32 75.0 74.75 74.5 74.25 ... 32.5 32.25 32.0
  * longitude  (longitude) float32 -30.0 -29.75 -29.5 -29.25 ... 49.5 49.75 50.0

The computation takes a couple of minutes but then the mask can be saved (for example as a NetCDF) for a later use.

plt.figure(figsize=(12,8))
ax = plt.axes()
mask.plot(ax = ax)
nuts.plot(ax = ax, alpha = 0.8, facecolor = 'none', lw = 1)
<matplotlib.axes._subplots.AxesSubplot at 0x7f362edea0b8>

png

Extract time-series

We can now for each country we would like to aggregate the grid cells in the national borders. The procedure is rather simple, I will focus on a single region and it is easy to extend it for multiple regions.

ID_REGION = 34
print(nuts.NUTS_ID[ID_REGION])
UK

As first step, I will save the latitude and longitude vectors because I will use it later. Then, I select the mask points where the value is equal to target value (the NUTS code). In the numpy array sel_mask all the values are nan except for the selected ones.

lat = mask.latitude.values
lon = mask.longitude.values


sel_mask = mask.where(mask == ID_REGION).values
sel_mask
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]])

To speed-up the process I want to crop the xarray Dataset selecting the smallest box containing the entire mask. To do this I store in id_lon and id_lat the coordinate points where the mask has at least a non-nan value.

id_lon = lon[np.where(~np.all(np.isnan(sel_mask), axis=0))]
id_lat = lat[np.where(~np.all(np.isnan(sel_mask), axis=1))]

The d dataset is reduced selecting only the target year and the coordinates containing the target region. Then the dataset is load from the dask array using compute and then filtered using the mask. I came up with this procedure to speed up the computation and reduce the memory use for large dataset, apparently the where function is not really dask friendly.

out_sel = d.sel(latitude = slice(id_lat[0], id_lat[-1]), longitude = slice(id_lon[0], id_lon[-1])).compute().where(mask == ID_REGION)
out_sel
<xarray.Dataset>
Dimensions:    (latitude: 42, longitude: 39, time: 8760)
Coordinates:
  * latitude   (latitude) float64 60.5 60.25 60.0 59.75 ... 50.75 50.5 50.25
  * longitude  (longitude) float64 -8.0 -7.75 -7.5 -7.25 ... 0.75 1.0 1.25 1.5
  * time       (time) datetime64[ns] 2018-01-01 ... 2018-12-31T23:00:00
Data variables:
    t2m        (time, latitude, longitude) float32 nan nan nan ... nan nan nan
Attributes:
    Conventions:  CF-1.6
    history:      2019-04-27 23:33:00 GMT by grib_to_netcdf-2.10.0: /opt/ecmw...
plt.figure(figsize=(12,8))
ax = plt.axes()
out_sel.t2m.isel(time = 0).plot(ax = ax)
nuts.plot(ax = ax, alpha = 0.8, facecolor = 'none')

<matplotlib.axes._subplots.AxesSubplot at 0x7f36285ff278>

png

Finally we can aggregate by the arithmetic mean using the groupby function to obtain a time-series of national average temperatures.

x = out_sel.groupby('time').mean()
x
<xarray.Dataset>
Dimensions:  (time: 8760)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 ... 2018-12-31T23:00:00
Data variables:
    t2m      (time) float32 277.8664 277.64725 277.61978 ... 281.81863 281.60248

Then we plot the time-series…

x.t2m.plot()
[<matplotlib.lines.Line2D at 0x7f362e274518>]

png

and we save it as a csv

x.t2m.to_pandas().to_csv('average-temperature.csv', header = ['t2m'])

The procedure is quite simple and can be definitely improved, for example adding a cosine weighting to a regular grid.

Senior Data Scientist

Related