Reducing data-cubes using geometry objects

[1]:
from earthkit import data as ekd
from earthkit import plots as ekp
from earthkit import transforms as ekt
from earthkit.transforms._tools import earthkit_remote_test_data_file

Load some test data

All earthkit-transforms methods can be called with earthkit-data objects (Readers and Wrappers) or with the pre-loaded xarray or geopandas objects.

In this example we will use hourly ERA5 2m temperature data on a 0.5x0.5 spatial grid for the year 2015 as our physical data; and we will use the NUTS geometries which are stored in a geojson file.

First we lazily load the ERA5 data and NUTS geometries from our test-data repository.

Note the data is only downloaded when we use it, e.g. at the .to_xarray line, additionally, the download is cached so the next time you run this cell you will not need to re-download the file (unless it has been a very long time since you have run the code, please see tutorials in earthkit-data for more details in cache management).

[2]:
# Get some demonstration ERA5 data, this could be any url or path to an ERA5 grib or netCDF file.
# remote_era5_file = earthkit_remote_test_data_file("era5_temperature_europe_2015.grib") # Large file
remote_era5_file = earthkit_remote_test_data_file("era5_temperature_europe_20150101.grib")
era5_data = ekd.from_source("url", remote_era5_file)
era5_xr = era5_data.to_xarray(time_dims=["valid_time"]).rename({"2t": "t2m"})
era5_xr

[2]:
<xarray.Dataset> Size: 11MB
Dimensions:     (valid_time: 24, latitude: 201, longitude: 281)
Coordinates:
  * valid_time  (valid_time) datetime64[us] 192B 2015-01-01 ... 2015-01-01T23...
  * latitude    (latitude) float64 2kB 80.0 79.75 79.5 79.25 ... 30.5 30.25 30.0
  * longitude   (longitude) float64 2kB -10.0 -9.75 -9.5 ... 59.5 59.75 60.0
Data variables:
    t2m         (valid_time, latitude, longitude) float64 11MB ...
Attributes:
    Conventions:  CF-1.8
    institution:  ECMWF
[3]:
# Use some demonstration polygons stored, this could be any url or path to geojson file
remote_nuts_url = earthkit_remote_test_data_file("NUTS_RG_60M_2021_4326_LEVL_0.geojson")
nuts_data = ekd.from_source("url", remote_nuts_url).to_geopandas()

nuts_data[:5]

[3]:
id NUTS_ID LEVL_CODE CNTR_CODE NAME_LATN NUTS_NAME MOUNT_TYPE URBN_TYPE COAST_TYPE FID geometry
0 DK DK 0 DK Danmark Danmark 0 0 0 DK MULTIPOLYGON (((15.1629 55.0937, 15.094 54.996...
1 RS RS 0 RS Serbia Srbija/Сpбија 0 0 0 RS POLYGON ((21.4792 45.193, 21.3585 44.8216, 22....
2 EE EE 0 EE Eesti Eesti 0 0 0 EE MULTIPOLYGON (((27.357 58.7871, 27.6449 57.981...
3 EL EL 0 EL Elláda Ελλάδα 0 0 0 EL MULTIPOLYGON (((28.0777 36.1182, 27.8606 35.92...
4 ES ES 0 ES España España 0 0 0 ES MULTIPOLYGON (((4.391 39.8617, 4.1907 39.7981,...

Reduce data

Default behaviour

The default behaviour is to reduce the data along the spatial dimensions, only, and return the reduced data in the Xarray format it was provided, i.e. xr.DataArray or xr.Dataset.

The returned object has a new dimension index which has a coordinate variable with the values of the index in the input geodataframe.

[4]:
reduced_data = ekt.spatial.reduce(
    era5_xr,
    nuts_data,
    all_touched=True,
    mask_dim="NAME_LATN",
)
reduced_data
[4]:
<xarray.Dataset> Size: 8kB
Dimensions:     (valid_time: 24, NAME_LATN: 37)
Coordinates:
  * valid_time  (valid_time) datetime64[us] 192B 2015-01-01 ... 2015-01-01T23...
  * NAME_LATN   (NAME_LATN) object 296B 'Danmark' 'Serbia' ... 'Norge'
Data variables:
    t2m         (NAME_LATN, valid_time) float64 7kB 278.7 278.9 ... 272.8 272.8
Attributes:
    Conventions:  CF-1.8
    institution:  ECMWF

Reduce along additional dimension

For example, any time dimension, this is advisable as it ensures correct handling missing values and weights.

The extra_reduce_dims argument takes a single string or a list of strings corresponding to dimensions to include in the reduction.

It is also possible to select a column in the geodataframe to use to populate the dimension and coordinate variable created by the reduction using the mask_dim kwarg, here we choose the "NAME_LATN" column.

[5]:
reduced_data_extra_dims = ekt.spatial.reduce(
    era5_xr, nuts_data, mask_dim="NAME_LATN", extra_reduce_dims="valid_time", all_touched=True
)
reduced_data_extra_dims
[5]:
<xarray.Dataset> Size: 592B
Dimensions:    (NAME_LATN: 37)
Coordinates:
  * NAME_LATN  (NAME_LATN) object 296B 'Danmark' 'Serbia' ... 'Norge'
Data variables:
    t2m        (NAME_LATN) float64 296B 279.5 261.7 276.3 ... 272.7 274.4 273.1
Attributes:
    Conventions:  CF-1.8
    institution:  ECMWF

Weighted reduction

Provide numpy/xarray arrays of weights, or use predefined weights options, i.e. latitude:

[6]:
reduced_data_weighted = ekt.spatial.reduce(
    era5_xr,
    nuts_data,
    weights="latitude",
    mask_dim="NAME_LATN",
    extra_reduce_dims="valid_time",
    all_touched=True,
)
reduced_data_weighted
[6]:
<xarray.Dataset> Size: 592B
Dimensions:    (NAME_LATN: 37)
Coordinates:
  * NAME_LATN  (NAME_LATN) object 296B 'Danmark' 'Serbia' ... 'Norge'
Data variables:
    t2m        (NAME_LATN) float64 296B 279.5 261.6 276.3 ... 272.7 274.3 274.3
Attributes:
    Conventions:  CF-1.8
    institution:  ECMWF

Below we compare the unweighted and weighted means, we do this by adding the difference in t2m to the geopandas and plotting using the geo.choropleth method in earthkit.plots.

[7]:
diff = reduced_data_extra_dims - reduced_data_weighted
mystyle = ekp.Style(
    cmap="RdBu_r", levels=[-0.25, -0.1, -0.05, -0.025, -0.01, 0.01, 0.025, 0.05, 0.1, 0.25], extend="both"
)
ekp.geo.choropleth(
    nuts_data.assign(t2m_diff=diff.t2m),
    z="t2m_diff",
    style=mystyle,
    domain="Europe",
    labels="{t2m_diff:0.2f}",
).show()
../../_images/tutorials_spatial_02-era5-reduce-data-over-geometries_14_0.png
[7]:
<earthkit.plots.components.figures.Figure at 0x70b74b3f6590>

Return as a pandas dataframe

It is possible to return the reduced data in a fully expanded geopandas dataframe which contains the geometry and aggregated data. Additional columns for the data values and rows and indexes added to fully describe the reduced data.

The returned object fully supports pandas indexing and in-built convenience methods (e.g. plotting), but it comes with memory usage cost, hence in this example we reduce along all dimensions.

[8]:
reduced_data_pd = ekt.spatial.reduce(era5_xr, nuts_data, return_as="pandas", extra_reduce_dims="valid_time")
reduced_data_pd[:5]
[8]:
id NUTS_ID LEVL_CODE CNTR_CODE NAME_LATN NUTS_NAME MOUNT_TYPE URBN_TYPE COAST_TYPE FID geometry t2m
index
0 DK DK 0 DK Danmark Danmark 0 0 0 DK MULTIPOLYGON (((15.1629 55.0937, 15.094 54.996... 279.547336
1 RS RS 0 RS Serbia Srbija/Сpбија 0 0 0 RS POLYGON ((21.4792 45.193, 21.3585 44.8216, 22.... 261.353859
2 EE EE 0 EE Eesti Eesti 0 0 0 EE MULTIPOLYGON (((27.357 58.7871, 27.6449 57.981... 276.197991
3 EL EL 0 EL Elláda Ελλάδα 0 0 0 EL MULTIPOLYGON (((28.0777 36.1182, 27.8606 35.92... 274.270998
4 ES ES 0 ES España España 0 0 0 ES MULTIPOLYGON (((4.391 39.8617, 4.1907 39.7981,... 277.095383
[9]:
# reduced_data_pd.plot("t2m")
choropleth = ekp.geo.choropleth(
    reduced_data_pd,
    z="t2m",
    cmap="turbo",
    labels="{t2m:0.1f}",
    domain="Europe",
)
choropleth.set_title("Mean 2m temperature in Europe")
choropleth.show()
print("# Note that the NUTS regions include French foreign territories, hence the extent of the figure.")
../../_images/tutorials_spatial_02-era5-reduce-data-over-geometries_17_0.png
# Note that the NUTS regions include French foreign territories, hence the extent of the figure.

Providing a bespoke function for the reduction

When providing a our own function to the reduce method it must conform the the Xarray requirements defined in their documentation pages. Specifically:

“Function which can be called in the form f(x, axis=axis, **kwargs) to return the result of reducing an np.ndarray over an integer valued axis.”

Here we will calculate the ratio of the standard deviation to the mean for demonstration purposes only. We use nanmean and nanstd so that we ignore the nan values.

[10]:
import numpy as np


def my_method(array, axis=None, **kwargs):
    _data = np.nanmean(array, axis=axis, **kwargs) * np.nanstd(array, axis=axis, **kwargs)
    _normalised = _data * (np.nanmean(array) / np.nanmean(_data))
    return _normalised
[11]:
reduced_data_myfunc = ekt.spatial.reduce(
    era5_xr, nuts_data, how=my_method, extra_reduce_dims=["valid_time"], return_as="pandas"
)
reduced_data_myfunc[:5]
/home/docs/checkouts/readthedocs.org/user_builds/earthkit-transforms/envs/latest/lib/python3.11/site-packages/numpy/lib/_nanfunctions_impl.py:1997: RuntimeWarning: Degrees of freedom <= 0 for slice.
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/home/docs/checkouts/readthedocs.org/user_builds/earthkit-transforms/envs/latest/lib/python3.11/site-packages/numpy/lib/_nanfunctions_impl.py:1997: RuntimeWarning: Degrees of freedom <= 0 for slice.
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
[11]:
id NUTS_ID LEVL_CODE CNTR_CODE NAME_LATN NUTS_NAME MOUNT_TYPE URBN_TYPE COAST_TYPE FID geometry t2m
index
0 DK DK 0 DK Danmark Danmark 0 0 0 DK MULTIPOLYGON (((15.1629 55.0937, 15.094 54.996... 279.547336
1 RS RS 0 RS Serbia Srbija/Сpбија 0 0 0 RS POLYGON ((21.4792 45.193, 21.3585 44.8216, 22.... 261.353859
2 EE EE 0 EE Eesti Eesti 0 0 0 EE MULTIPOLYGON (((27.357 58.7871, 27.6449 57.981... 276.197991
3 EL EL 0 EL Elláda Ελλάδα 0 0 0 EL MULTIPOLYGON (((28.0777 36.1182, 27.8606 35.92... 274.270998
4 ES ES 0 ES España España 0 0 0 ES MULTIPOLYGON (((4.391 39.8617, 4.1907 39.7981,... 277.095383
[12]:
ekp.geo.choropleth(reduced_data_myfunc, z="t2m", cmap="turbo", labels="{t2m:0.1f}", domain="Europe").show()
../../_images/tutorials_spatial_02-era5-reduce-data-over-geometries_21_0.png
[12]:
<earthkit.plots.components.figures.Figure at 0x70b746058250>
[ ]: