{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reducing data-cubes using geometry objects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If first time running, uncomment the line below to install any additional dependencies\n",
    "# !bash requirements-for-notebooks.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from earthkit.data.testing import earthkit_remote_test_data_file\n",
    "\n",
    "from earthkit import data as ekd\n",
    "from earthkit import transforms as ekt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load some test data\n",
    "\n",
    "All `earthkit-transforms` methods can be called with `earthkit-data` objects (Readers and Wrappers) or with the \n",
    "pre-loaded `xarray` or `geopandas` objects.\n",
    "\n",
    "In this example we will use hourly ERA5 2m temperature data on a 0.5x0.5 spatial grid for the year 2015 as\n",
    "our physical data; and we will use the NUTS geometries which are stored in a geojson file.\n",
    "\n",
    "First we lazily load the ERA5 data  and NUTS geometries from our test-data repository.\n",
    "\n",
    "Note the data is only downloaded when\n",
    "we use it, e.g. at the `.to_xarray` line, additionally, the download is cached so the next time you run this\n",
    "cell you will not need to re-download the file (unless it has been a very long time since you have run the\n",
    "code, please see tutorials in `earthkit-data` for more details in cache management)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get some demonstration ERA5 data, this could be any url or path to an ERA5 grib or netCDF file.\n",
    "# remote_era5_file = earthkit_remote_test_data_file(\"era5_temperature_europe_2015.grib\") # Large file\n",
    "remote_era5_file = earthkit_remote_test_data_file(\"era5_temperature_europe_20150101.grib\")\n",
    "era5_data = ekd.from_source(\"url\", remote_era5_file)\n",
    "era5_xr = era5_data.to_xarray(time_dim_mode=\"valid_time\").rename({\"2t\": \"t2m\"})\n",
    "era5_xr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use some demonstration polygons stored, this could be any url or path to geojson file\n",
    "remote_nuts_url = earthkit_remote_test_data_file(\"NUTS_RG_60M_2021_4326_LEVL_0.geojson\")\n",
    "nuts_data = ekd.from_source(\"url\", remote_nuts_url)\n",
    "\n",
    "nuts_data.to_pandas()[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reduce data\n",
    "### Default behaviour\n",
    "\n",
    "The default behaviour is to reduce the data along the spatial dimensions, only, and return the reduced data\n",
    "in the Xarray format it was provided, i.e. `xr.DataArray` or `xr.Dataset`.\n",
    "\n",
    "The returned object has a new dimension `FID` (feature id) which has a coordinate variable with the values\n",
    "of the `FID` column in the input `geodataframe`.\n",
    "\n",
    "The new variable name is made up of the original variable name and the method used to reduce, e.g. `t2m_mean`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reduced_data = ekt.spatial.reduce(era5_xr, nuts_data)\n",
    "reduced_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reduce along additional dimension\n",
    "\n",
    "For example, any time dimension, this is advisable as it ensures correct handling missing values and weights.\n",
    "\n",
    "The extra_reduce_dims argument takes a single string or a list of strings corresponding to dimensions to\n",
    "include in the reduction.\n",
    "\n",
    "It is also possible to select a column in the geodataframe to use to populate the dimension and coordinate \n",
    "variable created by the reduction using the `mask_dim` kwarg, here we choose the `\"FID\"` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reduced_data = ekt.spatial.reduce(era5_xr, nuts_data, mask_dim=\"FID\", extra_reduce_dims=\"valid_time\", all_touched=True)\n",
    "reduced_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Weighted reduction\n",
    "\n",
    "Provide numpy/xarray arrays of weights, or use predefined weights options, i.e. `latitude`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reduced_data_xr = ekt.spatial.reduce(\n",
    "    era5_xr, nuts_data, weights=\"latitude\", mask_dim=\"FID\", extra_reduce_dims=\"valid_time\", all_touched=True\n",
    ")\n",
    "reduced_data_xr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Return as a pandas dataframe\n",
    "\n",
    "**WARNING: Returning reduced data in pandas format is considered experimental and may change in futureversions of earthkit** \n",
    "\n",
    "It is possible to return the reduced data in a fully expanded geopandas dataframe which contains the geometry \n",
    "and aggregated data.\n",
    "Additional columns for the data values and rows and indexes added to fully describe the reduced data.\n",
    "\n",
    "The returned object fully supports pandas indexing and in-built convenience methods (e.g. plotting),\n",
    "but it comes with memory usage cost, hence in this example we reduce along all dimensions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reduced_data_pd = ekt.spatial.reduce(\n",
    "    era5_xr, nuts_data, return_as=\"pandas\", mask_dim=\"FID\", extra_reduce_dims=\"valid_time\"\n",
    ")\n",
    "reduced_data_pd[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reduced_data_pd.plot(\"t2m\")\n",
    "print(\"# Note that the NUTS regions include French foreign territories, hence the extent of the figure.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix\n",
    "\n",
    "### Unadvised: return_as = 'pandas' for time-series\n",
    "\n",
    "This results in very heavy memory usage but may be useful"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reduced_data_pd = ekt.spatial.reduce(era5_xr, nuts_data, return_as=\"pandas\", mask_dim=\"FID\")\n",
    "reduced_data_pd[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_index = \"FID\"\n",
    "plot_var = \"t2m\"\n",
    "# plot_x_vals = reduced_data.attrs[f\"{plot_var}_dims\"][\"time\"]\n",
    "fig, ax = plt.subplots(1)\n",
    "for feature in reduced_data_pd.index.get_level_values(feature_index).unique()[:5]:\n",
    "    temp = reduced_data_pd.xs(feature, level=feature_index)\n",
    "    temp[plot_var].plot(ax=ax, label=feature)\n",
    "fig.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Providing a bespoke function for the reduction\n",
    "\n",
    "When providing a our own function to the reduce method it must conform the the Xarray requirements defined in their [documentation pages](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.reduce.html#xarray-dataarray-reduce). Specifically:\n",
    "\n",
    "*\"Function which can be called in the form f(x, axis=axis, \\*\\*kwargs) to return the result of reducing an np.ndarray over an integer valued axis.\"*\n",
    "\n",
    "Here we will calculate the ratio of the standard deviation to the mean for demonstration purposes only. We use nanmean and nanstd so that we ignore the nan values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def std_mean_ratio(x, axis=0, **kwargs):\n",
    "    return np.nanstd(x, axis=axis, **kwargs) / np.nanmean(x, axis=axis, **kwargs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reduced_data = ekt.spatial.reduce(era5_xr, nuts_data, how=std_mean_ratio)\n",
    "reduced_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".conda",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}