Modelling (and thus understanding) the generation of renewable energy sources is very important, especially considering their link with meteorological variables. In this blog post I would like to focus on run-of-river hydropower, a common source of electricity in Europe.
This notebook demonstrates the potential in using free/open data to analyse & model run-of-river generation. I will use three different datasets:
- The JRC Open Power Plants Database (JRC-PPDB-OPEN) to know the coordinates of the most important plants in Europe. The dataset is open and it can be downloaded from this link
- The time-series of the single units from the ENTSO-E Transparency Platform. The Transparency Platform can be accessed from the website or via SFTP after a registration
- River discharge historical data from the European Flood Awareness System (EFAS) available on the Copernicus Data Store. The data is freely available after a registration.
I cannot share the entire workflow because the size of the datasets used is too big: except for the JRC-PPDB-OPEN the other two are more than 50 GBs.
Coordinates and generation time-series
The first step is to extract from the Transparency Platform the time-series of the largest run-of-river plants in Europe, finding then their coordinates using the JRC-PPDB-OPEN.
From the power plants database we get the 20 run-of-river power plants
with the highest installed capacity (
units <- read_csv("JRC-PPDB-OPEN.ver0.91/JRC_OPEN_UNITS.csv", col_types = cols( eic_p = col_character(), eic_g = col_character(), name_p = col_character(), name_g = col_character(), capacity_p = col_double(), capacity_g = col_double(), type_g = col_character(), lat = col_double(), lon = col_double(), country = col_character(), NUTS2 = col_character(), status_g = col_character(), year_commissioned = col_double(), year_decommissioned = col_double() ) ) %>% dplyr::filter(type_g == "Hydro Run-of-river and poundage") %>% distinct(name_p, .keep_all = TRUE) %>% arrange(-capacity_p) %>% select(eic_p, eic_g, name_p, country, capacity_p, lat, lon, status_g) %>% top_n(20, capacity_p) kable(units)
|30W-CHE-PDF1—A||30WHIDRPDFE3—H||HPP Portile de Fier I||Romania||1161.0||44.673||22.532||COMMISSIONED|
|34WEHP-DJE-1—R||34WEHP-DJE-1G3-L||HE DJERDAP I||Serbia||1098.9||44.668||22.527||COMMISSIONED|
|34WEHP-BBASH—V||34WEHP-BBASHG4-M||HE BAJINA BASTA||Serbia||420.0||43.965||19.410||COMMISSIONED|
|34WEHP-DJE-2—M||NA||HE DJERDAP II||Serbia||270.0||44.302||22.562||COMMISSIONED|
Then I have extracted the generation time-series from the Transparency Platform. This dataset can be explored and downloaded directly on the website in the section Actual Generation per Generation Unit. Obviously, if you need all the historical data of multiple plants, the only option is the bulk download via SFTP (in principle, this can be achieved also through the restful API provided but I have never tried).
From the SFTP you will obtain a bunch of CSV
files, one for each month of data. From this tabular data the needed
plants can be extracted filtering the rows where the EIC
the unit (
GenerationUnitEIC) is in our list (
eic_g column). A final
postprocess is needed: the data in the Transparency Platform is hourly
while the hydrological data we will use in the next step is daily, then
the time-series are aggregated at daily level obtaining the data in MWh.
I have extracted the data in a tibble and saved to a CSV file (you can download it from this link ).
entsoe_tp_data <- read_csv("ENTSOE-TP-time-series-top-20-ror-europe.csv", col_types = cols( date = col_date(format = ""), eic_g = col_character(), output = col_double() ) ) %>% group_by(eic_g) %>% summarise( `number of daily samples` = n(), `average daily generation (MWh)` = mean(output, na.rm = TRUE) ) kable((entsoe_tp_data))
|eic_g||number of daily samples||average daily generation (MWh)|
Although we have selected the 20 power plants with the highest capacity in the JRC-PPDB-OPEN, in the Transparency Platform we could find the time-series for only 14 power plants (and two of them have only 5 samples).
The rest of this post will focus only on one plant, but the shown procedure can be applied in batch to multiple plants.
Let’s visualise the daily generation time-series for the Aschach hydro-power plant on the Danube in Austria. In the following figure we can see the daily generation (black line) and the rolling average for 30 days (blue line).
selected_plant <- read_csv("ENTSOE-TP-time-series-top-20-ror-europe.csv", col_types = cols( date = col_date(format = ""), eic_g = col_character(), output = col_double() ) ) %>% dplyr::filter(eic_g == "14W-BAS-TU-----Y") %>% ggplot(aes(x = date, y = output)) + geom_line(size = 0.1) + geom_line(aes(y = zoo::rollmean(output, k = 30, fill = NA)), color = "blue") + theme_light() print(selected_plant)
River discharge from EFAS and its correlation with the generation
As specified in the dataset
the EFAS historical data provides estimate of river discharge from 1991
with a resolution of 5 km. Using Python and the powerful
module we can open the entire EFAS
dataset and extract the the time-series for a single power plant. Given
that the coordinate of the power plant can be inaccurate, I don’t select
the grid point containing the power plant’s coordinate but also the 8
neighbours, then for each time step, selecting the maximum discharge of
those 9 grid points.
The Python code is available on this Github gist.
Now, let’s go back to the Aschach hydro-power plant. The Python script saved the time-series of the river discharge in the coordinate 48.385, 14.023 in a CSV file. Now we can join the generation data with the river discharge.
sel_unit <- units %>% dplyr::filter(eic_g == "14W-BAS-TU-----Y") generation_data <- selected_plant <- read_csv("ENTSOE-TP-time-series-top-20-ror-europe.csv", col_types = cols( date = col_date(format = ""), eic_g = col_character(), output = col_double() ) ) %>% dplyr::filter(eic_g == "14W-BAS-TU-----Y") discharge_data <- read_csv(sprintf("out-%s_2010-2018.csv", sel_unit$eic_p), col_types = cols( time = col_datetime(format = ""), step = col_character(), surface = col_double(), valid_time = col_datetime(format = ""), dis24 = col_double() ) ) %>% select(time, dis24) %>% mutate(time = lubridate::floor_date(time, "days") %>% lubridate::as_date()) joined_data <- inner_join( generation_data, discharge_data, by = c("date" = "time") ) %>% dplyr::filter(!is.na(output), !is.na(dis24)) kable(head(joined_data))
Now we can visualise the two time-series, applying a normalisation because they have two different unit measures: MWh for the generation and cubic meters per second for the river discharge.
compare_plot <- ggplot( data = joined_data %>% select(date, generation = output, discharge = dis24) %>% gather(variable, value, -date) %>% group_by(variable) %>% mutate(value = scale(value)), aes(x = date, y = value, color = variable) ) + geom_line() + theme_light() print(compare_plot)
The Spearman correlation of the two variables is 0.88 and also the plots shows the level of their correlation. And the correlation is also high for many other power plants, for example for Laufenburg hydropower station is even 0.92.
What are the potential applications?
- For power system modeling this means that we are able to estimate the generation of run-of-river power plants with also the possibility to create a synthetic dataset for the entire EFAS historical timespan (1990-2018)
- Also, we can extend this correlation to the climate change scenarios to simulate the generation of run-of-river in different scenarios (and for different models)
- Weather and seasonal forecasts might be used to provide(operational) forecasts of run-of-river generation (not an easy task though)