## Public sector information and energy system datasets

Recently I had the occasion to contribute to a document providing a feedback on European Commission public consultation on directives 2003/98/EC and 2013/37/EU. The work has been coordinated by Robbie Morrison, an important contributor to the OpenMod Initiative (and not only).

The document is available at this link, and it discusses all the concerns related to the (re-)use of public information and the needs from energy modellers community. It is also related to my recent experience in the C3S ECEM service where, with a bunch of smart and enthusiastic people from the energy & meteorology community, I had the possibility to really understand the potentialities and limitations in the use of public available data for the modelling of European power systems in relation to the link with meteorological variables. The document explores many issues like data longevity, copyrights and unclear legal status of datasets, the right to archive, etc. Enjoy the read and feel free to drop me a message if you want more details!

## MED-GOLD is here

The first of December the H2020 project MED-GOLD has started, with a fantastic kick-off meeting at Eataly in Rome. The project is coordinated by ENEA (I am part of the coordination team) and it aims to develop three climate services for three big industrial partners working in three key Mediterranean sectors: olive oil, pasta and wine. Sixteen partners will work together for four years facing some very hard challenges: define the value of climate information for the three industrial champions, develop three (reproducible!) tools and share the findings to a broader community, engage small & large stakeholders and policy-makers and a lot more. If you are interested in climate forecast and projections and — at the same time — you would like a sustainable and resilient agriculture in Europe, well save the link (the website www.med-gold.eu is still temporary) and follow us on Twitter.

## Tidyverse and NetCDFs: a first exploration

I am not exaggerating when I say that the main reason to develop in R is the presence of the tidyverse and the plethora of packages that let you to explore data and deal easily with the most common tasks (splitting, grouping, temporal aggregation, etc.). Unfortunately, the biggest limitation in R is its capability to manipulate effectively arrays of data (data cubes), especially nowadays when it is normal to deal with very large data sets (please, don’t say “big”). However, the R community is trying to reduce this gap and recently I have discovered the tidync package by Michael Summer. This package tries to make easier the access and manipulation of NetCDF files (the standard de facto of climate data sets) in a modern and clever way, a la tidyverse.

Recently I was analysing the E-RUN dataset of gridded river runoffs for my research on hydro-power modelling within the Copernicus contract ECEM. A simple step was the point correlation of daily run-offs to “see” the spatial patterns of similar runoff behaviours. Then, I took the change to create a simple code by using only three packages (well, I’m cheating a bit because one of them is a meta-package…).

library(tidync)
library(tidyverse)
library(corrr)

The first one can be installed with devtools (look at its Github page), the second needs no introduction and the last one can be found on CRAN and it’s a very hand tool to explore correlations in R.

Once downloaded the E-RUN NetCDF (~73 MB), you can open it with tidync, subset a box, and convert into a tibble with a few lines:

runoff_data %>%
hyper_filter(
lat = lat >= 35 &amp; lat &lt; 60,
lon = lon &lt; 30 ) %>%
hyper_tibble()

As you can see the use of the pipe operator makes the code streamlined and easy to read, moreover tidync perform the subsetting operation in a lazy way, that’s in my opinion, its biggest feature because when working with very large datasets, often bigger than the RAM memory you have installed on your workstation, R does not load the entire array in-memory as a first step but it happens only when hyper_tibble or hyper_slice is called. Now you have a fancy tibble with the gridded run-off data:

head(runoff_data)
# A tibble: 6 x 4
runoff    lon   lat  time

1 0.604  -5.75   35.2   349
2 0.869  -5.25   35.2   349
3 0.0808 -3.25   35.2   349
4 0.0457 -1.25   35.2   349
5 0.0455 -0.750  35.2   349
6 0.0512 -0.250  35.2   349

Then with a bit of tidyverse magic you can: 1) filter out all the points where run-off is NA; 2) “merge” the columns latitude (lat) and longitude (lon) into a single one named position; 3) make the tibble wider with spread; 4) remove the column time not needed for the correlation; 5) create the correlation matrix and 6) “stretch” it in a wide format.

cor_tbl %
filter(!is.na(runoff)) %>%
unite(pos, lat, lon) %>%
select(-time) %>%
correlate %>%
stretch()

Now we have a tibble where each row stores the correlation value between two points. Now I want to create a map showing all the correlation values with a specific point (in this case the area around Innsbruck in Austria). With the following lines I can select all the points of the correlation matrix related to our selected coordinates and then I revert the “merge” operation I did before with the separate function.

df_toplot %
filter(x == "47.25_11.25") %>%
select(y, r) %>%
separate(y, c("lat", "lon"), sep = "_") %>%
mutate(
lat = as.numeric(lat),
lon = as.numeric(lon)
)

We are ready to generate the map now:

my_map &lt;- ggplot(df_toplot, aes(x = lon,
y = lat,
fill = cut(r, breaks = seq(-1, 1, 0.2)))) +
geom_tile() +
geom_point(x = 11.25, y = 47.25, shape = 3, show.legend = FALSE) +
scale_fill_brewer(palette = "RdBu", drop = FALSE) +
borders() +
coord_quickmap(xlim = c(-20, 30),
ylim = c(35, 60)) +
theme_light()

The map is not really elegant but however it is the final output of a very elegant, robust and streamlined workflow, I hope that in the future we will have more tools like this one, giving to climate scientists the possibility to manipulate data arrays easily and efficiently also in R.



## Dplyr and database: a clutter-free approach for your data

I’ve been working for months on the relationship between energy and climate in Europe. The first mandatory step was to get all the observed data available to calibrate and test a set of models. While for climate data is quite easy (there are plenty of options for observations and reanalyses) the situation is particularly difficult for the energy sector. However, with a reasonable effort you can download data about electricity demand and generation, for example from the amazing ENTSO-E Transparency portal or from the European TSOs (Transmission System Operators) websites, possibly in an obscure malformed one-file-per-day Excel format (but that’s another story…).

So, in the end you have hundreds of files (if not thousands) of files and probably a set of functions to access them in an organised way. But sometimes you don’t remember where-is-what (unfortunately life of scientists is rather multi-tasking) or perhaps you need to work on multiple machines (via SSH) replicating your files or you want to share your data with a collaborator…

I have decided to try using a database (PostgreSQL in my case) and thanks to the dplyr package (part of the excellent tidyverse) this was super-easy.

## Experiences from a masterclass on Climate Services

From the 16th to the 20th of May, at EURAC in Bolzano the second Masterclass on Climate Services has been held. This year the focus was on food security, water and health sectors and – as we did in the first edition – during the entire weeks the students have worked in tight contact with the problem-holders to develop a climate service to try to solve their issues. It was very challenging for the students, working on how to manage climate variability for hydropower in Alto-Adige, heat-wave early warning in Emilia-Romagna or how to deal with droughts in Sudan with the World Food Program.

We organizers were particularly satisfied with the feedbacks we have received (for example this) and with the exciting environment we had for the entire week, with a lot of nice discussions ranging from climate forecasts to data visualization.

## Wrapping up the last months

Thanks to Storify service I can use all the public social media posts (mainly tweets I would say) to describe events like conferences and schools. During last May in Bolzano we organized a wonderful school (or better, a Masterclass) on Climate Services with committed students with different backgrounds and speakers from climate science, energy or agricultural sectors. The event is described here: https://storify.com/matteodefelice/euporiasmc15

The last month instead, the biannual conference on Energy and Meteorology (ICEM) has been held in Boulder: five days packed of lectures, seminars and talks on energy (mainly renewable energies) and meteorology (considering both weather and climate). The Storify of this event is here: https://storify.com/matteodefelice/icem-2015-boulder

## Making R based research more reproducible

During the last years a large part of my research has been rather data intensive, hundreds of gygabytes of binary data was saved storing final analysis and intermediate results. Recently I had an issue with a data file generated by a chain of R functions and I wasn’t able to retrace the “history” of those data: the only (meta) information I had were the creation date and the long filename that normally I use to convey information about the analysis and the functions I used. Unfortunately, in this case it wasn’t enough and I came up with a partial solution for my R workflow: a save function which stores data with metadata (when, how, where, etc.).

mySave <- function(..., file) {
callingF = sys.call(-1)
cTime = Sys.time()
cWd = getwd()
sInfo = Sys.info()
metadata = list(callingF, cTime, cWd, sInfo)
}

If I use this function instead of save I will include in the saved data also the original function call, the full date, the path of the working directory and information about the system (including hostname). It is far from be perfect but it is a personal initial step towards making my research full reproducible.

## EUPORIAS Climate Service Master Class in Bolzano

EUPORIAS is launching a climate service master class for this coming spring. Climate service development require a new framework for the interaction between users and provider of climate information. The EUPORIAS first climate service masterclass wants to be a first step in the direction of co-production where new climate services prototypes could be developed but, more importantly, where new protocol for interaction could be explain and presented in a hands-on fashion. The focus of the master class will be on Energy, Tourism and Agriculture sectors, with several local (and European) stakeholders involved.

The Masterclass will be held at EURAC, in Bolzano, Italy, the 18-22 May 2015. The deadline for the registration is the 30th of April.

The on-line application form can be accessed at: http://www.euporias.eu/masterclass-registration

## Exploring time-series data with ggplot2 and shiny

Recently I’ve started working with a EUROSTAT dataset with more than 300 variables. I needed to explore the time-series to see them and visualize evidents trends and relationships. I had at least two options: making a batch script to create a time-series plot for each variable or to create something interactive.

I’ve spent less than a hour to create a R Shiny app. Shiny is a powerful tool that let you to create interactive web applications easily. You only need to create two files: one describing the user interface (ui.R) and one that defines the server-side computations (server.R).

You can see my example application based on artificial data here.

The user interface let the user to select the time-scale, i.e. if you want to see all the samples or the grouped average of ten samples. Then you can select the range (if you want to “zoom”) and the variables that you want to plot. The code is the following:

shinyUI(pageWithSidebar(
sidebarPanel(
selectInput('timescale', 'Time scale', c('1', '10')),
sliderInput("range", "Range", min = 1,
max = 200, value = c(1, 200)),
checkboxGroupInput('var', 'Variable',
choices = c('alpha','beta','gamma'),
selected = 'alpha')
),
mainPanel(
plotOutput('plot1')
)
))

The server part is quite simple. The renderPlot section is just a ggplot command that plots the data returned by the reactive function (read here if you want to know what is a reactive function).

library(reshape2)
library(ggplot2)
library(dplyr)
shinyServer(function(input, output, session) {

# Combine the selected variables into a new data frame
selectedData <- reactive({
# Create artificial data
set.seed(123)
dd = data.frame(time = 1:200,
alpha = cumsum(rnorm(200)),
beta  = cumsum(rnorm(200, sd =1.2)),
gamma = cumsum(rnorm(200, sd =1.1))
)
dd_m = melt(dd, id.vars = 'time')
if (input$timescale == '10') { dd_m = dd_m %>% group_by(time = floor(dd_m$time / 10), variable) %>%
summarise(value = mean(value))
}

filter(dd_m, variable %in% input$var, time %in% seq(input$range[1],
input$range[2])) }) output$plot1 <- renderPlot({
ggplot(selectedData(), aes(x = time, y = value, color = variable)) +
geom_line(size = 2, alpha = 0.5) + geom_point(size = 3) +
theme(text = element_text(size = 18),
legend.position = 'bottom')
})

})

The selectedData function creates the data frame, then it is melted using reshape2 package in order to transform the original data frame in a new data frame where each row represents a single sample. In this way you can unleash the power of dplyr and ggplot2. In fact, with dplyr you can compute the grouped average (if selected) and apply a filter with the selected variables and the time-steps. You can see this application live at this link (shinyapps host).

## Measuring the uncertainty between variables

The first thing that comes up in your mind  you want to measure the association between two variables is probably the correlation coefficient (Pearson correlation coefficient). This coefficient consists of the covariance of the two variables over the product of their standard deviation, basically it is a kind of “normalised” covariance. It’s easy to understand, quick to compute and it is available in almost all the software computing tools. Unfortunately, the correlation coefficient is not able to measure any nonlinear relationship between the variables and this is the reason why sometimes other metrics can be used.

The Mutual Information is one of these alternatives. It comes from Information Theory and essentially it measures the amount of information that one variable contains about another one. The definition of Mutual Information is strictly related to the definition of Entropy (H) which tries to define the “unpredictability” of a variable. The Mutual Information (I) between two variables is equal to the sum of their entropies minus their joint entropy:

$I(X,Y) = H(X) + H(Y) - H(X, Y)$

Differently from the correlation coefficient, the value of Mutual Information is not bounded and it can be hard to understand. Thus, we can consider a normalized version of Mutual Information, called Uncertainty Coefficient (introduced in 1970) which takes the following form:

$U(X|Y) = \frac{I(X,Y)}{H(X)}$

This coefficient can be seen as the part of $X$ that can be predicted given $Y$.

Figure 1

Let’s try with an example, a sinusoid function. The correlation between time and the function value is normally close to zero, differently from the uncertainty coefficient. We start with a sinusoid function (figure 1, upper left) and then we change the order (random resampling) of an increasing fraction of the original samples. We see in Figure 1 the original signal with 25%, 50% and 75% of reshuffled samples. What happens to the correlation coefficient and mutual information in those cases?

In Figure 2 we can see how both the measures vary on average in 50 runs.

Figure 2

We can observe how the uncertainty coefficient (based on the mutual information) seems to represent the quantity of “order” inside the signal. more consistently than the correlation coefficient which appears very noisy.

Pages:1234