Latest Entries

Dplyr and database: a clutter-free approach for your data

I’ve been working for months on the relationship between energy and climate in Europe. The first mandatory step was to get all the observed data available to calibrate and test a set of models. While for climate data is quite easy (there are plenty of options for observations and reanalyses) the situation is particularly difficult for the energy sector. However, with a reasonable effort you can download data about electricity demand and generation, for example from the amazing ENTSO-E Transparency portal or from the European TSOs (Transmission System Operators) websites, possibly in an obscure malformed one-file-per-day Excel format (but that’s another story…).

So, in the end you have hundreds of files (if not thousands) of files and probably a set of functions to access them in an organised way. But sometimes you don’t remember where-is-what (unfortunately life of scientists is rather multi-tasking) or perhaps you need to work on multiple machines (via SSH) replicating your files or you want to share your data with a collaborator…

I have decided to try using a database (PostgreSQL in my case) and thanks to the dplyr package (part of the excellent tidyverse) this was super-easy.

Continue reading…


Experiences from a masterclass on Climate Services

From the 16th to the 20th of May, at EURAC in Bolzano the second Masterclass on Climate Services has been held. This year the focus was on food security, water and health sectors and – as we did in the first edition – during the entire weeks the students have worked in tight contact with the problem-holders to develop a climate service to try to solve their issues. It was very challenging for the students, working on how to manage climate variability for hydropower in Alto-Adige, heat-wave early warning in Emilia-Romagna or how to deal with droughts in Sudan with the World Food Program.

We organizers were particularly satisfied with the feedbacks we have received (for example this) and with the exciting environment we had for the entire week, with a lot of nice discussions ranging from climate forecasts to data visualization.

IMG_-2xhq65 IMG_-ay5ydd IMG_20160520_075521 IMG_20160516_164844


Wrapping up the last months

Thanks to Storify service I can use all the public social media posts (mainly tweets I would say) to describe events like conferences and schools. During last May in Bolzano we organized a wonderful school (or better, a Masterclass) on Climate Services with committed students with different backgrounds and speakers from climate science, energy or agricultural sectors. The event is described here:

The last month instead, the biannual conference on Energy and Meteorology (ICEM) has been held in Boulder: five days packed of lectures, seminars and talks on energy (mainly renewable energies) and meteorology (considering both weather and climate). The Storify of this event is here:



Making R based research more reproducible

During the last years a large part of my research has been rather data intensive, hundreds of gygabytes of binary data was saved storing final analysis and intermediate results. Recently I had an issue with a data file generated by a chain of R functions and I wasn’t able to retrace the “history” of those data: the only (meta) information I had were the creation date and the long filename that normally I use to convey information about the analysis and the functions I used. Unfortunately, in this case it wasn’t enough and I came up with a partial solution for my R workflow: a save function which stores data with metadata (when, how, where, etc.).

?View Code RSPLUS
mySave <- function(..., file) { 
  callingF =
  cTime = Sys.time()
  cWd = getwd()
  sInfo =
  metadata = list(callingF, cTime, cWd, sInfo)
  save(..., metadata, file = file)

If I use this function instead of save I will include in the saved data also the original function call, the full date, the path of the working directory and information about the system (including hostname). It is far from be perfect but it is a personal initial step towards making my research full reproducible.


EUPORIAS Climate Service Master Class in Bolzano

EUPORIAS is launching a climate service master class for this coming spring. Climate service development require a new framework for the interaction between users and provider of climate information. The EUPORIAS first climate service masterclass wants to be a first step in the direction of co-production where new climate services prototypes could be developed but, more importantly, where new protocol for interaction could be explain and presented in a hands-on fashion. The focus of the master class will be on Energy, Tourism and Agriculture sectors, with several local (and European) stakeholders involved.

The Masterclass will be held at EURAC, in Bolzano, Italy, the 18-22 May 2015. The deadline for the registration is the 30th of April.

The on-line application form can be accessed at:


Exploring time-series data with ggplot2 and shiny

Recently I’ve started working with a EUROSTAT dataset with more than 300 variables. I needed to explore the time-series to see them and visualize evidents trends and relationships. I had at least two options: making a batch script to create a time-series plot for each variable or to create something interactive.

I’ve spent less than a hour to create a R Shiny app. Shiny is a powerful tool that let you to create interactive web applications easily. You only need to create two files: one describing the user interface (ui.R) and one that defines the server-side computations (server.R).

You can see my example application based on artificial data here.

The user interface let the user to select the time-scale, i.e. if you want to see all the samples or the grouped average of ten samples. Then you can select the range (if you want to “zoom”) and the variables that you want to plot. The code is the following:

?View Code RSPLUS
  headerPanel('Explore Time-Series Data'),
    selectInput('timescale', 'Time scale', c('1', '10')),
    sliderInput("range", "Range", min = 1, 
                max = 200, value = c(1, 200)),
    checkboxGroupInput('var', 'Variable', 
                       choices = c('alpha','beta','gamma'), 
                       selected = 'alpha')

The server part is quite simple. The renderPlot section is just a ggplot command that plots the data returned by the reactive function (read here if you want to know what is a reactive function).

?View Code RSPLUS
shinyServer(function(input, output, session) {
  # Combine the selected variables into a new data frame
  selectedData <- reactive({
    # Create artificial data
    dd = data.frame(time = 1:200, 
                    alpha = cumsum(rnorm(200)), 
                    beta  = cumsum(rnorm(200, sd =1.2)),
                    gamma = cumsum(rnorm(200, sd =1.1))
    dd_m = melt(dd, id.vars = 'time')
    if (input$timescale == '10') {
      dd_m = dd_m %>% 
             group_by(time = floor(dd_m$time / 10), variable) %>% 
             summarise(value = mean(value)) 
    filter(dd_m, variable %in% input$var, 
           time %in% seq(input$range[1],
  output$plot1 <- renderPlot({
    ggplot(selectedData(), aes(x = time, y = value, color = variable)) +
      geom_line(size = 2, alpha = 0.5) + geom_point(size = 3) +      
      theme(text = element_text(size = 18), 
            legend.position = 'bottom')

The selectedData function creates the data frame, then it is melted using reshape2 package in order to transform the original data frame in a new data frame where each row represents a single sample. In this way you can unleash the power of dplyr and ggplot2. In fact, with dplyr you can compute the grouped average (if selected) and apply a filter with the selected variables and the time-steps. You can see this application live at this link (shinyapps host).


Measuring the uncertainty between variables

The first thing that comes up in your mind  you want to measure the association between two variables is probably the correlation coefficient (Pearson correlation coefficient). This coefficient consists of the covariance of the two variables over the product of their standard deviation, basically it is a kind of “normalised” covariance. It’s easy to understand, quick to compute and it is available in almost all the software computing tools. Unfortunately, the correlation coefficient is not able to measure any nonlinear relationship between the variables and this is the reason why sometimes other metrics can be used.

The Mutual Information is one of these alternatives. It comes from Information Theory and essentially it measures the amount of information that one variable contains about another one. The definition of Mutual Information is strictly related to the definition of Entropy (H) which tries to define the “unpredictability” of a variable. The Mutual Information (I) between two variables is equal to the sum of their entropies minus their joint entropy:

I(X,Y) = H(X) + H(Y) - H(X, Y)

Differently from the correlation coefficient, the value of Mutual Information is not bounded and it can be hard to understand. Thus, we can consider a normalized version of Mutual Information, called Uncertainty Coefficient (introduced in 1970) which takes the following form:

U(X|Y) = \frac{I(X,Y)}{H(X)}

This coefficient can be seen as the part of X that can be predicted given Y.


Figure 1

Let’s try with an example, a sinusoid function. The correlation between time and the function value is normally close to zero, differently from the uncertainty coefficient. We start with a sinusoid function (figure 1, upper left) and then we change the order (random resampling) of an increasing fraction of the original samples. We see in Figure 1 the original signal with 25%, 50% and 75% of reshuffled samples. What happens to the correlation coefficient and mutual information in those cases?

In Figure 2 we can see how both the measures vary on average in 50 runs.


Figure 2

We can observe how the uncertainty coefficient (based on the mutual information) seems to represent the quantity of “order” inside the signal. more consistently than the correlation coefficient which appears very noisy.


Climate Services and Stakeholders’ needs

I spent the entire last week in Tolouse at the joint assembly of SPECS and EUPORIAS projects. One of the main challenges of these projects is the need of understanding how climate information could help the stakeholders’ (private companies, government agencies, etc.) activities and how to communicate it effectively. We had a stakeholder meeting in Rome and during the last year there ware many discussions about end-users’ needs and the two-way communication processes. Within the Work Package WP12 (titled “Assessment of Users’ Needs”) several interviews with more than 70 stakeholders in Europe have been carried out and also an online survey has been set up reaching more than 400 answers. In Tolouse the WP12, coordinated by the University of Leeds, has organised a workshop aimed to assess the ability of climate impact scientists in fulfilling the identified user needs (including existing services and products available). The workshop was really interesting, it helped to make the scientists aware of  non-scientists’ expectations (sometimes unrealistic) about climate forecasts. If you are interested in this activity you can give a look at the project deliverables here, particularly D12.2 (Report on findings on S2D users’ needs from workshop with meteorological organisations) and D12.3 (Report summarising users’ needs for S2D predictions).

2014-10-22 18_Fotor_small




EUPORIAS is a FP7 European Project started two years ago. The project is focused on the use of climate forecasts (seasonal and decadal) for regional impacts, to make a long story short: Climate Services. Within this project, five prototypes of climate services  will be developed. I am involved in LEAP ETHIOPIA, which has the aim to assess the value of using seasonal forecasts for a drought early warning system in Ethiopia. This prototype is not (only) about papers and science, it will be an operational service linked with the LEAP system (Livelihoods, Early Assessment and Protection) developed by Government of Ethiopia in collaboration with the World Food Program and World Bank.

It’s a big challenge, analysing the socio-economic value of climate forecasts is something difficult that involves a lot of scientific areas: climate science, agriculture, computer science and data analysis, economics and social sciences. Analysing climate information to write scientific papers is  easy compared with using the same information to deal with complex and ‘real-world’ problems.


The uncertainties in Climate

This is one of the movies produced and funded by the FP7 CLIM-RUN project. The video describes how climate works and where the uncertainty comes from, especially when we talk about climate change. There also a French version.

The uncertainties in climate change scenarios from Vegas-Deluxe on Vimeo.


Copyright © 2004–2009. All rights reserved.

RSS Feed. This blog is proudly powered by Wordpress and uses Modern Clix, a theme by Rodrigo Galindez.