:::: MENU ::::

Which reference forecast for seasonal forecasting?

To evaluate the performances of the probabilistic forecasts provided by seasonal forecasting it is very common the use of skill scores. A skill score is normally defined as a ratio between a specific accuracy measure computed on the forecast and the same measure applied to a reference forecast. In the case of seasonal forecasts, as reference forecast is commonly used the climatological probability, that is the observed frequency of the target event (the event predicted by the forecast) in the past. Let’s say that I want to predict the probability to have a temperature for the incoming season higher 3 degrees than normal, in this case to assess the quality of my forecast I would use as reference forecast the frequency of having the seasonal temperature higher than 3 degrees that could be observed looking at the past 20-30-40 years (it depends on the availability of data). This means that I would compare my forecast with a static forecast, something like “ok, the event happened the 10% of times in the last 30 years, so I assume that the probability for the incoming season is exactly that value, 10%“.

This is not the only option, IRI in their Descriptions of the IRI Climate Forecast Verification Scores states:

Note that using climatology forecasts as the reference for comparison is not the only reasonable option. Other options may be random non-climatology forecasts, or damped persistence forecasts from the previous season’s observations

However, in the area of seasonal forecasting the skill scores are almost always evaluated  using the climatology as reference forecast. Recently, I have published on the EarthArXiv repository a draft paper written with Carlo Buontempo  titled “What is users’ next best alternative to the use of dynamical seasonal predictions?“. The paper could seem a bit controversial but we are not saying that the seasonal forecasts are not good (enough), we are instead convinced that the climatology (or even the persistence) “does not necessarily represent a good proxy for the value the users may see in these predictions“. We think this is really important from a climate service perspective, because:

the existence of simple alternative models with similar skill [of dynamical seasonal forecasts] could represent a stimulus for further research whilst at the same time providing a natural benchmark for evaluating more complex kind of predictions.

Moreover, although this is not mentioned in the paper, we should also take into account the computational requirements of the seasonal forecasts we provide, to this end when comparing the skill we might also give a look at the difference in computational time in order to provide a trade-off between the cost of a forecast and its quality.


Public sector information and energy system datasets

Recently I had the occasion to contribute to a document providing a feedback on European Commission public consultation on directives 2003/98/EC and 2013/37/EU. The work has been coordinated by Robbie Morrison, an important contributor to the OpenMod Initiative (and not only).

The document is available at this link, and it discusses all the concerns related to the (re-)use of public information and the needs from energy modellers community. It is also related to my recent experience in the C3S ECEM service where, with a bunch of smart and enthusiastic people from the energy & meteorology community, I had the possibility to really understand the potentialities and limitations in the use of public available data for the modelling of European power systems in relation to the link with meteorological variables. The document explores many issues like data longevity, copyrights and unclear legal status of datasets, the right to archive, etc. Enjoy the read and feel free to drop me a message if you want more details!


MED-GOLD is here

The first of December the H2020 project MED-GOLD has started, with a fantastic kick-off meeting at Eataly in Rome. The project is coordinated by ENEA (I am part of the coordination team) and it aims to develop three climate services for three big industrial partners working in three key Mediterranean sectors: olive oil, pasta and wine. Sixteen partners will work together for four years facing some very hard challenges: define the value of climate information for the three industrial champions, develop three (reproducible!) tools and share the findings to a broader community, engage small & large stakeholders and policy-makers and a lot more. If you are interested in climate forecast and projections and — at the same time — you would like a sustainable and resilient agriculture in Europe, well save the link (the website www.med-gold.eu is still temporary) and follow us on Twitter.


Tidyverse and NetCDFs: a first exploration

I am not exaggerating when I say that the main reason to develop in R is the presence of the tidyverse and the plethora of packages that let you to explore data and deal easily with the most common tasks (splitting, grouping, temporal aggregation, etc.). Unfortunately, the biggest limitation in R is its capability to manipulate effectively arrays of data (data cubes), especially nowadays when it is normal to deal with very large data sets (please, don’t say “big”). However, the R community is trying to reduce this gap and recently I have discovered the tidync package by Michael Summer. This package tries to make easier the access and manipulation of NetCDF files (the standard de facto of climate data sets) in a modern and clever way, a la tidyverse.

Recently I was analysing the E-RUN dataset of gridded river runoffs for my research on hydro-power modelling within the Copernicus contract ECEM. A simple step was the point correlation of daily run-offs to “see” the spatial patterns of similar runoff behaviours. Then, I took the change to create a simple code by using only three packages (well, I’m cheating a bit because one of them is a meta-package…).


The first one can be installed with devtools (look at its Github page), the second needs no introduction and the last one can be found on CRAN and it’s a very hand tool to explore correlations in R.

Once downloaded the E-RUN NetCDF (~73 MB), you can open it with tidync, subset a box, and convert into a tibble with a few lines:

runoff_data %>%
    lat = lat >= 35 & lat < 60,
    lon = lon < 30 ) %>%

As you can see the use of the pipe operator makes the code streamlined and easy to read, moreover tidync perform the subsetting operation in a lazy way, that’s in my opinion, its biggest feature because when working with very large datasets, often bigger than the RAM memory you have installed on your workstation, R does not load the entire array in-memory as a first step but it happens only when hyper_tibble or hyper_slice is called. Now you have a fancy tibble with the gridded run-off data:

# A tibble: 6 x 4
  runoff    lon   lat  time
1 0.604  -5.75   35.2   349
2 0.869  -5.25   35.2   349
3 0.0808 -3.25   35.2   349
4 0.0457 -1.25   35.2   349
5 0.0455 -0.750  35.2   349
6 0.0512 -0.250  35.2   349

Then with a bit of tidyverse magic you can: 1) filter out all the points where run-off is NA; 2) “merge” the columns latitude (lat) and longitude (lon) into a single one named position; 3) make the tibble wider with spread; 4) remove the column time not needed for the correlation; 5) create the correlation matrix and 6) “stretch” it in a wide format.

cor_tbl %
  filter(!is.na(runoff)) %>%
  unite(pos, lat, lon) %>%
  spread(pos, runoff) %>%
  select(-time) %>%
  correlate %>%

Now we have a tibble where each row stores the correlation value between two points. Now I want to create a map showing all the correlation values with a specific point (in this case the area around Innsbruck in Austria). With the following lines I can select all the points of the correlation matrix related to our selected coordinates and then I revert the “merge” operation I did before with the separate function.

df_toplot %
  filter(x == "47.25_11.25") %>%
  select(y, r) %>%
  separate(y, c("lat", "lon"), sep = "_") %>%
    lat = as.numeric(lat),
    lon = as.numeric(lon)

We are ready to generate the map now:

my_map <- ggplot(df_toplot, aes(x = lon, 
                                y = lat, 
                                fill = cut(r, breaks = seq(-1, 1, 0.2)))) + 
  geom_tile() +
  geom_point(x = 11.25, y = 47.25, shape = 3, show.legend = FALSE) +
  scale_fill_brewer(palette = "RdBu", drop = FALSE) + 
  borders() + 
  coord_quickmap(xlim = c(-20, 30), 
                  ylim = c(35, 60)) + 

The map is not really elegant but however it is the final output of a very elegant, robust and streamlined workflow, I hope that in the future we will have more tools like this one, giving to climate scientists the possibility to manipulate data arrays easily and efficiently also in R.


Dplyr and database: a clutter-free approach for your data

I’ve been working for months on the relationship between energy and climate in Europe. The first mandatory step was to get all the observed data available to calibrate and test a set of models. While for climate data is quite easy (there are plenty of options for observations and reanalyses) the situation is particularly difficult for the energy sector. However, with a reasonable effort you can download data about electricity demand and generation, for example from the amazing ENTSO-E Transparency portal or from the European TSOs (Transmission System Operators) websites, possibly in an obscure malformed one-file-per-day Excel format (but that’s another story…).

So, in the end you have hundreds of files (if not thousands) of files and probably a set of functions to access them in an organised way. But sometimes you don’t remember where-is-what (unfortunately life of scientists is rather multi-tasking) or perhaps you need to work on multiple machines (via SSH) replicating your files or you want to share your data with a collaborator…

I have decided to try using a database (PostgreSQL in my case) and thanks to the dplyr package (part of the excellent tidyverse) this was super-easy.

Continue Reading


Experiences from a masterclass on Climate Services

From the 16th to the 20th of May, at EURAC in Bolzano the second Masterclass on Climate Services has been held. This year the focus was on food security, water and health sectors and – as we did in the first edition – during the entire weeks the students have worked in tight contact with the problem-holders to develop a climate service to try to solve their issues. It was very challenging for the students, working on how to manage climate variability for hydropower in Alto-Adige, heat-wave early warning in Emilia-Romagna or how to deal with droughts in Sudan with the World Food Program.

We organizers were particularly satisfied with the feedbacks we have received (for example this) and with the exciting environment we had for the entire week, with a lot of nice discussions ranging from climate forecasts to data visualization.

IMG_-2xhq65 IMG_-ay5ydd IMG_20160520_075521 IMG_20160516_164844


Wrapping up the last months

Thanks to Storify service I can use all the public social media posts (mainly tweets I would say) to describe events like conferences and schools. During last May in Bolzano we organized a wonderful school (or better, a Masterclass) on Climate Services with committed students with different backgrounds and speakers from climate science, energy or agricultural sectors. The event is described here: https://storify.com/matteodefelice/euporiasmc15

The last month instead, the biannual conference on Energy and Meteorology (ICEM) has been held in Boulder: five days packed of lectures, seminars and talks on energy (mainly renewable energies) and meteorology (considering both weather and climate). The Storify of this event is here: https://storify.com/matteodefelice/icem-2015-boulder



Making R based research more reproducible

During the last years a large part of my research has been rather data intensive, hundreds of gygabytes of binary data was saved storing final analysis and intermediate results. Recently I had an issue with a data file generated by a chain of R functions and I wasn’t able to retrace the “history” of those data: the only (meta) information I had were the creation date and the long filename that normally I use to convey information about the analysis and the functions I used. Unfortunately, in this case it wasn’t enough and I came up with a partial solution for my R workflow: a save function which stores data with metadata (when, how, where, etc.).

mySave <- function(..., file) { 
  callingF = sys.call(-1)
  cTime = Sys.time()
  cWd = getwd()
  sInfo = Sys.info()
  metadata = list(callingF, cTime, cWd, sInfo)
  save(..., metadata, file = file)

If I use this function instead of save I will include in the saved data also the original function call, the full date, the path of the working directory and information about the system (including hostname). It is far from be perfect but it is a personal initial step towards making my research full reproducible.


EUPORIAS Climate Service Master Class in Bolzano

EUPORIAS is launching a climate service master class for this coming spring. Climate service development require a new framework for the interaction between users and provider of climate information. The EUPORIAS first climate service masterclass wants to be a first step in the direction of co-production where new climate services prototypes could be developed but, more importantly, where new protocol for interaction could be explain and presented in a hands-on fashion. The focus of the master class will be on Energy, Tourism and Agriculture sectors, with several local (and European) stakeholders involved.

The Masterclass will be held at EURAC, in Bolzano, Italy, the 18-22 May 2015. The deadline for the registration is the 30th of April.

The on-line application form can be accessed at: http://www.euporias.eu/masterclass-registration


Exploring time-series data with ggplot2 and shiny

Recently I’ve started working with a EUROSTAT dataset with more than 300 variables. I needed to explore the time-series to see them and visualize evidents trends and relationships. I had at least two options: making a batch script to create a time-series plot for each variable or to create something interactive.

I’ve spent less than a hour to create a R Shiny app. Shiny is a powerful tool that let you to create interactive web applications easily. You only need to create two files: one describing the user interface (ui.R) and one that defines the server-side computations (server.R).

You can see my example application based on artificial data here.

The user interface let the user to select the time-scale, i.e. if you want to see all the samples or the grouped average of ten samples. Then you can select the range (if you want to “zoom”) and the variables that you want to plot. The code is the following:

  headerPanel('Explore Time-Series Data'),
    selectInput('timescale', 'Time scale', c('1', '10')),
    sliderInput("range", "Range", min = 1, 
                max = 200, value = c(1, 200)),
    checkboxGroupInput('var', 'Variable', 
                       choices = c('alpha','beta','gamma'), 
                       selected = 'alpha')

The server part is quite simple. The renderPlot section is just a ggplot command that plots the data returned by the reactive function (read here if you want to know what is a reactive function).

shinyServer(function(input, output, session) {
  # Combine the selected variables into a new data frame
  selectedData <- reactive({
    # Create artificial data
    dd = data.frame(time = 1:200, 
                    alpha = cumsum(rnorm(200)), 
                    beta  = cumsum(rnorm(200, sd =1.2)),
                    gamma = cumsum(rnorm(200, sd =1.1))
    dd_m = melt(dd, id.vars = 'time')
    if (input$timescale == '10') {
      dd_m = dd_m %>% 
             group_by(time = floor(dd_m$time / 10), variable) %>% 
             summarise(value = mean(value)) 
    filter(dd_m, variable %in% input$var, 
           time %in% seq(input$range[1],
  output$plot1 <- renderPlot({
    ggplot(selectedData(), aes(x = time, y = value, color = variable)) +
      geom_line(size = 2, alpha = 0.5) + geom_point(size = 3) +      
      theme(text = element_text(size = 18), 
            legend.position = 'bottom')

The selectedData function creates the data frame, then it is melted using reshape2 package in order to transform the original data frame in a new data frame where each row represents a single sample. In this way you can unleash the power of dplyr and ggplot2. In fact, with dplyr you can compute the grouped average (if selected) and apply a filter with the selected variables and the time-steps. You can see this application live at this link (shinyapps host).