Latest Entries

Wrapping up the last months

Thanks to Storify service I can use all the public social media posts (mainly tweets I would say) to describe events like conferences and schools. During last May in Bolzano we organized a wonderful school (or better, a Masterclass) on Climate Services with committed students with different backgrounds and speakers from climate science, energy or agricultural sectors. The event is described here: https://storify.com/matteodefelice/euporiasmc15

The last month instead, the biannual conference on Energy and Meteorology (ICEM) has been held in Boulder: five days packed of lectures, seminars and talks on energy (mainly renewable energies) and meteorology (considering both weather and climate). The Storify of this event is here: https://storify.com/matteodefelice/icem-2015-boulder

 

FacebookTwitterGoogle+LinkedInBufferInstapaperShare

Making R based research more reproducible

During the last years a large part of my research has been rather data intensive, hundreds of gygabytes of binary data was saved storing final analysis and intermediate results. Recently I had an issue with a data file generated by a chain of R functions and I wasn’t able to retrace the “history” of those data: the only (meta) information I had were the creation date and the long filename that normally I use to convey information about the analysis and the functions I used. Unfortunately, in this case it wasn’t enough and I came up with a partial solution for my R workflow: a save function which stores data with metadata (when, how, where, etc.).

?View Code RSPLUS
mySave <- function(..., file) { 
  callingF = sys.call(-1)
  cTime = Sys.time()
  cWd = getwd()
  sInfo = Sys.info()
  metadata = list(callingF, cTime, cWd, sInfo)
  save(..., metadata, file = file)
}

If I use this function instead of save I will include in the saved data also the original function call, the full date, the path of the working directory and information about the system (including hostname). It is far from be perfect but it is a personal initial step towards making my research full reproducible.

FacebookTwitterGoogle+LinkedInBufferInstapaperShare

EUPORIAS Climate Service Master Class in Bolzano

EUPORIAS is launching a climate service master class for this coming spring. Climate service development require a new framework for the interaction between users and provider of climate information. The EUPORIAS first climate service masterclass wants to be a first step in the direction of co-production where new climate services prototypes could be developed but, more importantly, where new protocol for interaction could be explain and presented in a hands-on fashion. The focus of the master class will be on Energy, Tourism and Agriculture sectors, with several local (and European) stakeholders involved.

The Masterclass will be held at EURAC, in Bolzano, Italy, the 18-22 May 2015. The deadline for the registration is the 30th of April.

The on-line application form can be accessed at: http://www.euporias.eu/masterclass-registration

FacebookTwitterGoogle+LinkedInBufferInstapaperShare

Exploring time-series data with ggplot2 and shiny

Recently I’ve started working with a EUROSTAT dataset with more than 300 variables. I needed to explore the time-series to see them and visualize evidents trends and relationships. I had at least two options: making a batch script to create a time-series plot for each variable or to create something interactive.

I’ve spent less than a hour to create a R Shiny app. Shiny is a powerful tool that let you to create interactive web applications easily. You only need to create two files: one describing the user interface (ui.R) and one that defines the server-side computations (server.R).

You can see my example application based on artificial data here.

The user interface let the user to select the time-scale, i.e. if you want to see all the samples or the grouped average of ten samples. Then you can select the range (if you want to “zoom”) and the variables that you want to plot. The code is the following:

?View Code RSPLUS
shinyUI(pageWithSidebar(
  headerPanel('Explore Time-Series Data'),
  sidebarPanel(
    selectInput('timescale', 'Time scale', c('1', '10')),
    sliderInput("range", "Range", min = 1, 
                max = 200, value = c(1, 200)),
    checkboxGroupInput('var', 'Variable', 
                       choices = c('alpha','beta','gamma'), 
                       selected = 'alpha')
  ),
  mainPanel(
    plotOutput('plot1')
  )
))

The server part is quite simple. The renderPlot section is just a ggplot command that plots the data returned by the reactive function (read here if you want to know what is a reactive function).

?View Code RSPLUS
library(reshape2)
library(ggplot2)
library(dplyr)
shinyServer(function(input, output, session) {
 
  # Combine the selected variables into a new data frame
  selectedData <- reactive({
    # Create artificial data
    set.seed(123)
    dd = data.frame(time = 1:200, 
                    alpha = cumsum(rnorm(200)), 
                    beta  = cumsum(rnorm(200, sd =1.2)),
                    gamma = cumsum(rnorm(200, sd =1.1))
                    )
    dd_m = melt(dd, id.vars = 'time')
    if (input$timescale == '10') {
      dd_m = dd_m %>% 
             group_by(time = floor(dd_m$time / 10), variable) %>% 
             summarise(value = mean(value)) 
    }
 
    filter(dd_m, variable %in% input$var, 
           time %in% seq(input$range[1],
                         input$range[2]))
 
  })
 
 
  output$plot1 <- renderPlot({
    ggplot(selectedData(), aes(x = time, y = value, color = variable)) +
      geom_line(size = 2, alpha = 0.5) + geom_point(size = 3) +      
      theme(text = element_text(size = 18), 
            legend.position = 'bottom')
  })
 
})

The selectedData function creates the data frame, then it is melted using reshape2 package in order to transform the original data frame in a new data frame where each row represents a single sample. In this way you can unleash the power of dplyr and ggplot2. In fact, with dplyr you can compute the grouped average (if selected) and apply a filter with the selected variables and the time-steps. You can see this application live at this link (shinyapps host).

FacebookTwitterGoogle+LinkedInBufferInstapaperShare

Measuring the uncertainty between variables

The first thing that comes up in your mind  you want to measure the association between two variables is probably the correlation coefficient (Pearson correlation coefficient). This coefficient consists of the covariance of the two variables over the product of their standard deviation, basically it is a kind of “normalised” covariance. It’s easy to understand, quick to compute and it is available in almost all the software computing tools. Unfortunately, the correlation coefficient is not able to measure any nonlinear relationship between the variables and this is the reason why sometimes other metrics can be used.

The Mutual Information is one of these alternatives. It comes from Information Theory and essentially it measures the amount of information that one variable contains about another one. The definition of Mutual Information is strictly related to the definition of Entropy (H) which tries to define the “unpredictability” of a variable. The Mutual Information (I) between two variables is equal to the sum of their entropies minus their joint entropy:

I(X,Y) = H(X) + H(Y) - H(X, Y)

Differently from the correlation coefficient, the value of Mutual Information is not bounded and it can be hard to understand. Thus, we can consider a normalized version of Mutual Information, called Uncertainty Coefficient (introduced in 1970) which takes the following form:

U(X|Y) = \frac{I(X,Y)}{H(X)}

This coefficient can be seen as the part of X that can be predicted given Y.

example_sinusoid

Figure 1

Let’s try with an example, a sinusoid function. The correlation between time and the function value is normally close to zero, differently from the uncertainty coefficient. We start with a sinusoid function (figure 1, upper left) and then we change the order (random resampling) of an increasing fraction of the original samples. We see in Figure 1 the original signal with 25%, 50% and 75% of reshuffled samples. What happens to the correlation coefficient and mutual information in those cases?

In Figure 2 we can see how both the measures vary on average in 50 runs.

mut_info_vs_unc

Figure 2

We can observe how the uncertainty coefficient (based on the mutual information) seems to represent the quantity of “order” inside the signal. more consistently than the correlation coefficient which appears very noisy.

FacebookTwitterGoogle+LinkedInBufferInstapaperShare

Climate Services and Stakeholders’ needs

I spent the entire last week in Tolouse at the joint assembly of SPECS and EUPORIAS projects. One of the main challenges of these projects is the need of understanding how climate information could help the stakeholders’ (private companies, government agencies, etc.) activities and how to communicate it effectively. We had a stakeholder meeting in Rome and during the last year there ware many discussions about end-users’ needs and the two-way communication processes. Within the Work Package WP12 (titled “Assessment of Users’ Needs”) several interviews with more than 70 stakeholders in Europe have been carried out and also an online survey has been set up reaching more than 400 answers. In Tolouse the WP12, coordinated by the University of Leeds, has organised a workshop aimed to assess the ability of climate impact scientists in fulfilling the identified user needs (including existing services and products available). The workshop was really interesting, it helped to make the scientists aware of  non-scientists’ expectations (sometimes unrealistic) about climate forecasts. If you are interested in this activity you can give a look at the project deliverables here, particularly D12.2 (Report on findings on S2D users’ needs from workshop with meteorological organisations) and D12.3 (Report summarising users’ needs for S2D predictions).

2014-10-22 18_Fotor_small

FacebookTwitterGoogle+LinkedInBufferInstapaperShare

EUPORIAS and LEAP

EUPORIAS Logo

EUPORIAS is a FP7 European Project started two years ago. The project is focused on the use of climate forecasts (seasonal and decadal) for regional impacts, to make a long story short: Climate Services. Within this project, five prototypes of climate services  will be developed. I am involved in LEAP ETHIOPIA, which has the aim to assess the value of using seasonal forecasts for a drought early warning system in Ethiopia. This prototype is not (only) about papers and science, it will be an operational service linked with the LEAP system (Livelihoods, Early Assessment and Protection) developed by Government of Ethiopia in collaboration with the World Food Program and World Bank.

It’s a big challenge, analysing the socio-economic value of climate forecasts is something difficult that involves a lot of scientific areas: climate science, agriculture, computer science and data analysis, economics and social sciences. Analysing climate information to write scientific papers is  easy compared with using the same information to deal with complex and ‘real-world’ problems.

FacebookTwitterGoogle+LinkedInBufferInstapaperShare

The uncertainties in Climate

This is one of the movies produced and funded by the FP7 CLIM-RUN project. The video describes how climate works and where the uncertainty comes from, especially when we talk about climate change. There also a French version.

The uncertainties in climate change scenarios from Vegas-Deluxe on Vimeo.

FacebookTwitterGoogle+LinkedInBufferInstapaperShare

2nd CLIM-RUN School: my “Learning from data” lecture

This week at the International Centre of Theoretical Physics (ICTP) in Trieste has been held. the 2nd CLIM-RUN school on Climate Services. CLIM-RUN is a FP7 Project with the aim at develop a protocol for applying new methodologies for the production of adequate climate information at regional to local scale that is relevant to and usable by different sectors (energy, agriculture, etc.).

During this winter school I gave a two-hours lecture on the use of data mining and machine learning methods for energy & meteorology applications.

Learning from data: data mining approaches for Energy & Weather/Climate applications from matteodefelice
FacebookTwitterGoogle+LinkedInBufferInstapaperShare

Three things that changed the way I work

Since my first year of Ph.D. in 2007 a lot of things changed in information technologies and hardware/software. Some of them did not change anything about my usual workflow but others did in an impressive way. Here is my personal list:

  1. Cloud data: the time that Dropbox appeared it was an impressive innovations, but nowadays the cloud storage is so cheap and reliable that I don’t have to worry anymore about backups and did-I-forget-to-copy-that-directory. Today I have more than 25 Gb considering Dropbox, Google Drive, Ubuntu One and unlimited backup space on my Crashplan account. Amazing.
  2. Easy data visualization: thanks to the availability of cheap computer power today I can use a tool like R + ggplot2 that really help me to visualize data in a beautiful way and carry out easy data mining on massive data sets.
  3. RStudio Server: this is the most recent innovation in my workflow. Just using my browser I can use the powerful open-source R IDE RStudio that is running on the SGI 128 CPU 512 Gb RAM server in my lab everywhere without any lag in user interface.
FacebookTwitterGoogle+LinkedInBufferInstapaperShare


Copyright © 2004–2009. All rights reserved.

RSS Feed. This blog is proudly powered by Wordpress and uses Modern Clix, a theme by Rodrigo Galindez.