### Exploring time-series data with ggplot2 and shiny

Recently I’ve started working with a EUROSTAT dataset with more than 300 variables. I needed to explore the time-series to see them and visualize evidents trends and relationships. I had at least two options: making a batch script to create a time-series plot for each variable or to create something interactive.

I’ve spent less than a hour to create a R Shiny app. Shiny is a powerful tool that let you to create interactive web applications easily. You only need to create two files: one describing the user interface (ui.R) and one that defines the server-side computations (server.R).

You can see my example application based on artificial data here.

The user interface let the user to select the time-scale, i.e. if you want to see all the samples or the grouped average of ten samples. Then you can select the range (if you want to “zoom”) and the variables that you want to plot. The code is the following:

?View Code RSPLUS
 shinyUI(pageWithSidebar( headerPanel('Explore Time-Series Data'), sidebarPanel( selectInput('timescale', 'Time scale', c('1', '10')), sliderInput("range", "Range", min = 1, max = 200, value = c(1, 200)), checkboxGroupInput('var', 'Variable', choices = c('alpha','beta','gamma'), selected = 'alpha') ), mainPanel( plotOutput('plot1') ) ))

The server part is quite simple. The renderPlot section is just a ggplot command that plots the data returned by the reactive function (read here if you want to know what is a reactive function).

?View Code RSPLUS
 library(reshape2) library(ggplot2) library(dplyr) shinyServer(function(input, output, session) {   # Combine the selected variables into a new data frame selectedData <- reactive({ # Create artificial data set.seed(123) dd = data.frame(time = 1:200, alpha = cumsum(rnorm(200)), beta = cumsum(rnorm(200, sd =1.2)), gamma = cumsum(rnorm(200, sd =1.1)) ) dd_m = melt(dd, id.vars = 'time') if (input$timescale == '10') { dd_m = dd_m %>% group_by(time = floor(dd_m$time / 10), variable) %>% summarise(value = mean(value)) }   filter(dd_m, variable %in% input$var, time %in% seq(input$range[1], input$range[2])) }) output$plot1 <- renderPlot({ ggplot(selectedData(), aes(x = time, y = value, color = variable)) + geom_line(size = 2, alpha = 0.5) + geom_point(size = 3) + theme(text = element_text(size = 18), legend.position = 'bottom') })   })

The selectedData function creates the data frame, then it is melted using reshape2 package in order to transform the original data frame in a new data frame where each row represents a single sample. In this way you can unleash the power of dplyr and ggplot2. In fact, with dplyr you can compute the grouped average (if selected) and apply a filter with the selected variables and the time-steps. You can see this application live at this link (shinyapps host).

### Measuring the uncertainty between variables

The first thing that comes up in your mind  you want to measure the association between two variables is probably the correlation coefficient (Pearson correlation coefficient). This coefficient consists of the covariance of the two variables over the product of their standard deviation, basically it is a kind of “normalised” covariance. It’s easy to understand, quick to compute and it is available in almost all the software computing tools. Unfortunately, the correlation coefficient is not able to measure any nonlinear relationship between the variables and this is the reason why sometimes other metrics can be used.

The Mutual Information is one of these alternatives. It comes from Information Theory and essentially it measures the amount of information that one variable contains about another one. The definition of Mutual Information is strictly related to the definition of Entropy (H) which tries to define the “unpredictability” of a variable. The Mutual Information (I) between two variables is equal to the sum of their entropies minus their joint entropy:

$I(X,Y) = H(X) + H(Y) - H(X, Y)$

Differently from the correlation coefficient, the value of Mutual Information is not bounded and it can be hard to understand. Thus, we can consider a normalized version of Mutual Information, called Uncertainty Coefficient (introduced in 1970) which takes the following form:

$U(X|Y) = \frac{I(X,Y)}{H(X)}$

This coefficient can be seen as the part of $X$ that can be predicted given $Y$.

Figure 1

Let’s try with an example, a sinusoid function. The correlation between time and the function value is normally close to zero, differently from the uncertainty coefficient. We start with a sinusoid function (figure 1, upper left) and then we change the order (random resampling) of an increasing fraction of the original samples. We see in Figure 1 the original signal with 25%, 50% and 75% of reshuffled samples. What happens to the correlation coefficient and mutual information in those cases?

In Figure 2 we can see how both the measures vary on average in 50 runs.

Figure 2

We can observe how the uncertainty coefficient (based on the mutual information) seems to represent the quantity of “order” inside the signal. more consistently than the correlation coefficient which appears very noisy.

### Climate Services and Stakeholders’ needs

I spent the entire last week in Tolouse at the joint assembly of SPECS and EUPORIAS projects. One of the main challenges of these projects is the need of understanding how climate information could help the stakeholders’ (private companies, government agencies, etc.) activities and how to communicate it effectively. We had a stakeholder meeting in Rome and during the last year there ware many discussions about end-users’ needs and the two-way communication processes. Within the Work Package WP12 (titled “Assessment of Users’ Needs”) several interviews with more than 70 stakeholders in Europe have been carried out and also an online survey has been set up reaching more than 400 answers. In Tolouse the WP12, coordinated by the University of Leeds, has organised a workshop aimed to assess the ability of climate impact scientists in fulfilling the identified user needs (including existing services and products available). The workshop was really interesting, it helped to make the scientists aware of  non-scientists’ expectations (sometimes unrealistic) about climate forecasts. If you are interested in this activity you can give a look at the project deliverables here, particularly D12.2 (Report on findings on S2D users’ needs from workshop with meteorological organisations) and D12.3 (Report summarising users’ needs for S2D predictions).

### EUPORIAS and LEAP

EUPORIAS is a FP7 European Project started two years ago. The project is focused on the use of climate forecasts (seasonal and decadal) for regional impacts, to make a long story short: Climate Services. Within this project, five prototypes of climate services  will be developed. I am involved in LEAP ETHIOPIA, which has the aim to assess the value of using seasonal forecasts for a drought early warning system in Ethiopia. This prototype is not (only) about papers and science, it will be an operational service linked with the LEAP system (Livelihoods, Early Assessment and Protection) developed by Government of Ethiopia in collaboration with the World Food Program and World Bank.

It’s a big challenge, analysing the socio-economic value of climate forecasts is something difficult that involves a lot of scientific areas: climate science, agriculture, computer science and data analysis, economics and social sciences. Analysing climate information to write scientific papers is  easy compared with using the same information to deal with complex and ‘real-world’ problems.

### The uncertainties in Climate

This is one of the movies produced and funded by the FP7 CLIM-RUN project. The video describes how climate works and where the uncertainty comes from, especially when we talk about climate change. There also a French version.

### 2nd CLIM-RUN School: my “Learning from data” lecture

This week at the International Centre of Theoretical Physics (ICTP) in Trieste has been held. the 2nd CLIM-RUN school on Climate Services. CLIM-RUN is a FP7 Project with the aim at develop a protocol for applying new methodologies for the production of adequate climate information at regional to local scale that is relevant to and usable by different sectors (energy, agriculture, etc.).

During this winter school I gave a two-hours lecture on the use of data mining and machine learning methods for energy & meteorology applications.

### Three things that changed the way I work

Since my first year of Ph.D. in 2007 a lot of things changed in information technologies and hardware/software. Some of them did not change anything about my usual workflow but others did in an impressive way. Here is my personal list:

1. Cloud data: the time that Dropbox appeared it was an impressive innovations, but nowadays the cloud storage is so cheap and reliable that I don’t have to worry anymore about backups and did-I-forget-to-copy-that-directory. Today I have more than 25 Gb considering Dropbox, Google Drive, Ubuntu One and unlimited backup space on my Crashplan account. Amazing.
2. Easy data visualization: thanks to the availability of cheap computer power today I can use a tool like R + ggplot2 that really help me to visualize data in a beautiful way and carry out easy data mining on massive data sets.
3. RStudio Server: this is the most recent innovation in my workflow. Just using my browser I can use the powerful open-source R IDE RStudio that is running on the SGI 128 CPU 512 Gb RAM server in my lab everywhere without any lag in user interface.

### DRIPping data: from climate data to climate information

Someone says that we are moving to the era of DRIP (Data Rich Information Poor), you can also call it Big Data, definitely a more appealing name, but anyway in both cases you are talking about data, not information. What’s the difference? The Merriam-Webster defines data as:

factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation

while information is defined as:

knowledge obtained from investigation, study, or instruction

So the former is just a fact while the latter is knowledge, understanding. Like the image above suggests, we obtain information from data or, better, we transform data into information.

In climate science there are a lot of data: sensor measurements, satellite data, data generated by physical models, etc. In my opinion, we can transform it into information in two ways: by the traditional experimental procedure (hypothesis <-> test) and by transforming data into something useful. Continue reading…

### EGU 2013: “Application of seasonal climate forecasts for electricity demand forecasting: a case study on Italy”

These are the slides of the talk I gave at EGU 2013 in Vienna, at session CL 5.8,  “Climate Services – Underpinning Science”

### EUPORIAS: Meeting the stakeholders

As I announced here, two new challenging FP7 projects on climate services have recently started. One of their goal is to improve the communication between scientist and end-user on climate information, with a specific attention to the most critical sectors like energy, tourism, water & forest management.

Last week in Rome, at the ENEA main centre, the first EUPORIAS stakeholder meeting has been held. We had the occasion to exchange ideas for two days with the stakeholders involved in this project (some of them are also involved in SPECS). One of the main outcome of this workshop has been a list of the main climate information needs of end-users and the barriers they experience in using climate information in their decision-making processes. We tried to define WHAT information they need and WHEN they need it for their decisions, and also the main issues related to resolution (spatial and temporal). This is a first step towards a more effective way to use climate information and it seemed encouraging.