Tidyverse and NetCDFs: a first exploration

2018-01-14 4 min read

I am not exaggerating when I say that the main reason to develop in R is the presence of the tidyverse and the plethora of packages that let you to explore data and deal easily with the most common tasks (splitting, grouping, temporal aggregation, etc.). Unfortunately, the biggest limitation in R is its capability to manipulate effectively arrays of data (data cubes), especially nowadays when it is normal to deal with very large data sets (please, don’t say “big”). However, the R community is trying to reduce this gap and recently I have discovered the tidync package by Michael Summer. This package tries to make easier the access and manipulation of NetCDF files (the standard de facto of climate data sets) in a modern and clever way, a la tidyverse.

Recently I was analysing the E-RUN dataset of gridded river runoffs for my research on hydro-power modelling within the Copernicus contract ECEM. A simple step was the point correlation of daily run-offs to “see” the spatial patterns of similar runoff behaviours. Then, I took the change to create a simple code by using only three packages (well, I’m cheating a bit because one of them is a meta-package…).

^?View Code RSPLUS

library(tidync)
library(tidyverse)
library(corrr)

The first one can be installed with devtools (look at its Github page), the second needs no introduction and the last one can be found on CRAN and it’s a very hand tool to explore correlations in R.

Once downloaded the E-RUN NetCDF (~73 MB), you can open it with tidync, subset a box, and convert into a tibble with a few lines:

^?View Code RSPLUS

runoff_data %>%
  hyper_filter(
    lat = lat >= 35 &amp; lat &lt; 60,
    lon = lon &lt; 30 ) %>%
  hyper_tibble()

As you can see the use of the pipe operator makes the code streamlined and easy to read, moreover tidync perform the subsetting operation in a lazy way, that’s in my opinion, its biggest feature because when working with very large datasets, often bigger than the RAM memory you have installed on your workstation, R does not load the entire array in-memory as a first step but it happens only when hyper_tibble or hyper_slice is called. Now you have a fancy tibble with the gridded run-off data:

^?View Code RSPLUS

head(runoff_data)
# A tibble: 6 x 4
  runoff    lon   lat  time
 
1 0.604  -5.75   35.2   349
2 0.869  -5.25   35.2   349
3 0.0808 -3.25   35.2   349
4 0.0457 -1.25   35.2   349
5 0.0455 -0.750  35.2   349
6 0.0512 -0.250  35.2   349

Then with a bit of tidyverse magic you can: 1) filter out all the points where run-off is NA; 2) “merge” the columns latitude (lat) and longitude (lon) into a single one named position; 3) make the tibble wider with spread; 4) remove the column time not needed for the correlation; 5) create the correlation matrix and 6) “stretch” it in a wide format.

^?View Code RSPLUS

cor_tbl %
  filter(!is.na(runoff)) %>%
  unite(pos, lat, lon) %>%
  spread(pos, runoff) %>%
  select(-time) %>%
  correlate %>%
  stretch()

Now we have a tibble where each row stores the correlation value between two points. Now I want to create a map showing all the correlation values with a specific point (in this case the area around Innsbruck in Austria). With the following lines I can select all the points of the correlation matrix related to our selected coordinates and then I revert the “merge” operation I did before with the separate function.

^?View Code RSPLUS

df_toplot %
  filter(x == "47.25_11.25") %>%
  select(y, r) %>%
  separate(y, c("lat", "lon"), sep = "_") %>%
  mutate(
    lat = as.numeric(lat),
    lon = as.numeric(lon)
  )

We are ready to generate the map now:

^?View Code RSPLUS

my_map &lt;- ggplot(df_toplot, aes(x = lon, 
                                y = lat, 
                                fill = cut(r, breaks = seq(-1, 1, 0.2)))) + 
  geom_tile() +
  geom_point(x = 11.25, y = 47.25, shape = 3, show.legend = FALSE) +
  scale_fill_brewer(palette = "RdBu", drop = FALSE) + 
  borders() + 
  coord_quickmap(xlim = c(-20, 30), 
                  ylim = c(35, 60)) + 
  theme_light()

The map is not really elegant but however it is the final output of a very elegant, robust and streamlined workflow, I hope that in the future we will have more tools like this one, giving to climate scientists the possibility to manipulate data arrays easily and efficiently also in R.

^?View Code RSPLUS

Tidyverse and NetCDFs: a first exploration

Senior Data Scientist

Related