Tidyverse and NetCDFs: a first exploration
I am not exaggerating when I say that the main reason to develop in R is the presence of the tidyverse and the plethora of packages that let you to explore data and deal easily with the most common tasks (splitting, grouping, temporal aggregation, etc.). Unfortunately, the biggest limitation in R is its capability to manipulate effectively arrays of data (data cubes), especially nowadays when it is normal to deal with very large data sets (please, don’t say “big”). However, the R community is trying to reduce this gap and recently I have discovered the tidync package by Michael Summer. This package tries to make easier the access and manipulation of NetCDF files (the standard de facto of climate data sets) in a modern and clever way, a la tidyverse.
Recently I was analysing the E-RUN dataset of gridded river runoffs for my research on hydro-power modelling within the Copernicus contract ECEM. A simple step was the point correlation of daily run-offs to “see” the spatial patterns of similar runoff behaviours. Then, I took the change to create a simple code by using only three packages (well, I’m cheating a bit because one of them is a meta-package…).
The first one can be installed with
devtools (look at its Github page), the second needs no introduction and the last one can be found on CRAN and it’s a very hand tool to explore correlations in R.
Once downloaded the E-RUN NetCDF (~73 MB), you can open it with
tidync, subset a box, and convert into a tibble with a few lines:
runoff_data %>% hyper_filter( lat = lat >= 35 & lat < 60, lon = lon < 30 ) %>% hyper_tibble()
As you can see the use of the pipe operator makes the code streamlined and easy to read, moreover
tidync perform the subsetting operation in a lazy way, that’s in my opinion, its biggest feature because when working with very large datasets, often bigger than the RAM memory you have installed on your workstation, R does not load the entire array in-memory as a first step but it happens only when
hyper_slice is called. Now you have a fancy tibble with the gridded run-off data:
head(runoff_data) # A tibble: 6 x 4 runoff lon lat time 1 0.604 -5.75 35.2 349 2 0.869 -5.25 35.2 349 3 0.0808 -3.25 35.2 349 4 0.0457 -1.25 35.2 349 5 0.0455 -0.750 35.2 349 6 0.0512 -0.250 35.2 349
Then with a bit of tidyverse magic you can: 1) filter out all the points where run-off is NA; 2) “merge” the columns latitude (
lat) and longitude (
lon) into a single one named
position; 3) make the tibble wider with
spread; 4) remove the column time not needed for the correlation; 5) create the correlation matrix and 6) “stretch” it in a wide format.
cor_tbl % filter(!is.na(runoff)) %>% unite(pos, lat, lon) %>% spread(pos, runoff) %>% select(-time) %>% correlate %>% stretch()
Now we have a tibble where each row stores the correlation value between two points. Now I want to create a map showing all the correlation values with a specific point (in this case the area around Innsbruck in Austria). With the following lines I can select all the points of the correlation matrix related to our selected coordinates and then I revert the “merge” operation I did before with the
df_toplot % filter(x == "47.25_11.25") %>% select(y, r) %>% separate(y, c("lat", "lon"), sep = "_") %>% mutate( lat = as.numeric(lat), lon = as.numeric(lon) )
We are ready to generate the map now:
my_map <- ggplot(df_toplot, aes(x = lon, y = lat, fill = cut(r, breaks = seq(-1, 1, 0.2)))) + geom_tile() + geom_point(x = 11.25, y = 47.25, shape = 3, show.legend = FALSE) + scale_fill_brewer(palette = "RdBu", drop = FALSE) + borders() + coord_quickmap(xlim = c(-20, 30), ylim = c(35, 60)) + theme_light()
The map is not really elegant but however it is the final output of a very elegant, robust and streamlined workflow, I hope that in the future we will have more tools like this one, giving to climate scientists the possibility to manipulate data arrays easily and efficiently also in R.