The Copernicus toolbox and the role of software in climate services: why using Python

We can easily say that the Copernicus Climate Change (C3S) initiative is definitely shaping the field of climate services. I might have said “Climate Science” instead of “Climate Services”, but I want to focus here on the applicative side of the climate science.

The Copernicus Climate Change (C3S) initiative and the CDS

The best thing of the C3S is that they are trying to foster the creation of a ecosystem of data services and — not surprisingly — software (design, development, architecture) plays a critical role here. Recently they have released two important tools:

  • The Copernicus Data Store (CDS) catalogue: a catalogue of climate related datasets, ranging from glaciers extent to climate seasonal forecasts. The CDS also provides an Python-based API for accessing the data.
  • The CDS Toolbox: a sandboxed cloud platform to develop applications based on the CDS catalogue and then on all the data that will be generated by the C3S sectoral applications (the so-called Sectoral Information System)

My goal here is not to provide a complete description of those tools, if you are interested you can give a look directly at the C3S website and its documentation, or reading the description given in the ECMWF Spring 2007 Newsletter or in other presentations you can find searching the web.

I think the creation of the CDS is unprecedented, it is not just a web portal (that normally is a mere gateway to a set of data) but a proper ecosystem using Python as glue (when here I write Python I refer to Python 3) and its impressive xarray library. Maybe you might be not impressed by the creation of a software ecosystem but you shouldn know that all the data that will be produced by the C3S Sectoral Information Systems (operational services with the aim to provide ad hoc information for several sectors like energy, agriculture, health, etc.) should be included in the CDS. So, linking a operational activity to an ecosystem like the CDS means that you should provide (i.e. develop) an interface between the CDS framework and your application. This interface should be written in Python. And you can hear the same question from a lot of researchers: why Python and not R or Fortran or put-your-favourite-coding-language-here?

Why Python?

In the earth sciences, like in many other scientific communities, there is a plethora of used development environments: C, Fortran, R, Python, MATLAB, to name the most important ones. Each of them is used for several reasons due to its peculiar advantages and features. And again: why using only Python? I am a R user and I have been using it for climate data analysis for years, most of my code uses the tidyverse and the climate4R packages (probably one of the most underrated package for earth sciences). However, I think that Python is better suited for climate data analysis for several reasons:

  1. Python is a modern and well-designed language
  2. Xarray provides a powerful way to deal with array data/metadata, it supports the [CF conventions] (http://cfconventions.org/), and it also uses…
  3. …dask. Dask provides out-of-shelf parallelism and the possibility to work in a seamless way with larger-than-memory datasets
  4. Python has very good performances with linear algebra and matrix operations, especially the Anaconda distribution which provides the numpy/scipy package built with Intel MKL support (a very fast math library)

Let’s stop here. Are you a scientist and find this discussion too much technical and — honestly — not very interesting? That is the problem. You shouldn’t care too much on which language to use as you shouldn’t spend time developing software interfaces. It’s time to convince ourselves (and funding agencies) that we need software developers in our research teams, because designing and developing software is a complex and serious thing. I have seen in the last decade as in research activities the involvement of designers, visualisers, communicators, social scientists, has become more and more common because communicating and presenting data & information is complex and very important. And now it’s time to do the same with software developers and software architects. Normally, the final output of a scientific workflow is information, commonly presented as a dataset, a set of papers, reports, documents, infographics, etc. Software in this process was just a by-product. But when the research is focused on services, software is not anymore a side effect but it is also part of the final output and — frankly speaking — it should be considered the most important product of a research workflow because (more than papers, more than datasets) it can really encourage & promote further research and deeper exploration.

Senior Data Scientist

Related