I have recently shared a Python notebook using Google Colab (and GitHub). The code implements an entire workflow for climate science starting from data retrieving (seasonal forecasts and reanalysis) to the calculation of performance metrics (deterministic and probabilistic). I remember when years ago, when I started working on seasonal forecasts, the same workflow was much longer and complicate: I was retrieving data from the ECMWF ECGATE cluster (based on AIX), postprocessing them with GRIB tools and CDO and then analysing with MATLAB using metrics developed by me and my colleagues. Now the same workflow can be implemented: 1) using open source tools and 2) in a reproducible way with minimal effort.
We can easily say that today we have a Python-based ecosystem for climate data analysis which includes:
- xarray for multi-dimensional data manipulation
- dask for out-of-core computation (look here for an example)
- cfgrib to access GRIB data with xarray
- numpy for scientific computing
- cdsapi to retrieve data from the Copernicus Data Store
- eofs for EOF analysis
- xskillscore for forecast verifification scores
- matplotlib and cartopy for mapping and data visualisation
This list is not exhaustive and it is based on my personal experience. If you think that something is missing feel free to send me a message or a Tweet. You can also give a look at this page for an extended discussion on a Python stack for Atmospheric and Ocean Science and to pangeo for a Python ecosystem for Big Data in geosciences.
And the reproducibility? Part of the reproducibility derives from the openness of software (and data, as for the Copernicus Data Store) but also is implemented by amazing tools like Jupyter and services like GitHub/Gitlab, Binder and Google Colab. Reproducibility is definitely a fundamental topic in science and building reproducible workflows should be a priority to have a more transparent and fairer, in other words, a better science.