Scalable Offshore Wind Analysis With Pangeo

Mdavid800 · 14 September 2022 10:29

Hi All,

In my usual scrolling of the Python-o-sphere I came across the project Scalable Offshore Wind Analysis With Pangeo by D. O’Callaghan, and S. McBreen from Earth Observation Offshore (EOOffshore) School of Physics, in University College Dublin (UCD)

I thought this may be of great interest to the group - it sure is to me.

The project uses a combination of xarray, Dask and Zarr to enable scalable wind data processing that generates power estimates for offshore renewable energy assessment for the Irish Continental Shelf region. This has involved the creation of a new wind data catalog using Intake and Zarr, which features up to 21 years of available data products from various providers.

Details of the current project outputs are available at https://eooffshore.github.io . This includes a collection of Jupyter notebooks describing:

Data retrieval and Zarr store creation
Area Of Interest (AOI) assessment - the Irish Continental Shelf region
A prototype interactive wind atlas, created using HoloViz libraries

I think the thing that interests me the most is the Approach here. We have large quantities of data and the approach as far as I can see uses Zarr, Intake, xarray & Dask to allow this to be used directly - is this something we think can be used for even larger scale data sets, without dumbing down to the statistics? @oriol I know when we started PYWRAM this is something you were particularly interested to explore. @neildavis / @bjarketol NEWA is used here as an example (no 4 below) - is Zarr something the DTU team have looked at ?

the processed data sets in Zarr format for 5 models can be found below:

Further details of the above data sets can be found here: Irish Continental Shelf Data Sets (eooffshore.github.io)

The original post by Derek is from Offshore Wind Analysis with Pangeo - Science - Pangeo.
earth observation offshore logo transparent (500 × 500px) (500 × 500px)

kikocorreoso · 15 September 2022 07:30

Here there is another example of a similar workflow (Building open source downscaling pipelines for the cloud – CarbonPlan), xarray + dask + zarr. In this case they are processing hundreds of terabytes.

From my own (much modest) tests I see that processing will be done very few times but accessing will be done a ton of times. Dask is trickier and you have to test different approaches depending what you are doing. Xarray can be inneficient for some postprocess and you have to find the correct code. You can read a lot of issues and potential solutions in the pangeo discourse or in the pangeo/dask/xarray github issues.

At the end, everything will end in a zarr (or netcdf or parquet/arrow or… Cloud-optimized USGS Time Series Data - Water Data For The Nation Blog) store. Depending on the use cases you have to think how you will store these datasets. Nowadays, most data lakes save data in 2D lat-lot per time step. It is painful to get a time series from this.

So, to me, the most interesting part is the store part (chunking, compression, etc) as it is the one most users will interact.

kikocorreoso · 15 September 2022 07:40

BTW, one very interesting approach about the access is the one Patrick Zippenfenig (person behind open-meteo, its Free Open-Source Weather API | Open-Meteo.com) is doing.

His posts are gold and he explains very interesting and smart approaches to compression and storage. Accesing ERA5 data from open-meteo is a breath of fresh air.

(*) He uses swift under the hood and the API responses are in the “tens of ms or less” range. For larger requests like a long term ERA5 time series it is also very quick.

bjarketol · 15 September 2022 07:53

Thanks, David, great find!

At DTU Wind we are now defaulting to Zarr for many datasets. For example, we have a 12TB local copy of ERA5 data in Zarr. We use long chunks in time, smaller in space, to benefit time-series extraction, which is the most common use case. As @kikocorreoso mentioned, the main benefits for us are more flexible/suitable chunking, parallel decompression, and out-of-the-box support with Xarray.