openavalancheproject


Nameopenavalancheproject JSON
Version 0.0.4 PyPI version JSON
download
home_pagehttps://github.com/scottcha/openavalancheproject/tree/master/
SummaryData Pipeline Processing for Open Avalanche Project from global GFS to ML Data
upload_time2023-10-24 16:47:14
maintainer
docs_urlNone
authorScott Chamberlin
requires_python>=3.10
licenseMIT License
keywords openavalancheproject
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Open Avalanche Project



Open source project to bring data and ml to avalanche forecasting


Webpage is https://openavalancheproject.org
Docs are located at https://scottcha.github.io/OpenAvalancheProject/


Directories are organized as follows:
- Data

    Contains files associated with data inputs, such as geojson definitions of avalanche regions.  Training and label data are linked in the README there as they are too large to host in git.

- DataPipelineNotebooks

    The data prep code which is used to generate the OpenAvalanche pip package.

- Environments

    Conda environment yml files

- Get Training Data

    Files used to pull training data from various web sources

- ML

    Notebooks representing the current state of the art for danger prediction for this project

- TestData

    Sample test data supporting the automated tests

- WebApp is the bulk of the operational code for the OAP website

    - OpenAvalancheProjectWebApp Contains the code for the website

- docs

    Documentation generated with nbdev based on the notebooks in DataPipelineNotebooks

- openavalancheproject.egg-info

    Supporting pip package files generated with nbdev

- openavalancheproject

    Code generated with nbdev from DataPipelineNotebooks for the pip package 

## Tutorial 
### 1. Getting new input data

This aspect of the tutorial will cover how you can obtain new weather input data for a new date range or region.  This part assumes you have avalanche forecast labels for the dates and region (OAP currently has historical forecast labels for three avalanche centers in the US from the 15-16 season through the 20-21 season and is working on expanding that).

Due to the large size of the input GFS data and the fact that its already hosted by NCAR, OAP isn't currently providing copies of this data.  If you want to start a data processing pipeline from the original data you can start with this process here.  If you aren't interested in the data processing steps and only in the ML steps you can download the labels here: https://oapstorageprod.blob.core.windows.net/oap-training-data/Data/CleanedForecastsNWAC_CAIC_UAC_CAC.V1.2013-2021.zip and a subset of training data here: [TODO: replace with current link] and skip to the fourth notebook 4.TimeseriesAi

The input data is derived from the .25 degree GFS model hosted by NCAR hosted at this site: https://rda.ucar.edu/datasets/ds084.1/

You'll need to create an account and once you are logged in you can visit the above link and then click on the Data Access tab.  One note is that I've found that chromium based browsers don't work well on this site so I recommend you use Firefox for at least downloading the data.

Due to the size of the files we are downloading I only recommend downloading one season and for a regional subset at a time.  In this example I'm going to download the data for Colorado.  

![NCAR Get Data](DataPipelineNotebooks/images/NCAR_GetData.png)
(note: the web ui has changed a bit since these screenshots were taken but the naivgation remains the same)

Click on the "Get a Subset" link.

The next page allows us to select both the dates and parameters we are interested in.  Currently we read all parameters so please check all parameters.  For dates choose one winter season.  In the below screenshot I've selected dates Nov 1, 2015 thorough April 30, 2016 for the 15-16 season.  The models assume the season starts Nov 1 and ends April 30 (it wouldn't be too difficult to update the data pipeline for a southern hemisphere winter but its not something that has been done yet).

![NCAR Date Selection](DataPipelineNotebooks/images/NCAR_DateSelection.png)

Click Continue and wait for it to validate your selections. 

The next page allows you to further subset your data.  There are a few important things here.  

    1. Verify that the dates are correct.  
    2. We want the output as grib (check "same as input") 
    3. Download all available vertical levels.  
    4. Select only the 3-24 hour forecasts and accumulation in the gridded products as currently OAP doesn't use more than this.  
    5. You can also then select the bounding box for the area you want to download in the "spatial selection". Once you have a bounding box you like write down the lat/lon values so its easier to input when we come back for other date ranges.

![NCAR Subset Selection](DataPipelineNotebooks/images/NCAR_Subset2.png)

Once the selections are correct and you can eventually click through to submit your request.  You should get a confirmation page of your selections and the system will start to retrieve your data.  This usually takes a few hours and you will get an email when its ready for download.  At this point if you want additional date/time ranges you can submit the requests and they will get queued and made avalable for download when they are ready.  In this example the downloaded files were 1.1 GB.

Extract and decompress all the files until you have a per forecast grib file and ensure all the files have been moved in to a single directory (per season per location). If you are using Linux this stackoverflow post may help https://askubuntu.com/questions/146634/shell-script-to-move-all-files-from-subfolders-to-parent-folder.

Once you have all the files as grib files in a single directory for that date and location (i.e., 15-16/Colorado/) there are a couple final cleaning steps.  Due to the download process sometimes some files earlier than 11/1 are included.  You can just delete those files (the file date is 201510*)
    
_Its worth a brief interlude in to understanding how these files are encoded.  Here is a typical file name gfs.0p25.2015110100.f003.grib2.  Lets break that down gfs: is the model we are using.  0p25 I beleive is the resolution at .25 degress but I haven't seen this documented.  2015110100 is the encoded date of the model runtime.  You will see in your dataset that there are four models run per day: 00, 06, 12, 18.  Currently we are only using the 00 model (the first of the day).  The next component is .f003 which is the forecast for 3 hours from the model runtime.  grib2 is the input file format.  chamberlin455705 is the enocded download request. 

Next delete all files which have a model run hour other than 00 (i.e., 06, 12, 18).  Check that you have 1456 files at this point (8 files per day for 182 days, the data processing should be resiliant to the occasional missing file which does happen in these datasets).  The total size of the input files at this point is ~270MB.

![File List Example](DataPipelineNotebooks/images/files_example2.png)

You will have to repeat the download process for the accumulation variables and place them in a parallel folder (see next; it is possible to add the accumulation variables to the single download but they need to be preprocessed differently so I prefer to keep them seperate).

The final step is to ensure the input data is in the correct folder structure.  All data for this project will sit off a path you define as the base path.  The GFS input data then needs to be in subfolders of that path delineated by season and state (or country).
For example if our past path in this example is:

    /media/scottcha/E1/Data/OAPMLData/

The place this data in 

    /media/scottcha/E1/Data/OAPMLData/1.RawWeatherData/15-16/Colorado/
    or 
    /media/scottcha/E1/Data/OAPMLData/1.RawWeatherData/15-16/ColoradoAccumulationGrib/
    [for the accumulation files]

Notes:

* There is an option to covert the file to NetCDF in the NCAR/UCAR UI.  Don't use this as it will result in a .nc file which isn't in the same format as the one we are going to use.

### 2. Transform and Filtering the Data

Now that we have the input file set we can start to go through the initial data pipeline steps to transform and filter the data. Today this is done in a series of Jupyter notebooks.  This format makes it easy to incrementally process and check the outputs while the project is in a development phase (once we have a model which seems to have a resonable output these steps will be encoded in a set of python modules and implemented as a processing pipeline).

Assuming you have Anaconda and Jupyter installed first change directory to the Environments directory at the root level of the repo.  This contains two conda envrionment definitions, one for the processing steps, oap_ml_datapipeline.yml, and one for the deep learning step, oap_tsai.yml.

    conda env create -f oap_datapipeline.yml

Once the environment has been created you can activate it with

    conda activate oap_datapipeline 

There is one step we need to take before going through the notebooks and that is converting the grib2 files to NetCDF.  We do this for a couple of reasons but primarily that using this tool efficiently collapses the vertical dimensions (called level) in to the variable definitions so we can more easily get it to the ML format we need.  The utility to do this is called wgrib2 and should have been installed in the oap_ml_datapipeline environment.

* While most of the data processing steps work equally well in Windows or Linux I've found the wgrib2 is much easier to install on a Linux environment so I tend to use Linux for at least the following step.

Using a terminal prompt change directory to the folder where you downloaded and unpacked the weather model files.  

    /media/scottcha/E1/Data/OAPMLData/1.RawWeatherData/15-16/Colorado/

In that directory you can execute this command to iterate through all the files and tranform them:

    for i in *.grib2; do wgrib2 $i -netcdf $i.nc; done

_There are ways of improving the efficiency by doing this in parallel so feel free to improve on this._

To start a new notebook launch jupyter

    jupyter notebook

### 3. ParseGFS 
#### Parsing and filtering the input files

Completing these next few steps bascially takes the raw input weather data and leaves us with data slightly transformed but filtered to only the coordinates in the avalanche regions for that location.  For example here is what a regional view of one of the parameters (U component of wind vector) looks like when both interpolated 4x and viewed across the entire Washington region:

![Washington Wind Component](DataPipelineNotebooks/images/Wind_Example.png)

We've used this geojson definition of the avalanche regions to subset that view in to much smaller views focused on the avalanche forecast regions.  Here are all the US regions.

![US Avalanche Regions](DataPipelineNotebooks/images/US_Avy_Regions.png)

And then this is what it looks like when filtered to only the Olympics avalanche region (the small one in the top left of the US regions):

![Olympics Wind Component](DataPipelineNotebooks/images/Wind_Region_Example.png)



# Files on disk structure

Training labels can be downloaded here:
https://oapstorageprod.blob.core.windows.net/oap-training-data/Data/CleanedForecastsNWAC_CAIC_UAC_CAC.V1.2013-2021.zip 

    1.RawWeatherData/
        gfs/
            <season>/
                /<state or country>/
    2.GFSFiltered(x)Interpolation/
    3.GFSFiltered(x)InterpolationZarr/
    4.MLData

## These parameters need to be set

```python
#test_ignore
season = '15-16'
state = 'Colorado'

interpolate = 1 #interpolation factor: whether we can to augment the data through lat/lon interpolation; 1 no interpolation, 4 is 4x interpolation

data_root = '/media/scottcha/E1/Data/OAPMLData/'

n_jobs = 24 #number of parallel processes to run 
```

This code block will iterate through each season and region and produce the files for the next stage of te data pipeline.
```python
for s in seasons:
    print("On season: " + s)
    pgfs = ParseGFS(s, state, data_root, resample_length='3H')
    results = pgfs.resample_local(jobs=n_jobs)
```

### 4. ConvertToZarr
#### Reformat data in to efficient Zarr format
The next step in our data transformation pipeline is to transform the NetCDF files to Zarr files which are indexed in such a way to make access to specific dates and lat/lon pairs as efficient as possible. This process can be run entirely end to end once you are sure the parameters are set correctly.  It does take about 2 hours for one season for all regions in Colorado on my workstation (12 core 3900X with data on Gen3 NVME drive) using all cores.  The imporant item about this notebook is that we are essentially indexing the data to be accessed efficiently when we create our ML datasets. 


```python
#test_ignore
ctz = ConvertToZarr(seasons, regions, data_root)
```

```python
#test_ignore
ctz.convert_local()
```


### 5. PrepMLData
#### Converting the data in to a memmapped numpy timeseries (samples, feature, timestep)
This step needs to be run once to create a dataset to be used in a subsequent ML step.  The way to think about these methods is that we use the set of valid labels + the valid lat/lon pairs as an index in to the data.  Its important to understand the regions are geographically large and usually cover many lat/lon pairs in our gridded dataset while the labels apply to an entire region (multiple lat/lon pairs).  For example the _WA Cascades East, Central_ region coveres 24 lat/lon pairs so if on Jan 1 there was a label we wanted to predict our dataset would have 24 lat/lon pairs in that region associated with that label.  There are pros and cons for this approach.

Pros:
1. Reasonable data augmentation approach
2. Aligns with how we utltimatly want to provide predictions--more granular, not restricted to established regions

Cons:
1. Could be contributing to overfitting
2. The data becomes very large

That being said the methods will calculate this index for every label/lat/lon point and then we'll split this in to train and test sets.  Its important to ensure that the train test split is done in time (i.e., I usually use 15-16 through 18-19 as the training set and then 19-20 as the test set) as if you don't there will be data leakage.  

Once the train test split is done on the labels there is a process to build up the dataset.  This is still a slow process even when doing it in parallel and agains the indexed Zarr data.  I've spent a lot of time trying various ways of optimizing this but I'm sure this could use more work.  The primary method internal method for doign this is called _get_xr_batch_ and takes several parameters:

1. labels: list of the train or test set labels
2. lookback_days: the number of previous days to include in the dataset.  For example if the label is for Jan 1, then a lookback_days of 14 will also include the previous 14 days.  I've been typically using 180 days as lookback (if a lookback extends prior to Nov 1 then we just fill in NaN as the data is likly irrelevant) but its possible that a lower value might give better results.
3. batch_size: the size of the batch you want returned
4. y_column: the label you want to use
5. label_values: the possible values of the label from y_column.  We include this as the method can implement oversampling to adjust for the imbalanced data.
6. oversample: dict which indicates which labels should be oversampled.  
7. random_state: random variable initilizer
8. n_jobs: number of processes to use

In the tutorial the notebook produced one train batche of 1,000 rows and one test batch of 500 rows and then concats them in a single memapped file.



### At this point we can generate a train and test dataset from the Zarr data

```python
#test_ignore
pml = PrepML(data_root, interpolate,  date_start='2015-11-01', date_end='2020-04-30', date_train_test_cutoff='2019-11-01')
```

```python
#test_ignore
pml.regions = {            
            'Washington': ['Mt Hood', 'Olympics', 'Snoqualmie Pass', 'Stevens Pass',
            'WA Cascades East, Central', 'WA Cascades East, North', 'WA Cascades East, South',
            'WA Cascades West, Central', 'WA Cascades West, Mt Baker', 'WA Cascades West, South'
            ]}
```

```python
#test_ignore
%time train_labels, test_labels = pml.prep_labels()
```

    Mt Hood
    Olympics
    Snoqualmie Pass
    Stevens Pass
    WA Cascades East, Central
    WA Cascades East, North
    WA Cascades East, South
    WA Cascades West, Central
    WA Cascades West, Mt Baker
    WA Cascades West, South
    CPU times: user 19.6 s, sys: 531 ms, total: 20.2 s
    Wall time: 20.4 s


```python
#test_ignore
train_labels = train_labels[train_labels['UnifiedRegion'].isin(['Mt Hood', 
                                                              'Olympics', 
                                                              'Snoqualmie Pass',
                                                              'Stevens Pass',
                                                              'WA Cascades East, Central',
                                                              'WA Cascades East, North',
                                                              'WA Cascades East, South',
                                                              'WA Cascades West, Central',
                                                              'WA Cascades West, Mt Baker',
                                                              'WA Cascades West, South'])]
```

```python
#test_ignore
test_labels = test_labels[test_labels['UnifiedRegion'].isin(['Mt Hood', 
                                                              'Olympics', 
                                                              'Snoqualmie Pass',
                                                              'Stevens Pass',
                                                              'WA Cascades East, Central',
                                                              'WA Cascades East, North',
                                                              'WA Cascades East, South',
                                                              'WA Cascades West, Central',
                                                              'WA Cascades West, Mt Baker',
                                                              'WA Cascades West, South'])]
```

```python
#test_ignore
train_labels.head()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>UnifiedRegion</th>
      <th>latitude</th>
      <th>longitude</th>
      <th>UnifiedRegionleft</th>
      <th>Cornices_Likelihood</th>
      <th>Cornices_MaximumSize</th>
      <th>Cornices_MinimumSize</th>
      <th>Cornices_OctagonAboveTreelineEast</th>
      <th>Cornices_OctagonAboveTreelineNorth</th>
      <th>Cornices_OctagonAboveTreelineNorthEast</th>
      <th>...</th>
      <th>image_types</th>
      <th>image_urls</th>
      <th>rose_url</th>
      <th>BottomLineSummary</th>
      <th>Day1WarningText</th>
      <th>Day2WarningText</th>
      <th>parsed_date</th>
      <th>season</th>
      <th>Day1DangerAboveTreelineValue</th>
      <th>Day1DangerAboveTreelineWithTrend</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Mt Hood</td>
      <td>45.25</td>
      <td>-121.75</td>
      <td>Mt Hood</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>...</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>2015-12-05</td>
      <td>15-16</td>
      <td>1.0</td>
      <td>Moderate_Initial</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Mt Hood</td>
      <td>45.25</td>
      <td>-121.75</td>
      <td>Mt Hood</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>...</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>2015-12-06</td>
      <td>15-16</td>
      <td>1.0</td>
      <td>Moderate_Flat</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Mt Hood</td>
      <td>45.25</td>
      <td>-121.75</td>
      <td>Mt Hood</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>...</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>2015-12-07</td>
      <td>15-16</td>
      <td>2.0</td>
      <td>Considerable_Rising</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Mt Hood</td>
      <td>45.25</td>
      <td>-121.75</td>
      <td>Mt Hood</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>...</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>2015-12-08</td>
      <td>15-16</td>
      <td>2.0</td>
      <td>Considerable_Flat</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Mt Hood</td>
      <td>45.25</td>
      <td>-121.75</td>
      <td>Mt Hood</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>...</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>no-data</td>
      <td>2015-12-09</td>
      <td>15-16</td>
      <td>1.0</td>
      <td>Moderate_Falling</td>
    </tr>
  </tbody>
</table>
<p>5 rows × 302 columns</p>
</div>



### Note the class imbalance and the test set not having all classes.  This isn't a good set for ML (one should use the entire 2015-2020 dataset but you need to ensure you have all the data from those dates available)

```python
#test_ignore
train_labels['Day1DangerAboveTreeline'].value_counts()
```




    Moderate        27982
    Considerable    25588
    High             5715
    Low              2289
    no-data          1272
    Extreme            59
    Name: Day1DangerAboveTreeline, dtype: int64



```python
#test_ignore
test_labels['Day1DangerAboveTreeline'].value_counts()
```




    Series([], Name: Day1DangerAboveTreeline, dtype: int64)



### This will generate local files sampling from the datasets (parameters can specify exactly the amount of data to store) in the ML folder which can be used for the next ML process

Modifying the parameters so you don't run out of memory is important as its designed to append to the on disk files so as to stay within memory contraits: num_train_rows_per_file maxes out at around 50000 on my 48gb local machine.  If you want more data than then then use num_train_files parameter which will create multiple files num_train_rows_per_file and will append them in to one file at the end of the process. 

```python
#test_ignore
%time train_labels_remaining, test_labels_remaining = pml.generate_train_test_local(train_labels, test_labels, num_train_rows_per_file=1000, num_test_rows_per_file=500, num_variables=978)
```

### 6.TimeseriesAI
#### Demonstrate using the data as the input to a deep learning training process
Now that our data is in the right format we can try and do some machine learning on it.  The 4.TimeseriesAi notebook in the ML folder is only to demonstrate the process to do this as today the results are a proof of concept and not sophisticated at all.  This area has only had minimal investment to date and is where focus is now being applied.  The current issue is overfitting and that will need to be addresssed before both exapnding the dataset size or training for additional epochs.

The Notebook [5.TimeseriesAI](/ML/5.TimeseriesAI.ipynb) leverages the Timeseries Deep Learning library https://github.com/timeseriesAI/tsai based on FastAI https://github.com/fastai/fastai and it realitvely straightforward to understand especially if you are familiar with FastAI.  As progress is made here this notbook will be updated to reflect the current state.

This notebook also depends on a different conda environment in the _Environments_ folder.  Create and activate the environment from the timeseriesai.yml file to use this notebook.

## Citations
National Centers for Environmental Prediction/National Weather Service/NOAA/U.S. Department of Commerce. 2015, updated daily. NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory. https://doi.org/10.5065/D65D8PWK. Accessed April, 2020

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/scottcha/openavalancheproject/tree/master/",
    "name": "openavalancheproject",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "",
    "keywords": "openavalancheproject",
    "author": "Scott Chamberlin",
    "author_email": "scott@snowymountainworks.com",
    "download_url": "https://files.pythonhosted.org/packages/fd/ff/777ca5760b7b51477196e0a48edb43986ef6952f6f9daad40ef1cb7c82ba/openavalancheproject-0.0.4.tar.gz",
    "platform": null,
    "description": "# Open Avalanche Project\n\n\n\nOpen source project to bring data and ml to avalanche forecasting\n\n\nWebpage is https://openavalancheproject.org\nDocs are located at https://scottcha.github.io/OpenAvalancheProject/\n\n\nDirectories are organized as follows:\n- Data\n\n    Contains files associated with data inputs, such as geojson definitions of avalanche regions.  Training and label data are linked in the README there as they are too large to host in git.\n\n- DataPipelineNotebooks\n\n    The data prep code which is used to generate the OpenAvalanche pip package.\n\n- Environments\n\n    Conda environment yml files\n\n- Get Training Data\n\n    Files used to pull training data from various web sources\n\n- ML\n\n    Notebooks representing the current state of the art for danger prediction for this project\n\n- TestData\n\n    Sample test data supporting the automated tests\n\n- WebApp is the bulk of the operational code for the OAP website\n\n    - OpenAvalancheProjectWebApp Contains the code for the website\n\n- docs\n\n    Documentation generated with nbdev based on the notebooks in DataPipelineNotebooks\n\n- openavalancheproject.egg-info\n\n    Supporting pip package files generated with nbdev\n\n- openavalancheproject\n\n    Code generated with nbdev from DataPipelineNotebooks for the pip package \n\n## Tutorial \n### 1. Getting new input data\n\nThis aspect of the tutorial will cover how you can obtain new weather input data for a new date range or region.  This part assumes you have avalanche forecast labels for the dates and region (OAP currently has historical forecast labels for three avalanche centers in the US from the 15-16 season through the 20-21 season and is working on expanding that).\n\nDue to the large size of the input GFS data and the fact that its already hosted by NCAR, OAP isn't currently providing copies of this data.  If you want to start a data processing pipeline from the original data you can start with this process here.  If you aren't interested in the data processing steps and only in the ML steps you can download the labels here: https://oapstorageprod.blob.core.windows.net/oap-training-data/Data/CleanedForecastsNWAC_CAIC_UAC_CAC.V1.2013-2021.zip and a subset of training data here: [TODO: replace with current link] and skip to the fourth notebook 4.TimeseriesAi\n\nThe input data is derived from the .25 degree GFS model hosted by NCAR hosted at this site: https://rda.ucar.edu/datasets/ds084.1/\n\nYou'll need to create an account and once you are logged in you can visit the above link and then click on the Data Access tab.  One note is that I've found that chromium based browsers don't work well on this site so I recommend you use Firefox for at least downloading the data.\n\nDue to the size of the files we are downloading I only recommend downloading one season and for a regional subset at a time.  In this example I'm going to download the data for Colorado.  \n\n![NCAR Get Data](DataPipelineNotebooks/images/NCAR_GetData.png)\n(note: the web ui has changed a bit since these screenshots were taken but the naivgation remains the same)\n\nClick on the \"Get a Subset\" link.\n\nThe next page allows us to select both the dates and parameters we are interested in.  Currently we read all parameters so please check all parameters.  For dates choose one winter season.  In the below screenshot I've selected dates Nov 1, 2015 thorough April 30, 2016 for the 15-16 season.  The models assume the season starts Nov 1 and ends April 30 (it wouldn't be too difficult to update the data pipeline for a southern hemisphere winter but its not something that has been done yet).\n\n![NCAR Date Selection](DataPipelineNotebooks/images/NCAR_DateSelection.png)\n\nClick Continue and wait for it to validate your selections. \n\nThe next page allows you to further subset your data.  There are a few important things here.  \n\n    1. Verify that the dates are correct.  \n    2. We want the output as grib (check \"same as input\") \n    3. Download all available vertical levels.  \n    4. Select only the 3-24 hour forecasts and accumulation in the gridded products as currently OAP doesn't use more than this.  \n    5. You can also then select the bounding box for the area you want to download in the \"spatial selection\". Once you have a bounding box you like write down the lat/lon values so its easier to input when we come back for other date ranges.\n\n![NCAR Subset Selection](DataPipelineNotebooks/images/NCAR_Subset2.png)\n\nOnce the selections are correct and you can eventually click through to submit your request.  You should get a confirmation page of your selections and the system will start to retrieve your data.  This usually takes a few hours and you will get an email when its ready for download.  At this point if you want additional date/time ranges you can submit the requests and they will get queued and made avalable for download when they are ready.  In this example the downloaded files were 1.1 GB.\n\nExtract and decompress all the files until you have a per forecast grib file and ensure all the files have been moved in to a single directory (per season per location). If you are using Linux this stackoverflow post may help https://askubuntu.com/questions/146634/shell-script-to-move-all-files-from-subfolders-to-parent-folder.\n\nOnce you have all the files as grib files in a single directory for that date and location (i.e., 15-16/Colorado/) there are a couple final cleaning steps.  Due to the download process sometimes some files earlier than 11/1 are included.  You can just delete those files (the file date is 201510*)\n    \n_Its worth a brief interlude in to understanding how these files are encoded.  Here is a typical file name gfs.0p25.2015110100.f003.grib2.  Lets break that down gfs: is the model we are using.  0p25 I beleive is the resolution at .25 degress but I haven't seen this documented.  2015110100 is the encoded date of the model runtime.  You will see in your dataset that there are four models run per day: 00, 06, 12, 18.  Currently we are only using the 00 model (the first of the day).  The next component is .f003 which is the forecast for 3 hours from the model runtime.  grib2 is the input file format.  chamberlin455705 is the enocded download request. \n\nNext delete all files which have a model run hour other than 00 (i.e., 06, 12, 18).  Check that you have 1456 files at this point (8 files per day for 182 days, the data processing should be resiliant to the occasional missing file which does happen in these datasets).  The total size of the input files at this point is ~270MB.\n\n![File List Example](DataPipelineNotebooks/images/files_example2.png)\n\nYou will have to repeat the download process for the accumulation variables and place them in a parallel folder (see next; it is possible to add the accumulation variables to the single download but they need to be preprocessed differently so I prefer to keep them seperate).\n\nThe final step is to ensure the input data is in the correct folder structure.  All data for this project will sit off a path you define as the base path.  The GFS input data then needs to be in subfolders of that path delineated by season and state (or country).\nFor example if our past path in this example is:\n\n    /media/scottcha/E1/Data/OAPMLData/\n\nThe place this data in \n\n    /media/scottcha/E1/Data/OAPMLData/1.RawWeatherData/15-16/Colorado/\n    or \n    /media/scottcha/E1/Data/OAPMLData/1.RawWeatherData/15-16/ColoradoAccumulationGrib/\n    [for the accumulation files]\n\nNotes:\n\n* There is an option to covert the file to NetCDF in the NCAR/UCAR UI.  Don't use this as it will result in a .nc file which isn't in the same format as the one we are going to use.\n\n### 2. Transform and Filtering the Data\n\nNow that we have the input file set we can start to go through the initial data pipeline steps to transform and filter the data. Today this is done in a series of Jupyter notebooks.  This format makes it easy to incrementally process and check the outputs while the project is in a development phase (once we have a model which seems to have a resonable output these steps will be encoded in a set of python modules and implemented as a processing pipeline).\n\nAssuming you have Anaconda and Jupyter installed first change directory to the Environments directory at the root level of the repo.  This contains two conda envrionment definitions, one for the processing steps, oap_ml_datapipeline.yml, and one for the deep learning step, oap_tsai.yml.\n\n    conda env create -f oap_datapipeline.yml\n\nOnce the environment has been created you can activate it with\n\n    conda activate oap_datapipeline \n\nThere is one step we need to take before going through the notebooks and that is converting the grib2 files to NetCDF.  We do this for a couple of reasons but primarily that using this tool efficiently collapses the vertical dimensions (called level) in to the variable definitions so we can more easily get it to the ML format we need.  The utility to do this is called wgrib2 and should have been installed in the oap_ml_datapipeline environment.\n\n* While most of the data processing steps work equally well in Windows or Linux I've found the wgrib2 is much easier to install on a Linux environment so I tend to use Linux for at least the following step.\n\nUsing a terminal prompt change directory to the folder where you downloaded and unpacked the weather model files.  \n\n    /media/scottcha/E1/Data/OAPMLData/1.RawWeatherData/15-16/Colorado/\n\nIn that directory you can execute this command to iterate through all the files and tranform them:\n\n    for i in *.grib2; do wgrib2 $i -netcdf $i.nc; done\n\n_There are ways of improving the efficiency by doing this in parallel so feel free to improve on this._\n\nTo start a new notebook launch jupyter\n\n    jupyter notebook\n\n### 3. ParseGFS \n#### Parsing and filtering the input files\n\nCompleting these next few steps bascially takes the raw input weather data and leaves us with data slightly transformed but filtered to only the coordinates in the avalanche regions for that location.  For example here is what a regional view of one of the parameters (U component of wind vector) looks like when both interpolated 4x and viewed across the entire Washington region:\n\n![Washington Wind Component](DataPipelineNotebooks/images/Wind_Example.png)\n\nWe've used this geojson definition of the avalanche regions to subset that view in to much smaller views focused on the avalanche forecast regions.  Here are all the US regions.\n\n![US Avalanche Regions](DataPipelineNotebooks/images/US_Avy_Regions.png)\n\nAnd then this is what it looks like when filtered to only the Olympics avalanche region (the small one in the top left of the US regions):\n\n![Olympics Wind Component](DataPipelineNotebooks/images/Wind_Region_Example.png)\n\n\n\n# Files on disk structure\n\nTraining labels can be downloaded here:\nhttps://oapstorageprod.blob.core.windows.net/oap-training-data/Data/CleanedForecastsNWAC_CAIC_UAC_CAC.V1.2013-2021.zip \n\n    1.RawWeatherData/\n        gfs/\n            <season>/\n                /<state or country>/\n    2.GFSFiltered(x)Interpolation/\n    3.GFSFiltered(x)InterpolationZarr/\n    4.MLData\n\n## These parameters need to be set\n\n```python\n#test_ignore\nseason = '15-16'\nstate = 'Colorado'\n\ninterpolate = 1 #interpolation factor: whether we can to augment the data through lat/lon interpolation; 1 no interpolation, 4 is 4x interpolation\n\ndata_root = '/media/scottcha/E1/Data/OAPMLData/'\n\nn_jobs = 24 #number of parallel processes to run \n```\n\nThis code block will iterate through each season and region and produce the files for the next stage of te data pipeline.\n```python\nfor s in seasons:\n    print(\"On season: \" + s)\n    pgfs = ParseGFS(s, state, data_root, resample_length='3H')\n    results = pgfs.resample_local(jobs=n_jobs)\n```\n\n### 4. ConvertToZarr\n#### Reformat data in to efficient Zarr format\nThe next step in our data transformation pipeline is to transform the NetCDF files to Zarr files which are indexed in such a way to make access to specific dates and lat/lon pairs as efficient as possible. This process can be run entirely end to end once you are sure the parameters are set correctly.  It does take about 2 hours for one season for all regions in Colorado on my workstation (12 core 3900X with data on Gen3 NVME drive) using all cores.  The imporant item about this notebook is that we are essentially indexing the data to be accessed efficiently when we create our ML datasets. \n\n\n```python\n#test_ignore\nctz = ConvertToZarr(seasons, regions, data_root)\n```\n\n```python\n#test_ignore\nctz.convert_local()\n```\n\n\n### 5. PrepMLData\n#### Converting the data in to a memmapped numpy timeseries (samples, feature, timestep)\nThis step needs to be run once to create a dataset to be used in a subsequent ML step.  The way to think about these methods is that we use the set of valid labels + the valid lat/lon pairs as an index in to the data.  Its important to understand the regions are geographically large and usually cover many lat/lon pairs in our gridded dataset while the labels apply to an entire region (multiple lat/lon pairs).  For example the _WA Cascades East, Central_ region coveres 24 lat/lon pairs so if on Jan 1 there was a label we wanted to predict our dataset would have 24 lat/lon pairs in that region associated with that label.  There are pros and cons for this approach.\n\nPros:\n1. Reasonable data augmentation approach\n2. Aligns with how we utltimatly want to provide predictions--more granular, not restricted to established regions\n\nCons:\n1. Could be contributing to overfitting\n2. The data becomes very large\n\nThat being said the methods will calculate this index for every label/lat/lon point and then we'll split this in to train and test sets.  Its important to ensure that the train test split is done in time (i.e., I usually use 15-16 through 18-19 as the training set and then 19-20 as the test set) as if you don't there will be data leakage.  \n\nOnce the train test split is done on the labels there is a process to build up the dataset.  This is still a slow process even when doing it in parallel and agains the indexed Zarr data.  I've spent a lot of time trying various ways of optimizing this but I'm sure this could use more work.  The primary method internal method for doign this is called _get_xr_batch_ and takes several parameters:\n\n1. labels: list of the train or test set labels\n2. lookback_days: the number of previous days to include in the dataset.  For example if the label is for Jan 1, then a lookback_days of 14 will also include the previous 14 days.  I've been typically using 180 days as lookback (if a lookback extends prior to Nov 1 then we just fill in NaN as the data is likly irrelevant) but its possible that a lower value might give better results.\n3. batch_size: the size of the batch you want returned\n4. y_column: the label you want to use\n5. label_values: the possible values of the label from y_column.  We include this as the method can implement oversampling to adjust for the imbalanced data.\n6. oversample: dict which indicates which labels should be oversampled.  \n7. random_state: random variable initilizer\n8. n_jobs: number of processes to use\n\nIn the tutorial the notebook produced one train batche of 1,000 rows and one test batch of 500 rows and then concats them in a single memapped file.\n\n\n\n### At this point we can generate a train and test dataset from the Zarr data\n\n```python\n#test_ignore\npml = PrepML(data_root, interpolate,  date_start='2015-11-01', date_end='2020-04-30', date_train_test_cutoff='2019-11-01')\n```\n\n```python\n#test_ignore\npml.regions = {            \n            'Washington': ['Mt Hood', 'Olympics', 'Snoqualmie Pass', 'Stevens Pass',\n            'WA Cascades East, Central', 'WA Cascades East, North', 'WA Cascades East, South',\n            'WA Cascades West, Central', 'WA Cascades West, Mt Baker', 'WA Cascades West, South'\n            ]}\n```\n\n```python\n#test_ignore\n%time train_labels, test_labels = pml.prep_labels()\n```\n\n    Mt Hood\n    Olympics\n    Snoqualmie Pass\n    Stevens Pass\n    WA Cascades East, Central\n    WA Cascades East, North\n    WA Cascades East, South\n    WA Cascades West, Central\n    WA Cascades West, Mt Baker\n    WA Cascades West, South\n    CPU times: user 19.6 s, sys: 531 ms, total: 20.2 s\n    Wall time: 20.4 s\n\n\n```python\n#test_ignore\ntrain_labels = train_labels[train_labels['UnifiedRegion'].isin(['Mt Hood', \n                                                              'Olympics', \n                                                              'Snoqualmie Pass',\n                                                              'Stevens Pass',\n                                                              'WA Cascades East, Central',\n                                                              'WA Cascades East, North',\n                                                              'WA Cascades East, South',\n                                                              'WA Cascades West, Central',\n                                                              'WA Cascades West, Mt Baker',\n                                                              'WA Cascades West, South'])]\n```\n\n```python\n#test_ignore\ntest_labels = test_labels[test_labels['UnifiedRegion'].isin(['Mt Hood', \n                                                              'Olympics', \n                                                              'Snoqualmie Pass',\n                                                              'Stevens Pass',\n                                                              'WA Cascades East, Central',\n                                                              'WA Cascades East, North',\n                                                              'WA Cascades East, South',\n                                                              'WA Cascades West, Central',\n                                                              'WA Cascades West, Mt Baker',\n                                                              'WA Cascades West, South'])]\n```\n\n```python\n#test_ignore\ntrain_labels.head()\n```\n\n\n\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>UnifiedRegion</th>\n      <th>latitude</th>\n      <th>longitude</th>\n      <th>UnifiedRegionleft</th>\n      <th>Cornices_Likelihood</th>\n      <th>Cornices_MaximumSize</th>\n      <th>Cornices_MinimumSize</th>\n      <th>Cornices_OctagonAboveTreelineEast</th>\n      <th>Cornices_OctagonAboveTreelineNorth</th>\n      <th>Cornices_OctagonAboveTreelineNorthEast</th>\n      <th>...</th>\n      <th>image_types</th>\n      <th>image_urls</th>\n      <th>rose_url</th>\n      <th>BottomLineSummary</th>\n      <th>Day1WarningText</th>\n      <th>Day2WarningText</th>\n      <th>parsed_date</th>\n      <th>season</th>\n      <th>Day1DangerAboveTreelineValue</th>\n      <th>Day1DangerAboveTreelineWithTrend</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Mt Hood</td>\n      <td>45.25</td>\n      <td>-121.75</td>\n      <td>Mt Hood</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>...</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>2015-12-05</td>\n      <td>15-16</td>\n      <td>1.0</td>\n      <td>Moderate_Initial</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Mt Hood</td>\n      <td>45.25</td>\n      <td>-121.75</td>\n      <td>Mt Hood</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>...</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>2015-12-06</td>\n      <td>15-16</td>\n      <td>1.0</td>\n      <td>Moderate_Flat</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Mt Hood</td>\n      <td>45.25</td>\n      <td>-121.75</td>\n      <td>Mt Hood</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>...</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>2015-12-07</td>\n      <td>15-16</td>\n      <td>2.0</td>\n      <td>Considerable_Rising</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Mt Hood</td>\n      <td>45.25</td>\n      <td>-121.75</td>\n      <td>Mt Hood</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>...</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>2015-12-08</td>\n      <td>15-16</td>\n      <td>2.0</td>\n      <td>Considerable_Flat</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Mt Hood</td>\n      <td>45.25</td>\n      <td>-121.75</td>\n      <td>Mt Hood</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>...</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>no-data</td>\n      <td>2015-12-09</td>\n      <td>15-16</td>\n      <td>1.0</td>\n      <td>Moderate_Falling</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows \u00d7 302 columns</p>\n</div>\n\n\n\n### Note the class imbalance and the test set not having all classes.  This isn't a good set for ML (one should use the entire 2015-2020 dataset but you need to ensure you have all the data from those dates available)\n\n```python\n#test_ignore\ntrain_labels['Day1DangerAboveTreeline'].value_counts()\n```\n\n\n\n\n    Moderate        27982\n    Considerable    25588\n    High             5715\n    Low              2289\n    no-data          1272\n    Extreme            59\n    Name: Day1DangerAboveTreeline, dtype: int64\n\n\n\n```python\n#test_ignore\ntest_labels['Day1DangerAboveTreeline'].value_counts()\n```\n\n\n\n\n    Series([], Name: Day1DangerAboveTreeline, dtype: int64)\n\n\n\n### This will generate local files sampling from the datasets (parameters can specify exactly the amount of data to store) in the ML folder which can be used for the next ML process\n\nModifying the parameters so you don't run out of memory is important as its designed to append to the on disk files so as to stay within memory contraits: num_train_rows_per_file maxes out at around 50000 on my 48gb local machine.  If you want more data than then then use num_train_files parameter which will create multiple files num_train_rows_per_file and will append them in to one file at the end of the process. \n\n```python\n#test_ignore\n%time train_labels_remaining, test_labels_remaining = pml.generate_train_test_local(train_labels, test_labels, num_train_rows_per_file=1000, num_test_rows_per_file=500, num_variables=978)\n```\n\n### 6.TimeseriesAI\n#### Demonstrate using the data as the input to a deep learning training process\nNow that our data is in the right format we can try and do some machine learning on it.  The 4.TimeseriesAi notebook in the ML folder is only to demonstrate the process to do this as today the results are a proof of concept and not sophisticated at all.  This area has only had minimal investment to date and is where focus is now being applied.  The current issue is overfitting and that will need to be addresssed before both exapnding the dataset size or training for additional epochs.\n\nThe Notebook [5.TimeseriesAI](/ML/5.TimeseriesAI.ipynb) leverages the Timeseries Deep Learning library https://github.com/timeseriesAI/tsai based on FastAI https://github.com/fastai/fastai and it realitvely straightforward to understand especially if you are familiar with FastAI.  As progress is made here this notbook will be updated to reflect the current state.\n\nThis notebook also depends on a different conda environment in the _Environments_ folder.  Create and activate the environment from the timeseriesai.yml file to use this notebook.\n\n## Citations\nNational Centers for Environmental Prediction/National Weather Service/NOAA/U.S. Department of Commerce. 2015, updated daily. NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory. https://doi.org/10.5065/D65D8PWK. Accessed April, 2020\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Data Pipeline Processing for Open Avalanche Project from global GFS to ML Data",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/scottcha/openavalancheproject/tree/master/"
    },
    "split_keywords": [
        "openavalancheproject"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9fb60a005227cbb31664fd7994997c252f5fb1337fc48bd013b0c880895fd802",
                "md5": "83700a6200026efc07275dfc28690a7e",
                "sha256": "64bf97a3198e62bad397aebf185e7b1fd49f990df17f49cd87d45e159e98aff0"
            },
            "downloads": -1,
            "filename": "openavalancheproject-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "83700a6200026efc07275dfc28690a7e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 36388,
            "upload_time": "2023-10-24T16:47:11",
            "upload_time_iso_8601": "2023-10-24T16:47:11.358878Z",
            "url": "https://files.pythonhosted.org/packages/9f/b6/0a005227cbb31664fd7994997c252f5fb1337fc48bd013b0c880895fd802/openavalancheproject-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fdff777ca5760b7b51477196e0a48edb43986ef6952f6f9daad40ef1cb7c82ba",
                "md5": "ab1b9bd49b37d972e5fc56b1c5a7a443",
                "sha256": "cf6a79fab98c6e6ae21cb8d853568a9fca9e8b34964460cef1e0dc4d33291446"
            },
            "downloads": -1,
            "filename": "openavalancheproject-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "ab1b9bd49b37d972e5fc56b1c5a7a443",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 43220,
            "upload_time": "2023-10-24T16:47:14",
            "upload_time_iso_8601": "2023-10-24T16:47:14.749010Z",
            "url": "https://files.pythonhosted.org/packages/fd/ff/777ca5760b7b51477196e0a48edb43986ef6952f6f9daad40ef1cb7c82ba/openavalancheproject-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-24 16:47:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "scottcha",
    "github_project": "openavalancheproject",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "openavalancheproject"
}
        
Elapsed time: 0.45786s