# README
Author: Sally Zhao, Neil Gutkin
## Introduction and Overview
It is important to monitor Earth system data for both research and scientific reasons — analysis of such data furthers understanding of the planet and better informs political, economic, and policy decisions. Thus, in an effort to aid the Making Earth System Data Records for Use in Research Environments (MEaSUREs) Program, this project developed Python code for fusing six satellite Level 2 aerosol data (three are from geostationary satellites (GEO), and other three are from low earth orbital satellites (LEO)) from Dark Target Aerosol Retrieval Algorithm. This work emulated existing IDL workflow, which was originally created with IDL (interactive data language) by the science team. Quality checking was done by comparison to Panoply results and netCDF output matching through Python. Fused datasets generated by the program allow for visualization and analysis of the global aerosol data record over specific time periods. It also aids in research and analysis as users can better manipulate and work with satellite and sensor data. By making such code, and the accompanying functionality, open source and scalable, the scientific community is granted easier access to aerosol data processing resources.
## Installation
Required libraries:
1. numpy (statistics and calculations)
2. pyhdf (process HDF files)
3. netCDF4 (process netCDF files)
4. pyyaml (YAML user config)
5. joblib (run pipeline jobs)
6. pandas (python data analysis libs)
7. numba (parallelizing codes with GPUs)
check requirements.txt for the required versions or additional libraries. numba is not required for the cpu-only package.
## Inputs
Supported input formats include netCDF4 and HDF file formats.
## Outputs
Depending on export selection, the package can create netCDF4 or geoTIFF files.
## User Configuration
### Command Line Input
Users can manually call functions from the command line. User specifications are demarked by flags.
#### Flags
-fn (filename): Single filename to read in.
-fl (filelist): Location of file that contains a list of files (and their locations) to read in.
-gl (geolocation variables): Geolocation variables for parsing and calculations. Default would be latitude, longitude.
-gp (geophysical variables): Geophysical variables for parsing and calculations (e.g. aerosol optical depth land and ocean, solar azumith, etc).
-gs (gridding size): Gridding size and pixel resolution level.
-o (output): Output location
-on (output name): Output name. If there are multiple files created, output name will be the prefix and appended to time interval associated with the calculations.
-l (limit): Boundary box for latitude longitude (-90 90 180 180 would encompass the full Earth)
#### Possible Commands
-r (read): Reads in raw data from L2 file.
-f (filter): Reads in raw data from L2 file and filters based on metadata.
-g (grid): Reads, filters, and grids single L2 file.
-ns (netCDF single): Reads, filters, grids single L2 file and saves output as netCDF file.
-nm (netCDF multiple): Reads, filters, grids single L2 file and saves output as single netCDF file regardless of time interval.
-nmt (netCDF multiple time): Reads, filters, grids single L2 file and saves output as single netCDF file with time interval separation as a layer dimension.
-ss (sensor statistics): Reads sensors and reports statistics and individual gridded data.
-sss (sensor statistic split): Reads sensors and reports statistics and gridded data based on satellite categorization.
-ssi (sensor statistic split id): Reads sensors and reports statistics and gridded data based on satellite categorization.
-cfg (config): Reads in YAML file and executes commands.
### YAML configuration
Command line input to call YAML file:
python3 gtools.py -cfg -fn "C:\LOCATION\CONFIG_FILE_NAME.yml"
The command line also has the ability to use YAML file while specifying the time start and end in the command line. This way there is no need to edit the YAML file every time when run (or create a new Docker image).
Command line input to call the YAML file with time start and end:
python3 gtools.py -cfgtime -fn "C:\LOCATION\CONFIG_FILE_NAME.yml" -ts 2020/01/01/00/00 -te 2020/01/01/00/30
#### User specifications
The YAML file also has inputs for user specifications. This includes:
##### grid_settings:
- gridsize (pixel resolution size)
- limit (rectangular boundaries for gridding - default: [-89.875, 89.875, -179.875, 179.875])
- fill_value (fill value for areas with no calculations or data)
- time_start (start of gridding time)
- time_end (end of gridding time)
##### variables: (variables to take from input files)
- geo_var (i.e. latitude, longitude)
- phy_var (geophysical variables)
- phy_var_nc (naming for geophysical variables in netCDF files (e.g. ABI_G16, ABI_G17, etc))
- phy_var_hdf (naming for geophysical variables in HDF files (e.g. MODIS))
- aod_range (user settings for aod. Is overwritten)
- pixel_range (user settings for pixel range for single gridded point)
##### file_io: (file inputs and outputs)
- file_directory_folder (Path to directory. Reads all files in subdirectories as well. Takes precedence over file_location_folder and file_location_file. )
- file_location_folder (Path to directory folder. Only reads files in the current directory. Takes precedence over file_location_file.)
- file_location_file (Path to file that contains paths to individual file paths. Only reads files with paths contained in this file.)
- output_location (Path to folder for outputs)
- output_name (User input name. Default is overwritten. Optional "NA")
- static_file (Path to static file where certain geophysical variable values are copied from)
When reading a directory with subdirectories (e.g. LAADS archive), input path to the top directory in file_directory_folder. This would then read all files contained in subdirectories.
When creating a text file with paths to files, input path to this text file in file_directory_file.
Paths should be included:
C:\LOCATION\SATELLITE1.nc
C:\LOCATION\SATELLITE2.hdf
#### YAML file format
![image.png](attachment:04ae5dab-4462-4faf-9a80-a1a1879bda71.png)
## Docker
The repository includes a Dockerfile, which was used to a build a Docker image for the package, which is available here: https://hub.docker.com/repository/docker/neilgutkin/aerosol-data-fusion/general.
A Docker image is essentially a blueprint for the creation of a Docker container. A container run from the image is a host-isolated environment that can be used to execute the data fusion package with provided user inputs.
Configuration of a YAML file is required for the package to be run with Docker. Through this configuration, the user specifies the various parameters for the package run. The template for this YAML file is available in the source directory of this repository, under the name "example_config.yml". The input, output, and static file location fields in the YAML should be set to the paths of the input and output as they appear in the container - the "file_io" section of the example config is already set up for the provided Docker image, so there is no need to change it.
The next step is setting up the file system on the host. The input file directory, output file directory, config.yml file, and static file must all be grouped into one directory on the host machine, referred to as the "ioFiles" directory in the example below.
Finally, it's time to run a container from the Docker image. This step requires the user to specify the location of the ioFiles directory that the package should use. This data will be shared between the container environment and the host, meaning that changes made in the container (e.g. by the package) will be reflected on the host. To run the container, a user can execute the following command:
docker run [flags] -v "/your/host/path/to/ioFilesDirectory:/app/src/ioFiles" [image_name]:[version] python ./gridtools/gtools.py -cfg -fn /app/src/ioFiles/config.yml
Below is an example - note especially the appearance of the windows source path (/c/ instead of C:/):
docker run -it --gpus all -v "/c/Users/Neil/Desktop/Work/s23/ioFiles:/app/src/ioFiles" aerosol-df:v0 python ./gridtools/gtools.py -cfg -fn /app/src/ioFiles/config.yml
If running on s4psci or a similar server environment, permissions might require that you add the :z flag to the command when linking directories. In the above example, you would replace "/c/Users/Neil/Desktop/Work/s23/ioFiles:/app/src/ioFiles" with "/c/Users/Neil/Desktop/Work/s23/ioFiles:/app/src/ioFiles:z". After execution, the package will run and the output files directory on the host machine will be populated with the newly fused outputs.
## Example Inputs / Outputs
Navigate to the gridtools folder to run commands:
-$ python3 gtools.py [commands] [flags]
One example of this (run the yaml config file):
-$ python3 gtools.py -cfg -fn "PATH/config.yml"
where "-cfg" is the "config yaml" command and "-fn" is the filename flag for the proceeding path.
#### Inputs
Inputs can be of the form of netCDF4 or HDf4 files. Sample files can be found in the respective folder in the "SampleInputs 0000-0059 01-01-2020" folder. These files can be found on NASA LAADS DAAC:
https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/
One can also request files at the NASA website (look under Atmosphere-Aerosol):
https://ladsweb.modaps.eosdis.nasa.gov/search/
#### Outputs
Outputs are in the form of netCDF4 file. Sample output files can be found in the respective folder in the "SampleOutputs 0000-0059 01-01-2020" folder.
Each output file is the fused statistics and grid for the input files for that time interval. If input files range between 00:00-23:59 for a single day and the time interval is 30 minutes, there will be 48 files produced (each of which is for that 30 minute time interval). These times can be changed by user preference.
The output files here use the sample input files provided and grid/fuse/provide statistics for Optical_Depth_Land_And_Ocean and Solar_Azumith between the times of 00:00 - 01:00, Jan 1 2020.
```python
```
Raw data
{
"_id": null,
"home_page": "https://github.com/jwei-openscapes/aerosol-data-fusion",
"name": "pyroscopegriddingcpu",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "data fusion, satellite, L2, L3",
"author": "Sally Zhao, Neil Gutkin",
"author_email": "zhaosally0@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/bc/a7/68c6f2b59c1f930cbbbca2ff982298a0a94619c4151e0a6bdd8f30abdf9c/pyroscopegriddingcpu-1.4.1.0.tar.gz",
"platform": null,
"description": "# README\n\nAuthor: Sally Zhao, Neil Gutkin\n\n## Introduction and Overview\n\nIt is important to monitor Earth system data for both research and scientific reasons \u2014 analysis of such data furthers understanding of the planet and better informs political, economic, and policy decisions. Thus, in an effort to aid the Making Earth System Data Records for Use in Research Environments (MEaSUREs) Program, this project developed Python code for fusing six satellite Level 2 aerosol data (three are from geostationary satellites (GEO), and other three are from low earth orbital satellites (LEO)) from Dark Target Aerosol Retrieval Algorithm. This work emulated existing IDL workflow, which was originally created with IDL (interactive data language) by the science team. Quality checking was done by comparison to Panoply results and netCDF output matching through Python. Fused datasets generated by the program allow for visualization and analysis of the global aerosol data record over specific time periods. It also aids in research and analysis as users can better manipulate and work with satellite and sensor data. By making such code, and the accompanying functionality, open source and scalable, the scientific community is granted easier access to aerosol data processing resources. \n\n## Installation\n\nRequired libraries:\n1. numpy (statistics and calculations)\n2. pyhdf (process HDF files)\n3. netCDF4 (process netCDF files)\n4. pyyaml (YAML user config)\n5. joblib (run pipeline jobs)\n6. pandas (python data analysis libs)\n7. numba (parallelizing codes with GPUs)\n\ncheck requirements.txt for the required versions or additional libraries. numba is not required for the cpu-only package.\n \n\n## Inputs\n\nSupported input formats include netCDF4 and HDF file formats.\n\n## Outputs\n\nDepending on export selection, the package can create netCDF4 or geoTIFF files.\n\n## User Configuration\n\n### Command Line Input\n\nUsers can manually call functions from the command line. User specifications are demarked by flags.\n\n#### Flags\n\n-fn (filename): Single filename to read in.\n\n-fl (filelist): Location of file that contains a list of files (and their locations) to read in.\n\n-gl (geolocation variables): Geolocation variables for parsing and calculations. Default would be latitude, longitude.\n\n-gp (geophysical variables): Geophysical variables for parsing and calculations (e.g. aerosol optical depth land and ocean, solar azumith, etc).\n\n-gs (gridding size): Gridding size and pixel resolution level.\n\n-o (output): Output location \n\n-on (output name): Output name. If there are multiple files created, output name will be the prefix and appended to time interval associated with the calculations.\n\n-l (limit): Boundary box for latitude longitude (-90 90 180 180 would encompass the full Earth)\n\n\n#### Possible Commands\n\n-r (read): Reads in raw data from L2 file.\n\n-f (filter): Reads in raw data from L2 file and filters based on metadata.\n\n-g (grid): Reads, filters, and grids single L2 file.\n\n-ns (netCDF single): Reads, filters, grids single L2 file and saves output as netCDF file.\n\n-nm (netCDF multiple): Reads, filters, grids single L2 file and saves output as single netCDF file regardless of time interval.\n\n-nmt (netCDF multiple time): Reads, filters, grids single L2 file and saves output as single netCDF file with time interval separation as a layer dimension.\n\n-ss (sensor statistics): Reads sensors and reports statistics and individual gridded data.\n\n-sss (sensor statistic split): Reads sensors and reports statistics and gridded data based on satellite categorization.\n\n-ssi (sensor statistic split id): Reads sensors and reports statistics and gridded data based on satellite categorization.\n\n-cfg (config): Reads in YAML file and executes commands.\n\n### YAML configuration\n\nCommand line input to call YAML file: \n\npython3 gtools.py -cfg -fn \"C:\\LOCATION\\CONFIG_FILE_NAME.yml\"\n\nThe command line also has the ability to use YAML file while specifying the time start and end in the command line. This way there is no need to edit the YAML file every time when run (or create a new Docker image).\n\nCommand line input to call the YAML file with time start and end: \n\npython3 gtools.py -cfgtime -fn \"C:\\LOCATION\\CONFIG_FILE_NAME.yml\" -ts 2020/01/01/00/00 -te 2020/01/01/00/30\n\n#### User specifications\n\nThe YAML file also has inputs for user specifications. This includes:\n\n##### grid_settings: \n- gridsize (pixel resolution size)\n- limit (rectangular boundaries for gridding - default: [-89.875, 89.875, -179.875, 179.875])\n- fill_value (fill value for areas with no calculations or data)\n- time_start (start of gridding time)\n- time_end (end of gridding time)\n\n##### variables: (variables to take from input files)\n- geo_var (i.e. latitude, longitude)\n- phy_var (geophysical variables)\n- phy_var_nc (naming for geophysical variables in netCDF files (e.g. ABI_G16, ABI_G17, etc))\n- phy_var_hdf (naming for geophysical variables in HDF files (e.g. MODIS))\n- aod_range (user settings for aod. Is overwritten)\n- pixel_range (user settings for pixel range for single gridded point)\n\n##### file_io: (file inputs and outputs)\n- file_directory_folder (Path to directory. Reads all files in subdirectories as well. Takes precedence over file_location_folder and file_location_file. )\n- file_location_folder (Path to directory folder. Only reads files in the current directory. Takes precedence over file_location_file.)\n- file_location_file (Path to file that contains paths to individual file paths. Only reads files with paths contained in this file.)\n- output_location (Path to folder for outputs)\n- output_name (User input name. Default is overwritten. Optional \"NA\")\n- static_file (Path to static file where certain geophysical variable values are copied from)\n\nWhen reading a directory with subdirectories (e.g. LAADS archive), input path to the top directory in file_directory_folder. This would then read all files contained in subdirectories. \n\nWhen creating a text file with paths to files, input path to this text file in file_directory_file. \nPaths should be included: \nC:\\LOCATION\\SATELLITE1.nc\nC:\\LOCATION\\SATELLITE2.hdf\n\n#### YAML file format\n\n![image.png](attachment:04ae5dab-4462-4faf-9a80-a1a1879bda71.png)\n\n## Docker\n\nThe repository includes a Dockerfile, which was used to a build a Docker image for the package, which is available here: https://hub.docker.com/repository/docker/neilgutkin/aerosol-data-fusion/general.\n\nA Docker image is essentially a blueprint for the creation of a Docker container. A container run from the image is a host-isolated environment that can be used to execute the data fusion package with provided user inputs.\n\nConfiguration of a YAML file is required for the package to be run with Docker. Through this configuration, the user specifies the various parameters for the package run. The template for this YAML file is available in the source directory of this repository, under the name \"example_config.yml\". The input, output, and static file location fields in the YAML should be set to the paths of the input and output as they appear in the container - the \"file_io\" section of the example config is already set up for the provided Docker image, so there is no need to change it. \n\nThe next step is setting up the file system on the host. The input file directory, output file directory, config.yml file, and static file must all be grouped into one directory on the host machine, referred to as the \"ioFiles\" directory in the example below. \n\nFinally, it's time to run a container from the Docker image. This step requires the user to specify the location of the ioFiles directory that the package should use. This data will be shared between the container environment and the host, meaning that changes made in the container (e.g. by the package) will be reflected on the host. To run the container, a user can execute the following command:\n\ndocker run [flags] -v \"/your/host/path/to/ioFilesDirectory:/app/src/ioFiles\" [image_name]:[version] python ./gridtools/gtools.py -cfg -fn /app/src/ioFiles/config.yml\n\nBelow is an example - note especially the appearance of the windows source path (/c/ instead of C:/):\n\ndocker run -it --gpus all -v \"/c/Users/Neil/Desktop/Work/s23/ioFiles:/app/src/ioFiles\" aerosol-df:v0 python ./gridtools/gtools.py -cfg -fn /app/src/ioFiles/config.yml\n\nIf running on s4psci or a similar server environment, permissions might require that you add the :z flag to the command when linking directories. In the above example, you would replace \"/c/Users/Neil/Desktop/Work/s23/ioFiles:/app/src/ioFiles\" with \"/c/Users/Neil/Desktop/Work/s23/ioFiles:/app/src/ioFiles:z\". After execution, the package will run and the output files directory on the host machine will be populated with the newly fused outputs. \n\n## Example Inputs / Outputs\n\nNavigate to the gridtools folder to run commands:\n\n-$ python3 gtools.py [commands] [flags] \n\nOne example of this (run the yaml config file):\n\n-$ python3 gtools.py -cfg -fn \"PATH/config.yml\"\n\nwhere \"-cfg\" is the \"config yaml\" command and \"-fn\" is the filename flag for the proceeding path.\n\n#### Inputs\n\nInputs can be of the form of netCDF4 or HDf4 files. Sample files can be found in the respective folder in the \"SampleInputs 0000-0059 01-01-2020\" folder. These files can be found on NASA LAADS DAAC:\nhttps://ladsweb.modaps.eosdis.nasa.gov/archive/allData/\n\nOne can also request files at the NASA website (look under Atmosphere-Aerosol):\nhttps://ladsweb.modaps.eosdis.nasa.gov/search/\n\n#### Outputs\n\nOutputs are in the form of netCDF4 file. Sample output files can be found in the respective folder in the \"SampleOutputs 0000-0059 01-01-2020\" folder.\n\nEach output file is the fused statistics and grid for the input files for that time interval. If input files range between 00:00-23:59 for a single day and the time interval is 30 minutes, there will be 48 files produced (each of which is for that 30 minute time interval). These times can be changed by user preference.\n\nThe output files here use the sample input files provided and grid/fuse/provide statistics for Optical_Depth_Land_And_Ocean and Solar_Azumith between the times of 00:00 - 01:00, Jan 1 2020.\n\n\n```python\n\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Data fusion package for transforming L2 satellite to L3 spatial-temporal gridded data",
"version": "1.4.1.0",
"project_urls": {
"Homepage": "https://github.com/jwei-openscapes/aerosol-data-fusion"
},
"split_keywords": [
"data fusion",
" satellite",
" l2",
" l3"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1a0ea6eb2f845ac630d6c4fc4c590cd474e48bb573944dbd3ced2ba0a751c74b",
"md5": "d0a90327b5e1cd6c4257e701e45335dd",
"sha256": "fb99e717988039fefe001e2b6cec64bc36345c2f021d927fe4e44bb220b85869"
},
"downloads": -1,
"filename": "pyroscopegriddingcpu-1.4.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d0a90327b5e1cd6c4257e701e45335dd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 33752,
"upload_time": "2024-09-21T17:45:47",
"upload_time_iso_8601": "2024-09-21T17:45:47.929284Z",
"url": "https://files.pythonhosted.org/packages/1a/0e/a6eb2f845ac630d6c4fc4c590cd474e48bb573944dbd3ced2ba0a751c74b/pyroscopegriddingcpu-1.4.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bca768c6f2b59c1f930cbbbca2ff982298a0a94619c4151e0a6bdd8f30abdf9c",
"md5": "2fe444ff19c055031ff1ac8cd3839389",
"sha256": "3f7954a3f8e7ab184289513d5c6b1d758e3d8371bcd9143ea71dad7e42c5523a"
},
"downloads": -1,
"filename": "pyroscopegriddingcpu-1.4.1.0.tar.gz",
"has_sig": false,
"md5_digest": "2fe444ff19c055031ff1ac8cd3839389",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 33493,
"upload_time": "2024-09-21T17:45:49",
"upload_time_iso_8601": "2024-09-21T17:45:49.758324Z",
"url": "https://files.pythonhosted.org/packages/bc/a7/68c6f2b59c1f930cbbbca2ff982298a0a94619c4151e0a6bdd8f30abdf9c/pyroscopegriddingcpu-1.4.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-21 17:45:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jwei-openscapes",
"github_project": "aerosol-data-fusion",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "pyroscopegriddingcpu"
}