ncagg


Namencagg JSON
Version 0.8.18 PyPI version JSON
download
home_pagehttps://github.com/5tefan/ncagg
SummaryUtility for aggregation of NetCDF data.
upload_time2024-01-21 19:54:53
maintainer
docs_urlNone
authorStefan Codrescu
requires_python
license
keywords
VCS
bugtrack_url
requirements cerberus click netCDF4 numpy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NetCDF Aggregation (ncagg)

So... you want to aggregate time series NetCDF files?


## TL;DR

Install the utility with with pip:
```
pip install ncagg
```

On the command line, use `ncagg`:

```
Usage: ncagg [OPTIONS] DST [SRC]...

  Aggregate NetCDF files.

Options:
  -v, --version                   Show the version and exit.
  --generate_template FILE        Print the default template generated for
                                  PATH and exit.

  -u TEXT                         Give an Unlimited Dimension Configuration as
                                  udim:ivar[:hz[:hz]]

  -c TEXT                         Give an Chunksize Configuration as
                                  udim:chunksize to chunk the ulimited
                                  dimension udim by chunksize

  -b TEXT                         If -u given, specify bounds for ivar as
                                  min:max or Tstart[:[T]stop]. min and max are
                                  numerical, otherwise T indicates start and
                                  stop are times.start and stop are of the
                                  form YYYY[MM[DD[HH[MM]]]] and of stop is
                                  omitted,it will be inferred to be the least
                                  significantly specified date + 1.

  -l [DEBUG|INFO|WARNING|ERROR|CRITICAL]
                                  log level
  -t FILENAME                     Specify a configuration template
  --help                          Show this message and exit.
```

Notes:

 - DST is the filename for the NetCDF output and should not already exist, or will be overwritten.
 - SRC is a list of input NetCDF files to aggregate, can be passed on the command line or piped to ncagg.
 - `-u` can specify an Unlimited Dimension Configuration. See below for details.
 - Taking tens of minutes for a day is normal, a progress bar will indicate time remaining.
 - For fine grained control over the output, specify a configuration template (`-t`). See below for details.
 - Set the ulimited dimension chunksize with `-c` for smaller output filesize. 

Examples:

1. Explicitly list files to aggregate:
    - `ncagg output_filename.nc file_0.nc file_02.nc #...`
2. Aggregate by globbing files:
    - `ncagg output_filename.nc path_to_files/*.nc`
3. Sort the unlimited dimension record_number, according to the variable time:
    - `ncagg -u record_number:time output_filename.nc path_to_files/*.nc`
4. Sort the unlimited dimension record_number, according to the variable time, and insert or remove
fill values to ensure time occurrs at 10hz:
    - ` ncagg -u record_number:time:10 output_filename.nc path_to_files/*.nc`
4. Sort the unlimited dimension record_number, according to the variable time, and insert or remove
fill values to ensure time occurs every minute (note: 0.016666 ~ 1.0/60.0)
    - ` ncagg -u record_number:time:0.0166666 output_filename.nc path_to_files/*.nc`
5. Only include time values between 2017-06-01 to 2017-06-02 (bounds), including sorting and
filling, as above:
    - `ncagg -u record_number:time:10 -b T20170601:T20170602 output_filename.nc path_to_files/*.nc`
6. Or equivalently, if only one bound is specified, the end is inferred to be most significant + 1:
    - `ncagg -u record_number:time:10 -b T20170601 output_filename.nc path_to_files/*.nc`
7. Aggregate more files than fit on the command line... (in case of: Argument list too long):
    - `find /path/to/files -type f -name "*.nc" | ncagg output.nc`

For more information, see the Unlimited Dimension Configuration below. The
`ncagg` Command Line Interface (CLI) builds a Config based on the arguments
specified. Fine grained control over the config can be exercised by providing a
config template.

## High level overview

Aggregation works in two stages:

1. Create a Aggregation List describing steps and order of aggregation.
2. Evaluate the Aggregation List.

The Aggregation List object is just a list that describes the order in which to combine components of an aggregation.
The objects within the list represent source files, or segments of fill values. Source files are associated
with sorting and filling instructions within the file. Fill values indicate where, and how many fill values to create.

During stage 1, the Aggregation List is generated. The level of configuration given determines how much is done here.
At most, each file is inspected according to it's unlimited dimension and the variable that indexes it to determine
sorting and filling. No data except for index_by variables are read and none written to disk during this stage. If
an expected cadence is not provided, filling is not done. If bounds are provided, the unlimited dimension is clipped
to ensure data is included only within the bounds. For the minimum configuration given, files are simply assembed in
order of sorted filename.

During stage 2, the Aggregation List is evaluated. Evaluating the Aggregation List means simply iterating over the
components contained and copying data from these into the output aggregation file, while keeping track of global attributes.

Reasons for using this approach:

 - Possible to aggregate more data than fits in memory.
 - Sort once per unlimited dimension.
 - Modular code, easier to maintain, extend, and debug.


## Configuration

The sophistication of the aggregation is determine by how much configuration information is given on
generation of the Aggregation List.

 - No Config -> agg files along unlimited dims, sorted by filename.
 - Config with index_by -> agg such that index_by is in ascending order.
 - Config with index_by and bounds -> agg such that index_by is in ascending order within bounds.
 - Config with index_by and expected_cadences -> agg and regularize, removing duplicates/inserting fills if needed.

The Config contains information that a NetCDF CDL specification would, but in json format, extended
with aggregation configuration information. If not provided, a default version will be created using the first
file in the list to aggregate.

The Config contains three properties (keys):

 - dimensions
 - variables
 - attributes

 Each property is associated with a list of objects so to preserve ordering. The order in the
 objects corresponds to the order of appearence in the output. Objects of all sections
 have a "name" property.

 Dimensions specify the dimensions of the file and has at minimum a "name", and a "size"
 which can be null for an unlimited dimension. Unlimited dimensions may also have an
 Unlimited Dimension Configuration which will be described in a dedicated section below.

 Variable objects contain a "name", "dimensions", "datatype", "attributes", and
 "chunksizes". The dimensions property is a list of dimension names on which the variable depends, each
 must be configured in the dimensions section. datatype is something like int8, float32, string, etc.
 Finally, attributes is another property containing key and values corresponding to variable attributes
 commonly including "units", "valid_min", "_FillValue", etc.

 Attributes objects contain "name", "strategy", and optionally "value" for NetCDF Global Attributes. The
 strategies are described below.

### Unlimited Dimension Configuration

The Unlimited Dimension Configuration associates a particular unlimited dimension with a variable by which
it can be indexed. Commonly, a dimension named time is associated with a variable also named time which 
indicates some epoch value for all data associated with that index of the dimension.

For example, a file may have a dimension "record_number" which is indexed by a variable "time". Using
the Unlimited Dimension Configuration, we can specify to aggregate record_number such that the variable
"time" forms a monotonic sequence increasing at some expected frequency.

Here is what a typical GOES-R L1b product aggregation output looks like:

```json
{
    "name": "report_number",
    "size": null,
    "index_by": "time",
    "expected_cadence": {"report_number": 1},
}
```

In English, the configuration above says "Order the dimension report_number by the values in the variable time, where
time values are expected to increase along the dimension report_number incrementing at 1hz." This would be specified 
to the ncagg CLI using `ncagg -u report_number:time:1 output.nc in1.nc in2.nc`.

The configuration allows to even index by multidimensional time (ehem, mag with 10 samples per report). On the command
line specified as `-u report_number:OB_time:1:10`, or as json:

```json
{
    "name": "report_number",
    "size": null,
    "index_by": "OB_time",
    "other_dim_indicies": {"samples_per_record": 0},
    "expected_cadence": {"report_number": 1, "number_samples_per_report": 10},
}
```

One design constraint was to not reshape the data, so above, we order the data by looking at index 0 of 
samples_per_record for every value along the report_number dimension. We assume that the other timestamps along
samples_per_record are correct. Also, given the configuration above, we only insert fill records of OB_time
if a full report_number record is missing (all 10 values along the number_samples_per_report dimension missing).

------------------------

Indexing an unlimited dimension was described above. In addition to simply indexing by a variable, in the case that
the variable represents time, a common operation would be to restrict value to some range, to, for example, create
a day file. The Unlimited Dimension Configuration would look like:

```json
{
    "name": "report_number",
    "size": null,
    "index_by": "time",
    "min": 14000000,  # in units of the variable "time", expected
    "max": 14000060,  # something like "seconds since 2000-01-01 12:00:00"
    "expected_cadence": {"report_number": 1}
}
```
Which would be specified on the command line as `... -u report_number:time:1 -b1400000:14000060 ...` where the `-b`
option stands for "bounds".

As min and max almost exclusively indicate datetime values, for convenience, they
are accepted as types: numerical, string, or python datetime. In string representation, they must start with "T" and
can be of the form "TYYYY[MM[DD[HH[MM]]]]" where brackets indicate optional and if omitted, will be inferred to be
minimum valid value, ie: 01 for MM (month). A units attribute must available for the index_by variable in the
form of "<time units> since <reference time>". On the command line, string time can be given as
`... -u report_number:time:1 -bT20170101:T20170102 ...` or equivalently the end bound can be omitted and will be
inferred to be the rightmost specified of the beginning YYYY[MM[DD[HH[MM]]]] incremented by one: ie:
`... -u report_number:time:1 -bT20170101 ...`.



------------------------

Consider the suvi-l2-flloc (flare location) product which has two unlimited dimensions, time and feature_number.
At any time record, there can exist an arbitrary number of features. Consider a variable reporting the flux from
each feature at each time: `flux(time, feature_number)`. Although feature_number is unlimited, it is unique to
each time and thus needs to be "flattened":

```
flux([0], [0]) -> [[3.2e-6]]
flux([0], [0, 1]) -> [[3.3e-6, 5.4e-7]]

undesired_aggregated_flux(time, feature_number):
[[3.2e-6,      _,      _],
 [     _, 3.3e-6, 5.4e-7]]

desired_aggregated_flux(time, feature_number):
[[3.2e-6,      _],
 [3.3e-6, 5.4e-7]]
```

The `desired_aggregated_flux` is achieved by setting {"flatten": true} within an the unlimited dimension configuration for feature_number.
```json
[{
    "name": "time",
    "size": null,
    "index_by": "time",
}, {
    "name": "feature_number",
    "size": null,
    "flatten": true,
}]
```

#### Specify Global Attribute Aggregation Strategies

The aggregated netcdf file contains global attributes formed from the constituent granules. A number of
strategies exist to aggregate Global Attributes across the granules. Most are quite self explanatory:

 - "static": use the configured "value" in the template, ignoring any values that may be in the file.
 - "first": first value seen will be taken as the output value for this global attribute
 - "last": the last value seen will be taken as global attribute
 - "unique_list": compile values into a unique list "first, second, etc"
 - "int_sum": resulting in integer sum of the inputs
 - "float_sum": StratFloatSum
 - "constant": StratAssertConst, similar to first, but raises an error if value changes among input files.
 - "date_created": simply yeilds the current date when finalized, standard dt fmt
 - "time_coverage_start": start bound, if specified, standard dt fmt
 - "time_coverage_end": end bound, if specified, standard dt fmt
 - "filename": StratOutputFilename, set attribute to name of output file
 - "remove": remove/do not include this global attribute
 - "first_input": Filename of first file included in aggregate
 - "last_input": Filename of last file included in aggregate
 - "input_count": Number of files included in aggregate
 - "ncagg_version": Version number for the ncagg software running

 
 The configuration format expects a key "global attributes" associated with a list of objects each containing 
 a global attribute name, strategy, and possible value (for static). A list is used to preserve order, as the
 order in the configuration will be the resulting order in the output NetCDF.
 

```json
{
    "global attributes": [
        {
            "name": "production_site", 
            "strategy": "unique_list"
        }, {
            "name": "creator",
            "strategy": "static",
            "value": "Stefan Codrescu"
        }, {
            
        ...
        }
     ]
}
```

#### Specify Dimension Indecies to Extract and Flatten

NOT IMPLEMENTED. IN PROGRESS. SUBJECT TO CHANGE.

Consider SEIS SGPS files which contain the data from two sensor units, +X and -X. Most variables are of the form
var[record_number, sensor_unit, channel, ...]. It is possible to create an aggregate file for the +X and -X sensor 
units individually using the take_dim_indicies configuration key.

```json
{
    "take_dim_indicies": {
        "sensor_unit": 0
    }
}
```

With the above configuration, sensor_unit must be removed from the dimensions configuration. Please also ensure that
variables do not list sensor_unit as a dimension, and also update chunk sizes accordingly. Chunk sizes must be a list
of values of the same length as dimensions.


### Configuration Template

`ncagg` can be configured to output files into a format specified by a configuration template file. It is expected
that this is a json format file. A generic template can be created using the `ncagg --generate_template [SAMPLE_NC]` 
command. The output of the template command is the default template that is used internally if no template is specified.

#### Example usage

Use `ncagg --generate_template example_netcdf.nc > my_template.json` to save the default template for an example_netcdf.nc file
into my_template.nc. Edit my_template.json to your liking, then run aggregation using `ncagg -t my_template.json [...]`.

#### Template syntax

The template syntax is verbose, but hopefully straightforward and clear. The incoming template will be validated
upon initiating an aggregation, but some issues may only be found at runtime.

##### Attributes

The attributes section is a list of objects contianing global attributes:

- name: name of global attribute
- strategy: [aggregation strategy](#Specify-Global-Attribute-Aggregation-Strategies) to use for attribute.
- value: value used by strategy, if required. Eg. constant, where the value is "test".

##### Dimensions

The dimensions section is a list of objects containing the dimensions of the file. Most configuration options
are covered in [Unlimited Dimension Configuration](#Unlimited-Dimension-Configuration) section, but to clarify:

- size: integer if dimension has a fixed size. null if it's unlimited.

##### Variables

Similarly, variables section is a list of objects configuring output variables. Remove the object
corresponding to some variable to remove it from the output.

Important notes:

- The dimensions referenced must exist.
- Chunksizes must be the same number of elements as dimensions.
- Use `copy_from_alt` to specify a list of alternative variables to copy data from if a variable
with `name` isn't found.

Take care that everything is consistent when doing heavy modifications.

## Use from code

In addition to the CLI, `ncagg` exposes an API which makes it possible to call from Python code:

```
from ncagg import aggregate
aggregate(["file1.nc", "file2.nc"], "output.nc")
```

`aggregate` optionally accepts as a third argument a configuration template. If none is given,
the default template created from the first input file is used. Thus code above is equivalent to:

```
from ncagg import aggregate, Config
config = Config.from_nc("file1.nc")
aggregate(["file1.nc", "file2.nc"], "output.nc", config)
```

This allows for the possibility of programatically manipulating the configuration at runtime before
performing aggregation.

## Limitations

 - Does not support netCDF4 enum types.

## Technical and Implementation details

An Aggregation List is composed of two types of objects, InputFileNode and FillNode objects. These inherit in common
from an AbstractNode and must implement the `get_size_along(unlimited_dim)` and `data_for(var, dim)` 
methods. Evaluating an aggregation list is simply going though the Aggregation List and calling something like:

```
nc_out.variable[var][write_slice] = node.data_for(var)
```

The `data_for` must return data consistent with the shape promised from `node.get_size_along(dim)`.

The complixity of aggregation comes in handling the dimensions and building the aggregation list. In addition to
the interface exposed by an AbstractNode, each InputFileNode and FillNode implement their own specific functionality.

A FillNode is simpler, and needs to be told how many fills to insert along a certain unlimited dimension and 
optionally, can be configured to return values from `data_for` that are increasing along multiple dimensions
according to configured `expected_cadence` values from a certain start value.


An InputFileNode is more complicated and exposes methods to find the time bounds of the file, and additionally, 
is internally capable of sorting itself and inserting fill values into itself. Of course, it doesn't modify the
actual input file, this is all done on the fly as data is being read out through `data_for`. Implementation wise,
an InputFileNode may contain within itself a mini aggregation list containing two types of objects: slice and 
FillNode objects. Similarly to the large scale process of aggregating, an InputFileNode returns data that has 
been assembled according to it's internal aggregation list and internal sorting.


## Testing

This software is written for aggregation of GOES-R series Space Weather data products (L1b and L2+). As
such, it contains extensive tests against real GOES-16 satellite data. Many "features" in this code are
intended to address "quirks" in the ground processing (implemented by a certain contractor...).

Tests are in the `test` subdirectory. Run all tests with

```bash
python -m unittest discover 
```

The code is compatible with Python2 (2.7) and Python3, so unittests should be run with both. One interesting
thing I've noticed is the test suite appears to be about 20% faster in Python3 than in Python2. 

Note: currently it is expected that 1 test(s) fail.
 - test.seis.SEISL1bSGPSEAST_5min.test_SEISL1bSGPS fails because dimension subsetting has not
been reimplemented after a refactor that removed the feature.


## Development

Setting up a virtualenv is recommended for development.

```
virtualenv venv
. venv/bin/activate
pip install --editable .
```

---------------------

Deploy to pip, after running unittest with both with python2 and python3. The `git stash` is important so that
the build is from a clean repo! We don't want any dev or debug changes that are sitting unstaged to be included.

```bash
git stash
rm -r dist/
python setup.py bdist_wheel --universal
twine upload dist/*
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/5tefan/ncagg",
    "name": "ncagg",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Stefan Codrescu",
    "author_email": "stefan.codrescu@noaa.gov",
    "download_url": "",
    "platform": null,
    "description": "# NetCDF Aggregation (ncagg)\n\nSo... you want to aggregate time series NetCDF files?\n\n\n## TL;DR\n\nInstall the utility with with pip:\n```\npip install ncagg\n```\n\nOn the command line, use `ncagg`:\n\n```\nUsage: ncagg [OPTIONS] DST [SRC]...\n\n  Aggregate NetCDF files.\n\nOptions:\n  -v, --version                   Show the version and exit.\n  --generate_template FILE        Print the default template generated for\n                                  PATH and exit.\n\n  -u TEXT                         Give an Unlimited Dimension Configuration as\n                                  udim:ivar[:hz[:hz]]\n\n  -c TEXT                         Give an Chunksize Configuration as\n                                  udim:chunksize to chunk the ulimited\n                                  dimension udim by chunksize\n\n  -b TEXT                         If -u given, specify bounds for ivar as\n                                  min:max or Tstart[:[T]stop]. min and max are\n                                  numerical, otherwise T indicates start and\n                                  stop are times.start and stop are of the\n                                  form YYYY[MM[DD[HH[MM]]]] and of stop is\n                                  omitted,it will be inferred to be the least\n                                  significantly specified date + 1.\n\n  -l [DEBUG|INFO|WARNING|ERROR|CRITICAL]\n                                  log level\n  -t FILENAME                     Specify a configuration template\n  --help                          Show this message and exit.\n```\n\nNotes:\n\n - DST is the filename for the NetCDF output and should not already exist, or will be overwritten.\n - SRC is a list of input NetCDF files to aggregate, can be passed on the command line or piped to ncagg.\n - `-u` can specify an Unlimited Dimension Configuration. See below for details.\n - Taking tens of minutes for a day is normal, a progress bar will indicate time remaining.\n - For fine grained control over the output, specify a configuration template (`-t`). See below for details.\n - Set the ulimited dimension chunksize with `-c` for smaller output filesize. \n\nExamples:\n\n1. Explicitly list files to aggregate:\n    - `ncagg output_filename.nc file_0.nc file_02.nc #...`\n2. Aggregate by globbing files:\n    - `ncagg output_filename.nc path_to_files/*.nc`\n3. Sort the unlimited dimension record_number, according to the variable time:\n    - `ncagg -u record_number:time output_filename.nc path_to_files/*.nc`\n4. Sort the unlimited dimension record_number, according to the variable time, and insert or remove\nfill values to ensure time occurrs at 10hz:\n    - ` ncagg -u record_number:time:10 output_filename.nc path_to_files/*.nc`\n4. Sort the unlimited dimension record_number, according to the variable time, and insert or remove\nfill values to ensure time occurs every minute (note: 0.016666 ~ 1.0/60.0)\n    - ` ncagg -u record_number:time:0.0166666 output_filename.nc path_to_files/*.nc`\n5. Only include time values between 2017-06-01 to 2017-06-02 (bounds), including sorting and\nfilling, as above:\n    - `ncagg -u record_number:time:10 -b T20170601:T20170602 output_filename.nc path_to_files/*.nc`\n6. Or equivalently, if only one bound is specified, the end is inferred to be most significant + 1:\n    - `ncagg -u record_number:time:10 -b T20170601 output_filename.nc path_to_files/*.nc`\n7. Aggregate more files than fit on the command line... (in case of: Argument list too long):\n    - `find /path/to/files -type f -name \"*.nc\" | ncagg output.nc`\n\nFor more information, see the Unlimited Dimension Configuration below. The\n`ncagg` Command Line Interface (CLI) builds a Config based on the arguments\nspecified. Fine grained control over the config can be exercised by providing a\nconfig template.\n\n## High level overview\n\nAggregation works in two stages:\n\n1. Create a Aggregation List describing steps and order of aggregation.\n2. Evaluate the Aggregation List.\n\nThe Aggregation List object is just a list that describes the order in which to combine components of an aggregation.\nThe objects within the list represent source files, or segments of fill values. Source files are associated\nwith sorting and filling instructions within the file. Fill values indicate where, and how many fill values to create.\n\nDuring stage 1, the Aggregation List is generated. The level of configuration given determines how much is done here.\nAt most, each file is inspected according to it's unlimited dimension and the variable that indexes it to determine\nsorting and filling. No data except for index_by variables are read and none written to disk during this stage. If\nan expected cadence is not provided, filling is not done. If bounds are provided, the unlimited dimension is clipped\nto ensure data is included only within the bounds. For the minimum configuration given, files are simply assembed in\norder of sorted filename.\n\nDuring stage 2, the Aggregation List is evaluated. Evaluating the Aggregation List means simply iterating over the\ncomponents contained and copying data from these into the output aggregation file, while keeping track of global attributes.\n\nReasons for using this approach:\n\n - Possible to aggregate more data than fits in memory.\n - Sort once per unlimited dimension.\n - Modular code, easier to maintain, extend, and debug.\n\n\n## Configuration\n\nThe sophistication of the aggregation is determine by how much configuration information is given on\ngeneration of the Aggregation List.\n\n - No Config -> agg files along unlimited dims, sorted by filename.\n - Config with index_by -> agg such that index_by is in ascending order.\n - Config with index_by and bounds -> agg such that index_by is in ascending order within bounds.\n - Config with index_by and expected_cadences -> agg and regularize, removing duplicates/inserting fills if needed.\n\nThe Config contains information that a NetCDF CDL specification would, but in json format, extended\nwith aggregation configuration information. If not provided, a default version will be created using the first\nfile in the list to aggregate.\n\nThe Config contains three properties (keys):\n\n - dimensions\n - variables\n - attributes\n\n Each property is associated with a list of objects so to preserve ordering. The order in the\n objects corresponds to the order of appearence in the output. Objects of all sections\n have a \"name\" property.\n\n Dimensions specify the dimensions of the file and has at minimum a \"name\", and a \"size\"\n which can be null for an unlimited dimension. Unlimited dimensions may also have an\n Unlimited Dimension Configuration which will be described in a dedicated section below.\n\n Variable objects contain a \"name\", \"dimensions\", \"datatype\", \"attributes\", and\n \"chunksizes\". The dimensions property is a list of dimension names on which the variable depends, each\n must be configured in the dimensions section. datatype is something like int8, float32, string, etc.\n Finally, attributes is another property containing key and values corresponding to variable attributes\n commonly including \"units\", \"valid_min\", \"_FillValue\", etc.\n\n Attributes objects contain \"name\", \"strategy\", and optionally \"value\" for NetCDF Global Attributes. The\n strategies are described below.\n\n### Unlimited Dimension Configuration\n\nThe Unlimited Dimension Configuration associates a particular unlimited dimension with a variable by which\nit can be indexed. Commonly, a dimension named time is associated with a variable also named time which \nindicates some epoch value for all data associated with that index of the dimension.\n\nFor example, a file may have a dimension \"record_number\" which is indexed by a variable \"time\". Using\nthe Unlimited Dimension Configuration, we can specify to aggregate record_number such that the variable\n\"time\" forms a monotonic sequence increasing at some expected frequency.\n\nHere is what a typical GOES-R L1b product aggregation output looks like:\n\n```json\n{\n    \"name\": \"report_number\",\n    \"size\": null,\n    \"index_by\": \"time\",\n    \"expected_cadence\": {\"report_number\": 1},\n}\n```\n\nIn English, the configuration above says \"Order the dimension report_number by the values in the variable time, where\ntime values are expected to increase along the dimension report_number incrementing at 1hz.\" This would be specified \nto the ncagg CLI using `ncagg -u report_number:time:1 output.nc in1.nc in2.nc`.\n\nThe configuration allows to even index by multidimensional time (ehem, mag with 10 samples per report). On the command\nline specified as `-u report_number:OB_time:1:10`, or as json:\n\n```json\n{\n    \"name\": \"report_number\",\n    \"size\": null,\n    \"index_by\": \"OB_time\",\n    \"other_dim_indicies\": {\"samples_per_record\": 0},\n    \"expected_cadence\": {\"report_number\": 1, \"number_samples_per_report\": 10},\n}\n```\n\nOne design constraint was to not reshape the data, so above, we order the data by looking at index 0 of \nsamples_per_record for every value along the report_number dimension. We assume that the other timestamps along\nsamples_per_record are correct. Also, given the configuration above, we only insert fill records of OB_time\nif a full report_number record is missing (all 10 values along the number_samples_per_report dimension missing).\n\n------------------------\n\nIndexing an unlimited dimension was described above. In addition to simply indexing by a variable, in the case that\nthe variable represents time, a common operation would be to restrict value to some range, to, for example, create\na day file. The Unlimited Dimension Configuration would look like:\n\n```json\n{\n    \"name\": \"report_number\",\n    \"size\": null,\n    \"index_by\": \"time\",\n    \"min\": 14000000,  # in units of the variable \"time\", expected\n    \"max\": 14000060,  # something like \"seconds since 2000-01-01 12:00:00\"\n    \"expected_cadence\": {\"report_number\": 1}\n}\n```\nWhich would be specified on the command line as `... -u report_number:time:1 -b1400000:14000060 ...` where the `-b`\noption stands for \"bounds\".\n\nAs min and max almost exclusively indicate datetime values, for convenience, they\nare accepted as types: numerical, string, or python datetime. In string representation, they must start with \"T\" and\ncan be of the form \"TYYYY[MM[DD[HH[MM]]]]\" where brackets indicate optional and if omitted, will be inferred to be\nminimum valid value, ie: 01 for MM (month). A units attribute must available for the index_by variable in the\nform of \"<time units> since <reference time>\". On the command line, string time can be given as\n`... -u report_number:time:1 -bT20170101:T20170102 ...` or equivalently the end bound can be omitted and will be\ninferred to be the rightmost specified of the beginning YYYY[MM[DD[HH[MM]]]] incremented by one: ie:\n`... -u report_number:time:1 -bT20170101 ...`.\n\n\n\n------------------------\n\nConsider the suvi-l2-flloc (flare location) product which has two unlimited dimensions, time and feature_number.\nAt any time record, there can exist an arbitrary number of features. Consider a variable reporting the flux from\neach feature at each time: `flux(time, feature_number)`. Although feature_number is unlimited, it is unique to\neach time and thus needs to be \"flattened\":\n\n```\nflux([0], [0]) -> [[3.2e-6]]\nflux([0], [0, 1]) -> [[3.3e-6, 5.4e-7]]\n\nundesired_aggregated_flux(time, feature_number):\n[[3.2e-6,      _,      _],\n [     _, 3.3e-6, 5.4e-7]]\n\ndesired_aggregated_flux(time, feature_number):\n[[3.2e-6,      _],\n [3.3e-6, 5.4e-7]]\n```\n\nThe `desired_aggregated_flux` is achieved by setting {\"flatten\": true} within an the unlimited dimension configuration for feature_number.\n```json\n[{\n    \"name\": \"time\",\n    \"size\": null,\n    \"index_by\": \"time\",\n}, {\n    \"name\": \"feature_number\",\n    \"size\": null,\n    \"flatten\": true,\n}]\n```\n\n#### Specify Global Attribute Aggregation Strategies\n\nThe aggregated netcdf file contains global attributes formed from the constituent granules. A number of\nstrategies exist to aggregate Global Attributes across the granules. Most are quite self explanatory:\n\n - \"static\": use the configured \"value\" in the template, ignoring any values that may be in the file.\n - \"first\": first value seen will be taken as the output value for this global attribute\n - \"last\": the last value seen will be taken as global attribute\n - \"unique_list\": compile values into a unique list \"first, second, etc\"\n - \"int_sum\": resulting in integer sum of the inputs\n - \"float_sum\": StratFloatSum\n - \"constant\": StratAssertConst, similar to first, but raises an error if value changes among input files.\n - \"date_created\": simply yeilds the current date when finalized, standard dt fmt\n - \"time_coverage_start\": start bound, if specified, standard dt fmt\n - \"time_coverage_end\": end bound, if specified, standard dt fmt\n - \"filename\": StratOutputFilename, set attribute to name of output file\n - \"remove\": remove/do not include this global attribute\n - \"first_input\": Filename of first file included in aggregate\n - \"last_input\": Filename of last file included in aggregate\n - \"input_count\": Number of files included in aggregate\n - \"ncagg_version\": Version number for the ncagg software running\n\n \n The configuration format expects a key \"global attributes\" associated with a list of objects each containing \n a global attribute name, strategy, and possible value (for static). A list is used to preserve order, as the\n order in the configuration will be the resulting order in the output NetCDF.\n \n\n```json\n{\n    \"global attributes\": [\n        {\n            \"name\": \"production_site\", \n            \"strategy\": \"unique_list\"\n        }, {\n            \"name\": \"creator\",\n            \"strategy\": \"static\",\n            \"value\": \"Stefan Codrescu\"\n        }, {\n            \n        ...\n        }\n     ]\n}\n```\n\n#### Specify Dimension Indecies to Extract and Flatten\n\nNOT IMPLEMENTED. IN PROGRESS. SUBJECT TO CHANGE.\n\nConsider SEIS SGPS files which contain the data from two sensor units, +X and -X. Most variables are of the form\nvar[record_number, sensor_unit, channel, ...]. It is possible to create an aggregate file for the +X and -X sensor \nunits individually using the take_dim_indicies configuration key.\n\n```json\n{\n    \"take_dim_indicies\": {\n        \"sensor_unit\": 0\n    }\n}\n```\n\nWith the above configuration, sensor_unit must be removed from the dimensions configuration. Please also ensure that\nvariables do not list sensor_unit as a dimension, and also update chunk sizes accordingly. Chunk sizes must be a list\nof values of the same length as dimensions.\n\n\n### Configuration Template\n\n`ncagg` can be configured to output files into a format specified by a configuration template file. It is expected\nthat this is a json format file. A generic template can be created using the `ncagg --generate_template [SAMPLE_NC]` \ncommand. The output of the template command is the default template that is used internally if no template is specified.\n\n#### Example usage\n\nUse `ncagg --generate_template example_netcdf.nc > my_template.json` to save the default template for an example_netcdf.nc file\ninto my_template.nc. Edit my_template.json to your liking, then run aggregation using `ncagg -t my_template.json [...]`.\n\n#### Template syntax\n\nThe template syntax is verbose, but hopefully straightforward and clear. The incoming template will be validated\nupon initiating an aggregation, but some issues may only be found at runtime.\n\n##### Attributes\n\nThe attributes section is a list of objects contianing global attributes:\n\n- name: name of global attribute\n- strategy: [aggregation strategy](#Specify-Global-Attribute-Aggregation-Strategies) to use for attribute.\n- value: value used by strategy, if required. Eg. constant, where the value is \"test\".\n\n##### Dimensions\n\nThe dimensions section is a list of objects containing the dimensions of the file. Most configuration options\nare covered in [Unlimited Dimension Configuration](#Unlimited-Dimension-Configuration) section, but to clarify:\n\n- size: integer if dimension has a fixed size. null if it's unlimited.\n\n##### Variables\n\nSimilarly, variables section is a list of objects configuring output variables. Remove the object\ncorresponding to some variable to remove it from the output.\n\nImportant notes:\n\n- The dimensions referenced must exist.\n- Chunksizes must be the same number of elements as dimensions.\n- Use `copy_from_alt` to specify a list of alternative variables to copy data from if a variable\nwith `name` isn't found.\n\nTake care that everything is consistent when doing heavy modifications.\n\n## Use from code\n\nIn addition to the CLI, `ncagg` exposes an API which makes it possible to call from Python code:\n\n```\nfrom ncagg import aggregate\naggregate([\"file1.nc\", \"file2.nc\"], \"output.nc\")\n```\n\n`aggregate` optionally accepts as a third argument a configuration template. If none is given,\nthe default template created from the first input file is used. Thus code above is equivalent to:\n\n```\nfrom ncagg import aggregate, Config\nconfig = Config.from_nc(\"file1.nc\")\naggregate([\"file1.nc\", \"file2.nc\"], \"output.nc\", config)\n```\n\nThis allows for the possibility of programatically manipulating the configuration at runtime before\nperforming aggregation.\n\n## Limitations\n\n - Does not support netCDF4 enum types.\n\n## Technical and Implementation details\n\nAn Aggregation List is composed of two types of objects, InputFileNode and FillNode objects. These inherit in common\nfrom an AbstractNode and must implement the `get_size_along(unlimited_dim)` and `data_for(var, dim)` \nmethods. Evaluating an aggregation list is simply going though the Aggregation List and calling something like:\n\n```\nnc_out.variable[var][write_slice] = node.data_for(var)\n```\n\nThe `data_for` must return data consistent with the shape promised from `node.get_size_along(dim)`.\n\nThe complixity of aggregation comes in handling the dimensions and building the aggregation list. In addition to\nthe interface exposed by an AbstractNode, each InputFileNode and FillNode implement their own specific functionality.\n\nA FillNode is simpler, and needs to be told how many fills to insert along a certain unlimited dimension and \noptionally, can be configured to return values from `data_for` that are increasing along multiple dimensions\naccording to configured `expected_cadence` values from a certain start value.\n\n\nAn InputFileNode is more complicated and exposes methods to find the time bounds of the file, and additionally, \nis internally capable of sorting itself and inserting fill values into itself. Of course, it doesn't modify the\nactual input file, this is all done on the fly as data is being read out through `data_for`. Implementation wise,\nan InputFileNode may contain within itself a mini aggregation list containing two types of objects: slice and \nFillNode objects. Similarly to the large scale process of aggregating, an InputFileNode returns data that has \nbeen assembled according to it's internal aggregation list and internal sorting.\n\n\n## Testing\n\nThis software is written for aggregation of GOES-R series Space Weather data products (L1b and L2+). As\nsuch, it contains extensive tests against real GOES-16 satellite data. Many \"features\" in this code are\nintended to address \"quirks\" in the ground processing (implemented by a certain contractor...).\n\nTests are in the `test` subdirectory. Run all tests with\n\n```bash\npython -m unittest discover \n```\n\nThe code is compatible with Python2 (2.7) and Python3, so unittests should be run with both. One interesting\nthing I've noticed is the test suite appears to be about 20% faster in Python3 than in Python2. \n\nNote: currently it is expected that 1 test(s) fail.\n - test.seis.SEISL1bSGPSEAST_5min.test_SEISL1bSGPS fails because dimension subsetting has not\nbeen reimplemented after a refactor that removed the feature.\n\n\n## Development\n\nSetting up a virtualenv is recommended for development.\n\n```\nvirtualenv venv\n. venv/bin/activate\npip install --editable .\n```\n\n---------------------\n\nDeploy to pip, after running unittest with both with python2 and python3. The `git stash` is important so that\nthe build is from a clean repo! We don't want any dev or debug changes that are sitting unstaged to be included.\n\n```bash\ngit stash\nrm -r dist/\npython setup.py bdist_wheel --universal\ntwine upload dist/*\n```\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Utility for aggregation of NetCDF data.",
    "version": "0.8.18",
    "project_urls": {
        "Documentation": "http://ncagg.readthedocs.io/en/latest/",
        "Homepage": "https://github.com/5tefan/ncagg"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "893446f5391dd32e4c413680ccf06696dd7ac5046302173fcac9c89fbd978d0e",
                "md5": "e33b5ac83df8a8cf945312445bbc87f1",
                "sha256": "1526b6ee39f39bd996dc5bbbb5a141559ffb805f0021952aca85e8980ab5140e"
            },
            "downloads": -1,
            "filename": "ncagg-0.8.18-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e33b5ac83df8a8cf945312445bbc87f1",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 35053,
            "upload_time": "2024-01-21T19:54:53",
            "upload_time_iso_8601": "2024-01-21T19:54:53.786481Z",
            "url": "https://files.pythonhosted.org/packages/89/34/46f5391dd32e4c413680ccf06696dd7ac5046302173fcac9c89fbd978d0e/ncagg-0.8.18-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-21 19:54:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "5tefan",
    "github_project": "ncagg",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "cerberus",
            "specs": []
        },
        {
            "name": "click",
            "specs": []
        },
        {
            "name": "netCDF4",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        }
    ],
    "lcname": "ncagg"
}
        
Elapsed time: 0.56535s