ecml-tools


Nameecml-tools JSON
Version 0.6.1 PyPI version JSON
download
home_pagehttps://github.com/ecmwf-lab/ecml-tools
SummaryA package to hold various functions to support training of ML models on ECMWF data.
upload_time2024-03-14 17:32:25
maintainer
docs_urlNone
authorEuropean Centre for Medium-Range Weather Forecasts (ECMWF)
requires_python
licenseApache License Version 2.0
keywords tool
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ecml-tools

A package to hold various functions to support training of ML models on ECMWF data.

## Installation

This package will be a collection of tools, with their own dependencies. In order to not
install unnecessary dependencies, the package is split in parts.

For handling datasets, you will need to install the `data` extra:


```bash
pip install ecml-tools[data]
```

For provenance tracking, you will need to install the `provenance` extra:

```bash
pip install ecml-tools[provenance]
```

To install everything:

```bash
pip install ecml-tools[all]
```

# Datasets

A `dataset` wraps a `zarr` file that follows the format used by ECMWF to train its machine learning models.

```python
from ecml_tools.data import open_dataset

ds = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2")
```

The dataset can be passed as a path or URL to a `zarr` file, or as a name. In the later case, the package will use the entry `zarr_root` of `~/.ecml-tool` file to create the full path or URL:

```yaml
zarr_root: /path_or_url/to/the/zarrs
```

## Attributes of a dataset

As the underlying `zarr`, the `dataset` is an iterable:

```python
from ecml_tools.data import open_dataset

ds = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2")

# Print the number of rows (i.e. dates):

print(len(ds))

# Iterate throw the rows,

for row in ds:
    print(row)

# or access a item directly.

print(row[10])

# You can retrieve the shape of the dataset,

print(ds.shape)

# the list of variables,

print(ds.variables)

# the mapping between variable names and columns index

two_t_index = ds.name_to_index["2t"]
row = ds[10]
print("2t", row[two_t_index])

# Get the list of dates (as NumPy datetime64)

print(ds.dates)

# The number of hours between consecutive dates

print(ds.frequency)

# The resolution of the underlying grid

print(ds.resolution)

# The list of latitudes of the data values (NumPy array)

print(ds.latitudes)

# The same for longitudes

print(ds.longitudes)

# And the statitics

print(ds.statistics)
```

The statistics is a dictionary of NumPy vectors following the order of the variables:

```python
{
    "mean": ...,
    "stdev": ...,
    "minimum": ...,
    "maximum": ...,
}
```

To get the statistics for `2t`:

```python
two_t_index = ds.name_to_index["2t"]
stats = ds.statistics
print("Average 2t", stats["mean"][two_t_index])
```

## Subsetting datasets

You can create a view on the `zarr` file that selects a subset of dates.

### Changing the frequency

```python
from ecml_tools.data import open_dataset

ds = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    freqency="12h")
```

The `frequency` parameter can be a integer (in hours) or a string following with the suffix `h` (hours) or `d` (days).

### Selecting years

You can select ranges of years using the `start` and `end` keywords:

```python
from ecml_tools.data import open_dataset

training = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    start=1979,
    end=2020)

test = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2"
    start=2021,
    end=2022)
```

The selection includes all the dates of the `end` years.

### Selecting more precise ranges

You can select a few months, or even a few days:

```python
from ecml_tools.data import open_dataset

training = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    start=202306,
    end=202308)

test = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2"
    start=20200301,
    end=20200410)
```

The following are equivalent way of describing `start` or `end`:

* `2020` and `"2020"`
* `202306`, `"202306"` and `"2023-06"`
* `20200301`, `"20200301"` and `"2020-03-01"`

You can omit either `start` or `end`. In that case the first and last date of the dataset will be used respectively.

### Combining both

You can combine both subsetting methods:

```python
from ecml_tools.data import open_dataset

training = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    start=1979,
    end=2020,
    frequency="6h")
```

## Combining datasets

You can create a virtual dataset by combining two or more `zarr` files.

```python
from ecml_tools.data import open_dataset

ds = open_dataset(
    "dataset-1",
    "dataset-2",
    "dataset-3",
    ...
)
```

When given a list of `zarr` files, the package will automatically work out if the files can be _concatenated_ or _joined_ by looking at the range of dates covered by each files.

If the dates are different, the files are concatenated. If the dates are the same, the files are joined. See below for more information.

## Concatenating datasets

You can concatenate two or more datasets along the dates dimension. The package will check that all datasets are compatible (same resolution, same variables, etc.). Currently, the datasets must be given in chronological order with no gaps between them.

```python
from ecml_tools.data import open_dataset

ds = open_dataset(
    "aifs-ea-an-oper-0001-mars-o96-1940-1978-1h-v2",
    "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2"
)
```

![Concatenation](concat.png)

Please note that you can pass more than two `zarr` files to the function.

> **_NOTE:_** When concatenating file, the statistics are not recomputed; it is the statistics of first file that are returned to the user.

## Joining datasets

You can join two datasets that have the same dates, combining their variables.

```python
from ecml_tools.data import open_dataset

ds = open_dataset(
    "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    "some-extra-parameters-from-another-source-o96-1979-2022-1h-v2",
)
```

![Join](join.png)

If a variable is present in more that one file, that last occurrence of that variable will be used, and will be at the position of the first occurrence of that name.

![Overlay](overlay.png)

Please note that you can join more than two `zarr` files.

## Selection, ordering and renaming of variables

You can select a subset of variables when opening a `zarr` file. If you pass a `list`, the variables are ordered according the that list. If you pass a `set`, the order of the file is preserved.

```python
from ecml_tools.data import open_dataset

# Select '2t' and 'tp' in that order

ds = open_dataset(
    "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    select = ["2t", "tp"],
)

# Select '2t' and 'tp', but preserve the order in which they are in the file

ds = open_dataset(
    "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    select = {"2t", "tp"},
)
```

You can also drop some variables:

```python
from ecml_tools.data import open_dataset


ds = open_dataset(
    "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    drop = ["10u", "10v"],
)
```

and reorder them:

```python
from ecml_tools.data import open_dataset

# ... using a list

ds = open_dataset(
    "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    reorder = ["2t", "msl", "sp", "10u", "10v"],
)

# ... or using a dictionnary

ds = open_dataset(
    "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    reorder = {"2t": 0, "msl": 1, "sp": 2, "10u": 3, "10v": 4},
)
```

You can also rename variables:

```python
from ecml_tools.data import open_dataset


ds = open_dataset(
    "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    rename = {"2t": "t2m"},
)
```

This will be useful when your join datasets and do not want variables from one dataset to override the ones from the other.


## Using all options

You can combine all of the above:


```python
from ecml_tools.data import open_dataset

ds = open_dataset(
    "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
    "some-extra-parameters-from-another-source-o96-1979-2022-1h-v2",
    start=2000,
    end=2001,
    frequency="12h",
    select={"2t", "2d"},
    ...
)
```

## Building a dataset from a configuration

In practice, you will be building datasets from a configuration file, such as a YAML file:


```python
import yaml
from ecml_tools.data import open_dataset

with open("config.yaml") as f:
    config = yaml.safe_load(f)

training = open_dataset(config["training"])
test = open_dataset(config["test"])
```

This is possible because `open_dataset` can be build from simple lists and dictionaries:

```python
# From a string

ds = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2")

# From a list of strings

ds = open_dataset(
    [
        "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
        "aifs-ea-an-oper-0001-mars-o96-2023-2023-1h-v2",
    ]
)


# From a dictionnary

ds = open_dataset(
    {
        "dataset": "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
        "frequency": "6h",
    }
)

# From a list of dictionnary

ds = open_dataset(
    [
        {
            "dataset": "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
            "frequency": "6h",
        },
        {
            "dataset": "some-extra-parameters-from-another-source-o96-1979-2022-1h-v2",
            "frequency": "6h",
            "select": ["sst", "cape"],
        },
    ]
)

# And even deeper constructs

ds = open_dataset(
    [
        {
            "dataset": "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
            "frequency": "6h",
        },
        {
            "dataset": [
                {
                    "dataset": "aifs-od-an-oper-8888-mars-o96-1979-2022-6h-v2",
                    "drop": ["ws"],
                },
                {
                    "dataset": "aifs-od-an-oper-9999-mars-o96-1979-2022-6h-v2",
                    "select": ["ws"],
                },
            ],
            "frequency": "6h",
            "select": ["sst", "cape"],
        },
    ]
)
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ecmwf-lab/ecml-tools",
    "name": "ecml-tools",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "tool",
    "author": "European Centre for Medium-Range Weather Forecasts (ECMWF)",
    "author_email": "software.support@ecmwf.int",
    "download_url": "https://files.pythonhosted.org/packages/4a/48/2430022af064ae17fa788f8f310ecd8d976634b6a9f98dbe1cc240a47e43/ecml-tools-0.6.1.tar.gz",
    "platform": null,
    "description": "# ecml-tools\n\nA package to hold various functions to support training of ML models on ECMWF data.\n\n## Installation\n\nThis package will be a collection of tools, with their own dependencies. In order to not\ninstall unnecessary dependencies, the package is split in parts.\n\nFor handling datasets, you will need to install the `data` extra:\n\n\n```bash\npip install ecml-tools[data]\n```\n\nFor provenance tracking, you will need to install the `provenance` extra:\n\n```bash\npip install ecml-tools[provenance]\n```\n\nTo install everything:\n\n```bash\npip install ecml-tools[all]\n```\n\n# Datasets\n\nA `dataset` wraps a `zarr` file that follows the format used by ECMWF to train its machine learning models.\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\")\n```\n\nThe dataset can be passed as a path or URL to a `zarr` file, or as a name. In the later case, the package will use the entry `zarr_root` of `~/.ecml-tool` file to create the full path or URL:\n\n```yaml\nzarr_root: /path_or_url/to/the/zarrs\n```\n\n## Attributes of a dataset\n\nAs the underlying `zarr`, the `dataset` is an iterable:\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\")\n\n# Print the number of rows (i.e. dates):\n\nprint(len(ds))\n\n# Iterate throw the rows,\n\nfor row in ds:\n    print(row)\n\n# or access a item directly.\n\nprint(row[10])\n\n# You can retrieve the shape of the dataset,\n\nprint(ds.shape)\n\n# the list of variables,\n\nprint(ds.variables)\n\n# the mapping between variable names and columns index\n\ntwo_t_index = ds.name_to_index[\"2t\"]\nrow = ds[10]\nprint(\"2t\", row[two_t_index])\n\n# Get the list of dates (as NumPy datetime64)\n\nprint(ds.dates)\n\n# The number of hours between consecutive dates\n\nprint(ds.frequency)\n\n# The resolution of the underlying grid\n\nprint(ds.resolution)\n\n# The list of latitudes of the data values (NumPy array)\n\nprint(ds.latitudes)\n\n# The same for longitudes\n\nprint(ds.longitudes)\n\n# And the statitics\n\nprint(ds.statistics)\n```\n\nThe statistics is a dictionary of NumPy vectors following the order of the variables:\n\n```python\n{\n    \"mean\": ...,\n    \"stdev\": ...,\n    \"minimum\": ...,\n    \"maximum\": ...,\n}\n```\n\nTo get the statistics for `2t`:\n\n```python\ntwo_t_index = ds.name_to_index[\"2t\"]\nstats = ds.statistics\nprint(\"Average 2t\", stats[\"mean\"][two_t_index])\n```\n\n## Subsetting datasets\n\nYou can create a view on the `zarr` file that selects a subset of dates.\n\n### Changing the frequency\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    freqency=\"12h\")\n```\n\nThe `frequency` parameter can be a integer (in hours) or a string following with the suffix `h` (hours) or `d` (days).\n\n### Selecting years\n\nYou can select ranges of years using the `start` and `end` keywords:\n\n```python\nfrom ecml_tools.data import open_dataset\n\ntraining = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    start=1979,\n    end=2020)\n\ntest = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\"\n    start=2021,\n    end=2022)\n```\n\nThe selection includes all the dates of the `end` years.\n\n### Selecting more precise ranges\n\nYou can select a few months, or even a few days:\n\n```python\nfrom ecml_tools.data import open_dataset\n\ntraining = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    start=202306,\n    end=202308)\n\ntest = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\"\n    start=20200301,\n    end=20200410)\n```\n\nThe following are equivalent way of describing `start` or `end`:\n\n* `2020` and `\"2020\"`\n* `202306`, `\"202306\"` and `\"2023-06\"`\n* `20200301`, `\"20200301\"` and `\"2020-03-01\"`\n\nYou can omit either `start` or `end`. In that case the first and last date of the dataset will be used respectively.\n\n### Combining both\n\nYou can combine both subsetting methods:\n\n```python\nfrom ecml_tools.data import open_dataset\n\ntraining = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    start=1979,\n    end=2020,\n    frequency=\"6h\")\n```\n\n## Combining datasets\n\nYou can create a virtual dataset by combining two or more `zarr` files.\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\n    \"dataset-1\",\n    \"dataset-2\",\n    \"dataset-3\",\n    ...\n)\n```\n\nWhen given a list of `zarr` files, the package will automatically work out if the files can be _concatenated_ or _joined_ by looking at the range of dates covered by each files.\n\nIf the dates are different, the files are concatenated. If the dates are the same, the files are joined. See below for more information.\n\n## Concatenating datasets\n\nYou can concatenate two or more datasets along the dates dimension. The package will check that all datasets are compatible (same resolution, same variables, etc.). Currently, the datasets must be given in chronological order with no gaps between them.\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\n    \"aifs-ea-an-oper-0001-mars-o96-1940-1978-1h-v2\",\n    \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\"\n)\n```\n\n![Concatenation](concat.png)\n\nPlease note that you can pass more than two `zarr` files to the function.\n\n> **_NOTE:_** When concatenating file, the statistics are not recomputed; it is the statistics of first file that are returned to the user.\n\n## Joining datasets\n\nYou can join two datasets that have the same dates, combining their variables.\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\n    \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    \"some-extra-parameters-from-another-source-o96-1979-2022-1h-v2\",\n)\n```\n\n![Join](join.png)\n\nIf a variable is present in more that one file, that last occurrence of that variable will be used, and will be at the position of the first occurrence of that name.\n\n![Overlay](overlay.png)\n\nPlease note that you can join more than two `zarr` files.\n\n## Selection, ordering and renaming of variables\n\nYou can select a subset of variables when opening a `zarr` file. If you pass a `list`, the variables are ordered according the that list. If you pass a `set`, the order of the file is preserved.\n\n```python\nfrom ecml_tools.data import open_dataset\n\n# Select '2t' and 'tp' in that order\n\nds = open_dataset(\n    \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    select = [\"2t\", \"tp\"],\n)\n\n# Select '2t' and 'tp', but preserve the order in which they are in the file\n\nds = open_dataset(\n    \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    select = {\"2t\", \"tp\"},\n)\n```\n\nYou can also drop some variables:\n\n```python\nfrom ecml_tools.data import open_dataset\n\n\nds = open_dataset(\n    \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    drop = [\"10u\", \"10v\"],\n)\n```\n\nand reorder them:\n\n```python\nfrom ecml_tools.data import open_dataset\n\n# ... using a list\n\nds = open_dataset(\n    \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    reorder = [\"2t\", \"msl\", \"sp\", \"10u\", \"10v\"],\n)\n\n# ... or using a dictionnary\n\nds = open_dataset(\n    \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    reorder = {\"2t\": 0, \"msl\": 1, \"sp\": 2, \"10u\": 3, \"10v\": 4},\n)\n```\n\nYou can also rename variables:\n\n```python\nfrom ecml_tools.data import open_dataset\n\n\nds = open_dataset(\n    \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    rename = {\"2t\": \"t2m\"},\n)\n```\n\nThis will be useful when your join datasets and do not want variables from one dataset to override the ones from the other.\n\n\n## Using all options\n\nYou can combine all of the above:\n\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\n    \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n    \"some-extra-parameters-from-another-source-o96-1979-2022-1h-v2\",\n    start=2000,\n    end=2001,\n    frequency=\"12h\",\n    select={\"2t\", \"2d\"},\n    ...\n)\n```\n\n## Building a dataset from a configuration\n\nIn practice, you will be building datasets from a configuration file, such as a YAML file:\n\n\n```python\nimport yaml\nfrom ecml_tools.data import open_dataset\n\nwith open(\"config.yaml\") as f:\n    config = yaml.safe_load(f)\n\ntraining = open_dataset(config[\"training\"])\ntest = open_dataset(config[\"test\"])\n```\n\nThis is possible because `open_dataset` can be build from simple lists and dictionaries:\n\n```python\n# From a string\n\nds = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\")\n\n# From a list of strings\n\nds = open_dataset(\n    [\n        \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n        \"aifs-ea-an-oper-0001-mars-o96-2023-2023-1h-v2\",\n    ]\n)\n\n\n# From a dictionnary\n\nds = open_dataset(\n    {\n        \"dataset\": \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n        \"frequency\": \"6h\",\n    }\n)\n\n# From a list of dictionnary\n\nds = open_dataset(\n    [\n        {\n            \"dataset\": \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n            \"frequency\": \"6h\",\n        },\n        {\n            \"dataset\": \"some-extra-parameters-from-another-source-o96-1979-2022-1h-v2\",\n            \"frequency\": \"6h\",\n            \"select\": [\"sst\", \"cape\"],\n        },\n    ]\n)\n\n# And even deeper constructs\n\nds = open_dataset(\n    [\n        {\n            \"dataset\": \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n            \"frequency\": \"6h\",\n        },\n        {\n            \"dataset\": [\n                {\n                    \"dataset\": \"aifs-od-an-oper-8888-mars-o96-1979-2022-6h-v2\",\n                    \"drop\": [\"ws\"],\n                },\n                {\n                    \"dataset\": \"aifs-od-an-oper-9999-mars-o96-1979-2022-6h-v2\",\n                    \"select\": [\"ws\"],\n                },\n            ],\n            \"frequency\": \"6h\",\n            \"select\": [\"sst\", \"cape\"],\n        },\n    ]\n)\n```\n",
    "bugtrack_url": null,
    "license": "Apache License Version 2.0",
    "summary": "A package to hold various functions to support training of ML models on ECMWF data.",
    "version": "0.6.1",
    "project_urls": {
        "Homepage": "https://github.com/ecmwf-lab/ecml-tools"
    },
    "split_keywords": [
        "tool"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4a482430022af064ae17fa788f8f310ecd8d976634b6a9f98dbe1cc240a47e43",
                "md5": "23d2c873b74103488be295854d8fd134",
                "sha256": "3bdba1787207dfbf7212f2df76a01f20d4ade65817b6ca2b90725303149a6bd4"
            },
            "downloads": -1,
            "filename": "ecml-tools-0.6.1.tar.gz",
            "has_sig": false,
            "md5_digest": "23d2c873b74103488be295854d8fd134",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 71783,
            "upload_time": "2024-03-14T17:32:25",
            "upload_time_iso_8601": "2024-03-14T17:32:25.195279Z",
            "url": "https://files.pythonhosted.org/packages/4a/48/2430022af064ae17fa788f8f310ecd8d976634b6a9f98dbe1cc240a47e43/ecml-tools-0.6.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-14 17:32:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ecmwf-lab",
    "github_project": "ecml-tools",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "ecml-tools"
}
        
Elapsed time: 0.54075s