# ecml-tools
A package to hold various functions to support training of ML models on ECMWF data.
## Installation
This package will be a collection of tools, with their own dependencies. In order to not
install unnecessary dependencies, the package is split in parts.
For handling datasets, you will need to install the `data` extra:
```bash
pip install ecml-tools[data]
```
For provenance tracking, you will need to install the `provenance` extra:
```bash
pip install ecml-tools[provenance]
```
To install everything:
```bash
pip install ecml-tools[all]
```
# Datasets
A `dataset` wraps a `zarr` file that follows the format used by ECMWF to train its machine learning models.
```python
from ecml_tools.data import open_dataset
ds = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2")
```
The dataset can be passed as a path or URL to a `zarr` file, or as a name. In the later case, the package will use the entry `zarr_root` of `~/.ecml-tool` file to create the full path or URL:
```yaml
zarr_root: /path_or_url/to/the/zarrs
```
## Attributes of a dataset
As the underlying `zarr`, the `dataset` is an iterable:
```python
from ecml_tools.data import open_dataset
ds = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2")
# Print the number of rows (i.e. dates):
print(len(ds))
# Iterate throw the rows,
for row in ds:
print(row)
# or access a item directly.
print(row[10])
# You can retrieve the shape of the dataset,
print(ds.shape)
# the list of variables,
print(ds.variables)
# the mapping between variable names and columns index
two_t_index = ds.name_to_index["2t"]
row = ds[10]
print("2t", row[two_t_index])
# Get the list of dates (as NumPy datetime64)
print(ds.dates)
# The number of hours between consecutive dates
print(ds.frequency)
# The resolution of the underlying grid
print(ds.resolution)
# The list of latitudes of the data values (NumPy array)
print(ds.latitudes)
# The same for longitudes
print(ds.longitudes)
# And the statitics
print(ds.statistics)
```
The statistics is a dictionary of NumPy vectors following the order of the variables:
```python
{
"mean": ...,
"stdev": ...,
"minimum": ...,
"maximum": ...,
}
```
To get the statistics for `2t`:
```python
two_t_index = ds.name_to_index["2t"]
stats = ds.statistics
print("Average 2t", stats["mean"][two_t_index])
```
## Subsetting datasets
You can create a view on the `zarr` file that selects a subset of dates.
### Changing the frequency
```python
from ecml_tools.data import open_dataset
ds = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
freqency="12h")
```
The `frequency` parameter can be a integer (in hours) or a string following with the suffix `h` (hours) or `d` (days).
### Selecting years
You can select ranges of years using the `start` and `end` keywords:
```python
from ecml_tools.data import open_dataset
training = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
start=1979,
end=2020)
test = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2"
start=2021,
end=2022)
```
The selection includes all the dates of the `end` years.
### Selecting more precise ranges
You can select a few months, or even a few days:
```python
from ecml_tools.data import open_dataset
training = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
start=202306,
end=202308)
test = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2"
start=20200301,
end=20200410)
```
The following are equivalent way of describing `start` or `end`:
* `2020` and `"2020"`
* `202306`, `"202306"` and `"2023-06"`
* `20200301`, `"20200301"` and `"2020-03-01"`
You can omit either `start` or `end`. In that case the first and last date of the dataset will be used respectively.
### Combining both
You can combine both subsetting methods:
```python
from ecml_tools.data import open_dataset
training = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
start=1979,
end=2020,
frequency="6h")
```
## Combining datasets
You can create a virtual dataset by combining two or more `zarr` files.
```python
from ecml_tools.data import open_dataset
ds = open_dataset(
"dataset-1",
"dataset-2",
"dataset-3",
...
)
```
When given a list of `zarr` files, the package will automatically work out if the files can be _concatenated_ or _joined_ by looking at the range of dates covered by each files.
If the dates are different, the files are concatenated. If the dates are the same, the files are joined. See below for more information.
## Concatenating datasets
You can concatenate two or more datasets along the dates dimension. The package will check that all datasets are compatible (same resolution, same variables, etc.). Currently, the datasets must be given in chronological order with no gaps between them.
```python
from ecml_tools.data import open_dataset
ds = open_dataset(
"aifs-ea-an-oper-0001-mars-o96-1940-1978-1h-v2",
"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2"
)
```
![Concatenation](concat.png)
Please note that you can pass more than two `zarr` files to the function.
> **_NOTE:_** When concatenating file, the statistics are not recomputed; it is the statistics of first file that are returned to the user.
## Joining datasets
You can join two datasets that have the same dates, combining their variables.
```python
from ecml_tools.data import open_dataset
ds = open_dataset(
"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
"some-extra-parameters-from-another-source-o96-1979-2022-1h-v2",
)
```
![Join](join.png)
If a variable is present in more that one file, that last occurrence of that variable will be used, and will be at the position of the first occurrence of that name.
![Overlay](overlay.png)
Please note that you can join more than two `zarr` files.
## Selection, ordering and renaming of variables
You can select a subset of variables when opening a `zarr` file. If you pass a `list`, the variables are ordered according the that list. If you pass a `set`, the order of the file is preserved.
```python
from ecml_tools.data import open_dataset
# Select '2t' and 'tp' in that order
ds = open_dataset(
"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
select = ["2t", "tp"],
)
# Select '2t' and 'tp', but preserve the order in which they are in the file
ds = open_dataset(
"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
select = {"2t", "tp"},
)
```
You can also drop some variables:
```python
from ecml_tools.data import open_dataset
ds = open_dataset(
"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
drop = ["10u", "10v"],
)
```
and reorder them:
```python
from ecml_tools.data import open_dataset
# ... using a list
ds = open_dataset(
"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
reorder = ["2t", "msl", "sp", "10u", "10v"],
)
# ... or using a dictionnary
ds = open_dataset(
"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
reorder = {"2t": 0, "msl": 1, "sp": 2, "10u": 3, "10v": 4},
)
```
You can also rename variables:
```python
from ecml_tools.data import open_dataset
ds = open_dataset(
"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
rename = {"2t": "t2m"},
)
```
This will be useful when your join datasets and do not want variables from one dataset to override the ones from the other.
## Using all options
You can combine all of the above:
```python
from ecml_tools.data import open_dataset
ds = open_dataset(
"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
"some-extra-parameters-from-another-source-o96-1979-2022-1h-v2",
start=2000,
end=2001,
frequency="12h",
select={"2t", "2d"},
...
)
```
## Building a dataset from a configuration
In practice, you will be building datasets from a configuration file, such as a YAML file:
```python
import yaml
from ecml_tools.data import open_dataset
with open("config.yaml") as f:
config = yaml.safe_load(f)
training = open_dataset(config["training"])
test = open_dataset(config["test"])
```
This is possible because `open_dataset` can be build from simple lists and dictionaries:
```python
# From a string
ds = open_dataset("aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2")
# From a list of strings
ds = open_dataset(
[
"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
"aifs-ea-an-oper-0001-mars-o96-2023-2023-1h-v2",
]
)
# From a dictionnary
ds = open_dataset(
{
"dataset": "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
"frequency": "6h",
}
)
# From a list of dictionnary
ds = open_dataset(
[
{
"dataset": "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
"frequency": "6h",
},
{
"dataset": "some-extra-parameters-from-another-source-o96-1979-2022-1h-v2",
"frequency": "6h",
"select": ["sst", "cape"],
},
]
)
# And even deeper constructs
ds = open_dataset(
[
{
"dataset": "aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2",
"frequency": "6h",
},
{
"dataset": [
{
"dataset": "aifs-od-an-oper-8888-mars-o96-1979-2022-6h-v2",
"drop": ["ws"],
},
{
"dataset": "aifs-od-an-oper-9999-mars-o96-1979-2022-6h-v2",
"select": ["ws"],
},
],
"frequency": "6h",
"select": ["sst", "cape"],
},
]
)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/ecmwf-lab/ecml-tools",
"name": "ecml-tools",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "tool",
"author": "European Centre for Medium-Range Weather Forecasts (ECMWF)",
"author_email": "software.support@ecmwf.int",
"download_url": "https://files.pythonhosted.org/packages/4a/48/2430022af064ae17fa788f8f310ecd8d976634b6a9f98dbe1cc240a47e43/ecml-tools-0.6.1.tar.gz",
"platform": null,
"description": "# ecml-tools\n\nA package to hold various functions to support training of ML models on ECMWF data.\n\n## Installation\n\nThis package will be a collection of tools, with their own dependencies. In order to not\ninstall unnecessary dependencies, the package is split in parts.\n\nFor handling datasets, you will need to install the `data` extra:\n\n\n```bash\npip install ecml-tools[data]\n```\n\nFor provenance tracking, you will need to install the `provenance` extra:\n\n```bash\npip install ecml-tools[provenance]\n```\n\nTo install everything:\n\n```bash\npip install ecml-tools[all]\n```\n\n# Datasets\n\nA `dataset` wraps a `zarr` file that follows the format used by ECMWF to train its machine learning models.\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\")\n```\n\nThe dataset can be passed as a path or URL to a `zarr` file, or as a name. In the later case, the package will use the entry `zarr_root` of `~/.ecml-tool` file to create the full path or URL:\n\n```yaml\nzarr_root: /path_or_url/to/the/zarrs\n```\n\n## Attributes of a dataset\n\nAs the underlying `zarr`, the `dataset` is an iterable:\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\")\n\n# Print the number of rows (i.e. dates):\n\nprint(len(ds))\n\n# Iterate throw the rows,\n\nfor row in ds:\n print(row)\n\n# or access a item directly.\n\nprint(row[10])\n\n# You can retrieve the shape of the dataset,\n\nprint(ds.shape)\n\n# the list of variables,\n\nprint(ds.variables)\n\n# the mapping between variable names and columns index\n\ntwo_t_index = ds.name_to_index[\"2t\"]\nrow = ds[10]\nprint(\"2t\", row[two_t_index])\n\n# Get the list of dates (as NumPy datetime64)\n\nprint(ds.dates)\n\n# The number of hours between consecutive dates\n\nprint(ds.frequency)\n\n# The resolution of the underlying grid\n\nprint(ds.resolution)\n\n# The list of latitudes of the data values (NumPy array)\n\nprint(ds.latitudes)\n\n# The same for longitudes\n\nprint(ds.longitudes)\n\n# And the statitics\n\nprint(ds.statistics)\n```\n\nThe statistics is a dictionary of NumPy vectors following the order of the variables:\n\n```python\n{\n \"mean\": ...,\n \"stdev\": ...,\n \"minimum\": ...,\n \"maximum\": ...,\n}\n```\n\nTo get the statistics for `2t`:\n\n```python\ntwo_t_index = ds.name_to_index[\"2t\"]\nstats = ds.statistics\nprint(\"Average 2t\", stats[\"mean\"][two_t_index])\n```\n\n## Subsetting datasets\n\nYou can create a view on the `zarr` file that selects a subset of dates.\n\n### Changing the frequency\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n freqency=\"12h\")\n```\n\nThe `frequency` parameter can be a integer (in hours) or a string following with the suffix `h` (hours) or `d` (days).\n\n### Selecting years\n\nYou can select ranges of years using the `start` and `end` keywords:\n\n```python\nfrom ecml_tools.data import open_dataset\n\ntraining = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n start=1979,\n end=2020)\n\ntest = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\"\n start=2021,\n end=2022)\n```\n\nThe selection includes all the dates of the `end` years.\n\n### Selecting more precise ranges\n\nYou can select a few months, or even a few days:\n\n```python\nfrom ecml_tools.data import open_dataset\n\ntraining = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n start=202306,\n end=202308)\n\ntest = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\"\n start=20200301,\n end=20200410)\n```\n\nThe following are equivalent way of describing `start` or `end`:\n\n* `2020` and `\"2020\"`\n* `202306`, `\"202306\"` and `\"2023-06\"`\n* `20200301`, `\"20200301\"` and `\"2020-03-01\"`\n\nYou can omit either `start` or `end`. In that case the first and last date of the dataset will be used respectively.\n\n### Combining both\n\nYou can combine both subsetting methods:\n\n```python\nfrom ecml_tools.data import open_dataset\n\ntraining = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n start=1979,\n end=2020,\n frequency=\"6h\")\n```\n\n## Combining datasets\n\nYou can create a virtual dataset by combining two or more `zarr` files.\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\n \"dataset-1\",\n \"dataset-2\",\n \"dataset-3\",\n ...\n)\n```\n\nWhen given a list of `zarr` files, the package will automatically work out if the files can be _concatenated_ or _joined_ by looking at the range of dates covered by each files.\n\nIf the dates are different, the files are concatenated. If the dates are the same, the files are joined. See below for more information.\n\n## Concatenating datasets\n\nYou can concatenate two or more datasets along the dates dimension. The package will check that all datasets are compatible (same resolution, same variables, etc.). Currently, the datasets must be given in chronological order with no gaps between them.\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\n \"aifs-ea-an-oper-0001-mars-o96-1940-1978-1h-v2\",\n \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\"\n)\n```\n\n![Concatenation](concat.png)\n\nPlease note that you can pass more than two `zarr` files to the function.\n\n> **_NOTE:_** When concatenating file, the statistics are not recomputed; it is the statistics of first file that are returned to the user.\n\n## Joining datasets\n\nYou can join two datasets that have the same dates, combining their variables.\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\n \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n \"some-extra-parameters-from-another-source-o96-1979-2022-1h-v2\",\n)\n```\n\n![Join](join.png)\n\nIf a variable is present in more that one file, that last occurrence of that variable will be used, and will be at the position of the first occurrence of that name.\n\n![Overlay](overlay.png)\n\nPlease note that you can join more than two `zarr` files.\n\n## Selection, ordering and renaming of variables\n\nYou can select a subset of variables when opening a `zarr` file. If you pass a `list`, the variables are ordered according the that list. If you pass a `set`, the order of the file is preserved.\n\n```python\nfrom ecml_tools.data import open_dataset\n\n# Select '2t' and 'tp' in that order\n\nds = open_dataset(\n \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n select = [\"2t\", \"tp\"],\n)\n\n# Select '2t' and 'tp', but preserve the order in which they are in the file\n\nds = open_dataset(\n \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n select = {\"2t\", \"tp\"},\n)\n```\n\nYou can also drop some variables:\n\n```python\nfrom ecml_tools.data import open_dataset\n\n\nds = open_dataset(\n \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n drop = [\"10u\", \"10v\"],\n)\n```\n\nand reorder them:\n\n```python\nfrom ecml_tools.data import open_dataset\n\n# ... using a list\n\nds = open_dataset(\n \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n reorder = [\"2t\", \"msl\", \"sp\", \"10u\", \"10v\"],\n)\n\n# ... or using a dictionnary\n\nds = open_dataset(\n \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n reorder = {\"2t\": 0, \"msl\": 1, \"sp\": 2, \"10u\": 3, \"10v\": 4},\n)\n```\n\nYou can also rename variables:\n\n```python\nfrom ecml_tools.data import open_dataset\n\n\nds = open_dataset(\n \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n rename = {\"2t\": \"t2m\"},\n)\n```\n\nThis will be useful when your join datasets and do not want variables from one dataset to override the ones from the other.\n\n\n## Using all options\n\nYou can combine all of the above:\n\n\n```python\nfrom ecml_tools.data import open_dataset\n\nds = open_dataset(\n \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n \"some-extra-parameters-from-another-source-o96-1979-2022-1h-v2\",\n start=2000,\n end=2001,\n frequency=\"12h\",\n select={\"2t\", \"2d\"},\n ...\n)\n```\n\n## Building a dataset from a configuration\n\nIn practice, you will be building datasets from a configuration file, such as a YAML file:\n\n\n```python\nimport yaml\nfrom ecml_tools.data import open_dataset\n\nwith open(\"config.yaml\") as f:\n config = yaml.safe_load(f)\n\ntraining = open_dataset(config[\"training\"])\ntest = open_dataset(config[\"test\"])\n```\n\nThis is possible because `open_dataset` can be build from simple lists and dictionaries:\n\n```python\n# From a string\n\nds = open_dataset(\"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\")\n\n# From a list of strings\n\nds = open_dataset(\n [\n \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n \"aifs-ea-an-oper-0001-mars-o96-2023-2023-1h-v2\",\n ]\n)\n\n\n# From a dictionnary\n\nds = open_dataset(\n {\n \"dataset\": \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n \"frequency\": \"6h\",\n }\n)\n\n# From a list of dictionnary\n\nds = open_dataset(\n [\n {\n \"dataset\": \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n \"frequency\": \"6h\",\n },\n {\n \"dataset\": \"some-extra-parameters-from-another-source-o96-1979-2022-1h-v2\",\n \"frequency\": \"6h\",\n \"select\": [\"sst\", \"cape\"],\n },\n ]\n)\n\n# And even deeper constructs\n\nds = open_dataset(\n [\n {\n \"dataset\": \"aifs-ea-an-oper-0001-mars-o96-1979-2022-1h-v2\",\n \"frequency\": \"6h\",\n },\n {\n \"dataset\": [\n {\n \"dataset\": \"aifs-od-an-oper-8888-mars-o96-1979-2022-6h-v2\",\n \"drop\": [\"ws\"],\n },\n {\n \"dataset\": \"aifs-od-an-oper-9999-mars-o96-1979-2022-6h-v2\",\n \"select\": [\"ws\"],\n },\n ],\n \"frequency\": \"6h\",\n \"select\": [\"sst\", \"cape\"],\n },\n ]\n)\n```\n",
"bugtrack_url": null,
"license": "Apache License Version 2.0",
"summary": "A package to hold various functions to support training of ML models on ECMWF data.",
"version": "0.6.1",
"project_urls": {
"Homepage": "https://github.com/ecmwf-lab/ecml-tools"
},
"split_keywords": [
"tool"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4a482430022af064ae17fa788f8f310ecd8d976634b6a9f98dbe1cc240a47e43",
"md5": "23d2c873b74103488be295854d8fd134",
"sha256": "3bdba1787207dfbf7212f2df76a01f20d4ade65817b6ca2b90725303149a6bd4"
},
"downloads": -1,
"filename": "ecml-tools-0.6.1.tar.gz",
"has_sig": false,
"md5_digest": "23d2c873b74103488be295854d8fd134",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 71783,
"upload_time": "2024-03-14T17:32:25",
"upload_time_iso_8601": "2024-03-14T17:32:25.195279Z",
"url": "https://files.pythonhosted.org/packages/4a/48/2430022af064ae17fa788f8f310ecd8d976634b6a9f98dbe1cc240a47e43/ecml-tools-0.6.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-14 17:32:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ecmwf-lab",
"github_project": "ecml-tools",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "ecml-tools"
}