odc-dscache

Name	odc-dscache JSON
Version	1.9.1 JSON
	download
home_page	https://github.com/opendatacube/odc-dscache/
Summary	ODC Dataset File Cache
upload_time	2025-07-10 03:14:58
maintainer	Open Data Cube
docs_url	None
author	Open Data Cube
requires_python	>=3.10
license	Apache License 2.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Dataset Cache

Random access cache of `Dataset` objects backed by disk storage.

- Uses `lmdb` as key value store
  - UUID is the key
  - Compressed json blob is value
- Uses `zstandard` compression (with pre-trained dictionaries)
  - Achieves pretty good compression (db size is roughly 3 times larger than `.tar.gz` of dataset yaml files), but, unlike tar archive, allows random access.
- Keeps track of `Product` and `Metadata` objects
- Has concept of "groups" (used for `GridWorkFlow`)


## Installation

```
pip install odc-dscache
```

## Exporting from Datacube

### Using command line app

There is a CLI tool called `slurpy` that can export a set of products to a file

```
> slurpy --help
Usage: slurpy [OPTIONS] OUTPUT [PRODUCTS]...

Options:
  -E, --env TEXT  Datacube environment name
  -z INTEGER      Compression setting for zstandard 1-fast, 9+ good but slow
  --help          Show this message and exit.
```

Note that this app is not affected by [issue#542](https://github.com/opendatacube/datacube-core/issues/542), as it implements a properly lazy SQL query using cursors.


### From python code

```python
from odc import dscache

# create new file db, deleting old one if exists
cache = dscache.create_cache('sample.db', truncate=True)

# dataset stream from some query
dss = dc.find_datasets_lazy(..)

# tee off dataset stream into db file
dss = cache.tee(dss)

# then just process the stream of datasets
for ds in dss:
   do_stuff_with(ds)

# finally you can call `.close`
cache.close()
```

## Reading from a file database

By default we assume that database file is read-only. If however some other process is writing to the db while this process is reading, you have to supply extra argument to `open_ro(.., lock=True)`. You better not do that over network file system.

```python
from odc import dscache

cache = dscache.open_ro("sample.db")

# access individual dataset: returns None if not found
ds = cache.get("005b0ab7-5454-4eef-829d-ed081135aefb")
if ds is not None:
    do_stuff_with(ds)

# stream all datasets
for ds in cache.get_all():
    do_stuff_with(ds)
```

For more details see [notebook](../notebooks/dscache-example.ipynb).

## Groups

Group is a collection of datasets that are somehow related. It is essentially a simple index: a list of uuids stored under some name. For example we might want to group all datasets that overlap a certain Albers tile into a group with a name `albers/{x}_{y}`. One can query a list of all group names with `.groups()` method. One can add new group using `.put_group(name, list_of_uuids)`. To read all datasets that belong to a given group `.stream_group(group_name)` can be used.

- Get list of group names and their population counts: `.groups() -> List((name, count))`
- Get datasets for a given group: `.stream_group(group_name) -> lazy sequence of Dataset objects`
- To get just uuids: `.get_group(group_name) -> List[UUID]`

There is a cli tool `dstiler` that can group datasets based on `GridSpec`

```
Usage: dstiler [OPTIONS] DBFILE

  Add spatial grouping to file db.

  Default grid is Australian Albers (EPSG:3577) with 100k by 100k tiles. But
  you can also group by Landsat path/row (--native), or Google's map tiling
  regime (--web zoom_level)

Options:
  --native         Use Landsat Path/Row as grouping
  --native-albers  When datasets are in Albers grid already
  --web INTEGER    Use web map tiling regime at supplied zoom level
  --help           Show this message and exit.
```

Note that unlike tools like `datacube-stats --save-tasks` that rely on `GridWorkflow.group_into_cells`, `dstiler` is capable of processing large datasets since it does not keep the entire `Dataset` object in memory for every dataset observed, instead only UUID is kept in RAM until completion, drastically reducing RAM usage. There is also an optimization for ingested products, these are already tiled into Albers tiles so rather than doing relatively expensive geometry overlap checks we can simply extract Albers tile index directly from `Dataset`'s  `.metadata.grid_spatial` property. To use this option supply `--native-albers` to `dstiler` app.


## Notes on performance

It took 26 minutes to slurp 2,627,779 wofs datasets from a local postgres server on AWS(`r4.xlarge`), this generated 1.4G database file.

```
Command being timed: "slurpy -E wofs wofs.db :all:"
User time (seconds): 1037.93
System time (seconds): 48.77
Percent of CPU this job got: 69%
Elapsed (wall clock) time (h:mm:ss or m:ss): 26:04.79
```

Adding Albers tile grouping to this took just over 4 minutes, that's a processing rate of ~10.6K datasets per second.

```
Command being timed: "dstiler --native-albers wofs.db"
User time (seconds): 234.57
System time (seconds): 2.65
Percent of CPU this job got: 95%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:08.70
```

Similar work load but on VDI node (2,747,870 wofs dataset from main db) took 23 minutes to dump all datasets from DB and 7 minutes to tile into Albers grid using "native grid" optimization. Read throughput from file db on VDI node is slower than on AWS, but is still a respectable 6.5K datasets per second. Database file was somewhat bigger too, 2G vs 1.4G on AWS, maybe there is a significant difference in `zstandard` library between two systems. 

```
Command being timed: "slurpy wofs.db wofs_albers"
User time (seconds): 1077.74
System time (seconds): 49.75
Percent of CPU this job got: 81%
Elapsed (wall clock) time (h:mm:ss or m:ss): 23:01.20
```

```
Command being timed: "dstiler --native-albers wofs.db"
User time (seconds): 408.65
System time (seconds): 6.28
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 7:03.22
```

I'd like to point out that grouping datasets into Grids can very well happen during `slurpy` process without adding much overhead, so two step processing is not strictly necessary.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/opendatacube/odc-dscache/",
    "name": "odc-dscache",
    "maintainer": "Open Data Cube",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Open Data Cube",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/40/82/1049eeca30d89cf116b28470a77dd744727d911a2f0532f6bbacfca8629a/odc_dscache-1.9.1.tar.gz",
    "platform": "any",
    "description": "# Dataset Cache\n\nRandom access cache of `Dataset` objects backed by disk storage.\n\n- Uses `lmdb` as key value store\n  - UUID is the key\n  - Compressed json blob is value\n- Uses `zstandard` compression (with pre-trained dictionaries)\n  - Achieves pretty good compression (db size is roughly 3 times larger than `.tar.gz` of dataset yaml files), but, unlike tar archive, allows random access.\n- Keeps track of `Product` and `Metadata` objects\n- Has concept of \"groups\" (used for `GridWorkFlow`)\n\n\n## Installation\n\n```\npip install odc-dscache\n```\n\n## Exporting from Datacube\n\n### Using command line app\n\nThere is a CLI tool called `slurpy` that can export a set of products to a file\n\n```\n> slurpy --help\nUsage: slurpy [OPTIONS] OUTPUT [PRODUCTS]...\n\nOptions:\n  -E, --env TEXT  Datacube environment name\n  -z INTEGER      Compression setting for zstandard 1-fast, 9+ good but slow\n  --help          Show this message and exit.\n```\n\nNote that this app is not affected by [issue#542](https://github.com/opendatacube/datacube-core/issues/542), as it implements a properly lazy SQL query using cursors.\n\n\n### From python code\n\n```python\nfrom odc import dscache\n\n# create new file db, deleting old one if exists\ncache = dscache.create_cache('sample.db', truncate=True)\n\n# dataset stream from some query\ndss = dc.find_datasets_lazy(..)\n\n# tee off dataset stream into db file\ndss = cache.tee(dss)\n\n# then just process the stream of datasets\nfor ds in dss:\n   do_stuff_with(ds)\n\n# finally you can call `.close`\ncache.close()\n```\n\n## Reading from a file database\n\nBy default we assume that database file is read-only. If however some other process is writing to the db while this process is reading, you have to supply extra argument to `open_ro(.., lock=True)`. You better not do that over network file system.\n\n```python\nfrom odc import dscache\n\ncache = dscache.open_ro(\"sample.db\")\n\n# access individual dataset: returns None if not found\nds = cache.get(\"005b0ab7-5454-4eef-829d-ed081135aefb\")\nif ds is not None:\n    do_stuff_with(ds)\n\n# stream all datasets\nfor ds in cache.get_all():\n    do_stuff_with(ds)\n```\n\nFor more details see [notebook](../notebooks/dscache-example.ipynb).\n\n## Groups\n\nGroup is a collection of datasets that are somehow related. It is essentially a simple index: a list of uuids stored under some name. For example we might want to group all datasets that overlap a certain Albers tile into a group with a name `albers/{x}_{y}`. One can query a list of all group names with `.groups()` method. One can add new group using `.put_group(name, list_of_uuids)`. To read all datasets that belong to a given group `.stream_group(group_name)` can be used.\n\n- Get list of group names and their population counts: `.groups() -> List((name, count))`\n- Get datasets for a given group: `.stream_group(group_name) -> lazy sequence of Dataset objects`\n- To get just uuids: `.get_group(group_name) -> List[UUID]`\n\nThere is a cli tool `dstiler` that can group datasets based on `GridSpec`\n\n```\nUsage: dstiler [OPTIONS] DBFILE\n\n  Add spatial grouping to file db.\n\n  Default grid is Australian Albers (EPSG:3577) with 100k by 100k tiles. But\n  you can also group by Landsat path/row (--native), or Google's map tiling\n  regime (--web zoom_level)\n\nOptions:\n  --native         Use Landsat Path/Row as grouping\n  --native-albers  When datasets are in Albers grid already\n  --web INTEGER    Use web map tiling regime at supplied zoom level\n  --help           Show this message and exit.\n```\n\nNote that unlike tools like `datacube-stats --save-tasks` that rely on `GridWorkflow.group_into_cells`, `dstiler` is capable of processing large datasets since it does not keep the entire `Dataset` object in memory for every dataset observed, instead only UUID is kept in RAM until completion, drastically reducing RAM usage. There is also an optimization for ingested products, these are already tiled into Albers tiles so rather than doing relatively expensive geometry overlap checks we can simply extract Albers tile index directly from `Dataset`'s  `.metadata.grid_spatial` property. To use this option supply `--native-albers` to `dstiler` app.\n\n\n## Notes on performance\n\nIt took 26 minutes to slurp 2,627,779 wofs datasets from a local postgres server on AWS(`r4.xlarge`), this generated 1.4G database file.\n\n```\nCommand being timed: \"slurpy -E wofs wofs.db :all:\"\nUser time (seconds): 1037.93\nSystem time (seconds): 48.77\nPercent of CPU this job got: 69%\nElapsed (wall clock) time (h:mm:ss or m:ss): 26:04.79\n```\n\nAdding Albers tile grouping to this took just over 4 minutes, that's a processing rate of ~10.6K datasets per second.\n\n```\nCommand being timed: \"dstiler --native-albers wofs.db\"\nUser time (seconds): 234.57\nSystem time (seconds): 2.65\nPercent of CPU this job got: 95%\nElapsed (wall clock) time (h:mm:ss or m:ss): 4:08.70\n```\n\nSimilar work load but on VDI node (2,747,870 wofs dataset from main db) took 23 minutes to dump all datasets from DB and 7 minutes to tile into Albers grid using \"native grid\" optimization. Read throughput from file db on VDI node is slower than on AWS, but is still a respectable 6.5K datasets per second. Database file was somewhat bigger too, 2G vs 1.4G on AWS, maybe there is a significant difference in `zstandard` library between two systems. \n\n```\nCommand being timed: \"slurpy wofs.db wofs_albers\"\nUser time (seconds): 1077.74\nSystem time (seconds): 49.75\nPercent of CPU this job got: 81%\nElapsed (wall clock) time (h:mm:ss or m:ss): 23:01.20\n```\n\n```\nCommand being timed: \"dstiler --native-albers wofs.db\"\nUser time (seconds): 408.65\nSystem time (seconds): 6.28\nPercent of CPU this job got: 98%\nElapsed (wall clock) time (h:mm:ss or m:ss): 7:03.22\n```\n\nI'd like to point out that grouping datasets into Grids can very well happen during `slurpy` process without adding much overhead, so two step processing is not strictly necessary.\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "ODC Dataset File Cache",
    "version": "1.9.1",
    "project_urls": {
        "Homepage": "https://github.com/opendatacube/odc-dscache/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7acf633a5e4aa17d2143f6931daa7e1994fc7ded043d4cee3e0ccd631b556dbf",
                "md5": "143c4ff22fdc6c8fad3b894a5faae3c8",
                "sha256": "a32367c03102b4601b011fb6743287acbea2fd4b38fd0f9e4622455a38c9f094"
            },
            "downloads": -1,
            "filename": "odc_dscache-1.9.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "143c4ff22fdc6c8fad3b894a5faae3c8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 33571,
            "upload_time": "2025-07-10T03:14:57",
            "upload_time_iso_8601": "2025-07-10T03:14:57.827283Z",
            "url": "https://files.pythonhosted.org/packages/7a/cf/633a5e4aa17d2143f6931daa7e1994fc7ded043d4cee3e0ccd631b556dbf/odc_dscache-1.9.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "40821049eeca30d89cf116b28470a77dd744727d911a2f0532f6bbacfca8629a",
                "md5": "7e5253a9bc442dc1e1466708ba95db0f",
                "sha256": "6264f9479e3b777d24596bcbfdf0f5720491dc13749bd665027e6e131dba7ef3"
            },
            "downloads": -1,
            "filename": "odc_dscache-1.9.1.tar.gz",
            "has_sig": false,
            "md5_digest": "7e5253a9bc442dc1e1466708ba95db0f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 31859,
            "upload_time": "2025-07-10T03:14:58",
            "upload_time_iso_8601": "2025-07-10T03:14:58.687433Z",
            "url": "https://files.pythonhosted.org/packages/40/82/1049eeca30d89cf116b28470a77dd744727d911a2f0532f6bbacfca8629a/odc_dscache-1.9.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-10 03:14:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "opendatacube",
    "github_project": "odc-dscache",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "odc-dscache"
}

Open Data Cube