# Registry Sweepers
This package provides supplementary metadata generation for registry documents, which is required for registry-api to function correctly, and for common user queries. Execution is idempotent and should be scheduled on a recurring basis.
### Components
#### [RepairKit](https://github.com/NASA-PDS/registry-sweepers/blob/main/src/pds/registrysweepers/repairkit/__init__.py)
The repairkit sweeper applies idempotent transformations to targeted subsets of properties, for example ensuring that all properties expected to have array-like values are in fact arrays (as opposed to single-element arrays being flattened to strings during harvest). Documents are processed based on whether their `ops:Provenance/ops:registry_sweepers_repairkit_version` metadata value is up-to-date relative to the sweeper codebase.
#### [Provenance](https://github.com/NASA-PDS/registry-sweepers/blob/main/src/pds/registrysweepers/provenance.py)
The provenance sweeper generates metadata for linking each version-superseded product with the versioned product which supersedes it. The value of the successor is stored in the `ops:Provenance/ops:superseded_by` property. This property will not be set for the latest version of any product. All documents are processed, but db writes are optimised based on whether their `ops:Provenance/ops:registry_sweepers_provenance_version` metadata value is up-to-date relative to the sweeper codebase.
#### [Ancestry](https://github.com/NASA-PDS/registry-sweepers/blob/main/src/pds/registrysweepers/ancestry/__init__.py)
The ancestry sweeper generates membership metadata for each product, i.e. which bundle lidvids and which collection lidvids reference a given product. These values will be stored in properties `ops:Provenance/ops:parent_bundle_identifier` and `ops:Provenance/ops:parent_collection_identifier`, respectively. All bundles/collections are processed to populate a lookup table, but db writes are optimised based on whether their `ops:Provenance/ops:registry_sweepers_provenance_version` metadata value is up-to-date relative to the sweeper codebase, and collection non-aggregate reference pages in registry-refs are skipped entirely if they are marked as up-to-date.
[Accepts environment variables to tune performance](./src/pds/registrysweepers/ancestry/runtimeconstants.py), primarily trading increased runtime duration for reduced peak memory usage.
## Developer Quickstart
### Prerequisites
#### Dependencies
- Python >=3.9
#### Environment Variables
```
MULTITENANCY_NODE_ID= // If running in a multitenant environment, the id of the node, used to distinguish registry/registry-refs index instances
PROV_CREDENTIALS={"admin": "admin"} // OpenSearch username/password, if targeting an OpenSearch host other than AWS AOSS
SWEEPERS_IAM_ROLE_NAME=<value> // AWS IAM role name, if targeting AWS AOSS
PROV_ENDPOINT=https://localhost:9200 // OpenSearch host url and port
LOGLEVEL - an integer log level or anycase string matching a python log level like `INFO` (optional - defaults to `INFO`))
DEV_MODE=1 // disables host verification
// tqdm dependency may cause fatal crashes on some architectures when breakpoints are used in debug mode with Cython speedup extension enabled
PYDEVD_USE_CYTHON=NO // disables Cython speedup extension
```
With `--legacy-sync` option, you also need the list of the cross-cluster-search node configured to access all the node's OpensSearch domains:
```
CCS_CONN=naif-prod-ccs,rms-prod,sbnumd-prod-ccs,geo-prod-ccs,atm-prod-ccs,sbnpsi-prod-ccs,ppi-prod-ccs,img-prod-ccs
```
Use the connection aliases found in the 'Connections' tab of the Engineering Node OpenSearch Domain on AWS.
https://us-west-2.console.aws.amazon.com/aos/home?region=us-west-2#opensearch/domains/en-prod?tabId=ccs
After cloning the repository, and setting the repository root as the current working directory install the package with `pip install -e .`
The wrapper script for the suite of components may be run with `python ./docker/sweepers_driver.py`
Alternatively, registry-sweepers may be built from its [Dockerfile](./docker/Dockerfile) with `docker image build --file ./docker/Dockerfile .` and run as a container, providing those same environment variables when running the container.
### Performance
#### Rough Benchmarks
When run against the production OpenSearch instance with ~1.1M products, no cross-cluster remotes, and (only) ~1k multi-version products, from a local development machine, the runtime is ~20min on first run and ~12min subsequently. It appears that OpenSearch optimizes away no-op update calls, resulting in significant speedup despite the fact that registry-sweepers reprocesses metadata from scratch, every run.
The overwhelming bottleneck ops are the O(docs_count) db writes in ancestry.
## Code of Conduct
All users and developers of the NASA-PDS software are expected to abide by our [Code of Conduct](https://github.com/NASA-PDS/.github/blob/main/CODE_OF_CONDUCT.md). Please read this to ensure you understand the expectations of our community.
## Development
To develop this project, use your favorite text editor, or an integrated development environment with Python support, such as [PyCharm](https://www.jetbrains.com/pycharm/).
### Contributing
For information on how to contribute to NASA-PDS codebases please take a look at our [Contributing guidelines](https://github.com/NASA-PDS/.github/blob/main/CONTRIBUTING.md).
### Installation
Install in editable mode and with extra developer dependencies into your virtual environment of choice:
pip install --editable '.[dev]'
Configure the `pre-commit` hooks:
pre-commit install
pre-commit install -t pre-push
pre-commit install -t prepare-commit-msg
pre-commit install -t commit-msg
These hooks check code formatting and also aborts commits that contain secrets such as passwords or API keys. However, a one time setup is required in your global Git configuration. See [the wiki entry on Git Secrets](https://github.com/NASA-PDS/nasa-pds.github.io/wiki/Git-and-Github-Guide#git-secrets) to learn how.
### Packaging
To isolate and be able to re-produce the environment for this package, you should use a [Python Virtual Environment](https://docs.python.org/3/tutorial/venv.html). To do so, run:
python -m venv venv
Then exclusively use `venv/bin/python`, `venv/bin/pip`, etc.
If you have `tox` installed and would like it to create your environment and install dependencies for you run:
tox --devenv <name you'd like for env> -e dev
Dependencies for development are specified as the `dev` `extras_require` in `setup.cfg`; they are installed into the virtual environment as follows:
pip install --editable '.[dev]'
All the source code is in a sub-directory under `src`.
### Tests
This section describes testing for your package.
A complete "build" including test execution, linting (`mypy`, `black`, `flake8`, etc.), and documentation build is executed via:
tox
#### Unit tests
Your project should have built-in unit tests, functional, validation, acceptance, etc., tests.
For unit testing, check out the [unittest](https://docs.python.org/3/library/unittest.html) module, built into Python 3.
Tests objects should be in packages `test` modules or preferably in project 'tests' directory which mirrors the project package structure.
Our unit tests are launched with command:
pytest
If you want your tests to run automatically as you make changes start up `pytest` in watch mode with:
ptw
## Build
pip install wheel
python setup.py sdist bdist_wheel
## Publication
NASA PDS packages can publish automatically using the [Roundup Action](https://github.com/NASA-PDS/roundup-action), which leverages GitHub Actions to perform automated continuous integration and continuous delivery. A default workflow that includes the Roundup is provided in the `.github/workflows/unstable-cicd.yaml` file. (Unstable here means an interim release.)
### Manual Publication
Create the package:
python setup.py bdist_wheel
Publish it as a Github release.
Publish on PyPI (you need a PyPI account and configure `$HOME/.pypirc`):
pip install twine
twine upload dist/*
Or publish on the Test PyPI (you need a Test PyPI account and configure `$HOME/.pypirc`):
pip install twine
twine upload --repository testpypi dist/*
## CI/CD
The template repository comes with our two "standard" CI/CD workflows, `stable-cicd` and `unstable-cicd`. The unstable build runs on any push to `main` (± ignoring changes to specific files) and the stable build runs on push of a release branch of the form `release/<release version>`. Both of these make use of our GitHub actions build step, [Roundup](https://github.com/NASA-PDS/roundup-action). The `unstable-cicd` will generate (and constantly update) a SNAPSHOT release. If you haven't done a formal software release you will end up with a `v0.0.0-SNAPSHOT` release (see NASA-PDS/roundup-action#56 for specifics).
Raw data
{
"_id": null,
"home_page": "https://github.com/NASA-PDS/registry-sweepers",
"name": "pds.registry-sweepers",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "pds, planetary data, registry",
"author": "PDS",
"author_email": "pds_operator@jpl.nasa.gov",
"download_url": "https://github.com/NASA-PDS/registry-sweepers/releases/",
"platform": null,
"description": "# Registry Sweepers\n\nThis package provides supplementary metadata generation for registry documents, which is required for registry-api to function correctly, and for common user queries. Execution is idempotent and should be scheduled on a recurring basis.\n\n### Components\n\n#### [RepairKit](https://github.com/NASA-PDS/registry-sweepers/blob/main/src/pds/registrysweepers/repairkit/__init__.py)\nThe repairkit sweeper applies idempotent transformations to targeted subsets of properties, for example ensuring that all properties expected to have array-like values are in fact arrays (as opposed to single-element arrays being flattened to strings during harvest). Documents are processed based on whether their `ops:Provenance/ops:registry_sweepers_repairkit_version` metadata value is up-to-date relative to the sweeper codebase.\n\n#### [Provenance](https://github.com/NASA-PDS/registry-sweepers/blob/main/src/pds/registrysweepers/provenance.py)\nThe provenance sweeper generates metadata for linking each version-superseded product with the versioned product which supersedes it. The value of the successor is stored in the `ops:Provenance/ops:superseded_by` property. This property will not be set for the latest version of any product. All documents are processed, but db writes are optimised based on whether their `ops:Provenance/ops:registry_sweepers_provenance_version` metadata value is up-to-date relative to the sweeper codebase.\n\n#### [Ancestry](https://github.com/NASA-PDS/registry-sweepers/blob/main/src/pds/registrysweepers/ancestry/__init__.py)\nThe ancestry sweeper generates membership metadata for each product, i.e. which bundle lidvids and which collection lidvids reference a given product. These values will be stored in properties `ops:Provenance/ops:parent_bundle_identifier` and `ops:Provenance/ops:parent_collection_identifier`, respectively. All bundles/collections are processed to populate a lookup table, but db writes are optimised based on whether their `ops:Provenance/ops:registry_sweepers_provenance_version` metadata value is up-to-date relative to the sweeper codebase, and collection non-aggregate reference pages in registry-refs are skipped entirely if they are marked as up-to-date.\n\n[Accepts environment variables to tune performance](./src/pds/registrysweepers/ancestry/runtimeconstants.py), primarily trading increased runtime duration for reduced peak memory usage.\n\n## Developer Quickstart\n\n### Prerequisites\n\n#### Dependencies\n- Python >=3.9\n\n#### Environment Variables\n```\nMULTITENANCY_NODE_ID= // If running in a multitenant environment, the id of the node, used to distinguish registry/registry-refs index instances\nPROV_CREDENTIALS={\"admin\": \"admin\"} // OpenSearch username/password, if targeting an OpenSearch host other than AWS AOSS\nSWEEPERS_IAM_ROLE_NAME=<value> // AWS IAM role name, if targeting AWS AOSS\nPROV_ENDPOINT=https://localhost:9200 // OpenSearch host url and port\nLOGLEVEL - an integer log level or anycase string matching a python log level like `INFO` (optional - defaults to `INFO`))\nDEV_MODE=1 // disables host verification\n\n// tqdm dependency may cause fatal crashes on some architectures when breakpoints are used in debug mode with Cython speedup extension enabled\nPYDEVD_USE_CYTHON=NO // disables Cython speedup extension\n```\n\nWith `--legacy-sync` option, you also need the list of the cross-cluster-search node configured to access all the node's OpensSearch domains:\n\n```\nCCS_CONN=naif-prod-ccs,rms-prod,sbnumd-prod-ccs,geo-prod-ccs,atm-prod-ccs,sbnpsi-prod-ccs,ppi-prod-ccs,img-prod-ccs\n```\n\nUse the connection aliases found in the 'Connections' tab of the Engineering Node OpenSearch Domain on AWS.\n\nhttps://us-west-2.console.aws.amazon.com/aos/home?region=us-west-2#opensearch/domains/en-prod?tabId=ccs\n\nAfter cloning the repository, and setting the repository root as the current working directory install the package with `pip install -e .`\n\nThe wrapper script for the suite of components may be run with `python ./docker/sweepers_driver.py`\n\nAlternatively, registry-sweepers may be built from its [Dockerfile](./docker/Dockerfile) with `docker image build --file ./docker/Dockerfile .` and run as a container, providing those same environment variables when running the container.\n\n### Performance\n\n#### Rough Benchmarks\nWhen run against the production OpenSearch instance with ~1.1M products, no cross-cluster remotes, and (only) ~1k multi-version products, from a local development machine, the runtime is ~20min on first run and ~12min subsequently. It appears that OpenSearch optimizes away no-op update calls, resulting in significant speedup despite the fact that registry-sweepers reprocesses metadata from scratch, every run.\n\nThe overwhelming bottleneck ops are the O(docs_count) db writes in ancestry.\n\n\n## Code of Conduct\n\nAll users and developers of the NASA-PDS software are expected to abide by our [Code of Conduct](https://github.com/NASA-PDS/.github/blob/main/CODE_OF_CONDUCT.md). Please read this to ensure you understand the expectations of our community.\n\n\n## Development\n\nTo develop this project, use your favorite text editor, or an integrated development environment with Python support, such as [PyCharm](https://www.jetbrains.com/pycharm/).\n\n\n### Contributing\n\nFor information on how to contribute to NASA-PDS codebases please take a look at our [Contributing guidelines](https://github.com/NASA-PDS/.github/blob/main/CONTRIBUTING.md).\n\n\n### Installation\n\nInstall in editable mode and with extra developer dependencies into your virtual environment of choice:\n\n pip install --editable '.[dev]'\n\nConfigure the `pre-commit` hooks:\n\n pre-commit install\n pre-commit install -t pre-push\n pre-commit install -t prepare-commit-msg\n pre-commit install -t commit-msg\n\nThese hooks check code formatting and also aborts commits that contain secrets such as passwords or API keys. However, a one time setup is required in your global Git configuration. See [the wiki entry on Git Secrets](https://github.com/NASA-PDS/nasa-pds.github.io/wiki/Git-and-Github-Guide#git-secrets) to learn how.\n\n### Packaging\n\nTo isolate and be able to re-produce the environment for this package, you should use a [Python Virtual Environment](https://docs.python.org/3/tutorial/venv.html). To do so, run:\n\n python -m venv venv\n\nThen exclusively use `venv/bin/python`, `venv/bin/pip`, etc.\n\nIf you have `tox` installed and would like it to create your environment and install dependencies for you run:\n\n tox --devenv <name you'd like for env> -e dev\n\nDependencies for development are specified as the `dev` `extras_require` in `setup.cfg`; they are installed into the virtual environment as follows:\n\n pip install --editable '.[dev]'\n\nAll the source code is in a sub-directory under `src`.\n\n\n### Tests\n\nThis section describes testing for your package.\n\nA complete \"build\" including test execution, linting (`mypy`, `black`, `flake8`, etc.), and documentation build is executed via:\n\n tox\n\n\n#### Unit tests\n\nYour project should have built-in unit tests, functional, validation, acceptance, etc., tests.\n\nFor unit testing, check out the [unittest](https://docs.python.org/3/library/unittest.html) module, built into Python 3.\n\nTests objects should be in packages `test` modules or preferably in project 'tests' directory which mirrors the project package structure.\n\nOur unit tests are launched with command:\n\n pytest\n\nIf you want your tests to run automatically as you make changes start up `pytest` in watch mode with:\n\n ptw\n\n\n## Build\n\n pip install wheel\n python setup.py sdist bdist_wheel\n\n\n## Publication\n\nNASA PDS packages can publish automatically using the [Roundup Action](https://github.com/NASA-PDS/roundup-action), which leverages GitHub Actions to perform automated continuous integration and continuous delivery. A default workflow that includes the Roundup is provided in the `.github/workflows/unstable-cicd.yaml` file. (Unstable here means an interim release.)\n\n\n### Manual Publication\n\nCreate the package:\n\n python setup.py bdist_wheel\n\nPublish it as a Github release.\n\nPublish on PyPI (you need a PyPI account and configure `$HOME/.pypirc`):\n\n pip install twine\n twine upload dist/*\n\nOr publish on the Test PyPI (you need a Test PyPI account and configure `$HOME/.pypirc`):\n\n pip install twine\n twine upload --repository testpypi dist/*\n\n## CI/CD\n\nThe template repository comes with our two \"standard\" CI/CD workflows, `stable-cicd` and `unstable-cicd`. The unstable build runs on any push to `main` (\u00b1 ignoring changes to specific files) and the stable build runs on push of a release branch of the form `release/<release version>`. Both of these make use of our GitHub actions build step, [Roundup](https://github.com/NASA-PDS/roundup-action). The `unstable-cicd` will generate (and constantly update) a SNAPSHOT release. If you haven't done a formal software release you will end up with a `v0.0.0-SNAPSHOT` release (see NASA-PDS/roundup-action#56 for specifics).\n",
"bugtrack_url": null,
"license": "apache-2.0",
"summary": "A set of utility scripts which transform/augment PDS registry metadata to support additional capabilities.",
"version": "1.3.0",
"project_urls": {
"Download": "https://github.com/NASA-PDS/registry-sweepers/releases/",
"Homepage": "https://github.com/NASA-PDS/registry-sweepers"
},
"split_keywords": [
"pds",
" planetary data",
" registry"
],
"urls": [
{
"comment_text": "\ud83e\udd20 Yee-haw! This here ar-tee-fact got done uploaded by the Roundup!",
"digests": {
"blake2b_256": "8bc236d7225b2f667f395f66dde0384adc0f7b22a795ed96bcf675274e7673dd",
"md5": "c0194e1281e7c48ae7ca88019a5ad9e5",
"sha256": "4d06bd6e5b755770c006b10ffe66003a882db3254cc26276cdb5555e3814fa86"
},
"downloads": -1,
"filename": "pds.registry_sweepers-1.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c0194e1281e7c48ae7ca88019a5ad9e5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 55133,
"upload_time": "2024-10-14T21:40:22",
"upload_time_iso_8601": "2024-10-14T21:40:22.856295Z",
"url": "https://files.pythonhosted.org/packages/8b/c2/36d7225b2f667f395f66dde0384adc0f7b22a795ed96bcf675274e7673dd/pds.registry_sweepers-1.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-14 21:40:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "NASA-PDS",
"github_project": "registry-sweepers",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "pds.registry-sweepers"
}