articat

Name	articat JSON
Version	0.2.0 JSON
	download
home_page	https://github.com/related-sciences/articat
Summary	articat: data artifact catalog
upload_time	2025-02-17 18:20:27
maintainer	None
docs_url	None
author	Related Sciences LLC
requires_python	>=3.11
license	Apache
keywords	data catalog metadata data-discovery data-catalog
VCS
bugtrack_url
requirements	fire fsspec gcsfs google-cloud-bigquery google-cloud-datastore jupyterlab papermill pydantic
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # articat
[![CI](https://github.com/related-sciences/articat/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/related-sciences/articat/actions/workflows/build.yml)
[![PYPI](https://img.shields.io/pypi/v/articat.svg)](https://pypi.org/project/articat/)

Minimal metadata catalog to store and retrieve metadata about data artifacts.

## Getting started

At a high level, *articat* is simply a key-value store. Value being the Artifact metadata.
Key a.k.a. "Artifact Spec" being:
 * globally unique `id`
 * optional timestamp: `partition`
 * optional arbitrary string: `version`

To publish a file system Artifact (`FSArtifact`):

```python
from articat import FSArtifact
from pathlib import Path
from datetime import date

# Apart from being a metadata containers, Artifact classes have optional
# convenience methods to help in data publishing flow:

with FSArtifact.partitioned("foo", partition=date(1643, 1, 4)) as fsa:
    # To create a new Artifact, always use `with` statement, and
    # either `partitioned` or `versioned` methods. Use:
    # * `partitioned(...)`, for Artifacts with explicit `datetime` partition
    # * `versioned(...)`, for Artifacts with explicit `str` version

    # Next we produce some local data, this could be a Spark job,
    # ML model etc.
    data_path = Path("/tmp/data")
    data_path.write_text("42")

    # Now let's stage that data, temporary and final data directories/buckets
    # are configurable (see below)
    fsa.stage(data_path)

    # Additionally let's provide some description, here we could also
    # save some extra arbitrary metadata like model metrics, hyperparameters etc.
    fsa.metadata.description = "Answer to the Ultimate Question of Life, the Universe, and Everything"
```

To retrieve the metadata about the Artifact above:

```python
from articat.fs_artifact import FSArtifact
from datetime import date
from pathlib import Path

# To retrieve the metadata, use Artifact object, and `fetch` method:
fsa = FSArtifact.partitioned("foo", partition=date(1643, 1, 4)).fetch()

fsa.id # "foo"
fsa.created # <CREATION-TIMESTAMP>
fsa.partition # <CREATION-TIMESTAMP>
fsa.metadata.description # "Answer to the Ultimate Question of Life, the Universe, and Everything"
fsa.main_dir # Data directory, this is where the data was stored after staging
Path(fsa.joinpath("data")).read_text() # 42
```

## Features

 * store and retrieve metadata about your data artifacts
 * no long running services (low maintenance)
 * data publishing utils builtin
 * IO/data format agnostic
 * immutable metadata
 * development mode

## Artifact flavours

Currently available Artifact flavours:
 * `FSArtifact`: metadata/utils for files or objects (supports: local FS, GCS, S3 and more)
 * `BQArtifact`: metadata/utils for BigQuery tables
 * `NotebookArtifact`: metadata/utils for Jupyter Notebooks

## Development mode

To ease development of Artifacts, *articat* supports development/dev mode.
Development Artifact can be indicated by `dev` parameter (preferred), or
`_dev` prefix in the Artifact `id`. Dev mode supports:
 * overwriting Artifact metadata
 * configure separate locations (e.g. `dev_prefix` for `FSArtifact`), with
   potentially different retention periods etc

## Backend

 * `local`: mostly for testing/demo, metadata is stored locally (configurable, default: `~/.config/articat/local`)
 * `gcp_datastore`: metadata is stored in the Google Cloud Datastore

## Configuration

*articat* configuration can be provided in the API, or configuration files. By default configuration
is loaded from `~/.config/articat/articat.cfg` and `articat.cfg` in current working directory. You
can also point at the configuration file via environment variable `ARTICAT_CONFIG`.

You use `local` mode without configuration file. Available options:

 ```toml
[main]
# local or gcp_datastore, default: local
# mode =

# local DB directory, default: ~/.config/articat/local
# local_db_dir =

[fs]
# temporary directory/prefix
# tmp_prefix =
# development data directory/prefix
# dev_prefix =
# production data directory/prefix
# prod_prefix =

[gcp]
# GCP project
# project =

[bq]
# development data BigQuery dataset
# dev_dataset =
# production data BigQuery dataset
# prod_dataset =
```

## Our/example setup

Below you can see a diagram of our setup, Articat is just one piece of our system, and solves a specific problem. This should give you an idea where it might fit into your environment:

<p align="center">
  <img src="https://docs.google.com/drawings/d/1wll4Q_PlKGHVu-C2IN8jUIxzFTD8jwFWnvwgFrvq2ls/export/png" alt="Our setup diagram"/>
</p>

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/related-sciences/articat",
    "name": "articat",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "data, catalog, metadata, data-discovery, data-catalog",
    "author": "Related Sciences LLC",
    "author_email": "rav@related.vc",
    "download_url": "https://files.pythonhosted.org/packages/ec/d7/80582120af4b142b406fda29f9bd1f17762e01e5a44f8af5d33776cdb155/articat-0.2.0.tar.gz",
    "platform": null,
    "description": "# articat\n[![CI](https://github.com/related-sciences/articat/actions/workflows/build.yml/badge.svg?branch=main)](https://github.com/related-sciences/articat/actions/workflows/build.yml)\n[![PYPI](https://img.shields.io/pypi/v/articat.svg)](https://pypi.org/project/articat/)\n\nMinimal metadata catalog to store and retrieve metadata about data artifacts.\n\n## Getting started\n\nAt a high level, *articat* is simply a key-value store. Value being the Artifact metadata.\nKey a.k.a. \"Artifact Spec\" being:\n * globally unique `id`\n * optional timestamp: `partition`\n * optional arbitrary string: `version`\n\nTo publish a file system Artifact (`FSArtifact`):\n\n```python\nfrom articat import FSArtifact\nfrom pathlib import Path\nfrom datetime import date\n\n# Apart from being a metadata containers, Artifact classes have optional\n# convenience methods to help in data publishing flow:\n\nwith FSArtifact.partitioned(\"foo\", partition=date(1643, 1, 4)) as fsa:\n    # To create a new Artifact, always use `with` statement, and\n    # either `partitioned` or `versioned` methods. Use:\n    # * `partitioned(...)`, for Artifacts with explicit `datetime` partition\n    # * `versioned(...)`, for Artifacts with explicit `str` version\n\n    # Next we produce some local data, this could be a Spark job,\n    # ML model etc.\n    data_path = Path(\"/tmp/data\")\n    data_path.write_text(\"42\")\n\n    # Now let's stage that data, temporary and final data directories/buckets\n    # are configurable (see below)\n    fsa.stage(data_path)\n\n    # Additionally let's provide some description, here we could also\n    # save some extra arbitrary metadata like model metrics, hyperparameters etc.\n    fsa.metadata.description = \"Answer to the Ultimate Question of Life, the Universe, and Everything\"\n```\n\nTo retrieve the metadata about the Artifact above:\n\n```python\nfrom articat.fs_artifact import FSArtifact\nfrom datetime import date\nfrom pathlib import Path\n\n# To retrieve the metadata, use Artifact object, and `fetch` method:\nfsa = FSArtifact.partitioned(\"foo\", partition=date(1643, 1, 4)).fetch()\n\nfsa.id # \"foo\"\nfsa.created # <CREATION-TIMESTAMP>\nfsa.partition # <CREATION-TIMESTAMP>\nfsa.metadata.description # \"Answer to the Ultimate Question of Life, the Universe, and Everything\"\nfsa.main_dir # Data directory, this is where the data was stored after staging\nPath(fsa.joinpath(\"data\")).read_text() # 42\n```\n\n## Features\n\n * store and retrieve metadata about your data artifacts\n * no long running services (low maintenance)\n * data publishing utils builtin\n * IO/data format agnostic\n * immutable metadata\n * development mode\n\n## Artifact flavours\n\nCurrently available Artifact flavours:\n * `FSArtifact`: metadata/utils for files or objects (supports: local FS, GCS, S3 and more)\n * `BQArtifact`: metadata/utils for BigQuery tables\n * `NotebookArtifact`: metadata/utils for Jupyter Notebooks\n\n## Development mode\n\nTo ease development of Artifacts, *articat* supports development/dev mode.\nDevelopment Artifact can be indicated by `dev` parameter (preferred), or\n`_dev` prefix in the Artifact `id`. Dev mode supports:\n * overwriting Artifact metadata\n * configure separate locations (e.g. `dev_prefix` for `FSArtifact`), with\n   potentially different retention periods etc\n\n## Backend\n\n * `local`: mostly for testing/demo, metadata is stored locally (configurable, default: `~/.config/articat/local`)\n * `gcp_datastore`: metadata is stored in the Google Cloud Datastore\n\n## Configuration\n\n*articat* configuration can be provided in the API, or configuration files. By default configuration\nis loaded from `~/.config/articat/articat.cfg` and `articat.cfg` in current working directory. You\ncan also point at the configuration file via environment variable `ARTICAT_CONFIG`.\n\nYou use `local` mode without configuration file. Available options:\n\n ```toml\n[main]\n# local or gcp_datastore, default: local\n# mode =\n\n# local DB directory, default: ~/.config/articat/local\n# local_db_dir =\n\n[fs]\n# temporary directory/prefix\n# tmp_prefix =\n# development data directory/prefix\n# dev_prefix =\n# production data directory/prefix\n# prod_prefix =\n\n[gcp]\n# GCP project\n# project =\n\n[bq]\n# development data BigQuery dataset\n# dev_dataset =\n# production data BigQuery dataset\n# prod_dataset =\n```\n\n## Our/example setup\n\nBelow you can see a diagram of our setup, Articat is just one piece of our system, and solves a specific problem. This should give you an idea where it might fit into your environment:\n\n<p align=\"center\">\n  <img src=\"https://docs.google.com/drawings/d/1wll4Q_PlKGHVu-C2IN8jUIxzFTD8jwFWnvwgFrvq2ls/export/png\" alt=\"Our setup diagram\"/>\n</p>\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "articat: data artifact catalog",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/related-sciences/articat"
    },
    "split_keywords": [
        "data",
        " catalog",
        " metadata",
        " data-discovery",
        " data-catalog"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "65f71fcd347f549c13f3bfc0754c022a2e6542ac67da50b0747b79009d0c82d2",
                "md5": "d0d2a93faf01dee2eabb9a6093d01940",
                "sha256": "3a16e5e91413069c8a67d2b0055cf7029663c818784cce5f398f9e2d03fd3733"
            },
            "downloads": -1,
            "filename": "articat-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d0d2a93faf01dee2eabb9a6093d01940",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 52019,
            "upload_time": "2025-02-17T18:20:26",
            "upload_time_iso_8601": "2025-02-17T18:20:26.236554Z",
            "url": "https://files.pythonhosted.org/packages/65/f7/1fcd347f549c13f3bfc0754c022a2e6542ac67da50b0747b79009d0c82d2/articat-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ecd780582120af4b142b406fda29f9bd1f17762e01e5a44f8af5d33776cdb155",
                "md5": "a748ab404129f7495a141927fc03e042",
                "sha256": "61ce471cd65e57ed1e77b9d87b0a72922e33860e7a1c9ff41e8a017201b87c9c"
            },
            "downloads": -1,
            "filename": "articat-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a748ab404129f7495a141927fc03e042",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 43422,
            "upload_time": "2025-02-17T18:20:27",
            "upload_time_iso_8601": "2025-02-17T18:20:27.470178Z",
            "url": "https://files.pythonhosted.org/packages/ec/d7/80582120af4b142b406fda29f9bd1f17762e01e5a44f8af5d33776cdb155/articat-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-17 18:20:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "related-sciences",
    "github_project": "articat",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "fire",
            "specs": [
                [
                    ">=",
                    "0.4"
                ]
            ]
        },
        {
            "name": "fsspec",
            "specs": [
                [
                    "<",
                    "2024.3.0"
                ],
                [
                    ">=",
                    "2021.7.0"
                ]
            ]
        },
        {
            "name": "gcsfs",
            "specs": [
                [
                    ">=",
                    "2021.7.0"
                ]
            ]
        },
        {
            "name": "google-cloud-bigquery",
            "specs": [
                [
                    ">=",
                    "1.11"
                ]
            ]
        },
        {
            "name": "google-cloud-datastore",
            "specs": [
                [
                    ">=",
                    "2.1"
                ]
            ]
        },
        {
            "name": "jupyterlab",
            "specs": [
                [
                    "~=",
                    "3.0"
                ]
            ]
        },
        {
            "name": "papermill",
            "specs": [
                [
                    ">=",
                    "2.0"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "~=",
                    "2.0"
                ]
            ]
        }
    ],
    "lcname": "articat"
}

Related Sciences LLC