Name | databeakers JSON |
Version |
0.3.9
JSON |
| download |
home_page | |
Summary | |
upload_time | 2023-08-31 04:33:37 |
maintainer | |
docs_url | None |
author | James Turk |
requires_python | >=3.10,<4.0 |
license | |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# beakers
beakers is an experimental lightweight declarative ETL framework for Python
It is still very much in flux with no correctness or stability guarantees.
No contributions yet please, but feel free to poke around/ask questions.
## Features
- declarative ETL graph comprised of Python functions & Pydantic models
- developer-friendly CLI for running processes
- sync/async task execution
- data checkpoints stored in local database for intermediate caching & resuming interrupted runs
- robust error handling, including retries
## Guiding Principles
* **Lightweight** - Writing a single python file should be enough to get started. It should be as easy to use as a script in that sense.
* **Data-centric** - Know what data is added at each step.
* **Modern Python** - Take full advantage of recent additions to Python, including type hints, `asyncio`, and libraries like `pydantic` and `rich`.
* **Developer Experience** - Focused on the developer experience: a nice CLI, helpful error messages.
## Anti-Principles
Unlike most tools in this space, this is not a complete "enterprise grade" ETL solution.
It isn't a perfect analogy by any means but it could be said `databeakers` is to `luigi` what `flask` is to `Django`.
If you are building your entire business around ETL, it makes sense to invest in the infrastructure & tooling to make that work.
Maybe structuring your code around beakers will make it easier to migrate to one of those tools than if you had written a bespoke script.
Plus, beakers is Python, so you can always start by running it from within a bigger framework.
## Concepts
Like most ETL tools, beakers is built around a directed acyclic graph (DAG).
The nodes on this graph are known as "beakers", and the edges are often called "transforms".
(Note: These names aren't final, suggestions welcome.)
### Beakers
Each node in the graph is called a "beaker". A beaker is a container for some data.
Each beaker has a name and a type.
The name is used to refer to the beaker elsewhere in the graph.
The type, represented by a `pydantic` model, defines the structure of the data. By leveraging `pydantic` we get a lot of nice features for free, like validation and serialization.
### Transform
Edges in the graph represent dataflow between beakers. Each edge has a concept of a "source beaker" and a "destination beaker".
These come in two main flavors:
* **Transforms** - A transform places new data in the destination beaker based on data already in the source beaker.
An example of this might be a transform that takes a list of URLs and downloads the HTML for each one, placing the results in a new beaker.
* **Filter** - A filter can be used to stop the flow of data from one beaker to another based on some criteria.
### Seed
A concept somewhat unique to beakers is the "seed". A seed is a function that returns initial data for a beaker.
This is useful for things like starting the graph with a list of URLs to scrape, or a list of images to process.
A beaker can have any number of seeds, for example one might have a short list of URLs to use for testing, and another that reads from a database.
Raw data
{
"_id": null,
"home_page": "",
"name": "databeakers",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "James Turk",
"author_email": "dev@jamesturk.net",
"download_url": "https://files.pythonhosted.org/packages/a1/ac/5a89bba7a24ee288d06018d46298d5acb7a072cda48cc5f2b20a58cdf836/databeakers-0.3.9.tar.gz",
"platform": null,
"description": "# beakers\n\nbeakers is an experimental lightweight declarative ETL framework for Python\n\nIt is still very much in flux with no correctness or stability guarantees. \n\nNo contributions yet please, but feel free to poke around/ask questions.\n\n## Features\n\n- declarative ETL graph comprised of Python functions & Pydantic models\n- developer-friendly CLI for running processes\n- sync/async task execution\n- data checkpoints stored in local database for intermediate caching & resuming interrupted runs\n- robust error handling, including retries\n\n## Guiding Principles\n\n* **Lightweight** - Writing a single python file should be enough to get started. It should be as easy to use as a script in that sense.\n* **Data-centric** - Know what data is added at each step.\n* **Modern Python** - Take full advantage of recent additions to Python, including type hints, `asyncio`, and libraries like `pydantic` and `rich`.\n* **Developer Experience** - Focused on the developer experience: a nice CLI, helpful error messages.\n\n## Anti-Principles\n\nUnlike most tools in this space, this is not a complete \"enterprise grade\" ETL solution.\n\nIt isn't a perfect analogy by any means but it could be said `databeakers` is to `luigi` what `flask` is to `Django`.\nIf you are building your entire business around ETL, it makes sense to invest in the infrastructure & tooling to make that work.\nMaybe structuring your code around beakers will make it easier to migrate to one of those tools than if you had written a bespoke script.\nPlus, beakers is Python, so you can always start by running it from within a bigger framework.\n\n## Concepts\n\nLike most ETL tools, beakers is built around a directed acyclic graph (DAG).\n\nThe nodes on this graph are known as \"beakers\", and the edges are often called \"transforms\".\n\n(Note: These names aren't final, suggestions welcome.)\n\n### Beakers\n\nEach node in the graph is called a \"beaker\". A beaker is a container for some data.\n\nEach beaker has a name and a type.\nThe name is used to refer to the beaker elsewhere in the graph.\nThe type, represented by a `pydantic` model, defines the structure of the data. By leveraging `pydantic` we get a lot of nice features for free, like validation and serialization.\n\n### Transform\n\nEdges in the graph represent dataflow between beakers. Each edge has a concept of a \"source beaker\" and a \"destination beaker\".\n\n These come in two main flavors:\n\n* **Transforms** - A transform places new data in the destination beaker based on data already in the source beaker.\nAn example of this might be a transform that takes a list of URLs and downloads the HTML for each one, placing the results in a new beaker.\n\n* **Filter** - A filter can be used to stop the flow of data from one beaker to another based on some criteria.\n\n### Seed\n\nA concept somewhat unique to beakers is the \"seed\". A seed is a function that returns initial data for a beaker.\n\nThis is useful for things like starting the graph with a list of URLs to scrape, or a list of images to process.\n\nA beaker can have any number of seeds, for example one might have a short list of URLs to use for testing, and another that reads from a database.",
"bugtrack_url": null,
"license": "",
"summary": "",
"version": "0.3.9",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "19f8e00b3bfa7d7530ceaf174801e12bf51024c925c423584a64d44d4eea94c3",
"md5": "ee37584471568af583ba0fa696a18e38",
"sha256": "9dd5384925af08f2fa3704396ff6eabab74a940af5fc38f4b8c163ce9f644377"
},
"downloads": -1,
"filename": "databeakers-0.3.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ee37584471568af583ba0fa696a18e38",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10,<4.0",
"size": 36334,
"upload_time": "2023-08-31T04:33:36",
"upload_time_iso_8601": "2023-08-31T04:33:36.738551Z",
"url": "https://files.pythonhosted.org/packages/19/f8/e00b3bfa7d7530ceaf174801e12bf51024c925c423584a64d44d4eea94c3/databeakers-0.3.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a1ac5a89bba7a24ee288d06018d46298d5acb7a072cda48cc5f2b20a58cdf836",
"md5": "869685f72b4e3cc7385165f3dce0974a",
"sha256": "3bcc3da151fd75dbfeaad53c7415c0ba40d02a416bc4f36277c3e765e3dccc53"
},
"downloads": -1,
"filename": "databeakers-0.3.9.tar.gz",
"has_sig": false,
"md5_digest": "869685f72b4e3cc7385165f3dce0974a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10,<4.0",
"size": 33906,
"upload_time": "2023-08-31T04:33:37",
"upload_time_iso_8601": "2023-08-31T04:33:37.808124Z",
"url": "https://files.pythonhosted.org/packages/a1/ac/5a89bba7a24ee288d06018d46298d5acb7a072cda48cc5f2b20a58cdf836/databeakers-0.3.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-31 04:33:37",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "databeakers"
}