<p align="center">
<img src="https://raw.githubusercontent.com/JhossePaul/pysetl/main/docs/assets/images/logo_name.png" alt="PySetl" width="200" />
</p>
[](https://github.com/JhossePaul/pysetl/actions/workflows/build.yml) [](https://codecov.io/gh/JhossePaul/pysetl) [](https://pysetl.readthedocs.io/en/latest/?badge=latest)
[](https://pypi.org/project/pysetl) [](https://www.python.org/downloads/) [](https://spark.apache.org/docs/latest/) [](https://pypi.org/project/pysetl)
[](https://github.com/JhossePaul/pysetl/blob/main/LICENSE) [](https://github.com/astral-sh/ruff) [](http://mypy-lang.org/) [](https://github.com/pre-commit/pre-commit)
## Overview
PySetl is a framework to improve the readability and structure of PySpark ETL
projects. It is designed to take advantage of Python's typing syntax to reduce
runtime errors through linting tools and verifying types at runtime, effectively
enhancing stability for large ETL pipelines.
To accomplish this, we provide some tools:
- **`pysetl.config`**: Type-safe configuration.
- **`pysetl.storage`**: Agnostic and extensible data sources connections.
- **`pysetl.workflow`**: Pipeline management and dependency injection.
PySetl is designed with Python typing syntax at its core. We strongly suggest
using [typedspark](https://typedspark.readthedocs.io/en/latest/) and
[pydantic](https://docs.pydantic.dev/latest/) for development.
## Why use PySetl?
- Model complex data pipelines.
- Reduce risks at production with type-safe development.
- Improve large project structure and readability.
## Quick Start
```python
from pysetl.config import CsvConfig
from pysetl.workflow import Factory, Stage, Pipeline
from typedspark import DataSet, Schema, Column, create_partially_filled_dataset
from pyspark.sql.types import StringType, IntegerType
# Define your data schema
class Citizen(Schema):
name: Column[StringType]
age: Column[IntegerType]
city: Column[StringType]
# Create a factory
class CitizensFactory(Factory[DataSet[Citizen]]):
def read(self):
self.citizens = create_partially_filled_dataset(
spark, Citizen,
[{Citizen.name: "Alice", Citizen.age: 30, Citizen.city: "NYC"}]
)
return self
def process(self): return self
def write(self): return self
def get(self): return self.citizens
# Build and run pipeline
stage = Stage().add_factory_from_type(CitizensFactory)
pipeline = Pipeline().add_stage(stage).run()
```
## Installation
PySetl is available on PyPI:
```bash
pip install pysetl
```
### Optional Dependencies
PySetl provides several optional dependencies for different use cases:
- **PySpark**: For local development (most production environments come with
their own Spark distribution)
```bash
pip install "pysetl[pyspark]"
```
- **Documentation**: For building documentation locally
```bash
pip install "pysetl[docs]"
```
## Documentation
- 📖 [User Guide](https://pysetl.readthedocs.io/en/latest/user-guide/)
- 🔧 [API Reference](https://pysetl.readthedocs.io/en/latest/api/)
- 🚀 [Getting Started](https://pysetl.readthedocs.io/en/latest/home/quickstart/)
- 🤝 [Contributing](https://pysetl.readthedocs.io/en/latest/development/)
## Development
```bash
git clone https://github.com/JhossePaul/pysetl.git
cd pysetl
hatch env show # Shows available environments and scripts
hatch shell
pre-commit install
```
### Development Commands
- **Type checking**: `hatch run type`
- **Lint code**: `hatch run lint`
- **Format code**: `hatch run format`
- **Run tests (default environment only)**: `hatch test`
- **Run all test matrix**: `hatch test --all`
- **Run tests with coverage (all matrix)**: `hatch test --cover --all`
- **Build documentation**: `hatch run docs:docs`
- **Serve documentation**: `hatch run docs:serve`
- **Security checks**: `hatch run security:all`
## Contributing
We welcome contributions! Please see our
[Contributing Guide](https://pysetl.readthedocs.io/en/latest/development/)
for details.
## License
This project is licensed under the Apache License 2.0 - see the
[LICENSE](https://github.com/JhossePaul/pysetl/blob/main/LICENSE) file for
details.
## Acknowledgments
PySetl is a port from [SETL](https://setl-framework.github.io/setl/). We want to
fully recognize this package is heavily inspired by the work of the SETL team.
We just adapted things to work in Python.
## Supported Python Versions
pysetl supports Python 3.9, 3.10, 3.11, 3.12, and 3.13. The typing system and all
features are compatible across these versions. Recent updates have improved
compatibility with Python 3.9, especially regarding advanced typing and
generics.
Raw data
{
"_id": null,
"home_page": null,
"name": "pysetl",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.14,>=3.9",
"maintainer_email": null,
"keywords": "analytics, automation, big data, configuration, data analysis, data engineering, data pipeline, data processing, data science, etl, framework, pipeline, pyspark, python, spark, type-safe, workflow",
"author": null,
"author_email": "Jhosse Paul Marquez Ruiz <jpaul.marquez.ruiz@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/8c/89/35b73fa31006a893a4dd50d7dd7d3a77ebd87cec7bc999e48bbac9e94267/pysetl-1.2.1.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/JhossePaul/pysetl/main/docs/assets/images/logo_name.png\" alt=\"PySetl\" width=\"200\" />\n</p>\n\n[](https://github.com/JhossePaul/pysetl/actions/workflows/build.yml) [](https://codecov.io/gh/JhossePaul/pysetl) [](https://pysetl.readthedocs.io/en/latest/?badge=latest)\n\n[](https://pypi.org/project/pysetl) [](https://www.python.org/downloads/) [](https://spark.apache.org/docs/latest/) [](https://pypi.org/project/pysetl)\n\n[](https://github.com/JhossePaul/pysetl/blob/main/LICENSE) [](https://github.com/astral-sh/ruff) [](http://mypy-lang.org/) [](https://github.com/pre-commit/pre-commit)\n\n## Overview\n\nPySetl is a framework to improve the readability and structure of PySpark ETL\nprojects. It is designed to take advantage of Python's typing syntax to reduce\nruntime errors through linting tools and verifying types at runtime, effectively\nenhancing stability for large ETL pipelines.\n\nTo accomplish this, we provide some tools:\n\n- **`pysetl.config`**: Type-safe configuration.\n- **`pysetl.storage`**: Agnostic and extensible data sources connections.\n- **`pysetl.workflow`**: Pipeline management and dependency injection.\n\nPySetl is designed with Python typing syntax at its core. We strongly suggest\nusing [typedspark](https://typedspark.readthedocs.io/en/latest/) and\n[pydantic](https://docs.pydantic.dev/latest/) for development.\n\n## Why use PySetl?\n\n- Model complex data pipelines.\n- Reduce risks at production with type-safe development.\n- Improve large project structure and readability.\n\n## Quick Start\n\n```python\nfrom pysetl.config import CsvConfig\nfrom pysetl.workflow import Factory, Stage, Pipeline\nfrom typedspark import DataSet, Schema, Column, create_partially_filled_dataset\nfrom pyspark.sql.types import StringType, IntegerType\n\n# Define your data schema\nclass Citizen(Schema):\n name: Column[StringType]\n age: Column[IntegerType]\n city: Column[StringType]\n\n# Create a factory\nclass CitizensFactory(Factory[DataSet[Citizen]]):\n def read(self):\n self.citizens = create_partially_filled_dataset(\n spark, Citizen,\n [{Citizen.name: \"Alice\", Citizen.age: 30, Citizen.city: \"NYC\"}]\n )\n return self\n def process(self): return self\n def write(self): return self\n def get(self): return self.citizens\n\n# Build and run pipeline\nstage = Stage().add_factory_from_type(CitizensFactory)\npipeline = Pipeline().add_stage(stage).run()\n```\n\n## Installation\n\nPySetl is available on PyPI:\n\n```bash\npip install pysetl\n```\n\n### Optional Dependencies\n\nPySetl provides several optional dependencies for different use cases:\n\n- **PySpark**: For local development (most production environments come with\ntheir own Spark distribution)\n\n ```bash\n pip install \"pysetl[pyspark]\"\n ```\n\n- **Documentation**: For building documentation locally\n ```bash\n pip install \"pysetl[docs]\"\n ```\n\n## Documentation\n\n- \ud83d\udcd6 [User Guide](https://pysetl.readthedocs.io/en/latest/user-guide/)\n- \ud83d\udd27 [API Reference](https://pysetl.readthedocs.io/en/latest/api/)\n- \ud83d\ude80 [Getting Started](https://pysetl.readthedocs.io/en/latest/home/quickstart/)\n- \ud83e\udd1d [Contributing](https://pysetl.readthedocs.io/en/latest/development/)\n\n## Development\n\n```bash\ngit clone https://github.com/JhossePaul/pysetl.git\ncd pysetl\nhatch env show # Shows available environments and scripts\nhatch shell\npre-commit install\n```\n\n### Development Commands\n\n- **Type checking**: `hatch run type`\n- **Lint code**: `hatch run lint`\n- **Format code**: `hatch run format`\n- **Run tests (default environment only)**: `hatch test`\n- **Run all test matrix**: `hatch test --all`\n- **Run tests with coverage (all matrix)**: `hatch test --cover --all`\n- **Build documentation**: `hatch run docs:docs`\n- **Serve documentation**: `hatch run docs:serve`\n- **Security checks**: `hatch run security:all`\n\n## Contributing\n\nWe welcome contributions! Please see our\n[Contributing Guide](https://pysetl.readthedocs.io/en/latest/development/)\nfor details.\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the\n[LICENSE](https://github.com/JhossePaul/pysetl/blob/main/LICENSE) file for\ndetails.\n\n## Acknowledgments\n\nPySetl is a port from [SETL](https://setl-framework.github.io/setl/). We want to\nfully recognize this package is heavily inspired by the work of the SETL team.\nWe just adapted things to work in Python.\n\n## Supported Python Versions\n\npysetl supports Python 3.9, 3.10, 3.11, 3.12, and 3.13. The typing system and all\nfeatures are compatible across these versions. Recent updates have improved\ncompatibility with Python 3.9, especially regarding advanced typing and\ngenerics.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "A PySpark ETL Framework",
"version": "1.2.1",
"project_urls": {
"Home": "https://github.com/JhossePaul/pysetl",
"Source": "https://github.com/JhossePaul/pysetl"
},
"split_keywords": [
"analytics",
" automation",
" big data",
" configuration",
" data analysis",
" data engineering",
" data pipeline",
" data processing",
" data science",
" etl",
" framework",
" pipeline",
" pyspark",
" python",
" spark",
" type-safe",
" workflow"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "dfa9179fad9460018b22385702209dcba66650c3d29b0c34194cbd5b959115e0",
"md5": "b17244caaa3cf287ba73f6fba9248297",
"sha256": "c7614f4904b8c5db594996421ce73414ca546206de53fc6bb6c5509f34451773"
},
"downloads": -1,
"filename": "pysetl-1.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b17244caaa3cf287ba73f6fba9248297",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.14,>=3.9",
"size": 73157,
"upload_time": "2025-07-27T18:35:29",
"upload_time_iso_8601": "2025-07-27T18:35:29.936275Z",
"url": "https://files.pythonhosted.org/packages/df/a9/179fad9460018b22385702209dcba66650c3d29b0c34194cbd5b959115e0/pysetl-1.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8c8935b73fa31006a893a4dd50d7dd7d3a77ebd87cec7bc999e48bbac9e94267",
"md5": "cb44a507c57e1305c7ef190228f313bd",
"sha256": "cbdc4187000395ee2ce2258eda3307e2d10dbb7099c5d66ba7f9abf79514e952"
},
"downloads": -1,
"filename": "pysetl-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "cb44a507c57e1305c7ef190228f313bd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.14,>=3.9",
"size": 46289,
"upload_time": "2025-07-27T18:35:31",
"upload_time_iso_8601": "2025-07-27T18:35:31.216131Z",
"url": "https://files.pythonhosted.org/packages/8c/89/35b73fa31006a893a4dd50d7dd7d3a77ebd87cec7bc999e48bbac9e94267/pysetl-1.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-27 18:35:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "JhossePaul",
"github_project": "pysetl",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pysetl"
}