pysetl

Name	pysetl JSON
Version	1.2.1 JSON
	download
home_page	None
Summary	A PySpark ETL Framework
upload_time	2025-07-27 18:35:31
maintainer	None
docs_url	None
author	None
requires_python	<3.14,>=3.9
license	Apache-2.0
keywords	analytics automation big data configuration data analysis data engineering data pipeline data processing data science etl framework pipeline pyspark python spark type-safe workflow
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center">
  <img src="https://raw.githubusercontent.com/JhossePaul/pysetl/main/docs/assets/images/logo_name.png" alt="PySetl" width="200" />
</p>

[![Build Status](https://github.com/JhossePaul/pysetl/actions/workflows/build.yml/badge.svg)](https://github.com/JhossePaul/pysetl/actions/workflows/build.yml) [![Code Coverage](https://codecov.io/gh/JhossePaul/pysetl/branch/main/graph/badge.svg)](https://codecov.io/gh/JhossePaul/pysetl) [![Documentation Status](https://readthedocs.org/projects/pysetl/badge/?version=latest)](https://pysetl.readthedocs.io/en/latest/?badge=latest)

[![PyPI](https://img.shields.io/pypi/v/pysetl)](https://pypi.org/project/pysetl) [![Python](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/) [![PySpark](https://img.shields.io/badge/PySpark-3.4%2B-orange.svg?logo=apache-spark&logoColor=white)](https://spark.apache.org/docs/latest/) [![Downloads](https://img.shields.io/pypi/dm/pysetl.svg?color=blue&label=Installs&logo=pypi&logoColor=gold)](https://pypi.org/project/pysetl)

[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](https://github.com/JhossePaul/pysetl/blob/main/LICENSE) [![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff) [![Type checked with mypy](https://img.shields.io/badge/mypy-checked-blue.svg)](http://mypy-lang.org/) [![Pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)

## Overview

PySetl is a framework to improve the readability and structure of PySpark ETL
projects. It is designed to take advantage of Python's typing syntax to reduce
runtime errors through linting tools and verifying types at runtime, effectively
enhancing stability for large ETL pipelines.

To accomplish this, we provide some tools:

- **`pysetl.config`**: Type-safe configuration.
- **`pysetl.storage`**: Agnostic and extensible data sources connections.
- **`pysetl.workflow`**: Pipeline management and dependency injection.

PySetl is designed with Python typing syntax at its core. We strongly suggest
using [typedspark](https://typedspark.readthedocs.io/en/latest/) and
[pydantic](https://docs.pydantic.dev/latest/) for development.

## Why use PySetl?

- Model complex data pipelines.
- Reduce risks at production with type-safe development.
- Improve large project structure and readability.

## Quick Start

```python
from pysetl.config import CsvConfig
from pysetl.workflow import Factory, Stage, Pipeline
from typedspark import DataSet, Schema, Column, create_partially_filled_dataset
from pyspark.sql.types import StringType, IntegerType

# Define your data schema
class Citizen(Schema):
    name: Column[StringType]
    age: Column[IntegerType]
    city: Column[StringType]

# Create a factory
class CitizensFactory(Factory[DataSet[Citizen]]):
    def read(self):
        self.citizens = create_partially_filled_dataset(
            spark, Citizen,
            [{Citizen.name: "Alice", Citizen.age: 30, Citizen.city: "NYC"}]
        )
        return self
    def process(self): return self
    def write(self): return self
    def get(self): return self.citizens

# Build and run pipeline
stage = Stage().add_factory_from_type(CitizensFactory)
pipeline = Pipeline().add_stage(stage).run()
```

## Installation

PySetl is available on PyPI:

```bash
pip install pysetl
```

### Optional Dependencies

PySetl provides several optional dependencies for different use cases:

- **PySpark**: For local development (most production environments come with
their own Spark distribution)

  ```bash
  pip install "pysetl[pyspark]"
  ```

- **Documentation**: For building documentation locally
  ```bash
  pip install "pysetl[docs]"
  ```

## Documentation

- 📖 [User Guide](https://pysetl.readthedocs.io/en/latest/user-guide/)
- 🔧 [API Reference](https://pysetl.readthedocs.io/en/latest/api/)
- 🚀 [Getting Started](https://pysetl.readthedocs.io/en/latest/home/quickstart/)
- 🤝 [Contributing](https://pysetl.readthedocs.io/en/latest/development/)

## Development

```bash
git clone https://github.com/JhossePaul/pysetl.git
cd pysetl
hatch env show  # Shows available environments and scripts
hatch shell
pre-commit install
```

### Development Commands

- **Type checking**: `hatch run type`
- **Lint code**: `hatch run lint`
- **Format code**: `hatch run format`
- **Run tests (default environment only)**: `hatch test`
- **Run all test matrix**: `hatch test --all`
- **Run tests with coverage (all matrix)**: `hatch test --cover --all`
- **Build documentation**: `hatch run docs:docs`
- **Serve documentation**: `hatch run docs:serve`
- **Security checks**: `hatch run security:all`

## Contributing

We welcome contributions! Please see our
[Contributing Guide](https://pysetl.readthedocs.io/en/latest/development/)
for details.

## License

This project is licensed under the Apache License 2.0 - see the
[LICENSE](https://github.com/JhossePaul/pysetl/blob/main/LICENSE) file for
details.

## Acknowledgments

PySetl is a port from [SETL](https://setl-framework.github.io/setl/). We want to
fully recognize this package is heavily inspired by the work of the SETL team.
We just adapted things to work in Python.

## Supported Python Versions

pysetl supports Python 3.9, 3.10, 3.11, 3.12, and 3.13. The typing system and all
features are compatible across these versions. Recent updates have improved
compatibility with Python 3.9, especially regarding advanced typing and
generics.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pysetl",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.14,>=3.9",
    "maintainer_email": null,
    "keywords": "analytics, automation, big data, configuration, data analysis, data engineering, data pipeline, data processing, data science, etl, framework, pipeline, pyspark, python, spark, type-safe, workflow",
    "author": null,
    "author_email": "Jhosse Paul Marquez Ruiz <jpaul.marquez.ruiz@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/8c/89/35b73fa31006a893a4dd50d7dd7d3a77ebd87cec7bc999e48bbac9e94267/pysetl-1.2.1.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/JhossePaul/pysetl/main/docs/assets/images/logo_name.png\" alt=\"PySetl\" width=\"200\" />\n</p>\n\n[![Build Status](https://github.com/JhossePaul/pysetl/actions/workflows/build.yml/badge.svg)](https://github.com/JhossePaul/pysetl/actions/workflows/build.yml) [![Code Coverage](https://codecov.io/gh/JhossePaul/pysetl/branch/main/graph/badge.svg)](https://codecov.io/gh/JhossePaul/pysetl) [![Documentation Status](https://readthedocs.org/projects/pysetl/badge/?version=latest)](https://pysetl.readthedocs.io/en/latest/?badge=latest)\n\n[![PyPI](https://img.shields.io/pypi/v/pysetl)](https://pypi.org/project/pysetl) [![Python](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/) [![PySpark](https://img.shields.io/badge/PySpark-3.4%2B-orange.svg?logo=apache-spark&logoColor=white)](https://spark.apache.org/docs/latest/) [![Downloads](https://img.shields.io/pypi/dm/pysetl.svg?color=blue&label=Installs&logo=pypi&logoColor=gold)](https://pypi.org/project/pysetl)\n\n[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](https://github.com/JhossePaul/pysetl/blob/main/LICENSE) [![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff) [![Type checked with mypy](https://img.shields.io/badge/mypy-checked-blue.svg)](http://mypy-lang.org/) [![Pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)\n\n## Overview\n\nPySetl is a framework to improve the readability and structure of PySpark ETL\nprojects. It is designed to take advantage of Python's typing syntax to reduce\nruntime errors through linting tools and verifying types at runtime, effectively\nenhancing stability for large ETL pipelines.\n\nTo accomplish this, we provide some tools:\n\n- **`pysetl.config`**: Type-safe configuration.\n- **`pysetl.storage`**: Agnostic and extensible data sources connections.\n- **`pysetl.workflow`**: Pipeline management and dependency injection.\n\nPySetl is designed with Python typing syntax at its core. We strongly suggest\nusing [typedspark](https://typedspark.readthedocs.io/en/latest/) and\n[pydantic](https://docs.pydantic.dev/latest/) for development.\n\n## Why use PySetl?\n\n- Model complex data pipelines.\n- Reduce risks at production with type-safe development.\n- Improve large project structure and readability.\n\n## Quick Start\n\n```python\nfrom pysetl.config import CsvConfig\nfrom pysetl.workflow import Factory, Stage, Pipeline\nfrom typedspark import DataSet, Schema, Column, create_partially_filled_dataset\nfrom pyspark.sql.types import StringType, IntegerType\n\n# Define your data schema\nclass Citizen(Schema):\n    name: Column[StringType]\n    age: Column[IntegerType]\n    city: Column[StringType]\n\n# Create a factory\nclass CitizensFactory(Factory[DataSet[Citizen]]):\n    def read(self):\n        self.citizens = create_partially_filled_dataset(\n            spark, Citizen,\n            [{Citizen.name: \"Alice\", Citizen.age: 30, Citizen.city: \"NYC\"}]\n        )\n        return self\n    def process(self): return self\n    def write(self): return self\n    def get(self): return self.citizens\n\n# Build and run pipeline\nstage = Stage().add_factory_from_type(CitizensFactory)\npipeline = Pipeline().add_stage(stage).run()\n```\n\n## Installation\n\nPySetl is available on PyPI:\n\n```bash\npip install pysetl\n```\n\n### Optional Dependencies\n\nPySetl provides several optional dependencies for different use cases:\n\n- **PySpark**: For local development (most production environments come with\ntheir own Spark distribution)\n\n  ```bash\n  pip install \"pysetl[pyspark]\"\n  ```\n\n- **Documentation**: For building documentation locally\n  ```bash\n  pip install \"pysetl[docs]\"\n  ```\n\n## Documentation\n\n- \ud83d\udcd6 [User Guide](https://pysetl.readthedocs.io/en/latest/user-guide/)\n- \ud83d\udd27 [API Reference](https://pysetl.readthedocs.io/en/latest/api/)\n- \ud83d\ude80 [Getting Started](https://pysetl.readthedocs.io/en/latest/home/quickstart/)\n- \ud83e\udd1d [Contributing](https://pysetl.readthedocs.io/en/latest/development/)\n\n## Development\n\n```bash\ngit clone https://github.com/JhossePaul/pysetl.git\ncd pysetl\nhatch env show  # Shows available environments and scripts\nhatch shell\npre-commit install\n```\n\n### Development Commands\n\n- **Type checking**: `hatch run type`\n- **Lint code**: `hatch run lint`\n- **Format code**: `hatch run format`\n- **Run tests (default environment only)**: `hatch test`\n- **Run all test matrix**: `hatch test --all`\n- **Run tests with coverage (all matrix)**: `hatch test --cover --all`\n- **Build documentation**: `hatch run docs:docs`\n- **Serve documentation**: `hatch run docs:serve`\n- **Security checks**: `hatch run security:all`\n\n## Contributing\n\nWe welcome contributions! Please see our\n[Contributing Guide](https://pysetl.readthedocs.io/en/latest/development/)\nfor details.\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the\n[LICENSE](https://github.com/JhossePaul/pysetl/blob/main/LICENSE) file for\ndetails.\n\n## Acknowledgments\n\nPySetl is a port from [SETL](https://setl-framework.github.io/setl/). We want to\nfully recognize this package is heavily inspired by the work of the SETL team.\nWe just adapted things to work in Python.\n\n## Supported Python Versions\n\npysetl supports Python 3.9, 3.10, 3.11, 3.12, and 3.13. The typing system and all\nfeatures are compatible across these versions. Recent updates have improved\ncompatibility with Python 3.9, especially regarding advanced typing and\ngenerics.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A PySpark ETL Framework",
    "version": "1.2.1",
    "project_urls": {
        "Home": "https://github.com/JhossePaul/pysetl",
        "Source": "https://github.com/JhossePaul/pysetl"
    },
    "split_keywords": [
        "analytics",
        " automation",
        " big data",
        " configuration",
        " data analysis",
        " data engineering",
        " data pipeline",
        " data processing",
        " data science",
        " etl",
        " framework",
        " pipeline",
        " pyspark",
        " python",
        " spark",
        " type-safe",
        " workflow"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dfa9179fad9460018b22385702209dcba66650c3d29b0c34194cbd5b959115e0",
                "md5": "b17244caaa3cf287ba73f6fba9248297",
                "sha256": "c7614f4904b8c5db594996421ce73414ca546206de53fc6bb6c5509f34451773"
            },
            "downloads": -1,
            "filename": "pysetl-1.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b17244caaa3cf287ba73f6fba9248297",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.14,>=3.9",
            "size": 73157,
            "upload_time": "2025-07-27T18:35:29",
            "upload_time_iso_8601": "2025-07-27T18:35:29.936275Z",
            "url": "https://files.pythonhosted.org/packages/df/a9/179fad9460018b22385702209dcba66650c3d29b0c34194cbd5b959115e0/pysetl-1.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8c8935b73fa31006a893a4dd50d7dd7d3a77ebd87cec7bc999e48bbac9e94267",
                "md5": "cb44a507c57e1305c7ef190228f313bd",
                "sha256": "cbdc4187000395ee2ce2258eda3307e2d10dbb7099c5d66ba7f9abf79514e952"
            },
            "downloads": -1,
            "filename": "pysetl-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "cb44a507c57e1305c7ef190228f313bd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.14,>=3.9",
            "size": 46289,
            "upload_time": "2025-07-27T18:35:31",
            "upload_time_iso_8601": "2025-07-27T18:35:31.216131Z",
            "url": "https://files.pythonhosted.org/packages/8c/89/35b73fa31006a893a4dd50d7dd7d3a77ebd87cec7bc999e48bbac9e94267/pysetl-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-27 18:35:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "JhossePaul",
    "github_project": "pysetl",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pysetl"
}

None