# Typedspark: column-wise type annotations for pyspark DataFrames
We love Spark! But in production code we're wary when we see:
```python
from pyspark.sql import DataFrame
def foo(df: DataFrame) -> DataFrame:
# do stuff
return df
```
Becauseā¦ How do we know which columns are supposed to be in ``df``?
Using ``typedspark``, we can be more explicit about what these data should look like.
```python
from typedspark import Column, DataSet, Schema
from pyspark.sql.types import LongType, StringType
class Person(Schema):
id: Column[LongType]
name: Column[StringType]
age: Column[LongType]
def foo(df: DataSet[Person]) -> DataSet[Person]:
# do stuff
return df
```
The advantages include:
* Improved readability of the code
* Typechecking, both during runtime and linting
* Auto-complete of column names
* Easy refactoring of column names
* Easier unit testing through the generation of empty ``DataSets`` based on their schemas
* Improved documentation of tables
## Documentation
Please see our documentation on [readthedocs](https://typedspark.readthedocs.io/en/latest/index.html).
## Installation
You can install ``typedspark`` from [pypi](https://pypi.org/project/typedspark/) by running:
```bash
pip install typedspark
```
By default, ``typedspark`` does not list ``pyspark`` as a dependency, since many platforms (e.g. Databricks) come with ``pyspark`` preinstalled. If you want to install ``typedspark`` with ``pyspark``, you can run:
```bash
pip install "typedspark[pyspark]"
```
## Demo videos
### IDE demo
https://github.com/kaiko-ai/typedspark/assets/47976799/e6f7fa9c-6d14-4f68-baba-fe3c22f75b67
You can find the corresponding code [here](docs/videos/ide.ipynb).
### Jupyter / Databricks notebooks demo
https://github.com/kaiko-ai/typedspark/assets/47976799/39e157c3-6db0-436a-9e72-44b2062df808
You can find the corresponding code [here](docs/videos/notebook.ipynb).
## FAQ
**I found a bug! What should I do?**</br>
Great! Please make an issue and we'll look into it.
**I have a great idea to improve typedspark! How can we make this work?**</br>
Awesome, please make an issue and let us know!
Raw data
{
"_id": null,
"home_page": "https://github.com/kaiko-ai/typedspark",
"name": "typedspark",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9.0",
"maintainer_email": null,
"keywords": "pyspark spark typing type checking annotations",
"author": "Nanne Aben",
"author_email": "nanne@kaiko.ai",
"download_url": "https://files.pythonhosted.org/packages/00/00/752ed241d4372b0cb2a52277a32517aa8a4f26f9e557ea77e5391b74a1b4/typedspark-1.5.0.tar.gz",
"platform": null,
"description": "# Typedspark: column-wise type annotations for pyspark DataFrames\n\nWe love Spark! But in production code we're wary when we see:\n\n```python\nfrom pyspark.sql import DataFrame\n\ndef foo(df: DataFrame) -> DataFrame:\n # do stuff\n return df\n```\n\nBecause\u2026 How do we know which columns are supposed to be in ``df``?\n\nUsing ``typedspark``, we can be more explicit about what these data should look like.\n\n```python\nfrom typedspark import Column, DataSet, Schema\nfrom pyspark.sql.types import LongType, StringType\n\nclass Person(Schema):\n id: Column[LongType]\n name: Column[StringType]\n age: Column[LongType]\n\ndef foo(df: DataSet[Person]) -> DataSet[Person]:\n # do stuff\n return df\n```\nThe advantages include:\n\n* Improved readability of the code\n* Typechecking, both during runtime and linting\n* Auto-complete of column names\n* Easy refactoring of column names\n* Easier unit testing through the generation of empty ``DataSets`` based on their schemas\n* Improved documentation of tables\n\n## Documentation\nPlease see our documentation on [readthedocs](https://typedspark.readthedocs.io/en/latest/index.html).\n\n## Installation\n\nYou can install ``typedspark`` from [pypi](https://pypi.org/project/typedspark/) by running:\n\n```bash\npip install typedspark\n```\nBy default, ``typedspark`` does not list ``pyspark`` as a dependency, since many platforms (e.g. Databricks) come with ``pyspark`` preinstalled. If you want to install ``typedspark`` with ``pyspark``, you can run:\n\n```bash\npip install \"typedspark[pyspark]\"\n```\n\n## Demo videos\n\n### IDE demo\n\nhttps://github.com/kaiko-ai/typedspark/assets/47976799/e6f7fa9c-6d14-4f68-baba-fe3c22f75b67\n\nYou can find the corresponding code [here](docs/videos/ide.ipynb).\n\n### Jupyter / Databricks notebooks demo\n\nhttps://github.com/kaiko-ai/typedspark/assets/47976799/39e157c3-6db0-436a-9e72-44b2062df808\n\nYou can find the corresponding code [here](docs/videos/notebook.ipynb).\n\n## FAQ\n\n**I found a bug! What should I do?**</br>\nGreat! Please make an issue and we'll look into it.\n\n**I have a great idea to improve typedspark! How can we make this work?**</br>\nAwesome, please make an issue and let us know!\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Column-wise type annotations for pyspark DataFrames",
"version": "1.5.0",
"project_urls": {
"Homepage": "https://github.com/kaiko-ai/typedspark"
},
"split_keywords": [
"pyspark",
"spark",
"typing",
"type",
"checking",
"annotations"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "055bf848c81b0508a68b80177affb2116b1949394012b0a107f29c5fc379703a",
"md5": "79609bd78369745e571f682443fb8d9f",
"sha256": "2bcadc53f89a704fca31d477aeb098d7afc6537cb9748ddf2b1190e165b53de9"
},
"downloads": -1,
"filename": "typedspark-1.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "79609bd78369745e571f682443fb8d9f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9.0",
"size": 35091,
"upload_time": "2024-08-12T12:58:00",
"upload_time_iso_8601": "2024-08-12T12:58:00.471123Z",
"url": "https://files.pythonhosted.org/packages/05/5b/f848c81b0508a68b80177affb2116b1949394012b0a107f29c5fc379703a/typedspark-1.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0000752ed241d4372b0cb2a52277a32517aa8a4f26f9e557ea77e5391b74a1b4",
"md5": "cedcfad96d47c08fd463db568c5c7fe1",
"sha256": "78170cb87c0b7ee21a0935e7240968878bf969332acfd2ee4a0ee05d4fd425c6"
},
"downloads": -1,
"filename": "typedspark-1.5.0.tar.gz",
"has_sig": false,
"md5_digest": "cedcfad96d47c08fd463db568c5c7fe1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9.0",
"size": 27236,
"upload_time": "2024-08-12T12:58:01",
"upload_time_iso_8601": "2024-08-12T12:58:01.556549Z",
"url": "https://files.pythonhosted.org/packages/00/00/752ed241d4372b0cb2a52277a32517aa8a4f26f9e557ea77e5391b74a1b4/typedspark-1.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-12 12:58:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kaiko-ai",
"github_project": "typedspark",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "typing-extensions",
"specs": [
[
"<=",
"4.12.2"
]
]
}
],
"tox": true,
"lcname": "typedspark"
}