# Typedspark: column-wise type annotations for pyspark DataFrames
We love Spark! But in production code we're wary when we see:
```python
from pyspark.sql import DataFrame
def foo(df: DataFrame) -> DataFrame:
# do stuff
return df
```
Becauseā¦ How do we know which columns are supposed to be in ``df``?
Using ``typedspark``, we can be more explicit about what these data should look like.
```python
from typedspark import Column, DataSet, Schema
from pyspark.sql.types import LongType, StringType
class Person(Schema):
id: Column[LongType]
name: Column[StringType]
age: Column[LongType]
def foo(df: DataSet[Person]) -> DataSet[Person]:
# do stuff
return df
```
The advantages include:
* Improved readability of the code
* Typechecking, both during runtime and linting
* Auto-complete of column names
* Easy refactoring of column names
* Easier unit testing through the generation of empty ``DataSets`` based on their schemas
* Improved documentation of tables
## Documentation
Please see our documentation on [readthedocs](https://typedspark.readthedocs.io/en/latest/index.html).
## Installation
You can install ``typedspark`` from [pypi](https://pypi.org/project/typedspark/) by running:
```bash
pip install typedspark
```
By default, ``typedspark`` does not list ``pyspark`` as a dependency, since many platforms (e.g. Databricks) come with ``pyspark`` preinstalled. If you want to install ``typedspark`` with ``pyspark``, you can run:
```bash
pip install "typedspark[pyspark]"
```
## Demo videos
### IDE demo
https://github.com/kaiko-ai/typedspark/assets/47976799/e6f7fa9c-6d14-4f68-baba-fe3c22f75b67
You can find the corresponding code [here](docs/videos/ide.ipynb).
### Jupyter / Databricks notebooks demo
https://github.com/kaiko-ai/typedspark/assets/47976799/39e157c3-6db0-436a-9e72-44b2062df808
You can find the corresponding code [here](docs/videos/notebook.ipynb).
## FAQ
**I found a bug! What should I do?**</br>
Great! Please make an issue and we'll look into it.
**I have a great idea to improve typedspark! How can we make this work?**</br>
Awesome, please make an issue and let us know!
Raw data
{
"_id": null,
"home_page": "https://github.com/kaiko-ai/typedspark",
"name": "typedspark",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9.0",
"maintainer_email": null,
"keywords": "pyspark spark typing type checking annotations",
"author": "Nanne Aben",
"author_email": "nanne@kaiko.ai",
"download_url": "https://files.pythonhosted.org/packages/46/a6/157a6d8e160b7a4981d8510fb9c8ce8059a6bc4703a6cc7ae65c6c3bdaa6/typedspark-1.4.2.tar.gz",
"platform": null,
"description": "# Typedspark: column-wise type annotations for pyspark DataFrames\n\nWe love Spark! But in production code we're wary when we see:\n\n```python\nfrom pyspark.sql import DataFrame\n\ndef foo(df: DataFrame) -> DataFrame:\n # do stuff\n return df\n```\n\nBecause\u2026 How do we know which columns are supposed to be in ``df``?\n\nUsing ``typedspark``, we can be more explicit about what these data should look like.\n\n```python\nfrom typedspark import Column, DataSet, Schema\nfrom pyspark.sql.types import LongType, StringType\n\nclass Person(Schema):\n id: Column[LongType]\n name: Column[StringType]\n age: Column[LongType]\n\ndef foo(df: DataSet[Person]) -> DataSet[Person]:\n # do stuff\n return df\n```\nThe advantages include:\n\n* Improved readability of the code\n* Typechecking, both during runtime and linting\n* Auto-complete of column names\n* Easy refactoring of column names\n* Easier unit testing through the generation of empty ``DataSets`` based on their schemas\n* Improved documentation of tables\n\n## Documentation\nPlease see our documentation on [readthedocs](https://typedspark.readthedocs.io/en/latest/index.html).\n\n## Installation\n\nYou can install ``typedspark`` from [pypi](https://pypi.org/project/typedspark/) by running:\n\n```bash\npip install typedspark\n```\nBy default, ``typedspark`` does not list ``pyspark`` as a dependency, since many platforms (e.g. Databricks) come with ``pyspark`` preinstalled. If you want to install ``typedspark`` with ``pyspark``, you can run:\n\n```bash\npip install \"typedspark[pyspark]\"\n```\n\n## Demo videos\n\n### IDE demo\n\nhttps://github.com/kaiko-ai/typedspark/assets/47976799/e6f7fa9c-6d14-4f68-baba-fe3c22f75b67\n\nYou can find the corresponding code [here](docs/videos/ide.ipynb).\n\n### Jupyter / Databricks notebooks demo\n\nhttps://github.com/kaiko-ai/typedspark/assets/47976799/39e157c3-6db0-436a-9e72-44b2062df808\n\nYou can find the corresponding code [here](docs/videos/notebook.ipynb).\n\n## FAQ\n\n**I found a bug! What should I do?**</br>\nGreat! Please make an issue and we'll look into it.\n\n**I have a great idea to improve typedspark! How can we make this work?**</br>\nAwesome, please make an issue and let us know!\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Column-wise type annotations for pyspark DataFrames",
"version": "1.4.2",
"project_urls": {
"Homepage": "https://github.com/kaiko-ai/typedspark"
},
"split_keywords": [
"pyspark",
"spark",
"typing",
"type",
"checking",
"annotations"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c3cd9d69f4cb09620b96d065469a5a8d63b39b00921ce711a6e7440122d89414",
"md5": "60a7a9383a816eba793c8c428b9d0a1f",
"sha256": "1f297356f2c84a5b0afd792ab81b4eb0cb7a84829b7148d2707ac4f4fd0de045"
},
"downloads": -1,
"filename": "typedspark-1.4.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "60a7a9383a816eba793c8c428b9d0a1f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9.0",
"size": 34999,
"upload_time": "2024-04-30T13:20:13",
"upload_time_iso_8601": "2024-04-30T13:20:13.479205Z",
"url": "https://files.pythonhosted.org/packages/c3/cd/9d69f4cb09620b96d065469a5a8d63b39b00921ce711a6e7440122d89414/typedspark-1.4.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "46a6157a6d8e160b7a4981d8510fb9c8ce8059a6bc4703a6cc7ae65c6c3bdaa6",
"md5": "ca158462c429ad4b13ac8a49d220bb49",
"sha256": "79d28eb5acf9857ea784cc0a3397e8a94348b37ff6b45a1e8c5f510b6a42f8d1"
},
"downloads": -1,
"filename": "typedspark-1.4.2.tar.gz",
"has_sig": false,
"md5_digest": "ca158462c429ad4b13ac8a49d220bb49",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9.0",
"size": 27140,
"upload_time": "2024-04-30T13:20:15",
"upload_time_iso_8601": "2024-04-30T13:20:15.802476Z",
"url": "https://files.pythonhosted.org/packages/46/a6/157a6d8e160b7a4981d8510fb9c8ce8059a6bc4703a6cc7ae65c6c3bdaa6/typedspark-1.4.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-30 13:20:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kaiko-ai",
"github_project": "typedspark",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "typing-extensions",
"specs": [
[
"<=",
"4.11.0"
]
]
}
],
"tox": true,
"lcname": "typedspark"
}