datasaurus


Namedatasaurus JSON
Version 0.0.2.dev4 PyPI version JSON
download
home_pagehttps://www.github.com/surister/datasaurus
SummaryData Engineering framework based on Polars.rs
upload_time2023-12-19 12:10:51
maintainer
docs_urlNone
authorsurister
requires_python>=3.8.1,<4.0.0
licenseMIT
keywords python3 data polars dataframes framework data engineering
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            Datasaurus is a Data Engineering framework written in Python 3.8, 3.9, 3.10 and 3.11

It is based in Polars and heavily influenced by Django.

Datasaurus offers an opinionated, feature-rich and powerful framework to help you write
data pipelines, ETLs or data manipulation programs.

[Documentation]() (TODO)
## It supports:
- โœ… Fully support read/write operations.
- โญ• Not yet but will be implemented.
- ๐Ÿ’€ Won't be implemented in the near future.

### Storages:
- Sqlite โœ…
- PostgresSQL โœ…
- MySQL โœ…
- Mariadb โœ…
- Local Storage โœ…
- Azure blob storage โญ•
- AWS S3 โญ•


### Formats:
- CSV โœ…
- JSON โœ…
- PARQUET โœ…
- EXCEL โœ…
- AVRO โœ…
- TSV โญ•
- SQL โญ• (Like sql inserts)
- 
### Features:
- Delta Tables โญ•
- Field validations โญ•

## Simple example
```python
# settings.py 
from datasaurus.core.storage import PostgresStorage, StorageGroup, SqliteStorage
from datasaurus.core.models import StringColumn, IntegerColumn

# We set the environment that will be used.
os.environ['DATASAURUS_ENVIRONMENT'] = 'dev'

class ProfilesData(StorageGroup):
    dev = SqliteStorage(path='/data/data.sqlite')
    live = PostgresStorage(username='user', password='user', host='localhost', database='postgres')

    
# models.py
from datasaurus.core.models import Model, StringColumn, IntegerColumn

class ProfileModel(Model):
    id = IntegerColumn()
    username = StringColumn()
    mail = StringColumn()
    sex = StringColumn()

    class Meta:
        storage = ProfilesData
        table_name = 'PROFILE'

```

We can access the raw Polars dataframe with 'Model.df', it's lazy, meaning it will only load the
data if we access the attribute.

```py
>>> ProfileModel.df
shape: (100, 4)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”
โ”‚ id  โ”† username           โ”† mail                     โ”† sex โ”‚
โ”‚ --- โ”† ---                โ”† ---                      โ”† --- โ”‚
โ”‚ i64 โ”† str                โ”† str                      โ”† str โ”‚
โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•ก
โ”‚ 1   โ”† ehayes             โ”† colleen63@hotmail.com    โ”† F   โ”‚
โ”‚ 2   โ”† thompsondeborah    โ”† judyortega@hotmail.com   โ”† F   โ”‚
โ”‚ 3   โ”† orivera            โ”† iperkins@hotmail.com     โ”† F   โ”‚
โ”‚ 4   โ”† ychase             โ”† sophia92@hotmail.com     โ”† F   โ”‚
โ”‚ โ€ฆ   โ”† โ€ฆ                  โ”† โ€ฆ                        โ”† โ€ฆ   โ”‚
โ”‚ 97  โ”† mary38             โ”† sylvia80@yahoo.com       โ”† F   โ”‚
โ”‚ 98  โ”† charlessteven      โ”† usmith@gmail.com         โ”† F   โ”‚
โ”‚ 99  โ”† plee               โ”† powens@hotmail.com       โ”† F   โ”‚
โ”‚ 100 โ”† elliottchristopher โ”† wilsonbenjamin@yahoo.com โ”† M   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜

```

We could now create a new model whose data is created from ProfileModel

```python
class FemaleProfiles(Model):
    id = IntegerField()
    profile_id = IntegerField()
    mail = StringField()

    def calculate_data(self):
        return (
            ProfileModel.df
            .filter(ProfileModel.sex == 'F')
            .with_row_count('new_id')
            .with_columns(
                pl.col('new_id')
            )
            .with_columns(
                pl.col('id').alias('profile_id')
            )
        )

    class Meta:
        recalculate = 'if_no_data_in_storage'
        storage = ProfilesData
        table_name = 'PROFILE_FEMALES'
```
Et voilรก! the columns will be auto selected from the column definitions (id, profile_id and email).

If we now call:
```python
FemaleProfiles.df
```

It will check if the dataframe exists in the storage and if it does not, it will 'calculate' it again
from calculate_data and save it to the Storage, this parameter can also be set to 'always'.


You can also move data to different environments or storages, making it easy to change formats or
move data around:

```python
FemaleProfiles.save(to=ProfilesData.live)
```

Effectively moving data from SQLITE (dev) to PostgreSQL (live), 

```python
# Can also change formats
FemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.JSON)
FemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.CSV)
FemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.PARQUET)
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://www.github.com/surister/datasaurus",
    "name": "datasaurus",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.1,<4.0.0",
    "maintainer_email": "",
    "keywords": "python3,data,polars,dataframes,framework,data engineering",
    "author": "surister",
    "author_email": "surister98@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/34/e5/37f1adf2e208b1a93e60d29c2b22ba4b4eb121850c8c0dab77319f7c9d46/datasaurus-0.0.2.dev4.tar.gz",
    "platform": null,
    "description": "Datasaurus is a Data Engineering framework written in Python 3.8, 3.9, 3.10 and 3.11\n\nIt is based in Polars and heavily influenced by Django.\n\nDatasaurus offers an opinionated, feature-rich and powerful framework to help you write\ndata pipelines, ETLs or data manipulation programs.\n\n[Documentation]() (TODO)\n## It supports:\n- \u2705 Fully support read/write operations.\n- \u2b55 Not yet but will be implemented.\n- \ud83d\udc80 Won't be implemented in the near future.\n\n### Storages:\n- Sqlite \u2705\n- PostgresSQL \u2705\n- MySQL \u2705\n- Mariadb \u2705\n- Local Storage \u2705\n- Azure blob storage \u2b55\n- AWS S3 \u2b55\n\n\n### Formats:\n- CSV \u2705\n- JSON \u2705\n- PARQUET \u2705\n- EXCEL \u2705\n- AVRO \u2705\n- TSV \u2b55\n- SQL \u2b55 (Like sql inserts)\n- \n### Features:\n- Delta Tables \u2b55\n- Field validations \u2b55\n\n## Simple example\n```python\n# settings.py \nfrom datasaurus.core.storage import PostgresStorage, StorageGroup, SqliteStorage\nfrom datasaurus.core.models import StringColumn, IntegerColumn\n\n# We set the environment that will be used.\nos.environ['DATASAURUS_ENVIRONMENT'] = 'dev'\n\nclass ProfilesData(StorageGroup):\n    dev = SqliteStorage(path='/data/data.sqlite')\n    live = PostgresStorage(username='user', password='user', host='localhost', database='postgres')\n\n    \n# models.py\nfrom datasaurus.core.models import Model, StringColumn, IntegerColumn\n\nclass ProfileModel(Model):\n    id = IntegerColumn()\n    username = StringColumn()\n    mail = StringColumn()\n    sex = StringColumn()\n\n    class Meta:\n        storage = ProfilesData\n        table_name = 'PROFILE'\n\n```\n\nWe can access the raw Polars dataframe with 'Model.df', it's lazy, meaning it will only load the\ndata if we access the attribute.\n\n```py\n>>> ProfileModel.df\nshape: (100, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 id  \u2506 username           \u2506 mail                     \u2506 sex \u2502\n\u2502 --- \u2506 ---                \u2506 ---                      \u2506 --- \u2502\n\u2502 i64 \u2506 str                \u2506 str                      \u2506 str \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1   \u2506 ehayes             \u2506 colleen63@hotmail.com    \u2506 F   \u2502\n\u2502 2   \u2506 thompsondeborah    \u2506 judyortega@hotmail.com   \u2506 F   \u2502\n\u2502 3   \u2506 orivera            \u2506 iperkins@hotmail.com     \u2506 F   \u2502\n\u2502 4   \u2506 ychase             \u2506 sophia92@hotmail.com     \u2506 F   \u2502\n\u2502 \u2026   \u2506 \u2026                  \u2506 \u2026                        \u2506 \u2026   \u2502\n\u2502 97  \u2506 mary38             \u2506 sylvia80@yahoo.com       \u2506 F   \u2502\n\u2502 98  \u2506 charlessteven      \u2506 usmith@gmail.com         \u2506 F   \u2502\n\u2502 99  \u2506 plee               \u2506 powens@hotmail.com       \u2506 F   \u2502\n\u2502 100 \u2506 elliottchristopher \u2506 wilsonbenjamin@yahoo.com \u2506 M   \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2518\n\n```\n\nWe could now create a new model whose data is created from ProfileModel\n\n```python\nclass FemaleProfiles(Model):\n    id = IntegerField()\n    profile_id = IntegerField()\n    mail = StringField()\n\n    def calculate_data(self):\n        return (\n            ProfileModel.df\n            .filter(ProfileModel.sex == 'F')\n            .with_row_count('new_id')\n            .with_columns(\n                pl.col('new_id')\n            )\n            .with_columns(\n                pl.col('id').alias('profile_id')\n            )\n        )\n\n    class Meta:\n        recalculate = 'if_no_data_in_storage'\n        storage = ProfilesData\n        table_name = 'PROFILE_FEMALES'\n```\nEt voil\u00e1! the columns will be auto selected from the column definitions (id, profile_id and email).\n\nIf we now call:\n```python\nFemaleProfiles.df\n```\n\nIt will check if the dataframe exists in the storage and if it does not, it will 'calculate' it again\nfrom calculate_data and save it to the Storage, this parameter can also be set to 'always'.\n\n\nYou can also move data to different environments or storages, making it easy to change formats or\nmove data around:\n\n```python\nFemaleProfiles.save(to=ProfilesData.live)\n```\n\nEffectively moving data from SQLITE (dev) to PostgreSQL (live), \n\n```python\n# Can also change formats\nFemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.JSON)\nFemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.CSV)\nFemaleProfiles.save(to=ProfilesData.otherenvironment, format=LocalFormat.PARQUET)\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Data Engineering framework based on Polars.rs",
    "version": "0.0.2.dev4",
    "project_urls": {
        "Homepage": "https://www.github.com/surister/datasaurus",
        "Repository": "https://www.github.com/surister/datasaurus"
    },
    "split_keywords": [
        "python3",
        "data",
        "polars",
        "dataframes",
        "framework",
        "data engineering"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a9cf9a68b4764c1a2df2668e8d8141c614654d45a3d55d8ee8f9ceeca03e74f3",
                "md5": "614abde226269b643c3a1531860ebce9",
                "sha256": "9e37584a072adf1184b546fe4f0a66dd13dd55ccaed6ef89102fdd609f780ce7"
            },
            "downloads": -1,
            "filename": "datasaurus-0.0.2.dev4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "614abde226269b643c3a1531860ebce9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.1,<4.0.0",
            "size": 19971,
            "upload_time": "2023-12-19T12:10:49",
            "upload_time_iso_8601": "2023-12-19T12:10:49.540079Z",
            "url": "https://files.pythonhosted.org/packages/a9/cf/9a68b4764c1a2df2668e8d8141c614654d45a3d55d8ee8f9ceeca03e74f3/datasaurus-0.0.2.dev4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "34e537f1adf2e208b1a93e60d29c2b22ba4b4eb121850c8c0dab77319f7c9d46",
                "md5": "34308388f237ef34c6be5d64ca17c588",
                "sha256": "c8dd2a76fb6d52049232782ec575d2d53f634095478781a4a13c3793ec8a322a"
            },
            "downloads": -1,
            "filename": "datasaurus-0.0.2.dev4.tar.gz",
            "has_sig": false,
            "md5_digest": "34308388f237ef34c6be5d64ca17c588",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.1,<4.0.0",
            "size": 16916,
            "upload_time": "2023-12-19T12:10:51",
            "upload_time_iso_8601": "2023-12-19T12:10:51.294388Z",
            "url": "https://files.pythonhosted.org/packages/34/e5/37f1adf2e208b1a93e60d29c2b22ba4b4eb121850c8c0dab77319f7c9d46/datasaurus-0.0.2.dev4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-19 12:10:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "surister",
    "github_project": "datasaurus",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": false,
    "lcname": "datasaurus"
}
        
Elapsed time: 0.16529s