bigtabular


Namebigtabular JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/stefan027/bigtabular
SummaryExtension of fastai.tabular for larger-than-memory datasets with Dask
upload_time2024-07-21 08:50:35
maintainerNone
docs_urlNone
authorstefan027
requires_python>=3.7
licenseApache Software License 2.0
keywords nbdev jupyter notebook python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # BigTabular


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

This library replicates much the functionality of the tabular data
application in the [fastai](https://docs.fast.ai/) library to work with
larger-than-memory datasets. Pandas, which is used for data
transformations in `fastai.tabular`, is replaced with [Dask
DataFrames](https://docs.dask.org/en/stable/dataframe.html).

Most of the Dask implementations were written as they were needed for a
personal project, but then refactored to match the fastai API more
closely. The flow of the Jupyter notebooks follows those from
`fastai.tabular` closely and most of the examples and tests were
replicated.

## When not to use BigTabular

Don’t use this library when you don’t need to use Dask. The [Dask
website](https://docs.dask.org/en/stable/dataframe.html) gives the
following guidance:

> Dask DataFrames are often used either when …
>
> 1.  Your data is too big
> 2.  Your computation is too slow and other techniques don’t work
>
> You should probably stick to just using pandas if …
>
> 1.  Your data is small
> 2.  Your computation is fast (subsecond)
> 3.  There are simpler ways to accelerate your computation, like
>     avoiding .apply or Python for loops and using a built-in pandas
>     method instead.

## Install

``` sh
pip install bigtabular
```

## How to use

Refer to the [tutorial](nbs/03_tutorial.ipynb) for a more detailed usage
example.

Get a Dask DataFrame:

``` python
path = untar_data(URLs.ADULT_SAMPLE)
ddf = dd.from_pandas(pd.read_csv(path/'adult.csv'))
ddf.head()
```

<div>

|     | age | workclass        | fnlwgt | education   | education-num | marital-status     | occupation      | relationship  | race               | sex    | capital-gain | capital-loss | hours-per-week | native-country | salary |
|-----|-----|------------------|--------|-------------|---------------|--------------------|-----------------|---------------|--------------------|--------|--------------|--------------|----------------|----------------|--------|
| 0   | 49  | Private          | 101320 | Assoc-acdm  | 12.0          | Married-civ-spouse | \<NA\>          | Wife          | White              | Female | 0            | 1902         | 40             | United-States  | \>=50k |
| 1   | 44  | Private          | 236746 | Masters     | 14.0          | Divorced           | Exec-managerial | Not-in-family | White              | Male   | 10520        | 0            | 45             | United-States  | \>=50k |
| 2   | 38  | Private          | 96185  | HS-grad     | NaN           | Divorced           | \<NA\>          | Unmarried     | Black              | Female | 0            | 0            | 32             | United-States  | \<50k  |
| 3   | 38  | Self-emp-inc     | 112847 | Prof-school | 15.0          | Married-civ-spouse | Prof-specialty  | Husband       | Asian-Pac-Islander | Male   | 0            | 0            | 40             | United-States  | \>=50k |
| 4   | 42  | Self-emp-not-inc | 82297  | 7th-8th     | NaN           | Married-civ-spouse | Other-service   | Wife          | Black              | Female | 0            | 0            | 50             | United-States  | \<50k  |

</div>

</div>

Create dataloaders. Some of the columns are continuous (like age) and we
will treat them as float numbers we can feed our model directly. Others
are categorical (like workclass or education) and we will convert them
to a unique index that we will feed to embedding layers. We can specify
our categorical and continuous column names, as well as the name of the
dependent variable in
[`DaskDataLoaders`](https://stefan027.github.io/bigtabular/data.html#daskdataloaders)
factory methods:

``` python
dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [DaskCategorify, DaskFillMissing, DaskNormalize])
```

Create a `Learner`:

``` python
learn = dask_learner(dls, metrics=accuracy)
```

Train the model for one epoch:

``` python
learn.fit_one_cycle(1)
```

<div>

| epoch | train_loss | valid_loss | accuracy | time  |
|-------|------------|------------|----------|-------|
| 0     | 0.359618   | 0.356699   | 0.836550 | 00:51 |

</div>

We can then have a look at some predictions:

``` python
learn.show_results()
```

<div>

|     | workclass | education | marital-status | occupation | relationship | race | education-num_na | age       | fnlwgt    | education-num | salary | salary_pred |
|-----|-----------|-----------|----------------|------------|--------------|------|------------------|-----------|-----------|---------------|--------|-------------|
| 0   | 5         | 13        | 1              | 5          | 2            | 5    | 1                | 0.402007  | 0.446103  | 1.537303      | 1      | 1           |
| 1   | 5         | 16        | 1              | 0          | 3            | 5    | 1                | 0.768961  | -1.380161 | -0.033487     | 0      | 0           |
| 2   | 5         | 12        | 5              | 7          | 4            | 3    | 1                | -0.919026 | 5.286263  | -0.426185     | 0      | 0           |
| 3   | 5         | 16        | 5              | 13         | 2            | 3    | 2                | 0.181835  | -0.467029 | -0.033487     | 0      | 0           |
| 4   | 5         | 13        | 5              | 5          | 2            | 5    | 2                | -0.698853 | -0.308706 | -0.033487     | 0      | 0           |
| 5   | 5         | 10        | 3              | 0          | 1            | 5    | 2                | 0.255226  | -1.457680 | -0.033487     | 1      | 1           |
| 6   | 1         | 10        | 3              | 1          | 1            | 5    | 2                | 2.016603  | -0.117934 | -0.033487     | 1      | 0           |
| 7   | 3         | 12        | 5              | 2          | 4            | 5    | 1                | -1.139198 | -0.574889 | -0.426185     | 0      | 0           |
| 8   | 5         | 1         | 5              | 0          | 4            | 5    | 1                | -1.579542 | -0.441000 | -1.604277     | 0      | 0           |

</div>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/stefan027/bigtabular",
    "name": "bigtabular",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "nbdev jupyter notebook python",
    "author": "stefan027",
    "author_email": "stefan.strydom87@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/db/72/ed68e1621095572e58948c6d2ca8763a8b3271aca0f8fbaa2f5825944cc0/bigtabular-0.0.3.tar.gz",
    "platform": null,
    "description": "# BigTabular\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\nThis library replicates much the functionality of the tabular data\napplication in the [fastai](https://docs.fast.ai/) library to work with\nlarger-than-memory datasets. Pandas, which is used for data\ntransformations in `fastai.tabular`, is replaced with [Dask\nDataFrames](https://docs.dask.org/en/stable/dataframe.html).\n\nMost of the Dask implementations were written as they were needed for a\npersonal project, but then refactored to match the fastai API more\nclosely. The flow of the Jupyter notebooks follows those from\n`fastai.tabular` closely and most of the examples and tests were\nreplicated.\n\n## When not to use BigTabular\n\nDon\u2019t use this library when you don\u2019t need to use Dask. The [Dask\nwebsite](https://docs.dask.org/en/stable/dataframe.html) gives the\nfollowing guidance:\n\n> Dask DataFrames are often used either when \u2026\n>\n> 1.  Your data is too big\n> 2.  Your computation is too slow and other techniques don\u2019t work\n>\n> You should probably stick to just using pandas if \u2026\n>\n> 1.  Your data is small\n> 2.  Your computation is fast (subsecond)\n> 3.  There are simpler ways to accelerate your computation, like\n>     avoiding .apply or Python for loops and using a built-in pandas\n>     method instead.\n\n## Install\n\n``` sh\npip install bigtabular\n```\n\n## How to use\n\nRefer to the [tutorial](nbs/03_tutorial.ipynb) for a more detailed usage\nexample.\n\nGet a Dask DataFrame:\n\n``` python\npath = untar_data(URLs.ADULT_SAMPLE)\nddf = dd.from_pandas(pd.read_csv(path/'adult.csv'))\nddf.head()\n```\n\n<div>\n\n|     | age | workclass        | fnlwgt | education   | education-num | marital-status     | occupation      | relationship  | race               | sex    | capital-gain | capital-loss | hours-per-week | native-country | salary |\n|-----|-----|------------------|--------|-------------|---------------|--------------------|-----------------|---------------|--------------------|--------|--------------|--------------|----------------|----------------|--------|\n| 0   | 49  | Private          | 101320 | Assoc-acdm  | 12.0          | Married-civ-spouse | \\<NA\\>          | Wife          | White              | Female | 0            | 1902         | 40             | United-States  | \\>=50k |\n| 1   | 44  | Private          | 236746 | Masters     | 14.0          | Divorced           | Exec-managerial | Not-in-family | White              | Male   | 10520        | 0            | 45             | United-States  | \\>=50k |\n| 2   | 38  | Private          | 96185  | HS-grad     | NaN           | Divorced           | \\<NA\\>          | Unmarried     | Black              | Female | 0            | 0            | 32             | United-States  | \\<50k  |\n| 3   | 38  | Self-emp-inc     | 112847 | Prof-school | 15.0          | Married-civ-spouse | Prof-specialty  | Husband       | Asian-Pac-Islander | Male   | 0            | 0            | 40             | United-States  | \\>=50k |\n| 4   | 42  | Self-emp-not-inc | 82297  | 7th-8th     | NaN           | Married-civ-spouse | Other-service   | Wife          | Black              | Female | 0            | 0            | 50             | United-States  | \\<50k  |\n\n</div>\n\n</div>\n\nCreate dataloaders. Some of the columns are continuous (like age) and we\nwill treat them as float numbers we can feed our model directly. Others\nare categorical (like workclass or education) and we will convert them\nto a unique index that we will feed to embedding layers. We can specify\nour categorical and continuous column names, as well as the name of the\ndependent variable in\n[`DaskDataLoaders`](https://stefan027.github.io/bigtabular/data.html#daskdataloaders)\nfactory methods:\n\n``` python\ndls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names=\"salary\",\n    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],\n    cont_names = ['age', 'fnlwgt', 'education-num'],\n    procs = [DaskCategorify, DaskFillMissing, DaskNormalize])\n```\n\nCreate a `Learner`:\n\n``` python\nlearn = dask_learner(dls, metrics=accuracy)\n```\n\nTrain the model for one epoch:\n\n``` python\nlearn.fit_one_cycle(1)\n```\n\n<div>\n\n| epoch | train_loss | valid_loss | accuracy | time  |\n|-------|------------|------------|----------|-------|\n| 0     | 0.359618   | 0.356699   | 0.836550 | 00:51 |\n\n</div>\n\nWe can then have a look at some predictions:\n\n``` python\nlearn.show_results()\n```\n\n<div>\n\n|     | workclass | education | marital-status | occupation | relationship | race | education-num_na | age       | fnlwgt    | education-num | salary | salary_pred |\n|-----|-----------|-----------|----------------|------------|--------------|------|------------------|-----------|-----------|---------------|--------|-------------|\n| 0   | 5         | 13        | 1              | 5          | 2            | 5    | 1                | 0.402007  | 0.446103  | 1.537303      | 1      | 1           |\n| 1   | 5         | 16        | 1              | 0          | 3            | 5    | 1                | 0.768961  | -1.380161 | -0.033487     | 0      | 0           |\n| 2   | 5         | 12        | 5              | 7          | 4            | 3    | 1                | -0.919026 | 5.286263  | -0.426185     | 0      | 0           |\n| 3   | 5         | 16        | 5              | 13         | 2            | 3    | 2                | 0.181835  | -0.467029 | -0.033487     | 0      | 0           |\n| 4   | 5         | 13        | 5              | 5          | 2            | 5    | 2                | -0.698853 | -0.308706 | -0.033487     | 0      | 0           |\n| 5   | 5         | 10        | 3              | 0          | 1            | 5    | 2                | 0.255226  | -1.457680 | -0.033487     | 1      | 1           |\n| 6   | 1         | 10        | 3              | 1          | 1            | 5    | 2                | 2.016603  | -0.117934 | -0.033487     | 1      | 0           |\n| 7   | 3         | 12        | 5              | 2          | 4            | 5    | 1                | -1.139198 | -0.574889 | -0.426185     | 0      | 0           |\n| 8   | 5         | 1         | 5              | 0          | 4            | 5    | 1                | -1.579542 | -0.441000 | -1.604277     | 0      | 0           |\n\n</div>\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Extension of fastai.tabular for larger-than-memory datasets with Dask",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/stefan027/bigtabular"
    },
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "65293649102efb3e8df46d0e088febd66f556976e761d13c73b59c6825f6e211",
                "md5": "7e28e2747ddec58a259d05bfca6d820c",
                "sha256": "2cf9dcc6a9035d99a9c35210888659085c29efc6778866f012be59f1f434dc4b"
            },
            "downloads": -1,
            "filename": "bigtabular-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7e28e2747ddec58a259d05bfca6d820c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 18549,
            "upload_time": "2024-07-21T08:50:34",
            "upload_time_iso_8601": "2024-07-21T08:50:34.103611Z",
            "url": "https://files.pythonhosted.org/packages/65/29/3649102efb3e8df46d0e088febd66f556976e761d13c73b59c6825f6e211/bigtabular-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "db72ed68e1621095572e58948c6d2ca8763a8b3271aca0f8fbaa2f5825944cc0",
                "md5": "5444483571045b1ec5a9699eb993f6ba",
                "sha256": "ac90042f21f5879b5a019be9cde73e0e7070b9a2f00c2890354f9c198f4dd439"
            },
            "downloads": -1,
            "filename": "bigtabular-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "5444483571045b1ec5a9699eb993f6ba",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 20551,
            "upload_time": "2024-07-21T08:50:35",
            "upload_time_iso_8601": "2024-07-21T08:50:35.869413Z",
            "url": "https://files.pythonhosted.org/packages/db/72/ed68e1621095572e58948c6d2ca8763a8b3271aca0f8fbaa2f5825944cc0/bigtabular-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-21 08:50:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "stefan027",
    "github_project": "bigtabular",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "bigtabular"
}
        
Elapsed time: 0.32330s