dutil


Namedutil JSON
Version 0.2.24 PyPI version JSON
download
home_pageNone
SummaryA few useful tools for data wrangling
upload_time2024-04-23 16:16:20
maintainerNone
docs_urlNone
authorMysterious Ben
requires_python<4.0,>=3.9
licenseApache License, Version 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # dutil

A few data utilities to make life of a data scientist easier

## Installation

```shell
pip install dutil
```

## Modules

- `pipeline` (data caching and pipelines)
- `stats` (statistical functions)
- `string` (string manipulations)
- `transform` (data transformations)
- `jupyter` (tools for jupyter notebooks)


### Pipeline

```python
import dutil.pipeline as dpipe
import pandas as pd
import numpy as np
from loguru import logger

# --- Define data transformations via step functions (similar to dask.delayed)

@dpipe.delayed_cached()  # lazy computation + caching on disk
def load_1():
    df = pd.DataFrame({'a': [1., 2.], 'b': [0.1, np.nan]})
    logger.info('Loaded {} records'.format(len(df)))
    return df

@dpipe.delayed_cached()  # lazy computation + caching on disk
def load_2(timestamp):
    df = pd.DataFrame({'a': [0.9, 3.], 'b': [0.001, 1.]})
    logger.info('Loaded {} records'.format(len(df)))
    return df

@dpipe.delayed_cached()  # lazy computation + caching on disk
def compute(x, y, eps):
    assert x.shape == y.shape
    diff = ((x - y).abs() / (y.abs()+eps)).mean().mean()
    logger.info('Difference is computed')
    return diff

# Define pipeline dependencies
ts = pd.Timestamp(2019, 1, 1)
eps = 0.01
s1 = load_1()
s2 = load_2(ts)
diff = compute(s1, s2, eps)

# Trigger pipeline execution
print('diff: {:.3f}'.format(dpipe.delayed_compute((diff, ))[0]))
```

### Stats

```python
from dutil.stats import mean_lower, mean_upper
import pandas as pd
ss = pd.Series([0, 1, 5, -1])
mean_lower(ss)  # Compute mean among 50% smallest elements
mean_upper(ss)  # Compute mean among 50% biggest elements
```

### String

```python
from dutil.string import compare_companies
compare_companies("Aarons Holdings Company Inc.", "Aaron's, Inc.")  # Give match rating for two company names
```

### Transform

```python
from dutil.transform import ht
import pandas as pd
df = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})
ht(df)  # Return first and last rows of a DataFrame, a Series, or an array
```

### Jupyter

```python
from dutil.jupyter import dht
import pandas as pd
df = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})
dht(df)  # Display first and last rows of a DataFrame, a Series, or an array in a Jupyter notebook
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "dutil",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Mysterious Ben",
    "author_email": "datascience@tuta.io",
    "download_url": "https://files.pythonhosted.org/packages/72/73/4be0bfa7c727459902cddd3fc776b0fbcae0f56814ac005ae229abd869d4/dutil-0.2.24.tar.gz",
    "platform": null,
    "description": "# dutil\n\nA few data utilities to make life of a data scientist easier\n\n## Installation\n\n```shell\npip install dutil\n```\n\n## Modules\n\n- `pipeline` (data caching and pipelines)\n- `stats` (statistical functions)\n- `string` (string manipulations)\n- `transform` (data transformations)\n- `jupyter` (tools for jupyter notebooks)\n\n\n### Pipeline\n\n```python\nimport dutil.pipeline as dpipe\nimport pandas as pd\nimport numpy as np\nfrom loguru import logger\n\n# --- Define data transformations via step functions (similar to dask.delayed)\n\n@dpipe.delayed_cached()  # lazy computation + caching on disk\ndef load_1():\n    df = pd.DataFrame({'a': [1., 2.], 'b': [0.1, np.nan]})\n    logger.info('Loaded {} records'.format(len(df)))\n    return df\n\n@dpipe.delayed_cached()  # lazy computation + caching on disk\ndef load_2(timestamp):\n    df = pd.DataFrame({'a': [0.9, 3.], 'b': [0.001, 1.]})\n    logger.info('Loaded {} records'.format(len(df)))\n    return df\n\n@dpipe.delayed_cached()  # lazy computation + caching on disk\ndef compute(x, y, eps):\n    assert x.shape == y.shape\n    diff = ((x - y).abs() / (y.abs()+eps)).mean().mean()\n    logger.info('Difference is computed')\n    return diff\n\n# Define pipeline dependencies\nts = pd.Timestamp(2019, 1, 1)\neps = 0.01\ns1 = load_1()\ns2 = load_2(ts)\ndiff = compute(s1, s2, eps)\n\n# Trigger pipeline execution\nprint('diff: {:.3f}'.format(dpipe.delayed_compute((diff, ))[0]))\n```\n\n### Stats\n\n```python\nfrom dutil.stats import mean_lower, mean_upper\nimport pandas as pd\nss = pd.Series([0, 1, 5, -1])\nmean_lower(ss)  # Compute mean among 50% smallest elements\nmean_upper(ss)  # Compute mean among 50% biggest elements\n```\n\n### String\n\n```python\nfrom dutil.string import compare_companies\ncompare_companies(\"Aarons Holdings Company Inc.\", \"Aaron's, Inc.\")  # Give match rating for two company names\n```\n\n### Transform\n\n```python\nfrom dutil.transform import ht\nimport pandas as pd\ndf = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})\nht(df)  # Return first and last rows of a DataFrame, a Series, or an array\n```\n\n### Jupyter\n\n```python\nfrom dutil.jupyter import dht\nimport pandas as pd\ndf = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})\ndht(df)  # Display first and last rows of a DataFrame, a Series, or an array in a Jupyter notebook\n```\n",
    "bugtrack_url": null,
    "license": "Apache License, Version 2.0",
    "summary": "A few useful tools for data wrangling",
    "version": "0.2.24",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1b13a4f92be1ada57fc170de3893c0890b1069f858c5f918e4eef5be00bced99",
                "md5": "1d6e51eb97d6771fc48cceb187298736",
                "sha256": "a7483159cfa99e9da4bca1be6ccd0730a2cb34beb53314c6703a774eea34c4ee"
            },
            "downloads": -1,
            "filename": "dutil-0.2.24-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1d6e51eb97d6771fc48cceb187298736",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 14949,
            "upload_time": "2024-04-23T16:16:19",
            "upload_time_iso_8601": "2024-04-23T16:16:19.132708Z",
            "url": "https://files.pythonhosted.org/packages/1b/13/a4f92be1ada57fc170de3893c0890b1069f858c5f918e4eef5be00bced99/dutil-0.2.24-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "72734be0bfa7c727459902cddd3fc776b0fbcae0f56814ac005ae229abd869d4",
                "md5": "3b8774b2c5d13f8b25e7b302dc16a25f",
                "sha256": "3ecf531003419ce4dfe764cd80fe3526f4c32007524c356e8036e46af06bb8d5"
            },
            "downloads": -1,
            "filename": "dutil-0.2.24.tar.gz",
            "has_sig": false,
            "md5_digest": "3b8774b2c5d13f8b25e7b302dc16a25f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 13333,
            "upload_time": "2024-04-23T16:16:20",
            "upload_time_iso_8601": "2024-04-23T16:16:20.557226Z",
            "url": "https://files.pythonhosted.org/packages/72/73/4be0bfa7c727459902cddd3fc776b0fbcae0f56814ac005ae229abd869d4/dutil-0.2.24.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-23 16:16:20",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "dutil"
}
        
Elapsed time: 0.30674s