Name | dutil JSON |
Version |
0.2.24
JSON |
| download |
home_page | None |
Summary | A few useful tools for data wrangling |
upload_time | 2024-04-23 16:16:20 |
maintainer | None |
docs_url | None |
author | Mysterious Ben |
requires_python | <4.0,>=3.9 |
license | Apache License, Version 2.0 |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# dutil
A few data utilities to make life of a data scientist easier
## Installation
```shell
pip install dutil
```
## Modules
- `pipeline` (data caching and pipelines)
- `stats` (statistical functions)
- `string` (string manipulations)
- `transform` (data transformations)
- `jupyter` (tools for jupyter notebooks)
### Pipeline
```python
import dutil.pipeline as dpipe
import pandas as pd
import numpy as np
from loguru import logger
# --- Define data transformations via step functions (similar to dask.delayed)
@dpipe.delayed_cached() # lazy computation + caching on disk
def load_1():
df = pd.DataFrame({'a': [1., 2.], 'b': [0.1, np.nan]})
logger.info('Loaded {} records'.format(len(df)))
return df
@dpipe.delayed_cached() # lazy computation + caching on disk
def load_2(timestamp):
df = pd.DataFrame({'a': [0.9, 3.], 'b': [0.001, 1.]})
logger.info('Loaded {} records'.format(len(df)))
return df
@dpipe.delayed_cached() # lazy computation + caching on disk
def compute(x, y, eps):
assert x.shape == y.shape
diff = ((x - y).abs() / (y.abs()+eps)).mean().mean()
logger.info('Difference is computed')
return diff
# Define pipeline dependencies
ts = pd.Timestamp(2019, 1, 1)
eps = 0.01
s1 = load_1()
s2 = load_2(ts)
diff = compute(s1, s2, eps)
# Trigger pipeline execution
print('diff: {:.3f}'.format(dpipe.delayed_compute((diff, ))[0]))
```
### Stats
```python
from dutil.stats import mean_lower, mean_upper
import pandas as pd
ss = pd.Series([0, 1, 5, -1])
mean_lower(ss) # Compute mean among 50% smallest elements
mean_upper(ss) # Compute mean among 50% biggest elements
```
### String
```python
from dutil.string import compare_companies
compare_companies("Aarons Holdings Company Inc.", "Aaron's, Inc.") # Give match rating for two company names
```
### Transform
```python
from dutil.transform import ht
import pandas as pd
df = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})
ht(df) # Return first and last rows of a DataFrame, a Series, or an array
```
### Jupyter
```python
from dutil.jupyter import dht
import pandas as pd
df = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})
dht(df) # Display first and last rows of a DataFrame, a Series, or an array in a Jupyter notebook
```
Raw data
{
"_id": null,
"home_page": null,
"name": "dutil",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Mysterious Ben",
"author_email": "datascience@tuta.io",
"download_url": "https://files.pythonhosted.org/packages/72/73/4be0bfa7c727459902cddd3fc776b0fbcae0f56814ac005ae229abd869d4/dutil-0.2.24.tar.gz",
"platform": null,
"description": "# dutil\n\nA few data utilities to make life of a data scientist easier\n\n## Installation\n\n```shell\npip install dutil\n```\n\n## Modules\n\n- `pipeline` (data caching and pipelines)\n- `stats` (statistical functions)\n- `string` (string manipulations)\n- `transform` (data transformations)\n- `jupyter` (tools for jupyter notebooks)\n\n\n### Pipeline\n\n```python\nimport dutil.pipeline as dpipe\nimport pandas as pd\nimport numpy as np\nfrom loguru import logger\n\n# --- Define data transformations via step functions (similar to dask.delayed)\n\n@dpipe.delayed_cached() # lazy computation + caching on disk\ndef load_1():\n df = pd.DataFrame({'a': [1., 2.], 'b': [0.1, np.nan]})\n logger.info('Loaded {} records'.format(len(df)))\n return df\n\n@dpipe.delayed_cached() # lazy computation + caching on disk\ndef load_2(timestamp):\n df = pd.DataFrame({'a': [0.9, 3.], 'b': [0.001, 1.]})\n logger.info('Loaded {} records'.format(len(df)))\n return df\n\n@dpipe.delayed_cached() # lazy computation + caching on disk\ndef compute(x, y, eps):\n assert x.shape == y.shape\n diff = ((x - y).abs() / (y.abs()+eps)).mean().mean()\n logger.info('Difference is computed')\n return diff\n\n# Define pipeline dependencies\nts = pd.Timestamp(2019, 1, 1)\neps = 0.01\ns1 = load_1()\ns2 = load_2(ts)\ndiff = compute(s1, s2, eps)\n\n# Trigger pipeline execution\nprint('diff: {:.3f}'.format(dpipe.delayed_compute((diff, ))[0]))\n```\n\n### Stats\n\n```python\nfrom dutil.stats import mean_lower, mean_upper\nimport pandas as pd\nss = pd.Series([0, 1, 5, -1])\nmean_lower(ss) # Compute mean among 50% smallest elements\nmean_upper(ss) # Compute mean among 50% biggest elements\n```\n\n### String\n\n```python\nfrom dutil.string import compare_companies\ncompare_companies(\"Aarons Holdings Company Inc.\", \"Aaron's, Inc.\") # Give match rating for two company names\n```\n\n### Transform\n\n```python\nfrom dutil.transform import ht\nimport pandas as pd\ndf = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})\nht(df) # Return first and last rows of a DataFrame, a Series, or an array\n```\n\n### Jupyter\n\n```python\nfrom dutil.jupyter import dht\nimport pandas as pd\ndf = pd.DataFrame({'a': [0, 2, 2, 4, 6], 'b': [1, 1, 1, 1, 1]})\ndht(df) # Display first and last rows of a DataFrame, a Series, or an array in a Jupyter notebook\n```\n",
"bugtrack_url": null,
"license": "Apache License, Version 2.0",
"summary": "A few useful tools for data wrangling",
"version": "0.2.24",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1b13a4f92be1ada57fc170de3893c0890b1069f858c5f918e4eef5be00bced99",
"md5": "1d6e51eb97d6771fc48cceb187298736",
"sha256": "a7483159cfa99e9da4bca1be6ccd0730a2cb34beb53314c6703a774eea34c4ee"
},
"downloads": -1,
"filename": "dutil-0.2.24-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1d6e51eb97d6771fc48cceb187298736",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 14949,
"upload_time": "2024-04-23T16:16:19",
"upload_time_iso_8601": "2024-04-23T16:16:19.132708Z",
"url": "https://files.pythonhosted.org/packages/1b/13/a4f92be1ada57fc170de3893c0890b1069f858c5f918e4eef5be00bced99/dutil-0.2.24-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "72734be0bfa7c727459902cddd3fc776b0fbcae0f56814ac005ae229abd869d4",
"md5": "3b8774b2c5d13f8b25e7b302dc16a25f",
"sha256": "3ecf531003419ce4dfe764cd80fe3526f4c32007524c356e8036e46af06bb8d5"
},
"downloads": -1,
"filename": "dutil-0.2.24.tar.gz",
"has_sig": false,
"md5_digest": "3b8774b2c5d13f8b25e7b302dc16a25f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 13333,
"upload_time": "2024-04-23T16:16:20",
"upload_time_iso_8601": "2024-04-23T16:16:20.557226Z",
"url": "https://files.pythonhosted.org/packages/72/73/4be0bfa7c727459902cddd3fc776b0fbcae0f56814ac005ae229abd869d4/dutil-0.2.24.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-23 16:16:20",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "dutil"
}