Name | dcef JSON |
Version |
0.2.3
JSON |
| download |
home_page | |
Summary | Data Cleaning Framework - UI for quickly cleaning pandas dataframes |
upload_time | 2023-04-21 10:35:55 |
maintainer | |
docs_url | None |
author | Paddy Mullen |
requires_python | >=3.7 |
license | Copyright (c) 2019 Bloomberg All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
keywords |
ipython
jupyter
widgets
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# DCEF - Data Cleaning Exploration Framework
We all know how awkward it is to clean data in jupyter notebooks. Multiple cells of exploratory work, trying different transforms, looking up different transforms, adhoc functions that work in one notebook and have to be either copied/pasta-ed to the next notebook, or rewritten from scratch. Data Cleaning Explorationg Framework (DCEF) makes all of that better by providing a visual UI for common cleaning operations AND emitting python code that performs the transformation. Specifically, the DCEF is a tool built to interactively explore, clean, and transform pandas dataframes.
![Data Cleaning Exploration Framework Screenshot](static/images/dcf-jupyter.png)
## Installation
If using JupyterLab, `dcef` requires JupyterLab version 3 or higher.
You can install `dcef` using `pip`
Using `pip`:
```bash
pip install dcef
```
# Using DCEF
in a jupyter lab notebook just add the following to a cell
```python
from dcef.dcef_widget import DCEFWidget
DCEFWidget(df=df) #df being the dataframe you want to explore
```
and you will see the UI for DCEF
## Using commands
At the core DCEF commands operate on columns. You must first click on a cell (not a header) in the top pane to select a column.
Next you must click on a command like `dropcol`, `fillna`, or `groupby` to create a new command
After creating a new command, you will see that command in the commands list, now you must edit the details of a command. Select the command by clicking on the bottom cell.
At this point you can either delete the command by clicking the `X` button or change command parameters.
## Writing your own commands
Builtin commands are found in [all_transforms.py](dcef/all_transforms.py)
### Simple example
Here is a simple example command
```python
class DropCol(Transform):
command_default = [s('dropcol'), s('df'), "col"]
command_pattern = [None]
@staticmethod
def transform(df, col):
df.drop(col, axis=1, inplace=True)
return df
@staticmethod
def transform_to_py(df, col):
return " df.drop('%s', axis=1, inplace=True)" % col
```
`command_default` is the base configuration of the command when first added, `s('dropcol')` is a special notation for the function name. `s('df')` is a symbol notation for the dataframe argument (see LISP section for details). `"col"` is a placeholder for the selected column.
since `dropcol` does not take any extra arguments, `command_pattern` is `[None]`
```python
def transform(df, col):
df.drop(col, axis=1, inplace=True)
return df
```
This `transform` is the function that manipulates the dataframe. For `dropcol` we take two arguments, the dataframe, and the column name.
```python
def transform_to_py(df, col):
return " df.drop('%s', axis=1, inplace=True)" % col
```
`transform_to_py` emits equivalent python code for this transform. Code is indented 4 space for use in a function.
### Complex example
```python
class GroupBy(Transform):
command_default = [s("groupby"), s('df'), 'col', {}]
command_pattern = [[3, 'colMap', 'colEnum', ['null', 'sum', 'mean', 'median', 'count']]]
@staticmethod
def transform(df, col, col_spec):
grps = df.groupby(col)
df_contents = {}
for k, v in col_spec.items():
if v == "sum":
df_contents[k] = grps[k].apply(lambda x: x.sum())
elif v == "mean":
df_contents[k] = grps[k].apply(lambda x: x.mean())
elif v == "median":
df_contents[k] = grps[k].apply(lambda x: x.median())
elif v == "count":
df_contents[k] = grps[k].apply(lambda x: x.count())
return pd.DataFrame(df_contents)
```
The `GroupBy` command is complex. it takes a 3rd argument of `col_spec`. `col_spec` is an argument of type `colEnum`. A `colEnum` argument tells the UI to display a table with all column names, and a drop down box of enum options.
In this case each column can have an operation of either `sum`, `mean`, `median`, or `count` applied to it.
Note also the leading `3` in the `command_pattern`. That is telling the UI that these are the specs for the 3rd element of the command. Eventually commands will be able to have multiple configured arguments.
### Argument types
Arguments can currently be configured as
* `integer` - allowing an integer input
* `enum` - allowing a strict set of options, returned as a string to the transform
* `colEnum` - allowing a strict set of options per column, returned as a dictionary keyed on column with values of enum options
## Order of Operations for data cleaning
The ideal order of operations is as follows
* Column level fixes
* drop (remove this column)
* fillna (fill NaN/None with a value)
* safe int (convert a colum to integers where possible, and nan everywhere else)
* OneHotEncoding ( create multiple boolean columns from the possible values of this column )
* MakeCategorical ( change the values of string to a Categorical Data type)
* Quantize
* DataFrame transformations
these transforms largely keep the shape of the data the same
* Resample
* ManyColdDecoding (the opposite of OneHotEncoding, take multiple boolean columns and transform into a single categorical
* Index shift (add a column with the value from previous row's column)
* Dataframe transformations 2
These result in a single new dataframe with a vastly different shape
* Stack/Unstack columns
* GroupBy (with UI for sellect group by function for each column)
* DataFrame transformations 2
These transforms emit multiple DataFrames
* Relational extract (extract one or more columns into a second dataframe that can be joined back to a foreign key column)
* Split on column (emit separate dataframes for each value of a categorical, no shape editting)
* DataFrame combination
* concat (concatenate multiple dataframes, with UI affordances to assure a similar shape)
* join (join two dataframes on a key, with UI affordances)
DCEF can only work on a single input dataframe shape at a time. Any newly created columns are visible on output, but not available for manipulation in the same DCEF Cell.
# Components
* a rich table widget that is embeddable into applications and in the jupyter notebook.
* A UI for selecting and trying transforms interactively
* An output table widget showing the transformed dataframe
# What works now, what's coming
## Exists now
* React frontend app
* Displays a datatframe
* Simple UI for column level functions
* Shows generated python code
* Shows transformed data frame
* DCEF server
* Serves up dataframes for use by frontend
* responds to dcef commands
* shows generated python code
* Developer User experience
* define DCEF commands in python onloy
* DCEF Intepreter
* Based on Peter Norvig's lispy.py, a simple syntax that is easy for the frontend to generate (no parens, just JSON arrays)
* DCEF core (actual transforms supported)
* dropcol
* fillna
* one hot
* safe int
* GroupBy
## Next major features
* Jupyter Notebook widget
* embed the same UI from the frontend into a jupyter notebook shell
* No need to fire up a separate server, commands sent via ipywidgets.comms
* Add a "send generated python to next cell" function
* React frontend app
* Styling
* Server only, some UI for DataFrame selection
* Pre filtering concept (only operate on first 1000 rows, some sample of all rows)
* DataFrame joining UI
* Summary statistics tab for incoming dataframe
* Multi index columns
* DateTimeIndex support
* DCEF core
* MakeCategorical
* Quantize
* Resample
* ManyColdDecoding
* IndexShift
* Computed
* Stack/Unstack
* RelationalExtract
* Split
* concat
* join
## Development installation
For a development installation:
```bash
git clone https://github.com/paddymul/dcef.git
cd dcef
conda install ipywidgets=8 jupyterlab
pip install -ve .
```
Enabling development install for Jupyter notebook:
Enabling development install for JupyterLab:
```bash
jupyter labextension develop . --overwrite
```
Note for developers: the `--symlink` argument on Linux or OS X allows one to modify the JavaScript code in-place. This feature is not available with Windows.
`
## Contributions
We :heart: contributions.
Have you had a good experience with this project? Why not share some love and contribute code, or just let us know about any issues you had with it?
We welcome issue reports [here](../../issues); be sure to choose the proper issue template for your issue, so that we can be sure you're providing the necessary information.
Before sending a [Pull Request](../../pulls), please make sure you read our
Raw data
{
"_id": null,
"home_page": "",
"name": "dcef",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "IPython,Jupyter,Widgets",
"author": "Paddy Mullen",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/15/a8/9519c0009a44393c03fc27fc1ceb8f3f12e02eafc78b3ad776ac2e42a2c9/dcef-0.2.3.tar.gz",
"platform": null,
"description": "# DCEF - Data Cleaning Exploration Framework\nWe all know how awkward it is to clean data in jupyter notebooks. Multiple cells of exploratory work, trying different transforms, looking up different transforms, adhoc functions that work in one notebook and have to be either copied/pasta-ed to the next notebook, or rewritten from scratch. Data Cleaning Explorationg Framework (DCEF) makes all of that better by providing a visual UI for common cleaning operations AND emitting python code that performs the transformation. Specifically, the DCEF is a tool built to interactively explore, clean, and transform pandas dataframes.\n\n![Data Cleaning Exploration Framework Screenshot](static/images/dcf-jupyter.png)\n\n\n## Installation\n\nIf using JupyterLab, `dcef` requires JupyterLab version 3 or higher.\n\nYou can install `dcef` using `pip`\n\nUsing `pip`:\n\n```bash\npip install dcef\n```\n\n# Using DCEF\n\nin a jupyter lab notebook just add the following to a cell\n\n```python\nfrom dcef.dcef_widget import DCEFWidget\nDCEFWidget(df=df) #df being the dataframe you want to explore\n``` \nand you will see the UI for DCEF\n\n\n## Using commands\n\nAt the core DCEF commands operate on columns. You must first click on a cell (not a header) in the top pane to select a column.\n\nNext you must click on a command like `dropcol`, `fillna`, or `groupby` to create a new command\n\nAfter creating a new command, you will see that command in the commands list, now you must edit the details of a command. Select the command by clicking on the bottom cell.\n\nAt this point you can either delete the command by clicking the `X` button or change command parameters.\n\n## Writing your own commands\n\nBuiltin commands are found in [all_transforms.py](dcef/all_transforms.py)\n\n### Simple example\nHere is a simple example command\n```python\nclass DropCol(Transform):\n command_default = [s('dropcol'), s('df'), \"col\"]\n command_pattern = [None]\n\n @staticmethod \n def transform(df, col):\n df.drop(col, axis=1, inplace=True)\n return df\n\n @staticmethod \n def transform_to_py(df, col):\n return \" df.drop('%s', axis=1, inplace=True)\" % col\n```\n`command_default` is the base configuration of the command when first added, `s('dropcol')` is a special notation for the function name. `s('df')` is a symbol notation for the dataframe argument (see LISP section for details). `\"col\"` is a placeholder for the selected column.\n\nsince `dropcol` does not take any extra arguments, `command_pattern` is `[None]`\n```python\n def transform(df, col):\n df.drop(col, axis=1, inplace=True)\n return df\n```\nThis `transform` is the function that manipulates the dataframe. For `dropcol` we take two arguments, the dataframe, and the column name.\n\n```python\n def transform_to_py(df, col):\n return \" df.drop('%s', axis=1, inplace=True)\" % col\n```\n`transform_to_py` emits equivalent python code for this transform. Code is indented 4 space for use in a function.\n\n### Complex example\n```python\nclass GroupBy(Transform):\n command_default = [s(\"groupby\"), s('df'), 'col', {}]\n command_pattern = [[3, 'colMap', 'colEnum', ['null', 'sum', 'mean', 'median', 'count']]]\n @staticmethod \n def transform(df, col, col_spec):\n grps = df.groupby(col)\n df_contents = {}\n for k, v in col_spec.items():\n if v == \"sum\":\n df_contents[k] = grps[k].apply(lambda x: x.sum())\n elif v == \"mean\":\n df_contents[k] = grps[k].apply(lambda x: x.mean())\n elif v == \"median\":\n df_contents[k] = grps[k].apply(lambda x: x.median())\n elif v == \"count\":\n df_contents[k] = grps[k].apply(lambda x: x.count())\n return pd.DataFrame(df_contents)\n```\nThe `GroupBy` command is complex. it takes a 3rd argument of `col_spec`. `col_spec` is an argument of type `colEnum`. A `colEnum` argument tells the UI to display a table with all column names, and a drop down box of enum options.\n\nIn this case each column can have an operation of either `sum`, `mean`, `median`, or `count` applied to it.\n\nNote also the leading `3` in the `command_pattern`. That is telling the UI that these are the specs for the 3rd element of the command. Eventually commands will be able to have multiple configured arguments.\n\n### Argument types\nArguments can currently be configured as \n\n* `integer` - allowing an integer input\n* `enum` - allowing a strict set of options, returned as a string to the transform\n* `colEnum` - allowing a strict set of options per column, returned as a dictionary keyed on column with values of enum options\n\n\n## Order of Operations for data cleaning\nThe ideal order of operations is as follows\n\n* Column level fixes\n * drop (remove this column)\n * fillna (fill NaN/None with a value)\n * safe int (convert a colum to integers where possible, and nan everywhere else)\n * OneHotEncoding ( create multiple boolean columns from the possible values of this column )\n * MakeCategorical ( change the values of string to a Categorical Data type)\n * Quantize\n* DataFrame transformations\nthese transforms largely keep the shape of the data the same\n\n * Resample\n * ManyColdDecoding (the opposite of OneHotEncoding, take multiple boolean columns and transform into a single categorical\n * Index shift (add a column with the value from previous row's column)\n* Dataframe transformations 2\nThese result in a single new dataframe with a vastly different shape\n * Stack/Unstack columns\n * GroupBy (with UI for sellect group by function for each column)\n* DataFrame transformations 2\nThese transforms emit multiple DataFrames\n * Relational extract (extract one or more columns into a second dataframe that can be joined back to a foreign key column)\n * Split on column (emit separate dataframes for each value of a categorical, no shape editting)\n* DataFrame combination\n * concat (concatenate multiple dataframes, with UI affordances to assure a similar shape)\n * join (join two dataframes on a key, with UI affordances)\n\nDCEF can only work on a single input dataframe shape at a time. Any newly created columns are visible on output, but not available for manipulation in the same DCEF Cell.\n\n\n# Components\n* a rich table widget that is embeddable into applications and in the jupyter notebook.\n* A UI for selecting and trying transforms interactively\n* An output table widget showing the transformed dataframe\n\n\n# What works now, what's coming\n\n## Exists now\n * React frontend app\n * Displays a datatframe\n\t* Simple UI for column level functions\n\t* Shows generated python code\n\t* Shows transformed data frame\n * DCEF server\n * Serves up dataframes for use by frontend\n\t* responds to dcef commands\n\t* shows generated python code\n * Developer User experience\n\t* define DCEF commands in python onloy\n * DCEF Intepreter\n * Based on Peter Norvig's lispy.py, a simple syntax that is easy for the frontend to generate (no parens, just JSON arrays)\n * DCEF core (actual transforms supported)\n * dropcol\n\t* fillna\n\t* one hot\n\t* safe int\n\t* GroupBy\n\n## Next major features\n * Jupyter Notebook widget\n * embed the same UI from the frontend into a jupyter notebook shell\n\t* No need to fire up a separate server, commands sent via ipywidgets.comms\n\t* Add a \"send generated python to next cell\" function\n * React frontend app\n * Styling\n\t * Server only, some UI for DataFrame selection\n * Pre filtering concept (only operate on first 1000 rows, some sample of all rows)\n\t* DataFrame joining UI\n\t* Summary statistics tab for incoming dataframe\n\t* Multi index columns\n\t* DateTimeIndex support\n * DCEF core\n\t* MakeCategorical\n\t* Quantize\n\t* Resample\n\t* ManyColdDecoding\n\t* IndexShift\n\t* Computed\n\t* Stack/Unstack\n\t* RelationalExtract\n\t* Split\n\t* concat\n\t* join\n\t\n\n\n\n## Development installation\n\nFor a development installation:\n\n```bash\ngit clone https://github.com/paddymul/dcef.git\ncd dcef\nconda install ipywidgets=8 jupyterlab\npip install -ve .\n```\n\nEnabling development install for Jupyter notebook:\n\n\nEnabling development install for JupyterLab:\n\n```bash\njupyter labextension develop . --overwrite\n```\n\nNote for developers: the `--symlink` argument on Linux or OS X allows one to modify the JavaScript code in-place. This feature is not available with Windows.\n`\n\n## Contributions\n\nWe :heart: contributions.\n\nHave you had a good experience with this project? Why not share some love and contribute code, or just let us know about any issues you had with it?\n\nWe welcome issue reports [here](../../issues); be sure to choose the proper issue template for your issue, so that we can be sure you're providing the necessary information.\n\nBefore sending a [Pull Request](../../pulls), please make sure you read our\n\n\n",
"bugtrack_url": null,
"license": "Copyright (c) 2019 Bloomberg All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.",
"summary": "Data Cleaning Framework - UI for quickly cleaning pandas dataframes",
"version": "0.2.3",
"split_keywords": [
"ipython",
"jupyter",
"widgets"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6fb6037ffb40c7e180f9e0140814472be2191fcff7a1fcd6ce023e3a715bd035",
"md5": "aa5498673514336ae44e7a8c38b8a932",
"sha256": "2040260f690f0ed425fea854e8ded41cd7fde62d7d681c35834a5da2de1e15de"
},
"downloads": -1,
"filename": "dcef-0.2.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "aa5498673514336ae44e7a8c38b8a932",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 4073685,
"upload_time": "2023-04-21T10:35:48",
"upload_time_iso_8601": "2023-04-21T10:35:48.985008Z",
"url": "https://files.pythonhosted.org/packages/6f/b6/037ffb40c7e180f9e0140814472be2191fcff7a1fcd6ce023e3a715bd035/dcef-0.2.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "15a89519c0009a44393c03fc27fc1ceb8f3f12e02eafc78b3ad776ac2e42a2c9",
"md5": "3d14c2cb9526a608182698249167764c",
"sha256": "981487eb3a07fb46a7cccbaa1ae2dd030fd4e6b331cdd811e68d4e12be29d293"
},
"downloads": -1,
"filename": "dcef-0.2.3.tar.gz",
"has_sig": false,
"md5_digest": "3d14c2cb9526a608182698249167764c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 3477624,
"upload_time": "2023-04-21T10:35:55",
"upload_time_iso_8601": "2023-04-21T10:35:55.680435Z",
"url": "https://files.pythonhosted.org/packages/15/a8/9519c0009a44393c03fc27fc1ceb8f3f12e02eafc78b3ad776ac2e42a2c9/dcef-0.2.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-21 10:35:55",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "dcef"
}