drpt


Namedrpt JSON
Version 0.8.2 PyPI version JSON
download
home_pagehttps://github.com/ConX/drpt
SummaryTool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.
upload_time2023-01-16 03:59:37
maintainer
docs_urlNone
authorConstantinos Xanthopoulos
requires_python>=3.9,<4.0
licenseBSD-3-Clause
keywords data data science preprocessing scaling obfuscation data release data publishing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Data Release Preparation Tool

- [Data Release Preparation Tool](#data-release-preparation-tool)
  - [Description](#description)
  - [Installation](#installation)
  - [Usage](#usage)
    - [CLI](#cli)
    - [Recipe Definition](#recipe-definition)
  - [Example](#example)
  - [Thanks](#thanks)

> :warning: This is currently at beta development stage and likely has a lot of bugs. Please use the [issue tracker](https://github.com/ConX/drpt/issues) to report an bugs or feature requests.

## Description

Command-line tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.

After performing the operations defined in the recipe the tool generates the transformed dataset version and a CSV report listing the performed actions.

## Installation

The tool can be installed using pip:

```shell
pip install drpt
```

## Usage

### CLI

```txt
Usage: drpt [OPTIONS] RECIPE_FILE INPUT_FILE

Options:
  -d, --dry-run           Generate only the report without the release dataset
  -v, --verbose           Verbose [Not implemented]
  -n, --nrows TEXT        Number of rows to read from a CSV file. Doesn't work
                          with parquet files.
  -l, --limits-file PATH  Limits file
  -o, --output-dir PATH   Output directory. The default output directory is
                          the same as the location of the recipe_file.
  --version               Show the version and exit.
  --help                  Show this message and exit.
```

### Recipe Definition

#### Overview
The recipe is a JSON formatted file that includes what operations should be performed on the dataset. For versioning purposes, the recipe also contains a `version` key which is appended in the generated filenames and the report.

**Default recipe:**
```json
{
  "version": "",
  "actions": {
    "drop": [],
    "drop-constant-columns": false,
    "obfuscate": [],
    "disable-scaling": false,
    "skip-scaling": [],
    "sort-by": [],
    "rename": []
  }
}
```

The currently supported actions, performed in this order, are as follows:
  - `drop`: Column deletion
  - `drop-constant-columns`: Drops all columns that containt only one unique value
  - `obfuscate`: Column obfuscation, where the listed columns are treated as categorical variables and then integer coded.
  - Scaling: By default all columns are Min/Max scaled
    - `disable-scaling`: Can be used to disable scaling for all columns
    - `skip-scaling`: By default all columns are Min/Max scaled, except those excluded (`skip-scaling`)
  - `sort-by`: Sort rows by the listed columns
  - `rename`: Column renaming

All column definitions above support [regular expressions](https://docs.python.org/3/library/re.html#regular-expression-syntax).

#### Actions

##### _drop_
The `drop` action is defined as a list of column names to be dropped.

##### _drop-constant-columns_
This is a boolean action, which when set to `true` will drop all the columns that have only a single unique value.

##### _obfuscate_
The `obfuscate` action is defined as a list of column names to be obfuscated.

##### _disable-scaling_, _skip-scaling_
By default, the tool Min/Max scales all numerical columns. This behavior can be disabled for all columns by setting the `disable-scaling` action to `true`. If scaling must be disabled for only a set of columns these columns can be defined using the `skip-scaling` action, as a list of column names.

##### _sort-by_
This is a list of column names by which to sort the rows. The order in the list denotes the sorting priority.

##### _rename_
The `rename` action is defined as a list of objects whose key is the original name (or regular expression), and their value is the target name. When the target uses matched groups from the regular expression those can be provided with their group number prepended with an escaped backslash (`\\1`) [see [example](#example) below].

```json
{
  //...
  "rename": [{"original_name": "target_name"}]
  //...
}
```
## Example

**Input CSV file:**
```csv
test1,test2,test3,test4,test5,test6,test7,test8,test9,foo.bar.test,foo.bar.test2,const
1.1,1,one,2,0.234,0.3,-1,a,e,1,1,1
2.2,2,two,2,0.555,0.4,0,b,f,2,2,1
3.3,3,three,4,0.1,5,1,c,g,3,3,1
2.22,2,two,4,1,0,2.5,d,h,4,4,1
```

**Recipe:**
```json
{
  "version": "1.0",
  "actions": {
    "drop": ["test2", "test[8-9]"],
    "drop-constant-columns": true,
    "obfuscate": ["test3"],
    "skip-scaling": ["test4"],
    "sort-by": ["test4", "test3"],
    "rename": [
      { "test1": "test1_renamed" },
      { "test([3-4])": "test\\1_regex_renamed" },
      { "foo[.]bar[.].*": "foo" }
    ]
  }
}
```

**Generated CSV file:**
```csv
test3_regex_renamed,test4_regex_renamed,test1_renamed,test5,test6,test7,foo_1,foo_2
0,2,0.0,0.1488888888888889,0.06,0.0,0.0,0.0
2,2,0.5000000000000001,0.5055555555555556,0.08,0.2857142857142857,0.3333333333333333,0.3333333333333333
1,4,1.0,0.0,1.0,0.5714285714285714,0.6666666666666666,0.6666666666666666
2,4,0.5090909090909091,1.0,0.0,1.0,1.0,1.0
```

**Report:**
```csv
,action,column,details
0,recipe_version,,1.0
1,drpt_version,,0.6.3
2,DROP,test2,
3,DROP,test8,
4,DROP,test9,
5,DROP_CONSTANT,const,
6,OBFUSCATE,test3,"{""one"": 0, ""three"": 1, ""two"": 2}"
7,SCALE_DEFAULT,test1,"[1.1,3.3]"
8,SCALE_DEFAULT,test5,"[0.1,1.0]"
9,SCALE_DEFAULT,test6,"[0.0,5.0]"
10,SCALE_DEFAULT,test7,"[-1.0,2.5]"
11,SCALE_DEFAULT,foo.bar.test,"[1,4]"
12,SCALE_DEFAULT,foo.bar.test2,"[1,4]"
13,SORT,"['test4', 'test3']",
14,RENAME,test1,test1_renamed
15,RENAME,test3,test3_regex_renamed
16,RENAME,test4,test4_regex_renamed
17,RENAME,foo.bar.test,foo_1
18,RENAME,foo.bar.test2,foo_2
```

## Thanks

This tool was made possible with [Pandas](https://pandas.pydata.org/), [PyArrow](https://arrow.apache.org/docs/python/index.html), [jsonschema](https://pypi.org/project/jsonschema/), and of course [Python](https://www.python.org/).


  
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ConX/drpt",
    "name": "drpt",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "",
    "keywords": "data,data science,preprocessing,scaling,obfuscation,data release,data publishing",
    "author": "Constantinos Xanthopoulos",
    "author_email": "conx@xanthopoulos.info",
    "download_url": "https://files.pythonhosted.org/packages/31/50/08cbf7bab710047be570f9fea765de8373740ae8b1f62f6c0a899bada619/drpt-0.8.2.tar.gz",
    "platform": null,
    "description": "# Data Release Preparation Tool\n\n- [Data Release Preparation Tool](#data-release-preparation-tool)\n  - [Description](#description)\n  - [Installation](#installation)\n  - [Usage](#usage)\n    - [CLI](#cli)\n    - [Recipe Definition](#recipe-definition)\n  - [Example](#example)\n  - [Thanks](#thanks)\n\n> :warning: This is currently at beta development stage and likely has a lot of bugs. Please use the [issue tracker](https://github.com/ConX/drpt/issues) to report an bugs or feature requests.\n\n## Description\n\nCommand-line tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.\n\nAfter performing the operations defined in the recipe the tool generates the transformed dataset version and a CSV report listing the performed actions.\n\n## Installation\n\nThe tool can be installed using pip:\n\n```shell\npip install drpt\n```\n\n## Usage\n\n### CLI\n\n```txt\nUsage: drpt [OPTIONS] RECIPE_FILE INPUT_FILE\n\nOptions:\n  -d, --dry-run           Generate only the report without the release dataset\n  -v, --verbose           Verbose [Not implemented]\n  -n, --nrows TEXT        Number of rows to read from a CSV file. Doesn't work\n                          with parquet files.\n  -l, --limits-file PATH  Limits file\n  -o, --output-dir PATH   Output directory. The default output directory is\n                          the same as the location of the recipe_file.\n  --version               Show the version and exit.\n  --help                  Show this message and exit.\n```\n\n### Recipe Definition\n\n#### Overview\nThe recipe is a JSON formatted file that includes what operations should be performed on the dataset. For versioning purposes, the recipe also contains a `version` key which is appended in the generated filenames and the report.\n\n**Default recipe:**\n```json\n{\n  \"version\": \"\",\n  \"actions\": {\n    \"drop\": [],\n    \"drop-constant-columns\": false,\n    \"obfuscate\": [],\n    \"disable-scaling\": false,\n    \"skip-scaling\": [],\n    \"sort-by\": [],\n    \"rename\": []\n  }\n}\n```\n\nThe currently supported actions, performed in this order, are as follows:\n  - `drop`: Column deletion\n  - `drop-constant-columns`: Drops all columns that containt only one unique value\n  - `obfuscate`: Column obfuscation, where the listed columns are treated as categorical variables and then integer coded.\n  - Scaling: By default all columns are Min/Max scaled\n    - `disable-scaling`: Can be used to disable scaling for all columns\n    - `skip-scaling`: By default all columns are Min/Max scaled, except those excluded (`skip-scaling`)\n  - `sort-by`: Sort rows by the listed columns\n  - `rename`: Column renaming\n\nAll column definitions above support [regular expressions](https://docs.python.org/3/library/re.html#regular-expression-syntax).\n\n#### Actions\n\n##### _drop_\nThe `drop` action is defined as a list of column names to be dropped.\n\n##### _drop-constant-columns_\nThis is a boolean action, which when set to `true` will drop all the columns that have only a single unique value.\n\n##### _obfuscate_\nThe `obfuscate` action is defined as a list of column names to be obfuscated.\n\n##### _disable-scaling_, _skip-scaling_\nBy default, the tool Min/Max scales all numerical columns. This behavior can be disabled for all columns by setting the `disable-scaling` action to `true`. If scaling must be disabled for only a set of columns these columns can be defined using the `skip-scaling` action, as a list of column names.\n\n##### _sort-by_\nThis is a list of column names by which to sort the rows. The order in the list denotes the sorting priority.\n\n##### _rename_\nThe `rename` action is defined as a list of objects whose key is the original name (or regular expression), and their value is the target name. When the target uses matched groups from the regular expression those can be provided with their group number prepended with an escaped backslash (`\\\\1`) [see [example](#example) below].\n\n```json\n{\n  //...\n  \"rename\": [{\"original_name\": \"target_name\"}]\n  //...\n}\n```\n## Example\n\n**Input CSV file:**\n```csv\ntest1,test2,test3,test4,test5,test6,test7,test8,test9,foo.bar.test,foo.bar.test2,const\n1.1,1,one,2,0.234,0.3,-1,a,e,1,1,1\n2.2,2,two,2,0.555,0.4,0,b,f,2,2,1\n3.3,3,three,4,0.1,5,1,c,g,3,3,1\n2.22,2,two,4,1,0,2.5,d,h,4,4,1\n```\n\n**Recipe:**\n```json\n{\n  \"version\": \"1.0\",\n  \"actions\": {\n    \"drop\": [\"test2\", \"test[8-9]\"],\n    \"drop-constant-columns\": true,\n    \"obfuscate\": [\"test3\"],\n    \"skip-scaling\": [\"test4\"],\n    \"sort-by\": [\"test4\", \"test3\"],\n    \"rename\": [\n      { \"test1\": \"test1_renamed\" },\n      { \"test([3-4])\": \"test\\\\1_regex_renamed\" },\n      { \"foo[.]bar[.].*\": \"foo\" }\n    ]\n  }\n}\n```\n\n**Generated CSV file:**\n```csv\ntest3_regex_renamed,test4_regex_renamed,test1_renamed,test5,test6,test7,foo_1,foo_2\n0,2,0.0,0.1488888888888889,0.06,0.0,0.0,0.0\n2,2,0.5000000000000001,0.5055555555555556,0.08,0.2857142857142857,0.3333333333333333,0.3333333333333333\n1,4,1.0,0.0,1.0,0.5714285714285714,0.6666666666666666,0.6666666666666666\n2,4,0.5090909090909091,1.0,0.0,1.0,1.0,1.0\n```\n\n**Report:**\n```csv\n,action,column,details\n0,recipe_version,,1.0\n1,drpt_version,,0.6.3\n2,DROP,test2,\n3,DROP,test8,\n4,DROP,test9,\n5,DROP_CONSTANT,const,\n6,OBFUSCATE,test3,\"{\"\"one\"\": 0, \"\"three\"\": 1, \"\"two\"\": 2}\"\n7,SCALE_DEFAULT,test1,\"[1.1,3.3]\"\n8,SCALE_DEFAULT,test5,\"[0.1,1.0]\"\n9,SCALE_DEFAULT,test6,\"[0.0,5.0]\"\n10,SCALE_DEFAULT,test7,\"[-1.0,2.5]\"\n11,SCALE_DEFAULT,foo.bar.test,\"[1,4]\"\n12,SCALE_DEFAULT,foo.bar.test2,\"[1,4]\"\n13,SORT,\"['test4', 'test3']\",\n14,RENAME,test1,test1_renamed\n15,RENAME,test3,test3_regex_renamed\n16,RENAME,test4,test4_regex_renamed\n17,RENAME,foo.bar.test,foo_1\n18,RENAME,foo.bar.test2,foo_2\n```\n\n## Thanks\n\nThis tool was made possible with [Pandas](https://pandas.pydata.org/), [PyArrow](https://arrow.apache.org/docs/python/index.html), [jsonschema](https://pypi.org/project/jsonschema/), and of course [Python](https://www.python.org/).\n\n\n  ",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "Tool for preparing a dataset for publishing by dropping, renaming, scaling, and obfuscating columns defined in a recipe.",
    "version": "0.8.2",
    "split_keywords": [
        "data",
        "data science",
        "preprocessing",
        "scaling",
        "obfuscation",
        "data release",
        "data publishing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2868532e982f1d00f9361192029f33bbbf5ededb42f6e49cfcf85484121776e3",
                "md5": "82ceb35f63c122ab5813052d07d7d167",
                "sha256": "18b06a68945f62441b149154092387f3741b627f78ba6ee11a7b35da2d134f70"
            },
            "downloads": -1,
            "filename": "drpt-0.8.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "82ceb35f63c122ab5813052d07d7d167",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 10031,
            "upload_time": "2023-01-16T03:59:36",
            "upload_time_iso_8601": "2023-01-16T03:59:36.392621Z",
            "url": "https://files.pythonhosted.org/packages/28/68/532e982f1d00f9361192029f33bbbf5ededb42f6e49cfcf85484121776e3/drpt-0.8.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "315008cbf7bab710047be570f9fea765de8373740ae8b1f62f6c0a899bada619",
                "md5": "61a435032e083382d6e4545e3238bbcc",
                "sha256": "3ee537d3fe54ff67551b7fdce8e28074fc7cdbd7ee546a32e1b1dc579472e943"
            },
            "downloads": -1,
            "filename": "drpt-0.8.2.tar.gz",
            "has_sig": false,
            "md5_digest": "61a435032e083382d6e4545e3238bbcc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 11996,
            "upload_time": "2023-01-16T03:59:37",
            "upload_time_iso_8601": "2023-01-16T03:59:37.565785Z",
            "url": "https://files.pythonhosted.org/packages/31/50/08cbf7bab710047be570f9fea765de8373740ae8b1f62f6c0a899bada619/drpt-0.8.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-16 03:59:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "ConX",
    "github_project": "drpt",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "drpt"
}
        
Elapsed time: 0.06487s