- [`pdpp`](#pdpp)
- [Installation Prerequisites](#installation-prerequisites)
- [Installation](#installation)
- [Example](#example)
- [Usage from the Command Line](#usage-from-the-command-line)
- [`pdpp init`](#pdpp-init)
- [`pdpp new`](#pdpp-new)
- [`pdpp custom`](#pdpp-custom)
- [`pdpp sub`](#pdpp-sub)
- [`pdpp rig`](#pdpp-rig)
- [`pdpp run` or `doit`](#pdpp-run-or-doit)
- [`pdpp graph`](#pdpp-graph)
- [`pdpp extant`](#pdpp-extant)
- [`pdpp enable`](#pdpp-enable)
# `pdpp`
`pdpp` is a command-line interface for facilitating the creation and maintainance of transparent and reproducible data workflows. `pdpp` adheres to principles espoused by Patrick Ball in his manifesto on ['Principled Data Processing'](https://www.youtube.com/watch?v=ZSunU9GQdcI). `pdpp` can be used to create 'tasks', populate task directories with the requisite subdirectories, link together tasks' inputs and outputs, and executing the pipeline using the `doit` [suite of automation tools](https://pydoit.org/).
`pdpp` is also capable of producing rich visualizaitons of the data processing workflows it creates:
![](img/dependencies_all.png)
Every project that comforms with Patrick Ball's ['Principled Data Processing'](https://www.youtube.com/watch?v=ZSunU9GQdcI) guidelines uses the 'task' as the quantum of a workflow. Each task in a workflow takes the form of a directory that houses a discrete data operation, such as extracting records from plaintext documents and storing the results as a `.csv` file. Ideally, each task should be simple and conceptually unified enough that a short 3- or 4-word description is enough to convey what the task accomplishes.^[In practical terms, this implies that PDP-compliant projects tend to use many more distinct script files to perform what would normally be accomplished in the space of a single, longer script.]
Each task directory contains at minimum three subdirectories:
1. `input`, which contains all of the task's local data dependencies
2. `output`, which contains all of the task's local data outputs (also referred to as 'targets')
3. `src`, which all of the task's source code^[Which, ideally, would be contained within a single script file.]
The `pdpp` package adds two additional constraints to Patrick Ball's original formulation of PDP:
1. All local data files needed by the workflow but which are not generated by any of the workflow's tasks must be included in the `_import_` directory, which `pdpp` places at the same directory level as the overall workflow during project initialization.
2. All local data files produced by the workflow as project outputs must be routed into the `_export_` directory, which `pdpp` places at the same directory level as the overall workflow during project initialization.
These additional constraints disambiguate the input and output of the overall workflow, which permits `pdpp` workflows to be embedded within one another.
## Installation Prerequisites
Aside from an up-to-date installation of `python` and `pip` (installation instructions for which can be found [here](https://wiki.python.org/moin/BeginnersGuide/Download)), the `pdpp` package depends on `graphviz`, which must be installed before attempting to install `pykrusch`. Installation instructions for `graphviz` can be found at the [GraphViz installation instructions page.](https://pygraphviz.github.io/documentation/stable/install.html#windows-install)
## Installation
To install `pykrusch`, use `pip`:
```bash
pip install pdpp
```
## Example
The first step when using `pdpp` is to initialize a new project directory, which must be empty. To initialize a new project directory, use the following command:
```bash
pdpp init
```
Doing so should produce a directory tree similar to this one:
![](img/init.png)
For the purposes of this example, a `.csv` file containing some toy data has been added to the `_import_` directory.
At this point, we're ready to add our first task to the project. To do this, we'll use the `new` command:
```bash
pdpp new
```
Upon executing the command, `pdpp` will request a name for the new task. We'll call it 'task_1'. After supplying the name, `pdpp` will display an interactive menu which allows users to specify which other tasks in the project contain files that 'task_1' will depend upon.
![](img/task_1_task_dep.png)
At the moment, this isn't a very decision to make, as there's only one other task in the project that can files can be imported from. Select it (using spacebar) and press the enter/return key. `pdpp` will then display a nested list of all the files available to be nominated as a dependency of 'task_1':
![](img/task_1_file_dep.png)
Select 'example_data.csv' and press enter. `pdpp` will inform you that your new task has been created. At this point, the project's tree diagram should appear similar to this:
![](img/tree_2.png)
The tree diagram shows that 'task_1' exists, that its input directory contains `example_data.csv` (which is hard linked to the `example_data.csv` file in `_import_`, meaning that any changes to one will be reflected in the other). The `src` directory also contains a Python script titled `task_1.py` -- for the sake of convenience, `pdpp` populates new tasks with a synonymous Python script.^[This file should be deleted or renamed if the task requires source code by a different name or using a different programming language.] For this example, we'll populate the Python script with a simple set of instructions for loading `example_data.csv`, adding one to each of the items in it, and then saving it as a new `.csv` file in the `output` subdirectory of the `task_1` directory:
```python
import csv
new_rows = []
with open('../input/example_data.csv', 'r') as f1:
r = csv.reader(f1)
for row in r:
new_row = [int(row[0]) + 1, int(row[1]) + 1]
new_rows.append(new_row)
with open('../output/example_data_plus_one.csv', 'w') as f2:
w = csv.writer(f2)
for row in new_rows:
w.writerow(row)
```
After running `task_1.py`, a new file called `example_data_plus_one.csv` should be in `task_1`'s `output` subdirectory. If one assumes that producing this new `.csv` is one of the workflow's overall objectives, then it is important to let `pdpp` know as much. This can be accomplished using the `rig` command, which is used to re-configure existing tasks. In this example, it will be used to configure the `_export_` directory (which is a special kind of task). Run the command:
```bash
pdpp rig
```
Select `_export_` from the list of tasks available, then select `task_1` (and not `_import_`); finally, select `example_data_plus_one.csv` as the only dependency for `_export_`.
Once `_export_` has been rigged, this example project is a complete (if exceedingly simple) example of a `pdpp` workflow. The workflow imports a simple `.csv` file, adds one to each number in the file, and exports the resulting modified `.csv` file. `pdpp` workflows can be visualized using the built-in visualization suite like so:
```bash
pdpp graph
```
The above command will prompt users for two pieces of information: the output format for the visualization (defaults to `.png`) and the colour scheme to be used (defaults to colour, can be set to greyscale). Accept both defaults by pressing enter twice. Whenever `pdpp` graphs a workflow, it produces four different visualizations thereof. For now, examine the `dependencies_all.png` file, which looks like this:
![](img/dependencies_all.png)
In `pdpp` visualizations, the box-like nodes represent tasks, the nodes with the folded-corners repesent data files, and the nodes with two tabs on the left-hand side represent source code.
One may execute the entire workflow by using one of the two following commands (both are functionally identical):
```bash
doit
```
```bash
pdpp run
```
If everything worked correctly, running the example workflow should result in the following console message:
```bash
. task_1
```
When a workflow is run, the `doit` automation suite -- atop which `pdpp` is built -- lists the name of each task in the workflow. If the task needed to be executed, a period is displayed before the task. If the task did *not* need to be executed, two dashes are displayed before the task name, like so:
```bash
-- task_1
```
This is because `doit` checks the relative ages of each tasks' inputs and outputs at runtime; if any given task has any outputs^[Or 'targets,' in `doit` nomenclature.] that are older than one or more of the task's inputs,^[Or 'dependencies,' in `doit` nomenclature] that task must be re-run. If all of a task's inputs are older than its outputs, the task does not need to be run. This means that a `pdpp`/`doit` pipeline can be run as often as the user desires without running the risk of needlessly wasting time or computing power: tasks will only be re-run if changes to 'upstream' files necessitate it. You can read more about this impressive feature of the `doit` suite [here](https://pydoit.org/tasks.html).
## Usage from the Command Line
With the exception of `pdpp init`, all `pdpp` commands must be run from the bottom-most project directory.
### `pdpp init`
Initializes a new `pdpp` project in an empty directory.
### `pdpp new`
Adds a new basic task to a `pdpp` project and launches an interactive rigging session for it (see `pdpp rig` below for more information).
### `pdpp custom`
Adds a new custom task to a `pdpp` project and launches an interactive rigging session for it (see `pdpp rig` below for more information). Custom tasks make no assumptions about the kind of source code or automation required by the task; it is up to the user to write a custom `dodo.py` file that will correctly automate the task in the context of the larger workflow. More information on writing custom `dodo.py` files can be found [here](https://pydoit.org/tasks.html).
### `pdpp sub`
Adds a new sub-project task to a `pdpp` project and launches an interactive rigging session for it (see `pdpp rig` below for more information). Sub-project tasks are distinct `pdpp` projects nested inside the main project -- structurally, they function identically to all other `pdpp` projects. Their dependencies are defined as any local files contained inside their `_import_` directory (which functions as if it were an `input` directory for a task) and their targets are defined as any local files contained inside their `_export_` directory (which functions as if if were an `output` directory for a task).
### `pdpp rig`
Launches an interactive rigging session for a selected task, which allows users to specify the task's dependencies (inputs), targets (outputs), and source code (src).^[In practice, explicit rigging of a task's source code shouldn't be necessary, as doing so is only required if there are more than one files inside a task's `src` directory, or if `pdpp` doesn't recognize the language of one or more of the source files. If this is the case, users are encouraged to consider using a custom task (`pdpp custom`) instead of a basic task.]
### `pdpp run` or `doit`
Runs the project. The `pdpp run` command provides basic functionality; users may pass arguments to the `doit` command that provides a great deal of control and specificity. More information about the `doit` command can be found [here](https://pydoit.org/cmd-run.html).
### `pdpp graph`
Produces four visualizations of the `pdpp` project:
- `dependencies_sparse` only includes task nodes.
- `dependencies_file` includes task nodes and data files.
- `dependencies_source` includes task nodes and source files.
- `dependencies_all` includes task nodes, source files, and data files.
### `pdpp extant`
Incorporates an already-PDP compliant directory (containing `input`, `output`, and `src` directories) that was not created using `pdpp` into the `pdpp` project.
### `pdpp enable`
Allows users to toggle tasks 'on' or 'off'; tasks that are 'off' will not be executed when `pdpp run` or `doit` is used.
Raw data
{
"_id": null,
"home_page": "https://github.com/UWNETLAB/pdpp",
"name": "pdpp",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "doit",
"author": "Pierson Browne, Rachel Wood, Tyler Crick, John McLevey",
"author_email": "pbrowne88@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/e9/9c/7137745902f00c1368e47911b477f4ae96eb3a7098f76e965ea22b41480a/pdpp-0.4.8.tar.gz",
"platform": null,
"description": "- [`pdpp`](#pdpp)\n - [Installation Prerequisites](#installation-prerequisites)\n - [Installation](#installation)\n - [Example](#example)\n - [Usage from the Command Line](#usage-from-the-command-line)\n - [`pdpp init`](#pdpp-init)\n - [`pdpp new`](#pdpp-new)\n - [`pdpp custom`](#pdpp-custom)\n - [`pdpp sub`](#pdpp-sub)\n - [`pdpp rig`](#pdpp-rig)\n - [`pdpp run` or `doit`](#pdpp-run-or-doit)\n - [`pdpp graph`](#pdpp-graph)\n - [`pdpp extant`](#pdpp-extant)\n - [`pdpp enable`](#pdpp-enable)\n\n\n# `pdpp`\n\n`pdpp` is a command-line interface for facilitating the creation and maintainance of transparent and reproducible data workflows. `pdpp` adheres to principles espoused by Patrick Ball in his manifesto on ['Principled Data Processing'](https://www.youtube.com/watch?v=ZSunU9GQdcI). `pdpp` can be used to create 'tasks', populate task directories with the requisite subdirectories, link together tasks' inputs and outputs, and executing the pipeline using the `doit` [suite of automation tools](https://pydoit.org/). \n\n`pdpp` is also capable of producing rich visualizaitons of the data processing workflows it creates:\n\n![](img/dependencies_all.png)\n\nEvery project that comforms with Patrick Ball's ['Principled Data Processing'](https://www.youtube.com/watch?v=ZSunU9GQdcI) guidelines uses the 'task' as the quantum of a workflow. Each task in a workflow takes the form of a directory that houses a discrete data operation, such as extracting records from plaintext documents and storing the results as a `.csv` file. Ideally, each task should be simple and conceptually unified enough that a short 3- or 4-word description is enough to convey what the task accomplishes.^[In practical terms, this implies that PDP-compliant projects tend to use many more distinct script files to perform what would normally be accomplished in the space of a single, longer script.] \n\nEach task directory contains at minimum three subdirectories:\n\n1. `input`, which contains all of the task's local data dependencies\n2. `output`, which contains all of the task's local data outputs (also referred to as 'targets')\n3. `src`, which all of the task's source code^[Which, ideally, would be contained within a single script file.]\n\nThe `pdpp` package adds two additional constraints to Patrick Ball's original formulation of PDP: \n\n1. All local data files needed by the workflow but which are not generated by any of the workflow's tasks must be included in the `_import_` directory, which `pdpp` places at the same directory level as the overall workflow during project initialization.\n2. All local data files produced by the workflow as project outputs must be routed into the `_export_` directory, which `pdpp` places at the same directory level as the overall workflow during project initialization.\n\nThese additional constraints disambiguate the input and output of the overall workflow, which permits `pdpp` workflows to be embedded within one another. \n\n\n## Installation Prerequisites\n\nAside from an up-to-date installation of `python` and `pip` (installation instructions for which can be found [here](https://wiki.python.org/moin/BeginnersGuide/Download)), the `pdpp` package depends on `graphviz`, which must be installed before attempting to install `pykrusch`. Installation instructions for `graphviz` can be found at the [GraphViz installation instructions page.](https://pygraphviz.github.io/documentation/stable/install.html#windows-install)\n\n\n## Installation\n\nTo install `pykrusch`, use `pip`:\n\n```bash\npip install pdpp\n```\n\n\n## Example\n\nThe first step when using `pdpp` is to initialize a new project directory, which must be empty. To initialize a new project directory, use the following command:\n\n```bash\npdpp init\n```\n\nDoing so should produce a directory tree similar to this one:\n\n![](img/init.png)\n\nFor the purposes of this example, a `.csv` file containing some toy data has been added to the `_import_` directory. \n\nAt this point, we're ready to add our first task to the project. To do this, we'll use the `new` command:\n\n```bash\npdpp new\n```\n\nUpon executing the command, `pdpp` will request a name for the new task. We'll call it 'task_1'. After supplying the name, `pdpp` will display an interactive menu which allows users to specify which other tasks in the project contain files that 'task_1' will depend upon. \n\n![](img/task_1_task_dep.png)\n\nAt the moment, this isn't a very decision to make, as there's only one other task in the project that can files can be imported from. Select it (using spacebar) and press the enter/return key. `pdpp` will then display a nested list of all the files available to be nominated as a dependency of 'task_1':\n\n![](img/task_1_file_dep.png)\n\nSelect 'example_data.csv' and press enter. `pdpp` will inform you that your new task has been created. At this point, the project's tree diagram should appear similar to this:\n\n![](img/tree_2.png)\n\nThe tree diagram shows that 'task_1' exists, that its input directory contains `example_data.csv` (which is hard linked to the `example_data.csv` file in `_import_`, meaning that any changes to one will be reflected in the other). The `src` directory also contains a Python script titled `task_1.py` -- for the sake of convenience, `pdpp` populates new tasks with a synonymous Python script.^[This file should be deleted or renamed if the task requires source code by a different name or using a different programming language.] For this example, we'll populate the Python script with a simple set of instructions for loading `example_data.csv`, adding one to each of the items in it, and then saving it as a new `.csv` file in the `output` subdirectory of the `task_1` directory:\n\n```python\nimport csv\n\nnew_rows = []\n\nwith open('../input/example_data.csv', 'r') as f1:\n r = csv.reader(f1)\n for row in r: \n new_row = [int(row[0]) + 1, int(row[1]) + 1]\n new_rows.append(new_row)\n\nwith open('../output/example_data_plus_one.csv', 'w') as f2:\n w = csv.writer(f2)\n for row in new_rows:\n w.writerow(row)\n```\n\nAfter running `task_1.py`, a new file called `example_data_plus_one.csv` should be in `task_1`'s `output` subdirectory. If one assumes that producing this new `.csv` is one of the workflow's overall objectives, then it is important to let `pdpp` know as much. This can be accomplished using the `rig` command, which is used to re-configure existing tasks. In this example, it will be used to configure the `_export_` directory (which is a special kind of task). Run the command:\n\n```bash\npdpp rig\n```\n\nSelect `_export_` from the list of tasks available, then select `task_1` (and not `_import_`); finally, select `example_data_plus_one.csv` as the only dependency for `_export_`. \n\nOnce `_export_` has been rigged, this example project is a complete (if exceedingly simple) example of a `pdpp` workflow. The workflow imports a simple `.csv` file, adds one to each number in the file, and exports the resulting modified `.csv` file. `pdpp` workflows can be visualized using the built-in visualization suite like so:\n\n```bash\npdpp graph\n```\n\nThe above command will prompt users for two pieces of information: the output format for the visualization (defaults to `.png`) and the colour scheme to be used (defaults to colour, can be set to greyscale). Accept both defaults by pressing enter twice. Whenever `pdpp` graphs a workflow, it produces four different visualizations thereof. For now, examine the `dependencies_all.png` file, which looks like this:\n\n![](img/dependencies_all.png)\n\nIn `pdpp` visualizations, the box-like nodes represent tasks, the nodes with the folded-corners repesent data files, and the nodes with two tabs on the left-hand side represent source code. \n\nOne may execute the entire workflow by using one of the two following commands (both are functionally identical):\n\n```bash\ndoit\n```\n\n```bash\npdpp run\n```\n\nIf everything worked correctly, running the example workflow should result in the following console message:\n\n```bash\n. task_1\n```\n\nWhen a workflow is run, the `doit` automation suite -- atop which `pdpp` is built -- lists the name of each task in the workflow. If the task needed to be executed, a period is displayed before the task. If the task did *not* need to be executed, two dashes are displayed before the task name, like so:\n\n```bash\n-- task_1\n```\n\nThis is because `doit` checks the relative ages of each tasks' inputs and outputs at runtime; if any given task has any outputs^[Or 'targets,' in `doit` nomenclature.] that are older than one or more of the task's inputs,^[Or 'dependencies,' in `doit` nomenclature] that task must be re-run. If all of a task's inputs are older than its outputs, the task does not need to be run. This means that a `pdpp`/`doit` pipeline can be run as often as the user desires without running the risk of needlessly wasting time or computing power: tasks will only be re-run if changes to 'upstream' files necessitate it. You can read more about this impressive feature of the `doit` suite [here](https://pydoit.org/tasks.html). \n\n\n## Usage from the Command Line\n\nWith the exception of `pdpp init`, all `pdpp` commands must be run from the bottom-most project directory.\n\n### `pdpp init`\n\nInitializes a new `pdpp` project in an empty directory.\n\n### `pdpp new`\n\nAdds a new basic task to a `pdpp` project and launches an interactive rigging session for it (see `pdpp rig` below for more information).\n\n### `pdpp custom`\n\nAdds a new custom task to a `pdpp` project and launches an interactive rigging session for it (see `pdpp rig` below for more information). Custom tasks make no assumptions about the kind of source code or automation required by the task; it is up to the user to write a custom `dodo.py` file that will correctly automate the task in the context of the larger workflow. More information on writing custom `dodo.py` files can be found [here](https://pydoit.org/tasks.html).\n\n\n### `pdpp sub`\n\nAdds a new sub-project task to a `pdpp` project and launches an interactive rigging session for it (see `pdpp rig` below for more information). Sub-project tasks are distinct `pdpp` projects nested inside the main project -- structurally, they function identically to all other `pdpp` projects. Their dependencies are defined as any local files contained inside their `_import_` directory (which functions as if it were an `input` directory for a task) and their targets are defined as any local files contained inside their `_export_` directory (which functions as if if were an `output` directory for a task). \n\n\n### `pdpp rig`\n\nLaunches an interactive rigging session for a selected task, which allows users to specify the task's dependencies (inputs), targets (outputs), and source code (src).^[In practice, explicit rigging of a task's source code shouldn't be necessary, as doing so is only required if there are more than one files inside a task's `src` directory, or if `pdpp` doesn't recognize the language of one or more of the source files. If this is the case, users are encouraged to consider using a custom task (`pdpp custom`) instead of a basic task.]\n\n### `pdpp run` or `doit`\n\nRuns the project. The `pdpp run` command provides basic functionality; users may pass arguments to the `doit` command that provides a great deal of control and specificity. More information about the `doit` command can be found [here](https://pydoit.org/cmd-run.html). \n\n### `pdpp graph`\n\nProduces four visualizations of the `pdpp` project:\n\n- `dependencies_sparse` only includes task nodes.\n- `dependencies_file` includes task nodes and data files.\n- `dependencies_source` includes task nodes and source files.\n- `dependencies_all` includes task nodes, source files, and data files.\n\n### `pdpp extant`\n\nIncorporates an already-PDP compliant directory (containing `input`, `output`, and `src` directories) that was not created using `pdpp` into the `pdpp` project.\n\n### `pdpp enable`\n\nAllows users to toggle tasks 'on' or 'off'; tasks that are 'off' will not be executed when `pdpp run` or `doit` is used. \n",
"bugtrack_url": null,
"license": "",
"summary": "Command line tool for automation, transparency, and reproducibility in data processing projects",
"version": "0.4.8",
"project_urls": {
"Homepage": "https://github.com/UWNETLAB/pdpp"
},
"split_keywords": [
"doit"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e6fbe0297c5249984a1069f7fc6ea20b5e4103c5fddfdb8c1cb0fb3630a7863a",
"md5": "33f472e146a8063a9a78046a6387ad6a",
"sha256": "117eb3a9ef56c56b7626c6d89aa458e80ead776e33b3497712c65d3530c48369"
},
"downloads": -1,
"filename": "pdpp-0.4.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "33f472e146a8063a9a78046a6387ad6a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 34065,
"upload_time": "2023-08-03T23:55:43",
"upload_time_iso_8601": "2023-08-03T23:55:43.851192Z",
"url": "https://files.pythonhosted.org/packages/e6/fb/e0297c5249984a1069f7fc6ea20b5e4103c5fddfdb8c1cb0fb3630a7863a/pdpp-0.4.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e99c7137745902f00c1368e47911b477f4ae96eb3a7098f76e965ea22b41480a",
"md5": "3e2a86b563efaa8194d5870d5c739b17",
"sha256": "30b8670695940c2635278ec05180b32065af43af8c6ed1dd8d753c1af9bffaf0"
},
"downloads": -1,
"filename": "pdpp-0.4.8.tar.gz",
"has_sig": false,
"md5_digest": "3e2a86b563efaa8194d5870d5c739b17",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 26590,
"upload_time": "2023-08-03T23:55:46",
"upload_time_iso_8601": "2023-08-03T23:55:46.739602Z",
"url": "https://files.pythonhosted.org/packages/e9/9c/7137745902f00c1368e47911b477f4ae96eb3a7098f76e965ea22b41480a/pdpp-0.4.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-03 23:55:46",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "UWNETLAB",
"github_project": "pdpp",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pdpp"
}