dataplaybook


Namedataplaybook JSON
Version 1.0.16 PyPI version JSON
download
home_pagehttps://github.com/kellerza/data-playbook
SummaryPlaybooks for data. Open, process and save table based data.
upload_time2023-06-01 13:48:48
maintainer
docs_urlNone
authorJohann Kellerman
requires_python>=3.9
licenseApache License 2.0
keywords data tables excel mongodb generators
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Data Playbook

:book: Playbooks for data. Open, process and save table based data.
[![Workflow Status](https://github.com/kellerza/data-playbook/actions/workflows/main.yml/badge.svg?branch=master)](https://github.com/kellerza/data-playbook/actions)
[![codecov](https://codecov.io/gh/kellerza/data-playbook/branch/master/graph/badge.svg)](https://codecov.io/gh/kellerza/data-playbook)

Automate repetitive tasks on table based data. Include various input and output tasks.

Install: `pip install dataplaybook`

Use the `@task` and `@playbook` decorators

```python
from dataplaybook import task, playbook
from dataplaybook.tasks.io_xlsx

@task
def print
```

## Tasks

Tasks are implemented as simple Python functions and the modules can be found in the dataplaybook/tasks folder.

| Module                                                                                     | Functions                                                                                      |
| :----------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------- |
| Generic function to work on tables<br>`dataplaybook.tasks`                                 | build_lookup, build_lookup_var, combine, drop, extend, filter, print, replace, unique, vlookup |
| Fuzzy string matching <br>`dataplaybook.taksk.fuzzy`<br> Requires _pip install fuzzywuzzy_ |                                                                                                |
| Read/write excel files ()<br>`dataplaybook.tasks.io_xlsx`                                  | read_excel, write_excel                                                                        |
| Misc IO tasks<br>`dataplaybook.tasks.io_misc`                                              | read_csv, read_tab_delim, read_text_regex, wget, write_csv                                     |
| MongoDB functions<br>`dataplaybook.tasks.io_mongo`                                         | read_mongo, write_mongo, columns_to_list, list_to_columns                                      |
| PDF functions. Requires _pdftotext_ on your path<br>`dataplaybook.tasks.io_pdf`            | read_pdf_pages, read_pdf_files                                                                 |
| Read XML<br>`dataplaybook.tasks.io_xml`                                                    | read_xml                                                                                       |

## Data Playbook v0

The [v0](https://github.com/kellerza/data-playbook/tree/v0) of dataplaybook used yaml files, very similar to playbooks

Use: `dataplaybook playbook.yaml`

### Playbook structure

The playbook.yaml file allows you to load additional modules (containing tasks) and specify the tasks to execute in sequence, with all their parameters.

The `tasks` to perform typically follow the the structure of read, process, write.

Example yaml: (please note yaml is case sensitive)

```yaml
modules: [list, of, modules]

tasks:
  - task_name: # See a list of tasks below
      task_setting_1: 1
    tables: # The INPUT. One of more tables used by this task
    target: # The OUTPUT. Target table name of this function
    debug: True/False # Print extra debug message, default: False
```

### Templating

Jinja2 and JMESPath expressions can be used to create parameters for subsequent tasks. Jinja2 simply use the `"{{ var[res1] }}"` bracket syntax and jmespath expressions should start with the word _jmespath_ followed by a space.

Both the `vars` and `template` tasks achieve a similar result: (this will search a table matching string "2" on the key column and return the value in the value column)

```yaml
- vars:
    res1: jmespath test[?key=='2'].value | [0]
# is equal to
- template:
    jmespath: "test[?key=='2'].value | [0]"
  target: res1
# ... then use it with `{{ var.res1 }}`
```

The JMESpath task `template` task has an advantage that you can create new variables **or tables**.

If you have a lookup you use regularly you can do the following:

```yaml
 - build_lookup_var:
     key: key
     columns: [value]
   target: lookup1
  # and then use it as follows to get a similar results to the previous example
  - vars:
      res1: "{{ var['lookup1']['2'].value }}"
```

When searching through a table with Jinja, a similar one-liner, using `selectattr`, seems much more complex:

```yaml
- vars:
    res1: "{{ test | selectattr('key', 'equalto', '2') | map(attribute='value') | first }}"
```

### Special yaml functions

- `!re <expression>` Regular expression
- `!es <search string>` Search a file using Everything by Voidtools

### Install the development version

1. Clone the repo
2. `pip install <path> -e`

### Data Playbook v0 origins

Data playbooks was created to replace various snippets of code I had lying around. They were all created to ensure repeatability of some menial task, and generally followed a similar structure of load something, process it and save it. (Process network data into GIS tools, network audits & reporting on router & NMS output, Extract IETF standards to complete SOCs, read my bank statements into my Excel budgeting tool, etc.)

For many of these tasks I have specific processing code (`tasks_x.py`, loaded with `modules: [tasks_x]` in the playbook), but in almost all cases input & output tasks (and configuring these names etc) are common. The idea of the modular tasks originally came from Home Assistant, where I started learning Python and the idea of "custom components" to add your own integrations, although one could argue this also has similarities to Ansible playbooks.

In many cases I have a 'loose' coupling to actual file names, using Everything search (`!es search_pattern` in the playbook) to resolve a search pattern to the correct file used for input.

It has some parts in common with Ansible Playbooks, especially the name was chosen after I was introduced to Ansible Playbooks. The task structure has been updated in 2019 to match the Ansible Playbooks 2.0/2.5+ format and allow names. This format will also be easier to introduce loop mechanisms etc.

#### Comparison to Ansible Playbooks

Data playbooks is intended to create and modify variables in the environment (similar to **inventory**). Data playbooks starts with an empty environment (although you can read the environment from various sources inside the play).
Although new variables can be created using **register:** in Ansible, data playbook functions requires the output to be captured through `target:`.

Data playbook tasks are different form Ansible's **actions**:

- They are mostly not idempotent, since the intention is to modify tables as we go along,
- they can return lists containing rows or be Python iterators (that `yield` rows of a table)
- if they dont return any tabular data (a list), the return value will be added to the `var` table in the environment
- Each have a strict voluptuous schema, evaluated when loading and during runtime (e.g. to expand templates) to allow quick troubleshooting

You could argue I can do this with Ansible, but it won't be as elegant with single item hosts files, `gather_facts: no` and `delegate_to: localhost` throughout the playbooks. It will likely only be half as much fun trying to force it into my way of thinking.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kellerza/data-playbook",
    "name": "dataplaybook",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "data,tables,excel,mongodb,generators",
    "author": "Johann Kellerman",
    "author_email": "kellerza@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/0e/42/ec517861c836708379b94c92e5e3720f7028d749f045372f4923e4db5a0b/dataplaybook-1.0.16.tar.gz",
    "platform": null,
    "description": "# Data Playbook\n\n:book: Playbooks for data. Open, process and save table based data.\n[![Workflow Status](https://github.com/kellerza/data-playbook/actions/workflows/main.yml/badge.svg?branch=master)](https://github.com/kellerza/data-playbook/actions)\n[![codecov](https://codecov.io/gh/kellerza/data-playbook/branch/master/graph/badge.svg)](https://codecov.io/gh/kellerza/data-playbook)\n\nAutomate repetitive tasks on table based data. Include various input and output tasks.\n\nInstall: `pip install dataplaybook`\n\nUse the `@task` and `@playbook` decorators\n\n```python\nfrom dataplaybook import task, playbook\nfrom dataplaybook.tasks.io_xlsx\n\n@task\ndef print\n```\n\n## Tasks\n\nTasks are implemented as simple Python functions and the modules can be found in the dataplaybook/tasks folder.\n\n| Module                                                                                     | Functions                                                                                      |\n| :----------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------- |\n| Generic function to work on tables<br>`dataplaybook.tasks`                                 | build_lookup, build_lookup_var, combine, drop, extend, filter, print, replace, unique, vlookup |\n| Fuzzy string matching <br>`dataplaybook.taksk.fuzzy`<br> Requires _pip install fuzzywuzzy_ |                                                                                                |\n| Read/write excel files ()<br>`dataplaybook.tasks.io_xlsx`                                  | read_excel, write_excel                                                                        |\n| Misc IO tasks<br>`dataplaybook.tasks.io_misc`                                              | read_csv, read_tab_delim, read_text_regex, wget, write_csv                                     |\n| MongoDB functions<br>`dataplaybook.tasks.io_mongo`                                         | read_mongo, write_mongo, columns_to_list, list_to_columns                                      |\n| PDF functions. Requires _pdftotext_ on your path<br>`dataplaybook.tasks.io_pdf`            | read_pdf_pages, read_pdf_files                                                                 |\n| Read XML<br>`dataplaybook.tasks.io_xml`                                                    | read_xml                                                                                       |\n\n## Data Playbook v0\n\nThe [v0](https://github.com/kellerza/data-playbook/tree/v0) of dataplaybook used yaml files, very similar to playbooks\n\nUse: `dataplaybook playbook.yaml`\n\n### Playbook structure\n\nThe playbook.yaml file allows you to load additional modules (containing tasks) and specify the tasks to execute in sequence, with all their parameters.\n\nThe `tasks` to perform typically follow the the structure of read, process, write.\n\nExample yaml: (please note yaml is case sensitive)\n\n```yaml\nmodules: [list, of, modules]\n\ntasks:\n  - task_name: # See a list of tasks below\n      task_setting_1: 1\n    tables: # The INPUT. One of more tables used by this task\n    target: # The OUTPUT. Target table name of this function\n    debug: True/False # Print extra debug message, default: False\n```\n\n### Templating\n\nJinja2 and JMESPath expressions can be used to create parameters for subsequent tasks. Jinja2 simply use the `\"{{ var[res1] }}\"` bracket syntax and jmespath expressions should start with the word _jmespath_ followed by a space.\n\nBoth the `vars` and `template` tasks achieve a similar result: (this will search a table matching string \"2\" on the key column and return the value in the value column)\n\n```yaml\n- vars:\n    res1: jmespath test[?key=='2'].value | [0]\n# is equal to\n- template:\n    jmespath: \"test[?key=='2'].value | [0]\"\n  target: res1\n# ... then use it with `{{ var.res1 }}`\n```\n\nThe JMESpath task `template` task has an advantage that you can create new variables **or tables**.\n\nIf you have a lookup you use regularly you can do the following:\n\n```yaml\n - build_lookup_var:\n     key: key\n     columns: [value]\n   target: lookup1\n  # and then use it as follows to get a similar results to the previous example\n  - vars:\n      res1: \"{{ var['lookup1']['2'].value }}\"\n```\n\nWhen searching through a table with Jinja, a similar one-liner, using `selectattr`, seems much more complex:\n\n```yaml\n- vars:\n    res1: \"{{ test | selectattr('key', 'equalto', '2') | map(attribute='value') | first }}\"\n```\n\n### Special yaml functions\n\n- `!re <expression>` Regular expression\n- `!es <search string>` Search a file using Everything by Voidtools\n\n### Install the development version\n\n1. Clone the repo\n2. `pip install <path> -e`\n\n### Data Playbook v0 origins\n\nData playbooks was created to replace various snippets of code I had lying around. They were all created to ensure repeatability of some menial task, and generally followed a similar structure of load something, process it and save it. (Process network data into GIS tools, network audits & reporting on router & NMS output, Extract IETF standards to complete SOCs, read my bank statements into my Excel budgeting tool, etc.)\n\nFor many of these tasks I have specific processing code (`tasks_x.py`, loaded with `modules: [tasks_x]` in the playbook), but in almost all cases input & output tasks (and configuring these names etc) are common. The idea of the modular tasks originally came from Home Assistant, where I started learning Python and the idea of \"custom components\" to add your own integrations, although one could argue this also has similarities to Ansible playbooks.\n\nIn many cases I have a 'loose' coupling to actual file names, using Everything search (`!es search_pattern` in the playbook) to resolve a search pattern to the correct file used for input.\n\nIt has some parts in common with Ansible Playbooks, especially the name was chosen after I was introduced to Ansible Playbooks. The task structure has been updated in 2019 to match the Ansible Playbooks 2.0/2.5+ format and allow names. This format will also be easier to introduce loop mechanisms etc.\n\n#### Comparison to Ansible Playbooks\n\nData playbooks is intended to create and modify variables in the environment (similar to **inventory**). Data playbooks starts with an empty environment (although you can read the environment from various sources inside the play).\nAlthough new variables can be created using **register:** in Ansible, data playbook functions requires the output to be captured through `target:`.\n\nData playbook tasks are different form Ansible's **actions**:\n\n- They are mostly not idempotent, since the intention is to modify tables as we go along,\n- they can return lists containing rows or be Python iterators (that `yield` rows of a table)\n- if they dont return any tabular data (a list), the return value will be added to the `var` table in the environment\n- Each have a strict voluptuous schema, evaluated when loading and during runtime (e.g. to expand templates) to allow quick troubleshooting\n\nYou could argue I can do this with Ansible, but it won't be as elegant with single item hosts files, `gather_facts: no` and `delegate_to: localhost` throughout the playbooks. It will likely only be half as much fun trying to force it into my way of thinking.\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Playbooks for data. Open, process and save table based data.",
    "version": "1.0.16",
    "project_urls": {
        "Homepage": "https://github.com/kellerza/data-playbook"
    },
    "split_keywords": [
        "data",
        "tables",
        "excel",
        "mongodb",
        "generators"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "42f5ce3401bb207ffca23a271782be10314eddcbb475fecf682aab2acb159473",
                "md5": "550cc0410facf325176e561ac041dd99",
                "sha256": "3baad671a13ae2fa404ebd460487ff493b8015647ce9146a2e8c304ef170906b"
            },
            "downloads": -1,
            "filename": "dataplaybook-1.0.16-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "550cc0410facf325176e561ac041dd99",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 52703,
            "upload_time": "2023-06-01T13:48:46",
            "upload_time_iso_8601": "2023-06-01T13:48:46.116708Z",
            "url": "https://files.pythonhosted.org/packages/42/f5/ce3401bb207ffca23a271782be10314eddcbb475fecf682aab2acb159473/dataplaybook-1.0.16-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0e42ec517861c836708379b94c92e5e3720f7028d749f045372f4923e4db5a0b",
                "md5": "0e891fb26523bd55d15098d5ac6cf10b",
                "sha256": "4d5b56d2b33b029419575951325f24c70a5b35c0854b27a7bedf88a383693f3a"
            },
            "downloads": -1,
            "filename": "dataplaybook-1.0.16.tar.gz",
            "has_sig": false,
            "md5_digest": "0e891fb26523bd55d15098d5ac6cf10b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 43179,
            "upload_time": "2023-06-01T13:48:48",
            "upload_time_iso_8601": "2023-06-01T13:48:48.122567Z",
            "url": "https://files.pythonhosted.org/packages/0e/42/ec517861c836708379b94c92e5e3720f7028d749f045372f4923e4db5a0b/dataplaybook-1.0.16.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-01 13:48:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kellerza",
    "github_project": "data-playbook",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dataplaybook"
}
        
Elapsed time: 0.07516s