dataflows


Namedataflows JSON
Version 0.5.5 PyPI version JSON
download
home_pagehttps://github.com/datahq/dataflows
SummaryA nifty data processing framework, based on data packages
upload_time2024-04-01 19:52:01
maintainerNone
docs_urlNone
authorAdam Kariv
requires_pythonNone
licenseMIT
keywords data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ![logo](logo-s.png) DataFlows

[![Travis](https://img.shields.io/travis/datahq/dataflows/master.svg)](https://travis-ci.org/datahq/dataflows)
[![Coveralls](http://img.shields.io/coveralls/datahq/dataflows.svg?branch=master)](https://coveralls.io/r/datahq/dataflows?branch=master)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dataflows.svg)
[![Gitter chat](https://badges.gitter.im/dataflows-chat/Lobby.png)](https://gitter.im/dataflows-chat/Lobby)

DataFlows is a simple and intuitive way of building data processing flows.

- It's built for small-to-medium-data processing - data that fits on your hard drive, but is too big to load in Excel or as-is into Python, and not big enough to require spinning up a Hadoop cluster...
- It's built upon the foundation of the Frictionless Data project - which means that all data produced by these flows is easily reusable by others.
- It's a pattern not a heavy-weight framework: if you already have a bunch of download and extract scripts this will be a natural fit

Read more in the [Features section below](#features).

## QuickStart 

Install `dataflows` via `pip install.`

(If you are using minimal UNIX OS, run first `sudo apt install build-essential`)

Then use the command-line interface to bootstrap a basic processing script for any remote data file:

```bash

# Install from PyPi
$ pip install dataflows

# Inspect a remote CSV file
$ dataflows init https://raw.githubusercontent.com/datahq/dataflows/master/data/academy.csv
Writing processing code into academy_csv.py
Running academy_csv.py
academy:
#     Year           Ceremony  Award                                 Winner  Name                            Film
      (string)      (integer)  (string)                            (string)  (string)                        (string)
----  ----------  -----------  --------------------------------  ----------  ------------------------------  -------------------
1     1927/1928             1  Actor                                         Richard Barthelmess             The Noose
2     1927/1928             1  Actor                                      1  Emil Jannings                   The Last Command
3     1927/1928             1  Actress                                       Louise Dresser                  A Ship Comes In
4     1927/1928             1  Actress                                    1  Janet Gaynor                    7th Heaven
5     1927/1928             1  Actress                                       Gloria Swanson                  Sadie Thompson
6     1927/1928             1  Art Direction                                 Rochus Gliese                   Sunrise
7     1927/1928             1  Art Direction                              1  William Cameron Menzies         The Dove; Tempest
...

# dataflows create a local package of the data and a reusable processing script which you can tinker with
$ tree
.
├── academy_csv
│   ├── academy.csv
│   └── datapackage.json
└── academy_csv.py

1 directory, 3 files

# Resulting 'Data Package' is super easy to use in Python
[adam] ~/code/budgetkey-apps/budgetkey-app-main-page/tmp (master=) $ python
Python 3.6.1 (default, Mar 27 2017, 00:25:54)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datapackage import Package
>>> pkg = Package('academy_csv/datapackage.json')
>>> it = pkg.resources[0].iter(keyed=True)
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': None, 'Name': 'Richard Barthelmess', 'Film': 'The Noose'}
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': '1', 'Name': 'Emil Jannings', 'Film': 'The Last Command'}

# You now run `academy_csv.py` to repeat the process
# And obviously modify it to add data modification steps
```

## Features

* Trivial to get started and easy to scale up
* Set up and run from command line in seconds ...
    * `dataflows init` => `flow.py`
    * `python flow.py`
* Validate input (and esp source) quickly (non-zero length, right structure, etc.)
* Supports caching data from source and even between steps
    * so that we can run and test quickly (retrieving is slow)
* Immediate test is run: and look at output ...
    * Log, debug, rerun
* Degrades to simple python
* Conventions over configuration
* Log exceptions and / or terminate
* The input to each stage is a Data Package or Data Resource (not a previous task)
	* Data package based and compatible
* Processors can be a function (or a class) processing row-by-row, resource-by-resource or a full package
* A pre-existing decent contrib library of Readers (Collectors) and Processors and Writers

## Learn more

Dive into the [Tutorial](TUTORIAL.md) to get a deeper glimpse into everything that `dataflows` can do.
Also review this list of [Built-in Processors](PROCESSORS.md), which also includes an API reference for each one of them.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/datahq/dataflows",
    "name": "dataflows",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "data",
    "author": "Adam Kariv",
    "author_email": "adam.kariv@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/66/b9/77473354f017186817c9be060a1617685f33f0f7cf4235a51c89be658e92/dataflows-0.5.5.tar.gz",
    "platform": null,
    "description": "# ![logo](logo-s.png) DataFlows\n\n[![Travis](https://img.shields.io/travis/datahq/dataflows/master.svg)](https://travis-ci.org/datahq/dataflows)\n[![Coveralls](http://img.shields.io/coveralls/datahq/dataflows.svg?branch=master)](https://coveralls.io/r/datahq/dataflows?branch=master)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dataflows.svg)\n[![Gitter chat](https://badges.gitter.im/dataflows-chat/Lobby.png)](https://gitter.im/dataflows-chat/Lobby)\n\nDataFlows is a simple and intuitive way of building data processing flows.\n\n- It's built for small-to-medium-data processing - data that fits on your hard drive, but is too big to load in Excel or as-is into Python, and not big enough to require spinning up a Hadoop cluster...\n- It's built upon the foundation of the Frictionless Data project - which means that all data produced by these flows is easily reusable by others.\n- It's a pattern not a heavy-weight framework: if you already have a bunch of download and extract scripts this will be a natural fit\n\nRead more in the [Features section below](#features).\n\n## QuickStart \n\nInstall `dataflows` via `pip install.`\n\n(If you are using minimal UNIX OS, run first `sudo apt install build-essential`)\n\nThen use the command-line interface to bootstrap a basic processing script for any remote data file:\n\n```bash\n\n# Install from PyPi\n$ pip install dataflows\n\n# Inspect a remote CSV file\n$ dataflows init https://raw.githubusercontent.com/datahq/dataflows/master/data/academy.csv\nWriting processing code into academy_csv.py\nRunning academy_csv.py\nacademy:\n#     Year           Ceremony  Award                                 Winner  Name                            Film\n      (string)      (integer)  (string)                            (string)  (string)                        (string)\n----  ----------  -----------  --------------------------------  ----------  ------------------------------  -------------------\n1     1927/1928             1  Actor                                         Richard Barthelmess             The Noose\n2     1927/1928             1  Actor                                      1  Emil Jannings                   The Last Command\n3     1927/1928             1  Actress                                       Louise Dresser                  A Ship Comes In\n4     1927/1928             1  Actress                                    1  Janet Gaynor                    7th Heaven\n5     1927/1928             1  Actress                                       Gloria Swanson                  Sadie Thompson\n6     1927/1928             1  Art Direction                                 Rochus Gliese                   Sunrise\n7     1927/1928             1  Art Direction                              1  William Cameron Menzies         The Dove; Tempest\n...\n\n# dataflows create a local package of the data and a reusable processing script which you can tinker with\n$ tree\n.\n\u251c\u2500\u2500 academy_csv\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 academy.csv\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 datapackage.json\n\u2514\u2500\u2500 academy_csv.py\n\n1 directory, 3 files\n\n# Resulting 'Data Package' is super easy to use in Python\n[adam] ~/code/budgetkey-apps/budgetkey-app-main-page/tmp (master=) $ python\nPython 3.6.1 (default, Mar 27 2017, 00:25:54)\n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin\nType \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n>>> from datapackage import Package\n>>> pkg = Package('academy_csv/datapackage.json')\n>>> it = pkg.resources[0].iter(keyed=True)\n>>> next(it)\n{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': None, 'Name': 'Richard Barthelmess', 'Film': 'The Noose'}\n>>> next(it)\n{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': '1', 'Name': 'Emil Jannings', 'Film': 'The Last Command'}\n\n# You now run `academy_csv.py` to repeat the process\n# And obviously modify it to add data modification steps\n```\n\n## Features\n\n* Trivial to get started and easy to scale up\n* Set up and run from command line in seconds ...\n    * `dataflows init` => `flow.py`\n    * `python flow.py`\n* Validate input (and esp source) quickly (non-zero length, right structure, etc.)\n* Supports caching data from source and even between steps\n    * so that we can run and test quickly (retrieving is slow)\n* Immediate test is run: and look at output ...\n    * Log, debug, rerun\n* Degrades to simple python\n* Conventions over configuration\n* Log exceptions and / or terminate\n* The input to each stage is a Data Package or Data Resource (not a previous task)\n\t* Data package based and compatible\n* Processors can be a function (or a class) processing row-by-row, resource-by-resource or a full package\n* A pre-existing decent contrib library of Readers (Collectors) and Processors and Writers\n\n## Learn more\n\nDive into the [Tutorial](TUTORIAL.md) to get a deeper glimpse into everything that `dataflows` can do.\nAlso review this list of [Built-in Processors](PROCESSORS.md), which also includes an API reference for each one of them.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A nifty data processing framework, based on data packages",
    "version": "0.5.5",
    "project_urls": {
        "Homepage": "https://github.com/datahq/dataflows"
    },
    "split_keywords": [
        "data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7ba610ce6f728a4c38b5008582814e39c00f113bb2b1ff554674c546b06bc828",
                "md5": "323332f72c9a72b31990da382b799a3d",
                "sha256": "421ea43b3496e0c62ecaa28d273709e1e54c8ec597ce33b7abbc9c7607e8c60c"
            },
            "downloads": -1,
            "filename": "dataflows-0.5.5-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "323332f72c9a72b31990da382b799a3d",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 60469,
            "upload_time": "2024-04-01T19:51:54",
            "upload_time_iso_8601": "2024-04-01T19:51:54.701593Z",
            "url": "https://files.pythonhosted.org/packages/7b/a6/10ce6f728a4c38b5008582814e39c00f113bb2b1ff554674c546b06bc828/dataflows-0.5.5-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "66b977473354f017186817c9be060a1617685f33f0f7cf4235a51c89be658e92",
                "md5": "59fb9c44506f86015b969c0e1ce244ca",
                "sha256": "401ed924ce56875a434f85ab1956ea8ff22f935c02877759a41f4bec2d42f682"
            },
            "downloads": -1,
            "filename": "dataflows-0.5.5.tar.gz",
            "has_sig": false,
            "md5_digest": "59fb9c44506f86015b969c0e1ce244ca",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 42702,
            "upload_time": "2024-04-01T19:52:01",
            "upload_time_iso_8601": "2024-04-01T19:52:01.562291Z",
            "url": "https://files.pythonhosted.org/packages/66/b9/77473354f017186817c9be060a1617685f33f0f7cf4235a51c89be658e92/dataflows-0.5.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-01 19:52:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "datahq",
    "github_project": "dataflows",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dataflows"
}
        
Elapsed time: 0.23482s