#  DataFlows
[](https://travis-ci.org/datahq/dataflows)
[](https://coveralls.io/r/datahq/dataflows?branch=master)

[](https://gitter.im/dataflows-chat/Lobby)
DataFlows is a simple and intuitive way of building data processing flows.
- It's built for small-to-medium-data processing - data that fits on your hard drive, but is too big to load in Excel or as-is into Python, and not big enough to require spinning up a Hadoop cluster...
- It's built upon the foundation of the Frictionless Data project - which means that all data produced by these flows is easily reusable by others.
- It's a pattern not a heavy-weight framework: if you already have a bunch of download and extract scripts this will be a natural fit
Read more in the [Features section below](#features).
## QuickStart
Install `dataflows` via `pip install.`
(If you are using minimal UNIX OS, run first `sudo apt install build-essential`)
Then use the command-line interface to bootstrap a basic processing script for any remote data file:
```bash
# Install from PyPi
$ pip install dataflows
# Inspect a remote CSV file
$ dataflows init https://raw.githubusercontent.com/datahq/dataflows/master/data/academy.csv
Writing processing code into academy_csv.py
Running academy_csv.py
academy:
# Year Ceremony Award Winner Name Film
(string) (integer) (string) (string) (string) (string)
---- ---------- ----------- -------------------------------- ---------- ------------------------------ -------------------
1 1927/1928 1 Actor Richard Barthelmess The Noose
2 1927/1928 1 Actor 1 Emil Jannings The Last Command
3 1927/1928 1 Actress Louise Dresser A Ship Comes In
4 1927/1928 1 Actress 1 Janet Gaynor 7th Heaven
5 1927/1928 1 Actress Gloria Swanson Sadie Thompson
6 1927/1928 1 Art Direction Rochus Gliese Sunrise
7 1927/1928 1 Art Direction 1 William Cameron Menzies The Dove; Tempest
...
# dataflows create a local package of the data and a reusable processing script which you can tinker with
$ tree
.
├── academy_csv
│ ├── academy.csv
│ └── datapackage.json
└── academy_csv.py
1 directory, 3 files
# Resulting 'Data Package' is super easy to use in Python
[adam] ~/code/budgetkey-apps/budgetkey-app-main-page/tmp (master=) $ python
Python 3.6.1 (default, Mar 27 2017, 00:25:54)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datapackage import Package
>>> pkg = Package('academy_csv/datapackage.json')
>>> it = pkg.resources[0].iter(keyed=True)
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': None, 'Name': 'Richard Barthelmess', 'Film': 'The Noose'}
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': '1', 'Name': 'Emil Jannings', 'Film': 'The Last Command'}
# You now run `academy_csv.py` to repeat the process
# And obviously modify it to add data modification steps
```
## Features
* Trivial to get started and easy to scale up
* Set up and run from command line in seconds ...
* `dataflows init` => `flow.py`
* `python flow.py`
* Validate input (and esp source) quickly (non-zero length, right structure, etc.)
* Supports caching data from source and even between steps
* so that we can run and test quickly (retrieving is slow)
* Immediate test is run: and look at output ...
* Log, debug, rerun
* Degrades to simple python
* Conventions over configuration
* Log exceptions and / or terminate
* The input to each stage is a Data Package or Data Resource (not a previous task)
* Data package based and compatible
* Processors can be a function (or a class) processing row-by-row, resource-by-resource or a full package
* A pre-existing decent contrib library of Readers (Collectors) and Processors and Writers
## Learn more
Dive into the [Tutorial](TUTORIAL.md) to get a deeper glimpse into everything that `dataflows` can do.
Also review this list of [Built-in Processors](PROCESSORS.md), which also includes an API reference for each one of them.
Raw data
{
"_id": null,
"home_page": "https://github.com/datahq/dataflows",
"name": "dataflows",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "data",
"author": "Adam Kariv",
"author_email": "adam.kariv@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/ab/5b/6487f8c8b5fd2c907d85df19506974e4748aafd1a92a014cdd9e4abb7798/dataflows-0.5.9.tar.gz",
"platform": null,
"description": "#  DataFlows\n\n[](https://travis-ci.org/datahq/dataflows)\n[](https://coveralls.io/r/datahq/dataflows?branch=master)\n\n[](https://gitter.im/dataflows-chat/Lobby)\n\nDataFlows is a simple and intuitive way of building data processing flows.\n\n- It's built for small-to-medium-data processing - data that fits on your hard drive, but is too big to load in Excel or as-is into Python, and not big enough to require spinning up a Hadoop cluster...\n- It's built upon the foundation of the Frictionless Data project - which means that all data produced by these flows is easily reusable by others.\n- It's a pattern not a heavy-weight framework: if you already have a bunch of download and extract scripts this will be a natural fit\n\nRead more in the [Features section below](#features).\n\n## QuickStart \n\nInstall `dataflows` via `pip install.`\n\n(If you are using minimal UNIX OS, run first `sudo apt install build-essential`)\n\nThen use the command-line interface to bootstrap a basic processing script for any remote data file:\n\n```bash\n\n# Install from PyPi\n$ pip install dataflows\n\n# Inspect a remote CSV file\n$ dataflows init https://raw.githubusercontent.com/datahq/dataflows/master/data/academy.csv\nWriting processing code into academy_csv.py\nRunning academy_csv.py\nacademy:\n# Year Ceremony Award Winner Name Film\n (string) (integer) (string) (string) (string) (string)\n---- ---------- ----------- -------------------------------- ---------- ------------------------------ -------------------\n1 1927/1928 1 Actor Richard Barthelmess The Noose\n2 1927/1928 1 Actor 1 Emil Jannings The Last Command\n3 1927/1928 1 Actress Louise Dresser A Ship Comes In\n4 1927/1928 1 Actress 1 Janet Gaynor 7th Heaven\n5 1927/1928 1 Actress Gloria Swanson Sadie Thompson\n6 1927/1928 1 Art Direction Rochus Gliese Sunrise\n7 1927/1928 1 Art Direction 1 William Cameron Menzies The Dove; Tempest\n...\n\n# dataflows create a local package of the data and a reusable processing script which you can tinker with\n$ tree\n.\n\u251c\u2500\u2500 academy_csv\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 academy.csv\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 datapackage.json\n\u2514\u2500\u2500 academy_csv.py\n\n1 directory, 3 files\n\n# Resulting 'Data Package' is super easy to use in Python\n[adam] ~/code/budgetkey-apps/budgetkey-app-main-page/tmp (master=) $ python\nPython 3.6.1 (default, Mar 27 2017, 00:25:54)\n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin\nType \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n>>> from datapackage import Package\n>>> pkg = Package('academy_csv/datapackage.json')\n>>> it = pkg.resources[0].iter(keyed=True)\n>>> next(it)\n{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': None, 'Name': 'Richard Barthelmess', 'Film': 'The Noose'}\n>>> next(it)\n{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': '1', 'Name': 'Emil Jannings', 'Film': 'The Last Command'}\n\n# You now run `academy_csv.py` to repeat the process\n# And obviously modify it to add data modification steps\n```\n\n## Features\n\n* Trivial to get started and easy to scale up\n* Set up and run from command line in seconds ...\n * `dataflows init` => `flow.py`\n * `python flow.py`\n* Validate input (and esp source) quickly (non-zero length, right structure, etc.)\n* Supports caching data from source and even between steps\n * so that we can run and test quickly (retrieving is slow)\n* Immediate test is run: and look at output ...\n * Log, debug, rerun\n* Degrades to simple python\n* Conventions over configuration\n* Log exceptions and / or terminate\n* The input to each stage is a Data Package or Data Resource (not a previous task)\n\t* Data package based and compatible\n* Processors can be a function (or a class) processing row-by-row, resource-by-resource or a full package\n* A pre-existing decent contrib library of Readers (Collectors) and Processors and Writers\n\n## Learn more\n\nDive into the [Tutorial](TUTORIAL.md) to get a deeper glimpse into everything that `dataflows` can do.\nAlso review this list of [Built-in Processors](PROCESSORS.md), which also includes an API reference for each one of them.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A nifty data processing framework, based on data packages",
"version": "0.5.9",
"project_urls": {
"Homepage": "https://github.com/datahq/dataflows"
},
"split_keywords": [
"data"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "855bac7664c9f6649d39237c1bc39b663800042922fd6861454acd5157ace04b",
"md5": "c3507c7547c69d4a2ff7bd0767f11e9c",
"sha256": "a1cfb68d0c6b683770cc2dad3298c2ed4283b7d4e3741c0c39facde4ba46d49c"
},
"downloads": -1,
"filename": "dataflows-0.5.9-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "c3507c7547c69d4a2ff7bd0767f11e9c",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 60648,
"upload_time": "2025-03-18T10:39:27",
"upload_time_iso_8601": "2025-03-18T10:39:27.646242Z",
"url": "https://files.pythonhosted.org/packages/85/5b/ac7664c9f6649d39237c1bc39b663800042922fd6861454acd5157ace04b/dataflows-0.5.9-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ab5b6487f8c8b5fd2c907d85df19506974e4748aafd1a92a014cdd9e4abb7798",
"md5": "3025e38f330c68d2c5430718ab61437c",
"sha256": "98239902b783c6192f2b5534fd65e8c5bafd6aef433831c6234a6e675c819563"
},
"downloads": -1,
"filename": "dataflows-0.5.9.tar.gz",
"has_sig": false,
"md5_digest": "3025e38f330c68d2c5430718ab61437c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 42922,
"upload_time": "2025-03-18T10:39:28",
"upload_time_iso_8601": "2025-03-18T10:39:28.806007Z",
"url": "https://files.pythonhosted.org/packages/ab/5b/6487f8c8b5fd2c907d85df19506974e4748aafd1a92a014cdd9e4abb7798/dataflows-0.5.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-03-18 10:39:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "datahq",
"github_project": "dataflows",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "dataflows"
}