iterabledata

Name	iterabledata JSON
Version	1.0.2 JSON
	download
home_page	https://github.com/apicrafter/pyiterable/
Summary	Iterable data processing Python library
upload_time	2022-12-24 07:24:52
maintainer
docs_url	None
author	Ivan Begtin
requires_python	>=3.8
license	MIT
keywords	json jsonl csv bson parquet orc xml xls xlsx dataset etl data-pipelines
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Iterable Data

*Work in progress. Documentation in progress*

Iterable data is a Python lib to read data files row by row and write data files.
Iterable classes are similar to files or csv.DictReader or reading parquet files row by row. 

This library was written to simplify data processing and conversion between formats.
 
Supported file types:
* BSON
* JSON
* NDJSON (JSON lines)
* XML
* XLS
* XLSX
* Parquet
* ORC
* Avro
* Pickle

Supported file compression: GZip, BZip2, LZMA (.xz), LZ4, ZIP

## Why writing this lib? 

Python has many high-quality data processing tools and libraries, especially pandas and other data frames lib. The only issue with most of them is flat data. Data frames don't support complex data types, and you must *flatten* data each time. 

pyiterable helps you read any data as a Python dictionary instead of flattening data.
It makes it much easier to work with such data sources as JSON, NDJSON, or BSON files.

This code is used in several tools written by its author. It's command line tool [undatum](https://github.com/datacoon/undatum) and data processing ETL engine [datacrafter](https://github.com/apicrafter/datacrafter)


## Requirements

Python 3.8+

## Documentation

In progress

## Usage and examples


### Read compressed CSV file 

Read compressed csv.xz file

```{python}

from iterable.helpers.detect import open_iterable

source = open_iterable('data.csv.xz')
n = 0
for row in iterable:
    n += 1
    # Add data processing code here
    if n % 1000 == 0: print('Processing %d' % (n))
```

### Detect encoding and file delimiter

Detects encoding and delimiter of the selected CSV file and use it to open as iterable

```{python}

from iterable.helpers.detect import open_iterable
from iterable.helpers.utils import detect_encoding, detect_delimiter

delimiter = detect_delimiter('data.csv')
encoding = detect_encoding('data.csv')

source = open_iterable('data.csv', iterableargs={'encoding' : encoding['encoding'], 'delimiter' : delimiter)
n = 0
for row in iterable:
    n += 1
    # Add data processing code here
    if n % 1000 == 0: print('Processing %d' % (n))
```


### Convert Parquet file to BSON compressed with LZMA using pipeline

Uses pipeline class to iterate through parquet file and convert its selected fields to JSON lines (NDJSON)

```{python}

from iterable.helpers.detect import open_iterable
from iterable.pipeline import pipeline

source = open_iterable('data/data.parquet')
destination = open_iterable('data/data.jsonl.xz', mode='w')

def extract_fields(record, state):
    out = {}
    record = dict(record)
    print(record)
    for k in ['name',]:
        out[k] = record[k]
    return out

def print_process(stats, state):
    print(stats)

pipeline(source, destination=destination, process_func=extract_fields, trigger_on=2, trigger_func=print_process, final_func=print_process, start_state={})

```

### Convert gzipped JSON lines (NDJSON) file to BSON compressed with LZMA 

Reads each row from JSON lines file using Gzip codec and writes BSON data using LZMA codec

```{python}

from iterable.datatypes import JSONLinesIterable, BSONIterable
from iterable.codecs import GZIPCodec, LZMACodec


codecobj = GZIPCodec('data.jsonl.gz', mode='r', open_it=True)
iterable = JSONLinesIterable(codec=codecobj)        
codecobj = LZMACodec('data.bson.xz', mode='wb', open_it=False)
write_iterable = BSONIterable(codec=codecobj, mode='w')
n = 0
for row in iterable:
    n += 1
    if n % 10000 == 0: print('Processing %d' % (n))
    write_iterable.write(row)
```



## Examples and tests

See [tests](tests/) for example usage

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/apicrafter/pyiterable/",
    "name": "iterabledata",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "json jsonl csv bson parquet orc xml xls xlsx dataset etl data-pipelines",
    "author": "Ivan Begtin",
    "author_email": "ivan@begtin.tech",
    "download_url": "https://files.pythonhosted.org/packages/f6/ca/9cecfff8829dfbb15d1f54b974e15c6dff76eb4c51e3c10ffdd1aff48380/iterabledata-1.0.2.tar.gz",
    "platform": null,
    "description": "# Iterable Data\r\n\r\n*Work in progress. Documentation in progress*\r\n\r\nIterable data is a Python lib to read data files row by row and write data files.\r\nIterable classes are similar to files or csv.DictReader or reading parquet files row by row. \r\n\r\nThis library was written to simplify data processing and conversion between formats.\r\n \r\nSupported file types:\r\n* BSON\r\n* JSON\r\n* NDJSON (JSON lines)\r\n* XML\r\n* XLS\r\n* XLSX\r\n* Parquet\r\n* ORC\r\n* Avro\r\n* Pickle\r\n\r\nSupported file compression: GZip, BZip2, LZMA (.xz), LZ4, ZIP\r\n\r\n## Why writing this lib? \r\n\r\nPython has many high-quality data processing tools and libraries, especially pandas and other data frames lib. The only issue with most of them is flat data. Data frames don't support complex data types, and you must *flatten* data each time. \r\n\r\npyiterable helps you read any data as a Python dictionary instead of flattening data.\r\nIt makes it much easier to work with such data sources as JSON, NDJSON, or BSON files.\r\n\r\nThis code is used in several tools written by its author. It's command line tool [undatum](https://github.com/datacoon/undatum) and data processing ETL engine [datacrafter](https://github.com/apicrafter/datacrafter)\r\n\r\n\r\n## Requirements\r\n\r\nPython 3.8+\r\n\r\n## Documentation\r\n\r\nIn progress\r\n\r\n## Usage and examples\r\n\r\n\r\n### Read compressed CSV file \r\n\r\nRead compressed csv.xz file\r\n\r\n```{python}\r\n\r\nfrom iterable.helpers.detect import open_iterable\r\n\r\nsource = open_iterable('data.csv.xz')\r\nn = 0\r\nfor row in iterable:\r\n    n += 1\r\n    # Add data processing code here\r\n    if n % 1000 == 0: print('Processing %d' % (n))\r\n```\r\n\r\n### Detect encoding and file delimiter\r\n\r\nDetects encoding and delimiter of the selected CSV file and use it to open as iterable\r\n\r\n```{python}\r\n\r\nfrom iterable.helpers.detect import open_iterable\r\nfrom iterable.helpers.utils import detect_encoding, detect_delimiter\r\n\r\ndelimiter = detect_delimiter('data.csv')\r\nencoding = detect_encoding('data.csv')\r\n\r\nsource = open_iterable('data.csv', iterableargs={'encoding' : encoding['encoding'], 'delimiter' : delimiter)\r\nn = 0\r\nfor row in iterable:\r\n    n += 1\r\n    # Add data processing code here\r\n    if n % 1000 == 0: print('Processing %d' % (n))\r\n```\r\n\r\n\r\n### Convert Parquet file to BSON compressed with LZMA using pipeline\r\n\r\nUses pipeline class to iterate through parquet file and convert its selected fields to JSON lines (NDJSON)\r\n\r\n```{python}\r\n\r\nfrom iterable.helpers.detect import open_iterable\r\nfrom iterable.pipeline import pipeline\r\n\r\nsource = open_iterable('data/data.parquet')\r\ndestination = open_iterable('data/data.jsonl.xz', mode='w')\r\n\r\ndef extract_fields(record, state):\r\n    out = {}\r\n    record = dict(record)\r\n    print(record)\r\n    for k in ['name',]:\r\n        out[k] = record[k]\r\n    return out\r\n\r\ndef print_process(stats, state):\r\n    print(stats)\r\n\r\npipeline(source, destination=destination, process_func=extract_fields, trigger_on=2, trigger_func=print_process, final_func=print_process, start_state={})\r\n\r\n```\r\n\r\n### Convert gzipped JSON lines (NDJSON) file to BSON compressed with LZMA \r\n\r\nReads each row from JSON lines file using Gzip codec and writes BSON data using LZMA codec\r\n\r\n```{python}\r\n\r\nfrom iterable.datatypes import JSONLinesIterable, BSONIterable\r\nfrom iterable.codecs import GZIPCodec, LZMACodec\r\n\r\n\r\ncodecobj = GZIPCodec('data.jsonl.gz', mode='r', open_it=True)\r\niterable = JSONLinesIterable(codec=codecobj)        \r\ncodecobj = LZMACodec('data.bson.xz', mode='wb', open_it=False)\r\nwrite_iterable = BSONIterable(codec=codecobj, mode='w')\r\nn = 0\r\nfor row in iterable:\r\n    n += 1\r\n    if n % 10000 == 0: print('Processing %d' % (n))\r\n    write_iterable.write(row)\r\n```\r\n\r\n\r\n\r\n## Examples and tests\r\n\r\nSee [tests](tests/) for example usage\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Iterable data processing Python library",
    "version": "1.0.2",
    "split_keywords": [
        "json",
        "jsonl",
        "csv",
        "bson",
        "parquet",
        "orc",
        "xml",
        "xls",
        "xlsx",
        "dataset",
        "etl",
        "data-pipelines"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "c8c5e8941933a4918658e3aeb6e002b0",
                "sha256": "c0a05329329145ba9e8b17ed736387d5dcc152dd33f9c90c119782018c442fee"
            },
            "downloads": -1,
            "filename": "iterabledata-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "c8c5e8941933a4918658e3aeb6e002b0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 18459,
            "upload_time": "2022-12-24T07:24:52",
            "upload_time_iso_8601": "2022-12-24T07:24:52.513885Z",
            "url": "https://files.pythonhosted.org/packages/f6/ca/9cecfff8829dfbb15d1f54b974e15c6dff76eb4c51e3c10ffdd1aff48380/iterabledata-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-24 07:24:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "apicrafter",
    "github_project": "pyiterable",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "iterabledata"
}

Ivan Begtin