iterabledata


Nameiterabledata JSON
Version 1.0.5 PyPI version JSON
download
home_pagehttps://github.com/apicrafter/pyiterable/
SummaryIterable data processing Python library
upload_time2024-06-14 08:38:31
maintainerNone
docs_urlNone
authorIvan Begtin
requires_python>=3.10
licenseMIT
keywords json jsonl csv bson parquet orc xml xls xlsx dataset etl data-pipelines
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Iterable Data

=============



*Work in progress. Documentation in progress*



Iterable data is a Python lib to read data files row by row and write

data files. Iterable classes are similar to files or csv.DictReader or

reading parquet files row by row.



This library was written to simplify data processing and conversion

between formats.



Supported file types: \* BSON \* JSON \* NDJSON (JSON lines) \* XML \*

XLS \* XLSX \* Parquet \* ORC \* Avro \* Pickle



Supported file compression: GZip, BZip2, LZMA (.xz), LZ4, ZIP, Brotli,

ZStandard



Why writing this lib?

---------------------



Python has many high-quality data processing tools and libraries,

especially pandas and other data frames lib. The only issue with most of

them is flat data. Data frames don’t support complex data types, and you

must *flatten* data each time.



pyiterable helps you read any data as a Python dictionary instead of

flattening data. It makes it much easier to work with such data sources

as JSON, NDJSON, or BSON files.



This code is used in several tools written by its author. It’s command

line tool `undatum <https://github.com/datacoon/undatum>`__ and data

processing ETL engine

`datacrafter <https://github.com/apicrafter/datacrafter>`__



Requirements

------------



Python 3.8+



Installation

------------



``pip install iterabledata`` or use this repository



Documentation

-------------



In progress. Please see usage and examples.



Usage and examples

------------------



Read compressed CSV file

~~~~~~~~~~~~~~~~~~~~~~~~



Read compressed csv.xz file



\```{python}



from iterable.helpers.detect import open_iterable



source = open_iterable(‘data.csv.xz’) n = 0 for row in iterable: n += 1

# Add data processing code here if n % 1000 == 0: print(‘Processing %d’

% (n))



::





   ### Detect encoding and file delimiter



   Detects encoding and delimiter of the selected CSV file and use it to open as iterable



   ```{python}



   from iterable.helpers.detect import open_iterable

   from iterable.helpers.utils import detect_encoding, detect_delimiter



   delimiter = detect_delimiter('data.csv')

   encoding = detect_encoding('data.csv')



   source = open_iterable('data.csv', iterableargs={'encoding' : encoding['encoding'], 'delimiter' : delimiter)

   n = 0

   for row in iterable:

       n += 1

       # Add data processing code here

       if n % 1000 == 0: print('Processing %d' % (n))



Convert Parquet file to BSON compressed with LZMA using pipeline

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



Uses pipeline class to iterate through parquet file and convert its

selected fields to JSON lines (NDJSON)



\```{python}



from iterable.helpers.detect import open_iterable from iterable.pipeline

import pipeline



source = open_iterable(‘data/data.parquet’) destination =

open_iterable(‘data/data.jsonl.xz’, mode=‘w’)



def extract_fields(record, state): out = {} record = dict(record)

print(record) for k in [‘name’,]: out[k] = record[k] return out



def print_process(stats, state): print(stats)



pipeline(source, destination=destination, process_func=extract_fields,

trigger_on=2, trigger_func=print_process, final_func=print_process,

start_state={})



::





   ### Convert gzipped JSON lines (NDJSON) file to BSON compressed with LZMA 



   Reads each row from JSON lines file using Gzip codec and writes BSON data using LZMA codec



   ```{python}



   from iterable.datatypes import JSONLinesIterable, BSONIterable

   from iterable.codecs import GZIPCodec, LZMACodec





   codecobj = GZIPCodec('data.jsonl.gz', mode='r', open_it=True)

   iterable = JSONLinesIterable(codec=codecobj)        

   codecobj = LZMACodec('data.bson.xz', mode='wb', open_it=False)

   write_iterable = BSONIterable(codec=codecobj, mode='w')

   n = 0

   for row in iterable:

       n += 1

       if n % 10000 == 0: print('Processing %d' % (n))

       write_iterable.write(row)



More examples and tests

-----------------------



See `tests <tests/>`__ for example usage and tests


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/apicrafter/pyiterable/",
    "name": "iterabledata",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "json jsonl csv bson parquet orc xml xls xlsx dataset etl data-pipelines",
    "author": "Ivan Begtin",
    "author_email": "ivan@begtin.tech",
    "download_url": "https://files.pythonhosted.org/packages/9b/17/9e96c96b4b62e8ebe86674ffc2e538bcaab5c44276a7734c9036cf8b5968/iterabledata-1.0.5.tar.gz",
    "platform": null,
    "description": "Iterable Data\r\r\n=============\r\r\n\r\r\n*Work in progress. Documentation in progress*\r\r\n\r\r\nIterable data is a Python lib to read data files row by row and write\r\r\ndata files. Iterable classes are similar to files or csv.DictReader or\r\r\nreading parquet files row by row.\r\r\n\r\r\nThis library was written to simplify data processing and conversion\r\r\nbetween formats.\r\r\n\r\r\nSupported file types: \\* BSON \\* JSON \\* NDJSON (JSON lines) \\* XML \\*\r\r\nXLS \\* XLSX \\* Parquet \\* ORC \\* Avro \\* Pickle\r\r\n\r\r\nSupported file compression: GZip, BZip2, LZMA (.xz), LZ4, ZIP, Brotli,\r\r\nZStandard\r\r\n\r\r\nWhy writing this lib?\r\r\n---------------------\r\r\n\r\r\nPython has many high-quality data processing tools and libraries,\r\r\nespecially pandas and other data frames lib. The only issue with most of\r\r\nthem is flat data. Data frames don\u2019t support complex data types, and you\r\r\nmust *flatten* data each time.\r\r\n\r\r\npyiterable helps you read any data as a Python dictionary instead of\r\r\nflattening data. It makes it much easier to work with such data sources\r\r\nas JSON, NDJSON, or BSON files.\r\r\n\r\r\nThis code is used in several tools written by its author. It\u2019s command\r\r\nline tool `undatum <https://github.com/datacoon/undatum>`__ and data\r\r\nprocessing ETL engine\r\r\n`datacrafter <https://github.com/apicrafter/datacrafter>`__\r\r\n\r\r\nRequirements\r\r\n------------\r\r\n\r\r\nPython 3.8+\r\r\n\r\r\nInstallation\r\r\n------------\r\r\n\r\r\n``pip install iterabledata`` or use this repository\r\r\n\r\r\nDocumentation\r\r\n-------------\r\r\n\r\r\nIn progress. Please see usage and examples.\r\r\n\r\r\nUsage and examples\r\r\n------------------\r\r\n\r\r\nRead compressed CSV file\r\r\n~~~~~~~~~~~~~~~~~~~~~~~~\r\r\n\r\r\nRead compressed csv.xz file\r\r\n\r\r\n\\```{python}\r\r\n\r\r\nfrom iterable.helpers.detect import open_iterable\r\r\n\r\r\nsource = open_iterable(\u2018data.csv.xz\u2019) n = 0 for row in iterable: n += 1\r\r\n# Add data processing code here if n % 1000 == 0: print(\u2018Processing %d\u2019\r\r\n% (n))\r\r\n\r\r\n::\r\r\n\r\r\n\r\r\n   ### Detect encoding and file delimiter\r\r\n\r\r\n   Detects encoding and delimiter of the selected CSV file and use it to open as iterable\r\r\n\r\r\n   ```{python}\r\r\n\r\r\n   from iterable.helpers.detect import open_iterable\r\r\n   from iterable.helpers.utils import detect_encoding, detect_delimiter\r\r\n\r\r\n   delimiter = detect_delimiter('data.csv')\r\r\n   encoding = detect_encoding('data.csv')\r\r\n\r\r\n   source = open_iterable('data.csv', iterableargs={'encoding' : encoding['encoding'], 'delimiter' : delimiter)\r\r\n   n = 0\r\r\n   for row in iterable:\r\r\n       n += 1\r\r\n       # Add data processing code here\r\r\n       if n % 1000 == 0: print('Processing %d' % (n))\r\r\n\r\r\nConvert Parquet file to BSON compressed with LZMA using pipeline\r\r\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\r\n\r\r\nUses pipeline class to iterate through parquet file and convert its\r\r\nselected fields to JSON lines (NDJSON)\r\r\n\r\r\n\\```{python}\r\r\n\r\r\nfrom iterable.helpers.detect import open_iterable from iterable.pipeline\r\r\nimport pipeline\r\r\n\r\r\nsource = open_iterable(\u2018data/data.parquet\u2019) destination =\r\r\nopen_iterable(\u2018data/data.jsonl.xz\u2019, mode=\u2018w\u2019)\r\r\n\r\r\ndef extract_fields(record, state): out = {} record = dict(record)\r\r\nprint(record) for k in [\u2018name\u2019,]: out[k] = record[k] return out\r\r\n\r\r\ndef print_process(stats, state): print(stats)\r\r\n\r\r\npipeline(source, destination=destination, process_func=extract_fields,\r\r\ntrigger_on=2, trigger_func=print_process, final_func=print_process,\r\r\nstart_state={})\r\r\n\r\r\n::\r\r\n\r\r\n\r\r\n   ### Convert gzipped JSON lines (NDJSON) file to BSON compressed with LZMA \r\r\n\r\r\n   Reads each row from JSON lines file using Gzip codec and writes BSON data using LZMA codec\r\r\n\r\r\n   ```{python}\r\r\n\r\r\n   from iterable.datatypes import JSONLinesIterable, BSONIterable\r\r\n   from iterable.codecs import GZIPCodec, LZMACodec\r\r\n\r\r\n\r\r\n   codecobj = GZIPCodec('data.jsonl.gz', mode='r', open_it=True)\r\r\n   iterable = JSONLinesIterable(codec=codecobj)        \r\r\n   codecobj = LZMACodec('data.bson.xz', mode='wb', open_it=False)\r\r\n   write_iterable = BSONIterable(codec=codecobj, mode='w')\r\r\n   n = 0\r\r\n   for row in iterable:\r\r\n       n += 1\r\r\n       if n % 10000 == 0: print('Processing %d' % (n))\r\r\n       write_iterable.write(row)\r\r\n\r\r\nMore examples and tests\r\r\n-----------------------\r\r\n\r\r\nSee `tests <tests/>`__ for example usage and tests\r\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Iterable data processing Python library",
    "version": "1.0.5",
    "project_urls": {
        "Download": "https://github.com/apicrafter/pyiterable/",
        "Homepage": "https://github.com/apicrafter/pyiterable/"
    },
    "split_keywords": [
        "json",
        "jsonl",
        "csv",
        "bson",
        "parquet",
        "orc",
        "xml",
        "xls",
        "xlsx",
        "dataset",
        "etl",
        "data-pipelines"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9b179e96c96b4b62e8ebe86674ffc2e538bcaab5c44276a7734c9036cf8b5968",
                "md5": "6e5e92876eaac6799f866f89cfccafb5",
                "sha256": "bdc3051f53075558f964da977d3b34616c5558020a0c7cb27a4246d8098d2d13"
            },
            "downloads": -1,
            "filename": "iterabledata-1.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "6e5e92876eaac6799f866f89cfccafb5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 20486,
            "upload_time": "2024-06-14T08:38:31",
            "upload_time_iso_8601": "2024-06-14T08:38:31.821976Z",
            "url": "https://files.pythonhosted.org/packages/9b/17/9e96c96b4b62e8ebe86674ffc2e538bcaab5c44276a7734c9036cf8b5968/iterabledata-1.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-14 08:38:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "apicrafter",
    "github_project": "pyiterable",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "iterabledata"
}
        
Elapsed time: 1.02480s