pycobol2parquet


Namepycobol2parquet JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/jasonli-lijie/pycobol2parquet
SummaryA Python library to convert COBOL ebcdic file to parquet format based on copybook
upload_time2024-03-16 07:26:15
maintainer
docs_urlNone
authorJason Li
requires_python
licenseMIT
keywords cobol parquet
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pycobol2parquet

pycobol2parquet is a Python library to convert COBOL ebcdic file to parquet format. 

I released [pycobol2csv](https://pypi.org/project/pycobol2csv/) back in 2021 and it has been deployed to multiple production systems. One feedback I received is about the possibility of converting from Cobol to Parquet directly for analytical workload. 

It is straightforward to reuse the same underline knowledge and code to generate Parquet file.


Install the python module:

`pip install pycobol2parquet`

To use the module:

```
from pycobol2parquet import convert_cobol_file, decode_copybook_file

row_length, cobol_struc = decode_copybook_file(copybook_file)

convert_cobol_file(copybook_file, data_file, output_file, codepage, debug=False)

```

- copybook_file: copybook filename
- data_file: data filename 
- output_file: output parquet filename
- codepage: codepage for edibic, refer to https://docs.python.org/3.7/library/codecs.html#standard-encodings for details
- debug: enable for more debug information, default is OFF

Please refer to _convert_cobol_test_main.py_ for details. 

## test 

2 sets of test data have been created from scratch. Each set includes a copybook and an EBCDIC data file.

To test:

```
python convert_cobol_test_main.py --copybook testdata\test2\DWSTUB.txt --data testdata\test2\DWSTUB_DATA.DAT --output DWSTUB_DATA_output.parquet
```

## known issues and limitations

- Be aware of the resources available in your runtime environment and make sure the Cobol file size is not beyond the limit or cause any performance issue.

To handle large Cobol files, you can split the files into smaller chunks and then process the chunks in parallel. Please refer to the [medium post](https://medium.com/@jasonli.lijie/process-large-cobol-files-efficiently-with-pycobol2csv-pycobol2parquet-f023533607e4) for details.

- When creating Parquet files the library detects data type automatically. This is to simplify the parameters passed to the conversion function.

<!-- Repo: https://github.com/jasonli-lijie/pycobol2parquet -->

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jasonli-lijie/pycobol2parquet",
    "name": "pycobol2parquet",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "COBOL,parquet",
    "author": "Jason Li",
    "author_email": "niomobileapp@gmail.com",
    "download_url": "https://github.com/user/reponame/archive/v_01.tar.gz",
    "platform": null,
    "description": "# pycobol2parquet\r\n\r\npycobol2parquet is a Python library to convert COBOL ebcdic file to parquet format. \r\n\r\nI released [pycobol2csv](https://pypi.org/project/pycobol2csv/) back in 2021 and it has been deployed to multiple production systems. One feedback I received is about the possibility of converting from Cobol to Parquet directly for analytical workload. \r\n\r\nIt is straightforward to reuse the same underline knowledge and code to generate Parquet file.\r\n\r\n\r\nInstall the python module:\r\n\r\n`pip install pycobol2parquet`\r\n\r\nTo use the module:\r\n\r\n```\r\nfrom pycobol2parquet import convert_cobol_file, decode_copybook_file\r\n\r\nrow_length, cobol_struc = decode_copybook_file(copybook_file)\r\n\r\nconvert_cobol_file(copybook_file, data_file, output_file, codepage, debug=False)\r\n\r\n```\r\n\r\n- copybook_file: copybook filename\r\n- data_file: data filename \r\n- output_file: output parquet filename\r\n- codepage: codepage for edibic, refer to https://docs.python.org/3.7/library/codecs.html#standard-encodings for details\r\n- debug: enable for more debug information, default is OFF\r\n\r\nPlease refer to _convert_cobol_test_main.py_ for details. \r\n\r\n## test \r\n\r\n2 sets of test data have been created from scratch. Each set includes a copybook and an EBCDIC data file.\r\n\r\nTo test:\r\n\r\n```\r\npython convert_cobol_test_main.py --copybook testdata\\test2\\DWSTUB.txt --data testdata\\test2\\DWSTUB_DATA.DAT --output DWSTUB_DATA_output.parquet\r\n```\r\n\r\n## known issues and limitations\r\n\r\n- Be aware of the resources available in your runtime environment and make sure the Cobol file size is not beyond the limit or cause any performance issue.\r\n\r\nTo handle large Cobol files, you can split the files into smaller chunks and then process the chunks in parallel. Please refer to the [medium post](https://medium.com/@jasonli.lijie/process-large-cobol-files-efficiently-with-pycobol2csv-pycobol2parquet-f023533607e4) for details.\r\n\r\n- When creating Parquet files the library detects data type automatically. This is to simplify the parameters passed to the conversion function.\r\n\r\n<!-- Repo: https://github.com/jasonli-lijie/pycobol2parquet -->\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python library to convert COBOL ebcdic file to parquet format based on copybook",
    "version": "0.0.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/jasonli-lijie/pycobol2parquet/issues",
        "Download": "https://github.com/user/reponame/archive/v_01.tar.gz",
        "Homepage": "https://github.com/jasonli-lijie/pycobol2parquet"
    },
    "split_keywords": [
        "cobol",
        "parquet"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "55aed75cb1cf180cb1d127c01c52faca752538fa695addb9633d0ef009568201",
                "md5": "45e39a335c00d3f7a21f6fde28f1b484",
                "sha256": "e0129a9fe9b2ade8f8cfc622357291938a0ec0c90f909bf51eaf7dd0e894f8f9"
            },
            "downloads": -1,
            "filename": "pycobol2parquet-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "45e39a335c00d3f7a21f6fde28f1b484",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 9092,
            "upload_time": "2024-03-16T07:26:15",
            "upload_time_iso_8601": "2024-03-16T07:26:15.797400Z",
            "url": "https://files.pythonhosted.org/packages/55/ae/d75cb1cf180cb1d127c01c52faca752538fa695addb9633d0ef009568201/pycobol2parquet-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-16 07:26:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jasonli-lijie",
    "github_project": "pycobol2parquet",
    "github_not_found": true,
    "lcname": "pycobol2parquet"
}
        
Elapsed time: 2.77656s