dfcsv2parquet

Name	dfcsv2parquet JSON
Version	0.10 JSON
	download
home_page	https://github.com/hansalemaos/dfcsv2parquet
Summary	converts large CSV files into smaller, Pandas-compatible Parquet files
upload_time	2023-06-25 19:21:19
maintainer
docs_url	None
author	Johannes Fischer
requires_python
license	MIT
keywords	csv parquet pandas dataframe
VCS
bugtrack_url
requirements	a_pandas_ex_less_memory_more_speed hackyargparser pandas pyarrow
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# converts large CSV files into smaller, Pandas-compatible Parquet files 


## pip install dfcsv2parquet

### Tested against Windows 10 / Python 3.10 / Anaconda 


The convert2parquet function is used to convert large CSV files into smaller Parquet files.
It offers several advantages such as reducing memory usage,
improving processing speed, and optimizing data types for more efficient storage.

This function might be interesting for individuals or organizations working with large datasets in CSV format
and looking for ways to optimize their data storage and processing.
By converting CSV files to Parquet format, which is a columnar storage format, several benefits can be achieved:


### Reduced Memory Usage:

Parquet files are highly compressed and store data in a columnar format,
allowing for efficient memory utilization.
This can significantly reduce the memory footprint compared to traditional row-based CSV files.


### Improved Processing Speed:

Parquet files are designed for parallel processing and can be read in a highly efficient manner.
By converting CSV files to Parquet, you can potentially achieve faster data ingestion and query performance.


### Optimized Data Types:

The function includes a data type optimization step (optimize_dtypes) that aims to minimize the memory usage
of the resulting Parquet files. It intelligently selects appropriate data types based on the actual
data values, which can further enhance storage efficiency.

### Categorical Data Optimization:

The function handles categorical columns efficiently by limiting the number of categories (categorylimit).
It uses the union_categoricals function to merge categorical data from different chunks,
reducing duplication and optimizing memory usage.

```python


        Args:
            csv_file (str | None): Path to the input CSV file. Default is None.
            parquet_file (str | None): Path to the output Parquet file. Default is None.
            chunksize (int): Number of rows to read from the CSV file per chunk. Default is 1000000.
            categorylimit (int): The minimum number of categories in categorical columns. Default is 4.
            verbose (int | bool): Verbosity level. Set to 1 or True for verbose output, 0 or False for no output. Default is 1.
            zerolen_is_na (int | bool): Whether to treat zero-length strings as NaN values. Set to 1 or True to enable, 0 or False to disable. Default is 0.
            args: passed to pd.read_csv (doesn't work with the cli)
            kwargs: passed to pd.read_csv (doesn't work with the cli)

        Returns:
            None

        Examples:
            # Download the csv:
            https://github.com/hansalemaos/csv2parquet/raw/main/bigcsv.part03.rar
            https://github.com/hansalemaos/csv2parquet/raw/main/bigcsv.part02.rar
            https://github.com/hansalemaos/csv2parquet/raw/main/bigcsv.part01.rar
            # in Python


            from dfcsv2parquet import convert2parquet
            convert2parquet(csv_file=r"C:\bigcsv.csv",
                                            parquet_file=r'c:\parquettest4.pqt',

                                            chunksize=1000000,
                                            categorylimit=4,
                                            verbose=True,
                                            zerolen_is_na=False, )

            # CLI
            python.exe "...\__init__.py" --csv_file "C:\bigcsv.csv" --parquet_file  "c:\parquettest4.pqt" --chunksize 100000 --categorylimit 4 --verbose 1 --zerolen_is_na 1         
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hansalemaos/dfcsv2parquet",
    "name": "dfcsv2parquet",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "csv,parquet,pandas,DataFrame",
    "author": "Johannes Fischer",
    "author_email": "aulasparticularesdealemaosp@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/5f/40/2a4e450f489cec72efa09f2ea2bfe87423354960407644663f4cc9e487f6/dfcsv2parquet-0.10.tar.gz",
    "platform": null,
    "description": "\r\n# converts large CSV files into smaller, Pandas-compatible Parquet files \r\n\r\n\r\n## pip install dfcsv2parquet\r\n\r\n### Tested against Windows 10 / Python 3.10 / Anaconda \r\n\r\n\r\nThe convert2parquet function is used to convert large CSV files into smaller Parquet files.\r\nIt offers several advantages such as reducing memory usage,\r\nimproving processing speed, and optimizing data types for more efficient storage.\r\n\r\nThis function might be interesting for individuals or organizations working with large datasets in CSV format\r\nand looking for ways to optimize their data storage and processing.\r\nBy converting CSV files to Parquet format, which is a columnar storage format, several benefits can be achieved:\r\n\r\n\r\n### Reduced Memory Usage:\r\n\r\nParquet files are highly compressed and store data in a columnar format,\r\nallowing for efficient memory utilization.\r\nThis can significantly reduce the memory footprint compared to traditional row-based CSV files.\r\n\r\n\r\n### Improved Processing Speed:\r\n\r\nParquet files are designed for parallel processing and can be read in a highly efficient manner.\r\nBy converting CSV files to Parquet, you can potentially achieve faster data ingestion and query performance.\r\n\r\n\r\n### Optimized Data Types:\r\n\r\nThe function includes a data type optimization step (optimize_dtypes) that aims to minimize the memory usage\r\nof the resulting Parquet files. It intelligently selects appropriate data types based on the actual\r\ndata values, which can further enhance storage efficiency.\r\n\r\n### Categorical Data Optimization:\r\n\r\nThe function handles categorical columns efficiently by limiting the number of categories (categorylimit).\r\nIt uses the union_categoricals function to merge categorical data from different chunks,\r\nreducing duplication and optimizing memory usage.\r\n\r\n```python\r\n\r\n\r\n        Args:\r\n            csv_file (str | None): Path to the input CSV file. Default is None.\r\n            parquet_file (str | None): Path to the output Parquet file. Default is None.\r\n            chunksize (int): Number of rows to read from the CSV file per chunk. Default is 1000000.\r\n            categorylimit (int): The minimum number of categories in categorical columns. Default is 4.\r\n            verbose (int | bool): Verbosity level. Set to 1 or True for verbose output, 0 or False for no output. Default is 1.\r\n            zerolen_is_na (int | bool): Whether to treat zero-length strings as NaN values. Set to 1 or True to enable, 0 or False to disable. Default is 0.\r\n            args: passed to pd.read_csv (doesn't work with the cli)\r\n            kwargs: passed to pd.read_csv (doesn't work with the cli)\r\n\r\n        Returns:\r\n            None\r\n\r\n        Examples:\r\n            # Download the csv:\r\n            https://github.com/hansalemaos/csv2parquet/raw/main/bigcsv.part03.rar\r\n            https://github.com/hansalemaos/csv2parquet/raw/main/bigcsv.part02.rar\r\n            https://github.com/hansalemaos/csv2parquet/raw/main/bigcsv.part01.rar\r\n            # in Python\r\n\r\n\r\n            from dfcsv2parquet import convert2parquet\r\n            convert2parquet(csv_file=r\"C:\\bigcsv.csv\",\r\n                                            parquet_file=r'c:\\parquettest4.pqt',\r\n\r\n                                            chunksize=1000000,\r\n                                            categorylimit=4,\r\n                                            verbose=True,\r\n                                            zerolen_is_na=False, )\r\n\r\n            # CLI\r\n            python.exe \"...\\__init__.py\" --csv_file \"C:\\bigcsv.csv\" --parquet_file  \"c:\\parquettest4.pqt\" --chunksize 100000 --categorylimit 4 --verbose 1 --zerolen_is_na 1         \r\n```\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "converts large CSV files into smaller, Pandas-compatible Parquet files",
    "version": "0.10",
    "project_urls": {
        "Homepage": "https://github.com/hansalemaos/dfcsv2parquet"
    },
    "split_keywords": [
        "csv",
        "parquet",
        "pandas",
        "dataframe"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3b4cc9249a328fe67b6625db05fb757c70eca39a32eabf7adfdf27a5b2c3a0fd",
                "md5": "c64f7b3c89eabcc663d304167e4dbd35",
                "sha256": "1879355a5d016d4eedb906f24cdf826aa1313de0fb892bc130c3b8faa2978894"
            },
            "downloads": -1,
            "filename": "dfcsv2parquet-0.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c64f7b3c89eabcc663d304167e4dbd35",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 28299,
            "upload_time": "2023-06-25T19:21:18",
            "upload_time_iso_8601": "2023-06-25T19:21:18.219934Z",
            "url": "https://files.pythonhosted.org/packages/3b/4c/c9249a328fe67b6625db05fb757c70eca39a32eabf7adfdf27a5b2c3a0fd/dfcsv2parquet-0.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5f402a4e450f489cec72efa09f2ea2bfe87423354960407644663f4cc9e487f6",
                "md5": "e0b7da76596cdf4b1b2c63656bec0b27",
                "sha256": "022f318325d5af7fd264e0342bf3347fd217052f7aa3bf669df98e3e441ac0e4"
            },
            "downloads": -1,
            "filename": "dfcsv2parquet-0.10.tar.gz",
            "has_sig": false,
            "md5_digest": "e0b7da76596cdf4b1b2c63656bec0b27",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 27204,
            "upload_time": "2023-06-25T19:21:19",
            "upload_time_iso_8601": "2023-06-25T19:21:19.964481Z",
            "url": "https://files.pythonhosted.org/packages/5f/40/2a4e450f489cec72efa09f2ea2bfe87423354960407644663f4cc9e487f6/dfcsv2parquet-0.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-25 19:21:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hansalemaos",
    "github_project": "dfcsv2parquet",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "a_pandas_ex_less_memory_more_speed",
            "specs": []
        },
        {
            "name": "hackyargparser",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "pyarrow",
            "specs": []
        }
    ],
    "lcname": "dfcsv2parquet"
}

Johannes Fischer