simple-xlsx-writer

Name	simple-xlsx-writer JSON
Version	0.3.1 JSON
	download
home_page	None
Summary	Save large amounts of data to small xlsx files
upload_time	2025-01-24 10:16:33
maintainer	None
docs_url	None
author	None
requires_python	>=3.11
license	MIT License Copyright (c) 2025 Krzysztof Ruta Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	excel database export python xlsx
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Yet another Python XLSX writer

... yes, this is reinventing the wheel again and again, but ...

So I decided to write my own *xlsx* export for two reasons:

First and foremost, the two existing engines I use (*openpyxl*, *xlsxwriter*) available in *pandas* do not store large files efficiently.
The problem is when I must load large number of records, sometimes beyond Excel limit (2^20 = 1048576) and then send it over email 
(this is quite often the easies way to share data...). The files get way too big.

Secondly, I just want to understand *xlsx* internals and use the simples possible code to handle files. 
As a side effect, it is simpler and faster than using some other libraries.

As a simple benchmark consider a sample file of 700+k records and 18 columns. 
Standard *pandas* creates files of about **40**MB. The simple_xls_writer's file is as small as **8**MB which makes it more *email friendly*. 


(Of course when saving modified file it gets much bigger but that's not the point). 

## Usage

The project consists of submodules: 

- *writer*
- *postgresql_handler*
- *oracle_handler*

### writer

This generic module exposes function(s) to write raw data 
(array of arrays) into Excel file. 

This should be clear when reading the helper function:

    def write_dummy(base_path: str, target_name: str) -> None:
        data = [["A", "B", "C"], ["TEST", 1.23, "2024-10-01 12:34:56"], ["TEST", 200, "2024-10-01 12:34:56"]]
        write_raw_data(base_path, target_name, data)

Note that the only supported data types are: *str*, *int* and *float*, which relates to the way data is saved in xlsx file.

So you may have to prepare the input array yourself or use other submodules (see below).

There's a helper function *write_dummy* that saves predefined tiny file under given name.

### postgresql_handler 

If you use PostgreSQL database you can use helper method that reads query result into required structure.

First of all you may wish to verify connection. I prefer to do it this way:

    print("db time: "+postgresql_handler.get_sysdate(username,password,host,port,dbname).strftime("%Y-%m-%d %H:%M:%S"))

To save query results run:

    postgresql_handler.write_query(query, base_path, "all_tables_pg", username, password, host, port, dbname)

### oracle_handler 

If you use Oracle database the usage is very similar to PostgreSQL, except for naming and parameter list.

You may verify connection using :

    print("db time: "+oracle_handler.get_sysdate(username,password,dsn).strftime("%Y-%m-%d %H:%M:%S"))

To save query results run:

    oracle_handler.write_query(query,base_path, "all_tables",username,password,dh_url)


#### Example

See: *main_pg.py*

    ...

    username = input("username: ")
    password = getpass.getpass()
    host = input("host: ")
    port = int(input("port: "))
    dbname = input("database name: ")

    # verify connection
    print("db time: "+postgresql_handler.get_sysdate(username,password,host,port,dbname).strftime("%Y-%m-%d %H:%M:%S"))

    # fetch all tables' metadata
    query = "select * from information_schema.tables"
    base_path = os.path.dirname(__file__)
    postgresql_handler.write_query(query, base_path, "all_tables_pg", username, password, host, port, dbname)

    ...

See: *main_ora.py*

    ...    

    username = input("username: ")
    password = getpass.getpass()
    dsn = input("DSN: ")
    
    # verify connection
    print("db time: "+oracle_handler.get_sysdate(username,password,dsn).strftime("%Y-%m-%d %H:%M:%S"))

    # fetch all tables' metadata
    query = "select * from all_tables"
    base_path = os.path.dirname(__file__)
    oracle_handler.write_query(query,base_path, "all_tables",username,password,dsn)

    ...

### CSV conversion

From time to time I come across large CSV files. When you want to load it to Excel using Power Query it generally works fine,
but what if you have to load it over the slow network. The solution is to load all files to relatively small *xlsx* file.

#### Example

    ...

    custom_params = {
        "debug_info_every_rows": 100000,
        "row_limit": 1000000,
        "row_limit_exceed_strategy": "sheets" # truncate / files / sheets
    }
    base_path = os.path.dirname(__file__)
    writer.convert_csv("c:\\some_path\\A_VERY_LARGE_CSV.csv", base_path,
                       "converted_from_csv.xlsx", debug=True, custom_params=custom_params)
    
    ...

The *debug* option is particularly useful when you want to ensure that data load progresses.


### Exceeding Excel row limit

If your input data exceeds Excel row limit, or you just want to divide it into smaller chunks, 
you may use two strategies (see: *configuration*):
    
- use files - save each chunk into a separate *xlsx* file
- use sheets - save each chunk into a separate worksheet within a single *xlsx* file

# Configuration

You may customize operations using *custom_params* parameter. The available options are:

| **Parameter**                        | **Default value**  | **Description**                                                       |
|--------------------------------------|--------------------|-----------------------------------------------------------------------|
| sheet_name                           | data               | sheet name or sheet name prefix for multiple sheets (numbered from 1) |
| python_date_format                   | %Y-%m-%d           | date format when converting data loaded from database                 | 
| python_datetime_format               | %Y-%m-%d %H:%M:%S  | as above but for datatime format                                      |
| python_datetime_remove_zeros         | True               | should empty time be removed (to save space and improve readability)  |
| python_datetime_remove_zeros_pattern | " 00:00:00"        | the pattern of empty time to be removed                               |
| headers                              | True               | does input data contain header row                                    |
| row_limit                            | 1048576-1 (2^20-1) | row limit (unchecked!)                                                |
| row_limit_exceed_strategy            | truncate           | what to do when dealing with data exceeding row limit                 |
|                                      |                    | - truncate - truncate data beyond row limit                           |
|                                      |                    | - files - generate multiple files (numbered from 1)                   |
|                                      |                    | - sheets - generate multiple sheets (numbered from 1)                 |
| debug_info_every_rows                | 10000              | print debug info every X rows, applies only when debug=True           |
| csv_delimiter                        | , (comma)          | CSV file delimiter                                                    |
| csv_quote                            | \" (double quote)  | CSV quote character                                                   |
| csv_encoding                         | utf-8              | CSV file encoding                                                     |

You provide custom options as Python *dict*:

    custom_params = {
        "sheet_name": "events_",
        "debug_info_every_rows": 100000,
        "row_limit": 1000000,
        "row_limit_exceed_strategy": "sheets"
    }


## Installation

Install package using pip:

    pip install simple-xlsx-writer

If you wish to use *Oracle* and/or *PostgrweSQL* connectivity, add option(s):

    pip install simple-xlsx-writer[postgresql]
    pip install simple-xlsx-writer[oracle]

These add required dependencies: *psycopg* and *oracledb* respectively. You can always install them yourself. 

To verify installation run:

    import os
    from simple_xlsx_writer import writer

    base_path = os.path.dirname(__file__) # or provide explicit path in interactive mode
    writer.write_dummy(base_path, "dummy01")

You should find *dummy01.xlsx* file in a given containing:

| A    | B    | C                   |
|------|------|---------------------|
| TEST | 1,23 | 2024-10-01 12:34:56 |
| TEST | 200  | 2024-10-01 12:34:56 |

## Memory considerations

Keep in mind that the module is performance and size optimized, NOT memory optimized.
The complete dataset (either from database of CSV file) is loaded to memory in order to calculate shared strings. 

For extremely large files (as for Excel terms, not extremely in general) the memory requirements are substantial.

One real-life example is conversion of 1GB CSV file (2.5M rows and 60 columns with lots of text repetitions, Windows 11). 

Loading the file to memory takes **~6GB**, slicing it to separate sheets bumps memory for a brief moment up to even **10GB**.
The resulting single xlsx file is about **60MB** (6% of original). The process takes about **8 minutes** (i5 13gen).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "simple-xlsx-writer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": "Krzysztof Ruta <krzys9876@gmail.com>",
    "keywords": "Excel, database, export, python, xlsx",
    "author": null,
    "author_email": "Krzysztof Ruta <krzys9876@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/52/29/928951b27335c77d32ccd8afd2c62db9a48f3e78a050442929283a27be22/simple_xlsx_writer-0.3.1.tar.gz",
    "platform": null,
    "description": "# Yet another Python XLSX writer\n\n... yes, this is reinventing the wheel again and again, but ...\n\nSo I decided to write my own *xlsx* export for two reasons:\n\nFirst and foremost, the two existing engines I use (*openpyxl*, *xlsxwriter*) available in *pandas* do not store large files efficiently.\nThe problem is when I must load large number of records, sometimes beyond Excel limit (2^20 = 1048576) and then send it over email \n(this is quite often the easies way to share data...). The files get way too big.\n\nSecondly, I just want to understand *xlsx* internals and use the simples possible code to handle files. \nAs a side effect, it is simpler and faster than using some other libraries.\n\nAs a simple benchmark consider a sample file of 700+k records and 18 columns. \nStandard *pandas* creates files of about **40**MB. The simple_xls_writer's file is as small as **8**MB which makes it more *email friendly*. \n\n\n(Of course when saving modified file it gets much bigger but that's not the point). \n\n## Usage\n\nThe project consists of submodules: \n\n- *writer*\n- *postgresql_handler*\n- *oracle_handler*\n\n### writer\n\nThis generic module exposes function(s) to write raw data \n(array of arrays) into Excel file. \n\nThis should be clear when reading the helper function:\n\n    def write_dummy(base_path: str, target_name: str) -> None:\n        data = [[\"A\", \"B\", \"C\"], [\"TEST\", 1.23, \"2024-10-01 12:34:56\"], [\"TEST\", 200, \"2024-10-01 12:34:56\"]]\n        write_raw_data(base_path, target_name, data)\n\nNote that the only supported data types are: *str*, *int* and *float*, which relates to the way data is saved in xlsx file.\n\nSo you may have to prepare the input array yourself or use other submodules (see below).\n\nThere's a helper function *write_dummy* that saves predefined tiny file under given name.\n\n### postgresql_handler \n\nIf you use PostgreSQL database you can use helper method that reads query result into required structure.\n\nFirst of all you may wish to verify connection. I prefer to do it this way:\n\n    print(\"db time: \"+postgresql_handler.get_sysdate(username,password,host,port,dbname).strftime(\"%Y-%m-%d %H:%M:%S\"))\n\nTo save query results run:\n\n    postgresql_handler.write_query(query, base_path, \"all_tables_pg\", username, password, host, port, dbname)\n\n### oracle_handler \n\nIf you use Oracle database the usage is very similar to PostgreSQL, except for naming and parameter list.\n\nYou may verify connection using :\n\n    print(\"db time: \"+oracle_handler.get_sysdate(username,password,dsn).strftime(\"%Y-%m-%d %H:%M:%S\"))\n\nTo save query results run:\n\n    oracle_handler.write_query(query,base_path, \"all_tables\",username,password,dh_url)\n\n\n#### Example\n\nSee: *main_pg.py*\n\n    ...\n\n    username = input(\"username: \")\n    password = getpass.getpass()\n    host = input(\"host: \")\n    port = int(input(\"port: \"))\n    dbname = input(\"database name: \")\n\n    # verify connection\n    print(\"db time: \"+postgresql_handler.get_sysdate(username,password,host,port,dbname).strftime(\"%Y-%m-%d %H:%M:%S\"))\n\n    # fetch all tables' metadata\n    query = \"select * from information_schema.tables\"\n    base_path = os.path.dirname(__file__)\n    postgresql_handler.write_query(query, base_path, \"all_tables_pg\", username, password, host, port, dbname)\n\n    ...\n\nSee: *main_ora.py*\n\n    ...    \n\n    username = input(\"username: \")\n    password = getpass.getpass()\n    dsn = input(\"DSN: \")\n    \n    # verify connection\n    print(\"db time: \"+oracle_handler.get_sysdate(username,password,dsn).strftime(\"%Y-%m-%d %H:%M:%S\"))\n\n    # fetch all tables' metadata\n    query = \"select * from all_tables\"\n    base_path = os.path.dirname(__file__)\n    oracle_handler.write_query(query,base_path, \"all_tables\",username,password,dsn)\n\n    ...\n\n### CSV conversion\n\nFrom time to time I come across large CSV files. When you want to load it to Excel using Power Query it generally works fine,\nbut what if you have to load it over the slow network. The solution is to load all files to relatively small *xlsx* file.\n\n#### Example\n\n    ...\n\n    custom_params = {\n        \"debug_info_every_rows\": 100000,\n        \"row_limit\": 1000000,\n        \"row_limit_exceed_strategy\": \"sheets\" # truncate / files / sheets\n    }\n    base_path = os.path.dirname(__file__)\n    writer.convert_csv(\"c:\\\\some_path\\\\A_VERY_LARGE_CSV.csv\", base_path,\n                       \"converted_from_csv.xlsx\", debug=True, custom_params=custom_params)\n    \n    ...\n\nThe *debug* option is particularly useful when you want to ensure that data load progresses.\n\n\n### Exceeding Excel row limit\n\nIf your input data exceeds Excel row limit, or you just want to divide it into smaller chunks, \nyou may use two strategies (see: *configuration*):\n    \n- use files - save each chunk into a separate *xlsx* file\n- use sheets - save each chunk into a separate worksheet within a single *xlsx* file\n\n# Configuration\n\nYou may customize operations using *custom_params* parameter. The available options are:\n\n| **Parameter**                        | **Default value**  | **Description**                                                       |\n|--------------------------------------|--------------------|-----------------------------------------------------------------------|\n| sheet_name                           | data               | sheet name or sheet name prefix for multiple sheets (numbered from 1) |\n| python_date_format                   | %Y-%m-%d           | date format when converting data loaded from database                 | \n| python_datetime_format               | %Y-%m-%d %H:%M:%S  | as above but for datatime format                                      |\n| python_datetime_remove_zeros         | True               | should empty time be removed (to save space and improve readability)  |\n| python_datetime_remove_zeros_pattern | \" 00:00:00\"        | the pattern of empty time to be removed                               |\n| headers                              | True               | does input data contain header row                                    |\n| row_limit                            | 1048576-1 (2^20-1) | row limit (unchecked!)                                                |\n| row_limit_exceed_strategy            | truncate           | what to do when dealing with data exceeding row limit                 |\n|                                      |                    | - truncate - truncate data beyond row limit                           |\n|                                      |                    | - files - generate multiple files (numbered from 1)                   |\n|                                      |                    | - sheets - generate multiple sheets (numbered from 1)                 |\n| debug_info_every_rows                | 10000              | print debug info every X rows, applies only when debug=True           |\n| csv_delimiter                        | , (comma)          | CSV file delimiter                                                    |\n| csv_quote                            | \\\" (double quote)  | CSV quote character                                                   |\n| csv_encoding                         | utf-8              | CSV file encoding                                                     |\n\nYou provide custom options as Python *dict*:\n\n    custom_params = {\n        \"sheet_name\": \"events_\",\n        \"debug_info_every_rows\": 100000,\n        \"row_limit\": 1000000,\n        \"row_limit_exceed_strategy\": \"sheets\"\n    }\n\n\n## Installation\n\nInstall package using pip:\n\n    pip install simple-xlsx-writer\n\nIf you wish to use *Oracle* and/or *PostgrweSQL* connectivity, add option(s):\n\n    pip install simple-xlsx-writer[postgresql]\n    pip install simple-xlsx-writer[oracle]\n\nThese add required dependencies: *psycopg* and *oracledb* respectively. You can always install them yourself. \n\nTo verify installation run:\n\n    import os\n    from simple_xlsx_writer import writer\n\n    base_path = os.path.dirname(__file__) # or provide explicit path in interactive mode\n    writer.write_dummy(base_path, \"dummy01\")\n\nYou should find *dummy01.xlsx* file in a given containing:\n\n| A    | B    | C                   |\n|------|------|---------------------|\n| TEST | 1,23 | 2024-10-01 12:34:56 |\n| TEST | 200  | 2024-10-01 12:34:56 |\n\n## Memory considerations\n\nKeep in mind that the module is performance and size optimized, NOT memory optimized.\nThe complete dataset (either from database of CSV file) is loaded to memory in order to calculate shared strings. \n\nFor extremely large files (as for Excel terms, not extremely in general) the memory requirements are substantial.\n\nOne real-life example is conversion of 1GB CSV file (2.5M rows and 60 columns with lots of text repetitions, Windows 11). \n\nLoading the file to memory takes **~6GB**, slicing it to separate sheets bumps memory for a brief moment up to even **10GB**.\nThe resulting single xlsx file is about **60MB** (6% of original). The process takes about **8 minutes** (i5 13gen).\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2025 Krzysztof Ruta  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "Save large amounts of data to small xlsx files",
    "version": "0.3.1",
    "project_urls": {
        "Homepage": "https://github.com/krzys9876/simple_xlsx_writer",
        "Repository": "https://github.com/krzys9876/simple_xlsx_writer"
    },
    "split_keywords": [
        "excel",
        " database",
        " export",
        " python",
        " xlsx"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "194cdb7461324b94fb2b00e2313b2f204b5073ed7ddc57567cc8eb61e56fa280",
                "md5": "340ab6566b26828ab34acc5232d4656c",
                "sha256": "503c6e1e88e887106eeec37f3e66b992faac3be64295bf44596c60d72b92110a"
            },
            "downloads": -1,
            "filename": "simple_xlsx_writer-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "340ab6566b26828ab34acc5232d4656c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 12901,
            "upload_time": "2025-01-24T10:16:31",
            "upload_time_iso_8601": "2025-01-24T10:16:31.280299Z",
            "url": "https://files.pythonhosted.org/packages/19/4c/db7461324b94fb2b00e2313b2f204b5073ed7ddc57567cc8eb61e56fa280/simple_xlsx_writer-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5229928951b27335c77d32ccd8afd2c62db9a48f3e78a050442929283a27be22",
                "md5": "9d143f206c11c3ad2b775bbb4a99970a",
                "sha256": "e82c35d561214b1b0a345729afd204588b2bc56b42df2c2b2f2567442ffe80ec"
            },
            "downloads": -1,
            "filename": "simple_xlsx_writer-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "9d143f206c11c3ad2b775bbb4a99970a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 12226,
            "upload_time": "2025-01-24T10:16:33",
            "upload_time_iso_8601": "2025-01-24T10:16:33.974356Z",
            "url": "https://files.pythonhosted.org/packages/52/29/928951b27335c77d32ccd8afd2c62db9a48f3e78a050442929283a27be22/simple_xlsx_writer-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-24 10:16:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "krzys9876",
    "github_project": "simple_xlsx_writer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "simple-xlsx-writer"
}

None