arrow-bcp


Namearrow-bcp JSON
Version 0.9.0 PyPI version JSON
download
home_pageNone
SummaryTransfer data between pyarrow and SQL Server using Microsoft's Bulk Copy Program
upload_time2024-09-05 22:01:56
maintainerNone
docs_urlNone
authorFelix Graßl
requires_python>=3.10
licenseCopyright (c) 2024 Felix Graßl. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords bcp arrow mssql sql server
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # arrow-bcp

When loading data into SQL Server from Python, the common methods are based on the ODBC interface and driver. While this is a tried and tested approach, the performance is not optimal. Reasons for this include:

* *minimal logging* cannot be used by regular insert statements unless they use table-valued parameters, which are cumbersome to use.
* General overhead caused by the ODBC specification. Most databases other than SQL Server provide their own connector in addition to an ODBC driver for this reason

Microsoft's Bulk Copy Program (bcp) command-line utility seems to use additional techniques beyond the ODBC specification and achieves better performance.

The purpose of this library is to (de-)serialize `pyarrow` dataframes into SQL Server's native format so that it can be understood and ingested by bcp, or equivalently, SQL Server's `BULK INSERT` statement.

Some wrappers around the bcp CLI are provided as well to simplify interaction with it.

For example usage, see the notebooks in the examples directory.

# Prerequisites

* An ODBC Driver Manager. On Linux/Mac, you may need to install unixODBC. On Windows, a driver manager is preinstalled.
* The ODBC Driver for SQL Server, which also bundles the bcp executable ([Link](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server)).
* If the ODBC Driver for SQL Server was not installed using Microsoft's install script (e.g., by copy-pasting the binaries onto a system), you may need to configure `odbc.ini` (located in /etc/ on Linux) such that "ODBC Driver 17 for SQL Server," or whichever version you installed, points to the ODBC driver executable. If bcp is not on the PATH, you will also need to specify its location using `arrow_bcp.set_bcp_executable(path)`.

# Type mapping

## Arrow to SQL Server

| Arrow                    | SQL Server               |
| ------------------------ | ------------------------ |
| utf8                     | varchar(max) (utf8)      |
| binary                   | varbinary(max)           |
| fixedbinary              | varbinary(max)           |
| decimal128(p, s)         | decimal(p, s)            |
| boolean                  | bit                      |
| int8                     | smallint                 |
| uint8                    | tinyint                  |
| int16                    | smallint                 |
| uint16                   | int                      |
| int32                    | int                      |
| uint32                   | bigint                   |
| int64                    | bigint                   |
| uint64                   | decimal(20, 0)           |
| float16                  | real                     |
| float32                  | real                     |
| float64                  | float                    |
| date32                   | date                     |
| date64                   | datetime2                |
| timestamp(s/ms/us/ns)    | datetime2                |
| time32(s/ms)             | time                     |
| time64(us/ns)            | time                     |
| timestamp(s/ms/us/ns+tz) | datetimeoffset           |
| null                     | varbinary(max)           |

## SQL Server to Arrow

| SQL Server               | Arrow                    |
| ------------------------ | ------------------------ |
| (n)(var)char             | utf8                     |
| (n)text                  | utf8                     |
| variant                  | utf8                     |
| (var)binary              | binary                   |
| user defined type        | binary                   |
| uniqueidentifier         | fixedbinary(16)          |
| decimal(p, s)            | decimal128(p, s) or null |
| numeric(p, s)            | decimal128(p, s) or null |
| (small)money             | decimal128               |
| bit                      | boolean                  |
| tinyint                  | uint8                    |
| smallint                 | int16                    |
| int                      | int32                    |
| bigint                   | int64                    |
| real                     | float32                  |
| float                    | float64                  |
| date                     | date32                   |
| (small)datetime          | timestamp(us)            |
| datetime2                | timestamp(us)            |
| time                     | time64(ns)               |
| datetimeoffset           | timestamp(us+tz) or null |

# Limitations
## Reading data from SQL Server
### Decimals / timestamps with timezones

In both Arrow and SQL Server's format, decimal size/precision and timezone offset are part of the column's datatype. However, bcp does not provide this information as metadata and instead specifies it separately for each cell (even though all cells must match). As a consequence, the datatype of an Arrow decimal/timezone column is only known once the first non-`null` cell is read. If all cells are `null`, we don't get this information, so the entire datatype is set to `null` as a workaround. This may lead to issues when another datatype is expected.

### Datetime2 / datetimeoffset

Both of these datatypes store time with an accuracy of 100ns. However Arrow only offers datatype with accuracy 1ns or 1000ns (1us) and both have tradeoffs. The 1ns datatype cannot represent larger values like `'9999-12-31'` and the 1000ns datatype truncates the last digit, i.e. `'9999-12-31 11:11:11.1234567'` turns to `'9999-12-31 11:11:11.1234560'`. This library chooses to truncate.

# Acknowledgements

Thanks to Adam Serafini for figuring out the build process for Zig extensions and providing an [example module](https://github.com/adamserafini/zaml). He explains that it was quite a journey in his [talk](https://www.youtube.com/watch?v=O0MmmZxdct4), which I appreciate, as even just figuring out that targeting `x86_64-windows` instead of `x86_64-windows-msvc` works better for Windows compilation took me several hours.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "arrow-bcp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "bcp, arrow, mssql, sql server",
    "author": "Felix Gra\u00dfl",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/fb/b4/fa1620187d7b2a69bbba298d490d54c02b163c8e6c6c00a4e3ff4916c548/arrow_bcp-0.9.0.tar.gz",
    "platform": null,
    "description": "# arrow-bcp\n\nWhen loading data into SQL Server from Python, the common methods are based on the ODBC interface and driver. While this is a tried and tested approach, the performance is not optimal. Reasons for this include:\n\n* *minimal logging* cannot be used by regular insert statements unless they use table-valued parameters, which are cumbersome to use.\n* General overhead caused by the ODBC specification. Most databases other than SQL Server provide their own connector in addition to an ODBC driver for this reason\n\nMicrosoft's Bulk Copy Program (bcp) command-line utility seems to use additional techniques beyond the ODBC specification and achieves better performance.\n\nThe purpose of this library is to (de-)serialize `pyarrow` dataframes into SQL Server's native format so that it can be understood and ingested by bcp, or equivalently, SQL Server's `BULK INSERT` statement.\n\nSome wrappers around the bcp CLI are provided as well to simplify interaction with it.\n\nFor example usage, see the notebooks in the examples directory.\n\n# Prerequisites\n\n* An ODBC Driver Manager. On Linux/Mac, you may need to install unixODBC. On Windows, a driver manager is preinstalled.\n* The ODBC Driver for SQL Server, which also bundles the bcp executable ([Link](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server)).\n* If the ODBC Driver for SQL Server was not installed using Microsoft's install script (e.g., by copy-pasting the binaries onto a system), you may need to configure `odbc.ini` (located in /etc/ on Linux) such that \"ODBC Driver 17 for SQL Server,\" or whichever version you installed, points to the ODBC driver executable. If bcp is not on the PATH, you will also need to specify its location using `arrow_bcp.set_bcp_executable(path)`.\n\n# Type mapping\n\n## Arrow to SQL Server\n\n| Arrow                    | SQL Server               |\n| ------------------------ | ------------------------ |\n| utf8                     | varchar(max) (utf8)      |\n| binary                   | varbinary(max)           |\n| fixedbinary              | varbinary(max)           |\n| decimal128(p, s)         | decimal(p, s)            |\n| boolean                  | bit                      |\n| int8                     | smallint                 |\n| uint8                    | tinyint                  |\n| int16                    | smallint                 |\n| uint16                   | int                      |\n| int32                    | int                      |\n| uint32                   | bigint                   |\n| int64                    | bigint                   |\n| uint64                   | decimal(20, 0)           |\n| float16                  | real                     |\n| float32                  | real                     |\n| float64                  | float                    |\n| date32                   | date                     |\n| date64                   | datetime2                |\n| timestamp(s/ms/us/ns)    | datetime2                |\n| time32(s/ms)             | time                     |\n| time64(us/ns)            | time                     |\n| timestamp(s/ms/us/ns+tz) | datetimeoffset           |\n| null                     | varbinary(max)           |\n\n## SQL Server to Arrow\n\n| SQL Server               | Arrow                    |\n| ------------------------ | ------------------------ |\n| (n)(var)char             | utf8                     |\n| (n)text                  | utf8                     |\n| variant                  | utf8                     |\n| (var)binary              | binary                   |\n| user defined type        | binary                   |\n| uniqueidentifier         | fixedbinary(16)          |\n| decimal(p, s)            | decimal128(p, s) or null |\n| numeric(p, s)            | decimal128(p, s) or null |\n| (small)money             | decimal128               |\n| bit                      | boolean                  |\n| tinyint                  | uint8                    |\n| smallint                 | int16                    |\n| int                      | int32                    |\n| bigint                   | int64                    |\n| real                     | float32                  |\n| float                    | float64                  |\n| date                     | date32                   |\n| (small)datetime          | timestamp(us)            |\n| datetime2                | timestamp(us)            |\n| time                     | time64(ns)               |\n| datetimeoffset           | timestamp(us+tz) or null |\n\n# Limitations\n## Reading data from SQL Server\n### Decimals / timestamps with timezones\n\nIn both Arrow and SQL Server's format, decimal size/precision and timezone offset are part of the column's datatype. However, bcp does not provide this information as metadata and instead specifies it separately for each cell (even though all cells must match). As a consequence, the datatype of an Arrow decimal/timezone column is only known once the first non-`null` cell is read. If all cells are `null`, we don't get this information, so the entire datatype is set to `null` as a workaround. This may lead to issues when another datatype is expected.\n\n### Datetime2 / datetimeoffset\n\nBoth of these datatypes store time with an accuracy of 100ns. However Arrow only offers datatype with accuracy 1ns or 1000ns (1us) and both have tradeoffs. The 1ns datatype cannot represent larger values like `'9999-12-31'` and the 1000ns datatype truncates the last digit, i.e. `'9999-12-31 11:11:11.1234567'` turns to `'9999-12-31 11:11:11.1234560'`. This library chooses to truncate.\n\n# Acknowledgements\n\nThanks to Adam Serafini for figuring out the build process for Zig extensions and providing an [example module](https://github.com/adamserafini/zaml). He explains that it was quite a journey in his [talk](https://www.youtube.com/watch?v=O0MmmZxdct4), which I appreciate, as even just figuring out that targeting `x86_64-windows` instead of `x86_64-windows-msvc` works better for Windows compilation took me several hours.\n",
    "bugtrack_url": null,
    "license": "Copyright (c) 2024 Felix Gra\u00dfl.  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Transfer data between pyarrow and SQL Server using Microsoft's Bulk Copy Program",
    "version": "0.9.0",
    "project_urls": null,
    "split_keywords": [
        "bcp",
        " arrow",
        " mssql",
        " sql server"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fbb4fa1620187d7b2a69bbba298d490d54c02b163c8e6c6c00a4e3ff4916c548",
                "md5": "dab41450ac8eae054f8028c8162bce81",
                "sha256": "bb65025daa178845d2b23489d7b7f98018e6f31ac713f4ad3b73142f0e86cf31"
            },
            "downloads": -1,
            "filename": "arrow_bcp-0.9.0.tar.gz",
            "has_sig": false,
            "md5_digest": "dab41450ac8eae054f8028c8162bce81",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 24302,
            "upload_time": "2024-09-05T22:01:56",
            "upload_time_iso_8601": "2024-09-05T22:01:56.390439Z",
            "url": "https://files.pythonhosted.org/packages/fb/b4/fa1620187d7b2a69bbba298d490d54c02b163c8e6c6c00a4e3ff4916c548/arrow_bcp-0.9.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-05 22:01:56",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "arrow-bcp"
}
        
Elapsed time: 4.35944s