pandas-pyarrow


Namepandas-pyarrow JSON
Version 0.1.6 PyPI version JSON
download
home_pagehttps://github.com/DanielAvdar/pandas-pyarrow
SummaryA library for switching pandas backend to pyarrow
upload_time2024-04-11 19:41:33
maintainerNone
docs_urlNone
authorDanielAvdar
requires_python<3.13,>=3.9
licenseMIT
keywords python pandas pyarrow arrow dataframe bigquery pandas-pyarrow pandas-arrow
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pandas-pyarrow

[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pandas-pyarrow)](https://pypi.org/project/pandas-pyarrow/)
[![version](https://img.shields.io/pypi/v/pandas-pyarrow)](https://img.shields.io/pypi/v/pandas-pyarrow)
[![License](https://img.shields.io/:license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
![OS](https://img.shields.io/badge/ubuntu-blue?logo=ubuntu)
![OS](https://img.shields.io/badge/win-blue?logo=windows)
![OS](https://img.shields.io/badge/mac-blue?logo=apple)
[![Tests](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/ci.yml/badge.svg)](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/ci.yml)
[![Code Checks](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/code-checks.yml/badge.svg)](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/code-checks.yml)
[![codecov](https://codecov.io/gh/DanielAvdar/pandas-pyarrow/graph/badge.svg?token=N0V9KANTG2)](https://codecov.io/gh/DanielAvdar/pandas-pyarrow)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

`pandas-pyarrow` simplifies the conversion of pandas backend to pyarrow, allowing seamlessly switch to pyarrow pandas
backend.

## Get started:

### Installation

To install the package use pip:

```bash
pip install pandas-pyarrow
```

### Usage

```python
import pandas as pd

from pandas_pyarrow import convert_to_pyarrow

# Create a pandas DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c'],
    'C': [1.1, 2.2, 3.3],
    'D': [True, False, True]
})

# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = convert_to_pyarrow(df)

print(adf.dtypes)
```

outputs:

```
A     int64[pyarrow]
B    string[pyarrow]
C    double[pyarrow]
D      bool[pyarrow]
dtype: object
```

Furthermore, it's possible to add mappings or override existing ones:

```python
import pandas as pd

from pandas_pyarrow import PandasArrowConverter

# Create a pandas DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c'],
    'C': [1.1, 2.2, 3.3],
    'D': [True, False, True]
})

# Instantiate a PandasArrowConverter object
pandas_pyarrow_converter = PandasArrowConverter(
    custom_mapper={'int64': 'int32[pyarrow]', 'float64': 'float32[pyarrow]'})

# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = pandas_pyarrow_converter(df)

print(adf.dtypes)
```

outputs:

```
A     int32[pyarrow]
B    string[pyarrow]
C     float[pyarrow]
D      bool[pyarrow]
dtype: object
```

pandas-pyarrow also support db-dtypes used by bigquery python sdk:

```bash
pip install pandas-gbq
```

or

```bash
pip install pandas-pyarrow[bigquery]
```

```python
import pandas_gbq as gbq

from pandas_pyarrow import PandasArrowConverter

# Specify the public dataset and table you want to query
dataset_id = "bigquery-public-data"
table_name = "hacker_news.stories"

# Construct the query string
query = """
    SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1000
"""

# Use pandas_gbq to read the data from BigQuery
df = gbq.read_gbq(query)
pandas_pyarrow_converter = PandasArrowConverter()
adf = pandas_pyarrow_converter(df)
# Print the retrieved data
print(df.dtypes)
print(adf.dtypes)
```

outputs:

```
unique_key                               object
complaint_description                    object
source                                   object
status                                   object
status_change_date          datetime64[us, UTC]
created_date                datetime64[us, UTC]
last_update_date            datetime64[us, UTC]
close_date                  datetime64[us, UTC]
incident_address                         object
street_number                            object
street_name                              object
city                                     object
incident_zip                              Int64
county                                   object
state_plane_x_coordinate                 object
state_plane_y_coordinate                float64
latitude                                float64
longitude                               float64
location                                 object
council_district_code                     Int64
map_page                                 object
map_tile                                 object
dtype: object
unique_key                         string[pyarrow]
complaint_description              string[pyarrow]
source                             string[pyarrow]
status                             string[pyarrow]
status_change_date          timestamp[us][pyarrow]
created_date                timestamp[us][pyarrow]
last_update_date            timestamp[us][pyarrow]
close_date                  timestamp[us][pyarrow]
incident_address                   string[pyarrow]
street_number                      string[pyarrow]
street_name                        string[pyarrow]
city                               string[pyarrow]
incident_zip                        int64[pyarrow]
county                             string[pyarrow]
state_plane_x_coordinate           string[pyarrow]
state_plane_y_coordinate           double[pyarrow]
latitude                           double[pyarrow]
longitude                          double[pyarrow]
location                           string[pyarrow]
council_district_code               int64[pyarrow]
map_page                           string[pyarrow]
map_tile                           string[pyarrow]
dtype: object
```

## Purposes

- Simplify the conversion between pandas pyarrow and numpy backends.
- Allow seamlessly switch to pyarrow pandas backend, even for problematic dtypes such float16 or db-dtypes.
- dtype standardization for db-dtypes used by bigquery python sdk.


example:

```python
import pandas as pd

# Create a pandas DataFrame
df = pd.DataFrame({

    'C': [1.1, 2.2, 3.3],

}, dtype='float16')

df.convert_dtypes(dtype_backend='pyarrow')
```
will raise an error:
```
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from halffloat to double using function cast_double
```
but with pandas-pyarrow:
```python
import pandas as pd

from pandas_pyarrow import convert_to_pyarrow

# Create a pandas DataFrame
df = pd.DataFrame({

    'C': [1.1, 2.2, 3.3],

}, dtype='float16')
adf = convert_to_pyarrow(df)
print(adf.dtypes)

```
outputs:
```
C    halffloat[pyarrow]
dtype: object
```


## Additional Information

When converting from higher precision numerical dtypes (like float64) to
lower precision (like float32), data precision might be compromised.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/DanielAvdar/pandas-pyarrow",
    "name": "pandas-pyarrow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.9",
    "maintainer_email": null,
    "keywords": "python, pandas, pyarrow, arrow, dataframe, bigquery, pandas-pyarrow, pandas-arrow",
    "author": "DanielAvdar",
    "author_email": "66269169+DanielAvdar@users.noreply.github.com",
    "download_url": "https://files.pythonhosted.org/packages/62/e1/dd6903db15ff7921087ea6ecd5ce41c9d8c56d9dcbae2b2eed2471f3e502/pandas_pyarrow-0.1.6.tar.gz",
    "platform": null,
    "description": "# pandas-pyarrow\n\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pandas-pyarrow)](https://pypi.org/project/pandas-pyarrow/)\n[![version](https://img.shields.io/pypi/v/pandas-pyarrow)](https://img.shields.io/pypi/v/pandas-pyarrow)\n[![License](https://img.shields.io/:license-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n![OS](https://img.shields.io/badge/ubuntu-blue?logo=ubuntu)\n![OS](https://img.shields.io/badge/win-blue?logo=windows)\n![OS](https://img.shields.io/badge/mac-blue?logo=apple)\n[![Tests](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/ci.yml/badge.svg)](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/ci.yml)\n[![Code Checks](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/code-checks.yml/badge.svg)](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/code-checks.yml)\n[![codecov](https://codecov.io/gh/DanielAvdar/pandas-pyarrow/graph/badge.svg?token=N0V9KANTG2)](https://codecov.io/gh/DanielAvdar/pandas-pyarrow)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n\n`pandas-pyarrow` simplifies the conversion of pandas backend to pyarrow, allowing seamlessly switch to pyarrow pandas\nbackend.\n\n## Get started:\n\n### Installation\n\nTo install the package use pip:\n\n```bash\npip install pandas-pyarrow\n```\n\n### Usage\n\n```python\nimport pandas as pd\n\nfrom pandas_pyarrow import convert_to_pyarrow\n\n# Create a pandas DataFrame\ndf = pd.DataFrame({\n    'A': [1, 2, 3],\n    'B': ['a', 'b', 'c'],\n    'C': [1.1, 2.2, 3.3],\n    'D': [True, False, True]\n})\n\n# Convert the pandas DataFrame dtypes to arrow dtypes\nadf: pd.DataFrame = convert_to_pyarrow(df)\n\nprint(adf.dtypes)\n```\n\noutputs:\n\n```\nA     int64[pyarrow]\nB    string[pyarrow]\nC    double[pyarrow]\nD      bool[pyarrow]\ndtype: object\n```\n\nFurthermore, it's possible to add mappings or override existing ones:\n\n```python\nimport pandas as pd\n\nfrom pandas_pyarrow import PandasArrowConverter\n\n# Create a pandas DataFrame\ndf = pd.DataFrame({\n    'A': [1, 2, 3],\n    'B': ['a', 'b', 'c'],\n    'C': [1.1, 2.2, 3.3],\n    'D': [True, False, True]\n})\n\n# Instantiate a PandasArrowConverter object\npandas_pyarrow_converter = PandasArrowConverter(\n    custom_mapper={'int64': 'int32[pyarrow]', 'float64': 'float32[pyarrow]'})\n\n# Convert the pandas DataFrame dtypes to arrow dtypes\nadf: pd.DataFrame = pandas_pyarrow_converter(df)\n\nprint(adf.dtypes)\n```\n\noutputs:\n\n```\nA     int32[pyarrow]\nB    string[pyarrow]\nC     float[pyarrow]\nD      bool[pyarrow]\ndtype: object\n```\n\npandas-pyarrow also support db-dtypes used by bigquery python sdk:\n\n```bash\npip install pandas-gbq\n```\n\nor\n\n```bash\npip install pandas-pyarrow[bigquery]\n```\n\n```python\nimport pandas_gbq as gbq\n\nfrom pandas_pyarrow import PandasArrowConverter\n\n# Specify the public dataset and table you want to query\ndataset_id = \"bigquery-public-data\"\ntable_name = \"hacker_news.stories\"\n\n# Construct the query string\nquery = \"\"\"\n    SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1000\n\"\"\"\n\n# Use pandas_gbq to read the data from BigQuery\ndf = gbq.read_gbq(query)\npandas_pyarrow_converter = PandasArrowConverter()\nadf = pandas_pyarrow_converter(df)\n# Print the retrieved data\nprint(df.dtypes)\nprint(adf.dtypes)\n```\n\noutputs:\n\n```\nunique_key                               object\ncomplaint_description                    object\nsource                                   object\nstatus                                   object\nstatus_change_date          datetime64[us, UTC]\ncreated_date                datetime64[us, UTC]\nlast_update_date            datetime64[us, UTC]\nclose_date                  datetime64[us, UTC]\nincident_address                         object\nstreet_number                            object\nstreet_name                              object\ncity                                     object\nincident_zip                              Int64\ncounty                                   object\nstate_plane_x_coordinate                 object\nstate_plane_y_coordinate                float64\nlatitude                                float64\nlongitude                               float64\nlocation                                 object\ncouncil_district_code                     Int64\nmap_page                                 object\nmap_tile                                 object\ndtype: object\nunique_key                         string[pyarrow]\ncomplaint_description              string[pyarrow]\nsource                             string[pyarrow]\nstatus                             string[pyarrow]\nstatus_change_date          timestamp[us][pyarrow]\ncreated_date                timestamp[us][pyarrow]\nlast_update_date            timestamp[us][pyarrow]\nclose_date                  timestamp[us][pyarrow]\nincident_address                   string[pyarrow]\nstreet_number                      string[pyarrow]\nstreet_name                        string[pyarrow]\ncity                               string[pyarrow]\nincident_zip                        int64[pyarrow]\ncounty                             string[pyarrow]\nstate_plane_x_coordinate           string[pyarrow]\nstate_plane_y_coordinate           double[pyarrow]\nlatitude                           double[pyarrow]\nlongitude                          double[pyarrow]\nlocation                           string[pyarrow]\ncouncil_district_code               int64[pyarrow]\nmap_page                           string[pyarrow]\nmap_tile                           string[pyarrow]\ndtype: object\n```\n\n## Purposes\n\n- Simplify the conversion between pandas pyarrow and numpy backends.\n- Allow seamlessly switch to pyarrow pandas backend, even for problematic dtypes such float16 or db-dtypes.\n- dtype standardization for db-dtypes used by bigquery python sdk.\n\n\nexample:\n\n```python\nimport pandas as pd\n\n# Create a pandas DataFrame\ndf = pd.DataFrame({\n\n    'C': [1.1, 2.2, 3.3],\n\n}, dtype='float16')\n\ndf.convert_dtypes(dtype_backend='pyarrow')\n```\nwill raise an error:\n```\npyarrow.lib.ArrowNotImplementedError: Unsupported cast from halffloat to double using function cast_double\n```\nbut with pandas-pyarrow:\n```python\nimport pandas as pd\n\nfrom pandas_pyarrow import convert_to_pyarrow\n\n# Create a pandas DataFrame\ndf = pd.DataFrame({\n\n    'C': [1.1, 2.2, 3.3],\n\n}, dtype='float16')\nadf = convert_to_pyarrow(df)\nprint(adf.dtypes)\n\n```\noutputs:\n```\nC    halffloat[pyarrow]\ndtype: object\n```\n\n\n## Additional Information\n\nWhen converting from higher precision numerical dtypes (like float64) to\nlower precision (like float32), data precision might be compromised.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A library for switching pandas backend to pyarrow",
    "version": "0.1.6",
    "project_urls": {
        "Documentation": "https://github.com/DanielAvdar/pandas-pyarrow",
        "Homepage": "https://github.com/DanielAvdar/pandas-pyarrow",
        "Repository": "https://github.com/DanielAvdar/pandas-pyarrow"
    },
    "split_keywords": [
        "python",
        " pandas",
        " pyarrow",
        " arrow",
        " dataframe",
        " bigquery",
        " pandas-pyarrow",
        " pandas-arrow"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8b3a51a8bfbfd394b0d4e17a68e08cdc2559cedc5138d9e9f8de19e9e4b4870f",
                "md5": "6d452d8ff0ff2e1c47937b226a2a06d2",
                "sha256": "ea1fef78d7b773382385b981cdbe2492c894576565f557593dcb3b5713926ba3"
            },
            "downloads": -1,
            "filename": "pandas_pyarrow-0.1.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6d452d8ff0ff2e1c47937b226a2a06d2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.9",
            "size": 7194,
            "upload_time": "2024-04-11T19:41:32",
            "upload_time_iso_8601": "2024-04-11T19:41:32.027106Z",
            "url": "https://files.pythonhosted.org/packages/8b/3a/51a8bfbfd394b0d4e17a68e08cdc2559cedc5138d9e9f8de19e9e4b4870f/pandas_pyarrow-0.1.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "62e1dd6903db15ff7921087ea6ecd5ce41c9d8c56d9dcbae2b2eed2471f3e502",
                "md5": "3141e041e4c82cd0f0631add0df32f70",
                "sha256": "4253eae88d75ac346839ea28971cd071876c8c5b446210114c3a105663e75a49"
            },
            "downloads": -1,
            "filename": "pandas_pyarrow-0.1.6.tar.gz",
            "has_sig": false,
            "md5_digest": "3141e041e4c82cd0f0631add0df32f70",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.9",
            "size": 5890,
            "upload_time": "2024-04-11T19:41:33",
            "upload_time_iso_8601": "2024-04-11T19:41:33.687461Z",
            "url": "https://files.pythonhosted.org/packages/62/e1/dd6903db15ff7921087ea6ecd5ce41c9d8c56d9dcbae2b2eed2471f3e502/pandas_pyarrow-0.1.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-11 19:41:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "DanielAvdar",
    "github_project": "pandas-pyarrow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pandas-pyarrow"
}
        
Elapsed time: 0.41415s