# pandas-pyarrow
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pandas-pyarrow)](https://pypi.org/project/pandas-pyarrow/)
[![version](https://img.shields.io/pypi/v/pandas-pyarrow)](https://img.shields.io/pypi/v/pandas-pyarrow)
[![License](https://img.shields.io/:license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
![OS](https://img.shields.io/badge/ubuntu-blue?logo=ubuntu)
![OS](https://img.shields.io/badge/win-blue?logo=windows)
![OS](https://img.shields.io/badge/mac-blue?logo=apple)
[![Tests](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/ci.yml/badge.svg)](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/ci.yml)
[![Code Checks](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/code-checks.yml/badge.svg)](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/code-checks.yml)
[![codecov](https://codecov.io/gh/DanielAvdar/pandas-pyarrow/graph/badge.svg?token=N0V9KANTG2)](https://codecov.io/gh/DanielAvdar/pandas-pyarrow)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
`pandas-pyarrow` simplifies the conversion of pandas backend to pyarrow, allowing seamlessly switch to pyarrow pandas
backend.
## Get started:
### Installation
To install the package use pip:
```bash
pip install pandas-pyarrow
```
### Usage
```python
import pandas as pd
from pandas_pyarrow import convert_to_pyarrow
# Create a pandas DataFrame
df = pd.DataFrame({
'A': [1, 2, 3],
'B': ['a', 'b', 'c'],
'C': [1.1, 2.2, 3.3],
'D': [True, False, True]
})
# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = convert_to_pyarrow(df)
print(adf.dtypes)
```
outputs:
```
A int64[pyarrow]
B string[pyarrow]
C double[pyarrow]
D bool[pyarrow]
dtype: object
```
Furthermore, it's possible to add mappings or override existing ones:
```python
import pandas as pd
from pandas_pyarrow import PandasArrowConverter
# Create a pandas DataFrame
df = pd.DataFrame({
'A': [1, 2, 3],
'B': ['a', 'b', 'c'],
'C': [1.1, 2.2, 3.3],
'D': [True, False, True]
})
# Instantiate a PandasArrowConverter object
pandas_pyarrow_converter = PandasArrowConverter(
custom_mapper={'int64': 'int32[pyarrow]', 'float64': 'float32[pyarrow]'})
# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = pandas_pyarrow_converter(df)
print(adf.dtypes)
```
outputs:
```
A int32[pyarrow]
B string[pyarrow]
C float[pyarrow]
D bool[pyarrow]
dtype: object
```
pandas-pyarrow also support db-dtypes used by bigquery python sdk:
```bash
pip install pandas-gbq
```
or
```bash
pip install pandas-pyarrow[bigquery]
```
```python
import pandas_gbq as gbq
from pandas_pyarrow import PandasArrowConverter
# Specify the public dataset and table you want to query
dataset_id = "bigquery-public-data"
table_name = "hacker_news.stories"
# Construct the query string
query = """
SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1000
"""
# Use pandas_gbq to read the data from BigQuery
df = gbq.read_gbq(query)
pandas_pyarrow_converter = PandasArrowConverter()
adf = pandas_pyarrow_converter(df)
# Print the retrieved data
print(df.dtypes)
print(adf.dtypes)
```
outputs:
```
unique_key object
complaint_description object
source object
status object
status_change_date datetime64[us, UTC]
created_date datetime64[us, UTC]
last_update_date datetime64[us, UTC]
close_date datetime64[us, UTC]
incident_address object
street_number object
street_name object
city object
incident_zip Int64
county object
state_plane_x_coordinate object
state_plane_y_coordinate float64
latitude float64
longitude float64
location object
council_district_code Int64
map_page object
map_tile object
dtype: object
unique_key string[pyarrow]
complaint_description string[pyarrow]
source string[pyarrow]
status string[pyarrow]
status_change_date timestamp[us][pyarrow]
created_date timestamp[us][pyarrow]
last_update_date timestamp[us][pyarrow]
close_date timestamp[us][pyarrow]
incident_address string[pyarrow]
street_number string[pyarrow]
street_name string[pyarrow]
city string[pyarrow]
incident_zip int64[pyarrow]
county string[pyarrow]
state_plane_x_coordinate string[pyarrow]
state_plane_y_coordinate double[pyarrow]
latitude double[pyarrow]
longitude double[pyarrow]
location string[pyarrow]
council_district_code int64[pyarrow]
map_page string[pyarrow]
map_tile string[pyarrow]
dtype: object
```
## Purposes
- Simplify the conversion between pandas pyarrow and numpy backends.
- Allow seamlessly switch to pyarrow pandas backend, even for problematic dtypes such float16 or db-dtypes.
- dtype standardization for db-dtypes used by bigquery python sdk.
example:
```python
import pandas as pd
# Create a pandas DataFrame
df = pd.DataFrame({
'C': [1.1, 2.2, 3.3],
}, dtype='float16')
df.convert_dtypes(dtype_backend='pyarrow')
```
will raise an error:
```
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from halffloat to double using function cast_double
```
but with pandas-pyarrow:
```python
import pandas as pd
from pandas_pyarrow import convert_to_pyarrow
# Create a pandas DataFrame
df = pd.DataFrame({
'C': [1.1, 2.2, 3.3],
}, dtype='float16')
adf = convert_to_pyarrow(df)
print(adf.dtypes)
```
outputs:
```
C halffloat[pyarrow]
dtype: object
```
## Additional Information
When converting from higher precision numerical dtypes (like float64) to
lower precision (like float32), data precision might be compromised.
Raw data
{
"_id": null,
"home_page": "https://github.com/DanielAvdar/pandas-pyarrow",
"name": "pandas-pyarrow",
"maintainer": null,
"docs_url": null,
"requires_python": "<4,>=3.9",
"maintainer_email": null,
"keywords": "python, pandas, pyarrow, arrow, dataframe, bigquery, pandas-pyarrow, pandas-arrow",
"author": "DanielAvdar",
"author_email": "66269169+DanielAvdar@users.noreply.github.com",
"download_url": "https://files.pythonhosted.org/packages/f6/fb/8e4f4df431da8e522cb0fd7c64d211a4b5a7f78bcc54103db67281aba1bb/pandas_pyarrow-0.2.0.tar.gz",
"platform": null,
"description": "# pandas-pyarrow\n\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pandas-pyarrow)](https://pypi.org/project/pandas-pyarrow/)\n[![version](https://img.shields.io/pypi/v/pandas-pyarrow)](https://img.shields.io/pypi/v/pandas-pyarrow)\n[![License](https://img.shields.io/:license-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n![OS](https://img.shields.io/badge/ubuntu-blue?logo=ubuntu)\n![OS](https://img.shields.io/badge/win-blue?logo=windows)\n![OS](https://img.shields.io/badge/mac-blue?logo=apple)\n[![Tests](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/ci.yml/badge.svg)](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/ci.yml)\n[![Code Checks](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/code-checks.yml/badge.svg)](https://github.com/DanielAvdar/pandas-pyarrow/actions/workflows/code-checks.yml)\n[![codecov](https://codecov.io/gh/DanielAvdar/pandas-pyarrow/graph/badge.svg?token=N0V9KANTG2)](https://codecov.io/gh/DanielAvdar/pandas-pyarrow)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n\n`pandas-pyarrow` simplifies the conversion of pandas backend to pyarrow, allowing seamlessly switch to pyarrow pandas\nbackend.\n\n## Get started:\n\n### Installation\n\nTo install the package use pip:\n\n```bash\npip install pandas-pyarrow\n```\n\n### Usage\n\n```python\nimport pandas as pd\n\nfrom pandas_pyarrow import convert_to_pyarrow\n\n# Create a pandas DataFrame\ndf = pd.DataFrame({\n 'A': [1, 2, 3],\n 'B': ['a', 'b', 'c'],\n 'C': [1.1, 2.2, 3.3],\n 'D': [True, False, True]\n})\n\n# Convert the pandas DataFrame dtypes to arrow dtypes\nadf: pd.DataFrame = convert_to_pyarrow(df)\n\nprint(adf.dtypes)\n```\n\noutputs:\n\n```\nA int64[pyarrow]\nB string[pyarrow]\nC double[pyarrow]\nD bool[pyarrow]\ndtype: object\n```\n\nFurthermore, it's possible to add mappings or override existing ones:\n\n```python\nimport pandas as pd\n\nfrom pandas_pyarrow import PandasArrowConverter\n\n# Create a pandas DataFrame\ndf = pd.DataFrame({\n 'A': [1, 2, 3],\n 'B': ['a', 'b', 'c'],\n 'C': [1.1, 2.2, 3.3],\n 'D': [True, False, True]\n})\n\n# Instantiate a PandasArrowConverter object\npandas_pyarrow_converter = PandasArrowConverter(\n custom_mapper={'int64': 'int32[pyarrow]', 'float64': 'float32[pyarrow]'})\n\n# Convert the pandas DataFrame dtypes to arrow dtypes\nadf: pd.DataFrame = pandas_pyarrow_converter(df)\n\nprint(adf.dtypes)\n```\n\noutputs:\n\n```\nA int32[pyarrow]\nB string[pyarrow]\nC float[pyarrow]\nD bool[pyarrow]\ndtype: object\n```\n\npandas-pyarrow also support db-dtypes used by bigquery python sdk:\n\n```bash\npip install pandas-gbq\n```\n\nor\n\n```bash\npip install pandas-pyarrow[bigquery]\n```\n\n```python\nimport pandas_gbq as gbq\n\nfrom pandas_pyarrow import PandasArrowConverter\n\n# Specify the public dataset and table you want to query\ndataset_id = \"bigquery-public-data\"\ntable_name = \"hacker_news.stories\"\n\n# Construct the query string\nquery = \"\"\"\n SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1000\n\"\"\"\n\n# Use pandas_gbq to read the data from BigQuery\ndf = gbq.read_gbq(query)\npandas_pyarrow_converter = PandasArrowConverter()\nadf = pandas_pyarrow_converter(df)\n# Print the retrieved data\nprint(df.dtypes)\nprint(adf.dtypes)\n```\n\noutputs:\n\n```\nunique_key object\ncomplaint_description object\nsource object\nstatus object\nstatus_change_date datetime64[us, UTC]\ncreated_date datetime64[us, UTC]\nlast_update_date datetime64[us, UTC]\nclose_date datetime64[us, UTC]\nincident_address object\nstreet_number object\nstreet_name object\ncity object\nincident_zip Int64\ncounty object\nstate_plane_x_coordinate object\nstate_plane_y_coordinate float64\nlatitude float64\nlongitude float64\nlocation object\ncouncil_district_code Int64\nmap_page object\nmap_tile object\ndtype: object\nunique_key string[pyarrow]\ncomplaint_description string[pyarrow]\nsource string[pyarrow]\nstatus string[pyarrow]\nstatus_change_date timestamp[us][pyarrow]\ncreated_date timestamp[us][pyarrow]\nlast_update_date timestamp[us][pyarrow]\nclose_date timestamp[us][pyarrow]\nincident_address string[pyarrow]\nstreet_number string[pyarrow]\nstreet_name string[pyarrow]\ncity string[pyarrow]\nincident_zip int64[pyarrow]\ncounty string[pyarrow]\nstate_plane_x_coordinate string[pyarrow]\nstate_plane_y_coordinate double[pyarrow]\nlatitude double[pyarrow]\nlongitude double[pyarrow]\nlocation string[pyarrow]\ncouncil_district_code int64[pyarrow]\nmap_page string[pyarrow]\nmap_tile string[pyarrow]\ndtype: object\n```\n\n## Purposes\n\n- Simplify the conversion between pandas pyarrow and numpy backends.\n- Allow seamlessly switch to pyarrow pandas backend, even for problematic dtypes such float16 or db-dtypes.\n- dtype standardization for db-dtypes used by bigquery python sdk.\n\n\nexample:\n\n```python\nimport pandas as pd\n\n# Create a pandas DataFrame\ndf = pd.DataFrame({\n\n 'C': [1.1, 2.2, 3.3],\n\n}, dtype='float16')\n\ndf.convert_dtypes(dtype_backend='pyarrow')\n```\nwill raise an error:\n```\npyarrow.lib.ArrowNotImplementedError: Unsupported cast from halffloat to double using function cast_double\n```\nbut with pandas-pyarrow:\n```python\nimport pandas as pd\n\nfrom pandas_pyarrow import convert_to_pyarrow\n\n# Create a pandas DataFrame\ndf = pd.DataFrame({\n\n 'C': [1.1, 2.2, 3.3],\n\n}, dtype='float16')\nadf = convert_to_pyarrow(df)\nprint(adf.dtypes)\n\n```\noutputs:\n```\nC halffloat[pyarrow]\ndtype: object\n```\n\n\n## Additional Information\n\nWhen converting from higher precision numerical dtypes (like float64) to\nlower precision (like float32), data precision might be compromised.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A library for switching pandas backend to pyarrow",
"version": "0.2.0",
"project_urls": {
"Documentation": "https://github.com/DanielAvdar/pandas-pyarrow",
"Homepage": "https://github.com/DanielAvdar/pandas-pyarrow",
"Repository": "https://github.com/DanielAvdar/pandas-pyarrow"
},
"split_keywords": [
"python",
" pandas",
" pyarrow",
" arrow",
" dataframe",
" bigquery",
" pandas-pyarrow",
" pandas-arrow"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9e5b40b01f3adf0d6691eb3b45aff12b686b5cdeadd729305ad8e4dd1cd5ae79",
"md5": "b0bf24459aa8a4c8695a9edbc009e98b",
"sha256": "a3dabb31cef8fcbda90f5a6a4caf39b52a8e3218ace05ca6838cab054cb49c34"
},
"downloads": -1,
"filename": "pandas_pyarrow-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b0bf24459aa8a4c8695a9edbc009e98b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4,>=3.9",
"size": 8424,
"upload_time": "2025-01-12T17:35:14",
"upload_time_iso_8601": "2025-01-12T17:35:14.026722Z",
"url": "https://files.pythonhosted.org/packages/9e/5b/40b01f3adf0d6691eb3b45aff12b686b5cdeadd729305ad8e4dd1cd5ae79/pandas_pyarrow-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f6fb8e4f4df431da8e522cb0fd7c64d211a4b5a7f78bcc54103db67281aba1bb",
"md5": "dac5d8ffb952e68374836bad8fdaaf03",
"sha256": "ec789140336937aefd5ef28bda75c36b1d2ef0dd8cc36e19b3859d32079af6fe"
},
"downloads": -1,
"filename": "pandas_pyarrow-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "dac5d8ffb952e68374836bad8fdaaf03",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.9",
"size": 6234,
"upload_time": "2025-01-12T17:35:16",
"upload_time_iso_8601": "2025-01-12T17:35:16.664862Z",
"url": "https://files.pythonhosted.org/packages/f6/fb/8e4f4df431da8e522cb0fd7c64d211a4b5a7f78bcc54103db67281aba1bb/pandas_pyarrow-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-12 17:35:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "DanielAvdar",
"github_project": "pandas-pyarrow",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pandas-pyarrow"
}