pyarrow-bigquery


Namepyarrow-bigquery JSON
Version 0.5.0 PyPI version JSON
download
home_pageNone
SummaryA simple library to write to and read from BigQuery tables as PyArrow tables.
upload_time2024-06-26 22:07:44
maintainerNone
docs_urlNone
authorSebastian Pawluś
requires_pythonNone
licenseMIT
keywords pyarrow bigquery
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            

# pyarrow-bigquery

A simple library to **write to** and **download from** BigQuery tables as PyArrow tables.

---

## Installation

```bash
pip install pyarrow-bigquery
```

---

## Quick Start

This guide will help you quickly get started with `pyarrow-bigquery`, a library that allows you to **read** from and **write** to Google BigQuery using PyArrow.

### Reading

`pyarrow-bigquery` offers four methods to read BigQuery tables as PyArrow tables. Depending on your use case and/or the table size, you can choose the most suitable method.

**Read from a Table Location**

When the table is small enough to fit in memory, you can read it directly using `read_table`.

```python
import pyarrow.bigquery as bq

table = bq.read_table("gcp_project.dataset.small_table")

print(table.num_rows)
```

**Read from a Query**

Alternatively, if the query results are small enough to fit in memory, you can read them directly using `read_query`.

```python
import pyarrow.bigquery as bq

table = bq.read_query(
    project="gcp_project",
    query="SELECT * FROM `gcp_project.dataset.small_table`"
)

print(table.num_rows)
```

**Read in Batches**

If the target table is larger than memory or you prefer not to fetch the entire table at once, you can use the `bq.reader` iterator method with the `batch_size` parameter to limit how much data is fetched per iteration.

```python
import pyarrow.bigquery as bq

for table in bq.reader("gcp_project.dataset.big_table", batch_size=100):
    print(table.num_rows)
```

**Read Query in Batches**

Similarly, you can read data in batches from a query using `reader_query`.

```python
import pyarrow.bigquery as bq

for table in bq.reader_query(
    project="gcp_project",
    query="SELECT * FROM `gcp_project.dataset.small_table`"
):
    print(table.num_rows)
```

### Writing

The package provides two methods to write to BigQuery. Depending on your use case or the table size, you can choose the appropriate method.

**Write the Entire Table**

To write a complete table at once, use the `bq.write_table` method.

```python
import pyarrow as pa
import pyarrow.bigquery as bq

table = pa.Table.from_arrays([[1, 2, 3, 4]], names=['integers'])

bq.write_table(table, 'gcp_project.dataset.table')
```

**Write in Batches**

If you need to write data in smaller chunks, use the `bq.writer` method with the `schema` parameter to define the table structure.

```python
import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([
    ("integers", pa.int64())
])

with bq.writer("gcp_project.dataset.table", schema=schema) as writer:
    writer.write_batch(record_batch)
    writer.write_table(table)
```

---

## API Reference

### Writing

#### `pyarrow.bigquery.write_table`

Writes a PyArrow Table to a BigQuery Table. No return value.

**Parameters:**

- `table`: `pa.Table`  
  The PyArrow table.

- `where`: `str`  
  The destination location in the BigQuery catalog.

- `project`: `str`, *default* `None`  
  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.

- `table_create`: `bool`, *default* `True`  
  Specifies if the BigQuery table should be created.

- `table_expire`: `None | int`, *default* `None`  
  The number of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.

- `table_overwrite`: `bool`, *default* `False`  
  If the table already exists, it will be destroyed and a new one will be created.

- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
  The worker backend for fetching data.

- `worker_count`: `int`, *default* `os.cpu_count()`  
  The number of threads or processes to use for fetching data from BigQuery.

- `batch_size`: `int`, *default* `100`  
  The batch size for fetched rows.

```python
bq.write_table(table, 'gcp_project.dataset.table')
```

#### `pyarrow.bigquery.writer` (Context Manager)

Context manager version of the write method. Useful when the PyArrow table is larger than memory size or the table is available in chunks.

**Parameters:**

- `schema`: `pa.Schema`  
  The PyArrow schema.

- `where`: `str`  
  The destination location in the BigQuery catalog.

- `project`: `str`, *default* `None`  
  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.

- `table_create`: `bool`, *default* `True`  
  Specifies if the BigQuery table should be created.

- `table_expire`: `None | int`, *default* `None`  
  The number of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.

- `table_overwrite`: `bool`, *default* `False`  
  If the table already exists, it will be destroyed and a new one will be created.

- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
  The worker backend for writing data.

- `worker_count`: `int`, *default* `os.cpu_count()`  
  The number of threads or processes to use for writing data to BigQuery.

- `batch_size`: `int`, *default* `100`  
  The batch size used for writes. The table will be automatically split to this value.

Depending on your use case, you might want to use one of the methods below to write your data to a BigQuery table, using either `pa.Table` or `pa.RecordBatch`.

#### `pyarrow.bigquery.writer.write_table` (Context Manager Method)

Context manager method to write a table.

**Parameters:**

- `table`: `pa.Table`  
  The PyArrow table.

```python
import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([("value", pa.list_(pa.int64()))])

with bq.writer("gcp_project.dataset.table", schema=schema) as writer:
    for a in range(1000):
        writer.write_table(pa.Table.from_pylist([{'value': [a] * 10}]))
```

#### `pyarrow.bigquery.writer.write_batch` (Context Manager Method)

Context manager method to write a record batch.

**Parameters:**

- `batch`: `pa.RecordBatch`  
  The PyArrow record batch.

```python
import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([("value", pa.list_(pa.int64()))])

with bq.writer("gcp_project.dataset.table", schema=schema) as writer:
    for a in range 1000:
        writer.write_batch(pa.RecordBatch.from_pylist([{'value': [1] * 10}]))
```

### Reading

#### `pyarrow.bigquery.read_table`

**Parameters:**

- `source`: `str`  
  The BigQuery table location.

- `project`: `str`, *default* `None`  
  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.

- `columns`: `str`, *default* `None`  
  The columns to download. When not provided, all available columns will be downloaded.

- `row_restrictions`: `str`, *default* `None`  
  Row-level filtering executed on the BigQuery side. More information is available in the [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).

- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
  The worker backend for fetching data.

- `worker_count`: `int`, *default* `os.cpu_count()`  
  The number of threads or processes to use for fetching data from BigQuery.

- `batch_size`: `int`, *default* `100`  
  The batch size used for fetching. The table will be automatically split into this value.

#### `pyarrow.bigquery.read_query`

**Parameters:**

- `project`: `str`  
  The BigQuery query execution (and billing) project.

- `query`: `str`  
  The query to be executed.

- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
  The worker backend for fetching data.

- `worker_count`: `int`, *default* `os.cpu_count()`  
  The number of threads or processes to use for fetching data from BigQuery.

- `batch_size`: `int`, *default* `100`  
  The batch size used for fetching. The table will be automatically split into this value.

```python
table = bq.read_query("gcp_project", "SELECT * FROM `gcp_project.dataset.table`")
```

#### `pyarrow.bigquery.reader`

**Parameters:**

- `

source`: `str`  
  The BigQuery table location.

- `project`: `str`, *default* `None`  
  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.

- `columns`: `str`, *default* `None`  
  The columns to download. When not provided, all available columns will be downloaded.

- `row_restrictions`: `str`, *default* `None`  
  Row-level filtering executed on the BigQuery side. More information is available in the [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).

- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
  The worker backend for fetching data.

- `worker_count`: `int`, *default* `os.cpu_count()`  
  The number of threads or processes to use for fetching data from BigQuery.

- `batch_size`: `int`, *default* `100`  
  The batch size used for fetching. The table will be automatically split into this value.

```python
import pyarrow as pa
import pyarrow.bigquery as bq

parts = []
for part in bq.reader("gcp_project.dataset.table"):
    parts.append(part)

table = pa.concat_tables(parts)
```

#### `pyarrow.bigquery.reader_query`

**Parameters:**

- `project`: `str`  
  The BigQuery query execution (and billing) project.

- `query`: `str`  
  The query to be executed.

- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  
  The worker backend for fetching data.

- `worker_count`: `int`, *default* `os.cpu_count()`  
  The number of threads or processes to use for fetching data from BigQuery.

- `batch_size`: `int`, *default* `100`  
  The batch size used for fetching. The table will be automatically split into this value.

```python
for batch in bq.reader_query("gcp_project", "SELECT * FROM `gcp_project.dataset.table`"):
    print(batch.num_rows)
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyarrow-bigquery",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "pyarrow, bigquery",
    "author": "Sebastian Pawlu\u015b",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/e5/44/b6f3362b6f663f1b582404ae14af7d8a8417ab6392a027298cd550f8f78b/pyarrow_bigquery-0.5.0.tar.gz",
    "platform": null,
    "description": "\n\n# pyarrow-bigquery\n\nA simple library to **write to** and **download from** BigQuery tables as PyArrow tables.\n\n---\n\n## Installation\n\n```bash\npip install pyarrow-bigquery\n```\n\n---\n\n## Quick Start\n\nThis guide will help you quickly get started with `pyarrow-bigquery`, a library that allows you to **read** from and **write** to Google BigQuery using PyArrow.\n\n### Reading\n\n`pyarrow-bigquery` offers four methods to read BigQuery tables as PyArrow tables. Depending on your use case and/or the table size, you can choose the most suitable method.\n\n**Read from a Table Location**\n\nWhen the table is small enough to fit in memory, you can read it directly using `read_table`.\n\n```python\nimport pyarrow.bigquery as bq\n\ntable = bq.read_table(\"gcp_project.dataset.small_table\")\n\nprint(table.num_rows)\n```\n\n**Read from a Query**\n\nAlternatively, if the query results are small enough to fit in memory, you can read them directly using `read_query`.\n\n```python\nimport pyarrow.bigquery as bq\n\ntable = bq.read_query(\n    project=\"gcp_project\",\n    query=\"SELECT * FROM `gcp_project.dataset.small_table`\"\n)\n\nprint(table.num_rows)\n```\n\n**Read in Batches**\n\nIf the target table is larger than memory or you prefer not to fetch the entire table at once, you can use the `bq.reader` iterator method with the `batch_size` parameter to limit how much data is fetched per iteration.\n\n```python\nimport pyarrow.bigquery as bq\n\nfor table in bq.reader(\"gcp_project.dataset.big_table\", batch_size=100):\n    print(table.num_rows)\n```\n\n**Read Query in Batches**\n\nSimilarly, you can read data in batches from a query using `reader_query`.\n\n```python\nimport pyarrow.bigquery as bq\n\nfor table in bq.reader_query(\n    project=\"gcp_project\",\n    query=\"SELECT * FROM `gcp_project.dataset.small_table`\"\n):\n    print(table.num_rows)\n```\n\n### Writing\n\nThe package provides two methods to write to BigQuery. Depending on your use case or the table size, you can choose the appropriate method.\n\n**Write the Entire Table**\n\nTo write a complete table at once, use the `bq.write_table` method.\n\n```python\nimport pyarrow as pa\nimport pyarrow.bigquery as bq\n\ntable = pa.Table.from_arrays([[1, 2, 3, 4]], names=['integers'])\n\nbq.write_table(table, 'gcp_project.dataset.table')\n```\n\n**Write in Batches**\n\nIf you need to write data in smaller chunks, use the `bq.writer` method with the `schema` parameter to define the table structure.\n\n```python\nimport pyarrow as pa\nimport pyarrow.bigquery as bq\n\nschema = pa.schema([\n    (\"integers\", pa.int64())\n])\n\nwith bq.writer(\"gcp_project.dataset.table\", schema=schema) as writer:\n    writer.write_batch(record_batch)\n    writer.write_table(table)\n```\n\n---\n\n## API Reference\n\n### Writing\n\n#### `pyarrow.bigquery.write_table`\n\nWrites a PyArrow Table to a BigQuery Table. No return value.\n\n**Parameters:**\n\n- `table`: `pa.Table`  \n  The PyArrow table.\n\n- `where`: `str`  \n  The destination location in the BigQuery catalog.\n\n- `project`: `str`, *default* `None`  \n  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.\n\n- `table_create`: `bool`, *default* `True`  \n  Specifies if the BigQuery table should be created.\n\n- `table_expire`: `None | int`, *default* `None`  \n  The number of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.\n\n- `table_overwrite`: `bool`, *default* `False`  \n  If the table already exists, it will be destroyed and a new one will be created.\n\n- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  \n  The worker backend for fetching data.\n\n- `worker_count`: `int`, *default* `os.cpu_count()`  \n  The number of threads or processes to use for fetching data from BigQuery.\n\n- `batch_size`: `int`, *default* `100`  \n  The batch size for fetched rows.\n\n```python\nbq.write_table(table, 'gcp_project.dataset.table')\n```\n\n#### `pyarrow.bigquery.writer` (Context Manager)\n\nContext manager version of the write method. Useful when the PyArrow table is larger than memory size or the table is available in chunks.\n\n**Parameters:**\n\n- `schema`: `pa.Schema`  \n  The PyArrow schema.\n\n- `where`: `str`  \n  The destination location in the BigQuery catalog.\n\n- `project`: `str`, *default* `None`  \n  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `where`.\n\n- `table_create`: `bool`, *default* `True`  \n  Specifies if the BigQuery table should be created.\n\n- `table_expire`: `None | int`, *default* `None`  \n  The number of seconds after which the created table will expire. Used only if `table_create` is `True`. Set to `None` to disable expiration.\n\n- `table_overwrite`: `bool`, *default* `False`  \n  If the table already exists, it will be destroyed and a new one will be created.\n\n- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  \n  The worker backend for writing data.\n\n- `worker_count`: `int`, *default* `os.cpu_count()`  \n  The number of threads or processes to use for writing data to BigQuery.\n\n- `batch_size`: `int`, *default* `100`  \n  The batch size used for writes. The table will be automatically split to this value.\n\nDepending on your use case, you might want to use one of the methods below to write your data to a BigQuery table, using either `pa.Table` or `pa.RecordBatch`.\n\n#### `pyarrow.bigquery.writer.write_table` (Context Manager Method)\n\nContext manager method to write a table.\n\n**Parameters:**\n\n- `table`: `pa.Table`  \n  The PyArrow table.\n\n```python\nimport pyarrow as pa\nimport pyarrow.bigquery as bq\n\nschema = pa.schema([(\"value\", pa.list_(pa.int64()))])\n\nwith bq.writer(\"gcp_project.dataset.table\", schema=schema) as writer:\n    for a in range(1000):\n        writer.write_table(pa.Table.from_pylist([{'value': [a] * 10}]))\n```\n\n#### `pyarrow.bigquery.writer.write_batch` (Context Manager Method)\n\nContext manager method to write a record batch.\n\n**Parameters:**\n\n- `batch`: `pa.RecordBatch`  \n  The PyArrow record batch.\n\n```python\nimport pyarrow as pa\nimport pyarrow.bigquery as bq\n\nschema = pa.schema([(\"value\", pa.list_(pa.int64()))])\n\nwith bq.writer(\"gcp_project.dataset.table\", schema=schema) as writer:\n    for a in range 1000:\n        writer.write_batch(pa.RecordBatch.from_pylist([{'value': [1] * 10}]))\n```\n\n### Reading\n\n#### `pyarrow.bigquery.read_table`\n\n**Parameters:**\n\n- `source`: `str`  \n  The BigQuery table location.\n\n- `project`: `str`, *default* `None`  \n  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.\n\n- `columns`: `str`, *default* `None`  \n  The columns to download. When not provided, all available columns will be downloaded.\n\n- `row_restrictions`: `str`, *default* `None`  \n  Row-level filtering executed on the BigQuery side. More information is available in the [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).\n\n- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  \n  The worker backend for fetching data.\n\n- `worker_count`: `int`, *default* `os.cpu_count()`  \n  The number of threads or processes to use for fetching data from BigQuery.\n\n- `batch_size`: `int`, *default* `100`  \n  The batch size used for fetching. The table will be automatically split into this value.\n\n#### `pyarrow.bigquery.read_query`\n\n**Parameters:**\n\n- `project`: `str`  \n  The BigQuery query execution (and billing) project.\n\n- `query`: `str`  \n  The query to be executed.\n\n- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  \n  The worker backend for fetching data.\n\n- `worker_count`: `int`, *default* `os.cpu_count()`  \n  The number of threads or processes to use for fetching data from BigQuery.\n\n- `batch_size`: `int`, *default* `100`  \n  The batch size used for fetching. The table will be automatically split into this value.\n\n```python\ntable = bq.read_query(\"gcp_project\", \"SELECT * FROM `gcp_project.dataset.table`\")\n```\n\n#### `pyarrow.bigquery.reader`\n\n**Parameters:**\n\n- `\n\nsource`: `str`  \n  The BigQuery table location.\n\n- `project`: `str`, *default* `None`  \n  The BigQuery execution project, also the billing project. If not provided, it will be extracted from `source`.\n\n- `columns`: `str`, *default* `None`  \n  The columns to download. When not provided, all available columns will be downloaded.\n\n- `row_restrictions`: `str`, *default* `None`  \n  Row-level filtering executed on the BigQuery side. More information is available in the [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1beta1).\n\n- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  \n  The worker backend for fetching data.\n\n- `worker_count`: `int`, *default* `os.cpu_count()`  \n  The number of threads or processes to use for fetching data from BigQuery.\n\n- `batch_size`: `int`, *default* `100`  \n  The batch size used for fetching. The table will be automatically split into this value.\n\n```python\nimport pyarrow as pa\nimport pyarrow.bigquery as bq\n\nparts = []\nfor part in bq.reader(\"gcp_project.dataset.table\"):\n    parts.append(part)\n\ntable = pa.concat_tables(parts)\n```\n\n#### `pyarrow.bigquery.reader_query`\n\n**Parameters:**\n\n- `project`: `str`  \n  The BigQuery query execution (and billing) project.\n\n- `query`: `str`  \n  The query to be executed.\n\n- `worker_type`: `threading.Thread | multiprocessing.Process`, *default* `threading.Thread`  \n  The worker backend for fetching data.\n\n- `worker_count`: `int`, *default* `os.cpu_count()`  \n  The number of threads or processes to use for fetching data from BigQuery.\n\n- `batch_size`: `int`, *default* `100`  \n  The batch size used for fetching. The table will be automatically split into this value.\n\n```python\nfor batch in bq.reader_query(\"gcp_project\", \"SELECT * FROM `gcp_project.dataset.table`\"):\n    print(batch.num_rows)\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A simple library to write to and read from BigQuery tables as PyArrow tables.",
    "version": "0.5.0",
    "project_urls": null,
    "split_keywords": [
        "pyarrow",
        " bigquery"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4692dafce8aef2ecf7215bec8e1947e2026a16b11b6145f33c5b6719f7144454",
                "md5": "714549e9cd7478d197db878e4627d991",
                "sha256": "585e44b10fb19bd32aaea381a32bebc0ffe9587b037c6c4576e8e17047e6f35e"
            },
            "downloads": -1,
            "filename": "pyarrow_bigquery-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "714549e9cd7478d197db878e4627d991",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 11256,
            "upload_time": "2024-06-26T22:07:43",
            "upload_time_iso_8601": "2024-06-26T22:07:43.617686Z",
            "url": "https://files.pythonhosted.org/packages/46/92/dafce8aef2ecf7215bec8e1947e2026a16b11b6145f33c5b6719f7144454/pyarrow_bigquery-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e544b6f3362b6f663f1b582404ae14af7d8a8417ab6392a027298cd550f8f78b",
                "md5": "ca15ed451950b2f62f3575fa9ad779de",
                "sha256": "d6649b3637e24d6c8fef2c39cf7aff79eef36d525cef887d964a987c618a9641"
            },
            "downloads": -1,
            "filename": "pyarrow_bigquery-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ca15ed451950b2f62f3575fa9ad779de",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 11080,
            "upload_time": "2024-06-26T22:07:44",
            "upload_time_iso_8601": "2024-06-26T22:07:44.574955Z",
            "url": "https://files.pythonhosted.org/packages/e5/44/b6f3362b6f663f1b582404ae14af7d8a8417ab6392a027298cd550f8f78b/pyarrow_bigquery-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-26 22:07:44",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pyarrow-bigquery"
}
        
Elapsed time: 0.36053s