target-bigquery-partition

Name	target-bigquery-partition JSON
Version	0.2.3 JSON
	download
home_page	https://github.com/anelendata/target-bigquery
Summary	Google BigQuery target of singer.io framework.
upload_time	2024-10-17 07:36:07
maintainer	None
docs_url	None
author	Daigo Tanaka, Anelen Co., LLC
requires_python	None
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# target-bigquery

ANELEN's implementation of target-bigquery.

This is a [Singer](https://singer.io) target that loads JSON-formatted data
following the [Singer spec](https://github.com/singer-io/getting-started/blob/master/SPEC.md)
to Google BigQuery.

## Installation

### Step 0: Acknowledge LICENSE and TERMS

Please especially note that the author(s) of target-bigquery is not responsible
for the cost, including but not limited to BigQuery cost) incurred by running
this program.

### Step 1: Activate the Google BigQuery API

(originally found in the [Google API docs](https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html))

1. Use [this wizard](https://console.developers.google.com/start/api?id=bigquery-json.googleapis.com) to create or select a project in the Google Developers Console and activate the BigQuery API. Click Continue, then Go to credentials.
2. On the **Add credentials to your project** page, click the **Cancel** button.
3. At the top of the page, select the **OAuth consent screen** tab. Select an **Email address**, enter a **Product name** if not already set, and click the **Save** button.
4. Select the **Credentials** tab, click the **Create credentials** button and select **OAuth client ID**.
5. Select the application type **Other**, enter the name "Singer BigQuery Tap", and click the **Create** button.
6. Click **OK** to dismiss the resulting dialog.
7. Click the Download button to the right of the client ID.
8. Move this file to your working directory and rename it *client_secrets.json*.

Export the location of the secret file:

```
export GOOGLE_APPLICATION_CREDENTIALS="./client_secret.json"
```

For other authentication method, please see Authentication section.

### Step 2: Install

First, make sure Python 3 is installed on your system or follow these
installation instructions for Mac or Ubuntu.

```
pip install -U target-bigquery-partition
```

Or you can install the lastest development version from GitHub:

```
pip install --no-cache-dir https://github.com/anelendata/target-bigquery/archive/master.tar.gz#egg=target-bigquery
```

## Run

### Step 1: Configure

Create a file called target_config.json in your working directory, following
config.sample.json:

```
{
"project_id": "your-gcp-project-id",
"dataset_id": "your-bigquery-dataset",
"table_prefix": "optional_table_prefix",
"table_ext": "optional_table_ext",
"partition_by": "optional_column_name",
"partition_type": "day",
"partition_exp_ms": null,
"stream": false,
}
```
Notes:
- The table name is set as stream name from the tap. You can add prefix and ext to the name.
- Optionally, you can set partition_by to create a partitioned table. Many production quailty taps implements a ingestion timestamp and it is recommended to use the column here to partition the table. It will increase the query performance and lower the BigQuery costs. partition_type can be hour, day, month, or year and the default is day. partition_exp_ms sets the partition expiration in millisecond. Default is null (never expire).
- `stream`: Make this true to run the streaming updates to BigQuery. Note that performance of batch update is better when keeping this option `false`.

### Step 2: Run

target-bigquery can be run with any Singer Target. As example, let use
[tap-exchangeratesapi](https://github.com/singer-io/tap-exchangeratesapi).

```
pip install tap-exchangeratesapi
```

Run:

```
tap-exchangeratesapi | target-bigquery -c target_config.json
```

## Authentication

It is recommended to use `target-bigquery` with a service account.

- Download the client_secrets.json file for your service account, and place it
on the machine where `target-bigquery` will be executed.
- Set a `GOOGLE_APPLICATION_CREDENTIALS` environment variable on the machine,
where the value is the fully qualified path to client_secrets.json

In the testing environment, you can also manually authenticate before runnig
the tap. In this case you do not need `GOOGLE_APPLICATION_CREDENTIALS` defined:

```
gcloud auth application-default login
```

You may also have to set the project:

```
gcloud config set project <project-id>
```

Though not tested, it should also be possible to use the OAuth flow to
authenticate to GCP as well:
- `target-bigquery` will attempt to open a new window or tab in your default
browser. If this fails, copy the URL from the console and manually open it
in your browser.
- If you are not already logged into your Google account, you will be prompted
to log in.
- If you are logged into multiple Google accounts, you will be asked to select
one account to use for the authorization.
- Click the **Accept** button to allow `target-bigquery` to access your Google BigQuery
table.
- You can close the tab after the signup flow is complete.

## Schema change considerations

Typically, a change in schema is detected by target_bigquery's schema validation or
when BigQuery rejects the input. Here are some ideas and options to handle the
schema changes.

### Mapping column names

It is costly to modify the existing columns in data warehouse.
One way to handle source data type change is to create a new column.
(e.g. original column name: price, new: price_)
And let the downstream (e.g. dbt) reconcile the old and new column types.

In the config, you can use `column_map` to map the source field name to the target column name:

```
{
"project_id": "your-gcp-project-id",
"dataset_id": "your-bigquery-dataset",
...
"column_map": {
"<stream_name>": {
"<source_field_name>": "<target_col_name>",
...
},
...
}
```

Note: Schema is validated against pre-mapped names. Then the column names are swapped, if applicable, just before being written to BigQuery. So, the input stream (tap) don't have to modify schema message.

### Ignore unknown columns

BigQuery will reject the load if the data contains undefined column.
This is disruptive to the daily operations which aren't depending on the new columns.

To ignore the unknown columns, add this to config:

```
{
"project_id": "your-gcp-project-id",
"dataset_id": "your-bigquery-dataset",
...
"exclude_unknown_columns": true,
...
```
(Default is false)

Warning logs are written out when new columns are detected.

### Add new columns

Use `--schema` or `-s` followed by the updated catalog file to
automatically detect and add new column to BigQuery table:

```
target-bigquery -c files/target_config.json -s files/catalog.json --dryrun -t <table1,table2,...>
```

- Use `--dryrun` switch to dry run. No change to the BigQuery table. Preview the result in the log.
- Use `--table` followed by comma separated (no space) table names to execute only for the listed tables.

## Original repo
https://github.com/anelendata/target-bigquery

# About this project

This project is developed by
ANELEN and friends. Please check out the ANELEN's
[open innovation philosophy and other projects](https://anelen.co/open-source.html)

![ANELEN](https://avatars.githubusercontent.com/u/13533307?s=400&u=a0d24a7330d55ce6db695c5572faf8f490c63898&v=4)
---

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/anelendata/target-bigquery",
    "name": "target-bigquery-partition",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Daigo Tanaka, Anelen Co., LLC",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/c3/1c/68ebbde3f9a4a255869f7f277a31345137a1780c47d864ac52301572acc3/target-bigquery-partition-0.2.3.tar.gz",
    "platform": null,
    "description": "# target-bigquery\n\nANELEN's implementation of target-bigquery.\n\nThis is a [Singer](https://singer.io) target that loads JSON-formatted data\nfollowing the [Singer spec](https://github.com/singer-io/getting-started/blob/master/SPEC.md)\nto Google BigQuery.\n\n## Installation\n\n### Step 0: Acknowledge LICENSE and TERMS\n\nPlease especially note that the author(s) of target-bigquery is not responsible\nfor the cost, including but not limited to BigQuery cost) incurred by running\nthis program.\n\n### Step 1: Activate the Google BigQuery API\n\n(originally found in the [Google API docs](https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html))\n\n 1. Use [this wizard](https://console.developers.google.com/start/api?id=bigquery-json.googleapis.com) to create or select a project in the Google Developers Console and activate the BigQuery API. Click Continue, then Go to credentials.\n 2. On the **Add credentials to your project** page, click the **Cancel** button.\n 3. At the top of the page, select the **OAuth consent screen** tab. Select an **Email address**, enter a **Product name** if not already set, and click the **Save** button.\n 4. Select the **Credentials** tab, click the **Create credentials** button and select **OAuth client ID**.\n 5. Select the application type **Other**, enter the name \"Singer BigQuery Tap\", and click the **Create** button.\n 6. Click **OK** to dismiss the resulting dialog.\n 7. Click the Download button to the right of the client ID.\n 8. Move this file to your working directory and rename it *client_secrets.json*.\n\n\nExport the location of the secret file:\n\n```\nexport GOOGLE_APPLICATION_CREDENTIALS=\"./client_secret.json\"\n```\n\nFor other authentication method, please see Authentication section.\n\n### Step 2: Install\n\nFirst, make sure Python 3 is installed on your system or follow these \ninstallation instructions for Mac or Ubuntu.\n\n```\npip install -U target-bigquery-partition\n```\n\nOr you can install the lastest development version from GitHub:\n\n```\npip install --no-cache-dir https://github.com/anelendata/target-bigquery/archive/master.tar.gz#egg=target-bigquery\n```\n\n## Run\n\n### Step 1: Configure\n\nCreate a file called target_config.json in your working directory, following \nconfig.sample.json:\n\n```\n{\n    \"project_id\": \"your-gcp-project-id\",\n    \"dataset_id\": \"your-bigquery-dataset\",\n    \"table_prefix\": \"optional_table_prefix\",\n    \"table_ext\": \"optional_table_ext\",\n    \"partition_by\": \"optional_column_name\",\n    \"partition_type\": \"day\",\n    \"partition_exp_ms\": null,\n    \"stream\": false,\n}\n```\nNotes:\n- The table name is set as stream name from the tap. You can add prefix and ext to the name.\n- Optionally, you can set partition_by to create a partitioned table. Many production quailty taps implements a ingestion timestamp and it is recommended to use the column here to partition the table. It will increase the query performance and lower the BigQuery costs. partition_type can be hour, day, month, or year and the default is day. partition_exp_ms sets the partition expiration in millisecond. Default is null (never expire).\n- `stream`: Make this true to run the streaming updates to BigQuery. Note that performance of batch update is better when keeping this option `false`.\n\n### Step 2: Run\n\ntarget-bigquery can be run with any Singer Target. As example, let use\n[tap-exchangeratesapi](https://github.com/singer-io/tap-exchangeratesapi).\n\n```\npip install tap-exchangeratesapi\n```\n\nRun:\n\n```\ntap-exchangeratesapi | target-bigquery -c target_config.json\n```\n\n## Authentication\n\nIt is recommended to use `target-bigquery` with a service account.\n\n- Download the client_secrets.json file for your service account, and place it\n  on the machine where `target-bigquery` will be executed.\n- Set a `GOOGLE_APPLICATION_CREDENTIALS` environment variable on the machine,\n  where the value is the fully qualified path to client_secrets.json\n\nIn the testing environment, you can also manually authenticate before runnig\nthe tap. In this case you do not need `GOOGLE_APPLICATION_CREDENTIALS` defined:\n\n```\ngcloud auth application-default login\n```\n\nYou may also have to set the project:\n\n```\ngcloud config set project <project-id>\n```\n\nThough not tested, it should also be possible to use the OAuth flow to\nauthenticate to GCP as well:\n- `target-bigquery` will attempt to open a new window or tab in your default\n  browser. If this fails, copy the URL from the console and manually open it\n  in your browser.\n- If you are not already logged into your Google account, you will be prompted\n  to log in.\n- If you are logged into multiple Google accounts, you will be asked to select\n  one account to use for the authorization.\n- Click the **Accept** button to allow `target-bigquery` to access your Google BigQuery\n  table.\n- You can close the tab after the signup flow is complete.\n\n## Schema change considerations\n\nTypically, a change in schema is detected by target_bigquery's schema validation or\nwhen BigQuery rejects the input. Here are some ideas and options to handle the\nschema changes.\n\n### Mapping column names\n\nIt is costly to modify the existing columns in data warehouse.\nOne way to handle source data type change is to create a new column.\n(e.g. original column name: price, new: price_)\nAnd let the downstream (e.g. dbt) reconcile the old and new column types.\n\nIn the config, you can use `column_map` to map the source field name to the target column name:\n\n```\n{\n    \"project_id\": \"your-gcp-project-id\",\n    \"dataset_id\": \"your-bigquery-dataset\",\n    ...\n    \"column_map\": {\n      \"<stream_name>\": {\n        \"<source_field_name>\": \"<target_col_name>\",\n        ...\n      },\n      ...\n    }\n``` \n\nNote: Schema is validated against pre-mapped names. Then the column names are swapped, if applicable, just before being written to BigQuery. So, the input stream (tap) don't have to modify schema message.\n\n### Ignore unknown columns\n\nBigQuery will reject the load if the data contains undefined column.\nThis is disruptive to the daily operations which aren't depending on the new columns.\n\nTo ignore the unknown columns, add this to config:\n\n```\n{\n    \"project_id\": \"your-gcp-project-id\",\n    \"dataset_id\": \"your-bigquery-dataset\",\n    ...\n    \"exclude_unknown_columns\": true,\n    ...\n```\n(Default is false)\n\nWarning logs are written out when new columns are detected.\n\n### Add new columns\n\nUse `--schema` or `-s` followed by the updated catalog file to\nautomatically detect and add new column to BigQuery table:\n\n```\ntarget-bigquery -c files/target_config.json -s files/catalog.json --dryrun -t <table1,table2,...>\n```\n\n- Use `--dryrun` switch to dry run. No change to the BigQuery table. Preview the result in the log.\n- Use `--table` followed by comma separated (no space) table names to execute only for the listed tables.\n\n## Original repo\nhttps://github.com/anelendata/target-bigquery\n\n# About this project\n\nThis project is developed by\nANELEN and friends. Please check out the ANELEN's\n[open innovation philosophy and other projects](https://anelen.co/open-source.html)\n\n![ANELEN](https://avatars.githubusercontent.com/u/13533307?s=400&u=a0d24a7330d55ce6db695c5572faf8f490c63898&v=4)\n---\n\nCopyright &copy; 2020~ Anelen Co., LLC\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Google BigQuery target of singer.io framework.",
    "version": "0.2.3",
    "project_urls": {
        "Homepage": "https://github.com/anelendata/target-bigquery"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f2b7d12e1500922d4ceee169c30f3dae62db0f15cdcee5366c273ff35a21da96",
                "md5": "9ad39d7607447eeeaa8b72169cd328b8",
                "sha256": "b9000f39f6c362be8a8ed0884663171bb8e26f8710632595392f338c75e42ff0"
            },
            "downloads": -1,
            "filename": "target_bigquery_partition-0.2.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9ad39d7607447eeeaa8b72169cd328b8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 17313,
            "upload_time": "2024-10-17T07:36:06",
            "upload_time_iso_8601": "2024-10-17T07:36:06.112936Z",
            "url": "https://files.pythonhosted.org/packages/f2/b7/d12e1500922d4ceee169c30f3dae62db0f15cdcee5366c273ff35a21da96/target_bigquery_partition-0.2.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c31c68ebbde3f9a4a255869f7f277a31345137a1780c47d864ac52301572acc3",
                "md5": "c51eb2fddd6479505c5eec995d1ad1f1",
                "sha256": "7f8a09a04b5441056450209dcd09a9ee17dc344e88a2668d6801d0d0d0096a70"
            },
            "downloads": -1,
            "filename": "target-bigquery-partition-0.2.3.tar.gz",
            "has_sig": false,
            "md5_digest": "c51eb2fddd6479505c5eec995d1ad1f1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 18707,
            "upload_time": "2024-10-17T07:36:07",
            "upload_time_iso_8601": "2024-10-17T07:36:07.378013Z",
            "url": "https://files.pythonhosted.org/packages/c3/1c/68ebbde3f9a4a255869f7f277a31345137a1780c47d864ac52301572acc3/target-bigquery-partition-0.2.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-17 07:36:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "anelendata",
    "github_project": "target-bigquery",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "target-bigquery-partition"
}

Daigo Tanaka, Anelen Co., LLC