scrapy-bigquery


Namescrapy-bigquery JSON
Version 1.0.15 PyPI version JSON
download
home_pagehttps://github.com/8W9aG/scrapy-bigquery
SummaryScrapy pipeline to store items into BigQuery
upload_time2023-05-04 00:48:11
maintainer
docs_urlNone
authorWill Sackfield
requires_python
licenseMIT
keywords scrapy pipeline bigquery
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # scrapy-bigquery

<a href="https://pypi.org/project/scrapy-bigquery/">
    <img alt="PyPi" src="https://img.shields.io/pypi/v/scrapy-bigquery">
</a>

A Big Query pipeline to store items into [Google BigQuery](https://cloud.google.com/bigquery/).

## Dependencies :globe_with_meridians:

- [Python 3.7](https://www.python.org/downloads/release/python-370/)
- [Scrapy 2.4.0](https://scrapy.org/)
- [Google Cloud Bigquery 2.23.2](https://pypi.org/project/google-cloud-bigquery/)
- [Bigquery Schema Generator 1.4](https://github.com/bxparks/bigquery-schema-generator)

## Installation :inbox_tray:

This is a python package hosted on pypi, so to install simply run the following command:

`pip install scrapy-bigquery`

## Settings

### BIGQUERY_DATASET (Required)

The name of the bigquery dataset to post to.

### BIGQUERY_TABLE (Required)

The name of the bigquery table in the dataset to post to.

### BIGQUERY_SERVICE_ACCOUNT (Required)

The base64'd JSON of the [Google Service Account](https://cloud.google.com/iam/docs/service-accounts) used to authenticate with Google BigQuery. You can generate it from a service account like so:

`cat service-account.json | jq . -c | base64`

### BIGQUERY_ADD_SCRAPED_TIME (Optional)

Whether to add the time the item was scraped to the item when posting it to BigQuery. This will add current datetime to the column `scraped_time` in the BigQuery table.

### BIGQUERY_ADD_SCRAPER_NAME (Optional)

Whether to add the name of the scraper to the item when posting it to BigQuery. This will add the scrapers name to the column `scraper` in the BigQuery table.

### BIGQUERY_ADD_SCRAPER_SESSION (Optional)

Whether to add the session ID of the scraper to the item when posting it to BigQuery. This will add the scrapers session ID to the column `scraper_session_id` in the BigQuery table.

### BIGQUERY_ITEM_BATCH (Optional)

The number of items to batch process when inserting into BigQuery. The higher this number the faster the pipeline will process items.

### BIGQUERY_FIELDS_TO_SAVE (Optional)

A list of item fields to save to BigQuery. If this is not set, all fields of an item will be saved.

## Usage example :eyes:

In order to use this plugin simply add the following settings and substitute your variables:

```
BIGQUERY_DATASET = "my-dataset"
BIGQUERY_TABLE = "my-table"
BIGQUERY_SERVICE_ACCOUNT = "eyJ0eX=="
ITEM_PIPELINES = {
    "bigquerypipeline.pipelines.BigQueryPipeline": 301
}
BIGQUERY_FIELDS_TO_SAVE = ["name", "age"] # Optional. This will only save the name and age fields of an item to BigQuery.
```

The pipeline will attempt to create a dataset/table if none exist by inferring the type from the dictionaries it processes, however be aware that this can be flaky (especially if you have nulls in the dictionary), so it is recommended you create the table prior to running.

If you want to specify a table for a specific item, you can add the keys "BIGQUERY_DATASET" and "BIGQUERY_TABLE" to the item you pass back to the pipeline. This will override where the item is posted, allowing you to handle more than one item type in a scraper. The keys/values here will not be part of the final item in the table.

## License :memo:

The project is available under the [MIT License](LICENSE).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/8W9aG/scrapy-bigquery",
    "name": "scrapy-bigquery",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "scrapy pipeline bigquery",
    "author": "Will Sackfield",
    "author_email": "will.sackfield@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/5a/6c/e6915459b617a877819bbca37db671263a1ff546b4eb7030500ef878ff4e/scrapy-bigquery-1.0.15.tar.gz",
    "platform": null,
    "description": "# scrapy-bigquery\n\n<a href=\"https://pypi.org/project/scrapy-bigquery/\">\n    <img alt=\"PyPi\" src=\"https://img.shields.io/pypi/v/scrapy-bigquery\">\n</a>\n\nA Big Query pipeline to store items into [Google BigQuery](https://cloud.google.com/bigquery/).\n\n## Dependencies :globe_with_meridians:\n\n- [Python 3.7](https://www.python.org/downloads/release/python-370/)\n- [Scrapy 2.4.0](https://scrapy.org/)\n- [Google Cloud Bigquery 2.23.2](https://pypi.org/project/google-cloud-bigquery/)\n- [Bigquery Schema Generator 1.4](https://github.com/bxparks/bigquery-schema-generator)\n\n## Installation :inbox_tray:\n\nThis is a python package hosted on pypi, so to install simply run the following command:\n\n`pip install scrapy-bigquery`\n\n## Settings\n\n### BIGQUERY_DATASET (Required)\n\nThe name of the bigquery dataset to post to.\n\n### BIGQUERY_TABLE (Required)\n\nThe name of the bigquery table in the dataset to post to.\n\n### BIGQUERY_SERVICE_ACCOUNT (Required)\n\nThe base64'd JSON of the [Google Service Account](https://cloud.google.com/iam/docs/service-accounts) used to authenticate with Google BigQuery. You can generate it from a service account like so:\n\n`cat service-account.json | jq . -c | base64`\n\n### BIGQUERY_ADD_SCRAPED_TIME (Optional)\n\nWhether to add the time the item was scraped to the item when posting it to BigQuery. This will add current datetime to the column `scraped_time` in the BigQuery table.\n\n### BIGQUERY_ADD_SCRAPER_NAME (Optional)\n\nWhether to add the name of the scraper to the item when posting it to BigQuery. This will add the scrapers name to the column `scraper` in the BigQuery table.\n\n### BIGQUERY_ADD_SCRAPER_SESSION (Optional)\n\nWhether to add the session ID of the scraper to the item when posting it to BigQuery. This will add the scrapers session ID to the column `scraper_session_id` in the BigQuery table.\n\n### BIGQUERY_ITEM_BATCH (Optional)\n\nThe number of items to batch process when inserting into BigQuery. The higher this number the faster the pipeline will process items.\n\n### BIGQUERY_FIELDS_TO_SAVE (Optional)\n\nA list of item fields to save to BigQuery. If this is not set, all fields of an item will be saved.\n\n## Usage example :eyes:\n\nIn order to use this plugin simply add the following settings and substitute your variables:\n\n```\nBIGQUERY_DATASET = \"my-dataset\"\nBIGQUERY_TABLE = \"my-table\"\nBIGQUERY_SERVICE_ACCOUNT = \"eyJ0eX==\"\nITEM_PIPELINES = {\n    \"bigquerypipeline.pipelines.BigQueryPipeline\": 301\n}\nBIGQUERY_FIELDS_TO_SAVE = [\"name\", \"age\"] # Optional. This will only save the name and age fields of an item to BigQuery.\n```\n\nThe pipeline will attempt to create a dataset/table if none exist by inferring the type from the dictionaries it processes, however be aware that this can be flaky (especially if you have nulls in the dictionary), so it is recommended you create the table prior to running.\n\nIf you want to specify a table for a specific item, you can add the keys \"BIGQUERY_DATASET\" and \"BIGQUERY_TABLE\" to the item you pass back to the pipeline. This will override where the item is posted, allowing you to handle more than one item type in a scraper. The keys/values here will not be part of the final item in the table.\n\n## License :memo:\n\nThe project is available under the [MIT License](LICENSE).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Scrapy pipeline to store items into BigQuery",
    "version": "1.0.15",
    "project_urls": {
        "Homepage": "https://github.com/8W9aG/scrapy-bigquery"
    },
    "split_keywords": [
        "scrapy",
        "pipeline",
        "bigquery"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5a6ce6915459b617a877819bbca37db671263a1ff546b4eb7030500ef878ff4e",
                "md5": "a835413f491d0b1d42c6867618f4a03c",
                "sha256": "c4558753d35ac135e80ec27209e24e41c47277f683223b2f29b35f310a19e0bc"
            },
            "downloads": -1,
            "filename": "scrapy-bigquery-1.0.15.tar.gz",
            "has_sig": false,
            "md5_digest": "a835413f491d0b1d42c6867618f4a03c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5371,
            "upload_time": "2023-05-04T00:48:11",
            "upload_time_iso_8601": "2023-05-04T00:48:11.948690Z",
            "url": "https://files.pythonhosted.org/packages/5a/6c/e6915459b617a877819bbca37db671263a1ff546b4eb7030500ef878ff4e/scrapy-bigquery-1.0.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-04 00:48:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "8W9aG",
    "github_project": "scrapy-bigquery",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "scrapy-bigquery"
}
        
Elapsed time: 0.20972s