pipelinewise-tap-postgres


Namepipelinewise-tap-postgres JSON
Version 2.1.0 PyPI version JSON
download
home_pagehttps://github.com/transferwise/pipelinewise-tap-postgres
SummarySinger.io tap for extracting data from PostgresSQL - PipelineWise compatible
upload_time2023-05-17 10:31:36
maintainer
docs_urlNone
authorWise
requires_python>=3.7,<3.10
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pipelinewise-tap-postgres

[![PyPI version](https://badge.fury.io/py/pipelinewise-tap-postgres.svg)](https://badge.fury.io/py/pipelinewise-tap-postgres)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pipelinewise-tap-postgres.svg)](https://pypi.org/project/pipelinewise-tap-postgres/)
[![License: MIT](https://img.shields.io/badge/License-GPLv3-yellow.svg)](https://opensource.org/licenses/GPL-3.0)

[Singer](https://www.singer.io/) tap that extracts data from a [PostgreSQL](https://www.postgresql.com/) database and produces JSON-formatted data following the [Singer spec](https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md).

This is a [PipelineWise](https://transferwise.github.io/pipelinewise) compatible tap connector.

## How to use it

The recommended method of running this tap is to use it from [PipelineWise](https://transferwise.github.io/pipelinewise). When running it from PipelineWise you don't need to configure this tap with JSON files and most of things are automated. Please check the related documentation at [Tap Postgres](https://transferwise.github.io/pipelinewise/connectors/taps/postgres.html)

If you want to run this [Singer Tap](https://singer.io) independently please read further.

### Install and Run

First, make sure Python 3 is installed on your system or follow these
installation instructions for [Mac](http://docs.python-guide.org/en/latest/starting/install3/osx/) or
[Ubuntu](https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-ubuntu-16-04).


It's recommended to use a virtualenv:

```bash
  python3 -m venv venv
  pip install pipelinewise-tap-postgres
```

or

```bash
  make venv
```

### Create a config.json

```
{
  "host": "localhost",
  "port": 5432,
  "user": "postgres",
  "password": "secret",
  "dbname": "db"
}
```

These are the same basic configuration properties used by the PostgreSQL command-line client (`psql`).

Full list of options in `config.json`:

| Property                   | Type    | Required? | Default | Description                                                                                                                                                                                |
|----------------------------|---------|----------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host                       | String  | Yes      | -       | PostgreSQL host                                                                                                                                                                            |
| port                       | Integer | Yes      | -       | PostgreSQL port                                                                                                                                                                            |
| user                       | String  | Yes      | -       | PostgreSQL user                                                                                                                                                                            |
| password                   | String  | Yes      | -       | PostgreSQL password                                                                                                                                                                        |
| dbname                     | String  | Yes      | -       | PostgreSQL database name                                                                                                                                                                   |
| filter_schemas             | String  | No       | None    | Comma separated schema names to scan only the required schemas to improve the performance of data extraction.                                                                              |
| ssl                        | String  | No       | None    | If set to `"true"` then use SSL via postgres sslmode `require` option. If the server does not accept SSL connections or the client certificate is not recognized the connection will fail. |
| logical_poll_total_seconds | Integer | No       | 10800   | Stop running the tap when no data received from wal after certain number of seconds.                                                                                                       |
| break_at_end_lsn           | Boolean | No       | true    | Stop running the tap if the newly received lsn is after the max lsn that was detected when the tap started.                                                                                |
| max_run_seconds            | Integer | No       | 43200   | Stop running the tap after certain number of seconds.                                                                                                                                      |
| debug_lsn                  | String  | No       | None    | If set to `"true"` then add `_sdc_lsn` property to the singer messages to debug postgres LSN position in the WAL stream.                                                                   |
| tap_id                     | String  | No       | None    | ID of the pipeline/tap                                                                                                                                                                     |
| itersize                   | Integer | No       | 20000   | Size of PG cursor iterator when doing INCREMENTAL or FULL_TABLE                                                                                                                            |
| default_replication_method | String  | No       | None    | Default replication method to use when no one is provided in the catalog (Values: `LOG_BASED`, `INCREMENTAL` or `FULL_TABLE`)                                                              |
| use_secondary              | Boolean | No       | False   | Use a database replica for `INCREMENTAL` and `FULL_TABLE` replication                                                                                                                      |
| secondary_host             | String  | No       | -       | PostgreSQL Replica host (required if `use_secondary` is `True`)                                                                                                                            |
| secondary_port             | Integer | No       | -       | PostgreSQL Replica port (required if `use_secondary` is `True`)                                                                                                                            |
| limit                      | Integer | No       | None    | Adds a limit to INCREMENTAL queries to limit the number of records returns per run                                                                                                         |


### Run the tap in Discovery Mode

```
tap-postgres --config config.json --discover                # Should dump a Catalog to stdout
tap-postgres --config config.json --discover > catalog.json # Capture the Catalog
```

### Add Metadata to the Catalog

Each entry under the Catalog's "stream" key will need the following metadata:

```
{
  "streams": [
    {
      "stream_name": "my_topic"
      "metadata": [{
        "breadcrumb": [],
        "metadata": {
          "selected": true,
          "replication-method": "LOG_BASED",
        }
      }]
    }
  ]
}
```

The replication method can be one of `FULL_TABLE`, `INCREMENTAL` or `LOG_BASED`.

**Note**: Log based replication requires a few adjustments in the source postgres database, please read further
for more information.

### Run the tap in Sync Mode

```
tap-postgres --config config.json --catalog catalog.json
```

The tap will write bookmarks to stdout which can be captured and passed as an optional `--state state.json` parameter
to the tap for the next sync.

### Log Based replication requirements

* PostgreSQL databases running **PostgreSQL versions 9.4.x or greater**. To avoid a critical PostgreSQL bug,
  use at least one of the following minor versions:
   - PostgreSQL 12.0
   - PostgreSQL 11.2
   - PostgreSQL 10.7
   - PostgreSQL 9.6.12
   - PostgreSQL 9.5.16
   - PostgreSQL 9.4.21

* **A connection to the master instance**. Log-based replication will only work by connecting to the master instance.

* **wal2json plugin**: To use Log Based for your PostgreSQL integration, you must install the wal2json plugin version >= 2.3.
  The wal2json plugin outputs JSON objects for logical decoding, which the tap then uses to perform Log-based Replication.
  Steps for installing the plugin vary depending on your operating system. Instructions for each operating system type
  are in the wal2json’s GitHub repository:

  * [Unix-based operating systems](https://github.com/eulerto/wal2json#unix-based-operating-systems)
  * [Windows](https://github.com/eulerto/wal2json#windows)


* **postgres config file**: Locate the database configuration file (usually `postgresql.conf`) and define
  the parameters as follows:

    ```
    wal_level=logical
    max_replication_slots=5
    max_wal_senders=5
    ```

    Restart your PostgreSQL service to ensure the changes take effect.

    **Note**: For `max_replication_slots` and `max_wal_senders`, we’re defaulting to a value of 5.
    This should be sufficient unless you have a large number of read replicas connected to the master instance.


* **Existing replication slot**: Log based replication requires a dedicated logical replication slot.
  In PostgreSQL, a logical replication slot represents a stream of database changes that can then be replayed to a
  client in the order they were made on the original server. Each slot streams a sequence of changes from a single
  database.

  Login to the master instance as a superuser and using the `wal2json` plugin, create a logical replication slot:
  ```
    SELECT *
    FROM pg_create_logical_replication_slot('pipelinewise_<database_name>', 'wal2json');
  ```

  **Note**: Replication slots are specific to a given database in a cluster. If you want to connect multiple
  databases - whether in one integration or several - you’ll need to create a replication slot for each database.

### To run tests:

1. Install python test dependencies in a virtual env:
```
 make venv
```

2. You need to have a postgres database to run the tests and export its credentials.

You can make use of the local docker-compose to spin up a test database by running `make start_db`

Test objects will be created in the `postgres` database.

3. To run the unit tests:
```
  make unit_test
```

4. To run the integration tests:
```
  make integration_test
```

### To run pylint:

Install python dependencies and run python linter
```
  make venv
  make pylint
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/transferwise/pipelinewise-tap-postgres",
    "name": "pipelinewise-tap-postgres",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7,<3.10",
    "maintainer_email": "",
    "keywords": "",
    "author": "Wise",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/c2/bf/b8924b58bfefca08a5b1639aea1a79355dc8bb9b3e4d0707730edc143b3f/pipelinewise-tap-postgres-2.1.0.tar.gz",
    "platform": null,
    "description": "# pipelinewise-tap-postgres\n\n[![PyPI version](https://badge.fury.io/py/pipelinewise-tap-postgres.svg)](https://badge.fury.io/py/pipelinewise-tap-postgres)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pipelinewise-tap-postgres.svg)](https://pypi.org/project/pipelinewise-tap-postgres/)\n[![License: MIT](https://img.shields.io/badge/License-GPLv3-yellow.svg)](https://opensource.org/licenses/GPL-3.0)\n\n[Singer](https://www.singer.io/) tap that extracts data from a [PostgreSQL](https://www.postgresql.com/) database and produces JSON-formatted data following the [Singer spec](https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md).\n\nThis is a [PipelineWise](https://transferwise.github.io/pipelinewise) compatible tap connector.\n\n## How to use it\n\nThe recommended method of running this tap is to use it from [PipelineWise](https://transferwise.github.io/pipelinewise). When running it from PipelineWise you don't need to configure this tap with JSON files and most of things are automated. Please check the related documentation at [Tap Postgres](https://transferwise.github.io/pipelinewise/connectors/taps/postgres.html)\n\nIf you want to run this [Singer Tap](https://singer.io) independently please read further.\n\n### Install and Run\n\nFirst, make sure Python 3 is installed on your system or follow these\ninstallation instructions for [Mac](http://docs.python-guide.org/en/latest/starting/install3/osx/) or\n[Ubuntu](https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-ubuntu-16-04).\n\n\nIt's recommended to use a virtualenv:\n\n```bash\n  python3 -m venv venv\n  pip install pipelinewise-tap-postgres\n```\n\nor\n\n```bash\n  make venv\n```\n\n### Create a config.json\n\n```\n{\n  \"host\": \"localhost\",\n  \"port\": 5432,\n  \"user\": \"postgres\",\n  \"password\": \"secret\",\n  \"dbname\": \"db\"\n}\n```\n\nThese are the same basic configuration properties used by the PostgreSQL command-line client (`psql`).\n\nFull list of options in `config.json`:\n\n| Property                   | Type    | Required? | Default | Description                                                                                                                                                                                |\n|----------------------------|---------|----------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| host                       | String  | Yes      | -       | PostgreSQL host                                                                                                                                                                            |\n| port                       | Integer | Yes      | -       | PostgreSQL port                                                                                                                                                                            |\n| user                       | String  | Yes      | -       | PostgreSQL user                                                                                                                                                                            |\n| password                   | String  | Yes      | -       | PostgreSQL password                                                                                                                                                                        |\n| dbname                     | String  | Yes      | -       | PostgreSQL database name                                                                                                                                                                   |\n| filter_schemas             | String  | No       | None    | Comma separated schema names to scan only the required schemas to improve the performance of data extraction.                                                                              |\n| ssl                        | String  | No       | None    | If set to `\"true\"` then use SSL via postgres sslmode `require` option. If the server does not accept SSL connections or the client certificate is not recognized the connection will fail. |\n| logical_poll_total_seconds | Integer | No       | 10800   | Stop running the tap when no data received from wal after certain number of seconds.                                                                                                       |\n| break_at_end_lsn           | Boolean | No       | true    | Stop running the tap if the newly received lsn is after the max lsn that was detected when the tap started.                                                                                |\n| max_run_seconds            | Integer | No       | 43200   | Stop running the tap after certain number of seconds.                                                                                                                                      |\n| debug_lsn                  | String  | No       | None    | If set to `\"true\"` then add `_sdc_lsn` property to the singer messages to debug postgres LSN position in the WAL stream.                                                                   |\n| tap_id                     | String  | No       | None    | ID of the pipeline/tap                                                                                                                                                                     |\n| itersize                   | Integer | No       | 20000   | Size of PG cursor iterator when doing INCREMENTAL or FULL_TABLE                                                                                                                            |\n| default_replication_method | String  | No       | None    | Default replication method to use when no one is provided in the catalog (Values: `LOG_BASED`, `INCREMENTAL` or `FULL_TABLE`)                                                              |\n| use_secondary              | Boolean | No       | False   | Use a database replica for `INCREMENTAL` and `FULL_TABLE` replication                                                                                                                      |\n| secondary_host             | String  | No       | -       | PostgreSQL Replica host (required if `use_secondary` is `True`)                                                                                                                            |\n| secondary_port             | Integer | No       | -       | PostgreSQL Replica port (required if `use_secondary` is `True`)                                                                                                                            |\n| limit                      | Integer | No       | None    | Adds a limit to INCREMENTAL queries to limit the number of records returns per run                                                                                                         |\n\n\n### Run the tap in Discovery Mode\n\n```\ntap-postgres --config config.json --discover                # Should dump a Catalog to stdout\ntap-postgres --config config.json --discover > catalog.json # Capture the Catalog\n```\n\n### Add Metadata to the Catalog\n\nEach entry under the Catalog's \"stream\" key will need the following metadata:\n\n```\n{\n  \"streams\": [\n    {\n      \"stream_name\": \"my_topic\"\n      \"metadata\": [{\n        \"breadcrumb\": [],\n        \"metadata\": {\n          \"selected\": true,\n          \"replication-method\": \"LOG_BASED\",\n        }\n      }]\n    }\n  ]\n}\n```\n\nThe replication method can be one of `FULL_TABLE`, `INCREMENTAL` or `LOG_BASED`.\n\n**Note**: Log based replication requires a few adjustments in the source postgres database, please read further\nfor more information.\n\n### Run the tap in Sync Mode\n\n```\ntap-postgres --config config.json --catalog catalog.json\n```\n\nThe tap will write bookmarks to stdout which can be captured and passed as an optional `--state state.json` parameter\nto the tap for the next sync.\n\n### Log Based replication requirements\n\n* PostgreSQL databases running **PostgreSQL versions 9.4.x or greater**. To avoid a critical PostgreSQL bug,\n  use at least one of the following minor versions:\n   - PostgreSQL 12.0\n   - PostgreSQL 11.2\n   - PostgreSQL 10.7\n   - PostgreSQL 9.6.12\n   - PostgreSQL 9.5.16\n   - PostgreSQL 9.4.21\n\n* **A connection to the master instance**. Log-based replication will only work by connecting to the master instance.\n\n* **wal2json plugin**: To use Log Based for your PostgreSQL integration, you must install the wal2json plugin version >= 2.3.\n  The wal2json plugin outputs JSON objects for logical decoding, which the tap then uses to perform Log-based Replication.\n  Steps for installing the plugin vary depending on your operating system. Instructions for each operating system type\n  are in the wal2json\u2019s GitHub repository:\n\n  * [Unix-based operating systems](https://github.com/eulerto/wal2json#unix-based-operating-systems)\n  * [Windows](https://github.com/eulerto/wal2json#windows)\n\n\n* **postgres config file**: Locate the database configuration file (usually `postgresql.conf`) and define\n  the parameters as follows:\n\n    ```\n    wal_level=logical\n    max_replication_slots=5\n    max_wal_senders=5\n    ```\n\n    Restart your PostgreSQL service to ensure the changes take effect.\n\n    **Note**: For `max_replication_slots` and `max_wal_senders`, we\u2019re defaulting to a value of 5.\n    This should be sufficient unless you have a large number of read replicas connected to the master instance.\n\n\n* **Existing replication slot**: Log based replication requires a dedicated logical replication slot.\n  In PostgreSQL, a logical replication slot represents a stream of database changes that can then be replayed to a\n  client in the order they were made on the original server. Each slot streams a sequence of changes from a single\n  database.\n\n  Login to the master instance as a superuser and using the `wal2json` plugin, create a logical replication slot:\n  ```\n    SELECT *\n    FROM pg_create_logical_replication_slot('pipelinewise_<database_name>', 'wal2json');\n  ```\n\n  **Note**: Replication slots are specific to a given database in a cluster. If you want to connect multiple\n  databases - whether in one integration or several - you\u2019ll need to create a replication slot for each database.\n\n### To run tests:\n\n1. Install python test dependencies in a virtual env:\n```\n make venv\n```\n\n2. You need to have a postgres database to run the tests and export its credentials.\n\nYou can make use of the local docker-compose to spin up a test database by running `make start_db`\n\nTest objects will be created in the `postgres` database.\n\n3. To run the unit tests:\n```\n  make unit_test\n```\n\n4. To run the integration tests:\n```\n  make integration_test\n```\n\n### To run pylint:\n\nInstall python dependencies and run python linter\n```\n  make venv\n  make pylint\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Singer.io tap for extracting data from PostgresSQL - PipelineWise compatible",
    "version": "2.1.0",
    "project_urls": {
        "Homepage": "https://github.com/transferwise/pipelinewise-tap-postgres"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1998fa3bbfae78f95aacb9cc955c384113361534ef022b6d98b90876926f6592",
                "md5": "2aeb5fc3c01728b6bd5f4a58d51ae189",
                "sha256": "8e06345ad6056a640f79b6cf0663705e50d76a15f640142a6b20b24565d133e7"
            },
            "downloads": -1,
            "filename": "pipelinewise_tap_postgres-2.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2aeb5fc3c01728b6bd5f4a58d51ae189",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7,<3.10",
            "size": 41871,
            "upload_time": "2023-05-17T10:31:34",
            "upload_time_iso_8601": "2023-05-17T10:31:34.396890Z",
            "url": "https://files.pythonhosted.org/packages/19/98/fa3bbfae78f95aacb9cc955c384113361534ef022b6d98b90876926f6592/pipelinewise_tap_postgres-2.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c2bfb8924b58bfefca08a5b1639aea1a79355dc8bb9b3e4d0707730edc143b3f",
                "md5": "89866c6949674be0b1dbd388dc617b3e",
                "sha256": "aa0b2cd7bb7c897497a28251a39657f67452a2ea90bfa21aaf1a73f565679c50"
            },
            "downloads": -1,
            "filename": "pipelinewise-tap-postgres-2.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "89866c6949674be0b1dbd388dc617b3e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7,<3.10",
            "size": 36785,
            "upload_time": "2023-05-17T10:31:36",
            "upload_time_iso_8601": "2023-05-17T10:31:36.627777Z",
            "url": "https://files.pythonhosted.org/packages/c2/bf/b8924b58bfefca08a5b1639aea1a79355dc8bb9b3e4d0707730edc143b3f/pipelinewise-tap-postgres-2.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-17 10:31:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "transferwise",
    "github_project": "pipelinewise-tap-postgres",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pipelinewise-tap-postgres"
}
        
Elapsed time: 1.61918s