# pipelinewise-tap-postgres
[![PyPI version](https://badge.fury.io/py/pipelinewise-tap-postgres.svg)](https://badge.fury.io/py/pipelinewise-tap-postgres)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pipelinewise-tap-postgres.svg)](https://pypi.org/project/pipelinewise-tap-postgres/)
[![License: MIT](https://img.shields.io/badge/License-GPLv3-yellow.svg)](https://opensource.org/licenses/GPL-3.0)
[Singer](https://www.singer.io/) tap that extracts data from a [PostgreSQL](https://www.postgresql.com/) database and produces JSON-formatted data following the [Singer spec](https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md).
This is a [PipelineWise](https://transferwise.github.io/pipelinewise) compatible tap connector.
## How to use it
The recommended method of running this tap is to use it from [PipelineWise](https://transferwise.github.io/pipelinewise). When running it from PipelineWise you don't need to configure this tap with JSON files and most of things are automated. Please check the related documentation at [Tap Postgres](https://transferwise.github.io/pipelinewise/connectors/taps/postgres.html)
If you want to run this [Singer Tap](https://singer.io) independently please read further.
### Install and Run
First, make sure Python 3 is installed on your system or follow these
installation instructions for [Mac](http://docs.python-guide.org/en/latest/starting/install3/osx/) or
[Ubuntu](https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-ubuntu-16-04).
It's recommended to use a virtualenv:
```bash
python3 -m venv venv
pip install pipelinewise-tap-postgres
```
or
```bash
make venv
```
### Create a config.json
```
{
"host": "localhost",
"port": 5432,
"user": "postgres",
"password": "secret",
"dbname": "db"
}
```
These are the same basic configuration properties used by the PostgreSQL command-line client (`psql`).
Full list of options in `config.json`:
| Property | Type | Required? | Default | Description |
|----------------------------|---------|----------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host | String | Yes | - | PostgreSQL host |
| port | Integer | Yes | - | PostgreSQL port |
| user | String | Yes | - | PostgreSQL user |
| password | String | Yes | - | PostgreSQL password |
| dbname | String | Yes | - | PostgreSQL database name |
| filter_schemas | String | No | None | Comma separated schema names to scan only the required schemas to improve the performance of data extraction. |
| ssl | String | No | None | If set to `"true"` then use SSL via postgres sslmode `require` option. If the server does not accept SSL connections or the client certificate is not recognized the connection will fail. |
| logical_poll_total_seconds | Integer | No | 10800 | Stop running the tap when no data received from wal after certain number of seconds. |
| break_at_end_lsn | Boolean | No | true | Stop running the tap if the newly received lsn is after the max lsn that was detected when the tap started. |
| max_run_seconds | Integer | No | 43200 | Stop running the tap after certain number of seconds. |
| debug_lsn | String | No | None | If set to `"true"` then add `_sdc_lsn` property to the singer messages to debug postgres LSN position in the WAL stream. |
| tap_id | String | No | None | ID of the pipeline/tap |
| itersize | Integer | No | 20000 | Size of PG cursor iterator when doing INCREMENTAL or FULL_TABLE |
| default_replication_method | String | No | None | Default replication method to use when no one is provided in the catalog (Values: `LOG_BASED`, `INCREMENTAL` or `FULL_TABLE`) |
| use_secondary | Boolean | No | False | Use a database replica for `INCREMENTAL` and `FULL_TABLE` replication |
| secondary_host | String | No | - | PostgreSQL Replica host (required if `use_secondary` is `True`) |
| secondary_port | Integer | No | - | PostgreSQL Replica port (required if `use_secondary` is `True`) |
| limit | Integer | No | None | Adds a limit to INCREMENTAL queries to limit the number of records returns per run |
### Run the tap in Discovery Mode
```
tap-postgres --config config.json --discover # Should dump a Catalog to stdout
tap-postgres --config config.json --discover > catalog.json # Capture the Catalog
```
### Add Metadata to the Catalog
Each entry under the Catalog's "stream" key will need the following metadata:
```
{
"streams": [
{
"stream_name": "my_topic"
"metadata": [{
"breadcrumb": [],
"metadata": {
"selected": true,
"replication-method": "LOG_BASED",
}
}]
}
]
}
```
The replication method can be one of `FULL_TABLE`, `INCREMENTAL` or `LOG_BASED`.
**Note**: Log based replication requires a few adjustments in the source postgres database, please read further
for more information.
### Run the tap in Sync Mode
```
tap-postgres --config config.json --catalog catalog.json
```
The tap will write bookmarks to stdout which can be captured and passed as an optional `--state state.json` parameter
to the tap for the next sync.
### Log Based replication requirements
* PostgreSQL databases running **PostgreSQL versions 9.4.x or greater**. To avoid a critical PostgreSQL bug,
use at least one of the following minor versions:
- PostgreSQL 12.0
- PostgreSQL 11.2
- PostgreSQL 10.7
- PostgreSQL 9.6.12
- PostgreSQL 9.5.16
- PostgreSQL 9.4.21
* **A connection to the master instance**. Log-based replication will only work by connecting to the master instance.
* **wal2json plugin**: To use Log Based for your PostgreSQL integration, you must install the wal2json plugin version >= 2.3.
The wal2json plugin outputs JSON objects for logical decoding, which the tap then uses to perform Log-based Replication.
Steps for installing the plugin vary depending on your operating system. Instructions for each operating system type
are in the wal2json’s GitHub repository:
* [Unix-based operating systems](https://github.com/eulerto/wal2json#unix-based-operating-systems)
* [Windows](https://github.com/eulerto/wal2json#windows)
* **postgres config file**: Locate the database configuration file (usually `postgresql.conf`) and define
the parameters as follows:
```
wal_level=logical
max_replication_slots=5
max_wal_senders=5
```
Restart your PostgreSQL service to ensure the changes take effect.
**Note**: For `max_replication_slots` and `max_wal_senders`, we’re defaulting to a value of 5.
This should be sufficient unless you have a large number of read replicas connected to the master instance.
* **Existing replication slot**: Log based replication requires a dedicated logical replication slot.
In PostgreSQL, a logical replication slot represents a stream of database changes that can then be replayed to a
client in the order they were made on the original server. Each slot streams a sequence of changes from a single
database.
Login to the master instance as a superuser and using the `wal2json` plugin, create a logical replication slot:
```
SELECT *
FROM pg_create_logical_replication_slot('pipelinewise_<database_name>', 'wal2json');
```
**Note**: Replication slots are specific to a given database in a cluster. If you want to connect multiple
databases - whether in one integration or several - you’ll need to create a replication slot for each database.
### To run tests:
1. Install python test dependencies in a virtual env:
```
make venv
```
2. You need to have a postgres database to run the tests and export its credentials.
You can make use of the local docker-compose to spin up a test database by running `make start_db`
Test objects will be created in the `postgres` database.
3. To run the unit tests:
```
make unit_test
```
4. To run the integration tests:
```
make integration_test
```
### To run pylint:
Install python dependencies and run python linter
```
make venv
make pylint
```
Raw data
{
"_id": null,
"home_page": "https://github.com/OpenClassrooms/pipelinewise-tap-postgres",
"name": "oc-pipelinewise-tap-postgres",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Wise",
"author_email": "",
"download_url": "",
"platform": null,
"description": "# pipelinewise-tap-postgres\n\n[![PyPI version](https://badge.fury.io/py/pipelinewise-tap-postgres.svg)](https://badge.fury.io/py/pipelinewise-tap-postgres)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pipelinewise-tap-postgres.svg)](https://pypi.org/project/pipelinewise-tap-postgres/)\n[![License: MIT](https://img.shields.io/badge/License-GPLv3-yellow.svg)](https://opensource.org/licenses/GPL-3.0)\n\n[Singer](https://www.singer.io/) tap that extracts data from a [PostgreSQL](https://www.postgresql.com/) database and produces JSON-formatted data following the [Singer spec](https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md).\n\nThis is a [PipelineWise](https://transferwise.github.io/pipelinewise) compatible tap connector.\n\n## How to use it\n\nThe recommended method of running this tap is to use it from [PipelineWise](https://transferwise.github.io/pipelinewise). When running it from PipelineWise you don't need to configure this tap with JSON files and most of things are automated. Please check the related documentation at [Tap Postgres](https://transferwise.github.io/pipelinewise/connectors/taps/postgres.html)\n\nIf you want to run this [Singer Tap](https://singer.io) independently please read further.\n\n### Install and Run\n\nFirst, make sure Python 3 is installed on your system or follow these\ninstallation instructions for [Mac](http://docs.python-guide.org/en/latest/starting/install3/osx/) or\n[Ubuntu](https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-ubuntu-16-04).\n\n\nIt's recommended to use a virtualenv:\n\n```bash\n python3 -m venv venv\n pip install pipelinewise-tap-postgres\n```\n\nor\n\n```bash\n make venv\n```\n\n### Create a config.json\n\n```\n{\n \"host\": \"localhost\",\n \"port\": 5432,\n \"user\": \"postgres\",\n \"password\": \"secret\",\n \"dbname\": \"db\"\n}\n```\n\nThese are the same basic configuration properties used by the PostgreSQL command-line client (`psql`).\n\nFull list of options in `config.json`:\n\n| Property | Type | Required? | Default | Description |\n|----------------------------|---------|----------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| host | String | Yes | - | PostgreSQL host |\n| port | Integer | Yes | - | PostgreSQL port |\n| user | String | Yes | - | PostgreSQL user |\n| password | String | Yes | - | PostgreSQL password |\n| dbname | String | Yes | - | PostgreSQL database name |\n| filter_schemas | String | No | None | Comma separated schema names to scan only the required schemas to improve the performance of data extraction. |\n| ssl | String | No | None | If set to `\"true\"` then use SSL via postgres sslmode `require` option. If the server does not accept SSL connections or the client certificate is not recognized the connection will fail. |\n| logical_poll_total_seconds | Integer | No | 10800 | Stop running the tap when no data received from wal after certain number of seconds. |\n| break_at_end_lsn | Boolean | No | true | Stop running the tap if the newly received lsn is after the max lsn that was detected when the tap started. |\n| max_run_seconds | Integer | No | 43200 | Stop running the tap after certain number of seconds. |\n| debug_lsn | String | No | None | If set to `\"true\"` then add `_sdc_lsn` property to the singer messages to debug postgres LSN position in the WAL stream. |\n| tap_id | String | No | None | ID of the pipeline/tap |\n| itersize | Integer | No | 20000 | Size of PG cursor iterator when doing INCREMENTAL or FULL_TABLE |\n| default_replication_method | String | No | None | Default replication method to use when no one is provided in the catalog (Values: `LOG_BASED`, `INCREMENTAL` or `FULL_TABLE`) |\n| use_secondary | Boolean | No | False | Use a database replica for `INCREMENTAL` and `FULL_TABLE` replication |\n| secondary_host | String | No | - | PostgreSQL Replica host (required if `use_secondary` is `True`) |\n| secondary_port | Integer | No | - | PostgreSQL Replica port (required if `use_secondary` is `True`) |\n| limit | Integer | No | None | Adds a limit to INCREMENTAL queries to limit the number of records returns per run |\n\n\n### Run the tap in Discovery Mode\n\n```\ntap-postgres --config config.json --discover # Should dump a Catalog to stdout\ntap-postgres --config config.json --discover > catalog.json # Capture the Catalog\n```\n\n### Add Metadata to the Catalog\n\nEach entry under the Catalog's \"stream\" key will need the following metadata:\n\n```\n{\n \"streams\": [\n {\n \"stream_name\": \"my_topic\"\n \"metadata\": [{\n \"breadcrumb\": [],\n \"metadata\": {\n \"selected\": true,\n \"replication-method\": \"LOG_BASED\",\n }\n }]\n }\n ]\n}\n```\n\nThe replication method can be one of `FULL_TABLE`, `INCREMENTAL` or `LOG_BASED`.\n\n**Note**: Log based replication requires a few adjustments in the source postgres database, please read further\nfor more information.\n\n### Run the tap in Sync Mode\n\n```\ntap-postgres --config config.json --catalog catalog.json\n```\n\nThe tap will write bookmarks to stdout which can be captured and passed as an optional `--state state.json` parameter\nto the tap for the next sync.\n\n### Log Based replication requirements\n\n* PostgreSQL databases running **PostgreSQL versions 9.4.x or greater**. To avoid a critical PostgreSQL bug,\n use at least one of the following minor versions:\n - PostgreSQL 12.0\n - PostgreSQL 11.2\n - PostgreSQL 10.7\n - PostgreSQL 9.6.12\n - PostgreSQL 9.5.16\n - PostgreSQL 9.4.21\n\n* **A connection to the master instance**. Log-based replication will only work by connecting to the master instance.\n\n* **wal2json plugin**: To use Log Based for your PostgreSQL integration, you must install the wal2json plugin version >= 2.3.\n The wal2json plugin outputs JSON objects for logical decoding, which the tap then uses to perform Log-based Replication.\n Steps for installing the plugin vary depending on your operating system. Instructions for each operating system type\n are in the wal2json\u2019s GitHub repository:\n\n * [Unix-based operating systems](https://github.com/eulerto/wal2json#unix-based-operating-systems)\n * [Windows](https://github.com/eulerto/wal2json#windows)\n\n\n* **postgres config file**: Locate the database configuration file (usually `postgresql.conf`) and define\n the parameters as follows:\n\n ```\n wal_level=logical\n max_replication_slots=5\n max_wal_senders=5\n ```\n\n Restart your PostgreSQL service to ensure the changes take effect.\n\n **Note**: For `max_replication_slots` and `max_wal_senders`, we\u2019re defaulting to a value of 5.\n This should be sufficient unless you have a large number of read replicas connected to the master instance.\n\n\n* **Existing replication slot**: Log based replication requires a dedicated logical replication slot.\n In PostgreSQL, a logical replication slot represents a stream of database changes that can then be replayed to a\n client in the order they were made on the original server. Each slot streams a sequence of changes from a single\n database.\n\n Login to the master instance as a superuser and using the `wal2json` plugin, create a logical replication slot:\n ```\n SELECT *\n FROM pg_create_logical_replication_slot('pipelinewise_<database_name>', 'wal2json');\n ```\n\n **Note**: Replication slots are specific to a given database in a cluster. If you want to connect multiple\n databases - whether in one integration or several - you\u2019ll need to create a replication slot for each database.\n\n### To run tests:\n\n1. Install python test dependencies in a virtual env:\n```\n make venv\n```\n\n2. You need to have a postgres database to run the tests and export its credentials.\n\nYou can make use of the local docker-compose to spin up a test database by running `make start_db`\n\nTest objects will be created in the `postgres` database.\n\n3. To run the unit tests:\n```\n make unit_test\n```\n\n4. To run the integration tests:\n```\n make integration_test\n```\n\n### To run pylint:\n\nInstall python dependencies and run python linter\n```\n make venv\n make pylint\n```\n",
"bugtrack_url": null,
"license": "",
"summary": "Singer.io tap for extracting data from PostgresSQL - PipelineWise compatible",
"version": "2.1.1",
"project_urls": {
"Homepage": "https://github.com/OpenClassrooms/pipelinewise-tap-postgres"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4df77faeedc6582ac9288ac5853126bc2de2e957bf736bd85e7d5462d35beda4",
"md5": "1b9c6ff12c34f7d64242907c5ed4472e",
"sha256": "aba7a7f409244a92cb7f4c195d2ed0750f4d0ac7e4ed27a2a6a8f20f91d3f6eb"
},
"downloads": -1,
"filename": "oc_pipelinewise_tap_postgres-2.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1b9c6ff12c34f7d64242907c5ed4472e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 41915,
"upload_time": "2023-06-20T14:35:40",
"upload_time_iso_8601": "2023-06-20T14:35:40.140931Z",
"url": "https://files.pythonhosted.org/packages/4d/f7/7faeedc6582ac9288ac5853126bc2de2e957bf736bd85e7d5462d35beda4/oc_pipelinewise_tap_postgres-2.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-20 14:35:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "OpenClassrooms",
"github_project": "pipelinewise-tap-postgres",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "oc-pipelinewise-tap-postgres"
}