# GlassFlow ETL Python SDK
<p align="left">
<a target="_blank" href="https://pypi.python.org/pypi/glassflow">
<img src="https://img.shields.io/pypi/v/glassflow.svg?labelColor=&color=e69e3a">
</a>
<a target="_blank" href="https://github.com/glassflow/glassflow-python-sdk/blob/main/LICENSE">
<img src="https://img.shields.io/pypi/l/glassflow.svg?labelColor=&color=e69e3a">
</a>
<a target="_blank" href="https://pypi.python.org/pypi/glassflow">
<img src="https://img.shields.io/pypi/pyversions/glassflow.svg?labelColor=&color=e69e3a">
</a>
<br />
<a target="_blank" href="(https://github.com/glassflow/glassflow-python-sdk/actions">
<img src="https://github.com/glassflow/glassflow-python-sdk/workflows/Test/badge.svg?labelColor=&color=e69e3a">
</a>
<!-- Pytest Coverage Comment:Begin -->
<img src=https://img.shields.io/badge/coverage-94%25-brightgreen>
<!-- Pytest Coverage Comment:End -->
</p>
A Python SDK for creating and managing data pipelines between Kafka and ClickHouse.
## Features
- Create and manage data pipelines between Kafka and ClickHouse
- Deduplication of events during a time window based on a key
- Temporal joins between topics based on a common key with a given time window
- Schema validation and configuration management
## Installation
```bash
pip install glassflow
```
## Quick Start
```python
from glassflow.etl import Pipeline
pipeline_config = {
"pipeline_id": "test-pipeline",
"source": {
"type": "kafka",
"provider": "aiven",
"connection_params": {
"brokers": ["localhoust:9092"],
"protocol": "SASL_SSL",
"mechanism": "SCRAM-SHA-256",
"username": "user",
"password": "pass"
}
"topics": [
{
"consumer_group_initial_offset": "earliest",
"id": "test-topic",
"name": "test-topic",
"schema": {
"type": "json",
"fields": [
{"name": "id", "type": "string" },
{"name": "email", "type": "string"}
]
},
"deduplication": {
"id_field": "id",
"id_field_type": "string",
"time_window": "1h",
"enabled": True
}
}
],
},
"sink": {
"type": "clickhouse",
"host": "localhost:8443",
"port": 8443,
"database": "test",
"username": "default",
"password": "pass",
"table_mapping": [
{
"source_id": "test_table",
"field_name": "id",
"column_name": "user_id",
"column_type": "UUID"
},
{
"source_id": "test_table",
"field_name": "email",
"column_name": "email",
"column_type": "String"
}
]
}
}
# Create a pipeline from a JSON configuration
pipeline = Pipeline(pipeline_config)
# Create the pipeline
pipeline.create()
```
## Pipeline Configuration
For detailed information about the pipeline configuration, see [GlassFlow docs](https://docs.glassflow.dev/pipeline/pipeline-configuration).
## Tracking
The SDK includes anonymous usage tracking to help improve the product. Tracking is enabled by default but can be disabled in two ways:
1. Using an environment variable:
```bash
export GF_TRACKING_ENABLED=false
```
2. Programmatically using the `disable_tracking` method:
```python
pipeline = Pipeline(pipeline_config)
pipeline.disable_tracking()
```
The tracking collects anonymous information about:
- SDK version
- Platform (operating system)
- Python version
- Pipeline ID
- Whether joins or deduplication are enabled
- Kafka security protocol, auth mechanism used and whether authentication is disabled
- Errors during pipeline creation and deletion
## Development
### Setup
1. Clone the repository
2. Create a virtual environment
3. Install dependencies:
```bash
uv venv
source .venv/bin/activate
uv pip install -e .[dev]
```
### Testing
```bash
pytest
```
Raw data
{
"_id": null,
"home_page": null,
"name": "glassflow",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "clickhouse, data-engineering, data-pipeline, etl, glassflow, kafka, streaming",
"author": null,
"author_email": "GlassFlow <hello@glassflow.dev>",
"download_url": "https://files.pythonhosted.org/packages/d0/40/215b76f9c340845aea5edf7dddd41d48833cdeb39ddb4a4a8b7d365867be/glassflow-3.0.0.tar.gz",
"platform": null,
"description": "# GlassFlow ETL Python SDK\n\n<p align=\"left\">\n <a target=\"_blank\" href=\"https://pypi.python.org/pypi/glassflow\">\n <img src=\"https://img.shields.io/pypi/v/glassflow.svg?labelColor=&color=e69e3a\">\n </a>\n <a target=\"_blank\" href=\"https://github.com/glassflow/glassflow-python-sdk/blob/main/LICENSE\">\n <img src=\"https://img.shields.io/pypi/l/glassflow.svg?labelColor=&color=e69e3a\">\n </a>\n <a target=\"_blank\" href=\"https://pypi.python.org/pypi/glassflow\">\n <img src=\"https://img.shields.io/pypi/pyversions/glassflow.svg?labelColor=&color=e69e3a\">\n </a>\n <br />\n <a target=\"_blank\" href=\"(https://github.com/glassflow/glassflow-python-sdk/actions\">\n <img src=\"https://github.com/glassflow/glassflow-python-sdk/workflows/Test/badge.svg?labelColor=&color=e69e3a\">\n </a>\n<!-- Pytest Coverage Comment:Begin -->\n <img src=https://img.shields.io/badge/coverage-94%25-brightgreen>\n<!-- Pytest Coverage Comment:End -->\n</p>\n\nA Python SDK for creating and managing data pipelines between Kafka and ClickHouse.\n\n## Features\n\n- Create and manage data pipelines between Kafka and ClickHouse\n- Deduplication of events during a time window based on a key\n- Temporal joins between topics based on a common key with a given time window\n- Schema validation and configuration management\n\n## Installation\n\n```bash\npip install glassflow\n```\n\n## Quick Start\n\n```python\nfrom glassflow.etl import Pipeline\n\n\npipeline_config = {\n \"pipeline_id\": \"test-pipeline\",\n \"source\": {\n \"type\": \"kafka\",\n \"provider\": \"aiven\",\n \"connection_params\": {\n \"brokers\": [\"localhoust:9092\"],\n \"protocol\": \"SASL_SSL\",\n \"mechanism\": \"SCRAM-SHA-256\",\n \"username\": \"user\",\n \"password\": \"pass\"\n }\n \"topics\": [\n {\n \"consumer_group_initial_offset\": \"earliest\",\n \"id\": \"test-topic\",\n \"name\": \"test-topic\",\n \"schema\": {\n \"type\": \"json\",\n \"fields\": [\n {\"name\": \"id\", \"type\": \"string\" },\n {\"name\": \"email\", \"type\": \"string\"}\n ]\n },\n \"deduplication\": {\n \"id_field\": \"id\",\n \"id_field_type\": \"string\",\n \"time_window\": \"1h\",\n \"enabled\": True\n }\n }\n ],\n },\n \"sink\": {\n \"type\": \"clickhouse\",\n \"host\": \"localhost:8443\",\n \"port\": 8443,\n \"database\": \"test\",\n \"username\": \"default\",\n \"password\": \"pass\",\n \"table_mapping\": [\n {\n \"source_id\": \"test_table\",\n \"field_name\": \"id\",\n \"column_name\": \"user_id\",\n \"column_type\": \"UUID\"\n },\n {\n \"source_id\": \"test_table\",\n \"field_name\": \"email\",\n \"column_name\": \"email\",\n \"column_type\": \"String\"\n }\n ]\n }\n}\n\n# Create a pipeline from a JSON configuration\npipeline = Pipeline(pipeline_config)\n\n# Create the pipeline\npipeline.create()\n```\n\n## Pipeline Configuration\n\nFor detailed information about the pipeline configuration, see [GlassFlow docs](https://docs.glassflow.dev/pipeline/pipeline-configuration).\n\n## Tracking\n\nThe SDK includes anonymous usage tracking to help improve the product. Tracking is enabled by default but can be disabled in two ways:\n\n1. Using an environment variable:\n```bash\nexport GF_TRACKING_ENABLED=false\n```\n\n2. Programmatically using the `disable_tracking` method:\n```python\npipeline = Pipeline(pipeline_config)\npipeline.disable_tracking()\n```\n\nThe tracking collects anonymous information about:\n- SDK version\n- Platform (operating system)\n- Python version\n- Pipeline ID\n- Whether joins or deduplication are enabled\n- Kafka security protocol, auth mechanism used and whether authentication is disabled\n- Errors during pipeline creation and deletion\n\n## Development\n\n### Setup\n\n1. Clone the repository\n2. Create a virtual environment\n3. Install dependencies:\n\n```bash\nuv venv\nsource .venv/bin/activate\nuv pip install -e .[dev]\n```\n\n### Testing\n\n```bash\npytest\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "GlassFlow Clickhouse ETL Python SDK: Create GlassFlow pipelines between Kafka and ClickHouse",
"version": "3.0.0",
"project_urls": {
"Documentation": "https://glassflow.github.io/glassflow-python-sdk",
"Homepage": "https://github.com/glassflow/glassflow-python-sdk",
"Issues": "https://github.com/glassflow/glassflow-python-sdk/issues",
"Repository": "https://github.com/glassflow/glassflow-python-sdk.git"
},
"split_keywords": [
"clickhouse",
" data-engineering",
" data-pipeline",
" etl",
" glassflow",
" kafka",
" streaming"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b1c41d628e6f9671517fba37890063d857d4f906e9cd8c258b2c4095ca6f484b",
"md5": "0a740ec4300b0a312c79bf3c04ecc1f6",
"sha256": "f77b0af0762fc56ba7e0b6180a8fec33874d41a3b5a83adcf45d8d80a96c46e8"
},
"downloads": -1,
"filename": "glassflow-3.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0a740ec4300b0a312c79bf3c04ecc1f6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 18986,
"upload_time": "2025-09-05T15:53:20",
"upload_time_iso_8601": "2025-09-05T15:53:20.680476Z",
"url": "https://files.pythonhosted.org/packages/b1/c4/1d628e6f9671517fba37890063d857d4f906e9cd8c258b2c4095ca6f484b/glassflow-3.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d040215b76f9c340845aea5edf7dddd41d48833cdeb39ddb4a4a8b7d365867be",
"md5": "59a6a8b5f5d4317e6b95f3e55bf3c40c",
"sha256": "449c1c73f1317743370952adc63a609cb14f93e8d2e5112ae2d7c70d249e4916"
},
"downloads": -1,
"filename": "glassflow-3.0.0.tar.gz",
"has_sig": false,
"md5_digest": "59a6a8b5f5d4317e6b95f3e55bf3c40c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 77755,
"upload_time": "2025-09-05T15:53:22",
"upload_time_iso_8601": "2025-09-05T15:53:22.012334Z",
"url": "https://files.pythonhosted.org/packages/d0/40/215b76f9c340845aea5edf7dddd41d48833cdeb39ddb4a4a8b7d365867be/glassflow-3.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-05 15:53:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "glassflow",
"github_project": "glassflow-python-sdk",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "glassflow"
}