cetl


Namecetl JSON
Version 0.2.9 PyPI version JSON
download
home_page
SummaryA basic data pipeline tools for data engineer to handle the CRM or loyalty data
upload_time2023-06-20 03:42:10
maintainer
docs_urlNone
authorClement
requires_python
license
keywords python data pipeline pipeline
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
### About CETL

CETL is a Python library that provides a comprehensive set of tools for building and managing data pipelines. It is designed to assist data engineers in handling Extract, Transform, and Load (ETL) tasks more effectively by simplifying the process and reducing the amount of manual labor involved.<br>

CETL is particularly useful for Python developers who work with data on a regular basis. It uses popular data containers such as pandas dataframes, JSON objects, and PySpark dataframes to provide a familiar interface for developers. This allows users to easily integrate CETL into their existing data pipelines and workflows.<br>

The library is intended to make the ETL process more straightforward by automating many of the technical details involved in data processing and movement. CETL includes a wide range of functions and tools for handling complex data formats, such as CSV, Excel, and JSON files, as well as for working with a variety of data sources, including databases, APIs, and cloud storage services.<br>

One of the key benefits of CETL is its ability to handle large datasets, making it suitable for use in high-performance data processing environments. CETL also includes features for data profiling, data validation, data transformation, and data mapping, allowing users to build sophisticated data pipelines that can handle a wide range of data processing tasks.<br>

Overall, CETL is a powerful data pipeline tool that can help data engineers to improve their productivity and streamline the ETL process. By providing a comprehensive set of functions and tools for working with data, CETL makes it easier to develop and maintain complex ETL pipelines, reducing the amount of time and effort required to manage data processing tasks.<br><br>

<br>

### User Guide

#### Example 1
GenerateDataFrame is a Python class object designed to represent a transformation step in a data pipeline. This object can be used to generate a dummy dataframe without reading actual data from a file. The main purpose of this object is to assist developers in testing their data processing pipelines.<br>

With GenerateDataFrame, developers can quickly and easily create test data that mimics the structure of their actual data. This can be particularly useful when working with large datasets or when data is not readily available. By generating dummy data, developers can test their pipeline's functionality without having to rely on real data sources.<br>

GenerateDataFrame is particularly useful in situations where developers need to test their pipeline's ability to handle different types of data and perform various data transformations. This can include testing the pipeline's ability to handle missing data, data outliers, and data formatting issues.<br>

Overall, GenerateDataFrame is a powerful tool that can help developers to streamline the testing process and ensure the accuracy and efficiency of their data processing pipelines. By allowing developers to generate dummy data, it provides a quick and easy way to test their pipeline's functionality and identify any potential issues before deploying to production.<br>
```python
from cetl import make_pipeline
from cetl.pandas_modules import generateDataFrame
pipe = make_pipeline(generateDataFrame())
df = pipe.transform("")
print(df)
```
|    |   customer_id | first_name   | last_name   | title   |
|---:|--------------:|:-------------|:------------|:--------|
|  0 |           111 | peter        | Hong        | Mr.     |
|  1 |           222 | YuCheung     | Wong        | Mr.     |
|  2 |           333 | Cindy        | Wong        | Mrs.    |

<br>

#### Example 2
```python
from cetl import build_pipeline
from cetl.pandas_modules import generateDataFrame, unionAll
from cetl.functional_modules import dummyStart, parallelTransformer

pipe = build_pipeline(  dummyStart(),
                        parallelTransformer([generateDataFrame(), generateDataFrame()]), 
                        unionAll())
df = pipe.transform("")
print(df)

```
|    |   customer_id | first_name   | last_name   | title   |
|---:|--------------:|:-------------|:------------|:--------|
|  0 |           111 | peter        | Hong        | Mr.     |
|  1 |           222 | YuCheung     | Wong        | Mr.     |
|  2 |           333 | Cindy        | Wong        | Mrs.    |
|  0 |           111 | peter        | Hong        | Mr.     |
|  1 |           222 | YuCheung     | Wong        | Mr.     |
|  2 |           333 | Cindy        | Wong        | Mrs.    |


Alternatively, you can perform the same by using json configuration to the DataPipeline object
```python
from cetl import DataPipeline
cfg = {"pipeline":[ {"type":"dummyStart", "module_type":"functional"},
                    {"type":"parallelTransformer", "transformers":[
                        {"type":"generateDataFrame"},
                        {"type":"generateDataFrame"}
                    ]},
                    {"type":"unionAll"}
]}

pipe = DataPipeline(cfg)
df = pipe.transform("")
print(df)
```

|    |   customer_id | first_name   | last_name   | title   |
|---:|--------------:|:-------------|:------------|:--------|
|  0 |           111 | peter        | Hong        | Mr.     |
|  1 |           222 | YuCheung     | Wong        | Mr.     |
|  2 |           333 | Cindy        | Wong        | Mrs.    |
|  0 |           111 | peter        | Hong        | Mr.     |
|  1 |           222 | YuCheung     | Wong        | Mr.     |
|  2 |           333 | Cindy        | Wong        | Mrs.    |

<br>

### Render the graph
Note: please make sure the graphviz executable file is installed.<br>
both png file and the svg file will be exported
```python
pipe = pipe.build_digraph()
pipe.save_png("./sample.png")
```


#### sample.png
<img src="sample.png">
 this version will solve the issue of UnboundLocalError: 
local variable 'pre_transformer_key' referenced before assignment

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "cetl",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,data pipeline,pipeline",
    "author": "Clement",
    "author_email": "<cheukub@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/e4/db/e536547f76541f77818cfc8e4ac51d6a4bc7668273db52e0f7219d3f1463/cetl-0.2.9.tar.gz",
    "platform": null,
    "description": "\n### About CETL\n\nCETL is a Python library that provides a comprehensive set of tools for building and managing data pipelines. It is designed to assist data engineers in handling Extract, Transform, and Load (ETL) tasks more effectively by simplifying the process and reducing the amount of manual labor involved.<br>\n\nCETL is particularly useful for Python developers who work with data on a regular basis. It uses popular data containers such as pandas dataframes, JSON objects, and PySpark dataframes to provide a familiar interface for developers. This allows users to easily integrate CETL into their existing data pipelines and workflows.<br>\n\nThe library is intended to make the ETL process more straightforward by automating many of the technical details involved in data processing and movement. CETL includes a wide range of functions and tools for handling complex data formats, such as CSV, Excel, and JSON files, as well as for working with a variety of data sources, including databases, APIs, and cloud storage services.<br>\n\nOne of the key benefits of CETL is its ability to handle large datasets, making it suitable for use in high-performance data processing environments. CETL also includes features for data profiling, data validation, data transformation, and data mapping, allowing users to build sophisticated data pipelines that can handle a wide range of data processing tasks.<br>\n\nOverall, CETL is a powerful data pipeline tool that can help data engineers to improve their productivity and streamline the ETL process. By providing a comprehensive set of functions and tools for working with data, CETL makes it easier to develop and maintain complex ETL pipelines, reducing the amount of time and effort required to manage data processing tasks.<br><br>\n\n<br>\n\n### User Guide\n\n#### Example 1\nGenerateDataFrame is a Python class object designed to represent a transformation step in a data pipeline. This object can be used to generate a dummy dataframe without reading actual data from a file. The main purpose of this object is to assist developers in testing their data processing pipelines.<br>\n\nWith GenerateDataFrame, developers can quickly and easily create test data that mimics the structure of their actual data. This can be particularly useful when working with large datasets or when data is not readily available. By generating dummy data, developers can test their pipeline's functionality without having to rely on real data sources.<br>\n\nGenerateDataFrame is particularly useful in situations where developers need to test their pipeline's ability to handle different types of data and perform various data transformations. This can include testing the pipeline's ability to handle missing data, data outliers, and data formatting issues.<br>\n\nOverall, GenerateDataFrame is a powerful tool that can help developers to streamline the testing process and ensure the accuracy and efficiency of their data processing pipelines. By allowing developers to generate dummy data, it provides a quick and easy way to test their pipeline's functionality and identify any potential issues before deploying to production.<br>\n```python\nfrom cetl import make_pipeline\nfrom cetl.pandas_modules import generateDataFrame\npipe = make_pipeline(generateDataFrame())\ndf = pipe.transform(\"\")\nprint(df)\n```\n|    |   customer_id | first_name   | last_name   | title   |\n|---:|--------------:|:-------------|:------------|:--------|\n|  0 |           111 | peter        | Hong        | Mr.     |\n|  1 |           222 | YuCheung     | Wong        | Mr.     |\n|  2 |           333 | Cindy        | Wong        | Mrs.    |\n\n<br>\n\n#### Example 2\n```python\nfrom cetl import build_pipeline\nfrom cetl.pandas_modules import generateDataFrame, unionAll\nfrom cetl.functional_modules import dummyStart, parallelTransformer\n\npipe = build_pipeline(  dummyStart(),\n                        parallelTransformer([generateDataFrame(), generateDataFrame()]), \n                        unionAll())\ndf = pipe.transform(\"\")\nprint(df)\n\n```\n|    |   customer_id | first_name   | last_name   | title   |\n|---:|--------------:|:-------------|:------------|:--------|\n|  0 |           111 | peter        | Hong        | Mr.     |\n|  1 |           222 | YuCheung     | Wong        | Mr.     |\n|  2 |           333 | Cindy        | Wong        | Mrs.    |\n|  0 |           111 | peter        | Hong        | Mr.     |\n|  1 |           222 | YuCheung     | Wong        | Mr.     |\n|  2 |           333 | Cindy        | Wong        | Mrs.    |\n\n\nAlternatively, you can perform the same by using json configuration to the DataPipeline object\n```python\nfrom cetl import DataPipeline\ncfg = {\"pipeline\":[ {\"type\":\"dummyStart\", \"module_type\":\"functional\"},\n                    {\"type\":\"parallelTransformer\", \"transformers\":[\n                        {\"type\":\"generateDataFrame\"},\n                        {\"type\":\"generateDataFrame\"}\n                    ]},\n                    {\"type\":\"unionAll\"}\n]}\n\npipe = DataPipeline(cfg)\ndf = pipe.transform(\"\")\nprint(df)\n```\n\n|    |   customer_id | first_name   | last_name   | title   |\n|---:|--------------:|:-------------|:------------|:--------|\n|  0 |           111 | peter        | Hong        | Mr.     |\n|  1 |           222 | YuCheung     | Wong        | Mr.     |\n|  2 |           333 | Cindy        | Wong        | Mrs.    |\n|  0 |           111 | peter        | Hong        | Mr.     |\n|  1 |           222 | YuCheung     | Wong        | Mr.     |\n|  2 |           333 | Cindy        | Wong        | Mrs.    |\n\n<br>\n\n### Render the graph\nNote: please make sure the graphviz executable file is installed.<br>\nboth png file and the svg file will be exported\n```python\npipe = pipe.build_digraph()\npipe.save_png(\"./sample.png\")\n```\n\n\n#### sample.png\n<img src=\"sample.png\">\n this version will solve the issue of UnboundLocalError: \nlocal variable 'pre_transformer_key' referenced before assignment\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A basic data pipeline tools for data engineer to handle the CRM or loyalty data",
    "version": "0.2.9",
    "project_urls": null,
    "split_keywords": [
        "python",
        "data pipeline",
        "pipeline"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2d1836a1770fee2a7b5c52e3a149293b2f06f081f93becbda816559d27b05add",
                "md5": "3d0f61498dec3fe5656448b1f024b76f",
                "sha256": "d00f0dc711c9e660ed244d2a449d2b1fc309221fd9b33a05c8b73ff618163613"
            },
            "downloads": -1,
            "filename": "cetl-0.2.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3d0f61498dec3fe5656448b1f024b76f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 58090,
            "upload_time": "2023-06-20T03:42:07",
            "upload_time_iso_8601": "2023-06-20T03:42:07.969790Z",
            "url": "https://files.pythonhosted.org/packages/2d/18/36a1770fee2a7b5c52e3a149293b2f06f081f93becbda816559d27b05add/cetl-0.2.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e4dbe536547f76541f77818cfc8e4ac51d6a4bc7668273db52e0f7219d3f1463",
                "md5": "9887d8f5dc525a4ee0fd3d08f68b29e1",
                "sha256": "5c28dddb70ce1ac9d528d801ef6b0626ff7deeba73f16139cc71c52dad74c0ca"
            },
            "downloads": -1,
            "filename": "cetl-0.2.9.tar.gz",
            "has_sig": false,
            "md5_digest": "9887d8f5dc525a4ee0fd3d08f68b29e1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 39570,
            "upload_time": "2023-06-20T03:42:10",
            "upload_time_iso_8601": "2023-06-20T03:42:10.183715Z",
            "url": "https://files.pythonhosted.org/packages/e4/db/e536547f76541f77818cfc8e4ac51d6a4bc7668273db52e0f7219d3f1463/cetl-0.2.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-20 03:42:10",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "cetl"
}
        
Elapsed time: 0.08613s