hyperleaup


Namehyperleaup JSON
Version 0.1.2 PyPI version JSON
download
home_page
SummaryCreate and publish Tableau Hyper files from Apache Spark DataFrames and Spark SQL.
upload_time2023-09-06 19:54:16
maintainer
docs_urlNone
author
requires_python>=3.6
license
keywords spark tableau extract hyper
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # hyperleaup
Pronounced "hyper-loop". Create and publish Tableau Hyper files from Apache Spark DataFrames or Spark SQL.

## Why are data extracts are _so slow_?
Tableau Data Extracts can take hours to create and publish to a Tableau Server.
Sometimes this means waiting around most of the day for the data extract to complete.
What a waste of time! In addition, the Tableau Backgrounder (the Tableau Server job scheduler)
becomes a single point of failure as more refresh jobs are scheduled and long running jobs exhaust the server’s resources.

![Data Extract Current Workflow](images/data-extracts-current.png)

## How hyperleaup helps
Rather than pulling data from the source over an ODBC connection, `hyperleaup` can write data directly to a Hyper file
and publish final Hyper files to a Tableau Server. Best of all, you can take advantage of all the benefits of 
Apache Spark + Tableau Hyper API:
- perform efficient CDC upserts
- distributed read/write/transformations from multiple sources
- execute SQL directly

`hyperleaup` allows you to create repeatable data extracts that can be scheduled to run on a repeated frequency
or even incorporate it as a final step in an ETL pipeline, e.g. refresh data extract with latest CDC.

## Getting Started
A list of usage examples is available in the `demo` folder of this repo as a [Databricks Notebook Archive (DBC)](demo/Hyperleaup-Demo.dbc) or IPython Notebook (demo/Hyperleaup-Demo.ipynb).

## Example usage
The following code snippet creates a Tableau Hyper file from a Spark SQL statement and publishes it as a datasource to a Tableau Server.

```python
from hyperleaup import HyperFile

# Step 1: Create a Hyper File from Spark SQL
query = """
select *
  from transaction_history
 where action_date > '2015-01-01'
"""

hf = HyperFile(name="transaction_history", sql=query, is_dbfs_enabled=True)

# Step 2: Publish Hyper File to a Tableau Server
hf.publish(tableau_server_url,
           username,
           password,
           site_name,
           project_name,
           datasource_name)

# Step 3: Append new data
new_data = """
select *
  from transaction_history
 where action_date > last_publish_date
"""
hf.append(sql=new_data)
```

## Creation Mode
There are several options for how to create the Hyper file that can be set by adding argument `creation_mode` when initializing HyperFile instance. The default is PARQUET.

| Mode | Description | Data Size |
| --- | --- | --- |
| PARQUET | Saves data to a single Parquet file then copies to Hyper file. | MEDIUM |
| COPY | Saves data to CSV format then copies to Hyper file. | MEDIUM |
| INSERT | Reads data into memory; more forgiving for null values. | SMALL |
| LARGEFILE | Saves data to multiple Parquet files then copies to Hyper file. | LARGE |


Example of setting creation mode:  
`hf = HyperFile(name="transaction_history", sql=query, is_dbfs_enabled=True, creation_mode="PARQUET")`

## Hyper File Options
There is an optional `HyperFileConfig` that can be used to change default behaviors.
  - timestamp_with_timezone:
    - If `True`, use timestamptz datatype with HyperFile. Recommended if using timestamp values with Parquet create mode. (default=False)
  - allow_nulls:
    - If `True`, skip default behavior of replacing null numeric and strings with non-null values. (default=False)
  - convert_decimal_precision:
    - If `True`, automatically convert decimals with precision over 18 down to 18. This has risk of data truncation. (default=False)


### Example using configs
```python
from hyperleaup import HyperFile, HyperFileConfig

hf_config = HyperFileConfig(
              timestamp_with_timezone=True, 
              allow_nulls=False,
              convert_decimal_precision=False)

hf = HyperFile(name="transaction_history", sql=query, is_dbfs_enabled=True)
```

## Legal Information
This software is provided **as-is** and is not officially supported by Databricks through customer technical support channels.
Support, questions, and feature requests can be submitted through the Issues page of this repo.
Please understand that issues with the use of this code will not be answered or investigated by Databricks Support.  

## Core Contribution team
* Lead Developer: [Will Girten](https://www.linkedin.com/in/willgirten/), Lead SSA @Databricks
* Puru Shrestha, Sr. BI Developer

## Project Support
Please note that all projects in the /databrickslabs github account are provided for your exploration only, 
and are not formally supported by Databricks with Service Level Agreements (SLAs).  
They are provided AS-IS and we do not make any guarantees of any kind.  
Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo.  
They will be reviewed as time permits, but there are no formal SLAs for support.


## Building the Project
To build the project: <br>
```
python3 -m build
```

## Running Pytests
To run tests on the project: <br>
```
cd tests
python test_hyper_file.py
python test_creator.py
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "hyperleaup",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "Spark,Tableau,extract,hyper",
    "author": "",
    "author_email": "Will Girten <will.girten@databricks.com>",
    "download_url": "https://files.pythonhosted.org/packages/aa/b3/1b6e40badde52ac7d100cf018828a0886be0056ba13d52bd150c1b4cb0e3/hyperleaup-0.1.2.tar.gz",
    "platform": null,
    "description": "# hyperleaup\nPronounced \"hyper-loop\". Create and publish Tableau Hyper files from Apache Spark DataFrames or Spark SQL.\n\n## Why are data extracts are _so slow_?\nTableau Data Extracts can take hours to create and publish to a Tableau Server.\nSometimes this means waiting around most of the day for the data extract to complete.\nWhat a waste of time! In addition, the Tableau Backgrounder (the Tableau Server job scheduler)\nbecomes a single point of failure as more refresh jobs are scheduled and long running jobs exhaust the server\u2019s resources.\n\n![Data Extract Current Workflow](images/data-extracts-current.png)\n\n## How hyperleaup helps\nRather than pulling data from the source over an ODBC connection, `hyperleaup` can write data directly to a Hyper file\nand publish final Hyper files to a Tableau Server. Best of all, you can take advantage of all the benefits of \nApache Spark + Tableau Hyper API:\n- perform efficient CDC upserts\n- distributed read/write/transformations from multiple sources\n- execute SQL directly\n\n`hyperleaup` allows you to create repeatable data extracts that can be scheduled to run on a repeated frequency\nor even incorporate it as a final step in an ETL pipeline, e.g. refresh data extract with latest CDC.\n\n## Getting Started\nA list of usage examples is available in the `demo` folder of this repo as a [Databricks Notebook Archive (DBC)](demo/Hyperleaup-Demo.dbc) or IPython Notebook (demo/Hyperleaup-Demo.ipynb).\n\n## Example usage\nThe following code snippet creates a Tableau Hyper file from a Spark SQL statement and publishes it as a datasource to a Tableau Server.\n\n```python\nfrom hyperleaup import HyperFile\n\n# Step 1: Create a Hyper File from Spark SQL\nquery = \"\"\"\nselect *\n  from transaction_history\n where action_date > '2015-01-01'\n\"\"\"\n\nhf = HyperFile(name=\"transaction_history\", sql=query, is_dbfs_enabled=True)\n\n# Step 2: Publish Hyper File to a Tableau Server\nhf.publish(tableau_server_url,\n           username,\n           password,\n           site_name,\n           project_name,\n           datasource_name)\n\n# Step 3: Append new data\nnew_data = \"\"\"\nselect *\n  from transaction_history\n where action_date > last_publish_date\n\"\"\"\nhf.append(sql=new_data)\n```\n\n## Creation Mode\nThere are several options for how to create the Hyper file that can be set by adding argument `creation_mode` when initializing HyperFile instance. The default is PARQUET.\n\n| Mode | Description | Data Size |\n| --- | --- | --- |\n| PARQUET | Saves data to a single Parquet file then copies to Hyper file. | MEDIUM |\n| COPY | Saves data to CSV format then copies to Hyper file. | MEDIUM |\n| INSERT | Reads data into memory; more forgiving for null values. | SMALL |\n| LARGEFILE | Saves data to multiple Parquet files then copies to Hyper file. | LARGE |\n\n\nExample of setting creation mode:  \n`hf = HyperFile(name=\"transaction_history\", sql=query, is_dbfs_enabled=True, creation_mode=\"PARQUET\")`\n\n## Hyper File Options\nThere is an optional `HyperFileConfig` that can be used to change default behaviors.\n  - timestamp_with_timezone:\n    - If `True`, use timestamptz datatype with HyperFile. Recommended if using timestamp values with Parquet create mode. (default=False)\n  - allow_nulls:\n    - If `True`, skip default behavior of replacing null numeric and strings with non-null values. (default=False)\n  - convert_decimal_precision:\n    - If `True`, automatically convert decimals with precision over 18 down to 18. This has risk of data truncation. (default=False)\n\n\n### Example using configs\n```python\nfrom hyperleaup import HyperFile, HyperFileConfig\n\nhf_config = HyperFileConfig(\n              timestamp_with_timezone=True, \n              allow_nulls=False,\n              convert_decimal_precision=False)\n\nhf = HyperFile(name=\"transaction_history\", sql=query, is_dbfs_enabled=True)\n```\n\n## Legal Information\nThis software is provided **as-is** and is not officially supported by Databricks through customer technical support channels.\nSupport, questions, and feature requests can be submitted through the Issues page of this repo.\nPlease understand that issues with the use of this code will not be answered or investigated by Databricks Support.  \n\n## Core Contribution team\n* Lead Developer: [Will Girten](https://www.linkedin.com/in/willgirten/), Lead SSA @Databricks\n* Puru Shrestha, Sr. BI Developer\n\n## Project Support\nPlease note that all projects in the /databrickslabs github account are provided for your exploration only, \nand are not formally supported by Databricks with Service Level Agreements (SLAs).  \nThey are provided AS-IS and we do not make any guarantees of any kind.  \nPlease do not submit a support ticket relating to any issues arising from the use of these projects.\n\nAny issues discovered through the use of this project should be filed as GitHub Issues on the Repo.  \nThey will be reviewed as time permits, but there are no formal SLAs for support.\n\n\n## Building the Project\nTo build the project: <br>\n```\npython3 -m build\n```\n\n## Running Pytests\nTo run tests on the project: <br>\n```\ncd tests\npython test_hyper_file.py\npython test_creator.py\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Create and publish Tableau Hyper files from Apache Spark DataFrames and Spark SQL.",
    "version": "0.1.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/goodwillpunning/hyperleaup/issues",
        "Homepage": "https://github.com/goodwillpunning/hyperleaup"
    },
    "split_keywords": [
        "spark",
        "tableau",
        "extract",
        "hyper"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b8a6e6f7b6bfb5b194e14133fa41f9614df5acd02d28586b29e6206bd8166b8d",
                "md5": "e8a3cb0dcd8dcfb76f4657ad07acd0a6",
                "sha256": "c4b8c7bd6a04f3f26b255644ed6be1e79503a6fc2762ae6ec5bcd5e04bb5b8f9"
            },
            "downloads": -1,
            "filename": "hyperleaup-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e8a3cb0dcd8dcfb76f4657ad07acd0a6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 17243,
            "upload_time": "2023-09-06T19:54:15",
            "upload_time_iso_8601": "2023-09-06T19:54:15.083672Z",
            "url": "https://files.pythonhosted.org/packages/b8/a6/e6f7b6bfb5b194e14133fa41f9614df5acd02d28586b29e6206bd8166b8d/hyperleaup-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "aab31b6e40badde52ac7d100cf018828a0886be0056ba13d52bd150c1b4cb0e3",
                "md5": "67a737741757c4d25dac7ea5316c324a",
                "sha256": "fbd83fae099562470a597f8390b05ea851ac349f84e679a343da29343d89dfb3"
            },
            "downloads": -1,
            "filename": "hyperleaup-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "67a737741757c4d25dac7ea5316c324a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 983824,
            "upload_time": "2023-09-06T19:54:16",
            "upload_time_iso_8601": "2023-09-06T19:54:16.991188Z",
            "url": "https://files.pythonhosted.org/packages/aa/b3/1b6e40badde52ac7d100cf018828a0886be0056ba13d52bd150c1b4cb0e3/hyperleaup-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-06 19:54:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "goodwillpunning",
    "github_project": "hyperleaup",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "hyperleaup"
}
        
Elapsed time: 2.00054s