databricks-bridge


Namedatabricks-bridge JSON
Version 0.0.4 PyPI version JSON
download
home_pageNone
SummaryDatabricks read and write with sql connection
upload_time2024-03-25 14:26:21
maintainerNone
docs_urlNone
authorY-Tree (Saeed Falowo)
requires_pythonNone
licenseNone
keywords python databricks pyspark sql dataframe
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
Databricks read and write data from and to databricks tables via insert statement direct write, pandas or spark dataframes to insert statement conversion write

## Requirements
Python 3.7 or above is required.

## Prerequisite:
- Java
- Python
- Pyspark
- Pandas
- Numpy

Although the installation of this package installs pyspark, pandas, and numpy, the spark environment isnt set up automatically.
The machine should be able to create a spark session and create spark and pandas dataframes.

To confirm if pyspark is running as expected, run the following python script:
```
from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.appName("Databricks Bridge Test").enableHiveSupport().getOrCreate()
dict_data = [
    {"name": "Tom", "age": 20, "dob": "2000-10-31"},
    {"name": "Dick", "age": 21, "dob": "1999-10-30"},
    {"name": "Harry", "age": 22, "dob": "1998-10-29"}
]
spark_df = spark.createDataFrame(dict_data)
spark_df.show()

pd_df = pd.DataFrame(dict_data)
print(pd_df)
```

Should return:
```
+---+----------+-----+
|age|       dob| name|
+---+----------+-----+
| 20|2000-10-31|  Tom|
| 21|1999-10-30| Dick|
| 22|1998-10-29|Harry|
+---+----------+-----+
```

```
    name  age         dob
0    Tom   20  2000-10-31
1   Dick   21  1999-10-30
2  Harry   22  1998-10-29
```
If this runs without errors and the dataframe prints are returned on the console, then pyspark and pandas are set up properly.

If not, then please install openjdk

## Usage
- Initialization
  - ```
    from databricks_bridge import Bridge
    bridge = Bridge(hostname="<host_id>.cloud.databricks.com", token="<token>")
    ```
- Run queries without data returns
  - ```
    bridge.execute_query("create database if not exists bridge_test_db;")
    bridge.execute_query("""
        create table if not exists bridge_test_db.students (
            name string,
            age int,
            dob date,
            last_active timestamp,
            reg_date date
        );""")
    ```
- Write into tables with sql insert statement
  - ```
    bridge.execute_query("""
        insert into bridge_test_db.students (age, name, dob, last_active, reg_date)
        values
            (18, 'Rachel', '1999-11-01', '2023-11-01 20:36:31.365375', '2023-11-01'),
            (19, 'Harriet', '1999-11-02', '2023-11-01 20:36:31.365375', '2022-11-01');
    """)
    ```
- Write pandas or spark dataframes into databricks tables
  - ```
    new_data = [
        {"name": "Tom", "age": 20, "dob": "1999-10-31", "last_active": datetime.now(), "reg_date": datetime.today().date()},
        {"name": "Dick", "age": 21, "dob": "1999-10-30", "last_active": datetime.now(), "reg_date": datetime.today().date()},
        {"name": "Harry", "age": 22, "dob": "1999-10-29", "last_active": datetime.now(), "reg_date": datetime.today().date()}
    ]
    new_pd_df = pd.DataFrame(new_data)
    bridge.write_df_to_table(df=new_pd_df, target_table="bridge_test_db.students")

    new_spark_df = bridge.spark.createDataFrame(new_data)
    bridge.write_df_to_table(df=new_spark_df, target_table="bridge_test_db.students")
    ```
- Run queries with dataframes returns
  - ```
    pd_df, spark_schema = bridge.execute_query("select * from bridge_test_db.students")
    ```
- Convert returned default pandas dataframe to spark dataframe with exact schema match
  - ```
    spark_df = bridge.to_spark_df(pd_df, spark_schema)
    ```
- Convert returned default pandas dataframe to spark dataframe without exact schema match
  - ```
    spark_df = bridge.to_spark_df(pd_df)
    ```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "databricks-bridge",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "python, databricks, pyspark, sql, dataframe",
    "author": "Y-Tree (Saeed Falowo)",
    "author_email": "saeed@y-tree.com",
    "download_url": "https://files.pythonhosted.org/packages/36/63/bf274210b8cb915c9a2bea5d1bea5f992fba5094993073751f4a68ed7018/databricks-bridge-0.0.4.tar.gz",
    "platform": null,
    "description": "\nDatabricks read and write data from and to databricks tables via insert statement direct write, pandas or spark dataframes to insert statement conversion write\n\n## Requirements\nPython 3.7 or above is required.\n\n## Prerequisite:\n- Java\n- Python\n- Pyspark\n- Pandas\n- Numpy\n\nAlthough the installation of this package installs pyspark, pandas, and numpy, the spark environment isnt set up automatically.\nThe machine should be able to create a spark session and create spark and pandas dataframes.\n\nTo confirm if pyspark is running as expected, run the following python script:\n```\nfrom pyspark.sql import SparkSession\nimport pandas as pd\n\nspark = SparkSession.builder.appName(\"Databricks Bridge Test\").enableHiveSupport().getOrCreate()\ndict_data = [\n    {\"name\": \"Tom\", \"age\": 20, \"dob\": \"2000-10-31\"},\n    {\"name\": \"Dick\", \"age\": 21, \"dob\": \"1999-10-30\"},\n    {\"name\": \"Harry\", \"age\": 22, \"dob\": \"1998-10-29\"}\n]\nspark_df = spark.createDataFrame(dict_data)\nspark_df.show()\n\npd_df = pd.DataFrame(dict_data)\nprint(pd_df)\n```\n\nShould return:\n```\n+---+----------+-----+\n|age|       dob| name|\n+---+----------+-----+\n| 20|2000-10-31|  Tom|\n| 21|1999-10-30| Dick|\n| 22|1998-10-29|Harry|\n+---+----------+-----+\n```\n\n```\n    name  age         dob\n0    Tom   20  2000-10-31\n1   Dick   21  1999-10-30\n2  Harry   22  1998-10-29\n```\nIf this runs without errors and the dataframe prints are returned on the console, then pyspark and pandas are set up properly.\n\nIf not, then please install openjdk\n\n## Usage\n- Initialization\n  - ```\n    from databricks_bridge import Bridge\n    bridge = Bridge(hostname=\"<host_id>.cloud.databricks.com\", token=\"<token>\")\n    ```\n- Run queries without data returns\n  - ```\n    bridge.execute_query(\"create database if not exists bridge_test_db;\")\n    bridge.execute_query(\"\"\"\n        create table if not exists bridge_test_db.students (\n            name string,\n            age int,\n            dob date,\n            last_active timestamp,\n            reg_date date\n        );\"\"\")\n    ```\n- Write into tables with sql insert statement\n  - ```\n    bridge.execute_query(\"\"\"\n        insert into bridge_test_db.students (age, name, dob, last_active, reg_date)\n        values\n            (18, 'Rachel', '1999-11-01', '2023-11-01 20:36:31.365375', '2023-11-01'),\n            (19, 'Harriet', '1999-11-02', '2023-11-01 20:36:31.365375', '2022-11-01');\n    \"\"\")\n    ```\n- Write pandas or spark dataframes into databricks tables\n  - ```\n    new_data = [\n        {\"name\": \"Tom\", \"age\": 20, \"dob\": \"1999-10-31\", \"last_active\": datetime.now(), \"reg_date\": datetime.today().date()},\n        {\"name\": \"Dick\", \"age\": 21, \"dob\": \"1999-10-30\", \"last_active\": datetime.now(), \"reg_date\": datetime.today().date()},\n        {\"name\": \"Harry\", \"age\": 22, \"dob\": \"1999-10-29\", \"last_active\": datetime.now(), \"reg_date\": datetime.today().date()}\n    ]\n    new_pd_df = pd.DataFrame(new_data)\n    bridge.write_df_to_table(df=new_pd_df, target_table=\"bridge_test_db.students\")\n\n    new_spark_df = bridge.spark.createDataFrame(new_data)\n    bridge.write_df_to_table(df=new_spark_df, target_table=\"bridge_test_db.students\")\n    ```\n- Run queries with dataframes returns\n  - ```\n    pd_df, spark_schema = bridge.execute_query(\"select * from bridge_test_db.students\")\n    ```\n- Convert returned default pandas dataframe to spark dataframe with exact schema match\n  - ```\n    spark_df = bridge.to_spark_df(pd_df, spark_schema)\n    ```\n- Convert returned default pandas dataframe to spark dataframe without exact schema match\n  - ```\n    spark_df = bridge.to_spark_df(pd_df)\n    ```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Databricks read and write with sql connection",
    "version": "0.0.4",
    "project_urls": null,
    "split_keywords": [
        "python",
        " databricks",
        " pyspark",
        " sql",
        " dataframe"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "62fde3daaec3118e7d0c1e28aed9cd728754bb629ae1402acd966e9a6f14bcfa",
                "md5": "c360f98deb1488188e36001b558c8a9d",
                "sha256": "bacfe48742b66dc8bd42c99b984a51aa8f114dbee23eedf65e9eda180a9d2c95"
            },
            "downloads": -1,
            "filename": "databricks_bridge-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c360f98deb1488188e36001b558c8a9d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 5744,
            "upload_time": "2024-03-25T14:26:14",
            "upload_time_iso_8601": "2024-03-25T14:26:14.290730Z",
            "url": "https://files.pythonhosted.org/packages/62/fd/e3daaec3118e7d0c1e28aed9cd728754bb629ae1402acd966e9a6f14bcfa/databricks_bridge-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3663bf274210b8cb915c9a2bea5d1bea5f992fba5094993073751f4a68ed7018",
                "md5": "5cbe7d3eb9e40883a5f08ae37d3f5245",
                "sha256": "4eb26e6a867b3577808e8616f9db90b8737868162ff00c484b33c819c1fd8e66"
            },
            "downloads": -1,
            "filename": "databricks-bridge-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "5cbe7d3eb9e40883a5f08ae37d3f5245",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5713,
            "upload_time": "2024-03-25T14:26:21",
            "upload_time_iso_8601": "2024-03-25T14:26:21.440982Z",
            "url": "https://files.pythonhosted.org/packages/36/63/bf274210b8cb915c9a2bea5d1bea5f992fba5094993073751f4a68ed7018/databricks-bridge-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-25 14:26:21",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "databricks-bridge"
}
        
Elapsed time: 0.39962s