# pandashift
![PyPI](https://img.shields.io/pypi/v/pandashift)
![PyPI - Downloads](https://img.shields.io/pypi/dm/pandashift)
## Overview
Pandashift integrates Pandas with Amazon Redshift for smooth ETL processes and data manipulation. Specifically designed for Redshift in Python, it simplifies workflows by providing easy bulk inserts and automatically handling type conversions, including the ability to convert Pandas types to Redshift types. Data can be pulled from Redshift into Pandas using the function `read_query` for quick analysis and data processing in Python. Whether creating DataFrames from Redshift or performing ETL tasks, taking advantage of Pandas and Pandashift provides flexibility and efficiency for scalable data workflows.
## Installing
``` shell
pip install pandashift
```
## Usage
There are 2 ways work with this package
1. Setting up environment variables
2. Passing credentials on every call
Below are examples of both approaches
### With environment variables set up
``` python
from pandashift import read_query, execute_query, load_db, create_table_from_df
# Read Data
df = read_query('SELECT * FROM public.test')
# Execute statement (aka table create/drop/etc)
execute_query('DROP public.test')
# Create the table from df
create_table_from_df(df, 'public.test')
# Loading dataframe
load_df(df, table_name ='public.test')
```
### No environment variables
``` python
from pandashift import read_query, execute_query, load_db, create_table_from_df
creds = {
"host":"YOUR HOST",
"port":"YOUR PORT",
"dbname":"YOUR DATABASE",
"user":"YOUR USER",
"password":"YOUR PASSWORD"
}
# Read Data
df = read_query('SELECT * FROM public.test',credentials = creds)
# Execute statement (aka table create/drop/etc)
execute_query('DROP public.test',credentials = creds)
# Create the table from df
create_table_from_df(df, 'public.test', credentials = creds)
# Loading dataframe
load_df(df, table_name = 'public.test',credentials = creds)
```
## Functions
### load_df
| Parameter | Usage |
|-----------------------|--------------------------------------------------------------------------------|
| init_df | The dataframe for loading|
| table_name | The table that you want to load the df to|
| credentials | Credentials to use for connection if any|
| auto_create_table | Create the table if it doesn't exist, by default **False**|
| auto_create_threshold | Parameter used by auto create table, by default **0.8**|
| auto_create_show_ddl | Parameter used by auto create table if True will print DDL, by default **False**|
| verify_column_names | The checks that the dataframe column order matches the table column order, by default **True**|
| empty_str_as_null | This option will interpret empty string '' as NULL, by default **True**|
| maximum_insert_length | Maximum length of the insert statement, alter this if you get error exceeding the max statement length, by default **16000000**|
| perform_analyze | If this is true at the end of loading will run ANALYZE table, by default **False**|
### create_table_from_df
| Parameter | Usage |
|-----------------------|--------------------------------------------------------------------------------|
| df | The dataframe for loading|
| table_name | The table that you want to load the df to|
| credentials | Credentials to use for connection if any|
| sortkeys | Specify desired sortkeys, by default no sortkeys are used|
| distkey | Specify desired sortkeys, by default no distkeys are used|
| threshold | Specifies the threshold for datatype conversion (eg if more than 80% of values are float will create FLOAT column ). Must be a value between 0 and 1, by default **0.8**|
| show_ddl | Prints the DDL of the created table, by default **False** |
| no_execute | Just creates DDL, but doesn't run it in database (useful for debugging), by default **False** |
Currently only the these datatypes are suppourted:
* SMALLINT
* INTEGER
* BIGINT
* DECIMAL
* REAL
* DOUBLE
* TIMESTAMP
* DATE
* CHAR
* VARCHAR
* BOOLEAN
* SUPER
## Testing
In order to run local testing
1. Create an environment with required modules
``` bash
pip install ".[dev]"
```
2. Run the following
```
pytest
```
Raw data
{
"_id": null,
"home_page": null,
"name": "pandashift",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "redshift, pandas, analytics, amazon, pandashift",
"author": null,
"author_email": "Maksim Linichenko <maksim.linichenko@yahoo.com>",
"download_url": "https://files.pythonhosted.org/packages/da/be/5d3792f8549be95e0a16e617f888caffcfdbd5fce1a9d2000ecf71b58d2e/pandashift-1.0.5.tar.gz",
"platform": null,
"description": "# pandashift\n\n![PyPI](https://img.shields.io/pypi/v/pandashift)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/pandashift)\n\n## Overview\nPandashift integrates Pandas with Amazon Redshift for smooth ETL processes and data manipulation. Specifically designed for Redshift in Python, it simplifies workflows by providing easy bulk inserts and automatically handling type conversions, including the ability to convert Pandas types to Redshift types. Data can be pulled from Redshift into Pandas using the function `read_query` for quick analysis and data processing in Python. Whether creating DataFrames from Redshift or performing ETL tasks, taking advantage of Pandas and Pandashift provides flexibility and efficiency for scalable data workflows.\n\n## Installing\n\n``` shell\npip install pandashift\n```\n\n## Usage\nThere are 2 ways work with this package\n\n1. Setting up environment variables \n2. Passing credentials on every call\n\nBelow are examples of both approaches\n\n### With environment variables set up\n\n``` python\nfrom pandashift import read_query, execute_query, load_db, create_table_from_df\n\n# Read Data\ndf = read_query('SELECT * FROM public.test')\n\n# Execute statement (aka table create/drop/etc)\nexecute_query('DROP public.test')\n\n# Create the table from df\ncreate_table_from_df(df, 'public.test')\n\n# Loading dataframe\nload_df(df, table_name ='public.test')\n```\n\n### No environment variables\n``` python\nfrom pandashift import read_query, execute_query, load_db, create_table_from_df\n\ncreds = {\n \"host\":\"YOUR HOST\",\n \"port\":\"YOUR PORT\",\n \"dbname\":\"YOUR DATABASE\",\n \"user\":\"YOUR USER\",\n \"password\":\"YOUR PASSWORD\"\n }\n\n# Read Data\ndf = read_query('SELECT * FROM public.test',credentials = creds)\n\n# Execute statement (aka table create/drop/etc)\nexecute_query('DROP public.test',credentials = creds)\n\n# Create the table from df\ncreate_table_from_df(df, 'public.test', credentials = creds)\n\n# Loading dataframe\nload_df(df, table_name = 'public.test',credentials = creds)\n```\n\n## Functions\n\n### load_df\n\n\n| Parameter | Usage |\n|-----------------------|--------------------------------------------------------------------------------|\n| init_df | The dataframe for loading|\n| table_name | The table that you want to load the df to|\n| credentials | Credentials to use for connection if any|\n| auto_create_table | Create the table if it doesn't exist, by default **False**|\n| auto_create_threshold | Parameter used by auto create table, by default **0.8**|\n| auto_create_show_ddl | Parameter used by auto create table if True will print DDL, by default **False**|\n| verify_column_names | The checks that the dataframe column order matches the table column order, by default **True**|\n| empty_str_as_null | This option will interpret empty string '' as NULL, by default **True**|\n| maximum_insert_length | Maximum length of the insert statement, alter this if you get error exceeding the max statement length, by default **16000000**|\n| perform_analyze | If this is true at the end of loading will run ANALYZE table, by default **False**|\n\n\n### create_table_from_df\n\n| Parameter | Usage |\n|-----------------------|--------------------------------------------------------------------------------|\n| df | The dataframe for loading|\n| table_name | The table that you want to load the df to|\n| credentials | Credentials to use for connection if any|\n| sortkeys | Specify desired sortkeys, by default no sortkeys are used|\n| distkey | Specify desired sortkeys, by default no distkeys are used|\n| threshold | Specifies the threshold for datatype conversion (eg if more than 80% of values are float will create FLOAT column ). Must be a value between 0 and 1, by default **0.8**|\n| show_ddl | Prints the DDL of the created table, by default **False** |\n| no_execute | Just creates DDL, but doesn't run it in database (useful for debugging), by default **False** |\n\n\nCurrently only the these datatypes are suppourted: \n* SMALLINT\n* INTEGER\n* BIGINT\n* DECIMAL\n* REAL\n* DOUBLE\n* TIMESTAMP\n* DATE\n* CHAR\n* VARCHAR\n* BOOLEAN\n* SUPER\n\n\n## Testing\n\nIn order to run local testing\n\n1. Create an environment with required modules\n``` bash\npip install \".[dev]\"\n```\n\n2. Run the following\n```\npytest\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Module for quick and easy access to redshift using pandas",
"version": "1.0.5",
"project_urls": {
"Homepage": "https://github.com/tumax7/pandashift",
"Issues": "https://github.com/tumax7/pandashift/issues"
},
"split_keywords": [
"redshift",
" pandas",
" analytics",
" amazon",
" pandashift"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0efb848dca17fbafe832ddebd314989fe490e648afd8d62035bf7fc7f0c91232",
"md5": "750f236d071c8565070f7431024a8fd1",
"sha256": "95a1cc4ab2b56d2d6800b1e0bdf4928e14268e228986e6be88aea835e181269d"
},
"downloads": -1,
"filename": "pandashift-1.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "750f236d071c8565070f7431024a8fd1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 9985,
"upload_time": "2024-12-23T12:52:31",
"upload_time_iso_8601": "2024-12-23T12:52:31.463444Z",
"url": "https://files.pythonhosted.org/packages/0e/fb/848dca17fbafe832ddebd314989fe490e648afd8d62035bf7fc7f0c91232/pandashift-1.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "dabe5d3792f8549be95e0a16e617f888caffcfdbd5fce1a9d2000ecf71b58d2e",
"md5": "e433841abb95865828e5763a5587750f",
"sha256": "6b7fc03a4ecdfc69187460ca1795d04ef019aeed7067864be551e4e17641b933"
},
"downloads": -1,
"filename": "pandashift-1.0.5.tar.gz",
"has_sig": false,
"md5_digest": "e433841abb95865828e5763a5587750f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 11469,
"upload_time": "2024-12-23T12:52:34",
"upload_time_iso_8601": "2024-12-23T12:52:34.125412Z",
"url": "https://files.pythonhosted.org/packages/da/be/5d3792f8549be95e0a16e617f888caffcfdbd5fce1a9d2000ecf71b58d2e/pandashift-1.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-23 12:52:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tumax7",
"github_project": "pandashift",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pandashift"
}