<p align="center">
<img src='https://raw.githubusercontent.com/vertica/VerticaPy/master/assets/img/logo.png' width="180px">
</p>
:star: 2023-12-01: VerticaPy secures 200 stars.
:loudspeaker: 2020-06-27: Vertica-ML-Python has been renamed to VerticaPy.
:warning: The following README is for VerticaPy 1.1.x and onwards, and so some of the elements may not be present in the previous versions.
:scroll: Some basic syntax can be found in [the cheat sheet](assets/cheat_sheet/).
π° Check out the latest newsletter [here](https://www.vertica.com/python/documentation/1.1.x/html/whats_new.html).
# VerticaPy
[![PyPI version](https://badge.fury.io/py/verticapy.svg)](https://badge.fury.io/py/verticapy)
[![Conda Version](https://img.shields.io/conda/vn/conda-forge/verticapy?color=yellowgreen)](https://anaconda.org/conda-forge/verticapy)
[![License](https://img.shields.io/badge/License-Apache%202.0-orange.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python Version](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-blue)](https://www.python.org/downloads/)
[![codecov](https://codecov.io/gh/vertica/VerticaPy/branch/master/graph/badge.svg?token=a6GiFYI9at)](https://codecov.io/gh/vertica/VerticaPy)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/pylint-dev/pylint)
<p align="center">
<img src='https://raw.githubusercontent.com/vertica/VerticaPy/master/assets/img/benefits.png' width="92%">
</p>
VerticaPy is a Python library with scikit-like functionality used to conduct data science projects on data stored in Vertica, taking advantage of Verticaβs speed and built-in analytics and machine learning features. VerticaPy offers robust support for the entire data science life cycle, uses a 'pipeline' mechanism to sequentialize data transformation operations, and offers beautiful graphical options.
<br><br>
# Table of Contents
- [Introduction](#introduction)
- [Installation](#installation)
- [Connecting to the Database](#connecting-to-the-database)
- [Documentation](#documentation)
- [Use-cases](#use-cases)
- [Highlighted Features](#highllighted-features)
- [Themes - Dark | Light](#themes)
- [SQL Magic](#sql-magic)
- [SQL Plots](#sql-plots)
- [Diverse Database Connections](#multiple-database-connection-using-dblink)
- [Python and SQL Combo](#python-and-sql-combo)
- [Charts](#charts)
- [Complete ML pipeline](#complete-machine-learning-pipeline)
- [Quickstart](#quickstart)
- [Help and Support](#help-an-support)
- [Contributing](#contributing)
- [Communication](#communication)
<br>
# Introduction
Vertica was the first real analytic columnar database and is still the fastest in the market. However, SQL alone isn't flexible enough to meet the needs of data scientists.
<br><br>
Python has quickly become the most popular tool in this domain, owing much of its flexibility to its high-level of abstraction and impressively large and ever-growing set of libraries. Its accessibility has led to the development of popular and perfomant APIs, like pandas and scikit-learn, and a dedicated community of data scientists. Unfortunately, Python only works in-memory as a single-node process. This problem has led to the rise of distributed programming languages, but they too, are limited as in-memory processes and, as such, will never be able to process all of your data in this era, and moving data for processing is prohobitively expensive. On top of all of this, data scientists must also find convenient ways to deploy their data and models. The whole process is time consuming.
<br><br>
**VerticaPy aims to solve all of these problems**. The idea is simple: instead of moving data around for processing, VerticaPy brings the logic to the data.
<br><br>
3+ years in the making, we're proud to bring you VerticaPy.
<br><br>
Main Advantages:
<ul>
<li> Easy Data Exploration.</li>
<li> Fast Data Preparation.</li>
<li> In-Database Machine Learning.</li>
<li> Easy Model Evaluation.</li>
<li> Easy Model Deployment.</li>
<li> Flexibility of using either Python or SQL.</li>
</ul>
<p align="center">
<img src='https://raw.githubusercontent.com/vertica/VerticaPy/master/assets/img/architecture.png' width="92%">
</p>
[:arrow_up: Back to TOC](#table-of-contents)
<br>
## Installation
To install <b>VerticaPy</b> with pip:
```shell
# Latest release version
root@ubuntu:~$ pip3 install verticapy[all]
# Latest commit on master branch
root@ubuntu:~$ pip3 install git+https://github.com/vertica/verticapy.git@master
```
To install <b>VerticaPy</b> from source, run the following command from the root directory:
```shell
root@ubuntu:~$ python3 setup.py install
```
A detailed installation guide is available at: <br>
https://www.vertica.com/python/documentation/installation.html
[:arrow_up: Back to TOC](#table-of-contents)
<br>
## Connecting to the Database
VerticaPy is compatible with several clients. For details, see the <a href='https://www.vertica.com/python/documentation/connection.html'>connection page</a>.<br>
[:arrow_up: Back to TOC](#table-of-contents)
<br>
## Documentation
The easiest and most accurate way to find documentation for a particular function is to use the help function:
```python
import verticapy as vp
help(vp.vDataFrame)
```
Official documentation is available at: <br>
https://www.vertica.com/python/documentation/
To generate documentation, please look at: <br>
https://github.com/mail4umar/VerticaPy/blob/master/docs/Documentation%20Generation.md
[:arrow_up: Back to TOC](#table-of-contents)
<br>
## Use-cases
Examples and case-studies: <br>
https://www.vertica.com/python/examples/
[:arrow_up: Back to TOC](#table-of-contents)
<br>
## Highlighted Features
### Themes
VerticaPy, offers users the flexibility to customize their coding experience with two visually appealing themes: **Dark** and **Light**.
Dark mode, ideal for night-time coding sessions, features a sleek and stylish dark color scheme, providing a comfortable and eye-friendly environment.
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/8ee0b717-a994-4535-826a-7ca4db3772b5" width="70%">
</p>
On the other hand, Light mode serves as the default theme, offering a clean and bright interface for users who prefer a traditional coding ambiance.
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/24757bfd-4d0f-4e92-9aca-45476d704a33" width="70%">
</p>
Theme can be easily switched by:
```python
import verticapy as vp
vp.set_option("theme", "dark") # can be switched 'light'.
```
VerticaPy's theme-switching option ensures that users can tailor their experience to their preferences, making data exploration and analysis a more personalized and enjoyable journey.
[:arrow_up: Back to TOC](#table-of-contents)
<br>
### SQL Magic
You can use VerticaPy to execute SQL queries directly from a Jupyter notebook. For details, see <a href='https://www.vertica.com/python/documentation/1.1.x/html/api/verticapy.jupyter.extensions.sql_magic.sql_magic.html#verticapy.jupyter.extensions.sql_magic.sql_magic'>SQL Magic</a>:
#### Example
Load the SQL extension.
```python
%load_ext verticapy.sql
```
Execute your SQL queries.
```sql
%%sql
SELECT version();
# Output
# Vertica Analytic Database v24.4-0
```
[:arrow_up: Back to TOC](#table-of-contents)
<br>
### SQL Plots
You can create interactive, professional plots directly from SQL.
To create plots, simply provide the type of plot along with the SQL command.
#### Example
```python
%load_ext verticapy.jupyter.extensions.chart_magic
%chart -k pie -c "SELECT pclass, AVG(age) AS av_avg FROM titanic GROUP BY 1;"
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/7616ca04-87d4-4fd7-8cb9-015f48fe3c19" width="50%">
</p>
[:arrow_up: Back to TOC](#table-of-contents)
<br>
### Multiple Database Connection using DBLINK
In a single platform, multiple databases (e.g. PostgreSQL, Vertica, MySQL, In-memory) can be accessed using SQL and python.
#### Example
```sql
%%sql
/* Fetch TAIL_NUMBER and CITY after Joining the flight_vertica table with airports table in MySQL database. */
SELECT flight_vertica.TAIL_NUMBER, airports.CITY AS Departing_City
FROM flight_vertica
INNER JOIN &&& airports &&&
ON flight_vertica.ORIGIN_AIRPORT = airports.IATA_CODE;
```
In the example above, the 'flight_vertica' table is stored in Vertica, whereas the 'airports' table is stored in MySQL. We can associate special symbols "&&&" to the different databases to fetch the data. The best part is that all the aggregation is pushed to the databases (i.e. it is not done in memory)!
For more details on how to setup DBLINK, please visit the [github repo](https://github.com/vertica/dblink). To learn about using DBLINK in VerticaPy, check out the [documentation page](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/full_stack/dblink_integration/).
[:arrow_up: Back to TOC](#table-of-contents)
<br>
### Python and SQL Combo
VerticaPy has a unique place in the market because it allows users to use Python and SQL in the same environment.
#### Example
```python
import verticapy as vp
selected_titanic = vp.vDataFrame(
"(SELECT pclass, embarked, AVG(survived) FROM public.titanic GROUP BY 1, 2) x"
)
selected_titanic.groupby(columns=["pclass"], expr=["AVG(AVG)"])
```
[:arrow_up: Back to TOC](#table-of-contents)
<br>
### Charts
Verticapy comes integrated with three popular plotting libraries: matplotlib, highcharts, and plotly.
A gallery of VerticaPy-generated charts is available at:<br>
https://www.vertica.com/python/documentation/chart.html
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/ac62df51-5f26-4b67-839b-fbd962fbaaea" width="70%">
</p>
[:arrow_up: Back to TOC](#table-of-contents)
<br>
### Complete Machine Learning Pipeline
- **Data Ingestion**
VerticaPy allows users to ingest data from a diverse range of sources, such as AVRO, Parquet, CSV, JSON etc. With a simple command "[read_file](https://www.vertica.com/python/documentation/1.1.x/html/api/verticapy.read_file.html)", VerticaPy automatically infers the source type and the data type.
```python
import verticapy as vp
read_file(
"/home/laliga/2012.json",
table_name="laliga",
)
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/cddc5bbc-9f96-469e-92ee-b4a6e0bc7cfb" width="100%">
</p>
Note: Not all columns are displayed in the screenshot above because of width restriction here.
As shown above, it has created a nested structure for the complex data. The actual file structure is below:
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/6ad242fb-2994-45de-8796-d6af61dae00d" width="30%">
</p>
We can even see the SQL underneath every VerticaPy command by turning on the genSQL option:
```python
import verticapy as vp
read_file("/home/laliga/2012.json", table_name="laliga", genSQL=True)
```
```sql
CREATE LOCAL TEMPORARY TABLE "laliga"
("away_score" INT,
"away_team" ROW("away_team_gender" VARCHAR,
"away_team_group" VARCHAR,
"away_team_id" INT, ...
ROW("id" INT,
"name" VARCHAR)),
"competition" ROW("competition_id" INT,
"competition_name" VARCHAR,
"country_name" VARCHAR),
"competition_stage" ROW("id" INT,
"name" VARCHAR),
"home_score" INT,
"home_team" ROW("country" ROW("id" INT,
"name" VARCHAR),
"home_team_gender" VARCHAR,
"home_team_group" VARCHAR,
"home_team_id" INT, ...),
"kick_off" TIME,
"last_updated" DATE,
"match_DATE" DATE,
"match_id" INT, ...
ROW("data_version" DATE,
"shot_fidelity_version" INT,
"xy_fidelity_version" INT),
"season" ROW("season_id" INT,
"season_name" VARCHAR))
ON COMMIT PRESERVE ROWS
COPY "v_temp_schema"."laliga"
FROM '/home/laliga/2012.json'
PARSER FJsonParser()
```
VerticaPy provides functions for importing other specific file types, such as [read_json](#https://www.vertica.com/python/documentation/1.1.x/html/api/verticapy.read_json.html#verticapy.read_json) and [read_csv](#https://www.vertica.com/python/documentation/1.1.x/html/api/verticapy.read_csv.html). Since these functions focus on a particular file type, they offer more options for tackling the data. For example, [read_json](#https://www.vertica.com/python/documentation/1.1.x/html/api/verticapy.read_json.html#verticapy.read_json) has a "flatten_arrays" parameter that allows you to flatten nested JSON arrays.
- **Data Exploration**
There are many options for descriptive and visual exploration.
```python
from verticapy.datasets import load_iris
iris_data = load_iris()
iris_data.scatter(
["SepalWidthCm", "SepalLengthCm", "PetalLengthCm"],
by="Species",
max_nb_points=30
)
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/b70bfbf4-22fa-40f9-9958-7fd19dbfc61b" width="40%">
</p>
The <b>Correlation Matrix</b> is also very fast and convenient to compute. Users can choose from a wide variety of correaltions, including cramer, spearman, pearson etc.
```python
from verticapy.datasets import load_titanic
titanic = load_titanic()
titanic.corr(method="spearman")
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/fd34aac7-890a-484e-a3bc-9173bffa79d2" width="75%">
</p>
By turning on the SQL print option, users can see and copy SQL queries:
```python
from verticapy import set_option
set_option("sql_on", True)
```
```sql
β SELECT
β β /*+LABEL('vDataframe._aggregate_matrix')*/ CORR_MATRIX("pclass", "survived", "age", "sibsp", "parch", "fare", "body") OVER () β
β FROM
(
β SELECT
β β RANK() OVER (ORDER BY "pclass") AS "pclass",
β β RANK() OVER (ORDER BY "survived") AS "survived",
β β RANK() OVER (ORDER BY "age") AS "age",
β β RANK() OVER (ORDER BY "sibsp") AS "sibsp",
β β RANK() OVER (ORDER BY "parch") AS "parch",
β β RANK() OVER (ORDER BY "fare") AS "fare",
β β RANK() OVER (ORDER BY "body") AS "body" β
β FROM
"public"."titanic") spearman_table
```
VerticaPy allows users to calculate a focused correlation using the "focus" parameter:
```python
titanic.corr(method="spearman", focus="survived")
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/c46493b5-61e2-4eca-ae0e-e2a09fc8d304" width="20%">
</p>
- **Data Preparation**
Whether you are [joining multiple tables](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/data_prep/joins/), [encoding](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/data_prep/encoding/), or [filling missing values](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/data_prep/missing_values/), VerticaPy has everything and more in one package.
```python
import random
import verticapy as vp
data = vp.vDataFrame({"Heights": [random.randint(10, 60) for _ in range(40)] + [100]})
data.outliers_plot(columns="Heights")
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/c71b106b-29d0-4e19-8267-04c5107aa365" width="50%">
</p>
- **Machine Learning**
ML is the strongest suite of VerticaPy as it capitalizes on the speed of in-database training and prediction by using SQL in the background to interact with the database. ML for VerticaPy covers a vast array of tools, including [time series forecasting](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/ml/time_series/), [clustering](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/ml/clustering/), and [classification](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/ml/classification/).
```python
# titanic_vd is already loaded
# Logistic Regression model is already loaded
stepwise_result = stepwise(
model,
input_relation=titanic_vd,
X=[
"age",
"fare",
"parch",
"pclass",
],
y="survived",
direction="backward",
height=600,
width=800,
)
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/1550a25c-138c-4673-9940-44bf060a284b" width="50%">
</p>
[:arrow_up: Back to TOC](#table-of-contents)
<br>
### Loading Predefined Datasets
VerticaPy provides some predefined datasets that can be easily loaded. These datasets include the iris dataset, titanic dataset, amazon, and more.
There are two ways to access the provided datasets:
(1) Use the standard python method:
```python
from verticapy.datasets import load_iris
iris_data = load_iris()
```
(2) Use the standard name of the dataset from the public schema:
```python
iris_data = vp.vDataFrame(input_relation = "public.iris")
```
[:arrow_up: Back to TOC](#table-of-contents)
<br>
## Quickstart
The following example follows the <a href='https://www.vertica.com/python/quick-start/'>VerticaPy quickstart guide</a>.
Install the library using with <b>pip</b>.
```shell
root@ubuntu:~$ pip3 install verticapy[all]
```
Create a new Vertica connection:
```python
import verticapy as vp
vp.new_connection({
"host": "10.211.55.14",
"port": "5433",
"database": "testdb",
"password": "XxX",
"user": "dbadmin"},
name="Vertica_New_Connection")
```
Use the newly created connection:
```python
vp.connect("Vertica_New_Connection")
```
Create a VerticaPy schema for native VerticaPy models (that is, models available in VerticaPy, but not Vertica itself):
```python
vp.create_verticapy_schema()
```
Create a vDataFrame of your relation:
```python
from verticapy import vDataFrame
vdf = vDataFrame("my_relation")
```
Load a sample dataset:
```python
from verticapy.datasets import load_titanic
vdf = load_titanic()
```
Examine your data:
```python
vdf.describe()
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/362dbd53-3692-48e4-a1e1-60f5f565dc50" width="100%">
</p>
Print the SQL query with <b>set_option</b>:
```python
set_option("sql_on", True)
vdf.describe()
# Output
## Compute the descriptive statistics of all the numerical columns ##
SELECT
SUMMARIZE_NUMCOL("pclass", "survived", "age", "sibsp", "parch", "fare", "body") OVER ()
FROM public.titanic
```
With VerticaPy, it is now possible to solve a ML problem with few lines of code.
```python
from verticapy.machine_learning.model_selection.model_validation import cross_validate
from verticapy.machine_learning.vertica import RandomForestClassifier
# Data Preparation
vdf["sex"].label_encode()["boat"].fillna(method="0ifnull")["name"].str_extract(
" ([A-Za-z]+)\."
).eval("family_size", expr="parch + sibsp + 1").drop(
columns=["cabin", "body", "ticket", "home.dest"]
)[
"fare"
].fill_outliers().fillna()
# Model Evaluation
cross_validate(
RandomForestClassifier("rf_titanic", max_leaf_nodes=100, n_estimators=30),
vdf,
["age", "family_size", "sex", "pclass", "fare", "boat"],
"survived",
cutoff=0.35,
)
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/49d3a606-8518-4676-b7ae-fa5c3c962432" width="100%">
</p>
```python
# Features importance
model.fit(vdf, ["age", "family_size", "sex", "pclass", "fare", "boat"], "survived")
model.features_importance()
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/a3d8b236-53a7-4d69-a969-48c2ba9bc114" width="80%">
</p>
```python
# ROC Curve
model = RandomForestClassifier(
name = "public.RF_titanic",
n_estimators = 20,
max_features = "auto",
max_leaf_nodes = 32,
sample = 0.7,
max_depth = 3,
min_samples_leaf = 5,
min_info_gain = 0.0,
nbins = 32
)
model.fit(
"public.titanic", # input relation
["age", "fare", "sex"], # predictors
"survived" # response
)
# Roc Curve
model.roc_curve()
```
<p align="center">
<img src="https://github.com/vertica/VerticaPy/assets/46414488/87f74bc7-a6cd-4336-8d32-b144f7fb6888" width="80%">
</p>
Enjoy!
[:arrow_up: Back to TOC](#table-of-contents)
<br>
## Help and Support
### Contributing
For a short guide on contribution standards, see the <a href='https://www.vertica.com/python/documentation/1.1.x/html/contribution_guidelines.html'>Contribution Guidelines</a>.
### Communication
- LinkedIn: https://www.linkedin.com/company/verticapy/
- Announcements and Discussion: https://github.com/vertica/VerticaPy/discussions
[:arrow_up: Back to TOC](#table-of-contents)
Raw data
{
"_id": null,
"home_page": "https://github.com/vertica/VerticaPy",
"name": "verticapy",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "vertica python ml data science machine learning statistics database",
"author": "Badr Ouali",
"author_email": "badr.ouali@vertica.com",
"download_url": "https://files.pythonhosted.org/packages/2d/f5/6326207c82b91477e9135745108c505e3fdf2ec92dddd9bd084d34f6aa03/verticapy-1.1.0.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n<img src='https://raw.githubusercontent.com/vertica/VerticaPy/master/assets/img/logo.png' width=\"180px\">\n</p>\n\n:star: 2023-12-01: VerticaPy secures 200 stars.\n\n:loudspeaker: 2020-06-27: Vertica-ML-Python has been renamed to VerticaPy.\n\n:warning: The following README is for VerticaPy 1.1.x and onwards, and so some of the elements may not be present in the previous versions.\n\n:scroll: Some basic syntax can be found in [the cheat sheet](assets/cheat_sheet/).\n\n\ud83d\udcf0 Check out the latest newsletter [here](https://www.vertica.com/python/documentation/1.1.x/html/whats_new.html).\n\n# VerticaPy\n\n[![PyPI version](https://badge.fury.io/py/verticapy.svg)](https://badge.fury.io/py/verticapy)\n[![Conda Version](https://img.shields.io/conda/vn/conda-forge/verticapy?color=yellowgreen)](https://anaconda.org/conda-forge/verticapy)\n[![License](https://img.shields.io/badge/License-Apache%202.0-orange.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Python Version](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-blue)](https://www.python.org/downloads/)\n[![codecov](https://codecov.io/gh/vertica/VerticaPy/branch/master/graph/badge.svg?token=a6GiFYI9at)](https://codecov.io/gh/vertica/VerticaPy)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/pylint-dev/pylint)\n\n<p align=\"center\">\n<img src='https://raw.githubusercontent.com/vertica/VerticaPy/master/assets/img/benefits.png' width=\"92%\">\n</p>\n\nVerticaPy is a Python library with scikit-like functionality used to conduct data science projects on data stored in Vertica, taking advantage of Vertica\u2019s speed and built-in analytics and machine learning features. VerticaPy offers robust support for the entire data science life cycle, uses a 'pipeline' mechanism to sequentialize data transformation operations, and offers beautiful graphical options.\n<br><br>\n\n# Table of Contents\n- [Introduction](#introduction)\n- [Installation](#installation)\n- [Connecting to the Database](#connecting-to-the-database)\n- [Documentation](#documentation)\n- [Use-cases](#use-cases)\n- [Highlighted Features](#highllighted-features)\n - [Themes - Dark | Light](#themes)\n - [SQL Magic](#sql-magic)\n - [SQL Plots](#sql-plots)\n - [Diverse Database Connections](#multiple-database-connection-using-dblink)\n - [Python and SQL Combo](#python-and-sql-combo)\n - [Charts](#charts)\n - [Complete ML pipeline](#complete-machine-learning-pipeline)\n- [Quickstart](#quickstart)\n- [Help and Support](#help-an-support)\n - [Contributing](#contributing)\n - [Communication](#communication)\n\n<br>\n\n# Introduction\n\nVertica was the first real analytic columnar database and is still the fastest in the market. However, SQL alone isn't flexible enough to meet the needs of data scientists.\n<br><br>\nPython has quickly become the most popular tool in this domain, owing much of its flexibility to its high-level of abstraction and impressively large and ever-growing set of libraries. Its accessibility has led to the development of popular and perfomant APIs, like pandas and scikit-learn, and a dedicated community of data scientists. Unfortunately, Python only works in-memory as a single-node process. This problem has led to the rise of distributed programming languages, but they too, are limited as in-memory processes and, as such, will never be able to process all of your data in this era, and moving data for processing is prohobitively expensive. On top of all of this, data scientists must also find convenient ways to deploy their data and models. The whole process is time consuming.\n<br><br>\n**VerticaPy aims to solve all of these problems**. The idea is simple: instead of moving data around for processing, VerticaPy brings the logic to the data.\n<br><br>\n3+ years in the making, we're proud to bring you VerticaPy.\n<br><br>\nMain Advantages:\n<ul>\n <li> Easy Data Exploration.</li>\n <li> Fast Data Preparation.</li>\n <li> In-Database Machine Learning.</li>\n <li> Easy Model Evaluation.</li>\n <li> Easy Model Deployment.</li>\n <li> Flexibility of using either Python or SQL.</li>\n</ul>\n\n<p align=\"center\">\n<img src='https://raw.githubusercontent.com/vertica/VerticaPy/master/assets/img/architecture.png' width=\"92%\">\n</p>\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n## Installation\n\nTo install <b>VerticaPy</b> with pip:\n```shell\n# Latest release version\nroot@ubuntu:~$ pip3 install verticapy[all]\n\n# Latest commit on master branch\nroot@ubuntu:~$ pip3 install git+https://github.com/vertica/verticapy.git@master\n```\nTo install <b>VerticaPy</b> from source, run the following command from the root directory:\n```shell\nroot@ubuntu:~$ python3 setup.py install\n```\n\nA detailed installation guide is available at: <br>\n\nhttps://www.vertica.com/python/documentation/installation.html\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n## Connecting to the Database\n\nVerticaPy is compatible with several clients. For details, see the <a href='https://www.vertica.com/python/documentation/connection.html'>connection page</a>.<br>\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n## Documentation\n\nThe easiest and most accurate way to find documentation for a particular function is to use the help function:\n\n```python\nimport verticapy as vp\n\nhelp(vp.vDataFrame)\n```\n\nOfficial documentation is available at: <br>\n\nhttps://www.vertica.com/python/documentation/\n\nTo generate documentation, please look at: <br>\nhttps://github.com/mail4umar/VerticaPy/blob/master/docs/Documentation%20Generation.md\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n## Use-cases\n\nExamples and case-studies: <br>\n\nhttps://www.vertica.com/python/examples/\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n## Highlighted Features\n\n### Themes\n\nVerticaPy, offers users the flexibility to customize their coding experience with two visually appealing themes: **Dark** and **Light**. \n\nDark mode, ideal for night-time coding sessions, features a sleek and stylish dark color scheme, providing a comfortable and eye-friendly environment. \n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/8ee0b717-a994-4535-826a-7ca4db3772b5\" width=\"70%\">\n</p>\n\nOn the other hand, Light mode serves as the default theme, offering a clean and bright interface for users who prefer a traditional coding ambiance. \n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/24757bfd-4d0f-4e92-9aca-45476d704a33\" width=\"70%\">\n</p>\n\nTheme can be easily switched by:\n\n```python\nimport verticapy as vp\n\nvp.set_option(\"theme\", \"dark\") # can be switched 'light'.\n```\n\nVerticaPy's theme-switching option ensures that users can tailor their experience to their preferences, making data exploration and analysis a more personalized and enjoyable journey.\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n### SQL Magic\nYou can use VerticaPy to execute SQL queries directly from a Jupyter notebook. For details, see <a href='https://www.vertica.com/python/documentation/1.1.x/html/api/verticapy.jupyter.extensions.sql_magic.sql_magic.html#verticapy.jupyter.extensions.sql_magic.sql_magic'>SQL Magic</a>:\n\n#### Example\n\nLoad the SQL extension.\n```python\n%load_ext verticapy.sql\n```\nExecute your SQL queries.\n```sql\n%%sql\nSELECT version();\n\n# Output\n# Vertica Analytic Database v24.4-0\n```\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n### SQL Plots\n\nYou can create interactive, professional plots directly from SQL.\n\nTo create plots, simply provide the type of plot along with the SQL command.\n\n#### Example\n```python\n%load_ext verticapy.jupyter.extensions.chart_magic\n%chart -k pie -c \"SELECT pclass, AVG(age) AS av_avg FROM titanic GROUP BY 1;\"\n```\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/7616ca04-87d4-4fd7-8cb9-015f48fe3c19\" width=\"50%\">\n</p>\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n### Multiple Database Connection using DBLINK\n\nIn a single platform, multiple databases (e.g. PostgreSQL, Vertica, MySQL, In-memory) can be accessed using SQL and python.\n\n#### Example\n```sql\n%%sql\n/* Fetch TAIL_NUMBER and CITY after Joining the flight_vertica table with airports table in MySQL database. */\nSELECT flight_vertica.TAIL_NUMBER, airports.CITY AS Departing_City\nFROM flight_vertica\nINNER JOIN &&& airports &&&\nON flight_vertica.ORIGIN_AIRPORT = airports.IATA_CODE;\n```\nIn the example above, the 'flight_vertica' table is stored in Vertica, whereas the 'airports' table is stored in MySQL. We can associate special symbols \"&&&\" to the different databases to fetch the data. The best part is that all the aggregation is pushed to the databases (i.e. it is not done in memory)!\n\nFor more details on how to setup DBLINK, please visit the [github repo](https://github.com/vertica/dblink). To learn about using DBLINK in VerticaPy, check out the [documentation page](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/full_stack/dblink_integration/).\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n### Python and SQL Combo\n\nVerticaPy has a unique place in the market because it allows users to use Python and SQL in the same environment. \n\n#### Example\n```python\nimport verticapy as vp\n\nselected_titanic = vp.vDataFrame(\n \"(SELECT pclass, embarked, AVG(survived) FROM public.titanic GROUP BY 1, 2) x\"\n)\nselected_titanic.groupby(columns=[\"pclass\"], expr=[\"AVG(AVG)\"])\n```\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n### Charts\n\nVerticapy comes integrated with three popular plotting libraries: matplotlib, highcharts, and plotly.\n\nA gallery of VerticaPy-generated charts is available at:<br>\n\nhttps://www.vertica.com/python/documentation/chart.html\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/ac62df51-5f26-4b67-839b-fbd962fbaaea\" width=\"70%\">\n</p>\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n### Complete Machine Learning Pipeline\n\n- **Data Ingestion**\n\n VerticaPy allows users to ingest data from a diverse range of sources, such as AVRO, Parquet, CSV, JSON etc. With a simple command \"[read_file](https://www.vertica.com/python/documentation/1.1.x/html/api/verticapy.read_file.html)\", VerticaPy automatically infers the source type and the data type.\n\n ```python\n import verticapy as vp\n\n read_file(\n \"/home/laliga/2012.json\",\n table_name=\"laliga\",\n )\n ```\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/cddc5bbc-9f96-469e-92ee-b4a6e0bc7cfb\" width=\"100%\">\n</p>\nNote: Not all columns are displayed in the screenshot above because of width restriction here.\n\nAs shown above, it has created a nested structure for the complex data. The actual file structure is below:\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/6ad242fb-2994-45de-8796-d6af61dae00d\" width=\"30%\">\n</p>\n\nWe can even see the SQL underneath every VerticaPy command by turning on the genSQL option:\n\n```python\n import verticapy as vp\n\n read_file(\"/home/laliga/2012.json\", table_name=\"laliga\", genSQL=True)\n```\n```sql\n CREATE LOCAL TEMPORARY TABLE \"laliga\"\n (\"away_score\" INT, \n \"away_team\" ROW(\"away_team_gender\" VARCHAR, \n \"away_team_group\" VARCHAR, \n \"away_team_id\" INT, ... \n ROW(\"id\" INT, \n \"name\" VARCHAR)), \n \"competition\" ROW(\"competition_id\" INT, \n \"competition_name\" VARCHAR, \n \"country_name\" VARCHAR), \n \"competition_stage\" ROW(\"id\" INT, \n \"name\" VARCHAR), \n \"home_score\" INT, \n \"home_team\" ROW(\"country\" ROW(\"id\" INT, \n \"name\" VARCHAR), \n \"home_team_gender\" VARCHAR, \n \"home_team_group\" VARCHAR, \n \"home_team_id\" INT, ...), \n \"kick_off\" TIME, \n \"last_updated\" DATE, \n \"match_DATE\" DATE, \n \"match_id\" INT, ... \n ROW(\"data_version\" DATE, \n \"shot_fidelity_version\" INT, \n \"xy_fidelity_version\" INT), \n \"season\" ROW(\"season_id\" INT, \n \"season_name\" VARCHAR)) \n ON COMMIT PRESERVE ROWS\n COPY \"v_temp_schema\".\"laliga\" \n FROM '/home/laliga/2012.json' \n PARSER FJsonParser()\n```\n\nVerticaPy provides functions for importing other specific file types, such as [read_json](#https://www.vertica.com/python/documentation/1.1.x/html/api/verticapy.read_json.html#verticapy.read_json) and [read_csv](#https://www.vertica.com/python/documentation/1.1.x/html/api/verticapy.read_csv.html). Since these functions focus on a particular file type, they offer more options for tackling the data. For example, [read_json](#https://www.vertica.com/python/documentation/1.1.x/html/api/verticapy.read_json.html#verticapy.read_json) has a \"flatten_arrays\" parameter that allows you to flatten nested JSON arrays.\n\n- **Data Exploration**\n\n There are many options for descriptive and visual exploration. \n\n```python\nfrom verticapy.datasets import load_iris\n\niris_data = load_iris()\niris_data.scatter(\n [\"SepalWidthCm\", \"SepalLengthCm\", \"PetalLengthCm\"], \n by=\"Species\", \n max_nb_points=30\n)\n```\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/b70bfbf4-22fa-40f9-9958-7fd19dbfc61b\" width=\"40%\">\n</p>\n\nThe <b>Correlation Matrix</b> is also very fast and convenient to compute. Users can choose from a wide variety of correaltions, including cramer, spearman, pearson etc.\n\n```python\nfrom verticapy.datasets import load_titanic\n\ntitanic = load_titanic()\ntitanic.corr(method=\"spearman\")\n```\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/fd34aac7-890a-484e-a3bc-9173bffa79d2\" width=\"75%\">\n</p>\n\nBy turning on the SQL print option, users can see and copy SQL queries:\n\n```python\nfrom verticapy import set_option\n\nset_option(\"sql_on\", True)\n```\n\n```sql\n\u2003 SELECT\n\u2003 \u2003 /*+LABEL('vDataframe._aggregate_matrix')*/ CORR_MATRIX(\"pclass\", \"survived\", \"age\", \"sibsp\", \"parch\", \"fare\", \"body\") OVER () \u2003\n\u2003 FROM\n(\n\u2003 SELECT\n\u2003 \u2003 RANK() OVER (ORDER BY \"pclass\") AS \"pclass\",\n\u2003 \u2003 RANK() OVER (ORDER BY \"survived\") AS \"survived\",\n\u2003 \u2003 RANK() OVER (ORDER BY \"age\") AS \"age\",\n\u2003 \u2003 RANK() OVER (ORDER BY \"sibsp\") AS \"sibsp\",\n\u2003 \u2003 RANK() OVER (ORDER BY \"parch\") AS \"parch\",\n\u2003 \u2003 RANK() OVER (ORDER BY \"fare\") AS \"fare\",\n\u2003 \u2003 RANK() OVER (ORDER BY \"body\") AS \"body\" \u2003\n\u2003 FROM\n\"public\".\"titanic\") spearman_table\n```\n\nVerticaPy allows users to calculate a focused correlation using the \"focus\" parameter:\n\n```python\ntitanic.corr(method=\"spearman\", focus=\"survived\")\n```\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/c46493b5-61e2-4eca-ae0e-e2a09fc8d304\" width=\"20%\">\n</p>\n\n- **Data Preparation**\n\n Whether you are [joining multiple tables](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/data_prep/joins/), [encoding](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/data_prep/encoding/), or [filling missing values](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/data_prep/missing_values/), VerticaPy has everything and more in one package.\n\n```python\nimport random\nimport verticapy as vp\n\ndata = vp.vDataFrame({\"Heights\": [random.randint(10, 60) for _ in range(40)] + [100]})\ndata.outliers_plot(columns=\"Heights\")\n```\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/c71b106b-29d0-4e19-8267-04c5107aa365\" width=\"50%\">\n</p>\n\n\n- **Machine Learning**\n\n ML is the strongest suite of VerticaPy as it capitalizes on the speed of in-database training and prediction by using SQL in the background to interact with the database. ML for VerticaPy covers a vast array of tools, including [time series forecasting](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/ml/time_series/), [clustering](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/ml/clustering/), and [classification](https://www.vertica.com/python/documentation/1.1.x/html/notebooks/ml/classification/). \n\n```python\n# titanic_vd is already loaded\n# Logistic Regression model is already loaded\nstepwise_result = stepwise(\n model,\n input_relation=titanic_vd,\n X=[\n \"age\",\n \"fare\",\n \"parch\",\n \"pclass\",\n ],\n y=\"survived\",\n direction=\"backward\",\n height=600,\n width=800,\n)\n```\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/1550a25c-138c-4673-9940-44bf060a284b\" width=\"50%\">\n</p>\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n### Loading Predefined Datasets\n\nVerticaPy provides some predefined datasets that can be easily loaded. These datasets include the iris dataset, titanic dataset, amazon, and more.\n\nThere are two ways to access the provided datasets:\n\n(1) Use the standard python method:\n\n```python\nfrom verticapy.datasets import load_iris\n\niris_data = load_iris()\n```\n\n(2) Use the standard name of the dataset from the public schema:\n\n```python\niris_data = vp.vDataFrame(input_relation = \"public.iris\")\n```\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n## Quickstart\n\nThe following example follows the <a href='https://www.vertica.com/python/quick-start/'>VerticaPy quickstart guide</a>.\n\nInstall the library using with <b>pip</b>.\n```shell\nroot@ubuntu:~$ pip3 install verticapy[all]\n```\nCreate a new Vertica connection:\n```python\nimport verticapy as vp\n\nvp.new_connection({\n \"host\": \"10.211.55.14\", \n \"port\": \"5433\", \n \"database\": \"testdb\", \n \"password\": \"XxX\", \n \"user\": \"dbadmin\"},\n name=\"Vertica_New_Connection\")\n```\nUse the newly created connection:\n```python\nvp.connect(\"Vertica_New_Connection\")\n```\nCreate a VerticaPy schema for native VerticaPy models (that is, models available in VerticaPy, but not Vertica itself):\n```python\nvp.create_verticapy_schema()\n```\nCreate a vDataFrame of your relation:\n```python\nfrom verticapy import vDataFrame\n\nvdf = vDataFrame(\"my_relation\")\n```\nLoad a sample dataset:\n```python\nfrom verticapy.datasets import load_titanic\n\nvdf = load_titanic()\n```\nExamine your data:\n```python\nvdf.describe()\n```\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/362dbd53-3692-48e4-a1e1-60f5f565dc50\" width=\"100%\">\n</p>\n\nPrint the SQL query with <b>set_option</b>:\n```python\nset_option(\"sql_on\", True)\nvdf.describe()\n\n# Output\n## Compute the descriptive statistics of all the numerical columns ##\n\nSELECT \n SUMMARIZE_NUMCOL(\"pclass\", \"survived\", \"age\", \"sibsp\", \"parch\", \"fare\", \"body\") OVER ()\nFROM public.titanic\n```\nWith VerticaPy, it is now possible to solve a ML problem with few lines of code.\n```python\nfrom verticapy.machine_learning.model_selection.model_validation import cross_validate\nfrom verticapy.machine_learning.vertica import RandomForestClassifier\n\n# Data Preparation\nvdf[\"sex\"].label_encode()[\"boat\"].fillna(method=\"0ifnull\")[\"name\"].str_extract(\n \" ([A-Za-z]+)\\.\"\n).eval(\"family_size\", expr=\"parch + sibsp + 1\").drop(\n columns=[\"cabin\", \"body\", \"ticket\", \"home.dest\"]\n)[\n \"fare\"\n].fill_outliers().fillna()\n\n# Model Evaluation\ncross_validate(\n RandomForestClassifier(\"rf_titanic\", max_leaf_nodes=100, n_estimators=30),\n vdf,\n [\"age\", \"family_size\", \"sex\", \"pclass\", \"fare\", \"boat\"],\n \"survived\",\n cutoff=0.35,\n)\n```\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/49d3a606-8518-4676-b7ae-fa5c3c962432\" width=\"100%\">\n</p>\n\n```python\n# Features importance\nmodel.fit(vdf, [\"age\", \"family_size\", \"sex\", \"pclass\", \"fare\", \"boat\"], \"survived\")\nmodel.features_importance()\n```\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/a3d8b236-53a7-4d69-a969-48c2ba9bc114\" width=\"80%\">\n</p>\n\n```python\n# ROC Curve\nmodel = RandomForestClassifier(\n name = \"public.RF_titanic\",\n n_estimators = 20,\n max_features = \"auto\",\n max_leaf_nodes = 32, \n sample = 0.7,\n max_depth = 3,\n min_samples_leaf = 5,\n min_info_gain = 0.0,\n nbins = 32\n)\nmodel.fit(\n \"public.titanic\", # input relation\n [\"age\", \"fare\", \"sex\"], # predictors\n \"survived\" # response\n)\n\n# Roc Curve\nmodel.roc_curve()\n```\n\n<p align=\"center\">\n<img src=\"https://github.com/vertica/VerticaPy/assets/46414488/87f74bc7-a6cd-4336-8d32-b144f7fb6888\" width=\"80%\">\n</p>\n\nEnjoy!\n\n[:arrow_up: Back to TOC](#table-of-contents)\n<br>\n\n## Help and Support\n\n### Contributing\n\nFor a short guide on contribution standards, see the <a href='https://www.vertica.com/python/documentation/1.1.x/html/contribution_guidelines.html'>Contribution Guidelines</a>.\n\n### Communication\n\n- LinkedIn: https://www.linkedin.com/company/verticapy/\n\n- Announcements and Discussion: https://github.com/vertica/VerticaPy/discussions\n\n[:arrow_up: Back to TOC](#table-of-contents)\n",
"bugtrack_url": null,
"license": null,
"summary": "VerticaPy simplifies data exploration, data cleaning, and machine learning in Vertica.",
"version": "1.1.0",
"project_urls": {
"Homepage": "https://github.com/vertica/VerticaPy"
},
"split_keywords": [
"vertica",
"python",
"ml",
"data",
"science",
"machine",
"learning",
"statistics",
"database"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6c461de2770386f52618d8c715648bfa4559940c64e2368844b62582ecb9b5ff",
"md5": "ecb3f6ec073ff536596d5785547dd232",
"sha256": "e5e07264ec5d6dc9409bd6033692446572ace29ecd545b1d8483bd845eefa33c"
},
"downloads": -1,
"filename": "verticapy-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ecb3f6ec073ff536596d5785547dd232",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 2990748,
"upload_time": "2024-10-31T21:06:38",
"upload_time_iso_8601": "2024-10-31T21:06:38.006876Z",
"url": "https://files.pythonhosted.org/packages/6c/46/1de2770386f52618d8c715648bfa4559940c64e2368844b62582ecb9b5ff/verticapy-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2df56326207c82b91477e9135745108c505e3fdf2ec92dddd9bd084d34f6aa03",
"md5": "5ded5e4dd2c07ed7d71b45fa1afe0a32",
"sha256": "6ae6a08b03787d16ba8eaf0f7cc97efec8505b3c92b9ed429ea7ad42c86a8622"
},
"downloads": -1,
"filename": "verticapy-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "5ded5e4dd2c07ed7d71b45fa1afe0a32",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 2508536,
"upload_time": "2024-10-31T21:06:40",
"upload_time_iso_8601": "2024-10-31T21:06:40.398241Z",
"url": "https://files.pythonhosted.org/packages/2d/f5/6326207c82b91477e9135745108c505e3fdf2ec92dddd9bd084d34f6aa03/verticapy-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-31 21:06:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vertica",
"github_project": "VerticaPy",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "verticapy"
}