SQLDataModel

Name	SQLDataModel JSON
Version	0.4.0 JSON
	download
home_page	https://github.com/AnteT/SQLDataModel
Summary	SQLDataModel is a lightweight dataframe library designed for efficient data extraction, transformation, and loading (ETL) across various sources and destinations, providing an efficient alternative to common setups like pandas, numpy, and sqlalchemy while also providing additional features without the overhead of external dependencies.
upload_time	2024-04-23 22:08:55
maintainer	None
docs_url	None
author	Ante Tonkovic-Capin
requires_python	>=3.9
license	None
keywords	sql etl dataframe terminal tables pretty print tables sql2sql data analysis data science data model extract transform load web scraping tables data mining html html table parsing apache arrow pyarrow pyarrow conversion pyarrow to table pyarrow to sql pyarrow to csv parquet file parsing csv csv parsing markdown markdown table parsing latex latex table parsing delimited delimited data parsing file conversion format conversion terminal styling table styling from sqlite to sqlite from postgresql to postgresql sql to sql excel xlsx file excel to sql dataframe package polars2pandas pandas2polars
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <img src="https://github.com/AnteT/SQLDataModel/raw/master/figs/sdm_banner_cyan.PNG?raw=true" alt="SQLDataModel Logo" style="width:100vw; border-radius: .6vw" />

---

### SQLDataModel: fast & lightweight source agnostic data model

[![SQLDataModel Home](https://img.shields.io/badge/SQLDataModel-Home-blue?style=flat&logo=github)](https://github.com/AnteT/SQLDataModel)
[![PyPI License](https://img.shields.io/pypi/l/sqldatamodel)](https://pypi.org/project/SQLDataModel/)
[![PyPI Version](https://img.shields.io/pypi/v/sqldatamodel)](https://pypi.org/project/SQLDataModel/)
[![Docs Status](https://readthedocs.org/projects/sqldatamodel/badge/?version=latest)](https://sqldatamodel.readthedocs.io/en/latest/?badge=latest)
[![Downloads](https://static.pepy.tech/badge/sqldatamodel)](https://pepy.tech/project/sqldatamodel)

SQLDataModel is a fast & lightweight data model with no additional dependencies for quickly fetching and storing your tabular data to and from the most commonly used [databases & data sources](#data-formats) in a couple lines of code. It's as easy as ETL:

```python
from SQLDataModel import SQLDataModel

# Extract your data:
sdm = SQLDataModel.from_html('https://antet.github.io/planets')

# Transform it:
sdm['Flyable?'] = sdm['Gravity'].apply(lambda x: str(x < 1.0))

# Filter it
sdm = sdm[sdm['Flyable?']=='True']

# Load it wherever you need! Let's make a Markdown file
sdm.to_markdown('Planets.MD')
```

Now we have a planetary bucket list in `Planets.MD`:

```markdown
| Planet  | Gravity | Moons | Flyable? |
|:--------|--------:|------:|:---------|
| Mercury |    0.38 |     0 | True     |
| Venus   |    0.91 |     0 | True     |
| Mars    |    0.38 |     2 | True     |
| Saturn  |    0.92 |   146 | True     |
| Uranus  |    0.89 |    27 | True     |
```

Made for those times when you just want to use raw SQL on your dataframe, or need to move data around but the full Pandas, Numpy, SQLAlchemy installation is just overkill. SQLDataModel includes all the most commonly used features, including additional ones like using plan SQL on your `DataFrame` and pretty printing your table, at _1/1000_ the size, 0.03MB vs 30MB

---

### Installation
Use the package manager [pip](https://pip.pypa.io/en/stable/) to install SQLDataModel.

```bash
$ pip install SQLDataModel
```
Then import the main class `SQLDataModel` into your local project, see [usage](#usage) below or go straight to the [project docs](https://sqldatamodel.readthedocs.io).

---

### Quick Example
A `SQLDataModel` can be created from any number of [sources](#data-formats), as a quick demo lets create one using a Wikipedia page:

```python
>>> from SQLDataModel import SQLDataModel
>>> 
>>> url = 'https://en.wikipedia.org/wiki/1998_FIFA_World_Cup'
>>> 
>>> sdm = SQLDataModel.from_html(url, table_identifier=94)   
>>> 
>>> sdm[:4, ['R', 'Team', 'W', 'Pts.']]
┌──────┬─────────────┬──────┬──────┐
│ R    │ Team        │ W    │ Pts. │
├──────┼─────────────┼──────┼──────┤
│ 1    │ France      │ 6    │ 19   │
│ 2    │ Brazil      │ 4    │ 13   │
│ 3    │ Croatia     │ 5    │ 15   │
│ 4    │ Netherlands │ 3    │ 12   │
└──────┴─────────────┴──────┴──────┘
[4 rows x 4 columns]
```

SQLDataModel provides a quick and easy way to import, view, transform and export your data in multiple [formats](#data-formats) and sources, providing the full power of executing raw SQL against your model in the process. 

---

### Usage

`SQLDataModel` is a versatile data model leveraging the mighty power of in-memory `sqlite3` to perform fast and light-weight transformations allowing you to easily move and manipulate data from source to destination regardless of where, or in what format, the data is. If you need to extract, transform, or simply transfer data across [various formats](#data-formats), `SQLDataModel` can make your life easier. Here's a few examples how: 

1. [From Website to Markdown](#from-website-to-markdown)
2. [SQL on your Pandas DataFrame](#sql-on-your-pandas-dataframe)
3. [From SQL to HTML Table](#from-sql-to-html-table)
4. [Pretty PyTorch Training Results to JSON](#pretty-pytorch-training-results-to-json)
5. [From PostgreSQL to SQLite](#from-postgresql-to-sqlite)

---

#### From Website to Markdown
Say we find some cool data online, perhaps some planetary data, and we want to go and get it for our own purposes:

```python
from SQLDataModel import SQLDataModel

# Target url with some planetary data we can extract
url = 'https://antet.github.io/sdm-planets'

# Create a model from the first table element found on the web page
sdm = SQLDataModel.from_html(url, table_identifier=1)

# Add some color to it
sdm.set_display_color('#A6D7E8')

# Lets see!
print(sdm)
```
`SQLDataModel`'s default output is pretty printed and formatted to fit within the current terminal's width and height, since we added some color here's the result:

<img src="https://github.com/AnteT/SQLDataModel/raw/master/figs/sdm_planets.PNG?raw=true" alt="sdm_planets" style="width:100vw; border-radius: .6vw" />

Now that we have our data as a `SQLDataModel`, we can do any number of things with it using the provided methods or using your own SQL and returning the query results as a new model! Lets find out, all extreme temperatures and pressure aside, if it would be easier to fly given the planet's gravity relative to Earth:

```python
# Extract: Slice by rows and columns
sdm = sdm[:,['Planet','Gravity']] # or sdm[:,:2]

# Transform: Create a new column based on existing values
sdm['Flyable?'] = sdm['Gravity'].apply(lambda x: str(x < 1.0))

# Filter: Keep only the 'Flyable' planets
sdm = sdm[sdm['Flyable?'] == 'True']

# Load: Let's turn our data into markdown!
sdm.to_markdown('Planet-Flying.MD')
```
Here's the raw markdown of the new file we created `Planet-Flying.MD`:

```markdown
| Planet  | Gravity | Flyable? |
|:--------|--------:|:---------|
| Mercury |    0.38 | True     |
| Venus   |    0.91 | True     |
| Mars    |    0.38 | True     |
| Saturn  |    0.92 | True     |
| Uranus  |    0.89 | True     |
```

Notice that the output from the `to_markdown()` method also aligned our columns based on the data type, and padded the values so that even the raw markdown is pretty printed! While we used the `from_html()` and `to_markdown()` methods for this demo, we could just as easily have created the same table in any number of formats. Once we have our data as a `SQLDataModel`, it would've been just as easy to have the data in LaTeX:

```python
# Let's convert it to a LaTeX table instead
sdm_latex = sdm.to_latex()

# Here's the output
print(sdm_latex)
```
As with our markdown output, the columns are correctly aligned and pretty printed:

```latex
\begin{tabular}{|l|r|l|}
\hline
    Planet  & Gravity & Flyable? \\
\hline
    Mercury &    0.38 & True     \\
    Venus   &    0.91 & True     \\
    Mars    &    0.38 & True     \\
    Saturn  &    0.92 & True     \\
    Uranus  &    0.89 & True     \\
\hline
\end{tabular}
```
In fact, using the `to_html()` method is how the table from the beginning of this demo was created! Click [here](https://antet.github.io/sdm-planets) for an example of how the styling and formatting applied to `SQLDataModel` gets exported along with it when using `to_html()`. 

---

#### SQL on your Pandas DataFrame
I can't tell you how many times I've found myself searching for information on how to do this or that operation in `pandas` and wished I could just quickly do it in SQL instead. Enter `SQLDataModel`: 

```python
import pandas as pd
from SQLDataModel import SQLDataModel

# Titanic dataset
df = pd.read_csv('titanic.csv')

# Transformations you don't want to do in pandas if you already know SQL
sql_query = """
select 
    Pclass, Sex, count(*) as 'Survived' 
from sdm where 
    Survived = 1 
group by 
    Pclass, Sex 
order by 
    count(*) desc
"""

# Extract: Create SQLDataModel from the df
sdm = SQLDataModel.from_pandas(df)

# Transform: Do them in SQLDataModel
sdm = sdm.execute_fetch(sql_query)

# Load: Then hand it back to pandas!
df = sdm.to_pandas()
```
Here we're using `SQLDataModel` to avoid performing the complex pandas operations required for aggregation if we already know SQL. Here's the output of the `sdm` we used to do the operations in and the `df`:

```text
SQLDataModel:                         pandas:
┌───┬────────┬────────┬──────────┐    
│   │ Pclass │ Sex    │ Survived │       Pclass     Sex  Survived
├───┼────────┼────────┼──────────┤    0       1  female        91
│ 0 │      1 │ female │       91 │    1       3  female        72
│ 1 │      3 │ female │       72 │    2       2  female        70
│ 2 │      2 │ female │       70 │    3       3    male        47
│ 3 │      3 │ male   │       47 │    4       1    male        45
│ 4 │      1 │ male   │       45 │    5       2    male        17
│ 5 │      2 │ male   │       17 │    
└───┴────────┴────────┴──────────┘    
[6 rows x 3 columns]               
```
In this example our source and destination formats were both `pd.DataFrame` objects, however `pandas` is not required to use, nor is it a dependency of, `SQLDataModel`. It is only required if you're using the `from_pandas()` or `to_pandas()` methods.

---

#### From SQL to HTML Table

Say we have a table located on a remote PostgreSQL server that we want to put on our website, normally we could pip install `SQLAlchemy`, `psycopg2`, `pandas` including whatever dependencies they come with, like `numpy`. This time all we need is `psycopg2` for the PostgreSQL driver:

```python
import psycopg2
import datetime
from SQLDataModel import SQLDataModel

# Setup the connection
psql_conn = psycopg2.connect(...)

# Grab a table with missions to Saturn
sdm = SQLDataModel.from_sql('saturn_missions', psql_conn) # or SQL statement

# Filter to only 'Future' missions
sdm = sdm[sdm['Status'] == 'Future']

# Create new column with today's date so it ages better!
sdm['Updated'] = datetime.date.today()

# Create a new html file with our table
sdm.to_html('Future-Saturn.html', index=False)
```
Here's a snippet from the `Future-Saturn.html` file generated by the `to_html()` method:
```html
<!-- Metadata Removed -->
<table>
    <tr>
        <th>Mission</th>
        <th>Status</th>
        <th>Launch</th>
        <th>Destination</th>
        <th>Updated</th>
    </tr>
    <tr>
        <td>New Frontiers 4</td>
        <td>Future</td>
        <td>2028</td>
        <td>Surface of Titan</td>
        <td>2024-03-21</td>
    </tr>
    <tr>
        <td>Enceladus Orbilander</td>
        <td>Future</td>
        <td>2038</td>
        <td>Surface of Enceladus</td>
        <td>2024-03-21</td>
    </tr>
</table>
<!-- SQLDataModel css styles removed -->
```
The Surface of Titan in 2028, how cool is that!

---

#### Pretty PyTorch Training Results to JSON
Let's say we're training a `pytorch` model and we want to keep the results from training. If you've ever trained one or followed a tutorial, I'm sure you'll be familiar with this pattern:

```python
# Training loop
num_epochs = 10

for epoch in range(num_epochs):
    # Train the model and get the metrics
    train_loss, train_acc = train(model, criterion, optimizer, train_loader, device)
    val_loss, val_acc = validate(model, criterion, val_loader, device)

    # Print training results to terminal
    print(f'Epoch [{epoch+1}/{num_epochs}], '
          f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%, '
          f'Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%')
```
This ubiquitous setup, where results from each epoch are output to the terminal, has become the `Hello, World!` of ML training loops. While this is great for monitoring the progress of training, it's miserable for actually keeping and analyzing the data afterwards. Here's how we can use `SQLDataModel` to help:

```python
from SQLDataModel import SQLDataModel

# Initialize an empty model with headers for the data we want to store
sdm = SQLDataModel(headers=['Epoch', 'Train Loss', 'Train Acc', 'Val Loss', 'Val Acc'])

# ... Training code ...
```
We create an empty `SQLDataModel` that we can use with `headers` corresponding to the metrics we want to store. Now in our training loop we can fill it with our results:

```python
# Initialize an empty model for training
sdm = SQLDataModel(headers=['Epoch', 'Train Loss', 'Train Acc', 'Val Loss', 'Val Acc'])

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    # Train the model and get the metrics
    train_loss, train_acc = train(model, criterion, optimizer, train_loader, device)
    val_loss, val_acc = validate(model, criterion, val_loader, device)

    # Let's store the results first
    sdm.insert_row([epoch+1, train_loss, train_acc, val_loss, val_acc])

    # Print training results to terminal
    print(f'Epoch [{epoch+1}/{num_epochs}], '
          f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%, '
          f'Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%')
```
We use the `insert_row()` method to store the results from training, inserting a new row for each epoch. Now that we're storing the data instead of just printing it to the terminal, we can track our results and easily export them into whichever format we need. For this example, let's store them as a `.json` file. All we need is to add one more line at the end of our training loop:

```python
# ... Training loop ...
sdm.to_json('Training-Metrics.json')
```
This will export all the training data stored in our model to a new JSON file `Training-Metrics.json`, trimmed for brevity:

```json
[
    {
      "Epoch": 1,
      "Train Loss": 0.6504,
      "Train Acc": 0.6576,
      "Val Loss": 0.6126,
      "Val Acc": 0.6608
    },
    {
      "Epoch": 2,
      "Train Loss": 0.5733,
      "Train Acc": 0.6989,
      "Val Loss": 0.5484,
      "Val Acc": 0.7268
    },
    {
      "Epoch": 3,
      "Train Loss": 0.5222,
      "Train Acc": 0.7239,
      "Val Loss": 0.5294,
      "Val Acc": 0.7332
    }
  ]
```
In this example we kept the original print statements intact during training, but we could've removed them altogether and done something like this instead:

```python
# Initialize an empty model for training
sdm = SQLDataModel(headers=['Epoch', 'Train Loss', 'Train Acc', 'Val Loss', 'Val Acc'])

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    # Train the model and get the metrics
    train_loss, train_acc = train(model, criterion, optimizer, train_loader, device)
    val_loss, val_acc = validate(model, criterion, val_loader, device)

    # Let's store the results first
    sdm.insert_row([epoch+1, train_loss, train_acc, val_loss, val_acc])

    # Pretty print last row of training results to terminal
    print(sdm[-1])

# Print all results when training finishes
print(sdm)

# Export them as a json file
sdm.to_json('Training-Metrics.json')
```
Here we've removed the previous and rather verbose print statement and replaced it with a simple `print(sdm[-1])`, this will print the last row of data in the model. This results in the following terminal output for each epoch during training:

```shell
┌───────┬────────────┬───────────┬──────────┬─────────┐
│ Epoch │ Train Loss │ Train Acc │ Val Loss │ Val Acc │
├───────┼────────────┼───────────┼──────────┼─────────┤
│    10 │     0.2586 │    0.8063 │   0.5391 │  0.7728 │
└───────┴────────────┴───────────┴──────────┴─────────┘
```
Which, besides having the added benefit of storing our data, is far less code than the prior print statement, and is _in my humble opinion_ much nicer to look at while we wait for training to complete. We've also added one more line, a final `print(sdm)` to output all the training results before saving them to `Training-Metrics.json`. That way, for those of us who still like to see them all in one place, we can have our saved json cake and pretty print it too:

```shell
┌───────┬────────────┬───────────┬──────────┬─────────┐
│ Epoch │ Train Loss │ Train Acc │ Val Loss │ Val Acc │
├───────┼────────────┼───────────┼──────────┼─────────┤
│     1 │     0.6504 │    0.6576 │   0.6126 │  0.6608 │
│     2 │     0.5733 │    0.6989 │   0.5484 │  0.7268 │
│     3 │     0.5222 │    0.7239 │   0.5294 │  0.7332 │
│     4 │     0.4749 │    0.7288 │   0.5461 │  0.7238 │
│     5 │     0.4417 │    0.7301 │   0.5862 │  0.7148 │
│     6 │     0.4065 │    0.7544 │   0.4882 │  0.7674 │
│     7 │     0.3635 │    0.7679 │   0.4793 │  0.7690 │
│     8 │     0.3229 │    0.7884 │   0.4833 │  0.7732 │
│     9 │     0.2860 │    0.7921 │   0.5041 │  0.7746 │
│    10 │     0.2586 │    0.8063 │   0.5391 │  0.7728 │
└───────┴────────────┴───────────┴──────────┴─────────┘
[10 rows x 5 columns]
```
Whelp, clearly it's overfitting...

---

#### From PostgreSQL to SQLite
This example will showcase one of the original motivations of this package, moving data from one SQL database to another. It'll be a short one, just as `SQLDataModel` was designed to be! Let's say we need to move data from a PostgreSQL server into a local SQLite connection:

```python
import sqlite3
import psycopg2
from SQLDataModel import SQLDataModel

# Setup our postgresql connection
pg_conn = psycopg2.connect(...)

# Setup our local sqlite3 connection
sqlite_conn = sqlite3.connect(...)

# Create our model from a query on our postgresql connection
sdm = SQLDataModel.from_sql("SELECT * FROM pg_table", pg_conn)

# Import it into our sqlite3 connection, appending to existing data
sdm.to_sql("local_table", sqlite_conn, if_exists='append')
```
Thats it! What once required installing packages like `pandas`, `numpy`, and `SQLAlchemy`, plus all of their dependencies, just to able to use a nice DataFrame API when interacting with SQL data, is now just a single package, `SQLDataModel`. While I **love** all three of those packages, 99% of what I consistently use them for can be done with far less "package baggage".

---

#### The Workflow

For all the `SQLDataModel` examples, the same basic workflow and pattern is present:

```python
from SQLDataModel import SQLDataModel

# Extract: Create the model from a source
sdm = SQLDataModel.from_data(...)

# Transform: Manipulate the data if needed
sdm['New Column'] = sdm['Column A'].apply(func)

# Load: Move it to a destination format
sdm.to_text('table.txt') # to_csv, to_json, to_latex, to_markdown, to_html, ..., etc.
```
We can take a step back and illustrate the broader  **source, transform, destination** flow with `SQLDataModel` like this:

<img src="https://github.com/AnteT/SQLDataModel/raw/master/figs/sdm_graph.PNG?raw=true" alt="sdm_graph" style="width:100vw; border-radius: .6vw" />

Regardless of where the data originated or where it ends up, `SQLDataModel`'s best use-case is to be the light-weight intermediary that's agnostic to the original source, or the final destination, of the data.

---

### Data Formats
`SQLDataModel` seamlessly interacts with a wide range of data formats providing a versatile platform for data extraction, conversion, and writing. Supported formats include:

- **Arrow**: Convert to and from Apache Arrow format, `pyarrow` required.
- **CSV**: Extract from and write to comma separated value, `.csv`, files.
- **Excel**: Extract from and write to Excel `.xlsx` files, `openpyxl` required.
- **HTML**: Extract from web and write to and from `.html` files including formatted string literals.
- **JSON**: Extract from and write to `.json` files, JSON-like objects, or JSON formatted sring literals.
- **LaTeX**: Extract from and write to `.tex` files, LaTeX formatted string literals.
- **Markdown**: Extract from and write to `.MD` files, Markdown formatted string literals.
- **Numpy**: Convert to and from `numpy.ndarray` objects, `numpy` required.
- **Pandas**: Convert to and from `pandas.DataFrame` objects, `pandas` required.
- **Parquet**: Extract from and write to `.parquet` files, `pyarrow` required.
- **Pickle**: Extract from and write to `.pkl` files, package uses `.sdm` extension when pickling for `SQLDataModel` metadata.
- **Polars**: Convert to and from `polars.DataFrame` objects, `polars` required.
- **SQL**: Extract from and write to the following popular SQL databases:
    - **SQLite**: Using the built-in `sqlite3` module.
    - **PostgreSQL**: Using the `psycopg2` package.
    - **SQL Server**: Using the `pyodbc` package.
    - **Oracle**: Using the `cx_Oracle` package.
    - **Teradata**: Using the `teradatasql` package.
- **Text**: Write to and from `.txt` files including other `SQLDataModel` string representations.
- **TSV** or delimited: Write to and from files delimited by:
    - **`\t`**: Tab separated values or `.tsv` files.
    - **`\s`**: Single space or whitespace separated values.
    - **`;`**: Semicolon separated values.
    - **`|`**: Pipe separated values.
    - **`:`**: Colon separated values.
    - **`,`**: Comma separated values or `.csv` files.
- **Python** objects:
    - **dictionaries**: Convert to and from collections of python `dict` objects.
    - **lists**: Convert to and from collections of python `list` objects.
    - **tuples**: Convert to and from collections of python `tuple` objects.
    - **namedtuples**: Convert to and from collections of `namedtuples` objects.

Note that `SQLDataModel` does not install any additional dependencies by default. This is done to keep the package as light-weight and small as possible. This means that to use package dependent methods like `to_parquet()` or the inverse `from_parquet()` the `pyarrow` package is required. The same goes for other package dependent methods like those converting to and from `pandas` and `numpy` objects.

---

### Documentation
SQLDataModel's documentation can be found at https://sqldatamodel.readthedocs.io containing detailed descriptions for the key modules in the package. These are listed below as links to their respective sections in the docs:

  * [`ANSIColor`](https://sqldatamodel.readthedocs.io/en/latest/SQLDataModel.html#SQLDataModel.ANSIColor.ANSIColor) for terminal styling.
  * [`HTMLParser`](https://sqldatamodel.readthedocs.io/en/latest/SQLDataModel.html#SQLDataModel.HTMLParser.HTMLParser) for parsing tabular data from the web.
  * [`JSONEncoder`](https://sqldatamodel.readthedocs.io/en/latest/SQLDataModel.html#SQLDataModel.JSONEncoder.DataTypesEncoder) for type casting and encoding JSON data.
  * [`SQLDataModel`](https://sqldatamodel.readthedocs.io/en/latest/SQLDataModel.html#SQLDataModel.SQLDataModel.SQLDataModel) for wrapping it all up.


However, to skip over the less relevant modules and jump straight to the meat of the package, the ``SQLDataModel`` module, click [here](https://sqldatamodel.readthedocs.io/en/latest/modules.html). 

---

### Motivation

While there are packages/dependencies out there that can accomplish some of the same tasks as `SQLDataModel`, they're either missing key features or end up being overkill for common tasks like grabbing and converting tables from source A to destination B, or they don't quite handle the full process and require additional dependencies to make it all work. When you find yourself doing the same thing over and over again, eventually you sit down and write a package to do it for you.

---

### Contributing

Pull requests are welcome. For major changes, please open an issue first
to discuss what you would like to change.

Please make sure to update tests as appropriate.

---

### License

[MIT](https://choosealicense.com/licenses/mit/)


Thank you!  
Ante Tonkovic-Capin

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/AnteT/SQLDataModel",
    "name": "SQLDataModel",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "SQL, ETL, dataframe, terminal tables, pretty print tables, sql2sql, data analysis, data science, data model, extract, transform, load, web scraping tables, data mining, html, html table parsing, apache arrow, pyarrow, pyarrow conversion, pyarrow to table, pyarrow to sql, pyarrow to csv, parquet file parsing, csv, csv parsing, markdown, markdown table parsing, latex, latex table parsing, delimited, delimited data parsing, file conversion, format conversion, terminal styling, table styling, from sqlite, to sqlite, from postgresql, to postgresql, sql to sql, excel, xlsx file, excel to sql, DataFrame package, polars2pandas, pandas2polars",
    "author": "Ante Tonkovic-Capin",
    "author_email": "antetc@icloud.com",
    "download_url": "https://files.pythonhosted.org/packages/25/32/a11124b907eb30aae2c381db9f45d8e328a3128a2015a52b928249d1fbcc/sqldatamodel-0.4.0.tar.gz",
    "platform": null,
    "description": "<img src=\"https://github.com/AnteT/SQLDataModel/raw/master/figs/sdm_banner_cyan.PNG?raw=true\" alt=\"SQLDataModel Logo\" style=\"width:100vw; border-radius: .6vw\" />\r\n\r\n---\r\n\r\n### SQLDataModel: fast & lightweight source agnostic data model\r\n\r\n[![SQLDataModel Home](https://img.shields.io/badge/SQLDataModel-Home-blue?style=flat&logo=github)](https://github.com/AnteT/SQLDataModel)\r\n[![PyPI License](https://img.shields.io/pypi/l/sqldatamodel)](https://pypi.org/project/SQLDataModel/)\r\n[![PyPI Version](https://img.shields.io/pypi/v/sqldatamodel)](https://pypi.org/project/SQLDataModel/)\r\n[![Docs Status](https://readthedocs.org/projects/sqldatamodel/badge/?version=latest)](https://sqldatamodel.readthedocs.io/en/latest/?badge=latest)\r\n[![Downloads](https://static.pepy.tech/badge/sqldatamodel)](https://pepy.tech/project/sqldatamodel)\r\n\r\nSQLDataModel is a fast & lightweight data model with no additional dependencies for quickly fetching and storing your tabular data to and from the most commonly used [databases & data sources](#data-formats) in a couple lines of code. It's as easy as ETL:\r\n\r\n```python\r\nfrom SQLDataModel import SQLDataModel\r\n\r\n# Extract your data:\r\nsdm = SQLDataModel.from_html('https://antet.github.io/planets')\r\n\r\n# Transform it:\r\nsdm['Flyable?'] = sdm['Gravity'].apply(lambda x: str(x < 1.0))\r\n\r\n# Filter it\r\nsdm = sdm[sdm['Flyable?']=='True']\r\n\r\n# Load it wherever you need! Let's make a Markdown file\r\nsdm.to_markdown('Planets.MD')\r\n```\r\n\r\nNow we have a planetary bucket list in `Planets.MD`:\r\n\r\n```markdown\r\n| Planet  | Gravity | Moons | Flyable? |\r\n|:--------|--------:|------:|:---------|\r\n| Mercury |    0.38 |     0 | True     |\r\n| Venus   |    0.91 |     0 | True     |\r\n| Mars    |    0.38 |     2 | True     |\r\n| Saturn  |    0.92 |   146 | True     |\r\n| Uranus  |    0.89 |    27 | True     |\r\n```\r\n\r\nMade for those times when you just want to use raw SQL on your dataframe, or need to move data around but the full Pandas, Numpy, SQLAlchemy installation is just overkill. SQLDataModel includes all the most commonly used features, including additional ones like using plan SQL on your `DataFrame` and pretty printing your table, at _1/1000_ the size, 0.03MB vs 30MB\r\n\r\n---\r\n\r\n### Installation\r\nUse the package manager [pip](https://pip.pypa.io/en/stable/) to install SQLDataModel.\r\n\r\n```bash\r\n$ pip install SQLDataModel\r\n```\r\nThen import the main class `SQLDataModel` into your local project, see [usage](#usage) below or go straight to the [project docs](https://sqldatamodel.readthedocs.io).\r\n\r\n---\r\n\r\n### Quick Example\r\nA `SQLDataModel` can be created from any number of [sources](#data-formats), as a quick demo lets create one using a Wikipedia page:\r\n\r\n```python\r\n>>> from SQLDataModel import SQLDataModel\r\n>>> \r\n>>> url = 'https://en.wikipedia.org/wiki/1998_FIFA_World_Cup'\r\n>>> \r\n>>> sdm = SQLDataModel.from_html(url, table_identifier=94)   \r\n>>> \r\n>>> sdm[:4, ['R', 'Team', 'W', 'Pts.']]\r\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n\u2502 R    \u2502 Team        \u2502 W    \u2502 Pts. \u2502\r\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2524\r\n\u2502 1    \u2502 France      \u2502 6    \u2502 19   \u2502\r\n\u2502 2    \u2502 Brazil      \u2502 4    \u2502 13   \u2502\r\n\u2502 3    \u2502 Croatia     \u2502 5    \u2502 15   \u2502\r\n\u2502 4    \u2502 Netherlands \u2502 3    \u2502 12   \u2502\r\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2518\r\n[4 rows x 4 columns]\r\n```\r\n\r\nSQLDataModel provides a quick and easy way to import, view, transform and export your data in multiple [formats](#data-formats) and sources, providing the full power of executing raw SQL against your model in the process. \r\n\r\n---\r\n\r\n### Usage\r\n\r\n`SQLDataModel` is a versatile data model leveraging the mighty power of in-memory `sqlite3` to perform fast and light-weight transformations allowing you to easily move and manipulate data from source to destination regardless of where, or in what format, the data is. If you need to extract, transform, or simply transfer data across [various formats](#data-formats), `SQLDataModel` can make your life easier. Here's a few examples how: \r\n\r\n1. [From Website to Markdown](#from-website-to-markdown)\r\n2. [SQL on your Pandas DataFrame](#sql-on-your-pandas-dataframe)\r\n3. [From SQL to HTML Table](#from-sql-to-html-table)\r\n4. [Pretty PyTorch Training Results to JSON](#pretty-pytorch-training-results-to-json)\r\n5. [From PostgreSQL to SQLite](#from-postgresql-to-sqlite)\r\n\r\n---\r\n\r\n#### From Website to Markdown\r\nSay we find some cool data online, perhaps some planetary data, and we want to go and get it for our own purposes:\r\n\r\n```python\r\nfrom SQLDataModel import SQLDataModel\r\n\r\n# Target url with some planetary data we can extract\r\nurl = 'https://antet.github.io/sdm-planets'\r\n\r\n# Create a model from the first table element found on the web page\r\nsdm = SQLDataModel.from_html(url, table_identifier=1)\r\n\r\n# Add some color to it\r\nsdm.set_display_color('#A6D7E8')\r\n\r\n# Lets see!\r\nprint(sdm)\r\n```\r\n`SQLDataModel`'s default output is pretty printed and formatted to fit within the current terminal's width and height, since we added some color here's the result:\r\n\r\n<img src=\"https://github.com/AnteT/SQLDataModel/raw/master/figs/sdm_planets.PNG?raw=true\" alt=\"sdm_planets\" style=\"width:100vw; border-radius: .6vw\" />\r\n\r\nNow that we have our data as a `SQLDataModel`, we can do any number of things with it using the provided methods or using your own SQL and returning the query results as a new model! Lets find out, all extreme temperatures and pressure aside, if it would be easier to fly given the planet's gravity relative to Earth:\r\n\r\n```python\r\n# Extract: Slice by rows and columns\r\nsdm = sdm[:,['Planet','Gravity']] # or sdm[:,:2]\r\n\r\n# Transform: Create a new column based on existing values\r\nsdm['Flyable?'] = sdm['Gravity'].apply(lambda x: str(x < 1.0))\r\n\r\n# Filter: Keep only the 'Flyable' planets\r\nsdm = sdm[sdm['Flyable?'] == 'True']\r\n\r\n# Load: Let's turn our data into markdown!\r\nsdm.to_markdown('Planet-Flying.MD')\r\n```\r\nHere's the raw markdown of the new file we created `Planet-Flying.MD`:\r\n\r\n```markdown\r\n| Planet  | Gravity | Flyable? |\r\n|:--------|--------:|:---------|\r\n| Mercury |    0.38 | True     |\r\n| Venus   |    0.91 | True     |\r\n| Mars    |    0.38 | True     |\r\n| Saturn  |    0.92 | True     |\r\n| Uranus  |    0.89 | True     |\r\n```\r\n\r\nNotice that the output from the `to_markdown()` method also aligned our columns based on the data type, and padded the values so that even the raw markdown is pretty printed! While we used the `from_html()` and `to_markdown()` methods for this demo, we could just as easily have created the same table in any number of formats. Once we have our data as a `SQLDataModel`, it would've been just as easy to have the data in LaTeX:\r\n\r\n```python\r\n# Let's convert it to a LaTeX table instead\r\nsdm_latex = sdm.to_latex()\r\n\r\n# Here's the output\r\nprint(sdm_latex)\r\n```\r\nAs with our markdown output, the columns are correctly aligned and pretty printed:\r\n\r\n```latex\r\n\\begin{tabular}{|l|r|l|}\r\n\\hline\r\n    Planet  & Gravity & Flyable? \\\\\r\n\\hline\r\n    Mercury &    0.38 & True     \\\\\r\n    Venus   &    0.91 & True     \\\\\r\n    Mars    &    0.38 & True     \\\\\r\n    Saturn  &    0.92 & True     \\\\\r\n    Uranus  &    0.89 & True     \\\\\r\n\\hline\r\n\\end{tabular}\r\n```\r\nIn fact, using the `to_html()` method is how the table from the beginning of this demo was created! Click [here](https://antet.github.io/sdm-planets) for an example of how the styling and formatting applied to `SQLDataModel` gets exported along with it when using `to_html()`. \r\n\r\n---\r\n\r\n#### SQL on your Pandas DataFrame\r\nI can't tell you how many times I've found myself searching for information on how to do this or that operation in `pandas` and wished I could just quickly do it in SQL instead. Enter `SQLDataModel`: \r\n\r\n```python\r\nimport pandas as pd\r\nfrom SQLDataModel import SQLDataModel\r\n\r\n# Titanic dataset\r\ndf = pd.read_csv('titanic.csv')\r\n\r\n# Transformations you don't want to do in pandas if you already know SQL\r\nsql_query = \"\"\"\r\nselect \r\n    Pclass, Sex, count(*) as 'Survived' \r\nfrom sdm where \r\n    Survived = 1 \r\ngroup by \r\n    Pclass, Sex \r\norder by \r\n    count(*) desc\r\n\"\"\"\r\n\r\n# Extract: Create SQLDataModel from the df\r\nsdm = SQLDataModel.from_pandas(df)\r\n\r\n# Transform: Do them in SQLDataModel\r\nsdm = sdm.execute_fetch(sql_query)\r\n\r\n# Load: Then hand it back to pandas!\r\ndf = sdm.to_pandas()\r\n```\r\nHere we're using `SQLDataModel` to avoid performing the complex pandas operations required for aggregation if we already know SQL. Here's the output of the `sdm` we used to do the operations in and the `df`:\r\n\r\n```text\r\nSQLDataModel:                         pandas:\r\n\u250c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \r\n\u2502   \u2502 Pclass \u2502 Sex    \u2502 Survived \u2502       Pclass     Sex  Survived\r\n\u251c\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524    0       1  female        91\r\n\u2502 0 \u2502      1 \u2502 female \u2502       91 \u2502    1       3  female        72\r\n\u2502 1 \u2502      3 \u2502 female \u2502       72 \u2502    2       2  female        70\r\n\u2502 2 \u2502      2 \u2502 female \u2502       70 \u2502    3       3    male        47\r\n\u2502 3 \u2502      3 \u2502 male   \u2502       47 \u2502    4       1    male        45\r\n\u2502 4 \u2502      1 \u2502 male   \u2502       45 \u2502    5       2    male        17\r\n\u2502 5 \u2502      2 \u2502 male   \u2502       17 \u2502    \r\n\u2514\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \r\n[6 rows x 3 columns]               \r\n```\r\nIn this example our source and destination formats were both `pd.DataFrame` objects, however `pandas` is not required to use, nor is it a dependency of, `SQLDataModel`. It is only required if you're using the `from_pandas()` or `to_pandas()` methods.\r\n\r\n---\r\n\r\n#### From SQL to HTML Table\r\n\r\nSay we have a table located on a remote PostgreSQL server that we want to put on our website, normally we could pip install `SQLAlchemy`, `psycopg2`, `pandas` including whatever dependencies they come with, like `numpy`. This time all we need is `psycopg2` for the PostgreSQL driver:\r\n\r\n```python\r\nimport psycopg2\r\nimport datetime\r\nfrom SQLDataModel import SQLDataModel\r\n\r\n# Setup the connection\r\npsql_conn = psycopg2.connect(...)\r\n\r\n# Grab a table with missions to Saturn\r\nsdm = SQLDataModel.from_sql('saturn_missions', psql_conn) # or SQL statement\r\n\r\n# Filter to only 'Future' missions\r\nsdm = sdm[sdm['Status'] == 'Future']\r\n\r\n# Create new column with today's date so it ages better!\r\nsdm['Updated'] = datetime.date.today()\r\n\r\n# Create a new html file with our table\r\nsdm.to_html('Future-Saturn.html', index=False)\r\n```\r\nHere's a snippet from the `Future-Saturn.html` file generated by the `to_html()` method:\r\n```html\r\n<!-- Metadata Removed -->\r\n<table>\r\n    <tr>\r\n        <th>Mission</th>\r\n        <th>Status</th>\r\n        <th>Launch</th>\r\n        <th>Destination</th>\r\n        <th>Updated</th>\r\n    </tr>\r\n    <tr>\r\n        <td>New Frontiers 4</td>\r\n        <td>Future</td>\r\n        <td>2028</td>\r\n        <td>Surface of Titan</td>\r\n        <td>2024-03-21</td>\r\n    </tr>\r\n    <tr>\r\n        <td>Enceladus Orbilander</td>\r\n        <td>Future</td>\r\n        <td>2038</td>\r\n        <td>Surface of Enceladus</td>\r\n        <td>2024-03-21</td>\r\n    </tr>\r\n</table>\r\n<!-- SQLDataModel css styles removed -->\r\n```\r\nThe Surface of Titan in 2028, how cool is that!\r\n\r\n---\r\n\r\n#### Pretty PyTorch Training Results to JSON\r\nLet's say we're training a `pytorch` model and we want to keep the results from training. If you've ever trained one or followed a tutorial, I'm sure you'll be familiar with this pattern:\r\n\r\n```python\r\n# Training loop\r\nnum_epochs = 10\r\n\r\nfor epoch in range(num_epochs):\r\n    # Train the model and get the metrics\r\n    train_loss, train_acc = train(model, criterion, optimizer, train_loader, device)\r\n    val_loss, val_acc = validate(model, criterion, val_loader, device)\r\n\r\n    # Print training results to terminal\r\n    print(f'Epoch [{epoch+1}/{num_epochs}], '\r\n          f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%, '\r\n          f'Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%')\r\n```\r\nThis ubiquitous setup, where results from each epoch are output to the terminal, has become the `Hello, World!` of ML training loops. While this is great for monitoring the progress of training, it's miserable for actually keeping and analyzing the data afterwards. Here's how we can use `SQLDataModel` to help:\r\n\r\n```python\r\nfrom SQLDataModel import SQLDataModel\r\n\r\n# Initialize an empty model with headers for the data we want to store\r\nsdm = SQLDataModel(headers=['Epoch', 'Train Loss', 'Train Acc', 'Val Loss', 'Val Acc'])\r\n\r\n# ... Training code ...\r\n```\r\nWe create an empty `SQLDataModel` that we can use with `headers` corresponding to the metrics we want to store. Now in our training loop we can fill it with our results:\r\n\r\n```python\r\n# Initialize an empty model for training\r\nsdm = SQLDataModel(headers=['Epoch', 'Train Loss', 'Train Acc', 'Val Loss', 'Val Acc'])\r\n\r\n# Training loop\r\nnum_epochs = 10\r\nfor epoch in range(num_epochs):\r\n    # Train the model and get the metrics\r\n    train_loss, train_acc = train(model, criterion, optimizer, train_loader, device)\r\n    val_loss, val_acc = validate(model, criterion, val_loader, device)\r\n\r\n    # Let's store the results first\r\n    sdm.insert_row([epoch+1, train_loss, train_acc, val_loss, val_acc])\r\n\r\n    # Print training results to terminal\r\n    print(f'Epoch [{epoch+1}/{num_epochs}], '\r\n          f'Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%, '\r\n          f'Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%')\r\n```\r\nWe use the `insert_row()` method to store the results from training, inserting a new row for each epoch. Now that we're storing the data instead of just printing it to the terminal, we can track our results and easily export them into whichever format we need. For this example, let's store them as a `.json` file. All we need is to add one more line at the end of our training loop:\r\n\r\n```python\r\n# ... Training loop ...\r\nsdm.to_json('Training-Metrics.json')\r\n```\r\nThis will export all the training data stored in our model to a new JSON file `Training-Metrics.json`, trimmed for brevity:\r\n\r\n```json\r\n[\r\n    {\r\n      \"Epoch\": 1,\r\n      \"Train Loss\": 0.6504,\r\n      \"Train Acc\": 0.6576,\r\n      \"Val Loss\": 0.6126,\r\n      \"Val Acc\": 0.6608\r\n    },\r\n    {\r\n      \"Epoch\": 2,\r\n      \"Train Loss\": 0.5733,\r\n      \"Train Acc\": 0.6989,\r\n      \"Val Loss\": 0.5484,\r\n      \"Val Acc\": 0.7268\r\n    },\r\n    {\r\n      \"Epoch\": 3,\r\n      \"Train Loss\": 0.5222,\r\n      \"Train Acc\": 0.7239,\r\n      \"Val Loss\": 0.5294,\r\n      \"Val Acc\": 0.7332\r\n    }\r\n  ]\r\n```\r\nIn this example we kept the original print statements intact during training, but we could've removed them altogether and done something like this instead:\r\n\r\n```python\r\n# Initialize an empty model for training\r\nsdm = SQLDataModel(headers=['Epoch', 'Train Loss', 'Train Acc', 'Val Loss', 'Val Acc'])\r\n\r\n# Training loop\r\nnum_epochs = 10\r\nfor epoch in range(num_epochs):\r\n    # Train the model and get the metrics\r\n    train_loss, train_acc = train(model, criterion, optimizer, train_loader, device)\r\n    val_loss, val_acc = validate(model, criterion, val_loader, device)\r\n\r\n    # Let's store the results first\r\n    sdm.insert_row([epoch+1, train_loss, train_acc, val_loss, val_acc])\r\n\r\n    # Pretty print last row of training results to terminal\r\n    print(sdm[-1])\r\n\r\n# Print all results when training finishes\r\nprint(sdm)\r\n\r\n# Export them as a json file\r\nsdm.to_json('Training-Metrics.json')\r\n```\r\nHere we've removed the previous and rather verbose print statement and replaced it with a simple `print(sdm[-1])`, this will print the last row of data in the model. This results in the following terminal output for each epoch during training:\r\n\r\n```shell\r\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n\u2502 Epoch \u2502 Train Loss \u2502 Train Acc \u2502 Val Loss \u2502 Val Acc \u2502\r\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\r\n\u2502    10 \u2502     0.2586 \u2502    0.8063 \u2502   0.5391 \u2502  0.7728 \u2502\r\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\r\n```\r\nWhich, besides having the added benefit of storing our data, is far less code than the prior print statement, and is _in my humble opinion_ much nicer to look at while we wait for training to complete. We've also added one more line, a final `print(sdm)` to output all the training results before saving them to `Training-Metrics.json`. That way, for those of us who still like to see them all in one place, we can have our saved json cake and pretty print it too:\r\n\r\n```shell\r\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\r\n\u2502 Epoch \u2502 Train Loss \u2502 Train Acc \u2502 Val Loss \u2502 Val Acc \u2502\r\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\r\n\u2502     1 \u2502     0.6504 \u2502    0.6576 \u2502   0.6126 \u2502  0.6608 \u2502\r\n\u2502     2 \u2502     0.5733 \u2502    0.6989 \u2502   0.5484 \u2502  0.7268 \u2502\r\n\u2502     3 \u2502     0.5222 \u2502    0.7239 \u2502   0.5294 \u2502  0.7332 \u2502\r\n\u2502     4 \u2502     0.4749 \u2502    0.7288 \u2502   0.5461 \u2502  0.7238 \u2502\r\n\u2502     5 \u2502     0.4417 \u2502    0.7301 \u2502   0.5862 \u2502  0.7148 \u2502\r\n\u2502     6 \u2502     0.4065 \u2502    0.7544 \u2502   0.4882 \u2502  0.7674 \u2502\r\n\u2502     7 \u2502     0.3635 \u2502    0.7679 \u2502   0.4793 \u2502  0.7690 \u2502\r\n\u2502     8 \u2502     0.3229 \u2502    0.7884 \u2502   0.4833 \u2502  0.7732 \u2502\r\n\u2502     9 \u2502     0.2860 \u2502    0.7921 \u2502   0.5041 \u2502  0.7746 \u2502\r\n\u2502    10 \u2502     0.2586 \u2502    0.8063 \u2502   0.5391 \u2502  0.7728 \u2502\r\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\r\n[10 rows x 5 columns]\r\n```\r\nWhelp, clearly it's overfitting...\r\n\r\n---\r\n\r\n#### From PostgreSQL to SQLite\r\nThis example will showcase one of the original motivations of this package, moving data from one SQL database to another. It'll be a short one, just as `SQLDataModel` was designed to be! Let's say we need to move data from a PostgreSQL server into a local SQLite connection:\r\n\r\n```python\r\nimport sqlite3\r\nimport psycopg2\r\nfrom SQLDataModel import SQLDataModel\r\n\r\n# Setup our postgresql connection\r\npg_conn = psycopg2.connect(...)\r\n\r\n# Setup our local sqlite3 connection\r\nsqlite_conn = sqlite3.connect(...)\r\n\r\n# Create our model from a query on our postgresql connection\r\nsdm = SQLDataModel.from_sql(\"SELECT * FROM pg_table\", pg_conn)\r\n\r\n# Import it into our sqlite3 connection, appending to existing data\r\nsdm.to_sql(\"local_table\", sqlite_conn, if_exists='append')\r\n```\r\nThats it! What once required installing packages like `pandas`, `numpy`, and `SQLAlchemy`, plus all of their dependencies, just to able to use a nice DataFrame API when interacting with SQL data, is now just a single package, `SQLDataModel`. While I **love** all three of those packages, 99% of what I consistently use them for can be done with far less \"package baggage\".\r\n\r\n---\r\n\r\n#### The Workflow\r\n\r\nFor all the `SQLDataModel` examples, the same basic workflow and pattern is present:\r\n\r\n```python\r\nfrom SQLDataModel import SQLDataModel\r\n\r\n# Extract: Create the model from a source\r\nsdm = SQLDataModel.from_data(...)\r\n\r\n# Transform: Manipulate the data if needed\r\nsdm['New Column'] = sdm['Column A'].apply(func)\r\n\r\n# Load: Move it to a destination format\r\nsdm.to_text('table.txt') # to_csv, to_json, to_latex, to_markdown, to_html, ..., etc.\r\n```\r\nWe can take a step back and illustrate the broader  **source, transform, destination** flow with `SQLDataModel` like this:\r\n\r\n<img src=\"https://github.com/AnteT/SQLDataModel/raw/master/figs/sdm_graph.PNG?raw=true\" alt=\"sdm_graph\" style=\"width:100vw; border-radius: .6vw\" />\r\n\r\nRegardless of where the data originated or where it ends up, `SQLDataModel`'s best use-case is to be the light-weight intermediary that's agnostic to the original source, or the final destination, of the data.\r\n\r\n---\r\n\r\n### Data Formats\r\n`SQLDataModel` seamlessly interacts with a wide range of data formats providing a versatile platform for data extraction, conversion, and writing. Supported formats include:\r\n\r\n- **Arrow**: Convert to and from Apache Arrow format, `pyarrow` required.\r\n- **CSV**: Extract from and write to comma separated value, `.csv`, files.\r\n- **Excel**: Extract from and write to Excel `.xlsx` files, `openpyxl` required.\r\n- **HTML**: Extract from web and write to and from `.html` files including formatted string literals.\r\n- **JSON**: Extract from and write to `.json` files, JSON-like objects, or JSON formatted sring literals.\r\n- **LaTeX**: Extract from and write to `.tex` files, LaTeX formatted string literals.\r\n- **Markdown**: Extract from and write to `.MD` files, Markdown formatted string literals.\r\n- **Numpy**: Convert to and from `numpy.ndarray` objects, `numpy` required.\r\n- **Pandas**: Convert to and from `pandas.DataFrame` objects, `pandas` required.\r\n- **Parquet**: Extract from and write to `.parquet` files, `pyarrow` required.\r\n- **Pickle**: Extract from and write to `.pkl` files, package uses `.sdm` extension when pickling for `SQLDataModel` metadata.\r\n- **Polars**: Convert to and from `polars.DataFrame` objects, `polars` required.\r\n- **SQL**: Extract from and write to the following popular SQL databases:\r\n    - **SQLite**: Using the built-in `sqlite3` module.\r\n    - **PostgreSQL**: Using the `psycopg2` package.\r\n    - **SQL Server**: Using the `pyodbc` package.\r\n    - **Oracle**: Using the `cx_Oracle` package.\r\n    - **Teradata**: Using the `teradatasql` package.\r\n- **Text**: Write to and from `.txt` files including other `SQLDataModel` string representations.\r\n- **TSV** or delimited: Write to and from files delimited by:\r\n    - **`\\t`**: Tab separated values or `.tsv` files.\r\n    - **`\\s`**: Single space or whitespace separated values.\r\n    - **`;`**: Semicolon separated values.\r\n    - **`|`**: Pipe separated values.\r\n    - **`:`**: Colon separated values.\r\n    - **`,`**: Comma separated values or `.csv` files.\r\n- **Python** objects:\r\n    - **dictionaries**: Convert to and from collections of python `dict` objects.\r\n    - **lists**: Convert to and from collections of python `list` objects.\r\n    - **tuples**: Convert to and from collections of python `tuple` objects.\r\n    - **namedtuples**: Convert to and from collections of `namedtuples` objects.\r\n\r\nNote that `SQLDataModel` does not install any additional dependencies by default. This is done to keep the package as light-weight and small as possible. This means that to use package dependent methods like `to_parquet()` or the inverse `from_parquet()` the `pyarrow` package is required. The same goes for other package dependent methods like those converting to and from `pandas` and `numpy` objects.\r\n\r\n---\r\n\r\n### Documentation\r\nSQLDataModel's documentation can be found at https://sqldatamodel.readthedocs.io containing detailed descriptions for the key modules in the package. These are listed below as links to their respective sections in the docs:\r\n\r\n  * [`ANSIColor`](https://sqldatamodel.readthedocs.io/en/latest/SQLDataModel.html#SQLDataModel.ANSIColor.ANSIColor) for terminal styling.\r\n  * [`HTMLParser`](https://sqldatamodel.readthedocs.io/en/latest/SQLDataModel.html#SQLDataModel.HTMLParser.HTMLParser) for parsing tabular data from the web.\r\n  * [`JSONEncoder`](https://sqldatamodel.readthedocs.io/en/latest/SQLDataModel.html#SQLDataModel.JSONEncoder.DataTypesEncoder) for type casting and encoding JSON data.\r\n  * [`SQLDataModel`](https://sqldatamodel.readthedocs.io/en/latest/SQLDataModel.html#SQLDataModel.SQLDataModel.SQLDataModel) for wrapping it all up.\r\n\r\n\r\nHowever, to skip over the less relevant modules and jump straight to the meat of the package, the ``SQLDataModel`` module, click [here](https://sqldatamodel.readthedocs.io/en/latest/modules.html). \r\n\r\n---\r\n\r\n### Motivation\r\n\r\nWhile there are packages/dependencies out there that can accomplish some of the same tasks as `SQLDataModel`, they're either missing key features or end up being overkill for common tasks like grabbing and converting tables from source A to destination B, or they don't quite handle the full process and require additional dependencies to make it all work. When you find yourself doing the same thing over and over again, eventually you sit down and write a package to do it for you.\r\n\r\n---\r\n\r\n### Contributing\r\n\r\nPull requests are welcome. For major changes, please open an issue first\r\nto discuss what you would like to change.\r\n\r\nPlease make sure to update tests as appropriate.\r\n\r\n---\r\n\r\n### License\r\n\r\n[MIT](https://choosealicense.com/licenses/mit/)\r\n\r\n\r\nThank you!  \r\nAnte Tonkovic-Capin\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "SQLDataModel is a lightweight dataframe library designed for efficient data extraction, transformation, and loading (ETL) across various sources and destinations, providing an efficient alternative to common setups like pandas, numpy, and sqlalchemy while also providing additional features without the overhead of external dependencies.",
    "version": "0.4.0",
    "project_urls": {
        "Documentation": "https://sqldatamodel.readthedocs.io/en/latest/",
        "Homepage": "https://github.com/AnteT/SQLDataModel",
        "Source": "https://github.com/AnteT/SQLDataModel.git"
    },
    "split_keywords": [
        "sql",
        " etl",
        " dataframe",
        " terminal tables",
        " pretty print tables",
        " sql2sql",
        " data analysis",
        " data science",
        " data model",
        " extract",
        " transform",
        " load",
        " web scraping tables",
        " data mining",
        " html",
        " html table parsing",
        " apache arrow",
        " pyarrow",
        " pyarrow conversion",
        " pyarrow to table",
        " pyarrow to sql",
        " pyarrow to csv",
        " parquet file parsing",
        " csv",
        " csv parsing",
        " markdown",
        " markdown table parsing",
        " latex",
        " latex table parsing",
        " delimited",
        " delimited data parsing",
        " file conversion",
        " format conversion",
        " terminal styling",
        " table styling",
        " from sqlite",
        " to sqlite",
        " from postgresql",
        " to postgresql",
        " sql to sql",
        " excel",
        " xlsx file",
        " excel to sql",
        " dataframe package",
        " polars2pandas",
        " pandas2polars"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "396e0a7bd86187ce11e57b88edd4d87c1cef50dbd053386cf5d2a83b3a024948",
                "md5": "24eca450b5cb1573f6ae16ef5abc75b1",
                "sha256": "7b7bdfdab8d44396217ac8d76be977901e36b3882722efb538eb9afc043c1399"
            },
            "downloads": -1,
            "filename": "SQLDataModel-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "24eca450b5cb1573f6ae16ef5abc75b1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 237217,
            "upload_time": "2024-04-23T22:08:40",
            "upload_time_iso_8601": "2024-04-23T22:08:40.824116Z",
            "url": "https://files.pythonhosted.org/packages/39/6e/0a7bd86187ce11e57b88edd4d87c1cef50dbd053386cf5d2a83b3a024948/SQLDataModel-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2532a11124b907eb30aae2c381db9f45d8e328a3128a2015a52b928249d1fbcc",
                "md5": "c66d9ab1ed7adada949bfbd994b727b6",
                "sha256": "97cd89440e0635b5382af3a44839957453dd5333dda0dc9a50cc47f3599201ac"
            },
            "downloads": -1,
            "filename": "sqldatamodel-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c66d9ab1ed7adada949bfbd994b727b6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 156016,
            "upload_time": "2024-04-23T22:08:55",
            "upload_time_iso_8601": "2024-04-23T22:08:55.683293Z",
            "url": "https://files.pythonhosted.org/packages/25/32/a11124b907eb30aae2c381db9f45d8e328a3128a2015a52b928249d1fbcc/sqldatamodel-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-23 22:08:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AnteT",
    "github_project": "SQLDataModel",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "sqldatamodel"
}

Ante Tonkovic-Capin