bigquery-frame


Namebigquery-frame JSON
Version 0.5.0 PyPI version JSON
download
home_pagehttps://github.com/FurcyPin/bigquery-frame
SummaryA DataFrame API for Google BigQuery
upload_time2024-07-04 22:00:00
maintainerNone
docs_urlNone
authorFurcyPin
requires_python<3.13,>=3.9
licenseApache-2.0
keywords bigquery dataframe
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Bigquery-frame

[![PyPI version](https://badge.fury.io/py/bigquery-frame.svg)](https://badge.fury.io/py/bigquery-frame)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/bigquery-frame.svg)](https://pypi.org/project/bigquery-frame/)
[![GitHub Build](https://img.shields.io/github/actions/workflow/status/FurcyPin/bigquery-frame/build_and_validate.yml?branch=main)](https://github.com/FurcyPin/bigquery-frame/actions)
[![SonarCloud Coverage](https://sonarcloud.io/api/project_badges/measure?project=FurcyPin_bigquery-frame&metric=coverage)](https://sonarcloud.io/component_measures?id=FurcyPin_bigquery-frame&metric=coverage&view=list)
[![SonarCloud Bugs](https://sonarcloud.io/api/project_badges/measure?project=FurcyPin_bigquery-frame&metric=bugs)](https://sonarcloud.io/component_measures?metric=reliability_rating&view=list&id=FurcyPin_bigquery-frame)
[![SonarCloud Vulnerabilities](https://sonarcloud.io/api/project_badges/measure?project=FurcyPin_bigquery-frame&metric=vulnerabilities)](https://sonarcloud.io/component_measures?metric=security_rating&view=list&id=FurcyPin_bigquery-frame)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/bigquery-frame)](https://pypi.org/project/bigquery-frame/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

## What is it ?

This project started as a POC that aimed to showcase the wonders that could be done if BigQuery provided a DataFrame API in 
Python similar to the one already available with PySpark or Snowpark.
With time, I started to add more and more [cool features :sunglasses:](#cool-features).

I tried to reproduce the most commonly used methods of the Spark DataFrame object. I aimed at making something 
as close as possible as PySpark, and tried to keep exactly the same naming and docstrings as PySpark's DataFrames.
 

For instance, this is a working example of PySpark code:
```python
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

spark = SparkSession.builder.master("local[1]").getOrCreate()

df = spark.sql("""
    SELECT 1 as id, "Bulbasaur" as name, ARRAY("Grass", "Poison") as types, NULL as other_col
    UNION ALL
    SELECT 2 as id, "Ivysaur" as name, ARRAY("Grass", "Poison") as types, NULL as other_col
""")

df.select("id", "name", "types").createOrReplaceTempView("pokedex")

df2 = spark.sql("""SELECT * FROM pokedex""")\
    .withColumn("nb_types", f.expr("SIZE(types)"))\
    .withColumn("name", f.expr("LOWER(name)"))

df2.show()
# +---+---------+---------------+--------+
# | id|     name|          types|nb_types|
# +---+---------+---------------+--------+
# |  1|bulbasaur|[Grass, Poison]|       2|
# |  2|  ivysaur|[Grass, Poison]|       2|
# +---+---------+---------------+--------+
```

And this is an equivalent working example using bigquery_frame, that runs on Google Big Query! 
```python
from bigquery_frame import BigQueryBuilder
from bigquery_frame.auth import get_bq_client
from bigquery_frame import functions as f

bigquery = BigQueryBuilder(get_bq_client())

df = bigquery.sql("""
    SELECT 1 as id, "Bulbasaur" as name, ["Grass", "Poison"] as types, NULL as other_col
    UNION ALL
    SELECT 2 as id, "Ivysaur" as name, ["Grass", "Poison"] as types, NULL as other_col
""")

df.select("id", "name", "types").createOrReplaceTempView("pokedex")

df2 = bigquery.sql("""SELECT * FROM pokedex""")\
    .withColumn("nb_types", f.expr("ARRAY_LENGTH(types)"))\
    .withColumn("name", f.expr("LOWER(name)"), replace=True)

df2.show()
# +----+-----------+---------------------+----------+
# | id |      name |               types | nb_types |
# +----+-----------+---------------------+----------+
# |  1 | bulbasaur | ['Grass', 'Poison'] |        2 |
# |  2 |   ivysaur | ['Grass', 'Poison'] |        2 |
# +----+-----------+---------------------+----------+
```

## What's so cool about DataFrames ?

I believe that DataFrames are super cool to organise SQL code as it allows us to 
several things that are much harder, or even impossible, in pure-SQL:

- on-the-fly introspection
- chaining operations
- generic transformations
- higher level abstraction

But that deserves [a blog article](https://towardsdatascience.com/sql-jinja-is-not-enough-why-we-need-dataframes-4d71a191936d).


## Installation

[bigquery-frame is available on PyPi](https://pypi.org/project/bigquery-frame/).

### 1. Install bigquery-frame

```bash
pip install bigquery-frame
```

### 2. Configure access to your BigQuery project.

There are three possible methods detailed in [AUTH.md](/AUTH.md):

#### The quickest way
Either run `gcloud auth application-default login` and set the environment variable `GCP_PROJECT` to your project name.
If you do that bigquery-frame will inherit use own credentials.

#### The safest way
Create a service account, configure its rights and set the environment variable `GCP_CREDENTIALS_PATH` to point to
your service account's credential file.

#### The custom way
Create a `google.cloud.bigquery.Client` object yourself and pass it to the `BigQueryBuilder`.


### 3. Open your favorite python console / notebook and enjoy

```python
from bigquery_frame import BigQueryBuilder
bq = BigQueryBuilder()
df = bq.table('bigquery-public-data.utility_us.country_code_iso')
df.printSchema()
df.show()
```

## How does it work ?

Very simply, by generating SQL queries that are sent to BigQuery.
You can get the query by calling the method `DataFrame.compile()`.

For instance, if we reuse the example from the beginning:
```
print(df2.compile())
```

This will print the following SQL query:
```SQL
WITH pokedex AS (
  WITH _default_alias_1 AS (
    
        SELECT 1 as id, "Bulbasaur" as name, ["Grass", "Poison"] as types, NULL as other_col
        UNION ALL
        SELECT 2 as id, "Ivysaur" as name, ["Grass", "Poison"] as types, NULL as other_col
    
  )
  SELECT 
    id,
    name,
    types
  FROM _default_alias_1
)
, _default_alias_3 AS (
  SELECT * FROM pokedex
)
, _default_alias_4 AS (
  SELECT 
    *,
    ARRAY_LENGTH(types) AS nb_types
  FROM _default_alias_3
)
SELECT 
  * REPLACE (
    LOWER(name) AS name
  )
FROM _default_alias_4
```


## Cool Features

This feature list is arbitrarily sorted by decreasing order of coolness.

### Data Diff _(NEW!)_
_Just like git diff, but for data!_

Performs a diff between two dataframes.

It can be called from Python or directly via the command line if you made a `pip install bigquery-frame`.

**Example with the command line:**

_Please make sure you followed the [Installation section](#installation) first._
In this example, I compared two snapshots of the public table `bigquery-public-data.utility_us.country_code_iso`
made at 6 days interval and noticed that a bug had been introduced, as you can see from the diff: the new values
for the columns `continent_code` and `continent_name` look like they have been inverted.

```bash
$ bq-diff --tables test_us.country_code_iso_snapshot_20220921 test_us.country_code_iso_snapshot_20220927 --join-cols country_name

Analyzing differences...
We will try to find the differences by joining the DataFrames together using the provided column: country_name
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.66s/it]
Schema: ok (10)

diff NOT ok

Summary:

Row count ok: 278 rows

28 (10.07%) rows are identical
250 (89.93%) rows have changed

100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.67s/it]
Found the following differences:
+----------------+---------------+---------------+--------------------+----------------+
|    column_name | total_nb_diff |          left |              right | nb_differences |
+----------------+---------------+---------------+--------------------+----------------+
| continent_code |           250 |            NA |          Caribbean |             26 |
| continent_code |           250 |            AF |     Eastern Africa |             20 |
| continent_code |           250 |            AS |       Western Asia |             18 |
| continent_code |           250 |            AF |     Western Africa |             17 |
| continent_code |           250 |            EU |    Southern Europe |             16 |
| continent_code |           250 |            EU |    Northern Europe |             15 |
| continent_code |           250 |            SA |      South America |             14 |
| continent_code |           250 |            AS | South-Eastern Asia |             11 |
| continent_code |           250 |            EU |     Eastern Europe |             10 |
| continent_code |           250 |            OC |          Polynesia |             10 |
| continent_name |           250 |        Africa |                 AF |             58 |
| continent_name |           250 |          Asia |                 AS |             54 |
| continent_name |           250 |        Europe |                 EU |             53 |
| continent_name |           250 | North America |                 NA |             39 |
| continent_name |           250 |       Oceania |                 OC |             26 |
| continent_name |           250 | South America |                 SA |             15 |
| continent_name |           250 |    Antarctica |                 AN |              5 |
+----------------+---------------+---------------+--------------------+----------------+
```

An equivalent analysis made with Python is available 
at [examples/data_diff/country_code_iso.py](examples/data_diff/country_code_iso.py).

**Example with Python:**

_Please make sure you followed the [Installation section](#installation) first._
```python
from bigquery_frame.data_diff import DataframeComparator
from bigquery_frame import BigQueryBuilder
from bigquery_frame.auth import get_bq_client

bq = BigQueryBuilder(get_bq_client())
df1 = bq.sql("""
    SELECT * FROM UNNEST ([
        STRUCT(1 as id, [STRUCT(1 as a, 2 as b, 3 as c)] as my_array),
        STRUCT(2 as id, [STRUCT(1 as a, 2 as b, 3 as c)] as my_array),
        STRUCT(3 as id, [STRUCT(1 as a, 2 as b, 3 as c)] as my_array)
    ])
""")
df2 = bq.sql("""
    SELECT * FROM UNNEST ([
        STRUCT(1 as id, [STRUCT(1 as a, 2 as b, 3 as c, 4 as d)] as my_array),
        STRUCT(2 as id, [STRUCT(2 as a, 2 as b, 3 as c, 4 as d)] as my_array),
        STRUCT(4 as id, [STRUCT(1 as a, 2 as b, 3 as c, 4 as d)] as my_array)
   ])
""")

df1.show()
# +----+----------------------------+
# | id |                   my_array |
# +----+----------------------------+
# |  1 | [{'a': 1, 'b': 2, 'c': 3}] |
# |  2 | [{'a': 1, 'b': 2, 'c': 3}] |
# |  3 | [{'a': 1, 'b': 2, 'c': 3}] |
# +----+----------------------------+
df2.show()
# +----+------------------------------------+
# | id |                           my_array |
# +----+------------------------------------+
# |  1 | [{'a': 1, 'b': 2, 'c': 3, 'd': 4}] |
# |  2 | [{'a': 2, 'b': 2, 'c': 3, 'd': 4}] |
# |  4 | [{'a': 1, 'b': 2, 'c': 3, 'd': 4}] |
# +----+------------------------------------+
diff_result = DataframeComparator().compare_df(df1, df2)
diff_result.display()
```

Will produce the following output:

```
Schema has changed:
@@ -2,3 +2,4 @@
 my_array!.a INTEGER
 my_array!.b INTEGER
 my_array!.c INTEGER
+my_array!.d INTEGER
WARNING: columns that do not match both sides will be ignored
diff NOT ok
Summary:
Row count ok: 3 rows
1 (25.0%) rows are identical
1 (25.0%) rows have changed
1 (25.0%) rows are only in 'left'
1 (25.0%) rows are only in 'right
100%|██████████| 1/1 [00:04<00:00,  4.26s/it]
Found the following differences:
+-------------+---------------+-----------------------+-----------------------+----------------+
| column_name | total_nb_diff |                  left |                 right | nb_differences |
+-------------+---------------+-----------------------+-----------------------+----------------+
|    my_array |             1 | [{"a":1,"b":2,"c":3}] | [{"a":2,"b":2,"c":3}] |              1 |
+-------------+---------------+-----------------------+-----------------------+----------------+
1 rows were only found in 'left' :
Analyzing 4 columns ...
+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+
| column_number | column_name | column_type | count | count_distinct | count_null | min | max |               approx_top_100 |
+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+
|             0 |          id |     INTEGER |     1 |              1 |          0 |   3 |   3 | [{'value': '3', 'count': 1}] |
|             1 | my_array!.a |     INTEGER |     1 |              1 |          0 |   1 |   1 | [{'value': '1', 'count': 1}] |
|             2 | my_array!.b |     INTEGER |     1 |              1 |          0 |   2 |   2 | [{'value': '2', 'count': 1}] |
|             3 | my_array!.c |     INTEGER |     1 |              1 |          0 |   3 |   3 | [{'value': '3', 'count': 1}] |
+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+
1 rows were only found in 'right':
Analyzing 4 columns ...
+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+
| column_number | column_name | column_type | count | count_distinct | count_null | min | max |               approx_top_100 |
+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+
|             0 |          id |     INTEGER |     1 |              1 |          0 |   4 |   4 | [{'value': '4', 'count': 1}] |
|             1 | my_array!.a |     INTEGER |     1 |              1 |          0 |   1 |   1 | [{'value': '1', 'count': 1}] |
|             2 | my_array!.b |     INTEGER |     1 |              1 |          0 |   2 |   2 | [{'value': '2', 'count': 1}] |
|             3 | my_array!.c |     INTEGER |     1 |              1 |          0 |   3 |   3 | [{'value': '3', 'count': 1}] |
+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+
```

#### Features

- Full support of nested records
- (improvable) support for repeated records
- Optimized to work even on huge table with more than 1000 columns

### Analyze

Perform an analysis on a DataFrame, return aggregated stats for each column,
such as count, count distinct, count null, min, max, top 100 most frequent values. 

- Optimized to work on wide tables
- Custom aggregation functions can be added
- Aggregations can be grouped against one or several columns

**Example:**
```python
from bigquery_frame.transformations_impl.analyze import __get_test_df
from bigquery_frame.transformations import analyze

df = __get_test_df()

df.show()
# +----+------------+---------------------+--------------------------------------------+
# | id |       name |               types |                                  evolution |
# +----+------------+---------------------+--------------------------------------------+
# |  1 |  Bulbasaur | ['Grass', 'Poison'] | {'can_evolve': True, 'evolves_from': None} |
# |  2 |    Ivysaur | ['Grass', 'Poison'] |    {'can_evolve': True, 'evolves_from': 1} |
# |  3 |   Venusaur | ['Grass', 'Poison'] |   {'can_evolve': False, 'evolves_from': 2} |
# |  4 | Charmander |            ['Fire'] | {'can_evolve': True, 'evolves_from': None} |
# |  5 | Charmeleon |            ['Fire'] |    {'can_evolve': True, 'evolves_from': 4} |
# |  6 |  Charizard |  ['Fire', 'Flying'] |   {'can_evolve': False, 'evolves_from': 5} |
# |  7 |   Squirtle |           ['Water'] | {'can_evolve': True, 'evolves_from': None} |
# |  8 |  Wartortle |           ['Water'] |    {'can_evolve': True, 'evolves_from': 7} |
# |  9 |  Blastoise |           ['Water'] |   {'can_evolve': False, 'evolves_from': 8} |
# +----+------------+---------------------+--------------------------------------------+

analyze(df).show()
# +---------------+------------------------+-------------+-------+----------------+------------+-----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# | column_number |            column_name | column_type | count | count_distinct | count_null |       min |       max |                                                                                                                                                                                                                                                                                                                     approx_top_100 |
# +---------------+------------------------+-------------+-------+----------------+------------+-----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
# |             0 |                     id |     INTEGER |     9 |              9 |          0 |         1 |         9 |                                                                       [{'value': '1', 'count': 1}, {'value': '2', 'count': 1}, {'value': '3', 'count': 1}, {'value': '4', 'count': 1}, {'value': '5', 'count': 1}, {'value': '6', 'count': 1}, {'value': '7', 'count': 1}, {'value': '8', 'count': 1}, {'value': '9', 'count': 1}] |
# |             1 |                   name |      STRING |     9 |              9 |          0 | Blastoise | Wartortle | [{'value': 'Bulbasaur', 'count': 1}, {'value': 'Ivysaur', 'count': 1}, {'value': 'Venusaur', 'count': 1}, {'value': 'Charmander', 'count': 1}, {'value': 'Charmeleon', 'count': 1}, {'value': 'Charizard', 'count': 1}, {'value': 'Squirtle', 'count': 1}, {'value': 'Wartortle', 'count': 1}, {'value': 'Blastoise', 'count': 1}] |
# |             2 |                 types! |      STRING |    13 |              5 |          0 |      Fire |     Water |                                                                                                                                                                  [{'value': 'Grass', 'count': 3}, {'value': 'Poison', 'count': 3}, {'value': 'Fire', 'count': 3}, {'value': 'Water', 'count': 3}, {'value': 'Flying', 'count': 1}] |
# |             3 |   evolution.can_evolve |     BOOLEAN |     9 |              2 |          0 |     false |      true |                                                                                                                                                                                                                                                                    [{'value': 'true', 'count': 6}, {'value': 'false', 'count': 3}] |
# |             4 | evolution.evolves_from |     INTEGER |     9 |              6 |          3 |         1 |         8 |                                                                                                                            [{'value': 'NULL', 'count': 3}, {'value': '1', 'count': 1}, {'value': '2', 'count': 1}, {'value': '4', 'count': 1}, {'value': '5', 'count': 1}, {'value': '7', 'count': 1}, {'value': '8', 'count': 1}] |
# +---------------+------------------------+-------------+-------+----------------+------------+-----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```


**Example with custom aggregation and column groups:**
```python
from bigquery_frame.transformations_impl import analyze_aggs
from bigquery_frame.transformations import analyze
aggs = [
     analyze_aggs.column_number,
     analyze_aggs.column_name,
     analyze_aggs.count,
     analyze_aggs.count_distinct,
     analyze_aggs.count_null,
]
analyze(df, group_by="evolution.can_evolve", _aggs=aggs).orderBy('group.can_evolve', 'column_number').show()
# +-----------------------+---------------+------------------------+-------+----------------+------------+
# |                 group | column_number |            column_name | count | count_distinct | count_null |
# +-----------------------+---------------+------------------------+-------+----------------+------------+
# | {'can_evolve': False} |             0 |                     id |     3 |              3 |          0 |
# | {'can_evolve': False} |             1 |                   name |     3 |              3 |          0 |
# | {'can_evolve': False} |             2 |                 types! |     5 |              5 |          0 |
# | {'can_evolve': False} |             4 | evolution.evolves_from |     3 |              3 |          0 |
# |  {'can_evolve': True} |             0 |                     id |     6 |              6 |          0 |
# |  {'can_evolve': True} |             1 |                   name |     6 |              6 |          0 |
# |  {'can_evolve': True} |             2 |                 types! |     8 |              4 |          0 |
# |  {'can_evolve': True} |             4 | evolution.evolves_from |     6 |              3 |          3 |
# +-----------------------+---------------+------------------------+-------+----------------+------------+
```


## Billing

All queries are run on BigQuery, so BigQuery usual billing will apply on your queries. 
The following operations trigger a query execution: 

- `df.show()`
- `df.persist()`
- `df.createOrReplaceTempTable()`
- `df.write`

The operation `df.schema` also triggers a query execution, but it uses a `LIMIT 0`, which will make BigQuery return 
0 rows and charge **0 cent** for it.

### Query caching

Since version 0.4.0, the queries generated by bigquery-frame are now 100% deterministic, so executing the same
DataFrame code twice should send the same SQL query to BigQuery, thus leveraging BigQuery's query caching.

### Accessing billing stats
Since version 0.4.2, the BigQueryBuilder gather statistics about billing.

Example:
```python
from bigquery_frame import BigQueryBuilder
bq = BigQueryBuilder()
df = bq.sql(f"SELECT 1 as a").persist()
df.show()
print(bq.stats.human_readable())
```
will print:
```
+---+
| a |
+---+
| 1 |
+---+
Estimated bytes processed : 8.00 B
Total bytes processed : 8.00 B
Total bytes billed : 10.00 MiB
```
_The number of bytes billed is larger because BigQuery applies a minimal pricing on all queries run 
to account for execution overhead (except those with `LIMIT 0`)._
More details in the [BigQuery Documentation](https://cloud.google.com/bigquery/pricing#on_demand_pricing).

### Cost of advanced transformations 
Some advanced transformations, like `analyze` and `DataframeComparator.compare_df`, persist intermediary results.

- `analyze` scans each column of the table exactly once, and stores very small intermediary results, which should
  cost no more than 10 MiB per 50 column in the table analyzed. This is negligible and the cost of analyze
  should be comparable to that of a regular `SELECT *`.

- `DataframeComparator.compare_df` persists multiple intermediary results, which should never exceed 2 times the size
  of the input tables. It also has a built-in security to prevent any combinatorial explosion, should the user provide
  a join key that has duplicate. Overall, the cost of a diff should be comparable to scanning both input tables 
  4 to 6 times.


## Debugging

If you have already used Spark, you probably remember times when you tried to troubleshoot
a complex transformation pipeline but had trouble finding at which exact line the bug was hidden,
because Spark's lazy evaluation often makes it crash *several steps **after*** the actual buggy line.

To make debugging easier, I added a `debug` flag that when set to True, will validate each transformation
step instead of doing a lazy evaluation at the end. 

Examples:
```python
from bigquery_frame import BigQueryBuilder
bq = BigQueryBuilder(client)
df1 = bq.sql("""SELECT 1 as a""")

# This line is wrong, but since we don't fetch any result here, the error will be caught later.
df2 = df1.select("b")  

# The execution fails at this line.
df2.show()  
```

```python
from bigquery_frame import BigQueryBuilder
bq = BigQueryBuilder(client)

# This time, we activate the "debug mode"
bq.debug = True

df1 = bq.sql("""SELECT 1 as a""")

# Since debug mode is enabled, the execution will fail at this line.
df2 = df1.select("b")  

df2.show()
```

Of course, this is costly in performance, *even if query validation does not 
incur extra query costs*, and it is advised to keep this
option to `False` by default and use it only for troubleshooting.


## Known limitations

This project started as a POC, and while I keep improving it with time, 
I did take some shortcuts and did not try to optimize the query length.
In particular, bigquery-frame generates _**a lot**_ of CTEs, and any 
serious project trying to use it might reach the maximum query length very quickly.

Here is a list of other known limitations, please also see the 
[Further developments](#further-developments) section for a list of missing features.

- `DataFrame.withColumn`: 
  - unlike in Spark, replacing an existing column is not done 
    automatically, an extra argument `replace=True` must be passed.
- `DataFrame.createOrReplaceTempView`: 
  - I kept the same name as Spark for consistency, but it does not create an actual view on BigQuery, it just emulates 
    Spark's behaviour by using a CTE. Because of this, if you replace a temp view that already exists, the new view
    can not derive from the old view (while in Spark it is possible). 
- `DataFrame.join`:
  - When doing a select after a join, table prefixes MUST always be used on column names.
    For this reason, users SHOULD always make sure the DataFrames they are joining on are properly aliased
  - When chaining multiple joins, the name of the first DataFrame is not available in the select clause

## Further developments

Functions not supported yet:

- `DataFrame.groupBy`
- UDTFs such as `explode`, `explode_outer`, etc. 

Other improvements:

- One main improvement would be to make DataFrame more _schema-conscious_.
  Currently, the only way we get to know a DataFrame's schema is by making 
  a round-trip with BigQuery. Ideally, we could improve this so that simple 
  well-defined transformations would update the DataFrame schema directly 
  on the client side. Of course, we would always have to ask BigQuery when 
  evaluating raw SQL expressions (e.g. `BigQueryBuilder.sql`, `functions.expr`).
- Another improvement would be to optimize the size of the generated query 
  and being able to generate only one single SELECT statement combining multiple
  steps like JOINS, GROUP BY, HAVING etc. rather than doing one CTE for each step.


Also, it would be cool to expand this to other SQL engines than BigQuery 
(contributors are welcome :wink: ).


## Similar projects


### Ibis project
[ibis-project](https://github.com/ibis-project/ibis) provides a Python API inspired by pandas and dplyr that 
generates queries against multiple analytics (SQL or DataFrame-based) backends, such as Pandas, Dask, PySpark
and [BigQuery](https://github.com/ibis-project/ibis-bigquery).

I played with ibis very shortly, so I might have misunderstood some parts, but from my understanding, 
I noted the following differences between bigquery-frame and ibis-bigquery:

Unlike bigquery-frame (at least for now), Ibis is always _schema-conscious_ which has several upsides:
- If you select a column that does not exist from a DataFrame, ibis will detect it immediately, 
  while bigquery-frame will only see once the query is executed, unless if the debug mode is active.
- Ibis is capable of finding from which DataFrame a column comes from after a join, 
  while bigquery-frame isn't at the moment.
- Ibis's query compilation is also more advanced and the SQL queries generated by ibis are much shorter and
  cleaner than the ones generated by bigquery-frame for the moment.

But this also have some downsides:
- ibis-bigquery currently does not support mixing raw SQL expression in your transformation. This might be added
  in the future, as ibis is already capable of mixing raw SQL expression with some other backends.
  Since bigquery-frame is not schema-conscious, support for raw SQL expression was very easy to add.
  This is especially useful if you want to use "that brand-new SQL feature that was just added in BigQuery":
  with bigquery-frame you can always revert to pure-SQL, while with ibis you must wait until that feature 
  is integrated into ibis-bigquery.

Another key difference between ibis-bigquery and bigquery-frame is that ibis-bigquery is part of a larger
cross-platform framework, while bigquery-frame only support BigQuery at the moment. This allows bigquery-frame
to support more easily some of BigQuery's exotic features. For instance bigquery-frame offers a good 
support for nested repeated fields (array of structs) thanks to the `flatten_schema` function, 
and the `functions.transform`, `DataFrame.with_nested_columns` and `DataFrame.select_nested_columns` functions.
 
I plan to address most of bigquery-frame's downsides in the future, whenever I get the time. 


## Why did I make this ?

I hope that it will motivate the teams working on BigQuery (or Redshift, 
or Azure Synapse) to propose a real python DataFrame API on top of their 
massively parallel SQL engines. But not something that generates ugly SQL strings, 
more something like Spark Catalyst, which directly generates logical plans 
out of the DataFrame API without passing through the "SQL string" step.

After starting this POC, I realized Snowflake already understood this and 
developed Snowpark, a Java/Scala (and soon Python) API to run complex workflows
on Snowflake, and [Snowpark's DataFrame API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html)
which was clearly borrowed from [Spark's DataFrame (= DataSet[Row]) API](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html)
(we recognize several key method names: cache, createOrReplaceTempView, 
where/filter, collect, toLocalIterator).

I believe such project could open the gate to hundreds of very cool applications.
For instance, did you know that, in its early versions at least, Dataiku Shaker 
was just a GUI that chained transformations on Pandas DataFrame, and later 
Spark DataFrame ? 

Another possible evolution would be to make a DataFrame API capable of speaking
multiple SQL dialects. By using it, projects that generate SQL for multiple 
platforms, like [Malloy](https://github.com/looker-open-source/malloy), could
all use the same DataFrame abstraction. Adding support for a new SQL platform
would immediately allow all the project based on it to support this new platform.

I recently discovered several other projects that have similar goals:
- [ibis-project](https://github.com/ibis-project/ibis) that provides a unified pandas-like API on top of several
  backends such as Pandas, Dask, Postgres, DuckDB, PySpark, BigQuery (cf the [comparison above](#ibis-project))
- [substrait](https://substrait.io/) aims at providing a formal universal serialization specification for 
  expressing relational-algebra transformations. The idea is to abstract the query plan from the syntax 
  in which it was expressed by using a single common query plan representation for everyone. Maybe one day,
  if this project becomes mainstream enough, we will have code written with PySpark, Pandas or SQL generate 
  an abstract query plan **using the same specification**, which could then be run by Dask, BigQuery, or Snowflake.
  Personally, I'm afraid that the industry is moving too fast to allow such initiative to catch up and 
  become prevalent one day, but of course I would LOVE to be proven wrong there.

Given that the ibis-project is much more advanced than mine when it comes to multilingualism, I think
I will pursue my efforts into deep-diving more advanced features for BigQuery only.
Perhaps one day I or someone else will port them to the ibis-project.

Would anyone be interested in trying to POC an extension of this project to RedShift, Postgres, Azure Synapse, 
or any other SQL engines SQL engine, I would be very glad to discuss it.


## Changelog

### 0.5.0

**Improvements:**

- Added support for `DataFrame.groupBy`. It also supports `DataFrame.groupBy(...).pivot(...)`.
- `transformations.pivot` now support doing multiple aggregations at the same time.
  It's signature has changed: The arguments `agg_fun` and `agg_col` have been replaced with `aggs`
  and the argument `implem_version` has been removed: The first implementation version has been removed. 
- Added new method `DataFrame.show_string` that returns the string resulting from `DataFrame.show`
  without printing it
- Added new method `DataFrame.transform` that provides a concise syntax for chaining custom transformations.
- `functions.array_agg` now support passing multiple columns in the order_by argument.  
- Added new `functions`:
  - `avg` and `mean`
  - `lower` and `upper`
  - `regexp_replace`
  - `explode`, `explode_outer`, `posexplode_outer` and `posexplode_outer` 
- method `functions.lit` can now generate most BigQuery types:
  - STRING, BYTES
  - DATE, TIME, DATETIME, TIMESTAMP
  - INT, FLOAT, BIGNUMERIC
  - ARRAY and STRUCTS
- Added new method `Column.isin`.
- Improved `functions.sort_array` and `functions.transform`, they can now be correctly used together. 
  But their signature has changed. They now take functions as arguments instead of strings.
- **Complete overhaul of data-diff**: 
  - data-diff now supports complex data types. Declaring a repeated field (e.g. `"s!.id"` in join_cols will now explode the
    corresponding array and perform the diff on it).
  - When columns are removed or renamed, they are now still displayed in the per-column diff report.
  - data-diff now generates an HTML report: it is fully standalone and can be opened without any 
    internet connection.
  - A user-friendly error is now raised when one of the `join_cols` does not exist. 


**Breaking changes:**

- Dropped support for Python 3.7 and 3.8, but added support for Python 3.11 and 3.12
- `DataFrame.join` does not supports string expression. For instance `df1.join(df2, on="col1 == col2")` 
  must be rewritten `df1.join(df2, on=df1["col1"] == df2["col2"])`
- The signature of the `transformations.pivot` method has changed: 
  The arguments `agg_fun` and `agg_col` have been replaced with `aggs` and the argument `implem_version` 
  has been removed: The first implementation version has been removed. 
- The signature of the `functions.sort_array` method has changed.
  The second argument `sort_cols: Union[Column, List[Column]]` was replaced 
  with `sort_keys: Optional[Callable[[Column], Union[Column, List[Column]]]] = None`.
- The signature of the `functions.transform` method has changed.
  The second argument `transform_col: Column` was replaced with `func: Callable[[Column], Column]`
- **data-diff**:
  - `bigquery_frame.data_diff.DataframeComparator` object has been removed. 
    Please use directly the method `bigquery_frame.data_diff.compare_dataframes`.
  - package `bigquery_frame.data_diff.diff_results` has been renamed to `diff_result`.

Bugfixes:
- `transformations.unpivot` now works with an empty `pivot_columns` list.


### 0.4.4

- Breaking change: The `flatten_schema` method has been moved from `transformations` to `data_type_utils`.


### 0.4.3

- Added `functions.array_agg` method.
- added a first graph algorithm: `graph.connected_component` computes the connected components in a graph using the
  "small star - large star" algorithm which is conjectured to perform in `O(n log(n))`.
- Added `functions.from_base_32` and `functions.from_base_64`.

### 0.4.2

- The BigQueryBuilder now aggregates stats about the total number of bytes processed and billed. 
  This is useful to check how much you spend after running a diff (^ ^)  
  It can be displayed with `print(bq.stats.human_readable())`
- Added a [`BigQueryBuilder.debug` attribute](#debugging). When set to true, each step of a DataFrame 
  transformation flow will be validated instead of compiling the query lazily when we need to fetch a result.
- Added a `bq-diff` command that can be called directly from bash. Run `bq-diff --help` for more information. 


### 0.4.1

- Added `DataFrame.write` feature. It currently supports partitioning, table-level options, and several insertion modes:
  - `append`: Append contents of this :class:`DataFrame` to existing table.
  - `overwrite`: Replace destination table with the new data if it already exists.
  - `error` or `errorifexists`: Throw an exception if destination table already exists.
  - `ignore`: Silently ignore this operation if destination table already exists.
- `BigQueryBuilder` now tries to create its own BigQuery client if none is passed

### 0.4.0


#### New exciting features!

Several new features that make working with nested structure easier were added.

- Added `DataFrame.select_nested_columns` and `DataFrame.with_nested_columns`, which 
  make transformation of nested values much easier
- Added `functions.transform`, useful to transform arrays
- Added `transformations.normalize_arrays` which automatically sort all arrays in a DataFrame, including
  arrays of arrays of arrays...
- Added `transformations.harmonize_dataframes` which takes two DataFrames and transform them into DataFrames
  with the same schema.
- Experimental: Added data_diff capabilities, including a DataFrameComparator which can perform a diff
  between two DataFrames. Extremely useful for non-regression testing.
- Generated queries are now deterministic, this means that if you re-run the
  same DataFrame code twice, the exact same query will be sent to BigQuery twice,
  thus leveraging query caching.


#### Other features

- Added automatic retry when BigQuery returns an InternalServerError. We now do 3 tries by default.
- Added `functions.to_base32` and `functions.to_base64`. 
  `from_base_32` and `from_base_64` will be added later, 
  once a [bug in python-tabulate](https://github.com/astanin/python-tabulate/issues/192) is fixed.
- Added `Column.__mod__` (e.g. `f.when(c % 2 == 0)`) 


#### Breaking changes

- Improved typing of `Column.when.otherwise`. Now `functions.when` returns a `WhenColumn`, a
  special type of `Column` with two extra methods: `when` and `otherwise`.
- Changed `functions.sort_array`'s signature to make it consistent with `transform`


### 0.3.4

#### Features

- added `Column[...]` (`__getitem`) that can be used to access struct or array elements.

#### Bugfixes

- fixed various bugs in `transformations.analyze`
  - Was crashing on ARRAY<STRUCT<ARRAY<...>>>
  - Was crashing on columns of type BYTES
  - Columns used in group_by were analyzed, which is useless because the group is constant


### 0.3.3

#### Breaking changes

- The `bigquery_frame.transformations.analyze` now return one extra column named `column_number`

#### Features

- Add various Column methods: `asc`, `desc`
- Add various functions methods: `replace`, `array`, 

#### Bugfixes

- `Dataframe.sort` now works on aliased columns
- `Column.alias` now works on sql keywords
- `functions.eqNullSafe` now never returns NULL
- fix incorrect line numbering when query was displayed on error


### 0.3.2

#### Breaking changes

- `Column` constructor no longer accept `alias` as argument. Use `Column.alias()` instead.

#### Features

- Add various Column methods: `cast`, `~`, `isNull`, `isNotNull`, `eqNullSafe`
- Add various functions methods: `concat`, `length`, `sort_array`, `substring`

#### Bugfixes
- Fix broken `Column.when().otherwise().alias()` 

### 0.3.1

#### Breaking changes
- Dropped support for Python 3.6
- Bumped dependencies versions
- DataFrame.toPandas() now requires extra permissions by default
  (the BigQuery ReadSession User role), but downloads data faster.

#### Features
- Added `functions.cast()` method
- We now print the whole query in the error message when it fails
- Added `DataFrame.join()`. This is a first implementation which is a little clumsy.

#### Bugfixes
- Fix DataFrame deps being lost when using `df.alias()`


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/FurcyPin/bigquery-frame",
    "name": "bigquery-frame",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.9",
    "maintainer_email": null,
    "keywords": "BigQuery, DataFrame",
    "author": "FurcyPin",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/bc/1e/6fd24dafba24a928428b25a864d369c6d52a8e8c32fded27da24b63c87b2/bigquery_frame-0.5.0.tar.gz",
    "platform": null,
    "description": "# Bigquery-frame\n\n[![PyPI version](https://badge.fury.io/py/bigquery-frame.svg)](https://badge.fury.io/py/bigquery-frame)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/bigquery-frame.svg)](https://pypi.org/project/bigquery-frame/)\n[![GitHub Build](https://img.shields.io/github/actions/workflow/status/FurcyPin/bigquery-frame/build_and_validate.yml?branch=main)](https://github.com/FurcyPin/bigquery-frame/actions)\n[![SonarCloud Coverage](https://sonarcloud.io/api/project_badges/measure?project=FurcyPin_bigquery-frame&metric=coverage)](https://sonarcloud.io/component_measures?id=FurcyPin_bigquery-frame&metric=coverage&view=list)\n[![SonarCloud Bugs](https://sonarcloud.io/api/project_badges/measure?project=FurcyPin_bigquery-frame&metric=bugs)](https://sonarcloud.io/component_measures?metric=reliability_rating&view=list&id=FurcyPin_bigquery-frame)\n[![SonarCloud Vulnerabilities](https://sonarcloud.io/api/project_badges/measure?project=FurcyPin_bigquery-frame&metric=vulnerabilities)](https://sonarcloud.io/component_measures?metric=security_rating&view=list&id=FurcyPin_bigquery-frame)\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/bigquery-frame)](https://pypi.org/project/bigquery-frame/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n## What is it ?\n\nThis project started as a POC that aimed to showcase the wonders that could be done if BigQuery provided a DataFrame API in \nPython similar to the one already available with PySpark or Snowpark.\nWith time, I started to add more and more [cool features :sunglasses:](#cool-features).\n\nI tried to reproduce the most commonly used methods of the Spark DataFrame object. I aimed at making something \nas close as possible as PySpark, and tried to keep exactly the same naming and docstrings as PySpark's DataFrames.\n \n\nFor instance, this is a working example of PySpark code:\n```python\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql import functions as f\n\nspark = SparkSession.builder.master(\"local[1]\").getOrCreate()\n\ndf = spark.sql(\"\"\"\n    SELECT 1 as id, \"Bulbasaur\" as name, ARRAY(\"Grass\", \"Poison\") as types, NULL as other_col\n    UNION ALL\n    SELECT 2 as id, \"Ivysaur\" as name, ARRAY(\"Grass\", \"Poison\") as types, NULL as other_col\n\"\"\")\n\ndf.select(\"id\", \"name\", \"types\").createOrReplaceTempView(\"pokedex\")\n\ndf2 = spark.sql(\"\"\"SELECT * FROM pokedex\"\"\")\\\n    .withColumn(\"nb_types\", f.expr(\"SIZE(types)\"))\\\n    .withColumn(\"name\", f.expr(\"LOWER(name)\"))\n\ndf2.show()\n# +---+---------+---------------+--------+\n# | id|     name|          types|nb_types|\n# +---+---------+---------------+--------+\n# |  1|bulbasaur|[Grass, Poison]|       2|\n# |  2|  ivysaur|[Grass, Poison]|       2|\n# +---+---------+---------------+--------+\n```\n\nAnd this is an equivalent working example using bigquery_frame, that runs on Google Big Query! \n```python\nfrom bigquery_frame import BigQueryBuilder\nfrom bigquery_frame.auth import get_bq_client\nfrom bigquery_frame import functions as f\n\nbigquery = BigQueryBuilder(get_bq_client())\n\ndf = bigquery.sql(\"\"\"\n    SELECT 1 as id, \"Bulbasaur\" as name, [\"Grass\", \"Poison\"] as types, NULL as other_col\n    UNION ALL\n    SELECT 2 as id, \"Ivysaur\" as name, [\"Grass\", \"Poison\"] as types, NULL as other_col\n\"\"\")\n\ndf.select(\"id\", \"name\", \"types\").createOrReplaceTempView(\"pokedex\")\n\ndf2 = bigquery.sql(\"\"\"SELECT * FROM pokedex\"\"\")\\\n    .withColumn(\"nb_types\", f.expr(\"ARRAY_LENGTH(types)\"))\\\n    .withColumn(\"name\", f.expr(\"LOWER(name)\"), replace=True)\n\ndf2.show()\n# +----+-----------+---------------------+----------+\n# | id |      name |               types | nb_types |\n# +----+-----------+---------------------+----------+\n# |  1 | bulbasaur | ['Grass', 'Poison'] |        2 |\n# |  2 |   ivysaur | ['Grass', 'Poison'] |        2 |\n# +----+-----------+---------------------+----------+\n```\n\n## What's so cool about DataFrames ?\n\nI believe that DataFrames are super cool to organise SQL code as it allows us to \nseveral things that are much harder, or even impossible, in pure-SQL:\n\n- on-the-fly introspection\n- chaining operations\n- generic transformations\n- higher level abstraction\n\nBut that deserves [a blog article](https://towardsdatascience.com/sql-jinja-is-not-enough-why-we-need-dataframes-4d71a191936d).\n\n\n## Installation\n\n[bigquery-frame is available on PyPi](https://pypi.org/project/bigquery-frame/).\n\n### 1. Install bigquery-frame\n\n```bash\npip install bigquery-frame\n```\n\n### 2. Configure access to your BigQuery project.\n\nThere are three possible methods detailed in [AUTH.md](/AUTH.md):\n\n#### The quickest way\nEither run `gcloud auth application-default login` and set the environment variable `GCP_PROJECT` to your project name.\nIf you do that bigquery-frame will inherit use own credentials.\n\n#### The safest way\nCreate a service account, configure its rights and set the environment variable `GCP_CREDENTIALS_PATH` to point to\nyour service account's credential file.\n\n#### The custom way\nCreate a `google.cloud.bigquery.Client` object yourself and pass it to the `BigQueryBuilder`.\n\n\n### 3. Open your favorite python console / notebook and enjoy\n\n```python\nfrom bigquery_frame import BigQueryBuilder\nbq = BigQueryBuilder()\ndf = bq.table('bigquery-public-data.utility_us.country_code_iso')\ndf.printSchema()\ndf.show()\n```\n\n## How does it work ?\n\nVery simply, by generating SQL queries that are sent to BigQuery.\nYou can get the query by calling the method `DataFrame.compile()`.\n\nFor instance, if we reuse the example from the beginning:\n```\nprint(df2.compile())\n```\n\nThis will print the following SQL query:\n```SQL\nWITH pokedex AS (\n  WITH _default_alias_1 AS (\n    \n        SELECT 1 as id, \"Bulbasaur\" as name, [\"Grass\", \"Poison\"] as types, NULL as other_col\n        UNION ALL\n        SELECT 2 as id, \"Ivysaur\" as name, [\"Grass\", \"Poison\"] as types, NULL as other_col\n    \n  )\n  SELECT \n    id,\n    name,\n    types\n  FROM _default_alias_1\n)\n, _default_alias_3 AS (\n  SELECT * FROM pokedex\n)\n, _default_alias_4 AS (\n  SELECT \n    *,\n    ARRAY_LENGTH(types) AS nb_types\n  FROM _default_alias_3\n)\nSELECT \n  * REPLACE (\n    LOWER(name) AS name\n  )\nFROM _default_alias_4\n```\n\n\n## Cool Features\n\nThis feature list is arbitrarily sorted by decreasing order of coolness.\n\n### Data Diff _(NEW!)_\n_Just like git diff, but for data!_\n\nPerforms a diff between two dataframes.\n\nIt can be called from Python or directly via the command line if you made a `pip install bigquery-frame`.\n\n**Example with the command line:**\n\n_Please make sure you followed the [Installation section](#installation) first._\nIn this example, I compared two snapshots of the public table `bigquery-public-data.utility_us.country_code_iso`\nmade at 6 days interval and noticed that a bug had been introduced, as you can see from the diff: the new values\nfor the columns `continent_code` and `continent_name` look like they have been inverted.\n\n```bash\n$ bq-diff --tables test_us.country_code_iso_snapshot_20220921 test_us.country_code_iso_snapshot_20220927 --join-cols country_name\n\nAnalyzing differences...\nWe will try to find the differences by joining the DataFrames together using the provided column: country_name\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:04<00:00,  4.66s/it]\nSchema: ok (10)\n\ndiff NOT ok\n\nSummary:\n\nRow count ok: 278 rows\n\n28 (10.07%) rows are identical\n250 (89.93%) rows have changed\n\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:04<00:00,  4.67s/it]\nFound the following differences:\n+----------------+---------------+---------------+--------------------+----------------+\n|    column_name | total_nb_diff |          left |              right | nb_differences |\n+----------------+---------------+---------------+--------------------+----------------+\n| continent_code |           250 |            NA |          Caribbean |             26 |\n| continent_code |           250 |            AF |     Eastern Africa |             20 |\n| continent_code |           250 |            AS |       Western Asia |             18 |\n| continent_code |           250 |            AF |     Western Africa |             17 |\n| continent_code |           250 |            EU |    Southern Europe |             16 |\n| continent_code |           250 |            EU |    Northern Europe |             15 |\n| continent_code |           250 |            SA |      South America |             14 |\n| continent_code |           250 |            AS | South-Eastern Asia |             11 |\n| continent_code |           250 |            EU |     Eastern Europe |             10 |\n| continent_code |           250 |            OC |          Polynesia |             10 |\n| continent_name |           250 |        Africa |                 AF |             58 |\n| continent_name |           250 |          Asia |                 AS |             54 |\n| continent_name |           250 |        Europe |                 EU |             53 |\n| continent_name |           250 | North America |                 NA |             39 |\n| continent_name |           250 |       Oceania |                 OC |             26 |\n| continent_name |           250 | South America |                 SA |             15 |\n| continent_name |           250 |    Antarctica |                 AN |              5 |\n+----------------+---------------+---------------+--------------------+----------------+\n```\n\nAn equivalent analysis made with Python is available \nat [examples/data_diff/country_code_iso.py](examples/data_diff/country_code_iso.py).\n\n**Example with Python:**\n\n_Please make sure you followed the [Installation section](#installation) first._\n```python\nfrom bigquery_frame.data_diff import DataframeComparator\nfrom bigquery_frame import BigQueryBuilder\nfrom bigquery_frame.auth import get_bq_client\n\nbq = BigQueryBuilder(get_bq_client())\ndf1 = bq.sql(\"\"\"\n    SELECT * FROM UNNEST ([\n        STRUCT(1 as id, [STRUCT(1 as a, 2 as b, 3 as c)] as my_array),\n        STRUCT(2 as id, [STRUCT(1 as a, 2 as b, 3 as c)] as my_array),\n        STRUCT(3 as id, [STRUCT(1 as a, 2 as b, 3 as c)] as my_array)\n    ])\n\"\"\")\ndf2 = bq.sql(\"\"\"\n    SELECT * FROM UNNEST ([\n        STRUCT(1 as id, [STRUCT(1 as a, 2 as b, 3 as c, 4 as d)] as my_array),\n        STRUCT(2 as id, [STRUCT(2 as a, 2 as b, 3 as c, 4 as d)] as my_array),\n        STRUCT(4 as id, [STRUCT(1 as a, 2 as b, 3 as c, 4 as d)] as my_array)\n   ])\n\"\"\")\n\ndf1.show()\n# +----+----------------------------+\n# | id |                   my_array |\n# +----+----------------------------+\n# |  1 | [{'a': 1, 'b': 2, 'c': 3}] |\n# |  2 | [{'a': 1, 'b': 2, 'c': 3}] |\n# |  3 | [{'a': 1, 'b': 2, 'c': 3}] |\n# +----+----------------------------+\ndf2.show()\n# +----+------------------------------------+\n# | id |                           my_array |\n# +----+------------------------------------+\n# |  1 | [{'a': 1, 'b': 2, 'c': 3, 'd': 4}] |\n# |  2 | [{'a': 2, 'b': 2, 'c': 3, 'd': 4}] |\n# |  4 | [{'a': 1, 'b': 2, 'c': 3, 'd': 4}] |\n# +----+------------------------------------+\ndiff_result = DataframeComparator().compare_df(df1, df2)\ndiff_result.display()\n```\n\nWill produce the following output:\n\n```\nSchema has changed:\n@@ -2,3 +2,4 @@\n my_array!.a INTEGER\n my_array!.b INTEGER\n my_array!.c INTEGER\n+my_array!.d INTEGER\nWARNING: columns that do not match both sides will be ignored\ndiff NOT ok\nSummary:\nRow count ok: 3 rows\n1 (25.0%) rows are identical\n1 (25.0%) rows have changed\n1 (25.0%) rows are only in 'left'\n1 (25.0%) rows are only in 'right\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:04<00:00,  4.26s/it]\nFound the following differences:\n+-------------+---------------+-----------------------+-----------------------+----------------+\n| column_name | total_nb_diff |                  left |                 right | nb_differences |\n+-------------+---------------+-----------------------+-----------------------+----------------+\n|    my_array |             1 | [{\"a\":1,\"b\":2,\"c\":3}] | [{\"a\":2,\"b\":2,\"c\":3}] |              1 |\n+-------------+---------------+-----------------------+-----------------------+----------------+\n1 rows were only found in 'left' :\nAnalyzing 4 columns ...\n+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+\n| column_number | column_name | column_type | count | count_distinct | count_null | min | max |               approx_top_100 |\n+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+\n|             0 |          id |     INTEGER |     1 |              1 |          0 |   3 |   3 | [{'value': '3', 'count': 1}] |\n|             1 | my_array!.a |     INTEGER |     1 |              1 |          0 |   1 |   1 | [{'value': '1', 'count': 1}] |\n|             2 | my_array!.b |     INTEGER |     1 |              1 |          0 |   2 |   2 | [{'value': '2', 'count': 1}] |\n|             3 | my_array!.c |     INTEGER |     1 |              1 |          0 |   3 |   3 | [{'value': '3', 'count': 1}] |\n+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+\n1 rows were only found in 'right':\nAnalyzing 4 columns ...\n+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+\n| column_number | column_name | column_type | count | count_distinct | count_null | min | max |               approx_top_100 |\n+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+\n|             0 |          id |     INTEGER |     1 |              1 |          0 |   4 |   4 | [{'value': '4', 'count': 1}] |\n|             1 | my_array!.a |     INTEGER |     1 |              1 |          0 |   1 |   1 | [{'value': '1', 'count': 1}] |\n|             2 | my_array!.b |     INTEGER |     1 |              1 |          0 |   2 |   2 | [{'value': '2', 'count': 1}] |\n|             3 | my_array!.c |     INTEGER |     1 |              1 |          0 |   3 |   3 | [{'value': '3', 'count': 1}] |\n+---------------+-------------+-------------+-------+----------------+------------+-----+-----+------------------------------+\n```\n\n#### Features\n\n- Full support of nested records\n- (improvable) support for repeated records\n- Optimized to work even on huge table with more than 1000 columns\n\n### Analyze\n\nPerform an analysis on a DataFrame, return aggregated stats for each column,\nsuch as count, count distinct, count null, min, max, top 100 most frequent values. \n\n- Optimized to work on wide tables\n- Custom aggregation functions can be added\n- Aggregations can be grouped against one or several columns\n\n**Example:**\n```python\nfrom bigquery_frame.transformations_impl.analyze import __get_test_df\nfrom bigquery_frame.transformations import analyze\n\ndf = __get_test_df()\n\ndf.show()\n# +----+------------+---------------------+--------------------------------------------+\n# | id |       name |               types |                                  evolution |\n# +----+------------+---------------------+--------------------------------------------+\n# |  1 |  Bulbasaur | ['Grass', 'Poison'] | {'can_evolve': True, 'evolves_from': None} |\n# |  2 |    Ivysaur | ['Grass', 'Poison'] |    {'can_evolve': True, 'evolves_from': 1} |\n# |  3 |   Venusaur | ['Grass', 'Poison'] |   {'can_evolve': False, 'evolves_from': 2} |\n# |  4 | Charmander |            ['Fire'] | {'can_evolve': True, 'evolves_from': None} |\n# |  5 | Charmeleon |            ['Fire'] |    {'can_evolve': True, 'evolves_from': 4} |\n# |  6 |  Charizard |  ['Fire', 'Flying'] |   {'can_evolve': False, 'evolves_from': 5} |\n# |  7 |   Squirtle |           ['Water'] | {'can_evolve': True, 'evolves_from': None} |\n# |  8 |  Wartortle |           ['Water'] |    {'can_evolve': True, 'evolves_from': 7} |\n# |  9 |  Blastoise |           ['Water'] |   {'can_evolve': False, 'evolves_from': 8} |\n# +----+------------+---------------------+--------------------------------------------+\n\nanalyze(df).show()\n# +---------------+------------------------+-------------+-------+----------------+------------+-----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n# | column_number |            column_name | column_type | count | count_distinct | count_null |       min |       max |                                                                                                                                                                                                                                                                                                                     approx_top_100 |\n# +---------------+------------------------+-------------+-------+----------------+------------+-----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n# |             0 |                     id |     INTEGER |     9 |              9 |          0 |         1 |         9 |                                                                       [{'value': '1', 'count': 1}, {'value': '2', 'count': 1}, {'value': '3', 'count': 1}, {'value': '4', 'count': 1}, {'value': '5', 'count': 1}, {'value': '6', 'count': 1}, {'value': '7', 'count': 1}, {'value': '8', 'count': 1}, {'value': '9', 'count': 1}] |\n# |             1 |                   name |      STRING |     9 |              9 |          0 | Blastoise | Wartortle | [{'value': 'Bulbasaur', 'count': 1}, {'value': 'Ivysaur', 'count': 1}, {'value': 'Venusaur', 'count': 1}, {'value': 'Charmander', 'count': 1}, {'value': 'Charmeleon', 'count': 1}, {'value': 'Charizard', 'count': 1}, {'value': 'Squirtle', 'count': 1}, {'value': 'Wartortle', 'count': 1}, {'value': 'Blastoise', 'count': 1}] |\n# |             2 |                 types! |      STRING |    13 |              5 |          0 |      Fire |     Water |                                                                                                                                                                  [{'value': 'Grass', 'count': 3}, {'value': 'Poison', 'count': 3}, {'value': 'Fire', 'count': 3}, {'value': 'Water', 'count': 3}, {'value': 'Flying', 'count': 1}] |\n# |             3 |   evolution.can_evolve |     BOOLEAN |     9 |              2 |          0 |     false |      true |                                                                                                                                                                                                                                                                    [{'value': 'true', 'count': 6}, {'value': 'false', 'count': 3}] |\n# |             4 | evolution.evolves_from |     INTEGER |     9 |              6 |          3 |         1 |         8 |                                                                                                                            [{'value': 'NULL', 'count': 3}, {'value': '1', 'count': 1}, {'value': '2', 'count': 1}, {'value': '4', 'count': 1}, {'value': '5', 'count': 1}, {'value': '7', 'count': 1}, {'value': '8', 'count': 1}] |\n# +---------------+------------------------+-------------+-------+----------------+------------+-----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n```\n\n\n**Example with custom aggregation and column groups:**\n```python\nfrom bigquery_frame.transformations_impl import analyze_aggs\nfrom bigquery_frame.transformations import analyze\naggs = [\n     analyze_aggs.column_number,\n     analyze_aggs.column_name,\n     analyze_aggs.count,\n     analyze_aggs.count_distinct,\n     analyze_aggs.count_null,\n]\nanalyze(df, group_by=\"evolution.can_evolve\", _aggs=aggs).orderBy('group.can_evolve', 'column_number').show()\n# +-----------------------+---------------+------------------------+-------+----------------+------------+\n# |                 group | column_number |            column_name | count | count_distinct | count_null |\n# +-----------------------+---------------+------------------------+-------+----------------+------------+\n# | {'can_evolve': False} |             0 |                     id |     3 |              3 |          0 |\n# | {'can_evolve': False} |             1 |                   name |     3 |              3 |          0 |\n# | {'can_evolve': False} |             2 |                 types! |     5 |              5 |          0 |\n# | {'can_evolve': False} |             4 | evolution.evolves_from |     3 |              3 |          0 |\n# |  {'can_evolve': True} |             0 |                     id |     6 |              6 |          0 |\n# |  {'can_evolve': True} |             1 |                   name |     6 |              6 |          0 |\n# |  {'can_evolve': True} |             2 |                 types! |     8 |              4 |          0 |\n# |  {'can_evolve': True} |             4 | evolution.evolves_from |     6 |              3 |          3 |\n# +-----------------------+---------------+------------------------+-------+----------------+------------+\n```\n\n\n## Billing\n\nAll queries are run on BigQuery, so BigQuery usual billing will apply on your queries. \nThe following operations trigger a query execution: \n\n- `df.show()`\n- `df.persist()`\n- `df.createOrReplaceTempTable()`\n- `df.write`\n\nThe operation `df.schema` also triggers a query execution, but it uses a `LIMIT 0`, which will make BigQuery return \n0 rows and charge **0 cent** for it.\n\n### Query caching\n\nSince version 0.4.0, the queries generated by bigquery-frame are now 100% deterministic, so executing the same\nDataFrame code twice should send the same SQL query to BigQuery, thus leveraging BigQuery's query caching.\n\n### Accessing billing stats\nSince version 0.4.2, the BigQueryBuilder gather statistics about billing.\n\nExample:\n```python\nfrom bigquery_frame import BigQueryBuilder\nbq = BigQueryBuilder()\ndf = bq.sql(f\"SELECT 1 as a\").persist()\ndf.show()\nprint(bq.stats.human_readable())\n```\nwill print:\n```\n+---+\n| a |\n+---+\n| 1 |\n+---+\nEstimated bytes processed : 8.00 B\nTotal bytes processed : 8.00 B\nTotal bytes billed : 10.00 MiB\n```\n_The number of bytes billed is larger because BigQuery applies a minimal pricing on all queries run \nto account for execution overhead (except those with `LIMIT 0`)._\nMore details in the [BigQuery Documentation](https://cloud.google.com/bigquery/pricing#on_demand_pricing).\n\n### Cost of advanced transformations \nSome advanced transformations, like `analyze` and `DataframeComparator.compare_df`, persist intermediary results.\n\n- `analyze` scans each column of the table exactly once, and stores very small intermediary results, which should\n  cost no more than 10 MiB per 50 column in the table analyzed. This is negligible and the cost of analyze\n  should be comparable to that of a regular `SELECT *`.\n\n- `DataframeComparator.compare_df` persists multiple intermediary results, which should never exceed 2 times the size\n  of the input tables. It also has a built-in security to prevent any combinatorial explosion, should the user provide\n  a join key that has duplicate. Overall, the cost of a diff should be comparable to scanning both input tables \n  4 to 6 times.\n\n\n## Debugging\n\nIf you have already used Spark, you probably remember times when you tried to troubleshoot\na complex transformation pipeline but had trouble finding at which exact line the bug was hidden,\nbecause Spark's lazy evaluation often makes it crash *several steps **after*** the actual buggy line.\n\nTo make debugging easier, I added a `debug` flag that when set to True, will validate each transformation\nstep instead of doing a lazy evaluation at the end. \n\nExamples:\n```python\nfrom bigquery_frame import BigQueryBuilder\nbq = BigQueryBuilder(client)\ndf1 = bq.sql(\"\"\"SELECT 1 as a\"\"\")\n\n# This line is wrong, but since we don't fetch any result here, the error will be caught later.\ndf2 = df1.select(\"b\")  \n\n# The execution fails at this line.\ndf2.show()  \n```\n\n```python\nfrom bigquery_frame import BigQueryBuilder\nbq = BigQueryBuilder(client)\n\n# This time, we activate the \"debug mode\"\nbq.debug = True\n\ndf1 = bq.sql(\"\"\"SELECT 1 as a\"\"\")\n\n# Since debug mode is enabled, the execution will fail at this line.\ndf2 = df1.select(\"b\")  \n\ndf2.show()\n```\n\nOf course, this is costly in performance, *even if query validation does not \nincur extra query costs*, and it is advised to keep this\noption to `False` by default and use it only for troubleshooting.\n\n\n## Known limitations\n\nThis project started as a POC, and while I keep improving it with time, \nI did take some shortcuts and did not try to optimize the query length.\nIn particular, bigquery-frame generates _**a lot**_ of CTEs, and any \nserious project trying to use it might reach the maximum query length very quickly.\n\nHere is a list of other known limitations, please also see the \n[Further developments](#further-developments) section for a list of missing features.\n\n- `DataFrame.withColumn`: \n  - unlike in Spark, replacing an existing column is not done \n    automatically, an extra argument `replace=True` must be passed.\n- `DataFrame.createOrReplaceTempView`: \n  - I kept the same name as Spark for consistency, but it does not create an actual view on BigQuery, it just emulates \n    Spark's behaviour by using a CTE. Because of this, if you replace a temp view that already exists, the new view\n    can not derive from the old view (while in Spark it is possible). \n- `DataFrame.join`:\n  - When doing a select after a join, table prefixes MUST always be used on column names.\n    For this reason, users SHOULD always make sure the DataFrames they are joining on are properly aliased\n  - When chaining multiple joins, the name of the first DataFrame is not available in the select clause\n\n## Further developments\n\nFunctions not supported yet:\n\n- `DataFrame.groupBy`\n- UDTFs such as `explode`, `explode_outer`, etc. \n\nOther improvements:\n\n- One main improvement would be to make DataFrame more _schema-conscious_.\n  Currently, the only way we get to know a DataFrame's schema is by making \n  a round-trip with BigQuery. Ideally, we could improve this so that simple \n  well-defined transformations would update the DataFrame schema directly \n  on the client side. Of course, we would always have to ask BigQuery when \n  evaluating raw SQL expressions (e.g. `BigQueryBuilder.sql`, `functions.expr`).\n- Another improvement would be to optimize the size of the generated query \n  and being able to generate only one single SELECT statement combining multiple\n  steps like JOINS, GROUP BY, HAVING etc. rather than doing one CTE for each step.\n\n\nAlso, it would be cool to expand this to other SQL engines than BigQuery \n(contributors are welcome :wink: ).\n\n\n## Similar projects\n\n\n### Ibis project\n[ibis-project](https://github.com/ibis-project/ibis) provides a Python API inspired by pandas and dplyr that \ngenerates queries against multiple analytics (SQL or DataFrame-based) backends, such as Pandas, Dask, PySpark\nand [BigQuery](https://github.com/ibis-project/ibis-bigquery).\n\nI played with ibis very shortly, so I might have misunderstood some parts, but from my understanding, \nI noted the following differences between bigquery-frame and ibis-bigquery:\n\nUnlike bigquery-frame (at least for now), Ibis is always _schema-conscious_ which has several upsides:\n- If you select a column that does not exist from a DataFrame, ibis will detect it immediately, \n  while bigquery-frame will only see once the query is executed, unless if the debug mode is active.\n- Ibis is capable of finding from which DataFrame a column comes from after a join, \n  while bigquery-frame isn't at the moment.\n- Ibis's query compilation is also more advanced and the SQL queries generated by ibis are much shorter and\n  cleaner than the ones generated by bigquery-frame for the moment.\n\nBut this also have some downsides:\n- ibis-bigquery currently does not support mixing raw SQL expression in your transformation. This might be added\n  in the future, as ibis is already capable of mixing raw SQL expression with some other backends.\n  Since bigquery-frame is not schema-conscious, support for raw SQL expression was very easy to add.\n  This is especially useful if you want to use \"that brand-new SQL feature that was just added in BigQuery\":\n  with bigquery-frame you can always revert to pure-SQL, while with ibis you must wait until that feature \n  is integrated into ibis-bigquery.\n\nAnother key difference between ibis-bigquery and bigquery-frame is that ibis-bigquery is part of a larger\ncross-platform framework, while bigquery-frame only support BigQuery at the moment. This allows bigquery-frame\nto support more easily some of BigQuery's exotic features. For instance bigquery-frame offers a good \nsupport for nested repeated fields (array of structs) thanks to the `flatten_schema` function, \nand the `functions.transform`, `DataFrame.with_nested_columns` and `DataFrame.select_nested_columns` functions.\n \nI plan to address most of bigquery-frame's downsides in the future, whenever I get the time. \n\n\n## Why did I make this ?\n\nI hope that it will motivate the teams working on BigQuery (or Redshift, \nor Azure Synapse) to propose a real python DataFrame API on top of their \nmassively parallel SQL engines. But not something that generates ugly SQL strings, \nmore something like Spark Catalyst, which directly generates logical plans \nout of the DataFrame API without passing through the \"SQL string\" step.\n\nAfter starting this POC, I realized Snowflake already understood this and \ndeveloped Snowpark, a Java/Scala (and soon Python) API to run complex workflows\non Snowflake, and [Snowpark's DataFrame API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/scala/com/snowflake/snowpark/DataFrame.html)\nwhich was clearly borrowed from [Spark's DataFrame (= DataSet[Row]) API](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html)\n(we recognize several key method names: cache, createOrReplaceTempView, \nwhere/filter, collect, toLocalIterator).\n\nI believe such project could open the gate to hundreds of very cool applications.\nFor instance, did you know that, in its early versions at least, Dataiku Shaker \nwas just a GUI that chained transformations on Pandas DataFrame, and later \nSpark DataFrame ? \n\nAnother possible evolution would be to make a DataFrame API capable of speaking\nmultiple SQL dialects. By using it, projects that generate SQL for multiple \nplatforms, like [Malloy](https://github.com/looker-open-source/malloy), could\nall use the same DataFrame abstraction. Adding support for a new SQL platform\nwould immediately allow all the project based on it to support this new platform.\n\nI recently discovered several other projects that have similar goals:\n- [ibis-project](https://github.com/ibis-project/ibis) that provides a unified pandas-like API on top of several\n  backends such as Pandas, Dask, Postgres, DuckDB, PySpark, BigQuery (cf the [comparison above](#ibis-project))\n- [substrait](https://substrait.io/) aims at providing a formal universal serialization specification for \n  expressing relational-algebra transformations. The idea is to abstract the query plan from the syntax \n  in which it was expressed by using a single common query plan representation for everyone. Maybe one day,\n  if this project becomes mainstream enough, we will have code written with PySpark, Pandas or SQL generate \n  an abstract query plan **using the same specification**, which could then be run by Dask, BigQuery, or Snowflake.\n  Personally, I'm afraid that the industry is moving too fast to allow such initiative to catch up and \n  become prevalent one day, but of course I would LOVE to be proven wrong there.\n\nGiven that the ibis-project is much more advanced than mine when it comes to multilingualism, I think\nI will pursue my efforts into deep-diving more advanced features for BigQuery only.\nPerhaps one day I or someone else will port them to the ibis-project.\n\nWould anyone be interested in trying to POC an extension of this project to RedShift, Postgres, Azure Synapse, \nor any other SQL engines SQL engine, I would be very glad to discuss it.\n\n\n## Changelog\n\n### 0.5.0\n\n**Improvements:**\n\n- Added support for `DataFrame.groupBy`. It also supports `DataFrame.groupBy(...).pivot(...)`.\n- `transformations.pivot` now support doing multiple aggregations at the same time.\n  It's signature has changed: The arguments `agg_fun` and `agg_col` have been replaced with `aggs`\n  and the argument `implem_version` has been removed: The first implementation version has been removed. \n- Added new method `DataFrame.show_string` that returns the string resulting from `DataFrame.show`\n  without printing it\n- Added new method `DataFrame.transform` that provides a concise syntax for chaining custom transformations.\n- `functions.array_agg` now support passing multiple columns in the order_by argument.  \n- Added new `functions`:\n  - `avg` and `mean`\n  - `lower` and `upper`\n  - `regexp_replace`\n  - `explode`, `explode_outer`, `posexplode_outer` and `posexplode_outer` \n- method `functions.lit` can now generate most BigQuery types:\n  - STRING, BYTES\n  - DATE, TIME, DATETIME, TIMESTAMP\n  - INT, FLOAT, BIGNUMERIC\n  - ARRAY and STRUCTS\n- Added new method `Column.isin`.\n- Improved `functions.sort_array` and `functions.transform`, they can now be correctly used together. \n  But their signature has changed. They now take functions as arguments instead of strings.\n- **Complete overhaul of data-diff**: \n  - data-diff now supports complex data types. Declaring a repeated field (e.g. `\"s!.id\"` in join_cols will now explode the\n    corresponding array and perform the diff on it).\n  - When columns are removed or renamed, they are now still displayed in the per-column diff report.\n  - data-diff now generates an HTML report: it is fully standalone and can be opened without any \n    internet connection.\n  - A user-friendly error is now raised when one of the `join_cols` does not exist. \n\n\n**Breaking changes:**\n\n- Dropped support for Python 3.7 and 3.8, but added support for Python 3.11 and 3.12\n- `DataFrame.join` does not supports string expression. For instance `df1.join(df2, on=\"col1 == col2\")` \n  must be rewritten `df1.join(df2, on=df1[\"col1\"] == df2[\"col2\"])`\n- The signature of the `transformations.pivot` method has changed: \n  The arguments `agg_fun` and `agg_col` have been replaced with `aggs` and the argument `implem_version` \n  has been removed: The first implementation version has been removed. \n- The signature of the `functions.sort_array` method has changed.\n  The second argument `sort_cols: Union[Column, List[Column]]` was replaced \n  with `sort_keys: Optional[Callable[[Column], Union[Column, List[Column]]]] = None`.\n- The signature of the `functions.transform` method has changed.\n  The second argument `transform_col: Column` was replaced with `func: Callable[[Column], Column]`\n- **data-diff**:\n  - `bigquery_frame.data_diff.DataframeComparator` object has been removed. \n    Please use directly the method `bigquery_frame.data_diff.compare_dataframes`.\n  - package `bigquery_frame.data_diff.diff_results` has been renamed to `diff_result`.\n\nBugfixes:\n- `transformations.unpivot` now works with an empty `pivot_columns` list.\n\n\n### 0.4.4\n\n- Breaking change: The `flatten_schema` method has been moved from `transformations` to `data_type_utils`.\n\n\n### 0.4.3\n\n- Added `functions.array_agg` method.\n- added a first graph algorithm: `graph.connected_component` computes the connected components in a graph using the\n  \"small star - large star\" algorithm which is conjectured to perform in `O(n log(n))`.\n- Added `functions.from_base_32` and `functions.from_base_64`.\n\n### 0.4.2\n\n- The BigQueryBuilder now aggregates stats about the total number of bytes processed and billed. \n  This is useful to check how much you spend after running a diff (^ ^)  \n  It can be displayed with `print(bq.stats.human_readable())`\n- Added a [`BigQueryBuilder.debug` attribute](#debugging). When set to true, each step of a DataFrame \n  transformation flow will be validated instead of compiling the query lazily when we need to fetch a result.\n- Added a `bq-diff` command that can be called directly from bash. Run `bq-diff --help` for more information. \n\n\n### 0.4.1\n\n- Added `DataFrame.write` feature. It currently supports partitioning, table-level options, and several insertion modes:\n  - `append`: Append contents of this :class:`DataFrame` to existing table.\n  - `overwrite`: Replace destination table with the new data if it already exists.\n  - `error` or `errorifexists`: Throw an exception if destination table already exists.\n  - `ignore`: Silently ignore this operation if destination table already exists.\n- `BigQueryBuilder` now tries to create its own BigQuery client if none is passed\n\n### 0.4.0\n\n\n#### New exciting features!\n\nSeveral new features that make working with nested structure easier were added.\n\n- Added `DataFrame.select_nested_columns` and `DataFrame.with_nested_columns`, which \n  make transformation of nested values much easier\n- Added `functions.transform`, useful to transform arrays\n- Added `transformations.normalize_arrays` which automatically sort all arrays in a DataFrame, including\n  arrays of arrays of arrays...\n- Added `transformations.harmonize_dataframes` which takes two DataFrames and transform them into DataFrames\n  with the same schema.\n- Experimental: Added data_diff capabilities, including a DataFrameComparator which can perform a diff\n  between two DataFrames. Extremely useful for non-regression testing.\n- Generated queries are now deterministic, this means that if you re-run the\n  same DataFrame code twice, the exact same query will be sent to BigQuery twice,\n  thus leveraging query caching.\n\n\n#### Other features\n\n- Added automatic retry when BigQuery returns an InternalServerError. We now do 3 tries by default.\n- Added `functions.to_base32` and `functions.to_base64`. \n  `from_base_32` and `from_base_64` will be added later, \n  once a [bug in python-tabulate](https://github.com/astanin/python-tabulate/issues/192) is fixed.\n- Added `Column.__mod__` (e.g. `f.when(c % 2 == 0)`) \n\n\n#### Breaking changes\n\n- Improved typing of `Column.when.otherwise`. Now `functions.when` returns a `WhenColumn`, a\n  special type of `Column` with two extra methods: `when` and `otherwise`.\n- Changed `functions.sort_array`'s signature to make it consistent with `transform`\n\n\n### 0.3.4\n\n#### Features\n\n- added `Column[...]` (`__getitem`) that can be used to access struct or array elements.\n\n#### Bugfixes\n\n- fixed various bugs in `transformations.analyze`\n  - Was crashing on ARRAY<STRUCT<ARRAY<...>>>\n  - Was crashing on columns of type BYTES\n  - Columns used in group_by were analyzed, which is useless because the group is constant\n\n\n### 0.3.3\n\n#### Breaking changes\n\n- The `bigquery_frame.transformations.analyze` now return one extra column named `column_number`\n\n#### Features\n\n- Add various Column methods: `asc`, `desc`\n- Add various functions methods: `replace`, `array`, \n\n#### Bugfixes\n\n- `Dataframe.sort` now works on aliased columns\n- `Column.alias` now works on sql keywords\n- `functions.eqNullSafe` now never returns NULL\n- fix incorrect line numbering when query was displayed on error\n\n\n### 0.3.2\n\n#### Breaking changes\n\n- `Column` constructor no longer accept `alias` as argument. Use `Column.alias()` instead.\n\n#### Features\n\n- Add various Column methods: `cast`, `~`, `isNull`, `isNotNull`, `eqNullSafe`\n- Add various functions methods: `concat`, `length`, `sort_array`, `substring`\n\n#### Bugfixes\n- Fix broken `Column.when().otherwise().alias()` \n\n### 0.3.1\n\n#### Breaking changes\n- Dropped support for Python 3.6\n- Bumped dependencies versions\n- DataFrame.toPandas() now requires extra permissions by default\n  (the BigQuery ReadSession User role), but downloads data faster.\n\n#### Features\n- Added `functions.cast()` method\n- We now print the whole query in the error message when it fails\n- Added `DataFrame.join()`. This is a first implementation which is a little clumsy.\n\n#### Bugfixes\n- Fix DataFrame deps being lost when using `df.alias()`\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A DataFrame API for Google BigQuery",
    "version": "0.5.0",
    "project_urls": {
        "Homepage": "https://github.com/FurcyPin/bigquery-frame",
        "Repository": "https://github.com/FurcyPin/bigquery-frame"
    },
    "split_keywords": [
        "bigquery",
        " dataframe"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "eb1c375b517230dd7f0b298fea25991ff7e7e08269c7b625134ff20d992787d7",
                "md5": "77db07aaeadd1886913d64082bd10922",
                "sha256": "da0f0895924604477a8051f1465c424eec900e9a930aaf5600074f29188dcf6f"
            },
            "downloads": -1,
            "filename": "bigquery_frame-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "77db07aaeadd1886913d64082bd10922",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.9",
            "size": 137785,
            "upload_time": "2024-07-04T21:59:57",
            "upload_time_iso_8601": "2024-07-04T21:59:57.582014Z",
            "url": "https://files.pythonhosted.org/packages/eb/1c/375b517230dd7f0b298fea25991ff7e7e08269c7b625134ff20d992787d7/bigquery_frame-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bc1e6fd24dafba24a928428b25a864d369c6d52a8e8c32fded27da24b63c87b2",
                "md5": "4255bcaf6fd0e4b6e5076e7841aa5380",
                "sha256": "0ce03f222987c92ea9a76ef2701b593ddde08bafaf8ba179a58ffee4faa383ad"
            },
            "downloads": -1,
            "filename": "bigquery_frame-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4255bcaf6fd0e4b6e5076e7841aa5380",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.9",
            "size": 123181,
            "upload_time": "2024-07-04T22:00:00",
            "upload_time_iso_8601": "2024-07-04T22:00:00.128406Z",
            "url": "https://files.pythonhosted.org/packages/bc/1e/6fd24dafba24a928428b25a864d369c6d52a8e8c32fded27da24b63c87b2/bigquery_frame-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-04 22:00:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "FurcyPin",
    "github_project": "bigquery-frame",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "bigquery-frame"
}
        
Elapsed time: 0.33747s