lakebench

Name	lakebench JSON
Version	0.8.1 JSON
	download
home_page	None
Summary	A multi-modal Python library for benchmarking Azure lakehouse engines and ELT scenarios, supporting both industry-standard and novel benchmarks.
upload_time	2025-07-25 03:37:29
maintainer	None
docs_url	None
author	Miles Cole
requires_python	>=3.8
license	MIT License Copyright (c) 2025 Miles Cole Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # LakeBench

🌊 LakeBench is the first Python-based, multi-modal benchmarking framework designed to evaluate performance across multiple lakehouse compute engines and ELT scenarios. Supporting a variety of engines and both industry-standard and novel benchmarks, LakeBench enables comprehensive, apples-to-apples comparisons in a single, extensible Python library.

## 🚀 The Mission of LakeBench
LakeBench exists to bring clarity, trust, accessibility, and relevance to engine benchmarking by focusing on four core pillars:
1. **End-to-End ELT Workflows Matter**
    
    Most benchmarks focus solely on analytic queries. But in practice, data engineers manage full data pipelines — loading data, transforming it (in batch, incrementally, or even streaming), maintaining tables, and then querying.

    > LakeBench proposes that **the entire end-to-end data lifecycle managed by data engineers is relevant**, not just queries.

1. **Variety in Benchmarks Is Essential**

    Real-world pipelines deal with with different data shapes, sizes, and patterns. One-size-fits-all benchmarks miss this nuance.

    > LakeBench covers a **variety of benchmarks** that represent **diverse workloads** — from bulk loads to incremental merges to maintenance jobs to ad-hoc queries — providing a richer picture of engine behavior under different conditions.

1. **Consistency Enables Trustworthy Comparisons**

    Somehow, every engine claims to be the fastest at the same benchmark, _at the same time_. Without a standardized framework, with support for many engines, comparisons are hard to trust and even more difficult to reproduce.

    > LakeBench ensures **consistent methodology across engines**, reducing the likelihood of implementation bias and enabling repeatable, trustworthy results. Engine subject matter experts are _encouraged_ to submit PRs to tune code as needed so that their preferred engine is best represented.

1. **Accessibility starts with `pip install`**

    Most benchmarking toolkits are highly inaccessible to the beginner data engineer, requiring the user to build the package or installation via a JAR, absent of Python bindings.

    > LakeBench is intentionally built as a **Python-native library**, installable via `pip` from PyPi, so it's easy for any engineer to get started—no JVM or compilation required. It's so lightweight and approachable, you could even use it just for generating high-quality sample data.


## ✅ Why LakeBench?
- **Multi-Engine**: Benchmark Spark, DuckDB, Polars, and many more planned, side-by-side
- **Lifecycle Coverage**: Ingest, transform, maintain, and query—just like real workloads
- **Diverse Workloads**: Test performance across varied data shapes and operations
- **Consistent Execution**: One framework, many engines
- **Extensible by Design**: Add engines or additional benchmarks with minimal friction
- **Dataset Generation**: Out-of-the box dataset generation for all benchmarks
- **Rich Logs**: Automatically logged engine version, compute size, duration, estimated execution cost, etc.

LakeBench empowers data teams to make informed engine decisions based on real workloads, not just marketing claims.

## 💪 Benchmarks

LakeBench currently supports four benchmarks with more to come:

- **ELTBench**: An benchmark with various modes (`light`, `full`) that simulates typicaly ELT workloads:
  - Raw data load (Parquet → Delta)
  - Fact table generation
  - Incremental merge processing
  - Table maintenance (e.g. OPTIMIZE/VACUUM)
  - Ad-hoc analytical queries
- **[TPC-DS](https://www.tpc.org/tpcds/)**: An industry-standard benchmark for complex analytical queries, featuring 24 source tables and 99 queries. Designed to simulate decision support systems and analytics workloads.
- **[TPC-H](https://www.tpc.org/tpch/)**: Focuses on ad-hoc decision support with 8 tables and 22 queries, evaluating performance on business-oriented analytical workloads.
- **[ClickBench](https://github.com/ClickHouse/ClickBench)**: A benchmark that simulates ad-hoc analytical and real-time queries on clickstream, traffic analysis, web analytics, machine-generated data, structured logs, and events data. The load phase (single flat table) is followed by 43 queries.

_Planned_
- **[TPC-DI](https://www.tpc.org/tpcdi/)**: An industry-standard benchmark for data integration workloads, evaluating end-to-end ETL/ELT performance across heterogeneous sources—including data ingestion, transformation, and loading processes.

## ⚙️ Engine Support Matrix

LakeBench supports multiple lakehouse compute engines. Each benchmark scenario declares which engines it supports via `<BenchmarkClassName>.BENCHMARK_IMPL_REGISTRY`.

| Engine          | ELTBench | TPC-DS | TPC-H   | ClickBench |
|-----------------|:--------:|:------:|:-------:|:----------:|
| Spark (Fabric)  |    ✅    |   ✅   |   ✅  |    ✅    |
| DuckDB          |    ✅    |   ✅   |   ✅  |    ✅    |
| Polars          |    ✅    |   ⚠️   |   ⚠️  |    🔜    |
| Daft            |    ✅    |   ⚠️   |   ⚠️  |    🔜    |

> **Legend:**  
> ✅ = Supported  
> ⚠️ = Some queries fail due to syntax issues (i.e. Polars doesn't support SQL non-equi joins, Daft is missing a lot of standard SQL contructs, i.e. DATE_ADD, CROSS JOIN, Subqueries, non-equi joins, CASE with operand, etc.).
> 🔜 = Coming Soon  
> (Blank) = Not currently supported 

## 🔌 Extensibility by Design

LakeBench is designed to be _extensible_, both for additional engines and benchmarks. 

- You can register **new engines** without modifying core benchmark logic.
- You can add **new benchmarks** that reuse existing engines and shared engine methods.
- LakeBench extension libraries can be created to extend core LakeBench capabilities with additional custom benchmarks and engines (i.e. `MyCustomSynapseSpark(Spark)`, `MyOrgsELT(BaseBenchmark)`).

New engines can be added via subclassing an existing engine class. Existing benchmarks can then register support for additional engines via the below:

```python
from lakebench.benchmarks import TPCDS
TPCDS.register_engine(MyNewEngine, None)
```

_`register_engine` is a class method to update `<BenchmarkClassName>.BENCHMARK_IMPL_REGISTRY`. It requires two inputs, the engine class that is being registered and the engine specific benchmark implementation class if required (otherwise specifying `None` will leverage methods in the generic engine class)._

This architecture encourages experimentation, benchmarking innovation, and easy adaptation.

_Example:_
```python
from lakebench.engines import BaseEngine

class MyCustomEngine(BaseEngine):
    ...

from lakebench.benchmarks.elt_bench import ELTBench
# registering the engine is only required if you aren't subclassing an existing registered engine
ELTBench.register_engine(MyCustomEngine, None)

benchmark = ELTBench(engine=MyCustomEngine(...))
benchmark.run()
```

---

# Using LakeBench

## 📦 Installation

Install from PyPi:

```bash
pip install lakebench[duckdb,polars,daft,tpcds_datagen,tpch_datagen,sparkmeasure]
```

_Note: in this initial beta version, all engines have only been tested inside Microsoft Fabric Python and Spark Notebooks._

## Example Usage
To run any LakeBench benchmark, first do a one time generation of the data required for the benchmark and scale of interest. LakeBench provides datagen classes to quickly generate parquet datasets required by the benchmarks.

### Data Generation
Data generation is provided via the DuckDB [TPC-DS](https://duckdb.org/docs/stable/core_extensions/tpcds) and [TPC-H](https://duckdb.org/docs/stable/core_extensions/tpch) extensions. The LakeBench wrapper around DuckDB adds support for writing out parquet files with a provided row-group target file size as normally the files generated by DuckDB are atypically small (i.e. 10MB) and are most suitable for ultra-small scale scenarios. LakeBench defaults to target 128MB row groups but can be configured via the `target_row_group_size_mb` parameter of both TPC-H and TPC-DS DataGenerator classes.

_Generating scale factor 1 data takes about 1 minute on a 2vCore VM._

#### TPC-H Data Generation
```python
from lakebench.datagen import TPCHDataGenerator

datagen = TPCHDataGenerator(
    scale_factor=1,
    target_mount_folder_path='/lakehouse/default/Files/tpch_sf1'
)
datagen.run()
```

#### TPC-DS Data Generation
```python
from lakebench.datagen import TPCDSDataGenerator

datagen = TPCDSDataGenerator(
    scale_factor=1,
    target_mount_folder_path='/lakehouse/default/Files/tpcds_sf1'
)
datagen.run()
```

_Notes:_
- TPC-H data can be generated up to SF100 however I hit OOM issues when targeting generating SF1000 on a 64-vCore machine.
- TPC-DS data up to SF1000 can be generated on a 32-vCore machine. 
- TPC-H and TPC-DS datasets up to SF10 will complete in minutes on a 2-vCore machine.
- The ClickBench dataset (only 1 size) should download with partitioned files in ~ 1 minute and ~ 6 minutes as a single file. 

#### Is BYO Data Supported?
If you want to use you own TPC-DS, TPC-H, or ClickBench parquet datasets, that is fine and encouraged as long as they are to specification. The Databricks [spark-sql-perf](https://github.com/databricks/spark-sql-perf) repo which is commonly used to produce TPC-DS and TPC-H datasets for benchmarking Spark has two critical schema bugs (typos?) in their implementation. Rather than supporting the perpetuation of these typos, LakeBench sticks to the schema defined in the specs. An [issue](https://github.com/databricks/spark-sql-perf/issues/219) was raised for tracking if this gets fixed. These datasets need to be fixed before running LakeBench with any data generated from spark-sql-perf:
1. The `c_last_review_date_sk` column in the TPC-DS `customer` table was named `c_last_review_date` (the **_sk** is missing) and it is generated as a string whereas the TPC-DS spec says this column is a Identity type which would map to a integer. The data value is still a surrogate key but the schema doesn't exactly match the specification.
    _Fix via:_
    ```python
    df = spark.read.parquet(f".../customer/")
    df = df.withColumn('c_last_review_date_sk', sf.col('c_last_review_date').cast('int')).drop('c_last_review_date')
    df.write.mode('overwrite').parquet(f".../customer/")
    ```
1. The `s_tax_percentage` column in the TPC-DS `store` table was named with a typo: `s_tax_precentage` (is "**pre**centage" the precursor of a "**per**centage"??).
    _Fix via:_
    ```python
    df = spark.read.parquet(f"..../store/")
    df = df.withColumnRenamed('s_tax_precentage', 's_tax_percentage')
    df.write.mode('overwrite').parquet(f"..../store/")
    ```

### Fabric Spark
```python
from lakebench.engines import FabricSpark
from lakebench.benchmarks import ELTBench

engine = FabricSpark(
    lakehouse_workspace_name="workspace",
    lakehouse_name="lakehouse",
    lakehouse_schema_name="schema",
    spark_measure_telemetry=True
)

benchmark = ELTBench(
    engine=engine,
    scenario_name="sf10",
    mode="light",
    tpcds_parquet_abfss_path="abfss://...",
    save_results=True,
    result_abfss_path="abfss://..."
)

benchmark.run()
```

> _Note: The `spark_measure_telemetry` flag can be enabled to capture stage metrics in the results. The `sparkmeasure` install option must be used when `spark_measure_telemetry` is enabled (`%pip install lakebench[sparkmeasure]`). Additionally, the Spark-Measure JAR must be installed from Maven: https://mvnrepository.com/artifact/ch.cern.sparkmeasure/spark-measure_2.13/0.24_

### Polars
```python
from lakebench.engines import Polars
from lakebench.benchmarks import ELTBench

engine = Polars( 
    delta_abfss_schema_path = 'abfss://...'
)

benchmark = ELTBench(
    engine=engine,
    scenario_name="sf10",
    mode="light",
    tpcds_parquet_abfss_path="abfss://...",
    save_results=True,
    result_abfss_path="abfss://..."
)

benchmark.run()
```
---

## Managing Queries Over Various Dialects

LakeBench supports multiple engines that each leverage different SQL dialects and capabilities. To handle this diversity while maintaining consistency, LakeBench employs a **hierarchical query resolution strategy** that balances automated transpilation with engine-specific customization.

### Query Resolution Strategy

LakeBench uses a three-tier fallback approach for each query:

1. **Engine-Specific Override** (if exists - rare)
   - Custom queries tailored for specific engine limitations or optimizations
   - Example: `src/lakebench/benchmarks/tpch/resources/queries/daft/q14.sql` -> Daft is generally sensitive to multiplying decimals and thus requires casing to `DOUBLE` or managing specific decimal types.

2. **Parent Engine Class Override** (if exists - rare)
   - Shared customizations for engine families, i.e. Spark (_not yet leveraged by any engine and benchmark combinations_).
   - Example: `src/lakebench/benchmarks/tpch/resources/queries/spark/q14.sql`

3. **Canonical + Transpilation** (fallback - common)
   - SparkSQL canonical queries are automatically transpiled via SQLGlot. Each engine registers its `SQLGLOT_DIALECT` constant, enabling automatic transpilation when custom queries aren't needed.
   - Example: `src/lakebench/benchmarks/tpch/resources/queries/canonical/q14.sql`

In all cases, tables are automatically qualified with the catalog and schema if applicable to the engine class.

### Why This Approach?

**Real-World Engine Limitations**: Engines like Daft lack support for `DATE_ADD`, `CROSS JOIN`, subqueries, and non-equi joins. Polars doesn't support non-equi joins. Rather than restricting all queries to the lowest common denominator, LakeBench allows targeted workarounds.

**Automated Transpilation Where Possible**: For most queries, SQLGlot can successfully transpile SparkSQL to engine-specific dialects (DuckDB, Postgres, SQLServer, etc.), eliminating manual maintenance overhead and a proliferation of query variants.

**Expert Optimization**: Engine specific subject matter experts can contribute PRs with optimized query variants that reasonably follow the specification of the benchmark author (i.e. TPC).

### Viewing Generated Queries

To inspect the final query that will be executed for any engine:

```python
benchmark = TPCH(engine=MyEngine(...))
query_str = benchmark._return_query_definition('q14')
print(query_str)  # Shows final transpiled/customized query
```

This approach ensures **consistency** (same business logic across engines), **accessibility** (as much as possible, engines work out-of-the-box), and **flexibility** (custom optimizations where needed).

# 📬 Feedback / Contributions
Got ideas? Found a bug? Want to contribute a benchmark or engine wrapper? PRs and issues are welcome!


# Acknowledgement of Other _LakeBench_ Projects
The **LakeBench** name is also used by two unrelated academic and research efforts:
- **[RLGen/LAKEBENCH](https://github.com/RLGen/LAKEBENCH)**: A benchmark designed for evaluating vision-language models on multimodal tasks.
- **LakeBench: Benchmarks for Data Discovery over Lakes** ([paper link](https://www.catalyzex.com/paper/lakebench-benchmarks-for-data-discovery-over)):
    A benchmark suite focused on improving data discovery and exploration over large data lakes.

While these projects target very different problem domains — such as machine learning and data discovery — they coincidentally share the same name. This project, focused on ELT benchmarking across lakehouse engines, is not affiliated with or derived from either.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "lakebench",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Miles Cole",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/af/40/166ad65395ea2447cc62b56fd0920d44f7667919dc8dc66d58a0ed2862aa/lakebench-0.8.1.tar.gz",
    "platform": null,
    "description": "# LakeBench\n\n\ud83c\udf0a LakeBench is the first Python-based, multi-modal benchmarking framework designed to evaluate performance across multiple lakehouse compute engines and ELT scenarios. Supporting a variety of engines and both industry-standard and novel benchmarks, LakeBench enables comprehensive, apples-to-apples comparisons in a single, extensible Python library.\n\n## \ud83d\ude80 The Mission of LakeBench\nLakeBench exists to bring clarity, trust, accessibility, and relevance to engine benchmarking by focusing on four core pillars:\n1. **End-to-End ELT Workflows Matter**\n    \n    Most benchmarks focus solely on analytic queries. But in practice, data engineers manage full data pipelines \u2014 loading data, transforming it (in batch, incrementally, or even streaming), maintaining tables, and then querying.\n\n    > LakeBench proposes that **the entire end-to-end data lifecycle managed by data engineers is relevant**, not just queries.\n\n1. **Variety in Benchmarks Is Essential**\n\n    Real-world pipelines deal with with different data shapes, sizes, and patterns. One-size-fits-all benchmarks miss this nuance.\n\n    > LakeBench covers a **variety of benchmarks** that represent **diverse workloads** \u2014 from bulk loads to incremental merges to maintenance jobs to ad-hoc queries \u2014 providing a richer picture of engine behavior under different conditions.\n\n1. **Consistency Enables Trustworthy Comparisons**\n\n    Somehow, every engine claims to be the fastest at the same benchmark, _at the same time_. Without a standardized framework, with support for many engines, comparisons are hard to trust and even more difficult to reproduce.\n\n    > LakeBench ensures **consistent methodology across engines**, reducing the likelihood of implementation bias and enabling repeatable, trustworthy results. Engine subject matter experts are _encouraged_ to submit PRs to tune code as needed so that their preferred engine is best represented.\n\n1. **Accessibility starts with `pip install`**\n\n    Most benchmarking toolkits are highly inaccessible to the beginner data engineer, requiring the user to build the package or installation via a JAR, absent of Python bindings.\n\n    > LakeBench is intentionally built as a **Python-native library**, installable via `pip` from PyPi, so it's easy for any engineer to get started\u2014no JVM or compilation required. It's so lightweight and approachable, you could even use it just for generating high-quality sample data.\n\n\n## \u2705 Why LakeBench?\n- **Multi-Engine**: Benchmark Spark, DuckDB, Polars, and many more planned, side-by-side\n- **Lifecycle Coverage**: Ingest, transform, maintain, and query\u2014just like real workloads\n- **Diverse Workloads**: Test performance across varied data shapes and operations\n- **Consistent Execution**: One framework, many engines\n- **Extensible by Design**: Add engines or additional benchmarks with minimal friction\n- **Dataset Generation**: Out-of-the box dataset generation for all benchmarks\n- **Rich Logs**: Automatically logged engine version, compute size, duration, estimated execution cost, etc.\n\nLakeBench empowers data teams to make informed engine decisions based on real workloads, not just marketing claims.\n\n## \ud83d\udcaa Benchmarks\n\nLakeBench currently supports four benchmarks with more to come:\n\n- **ELTBench**: An benchmark with various modes (`light`, `full`) that simulates typicaly ELT workloads:\n  - Raw data load (Parquet \u2192 Delta)\n  - Fact table generation\n  - Incremental merge processing\n  - Table maintenance (e.g. OPTIMIZE/VACUUM)\n  - Ad-hoc analytical queries\n- **[TPC-DS](https://www.tpc.org/tpcds/)**: An industry-standard benchmark for complex analytical queries, featuring 24 source tables and 99 queries. Designed to simulate decision support systems and analytics workloads.\n- **[TPC-H](https://www.tpc.org/tpch/)**: Focuses on ad-hoc decision support with 8 tables and 22 queries, evaluating performance on business-oriented analytical workloads.\n- **[ClickBench](https://github.com/ClickHouse/ClickBench)**: A benchmark that simulates ad-hoc analytical and real-time queries on clickstream, traffic analysis, web analytics, machine-generated data, structured logs, and events data. The load phase (single flat table) is followed by 43 queries.\n\n_Planned_\n- **[TPC-DI](https://www.tpc.org/tpcdi/)**: An industry-standard benchmark for data integration workloads, evaluating end-to-end ETL/ELT performance across heterogeneous sources\u2014including data ingestion, transformation, and loading processes.\n\n## \u2699\ufe0f Engine Support Matrix\n\nLakeBench supports multiple lakehouse compute engines. Each benchmark scenario declares which engines it supports via `<BenchmarkClassName>.BENCHMARK_IMPL_REGISTRY`.\n\n| Engine          | ELTBench | TPC-DS | TPC-H   | ClickBench |\n|-----------------|:--------:|:------:|:-------:|:----------:|\n| Spark (Fabric)  |    \u2705    |   \u2705   |   \u2705  |    \u2705    |\n| DuckDB          |    \u2705    |   \u2705   |   \u2705  |    \u2705    |\n| Polars          |    \u2705    |   \u26a0\ufe0f   |   \u26a0\ufe0f  |    \ud83d\udd1c    |\n| Daft            |    \u2705    |   \u26a0\ufe0f   |   \u26a0\ufe0f  |    \ud83d\udd1c    |\n\n> **Legend:**  \n> \u2705 = Supported  \n> \u26a0\ufe0f = Some queries fail due to syntax issues (i.e. Polars doesn't support SQL non-equi joins, Daft is missing a lot of standard SQL contructs, i.e. DATE_ADD, CROSS JOIN, Subqueries, non-equi joins, CASE with operand, etc.).\n> \ud83d\udd1c = Coming Soon  \n> (Blank) = Not currently supported \n\n## \ud83d\udd0c Extensibility by Design\n\nLakeBench is designed to be _extensible_, both for additional engines and benchmarks. \n\n- You can register **new engines** without modifying core benchmark logic.\n- You can add **new benchmarks** that reuse existing engines and shared engine methods.\n- LakeBench extension libraries can be created to extend core LakeBench capabilities with additional custom benchmarks and engines (i.e. `MyCustomSynapseSpark(Spark)`, `MyOrgsELT(BaseBenchmark)`).\n\nNew engines can be added via subclassing an existing engine class. Existing benchmarks can then register support for additional engines via the below:\n\n```python\nfrom lakebench.benchmarks import TPCDS\nTPCDS.register_engine(MyNewEngine, None)\n```\n\n_`register_engine` is a class method to update `<BenchmarkClassName>.BENCHMARK_IMPL_REGISTRY`. It requires two inputs, the engine class that is being registered and the engine specific benchmark implementation class if required (otherwise specifying `None` will leverage methods in the generic engine class)._\n\nThis architecture encourages experimentation, benchmarking innovation, and easy adaptation.\n\n_Example:_\n```python\nfrom lakebench.engines import BaseEngine\n\nclass MyCustomEngine(BaseEngine):\n    ...\n\nfrom lakebench.benchmarks.elt_bench import ELTBench\n# registering the engine is only required if you aren't subclassing an existing registered engine\nELTBench.register_engine(MyCustomEngine, None)\n\nbenchmark = ELTBench(engine=MyCustomEngine(...))\nbenchmark.run()\n```\n\n---\n\n# Using LakeBench\n\n## \ud83d\udce6 Installation\n\nInstall from PyPi:\n\n```bash\npip install lakebench[duckdb,polars,daft,tpcds_datagen,tpch_datagen,sparkmeasure]\n```\n\n_Note: in this initial beta version, all engines have only been tested inside Microsoft Fabric Python and Spark Notebooks._\n\n## Example Usage\nTo run any LakeBench benchmark, first do a one time generation of the data required for the benchmark and scale of interest. LakeBench provides datagen classes to quickly generate parquet datasets required by the benchmarks.\n\n### Data Generation\nData generation is provided via the DuckDB [TPC-DS](https://duckdb.org/docs/stable/core_extensions/tpcds) and [TPC-H](https://duckdb.org/docs/stable/core_extensions/tpch) extensions. The LakeBench wrapper around DuckDB adds support for writing out parquet files with a provided row-group target file size as normally the files generated by DuckDB are atypically small (i.e. 10MB) and are most suitable for ultra-small scale scenarios. LakeBench defaults to target 128MB row groups but can be configured via the `target_row_group_size_mb` parameter of both TPC-H and TPC-DS DataGenerator classes.\n\n_Generating scale factor 1 data takes about 1 minute on a 2vCore VM._\n\n#### TPC-H Data Generation\n```python\nfrom lakebench.datagen import TPCHDataGenerator\n\ndatagen = TPCHDataGenerator(\n    scale_factor=1,\n    target_mount_folder_path='/lakehouse/default/Files/tpch_sf1'\n)\ndatagen.run()\n```\n\n#### TPC-DS Data Generation\n```python\nfrom lakebench.datagen import TPCDSDataGenerator\n\ndatagen = TPCDSDataGenerator(\n    scale_factor=1,\n    target_mount_folder_path='/lakehouse/default/Files/tpcds_sf1'\n)\ndatagen.run()\n```\n\n_Notes:_\n- TPC-H data can be generated up to SF100 however I hit OOM issues when targeting generating SF1000 on a 64-vCore machine.\n- TPC-DS data up to SF1000 can be generated on a 32-vCore machine. \n- TPC-H and TPC-DS datasets up to SF10 will complete in minutes on a 2-vCore machine.\n- The ClickBench dataset (only 1 size) should download with partitioned files in ~ 1 minute and ~ 6 minutes as a single file. \n\n#### Is BYO Data Supported?\nIf you want to use you own TPC-DS, TPC-H, or ClickBench parquet datasets, that is fine and encouraged as long as they are to specification. The Databricks [spark-sql-perf](https://github.com/databricks/spark-sql-perf) repo which is commonly used to produce TPC-DS and TPC-H datasets for benchmarking Spark has two critical schema bugs (typos?) in their implementation. Rather than supporting the perpetuation of these typos, LakeBench sticks to the schema defined in the specs. An [issue](https://github.com/databricks/spark-sql-perf/issues/219) was raised for tracking if this gets fixed. These datasets need to be fixed before running LakeBench with any data generated from spark-sql-perf:\n1. The `c_last_review_date_sk` column in the TPC-DS `customer` table was named `c_last_review_date` (the **_sk** is missing) and it is generated as a string whereas the TPC-DS spec says this column is a Identity type which would map to a integer. The data value is still a surrogate key but the schema doesn't exactly match the specification.\n    _Fix via:_\n    ```python\n    df = spark.read.parquet(f\".../customer/\")\n    df = df.withColumn('c_last_review_date_sk', sf.col('c_last_review_date').cast('int')).drop('c_last_review_date')\n    df.write.mode('overwrite').parquet(f\".../customer/\")\n    ```\n1. The `s_tax_percentage` column in the TPC-DS `store` table was named with a typo: `s_tax_precentage` (is \"**pre**centage\" the precursor of a \"**per**centage\"??).\n    _Fix via:_\n    ```python\n    df = spark.read.parquet(f\"..../store/\")\n    df = df.withColumnRenamed('s_tax_precentage', 's_tax_percentage')\n    df.write.mode('overwrite').parquet(f\"..../store/\")\n    ```\n\n### Fabric Spark\n```python\nfrom lakebench.engines import FabricSpark\nfrom lakebench.benchmarks import ELTBench\n\nengine = FabricSpark(\n    lakehouse_workspace_name=\"workspace\",\n    lakehouse_name=\"lakehouse\",\n    lakehouse_schema_name=\"schema\",\n    spark_measure_telemetry=True\n)\n\nbenchmark = ELTBench(\n    engine=engine,\n    scenario_name=\"sf10\",\n    mode=\"light\",\n    tpcds_parquet_abfss_path=\"abfss://...\",\n    save_results=True,\n    result_abfss_path=\"abfss://...\"\n)\n\nbenchmark.run()\n```\n\n> _Note: The `spark_measure_telemetry` flag can be enabled to capture stage metrics in the results. The `sparkmeasure` install option must be used when `spark_measure_telemetry` is enabled (`%pip install lakebench[sparkmeasure]`). Additionally, the Spark-Measure JAR must be installed from Maven: https://mvnrepository.com/artifact/ch.cern.sparkmeasure/spark-measure_2.13/0.24_\n\n### Polars\n```python\nfrom lakebench.engines import Polars\nfrom lakebench.benchmarks import ELTBench\n\nengine = Polars( \n    delta_abfss_schema_path = 'abfss://...'\n)\n\nbenchmark = ELTBench(\n    engine=engine,\n    scenario_name=\"sf10\",\n    mode=\"light\",\n    tpcds_parquet_abfss_path=\"abfss://...\",\n    save_results=True,\n    result_abfss_path=\"abfss://...\"\n)\n\nbenchmark.run()\n```\n---\n\n## Managing Queries Over Various Dialects\n\nLakeBench supports multiple engines that each leverage different SQL dialects and capabilities. To handle this diversity while maintaining consistency, LakeBench employs a **hierarchical query resolution strategy** that balances automated transpilation with engine-specific customization.\n\n### Query Resolution Strategy\n\nLakeBench uses a three-tier fallback approach for each query:\n\n1. **Engine-Specific Override** (if exists - rare)\n   - Custom queries tailored for specific engine limitations or optimizations\n   - Example: `src/lakebench/benchmarks/tpch/resources/queries/daft/q14.sql` -> Daft is generally sensitive to multiplying decimals and thus requires casing to `DOUBLE` or managing specific decimal types.\n\n2. **Parent Engine Class Override** (if exists - rare)\n   - Shared customizations for engine families, i.e. Spark (_not yet leveraged by any engine and benchmark combinations_).\n   - Example: `src/lakebench/benchmarks/tpch/resources/queries/spark/q14.sql`\n\n3. **Canonical + Transpilation** (fallback - common)\n   - SparkSQL canonical queries are automatically transpiled via SQLGlot. Each engine registers its `SQLGLOT_DIALECT` constant, enabling automatic transpilation when custom queries aren't needed.\n   - Example: `src/lakebench/benchmarks/tpch/resources/queries/canonical/q14.sql`\n\nIn all cases, tables are automatically qualified with the catalog and schema if applicable to the engine class.\n\n### Why This Approach?\n\n**Real-World Engine Limitations**: Engines like Daft lack support for `DATE_ADD`, `CROSS JOIN`, subqueries, and non-equi joins. Polars doesn't support non-equi joins. Rather than restricting all queries to the lowest common denominator, LakeBench allows targeted workarounds.\n\n**Automated Transpilation Where Possible**: For most queries, SQLGlot can successfully transpile SparkSQL to engine-specific dialects (DuckDB, Postgres, SQLServer, etc.), eliminating manual maintenance overhead and a proliferation of query variants.\n\n**Expert Optimization**: Engine specific subject matter experts can contribute PRs with optimized query variants that reasonably follow the specification of the benchmark author (i.e. TPC).\n\n### Viewing Generated Queries\n\nTo inspect the final query that will be executed for any engine:\n\n```python\nbenchmark = TPCH(engine=MyEngine(...))\nquery_str = benchmark._return_query_definition('q14')\nprint(query_str)  # Shows final transpiled/customized query\n```\n\nThis approach ensures **consistency** (same business logic across engines), **accessibility** (as much as possible, engines work out-of-the-box), and **flexibility** (custom optimizations where needed).\n\n# \ud83d\udcec Feedback / Contributions\nGot ideas? Found a bug? Want to contribute a benchmark or engine wrapper? PRs and issues are welcome!\n\n\n# Acknowledgement of Other _LakeBench_ Projects\nThe **LakeBench** name is also used by two unrelated academic and research efforts:\n- **[RLGen/LAKEBENCH](https://github.com/RLGen/LAKEBENCH)**: A benchmark designed for evaluating vision-language models on multimodal tasks.\n- **LakeBench: Benchmarks for Data Discovery over Lakes** ([paper link](https://www.catalyzex.com/paper/lakebench-benchmarks-for-data-discovery-over)):\n    A benchmark suite focused on improving data discovery and exploration over large data lakes.\n\nWhile these projects target very different problem domains \u2014 such as machine learning and data discovery \u2014 they coincidentally share the same name. This project, focused on ELT benchmarking across lakehouse engines, is not affiliated with or derived from either.\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 Miles Cole\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "A multi-modal Python library for benchmarking Azure lakehouse engines and ELT scenarios, supporting both industry-standard and novel benchmarks.",
    "version": "0.8.1",
    "project_urls": {
        "Issues": "https://github.com/mwc360/LakeBench/issues",
        "github": "https://github.com/mwc360/LakeBench"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b1ede96c7b374abbea77aa50fe9d9651d4edd8063c033262eb688a51125e71b4",
                "md5": "af2aa3eb3dd27feedef5b1030f863873",
                "sha256": "a0f2a785dfcb8fe31f25bd1b2f3d603a909d2088aac1c937dc5799a2d7d48172"
            },
            "downloads": -1,
            "filename": "lakebench-0.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "af2aa3eb3dd27feedef5b1030f863873",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 155037,
            "upload_time": "2025-07-25T03:37:28",
            "upload_time_iso_8601": "2025-07-25T03:37:28.254425Z",
            "url": "https://files.pythonhosted.org/packages/b1/ed/e96c7b374abbea77aa50fe9d9651d4edd8063c033262eb688a51125e71b4/lakebench-0.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "af40166ad65395ea2447cc62b56fd0920d44f7667919dc8dc66d58a0ed2862aa",
                "md5": "50a10f884d9e7af5d34377d117e9e4df",
                "sha256": "ceb62b2ffd370d438fc7f0d20c629c020869a05b4516acfd8056d454f6436ecd"
            },
            "downloads": -1,
            "filename": "lakebench-0.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "50a10f884d9e7af5d34377d117e9e4df",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 84672,
            "upload_time": "2025-07-25T03:37:29",
            "upload_time_iso_8601": "2025-07-25T03:37:29.408146Z",
            "url": "https://files.pythonhosted.org/packages/af/40/166ad65395ea2447cc62b56fd0920d44f7667919dc8dc66d58a0ed2862aa/lakebench-0.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-25 03:37:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mwc360",
    "github_project": "LakeBench",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "lakebench"
}

Miles Cole