fugue


Namefugue JSON
Version 0.9.1 PyPI version JSON
download
home_pagehttp://github.com/fugue-project/fugue
SummaryAn abstraction layer for distributed computation
upload_time2024-06-14 17:03:44
maintainerNone
docs_urlNone
authorThe Fugue Development Team
requires_python>=3.8
licenseApache-2.0
keywords distributed spark dask ray sql dsl domain specific language
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Fugue

[![PyPI version](https://badge.fury.io/py/fugue.svg)](https://pypi.python.org/pypi/fugue/)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/fugue.svg)](https://pypi.python.org/pypi/fugue/)
[![PyPI license](https://img.shields.io/pypi/l/fugue.svg)](https://pypi.python.org/pypi/fugue/)
[![codecov](https://codecov.io/gh/fugue-project/fugue/branch/master/graph/badge.svg?token=ZO9YD5N3IA)](https://codecov.io/gh/fugue-project/fugue)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/4fa5f2f53e6f48aaa1218a89f4808b91)](https://www.codacy.com/gh/fugue-project/fugue/dashboard?utm_source=github.com&utm_medium=referral&utm_content=fugue-project/fugue&utm_campaign=Badge_Grade)
[![Downloads](https://static.pepy.tech/badge/fugue)](https://pepy.tech/project/fugue)

| Tutorials                                                                                           | API Documentation                                                                     | Chat with us on slack!                                                                                                   |
| --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| [![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://fugue-tutorials.readthedocs.io/) | [![Doc](https://readthedocs.org/projects/fugue/badge)](https://fugue.readthedocs.org) | [![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai) |


**Fugue is a unified interface for distributed computing that lets users execute Python, Pandas, and SQL code on Spark, Dask, and Ray with minimal rewrites**.

Fugue is most commonly used for:

*   **Parallelizing or scaling existing Python and Pandas code** by bringing it to Spark, Dask, or Ray with minimal rewrites.
*   Using [FugueSQL](https://fugue-tutorials.readthedocs.io/tutorials/quick_look/ten_minutes_sql.html) to **define end-to-end workflows** on top of Pandas, Spark, and Dask DataFrames. FugueSQL is an enhanced SQL interface that can invoke Python code.

To see how Fugue compares to other frameworks like dbt, Arrow, Ibis, PySpark Pandas, see the [comparisons](https://fugue-tutorials.readthedocs.io/#how-does-fugue-compare-to)

## [Fugue API](https://fugue-tutorials.readthedocs.io/tutorials/quick_look/ten_minutes.html)

The Fugue API is a collection of functions that are capable of running on Pandas, Spark, Dask, and Ray. The simplest way to use Fugue is the [`transform()` function](https://fugue-tutorials.readthedocs.io/tutorials/beginner/transform.html). This lets users parallelize the execution of a single function by bringing it to Spark, Dask, or Ray. In the example below, the `map_letter_to_food()` function takes in a mapping and applies it on a column. This is just Pandas and Python so far (without Fugue).

```python
import pandas as pd
from typing import Dict

input_df = pd.DataFrame({"id":[0,1,2], "value": (["A", "B", "C"])})
map_dict = {"A": "Apple", "B": "Banana", "C": "Carrot"}

def map_letter_to_food(df: pd.DataFrame, mapping: Dict[str, str]) -> pd.DataFrame:
    df["value"] = df["value"].map(mapping)
    return df
```

Now, the `map_letter_to_food()` function is brought to the Spark execution engine by invoking the `transform()` function of Fugue. The output `schema` and `params` are passed to the `transform()` call. The `schema` is needed because it's a requirement for distributed frameworks. A schema of `"*"` below means all input columns are in the output.

```python
from pyspark.sql import SparkSession
from fugue import transform

spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(input_df)

out = transform(sdf,
               map_letter_to_food,
               schema="*",
               params=dict(mapping=map_dict),
               )
# out is a Spark DataFrame
out.show()
```
```rst
+---+------+
| id| value|
+---+------+
|  0| Apple|
|  1|Banana|
|  2|Carrot|
+---+------+
```

<details>
  <summary>PySpark equivalent of Fugue transform()</summary>

  ```python
from typing import Iterator, Union
from pyspark.sql.types import StructType
from pyspark.sql import DataFrame, SparkSession

spark_session = SparkSession.builder.getOrCreate()

def mapping_wrapper(dfs: Iterator[pd.DataFrame], mapping):
    for df in dfs:
        yield map_letter_to_food(df, mapping)

def run_map_letter_to_food(input_df: Union[DataFrame, pd.DataFrame], mapping):
    # conversion
    if isinstance(input_df, pd.DataFrame):
        sdf = spark_session.createDataFrame(input_df.copy())
    else:
        sdf = input_df.copy()

    schema = StructType(list(sdf.schema.fields))
    return sdf.mapInPandas(lambda dfs: mapping_wrapper(dfs, mapping),
                            schema=schema)

result = run_map_letter_to_food(input_df, map_dict)
result.show()
  ```
</details>

This syntax is simpler, cleaner, and more maintainable than the PySpark equivalent. At the same time, no edits were made to the original Pandas-based function to bring it to Spark. It is still usable on Pandas DataFrames. Fugue `transform()` also supports Dask and Ray as execution engines alongside the default Pandas-based engine.

The Fugue API has a broader collection of functions that are also compatible with Spark, Dask, and Ray. For example, we can use `load()` and `save()` to create an end-to-end workflow compatible with Spark, Dask, and Ray. For the full list of functions, see the [Top Level API](https://fugue.readthedocs.io/en/latest/top_api.html)

```python
import fugue.api as fa

def run(engine=None):
    with fa.engine_context(engine):
        df = fa.load("/path/to/file.parquet")
        out = fa.transform(df, map_letter_to_food, schema="*")
        fa.save(out, "/path/to/output_file.parquet")

run()                 # runs on Pandas
run(engine="spark")   # runs on Spark
run(engine="dask")    # runs on Dask
```

All functions underneath the context will run on the specified backend. This makes it easy to toggle between local execution, and distributed execution.

## [FugueSQL](https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/index.html)

FugueSQL is a SQL-based language capable of expressing end-to-end data workflows on top of Pandas, Spark, and Dask. The `map_letter_to_food()` function above is used in the SQL expression below. This is how to use a Python-defined function along with the standard SQL `SELECT` statement.

```python
from fugue.api import fugue_sql
import json

query = """
    SELECT id, value
      FROM input_df
    TRANSFORM USING map_letter_to_food(mapping={{mapping}}) SCHEMA *
    """
map_dict_str = json.dumps(map_dict)

# returns Pandas DataFrame
fugue_sql(query,mapping=map_dict_str)

# returns Spark DataFrame
fugue_sql(query, mapping=map_dict_str, engine="spark")
```

## Installation

Fugue can be installed through pip or conda. For example:

```bash
pip install fugue
```

In order to use Fugue SQL, it is strongly recommended to install the `sql` extra:

```bash
pip install fugue[sql]
```

It also has the following installation extras:

*   **sql**: to support Fugue SQL. Without this extra, the non-SQL part still works. Before Fugue 0.9.0, this extra is included in Fugue's core dependency so you don't need to install explicitly. **But for 0,9.0+, this becomes required if you want to use Fugue SQL.**
*   **spark**: to support Spark as the [ExecutionEngine](https://fugue-tutorials.readthedocs.io/tutorials/advanced/execution_engine.html).
*   **dask**: to support Dask as the ExecutionEngine.
*   **ray**: to support Ray as the ExecutionEngine.
*   **duckdb**: to support DuckDB as the ExecutionEngine, read [details](https://fugue-tutorials.readthedocs.io/tutorials/integrations/backends/duckdb.html).
*   **polars**: to support Polars DataFrames and extensions using Polars.
*   **ibis**: to enable Ibis for Fugue workflows, read [details](https://fugue-tutorials.readthedocs.io/tutorials/integrations/backends/ibis.html).
*   **cpp_sql_parser**: to enable the CPP antlr parser for Fugue SQL. It can be 50+ times faster than the pure Python parser. For the main Python versions and platforms, there is already pre-built binaries, but for the remaining, it needs a C++ compiler to build on the fly.

For example a common use case is:

```bash
pip install "fugue[duckdb,spark]"
```

Note if you already installed Spark or DuckDB independently, Fugue is able to automatically use them without installing the extras.

## [Getting Started](https://fugue-tutorials.readthedocs.io/)

The best way to get started with Fugue is to work through the 10 minute tutorials:

*   [Fugue API in 10 minutes](https://fugue-tutorials.readthedocs.io/tutorials/quick_look/ten_minutes.html)
*   [FugueSQL in 10 minutes](https://fugue-tutorials.readthedocs.io/tutorials/quick_look/ten_minutes_sql.html)

For the top level API, see:

*   [Fugue Top Level API](https://fugue.readthedocs.io/en/latest/top_api.html)

The [tutorials](https://fugue-tutorials.readthedocs.io/) can also be run in an interactive notebook environment through binder or Docker:

### Using binder

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/fugue-project/tutorials/master)

**Note it runs slow on binder** because the machine on binder isn't powerful enough for a distributed framework such as Spark. Parallel executions can become sequential, so some of the performance comparison examples will not give you the correct numbers.

### Using Docker

Alternatively, you should get decent performance by running this Docker image on your own machine:

```bash
docker run -p 8888:8888 fugueproject/tutorials:latest
```


## Jupyter Notebook Extension

There is an accompanying [notebook extension](https://pypi.org/project/fugue-jupyter/) for FugueSQL that lets users use the `%%fsql` cell magic. The extension also provides syntax highlighting for FugueSQL cells. It works for both classic notebook and Jupyter Lab. More details can be found in the [installation instructions](https://github.com/fugue-project/fugue-jupyter#install).

![FugueSQL gif](https://miro.medium.com/max/700/1*6091-RcrOPyifJTLjo0anA.gif)


## Ecosystem

By being an abstraction layer, Fugue can be used with a lot of other open-source projects seamlessly.

Python backends:

*   [Pandas](https://github.com/pandas-dev/pandas)
*   [Polars](https://www.pola.rs) (DataFrames only)
*   [Spark](https://github.com/apache/spark)
*   [Dask](https://github.com/dask/dask)
*   [Ray](http://github.com/ray-project/ray)
*   [Ibis](https://github.com/ibis-project/ibis/)

FugueSQL backends:

*   Pandas - FugueSQL can run on Pandas
*   [Duckdb](https://github.com/duckdb/duckdb) - in-process SQL OLAP database management
*   [dask-sql](https://github.com/dask-contrib/dask-sql) - SQL interface for Dask
*   SparkSQL
*   [BigQuery](https://fugue-tutorials.readthedocs.io/tutorials/integrations/warehouses/bigquery.html)
*   Trino


Fugue is available as a backend or can integrate with the following projects:

*   [WhyLogs](https://whylogs.readthedocs.io/en/latest/examples/integrations/Fugue_Profiling.html?highlight=fugue) - data profiling
*   [PyCaret](https://fugue-tutorials.readthedocs.io/tutorials/integrations/ecosystem/pycaret.html) - low code machine learning
*   [Nixtla](https://fugue-tutorials.readthedocs.io/tutorials/integrations/ecosystem/nixtla.html) - timeseries modelling
*   [Prefect](https://fugue-tutorials.readthedocs.io/tutorials/integrations/ecosystem/prefect.html) - workflow orchestration
*   [Pandera](https://fugue-tutorials.readthedocs.io/tutorials/integrations/ecosystem/pandera.html) - data validation
*   [Datacompy (by Capital One)](https://fugue-tutorials.readthedocs.io/tutorials/integrations/ecosystem/datacompy.html) - comparing DataFrames

Registered 3rd party extensions (majorly for Fugue SQL) include:

*   [Pandas plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) - visualize data using matplotlib or plotly
*   [Seaborn](https://seaborn.pydata.org/api.html) - visualize data using seaborn
*   [WhyLogs](https://whylogs.readthedocs.io/en/latest/examples/integrations/Fugue_Profiling.html?highlight=fugue) - visualize data profiling
*   [Vizzu](https://github.com/vizzuhq/ipyvizzu) - visualize data using ipyvizzu

## Community and Contributing

Feel free to message us on [Slack](http://slack.fugue.ai). We also have [contributing instructions](CONTRIBUTING.md).

### Case Studies

*   [How LyftLearn Democratizes Distributed Compute through Kubernetes Spark and Fugue](https://eng.lyft.com/how-lyftlearn-democratizes-distributed-compute-through-kubernetes-spark-and-fugue-c0875b97c3d9)
*   [Clobotics - Large Scale Image Processing with Spark through Fugue](https://medium.com/fugue-project/large-scale-image-processing-with-spark-through-fugue-e510b9813da8)
*   [Architecture for a data lake REST API using Delta Lake, Fugue & Spark (article by bitsofinfo)](https://bitsofinfo.wordpress.com/2023/08/14/data-lake-rest-api-delta-lake-fugue-spark)

### Mentioned Uses

*   [Productionizing Data Science at Interos, Inc. (LinkedIn post by Anthony Holten)](https://www.linkedin.com/posts/anthony-holten_pandas-spark-dask-activity-7022628193983459328-QvcF)
*   [Multiple Time Series Forecasting with Fugue & Nixtla at Bain & Company (LinkedIn post by Fahad Akbar)](https://www.linkedin.com/posts/fahadakbar_fugue-datascience-forecasting-activity-7041119034813124608-u08q?utm_source=share&utm_medium=member_desktop)

## Further Resources

View some of our latest conferences presentations and content. For a more complete list, check the [Content](https://fugue-tutorials.readthedocs.io/tutorials/resources/content.html) page in the tutorials.

### Blogs

*   [Why Pandas-like Interfaces are Sub-optimal for Distributed Computing](https://towardsdatascience.com/why-pandas-like-interfaces-are-sub-optimal-for-distributed-computing-322dacbce43)
*   [Introducing FugueSQL — SQL for Pandas, Spark, and Dask DataFrames (Towards Data Science by Khuyen Tran)](https://towardsdatascience.com/introducing-fuguesql-sql-for-pandas-spark-and-dask-dataframes-63d461a16b27)

### Conferences

*   [Distributed Machine Learning at Lyft](https://www.youtube.com/watch?v=_IVyIOV0LgY)
*   [Comparing the Different Ways to Scale Python and Pandas Code](https://www.youtube.com/watch?v=b3ae0m_XTys)
*   [Large Scale Data Validation with Spark and Dask (PyCon US)](https://www.youtube.com/watch?v=2AdvBgjO_3Q)
*   [FugueSQL - The Enhanced SQL Interface for Pandas, Spark, and Dask DataFrames (PyData Global)](https://www.youtube.com/watch?v=OBpnGYjNBBI)
*   [Distributed Hybrid Parameter Tuning](https://www.youtube.com/watch?v=_GBjqskD8Qk)


            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/fugue-project/fugue",
    "name": "fugue",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "distributed spark dask ray sql dsl domain specific language",
    "author": "The Fugue Development Team",
    "author_email": "hello@fugue.ai",
    "download_url": "https://files.pythonhosted.org/packages/91/a1/eca331442c758f8a6f23792dd10a51fb827fad1204805d6c70f02a35ee00/fugue-0.9.1.tar.gz",
    "platform": null,
    "description": "# Fugue\n\n[![PyPI version](https://badge.fury.io/py/fugue.svg)](https://pypi.python.org/pypi/fugue/)\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/fugue.svg)](https://pypi.python.org/pypi/fugue/)\n[![PyPI license](https://img.shields.io/pypi/l/fugue.svg)](https://pypi.python.org/pypi/fugue/)\n[![codecov](https://codecov.io/gh/fugue-project/fugue/branch/master/graph/badge.svg?token=ZO9YD5N3IA)](https://codecov.io/gh/fugue-project/fugue)\n[![Codacy Badge](https://app.codacy.com/project/badge/Grade/4fa5f2f53e6f48aaa1218a89f4808b91)](https://www.codacy.com/gh/fugue-project/fugue/dashboard?utm_source=github.com&utm_medium=referral&utm_content=fugue-project/fugue&utm_campaign=Badge_Grade)\n[![Downloads](https://static.pepy.tech/badge/fugue)](https://pepy.tech/project/fugue)\n\n| Tutorials                                                                                           | API Documentation                                                                     | Chat with us on slack!                                                                                                   |\n| --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |\n| [![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://fugue-tutorials.readthedocs.io/) | [![Doc](https://readthedocs.org/projects/fugue/badge)](https://fugue.readthedocs.org) | [![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai) |\n\n\n**Fugue is a unified interface for distributed computing that lets users execute Python, Pandas, and SQL code on Spark, Dask, and Ray with minimal rewrites**.\n\nFugue is most commonly used for:\n\n*   **Parallelizing or scaling existing Python and Pandas code** by bringing it to Spark, Dask, or Ray with minimal rewrites.\n*   Using [FugueSQL](https://fugue-tutorials.readthedocs.io/tutorials/quick_look/ten_minutes_sql.html) to **define end-to-end workflows** on top of Pandas, Spark, and Dask DataFrames. FugueSQL is an enhanced SQL interface that can invoke Python code.\n\nTo see how Fugue compares to other frameworks like dbt, Arrow, Ibis, PySpark Pandas, see the [comparisons](https://fugue-tutorials.readthedocs.io/#how-does-fugue-compare-to)\n\n## [Fugue API](https://fugue-tutorials.readthedocs.io/tutorials/quick_look/ten_minutes.html)\n\nThe Fugue API is a collection of functions that are capable of running on Pandas, Spark, Dask, and Ray. The simplest way to use Fugue is the [`transform()` function](https://fugue-tutorials.readthedocs.io/tutorials/beginner/transform.html). This lets users parallelize the execution of a single function by bringing it to Spark, Dask, or Ray. In the example below, the `map_letter_to_food()` function takes in a mapping and applies it on a column. This is just Pandas and Python so far (without Fugue).\n\n```python\nimport pandas as pd\nfrom typing import Dict\n\ninput_df = pd.DataFrame({\"id\":[0,1,2], \"value\": ([\"A\", \"B\", \"C\"])})\nmap_dict = {\"A\": \"Apple\", \"B\": \"Banana\", \"C\": \"Carrot\"}\n\ndef map_letter_to_food(df: pd.DataFrame, mapping: Dict[str, str]) -> pd.DataFrame:\n    df[\"value\"] = df[\"value\"].map(mapping)\n    return df\n```\n\nNow, the `map_letter_to_food()` function is brought to the Spark execution engine by invoking the `transform()` function of Fugue. The output `schema` and `params` are passed to the `transform()` call. The `schema` is needed because it's a requirement for distributed frameworks. A schema of `\"*\"` below means all input columns are in the output.\n\n```python\nfrom pyspark.sql import SparkSession\nfrom fugue import transform\n\nspark = SparkSession.builder.getOrCreate()\nsdf = spark.createDataFrame(input_df)\n\nout = transform(sdf,\n               map_letter_to_food,\n               schema=\"*\",\n               params=dict(mapping=map_dict),\n               )\n# out is a Spark DataFrame\nout.show()\n```\n```rst\n+---+------+\n| id| value|\n+---+------+\n|  0| Apple|\n|  1|Banana|\n|  2|Carrot|\n+---+------+\n```\n\n<details>\n  <summary>PySpark equivalent of Fugue transform()</summary>\n\n  ```python\nfrom typing import Iterator, Union\nfrom pyspark.sql.types import StructType\nfrom pyspark.sql import DataFrame, SparkSession\n\nspark_session = SparkSession.builder.getOrCreate()\n\ndef mapping_wrapper(dfs: Iterator[pd.DataFrame], mapping):\n    for df in dfs:\n        yield map_letter_to_food(df, mapping)\n\ndef run_map_letter_to_food(input_df: Union[DataFrame, pd.DataFrame], mapping):\n    # conversion\n    if isinstance(input_df, pd.DataFrame):\n        sdf = spark_session.createDataFrame(input_df.copy())\n    else:\n        sdf = input_df.copy()\n\n    schema = StructType(list(sdf.schema.fields))\n    return sdf.mapInPandas(lambda dfs: mapping_wrapper(dfs, mapping),\n                            schema=schema)\n\nresult = run_map_letter_to_food(input_df, map_dict)\nresult.show()\n  ```\n</details>\n\nThis syntax is simpler, cleaner, and more maintainable than the PySpark equivalent. At the same time, no edits were made to the original Pandas-based function to bring it to Spark. It is still usable on Pandas DataFrames. Fugue `transform()` also supports Dask and Ray as execution engines alongside the default Pandas-based engine.\n\nThe Fugue API has a broader collection of functions that are also compatible with Spark, Dask, and Ray. For example, we can use `load()` and `save()` to create an end-to-end workflow compatible with Spark, Dask, and Ray. For the full list of functions, see the [Top Level API](https://fugue.readthedocs.io/en/latest/top_api.html)\n\n```python\nimport fugue.api as fa\n\ndef run(engine=None):\n    with fa.engine_context(engine):\n        df = fa.load(\"/path/to/file.parquet\")\n        out = fa.transform(df, map_letter_to_food, schema=\"*\")\n        fa.save(out, \"/path/to/output_file.parquet\")\n\nrun()                 # runs on Pandas\nrun(engine=\"spark\")   # runs on Spark\nrun(engine=\"dask\")    # runs on Dask\n```\n\nAll functions underneath the context will run on the specified backend. This makes it easy to toggle between local execution, and distributed execution.\n\n## [FugueSQL](https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/index.html)\n\nFugueSQL is a SQL-based language capable of expressing end-to-end data workflows on top of Pandas, Spark, and Dask. The `map_letter_to_food()` function above is used in the SQL expression below. This is how to use a Python-defined function along with the standard SQL `SELECT` statement.\n\n```python\nfrom fugue.api import fugue_sql\nimport json\n\nquery = \"\"\"\n    SELECT id, value\n      FROM input_df\n    TRANSFORM USING map_letter_to_food(mapping={{mapping}}) SCHEMA *\n    \"\"\"\nmap_dict_str = json.dumps(map_dict)\n\n# returns Pandas DataFrame\nfugue_sql(query,mapping=map_dict_str)\n\n# returns Spark DataFrame\nfugue_sql(query, mapping=map_dict_str, engine=\"spark\")\n```\n\n## Installation\n\nFugue can be installed through pip or conda. For example:\n\n```bash\npip install fugue\n```\n\nIn order to use Fugue SQL, it is strongly recommended to install the `sql` extra:\n\n```bash\npip install fugue[sql]\n```\n\nIt also has the following installation extras:\n\n*   **sql**: to support Fugue SQL. Without this extra, the non-SQL part still works. Before Fugue 0.9.0, this extra is included in Fugue's core dependency so you don't need to install explicitly. **But for 0,9.0+, this becomes required if you want to use Fugue SQL.**\n*   **spark**: to support Spark as the [ExecutionEngine](https://fugue-tutorials.readthedocs.io/tutorials/advanced/execution_engine.html).\n*   **dask**: to support Dask as the ExecutionEngine.\n*   **ray**: to support Ray as the ExecutionEngine.\n*   **duckdb**: to support DuckDB as the ExecutionEngine, read [details](https://fugue-tutorials.readthedocs.io/tutorials/integrations/backends/duckdb.html).\n*   **polars**: to support Polars DataFrames and extensions using Polars.\n*   **ibis**: to enable Ibis for Fugue workflows, read [details](https://fugue-tutorials.readthedocs.io/tutorials/integrations/backends/ibis.html).\n*   **cpp_sql_parser**: to enable the CPP antlr parser for Fugue SQL. It can be 50+ times faster than the pure Python parser. For the main Python versions and platforms, there is already pre-built binaries, but for the remaining, it needs a C++ compiler to build on the fly.\n\nFor example a common use case is:\n\n```bash\npip install \"fugue[duckdb,spark]\"\n```\n\nNote if you already installed Spark or DuckDB independently, Fugue is able to automatically use them without installing the extras.\n\n## [Getting Started](https://fugue-tutorials.readthedocs.io/)\n\nThe best way to get started with Fugue is to work through the 10 minute tutorials:\n\n*   [Fugue API in 10 minutes](https://fugue-tutorials.readthedocs.io/tutorials/quick_look/ten_minutes.html)\n*   [FugueSQL in 10 minutes](https://fugue-tutorials.readthedocs.io/tutorials/quick_look/ten_minutes_sql.html)\n\nFor the top level API, see:\n\n*   [Fugue Top Level API](https://fugue.readthedocs.io/en/latest/top_api.html)\n\nThe [tutorials](https://fugue-tutorials.readthedocs.io/) can also be run in an interactive notebook environment through binder or Docker:\n\n### Using binder\n\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/fugue-project/tutorials/master)\n\n**Note it runs slow on binder** because the machine on binder isn't powerful enough for a distributed framework such as Spark. Parallel executions can become sequential, so some of the performance comparison examples will not give you the correct numbers.\n\n### Using Docker\n\nAlternatively, you should get decent performance by running this Docker image on your own machine:\n\n```bash\ndocker run -p 8888:8888 fugueproject/tutorials:latest\n```\n\n\n## Jupyter Notebook Extension\n\nThere is an accompanying [notebook extension](https://pypi.org/project/fugue-jupyter/) for FugueSQL that lets users use the `%%fsql` cell magic. The extension also provides syntax highlighting for FugueSQL cells. It works for both classic notebook and Jupyter Lab. More details can be found in the [installation instructions](https://github.com/fugue-project/fugue-jupyter#install).\n\n![FugueSQL gif](https://miro.medium.com/max/700/1*6091-RcrOPyifJTLjo0anA.gif)\n\n\n## Ecosystem\n\nBy being an abstraction layer, Fugue can be used with a lot of other open-source projects seamlessly.\n\nPython backends:\n\n*   [Pandas](https://github.com/pandas-dev/pandas)\n*   [Polars](https://www.pola.rs) (DataFrames only)\n*   [Spark](https://github.com/apache/spark)\n*   [Dask](https://github.com/dask/dask)\n*   [Ray](http://github.com/ray-project/ray)\n*   [Ibis](https://github.com/ibis-project/ibis/)\n\nFugueSQL backends:\n\n*   Pandas - FugueSQL can run on Pandas\n*   [Duckdb](https://github.com/duckdb/duckdb) - in-process SQL OLAP database management\n*   [dask-sql](https://github.com/dask-contrib/dask-sql) - SQL interface for Dask\n*   SparkSQL\n*   [BigQuery](https://fugue-tutorials.readthedocs.io/tutorials/integrations/warehouses/bigquery.html)\n*   Trino\n\n\nFugue is available as a backend or can integrate with the following projects:\n\n*   [WhyLogs](https://whylogs.readthedocs.io/en/latest/examples/integrations/Fugue_Profiling.html?highlight=fugue) - data profiling\n*   [PyCaret](https://fugue-tutorials.readthedocs.io/tutorials/integrations/ecosystem/pycaret.html) - low code machine learning\n*   [Nixtla](https://fugue-tutorials.readthedocs.io/tutorials/integrations/ecosystem/nixtla.html) - timeseries modelling\n*   [Prefect](https://fugue-tutorials.readthedocs.io/tutorials/integrations/ecosystem/prefect.html) - workflow orchestration\n*   [Pandera](https://fugue-tutorials.readthedocs.io/tutorials/integrations/ecosystem/pandera.html) - data validation\n*   [Datacompy (by Capital One)](https://fugue-tutorials.readthedocs.io/tutorials/integrations/ecosystem/datacompy.html) - comparing DataFrames\n\nRegistered 3rd party extensions (majorly for Fugue SQL) include:\n\n*   [Pandas plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) - visualize data using matplotlib or plotly\n*   [Seaborn](https://seaborn.pydata.org/api.html) - visualize data using seaborn\n*   [WhyLogs](https://whylogs.readthedocs.io/en/latest/examples/integrations/Fugue_Profiling.html?highlight=fugue) - visualize data profiling\n*   [Vizzu](https://github.com/vizzuhq/ipyvizzu) - visualize data using ipyvizzu\n\n## Community and Contributing\n\nFeel free to message us on [Slack](http://slack.fugue.ai). We also have [contributing instructions](CONTRIBUTING.md).\n\n### Case Studies\n\n*   [How LyftLearn Democratizes Distributed Compute through Kubernetes Spark and Fugue](https://eng.lyft.com/how-lyftlearn-democratizes-distributed-compute-through-kubernetes-spark-and-fugue-c0875b97c3d9)\n*   [Clobotics - Large Scale Image Processing with Spark through Fugue](https://medium.com/fugue-project/large-scale-image-processing-with-spark-through-fugue-e510b9813da8)\n*   [Architecture for a data lake REST API using Delta Lake, Fugue & Spark (article by bitsofinfo)](https://bitsofinfo.wordpress.com/2023/08/14/data-lake-rest-api-delta-lake-fugue-spark)\n\n### Mentioned Uses\n\n*   [Productionizing Data Science at Interos, Inc. (LinkedIn post by Anthony Holten)](https://www.linkedin.com/posts/anthony-holten_pandas-spark-dask-activity-7022628193983459328-QvcF)\n*   [Multiple Time Series Forecasting with Fugue & Nixtla at Bain & Company (LinkedIn post by Fahad Akbar)](https://www.linkedin.com/posts/fahadakbar_fugue-datascience-forecasting-activity-7041119034813124608-u08q?utm_source=share&utm_medium=member_desktop)\n\n## Further Resources\n\nView some of our latest conferences presentations and content. For a more complete list, check the [Content](https://fugue-tutorials.readthedocs.io/tutorials/resources/content.html) page in the tutorials.\n\n### Blogs\n\n*   [Why Pandas-like Interfaces are Sub-optimal for Distributed Computing](https://towardsdatascience.com/why-pandas-like-interfaces-are-sub-optimal-for-distributed-computing-322dacbce43)\n*   [Introducing FugueSQL \u2014 SQL for Pandas, Spark, and Dask DataFrames (Towards Data Science by Khuyen Tran)](https://towardsdatascience.com/introducing-fuguesql-sql-for-pandas-spark-and-dask-dataframes-63d461a16b27)\n\n### Conferences\n\n*   [Distributed Machine Learning at Lyft](https://www.youtube.com/watch?v=_IVyIOV0LgY)\n*   [Comparing the Different Ways to Scale Python and Pandas Code](https://www.youtube.com/watch?v=b3ae0m_XTys)\n*   [Large Scale Data Validation with Spark and Dask (PyCon US)](https://www.youtube.com/watch?v=2AdvBgjO_3Q)\n*   [FugueSQL - The Enhanced SQL Interface for Pandas, Spark, and Dask DataFrames (PyData Global)](https://www.youtube.com/watch?v=OBpnGYjNBBI)\n*   [Distributed Hybrid Parameter Tuning](https://www.youtube.com/watch?v=_GBjqskD8Qk)\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "An abstraction layer for distributed computation",
    "version": "0.9.1",
    "project_urls": {
        "Homepage": "http://github.com/fugue-project/fugue"
    },
    "split_keywords": [
        "distributed",
        "spark",
        "dask",
        "ray",
        "sql",
        "dsl",
        "domain",
        "specific",
        "language"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ec3846a0ef179f7279207a3263afeb8da4dd73f44d00b6cc999c96a39112d284",
                "md5": "e1097dbef44de4de129d022a950b8f3a",
                "sha256": "5b91e55e6f243af6e2b901dc37914d954d8f0231627b68007850879f8848a3a3"
            },
            "downloads": -1,
            "filename": "fugue-0.9.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e1097dbef44de4de129d022a950b8f3a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 278186,
            "upload_time": "2024-06-14T17:03:41",
            "upload_time_iso_8601": "2024-06-14T17:03:41.959224Z",
            "url": "https://files.pythonhosted.org/packages/ec/38/46a0ef179f7279207a3263afeb8da4dd73f44d00b6cc999c96a39112d284/fugue-0.9.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "91a1eca331442c758f8a6f23792dd10a51fb827fad1204805d6c70f02a35ee00",
                "md5": "6c40bd3aaaa88ba1cab00982e9d4ae37",
                "sha256": "fb0f9a4780147ac8438be96efc50593e2d771d1cbf528ac56d3bcecd39915b50"
            },
            "downloads": -1,
            "filename": "fugue-0.9.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6c40bd3aaaa88ba1cab00982e9d4ae37",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 224340,
            "upload_time": "2024-06-14T17:03:44",
            "upload_time_iso_8601": "2024-06-14T17:03:44.688906Z",
            "url": "https://files.pythonhosted.org/packages/91/a1/eca331442c758f8a6f23792dd10a51fb827fad1204805d6c70f02a35ee00/fugue-0.9.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-14 17:03:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fugue-project",
    "github_project": "fugue",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "fugue"
}
        
Elapsed time: 0.48146s