prophecy-spark-ai

Name	prophecy-spark-ai JSON
Version	0.1.11 JSON
	download
home_page	https://github.com/prophecy-io/spark-ai
Summary	High-performance AI/ML library for Spark to build and deploy your LLM applications in production.
upload_time	2023-10-17 16:37:02
maintainer
docs_url	None
author
requires_python
license
keywords	python prophecy
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Spark AI

Toolbox for building Generative AI applications on top of Apache Spark.

Many developers are companies are trying to leverage LLMs to enhance their existing applications or build completely new
ones. Thanks to LLMs most of them no longer have to train new ML models. However, still the major challenge is data and
infrastructure. This includes data ingestion, transformation, vectorization, lookup, and model serving.

Over the last few months, the industry has seen a spur of new tools and frameworks to help with these challenges.
However, none of them are easy to use, deploy to production, nor can deal with the scale of data.

This project aims to provide a toolbox of Spark extensions, data sources, and utilities to make building robust
data infrastructure on Spark for Generative AI applications easy.

[![PyPI version](https://badge.fury.io/py/prophecy-spark-ai.svg)](https://badge.fury.io/py/prophecy-spark-ai) [![Maven Central](https://maven-badges.herokuapp.com/maven-central/io.prophecy/spark-ai_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/io.prophecy/spark-ai_2.12)

## Example Applications

Complete examples that anyone can start from to build their own Generative AI applications.

- [Chatbot Template](https://github.com/prophecy-samples/gen-ai-chatbot-template)
- [Medical Advisor Template](https://github.com/prophecy-samples/gen-ai-med-avisor-template)

Read about our thoughts on [Prompt engineering, LLMs, and Low-code here](https://www.prophecy.io/blog/prophecy-generative-ai-platform-applications-on-enterprise-data-built-in-hours). 

## Quickstart

### Installation

Currently, the project is aimed mainly at PySpark users, however, because it also features high-performance connectors,
both the PySpark and Scala dependencies have to be present on the Spark cluster.

### Ingestion

```python
from spark_ai.webapps.slack import SlackUtilities

# Batch version
slack = SlackUtilities(token='xoxb-...', spark=spark)
df_channels = slack.read_channels()
df_conversations = slack.read_conversations(df_channels)

# Live streaming version
df_messages = (spark.readStream
    .format('io.prophecy.spark_ai.webapps.slack.SlackSourceProvider')
    .option('token', 'xapp-...')
    .load())
```

### Pre-processing & Vectorization

```python
from spark_ai.llms.openai import OpenAiLLM
from spark_ai.dbs.pinecone import PineconeDB

OpenAiLLM(api_key='sk-...').register_udfs(spark=spark)
PineconeDB('8045...', 'us-east-1-aws').register_udfs(self.spark)

(df_conversations
    # Embed the text from every conversation into a vector
    .withColumn('embeddings', expr('openai_embed_texts(text)'))
    # Do some more pre-processing
    ... 
    # Upsert the embeddings into Pinecone
    .withColumn('status', expr('pinecone_upsert(\'index-name\', embeddings)'))
    # Save the status of the upsertion to a standard table
    .saveAsTable('pinecone_status'))
```

### Inference 

```python
df_messages = spark.readStream \
    .format("io_prophecy.spark_ai.SlackStreamingSourceProvider") \
    .option("token", token) \
    .load()

# Handle a live stream of messages from Slack here
```

## Roadmap

Data sources supported:

- 🚧 Slack
- 🗺️ PDFs
- 🗺️ Asana
- 🗺️ Notion
- 🗺️ Google Drive
- 🗺 Web-scrape

Vector databases supported:

- 🚧 Pinecone
- 🚧 Spark-ML (table store & cos sim)
- 🗺 ElasticSearch

LLMs supported:

- 🚧 OpenAI
- 🚧 Spark-ML
- 🗺️ Databrick's Dolly
- 🗺️ HuggingFace's Models

Application interfaces supported:

- 🚧 Slack
- 🗺️ Microsoft Teams

And many more are coming soon (feel free to request as issues)! 🚀

✅: General Availability; 🚧: Beta availability; 🗺️: Roadmap;

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/prophecy-io/spark-ai",
    "name": "prophecy-spark-ai",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,prophecy",
    "author": "",
    "author_email": "",
    "download_url": "",
    "platform": null,
    "description": "# Spark AI\n\nToolbox for building Generative AI applications on top of Apache Spark.\n\nMany developers are companies are trying to leverage LLMs to enhance their existing applications or build completely new\nones. Thanks to LLMs most of them no longer have to train new ML models. However, still the major challenge is data and\ninfrastructure. This includes data ingestion, transformation, vectorization, lookup, and model serving.\n\nOver the last few months, the industry has seen a spur of new tools and frameworks to help with these challenges.\nHowever, none of them are easy to use, deploy to production, nor can deal with the scale of data.\n\nThis project aims to provide a toolbox of Spark extensions, data sources, and utilities to make building robust\ndata infrastructure on Spark for Generative AI applications easy.\n\n[![PyPI version](https://badge.fury.io/py/prophecy-spark-ai.svg)](https://badge.fury.io/py/prophecy-spark-ai) [![Maven Central](https://maven-badges.herokuapp.com/maven-central/io.prophecy/spark-ai_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/io.prophecy/spark-ai_2.12)\n\n## Example Applications\n\nComplete examples that anyone can start from to build their own Generative AI applications.\n\n- [Chatbot Template](https://github.com/prophecy-samples/gen-ai-chatbot-template)\n- [Medical Advisor Template](https://github.com/prophecy-samples/gen-ai-med-avisor-template)\n\nRead about our thoughts on [Prompt engineering, LLMs, and Low-code here](https://www.prophecy.io/blog/prophecy-generative-ai-platform-applications-on-enterprise-data-built-in-hours). \n\n## Quickstart\n\n### Installation\n\nCurrently, the project is aimed mainly at PySpark users, however, because it also features high-performance connectors,\nboth the PySpark and Scala dependencies have to be present on the Spark cluster.\n\n### Ingestion\n\n```python\nfrom spark_ai.webapps.slack import SlackUtilities\n\n# Batch version\nslack = SlackUtilities(token='xoxb-...', spark=spark)\ndf_channels = slack.read_channels()\ndf_conversations = slack.read_conversations(df_channels)\n\n# Live streaming version\ndf_messages = (spark.readStream\n    .format('io.prophecy.spark_ai.webapps.slack.SlackSourceProvider')\n    .option('token', 'xapp-...')\n    .load())\n```\n\n### Pre-processing & Vectorization\n\n```python\nfrom spark_ai.llms.openai import OpenAiLLM\nfrom spark_ai.dbs.pinecone import PineconeDB\n\nOpenAiLLM(api_key='sk-...').register_udfs(spark=spark)\nPineconeDB('8045...', 'us-east-1-aws').register_udfs(self.spark)\n\n(df_conversations\n    # Embed the text from every conversation into a vector\n    .withColumn('embeddings', expr('openai_embed_texts(text)'))\n    # Do some more pre-processing\n    ... \n    # Upsert the embeddings into Pinecone\n    .withColumn('status', expr('pinecone_upsert(\\'index-name\\', embeddings)'))\n    # Save the status of the upsertion to a standard table\n    .saveAsTable('pinecone_status'))\n```\n\n### Inference \n\n```python\ndf_messages = spark.readStream \\\n    .format(\"io_prophecy.spark_ai.SlackStreamingSourceProvider\") \\\n    .option(\"token\", token) \\\n    .load()\n\n# Handle a live stream of messages from Slack here\n```\n\n## Roadmap\n\nData sources supported:\n\n- \ud83d\udea7 Slack\n- \ud83d\uddfa\ufe0f PDFs\n- \ud83d\uddfa\ufe0f Asana\n- \ud83d\uddfa\ufe0f Notion\n- \ud83d\uddfa\ufe0f Google Drive\n- \ud83d\uddfa Web-scrape\n\nVector databases supported:\n\n- \ud83d\udea7 Pinecone\n- \ud83d\udea7 Spark-ML (table store & cos sim)\n- \ud83d\uddfa ElasticSearch\n\nLLMs supported:\n\n- \ud83d\udea7 OpenAI\n- \ud83d\udea7 Spark-ML\n- \ud83d\uddfa\ufe0f Databrick's Dolly\n- \ud83d\uddfa\ufe0f HuggingFace's Models\n\nApplication interfaces supported:\n\n- \ud83d\udea7 Slack\n- \ud83d\uddfa\ufe0f Microsoft Teams\n\nAnd many more are coming soon (feel free to request as issues)! \ud83d\ude80\n\n\u2705: General Availability; \ud83d\udea7: Beta availability; \ud83d\uddfa\ufe0f: Roadmap; \n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "High-performance AI/ML library for Spark to build and deploy your LLM applications in production.",
    "version": "0.1.11",
    "project_urls": {
        "Homepage": "https://github.com/prophecy-io/spark-ai"
    },
    "split_keywords": [
        "python",
        "prophecy"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5c1aa8e7ad647c2b96b21b6405f53a76937afa96aa7b623931a42e8033dcf80a",
                "md5": "d3a5faab65f7a82630fbb6f1a2e78b70",
                "sha256": "d0fa1a1c8a3b8e8a8578d8240dc82dcb6c27fe930c79db3b6f869f24b72c37c3"
            },
            "downloads": -1,
            "filename": "prophecy_spark_ai-0.1.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d3a5faab65f7a82630fbb6f1a2e78b70",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14988,
            "upload_time": "2023-10-17T16:37:02",
            "upload_time_iso_8601": "2023-10-17T16:37:02.532217Z",
            "url": "https://files.pythonhosted.org/packages/5c/1a/a8e7ad647c2b96b21b6405f53a76937afa96aa7b623931a42e8033dcf80a/prophecy_spark_ai-0.1.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-17 16:37:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "prophecy-io",
    "github_project": "spark-ai",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "prophecy-spark-ai"
}