mlverse-mall

Name	mlverse-mall JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Run multiple 'Large Language Model' predictions against a table. The predictions run row-wise over a specified column.
upload_time	2024-10-24 16:14:09
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords	large language models llm natural language processing nlp polars
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            

<img src="https://mlverse.github.io/mall/site/images/favicon/apple-touch-icon-180x180.png" style="float:right" />

<!-- badges: start -->

[![Python
tests](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml)
[![Code
coverage](https://codecov.io/gh/mlverse/mall/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mlverse/mall?branch=main)
[![Lifecycle:
experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
<!-- badges: end -->

Run multiple LLM predictions against a data frame. The predictions are
processed row-wise over a specified column. It works using a
pre-determined one-shot prompt, along with the current row’s content.
`mall` has been implemented for both R and Python. The prompt that is
use will depend of the type of analysis needed.

Currently, the included prompts perform the following:

- [Sentiment analysis](#sentiment)
- [Text summarizing](#summarize)
- [Classify text](#classify)
- [Extract one, or several](#extract), specific pieces information from
  the text
- [Translate text](#translate)
- [Verify that something it true](#verify) about the text (binary)
- [Custom prompt](#custom-prompt)

This package is inspired by the SQL AI functions now offered by vendors
such as
[Databricks](https://docs.databricks.com/en/large-language-models/ai-functions.html)
and Snowflake. `mall` uses [Ollama](https://ollama.com/) to interact
with LLMs installed locally.

For **Python**, `mall` is a library extension to
[Polars](https://pola.rs/). To interact with Ollama, it uses the
official [Python library](https://github.com/ollama/ollama-python).

``` python
reviews.llm.sentiment("review")
```

## Motivation

We want to new find ways to help data scientists use LLMs in their daily
work. Unlike the familiar interfaces, such as chatting and code
completion, this interface runs your text data directly against the LLM.

The LLM’s flexibility, allows for it to adapt to the subject of your
data, and provide surprisingly accurate predictions. This saves the data
scientist the need to write and tune an NLP model.

In recent times, the capabilities of LLMs that can run locally in your
computer have increased dramatically. This means that these sort of
analysis can run in your machine with good accuracy. Additionally, it
makes it possible to take advantage of LLM’s at your institution, since
the data will not leave the corporate network.

## Get started

- Install `mall` from Github

``` python
pip install "mall @ git+https://git@github.com/mlverse/mall.git#subdirectory=python"
```

- [Download Ollama from the official
  website](https://ollama.com/download)

- Install and start Ollama in your computer

- Install the official Ollama library

  ``` python
  pip install ollama
  ```

- Download an LLM model. For example, I have been developing this
  package using Llama 3.2 to test. To get that model you can run:

  ``` python
  import ollama
  ollama.pull('llama3.2')
  ```

## LLM functions

We will start with loading a very small data set contained in `mall`. It
has 3 product reviews that we will use as the source of our examples.

``` python
import mall 
data = mall.MallData
reviews = data.reviews

reviews 
```

| review |
|----|
| "This has been the best TV I've ever used. Great screen, and sound." |
| "I regret buying this laptop. It is too slow and the keyboard is too noisy" |
| "Not sure how to feel about my new washing machine. Great color, but hard to figure" |

<p>

### Sentiment

Automatically returns “positive”, “negative”, or “neutral” based on the
text.

``` python
reviews.llm.sentiment("review")
```

| review | sentiment |
|----|----|
| "This has been the best TV I've ever used. Great screen, and sound." | "positive" |
| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "negative" |
| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "neutral" |

### Summarize

There may be a need to reduce the number of words in a given text.
Typically to make it easier to understand its intent. The function has
an argument to control the maximum number of words to output
(`max_words`):

``` python
reviews.llm.summarize("review", 5)
```

| review | summary |
|----|----|
| "This has been the best TV I've ever used. Great screen, and sound." | "great tv with good features" |
| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "laptop purchase was a mistake" |
| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "feeling uncertain about new purchase" |

### Classify

Use the LLM to categorize the text into one of the options you provide:

``` python
reviews.llm.classify("review", ["computer", "appliance"])
```

| review | classify |
|----|----|
| "This has been the best TV I've ever used. Great screen, and sound." | "appliance" |
| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "computer" |
| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "appliance" |

### Extract

One of the most interesting use cases Using natural language, we can
tell the LLM to return a specific part of the text. In the following
example, we request that the LLM return the product being referred to.
We do this by simply saying “product”. The LLM understands what we
*mean* by that word, and looks for that in the text.

``` python
reviews.llm.extract("review", "product")
```

| review | extract |
|----|----|
| "This has been the best TV I've ever used. Great screen, and sound." | "tv" |
| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "laptop" |
| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "washing machine" |

### Classify

Use the LLM to categorize the text into one of the options you provide:

``` python
reviews.llm.classify("review", ["computer", "appliance"])
```

| review | classify |
|----|----|
| "This has been the best TV I've ever used. Great screen, and sound." | "appliance" |
| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "computer" |
| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "appliance" |

### Verify

This functions allows you to check and see if a statement is true, based
on the provided text. By default, it will return a 1 for “yes”, and 0
for “no”. This can be customized.

``` python
reviews.llm.verify("review", "is the customer happy with the purchase")
```

| review | verify |
|----|----|
| "This has been the best TV I've ever used. Great screen, and sound." | 1 |
| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | 0 |
| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | 0 |

### Translate

As the title implies, this function will translate the text into a
specified language. What is really nice, it is that you don’t need to
specify the language of the source text. Only the target language needs
to be defined. The translation accuracy will depend on the LLM

``` python
reviews.llm.translate("review", "spanish")
```

| review | translation |
|----|----|
| "This has been the best TV I've ever used. Great screen, and sound." | "Esta ha sido la mejor televisión que he utilizado hasta ahora. Gran pantalla y sonido." |
| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "Me arrepiento de comprar este portátil. Es demasiado lento y la tecla es demasiado ruidosa." |
| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "No estoy seguro de cómo sentirme con mi nueva lavadora. Un color maravilloso, pero muy difícil de en… |

### Custom prompt

It is possible to pass your own prompt to the LLM, and have `mall` run
it against each text entry:

``` python
my_prompt = (
    "Answer a question."
    "Return only the answer, no explanation"
    "Acceptable answers are 'yes', 'no'"
    "Answer this about the following text, is this a happy customer?:"
)

reviews.llm.custom("review", prompt = my_prompt)
```

| review | custom |
|----|----|
| "This has been the best TV I've ever used. Great screen, and sound." | "Yes" |
| "I regret buying this laptop. It is too slow and the keyboard is too noisy" | "No" |
| "Not sure how to feel about my new washing machine. Great color, but hard to figure" | "No" |

## Model selection and settings

You can set the model and its options to use when calling the LLM. In
this case, we refer to options as model specific things that can be set,
such as seed or temperature.

The model and options to be used will be defined at the Polars data
frame object level. If not passed, the default model will be
**llama3.2**.

``` python
reviews.llm.use("ollama", "llama3.2", options = dict(seed = 100))
```

#### Results caching

By default `mall` caches the requests and corresponding results from a
given LLM run. Each response is saved as individual JSON files. By
default, the folder name is `_mall_cache`. The folder name can be
customized, if needed. Also, the caching can be turned off by setting
the argument to empty (`""`).

``` python
reviews.llm.use(_cache = "my_cache")
```

To turn off:

``` python
reviews.llm.use(_cache = "")
```

## Key considerations

The main consideration is **cost**. Either, time cost, or money cost.

If using this method with an LLM locally available, the cost will be a
long running time. Unless using a very specialized LLM, a given LLM is a
general model. It was fitted using a vast amount of data. So determining
a response for each row, takes longer than if using a manually created
NLP model. The default model used in Ollama is [Llama
3.2](https://ollama.com/library/llama3.2), which was fitted using 3B
parameters.

If using an external LLM service, the consideration will need to be for
the billing costs of using such service. Keep in mind that you will be
sending a lot of data to be evaluated.

Another consideration is the novelty of this approach. Early tests are
providing encouraging results. But you, as an user, will still need to
keep in mind that the predictions will not be infallible, so always
check the output. At this time, I think the best use for this method, is
for a quick analysis.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mlverse-mall",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "large language models, llm, natural language processing, nlp, polars",
    "author": null,
    "author_email": "Edgar Ruiz <edgar@posit.co>",
    "download_url": "https://files.pythonhosted.org/packages/52/c9/a763ae9fab87c919b235d13f6434be4e1450ce5d623607c9e81c6610a4c9/mlverse_mall-0.1.0.tar.gz",
    "platform": null,
    "description": "\n\n<img src=\"https://mlverse.github.io/mall/site/images/favicon/apple-touch-icon-180x180.png\" style=\"float:right\" />\n\n<!-- badges: start -->\n\n[![Python\ntests](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml/badge.svg)](https://github.com/mlverse/mall/actions/workflows/python-tests.yaml)\n[![Code\ncoverage](https://codecov.io/gh/mlverse/mall/branch/main/graph/badge.svg)](https://app.codecov.io/gh/mlverse/mall?branch=main)\n[![Lifecycle:\nexperimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)\n<!-- badges: end -->\n\nRun multiple LLM predictions against a data frame. The predictions are\nprocessed row-wise over a specified column. It works using a\npre-determined one-shot prompt, along with the current row\u2019s content.\n`mall` has been implemented for both R and Python. The prompt that is\nuse will depend of the type of analysis needed.\n\nCurrently, the included prompts perform the following:\n\n- [Sentiment analysis](#sentiment)\n- [Text summarizing](#summarize)\n- [Classify text](#classify)\n- [Extract one, or several](#extract), specific pieces information from\n  the text\n- [Translate text](#translate)\n- [Verify that something it true](#verify) about the text (binary)\n- [Custom prompt](#custom-prompt)\n\nThis package is inspired by the SQL AI functions now offered by vendors\nsuch as\n[Databricks](https://docs.databricks.com/en/large-language-models/ai-functions.html)\nand Snowflake. `mall` uses [Ollama](https://ollama.com/) to interact\nwith LLMs installed locally.\n\nFor **Python**, `mall` is a library extension to\n[Polars](https://pola.rs/). To interact with Ollama, it uses the\nofficial [Python library](https://github.com/ollama/ollama-python).\n\n``` python\nreviews.llm.sentiment(\"review\")\n```\n\n## Motivation\n\nWe want to new find ways to help data scientists use LLMs in their daily\nwork. Unlike the familiar interfaces, such as chatting and code\ncompletion, this interface runs your text data directly against the LLM.\n\nThe LLM\u2019s flexibility, allows for it to adapt to the subject of your\ndata, and provide surprisingly accurate predictions. This saves the data\nscientist the need to write and tune an NLP model.\n\nIn recent times, the capabilities of LLMs that can run locally in your\ncomputer have increased dramatically. This means that these sort of\nanalysis can run in your machine with good accuracy. Additionally, it\nmakes it possible to take advantage of LLM\u2019s at your institution, since\nthe data will not leave the corporate network.\n\n## Get started\n\n- Install `mall` from Github\n\n``` python\npip install \"mall @ git+https://git@github.com/mlverse/mall.git#subdirectory=python\"\n```\n\n- [Download Ollama from the official\n  website](https://ollama.com/download)\n\n- Install and start Ollama in your computer\n\n- Install the official Ollama library\n\n  ``` python\n  pip install ollama\n  ```\n\n- Download an LLM model. For example, I have been developing this\n  package using Llama 3.2 to test. To get that model you can run:\n\n  ``` python\n  import ollama\n  ollama.pull('llama3.2')\n  ```\n\n## LLM functions\n\nWe will start with loading a very small data set contained in `mall`. It\nhas 3 product reviews that we will use as the source of our examples.\n\n``` python\nimport mall \ndata = mall.MallData\nreviews = data.reviews\n\nreviews \n```\n\n| review |\n|----|\n| \"This has been the best TV I've ever used. Great screen, and sound.\" |\n| \"I regret buying this laptop. It is too slow and the keyboard is too noisy\" |\n| \"Not sure how to feel about my new washing machine. Great color, but hard to figure\" |\n\n<p>\n\n### Sentiment\n\nAutomatically returns \u201cpositive\u201d, \u201cnegative\u201d, or \u201cneutral\u201d based on the\ntext.\n\n``` python\nreviews.llm.sentiment(\"review\")\n```\n\n| review | sentiment |\n|----|----|\n| \"This has been the best TV I've ever used. Great screen, and sound.\" | \"positive\" |\n| \"I regret buying this laptop. It is too slow and the keyboard is too noisy\" | \"negative\" |\n| \"Not sure how to feel about my new washing machine. Great color, but hard to figure\" | \"neutral\" |\n\n### Summarize\n\nThere may be a need to reduce the number of words in a given text.\nTypically to make it easier to understand its intent. The function has\nan argument to control the maximum number of words to output\n(`max_words`):\n\n``` python\nreviews.llm.summarize(\"review\", 5)\n```\n\n| review | summary |\n|----|----|\n| \"This has been the best TV I've ever used. Great screen, and sound.\" | \"great tv with good features\" |\n| \"I regret buying this laptop. It is too slow and the keyboard is too noisy\" | \"laptop purchase was a mistake\" |\n| \"Not sure how to feel about my new washing machine. Great color, but hard to figure\" | \"feeling uncertain about new purchase\" |\n\n### Classify\n\nUse the LLM to categorize the text into one of the options you provide:\n\n``` python\nreviews.llm.classify(\"review\", [\"computer\", \"appliance\"])\n```\n\n| review | classify |\n|----|----|\n| \"This has been the best TV I've ever used. Great screen, and sound.\" | \"appliance\" |\n| \"I regret buying this laptop. It is too slow and the keyboard is too noisy\" | \"computer\" |\n| \"Not sure how to feel about my new washing machine. Great color, but hard to figure\" | \"appliance\" |\n\n### Extract\n\nOne of the most interesting use cases Using natural language, we can\ntell the LLM to return a specific part of the text. In the following\nexample, we request that the LLM return the product being referred to.\nWe do this by simply saying \u201cproduct\u201d. The LLM understands what we\n*mean* by that word, and looks for that in the text.\n\n``` python\nreviews.llm.extract(\"review\", \"product\")\n```\n\n| review | extract |\n|----|----|\n| \"This has been the best TV I've ever used. Great screen, and sound.\" | \"tv\" |\n| \"I regret buying this laptop. It is too slow and the keyboard is too noisy\" | \"laptop\" |\n| \"Not sure how to feel about my new washing machine. Great color, but hard to figure\" | \"washing machine\" |\n\n### Classify\n\nUse the LLM to categorize the text into one of the options you provide:\n\n``` python\nreviews.llm.classify(\"review\", [\"computer\", \"appliance\"])\n```\n\n| review | classify |\n|----|----|\n| \"This has been the best TV I've ever used. Great screen, and sound.\" | \"appliance\" |\n| \"I regret buying this laptop. It is too slow and the keyboard is too noisy\" | \"computer\" |\n| \"Not sure how to feel about my new washing machine. Great color, but hard to figure\" | \"appliance\" |\n\n### Verify\n\nThis functions allows you to check and see if a statement is true, based\non the provided text. By default, it will return a 1 for \u201cyes\u201d, and 0\nfor \u201cno\u201d. This can be customized.\n\n``` python\nreviews.llm.verify(\"review\", \"is the customer happy with the purchase\")\n```\n\n| review | verify |\n|----|----|\n| \"This has been the best TV I've ever used. Great screen, and sound.\" | 1 |\n| \"I regret buying this laptop. It is too slow and the keyboard is too noisy\" | 0 |\n| \"Not sure how to feel about my new washing machine. Great color, but hard to figure\" | 0 |\n\n### Translate\n\nAs the title implies, this function will translate the text into a\nspecified language. What is really nice, it is that you don\u2019t need to\nspecify the language of the source text. Only the target language needs\nto be defined. The translation accuracy will depend on the LLM\n\n``` python\nreviews.llm.translate(\"review\", \"spanish\")\n```\n\n| review | translation |\n|----|----|\n| \"This has been the best TV I've ever used. Great screen, and sound.\" | \"Esta ha sido la mejor televisi\u00f3n que he utilizado hasta ahora. Gran pantalla y sonido.\" |\n| \"I regret buying this laptop. It is too slow and the keyboard is too noisy\" | \"Me arrepiento de comprar este port\u00e1til. Es demasiado lento y la tecla es demasiado ruidosa.\" |\n| \"Not sure how to feel about my new washing machine. Great color, but hard to figure\" | \"No estoy seguro de c\u00f3mo sentirme con mi nueva lavadora. Un color maravilloso, pero muy dif\u00edcil de en\u2026 |\n\n### Custom prompt\n\nIt is possible to pass your own prompt to the LLM, and have `mall` run\nit against each text entry:\n\n``` python\nmy_prompt = (\n    \"Answer a question.\"\n    \"Return only the answer, no explanation\"\n    \"Acceptable answers are 'yes', 'no'\"\n    \"Answer this about the following text, is this a happy customer?:\"\n)\n\nreviews.llm.custom(\"review\", prompt = my_prompt)\n```\n\n| review | custom |\n|----|----|\n| \"This has been the best TV I've ever used. Great screen, and sound.\" | \"Yes\" |\n| \"I regret buying this laptop. It is too slow and the keyboard is too noisy\" | \"No\" |\n| \"Not sure how to feel about my new washing machine. Great color, but hard to figure\" | \"No\" |\n\n## Model selection and settings\n\nYou can set the model and its options to use when calling the LLM. In\nthis case, we refer to options as model specific things that can be set,\nsuch as seed or temperature.\n\nThe model and options to be used will be defined at the Polars data\nframe object level. If not passed, the default model will be\n**llama3.2**.\n\n``` python\nreviews.llm.use(\"ollama\", \"llama3.2\", options = dict(seed = 100))\n```\n\n#### Results caching\n\nBy default `mall` caches the requests and corresponding results from a\ngiven LLM run. Each response is saved as individual JSON files. By\ndefault, the folder name is `_mall_cache`. The folder name can be\ncustomized, if needed. Also, the caching can be turned off by setting\nthe argument to empty (`\"\"`).\n\n``` python\nreviews.llm.use(_cache = \"my_cache\")\n```\n\nTo turn off:\n\n``` python\nreviews.llm.use(_cache = \"\")\n```\n\n## Key considerations\n\nThe main consideration is **cost**. Either, time cost, or money cost.\n\nIf using this method with an LLM locally available, the cost will be a\nlong running time. Unless using a very specialized LLM, a given LLM is a\ngeneral model. It was fitted using a vast amount of data. So determining\na response for each row, takes longer than if using a manually created\nNLP model. The default model used in Ollama is [Llama\n3.2](https://ollama.com/library/llama3.2), which was fitted using 3B\nparameters.\n\nIf using an external LLM service, the consideration will need to be for\nthe billing costs of using such service. Keep in mind that you will be\nsending a lot of data to be evaluated.\n\nAnother consideration is the novelty of this approach. Early tests are\nproviding encouraging results. But you, as an user, will still need to\nkeep in mind that the predictions will not be infallible, so always\ncheck the output. At this time, I think the best use for this method, is\nfor a quick analysis.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Run multiple 'Large Language Model' predictions against a table. The predictions run row-wise over a specified column.",
    "version": "0.1.0",
    "project_urls": {
        "homepage": "https://mlverse.github.io/mall/",
        "issues": "https://github.com/mlverse/mall/issues"
    },
    "split_keywords": [
        "large language models",
        " llm",
        " natural language processing",
        " nlp",
        " polars"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "32a29138e6706fab3baf530c90bb46dd6a75d280a85657b4a229d2306b1c980a",
                "md5": "3a92a74785bada22c426d1f77296d7a3",
                "sha256": "a907fcc9b6965a5c200c31630c20b1c2919d655138dabe1f6a8d27291d16f543"
            },
            "downloads": -1,
            "filename": "mlverse_mall-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3a92a74785bada22c426d1f77296d7a3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 10822,
            "upload_time": "2024-10-24T16:14:07",
            "upload_time_iso_8601": "2024-10-24T16:14:07.784754Z",
            "url": "https://files.pythonhosted.org/packages/32/a2/9138e6706fab3baf530c90bb46dd6a75d280a85657b4a229d2306b1c980a/mlverse_mall-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "52c9a763ae9fab87c919b235d13f6434be4e1450ce5d623607c9e81c6610a4c9",
                "md5": "e2df28e970e65ab18a52c946b1928c4a",
                "sha256": "78b68c889967ea6d086f90a7377f24594af3fff120590d697d45ed884eb4d6d7"
            },
            "downloads": -1,
            "filename": "mlverse_mall-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e2df28e970e65ab18a52c946b1928c4a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 363071,
            "upload_time": "2024-10-24T16:14:09",
            "upload_time_iso_8601": "2024-10-24T16:14:09.836912Z",
            "url": "https://files.pythonhosted.org/packages/52/c9/a763ae9fab87c919b235d13f6434be4e1450ce5d623607c9e81c6610a4c9/mlverse_mall-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-24 16:14:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mlverse",
    "github_project": "mall",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "mlverse-mall"
}

None