llm-cluster

Name	llm-cluster JSON
Version	0.2 JSON
	download
home_page
Summary	LLM plugin for clustering embeddings
upload_time	2023-09-04 16:37:26
maintainer
docs_url	None
author	Simon Willison
requires_python
license	Apache-2.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # llm-cluster

[![PyPI](https://img.shields.io/pypi/v/llm-cluster.svg)](https://pypi.org/project/llm-cluster/)
[![Changelog](https://img.shields.io/github/v/release/simonw/llm-cluster?include_prereleases&label=changelog)](https://github.com/simonw/llm-cluster/releases)
[![Tests](https://github.com/simonw/llm-cluster/workflows/Test/badge.svg)](https://github.com/simonw/llm-cluster/actions?query=workflow%3ATest)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/llm-cluster/blob/main/LICENSE)

[LLM](https://llm.datasette.io/) plugin for clustering embeddings.

## Installation

Install this plugin in the same environment as LLM.
```bash
llm install llm-cluster
```

## Usage

The plugin adds a new command, `llm cluster`. This command takes the name of an [embedding collection](https://llm.datasette.io/en/stable/embeddings/cli.html#storing-embeddings-in-sqlite) and the number of clusters to return.

First, use [paginate-json](https://github.com/simonw/paginate-json) and [jq](https://stedolan.github.io/jq/) to populate a collection. I this case we are embedding the title and body of every issue in the [llm repository](https://github.com/simonw/llm), and storing the result in a `issues.db` database:
```bash
paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db --store
```
The `--store` flag causes the content to be stored in the database along with the embedding vectors.

Now we can cluster those embeddings into 10 groups:
```bash
llm cluster llm-issues 10 \
  -d issues.db
```
If you omit the `-d` option the default embeddings database will be used.

The output should look something like this (truncated):
```json
[
  {
    "id": "2",
    "items": [
      {
        "id": "1650662628",
        "content": "Initial design"
      },
      {
        "id": "1650682379",
        "content": "Log prompts and responses to SQLite"
      }
    ]
  },
  {
    "id": "4",
    "items": [
      {
        "id": "1650760699",
        "content": "llm web command - launches a web server"
      },
      {
        "id": "1759659476",
        "content": "`llm models` command"
      },
      {
        "id": "1784156919",
        "content": "`llm.get_model(alias)` helper"
      }
    ]
  },
  {
    "id": "7",
    "items": [
      {
        "id": "1650765575",
        "content": "--code mode for outputting code"
      },
      {
        "id": "1659086298",
        "content": "Accept PROMPT from --stdin"
      },
      {
        "id": "1714651657",
        "content": "Accept input from standard in"
      }
    ]
  }
]
```
The content displayed is truncated to 100 characters. Pass `--truncate 0` to disable truncation, or `--truncate X` to truncate to X characters.

## Generating summaries for each cluster

The `--summary` flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the `--truncate` option) through a prompt to a Large Language Model.

This feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.

Since this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.

This feature only works for embeddings that have had their associated content stored in the database using the `--store` flag.

You can use it like this:

```bash
llm cluster llm-issues 10 \
  -d issues.db \
  --summary
```
This uses the default prompt and the default model.

To use a different model, e.g. GPT-4, pass the `--model` option:
```bash
llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4
```
The default prompt used is:

> Short, concise title for this cluster of related documents.

To use a custom prompt, pass `--prompt`:

```bash
llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4 \
  --prompt 'Summarize this in a short line in the style of a bored, angry panda'
```
A `"summary"` key will be added to each cluster, containing the generated summary.

## Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:
```bash
cd llm-cluster
python3 -m venv venv
source venv/bin/activate
```
Now install the dependencies and test dependencies:
```bash
pip install -e '.[test]'
```
To run the tests:
```bash
pytest
```

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "llm-cluster",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Simon Willison",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/48/71/f7f96688d4935ee20c89058365eea875273a361df00350a8d5f5f59bd721/llm-cluster-0.2.tar.gz",
    "platform": null,
    "description": "# llm-cluster\n\n[![PyPI](https://img.shields.io/pypi/v/llm-cluster.svg)](https://pypi.org/project/llm-cluster/)\n[![Changelog](https://img.shields.io/github/v/release/simonw/llm-cluster?include_prereleases&label=changelog)](https://github.com/simonw/llm-cluster/releases)\n[![Tests](https://github.com/simonw/llm-cluster/workflows/Test/badge.svg)](https://github.com/simonw/llm-cluster/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/llm-cluster/blob/main/LICENSE)\n\n[LLM](https://llm.datasette.io/) plugin for clustering embeddings.\n\n## Installation\n\nInstall this plugin in the same environment as LLM.\n```bash\nllm install llm-cluster\n```\n\n## Usage\n\nThe plugin adds a new command, `llm cluster`. This command takes the name of an [embedding collection](https://llm.datasette.io/en/stable/embeddings/cli.html#storing-embeddings-in-sqlite) and the number of clusters to return.\n\nFirst, use [paginate-json](https://github.com/simonw/paginate-json) and [jq](https://stedolan.github.io/jq/) to populate a collection. I this case we are embedding the title and body of every issue in the [llm repository](https://github.com/simonw/llm), and storing the result in a `issues.db` database:\n```bash\npaginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \\\n  | jq '[.[] | {id: .id, title: .title}]' \\\n  | llm embed-multi llm-issues - \\\n    --database issues.db --store\n```\nThe `--store` flag causes the content to be stored in the database along with the embedding vectors.\n\nNow we can cluster those embeddings into 10 groups:\n```bash\nllm cluster llm-issues 10 \\\n  -d issues.db\n```\nIf you omit the `-d` option the default embeddings database will be used.\n\nThe output should look something like this (truncated):\n```json\n[\n  {\n    \"id\": \"2\",\n    \"items\": [\n      {\n        \"id\": \"1650662628\",\n        \"content\": \"Initial design\"\n      },\n      {\n        \"id\": \"1650682379\",\n        \"content\": \"Log prompts and responses to SQLite\"\n      }\n    ]\n  },\n  {\n    \"id\": \"4\",\n    \"items\": [\n      {\n        \"id\": \"1650760699\",\n        \"content\": \"llm web command - launches a web server\"\n      },\n      {\n        \"id\": \"1759659476\",\n        \"content\": \"`llm models` command\"\n      },\n      {\n        \"id\": \"1784156919\",\n        \"content\": \"`llm.get_model(alias)` helper\"\n      }\n    ]\n  },\n  {\n    \"id\": \"7\",\n    \"items\": [\n      {\n        \"id\": \"1650765575\",\n        \"content\": \"--code mode for outputting code\"\n      },\n      {\n        \"id\": \"1659086298\",\n        \"content\": \"Accept PROMPT from --stdin\"\n      },\n      {\n        \"id\": \"1714651657\",\n        \"content\": \"Accept input from standard in\"\n      }\n    ]\n  }\n]\n```\nThe content displayed is truncated to 100 characters. Pass `--truncate 0` to disable truncation, or `--truncate X` to truncate to X characters.\n\n## Generating summaries for each cluster\n\nThe `--summary` flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the `--truncate` option) through a prompt to a Large Language Model.\n\nThis feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.\n\nSince this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.\n\nThis feature only works for embeddings that have had their associated content stored in the database using the `--store` flag.\n\nYou can use it like this:\n\n```bash\nllm cluster llm-issues 10 \\\n  -d issues.db \\\n  --summary\n```\nThis uses the default prompt and the default model.\n\nTo use a different model, e.g. GPT-4, pass the `--model` option:\n```bash\nllm cluster llm-issues 10 \\\n  -d issues.db \\\n  --summary \\\n  --model gpt-4\n```\nThe default prompt used is:\n\n> Short, concise title for this cluster of related documents.\n\nTo use a custom prompt, pass `--prompt`:\n\n```bash\nllm cluster llm-issues 10 \\\n  -d issues.db \\\n  --summary \\\n  --model gpt-4 \\\n  --prompt 'Summarize this in a short line in the style of a bored, angry panda'\n```\nA `\"summary\"` key will be added to each cluster, containing the generated summary.\n\n## Development\n\nTo set up this plugin locally, first checkout the code. Then create a new virtual environment:\n```bash\ncd llm-cluster\npython3 -m venv venv\nsource venv/bin/activate\n```\nNow install the dependencies and test dependencies:\n```bash\npip install -e '.[test]'\n```\nTo run the tests:\n```bash\npytest\n```\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "LLM plugin for clustering embeddings",
    "version": "0.2",
    "project_urls": {
        "CI": "https://github.com/simonw/llm-cluster/actions",
        "Changelog": "https://github.com/simonw/llm-cluster/releases",
        "Homepage": "https://github.com/simonw/llm-cluster",
        "Issues": "https://github.com/simonw/llm-cluster/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "56ff3d156c6ed478fdd095b398fce8f50e95b23deecafbfd4372666f4b98fa08",
                "md5": "afd4d4a232514b6f3ae2b0ee0f162078",
                "sha256": "ffabb64eb97a264a414b9312d82d8a649e05c3d73b93e943fafc46975ada42cb"
            },
            "downloads": -1,
            "filename": "llm_cluster-0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "afd4d4a232514b6f3ae2b0ee0f162078",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 9007,
            "upload_time": "2023-09-04T16:37:24",
            "upload_time_iso_8601": "2023-09-04T16:37:24.831990Z",
            "url": "https://files.pythonhosted.org/packages/56/ff/3d156c6ed478fdd095b398fce8f50e95b23deecafbfd4372666f4b98fa08/llm_cluster-0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4871f7f96688d4935ee20c89058365eea875273a361df00350a8d5f5f59bd721",
                "md5": "6b0f513944d4ed0153c81023910a4f66",
                "sha256": "9e79db0bd3f7feb3f73afdac1caf5947da5fb2f43bdfced36549cc349d26bc28"
            },
            "downloads": -1,
            "filename": "llm-cluster-0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "6b0f513944d4ed0153c81023910a4f66",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 9169,
            "upload_time": "2023-09-04T16:37:26",
            "upload_time_iso_8601": "2023-09-04T16:37:26.245382Z",
            "url": "https://files.pythonhosted.org/packages/48/71/f7f96688d4935ee20c89058365eea875273a361df00350a8d5f5f59bd721/llm-cluster-0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-04 16:37:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "simonw",
    "github_project": "llm-cluster",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "llm-cluster"
}

Simon Willison