Name | llm-cluster JSON |
Version |
0.2
JSON |
| download |
home_page | |
Summary | LLM plugin for clustering embeddings |
upload_time | 2023-09-04 16:37:26 |
maintainer | |
docs_url | None |
author | Simon Willison |
requires_python | |
license | Apache-2.0 |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# llm-cluster
[![PyPI](https://img.shields.io/pypi/v/llm-cluster.svg)](https://pypi.org/project/llm-cluster/)
[![Changelog](https://img.shields.io/github/v/release/simonw/llm-cluster?include_prereleases&label=changelog)](https://github.com/simonw/llm-cluster/releases)
[![Tests](https://github.com/simonw/llm-cluster/workflows/Test/badge.svg)](https://github.com/simonw/llm-cluster/actions?query=workflow%3ATest)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/llm-cluster/blob/main/LICENSE)
[LLM](https://llm.datasette.io/) plugin for clustering embeddings.
## Installation
Install this plugin in the same environment as LLM.
```bash
llm install llm-cluster
```
## Usage
The plugin adds a new command, `llm cluster`. This command takes the name of an [embedding collection](https://llm.datasette.io/en/stable/embeddings/cli.html#storing-embeddings-in-sqlite) and the number of clusters to return.
First, use [paginate-json](https://github.com/simonw/paginate-json) and [jq](https://stedolan.github.io/jq/) to populate a collection. I this case we are embedding the title and body of every issue in the [llm repository](https://github.com/simonw/llm), and storing the result in a `issues.db` database:
```bash
paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
| jq '[.[] | {id: .id, title: .title}]' \
| llm embed-multi llm-issues - \
--database issues.db --store
```
The `--store` flag causes the content to be stored in the database along with the embedding vectors.
Now we can cluster those embeddings into 10 groups:
```bash
llm cluster llm-issues 10 \
-d issues.db
```
If you omit the `-d` option the default embeddings database will be used.
The output should look something like this (truncated):
```json
[
{
"id": "2",
"items": [
{
"id": "1650662628",
"content": "Initial design"
},
{
"id": "1650682379",
"content": "Log prompts and responses to SQLite"
}
]
},
{
"id": "4",
"items": [
{
"id": "1650760699",
"content": "llm web command - launches a web server"
},
{
"id": "1759659476",
"content": "`llm models` command"
},
{
"id": "1784156919",
"content": "`llm.get_model(alias)` helper"
}
]
},
{
"id": "7",
"items": [
{
"id": "1650765575",
"content": "--code mode for outputting code"
},
{
"id": "1659086298",
"content": "Accept PROMPT from --stdin"
},
{
"id": "1714651657",
"content": "Accept input from standard in"
}
]
}
]
```
The content displayed is truncated to 100 characters. Pass `--truncate 0` to disable truncation, or `--truncate X` to truncate to X characters.
## Generating summaries for each cluster
The `--summary` flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the `--truncate` option) through a prompt to a Large Language Model.
This feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.
Since this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.
This feature only works for embeddings that have had their associated content stored in the database using the `--store` flag.
You can use it like this:
```bash
llm cluster llm-issues 10 \
-d issues.db \
--summary
```
This uses the default prompt and the default model.
To use a different model, e.g. GPT-4, pass the `--model` option:
```bash
llm cluster llm-issues 10 \
-d issues.db \
--summary \
--model gpt-4
```
The default prompt used is:
> Short, concise title for this cluster of related documents.
To use a custom prompt, pass `--prompt`:
```bash
llm cluster llm-issues 10 \
-d issues.db \
--summary \
--model gpt-4 \
--prompt 'Summarize this in a short line in the style of a bored, angry panda'
```
A `"summary"` key will be added to each cluster, containing the generated summary.
## Development
To set up this plugin locally, first checkout the code. Then create a new virtual environment:
```bash
cd llm-cluster
python3 -m venv venv
source venv/bin/activate
```
Now install the dependencies and test dependencies:
```bash
pip install -e '.[test]'
```
To run the tests:
```bash
pytest
```
Raw data
{
"_id": null,
"home_page": "",
"name": "llm-cluster",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Simon Willison",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/48/71/f7f96688d4935ee20c89058365eea875273a361df00350a8d5f5f59bd721/llm-cluster-0.2.tar.gz",
"platform": null,
"description": "# llm-cluster\n\n[![PyPI](https://img.shields.io/pypi/v/llm-cluster.svg)](https://pypi.org/project/llm-cluster/)\n[![Changelog](https://img.shields.io/github/v/release/simonw/llm-cluster?include_prereleases&label=changelog)](https://github.com/simonw/llm-cluster/releases)\n[![Tests](https://github.com/simonw/llm-cluster/workflows/Test/badge.svg)](https://github.com/simonw/llm-cluster/actions?query=workflow%3ATest)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/simonw/llm-cluster/blob/main/LICENSE)\n\n[LLM](https://llm.datasette.io/) plugin for clustering embeddings.\n\n## Installation\n\nInstall this plugin in the same environment as LLM.\n```bash\nllm install llm-cluster\n```\n\n## Usage\n\nThe plugin adds a new command, `llm cluster`. This command takes the name of an [embedding collection](https://llm.datasette.io/en/stable/embeddings/cli.html#storing-embeddings-in-sqlite) and the number of clusters to return.\n\nFirst, use [paginate-json](https://github.com/simonw/paginate-json) and [jq](https://stedolan.github.io/jq/) to populate a collection. I this case we are embedding the title and body of every issue in the [llm repository](https://github.com/simonw/llm), and storing the result in a `issues.db` database:\n```bash\npaginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \\\n | jq '[.[] | {id: .id, title: .title}]' \\\n | llm embed-multi llm-issues - \\\n --database issues.db --store\n```\nThe `--store` flag causes the content to be stored in the database along with the embedding vectors.\n\nNow we can cluster those embeddings into 10 groups:\n```bash\nllm cluster llm-issues 10 \\\n -d issues.db\n```\nIf you omit the `-d` option the default embeddings database will be used.\n\nThe output should look something like this (truncated):\n```json\n[\n {\n \"id\": \"2\",\n \"items\": [\n {\n \"id\": \"1650662628\",\n \"content\": \"Initial design\"\n },\n {\n \"id\": \"1650682379\",\n \"content\": \"Log prompts and responses to SQLite\"\n }\n ]\n },\n {\n \"id\": \"4\",\n \"items\": [\n {\n \"id\": \"1650760699\",\n \"content\": \"llm web command - launches a web server\"\n },\n {\n \"id\": \"1759659476\",\n \"content\": \"`llm models` command\"\n },\n {\n \"id\": \"1784156919\",\n \"content\": \"`llm.get_model(alias)` helper\"\n }\n ]\n },\n {\n \"id\": \"7\",\n \"items\": [\n {\n \"id\": \"1650765575\",\n \"content\": \"--code mode for outputting code\"\n },\n {\n \"id\": \"1659086298\",\n \"content\": \"Accept PROMPT from --stdin\"\n },\n {\n \"id\": \"1714651657\",\n \"content\": \"Accept input from standard in\"\n }\n ]\n }\n]\n```\nThe content displayed is truncated to 100 characters. Pass `--truncate 0` to disable truncation, or `--truncate X` to truncate to X characters.\n\n## Generating summaries for each cluster\n\nThe `--summary` flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the `--truncate` option) through a prompt to a Large Language Model.\n\nThis feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.\n\nSince this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.\n\nThis feature only works for embeddings that have had their associated content stored in the database using the `--store` flag.\n\nYou can use it like this:\n\n```bash\nllm cluster llm-issues 10 \\\n -d issues.db \\\n --summary\n```\nThis uses the default prompt and the default model.\n\nTo use a different model, e.g. GPT-4, pass the `--model` option:\n```bash\nllm cluster llm-issues 10 \\\n -d issues.db \\\n --summary \\\n --model gpt-4\n```\nThe default prompt used is:\n\n> Short, concise title for this cluster of related documents.\n\nTo use a custom prompt, pass `--prompt`:\n\n```bash\nllm cluster llm-issues 10 \\\n -d issues.db \\\n --summary \\\n --model gpt-4 \\\n --prompt 'Summarize this in a short line in the style of a bored, angry panda'\n```\nA `\"summary\"` key will be added to each cluster, containing the generated summary.\n\n## Development\n\nTo set up this plugin locally, first checkout the code. Then create a new virtual environment:\n```bash\ncd llm-cluster\npython3 -m venv venv\nsource venv/bin/activate\n```\nNow install the dependencies and test dependencies:\n```bash\npip install -e '.[test]'\n```\nTo run the tests:\n```bash\npytest\n```\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "LLM plugin for clustering embeddings",
"version": "0.2",
"project_urls": {
"CI": "https://github.com/simonw/llm-cluster/actions",
"Changelog": "https://github.com/simonw/llm-cluster/releases",
"Homepage": "https://github.com/simonw/llm-cluster",
"Issues": "https://github.com/simonw/llm-cluster/issues"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "56ff3d156c6ed478fdd095b398fce8f50e95b23deecafbfd4372666f4b98fa08",
"md5": "afd4d4a232514b6f3ae2b0ee0f162078",
"sha256": "ffabb64eb97a264a414b9312d82d8a649e05c3d73b93e943fafc46975ada42cb"
},
"downloads": -1,
"filename": "llm_cluster-0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "afd4d4a232514b6f3ae2b0ee0f162078",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 9007,
"upload_time": "2023-09-04T16:37:24",
"upload_time_iso_8601": "2023-09-04T16:37:24.831990Z",
"url": "https://files.pythonhosted.org/packages/56/ff/3d156c6ed478fdd095b398fce8f50e95b23deecafbfd4372666f4b98fa08/llm_cluster-0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4871f7f96688d4935ee20c89058365eea875273a361df00350a8d5f5f59bd721",
"md5": "6b0f513944d4ed0153c81023910a4f66",
"sha256": "9e79db0bd3f7feb3f73afdac1caf5947da5fb2f43bdfced36549cc349d26bc28"
},
"downloads": -1,
"filename": "llm-cluster-0.2.tar.gz",
"has_sig": false,
"md5_digest": "6b0f513944d4ed0153c81023910a4f66",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 9169,
"upload_time": "2023-09-04T16:37:26",
"upload_time_iso_8601": "2023-09-04T16:37:26.245382Z",
"url": "https://files.pythonhosted.org/packages/48/71/f7f96688d4935ee20c89058365eea875273a361df00350a8d5f5f59bd721/llm-cluster-0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-04 16:37:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "simonw",
"github_project": "llm-cluster",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "llm-cluster"
}