# The Distiller
`The Distiller` is a command-line interface and a Python library for inferencing Large Language Models to generate textual or multi-modal datasets.
Possible use cases may include:
- Constructing textual corpora to train/fine-tune detectors for content written by AI.
- Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc.
- Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command.
- Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.
> This tool is distributed freely and doesn't imply any restrictions on the downstream use cases.
> However, you should make sure to follow the **Terms of use** of the backend APIs (OpenAI, Cohere, Petals, etc.) in your specific context.
# Agora
The Distiller is brought to you by Agora, we're an all-new open source multi-modality AI Research organization devoted to advancing Humanity!
[Join our Discord and help work on this project or 30+ others!](https://discord.gg/qUtxnK2NMf)
## Installation
```
pip install distiller
```
Most of the generation features rely on third-party APIs. Install their respective packages:
```
pip install openai cohere petals
```
## Usage examples
### Inference LLMs at scale
```bash
export OPENAI_API_KEY="..."
export COHERE_API_KEY="..."
distiller texts \
--prompt "If {country} was a planet in the Star Wars universe it would be called" \
--backend "openai|text-davinci-003" \
--backend "cohere|medium" \
--temperature 0.9 \
--option country Germany \
--option country France \
--max-length 50 \
--num-samples 1 \
--single-file
```
The command above should produce a dataset file with 4 texts. Each possible combination of options is used for each of the backend LLMs. Check out the [CLI reference](#cli-reference) for more details. A dataset file looks like this:
```json
[
{
"sample_id": 0,
"backend": "openai|text-davinci-003",
"max_length": 50,
"temperature": 0.9,
"country": "Germany",
"prompt": "If Germany was a planet in the Star Wars universe it would be called",
"output": " Euron. The planet would be home to a powerful and diverse species of aliens, known as the Eurons, that have evolved to a higher level of understanding and technological advancement compared to many of the other planets in the galaxy. The planet would be"
},
{
"sample_id": 0,
"backend": "openai|text-davinci-003",
"max_length": 50,
"temperature": 0.9,
"country": "France",
"prompt": "If France was a planet in the Star Wars universe it would be called",
"output": " The Empire of Liberty. It would be a peaceful, democratic planet with a strong sense of justice and equality. The planet would be home to many different species of aliens but the majority of its population would be humans. It would have a strong military and"
},
{
"sample_id": 0,
"backend": "cohere|medium",
"max_length": 50,
"temperature": 0.9,
"country": "Germany",
"prompt": "If Germany was a planet in the Star Wars universe it would be called",
"output": " the Hoth of the universe.\nAfter the Soviet invasion of Eastern Europe and the subsequent Western anti-Soviet sentiment, Germany's arms manufacturers went into hyperdrive and the country churned out guns at a frightening pace. By the early 1930"
},
...
]
```
Alternatively, you can use our [`TextsGenerator`](https://github.com/radi-cho/distiller/blob/main/distiller/texts.py#L27) to produce texts in Python.
### Generate conversations with the ChatGPT API
```bash
export OPENAI_API_KEY="..."
distiller conversations \
--length 4 \
--length 5 \
--agent1 "You're a shop assistant in a pet store. Answer to customer questions politely." \
--agent2 "You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store." \
--temperature 0.1 \
--temperature 0.2 \
--option n 2 \
--option n 3 \
--path dataset
```
The command above should produce a dataset directory with 16 conversations saved as JSON files. You can specify if you want all of them to be saved in a single file. But why 16? Because `num_samples` dialogues are generated for each possible combination of parameters (conversation length, LLM temperature, and custom prompt options). A dataset item looks like this:
```json
{
"length": 5,
"temperature": 0.1,
"n": "2",
"agent1": "You're a shop assistant in a pet store. Answer to customer questions politely. When the whole conversation is over end with \"Goodbye\".",
"agent2": "You're a customer in a pet store. You should behave like a human. You want to buy 2 pets. Ask questions about the pets in the store. When the whole conversation is over end with \"Goodbye\".",
"utterances": [
[
"agent1",
"Hello! How can I assist you today?"
],
[
"agent2",
"Hi! I'm interested in buying two pets. Can you tell me what kind of pets you have available in the store?"
],
[
"agent1",
"Certainly! We have a variety of pets available, including dogs, cats, birds, fish, hamsters, guinea pigs, rabbits, and reptiles. Is there a specific type of pet you're interested in?"
],
[
"agent2",
"I'm not sure yet. Can you tell me more about the dogs and cats you have available? What breeds do you have?"
],
...
]
}
```
You can also use our [`ConversationsGenerator`](https://github.com/radi-cho/distiller/blob/main/distiller/conversations.py#L43) to produce texts with the `gpt-3.5-turbo` API programatically.
```python
from distiller import ConversationsGenerator, ConversationsGeneratorConfig, DatasetWriter
dataset_writer = DatasetWriter() # single_file=True
generator_config = ConversationsGeneratorConfig(agent1="You're a shop assistant in a pet store. Answer to customer questions politely.",
agent2="You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.",
num_samples=2,
interruption="length",
lengths=[4, 5],
temperatures=[0.1, 0.2],
options=[("n", "2"), ("n", "3")])
conversations_generator = ConversationsGenerator(generator_config)
for conversation in conversations_generator:
dataset_writer.save_intermediate_result(conversation)
```
## Contributing
> Still under active development.
Contributions will be highly appreciated. Currently these features are under development:
- [x] `distiller conversations` - Make two ChatGPT agents talk with one another and record the conversation history.
- [x] `distiller texts` - Inference different LLMs with a given input prompt and generate multiple outputs by varying parameters.
- [ ] `distiller transformations` - Apply a list of transformations to a list of texts. For example, summarizing a list of texts for a {child | university student | PhD candidate} to understand.
- [ ] Support more backend LLMs.
To set up a local development environment:
```bash
git clone https://github.com/radi-cho/distiller/
cd distiller
pip install -e .
```
## CLI Reference
```
distiller [OPTIONS] COMMAND [ARGS]...
Command line interface that generates datasets with LLMs.
Options:
--help Show this message and exit.
Commands:
conversations Produce conversations between two gpt-3.5-turbo agents...
texts Inference multiple LLMs at scale.
```
```
distiller texts [OPTIONS]
Inference multiple LLMs at scale.
Options:
-p, --prompt TEXT Input prompt. [required]
-b, --backend TEXT LLM APIs to use as backends. Use
"backend|model_name" notation. For example:
"openai|text-davinci-003".
-l, --max-length INTEGER Maximum number of tokens to generate for each
prompt.
-t, --temperature FLOAT Possible temperature values for the backend
language model.
-n, --num-samples INTEGER Number of conversations for each configuration.
-o, --option <TEXT TEXT>... Values for additional options denoted in your
prompts by {OPTION_NAME}.
-f, --path PATH Where to save the dataset. Either a file or a
directory (folder).
-s, --single-file Either save the whole dataset to a single file
or create multiple files.
--help Show this message and exit.
```
- You can specify multiple variants for the following options: `--length`, `--temperature`, `--num-samples`, `--option`. A dataset item will be generated for each possible combination of the supplied values.
- Each `--option` provided must be formatted as follows: `--option option_name "Some option value"`.
- Currently supported backends: GPT-3 model variants by [OpenAI](https://openai.com/blog/openai-api), the language models by [Cohere](https://pypi.org/project/cohere/), BLOOM through the [Petals API](https://petals.ml/).
```
distiller conversations [OPTIONS]
Produce conversations between two gpt-3.5-turbo agents with given roles.
Options:
-k, --openai-api-key TEXT OpenAI API key.
-a, --agent1 TEXT Agent role description. [required]
-b, --agent2 TEXT Agent role description. [required]
-i, --interruption [length|end_phrase]
Interruption mode.
-e, --end-phrase TEXT Interrupt after this phrase is outputted by
one of the agents.
-d, --end-agent [agent1|agent2|both]
In which agent's messages to look for the
end phrase.
-l, --length INTEGER Maximum number of utterances for each agent.
A conversation sample will be generated for
each length.
-t, --temperature FLOAT Possible temperature values for the backend
language model.
-n, --num-samples INTEGER Number of conversations for each
configuration.
-o, --option <TEXT TEXT>... Values for additional options denoted in
your prompts by {OPTION_NAME}.
-f, --path PATH Where to save the dataset. Either a file or
a directory (folder).
-s, --single-file Either save the whole dataset to a single
file or create multiple files.
--help Show this message and exit.
```
- The length parameter specifies how many utterances each agent should make. A length of 4 typically produces 8 utterances in total.
- You can specify either `length` (default) or `end_phrase` as an interruption strategy. When using `end_phrase` a conversation will be interrupted once the `--end-phrase` has appeared in the messages of the `--end-agent` (could be both). In this case, the lengths provided will be treated as maximum conversation lengths.
# Roadmap
* Add Hugginggace polymorphic function to use any model on hf
* Add CLI arg for huggingface model to utilize
* Integrate tree of thoughts prompting
* Integrate multi-modality models from hf
* Integrate multi-modality conversation => store images in folder then agent2 looks up image address
* Enable Multi-Modal Dataset Generation using generate, critique
Raw data
{
"_id": null,
"home_page": "https://github.com/kyegomez/The-Distiller",
"name": "The-Distiller",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "dataset,llm,langchain,openai",
"author": "Kye Gomez",
"author_email": "Kye@apac.ai",
"download_url": "https://files.pythonhosted.org/packages/d7/5c/ae6ec94a88d87b83c2f0bb4c3bb424632c749ac1c3ee185e9d5117fbbb1c/The%20Distiller-0.0.2.tar.gz",
"platform": null,
"description": "# The Distiller\n\n`The Distiller` is a command-line interface and a Python library for inferencing Large Language Models to generate textual or multi-modal datasets.\n\nPossible use cases may include:\n\n- Constructing textual corpora to train/fine-tune detectors for content written by AI.\n- Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc.\n- Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command.\n- Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.\n\n> This tool is distributed freely and doesn't imply any restrictions on the downstream use cases.\n> However, you should make sure to follow the **Terms of use** of the backend APIs (OpenAI, Cohere, Petals, etc.) in your specific context.\n\n\n# Agora\nThe Distiller is brought to you by Agora, we're an all-new open source multi-modality AI Research organization devoted to advancing Humanity!\n\n\n[Join our Discord and help work on this project or 30+ others!](https://discord.gg/qUtxnK2NMf)\n\n\n## Installation\n\n```\npip install distiller\n```\n\nMost of the generation features rely on third-party APIs. Install their respective packages:\n\n```\npip install openai cohere petals\n```\n\n## Usage examples\n\n### Inference LLMs at scale\n\n```bash\nexport OPENAI_API_KEY=\"...\"\nexport COHERE_API_KEY=\"...\"\n\ndistiller texts \\ \n --prompt \"If {country} was a planet in the Star Wars universe it would be called\" \\\n --backend \"openai|text-davinci-003\" \\\n --backend \"cohere|medium\" \\\n --temperature 0.9 \\\n --option country Germany \\\n --option country France \\\n --max-length 50 \\\n --num-samples 1 \\\n --single-file\n```\n\nThe command above should produce a dataset file with 4 texts. Each possible combination of options is used for each of the backend LLMs. Check out the [CLI reference](#cli-reference) for more details. A dataset file looks like this:\n\n```json\n[\n {\n \"sample_id\": 0,\n \"backend\": \"openai|text-davinci-003\",\n \"max_length\": 50,\n \"temperature\": 0.9,\n \"country\": \"Germany\",\n \"prompt\": \"If Germany was a planet in the Star Wars universe it would be called\",\n \"output\": \" Euron. The planet would be home to a powerful and diverse species of aliens, known as the Eurons, that have evolved to a higher level of understanding and technological advancement compared to many of the other planets in the galaxy. The planet would be\"\n },\n {\n \"sample_id\": 0,\n \"backend\": \"openai|text-davinci-003\",\n \"max_length\": 50,\n \"temperature\": 0.9,\n \"country\": \"France\",\n \"prompt\": \"If France was a planet in the Star Wars universe it would be called\",\n \"output\": \" The Empire of Liberty. It would be a peaceful, democratic planet with a strong sense of justice and equality. The planet would be home to many different species of aliens but the majority of its population would be humans. It would have a strong military and\"\n },\n {\n \"sample_id\": 0,\n \"backend\": \"cohere|medium\",\n \"max_length\": 50,\n \"temperature\": 0.9,\n \"country\": \"Germany\",\n \"prompt\": \"If Germany was a planet in the Star Wars universe it would be called\",\n \"output\": \" the Hoth of the universe.\\nAfter the Soviet invasion of Eastern Europe and the subsequent Western anti-Soviet sentiment, Germany's arms manufacturers went into hyperdrive and the country churned out guns at a frightening pace. By the early 1930\"\n },\n ...\n]\n```\n\nAlternatively, you can use our [`TextsGenerator`](https://github.com/radi-cho/distiller/blob/main/distiller/texts.py#L27) to produce texts in Python.\n\n### Generate conversations with the ChatGPT API\n\n```bash\nexport OPENAI_API_KEY=\"...\"\ndistiller conversations \\\n --length 4 \\\n --length 5 \\\n --agent1 \"You're a shop assistant in a pet store. Answer to customer questions politely.\" \\\n --agent2 \"You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.\" \\\n --temperature 0.1 \\\n --temperature 0.2 \\\n --option n 2 \\\n --option n 3 \\\n --path dataset\n```\n\nThe command above should produce a dataset directory with 16 conversations saved as JSON files. You can specify if you want all of them to be saved in a single file. But why 16? Because `num_samples` dialogues are generated for each possible combination of parameters (conversation length, LLM temperature, and custom prompt options). A dataset item looks like this:\n\n```json\n{\n \"length\": 5,\n \"temperature\": 0.1,\n \"n\": \"2\",\n \"agent1\": \"You're a shop assistant in a pet store. Answer to customer questions politely. When the whole conversation is over end with \\\"Goodbye\\\".\",\n \"agent2\": \"You're a customer in a pet store. You should behave like a human. You want to buy 2 pets. Ask questions about the pets in the store. When the whole conversation is over end with \\\"Goodbye\\\".\",\n \"utterances\": [\n [\n \"agent1\",\n \"Hello! How can I assist you today?\"\n ],\n [\n \"agent2\",\n \"Hi! I'm interested in buying two pets. Can you tell me what kind of pets you have available in the store?\"\n ],\n [\n \"agent1\",\n \"Certainly! We have a variety of pets available, including dogs, cats, birds, fish, hamsters, guinea pigs, rabbits, and reptiles. Is there a specific type of pet you're interested in?\"\n ],\n [\n \"agent2\",\n \"I'm not sure yet. Can you tell me more about the dogs and cats you have available? What breeds do you have?\"\n ],\n ...\n ]\n}\n```\n\nYou can also use our [`ConversationsGenerator`](https://github.com/radi-cho/distiller/blob/main/distiller/conversations.py#L43) to produce texts with the `gpt-3.5-turbo` API programatically.\n\n```python\nfrom distiller import ConversationsGenerator, ConversationsGeneratorConfig, DatasetWriter\n\ndataset_writer = DatasetWriter() # single_file=True\n\ngenerator_config = ConversationsGeneratorConfig(agent1=\"You're a shop assistant in a pet store. Answer to customer questions politely.\",\n agent2=\"You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.\",\n num_samples=2,\n interruption=\"length\",\n lengths=[4, 5],\n temperatures=[0.1, 0.2],\n options=[(\"n\", \"2\"), (\"n\", \"3\")])\n\nconversations_generator = ConversationsGenerator(generator_config)\n\nfor conversation in conversations_generator:\n dataset_writer.save_intermediate_result(conversation)\n```\n\n## Contributing\n\n> Still under active development.\n\nContributions will be highly appreciated. Currently these features are under development:\n- [x] `distiller conversations` - Make two ChatGPT agents talk with one another and record the conversation history.\n- [x] `distiller texts` - Inference different LLMs with a given input prompt and generate multiple outputs by varying parameters.\n- [ ] `distiller transformations` - Apply a list of transformations to a list of texts. For example, summarizing a list of texts for a {child | university student | PhD candidate} to understand.\n- [ ] Support more backend LLMs.\n\nTo set up a local development environment:\n\n```bash\ngit clone https://github.com/radi-cho/distiller/\ncd distiller\npip install -e .\n```\n\n## CLI Reference\n\n```\ndistiller [OPTIONS] COMMAND [ARGS]...\n\n Command line interface that generates datasets with LLMs.\n\nOptions:\n --help Show this message and exit.\n\nCommands:\n conversations Produce conversations between two gpt-3.5-turbo agents...\n texts Inference multiple LLMs at scale.\n```\n\n```\ndistiller texts [OPTIONS]\n\n Inference multiple LLMs at scale.\n\nOptions:\n -p, --prompt TEXT Input prompt. [required]\n -b, --backend TEXT LLM APIs to use as backends. Use\n \"backend|model_name\" notation. For example:\n \"openai|text-davinci-003\".\n -l, --max-length INTEGER Maximum number of tokens to generate for each\n prompt.\n -t, --temperature FLOAT Possible temperature values for the backend\n language model.\n -n, --num-samples INTEGER Number of conversations for each configuration.\n -o, --option <TEXT TEXT>... Values for additional options denoted in your\n prompts by {OPTION_NAME}.\n -f, --path PATH Where to save the dataset. Either a file or a\n directory (folder).\n -s, --single-file Either save the whole dataset to a single file\n or create multiple files.\n --help Show this message and exit.\n```\n\n- You can specify multiple variants for the following options: `--length`, `--temperature`, `--num-samples`, `--option`. A dataset item will be generated for each possible combination of the supplied values.\n- Each `--option` provided must be formatted as follows: `--option option_name \"Some option value\"`.\n- Currently supported backends: GPT-3 model variants by [OpenAI](https://openai.com/blog/openai-api), the language models by [Cohere](https://pypi.org/project/cohere/), BLOOM through the [Petals API](https://petals.ml/).\n\n```\ndistiller conversations [OPTIONS]\n\n Produce conversations between two gpt-3.5-turbo agents with given roles.\n\nOptions:\n -k, --openai-api-key TEXT OpenAI API key.\n -a, --agent1 TEXT Agent role description. [required]\n -b, --agent2 TEXT Agent role description. [required]\n -i, --interruption [length|end_phrase]\n Interruption mode.\n -e, --end-phrase TEXT Interrupt after this phrase is outputted by\n one of the agents.\n -d, --end-agent [agent1|agent2|both]\n In which agent's messages to look for the\n end phrase.\n -l, --length INTEGER Maximum number of utterances for each agent.\n A conversation sample will be generated for\n each length.\n -t, --temperature FLOAT Possible temperature values for the backend\n language model.\n -n, --num-samples INTEGER Number of conversations for each\n configuration.\n -o, --option <TEXT TEXT>... Values for additional options denoted in\n your prompts by {OPTION_NAME}.\n -f, --path PATH Where to save the dataset. Either a file or\n a directory (folder).\n -s, --single-file Either save the whole dataset to a single\n file or create multiple files.\n --help Show this message and exit.\n```\n\n- The length parameter specifies how many utterances each agent should make. A length of 4 typically produces 8 utterances in total.\n- You can specify either `length` (default) or `end_phrase` as an interruption strategy. When using `end_phrase` a conversation will be interrupted once the `--end-phrase` has appeared in the messages of the `--end-agent` (could be both). In this case, the lengths provided will be treated as maximum conversation lengths.\n\n\n# Roadmap\n\n* Add Hugginggace polymorphic function to use any model on hf\n\n* Add CLI arg for huggingface model to utilize\n\n* Integrate tree of thoughts prompting\n\n* Integrate multi-modality models from hf \n\n* Integrate multi-modality conversation => store images in folder then agent2 looks up image address\n\n* Enable Multi-Modal Dataset Generation using generate, critique \n",
"bugtrack_url": null,
"license": "",
"summary": "Generate textual and conversational datasets with LLMs.",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/kyegomez/The-Distiller"
},
"split_keywords": [
"dataset",
"llm",
"langchain",
"openai"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "aa016c5c3b25629aea3a67463f757303a3c637b12d03b5fca764f4768e9ca660",
"md5": "f0c58e0783dcdaf0d47082b9ed9607cd",
"sha256": "e7e4feb9d29f2929db47e49fdb9d9aaa7f810a3c78f9c9bf1b520c6bd69ace22"
},
"downloads": -1,
"filename": "The_Distiller-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f0c58e0783dcdaf0d47082b9ed9607cd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 12856,
"upload_time": "2023-06-05T15:34:53",
"upload_time_iso_8601": "2023-06-05T15:34:53.652899Z",
"url": "https://files.pythonhosted.org/packages/aa/01/6c5c3b25629aea3a67463f757303a3c637b12d03b5fca764f4768e9ca660/The_Distiller-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d75cae6ec94a88d87b83c2f0bb4c3bb424632c749ac1c3ee185e9d5117fbbb1c",
"md5": "55d5a1d3d99c2bb55a9220f145363d6a",
"sha256": "757a1659ceae74d1065e59770fd3c33bd2335d9f42af9041b2aafb91c201e34d"
},
"downloads": -1,
"filename": "The Distiller-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "55d5a1d3d99c2bb55a9220f145363d6a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 10428,
"upload_time": "2023-06-05T15:34:55",
"upload_time_iso_8601": "2023-06-05T15:34:55.428675Z",
"url": "https://files.pythonhosted.org/packages/d7/5c/ae6ec94a88d87b83c2f0bb4c3bb424632c749ac1c3ee185e9d5117fbbb1c/The%20Distiller-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-05 15:34:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kyegomez",
"github_project": "The-Distiller",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "the-distiller"
}