# datasetGPT
`datasetGPT` is a command-line interface and a Python library for inferencing Large Language Models to generate textual datasets.
Possible use cases may include:
- Constructing textual corpora to train/fine-tune detectors for content written by AI.
- Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc.
- Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command.
- Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.
> This tool is distributed freely and doesn't imply any restrictions on the downstream use cases.
> However, you should make sure to follow the **Terms of use** of the backend APIs (OpenAI, Cohere, Petals, etc.) in your specific context.
## Installation
```
pip install datasetGPT
```
Most of the generation features rely on third-party APIs. Install their respective packages:
```
pip install openai cohere petals
```
## Usage examples
### Inference LLMs at scale
```bash
export OPENAI_API_KEY="..."
export COHERE_API_KEY="..."
datasetGPT texts \
--prompt "If {country} was a planet in the Star Wars universe it would be called" \
--backend "openai|text-davinci-003" \
--backend "cohere|medium" \
--temperature 0.9 \
--option country Germany \
--option country France \
--max-length 50 \
--num-samples 1 \
--single-file
```
The command above should produce a dataset file with 4 texts. Each possible combination of options is used for each of the backend LLMs. Check out the [CLI reference](#cli-reference) for more details. A dataset file looks like this:
```json
[
{
"sample_id": 0,
"backend": "openai|text-davinci-003",
"max_length": 50,
"temperature": 0.9,
"country": "Germany",
"prompt": "If Germany was a planet in the Star Wars universe it would be called",
"output": " Euron. The planet would be home to a powerful and diverse species of aliens, known as the Eurons, that have evolved to a higher level of understanding and technological advancement compared to many of the other planets in the galaxy. The planet would be"
},
{
"sample_id": 0,
"backend": "openai|text-davinci-003",
"max_length": 50,
"temperature": 0.9,
"country": "France",
"prompt": "If France was a planet in the Star Wars universe it would be called",
"output": " The Empire of Liberty. It would be a peaceful, democratic planet with a strong sense of justice and equality. The planet would be home to many different species of aliens but the majority of its population would be humans. It would have a strong military and"
},
{
"sample_id": 0,
"backend": "cohere|medium",
"max_length": 50,
"temperature": 0.9,
"country": "Germany",
"prompt": "If Germany was a planet in the Star Wars universe it would be called",
"output": " the Hoth of the universe.\nAfter the Soviet invasion of Eastern Europe and the subsequent Western anti-Soviet sentiment, Germany's arms manufacturers went into hyperdrive and the country churned out guns at a frightening pace. By the early 1930"
},
...
]
```
Alternatively, you can use our [`TextsGenerator`](https://github.com/radi-cho/datasetGPT/blob/main/datasetGPT/texts.py#L27) to produce texts in Python.
### Generate conversations with the ChatGPT API
```bash
export OPENAI_API_KEY="..."
datasetGPT conversations \
--length 4 \
--length 5 \
--agent1 "You're a shop assistant in a pet store. Answer to customer questions politely." \
--agent2 "You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store." \
--temperature 0.1 \
--temperature 0.2 \
--option n 2 \
--option n 3 \
--path dataset
```
The command above should produce a dataset directory with 16 conversations saved as JSON files. You can specify if you want all of them to be saved in a single file. But why 16? Because `num_samples` dialogues are generated for each possible combination of parameters (conversation length, LLM temperature, and custom prompt options). A dataset item looks like this:
```json
{
"length": 5,
"temperature": 0.1,
"n": "2",
"agent1": "You're a shop assistant in a pet store. Answer to customer questions politely. When the whole conversation is over end with \"Goodbye\".",
"agent2": "You're a customer in a pet store. You should behave like a human. You want to buy 2 pets. Ask questions about the pets in the store. When the whole conversation is over end with \"Goodbye\".",
"utterances": [
[
"agent1",
"Hello! How can I assist you today?"
],
[
"agent2",
"Hi! I'm interested in buying two pets. Can you tell me what kind of pets you have available in the store?"
],
[
"agent1",
"Certainly! We have a variety of pets available, including dogs, cats, birds, fish, hamsters, guinea pigs, rabbits, and reptiles. Is there a specific type of pet you're interested in?"
],
[
"agent2",
"I'm not sure yet. Can you tell me more about the dogs and cats you have available? What breeds do you have?"
],
...
]
}
```
You can also use our [`ConversationsGenerator`](https://github.com/radi-cho/datasetGPT/blob/main/datasetGPT/conversations.py#L43) to produce texts with the `gpt-3.5-turbo` API programatically.
```python
from datasetGPT import ConversationsGenerator, ConversationsGeneratorConfig, DatasetWriter
dataset_writer = DatasetWriter() # single_file=True
generator_config = ConversationsGeneratorConfig(agent1="You're a shop assistant in a pet store. Answer to customer questions politely.",
agent2="You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.",
num_samples=2,
interruption="length",
lengths=[4, 5],
temperatures=[0.1, 0.2],
options=[("n", "2"), ("n", "3")],
model="gpt-3.5-turbo", # or model="gpt-4"
# Specific models per agent can be set. If one of the parameters is set, the other must also be provided, otherwise both agents use the "model" value.
model_agent_one="gpt-3.5-turbo",
model_agent_two="gpt-4"
)
conversations_generator = ConversationsGenerator(generator_config)
for conversation in conversations_generator:
dataset_writer.save_intermediate_result(conversation)
```
## Contributing
> Still under active development.
Contributions will be highly appreciated. Currently these features are under development:
- [x] `datasetGPT conversations` - Make two ChatGPT agents talk with one another and record the conversation history.
- [x] `datasetGPT texts` - Inference different LLMs with a given input prompt and generate multiple outputs by varying parameters.
- [ ] `datasetGPT transformations` - Apply a list of transformations to a list of texts. For example, summarizing a list of texts for a {child | university student | PhD candidate} to understand.
- [ ] Support more backend LLMs.
To set up a local development environment:
```bash
git clone https://github.com/radi-cho/datasetGPT/
cd datasetGPT
pip install -e .
```
## CLI Reference
```
datasetGPT [OPTIONS] COMMAND [ARGS]...
Command line interface that generates datasets with LLMs.
Options:
--help Show this message and exit.
Commands:
conversations Produce conversations between two gpt-3.5-turbo agents...
texts Inference multiple LLMs at scale.
```
```
datasetGPT texts [OPTIONS]
Inference multiple LLMs at scale.
Options:
-p, --prompt TEXT Input prompt. [required]
-b, --backend TEXT LLM APIs to use as backends. Use
"backend|model_name" notation. For example:
"openai|text-davinci-003".
-l, --max-length INTEGER Maximum number of tokens to generate for each
prompt.
-t, --temperature FLOAT Possible temperature values for the backend
language model.
-n, --num-samples INTEGER Number of conversations for each configuration.
-o, --option <TEXT TEXT>... Values for additional options denoted in your
prompts by {OPTION_NAME}.
-f, --path PATH Where to save the dataset. Either a file or a
directory (folder).
-s, --single-file Either save the whole dataset to a single file
or create multiple files.
--help Show this message and exit.
```
- You can specify multiple variants for the following options: `--length`, `--temperature`, `--num-samples`, `--option`. A dataset item will be generated for each possible combination of the supplied values.
- Each `--option` provided must be formatted as follows: `--option option_name "Some option value"`.
- Currently supported backends: GPT-3 and GPT-4 model variants by [OpenAI](https://openai.com/blog/openai-api), the language models by [Cohere](https://pypi.org/project/cohere/), BLOOM through the [Petals API](https://petals.ml/).
```
datasetGPT conversations [OPTIONS]
Produce conversations between two gpt-3.5-turbo agents with given roles.
Options:
-k, --openai-api-key TEXT OpenAI API key.
-a, --agent1 TEXT Agent role description. [required]
-b, --agent2 TEXT Agent role description. [required]
-i, --interruption [length|end_phrase]
Interruption mode.
-e, --end-phrase TEXT Interrupt after this phrase is outputted by
one of the agents.
-d, --end-agent [agent1|agent2|both]
In which agent's messages to look for the
end phrase.
-l, --length INTEGER Maximum number of utterances for each agent.
A conversation sample will be generated for
each length.
-t, --temperature FLOAT Possible temperature values for the backend
language model.
-m, --model [gpt-3.5-turbo|gpt-4]
OpenAI Chat model to use. GPT-4 is only
supported if provided API key has access to
GPT4. Defaults to GPT-3.5-Turbo.
-m1, --model-agent1 [gpt-3.5-turbo|gpt-4]
OpenAI Chat model to use for agent1. GPT-4
is only supported if given API key has
access to GPT4. Defaults to GPT-3.5-Turbo.
If set, --model-agent2 must also be
provided, otherwise --model value will be
used.
-m2, --model-agent2 [gpt-3.5-turbo|gpt-4]
OpenAI Chat model to use for agent2. GPT-4
is only supported if given API key has
access to GPT4. Defaults to GPT-3.5-Turbo If
set, --model-agent1 must also be provided,
otherwise --model value will be used.
-n, --num-samples INTEGER Number of conversations for each
configuration.
-o, --option <TEXT TEXT>... Values for additional options denoted in
your prompts by {OPTION_NAME}.
-f, --path PATH Where to save the dataset. Either a file or
a directory (folder).
-s, --single-file Either save the whole dataset to a single
file or create multiple files.
--help Show this message and exit.
```
- The length parameter specifies how many utterances each agent should make. A length of 4 typically produces 8 utterances in total.
- You can specify either `length` (default) or `end_phrase` as an interruption strategy. When using `end_phrase` a conversation will be interrupted once the `--end-phrase` has appeared in the messages of the `--end-agent` (could be both). In this case, the lengths provided will be treated as maximum conversation lengths.
Raw data
{
"_id": null,
"home_page": "https://github.com/radi-cho/datasetGPT",
"name": "datasetGPT",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "dataset,llm,langchain,openai",
"author": "Radostin Cholakov",
"author_email": "radicho123@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/2b/8f/ca9310baa7d070fd73cbb3f61092535a46b31b92415f390ffeb7510813c1/datasetGPT-0.0.4.tar.gz",
"platform": null,
"description": "# datasetGPT\n\n`datasetGPT` is a command-line interface and a Python library for inferencing Large Language Models to generate textual datasets.\n\nPossible use cases may include:\n\n- Constructing textual corpora to train/fine-tune detectors for content written by AI.\n- Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc.\n- Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command.\n- Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.\n\n> This tool is distributed freely and doesn't imply any restrictions on the downstream use cases.\n> However, you should make sure to follow the **Terms of use** of the backend APIs (OpenAI, Cohere, Petals, etc.) in your specific context.\n\n## Installation\n\n```\npip install datasetGPT\n```\n\nMost of the generation features rely on third-party APIs. Install their respective packages:\n\n```\npip install openai cohere petals\n```\n\n## Usage examples\n\n### Inference LLMs at scale\n\n```bash\nexport OPENAI_API_KEY=\"...\"\nexport COHERE_API_KEY=\"...\"\n\ndatasetGPT texts \\\n --prompt \"If {country} was a planet in the Star Wars universe it would be called\" \\\n --backend \"openai|text-davinci-003\" \\\n --backend \"cohere|medium\" \\\n --temperature 0.9 \\\n --option country Germany \\\n --option country France \\\n --max-length 50 \\\n --num-samples 1 \\\n --single-file\n```\n\nThe command above should produce a dataset file with 4 texts. Each possible combination of options is used for each of the backend LLMs. Check out the [CLI reference](#cli-reference) for more details. A dataset file looks like this:\n\n```json\n[\n {\n \"sample_id\": 0,\n \"backend\": \"openai|text-davinci-003\",\n \"max_length\": 50,\n \"temperature\": 0.9,\n \"country\": \"Germany\",\n \"prompt\": \"If Germany was a planet in the Star Wars universe it would be called\",\n \"output\": \" Euron. The planet would be home to a powerful and diverse species of aliens, known as the Eurons, that have evolved to a higher level of understanding and technological advancement compared to many of the other planets in the galaxy. The planet would be\"\n },\n {\n \"sample_id\": 0,\n \"backend\": \"openai|text-davinci-003\",\n \"max_length\": 50,\n \"temperature\": 0.9,\n \"country\": \"France\",\n \"prompt\": \"If France was a planet in the Star Wars universe it would be called\",\n \"output\": \" The Empire of Liberty. It would be a peaceful, democratic planet with a strong sense of justice and equality. The planet would be home to many different species of aliens but the majority of its population would be humans. It would have a strong military and\"\n },\n {\n \"sample_id\": 0,\n \"backend\": \"cohere|medium\",\n \"max_length\": 50,\n \"temperature\": 0.9,\n \"country\": \"Germany\",\n \"prompt\": \"If Germany was a planet in the Star Wars universe it would be called\",\n \"output\": \" the Hoth of the universe.\\nAfter the Soviet invasion of Eastern Europe and the subsequent Western anti-Soviet sentiment, Germany's arms manufacturers went into hyperdrive and the country churned out guns at a frightening pace. By the early 1930\"\n },\n ...\n]\n```\n\nAlternatively, you can use our [`TextsGenerator`](https://github.com/radi-cho/datasetGPT/blob/main/datasetGPT/texts.py#L27) to produce texts in Python.\n\n### Generate conversations with the ChatGPT API\n\n```bash\nexport OPENAI_API_KEY=\"...\"\ndatasetGPT conversations \\\n --length 4 \\\n --length 5 \\\n --agent1 \"You're a shop assistant in a pet store. Answer to customer questions politely.\" \\\n --agent2 \"You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.\" \\\n --temperature 0.1 \\\n --temperature 0.2 \\\n --option n 2 \\\n --option n 3 \\\n --path dataset\n```\n\nThe command above should produce a dataset directory with 16 conversations saved as JSON files. You can specify if you want all of them to be saved in a single file. But why 16? Because `num_samples` dialogues are generated for each possible combination of parameters (conversation length, LLM temperature, and custom prompt options). A dataset item looks like this:\n\n```json\n{\n \"length\": 5,\n \"temperature\": 0.1,\n \"n\": \"2\",\n \"agent1\": \"You're a shop assistant in a pet store. Answer to customer questions politely. When the whole conversation is over end with \\\"Goodbye\\\".\",\n \"agent2\": \"You're a customer in a pet store. You should behave like a human. You want to buy 2 pets. Ask questions about the pets in the store. When the whole conversation is over end with \\\"Goodbye\\\".\",\n \"utterances\": [\n [\n \"agent1\",\n \"Hello! How can I assist you today?\"\n ],\n [\n \"agent2\",\n \"Hi! I'm interested in buying two pets. Can you tell me what kind of pets you have available in the store?\"\n ],\n [\n \"agent1\",\n \"Certainly! We have a variety of pets available, including dogs, cats, birds, fish, hamsters, guinea pigs, rabbits, and reptiles. Is there a specific type of pet you're interested in?\"\n ],\n [\n \"agent2\",\n \"I'm not sure yet. Can you tell me more about the dogs and cats you have available? What breeds do you have?\"\n ],\n ...\n ]\n}\n```\n\nYou can also use our [`ConversationsGenerator`](https://github.com/radi-cho/datasetGPT/blob/main/datasetGPT/conversations.py#L43) to produce texts with the `gpt-3.5-turbo` API programatically.\n\n```python\nfrom datasetGPT import ConversationsGenerator, ConversationsGeneratorConfig, DatasetWriter\n\ndataset_writer = DatasetWriter() # single_file=True\n\ngenerator_config = ConversationsGeneratorConfig(agent1=\"You're a shop assistant in a pet store. Answer to customer questions politely.\",\n agent2=\"You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.\",\n num_samples=2,\n interruption=\"length\",\n lengths=[4, 5],\n temperatures=[0.1, 0.2],\n options=[(\"n\", \"2\"), (\"n\", \"3\")],\n model=\"gpt-3.5-turbo\", # or model=\"gpt-4\"\n # Specific models per agent can be set. If one of the parameters is set, the other must also be provided, otherwise both agents use the \"model\" value.\n model_agent_one=\"gpt-3.5-turbo\",\n model_agent_two=\"gpt-4\"\n\n )\n\nconversations_generator = ConversationsGenerator(generator_config)\n\nfor conversation in conversations_generator:\n dataset_writer.save_intermediate_result(conversation)\n```\n\n## Contributing\n\n> Still under active development.\n\nContributions will be highly appreciated. Currently these features are under development:\n\n- [x] `datasetGPT conversations` - Make two ChatGPT agents talk with one another and record the conversation history.\n- [x] `datasetGPT texts` - Inference different LLMs with a given input prompt and generate multiple outputs by varying parameters.\n- [ ] `datasetGPT transformations` - Apply a list of transformations to a list of texts. For example, summarizing a list of texts for a {child | university student | PhD candidate} to understand.\n- [ ] Support more backend LLMs.\n\nTo set up a local development environment:\n\n```bash\ngit clone https://github.com/radi-cho/datasetGPT/\ncd datasetGPT\npip install -e .\n```\n\n## CLI Reference\n\n```\ndatasetGPT [OPTIONS] COMMAND [ARGS]...\n\n Command line interface that generates datasets with LLMs.\n\nOptions:\n --help Show this message and exit.\n\nCommands:\n conversations Produce conversations between two gpt-3.5-turbo agents...\n texts Inference multiple LLMs at scale.\n```\n\n```\ndatasetGPT texts [OPTIONS]\n\n Inference multiple LLMs at scale.\n\nOptions:\n -p, --prompt TEXT Input prompt. [required]\n -b, --backend TEXT LLM APIs to use as backends. Use\n \"backend|model_name\" notation. For example:\n \"openai|text-davinci-003\".\n -l, --max-length INTEGER Maximum number of tokens to generate for each\n prompt.\n -t, --temperature FLOAT Possible temperature values for the backend\n language model.\n -n, --num-samples INTEGER Number of conversations for each configuration.\n -o, --option <TEXT TEXT>... Values for additional options denoted in your\n prompts by {OPTION_NAME}.\n -f, --path PATH Where to save the dataset. Either a file or a\n directory (folder).\n -s, --single-file Either save the whole dataset to a single file\n or create multiple files.\n --help Show this message and exit.\n```\n\n- You can specify multiple variants for the following options: `--length`, `--temperature`, `--num-samples`, `--option`. A dataset item will be generated for each possible combination of the supplied values.\n- Each `--option` provided must be formatted as follows: `--option option_name \"Some option value\"`.\n- Currently supported backends: GPT-3 and GPT-4 model variants by [OpenAI](https://openai.com/blog/openai-api), the language models by [Cohere](https://pypi.org/project/cohere/), BLOOM through the [Petals API](https://petals.ml/).\n\n```\ndatasetGPT conversations [OPTIONS]\n\n Produce conversations between two gpt-3.5-turbo agents with given roles.\n\nOptions:\n -k, --openai-api-key TEXT OpenAI API key.\n -a, --agent1 TEXT Agent role description. [required]\n -b, --agent2 TEXT Agent role description. [required]\n -i, --interruption [length|end_phrase]\n Interruption mode.\n -e, --end-phrase TEXT Interrupt after this phrase is outputted by\n one of the agents.\n -d, --end-agent [agent1|agent2|both]\n In which agent's messages to look for the\n end phrase.\n -l, --length INTEGER Maximum number of utterances for each agent.\n A conversation sample will be generated for\n each length.\n -t, --temperature FLOAT Possible temperature values for the backend\n language model.\n -m, --model [gpt-3.5-turbo|gpt-4]\n OpenAI Chat model to use. GPT-4 is only\n supported if provided API key has access to\n GPT4. Defaults to GPT-3.5-Turbo.\n -m1, --model-agent1 [gpt-3.5-turbo|gpt-4]\n OpenAI Chat model to use for agent1. GPT-4\n is only supported if given API key has\n access to GPT4. Defaults to GPT-3.5-Turbo.\n If set, --model-agent2 must also be\n provided, otherwise --model value will be\n used.\n -m2, --model-agent2 [gpt-3.5-turbo|gpt-4]\n OpenAI Chat model to use for agent2. GPT-4\n is only supported if given API key has\n access to GPT4. Defaults to GPT-3.5-Turbo If\n set, --model-agent1 must also be provided,\n otherwise --model value will be used.\n -n, --num-samples INTEGER Number of conversations for each\n configuration.\n -o, --option <TEXT TEXT>... Values for additional options denoted in\n your prompts by {OPTION_NAME}.\n -f, --path PATH Where to save the dataset. Either a file or\n a directory (folder).\n -s, --single-file Either save the whole dataset to a single\n file or create multiple files.\n --help Show this message and exit.\n```\n\n- The length parameter specifies how many utterances each agent should make. A length of 4 typically produces 8 utterances in total.\n- You can specify either `length` (default) or `end_phrase` as an interruption strategy. When using `end_phrase` a conversation will be interrupted once the `--end-phrase` has appeared in the messages of the `--end-agent` (could be both). In this case, the lengths provided will be treated as maximum conversation lengths.\n",
"bugtrack_url": null,
"license": "",
"summary": "Generate textual and conversational datasets with LLMs.",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/radi-cho/datasetGPT"
},
"split_keywords": [
"dataset",
"llm",
"langchain",
"openai"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "062ca2836fdad687b6e482bd6a60d9aa215e9d13234ae641025a01c374f9c36a",
"md5": "88e726bbd2f2dbc37457269d6aab4b83",
"sha256": "bc4519529c6efc0ad0656e7e03c3172550d012a0a6d6b2ccdfdec0a3f6004c9b"
},
"downloads": -1,
"filename": "datasetGPT-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "88e726bbd2f2dbc37457269d6aab4b83",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 13051,
"upload_time": "2023-08-25T16:36:14",
"upload_time_iso_8601": "2023-08-25T16:36:14.468916Z",
"url": "https://files.pythonhosted.org/packages/06/2c/a2836fdad687b6e482bd6a60d9aa215e9d13234ae641025a01c374f9c36a/datasetGPT-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2b8fca9310baa7d070fd73cbb3f61092535a46b31b92415f390ffeb7510813c1",
"md5": "7a860676620fb255c6cbc2ae24685718",
"sha256": "cfeaafff2d17a18518442eb04b02e3d55548579257d190e5eb2691d9fb8b1beb"
},
"downloads": -1,
"filename": "datasetGPT-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "7a860676620fb255c6cbc2ae24685718",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 14286,
"upload_time": "2023-08-25T16:36:15",
"upload_time_iso_8601": "2023-08-25T16:36:15.871728Z",
"url": "https://files.pythonhosted.org/packages/2b/8f/ca9310baa7d070fd73cbb3f61092535a46b31b92415f390ffeb7510813c1/datasetGPT-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-25 16:36:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "radi-cho",
"github_project": "datasetGPT",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "datasetgpt"
}