# GeneratorPromptKit
### A Python Library and Framework for Automated Generator Prompting and Dataset Generation
This is a Python library (https://pypi.org/project/generator-prompt-kit/) and framework for automated generator prompting and dataset generation using large language models (LLMs). Inspired by the work of Chen et al. in their paper "GenQA: Generating Millions of Instructions from a Handful of Prompts", this library provides a structured approach to creating diverse and high-quality datasets using a combination of generator prompts, topic extraction, and subtopic exploration.
## Overview
The key idea behind GeneratorPromptKit is to leverage the power of LLMs to generate diverse and relevant questions and answers based on a given input domain. By iteratively extracting topics and subtopics from the domain and using carefully crafted generator prompts, the library enables the creation of large-scale datasets with minimal human intervention.
### Demo Generation Example
| topic | subtopic | question | answer |
|-----------------------|---------------------|---------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| Internet of Things | IoT Data Analytics | How can leveraging insights from connected devices revolutionize decision-making processes in various industries?| By harnessing the power of IoT data analytics, organizations can gain real-time insights...|
| Data Structures and Algorithms | Sorting Algorithms | How can the efficiency of sorting algorithms be further improved beyond traditional comparisons and swaps? | One innovative approach to enhancing the efficiency of sorting algorithms beyond... |
| Operating Systems | Memory Management | How does efficient memory management contribute to the overall performance of an operating system?| Efficient memory management plays a crucial role in optimizing the performance of an... |
| Computer Networks | Network Security | How can we ensure the confidentiality and integrity of data transmitted over a network, especially in the presence of potential threats? | To safeguard data during transmission, network security mechanisms like encryption,... |
| Operating Systems | Memory Management | How does the efficient utilization of resources contribute to the overall performance of a system? | Efficient memory management plays a crucial role in optimizing system performance by... |
### Flowchart
```mermaid
graph TD
A[Input Domain] -->|Define Domain| B[Extract Topics]
B -->|List Topics| C[Iterate over Topics]
C -->|Select Topic| D[Extract Subtopics]
D -->|List Subtopics| E[Iterate over Subtopics]
E -->|Select Subtopic| F[Generate Questions and Answers]
F -->|Generate QA Pair| G[Store QA in Dataset]
G --> H{More Subtopics?}
H -- Yes --> E
H -- No --> I{More Topics?}
I -- Yes --> C
I -- No --> J[Output Dataset]
subgraph Topic Extraction
B
end
subgraph Subtopic Extraction
D
end
subgraph Question and Answer Generation
F
end
subgraph Dataset Storage
G
end
style Topic Extraction fill:#f9d,stroke:#333,stroke-width:2px
style Subtopic Extraction fill:#dbf,stroke:#333,stroke-width:2px
style Question and Answer Generation fill:#ffd,stroke:#333,stroke-width:2px
style Dataset Storage fill:#bdf,stroke:#333,stroke-width:2px
```
## Some Important Questions
### 1. Why Generator Prompts?
> When finetuning a Llama-3 8B base model, on Generator Prompt generated datasets, it meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations.
- [Source](https://arxiv.org/abs/2406.10323)
### 2. What are Generator Prompts?
> A generator prompt asks the model to enumerate a long list of execution paths, and then randomizes which paths get chosen.
- [Source](https://x.com/tomgoldsteincs/status/1803865169543532867)
## Features
- **Automated Topic and Subtopic Extraction**: GeneratorPromptKit automatically extracts topics and subtopics from the input domain using LLMs, enabling efficient exploration of the domain space.
- **Generator Prompts**: The library provides a set of carefully designed generator prompts that encourage diversity, creativity, and relevance in the generated questions and answers.
- **Customizable Prompts**: Users can easily modify and extend the existing prompts or add their own prompts to suit their specific requirements.
- **Randomness and Diversity**: GeneratorPromptKit incorporates randomness boosters and indirect question generators to enhance the diversity of the generated dataset.
- **Integration with OpenAI API**: The library seamlessly integrates with the OpenAI API, allowing users to leverage their language models for dataset generation.
## Installation
To install GeneratorPromptKit, simply use pip:
```bash
pip install generator-prompt-kit
```
## Usage
Here's a basic example of how to use GeneratorPromptKit to generate a dataset:
```python
from GeneratorPromptKit import GeneratorPromptKit
import pandas as pd
# Initialize the GeneratorPromptKit
gpk = GeneratorPromptKit(api_key='your_openai_api_key')
# Set the input domain
input_domain = "Computer Science"
# Generate the dataset
dataset = gpk.generate_dataset(input_domain, num_topics=10, num_subtopics=5, num_datapoints=100)
# Save the dataset to a file
dataset.save('computer_science_dataset.csv')
# Printing what has been generated
df = pd.read_csv('computer_science_dataset.csv')
print(df)
```
#### GeneratorPromptKit()
The constructor `__init__` of the `GeneratorPromptKit` class initializes a new instance of the `GeneratorPromptKit` class, setting up the necessary configuration to interact with an LLM via the specified API. It prepares the system for subsequent calls to generate topics, subtopics, and Q&A pairs by configuring the API key, operational parameters like `temperature` and `pause for rate limiting`, and selecting the appropriate `language model`.
1. **api_key (str)**
- **Description**: The API key used to authenticate requests to the language model provider, such as OpenAI. This key is necessary for billing and access control when using the API.
- **Example**: "your_api_key_here"
2. **temperature (float, optional)**
- **Description**: Controls the randomness of the output from the language model. A higher temperature results in more varied and sometimes more creative responses. A lower temperature produces more predictable and conservative outputs. This parameter is optional, with a default value of 0, indicating the most deterministic behavior.
- **Default**: 0
- **Example**: 0.7 (for more creativity in responses)
3. **openai_rpm_seconds_pause (int, optional)**
- **Description**: Specifies the number of seconds to pause between successive requests to the OpenAI API. This is used to manage the rate of requests per minute (RPM) to conform to API rate limits and avoid overloading the service. This parameter is optional and has a default value set to manage typical usage scenarios effectively.
- **Default**: 5
- **Example**: 2 (for a faster rate of API calls, suitable when higher RPM limits are allowed)
4. **llm_model (str, optional)**
- **Description**: The identifier for the specific language model to be used for generating prompts, topics, subtopics, questions, and answers. This parameter allows the user to specify different models that might be optimized for particular tasks or that offer different balances of speed, cost, and accuracy. The default model is "gpt-3.5-turbo", known for its efficiency and robustness.
- **Default**: "gpt-3.5-turbo"
- **Example**: "gpt-4" (if the user wishes to utilize a more advanced model, assuming it's available in the API)
#### generate_dataset
The `generate_dataset` function is used to automatically generate a structured dataset containing questions and optionally answers, divided by topics and subtopics based on the specified input domain. This function is versatile for generating rich and diverse educational or research-oriented datasets, especially useful for machine learning and data analysis tasks.
1. **input_domain (str)**
- **Description**: The broad area or field from which topics and subsequently questions will be generated. It sets the context for the entire dataset generation process.
- **Example**: "Computer Science", "Biology", "History"
2. **num_topics (int)**
- **Description**: The number of distinct topics to extract from the input domain. This number dictates how many major categories will be considered when generating the dataset.
- **Example**: 5 (would generate a dataset across 5 different topics in the specified domain)
3. **num_subtopics (int)**
- **Description**: The number of subtopics to be extracted for each topic. This parameter helps in drilling down into more specific areas within each main topic.
- **Example**: 3 (each topic will be further explored into 3 subtopics)
4. **num_datapoints (int)**
- **Description**: The total number of data points (question-and-answer pairs or just questions, depending on other parameters) intended to be generated across all topics.
- **Example**: 100 (aims to create a total of 100 data points)
5. **use_subtopic_index (bool, optional)**
- **Description**: A flag to decide whether to use a specific index for subtopics during the question generation. If set to True, the function will use the specific `subtopic_index` provided to focus question generation on a particular subtopic.
- **Example**: True or False
6. **subtopic_index (int, optional)**
- **Description**: Specifies the index of the subtopic to focus on if `use_subtopic_index` is True. This parameter is only required and used if `use_subtopic_index` is True.
- **Example**: 1 (focus on the second subtopic, as indexing typically starts at 0)
7. **generate_answers (bool, optional)**
- **Description**: Determines whether the dataset generation process should include answers for the generated questions. If set to False, only questions will be generated.
- **Example**: True (generate both questions and their corresponding answers)
## Performance
For detailed benchmarks and experimental results, please refer to the original paper "GenQA: Generating Millions of Instructions from a Handful of Prompts" by Chen et al. GeneratorPromptKit was created as an inspiration from their work and aims to provide a practical implementation of the concepts and techniques discussed in the paper.
## Cite our Work
```
@inproceedings{generator-prompt-kit,
title = {GeneratorPromptKit: A Python Library for Automated Generator Prompting and Dataset Generation},
author = {Aman Priyanshu, Supriti Vijay},
year = {2024},
publisher = {{GitHub}},
url = {https://github.com/AmanPriyanshu/GeneratorPromptKit}
}
```
## References
- Chen et al. "GenQA: Generating Millions of Instructions from a Handful of Prompts". arXiv preprint arXiv:2406.10323, 2024.
## Contributing
Contributions to GeneratorPromptKit are welcome! If you encounter any issues, have suggestions for improvements, or want to add new features, please open an issue or submit a pull request on the [GitHub repository](https://github.com/AmanPriyanshu/GeneratorPromptKit)
## License
GeneratorPromptKit is released under the [MIT License](/LICENSE).
Raw data
{
"_id": null,
"home_page": "http://github.com/AmanPriyanshu/GeneratorPromptKit",
"name": "generator-prompt-kit",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "llm llms llm-training dataset-generation automated-prompt-engineering prompt-engineering diverse-data data-science data dataset synthetic-dataset-generation synthetic-data augmentation data-augmentation",
"author": "Aman Priyanshu and Supriti Vijay",
"author_email": "amanpriyanshusms2001@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/42/5b/34d7af1e4d950cd8231e27c27f00ec3161858709d372735a16b8ab0e90d7/generator_prompt_kit-1.0.3.tar.gz",
"platform": null,
"description": "# GeneratorPromptKit\n### A Python Library and Framework for Automated Generator Prompting and Dataset Generation\n\nThis is a Python library (https://pypi.org/project/generator-prompt-kit/) and framework for automated generator prompting and dataset generation using large language models (LLMs). Inspired by the work of Chen et al. in their paper \"GenQA: Generating Millions of Instructions from a Handful of Prompts\", this library provides a structured approach to creating diverse and high-quality datasets using a combination of generator prompts, topic extraction, and subtopic exploration.\n\n## Overview\n\nThe key idea behind GeneratorPromptKit is to leverage the power of LLMs to generate diverse and relevant questions and answers based on a given input domain. By iteratively extracting topics and subtopics from the domain and using carefully crafted generator prompts, the library enables the creation of large-scale datasets with minimal human intervention.\n\n### Demo Generation Example\n\n| topic | subtopic | question | answer |\n|-----------------------|---------------------|---------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|\n| Internet of Things | IoT Data Analytics | How can leveraging insights from connected devices revolutionize decision-making processes in various industries?| By harnessing the power of IoT data analytics, organizations can gain real-time insights...|\n| Data Structures and Algorithms | Sorting Algorithms | How can the efficiency of sorting algorithms be further improved beyond traditional comparisons and swaps? | One innovative approach to enhancing the efficiency of sorting algorithms beyond... |\n| Operating Systems | Memory Management | How does efficient memory management contribute to the overall performance of an operating system?| Efficient memory management plays a crucial role in optimizing the performance of an... |\n| Computer Networks | Network Security | How can we ensure the confidentiality and integrity of data transmitted over a network, especially in the presence of potential threats? | To safeguard data during transmission, network security mechanisms like encryption,... |\n| Operating Systems | Memory Management | How does the efficient utilization of resources contribute to the overall performance of a system? | Efficient memory management plays a crucial role in optimizing system performance by... |\n\n\n### Flowchart\n\n```mermaid\ngraph TD\n A[Input Domain] -->|Define Domain| B[Extract Topics]\n B -->|List Topics| C[Iterate over Topics]\n C -->|Select Topic| D[Extract Subtopics]\n D -->|List Subtopics| E[Iterate over Subtopics]\n E -->|Select Subtopic| F[Generate Questions and Answers]\n F -->|Generate QA Pair| G[Store QA in Dataset]\n G --> H{More Subtopics?}\n H -- Yes --> E\n H -- No --> I{More Topics?}\n I -- Yes --> C\n I -- No --> J[Output Dataset]\n\n subgraph Topic Extraction\n B\n end\n\n subgraph Subtopic Extraction\n D\n end\n\n subgraph Question and Answer Generation\n F\n end\n\n subgraph Dataset Storage\n G\n end\n\n style Topic Extraction fill:#f9d,stroke:#333,stroke-width:2px\n style Subtopic Extraction fill:#dbf,stroke:#333,stroke-width:2px\n style Question and Answer Generation fill:#ffd,stroke:#333,stroke-width:2px\n style Dataset Storage fill:#bdf,stroke:#333,stroke-width:2px\n```\n\n## Some Important Questions\n\n### 1. Why Generator Prompts?\n> When finetuning a Llama-3 8B base model, on Generator Prompt generated datasets, it meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations.\n- [Source](https://arxiv.org/abs/2406.10323)\n\n### 2. What are Generator Prompts?\n> A generator prompt asks the model to enumerate a long list of execution paths, and then randomizes which paths get chosen.\n- [Source](https://x.com/tomgoldsteincs/status/1803865169543532867)\n\n## Features\n\n- **Automated Topic and Subtopic Extraction**: GeneratorPromptKit automatically extracts topics and subtopics from the input domain using LLMs, enabling efficient exploration of the domain space.\n- **Generator Prompts**: The library provides a set of carefully designed generator prompts that encourage diversity, creativity, and relevance in the generated questions and answers.\n- **Customizable Prompts**: Users can easily modify and extend the existing prompts or add their own prompts to suit their specific requirements.\n- **Randomness and Diversity**: GeneratorPromptKit incorporates randomness boosters and indirect question generators to enhance the diversity of the generated dataset.\n- **Integration with OpenAI API**: The library seamlessly integrates with the OpenAI API, allowing users to leverage their language models for dataset generation.\n\n## Installation\n\nTo install GeneratorPromptKit, simply use pip:\n\n```bash\npip install generator-prompt-kit\n```\n\n## Usage\n\nHere's a basic example of how to use GeneratorPromptKit to generate a dataset:\n\n```python\nfrom GeneratorPromptKit import GeneratorPromptKit\nimport pandas as pd\n\n# Initialize the GeneratorPromptKit\ngpk = GeneratorPromptKit(api_key='your_openai_api_key')\n\n# Set the input domain\ninput_domain = \"Computer Science\"\n\n# Generate the dataset\ndataset = gpk.generate_dataset(input_domain, num_topics=10, num_subtopics=5, num_datapoints=100)\n\n# Save the dataset to a file\ndataset.save('computer_science_dataset.csv')\n\n# Printing what has been generated\ndf = pd.read_csv('computer_science_dataset.csv')\nprint(df)\n```\n\n#### GeneratorPromptKit()\n\n\nThe constructor `__init__` of the `GeneratorPromptKit` class initializes a new instance of the `GeneratorPromptKit` class, setting up the necessary configuration to interact with an LLM via the specified API. It prepares the system for subsequent calls to generate topics, subtopics, and Q&A pairs by configuring the API key, operational parameters like `temperature` and `pause for rate limiting`, and selecting the appropriate `language model`. \n\n1. **api_key (str)**\n - **Description**: The API key used to authenticate requests to the language model provider, such as OpenAI. This key is necessary for billing and access control when using the API.\n - **Example**: \"your_api_key_here\"\n\n2. **temperature (float, optional)**\n - **Description**: Controls the randomness of the output from the language model. A higher temperature results in more varied and sometimes more creative responses. A lower temperature produces more predictable and conservative outputs. This parameter is optional, with a default value of 0, indicating the most deterministic behavior.\n - **Default**: 0\n - **Example**: 0.7 (for more creativity in responses)\n\n3. **openai_rpm_seconds_pause (int, optional)**\n - **Description**: Specifies the number of seconds to pause between successive requests to the OpenAI API. This is used to manage the rate of requests per minute (RPM) to conform to API rate limits and avoid overloading the service. This parameter is optional and has a default value set to manage typical usage scenarios effectively.\n - **Default**: 5\n - **Example**: 2 (for a faster rate of API calls, suitable when higher RPM limits are allowed)\n\n4. **llm_model (str, optional)**\n - **Description**: The identifier for the specific language model to be used for generating prompts, topics, subtopics, questions, and answers. This parameter allows the user to specify different models that might be optimized for particular tasks or that offer different balances of speed, cost, and accuracy. The default model is \"gpt-3.5-turbo\", known for its efficiency and robustness.\n - **Default**: \"gpt-3.5-turbo\"\n - **Example**: \"gpt-4\" (if the user wishes to utilize a more advanced model, assuming it's available in the API)\n \n#### generate_dataset\n\nThe `generate_dataset` function is used to automatically generate a structured dataset containing questions and optionally answers, divided by topics and subtopics based on the specified input domain. This function is versatile for generating rich and diverse educational or research-oriented datasets, especially useful for machine learning and data analysis tasks.\n\n1. **input_domain (str)**\n - **Description**: The broad area or field from which topics and subsequently questions will be generated. It sets the context for the entire dataset generation process.\n - **Example**: \"Computer Science\", \"Biology\", \"History\"\n\n2. **num_topics (int)**\n - **Description**: The number of distinct topics to extract from the input domain. This number dictates how many major categories will be considered when generating the dataset.\n - **Example**: 5 (would generate a dataset across 5 different topics in the specified domain)\n\n3. **num_subtopics (int)**\n - **Description**: The number of subtopics to be extracted for each topic. This parameter helps in drilling down into more specific areas within each main topic.\n - **Example**: 3 (each topic will be further explored into 3 subtopics)\n\n4. **num_datapoints (int)**\n - **Description**: The total number of data points (question-and-answer pairs or just questions, depending on other parameters) intended to be generated across all topics.\n - **Example**: 100 (aims to create a total of 100 data points)\n\n5. **use_subtopic_index (bool, optional)**\n - **Description**: A flag to decide whether to use a specific index for subtopics during the question generation. If set to True, the function will use the specific `subtopic_index` provided to focus question generation on a particular subtopic. \n - **Example**: True or False\n\n6. **subtopic_index (int, optional)**\n - **Description**: Specifies the index of the subtopic to focus on if `use_subtopic_index` is True. This parameter is only required and used if `use_subtopic_index` is True.\n - **Example**: 1 (focus on the second subtopic, as indexing typically starts at 0)\n\n7. **generate_answers (bool, optional)**\n - **Description**: Determines whether the dataset generation process should include answers for the generated questions. If set to False, only questions will be generated.\n - **Example**: True (generate both questions and their corresponding answers)\n\n## Performance\n\nFor detailed benchmarks and experimental results, please refer to the original paper \"GenQA: Generating Millions of Instructions from a Handful of Prompts\" by Chen et al. GeneratorPromptKit was created as an inspiration from their work and aims to provide a practical implementation of the concepts and techniques discussed in the paper.\n\n## Cite our Work\n\n```\n@inproceedings{generator-prompt-kit,\n title = {GeneratorPromptKit: A Python Library for Automated Generator Prompting and Dataset Generation},\n author = {Aman Priyanshu, Supriti Vijay},\n year = {2024},\n publisher = {{GitHub}},\n url = {https://github.com/AmanPriyanshu/GeneratorPromptKit}\n}\n\n```\n\n## References\n\n- Chen et al. \"GenQA: Generating Millions of Instructions from a Handful of Prompts\". arXiv preprint arXiv:2406.10323, 2024.\n\n## Contributing\n\nContributions to GeneratorPromptKit are welcome! If you encounter any issues, have suggestions for improvements, or want to add new features, please open an issue or submit a pull request on the [GitHub repository](https://github.com/AmanPriyanshu/GeneratorPromptKit)\n\n## License\n\nGeneratorPromptKit is released under the [MIT License](/LICENSE).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python Library for Automated Generator Prompting and Dataset Generation",
"version": "1.0.3",
"project_urls": {
"Homepage": "http://github.com/AmanPriyanshu/GeneratorPromptKit"
},
"split_keywords": [
"llm",
"llms",
"llm-training",
"dataset-generation",
"automated-prompt-engineering",
"prompt-engineering",
"diverse-data",
"data-science",
"data",
"dataset",
"synthetic-dataset-generation",
"synthetic-data",
"augmentation",
"data-augmentation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9849593683e8894abb8a8733a59d135b9191e00febb779175bb68891d352a90e",
"md5": "ffa5b2a558ea4791d2486148a03a391a",
"sha256": "2f1ef0f9200dad48bbfc23d6da36fb2bc220ba60d7fc80bdb5211ee2b7deabee"
},
"downloads": -1,
"filename": "generator_prompt_kit-1.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ffa5b2a558ea4791d2486148a03a391a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 11994,
"upload_time": "2024-06-21T22:51:54",
"upload_time_iso_8601": "2024-06-21T22:51:54.080256Z",
"url": "https://files.pythonhosted.org/packages/98/49/593683e8894abb8a8733a59d135b9191e00febb779175bb68891d352a90e/generator_prompt_kit-1.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "425b34d7af1e4d950cd8231e27c27f00ec3161858709d372735a16b8ab0e90d7",
"md5": "e7071c4fa6451b2c7562234266c6e409",
"sha256": "de7b26da49f06095bfb943f59eb2114b419a5740ae6477eeb6b111581e228a93"
},
"downloads": -1,
"filename": "generator_prompt_kit-1.0.3.tar.gz",
"has_sig": false,
"md5_digest": "e7071c4fa6451b2c7562234266c6e409",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 10521,
"upload_time": "2024-06-21T22:51:55",
"upload_time_iso_8601": "2024-06-21T22:51:55.795521Z",
"url": "https://files.pythonhosted.org/packages/42/5b/34d7af1e4d950cd8231e27c27f00ec3161858709d372735a16b8ab0e90d7/generator_prompt_kit-1.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-21 22:51:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "AmanPriyanshu",
"github_project": "GeneratorPromptKit",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
"==",
"2.0.0"
]
]
},
{
"name": "openai",
"specs": [
[
"==",
"1.34.0"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.2.2"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.66.4"
]
]
}
],
"lcname": "generator-prompt-kit"
}