# GSPL: Gen-Selective Pseudo Labeling
[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)
## Introduction
![Plot](https://github.com/louisbrulenaudet/gspl/blob/main/thumbnail.png?raw=true)
The Gen-Selective Pseudo Labeler (GSPL) is a tool designed to generate multiple queries about the same document and apply selection techniques using transformers. It leverages large language models (LLMs) to function as a judge, enhancing the labeling process through advanced completion and selection methods. This project is particularly useful in scenarios where multi-query generation and selection are required to improve the quality and relevance of labeled data.
## Features
- **Multi-query Generation**: Generate multiple queries for a single document using powerful LLMs.
- **Selective Labeling**: Apply selection techniques to choose the best query from the generated set.
- **Parallel Processing**: Utilize multi-threading to efficiently process large datasets.
- **Integration with Hugging Face Transformers**: Seamlessly connect with transformer models hosted on Hugging Face.
- **Customizable Templates**: Apply custom chat templates to format queries and responses.
## Requirements
To use GSPL, you need the following dependencies installed:
- Python 3.7+
- Datasets
- tqdm
- langdetect
- tiktoken
- python-dotenv
- concurrent.futures
You can install the required dependencies using the following command:
```bash
pip install datasets tqdm langdetect python-dotenv tiktoken
```
Or using PyPI:
```bash
pip install gspl-api
```
## Installation
To install GSPL, clone the repository and navigate to the project directory:
```bash
git clone https://github.com/louisbrulenaudet/gspl.git
```
## Usage
### Initialization
To initialize the GSPL class, provide the API key, dataset, and optional parameters for dataset split, streaming, and rate limits.
```python
from gspl import GSPL
api_key = "your_api_key"
dataset = "your_dataset"
gspl_instance = GSPL(api_key=api_key, dataset=dataset)
```
### Methods
#### `__init__(self, api_key: str, dataset: Dataset, split: Optional[str] = "train", streaming: Optional[bool] = False, rpm: Optional[int] = 30) -> None`
Initializes the GSPL class with the specified parameters.
- **Parameters**:
- `api_key` (str): API key for the completion client.
- `dataset` (Dataset): The dataset to be labeled.
- `split` (str, optional): The dataset split to be used (default is "train").
- `streaming` (bool, optional): Whether to stream the dataset (default is False).
- `rpm` (int, optional): Requests per minute limit for rate limiting (default is 30).
#### Example Usage
```python
from gspl import GSPL
completion_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are an AI assistant specialized in creating targeted questions for documents to build a domain-specific dataset for embedding model training. Your task is to:
1. Analyze the given document or text.
2. Generate a highly specific question that directly relates to the main content or key information in the document.
3. Ensure the question is tailored to retrieve the document's content when used as a search query.
4. Format the question as a JSON object with a single "query" key.
5. Provide only the JSON object as output, without any additional text, introductions, or conclusions.
Remember:
- The question should be precise and relevant to the document's core information.
- Avoid generic questions; focus on unique aspects of the given text.
- Ensure the JSON is valid and can be parsed in Python.
- Do not include any explanations or additional text outside the JSON object.<|eot_id|><|start_header_id|>user<|end_header_id|>
{document}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
selection_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are an AI assistant specialized in selecting the most appropriate question from a batch of queries to match specific content. Your task is to:
1. Analyze the given content or document.
2. Review the provided batch of queries.
3. Select the question that best relates to the main content or key information in the document.
4. Ensure the selected question is highly specific and tailored to retrieve the document's content when used as a search query.
5. Format the selected question as a JSON object with a single "best_query" key.
6. Provide only the JSON object as output, without any additional text, introductions, or conclusions.
Remember:
- The selected question should be the most precise and relevant to the document's core information.
- Prioritize questions that focus on unique aspects of the given text.
- Ensure the JSON is valid and can be parsed in Python.
- Do not include any explanations or additional text outside the JSON object.<|eot_id|><|start_header_id|>user<|end_header_id|>
{queries}
Source text : {document}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
payload = {
"parameters": {
"temperature": 0.9,
"return_full_text": False,
"max_new_tokens": 250,
"do_sample": True,
"top_k": 50,
"top_p": 0.95
},
"options": {
"use_cache": False,
"wait_for_model": True
}
}
gspl = GSPL(
api_key="api_key",
dataset=dataset["datasetId"],
split="train"
)
gspl.apply_chat_template(
template=completion_template,
output="inputs",
document="output",
)
gspl.label(
payload=payload,
output="queries"
)
gspl.apply_chat_template(
template=selection_template,
output="inputs",
queries="queries",
document="output"
)
gspl.select(
payload=payload,
)
gspl.to_parquet(
filepath=output_file
)
```
## Methods Explained
### `apply_chat_template(self, column: str, system_prompt: str, output: Optional[str] = "inputs") -> None`
Applies a chat template to a dataset in parallel using multiple processes.
- **Parameters**:
- `template` (str): The template prompt to use for the chat template.
- `output` (str, optional): The name of the column where the processed template will be saved (default is "inputs").
- `**kwargs_columns` (str): Mapping of template placeholder names to dataset column names.
- **Returns**:
- `self.dataset` (Dataset): The dataset is updated in-place with the chat template applied.
### `completion(self, payload: dict, api_url: Optional[str] = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct", response_key: Optional[str] = "query", validate_payload: bool = False) -> str`
Generates a completion response using the completion client.
- **Parameters**:
- `payload` (dict): The payload to be sent to the completion API.
- `api_url` (str, optional): The API URL for the completion client (default is "[https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct](https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct)").
- `response_key` (str, optional): The key to extract the response from the completion client (default is "query").
- `validate_payload` (bool, optional): Whether to validate the payload before sending (default is False).
- **Returns**:
- `str`: The completion response from the API.
### `label(self, payload: dict, api_url: Optional[str] = "https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct", inputs_column: Optional[str] = "inputs", response_key: Optional[str] = "query", output: Optional[str] = "queries", num_return_sequences: Optional[int] = 3, token_threshold: Optional[int] = 600, validate_payload: bool = False) -> Dataset`
Labels the dataset using the completion client.
- **Parameters**:
- `payload` (dict): The payload to be sent to the completion API.
- `api_url` (str, optional): The API URL for the completion client (default is "[https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct](https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct)").
- `inputs_column` (str, optional): The name of the column containing the input text (default is "inputs").
- `response_key` (str, optional): The key to extract the response from the completion client (default is "query").
- `output` (str, optional): The name of the column to store the output (default is "queries").
- `num_return_sequences` (int, optional): The number of response sequences to generate for each input (default is 3).
- `token_threshold` (int, optional): The token threshold for the completion response (default is 600).
- `validate_payload` (bool, optional): Whether to validate the payload before sending (default is False).
- **Returns**:
- `Dataset`: The labeled dataset.
### `select(self, payload: dict, api_url: Optional[str] = "https://api-inference.huggingface.co/models/microsoft/Phi-3-medium-4k-instruct", inputs_column: Optional[str] = "inputs", response_key: Optional[str] = "query", output: Optional[str] = "query", validate_payload: bool = False) -> None`
Selects outputs from the labeled dataset.
- **Parameters**:
- `payload` (dict): The payload to be sent to the completion API.
- `api_url` (str, optional): The API URL for the completion client (default is "[https://api-inference.huggingface.co/models/microsoft/Phi-3-medium-4k-instruct](https://api-inference.huggingface.co/models/microsoft/Phi-3-medium-4k-instruct)").
- `inputs_column` (str, optional): The name of the column containing the input text (default is "inputs").
- `response_key` (str, optional): The key to extract the response from the completion client (default is "query").
- `output` (str, optional): The name of the column to store the output (default is "query").
- `validate_payload` (bool, optional): Whether to validate the payload before sending (default is False).
- **Returns**:
- `None`
### `to_parquet(self, filepath: str) -> None`
Saves the labeled dataset to a Parquet file.
- **Parameters**:
- `filepath` (str): The file path to save the Parquet file.
- **Returns**:
- `None`
## Citing this project
If you use this code in your research, please use the following BibTeX entry.
```BibTeX
@misc{louisbrulenaudet2024,
author = {Louis Brulé Naudet},
title = {GSPL: Gen-Selective Pseudo Labeler},
howpublished = {\url{https://github.com/louisbrulenaudet/gspl}},
year = {2024}
}
```
## Feedback
If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).
Raw data
{
"_id": null,
"home_page": null,
"name": "gspl-api",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "language-models, retrieval, web-scraping, gpl, nlp, machine-learning, retrieval-augmented-generation, RAG, huggingface, generative-ai, llama, Mistral, inference-api, datasets, llm-as-judge",
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/23/95/bdebcc179815dc3a3bd316830348f4d59df937210f419ea42088b29bfe5d/gspl_api-0.0.92.tar.gz",
"platform": null,
"description": "# GSPL: Gen-Selective Pseudo Labeling\n[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)\n\n## Introduction\n![Plot](https://github.com/louisbrulenaudet/gspl/blob/main/thumbnail.png?raw=true)\nThe Gen-Selective Pseudo Labeler (GSPL) is a tool designed to generate multiple queries about the same document and apply selection techniques using transformers. It leverages large language models (LLMs) to function as a judge, enhancing the labeling process through advanced completion and selection methods. This project is particularly useful in scenarios where multi-query generation and selection are required to improve the quality and relevance of labeled data.\n\n## Features\n\n- **Multi-query Generation**: Generate multiple queries for a single document using powerful LLMs.\n- **Selective Labeling**: Apply selection techniques to choose the best query from the generated set.\n- **Parallel Processing**: Utilize multi-threading to efficiently process large datasets.\n- **Integration with Hugging Face Transformers**: Seamlessly connect with transformer models hosted on Hugging Face.\n- **Customizable Templates**: Apply custom chat templates to format queries and responses.\n\n## Requirements\n\nTo use GSPL, you need the following dependencies installed:\n\n- Python 3.7+\n- Datasets\n- tqdm\n- langdetect\n- tiktoken\n- python-dotenv\n- concurrent.futures\n\nYou can install the required dependencies using the following command:\n\n```bash\npip install datasets tqdm langdetect python-dotenv tiktoken\n```\nOr using PyPI:\n\n```bash\npip install gspl-api\n```\n\n## Installation\n\nTo install GSPL, clone the repository and navigate to the project directory:\n\n```bash\ngit clone https://github.com/louisbrulenaudet/gspl.git\n```\n## Usage\n\n### Initialization\n\nTo initialize the GSPL class, provide the API key, dataset, and optional parameters for dataset split, streaming, and rate limits.\n```python\nfrom gspl import GSPL\n\napi_key = \"your_api_key\"\ndataset = \"your_dataset\"\n\ngspl_instance = GSPL(api_key=api_key, dataset=dataset)\n```\n### Methods\n\n#### `__init__(self, api_key: str, dataset: Dataset, split: Optional[str] = \"train\", streaming: Optional[bool] = False, rpm: Optional[int] = 30) -> None`\n\nInitializes the GSPL class with the specified parameters.\n\n- **Parameters**:\n - `api_key` (str): API key for the completion client.\n - `dataset` (Dataset): The dataset to be labeled.\n - `split` (str, optional): The dataset split to be used (default is \"train\").\n - `streaming` (bool, optional): Whether to stream the dataset (default is False).\n - `rpm` (int, optional): Requests per minute limit for rate limiting (default is 30).\n\n#### Example Usage\n\n```python\nfrom gspl import GSPL\n\ncompletion_template = \"\"\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are an AI assistant specialized in creating targeted questions for documents to build a domain-specific dataset for embedding model training. Your task is to:\n\n 1. Analyze the given document or text.\n 2. Generate a highly specific question that directly relates to the main content or key information in the document.\n 3. Ensure the question is tailored to retrieve the document's content when used as a search query.\n 4. Format the question as a JSON object with a single \"query\" key.\n 5. Provide only the JSON object as output, without any additional text, introductions, or conclusions.\n\n Remember:\n - The question should be precise and relevant to the document's core information.\n - Avoid generic questions; focus on unique aspects of the given text.\n - Ensure the JSON is valid and can be parsed in Python.\n - Do not include any explanations or additional text outside the JSON object.<|eot_id|><|start_header_id|>user<|end_header_id|>\n {document}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n \"\"\"\n\nselection_template = \"\"\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are an AI assistant specialized in selecting the most appropriate question from a batch of queries to match specific content. Your task is to:\n\n 1. Analyze the given content or document.\n 2. Review the provided batch of queries.\n 3. Select the question that best relates to the main content or key information in the document.\n 4. Ensure the selected question is highly specific and tailored to retrieve the document's content when used as a search query.\n 5. Format the selected question as a JSON object with a single \"best_query\" key.\n 6. Provide only the JSON object as output, without any additional text, introductions, or conclusions.\n\n Remember:\n - The selected question should be the most precise and relevant to the document's core information.\n - Prioritize questions that focus on unique aspects of the given text.\n - Ensure the JSON is valid and can be parsed in Python.\n - Do not include any explanations or additional text outside the JSON object.<|eot_id|><|start_header_id|>user<|end_header_id|>\n {queries}\n Source text : {document}\n <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n \"\"\"\n\npayload = {\n \"parameters\": {\n \"temperature\": 0.9,\n \"return_full_text\": False,\n \"max_new_tokens\": 250,\n \"do_sample\": True,\n \"top_k\": 50,\n \"top_p\": 0.95\n },\n \"options\": {\n \"use_cache\": False,\n \"wait_for_model\": True\n }\n}\n\ngspl = GSPL(\n api_key=\"api_key\",\n dataset=dataset[\"datasetId\"],\n split=\"train\"\n)\n\ngspl.apply_chat_template(\n template=completion_template,\n output=\"inputs\",\n document=\"output\",\n)\n\ngspl.label(\n payload=payload,\n output=\"queries\"\n)\n\ngspl.apply_chat_template(\n template=selection_template,\n output=\"inputs\",\n queries=\"queries\",\n document=\"output\"\n)\n\ngspl.select(\n payload=payload,\n)\n\ngspl.to_parquet(\n filepath=output_file\n)\n```\n## Methods Explained\n\n### `apply_chat_template(self, column: str, system_prompt: str, output: Optional[str] = \"inputs\") -> None`\n\nApplies a chat template to a dataset in parallel using multiple processes.\n\n- **Parameters**:\n - `template` (str): The template prompt to use for the chat template.\n - `output` (str, optional): The name of the column where the processed template will be saved (default is \"inputs\").\n - `**kwargs_columns` (str): Mapping of template placeholder names to dataset column names.\n- **Returns**:\n - `self.dataset` (Dataset): The dataset is updated in-place with the chat template applied.\n\n### `completion(self, payload: dict, api_url: Optional[str] = \"https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct\", response_key: Optional[str] = \"query\", validate_payload: bool = False) -> str`\n\nGenerates a completion response using the completion client.\n\n- **Parameters**:\n - `payload` (dict): The payload to be sent to the completion API.\n - `api_url` (str, optional): The API URL for the completion client (default is \"[https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct](https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct)\").\n - `response_key` (str, optional): The key to extract the response from the completion client (default is \"query\").\n - `validate_payload` (bool, optional): Whether to validate the payload before sending (default is False).\n- **Returns**:\n - `str`: The completion response from the API.\n\n### `label(self, payload: dict, api_url: Optional[str] = \"https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct\", inputs_column: Optional[str] = \"inputs\", response_key: Optional[str] = \"query\", output: Optional[str] = \"queries\", num_return_sequences: Optional[int] = 3, token_threshold: Optional[int] = 600, validate_payload: bool = False) -> Dataset`\n\nLabels the dataset using the completion client.\n\n- **Parameters**:\n - `payload` (dict): The payload to be sent to the completion API.\n - `api_url` (str, optional): The API URL for the completion client (default is \"[https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct](https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct)\").\n - `inputs_column` (str, optional): The name of the column containing the input text (default is \"inputs\").\n - `response_key` (str, optional): The key to extract the response from the completion client (default is \"query\").\n - `output` (str, optional): The name of the column to store the output (default is \"queries\").\n - `num_return_sequences` (int, optional): The number of response sequences to generate for each input (default is 3).\n - `token_threshold` (int, optional): The token threshold for the completion response (default is 600).\n - `validate_payload` (bool, optional): Whether to validate the payload before sending (default is False).\n- **Returns**:\n - `Dataset`: The labeled dataset.\n\n### `select(self, payload: dict, api_url: Optional[str] = \"https://api-inference.huggingface.co/models/microsoft/Phi-3-medium-4k-instruct\", inputs_column: Optional[str] = \"inputs\", response_key: Optional[str] = \"query\", output: Optional[str] = \"query\", validate_payload: bool = False) -> None`\n\nSelects outputs from the labeled dataset.\n\n- **Parameters**:\n - `payload` (dict): The payload to be sent to the completion API.\n - `api_url` (str, optional): The API URL for the completion client (default is \"[https://api-inference.huggingface.co/models/microsoft/Phi-3-medium-4k-instruct](https://api-inference.huggingface.co/models/microsoft/Phi-3-medium-4k-instruct)\").\n - `inputs_column` (str, optional): The name of the column containing the input text (default is \"inputs\").\n - `response_key` (str, optional): The key to extract the response from the completion client (default is \"query\").\n - `output` (str, optional): The name of the column to store the output (default is \"query\").\n - `validate_payload` (bool, optional): Whether to validate the payload before sending (default is False).\n- **Returns**:\n - `None`\n\n### `to_parquet(self, filepath: str) -> None`\n\nSaves the labeled dataset to a Parquet file.\n\n- **Parameters**:\n - `filepath` (str): The file path to save the Parquet file.\n- **Returns**:\n - `None`\n\n## Citing this project\nIf you use this code in your research, please use the following BibTeX entry.\n\n```BibTeX\n@misc{louisbrulenaudet2024,\n\tauthor = {Louis Brul\u00e9 Naudet},\n\ttitle = {GSPL: Gen-Selective Pseudo Labeler},\n\thowpublished = {\\url{https://github.com/louisbrulenaudet/gspl}},\n\tyear = {2024}\n}\n```\n## Feedback\nIf you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).\n",
"bugtrack_url": null,
"license": null,
"summary": "GSPL: Gen-Selective Pseudo Labeling for Domain Adaptation, based on \ud83e\udd17 Datasets and Serverless Inference API.",
"version": "0.0.92",
"project_urls": null,
"split_keywords": [
"language-models",
" retrieval",
" web-scraping",
" gpl",
" nlp",
" machine-learning",
" retrieval-augmented-generation",
" rag",
" huggingface",
" generative-ai",
" llama",
" mistral",
" inference-api",
" datasets",
" llm-as-judge"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b1fe4f99b016cc5d0f440e937909e1541801bb02b3d0a2ce9220cf7ab201a30b",
"md5": "e703c55a53b844b9f90afe3548061346",
"sha256": "f6477145e462ce6d6ae1f3b004af41e57b461c58c00f2c069ef1bdae11d1704e"
},
"downloads": -1,
"filename": "gspl_api-0.0.92-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e703c55a53b844b9f90afe3548061346",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 21849,
"upload_time": "2024-06-24T20:48:06",
"upload_time_iso_8601": "2024-06-24T20:48:06.318147Z",
"url": "https://files.pythonhosted.org/packages/b1/fe/4f99b016cc5d0f440e937909e1541801bb02b3d0a2ce9220cf7ab201a30b/gspl_api-0.0.92-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2395bdebcc179815dc3a3bd316830348f4d59df937210f419ea42088b29bfe5d",
"md5": "722b4c9bf3cc43c9df6af6f87c77d686",
"sha256": "d42c75e8131dbba644c09ca75ccabb77f801bf6b30e40a6e4a3ee175ca8eea40"
},
"downloads": -1,
"filename": "gspl_api-0.0.92.tar.gz",
"has_sig": false,
"md5_digest": "722b4c9bf3cc43c9df6af6f87c77d686",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 20801,
"upload_time": "2024-06-24T20:48:08",
"upload_time_iso_8601": "2024-06-24T20:48:08.232856Z",
"url": "https://files.pythonhosted.org/packages/23/95/bdebcc179815dc3a3bd316830348f4d59df937210f419ea42088b29bfe5d/gspl_api-0.0.92.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-24 20:48:08",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "gspl-api"
}