embedding-generator

Name	embedding-generator JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/rrhd/embeddingGenerator
Summary	A module for generating embeddings for batches of texts using a SentenceTransformer model.
upload_time	2024-09-23 01:40:59
maintainer	None
docs_url	None
author	Ron Heichman
requires_python	<4.0,>=3.8
license	MIT
keywords	embedding nlp sentencetransformer
VCS
bugtrack_url
requirements	click ijson langchain_text_splitters numpy orjson scikit_learn sentence_transformers torch tqdm
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# `EmbeddingGenerator` Documentation

## Overview

The `EmbeddingGenerator` class is designed to efficiently generate embeddings for a list of input texts using a model such as `SentenceTransformer`. It manages the process of splitting texts into manageable chunks, embedding them while considering token limits, and saving the results incrementally to avoid memory issues. The class is optimized to handle large datasets by processing texts in batches and managing resources effectively.

---

## Usage Example

Here's how you might use the `EmbeddingGenerator`:

```python
from sentence_transformers import SentenceTransformer

# Initialize your model
model = SentenceTransformer('all-MiniLM-L6-v2') # Replace with your model

# Define model settings
model_settings = {
'convert_to_tensor': True,
'device': 'cuda', # or 'cpu'
'show_progress_bar': False
}

# Create an instance of EmbeddingGenerator
embedding_generator = EmbeddingGenerator(model, model_settings, save_path='data')

# Prepare your texts as an iterator
texts = iter([
"This is the first text.",
"Here's the second text.",
# ... add more texts
])

# Generate embeddings
embeddings = embedding_generator(texts)

# Output embeddings
print(embeddings)
```

---

## What Happens During Execution

1. **Initialization**:
- The `EmbeddingGenerator` is initialized with a model, model settings, and a save path.
- It sets up internal structures for managing texts, embeddings, and progress tracking.

2. **Text Loading and Memory Management**:
- Texts are loaded from the provided iterator using the `fill_texts` method.
- The class dynamically loads texts while monitoring memory usage to prevent exceeding `max_memory_usage`.

3. **Text Chunking**:
- Texts are split into chunks using `RecursiveCharacterTextSplitter` based on the model's `max_seq_length`.
- The splitter ensures chunks are appropriately sized for the model to process efficiently.

4. **Token Counting**:
- The `TokenCounter` estimates the number of tokens in each chunk.
- This information is used to manage batch sizes and ensure they fit within token limits.

5. **Batch Selection**:
- The `find_best_chunks` method selects chunks to process in the next batch, maximizing batch sizes without exceeding limits.
- Chunks are sorted and selected based on their token counts.

6. **Embedding Generation**:
- The `embed` method processes the selected chunks using the model.
- Embeddings are generated and associated with their respective chunks.

7. **Error Handling and Token Limit Adjustment**:
- If a `RuntimeError` occurs (e.g., out-of-memory error), the `fail` method adjusts the token limit to prevent future errors.
- Successful batches inform the `succeed` method to update the token limit estimator positively.

8. **Saving Progress**:
- Embeddings and metadata are saved incrementally using the `save_data` method.
- Data is saved per text to individual files to avoid loading large JSON files entirely.

9. **Resource Cleanup**:
- Completed texts are removed from memory using the `remove_completed_texts` method.
- This ensures efficient memory usage throughout the process.

10. **Final Output Generation**:
- Upon completion, `load_average_embeddings_with_fallback` is called to compile the average embeddings for each text.
- The output is a dictionary mapping each text to its average embedding or `None` if unavailable.

---

## Output

The output of the `EmbeddingGenerator` is a dictionary where each key is an input text, and the value is one of the following:

- **List of Floats**: The average embedding for the text, represented as a list of floats.
- **`None`**: Indicates that the embedding for the text could not be generated or is missing.

### Example Output

```python
{
"This is the first text.": [0.234, -0.987, 0.123, ...], # Embedding vector
"Here's the second text.": [0.456, -0.654, 0.789, ...] # Embedding vector
}
```

---

## Notes for Users

- **File Structure**:
- The `EmbeddingGenerator` saves data in a structured directory:
```
data/
├── embeddings_index.json
└── embeddings_data/
├── <text_id1>.json
└── <text_id2>.json
```
- Each text's data is saved in a separate JSON file, preventing the need to load large files into memory.

- **Memory Efficiency**:
- Designed to handle large datasets by managing memory usage and saving progress incrementally.
- Texts are removed from memory once processed to conserve resources.

- **Resumable Processing**:
- If the process is interrupted, it can be resumed, and the class will continue from where it left off, avoiding recomputation.

- **GPU Utilization**:
- Attempts to maximize GPU utilization by processing large batches without exceeding memory limits.
- Adjusts batch sizes dynamically based on successful and failed attempts.

- **Error Handling**:
- Handles out-of-memory errors gracefully by adjusting token limits and retrying with smaller batches.

- **Missing Data**:
- If any embeddings are missing, the output dictionary will contain `None` for those texts.

---

## Advanced Usage and Customization

### Adjusting Chunk Size

- By default, the chunk size is set based on the model's `max_seq_length`.
- You can customize the chunk size if needed:
```python
embedding_generator.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1024, # Desired chunk size
chunk_overlap=20,
length_function=embedding_generator.token_counter,
)
```

### Handling Large Batches

- Increase the initial token limit to allow larger batches:
```python
embedding_generator.limit_estimator = TokenLimitEstimator(initial_limit=2048)
```

- Adjust the model settings to change the batch size:
```python
model_settings = {
'convert_to_tensor': True,
'device': 'cuda',
'show_progress_bar': False,
'batch_size': 64 # Adjust as per your GPU capacity
}
```

### Monitoring Progress

- The class uses `tqdm` to display a progress bar during processing.
- You can access or customize it via `embedding_generator.progress_bar`.

## Running Embedding Generation from an Input File

The `run_embedding.py` script accepts an input file containing texts in various formats.

### Supported Input File Formats

- **Plain Text (`.txt`)**: Each line is treated as a separate text.
- **JSON (`.json`)**: The file can contain a list or dictionary of texts.
- **CSV (`.csv`)**: Each row's first column is treated as a text.

### Example Usage

```bash
python scripts/run_embedding.py --input-file "path/to/texts.json" --model-path "path/to/model" --save-path "data" --device "cuda"
```

### Command-Line Options

- **`--input-file` or `-i`**: Path to the input file.
- **`--model-path` or `-m`**: Path to the SentenceTransformer model.
- **`--save-path` or `-o`**: Directory where embeddings will be saved. Defaults to `data`.
- **`--device` or `-d`**: Device to use (`'cpu'` or `'cuda'`). Defaults to `'cpu'`.

### Notes

- Ensure the input file is properly formatted according to its extension.
- The embeddings are saved incrementally in the specified save path.
- The script handles large datasets efficiently, but ensure sufficient disk space is available.

---

## Conclusion

The `EmbeddingGenerator` is a robust tool for generating embeddings for large datasets, designed with efficiency and scalability in mind. By managing resources effectively, handling errors gracefully, and providing mechanisms for customization, it ensures that embedding generation tasks can be performed reliably, even with extensive datasets.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rrhd/embeddingGenerator",
    "name": "embedding-generator",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": "embedding, NLP, SentenceTransformer",
    "author": "Ron Heichman",
    "author_email": "ronheichman@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/35/ea/dfd79a187e80ff10093aba5f3639f1eb5fd07722ccd2a0b23c208ee91914/embedding_generator-0.1.0.tar.gz",
    "platform": null,
    "description": "# `EmbeddingGenerator` Documentation\n\n## Overview\n\nThe `EmbeddingGenerator` class is designed to efficiently generate embeddings for a list of input texts using a model such as `SentenceTransformer`. It manages the process of splitting texts into manageable chunks, embedding them while considering token limits, and saving the results incrementally to avoid memory issues. The class is optimized to handle large datasets by processing texts in batches and managing resources effectively.\n\n---\n\n## Usage Example\n\nHere's how you might use the `EmbeddingGenerator`:\n\n```python\nfrom sentence_transformers import SentenceTransformer\n\n# Initialize your model\nmodel = SentenceTransformer('all-MiniLM-L6-v2')  # Replace with your model\n\n# Define model settings\nmodel_settings = {\n    'convert_to_tensor': True,\n    'device': 'cuda',  # or 'cpu'\n    'show_progress_bar': False\n}\n\n# Create an instance of EmbeddingGenerator\nembedding_generator = EmbeddingGenerator(model, model_settings, save_path='data')\n\n# Prepare your texts as an iterator\ntexts = iter([\n    \"This is the first text.\",\n    \"Here's the second text.\",\n    # ... add more texts\n])\n\n# Generate embeddings\nembeddings = embedding_generator(texts)\n\n# Output embeddings\nprint(embeddings)\n```\n\n---\n\n## What Happens During Execution\n\n1. **Initialization**:\n   - The `EmbeddingGenerator` is initialized with a model, model settings, and a save path.\n   - It sets up internal structures for managing texts, embeddings, and progress tracking.\n\n2. **Text Loading and Memory Management**:\n   - Texts are loaded from the provided iterator using the `fill_texts` method.\n   - The class dynamically loads texts while monitoring memory usage to prevent exceeding `max_memory_usage`.\n\n3. **Text Chunking**:\n   - Texts are split into chunks using `RecursiveCharacterTextSplitter` based on the model's `max_seq_length`.\n   - The splitter ensures chunks are appropriately sized for the model to process efficiently.\n\n4. **Token Counting**:\n   - The `TokenCounter` estimates the number of tokens in each chunk.\n   - This information is used to manage batch sizes and ensure they fit within token limits.\n\n5. **Batch Selection**:\n   - The `find_best_chunks` method selects chunks to process in the next batch, maximizing batch sizes without exceeding limits.\n   - Chunks are sorted and selected based on their token counts.\n\n6. **Embedding Generation**:\n   - The `embed` method processes the selected chunks using the model.\n   - Embeddings are generated and associated with their respective chunks.\n\n7. **Error Handling and Token Limit Adjustment**:\n   - If a `RuntimeError` occurs (e.g., out-of-memory error), the `fail` method adjusts the token limit to prevent future errors.\n   - Successful batches inform the `succeed` method to update the token limit estimator positively.\n\n8. **Saving Progress**:\n   - Embeddings and metadata are saved incrementally using the `save_data` method.\n   - Data is saved per text to individual files to avoid loading large JSON files entirely.\n\n9. **Resource Cleanup**:\n   - Completed texts are removed from memory using the `remove_completed_texts` method.\n   - This ensures efficient memory usage throughout the process.\n\n10. **Final Output Generation**:\n    - Upon completion, `load_average_embeddings_with_fallback` is called to compile the average embeddings for each text.\n    - The output is a dictionary mapping each text to its average embedding or `None` if unavailable.\n\n---\n\n## Output\n\nThe output of the `EmbeddingGenerator` is a dictionary where each key is an input text, and the value is one of the following:\n\n- **List of Floats**: The average embedding for the text, represented as a list of floats.\n- **`None`**: Indicates that the embedding for the text could not be generated or is missing.\n\n### Example Output\n\n```python\n{\n    \"This is the first text.\": [0.234, -0.987, 0.123, ...],  # Embedding vector\n    \"Here's the second text.\": [0.456, -0.654, 0.789, ...]   # Embedding vector\n}\n```\n\n---\n\n## Notes for Users\n\n- **File Structure**:\n  - The `EmbeddingGenerator` saves data in a structured directory:\n    ```\n    data/\n    \u251c\u2500\u2500 embeddings_index.json\n    \u2514\u2500\u2500 embeddings_data/\n        \u251c\u2500\u2500 <text_id1>.json\n        \u2514\u2500\u2500 <text_id2>.json\n    ```\n  - Each text's data is saved in a separate JSON file, preventing the need to load large files into memory.\n\n- **Memory Efficiency**:\n  - Designed to handle large datasets by managing memory usage and saving progress incrementally.\n  - Texts are removed from memory once processed to conserve resources.\n\n- **Resumable Processing**:\n  - If the process is interrupted, it can be resumed, and the class will continue from where it left off, avoiding recomputation.\n\n- **GPU Utilization**:\n  - Attempts to maximize GPU utilization by processing large batches without exceeding memory limits.\n  - Adjusts batch sizes dynamically based on successful and failed attempts.\n\n- **Error Handling**:\n  - Handles out-of-memory errors gracefully by adjusting token limits and retrying with smaller batches.\n\n- **Missing Data**:\n  - If any embeddings are missing, the output dictionary will contain `None` for those texts.\n\n---\n\n## Advanced Usage and Customization\n\n### Adjusting Chunk Size\n\n- By default, the chunk size is set based on the model's `max_seq_length`.\n- You can customize the chunk size if needed:\n  ```python\n  embedding_generator.text_splitter = RecursiveCharacterTextSplitter(\n      chunk_size=1024,  # Desired chunk size\n      chunk_overlap=20,\n      length_function=embedding_generator.token_counter,\n  )\n  ```\n\n### Handling Large Batches\n\n- Increase the initial token limit to allow larger batches:\n  ```python\n  embedding_generator.limit_estimator = TokenLimitEstimator(initial_limit=2048)\n  ```\n\n- Adjust the model settings to change the batch size:\n  ```python\n  model_settings = {\n      'convert_to_tensor': True,\n      'device': 'cuda',\n      'show_progress_bar': False,\n      'batch_size': 64  # Adjust as per your GPU capacity\n  }\n  ```\n\n### Monitoring Progress\n\n- The class uses `tqdm` to display a progress bar during processing.\n- You can access or customize it via `embedding_generator.progress_bar`.\n\n## Running Embedding Generation from an Input File\n\nThe `run_embedding.py` script accepts an input file containing texts in various formats.\n\n### Supported Input File Formats\n\n- **Plain Text (`.txt`)**: Each line is treated as a separate text.\n- **JSON (`.json`)**: The file can contain a list or dictionary of texts.\n- **CSV (`.csv`)**: Each row's first column is treated as a text.\n\n### Example Usage\n\n```bash\npython scripts/run_embedding.py --input-file \"path/to/texts.json\" --model-path \"path/to/model\" --save-path \"data\" --device \"cuda\"\n```\n\n### Command-Line Options\n\n- **`--input-file` or `-i`**: Path to the input file.\n- **`--model-path` or `-m`**: Path to the SentenceTransformer model.\n- **`--save-path` or `-o`**: Directory where embeddings will be saved. Defaults to `data`.\n- **`--device` or `-d`**: Device to use (`'cpu'` or `'cuda'`). Defaults to `'cpu'`.\n\n### Notes\n\n- Ensure the input file is properly formatted according to its extension.\n- The embeddings are saved incrementally in the specified save path.\n- The script handles large datasets efficiently, but ensure sufficient disk space is available.\n\n---\n\n## Conclusion\n\nThe `EmbeddingGenerator` is a robust tool for generating embeddings for large datasets, designed with efficiency and scalability in mind. By managing resources effectively, handling errors gracefully, and providing mechanisms for customization, it ensures that embedding generation tasks can be performed reliably, even with extensive datasets.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A module for generating embeddings for batches of texts using a SentenceTransformer model.",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/rrhd/embeddingGenerator",
        "Repository": "https://github.com/rrhd/embeddingGenerator"
    },
    "split_keywords": [
        "embedding",
        " nlp",
        " sentencetransformer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "704a05f5656285accee233b051f3bee1b341f557ac5734cec36f60d5645af0f9",
                "md5": "a0cbdd8cbc1515001c43ce794cdcd117",
                "sha256": "40995c1b12a65a2a3c208994ee4087541a92980e2a7459ce23af6a7a1e398c7a"
            },
            "downloads": -1,
            "filename": "embedding_generator-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a0cbdd8cbc1515001c43ce794cdcd117",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 13170,
            "upload_time": "2024-09-23T01:40:57",
            "upload_time_iso_8601": "2024-09-23T01:40:57.709301Z",
            "url": "https://files.pythonhosted.org/packages/70/4a/05f5656285accee233b051f3bee1b341f557ac5734cec36f60d5645af0f9/embedding_generator-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "35eadfd79a187e80ff10093aba5f3639f1eb5fd07722ccd2a0b23c208ee91914",
                "md5": "ff29fc62b72189ce3210b2130c51da7e",
                "sha256": "f784c7c18aa018178a03253e3234ba588ea0e617c5aed519ff8e617cc2d3e8b8"
            },
            "downloads": -1,
            "filename": "embedding_generator-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ff29fc62b72189ce3210b2130c51da7e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 13925,
            "upload_time": "2024-09-23T01:40:59",
            "upload_time_iso_8601": "2024-09-23T01:40:59.547725Z",
            "url": "https://files.pythonhosted.org/packages/35/ea/dfd79a187e80ff10093aba5f3639f1eb5fd07722ccd2a0b23c208ee91914/embedding_generator-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-23 01:40:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rrhd",
    "github_project": "embeddingGenerator",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "click",
            "specs": [
                [
                    "==",
                    "8.1.7"
                ]
            ]
        },
        {
            "name": "ijson",
            "specs": [
                [
                    "==",
                    "3.3.0"
                ]
            ]
        },
        {
            "name": "langchain_text_splitters",
            "specs": [
                [
                    "==",
                    "0.2.2"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.0.1"
                ]
            ]
        },
        {
            "name": "orjson",
            "specs": [
                [
                    "==",
                    "3.10.7"
                ]
            ]
        },
        {
            "name": "scikit_learn",
            "specs": [
                [
                    "==",
                    "1.5.1"
                ]
            ]
        },
        {
            "name": "sentence_transformers",
            "specs": [
                [
                    "==",
                    "3.0.1"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "2.4.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "==",
                    "4.66.4"
                ]
            ]
        }
    ],
    "lcname": "embedding-generator"
}

Ron Heichman