semantic-split

Name	semantic-split JSON
Version	0.1.0 JSON
	download
home_page
Summary	A better way to split (chunk/group) your text before inserting them into an LLM/Vector DB.
upload_time	2023-06-12 03:31:22
maintainer
docs_url	None
author	Agam More
requires_python	>=3.10,<4.0
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Semantic-Split

A Python library to chunk/group your text based on semantic similarity - ideal for pre-processing data for Language Models or Vector Databases. Leverages [SentenceTransformers](https://github.com/UKPLab/sentence-transformers) and [spaCy](https://github.com/explosion/spaCy).

## Why?

1. **Better Context:** Providing more relevant context to your prompts enhances the LLM's performance ([arXiv:2005.14165](https://arxiv.org/abs/2005.14165) [cs.CL]). Semantic-Split groups related sentences together, ensuring your prompts have relevant context.

2. **Improved Results:** Short, precise prompts often yield the best results from LLMs ([arXiv:2004.04906](https://arxiv.org/abs/2004.04906) [cs.CL]). By grouping semantically similar sentences, Semantic-Split helps you craft such efficient prompts.

3. **Cost Savings:** LLMs like GPT-4 charge costs per token and have a token limit (e.g., 8K tokens). With Semantic-Split, you can make your prompts shorter and more meaningful, leading to potential cost savings.

**Real world example**:

Imagine you're building an application where users ask questions about articles:

- A. We want to only add parts in the artcile that are relevant to our query (for better results).
- B. We want to be able to query the article quickly (pre-processing).

1. We want to pre-process the article - so each query is fast (point B). So we split it into semantic chunks using `semantic-split` and store it in a [Vector DB](https://unzip.dev/0x014-vector-databases/?ref=github-semantic-split) as embeddings.
2. Each time the user asks something we calculate the embedding for their question and find the top 3 similar chunks in our Vector DB.
3. We add those 3 chunks to our pompt, to get better results for our user's questions.

As you can see, in part `1`, which involves semantic sentence splitting (grouping), is crucial. If we don't split or group the sentences semantically, we risk losing essential information. This can diminish the effectiveness of our Vector DB in identifying the most suitable chunks. Consequently, we may end up with poorer context for our prompts, negatively impacting the quality of our responses.

### Install

1. To use most of the functionality you will need to install some pre-requisists
2. Spacy sm dataset: `python -m spacy download en_core_web_sm`
3. `poetry install`
4. See examples

### Examples

#### Sentence Split by Semantic Similarity

```
    from semantic_split import SimilarSentenceSplitter, SentenceTransformersSimilarity, SpacySentenceSplitter

    text = """
      I dogs are amazing.
      Cats must be the easiest pets around.
      Robots are advanced now with AI.
      Flying in space can only be done by Artificial intelligence."""

    model = SentenceTransformersSimilarity()
    sentence_splitter = SpacySentenceSplitter()
    splitter = SimilarSentenceSplitter(model, sentence_splitter)
    res = splitter.split(text)
```

**Result**:

> `[["I dogs are amazing.", "Cats must be the easiest pets around."], `  
> `["Robots are advanced now with AI.", "Flying in space can only be done by Artificial intelligence."]]`

### Tests

`poetry run pytest`

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "semantic-split",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Agam More",
    "author_email": "agam@agam.me",
    "download_url": "https://files.pythonhosted.org/packages/de/f9/abaf967304741740c217e23d3a6adcee8b2353527db82ce0bb1e25d9b231/semantic_split-0.1.0.tar.gz",
    "platform": null,
    "description": "# Semantic-Split\n\nA Python library to chunk/group your text based on semantic similarity - ideal for pre-processing data for Language Models or Vector Databases. Leverages [SentenceTransformers](https://github.com/UKPLab/sentence-transformers) and [spaCy](https://github.com/explosion/spaCy).\n\n## Why?\n\n1. **Better Context:** Providing more relevant context to your prompts enhances the LLM's performance ([arXiv:2005.14165](https://arxiv.org/abs/2005.14165) [cs.CL]). Semantic-Split groups related sentences together, ensuring your prompts have relevant context.\n\n2. **Improved Results:** Short, precise prompts often yield the best results from LLMs ([arXiv:2004.04906](https://arxiv.org/abs/2004.04906) [cs.CL]). By grouping semantically similar sentences, Semantic-Split helps you craft such efficient prompts.\n\n3. **Cost Savings:** LLMs like GPT-4 charge costs per token and have a token limit (e.g., 8K tokens). With Semantic-Split, you can make your prompts shorter and more meaningful, leading to potential cost savings.\n\n**Real world example**:\n\nImagine you're building an application where users ask questions about articles:\n\n- A. We want to only add parts in the artcile that are relevant to our query (for better results).\n- B. We want to be able to query the article quickly (pre-processing).\n\n1. We want to pre-process the article - so each query is fast (point B). So we split it into semantic chunks using `semantic-split` and store it in a [Vector DB](https://unzip.dev/0x014-vector-databases/?ref=github-semantic-split) as embeddings.\n2. Each time the user asks something we calculate the embedding for their question and find the top 3 similar chunks in our Vector DB.\n3. We add those 3 chunks to our pompt, to get better results for our user's questions.\n\nAs you can see, in part `1`, which involves semantic sentence splitting (grouping), is crucial. If we don't split or group the sentences semantically, we risk losing essential information. This can diminish the effectiveness of our Vector DB in identifying the most suitable chunks. Consequently, we may end up with poorer context for our prompts, negatively impacting the quality of our responses.\n\n### Install\n\n1. To use most of the functionality you will need to install some pre-requisists\n2. Spacy sm dataset: `python -m spacy download en_core_web_sm`\n3. `poetry install`\n4. See examples\n\n### Examples\n\n#### Sentence Split by Semantic Similarity\n\n```\n    from semantic_split import SimilarSentenceSplitter, SentenceTransformersSimilarity, SpacySentenceSplitter\n\n    text = \"\"\"\n      I dogs are amazing.\n      Cats must be the easiest pets around.\n      Robots are advanced now with AI.\n      Flying in space can only be done by Artificial intelligence.\"\"\"\n\n    model = SentenceTransformersSimilarity()\n    sentence_splitter = SpacySentenceSplitter()\n    splitter = SimilarSentenceSplitter(model, sentence_splitter)\n    res = splitter.split(text)\n```\n\n**Result**:\n\n> `[[\"I dogs are amazing.\", \"Cats must be the easiest pets around.\"], `  \n> `[\"Robots are advanced now with AI.\", \"Flying in space can only be done by Artificial intelligence.\"]]`\n\n### Tests\n\n`poetry run pytest`\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A better way to split (chunk/group) your text before inserting them into an LLM/Vector DB.",
    "version": "0.1.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ae0a61798449f783fa2c6d0c05a39150b3957146a2999b5fab8130ca6de4c6e1",
                "md5": "33171dbd676dc7429c6fa7bf584d8607",
                "sha256": "c02eedbec9da7c510b9e0ad42f6aa117bb4077e7d4ffe491e3d8940d534774e1"
            },
            "downloads": -1,
            "filename": "semantic_split-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "33171dbd676dc7429c6fa7bf584d8607",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 4593,
            "upload_time": "2023-06-12T03:31:20",
            "upload_time_iso_8601": "2023-06-12T03:31:20.878264Z",
            "url": "https://files.pythonhosted.org/packages/ae/0a/61798449f783fa2c6d0c05a39150b3957146a2999b5fab8130ca6de4c6e1/semantic_split-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "def9abaf967304741740c217e23d3a6adcee8b2353527db82ce0bb1e25d9b231",
                "md5": "371af636b5d866ec38dfe015e090966f",
                "sha256": "0ce362864b34d4642f88e2bf45b05e17fb7ebfb9555cb6e7d0703e2cb3512c43"
            },
            "downloads": -1,
            "filename": "semantic_split-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "371af636b5d866ec38dfe015e090966f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 3349,
            "upload_time": "2023-06-12T03:31:22",
            "upload_time_iso_8601": "2023-06-12T03:31:22.840792Z",
            "url": "https://files.pythonhosted.org/packages/de/f9/abaf967304741740c217e23d3a6adcee8b2353527db82ce0bb1e25d9b231/semantic_split-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-12 03:31:22",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "semantic-split"
}

Agam More