# Semantic Synth
So you've decided to build out a RAG (Retrieval Augmented Generation) tool. Great! Now you want to build a vector search index for a critical production task.
Before you push into prodcution though, you'd like to be able to test how well the search index itself is working.
We have also released a 520k sample dataset for you to run tests. Access at: [Huggingface Hub](https://huggingface.co/datasets/wordlabs/semantic_search_quality)
## Current methods for testing
Having reviewed multiple different platforms that work on semantic search correctness, the primary idea is to collect some passage from the dataset, and generate questions on it using an LLM, then get the LLM to answer it, and generate different types of metrics, such as faithfulness, correctness, etc.
This is an expensive operation owing to the multiple LLM calls, not to mention would not be conducive for very large scale data testing or continous monitoring.
## How does this package work?
Given any text, we generate keywords using YAKE library. This effectively makes the test self supervised, without need for expensive LLM calls for synthetic generation
### Philosophy
Semantic search mainly works on finding different latent meanings between queries and vectors in a search index. Therefore, it only makes sense that it should be highly effective at finding passages that contain key phrases in a document
However, there are multiple steps of possible loss. First is the chunking strategy. Then comes your vector embedding model. Post which comes the capability of the vector index you are using (these indices are often based on approximation algorithms and can be quite lossy)
For academic purposes, premade datasets are good enough, but how do you run these estimates on your own data? That is where Semantic Synth comes in
if a vector search index works well, it will be able to find phrases in your passages effectively. If not, maybe there's more to look at, such as your chunking strategy, or your embedding model itself
### Who is this package for?
This package is specifically for testing the retrieval capacity of your vector search index, for testing statistical metrics such as precision, recall, F1 scores, etc.
Currently we support the generation of synthetic search terms for your content so that you can perform search and calculate accuracy metrics. We are working on adding a full fledged testing suite.
### Why did we build this?
This package was built to perform research on how different chunking strategies affect vector search accuracy.
## Usage
> WARNING: Please note that this is an alpha release and is only suitable for testing, not for production
### Installation
```python
pip install semantic-synth
```
### Code
```python
from semantic_synth.datagen import KeywordDatasetGenerator
text = """
<Insert text here>
"""
gen = KeywordDatasetGenerator()
#To get single text response
print(gen.generate(content = text))
#For dataset as dataframe
content = [
'<Insert text1 here>',
'<Insert text2 here>,
'<Insert text3 here>'
]
print(gen.generate_as_df(content = content))
```
Raw data
{
"_id": null,
"home_page": "https://github.com/wordlabs-io/semantic-synth",
"name": "semantic-synth",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "python,rag",
"author": "Tanishk Kithannae",
"author_email": "tanishk.kithannae@wordlabs.io",
"download_url": "",
"platform": null,
"description": "# Semantic Synth\r\nSo you've decided to build out a RAG (Retrieval Augmented Generation) tool. Great! Now you want to build a vector search index for a critical production task. \r\n\r\nBefore you push into prodcution though, you'd like to be able to test how well the search index itself is working.\r\n\r\nWe have also released a 520k sample dataset for you to run tests. Access at: [Huggingface Hub](https://huggingface.co/datasets/wordlabs/semantic_search_quality)\r\n\r\n## Current methods for testing\r\nHaving reviewed multiple different platforms that work on semantic search correctness, the primary idea is to collect some passage from the dataset, and generate questions on it using an LLM, then get the LLM to answer it, and generate different types of metrics, such as faithfulness, correctness, etc. \r\n\r\nThis is an expensive operation owing to the multiple LLM calls, not to mention would not be conducive for very large scale data testing or continous monitoring. \r\n\r\n## How does this package work?\r\nGiven any text, we generate keywords using YAKE library. This effectively makes the test self supervised, without need for expensive LLM calls for synthetic generation\r\n\r\n### Philosophy\r\nSemantic search mainly works on finding different latent meanings between queries and vectors in a search index. Therefore, it only makes sense that it should be highly effective at finding passages that contain key phrases in a document\r\n\r\nHowever, there are multiple steps of possible loss. First is the chunking strategy. Then comes your vector embedding model. Post which comes the capability of the vector index you are using (these indices are often based on approximation algorithms and can be quite lossy)\r\n\r\nFor academic purposes, premade datasets are good enough, but how do you run these estimates on your own data? That is where Semantic Synth comes in\r\n\r\nif a vector search index works well, it will be able to find phrases in your passages effectively. If not, maybe there's more to look at, such as your chunking strategy, or your embedding model itself\r\n\r\n### Who is this package for?\r\nThis package is specifically for testing the retrieval capacity of your vector search index, for testing statistical metrics such as precision, recall, F1 scores, etc.\r\n\r\nCurrently we support the generation of synthetic search terms for your content so that you can perform search and calculate accuracy metrics. We are working on adding a full fledged testing suite. \r\n\r\n### Why did we build this?\r\nThis package was built to perform research on how different chunking strategies affect vector search accuracy.\r\n\r\n## Usage\r\n\r\n> WARNING: Please note that this is an alpha release and is only suitable for testing, not for production\r\n\r\n### Installation\r\n```python\r\npip install semantic-synth\r\n```\r\n\r\n### Code \r\n```python\r\nfrom semantic_synth.datagen import KeywordDatasetGenerator\r\n\r\ntext = \"\"\"\r\n<Insert text here>\r\n\"\"\"\r\n\r\ngen = KeywordDatasetGenerator()\r\n\r\n#To get single text response\r\nprint(gen.generate(content = text))\r\n\r\n#For dataset as dataframe\r\n\r\ncontent = [\r\n '<Insert text1 here>',\r\n '<Insert text2 here>,\r\n '<Insert text3 here>'\r\n]\r\n\r\nprint(gen.generate_as_df(content = content))\r\n\r\n```\r\n",
"bugtrack_url": null,
"license": "",
"summary": "Synthetic dataset generator for testing semantic search quality",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/wordlabs-io/semantic-synth"
},
"split_keywords": [
"python",
"rag"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "707948707deddbb36935b6b0a043085d9f31b890d179921f422e8f5705aed2a1",
"md5": "8c10b4a3696947f315d705c56a21ce53",
"sha256": "622c263f913f8f8067fcaf8ecb75494f265754e440297fe5b7e8b8766fb3f6d9"
},
"downloads": -1,
"filename": "semantic_synth-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8c10b4a3696947f315d705c56a21ce53",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 4887,
"upload_time": "2024-02-26T07:14:41",
"upload_time_iso_8601": "2024-02-26T07:14:41.158788Z",
"url": "https://files.pythonhosted.org/packages/70/79/48707deddbb36935b6b0a043085d9f31b890d179921f422e8f5705aed2a1/semantic_synth-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-26 07:14:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wordlabs-io",
"github_project": "semantic-synth",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "semantic-synth"
}