hybrid-search


Namehybrid-search JSON
Version 0.0.15 PyPI version JSON
download
home_pageNone
SummaryHybrid search with OpenSearch and Langchain
upload_time2024-07-28 22:23:25
maintainerNone
docs_urlNone
authorAlex Karmazin
requires_pythonNone
licenseNone
keywords python llm science review hybrid search semantic search
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# hybrid search instructions
Created following this article https://opensearch.org/blog/hybrid-search/

## Installation:

With conda or micromamba setup the environment:
```
micromamba create -f environment.yaml
micromamba activate hybrid_search
```

For the OpenSearch itself there are several installation options. 

### From docker-compose

This repository goes with a test two nodes open-search cluster together with a dashboard.

Optional: change OPENSEARCH_JAVA_OPTS=-Xms2512m -Xmx2512m according to your RAM availability, usually it is recommended to have them equal in side.
Start docker-compose:
```bash
docker compose up
```
Open http://localhost:5601/ to explore the dashboard, "admin" is used both as user and passport by default.

### Manual installation

- Go to https://opensearch.org/downloads.html and download OpenSearch choose the installation variant you like. OpenSearch Dashboards is a convenient tool but not mandatory.
- Install the latest Java
- For Windows unpack the archive. In opensearch_folder/config/opensearch.yml make sure plugins.security.ssl.http.enabled: true. Because it works correctly only with ssl on, despite some functionality still being available with http. Launch opensearch-windows-install.bat, despite the name it is not an installer but a main launcher.
- For Linux use docker or follow instructions in the documentation.

## Usage:
- Launch open-search either with docker-compose or java
- Launch index.py for the initial indexing test dataset. It creates an index and pipeline for hybrid search.
- Activate environment
```bash
micromamba activate hybrid_search #to activate environment
pip install -e . #[optional] install current package locally
```
- Launch search to perform test search.
```bash
python index.py #to index
python search.py # to search, uses default query
```
You can also tune index.py parameters. For example:
```
python index.py main --url https://agingkills.eu:9200 --user admin --password admin --index_name index-bge-test_rsids_10k --embedding BAAI/bge-base-en-v1.5

```

If you want to use another embedding, for example specter2, try:
```bash
python index.py specter2
```

## Tests

### RSID test

There are text pieces deliberately incorporated into tacutu papers data ( /data/tacutopapers_test_rsids_10k )
In particular for rs123456789 and rs123456788 as well as similar but misspelled rsids are added to the documents:
* 10.txt contains both two times
* 11.txt contains both one time
* 12.txt and 13 contain only one rsid
* 20.txt contains both wrong rsids two times
* 21.txt contains both wrong rsids one time
* 22.txt and 23 contain only one wrong rsid

You can test them by:
```
python search.py test_rsids
```

### Comics superheroes test

Also, similar test for "Comics superheroes" that will test embeddings:
* Only 114 document has text about superheroes, but text did not contain words 'comics' or 'superheroes'

You can test them by:
```
python search.py test_heroes
```

Right now testing is not automated and you have to call CLI to test


## Troubleshooting

If something is not working with OpenSearch, read log messages carefully. For example, if you have small disk space it can block writing (watermark issue) that will cause failing with different final error message.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "hybrid-search",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "python, llm, science, review, hybrid search, semantic search",
    "author": "Alex Karmazin",
    "author_email": "<karmazinalex@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/3b/b0/bf9a4107637c92bfa9b5fbc1f111b8c1b9dcf60a16587106ca759de49d1f/hybrid_search-0.0.15.tar.gz",
    "platform": null,
    "description": "\n# hybrid search instructions\nCreated following this article https://opensearch.org/blog/hybrid-search/\n\n## Installation:\n\nWith conda or micromamba setup the environment:\n```\nmicromamba create -f environment.yaml\nmicromamba activate hybrid_search\n```\n\nFor the OpenSearch itself there are several installation options. \n\n### From docker-compose\n\nThis repository goes with a test two nodes open-search cluster together with a dashboard.\n\nOptional: change OPENSEARCH_JAVA_OPTS=-Xms2512m -Xmx2512m according to your RAM availability, usually it is recommended to have them equal in side.\nStart docker-compose:\n```bash\ndocker compose up\n```\nOpen http://localhost:5601/ to explore the dashboard, \"admin\" is used both as user and passport by default.\n\n### Manual installation\n\n- Go to https://opensearch.org/downloads.html and download OpenSearch choose the installation variant you like. OpenSearch Dashboards is a convenient tool but not mandatory.\n- Install the latest Java\n- For Windows unpack the archive. In opensearch_folder/config/opensearch.yml make sure plugins.security.ssl.http.enabled: true. Because it works correctly only with ssl on, despite some functionality still being available with http. Launch opensearch-windows-install.bat, despite the name it is not an installer but a main launcher.\n- For Linux use docker or follow instructions in the documentation.\n\n## Usage:\n- Launch open-search either with docker-compose or java\n- Launch index.py for the initial indexing test dataset. It creates an index and pipeline for hybrid search.\n- Activate environment\n```bash\nmicromamba activate hybrid_search #to activate environment\npip install -e . #[optional] install current package locally\n```\n- Launch search to perform test search.\n```bash\npython index.py #to index\npython search.py # to search, uses default query\n```\nYou can also tune index.py parameters. For example:\n```\npython index.py main --url https://agingkills.eu:9200 --user admin --password admin --index_name index-bge-test_rsids_10k --embedding BAAI/bge-base-en-v1.5\n\n```\n\nIf you want to use another embedding, for example specter2, try:\n```bash\npython index.py specter2\n```\n\n## Tests\n\n### RSID test\n\nThere are text pieces deliberately incorporated into tacutu papers data ( /data/tacutopapers_test_rsids_10k )\nIn particular for rs123456789 and rs123456788 as well as similar but misspelled rsids are added to the documents:\n* 10.txt contains both two times\n* 11.txt contains both one time\n* 12.txt and 13 contain only one rsid\n* 20.txt contains both wrong rsids two times\n* 21.txt contains both wrong rsids one time\n* 22.txt and 23 contain only one wrong rsid\n\nYou can test them by:\n```\npython search.py test_rsids\n```\n\n### Comics superheroes test\n\nAlso, similar test for \"Comics superheroes\" that will test embeddings:\n* Only 114 document has text about superheroes, but text did not contain words 'comics' or 'superheroes'\n\nYou can test them by:\n```\npython search.py test_heroes\n```\n\nRight now testing is not automated and you have to call CLI to test\n\n\n## Troubleshooting\n\nIf something is not working with OpenSearch, read log messages carefully. For example, if you have small disk space it can block writing (watermark issue) that will cause failing with different final error message.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Hybrid search with OpenSearch and Langchain",
    "version": "0.0.15",
    "project_urls": null,
    "split_keywords": [
        "python",
        " llm",
        " science",
        " review",
        " hybrid search",
        " semantic search"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f78fd67556c03836b6443e8e2b1f2a418a3b721fb93df0218e1fe8e88e02f7d0",
                "md5": "6240179d79818f3cb8cda224d1617e80",
                "sha256": "b356a7b59829e0781bf39196a91868063b6c777144954a5c975b575e70d98f7b"
            },
            "downloads": -1,
            "filename": "hybrid_search-0.0.15-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6240179d79818f3cb8cda224d1617e80",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 11723,
            "upload_time": "2024-07-28T22:23:24",
            "upload_time_iso_8601": "2024-07-28T22:23:24.178847Z",
            "url": "https://files.pythonhosted.org/packages/f7/8f/d67556c03836b6443e8e2b1f2a418a3b721fb93df0218e1fe8e88e02f7d0/hybrid_search-0.0.15-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3bb0bf9a4107637c92bfa9b5fbc1f111b8c1b9dcf60a16587106ca759de49d1f",
                "md5": "602afe5537f7d043034a8a2395b98179",
                "sha256": "487c2329f730a475cff8c498f2108f5475788bfbf4c162802880209cf80833fd"
            },
            "downloads": -1,
            "filename": "hybrid_search-0.0.15.tar.gz",
            "has_sig": false,
            "md5_digest": "602afe5537f7d043034a8a2395b98179",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 11504,
            "upload_time": "2024-07-28T22:23:25",
            "upload_time_iso_8601": "2024-07-28T22:23:25.785283Z",
            "url": "https://files.pythonhosted.org/packages/3b/b0/bf9a4107637c92bfa9b5fbc1f111b8c1b9dcf60a16587106ca759de49d1f/hybrid_search-0.0.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-28 22:23:25",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "hybrid-search"
}
        
Elapsed time: 0.27240s