docuverse

Name	docuverse JSON
Version	0.0.8 JSON
	download
home_page	https://github.com/primeqa/docuverse
Summary	State-of-the-art Retrieval/Search engine models, including ElasticSearch, ChromaDB, Milvus, and PrimeQA
upload_time	2024-11-22 18:29:28
maintainer	None
docs_url	None
author	PrimeQA/DocUVerse Team
requires_python	<3.12.0,>=3.10.0
license	Apache
keywords	question answering (qa) machine reading comprehension (mrc) information retrieval (ir)
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <!---
Copyright 2022 IBM Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

<h3 align="center">
    <img width="350" alt="primeqa" src="docs/_static/img/PrimeQA.png">
    <p>Repository for (almost) *all* your document search needs.</p>
    <p>Part of the Prime Repository for State-of-the-Art Multilingual QuestionAnswering Research and Development.</p>
</h3>

[//]: # (![Build Status]&#40;https://github.com/primeqa/primeqa/actions/workflows/primeqa-ci.yml/badge.svg&#41;)

[//]: # ([![LICENSE|Apache2.0]&#40;https://img.shields.io/github/license/saltstack/salt?color=blue&#41;]&#40;https://www.apache.org/licenses/LICENSE-2.0.txt&#41;)

[//]: # ([![sphinx-doc-build]&#40;https://github.com/primeqa/primeqa/actions/workflows/sphinx-doc-build.yml/badge.svg&#41;]&#40;https://github.com/primeqa/primeqa/actions/workflows/sphinx-doc-build.yml&#41;   )

DocUServe is a public open source repository that enables researchers and developers to quickly
experiment with various search engines (such as ElasticSearch, ChromaDB, Milvus, PrimeQA, FAISS)
both in direct search and reranking scenarios. By using DocUVerse, a researcher
can replicate the experiments outlined in a paper published in the latest NLP 
conference while also enjoying the capability to download pre-trained models 
(from an online repository) and run them on their own custom data. DocUVerse is built 
on top of the [Transformers](https://github.com/huggingface/transformers), PrimeQA, and Elasticsearch toolkits and uses [datasets](https://huggingface.co/datasets/viewer/) and 
[models](https://huggingface.co/PrimeQA) that are directly 
downloadable.

## Design

The following is a code snippet showing how to run a query search, and also how to ingest a corpus,
followed by an evaluation search.
```python
from docuverse import SearchEngine, SearchQueries

# Test an existing engine
engine = SearchEngine(config="experiments/sap/elastic_v2/setup.yaml")
queries = SearchQueries(data="benchmark_v2.csv")

results = engine.search(queries)
scores = engine.compute_score(queries, results)
print (f"Results:\n{scores.to_string()}")
```

Ingesting a new corpus (create an index for a specific engine) should be just as easy:
```python
from docuverse import SearchEngine, SearchCorpus, SearchQueries

corpus = SearchCorpus(filepaths="experiments/claspnq/passages.jsonl")
engine = SearchEngine(config="experiments/sap/elastic_v2/setup.yaml")
engine.ingest(corpus, max_doc_length=512, stride=100, title_handling="all", 
              index="my_new_index")

queries = SearchQueries(data="ClaspNQ.jsonl")
scores = engine.compute_score(queries, results)
print (f"Results:\n{scores.to_string()}")
```

## ✔️ Getting Started

### Installation
[Installation doc](https://primeqa.github.io/primeqa/installation.html)       

```shell
# cd to project root

# If you want to run on GPU make sure to install torch appropriately

# E.g. for torch 1.11 + CUDA 11.3:
pip install 'torch~=1.11.0' --extra-index-url https://download.pytorch.org/whl/cu113

# Install as editable (-e) or non-editable using pip, with extras (e.g. tests) as desired
# Example installation commands:

# Minimal install (non-editable)
pip install .

# GPU support
pip install .[gpu]

# Full install (editable)
pip install -e .[all]
```

Please note that dependencies (specified in [setup.py](./setup.py)) are pinned to provide a stable experience.
When installing from source these can be modified, however this is not officially supported.

**Note:** in many environments, conda-forge based faiss libraries perform substantially better than the default ones installed with pip. To install faiss libraries from conda-forge, use the following steps:

- Create and activate a conda environment
- Install faiss libraries, using a command

```conda install -c conda-forge faiss=1.7.0 faiss-gpu=1.7.0```

- In `setup.py`, remove the faiss-related lines:

```commandline
"faiss-cpu~=1.7.2": ["install", "gpu"],
"faiss-gpu~=1.7.2": ["gpu"],
```

- Continue with the `pip install` commands as desctibed above.

## :speech_balloon: Blog Posts
There're several blog posts by members of the open source community on how they've been using PrimeQA for their needs. Read some of them:
1. [PrimeQA and GPT 3](https://www.marktechpost.com/2023/03/03/with-just-20-lines-of-python-code-you-can-do-retrieval-augmented-gpt-based-qa-using-this-open-source-repository-called-primeqa/)
2. [Enterprise search with PrimeQA](https://heidloff.net/article/introduction-neural-information-retrieval/)
3. [A search engine for Trivia geeks](https://www.deleeuw.me.uk/posts/Using-PrimeQA-For-NLP-Question-Answering/)


## 🧪 Unit Tests
[Testing doc](https://primeqa.github.io/primeqa/testing.html)       

To run the unit tests you first need to [install PrimeQA](#Installation).
Make sure to install with the `[tests]` or `[all]` extras from pip.

From there you can run the tests via pytest, for example:
```shell
pytest --cov PrimeQA --cov-config .coveragerc tests/
```

For more information, see:
- Our [tox.ini](./tox.ini)
- The [pytest](https://docs.pytest.org) and [tox](https://tox.wiki/en/latest/) documentation    

## 🔭 Learn more

| Section | Description |
|-|-|
| 📒 [Documentation](https://primeqa.github.io/primeqa) | Full API documentation and tutorials |
| 📓 [Tutorials: Jupyter Notebooks](https://github.com/primeqa/primeqa/tree/main/notebooks) | Notebooks to get started on QA tasks |
| 🤗 [Model sharing and uploading](https://huggingface.co/docs/transformers/model_sharing) | Upload and share your fine-tuned models with the community |
| ✅ [Pull Request](https://primeqa.github.io/primeqa/pull_request_template.html) | PrimeQA Pull Request |
| 📄 [Generate Documentation](https://primeqa.github.io/primeqa/README.html) | How Documentation works |        

## ❤️ PrimeQA collaborators include

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/primeqa/docuverse",
    "name": "docuverse",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12.0,>=3.10.0",
    "maintainer_email": null,
    "keywords": "Question Answering (QA), Machine Reading Comprehension (MRC), Information Retrieval (IR)",
    "author": "PrimeQA/DocUVerse Team",
    "author_email": "primeqa@us.ibm.com",
    "download_url": null,
    "platform": null,
    "description": "<!---\nCopyright 2022 IBM Corp.\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n-->\n\n<h3 align=\"center\">\n    <img width=\"350\" alt=\"primeqa\" src=\"docs/_static/img/PrimeQA.png\">\n    <p>Repository for (almost) *all* your document search needs.</p>\n    <p>Part of the Prime Repository for State-of-the-Art Multilingual QuestionAnswering Research and Development.</p>\n</h3>\n\n[//]: # (![Build Status]&#40;https://github.com/primeqa/primeqa/actions/workflows/primeqa-ci.yml/badge.svg&#41;)\n\n[//]: # ([![LICENSE|Apache2.0]&#40;https://img.shields.io/github/license/saltstack/salt?color=blue&#41;]&#40;https://www.apache.org/licenses/LICENSE-2.0.txt&#41;)\n\n[//]: # ([![sphinx-doc-build]&#40;https://github.com/primeqa/primeqa/actions/workflows/sphinx-doc-build.yml/badge.svg&#41;]&#40;https://github.com/primeqa/primeqa/actions/workflows/sphinx-doc-build.yml&#41;   )\n\nDocUServe is a public open source repository that enables researchers and developers to quickly\nexperiment with various search engines (such as ElasticSearch, ChromaDB, Milvus, PrimeQA, FAISS)\nboth in direct search and reranking scenarios. By using DocUVerse, a researcher\ncan replicate the experiments outlined in a paper published in the latest NLP \nconference while also enjoying the capability to download pre-trained models \n(from an online repository) and run them on their own custom data. DocUVerse is built \non top of the [Transformers](https://github.com/huggingface/transformers), PrimeQA, and Elasticsearch toolkits and uses [datasets](https://huggingface.co/datasets/viewer/) and \n[models](https://huggingface.co/PrimeQA) that are directly \ndownloadable.\n\n## Design\n\nThe following is a code snippet showing how to run a query search, and also how to ingest a corpus,\nfollowed by an evaluation search.\n```python\nfrom docuverse import SearchEngine, SearchQueries\n\n# Test an existing engine\nengine = SearchEngine(config=\"experiments/sap/elastic_v2/setup.yaml\")\nqueries = SearchQueries(data=\"benchmark_v2.csv\")\n\nresults = engine.search(queries)\nscores = engine.compute_score(queries, results)\nprint (f\"Results:\\n{scores.to_string()}\")\n```\n\nIngesting a new corpus (create an index for a specific engine) should be just as easy:\n```python\nfrom docuverse import SearchEngine, SearchCorpus, SearchQueries\n\ncorpus = SearchCorpus(filepaths=\"experiments/claspnq/passages.jsonl\")\nengine = SearchEngine(config=\"experiments/sap/elastic_v2/setup.yaml\")\nengine.ingest(corpus, max_doc_length=512, stride=100, title_handling=\"all\", \n              index=\"my_new_index\")\n\nqueries = SearchQueries(data=\"ClaspNQ.jsonl\")\nscores = engine.compute_score(queries, results)\nprint (f\"Results:\\n{scores.to_string()}\")\n```\n\n## \u2714\ufe0f Getting Started\n\n### Installation\n[Installation doc](https://primeqa.github.io/primeqa/installation.html)       \n\n```shell\n# cd to project root\n\n# If you want to run on GPU make sure to install torch appropriately\n\n# E.g. for torch 1.11 + CUDA 11.3:\npip install 'torch~=1.11.0' --extra-index-url https://download.pytorch.org/whl/cu113\n\n# Install as editable (-e) or non-editable using pip, with extras (e.g. tests) as desired\n# Example installation commands:\n\n# Minimal install (non-editable)\npip install .\n\n# GPU support\npip install .[gpu]\n\n# Full install (editable)\npip install -e .[all]\n```\n\nPlease note that dependencies (specified in [setup.py](./setup.py)) are pinned to provide a stable experience.\nWhen installing from source these can be modified, however this is not officially supported.\n\n**Note:** in many environments, conda-forge based faiss libraries perform substantially better than the default ones installed with pip. To install faiss libraries from conda-forge, use the following steps:\n\n- Create and activate a conda environment\n- Install faiss libraries, using a command\n\n```conda install -c conda-forge faiss=1.7.0 faiss-gpu=1.7.0```\n\n- In `setup.py`, remove the faiss-related lines:\n\n```commandline\n\"faiss-cpu~=1.7.2\": [\"install\", \"gpu\"],\n\"faiss-gpu~=1.7.2\": [\"gpu\"],\n```\n\n- Continue with the `pip install` commands as desctibed above.\n\n## :speech_balloon: Blog Posts\nThere're several blog posts by members of the open source community on how they've been using PrimeQA for their needs. Read some of them:\n1. [PrimeQA and GPT 3](https://www.marktechpost.com/2023/03/03/with-just-20-lines-of-python-code-you-can-do-retrieval-augmented-gpt-based-qa-using-this-open-source-repository-called-primeqa/)\n2. [Enterprise search with PrimeQA](https://heidloff.net/article/introduction-neural-information-retrieval/)\n3. [A search engine for Trivia geeks](https://www.deleeuw.me.uk/posts/Using-PrimeQA-For-NLP-Question-Answering/)\n\n\n## \ud83e\uddea Unit Tests\n[Testing doc](https://primeqa.github.io/primeqa/testing.html)       \n\nTo run the unit tests you first need to [install PrimeQA](#Installation).\nMake sure to install with the `[tests]` or `[all]` extras from pip.\n\nFrom there you can run the tests via pytest, for example:\n```shell\npytest --cov PrimeQA --cov-config .coveragerc tests/\n```\n\nFor more information, see:\n- Our [tox.ini](./tox.ini)\n- The [pytest](https://docs.pytest.org) and [tox](https://tox.wiki/en/latest/) documentation    \n\n## \ud83d\udd2d Learn more\n\n| Section | Description |\n|-|-|\n| \ud83d\udcd2 [Documentation](https://primeqa.github.io/primeqa) | Full API documentation and tutorials |\n| \ud83d\udcd3 [Tutorials: Jupyter Notebooks](https://github.com/primeqa/primeqa/tree/main/notebooks) | Notebooks to get started on QA tasks |\n| \ud83e\udd17 [Model sharing and uploading](https://huggingface.co/docs/transformers/model_sharing) | Upload and share your fine-tuned models with the community |\n| \u2705 [Pull Request](https://primeqa.github.io/primeqa/pull_request_template.html) | PrimeQA Pull Request |\n| \ud83d\udcc4 [Generate Documentation](https://primeqa.github.io/primeqa/README.html) | How Documentation works |        \n\n## \u2764\ufe0f PrimeQA collaborators include       \n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "State-of-the-art Retrieval/Search engine models, including ElasticSearch, ChromaDB, Milvus, and PrimeQA",
    "version": "0.0.8",
    "project_urls": {
        "Homepage": "https://github.com/primeqa/docuverse"
    },
    "split_keywords": [
        "question answering (qa)",
        " machine reading comprehension (mrc)",
        " information retrieval (ir)"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fbbbfc4cd4c61b9d7305639c6858efe1c81b0630903bb7e027deaae7e3c712f2",
                "md5": "d5c412a6f80ebb798ea4cdd058d1a77d",
                "sha256": "7ac11af2b4152b245e3cd2b2979421fdb56e3be35a00dbc524dbc95667a8c2fe"
            },
            "downloads": -1,
            "filename": "docuverse-0.0.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d5c412a6f80ebb798ea4cdd058d1a77d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12.0,>=3.10.0",
            "size": 104982,
            "upload_time": "2024-11-22T18:29:28",
            "upload_time_iso_8601": "2024-11-22T18:29:28.681406Z",
            "url": "https://files.pythonhosted.org/packages/fb/bb/fc4cd4c61b9d7305639c6858efe1c81b0630903bb7e027deaae7e3c712f2/docuverse-0.0.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-22 18:29:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "primeqa",
    "github_project": "docuverse",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "docuverse"
}

PrimeQA/DocUVerse Team