indexpaper

Name	indexpaper JSON
Version	0.0.23 JSON
	download
home_page	None
Summary	indexpaper - library to index papers with vector databases
upload_time	2024-03-20 16:15:17
maintainer	None
docs_url	None
author	antonkulaga (Anton Kulaga)
requires_python	None
license	None
keywords	python utils files papers download index vector databases
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# indexpaper

The project devoted to indexing papers in vector databases

It was originally part of getpaper but now has not dependencies on it

We provide features to index the papers as well as semantic-scholar paper datasets with openai, huggingface or llama embeddings and save them either chromadb or qdrant vector store.

For openai embeddings to work you have to create .env file and specify your openai key there, see .env.template as example

# getting started

Install the library with:
```bash
pip install indexpaper
```

On linux systems you sometimes need to check that build--essential are installed:
```bash
sudo apt install build-essential
```
It is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.
Assuming you installed [micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html), you will have to create an environment and enable the library locally
```
micromamba create -f environment.yaml
micromabma activate indexpaper
pip install -e .
```
The last command is optional. With conda/anaconda the commands will look the same but with another name of executable.

## Running local Qdrant

We provide docker-compose configuration to run local qdrant (you can also use qdrant cloud instead).
To run local qdrant install docker compose (sometimes needs sudo) and run:
```bash
cd services
docker compose up
```
Then you should be able to see  http://localhost:6333/dashboard for qdrant dashboard and http://0.0.0.0:5601 for OpenSearch dashboard

# Additional requirements

index.py has local dependencies on other modules, for this reason if you are running it inside indexpaper project folder consider having it installed locally:
```bash
pip install -e .
```

# indexing a dataset

To index a dataset you can use either index.py dataset subcommand or you look how to do it in code in papers.ipynb example notebook
For example, if we want to index "longevity-genie/tacutu_papers" huggingface dataset using "michiyasunaga/BioLinkBERT-large" hugging face embedding with "cuda" as device and with 10 papers in a slice.
And we want to write it to the local version of qdrant located at http://localhost:6333 (see services for docker-compose file):
```bash
python indexpaper/index.py dataset --collection bge_large_v1.5_512_tacutu_papers_paragraphs_10 --dataset "longevity-genie/tacutu_papers" --url http://localhost:6333 --model BAAI/bge-large-en-v1.5 --slice 10 --chunk_size 500 --device cuda
```

Another example. If we want to index "longevity-genie/moskalev_papers" huggingface dataset using "michiyasunaga/BioLinkBERT-large" hugging face embedding with "gpu" as device and with 10 papers in a slice.
And we want to use our Qdrant cloud key (fill in QDRANT_KEY or put it to environment variable)

Another example. Robi Tacutu papers with cpu using QDRANT_KEY, cluster url (put yours) and michiyasunaga/BioLinkBERT-large embeddings model:
```
python indexpaper/index.py dataset --url https://62d4a96e-2b91-4ab8-a4dd-a91e626d874a.europe-west3-0.gcp.cloud.qdrant.io:6333 --collection biolinkbert_large_512_tacutu_papers --embeddings huggingface --dataset "longevity-genie/tacutu_papers" --key QDRANT_KEY --model michiyasunaga/BioLinkBERT-large --slice 500 --chunk_size 512 --device cpu
```
If  you do not specify  embeddings, slice and chunk, then BGE-large-en with chunk-size 512 and slice of 100 is used by default:
```
python indexpaper/index.py dataset --collection bge_large_v1.5_512_moskalev_papers_paragraphs_10 --dataset "longevity-genie/moskalev_papers" --url https://62d4a96e-2b91-4ab8-a4dd-a91e626d874a.europe-west3-0.gcp.cloud.qdrant.io:6333 --key QDRANT_KEY
```
If you want to recreate the collection from scratch you can also add --rewrite true


# Fast indexing

We also experimentally support fast indexing that has similar parameters, for example:

Robi Tacutu papers QDRANT_KEY, cluster url (put yours) and  embeddings model:
```
python indexpaper/index.py fast_index --url https://62d4a96e-2b91-4ab8-a4dd-a91e626d874a.europe-west3-0.gcp.cloud.qdrant.io:6333 --collection bge_base_en_v1.5_tacutu_papers_5 --dataset "longevity-genie/tacutu_papers" --key QDRANT_KEY --paragraphs 5 --model BAAI/bge-base-en-v1.5 --slice 100 --batch_size 50 --parallel 10
```

# OpenSearch hybrid indexing

For example indexing with bge
```
python indexpaper/index.py hybrid_index --collection tacutu_papers_bge_base_en_v1.5 --model "BAAI/bge-base-en-v1.5" --dataset "longevity-genie/tacutu_papers"
```
If you want to index it with gpu and at different host (for example pic) use:
```
python indexpaper/index.py hybrid_index --collection tacutu_papers_bge_base_en_v1.5 --model "BAAI/bge-base-en-v1.5" --url "https://pic:9200" --dataset "longevity-genie/tacutu_papers" --device cuda
```

You can also make a test search:
```
python indexpaper/search.py hybrid --index "tacutu_papers_bge_base_en_v1.5" --model "BAAI/bge-base-en-v1.5" --query "mitochondrial GC content and longevity" --verbose True --k 3 --verbose True
```

Same for specter model:
```
python indexpaper/index.py hybrid_index --collection specter2_tacutu_papers --model "allenai/specter2_base" --dataset "longevity-genie/tacutu_papers"
```

You can also make a test search:
```
python indexpaper/search.py hybrid --index "specter2_tacutu_papers" --model "allenai/specter2_base" --query "mitochondrial GC content and longevity" --k 3 --verbose true
```

# Indexing papers

For example if you have your papers inside data/output/test/papers folder, and you want to make an index at data/output/test/index you can do it by:
```bash
indexpaper/index.py index_papers --papers data/output/test/papers --folder data/output/test/index --collection mypapers --chunk_size 6000
```

It is possible to use both Chroma and Qdrant. To use qdrant we provide docker-compose file to set it up:
```bash
cd services
docker compose -f docker-compose.yaml up
```
then you can run the indexing of the paper with Qdrant:
```
indexpaper/index.py index_papers --papers data/output/test/papers --url http://localhost:6333 --collection mypapers --chunk_size 6000 --database Qdrant
```
You can also take a look if things were added to the collection with qdrant web UI by checking http://localhost:6333/dashboard

### Checking SemanticScholar dataset

We provide some convenience methods to also run semantic-scholar datasets API.
For example, if you want to get s2orc dataset you can run:
```bash
python indexpaper/check_scholar.py --key <your_semantic_schoalr_key> https://api.semanticscholar.org/datasets/v1/release/latest s2orc --output s2orc.json
```

If you need just the files you can do:
```bash
python indexpaper/check_scholar.py --key <your_semantic_schoalr_key> https://api.semanticscholar.org/datasets/v1/release/latest s2orc --output s2orc.json
```

### Indexing with Llama-2 embeddings ###
You can also use llama-2 embeddings if you install llama-cpp-python and pass a path to the model, for example for https://huggingface.co/TheBloke/Llama-2-13B-GGML model:
```
indexpaper/index.py index_papers --papers data/output/test/papers --url http://localhost:6333 --collection papers_llama2_2000 --chunk_size 2000 --database Qdrant --embeddings llama --model /home/antonkulaga/sources/indexpaper/data/models/llama-2-13b-chat.ggmlv3.q2_K.bin
```
Instead of explicitly pathing the model path you can also include the path to LLAMA_MODEL to the .env file as:
```
LLAMA_MODEL="/home/antonkulaga/sources/indexpaper/data/models/llama-2-13b-chat.ggmlv3.q2_K.bin"
```
Note: if you want to use Qdrant cloud you do not need docker-compose, but you need to provide a key and look at qdrant cloud setting for the url to give.
```
indexpaper/index.py index_papers --papers data/output/test/papers --url https://5bea7502-97d4-4876-98af-0cdf8af4bd18.us-east-1-0.aws.cloud.qdrant.io:6333 --key put_your_key_here --collection mypapers --chunk_size 6000 --database Qdrant
```
Note: there are temporal issues with embeddings for llama.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "indexpaper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "python, utils, files, papers, download, index, vector databases",
    "author": "antonkulaga (Anton Kulaga)",
    "author_email": "<antonkulaga@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/87/49/c151209f24a5f41dae536f30a775c47263de473c558a24ef32b6b5566e68/indexpaper-0.0.23.tar.gz",
    "platform": null,
    "description": "\n# indexpaper\n\nThe project devoted to indexing papers in vector databases\n\nIt was originally part of getpaper but now has not dependencies on it\n\nWe provide features to index the papers as well as semantic-scholar paper datasets with openai, huggingface or llama embeddings and save them either chromadb or qdrant vector store.\n\nFor openai embeddings to work you have to create .env file and specify your openai key there, see .env.template as example\n\n# getting started\n\nInstall the library with:\n```bash\npip install indexpaper\n```\n\nOn linux systems you sometimes need to check that build--essential are installed:\n```bash\nsudo apt install build-essential\n```\nIt is also recommended to use micromamba, conda, anaconda or other environments to avoid bloating system python with too many dependencies.\nAssuming you installed [micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html), you will have to create an environment and enable the library locally\n```\nmicromamba create -f environment.yaml\nmicromabma activate indexpaper\npip install -e .\n```\nThe last command is optional. With conda/anaconda the commands will look the same but with another name of executable.\n\n## Running local Qdrant\n\nWe provide docker-compose configuration to run local qdrant (you can also use qdrant cloud instead).\nTo run local qdrant install docker compose (sometimes needs sudo) and run:\n```bash\ncd services\ndocker compose up\n```\nThen you should be able to see  http://localhost:6333/dashboard for qdrant dashboard and http://0.0.0.0:5601 for OpenSearch dashboard\n\n# Additional requirements\n\nindex.py has local dependencies on other modules, for this reason if you are running it inside indexpaper project folder consider having it installed locally:\n```bash\npip install -e .\n```\n\n# indexing a dataset\n\nTo index a dataset you can use either index.py dataset subcommand or you look how to do it in code in papers.ipynb example notebook\nFor example, if we want to index \"longevity-genie/tacutu_papers\" huggingface dataset using \"michiyasunaga/BioLinkBERT-large\" hugging face embedding with \"cuda\" as device and with 10 papers in a slice.\nAnd we want to write it to the local version of qdrant located at http://localhost:6333 (see services for docker-compose file):\n```bash\npython indexpaper/index.py dataset --collection bge_large_v1.5_512_tacutu_papers_paragraphs_10 --dataset \"longevity-genie/tacutu_papers\" --url http://localhost:6333 --model BAAI/bge-large-en-v1.5 --slice 10 --chunk_size 500 --device cuda\n```\n\nAnother example. If we want to index \"longevity-genie/moskalev_papers\" huggingface dataset using \"michiyasunaga/BioLinkBERT-large\" hugging face embedding with \"gpu\" as device and with 10 papers in a slice.\nAnd we want to use our Qdrant cloud key (fill in QDRANT_KEY or put it to environment variable)\n\nAnother example. Robi Tacutu papers with cpu using QDRANT_KEY, cluster url (put yours) and michiyasunaga/BioLinkBERT-large embeddings model:\n```\npython indexpaper/index.py dataset --url https://62d4a96e-2b91-4ab8-a4dd-a91e626d874a.europe-west3-0.gcp.cloud.qdrant.io:6333 --collection biolinkbert_large_512_tacutu_papers --embeddings huggingface --dataset \"longevity-genie/tacutu_papers\" --key QDRANT_KEY --model michiyasunaga/BioLinkBERT-large --slice 500 --chunk_size 512 --device cpu\n```\nIf  you do not specify  embeddings, slice and chunk, then BGE-large-en with chunk-size 512 and slice of 100 is used by default:\n```\npython indexpaper/index.py dataset --collection bge_large_v1.5_512_moskalev_papers_paragraphs_10 --dataset \"longevity-genie/moskalev_papers\" --url https://62d4a96e-2b91-4ab8-a4dd-a91e626d874a.europe-west3-0.gcp.cloud.qdrant.io:6333 --key QDRANT_KEY\n```\nIf you want to recreate the collection from scratch you can also add --rewrite true\n\n\n# Fast indexing\n\nWe also experimentally support fast indexing that has similar parameters, for example:\n\nRobi Tacutu papers QDRANT_KEY, cluster url (put yours) and  embeddings model:\n```\npython indexpaper/index.py fast_index --url https://62d4a96e-2b91-4ab8-a4dd-a91e626d874a.europe-west3-0.gcp.cloud.qdrant.io:6333 --collection bge_base_en_v1.5_tacutu_papers_5 --dataset \"longevity-genie/tacutu_papers\" --key QDRANT_KEY --paragraphs 5 --model BAAI/bge-base-en-v1.5 --slice 100 --batch_size 50 --parallel 10\n```\n\n# OpenSearch hybrid indexing\n\nFor example indexing with bge\n```\npython indexpaper/index.py hybrid_index --collection tacutu_papers_bge_base_en_v1.5 --model \"BAAI/bge-base-en-v1.5\" --dataset \"longevity-genie/tacutu_papers\"\n```\nIf you want to index it with gpu and at different host (for example pic) use:\n```\npython indexpaper/index.py hybrid_index --collection tacutu_papers_bge_base_en_v1.5 --model \"BAAI/bge-base-en-v1.5\" --url \"https://pic:9200\" --dataset \"longevity-genie/tacutu_papers\" --device cuda\n```\n\nYou can also make a test search:\n```\npython indexpaper/search.py hybrid --index \"tacutu_papers_bge_base_en_v1.5\" --model \"BAAI/bge-base-en-v1.5\" --query \"mitochondrial GC content and longevity\" --verbose True --k 3 --verbose True\n```\n\nSame for specter model:\n```\npython indexpaper/index.py hybrid_index --collection specter2_tacutu_papers --model \"allenai/specter2_base\" --dataset \"longevity-genie/tacutu_papers\"\n```\n\nYou can also make a test search:\n```\npython indexpaper/search.py hybrid --index \"specter2_tacutu_papers\" --model \"allenai/specter2_base\" --query \"mitochondrial GC content and longevity\" --k 3 --verbose true\n```\n\n# Indexing papers\n\nFor example if you have your papers inside data/output/test/papers folder, and you want to make an index at data/output/test/index you can do it by:\n```bash\nindexpaper/index.py index_papers --papers data/output/test/papers --folder data/output/test/index --collection mypapers --chunk_size 6000\n```\n\nIt is possible to use both Chroma and Qdrant. To use qdrant we provide docker-compose file to set it up:\n```bash\ncd services\ndocker compose -f docker-compose.yaml up\n```\nthen you can run the indexing of the paper with Qdrant:\n```\nindexpaper/index.py index_papers --papers data/output/test/papers --url http://localhost:6333 --collection mypapers --chunk_size 6000 --database Qdrant\n```\nYou can also take a look if things were added to the collection with qdrant web UI by checking http://localhost:6333/dashboard\n\n### Checking SemanticScholar dataset\n\nWe provide some convenience methods to also run semantic-scholar datasets API.\nFor example, if you want to get s2orc dataset you can run:\n```bash\npython indexpaper/check_scholar.py --key <your_semantic_schoalr_key> https://api.semanticscholar.org/datasets/v1/release/latest s2orc --output s2orc.json\n```\n\nIf you need just the files you can do:\n```bash\npython indexpaper/check_scholar.py --key <your_semantic_schoalr_key> https://api.semanticscholar.org/datasets/v1/release/latest s2orc --output s2orc.json\n```\n\n### Indexing with Llama-2 embeddings ###\nYou can also use llama-2 embeddings if you install llama-cpp-python and pass a path to the model, for example for https://huggingface.co/TheBloke/Llama-2-13B-GGML model:\n```\nindexpaper/index.py index_papers --papers data/output/test/papers --url http://localhost:6333 --collection papers_llama2_2000 --chunk_size 2000 --database Qdrant --embeddings llama --model /home/antonkulaga/sources/indexpaper/data/models/llama-2-13b-chat.ggmlv3.q2_K.bin\n```\nInstead of explicitly pathing the model path you can also include the path to LLAMA_MODEL to the .env file as:\n```\nLLAMA_MODEL=\"/home/antonkulaga/sources/indexpaper/data/models/llama-2-13b-chat.ggmlv3.q2_K.bin\"\n```\nNote: if you want to use Qdrant cloud you do not need docker-compose, but you need to provide a key and look at qdrant cloud setting for the url to give.\n```\nindexpaper/index.py index_papers --papers data/output/test/papers --url https://5bea7502-97d4-4876-98af-0cdf8af4bd18.us-east-1-0.aws.cloud.qdrant.io:6333 --key put_your_key_here --collection mypapers --chunk_size 6000 --database Qdrant\n```\nNote: there are temporal issues with embeddings for llama.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "indexpaper - library to index papers with vector databases",
    "version": "0.0.23",
    "project_urls": null,
    "split_keywords": [
        "python",
        " utils",
        " files",
        " papers",
        " download",
        " index",
        " vector databases"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9f18be305933e348f92999db44861011f53f32c4b302779a398ef521c3a70644",
                "md5": "bc582c3ec771e4484dad03f24eb8005c",
                "sha256": "9efc8bb07789a7f88a161c9fedcae46ee1846e311c0d31a5c524cf6e50fb2eb4"
            },
            "downloads": -1,
            "filename": "indexpaper-0.0.23-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bc582c3ec771e4484dad03f24eb8005c",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 27814,
            "upload_time": "2024-03-20T16:15:15",
            "upload_time_iso_8601": "2024-03-20T16:15:15.566154Z",
            "url": "https://files.pythonhosted.org/packages/9f/18/be305933e348f92999db44861011f53f32c4b302779a398ef521c3a70644/indexpaper-0.0.23-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8749c151209f24a5f41dae536f30a775c47263de473c558a24ef32b6b5566e68",
                "md5": "8e63b5d9ed302548078f8a781dd2c422",
                "sha256": "00a21fa29330a22d5647f60227d059e5f3d84c349fcf0946cc4ffcf7f5d01fa7"
            },
            "downloads": -1,
            "filename": "indexpaper-0.0.23.tar.gz",
            "has_sig": false,
            "md5_digest": "8e63b5d9ed302548078f8a781dd2c422",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 26063,
            "upload_time": "2024-03-20T16:15:17",
            "upload_time_iso_8601": "2024-03-20T16:15:17.456227Z",
            "url": "https://files.pythonhosted.org/packages/87/49/c151209f24a5f41dae536f30a775c47263de473c558a24ef32b6b5566e68/indexpaper-0.0.23.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-20 16:15:17",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "indexpaper"
}

antonkulaga (Anton Kulaga)