# MAP2GPT
## description
This project is a versatile and powerful search tool that leverages state-of-the-art natural language processing models to provide relevant and contextually rich results. The primary goal of this project is to build a semantic search engine for textual content from various sources such as PDF files and Wikipedia pages.
The project utilizes the GPT-3.5-turbo model for generating responses and French Semantic model to create embeddings of textual data. Users can build an index of embeddings from a PDF file or a Wikipedia page, explore the index interactively, and deploy the search functionality on Telegram. The search results are presented as the top k relevant chunks of information, which are then used as context to generate an informative response from the GPT-3.5-turbo model.
The project is implemented in Python, and it employs several open-source libraries such as Click, OpenAI, Wikipedia, PyTorch, Tiktoken, and Rich. The code is organized into modular functions and classes, making it easy to understand, maintain, and extend. The main script provides a command-line interface for users to interact with the project's functionalities.
## Table of Contents
1. [Installation](#installation)
2. [Usage](#usage)
- [Build Index from PDF](#build-index-from-pdf)
- [Build Index from Wikipedia](#build-index-from-wikipedia)
- [Explore Index](#explore-index)
## Installation
To install the necessary dependencies, run the following command:
```bash
python -m venv env
source env/bin/activate
pip install --upgrade pip
pip install map2gpt
```
## Supported Transformer Models
This project supports a variety of transformer models, including models from the Hugging Face Model Hub and sentence-transformers. Below are some examples:
- Hugging Face Model: 'Sahajtomar/french_semantic'
- Sentence-Transformers Model: 'paraphrase-MiniLM-L6-v2', 'all-mpnet-base-v2', etc...
Please ensure that the model you choose is compatible with the project requirements and adjust the `--transformer_model_name` option accordingly.
# CLI usage
## set env vars
```bash
export OPENAI_API_KEY=sk- TRANSFORMERS_CACHE=/path/to/cache QDRANT_PERSISTENT_FOLDER=/path/to_persistent
```
## Build Index from PDF files
To build an index from a PDF file, run the following command:
```bash
python -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' build-index-from-pdf-files
--path2pdf_files /path/to/file-000.pdf \
--path2pdf_files /path/to/file-001.pdf \
--name qdrant_collection_name \
--chunk_size 256 \
--batch_size 128
```
## Build Index from Wikipedia pages
To build an index from a Wikipedia page, run the following command:
```bash
python -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' build-index-from-wikipedia-pages
--urls https://...wikipedia \
--urls https://...wikipedia \
--name qdrant_collection_name \
--chunk_size 256 \
--batch_size 128
```
## Build Index from Youtube links
To build an index from a Wikipedia page, run the following command:
```bash
python -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' build-index-from-youtube-links
--urls https://...youtube \
--urls https://...youtube \
--name qdrant_collection_name \
--chunk_size 256 \
--batch_size 128
```
## Build Index from texts
To build an index from a Wikipedia page, run the following command:
```bash
python -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' build-index-from-wikipedia-pages
--path2directory /path/to/corpus_text_files
--name qdrant_collection_name \
--chunk_size 256 \
--batch_size 128
```
# Explore Index
To explore the index, run the following command:
## query the index
```bash
python -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' query-index
--query "...." \
--name qdrant_collection_name \
--top_k 7
--source_k 3
--description "service description"
```
## deploy on telegram
```bash
python -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' deploy-on-telegram
--telegram_token XXXXXXXXX...XXXXXXXXXXX \
--name qdrant_collection_name \
--top_k 7
--source_k 3
--description "service description"
```
# Module usage
```python
# create qdrant client
qdrant = QdrantClient(':memory:') # use path for persistence QdrantClient(path=path2persistent_dir)
# initialize runner
runner = GPTRunner(
device='cuda:0', # cpu
qdrant=qdrant,
tokenizer='gpt-3.5-turbo',
openai_api_key='sk-XXXXXXXXXXXXXXXXXXXXX',
transformers_cache='/path/to/transformers_cache',
transformer_model_name='Sahajtomar/french_semantic' # use all-mpnet-case-v2 for english
)
# build index from wikipedia pages
knowledge_base = runner.build_index_from_pdf_files(
path2pdf_files=[
'https://www.youtube.com/watch?v=tH-i_FeagJc',
'https://www.youtube.com/watch?v=tH-i_FeagJc',
],
chunk_size=256,
batch_size=128,
name='collection_name',
)
# create qdrant index
runner.create_qdrant_index(knowledge_base=knowledge_base)
# deploy on telegram
deploy_on_telegram(
telegram_token='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
runner=runner,
name='collection_name',
description="service name description",
top_k=10,
source_k=3
)
```
Raw data
{
"_id": null,
"home_page": "",
"name": "map2gpt",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "GPT3, gpt-index, llama-index, pdf2gpt, doc2gpt, wikipedia2gpt, semantic-search",
"author": "",
"author_email": "Ibrahima BA <ibrahima.elmokhtar@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/cf/0f/eb247df48b634b01ba1100940edb9c4fd3f3093e7d6ed0f623664eb4485a/map2gpt-0.2.0.tar.gz",
"platform": null,
"description": "# MAP2GPT\n\n## description \n\nThis project is a versatile and powerful search tool that leverages state-of-the-art natural language processing models to provide relevant and contextually rich results. The primary goal of this project is to build a semantic search engine for textual content from various sources such as PDF files and Wikipedia pages.\n\nThe project utilizes the GPT-3.5-turbo model for generating responses and French Semantic model to create embeddings of textual data. Users can build an index of embeddings from a PDF file or a Wikipedia page, explore the index interactively, and deploy the search functionality on Telegram. The search results are presented as the top k relevant chunks of information, which are then used as context to generate an informative response from the GPT-3.5-turbo model.\n\nThe project is implemented in Python, and it employs several open-source libraries such as Click, OpenAI, Wikipedia, PyTorch, Tiktoken, and Rich. The code is organized into modular functions and classes, making it easy to understand, maintain, and extend. The main script provides a command-line interface for users to interact with the project's functionalities.\n\n## Table of Contents\n\n1. [Installation](#installation)\n2. [Usage](#usage)\n - [Build Index from PDF](#build-index-from-pdf)\n - [Build Index from Wikipedia](#build-index-from-wikipedia)\n - [Explore Index](#explore-index)\n\n## Installation\n\nTo install the necessary dependencies, run the following command:\n\n```bash\npython -m venv env \nsource env/bin/activate\npip install --upgrade pip \npip install map2gpt \n```\n\n## Supported Transformer Models\n\nThis project supports a variety of transformer models, including models from the Hugging Face Model Hub and sentence-transformers. Below are some examples:\n - Hugging Face Model: 'Sahajtomar/french_semantic'\n - Sentence-Transformers Model: 'paraphrase-MiniLM-L6-v2', 'all-mpnet-base-v2', etc...\n\nPlease ensure that the model you choose is compatible with the project requirements and adjust the `--transformer_model_name` option accordingly.\n\n# CLI usage \n\n## set env vars \n```bash\n export OPENAI_API_KEY=sk- TRANSFORMERS_CACHE=/path/to/cache QDRANT_PERSISTENT_FOLDER=/path/to_persistent\n```\n\n## Build Index from PDF files\nTo build an index from a PDF file, run the following command:\n\n```bash\npython -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' build-index-from-pdf-files\n --path2pdf_files /path/to/file-000.pdf \\\n --path2pdf_files /path/to/file-001.pdf \\\n --name qdrant_collection_name \\\n --chunk_size 256 \\\n --batch_size 128\n```\n\n## Build Index from Wikipedia pages\nTo build an index from a Wikipedia page, run the following command:\n\n```bash\npython -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' build-index-from-wikipedia-pages\n --urls https://...wikipedia \\\n --urls https://...wikipedia \\\n --name qdrant_collection_name \\\n --chunk_size 256 \\\n --batch_size 128\n```\n\n## Build Index from Youtube links \nTo build an index from a Wikipedia page, run the following command:\n\n```bash\npython -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' build-index-from-youtube-links\n --urls https://...youtube \\\n --urls https://...youtube \\\n --name qdrant_collection_name \\\n --chunk_size 256 \\\n --batch_size 128\n```\n\n## Build Index from texts\nTo build an index from a Wikipedia page, run the following command:\n\n\n```bash\npython -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' build-index-from-wikipedia-pages\n --path2directory /path/to/corpus_text_files\n --name qdrant_collection_name \\\n --chunk_size 256 \\\n --batch_size 128\n```\n\n# Explore Index\nTo explore the index, run the following command:\n\n\n## query the index\n\n```bash\npython -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' query-index\n --query \"....\" \\\n --name qdrant_collection_name \\ \n --top_k 7\n --source_k 3\n --description \"service description\"\n```\n\n## deploy on telegram \n\n```bash\npython -m map2gpt.main --transformer_model_name 'Sahajtomar/french_semantic' deploy-on-telegram\n --telegram_token XXXXXXXXX...XXXXXXXXXXX \\\n --name qdrant_collection_name \\ \n --top_k 7\n --source_k 3\n --description \"service description\"\n```\n\n# Module usage \n```python\n # create qdrant client \n qdrant = QdrantClient(':memory:') # use path for persistence QdrantClient(path=path2persistent_dir)\n \n # initialize runner\n runner = GPTRunner(\n device='cuda:0', # cpu\n qdrant=qdrant,\n tokenizer='gpt-3.5-turbo',\n openai_api_key='sk-XXXXXXXXXXXXXXXXXXXXX',\n transformers_cache='/path/to/transformers_cache',\n transformer_model_name='Sahajtomar/french_semantic' # use all-mpnet-case-v2 for english\n )\n\n # build index from wikipedia pages\n knowledge_base = runner.build_index_from_pdf_files(\n path2pdf_files=[\n 'https://www.youtube.com/watch?v=tH-i_FeagJc',\n 'https://www.youtube.com/watch?v=tH-i_FeagJc',\n ],\n chunk_size=256,\n batch_size=128,\n name='collection_name',\n )\n \n # create qdrant index\n runner.create_qdrant_index(knowledge_base=knowledge_base)\n\n # deploy on telegram\n deploy_on_telegram(\n telegram_token='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX', \n runner=runner, \n name='collection_name', \n description=\"service name description\", \n top_k=10, \n source_k=3\n )\n```\n",
"bugtrack_url": null,
"license": "",
"summary": "A python package to index and search documents using GPT3",
"version": "0.2.0",
"split_keywords": [
"gpt3",
" gpt-index",
" llama-index",
" pdf2gpt",
" doc2gpt",
" wikipedia2gpt",
" semantic-search"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "90f4c566c6630666bb73a359b6697cd14159ccaac0fc564787fb79c3e4f3892e",
"md5": "29749673a2da8821c1c7538d43d12930",
"sha256": "093efd9d874e0570a008e3e5a850ad53d2d3e68e643ea0c14b65474a52294432"
},
"downloads": -1,
"filename": "map2gpt-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "29749673a2da8821c1c7538d43d12930",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 14879,
"upload_time": "2023-04-03T09:18:33",
"upload_time_iso_8601": "2023-04-03T09:18:33.046105Z",
"url": "https://files.pythonhosted.org/packages/90/f4/c566c6630666bb73a359b6697cd14159ccaac0fc564787fb79c3e4f3892e/map2gpt-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cf0feb247df48b634b01ba1100940edb9c4fd3f3093e7d6ed0f623664eb4485a",
"md5": "d605435db4d3105f50bdefc5936b9560",
"sha256": "b23bc1bfe2279d71550f5df3762be62606e421456ca683c67109adb28247158f"
},
"downloads": -1,
"filename": "map2gpt-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "d605435db4d3105f50bdefc5936b9560",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 14015,
"upload_time": "2023-04-03T09:18:34",
"upload_time_iso_8601": "2023-04-03T09:18:34.708730Z",
"url": "https://files.pythonhosted.org/packages/cf/0f/eb247df48b634b01ba1100940edb9c4fd3f3093e7d6ed0f623664eb4485a/map2gpt-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-03 09:18:34",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "map2gpt"
}