ovos-solver-bm25-plugin

Name	ovos-solver-bm25-plugin JSON
Version	0.0.0 JSON
	download
home_page	https://github.com/TigreGotico/ovos-solver-BM25-plugin
Summary	A question solver plugin for OVOS
upload_time	2024-10-25 22:25:27
maintainer	None
docs_url	None
author	jarbasai
requires_python	None
license	MIT
keywords	ovos openvoiceos plugin utterance fallback query
VCS
bugtrack_url
requirements	ovos-plugin-manager bm25s quebra-frases rapidfuzz json_database
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # BM25CorpusSolver Plugin

BM25CorpusSolver is an OVOS (OpenVoiceOS) plugin designed to retrieve answers from a corpus of documents using the [BM25](https://en.wikipedia.org/wiki/Okapi_BM25)
algorithm. This solver is ideal for question-answering systems that require efficient and accurate retrieval of
information from a predefined set of documents.

- [Features](#features)
- [Retrieval Chatbots](#retrieval-chatbots)
   - [Example Solvers](#example-solvers)
     - [SquadQASolver](#squadqasolver)
     - [FreebaseQASolver](#freebaseqasolver)
  - [Implementing a Retrieval Chatbot](#implementing-a-retrieval-chatbot)
     - [BM25CorpusSolver](#bm25corpussolver)
     - [BM25QACorpusSolver](#bm25qacorpussolver)
  - [Limitations of Retrieval Chatbots](#limitations-of-retrieval-chatbots)
- [ReRanking](#reranking)
   - [BM25MultipleChoiceSolver](#bm25multiplechoicesolver)
   - [BM25EvidenceSolverPlugin](#bm25evidencesolverplugin)
- [Embeddings Store](#embeddings-store)
- [Integrating with Persona Framework](#integrating-with-persona-framework)


## Features

- **BM25 Algorithm**: Utilizes the BM25 ranking function for information retrieval, providing relevance-based document scoring.
- **Configurable**: Allows customization of language, minimum confidence score, and the number of answers to retrieve.
- **Logging**: Integrates with OVOS logging system for debugging and monitoring.
- **BM25QACorpusSolver**: Extends `BM25CorpusSolver` to handle question-answer pairs, optimizing the retrieval process for QA datasets.
- **BM25MultipleChoiceSolver**: Reranks multiple-choice options based on relevance to the query.
- **BM25EvidenceSolverPlugin**: Extracts the best sentence from a text that answers a question using the BM25 algorithm.

## Retrieval Chatbots

Retrieval chatbots use BM25CorpusSolver to provide answers to user queries by searching through a preloaded corpus of documents or QA pairs. 

These chatbots excel in environments where the information is structured and the queries are straightforward.

### Example solvers

#### SquadQASolver

The SquadQASolver is a subclass of BM25QACorpusSolver that automatically loads and indexes the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) upon initialization.

This solver is suitable for usage with the ovos-persona framework.

```python
from ovos_bm25_solver import SquadQASolver

s = SquadQASolver()
query = "is there life on mars"
print("Query:", query)
print("Answer:", s.spoken_answer(query))
# 2024-07-19 22:31:12.625 - OVOS - __main__:load_corpus:60 - DEBUG - indexed 86769 documents
# 2024-07-19 22:31:12.625 - OVOS - __main__:load_squad_corpus:119 - INFO - Loaded and indexed 86769 question-answer pairs from SQuAD dataset
# Query: is there life on mars
# 2024-07-19 22:31:12.628 - OVOS - __main__:retrieve_from_corpus:69 - DEBUG - Rank 1 (score: 6.334013938903809): How is it postulated that Mars life might have evolved?
# 2024-07-19 22:31:12.628 - OVOS - __main__:retrieve_from_corpus:93 - DEBUG - closest question in corpus: How is it postulated that Mars life might have evolved?
# Answer: similar to Antarctic
```

#### FreebaseQASolver

The FreebaseQASolver is a subclass of BM25QACorpusSolver that automatically loads and indexes the [FreebaseQA dataset](https://github.com/kelvin-jiang/FreebaseQA) upon initialization.

This solver is suitable for usage with the ovos-persona framework.

```python
from ovos_bm25_solver import FreebaseQASolver

s = FreebaseQASolver()
query = "What is the capital of France"
print("Query:", query)
print("Answer:", s.spoken_answer(query))
# 2024-07-19 22:31:09.468 - OVOS - __main__:load_corpus:60 - DEBUG - indexed 20357 documents
# Query: What is the capital of France
# 2024-07-19 22:31:09.468 - OVOS - __main__:retrieve_from_corpus:69 - DEBUG - Rank 1 (score: 5.996074199676514): what is the capital of france
# 2024-07-19 22:31:09.469 - OVOS - __main__:retrieve_from_corpus:93 - DEBUG - closest question in corpus: what is the capital of france
# Answer: paris
```

### Implementing a Retrieval Chatbot

To use the BM25CorpusSolver, you need to create an instance of the solver, load your corpus, and then query it.

#### BM25CorpusSolver

This class is meant to be used to create your own solvers with a dedicated corpus.

```python
from ovos_bm25_solver import BM25CorpusSolver

config = {
    "lang": "en-us",
    "min_conf": 0.4,
    "n_answer": 2
}
solver = BM25CorpusSolver(config)

corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]
solver.load_corpus(corpus)

query = "does the fish purr like a cat?"
answer = solver.get_spoken_answer(query)
print(answer)

# Expected Output:
# 2024-07-19 20:03:29.979 - OVOS - ovos_plugin_manager.utils.config:get_plugin_config:40 - DEBUG - Loaded configuration: {'module': 'ovos-translate-plugin-server', 'lang': 'en-us'}
# 2024-07-19 20:03:30.024 - OVOS - __main__:load_corpus:28 - DEBUG - indexed 4 documents
# 2024-07-19 20:03:30.025 - OVOS - __main__:retrieve_from_corpus:37 - DEBUG - Rank 1 (score: 1.0584375858306885): a cat is a feline and likes to purr
# 2024-07-19 20:03:30.025 - OVOS - __main__:retrieve_from_corpus:37 - DEBUG - Rank 2 (score: 0.481589138507843): a fish is a creature that lives in water and swims
# a cat is a feline and likes to purr. a fish is a creature that lives in water and swims
```

#### BM25QACorpusSolver

This class is meant to be used to create your own solvers with a dedicated corpus

BM25QACorpusSolver is an extension of BM25CorpusSolver, designed to work with question-answer pairs. It is particularly
useful when working with datasets like SQuAD, FreebaseQA, or similar QA datasets.

```python
import requests
from ovos_bm25_solver import BM25QACorpusSolver

# Load SQuAD dataset
corpus = {}
data = requests.get("https://github.com/chrischute/squad/raw/master/data/train-v2.0.json").json()
for s in data["data"]:
    for p in s["paragraphs"]:
        for qa in p["qas"]:
            if "question" in qa and qa["answers"]:
                corpus[qa["question"]] = qa["answers"][0]["text"]

# Load FreebaseQA dataset
data = requests.get("https://github.com/kelvin-jiang/FreebaseQA/raw/master/FreebaseQA-train.json").json()
for qa in data["Questions"]:
    q = qa["ProcessedQuestion"]
    a = qa["Parses"][0]["Answers"][0]["AnswersName"][0]
    corpus[q] = a

# Initialize BM25QACorpusSolver with config
config = {
    "lang": "en-us",
    "min_conf": 0.4,
    "n_answer": 1
}
solver = BM25QACorpusSolver(config)
solver.load_corpus(corpus)

query = "is there life on mars?"
answer = solver.get_spoken_answer(query)
print("Query:", query)
print("Answer:", answer)

# Expected Output:
# 86769 qa pairs imports from squad dataset
# 20357 qa pairs imports from freebaseQA dataset
# 2024-07-19 21:49:31.360 - OVOS - ovos_plugin_manager.language:create:233 - INFO - Loaded the Language Translation plugin ovos-translate-plugin-server
# 2024-07-19 21:49:31.360 - OVOS - ovos_plugin_manager.utils.config:get_plugin_config:40 - DEBUG - Loaded configuration: {'module': 'ovos-translate-plugin-server', 'lang': 'en-us'}
# 2024-07-19 21:49:32.759 - OVOS - __main__:load_corpus:61 - DEBUG - indexed 107126 documents
# Query: is there life on mars
# 2024-07-19 21:49:32.760 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 1 (score: 6.037893295288086): How is it postulated that Mars life might have evolved?
# 2024-07-19 21:49:32.760 - OVOS - __main__:retrieve_from_corpus:94 - DEBUG - closest question in corpus: How is it postulated that Mars life might have evolved?
# Answer: similar to Antarctic
```

In this example, BM25QACorpusSolver is used to load a large corpus of question-answer pairs from the SQuAD and
FreebaseQA datasets. The solver retrieves the best matching answer for the given query.

### Limitations of Retrieval Chatbots

Retrieval chatbots, while powerful, have certain limitations. These include:

1. **Dependence on Corpus Quality and Size**: The accuracy of a retrieval chatbot heavily relies on the quality and comprehensiveness of the underlying corpus. A limited or biased corpus can lead to inaccurate or irrelevant responses.
2. **Static Knowledge Base**: Unlike generative models, retrieval chatbots can't generate new information or answers. They can only retrieve and rephrase content from the pre-existing corpus.
3. **Contextual Understanding**: While advanced algorithms like BM25 can rank documents based on relevance, they may still struggle with understanding nuanced or complex queries, especially those requiring deep contextual understanding.
4. **Scalability**: As the size of the corpus increases, the computational resources required for indexing and retrieving relevant documents also increase, potentially impacting performance.
5. **Dynamic Updates**: Keeping the corpus updated with the latest information can be challenging, especially in fast-evolving domains.

Despite these limitations, retrieval chatbots are effective for domains where the corpus is well-defined and relatively static, such as FAQs, documentation, and knowledge bases.

### ReRanking

ReRanking is a technique used to refine a list of potential answers by evaluating their relevance to a given query. 
This process is crucial in scenarios where multiple options or responses need to be assessed to determine the most appropriate one.

In retrieval chatbots, ReRanking helps in selecting the best answer from a set of retrieved documents or options, enhancing the accuracy of the response provided to the user.

`MultipleChoiceSolver` are integrated into the OVOS Common Query framework, where they are used to select the most relevant answer from a set of multiple skill responses.

#### BM25MultipleChoiceSolver

BM25MultipleChoiceSolver is designed to select the best answer to a question from a list of options.

In the context of retrieval chatbots, BM25MultipleChoiceSolver is useful for scenarios where a user query results in a list of predefined answers or options. 
The solver ranks these options based on their relevance to the query and selects the most suitable one.


```python
from ovos_bm25_solver import BM25MultipleChoiceSolver

solver = BM25MultipleChoiceSolver()
a = solver.rerank("what is the speed of light", [
    "very fast", "10m/s", "the speed of light is C"
])
print(a)
# 2024-07-22 15:03:10.295 - OVOS - __main__:load_corpus:61 - DEBUG - indexed 3 documents
# 2024-07-22 15:03:10.297 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 1 (score: 0.7198746800422668): the speed of light is C
# 2024-07-22 15:03:10.297 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 2 (score: 0.0): 10m/s
# 2024-07-22 15:03:10.297 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 3 (score: 0.0): very fast
# [(0.7198747, 'the speed of light is C'), (0.0, '10m/s'), (0.0, 'very fast')]

# NOTE: select_answer is part of the MultipleChoiceSolver base class and uses rerank internally
a = solver.select_answer("what is the speed of light", [
    "very fast", "10m/s", "the speed of light is C"
])
print(a) # the speed of light is C
```

#### BM25EvidenceSolverPlugin

BM25EvidenceSolverPlugin is designed to extract the most relevant sentence from a text passage that answers a given question. This plugin uses the BM25 algorithm to evaluate and rank sentences based on their relevance to the query.

In text extraction and machine comprehension tasks, BM25EvidenceSolverPlugin enables the identification of specific sentences within a larger body of text that directly address a user's query. 

For example, in a scenario where a user queries about the number of rovers exploring Mars, BM25EvidenceSolverPlugin scans the provided text passage, ranks sentences based on their relevance, and extracts the most informative sentence.

```python
from ovos_bm25_solver import BM25EvidenceSolverPlugin

config = {
    "lang": "en-us",
    "min_conf": 0.4,
    "n_answer": 1
}
solver = BM25EvidenceSolverPlugin(config)

text = """Mars is the fourth planet from the Sun. It is a dusty, cold, desert world with a very thin atmosphere. 
Mars is also a dynamic planet with seasons, polar ice caps, canyons, extinct volcanoes, and evidence that it was even more active in the past.
Mars is one of the most explored bodies in our solar system, and it's the only planet where we've sent rovers to roam the alien landscape. 
NASA currently has two rovers (Curiosity and Perseverance), one lander (InSight), and one helicopter (Ingenuity) exploring the surface of Mars.
"""
query = "how many rovers are currently exploring Mars"
answer = solver.get_best_passage(evidence=text, question=query)
print("Query:", query)
print("Answer:", answer)
# 2024-07-22 15:05:14.209 - OVOS - __main__:load_corpus:61 - DEBUG - indexed 5 documents
# 2024-07-22 15:05:14.209 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 1 (score: 1.39238703250885): NASA currently has two rovers (Curiosity and Perseverance), one lander (InSight), and one helicopter (Ingenuity) exploring the surface of Mars.
# 2024-07-22 15:05:14.210 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 2 (score: 0.38667747378349304): Mars is one of the most explored bodies in our solar system, and it's the only planet where we've sent rovers to roam the alien landscape.
# 2024-07-22 15:05:14.210 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 3 (score: 0.15732118487358093): Mars is the fourth planet from the Sun.
# 2024-07-22 15:05:14.210 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 4 (score: 0.10177625715732574): Mars is also a dynamic planet with seasons, polar ice caps, canyons, extinct volcanoes, and evidence that it was even more active in the past.
# 2024-07-22 15:05:14.210 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 5 (score: 0.0): It is a dusty, cold, desert world with a very thin atmosphere.
# Query: how many rovers are currently exploring Mars
# Answer: NASA currently has two rovers (Curiosity and Perseverance), one lander (InSight), and one helicopter (Ingenuity) exploring the surface of Mars.

```

In this example, `BM25EvidenceSolverPlugin` effectively identifies and retrieves the most relevant sentence from the provided text that answers the query about the number of rovers exploring Mars. 
This capability is essential for applications requiring information extraction from extensive textual content, such as automated research assistants or content summarizers.

## Embeddings Store

A fake embeddings store is provided using only text search

> NOTE: this does not scale to large datasets

```python
from ovos_bm25_solver.embed import JsonEmbeddingsDB, BM25TextEmbeddingsStore
db = JsonEmbeddingsDB("bm25_index")
# Initialize the BM25 text embeddings store
index = BM25TextEmbeddingsStore(db=db)

# Add documents to the database
text = "hello world"
text2 = "goodbye cruel world"
index.add_document(text)
index.add_document(text2)

# Querying with fuzzy match
results = db.query("the world", top_k=2)
print("Fuzzy Match Results:", results)

# Querying with BM25
results = index.query("the world", top_k=2)
print("BM25 Query Results:", results)

# Comparing strings using fuzzy match
distance = index.distance(text, text2)
print("Distance between strings:", distance)
```

## Integrating with Persona Framework

To use the `SquadQASolver` and `FreebaseQASolver` in the persona framework, you can define a persona configuration file and specify the solvers to be used.

Here's an example of how to define a persona that uses the `SquadQASolver` and `FreebaseQASolver`:

1. Create a persona configuration file, e.g., `qa_persona.json`:

```json
{
  "name": "QAPersona",
  "solvers": [
    "ovos-solver-squadqa-plugin",
    "ovos-solver-freebaseqa-plugin",
    "ovos-solver-failure-plugin"
  ]
}
```

2. Run [ovos-persona-server](https://github.com/OpenVoiceOS/ovos-persona-server) with the defined persona:

```bash
$ ovos-persona-server --persona qa_persona.json
```

In this example, the persona named "QAPersona" will first use the `SquadQASolver` to answer questions. If it cannot find an answer, it will fall back to the `FreebaseQASolver`. Finally, it will use the `ovos-solver-failure-plugin` to ensure it always responds with something, even if the previous solvers fail.


Check setup.py for reference in how to package your own corpus backed solvers

```python
PLUGIN_ENTRY_POINTS = [
    'ovos-solver-bm25-squad-plugin=ovos_bm25_solver:SquadQASolver',
    'ovos-solver-bm25-freebase-plugin=ovos_bm25_solver:FreebaseQASolver'
]
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/TigreGotico/ovos-solver-BM25-plugin",
    "name": "ovos-solver-bm25-plugin",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "OVOS openvoiceos plugin utterance fallback query",
    "author": "jarbasai",
    "author_email": "jarbasai@mailfence.com",
    "download_url": "https://files.pythonhosted.org/packages/56/34/c8115622b654950a3e4cfe41aa900e2cc3522e549b0fc16108e14240118e/ovos-solver-bm25-plugin-0.0.0.tar.gz",
    "platform": null,
    "description": "# BM25CorpusSolver Plugin\n\nBM25CorpusSolver is an OVOS (OpenVoiceOS) plugin designed to retrieve answers from a corpus of documents using the [BM25](https://en.wikipedia.org/wiki/Okapi_BM25)\nalgorithm. This solver is ideal for question-answering systems that require efficient and accurate retrieval of\ninformation from a predefined set of documents.\n\n- [Features](#features)\n- [Retrieval Chatbots](#retrieval-chatbots)\n   - [Example Solvers](#example-solvers)\n     - [SquadQASolver](#squadqasolver)\n     - [FreebaseQASolver](#freebaseqasolver)\n  - [Implementing a Retrieval Chatbot](#implementing-a-retrieval-chatbot)\n     - [BM25CorpusSolver](#bm25corpussolver)\n     - [BM25QACorpusSolver](#bm25qacorpussolver)\n  - [Limitations of Retrieval Chatbots](#limitations-of-retrieval-chatbots)\n- [ReRanking](#reranking)\n   - [BM25MultipleChoiceSolver](#bm25multiplechoicesolver)\n   - [BM25EvidenceSolverPlugin](#bm25evidencesolverplugin)\n- [Embeddings Store](#embeddings-store)\n- [Integrating with Persona Framework](#integrating-with-persona-framework)\n\n\n## Features\n\n- **BM25 Algorithm**: Utilizes the BM25 ranking function for information retrieval, providing relevance-based document scoring.\n- **Configurable**: Allows customization of language, minimum confidence score, and the number of answers to retrieve.\n- **Logging**: Integrates with OVOS logging system for debugging and monitoring.\n- **BM25QACorpusSolver**: Extends `BM25CorpusSolver` to handle question-answer pairs, optimizing the retrieval process for QA datasets.\n- **BM25MultipleChoiceSolver**: Reranks multiple-choice options based on relevance to the query.\n- **BM25EvidenceSolverPlugin**: Extracts the best sentence from a text that answers a question using the BM25 algorithm.\n\n## Retrieval Chatbots\n\nRetrieval chatbots use BM25CorpusSolver to provide answers to user queries by searching through a preloaded corpus of documents or QA pairs. \n\nThese chatbots excel in environments where the information is structured and the queries are straightforward.\n\n### Example solvers\n\n#### SquadQASolver\n\nThe SquadQASolver is a subclass of BM25QACorpusSolver that automatically loads and indexes the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) upon initialization.\n\nThis solver is suitable for usage with the ovos-persona framework.\n\n```python\nfrom ovos_bm25_solver import SquadQASolver\n\ns = SquadQASolver()\nquery = \"is there life on mars\"\nprint(\"Query:\", query)\nprint(\"Answer:\", s.spoken_answer(query))\n# 2024-07-19 22:31:12.625 - OVOS - __main__:load_corpus:60 - DEBUG - indexed 86769 documents\n# 2024-07-19 22:31:12.625 - OVOS - __main__:load_squad_corpus:119 - INFO - Loaded and indexed 86769 question-answer pairs from SQuAD dataset\n# Query: is there life on mars\n# 2024-07-19 22:31:12.628 - OVOS - __main__:retrieve_from_corpus:69 - DEBUG - Rank 1 (score: 6.334013938903809): How is it postulated that Mars life might have evolved?\n# 2024-07-19 22:31:12.628 - OVOS - __main__:retrieve_from_corpus:93 - DEBUG - closest question in corpus: How is it postulated that Mars life might have evolved?\n# Answer: similar to Antarctic\n```\n\n#### FreebaseQASolver\n\nThe FreebaseQASolver is a subclass of BM25QACorpusSolver that automatically loads and indexes the [FreebaseQA dataset](https://github.com/kelvin-jiang/FreebaseQA) upon initialization.\n\nThis solver is suitable for usage with the ovos-persona framework.\n\n```python\nfrom ovos_bm25_solver import FreebaseQASolver\n\ns = FreebaseQASolver()\nquery = \"What is the capital of France\"\nprint(\"Query:\", query)\nprint(\"Answer:\", s.spoken_answer(query))\n# 2024-07-19 22:31:09.468 - OVOS - __main__:load_corpus:60 - DEBUG - indexed 20357 documents\n# Query: What is the capital of France\n# 2024-07-19 22:31:09.468 - OVOS - __main__:retrieve_from_corpus:69 - DEBUG - Rank 1 (score: 5.996074199676514): what is the capital of france\n# 2024-07-19 22:31:09.469 - OVOS - __main__:retrieve_from_corpus:93 - DEBUG - closest question in corpus: what is the capital of france\n# Answer: paris\n```\n\n### Implementing a Retrieval Chatbot\n\nTo use the BM25CorpusSolver, you need to create an instance of the solver, load your corpus, and then query it.\n\n#### BM25CorpusSolver\n\nThis class is meant to be used to create your own solvers with a dedicated corpus.\n\n```python\nfrom ovos_bm25_solver import BM25CorpusSolver\n\nconfig = {\n    \"lang\": \"en-us\",\n    \"min_conf\": 0.4,\n    \"n_answer\": 2\n}\nsolver = BM25CorpusSolver(config)\n\ncorpus = [\n    \"a cat is a feline and likes to purr\",\n    \"a dog is the human's best friend and loves to play\",\n    \"a bird is a beautiful animal that can fly\",\n    \"a fish is a creature that lives in water and swims\",\n]\nsolver.load_corpus(corpus)\n\nquery = \"does the fish purr like a cat?\"\nanswer = solver.get_spoken_answer(query)\nprint(answer)\n\n# Expected Output:\n# 2024-07-19 20:03:29.979 - OVOS - ovos_plugin_manager.utils.config:get_plugin_config:40 - DEBUG - Loaded configuration: {'module': 'ovos-translate-plugin-server', 'lang': 'en-us'}\n# 2024-07-19 20:03:30.024 - OVOS - __main__:load_corpus:28 - DEBUG - indexed 4 documents\n# 2024-07-19 20:03:30.025 - OVOS - __main__:retrieve_from_corpus:37 - DEBUG - Rank 1 (score: 1.0584375858306885): a cat is a feline and likes to purr\n# 2024-07-19 20:03:30.025 - OVOS - __main__:retrieve_from_corpus:37 - DEBUG - Rank 2 (score: 0.481589138507843): a fish is a creature that lives in water and swims\n# a cat is a feline and likes to purr. a fish is a creature that lives in water and swims\n```\n\n#### BM25QACorpusSolver\n\nThis class is meant to be used to create your own solvers with a dedicated corpus\n\nBM25QACorpusSolver is an extension of BM25CorpusSolver, designed to work with question-answer pairs. It is particularly\nuseful when working with datasets like SQuAD, FreebaseQA, or similar QA datasets.\n\n```python\nimport requests\nfrom ovos_bm25_solver import BM25QACorpusSolver\n\n# Load SQuAD dataset\ncorpus = {}\ndata = requests.get(\"https://github.com/chrischute/squad/raw/master/data/train-v2.0.json\").json()\nfor s in data[\"data\"]:\n    for p in s[\"paragraphs\"]:\n        for qa in p[\"qas\"]:\n            if \"question\" in qa and qa[\"answers\"]:\n                corpus[qa[\"question\"]] = qa[\"answers\"][0][\"text\"]\n\n# Load FreebaseQA dataset\ndata = requests.get(\"https://github.com/kelvin-jiang/FreebaseQA/raw/master/FreebaseQA-train.json\").json()\nfor qa in data[\"Questions\"]:\n    q = qa[\"ProcessedQuestion\"]\n    a = qa[\"Parses\"][0][\"Answers\"][0][\"AnswersName\"][0]\n    corpus[q] = a\n\n# Initialize BM25QACorpusSolver with config\nconfig = {\n    \"lang\": \"en-us\",\n    \"min_conf\": 0.4,\n    \"n_answer\": 1\n}\nsolver = BM25QACorpusSolver(config)\nsolver.load_corpus(corpus)\n\nquery = \"is there life on mars?\"\nanswer = solver.get_spoken_answer(query)\nprint(\"Query:\", query)\nprint(\"Answer:\", answer)\n\n# Expected Output:\n# 86769 qa pairs imports from squad dataset\n# 20357 qa pairs imports from freebaseQA dataset\n# 2024-07-19 21:49:31.360 - OVOS - ovos_plugin_manager.language:create:233 - INFO - Loaded the Language Translation plugin ovos-translate-plugin-server\n# 2024-07-19 21:49:31.360 - OVOS - ovos_plugin_manager.utils.config:get_plugin_config:40 - DEBUG - Loaded configuration: {'module': 'ovos-translate-plugin-server', 'lang': 'en-us'}\n# 2024-07-19 21:49:32.759 - OVOS - __main__:load_corpus:61 - DEBUG - indexed 107126 documents\n# Query: is there life on mars\n# 2024-07-19 21:49:32.760 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 1 (score: 6.037893295288086): How is it postulated that Mars life might have evolved?\n# 2024-07-19 21:49:32.760 - OVOS - __main__:retrieve_from_corpus:94 - DEBUG - closest question in corpus: How is it postulated that Mars life might have evolved?\n# Answer: similar to Antarctic\n```\n\nIn this example, BM25QACorpusSolver is used to load a large corpus of question-answer pairs from the SQuAD and\nFreebaseQA datasets. The solver retrieves the best matching answer for the given query.\n\n### Limitations of Retrieval Chatbots\n\nRetrieval chatbots, while powerful, have certain limitations. These include:\n\n1. **Dependence on Corpus Quality and Size**: The accuracy of a retrieval chatbot heavily relies on the quality and comprehensiveness of the underlying corpus. A limited or biased corpus can lead to inaccurate or irrelevant responses.\n2. **Static Knowledge Base**: Unlike generative models, retrieval chatbots can't generate new information or answers. They can only retrieve and rephrase content from the pre-existing corpus.\n3. **Contextual Understanding**: While advanced algorithms like BM25 can rank documents based on relevance, they may still struggle with understanding nuanced or complex queries, especially those requiring deep contextual understanding.\n4. **Scalability**: As the size of the corpus increases, the computational resources required for indexing and retrieving relevant documents also increase, potentially impacting performance.\n5. **Dynamic Updates**: Keeping the corpus updated with the latest information can be challenging, especially in fast-evolving domains.\n\nDespite these limitations, retrieval chatbots are effective for domains where the corpus is well-defined and relatively static, such as FAQs, documentation, and knowledge bases.\n\n### ReRanking\n\nReRanking is a technique used to refine a list of potential answers by evaluating their relevance to a given query. \nThis process is crucial in scenarios where multiple options or responses need to be assessed to determine the most appropriate one.\n\nIn retrieval chatbots, ReRanking helps in selecting the best answer from a set of retrieved documents or options, enhancing the accuracy of the response provided to the user.\n\n`MultipleChoiceSolver` are integrated into the OVOS Common Query framework, where they are used to select the most relevant answer from a set of multiple skill responses.\n\n#### BM25MultipleChoiceSolver\n\nBM25MultipleChoiceSolver is designed to select the best answer to a question from a list of options.\n\nIn the context of retrieval chatbots, BM25MultipleChoiceSolver is useful for scenarios where a user query results in a list of predefined answers or options. \nThe solver ranks these options based on their relevance to the query and selects the most suitable one.\n\n\n```python\nfrom ovos_bm25_solver import BM25MultipleChoiceSolver\n\nsolver = BM25MultipleChoiceSolver()\na = solver.rerank(\"what is the speed of light\", [\n    \"very fast\", \"10m/s\", \"the speed of light is C\"\n])\nprint(a)\n# 2024-07-22 15:03:10.295 - OVOS - __main__:load_corpus:61 - DEBUG - indexed 3 documents\n# 2024-07-22 15:03:10.297 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 1 (score: 0.7198746800422668): the speed of light is C\n# 2024-07-22 15:03:10.297 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 2 (score: 0.0): 10m/s\n# 2024-07-22 15:03:10.297 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 3 (score: 0.0): very fast\n# [(0.7198747, 'the speed of light is C'), (0.0, '10m/s'), (0.0, 'very fast')]\n\n# NOTE: select_answer is part of the MultipleChoiceSolver base class and uses rerank internally\na = solver.select_answer(\"what is the speed of light\", [\n    \"very fast\", \"10m/s\", \"the speed of light is C\"\n])\nprint(a) # the speed of light is C\n```\n\n#### BM25EvidenceSolverPlugin\n\nBM25EvidenceSolverPlugin is designed to extract the most relevant sentence from a text passage that answers a given question. This plugin uses the BM25 algorithm to evaluate and rank sentences based on their relevance to the query.\n\nIn text extraction and machine comprehension tasks, BM25EvidenceSolverPlugin enables the identification of specific sentences within a larger body of text that directly address a user's query. \n\nFor example, in a scenario where a user queries about the number of rovers exploring Mars, BM25EvidenceSolverPlugin scans the provided text passage, ranks sentences based on their relevance, and extracts the most informative sentence.\n\n```python\nfrom ovos_bm25_solver import BM25EvidenceSolverPlugin\n\nconfig = {\n    \"lang\": \"en-us\",\n    \"min_conf\": 0.4,\n    \"n_answer\": 1\n}\nsolver = BM25EvidenceSolverPlugin(config)\n\ntext = \"\"\"Mars is the fourth planet from the Sun. It is a dusty, cold, desert world with a very thin atmosphere. \nMars is also a dynamic planet with seasons, polar ice caps, canyons, extinct volcanoes, and evidence that it was even more active in the past.\nMars is one of the most explored bodies in our solar system, and it's the only planet where we've sent rovers to roam the alien landscape. \nNASA currently has two rovers (Curiosity and Perseverance), one lander (InSight), and one helicopter (Ingenuity) exploring the surface of Mars.\n\"\"\"\nquery = \"how many rovers are currently exploring Mars\"\nanswer = solver.get_best_passage(evidence=text, question=query)\nprint(\"Query:\", query)\nprint(\"Answer:\", answer)\n# 2024-07-22 15:05:14.209 - OVOS - __main__:load_corpus:61 - DEBUG - indexed 5 documents\n# 2024-07-22 15:05:14.209 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 1 (score: 1.39238703250885): NASA currently has two rovers (Curiosity and Perseverance), one lander (InSight), and one helicopter (Ingenuity) exploring the surface of Mars.\n# 2024-07-22 15:05:14.210 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 2 (score: 0.38667747378349304): Mars is one of the most explored bodies in our solar system, and it's the only planet where we've sent rovers to roam the alien landscape.\n# 2024-07-22 15:05:14.210 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 3 (score: 0.15732118487358093): Mars is the fourth planet from the Sun.\n# 2024-07-22 15:05:14.210 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 4 (score: 0.10177625715732574): Mars is also a dynamic planet with seasons, polar ice caps, canyons, extinct volcanoes, and evidence that it was even more active in the past.\n# 2024-07-22 15:05:14.210 - OVOS - __main__:retrieve_from_corpus:70 - DEBUG - Rank 5 (score: 0.0): It is a dusty, cold, desert world with a very thin atmosphere.\n# Query: how many rovers are currently exploring Mars\n# Answer: NASA currently has two rovers (Curiosity and Perseverance), one lander (InSight), and one helicopter (Ingenuity) exploring the surface of Mars.\n\n```\n\nIn this example, `BM25EvidenceSolverPlugin` effectively identifies and retrieves the most relevant sentence from the provided text that answers the query about the number of rovers exploring Mars. \nThis capability is essential for applications requiring information extraction from extensive textual content, such as automated research assistants or content summarizers.\n\n## Embeddings Store\n\nA fake embeddings store is provided using only text search\n\n> NOTE: this does not scale to large datasets\n\n```python\nfrom ovos_bm25_solver.embed import JsonEmbeddingsDB, BM25TextEmbeddingsStore\ndb = JsonEmbeddingsDB(\"bm25_index\")\n# Initialize the BM25 text embeddings store\nindex = BM25TextEmbeddingsStore(db=db)\n\n# Add documents to the database\ntext = \"hello world\"\ntext2 = \"goodbye cruel world\"\nindex.add_document(text)\nindex.add_document(text2)\n\n# Querying with fuzzy match\nresults = db.query(\"the world\", top_k=2)\nprint(\"Fuzzy Match Results:\", results)\n\n# Querying with BM25\nresults = index.query(\"the world\", top_k=2)\nprint(\"BM25 Query Results:\", results)\n\n# Comparing strings using fuzzy match\ndistance = index.distance(text, text2)\nprint(\"Distance between strings:\", distance)\n```\n\n## Integrating with Persona Framework\n\nTo use the `SquadQASolver` and `FreebaseQASolver` in the persona framework, you can define a persona configuration file and specify the solvers to be used.\n\nHere's an example of how to define a persona that uses the `SquadQASolver` and `FreebaseQASolver`:\n\n1. Create a persona configuration file, e.g., `qa_persona.json`:\n\n```json\n{\n  \"name\": \"QAPersona\",\n  \"solvers\": [\n    \"ovos-solver-squadqa-plugin\",\n    \"ovos-solver-freebaseqa-plugin\",\n    \"ovos-solver-failure-plugin\"\n  ]\n}\n```\n\n2. Run [ovos-persona-server](https://github.com/OpenVoiceOS/ovos-persona-server) with the defined persona:\n\n```bash\n$ ovos-persona-server --persona qa_persona.json\n```\n\nIn this example, the persona named \"QAPersona\" will first use the `SquadQASolver` to answer questions. If it cannot find an answer, it will fall back to the `FreebaseQASolver`. Finally, it will use the `ovos-solver-failure-plugin` to ensure it always responds with something, even if the previous solvers fail.\n\n\nCheck setup.py for reference in how to package your own corpus backed solvers\n\n```python\nPLUGIN_ENTRY_POINTS = [\n    'ovos-solver-bm25-squad-plugin=ovos_bm25_solver:SquadQASolver',\n    'ovos-solver-bm25-freebase-plugin=ovos_bm25_solver:FreebaseQASolver'\n]\n```\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A question solver plugin for OVOS",
    "version": "0.0.0",
    "project_urls": {
        "Homepage": "https://github.com/TigreGotico/ovos-solver-BM25-plugin"
    },
    "split_keywords": [
        "ovos",
        "openvoiceos",
        "plugin",
        "utterance",
        "fallback",
        "query"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7c563124d4e1fc74ed494c614fab683c101bcf54fec9d149577c3569d8be297f",
                "md5": "b121c0f126339301deec000ec8be8144",
                "sha256": "d51d89727802966c9e3d22a5df8249b469899d5e2095bb579d7cbd3f0921f35a"
            },
            "downloads": -1,
            "filename": "ovos_solver_bm25_plugin-0.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b121c0f126339301deec000ec8be8144",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 15518,
            "upload_time": "2024-10-25T22:25:26",
            "upload_time_iso_8601": "2024-10-25T22:25:26.442987Z",
            "url": "https://files.pythonhosted.org/packages/7c/56/3124d4e1fc74ed494c614fab683c101bcf54fec9d149577c3569d8be297f/ovos_solver_bm25_plugin-0.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5634c8115622b654950a3e4cfe41aa900e2cc3522e549b0fc16108e14240118e",
                "md5": "f7298e130f9e81911d00031d7edb81a4",
                "sha256": "4b806e4595fa38ab1c7630eb37d4f8342bf05ff33aee249ac7cdd1e0157fa9f0"
            },
            "downloads": -1,
            "filename": "ovos-solver-bm25-plugin-0.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f7298e130f9e81911d00031d7edb81a4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 17822,
            "upload_time": "2024-10-25T22:25:27",
            "upload_time_iso_8601": "2024-10-25T22:25:27.874963Z",
            "url": "https://files.pythonhosted.org/packages/56/34/c8115622b654950a3e4cfe41aa900e2cc3522e549b0fc16108e14240118e/ovos-solver-bm25-plugin-0.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-25 22:25:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TigreGotico",
    "github_project": "ovos-solver-BM25-plugin",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "ovos-plugin-manager",
            "specs": []
        },
        {
            "name": "bm25s",
            "specs": []
        },
        {
            "name": "quebra-frases",
            "specs": []
        },
        {
            "name": "rapidfuzz",
            "specs": []
        },
        {
            "name": "json_database",
            "specs": []
        }
    ],
    "lcname": "ovos-solver-bm25-plugin"
}

jarbasai