cherche


Namecherche JSON
Version 2.2.1 PyPI version JSON
download
home_pagehttps://github.com/raphaelsty/cherche
SummaryNeural Search
upload_time2024-06-01 17:04:25
maintainerNone
docs_urlNone
authorRaphael Sourty
requires_python>=3.6
licenseMIT
keywords neural search information retrieval question answering semantic search
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
  <h1>Cherche</h1>
  <p>Neural search</p>
</div>

<p align="center"><img width=300 src="docs/img/logo.png"/></p>

<div align="center">
  <!-- Documentation -->
  <a href="https://raphaelsty.github.io/cherche/"><img src="https://img.shields.io/website?label=docs&style=flat-square&url=https%3A%2F%2Fraphaelsty.github.io/cherche/%2F" alt="documentation"></a>
  <!-- Demo -->
  <a href="https://raphaelsty.github.io/knowledge/?query=cherche%20neural%20search"><img src="https://img.shields.io/badge/demo-running-blueviolet?style=flat-square" alt="Demo"></a>
  <!-- License -->
  <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-blue.svg?style=flat-square" alt="license"></a>
</div>


Cherche enables the development of a neural search pipeline that employs retrievers and pre-trained language models both as retrievers and rankers. The primary advantage of Cherche lies in its capacity to construct end-to-end pipelines. Additionally, Cherche is well-suited for offline semantic search due to its compatibility with batch computation.

Here are some of the features Cherche offers:

[Live demo of a NLP search engine powered by Cherche](https://raphaelsty.github.io/knowledge/?query=cherche%20neural%20search)

![Alt text](docs/img/explain.png)

## Installation 🤖

To install Cherche for use with a simple retriever on CPU, such as TfIdf, Flash, Lunr, Fuzz, use the following command:

```sh
pip install cherche
```

To install Cherche for use with any semantic retriever or ranker on CPU, use the following command:

```sh
pip install "cherche[cpu]"
```

Finally, if you plan to use any semantic retriever or ranker on GPU, use the following command:

```sh
pip install "cherche[gpu]"
```

By following these installation instructions, you will be able to use Cherche with the appropriate requirements for your needs.

### Documentation

Documentation is available [here](https://raphaelsty.github.io/cherche/). It provides details
about retrievers, rankers, pipelines and examples.

## QuickStart 📑

### Documents

Cherche allows findings the right document within a list of objects. Here is an example of a corpus.

```python
from cherche import data

documents = data.load_towns()

documents[:3]
[{'id': 0,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris is the capital and most populous city of France.'},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of science, and arts."},
 {'id': 2,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France.'
  }]
```

### Retriever ranker

Here is an example of a neural search pipeline composed of a TF-IDF that quickly retrieves documents, followed by a ranking model. The ranking model sorts the documents produced by the retriever based on the semantic similarity between the query and the documents. We can call the pipeline using a list of queries and get relevant documents for each query.

```python
from cherche import data, retrieve, rank
from sentence_transformers import SentenceTransformer
from lenlp import sparse

# List of dicts
documents = data.load_towns()

# Retrieve on fields title and article
retriever = retrieve.BM25(
  key="id", 
  on=["title", "article"], 
  documents=documents, 
  k=30
)

# Rank on fields title and article
ranker = rank.Encoder(
    key = "id",
    on = ["title", "article"],
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k = 3,
)

# Pipeline creation
search = retriever + ranker

search.add(documents=documents)

# Search documents for 3 queries.
search(["Bordeaux", "Paris", "Toulouse"])
[[{'id': 57, 'similarity': 0.69513524},
  {'id': 63, 'similarity': 0.6214994},
  {'id': 65, 'similarity': 0.61809087}],
 [{'id': 16, 'similarity': 0.59158516},
  {'id': 0, 'similarity': 0.58217555},
  {'id': 1, 'similarity': 0.57944715}],
 [{'id': 26, 'similarity': 0.6925601},
  {'id': 37, 'similarity': 0.63977146},
  {'id': 28, 'similarity': 0.62772334}]]
```

We can map the index to the documents to access their contents using pipelines:

```python
search += documents
search(["Bordeaux", "Paris", "Toulouse"])
[[{'id': 57,
   'title': 'Bordeaux',
   'url': 'https://en.wikipedia.org/wiki/Bordeaux',
   'similarity': 0.69513524},
  {'id': 63,
   'title': 'Bordeaux',
   'similarity': 0.6214994},
  {'id': 65,
   'title': 'Bordeaux',
   'url': 'https://en.wikipedia.org/wiki/Bordeaux',
   'similarity': 0.61809087}],
 [{'id': 16,
   'title': 'Paris',
   'url': 'https://en.wikipedia.org/wiki/Paris',
   'article': 'Paris received 12.',
   'similarity': 0.59158516},
  {'id': 0,
   'title': 'Paris',
   'url': 'https://en.wikipedia.org/wiki/Paris',
   'similarity': 0.58217555},
  {'id': 1,
   'title': 'Paris',
   'url': 'https://en.wikipedia.org/wiki/Paris',
   'similarity': 0.57944715}],
 [{'id': 26,
   'title': 'Toulouse',
   'url': 'https://en.wikipedia.org/wiki/Toulouse',
   'similarity': 0.6925601},
  {'id': 37,
   'title': 'Toulouse',
   'url': 'https://en.wikipedia.org/wiki/Toulouse',
   'similarity': 0.63977146},
  {'id': 28,
   'title': 'Toulouse',
   'url': 'https://en.wikipedia.org/wiki/Toulouse',
   'similarity': 0.62772334}]]
```

## Retrieve

Cherche provides [retrievers](https://raphaelsty.github.io/cherche/retrieve/retrieve/) that filter input documents based on a query.

- retrieve.TfIdf
- retrieve.BM25
- retrieve.Lunr
- retrieve.Flash
- retrieve.Encoder
- retrieve.DPR
- retrieve.Fuzz
- retrieve.Embedding

## Rank

Cherche provides [rankers](https://raphaelsty.github.io/cherche/rank/rank/) that filter documents in output of retrievers.

Cherche rankers are compatible with [SentenceTransformers](https://www.sbert.net/docs/pretrained_models.html) models which are available on [Hugging Face hub](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads).

- rank.Encoder
- rank.DPR
- rank.CrossEncoder
- rank.Embedding

## Question answering

Cherche provides modules dedicated to question answering. These modules are compatible with Hugging Face's pre-trained models and fully integrated into neural search pipelines.

## Contributors 🤝
Cherche was created for/by Renault and is now available to all.
We welcome all contributions.

<p align="center"><img src="docs/img/renault.jpg"/></p>

## Acknowledgements 👏

Lunr retriever is a wrapper around [Lunr.py](https://github.com/yeraydiazdiaz/lunr.py). Flash retriever is a wrapper around [FlashText](https://github.com/vi3k6i5/flashtext). DPR, Encode and CrossEncoder rankers are wrappers dedicated to the use of the pre-trained models of [SentenceTransformers](https://www.sbert.net/docs/pretrained_models.html) in a neural search pipeline.

## Citations

If you use cherche to produce results for your scientific publication, please refer to our SIGIR paper:

```bibtex
@inproceedings{Sourty2022sigir,
    author = {Raphael Sourty and Jose G. Moreno and Lynda Tamine and Francois-Paul Servant},
    title = {CHERCHE: A new tool to rapidly implement pipelines in information retrieval},
    booktitle = {Proceedings of SIGIR 2022},
    year = {2022}
}
```

## Dev Team 💾

The Cherche dev team is made up of [Raphaël Sourty](https://github.com/raphaelsty), [François-Paul Servant](https://github.com/fpservant), [Nicolas Bizzozzero](https://github.com/NicolasBizzozzero), [Jose G Moreno](https://scholar.google.com/citations?user=4BZFUw8AAAAJ&hl=fr). 🥳

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/raphaelsty/cherche",
    "name": "cherche",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "neural search, information retrieval, question answering, semantic search",
    "author": "Raphael Sourty",
    "author_email": "raphael.sourty@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e6/49/f33984b81c3acfbeff2278f26fced217ace2b8410ba22f70b3b81d773913/cherche-2.2.1.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  <h1>Cherche</h1>\n  <p>Neural search</p>\n</div>\n\n<p align=\"center\"><img width=300 src=\"docs/img/logo.png\"/></p>\n\n<div align=\"center\">\n  <!-- Documentation -->\n  <a href=\"https://raphaelsty.github.io/cherche/\"><img src=\"https://img.shields.io/website?label=docs&style=flat-square&url=https%3A%2F%2Fraphaelsty.github.io/cherche/%2F\" alt=\"documentation\"></a>\n  <!-- Demo -->\n  <a href=\"https://raphaelsty.github.io/knowledge/?query=cherche%20neural%20search\"><img src=\"https://img.shields.io/badge/demo-running-blueviolet?style=flat-square\" alt=\"Demo\"></a>\n  <!-- License -->\n  <a href=\"https://opensource.org/licenses/MIT\"><img src=\"https://img.shields.io/badge/License-MIT-blue.svg?style=flat-square\" alt=\"license\"></a>\n</div>\n\n\nCherche enables the development of a neural search pipeline that employs retrievers and pre-trained language models both as retrievers and rankers. The primary advantage of Cherche lies in its capacity to construct end-to-end pipelines. Additionally, Cherche is well-suited for offline semantic search due to its compatibility with batch computation.\n\nHere are some of the features Cherche offers:\n\n[Live demo of a NLP search engine powered by Cherche](https://raphaelsty.github.io/knowledge/?query=cherche%20neural%20search)\n\n![Alt text](docs/img/explain.png)\n\n## Installation \ud83e\udd16\n\nTo install Cherche for use with a simple retriever on CPU, such as TfIdf, Flash, Lunr, Fuzz, use the following command:\n\n```sh\npip install cherche\n```\n\nTo install Cherche for use with any semantic retriever or ranker on CPU, use the following command:\n\n```sh\npip install \"cherche[cpu]\"\n```\n\nFinally, if you plan to use any semantic retriever or ranker on GPU, use the following command:\n\n```sh\npip install \"cherche[gpu]\"\n```\n\nBy following these installation instructions, you will be able to use Cherche with the appropriate requirements for your needs.\n\n### Documentation\n\nDocumentation is available [here](https://raphaelsty.github.io/cherche/). It provides details\nabout retrievers, rankers, pipelines and examples.\n\n## QuickStart \ud83d\udcd1\n\n### Documents\n\nCherche allows findings the right document within a list of objects. Here is an example of a corpus.\n\n```python\nfrom cherche import data\n\ndocuments = data.load_towns()\n\ndocuments[:3]\n[{'id': 0,\n  'title': 'Paris',\n  'url': 'https://en.wikipedia.org/wiki/Paris',\n  'article': 'Paris is the capital and most populous city of France.'},\n {'id': 1,\n  'title': 'Paris',\n  'url': 'https://en.wikipedia.org/wiki/Paris',\n  'article': \"Since the 17th century, Paris has been one of Europe's major centres of science, and arts.\"},\n {'id': 2,\n  'title': 'Paris',\n  'url': 'https://en.wikipedia.org/wiki/Paris',\n  'article': 'The City of Paris is the centre and seat of government of the region and province of \u00cele-de-France.'\n  }]\n```\n\n### Retriever ranker\n\nHere is an example of a neural search pipeline composed of a TF-IDF that quickly retrieves documents, followed by a ranking model. The ranking model sorts the documents produced by the retriever based on the semantic similarity between the query and the documents. We can call the pipeline using a list of queries and get relevant documents for each query.\n\n```python\nfrom cherche import data, retrieve, rank\nfrom sentence_transformers import SentenceTransformer\nfrom lenlp import sparse\n\n# List of dicts\ndocuments = data.load_towns()\n\n# Retrieve on fields title and article\nretriever = retrieve.BM25(\n  key=\"id\", \n  on=[\"title\", \"article\"], \n  documents=documents, \n  k=30\n)\n\n# Rank on fields title and article\nranker = rank.Encoder(\n    key = \"id\",\n    on = [\"title\", \"article\"],\n    encoder = SentenceTransformer(\"sentence-transformers/all-mpnet-base-v2\").encode,\n    k = 3,\n)\n\n# Pipeline creation\nsearch = retriever + ranker\n\nsearch.add(documents=documents)\n\n# Search documents for 3 queries.\nsearch([\"Bordeaux\", \"Paris\", \"Toulouse\"])\n[[{'id': 57, 'similarity': 0.69513524},\n  {'id': 63, 'similarity': 0.6214994},\n  {'id': 65, 'similarity': 0.61809087}],\n [{'id': 16, 'similarity': 0.59158516},\n  {'id': 0, 'similarity': 0.58217555},\n  {'id': 1, 'similarity': 0.57944715}],\n [{'id': 26, 'similarity': 0.6925601},\n  {'id': 37, 'similarity': 0.63977146},\n  {'id': 28, 'similarity': 0.62772334}]]\n```\n\nWe can map the index to the documents to access their contents using pipelines:\n\n```python\nsearch += documents\nsearch([\"Bordeaux\", \"Paris\", \"Toulouse\"])\n[[{'id': 57,\n   'title': 'Bordeaux',\n   'url': 'https://en.wikipedia.org/wiki/Bordeaux',\n   'similarity': 0.69513524},\n  {'id': 63,\n   'title': 'Bordeaux',\n   'similarity': 0.6214994},\n  {'id': 65,\n   'title': 'Bordeaux',\n   'url': 'https://en.wikipedia.org/wiki/Bordeaux',\n   'similarity': 0.61809087}],\n [{'id': 16,\n   'title': 'Paris',\n   'url': 'https://en.wikipedia.org/wiki/Paris',\n   'article': 'Paris received 12.',\n   'similarity': 0.59158516},\n  {'id': 0,\n   'title': 'Paris',\n   'url': 'https://en.wikipedia.org/wiki/Paris',\n   'similarity': 0.58217555},\n  {'id': 1,\n   'title': 'Paris',\n   'url': 'https://en.wikipedia.org/wiki/Paris',\n   'similarity': 0.57944715}],\n [{'id': 26,\n   'title': 'Toulouse',\n   'url': 'https://en.wikipedia.org/wiki/Toulouse',\n   'similarity': 0.6925601},\n  {'id': 37,\n   'title': 'Toulouse',\n   'url': 'https://en.wikipedia.org/wiki/Toulouse',\n   'similarity': 0.63977146},\n  {'id': 28,\n   'title': 'Toulouse',\n   'url': 'https://en.wikipedia.org/wiki/Toulouse',\n   'similarity': 0.62772334}]]\n```\n\n## Retrieve\n\nCherche provides [retrievers](https://raphaelsty.github.io/cherche/retrieve/retrieve/) that filter input documents based on a query.\n\n- retrieve.TfIdf\n- retrieve.BM25\n- retrieve.Lunr\n- retrieve.Flash\n- retrieve.Encoder\n- retrieve.DPR\n- retrieve.Fuzz\n- retrieve.Embedding\n\n## Rank\n\nCherche provides [rankers](https://raphaelsty.github.io/cherche/rank/rank/) that filter documents in output of retrievers.\n\nCherche rankers are compatible with [SentenceTransformers](https://www.sbert.net/docs/pretrained_models.html) models which are available on [Hugging Face hub](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads).\n\n- rank.Encoder\n- rank.DPR\n- rank.CrossEncoder\n- rank.Embedding\n\n## Question answering\n\nCherche provides modules dedicated to question answering. These modules are compatible with Hugging Face's pre-trained models and fully integrated into neural search pipelines.\n\n## Contributors \ud83e\udd1d\nCherche was created for/by Renault and is now available to all.\nWe welcome all contributions.\n\n<p align=\"center\"><img src=\"docs/img/renault.jpg\"/></p>\n\n## Acknowledgements \ud83d\udc4f\n\nLunr retriever is a wrapper around [Lunr.py](https://github.com/yeraydiazdiaz/lunr.py). Flash retriever is a wrapper around [FlashText](https://github.com/vi3k6i5/flashtext). DPR, Encode and CrossEncoder rankers are wrappers dedicated to the use of the pre-trained models of [SentenceTransformers](https://www.sbert.net/docs/pretrained_models.html) in a neural search pipeline.\n\n## Citations\n\nIf you use cherche to produce results for your scientific publication, please refer to our SIGIR paper:\n\n```bibtex\n@inproceedings{Sourty2022sigir,\n    author = {Raphael Sourty and Jose G. Moreno and Lynda Tamine and Francois-Paul Servant},\n    title = {CHERCHE: A new tool to rapidly implement pipelines in information retrieval},\n    booktitle = {Proceedings of SIGIR 2022},\n    year = {2022}\n}\n```\n\n## Dev Team \ud83d\udcbe\n\nThe Cherche dev team is made up of [Rapha\u00ebl Sourty](https://github.com/raphaelsty), [Fran\u00e7ois-Paul Servant](https://github.com/fpservant), [Nicolas Bizzozzero](https://github.com/NicolasBizzozzero), [Jose G Moreno](https://scholar.google.com/citations?user=4BZFUw8AAAAJ&hl=fr). \ud83e\udd73\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Neural Search",
    "version": "2.2.1",
    "project_urls": {
        "Download": "https://github.com/user/cherche/archive/v_01.tar.gz",
        "Homepage": "https://github.com/raphaelsty/cherche"
    },
    "split_keywords": [
        "neural search",
        " information retrieval",
        " question answering",
        " semantic search"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e649f33984b81c3acfbeff2278f26fced217ace2b8410ba22f70b3b81d773913",
                "md5": "40070b82d9f5fa11e8b38640c123a63d",
                "sha256": "7984a88f0bf9349cd08907baefc2527deaacea95d3acf5f4a1692896c507fe34"
            },
            "downloads": -1,
            "filename": "cherche-2.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "40070b82d9f5fa11e8b38640c123a63d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 1915769,
            "upload_time": "2024-06-01T17:04:25",
            "upload_time_iso_8601": "2024-06-01T17:04:25.745248Z",
            "url": "https://files.pythonhosted.org/packages/e6/49/f33984b81c3acfbeff2278f26fced217ace2b8410ba22f70b3b81d773913/cherche-2.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-01 17:04:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "raphaelsty",
    "github_project": "cherche",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "cherche"
}
        
Elapsed time: 0.26953s