tinysearch


Nametinysearch JSON
Version 0.5.0 PyPI version JSON
download
home_page
SummaryTiny one-phase search engine
upload_time2023-04-26 07:50:24
maintainer
docs_urlNone
author
requires_python>=3.9
licenseApache-2.0
keywords search engine
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TinySearch

TinySearch is a tiny one-phase search engine. It is extremely easy to
use and works well with simple lists where the query may not match the
document text exactly.

This is a minimal search engine. You don't need to run separate, big
instances of search engine when your use case is a few hundreds or
thousands small documents.

## Example

Input documents:

```
"Goldilocks and the Three Bears"
"Fuzzy Wuzzy"
"The Bear Went Over The Mountain"
"We're Going on a Bear Hunt"
"Brown Bear, Brown Bear, What Do You See?"
```

Search query:

```
bear
```

Results (ordered by best match):

```
"Brown Bear, Brown Bear, What Do You See?"
"Goldilocks and the Three Bears"
"The Bear Went Over The Mountain"
"We're Going on a Bear Hunt"
```

## How to use

```python
from tinysearch.search import Search

docs = [
    "Goldilocks and the Three Bears",
    "Fuzzy Wuzzy",
    "The Bear Went Over The Mountain",
    "We're Going on a Bear Hunt",
    "Brown Bear, Brown Bear, What Do You See?",
]
query = "bear"

s = Search(docs, query)

# How many results?
print(s.results.count)

# What is the top result?
print(s.results.matches[0].doc)

# Print all matches. Best results are at the top.
for m in s.results.matches:
    print(m.doc)
```

## Pass your own analyzer

When `tinysearch.analyzer.SimpleEnglishAnalyzer` does not satisfy your
needs, you can write your own analyzer and pass it to the `Search`
object.

An analyzer inherits from `tinysearch.analyzer.base.Analyzer`. It only
need to implement `analyze` method. The `analyze` method accepts a string
representing the document on the input, and returns a list of strings
representing tokens (terms). Everything that you need to make it happen
can be implemented there. See the docstring of the `Analyzer` base class.

You can then pass your analyzer to `Search`:

```python
my_analyzer = MyOwnAnalyzer()

s = Search(docs, query, analyzer=my_analyzer)
print(s.results.count)
```

## Under the hood

When you pass documents to the `Search` object, each document is
tokenized and transformed for easier search. The same process is
applied to the query.

Then each document is scored using the TF-IDF algorithm to find the
best match, and matches are returned sorted to the user. The best match
is at the top.

## Performance

Performance is important since search engines typically respond to
user queries, so it should generate results in a few seconds at most.
More than that would appear as a significant delay.

The numbers below are dependent on the running machine, so they are
just indicative.

```mermaid
gantt
title Search time for different dataset sizes [s]
dateFormat X
axisFormat %s

section 100
0.0, terms=1 : 0, 0.0s
0.0, terms=2 : 0, 0.0s
0.0, terms=3 : 0, 0.0s

section 1000
0.3, terms=1 : 0, 0.3s
0.2, terms=2 : 0, 0.2s
0.3, terms=3 : 0, 0.3s

section 10000
2.7, terms=1 : 0, 2.7s
2.7, terms=2 : 0, 2.7s
2.7, terms=3 : 0, 2.7s

section 52478
15.1, terms=1 : 0, 15.6s
15.4, terms=2 : 0, 15.1s
15.6, terms=3 : 0, 15.2s
```

Datasets of around 1000 entries might generate reasonable search times,
which is the intended use case for TinySearch. Still, there is probably
room for improvement.

## Can we make it faster?

Most time is spent in analyzer, so improving performance means
improving processing time of the analyzer. The default
`SimpleEnglishAnalyzer` has already been highly optimized.

The next step to consider is to split the search into two phases:
indexing and searching. Since analyzer needs to process every document,
indexing can happen earlier in the process execution and searching when
the user requests it. This has an additional benefit of indexing once
and searching multiple times.

```python
from tinysearch.index import Index
from tinysearch.search import Search

i = Index(docs)

# ...later...
s = Search(i, query)
print(s.results.matches[0])
```

## License

See LICENSE.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "tinysearch",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "search engine",
    "author": "",
    "author_email": "Domagoj Marsic <dmars@protonmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/68/3e/ad5e5f33c6353e0ba375bb6a0303bd0179d8baf9625dfcdf35966f4c61ac/tinysearch-0.5.0.tar.gz",
    "platform": null,
    "description": "# TinySearch\n\nTinySearch is a tiny one-phase search engine. It is extremely easy to\nuse and works well with simple lists where the query may not match the\ndocument text exactly.\n\nThis is a minimal search engine. You don't need to run separate, big\ninstances of search engine when your use case is a few hundreds or\nthousands small documents.\n\n## Example\n\nInput documents:\n\n```\n\"Goldilocks and the Three Bears\"\n\"Fuzzy Wuzzy\"\n\"The Bear Went Over The Mountain\"\n\"We're Going on a Bear Hunt\"\n\"Brown Bear, Brown Bear, What Do You See?\"\n```\n\nSearch query:\n\n```\nbear\n```\n\nResults (ordered by best match):\n\n```\n\"Brown Bear, Brown Bear, What Do You See?\"\n\"Goldilocks and the Three Bears\"\n\"The Bear Went Over The Mountain\"\n\"We're Going on a Bear Hunt\"\n```\n\n## How to use\n\n```python\nfrom tinysearch.search import Search\n\ndocs = [\n    \"Goldilocks and the Three Bears\",\n    \"Fuzzy Wuzzy\",\n    \"The Bear Went Over The Mountain\",\n    \"We're Going on a Bear Hunt\",\n    \"Brown Bear, Brown Bear, What Do You See?\",\n]\nquery = \"bear\"\n\ns = Search(docs, query)\n\n# How many results?\nprint(s.results.count)\n\n# What is the top result?\nprint(s.results.matches[0].doc)\n\n# Print all matches. Best results are at the top.\nfor m in s.results.matches:\n    print(m.doc)\n```\n\n## Pass your own analyzer\n\nWhen `tinysearch.analyzer.SimpleEnglishAnalyzer` does not satisfy your\nneeds, you can write your own analyzer and pass it to the `Search`\nobject.\n\nAn analyzer inherits from `tinysearch.analyzer.base.Analyzer`. It only\nneed to implement `analyze` method. The `analyze` method accepts a string\nrepresenting the document on the input, and returns a list of strings\nrepresenting tokens (terms). Everything that you need to make it happen\ncan be implemented there. See the docstring of the `Analyzer` base class.\n\nYou can then pass your analyzer to `Search`:\n\n```python\nmy_analyzer = MyOwnAnalyzer()\n\ns = Search(docs, query, analyzer=my_analyzer)\nprint(s.results.count)\n```\n\n## Under the hood\n\nWhen you pass documents to the `Search` object, each document is\ntokenized and transformed for easier search. The same process is\napplied to the query.\n\nThen each document is scored using the TF-IDF algorithm to find the\nbest match, and matches are returned sorted to the user. The best match\nis at the top.\n\n## Performance\n\nPerformance is important since search engines typically respond to\nuser queries, so it should generate results in a few seconds at most.\nMore than that would appear as a significant delay.\n\nThe numbers below are dependent on the running machine, so they are\njust indicative.\n\n```mermaid\ngantt\ntitle Search time for different dataset sizes [s]\ndateFormat X\naxisFormat %s\n\nsection 100\n0.0, terms=1 : 0, 0.0s\n0.0, terms=2 : 0, 0.0s\n0.0, terms=3 : 0, 0.0s\n\nsection 1000\n0.3, terms=1 : 0, 0.3s\n0.2, terms=2 : 0, 0.2s\n0.3, terms=3 : 0, 0.3s\n\nsection 10000\n2.7, terms=1 : 0, 2.7s\n2.7, terms=2 : 0, 2.7s\n2.7, terms=3 : 0, 2.7s\n\nsection 52478\n15.1, terms=1 : 0, 15.6s\n15.4, terms=2 : 0, 15.1s\n15.6, terms=3 : 0, 15.2s\n```\n\nDatasets of around 1000 entries might generate reasonable search times,\nwhich is the intended use case for TinySearch. Still, there is probably\nroom for improvement.\n\n## Can we make it faster?\n\nMost time is spent in analyzer, so improving performance means\nimproving processing time of the analyzer. The default\n`SimpleEnglishAnalyzer` has already been highly optimized.\n\nThe next step to consider is to split the search into two phases:\nindexing and searching. Since analyzer needs to process every document,\nindexing can happen earlier in the process execution and searching when\nthe user requests it. This has an additional benefit of indexing once\nand searching multiple times.\n\n```python\nfrom tinysearch.index import Index\nfrom tinysearch.search import Search\n\ni = Index(docs)\n\n# ...later...\ns = Search(i, query)\nprint(s.results.matches[0])\n```\n\n## License\n\nSee LICENSE.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Tiny one-phase search engine",
    "version": "0.5.0",
    "split_keywords": [
        "search",
        "engine"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a447a0b67df8ec176427ba566b17d450c6d7a6f90668396b6370c2648dc21768",
                "md5": "e351b667775cc3d571dd0de09aaa2b50",
                "sha256": "8a7f0047b5c5b2f57ce62ddd6f937700f41449184d90f0a6ddee871de0532138"
            },
            "downloads": -1,
            "filename": "tinysearch-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e351b667775cc3d571dd0de09aaa2b50",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 11178,
            "upload_time": "2023-04-26T07:50:22",
            "upload_time_iso_8601": "2023-04-26T07:50:22.806369Z",
            "url": "https://files.pythonhosted.org/packages/a4/47/a0b67df8ec176427ba566b17d450c6d7a6f90668396b6370c2648dc21768/tinysearch-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "683ead5e5f33c6353e0ba375bb6a0303bd0179d8baf9625dfcdf35966f4c61ac",
                "md5": "fc4e3aba5351e00732f08f2b0a0931b3",
                "sha256": "abe783cda3e33dc8417efc1344ee4dced31fd379bdd4d6dc96e7d93c969e2111"
            },
            "downloads": -1,
            "filename": "tinysearch-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "fc4e3aba5351e00732f08f2b0a0931b3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 9049,
            "upload_time": "2023-04-26T07:50:24",
            "upload_time_iso_8601": "2023-04-26T07:50:24.124792Z",
            "url": "https://files.pythonhosted.org/packages/68/3e/ad5e5f33c6353e0ba375bb6a0303bd0179d8baf9625dfcdf35966f4c61ac/tinysearch-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-26 07:50:24",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "tinysearch"
}
        
Elapsed time: 0.05888s