minsearch


Nameminsearch JSON
Version 0.0.4 PyPI version JSON
download
home_pageNone
SummaryMinimalistic text search engine that uses sklearn and pandas
upload_time2025-07-11 10:22:48
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseWTFPL
keywords cosine-similarity search text-search tf-idf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # minsearch

A minimalistic search engine that provides both text-based and vector-based search capabilities. The library provides three implementations:

1. `Index`: A basic search index using scikit-learn's TF-IDF vectorizer for text fields
2. `AppendableIndex`: An appendable search index using an inverted index implementation that allows for incremental document addition
3. `VectorSearch`: A vector search index using cosine similarity for pre-computed vectors

## Features

- Text field indexing with TF-IDF and cosine similarity
- Vector search with cosine similarity for pre-computed embeddings
- Keyword field filtering with exact matching
- Field boosting for fine-tuning search relevance (text-based search)
- Stop word removal and custom tokenization
- Support for incremental document addition (AppendableIndex)
- Customizable tokenizer patterns and stop words
- Efficient search with filtering and boosting

## Installation 

```bash
pip install minsearch
```

## Environment setup

For development purposes, use uv:

```bash
# Install uv if you haven't already
pip install uv
uv sync --extra dev
```

## Usage

### Basic Search with Index

```python
from minsearch import Index

# Create documents
docs = [
    {
        "question": "How do I join the course after it has started?",
        "text": "You can join the course at any time. We have recordings available.",
        "section": "General Information",
        "course": "data-engineering-zoomcamp"
    },
    {
        "question": "What are the prerequisites for the course?",
        "text": "You need to have basic knowledge of programming.",
        "section": "Course Requirements",
        "course": "data-engineering-zoomcamp"
    }
]

# Create and fit the index
index = Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)
index.fit(docs)

# Search with filters and boosts
query = "Can I join the course if it has already started?"
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3, "text": 1, "section": 1}

results = index.search(query, filter_dict=filter_dict, boost_dict=boost_dict)
```

### Incremental Search with AppendableIndex

```python
from minsearch import AppendableIndex

# Create the index
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"]
)

# Add documents one by one
doc1 = {"title": "Python Programming", "description": "Learn Python programming", "course": "CS101"}
index.append(doc1)

doc2 = {"title": "Data Science", "description": "Python for data science", "course": "CS102"}
index.append(doc2)

# Search with custom stop words
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"],
    stop_words={"the", "a", "an"}  # Custom stop words
)
```

### Vector Search with VectorSearch

```python
from minsearch import VectorSearch
import numpy as np

# Create sample vectors and payload documents
vectors = np.random.rand(100, 768)  # 100 documents, 768-dimensional vectors
payload = [
    {"id": 1, "title": "Python Tutorial", "category": "programming", "level": "beginner"},
    {"id": 2, "title": "Data Science Guide", "category": "data", "level": "intermediate"},
    {"id": 3, "title": "Machine Learning Basics", "category": "ai", "level": "advanced"},
    # ... more documents
]

# Create and fit the vector search index
index = VectorSearch(keyword_fields=["category", "level"])
index.fit(vectors, payload)

# Search with a query vector
query_vector = np.random.rand(768)  # 768-dimensional query vector
filter_dict = {"category": "programming", "level": "beginner"}

results = index.search(query_vector, filter_dict=filter_dict, num_results=5)
```

### Advanced Features

#### Custom Tokenizer Pattern

```python
from minsearch import AppendableIndex

# Create index with custom tokenizer pattern
index = AppendableIndex(
    text_fields=["title", "description"],
    keyword_fields=["course"],
    tokenizer_pattern=r'[\s\W\d]+'  # Custom pattern to split on whitespace, non-word chars, and digits
)
```

#### Field Boosting (Text-based Search)

```python
# Boost certain fields to increase their importance in search
boost_dict = {
    "title": 2.0,      # Title matches are twice as important
    "description": 1.0  # Normal importance for description
}
results = index.search("python", boost_dict=boost_dict)
```

#### Keyword Filtering

```python
# Filter results by exact keyword matches
filter_dict = {
    "course": "CS101",
    "level": "beginner"
}
results = index.search("python", filter_dict=filter_dict)
```

## Examples

### Interactive Notebook

The repository includes an interactive Jupyter notebook (`minsearch_example.ipynb`) that demonstrates the library's features using real-world data. The notebook shows:

- Loading and preparing documents from a JSON source
- Creating and configuring the search index
- Performing searches with filters and boosts
- Working with real course-related Q&A data

To run the notebook:

```bash
uv run jupyter notebook
```

Then open `minsearch_example.ipynb` in your browser.

## Development

### Running Tests

```bash
uv run pytest
```

### Building and Publishing

1. Install development dependencies:
```bash
uv sync --extra dev
```

2. Build the package:
```bash
uv run hatch build
```

3. Publish to test PyPI:
```bash
uv run hatch publish --repo test
```

4. Publish to PyPI:
```bash
uv run hatch publish
```

5. Clean up:
```bash
rm -r dist/
```

Note: For Hatch publishing, you'll need to configure your PyPI credentials in `~/.pypirc` or use environment variables.

## PyPI Credentials Setup

Create a `.pypirc` file in your home directory with your PyPI credentials:

```ini
[distutils]
index-servers =
    pypi
    testpypi

[pypi]
username = __token__
password = pypi-your-api-token-here

[testpypi]
repository = https://test.pypi.org/legacy/
username = __token__
password = pypi-your-test-api-token-here
```

**Important Notes:**
- Use `__token__` as the username for API tokens
- Get your tokens from [PyPI](https://pypi.org/manage/account/token/) and [Test PyPI](https://test.pypi.org/manage/account/token/)
- Set file permissions: `chmod 600 ~/.pypirc`

**Alternative: Environment Variables**
```bash
export HATCH_INDEX_USER=__token__
export HATCH_INDEX_AUTH=your-pypi-token
```

## Project Structure

- `minsearch/`: Main package directory
  - `minsearch.py`: Core Index implementation using scikit-learn
  - `append.py`: AppendableIndex implementation with inverted index
  - `vector.py`: VectorSearch implementation using cosine similarity
- `tests/`: Test suite
- `minsearch_example.ipynb`: Example notebook
- `setup.py`: Package configuration
- `Pipfile`: Development dependencies

Note: The `minsearch.py` file in the root directory is maintained for backward compatibility with the LLM Zoomcamp course.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "minsearch",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Alexey Grigorev <alexey@datatalks.club>",
    "keywords": "cosine-similarity, search, text-search, tf-idf",
    "author": null,
    "author_email": "Alexey Grigorev <alexey@datatalks.club>",
    "download_url": "https://files.pythonhosted.org/packages/26/5f/61b6289936f94396e987ecdab7152980bddfccb275bd925390ab3c156049/minsearch-0.0.4.tar.gz",
    "platform": null,
    "description": "# minsearch\n\nA minimalistic search engine that provides both text-based and vector-based search capabilities. The library provides three implementations:\n\n1. `Index`: A basic search index using scikit-learn's TF-IDF vectorizer for text fields\n2. `AppendableIndex`: An appendable search index using an inverted index implementation that allows for incremental document addition\n3. `VectorSearch`: A vector search index using cosine similarity for pre-computed vectors\n\n## Features\n\n- Text field indexing with TF-IDF and cosine similarity\n- Vector search with cosine similarity for pre-computed embeddings\n- Keyword field filtering with exact matching\n- Field boosting for fine-tuning search relevance (text-based search)\n- Stop word removal and custom tokenization\n- Support for incremental document addition (AppendableIndex)\n- Customizable tokenizer patterns and stop words\n- Efficient search with filtering and boosting\n\n## Installation \n\n```bash\npip install minsearch\n```\n\n## Environment setup\n\nFor development purposes, use uv:\n\n```bash\n# Install uv if you haven't already\npip install uv\nuv sync --extra dev\n```\n\n## Usage\n\n### Basic Search with Index\n\n```python\nfrom minsearch import Index\n\n# Create documents\ndocs = [\n    {\n        \"question\": \"How do I join the course after it has started?\",\n        \"text\": \"You can join the course at any time. We have recordings available.\",\n        \"section\": \"General Information\",\n        \"course\": \"data-engineering-zoomcamp\"\n    },\n    {\n        \"question\": \"What are the prerequisites for the course?\",\n        \"text\": \"You need to have basic knowledge of programming.\",\n        \"section\": \"Course Requirements\",\n        \"course\": \"data-engineering-zoomcamp\"\n    }\n]\n\n# Create and fit the index\nindex = Index(\n    text_fields=[\"question\", \"text\", \"section\"],\n    keyword_fields=[\"course\"]\n)\nindex.fit(docs)\n\n# Search with filters and boosts\nquery = \"Can I join the course if it has already started?\"\nfilter_dict = {\"course\": \"data-engineering-zoomcamp\"}\nboost_dict = {\"question\": 3, \"text\": 1, \"section\": 1}\n\nresults = index.search(query, filter_dict=filter_dict, boost_dict=boost_dict)\n```\n\n### Incremental Search with AppendableIndex\n\n```python\nfrom minsearch import AppendableIndex\n\n# Create the index\nindex = AppendableIndex(\n    text_fields=[\"title\", \"description\"],\n    keyword_fields=[\"course\"]\n)\n\n# Add documents one by one\ndoc1 = {\"title\": \"Python Programming\", \"description\": \"Learn Python programming\", \"course\": \"CS101\"}\nindex.append(doc1)\n\ndoc2 = {\"title\": \"Data Science\", \"description\": \"Python for data science\", \"course\": \"CS102\"}\nindex.append(doc2)\n\n# Search with custom stop words\nindex = AppendableIndex(\n    text_fields=[\"title\", \"description\"],\n    keyword_fields=[\"course\"],\n    stop_words={\"the\", \"a\", \"an\"}  # Custom stop words\n)\n```\n\n### Vector Search with VectorSearch\n\n```python\nfrom minsearch import VectorSearch\nimport numpy as np\n\n# Create sample vectors and payload documents\nvectors = np.random.rand(100, 768)  # 100 documents, 768-dimensional vectors\npayload = [\n    {\"id\": 1, \"title\": \"Python Tutorial\", \"category\": \"programming\", \"level\": \"beginner\"},\n    {\"id\": 2, \"title\": \"Data Science Guide\", \"category\": \"data\", \"level\": \"intermediate\"},\n    {\"id\": 3, \"title\": \"Machine Learning Basics\", \"category\": \"ai\", \"level\": \"advanced\"},\n    # ... more documents\n]\n\n# Create and fit the vector search index\nindex = VectorSearch(keyword_fields=[\"category\", \"level\"])\nindex.fit(vectors, payload)\n\n# Search with a query vector\nquery_vector = np.random.rand(768)  # 768-dimensional query vector\nfilter_dict = {\"category\": \"programming\", \"level\": \"beginner\"}\n\nresults = index.search(query_vector, filter_dict=filter_dict, num_results=5)\n```\n\n### Advanced Features\n\n#### Custom Tokenizer Pattern\n\n```python\nfrom minsearch import AppendableIndex\n\n# Create index with custom tokenizer pattern\nindex = AppendableIndex(\n    text_fields=[\"title\", \"description\"],\n    keyword_fields=[\"course\"],\n    tokenizer_pattern=r'[\\s\\W\\d]+'  # Custom pattern to split on whitespace, non-word chars, and digits\n)\n```\n\n#### Field Boosting (Text-based Search)\n\n```python\n# Boost certain fields to increase their importance in search\nboost_dict = {\n    \"title\": 2.0,      # Title matches are twice as important\n    \"description\": 1.0  # Normal importance for description\n}\nresults = index.search(\"python\", boost_dict=boost_dict)\n```\n\n#### Keyword Filtering\n\n```python\n# Filter results by exact keyword matches\nfilter_dict = {\n    \"course\": \"CS101\",\n    \"level\": \"beginner\"\n}\nresults = index.search(\"python\", filter_dict=filter_dict)\n```\n\n## Examples\n\n### Interactive Notebook\n\nThe repository includes an interactive Jupyter notebook (`minsearch_example.ipynb`) that demonstrates the library's features using real-world data. The notebook shows:\n\n- Loading and preparing documents from a JSON source\n- Creating and configuring the search index\n- Performing searches with filters and boosts\n- Working with real course-related Q&A data\n\nTo run the notebook:\n\n```bash\nuv run jupyter notebook\n```\n\nThen open `minsearch_example.ipynb` in your browser.\n\n## Development\n\n### Running Tests\n\n```bash\nuv run pytest\n```\n\n### Building and Publishing\n\n1. Install development dependencies:\n```bash\nuv sync --extra dev\n```\n\n2. Build the package:\n```bash\nuv run hatch build\n```\n\n3. Publish to test PyPI:\n```bash\nuv run hatch publish --repo test\n```\n\n4. Publish to PyPI:\n```bash\nuv run hatch publish\n```\n\n5. Clean up:\n```bash\nrm -r dist/\n```\n\nNote: For Hatch publishing, you'll need to configure your PyPI credentials in `~/.pypirc` or use environment variables.\n\n## PyPI Credentials Setup\n\nCreate a `.pypirc` file in your home directory with your PyPI credentials:\n\n```ini\n[distutils]\nindex-servers =\n    pypi\n    testpypi\n\n[pypi]\nusername = __token__\npassword = pypi-your-api-token-here\n\n[testpypi]\nrepository = https://test.pypi.org/legacy/\nusername = __token__\npassword = pypi-your-test-api-token-here\n```\n\n**Important Notes:**\n- Use `__token__` as the username for API tokens\n- Get your tokens from [PyPI](https://pypi.org/manage/account/token/) and [Test PyPI](https://test.pypi.org/manage/account/token/)\n- Set file permissions: `chmod 600 ~/.pypirc`\n\n**Alternative: Environment Variables**\n```bash\nexport HATCH_INDEX_USER=__token__\nexport HATCH_INDEX_AUTH=your-pypi-token\n```\n\n## Project Structure\n\n- `minsearch/`: Main package directory\n  - `minsearch.py`: Core Index implementation using scikit-learn\n  - `append.py`: AppendableIndex implementation with inverted index\n  - `vector.py`: VectorSearch implementation using cosine similarity\n- `tests/`: Test suite\n- `minsearch_example.ipynb`: Example notebook\n- `setup.py`: Package configuration\n- `Pipfile`: Development dependencies\n\nNote: The `minsearch.py` file in the root directory is maintained for backward compatibility with the LLM Zoomcamp course.\n",
    "bugtrack_url": null,
    "license": "WTFPL",
    "summary": "Minimalistic text search engine that uses sklearn and pandas",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/alexeygrigorev/minsearch",
        "Issues": "https://github.com/alexeygrigorev/minsearch/issues",
        "Repository": "https://github.com/alexeygrigorev/minsearch"
    },
    "split_keywords": [
        "cosine-similarity",
        " search",
        " text-search",
        " tf-idf"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fbf995ed2c175f7f8faac968446db7c9c34662b9d9921986f27762f9ef15212c",
                "md5": "5598f3b3e9b73a49db03ca3e0609d1a3",
                "sha256": "24902a1193b48e781fc0b715be550ced0cc6ba7f6f2a4ee613617b31390439ce"
            },
            "downloads": -1,
            "filename": "minsearch-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5598f3b3e9b73a49db03ca3e0609d1a3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 11404,
            "upload_time": "2025-07-11T10:22:47",
            "upload_time_iso_8601": "2025-07-11T10:22:47.089108Z",
            "url": "https://files.pythonhosted.org/packages/fb/f9/95ed2c175f7f8faac968446db7c9c34662b9d9921986f27762f9ef15212c/minsearch-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "265f61b6289936f94396e987ecdab7152980bddfccb275bd925390ab3c156049",
                "md5": "9dae7b34cb55aebf577ed052ea47acc3",
                "sha256": "ffc118d30bdc9dfc12fd33f4090965202d26f5c493a6c5b5cbc9568e8dd68546"
            },
            "downloads": -1,
            "filename": "minsearch-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "9dae7b34cb55aebf577ed052ea47acc3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 88013,
            "upload_time": "2025-07-11T10:22:48",
            "upload_time_iso_8601": "2025-07-11T10:22:48.541843Z",
            "url": "https://files.pythonhosted.org/packages/26/5f/61b6289936f94396e987ecdab7152980bddfccb275bd925390ab3c156049/minsearch-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-11 10:22:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "alexeygrigorev",
    "github_project": "minsearch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "minsearch"
}
        
Elapsed time: 0.41026s