# minsearch
A minimalistic search engine that provides both text-based and vector-based search capabilities. The library provides three implementations:
1. `Index`: A basic search index using scikit-learn's TF-IDF vectorizer for text fields
2. `AppendableIndex`: An appendable search index using an inverted index implementation that allows for incremental document addition
3. `VectorSearch`: A vector search index using cosine similarity for pre-computed vectors
## Features
- Text field indexing with TF-IDF and cosine similarity
- Vector search with cosine similarity for pre-computed embeddings
- Keyword field filtering with exact matching
- Field boosting for fine-tuning search relevance (text-based search)
- Stop word removal and custom tokenization
- Support for incremental document addition (AppendableIndex)
- Customizable tokenizer patterns and stop words
- Efficient search with filtering and boosting
## Installation
```bash
pip install minsearch
```
## Environment setup
For development purposes, use uv:
```bash
# Install uv if you haven't already
pip install uv
uv sync --extra dev
```
## Usage
### Basic Search with Index
```python
from minsearch import Index
# Create documents
docs = [
{
"question": "How do I join the course after it has started?",
"text": "You can join the course at any time. We have recordings available.",
"section": "General Information",
"course": "data-engineering-zoomcamp"
},
{
"question": "What are the prerequisites for the course?",
"text": "You need to have basic knowledge of programming.",
"section": "Course Requirements",
"course": "data-engineering-zoomcamp"
}
]
# Create and fit the index
index = Index(
text_fields=["question", "text", "section"],
keyword_fields=["course"]
)
index.fit(docs)
# Search with filters and boosts
query = "Can I join the course if it has already started?"
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3, "text": 1, "section": 1}
results = index.search(query, filter_dict=filter_dict, boost_dict=boost_dict)
```
### Incremental Search with AppendableIndex
```python
from minsearch import AppendableIndex
# Create the index
index = AppendableIndex(
text_fields=["title", "description"],
keyword_fields=["course"]
)
# Add documents one by one
doc1 = {"title": "Python Programming", "description": "Learn Python programming", "course": "CS101"}
index.append(doc1)
doc2 = {"title": "Data Science", "description": "Python for data science", "course": "CS102"}
index.append(doc2)
# Search with custom stop words
index = AppendableIndex(
text_fields=["title", "description"],
keyword_fields=["course"],
stop_words={"the", "a", "an"} # Custom stop words
)
```
### Vector Search with VectorSearch
```python
from minsearch import VectorSearch
import numpy as np
# Create sample vectors and payload documents
vectors = np.random.rand(100, 768) # 100 documents, 768-dimensional vectors
payload = [
{"id": 1, "title": "Python Tutorial", "category": "programming", "level": "beginner"},
{"id": 2, "title": "Data Science Guide", "category": "data", "level": "intermediate"},
{"id": 3, "title": "Machine Learning Basics", "category": "ai", "level": "advanced"},
# ... more documents
]
# Create and fit the vector search index
index = VectorSearch(keyword_fields=["category", "level"])
index.fit(vectors, payload)
# Search with a query vector
query_vector = np.random.rand(768) # 768-dimensional query vector
filter_dict = {"category": "programming", "level": "beginner"}
results = index.search(query_vector, filter_dict=filter_dict, num_results=5)
```
### Advanced Features
#### Custom Tokenizer Pattern
```python
from minsearch import AppendableIndex
# Create index with custom tokenizer pattern
index = AppendableIndex(
text_fields=["title", "description"],
keyword_fields=["course"],
tokenizer_pattern=r'[\s\W\d]+' # Custom pattern to split on whitespace, non-word chars, and digits
)
```
#### Field Boosting (Text-based Search)
```python
# Boost certain fields to increase their importance in search
boost_dict = {
"title": 2.0, # Title matches are twice as important
"description": 1.0 # Normal importance for description
}
results = index.search("python", boost_dict=boost_dict)
```
#### Keyword Filtering
```python
# Filter results by exact keyword matches
filter_dict = {
"course": "CS101",
"level": "beginner"
}
results = index.search("python", filter_dict=filter_dict)
```
## Examples
### Interactive Notebook
The repository includes an interactive Jupyter notebook (`minsearch_example.ipynb`) that demonstrates the library's features using real-world data. The notebook shows:
- Loading and preparing documents from a JSON source
- Creating and configuring the search index
- Performing searches with filters and boosts
- Working with real course-related Q&A data
To run the notebook:
```bash
uv run jupyter notebook
```
Then open `minsearch_example.ipynb` in your browser.
## Development
### Running Tests
```bash
uv run pytest
```
### Building and Publishing
1. Install development dependencies:
```bash
uv sync --extra dev
```
2. Build the package:
```bash
uv run hatch build
```
3. Publish to test PyPI:
```bash
uv run hatch publish --repo test
```
4. Publish to PyPI:
```bash
uv run hatch publish
```
5. Clean up:
```bash
rm -r dist/
```
Note: For Hatch publishing, you'll need to configure your PyPI credentials in `~/.pypirc` or use environment variables.
## PyPI Credentials Setup
Create a `.pypirc` file in your home directory with your PyPI credentials:
```ini
[distutils]
index-servers =
pypi
testpypi
[pypi]
username = __token__
password = pypi-your-api-token-here
[testpypi]
repository = https://test.pypi.org/legacy/
username = __token__
password = pypi-your-test-api-token-here
```
**Important Notes:**
- Use `__token__` as the username for API tokens
- Get your tokens from [PyPI](https://pypi.org/manage/account/token/) and [Test PyPI](https://test.pypi.org/manage/account/token/)
- Set file permissions: `chmod 600 ~/.pypirc`
**Alternative: Environment Variables**
```bash
export HATCH_INDEX_USER=__token__
export HATCH_INDEX_AUTH=your-pypi-token
```
## Project Structure
- `minsearch/`: Main package directory
- `minsearch.py`: Core Index implementation using scikit-learn
- `append.py`: AppendableIndex implementation with inverted index
- `vector.py`: VectorSearch implementation using cosine similarity
- `tests/`: Test suite
- `minsearch_example.ipynb`: Example notebook
- `setup.py`: Package configuration
- `Pipfile`: Development dependencies
Note: The `minsearch.py` file in the root directory is maintained for backward compatibility with the LLM Zoomcamp course.
Raw data
{
"_id": null,
"home_page": null,
"name": "minsearch",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Alexey Grigorev <alexey@datatalks.club>",
"keywords": "cosine-similarity, search, text-search, tf-idf",
"author": null,
"author_email": "Alexey Grigorev <alexey@datatalks.club>",
"download_url": "https://files.pythonhosted.org/packages/26/5f/61b6289936f94396e987ecdab7152980bddfccb275bd925390ab3c156049/minsearch-0.0.4.tar.gz",
"platform": null,
"description": "# minsearch\n\nA minimalistic search engine that provides both text-based and vector-based search capabilities. The library provides three implementations:\n\n1. `Index`: A basic search index using scikit-learn's TF-IDF vectorizer for text fields\n2. `AppendableIndex`: An appendable search index using an inverted index implementation that allows for incremental document addition\n3. `VectorSearch`: A vector search index using cosine similarity for pre-computed vectors\n\n## Features\n\n- Text field indexing with TF-IDF and cosine similarity\n- Vector search with cosine similarity for pre-computed embeddings\n- Keyword field filtering with exact matching\n- Field boosting for fine-tuning search relevance (text-based search)\n- Stop word removal and custom tokenization\n- Support for incremental document addition (AppendableIndex)\n- Customizable tokenizer patterns and stop words\n- Efficient search with filtering and boosting\n\n## Installation \n\n```bash\npip install minsearch\n```\n\n## Environment setup\n\nFor development purposes, use uv:\n\n```bash\n# Install uv if you haven't already\npip install uv\nuv sync --extra dev\n```\n\n## Usage\n\n### Basic Search with Index\n\n```python\nfrom minsearch import Index\n\n# Create documents\ndocs = [\n {\n \"question\": \"How do I join the course after it has started?\",\n \"text\": \"You can join the course at any time. We have recordings available.\",\n \"section\": \"General Information\",\n \"course\": \"data-engineering-zoomcamp\"\n },\n {\n \"question\": \"What are the prerequisites for the course?\",\n \"text\": \"You need to have basic knowledge of programming.\",\n \"section\": \"Course Requirements\",\n \"course\": \"data-engineering-zoomcamp\"\n }\n]\n\n# Create and fit the index\nindex = Index(\n text_fields=[\"question\", \"text\", \"section\"],\n keyword_fields=[\"course\"]\n)\nindex.fit(docs)\n\n# Search with filters and boosts\nquery = \"Can I join the course if it has already started?\"\nfilter_dict = {\"course\": \"data-engineering-zoomcamp\"}\nboost_dict = {\"question\": 3, \"text\": 1, \"section\": 1}\n\nresults = index.search(query, filter_dict=filter_dict, boost_dict=boost_dict)\n```\n\n### Incremental Search with AppendableIndex\n\n```python\nfrom minsearch import AppendableIndex\n\n# Create the index\nindex = AppendableIndex(\n text_fields=[\"title\", \"description\"],\n keyword_fields=[\"course\"]\n)\n\n# Add documents one by one\ndoc1 = {\"title\": \"Python Programming\", \"description\": \"Learn Python programming\", \"course\": \"CS101\"}\nindex.append(doc1)\n\ndoc2 = {\"title\": \"Data Science\", \"description\": \"Python for data science\", \"course\": \"CS102\"}\nindex.append(doc2)\n\n# Search with custom stop words\nindex = AppendableIndex(\n text_fields=[\"title\", \"description\"],\n keyword_fields=[\"course\"],\n stop_words={\"the\", \"a\", \"an\"} # Custom stop words\n)\n```\n\n### Vector Search with VectorSearch\n\n```python\nfrom minsearch import VectorSearch\nimport numpy as np\n\n# Create sample vectors and payload documents\nvectors = np.random.rand(100, 768) # 100 documents, 768-dimensional vectors\npayload = [\n {\"id\": 1, \"title\": \"Python Tutorial\", \"category\": \"programming\", \"level\": \"beginner\"},\n {\"id\": 2, \"title\": \"Data Science Guide\", \"category\": \"data\", \"level\": \"intermediate\"},\n {\"id\": 3, \"title\": \"Machine Learning Basics\", \"category\": \"ai\", \"level\": \"advanced\"},\n # ... more documents\n]\n\n# Create and fit the vector search index\nindex = VectorSearch(keyword_fields=[\"category\", \"level\"])\nindex.fit(vectors, payload)\n\n# Search with a query vector\nquery_vector = np.random.rand(768) # 768-dimensional query vector\nfilter_dict = {\"category\": \"programming\", \"level\": \"beginner\"}\n\nresults = index.search(query_vector, filter_dict=filter_dict, num_results=5)\n```\n\n### Advanced Features\n\n#### Custom Tokenizer Pattern\n\n```python\nfrom minsearch import AppendableIndex\n\n# Create index with custom tokenizer pattern\nindex = AppendableIndex(\n text_fields=[\"title\", \"description\"],\n keyword_fields=[\"course\"],\n tokenizer_pattern=r'[\\s\\W\\d]+' # Custom pattern to split on whitespace, non-word chars, and digits\n)\n```\n\n#### Field Boosting (Text-based Search)\n\n```python\n# Boost certain fields to increase their importance in search\nboost_dict = {\n \"title\": 2.0, # Title matches are twice as important\n \"description\": 1.0 # Normal importance for description\n}\nresults = index.search(\"python\", boost_dict=boost_dict)\n```\n\n#### Keyword Filtering\n\n```python\n# Filter results by exact keyword matches\nfilter_dict = {\n \"course\": \"CS101\",\n \"level\": \"beginner\"\n}\nresults = index.search(\"python\", filter_dict=filter_dict)\n```\n\n## Examples\n\n### Interactive Notebook\n\nThe repository includes an interactive Jupyter notebook (`minsearch_example.ipynb`) that demonstrates the library's features using real-world data. The notebook shows:\n\n- Loading and preparing documents from a JSON source\n- Creating and configuring the search index\n- Performing searches with filters and boosts\n- Working with real course-related Q&A data\n\nTo run the notebook:\n\n```bash\nuv run jupyter notebook\n```\n\nThen open `minsearch_example.ipynb` in your browser.\n\n## Development\n\n### Running Tests\n\n```bash\nuv run pytest\n```\n\n### Building and Publishing\n\n1. Install development dependencies:\n```bash\nuv sync --extra dev\n```\n\n2. Build the package:\n```bash\nuv run hatch build\n```\n\n3. Publish to test PyPI:\n```bash\nuv run hatch publish --repo test\n```\n\n4. Publish to PyPI:\n```bash\nuv run hatch publish\n```\n\n5. Clean up:\n```bash\nrm -r dist/\n```\n\nNote: For Hatch publishing, you'll need to configure your PyPI credentials in `~/.pypirc` or use environment variables.\n\n## PyPI Credentials Setup\n\nCreate a `.pypirc` file in your home directory with your PyPI credentials:\n\n```ini\n[distutils]\nindex-servers =\n pypi\n testpypi\n\n[pypi]\nusername = __token__\npassword = pypi-your-api-token-here\n\n[testpypi]\nrepository = https://test.pypi.org/legacy/\nusername = __token__\npassword = pypi-your-test-api-token-here\n```\n\n**Important Notes:**\n- Use `__token__` as the username for API tokens\n- Get your tokens from [PyPI](https://pypi.org/manage/account/token/) and [Test PyPI](https://test.pypi.org/manage/account/token/)\n- Set file permissions: `chmod 600 ~/.pypirc`\n\n**Alternative: Environment Variables**\n```bash\nexport HATCH_INDEX_USER=__token__\nexport HATCH_INDEX_AUTH=your-pypi-token\n```\n\n## Project Structure\n\n- `minsearch/`: Main package directory\n - `minsearch.py`: Core Index implementation using scikit-learn\n - `append.py`: AppendableIndex implementation with inverted index\n - `vector.py`: VectorSearch implementation using cosine similarity\n- `tests/`: Test suite\n- `minsearch_example.ipynb`: Example notebook\n- `setup.py`: Package configuration\n- `Pipfile`: Development dependencies\n\nNote: The `minsearch.py` file in the root directory is maintained for backward compatibility with the LLM Zoomcamp course.\n",
"bugtrack_url": null,
"license": "WTFPL",
"summary": "Minimalistic text search engine that uses sklearn and pandas",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/alexeygrigorev/minsearch",
"Issues": "https://github.com/alexeygrigorev/minsearch/issues",
"Repository": "https://github.com/alexeygrigorev/minsearch"
},
"split_keywords": [
"cosine-similarity",
" search",
" text-search",
" tf-idf"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "fbf995ed2c175f7f8faac968446db7c9c34662b9d9921986f27762f9ef15212c",
"md5": "5598f3b3e9b73a49db03ca3e0609d1a3",
"sha256": "24902a1193b48e781fc0b715be550ced0cc6ba7f6f2a4ee613617b31390439ce"
},
"downloads": -1,
"filename": "minsearch-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5598f3b3e9b73a49db03ca3e0609d1a3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 11404,
"upload_time": "2025-07-11T10:22:47",
"upload_time_iso_8601": "2025-07-11T10:22:47.089108Z",
"url": "https://files.pythonhosted.org/packages/fb/f9/95ed2c175f7f8faac968446db7c9c34662b9d9921986f27762f9ef15212c/minsearch-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "265f61b6289936f94396e987ecdab7152980bddfccb275bd925390ab3c156049",
"md5": "9dae7b34cb55aebf577ed052ea47acc3",
"sha256": "ffc118d30bdc9dfc12fd33f4090965202d26f5c493a6c5b5cbc9568e8dd68546"
},
"downloads": -1,
"filename": "minsearch-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "9dae7b34cb55aebf577ed052ea47acc3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 88013,
"upload_time": "2025-07-11T10:22:48",
"upload_time_iso_8601": "2025-07-11T10:22:48.541843Z",
"url": "https://files.pythonhosted.org/packages/26/5f/61b6289936f94396e987ecdab7152980bddfccb275bd925390ab3c156049/minsearch-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-11 10:22:48",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "alexeygrigorev",
"github_project": "minsearch",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "minsearch"
}