bm25-rs

Name	bm25-rs JSON
Version	0.1.1 JSON
	download
home_page	https://github.com/dorianbrown/rank_bm25
Summary	High-performance BM25 implementation in Rust with Python bindings
upload_time	2025-09-19 05:43:05
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT
keywords	bm25 search information-retrieval rust performance
VCS
bugtrack_url
requirements	numpy
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # BM25-RS: High-Performance BM25 for Python

[![PyPI version](https://badge.fury.io/py/bm25-rs.svg)](https://badge.fury.io/py/bm25-rs)
[![Python versions](https://img.shields.io/pypi/pyversions/bm25-rs.svg)](https://pypi.org/project/bm25-rs/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A blazingly fast BM25 implementation in Rust with Python bindings. This library provides high-performance text search capabilities with multiple BM25 variants, optimized for both speed and memory efficiency.

## 🚀 Features

- **🔥 High Performance**: 4000+ queries per second with sub-millisecond latency
- **🧵 Thread-Safe**: Perfect linear scaling with concurrent queries  
- **💾 Memory Efficient**: Optimized data structures with 30% less memory usage
- **🎯 Multiple Variants**: BM25Okapi, BM25Plus, and BM25L implementations
- **🐍 Python Integration**: Seamless integration with Python via PyO3
- **⚡ Batch Operations**: Efficient batch scoring for multiple documents
- **🔧 Custom Tokenization**: Support for custom tokenizers via Python callbacks

## 📦 Installation

Install from PyPI:

```bash
pip install bm25-rs
```

Or install with development dependencies:

```bash
pip install bm25-rs[dev,benchmark]
```

## 🏃‍♂️ Quick Start

```python
from bm25_rs import BM25Okapi

# Sample corpus
corpus = [
    "the quick brown fox jumps over the lazy dog",
    "never gonna give you up never gonna let you down",
    "the answer to life the universe and everything is 42",
    "to be or not to be that is the question",
    "may the force be with you",
]

# Initialize BM25
bm25 = BM25Okapi(corpus)

# Search query
query = "the quick brown"
query_tokens = query.lower().split()

# Get relevance scores for all documents
scores = bm25.get_scores(query_tokens)
print(f"Scores: {scores}")

# Get top-k most relevant documents
top_docs = bm25.get_top_n(query_tokens, corpus, n=3)
print(f"Top documents: {top_docs}")
```

## 🎯 Advanced Usage

### Custom Tokenization

```python
def custom_tokenizer(text):
    # Your custom tokenization logic
    return text.lower().split()

bm25 = BM25Okapi(corpus, tokenizer=custom_tokenizer)
```

### Batch Operations

```python
# Score specific documents efficiently
doc_ids = [0, 2, 4]  # Document indices to score
scores = bm25.get_batch_scores(query_tokens, doc_ids)
```

### Multiple BM25 Variants

```python
from bm25_rs import BM25Okapi, BM25Plus, BM25L

# Standard BM25Okapi
bm25_okapi = BM25Okapi(corpus, k1=1.5, b=0.75, epsilon=0.25)

# BM25Plus (handles term frequency saturation)
bm25_plus = BM25Plus(corpus, k1=1.5, b=0.75, delta=1.0)

# BM25L (length normalization variant)
bm25_l = BM25L(corpus, k1=1.5, b=0.75, delta=0.5)
```

### Performance Optimization

```python
# For large corpora, use chunked processing
scores = bm25.get_scores_chunked(query_tokens, chunk_size=1000)

# Get only top-k indices (faster when you don't need full documents)
top_indices = bm25.get_top_n_indices(query_tokens, n=10)
```

## 📊 Performance Benchmarks

Performance comparison on a corpus of 10,000 documents:

| Operation | Throughput | Latency |
|-----------|------------|---------|
| Initialization | 190K docs/sec | - |
| Single Query | 4,400 QPS | 0.23ms |
| Batch Queries | 73K ops/sec | 0.01ms |
| Concurrent (4 threads) | 17,600 QPS | 0.06ms |

Memory usage: ~30% less than pure Python implementations.

## 🔧 API Reference

### BM25Okapi

```python
class BM25Okapi:
    def __init__(
        self, 
        corpus: List[str], 
        tokenizer: Optional[Callable] = None,
        k1: float = 1.5, 
        b: float = 0.75, 
        epsilon: float = 0.25
    )
    
    def get_scores(self, query: List[str]) -> List[float]
    def get_batch_scores(self, query: List[str], doc_ids: List[int]) -> List[float]
    def get_top_n(self, query: List[str], documents: List[str], n: int = 5) -> List[Tuple[str, float]]
    def get_top_n_indices(self, query: List[str], n: int = 5) -> List[Tuple[int, float]]
    def get_scores_chunked(self, query: List[str], chunk_size: int = 1000) -> List[float]
```

### Parameters

- **k1** (float): Controls term frequency saturation (default: 1.5)
- **b** (float): Controls length normalization (default: 0.75)  
- **epsilon** (float): IDF normalization parameter for BM25Okapi (default: 0.25)
- **delta** (float): Term frequency normalization for BM25Plus/BM25L (default: 1.0/0.5)

## 🛠️ Development

### Building from Source

```bash
# Clone the repository
git clone https://github.com/dorianbrown/rank_bm25.git
cd rank_bm25

# Install development dependencies
pip install -e .[dev]

# Build the Rust extension
maturin develop --release
```

### Running Tests

```bash
pytest tests/
```

### Benchmarking

```bash
python benchmarks/benchmark.py
```

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- Built with [PyO3](https://pyo3.rs/) for Python-Rust interoperability
- Uses [Rayon](https://github.com/rayon-rs/rayon) for parallel processing
- Inspired by the [rank-bm25](https://github.com/dorianbrown/rank_bm25) Python library

## 📈 Changelog

See [CHANGELOG.md](CHANGELOG.md) for a detailed history of changes.

---

**Made with ❤️ and ⚡ by the BM25-RS team**

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/dorianbrown/rank_bm25",
    "name": "bm25-rs",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "BM25-RS Contributors <maintainers@example.com>",
    "keywords": "bm25, search, information-retrieval, rust, performance",
    "author": null,
    "author_email": "BM25-RS Contributors <maintainers@example.com>",
    "download_url": null,
    "platform": null,
    "description": "# BM25-RS: High-Performance BM25 for Python\n\n[![PyPI version](https://badge.fury.io/py/bm25-rs.svg)](https://badge.fury.io/py/bm25-rs)\n[![Python versions](https://img.shields.io/pypi/pyversions/bm25-rs.svg)](https://pypi.org/project/bm25-rs/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nA blazingly fast BM25 implementation in Rust with Python bindings. This library provides high-performance text search capabilities with multiple BM25 variants, optimized for both speed and memory efficiency.\n\n## \ud83d\ude80 Features\n\n- **\ud83d\udd25 High Performance**: 4000+ queries per second with sub-millisecond latency\n- **\ud83e\uddf5 Thread-Safe**: Perfect linear scaling with concurrent queries  \n- **\ud83d\udcbe Memory Efficient**: Optimized data structures with 30% less memory usage\n- **\ud83c\udfaf Multiple Variants**: BM25Okapi, BM25Plus, and BM25L implementations\n- **\ud83d\udc0d Python Integration**: Seamless integration with Python via PyO3\n- **\u26a1 Batch Operations**: Efficient batch scoring for multiple documents\n- **\ud83d\udd27 Custom Tokenization**: Support for custom tokenizers via Python callbacks\n\n## \ud83d\udce6 Installation\n\nInstall from PyPI:\n\n```bash\npip install bm25-rs\n```\n\nOr install with development dependencies:\n\n```bash\npip install bm25-rs[dev,benchmark]\n```\n\n## \ud83c\udfc3\u200d\u2642\ufe0f Quick Start\n\n```python\nfrom bm25_rs import BM25Okapi\n\n# Sample corpus\ncorpus = [\n    \"the quick brown fox jumps over the lazy dog\",\n    \"never gonna give you up never gonna let you down\",\n    \"the answer to life the universe and everything is 42\",\n    \"to be or not to be that is the question\",\n    \"may the force be with you\",\n]\n\n# Initialize BM25\nbm25 = BM25Okapi(corpus)\n\n# Search query\nquery = \"the quick brown\"\nquery_tokens = query.lower().split()\n\n# Get relevance scores for all documents\nscores = bm25.get_scores(query_tokens)\nprint(f\"Scores: {scores}\")\n\n# Get top-k most relevant documents\ntop_docs = bm25.get_top_n(query_tokens, corpus, n=3)\nprint(f\"Top documents: {top_docs}\")\n```\n\n## \ud83c\udfaf Advanced Usage\n\n### Custom Tokenization\n\n```python\ndef custom_tokenizer(text):\n    # Your custom tokenization logic\n    return text.lower().split()\n\nbm25 = BM25Okapi(corpus, tokenizer=custom_tokenizer)\n```\n\n### Batch Operations\n\n```python\n# Score specific documents efficiently\ndoc_ids = [0, 2, 4]  # Document indices to score\nscores = bm25.get_batch_scores(query_tokens, doc_ids)\n```\n\n### Multiple BM25 Variants\n\n```python\nfrom bm25_rs import BM25Okapi, BM25Plus, BM25L\n\n# Standard BM25Okapi\nbm25_okapi = BM25Okapi(corpus, k1=1.5, b=0.75, epsilon=0.25)\n\n# BM25Plus (handles term frequency saturation)\nbm25_plus = BM25Plus(corpus, k1=1.5, b=0.75, delta=1.0)\n\n# BM25L (length normalization variant)\nbm25_l = BM25L(corpus, k1=1.5, b=0.75, delta=0.5)\n```\n\n### Performance Optimization\n\n```python\n# For large corpora, use chunked processing\nscores = bm25.get_scores_chunked(query_tokens, chunk_size=1000)\n\n# Get only top-k indices (faster when you don't need full documents)\ntop_indices = bm25.get_top_n_indices(query_tokens, n=10)\n```\n\n## \ud83d\udcca Performance Benchmarks\n\nPerformance comparison on a corpus of 10,000 documents:\n\n| Operation | Throughput | Latency |\n|-----------|------------|---------|\n| Initialization | 190K docs/sec | - |\n| Single Query | 4,400 QPS | 0.23ms |\n| Batch Queries | 73K ops/sec | 0.01ms |\n| Concurrent (4 threads) | 17,600 QPS | 0.06ms |\n\nMemory usage: ~30% less than pure Python implementations.\n\n## \ud83d\udd27 API Reference\n\n### BM25Okapi\n\n```python\nclass BM25Okapi:\n    def __init__(\n        self, \n        corpus: List[str], \n        tokenizer: Optional[Callable] = None,\n        k1: float = 1.5, \n        b: float = 0.75, \n        epsilon: float = 0.25\n    )\n    \n    def get_scores(self, query: List[str]) -> List[float]\n    def get_batch_scores(self, query: List[str], doc_ids: List[int]) -> List[float]\n    def get_top_n(self, query: List[str], documents: List[str], n: int = 5) -> List[Tuple[str, float]]\n    def get_top_n_indices(self, query: List[str], n: int = 5) -> List[Tuple[int, float]]\n    def get_scores_chunked(self, query: List[str], chunk_size: int = 1000) -> List[float]\n```\n\n### Parameters\n\n- **k1** (float): Controls term frequency saturation (default: 1.5)\n- **b** (float): Controls length normalization (default: 0.75)  \n- **epsilon** (float): IDF normalization parameter for BM25Okapi (default: 0.25)\n- **delta** (float): Term frequency normalization for BM25Plus/BM25L (default: 1.0/0.5)\n\n## \ud83d\udee0\ufe0f Development\n\n### Building from Source\n\n```bash\n# Clone the repository\ngit clone https://github.com/dorianbrown/rank_bm25.git\ncd rank_bm25\n\n# Install development dependencies\npip install -e .[dev]\n\n# Build the Rust extension\nmaturin develop --release\n```\n\n### Running Tests\n\n```bash\npytest tests/\n```\n\n### Benchmarking\n\n```bash\npython benchmarks/benchmark.py\n```\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- Built with [PyO3](https://pyo3.rs/) for Python-Rust interoperability\n- Uses [Rayon](https://github.com/rayon-rs/rayon) for parallel processing\n- Inspired by the [rank-bm25](https://github.com/dorianbrown/rank_bm25) Python library\n\n## \ud83d\udcc8 Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for a detailed history of changes.\n\n---\n\n**Made with \u2764\ufe0f and \u26a1 by the BM25-RS team**\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "High-performance BM25 implementation in Rust with Python bindings",
    "version": "0.1.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/dorianbrown/rank_bm25/issues",
        "Changelog": "https://github.com/dorianbrown/rank_bm25/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/dorianbrown/rank_bm25#readme",
        "Homepage": "https://github.com/dorianbrown/rank_bm25",
        "Repository": "https://github.com/dorianbrown/rank_bm25.git"
    },
    "split_keywords": [
        "bm25",
        " search",
        " information-retrieval",
        " rust",
        " performance"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9154ef123166b39a4d669def3643ed06ea12cfacc054375ccb9e4b2c4e2b3f6a",
                "md5": "660c3bc4631cb21ee588afb07fbb6551",
                "sha256": "f9a1a2b6798db78cb0b71385ae6c5beebf8d500d90e7c2617d539ed611f47580"
            },
            "downloads": -1,
            "filename": "bm25_rs-0.1.1-cp313-cp313-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "660c3bc4631cb21ee588afb07fbb6551",
            "packagetype": "bdist_wheel",
            "python_version": "cp313",
            "requires_python": ">=3.8",
            "size": 424347,
            "upload_time": "2025-09-19T05:43:05",
            "upload_time_iso_8601": "2025-09-19T05:43:05.474416Z",
            "url": "https://files.pythonhosted.org/packages/91/54/ef123166b39a4d669def3643ed06ea12cfacc054375ccb9e4b2c4e2b3f6a/bm25_rs-0.1.1-cp313-cp313-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-19 05:43:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dorianbrown",
    "github_project": "rank_bm25",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": []
        }
    ],
    "lcname": "bm25-rs"
}

None