simstring-rust


Namesimstring-rust JSON
Version 0.3.1rc3 PyPI version JSON
download
home_pagehttps://github.com/PyDataBlog/simstring_rs
SummaryA fast, native Rust implementation of the SimString algorithm with Python bindings.
upload_time2025-08-11 19:40:56
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords string-matching nlp simstring fuzzy
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # simstring_rust

[![Build Status](https://github.com/PyDataBlog/simstring_rs/actions/workflows/CI.yml/badge.svg)](https://github.com/PyDataBlog/simstring_rs/actions)
[![Crates.io](https://img.shields.io/crates/v/simstring_rust.svg)](https://crates.io/crates/simstring_rust)
[![PyPI version](https://badge.fury.io/py/simstring-rust.svg)](https://badge.fury.io/py/simstring-rust)
[![Python versions](https://img.shields.io/pypi/pyversions/simstring-rust.svg)](https://pypi.org/project/simstring-rust)
[![Documentation](https://docs.rs/simstring_rust/badge.svg)](https://docs.rs/simstring_rust)
[![Rust](https://img.shields.io/badge/rust-1.63.0%2B-blue.svg?maxAge=3600)](https://github.com/PyDataBlog/simstring_rs)
[![Codecov](https://img.shields.io/codecov/c/github/PyDataBlog/simstring_rs?token=XJM8O8TD4U)](https://codecov.io/gh/PyDataBlog/simstring_rs)

A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.

## Features

- ✅ Fast algorithm for string matching
- ✅ 100% exact retrieval
- ✅ Support for Unicode
- [ ] Support for building databases directly from text files
- [ ] Mecab-based tokenizer support

## Supported String Similarity Measures

- ✅ Dice coefficient
- ✅ Jaccard coefficient
- ✅ Cosine coefficient
- ✅ Overlap coefficient
- ✅ Exact match

## Installation

Add `simstring_rust` to your `Cargo.toml`:

```toml
[dependencies]
simstring_rust = "0.3.0" # change version accordingly
```

For the latest features, you can add the master branch by specifying the Git repository:

```toml
[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }
```

Note: Using the master branch may include experimental features and potential breakages. Use with caution!

To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.

## Usage

Here is a basic example of how to use simstring_rs in your Rust project:

```Rust
use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;

use std::sync::Arc;

fn main() {
    // 1. Setup the database
    let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
    let mut db = HashDb::new(feature_extractor);

    // 2. Index some strings
    db.insert("hello".to_string());
    db.insert("help".to_string());
    db.insert("halo".to_string());
    db.insert("world".to_string());

    // 3. Search for strings
    let measure = Cosine;
    let searcher = Searcher::new(&db, measure);
    let query = "hell";
    let alpha = 0.5;

    if let Ok(results) = searcher.ranked_search(query, alpha) {
        println!("Found {} results for query '{}'", results.len(), query);
        for (item, score) in results {
            println!("- Match: '{}', Score: {:.4}", item, score);
        }
    }
}
```

<!-- ## Releasing -->
<!---->
<!-- This project uses [`cargo-release`](https://github.com/crate-ci/cargo-release) and [`git-cliff`](https://github.com/orhun/git-cliff) to automate the release process. -->
<!---->
<!-- ### Prerequisites -->
<!---->
<!-- Before creating a release, ensure you have installed the necessary tools: -->
<!---->
<!-- ```bash -->
<!-- cargo install cargo-release -->
<!-- cargo install git-cliff -->
<!-- ``` -->
<!---->
<!-- ### Creating a Release -->
<!---->
<!-- 1.  Ensure your local `main` branch is up-to-date: -->
<!--     ```bash -->
<!--     git checkout main -->
<!--     git pull origin main -->
<!--     ``` -->
<!-- 2.  Run `cargo release` with the desired release level (`patch`, `minor`, or `major`). The command runs in dry-run mode by default, so you can review the changes. -->
<!--     ```bash -->
<!--     cargo release <LEVEL> -->
<!--     ``` -->
<!-- 3.  Once you have verified the plan, execute the release: -->
<!--     ```bash -->
<!--     cargo release <LEVEL> --execute -->
<!--     ``` -->
<!---->
<!-- This will automatically: -->
<!-- -   Generate and update the `CHANGELOG.md`. -->
<!-- -   Bump the version in `Cargo.toml`. -->
<!-- -   Commit the changes and create a new Git tag. -->
<!-- -   Push the commit and tag to GitHub, which triggers the CI/CD pipeline to publish the crate to `crates.io`. -->

## Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.
License

This project is licensed under the MIT License.

## Acknowledgements

Inspired by the [SimString](https://www.chokkan.org/software/simstring/) project.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/PyDataBlog/simstring_rs",
    "name": "simstring-rust",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "string-matching, nlp, simstring, fuzzy",
    "author": null,
    "author_email": "Bernard Brenyah <bbrenyah@gmail.com>",
    "download_url": null,
    "platform": null,
    "description": "# simstring_rust\n\n[![Build Status](https://github.com/PyDataBlog/simstring_rs/actions/workflows/CI.yml/badge.svg)](https://github.com/PyDataBlog/simstring_rs/actions)\n[![Crates.io](https://img.shields.io/crates/v/simstring_rust.svg)](https://crates.io/crates/simstring_rust)\n[![PyPI version](https://badge.fury.io/py/simstring-rust.svg)](https://badge.fury.io/py/simstring-rust)\n[![Python versions](https://img.shields.io/pypi/pyversions/simstring-rust.svg)](https://pypi.org/project/simstring-rust)\n[![Documentation](https://docs.rs/simstring_rust/badge.svg)](https://docs.rs/simstring_rust)\n[![Rust](https://img.shields.io/badge/rust-1.63.0%2B-blue.svg?maxAge=3600)](https://github.com/PyDataBlog/simstring_rs)\n[![Codecov](https://img.shields.io/codecov/c/github/PyDataBlog/simstring_rs?token=XJM8O8TD4U)](https://codecov.io/gh/PyDataBlog/simstring_rs)\n\nA native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.\n\n## Features\n\n- \u2705 Fast algorithm for string matching\n- \u2705 100% exact retrieval\n- \u2705 Support for Unicode\n- [ ] Support for building databases directly from text files\n- [ ] Mecab-based tokenizer support\n\n## Supported String Similarity Measures\n\n- \u2705 Dice coefficient\n- \u2705 Jaccard coefficient\n- \u2705 Cosine coefficient\n- \u2705 Overlap coefficient\n- \u2705 Exact match\n\n## Installation\n\nAdd `simstring_rust` to your `Cargo.toml`:\n\n```toml\n[dependencies]\nsimstring_rust = \"0.3.0\" # change version accordingly\n```\n\nFor the latest features, you can add the master branch by specifying the Git repository:\n\n```toml\n[dependencies]\nsimstring_rust = { git = \"https://github.com/PyDataBlog/simstring_rs.git\", branch = \"main\" }\n```\n\nNote: Using the master branch may include experimental features and potential breakages. Use with caution!\n\nTo revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.\n\n## Usage\n\nHere is a basic example of how to use simstring_rs in your Rust project:\n\n```Rust\nuse simstring_rust::database::HashDb;\nuse simstring_rust::extractors::CharacterNgrams;\nuse simstring_rust::measures::Cosine;\nuse simstring_rust::Searcher;\n\nuse std::sync::Arc;\n\nfn main() {\n    // 1. Setup the database\n    let feature_extractor = Arc::new(CharacterNgrams::new(2, \"$\"));\n    let mut db = HashDb::new(feature_extractor);\n\n    // 2. Index some strings\n    db.insert(\"hello\".to_string());\n    db.insert(\"help\".to_string());\n    db.insert(\"halo\".to_string());\n    db.insert(\"world\".to_string());\n\n    // 3. Search for strings\n    let measure = Cosine;\n    let searcher = Searcher::new(&db, measure);\n    let query = \"hell\";\n    let alpha = 0.5;\n\n    if let Ok(results) = searcher.ranked_search(query, alpha) {\n        println!(\"Found {} results for query '{}'\", results.len(), query);\n        for (item, score) in results {\n            println!(\"- Match: '{}', Score: {:.4}\", item, score);\n        }\n    }\n}\n```\n\n<!-- ## Releasing -->\n<!---->\n<!-- This project uses [`cargo-release`](https://github.com/crate-ci/cargo-release) and [`git-cliff`](https://github.com/orhun/git-cliff) to automate the release process. -->\n<!---->\n<!-- ### Prerequisites -->\n<!---->\n<!-- Before creating a release, ensure you have installed the necessary tools: -->\n<!---->\n<!-- ```bash -->\n<!-- cargo install cargo-release -->\n<!-- cargo install git-cliff -->\n<!-- ``` -->\n<!---->\n<!-- ### Creating a Release -->\n<!---->\n<!-- 1.  Ensure your local `main` branch is up-to-date: -->\n<!--     ```bash -->\n<!--     git checkout main -->\n<!--     git pull origin main -->\n<!--     ``` -->\n<!-- 2.  Run `cargo release` with the desired release level (`patch`, `minor`, or `major`). The command runs in dry-run mode by default, so you can review the changes. -->\n<!--     ```bash -->\n<!--     cargo release <LEVEL> -->\n<!--     ``` -->\n<!-- 3.  Once you have verified the plan, execute the release: -->\n<!--     ```bash -->\n<!--     cargo release <LEVEL> --execute -->\n<!--     ``` -->\n<!---->\n<!-- This will automatically: -->\n<!-- -   Generate and update the `CHANGELOG.md`. -->\n<!-- -   Bump the version in `Cargo.toml`. -->\n<!-- -   Commit the changes and create a new Git tag. -->\n<!-- -   Push the commit and tag to GitHub, which triggers the CI/CD pipeline to publish the crate to `crates.io`. -->\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request on GitHub.\nLicense\n\nThis project is licensed under the MIT License.\n\n## Acknowledgements\n\nInspired by the [SimString](https://www.chokkan.org/software/simstring/) project.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A fast, native Rust implementation of the SimString algorithm with Python bindings.",
    "version": "0.3.1rc3",
    "project_urls": {
        "Homepage": "https://github.com/PyDataBlog/simstring_rs",
        "Repository": "https://github.com/PyDataBlog/simstring_rs"
    },
    "split_keywords": [
        "string-matching",
        " nlp",
        " simstring",
        " fuzzy"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "12896064f079119e62e76aae8939f050762eee2e247ff031c4a393132212fc28",
                "md5": "91e4362a2e928b694d0cbf4925ebd235",
                "sha256": "dd7c5185b011820be73f72c00b607876e4cd37b00a2957f69eb7f50d5658979f"
            },
            "downloads": -1,
            "filename": "simstring_rust-0.3.1rc3-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl",
            "has_sig": false,
            "md5_digest": "91e4362a2e928b694d0cbf4925ebd235",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.10",
            "size": 781730,
            "upload_time": "2025-08-11T19:40:56",
            "upload_time_iso_8601": "2025-08-11T19:40:56.916673Z",
            "url": "https://files.pythonhosted.org/packages/12/89/6064f079119e62e76aae8939f050762eee2e247ff031c4a393132212fc28/simstring_rust-0.3.1rc3-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b1811dd22fb4b9feb7febbe42aa2a9e01dccebf00246f7ba1ba8d50ebc336c7d",
                "md5": "4f3a8f78e26f64c8f09905207d58f2f8",
                "sha256": "e3253b5e78a05105325d7f258c04ca3cac835136d82b2debe875082d3bc288ac"
            },
            "downloads": -1,
            "filename": "simstring_rust-0.3.1rc3-cp37-abi3-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "4f3a8f78e26f64c8f09905207d58f2f8",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.10",
            "size": 3243280,
            "upload_time": "2025-08-11T19:40:58",
            "upload_time_iso_8601": "2025-08-11T19:40:58.469833Z",
            "url": "https://files.pythonhosted.org/packages/b1/81/1dd22fb4b9feb7febbe42aa2a9e01dccebf00246f7ba1ba8d50ebc336c7d/simstring_rust-0.3.1rc3-cp37-abi3-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d3bca301605e0de73c9f5ac34cee90f4250f78b0ac0978766ef71d650fedcdab",
                "md5": "dcc6e590a689019594ccb1ed65542a25",
                "sha256": "85179ecd0e35c989f27dbb861d5a1b61b821b93ebbe4d3b3d48519e6b527cdcb"
            },
            "downloads": -1,
            "filename": "simstring_rust-0.3.1rc3-cp37-abi3-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "dcc6e590a689019594ccb1ed65542a25",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.10",
            "size": 251855,
            "upload_time": "2025-08-11T19:40:59",
            "upload_time_iso_8601": "2025-08-11T19:40:59.732510Z",
            "url": "https://files.pythonhosted.org/packages/d3/bc/a301605e0de73c9f5ac34cee90f4250f78b0ac0978766ef71d650fedcdab/simstring_rust-0.3.1rc3-cp37-abi3-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-11 19:40:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "PyDataBlog",
    "github_project": "simstring_rs",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "simstring-rust"
}
        
Elapsed time: 0.63651s