simstring-rust


Namesimstring-rust JSON
Version 0.3.1rc2 PyPI version JSON
download
home_pagehttps://github.com/PyDataBlog/simstring_rs
SummaryA fast, native Rust implementation of the SimString algorithm with Python bindings.
upload_time2025-07-26 20:37:26
maintainerNone
docs_urlNone
authorBernard Brenyah <bbrenyah@gmail.com>
requires_python>=3.10
licenseMIT
keywords string-matching nlp simstring fuzzy
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # simstring_rust

[![Build Status](https://github.com/PyDataBlog/simstring_rs/actions/workflows/CI.yml/badge.svg)](https://github.com/PyDataBlog/simstring_rs/actions)
[![Crates.io](https://img.shields.io/crates/v/simstring_rust.svg)](https://crates.io/crates/simstring_rust)
[![Documentation](https://docs.rs/simstring_rust/badge.svg)](https://docs.rs/simstring_rust)
[![Rust](https://img.shields.io/badge/rust-1.63.0%2B-blue.svg?maxAge=3600)](https://github.com/PyDataBlog/simstring_rs)

A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.

## Features

- ✅ Fast algorithm for string matching
- ✅ 100% exact retrieval
- ✅ Support for Unicode
- [ ] Support for building databases directly from text files
- [ ] Mecab-based tokenizer support

## Supported String Similarity Measures

- ✅ Dice coefficient
- ✅ Jaccard coefficient
- ✅ Cosine coefficient
- ✅ Overlap coefficient
- ✅ Exact match

## Installation

Add `simstring_rust` to your `Cargo.toml`:

```toml
[dependencies]
simstring_rust = "0.3.0" # change version accordingly
```

For the latest features, you can add the master branch by specifying the Git repository:

```toml
[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }
```

Note: Using the master branch may include experimental features and potential breakages. Use with caution!

To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.

## Usage

Here is a basic example of how to use simstring_rs in your Rust project:

```Rust
use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;

use std::sync::Arc;

fn main() {
    // 1. Setup the database
    let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
    let mut db = HashDb::new(feature_extractor);

    // 2. Index some strings
    db.insert("hello".to_string());
    db.insert("help".to_string());
    db.insert("halo".to_string());
    db.insert("world".to_string());

    // 3. Search for strings
    let measure = Cosine;
    let searcher = Searcher::new(&db, measure);
    let query = "hell";
    let alpha = 0.5;

    if let Ok(results) = searcher.ranked_search(query, alpha) {
        println!("Found {} results for query '{}'", results.len(), query);
        for (item, score) in results {
            println!("- Match: '{}', Score: {:.4}", item, score);
        }
    }
}
```

<!-- ## Releasing -->
<!---->
<!-- This project uses [`cargo-release`](https://github.com/crate-ci/cargo-release) and [`git-cliff`](https://github.com/orhun/git-cliff) to automate the release process. -->
<!---->
<!-- ### Prerequisites -->
<!---->
<!-- Before creating a release, ensure you have installed the necessary tools: -->
<!---->
<!-- ```bash -->
<!-- cargo install cargo-release -->
<!-- cargo install git-cliff -->
<!-- ``` -->
<!---->
<!-- ### Creating a Release -->
<!---->
<!-- 1.  Ensure your local `main` branch is up-to-date: -->
<!--     ```bash -->
<!--     git checkout main -->
<!--     git pull origin main -->
<!--     ``` -->
<!-- 2.  Run `cargo release` with the desired release level (`patch`, `minor`, or `major`). The command runs in dry-run mode by default, so you can review the changes. -->
<!--     ```bash -->
<!--     cargo release <LEVEL> -->
<!--     ``` -->
<!-- 3.  Once you have verified the plan, execute the release: -->
<!--     ```bash -->
<!--     cargo release <LEVEL> --execute -->
<!--     ``` -->
<!---->
<!-- This will automatically: -->
<!-- -   Generate and update the `CHANGELOG.md`. -->
<!-- -   Bump the version in `Cargo.toml`. -->
<!-- -   Commit the changes and create a new Git tag. -->
<!-- -   Push the commit and tag to GitHub, which triggers the CI/CD pipeline to publish the crate to `crates.io`. -->

## Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.
License

This project is licensed under the MIT License.

## Acknowledgements

Inspired by the [SimString](https://www.chokkan.org/software/simstring/) project.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/PyDataBlog/simstring_rs",
    "name": "simstring-rust",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "string-matching, nlp, simstring, fuzzy",
    "author": "Bernard Brenyah <bbrenyah@gmail.com>",
    "author_email": "Bernard Brenyah <bbrenyah@gmail.com>",
    "download_url": null,
    "platform": null,
    "description": "# simstring_rust\n\n[![Build Status](https://github.com/PyDataBlog/simstring_rs/actions/workflows/CI.yml/badge.svg)](https://github.com/PyDataBlog/simstring_rs/actions)\n[![Crates.io](https://img.shields.io/crates/v/simstring_rust.svg)](https://crates.io/crates/simstring_rust)\n[![Documentation](https://docs.rs/simstring_rust/badge.svg)](https://docs.rs/simstring_rust)\n[![Rust](https://img.shields.io/badge/rust-1.63.0%2B-blue.svg?maxAge=3600)](https://github.com/PyDataBlog/simstring_rs)\n\nA native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.\n\n## Features\n\n- \u2705 Fast algorithm for string matching\n- \u2705 100% exact retrieval\n- \u2705 Support for Unicode\n- [ ] Support for building databases directly from text files\n- [ ] Mecab-based tokenizer support\n\n## Supported String Similarity Measures\n\n- \u2705 Dice coefficient\n- \u2705 Jaccard coefficient\n- \u2705 Cosine coefficient\n- \u2705 Overlap coefficient\n- \u2705 Exact match\n\n## Installation\n\nAdd `simstring_rust` to your `Cargo.toml`:\n\n```toml\n[dependencies]\nsimstring_rust = \"0.3.0\" # change version accordingly\n```\n\nFor the latest features, you can add the master branch by specifying the Git repository:\n\n```toml\n[dependencies]\nsimstring_rust = { git = \"https://github.com/PyDataBlog/simstring_rs.git\", branch = \"main\" }\n```\n\nNote: Using the master branch may include experimental features and potential breakages. Use with caution!\n\nTo revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.\n\n## Usage\n\nHere is a basic example of how to use simstring_rs in your Rust project:\n\n```Rust\nuse simstring_rust::database::HashDb;\nuse simstring_rust::extractors::CharacterNgrams;\nuse simstring_rust::measures::Cosine;\nuse simstring_rust::Searcher;\n\nuse std::sync::Arc;\n\nfn main() {\n    // 1. Setup the database\n    let feature_extractor = Arc::new(CharacterNgrams::new(2, \"$\"));\n    let mut db = HashDb::new(feature_extractor);\n\n    // 2. Index some strings\n    db.insert(\"hello\".to_string());\n    db.insert(\"help\".to_string());\n    db.insert(\"halo\".to_string());\n    db.insert(\"world\".to_string());\n\n    // 3. Search for strings\n    let measure = Cosine;\n    let searcher = Searcher::new(&db, measure);\n    let query = \"hell\";\n    let alpha = 0.5;\n\n    if let Ok(results) = searcher.ranked_search(query, alpha) {\n        println!(\"Found {} results for query '{}'\", results.len(), query);\n        for (item, score) in results {\n            println!(\"- Match: '{}', Score: {:.4}\", item, score);\n        }\n    }\n}\n```\n\n<!-- ## Releasing -->\n<!---->\n<!-- This project uses [`cargo-release`](https://github.com/crate-ci/cargo-release) and [`git-cliff`](https://github.com/orhun/git-cliff) to automate the release process. -->\n<!---->\n<!-- ### Prerequisites -->\n<!---->\n<!-- Before creating a release, ensure you have installed the necessary tools: -->\n<!---->\n<!-- ```bash -->\n<!-- cargo install cargo-release -->\n<!-- cargo install git-cliff -->\n<!-- ``` -->\n<!---->\n<!-- ### Creating a Release -->\n<!---->\n<!-- 1.  Ensure your local `main` branch is up-to-date: -->\n<!--     ```bash -->\n<!--     git checkout main -->\n<!--     git pull origin main -->\n<!--     ``` -->\n<!-- 2.  Run `cargo release` with the desired release level (`patch`, `minor`, or `major`). The command runs in dry-run mode by default, so you can review the changes. -->\n<!--     ```bash -->\n<!--     cargo release <LEVEL> -->\n<!--     ``` -->\n<!-- 3.  Once you have verified the plan, execute the release: -->\n<!--     ```bash -->\n<!--     cargo release <LEVEL> --execute -->\n<!--     ``` -->\n<!---->\n<!-- This will automatically: -->\n<!-- -   Generate and update the `CHANGELOG.md`. -->\n<!-- -   Bump the version in `Cargo.toml`. -->\n<!-- -   Commit the changes and create a new Git tag. -->\n<!-- -   Push the commit and tag to GitHub, which triggers the CI/CD pipeline to publish the crate to `crates.io`. -->\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request on GitHub.\nLicense\n\nThis project is licensed under the MIT License.\n\n## Acknowledgements\n\nInspired by the [SimString](https://www.chokkan.org/software/simstring/) project.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A fast, native Rust implementation of the SimString algorithm with Python bindings.",
    "version": "0.3.1rc2",
    "project_urls": {
        "Homepage": "https://github.com/PyDataBlog/simstring_rs",
        "Repository": "https://github.com/PyDataBlog/simstring_rs"
    },
    "split_keywords": [
        "string-matching",
        " nlp",
        " simstring",
        " fuzzy"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "108c1f9b02beb932ed53f098adadc239dbc7506ae7034f4175ea541c2e4b20eb",
                "md5": "8add548bdea2fe63fea92950e9d7c87e",
                "sha256": "49c766eb87d771897007354a768de2344bd4a164633d705c2372cd155a024909"
            },
            "downloads": -1,
            "filename": "simstring_rust-0.3.1rc2-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl",
            "has_sig": false,
            "md5_digest": "8add548bdea2fe63fea92950e9d7c87e",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.10",
            "size": 773759,
            "upload_time": "2025-07-26T20:37:26",
            "upload_time_iso_8601": "2025-07-26T20:37:26.037892Z",
            "url": "https://files.pythonhosted.org/packages/10/8c/1f9b02beb932ed53f098adadc239dbc7506ae7034f4175ea541c2e4b20eb/simstring_rust-0.3.1rc2-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d9cb00e82c4d4791b3830cad336428923c3b443275291c1495b96abc72a748b9",
                "md5": "c726ecc0d385a649855b1297b8a1e6fb",
                "sha256": "b2791f71bd37c5c14fd0ca787abc2d43aaa0f66f34d73d620a3b1e65d026f4a9"
            },
            "downloads": -1,
            "filename": "simstring_rust-0.3.1rc2-cp37-abi3-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "c726ecc0d385a649855b1297b8a1e6fb",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.10",
            "size": 3233047,
            "upload_time": "2025-07-26T20:37:27",
            "upload_time_iso_8601": "2025-07-26T20:37:27.510014Z",
            "url": "https://files.pythonhosted.org/packages/d9/cb/00e82c4d4791b3830cad336428923c3b443275291c1495b96abc72a748b9/simstring_rust-0.3.1rc2-cp37-abi3-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "eb84c0df2c4040cb7dbe96293c3736d0c291e62b55a92a8eb09412c6d42fe3d8",
                "md5": "83f9a6b8a7aecf870c6da0f553c13606",
                "sha256": "62353e63831ca42ada5920406dbd024108bed1793d107a9f14875321aa0d1ce9"
            },
            "downloads": -1,
            "filename": "simstring_rust-0.3.1rc2-cp37-abi3-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "83f9a6b8a7aecf870c6da0f553c13606",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.10",
            "size": 246872,
            "upload_time": "2025-07-26T20:37:29",
            "upload_time_iso_8601": "2025-07-26T20:37:29.044575Z",
            "url": "https://files.pythonhosted.org/packages/eb/84/c0df2c4040cb7dbe96293c3736d0c291e62b55a92a8eb09412c6d42fe3d8/simstring_rust-0.3.1rc2-cp37-abi3-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-26 20:37:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "PyDataBlog",
    "github_project": "simstring_rs",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "simstring-rust"
}
        
Elapsed time: 1.42295s