# simstring_rust
[](https://github.com/PyDataBlog/simstring_rs/actions)
[](https://crates.io/crates/simstring_rust)
[](https://docs.rs/simstring_rust)
[](https://github.com/PyDataBlog/simstring_rs)
A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.
## Features
- ✅ Fast algorithm for string matching
- ✅ 100% exact retrieval
- ✅ Support for Unicode
- [ ] Support for building databases directly from text files
- [ ] Mecab-based tokenizer support
## Supported String Similarity Measures
- ✅ Dice coefficient
- ✅ Jaccard coefficient
- ✅ Cosine coefficient
- ✅ Overlap coefficient
- ✅ Exact match
## Installation
Add `simstring_rust` to your `Cargo.toml`:
```toml
[dependencies]
simstring_rust = "0.3.0" # change version accordingly
```
For the latest features, you can add the master branch by specifying the Git repository:
```toml
[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }
```
Note: Using the master branch may include experimental features and potential breakages. Use with caution!
To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.
## Usage
Here is a basic example of how to use simstring_rs in your Rust project:
```Rust
use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;
use std::sync::Arc;
fn main() {
// 1. Setup the database
let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
let mut db = HashDb::new(feature_extractor);
// 2. Index some strings
db.insert("hello".to_string());
db.insert("help".to_string());
db.insert("halo".to_string());
db.insert("world".to_string());
// 3. Search for strings
let measure = Cosine;
let searcher = Searcher::new(&db, measure);
let query = "hell";
let alpha = 0.5;
if let Ok(results) = searcher.ranked_search(query, alpha) {
println!("Found {} results for query '{}'", results.len(), query);
for (item, score) in results {
println!("- Match: '{}', Score: {:.4}", item, score);
}
}
}
```
<!-- ## Releasing -->
<!---->
<!-- This project uses [`cargo-release`](https://github.com/crate-ci/cargo-release) and [`git-cliff`](https://github.com/orhun/git-cliff) to automate the release process. -->
<!---->
<!-- ### Prerequisites -->
<!---->
<!-- Before creating a release, ensure you have installed the necessary tools: -->
<!---->
<!-- ```bash -->
<!-- cargo install cargo-release -->
<!-- cargo install git-cliff -->
<!-- ``` -->
<!---->
<!-- ### Creating a Release -->
<!---->
<!-- 1. Ensure your local `main` branch is up-to-date: -->
<!-- ```bash -->
<!-- git checkout main -->
<!-- git pull origin main -->
<!-- ``` -->
<!-- 2. Run `cargo release` with the desired release level (`patch`, `minor`, or `major`). The command runs in dry-run mode by default, so you can review the changes. -->
<!-- ```bash -->
<!-- cargo release <LEVEL> -->
<!-- ``` -->
<!-- 3. Once you have verified the plan, execute the release: -->
<!-- ```bash -->
<!-- cargo release <LEVEL> --execute -->
<!-- ``` -->
<!---->
<!-- This will automatically: -->
<!-- - Generate and update the `CHANGELOG.md`. -->
<!-- - Bump the version in `Cargo.toml`. -->
<!-- - Commit the changes and create a new Git tag. -->
<!-- - Push the commit and tag to GitHub, which triggers the CI/CD pipeline to publish the crate to `crates.io`. -->
## Contributing
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
License
This project is licensed under the MIT License.
## Acknowledgements
Inspired by the [SimString](https://www.chokkan.org/software/simstring/) project.
Raw data
{
"_id": null,
"home_page": "https://github.com/PyDataBlog/simstring_rs",
"name": "simstring-rust",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "string-matching, nlp, simstring, fuzzy",
"author": "Bernard Brenyah <bbrenyah@gmail.com>",
"author_email": "Bernard Brenyah <bbrenyah@gmail.com>",
"download_url": null,
"platform": null,
"description": "# simstring_rust\n\n[](https://github.com/PyDataBlog/simstring_rs/actions)\n[](https://crates.io/crates/simstring_rust)\n[](https://docs.rs/simstring_rust)\n[](https://github.com/PyDataBlog/simstring_rs)\n\nA native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.\n\n## Features\n\n- \u2705 Fast algorithm for string matching\n- \u2705 100% exact retrieval\n- \u2705 Support for Unicode\n- [ ] Support for building databases directly from text files\n- [ ] Mecab-based tokenizer support\n\n## Supported String Similarity Measures\n\n- \u2705 Dice coefficient\n- \u2705 Jaccard coefficient\n- \u2705 Cosine coefficient\n- \u2705 Overlap coefficient\n- \u2705 Exact match\n\n## Installation\n\nAdd `simstring_rust` to your `Cargo.toml`:\n\n```toml\n[dependencies]\nsimstring_rust = \"0.3.0\" # change version accordingly\n```\n\nFor the latest features, you can add the master branch by specifying the Git repository:\n\n```toml\n[dependencies]\nsimstring_rust = { git = \"https://github.com/PyDataBlog/simstring_rs.git\", branch = \"main\" }\n```\n\nNote: Using the master branch may include experimental features and potential breakages. Use with caution!\n\nTo revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.\n\n## Usage\n\nHere is a basic example of how to use simstring_rs in your Rust project:\n\n```Rust\nuse simstring_rust::database::HashDb;\nuse simstring_rust::extractors::CharacterNgrams;\nuse simstring_rust::measures::Cosine;\nuse simstring_rust::Searcher;\n\nuse std::sync::Arc;\n\nfn main() {\n // 1. Setup the database\n let feature_extractor = Arc::new(CharacterNgrams::new(2, \"$\"));\n let mut db = HashDb::new(feature_extractor);\n\n // 2. Index some strings\n db.insert(\"hello\".to_string());\n db.insert(\"help\".to_string());\n db.insert(\"halo\".to_string());\n db.insert(\"world\".to_string());\n\n // 3. Search for strings\n let measure = Cosine;\n let searcher = Searcher::new(&db, measure);\n let query = \"hell\";\n let alpha = 0.5;\n\n if let Ok(results) = searcher.ranked_search(query, alpha) {\n println!(\"Found {} results for query '{}'\", results.len(), query);\n for (item, score) in results {\n println!(\"- Match: '{}', Score: {:.4}\", item, score);\n }\n }\n}\n```\n\n<!-- ## Releasing -->\n<!---->\n<!-- This project uses [`cargo-release`](https://github.com/crate-ci/cargo-release) and [`git-cliff`](https://github.com/orhun/git-cliff) to automate the release process. -->\n<!---->\n<!-- ### Prerequisites -->\n<!---->\n<!-- Before creating a release, ensure you have installed the necessary tools: -->\n<!---->\n<!-- ```bash -->\n<!-- cargo install cargo-release -->\n<!-- cargo install git-cliff -->\n<!-- ``` -->\n<!---->\n<!-- ### Creating a Release -->\n<!---->\n<!-- 1. Ensure your local `main` branch is up-to-date: -->\n<!-- ```bash -->\n<!-- git checkout main -->\n<!-- git pull origin main -->\n<!-- ``` -->\n<!-- 2. Run `cargo release` with the desired release level (`patch`, `minor`, or `major`). The command runs in dry-run mode by default, so you can review the changes. -->\n<!-- ```bash -->\n<!-- cargo release <LEVEL> -->\n<!-- ``` -->\n<!-- 3. Once you have verified the plan, execute the release: -->\n<!-- ```bash -->\n<!-- cargo release <LEVEL> --execute -->\n<!-- ``` -->\n<!---->\n<!-- This will automatically: -->\n<!-- - Generate and update the `CHANGELOG.md`. -->\n<!-- - Bump the version in `Cargo.toml`. -->\n<!-- - Commit the changes and create a new Git tag. -->\n<!-- - Push the commit and tag to GitHub, which triggers the CI/CD pipeline to publish the crate to `crates.io`. -->\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request on GitHub.\nLicense\n\nThis project is licensed under the MIT License.\n\n## Acknowledgements\n\nInspired by the [SimString](https://www.chokkan.org/software/simstring/) project.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A fast, native Rust implementation of the SimString algorithm with Python bindings.",
"version": "0.3.1rc2",
"project_urls": {
"Homepage": "https://github.com/PyDataBlog/simstring_rs",
"Repository": "https://github.com/PyDataBlog/simstring_rs"
},
"split_keywords": [
"string-matching",
" nlp",
" simstring",
" fuzzy"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "108c1f9b02beb932ed53f098adadc239dbc7506ae7034f4175ea541c2e4b20eb",
"md5": "8add548bdea2fe63fea92950e9d7c87e",
"sha256": "49c766eb87d771897007354a768de2344bd4a164633d705c2372cd155a024909"
},
"downloads": -1,
"filename": "simstring_rust-0.3.1rc2-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl",
"has_sig": false,
"md5_digest": "8add548bdea2fe63fea92950e9d7c87e",
"packagetype": "bdist_wheel",
"python_version": "cp37",
"requires_python": ">=3.10",
"size": 773759,
"upload_time": "2025-07-26T20:37:26",
"upload_time_iso_8601": "2025-07-26T20:37:26.037892Z",
"url": "https://files.pythonhosted.org/packages/10/8c/1f9b02beb932ed53f098adadc239dbc7506ae7034f4175ea541c2e4b20eb/simstring_rust-0.3.1rc2-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d9cb00e82c4d4791b3830cad336428923c3b443275291c1495b96abc72a748b9",
"md5": "c726ecc0d385a649855b1297b8a1e6fb",
"sha256": "b2791f71bd37c5c14fd0ca787abc2d43aaa0f66f34d73d620a3b1e65d026f4a9"
},
"downloads": -1,
"filename": "simstring_rust-0.3.1rc2-cp37-abi3-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "c726ecc0d385a649855b1297b8a1e6fb",
"packagetype": "bdist_wheel",
"python_version": "cp37",
"requires_python": ">=3.10",
"size": 3233047,
"upload_time": "2025-07-26T20:37:27",
"upload_time_iso_8601": "2025-07-26T20:37:27.510014Z",
"url": "https://files.pythonhosted.org/packages/d9/cb/00e82c4d4791b3830cad336428923c3b443275291c1495b96abc72a748b9/simstring_rust-0.3.1rc2-cp37-abi3-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "eb84c0df2c4040cb7dbe96293c3736d0c291e62b55a92a8eb09412c6d42fe3d8",
"md5": "83f9a6b8a7aecf870c6da0f553c13606",
"sha256": "62353e63831ca42ada5920406dbd024108bed1793d107a9f14875321aa0d1ce9"
},
"downloads": -1,
"filename": "simstring_rust-0.3.1rc2-cp37-abi3-win_amd64.whl",
"has_sig": false,
"md5_digest": "83f9a6b8a7aecf870c6da0f553c13606",
"packagetype": "bdist_wheel",
"python_version": "cp37",
"requires_python": ">=3.10",
"size": 246872,
"upload_time": "2025-07-26T20:37:29",
"upload_time_iso_8601": "2025-07-26T20:37:29.044575Z",
"url": "https://files.pythonhosted.org/packages/eb/84/c0df2c4040cb7dbe96293c3736d0c291e62b55a92a8eb09412c6d42fe3d8/simstring_rust-0.3.1rc2-cp37-abi3-win_amd64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-26 20:37:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "PyDataBlog",
"github_project": "simstring_rs",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "simstring-rust"
}