<div align="center">
<a href="https://pypi.org/project/flashtext2">![PyPi Version](https://badge.fury.io/py/flashtext2.svg)</a>
<a href="https://pypi.org/project/flashtext2">![Supported Python versions](https://img.shields.io/pypi/pyversions/flashtext2.svg?color=%2334D058)</a>
<a href="https://pepy.tech/project/flashtext2">![Downloads](https://static.pepy.tech/badge/flashtext2)</a>
<a href="https://pepy.tech/project/flashtext2">![Downloads](https://static.pepy.tech/badge/flashtext2/month)</a>
</div>
```sh
pip install flashtext2
```
# flashtext2
`flashtext2` is an optimized version of the `flashtext` library for fast keyword extraction and replacement.
Its orders of magnitude faster compared to regular expressions.
## Key Enhancements in flashtext2
- **Rewritten for Better Performance**: Completely rewritten in Rust, making it approximately 3-10x faster than the original version.
- **Unicode Standard Annex #29**: Instead of relying on arbitrary regex patterns like **flashtext**
[does](https://github.com/vi3k6i5/flashtext/blob/b316c7e9e54b6b4d078462b302a83db85f884a94/flashtext/keyword.py#L13): `[A-Za-z0-9_]+`,
**flashtext2** uses the [Unicode Standard Annex #29](https://www.unicode.org/reports/tr29/) to split strings into tokens.
This ensures compatibility with all languages, not just Latin-based ones.
- **Unicode Case Folding**: Instead of converting strings to lowercase for case-insensitive matches, it uses
[Unicode case folding](https://www.w3.org/TR/charmod-norm/#definitionCaseFolding), ensuring accurate normalization
of characters according to the Unicode standard.
- **Fully Type-Hinted API**: The entire API is fully type-hinted, providing better code clarity and improved development experience.
## Usage
<details>
<summary>Click to unfold usage</summary>
### Keyword Extraction
```python
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('Python')
kp.add_keyword('flashtext')
kp.add_keyword('program')
text = "I love programming in Python and using the flashtext library."
keywords_found = kp.extract_keywords(text)
print(keywords_found)
# Output: ['Python', 'flashtext']
keywords_found = kp.extract_keywords_with_span(text)
print(keywords_found)
# Output: [('Python', 22, 28), ('flashtext', 43, 52)]
```
### Keyword Replacement
```python
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('Java', 'Python')
kp.add_keyword('regex', 'flashtext')
text = "I love programming in Java and using the regex library."
new_text = kp.replace_keywords(text)
print(new_text)
# Output: "I love programming in Python and using the flashtext library."
```
### Case Sensitivity
```python
from flashtext2 import KeywordProcessor
text = 'abc aBc ABC'
kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('aBc')
print(kp.extract_keywords(text))
# Output: ['aBc']
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('aBc')
print(kp.extract_keywords(text))
# Output: ['aBc', 'aBc', 'aBc']
```
### Other Examples
Overlapping keywords (returns the longest sequence)
```python
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('machine')
kp.add_keyword('machine learning')
text = "machine learning is a subset of artificial intelligence"
print(kp.extract_keywords(text))
# Output: ['machine learning']
```
Case folding
```python
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keywords_from_iter(["flour", "Maße", "ᾲ στο διάολο"])
text = "flour, MASSE, ὰι στο διάολο"
print(kp.extract_keywords(text))
# Output: ['flour', 'Maße', 'ᾲ στο διάολο']
```
</details>
### Performance
<details>
<summary>
Click to unfold performance
</summary>
Extracting keywords is usually 2.5-3x faster, and replacing them is about 10x.
There is still room to optimize the code and improve performance.
You can find the benchmarks [here](https://github.com/shner-elmo/FlashText2.0/tree/master/benchmarks).
![Image](benchmarks/extract-keywords.png)
![Image](benchmarks/replace-keywords.png)
The words have on average 6 characters, and a sentence has 10k words, so the length is 60k.
</details>
### TODO
<details>
<summary>
Click to unfold TODO
</summary>
* Add multiple ways of normalizing strings: simple case folding, full case folding, and locale-aware folding
* Remove all clones in src code
</details>
Credit to [Vikash Singh](https://github.com/vi3k6i5/), the author of the original `flashtext` package.
Raw data
{
"_id": null,
"home_page": "https://github.com/shner-elmo/flashtext2-rs",
"name": "flashtext2",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "nlp, string, regex, text-processing, extracting-keywords, keyword-extraction, flashtext, flashtext2, rust",
"author": "Shneor E.",
"author_email": "\"Shneor E.\" <770elmo@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/b5/2e/a29841b65523bfb25dfe10e13a89e483edd6c9fc17e48d3c5f0b12e9d33d/flashtext2-1.1.0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n \n <a href=\"https://pypi.org/project/flashtext2\">![PyPi Version](https://badge.fury.io/py/flashtext2.svg)</a>\n <a href=\"https://pypi.org/project/flashtext2\">![Supported Python versions](https://img.shields.io/pypi/pyversions/flashtext2.svg?color=%2334D058)</a>\n <a href=\"https://pepy.tech/project/flashtext2\">![Downloads](https://static.pepy.tech/badge/flashtext2)</a>\n <a href=\"https://pepy.tech/project/flashtext2\">![Downloads](https://static.pepy.tech/badge/flashtext2/month)</a>\n \n</div>\n\n```sh\npip install flashtext2\n```\n\n# flashtext2\n\n`flashtext2` is an optimized version of the `flashtext` library for fast keyword extraction and replacement. \nIts orders of magnitude faster compared to regular expressions.\n\n## Key Enhancements in flashtext2\n\n- **Rewritten for Better Performance**: Completely rewritten in Rust, making it approximately 3-10x faster than the original version.\n- **Unicode Standard Annex #29**: Instead of relying on arbitrary regex patterns like **flashtext** \n[does](https://github.com/vi3k6i5/flashtext/blob/b316c7e9e54b6b4d078462b302a83db85f884a94/flashtext/keyword.py#L13): `[A-Za-z0-9_]+`, \n**flashtext2** uses the [Unicode Standard Annex #29](https://www.unicode.org/reports/tr29/) to split strings into tokens. \nThis ensures compatibility with all languages, not just Latin-based ones.\n- **Unicode Case Folding**: Instead of converting strings to lowercase for case-insensitive matches, it uses \n[Unicode case folding](https://www.w3.org/TR/charmod-norm/#definitionCaseFolding), ensuring accurate normalization \nof characters according to the Unicode standard.\n- **Fully Type-Hinted API**: The entire API is fully type-hinted, providing better code clarity and improved development experience.\n\n\n## Usage\n\n\n<details>\n <summary>Click to unfold usage</summary>\n\n### Keyword Extraction\n\n```python\nfrom flashtext2 import KeywordProcessor\n\nkp = KeywordProcessor(case_sensitive=False)\n\nkp.add_keyword('Python')\nkp.add_keyword('flashtext')\nkp.add_keyword('program')\n\ntext = \"I love programming in Python and using the flashtext library.\"\n\nkeywords_found = kp.extract_keywords(text)\nprint(keywords_found)\n# Output: ['Python', 'flashtext']\n\nkeywords_found = kp.extract_keywords_with_span(text)\nprint(keywords_found)\n# Output: [('Python', 22, 28), ('flashtext', 43, 52)]\n```\n\n### Keyword Replacement\n\n```python\nfrom flashtext2 import KeywordProcessor\n\nkp = KeywordProcessor(case_sensitive=False)\n\nkp.add_keyword('Java', 'Python')\nkp.add_keyword('regex', 'flashtext')\n\ntext = \"I love programming in Java and using the regex library.\"\nnew_text = kp.replace_keywords(text)\n\nprint(new_text)\n# Output: \"I love programming in Python and using the flashtext library.\"\n```\n\n### Case Sensitivity\n\n```python\nfrom flashtext2 import KeywordProcessor\n\ntext = 'abc aBc ABC'\n\nkp = KeywordProcessor(case_sensitive=True)\nkp.add_keyword('aBc')\n\nprint(kp.extract_keywords(text))\n# Output: ['aBc']\n\nkp = KeywordProcessor(case_sensitive=False)\nkp.add_keyword('aBc')\n\nprint(kp.extract_keywords(text))\n# Output: ['aBc', 'aBc', 'aBc']\n```\n\n### Other Examples\n\nOverlapping keywords (returns the longest sequence)\n```python\nfrom flashtext2 import KeywordProcessor\n\nkp = KeywordProcessor(case_sensitive=True)\nkp.add_keyword('machine')\nkp.add_keyword('machine learning')\n\ntext = \"machine learning is a subset of artificial intelligence\"\nprint(kp.extract_keywords(text))\n# Output: ['machine learning']\n```\n\nCase folding\n```python\nfrom flashtext2 import KeywordProcessor\n\nkp = KeywordProcessor(case_sensitive=False)\nkp.add_keywords_from_iter([\"flour\", \"Ma\u00dfe\", \"\u1fb2 \u03c3\u03c4\u03bf \u03b4\u03b9\u03ac\u03bf\u03bb\u03bf\"])\n\ntext = \"\ufb02our, MASSE, \u1f70\u03b9 \u03c3\u03c4\u03bf \u03b4\u03b9\u03ac\u03bf\u03bb\u03bf\"\nprint(kp.extract_keywords(text))\n# Output: ['flour', 'Ma\u00dfe', '\u1fb2 \u03c3\u03c4\u03bf \u03b4\u03b9\u03ac\u03bf\u03bb\u03bf']\n```\n</details>\n\n\n### Performance\n\n<details>\n <summary>\n Click to unfold performance\n </summary>\n\nExtracting keywords is usually 2.5-3x faster, and replacing them is about 10x. \nThere is still room to optimize the code and improve performance. \nYou can find the benchmarks [here](https://github.com/shner-elmo/FlashText2.0/tree/master/benchmarks).\n\n\n![Image](benchmarks/extract-keywords.png)\n\n![Image](benchmarks/replace-keywords.png)\n\nThe words have on average 6 characters, and a sentence has 10k words, so the length is 60k.\n</details>\n\n\n### TODO\n\n<details>\n <summary>\n Click to unfold TODO\n </summary>\n\n* Add multiple ways of normalizing strings: simple case folding, full case folding, and locale-aware folding\n* Remove all clones in src code\n</details>\n\nCredit to [Vikash Singh](https://github.com/vi3k6i5/), the author of the original `flashtext` package.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package for extracting keywords from large text very quickly (much faster than regex and the original flashtext package",
"version": "1.1.0",
"project_urls": {
"Bug Reports": "https://github.com/shner-elmo/flashtext2/issues",
"Homepage": "https://github.com/shner-elmo/flashtext2-rs",
"documentation": "https://github.com/shner-elmo/flashtext2/blob/master/README.md",
"homepage": "https://github.com/shner-elmo/flashtext2",
"repository": "https://github.com/shner-elmo/flashtext2",
"source": "https://github.com/shner-elmo/flashtext2"
},
"split_keywords": [
"nlp",
" string",
" regex",
" text-processing",
" extracting-keywords",
" keyword-extraction",
" flashtext",
" flashtext2",
" rust"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4567b16e333177cf9927f11e19c06359f0a03e225efe1c4cc47531898c0195d9",
"md5": "2cf546c40211e6555023f2acf763d276",
"sha256": "e5e766210be8457938f44b85d43c8b9ef7fc6cdb5a0503c1cd5bf29326511f8d"
},
"downloads": -1,
"filename": "flashtext2-1.1.0-cp38-abi3-macosx_10_12_x86_64.whl",
"has_sig": false,
"md5_digest": "2cf546c40211e6555023f2acf763d276",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 264546,
"upload_time": "2024-07-04T14:40:21",
"upload_time_iso_8601": "2024-07-04T14:40:21.250552Z",
"url": "https://files.pythonhosted.org/packages/45/67/b16e333177cf9927f11e19c06359f0a03e225efe1c4cc47531898c0195d9/flashtext2-1.1.0-cp38-abi3-macosx_10_12_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8d51454fc720e25d82c09f5efca82c0da45879fcba1bf373c3649c47f9bbde59",
"md5": "20624873d15269f3767cf5ec1efef067",
"sha256": "d08bcd69a2f7ab8e18b156e5bfe8a9e7539037ceb6e4b8b9dced77c59a703883"
},
"downloads": -1,
"filename": "flashtext2-1.1.0-cp38-abi3-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "20624873d15269f3767cf5ec1efef067",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 256273,
"upload_time": "2024-07-04T14:40:22",
"upload_time_iso_8601": "2024-07-04T14:40:22.969317Z",
"url": "https://files.pythonhosted.org/packages/8d/51/454fc720e25d82c09f5efca82c0da45879fcba1bf373c3649c47f9bbde59/flashtext2-1.1.0-cp38-abi3-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ec0a35caec42823627d9c50b3099b46310ef09d11d2466ad18a55ffb1f251cea",
"md5": "d599ef78023a81f2a229f3f8df20820b",
"sha256": "cffcfe5b1ea4d5c4326cd4de2fab9a54dc2c2c7b7f6bea27521ee7138f5b0ad2"
},
"downloads": -1,
"filename": "flashtext2-1.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
"has_sig": false,
"md5_digest": "d599ef78023a81f2a229f3f8df20820b",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 303696,
"upload_time": "2024-07-04T14:40:24",
"upload_time_iso_8601": "2024-07-04T14:40:24.597081Z",
"url": "https://files.pythonhosted.org/packages/ec/0a/35caec42823627d9c50b3099b46310ef09d11d2466ad18a55ffb1f251cea/flashtext2-1.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "05f1802189fb11d42e161d599013dab47724400b3df82a512ab62ca0aa2b3480",
"md5": "24dc39a09002dba09d3c75e079959ab5",
"sha256": "92cd3e73f1ce5d00eefdb564540d70cd2a40d71a2f79662823f125908aa34a5c"
},
"downloads": -1,
"filename": "flashtext2-1.1.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl",
"has_sig": false,
"md5_digest": "24dc39a09002dba09d3c75e079959ab5",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 303511,
"upload_time": "2024-07-04T14:40:26",
"upload_time_iso_8601": "2024-07-04T14:40:26.719787Z",
"url": "https://files.pythonhosted.org/packages/05/f1/802189fb11d42e161d599013dab47724400b3df82a512ab62ca0aa2b3480/flashtext2-1.1.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8f3c2f17a29d5e89c4f3f1c8eb98ce3ac79f4d7704aa35c0ea8d8474bc5d2258",
"md5": "312897656f609cafcab194037f0a28aa",
"sha256": "27113d3fd260b69dccd50611fd38fda792dc16bab2e938e60959e4a655ffd602"
},
"downloads": -1,
"filename": "flashtext2-1.1.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl",
"has_sig": false,
"md5_digest": "312897656f609cafcab194037f0a28aa",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 336743,
"upload_time": "2024-07-04T14:40:28",
"upload_time_iso_8601": "2024-07-04T14:40:28.010714Z",
"url": "https://files.pythonhosted.org/packages/8f/3c/2f17a29d5e89c4f3f1c8eb98ce3ac79f4d7704aa35c0ea8d8474bc5d2258/flashtext2-1.1.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8f5c1ffd609ac91b7b357d226ae285789ecbd8ee3a3c19d997df0b44733b4c43",
"md5": "f5098d7782b04210f2134e6c87bdf4e6",
"sha256": "db9f11b0c0debca21b7691e89912f188bad560f856aa9caa4b127ff4c898512a"
},
"downloads": -1,
"filename": "flashtext2-1.1.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl",
"has_sig": false,
"md5_digest": "f5098d7782b04210f2134e6c87bdf4e6",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 342181,
"upload_time": "2024-07-04T14:40:29",
"upload_time_iso_8601": "2024-07-04T14:40:29.236351Z",
"url": "https://files.pythonhosted.org/packages/8f/5c/1ffd609ac91b7b357d226ae285789ecbd8ee3a3c19d997df0b44733b4c43/flashtext2-1.1.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a03a0f591aede29ec711360206e394071ea6e8902b0eda8c62969b9e119d1574",
"md5": "7f9f57f22ee1bbbc07c92fccbab243f0",
"sha256": "06cd7787ac8b497f5725e76e168b4cbe8ddb96ef488315aea1fb23476071199a"
},
"downloads": -1,
"filename": "flashtext2-1.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "7f9f57f22ee1bbbc07c92fccbab243f0",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 296322,
"upload_time": "2024-07-04T14:40:31",
"upload_time_iso_8601": "2024-07-04T14:40:31.065461Z",
"url": "https://files.pythonhosted.org/packages/a0/3a/0f591aede29ec711360206e394071ea6e8902b0eda8c62969b9e119d1574/flashtext2-1.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b5abbc02ffd8d84cf646e9e5ed98fe5542f847f7e48bf11f878dbff667560841",
"md5": "c68e6099e890a3e2d4e1c88f4124c65e",
"sha256": "2811d7205bed2e1562b0b96c7540b12c92eb1a2236d0aade2f4f6304b2db67b8"
},
"downloads": -1,
"filename": "flashtext2-1.1.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl",
"has_sig": false,
"md5_digest": "c68e6099e890a3e2d4e1c88f4124c65e",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 307309,
"upload_time": "2024-07-04T14:40:32",
"upload_time_iso_8601": "2024-07-04T14:40:32.445270Z",
"url": "https://files.pythonhosted.org/packages/b5/ab/bc02ffd8d84cf646e9e5ed98fe5542f847f7e48bf11f878dbff667560841/flashtext2-1.1.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ad37bf09fcf995c5b51e11be1e7e3c71022deea6337e97248121f265fbbeac84",
"md5": "1823e791db9245782b248b82f6cccf0a",
"sha256": "f3c062a52f14840de76daa6180eb035d7f80151b7111bd419bfb34e2ced3cf24"
},
"downloads": -1,
"filename": "flashtext2-1.1.0-cp38-abi3-win32.whl",
"has_sig": false,
"md5_digest": "1823e791db9245782b248b82f6cccf0a",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 153365,
"upload_time": "2024-07-04T14:40:33",
"upload_time_iso_8601": "2024-07-04T14:40:33.697111Z",
"url": "https://files.pythonhosted.org/packages/ad/37/bf09fcf995c5b51e11be1e7e3c71022deea6337e97248121f265fbbeac84/flashtext2-1.1.0-cp38-abi3-win32.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "74f0039c9ee320f3581b0405efae9160167e7d19a8f0a22d865fe1c40c20ba11",
"md5": "5cdaf03c65c6c5564593682f4f96381a",
"sha256": "2049a62895a23db9b385e295f6bce16e6fb5e60444a1943d9978029683fc184c"
},
"downloads": -1,
"filename": "flashtext2-1.1.0-cp38-abi3-win_amd64.whl",
"has_sig": false,
"md5_digest": "5cdaf03c65c6c5564593682f4f96381a",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 163303,
"upload_time": "2024-07-04T14:40:35",
"upload_time_iso_8601": "2024-07-04T14:40:35.359192Z",
"url": "https://files.pythonhosted.org/packages/74/f0/039c9ee320f3581b0405efae9160167e7d19a8f0a22d865fe1c40c20ba11/flashtext2-1.1.0-cp38-abi3-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b52ea29841b65523bfb25dfe10e13a89e483edd6c9fc17e48d3c5f0b12e9d33d",
"md5": "5f2aa365475ec615a29a495b3fe51b1f",
"sha256": "2eb9d8c5400f59321e0c64e52bbbd310a2cab32b4269f2fc3c824f6e0c3320a3"
},
"downloads": -1,
"filename": "flashtext2-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "5f2aa365475ec615a29a495b3fe51b1f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 109451,
"upload_time": "2024-07-04T14:40:37",
"upload_time_iso_8601": "2024-07-04T14:40:37.161669Z",
"url": "https://files.pythonhosted.org/packages/b5/2e/a29841b65523bfb25dfe10e13a89e483edd6c9fc17e48d3c5f0b12e9d33d/flashtext2-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-04 14:40:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "shner-elmo",
"github_project": "flashtext2-rs",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "flashtext2"
}