Name | token-distance JSON |
Version |
0.2.3
JSON |
| download |
home_page | https://gitlab.com/patrick.daniel.gress/token-distance |
Summary | Python library designed to perform fuzzy token matching within text documents. Utilizing advanced algorithms, this tool allows developers and data scientists to search and compare tokens based on flexible criteria, beyond exact matches. The library supports tokenization through whitespace, regular expressions, or custom functions, and provides weighted comparisons for nuanced analysis. |
upload_time | 2024-07-29 21:13:54 |
maintainer | None |
docs_url | None |
author | voidpointercast |
requires_python | <4.0,>=3.10 |
license | BSD |
keywords |
fuzzy match
search
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
token-distance is a versatile library designed for fuzzy token searching within texts. It can be used as a standalone
command line tool or integrated into other software as a library. This tool is particularly useful for applications in
data mining, natural language processing, and information retrieval where matching based on exact tokens is
insufficient.
The process begins by tokenizing the input texts, typically using whitespace, though regular expressions and custom
functions can also be employed. Following tokenization, each token from the search query is assigned a weight that
reflects its importance, which could depend on factors like token length or predefined criteria.
For each search token, token-distance identifies the most similar token in the target text based on these weights. The
core of the library's functionality lies in how it calculates similarity: it pairs each search token with the best
matching token in the target text and computes a weighted average of these pairings to produce a final similarity score.
The operations of token-distance are summarized in the chart below, which illustrates the step-by-step process from
tokenization to the calculation of similarity scores.
![](source/_static/schema.png)
## Installation
Installation of token-distance is straightforward using pip, the Python package installer. This method ensures that the
library and its dependencies are correctly configured. Ensure you have Python and pip installed on your system before
proceeding.
````shell
pip install token-distance
````
## Usage
token-distance is flexible, functioning both as a command-line tool and as a library for integration into your software.
### Console
To compare two text files for token similarity, use the following command:
````shell
token_distance_compare <path_to_token_file> <path_to_search_target_file>
````
For more complex tokenization, such as splitting text by commas or exclamation marks, you can use regular expressions:
````shell
token_distance_compare <path_to_token_file> <path_to_search_target_file> \
--tokenize-by "[\s,\.]" --regex 1
````
This command will tokenize the input texts at spaces, commas, and periods, enhancing the flexibility of the search.
### As Library
token-distance can also be configured programmatically to suit specific needs, such as integrating custom similarity
algorithms. Here's how you can set up a token distance calculation function using a configuration object:
````python
from collections.abc import Callable
from token_distance import from_config, Config
calculate_distance: Callable[[str, str], float] = from_config(Config(mean='geometric'))
````
This configuration uses a geometric mean to compute the similarity score between tokens, which is useful for certain
types of textual analysis.
token-distance can also obtain information about the actual matching of the tokens, if those are of interest:
````python
from collections.abc import Callable, Collection
from token_distance import match_from_config, MatchConfig, RecordingToken
get_best_matches: Callable[[str, str], Collection[RecordingToken]] = match_from_config(MatchConfig())
````
Raw data
{
"_id": null,
"home_page": "https://gitlab.com/patrick.daniel.gress/token-distance",
"name": "token-distance",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "fuzzy match, search",
"author": "voidpointercast",
"author_email": "voidpointercast@justmail.de",
"download_url": null,
"platform": null,
"description": "token-distance is a versatile library designed for fuzzy token searching within texts. It can be used as a standalone\ncommand line tool or integrated into other software as a library. This tool is particularly useful for applications in\ndata mining, natural language processing, and information retrieval where matching based on exact tokens is\ninsufficient.\n\nThe process begins by tokenizing the input texts, typically using whitespace, though regular expressions and custom\nfunctions can also be employed. Following tokenization, each token from the search query is assigned a weight that\nreflects its importance, which could depend on factors like token length or predefined criteria.\n\nFor each search token, token-distance identifies the most similar token in the target text based on these weights. The\ncore of the library's functionality lies in how it calculates similarity: it pairs each search token with the best\nmatching token in the target text and computes a weighted average of these pairings to produce a final similarity score.\n\nThe operations of token-distance are summarized in the chart below, which illustrates the step-by-step process from\ntokenization to the calculation of similarity scores. \n\n\n![](source/_static/schema.png)\n\n\n## Installation\n\nInstallation of token-distance is straightforward using pip, the Python package installer. This method ensures that the\nlibrary and its dependencies are correctly configured. Ensure you have Python and pip installed on your system before\nproceeding.\n\n````shell\npip install token-distance\n````\n\n\n## Usage\n\ntoken-distance is flexible, functioning both as a command-line tool and as a library for integration into your software.\n\n### Console\n\nTo compare two text files for token similarity, use the following command:\n\n\n````shell\n token_distance_compare <path_to_token_file> <path_to_search_target_file>\n````\n\nFor more complex tokenization, such as splitting text by commas or exclamation marks, you can use regular expressions:\n\n````shell\n token_distance_compare <path_to_token_file> <path_to_search_target_file> \\\n --tokenize-by \"[\\s,\\.]\" --regex 1\n````\n\nThis command will tokenize the input texts at spaces, commas, and periods, enhancing the flexibility of the search.\n\n### As Library\n\ntoken-distance can also be configured programmatically to suit specific needs, such as integrating custom similarity\nalgorithms. Here's how you can set up a token distance calculation function using a configuration object:\n\n````python\nfrom collections.abc import Callable\nfrom token_distance import from_config, Config\n\ncalculate_distance: Callable[[str, str], float] = from_config(Config(mean='geometric'))\n````\n\nThis configuration uses a geometric mean to compute the similarity score between tokens, which is useful for certain\ntypes of textual analysis.\n\ntoken-distance can also obtain information about the actual matching of the tokens, if those are of interest:\n\n````python\nfrom collections.abc import Callable, Collection\nfrom token_distance import match_from_config, MatchConfig, RecordingToken\n\nget_best_matches: Callable[[str, str], Collection[RecordingToken]] = match_from_config(MatchConfig())\n````\n\n\n\n",
"bugtrack_url": null,
"license": "BSD",
"summary": "Python library designed to perform fuzzy token matching within text documents. Utilizing advanced algorithms, this tool allows developers and data scientists to search and compare tokens based on flexible criteria, beyond exact matches. The library supports tokenization through whitespace, regular expressions, or custom functions, and provides weighted comparisons for nuanced analysis.",
"version": "0.2.3",
"project_urls": {
"Documentation": "https://token-distance.readthedocs.io/en/latest/",
"Homepage": "https://gitlab.com/patrick.daniel.gress/token-distance",
"Repository": "https://gitlab.com/patrick.daniel.gress/token-distance"
},
"split_keywords": [
"fuzzy match",
" search"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9362c4cbab4893615b229d2062bf94beb472f6a48130846d53fd1a3ae7bdbe60",
"md5": "bf463519da6cb804c2629c132b910f1a",
"sha256": "4d335ebaa96d013b85f697aa5143b1d349ef0e36e8e83e5f4d6bd3a6aefb889e"
},
"downloads": -1,
"filename": "token_distance-0.2.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bf463519da6cb804c2629c132b910f1a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 17738,
"upload_time": "2024-07-29T21:13:54",
"upload_time_iso_8601": "2024-07-29T21:13:54.000824Z",
"url": "https://files.pythonhosted.org/packages/93/62/c4cbab4893615b229d2062bf94beb472f6a48130846d53fd1a3ae7bdbe60/token_distance-0.2.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-29 21:13:54",
"github": false,
"gitlab": true,
"bitbucket": false,
"codeberg": false,
"gitlab_user": "patrick.daniel.gress",
"gitlab_project": "token-distance",
"lcname": "token-distance"
}