stringpairfinder


Namestringpairfinder JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/AntoinePinto/string-pair-finder
SummaryPackage designed to match strings by similarity
upload_time2024-02-04 20:08:35
maintainer
docs_urlNone
authorAntoine PINTO
requires_python>=3.7
licenseMIT
keywords string string matching algorithm similarity
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # StringPairFinder

StringPairFinder is a Python package designed to simplify the process of finding similarities between strings.

<p align="center">
  <img src="https://github.com/AntoinePinto/string-pair-finder/blob/master/img/problematic.png?raw=true" alt="drawing" width="400"/>
</p>

## Evaluation

Evaluation on a dataset of 1000 observations (see `notebooks/1_evaluation.ipynb`) :
* FuzzyWuzzy algorithm : 85.2 % of success rate
* StringPairFinder algorithm : 94.0 % of success rate

## Installation

```python
pip install stringpairfinder
```

## Usage

### Computing String Similarity

```python
import stringpairfinder as spf

spf.get_similarity("Munich", "Munchen")
```

```python
>> 0.23809523809523808
```

### Finding the Nearest String

```python
spf.get_nearest_string(
    string="Naples",
    string_list=["Munchen", "Napoli", "Warszawa"]
    )
```

```python
>> 'Napoli'
```

### Mapping Strings to Their Nearest Counterparts

```python
spf.match_strings(
    source_strings=["Naples", "Munich", "Warsaw"],
    target_strings=["Munchen", "Napoli", "Warszawa"]
    )
```

```python
>> {'Naples': 'Napoli',
    'Munich': 'Munchen',
    'Warsaw': 'Warszawa'}
```


## Examples of use

*   **Encoding variables in datasets before a merge**: It is common to want to merge datasets from different sources, but to encounter difficulties when the variables used to identify records are not coded in the same way. Using StringPairFinder to link the variables before the merge can facilitate this process.

*   **Detection of duplicates in databases**: StringPairFinder can be used to detect duplicates in databases by matching and recoding strings that are mistakenly encoded differently.

*   **Searching for match between names and email addresses**: StringPairFinder can be used to link names and email addresses in databases. This can be useful for contact management or mass emailing.

*   **Searching for product similarity in online catalogs**: StringPairFinder can be used to link similar products in online catalogs. This can be used for tasks such as product recommendation or similar product search.

## What is the algorithm ?

The similarity search between two strings consists of a matrix comparison of each character in those strings. Let"s assume we want to compare the strings "Munich" and "Bayern Munich". 

1. The first step is to construct a table $T$ containing the first string in the column and the second in the row. The value of a cell is 1 if the character in the row is the same as the one in the column, and 0 otherwise.

<p align="center">
  <img src="https://github.com/AntoinePinto/string-pair-finder/blob/master/img/step1.png?raw=true" alt="drawing" width="300"/>
</p>

2. The second step aims at highlighting the fact that several characters correspond consecutively. Thus, for each row $i$ and column $j$, if cell $T[i-1, j-1] > 0$, then $T[i, j]$ is twice the value of $T[i-1, j-1]$.

<p align="center">
  <img src="https://github.com/AntoinePinto/string-pair-finder/blob/master/img/step2.png?raw=true" alt="drawing" width="300"/>
</p>

3. The third step is simply to calculate the similarity score, equal to the sum of all the cells in the $T$ divided by the size of the table.

$$ Score = \frac{\sum_{i=1}^{n_{row}}\sum_{j=1}^{n_{col}} T_{i,j}}{n_{row} * n_{col}}  = \frac{1+1+2+4+8+16+32}{78} \approx 0.82 $$

In this example, we obtain a similarity score of 64.

To connect the peers two by two, StringPairFinder calculates the similarity score of all (list1, list2) combinations and returns the association between each character string in list 1 with the character string in list 2 with the highest similarity score.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/AntoinePinto/string-pair-finder",
    "name": "stringpairfinder",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "string,string matching,algorithm,similarity",
    "author": "Antoine PINTO",
    "author_email": "antoine.pinto1@outlook.fr",
    "download_url": "https://files.pythonhosted.org/packages/bb/76/17a29bb0e49d54455aec08e9ab5f387f25e0b31d2aa5ffd2a6015d40049f/stringpairfinder-1.0.0.tar.gz",
    "platform": null,
    "description": "# StringPairFinder\r\n\r\nStringPairFinder is a Python package designed to simplify the process of finding similarities between strings.\r\n\r\n<p align=\"center\">\r\n  <img src=\"https://github.com/AntoinePinto/string-pair-finder/blob/master/img/problematic.png?raw=true\" alt=\"drawing\" width=\"400\"/>\r\n</p>\r\n\r\n## Evaluation\r\n\r\nEvaluation on a dataset of 1000 observations (see `notebooks/1_evaluation.ipynb`) :\r\n* FuzzyWuzzy algorithm : 85.2 % of success rate\r\n* StringPairFinder algorithm : 94.0 % of success rate\r\n\r\n## Installation\r\n\r\n```python\r\npip install stringpairfinder\r\n```\r\n\r\n## Usage\r\n\r\n### Computing String Similarity\r\n\r\n```python\r\nimport stringpairfinder as spf\r\n\r\nspf.get_similarity(\"Munich\", \"Munchen\")\r\n```\r\n\r\n```python\r\n>> 0.23809523809523808\r\n```\r\n\r\n### Finding the Nearest String\r\n\r\n```python\r\nspf.get_nearest_string(\r\n    string=\"Naples\",\r\n    string_list=[\"Munchen\", \"Napoli\", \"Warszawa\"]\r\n    )\r\n```\r\n\r\n```python\r\n>> 'Napoli'\r\n```\r\n\r\n### Mapping Strings to Their Nearest Counterparts\r\n\r\n```python\r\nspf.match_strings(\r\n    source_strings=[\"Naples\", \"Munich\", \"Warsaw\"],\r\n    target_strings=[\"Munchen\", \"Napoli\", \"Warszawa\"]\r\n    )\r\n```\r\n\r\n```python\r\n>> {'Naples': 'Napoli',\r\n    'Munich': 'Munchen',\r\n    'Warsaw': 'Warszawa'}\r\n```\r\n\r\n\r\n## Examples of use\r\n\r\n*   **Encoding variables in datasets before a merge**: It is common to want to merge datasets from different sources, but to encounter difficulties when the variables used to identify records are not coded in the same way. Using StringPairFinder to link the variables before the merge can facilitate this process.\r\n\r\n*   **Detection of duplicates in databases**: StringPairFinder can be used to detect duplicates in databases by matching and recoding strings that are mistakenly encoded differently.\r\n\r\n*   **Searching for match between names and email addresses**: StringPairFinder can be used to link names and email addresses in databases. This can be useful for contact management or mass emailing.\r\n\r\n*   **Searching for product similarity in online catalogs**: StringPairFinder can be used to link similar products in online catalogs. This can be used for tasks such as product recommendation or similar product search.\r\n\r\n## What is the algorithm ?\r\n\r\nThe similarity search between two strings consists of a matrix comparison of each character in those strings. Let\"s assume we want to compare the strings \"Munich\" and \"Bayern Munich\". \r\n\r\n1. The first step is to construct a table $T$ containing the first string in the column and the second in the row. The value of a cell is 1 if the character in the row is the same as the one in the column, and 0 otherwise.\r\n\r\n<p align=\"center\">\r\n  <img src=\"https://github.com/AntoinePinto/string-pair-finder/blob/master/img/step1.png?raw=true\" alt=\"drawing\" width=\"300\"/>\r\n</p>\r\n\r\n2. The second step aims at highlighting the fact that several characters correspond consecutively. Thus, for each row $i$ and column $j$, if cell $T[i-1, j-1] > 0$, then $T[i, j]$ is twice the value of $T[i-1, j-1]$.\r\n\r\n<p align=\"center\">\r\n  <img src=\"https://github.com/AntoinePinto/string-pair-finder/blob/master/img/step2.png?raw=true\" alt=\"drawing\" width=\"300\"/>\r\n</p>\r\n\r\n3. The third step is simply to calculate the similarity score, equal to the sum of all the cells in the $T$ divided by the size of the table.\r\n\r\n$$ Score = \\frac{\\sum_{i=1}^{n_{row}}\\sum_{j=1}^{n_{col}} T_{i,j}}{n_{row} * n_{col}}  = \\frac{1+1+2+4+8+16+32}{78} \\approx 0.82 $$\r\n\r\nIn this example, we obtain a similarity score of 64.\r\n\r\nTo connect the peers two by two, StringPairFinder calculates the similarity score of all (list1, list2) combinations and returns the association between each character string in list 1 with the character string in list 2 with the highest similarity score.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Package designed to match strings by similarity",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/AntoinePinto/string-pair-finder",
        "Source Code": "https://github.com/AntoinePinto/easyenvi"
    },
    "split_keywords": [
        "string",
        "string matching",
        "algorithm",
        "similarity"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bb7617a29bb0e49d54455aec08e9ab5f387f25e0b31d2aa5ffd2a6015d40049f",
                "md5": "de1e8b119bdb1556999892e857b78a95",
                "sha256": "d6fd03fc20d8eecdc1f0a1210b1ee3d166416bf5bc74414a441f065941567368"
            },
            "downloads": -1,
            "filename": "stringpairfinder-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "de1e8b119bdb1556999892e857b78a95",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 4770,
            "upload_time": "2024-02-04T20:08:35",
            "upload_time_iso_8601": "2024-02-04T20:08:35.132633Z",
            "url": "https://files.pythonhosted.org/packages/bb/76/17a29bb0e49d54455aec08e9ab5f387f25e0b31d2aa5ffd2a6015d40049f/stringpairfinder-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-04 20:08:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AntoinePinto",
    "github_project": "string-pair-finder",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "stringpairfinder"
}
        
Elapsed time: 1.91135s