# Merges two DataFrames using fuzzy matching on specified columns
## Tested against Windows / Python 3.11 / Anaconda
## pip install a-pandas-ex-fuzzymerge
```python
This function performs a fuzzy matching between two DataFrames `df1` and `df2`
based on the columns specified in `right_on` and `left_on`. Fuzzy matching allows
you to find similar values between these columns, making it useful for matching
data with small variations, such as typos or abbreviations.
Parameters:
df1 (DataFrame): The first DataFrame to be merged.
df2 (DataFrame): The second DataFrame to be merged.
right_on (str): The column name in `df2` to be used for matching.
left_on (str): The column name in `df1` to be used for matching.
usedtype (numpy.dtype, optional): The data type to use for the distance matrix.
Defaults to `np.uint8`.
scorer (function, optional): The scoring function to use for fuzzy matching.
Defaults to `fuzz.WRatio`.
concat_value (bool, optional): Whether to add a 'concat_value' column in the result DataFrame,
containing the similarity scores. Defaults to `True`.
**kwargs: Additional keyword arguments to pass to the `pandas.merge` function.
Returns:
DataFrame: A merged DataFrame with rows that matched based on the specified fuzzy criteria.
Example:
from a_pandas_ex_fuzzymerge import pd_add_fuzzymerge
import pandas as pd
import numpy as np
from rapidfuzz import fuzz
pd_add_fuzzymerge()
df1 = pd.read_csv(
"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
)
df2 = df1.copy()
df2 = pd.concat([df2 for x in range(3)], ignore_index=True)
df2.Name = (df2.Name + np.random.uniform(1, 2000, len(df2)).astype("U"))
df1 = pd.concat([df1 for x in range(3)], ignore_index=True)
df1.Name = (df1.Name + np.random.uniform(1, 2000, len(df1)).astype("U"))
df3 = df1.d_fuzzy_merge(df2, right_on='Name', left_on='Name', usedtype=np.uint8, scorer=fuzz.partial_ratio,
concat_value=True)
print(df3)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/hansalemaos/a_pandas_ex_fuzzymerge",
"name": "a-pandas-ex-fuzzymerge",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "merge,dataframe,fuzzy,rapidfuzz",
"author": "Johannes Fischer",
"author_email": "aulasparticularesdealemaosp@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/6a/a4/6a4f9217e0a30abfb127478c19539b66cff1d4aea6d7170323035ae59be0/a_pandas_ex_fuzzymerge-0.10.tar.gz",
"platform": null,
"description": "\r\n# Merges two DataFrames using fuzzy matching on specified columns\r\n\r\n## Tested against Windows / Python 3.11 / Anaconda\r\n\r\n## pip install a-pandas-ex-fuzzymerge\r\n\r\n```python\r\n\r\n\r\nThis function performs a fuzzy matching between two DataFrames `df1` and `df2`\r\nbased on the columns specified in `right_on` and `left_on`. Fuzzy matching allows\r\nyou to find similar values between these columns, making it useful for matching\r\ndata with small variations, such as typos or abbreviations.\r\n\r\nParameters:\r\ndf1 (DataFrame): The first DataFrame to be merged.\r\ndf2 (DataFrame): The second DataFrame to be merged.\r\nright_on (str): The column name in `df2` to be used for matching.\r\nleft_on (str): The column name in `df1` to be used for matching.\r\nusedtype (numpy.dtype, optional): The data type to use for the distance matrix.\r\n\tDefaults to `np.uint8`.\r\nscorer (function, optional): The scoring function to use for fuzzy matching.\r\n\tDefaults to `fuzz.WRatio`.\r\nconcat_value (bool, optional): Whether to add a 'concat_value' column in the result DataFrame,\r\n\tcontaining the similarity scores. Defaults to `True`.\r\n**kwargs: Additional keyword arguments to pass to the `pandas.merge` function.\r\n\r\nReturns:\r\nDataFrame: A merged DataFrame with rows that matched based on the specified fuzzy criteria.\r\n\r\nExample:\r\n\tfrom a_pandas_ex_fuzzymerge import pd_add_fuzzymerge\r\n\timport pandas as pd\r\n\timport numpy as np\r\n\tfrom rapidfuzz import fuzz\r\n\tpd_add_fuzzymerge()\r\n\tdf1 = pd.read_csv(\r\n\t\t\"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv\"\r\n\t)\r\n\tdf2 = df1.copy()\r\n\tdf2 = pd.concat([df2 for x in range(3)], ignore_index=True)\r\n\tdf2.Name = (df2.Name + np.random.uniform(1, 2000, len(df2)).astype(\"U\"))\r\n\tdf1 = pd.concat([df1 for x in range(3)], ignore_index=True)\r\n\tdf1.Name = (df1.Name + np.random.uniform(1, 2000, len(df1)).astype(\"U\"))\r\n\r\n\tdf3 = df1.d_fuzzy_merge(df2, right_on='Name', left_on='Name', usedtype=np.uint8, scorer=fuzz.partial_ratio,\r\n\t\t\t\t\t\t\tconcat_value=True)\r\n\tprint(df3)\r\n\r\n```\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Merges two DataFrames using fuzzy matching on specified columns",
"version": "0.10",
"project_urls": {
"Homepage": "https://github.com/hansalemaos/a_pandas_ex_fuzzymerge"
},
"split_keywords": [
"merge",
"dataframe",
"fuzzy",
"rapidfuzz"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7e4a42e8db0a2db08ab7751bf2ae8eed8f3dee494267572b959d22f5f1ad1e96",
"md5": "144bc03787efce8f448807d70b2d6d1f",
"sha256": "5701d08ce76cc3a0668f9e1c3c622a2def97671e8dd5cd9df38dce3ecbe10601"
},
"downloads": -1,
"filename": "a_pandas_ex_fuzzymerge-0.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "144bc03787efce8f448807d70b2d6d1f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 23499,
"upload_time": "2023-10-05T10:56:10",
"upload_time_iso_8601": "2023-10-05T10:56:10.379841Z",
"url": "https://files.pythonhosted.org/packages/7e/4a/42e8db0a2db08ab7751bf2ae8eed8f3dee494267572b959d22f5f1ad1e96/a_pandas_ex_fuzzymerge-0.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6aa46a4f9217e0a30abfb127478c19539b66cff1d4aea6d7170323035ae59be0",
"md5": "3cad1120edd1697734a8e912561f7b1c",
"sha256": "757b1d8511570adc1be41c3732f9b93e895e318de93a2af6c12c9d148d791a16"
},
"downloads": -1,
"filename": "a_pandas_ex_fuzzymerge-0.10.tar.gz",
"has_sig": false,
"md5_digest": "3cad1120edd1697734a8e912561f7b1c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 22279,
"upload_time": "2023-10-05T10:56:12",
"upload_time_iso_8601": "2023-10-05T10:56:12.763318Z",
"url": "https://files.pythonhosted.org/packages/6a/a4/6a4f9217e0a30abfb127478c19539b66cff1d4aea6d7170323035ae59be0/a_pandas_ex_fuzzymerge-0.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-05 10:56:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hansalemaos",
"github_project": "a_pandas_ex_fuzzymerge",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numexpr",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "rapidfuzz",
"specs": []
}
],
"lcname": "a-pandas-ex-fuzzymerge"
}