fuzzymerge-parallel


Namefuzzymerge-parallel JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/ULHPC/fuzzymerge_parallel
SummaryFuzzyMergeParallel is a Python package that enables efficient fuzzy merging of two dataframes based on string columns. With FuzzyMergeParallel, users can easily merge datasets, benefitting from enhanced performance through parallel computing with multiprocessing and Dask.
upload_time2023-12-06 11:06:46
maintainer
docs_urlNone
authorOscar J. Castro-Lopez
requires_python>=3.6
licenseMIT license
keywords fuzzymerge_parallel
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # fuzzymerge_parallel

[![Python package](https://github.com/ULHPC/fuzzymerge_parallel/actions/workflows/python-package.yml/badge.svg)](https://github.com/ULHPC/fuzzymerge_parallel/actions/workflows/python-package.yml)

Merge two pandas dataframes by using a function to calculate the edit distance (Levenshtein Distance) using multiprocessing for parallelization on a single node or Dask for distributed computation across multiple nodes.

**Efficient Matching and Merging**

Matching and merging data can be demanding in terms of time and memory usage. That's why our package is specifically crafted to address these challenges effectively.

**Optimized Execution**

To boost execution performance, we've fine-tuned our package to work seamlessly in both single-node and multi-node environments. Whether you're processing data on a local machine or distributing tasks across a cluster, our package is optimized to get the job done efficiently.

**Smart Memory Management**

Memory efficiency is a priority. Our algorithm estimates memory requirements and divides the workload into manageable batches. This ensures that your data operations fit comfortably within your available memory. Plus, you have the flexibility to customize these settings to match your specific needs.

With this package, you can confidently tackle data matching and merging tasks while optimizing both time and memory resources.

## Description

fuzzymerge_parallel offers two modes for faster execution:

### 1. Multiprocessing mode

This mode runs on a single machine and it is able to use multi-CPU cores. This mode does not require dask and it's ideal for local processing tasks, utilizing Numpy and Python's multiprocessing libraries to speed things up.

### 2. Dask mode

In this mode, fuzzymerge_parallel utilizes a Dask client, which can be configured for single or multi-node setups. To leverage the dask mode it is suggested to use it with multiple nodes. Multi-node dask clients distribute computations across clusters of machines, making it suitable for heavy-duty processing. Using dask mode offers numerous benefits, automating tasks that would otherwise require manual intervention, such as enhancing performance, expanding scalability, ensuring fault tolerance, optimizing resource utilization, and enabling parallelism.

<span style="color:red">**Important remarks:**</span> When using the package on a single node, it is recommended to opt for the multiprocessing mode. This choice is driven by the fact that multiprocessing generally offers faster execution times compared to Dask on a single node. Dask introduces certain overheads, including data copying, fault tolerance mechanisms, and resource management, which may not be as beneficial in single-node scenarios. Therefore, it is strongly advisable to leverage Dask when deploying the package in a multi-node cluster.

 
### Features

- Performs fuzzy merging of dataframes based on string columns
- Utilize distance functions (e.g., Levenshtein) for intelligent matching
- Benefit from parallel computing techniques for enhanced performance
- Easily integrate into your existing data processing pipelines

## Installation

### Install from PyPi
To download and install the fuzzymerge_parallel Python package from PyPi, run the following instruction:

### Install from GitHub
To download and install the fuzzymerge_parallel Python package from GitHub, you can follow these improved instructions:
```bash
    pip install fuzzymerge-parallel
```

To install FuzzyMergeParallel via pip from its GitHub repository, follow these steps:

1. **Download the Package:** Begin by downloading the package from GitHub. You can use git to clone the repository to your local machine:
    ```bash
    git clone https://github.com/ULHPC/fuzzymerge_parallel.git
    ```
    
2. **Navigate to the Package Directory:** Open a terminal or command prompt and change your current directory to the downloaded package folder:
    ```bash
    cd fuzzymerge_parallel
    ```

3. **Install the Package:** Finally, use pip to install the package in "editable" mode (with the -e flag) to allow for development and updates. There are two options:

    **Option 1:** If you plan to use the package on a single node and don't need the dask and distributed dependencies, simply run:
    ```bash
    pip install -e .
    ```

    **Option 2:** If you intend to use the package in both single and multi-node environments with dask and distributed support, use the following command:
    ```bash
    pip install -e ".[dask]"
    ```


This command will install the package along with its dependencies. You can now import and use FuzzyMergeParallel in your Python projects.    

## Dependencies

To use this package, you will need to have the following dependencies installed:

- [Click](https://pypi.org/project/Click/) >= 7.0
- [dask[distributed]](https://pypi.org/project/dask/) >= 2023.5.0 (Optional: Only needed for multi-node)
- [Levenshtein](https://pypi.org/project/python-Levenshtein/) >= 0.21.0 (Optional: Only needed for multi-node)
- [nltk](https://pypi.org/project/nltk/) >= 3.8.1
- [numpy](https://pypi.org/project/numpy/) >= 1.23.5
- [pandas](https://pypi.org/project/pandas/) >= 1.5.3
- [tqdm](https://pypi.org/project/tqdm/) >= 4.65.0
- [psutil](https://pypi.org/project/psutil/) == 5.9.5
- [pytest](https://pypi.org/project/pytest/) >= 7.4.1

## Description

The FuzzyMergeParallel class is exposed and it is highly configurable. The following parameters and other attributes can be set up before doing the merge operation:



| Parameter        | Description                                                      |
|------------------|------------------------------------------------------------------|
| left             | The left input data to be merged.                                |
| right            | The right input data to be merged.                               |
| left_on          | Column(s) in the left DataFrame to use as merge keys.            |
| right_on         | Column(s) in the right DataFrame to use as merge keys.           |

Example create a FuzzyMergeParallel class:

```python
fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
```


| Attribute        | Description                                                      |
|------------------|------------------------------------------------------------------|
| uselower         | Whether to convert strings to lowercase before comparison. Default is True.        |
| threshold        | The threshold value for fuzzy matching similarity. Default is 0.9.                |
| how              | The type of merge to be performed. Default is 'outer'.                               |
| on               | Column(s) to merge on if not specified in left_on or right_on. Default is False.    |
| left_index       | Whether to use the left DataFrame's index as merge key(s). Default is False.       |
| right_index      | Whether to use the right DataFrame's index as merge key(s). Default is False.      |
| parallel         | Whether to perform the merge operation in parallel. Default is True.              |
| n_threads        | The number of threads to use for parallel execution. Default is 'all' (a thread per each available core).             |
| hide_progress    | Whether to display a progress bar during the merge operation. Default is False.    |
| num_batches      | The number of batches to split the ratio computation. Default is automatic.              |
| ratio_function   | The distance ratio function.                Defaults to `Levenshtein.ratio()`.                      |
| dask_client      | A dask client object.                                            |

Example set extra attributes by stating the name of the attribute and its value with `set_parameter()`:

```python
fuzzy_merger.set_parameter('how', 'inner')
fuzzy_merger.set_parameter('threshold', 0.75)
```

## Usage


### Single node 

#### Sequential execution

```python
fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
# Set parameters
fuzzy_merger.set_parameter('how', 'inner')
fuzzy_merger.set_parameter('parallel', False)
# Run the merge sequentially
result = fuzzy_merger.merge()
```

#### Multiprocessing execution

```python
fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
# Set parameters
fuzzy_merger.set_parameter('how', 'inner')
fuzzy_merger.set_parameter('n_threads', 64)
# Run the merge multiprocessing
result = fuzzy_merger.merge()
```

### Multi-node (dask)

#### Local client

```python
fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
# Set parameters
fuzzy_merger.set_parameter('how', 'inner')

# Set parameters for dask
## Create a dask client
from dask.distributed import Client
client = Client(...)  # Connect to distributed cluster and override default
fuzzy_merger.set_parameter('parallel', True)
fuzzy_merger.set_parameter('dask_client', client)
# Run the merge in dask
result = fuzzy_merger.merge()
```

How to create a dask client?

There are different options to create a dask client. Extensive documentation can be found on their websites:

- [General dask documentation](https://docs.dask.org/en/stable/)
- [dask client documentation](https://distributed.dask.org/en/stable/client.html)
- [Dask jobqueue documentation (distributed)](https://jobqueue.dask.org/en/latest/index.html)

A couple of examples:

```python
# Launch dask on a local cluster (singlenode)
from dask.distributed import Client, LocalCluster
# Create a local Dask cluster
cluster = LocalCluster()
# Create a Dask client to connect to the cluster
client = Client(cluster)
```

```python
# Launch dask on a SLURM cluster
from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(
    queue='regular',
    account="myaccount",
    cores=128,
    memory="500 GB"
)

cluster.scale(jobs=10)  # ask for 10 jobs

client = Client(cluster)
```

## Contributing

Contributions are welcome! If you encounter any issues, have suggestions, or want to contribute improvements, please submit a pull request or open an issue on the GitHub repository.


## Authors

- Oscar J. Castro Lopez (oscar.castro@uni.lu)
  - Parallel Computing & Optimisation Group (PCOG) - **University of Luxembourg**


This package is based on the levenpandas package (https://github.com/fangzhou-xie/levenpandas).

## License

This project is licensed under the MIT License.

=======
History
=======

1.0.0 (2023-12-06)
0.1.0 (2023-06-21)
------------------

* First release on PyPI.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ULHPC/fuzzymerge_parallel",
    "name": "fuzzymerge-parallel",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "fuzzymerge_parallel",
    "author": "Oscar J. Castro-Lopez",
    "author_email": "oscar.castro@uni.lu",
    "download_url": "https://files.pythonhosted.org/packages/63/42/e277fa80876430126dd47cb7fc6e7c1f04208fdf6d4474b265467a058aad/fuzzymerge_parallel-1.0.0.tar.gz",
    "platform": null,
    "description": "# fuzzymerge_parallel\n\n[![Python package](https://github.com/ULHPC/fuzzymerge_parallel/actions/workflows/python-package.yml/badge.svg)](https://github.com/ULHPC/fuzzymerge_parallel/actions/workflows/python-package.yml)\n\nMerge two pandas dataframes by using a function to calculate the edit distance (Levenshtein Distance) using multiprocessing for parallelization on a single node or Dask for distributed computation across multiple nodes.\n\n**Efficient Matching and Merging**\n\nMatching and merging data can be demanding in terms of time and memory usage. That's why our package is specifically crafted to address these challenges effectively.\n\n**Optimized Execution**\n\nTo boost execution performance, we've fine-tuned our package to work seamlessly in both single-node and multi-node environments. Whether you're processing data on a local machine or distributing tasks across a cluster, our package is optimized to get the job done efficiently.\n\n**Smart Memory Management**\n\nMemory efficiency is a priority. Our algorithm estimates memory requirements and divides the workload into manageable batches. This ensures that your data operations fit comfortably within your available memory. Plus, you have the flexibility to customize these settings to match your specific needs.\n\nWith this package, you can confidently tackle data matching and merging tasks while optimizing both time and memory resources.\n\n## Description\n\nfuzzymerge_parallel offers two modes for faster execution:\n\n### 1. Multiprocessing mode\n\nThis mode runs on a single machine and it is able to use multi-CPU cores. This mode does not require dask and it's ideal for local processing tasks, utilizing Numpy and Python's multiprocessing libraries to speed things up.\n\n### 2. Dask mode\n\nIn this mode, fuzzymerge_parallel utilizes a Dask client, which can be configured for single or multi-node setups. To leverage the dask mode it is suggested to use it with multiple nodes. Multi-node dask clients distribute computations across clusters of machines, making it suitable for heavy-duty processing. Using dask mode offers numerous benefits, automating tasks that would otherwise require manual intervention, such as enhancing performance, expanding scalability, ensuring fault tolerance, optimizing resource utilization, and enabling parallelism.\n\n<span style=\"color:red\">**Important remarks:**</span> When using the package on a single node, it is recommended to opt for the multiprocessing mode. This choice is driven by the fact that multiprocessing generally offers faster execution times compared to Dask on a single node. Dask introduces certain overheads, including data copying, fault tolerance mechanisms, and resource management, which may not be as beneficial in single-node scenarios. Therefore, it is strongly advisable to leverage Dask when deploying the package in a multi-node cluster.\n\n \n### Features\n\n- Performs fuzzy merging of dataframes based on string columns\n- Utilize distance functions (e.g., Levenshtein) for intelligent matching\n- Benefit from parallel computing techniques for enhanced performance\n- Easily integrate into your existing data processing pipelines\n\n## Installation\n\n### Install from PyPi\nTo download and install the fuzzymerge_parallel Python package from PyPi, run the following instruction:\n\n### Install from GitHub\nTo download and install the fuzzymerge_parallel Python package from GitHub, you can follow these improved instructions:\n```bash\n    pip install fuzzymerge-parallel\n```\n\nTo install FuzzyMergeParallel via pip from its GitHub repository, follow these steps:\n\n1. **Download the Package:** Begin by downloading the package from GitHub. You can use git to clone the repository to your local machine:\n    ```bash\n    git clone https://github.com/ULHPC/fuzzymerge_parallel.git\n    ```\n    \n2. **Navigate to the Package Directory:** Open a terminal or command prompt and change your current directory to the downloaded package folder:\n    ```bash\n    cd fuzzymerge_parallel\n    ```\n\n3. **Install the Package:** Finally, use pip to install the package in \"editable\" mode (with the -e flag) to allow for development and updates. There are two options:\n\n    **Option 1:** If you plan to use the package on a single node and don't need the dask and distributed dependencies, simply run:\n    ```bash\n    pip install -e .\n    ```\n\n    **Option 2:** If you intend to use the package in both single and multi-node environments with dask and distributed support, use the following command:\n    ```bash\n    pip install -e \".[dask]\"\n    ```\n\n\nThis command will install the package along with its dependencies. You can now import and use FuzzyMergeParallel in your Python projects.    \n\n## Dependencies\n\nTo use this package, you will need to have the following dependencies installed:\n\n- [Click](https://pypi.org/project/Click/) >= 7.0\n- [dask[distributed]](https://pypi.org/project/dask/) >= 2023.5.0 (Optional: Only needed for multi-node)\n- [Levenshtein](https://pypi.org/project/python-Levenshtein/) >= 0.21.0 (Optional: Only needed for multi-node)\n- [nltk](https://pypi.org/project/nltk/) >= 3.8.1\n- [numpy](https://pypi.org/project/numpy/) >= 1.23.5\n- [pandas](https://pypi.org/project/pandas/) >= 1.5.3\n- [tqdm](https://pypi.org/project/tqdm/) >= 4.65.0\n- [psutil](https://pypi.org/project/psutil/) == 5.9.5\n- [pytest](https://pypi.org/project/pytest/) >= 7.4.1\n\n## Description\n\nThe FuzzyMergeParallel class is exposed and it is highly configurable. The following parameters and other attributes can be set up before doing the merge operation:\n\n\n\n| Parameter        | Description                                                      |\n|------------------|------------------------------------------------------------------|\n| left             | The left input data to be merged.                                |\n| right            | The right input data to be merged.                               |\n| left_on          | Column(s) in the left DataFrame to use as merge keys.            |\n| right_on         | Column(s) in the right DataFrame to use as merge keys.           |\n\nExample create a FuzzyMergeParallel class:\n\n```python\nfuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')\n```\n\n\n| Attribute        | Description                                                      |\n|------------------|------------------------------------------------------------------|\n| uselower         | Whether to convert strings to lowercase before comparison. Default is True.        |\n| threshold        | The threshold value for fuzzy matching similarity. Default is 0.9.                |\n| how              | The type of merge to be performed. Default is 'outer'.                               |\n| on               | Column(s) to merge on if not specified in left_on or right_on. Default is False.    |\n| left_index       | Whether to use the left DataFrame's index as merge key(s). Default is False.       |\n| right_index      | Whether to use the right DataFrame's index as merge key(s). Default is False.      |\n| parallel         | Whether to perform the merge operation in parallel. Default is True.              |\n| n_threads        | The number of threads to use for parallel execution. Default is 'all' (a thread per each available core).             |\n| hide_progress    | Whether to display a progress bar during the merge operation. Default is False.    |\n| num_batches      | The number of batches to split the ratio computation. Default is automatic.              |\n| ratio_function   | The distance ratio function.                Defaults to `Levenshtein.ratio()`.                      |\n| dask_client      | A dask client object.                                            |\n\nExample set extra attributes by stating the name of the attribute and its value with `set_parameter()`:\n\n```python\nfuzzy_merger.set_parameter('how', 'inner')\nfuzzy_merger.set_parameter('threshold', 0.75)\n```\n\n## Usage\n\n\n### Single node \n\n#### Sequential execution\n\n```python\nfuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')\n# Set parameters\nfuzzy_merger.set_parameter('how', 'inner')\nfuzzy_merger.set_parameter('parallel', False)\n# Run the merge sequentially\nresult = fuzzy_merger.merge()\n```\n\n#### Multiprocessing execution\n\n```python\nfuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')\n# Set parameters\nfuzzy_merger.set_parameter('how', 'inner')\nfuzzy_merger.set_parameter('n_threads', 64)\n# Run the merge multiprocessing\nresult = fuzzy_merger.merge()\n```\n\n### Multi-node (dask)\n\n#### Local client\n\n```python\nfuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')\n# Set parameters\nfuzzy_merger.set_parameter('how', 'inner')\n\n# Set parameters for dask\n## Create a dask client\nfrom dask.distributed import Client\nclient = Client(...)  # Connect to distributed cluster and override default\nfuzzy_merger.set_parameter('parallel', True)\nfuzzy_merger.set_parameter('dask_client', client)\n# Run the merge in dask\nresult = fuzzy_merger.merge()\n```\n\nHow to create a dask client?\n\nThere are different options to create a dask client. Extensive documentation can be found on their websites:\n\n- [General dask documentation](https://docs.dask.org/en/stable/)\n- [dask client documentation](https://distributed.dask.org/en/stable/client.html)\n- [Dask jobqueue documentation (distributed)](https://jobqueue.dask.org/en/latest/index.html)\n\nA couple of examples:\n\n```python\n# Launch dask on a local cluster (singlenode)\nfrom dask.distributed import Client, LocalCluster\n# Create a local Dask cluster\ncluster = LocalCluster()\n# Create a Dask client to connect to the cluster\nclient = Client(cluster)\n```\n\n```python\n# Launch dask on a SLURM cluster\nfrom dask_jobqueue import SLURMCluster\n\ncluster = SLURMCluster(\n    queue='regular',\n    account=\"myaccount\",\n    cores=128,\n    memory=\"500 GB\"\n)\n\ncluster.scale(jobs=10)  # ask for 10 jobs\n\nclient = Client(cluster)\n```\n\n## Contributing\n\nContributions are welcome! If you encounter any issues, have suggestions, or want to contribute improvements, please submit a pull request or open an issue on the GitHub repository.\n\n\n## Authors\n\n- Oscar J. Castro Lopez (oscar.castro@uni.lu)\n  - Parallel Computing & Optimisation Group (PCOG) - **University of Luxembourg**\n\n\nThis package is based on the levenpandas package (https://github.com/fangzhou-xie/levenpandas).\n\n## License\n\nThis project is licensed under the MIT License.\n\n=======\nHistory\n=======\n\n1.0.0 (2023-12-06)\n0.1.0 (2023-06-21)\n------------------\n\n* First release on PyPI.\n",
    "bugtrack_url": null,
    "license": "MIT license",
    "summary": "FuzzyMergeParallel is a Python package that enables efficient fuzzy merging of two dataframes based on string columns. With FuzzyMergeParallel, users can easily merge datasets, benefitting from enhanced performance through parallel computing with multiprocessing and Dask.",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/ULHPC/fuzzymerge_parallel"
    },
    "split_keywords": [
        "fuzzymerge_parallel"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bc2c1eaa20c78c03c28b96585be6d9150871f2cafab99d0d17727ac724a57087",
                "md5": "7ccf6ad51eb14e392a821e9592bde21a",
                "sha256": "2bc323b284389891cd8d05c5acfd875ca360a217fd678230977386c13e1b880e"
            },
            "downloads": -1,
            "filename": "fuzzymerge_parallel-1.0.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7ccf6ad51eb14e392a821e9592bde21a",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.6",
            "size": 16510,
            "upload_time": "2023-12-06T11:06:43",
            "upload_time_iso_8601": "2023-12-06T11:06:43.866994Z",
            "url": "https://files.pythonhosted.org/packages/bc/2c/1eaa20c78c03c28b96585be6d9150871f2cafab99d0d17727ac724a57087/fuzzymerge_parallel-1.0.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6342e277fa80876430126dd47cb7fc6e7c1f04208fdf6d4474b265467a058aad",
                "md5": "ebb7a48383f6b648a81b7c113c511cf2",
                "sha256": "c328029c1781f6266ddf872f10b81cfc3632412e55b7491c2d12bf7ec36dd1a9"
            },
            "downloads": -1,
            "filename": "fuzzymerge_parallel-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ebb7a48383f6b648a81b7c113c511cf2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 46175,
            "upload_time": "2023-12-06T11:06:46",
            "upload_time_iso_8601": "2023-12-06T11:06:46.560266Z",
            "url": "https://files.pythonhosted.org/packages/63/42/e277fa80876430126dd47cb7fc6e7c1f04208fdf6d4474b265467a058aad/fuzzymerge_parallel-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-06 11:06:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ULHPC",
    "github_project": "fuzzymerge_parallel",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "fuzzymerge-parallel"
}
        
Elapsed time: 0.24889s