# PyDuplicate
[![PyPI Downloads](https://img.shields.io/pypi/dm/PyDuplicate.svg?label=PyPI%20downloads)](
https://pypi.org/project/PyDuplicate/)
[![Stack Overflow](https://img.shields.io/badge/stackoverflow-Ask%20questions-blue.svg)](
https://stackoverflow.com/questions/tagged/PyDuplicate)
[![Nature Paper](https://img.shields.io/badge/Article-Duplicate-Finder--blue)](
https://publishup.uni-potsdam.de/opus4-ubp/frontdoor/deliver/index/docId/48913/file/koumarelas_diss.pdf)
PyDuplicate is a Python package that provides functionality for detecting and identifying duplicates within a given dataset.
It offers a bunch of functions to search for duplicate elements, making it easy to identify and handle duplicate entries efficiently.
- **Source code:** https://github.com/JeanBertinR/PyDuplicate
- **Bug reports:** https://github.com/JeanBertinR/PyDuplicate/issues
- **Report a security vulnerability:** https://tidelift.com/docs/security
## Requirements
- Python 3.x
- Levenshtein
## Installation
You can install the `PyDuplicate` package using `pip`. Here's the installation command from your terminal:
```shell
pip install PyDuplicate
```
Make sure you have Python and pip installed on your system before running this command.
After the installation, you can import the package in your Python code using the following line:
```python
import PyDuplicate
```
That's all it takes to install the package and import it into your project.
## Usage
### Importing the SimilarityScorer module
`SimilarityScorer` has been designed to calculate a similarity score between two sets of character strings.
To begin, import the package by installing it using pip and importing the `SimilarityScorer` class in your Python script or interactive session:
```python
from PyDuplicate import SimilarityScorer
```
### Instantiating the SimilarityScorer
Create an instance of the `SimilarityScorer` class:
```python
scorer = SimilarityScorer()
```
### Calculating the Similarity Score
Use the `similarity_score` method of the `SimilarityScorer` instance to calculate the similarity score between two sets of strings:
```python
score = scorer.similarity_score(str_tuple_1, str_tuple_2)
```
### Importance and Applications
The String Similarity Scorer function has several applications across various domains, including:
Text Matching: It can be used for comparing and matching textual data, such as finding duplicate entries in a database or identifying similar documents.
Data Cleansing: It aids in data preprocessing tasks by detecting and handling similar or duplicate records, improving data quality.
Natural Language Processing (NLP): The similarity score can be used as a feature in NLP tasks like text classification, information retrieval, and recommendation systems.
Fuzzy String Matching: The function incorporates fuzzy matching techniques to handle slight variations and inconsistencies in the input strings.
### Contributing
Contributions are welcome! If you have any suggestions or find any issues, please open an issue or submit a pull request.
### License
This project is licensed under the GPL v3 License.
Raw data
{
"_id": null,
"home_page": "https://github.com/JeanBertinR/PyDuplicate",
"name": "PyDuplicate",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "Duplicate detection,Efficient duplicate search",
"author": "Jean BERTIN",
"author_email": "<jeanbertin.ensam@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/5c/b0/2eef79a501dc9e5e92d7d872d4151d99c66d1ae1643384eeb2034aa7fbff/PyDuplicate-0.0.2.tar.gz",
"platform": null,
"description": "# PyDuplicate\r\n\r\n[![PyPI Downloads](https://img.shields.io/pypi/dm/PyDuplicate.svg?label=PyPI%20downloads)](\r\nhttps://pypi.org/project/PyDuplicate/)\r\n[![Stack Overflow](https://img.shields.io/badge/stackoverflow-Ask%20questions-blue.svg)](\r\nhttps://stackoverflow.com/questions/tagged/PyDuplicate)\r\n[![Nature Paper](https://img.shields.io/badge/Article-Duplicate-Finder--blue)](\r\nhttps://publishup.uni-potsdam.de/opus4-ubp/frontdoor/deliver/index/docId/48913/file/koumarelas_diss.pdf)\r\n\r\n\r\nPyDuplicate is a Python package that provides functionality for detecting and identifying duplicates within a given dataset. \r\nIt offers a bunch of functions to search for duplicate elements, making it easy to identify and handle duplicate entries efficiently.\r\n\r\n- **Source code:** https://github.com/JeanBertinR/PyDuplicate\r\n- **Bug reports:** https://github.com/JeanBertinR/PyDuplicate/issues\r\n- **Report a security vulnerability:** https://tidelift.com/docs/security\r\n\r\n## Requirements\r\n\r\n- Python 3.x\r\n- Levenshtein\r\n\r\n## Installation\r\n\r\nYou can install the `PyDuplicate` package using `pip`. Here's the installation command from your terminal:\r\n\r\n```shell\r\npip install PyDuplicate\r\n```\r\nMake sure you have Python and pip installed on your system before running this command.\r\nAfter the installation, you can import the package in your Python code using the following line:\r\n```python\r\nimport PyDuplicate\r\n```\r\nThat's all it takes to install the package and import it into your project.\r\n\r\n## Usage\r\n\r\n### Importing the SimilarityScorer module\r\n\r\n`SimilarityScorer` has been designed to calculate a similarity score between two sets of character strings.\r\nTo begin, import the package by installing it using pip and importing the `SimilarityScorer` class in your Python script or interactive session:\r\n\r\n```python\r\nfrom PyDuplicate import SimilarityScorer\r\n```\r\n\r\n### Instantiating the SimilarityScorer\r\nCreate an instance of the `SimilarityScorer` class:\r\n\r\n```python\r\nscorer = SimilarityScorer()\r\n```\r\n\r\n### Calculating the Similarity Score\r\nUse the `similarity_score` method of the `SimilarityScorer` instance to calculate the similarity score between two sets of strings:\r\n```python\r\nscore = scorer.similarity_score(str_tuple_1, str_tuple_2)\r\n```\r\n\r\n### Importance and Applications\r\nThe String Similarity Scorer function has several applications across various domains, including:\r\n\r\nText Matching: It can be used for comparing and matching textual data, such as finding duplicate entries in a database or identifying similar documents.\r\n\r\nData Cleansing: It aids in data preprocessing tasks by detecting and handling similar or duplicate records, improving data quality.\r\n\r\nNatural Language Processing (NLP): The similarity score can be used as a feature in NLP tasks like text classification, information retrieval, and recommendation systems.\r\n\r\nFuzzy String Matching: The function incorporates fuzzy matching techniques to handle slight variations and inconsistencies in the input strings.\r\n### Contributing\r\nContributions are welcome! If you have any suggestions or find any issues, please open an issue or submit a pull request.\r\n\r\n### License\r\nThis project is licensed under the GPL v3 License.\r\n\r\n",
"bugtrack_url": null,
"license": "GPL-3.0",
"summary": "Python functions for Efficient duplicate detection",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/JeanBertinR/PyDuplicate"
},
"split_keywords": [
"duplicate detection",
"efficient duplicate search"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8e984d971fca312eb9c2beef3727e8f2ff43dd8cf7d87fd356f6bb5c9065ef0e",
"md5": "62c8b3eab7f47befd17712866498e4af",
"sha256": "e454ad135b3613281d6dcbe7b42428df4cd104b448acff7a06d65ea846d1da95"
},
"downloads": -1,
"filename": "PyDuplicate-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "62c8b3eab7f47befd17712866498e4af",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 16325,
"upload_time": "2023-06-26T07:45:12",
"upload_time_iso_8601": "2023-06-26T07:45:12.735062Z",
"url": "https://files.pythonhosted.org/packages/8e/98/4d971fca312eb9c2beef3727e8f2ff43dd8cf7d87fd356f6bb5c9065ef0e/PyDuplicate-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5cb02eef79a501dc9e5e92d7d872d4151d99c66d1ae1643384eeb2034aa7fbff",
"md5": "9ef38cc39dda6408491718f75fd61a8d",
"sha256": "6951d23be96f44aafc7ece109077b812c786ef0578954b35987523d39295fe2a"
},
"downloads": -1,
"filename": "PyDuplicate-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "9ef38cc39dda6408491718f75fd61a8d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 16085,
"upload_time": "2023-06-26T07:45:15",
"upload_time_iso_8601": "2023-06-26T07:45:15.369630Z",
"url": "https://files.pythonhosted.org/packages/5c/b0/2eef79a501dc9e5e92d7d872d4151d99c66d1ae1643384eeb2034aa7fbff/PyDuplicate-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-26 07:45:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "JeanBertinR",
"github_project": "PyDuplicate",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pyduplicate"
}