# Duplicate Text Finder
`duptextfinder` is a python library to detect duplicated zones in text. Primarily meant to detect
copy/paste across medical documents. Should be faster than python's built-in
`difflib` algorithm and more robust to whitespace, newlines and other irrelevant
characters.
## Installation
`duptextfinder` can be installed through pip:
```
pip install duptextfinder
```
## Usage
```python3
from pathlib import Path
from duptextfinder import CharFingerprintBuilder, DuplicateFinder
# load some text files
texts = [p.read_text() for p in Path("some/dir").glob("*.txt")]
# init fingerprint and duplicate finder
fingerprintBuilder = CharFingerprintBuilder(fingerprintLength=15)
duplicateFinder = DuplicateFinder(fingerprintBuilder, minDuplicateLength=15)
# call findDuplicates() on each file
for i, text in enumerate(texts):
id = f"D{i}"
duplicates = duplicateFinder.findDuplicates(id, text)
for duplicate in duplicates:
print(
f"sourceDoc={duplicate.sourceDocId}, "
f"sourceStart={duplicate.sourceSpan.start}, "
f"sourceEnd={duplicate.sourceSpan.end}, "
f"targetStart={duplicate.targetSpan.start}, "
f"targetEnd={duplicate.targetSpan.end}"
)
duplicated_text = text[duplicate.targetSpan.start : duplicate.targetSpan.end]
print(duplicated_text)
```
`WordFingerprintBuilder` can be used instead of `CharFingerprintBuilder`. For
more details, refer to the docstrings of `DuplicateFinder`,
`CharFingerprintBuilder` and `WordFingerprintBuilder`.
## How to run tests
1. Install package in editable mode with test and extra dependencies by running `pip install -e ".[tests, ncls, intervaltree]"` in the repo directory
2. Launch `pytest tests/`
## About ncls and intervaltree
This tool can be used without any additional dependencies, but performance can
be improved when using interval trees. To benefit from this you well need to
install either the [ncls](https://github.com/biocore-ntnu/ncls) package or the
[intervaltree](https://github.com/chaimleib/intervaltree) package.
## References
- Evaluating the Impact of Text Duplications on a Corpus of More than 600,000 Clinical Narratives in a French Hospital. https://www.hal.inserm.fr/hal-02265124/
Raw data
{
"_id": null,
"home_page": "https://github.com/equipe22/duplicatedZoneInClinicalText",
"name": "duptextfinder",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "TEXT,DUPLICATION,DUPLICATE,CLINICAL",
"author": "",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/67/66/8796a3b0156aa70584e768bd66002a219eec868c215b9c83a90e28d26c5b/duptextfinder-0.3.0.tar.gz",
"platform": null,
"description": "# Duplicate Text Finder\n\n`duptextfinder` is a python library to detect duplicated zones in text. Primarily meant to detect\ncopy/paste across medical documents. Should be faster than python's built-in\n`difflib` algorithm and more robust to whitespace, newlines and other irrelevant\ncharacters.\n\n## Installation\n\n`duptextfinder` can be installed through pip:\n\n```\npip install duptextfinder\n```\n\n## Usage\n\n```python3\n\nfrom pathlib import Path\nfrom duptextfinder import CharFingerprintBuilder, DuplicateFinder\n\n# load some text files\ntexts = [p.read_text() for p in Path(\"some/dir\").glob(\"*.txt\")]\n\n# init fingerprint and duplicate finder\nfingerprintBuilder = CharFingerprintBuilder(fingerprintLength=15)\nduplicateFinder = DuplicateFinder(fingerprintBuilder, minDuplicateLength=15)\n\n# call findDuplicates() on each file\nfor i, text in enumerate(texts):\n id = f\"D{i}\"\n duplicates = duplicateFinder.findDuplicates(id, text)\n for duplicate in duplicates:\n print(\n f\"sourceDoc={duplicate.sourceDocId}, \"\n f\"sourceStart={duplicate.sourceSpan.start}, \"\n f\"sourceEnd={duplicate.sourceSpan.end}, \"\n f\"targetStart={duplicate.targetSpan.start}, \"\n f\"targetEnd={duplicate.targetSpan.end}\"\n )\n duplicated_text = text[duplicate.targetSpan.start : duplicate.targetSpan.end]\n print(duplicated_text)\n```\n\n`WordFingerprintBuilder` can be used instead of `CharFingerprintBuilder`. For\nmore details, refer to the docstrings of `DuplicateFinder`,\n`CharFingerprintBuilder` and `WordFingerprintBuilder`.\n\n## How to run tests\n\n1. Install package in editable mode with test and extra dependencies by running `pip install -e \".[tests, ncls, intervaltree]\"` in the repo directory\n2. Launch `pytest tests/`\n\n## About ncls and intervaltree\n\nThis tool can be used without any additional dependencies, but performance can\nbe improved when using interval trees. To benefit from this you well need to\ninstall either the [ncls](https://github.com/biocore-ntnu/ncls) package or the\n[intervaltree](https://github.com/chaimleib/intervaltree) package.\n\n\n## References\n- Evaluating the Impact of Text Duplications on a Corpus of More than 600,000 Clinical Narratives in a French Hospital. https://www.hal.inserm.fr/hal-02265124/\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Detect duplicated zones in (clinical) text",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/equipe22/duplicatedZoneInClinicalText"
},
"split_keywords": [
"text",
"duplication",
"duplicate",
"clinical"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c650c45b26f67e3301efeb9cfd11b7509f18c794439cbfffe250392380f11e1d",
"md5": "5fc4d746e4823e999cc8052541f37e4e",
"sha256": "b23072f5839a71240c43e288902024ee09a4491a64eb102619e4028f0253e37d"
},
"downloads": -1,
"filename": "duptextfinder-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5fc4d746e4823e999cc8052541f37e4e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 17224,
"upload_time": "2024-01-03T15:54:00",
"upload_time_iso_8601": "2024-01-03T15:54:00.808008Z",
"url": "https://files.pythonhosted.org/packages/c6/50/c45b26f67e3301efeb9cfd11b7509f18c794439cbfffe250392380f11e1d/duptextfinder-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "67668796a3b0156aa70584e768bd66002a219eec868c215b9c83a90e28d26c5b",
"md5": "d4e1d32863bedf4450d62748fed6fa7f",
"sha256": "a8ff3f3128bdc56157b2d09778e9ca2d73b093fea5419947418c07f83eaba08e"
},
"downloads": -1,
"filename": "duptextfinder-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "d4e1d32863bedf4450d62748fed6fa7f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 19422,
"upload_time": "2024-01-03T15:54:02",
"upload_time_iso_8601": "2024-01-03T15:54:02.541198Z",
"url": "https://files.pythonhosted.org/packages/67/66/8796a3b0156aa70584e768bd66002a219eec868c215b9c83a90e28d26c5b/duptextfinder-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-03 15:54:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "equipe22",
"github_project": "duplicatedZoneInClinicalText",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "duptextfinder"
}