deltatextsplitter

Name	deltatextsplitter JSON
Version	1.1.0 JSON
	download
home_page	https://gitlab.com/datainnovatielab/public/deltatextsplitter/
Summary	This packages can efficiently measure the text structure recognition capabilities ofn pdftextsplitter
upload_time	2023-12-19 13:53:43
maintainer
docs_url	None
author	Unit Data en Innovatie, Ministerie van Infrastructuur en Waterstaat, Netherlands
requires_python	>=3.10
license	MIT
keywords	nlp pdf text recognition structure recognition chatgpt
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # DeltaTextsplitter package

This package is meant to provide an objective evaluation of the performance of the
[pdftextsplitter](https://pypi.org/project/pdftextsplitter/) package in terms of
KPI's.
<br />
<br />
The package includes a set of test documents with references in the form of excel-files.
These reference-files are human-produces excel-files containing the structure of
the test documents that the pdftextsplitter package should have recognised. This can then
be compared to the actual output of the pdftextsplitter package, so that the performance
of the package can be evaluated.
<br />
<br />
The performance is evaluated in terms of the following two KPI's <br />
structure KPI = 1 - (fp + tn)(fp + tt) <br />
where tt = true total, the total number of structure elements in the reference-file,
fp = false positive, the number of structure elements that are present in the output, but not in the reference-file,
and tn = true negative, meaning tn = tt - tp, where tp = true positive,
the number of matching structure elements between the reference-file and the actual outcome.
Two structure elements are said to match if the fuzzy match ratio of their titles is >=80.0
(determined with the package [thefuzz](https://pypi.org/project/thefuzz/)) and their main structure types are equal.
<br />
<br />
cascade kpi = 1 - uc/tp <br />
where uc = unequal cascades, a subset of the above number tp where the cascade levels
of the reference-file and the outcome of the package do not match.
<br />
<br />
With these two KPI's, it is possible to quantify improvements made to the pdftextsplitter
package by calculating KPI's for each released version of pdftextsplitter.

### Getting started

The KPI calculation can be performed efficiently by entering the following commands:
from deltattextsplitter import documentclass <br />
mydelta = deltattextsplitter()
mydelta.FullRun()
<br />
<br />
The KPI's will then be printed, but can also be retrieved from: <br />
mydelta.structure_kpi <br />
mydelta.cascade_kpi <br />
The KPI's per test document can also be retrieved from: <br />
mydelta.documentarray[index].splitter.documentname <br />
mydelta.documentarray[index].structure_kpi <br />
mydelta.documentarray[index].cascade_kpi <br />
There are 12 testdocuments in total.
<br />
<br />
The FullRun-command is very CPU-intensive, as it needs to process all the test documents
with the pdftextsplitter-package (in dummy-mode). Once this has been done, one could
speed up the process of subsequent calculations by entering <br />
mydelta.FullRun(False,False) <br />
to skip the pdftextsplitter-execution and only redo the KPI-calculation.

Raw data

            {
    "_id": null,
    "home_page": "https://gitlab.com/datainnovatielab/public/deltatextsplitter/",
    "name": "deltatextsplitter",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "",
    "keywords": "NLP,PDF,Text recognition,Structure recognition,ChatGPT",
    "author": "Unit Data en Innovatie, Ministerie van Infrastructuur en Waterstaat, Netherlands",
    "author_email": "dataloket@minienw.nl",
    "download_url": "https://files.pythonhosted.org/packages/25/12/85bb662a52095f8376e17e3bc40d7f24c723311f72ebfe39391130585694/deltatextsplitter-1.1.0.tar.gz",
    "platform": null,
    "description": "# DeltaTextsplitter package\n\nThis package is meant to provide an objective evaluation of the performance of the\n[pdftextsplitter](https://pypi.org/project/pdftextsplitter/) package in terms of\nKPI's.\n<br />\n<br />\nThe package includes a set of test documents with references in the form of excel-files.\nThese reference-files are human-produces excel-files containing the structure of\nthe test documents that the pdftextsplitter package should have recognised. This can then\nbe compared to the actual output of the pdftextsplitter package, so that the performance\nof the package can be evaluated.\n<br />\n<br />\nThe performance is evaluated in terms of the following two KPI's <br />\nstructure KPI = 1 - (fp + tn)(fp + tt) <br />\nwhere tt = true total, the total number of structure elements in the reference-file,\nfp = false positive, the number of structure elements that are present in the output, but not in the reference-file,\nand tn = true negative, meaning tn = tt - tp, where tp = true positive,\nthe number of matching structure elements between the reference-file and the actual outcome.\nTwo structure elements are said to match if the fuzzy match ratio of their titles is >=80.0\n(determined with the package [thefuzz](https://pypi.org/project/thefuzz/)) and their main structure types are equal.\n<br />\n<br />\ncascade kpi = 1 - uc/tp <br />\nwhere uc = unequal cascades, a subset of the above number tp where the cascade levels\nof the reference-file and the outcome of the package do not match.\n<br />\n<br />\nWith these two KPI's, it is possible to quantify improvements made to the pdftextsplitter\npackage by calculating KPI's for each released version of pdftextsplitter.\n\n### Getting started\n\nThe KPI calculation can be performed efficiently by entering the following commands:\nfrom deltattextsplitter import documentclass <br />\nmydelta = deltattextsplitter()\nmydelta.FullRun()\n<br />\n<br />\nThe KPI's will then be printed, but can also be retrieved from: <br />\nmydelta.structure_kpi <br />\nmydelta.cascade_kpi <br />\nThe KPI's per test document can also be retrieved from: <br />\nmydelta.documentarray[index].splitter.documentname <br />\nmydelta.documentarray[index].structure_kpi <br />\nmydelta.documentarray[index].cascade_kpi <br />\nThere are 12 testdocuments in total.\n<br />\n<br />\nThe FullRun-command is very CPU-intensive, as it needs to process all the test documents\nwith the pdftextsplitter-package (in dummy-mode). Once this has been done, one could\nspeed up the process of subsequent calculations by entering <br />\nmydelta.FullRun(False,False) <br />\nto skip the pdftextsplitter-execution and only redo the KPI-calculation.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "This packages can efficiently measure the text structure recognition capabilities ofn pdftextsplitter",
    "version": "1.1.0",
    "project_urls": {
        "Download": "https://gitlab.com/datainnovatielab/public/deltatextsplitter/",
        "Homepage": "https://gitlab.com/datainnovatielab/public/deltatextsplitter/"
    },
    "split_keywords": [
        "nlp",
        "pdf",
        "text recognition",
        "structure recognition",
        "chatgpt"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a6077d06616303a317d7d65624b61a6f1c8371b5ea020de1ad54f71bf48614a0",
                "md5": "984498abe14f8b7b5f4dd1e946145d9c",
                "sha256": "68bc441ebe22851984438b9b5a871c3f939a48aa9d1a3e0ef43c855d40154f59"
            },
            "downloads": -1,
            "filename": "deltatextsplitter-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "984498abe14f8b7b5f4dd1e946145d9c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 28919532,
            "upload_time": "2023-12-19T13:53:35",
            "upload_time_iso_8601": "2023-12-19T13:53:35.632587Z",
            "url": "https://files.pythonhosted.org/packages/a6/07/7d06616303a317d7d65624b61a6f1c8371b5ea020de1ad54f71bf48614a0/deltatextsplitter-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "251285bb662a52095f8376e17e3bc40d7f24c723311f72ebfe39391130585694",
                "md5": "a28fda43ffca5ff41ceaa1bbe5857908",
                "sha256": "472387d79205c5317075ed2b176300224486d2daf4dd82fb9e8b4d66aa6c3d3c"
            },
            "downloads": -1,
            "filename": "deltatextsplitter-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a28fda43ffca5ff41ceaa1bbe5857908",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 28901692,
            "upload_time": "2023-12-19T13:53:43",
            "upload_time_iso_8601": "2023-12-19T13:53:43.001491Z",
            "url": "https://files.pythonhosted.org/packages/25/12/85bb662a52095f8376e17e3bc40d7f24c723311f72ebfe39391130585694/deltatextsplitter-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-19 13:53:43",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "datainnovatielab",
    "gitlab_project": "public",
    "lcname": "deltatextsplitter"
}

Unit Data en Innovatie, Ministerie van Infrastructuur en Waterstaat, Netherlands