ls-converter

Name	ls-converter JSON
Version	0.0.1 JSON
	download
home_page	https://github.com/Living-with-machines/label-studio-converter
Summary	Creates ready-to-use Label Studio pre-populated JSON files from popular OCR formats.
upload_time	2022-12-14 08:51:34
maintainer
docs_url	None
author	Kalle Westerling
requires_python	>=3.7,<4.0
license	MIT
keywords	computer-vision deep-learning image-annotation annotation annotations dataset image-classification labeling datasets semantic-segmentation annotation-tool text-annotation boundingbox image-labeling labeling-tool image-labelling-tool data-labeling label-studio ocr
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ls-converter

LabelStudioConverter (or `ls_converter` for short) is a simple library to convert OCR outputs into pre-annotated data for import into [LabelStudio](https://github.com/heartexlabs/label-studio).

Currently, we can convert directly from PyTesseract, ABBYY FineReader, and Transkribus. All that is needed is an `image` (which can be a path, a public URL, or an [Image object](https://github.com/python-pillow/Pillow)) and some `input_data` (which can be a path to a JSON).

It even comes with a quick utility tool to provide the PyTesseract data if you don't have it available:

```py
from ls_converter import LabelStudioConverter, Input
from ls_converter.utils import url_to_tesseract_data

URL = "http://<URL-TO-PUBLICLY-AVAILABLE-IMAGE>"

converter = LabelStudioConverter(input_format=Input.TESSERACT)

converted_data = converter.convert(
    image=URL,
    input_data=url_to_tesseract_data(URL),
)
```

## Installing

Installation of LabelStudioConverter is easily done using PIP:

```sh
$ pip install ls_converter
...
```

## OCR a public image URL with PyTesseract into Label Studio

In this example, we have a publicly available historical newspaper directory [from the University of Leicester‘s Special Collections](https://specialcollections.le.ac.uk/digital/collection/p16445coll4/id/52629/rec/1). We take the direct `URL` to the image and using the built-in `url_to_tesseract_data` function, can pass the image and the data straight into the `.convert` method. The next step is to save the resulting dictionary as a JSON file. There is a helpful `save_json` function built into the package as well. In effectively three lines of code, we have OCR parsed and created a file that we can input into Label Studio.

```py
from ls_converter import LabelStudioConverter, Input
from ls_converter.utils import url_to_tesseract_data, save_json

URL = "http://specialcollections.le.ac.uk/iiif/2/p16445coll4:8897/full/730,/0/default.jpg?page=27"

converter = LabelStudioConverter(input_format=Input.TESSERACT)
converted_data = converter.convert(image=URL, input_data=url_to_tesseract_data(URL))

save_json(converted_data, "import-me-into-label-studio.json")
```

If you select “Optical Character Recognition” in the Labeling Setup of the project where you import the resulting JSON file, you should end up with something like this:

![Label Studio interface after importing PyTesseract’s resulting JSON](img/label-studio.png)

## Using a local image and ABBYY FineReader

If you instead have an image and the resulting JSON file from running it through ABBYY FineReader, you only have to adjust the import of the data thus:

```py
from ls_converter import LabelStudioConverter, Input
from ls_converter.utils import load_json, save_json

LOCAL_IMAGE = "abbyy-output/0212_BCL8001.jpg"
LOCAL_JSON = "abbyy-output/0212_BCL8001.json"
REMOTE_IMAGE = "https://lwmincomingtradedirs.blob.core.windows.net/jpg/0212_BCL8001.jpg"

converter = LabelStudioConverter(input_format=Input.ABBYY)
converted_data = converter.convert(
    image=LOCAL_IMAGE,
    input_data=load_json(LOCAL_JSON),
    url=REMOTE_IMAGE,
)

save_json(converted_data, "import-me-into-label-studio.json")
```

_Note that, in this example, we use the convenient utility function `load_json` to load the JSON file with the ABBYY results. In this example (and any example where you have a locally stored version and a remote version), if you don’t care too much about speed, you can just pass the URL to the `image` parameter. If you are processing 100+ images, the script will run much faster if you have locally stored images._

The result should from the above example, when imported into Label Studio looks differently (since ABBYY’s result will differ from Tesseract’s) but otherwise, the result should be the same look:

![Label Studio interface after importing ABBYY FineReader’s resulting JSON](img/abbyy-result.png)

### Risk for error

Because you run the data conversion on a local file, you must specify a `url` as a parameter in the `.convert` method (as we did above). Alternatively, after you export the data, you can adjust the `"ocr"` value in the resulting JSON file _before importing it into Label Studio_. Otherwise you will see the following error message:

![Label Studio error message after importing faulty JSON](img/local-file-error.png)

## Using Transkribus results instead

Out of the box, LabelStudioConverter comes with support for the XML files created by Transkribus, as well. In this example, similar to the ABBYY example above, we provide a local image and a remote version of the same image. and use the `load_xml_as_json` utility function to read in the Transkribus XML as JSON data.

```py
from ls_converter import LabelStudioConverter, Input
from ls_converter.utils import load_xml_as_json, save_json

LOCAL_IMAGE = "transkribus-output/0219_BCL8001.jpg"
LOCAL_XML = "transkribus-output/0219_BCL8001.xml"
REMOTE_IMAGE = "https://lwmincomingtradedirs.blob.core.windows.net/jpg/0219_BCL8001.jpg"

converter = LabelStudioConverter(input_format=Input.TRANSKRIBUS)
converted_data = converter.convert(
    image=LOCAL_IMAGE,
    input_data=load_xml_as_json(LOCAL_XML),
    url=REMOTE_IMAGE,
)

save_json(converted_data, "import-me-into-label-studio.json")
```

The result, since we provide the `url` to the remote image in the example above, is similar (albeit different, due to Transkribus’s OCR and layout parsing algorithm) when viewed in Label Studio:

![Label Studio interface after importing Transkribus’s resulting JSON](img/transkribus-result.png)

## Change Log

### 0.0.1 (Dec 14, 2022)

- First alpha version

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Living-with-machines/label-studio-converter",
    "name": "ls-converter",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7,<4.0",
    "maintainer_email": "",
    "keywords": "computer-vision,deep-learning,image-annotation,annotation,annotations,dataset,image-classification,labeling,datasets,semantic-segmentation,annotation-tool,text-annotation,boundingbox,image-labeling,labeling-tool,image-labelling-tool,data-labeling,label-studio,ocr",
    "author": "Kalle Westerling",
    "author_email": "kalle.westerling@bl.uk",
    "download_url": "https://files.pythonhosted.org/packages/7f/9e/8029dae1994b11354a99a7723c15a8c622a9b9f9d61cfae34be1d9b73ee2/ls-converter-0.0.1.tar.gz",
    "platform": null,
    "description": "# ls-converter\n\nLabelStudioConverter (or `ls_converter` for short) is a simple library to convert OCR outputs into pre-annotated data for import into [LabelStudio](https://github.com/heartexlabs/label-studio).\n\nCurrently, we can convert directly from PyTesseract, ABBYY FineReader, and Transkribus. All that is needed is an `image` (which can be a path, a public URL, or an [Image object](https://github.com/python-pillow/Pillow)) and some `input_data` (which can be a path to a JSON).\n\nIt even comes with a quick utility tool to provide the PyTesseract data if you don't have it available:\n\n```py\nfrom ls_converter import LabelStudioConverter, Input\nfrom ls_converter.utils import url_to_tesseract_data\n\nURL = \"http://<URL-TO-PUBLICLY-AVAILABLE-IMAGE>\"\n\nconverter = LabelStudioConverter(input_format=Input.TESSERACT)\n\nconverted_data = converter.convert(\n    image=URL,\n    input_data=url_to_tesseract_data(URL),\n)\n```\n\n## Installing\n\nInstallation of LabelStudioConverter is easily done using PIP:\n\n```sh\n$ pip install ls_converter\n...\n```\n\n## OCR a public image URL with PyTesseract into Label Studio\n\nIn this example, we have a publicly available historical newspaper directory [from the University of Leicester\u2018s Special Collections](https://specialcollections.le.ac.uk/digital/collection/p16445coll4/id/52629/rec/1). We take the direct `URL` to the image and using the built-in `url_to_tesseract_data` function, can pass the image and the data straight into the `.convert` method. The next step is to save the resulting dictionary as a JSON file. There is a helpful `save_json` function built into the package as well. In effectively three lines of code, we have OCR parsed and created a file that we can input into Label Studio.\n\n```py\nfrom ls_converter import LabelStudioConverter, Input\nfrom ls_converter.utils import url_to_tesseract_data, save_json\n\nURL = \"http://specialcollections.le.ac.uk/iiif/2/p16445coll4:8897/full/730,/0/default.jpg?page=27\"\n\nconverter = LabelStudioConverter(input_format=Input.TESSERACT)\nconverted_data = converter.convert(image=URL, input_data=url_to_tesseract_data(URL))\n\nsave_json(converted_data, \"import-me-into-label-studio.json\")\n```\n\nIf you select \u201cOptical Character Recognition\u201d in the Labeling Setup of the project where you import the resulting JSON file, you should end up with something like this:\n\n![Label Studio interface after importing PyTesseract\u2019s resulting JSON](img/label-studio.png)\n\n## Using a local image and ABBYY FineReader\n\nIf you instead have an image and the resulting JSON file from running it through ABBYY FineReader, you only have to adjust the import of the data thus:\n\n```py\nfrom ls_converter import LabelStudioConverter, Input\nfrom ls_converter.utils import load_json, save_json\n\nLOCAL_IMAGE = \"abbyy-output/0212_BCL8001.jpg\"\nLOCAL_JSON = \"abbyy-output/0212_BCL8001.json\"\nREMOTE_IMAGE = \"https://lwmincomingtradedirs.blob.core.windows.net/jpg/0212_BCL8001.jpg\"\n\nconverter = LabelStudioConverter(input_format=Input.ABBYY)\nconverted_data = converter.convert(\n    image=LOCAL_IMAGE,\n    input_data=load_json(LOCAL_JSON),\n    url=REMOTE_IMAGE,\n)\n\nsave_json(converted_data, \"import-me-into-label-studio.json\")\n```\n\n_Note that, in this example, we use the convenient utility function `load_json` to load the JSON file with the ABBYY results. In this example (and any example where you have a locally stored version and a remote version), if you don\u2019t care too much about speed, you can just pass the URL to the `image` parameter. If you are processing 100+ images, the script will run much faster if you have locally stored images._\n\nThe result should from the above example, when imported into Label Studio looks differently (since ABBYY\u2019s result will differ from Tesseract\u2019s) but otherwise, the result should be the same look:\n\n![Label Studio interface after importing ABBYY FineReader\u2019s resulting JSON](img/abbyy-result.png)\n\n### Risk for error\n\nBecause you run the data conversion on a local file, you must specify a `url` as a parameter in the `.convert` method (as we did above). Alternatively, after you export the data, you can adjust the `\"ocr\"` value in the resulting JSON file _before importing it into Label Studio_. Otherwise you will see the following error message:\n\n![Label Studio error message after importing faulty JSON](img/local-file-error.png)\n\n## Using Transkribus results instead\n\nOut of the box, LabelStudioConverter comes with support for the XML files created by Transkribus, as well. In this example, similar to the ABBYY example above, we provide a local image and a remote version of the same image. and use the `load_xml_as_json` utility function to read in the Transkribus XML as JSON data.\n\n```py\nfrom ls_converter import LabelStudioConverter, Input\nfrom ls_converter.utils import load_xml_as_json, save_json\n\nLOCAL_IMAGE = \"transkribus-output/0219_BCL8001.jpg\"\nLOCAL_XML = \"transkribus-output/0219_BCL8001.xml\"\nREMOTE_IMAGE = \"https://lwmincomingtradedirs.blob.core.windows.net/jpg/0219_BCL8001.jpg\"\n\nconverter = LabelStudioConverter(input_format=Input.TRANSKRIBUS)\nconverted_data = converter.convert(\n    image=LOCAL_IMAGE,\n    input_data=load_xml_as_json(LOCAL_XML),\n    url=REMOTE_IMAGE,\n)\n\nsave_json(converted_data, \"import-me-into-label-studio.json\")\n```\n\nThe result, since we provide the `url` to the remote image in the example above, is similar (albeit different, due to Transkribus\u2019s OCR and layout parsing algorithm) when viewed in Label Studio:\n\n![Label Studio interface after importing Transkribus\u2019s resulting JSON](img/transkribus-result.png)\n\n## Change Log\n\n### 0.0.1 (Dec 14, 2022)\n\n- First alpha version\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Creates ready-to-use Label Studio pre-populated JSON files from popular OCR formats.",
    "version": "0.0.1",
    "split_keywords": [
        "computer-vision",
        "deep-learning",
        "image-annotation",
        "annotation",
        "annotations",
        "dataset",
        "image-classification",
        "labeling",
        "datasets",
        "semantic-segmentation",
        "annotation-tool",
        "text-annotation",
        "boundingbox",
        "image-labeling",
        "labeling-tool",
        "image-labelling-tool",
        "data-labeling",
        "label-studio",
        "ocr"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "f966ce1cdc9af2aa68c82a5f4a5ae712",
                "sha256": "26f3cae1f2c0bfd51f367bfb8623653c576d0dac832a383d6753b4cf8124be29"
            },
            "downloads": -1,
            "filename": "ls_converter-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f966ce1cdc9af2aa68c82a5f4a5ae712",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7,<4.0",
            "size": 9489,
            "upload_time": "2022-12-14T08:51:36",
            "upload_time_iso_8601": "2022-12-14T08:51:36.350034Z",
            "url": "https://files.pythonhosted.org/packages/1d/a0/cb6270faa7046357741173e558b8552eebd6bb40143c9226dcbff3787679/ls_converter-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "a561067d8802a7574ef60c55016090d8",
                "sha256": "85a155fd867704166116821dd8bb975953e63e967c4690527a5a18e9a8a1e4be"
            },
            "downloads": -1,
            "filename": "ls-converter-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "a561067d8802a7574ef60c55016090d8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7,<4.0",
            "size": 10763,
            "upload_time": "2022-12-14T08:51:34",
            "upload_time_iso_8601": "2022-12-14T08:51:34.553758Z",
            "url": "https://files.pythonhosted.org/packages/7f/9e/8029dae1994b11354a99a7723c15a8c622a9b9f9d61cfae34be1d9b73ee2/ls-converter-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-14 08:51:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "Living-with-machines",
    "github_project": "label-studio-converter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "ls-converter"
}

Kalle Westerling