================================================================
hespi
================================================================
.. image:: https://raw.githubusercontent.com/rbturnbull/hespi/main/docs/images/hespi-banner.svg
.. start-badges
|pypi badge| |testing badge| |coverage badge| |docs badge| |black badge|
.. |pypi badge| image:: https://img.shields.io/pypi/v/hespi
:target: https://pypi.org/project/hespi/
.. |testing badge| image:: https://github.com/rbturnbull/hespi/actions/workflows/testing.yml/badge.svg
:target: https://github.com/rbturnbull/hespi/actions
.. |docs badge| image:: https://github.com/rbturnbull/hespi/actions/workflows/docs.yml/badge.svg
:target: https://rbturnbull.github.io/hespi
.. |black badge| image:: https://img.shields.io/badge/code%20style-black-000000.svg
:target: https://github.com/psf/black
.. |coverage badge| image:: https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/rbturnbull/f31036b00473b6d0af3a160ea681903b/raw/coverage-badge.json
:target: https://rbturnbull.github.io/hespi/coverage/
.. end-badges
HErbarium Specimen sheet PIpeline
.. start-quickstart
Hespi takes images of specimen sheets from herbaria and first detects the various components of the sheet.
.. image:: https://raw.githubusercontent.com/rbturnbull/hespi/main/docs/images/HespiDiagram.jpg
:alt: Hespi pipeline
:align: center
Hespi first takes a specimen sheet and detects the various components of it using the Sheet-Component Model.
Then any full database label detected is cropped and this is given to the Label-Field Model
which detects different textual fields written on the label.
A Label Classifier is also used to determine the type of text written on the label.
If it is printed or typewritten, then the text of each field is given to an Optical Character Recognition (OCR) engine
and if there is handwriting, then each field is given to the Handwritten Text Recognition (HTR) engine.
The recognized text is then corrected using a multimodal Large Language Model (LLM).
Finally, the result of the fields is post-processed before being written into
an HTML report, a CSV file and text files.
The stages of the pipeline are explained in the `documentation for the pipeline <https://rbturnbull.github.io/hespi/pipeline.html>`_.
Installation
==================================
Install hespi using pip:
.. code-block:: bash
pip install hespi
The first time it runs, it will download the required model weights from the internet.
It is recommended that you also install `Tesseract <https://tesseract-ocr.github.io/tessdoc/Home.html>`_ so that this can be used in the text recognition part of the pipeline.
To install the development version, see the `documentation for contributing <https://rbturnbull.github.io/hespi/contributing.html>`_.
Usage
==================================
To run the pipeline, use the executable ``hespi`` and give it any number of images:
.. code-block:: bash
hespi image1.jpg image2.jpg
By default the output will go to a directory called ``hespi-output``.
You can set the output directory with the command with the ``--output-dir`` argument:
.. code-block:: bash
hespi images/*.tif --output-dir ./hespi-output
The detected components and text fields will be cropped and stored in the output directory.
There will also be a CSV file with the filename ``hespi-results.csv`` in the output directory with the text recognition results for any institutional labels found.
By default ``hespi`` will use OpenAI's ``gpt-4o`` large language model (LLM) in the pipeline to produce the final results.
If you wish to use a different model from OpenAI or Anthropic, add it on the command-line like this: ``--llm MODEL_NAME``
You will need to include an API key for the LLM. This can be ``OPENAI_API_KEY`` for an OpenAI LLM or ``ANTHROPIC_API_KEY`` for Anthropic.
You can also pass the API key to hespi with the ``--llm-api-key API_KEY`` argument.
More information on the command line arguments can be found in the `Command Line Reference <https://rbturnbull.github.io/hespi/cli.html>`_ in the documentation.
There is another command line utility called ``hespi-tools`` which provides additional functionality.
See the `documentation <https://rbturnbull.github.io/hespi/cli.html#hespi-tools>`_ for more information.
Training with custom data
==================================
To train the model with custom data, see the documention.
.. end-quickstart
Credits
==================================
.. start-credits
Robert Turnbull, Emily Fitzgerald, Karen Thompson and Jo Birch from the University of Melbourne.
This research was supported by The University of Melbourne’s Research Computing Services and the Petascale Campus Initiative.
The authors thank collaborators Niels Klazenga, Heroen Verbruggen, Nunzio Knerr, Noel Faux, Simon Mutch, Babak Shaban, Andrew Drinnan, Michael Bayly and Hannah Turnbull.
Plant refererence data obtained from the `Australian National Species List (auNSL) <https://biodiversity.org.au/nsl>`_, as of March 2024, using the:
- Australian Plant Name Index (APNI)
- Australian Bryophyte Name Index (AusMoss)
- Australian Fungi Name Index (AFNI)
- Australian Lichen Name Index (ALNI)
- Australian Algae Name Index (AANI)
and the `World Flora Online Taxonomic Backbone v.2023.12 <https://www.worldfloraonline.org/downloadData>`_, accessed 13 June 2024.
This pipeline depends on `YOLOv8 <https://github.com/ultralytics/ultralytics>`_,
`torchapp <https://github.com/rbturnbull/torchapp>`_,
Microsoft's `TrOCR <https://www.microsoft.com/en-us/research/publication/trocr-transformer-based-optical-character-recognition-with-pre-trained-models/>`_.
Logo derived from artwork by `ka reemov <https://thenounproject.com/icon/plant-1386076/>`_.
.. end-credits
See the documentation for more information for references in BibTeX format or use the command:
.. code-block:: bash
hespi-tools bibtex
Raw data
{
"_id": null,
"home_page": "https://rbturnbull.github.io/hespi/",
"name": "hespi",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.12,>=3.8.1",
"maintainer_email": null,
"keywords": "herbarium, object detection, OCR, HTR, specimen, handwritten text recognition",
"author": "Robert Turnbull",
"author_email": "robert.turnbull@unimelb.edu.au",
"download_url": "https://files.pythonhosted.org/packages/d0/12/9262de387df9b4d744dc9177a7eb34bbb3ed0b4944b602fb256fdaff4c07/hespi-0.5.0.tar.gz",
"platform": null,
"description": "================================================================\nhespi\n================================================================\n\n.. image:: https://raw.githubusercontent.com/rbturnbull/hespi/main/docs/images/hespi-banner.svg\n\n.. start-badges\n\n|pypi badge| |testing badge| |coverage badge| |docs badge| |black badge|\n\n.. |pypi badge| image:: https://img.shields.io/pypi/v/hespi\n :target: https://pypi.org/project/hespi/\n\n.. |testing badge| image:: https://github.com/rbturnbull/hespi/actions/workflows/testing.yml/badge.svg\n :target: https://github.com/rbturnbull/hespi/actions\n\n.. |docs badge| image:: https://github.com/rbturnbull/hespi/actions/workflows/docs.yml/badge.svg\n :target: https://rbturnbull.github.io/hespi\n \n.. |black badge| image:: https://img.shields.io/badge/code%20style-black-000000.svg\n :target: https://github.com/psf/black\n \n.. |coverage badge| image:: https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/rbturnbull/f31036b00473b6d0af3a160ea681903b/raw/coverage-badge.json\n :target: https://rbturnbull.github.io/hespi/coverage/\n \n.. end-badges\n\nHErbarium Specimen sheet PIpeline\n\n.. start-quickstart\n\nHespi takes images of specimen sheets from herbaria and first detects the various components of the sheet. \n\n\n.. image:: https://raw.githubusercontent.com/rbturnbull/hespi/main/docs/images/HespiDiagram.jpg\n :alt: Hespi pipeline\n :align: center\n\n\nHespi first takes a specimen sheet and detects the various components of it using the Sheet-Component Model. \nThen any full database label detected is cropped and this is given to the Label-Field Model \nwhich detects different textual fields written on the label. \nA Label Classifier is also used to determine the type of text written on the label. \nIf it is printed or typewritten, then the text of each field is given to an Optical Character Recognition (OCR) engine \nand if there is handwriting, then each field is given to the Handwritten Text Recognition (HTR) engine. \nThe recognized text is then corrected using a multimodal Large Language Model (LLM).\nFinally, the result of the fields is post-processed before being written into \nan HTML report, a CSV file and text files. \n\nThe stages of the pipeline are explained in the `documentation for the pipeline <https://rbturnbull.github.io/hespi/pipeline.html>`_.\n\n\nInstallation\n==================================\n\nInstall hespi using pip:\n\n.. code-block:: bash\n\n pip install hespi\n\nThe first time it runs, it will download the required model weights from the internet.\n\nIt is recommended that you also install `Tesseract <https://tesseract-ocr.github.io/tessdoc/Home.html>`_ so that this can be used in the text recognition part of the pipeline.\n\nTo install the development version, see the `documentation for contributing <https://rbturnbull.github.io/hespi/contributing.html>`_.\n\n\nUsage\n==================================\n\nTo run the pipeline, use the executable ``hespi`` and give it any number of images:\n\n.. code-block:: bash\n\n hespi image1.jpg image2.jpg\n\nBy default the output will go to a directory called ``hespi-output``. \nYou can set the output directory with the command with the ``--output-dir`` argument:\n\n.. code-block:: bash\n\n hespi images/*.tif --output-dir ./hespi-output\n\nThe detected components and text fields will be cropped and stored in the output directory. \nThere will also be a CSV file with the filename ``hespi-results.csv`` in the output directory with the text recognition results for any institutional labels found.\n\nBy default ``hespi`` will use OpenAI's ``gpt-4o`` large language model (LLM) in the pipeline to produce the final results.\nIf you wish to use a different model from OpenAI or Anthropic, add it on the command-line like this: ``--llm MODEL_NAME``\nYou will need to include an API key for the LLM. This can be ``OPENAI_API_KEY`` for an OpenAI LLM or ``ANTHROPIC_API_KEY`` for Anthropic.\nYou can also pass the API key to hespi with the ``--llm-api-key API_KEY`` argument.\n\nMore information on the command line arguments can be found in the `Command Line Reference <https://rbturnbull.github.io/hespi/cli.html>`_ in the documentation.\n\nThere is another command line utility called ``hespi-tools`` which provides additional functionality.\nSee the `documentation <https://rbturnbull.github.io/hespi/cli.html#hespi-tools>`_ for more information.\n\nTraining with custom data\n==================================\n\nTo train the model with custom data, see the documention.\n\n.. end-quickstart\n\nCredits\n==================================\n\n.. start-credits\n\nRobert Turnbull, Emily Fitzgerald, Karen Thompson and Jo Birch from the University of Melbourne.\n\nThis research was supported by The University of Melbourne\u2019s Research Computing Services and the Petascale Campus Initiative. \nThe authors thank collaborators Niels Klazenga, Heroen Verbruggen, Nunzio Knerr, Noel Faux, Simon Mutch, Babak Shaban, Andrew Drinnan, Michael Bayly and Hannah Turnbull.\n\nPlant refererence data obtained from the `Australian National Species List (auNSL) <https://biodiversity.org.au/nsl>`_, as of March 2024, using the:\n\n- Australian Plant Name Index (APNI)\n- Australian Bryophyte Name Index (AusMoss)\n- Australian Fungi Name Index (AFNI) \n- Australian Lichen Name Index (ALNI) \n- Australian Algae Name Index (AANI)\n\nand the `World Flora Online Taxonomic Backbone v.2023.12 <https://www.worldfloraonline.org/downloadData>`_, accessed 13 June 2024.\n\nThis pipeline depends on `YOLOv8 <https://github.com/ultralytics/ultralytics>`_, \n`torchapp <https://github.com/rbturnbull/torchapp>`_,\nMicrosoft's `TrOCR <https://www.microsoft.com/en-us/research/publication/trocr-transformer-based-optical-character-recognition-with-pre-trained-models/>`_.\n\nLogo derived from artwork by `ka reemov <https://thenounproject.com/icon/plant-1386076/>`_.\n\n.. end-credits\n\nSee the documentation for more information for references in BibTeX format or use the command:\n\n.. code-block:: bash\n\n hespi-tools bibtex\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "HErbarium Specimen sheet PIpeline",
"version": "0.5.0",
"project_urls": {
"Documentation": "https://rbturnbull.github.io/hespi/",
"Homepage": "https://rbturnbull.github.io/hespi/",
"Repository": "https://github.com/rbturnbull/hespi"
},
"split_keywords": [
"herbarium",
" object detection",
" ocr",
" htr",
" specimen",
" handwritten text recognition"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5aeeaca8bffeb69202b10260e57ef064c1e797045f2d06289297c7e16a3623d8",
"md5": "53890d0e14c33918872e025ca1e4f47b",
"sha256": "1b6bea625089fa38c26fb6f89b8c00be933fbbe8f241fb4aa7c5e7e28a24df75"
},
"downloads": -1,
"filename": "hespi-0.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "53890d0e14c33918872e025ca1e4f47b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.8.1",
"size": 2850417,
"upload_time": "2024-10-02T08:59:27",
"upload_time_iso_8601": "2024-10-02T08:59:27.760921Z",
"url": "https://files.pythonhosted.org/packages/5a/ee/aca8bffeb69202b10260e57ef064c1e797045f2d06289297c7e16a3623d8/hespi-0.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d0129262de387df9b4d744dc9177a7eb34bbb3ed0b4944b602fb256fdaff4c07",
"md5": "55996e5f4084b2e41fe863476b617b32",
"sha256": "59ff1eb4eec34964a078b89cf3a77e60f687d50bb0d617bf50d5a4fe0c742581"
},
"downloads": -1,
"filename": "hespi-0.5.0.tar.gz",
"has_sig": false,
"md5_digest": "55996e5f4084b2e41fe863476b617b32",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.8.1",
"size": 2837571,
"upload_time": "2024-10-02T08:59:29",
"upload_time_iso_8601": "2024-10-02T08:59:29.971142Z",
"url": "https://files.pythonhosted.org/packages/d0/12/9262de387df9b4d744dc9177a7eb34bbb3ed0b4944b602fb256fdaff4c07/hespi-0.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-02 08:59:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rbturnbull",
"github_project": "hespi",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "hespi"
}