kamilib

Name	kamilib JSON
Version	0.1.13 JSON
	download
home_page	https://github.com/KaMI-tools-project/KaMi-lib
Summary	HTR / OCR models evaluation agnostic Python package, originally based on the Kraken transcription system.
upload_time	2023-03-23 10:37:56
maintainer
docs_url	None
author	Lucas Terriel, Alix Chagué
requires_python	>=3.8
license	MIT
keywords	htr ocr evaluation framework metrics handwritten text recognition optical character recognition
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            |Python Version| |Version| |License|

KaMI-lib (Kraken Model Inspector)
=================================

|Logo|

HTR / OCR models evaluation agnostic Python package, originally based on
the `Kraken <http://kraken.re/>`__ transcription system.

🔌 Installation
===============

User installation
-----------------

Use pip to install package:

.. code-block:: bash

    $ pip install kamilib

Developer installation
----------------------

1. Create a local branch of the kami-lib project

.. code-block:: bash

    $ git clone https://gitlab.inria.fr/dh-projects/kami/kami-lib.git

2. Create a virtual environment

.. code-block:: bash

    $ virtualenv -p python3.7 kami_venv

then

.. code-block:: bash

    $ source kami_venv/bin/activate

3. Install dependencies with the requirements file

.. code-block:: bash

    $ pip install -r requirements.txt

4. Run the tests

.. code-block:: bash

    $ python -m unittest tests/*.py -v

🏃 Tutorial
===========

An "end-to-end pipeline" example that uses Kamilib (written in French)
is available at: |Open In Colab|

Tools build with KaMI-lib
-------------------------

A turn-key graphical interface :
`KaMI-app <https://kami-app.herokuapp.com/>`__

🔑 Quickstart
==============

KaMI-lib can be used for different use cases with the class `Kami()`.

First, import the KaMI-lib package :

.. code-block:: python

    from kami.Kami import Kami

The following sections describe two use cases :

-  How to compare outputs from any automatic transcription system,
-  How to use KaMI-lib with a transcription prediction produced with a
   Kraken model.

Summary
-------

1. Compare a reference and a prediction, independently from the Kraken engine
2. Evaluate the prediction of a model generated with the Kraken engine
3. Use text preprocessing to get different scores
4. Metrics options
5. Others

1. Compare a reference and a prediction, independently from the Kraken engine
-----------------------------------------------------------------------------

KaMI-lib allows you to compare two strings or two text files by
accessing them with their path.

.. code-block:: python

    # Define your string to compare.
    reference_string = "Les 13 ans de Maxime ? étaient, Déjà terriblement, savants ! - La Curée, 1871. En avant, pour la lecture."

    prediction_string = "Les 14a de Maxime ! étaient, djàteriblement, savants - La Curée, 1871. En avant? pour la leTTture."

    # Or specify the path to your text files.
    # reference_path = "reference.txt"
    # prediction_path = "prediction.txt"

    # Create a Kami() object and simply insert your data (string or raw text files)
    k = Kami([reference_string, prediction_string]) 

you can retrieve the results as dict with the `.board` attribute:

.. code-block:: python

    print(k.scores.board)

which returns a dictionary containing your metrics (see also Focus on
metrics section further):

.. code-block:: python

    {'levensthein_distance_char': 14, 'levensthein_distance_words': 8, 'hamming_distance': 'Ø', 'wer': 0.4, 'cer': 0.13333333333333333, 'wacc': 0.6, 'wer_hunt': 0.325, 'mer': 0.1320754716981132, 'cil': 0.17745383867832842, 'cip': 0.8225461613216716, 'hits': 92, 'substitutions': 5, 'deletions': 8, 'insertions': 1}

You can also access a specific metric, as follows:

.. code-block:: python

    print(k.scores.wer)

2. Evaluate the prediction of a model generated with the Kraken engine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The `Kami()` object uses a ground truth (**XML ALTO or XML PAGE format
only, no text format**), a transcription model and an image to evaluate
prediction made by the Kraken engine.

Here is a simple example demonstrating how to use this method with a
ground truth in ALTO XML:

.. code-block:: python

    # Define ground truth path (XML ALTO here)
    alto_gt = "./datatest/lectaurep_set/image_gt_page1/FRAN_0187_16402_L-0_alto.xml"
    # Define transcription model path
    model="./datatest/lectaurep_set/models/mixte_mrs_15.mlmodel"
    # Define image
    image="./datatest/lectaurep_set/image_gt_page1/FRAN_0187_16402_L-0.png"

    # Create a Kami() object and simply insert your data
    k = Kami(alto_gt,
             model=model,
             image=image)  

To retrieve the results as dict (`.board` attribute), as use case 1.:

.. code-block:: python

    print(k.scores.board)

which returns a dictionary containing your metrics (for more details on
metrics see section ...):

.. code-block:: python

    {'levensthein_distance_char': 408, 'levensthein_distance_words': 255, 'hamming_distance': 'Ø', 'wer': 0.3128834355828221, 'cer': 0.09150033639829558, 'wacc': 0.6871165644171779, 'wer_hunt': 0.29938650306748466, 'mer': 0.08970976253298153, 'cil': 0.1395071670835435, 'cip': 0.8604928329164565, 'hits': 4140, 'substitutions': 238, 'deletions': 81, 'insertions': 89}

Depending on the size of the ground truth file, the prediction process
may take more or less time.

Kraken parameters can be modified. You can specify the number of CPU
workers for inference (default 7) with the ``workers`` parameter and you
can set the principal text direction with the ``text_direction``
parameter ("horizontal-lr", "horizontal-rl", "vertical-lr ",
"vertical-rl". By default Kami uses "horizontal-lr".).

.. code-block:: python

    k = Kami(alto_gt,
             model=model,
             image=image,
             workers=7,
             text_direction="horizontal-lr")  

3. Use text preprocessing to get different scores
-------------------------------------------------

KaMI-lib provides the possibility to apply textual transformations on
the ground truth and the prediction before evaluating them. By doing so,
scores can change according to the performance of the model used. This
functionality allows a better made by the transcription model. For
example, if removing all diacritics improves the scores, it probably
means that the model is not good enough at transcribing them. By default
no preprocessing is applied.

To preprocess the ground truth and the prediction, you can use `apply_transforms` parameter from `Kami()` class.

The `apply_transforms` parameter receives a character code
corresponding to the transformations to be performed :

+------------------+----------------------------------------------------------------------------+
| Character code   | Applied transformation                                                     |
+==================+============================================================================+
| D                | remove digits                                                              |
+------------------+----------------------------------------------------------------------------+
| U                | uppercase                                                                  |
+------------------+----------------------------------------------------------------------------+
| L                | lowercase                                                                  |
+------------------+----------------------------------------------------------------------------+
| P                | remove punctuation                                                         |
+------------------+----------------------------------------------------------------------------+
| X                | remove diacritics                                                          |
+------------------+----------------------------------------------------------------------------+

You can combine these options as follows:

.. code-block:: python

    k = Kami(
        [ground_truth, prediction],
        apply_transforms="XP" # Combine here : remove diacritics + remove punctuation  
        )  

It results in a dictionary of more complex scores (use built-in
``pprint`` module to create a human readable dict.), as follows:

.. code-block:: python

    import pprint

    # Get all scores
    pprint.pprint(k.scores.board)

.. code-block:: python

    {'Length_prediction': 2507,
          'Length_prediction_transformed': 2405,
          'Length_reference': 2536,
          'Length_reference_transformed': 2426,
          'Total_char_removed_from_prediction': 102,
          'Total_char_removed_from_reference': 110,
          'Total_diacritics_removed_from_prediction': 84,
          'Total_diacritics_removed_from_reference': 98,
          'all_transforms': {'cer': 5.81,
                             'cil': 8.38,
                             'cip': 91.61,
                             'deletions': 48,
                             'hamming_distance': 'Ø',
                             'hits': 2312,
                             'insertions': 27,
                             'levensthein_distance_char': 141,
                             'levensthein_distance_words': 73,
                             'mer': 5.74,
                             'substitutions': 66,
                             'wacc': 82.28,
                             'wer': 17.71},
          'default': {'cer': 6.62,
                      'cil': 9.55,
                      'cip': 90.44,
                      'deletions': 59,
                      'hamming_distance': 'Ø',
                      'hits': 2398,
                      'insertions': 30,
                      'levensthein_distance_char': 168,
                      'levensthein_distance_words': 90,
                      'mer': 6.54,
                      'substitutions': 79,
                      'wacc': 79.54,
                      'wer': 20.45},
          'remove_diacritics': {'cer': 6.08,
                                'cil': 8.78,
                                'cip': 91.21,
                                'deletions': 49,
                                'hamming_distance': 'Ø',
                                'hits': 2379,
                                'insertions': 31,
                                'levensthein_distance_char': 152,
                                'levensthein_distance_words': 77,
                                'mer': 6.0,
                                'substitutions': 72,
                                'wacc': 82.05,
                                'wer': 17.94},
          'remove_punctuation': {'cer': 6.37,
                                 'cil': 9.25,
                                 'cip': 90.74,
                                 'deletions': 57,
                                 'hamming_distance': 'Ø',
                                 'hits': 2330,
                                 'insertions': 25,
                                 'levensthein_distance_char': 157,
                                 'levensthein_distance_words': 86,
                                 'mer': 6.31,
                                 'substitutions': 75,
                                 'wacc': 79.71,
                                 'wer': 20.28}}

-  The **'default'** key indicates the scores without any
   transformations;
-  The **'all\_transforms'** key indicates the scores with all
   transformations applied (here remove diacritics + remove
   punctuation).

If you have used text preprocessing, for example:

-  The **'remove\_punctuation'** key indicates the scores with removed
   punctuations only;
-  The **'remove\_diacritics'** key indicates the scores with removed
   diacritics only.

4. Metrics options
------------------

KaMI provides the possibility to weight differently the operations made
between the ground truth and the prediction (as insertions,
substitutions or deletions). By default this operations have a weight of
1. You can change these weigthts with the parameters in the `Kami()`
class:

-  `insertion_cost`
-  `substitution_cost`
-  `deletion_cost`

**Keep in mind that these weights are the basis for Levensthein distance
computations and performance metrics like WER and CER, which can greatly
influence final scores.**

Example:

.. code-block:: python

    k = Kami(
        [ground_truth, prediction],
        insertion_cost=1,
        substitution_cost=0.5,
        deletion_cost=1
        )  

`Kami()` class also provides score display settings :

-  `truncate` (bool) : Option to truncate result. Defaults to
   `False`.
-  `percent` (bool) : `True` if the user want to show result in
   percent else `False`. Defaults to `False`.
-  `round_digits` (str) : Set the number of digits after floating
   point in string form. Defaults to `'.01'`.

Example :

.. code-block:: python

    k = Kami([ground_truth, prediction],
                 apply_transforms="DUP", 
                 verbosity=False,  
                 truncate=True,  
                 percent=True,  
                 round_digits='0.01')  

5. Others
---------

For debugging you can pass the `verbosity` (defaults to `False`)
parameter in the `Kami()` class, this displays execution logs.

🎯 Focus on metrics
===================

Operations between strings
--------------------------

-  **Hits**: number of identical characters between the reference and
   the prediction.

-  **Substitutions**: number of substitutions (a character replaced by
   another) necessary to make the prediction match the reference.

-  **Deletions**: number of deletions (a character is removed) necessary
   to make the prediction match the reference.

-  **Insertions**: number of insertions (a character is added) necessary
   to make the prediction match the reference.

*for each of these operations, except hits, a cost of 1 is assigned by
default.*

Distances
---------

-  **Levensthein Distance (Char.)**: Levenshtein distance (sum of
   operations between character strings) at character level.

-  **Levensthein Distance (Words)**: Levenshtein distance (sum of
   operations between character strings) at word level.

-  **Hamming Distance**: A score if the strings' lengths match but their
   content is different; `Ø` if the strings' lengths don't
   match.

Transcription performance (HTR/OCR)
-----------------------------------

The performance metrics are calculated with the Levenshtein distances
mentioned above.

-  **WER** : Word Error Rate, proportion of words bearing at least one recognition error. 
   It is generally between `[0, 1.0]`, the closer it is to `0` the better the recognition. 
   However, a bad recognition can lead to a `WER> 1.0`.

-  **CER** : Character Error Rate, proportion of characters erroneously transcribed. 
   Generally more accurate than WER. It is generally between `[0, 1.0]`, the closer it is to
   `0` the better the recognition. However, a bad recognition can lead to a `CER> 1.0`.

-  **Wacc** : Word Accuracy, proportion of words bearing no recognition error.

-  **WER Hunt** : reproduce the Word Error Rate experiment by Hunt (1990). 
   Same principle as WER computation with a weighting of `O.5` on insertions and deletions. 
   This metric shows the importance of customizing the weighting of operations made between strings as it depends heavily on the system 
   and type of data used in an HTR/OCR project. In KaMI-lib, it is possible to modify the weigthts assigned to operations.

Experimental Metrics (metrics borrowed from Speech Recognition - ASR)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  **Match Error Rate**

-  **Character Information Lost**

-  **Character Information Preserve**

❓ Do you have questions, bug report, features request or feedback ?
====================================================================

Please use the issue templates:


- 🐞 Bug report: `here <https://github.com/KaMI-tools-project/KaMi-lib/issues/new?assignees=&labels=&template=bug_report.md&title=>`__


- 🎆 Features request: `here <https://github.com/KaMI-tools-project/KaMi-lib/issues/new?assignees=&labels=&template=feature_request.md&title=>`__

*if aforementioned cases does not apply, feel free to open an issue.*

✒️ How to cite
==============

.. code-block:: latex

    @misc{Kami-lib,
        author = "Lucas Terriel (Inria - ALMAnaCH) and Alix Chagué (Inria - ALMAnaCH)",
        title = {Kami-lib - Kraken model inspector},
        howpublished = {\url{https://github.com/KaMI-tools-project/KaMi-lib}},
        year = {2021-2022}
    }

🐙  License and contact
=======================

Distributed under `MIT <./LICENSE>`__ license. The dependencies used in
the project are also distributed under compatible license.

Mail authors and contact: Alix Chagué (alix.chague@inria.fr) and Lucas
Terriel (lucas.terriel@inria.fr)

Special thanks: Hugo Scheithauer (hugo.scheithauer@inria.fr)

*KaMI-lib* is developed and maintained by authors (2021-2022, first
version named Kraken-Benchmark in 2020) with contributions of
`ALMAnaCH <http://almanach.inria.fr/index-en.html>`__ at
`Inria <https://www.inria.fr/en>`__ Paris.

|forthebadge made-with-python|

.. |Logo| image:: https://raw.githubusercontent.com/KaMI-tools-project/KaMi-lib/master/docs/static/kamilib_logo.png
    :width: 100px
.. |Python Version| image:: https://img.shields.io/badge/Python-%3E%3D%203.7-%2313aab7
   :target: https://img.shields.io/badge/Python-%3E%3D%203.7-%2313aab7
.. |Version| image:: https://badge.fury.io/py/kamilib.svg
   :target: https://badge.fury.io/py/kamilib
.. |License| image:: https://img.shields.io/github/license/Naereen/StrapDown.js.svg
   :target: https://opensource.org/licenses/MIT
.. |Open In Colab| image:: https://colab.research.google.com/assets/colab-badge.svg
   :target: https://colab.research.google.com/drive/1nk0hNtL9QTO5jczK0RPEv9zF3nP3DpOc?usp=sharing
.. |forthebadge made-with-python| image:: http://ForTheBadge.com/images/badges/made-with-python.svg
   :target: https://www.python.org/

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/KaMI-tools-project/KaMi-lib",
    "name": "kamilib",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "HTR,OCR,Evaluation framework,metrics,handwritten text recognition,optical character recognition",
    "author": "Lucas Terriel, Alix Chagu\u00e9",
    "author_email": "lucas.terriel@chartes.psl.eu, alix.chague@inria.fr",
    "download_url": "https://files.pythonhosted.org/packages/09/4e/d47f7024f6dad6fc03e10fbbc6568da0e1225f0543fbe953ce09d37a988e/kamilib-0.1.13.tar.gz",
    "platform": null,
    "description": "|Python Version| |Version| |License|\n\nKaMI-lib (Kraken Model Inspector)\n=================================\n\n|Logo|\n\nHTR / OCR models evaluation agnostic Python package, originally based on\nthe `Kraken <http://kraken.re/>`__ transcription system.\n\n\ud83d\udd0c Installation\n===============\n\nUser installation\n-----------------\n\nUse pip to install package:\n\n.. code-block:: bash\n\n    $ pip install kamilib\n\nDeveloper installation\n----------------------\n\n1. Create a local branch of the kami-lib project\n\n.. code-block:: bash\n\n    $ git clone https://gitlab.inria.fr/dh-projects/kami/kami-lib.git\n\n2. Create a virtual environment\n\n.. code-block:: bash\n\n    $ virtualenv -p python3.7 kami_venv\n\nthen\n\n.. code-block:: bash\n\n    $ source kami_venv/bin/activate\n\n3. Install dependencies with the requirements file\n\n.. code-block:: bash\n\n    $ pip install -r requirements.txt\n\n4. Run the tests\n\n.. code-block:: bash\n\n    $ python -m unittest tests/*.py -v\n\n\ud83c\udfc3 Tutorial\n===========\n\nAn \"end-to-end pipeline\" example that uses Kamilib (written in French)\nis available at: |Open In Colab|\n\nTools build with KaMI-lib\n-------------------------\n\nA turn-key graphical interface :\n`KaMI-app <https://kami-app.herokuapp.com/>`__\n\n\ud83d\udd11 Quickstart\n==============\n\nKaMI-lib can be used for different use cases with the class `Kami()`.\n\nFirst, import the KaMI-lib package :\n\n.. code-block:: python\n\n    from kami.Kami import Kami\n\nThe following sections describe two use cases :\n\n-  How to compare outputs from any automatic transcription system,\n-  How to use KaMI-lib with a transcription prediction produced with a\n   Kraken model.\n\nSummary\n-------\n\n1. Compare a reference and a prediction, independently from the Kraken engine\n2. Evaluate the prediction of a model generated with the Kraken engine\n3. Use text preprocessing to get different scores\n4. Metrics options\n5. Others\n\n1. Compare a reference and a prediction, independently from the Kraken engine\n-----------------------------------------------------------------------------\n\nKaMI-lib allows you to compare two strings or two text files by\naccessing them with their path.\n\n.. code-block:: python\n\n    # Define your string to compare.\n    reference_string = \"Les 13 ans de Maxime ? \u00e9taient, D\u00e9j\u00e0 terriblement, savants ! - La Cur\u00e9e, 1871. En avant, pour la lecture.\"\n\n    prediction_string = \"Les 14a de Maxime ! \u00e9taient, dj\u00e0teriblement, savants - La Cur\u00e9e, 1871. En avant? pour la leTTture.\"\n\n    # Or specify the path to your text files.\n    # reference_path = \"reference.txt\"\n    # prediction_path = \"prediction.txt\"\n\n    # Create a Kami() object and simply insert your data (string or raw text files)\n    k = Kami([reference_string, prediction_string]) \n\nyou can retrieve the results as dict with the `.board` attribute:\n\n.. code-block:: python\n\n    print(k.scores.board)\n\nwhich returns a dictionary containing your metrics (see also Focus on\nmetrics section further):\n\n.. code-block:: python\n\n    {'levensthein_distance_char': 14, 'levensthein_distance_words': 8, 'hamming_distance': '\u00d8', 'wer': 0.4, 'cer': 0.13333333333333333, 'wacc': 0.6, 'wer_hunt': 0.325, 'mer': 0.1320754716981132, 'cil': 0.17745383867832842, 'cip': 0.8225461613216716, 'hits': 92, 'substitutions': 5, 'deletions': 8, 'insertions': 1}\n\nYou can also access a specific metric, as follows:\n\n.. code-block:: python\n\n    print(k.scores.wer)\n\n2. Evaluate the prediction of a model generated with the Kraken engine\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe `Kami()` object uses a ground truth (**XML ALTO or XML PAGE format\nonly, no text format**), a transcription model and an image to evaluate\nprediction made by the Kraken engine.\n\nHere is a simple example demonstrating how to use this method with a\nground truth in ALTO XML:\n\n.. code-block:: python\n\n    # Define ground truth path (XML ALTO here)\n    alto_gt = \"./datatest/lectaurep_set/image_gt_page1/FRAN_0187_16402_L-0_alto.xml\"\n    # Define transcription model path\n    model=\"./datatest/lectaurep_set/models/mixte_mrs_15.mlmodel\"\n    # Define image\n    image=\"./datatest/lectaurep_set/image_gt_page1/FRAN_0187_16402_L-0.png\"\n\n    # Create a Kami() object and simply insert your data\n    k = Kami(alto_gt,\n             model=model,\n             image=image)  \n\nTo retrieve the results as dict (`.board` attribute), as use case 1.:\n\n.. code-block:: python\n\n    print(k.scores.board)\n\nwhich returns a dictionary containing your metrics (for more details on\nmetrics see section ...):\n\n.. code-block:: python\n\n    {'levensthein_distance_char': 408, 'levensthein_distance_words': 255, 'hamming_distance': '\u00d8', 'wer': 0.3128834355828221, 'cer': 0.09150033639829558, 'wacc': 0.6871165644171779, 'wer_hunt': 0.29938650306748466, 'mer': 0.08970976253298153, 'cil': 0.1395071670835435, 'cip': 0.8604928329164565, 'hits': 4140, 'substitutions': 238, 'deletions': 81, 'insertions': 89}\n\nDepending on the size of the ground truth file, the prediction process\nmay take more or less time.\n\nKraken parameters can be modified. You can specify the number of CPU\nworkers for inference (default 7) with the ``workers`` parameter and you\ncan set the principal text direction with the ``text_direction``\nparameter (\"horizontal-lr\", \"horizontal-rl\", \"vertical-lr \",\n\"vertical-rl\". By default Kami uses \"horizontal-lr\".).\n\n.. code-block:: python\n\n    k = Kami(alto_gt,\n             model=model,\n             image=image,\n             workers=7,\n             text_direction=\"horizontal-lr\")  \n\n3. Use text preprocessing to get different scores\n-------------------------------------------------\n\nKaMI-lib provides the possibility to apply textual transformations on\nthe ground truth and the prediction before evaluating them. By doing so,\nscores can change according to the performance of the model used. This\nfunctionality allows a better made by the transcription model. For\nexample, if removing all diacritics improves the scores, it probably\nmeans that the model is not good enough at transcribing them. By default\nno preprocessing is applied.\n\nTo preprocess the ground truth and the prediction, you can use `apply_transforms` parameter from `Kami()` class.\n\nThe `apply_transforms` parameter receives a character code\ncorresponding to the transformations to be performed :\n\n+------------------+----------------------------------------------------------------------------+\n| Character code   | Applied transformation                                                     |\n+==================+============================================================================+\n| D                | remove digits                                                              |\n+------------------+----------------------------------------------------------------------------+\n| U                | uppercase                                                                  |\n+------------------+----------------------------------------------------------------------------+\n| L                | lowercase                                                                  |\n+------------------+----------------------------------------------------------------------------+\n| P                | remove punctuation                                                         |\n+------------------+----------------------------------------------------------------------------+\n| X                | remove diacritics                                                          |\n+------------------+----------------------------------------------------------------------------+\n\nYou can combine these options as follows:\n\n.. code-block:: python\n\n    k = Kami(\n        [ground_truth, prediction],\n        apply_transforms=\"XP\" # Combine here : remove diacritics + remove punctuation  \n        )  \n\nIt results in a dictionary of more complex scores (use built-in\n``pprint`` module to create a human readable dict.), as follows:\n\n.. code-block:: python\n\n    import pprint\n\n    # Get all scores\n    pprint.pprint(k.scores.board)\n\n.. code-block:: python\n\n    {'Length_prediction': 2507,\n          'Length_prediction_transformed': 2405,\n          'Length_reference': 2536,\n          'Length_reference_transformed': 2426,\n          'Total_char_removed_from_prediction': 102,\n          'Total_char_removed_from_reference': 110,\n          'Total_diacritics_removed_from_prediction': 84,\n          'Total_diacritics_removed_from_reference': 98,\n          'all_transforms': {'cer': 5.81,\n                             'cil': 8.38,\n                             'cip': 91.61,\n                             'deletions': 48,\n                             'hamming_distance': '\u00d8',\n                             'hits': 2312,\n                             'insertions': 27,\n                             'levensthein_distance_char': 141,\n                             'levensthein_distance_words': 73,\n                             'mer': 5.74,\n                             'substitutions': 66,\n                             'wacc': 82.28,\n                             'wer': 17.71},\n          'default': {'cer': 6.62,\n                      'cil': 9.55,\n                      'cip': 90.44,\n                      'deletions': 59,\n                      'hamming_distance': '\u00d8',\n                      'hits': 2398,\n                      'insertions': 30,\n                      'levensthein_distance_char': 168,\n                      'levensthein_distance_words': 90,\n                      'mer': 6.54,\n                      'substitutions': 79,\n                      'wacc': 79.54,\n                      'wer': 20.45},\n          'remove_diacritics': {'cer': 6.08,\n                                'cil': 8.78,\n                                'cip': 91.21,\n                                'deletions': 49,\n                                'hamming_distance': '\u00d8',\n                                'hits': 2379,\n                                'insertions': 31,\n                                'levensthein_distance_char': 152,\n                                'levensthein_distance_words': 77,\n                                'mer': 6.0,\n                                'substitutions': 72,\n                                'wacc': 82.05,\n                                'wer': 17.94},\n          'remove_punctuation': {'cer': 6.37,\n                                 'cil': 9.25,\n                                 'cip': 90.74,\n                                 'deletions': 57,\n                                 'hamming_distance': '\u00d8',\n                                 'hits': 2330,\n                                 'insertions': 25,\n                                 'levensthein_distance_char': 157,\n                                 'levensthein_distance_words': 86,\n                                 'mer': 6.31,\n                                 'substitutions': 75,\n                                 'wacc': 79.71,\n                                 'wer': 20.28}}\n\n-  The **'default'** key indicates the scores without any\n   transformations;\n-  The **'all\\_transforms'** key indicates the scores with all\n   transformations applied (here remove diacritics + remove\n   punctuation).\n\nIf you have used text preprocessing, for example:\n\n-  The **'remove\\_punctuation'** key indicates the scores with removed\n   punctuations only;\n-  The **'remove\\_diacritics'** key indicates the scores with removed\n   diacritics only.\n\n4. Metrics options\n------------------\n\nKaMI provides the possibility to weight differently the operations made\nbetween the ground truth and the prediction (as insertions,\nsubstitutions or deletions). By default this operations have a weight of\n1. You can change these weigthts with the parameters in the `Kami()`\nclass:\n\n-  `insertion_cost`\n-  `substitution_cost`\n-  `deletion_cost`\n\n**Keep in mind that these weights are the basis for Levensthein distance\ncomputations and performance metrics like WER and CER, which can greatly\ninfluence final scores.**\n\nExample:\n\n.. code-block:: python\n\n    k = Kami(\n        [ground_truth, prediction],\n        insertion_cost=1,\n        substitution_cost=0.5,\n        deletion_cost=1\n        )  \n\n`Kami()` class also provides score display settings :\n\n-  `truncate` (bool) : Option to truncate result. Defaults to\n   `False`.\n-  `percent` (bool) : `True` if the user want to show result in\n   percent else `False`. Defaults to `False`.\n-  `round_digits` (str) : Set the number of digits after floating\n   point in string form. Defaults to `'.01'`.\n\nExample :\n\n.. code-block:: python\n\n    k = Kami([ground_truth, prediction],\n                 apply_transforms=\"DUP\", \n                 verbosity=False,  \n                 truncate=True,  \n                 percent=True,  \n                 round_digits='0.01')  \n\n5. Others\n---------\n\nFor debugging you can pass the `verbosity` (defaults to `False`)\nparameter in the `Kami()` class, this displays execution logs.\n\n\ud83c\udfaf Focus on metrics\n===================\n\nOperations between strings\n--------------------------\n\n-  **Hits**: number of identical characters between the reference and\n   the prediction.\n\n-  **Substitutions**: number of substitutions (a character replaced by\n   another) necessary to make the prediction match the reference.\n\n-  **Deletions**: number of deletions (a character is removed) necessary\n   to make the prediction match the reference.\n\n-  **Insertions**: number of insertions (a character is added) necessary\n   to make the prediction match the reference.\n\n*for each of these operations, except hits, a cost of 1 is assigned by\ndefault.*\n\nDistances\n---------\n\n-  **Levensthein Distance (Char.)**: Levenshtein distance (sum of\n   operations between character strings) at character level.\n\n-  **Levensthein Distance (Words)**: Levenshtein distance (sum of\n   operations between character strings) at word level.\n\n-  **Hamming Distance**: A score if the strings' lengths match but their\n   content is different; `\u00d8` if the strings' lengths don't\n   match.\n\nTranscription performance (HTR/OCR)\n-----------------------------------\n\nThe performance metrics are calculated with the Levenshtein distances\nmentioned above.\n\n-  **WER** : Word Error Rate, proportion of words bearing at least one recognition error. \n   It is generally between `[0, 1.0]`, the closer it is to `0` the better the recognition. \n   However, a bad recognition can lead to a `WER> 1.0`.\n\n-  **CER** : Character Error Rate, proportion of characters erroneously transcribed. \n   Generally more accurate than WER. It is generally between `[0, 1.0]`, the closer it is to\n   `0` the better the recognition. However, a bad recognition can lead to a `CER> 1.0`.\n\n-  **Wacc** : Word Accuracy, proportion of words bearing no recognition error.\n\n-  **WER Hunt** : reproduce the Word Error Rate experiment by Hunt (1990). \n   Same principle as WER computation with a weighting of `O.5` on insertions and deletions. \n   This metric shows the importance of customizing the weighting of operations made between strings as it depends heavily on the system \n   and type of data used in an HTR/OCR project. In KaMI-lib, it is possible to modify the weigthts assigned to operations.\n\nExperimental Metrics (metrics borrowed from Speech Recognition - ASR)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n-  **Match Error Rate**\n\n-  **Character Information Lost**\n\n-  **Character Information Preserve**\n\n\u2753 Do you have questions, bug report, features request or feedback ?\n====================================================================\n\nPlease use the issue templates:\n\n\n- \ud83d\udc1e Bug report: `here <https://github.com/KaMI-tools-project/KaMi-lib/issues/new?assignees=&labels=&template=bug_report.md&title=>`__\n\n\n- \ud83c\udf86 Features request: `here <https://github.com/KaMI-tools-project/KaMi-lib/issues/new?assignees=&labels=&template=feature_request.md&title=>`__\n\n*if aforementioned cases does not apply, feel free to open an issue.*\n\n\u2712\ufe0f How to cite\n==============\n\n.. code-block:: latex\n\n    @misc{Kami-lib,\n        author = \"Lucas Terriel (Inria - ALMAnaCH) and Alix Chagu\u00e9 (Inria - ALMAnaCH)\",\n        title = {Kami-lib - Kraken model inspector},\n        howpublished = {\\url{https://github.com/KaMI-tools-project/KaMi-lib}},\n        year = {2021-2022}\n    }\n\n\ud83d\udc19  License and contact\n=======================\n\nDistributed under `MIT <./LICENSE>`__ license. The dependencies used in\nthe project are also distributed under compatible license.\n\nMail authors and contact: Alix Chagu\u00e9 (alix.chague@inria.fr) and Lucas\nTerriel (lucas.terriel@inria.fr)\n\nSpecial thanks: Hugo Scheithauer (hugo.scheithauer@inria.fr)\n\n*KaMI-lib* is developed and maintained by authors (2021-2022, first\nversion named Kraken-Benchmark in 2020) with contributions of\n`ALMAnaCH <http://almanach.inria.fr/index-en.html>`__ at\n`Inria <https://www.inria.fr/en>`__ Paris.\n\n|forthebadge made-with-python|\n\n.. |Logo| image:: https://raw.githubusercontent.com/KaMI-tools-project/KaMi-lib/master/docs/static/kamilib_logo.png\n    :width: 100px\n.. |Python Version| image:: https://img.shields.io/badge/Python-%3E%3D%203.7-%2313aab7\n   :target: https://img.shields.io/badge/Python-%3E%3D%203.7-%2313aab7\n.. |Version| image:: https://badge.fury.io/py/kamilib.svg\n   :target: https://badge.fury.io/py/kamilib\n.. |License| image:: https://img.shields.io/github/license/Naereen/StrapDown.js.svg\n   :target: https://opensource.org/licenses/MIT\n.. |Open In Colab| image:: https://colab.research.google.com/assets/colab-badge.svg\n   :target: https://colab.research.google.com/drive/1nk0hNtL9QTO5jczK0RPEv9zF3nP3DpOc?usp=sharing\n.. |forthebadge made-with-python| image:: http://ForTheBadge.com/images/badges/made-with-python.svg\n   :target: https://www.python.org/\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "HTR / OCR models evaluation agnostic Python package, originally based on the Kraken transcription system.",
    "version": "0.1.13",
    "split_keywords": [
        "htr",
        "ocr",
        "evaluation framework",
        "metrics",
        "handwritten text recognition",
        "optical character recognition"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c4666b40a03524ab0458ed1bf5bbb4d97a5492366c1332052eeb31b77e2c8671",
                "md5": "9ef750936da91fb4f7366cd8b10a8859",
                "sha256": "960b6ec1035a88f1ecd66c5817f062ca78b0ef46871c8cd55420fdd74d212781"
            },
            "downloads": -1,
            "filename": "kamilib-0.1.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9ef750936da91fb4f7366cd8b10a8859",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 23641,
            "upload_time": "2023-03-23T10:37:55",
            "upload_time_iso_8601": "2023-03-23T10:37:55.016341Z",
            "url": "https://files.pythonhosted.org/packages/c4/66/6b40a03524ab0458ed1bf5bbb4d97a5492366c1332052eeb31b77e2c8671/kamilib-0.1.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "094ed47f7024f6dad6fc03e10fbbc6568da0e1225f0543fbe953ce09d37a988e",
                "md5": "718991957cbbb2f8f22db8d22661bf52",
                "sha256": "10993977303975a627fbfadfee89035622d38b55aaf1e29dd08a8298a96404ac"
            },
            "downloads": -1,
            "filename": "kamilib-0.1.13.tar.gz",
            "has_sig": false,
            "md5_digest": "718991957cbbb2f8f22db8d22661bf52",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 25706,
            "upload_time": "2023-03-23T10:37:56",
            "upload_time_iso_8601": "2023-03-23T10:37:56.979217Z",
            "url": "https://files.pythonhosted.org/packages/09/4e/d47f7024f6dad6fc03e10fbbc6568da0e1225f0543fbe953ce09d37a988e/kamilib-0.1.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-23 10:37:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "KaMI-tools-project",
    "github_project": "KaMi-lib",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "kamilib"
}

Lucas Terriel, Alix Chagué