inscriptis

Name	inscriptis JSON
Version	2.5.3 JSON
	download
home_page	https://github.com/weblyzard/inscriptis
Summary	inscriptis - HTML to text converter.
upload_time	2025-01-16 13:11:20
maintainer	None
docs_url	None
author	Albert Weichselbraun
requires_python	<4.0,>=3.9
license	Apache-2.0
keywords	html converter text
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            ==================================================================================
inscriptis -- HTML to text conversion library, command line client and Web service
==================================================================================

.. image:: https://img.shields.io/pypi/pyversions/inscriptis   
   :target: https://badge.fury.io/py/inscriptis
   :alt: Supported python versions

.. image:: https://api.codeclimate.com/v1/badges/f8ed73f8a764f2bc4eba/maintainability
   :target: https://codeclimate.com/github/weblyzard/inscriptis/maintainability
   :alt: Maintainability

.. image:: https://codecov.io/gh/weblyzard/inscriptis/branch/master/graph/badge.svg
   :target: https://codecov.io/gh/weblyzard/inscriptis/
   :alt: Coverage

.. image:: https://github.com/weblyzard/inscriptis/actions/workflows/python-package.yml/badge.svg
   :target: https://github.com/weblyzard/inscriptis/actions/workflows/python-package.yml
   :alt: Build status

.. image:: https://readthedocs.org/projects/inscriptis/badge/?version=latest
   :target: https://inscriptis.readthedocs.io/en/latest/?badge=latest
   :alt: Documentation status

.. image:: https://badge.fury.io/py/inscriptis.svg
   :target: https://badge.fury.io/py/inscriptis
   :alt: PyPI version

.. image:: https://pepy.tech/badge/inscriptis
   :target: https://pepy.tech/project/inscriptis
   :alt: PyPI downloads

.. image:: https://joss.theoj.org/papers/10.21105/joss.03557/status.svg
   :target: https://doi.org/10.21105/joss.03557


A python based HTML to text conversion library, command line client and Web
service with support for **nested tables**, a **subset of CSS** and optional
support for providing an **annotated output**. 

Inscriptis is particularly well suited for applications that require high-performance, high-quality (i.e., layout-aware) text representations of HTML content, and will aid knowledge extraction and data science tasks conducted upon Web data.

Please take a look at the
`Rendering <https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md>`_
document for a demonstration of inscriptis' conversion quality.

A Java port of inscriptis 1.x has been published by
`x28 <https://github.com/x28/inscriptis-java>`_.

This document provides a short introduction to Inscriptis. 

- The full documentation is built automatically and published on `Read the Docs <https://inscriptis.readthedocs.org/en/latest/>`_. 
- If you are interested in a more general overview on the topic of *text extraction from HTML*, this `blog post on different HTML to text conversion approaches, and criteria for selecting them <https://www.semanticlab.net/linux/big%20data/knowledge%20extraction/Extracting-text-from-HTML-with-Python/>`_ might be interesting to you.

.. contents:: Table of contents

Statement of need - why inscriptis?
===================================

1. Inscriptis provides a **layout-aware** conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. 

   Conversion quality becomes a factor once you need to move beyond simple HTML snippets. Non-specialized approaches and less sophisticated libraries do not correctly interpret HTML semantics and, therefore, fail to properly convert constructs such as itemizations, enumerations, and tables.

   Beautiful Soup's ``get_text()`` function, for example, converts the following HTML enumeration to the string ``firstsecond``.

   .. code-block:: HTML
   
      <ul>
        <li>first</li>
        <li>second</li>
      <ul>


   Inscriptis, in contrast, not only returns the correct output
   
   .. code-block::
   
      * first
      * second

   but also supports much more complex constructs such as nested tables and also interprets a subset of HTML (e.g., ``align``, ``valign``) and CSS (e.g., ``display``, ``white-space``, ``margin-top``, ``vertical-align``, etc.) attributes that determine the text alignment. Any time the spatial alignment of text is relevant (e.g., for many knowledge extraction tasks, the computation of word embeddings and language models, and sentiment analysis) an accurate HTML to text conversion is essential.

2. Inscriptis supports `annotation rules <#annotation-rules>`_, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These rules might be used to

   - provide downstream knowledge extraction components with additional information that may be leveraged to improve their respective performance.
   - assist manual document annotation processes (e.g., for qualitative analysis or gold standard creation). ``Inscriptis`` supports multiple export formats such as XML, annotated HTML and the JSONL format that is used by the open source annotation tool `doccano <https://github.com/doccano/doccano>`_.
   - enabling the use of ``Inscriptis``  for tasks such as content extraction (i.e., extract task-specific relevant content from a Web page) which rely on information on the HTML document's structure.


Installation
============

At the command line::

    $ pip install inscriptis

Or, if you don't have pip installed::

    $ easy_install inscriptis


Python library
==============

Embedding inscriptis into your code is easy, as outlined below:

.. code-block:: python
   
   import urllib.request
   from inscriptis import get_text
   
   url = "https://www.fhgr.ch"
   html = urllib.request.urlopen(url).read().decode('utf-8')
   
   text = get_text(html)
   print(text)


Standalone command line client
==============================
The command line client converts HTML files or text retrieved from Web pages to
the corresponding text representation.


Command line parameters
-----------------------

The inscript command line client supports the following parameters::

    usage: inscript [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION]
                       [--table-cell-separator TABLE_CELL_SEPARATOR] [-v]
                       [input]

    Convert the given HTML document to text.

    positional arguments:
      input                 Html input either from a file or a URL (default:stdin).

    optional arguments:
      -h, --help            show this help message and exit
      -o OUTPUT, --output OUTPUT
                            Output file (default:stdout).
      -e ENCODING, --encoding ENCODING
                            Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs).
      -i, --display-image-captions
                            Display image captions (default:false).
      -d, --deduplicate-image-captions
                            Deduplicate image captions (default:false).
      -l, --display-link-targets
                            Display link targets (default:false).
      -a, --display-anchor-urls
                            Display anchor URLs (default:false).
      -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES
                            Path to an optional JSON file containing rules for annotating the retrieved text.
      -p POSTPROCESSOR, --postprocessor POSTPROCESSOR
                            Optional component for postprocessing the result (html, surface, xml).
      --indentation INDENTATION
                            How to handle indentation (extended or strict; default: extended).
      --table-cell-separator TABLE_CELL_SEPARATOR
                            Separator to use between table cells (default: three spaces).
      -v, --version         display version information

   

HTML to text conversion
-----------------------
convert the given page to text and output the result to the screen::

  $ inscript https://www.fhgr.ch
   
convert the file to text and save the output to fhgr.txt::

  $ inscript fhgr.html -o fhgr.txt

convert the file using strict indentation (i.e., minimize indentation and extra spaces) and save the output to fhgr-layout-optimized.txt::

  $ inscript --indentation strict fhgr.html -o fhgr-layout-optimized.txt
   
convert HTML provided via stdin and save the output to output.txt::

  $ echo "<body><p>Make it so!</p></body>" | inscript -o output.txt 


HTML to annotated text conversion
---------------------------------
convert and annotate HTML from a Web page using the provided annotation rules. 

Download the example `annotation-profile.json <https://github.com/weblyzard/inscriptis/blob/master/examples/annotation/annotation-profile.json>`_ and save it to your working directory::

  $ inscript https://www.fhgr.ch -r annotation-profile.json

The annotation rules are specified in `annotation-profile.json`:

.. code-block:: json

   {
    "h1": ["heading", "h1"],
    "h2": ["heading", "h2"],
    "b": ["emphasis"],
    "div#class=toc": ["table-of-contents"],
    "#class=FactBox": ["fact-box"],
    "#cite": ["citation"]
   }

The dictionary maps an HTML tag and/or attribute to the annotations
inscriptis should provide for them. In the example above, for instance, the tag
``h1`` yields the annotations ``heading`` and ``h1``, a ``div`` tag with a
``class`` that contains the value ``toc`` results in the annotation
``table-of-contents``, and all tags with a ``cite`` attribute are annotated with
``citation``.

Given these annotation rules the HTML file

.. code-block:: HTML

   <h1>Chur</h1>
   <b>Chur</b> is the capital and largest town of the Swiss canton of the
   Grisons and lies in the Grisonian Rhine Valley.

yields the following JSONL output

.. code-block:: json

   {"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
             of the Grisons and lies in the Grisonian Rhine Valley.",
    "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]}

The provided list of labels contains all annotated text elements with their
start index, end index and the assigned label.


Annotation postprocessors
-------------------------
Annotation postprocessors enable the post processing of annotations to formats
that are suitable for your particular application. Post processors can be
specified with the ``-p`` or ``--postprocessor`` command line argument::

  $ inscript https://www.fhgr.ch \
          -r ./annotation/examples/annotation-profile.json \
          -p surface


Output:

.. code-block:: json

   {"text": "  Chur\n\n  Chur is the capital and largest town of the Swiss
             canton of the Grisons and lies in the Grisonian Rhine Valley.",
    "label": [[0, 6, "heading"], [8, 14, "emphasis"]],
    "tag": "<heading>Chur</heading>\n\n<emphasis>Chur</emphasis> is the
           capital and largest town of the Swiss canton of the Grisons and
           lies in the Grisonian Rhine Valley."}



Currently, inscriptis supports the following postprocessors:

- surface: returns a list of mapping between the annotation's surface form and its label::

    [
       ['heading', 'Chur'], 
       ['emphasis': 'Chur']
    ]

- xml: returns an additional annotated text version::

    <?xml version="1.0" encoding="UTF-8" ?>
    <heading>Chur</heading>

    <emphasis>Chur</emphasis> is the capital and largest town of the Swiss
    canton of the Grisons and lies in the Grisonian Rhine Valley.

- html: creates an HTML file which contains the converted text and highlights all annotations as outlined below:

.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/paper/images/annotations.png
   :align: left
   :alt: Annotations extracted from the Wikipedia entry for Chur with the ``--postprocess html`` postprocessor.

   Snippet of the rendered HTML file created with the following command line options and annotation rules:

   .. code-block:: bash

      inscript --annotation-rules ./wikipedia.json \
                  --postprocessor html \
                  https://en.wikipedia.org/wiki/Chur.html

   Annotation rules encoded in the ``wikipedia.json`` file:

   .. code-block:: json

      {
        "h1": ["heading"],
        "h2": ["heading"],
        "h3": ["subheading"],
        "h4": ["subheading"],
        "h5": ["subheading"],
        "i": ["emphasis"],
        "b": ["bold"],
        "table": ["table"],
        "th": ["tableheading"],
        "a": ["link"]
      } 


Web Service
===========

A FastAPI-based Web Service that uses Inscriptis for translating HTML pages to plain text.

Run the Web Service on your host system
---------------------------------------
Install the optional feature `web-service` for inscriptis::
  
  $ pip install inscriptis[web-service]

Start the Inscriptis Web service with the following command::

  $ uvicorn inscriptis.service.web:app --port 5000 --host 127.0.0.1


Run the Web Service with Docker
-------------------------------

The docker definition can be found `here <https://github.com/weblyzard/inscriptis/pkgs/container/inscriptis>`_::
  
  $ docker pull ghcr.io/weblyzard/inscriptis:latest
  $ docker run -n inscriptis ghcr.io/weblyzard/inscriptis:latest

Run as Kubernetes Deployment
--------------------------------------

The helm chart for deployment on a kubernetes cluster is located in the `inscriptis-helm repository <https://github.com/weblyzard/inscriptis-helm>`_.

Use the Web Service
-------------------

The Web services receives the HTML file in the request body and returns the
corresponding text. The file's encoding needs to be specified
in the ``Content-Type`` header (``UTF-8`` in the example below)::

  $ curl -X POST  -H "Content-Type: text/html; encoding=UTF8"  \
          --data-binary @test.html  http://localhost:5000/get_text

The service also supports a version call::

  $ curl http://localhost:5000/version


Example annotation profiles
===========================

The following section provides a number of example annotation profiles illustrating the use of Inscriptis' annotation support.
The examples present the used annotation rules and an image that highlights a snippet with the annotated text on the converted web page, which has been 
created using the HTML postprocessor as outlined in Section `annotation postprocessors <#annotation-postprocessors>`_.

Wikipedia tables and table metadata
-----------------------------------


The following annotation rules extract tables from Wikipedia pages, and annotate table headings that are typically used to indicate column or row headings.

.. code-block:: json

   {
      "table": ["table"],
      "th": ["tableheading"],
      "caption": ["caption"]
   }

The figure below outlines an example table from Wikipedia that has been annotated using these rules.

.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/wikipedia-chur-table-annotation.png
   :alt: Table and table metadata annotations extracted from the Wikipedia entry for Chur.


References to entities, missing entities and citations from Wikipedia
---------------------------------------------------------------------

This profile extracts references to Wikipedia entities, missing entities and citations. Please note that the profile isn't perfect, since it also annotates ``[ edit ]`` links.

.. code-block:: json

   {
      "a#title": ["entity"],
      "a#class=new": ["missing"],
      "class=reference": ["citation"]
   }

The figure shows entities and citations that have been identified on a Wikipedia page using these rules.

.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/wikipedia-chur-entry-annotation.png
   :alt: Metadata on entries, missing entries and citations extracted from the Wikipedia entry for Chur.





Posts and post metadata from the XDA developer forum
----------------------------------------------------

The annotation rules below, extract posts with metadata on the post's time, user and the user's job title from the XDA developer forum.

.. code-block:: json

   {
       "article#class=message-body": ["article"],
       "li#class=u-concealed": ["time"],
       "#itemprop=name": ["user-name"],
       "#itemprop=jobTitle": ["user-title"]
   }

The figure illustrates the annotated metadata on posts from the XDA developer forum.

.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/xda-posts-annotation.png
   :alt: Posts and post metadata extracted from the XDA developer forum.



Code and metadata from Stackoverflow pages
------------------------------------------
The rules below extracts code and metadata on users and comments from Stackoverflow pages.

.. code-block:: json

   {
      "code": ["code"],
      "#itemprop=dateCreated": ["creation-date"],
      "#class=user-details": ["user"],
      "#class=reputation-score": ["reputation"],
      "#class=comment-date": ["comment-date"],
      "#class=comment-copy": ["comment-comment"]
   }

Applying these rules to a Stackoverflow page on text extraction from HTML yields the following snippet:

.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/stackoverflow-code-annotation.png
   :alt: Code and metadata from Stackoverflow pages.


Advanced topics
===============

Annotated text
--------------
Inscriptis can provide annotations alongside the extracted text which allows
downstream components to draw upon semantics that have only been available in
the original HTML file.

The extracted text and annotations can be exported in different formats,
including the popular JSONL format which is used by
`doccano <https://github.com/doccano/doccano>`_.

Example output:

.. code-block:: json

   {"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
             of the Grisons and lies in the Grisonian Rhine Valley.",
    "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]}

The output above is produced, if inscriptis is run with the following
annotation rules:

.. code-block:: json

   {
    "h1": ["heading", "h1"],
    "b": ["emphasis"],
   }

The code below demonstrates how inscriptis' annotation capabilities can
be used within a program:

.. code-block:: python

  import urllib.request
  from inscriptis import get_annotated_text
  from inscriptis.model.config import ParserConfig

  url = "https://www.fhgr.ch"
  html = urllib.request.urlopen(url).read().decode('utf-8')

  rules = {'h1': ['heading', 'h1'],
           'h2': ['heading', 'h2'],
           'b': ['emphasis'],
           'table': ['table']
          }

  output = get_annotated_text(html, ParserConfig(annotation_rules=rules)
  print("Text:", output['text'])
  print("Annotations:", output['label'])

Fine tuning
-----------

The following options are available for fine tuning inscriptis' HTML rendering:

1. **More rigorous indentation:** call ``inscriptis.get_text()`` with the
   parameter ``indentation='extended'`` to also use indentation for tags such as
   ``<div>`` and ``<span>`` that do not provide indentation in their standard
   definition. This strategy is the default in ``inscript`` and many other
   tools such as Lynx. If you do not want extended indentation you can use the
   parameter ``indentation='standard'`` instead.

2. **Overwriting the default CSS definition:** inscriptis uses CSS definitions
   that are maintained in ``inscriptis.css.CSS`` for rendering HTML tags. You can
   override these definitions (and therefore change the rendering) as outlined
   below:

.. code-block:: python

      from lxml.html import fromstring
      from inscriptis.css_profiles import CSS_PROFILES, HtmlElement
      from inscriptis.html_properties import Display
      from inscriptis.model.config import ParserConfig
      
      # create a custom CSS based on the default style sheet and change the
      # rendering of `div` and `span` elements
      css = CSS_PROFILES['strict'].copy()
      css['div'] = HtmlElement(display=Display.block, padding=2)
      css['span'] = HtmlElement(prefix=' ', suffix=' ')
      
      html_tree = fromstring(html)
      # create a parser using a custom css
      config = ParserConfig(css=css)
      parser = Inscriptis(html_tree, config)
      text = parser.get_text()


Custom HTML tag handling
------------------------

If the fine-tuning options discussed above are not sufficient, you may even override Inscriptis' handling of start and end tags as outlined below:

.. code-block:: python

    from inscriptis import ParserConfig
    from inscriptis.html_engine import Inscriptis
    from inscriptis.model.tag import CustomHtmlTagHandlerMapping

    my_mapping = CustomHtmlTagHandlerMapping(
        start_tag_mapping={'a': my_handle_start_a},
        end_tag_mapping={'a': my_handle_end_a}
    )
    inscriptis = Inscriptis(html_tree, 
                            ParserConfig(custom_html_tag_handler_mapping=my_mapping))
    text = inscriptis.get_text()
		

In the example the standard HTML handlers for the ``a`` tag are overwritten with custom versions (i.e., ``my_handle_start_a`` and ``my_handle_end_a``).
You may define custom handlers for any tag, regardless of whether it already exists in the standard mapping.

Please refer to `custom-html-handling.py <https://github.com/weblyzard/inscriptis/blob/master/examples/custom-html-handling.py>`_ for a working example. 
The standard HTML tag handlers can be found in the `inscriptis.model.tag <https://github.com/weblyzard/inscriptis/blob/master/src/inscriptis/model/tag>`_ package.

Optimizing memory consumption
-----------------------------

Inscriptis uses the Python lxml library which prefers to reuse memory rather than release it to the operating system. This behavior might lead to an increased memory consumption, if you use inscriptis within a Web service that parses very complex HTML pages.

The following code mitigates this problem on Unix systems by manually forcing lxml to release the allocated memory:

.. code-block:: python

   import ctypes
   def trim_memory() -> int:
      libc = ctypes.CDLL("libc.so.6")
      return libc.malloc_trim(0)


Examples
========

Strict indentation handling
---------------------------

The following example demonstrates modifying ``ParserConfig`` for strict indentation handling.

.. code-block:: python

   from inscriptis import get_text
   from inscriptis.css_profiles import CSS_PROFILES
   from inscriptis.model.config import ParserConfig

   config = ParserConfig(css=CSS_PROFILES['strict'].copy())
   text = get_text('fi<span>r</span>st', config)
   print(text)

Ignore elements during parsing 
------------------------------

Overwriting the default CSS profile also allows changing the rendering of selected elements. 
The snippet below, for example, removes forms from the parsed text by setting the definition of the ``form`` tag to ``Display.none``.

.. code-block:: python

      from inscriptis import get_text
      from inscriptis.css_profiles import CSS_PROFILES, HtmlElement
      from inscriptis.html_properties import Display
      from inscriptis.model.config import ParserConfig

      # create a custom CSS based on the default style sheet and change the
      # rendering of `div` and `span` elements
      css = CSS_PROFILES['strict'].copy()
      css['form'] = HtmlElement(display=Display.none)

      # create a parser configuration using a custom css
      html = """First line. 
                <form>
                  User data
                  <label for="name">Name:</label><br>
                  <input type="text" id="name" name="name"><br>
                  <label for="pass">Password:</label><br>
                  <input type="hidden" id="pass" name="pass">
                </form>"""
      config = ParserConfig(css=css)
      text = get_text(html, config)
      print(text)


Citation
========

There is a `Journal of Open Source Software <https://joss.theoj.org>`_ `paper <https://joss.theoj.org/papers/10.21105/joss.03557>`_ you can cite for Inscriptis:

.. code-block:: bibtex

      @article{Weichselbraun2021,
        doi = {10.21105/joss.03557},
        url = {https://doi.org/10.21105/joss.03557},
        year = {2021},
        publisher = {The Open Journal},
        volume = {6},
        number = {66},
        pages = {3557},
        author = {Albert Weichselbraun},
        title = {Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web},
        journal = {Journal of Open Source Software}
      }


Changelog
=========

A full list of changes can be found in the
`release notes <https://github.com/weblyzard/inscriptis/releases>`_.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/weblyzard/inscriptis",
    "name": "inscriptis",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "HTML, converter, text",
    "author": "Albert Weichselbraun",
    "author_email": "albert.weichselbraun@fhgr.ch",
    "download_url": "https://files.pythonhosted.org/packages/5a/aa/15cefc93fe3ee06f4c7c0a054054892cbd2f7fdd0513cff5bdd473c0adfc/inscriptis-2.5.3.tar.gz",
    "platform": null,
    "description": "==================================================================================\ninscriptis -- HTML to text conversion library, command line client and Web service\n==================================================================================\n\n.. image:: https://img.shields.io/pypi/pyversions/inscriptis   \n   :target: https://badge.fury.io/py/inscriptis\n   :alt: Supported python versions\n\n.. image:: https://api.codeclimate.com/v1/badges/f8ed73f8a764f2bc4eba/maintainability\n   :target: https://codeclimate.com/github/weblyzard/inscriptis/maintainability\n   :alt: Maintainability\n\n.. image:: https://codecov.io/gh/weblyzard/inscriptis/branch/master/graph/badge.svg\n   :target: https://codecov.io/gh/weblyzard/inscriptis/\n   :alt: Coverage\n\n.. image:: https://github.com/weblyzard/inscriptis/actions/workflows/python-package.yml/badge.svg\n   :target: https://github.com/weblyzard/inscriptis/actions/workflows/python-package.yml\n   :alt: Build status\n\n.. image:: https://readthedocs.org/projects/inscriptis/badge/?version=latest\n   :target: https://inscriptis.readthedocs.io/en/latest/?badge=latest\n   :alt: Documentation status\n\n.. image:: https://badge.fury.io/py/inscriptis.svg\n   :target: https://badge.fury.io/py/inscriptis\n   :alt: PyPI version\n\n.. image:: https://pepy.tech/badge/inscriptis\n   :target: https://pepy.tech/project/inscriptis\n   :alt: PyPI downloads\n\n.. image:: https://joss.theoj.org/papers/10.21105/joss.03557/status.svg\n   :target: https://doi.org/10.21105/joss.03557\n\n\nA python based HTML to text conversion library, command line client and Web\nservice with support for **nested tables**, a **subset of CSS** and optional\nsupport for providing an **annotated output**. \n\nInscriptis is particularly well suited for applications that require high-performance, high-quality (i.e., layout-aware) text representations of HTML content, and will aid knowledge extraction and data science tasks conducted upon Web data.\n\nPlease take a look at the\n`Rendering <https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md>`_\ndocument for a demonstration of inscriptis' conversion quality.\n\nA Java port of inscriptis 1.x has been published by\n`x28 <https://github.com/x28/inscriptis-java>`_.\n\nThis document provides a short introduction to Inscriptis. \n\n- The full documentation is built automatically and published on `Read the Docs <https://inscriptis.readthedocs.org/en/latest/>`_. \n- If you are interested in a more general overview on the topic of *text extraction from HTML*, this `blog post on different HTML to text conversion approaches, and criteria for selecting them <https://www.semanticlab.net/linux/big%20data/knowledge%20extraction/Extracting-text-from-HTML-with-Python/>`_ might be interesting to you.\n\n.. contents:: Table of contents\n\nStatement of need - why inscriptis?\n===================================\n\n1. Inscriptis provides a **layout-aware** conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. \n\n   Conversion quality becomes a factor once you need to move beyond simple HTML snippets. Non-specialized approaches and less sophisticated libraries do not correctly interpret HTML semantics and, therefore, fail to properly convert constructs such as itemizations, enumerations, and tables.\n\n   Beautiful Soup's ``get_text()`` function, for example, converts the following HTML enumeration to the string ``firstsecond``.\n\n   .. code-block:: HTML\n   \n      <ul>\n        <li>first</li>\n        <li>second</li>\n      <ul>\n\n\n   Inscriptis, in contrast, not only returns the correct output\n   \n   .. code-block::\n   \n      * first\n      * second\n\n   but also supports much more complex constructs such as nested tables and also interprets a subset of HTML (e.g., ``align``, ``valign``) and CSS (e.g., ``display``, ``white-space``, ``margin-top``, ``vertical-align``, etc.) attributes that determine the text alignment. Any time the spatial alignment of text is relevant (e.g., for many knowledge extraction tasks, the computation of word embeddings and language models, and sentiment analysis) an accurate HTML to text conversion is essential.\n\n2. Inscriptis supports `annotation rules <#annotation-rules>`_, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These rules might be used to\n\n   - provide downstream knowledge extraction components with additional information that may be leveraged to improve their respective performance.\n   - assist manual document annotation processes (e.g., for qualitative analysis or gold standard creation). ``Inscriptis`` supports multiple export formats such as XML, annotated HTML and the JSONL format that is used by the open source annotation tool `doccano <https://github.com/doccano/doccano>`_.\n   - enabling the use of ``Inscriptis``  for tasks such as content extraction (i.e., extract task-specific relevant content from a Web page) which rely on information on the HTML document's structure.\n\n\nInstallation\n============\n\nAt the command line::\n\n    $ pip install inscriptis\n\nOr, if you don't have pip installed::\n\n    $ easy_install inscriptis\n\n\nPython library\n==============\n\nEmbedding inscriptis into your code is easy, as outlined below:\n\n.. code-block:: python\n   \n   import urllib.request\n   from inscriptis import get_text\n   \n   url = \"https://www.fhgr.ch\"\n   html = urllib.request.urlopen(url).read().decode('utf-8')\n   \n   text = get_text(html)\n   print(text)\n\n\nStandalone command line client\n==============================\nThe command line client converts HTML files or text retrieved from Web pages to\nthe corresponding text representation.\n\n\nCommand line parameters\n-----------------------\n\nThe inscript command line client supports the following parameters::\n\n    usage: inscript [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION]\n                       [--table-cell-separator TABLE_CELL_SEPARATOR] [-v]\n                       [input]\n\n    Convert the given HTML document to text.\n\n    positional arguments:\n      input                 Html input either from a file or a URL (default:stdin).\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -o OUTPUT, --output OUTPUT\n                            Output file (default:stdout).\n      -e ENCODING, --encoding ENCODING\n                            Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs).\n      -i, --display-image-captions\n                            Display image captions (default:false).\n      -d, --deduplicate-image-captions\n                            Deduplicate image captions (default:false).\n      -l, --display-link-targets\n                            Display link targets (default:false).\n      -a, --display-anchor-urls\n                            Display anchor URLs (default:false).\n      -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES\n                            Path to an optional JSON file containing rules for annotating the retrieved text.\n      -p POSTPROCESSOR, --postprocessor POSTPROCESSOR\n                            Optional component for postprocessing the result (html, surface, xml).\n      --indentation INDENTATION\n                            How to handle indentation (extended or strict; default: extended).\n      --table-cell-separator TABLE_CELL_SEPARATOR\n                            Separator to use between table cells (default: three spaces).\n      -v, --version         display version information\n\n   \n\nHTML to text conversion\n-----------------------\nconvert the given page to text and output the result to the screen::\n\n  $ inscript https://www.fhgr.ch\n   \nconvert the file to text and save the output to fhgr.txt::\n\n  $ inscript fhgr.html -o fhgr.txt\n\nconvert the file using strict indentation (i.e., minimize indentation and extra spaces) and save the output to fhgr-layout-optimized.txt::\n\n  $ inscript --indentation strict fhgr.html -o fhgr-layout-optimized.txt\n   \nconvert HTML provided via stdin and save the output to output.txt::\n\n  $ echo \"<body><p>Make it so!</p></body>\" | inscript -o output.txt \n\n\nHTML to annotated text conversion\n---------------------------------\nconvert and annotate HTML from a Web page using the provided annotation rules. \n\nDownload the example `annotation-profile.json <https://github.com/weblyzard/inscriptis/blob/master/examples/annotation/annotation-profile.json>`_ and save it to your working directory::\n\n  $ inscript https://www.fhgr.ch -r annotation-profile.json\n\nThe annotation rules are specified in `annotation-profile.json`:\n\n.. code-block:: json\n\n   {\n    \"h1\": [\"heading\", \"h1\"],\n    \"h2\": [\"heading\", \"h2\"],\n    \"b\": [\"emphasis\"],\n    \"div#class=toc\": [\"table-of-contents\"],\n    \"#class=FactBox\": [\"fact-box\"],\n    \"#cite\": [\"citation\"]\n   }\n\nThe dictionary maps an HTML tag and/or attribute to the annotations\ninscriptis should provide for them. In the example above, for instance, the tag\n``h1`` yields the annotations ``heading`` and ``h1``, a ``div`` tag with a\n``class`` that contains the value ``toc`` results in the annotation\n``table-of-contents``, and all tags with a ``cite`` attribute are annotated with\n``citation``.\n\nGiven these annotation rules the HTML file\n\n.. code-block:: HTML\n\n   <h1>Chur</h1>\n   <b>Chur</b> is the capital and largest town of the Swiss canton of the\n   Grisons and lies in the Grisonian Rhine Valley.\n\nyields the following JSONL output\n\n.. code-block:: json\n\n   {\"text\": \"Chur\\n\\nChur is the capital and largest town of the Swiss canton\n             of the Grisons and lies in the Grisonian Rhine Valley.\",\n    \"label\": [[0, 4, \"heading\"], [0, 4, \"h1\"], [6, 10, \"emphasis\"]]}\n\nThe provided list of labels contains all annotated text elements with their\nstart index, end index and the assigned label.\n\n\nAnnotation postprocessors\n-------------------------\nAnnotation postprocessors enable the post processing of annotations to formats\nthat are suitable for your particular application. Post processors can be\nspecified with the ``-p`` or ``--postprocessor`` command line argument::\n\n  $ inscript https://www.fhgr.ch \\\n          -r ./annotation/examples/annotation-profile.json \\\n          -p surface\n\n\nOutput:\n\n.. code-block:: json\n\n   {\"text\": \"  Chur\\n\\n  Chur is the capital and largest town of the Swiss\n             canton of the Grisons and lies in the Grisonian Rhine Valley.\",\n    \"label\": [[0, 6, \"heading\"], [8, 14, \"emphasis\"]],\n    \"tag\": \"<heading>Chur</heading>\\n\\n<emphasis>Chur</emphasis> is the\n           capital and largest town of the Swiss canton of the Grisons and\n           lies in the Grisonian Rhine Valley.\"}\n\n\n\nCurrently, inscriptis supports the following postprocessors:\n\n- surface: returns a list of mapping between the annotation's surface form and its label::\n\n    [\n       ['heading', 'Chur'], \n       ['emphasis': 'Chur']\n    ]\n\n- xml: returns an additional annotated text version::\n\n    <?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n    <heading>Chur</heading>\n\n    <emphasis>Chur</emphasis> is the capital and largest town of the Swiss\n    canton of the Grisons and lies in the Grisonian Rhine Valley.\n\n- html: creates an HTML file which contains the converted text and highlights all annotations as outlined below:\n\n.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/paper/images/annotations.png\n   :align: left\n   :alt: Annotations extracted from the Wikipedia entry for Chur with the ``--postprocess html`` postprocessor.\n\n   Snippet of the rendered HTML file created with the following command line options and annotation rules:\n\n   .. code-block:: bash\n\n      inscript --annotation-rules ./wikipedia.json \\\n                  --postprocessor html \\\n                  https://en.wikipedia.org/wiki/Chur.html\n\n   Annotation rules encoded in the ``wikipedia.json`` file:\n\n   .. code-block:: json\n\n      {\n        \"h1\": [\"heading\"],\n        \"h2\": [\"heading\"],\n        \"h3\": [\"subheading\"],\n        \"h4\": [\"subheading\"],\n        \"h5\": [\"subheading\"],\n        \"i\": [\"emphasis\"],\n        \"b\": [\"bold\"],\n        \"table\": [\"table\"],\n        \"th\": [\"tableheading\"],\n        \"a\": [\"link\"]\n      } \n\n\nWeb Service\n===========\n\nA FastAPI-based Web Service that uses Inscriptis for translating HTML pages to plain text.\n\nRun the Web Service on your host system\n---------------------------------------\nInstall the optional feature `web-service` for inscriptis::\n  \n  $ pip install inscriptis[web-service]\n\nStart the Inscriptis Web service with the following command::\n\n  $ uvicorn inscriptis.service.web:app --port 5000 --host 127.0.0.1\n\n\nRun the Web Service with Docker\n-------------------------------\n\nThe docker definition can be found `here <https://github.com/weblyzard/inscriptis/pkgs/container/inscriptis>`_::\n  \n  $ docker pull ghcr.io/weblyzard/inscriptis:latest\n  $ docker run -n inscriptis ghcr.io/weblyzard/inscriptis:latest\n\nRun as Kubernetes Deployment\n--------------------------------------\n\nThe helm chart for deployment on a kubernetes cluster is located in the `inscriptis-helm repository <https://github.com/weblyzard/inscriptis-helm>`_.\n\nUse the Web Service\n-------------------\n\nThe Web services receives the HTML file in the request body and returns the\ncorresponding text. The file's encoding needs to be specified\nin the ``Content-Type`` header (``UTF-8`` in the example below)::\n\n  $ curl -X POST  -H \"Content-Type: text/html; encoding=UTF8\"  \\\n          --data-binary @test.html  http://localhost:5000/get_text\n\nThe service also supports a version call::\n\n  $ curl http://localhost:5000/version\n\n\nExample annotation profiles\n===========================\n\nThe following section provides a number of example annotation profiles illustrating the use of Inscriptis' annotation support.\nThe examples present the used annotation rules and an image that highlights a snippet with the annotated text on the converted web page, which has been \ncreated using the HTML postprocessor as outlined in Section `annotation postprocessors <#annotation-postprocessors>`_.\n\nWikipedia tables and table metadata\n-----------------------------------\n\n\nThe following annotation rules extract tables from Wikipedia pages, and annotate table headings that are typically used to indicate column or row headings.\n\n.. code-block:: json\n\n   {\n      \"table\": [\"table\"],\n      \"th\": [\"tableheading\"],\n      \"caption\": [\"caption\"]\n   }\n\nThe figure below outlines an example table from Wikipedia that has been annotated using these rules.\n\n.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/wikipedia-chur-table-annotation.png\n   :alt: Table and table metadata annotations extracted from the Wikipedia entry for Chur.\n\n\nReferences to entities, missing entities and citations from Wikipedia\n---------------------------------------------------------------------\n\nThis profile extracts references to Wikipedia entities, missing entities and citations. Please note that the profile isn't perfect, since it also annotates ``[ edit ]`` links.\n\n.. code-block:: json\n\n   {\n      \"a#title\": [\"entity\"],\n      \"a#class=new\": [\"missing\"],\n      \"class=reference\": [\"citation\"]\n   }\n\nThe figure shows entities and citations that have been identified on a Wikipedia page using these rules.\n\n.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/wikipedia-chur-entry-annotation.png\n   :alt: Metadata on entries, missing entries and citations extracted from the Wikipedia entry for Chur.\n\n\n\n\n\nPosts and post metadata from the XDA developer forum\n----------------------------------------------------\n\nThe annotation rules below, extract posts with metadata on the post's time, user and the user's job title from the XDA developer forum.\n\n.. code-block:: json\n\n   {\n       \"article#class=message-body\": [\"article\"],\n       \"li#class=u-concealed\": [\"time\"],\n       \"#itemprop=name\": [\"user-name\"],\n       \"#itemprop=jobTitle\": [\"user-title\"]\n   }\n\nThe figure illustrates the annotated metadata on posts from the XDA developer forum.\n\n.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/xda-posts-annotation.png\n   :alt: Posts and post metadata extracted from the XDA developer forum.\n\n\n\nCode and metadata from Stackoverflow pages\n------------------------------------------\nThe rules below extracts code and metadata on users and comments from Stackoverflow pages.\n\n.. code-block:: json\n\n   {\n      \"code\": [\"code\"],\n      \"#itemprop=dateCreated\": [\"creation-date\"],\n      \"#class=user-details\": [\"user\"],\n      \"#class=reputation-score\": [\"reputation\"],\n      \"#class=comment-date\": [\"comment-date\"],\n      \"#class=comment-copy\": [\"comment-comment\"]\n   }\n\nApplying these rules to a Stackoverflow page on text extraction from HTML yields the following snippet:\n\n.. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/stackoverflow-code-annotation.png\n   :alt: Code and metadata from Stackoverflow pages.\n\n\nAdvanced topics\n===============\n\nAnnotated text\n--------------\nInscriptis can provide annotations alongside the extracted text which allows\ndownstream components to draw upon semantics that have only been available in\nthe original HTML file.\n\nThe extracted text and annotations can be exported in different formats,\nincluding the popular JSONL format which is used by\n`doccano <https://github.com/doccano/doccano>`_.\n\nExample output:\n\n.. code-block:: json\n\n   {\"text\": \"Chur\\n\\nChur is the capital and largest town of the Swiss canton\n             of the Grisons and lies in the Grisonian Rhine Valley.\",\n    \"label\": [[0, 4, \"heading\"], [0, 4, \"h1\"], [6, 10, \"emphasis\"]]}\n\nThe output above is produced, if inscriptis is run with the following\nannotation rules:\n\n.. code-block:: json\n\n   {\n    \"h1\": [\"heading\", \"h1\"],\n    \"b\": [\"emphasis\"],\n   }\n\nThe code below demonstrates how inscriptis' annotation capabilities can\nbe used within a program:\n\n.. code-block:: python\n\n  import urllib.request\n  from inscriptis import get_annotated_text\n  from inscriptis.model.config import ParserConfig\n\n  url = \"https://www.fhgr.ch\"\n  html = urllib.request.urlopen(url).read().decode('utf-8')\n\n  rules = {'h1': ['heading', 'h1'],\n           'h2': ['heading', 'h2'],\n           'b': ['emphasis'],\n           'table': ['table']\n          }\n\n  output = get_annotated_text(html, ParserConfig(annotation_rules=rules)\n  print(\"Text:\", output['text'])\n  print(\"Annotations:\", output['label'])\n\nFine tuning\n-----------\n\nThe following options are available for fine tuning inscriptis' HTML rendering:\n\n1. **More rigorous indentation:** call ``inscriptis.get_text()`` with the\n   parameter ``indentation='extended'`` to also use indentation for tags such as\n   ``<div>`` and ``<span>`` that do not provide indentation in their standard\n   definition. This strategy is the default in ``inscript`` and many other\n   tools such as Lynx. If you do not want extended indentation you can use the\n   parameter ``indentation='standard'`` instead.\n\n2. **Overwriting the default CSS definition:** inscriptis uses CSS definitions\n   that are maintained in ``inscriptis.css.CSS`` for rendering HTML tags. You can\n   override these definitions (and therefore change the rendering) as outlined\n   below:\n\n.. code-block:: python\n\n      from lxml.html import fromstring\n      from inscriptis.css_profiles import CSS_PROFILES, HtmlElement\n      from inscriptis.html_properties import Display\n      from inscriptis.model.config import ParserConfig\n      \n      # create a custom CSS based on the default style sheet and change the\n      # rendering of `div` and `span` elements\n      css = CSS_PROFILES['strict'].copy()\n      css['div'] = HtmlElement(display=Display.block, padding=2)\n      css['span'] = HtmlElement(prefix=' ', suffix=' ')\n      \n      html_tree = fromstring(html)\n      # create a parser using a custom css\n      config = ParserConfig(css=css)\n      parser = Inscriptis(html_tree, config)\n      text = parser.get_text()\n\n\nCustom HTML tag handling\n------------------------\n\nIf the fine-tuning options discussed above are not sufficient, you may even override Inscriptis' handling of start and end tags as outlined below:\n\n.. code-block:: python\n\n    from inscriptis import ParserConfig\n    from inscriptis.html_engine import Inscriptis\n    from inscriptis.model.tag import CustomHtmlTagHandlerMapping\n\n    my_mapping = CustomHtmlTagHandlerMapping(\n        start_tag_mapping={'a': my_handle_start_a},\n        end_tag_mapping={'a': my_handle_end_a}\n    )\n    inscriptis = Inscriptis(html_tree, \n                            ParserConfig(custom_html_tag_handler_mapping=my_mapping))\n    text = inscriptis.get_text()\n\t\t\n\nIn the example the standard HTML handlers for the ``a`` tag are overwritten with custom versions (i.e., ``my_handle_start_a`` and ``my_handle_end_a``).\nYou may define custom handlers for any tag, regardless of whether it already exists in the standard mapping.\n\nPlease refer to `custom-html-handling.py <https://github.com/weblyzard/inscriptis/blob/master/examples/custom-html-handling.py>`_ for a working example. \nThe standard HTML tag handlers can be found in the `inscriptis.model.tag <https://github.com/weblyzard/inscriptis/blob/master/src/inscriptis/model/tag>`_ package.\n\nOptimizing memory consumption\n-----------------------------\n\nInscriptis uses the Python lxml library which prefers to reuse memory rather than release it to the operating system. This behavior might lead to an increased memory consumption, if you use inscriptis within a Web service that parses very complex HTML pages.\n\nThe following code mitigates this problem on Unix systems by manually forcing lxml to release the allocated memory:\n\n.. code-block:: python\n\n   import ctypes\n   def trim_memory() -> int:\n      libc = ctypes.CDLL(\"libc.so.6\")\n      return libc.malloc_trim(0)\n\n\nExamples\n========\n\nStrict indentation handling\n---------------------------\n\nThe following example demonstrates modifying ``ParserConfig`` for strict indentation handling.\n\n.. code-block:: python\n\n   from inscriptis import get_text\n   from inscriptis.css_profiles import CSS_PROFILES\n   from inscriptis.model.config import ParserConfig\n\n   config = ParserConfig(css=CSS_PROFILES['strict'].copy())\n   text = get_text('fi<span>r</span>st', config)\n   print(text)\n\nIgnore elements during parsing \n------------------------------\n\nOverwriting the default CSS profile also allows changing the rendering of selected elements. \nThe snippet below, for example, removes forms from the parsed text by setting the definition of the ``form`` tag to ``Display.none``.\n\n.. code-block:: python\n\n      from inscriptis import get_text\n      from inscriptis.css_profiles import CSS_PROFILES, HtmlElement\n      from inscriptis.html_properties import Display\n      from inscriptis.model.config import ParserConfig\n\n      # create a custom CSS based on the default style sheet and change the\n      # rendering of `div` and `span` elements\n      css = CSS_PROFILES['strict'].copy()\n      css['form'] = HtmlElement(display=Display.none)\n\n      # create a parser configuration using a custom css\n      html = \"\"\"First line. \n                <form>\n                  User data\n                  <label for=\"name\">Name:</label><br>\n                  <input type=\"text\" id=\"name\" name=\"name\"><br>\n                  <label for=\"pass\">Password:</label><br>\n                  <input type=\"hidden\" id=\"pass\" name=\"pass\">\n                </form>\"\"\"\n      config = ParserConfig(css=css)\n      text = get_text(html, config)\n      print(text)\n\n\nCitation\n========\n\nThere is a `Journal of Open Source Software <https://joss.theoj.org>`_ `paper <https://joss.theoj.org/papers/10.21105/joss.03557>`_ you can cite for Inscriptis:\n\n.. code-block:: bibtex\n\n      @article{Weichselbraun2021,\n        doi = {10.21105/joss.03557},\n        url = {https://doi.org/10.21105/joss.03557},\n        year = {2021},\n        publisher = {The Open Journal},\n        volume = {6},\n        number = {66},\n        pages = {3557},\n        author = {Albert Weichselbraun},\n        title = {Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web},\n        journal = {Journal of Open Source Software}\n      }\n\n\nChangelog\n=========\n\nA full list of changes can be found in the\n`release notes <https://github.com/weblyzard/inscriptis/releases>`_.\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "inscriptis - HTML to text converter.",
    "version": "2.5.3",
    "project_urls": {
        "Documentation": "https://inscriptis.readthedocs.io/en",
        "Homepage": "https://github.com/weblyzard/inscriptis",
        "Repository": "https://github.com/weblyzard/inscriptis"
    },
    "split_keywords": [
        "html",
        " converter",
        " text"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3f5d642b7314560ef3f529d4d2dcc65e63f36a745d1b330a23e6d4bcf2d66974",
                "md5": "1af7d89a39b06e43fba0fb89eb6d9f1a",
                "sha256": "25962cf5a60b1a8f33e7bfbbea08a29af82299702339b9b90c538653a5c7aa38"
            },
            "downloads": -1,
            "filename": "inscriptis-2.5.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1af7d89a39b06e43fba0fb89eb6d9f1a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 45357,
            "upload_time": "2025-01-16T13:11:17",
            "upload_time_iso_8601": "2025-01-16T13:11:17.595346Z",
            "url": "https://files.pythonhosted.org/packages/3f/5d/642b7314560ef3f529d4d2dcc65e63f36a745d1b330a23e6d4bcf2d66974/inscriptis-2.5.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5aaa15cefc93fe3ee06f4c7c0a054054892cbd2f7fdd0513cff5bdd473c0adfc",
                "md5": "125123dd1b1ee5f939c57cb975392f7d",
                "sha256": "256043caa13e4995c71fafdeadec4ac42b57f3914cb41023ecbee8bc27ca1cc0"
            },
            "downloads": -1,
            "filename": "inscriptis-2.5.3.tar.gz",
            "has_sig": false,
            "md5_digest": "125123dd1b1ee5f939c57cb975392f7d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 41439,
            "upload_time": "2025-01-16T13:11:20",
            "upload_time_iso_8601": "2025-01-16T13:11:20.136341Z",
            "url": "https://files.pythonhosted.org/packages/5a/aa/15cefc93fe3ee06f4c7c0a054054892cbd2f7fdd0513cff5bdd473c0adfc/inscriptis-2.5.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-16 13:11:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "weblyzard",
    "github_project": "inscriptis",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "tox": true,
    "lcname": "inscriptis"
}

Albert Weichselbraun