FoLiA-tools


NameFoLiA-tools JSON
Version 2.5.8 PyPI version JSON
download
home_pagehttps://proycon.github.io/folia
SummaryFoLiA-tools contains various Python-based command line tools for working with FoLiA XML (Format for Linguistic Annotation)
upload_time2024-10-17 14:54:17
maintainerNone
docs_urlNone
authorMaarten van Gompel
requires_pythonNone
licenseGPL-3.0-only
keywords nlp computational linguistics search folia annotation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            .. image:: https://github.com/proycon/foliatools/actions/workflows/foliatools.yml/badge.svg?branch=master
    :target: https://github.com/proycon/foliatools/actions/

.. image:: http://applejack.science.ru.nl/lamabadge.php/foliatools
   :target: http://applejack.science.ru.nl/languagemachines/

.. image:: https://www.repostatus.org/badges/latest/active.svg
   :alt: Project Status: Active – The project has reached a stable, usable state and is being actively developed.
   :target: https://www.repostatus.org/#active

.. image:: https://img.shields.io/pypi/v/folia-tools
   :alt: Latest release in the Python Package Index
   :target: https://pypi.org/project/folia-tools/

FoLiA Tools
=================

A number of command-line tools are readily available for working with FoLiA, to various ends. The following tools are currently available:

- ``foliavalidator`` -- Tests if documents are valid FoLiA XML. **Always use this to test your documents if you produce your own FoLiA documents!**. See the extra documentation in the dedicated scetion below.
- ``foliaquery`` -- Advanced query tool that searches FoLiA documents for a specified pattern, or modifies a document according to the query. Supports FQL (FoLiA Query Language) and CQL (Corpus Query Language).
- ``foliaeval`` -- Evaluation tool, can compute various evaluation metrics for selected annotation types, either against
  a gold standard reference or as a measure of inter-annotated agreement.
- ``folia2txt`` -- Convert FoLiA XML to plain text (pure text, without any annotations). Use this to extract plain text
  from any FoLiA document.
- ``folia2annotatedtxt`` -- Like above, but produces output simple
  token annotations inline, by appending them directly to the word using a specific delimiter.
- ``folia2columns`` -- This conversion tool reads a FoLiA XML document
  and produces a simple columned output format (including CSV) in which each token appears on one line. Note that only simple token annotations are supported and a lot of FoLiA data can not be intuitively expressed in a simple columned format!
- ``folia2html`` -- Converts a FoLiA document to a semi-interactive HTML document, with limited support for certain token annotations.
- ``folia2dcoi`` -- Convert FoLiA XML to D-Coi XML (only for annotations supported by D-Coi)
- ``foliatree`` -- Outputs the hierarchy of a FoLiA document.
- ``foliacat`` -- Concatenate multiple FoLiA documents.
- ``foliacount`` -- This script reads a FoLiA XML document and counts certain structure elements.
- ``foliacorrect`` -- A tool to deal with corrections in FoLiA, can automatically accept suggestions or strip all corrections so parsers that don't know how to handle corrections can process it.
- ``foliaerase`` -- Erases one or more specified annotation types from the FoLiA document.
- ``folialangid`` -- Does language detection on FoLiA documents, assigns language identifiers to different substructures
- ``foliaid`` -- Assigns IDs to elements in FoLiA documents. Use this to automatically generate identifiers on certain (or all) elements.
- ``foliafreqlist`` -- Output a frequency list on tokenised FoLiA documents.
- ``foliamerge`` -- Merges annotations from two or more FoLiA documents.
- ``foliatextcontent`` -- A tool for adding or stripping text redundancy (i.e. text associated with multiple structural levels), supports computing and adding offset information. Use this if you want to have text available on a different level (e.g. the global text level).
- ``foliaupgrade`` -- Upgrades a document to the latest FoLiA version.
- ``alpino2folia`` -- Convert Alpino-DS XML to FoLiA XML
- ``dcoi2folia`` -- Convert D-Coi XML to FoLiA XML
- ``conllu2folia`` -- Convert files in the `CONLL-U format <http://http://universaldependencies.org/format.html>`_ to FoLiA XML.
- ``rst2folia`` -- Convert ReStructuredText, a lightweight non-intrusive text markup language, to FoLiA, using `docutils <http://docutils.sourceforge.net/>`_.
- ``tei2folia`` -- Convert a subset of TEI to FoLiA. See the extra documentation in the section below.
- ``folia2salt`` -- Convert FoLiA XML to `Salt <https://corpus-tools.org/salt/>`_, which in turn enables further conversions (annis, paula, TCF, TigerXML, and others) through `Pepper <https://corpus-tools.org/pepper/>`_. See the extra documentation in the dedicated section below.
- ``folia2stam`` -- Convert FoLiA XML to `STAM <https://github.com/annotation/stam>`_, a standoff annotation model. Retains FoLiA vocabulary and enables further conversion to e.g. W3C Web Annotations.


All of these tools are written in Python, and thus require a Python 3 installation to run. More tools are added as time progresses.

Installation
---------------

The FoLiA tools are published to the Python Package Index and can be installed effortlessly using ``pip``, from the command-line, type::

  $ pip install folia-tools

You may need to use ``pip3`` to ensure you have the Python 3 version.  Add ``sudo`` to install it globally on your system, but we strongly
recommend you use virtualenv to make a self-contained Python environment.

The FoLiA tools are also included in our `LaMachine distribution <https://proycon.github.io/lamachine>`_ .


Installation Troubleshooting
-------------------------------

If ``pip`` is not yet available, install it as follows:

On Debian/Ubuntu-based systems::

  $ sudo apt-get install python3-pip

On RedHat-based systems::

  $ yum install python3-pip

On Arch Linux systems::

  $ pacman -Syu python-pip

Usage
-------

To obtain help regarding the usage of any of the available FoLiA tools, please pass the ``-h`` option on the command line to the tool you intend to use. This will provide a summary on available options and usage examples. Most of the tools can run on both a single FoLiA document, as well as a whole directory of documents, allowing also for recursion. The tools generally take one or more file names or directory names as parameters.

More about FoLiA?
--------------------

Please consult the FoLiA website at https://proycon.github.io/folia for more!

Specific Tools
-------------------

This section contains some extra important information for a few of the included tools.


Validating FoLiA documents using foliavalidator
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The FoLiA validator is an essential tool for anybody working with FoLiA. It is very important that FoLiA documents are
properly validated before they are published, this ensures that tools know what to expect when they get a FoLiA document
as input for processing and are not confronted with any nasty surprises that are far too common in the field. The degree of
formal validation offered by FoLiA is something that sets it apart from many alternative annotation formats. The key
tool to perform validation is ``foliavalidator`` (or its alternative C++ implementation ``folialint`` as part of `FoLiA-utils <https://github.com/LanguageMachines/foliautils/>`_).

Validation can proceed on two levels:

1. **shallow validation** - Validates the full FoLiA document, checks if all elements are valid FoLiA elements,
   properly used, and if the document structure is valid. Checks if all the proper annotation declarations are present
   and if there are no inconsistencies in the text if text is specified on multiple levels (text redundancy). Note that
   shallow validation already does way more than validation against the RelaxNG Schema does.
2. **deep validation** - Does all of the above, but in addition it also checks the actual tagsets used. It checks if all
   declarations refer to valid set definition and if all used classes (aka tags/labels) are valid according to the declared set definitions and if the combination of certain classes is valid according to the set definition.

Note that validation against merely the RelaxNG schema could be called naive validation and is **NOT** considered sufficient FoLiA validation for most intents and purposes.

Shallow validation is invoked as: ``$ foliavalidator document.folia.xml``.
Deep validation invoked as: ``$ foliavalidator --deep document.folia.xml``.

In addition to validating, the foliavalidator tool is capable of automatically fixing certain validation problems when
explicitly asked to do so, such as automatically declaring missing annotations.

Another feature of the validator is that it can get as a converter to convert FoLiA documents to `explicit form <https://folia.readthedocs.io/en/latest/form.html>`_ (using the ``--explicit`` parameter). Explicit form is a more verbose form of XML serialisation that is easier to parse to certain tools as it makes explicit certain details that are left implicit in normal form.


TEI to FoLiA conversion
^^^^^^^^^^^^^^^^^^^^^^^^^^

The TEI P5 guidelines (`Text Encoding Initiative <https://tei-c.org/>`_) specify a widely used encoding method for
machine-readable texts. It is primarly a format for capture text structure and markup in great detail, but there are
some facilities for linguistic annotation too. The sheer flexibility and complexity of TEI leads to many different TEI
dialects, and subsequently implementing support for TEI (all-of-it) in a tool is an almost impossible task. FoLiA is
more constrained than TEI with regard to structural and markup annotation, but places more focus on linguistic
annotation.

The ``tei2folia`` tool performs conversion from a (sizable) subset of TEI to FoLiA, but provides no guarantee that all
TEI P5 documents can be processed. Some notable things that are supported:

* Conversion of text structure including divisions, paragraphs, headers & titles, lists, figures, tables (limited), front matter, back
  matter
* Verse text (limited, no metrical analysis etc), line groups (``<lg>``)
* Gaps
* Text markup (highlighting, ``<hi>``), emphasis, foreign, term, mentioned, names and places
    * Limited corrections
* Conversion of `lightweight linguistic annotation <https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html>`_.
* Linguistic segments: sentences (``<s>``) & words (``w``), but **not** ``<cl>`` nor ``<phr>``.
    * Basic tokenisation (spacing) information (TEI's ``@join`` attribute)
* Limited metadata

Specifically not supported (yet), non-exhaustive list:

* Graphs and trees
* Milestones
* Span groups, interpretration groups, link groups (``<spanGrp>``, ``<interpGrp>``, ``<linkGrp>``)
* Speech
* Contextual information
* Feature structures (``<fs>``, ``<f>``)

FoLiA to STAM
^^^^^^^^^^^^^^^^^^^^^^^^^^

`STAM <https://annotation.github.io/stam>`__ is a stand-off model for text
annotation that. It does not prescribe any vocabulary at all but allows one to
reuse existing vocabularies. The `folia2stam` tool converts FoLiA documents to
STAM, preserving the vocabulary that FoLiA predefines regarding annotation types, common attributes etc... 

**Supported:**

* Conversion of text structure including divisions, paragraphs, headers & titles, lists, figures, tables (limited), front matter, back
  matter.
* Conversion of inline and span annotation

**Not supported yet:**

* Only tokenised documents (i.e. with word elements) are implemented currently
* Conversion of text markup annotation
* Certain higher-order annotation is not converted yet
* No explicit tree structure is built yet for hierarchical annotations like syntax annotation
* Do note that there is no conversion back from STAM to FoLiA XML currently (that would be complicated for multiple reasons, so might never be realized).

**Vocabulary conversion:**

Both FoLiA and STAM have the notion of a *set* or *annotation dataset*. In
FoLiA the scope of such a set is to define the vocabulary used for a particular
annotation type (e.g. a tagset). FoLiA itself already defines what annotation
types exist. In STAM an annotation dataset is a broader notion and all
vocabulary, even the notion of a word or sentence, comes from a set, as nothing
is predefined at all aside from the STAM model's primitives.

We map most of the vocabulary of FoLiA itself to a STAM dataset with ID
`https://w3id.org/folia/v2/`. All of FoLiA's annotation types, element types, and
common attributes are defined in this set.

Each FoLiA set definition maps to a STAM dataset with the same set ID (URI. The
STAM set defines `class` key in that set, that corresponds to FoLiA's *class*
attribute. Any FoLiA subsets (for features) also translate to key identifiers.

The declarations inside a FoLiA document will be explicitly expressed in STAM as well;
each STAM dataset will have an annotation that points to it (with a
DataSetSelector). This annotation has data with key `declaration`  (set
`https://w3id.org/folia/v2/`) that marks it as a declaration for a specific type,
the value is something like `pos-annotation` and corresponds one-on-one to the declaration
element used in FoLiA XML. Additionally, this annotation also has data with key
`annotationtype` (same set as above) that where the value corresponds to the
annotation type (lowercased, e.g. `pos`).

The FoLiA to STAM conversion is RDF-ready. That is, all identifiers are valid
IRIs and all FoLiA vocabulary (`https://w3id.org/folia/v2/`) is backed by `a formal ontology <https://github.com/proycon/folia/blob/master/schemas/folia.ttl>`_ using RDF and SKOS.

FoLiA set definitions, if defined, are already in SKOS (or in the legacy
format).

Being RDF-ready means that the STAM model produced by `folia2stam` can in turn
be easily be exported to W3C Web Annotations. Tooling for that conversion will
be provided in `Stam Tools <https://github.com/annotation/stam-tools>`_.



FoLiA to Salt
^^^^^^^^^^^^^^^^^^^^^^^^^^

`Salt <https://corpus-tools.org/salt/>`_ is a graph based annotation model that is designed to act as an intermediate
format in the conversion between various annotation formats. It is used by the conversion tool `Pepper <https://corpus-tools.org/pepper/>`_. Our FoLiA to Salt converter, however, is a standalone tool as part of these FoLiA tools, rather than integrated into pepper. You can use ``folia2salt`` to convert FoLiA XML to Salt XML and subsequently use Pepper to do conversions to other formats such as TCF, PAULA, TigerXML, GraF, Annis, etc... (there is no guarantee though that everything can be preserved accurately in each conversion).

The current state of this conversion is summarised below, it is however not
likely that this particular tool will be developed any further:

*  Conversion of FoLiA tokens to salt SToken nodes
   * The converter only supports tokenised FoLiA documents
*  Text extraction (from tokens) to STextualDS node and conversion to STextualRelation edges
   * preserves untokenised text only to a certain degree (using FoLiA's token spacing information only)
   * **not yet supported**: multiple text classes
* Conversion of FoLiA Inline Annotation (pos, lemma etc) to salt SAnnotation labels
* Conversion of FoLiA Structure Annotation (sentences,paragraph, etc) to salt SSpan nodes and SSpanRelation edges
  * converted structures will directly relate to the underlying token nodes rather than to a structural hierarchy like in FoLiA
* Conversion of simple FoLiA Span Annotation (entities etc) to salt SSpan nodes and SSpanRelation edges
   * Conversion of nested Span Annotation (syntax etc) to SSpan nodes and SDominanceRelation edges
   * **not yet supported**: Span Annotation including span roles  (dependencies etc) to SSpan nodes and SDominanceRelation edges
* Grouping of annotation types/sets in salt SLayer nodes
*  Conversion of FoLiA higher order elements:
    * Features
    * Comments
    * Descriptions
    * **not yet supported**:
        * Relations
        * Metrics
        * Span Relations
        * String annotation
        * Alternative annotation
        * Corrections
* Conversion of FoLiA phonetic content (as an extra STextualDS node and STextualRelation edges)
* Convert FoLiA native metadata
* **not yet supported**:
    * Conversion of FoLiA subtoken annotation (morphology/phonology)
    * Conversion of FoLiA references to audio/video sources and timing information

Our Salt conversion tries to preserve as much of the FoLiA as possible, we extensively use salt's capacity for
specifying namespaces to hold and group the annotation type and set of an annotation. SLabel elements with the same
namespace should often be considered together.




            

Raw data

            {
    "_id": null,
    "home_page": "https://proycon.github.io/folia",
    "name": "FoLiA-tools",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "nlp, computational linguistics, search, folia, annotation",
    "author": "Maarten van Gompel",
    "author_email": "proycon@anaproy.nl",
    "download_url": "https://files.pythonhosted.org/packages/54/ff/4488dbaadad73d6f8f19b5b2adff5231d4adcb7f98506f7119f9fc3ff2e1/folia_tools-2.5.8.tar.gz",
    "platform": null,
    "description": ".. image:: https://github.com/proycon/foliatools/actions/workflows/foliatools.yml/badge.svg?branch=master\n    :target: https://github.com/proycon/foliatools/actions/\n\n.. image:: http://applejack.science.ru.nl/lamabadge.php/foliatools\n   :target: http://applejack.science.ru.nl/languagemachines/\n\n.. image:: https://www.repostatus.org/badges/latest/active.svg\n   :alt: Project Status: Active \u2013 The project has reached a stable, usable state and is being actively developed.\n   :target: https://www.repostatus.org/#active\n\n.. image:: https://img.shields.io/pypi/v/folia-tools\n   :alt: Latest release in the Python Package Index\n   :target: https://pypi.org/project/folia-tools/\n\nFoLiA Tools\n=================\n\nA number of command-line tools are readily available for working with FoLiA, to various ends. The following tools are currently available:\n\n- ``foliavalidator`` -- Tests if documents are valid FoLiA XML. **Always use this to test your documents if you produce your own FoLiA documents!**. See the extra documentation in the dedicated scetion below.\n- ``foliaquery`` -- Advanced query tool that searches FoLiA documents for a specified pattern, or modifies a document according to the query. Supports FQL (FoLiA Query Language) and CQL (Corpus Query Language).\n- ``foliaeval`` -- Evaluation tool, can compute various evaluation metrics for selected annotation types, either against\n  a gold standard reference or as a measure of inter-annotated agreement.\n- ``folia2txt`` -- Convert FoLiA XML to plain text (pure text, without any annotations). Use this to extract plain text\n  from any FoLiA document.\n- ``folia2annotatedtxt`` -- Like above, but produces output simple\n  token annotations inline, by appending them directly to the word using a specific delimiter.\n- ``folia2columns`` -- This conversion tool reads a FoLiA XML document\n  and produces a simple columned output format (including CSV) in which each token appears on one line. Note that only simple token annotations are supported and a lot of FoLiA data can not be intuitively expressed in a simple columned format!\n- ``folia2html`` -- Converts a FoLiA document to a semi-interactive HTML document, with limited support for certain token annotations.\n- ``folia2dcoi`` -- Convert FoLiA XML to D-Coi XML (only for annotations supported by D-Coi)\n- ``foliatree`` -- Outputs the hierarchy of a FoLiA document.\n- ``foliacat`` -- Concatenate multiple FoLiA documents.\n- ``foliacount`` -- This script reads a FoLiA XML document and counts certain structure elements.\n- ``foliacorrect`` -- A tool to deal with corrections in FoLiA, can automatically accept suggestions or strip all corrections so parsers that don't know how to handle corrections can process it.\n- ``foliaerase`` -- Erases one or more specified annotation types from the FoLiA document.\n- ``folialangid`` -- Does language detection on FoLiA documents, assigns language identifiers to different substructures\n- ``foliaid`` -- Assigns IDs to elements in FoLiA documents. Use this to automatically generate identifiers on certain (or all) elements.\n- ``foliafreqlist`` -- Output a frequency list on tokenised FoLiA documents.\n- ``foliamerge`` -- Merges annotations from two or more FoLiA documents.\n- ``foliatextcontent`` -- A tool for adding or stripping text redundancy (i.e. text associated with multiple structural levels), supports computing and adding offset information. Use this if you want to have text available on a different level (e.g. the global text level).\n- ``foliaupgrade`` -- Upgrades a document to the latest FoLiA version.\n- ``alpino2folia`` -- Convert Alpino-DS XML to FoLiA XML\n- ``dcoi2folia`` -- Convert D-Coi XML to FoLiA XML\n- ``conllu2folia`` -- Convert files in the `CONLL-U format <http://http://universaldependencies.org/format.html>`_ to FoLiA XML.\n- ``rst2folia`` -- Convert ReStructuredText, a lightweight non-intrusive text markup language, to FoLiA, using `docutils <http://docutils.sourceforge.net/>`_.\n- ``tei2folia`` -- Convert a subset of TEI to FoLiA. See the extra documentation in the section below.\n- ``folia2salt`` -- Convert FoLiA XML to `Salt <https://corpus-tools.org/salt/>`_, which in turn enables further conversions (annis, paula, TCF, TigerXML, and others) through `Pepper <https://corpus-tools.org/pepper/>`_. See the extra documentation in the dedicated section below.\n- ``folia2stam`` -- Convert FoLiA XML to `STAM <https://github.com/annotation/stam>`_, a standoff annotation model. Retains FoLiA vocabulary and enables further conversion to e.g. W3C Web Annotations.\n\n\nAll of these tools are written in Python, and thus require a Python 3 installation to run. More tools are added as time progresses.\n\nInstallation\n---------------\n\nThe FoLiA tools are published to the Python Package Index and can be installed effortlessly using ``pip``, from the command-line, type::\n\n  $ pip install folia-tools\n\nYou may need to use ``pip3`` to ensure you have the Python 3 version.  Add ``sudo`` to install it globally on your system, but we strongly\nrecommend you use virtualenv to make a self-contained Python environment.\n\nThe FoLiA tools are also included in our `LaMachine distribution <https://proycon.github.io/lamachine>`_ .\n\n\nInstallation Troubleshooting\n-------------------------------\n\nIf ``pip`` is not yet available, install it as follows:\n\nOn Debian/Ubuntu-based systems::\n\n  $ sudo apt-get install python3-pip\n\nOn RedHat-based systems::\n\n  $ yum install python3-pip\n\nOn Arch Linux systems::\n\n  $ pacman -Syu python-pip\n\nUsage\n-------\n\nTo obtain help regarding the usage of any of the available FoLiA tools, please pass the ``-h`` option on the command line to the tool you intend to use. This will provide a summary on available options and usage examples. Most of the tools can run on both a single FoLiA document, as well as a whole directory of documents, allowing also for recursion. The tools generally take one or more file names or directory names as parameters.\n\nMore about FoLiA?\n--------------------\n\nPlease consult the FoLiA website at https://proycon.github.io/folia for more!\n\nSpecific Tools\n-------------------\n\nThis section contains some extra important information for a few of the included tools.\n\n\nValidating FoLiA documents using foliavalidator\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nThe FoLiA validator is an essential tool for anybody working with FoLiA. It is very important that FoLiA documents are\nproperly validated before they are published, this ensures that tools know what to expect when they get a FoLiA document\nas input for processing and are not confronted with any nasty surprises that are far too common in the field. The degree of\nformal validation offered by FoLiA is something that sets it apart from many alternative annotation formats. The key\ntool to perform validation is ``foliavalidator`` (or its alternative C++ implementation ``folialint`` as part of `FoLiA-utils <https://github.com/LanguageMachines/foliautils/>`_).\n\nValidation can proceed on two levels:\n\n1. **shallow validation** - Validates the full FoLiA document, checks if all elements are valid FoLiA elements,\n   properly used, and if the document structure is valid. Checks if all the proper annotation declarations are present\n   and if there are no inconsistencies in the text if text is specified on multiple levels (text redundancy). Note that\n   shallow validation already does way more than validation against the RelaxNG Schema does.\n2. **deep validation** - Does all of the above, but in addition it also checks the actual tagsets used. It checks if all\n   declarations refer to valid set definition and if all used classes (aka tags/labels) are valid according to the declared set definitions and if the combination of certain classes is valid according to the set definition.\n\nNote that validation against merely the RelaxNG schema could be called naive validation and is **NOT** considered sufficient FoLiA validation for most intents and purposes.\n\nShallow validation is invoked as: ``$ foliavalidator document.folia.xml``.\nDeep validation invoked as: ``$ foliavalidator --deep document.folia.xml``.\n\nIn addition to validating, the foliavalidator tool is capable of automatically fixing certain validation problems when\nexplicitly asked to do so, such as automatically declaring missing annotations.\n\nAnother feature of the validator is that it can get as a converter to convert FoLiA documents to `explicit form <https://folia.readthedocs.io/en/latest/form.html>`_ (using the ``--explicit`` parameter). Explicit form is a more verbose form of XML serialisation that is easier to parse to certain tools as it makes explicit certain details that are left implicit in normal form.\n\n\nTEI to FoLiA conversion\n^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nThe TEI P5 guidelines (`Text Encoding Initiative <https://tei-c.org/>`_) specify a widely used encoding method for\nmachine-readable texts. It is primarly a format for capture text structure and markup in great detail, but there are\nsome facilities for linguistic annotation too. The sheer flexibility and complexity of TEI leads to many different TEI\ndialects, and subsequently implementing support for TEI (all-of-it) in a tool is an almost impossible task. FoLiA is\nmore constrained than TEI with regard to structural and markup annotation, but places more focus on linguistic\nannotation.\n\nThe ``tei2folia`` tool performs conversion from a (sizable) subset of TEI to FoLiA, but provides no guarantee that all\nTEI P5 documents can be processed. Some notable things that are supported:\n\n* Conversion of text structure including divisions, paragraphs, headers & titles, lists, figures, tables (limited), front matter, back\n  matter\n* Verse text (limited, no metrical analysis etc), line groups (``<lg>``)\n* Gaps\n* Text markup (highlighting, ``<hi>``), emphasis, foreign, term, mentioned, names and places\n    * Limited corrections\n* Conversion of `lightweight linguistic annotation <https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.linguistic.html>`_.\n* Linguistic segments: sentences (``<s>``) & words (``w``), but **not** ``<cl>`` nor ``<phr>``.\n    * Basic tokenisation (spacing) information (TEI's ``@join`` attribute)\n* Limited metadata\n\nSpecifically not supported (yet), non-exhaustive list:\n\n* Graphs and trees\n* Milestones\n* Span groups, interpretration groups, link groups (``<spanGrp>``, ``<interpGrp>``, ``<linkGrp>``)\n* Speech\n* Contextual information\n* Feature structures (``<fs>``, ``<f>``)\n\nFoLiA to STAM\n^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n`STAM <https://annotation.github.io/stam>`__ is a stand-off model for text\nannotation that. It does not prescribe any vocabulary at all but allows one to\nreuse existing vocabularies. The `folia2stam` tool converts FoLiA documents to\nSTAM, preserving the vocabulary that FoLiA predefines regarding annotation types, common attributes etc... \n\n**Supported:**\n\n* Conversion of text structure including divisions, paragraphs, headers & titles, lists, figures, tables (limited), front matter, back\n  matter.\n* Conversion of inline and span annotation\n\n**Not supported yet:**\n\n* Only tokenised documents (i.e. with word elements) are implemented currently\n* Conversion of text markup annotation\n* Certain higher-order annotation is not converted yet\n* No explicit tree structure is built yet for hierarchical annotations like syntax annotation\n* Do note that there is no conversion back from STAM to FoLiA XML currently (that would be complicated for multiple reasons, so might never be realized).\n\n**Vocabulary conversion:**\n\nBoth FoLiA and STAM have the notion of a *set* or *annotation dataset*. In\nFoLiA the scope of such a set is to define the vocabulary used for a particular\nannotation type (e.g. a tagset). FoLiA itself already defines what annotation\ntypes exist. In STAM an annotation dataset is a broader notion and all\nvocabulary, even the notion of a word or sentence, comes from a set, as nothing\nis predefined at all aside from the STAM model's primitives.\n\nWe map most of the vocabulary of FoLiA itself to a STAM dataset with ID\n`https://w3id.org/folia/v2/`. All of FoLiA's annotation types, element types, and\ncommon attributes are defined in this set.\n\nEach FoLiA set definition maps to a STAM dataset with the same set ID (URI. The\nSTAM set defines `class` key in that set, that corresponds to FoLiA's *class*\nattribute. Any FoLiA subsets (for features) also translate to key identifiers.\n\nThe declarations inside a FoLiA document will be explicitly expressed in STAM as well;\neach STAM dataset will have an annotation that points to it (with a\nDataSetSelector). This annotation has data with key `declaration`  (set\n`https://w3id.org/folia/v2/`) that marks it as a declaration for a specific type,\nthe value is something like `pos-annotation` and corresponds one-on-one to the declaration\nelement used in FoLiA XML. Additionally, this annotation also has data with key\n`annotationtype` (same set as above) that where the value corresponds to the\nannotation type (lowercased, e.g. `pos`).\n\nThe FoLiA to STAM conversion is RDF-ready. That is, all identifiers are valid\nIRIs and all FoLiA vocabulary (`https://w3id.org/folia/v2/`) is backed by `a formal ontology <https://github.com/proycon/folia/blob/master/schemas/folia.ttl>`_ using RDF and SKOS.\n\nFoLiA set definitions, if defined, are already in SKOS (or in the legacy\nformat).\n\nBeing RDF-ready means that the STAM model produced by `folia2stam` can in turn\nbe easily be exported to W3C Web Annotations. Tooling for that conversion will\nbe provided in `Stam Tools <https://github.com/annotation/stam-tools>`_.\n\n\n\nFoLiA to Salt\n^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n`Salt <https://corpus-tools.org/salt/>`_ is a graph based annotation model that is designed to act as an intermediate\nformat in the conversion between various annotation formats. It is used by the conversion tool `Pepper <https://corpus-tools.org/pepper/>`_. Our FoLiA to Salt converter, however, is a standalone tool as part of these FoLiA tools, rather than integrated into pepper. You can use ``folia2salt`` to convert FoLiA XML to Salt XML and subsequently use Pepper to do conversions to other formats such as TCF, PAULA, TigerXML, GraF, Annis, etc... (there is no guarantee though that everything can be preserved accurately in each conversion).\n\nThe current state of this conversion is summarised below, it is however not\nlikely that this particular tool will be developed any further:\n\n*  Conversion of FoLiA tokens to salt SToken nodes\n   * The converter only supports tokenised FoLiA documents\n*  Text extraction (from tokens) to STextualDS node and conversion to STextualRelation edges\n   * preserves untokenised text only to a certain degree (using FoLiA's token spacing information only)\n   * **not yet supported**: multiple text classes\n* Conversion of FoLiA Inline Annotation (pos, lemma etc) to salt SAnnotation labels\n* Conversion of FoLiA Structure Annotation (sentences,paragraph, etc) to salt SSpan nodes and SSpanRelation edges\n  * converted structures will directly relate to the underlying token nodes rather than to a structural hierarchy like in FoLiA\n* Conversion of simple FoLiA Span Annotation (entities etc) to salt SSpan nodes and SSpanRelation edges\n   * Conversion of nested Span Annotation (syntax etc) to SSpan nodes and SDominanceRelation edges\n   * **not yet supported**: Span Annotation including span roles  (dependencies etc) to SSpan nodes and SDominanceRelation edges\n* Grouping of annotation types/sets in salt SLayer nodes\n*  Conversion of FoLiA higher order elements:\n    * Features\n    * Comments\n    * Descriptions\n    * **not yet supported**:\n        * Relations\n        * Metrics\n        * Span Relations\n        * String annotation\n        * Alternative annotation\n        * Corrections\n* Conversion of FoLiA phonetic content (as an extra STextualDS node and STextualRelation edges)\n* Convert FoLiA native metadata\n* **not yet supported**:\n    * Conversion of FoLiA subtoken annotation (morphology/phonology)\n    * Conversion of FoLiA references to audio/video sources and timing information\n\nOur Salt conversion tries to preserve as much of the FoLiA as possible, we extensively use salt's capacity for\nspecifying namespaces to hold and group the annotation type and set of an annotation. SLabel elements with the same\nnamespace should often be considered together.\n\n\n\n",
    "bugtrack_url": null,
    "license": "GPL-3.0-only",
    "summary": "FoLiA-tools contains various Python-based command line tools for working with FoLiA XML (Format for Linguistic Annotation)",
    "version": "2.5.8",
    "project_urls": {
        "Homepage": "https://proycon.github.io/folia"
    },
    "split_keywords": [
        "nlp",
        " computational linguistics",
        " search",
        " folia",
        " annotation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "54ff4488dbaadad73d6f8f19b5b2adff5231d4adcb7f98506f7119f9fc3ff2e1",
                "md5": "07bd568763617ba2bb68f4b431902818",
                "sha256": "6272f988b8c3220798cbed38a28af0548c761b52ec603d11d50228addcacead7"
            },
            "downloads": -1,
            "filename": "folia_tools-2.5.8.tar.gz",
            "has_sig": false,
            "md5_digest": "07bd568763617ba2bb68f4b431902818",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 146602,
            "upload_time": "2024-10-17T14:54:17",
            "upload_time_iso_8601": "2024-10-17T14:54:17.027829Z",
            "url": "https://files.pythonhosted.org/packages/54/ff/4488dbaadad73d6f8f19b5b2adff5231d4adcb7f98506f7119f9fc3ff2e1/folia_tools-2.5.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-17 14:54:17",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "folia-tools"
}
        
Elapsed time: 4.45098s