nafigator


Namenafigator JSON
Version 0.1.64 PyPI version JSON
download
home_pagehttps://github.com/denederlandschebank/nafigator
SummaryPython package to convert spaCy and Stanza documents to NLP Annotation Format (NAF)
upload_time2023-08-31 08:06:20
maintainer
docs_urlNone
authorDe Nederlandsche Bank
requires_python>=3.6
licenseMIT license
keywords nafigator
VCS
bugtrack_url
requirements click pdfminer.six lxml python-docx pandas stanza spacy deepdiff camelot-py opencv-python pdftopng iribaker Unidecode PyMuPDF pypdf parameterized
Travis-CI No Travis.
coveralls test coverage No coveralls.
            =========
nafigator
=========


.. image:: https://img.shields.io/pypi/v/nafigator.svg
        :target: https://pypi.python.org/pypi/nafigator

.. image:: https://img.shields.io/badge/License-MIT-yellow.svg
        :target: https://opensource.org/licenses/MIT
        :alt: License: MIT

.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
        :target: https://github.com/psf/black
        :alt: Code style: black

**DISCLAIMER - BETA PHASE**

*This package is currently in a beta phase.*

to nafigate [ **naf**-i-geyt ]
------------------------------

    *v.intr*, **nafigated**, **nafigating**

    1. To process one of more text documents through a NLP pipeline and output results in the NLP Annotation Format.


Features
--------

The Nafigator package allows you to store (intermediate) results and processing steps from custom made spaCy and stanza pipelines in one format.

* Convert text files to naf-files that satisfy the NLP Annotation Format (NAF)

  - Supported input media types: application/pdf (.pdf), text/plain (.txt), text/html (.html), MS Word (.docx)

  - Supported output formats: naf-xml (.naf.xml), naf-rdf in turtle-syntax (.ttl) and xml-syntax (.rdf) (experimental)

  - Supported NLP processors: spaCy, stanza

  - Supported NAF layers: raw, text, terms, entities, deps, multiwords

* Read naf-files and access data as Python lists and dicts

When reading naf-files Nafigator stores data in memory as lxml ElementTrees. The lxml package provides a Pythonic binding for C libaries so it should be very fast.

The NLP Annotation Format (NAF)
-------------------------------

Key features:

* Multilayered extensible annotations;

* Reproducible NLP pipelines;

* NLP processor agnostic;

* Compatible with RDF

References:

* `NAF: the NLP Annotation Format <http://newsreader-project.eu/files/2013/01/techreport.pdf>`_

* `NAF documentation on Github <https://github.com/newsreader/NAF>`_


Current changes to NAF:

* a 'formats' layer is added with text format data (font and size) to allow text classification like header detection

* a 'model' attribute is added to LinguisticProcessors to record the model that was used

* all attributes of public are Dublin Core elements and mapped to the dc namespace

* attributes in a dependency relation are renamed 'from_term' and 'to_term' ('from' is a Python reserved word)

The code of the SpaCy converter to NAF is partially based on `SpaCy-to-NAF <https://github.com/cltl/SpaCy-to-NAF>`_

Installation
------------

To install the package

::

    pip install nafigator

To install the package from Github

::

    pip install -e git+https://github.com/denederlandschebank/nafigator.git#egg=nafigator


How to run
----------

Command line interface
~~~~~~~~~~~~~~~~~~~~~~

To parse a pdf, .docx, .txt or .html-file from the command line interface run in the root of the project::

    python -m nafigator.cli


Function calls
~~~~~~~~~~~~~~

To convert a .pdf, .docx, .txt or .html-file in Python code you can use: ::

    from nafigator.parse2naf import generate_naf

    doc = generate_naf(input = "../data/example.pdf",
                       engine = "stanza",
                       language = "en",
                       naf_version = "v3.1",
                       dtd_validation = False,
                       params = {'fileDesc': {'author': 'anonymous'}},
                       nlp = None)

- input: document to convert to naf document
- engine: pipeline processor, i.e. 'spacy' or 'stanza'
- language: for example 'en' or 'nl'
- naf_version: 'v3' or 'v3.1'
- dtd_validation: True or False (default = False)
- params: dictionary with parameters (default = {}) 
- nlp: custom made pipeline object from spacy or stanza (default = None)

The returning object, doc, is a NafDocument from which layers can be accessed.

Get the document and processors metadata via::

    doc.header

Output of doc.header of processed data/example.pdf::

  {
    'fileDesc': {
      'author': 'anonymous',
      'creationtime': '2021-04-25T11:28:58UTC', 
      'filename': 'data/example.pdf', 
      'filetype': 'application/pdf', 
      'pages': '2'}, 
    'public': {
      '{http://purl.org/dc/elements/1.1/}uri': 'data/example.pdf',
      '{http://purl.org/dc/elements/1.1/}format': 'application/pdf'}, 
  ...

Get the raw layer output via::

  doc.raw

Output of doc.raw of processed data/example.pdf::

  The Nafigator package allows you to store NLP output from custom made spaCy and stanza  pipelines with (intermediate) results and all processing steps in one format.  Multiwords like in 'we have set that out below' are recognized (depending on your NLP  processor).

Get the text layer output via::

  doc.text

Output of doc.text of processed data/example.pdf::

  [
    {'text': 'The', 'page': '1', 'sent': '1', 'id': 'w1', 'length': '3', 'offset': '0'}, 
    {'text': 'Nafigator', 'page': '1', 'sent': '1', 'id': 'w2', 'length': '9', 'offset': '4'}, 
    {'text': 'package', 'page': '1', 'sent': '1', 'id': 'w3', 'length': '7', 'offset': '14'}, 
    {'text': 'allows', 'page': '1', 'sent': '1', 'id': 'w4', 'length': '6', 'offset': '22'}, 
  ...

Get the terms layer output via::

  doc.terms

Output of doc.terms of processed data/example.pdf::

  [
    {'id': 't1', 'lemma': 'the', 'pos': 'DET', 'type': 'open', 'morphofeat': 'Definite=Def|PronType=Art', 'targets': [{'id': 'w1'}]}, 
    {'id': 't2', 'lemma': 'Nafigator', 'pos': 'PROPN', 'type': 'open', 'morphofeat': 'Number=Sing', 'targets': [{'id': 'w2'}]}, 
    {'id': 't3', 'lemma': 'package', 'pos': 'NOUN', 'type': 'open', 'morphofeat': 'Number=Sing', 'targets': [{'id': 'w3'}]}, 
    {'id': 't4', 'lemma': 'allow', 'pos': 'VERB', 'type': 'open', 'morphofeat': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin',    
  ...

Get the entities layer output via::

  doc.entities

Output of doc.entities of processed data/example.pdf::

  [
    {'id': 'e1', 'type': 'PRODUCT', 'text': 'Nafigator', 'targets': [{'id': 't2'}]},
    {'id': 'e2', 'type': 'CARDINAL', 'text': 'one', 'targets': [{'id': 't28'}]}]
  ]

Get the entities layer output via::

    doc.deps

Output of doc.deps of processed data/example.pdf::

  [
    {'from_term': 't3', 'to_term': 't1', 'from_orth': 'package', 'to_orth': 'The', 'rfunc': 'det'}, 
    {'from_term': 't4', 'to_term': 't3', 'from_orth': 'allows', 'to_orth': 'package', 'rfunc': 'nsubj'}, 
    {'from_term': 't3', 'to_term': 't2', 'from_orth': 'package', 'to_orth': 'Nafigator', 'rfunc': 'compound'}, 
    {'from_term': 't4', 'to_term': 't5', 'from_orth': 'allows', 'to_orth': 'you', 'rfunc': 'obj'},
  ...

Get the multiwords layer output via::

  doc.multiwords

Output of doc.multiwords::

  [
    {'id': 'mw1', 'lemma': 'set_out', 'pos': 'VERB', 'type': 'phrasal', 'components': [
      {'id': 'mw1.c1', 'targets': [{'id': 't37'}]}, 
      {'id': 'mw1.c2', 'targets': [{'id': 't39'}]}]}
  ]

Get the formats layer output via::

  doc.formats

Output of doc.formats::

  [ 
    {'length': '268', 'offset': '0', 'textboxes': [
      {'textlines': [
        {'texts': [
          {'font': 'CIDFont+F1', 'size': '12.000', 'length': '87', 'offset': '0', 'text': 'The Nafigator package allows you to store NLP output from custom made spaCy and stanza '
          }]
        }, 
        {'texts': [
          {'font': 'CIDFont+F1', 'size': '12.000', 'length': '77', 'offset': '88', 'text': 'pipelines with (intermediate) results and all processing steps in one format.'
  ...

Get all sentences in the document via::

  doc.sentences

Output of doc.sentences::

  [
    {'text': 'The Nafigator package allows you to store NLP output from custom made Spacy and stanza pipelines with ( intermediate ) results and all processing steps in one format .', 
    'para': ['1'], 
    'page': ['1'], 
    'span': [{'id': 'w1'}, {'id': 'w2'}, {'id': 'w3'}, {'id': 'w4'}, {'id': 'w5'}, {'id': 'w6'}, {'id': 'w7'}, {'id': 'w8'}, {'id': 'w9'}, {'id': 'w10'}, {'id': 'w11'}, {'id': 'w12'}, {'id': 'w13'}, {'id': 'w14'}, {'id': 'w15'}, {'id': 'w16'}, {'id': 'w17'}, {'id': 'w18'}, {'id': 'w19'}, {'id': 'w20'}, {'id': 'w21'}, {'id': 'w22'}, {'id': 'w23'}, {'id': 'w24'}, {'id': 'w25'}, {'id': 'w26'}, {'id': 'w27'}, {'id': 'w28'}, {'id': 'w29'}], 
    'terms': [{'id': 't1'}, {'id': 't2'}, {'id': 't3'}, {'id': 't4'}, {'id': 't5'}, {'id': 't6'}, {'id': 't7'}, {'id': 't8'}, {'id': 't9'}, {'id': 't10'}, {'id': 't11'}, {'id': 't12'}, {'id': 't13'}, {'id': 't14'}, {'id': 't15'}, {'id': 't16'}, {'id': 't17'}, {'id': 't18'}, {'id': 't19'}, {'id': 't20'}, {'id': 't21'}, {'id': 't22'}, {'id': 't23'}, {'id': 't24'}, {'id': 't25'}, {'id': 't26'}, {'id': 't27'}, {'id': 't28'}, {'id': 't29'}]}, 
  ...

Note that you get the word ids (the span) as well as the terms ids in the sentence.


Adding new annotation layers
----------------------------

To add a new annotation layer with elements, start with registering the processor of the new annotations::

  lp = ProcessorElement(name="processorname", model="modelname", version="1.0", timestamp=None, beginTimestamp=None,   endTimestamp=None, hostname=None)

  doc.add_processor_element("recommendations", lp)

Then get the layer and add subelements::

  layer = doc.layer("recommendations")

  data_recommendation = {'id': "recommendation1", 'subjectivity': 0.5, 'polarity': 0.25, 'span': ['t37', 't39']}

  element = doc.subelement(element=layer, tag="recommendation", data=data_recommendation)

  doc.add_span_element(element=element, data=data_recommendation)

Retrieve the recommendations with::

  doc.recommendations


Convert NAF to the NLP Interchange Format (NIF)
-----------------------------------------------

The `NLP Interchange Format (NIF) <https://github.com/NLP2RDF/ontologies>` is an RDF/OWL-based format that aims to achieve interoperability between NLP tools.

Here's an example::

  doc = nafigator.NafDocument().open("..//data//example.naf.xml")

  nif = nafigator.naf2nif(uri="https://mangosaurus.eu/rdf-data/nif-data/doc_1",
                          collection_uri="https://mangosaurus.eu/rdf-data/nif-data/collection",
                          doc=doc)

This results in an object that contains the rdflib Graph and can be serialized with::

  nif.graph.serialize(format="turtle"))

This results in the graph in turtle format. 

The prefixes and namespaces:

::

  @prefix dcterms: <http://purl.org/dc/terms/> .
  @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
  @prefix olia: <http://purl.org/olia/olia.owl#> .
  @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

The nif:ContextCollection

::

  <https://mangosaurus.eu/rdf-data/nif-data/collection> a nif:ContextCollection ;
      nif:hasContext <https://mangosaurus.eu/rdf-data/nif-data/doc_1> ;
      dcterms:conformsTo <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/2.1> .

The nif:Context (a document)

::

  <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_265> a nif:Context,
          nif:String ;
      nif:beginIndex "0"^^xsd:nonNegativeInteger ;
      nif:endIndex "265"^^xsd:nonNegativeInteger ;
      nif:hasSentences ( 
        <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_165> 
        <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_167_265> 
      ) ;
      nif:isString "The Nafigator package allows you to store NLP output from custom made Spacy and stanza  pipelines with (intermediate) results and all processing steps in one format.  Multiwords like in “we have set that out below” are recognized (depending on your NLP  processor)."^^xsd:string ;
      nif:lastSentence <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_167_265> ;
      nif:firstSentence <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_165> ;
      nif:referenceContext <https://mangosaurus.eu/rdf-data/nif-data/doc_1> .

The nif:Sentence

::

  <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_165> a nif:OffsetBasedString,
          nif:Paragraph,
          nif:Sentence ;
    nif:anchorOf "The Nafigator package allows you to store NLP output from custom made Spacy and stanza pipelines with ( intermediate ) results and all processing steps in one format ."^^xsd:string ;
    nif:beginIndex "0"^^xsd:nonNegativeInteger ;
    nif:endIndex "165"^^xsd:nonNegativeInteger ;
    nif:firstWord <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_3> ;
    nif:hasWords ( 
      <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_3> 
      <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_4_13> 
      ...
      <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_164_165> 
    ) ;
    nif:lastWord <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_164_165> ;
    nif:nextSentence <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_167_265> ;
    nif:referenceContext <https://mangosaurus.eu/rdf-data/nif-data/doc_1> .

The nif:Word

::

  <https://mangosaurus.eu/rdf-data/nif-data/3968fc96-5750-3fdb-be58-46f182762119#offset_0_3> a nif:OffsetBasedString,
          nif:Word ;
      nif:anchorOf "The"^^xsd:string ;
      nif:beginIndex "0"^^xsd:nonNegativeInteger ;
      nif:endIndex "3"^^xsd:nonNegativeInteger ;
      nif:lemma "the"^^xsd:string ;
      nif:nextWord <https://mangosaurus.eu/rdf-data/nif-data/3968fc96-5750-3fdb-be58-46f182762119#offset_4_13> ;
      nif:oliaLink olia:Article,
          olia:Definite,
          olia:Determiner ;
      nif:referenceContext <https://mangosaurus.eu/rdf-data/nif-data/3968fc96-5750-3fdb-be58-46f182762119#offset_0_265> ;
      nif:sentence <https://mangosaurus.eu/rdf-data/nif-data/3968fc96-5750-3fdb-be58-46f182762119#offset_0_165> .

Part of speech tags and morphological features are here combined: the part-of-speech tag is *olia:Determiner*. The morphological features are *olia:Article* (the pronType:Art in terms of Universal Dependencies) and *olia:Definite* (the Definite:Def in terms of Universal Dependencies).

Changes to NIF
~~~~~~~~~~~~~~

Instead of the original RDF predicates *nif:word* and *nif:sentence* (used to link words to sentences and vice versa) I used predicates *nif:hasWord* and *nif:hasSentence* which point to a RDF collection (a linked list) of respectively words and sentences. The RDF collection maintains order of the elements and easy traversing. These predicates are not part of the original NIF ontology.


=======
History
=======

0.1.0 (2021-03-13)
------------------

* First release on PyPI.

0.1.1 to 0.1.41 (2022-3-1)
--------------------------

* A lot of small changes

0.1.42 (2022-4-6)
-----------------

* Added first version of termbase processor

0.1.43 (2022-4-29)
------------------

* Fix for get_context_rows

0.1.45 (2022-4-29)
------------------

* Added sent ids to doc.sentences and doc.paragraphs

0.1.47 (2022-8-22)
------------------

* Table extraction improvements 
* Fix to align enumeration of sentences and paragraphs

0.1.48 (2022-8-30)
------------------

* Added first version of nif conversion

0.1.49 (2022-9-2)
-----------------

* Improved version of nif conversion
* Optimized TermbaseProcessor

0.1.50 (2022-9-5)
-----------------

* Morphological features in nif
* Bugfix TermbaseProcessor
* NIF example added to README.rst

0.1.52 (2022-10-19)
-------------------

* Formats layer now contains a deep copy of pdfminer output in xml

0.1.53 (2022-11-11)
-------------------

* Added coordinates to formats layer as an option
* Added highlighter feature for words
* Separated TableFormatter and Highlighter into 2 different modules
* Bugfix in formats layer

0.1.54 (2022-11-17)
-------------------

* Added PyMuPDF to requirements

0.1.55 (2022-11-21)
-------------------

* Added iribaker and Unidecode to requirements

0.1.57 (2022-11-30)
-------------------

* Added possibility to use stream instead of opening a file
* Added naf2nif function to convert naf to rdflid.Graph in NIF format 
* Added parameter "include pdf xml" to include the original xml output of pdfminer to the naf document

0.1.58 (2022-12-08)
-------------------
* Version bump for new build to check if this solves the installation version of 0.1.57

0.1.59 (2022-12-08)
-------------------
* Added PyMuPDF==1.21.0 to requirements

0.1.60 (2022-12-12)
-------------------
* Add outline unittests
* Bugfix Lemma error
* Part 1 bugfix referencing error

0.1.61 (2022-01-09)
-------------------
* Add option for streams input
* Remove unused imports




            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/denederlandschebank/nafigator",
    "name": "nafigator",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "nafigator",
    "author": "De Nederlandsche Bank",
    "author_email": "w.j.willemse@dnb.nl",
    "download_url": "",
    "platform": null,
    "description": "=========\nnafigator\n=========\n\n\n.. image:: https://img.shields.io/pypi/v/nafigator.svg\n        :target: https://pypi.python.org/pypi/nafigator\n\n.. image:: https://img.shields.io/badge/License-MIT-yellow.svg\n        :target: https://opensource.org/licenses/MIT\n        :alt: License: MIT\n\n.. image:: https://img.shields.io/badge/code%20style-black-000000.svg\n        :target: https://github.com/psf/black\n        :alt: Code style: black\n\n**DISCLAIMER - BETA PHASE**\n\n*This package is currently in a beta phase.*\n\nto nafigate [ **naf**-i-geyt ]\n------------------------------\n\n    *v.intr*, **nafigated**, **nafigating**\n\n    1. To process one of more text documents through a NLP pipeline and output results in the NLP Annotation Format.\n\n\nFeatures\n--------\n\nThe Nafigator package allows you to store (intermediate) results and processing steps from custom made spaCy and stanza pipelines in one format.\n\n* Convert text files to naf-files that satisfy the NLP Annotation Format (NAF)\n\n  - Supported input media types: application/pdf (.pdf), text/plain (.txt), text/html (.html), MS Word (.docx)\n\n  - Supported output formats: naf-xml (.naf.xml), naf-rdf in turtle-syntax (.ttl) and xml-syntax (.rdf) (experimental)\n\n  - Supported NLP processors: spaCy, stanza\n\n  - Supported NAF layers: raw, text, terms, entities, deps, multiwords\n\n* Read naf-files and access data as Python lists and dicts\n\nWhen reading naf-files Nafigator stores data in memory as lxml ElementTrees. The lxml package provides a Pythonic binding for C libaries so it should be very fast.\n\nThe NLP Annotation Format (NAF)\n-------------------------------\n\nKey features:\n\n* Multilayered extensible annotations;\n\n* Reproducible NLP pipelines;\n\n* NLP processor agnostic;\n\n* Compatible with RDF\n\nReferences:\n\n* `NAF: the NLP Annotation Format <http://newsreader-project.eu/files/2013/01/techreport.pdf>`_\n\n* `NAF documentation on Github <https://github.com/newsreader/NAF>`_\n\n\nCurrent changes to NAF:\n\n* a 'formats' layer is added with text format data (font and size) to allow text classification like header detection\n\n* a 'model' attribute is added to LinguisticProcessors to record the model that was used\n\n* all attributes of public are Dublin Core elements and mapped to the dc namespace\n\n* attributes in a dependency relation are renamed 'from_term' and 'to_term' ('from' is a Python reserved word)\n\nThe code of the SpaCy converter to NAF is partially based on `SpaCy-to-NAF <https://github.com/cltl/SpaCy-to-NAF>`_\n\nInstallation\n------------\n\nTo install the package\n\n::\n\n    pip install nafigator\n\nTo install the package from Github\n\n::\n\n    pip install -e git+https://github.com/denederlandschebank/nafigator.git#egg=nafigator\n\n\nHow to run\n----------\n\nCommand line interface\n~~~~~~~~~~~~~~~~~~~~~~\n\nTo parse a pdf, .docx, .txt or .html-file from the command line interface run in the root of the project::\n\n    python -m nafigator.cli\n\n\nFunction calls\n~~~~~~~~~~~~~~\n\nTo convert a .pdf, .docx, .txt or .html-file in Python code you can use: ::\n\n    from nafigator.parse2naf import generate_naf\n\n    doc = generate_naf(input = \"../data/example.pdf\",\n                       engine = \"stanza\",\n                       language = \"en\",\n                       naf_version = \"v3.1\",\n                       dtd_validation = False,\n                       params = {'fileDesc': {'author': 'anonymous'}},\n                       nlp = None)\n\n- input: document to convert to naf document\n- engine: pipeline processor, i.e. 'spacy' or 'stanza'\n- language: for example 'en' or 'nl'\n- naf_version: 'v3' or 'v3.1'\n- dtd_validation: True or False (default = False)\n- params: dictionary with parameters (default = {}) \n- nlp: custom made pipeline object from spacy or stanza (default = None)\n\nThe returning object, doc, is a NafDocument from which layers can be accessed.\n\nGet the document and processors metadata via::\n\n    doc.header\n\nOutput of doc.header of processed data/example.pdf::\n\n  {\n    'fileDesc': {\n      'author': 'anonymous',\n      'creationtime': '2021-04-25T11:28:58UTC', \n      'filename': 'data/example.pdf', \n      'filetype': 'application/pdf', \n      'pages': '2'}, \n    'public': {\n      '{http://purl.org/dc/elements/1.1/}uri': 'data/example.pdf',\n      '{http://purl.org/dc/elements/1.1/}format': 'application/pdf'}, \n  ...\n\nGet the raw layer output via::\n\n  doc.raw\n\nOutput of doc.raw of processed data/example.pdf::\n\n  The Nafigator package allows you to store NLP output from custom made spaCy and stanza  pipelines with (intermediate) results and all processing steps in one format.  Multiwords like in 'we have set that out below' are recognized (depending on your NLP  processor).\n\nGet the text layer output via::\n\n  doc.text\n\nOutput of doc.text of processed data/example.pdf::\n\n  [\n    {'text': 'The', 'page': '1', 'sent': '1', 'id': 'w1', 'length': '3', 'offset': '0'}, \n    {'text': 'Nafigator', 'page': '1', 'sent': '1', 'id': 'w2', 'length': '9', 'offset': '4'}, \n    {'text': 'package', 'page': '1', 'sent': '1', 'id': 'w3', 'length': '7', 'offset': '14'}, \n    {'text': 'allows', 'page': '1', 'sent': '1', 'id': 'w4', 'length': '6', 'offset': '22'}, \n  ...\n\nGet the terms layer output via::\n\n  doc.terms\n\nOutput of doc.terms of processed data/example.pdf::\n\n  [\n    {'id': 't1', 'lemma': 'the', 'pos': 'DET', 'type': 'open', 'morphofeat': 'Definite=Def|PronType=Art', 'targets': [{'id': 'w1'}]}, \n    {'id': 't2', 'lemma': 'Nafigator', 'pos': 'PROPN', 'type': 'open', 'morphofeat': 'Number=Sing', 'targets': [{'id': 'w2'}]}, \n    {'id': 't3', 'lemma': 'package', 'pos': 'NOUN', 'type': 'open', 'morphofeat': 'Number=Sing', 'targets': [{'id': 'w3'}]}, \n    {'id': 't4', 'lemma': 'allow', 'pos': 'VERB', 'type': 'open', 'morphofeat': 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin',    \n  ...\n\nGet the entities layer output via::\n\n  doc.entities\n\nOutput of doc.entities of processed data/example.pdf::\n\n  [\n    {'id': 'e1', 'type': 'PRODUCT', 'text': 'Nafigator', 'targets': [{'id': 't2'}]},\n    {'id': 'e2', 'type': 'CARDINAL', 'text': 'one', 'targets': [{'id': 't28'}]}]\n  ]\n\nGet the entities layer output via::\n\n    doc.deps\n\nOutput of doc.deps of processed data/example.pdf::\n\n  [\n    {'from_term': 't3', 'to_term': 't1', 'from_orth': 'package', 'to_orth': 'The', 'rfunc': 'det'}, \n    {'from_term': 't4', 'to_term': 't3', 'from_orth': 'allows', 'to_orth': 'package', 'rfunc': 'nsubj'}, \n    {'from_term': 't3', 'to_term': 't2', 'from_orth': 'package', 'to_orth': 'Nafigator', 'rfunc': 'compound'}, \n    {'from_term': 't4', 'to_term': 't5', 'from_orth': 'allows', 'to_orth': 'you', 'rfunc': 'obj'},\n  ...\n\nGet the multiwords layer output via::\n\n  doc.multiwords\n\nOutput of doc.multiwords::\n\n  [\n    {'id': 'mw1', 'lemma': 'set_out', 'pos': 'VERB', 'type': 'phrasal', 'components': [\n      {'id': 'mw1.c1', 'targets': [{'id': 't37'}]}, \n      {'id': 'mw1.c2', 'targets': [{'id': 't39'}]}]}\n  ]\n\nGet the formats layer output via::\n\n  doc.formats\n\nOutput of doc.formats::\n\n  [ \n    {'length': '268', 'offset': '0', 'textboxes': [\n      {'textlines': [\n        {'texts': [\n          {'font': 'CIDFont+F1', 'size': '12.000', 'length': '87', 'offset': '0', 'text': 'The Nafigator package allows you to store NLP output from custom made spaCy and stanza '\n          }]\n        }, \n        {'texts': [\n          {'font': 'CIDFont+F1', 'size': '12.000', 'length': '77', 'offset': '88', 'text': 'pipelines with (intermediate) results and all processing steps in one format.'\n  ...\n\nGet all sentences in the document via::\n\n  doc.sentences\n\nOutput of doc.sentences::\n\n  [\n    {'text': 'The Nafigator package allows you to store NLP output from custom made Spacy and stanza pipelines with ( intermediate ) results and all processing steps in one format .', \n    'para': ['1'], \n    'page': ['1'], \n    'span': [{'id': 'w1'}, {'id': 'w2'}, {'id': 'w3'}, {'id': 'w4'}, {'id': 'w5'}, {'id': 'w6'}, {'id': 'w7'}, {'id': 'w8'}, {'id': 'w9'}, {'id': 'w10'}, {'id': 'w11'}, {'id': 'w12'}, {'id': 'w13'}, {'id': 'w14'}, {'id': 'w15'}, {'id': 'w16'}, {'id': 'w17'}, {'id': 'w18'}, {'id': 'w19'}, {'id': 'w20'}, {'id': 'w21'}, {'id': 'w22'}, {'id': 'w23'}, {'id': 'w24'}, {'id': 'w25'}, {'id': 'w26'}, {'id': 'w27'}, {'id': 'w28'}, {'id': 'w29'}], \n    'terms': [{'id': 't1'}, {'id': 't2'}, {'id': 't3'}, {'id': 't4'}, {'id': 't5'}, {'id': 't6'}, {'id': 't7'}, {'id': 't8'}, {'id': 't9'}, {'id': 't10'}, {'id': 't11'}, {'id': 't12'}, {'id': 't13'}, {'id': 't14'}, {'id': 't15'}, {'id': 't16'}, {'id': 't17'}, {'id': 't18'}, {'id': 't19'}, {'id': 't20'}, {'id': 't21'}, {'id': 't22'}, {'id': 't23'}, {'id': 't24'}, {'id': 't25'}, {'id': 't26'}, {'id': 't27'}, {'id': 't28'}, {'id': 't29'}]}, \n  ...\n\nNote that you get the word ids (the span) as well as the terms ids in the sentence.\n\n\nAdding new annotation layers\n----------------------------\n\nTo add a new annotation layer with elements, start with registering the processor of the new annotations::\n\n  lp = ProcessorElement(name=\"processorname\", model=\"modelname\", version=\"1.0\", timestamp=None, beginTimestamp=None,   endTimestamp=None, hostname=None)\n\n  doc.add_processor_element(\"recommendations\", lp)\n\nThen get the layer and add subelements::\n\n  layer = doc.layer(\"recommendations\")\n\n  data_recommendation = {'id': \"recommendation1\", 'subjectivity': 0.5, 'polarity': 0.25, 'span': ['t37', 't39']}\n\n  element = doc.subelement(element=layer, tag=\"recommendation\", data=data_recommendation)\n\n  doc.add_span_element(element=element, data=data_recommendation)\n\nRetrieve the recommendations with::\n\n  doc.recommendations\n\n\nConvert NAF to the NLP Interchange Format (NIF)\n-----------------------------------------------\n\nThe `NLP Interchange Format (NIF) <https://github.com/NLP2RDF/ontologies>` is an RDF/OWL-based format that aims to achieve interoperability between NLP tools.\n\nHere's an example::\n\n  doc = nafigator.NafDocument().open(\"..//data//example.naf.xml\")\n\n  nif = nafigator.naf2nif(uri=\"https://mangosaurus.eu/rdf-data/nif-data/doc_1\",\n                          collection_uri=\"https://mangosaurus.eu/rdf-data/nif-data/collection\",\n                          doc=doc)\n\nThis results in an object that contains the rdflib Graph and can be serialized with::\n\n  nif.graph.serialize(format=\"turtle\"))\n\nThis results in the graph in turtle format. \n\nThe prefixes and namespaces:\n\n::\n\n  @prefix dcterms: <http://purl.org/dc/terms/> .\n  @prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .\n  @prefix olia: <http://purl.org/olia/olia.owl#> .\n  @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .\n\nThe nif:ContextCollection\n\n::\n\n  <https://mangosaurus.eu/rdf-data/nif-data/collection> a nif:ContextCollection ;\n      nif:hasContext <https://mangosaurus.eu/rdf-data/nif-data/doc_1> ;\n      dcterms:conformsTo <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/2.1> .\n\nThe nif:Context (a document)\n\n::\n\n  <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_265> a nif:Context,\n          nif:String ;\n      nif:beginIndex \"0\"^^xsd:nonNegativeInteger ;\n      nif:endIndex \"265\"^^xsd:nonNegativeInteger ;\n      nif:hasSentences ( \n        <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_165> \n        <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_167_265> \n      ) ;\n      nif:isString \"The Nafigator package allows you to store NLP output from custom made Spacy and stanza  pipelines with (intermediate) results and all processing steps in one format.  Multiwords like in \u201cwe have set that out below\u201d are recognized (depending on your NLP  processor).\"^^xsd:string ;\n      nif:lastSentence <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_167_265> ;\n      nif:firstSentence <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_165> ;\n      nif:referenceContext <https://mangosaurus.eu/rdf-data/nif-data/doc_1> .\n\nThe nif:Sentence\n\n::\n\n  <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_165> a nif:OffsetBasedString,\n          nif:Paragraph,\n          nif:Sentence ;\n    nif:anchorOf \"The Nafigator package allows you to store NLP output from custom made Spacy and stanza pipelines with ( intermediate ) results and all processing steps in one format .\"^^xsd:string ;\n    nif:beginIndex \"0\"^^xsd:nonNegativeInteger ;\n    nif:endIndex \"165\"^^xsd:nonNegativeInteger ;\n    nif:firstWord <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_3> ;\n    nif:hasWords ( \n      <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_0_3> \n      <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_4_13> \n      ...\n      <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_164_165> \n    ) ;\n    nif:lastWord <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_164_165> ;\n    nif:nextSentence <https://mangosaurus.eu/rdf-data/nif-data/doc_1#offset_167_265> ;\n    nif:referenceContext <https://mangosaurus.eu/rdf-data/nif-data/doc_1> .\n\nThe nif:Word\n\n::\n\n  <https://mangosaurus.eu/rdf-data/nif-data/3968fc96-5750-3fdb-be58-46f182762119#offset_0_3> a nif:OffsetBasedString,\n          nif:Word ;\n      nif:anchorOf \"The\"^^xsd:string ;\n      nif:beginIndex \"0\"^^xsd:nonNegativeInteger ;\n      nif:endIndex \"3\"^^xsd:nonNegativeInteger ;\n      nif:lemma \"the\"^^xsd:string ;\n      nif:nextWord <https://mangosaurus.eu/rdf-data/nif-data/3968fc96-5750-3fdb-be58-46f182762119#offset_4_13> ;\n      nif:oliaLink olia:Article,\n          olia:Definite,\n          olia:Determiner ;\n      nif:referenceContext <https://mangosaurus.eu/rdf-data/nif-data/3968fc96-5750-3fdb-be58-46f182762119#offset_0_265> ;\n      nif:sentence <https://mangosaurus.eu/rdf-data/nif-data/3968fc96-5750-3fdb-be58-46f182762119#offset_0_165> .\n\nPart of speech tags and morphological features are here combined: the part-of-speech tag is *olia:Determiner*. The morphological features are *olia:Article* (the pronType:Art in terms of Universal Dependencies) and *olia:Definite* (the Definite:Def in terms of Universal Dependencies).\n\nChanges to NIF\n~~~~~~~~~~~~~~\n\nInstead of the original RDF predicates *nif:word* and *nif:sentence* (used to link words to sentences and vice versa) I used predicates *nif:hasWord* and *nif:hasSentence* which point to a RDF collection (a linked list) of respectively words and sentences. The RDF collection maintains order of the elements and easy traversing. These predicates are not part of the original NIF ontology.\n\n\n=======\nHistory\n=======\n\n0.1.0 (2021-03-13)\n------------------\n\n* First release on PyPI.\n\n0.1.1 to 0.1.41 (2022-3-1)\n--------------------------\n\n* A lot of small changes\n\n0.1.42 (2022-4-6)\n-----------------\n\n* Added first version of termbase processor\n\n0.1.43 (2022-4-29)\n------------------\n\n* Fix for get_context_rows\n\n0.1.45 (2022-4-29)\n------------------\n\n* Added sent ids to doc.sentences and doc.paragraphs\n\n0.1.47 (2022-8-22)\n------------------\n\n* Table extraction improvements \n* Fix to align enumeration of sentences and paragraphs\n\n0.1.48 (2022-8-30)\n------------------\n\n* Added first version of nif conversion\n\n0.1.49 (2022-9-2)\n-----------------\n\n* Improved version of nif conversion\n* Optimized TermbaseProcessor\n\n0.1.50 (2022-9-5)\n-----------------\n\n* Morphological features in nif\n* Bugfix TermbaseProcessor\n* NIF example added to README.rst\n\n0.1.52 (2022-10-19)\n-------------------\n\n* Formats layer now contains a deep copy of pdfminer output in xml\n\n0.1.53 (2022-11-11)\n-------------------\n\n* Added coordinates to formats layer as an option\n* Added highlighter feature for words\n* Separated TableFormatter and Highlighter into 2 different modules\n* Bugfix in formats layer\n\n0.1.54 (2022-11-17)\n-------------------\n\n* Added PyMuPDF to requirements\n\n0.1.55 (2022-11-21)\n-------------------\n\n* Added iribaker and Unidecode to requirements\n\n0.1.57 (2022-11-30)\n-------------------\n\n* Added possibility to use stream instead of opening a file\n* Added naf2nif function to convert naf to rdflid.Graph in NIF format \n* Added parameter \"include pdf xml\" to include the original xml output of pdfminer to the naf document\n\n0.1.58 (2022-12-08)\n-------------------\n* Version bump for new build to check if this solves the installation version of 0.1.57\n\n0.1.59 (2022-12-08)\n-------------------\n* Added PyMuPDF==1.21.0 to requirements\n\n0.1.60 (2022-12-12)\n-------------------\n* Add outline unittests\n* Bugfix Lemma error\n* Part 1 bugfix referencing error\n\n0.1.61 (2022-01-09)\n-------------------\n* Add option for streams input\n* Remove unused imports\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT license",
    "summary": "Python package to convert spaCy and Stanza documents to NLP Annotation Format (NAF)",
    "version": "0.1.64",
    "project_urls": {
        "Homepage": "https://github.com/denederlandschebank/nafigator"
    },
    "split_keywords": [
        "nafigator"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fbae48fdbf56954e80c941250587a03f8590e3ad9bfe554b6f79cc8945638837",
                "md5": "007a5182968b4e797f2b9b0784f76950",
                "sha256": "abf189bc6b8c61913cfe0f483129b74f5464a7546ab822933d7037a753794313"
            },
            "downloads": -1,
            "filename": "nafigator-0.1.64-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "007a5182968b4e797f2b9b0784f76950",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.6",
            "size": 49064,
            "upload_time": "2023-08-31T08:06:20",
            "upload_time_iso_8601": "2023-08-31T08:06:20.243904Z",
            "url": "https://files.pythonhosted.org/packages/fb/ae/48fdbf56954e80c941250587a03f8590e3ad9bfe554b6f79cc8945638837/nafigator-0.1.64-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-31 08:06:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "denederlandschebank",
    "github_project": "nafigator",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "click",
            "specs": [
                [
                    ">=",
                    "7.0"
                ]
            ]
        },
        {
            "name": "pdfminer.six",
            "specs": [
                [
                    "==",
                    "20211012"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    "==",
                    "4.7.1"
                ]
            ]
        },
        {
            "name": "python-docx",
            "specs": [
                [
                    "==",
                    "0.8.11"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "1.3.5"
                ]
            ]
        },
        {
            "name": "stanza",
            "specs": [
                [
                    "==",
                    "1.4.2"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    "==",
                    "3.2.2"
                ]
            ]
        },
        {
            "name": "deepdiff",
            "specs": [
                [
                    "==",
                    "5.7.0"
                ]
            ]
        },
        {
            "name": "camelot-py",
            "specs": [
                [
                    ">=",
                    "0.10.1"
                ]
            ]
        },
        {
            "name": "opencv-python",
            "specs": [
                [
                    ">=",
                    "4.5.5.62"
                ]
            ]
        },
        {
            "name": "pdftopng",
            "specs": [
                [
                    "==",
                    "0.2.3"
                ]
            ]
        },
        {
            "name": "iribaker",
            "specs": [
                [
                    "==",
                    "0.2"
                ]
            ]
        },
        {
            "name": "Unidecode",
            "specs": [
                [
                    "==",
                    "1.3.6"
                ]
            ]
        },
        {
            "name": "PyMuPDF",
            "specs": [
                [
                    "==",
                    "1.21.0"
                ]
            ]
        },
        {
            "name": "pypdf",
            "specs": [
                [
                    "==",
                    "3.2.1"
                ]
            ]
        },
        {
            "name": "parameterized",
            "specs": [
                [
                    "==",
                    "0.8.1"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "nafigator"
}
        
Elapsed time: 0.10808s