pyconll


Namepyconll JSON
Version 3.2.0 PyPI version JSON
download
home_pagehttps://github.com/pyconll/pyconll
SummaryRead and manipulate CoNLL files
upload_time2023-06-21 03:30:35
maintainer
docs_urlNone
authorMatias Grioni
requires_python~=3.8
licenseMIT
keywords nlp conllu conll universal dependencies
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            |Build Status| |Coverage Status| |Documentation Status| |Version|
|gitter|

pyconll
-------

*Easily work with* **CoNLL** *files using the familiar syntax of*
**python**\ *.*

Links
'''''

-  `Homepage <https://pyconll.github.io>`__
-  `Documentation <https://pyconll.readthedocs.io/>`__

Installation
~~~~~~~~~~~~

As with most python packages, simply use ``pip`` to install from PyPi.
::

   pip install pyconll

``pyconll`` is also available as a conda package on the ``pyconll``
channel. Only packages 2.2.0 and newer are available on conda at the
moment.

::

   conda install -c pyconll pyconll

pyconll supports Python 3.8 and greater. In general, pyconll will focus
development efforts on officially supported python versions.

Use
~~~

This tool is intended to be a **minimal**, **low level**, **expressive**
and **pragmatic** library in a widely used programming language. pyconll
creates a thin API on top of raw CoNLL annotations that is simple and
intuitive.

It offers the following features: \* Regular CI testing and validation
against all UD v2.x versions. \* A strong domain model that includes
CoNLL sources, Sentences, Tokens, Trees, etc. \* A typed API for better
development experience and better semantics. \* A focus on usability and
simplicity in design (no dependencies) \* Performance optimizations for
a smooth development workflow no matter the dataset size (performs about
25%-35% faster than other comparable packages)

See the following code example to understand the basics of the API.

.. code:: python

   # This snippet finds sentences where a token marked with part of speech 'AUX' are
   # governed by a NOUN. For example, in French this is a less common construction
   # and we may want to validate these examples because we have previously found some
   # problematic examples of this construction.
   import pyconll

   train = pyconll.load_from_file('./ud/train.conllu')

   review_sentences = []

   # Conll objects are iterable over their sentences, and sentences are iterable
   # over their tokens. Sentences also de/serialize comment information.
   for sentence in train:
      for token in sentence:

      # Tokens have attributes such as upos, head, id, deprel, etc, and sentences
      # can be indexed by a token's id. We must check that the token is not the
      # root token, whose id, '0', cannot be looked up.
      if token.upos == 'AUX' and (token.head != '0' and sentence[token.head].upos == 'NOUN'):
         review_sentences.append(sentence)

   print('Review the following sentences:')
   for sent in review_sentences:
      print(sent.id)

A full definition of the API can be found in the
`documentation <https://pyconll.readthedocs.io/>`__ or use the `quick
start <https://pyconll.readthedocs.io/en/stable/starting.html>`__ guide
for a focused introduction.

Uses and Limitations
~~~~~~~~~~~~~~~~~~~~

This package edits CoNLL-U annotations. This does not include the
annotated text itself. Word forms on Tokens are not editable and
Sentence Tokens cannot be reassigned or reordered. ``pyconll`` focuses
on editing CoNLL-U annotation rather than creating it or changing the
underlying text that is annotated. If there is interest in this
functionality area, please create a GitHub issue for more visibility.

This package also is only validated against the CoNLL-U format. The
CoNLL and CoNLL-X format are not supported, but are very similar. I
originally intended to support these formats as well, but their format
is not as well defined as CoNLL-U so they are not included. Please
create an issue for visibility if this feature interests you.

Lastly, linguistic data can often be very large and this package
attempts to keep that in mind. pyconll provides methods for creating in
memory conll objects along with an iterate only version in case a corpus
is too large to store in memory (the size of the memory structure is
several times larger than the actual corpus file). The iterate only
version can parse upwards of 100,000 words per second on a 16gb ram
machine, so for most datasets to be used on a local dev machine, this
package will perform well. The 2.2.0 release also improves parse time
and memory footprint by about 25%!

Contributing
~~~~~~~~~~~~

Contributions to this project are welcome and encouraged! If you are
unsure how to contribute, here is a
`guide <https://help.github.com/en/articles/creating-a-pull-request-from-a-fork>`__
from Github explaining the basic workflow. After cloning this repo,
please run ``pip install -r requirements.txt`` to properly setup
locally. Some of these tools like yapf, pylint, and mypy do not have to
be run locally, but CI builds will fail without their successful
running. Some other release dependencies like twine and sphinx are also
installed.

For packaging new versions, use setuptools version 24.2.0 or greater for
creating the appropriate packaging that recognizes the
``python_requires`` metadata. Final packaging and release is now done
with Github actions so this is less of a concern.

README and CHANGELOG
^^^^^^^^^^^^^^^^^^^^

When changing either of these files, please change the Markdown version
and run ``make gendocs`` so that the other versions stay in sync.

Release Checklist
^^^^^^^^^^^^^^^^^

Below enumerates the general release process explicitly. This section is
for internal use and most people do not have to worry about this. First
note, that the dev branch is always a direct extension of master with
the latest changes since the last release. That is, it is essentially a
staging release branch.

-  Change the version in ``pyconll/_version.py`` appropriately.
-  Merge dev into master **locally**. Github does not offer a fast
   forward merge and explicitly uses --no-ff. So to keep the linear
   nature of changes, merge locally to fast forward. This is assuming
   that the dev branch looks good on CI tests which do not automatically
   run in this situation.
-  Push the master branch. This should start some CI tests specifically
   for master. After validating these results, create a tag
   corresponding to the next version number and push the tag.
-  Create a new release from this tag from the `Releases
   page <https://github.com/pyconll/pyconll/releases>`__. On creating
   this release, two workflows will start. One releases to pypi, and the
   other releases to conda.
-  Validate these workflows pass, and the package is properly released
   on both platforms.

.. |Build Status| image:: https://github.com/pyconll/pyconll/workflows/CI/badge.svg?branch=master
   :target: https://github.com/pyconll/pyconll
.. |Coverage Status| image:: https://coveralls.io/repos/github/pyconll/pyconll/badge.svg?branch=master
   :target: https://coveralls.io/github/pyconll/pyconll?branch=master
.. |Documentation Status| image:: https://readthedocs.org/projects/pyconll/badge/?version=stable
   :target: https://pyconll.readthedocs.io/en/stable
.. |Version| image:: https://img.shields.io/github/v/release/pyconll/pyconll
   :target: https://github.com/pyconll/pyconll/releases
.. |gitter| image:: https://badges.gitter.im/pyconll/pyconll.svg
   :target: https://gitter.im/pyconll/pyconll?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/pyconll/pyconll",
    "name": "pyconll",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "~=3.8",
    "maintainer_email": "",
    "keywords": "nlp,conllu,conll,universal dependencies",
    "author": "Matias Grioni",
    "author_email": "matgrioni@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/16/67/37e18cb1e47db43ed6ba51734e631d72e3fe59df64aedf435c471bbd27d4/pyconll-3.2.0.tar.gz",
    "platform": null,
    "description": "|Build Status| |Coverage Status| |Documentation Status| |Version|\n|gitter|\n\npyconll\n-------\n\n*Easily work with* **CoNLL** *files using the familiar syntax of*\n**python**\\ *.*\n\nLinks\n'''''\n\n-  `Homepage <https://pyconll.github.io>`__\n-  `Documentation <https://pyconll.readthedocs.io/>`__\n\nInstallation\n~~~~~~~~~~~~\n\nAs with most python packages, simply use ``pip`` to install from PyPi.\n::\n\n   pip install pyconll\n\n``pyconll`` is also available as a conda package on the ``pyconll``\nchannel. Only packages 2.2.0 and newer are available on conda at the\nmoment.\n\n::\n\n   conda install -c pyconll pyconll\n\npyconll supports Python 3.8 and greater. In general, pyconll will focus\ndevelopment efforts on officially supported python versions.\n\nUse\n~~~\n\nThis tool is intended to be a **minimal**, **low level**, **expressive**\nand **pragmatic** library in a widely used programming language. pyconll\ncreates a thin API on top of raw CoNLL annotations that is simple and\nintuitive.\n\nIt offers the following features: \\* Regular CI testing and validation\nagainst all UD v2.x versions. \\* A strong domain model that includes\nCoNLL sources, Sentences, Tokens, Trees, etc. \\* A typed API for better\ndevelopment experience and better semantics. \\* A focus on usability and\nsimplicity in design (no dependencies) \\* Performance optimizations for\na smooth development workflow no matter the dataset size (performs about\n25%-35% faster than other comparable packages)\n\nSee the following code example to understand the basics of the API.\n\n.. code:: python\n\n   # This snippet finds sentences where a token marked with part of speech 'AUX' are\n   # governed by a NOUN. For example, in French this is a less common construction\n   # and we may want to validate these examples because we have previously found some\n   # problematic examples of this construction.\n   import pyconll\n\n   train = pyconll.load_from_file('./ud/train.conllu')\n\n   review_sentences = []\n\n   # Conll objects are iterable over their sentences, and sentences are iterable\n   # over their tokens. Sentences also de/serialize comment information.\n   for sentence in train:\n      for token in sentence:\n\n      # Tokens have attributes such as upos, head, id, deprel, etc, and sentences\n      # can be indexed by a token's id. We must check that the token is not the\n      # root token, whose id, '0', cannot be looked up.\n      if token.upos == 'AUX' and (token.head != '0' and sentence[token.head].upos == 'NOUN'):\n         review_sentences.append(sentence)\n\n   print('Review the following sentences:')\n   for sent in review_sentences:\n      print(sent.id)\n\nA full definition of the API can be found in the\n`documentation <https://pyconll.readthedocs.io/>`__ or use the `quick\nstart <https://pyconll.readthedocs.io/en/stable/starting.html>`__ guide\nfor a focused introduction.\n\nUses and Limitations\n~~~~~~~~~~~~~~~~~~~~\n\nThis package edits CoNLL-U annotations. This does not include the\nannotated text itself. Word forms on Tokens are not editable and\nSentence Tokens cannot be reassigned or reordered. ``pyconll`` focuses\non editing CoNLL-U annotation rather than creating it or changing the\nunderlying text that is annotated. If there is interest in this\nfunctionality area, please create a GitHub issue for more visibility.\n\nThis package also is only validated against the CoNLL-U format. The\nCoNLL and CoNLL-X format are not supported, but are very similar. I\noriginally intended to support these formats as well, but their format\nis not as well defined as CoNLL-U so they are not included. Please\ncreate an issue for visibility if this feature interests you.\n\nLastly, linguistic data can often be very large and this package\nattempts to keep that in mind. pyconll provides methods for creating in\nmemory conll objects along with an iterate only version in case a corpus\nis too large to store in memory (the size of the memory structure is\nseveral times larger than the actual corpus file). The iterate only\nversion can parse upwards of 100,000 words per second on a 16gb ram\nmachine, so for most datasets to be used on a local dev machine, this\npackage will perform well. The 2.2.0 release also improves parse time\nand memory footprint by about 25%!\n\nContributing\n~~~~~~~~~~~~\n\nContributions to this project are welcome and encouraged! If you are\nunsure how to contribute, here is a\n`guide <https://help.github.com/en/articles/creating-a-pull-request-from-a-fork>`__\nfrom Github explaining the basic workflow. After cloning this repo,\nplease run ``pip install -r requirements.txt`` to properly setup\nlocally. Some of these tools like yapf, pylint, and mypy do not have to\nbe run locally, but CI builds will fail without their successful\nrunning. Some other release dependencies like twine and sphinx are also\ninstalled.\n\nFor packaging new versions, use setuptools version 24.2.0 or greater for\ncreating the appropriate packaging that recognizes the\n``python_requires`` metadata. Final packaging and release is now done\nwith Github actions so this is less of a concern.\n\nREADME and CHANGELOG\n^^^^^^^^^^^^^^^^^^^^\n\nWhen changing either of these files, please change the Markdown version\nand run ``make gendocs`` so that the other versions stay in sync.\n\nRelease Checklist\n^^^^^^^^^^^^^^^^^\n\nBelow enumerates the general release process explicitly. This section is\nfor internal use and most people do not have to worry about this. First\nnote, that the dev branch is always a direct extension of master with\nthe latest changes since the last release. That is, it is essentially a\nstaging release branch.\n\n-  Change the version in ``pyconll/_version.py`` appropriately.\n-  Merge dev into master **locally**. Github does not offer a fast\n   forward merge and explicitly uses --no-ff. So to keep the linear\n   nature of changes, merge locally to fast forward. This is assuming\n   that the dev branch looks good on CI tests which do not automatically\n   run in this situation.\n-  Push the master branch. This should start some CI tests specifically\n   for master. After validating these results, create a tag\n   corresponding to the next version number and push the tag.\n-  Create a new release from this tag from the `Releases\n   page <https://github.com/pyconll/pyconll/releases>`__. On creating\n   this release, two workflows will start. One releases to pypi, and the\n   other releases to conda.\n-  Validate these workflows pass, and the package is properly released\n   on both platforms.\n\n.. |Build Status| image:: https://github.com/pyconll/pyconll/workflows/CI/badge.svg?branch=master\n   :target: https://github.com/pyconll/pyconll\n.. |Coverage Status| image:: https://coveralls.io/repos/github/pyconll/pyconll/badge.svg?branch=master\n   :target: https://coveralls.io/github/pyconll/pyconll?branch=master\n.. |Documentation Status| image:: https://readthedocs.org/projects/pyconll/badge/?version=stable\n   :target: https://pyconll.readthedocs.io/en/stable\n.. |Version| image:: https://img.shields.io/github/v/release/pyconll/pyconll\n   :target: https://github.com/pyconll/pyconll/releases\n.. |gitter| image:: https://badges.gitter.im/pyconll/pyconll.svg\n   :target: https://gitter.im/pyconll/pyconll?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Read and manipulate CoNLL files",
    "version": "3.2.0",
    "project_urls": {
        "Homepage": "https://github.com/pyconll/pyconll"
    },
    "split_keywords": [
        "nlp",
        "conllu",
        "conll",
        "universal dependencies"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "70ca722852b75919610fd01c293ffcbcf72eb58d0f4fec014fd67ec13a6b9dad",
                "md5": "bf42e3d72c56872b3d0072c588dd8149",
                "sha256": "6106a1136dfe6a524e41228b35b4f4aaae17bebe62d4aafc480feeb2b7ca769d"
            },
            "downloads": -1,
            "filename": "pyconll-3.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bf42e3d72c56872b3d0072c588dd8149",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "~=3.8",
            "size": 27187,
            "upload_time": "2023-06-21T03:30:33",
            "upload_time_iso_8601": "2023-06-21T03:30:33.861420Z",
            "url": "https://files.pythonhosted.org/packages/70/ca/722852b75919610fd01c293ffcbcf72eb58d0f4fec014fd67ec13a6b9dad/pyconll-3.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "166737e18cb1e47db43ed6ba51734e631d72e3fe59df64aedf435c471bbd27d4",
                "md5": "a04c8d1820160e2347f13d965a1d71f9",
                "sha256": "402aca6b8e769caea8440aa33a04adf5e0fe060a0fa128178a91f5b986107b19"
            },
            "downloads": -1,
            "filename": "pyconll-3.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a04c8d1820160e2347f13d965a1d71f9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "~=3.8",
            "size": 27721,
            "upload_time": "2023-06-21T03:30:35",
            "upload_time_iso_8601": "2023-06-21T03:30:35.283840Z",
            "url": "https://files.pythonhosted.org/packages/16/67/37e18cb1e47db43ed6ba51734e631d72e3fe59df64aedf435c471bbd27d4/pyconll-3.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-21 03:30:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pyconll",
    "github_project": "pyconll",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "lcname": "pyconll"
}
        
Elapsed time: 0.68808s