Name | corpy JSON |
Version |
0.6.1
JSON |
| download |
home_page | |
Summary | Tools for processing language data. |
upload_time | 2023-04-05 13:44:59 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.10 |
license | GPL-3.0-or-later |
keywords |
corpus
linguistics
nlp
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
=====
CorPy
=====
.. image:: https://readthedocs.org/projects/corpy/badge/?version=stable
:target: https://corpy.readthedocs.io/en/stable/?badge=stable
:alt: Documentation status
.. image:: https://badge.fury.io/py/corpy.svg
:target: https://badge.fury.io/py/corpy
:alt: PyPI package
.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
:target: https://github.com/python/black
:alt: Code style
Installation
============
.. code:: bash
$ python3 -m pip install corpy
Only recent versions of Python 3 (3.10+) are supported by design.
Help and feedback
=================
If you get stuck, it's always a good idea to start by searching the
documentation, the short URL to which is https://corpy.rtfd.io/.
The project is developed on GitHub_. You can ask for help via `GitHub
discussions`_ and report bugs and give other kinds of feedback via `GitHub
issues`_. Support is provided gladly, time and other engagements permitting, but
cannot be guaranteed.
.. _GitHub: https://github.com/dlukes/corpy
.. _GitHub discussions: https://github.com/dlukes/corpy/discussions
.. _GitHub issues: https://github.com/dlukes/corpy/issues
What is CorPy?
==============
A fancy plural for *corpus* ;) Also, a collection of handy but not especially
mutually integrated tools for dealing with linguistic data. It abstracts away
functionality which is often needed in practice for teaching and/or day to day
work at the `Czech National Corpus <https://korpus.cz>`__, without aspiring to
be a fully featured or consistent NLP framework.
Here's an idea of what you can do with CorPy:
- add linguistic annotation to raw textual data using either `UDPipe
<https://corpy.rtfd.io/en/stable/guides/udpipe.html>`__ or `MorphoDiTa
<https://corpy.rtfd.io/en/stable/guides/morphodita.html>`__
- `easily generate word clouds
<https://corpy.rtfd.io/en/stable/guides/vis.html>`__
- run code in `a sanitized global environment
<https://corpy.rtfd.io/en/stable/guides/no_globals.html>`__ (useful for
debugging in interactive sessions, e.g. with Jupyter notebooks in `JupyterLab
<https://jupyterlab.rtfd.io>`__)
- `generate phonetic transcripts of Czech texts
<https://corpy.rtfd.io/en/stable/guides/phonetics_cs.html>`__
- `wrangle corpora in the vertical format
<https://corpy.rtfd.io/en/stable/guides/vertical.html>`__ devised originally
for `CWB <http://cwb.sourceforge.net/>`__, used also by `(No)SketchEngine
<https://nlp.fi.muni.cz/trac/noske/>`__
- plus some `command line utilities
<https://corpy.rtfd.io/en/stable/guides/cli.html>`__
.. note::
**Should I pick UDPipe or MorphoDiTa?**
Both are developed at `ÚFAL MFF UK`_. UDPipe_ has more features at the cost
of being somewhat more complex: it does both `morphological tagging
(including lemmatization) and syntactic parsing
<https://corpy.rtfd.io/en/stable/guides/udpipe.html>`__, and it handles a
number of different input and output formats. You can also download
`pre-trained models <http://ufal.mff.cuni.cz/udpipe/models>`__ for many
different languages.
By contrast, MorphoDiTa_ only has `pre-trained models for Czech and English
<http://ufal.mff.cuni.cz/morphodita/users-manual>`__, and only performs
`morphological tagging (including lemmatization)
<https://corpy.rtfd.io/en/stable/guides/morphodita.html>`__. However, its
output is more straightforward -- it just splits your text into tokens and
annotates them, whereas UDPipe can (depending on the model) introduce
additional tokens necessary for a more explicit analysis, add multi-word
tokens etc. This is because UDPipe is tailored to the type of linguistic
analysis conducted within the UniversalDependencies_ project, using the
CoNLL-U_ data format.
MorphoDiTa can also help you if you just want to tokenize text and don't have
a language model available.
.. _`ÚFAL MFF UK`: https://ufal.mff.cuni.cz/
.. _UDPipe: https://ufal.mff.cuni.cz/udpipe
.. _MorphoDiTa: https://ufal.mff.cuni.cz/morphodita
.. _UniversalDependencies: https://universaldependencies.org
.. _CoNLL-U: https://universaldependencies.org/format.html
.. development-marker
Development
===========
Dependencies and building the docs
----------------------------------
``corpy`` needs to be installed in the ReadTheDocs virtualenv for ``autodoc`` to
work. The optional dependencies in the ``doc`` group are also needed. This is
all configured in ``.readthedocs.yml``.
.. license-marker
License
=======
Copyright © 2016--present `ÚČNK <http://korpus.cz>`__/David Lukeš
Distributed under the `GNU General Public License v3
<http://www.gnu.org/licenses/gpl-3.0.en.html>`__.
Raw data
{
"_id": null,
"home_page": "",
"name": "corpy",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "",
"keywords": "corpus,linguistics,NLP",
"author": "",
"author_email": "David Lukes <dafydd.lukes@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/ce/e7/83ab9a9cafce95fae887a0ec349ac92f14d99a36a35af032d49d9cfb82df/corpy-0.6.1.tar.gz",
"platform": null,
"description": "=====\nCorPy\n=====\n\n.. image:: https://readthedocs.org/projects/corpy/badge/?version=stable\n :target: https://corpy.readthedocs.io/en/stable/?badge=stable\n :alt: Documentation status\n\n.. image:: https://badge.fury.io/py/corpy.svg\n :target: https://badge.fury.io/py/corpy\n :alt: PyPI package\n\n.. image:: https://img.shields.io/badge/code%20style-black-000000.svg\n :target: https://github.com/python/black\n :alt: Code style\n\nInstallation\n============\n\n.. code:: bash\n\n $ python3 -m pip install corpy\n\nOnly recent versions of Python 3 (3.10+) are supported by design.\n\nHelp and feedback\n=================\n\nIf you get stuck, it's always a good idea to start by searching the\ndocumentation, the short URL to which is https://corpy.rtfd.io/.\n\nThe project is developed on GitHub_. You can ask for help via `GitHub\ndiscussions`_ and report bugs and give other kinds of feedback via `GitHub\nissues`_. Support is provided gladly, time and other engagements permitting, but\ncannot be guaranteed.\n\n.. _GitHub: https://github.com/dlukes/corpy\n.. _GitHub discussions: https://github.com/dlukes/corpy/discussions\n.. _GitHub issues: https://github.com/dlukes/corpy/issues\n\nWhat is CorPy?\n==============\n\nA fancy plural for *corpus* ;) Also, a collection of handy but not especially\nmutually integrated tools for dealing with linguistic data. It abstracts away\nfunctionality which is often needed in practice for teaching and/or day to day\nwork at the `Czech National Corpus <https://korpus.cz>`__, without aspiring to\nbe a fully featured or consistent NLP framework.\n\nHere's an idea of what you can do with CorPy:\n\n- add linguistic annotation to raw textual data using either `UDPipe\n <https://corpy.rtfd.io/en/stable/guides/udpipe.html>`__ or `MorphoDiTa\n <https://corpy.rtfd.io/en/stable/guides/morphodita.html>`__\n- `easily generate word clouds\n <https://corpy.rtfd.io/en/stable/guides/vis.html>`__\n- run code in `a sanitized global environment\n <https://corpy.rtfd.io/en/stable/guides/no_globals.html>`__ (useful for\n debugging in interactive sessions, e.g. with Jupyter notebooks in `JupyterLab\n <https://jupyterlab.rtfd.io>`__)\n- `generate phonetic transcripts of Czech texts\n <https://corpy.rtfd.io/en/stable/guides/phonetics_cs.html>`__\n- `wrangle corpora in the vertical format\n <https://corpy.rtfd.io/en/stable/guides/vertical.html>`__ devised originally\n for `CWB <http://cwb.sourceforge.net/>`__, used also by `(No)SketchEngine\n <https://nlp.fi.muni.cz/trac/noske/>`__\n- plus some `command line utilities\n <https://corpy.rtfd.io/en/stable/guides/cli.html>`__\n\n.. note::\n\n **Should I pick UDPipe or MorphoDiTa?**\n\n Both are developed at `\u00daFAL MFF UK`_. UDPipe_ has more features at the cost\n of being somewhat more complex: it does both `morphological tagging\n (including lemmatization) and syntactic parsing\n <https://corpy.rtfd.io/en/stable/guides/udpipe.html>`__, and it handles a\n number of different input and output formats. You can also download\n `pre-trained models <http://ufal.mff.cuni.cz/udpipe/models>`__ for many\n different languages.\n\n By contrast, MorphoDiTa_ only has `pre-trained models for Czech and English\n <http://ufal.mff.cuni.cz/morphodita/users-manual>`__, and only performs\n `morphological tagging (including lemmatization)\n <https://corpy.rtfd.io/en/stable/guides/morphodita.html>`__. However, its\n output is more straightforward -- it just splits your text into tokens and\n annotates them, whereas UDPipe can (depending on the model) introduce\n additional tokens necessary for a more explicit analysis, add multi-word\n tokens etc. This is because UDPipe is tailored to the type of linguistic\n analysis conducted within the UniversalDependencies_ project, using the\n CoNLL-U_ data format.\n\n MorphoDiTa can also help you if you just want to tokenize text and don't have\n a language model available.\n\n.. _`\u00daFAL MFF UK`: https://ufal.mff.cuni.cz/\n.. _UDPipe: https://ufal.mff.cuni.cz/udpipe\n.. _MorphoDiTa: https://ufal.mff.cuni.cz/morphodita\n.. _UniversalDependencies: https://universaldependencies.org\n.. _CoNLL-U: https://universaldependencies.org/format.html\n\n.. development-marker\n\nDevelopment\n===========\n\nDependencies and building the docs\n----------------------------------\n\n``corpy`` needs to be installed in the ReadTheDocs virtualenv for ``autodoc`` to\nwork. The optional dependencies in the ``doc`` group are also needed. This is\nall configured in ``.readthedocs.yml``.\n\n.. license-marker\n\nLicense\n=======\n\nCopyright \u00a9 2016--present `\u00da\u010cNK <http://korpus.cz>`__/David Luke\u0161\n\nDistributed under the `GNU General Public License v3\n<http://www.gnu.org/licenses/gpl-3.0.en.html>`__.\n",
"bugtrack_url": null,
"license": "GPL-3.0-or-later",
"summary": "Tools for processing language data.",
"version": "0.6.1",
"split_keywords": [
"corpus",
"linguistics",
"nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "94601c0fd32200d014484b98463de3e6b7f87d65fae257cde935022508a4eb64",
"md5": "d50de2d6f2dfcb971ef25651f85f6558",
"sha256": "3b7ee9366ac0920664b18a43704354af12281fb8f45de2263027b8c9535b041f"
},
"downloads": -1,
"filename": "corpy-0.6.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d50de2d6f2dfcb971ef25651f85f6558",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 38727,
"upload_time": "2023-04-05T13:44:56",
"upload_time_iso_8601": "2023-04-05T13:44:56.440239Z",
"url": "https://files.pythonhosted.org/packages/94/60/1c0fd32200d014484b98463de3e6b7f87d65fae257cde935022508a4eb64/corpy-0.6.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cee783ab9a9cafce95fae887a0ec349ac92f14d99a36a35af032d49d9cfb82df",
"md5": "8d0f8c99a0abdcada5dd58409594ebe4",
"sha256": "a90efa7e85eb43ac947dc84c98b12832905233ec978dee9e79efa07c46ab4f6f"
},
"downloads": -1,
"filename": "corpy-0.6.1.tar.gz",
"has_sig": false,
"md5_digest": "8d0f8c99a0abdcada5dd58409594ebe4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 383675,
"upload_time": "2023-04-05T13:44:59",
"upload_time_iso_8601": "2023-04-05T13:44:59.426201Z",
"url": "https://files.pythonhosted.org/packages/ce/e7/83ab9a9cafce95fae887a0ec349ac92f14d99a36a35af032d49d9cfb82df/corpy-0.6.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-05 13:44:59",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "corpy"
}