Name | spacy-conll JSON |
Version |
4.0.1
JSON |
| download |
home_page | None |
Summary | A custom pipeline component for spaCy that can convert any parsed Doc and its sentences into CoNLL-U format. Also provides a command line entry point. |
upload_time | 2024-07-02 08:51:06 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | BSD 2-Clause License Copyright (c) 2018-2021, Bram Vanroy, Raquel G. Alhama All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
keywords |
conll
conllu
nlp
parsing
spacy
spacy-extension
spacy_stanza
spacy_udpipe
stanza
tagging
udpipe
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe
This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your
own scripts by adding it as a custom pipeline component to a spaCy, `spacy-stanza`, or `spacy-udpipe` pipeline. It
also provides an easy-to-use function to quickly initialize a parser as well as a ConllParser class with built-in
functionality to parse files or text.
Note that the module simply takes a parser's output and puts it in a formatted string adhering to the linked ConLL-U
format. The output tags depend on the spaCy model used. If you want Universal Depencies tags as output, I advise you
to use this library in combination with [spacy-stanza](https://github.com/explosion/spacy-stanza), which is a spaCy
interface using `stanza` and its models behind the scenes. Those models use the Universal Dependencies formalism and
yield state-of-the-art performance. `stanza` is a new and improved version of `stanfordnlp`. As an alternative to the
Stanford models, you can use the spaCy wrapper for `UDPipe`, [spacy-udpipe](https://github.com/TakeLab/spacy-udpipe),
which is slightly less accurate than `stanza` but much faster.
## Installation
By default, this package automatically installs only [spaCy](https://spacy.io/usage/models#section-quickstart) as
dependency. Because [spaCy's models](https://spacy.io/usage/models) are not necessarily trained on Universal
Dependencies conventions, their output labels are not UD either. By using `spacy-stanza` or `spacy-udpipe`, we get
the easy-to-use interface of spaCy as a wrapper around `stanza` and `UDPipe` respectively, including their models that
*are* trained on UD data.
**NOTE**: `spacy-stanza` and `spacy-udpipe` are not installed automatically as a dependency for this library, because
it might be too much overhead for those who don't need UD. If you wish to use their functionality, you have to install
them manually or use one of the available options as described below.
If you want to retrieve CoNLL info as a `pandas` DataFrame, this library will automatically export it if it detects
that `pandas` is installed. See the Usage section for more.
To install the library, simply use pip.
```shell
# only includes spacy by default
pip install spacy_conll
```
A number of options are available to make installation of additional dependencies easier:
```shell
# include spacy-stanza and spacy-udpipe
pip install spacy_conll[parsers]
# include pandas
pip install spacy_conll[pd]
# include pandas, spacy-stanza and spacy-udpipe
pip install spacy_conll[all]
# include pandas, spacy-stanza and spacy-udpipe and additional libaries for testing and formatting
pip install spacy_conll[dev]
```
## Usage
When the ConllFormatter is added to a spaCy pipeline, it adds CoNLL properties for `Token`, sentence `Span` and `Doc`.
Note that arbitrary Span's are not included and do not receive these properties.
On all three of these levels, two custom properties are exposed by default, `._.conll` and its string
representation `._.conll_str`. However, if you have `pandas` installed, then `._.conll_pd` will
be added automatically, too!
- `._.conll`: raw CoNLL format
- in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as values.
- in sentence Span: a list of its tokens' `._.conll` dictionaries (list of dictionaries).
- in a Doc: a list of its sentences' `._.conll` lists (list of list of dictionaries).
- `._.conll_str`: string representation of the CoNLL format
- in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline.
- in sentence Span: the expected CoNLL format where each row represents a token. When
`ConllFormatter(include_headers=True)` is used, two header lines are included as well, as per the
[CoNLL format](https://universaldependencies.org/format.html#sentence-boundaries-and-comments).
- in Doc: all its sentences' `._.conll_str` combined and separated by new lines.
- `._.conll_pd`: `pandas` representation of the CoNLL format
- in Token: a Series representation of this token's CoNLL properties.
- in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column headers.
- in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose index is reset.
You can use `spacy_conll` in your own Python code as a custom pipeline component, or you can use the built-in
command-line script which offers typically needed functionality. See the following section for more.
### In Python
This library offers the ConllFormatter class which serves as a custom spaCy pipeline component. It can be instantiated
as follows. It is important that you import `spacy_conll` before adding the pipe!
```python
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("conll_formatter", last=True)
```
Because this library supports different spaCy wrappers (`spacy`, `stanza`, and `udpipe`), a convenience function is
available as well. With `utils.init_parser` you can easily instantiate a parser with a single line. You can
find the function's signature below. Have a look at the [source code](spacy_conll/utils.py) to read more about all the
possible arguments or try out the [examples](examples/).
**NOTE**: `is_tokenized` does not work for `spacy-udpipe`. Using `is_tokenized` for `spacy-stanza` also affects sentence
segmentation, effectively *only* splitting on new lines. With `spacy`, `is_tokenized` disables sentence splitting completely.
```python
def init_parser(
model_or_lang: str,
parser: str,
*,
is_tokenized: bool = False,
disable_sbd: bool = False,
exclude_spacy_components: Optional[List[str]] = None,
parser_opts: Optional[Dict] = None,
**kwargs,
)
```
For instance, if you want to load a Dutch `stanza` model in silent mode with the CoNLL formatter already attached, you
can simply use the following snippet. `parser_opts` is passed to the `stanza` pipeline initialisation automatically.
Any other keyword arguments (`kwargs`), on the other hand, are passed to the `ConllFormatter` initialisation.
```python
from spacy_conll import init_parser
nlp = init_parser("nl", "stanza", parser_opts={"verbose": False})
```
The `ConllFormatter` allows you to customize the extension names, and you can also specify conversion maps for the
output properties.
To illustrate, here is an advanced example, showing the more complex options:
- `ext_names`: changes the attribute names to a custom key by using a dictionary.
- `conversion_maps`: a two-level dictionary that looks like `{field_name: {tag_name: replacement}}`. In
other words, you can specify in which field a certain value should be replaced by another. This is especially useful
when you are not satisfied with the tagset of a model and wish to change some tags to an alternative0.
- `field_names`: allows you to change the default CoNLL-U field names to your own custom names. Similar to the
conversion map above, you should use any of the default field names as keys and add your own key as value.
Possible keys are : "ID", "FORM", "LEMMA", "UPOS", "XPOS", "FEATS", "HEAD", "DEPREL", "DEPS", "MISC".
The example below
- shows how to manually add the component;
- changes the custom attribute `conll_pd` to pandas (`conll_pd` only availabe if `pandas` is installed);
- converts any `nsubj` deprel tag to `subj`.
```python
import spacy
nlp = spacy.load("en_core_web_sm")
config = {"ext_names": {"conll_pd": "pandas"},
"conversion_maps": {"deprel": {"nsubj": "subj"}}}
nlp.add_pipe("conll_formatter", config=config, last=True)
doc = nlp("I like cookies.")
print(doc._.pandas)
```
This is the same as:
```python
from spacy_conll import init_parser
nlp = init_parser("en_core_web_sm",
"spacy",
ext_names={"conll_pd": "pandas"},
conversion_maps={"deprel": {"nsubj": "subj"}})
doc = nlp("I like cookies.")
print(doc._.pandas)
```
The snippets above will output a pandas DataFrame by using `._.pandas` rather than the standard
`._.conll_pd`, and all occurrences of `nsubj` in the deprel field are replaced by `subj`.
```
ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
0 1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 subj _ _
1 2 like like VERB VBP Tense=Pres|VerbForm=Fin 0 ROOT _ _
2 3 cookies cookie NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No
3 4 . . PUNCT . PunctType=Peri 2 punct _ SpaceAfter=No
```
Another initialization example that would replace the column names "UPOS" with "upostag" amd "XPOS" with "xpostag":
```python
import spacy
nlp = spacy.load("en_core_web_sm")
config = {"field_names": {"UPOS": "upostag", "XPOS": "xpostag"}}
nlp.add_pipe("conll_formatter", config=config, last=True)
```
#### Reading CoNLL into a spaCy object
It is possible to read a CoNLL string or text file and parse it as a spaCy object. This can be useful if you have raw
CoNLL data that you wish to process in different ways. The process is straightforward.
```python
from spacy_conll import init_parser
from spacy_conll.parser import ConllParser
nlp = ConllParser(init_parser("en_core_web_sm", "spacy"))
doc = nlp.parse_conll_file_as_spacy("path/to/your/conll-sample.txt")
'''
or straight from raw text:
conllstr = """
# text = From the AP comes this story :
1 From from ADP IN _ 3 case 3:case _
2 the the DET DT Definite=Def|PronType=Art 3 det 3:det _
3 AP AP PROPN NNP Number=Sing 4 obl 4:obl:from _
4 comes come VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _
5 this this DET DT Number=Sing|PronType=Dem 6 det 6:det _
6 story story NOUN NN Number=Sing 4 nsubj 4:nsubj _
"""
doc = nlp.parse_conll_text_as_spacy(conllstr)
'''
# Multiple CoNLL entries (separated by two newlines) will be included as different sentences in the resulting Doc
for sent in doc.sents:
for token in sent:
print(token.text, token.dep_, token.pos_)
```
### Command line
Upon installation, a command-line script is added under tha alias `parse-as-conll`. You can use it to parse a
string or file into CoNLL format given a number of options.
```shell
parse-as-conll -h
usage: parse-as-conll [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR] [-o OUTPUT_FILE]
[-c OUTPUT_ENCODING] [-s] [-t] [-d] [-e] [-j N_PROCESS] [-v]
[--ignore_pipe_errors] [--no_split_on_newline]
model_or_lang {spacy,stanza,udpipe}
Parse an input string or input file to CoNLL-U format using a spaCy-wrapped parser. The output
can be written to stdout or a file, or both.
positional arguments:
model_or_lang Model or language to use. SpaCy models must be pre-installed, stanza
and udpipe models will be downloaded automatically
{spacy,stanza,udpipe}
Which parser to use. Parsers other than 'spacy' need to be installed
separately. For 'stanza' you need 'spacy-stanza', and for 'udpipe' the
'spacy-udpipe' library is required.
optional arguments:
-h, --help show this help message and exit
-f INPUT_FILE, --input_file INPUT_FILE
Path to file with sentences to parse. Has precedence over 'input_str'.
(default: None)
-a INPUT_ENCODING, --input_encoding INPUT_ENCODING
Encoding of the input file. Default value is system default. (default:
cp1252)
-b INPUT_STR, --input_str INPUT_STR
Input string to parse. (default: None)
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Path to output file. If not specified, the output will be printed on
standard output. (default: None)
-c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING
Encoding of the output file. Default value is system default. (default:
cp1252)
-s, --disable_sbd Whether to disable spaCy automatic sentence boundary detection. In
practice, disabling means that every line will be parsed as one
sentence, regardless of its actual content. When 'is_tokenized' is
enabled, 'disable_sbd' is enabled automatically (see 'is_tokenized').
Only works when using 'spacy' as 'parser'. (default: False)
-t, --is_tokenized Whether your text has already been tokenized (space-seperated). Setting
this option has as an important consequence that no sentence splitting
at all will be done except splitting on new lines. So if your input is
a file, and you want to use pretokenised text, make sure that each line
contains exactly one sentence. (default: False)
-d, --include_headers
Whether to include headers before the output of every sentence. These
headers include the sentence text and the sentence ID as per the CoNLL
format. (default: False)
-e, --no_force_counting
Whether to disable force counting the 'sent_id', starting from 1 and
increasing for each sentence. Instead, 'sent_id' will depend on how
spaCy returns the sentences. Must have 'include_headers' enabled.
(default: False)
-j N_PROCESS, --n_process N_PROCESS
Number of processes to use in nlp.pipe(). -1 will use as many cores as
available. Might not work for a 'parser' other than 'spacy' depending
on your environment. (default: 1)
-v, --verbose Whether to always print the output to stdout, regardless of
'output_file'. (default: False)
--ignore_pipe_errors Whether to ignore a priori errors concerning 'n_process' By default we
try to determine whether processing works on your system and stop
execution if we think it doesn't. If you know what you are doing, you
can ignore such pre-emptive errors, though, and run the code as-is,
which will then throw the default Python errors when applicable.
(default: False)
--no_split_on_newline
By default, the input file or string is split on newlines for faster
processing of the split up parts. If you want to disable that behavior,
you can use this flag. (default: False)
```
For example, parsing a single line, multi-sentence string:
```shell
parse-as-conll en_core_web_sm spacy --input_str "I like cookies. What about you?" --include_headers
# sent_id = 1
# text = I like cookies.
1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj _ _
2 like like VERB VBP Tense=Pres|VerbForm=Fin 0 ROOT _ _
3 cookies cookie NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No
4 . . PUNCT . PunctType=Peri 2 punct _ _
# sent_id = 2
# text = What about you?
1 What what PRON WP _ 2 dep _ _
2 about about ADP IN _ 0 ROOT _ _
3 you you PRON PRP Case=Acc|Person=2|PronType=Prs 2 pobj _ SpaceAfter=No
4 ? ? PUNCT . PunctType=Peri 2 punct _ SpaceAfter=No
```
For example, parsing a large input file and writing output to a given output file, using four processes:
```shell
parse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4
```
## Credits
The first version of this library was inspired by initial work by [rgalhama](https://github.com/rgalhama/spaCy2CoNLLU)
and has evolved a lot since then.
Raw data
{
"_id": null,
"home_page": null,
"name": "spacy-conll",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Bram Vanroy <bramvanroy@hotmail.com>",
"keywords": "conll, conllu, nlp, parsing, spacy, spacy-extension, spacy_stanza, spacy_udpipe, stanza, tagging, udpipe",
"author": null,
"author_email": "Bram Vanroy <bramvanroy@hotmail.com>",
"download_url": "https://files.pythonhosted.org/packages/32/3f/a9c6ddaca411719bc3067c8ec1783ea48ab303542c2359c371c9af3c49ef/spacy_conll-4.0.1.tar.gz",
"platform": null,
"description": "# Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe\r\n\r\nThis module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your\r\n own scripts by adding it as a custom pipeline component to a spaCy, `spacy-stanza`, or `spacy-udpipe` pipeline. It \r\n also provides an easy-to-use function to quickly initialize a parser as well as a ConllParser class with built-in \r\n functionality to parse files or text.\r\n\r\nNote that the module simply takes a parser's output and puts it in a formatted string adhering to the linked ConLL-U \r\n format. The output tags depend on the spaCy model used. If you want Universal Depencies tags as output, I advise you \r\n to use this library in combination with [spacy-stanza](https://github.com/explosion/spacy-stanza), which is a spaCy \r\n interface using `stanza` and its models behind the scenes. Those models use the Universal Dependencies formalism and \r\n yield state-of-the-art performance. `stanza` is a new and improved version of `stanfordnlp`. As an alternative to the \r\n Stanford models, you can use the spaCy wrapper for `UDPipe`, [spacy-udpipe](https://github.com/TakeLab/spacy-udpipe), \r\n which is slightly less accurate than `stanza` but much faster.\r\n\r\n\r\n## Installation\r\n\r\nBy default, this package automatically installs only [spaCy](https://spacy.io/usage/models#section-quickstart) as \r\n dependency. Because [spaCy's models](https://spacy.io/usage/models) are not necessarily trained on Universal \r\n Dependencies conventions, their output labels are not UD either. By using `spacy-stanza` or `spacy-udpipe`, we get \r\n the easy-to-use interface of spaCy as a wrapper around `stanza` and `UDPipe` respectively, including their models that\r\n *are* trained on UD data.\r\n\r\n**NOTE**: `spacy-stanza` and `spacy-udpipe` are not installed automatically as a dependency for this library, because \r\n it might be too much overhead for those who don't need UD. If you wish to use their functionality, you have to install\r\nthem manually or use one of the available options as described below.\r\n\r\nIf you want to retrieve CoNLL info as a `pandas` DataFrame, this library will automatically export it if it detects \r\n that `pandas` is installed. See the Usage section for more.\r\n\r\nTo install the library, simply use pip.\r\n\r\n```shell\r\n# only includes spacy by default\r\npip install spacy_conll\r\n```\r\n\r\nA number of options are available to make installation of additional dependencies easier:\r\n\r\n```shell\r\n# include spacy-stanza and spacy-udpipe\r\npip install spacy_conll[parsers]\r\n# include pandas\r\npip install spacy_conll[pd]\r\n# include pandas, spacy-stanza and spacy-udpipe\r\npip install spacy_conll[all]\r\n# include pandas, spacy-stanza and spacy-udpipe and additional libaries for testing and formatting\r\npip install spacy_conll[dev]\r\n```\r\n\r\n\r\n## Usage\r\n\r\nWhen the ConllFormatter is added to a spaCy pipeline, it adds CoNLL properties for `Token`, sentence `Span` and `Doc`.\r\n Note that arbitrary Span's are not included and do not receive these properties.\r\n\r\nOn all three of these levels, two custom properties are exposed by default, `._.conll` and its string \r\n representation `._.conll_str`. However, if you have `pandas` installed, then `._.conll_pd` will\r\n be added automatically, too!\r\n\r\n- `._.conll`: raw CoNLL format \r\n - in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as values.\r\n - in sentence Span: a list of its tokens' `._.conll` dictionaries (list of dictionaries).\r\n - in a Doc: a list of its sentences' `._.conll` lists (list of list of dictionaries).\r\n\r\n- `._.conll_str`: string representation of the CoNLL format \r\n - in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline.\r\n - in sentence Span: the expected CoNLL format where each row represents a token. When \r\n `ConllFormatter(include_headers=True)` is used, two header lines are included as well, as per the\r\n [CoNLL format](https://universaldependencies.org/format.html#sentence-boundaries-and-comments).\r\n - in Doc: all its sentences' `._.conll_str` combined and separated by new lines.\r\n\r\n- `._.conll_pd`: `pandas` representation of the CoNLL format \r\n - in Token: a Series representation of this token's CoNLL properties.\r\n - in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column headers.\r\n - in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose index is reset.\r\n\r\nYou can use `spacy_conll` in your own Python code as a custom pipeline component, or you can use the built-in\r\n command-line script which offers typically needed functionality. See the following section for more.\r\n\r\n\r\n### In Python\r\n\r\nThis library offers the ConllFormatter class which serves as a custom spaCy pipeline component. It can be instantiated\r\n as follows. It is important that you import `spacy_conll` before adding the pipe!\r\n\r\n```python\r\nimport spacy\r\nnlp = spacy.load(\"en_core_web_sm\")\r\nnlp.add_pipe(\"conll_formatter\", last=True)\r\n```\r\n\r\nBecause this library supports different spaCy wrappers (`spacy`, `stanza`, and `udpipe`), a convenience function is\r\n available as well. With `utils.init_parser` you can easily instantiate a parser with a single line. You can\r\n find the function's signature below. Have a look at the [source code](spacy_conll/utils.py) to read more about all the\r\n possible arguments or try out the [examples](examples/).\r\n\r\n**NOTE**: `is_tokenized` does not work for `spacy-udpipe`. Using `is_tokenized` for `spacy-stanza` also affects sentence\r\n segmentation, effectively *only* splitting on new lines. With `spacy`, `is_tokenized` disables sentence splitting completely.\r\n\r\n```python\r\ndef init_parser(\r\n model_or_lang: str,\r\n parser: str,\r\n *,\r\n is_tokenized: bool = False,\r\n disable_sbd: bool = False,\r\n exclude_spacy_components: Optional[List[str]] = None,\r\n parser_opts: Optional[Dict] = None,\r\n **kwargs,\r\n)\r\n```\r\n\r\nFor instance, if you want to load a Dutch `stanza` model in silent mode with the CoNLL formatter already attached, you\r\n can simply use the following snippet. `parser_opts` is passed to the `stanza` pipeline initialisation automatically. \r\n Any other keyword arguments (`kwargs`), on the other hand, are passed to the `ConllFormatter` initialisation.\r\n\r\n```python\r\nfrom spacy_conll import init_parser\r\n\r\nnlp = init_parser(\"nl\", \"stanza\", parser_opts={\"verbose\": False})\r\n```\r\n\r\nThe `ConllFormatter` allows you to customize the extension names, and you can also specify conversion maps for the\r\noutput properties.\r\n\r\nTo illustrate, here is an advanced example, showing the more complex options:\r\n\r\n- `ext_names`: changes the attribute names to a custom key by using a dictionary.\r\n- `conversion_maps`: a two-level dictionary that looks like `{field_name: {tag_name: replacement}}`. In \r\n other words, you can specify in which field a certain value should be replaced by another. This is especially useful\r\n when you are not satisfied with the tagset of a model and wish to change some tags to an alternative0. \r\n- `field_names`: allows you to change the default CoNLL-U field names to your own custom names. Similar to the \r\n conversion map above, you should use any of the default field names as keys and add your own key as value. \r\n Possible keys are : \"ID\", \"FORM\", \"LEMMA\", \"UPOS\", \"XPOS\", \"FEATS\", \"HEAD\", \"DEPREL\", \"DEPS\", \"MISC\".\r\n\r\nThe example below\r\n\r\n- shows how to manually add the component;\r\n- changes the custom attribute `conll_pd` to pandas (`conll_pd` only availabe if `pandas` is installed);\r\n- converts any `nsubj` deprel tag to `subj`.\r\n\r\n```python\r\nimport spacy\r\n\r\n\r\nnlp = spacy.load(\"en_core_web_sm\")\r\nconfig = {\"ext_names\": {\"conll_pd\": \"pandas\"},\r\n \"conversion_maps\": {\"deprel\": {\"nsubj\": \"subj\"}}}\r\nnlp.add_pipe(\"conll_formatter\", config=config, last=True)\r\ndoc = nlp(\"I like cookies.\")\r\nprint(doc._.pandas)\r\n```\r\n\r\nThis is the same as:\r\n\r\n```python\r\nfrom spacy_conll import init_parser\r\n\r\nnlp = init_parser(\"en_core_web_sm\",\r\n \"spacy\",\r\n ext_names={\"conll_pd\": \"pandas\"},\r\n conversion_maps={\"deprel\": {\"nsubj\": \"subj\"}})\r\ndoc = nlp(\"I like cookies.\")\r\nprint(doc._.pandas)\r\n```\r\n\r\n\r\nThe snippets above will output a pandas DataFrame by using `._.pandas` rather than the standard\r\n`._.conll_pd`, and all occurrences of `nsubj` in the deprel field are replaced by `subj`.\r\n\r\n```\r\n ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC\r\n0 1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 subj _ _\r\n1 2 like like VERB VBP Tense=Pres|VerbForm=Fin 0 ROOT _ _\r\n2 3 cookies cookie NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No\r\n3 4 . . PUNCT . PunctType=Peri 2 punct _ SpaceAfter=No\r\n```\r\n\r\nAnother initialization example that would replace the column names \"UPOS\" with \"upostag\" amd \"XPOS\" with \"xpostag\":\r\n\r\n```python\r\nimport spacy\r\n\r\n\r\nnlp = spacy.load(\"en_core_web_sm\")\r\nconfig = {\"field_names\": {\"UPOS\": \"upostag\", \"XPOS\": \"xpostag\"}}\r\nnlp.add_pipe(\"conll_formatter\", config=config, last=True)\r\n```\r\n\r\n#### Reading CoNLL into a spaCy object\r\n\r\nIt is possible to read a CoNLL string or text file and parse it as a spaCy object. This can be useful if you have raw\r\nCoNLL data that you wish to process in different ways. The process is straightforward.\r\n\r\n```python\r\nfrom spacy_conll import init_parser\r\nfrom spacy_conll.parser import ConllParser\r\n\r\n\r\nnlp = ConllParser(init_parser(\"en_core_web_sm\", \"spacy\"))\r\n\r\ndoc = nlp.parse_conll_file_as_spacy(\"path/to/your/conll-sample.txt\")\r\n'''\r\nor straight from raw text:\r\nconllstr = \"\"\"\r\n# text = From the AP comes this story :\r\n1\tFrom\tfrom\tADP\tIN\t_\t3\tcase\t3:case\t_\r\n2\tthe\tthe\tDET\tDT\tDefinite=Def|PronType=Art\t3\tdet\t3:det\t_\r\n3\tAP\tAP\tPROPN\tNNP\tNumber=Sing\t4\tobl\t4:obl:from\t_\r\n4\tcomes\tcome\tVERB\tVBZ\tMood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin\t0\troot\t0:root\t_\r\n5\tthis\tthis\tDET\tDT\tNumber=Sing|PronType=Dem\t6\tdet\t6:det\t_\r\n6\tstory\tstory\tNOUN\tNN\tNumber=Sing\t4\tnsubj\t4:nsubj\t_\r\n\"\"\"\r\ndoc = nlp.parse_conll_text_as_spacy(conllstr)\r\n'''\r\n\r\n# Multiple CoNLL entries (separated by two newlines) will be included as different sentences in the resulting Doc\r\nfor sent in doc.sents:\r\n for token in sent:\r\n print(token.text, token.dep_, token.pos_)\r\n```\r\n\r\n### Command line\r\n\r\nUpon installation, a command-line script is added under tha alias `parse-as-conll`. You can use it to parse a\r\nstring or file into CoNLL format given a number of options.\r\n\r\n```shell\r\nparse-as-conll -h\r\nusage: parse-as-conll [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR] [-o OUTPUT_FILE]\r\n [-c OUTPUT_ENCODING] [-s] [-t] [-d] [-e] [-j N_PROCESS] [-v]\r\n [--ignore_pipe_errors] [--no_split_on_newline]\r\n model_or_lang {spacy,stanza,udpipe}\r\n\r\nParse an input string or input file to CoNLL-U format using a spaCy-wrapped parser. The output\r\ncan be written to stdout or a file, or both.\r\n\r\npositional arguments:\r\n model_or_lang Model or language to use. SpaCy models must be pre-installed, stanza\r\n and udpipe models will be downloaded automatically\r\n {spacy,stanza,udpipe}\r\n Which parser to use. Parsers other than 'spacy' need to be installed\r\n separately. For 'stanza' you need 'spacy-stanza', and for 'udpipe' the\r\n 'spacy-udpipe' library is required.\r\n\r\noptional arguments:\r\n -h, --help show this help message and exit\r\n -f INPUT_FILE, --input_file INPUT_FILE\r\n Path to file with sentences to parse. Has precedence over 'input_str'.\r\n (default: None)\r\n -a INPUT_ENCODING, --input_encoding INPUT_ENCODING\r\n Encoding of the input file. Default value is system default. (default:\r\n cp1252)\r\n -b INPUT_STR, --input_str INPUT_STR\r\n Input string to parse. (default: None)\r\n -o OUTPUT_FILE, --output_file OUTPUT_FILE\r\n Path to output file. If not specified, the output will be printed on\r\n standard output. (default: None)\r\n -c OUTPUT_ENCODING, --output_encoding OUTPUT_ENCODING\r\n Encoding of the output file. Default value is system default. (default:\r\n cp1252)\r\n -s, --disable_sbd Whether to disable spaCy automatic sentence boundary detection. In\r\n practice, disabling means that every line will be parsed as one\r\n sentence, regardless of its actual content. When 'is_tokenized' is\r\n enabled, 'disable_sbd' is enabled automatically (see 'is_tokenized').\r\n Only works when using 'spacy' as 'parser'. (default: False)\r\n -t, --is_tokenized Whether your text has already been tokenized (space-seperated). Setting\r\n this option has as an important consequence that no sentence splitting\r\n at all will be done except splitting on new lines. So if your input is\r\n a file, and you want to use pretokenised text, make sure that each line\r\n contains exactly one sentence. (default: False)\r\n -d, --include_headers\r\n Whether to include headers before the output of every sentence. These\r\n headers include the sentence text and the sentence ID as per the CoNLL\r\n format. (default: False)\r\n -e, --no_force_counting\r\n Whether to disable force counting the 'sent_id', starting from 1 and\r\n increasing for each sentence. Instead, 'sent_id' will depend on how\r\n spaCy returns the sentences. Must have 'include_headers' enabled.\r\n (default: False)\r\n -j N_PROCESS, --n_process N_PROCESS\r\n Number of processes to use in nlp.pipe(). -1 will use as many cores as\r\n available. Might not work for a 'parser' other than 'spacy' depending\r\n on your environment. (default: 1)\r\n -v, --verbose Whether to always print the output to stdout, regardless of\r\n 'output_file'. (default: False)\r\n --ignore_pipe_errors Whether to ignore a priori errors concerning 'n_process' By default we\r\n try to determine whether processing works on your system and stop\r\n execution if we think it doesn't. If you know what you are doing, you\r\n can ignore such pre-emptive errors, though, and run the code as-is,\r\n which will then throw the default Python errors when applicable.\r\n (default: False)\r\n --no_split_on_newline\r\n By default, the input file or string is split on newlines for faster\r\n processing of the split up parts. If you want to disable that behavior,\r\n you can use this flag. (default: False)\r\n```\r\n\r\n\r\nFor example, parsing a single line, multi-sentence string:\r\n\r\n```shell\r\nparse-as-conll en_core_web_sm spacy --input_str \"I like cookies. What about you?\" --include_headers\r\n\r\n# sent_id = 1\r\n# text = I like cookies.\r\n1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj _ _\r\n2 like like VERB VBP Tense=Pres|VerbForm=Fin 0 ROOT _ _\r\n3 cookies cookie NOUN NNS Number=Plur 2 dobj _ SpaceAfter=No\r\n4 . . PUNCT . PunctType=Peri 2 punct _ _\r\n\r\n# sent_id = 2\r\n# text = What about you?\r\n1 What what PRON WP _ 2 dep _ _\r\n2 about about ADP IN _ 0 ROOT _ _\r\n3 you you PRON PRP Case=Acc|Person=2|PronType=Prs 2 pobj _ SpaceAfter=No\r\n4 ? ? PUNCT . PunctType=Peri 2 punct _ SpaceAfter=No\r\n```\r\n\r\nFor example, parsing a large input file and writing output to a given output file, using four processes:\r\n\r\n```shell\r\nparse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4\r\n```\r\n\r\n\r\n## Credits\r\n\r\nThe first version of this library was inspired by initial work by [rgalhama](https://github.com/rgalhama/spaCy2CoNLLU)\r\n and has evolved a lot since then.\r\n",
"bugtrack_url": null,
"license": "BSD 2-Clause License Copyright (c) 2018-2021, Bram Vanroy, Raquel G. Alhama All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ",
"summary": "A custom pipeline component for spaCy that can convert any parsed Doc and its sentences into CoNLL-U format. Also provides a command line entry point.",
"version": "4.0.1",
"project_urls": {
"documentation": "https://github.com/BramVanroy/spacy_conll",
"homepage": "https://github.com/BramVanroy/spacy_conll",
"issues": "https://github.com/BramVanroy/spacy_conll/issues",
"repository": "https://github.com/BramVanroy/spacy_conll.git"
},
"split_keywords": [
"conll",
" conllu",
" nlp",
" parsing",
" spacy",
" spacy-extension",
" spacy_stanza",
" spacy_udpipe",
" stanza",
" tagging",
" udpipe"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2986e67b76b9405f2d8c9ab6fb535fdc8abae663bbce293a4f0e26f1ee78f6fb",
"md5": "3dd5f9f135f8b4e4b182a5a51638042b",
"sha256": "cf41c81b46f7dca47f5beae82906a0a39627e6c0af56af36c2640791636bd60e"
},
"downloads": -1,
"filename": "spacy_conll-4.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3dd5f9f135f8b4e4b182a5a51638042b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 22383,
"upload_time": "2024-07-02T08:49:39",
"upload_time_iso_8601": "2024-07-02T08:49:39.473928Z",
"url": "https://files.pythonhosted.org/packages/29/86/e67b76b9405f2d8c9ab6fb535fdc8abae663bbce293a4f0e26f1ee78f6fb/spacy_conll-4.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "323fa9c6ddaca411719bc3067c8ec1783ea48ab303542c2359c371c9af3c49ef",
"md5": "21ffe9c04043539a6b8edd057b640386",
"sha256": "ceaf536523b24695503b1e536cc7b4044e0e14dbb3b6d1189b5257cc21e130cb"
},
"downloads": -1,
"filename": "spacy_conll-4.0.1.tar.gz",
"has_sig": false,
"md5_digest": "21ffe9c04043539a6b8edd057b640386",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 38103,
"upload_time": "2024-07-02T08:51:06",
"upload_time_iso_8601": "2024-07-02T08:51:06.178288Z",
"url": "https://files.pythonhosted.org/packages/32/3f/a9c6ddaca411719bc3067c8ec1783ea48ab303542c2359c371c9af3c49ef/spacy_conll-4.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-02 08:51:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "BramVanroy",
"github_project": "spacy_conll",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "spacy-conll"
}