lcpcli


Namelcpcli JSON
Version 0.1.8 PyPI version JSON
download
home_pageNone
SummaryHelper for converting CONLLU files and uploading the corpus to LiRI Corpus Platform (LCP)
upload_time2024-12-09 13:27:11
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords conll tei vert corpora corpus linguistics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # LCP CLI module

> Command-line tool for converting CONLLU files and uploading the corpus to LCP

## Installation

Make sure you have python 3.11+ with `pip` installed in your local environment, then run

```bash
pip install lcpcli
```

## Usage

**Example:**

Corpus conversion:

```bash
lcpcli -i ~/conll_ext/ -o ~/upload/ -m upload
```

Data upload:

```bash
lcpcli -c ~/upload/ -k $API_KEY -s $API_SECRET -p my_project
```

**Help:**

```bash
lcpcli --help
```

`lcpcli` takes a corpus of CoNLL-U (PLUS) files and imports it to a project created in an LCP instance, such as _catchphrase_.

Besides the standard token-level CoNLL-U fields (`form`, `lemma`, `upos`, `xpos`, `feats`, `head`, `deprel`, `deps`) one can also provide document- and sentence-level annotations using comment lines in the files (see [the CoNLL-U Format section](#conll-u-format))

A more advanced functionality, `lcpcli` supports annotations aligned at the character level, such as named entities. See [the Named Entities section](#character-aligned-annotations-(e.g.-named-entities)) for more information

### Example corpus

`lcpcli` ships with an example one-video "corpus": the video is an excerpt from the CC-BY 3.0 "Big Buck Bunny" video ((c) copyright 2008, Blender Foundation / www.bigbuckbunny.org) and the "transcription" is a sample of the Declaration of the Human Rights

To populate a folder with the example data, use this command

```bash
lcpcli --example /destination/folder/
```

This will create a subfolder named *free_video_corpus* in */destination/folder* which, itself, contains two subfolders: *input* and *output*. The *input* subfolder contains four files: 
 - *doc.conllu* is a CoNLL-U Plus file that contains the textual data, with time alignments in seconds at the token- (`start` and `end` in the MISC column), segment- (`# start = ` and `# end = ` comments) and document-level (`#newdoc start =` and `#newdoc end =`)
 - *namedentity.tsv* is a tab-separated value lookup file that contains information about the named entities, where each row associates an ID reported in the `namedentity` token cells of *doc.conllu* with two attributes, `type` and `form`
 - *shot.tsv* is a tab-separated value file that defines time-aligned annotations about the shots in the video in the `view` column, where the `start` and `end` columns are timestamps, in seconds, relative to the document referenced in the `doc_id` column
 - *meta.json* is a JSON file that defines the structure of the corpus, used both for pre-processing the data before upload, and when adding the data to the LCP database. Read on for information on the definitions in this file


### CoNLL-U Format

The CoNLL-U format is documented at: https://universaldependencies.org/format.html

The LCP CLI converter will treat all the comments that start with `# newdoc KEY = VALUE` as document-level attributes.
This means that if a CoNLL-U file contains the line `# newdoc author = Jane Doe`, then in LCP all the sentences from this file will be associated with a document whose `meta` attribute will contain `author: 'Jane Doe'`

All other comment lines following the format `# key = value` will add an entry to the `meta` attribute of the _segment_ corresponding to the sentence below that line (ie not at the document level)

The key-value pairs in the `MISC` column of a token line will go in the `meta` attribute of the corresponding token, with the exceptions of these key-value combinations:
 - `SpaceAfter=Yes` vs. `SpaceAfter=No` (case senstive) controls whether the token will be represented with a trailing space character in the database
 - `start=n.m|end=o.p` (case senstive) will align tokens, segments (sentences) and documents along a temporal axis, where `n.m` and `o.p` should be floating values in seconds

See below how to report all the attributes in the template `.json` file

#### CoNLL-U Plus

CoNLL-U Plus is an extension to the CoNLLU-U format documented at: https://universaldependencies.org/ext-format.html

If your files start with a comment line of the form `# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC`, `lcpcli` will treat them as CoNLL-U Plus files and process the columns according to the names you set in that line


#### Annotations of sequences of tokens (e.g. Named Entities)

You can use `lcpcli` to define annotations on sequences of tokens below the segment level, for example named entities. To do so, you will need to prepare your corpus as CoNLL-U Plus files which must define a dedicated column, e.g. `namedentity`:

```conllu
# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC namedentity
```

All the tokens belonging to the same named entity should report the same index in that column, or `_` (as per CoNLL-U conventions) if it doesn't belong to a named entity. For example:

```conllu
1	Adopted	adopt	VERB	V	Tense=Past|VerbForm=Part	0	root	_	start=2.20|end=2.30	_
2	and	and	CCONJ	CC	_	3	cc	_	start=2.30|end=2.35	_
3	proclaimed	proclaim	VERB	V	Tense=Past|VerbForm=Part	1	conj	_	start=2.38|end=2.45	_
4	by	by	ADP	E	_	7	case	_	start=2.45|end=2.50	_
5	General	general	ADJ	A	Degree=Pos	6	amod	_	start=2.55|end=2.75	1
6	Assembly	assembly	NOUN	S	Number=Sing	7	nmod	_	start=2.85|end=3.15	1
7	resolution	resolution	NOUN	S	Number=Sing	3	obl	_	start=3.20|end=3.22	_
8	217	217	NUM	N	NumType=Card	7	nummod	_	SpaceAfter=No|start=3.24|end=3.40	_
9	A	A	X	X	_	8	dep	_	start=3.42|end=3.44	_
10	(	(	PUNCT	FB	_	11	punct	_	SpaceAfter=No|start=3.44|end=3.45	_
11	III	third	ADJ	NO	Degree=Pos|NumType=Ord	8	amod	_	SpaceAfter=No|start=3.47|end=3.53	_
12	)	)	PUNCT	FB	_	11	punct	_	start=3.53|end=3.54	_
13	of	of	ADP	E	_	14	case	_	start=3.55|end=3.62	_
14	10	10	NUM	N	NumType=Card	7	nmod	_	start=3.75|end=4.07	2
15	December	December	PROPN	SP	_	14	flat	_	start=4.09|end=4.13	2
16	1948	1948	NUM	N	NumType=Card	14	flat	_	SpaceAfter=No|start=4.15|end=4.23	2
17	.	.	PUNCT	FS	_	1	punct	_	start=4.24|end=4.25	_
```

In this example, tokens 5-6 belong to the same named entity ("General Assembly") and "10 December 1948" forms another named entity.

The directory containing your corpus files should also include one TSV file named after that column: the filename should match the column name, all in lower-case, plus an extension (e.g. `.tsv`) -- in the example corpus, the column as reported in the first comment line (`global.columns`) is named `namedentity` and, correspondingly, the TSV file is named _namedentity.tsv_. Its first line should report headers, starting with `namedentity_id` and then any attributes associated with a named entity. The value in the first cell of all the non-header lines should correspond to the ones listed in the CoNLL-U file(s) for lookup purposes. For example:

```tsv
namedentity_id	type	form
1	ORG	General Assembly
2	DATE	10 December 1948
```

When parsed along with the CoNLL-U Plus lines above, this would associate the corresponding occurrence of the sequence "General Assembly" with a named entity of type `ORG` and the corresponding occurrence of "10 December 1948" with a named entity of type `DATE`.

Finally, you need to report a corresponding entity type in the template `.json` under the `layer` key, for example:

```json
"NamedEntity": {
    "abstract": false,
    "layerType": "span",
    "contains": "Token",
    "attributes": {
        "form": {
            "isGlobal": false,
            "type": "text",
            "nullable": false
        },
        "type": {
            "isGlobal": false,
            "type": "categorical",
            "nullable": true
        }
    }
},
```

Make sure to set the `abstract`, `layerType` and `contains` attributes as illustrated above. See the section [Convert and Upload](#convert-and-upload) for a full example of a template `.json` file.

One can then query named entities by specifying that they are contained in segments, and that they should contain specific tokens. For example, the following DQD query would match all the named entities the corpus' segments that contain an adjective token:

```dqd
Segment s

NamedEntity@s ne
    type = "ORG"

Token@ne t
    upos = "ADJ"

res => plain
    context
        s
    entities
        ne
```


#### Annotations of sequences of segments (e.g. Topics)


You can use `lcpcli` to define annotations on sequences of segments below the document level, for example topics. The approach is almost identical to the one for annotations of sequences of tokens; the following only describes the differences:

 - one does _not_ define a new column in `global.columns`
 - one does _not_ report the lookup indices in the token lines
 - one reports the indices as segment-level comments, named to match the TSV file; for example, a segment-level comment `# topic = 1` will look up the file _topic.tsv_ for a row whose first cell has the value `1`

Just like with token-level annotations, all consecutive segments sharing the same value in the annotation comment will be grouped together as one occurrence of that annotation.

One can then query segments that belong to specific topics. For example the following DQD query would match all the segments that belong to a topic named "bunny" (assuming `topic.csv` has a corresponding column `name`):

```dqd
Topic top
    name = "bunny"

Segment@top s

res => plain
    context
        s
    entities
        s
```

Note that while the documents represent the top annotation level containing segments, one cannot prepare a `.tsv` file as just described here; all the document annotations must be directly reported in the conllu files using the `# newdoc key = value` as described in the section **CoNLL-U Format**.


#### Time-aligned annotations

Your corpus can also include annotations that do not strictly group entities together. The example video corpus includes an annotation named _shot_ that is **time-aligned** but does not necessarily align with tokens or segments on the time axis (e.g. a shot can start in the middle of a sentence and end some time after its end)

Much like with the annotation types described above, you should also include a corresponding TSV file. The first column should list unique IDs; one column should be named `doc_id` and report the ID of the corresponding document (make sure to include corresponding `# newdoc id = <ID>` comments in your CoNLL-U files); two columns named `start` and `end` should list the time points for temporal anchoring, measured in seconds from the start of the document's media file; with extra columns for additional attributes. For example, `shot.tsv` starts with:

```tsv
shot_id	doc_id	start	end	view
1	Bunny	0.00	8.00	wide angle
2	Bunny	8.05	12.50	low angle
3	Bunny	12.75	16.00	face-cam
```

Your template `.json` file should report _Shot_ under `layer`, for example:

```json
"Shot": {
    "abstract": false,
    "layerType": "unit",
    "anchoring": {
        "location": false,
        "stream": false,
        "time": true
    },
    "attributes": {
        "view": {
            "type": "categorical"
        }
    }
},
```

Assuming the sentences are also time-aligned (as in the example corpus) you can then query segments that overlap with specific shots, for example:

```dqd
Segment s

Shot sh
    OR # either...
        AND # ... the shot start in the middle of the segment
            start >= s.start + 0.0s
            start <= s.end + 0.0s
        AND # ... or the short ends in the middle of the segment
            end >= s.start + 0.0s
            end <= s.end + 0.0s

res => plain
    context
        s
    entities
        ne
```

#### Global attributes

In some cases, it makes sense for multiple entity types to share references: in those cases, they can define _global attributes_. An example of a global attribute is a speaker or an agent that can have a name, an age, etc. and be associated with both a segment (a sentence) and, say, a gesture. The corpus template would include definitions along these lines:

```json
"globalAttributes": {
    "agent": {
        "type": "dict",
        "keys": {
            "name": {
                "type": "text"
            },
            "age": {
                "type": "number"
            }
        }
    }
},
"layer": {
    "Segment": {
        "abstract": false,
        "layerType": "span",
        "contains": "Token",
        "attributes": {
            "agent": {
                "ref": "agent"
            }
        }
    },
    "Gesture": {
        "abstract": false,
        "layerType": "unit",
        "anchoring": {
            "time": true
        },
        "attributes": {
            "agent": {
                "ref": "agent"
            }
        }
    }
}
```

You should include a file named `global_attribute_agent.tsv` (mind the singular on `attribute`) with three columns: `agent_id`, `name` and `age`, and reference the values of `agent_id` appropriately as a sentence-level comment in your CoNLL-U files as well as in a file named `gesture.tsv`. For example:

*global_attribute_agent.tsv*:
```tsv
agent_id	agent
10	{"name": "Jane Doe", "age": 37}
```

CoNLL-U file:
```conllu
# newdoc id = video1

# sent_id = 1
# agent_id = 10
The the _ _ _
```

*gesture.tsv*:
```tsv
gesture_id	agent_id	doc_id	start	end
1	10	video1	1.25	2.6
```

#### Media files

If your corpus includes media files, your `.json` template should report it under a `mediaSlots` key in `meta`, e.g.:

```json
"meta": {
    "name": "Free Single-Video Corpus",
    "author": "LiRI",
    "date": "2024-06-13",
    "version": 1,
    "corpusDescription": "Single, open-source video with annotated shots and a placeholder text stream from the Universal Declaration of Human Rights annotated with named entities",
    "mediaSlots": {
        "video": {
            "mediaType": "video",
            "isOptional": false
        }
    }
},
```

Your CoNLL-U file(s) should accordingly report each document's media file's name in a comment, like so:

```tsv
# newdoc video = bunny.mp4
```

The `.json` template should also define a main key named `tracks` to control what annotations will be represented along the time axis. For example the following will tell the interface to display separate timeline tracks for the shot, named entity and segment annotations, with the latter being subdivided in as many tracks as there are distinct values for the attribute `speaker` of the segments:

```json
"tracks": {
    "layers": {
        "Shot": {},
        "NamedEntity": {},
        "Segment": {
            "split": [
                "speaker"
            ]
        }
    }
}
```

Finally, your **output** corpus folder should include a subfolder named `media` in which all the referenced media files have been placed


#### Attribute types


The values of each attribute (on tokens, segments, documents or at any other level) have a **type**; the most common types are `text`, `number` or `categorical`. The attributes must be reported in the template `.json` file, along with their type (you can see an example in the section **Convert and Upload**)

 - `text` vs `categorical`: while both types correspond to alpha-numerical values, `categorical` is meant for attributes that have a limited number of possible values (typically, less than 100 distinct values) of a limited length (as a rule of thumb, each value can have up to 50 characters). There is no such limits on values of attributes of type `text`. When a user starts typing a constraint on an attribute of type `categorical`, the DQD editor will offer autocompletition suggestions. The attributes of type `text` will have their values listed in a dedicated table (`lcpcli`'s conversion step produces corresponding `.tsv` files) so a query that expresses a constraint on an attribute will be slower if that attribute if of type `text` than of type `categorical`

 - the type `labels` (with an `s` at the end) corresponds to a set of labels that users will be able to constrain in DQD using the `contain` keyword: for example, if an attribute named `genre` is of type `labels`, the user could write a constraint like `genre contain 'drama'` or `hobbies !contain 'comedy'`. The values of attributes of type `labels` should be one-line strings, with each value separated by a comma (`,`) character (as in, e.g., `# newdoc genre = drama, romance, coming of age, fiction`); as a consequence, no label can contain the character `,`.

 - the type `dict` corresponds to key-values pairs as represented in JSON

 - the type `date` requires values to be formatted in a way that can be parsed by PostgreSQL


### Convert and Upload

1. Create a directory in which you have all your properly-fromatted CONLLU files

2. In the same directory, create a template `.json` file that describes your corpus structure (see above about the `attributes` key on `Document` and `Segment`), for example:

```json
{
    "meta": {
        "name": "Free Single-Video Corpus",
        "author": "LiRI",
        "date": "2024-06-13",
        "version": 1,
        "corpusDescription": "Single, open-source video with annotated shots and a placeholder text stream from the Universal Declaration of Human Rights annotated with named entities",
        "mediaSlots": {
            "video": {
                "mediaType": "video",
                "isOptional": false
            }
        }
    },
    "firstClass": {
        "document": "Document",
        "segment": "Segment",
        "token": "Token"
    },
    "layer": {
        "Token": {
            "abstract": false,
            "layerType": "unit",
            "anchoring": {
                "location": false,
                "stream": true,
                "time": true
            },
            "attributes": {
                "form": {
                    "isGlobal": false,
                    "type": "text",
                    "nullable": true
                },
                "lemma": {
                    "isGlobal": false,
                    "type": "text",
                    "nullable": false
                },
                "upos": {
                    "isGlobal": true,
                    "type": "categorical",
                    "nullable": true
                },
                "xpos": {
                    "isGlobal": false,
                    "type": "categorical",
                    "nullable": true
                },
                "ufeat": {
                    "isGlobal": false,
                    "type": "dict",
                    "nullable": true
                }
            }
        },
        "DepRel": {
            "abstract": true,
            "layerType": "relation",
            "attributes": {
                "udep": {
                    "type": "categorical",
                    "isGlobal": true,
                    "nullable": false
                },
                "source": {
                    "name": "dependent",
                    "entity": "Token",
                    "nullable": false
                },
                "target": {
                    "name": "head",
                    "entity": "Token",
                    "nullable": true
                },
                "left_anchor": {
                    "type": "number",
                    "nullable": false
                },
                "right_anchor": {
                    "type": "number",
                    "nullable": false
                }
            }
        },
        "NamedEntity": {
            "abstract": false,
            "layerType": "span",
            "contains": "Token",
            "attributes": {
                "form": {
                    "isGlobal": false,
                    "type": "text",
                    "nullable": false
                },
                "type": {
                    "isGlobal": false,
                    "type": "categorical",
                    "nullable": true
                }
            }
        },
        "Shot": {
            "abstract": false,
            "layerType": "span",
            "anchoring": {
                "location": false,
                "stream": false,
                "time": true
            },
            "attributes": {
                "view": {
                    "isGlobal": false,
                    "type": "categorical",
                    "nullable": false
                }
            }
        },
        "Segment": {
            "abstract": false,
            "layerType": "span",
            "contains": "Token",
            "attributes": {
                "meta": {
                    "text": {
                        "type": "text"
                    },
                    "start": {
                        "type": "text"
                    },
                    "end": {
                        "type": "text"
                    }
                }
            }
        },
        "Document": {
            "abstract": false,
            "contains": "Segment",
            "layerType": "span",
            "attributes": {
                "meta": {
                    "audio": {
                        "type": "text",
                        "isOptional": true
                    },
                    "video": {
                        "type": "text",
                        "isOptional": true
                    },
                    "start": {
                        "type": "number"
                    },
                    "end": {
                        "type": "number"
                    },
                    "name": {
                        "type": "text"
                    }
                }
            }
        }
    },
    "tracks": {
        "layers": {
            "Shot": {},
            "Segment": {},
            "NamedEntity": {}
        }
    }
}
```

3. If your corpus defines a character-anchored entity type such as named entities, make sure you also include a properly named and formatted TSV file for it in the directory (see [the Named Entities section](#named-entities))

4. Visit an LCP instance (e.g. _catchphrase_) and create a new project if you don't already have one where your corpus should go

5. Retrieve the API key and secret for your project by clicking on the button that says: "Create API Key"

    The secret will appear at the bottom of the page and remain visible only for 120s, after which it will disappear forever (you would then need to revoke the API key and create a new one)
    
    The key itself is listed above the button that says "Revoke API key" (make sure to **not** copy the line that starts with "Secret Key" along with the API key itself)

6. Once you have your API key and secret, you can start converting and uploading your corpus by running the following command:

```
lcpcli -i $CONLLU_FOLDER -o $OUTPUT_FOLDER -m upload -k $API_KEY -s $API_SECRET -p $PROJECT_NAME --live
```

- `$CONLLU_FOLDER` should point to the folder that contains your CONLLU files
- `$OUTPUT_FOLDER` should point to *another* folder that will be used to store the converted files to be uploaded
- `$API_KEY` is the key you copied from your project on LCP (still visible when you visit the page)
- `$API_SECRET` is the secret you copied from your project on LCP (only visible upon API Key creation)
- `$PROJECT_NAME` is the name of the project exactly as displayed on LCP -- it is case-sensitive, and space characters should be escaped

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "lcpcli",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "CONLL, TEI, VERT, corpora, corpus, linguistics",
    "author": null,
    "author_email": "Danny McDonald <daniel.mcdonald@uzh.ch>, Igor Musta\u010d <igor.mustac@uzh.ch>, Jeremy Zehr <jeremy.zehr@uzh.ch>, Jonathan Schaber <jeremy.schaber@uzh.ch>",
    "download_url": "https://files.pythonhosted.org/packages/93/21/b857720cbb5685975fdea2fce1b76db9605c40cce8b8dffc7005a435cda1/lcpcli-0.1.8.tar.gz",
    "platform": null,
    "description": "# LCP CLI module\n\n> Command-line tool for converting CONLLU files and uploading the corpus to LCP\n\n## Installation\n\nMake sure you have python 3.11+ with `pip` installed in your local environment, then run\n\n```bash\npip install lcpcli\n```\n\n## Usage\n\n**Example:**\n\nCorpus conversion:\n\n```bash\nlcpcli -i ~/conll_ext/ -o ~/upload/ -m upload\n```\n\nData upload:\n\n```bash\nlcpcli -c ~/upload/ -k $API_KEY -s $API_SECRET -p my_project\n```\n\n**Help:**\n\n```bash\nlcpcli --help\n```\n\n`lcpcli` takes a corpus of CoNLL-U (PLUS) files and imports it to a project created in an LCP instance, such as _catchphrase_.\n\nBesides the standard token-level CoNLL-U fields (`form`, `lemma`, `upos`, `xpos`, `feats`, `head`, `deprel`, `deps`) one can also provide document- and sentence-level annotations using comment lines in the files (see [the CoNLL-U Format section](#conll-u-format))\n\nA more advanced functionality, `lcpcli` supports annotations aligned at the character level, such as named entities. See [the Named Entities section](#character-aligned-annotations-(e.g.-named-entities)) for more information\n\n### Example corpus\n\n`lcpcli` ships with an example one-video \"corpus\": the video is an excerpt from the CC-BY 3.0 \"Big Buck Bunny\" video ((c) copyright 2008, Blender Foundation / www.bigbuckbunny.org) and the \"transcription\" is a sample of the Declaration of the Human Rights\n\nTo populate a folder with the example data, use this command\n\n```bash\nlcpcli --example /destination/folder/\n```\n\nThis will create a subfolder named *free_video_corpus* in */destination/folder* which, itself, contains two subfolders: *input* and *output*. The *input* subfolder contains four files: \n - *doc.conllu* is a CoNLL-U Plus file that contains the textual data, with time alignments in seconds at the token- (`start` and `end` in the MISC column), segment- (`# start = ` and `# end = ` comments) and document-level (`#newdoc start =` and `#newdoc end =`)\n - *namedentity.tsv* is a tab-separated value lookup file that contains information about the named entities, where each row associates an ID reported in the `namedentity` token cells of *doc.conllu* with two attributes, `type` and `form`\n - *shot.tsv* is a tab-separated value file that defines time-aligned annotations about the shots in the video in the `view` column, where the `start` and `end` columns are timestamps, in seconds, relative to the document referenced in the `doc_id` column\n - *meta.json* is a JSON file that defines the structure of the corpus, used both for pre-processing the data before upload, and when adding the data to the LCP database. Read on for information on the definitions in this file\n\n\n### CoNLL-U Format\n\nThe CoNLL-U format is documented at: https://universaldependencies.org/format.html\n\nThe LCP CLI converter will treat all the comments that start with `# newdoc KEY = VALUE` as document-level attributes.\nThis means that if a CoNLL-U file contains the line `# newdoc author = Jane Doe`, then in LCP all the sentences from this file will be associated with a document whose `meta` attribute will contain `author: 'Jane Doe'`\n\nAll other comment lines following the format `# key = value` will add an entry to the `meta` attribute of the _segment_ corresponding to the sentence below that line (ie not at the document level)\n\nThe key-value pairs in the `MISC` column of a token line will go in the `meta` attribute of the corresponding token, with the exceptions of these key-value combinations:\n - `SpaceAfter=Yes` vs. `SpaceAfter=No` (case senstive) controls whether the token will be represented with a trailing space character in the database\n - `start=n.m|end=o.p` (case senstive) will align tokens, segments (sentences) and documents along a temporal axis, where `n.m` and `o.p` should be floating values in seconds\n\nSee below how to report all the attributes in the template `.json` file\n\n#### CoNLL-U Plus\n\nCoNLL-U Plus is an extension to the CoNLLU-U format documented at: https://universaldependencies.org/ext-format.html\n\nIf your files start with a comment line of the form `# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC`, `lcpcli` will treat them as CoNLL-U Plus files and process the columns according to the names you set in that line\n\n\n#### Annotations of sequences of tokens (e.g. Named Entities)\n\nYou can use `lcpcli` to define annotations on sequences of tokens below the segment level, for example named entities. To do so, you will need to prepare your corpus as CoNLL-U Plus files which must define a dedicated column, e.g. `namedentity`:\n\n```conllu\n# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC namedentity\n```\n\nAll the tokens belonging to the same named entity should report the same index in that column, or `_` (as per CoNLL-U conventions) if it doesn't belong to a named entity. For example:\n\n```conllu\n1\tAdopted\tadopt\tVERB\tV\tTense=Past|VerbForm=Part\t0\troot\t_\tstart=2.20|end=2.30\t_\n2\tand\tand\tCCONJ\tCC\t_\t3\tcc\t_\tstart=2.30|end=2.35\t_\n3\tproclaimed\tproclaim\tVERB\tV\tTense=Past|VerbForm=Part\t1\tconj\t_\tstart=2.38|end=2.45\t_\n4\tby\tby\tADP\tE\t_\t7\tcase\t_\tstart=2.45|end=2.50\t_\n5\tGeneral\tgeneral\tADJ\tA\tDegree=Pos\t6\tamod\t_\tstart=2.55|end=2.75\t1\n6\tAssembly\tassembly\tNOUN\tS\tNumber=Sing\t7\tnmod\t_\tstart=2.85|end=3.15\t1\n7\tresolution\tresolution\tNOUN\tS\tNumber=Sing\t3\tobl\t_\tstart=3.20|end=3.22\t_\n8\t217\t217\tNUM\tN\tNumType=Card\t7\tnummod\t_\tSpaceAfter=No|start=3.24|end=3.40\t_\n9\tA\tA\tX\tX\t_\t8\tdep\t_\tstart=3.42|end=3.44\t_\n10\t(\t(\tPUNCT\tFB\t_\t11\tpunct\t_\tSpaceAfter=No|start=3.44|end=3.45\t_\n11\tIII\tthird\tADJ\tNO\tDegree=Pos|NumType=Ord\t8\tamod\t_\tSpaceAfter=No|start=3.47|end=3.53\t_\n12\t)\t)\tPUNCT\tFB\t_\t11\tpunct\t_\tstart=3.53|end=3.54\t_\n13\tof\tof\tADP\tE\t_\t14\tcase\t_\tstart=3.55|end=3.62\t_\n14\t10\t10\tNUM\tN\tNumType=Card\t7\tnmod\t_\tstart=3.75|end=4.07\t2\n15\tDecember\tDecember\tPROPN\tSP\t_\t14\tflat\t_\tstart=4.09|end=4.13\t2\n16\t1948\t1948\tNUM\tN\tNumType=Card\t14\tflat\t_\tSpaceAfter=No|start=4.15|end=4.23\t2\n17\t.\t.\tPUNCT\tFS\t_\t1\tpunct\t_\tstart=4.24|end=4.25\t_\n```\n\nIn this example, tokens 5-6 belong to the same named entity (\"General Assembly\") and \"10 December 1948\" forms another named entity.\n\nThe directory containing your corpus files should also include one TSV file named after that column: the filename should match the column name, all in lower-case, plus an extension (e.g. `.tsv`) -- in the example corpus, the column as reported in the first comment line (`global.columns`) is named `namedentity` and, correspondingly, the TSV file is named _namedentity.tsv_. Its first line should report headers, starting with `namedentity_id` and then any attributes associated with a named entity. The value in the first cell of all the non-header lines should correspond to the ones listed in the CoNLL-U file(s) for lookup purposes. For example:\n\n```tsv\nnamedentity_id\ttype\tform\n1\tORG\tGeneral Assembly\n2\tDATE\t10 December 1948\n```\n\nWhen parsed along with the CoNLL-U Plus lines above, this would associate the corresponding occurrence of the sequence \"General Assembly\" with a named entity of type `ORG` and the corresponding occurrence of \"10 December 1948\" with a named entity of type `DATE`.\n\nFinally, you need to report a corresponding entity type in the template `.json` under the `layer` key, for example:\n\n```json\n\"NamedEntity\": {\n    \"abstract\": false,\n    \"layerType\": \"span\",\n    \"contains\": \"Token\",\n    \"attributes\": {\n        \"form\": {\n            \"isGlobal\": false,\n            \"type\": \"text\",\n            \"nullable\": false\n        },\n        \"type\": {\n            \"isGlobal\": false,\n            \"type\": \"categorical\",\n            \"nullable\": true\n        }\n    }\n},\n```\n\nMake sure to set the `abstract`, `layerType` and `contains` attributes as illustrated above. See the section [Convert and Upload](#convert-and-upload) for a full example of a template `.json` file.\n\nOne can then query named entities by specifying that they are contained in segments, and that they should contain specific tokens. For example, the following DQD query would match all the named entities the corpus' segments that contain an adjective token:\n\n```dqd\nSegment s\n\nNamedEntity@s ne\n    type = \"ORG\"\n\nToken@ne t\n    upos = \"ADJ\"\n\nres => plain\n    context\n        s\n    entities\n        ne\n```\n\n\n#### Annotations of sequences of segments (e.g. Topics)\n\n\nYou can use `lcpcli` to define annotations on sequences of segments below the document level, for example topics. The approach is almost identical to the one for annotations of sequences of tokens; the following only describes the differences:\n\n - one does _not_ define a new column in `global.columns`\n - one does _not_ report the lookup indices in the token lines\n - one reports the indices as segment-level comments, named to match the TSV file; for example, a segment-level comment `# topic = 1` will look up the file _topic.tsv_ for a row whose first cell has the value `1`\n\nJust like with token-level annotations, all consecutive segments sharing the same value in the annotation comment will be grouped together as one occurrence of that annotation.\n\nOne can then query segments that belong to specific topics. For example the following DQD query would match all the segments that belong to a topic named \"bunny\" (assuming `topic.csv` has a corresponding column `name`):\n\n```dqd\nTopic top\n    name = \"bunny\"\n\nSegment@top s\n\nres => plain\n    context\n        s\n    entities\n        s\n```\n\nNote that while the documents represent the top annotation level containing segments, one cannot prepare a `.tsv` file as just described here; all the document annotations must be directly reported in the conllu files using the `# newdoc key = value` as described in the section **CoNLL-U Format**.\n\n\n#### Time-aligned annotations\n\nYour corpus can also include annotations that do not strictly group entities together. The example video corpus includes an annotation named _shot_ that is **time-aligned** but does not necessarily align with tokens or segments on the time axis (e.g. a shot can start in the middle of a sentence and end some time after its end)\n\nMuch like with the annotation types described above, you should also include a corresponding TSV file. The first column should list unique IDs; one column should be named `doc_id` and report the ID of the corresponding document (make sure to include corresponding `# newdoc id = <ID>` comments in your CoNLL-U files); two columns named `start` and `end` should list the time points for temporal anchoring, measured in seconds from the start of the document's media file; with extra columns for additional attributes. For example, `shot.tsv` starts with:\n\n```tsv\nshot_id\tdoc_id\tstart\tend\tview\n1\tBunny\t0.00\t8.00\twide angle\n2\tBunny\t8.05\t12.50\tlow angle\n3\tBunny\t12.75\t16.00\tface-cam\n```\n\nYour template `.json` file should report _Shot_ under `layer`, for example:\n\n```json\n\"Shot\": {\n    \"abstract\": false,\n    \"layerType\": \"unit\",\n    \"anchoring\": {\n        \"location\": false,\n        \"stream\": false,\n        \"time\": true\n    },\n    \"attributes\": {\n        \"view\": {\n            \"type\": \"categorical\"\n        }\n    }\n},\n```\n\nAssuming the sentences are also time-aligned (as in the example corpus) you can then query segments that overlap with specific shots, for example:\n\n```dqd\nSegment s\n\nShot sh\n    OR # either...\n        AND # ... the shot start in the middle of the segment\n            start >= s.start + 0.0s\n            start <= s.end + 0.0s\n        AND # ... or the short ends in the middle of the segment\n            end >= s.start + 0.0s\n            end <= s.end + 0.0s\n\nres => plain\n    context\n        s\n    entities\n        ne\n```\n\n#### Global attributes\n\nIn some cases, it makes sense for multiple entity types to share references: in those cases, they can define _global attributes_. An example of a global attribute is a speaker or an agent that can have a name, an age, etc. and be associated with both a segment (a sentence) and, say, a gesture. The corpus template would include definitions along these lines:\n\n```json\n\"globalAttributes\": {\n    \"agent\": {\n        \"type\": \"dict\",\n        \"keys\": {\n            \"name\": {\n                \"type\": \"text\"\n            },\n            \"age\": {\n                \"type\": \"number\"\n            }\n        }\n    }\n},\n\"layer\": {\n    \"Segment\": {\n        \"abstract\": false,\n        \"layerType\": \"span\",\n        \"contains\": \"Token\",\n        \"attributes\": {\n            \"agent\": {\n                \"ref\": \"agent\"\n            }\n        }\n    },\n    \"Gesture\": {\n        \"abstract\": false,\n        \"layerType\": \"unit\",\n        \"anchoring\": {\n            \"time\": true\n        },\n        \"attributes\": {\n            \"agent\": {\n                \"ref\": \"agent\"\n            }\n        }\n    }\n}\n```\n\nYou should include a file named `global_attribute_agent.tsv` (mind the singular on `attribute`) with three columns: `agent_id`, `name` and `age`, and reference the values of `agent_id` appropriately as a sentence-level comment in your CoNLL-U files as well as in a file named `gesture.tsv`. For example:\n\n*global_attribute_agent.tsv*:\n```tsv\nagent_id\tagent\n10\t{\"name\": \"Jane Doe\", \"age\": 37}\n```\n\nCoNLL-U file:\n```conllu\n# newdoc id = video1\n\n# sent_id = 1\n# agent_id = 10\nThe the _ _ _\n```\n\n*gesture.tsv*:\n```tsv\ngesture_id\tagent_id\tdoc_id\tstart\tend\n1\t10\tvideo1\t1.25\t2.6\n```\n\n#### Media files\n\nIf your corpus includes media files, your `.json` template should report it under a `mediaSlots` key in `meta`, e.g.:\n\n```json\n\"meta\": {\n    \"name\": \"Free Single-Video Corpus\",\n    \"author\": \"LiRI\",\n    \"date\": \"2024-06-13\",\n    \"version\": 1,\n    \"corpusDescription\": \"Single, open-source video with annotated shots and a placeholder text stream from the Universal Declaration of Human Rights annotated with named entities\",\n    \"mediaSlots\": {\n        \"video\": {\n            \"mediaType\": \"video\",\n            \"isOptional\": false\n        }\n    }\n},\n```\n\nYour CoNLL-U file(s) should accordingly report each document's media file's name in a comment, like so:\n\n```tsv\n# newdoc video = bunny.mp4\n```\n\nThe `.json` template should also define a main key named `tracks` to control what annotations will be represented along the time axis. For example the following will tell the interface to display separate timeline tracks for the shot, named entity and segment annotations, with the latter being subdivided in as many tracks as there are distinct values for the attribute `speaker` of the segments:\n\n```json\n\"tracks\": {\n    \"layers\": {\n        \"Shot\": {},\n        \"NamedEntity\": {},\n        \"Segment\": {\n            \"split\": [\n                \"speaker\"\n            ]\n        }\n    }\n}\n```\n\nFinally, your **output** corpus folder should include a subfolder named `media` in which all the referenced media files have been placed\n\n\n#### Attribute types\n\n\nThe values of each attribute (on tokens, segments, documents or at any other level) have a **type**; the most common types are `text`, `number` or `categorical`. The attributes must be reported in the template `.json` file, along with their type (you can see an example in the section **Convert and Upload**)\n\n - `text` vs `categorical`: while both types correspond to alpha-numerical values, `categorical` is meant for attributes that have a limited number of possible values (typically, less than 100 distinct values) of a limited length (as a rule of thumb, each value can have up to 50 characters). There is no such limits on values of attributes of type `text`. When a user starts typing a constraint on an attribute of type `categorical`, the DQD editor will offer autocompletition suggestions. The attributes of type `text` will have their values listed in a dedicated table (`lcpcli`'s conversion step produces corresponding `.tsv` files) so a query that expresses a constraint on an attribute will be slower if that attribute if of type `text` than of type `categorical`\n\n - the type `labels` (with an `s` at the end) corresponds to a set of labels that users will be able to constrain in DQD using the `contain` keyword: for example, if an attribute named `genre` is of type `labels`, the user could write a constraint like `genre contain 'drama'` or `hobbies !contain 'comedy'`. The values of attributes of type `labels` should be one-line strings, with each value separated by a comma (`,`) character (as in, e.g., `# newdoc genre = drama, romance, coming of age, fiction`); as a consequence, no label can contain the character `,`.\n\n - the type `dict` corresponds to key-values pairs as represented in JSON\n\n - the type `date` requires values to be formatted in a way that can be parsed by PostgreSQL\n\n\n### Convert and Upload\n\n1. Create a directory in which you have all your properly-fromatted CONLLU files\n\n2. In the same directory, create a template `.json` file that describes your corpus structure (see above about the `attributes` key on `Document` and `Segment`), for example:\n\n```json\n{\n    \"meta\": {\n        \"name\": \"Free Single-Video Corpus\",\n        \"author\": \"LiRI\",\n        \"date\": \"2024-06-13\",\n        \"version\": 1,\n        \"corpusDescription\": \"Single, open-source video with annotated shots and a placeholder text stream from the Universal Declaration of Human Rights annotated with named entities\",\n        \"mediaSlots\": {\n            \"video\": {\n                \"mediaType\": \"video\",\n                \"isOptional\": false\n            }\n        }\n    },\n    \"firstClass\": {\n        \"document\": \"Document\",\n        \"segment\": \"Segment\",\n        \"token\": \"Token\"\n    },\n    \"layer\": {\n        \"Token\": {\n            \"abstract\": false,\n            \"layerType\": \"unit\",\n            \"anchoring\": {\n                \"location\": false,\n                \"stream\": true,\n                \"time\": true\n            },\n            \"attributes\": {\n                \"form\": {\n                    \"isGlobal\": false,\n                    \"type\": \"text\",\n                    \"nullable\": true\n                },\n                \"lemma\": {\n                    \"isGlobal\": false,\n                    \"type\": \"text\",\n                    \"nullable\": false\n                },\n                \"upos\": {\n                    \"isGlobal\": true,\n                    \"type\": \"categorical\",\n                    \"nullable\": true\n                },\n                \"xpos\": {\n                    \"isGlobal\": false,\n                    \"type\": \"categorical\",\n                    \"nullable\": true\n                },\n                \"ufeat\": {\n                    \"isGlobal\": false,\n                    \"type\": \"dict\",\n                    \"nullable\": true\n                }\n            }\n        },\n        \"DepRel\": {\n            \"abstract\": true,\n            \"layerType\": \"relation\",\n            \"attributes\": {\n                \"udep\": {\n                    \"type\": \"categorical\",\n                    \"isGlobal\": true,\n                    \"nullable\": false\n                },\n                \"source\": {\n                    \"name\": \"dependent\",\n                    \"entity\": \"Token\",\n                    \"nullable\": false\n                },\n                \"target\": {\n                    \"name\": \"head\",\n                    \"entity\": \"Token\",\n                    \"nullable\": true\n                },\n                \"left_anchor\": {\n                    \"type\": \"number\",\n                    \"nullable\": false\n                },\n                \"right_anchor\": {\n                    \"type\": \"number\",\n                    \"nullable\": false\n                }\n            }\n        },\n        \"NamedEntity\": {\n            \"abstract\": false,\n            \"layerType\": \"span\",\n            \"contains\": \"Token\",\n            \"attributes\": {\n                \"form\": {\n                    \"isGlobal\": false,\n                    \"type\": \"text\",\n                    \"nullable\": false\n                },\n                \"type\": {\n                    \"isGlobal\": false,\n                    \"type\": \"categorical\",\n                    \"nullable\": true\n                }\n            }\n        },\n        \"Shot\": {\n            \"abstract\": false,\n            \"layerType\": \"span\",\n            \"anchoring\": {\n                \"location\": false,\n                \"stream\": false,\n                \"time\": true\n            },\n            \"attributes\": {\n                \"view\": {\n                    \"isGlobal\": false,\n                    \"type\": \"categorical\",\n                    \"nullable\": false\n                }\n            }\n        },\n        \"Segment\": {\n            \"abstract\": false,\n            \"layerType\": \"span\",\n            \"contains\": \"Token\",\n            \"attributes\": {\n                \"meta\": {\n                    \"text\": {\n                        \"type\": \"text\"\n                    },\n                    \"start\": {\n                        \"type\": \"text\"\n                    },\n                    \"end\": {\n                        \"type\": \"text\"\n                    }\n                }\n            }\n        },\n        \"Document\": {\n            \"abstract\": false,\n            \"contains\": \"Segment\",\n            \"layerType\": \"span\",\n            \"attributes\": {\n                \"meta\": {\n                    \"audio\": {\n                        \"type\": \"text\",\n                        \"isOptional\": true\n                    },\n                    \"video\": {\n                        \"type\": \"text\",\n                        \"isOptional\": true\n                    },\n                    \"start\": {\n                        \"type\": \"number\"\n                    },\n                    \"end\": {\n                        \"type\": \"number\"\n                    },\n                    \"name\": {\n                        \"type\": \"text\"\n                    }\n                }\n            }\n        }\n    },\n    \"tracks\": {\n        \"layers\": {\n            \"Shot\": {},\n            \"Segment\": {},\n            \"NamedEntity\": {}\n        }\n    }\n}\n```\n\n3. If your corpus defines a character-anchored entity type such as named entities, make sure you also include a properly named and formatted TSV file for it in the directory (see [the Named Entities section](#named-entities))\n\n4. Visit an LCP instance (e.g. _catchphrase_) and create a new project if you don't already have one where your corpus should go\n\n5. Retrieve the API key and secret for your project by clicking on the button that says: \"Create API Key\"\n\n    The secret will appear at the bottom of the page and remain visible only for 120s, after which it will disappear forever (you would then need to revoke the API key and create a new one)\n    \n    The key itself is listed above the button that says \"Revoke API key\" (make sure to **not** copy the line that starts with \"Secret Key\" along with the API key itself)\n\n6. Once you have your API key and secret, you can start converting and uploading your corpus by running the following command:\n\n```\nlcpcli -i $CONLLU_FOLDER -o $OUTPUT_FOLDER -m upload -k $API_KEY -s $API_SECRET -p $PROJECT_NAME --live\n```\n\n- `$CONLLU_FOLDER` should point to the folder that contains your CONLLU files\n- `$OUTPUT_FOLDER` should point to *another* folder that will be used to store the converted files to be uploaded\n- `$API_KEY` is the key you copied from your project on LCP (still visible when you visit the page)\n- `$API_SECRET` is the secret you copied from your project on LCP (only visible upon API Key creation)\n- `$PROJECT_NAME` is the name of the project exactly as displayed on LCP -- it is case-sensitive, and space characters should be escaped\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Helper for converting CONLLU files and uploading the corpus to LiRI Corpus Platform (LCP)",
    "version": "0.1.8",
    "project_urls": {
        "Homepage": "https://github.com/liri-uzh/lcpcli/issues",
        "Issues": "https://github.com/liri-uzh/lcpcli/issues"
    },
    "split_keywords": [
        "conll",
        " tei",
        " vert",
        " corpora",
        " corpus",
        " linguistics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "89816e5b58d3f2bbb6e03ce51589a82271d736fefbe386157032e9e0c97e786c",
                "md5": "69e8b92fa493d0eb61c458bbeea3aee8",
                "sha256": "d4c2533cdfb442b802c34062204c3ade86e4aa52969e1d906aaba64272fbd2c5"
            },
            "downloads": -1,
            "filename": "lcpcli-0.1.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "69e8b92fa493d0eb61c458bbeea3aee8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 10612015,
            "upload_time": "2024-12-09T13:26:43",
            "upload_time_iso_8601": "2024-12-09T13:26:43.246778Z",
            "url": "https://files.pythonhosted.org/packages/89/81/6e5b58d3f2bbb6e03ce51589a82271d736fefbe386157032e9e0c97e786c/lcpcli-0.1.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9321b857720cbb5685975fdea2fce1b76db9605c40cce8b8dffc7005a435cda1",
                "md5": "cd672e37326151442a7a4bc10d1b9451",
                "sha256": "2fc719660353cddfb157317fdf32428870001840f898a07db7624c422830c63a"
            },
            "downloads": -1,
            "filename": "lcpcli-0.1.8.tar.gz",
            "has_sig": false,
            "md5_digest": "cd672e37326151442a7a4bc10d1b9451",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 10605566,
            "upload_time": "2024-12-09T13:27:11",
            "upload_time_iso_8601": "2024-12-09T13:27:11.570598Z",
            "url": "https://files.pythonhosted.org/packages/93/21/b857720cbb5685975fdea2fce1b76db9605c40cce8b8dffc7005a435cda1/lcpcli-0.1.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-09 13:27:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "liri-uzh",
    "github_project": "lcpcli",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "lcpcli"
}
        
Elapsed time: 1.18260s