kollo

Name	kollo JSON
Version	1.0.1 JSON
	download
home_page
Summary	Extract collocations from VERT data
upload_time	2023-04-28 09:31:41
maintainer
docs_url	None
author	Danny McDonald
requires_python
license	MIT
keywords	corpus linguistics corpora collocation vert
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # kollo: extract collocations from VERT formatted corpora

> Author: Danny McDonald, UZH

## Installation

```bash
pip install kollo
# or
git clone https://gitlab.uzh.ch/LiRI/projects/kollo
cd kollo
python setup.py install
```

## CLI Usage

You can start the tool from your shell with:

```bash
python -m kollo input/file.vrt
# or
kollo input/file.vrt
```

Arguments are like this:

```
usage: kollo [-h] [-l LEFT] [-r RIGHT] [-s SPAN] [-m {ll,sll,lmi,mi,mi3,ld,t,z}] [-sw STOPWORDS] [-t TARGET] [-n NUMBER] [-o OUTPUT] [-c] [-p] [-csv [CSV]] input [query]

Extract collocations from VERT formatted corpora

positional arguments:
  input                 Input file path
  query                 Optional regex to search for (i.e. to appear in all collocation results)

optional arguments:
  -h, --help            show this help message and exit
  -l LEFT, --left LEFT  Window to the left in tokens
  -r RIGHT, --right RIGHT
                        Window to the right in tokens
  -s SPAN, --span SPAN  XML span to use as window (e.g. s or p)
  -m {lr,sll,lmi,mi,mi3,ld,t,z}, --metric {lr,sll,lmi,mi,mi3,ld,t,z}
                        Collocation metric
  -sw STOPWORDS, --stopwords STOPWORDS
                        Path to file containing stopwords (one per line)
  -t TARGET, --target TARGET
                        Index of VERT column to be searched as node
  -n NUMBER, --number NUMBER
                        Number of top results to return (-1 will return all)
  -o OUTPUT, --output OUTPUT
                        Comma-sep index/indices of VERT column to be calculated as collocations
  -c, --case-sensitive  Do case sensitive search
  -p, --preserve        Preserve original sequential order of tokens in bigram
  -csv [CSV], --csv [CSV]
                        Output comma-separated values
```

### Python usage

```python
from kollo import kollo

kollo(
    "path/to/file.vrt",
    query="^Reg(ex|ular expression)$",  # optional
    left=5,
    right=5,
    span=None,
    number=20,
    metric='lr',
    target=0,
    output=[0],
    stopwords=None,
    case_sensitive=False,
    preserve=False,
    csv=False
)
```

### Metrics supported (and their short name):

* Likelihood ratio (`lr`)
* Simple Log likelihood (`sll`)
* Mutual information (`mi`)
* Local mutual information (`lmi`)
* MI3 (`mi3`)
* Log Dice (`ld`)
* T-score (`t`)
* Z-score (`z`)

### Spans

If you enter a span (e.g. `s`) instead of a left/right window, collocation windows will expand from the matching node to the nearest `s` tags in both directions. Of course, this can lead to very large windows and potential memory/performance issues, especially for spans broader than one sentence.

If you specify a left and/or right as well as a span, matches will be cut off at matching XML elements if they are encountered. So you can specify (e.g.) `left=2, right=2, span="s"` to get a window of `2`, while not allowing the window to cross sentence boundaries. If you do not enter a span, left/right windows can cross sentence boundaries.

Note that you cannot give regular expressions for spans, or provide multiple spans (yet).

### Target and output

`target` denotes the index of the column of the VRT you want to match with your query, with the leftmost column, typically the original token, being number 0. So, if your VRT corpus is in the format of `token<tab>POS<tab>lemma`, you would set `target` to 2 in order to query on the lemma column.

For `output`, you are still providing column indices, but you can provide more than one. So, if you're using the CLI, you can do `--output=1,2` to format results from a corpus in `token<tab>POS<tab>lemma` format as `NNS/friend`. If you're in Python, provide a list of integers, matching the column indices you want to use.

### Example

```python
from kollo import kollo
kollo("./sample.vrt",
    query="en$",
    target=0,
    output=[1,2],
    number=3,
    left=0,
    right=1,
    metric="lr",
    stopwords="stopwords.txt",
    case_sensitive=True,
    preserve=False,
    csv=False
)
```

Results in:

```
VAFIN/sein    ART/d           1202.0321
VAINF/werden  VMFIN/können    853.0279
VAFIN/haben   PPER/wir        758.4650
```

The exact equivalents on the command line would be:

```bash
kollo ./sample.vrt "en$" -t 0 -o 1,2 -n 3 -l 0 -r 1 -m lr -sw stopwords.txt -c
````

or

```bash
python -m kollo ./sample.vrt "en$" --target=0 --output=1,2 --number=3 --left=0 --right=1 --metric=lr --stopwords=stopwords.txt --case-sensitive
````

### CSV creation

If you want to generate a CSV file containing your results, use the `-csv` argument with a filepath:

```bash
kollo example.vrt "test" -csv output.csv
```

Without a filename, the CSV results will print to stdout (so you can pipe them elsewhere if need be):

```bash
kollo example.vrt "test" -csv | grep ...
```

From the Python interface you can do `kollo(csv="output.csv")` to write results to a specific file. `csv=True` will output CSV-formatted results to stdout.

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "kollo",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "corpus,linguistics,corpora,collocation,vert",
    "author": "Danny McDonald",
    "author_email": "daniel.mcdonald@uzh.ch",
    "download_url": "",
    "platform": null,
    "description": "# kollo: extract collocations from VERT formatted corpora\n\n> Author: Danny McDonald, UZH\n\n## Installation\n\n```bash\npip install kollo\n# or\ngit clone https://gitlab.uzh.ch/LiRI/projects/kollo\ncd kollo\npython setup.py install\n```\n\n## CLI Usage\n\nYou can start the tool from your shell with:\n\n```bash\npython -m kollo input/file.vrt\n# or\nkollo input/file.vrt\n```\n\nArguments are like this:\n\n```\nusage: kollo [-h] [-l LEFT] [-r RIGHT] [-s SPAN] [-m {ll,sll,lmi,mi,mi3,ld,t,z}] [-sw STOPWORDS] [-t TARGET] [-n NUMBER] [-o OUTPUT] [-c] [-p] [-csv [CSV]] input [query]\n\nExtract collocations from VERT formatted corpora\n\npositional arguments:\n  input                 Input file path\n  query                 Optional regex to search for (i.e. to appear in all collocation results)\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -l LEFT, --left LEFT  Window to the left in tokens\n  -r RIGHT, --right RIGHT\n                        Window to the right in tokens\n  -s SPAN, --span SPAN  XML span to use as window (e.g. s or p)\n  -m {lr,sll,lmi,mi,mi3,ld,t,z}, --metric {lr,sll,lmi,mi,mi3,ld,t,z}\n                        Collocation metric\n  -sw STOPWORDS, --stopwords STOPWORDS\n                        Path to file containing stopwords (one per line)\n  -t TARGET, --target TARGET\n                        Index of VERT column to be searched as node\n  -n NUMBER, --number NUMBER\n                        Number of top results to return (-1 will return all)\n  -o OUTPUT, --output OUTPUT\n                        Comma-sep index/indices of VERT column to be calculated as collocations\n  -c, --case-sensitive  Do case sensitive search\n  -p, --preserve        Preserve original sequential order of tokens in bigram\n  -csv [CSV], --csv [CSV]\n                        Output comma-separated values\n```\n\n### Python usage\n\n```python\nfrom kollo import kollo\n\nkollo(\n    \"path/to/file.vrt\",\n    query=\"^Reg(ex|ular expression)$\",  # optional\n    left=5,\n    right=5,\n    span=None,\n    number=20,\n    metric='lr',\n    target=0,\n    output=[0],\n    stopwords=None,\n    case_sensitive=False,\n    preserve=False,\n    csv=False\n)\n```\n\n### Metrics supported (and their short name):\n\n* Likelihood ratio (`lr`)\n* Simple Log likelihood (`sll`)\n* Mutual information (`mi`)\n* Local mutual information (`lmi`)\n* MI3 (`mi3`)\n* Log Dice (`ld`)\n* T-score (`t`)\n* Z-score (`z`)\n\n### Spans\n\nIf you enter a span (e.g. `s`) instead of a left/right window, collocation windows will expand from the matching node to the nearest `s` tags in both directions. Of course, this can lead to very large windows and potential memory/performance issues, especially for spans broader than one sentence.\n\nIf you specify a left and/or right as well as a span, matches will be cut off at matching XML elements if they are encountered. So you can specify (e.g.) `left=2, right=2, span=\"s\"` to get a window of `2`, while not allowing the window to cross sentence boundaries. If you do not enter a span, left/right windows can cross sentence boundaries.\n\nNote that you cannot give regular expressions for spans, or provide multiple spans (yet).\n\n### Target and output\n\n`target` denotes the index of the column of the VRT you want to match with your query, with the leftmost column, typically the original token, being number 0. So, if your VRT corpus is in the format of `token<tab>POS<tab>lemma`, you would set `target` to 2 in order to query on the lemma column.\n\nFor `output`, you are still providing column indices, but you can provide more than one. So, if you're using the CLI, you can do `--output=1,2` to format results from a corpus in `token<tab>POS<tab>lemma` format as `NNS/friend`. If you're in Python, provide a list of integers, matching the column indices you want to use.\n\n### Example\n\n```python\nfrom kollo import kollo\nkollo(\"./sample.vrt\",\n    query=\"en$\",\n    target=0,\n    output=[1,2],\n    number=3,\n    left=0,\n    right=1,\n    metric=\"lr\",\n    stopwords=\"stopwords.txt\",\n    case_sensitive=True,\n    preserve=False,\n    csv=False\n)\n```\n\nResults in:\n\n```\nVAFIN/sein    ART/d           1202.0321\nVAINF/werden  VMFIN/k\u00f6nnen    853.0279\nVAFIN/haben   PPER/wir        758.4650\n```\n\nThe exact equivalents on the command line would be:\n\n```bash\nkollo ./sample.vrt \"en$\" -t 0 -o 1,2 -n 3 -l 0 -r 1 -m lr -sw stopwords.txt -c\n````\n\nor\n\n```bash\npython -m kollo ./sample.vrt \"en$\" --target=0 --output=1,2 --number=3 --left=0 --right=1 --metric=lr --stopwords=stopwords.txt --case-sensitive\n````\n\n### CSV creation\n\nIf you want to generate a CSV file containing your results, use the `-csv` argument with a filepath:\n\n```bash\nkollo example.vrt \"test\" -csv output.csv\n```\n\nWithout a filename, the CSV results will print to stdout (so you can pipe them elsewhere if need be):\n\n```bash\nkollo example.vrt \"test\" -csv | grep ...\n```\n\nFrom the Python interface you can do `kollo(csv=\"output.csv\")` to write results to a specific file. `csv=True` will output CSV-formatted results to stdout.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Extract collocations from VERT data",
    "version": "1.0.1",
    "split_keywords": [
        "corpus",
        "linguistics",
        "corpora",
        "collocation",
        "vert"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "60e2be34665a9b90668da360b0ff6fbd16dd8e6fea90eb37b87a663eb1ca32b9",
                "md5": "7317c9a2427ffb653dd811a0c1cb4916",
                "sha256": "239184078fa430b3cc0f3f3fd16f3a7dde93f89b63adc2d028837c48ef366868"
            },
            "downloads": -1,
            "filename": "kollo-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7317c9a2427ffb653dd811a0c1cb4916",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 13010,
            "upload_time": "2023-04-28T09:31:41",
            "upload_time_iso_8601": "2023-04-28T09:31:41.959780Z",
            "url": "https://files.pythonhosted.org/packages/60/e2/be34665a9b90668da360b0ff6fbd16dd8e6fea90eb37b87a663eb1ca32b9/kollo-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-28 09:31:41",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "kollo"
}

Danny McDonald