# pylexibank
[![Build Status](https://github.com/lexibank/pylexibank/workflows/tests/badge.svg)](https://github.com/lexibank/pylexibank/actions?query=workflow%3Atests)
[![PyPI](https://img.shields.io/pypi/v/pylexibank.svg)](https://pypi.org/project/pylexibank)
`pylexibank` is a python package providing functionality to curate and aggregate
[Lexibank](https://github.com/lexibank/lexibank) datasets.
## Compatibility
At the core of the curation functionality provided by `pylexibank` lies integration
with the metadata catalogs [Glottolog](https://glottolog.org),
[Concepticon](https://concepticon.clld.org) and [CLTS](https://clts.clld.org).
Not all releases of these catalogs are compatibly with all versions of
`pylexibank`.
pylexibank | Glottolog | Concepticon | CLTS
--- | --- | --- | ---
2.x | \>=4.x | \>=2.x | **1.x**
3.x | \>=4.x | \>=2.x | **\>=2.x**
## Install
Since `pylexibank` has quite a few dependencies, installing it will result in installing
many other python packages along with it. To avoid any side effects for your default
python installation, we recommend installation in a
[virtual environment](https://virtualenv.pypa.io/en/stable/).
Now you may install `pylexibank` via pip or in development mode following the instructions
in [CONTRIBUTING.md](CONTRIBUTING.md).
Installing `pylexibank` will also install [`cldfbench`](https://github.com/cldf/cldfbench), which in turn installs a cli command `cldfbench`. This command is used
to run `pylexibank` functionality from the command line as subcommands.
`cldfbench` is also used to [manage reference catalogs](https://github.com/cldf/cldfbench/#catalogs), in particular Glottolog,
Concepticon and CLTS. Thus, after installing `pylexibank` you should run
```shell script
cldfbench catconfig
```
to make sure the catalog data is locally available and `pylexibank` knows about it.
## Usage
`pylexibank` can be used in two ways:
- The command line interface provides mainly access to the functionality for the `lexibank`
curation workflow.
- The `pylexibank` package can also be used like any other python package in your own
python code to access lexibank data in a programmatic (and consistent) way.
### The `cmd_makecldf` method
The main goal of `pylexibank` is creating high-quality CLDF Wordlists. This
happens in the custom `cmd_makecldf` method of a Lexibank dataset. To make this task
easier, `pylexibank` provides
- **access to Glottolog and Concepticon data:**
- `args.glottolog.api` points to an instance of [`CachingGlottologAPI`](https://github.com/cldf/cldfbench/blob/f373855e3b9cde029578e77c26136f0df26a82fa/src/cldfbench/catalogs.py#L10-L40) (a subclass of `pyglottolog.Glottolog`)
- `args.concepticon.api` points to an instance of [`CachingConcepticonAPI`](https://github.com/cldf/cldfbench/blob/f373855e3b9cde029578e77c26136f0df26a82fa/src/cldfbench/catalogs.py#L48-L51) (a subclass of `pyconcepticon.Concepticon`)
- **fine-grained control over form manipulation** via a `Dataset.form_spec`, an instance
of [`pylexibank.FormSpec`](src/pylexibank/forms.py) which can be customized per
dataset. `FormSpec` is meant to capture the rules that have been used when compiling
the source data - for cases where the source data violates these rules, wholesale
replacement by listing a lexeme in `etc/lexemes.csv` is recommended.
- **support for additional information** on lexemes, cognates, concepts and languages via
subclassing the defaults in [`pylexibank.models`](src/pylexibank/models.py)
- **easy access to configuration data** in a dataset's `etc_dir`
- **support for segmentation** using the [`segments`](https:pypi.org/project/segments)
package with orthography profile(s):
- If an orthography profile is available as `etc/orthography.tsv`, a `segments.Tokenizer`
instance, initialized with this profile, will be available as `Dataset.tokenizer`
and automatically used by `LexibankWriter.add_form`.
- If a directory `etc/orthography/` exists, all `*.tsv` files in it will be considered
orthography profiles, and a `dict` mapping filename stem to tokenizer will be available. Tokenizer
selection can be controlled in two ways:
- Passing a keyword `profile=FILENAME_STEM` in `Dataset.tokenizer()` calls.
- Provide orthography profiles for each language and let `Dataset.tokenizer`
chose the tokenizer by `item['Language_ID']`.
## Programmatic access to Lexibank datasets
While some level of support for reading and writing any [CLDF](https://cldf.clld.org) dataset is already provided by the [`pycldf` package](https://pypi.org/projects/pycldf), `pylexibank` (building on `cldfbench`) adds another layer of abstraction which supports
- treating Lexibank datasets as Python packages (and managing them via `pip`),
- a multi-step curation workflow
- aggregating collections of Lexibank datasets into a single SQLite database for efficient analysis.
### Installable and `pylexibank` enabled datasets
Turning a Lexibank dataset into a (`pip` installable) Python package is as simple as writing a [setup script](https://docs.python.org/3/distutils/setupscript.html) `setup.py`.
But to make the dataset available for curation via `pylexibank`, the dataset must provide
- a python module
- containing a class derived from `pylexibank.Dataset`, which specifies
- `Dataset.dir`: A directory relative to which the the [curation directories](dataset.md) are located.
- `Dataset.id`: An identifier of the dataset.
- which is advertised as `lexibank.dataset` [entry point](https://packaging.python.org/specifications/entry-points/) in `setup.py`. E.g.
```python
entry_points={
'lexibank.dataset': [
'sohartmannchin=lexibank_sohartmannchin:Dataset',
]
},
```
Turning datasets into `pylexibank` enabled python packages has multiple advantages:
- Datasets can be installed from various sources, e.g. GitHub repositories.
- Requirements files can be used "pin" particular versions of datasets for installation.
- Upon installation datasets can be discovered programmatically.
- [Virtual environments](https://virtualenv.pypa.io/en/latest/) can be used to manage projects which require different versions of the same dataset.
#### Conventions
1. Dataset identifier should be lowercase and either:
- the database name, if this name is established and well-known (e.g. "abvd", "asjp" etc),
- \<author\>\<languagegroup\> (e.g. "grollemundbantu" etc)
2. Datasets that require preprocessing with external programs (e.g. antiword, libreoffice) should store intermediate/artifacts in ./raw/ directory, and the `cmd_install` code should install from that rather than requiring an external dependency.
3. Declaring a dataset's dependence on `pylexibank`:
- specify minimum versions in `setup.py`, i.e. require `pylexibank>=1.x.y`
- specify exact versions in dataset's `cldf-metadata.json` using `prov:createdBy` property (`pylexibank` will take care of this when the CLDF is created via `lexibank makecldf`).
#### Datasets on GitHub
GitHub provides a very good platform for collaborative curation of textual data
such as Lexibank datasets.
Dataset curators are encouraged to make use of features in addition to the just version control, such as
- releases
- README.md, LICENSE, CONTRIBUTORS.md
Note that for datasets curated with `pylexibank`, summary statistics will be written to `README.md` as part of the `makecldf` command.
In addition to the support for collaboratively editing and versioning data, GitHub supports tying into additional services via webhooks. In particular, two of these services are relevant for Lexibank datasets:
- Continuous integration, e.g. via Travis-CI
- Archiving with Zenodo. Notes:
- When datasets are curated on GitHub and hooked up to ZENODO to trigger automatic deposits of releases, the release tag **must** start with a letter (otherwise the deposit will fail).
- Additional tags can be added to add context - e.g. when a release is triggered by a specific use case (for example the CLICS 2.0 release). This can be done using `git` as follows:
```bash
git checkout tags/vX.Y.Z
git tag -a "clics-2.0"
git push origin --tags
```
### Attribution
There are multiple levels of contributions to a Lexibank dataset:
- Typically, Lexibank datasets are derived from published data (be it supplemental material of a paper or public databases). Attribution to this source dataset is given by specifying its full citation in the dataset's metadata and by adding the source title to the release title of a lexibank dataset.
- Often the source dataset is also an aggregation of data from other sources. If possible, these sources (and the corresponding references) are kept in the Lexibank dataset's CLDF; otherwise we refer to the source dataset for a description of its sources.
- Deriving a Lexibank dataset from a source dataset using the `pylexibank` curation workflow involves adding code, mapping to reference catalogs and to some extent also linguistic judgements. These contributions are listed in a dataset's `CONTRIBUTORS.md` and translate to the list of authors of released versions of the lexibank dataset.
Raw data
{
"_id": null,
"home_page": "https://github.com/lexibank/pylexibank",
"name": "pylexibank",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Robert Forkel",
"author_email": "dlce.rdm@eva.mpg.de",
"download_url": "https://files.pythonhosted.org/packages/85/28/92c3893aa084ee2b4279705b0c8664302f691cdec9f97dbd799b26b8341d/pylexibank-3.5.0.tar.gz",
"platform": "any",
"description": "# pylexibank\n\n[![Build Status](https://github.com/lexibank/pylexibank/workflows/tests/badge.svg)](https://github.com/lexibank/pylexibank/actions?query=workflow%3Atests)\n[![PyPI](https://img.shields.io/pypi/v/pylexibank.svg)](https://pypi.org/project/pylexibank)\n\n`pylexibank` is a python package providing functionality to curate and aggregate\n[Lexibank](https://github.com/lexibank/lexibank) datasets.\n\n\n## Compatibility\n\nAt the core of the curation functionality provided by `pylexibank` lies integration\nwith the metadata catalogs [Glottolog](https://glottolog.org), \n[Concepticon](https://concepticon.clld.org) and [CLTS](https://clts.clld.org).\nNot all releases of these catalogs are compatibly with all versions of \n`pylexibank`.\n\npylexibank | Glottolog | Concepticon | CLTS\n--- | --- | --- | ---\n2.x | \\>=4.x | \\>=2.x | **1.x**\n3.x | \\>=4.x | \\>=2.x | **\\>=2.x**\n\n\n## Install\n\nSince `pylexibank` has quite a few dependencies, installing it will result in installing\nmany other python packages along with it. To avoid any side effects for your default\npython installation, we recommend installation in a\n[virtual environment](https://virtualenv.pypa.io/en/stable/).\n\nNow you may install `pylexibank` via pip or in development mode following the instructions\nin [CONTRIBUTING.md](CONTRIBUTING.md).\n\nInstalling `pylexibank` will also install [`cldfbench`](https://github.com/cldf/cldfbench), which in turn installs a cli command `cldfbench`. This command is used\nto run `pylexibank` functionality from the command line as subcommands.\n\n`cldfbench` is also used to [manage reference catalogs](https://github.com/cldf/cldfbench/#catalogs), in particular Glottolog,\nConcepticon and CLTS. Thus, after installing `pylexibank` you should run\n```shell script\ncldfbench catconfig\n```\nto make sure the catalog data is locally available and `pylexibank` knows about it.\n\n\n## Usage\n\n`pylexibank` can be used in two ways:\n- The command line interface provides mainly access to the functionality for the `lexibank`\n curation workflow.\n- The `pylexibank` package can also be used like any other python package in your own\n python code to access lexibank data in a programmatic (and consistent) way.\n\n\n### The `cmd_makecldf` method\n\nThe main goal of `pylexibank` is creating high-quality CLDF Wordlists. This\nhappens in the custom `cmd_makecldf` method of a Lexibank dataset. To make this task\neasier, `pylexibank` provides\n- **access to Glottolog and Concepticon data:**\n - `args.glottolog.api` points to an instance of [`CachingGlottologAPI`](https://github.com/cldf/cldfbench/blob/f373855e3b9cde029578e77c26136f0df26a82fa/src/cldfbench/catalogs.py#L10-L40) (a subclass of `pyglottolog.Glottolog`)\n - `args.concepticon.api` points to an instance of [`CachingConcepticonAPI`](https://github.com/cldf/cldfbench/blob/f373855e3b9cde029578e77c26136f0df26a82fa/src/cldfbench/catalogs.py#L48-L51) (a subclass of `pyconcepticon.Concepticon`)\n- **fine-grained control over form manipulation** via a `Dataset.form_spec`, an instance\n of [`pylexibank.FormSpec`](src/pylexibank/forms.py) which can be customized per\n dataset. `FormSpec` is meant to capture the rules that have been used when compiling\n the source data - for cases where the source data violates these rules, wholesale\n replacement by listing a lexeme in `etc/lexemes.csv` is recommended.\n- **support for additional information** on lexemes, cognates, concepts and languages via\n subclassing the defaults in [`pylexibank.models`](src/pylexibank/models.py)\n- **easy access to configuration data** in a dataset's `etc_dir`\n- **support for segmentation** using the [`segments`](https:pypi.org/project/segments)\n package with orthography profile(s):\n - If an orthography profile is available as `etc/orthography.tsv`, a `segments.Tokenizer`\n instance, initialized with this profile, will be available as `Dataset.tokenizer`\n and automatically used by `LexibankWriter.add_form`.\n - If a directory `etc/orthography/` exists, all `*.tsv` files in it will be considered\n orthography profiles, and a `dict` mapping filename stem to tokenizer will be available. Tokenizer\n selection can be controlled in two ways:\n - Passing a keyword `profile=FILENAME_STEM` in `Dataset.tokenizer()` calls.\n - Provide orthography profiles for each language and let `Dataset.tokenizer`\n chose the tokenizer by `item['Language_ID']`.\n\n\n## Programmatic access to Lexibank datasets\n\nWhile some level of support for reading and writing any [CLDF](https://cldf.clld.org) dataset is already provided by the [`pycldf` package](https://pypi.org/projects/pycldf), `pylexibank` (building on `cldfbench`) adds another layer of abstraction which supports \n- treating Lexibank datasets as Python packages (and managing them via `pip`),\n- a multi-step curation workflow\n- aggregating collections of Lexibank datasets into a single SQLite database for efficient analysis.\n\n\n### Installable and `pylexibank` enabled datasets\n\nTurning a Lexibank dataset into a (`pip` installable) Python package is as simple as writing a [setup script](https://docs.python.org/3/distutils/setupscript.html) `setup.py`.\nBut to make the dataset available for curation via `pylexibank`, the dataset must provide \n- a python module \n- containing a class derived from `pylexibank.Dataset`, which specifies\n - `Dataset.dir`: A directory relative to which the the [curation directories](dataset.md) are located.\n - `Dataset.id`: An identifier of the dataset.\n- which is advertised as `lexibank.dataset` [entry point](https://packaging.python.org/specifications/entry-points/) in `setup.py`. E.g.\n ```python\n entry_points={\n 'lexibank.dataset': [\n 'sohartmannchin=lexibank_sohartmannchin:Dataset',\n ]\n },\n ```\n\nTurning datasets into `pylexibank` enabled python packages has multiple advantages:\n- Datasets can be installed from various sources, e.g. GitHub repositories.\n- Requirements files can be used \"pin\" particular versions of datasets for installation.\n- Upon installation datasets can be discovered programmatically.\n- [Virtual environments](https://virtualenv.pypa.io/en/latest/) can be used to manage projects which require different versions of the same dataset.\n\n\n#### Conventions\n\n1. Dataset identifier should be lowercase and either:\n - the database name, if this name is established and well-known (e.g. \"abvd\", \"asjp\" etc),\n - \\<author\\>\\<languagegroup\\> (e.g. \"grollemundbantu\" etc)\n2. Datasets that require preprocessing with external programs (e.g. antiword, libreoffice) should store intermediate/artifacts in ./raw/ directory, and the `cmd_install` code should install from that rather than requiring an external dependency.\n3. Declaring a dataset's dependence on `pylexibank`:\n - specify minimum versions in `setup.py`, i.e. require `pylexibank>=1.x.y`\n - specify exact versions in dataset's `cldf-metadata.json` using `prov:createdBy` property (`pylexibank` will take care of this when the CLDF is created via `lexibank makecldf`).\n\n\n#### Datasets on GitHub\n\nGitHub provides a very good platform for collaborative curation of textual data\nsuch as Lexibank datasets.\n\nDataset curators are encouraged to make use of features in addition to the just version control, such as\n- releases\n- README.md, LICENSE, CONTRIBUTORS.md\n\nNote that for datasets curated with `pylexibank`, summary statistics will be written to `README.md` as part of the `makecldf` command.\n\nIn addition to the support for collaboratively editing and versioning data, GitHub supports tying into additional services via webhooks. In particular, two of these services are relevant for Lexibank datasets:\n\n- Continuous integration, e.g. via Travis-CI\n- Archiving with Zenodo. Notes:\n - When datasets are curated on GitHub and hooked up to ZENODO to trigger automatic deposits of releases, the release tag **must** start with a letter (otherwise the deposit will fail).\n - Additional tags can be added to add context - e.g. when a release is triggered by a specific use case (for example the CLICS 2.0 release). This can be done using `git` as follows:\n ```bash\n git checkout tags/vX.Y.Z\n git tag -a \"clics-2.0\"\n git push origin --tags\n ```\n\n\n### Attribution\n\nThere are multiple levels of contributions to a Lexibank dataset:\n- Typically, Lexibank datasets are derived from published data (be it supplemental material of a paper or public databases). Attribution to this source dataset is given by specifying its full citation in the dataset's metadata and by adding the source title to the release title of a lexibank dataset.\n- Often the source dataset is also an aggregation of data from other sources. If possible, these sources (and the corresponding references) are kept in the Lexibank dataset's CLDF; otherwise we refer to the source dataset for a description of its sources.\n- Deriving a Lexibank dataset from a source dataset using the `pylexibank` curation workflow involves adding code, mapping to reference catalogs and to some extent also linguistic judgements. These contributions are listed in a dataset's `CONTRIBUTORS.md` and translate to the list of authors of released versions of the lexibank dataset.\n",
"bugtrack_url": null,
"license": null,
"summary": "Python library implementing the lexibank workbench",
"version": "3.5.0",
"project_urls": {
"Bug Tracker": "https://github.com/lexibank/pylexibank/issues",
"Homepage": "https://github.com/lexibank/pylexibank"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "efbe261584cd130b52d6a1798740296f608d08d4289500036b75cb7baa9a34b0",
"md5": "87db982070d10fe722547cf62567454c",
"sha256": "6e83481a2ee02129d001c7a5fd6afb1b9da2313805388c0e482e28944e573710"
},
"downloads": -1,
"filename": "pylexibank-3.5.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "87db982070d10fe722547cf62567454c",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.8",
"size": 73643,
"upload_time": "2024-07-04T12:21:09",
"upload_time_iso_8601": "2024-07-04T12:21:09.967475Z",
"url": "https://files.pythonhosted.org/packages/ef/be/261584cd130b52d6a1798740296f608d08d4289500036b75cb7baa9a34b0/pylexibank-3.5.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "852892c3893aa084ee2b4279705b0c8664302f691cdec9f97dbd799b26b8341d",
"md5": "f35841e13d75b01876652701db29403c",
"sha256": "90cf12a48cda90f95d1709645d94cd67abdac4328ce2802a0d849845509abae7"
},
"downloads": -1,
"filename": "pylexibank-3.5.0.tar.gz",
"has_sig": false,
"md5_digest": "f35841e13d75b01876652701db29403c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 72949,
"upload_time": "2024-07-04T12:21:12",
"upload_time_iso_8601": "2024-07-04T12:21:12.674926Z",
"url": "https://files.pythonhosted.org/packages/85/28/92c3893aa084ee2b4279705b0c8664302f691cdec9f97dbd799b26b8341d/pylexibank-3.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-04 12:21:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lexibank",
"github_project": "pylexibank",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "pylexibank"
}