gabra-converter

Name	gabra-converter JSON
Version	1.1.0 JSON
	download
home_page	None
Summary	A program for converting data from a Ġabra database dump to a more regular and accessible format.
upload_time	2024-06-29 21:53:23
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	MIT License
keywords	ġabra gabra malti maltese
VCS
bugtrack_url
requirements	Sphinx build mypy pydantic pyinstaller pylint twine
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# Ġabra Converter

This program converts [Ġabra](https://mlrs.research.um.edu.mt/resources/gabra/)'s [database dump files](https://mlrs.research.um.edu.mt/resources/gabra-api/p/download), which are for [MongoDB](https://www.mongodb.com/), into a more accessible format, as well as cleaning and normalising it.

## How to use

To use this program, you will need to have the following command line commands available on your computer:

- `tar`: [7-zip archiver](https://www.7-zip.org/download.html)
- `bsondump`: [MongoDB tool](https://www.mongodb.com/docs/database-tools/installation/installation/)

Make sure that you install the above applications and then test them in your command line with the following commands:

- `tar --version`
- `bsondump --version`

Once you have these applications available in your command line, you can now download a [Ġabra database dump file](https://mlrs.research.um.edu.mt/resources/gabra-api/p/download).

Use the exporter by calling `python bin/run_gabra_converter.py` or `gabra_converter.exe` in the command line as follows:

`python bin/run_gabra_converter.py --gabra_dump_path <path to dump file> --out_path <path to folder with exported files> --lexeme_cleaners <space separated list of lexeme cleaner names> --wordform_cleaners <space separated list of wordform cleaner names> --lexeme_exporter <exporter name> --wordform_exporter <exporter name, usually the same as the lexeme exporter>`

Here is a typical example:

`python bin/run_gabra_converter.py --gabra_dump_path path/to/gabra --out_path path/to/out --lexeme_cleaners --wordform_cleaners --lexeme_exporter csv --wordform_exporter csv`

or with the `gabra_converter.exe`:

`gabra_converter --gabra_dump_path path/to/gabra --out_path path/to/out --lexeme_cleaners new_lines --wordform_cleaners --lexeme_exporter csv --wordform_exporter csv`

Run `python bin/run_gabra_converter.py --help` or `gabra_converter --help` for more information.

## What is exported

All the exported data is based on [the official Ġabra schema](https://mlrs.research.um.edu.mt/resources/gabra-api/p/schema).
Whilst MongoDB is a NoSQL database which allows for leaving fields out completely in database rows (note that rows are called documents and tables are called collections in MongoDB), the exported data is structured as flat tables.
All the fields in the schema are used in the export and left empty if unused in a row.
On the other hand, any fields that are not mentioned in the schema but still used in the rows, such as `norm_freq`, are left out.

A number of files are generated to handle one-to-many relationships.
For example, since one lexeme can have many glosses (glosses are stored as a list in Ġabra), a separate file for glosses is created such that each row in the lexemes file can refer to multiple rows in the glosses file.
Non-list fields that are represented as nested objects are flattened such that the field `"root":{"radicals":"b-ħ-b-ħ","variant":2}"` becomes two fields: `root-radicals` and `root-variant`, with the dash used to separate parent names from child names.
Any unnecessarily nested objects produced by MongoDB that are used to specify data types (objects consisting of just one field with a dollar sign at the front of the field name) are not preserved.
So numbers being stored in `"$numberInt"` such as `"derived_form":{"$numberInt":1}` will be exported as `derived_form` without reference to the nested object.
Boolean values are represented as 0 for false and 1 for true.
Finally, while MongoDB uses hexadecimal numbers for primary and foreign keys, such as 63b1e0f314e849fa182bcfc3, the export also includes its own decimal primary and foreign keys for ease of use in relational databases.
These fields will have their field names prefixed with `new_`, such as `new_id` and `new_lexeme_id`.

The following exporters are supported:

### `csv`

At the moment, the program only supports CSV (Comma Separated Values) file exports.
The files generated are the following:

- `lexemes.csv`: Contains all the non-list fields in the lexemes collection.
Includes a decimal unique ID `new_id` field and the original hexadecimal unique ID `_id` field.
- `lexemes_alternatives.csv`: Contains the alternative words of each lexeme on separate rows using the `new_lexeme_id` field to link to the lexeme's `new_id` field.
Includes a decimal unique ID `new_id`.
- `lexemes_sources.csv`: Contains the [sources](https://mlrs.research.um.edu.mt/resources/gabra/sources) of each lexeme on separate rows using the `new_lexeme_id` field to link to the lexeme's `new_id` field.
Includes a decimal unique ID `new_id`.
- `lexemes_glosses.csv`: Contains the different glosses (definitions in English) of each lexeme on separate rows using the `new_lexeme_id` field to link to the lexeme's `new_id` field.
Includes a decimal unique ID `new_id`.
- `lexemes_examples.csv`: Contains the different examples of each lexeme's gloss on separate rows using the `new_gloss_id` field to link to the gloss's `new_id` field.
Includes a decimal unique ID `new_id`.
- `wordforms.csv`: Contains all the non-list fields in the wordforms collection.
Includes a decimal unique ID `new_id` field, a decimal lexeme ID reference called `new_lexeme_id`, and the original hexadecimal unique ID `_id` field.
- `wordforms_alternatives.csv`: Contains the alternative words of each wordform on separate rows using the `new_wordform_id` field to link to the wordform's `new_id` field.
Includes a decimal unique ID `new_id`.
- `wordforms_sources.csv`: Contains the [sources](https://mlrs.research.um.edu.mt/resources/gabra/sources) of each wordform on separate rows using the `new_wordform_id` field to link to the wordform's `new_id` field.
Includes a decimal unique ID `new_id`.

## Available cleaners

There are a number of options available for skipping or cleaning certain rows from the Ġabra database.
Some are required whilst others are optional, depending on the exporter used.

### Lexeme related cleaners

- `new_lines`: Remove new lines from the glosses and examples of lexemes.
- `lemma_capitals`: Skip any lexemes whose lemma contains uppercase letters.
- `lemma_nonmaltese`: Skip any lexemes whose lemma contains non-Maltese letters.
- `lemma_spaces`: Skip any lexemes whose lemma contains spaces.
- `pending`: Skip any lexemes whose pending field is not set to false.

Required cleaners:

||`csv`|
|---|---|
|`new_lines`||
|`lemma_capitals`||
|`lemma_nonmaltese`||
|`lemma_spaced`||
|`pending`||

### Wordform related cleaners

- `missing_lexeme`: Skip any wordforms whose lexeme ID does not refer to an existing lexeme.
- `surfaceform_capitals`: Skip any wordforms whose surfaceform contains uppercase letters.
- `surfaceform_nonmaltese`: Skip any wordforms whose surfaceform contains non-Maltese letters.
- `surfaceform_spaces`: Skip any wordforms whose surfaceform contains spaces.
- `pending`: Skip any wordforms whose pending field is not set to false.

Required cleaners:

||`csv`|
|---|---|
|`missing_lexeme`||
|`surfaceform_capitals`||
|`surfaceform_nonmaltese`||
|`surfaceform_spaces`||
|`pending`||

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "gabra-converter",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "\u0120abra, Gabra, Malti, Maltese",
    "author": null,
    "author_email": "Marc Tanti <marc.tanti@um.edu.mt>",
    "download_url": "https://files.pythonhosted.org/packages/4d/9e/43e6b9217ea29b1f3cbede3fa8e5a1388c473c420798a28a4ca28273aab1/gabra_converter-1.1.0.tar.gz",
    "platform": null,
    "description": "# \u0120abra Converter\r\n\r\nThis program converts [\u0120abra](https://mlrs.research.um.edu.mt/resources/gabra/)'s [database dump files](https://mlrs.research.um.edu.mt/resources/gabra-api/p/download), which are for [MongoDB](https://www.mongodb.com/), into a more accessible format, as well as cleaning and normalising it.\r\n\r\n## How to use\r\n\r\nTo use this program, you will need to have the following command line commands available on your computer:\r\n\r\n- `tar`: [7-zip archiver](https://www.7-zip.org/download.html)\r\n- `bsondump`: [MongoDB tool](https://www.mongodb.com/docs/database-tools/installation/installation/)\r\n\r\nMake sure that you install the above applications and then test them in your command line with the following commands:\r\n\r\n- `tar --version`\r\n- `bsondump --version`\r\n\r\nOnce you have these applications available in your command line, you can now download a [\u0120abra database dump file](https://mlrs.research.um.edu.mt/resources/gabra-api/p/download).\r\n\r\nUse the exporter by calling `python bin/run_gabra_converter.py` or `gabra_converter.exe` in the command line as follows:\r\n\r\n`python bin/run_gabra_converter.py --gabra_dump_path <path to dump file> --out_path <path to folder with exported files> --lexeme_cleaners <space separated list of lexeme cleaner names> --wordform_cleaners <space separated list of wordform cleaner names> --lexeme_exporter <exporter name> --wordform_exporter <exporter name, usually the same as the lexeme exporter>`\r\n\r\nHere is a typical example:\r\n\r\n`python bin/run_gabra_converter.py --gabra_dump_path path/to/gabra --out_path path/to/out --lexeme_cleaners --wordform_cleaners --lexeme_exporter csv --wordform_exporter csv`\r\n\r\nor with the `gabra_converter.exe`:\r\n\r\n`gabra_converter --gabra_dump_path path/to/gabra --out_path path/to/out --lexeme_cleaners new_lines --wordform_cleaners --lexeme_exporter csv --wordform_exporter csv`\r\n\r\nRun `python bin/run_gabra_converter.py --help` or `gabra_converter --help` for more information.\r\n\r\n## What is exported\r\n\r\nAll the exported data is based on [the official \u0120abra schema](https://mlrs.research.um.edu.mt/resources/gabra-api/p/schema).\r\nWhilst MongoDB is a NoSQL database which allows for leaving fields out completely in database rows (note that rows are called documents and tables are called collections in MongoDB), the exported data is structured as flat tables.\r\nAll the fields in the schema are used in the export and left empty if unused in a row.\r\nOn the other hand, any fields that are not mentioned in the schema but still used in the rows, such as `norm_freq`, are left out.\r\n\r\nA number of files are generated to handle one-to-many relationships.\r\nFor example, since one lexeme can have many glosses (glosses are stored as a list in \u0120abra), a separate file for glosses is created such that each row in the lexemes file can refer to multiple rows in the glosses file.\r\nNon-list fields that are represented as nested objects are flattened such that the field `\"root\":{\"radicals\":\"b-\u0127-b-\u0127\",\"variant\":2}\"` becomes two fields: `root-radicals` and `root-variant`, with the dash used to separate parent names from child names.\r\nAny unnecessarily nested objects produced by MongoDB that are used to specify data types (objects consisting of just one field with a dollar sign at the front of the field name) are not preserved.\r\nSo numbers being stored in `\"$numberInt\"` such as `\"derived_form\":{\"$numberInt\":1}` will be exported as `derived_form` without reference to the nested object.\r\nBoolean values are represented as 0 for false and 1 for true.\r\nFinally, while MongoDB uses hexadecimal numbers for primary and foreign keys, such as 63b1e0f314e849fa182bcfc3, the export also includes its own decimal primary and foreign keys for ease of use in relational databases.\r\nThese fields will have their field names prefixed with `new_`, such as `new_id` and `new_lexeme_id`.\r\n\r\nThe following exporters are supported:\r\n\r\n### `csv`\r\n\r\nAt the moment, the program only supports CSV (Comma Separated Values) file exports.\r\nThe files generated are the following:\r\n\r\n- `lexemes.csv`: Contains all the non-list fields in the lexemes collection.\r\n    Includes a decimal unique ID `new_id` field and the original hexadecimal unique ID `_id` field.\r\n- `lexemes_alternatives.csv`: Contains the alternative words of each lexeme on separate rows using the `new_lexeme_id` field to link to the lexeme's `new_id` field.\r\n    Includes a decimal unique ID `new_id`.\r\n- `lexemes_sources.csv`: Contains the [sources](https://mlrs.research.um.edu.mt/resources/gabra/sources) of each lexeme on separate rows using the `new_lexeme_id` field to link to the lexeme's `new_id` field.\r\n    Includes a decimal unique ID `new_id`.\r\n- `lexemes_glosses.csv`: Contains the different glosses (definitions in English) of each lexeme on separate rows using the `new_lexeme_id` field to link to the lexeme's `new_id` field.\r\n    Includes a decimal unique ID `new_id`.\r\n- `lexemes_examples.csv`: Contains the different examples of each lexeme's gloss on separate rows using the `new_gloss_id` field to link to the gloss's `new_id` field.\r\n    Includes a decimal unique ID `new_id`.\r\n- `wordforms.csv`: Contains all the non-list fields in the wordforms collection.\r\n    Includes a decimal unique ID `new_id` field, a decimal lexeme ID reference called `new_lexeme_id`, and the original hexadecimal unique ID `_id` field.\r\n- `wordforms_alternatives.csv`: Contains the alternative words of each wordform on separate rows using the `new_wordform_id` field to link to the wordform's `new_id` field.\r\n    Includes a decimal unique ID `new_id`.\r\n- `wordforms_sources.csv`: Contains the [sources](https://mlrs.research.um.edu.mt/resources/gabra/sources) of each wordform on separate rows using the `new_wordform_id` field to link to the wordform's `new_id` field.\r\n    Includes a decimal unique ID `new_id`.\r\n\r\n## Available cleaners\r\n\r\nThere are a number of options available for skipping or cleaning certain rows from the \u0120abra database.\r\nSome are required whilst others are optional, depending on the exporter used.\r\n\r\n### Lexeme related cleaners\r\n\r\n- `new_lines`: Remove new lines from the glosses and examples of lexemes.\r\n- `lemma_capitals`: Skip any lexemes whose lemma contains uppercase letters.\r\n- `lemma_nonmaltese`: Skip any lexemes whose lemma contains non-Maltese letters.\r\n- `lemma_spaces`: Skip any lexemes whose lemma contains spaces.\r\n- `pending`: Skip any lexemes whose pending field is not set to false.\r\n\r\nRequired cleaners:\r\n\r\n||`csv`|\r\n|---|---|\r\n|`new_lines`||\r\n|`lemma_capitals`||\r\n|`lemma_nonmaltese`||\r\n|`lemma_spaced`||\r\n|`pending`||\r\n\r\n### Wordform related cleaners\r\n\r\n- `missing_lexeme`: Skip any wordforms whose lexeme ID does not refer to an existing lexeme.\r\n- `surfaceform_capitals`: Skip any wordforms whose surfaceform contains uppercase letters.\r\n- `surfaceform_nonmaltese`: Skip any wordforms whose surfaceform contains non-Maltese letters.\r\n- `surfaceform_spaces`: Skip any wordforms whose surfaceform contains spaces.\r\n- `pending`: Skip any wordforms whose pending field is not set to false.\r\n\r\nRequired cleaners:\r\n\r\n||`csv`|\r\n|---|---|\r\n|`missing_lexeme`||\r\n|`surfaceform_capitals`||\r\n|`surfaceform_nonmaltese`||\r\n|`surfaceform_spaces`||\r\n|`pending`||\r\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A program for converting data from a \u0120abra database dump to a more regular and accessible format.",
    "version": "1.1.0",
    "project_urls": {
        "Repository": "https://github.com/mtanti/gabra-converter/"
    },
    "split_keywords": [
        "\u0121abra",
        " gabra",
        " malti",
        " maltese"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "625ba8c9bcfbdae107df6d25a7c1d45a97987c0641cc184dd05da99c4a76c2cd",
                "md5": "11cf93b95cc7985b6baa51f25625b79d",
                "sha256": "15bcad88148ca7fe04d738db480e1139ba6b24f2ea83a2f976131680ecef75de"
            },
            "downloads": -1,
            "filename": "gabra_converter-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "11cf93b95cc7985b6baa51f25625b79d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 40169,
            "upload_time": "2024-06-29T21:53:22",
            "upload_time_iso_8601": "2024-06-29T21:53:22.402022Z",
            "url": "https://files.pythonhosted.org/packages/62/5b/a8c9bcfbdae107df6d25a7c1d45a97987c0641cc184dd05da99c4a76c2cd/gabra_converter-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4d9e43e6b9217ea29b1f3cbede3fa8e5a1388c473c420798a28a4ca28273aab1",
                "md5": "3b9ce472fde0bb6857d2c2d8492af61b",
                "sha256": "d8fbd01b086bec1b3f26e663c25cbf0197f1711edcca030999439125eb59bc68"
            },
            "downloads": -1,
            "filename": "gabra_converter-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3b9ce472fde0bb6857d2c2d8492af61b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 24006,
            "upload_time": "2024-06-29T21:53:23",
            "upload_time_iso_8601": "2024-06-29T21:53:23.786542Z",
            "url": "https://files.pythonhosted.org/packages/4d/9e/43e6b9217ea29b1f3cbede3fa8e5a1388c473c420798a28a4ca28273aab1/gabra_converter-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-29 21:53:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mtanti",
    "github_project": "gabra-converter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "Sphinx",
            "specs": [
                [
                    "==",
                    "5.3.0"
                ]
            ]
        },
        {
            "name": "build",
            "specs": [
                [
                    "==",
                    "1.1.1"
                ]
            ]
        },
        {
            "name": "mypy",
            "specs": [
                [
                    "==",
                    "0.991"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "==",
                    "1.10.3"
                ]
            ]
        },
        {
            "name": "pyinstaller",
            "specs": [
                [
                    "==",
                    "5.6.2"
                ]
            ]
        },
        {
            "name": "pylint",
            "specs": [
                [
                    "==",
                    "2.15.9"
                ]
            ]
        },
        {
            "name": "twine",
            "specs": [
                [
                    "==",
                    "5.0.0"
                ]
            ]
        }
    ],
    "lcname": "gabra-converter"
}

None