wiktionary-de-parser


Namewiktionary-de-parser JSON
Version 0.12.13 PyPI version JSON
download
home_pagehttps://github.com/gambolputty/wiktionary-de-parser
SummaryExtracts data from German Wiktionary dump files.
upload_time2025-01-02 15:29:41
maintainerNone
docs_urlNone
authorGregor Weichbrodt
requires_python<4.0,>=3.11
licenseMIT
keywords wiktionary xml parser data-extraction german nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # wiktionary-de-parser

A Python module to extract data from German Wiktionary XML files (for Python 3.11+).

## Features

- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)

## Installation

`pip install wiktionary-de-parser`

Or with [Poetry](https://python-poetry.org/):

`poetry add wiktionary-de-parser`

## Usage

### Loading the XML dump file
```python
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump

# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")

# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()

# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
    dump_dir_path="directory-of-dump-file",
    dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()

# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
```

### Parsing the dump file
```python
from pprint import pprint
from wiktionary_de_parser import WiktionaryParser

# ... (see above)

parser = WiktionaryParser()

for page in dump.pages():
    # Skip redirects
    if page.redirect_to:
        continue

    if page.name == "Abend":
        # Parse all entries for "Abend"
        for entry in parser.entries_from_page(page):
            results = parser.parse_entry(entry)
            pprint(results)
        break
```

## Output
All page entries for "Abend":

```python
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion={
        "Genus": "m",
        "Nominativ Singular": "Abend",
        "Nominativ Plural": "Abende",
        "Genitiv Singular": "Abends",
        "Genitiv Plural": "Abende",
        "Dativ Singular": "Abend",
        "Dativ Plural": "Abenden",
        "Akkusativ Singular": "Abend",
        "Akkusativ Plural": "Abende",
    },
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": []},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Nachname"]},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Toponym"]},
    rhymes=["aːbn̩t"],
)

```

## Development
This project uses [Poetry](https://python-poetry.org/).

1. Install [Poetry](https://python-poetry.org/).
2. Clone this repository
3. Run `poetry install` inside of the project folder to install dependencies.
4. There is a `notebook.ipynb` to test the parser.
5. Run `poetry run pytest` to run tests.

## License

[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) © Gregor Weichbrodt

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gambolputty/wiktionary-de-parser",
    "name": "wiktionary-de-parser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.11",
    "maintainer_email": null,
    "keywords": "wiktionary, xml, parser, data-extraction, german, nlp",
    "author": "Gregor Weichbrodt",
    "author_email": "gregorweichbrodt@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/78/ed/bb370ea07307cdd95e4f9279b90319861d830f7531a633ff4c088ee36917/wiktionary_de_parser-0.12.13.tar.gz",
    "platform": null,
    "description": "# wiktionary-de-parser\n\nA Python module to extract data from German Wiktionary XML files (for Python 3.11+).\n\n## Features\n\n- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.\n- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)\n\n## Installation\n\n`pip install wiktionary-de-parser`\n\nOr with [Poetry](https://python-poetry.org/):\n\n`poetry add wiktionary-de-parser`\n\n## Usage\n\n### Loading the XML dump file\n```python\nfrom wiktionary_de_parser import WiktionaryParser\nfrom wiktionary_de_parser.dump_processor import WiktionaryDump\n\n# To download the dump file, specify the directory where the\n# dump file should be stored.\ndump = WiktionaryDump(dump_dir_path=\"directory-of-dump-file\")\n\n# This will download \"dewiktionary-latest-pages-articles-multistream.xml.bz2\" to\n# the directory specified in `dump_dir_path`.\ndump.download_dump()\n\n# Alternatively you can specify a different dump file to download.\ndump = WiktionaryDump(\n    dump_dir_path=\"directory-of-dump-file\",\n    dump_download_url=\"url-to-dump-file.xml.bz2\",\n)\ndump.download_dump()\n\n# If you already have the dump file locally, specify the path to the file.\ndump = WiktionaryDump(dump_file_path=\"path-to-dump-file.xml.bz2\")\ndump.download_dump()\n```\n\n### Parsing the dump file\n```python\nfrom pprint import pprint\nfrom wiktionary_de_parser import WiktionaryParser\n\n# ... (see above)\n\nparser = WiktionaryParser()\n\nfor page in dump.pages():\n    # Skip redirects\n    if page.redirect_to:\n        continue\n\n    if page.name == \"Abend\":\n        # Parse all entries for \"Abend\"\n        for entry in parser.entries_from_page(page):\n            results = parser.parse_entry(entry)\n            pprint(results)\n        break\n```\n\n## Output\nAll page entries for \"Abend\":\n\n```python\nParsedWiktionaryPageEntry(\n    name=\"Abend\",\n    hyphenation=[\"Abend\"],\n    flexion={\n        \"Genus\": \"m\",\n        \"Nominativ Singular\": \"Abend\",\n        \"Nominativ Plural\": \"Abende\",\n        \"Genitiv Singular\": \"Abends\",\n        \"Genitiv Plural\": \"Abende\",\n        \"Dativ Singular\": \"Abend\",\n        \"Dativ Plural\": \"Abenden\",\n        \"Akkusativ Singular\": \"Abend\",\n        \"Akkusativ Plural\": \"Abende\",\n    },\n    ipa=[\"\u02c8a\u02d0bn\u0329t\", \"\u02c8a\u02d0bm\u0329t\"],\n    language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n    lemma=Lemma(lemma=\"Abend\", inflected=False),\n    pos={\"Substantiv\": []},\n    rhymes=[\"a\u02d0bn\u0329t\"],\n)\nParsedWiktionaryPageEntry(\n    name=\"Abend\",\n    hyphenation=[\"Abend\"],\n    flexion=None,\n    ipa=[\"\u02c8a\u02d0bn\u0329t\"],\n    language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n    lemma=Lemma(lemma=\"Abend\", inflected=False),\n    pos={\"Substantiv\": [\"Nachname\"]},\n    rhymes=[\"a\u02d0bn\u0329t\"],\n)\nParsedWiktionaryPageEntry(\n    name=\"Abend\",\n    hyphenation=[\"Abend\"],\n    flexion=None,\n    ipa=[\"\u02c8a\u02d0bn\u0329t\", \"\u02c8a\u02d0bm\u0329t\"],\n    language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n    lemma=Lemma(lemma=\"Abend\", inflected=False),\n    pos={\"Substantiv\": [\"Toponym\"]},\n    rhymes=[\"a\u02d0bn\u0329t\"],\n)\n\n```\n\n## Development\nThis project uses [Poetry](https://python-poetry.org/).\n\n1. Install [Poetry](https://python-poetry.org/).\n2. Clone this repository\n3. Run `poetry install` inside of the project folder to install dependencies.\n4. There is a `notebook.ipynb` to test the parser.\n5. Run `poetry run pytest` to run tests.\n\n## License\n\n[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) \u00a9 Gregor Weichbrodt\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Extracts data from German Wiktionary dump files.",
    "version": "0.12.13",
    "project_urls": {
        "Bug Tracker": "https://github.com/gambolputty/wiktionary-de-parser/issues",
        "Homepage": "https://github.com/gambolputty/wiktionary-de-parser",
        "Repository": "https://github.com/gambolputty/wiktionary-de-parser"
    },
    "split_keywords": [
        "wiktionary",
        " xml",
        " parser",
        " data-extraction",
        " german",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b3c136afb4dd8618ffa94215c81f5731cd0c206e7757bf480c852fee7ec63398",
                "md5": "80a450b7ef1544f65121d02e79703184",
                "sha256": "3ff035f907b4762434cbbb4e922f4d1ca755911adf3c8ea4d8030db5992080d9"
            },
            "downloads": -1,
            "filename": "wiktionary_de_parser-0.12.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "80a450b7ef1544f65121d02e79703184",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.11",
            "size": 29056,
            "upload_time": "2025-01-02T15:29:39",
            "upload_time_iso_8601": "2025-01-02T15:29:39.259535Z",
            "url": "https://files.pythonhosted.org/packages/b3/c1/36afb4dd8618ffa94215c81f5731cd0c206e7757bf480c852fee7ec63398/wiktionary_de_parser-0.12.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "78edbb370ea07307cdd95e4f9279b90319861d830f7531a633ff4c088ee36917",
                "md5": "55097eca2f3c7e57f850ce7917bea0de",
                "sha256": "27bf7572b1452f05399a4b7b226408c4317f8c7c7de09ca8cd5fa265864fab84"
            },
            "downloads": -1,
            "filename": "wiktionary_de_parser-0.12.13.tar.gz",
            "has_sig": false,
            "md5_digest": "55097eca2f3c7e57f850ce7917bea0de",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.11",
            "size": 22918,
            "upload_time": "2025-01-02T15:29:41",
            "upload_time_iso_8601": "2025-01-02T15:29:41.756859Z",
            "url": "https://files.pythonhosted.org/packages/78/ed/bb370ea07307cdd95e4f9279b90319861d830f7531a633ff4c088ee36917/wiktionary_de_parser-0.12.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-02 15:29:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gambolputty",
    "github_project": "wiktionary-de-parser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "wiktionary-de-parser"
}
        
Elapsed time: 0.49372s