wiktionary-de-parser


Namewiktionary-de-parser JSON
Version 0.12.0 PyPI version JSON
download
home_pagehttps://github.com/gambolputty/wiktionary-de-parser
SummaryExtracts data from German Wiktionary dump files.
upload_time2024-07-29 19:16:32
maintainerNone
docs_urlNone
authorGregor Weichbrodt
requires_python<4.0,>=3.11
licenseMIT
keywords wiktionary xml parser data-extraction german nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # wiktionary-de-parser

A Python module to extract data from German Wiktionary XML files (for Python 3.11+).

## Features

- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)

## Installation

`pip install wiktionary-de-parser`

Or with [Poetry](https://python-poetry.org/):

`poetry add wiktionary-de-parser`

## Usage

### Loading the XML dump file
```python
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump

# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")

# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()

# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
    dump_dir_path="directory-of-dump-file",
    dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()

# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
```

### Parsing the dump file
```python
from pprint import pprint
from wiktionary_de_parser import WiktionaryParser

# ... (see above)

parser = WiktionaryParser()

for page in dump.pages():
    # Skip redirects
    if page.redirect_to:
        continue

    if page.name == "Abend":
        # Parse all entries for "Abend"
        for entry in parser.entries_from_page(page):
            results = parser.parse_entry(entry)
            pprint(results)
        break
```

## Output
All page entries for "Abend":

```python
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion={
        "Genus": "m",
        "Nominativ Singular": "Abend",
        "Nominativ Plural": "Abende",
        "Genitiv Singular": "Abends",
        "Genitiv Plural": "Abende",
        "Dativ Singular": "Abend",
        "Dativ Plural": "Abenden",
        "Akkusativ Singular": "Abend",
        "Akkusativ Plural": "Abende",
    },
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": []},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Nachname"]},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Toponym"]},
    rhymes=["aːbn̩t"],
)

```

## Development
This project uses [Poetry](https://python-poetry.org/).

1. Install [Poetry](https://python-poetry.org/).
2. Clone this repository
3. Run `poetry install` inside of the project folder to install dependencies.
4. There is a `notebook.ipynb` to test the parser.
5. Run `poetry run pytest` to run tests.

## License

[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) © Gregor Weichbrodt

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gambolputty/wiktionary-de-parser",
    "name": "wiktionary-de-parser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.11",
    "maintainer_email": null,
    "keywords": "wiktionary, xml, parser, data-extraction, german, nlp",
    "author": "Gregor Weichbrodt",
    "author_email": "gregorweichbrodt@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/0b/14/0e8ebfe4e62462896239507a32d25a8899b3986214d725478d823462c6c6/wiktionary_de_parser-0.12.0.tar.gz",
    "platform": null,
    "description": "# wiktionary-de-parser\n\nA Python module to extract data from German Wiktionary XML files (for Python 3.11+).\n\n## Features\n\n- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.\n- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)\n\n## Installation\n\n`pip install wiktionary-de-parser`\n\nOr with [Poetry](https://python-poetry.org/):\n\n`poetry add wiktionary-de-parser`\n\n## Usage\n\n### Loading the XML dump file\n```python\nfrom wiktionary_de_parser import WiktionaryParser\nfrom wiktionary_de_parser.dump_processor import WiktionaryDump\n\n# To download the dump file, specify the directory where the\n# dump file should be stored.\ndump = WiktionaryDump(dump_dir_path=\"directory-of-dump-file\")\n\n# This will download \"dewiktionary-latest-pages-articles-multistream.xml.bz2\" to\n# the directory specified in `dump_dir_path`.\ndump.download_dump()\n\n# Alternatively you can specify a different dump file to download.\ndump = WiktionaryDump(\n    dump_dir_path=\"directory-of-dump-file\",\n    dump_download_url=\"url-to-dump-file.xml.bz2\",\n)\ndump.download_dump()\n\n# If you already have the dump file locally, specify the path to the file.\ndump = WiktionaryDump(dump_file_path=\"path-to-dump-file.xml.bz2\")\ndump.download_dump()\n```\n\n### Parsing the dump file\n```python\nfrom pprint import pprint\nfrom wiktionary_de_parser import WiktionaryParser\n\n# ... (see above)\n\nparser = WiktionaryParser()\n\nfor page in dump.pages():\n    # Skip redirects\n    if page.redirect_to:\n        continue\n\n    if page.name == \"Abend\":\n        # Parse all entries for \"Abend\"\n        for entry in parser.entries_from_page(page):\n            results = parser.parse_entry(entry)\n            pprint(results)\n        break\n```\n\n## Output\nAll page entries for \"Abend\":\n\n```python\nParsedWiktionaryPageEntry(\n    name=\"Abend\",\n    hyphenation=[\"Abend\"],\n    flexion={\n        \"Genus\": \"m\",\n        \"Nominativ Singular\": \"Abend\",\n        \"Nominativ Plural\": \"Abende\",\n        \"Genitiv Singular\": \"Abends\",\n        \"Genitiv Plural\": \"Abende\",\n        \"Dativ Singular\": \"Abend\",\n        \"Dativ Plural\": \"Abenden\",\n        \"Akkusativ Singular\": \"Abend\",\n        \"Akkusativ Plural\": \"Abende\",\n    },\n    ipa=[\"\u02c8a\u02d0bn\u0329t\", \"\u02c8a\u02d0bm\u0329t\"],\n    language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n    lemma=Lemma(lemma=\"Abend\", inflected=False),\n    pos={\"Substantiv\": []},\n    rhymes=[\"a\u02d0bn\u0329t\"],\n)\nParsedWiktionaryPageEntry(\n    name=\"Abend\",\n    hyphenation=[\"Abend\"],\n    flexion=None,\n    ipa=[\"\u02c8a\u02d0bn\u0329t\"],\n    language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n    lemma=Lemma(lemma=\"Abend\", inflected=False),\n    pos={\"Substantiv\": [\"Nachname\"]},\n    rhymes=[\"a\u02d0bn\u0329t\"],\n)\nParsedWiktionaryPageEntry(\n    name=\"Abend\",\n    hyphenation=[\"Abend\"],\n    flexion=None,\n    ipa=[\"\u02c8a\u02d0bn\u0329t\", \"\u02c8a\u02d0bm\u0329t\"],\n    language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n    lemma=Lemma(lemma=\"Abend\", inflected=False),\n    pos={\"Substantiv\": [\"Toponym\"]},\n    rhymes=[\"a\u02d0bn\u0329t\"],\n)\n\n```\n\n## Development\nThis project uses [Poetry](https://python-poetry.org/).\n\n1. Install [Poetry](https://python-poetry.org/).\n2. Clone this repository\n3. Run `poetry install` inside of the project folder to install dependencies.\n4. There is a `notebook.ipynb` to test the parser.\n5. Run `poetry run pytest` to run tests.\n\n## License\n\n[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) \u00a9 Gregor Weichbrodt\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Extracts data from German Wiktionary dump files.",
    "version": "0.12.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/gambolputty/wiktionary-de-parser/issues",
        "Homepage": "https://github.com/gambolputty/wiktionary-de-parser",
        "Repository": "https://github.com/gambolputty/wiktionary-de-parser"
    },
    "split_keywords": [
        "wiktionary",
        " xml",
        " parser",
        " data-extraction",
        " german",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e95449b92aa92d31c93a2e82b5a74fa5c9bec4c4a20a2b7a3298a5d6d1831691",
                "md5": "3567e7595ef7a72894f84549853a3126",
                "sha256": "1b5d49a5acf3f557e054037edc48a0d4e9f931f4c3c989bfc1032348955c28e1"
            },
            "downloads": -1,
            "filename": "wiktionary_de_parser-0.12.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3567e7595ef7a72894f84549853a3126",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.11",
            "size": 20820,
            "upload_time": "2024-07-29T19:16:30",
            "upload_time_iso_8601": "2024-07-29T19:16:30.420797Z",
            "url": "https://files.pythonhosted.org/packages/e9/54/49b92aa92d31c93a2e82b5a74fa5c9bec4c4a20a2b7a3298a5d6d1831691/wiktionary_de_parser-0.12.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0b140e8ebfe4e62462896239507a32d25a8899b3986214d725478d823462c6c6",
                "md5": "2b373eede6e4cae7c5e21a246119750c",
                "sha256": "a9e147ebd0a8d83f792ae6ffd5d811f7183e875aefefc6c0d41df478f82e04dd"
            },
            "downloads": -1,
            "filename": "wiktionary_de_parser-0.12.0.tar.gz",
            "has_sig": false,
            "md5_digest": "2b373eede6e4cae7c5e21a246119750c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.11",
            "size": 16033,
            "upload_time": "2024-07-29T19:16:32",
            "upload_time_iso_8601": "2024-07-29T19:16:32.154170Z",
            "url": "https://files.pythonhosted.org/packages/0b/14/0e8ebfe4e62462896239507a32d25a8899b3986214d725478d823462c6c6/wiktionary_de_parser-0.12.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-29 19:16:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gambolputty",
    "github_project": "wiktionary-de-parser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "wiktionary-de-parser"
}
        
Elapsed time: 1.05148s