wiktionary-de-parser


Namewiktionary-de-parser JSON
Version 0.11.5 PyPI version JSON
download
home_pagehttps://github.com/gambolputty/wiktionary-de-parser
SummaryExtracts data from German Wiktionary dump files.
upload_time2024-02-10 17:39:08
maintainer
docs_urlNone
authorGregor Weichbrodt
requires_python>=3.11,<4.0
licenseMIT
keywords wiktionary xml parser data-extraction german nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # wiktionary-de-parser

A Python module to extract data from German Wiktionary XML files (for Python 3.11+).

## Features

- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)

## Installation

`pip install wiktionary-de-parser`

Or with [Poetry](https://python-poetry.org/):

`poetry add wiktionary-de-parser`

## Usage

### Loading the XML dump file
```python
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump

# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")

# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()

# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
    dump_dir_path="directory-of-dump-file",
    dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()

# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
```

### Parsing the dump file
```python
from pprint import pprint
from wiktionary_de_parser import WiktionaryParser

# ... (see above)

parser = WiktionaryParser()

for page in dump.pages():
    # Skip redirects
    if page.redirect_to:
        continue

    if page.name == "Abend":
        # Parse all entries for "Abend"
        for entry in parser.entries_from_page(page):
            results = parser.parse_entry(entry)
            pprint(results)
        break
```

## Output
All page entries for "Abend":

```python
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion={
        "Genus": "m",
        "Nominativ Singular": "Abend",
        "Nominativ Plural": "Abende",
        "Genitiv Singular": "Abends",
        "Genitiv Plural": "Abende",
        "Dativ Singular": "Abend",
        "Dativ Plural": "Abenden",
        "Akkusativ Singular": "Abend",
        "Akkusativ Plural": "Abende",
    },
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": []},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Nachname"]},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Toponym"]},
    rhymes=["aːbn̩t"],
)

```

## Development
This project uses [Poetry](https://python-poetry.org/).

1. Install [Poetry](https://python-poetry.org/).
2. Clone this repository
3. Run `poetry install` inside of the project folder to install dependencies.
4. There is a `notebook.ipynb` to test the parser.
5. Run `poetry run pytest` to run tests.

## License

[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) © Gregor Weichbrodt

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gambolputty/wiktionary-de-parser",
    "name": "wiktionary-de-parser",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.11,<4.0",
    "maintainer_email": "",
    "keywords": "wiktionary,xml,parser,data-extraction,german,nlp",
    "author": "Gregor Weichbrodt",
    "author_email": "gregorweichbrodt@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f2/18/5cbcb0ec3854f178f4333c0093e508c48aafb7a97b9890ee46522e2ccd2d/wiktionary_de_parser-0.11.5.tar.gz",
    "platform": null,
    "description": "# wiktionary-de-parser\n\nA Python module to extract data from German Wiktionary XML files (for Python 3.11+).\n\n## Features\n\n- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.\n- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)\n\n## Installation\n\n`pip install wiktionary-de-parser`\n\nOr with [Poetry](https://python-poetry.org/):\n\n`poetry add wiktionary-de-parser`\n\n## Usage\n\n### Loading the XML dump file\n```python\nfrom wiktionary_de_parser import WiktionaryParser\nfrom wiktionary_de_parser.dump_processor import WiktionaryDump\n\n# To download the dump file, specify the directory where the\n# dump file should be stored.\ndump = WiktionaryDump(dump_dir_path=\"directory-of-dump-file\")\n\n# This will download \"dewiktionary-latest-pages-articles-multistream.xml.bz2\" to\n# the directory specified in `dump_dir_path`.\ndump.download_dump()\n\n# Alternatively you can specify a different dump file to download.\ndump = WiktionaryDump(\n    dump_dir_path=\"directory-of-dump-file\",\n    dump_download_url=\"url-to-dump-file.xml.bz2\",\n)\ndump.download_dump()\n\n# If you already have the dump file locally, specify the path to the file.\ndump = WiktionaryDump(dump_file_path=\"path-to-dump-file.xml.bz2\")\ndump.download_dump()\n```\n\n### Parsing the dump file\n```python\nfrom pprint import pprint\nfrom wiktionary_de_parser import WiktionaryParser\n\n# ... (see above)\n\nparser = WiktionaryParser()\n\nfor page in dump.pages():\n    # Skip redirects\n    if page.redirect_to:\n        continue\n\n    if page.name == \"Abend\":\n        # Parse all entries for \"Abend\"\n        for entry in parser.entries_from_page(page):\n            results = parser.parse_entry(entry)\n            pprint(results)\n        break\n```\n\n## Output\nAll page entries for \"Abend\":\n\n```python\nParsedWiktionaryPageEntry(\n    name=\"Abend\",\n    hyphenation=[\"Abend\"],\n    flexion={\n        \"Genus\": \"m\",\n        \"Nominativ Singular\": \"Abend\",\n        \"Nominativ Plural\": \"Abende\",\n        \"Genitiv Singular\": \"Abends\",\n        \"Genitiv Plural\": \"Abende\",\n        \"Dativ Singular\": \"Abend\",\n        \"Dativ Plural\": \"Abenden\",\n        \"Akkusativ Singular\": \"Abend\",\n        \"Akkusativ Plural\": \"Abende\",\n    },\n    ipa=[\"\u02c8a\u02d0bn\u0329t\", \"\u02c8a\u02d0bm\u0329t\"],\n    language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n    lemma=Lemma(lemma=\"Abend\", inflected=False),\n    pos={\"Substantiv\": []},\n    rhymes=[\"a\u02d0bn\u0329t\"],\n)\nParsedWiktionaryPageEntry(\n    name=\"Abend\",\n    hyphenation=[\"Abend\"],\n    flexion=None,\n    ipa=[\"\u02c8a\u02d0bn\u0329t\"],\n    language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n    lemma=Lemma(lemma=\"Abend\", inflected=False),\n    pos={\"Substantiv\": [\"Nachname\"]},\n    rhymes=[\"a\u02d0bn\u0329t\"],\n)\nParsedWiktionaryPageEntry(\n    name=\"Abend\",\n    hyphenation=[\"Abend\"],\n    flexion=None,\n    ipa=[\"\u02c8a\u02d0bn\u0329t\", \"\u02c8a\u02d0bm\u0329t\"],\n    language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n    lemma=Lemma(lemma=\"Abend\", inflected=False),\n    pos={\"Substantiv\": [\"Toponym\"]},\n    rhymes=[\"a\u02d0bn\u0329t\"],\n)\n\n```\n\n## Development\nThis project uses [Poetry](https://python-poetry.org/).\n\n1. Install [Poetry](https://python-poetry.org/).\n2. Clone this repository\n3. Run `poetry install` inside of the project folder to install dependencies.\n4. There is a `notebook.ipynb` to test the parser.\n5. Run `poetry run pytest` to run tests.\n\n## License\n\n[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) \u00a9 Gregor Weichbrodt\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Extracts data from German Wiktionary dump files.",
    "version": "0.11.5",
    "project_urls": {
        "Bug Tracker": "https://github.com/gambolputty/wiktionary-de-parser/issues",
        "Homepage": "https://github.com/gambolputty/wiktionary-de-parser",
        "Repository": "https://github.com/gambolputty/wiktionary-de-parser"
    },
    "split_keywords": [
        "wiktionary",
        "xml",
        "parser",
        "data-extraction",
        "german",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6b871c372cd25eeeeb37fe45c84c8add1ef67b4ed2a09cebe96dc18b24b37db3",
                "md5": "e3183e9b8f0ce124a47738e30df0fe4a",
                "sha256": "f39505af6c7b6b2c321abc2ad772faed4922286e9a3a06764c3aa1ce64cc8f98"
            },
            "downloads": -1,
            "filename": "wiktionary_de_parser-0.11.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e3183e9b8f0ce124a47738e30df0fe4a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11,<4.0",
            "size": 20763,
            "upload_time": "2024-02-10T17:39:06",
            "upload_time_iso_8601": "2024-02-10T17:39:06.432572Z",
            "url": "https://files.pythonhosted.org/packages/6b/87/1c372cd25eeeeb37fe45c84c8add1ef67b4ed2a09cebe96dc18b24b37db3/wiktionary_de_parser-0.11.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f2185cbcb0ec3854f178f4333c0093e508c48aafb7a97b9890ee46522e2ccd2d",
                "md5": "785e0e97a700a8f5c6e4d88834badc58",
                "sha256": "bbc8c91e302e74a6ef5329952dd16c5df388fe36f23787ccdd5fba94799b3da5"
            },
            "downloads": -1,
            "filename": "wiktionary_de_parser-0.11.5.tar.gz",
            "has_sig": false,
            "md5_digest": "785e0e97a700a8f5c6e4d88834badc58",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11,<4.0",
            "size": 16006,
            "upload_time": "2024-02-10T17:39:08",
            "upload_time_iso_8601": "2024-02-10T17:39:08.554624Z",
            "url": "https://files.pythonhosted.org/packages/f2/18/5cbcb0ec3854f178f4333c0093e508c48aafb7a97b9890ee46522e2ccd2d/wiktionary_de_parser-0.11.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-10 17:39:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gambolputty",
    "github_project": "wiktionary-de-parser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "wiktionary-de-parser"
}
        
Elapsed time: 0.19261s