# wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
## Features
- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
## Installation
`pip install wiktionary-de-parser`
Or with [Poetry](https://python-poetry.org/):
`poetry add wiktionary-de-parser`
## Usage
### Loading the XML dump file
```python
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump
# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")
# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()
# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
dump_dir_path="directory-of-dump-file",
dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()
# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
```
### Parsing the dump file
```python
from pprint import pprint
from wiktionary_de_parser import WiktionaryParser
# ... (see above)
parser = WiktionaryParser()
for page in dump.pages():
# Skip redirects
if page.redirect_to:
continue
if page.name == "Abend":
# Parse all entries for "Abend"
for entry in parser.entries_from_page(page):
results = parser.parse_entry(entry)
pprint(results)
break
```
## Output
All page entries for "Abend":
```python
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion={
"Genus": "m",
"Nominativ Singular": "Abend",
"Nominativ Plural": "Abende",
"Genitiv Singular": "Abends",
"Genitiv Plural": "Abende",
"Dativ Singular": "Abend",
"Dativ Plural": "Abenden",
"Akkusativ Singular": "Abend",
"Akkusativ Plural": "Abende",
},
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": []},
rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion=None,
ipa=["ˈaːbn̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Nachname"]},
rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion=None,
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Toponym"]},
rhymes=["aːbn̩t"],
)
```
## Development
This project uses [Poetry](https://python-poetry.org/).
1. Install [Poetry](https://python-poetry.org/).
2. Clone this repository
3. Run `poetry install` inside of the project folder to install dependencies.
4. There is a `notebook.ipynb` to test the parser.
5. Run `poetry run pytest` to run tests.
## License
[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) © Gregor Weichbrodt
Raw data
{
"_id": null,
"home_page": "https://github.com/gambolputty/wiktionary-de-parser",
"name": "wiktionary-de-parser",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.11,<4.0",
"maintainer_email": "",
"keywords": "wiktionary,xml,parser,data-extraction,german,nlp",
"author": "Gregor Weichbrodt",
"author_email": "gregorweichbrodt@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/f2/18/5cbcb0ec3854f178f4333c0093e508c48aafb7a97b9890ee46522e2ccd2d/wiktionary_de_parser-0.11.5.tar.gz",
"platform": null,
"description": "# wiktionary-de-parser\n\nA Python module to extract data from German Wiktionary XML files (for Python 3.11+).\n\n## Features\n\n- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.\n- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)\n\n## Installation\n\n`pip install wiktionary-de-parser`\n\nOr with [Poetry](https://python-poetry.org/):\n\n`poetry add wiktionary-de-parser`\n\n## Usage\n\n### Loading the XML dump file\n```python\nfrom wiktionary_de_parser import WiktionaryParser\nfrom wiktionary_de_parser.dump_processor import WiktionaryDump\n\n# To download the dump file, specify the directory where the\n# dump file should be stored.\ndump = WiktionaryDump(dump_dir_path=\"directory-of-dump-file\")\n\n# This will download \"dewiktionary-latest-pages-articles-multistream.xml.bz2\" to\n# the directory specified in `dump_dir_path`.\ndump.download_dump()\n\n# Alternatively you can specify a different dump file to download.\ndump = WiktionaryDump(\n dump_dir_path=\"directory-of-dump-file\",\n dump_download_url=\"url-to-dump-file.xml.bz2\",\n)\ndump.download_dump()\n\n# If you already have the dump file locally, specify the path to the file.\ndump = WiktionaryDump(dump_file_path=\"path-to-dump-file.xml.bz2\")\ndump.download_dump()\n```\n\n### Parsing the dump file\n```python\nfrom pprint import pprint\nfrom wiktionary_de_parser import WiktionaryParser\n\n# ... (see above)\n\nparser = WiktionaryParser()\n\nfor page in dump.pages():\n # Skip redirects\n if page.redirect_to:\n continue\n\n if page.name == \"Abend\":\n # Parse all entries for \"Abend\"\n for entry in parser.entries_from_page(page):\n results = parser.parse_entry(entry)\n pprint(results)\n break\n```\n\n## Output\nAll page entries for \"Abend\":\n\n```python\nParsedWiktionaryPageEntry(\n name=\"Abend\",\n hyphenation=[\"Abend\"],\n flexion={\n \"Genus\": \"m\",\n \"Nominativ Singular\": \"Abend\",\n \"Nominativ Plural\": \"Abende\",\n \"Genitiv Singular\": \"Abends\",\n \"Genitiv Plural\": \"Abende\",\n \"Dativ Singular\": \"Abend\",\n \"Dativ Plural\": \"Abenden\",\n \"Akkusativ Singular\": \"Abend\",\n \"Akkusativ Plural\": \"Abende\",\n },\n ipa=[\"\u02c8a\u02d0bn\u0329t\", \"\u02c8a\u02d0bm\u0329t\"],\n language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n lemma=Lemma(lemma=\"Abend\", inflected=False),\n pos={\"Substantiv\": []},\n rhymes=[\"a\u02d0bn\u0329t\"],\n)\nParsedWiktionaryPageEntry(\n name=\"Abend\",\n hyphenation=[\"Abend\"],\n flexion=None,\n ipa=[\"\u02c8a\u02d0bn\u0329t\"],\n language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n lemma=Lemma(lemma=\"Abend\", inflected=False),\n pos={\"Substantiv\": [\"Nachname\"]},\n rhymes=[\"a\u02d0bn\u0329t\"],\n)\nParsedWiktionaryPageEntry(\n name=\"Abend\",\n hyphenation=[\"Abend\"],\n flexion=None,\n ipa=[\"\u02c8a\u02d0bn\u0329t\", \"\u02c8a\u02d0bm\u0329t\"],\n language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n lemma=Lemma(lemma=\"Abend\", inflected=False),\n pos={\"Substantiv\": [\"Toponym\"]},\n rhymes=[\"a\u02d0bn\u0329t\"],\n)\n\n```\n\n## Development\nThis project uses [Poetry](https://python-poetry.org/).\n\n1. Install [Poetry](https://python-poetry.org/).\n2. Clone this repository\n3. Run `poetry install` inside of the project folder to install dependencies.\n4. There is a `notebook.ipynb` to test the parser.\n5. Run `poetry run pytest` to run tests.\n\n## License\n\n[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) \u00a9 Gregor Weichbrodt\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Extracts data from German Wiktionary dump files.",
"version": "0.11.5",
"project_urls": {
"Bug Tracker": "https://github.com/gambolputty/wiktionary-de-parser/issues",
"Homepage": "https://github.com/gambolputty/wiktionary-de-parser",
"Repository": "https://github.com/gambolputty/wiktionary-de-parser"
},
"split_keywords": [
"wiktionary",
"xml",
"parser",
"data-extraction",
"german",
"nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6b871c372cd25eeeeb37fe45c84c8add1ef67b4ed2a09cebe96dc18b24b37db3",
"md5": "e3183e9b8f0ce124a47738e30df0fe4a",
"sha256": "f39505af6c7b6b2c321abc2ad772faed4922286e9a3a06764c3aa1ce64cc8f98"
},
"downloads": -1,
"filename": "wiktionary_de_parser-0.11.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e3183e9b8f0ce124a47738e30df0fe4a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11,<4.0",
"size": 20763,
"upload_time": "2024-02-10T17:39:06",
"upload_time_iso_8601": "2024-02-10T17:39:06.432572Z",
"url": "https://files.pythonhosted.org/packages/6b/87/1c372cd25eeeeb37fe45c84c8add1ef67b4ed2a09cebe96dc18b24b37db3/wiktionary_de_parser-0.11.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f2185cbcb0ec3854f178f4333c0093e508c48aafb7a97b9890ee46522e2ccd2d",
"md5": "785e0e97a700a8f5c6e4d88834badc58",
"sha256": "bbc8c91e302e74a6ef5329952dd16c5df388fe36f23787ccdd5fba94799b3da5"
},
"downloads": -1,
"filename": "wiktionary_de_parser-0.11.5.tar.gz",
"has_sig": false,
"md5_digest": "785e0e97a700a8f5c6e4d88834badc58",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11,<4.0",
"size": 16006,
"upload_time": "2024-02-10T17:39:08",
"upload_time_iso_8601": "2024-02-10T17:39:08.554624Z",
"url": "https://files.pythonhosted.org/packages/f2/18/5cbcb0ec3854f178f4333c0093e508c48aafb7a97b9890ee46522e2ccd2d/wiktionary_de_parser-0.11.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-10 17:39:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "gambolputty",
"github_project": "wiktionary-de-parser",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "wiktionary-de-parser"
}