# wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
## Features
- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
## Installation
`pip install wiktionary-de-parser`
Or with [Poetry](https://python-poetry.org/):
`poetry add wiktionary-de-parser`
## Usage
### Loading the XML dump file
```python
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump
# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")
# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()
# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
dump_dir_path="directory-of-dump-file",
dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()
# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
```
### Parsing the dump file
```python
from pprint import pprint
from wiktionary_de_parser import WiktionaryParser
# ... (see above)
parser = WiktionaryParser()
for page in dump.pages():
# Skip redirects
if page.redirect_to:
continue
if page.name == "Abend":
# Parse all entries for "Abend"
for entry in parser.entries_from_page(page):
results = parser.parse_entry(entry)
pprint(results)
break
```
## Output
All page entries for "Abend":
```python
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion={
"Genus": "m",
"Nominativ Singular": "Abend",
"Nominativ Plural": "Abende",
"Genitiv Singular": "Abends",
"Genitiv Plural": "Abende",
"Dativ Singular": "Abend",
"Dativ Plural": "Abenden",
"Akkusativ Singular": "Abend",
"Akkusativ Plural": "Abende",
},
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": []},
rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion=None,
ipa=["ˈaːbn̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Nachname"]},
rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion=None,
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", inflected=False),
pos={"Substantiv": ["Toponym"]},
rhymes=["aːbn̩t"],
)
```
## Development
This project uses [Poetry](https://python-poetry.org/).
1. Install [Poetry](https://python-poetry.org/).
2. Clone this repository
3. Run `poetry install` inside of the project folder to install dependencies.
4. There is a `notebook.ipynb` to test the parser.
5. Run `poetry run pytest` to run tests.
## License
[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) © Gregor Weichbrodt
Raw data
{
"_id": null,
"home_page": "https://github.com/gambolputty/wiktionary-de-parser",
"name": "wiktionary-de-parser",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.11",
"maintainer_email": null,
"keywords": "wiktionary, xml, parser, data-extraction, german, nlp",
"author": "Gregor Weichbrodt",
"author_email": "gregorweichbrodt@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/0b/14/0e8ebfe4e62462896239507a32d25a8899b3986214d725478d823462c6c6/wiktionary_de_parser-0.12.0.tar.gz",
"platform": null,
"description": "# wiktionary-de-parser\n\nA Python module to extract data from German Wiktionary XML files (for Python 3.11+).\n\n## Features\n\n- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.\n- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)\n\n## Installation\n\n`pip install wiktionary-de-parser`\n\nOr with [Poetry](https://python-poetry.org/):\n\n`poetry add wiktionary-de-parser`\n\n## Usage\n\n### Loading the XML dump file\n```python\nfrom wiktionary_de_parser import WiktionaryParser\nfrom wiktionary_de_parser.dump_processor import WiktionaryDump\n\n# To download the dump file, specify the directory where the\n# dump file should be stored.\ndump = WiktionaryDump(dump_dir_path=\"directory-of-dump-file\")\n\n# This will download \"dewiktionary-latest-pages-articles-multistream.xml.bz2\" to\n# the directory specified in `dump_dir_path`.\ndump.download_dump()\n\n# Alternatively you can specify a different dump file to download.\ndump = WiktionaryDump(\n dump_dir_path=\"directory-of-dump-file\",\n dump_download_url=\"url-to-dump-file.xml.bz2\",\n)\ndump.download_dump()\n\n# If you already have the dump file locally, specify the path to the file.\ndump = WiktionaryDump(dump_file_path=\"path-to-dump-file.xml.bz2\")\ndump.download_dump()\n```\n\n### Parsing the dump file\n```python\nfrom pprint import pprint\nfrom wiktionary_de_parser import WiktionaryParser\n\n# ... (see above)\n\nparser = WiktionaryParser()\n\nfor page in dump.pages():\n # Skip redirects\n if page.redirect_to:\n continue\n\n if page.name == \"Abend\":\n # Parse all entries for \"Abend\"\n for entry in parser.entries_from_page(page):\n results = parser.parse_entry(entry)\n pprint(results)\n break\n```\n\n## Output\nAll page entries for \"Abend\":\n\n```python\nParsedWiktionaryPageEntry(\n name=\"Abend\",\n hyphenation=[\"Abend\"],\n flexion={\n \"Genus\": \"m\",\n \"Nominativ Singular\": \"Abend\",\n \"Nominativ Plural\": \"Abende\",\n \"Genitiv Singular\": \"Abends\",\n \"Genitiv Plural\": \"Abende\",\n \"Dativ Singular\": \"Abend\",\n \"Dativ Plural\": \"Abenden\",\n \"Akkusativ Singular\": \"Abend\",\n \"Akkusativ Plural\": \"Abende\",\n },\n ipa=[\"\u02c8a\u02d0bn\u0329t\", \"\u02c8a\u02d0bm\u0329t\"],\n language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n lemma=Lemma(lemma=\"Abend\", inflected=False),\n pos={\"Substantiv\": []},\n rhymes=[\"a\u02d0bn\u0329t\"],\n)\nParsedWiktionaryPageEntry(\n name=\"Abend\",\n hyphenation=[\"Abend\"],\n flexion=None,\n ipa=[\"\u02c8a\u02d0bn\u0329t\"],\n language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n lemma=Lemma(lemma=\"Abend\", inflected=False),\n pos={\"Substantiv\": [\"Nachname\"]},\n rhymes=[\"a\u02d0bn\u0329t\"],\n)\nParsedWiktionaryPageEntry(\n name=\"Abend\",\n hyphenation=[\"Abend\"],\n flexion=None,\n ipa=[\"\u02c8a\u02d0bn\u0329t\", \"\u02c8a\u02d0bm\u0329t\"],\n language=Language(lang=\"Deutsch\", lang_code=\"de\"),\n lemma=Lemma(lemma=\"Abend\", inflected=False),\n pos={\"Substantiv\": [\"Toponym\"]},\n rhymes=[\"a\u02d0bn\u0329t\"],\n)\n\n```\n\n## Development\nThis project uses [Poetry](https://python-poetry.org/).\n\n1. Install [Poetry](https://python-poetry.org/).\n2. Clone this repository\n3. Run `poetry install` inside of the project folder to install dependencies.\n4. There is a `notebook.ipynb` to test the parser.\n5. Run `poetry run pytest` to run tests.\n\n## License\n\n[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) \u00a9 Gregor Weichbrodt\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Extracts data from German Wiktionary dump files.",
"version": "0.12.0",
"project_urls": {
"Bug Tracker": "https://github.com/gambolputty/wiktionary-de-parser/issues",
"Homepage": "https://github.com/gambolputty/wiktionary-de-parser",
"Repository": "https://github.com/gambolputty/wiktionary-de-parser"
},
"split_keywords": [
"wiktionary",
" xml",
" parser",
" data-extraction",
" german",
" nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e95449b92aa92d31c93a2e82b5a74fa5c9bec4c4a20a2b7a3298a5d6d1831691",
"md5": "3567e7595ef7a72894f84549853a3126",
"sha256": "1b5d49a5acf3f557e054037edc48a0d4e9f931f4c3c989bfc1032348955c28e1"
},
"downloads": -1,
"filename": "wiktionary_de_parser-0.12.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3567e7595ef7a72894f84549853a3126",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.11",
"size": 20820,
"upload_time": "2024-07-29T19:16:30",
"upload_time_iso_8601": "2024-07-29T19:16:30.420797Z",
"url": "https://files.pythonhosted.org/packages/e9/54/49b92aa92d31c93a2e82b5a74fa5c9bec4c4a20a2b7a3298a5d6d1831691/wiktionary_de_parser-0.12.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0b140e8ebfe4e62462896239507a32d25a8899b3986214d725478d823462c6c6",
"md5": "2b373eede6e4cae7c5e21a246119750c",
"sha256": "a9e147ebd0a8d83f792ae6ffd5d811f7183e875aefefc6c0d41df478f82e04dd"
},
"downloads": -1,
"filename": "wiktionary_de_parser-0.12.0.tar.gz",
"has_sig": false,
"md5_digest": "2b373eede6e4cae7c5e21a246119750c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.11",
"size": 16033,
"upload_time": "2024-07-29T19:16:32",
"upload_time_iso_8601": "2024-07-29T19:16:32.154170Z",
"url": "https://files.pythonhosted.org/packages/0b/14/0e8ebfe4e62462896239507a32d25a8899b3986214d725478d823462c6c6/wiktionary_de_parser-0.12.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-29 19:16:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "gambolputty",
"github_project": "wiktionary-de-parser",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "wiktionary-de-parser"
}