Name | wikitextparser JSON |
Version |
0.56.3
JSON |
| download |
home_page | None |
Summary | A simple parsing tool for MediaWiki's wikitext markup. |
upload_time | 2024-10-18 06:10:56 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | None |
keywords |
mediawiki
wikitext
parser
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
|
.. image:: https://github.com/5j9/wikitextparser/actions/workflows/tests.yml/badge.svg
:target: https://github.com/5j9/wikitextparser/actions/workflows/tests.yml
.. image:: https://codecov.io/github/5j9/wikitextparser/coverage.svg?branch=master
:target: https://codecov.io/github/5j9/wikitextparser
.. image:: https://readthedocs.org/projects/wikitextparser/badge/?version=latest
:target: http://wikitextparser.readthedocs.io/en/latest/?badge=latest
==============
WikiTextParser
==============
.. Quick Start Guid
A simple to use WikiText parsing library for `MediaWiki <https://www.mediawiki.org/wiki/MediaWiki>`_.
The purpose is to allow users easily extract and/or manipulate templates, template parameters, parser functions, tables, external links, wikilinks, lists, etc. found in wikitexts.
.. contents:: Table of Contents
Installation
============
- Python 3.8+ is required
- ``pip install wikitextparser``
Usage
=====
.. code:: python
>>> import wikitextparser as wtp
WikiTextParser can detect sections, parser functions, templates, wiki links, external links, arguments, tables, wiki lists, and comments in your wikitext. The following sections are a quick overview of some of these functionalities.
You may also want to have a look at the test modules for more examples and probable pitfalls (expected failures).
Templates
---------
.. code:: python
>>> parsed = wtp.parse("{{text|value1{{text|value2}}}}")
>>> parsed.templates
[Template('{{text|value1{{text|value2}}}}'), Template('{{text|value2}}')]
>>> parsed.templates[0].arguments
[Argument("|value1{{text|value2}}")]
>>> parsed.templates[0].arguments[0].value = 'value3'
>>> print(parsed)
{{text|value3}}
The ``pformat`` method returns a pretty-print formatted string for templates:
.. code:: python
>>> parsed = wtp.parse('{{t1 |b=b|c=c| d={{t2|e=e|f=f}} }}')
>>> t1, t2 = parsed.templates
>>> print(t2.pformat())
{{t2
| e = e
| f = f
}}
>>> print(t1.pformat())
{{t1
| b = b
| c = c
| d = {{t2
| e = e
| f = f
}}
}}
``Template.rm_dup_args_safe`` and ``Template.rm_first_of_dup_args`` methods can be used to clean-up `pages using duplicate arguments in template calls <https://en.wikipedia.org/wiki/Category:Pages_using_duplicate_arguments_in_template_calls>`_:
.. code:: python
>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_dup_args_safe()
>>> t
Template('{{t|a=b|a=a}}')
>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_first_of_dup_args()
>>> t
Template('{{t|a=a}}')
Template parameters:
.. code:: python
>>> param = wtp.parse('{{{a|b}}}').parameters[0]
>>> param.name
'a'
>>> param.default
'b'
>>> param.default = 'c'
>>> param
Parameter('{{{a|c}}}')
>>> param.append_default('d')
>>> param
Parameter('{{{a|{{{d|c}}}}}}')
WikiLinks
---------
.. code:: python
>>> wl = wtp.parse('... [[title#fragmet|text]] ...').wikilinks[0]
>>> wl.title = 'new_title'
>>> wl.fragment = 'new_fragmet'
>>> wl.text = 'X'
>>> wl
WikiLink('[[new_title#new_fragmet|X]]')
>>> del wl.text
>>> wl
WikiLink('[[new_title#new_fragmet]]')
All WikiLink properties support get, set, and delete operations.
Sections
--------
.. code:: python
>>> parsed = wtp.parse("""
... == h2 ==
... t2
... === h3 ===
... t3
... === h3 ===
... t3
... == h22 ==
... t22
... {{text|value3}}
... [[Z|X]]
... """)
>>> parsed.sections
[Section('\n'),
Section('== h2 ==\nt2\n=== h3 ===\nt3\n=== h3 ===\nt3\n'),
Section('=== h3 ===\nt3\n'),
Section('=== h3 ===\nt3\n'),
Section('== h22 ==\nt22\n{{text|value3}}\n[[Z|X]]\n')]
>>> parsed.sections[1].title = 'newtitle'
>>> print(parsed)
==newtitle==
t2
=== h3 ===
t3
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[Z|X]]
>>> del parsed.sections[1].title
>>>> print(parsed)
t2
=== h3 ===
t3
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[Z|X]]
Tables
------
Extracting cell values of a table:
.. code:: python
>>> p = wtp.parse("""{|
... | Orange || Apple || more
... |-
... | Bread || Pie || more
... |-
... | Butter || Ice cream || and more
... |}""")
>>> p.tables[0].data()
[['Orange', 'Apple', 'more'],
['Bread', 'Pie', 'more'],
['Butter', 'Ice cream', 'and more']]
By default, values are arranged according to ``colspan`` and ``rowspan`` attributes:
.. code:: python
>>> t = wtp.Table("""{| class="wikitable sortable"
... |-
... ! a !! b !! c
... |-
... !colspan = "2" | d || e
... |-
... |}""")
>>> t.data()
[['a', 'b', 'c'], ['d', 'd', 'e']]
>>> t.data(span=False)
[['a', 'b', 'c'], ['d', 'e']]
Calling the ``cells`` method of a ``Table`` returns table cells as ``Cell`` objects. Cell objects provide methods for getting or setting each cell's attributes or values individually:
.. code:: python
>>> cell = t.cells(row=1, column=1)
>>> cell.attrs
{'colspan': '2'}
>>> cell.set('colspan', '3')
>>> print(t)
{| class="wikitable sortable"
|-
! a !! b !! c
|-
!colspan = "3" | d || e
|-
|}
HTML attributes of Table, Cell, and Tag objects are accessible via
``get_attr``, ``set_attr``, ``has_attr``, and ``del_attr`` methods.
Lists
-----
The ``get_lists`` method provides access to lists within the wikitext.
.. code:: python
>>> parsed = wtp.parse(
... 'text\n'
... '* list item a\n'
... '* list item b\n'
... '** sub-list of b\n'
... '* list item c\n'
... '** sub-list of b\n'
... 'text'
... )
>>> wikilist = parsed.get_lists()[0]
>>> wikilist.items
[' list item a', ' list item b', ' list item c']
The ``sublists`` method can be used to get all sub-lists of the current list or just sub-lists of specific items:
.. code:: python
>>> wikilist.sublists()
[WikiList('** sub-list of b\n'), WikiList('** sub-list of b\n')]
>>> wikilist.sublists(1)[0].items
[' sub-list of b']
It also has an optional ``pattern`` argument that works similar to ``lists``, except that the current list pattern will be automatically added to it as a prefix:
.. code:: python
>>> wikilist = wtp.WikiList('#a\n#b\n##ba\n#*bb\n#:bc\n#c', '\#')
>>> wikilist.sublists()
[WikiList('##ba\n'), WikiList('#*bb\n'), WikiList('#:bc\n')]
>>> wikilist.sublists(pattern='\*')
[WikiList('#*bb\n')]
Convert one type of list to another using the convert method. Specifying the starting pattern of the desired lists can facilitate finding them and improves the performance:
.. code:: python
>>> wl = wtp.WikiList(
... ':*A1\n:*#B1\n:*#B2\n:*:continuing A1\n:*A2',
... pattern=':\*'
... )
>>> print(wl)
:*A1
:*#B1
:*#B2
:*:continuing A1
:*A2
>>> wl.convert('#')
>>> print(wl)
#A1
##B1
##B2
#:continuing A1
#A2
Tags
----
Accessing HTML tags:
.. code:: python
>>> p = wtp.parse('text<ref name="c">citation</ref>\n<references/>')
>>> ref, references = p.get_tags()
>>> ref.name = 'X'
>>> ref
Tag('<X name="c">citation</X>')
>>> references
Tag('<references/>')
WikiTextParser is able to handle common usages of HTML and extension tags. However it is not a fully-fledged HTML parser and may fail on edge cases or malformed HTML input. Please open an issue on github if you encounter bugs.
Miscellaneous
-------------
``parent`` and ``ancestors`` methods can be used to access a node's parent or ancestors respectively:
.. code:: python
>>> template_d = parse("{{a|{{b|{{c|{{d}}}}}}}}").templates[3]
>>> template_d.ancestors()
[Template('{{c|{{d}}}}'),
Template('{{b|{{c|{{d}}}}}}'),
Template('{{a|{{b|{{c|{{d}}}}}}}}')]
>>> template_d.parent()
Template('{{c|{{d}}}}')
>>> _.parent()
Template('{{b|{{c|{{d}}}}}}')
>>> _.parent()
Template('{{a|{{b|{{c|{{d}}}}}}}}')
>>> _.parent() # Returns None
Use the optional ``type_`` argument if looking for ancestors of a specific type:
.. code:: python
>>> parsed = parse('{{a|{{#if:{{b{{c<!---->}}}}}}}}')
>>> comment = parsed.comments[0]
>>> comment.ancestors(type_='ParserFunction')
[ParserFunction('{{#if:{{b{{c<!---->}}}}}}')]
To delete/remove any object from its parents use ``del object[:]`` or ``del object.string``.
The ``remove_markup`` function or ``plain_text`` method can be used to remove wiki markup:
.. code:: python
>>> from wikitextparser import remove_markup, parse
>>> s = "'''a'''<!--comment--> [[b|c]] [[d]]"
>>> remove_markup(s)
'a c d'
>>> parse(s).plain_text()
'a c d'
Compared with mwparserfromhell
==============================
`mwparserfromhell <https://github.com/earwig/mwparserfromhell>`_ is a mature and widely used library with nearly the same purposes as ``wikitextparser``. The main reason leading me to create ``wikitextparser`` was that ``mwparserfromhell`` could not parse wikitext in certain situations that I needed it for. See mwparserfromhell's issues `40 <https://github.com/earwig/mwparserfromhell/issues/40>`_, `42 <https://github.com/earwig/mwparserfromhell/issues/42>`_, `88 <https://github.com/earwig/mwparserfromhell/issues/88>`_, and other related issues. In many of those situation ``wikitextparser`` may be able to give you more acceptable results.
Also note that ``wikitextparser`` is still using 0.x.y version `meaning <https://semver.org/>`_ that the API is not stable and may change in the future versions.
The tokenizer in ``mwparserfromhell`` is written in C. Tokenization in ``wikitextparser`` is mostly done using the ``regex`` library which is also in C.
I have not rigorously compared the two libraries in terms of performance, i.e. execution time and memory usage. In my limited experience, ``wikitextparser`` has a decent performance in realistic cases and should be able to compete and may even have little performance benefits in some situations.
If you have had a chance to compare these libraries in terms of performance or capabilities please share your experience by opening an issue on github.
Some of the unique features of ``wikitextparser`` are: Providing access to individual cells of each table, pretty-printing templates, a WikiList class with rudimentary methods to work with `lists <https://www.mediawiki.org/wiki/Help:Lists>`_, and a few other functions.
Known issues and limitations
============================
* The contents of templates/parameters are not known to offline parsers. For example an offline parser cannot know if the markup ``[[{{z|a}}]]`` should be treated as wikilink or not, it depends on the inner-workings of the ``{{z}}`` template. In these situations ``wikitextparser`` tries to use a best guess. ``[[{{z|a}}]]`` is treated as a wikilink (why else would anyone call a template inside wikilink markup, and even if it is not a wikilink, usually no harm is done).
* Localized namespace names are unknown, so for example ``[[File:...]]`` links are treated as normal wikilinks. ``mwparserfromhell`` has similar issue, see `#87 <https://github.com/earwig/mwparserfromhell/issues/87>`_ and `#136 <https://github.com/earwig/mwparserfromhell/issues/136>`_. As a workaround, `Pywikibot <https://www.mediawiki.org/wiki/Manual:Pywikibot>`_ can be used for determining the namespace.
* `Linktrails <https://www.mediawiki.org/wiki/Help:Links>`_ are language dependant and are not supported. `Also not supported by mwparserfromhell <https://github.com/earwig/mwparserfromhell/issues/82>`_. However given the trail pattern and knowing that ``wikilink.span[1]`` is the ending position of a wikilink, it is possible to compute a WikiLink's linktrail.
* Templates adjacent to external links are never considered part of the link. In reality, this depends on the contents of the template. Example: ``parse('http://example.com{{dead link}}').external_links[0].url == 'http://example.com'``
* List of valid `extension tags <https://www.mediawiki.org/wiki/Parser_extension_tags>`_ depends on the extensions intalled on the wiki. The ``tags`` method currently only supports the ones on English Wikipedia. A configuration option might be added in the future to address this issue.
* ``wikitextparser`` currently does not provide an `ast.walk <https://docs.python.org/3/library/ast.html#ast.walk>`_-like method yielding all descendant nodes.
* `Parser functions <https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions>`_ and `magic words <https://www.mediawiki.org/wiki/Help:Magic_words>`_ are not evaluated.
Credits
=======
* `python <https://www.python.org/>`_
* `regex <https://github.com/mrabarnett/mrab-regex>`_
* `wcwidth <https://github.com/jquast/wcwidth>`_
Raw data
{
"_id": null,
"home_page": null,
"name": "wikitextparser",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "MediaWiki, wikitext, parser",
"author": null,
"author_email": "5j9 <5j9@users.noreply.github.com>",
"download_url": "https://files.pythonhosted.org/packages/68/8f/38ae3bb4d5b87a30f961c535365e807167ba7dc31b3bdc16c708fcd30153/wikitextparser-0.56.3.tar.gz",
"platform": null,
"description": ".. image:: https://github.com/5j9/wikitextparser/actions/workflows/tests.yml/badge.svg\n :target: https://github.com/5j9/wikitextparser/actions/workflows/tests.yml\n.. image:: https://codecov.io/github/5j9/wikitextparser/coverage.svg?branch=master\n :target: https://codecov.io/github/5j9/wikitextparser\n.. image:: https://readthedocs.org/projects/wikitextparser/badge/?version=latest\n :target: http://wikitextparser.readthedocs.io/en/latest/?badge=latest\n\n==============\nWikiTextParser\n==============\n.. Quick Start Guid\n\nA simple to use WikiText parsing library for `MediaWiki <https://www.mediawiki.org/wiki/MediaWiki>`_.\n\nThe purpose is to allow users easily extract and/or manipulate templates, template parameters, parser functions, tables, external links, wikilinks, lists, etc. found in wikitexts.\n\n.. contents:: Table of Contents\n\nInstallation\n============\n\n- Python 3.8+ is required\n- ``pip install wikitextparser``\n\nUsage\n=====\n\n.. code:: python\n\n >>> import wikitextparser as wtp\n\nWikiTextParser can detect sections, parser functions, templates, wiki links, external links, arguments, tables, wiki lists, and comments in your wikitext. The following sections are a quick overview of some of these functionalities.\n\nYou may also want to have a look at the test modules for more examples and probable pitfalls (expected failures).\n\nTemplates\n---------\n\n.. code:: python\n\n >>> parsed = wtp.parse(\"{{text|value1{{text|value2}}}}\")\n >>> parsed.templates\n [Template('{{text|value1{{text|value2}}}}'), Template('{{text|value2}}')]\n >>> parsed.templates[0].arguments\n [Argument(\"|value1{{text|value2}}\")]\n >>> parsed.templates[0].arguments[0].value = 'value3'\n >>> print(parsed)\n {{text|value3}}\n\nThe ``pformat`` method returns a pretty-print formatted string for templates:\n\n.. code:: python\n\n >>> parsed = wtp.parse('{{t1 |b=b|c=c| d={{t2|e=e|f=f}} }}')\n >>> t1, t2 = parsed.templates\n >>> print(t2.pformat())\n {{t2\n | e = e\n | f = f\n }}\n >>> print(t1.pformat())\n {{t1\n | b = b\n | c = c\n | d = {{t2\n | e = e\n | f = f\n }}\n }}\n\n``Template.rm_dup_args_safe`` and ``Template.rm_first_of_dup_args`` methods can be used to clean-up `pages using duplicate arguments in template calls <https://en.wikipedia.org/wiki/Category:Pages_using_duplicate_arguments_in_template_calls>`_:\n\n.. code:: python\n\n >>> t = wtp.Template('{{t|a=a|a=b|a=a}}')\n >>> t.rm_dup_args_safe()\n >>> t\n Template('{{t|a=b|a=a}}')\n >>> t = wtp.Template('{{t|a=a|a=b|a=a}}')\n >>> t.rm_first_of_dup_args()\n >>> t\n Template('{{t|a=a}}')\n\nTemplate parameters:\n\n.. code:: python\n\n >>> param = wtp.parse('{{{a|b}}}').parameters[0]\n >>> param.name\n 'a'\n >>> param.default\n 'b'\n >>> param.default = 'c'\n >>> param\n Parameter('{{{a|c}}}')\n >>> param.append_default('d')\n >>> param\n Parameter('{{{a|{{{d|c}}}}}}')\n\n\nWikiLinks\n---------\n\n.. code:: python\n\n >>> wl = wtp.parse('... [[title#fragmet|text]] ...').wikilinks[0]\n >>> wl.title = 'new_title'\n >>> wl.fragment = 'new_fragmet'\n >>> wl.text = 'X'\n >>> wl\n WikiLink('[[new_title#new_fragmet|X]]')\n >>> del wl.text\n >>> wl\n WikiLink('[[new_title#new_fragmet]]')\n\nAll WikiLink properties support get, set, and delete operations.\n\nSections\n--------\n\n.. code:: python\n\n >>> parsed = wtp.parse(\"\"\"\n ... == h2 ==\n ... t2\n ... === h3 ===\n ... t3\n ... === h3 ===\n ... t3\n ... == h22 ==\n ... t22\n ... {{text|value3}}\n ... [[Z|X]]\n ... \"\"\")\n >>> parsed.sections\n [Section('\\n'),\n Section('== h2 ==\\nt2\\n=== h3 ===\\nt3\\n=== h3 ===\\nt3\\n'),\n Section('=== h3 ===\\nt3\\n'),\n Section('=== h3 ===\\nt3\\n'),\n Section('== h22 ==\\nt22\\n{{text|value3}}\\n[[Z|X]]\\n')]\n >>> parsed.sections[1].title = 'newtitle'\n >>> print(parsed)\n\n ==newtitle==\n t2\n === h3 ===\n t3\n === h3 ===\n t3\n == h22 ==\n t22\n {{text|value3}}\n [[Z|X]]\n >>> del parsed.sections[1].title\n >>>> print(parsed)\n\n t2\n === h3 ===\n t3\n === h3 ===\n t3\n == h22 ==\n t22\n {{text|value3}}\n [[Z|X]]\n\nTables\n------\n\nExtracting cell values of a table:\n\n.. code:: python\n\n >>> p = wtp.parse(\"\"\"{|\n ... | Orange || Apple || more\n ... |-\n ... | Bread || Pie || more\n ... |-\n ... | Butter || Ice cream || and more\n ... |}\"\"\")\n >>> p.tables[0].data()\n [['Orange', 'Apple', 'more'],\n ['Bread', 'Pie', 'more'],\n ['Butter', 'Ice cream', 'and more']]\n\nBy default, values are arranged according to ``colspan`` and ``rowspan`` attributes:\n\n.. code:: python\n\n >>> t = wtp.Table(\"\"\"{| class=\"wikitable sortable\"\n ... |-\n ... ! a !! b !! c\n ... |-\n ... !colspan = \"2\" | d || e\n ... |-\n ... |}\"\"\")\n >>> t.data()\n [['a', 'b', 'c'], ['d', 'd', 'e']]\n >>> t.data(span=False)\n [['a', 'b', 'c'], ['d', 'e']]\n\nCalling the ``cells`` method of a ``Table`` returns table cells as ``Cell`` objects. Cell objects provide methods for getting or setting each cell's attributes or values individually:\n\n.. code:: python\n\n >>> cell = t.cells(row=1, column=1)\n >>> cell.attrs\n {'colspan': '2'}\n >>> cell.set('colspan', '3')\n >>> print(t)\n {| class=\"wikitable sortable\"\n |-\n ! a !! b !! c\n |-\n !colspan = \"3\" | d || e\n |-\n |}\n\nHTML attributes of Table, Cell, and Tag objects are accessible via\n``get_attr``, ``set_attr``, ``has_attr``, and ``del_attr`` methods.\n\nLists\n-----\n\nThe ``get_lists`` method provides access to lists within the wikitext.\n\n.. code:: python\n\n >>> parsed = wtp.parse(\n ... 'text\\n'\n ... '* list item a\\n'\n ... '* list item b\\n'\n ... '** sub-list of b\\n'\n ... '* list item c\\n'\n ... '** sub-list of b\\n'\n ... 'text'\n ... )\n >>> wikilist = parsed.get_lists()[0]\n >>> wikilist.items\n [' list item a', ' list item b', ' list item c']\n\nThe ``sublists`` method can be used to get all sub-lists of the current list or just sub-lists of specific items:\n\n.. code:: python\n\n >>> wikilist.sublists()\n [WikiList('** sub-list of b\\n'), WikiList('** sub-list of b\\n')]\n >>> wikilist.sublists(1)[0].items\n [' sub-list of b']\n\nIt also has an optional ``pattern`` argument that works similar to ``lists``, except that the current list pattern will be automatically added to it as a prefix:\n\n.. code:: python\n\n >>> wikilist = wtp.WikiList('#a\\n#b\\n##ba\\n#*bb\\n#:bc\\n#c', '\\#')\n >>> wikilist.sublists()\n [WikiList('##ba\\n'), WikiList('#*bb\\n'), WikiList('#:bc\\n')]\n >>> wikilist.sublists(pattern='\\*')\n [WikiList('#*bb\\n')]\n\n\nConvert one type of list to another using the convert method. Specifying the starting pattern of the desired lists can facilitate finding them and improves the performance:\n\n.. code:: python\n\n >>> wl = wtp.WikiList(\n ... ':*A1\\n:*#B1\\n:*#B2\\n:*:continuing A1\\n:*A2',\n ... pattern=':\\*'\n ... )\n >>> print(wl)\n :*A1\n :*#B1\n :*#B2\n :*:continuing A1\n :*A2\n >>> wl.convert('#')\n >>> print(wl)\n #A1\n ##B1\n ##B2\n #:continuing A1\n #A2\n\nTags\n----\n\nAccessing HTML tags:\n\n.. code:: python\n\n >>> p = wtp.parse('text<ref name=\"c\">citation</ref>\\n<references/>')\n >>> ref, references = p.get_tags()\n >>> ref.name = 'X'\n >>> ref\n Tag('<X name=\"c\">citation</X>')\n >>> references\n Tag('<references/>')\n\nWikiTextParser is able to handle common usages of HTML and extension tags. However it is not a fully-fledged HTML parser and may fail on edge cases or malformed HTML input. Please open an issue on github if you encounter bugs.\n\nMiscellaneous\n-------------\n``parent`` and ``ancestors`` methods can be used to access a node's parent or ancestors respectively:\n\n.. code:: python\n\n >>> template_d = parse(\"{{a|{{b|{{c|{{d}}}}}}}}\").templates[3]\n >>> template_d.ancestors()\n [Template('{{c|{{d}}}}'),\n Template('{{b|{{c|{{d}}}}}}'),\n Template('{{a|{{b|{{c|{{d}}}}}}}}')]\n >>> template_d.parent()\n Template('{{c|{{d}}}}')\n >>> _.parent()\n Template('{{b|{{c|{{d}}}}}}')\n >>> _.parent()\n Template('{{a|{{b|{{c|{{d}}}}}}}}')\n >>> _.parent() # Returns None\n\nUse the optional ``type_`` argument if looking for ancestors of a specific type:\n\n.. code:: python\n\n >>> parsed = parse('{{a|{{#if:{{b{{c<!---->}}}}}}}}')\n >>> comment = parsed.comments[0]\n >>> comment.ancestors(type_='ParserFunction')\n [ParserFunction('{{#if:{{b{{c<!---->}}}}}}')]\n\n\nTo delete/remove any object from its parents use ``del object[:]`` or ``del object.string``.\n\nThe ``remove_markup`` function or ``plain_text`` method can be used to remove wiki markup:\n\n.. code:: python\n\n >>> from wikitextparser import remove_markup, parse\n >>> s = \"'''a'''<!--comment--> [[b|c]] [[d]]\"\n >>> remove_markup(s)\n 'a c d'\n >>> parse(s).plain_text()\n 'a c d'\n\nCompared with mwparserfromhell\n==============================\n\n`mwparserfromhell <https://github.com/earwig/mwparserfromhell>`_ is a mature and widely used library with nearly the same purposes as ``wikitextparser``. The main reason leading me to create ``wikitextparser`` was that ``mwparserfromhell`` could not parse wikitext in certain situations that I needed it for. See mwparserfromhell's issues `40 <https://github.com/earwig/mwparserfromhell/issues/40>`_, `42 <https://github.com/earwig/mwparserfromhell/issues/42>`_, `88 <https://github.com/earwig/mwparserfromhell/issues/88>`_, and other related issues. In many of those situation ``wikitextparser`` may be able to give you more acceptable results.\n\nAlso note that ``wikitextparser`` is still using 0.x.y version `meaning <https://semver.org/>`_ that the API is not stable and may change in the future versions.\n\nThe tokenizer in ``mwparserfromhell`` is written in C. Tokenization in ``wikitextparser`` is mostly done using the ``regex`` library which is also in C.\nI have not rigorously compared the two libraries in terms of performance, i.e. execution time and memory usage. In my limited experience, ``wikitextparser`` has a decent performance in realistic cases and should be able to compete and may even have little performance benefits in some situations.\n\nIf you have had a chance to compare these libraries in terms of performance or capabilities please share your experience by opening an issue on github.\n\nSome of the unique features of ``wikitextparser`` are: Providing access to individual cells of each table, pretty-printing templates, a WikiList class with rudimentary methods to work with `lists <https://www.mediawiki.org/wiki/Help:Lists>`_, and a few other functions.\n\nKnown issues and limitations\n============================\n\n* The contents of templates/parameters are not known to offline parsers. For example an offline parser cannot know if the markup ``[[{{z|a}}]]`` should be treated as wikilink or not, it depends on the inner-workings of the ``{{z}}`` template. In these situations ``wikitextparser`` tries to use a best guess. ``[[{{z|a}}]]`` is treated as a wikilink (why else would anyone call a template inside wikilink markup, and even if it is not a wikilink, usually no harm is done).\n* Localized namespace names are unknown, so for example ``[[File:...]]`` links are treated as normal wikilinks. ``mwparserfromhell`` has similar issue, see `#87 <https://github.com/earwig/mwparserfromhell/issues/87>`_ and `#136 <https://github.com/earwig/mwparserfromhell/issues/136>`_. As a workaround, `Pywikibot <https://www.mediawiki.org/wiki/Manual:Pywikibot>`_ can be used for determining the namespace.\n* `Linktrails <https://www.mediawiki.org/wiki/Help:Links>`_ are language dependant and are not supported. `Also not supported by mwparserfromhell <https://github.com/earwig/mwparserfromhell/issues/82>`_. However given the trail pattern and knowing that ``wikilink.span[1]`` is the ending position of a wikilink, it is possible to compute a WikiLink's linktrail.\n* Templates adjacent to external links are never considered part of the link. In reality, this depends on the contents of the template. Example: ``parse('http://example.com{{dead link}}').external_links[0].url == 'http://example.com'``\n* List of valid `extension tags <https://www.mediawiki.org/wiki/Parser_extension_tags>`_ depends on the extensions intalled on the wiki. The ``tags`` method currently only supports the ones on English Wikipedia. A configuration option might be added in the future to address this issue.\n* ``wikitextparser`` currently does not provide an `ast.walk <https://docs.python.org/3/library/ast.html#ast.walk>`_-like method yielding all descendant nodes.\n* `Parser functions <https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions>`_ and `magic words <https://www.mediawiki.org/wiki/Help:Magic_words>`_ are not evaluated.\n\n\nCredits\n=======\n* `python <https://www.python.org/>`_\n* `regex <https://github.com/mrabarnett/mrab-regex>`_\n* `wcwidth <https://github.com/jquast/wcwidth>`_\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A simple parsing tool for MediaWiki's wikitext markup.",
"version": "0.56.3",
"project_urls": {
"Homepage": "https://github.com/5j9/wikitextparser"
},
"split_keywords": [
"mediawiki",
" wikitext",
" parser"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "815f3109173deefaaf4a4d8f4086b20367a42fd2f77d6a096db04e835aa5dfe2",
"md5": "7001209f0e773fc127d842b4c8b7eb48",
"sha256": "49bcbe421f0c126fba254a8f2e41262e679a2a88f2010dda90198a287616b5e4"
},
"downloads": -1,
"filename": "wikitextparser-0.56.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7001209f0e773fc127d842b4c8b7eb48",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 66284,
"upload_time": "2024-10-18T06:10:53",
"upload_time_iso_8601": "2024-10-18T06:10:53.149190Z",
"url": "https://files.pythonhosted.org/packages/81/5f/3109173deefaaf4a4d8f4086b20367a42fd2f77d6a096db04e835aa5dfe2/wikitextparser-0.56.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "688f38ae3bb4d5b87a30f961c535365e807167ba7dc31b3bdc16c708fcd30153",
"md5": "e474a086ea50c6d0feb477b8e670696f",
"sha256": "2fce8141975d15ba7bd04a7605792a28d7cf216ebce10287d086f32af051ed26"
},
"downloads": -1,
"filename": "wikitextparser-0.56.3.tar.gz",
"has_sig": false,
"md5_digest": "e474a086ea50c6d0feb477b8e670696f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 73175,
"upload_time": "2024-10-18T06:10:56",
"upload_time_iso_8601": "2024-10-18T06:10:56.653405Z",
"url": "https://files.pythonhosted.org/packages/68/8f/38ae3bb4d5b87a30f961c535365e807167ba7dc31b3bdc16c708fcd30153/wikitextparser-0.56.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-18 06:10:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "5j9",
"github_project": "wikitextparser",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "wikitextparser"
}