.. image:: https://badge.fury.io/py/DHTMLParser3.svg
:target: https://pypi.python.org/pypi/dhtmlparser3
.. image:: https://img.shields.io/pypi/dm/dhtmlparser3.svg
:target: https://pypi.python.org/pypi/dhtmlparser3
.. image:: https://readthedocs.org/projects/dhtmlparser3/badge/?version=latest
:target: http://dhtmlparser3.readthedocs.org/
.. image:: https://img.shields.io/github/issues/Bystroushaak/dhtmlparser3.svg
:target: https://github.com/Bystroushaak/dhtmlparser3/issues
.. image:: https://img.shields.io/pypi/l/dhtmlparser3.svg
:target: https://github.com/Bystroushaak/dhtmlparser3/blob/master/LICENSE.txt
.. image:: https://img.shields.io/github/sponsors/Bystroushaak
:target: https://github.com/sponsors/Bystroushaak
What is it?
===========
DHTMLParser3 is a lightweight HTML/XML parser created for one purpose - quick and easy picking selected tags from DOM.
It can be very useful when you are in need to write own "guerilla" API for some webpage, or a scrapper.
It is written in pure python with no dependencies, and it can handle pretty badly broken HTML.
Documentation
=============
Full module documentation can be found here: http://DHTMLParser3.rtfd.org
Changelog
=========
3.0.17
------
- Fixed problem with empty strings in Tokenizer.
3.0.16
------
- Changed behavior of the `.remove_item()` method to compare using identity.
3.0.15
------
- Added new method `parse_file()` method to simplify working with files.
3.0.14
------
- Fixed problem with tokenizer & nonpair tags without spaces.
3.0.13
------
- Fixed problem with re-ordering of the parameters when setting them.
3.0.12
------
- Added conditional `escape` parameter to `.content_str()` method.
3.0.11
------
- Fixed parent problem with `.__deepcopy__()`.
3.0.10
------
- Implemented proper `.__copy__()` and `.__deepcopy__()` methods.
3.0.9
-----
- Fixed the way how the quotes are escaped in the tag parameters.
3.0.8
-----
- Fixed behavior of the `.__hash__()` method for nested tags.
3.0.7
-----
- Don't escape `<script>` and `<style>` content's.
3.0.6
-----
- Fixed behavior of `.match()` method.
- Added new method `.match_paths()`.
- Added tests.
3.0.5
-----
- Bugfix; SpecialDict.copy() didn't return any value.
3.0.4
-----
- Bugfix; Don't search empty tags.
3.0.3
-----
- Bugfix; Always return container element for small doms with only strings inside.
3.0.2
-----
- Added `.__hash__()` method for Tag.
- `.replace_with()` method now accepts `str` as well as Tag.
- Fixed problems with `.parent` setting for non-pair tags in the parser.
- Added bunch of tests to test newly added stuff.
3.0.1
-----
- Added `.__contains__()` method for Tag, so you can now test parameters using `in` operator.
3.0.0
-----
- Rewritten to use different parser, support for HTML entities.
- Structure of the classes completely changed, now Tag & Comment are used instead of HTMLElement.
- Much more cleaner code and more comprehensive method names.
- By default, the tree is now double-linked without any additional cost.
- Implemented very useful magic methods, so indexing operators are supported for access to both parameters and content.
- Documentation completely reworked.
- Set of coverage tests is now much larger.
2.2.3
-----
- 2020-04-12 Fix by #25 (thx https://github.com/fm4d).
2.2.2
-----
- Attempt to fix strange recursive inheritance problem.
2.2.0
-----
- Rewritten for compatibility with python3.
2.1.0 - 2.1.8
-------------
- State parser fixed - it can now recover from invalid html like ``<invalid tag=something">``.
- Rewritten to use ``StateEnum`` in parser for better readability.
- Garbage collector is now disabled during _raw_split().
- Fixed #16 - recovery after tags which don't ends with ``>`` (``</code`` for example).
- Closed #17 - implementation of ignoring of ``<`` in usage as `is smaller than` sign.
- Restored support of multiline attributes.
- ``.parseString()`` now doesn't try to parse HTML element parameters.
- Implemented ``first()`` getter.
- License changed to MIT.
- Fixed #18: bug which in some cases caused invalid output.
- Added HTMLElement.__repr__().
- Added test_coverage.sh.
- Added extended test_equality() coverage.
- Formatting improvements.
- Improved constructor handling, which is now much more readable.
- Updated formatting of the setup.py.
- Added more tests.
- Fixed #22; bug in the SpecialDict.
- Fixed some nasty unicode problems.
- Fixed python 2 / 3 problem in docs/__init__.py.
- getVersion() -> get_version().
2.0.10
------
- Added more tests of removeTags().
- run_tests.sh now gets arguments.
- Check for string in removeTags() changed to basestring from str.
2.0.6 - 2.0.9
-------------
- Fixed behaviour of toString() and tagToString().
- SpecialDict is now derived from OrderedDict.
- Changed and added tests of .params attribute (OrderedDict is now used).
- Fixed bug in _repair_tags().
- Removed _repair_tags() - it wasn't really necessary.
- Fixed nasty bug which *could* cause invalid XML output.
2.0.1 - 2.0.5
-------------
- Fixed bugs in ``.match()``.
- Fixed broken links in documentation.
- Fixed bugs in ``.isAlmostEqual()``.
- ``.find()``; Fixed bug which prevented tag_name to be None.
- Added op ``.__eq__()`` to the `SpecialDict`.
- Added new method ``.containsParamSubset()`` to ``HTMLElement``.
2.0.0
-----
- Rewritten, refactored, splitted to multiple files.
- Added unittest coverage of almost 100% of the code.
- Added better selector methods (``.wfind()``, ``.match``)
- Added Sphinx documentation.
- Fixed a lot of bugs.
Raw data
{
"_id": null,
"home_page": "https://github.com/Bystroushaak/DHTMLParser3",
"name": "DHTMLParser3",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Bystroushaak",
"author_email": "bystrousak@kitakitsune.org",
"download_url": null,
"platform": null,
"description": "\n.. image:: https://badge.fury.io/py/DHTMLParser3.svg\n :target: https://pypi.python.org/pypi/dhtmlparser3\n\n.. image:: https://img.shields.io/pypi/dm/dhtmlparser3.svg\n :target: https://pypi.python.org/pypi/dhtmlparser3\n\n.. image:: https://readthedocs.org/projects/dhtmlparser3/badge/?version=latest\n :target: http://dhtmlparser3.readthedocs.org/\n\n.. image:: https://img.shields.io/github/issues/Bystroushaak/dhtmlparser3.svg\n :target: https://github.com/Bystroushaak/dhtmlparser3/issues\n\n.. image:: https://img.shields.io/pypi/l/dhtmlparser3.svg\n :target: https://github.com/Bystroushaak/dhtmlparser3/blob/master/LICENSE.txt\n \n.. image:: https://img.shields.io/github/sponsors/Bystroushaak\n :target: https://github.com/sponsors/Bystroushaak\n\nWhat is it?\n===========\nDHTMLParser3 is a lightweight HTML/XML parser created for one purpose - quick and easy picking selected tags from DOM.\n\nIt can be very useful when you are in need to write own \"guerilla\" API for some webpage, or a scrapper.\n\nIt is written in pure python with no dependencies, and it can handle pretty badly broken HTML.\n\nDocumentation\n=============\n\nFull module documentation can be found here: http://DHTMLParser3.rtfd.org\n\n\nChangelog\n=========\n\n3.0.17\n------\n - Fixed problem with empty strings in Tokenizer.\n\n3.0.16\n------\n - Changed behavior of the `.remove_item()` method to compare using identity.\n\n3.0.15\n------\n - Added new method `parse_file()` method to simplify working with files.\n\n3.0.14\n------\n - Fixed problem with tokenizer & nonpair tags without spaces.\n\n3.0.13\n------\n - Fixed problem with re-ordering of the parameters when setting them.\n\n3.0.12\n------\n - Added conditional `escape` parameter to `.content_str()` method.\n\n3.0.11\n------\n - Fixed parent problem with `.__deepcopy__()`.\n\n3.0.10\n------\n - Implemented proper `.__copy__()` and `.__deepcopy__()` methods.\n\n3.0.9\n-----\n - Fixed the way how the quotes are escaped in the tag parameters.\n\n3.0.8\n-----\n - Fixed behavior of the `.__hash__()` method for nested tags.\n\n3.0.7\n-----\n - Don't escape `<script>` and `<style>` content's.\n\n3.0.6\n-----\n - Fixed behavior of `.match()` method.\n - Added new method `.match_paths()`.\n - Added tests.\n\n3.0.5\n-----\n - Bugfix; SpecialDict.copy() didn't return any value.\n\n3.0.4\n-----\n - Bugfix; Don't search empty tags.\n\n3.0.3\n-----\n - Bugfix; Always return container element for small doms with only strings inside.\n\n3.0.2\n-----\n - Added `.__hash__()` method for Tag.\n - `.replace_with()` method now accepts `str` as well as Tag.\n - Fixed problems with `.parent` setting for non-pair tags in the parser.\n - Added bunch of tests to test newly added stuff.\n\n3.0.1\n-----\n - Added `.__contains__()` method for Tag, so you can now test parameters using `in` operator.\n\n3.0.0\n-----\n - Rewritten to use different parser, support for HTML entities.\n - Structure of the classes completely changed, now Tag & Comment are used instead of HTMLElement.\n - Much more cleaner code and more comprehensive method names.\n - By default, the tree is now double-linked without any additional cost.\n - Implemented very useful magic methods, so indexing operators are supported for access to both parameters and content.\n - Documentation completely reworked.\n - Set of coverage tests is now much larger.\n\n2.2.3\n-----\n - 2020-04-12 Fix by #25 (thx https://github.com/fm4d).\n\n2.2.2\n-----\n - Attempt to fix strange recursive inheritance problem.\n\n2.2.0\n-----\n - Rewritten for compatibility with python3.\n\n2.1.0 - 2.1.8\n-------------\n - State parser fixed - it can now recover from invalid html like ``<invalid tag=something\">``.\n - Rewritten to use ``StateEnum`` in parser for better readability.\n - Garbage collector is now disabled during _raw_split().\n - Fixed #16 - recovery after tags which don't ends with ``>`` (``</code`` for example).\n - Closed #17 - implementation of ignoring of ``<`` in usage as `is smaller than` sign.\n - Restored support of multiline attributes.\n - ``.parseString()`` now doesn't try to parse HTML element parameters.\n - Implemented ``first()`` getter.\n - License changed to MIT.\n - Fixed #18: bug which in some cases caused invalid output.\n - Added HTMLElement.__repr__().\n - Added test_coverage.sh.\n - Added extended test_equality() coverage.\n - Formatting improvements.\n - Improved constructor handling, which is now much more readable.\n - Updated formatting of the setup.py.\n - Added more tests.\n - Fixed #22; bug in the SpecialDict.\n - Fixed some nasty unicode problems.\n - Fixed python 2 / 3 problem in docs/__init__.py.\n - getVersion() -> get_version().\n\n2.0.10\n------\n - Added more tests of removeTags().\n - run_tests.sh now gets arguments.\n - Check for string in removeTags() changed to basestring from str.\n\n2.0.6 - 2.0.9\n-------------\n - Fixed behaviour of toString() and tagToString().\n - SpecialDict is now derived from OrderedDict.\n - Changed and added tests of .params attribute (OrderedDict is now used).\n - Fixed bug in _repair_tags().\n - Removed _repair_tags() - it wasn't really necessary.\n - Fixed nasty bug which *could* cause invalid XML output.\n\n2.0.1 - 2.0.5\n-------------\n - Fixed bugs in ``.match()``.\n - Fixed broken links in documentation.\n - Fixed bugs in ``.isAlmostEqual()``.\n - ``.find()``; Fixed bug which prevented tag_name to be None.\n - Added op ``.__eq__()`` to the `SpecialDict`.\n - Added new method ``.containsParamSubset()`` to ``HTMLElement``.\n\n2.0.0\n-----\n - Rewritten, refactored, splitted to multiple files.\n - Added unittest coverage of almost 100% of the code.\n - Added better selector methods (``.wfind()``, ``.match``)\n - Added Sphinx documentation.\n - Fixed a lot of bugs.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python HTML/XML parser for easy web scraping.",
"version": "3.0.17",
"project_urls": {
"Homepage": "https://github.com/Bystroushaak/DHTMLParser3"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4eeab1a7e71be4674a9b65e10f9fdf29055c3729d1188c84fb5a8fd0dd6814cf",
"md5": "700be9ff519daee36c901b79df4e46a3",
"sha256": "3c8c9aea865be16b055a5f4282d1a064ba5cd676c1545491c2df061067f10333"
},
"downloads": -1,
"filename": "DHTMLParser3-3.0.17-py3-none-any.whl",
"has_sig": false,
"md5_digest": "700be9ff519daee36c901b79df4e46a3",
"packagetype": "bdist_wheel",
"python_version": "3.8",
"requires_python": null,
"size": 15422,
"upload_time": "2022-03-21T04:13:18",
"upload_time_iso_8601": "2022-03-21T04:13:18.108120Z",
"url": "https://files.pythonhosted.org/packages/4e/ea/b1a7e71be4674a9b65e10f9fdf29055c3729d1188c84fb5a8fd0dd6814cf/DHTMLParser3-3.0.17-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-03-21 04:13:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Bystroushaak",
"github_project": "DHTMLParser3",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "dhtmlparser3"
}