DHTMLParser3


NameDHTMLParser3 JSON
Version 3.0.17 PyPI version JSON
download
home_pagehttps://github.com/Bystroushaak/DHTMLParser3
SummaryPython HTML/XML parser for easy web scraping.
upload_time2022-03-21 04:13:18
maintainerNone
docs_urlNone
authorBystroushaak
requires_pythonNone
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
.. image:: https://badge.fury.io/py/DHTMLParser3.svg
    :target: https://pypi.python.org/pypi/dhtmlparser3

.. image:: https://img.shields.io/pypi/dm/dhtmlparser3.svg
    :target: https://pypi.python.org/pypi/dhtmlparser3

.. image:: https://readthedocs.org/projects/dhtmlparser3/badge/?version=latest
    :target: http://dhtmlparser3.readthedocs.org/

.. image:: https://img.shields.io/github/issues/Bystroushaak/dhtmlparser3.svg
    :target: https://github.com/Bystroushaak/dhtmlparser3/issues

.. image:: https://img.shields.io/pypi/l/dhtmlparser3.svg
    :target: https://github.com/Bystroushaak/dhtmlparser3/blob/master/LICENSE.txt
    
.. image:: https://img.shields.io/github/sponsors/Bystroushaak
    :target: https://github.com/sponsors/Bystroushaak

What is it?
===========
DHTMLParser3 is a lightweight HTML/XML parser created for one purpose - quick and easy picking selected tags from DOM.

It can be very useful when you are in need to write own "guerilla" API for some webpage, or a scrapper.

It is written in pure python with no dependencies, and it can handle pretty badly broken HTML.

Documentation
=============

Full module documentation can be found here: http://DHTMLParser3.rtfd.org


Changelog
=========

3.0.17
------
    - Fixed problem with empty strings in Tokenizer.

3.0.16
------
    - Changed behavior of the `.remove_item()` method to compare using identity.

3.0.15
------
    - Added new method `parse_file()` method to simplify working with files.

3.0.14
------
    - Fixed problem with tokenizer & nonpair tags without spaces.

3.0.13
------
    - Fixed problem with re-ordering of the parameters when setting them.

3.0.12
------
    - Added conditional `escape` parameter to `.content_str()` method.

3.0.11
------
    - Fixed parent problem with `.__deepcopy__()`.

3.0.10
------
    - Implemented proper `.__copy__()` and `.__deepcopy__()` methods.

3.0.9
-----
    - Fixed the way how the quotes are escaped in the tag parameters.

3.0.8
-----
    - Fixed behavior of the `.__hash__()` method for nested tags.

3.0.7
-----
    - Don't escape `<script>` and `<style>` content's.

3.0.6
-----
    - Fixed behavior of `.match()` method.
    - Added new method `.match_paths()`.
    - Added tests.

3.0.5
-----
    - Bugfix; SpecialDict.copy() didn't return any value.

3.0.4
-----
    - Bugfix; Don't search empty tags.

3.0.3
-----
    - Bugfix; Always return container element for small doms with only strings inside.

3.0.2
-----
    - Added `.__hash__()` method for Tag.
    - `.replace_with()` method now accepts `str` as well as Tag.
    - Fixed problems with `.parent` setting for non-pair tags in the parser.
    - Added bunch of tests to test newly added stuff.

3.0.1
-----
    - Added `.__contains__()` method for Tag, so you can now test parameters using `in` operator.

3.0.0
-----
    - Rewritten to use different parser, support for HTML entities.
    - Structure of the classes completely changed, now Tag & Comment are used instead of HTMLElement.
    - Much more cleaner code and more comprehensive method names.
    - By default, the tree is now double-linked without any additional cost.
    - Implemented very useful magic methods, so indexing operators are supported for access to both parameters and content.
    - Documentation completely reworked.
    - Set of coverage tests is now much larger.

2.2.3
-----
    - 2020-04-12 Fix by #25 (thx https://github.com/fm4d).

2.2.2
-----
    - Attempt to fix strange recursive inheritance problem.

2.2.0
-----
    - Rewritten for compatibility with python3.

2.1.0 - 2.1.8
-------------
    - State parser fixed - it can now recover from invalid html like ``<invalid tag=something">``.
    - Rewritten to use ``StateEnum`` in parser for better readability.
    - Garbage collector is now disabled during _raw_split().
    - Fixed #16 - recovery after tags which don't ends with ``>`` (``</code`` for example).
    - Closed #17 - implementation of ignoring of ``<`` in usage as `is smaller than` sign.
    - Restored support of multiline attributes.
    - ``.parseString()`` now doesn't try to parse HTML element parameters.
    - Implemented ``first()`` getter.
    - License changed to MIT.
    - Fixed #18: bug which in some cases caused invalid output.
    - Added HTMLElement.__repr__().
    - Added test_coverage.sh.
    - Added extended test_equality() coverage.
    - Formatting improvements.
    - Improved constructor handling, which is now much more readable.
    - Updated formatting of the setup.py.
    - Added more tests.
    - Fixed #22; bug in the SpecialDict.
    - Fixed some nasty unicode problems.
    - Fixed python 2 / 3 problem in docs/__init__.py.
    - getVersion() -> get_version().

2.0.10
------
    - Added more tests of removeTags().
    - run_tests.sh now gets arguments.
    - Check for string in removeTags() changed to basestring from str.

2.0.6 - 2.0.9
-------------
    - Fixed behaviour of toString() and tagToString().
    - SpecialDict is now derived from OrderedDict.
    - Changed and added tests of .params attribute (OrderedDict is now used).
    - Fixed bug in _repair_tags().
    - Removed _repair_tags() - it wasn't really necessary.
    - Fixed nasty bug which *could* cause invalid XML output.

2.0.1 - 2.0.5
-------------
    - Fixed bugs in ``.match()``.
    - Fixed broken links in documentation.
    - Fixed bugs in ``.isAlmostEqual()``.
    - ``.find()``; Fixed bug which prevented tag_name to be None.
    - Added op ``.__eq__()`` to the `SpecialDict`.
    - Added new method ``.containsParamSubset()`` to ``HTMLElement``.

2.0.0
-----
    - Rewritten, refactored, splitted to multiple files.
    - Added unittest coverage of almost 100% of the code.
    - Added better selector methods (``.wfind()``, ``.match``)
    - Added Sphinx documentation.
    - Fixed a lot of bugs.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Bystroushaak/DHTMLParser3",
    "name": "DHTMLParser3",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Bystroushaak",
    "author_email": "bystrousak@kitakitsune.org",
    "download_url": null,
    "platform": null,
    "description": "\n.. image:: https://badge.fury.io/py/DHTMLParser3.svg\n    :target: https://pypi.python.org/pypi/dhtmlparser3\n\n.. image:: https://img.shields.io/pypi/dm/dhtmlparser3.svg\n    :target: https://pypi.python.org/pypi/dhtmlparser3\n\n.. image:: https://readthedocs.org/projects/dhtmlparser3/badge/?version=latest\n    :target: http://dhtmlparser3.readthedocs.org/\n\n.. image:: https://img.shields.io/github/issues/Bystroushaak/dhtmlparser3.svg\n    :target: https://github.com/Bystroushaak/dhtmlparser3/issues\n\n.. image:: https://img.shields.io/pypi/l/dhtmlparser3.svg\n    :target: https://github.com/Bystroushaak/dhtmlparser3/blob/master/LICENSE.txt\n    \n.. image:: https://img.shields.io/github/sponsors/Bystroushaak\n    :target: https://github.com/sponsors/Bystroushaak\n\nWhat is it?\n===========\nDHTMLParser3 is a lightweight HTML/XML parser created for one purpose - quick and easy picking selected tags from DOM.\n\nIt can be very useful when you are in need to write own \"guerilla\" API for some webpage, or a scrapper.\n\nIt is written in pure python with no dependencies, and it can handle pretty badly broken HTML.\n\nDocumentation\n=============\n\nFull module documentation can be found here: http://DHTMLParser3.rtfd.org\n\n\nChangelog\n=========\n\n3.0.17\n------\n    - Fixed problem with empty strings in Tokenizer.\n\n3.0.16\n------\n    - Changed behavior of the `.remove_item()` method to compare using identity.\n\n3.0.15\n------\n    - Added new method `parse_file()` method to simplify working with files.\n\n3.0.14\n------\n    - Fixed problem with tokenizer & nonpair tags without spaces.\n\n3.0.13\n------\n    - Fixed problem with re-ordering of the parameters when setting them.\n\n3.0.12\n------\n    - Added conditional `escape` parameter to `.content_str()` method.\n\n3.0.11\n------\n    - Fixed parent problem with `.__deepcopy__()`.\n\n3.0.10\n------\n    - Implemented proper `.__copy__()` and `.__deepcopy__()` methods.\n\n3.0.9\n-----\n    - Fixed the way how the quotes are escaped in the tag parameters.\n\n3.0.8\n-----\n    - Fixed behavior of the `.__hash__()` method for nested tags.\n\n3.0.7\n-----\n    - Don't escape `<script>` and `<style>` content's.\n\n3.0.6\n-----\n    - Fixed behavior of `.match()` method.\n    - Added new method `.match_paths()`.\n    - Added tests.\n\n3.0.5\n-----\n    - Bugfix; SpecialDict.copy() didn't return any value.\n\n3.0.4\n-----\n    - Bugfix; Don't search empty tags.\n\n3.0.3\n-----\n    - Bugfix; Always return container element for small doms with only strings inside.\n\n3.0.2\n-----\n    - Added `.__hash__()` method for Tag.\n    - `.replace_with()` method now accepts `str` as well as Tag.\n    - Fixed problems with `.parent` setting for non-pair tags in the parser.\n    - Added bunch of tests to test newly added stuff.\n\n3.0.1\n-----\n    - Added `.__contains__()` method for Tag, so you can now test parameters using `in` operator.\n\n3.0.0\n-----\n    - Rewritten to use different parser, support for HTML entities.\n    - Structure of the classes completely changed, now Tag & Comment are used instead of HTMLElement.\n    - Much more cleaner code and more comprehensive method names.\n    - By default, the tree is now double-linked without any additional cost.\n    - Implemented very useful magic methods, so indexing operators are supported for access to both parameters and content.\n    - Documentation completely reworked.\n    - Set of coverage tests is now much larger.\n\n2.2.3\n-----\n    - 2020-04-12 Fix by #25 (thx https://github.com/fm4d).\n\n2.2.2\n-----\n    - Attempt to fix strange recursive inheritance problem.\n\n2.2.0\n-----\n    - Rewritten for compatibility with python3.\n\n2.1.0 - 2.1.8\n-------------\n    - State parser fixed - it can now recover from invalid html like ``<invalid tag=something\">``.\n    - Rewritten to use ``StateEnum`` in parser for better readability.\n    - Garbage collector is now disabled during _raw_split().\n    - Fixed #16 - recovery after tags which don't ends with ``>`` (``</code`` for example).\n    - Closed #17 - implementation of ignoring of ``<`` in usage as `is smaller than` sign.\n    - Restored support of multiline attributes.\n    - ``.parseString()`` now doesn't try to parse HTML element parameters.\n    - Implemented ``first()`` getter.\n    - License changed to MIT.\n    - Fixed #18: bug which in some cases caused invalid output.\n    - Added HTMLElement.__repr__().\n    - Added test_coverage.sh.\n    - Added extended test_equality() coverage.\n    - Formatting improvements.\n    - Improved constructor handling, which is now much more readable.\n    - Updated formatting of the setup.py.\n    - Added more tests.\n    - Fixed #22; bug in the SpecialDict.\n    - Fixed some nasty unicode problems.\n    - Fixed python 2 / 3 problem in docs/__init__.py.\n    - getVersion() -> get_version().\n\n2.0.10\n------\n    - Added more tests of removeTags().\n    - run_tests.sh now gets arguments.\n    - Check for string in removeTags() changed to basestring from str.\n\n2.0.6 - 2.0.9\n-------------\n    - Fixed behaviour of toString() and tagToString().\n    - SpecialDict is now derived from OrderedDict.\n    - Changed and added tests of .params attribute (OrderedDict is now used).\n    - Fixed bug in _repair_tags().\n    - Removed _repair_tags() - it wasn't really necessary.\n    - Fixed nasty bug which *could* cause invalid XML output.\n\n2.0.1 - 2.0.5\n-------------\n    - Fixed bugs in ``.match()``.\n    - Fixed broken links in documentation.\n    - Fixed bugs in ``.isAlmostEqual()``.\n    - ``.find()``; Fixed bug which prevented tag_name to be None.\n    - Added op ``.__eq__()`` to the `SpecialDict`.\n    - Added new method ``.containsParamSubset()`` to ``HTMLElement``.\n\n2.0.0\n-----\n    - Rewritten, refactored, splitted to multiple files.\n    - Added unittest coverage of almost 100% of the code.\n    - Added better selector methods (``.wfind()``, ``.match``)\n    - Added Sphinx documentation.\n    - Fixed a lot of bugs.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python HTML/XML parser for easy web scraping.",
    "version": "3.0.17",
    "project_urls": {
        "Homepage": "https://github.com/Bystroushaak/DHTMLParser3"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4eeab1a7e71be4674a9b65e10f9fdf29055c3729d1188c84fb5a8fd0dd6814cf",
                "md5": "700be9ff519daee36c901b79df4e46a3",
                "sha256": "3c8c9aea865be16b055a5f4282d1a064ba5cd676c1545491c2df061067f10333"
            },
            "downloads": -1,
            "filename": "DHTMLParser3-3.0.17-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "700be9ff519daee36c901b79df4e46a3",
            "packagetype": "bdist_wheel",
            "python_version": "3.8",
            "requires_python": null,
            "size": 15422,
            "upload_time": "2022-03-21T04:13:18",
            "upload_time_iso_8601": "2022-03-21T04:13:18.108120Z",
            "url": "https://files.pythonhosted.org/packages/4e/ea/b1a7e71be4674a9b65e10f9fdf29055c3729d1188c84fb5a8fd0dd6814cf/DHTMLParser3-3.0.17-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-03-21 04:13:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Bystroushaak",
    "github_project": "DHTMLParser3",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "dhtmlparser3"
}
        
Elapsed time: 0.18494s