readability-lxml


Namereadability-lxml JSON
Version 0.8.1 PyPI version JSON
download
home_pagehttp://github.com/buriy/python-readability
Summaryfast html to text parser (article readability tool) with python 3 support
upload_time2020-07-04 00:45:51
maintainer
docs_urlNone
authorYuri Baburov
requires_python
licenseApache License 2.0
keywords
VCS
bugtrack_url
requirements None
Travis-CI
coveralls test coverage No coveralls.
            .. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master
    :target: https://travis-ci.org/buriy/python-readability


python-readability
==================

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of `arc90's readability
project <http://lab.arc90.com/experiments/readability/>`__.

Installation
------------

It's easy using ``pip``, just run:

.. code-block:: bash

    $ pip install readability-lxml

Usage
-----

.. code-block:: python

    >>> import requests
    >>> from readability import Document

    >>> response = requests.get('http://example.com')
    >>> doc = Document(response.text)
    >>> doc.title()
    'Example Domain'

    >>> doc.summary()
    """<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
    <p>This domain is established to be used for illustrative examples in documents. You may
    use this\n    domain in examples without prior coordination or asking for permission.</p>
    \n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
    \n</body>\n</div></body></html>"""

Change Log
----------

-  0.8.1 Fixed processing of non-ascii HTMLs via regexps.
-  0.8 Replaced XHTML output with HTML5 output in summary() call.
-  0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
-  0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
-  0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
-  0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
-  0.4 Added Videos loading and allowed more images per paragraph
-  0.3 Added Document.encoding, positive\_keywords and negative\_keywords

Licensing
---------

This code is under `the Apache License
2.0 <http://www.apache.org/licenses/LICENSE-2.0>`__ license.

Thanks to
---------

-  Latest `readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>`__
-  Ruby port by starrhorne and iterationlabs
-  `Python port <https://github.com/gfxmonk/python-readability>`__ by gfxmonk
-  `Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>` to move to lxml
-  "BR to P" fix from readability.js which improves quality for smaller texts
-  Github users contributions.



            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/buriy/python-readability",
    "name": "readability-lxml",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Yuri Baburov",
    "author_email": "burchik@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/b9/62/6de3a9a8524c1a1ee0f2aee0dfbad13a36ebbca0db402abcf4e790496512/readability-lxml-0.8.1.tar.gz",
    "platform": "",
    "description": ".. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master\n    :target: https://travis-ci.org/buriy/python-readability\n\n\npython-readability\n==================\n\nGiven a html document, it pulls out the main body text and cleans it up.\n\nThis is a python port of a ruby port of `arc90's readability\nproject <http://lab.arc90.com/experiments/readability/>`__.\n\nInstallation\n------------\n\nIt's easy using ``pip``, just run:\n\n.. code-block:: bash\n\n    $ pip install readability-lxml\n\nUsage\n-----\n\n.. code-block:: python\n\n    >>> import requests\n    >>> from readability import Document\n\n    >>> response = requests.get('http://example.com')\n    >>> doc = Document(response.text)\n    >>> doc.title()\n    'Example Domain'\n\n    >>> doc.summary()\n    \"\"\"<html><body><div><body id=\"readabilityBody\">\\n<div>\\n    <h1>Example Domain</h1>\\n\n    <p>This domain is established to be used for illustrative examples in documents. You may\n    use this\\n    domain in examples without prior coordination or asking for permission.</p>\n    \\n    <p><a href=\"http://www.iana.org/domains/example\">More information...</a></p>\\n</div>\n    \\n</body>\\n</div></body></html>\"\"\"\n\nChange Log\n----------\n\n-  0.8.1 Fixed processing of non-ascii HTMLs via regexps.\n-  0.8 Replaced XHTML output with HTML5 output in summary() call.\n-  0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.\n-  0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).\n-  0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6\n-  0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4\n-  0.4 Added Videos loading and allowed more images per paragraph\n-  0.3 Added Document.encoding, positive\\_keywords and negative\\_keywords\n\nLicensing\n---------\n\nThis code is under `the Apache License\n2.0 <http://www.apache.org/licenses/LICENSE-2.0>`__ license.\n\nThanks to\n---------\n\n-  Latest `readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>`__\n-  Ruby port by starrhorne and iterationlabs\n-  `Python port <https://github.com/gfxmonk/python-readability>`__ by gfxmonk\n-  `Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>` to move to lxml\n-  \"BR to P\" fix from readability.js which improves quality for smaller texts\n-  Github users contributions.\n\n\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "fast html to text parser (article readability tool) with python 3 support",
    "version": "0.8.1",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "6a0dc326b843d99346d2afc44d2b4faa",
                "sha256": "e0d366a21b1bd6cca17de71a4e6ea16fcfaa8b0a5b4004e39e2c7eff884e6305"
            },
            "downloads": -1,
            "filename": "readability_lxml-0.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6a0dc326b843d99346d2afc44d2b4faa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 20691,
            "upload_time": "2020-07-04T00:45:49",
            "upload_time_iso_8601": "2020-07-04T00:45:49.348058Z",
            "url": "https://files.pythonhosted.org/packages/39/a6/cfe22aaa19ac69b97d127043a76a5bbcb0ef24f3a0b22793c46608190caa/readability_lxml-0.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "dd153878f06608bd487f36a29d21cc5a",
                "sha256": "e51fea56b5909aaf886d307d48e79e096293255afa567b7d08bca94d25b1a4e1"
            },
            "downloads": -1,
            "filename": "readability-lxml-0.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "dd153878f06608bd487f36a29d21cc5a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 15878,
            "upload_time": "2020-07-04T00:45:51",
            "upload_time_iso_8601": "2020-07-04T00:45:51.112784Z",
            "url": "https://files.pythonhosted.org/packages/b9/62/6de3a9a8524c1a1ee0f2aee0dfbad13a36ebbca0db402abcf4e790496512/readability-lxml-0.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2020-07-04 00:45:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "buriy",
    "github_project": "python-readability",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": null,
            "specs": []
        }
    ],
    "tox": true,
    "lcname": "readability-lxml"
}
        
Elapsed time: 0.03401s