.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master
:target: https://travis-ci.org/buriy/python-readability
python-readability
==================
Given a html document, it pulls out the main body text and cleans it up.
This is a python port of a ruby port of `arc90's readability
project <http://lab.arc90.com/experiments/readability/>`__.
Installation
------------
It's easy using ``pip``, just run:
.. code-block:: bash
$ pip install readability-lxml
Usage
-----
.. code-block:: python
>>> import requests
>>> from readability import Document
>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'
>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n domain in examples without prior coordination or asking for permission.</p>
\n <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""
Change Log
----------
- 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
- 0.8 Replaced XHTML output with HTML5 output in summary() call.
- 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
- 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
- 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
- 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
- 0.4 Added Videos loading and allowed more images per paragraph
- 0.3 Added Document.encoding, positive\_keywords and negative\_keywords
Licensing
---------
This code is under `the Apache License
2.0 <http://www.apache.org/licenses/LICENSE-2.0>`__ license.
Thanks to
---------
- Latest `readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>`__
- Ruby port by starrhorne and iterationlabs
- `Python port <https://github.com/gfxmonk/python-readability>`__ by gfxmonk
- `Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>` to move to lxml
- "BR to P" fix from readability.js which improves quality for smaller texts
- Github users contributions.
Raw data
{
"_id": null,
"home_page": "http://github.com/buriy/python-readability",
"name": "readability-lxml",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Yuri Baburov",
"author_email": "burchik@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/b9/62/6de3a9a8524c1a1ee0f2aee0dfbad13a36ebbca0db402abcf4e790496512/readability-lxml-0.8.1.tar.gz",
"platform": "",
"description": ".. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master\n :target: https://travis-ci.org/buriy/python-readability\n\n\npython-readability\n==================\n\nGiven a html document, it pulls out the main body text and cleans it up.\n\nThis is a python port of a ruby port of `arc90's readability\nproject <http://lab.arc90.com/experiments/readability/>`__.\n\nInstallation\n------------\n\nIt's easy using ``pip``, just run:\n\n.. code-block:: bash\n\n $ pip install readability-lxml\n\nUsage\n-----\n\n.. code-block:: python\n\n >>> import requests\n >>> from readability import Document\n\n >>> response = requests.get('http://example.com')\n >>> doc = Document(response.text)\n >>> doc.title()\n 'Example Domain'\n\n >>> doc.summary()\n \"\"\"<html><body><div><body id=\"readabilityBody\">\\n<div>\\n <h1>Example Domain</h1>\\n\n <p>This domain is established to be used for illustrative examples in documents. You may\n use this\\n domain in examples without prior coordination or asking for permission.</p>\n \\n <p><a href=\"http://www.iana.org/domains/example\">More information...</a></p>\\n</div>\n \\n</body>\\n</div></body></html>\"\"\"\n\nChange Log\n----------\n\n- 0.8.1 Fixed processing of non-ascii HTMLs via regexps.\n- 0.8 Replaced XHTML output with HTML5 output in summary() call.\n- 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.\n- 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).\n- 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6\n- 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4\n- 0.4 Added Videos loading and allowed more images per paragraph\n- 0.3 Added Document.encoding, positive\\_keywords and negative\\_keywords\n\nLicensing\n---------\n\nThis code is under `the Apache License\n2.0 <http://www.apache.org/licenses/LICENSE-2.0>`__ license.\n\nThanks to\n---------\n\n- Latest `readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>`__\n- Ruby port by starrhorne and iterationlabs\n- `Python port <https://github.com/gfxmonk/python-readability>`__ by gfxmonk\n- `Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>` to move to lxml\n- \"BR to P\" fix from readability.js which improves quality for smaller texts\n- Github users contributions.\n\n\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "fast html to text parser (article readability tool) with python 3 support",
"version": "0.8.1",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "6a0dc326b843d99346d2afc44d2b4faa",
"sha256": "e0d366a21b1bd6cca17de71a4e6ea16fcfaa8b0a5b4004e39e2c7eff884e6305"
},
"downloads": -1,
"filename": "readability_lxml-0.8.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6a0dc326b843d99346d2afc44d2b4faa",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 20691,
"upload_time": "2020-07-04T00:45:49",
"upload_time_iso_8601": "2020-07-04T00:45:49.348058Z",
"url": "https://files.pythonhosted.org/packages/39/a6/cfe22aaa19ac69b97d127043a76a5bbcb0ef24f3a0b22793c46608190caa/readability_lxml-0.8.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "dd153878f06608bd487f36a29d21cc5a",
"sha256": "e51fea56b5909aaf886d307d48e79e096293255afa567b7d08bca94d25b1a4e1"
},
"downloads": -1,
"filename": "readability-lxml-0.8.1.tar.gz",
"has_sig": false,
"md5_digest": "dd153878f06608bd487f36a29d21cc5a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 15878,
"upload_time": "2020-07-04T00:45:51",
"upload_time_iso_8601": "2020-07-04T00:45:51.112784Z",
"url": "https://files.pythonhosted.org/packages/b9/62/6de3a9a8524c1a1ee0f2aee0dfbad13a36ebbca0db402abcf4e790496512/readability-lxml-0.8.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2020-07-04 00:45:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "buriy",
"github_project": "python-readability",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": null,
"specs": []
}
],
"tox": true,
"lcname": "readability-lxml"
}