jusText


NamejusText JSON
Version 3.0.1 PyPI version JSON
download
home_pagehttps://github.com/miso-belica/jusText
SummaryHeuristic based boilerplate removal tool
upload_time2024-05-09 15:49:56
maintainerMichal Belica
docs_urlNone
authorJan Pomikálek
requires_pythonNone
licenseThe BSD 2-Clause License
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            .. _jusText: http://code.google.com/p/justext/
.. _Python: http://www.python.org/
.. _lxml: http://lxml.de/

jusText
=======
.. image:: https://api.travis-ci.org/miso-belica/jusText.png?branch=master
  :target: https://travis-ci.org/miso-belica/jusText

Program jusText is a tool for removing boilerplate content, such as navigation
links, headers, and footers from HTML pages. It is
`designed <doc/algorithm.rst>`_ to preserve
mainly text containing full sentences and it is therefore well suited for
creating linguistic resources such as Web corpora. You can
`try it online <http://nlp.fi.muni.cz/projects/justext/>`_.

This is a fork of original (currently unmaintained) code of jusText_ hosted
on Google Code.


Adaptations of the algorithm to other languages:

- `C++ <https://github.com/endredy/jusText>`_
- `Go <https://github.com/JalfResi/justext>`_
- `Java <https://github.com/wizenoze/justext-java>`_


Some libraries using jusText:

- `chirp <https://github.com/9b/chirp>`_
- `lazynlp <https://github.com/chiphuyen/lazynlp>`_
- `off-topic-memento-toolkit <https://github.com/oduwsdl/off-topic-memento-toolkit>`_
- `pears <https://github.com/PeARSearch/PeARS-orchard>`_
- `readability calculator <https://github.com/joaopalotti/readability_calculator>`_
- `sky <https://github.com/kootenpv/sky>`_


Some currently (Jan 2020) maintained alternatives:

- `dragnet <https://github.com/dragnet-org/dragnet>`_
- `html2text <https://github.com/Alir3z4/html2text>`_
- `inscriptis <https://github.com/weblyzard/inscriptis>`_
- `newspaper <https://github.com/codelucas/newspaper>`_
- `python-readability <https://github.com/buriy/python-readability>`_
- `trafilatura <https://github.com/adbar/trafilatura>`_


Installation
------------
Make sure you have Python_ 2.7+/3.5+ and `pip <https://pip.pypa.io/en/stable/>`_
(`Windows <http://docs.python-guide.org/en/latest/starting/install/win/>`_,
`Linux <http://docs.python-guide.org/en/latest/starting/install/linux/>`_) installed.
Run simply:

.. code-block:: bash

  $ [sudo] pip install justext


Dependencies
------------
::

  lxml (version depends on your Python version)


Usage
-----
.. code-block:: bash

  $ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
  $ python -m justext -s English -o plain_text.txt english_page.html
  $ python -m justext --help # for more info


Python API
----------
.. code-block:: python

  import requests
  import justext

  response = requests.get("http://planet.python.org/")
  paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
  for paragraph in paragraphs:
    if not paragraph.is_boilerplate:
      print paragraph.text


Testing
-------
Run tests via

.. code-block:: bash

  $ py.test-2.7 && py.test-3.5 && py.test-3.6 && py.test-3.7 && py.test-3.8 && py.test-3.9


Acknowledgements
----------------
.. _`Natural Language Processing Centre`: http://nlp.fi.muni.cz/en/nlpc
.. _`Masaryk University in Brno`: http://nlp.fi.muni.cz/en
.. _PRESEMT: http://presemt.eu/
.. _`Lexical Computing Ltd.`: http://lexicalcomputing.com/
.. _`PhD research`: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf

This software has been developed at the `Natural Language Processing Centre`_ of
`Masaryk University in Brno`_ with a financial support from PRESEMT_ and
`Lexical Computing Ltd.`_ It also relates to `PhD research`_ of Jan Pomikálek.


.. :changelog:

Changelog for jusText
=====================

3.0.1 (2024-05-09)
------------------
- *BUG FIX:* Fix issue with new version of lxml `#48 <https://github.com/miso-belica/jusText/pull/48>`_.

3.0.0 (2021-10-21)
------------------
- *INCOMPATIBLE CHANGE:* Dropped support for Python 3.4 and below.
- *BUG FIX:* Don't join words separated only by ``<br>`` tag.
- *BUG FIX:* List available stop-lists alphabetically.

2.2.0 (2016-03-06)
------------------
- *INCOMPATIBLE CHANGE:* Stop words are case insensitive.
- *INCOMPATIBLE CHANGE:* Dropped support for Python 3.2
- *BUG FIX:* Preserve new lines from original text in paragraphs.

2.1.1 (2014-05-27)
------------------
- *BUG FIX:* Function ``decode_html`` now respects parameter ``errors`` when falling to ``default_encoding`` `#9 <https://github.com/miso-belica/jusText/issues/9>`_.

2.1.0 (2014-01-25)
------------------
- *FEATURE:* Added XPath selector to the paragrahs. XPath selector is also available in detailed output as ``xpath`` attribute of ``<p>`` tag `#5 <https://github.com/miso-belica/jusText/pull/5>`_.

2.0.0 (2013-08-26)
------------------
- *FEATURE:* Added pluggable DOM preprocessor.
- *FEATURE:* Added support for Python 3.2+.
- *INCOMPATIBLE CHANGE:* Paragraphs are instances of
  ``justext.paragraph.Paragraph``.
- *INCOMPATIBLE CHANGE:* Script 'justext' removed in favour of
  command ``python -m justext``.
- *FEATURE:* It's possible to enter an URI as input document in CLI.
- *FEATURE:* It is possible to pass unicode string directly.

1.2.0 (2011-08-08)
------------------
- *FEATURE:* Character counts used instead of word counts where possible in
  order to make the algorithm work well in the language independent
  mode (without a stoplist) for languages where counting words is
  not easy (Japanese, Chinese, Thai, etc).
- *BUG FIX:* More robust parsing of meta tags containing the information about
  used charset.
- *BUG FIX:* Corrected decoding of HTML entities &#128; to &#159;

1.1.0 (2011-03-09)
------------------
- First public release.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/miso-belica/jusText",
    "name": "jusText",
    "maintainer": "Michal Belica",
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "miso.belica@gmail.com",
    "keywords": null,
    "author": "Jan Pomik\u00e1lek",
    "author_email": "jan.pomikalek@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/b1/59/93ce612fce25c274efc88ec4d65963ce80fce96b9048e9fc1e430d893a9e/justext-3.0.1.tar.gz",
    "platform": null,
    "description": ".. _jusText: http://code.google.com/p/justext/\n.. _Python: http://www.python.org/\n.. _lxml: http://lxml.de/\n\njusText\n=======\n.. image:: https://api.travis-ci.org/miso-belica/jusText.png?branch=master\n  :target: https://travis-ci.org/miso-belica/jusText\n\nProgram jusText is a tool for removing boilerplate content, such as navigation\nlinks, headers, and footers from HTML pages. It is\n`designed <doc/algorithm.rst>`_ to preserve\nmainly text containing full sentences and it is therefore well suited for\ncreating linguistic resources such as Web corpora. You can\n`try it online <http://nlp.fi.muni.cz/projects/justext/>`_.\n\nThis is a fork of original (currently unmaintained) code of jusText_ hosted\non Google Code.\n\n\nAdaptations of the algorithm to other languages:\n\n- `C++ <https://github.com/endredy/jusText>`_\n- `Go <https://github.com/JalfResi/justext>`_\n- `Java <https://github.com/wizenoze/justext-java>`_\n\n\nSome libraries using jusText:\n\n- `chirp <https://github.com/9b/chirp>`_\n- `lazynlp <https://github.com/chiphuyen/lazynlp>`_\n- `off-topic-memento-toolkit <https://github.com/oduwsdl/off-topic-memento-toolkit>`_\n- `pears <https://github.com/PeARSearch/PeARS-orchard>`_\n- `readability calculator <https://github.com/joaopalotti/readability_calculator>`_\n- `sky <https://github.com/kootenpv/sky>`_\n\n\nSome currently (Jan 2020) maintained alternatives:\n\n- `dragnet <https://github.com/dragnet-org/dragnet>`_\n- `html2text <https://github.com/Alir3z4/html2text>`_\n- `inscriptis <https://github.com/weblyzard/inscriptis>`_\n- `newspaper <https://github.com/codelucas/newspaper>`_\n- `python-readability <https://github.com/buriy/python-readability>`_\n- `trafilatura <https://github.com/adbar/trafilatura>`_\n\n\nInstallation\n------------\nMake sure you have Python_ 2.7+/3.5+ and `pip <https://pip.pypa.io/en/stable/>`_\n(`Windows <http://docs.python-guide.org/en/latest/starting/install/win/>`_,\n`Linux <http://docs.python-guide.org/en/latest/starting/install/linux/>`_) installed.\nRun simply:\n\n.. code-block:: bash\n\n  $ [sudo] pip install justext\n\n\nDependencies\n------------\n::\n\n  lxml (version depends on your Python version)\n\n\nUsage\n-----\n.. code-block:: bash\n\n  $ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/\n  $ python -m justext -s English -o plain_text.txt english_page.html\n  $ python -m justext --help # for more info\n\n\nPython API\n----------\n.. code-block:: python\n\n  import requests\n  import justext\n\n  response = requests.get(\"http://planet.python.org/\")\n  paragraphs = justext.justext(response.content, justext.get_stoplist(\"English\"))\n  for paragraph in paragraphs:\n    if not paragraph.is_boilerplate:\n      print paragraph.text\n\n\nTesting\n-------\nRun tests via\n\n.. code-block:: bash\n\n  $ py.test-2.7 && py.test-3.5 && py.test-3.6 && py.test-3.7 && py.test-3.8 && py.test-3.9\n\n\nAcknowledgements\n----------------\n.. _`Natural Language Processing Centre`: http://nlp.fi.muni.cz/en/nlpc\n.. _`Masaryk University in Brno`: http://nlp.fi.muni.cz/en\n.. _PRESEMT: http://presemt.eu/\n.. _`Lexical Computing Ltd.`: http://lexicalcomputing.com/\n.. _`PhD research`: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf\n\nThis software has been developed at the `Natural Language Processing Centre`_ of\n`Masaryk University in Brno`_ with a financial support from PRESEMT_ and\n`Lexical Computing Ltd.`_ It also relates to `PhD research`_ of Jan Pomik\u00e1lek.\n\n\n.. :changelog:\n\nChangelog for jusText\n=====================\n\n3.0.1 (2024-05-09)\n------------------\n- *BUG FIX:* Fix issue with new version of lxml `#48 <https://github.com/miso-belica/jusText/pull/48>`_.\n\n3.0.0 (2021-10-21)\n------------------\n- *INCOMPATIBLE CHANGE:* Dropped support for Python 3.4 and below.\n- *BUG FIX:* Don't join words separated only by ``<br>`` tag.\n- *BUG FIX:* List available stop-lists alphabetically.\n\n2.2.0 (2016-03-06)\n------------------\n- *INCOMPATIBLE CHANGE:* Stop words are case insensitive.\n- *INCOMPATIBLE CHANGE:* Dropped support for Python 3.2\n- *BUG FIX:* Preserve new lines from original text in paragraphs.\n\n2.1.1 (2014-05-27)\n------------------\n- *BUG FIX:* Function ``decode_html`` now respects parameter ``errors`` when falling to ``default_encoding`` `#9 <https://github.com/miso-belica/jusText/issues/9>`_.\n\n2.1.0 (2014-01-25)\n------------------\n- *FEATURE:* Added XPath selector to the paragrahs. XPath selector is also available in detailed output as ``xpath`` attribute of ``<p>`` tag `#5 <https://github.com/miso-belica/jusText/pull/5>`_.\n\n2.0.0 (2013-08-26)\n------------------\n- *FEATURE:* Added pluggable DOM preprocessor.\n- *FEATURE:* Added support for Python 3.2+.\n- *INCOMPATIBLE CHANGE:* Paragraphs are instances of\n  ``justext.paragraph.Paragraph``.\n- *INCOMPATIBLE CHANGE:* Script 'justext' removed in favour of\n  command ``python -m justext``.\n- *FEATURE:* It's possible to enter an URI as input document in CLI.\n- *FEATURE:* It is possible to pass unicode string directly.\n\n1.2.0 (2011-08-08)\n------------------\n- *FEATURE:* Character counts used instead of word counts where possible in\n  order to make the algorithm work well in the language independent\n  mode (without a stoplist) for languages where counting words is\n  not easy (Japanese, Chinese, Thai, etc).\n- *BUG FIX:* More robust parsing of meta tags containing the information about\n  used charset.\n- *BUG FIX:* Corrected decoding of HTML entities &#128; to &#159;\n\n1.1.0 (2011-03-09)\n------------------\n- First public release.\n",
    "bugtrack_url": null,
    "license": "The BSD 2-Clause License",
    "summary": "Heuristic based boilerplate removal tool",
    "version": "3.0.1",
    "project_urls": {
        "Homepage": "https://github.com/miso-belica/jusText"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c4302cd44d6cc7541d5a68848250bf2f12c588631f6ff4461421fee34f9b619e",
                "md5": "f17fd3f6eafb453d87d5428e46d0c380",
                "sha256": "e0fb882dd7285415709f4b7466aed23d6b98b7b89404c36e8a2e730facfed02b"
            },
            "downloads": -1,
            "filename": "jusText-3.0.1-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f17fd3f6eafb453d87d5428e46d0c380",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 837839,
            "upload_time": "2024-05-09T15:49:54",
            "upload_time_iso_8601": "2024-05-09T15:49:54.138020Z",
            "url": "https://files.pythonhosted.org/packages/c4/30/2cd44d6cc7541d5a68848250bf2f12c588631f6ff4461421fee34f9b619e/jusText-3.0.1-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b15993ce612fce25c274efc88ec4d65963ce80fce96b9048e9fc1e430d893a9e",
                "md5": "c69f49d10435029655962934896c3234",
                "sha256": "b6ed2fb6c5d21618e2e34b2295c4edfc0bcece3bd549ed5c8ef5a8d20f0b3451"
            },
            "downloads": -1,
            "filename": "justext-3.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "c69f49d10435029655962934896c3234",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 828398,
            "upload_time": "2024-05-09T15:49:56",
            "upload_time_iso_8601": "2024-05-09T15:49:56.569568Z",
            "url": "https://files.pythonhosted.org/packages/b1/59/93ce612fce25c274efc88ec4d65963ce80fce96b9048e9fc1e430d893a9e/justext-3.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-09 15:49:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "miso-belica",
    "github_project": "jusText",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "justext"
}
        
Elapsed time: 0.95230s