# ripit (Forked from BoilerPy3 a beautiful library.)
![PyPI - Version](https://img.shields.io/pypi/v/ripit)
[![Updates](https://pyup.io/repos/github/pyupio/pyup/shield.svg)](https://pyup.io/repos/github/pyupio/pyup/)
![](https://github.com/sourcepirate/ripit/actions/workflows/test.yml/badge.svg)
![](https://github.com/sourcepirate/ripit/actions/workflows/publish.yml/badge.svg)
Original Boilerpy3 was not maintianed. I forked it to add some features changes.
No changes to license
## About
BoilerPy3 is a native Python [port](https://github.com/natural/java2python) of Christian Kohlschütter's [Boilerpipe](https://github.com/kohlschutter/boilerpipe) library, released under the Apache 2.0 Licence.
This package is based on [sammyer's](https://github.com/sammyer) [BoilerPy](https://github.com/sammyer/BoilerPy), specifically [mercuree's](https://github.com/mercuree) [Python3-compatible fork](https://github.com/mercuree/BoilerPy). This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.
**Note**: This package is based on Boilerpipe 1.2 (at or before [this commit](https://github.com/kohlschutter/boilerpipe/tree/b0816590340f4317f500c64565b23beb4fb9a827)), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.
## Installation
To install the latest version from PyPI, execute:
```shell
pip install ripit
```
## Usage
The top-level interfaces are the Extractors. Use the `get_content()` methods to extract the filtered text.
```python
from ripit import extractors
extractor = extractors.ArticleExtractor()
# From a URL
content = extractor.get_content_from_url('http://www.example.com/')
# From a file
content = extractor.get_content_from_file('tests/test.html')
# From raw HTML
content = extractor.get_content('<html><body><h1>Example</h1></body></html>')
```
Alternatively, use `get_doc()` to return a Boilerpipe document from which you can get more detailed information.
```python
from ripit import extractors
extractor = extractors.ArticleExtractor()
doc = extractor.get_doc_from_url('http://www.example.com/')
content = doc.content
title = doc.title
```
## Extractors
### DefaultExtractor
Usually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor.
### ArticleExtractor
A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.
### ArticleSentencesExtractor
A full-text extractor which is tuned towards extracting sentences from news articles.
### LargestContentExtractor
A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor
### CanolaExtractor
A full-text extractor trained on [krdwrd](http://krdwrd.org) [Canola](https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf). Works well with SimpleEstimator, too.
### KeepEverythingExtractor
Dummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.
### NumWordsRulesExtractor
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).
Raw data
{
"_id": null,
"home_page": "https://github.com/sourcepirate/ripit",
"name": "ripit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "boilerpipe, boilerpy, html text extraction, text extraction, full text extraction",
"author": "sourcepirate",
"author_email": "plasmashadowx@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/59/38/4ebda5eda1d8c26bf44ebc9335ae6c639c170349357c0d92b976e6887b0d/ripit-1.0.2.tar.gz",
"platform": null,
"description": "# ripit (Forked from BoilerPy3 a beautiful library.)\n\n![PyPI - Version](https://img.shields.io/pypi/v/ripit)\n[![Updates](https://pyup.io/repos/github/pyupio/pyup/shield.svg)](https://pyup.io/repos/github/pyupio/pyup/)\n![](https://github.com/sourcepirate/ripit/actions/workflows/test.yml/badge.svg)\n![](https://github.com/sourcepirate/ripit/actions/workflows/publish.yml/badge.svg)\n\nOriginal Boilerpy3 was not maintianed. I forked it to add some features changes.\nNo changes to license \n\n## About\n\nBoilerPy3 is a native Python [port](https://github.com/natural/java2python) of Christian Kohlsch\u00fctter's [Boilerpipe](https://github.com/kohlschutter/boilerpipe) library, released under the Apache 2.0 Licence.\n\nThis package is based on [sammyer's](https://github.com/sammyer) [BoilerPy](https://github.com/sammyer/BoilerPy), specifically [mercuree's](https://github.com/mercuree) [Python3-compatible fork](https://github.com/mercuree/BoilerPy). This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.\n\n**Note**: This package is based on Boilerpipe 1.2 (at or before [this commit](https://github.com/kohlschutter/boilerpipe/tree/b0816590340f4317f500c64565b23beb4fb9a827)), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.\n\n\n## Installation\n\nTo install the latest version from PyPI, execute:\n\n```shell\npip install ripit\n```\n\n\n## Usage\n\nThe top-level interfaces are the Extractors. Use the `get_content()` methods to extract the filtered text.\n\n```python\nfrom ripit import extractors\n\nextractor = extractors.ArticleExtractor()\n\n# From a URL\ncontent = extractor.get_content_from_url('http://www.example.com/')\n\n# From a file\ncontent = extractor.get_content_from_file('tests/test.html')\n\n# From raw HTML\ncontent = extractor.get_content('<html><body><h1>Example</h1></body></html>')\n```\n\nAlternatively, use `get_doc()` to return a Boilerpipe document from which you can get more detailed information.\n\n```python\nfrom ripit import extractors\n\nextractor = extractors.ArticleExtractor()\n\ndoc = extractor.get_doc_from_url('http://www.example.com/')\ncontent = doc.content\ntitle = doc.title\n```\n\n\n## Extractors\n\n\n### DefaultExtractor\n\nUsually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor. \n\n\n### ArticleExtractor\n\nA full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.\n\n### ArticleSentencesExtractor\n\nA full-text extractor which is tuned towards extracting sentences from news articles.\n\n\n### LargestContentExtractor\n\nA full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor\n\n\n### CanolaExtractor\n\nA full-text extractor trained on [krdwrd](http://krdwrd.org) [Canola](https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf). Works well with SimpleEstimator, too.\n\n\n### KeepEverythingExtractor\n\nDummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.\n\n\n### NumWordsRulesExtractor\n\nA quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).\n\n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Python port of Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages",
"version": "1.0.2",
"project_urls": {
"Homepage": "https://github.com/sourcepirate/ripit"
},
"split_keywords": [
"boilerpipe",
" boilerpy",
" html text extraction",
" text extraction",
" full text extraction"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "311eeb7b67d3cdb025d6587f8cf40246a0754e1cbc009773878ca74978547de0",
"md5": "41c2c021a0bcb29209cf2a634c015f7f",
"sha256": "ae0d8f91884d0ba1c676cf171ea75dd7494a77e58b1dfe7f1978327355fa144c"
},
"downloads": -1,
"filename": "ripit-1.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "41c2c021a0bcb29209cf2a634c015f7f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 21752,
"upload_time": "2024-09-12T13:08:38",
"upload_time_iso_8601": "2024-09-12T13:08:38.887762Z",
"url": "https://files.pythonhosted.org/packages/31/1e/eb7b67d3cdb025d6587f8cf40246a0754e1cbc009773878ca74978547de0/ripit-1.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "59384ebda5eda1d8c26bf44ebc9335ae6c639c170349357c0d92b976e6887b0d",
"md5": "f08a22232163ea0951cf06d23dd3cff8",
"sha256": "6b12e1acb7913fc57f1a545de18297fd24230ee01e2ea94caf4477a171bf9e98"
},
"downloads": -1,
"filename": "ripit-1.0.2.tar.gz",
"has_sig": false,
"md5_digest": "f08a22232163ea0951cf06d23dd3cff8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 20821,
"upload_time": "2024-09-12T13:08:40",
"upload_time_iso_8601": "2024-09-12T13:08:40.943605Z",
"url": "https://files.pythonhosted.org/packages/59/38/4ebda5eda1d8c26bf44ebc9335ae6c639c170349357c0d92b976e6887b0d/ripit-1.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-12 13:08:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sourcepirate",
"github_project": "ripit",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "ripit"
}