ripit


Nameripit JSON
Version 1.0.2 PyPI version JSON
download
home_pagehttps://github.com/sourcepirate/ripit
SummaryPython port of Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
upload_time2024-09-12 13:08:40
maintainerNone
docs_urlNone
authorsourcepirate
requires_python>=3.6
licenseApache 2.0
keywords boilerpipe boilerpy html text extraction text extraction full text extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ripit (Forked from BoilerPy3 a beautiful library.)

![PyPI - Version](https://img.shields.io/pypi/v/ripit)
[![Updates](https://pyup.io/repos/github/pyupio/pyup/shield.svg)](https://pyup.io/repos/github/pyupio/pyup/)
![](https://github.com/sourcepirate/ripit/actions/workflows/test.yml/badge.svg)
![](https://github.com/sourcepirate/ripit/actions/workflows/publish.yml/badge.svg)

Original Boilerpy3 was not maintianed. I forked it to add some features changes.
No changes to license 

## About

BoilerPy3 is a native Python [port](https://github.com/natural/java2python) of Christian Kohlschütter's [Boilerpipe](https://github.com/kohlschutter/boilerpipe) library, released under the Apache 2.0 Licence.

This package is based on [sammyer's](https://github.com/sammyer) [BoilerPy](https://github.com/sammyer/BoilerPy), specifically [mercuree's](https://github.com/mercuree) [Python3-compatible fork](https://github.com/mercuree/BoilerPy). This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.

**Note**: This package is based on Boilerpipe 1.2 (at or before [this commit](https://github.com/kohlschutter/boilerpipe/tree/b0816590340f4317f500c64565b23beb4fb9a827)), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.


## Installation

To install the latest version from PyPI, execute:

```shell
pip install ripit
```


## Usage

The top-level interfaces are the Extractors. Use the `get_content()` methods to extract the filtered text.

```python
from ripit import extractors

extractor = extractors.ArticleExtractor()

# From a URL
content = extractor.get_content_from_url('http://www.example.com/')

# From a file
content = extractor.get_content_from_file('tests/test.html')

# From raw HTML
content = extractor.get_content('<html><body><h1>Example</h1></body></html>')
```

Alternatively, use `get_doc()` to return a Boilerpipe document from which you can get more detailed information.

```python
from ripit import extractors

extractor = extractors.ArticleExtractor()

doc = extractor.get_doc_from_url('http://www.example.com/')
content = doc.content
title = doc.title
```


## Extractors


### DefaultExtractor

Usually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor. 


### ArticleExtractor

A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.

### ArticleSentencesExtractor

A full-text extractor which is tuned towards extracting sentences from news articles.


### LargestContentExtractor

A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor


### CanolaExtractor

A full-text extractor trained on [krdwrd](http://krdwrd.org) [Canola](https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf). Works well with SimpleEstimator, too.


### KeepEverythingExtractor

Dummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.


### NumWordsRulesExtractor

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/sourcepirate/ripit",
    "name": "ripit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "boilerpipe, boilerpy, html text extraction, text extraction, full text extraction",
    "author": "sourcepirate",
    "author_email": "plasmashadowx@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/59/38/4ebda5eda1d8c26bf44ebc9335ae6c639c170349357c0d92b976e6887b0d/ripit-1.0.2.tar.gz",
    "platform": null,
    "description": "# ripit (Forked from BoilerPy3 a beautiful library.)\n\n![PyPI - Version](https://img.shields.io/pypi/v/ripit)\n[![Updates](https://pyup.io/repos/github/pyupio/pyup/shield.svg)](https://pyup.io/repos/github/pyupio/pyup/)\n![](https://github.com/sourcepirate/ripit/actions/workflows/test.yml/badge.svg)\n![](https://github.com/sourcepirate/ripit/actions/workflows/publish.yml/badge.svg)\n\nOriginal Boilerpy3 was not maintianed. I forked it to add some features changes.\nNo changes to license \n\n## About\n\nBoilerPy3 is a native Python [port](https://github.com/natural/java2python) of Christian Kohlsch\u00fctter's [Boilerpipe](https://github.com/kohlschutter/boilerpipe) library, released under the Apache 2.0 Licence.\n\nThis package is based on [sammyer's](https://github.com/sammyer) [BoilerPy](https://github.com/sammyer/BoilerPy), specifically [mercuree's](https://github.com/mercuree) [Python3-compatible fork](https://github.com/mercuree/BoilerPy). This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.\n\n**Note**: This package is based on Boilerpipe 1.2 (at or before [this commit](https://github.com/kohlschutter/boilerpipe/tree/b0816590340f4317f500c64565b23beb4fb9a827)), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.\n\n\n## Installation\n\nTo install the latest version from PyPI, execute:\n\n```shell\npip install ripit\n```\n\n\n## Usage\n\nThe top-level interfaces are the Extractors. Use the `get_content()` methods to extract the filtered text.\n\n```python\nfrom ripit import extractors\n\nextractor = extractors.ArticleExtractor()\n\n# From a URL\ncontent = extractor.get_content_from_url('http://www.example.com/')\n\n# From a file\ncontent = extractor.get_content_from_file('tests/test.html')\n\n# From raw HTML\ncontent = extractor.get_content('<html><body><h1>Example</h1></body></html>')\n```\n\nAlternatively, use `get_doc()` to return a Boilerpipe document from which you can get more detailed information.\n\n```python\nfrom ripit import extractors\n\nextractor = extractors.ArticleExtractor()\n\ndoc = extractor.get_doc_from_url('http://www.example.com/')\ncontent = doc.content\ntitle = doc.title\n```\n\n\n## Extractors\n\n\n### DefaultExtractor\n\nUsually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor. \n\n\n### ArticleExtractor\n\nA full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.\n\n### ArticleSentencesExtractor\n\nA full-text extractor which is tuned towards extracting sentences from news articles.\n\n\n### LargestContentExtractor\n\nA full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor\n\n\n### CanolaExtractor\n\nA full-text extractor trained on [krdwrd](http://krdwrd.org) [Canola](https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf). Works well with SimpleEstimator, too.\n\n\n### KeepEverythingExtractor\n\nDummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.\n\n\n### NumWordsRulesExtractor\n\nA quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Python port of Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages",
    "version": "1.0.2",
    "project_urls": {
        "Homepage": "https://github.com/sourcepirate/ripit"
    },
    "split_keywords": [
        "boilerpipe",
        " boilerpy",
        " html text extraction",
        " text extraction",
        " full text extraction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "311eeb7b67d3cdb025d6587f8cf40246a0754e1cbc009773878ca74978547de0",
                "md5": "41c2c021a0bcb29209cf2a634c015f7f",
                "sha256": "ae0d8f91884d0ba1c676cf171ea75dd7494a77e58b1dfe7f1978327355fa144c"
            },
            "downloads": -1,
            "filename": "ripit-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "41c2c021a0bcb29209cf2a634c015f7f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 21752,
            "upload_time": "2024-09-12T13:08:38",
            "upload_time_iso_8601": "2024-09-12T13:08:38.887762Z",
            "url": "https://files.pythonhosted.org/packages/31/1e/eb7b67d3cdb025d6587f8cf40246a0754e1cbc009773878ca74978547de0/ripit-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "59384ebda5eda1d8c26bf44ebc9335ae6c639c170349357c0d92b976e6887b0d",
                "md5": "f08a22232163ea0951cf06d23dd3cff8",
                "sha256": "6b12e1acb7913fc57f1a545de18297fd24230ee01e2ea94caf4477a171bf9e98"
            },
            "downloads": -1,
            "filename": "ripit-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "f08a22232163ea0951cf06d23dd3cff8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 20821,
            "upload_time": "2024-09-12T13:08:40",
            "upload_time_iso_8601": "2024-09-12T13:08:40.943605Z",
            "url": "https://files.pythonhosted.org/packages/59/38/4ebda5eda1d8c26bf44ebc9335ae6c639c170349357c0d92b976e6887b0d/ripit-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-12 13:08:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sourcepirate",
    "github_project": "ripit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "ripit"
}
        
Elapsed time: 4.88589s