rollet


Namerollet JSON
Version 0.1.5 PyPI version JSON
download
home_page
SummaryCollect data from various sources
upload_time2022-04-15 10:51:04
maintainerLoïc Rakotoson
docs_urlNone
authorOpscidia (Tech)
requires_python>=3.7
license
keywords fetch pull extract scrap
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Rollet
`Rollet` collects, standardizes and completes from various sources.

[![PyPI](https://img.shields.io/pypi/v/Rollet?logo=PyPI&style=for-the-badge&labelColor=%233775A9&logoColor=white)](https://pypi.org/project/rollet/)
![PyPI - Status](https://img.shields.io/pypi/status/rollet?style=for-the-badge)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/rollet?logo=python&logoColor=yellow&style=for-the-badge)](https://pypi.org/project/rollet/)



# Installation
## Pypi
The safest way to install `rollet` is to go through pip
```bash
python -m pip install rollet
```

# How to use?
## Command script

```properties
rollet {extract-txt,extract-csv,extract-json} path
       [-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]
       [--size [SIZE]] [-t [TIMESLEEP]] [--timeout [TIMEOUT]]
       [--blacklist [BLACKLIST]]
```
```console
positional arguments:
  {extract-txt,extract-csv,extract-json} Choose file type option extraction
  path                                   file path

optional arguments:
  -h, --help                    show this help message and exit
  -o [OUTFILE], --outfile       output file path
  -l [LINK], --link             link field if csv or json
  -f [FIELDS], --fields         fields to keep separated by comma
  --start [START]               number of rows to skip
  --size [SIZE]                 max number of rows to keep
  -t [TIMESLEEP], --timesleep   sleep time in seconds between two pulling
  --timeout [TIMEOUT]           Max GET request timeout in second
  --blacklist [BLACKLIST]       0 (do not use), 1 (use), path (one column domain blacklist file)
```

## Python
### Basic usage
```python
from rollet import get_content
from rollet.extractor import BaseExtractor

url = 'https://example.url.com/content-id'

content_dict = get_content(url)

content_object = BaseExtractor(url)
content_object.title            # Title
content_object.abstract         # Abstract
content_object.lang             # Language
content_object.content_type     # Type (pdf, json, html, ...)
content_object.to_dict()        # Same as get_content
```

### Custom extractors
```python
class CustomExtractor(BaseExtractor):

    @property
    def title(self):
        return self._page.find('title')
```

### PDF extractors
PDF extraction require [Grobid service](https://grobid.readthedocs.io/en/latest/Grobid-service/).  
Assuming Grobid API runs on `http://localhost:8070`
```python
from rollet import grobid_service, get_content
from rollet.extractor import PDFExtractor

grobid_service('localhost', '8070')

url = 'https://example.url.com/pdf-content-id'

content_dict = get_content(url)

pdf_content_object = PDFExtractor(url)
```
Reading PDF with `BaseExtractor` will instanciate PDFExtractor object.


And More!


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "rollet",
    "maintainer": "Lo\u00efc Rakotoson",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "loic.rakotoson@opscidia.com",
    "keywords": "fetch,pull,extract,scrap",
    "author": "Opscidia (Tech)",
    "author_email": "tech@opscidia.com",
    "download_url": "https://files.pythonhosted.org/packages/a5/30/5bcdcb4e1397b213508c592fda693603ffa24e187dca88d5d230e7f4c5bd/rollet-0.1.5.tar.gz",
    "platform": null,
    "description": "# Rollet\n`Rollet` collects, standardizes and completes from various sources.\n\n[![PyPI](https://img.shields.io/pypi/v/Rollet?logo=PyPI&style=for-the-badge&labelColor=%233775A9&logoColor=white)](https://pypi.org/project/rollet/)\n![PyPI - Status](https://img.shields.io/pypi/status/rollet?style=for-the-badge)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/rollet?logo=python&logoColor=yellow&style=for-the-badge)](https://pypi.org/project/rollet/)\n\n\n\n# Installation\n## Pypi\nThe safest way to install `rollet` is to go through pip\n```bash\npython -m pip install rollet\n```\n\n# How to use?\n## Command script\n\n```properties\nrollet {extract-txt,extract-csv,extract-json} path\n       [-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]\n       [--size [SIZE]] [-t [TIMESLEEP]] [--timeout [TIMEOUT]]\n       [--blacklist [BLACKLIST]]\n```\n```console\npositional arguments:\n  {extract-txt,extract-csv,extract-json} Choose file type option extraction\n  path                                   file path\n\noptional arguments:\n  -h, --help                    show this help message and exit\n  -o [OUTFILE], --outfile       output file path\n  -l [LINK], --link             link field if csv or json\n  -f [FIELDS], --fields         fields to keep separated by comma\n  --start [START]               number of rows to skip\n  --size [SIZE]                 max number of rows to keep\n  -t [TIMESLEEP], --timesleep   sleep time in seconds between two pulling\n  --timeout [TIMEOUT]           Max GET request timeout in second\n  --blacklist [BLACKLIST]       0 (do not use), 1 (use), path (one column domain blacklist file)\n```\n\n## Python\n### Basic usage\n```python\nfrom rollet import get_content\nfrom rollet.extractor import BaseExtractor\n\nurl = 'https://example.url.com/content-id'\n\ncontent_dict = get_content(url)\n\ncontent_object = BaseExtractor(url)\ncontent_object.title            # Title\ncontent_object.abstract         # Abstract\ncontent_object.lang             # Language\ncontent_object.content_type     # Type (pdf, json, html, ...)\ncontent_object.to_dict()        # Same as get_content\n```\n\n### Custom extractors\n```python\nclass CustomExtractor(BaseExtractor):\n\n    @property\n    def title(self):\n        return self._page.find('title')\n```\n\n### PDF extractors\nPDF extraction require [Grobid service](https://grobid.readthedocs.io/en/latest/Grobid-service/).  \nAssuming Grobid API runs on `http://localhost:8070`\n```python\nfrom rollet import grobid_service, get_content\nfrom rollet.extractor import PDFExtractor\n\ngrobid_service('localhost', '8070')\n\nurl = 'https://example.url.com/pdf-content-id'\n\ncontent_dict = get_content(url)\n\npdf_content_object = PDFExtractor(url)\n```\nReading PDF with `BaseExtractor` will instanciate PDFExtractor object.\n\n\nAnd More!\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Collect data from various sources",
    "version": "0.1.5",
    "project_urls": null,
    "split_keywords": [
        "fetch",
        "pull",
        "extract",
        "scrap"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ccdc80cf0e514393a13042dbbcd7892243e85572675539f58f893ed8132b5464",
                "md5": "1e3ef0e8d0aeae28fd66a3022c698531",
                "sha256": "a983c5b4f359ac8bdbdee7953b13a5b966e3ef22c5d9c9b65bc0b1f19b826920"
            },
            "downloads": -1,
            "filename": "rollet-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1e3ef0e8d0aeae28fd66a3022c698531",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 66398,
            "upload_time": "2022-04-15T10:51:02",
            "upload_time_iso_8601": "2022-04-15T10:51:02.477536Z",
            "url": "https://files.pythonhosted.org/packages/cc/dc/80cf0e514393a13042dbbcd7892243e85572675539f58f893ed8132b5464/rollet-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a5305bcdcb4e1397b213508c592fda693603ffa24e187dca88d5d230e7f4c5bd",
                "md5": "51dd90a081dd2aee86013255583a52ad",
                "sha256": "322270401955942af3d36c62c54f6e41f5902269c6e64886345c46c3d40ff8e4"
            },
            "downloads": -1,
            "filename": "rollet-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "51dd90a081dd2aee86013255583a52ad",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 66686,
            "upload_time": "2022-04-15T10:51:04",
            "upload_time_iso_8601": "2022-04-15T10:51:04.208892Z",
            "url": "https://files.pythonhosted.org/packages/a5/30/5bcdcb4e1397b213508c592fda693603ffa24e187dca88d5d230e7f4c5bd/rollet-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-04-15 10:51:04",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "rollet"
}
        
Elapsed time: 0.29138s