Name | rollet JSON |
Version |
0.1.5
JSON |
| download |
home_page | |
Summary | Collect data from various sources |
upload_time | 2022-04-15 10:51:04 |
maintainer | Loïc Rakotoson |
docs_url | None |
author | Opscidia (Tech) |
requires_python | >=3.7 |
license | |
keywords |
fetch
pull
extract
scrap
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Rollet
`Rollet` collects, standardizes and completes from various sources.
[![PyPI](https://img.shields.io/pypi/v/Rollet?logo=PyPI&style=for-the-badge&labelColor=%233775A9&logoColor=white)](https://pypi.org/project/rollet/)
![PyPI - Status](https://img.shields.io/pypi/status/rollet?style=for-the-badge)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/rollet?logo=python&logoColor=yellow&style=for-the-badge)](https://pypi.org/project/rollet/)
# Installation
## Pypi
The safest way to install `rollet` is to go through pip
```bash
python -m pip install rollet
```
# How to use?
## Command script
```properties
rollet {extract-txt,extract-csv,extract-json} path
[-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]
[--size [SIZE]] [-t [TIMESLEEP]] [--timeout [TIMEOUT]]
[--blacklist [BLACKLIST]]
```
```console
positional arguments:
{extract-txt,extract-csv,extract-json} Choose file type option extraction
path file path
optional arguments:
-h, --help show this help message and exit
-o [OUTFILE], --outfile output file path
-l [LINK], --link link field if csv or json
-f [FIELDS], --fields fields to keep separated by comma
--start [START] number of rows to skip
--size [SIZE] max number of rows to keep
-t [TIMESLEEP], --timesleep sleep time in seconds between two pulling
--timeout [TIMEOUT] Max GET request timeout in second
--blacklist [BLACKLIST] 0 (do not use), 1 (use), path (one column domain blacklist file)
```
## Python
### Basic usage
```python
from rollet import get_content
from rollet.extractor import BaseExtractor
url = 'https://example.url.com/content-id'
content_dict = get_content(url)
content_object = BaseExtractor(url)
content_object.title # Title
content_object.abstract # Abstract
content_object.lang # Language
content_object.content_type # Type (pdf, json, html, ...)
content_object.to_dict() # Same as get_content
```
### Custom extractors
```python
class CustomExtractor(BaseExtractor):
@property
def title(self):
return self._page.find('title')
```
### PDF extractors
PDF extraction require [Grobid service](https://grobid.readthedocs.io/en/latest/Grobid-service/).
Assuming Grobid API runs on `http://localhost:8070`
```python
from rollet import grobid_service, get_content
from rollet.extractor import PDFExtractor
grobid_service('localhost', '8070')
url = 'https://example.url.com/pdf-content-id'
content_dict = get_content(url)
pdf_content_object = PDFExtractor(url)
```
Reading PDF with `BaseExtractor` will instanciate PDFExtractor object.
And More!
Raw data
{
"_id": null,
"home_page": "",
"name": "rollet",
"maintainer": "Lo\u00efc Rakotoson",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "loic.rakotoson@opscidia.com",
"keywords": "fetch,pull,extract,scrap",
"author": "Opscidia (Tech)",
"author_email": "tech@opscidia.com",
"download_url": "https://files.pythonhosted.org/packages/a5/30/5bcdcb4e1397b213508c592fda693603ffa24e187dca88d5d230e7f4c5bd/rollet-0.1.5.tar.gz",
"platform": null,
"description": "# Rollet\n`Rollet` collects, standardizes and completes from various sources.\n\n[![PyPI](https://img.shields.io/pypi/v/Rollet?logo=PyPI&style=for-the-badge&labelColor=%233775A9&logoColor=white)](https://pypi.org/project/rollet/)\n![PyPI - Status](https://img.shields.io/pypi/status/rollet?style=for-the-badge)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/rollet?logo=python&logoColor=yellow&style=for-the-badge)](https://pypi.org/project/rollet/)\n\n\n\n# Installation\n## Pypi\nThe safest way to install `rollet` is to go through pip\n```bash\npython -m pip install rollet\n```\n\n# How to use?\n## Command script\n\n```properties\nrollet {extract-txt,extract-csv,extract-json} path\n [-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]\n [--size [SIZE]] [-t [TIMESLEEP]] [--timeout [TIMEOUT]]\n [--blacklist [BLACKLIST]]\n```\n```console\npositional arguments:\n {extract-txt,extract-csv,extract-json} Choose file type option extraction\n path file path\n\noptional arguments:\n -h, --help show this help message and exit\n -o [OUTFILE], --outfile output file path\n -l [LINK], --link link field if csv or json\n -f [FIELDS], --fields fields to keep separated by comma\n --start [START] number of rows to skip\n --size [SIZE] max number of rows to keep\n -t [TIMESLEEP], --timesleep sleep time in seconds between two pulling\n --timeout [TIMEOUT] Max GET request timeout in second\n --blacklist [BLACKLIST] 0 (do not use), 1 (use), path (one column domain blacklist file)\n```\n\n## Python\n### Basic usage\n```python\nfrom rollet import get_content\nfrom rollet.extractor import BaseExtractor\n\nurl = 'https://example.url.com/content-id'\n\ncontent_dict = get_content(url)\n\ncontent_object = BaseExtractor(url)\ncontent_object.title # Title\ncontent_object.abstract # Abstract\ncontent_object.lang # Language\ncontent_object.content_type # Type (pdf, json, html, ...)\ncontent_object.to_dict() # Same as get_content\n```\n\n### Custom extractors\n```python\nclass CustomExtractor(BaseExtractor):\n\n @property\n def title(self):\n return self._page.find('title')\n```\n\n### PDF extractors\nPDF extraction require [Grobid service](https://grobid.readthedocs.io/en/latest/Grobid-service/). \nAssuming Grobid API runs on `http://localhost:8070`\n```python\nfrom rollet import grobid_service, get_content\nfrom rollet.extractor import PDFExtractor\n\ngrobid_service('localhost', '8070')\n\nurl = 'https://example.url.com/pdf-content-id'\n\ncontent_dict = get_content(url)\n\npdf_content_object = PDFExtractor(url)\n```\nReading PDF with `BaseExtractor` will instanciate PDFExtractor object.\n\n\nAnd More!\n\n",
"bugtrack_url": null,
"license": "",
"summary": "Collect data from various sources",
"version": "0.1.5",
"project_urls": null,
"split_keywords": [
"fetch",
"pull",
"extract",
"scrap"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ccdc80cf0e514393a13042dbbcd7892243e85572675539f58f893ed8132b5464",
"md5": "1e3ef0e8d0aeae28fd66a3022c698531",
"sha256": "a983c5b4f359ac8bdbdee7953b13a5b966e3ef22c5d9c9b65bc0b1f19b826920"
},
"downloads": -1,
"filename": "rollet-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1e3ef0e8d0aeae28fd66a3022c698531",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 66398,
"upload_time": "2022-04-15T10:51:02",
"upload_time_iso_8601": "2022-04-15T10:51:02.477536Z",
"url": "https://files.pythonhosted.org/packages/cc/dc/80cf0e514393a13042dbbcd7892243e85572675539f58f893ed8132b5464/rollet-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a5305bcdcb4e1397b213508c592fda693603ffa24e187dca88d5d230e7f4c5bd",
"md5": "51dd90a081dd2aee86013255583a52ad",
"sha256": "322270401955942af3d36c62c54f6e41f5902269c6e64886345c46c3d40ff8e4"
},
"downloads": -1,
"filename": "rollet-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "51dd90a081dd2aee86013255583a52ad",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 66686,
"upload_time": "2022-04-15T10:51:04",
"upload_time_iso_8601": "2022-04-15T10:51:04.208892Z",
"url": "https://files.pythonhosted.org/packages/a5/30/5bcdcb4e1397b213508c592fda693603ffa24e187dca88d5d230e7f4c5bd/rollet-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-04-15 10:51:04",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "rollet"
}