Name | scrape-schema JSON |
Version |
0.6.3
JSON |
| download |
home_page | None |
Summary | A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes |
upload_time | 2023-10-08 16:10:00 |
maintainer | None |
docs_url | None |
author | vypivshiy |
requires_python | >=3.8 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
[![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch)
[![Documentation Status](https://readthedocs.org/projects/scrape-schema/badge/?version=latest)](https://scrape-schema.readthedocs.io/en/latest/?badge=latest)
![CI](https://github.com/vypivshiy/scrape_schema/actions/workflows/ci.yml/badge.svg)
![License](https://img.shields.io/github/license/vypivshiy/scrape-schema)
![Version](https://img.shields.io/pypi/v/scrape-schema)
![Python-versions](https://img.shields.io/pypi/pyversions/scrape_schema)
[![codecov](https://codecov.io/gh/vypivshiy/scrape-schema/branch/master/graph/badge.svg?token=jqSNuE7g5l)](https://codecov.io/gh/vypivshiy/scrape-schema)
# Scrape-schema
This library is designed to write structured, readable and
reusable parsers for html, raw text and is inspired by dataclasses and ORM libraries
> 🚨 Scrape-schema is currently in Pre-Alpha. Please expect breaking changes.
## Motivation
Simplifying parsers support, where it is difficult to use
or the complete absence of the API interfaces, decrease boilerplate code and
easy separate extraction logic from the crawling
Structuring, data serialization and use as an intermediate layer
for third-party serialization libraries: pydantic, json, dataclasses, attrs, etc
_____
## Features
- Built top on [Parsel](https://github.com/scrapy/parsel).
- re, css, xpath, jmespath, [chompjs](https://github.com/Nykakin/chompjs) features.
- [Fluent interface](https://en.wikipedia.org/wiki/Fluent_interface#Python) simular parsel.Selector API for easy to use.
- Decrease boilerplate code.
- Does not depend on the http client implementation, use any!
- Python 3.8+ support.
- Reusability, code consistency.
- Dataclass-like structure.
- Partial support auto type-casting from annotations (str, int, float, bool, list, dict, Optional)
- Detailed logging process to make it easier to write a parser
____
## Install
```shell
pip install scrape-schema
```
## Example
The fields interface is similar to the original [parsel](https://parsel.readthedocs.io/en/latest/) library
```python
# Example from parsel documentation
from parsel import Selector
text = """
<html>
<body>
<h1>Hello, Parsel!</h1>
<ul>
<li><a href="http://example.com">Link 1</a></li>
<li><a href="http://scrapy.org">Link 2</a></li>
</ul>
<script type="application/json">{"a": ["b", "c"]}</script>
</body>
</html>"""
def schema(txt: str):
selector = Selector(text=txt)
return {
"h1": selector.css('h1::text').get(),
"words": selector.xpath('//h1/text()').re(r'\w+'),
"urls": selector.css('ul > li').xpath('.//@href').getall(),
"sample_jmespath_1": selector.css('script::text').jmespath("a").get(),
"sample_jmespath_2": selector.css('script::text').jmespath("a").getall()
}
print(schema(text))
# {'h1': 'Hello, Parsel!',
# 'words': ['Hello', 'Parsel'],
# 'urls': ['http://example.com', 'http://scrapy.org'],
# 'sample_jmespath_1': 'b',
# 'sample_jmespath_2': ['b', 'c']}
```
```python
from scrape_schema import BaseSchema, Parsel
class Schema(BaseSchema):
h1: str = Parsel().css('h1::text').get()
words: list[str] = Parsel().xpath('//h1/text()').re(r'\w+')
urls: list[str] = Parsel().css('ul > li').xpath('.//@href').getall()
sample_jmespath_1: str = Parsel().css('script::text').jmespath("a").get()
sample_jmespath_2: list[str] = Parsel().css('script::text').jmespath("a").getall()
text = """
<html>
<body>
<h1>Hello, Parsel!</h1>
<ul>
<li><a href="http://example.com">Link 1</a></li>
<li><a href="http://scrapy.org">Link 2</a></li>
</ul>
<script type="application/json">{"a": ["b", "c"]}</script>
</body>
</html>"""
print(Schema(text).dict())
# {'h1': 'Hello, Parsel!',
# 'words': ['Hello', 'Parsel'],
# 'urls': ['http://example.com', 'http://scrapy.org'],
# 'sample_jmespath_1': 'b',
# 'sample_jmespath_2': ['b', 'c']}
```
The scrape_schema example output looks like the previous one, why do you need this library?
- Easy to modify
- Easy add additional methods
- Easy to port to another project without having to deal with the logic and call stack
- IDE typing support
For example, if you need to modify a parser, with `scrape_schema` it is easy and simple to do!
```python
from uuid import uuid4
from datetime import datetime
from scrape_schema import BaseSchema, Parsel, sc_param, Callback
from scrape_schema.validator import markup_pre_validator
class Schema(BaseSchema):
# invoke simple functions to fields output
# add uuid4 id
id: str = Callback(lambda: str(uuid4()))
# add parse date
date: str = Callback(lambda: str(datetime.today()))
h1: str = Parsel().css('h1::text').get()
# convert to upper case
h1_upper: str = Parsel().css('h1::text').get().upper()
# convert to lower case
h1_lower: str = Parsel().css('h1::text').get().lower()
words: list[str] = Parsel().xpath('//h1/text()').re(r'\w+')
# alt solution split words
words_2: list[str] = Parsel().xpath('//h1/text()').get().split()
# join result by ' - ' string
words_join: str = Parsel().xpath('//h1/text()').re(r'\w+').join(" AND ")
urls: list[str] = Parsel().css('ul > li').xpath('.//@href').getall()
# replace http protocol to https
urls_https: list[str] = Parsel().css('ul > li').xpath('.//@href').getall().replace("http://", "https://")
# you can modify output keys
sample_jmespath_1: str = Parsel(alias="jsn1").css('script::text').jmespath("a").get()
sample_jmespath_2: list[str] = Parsel(alias="class").css('script::text').jmespath("a").getall()
# or calc json count values
jsn_len: int = Parsel(auto_type=False).css('script::text').jmespath("a").getall().count()
# pre validation markup input before parse text.
# if text from first h1 element != 'Hello, Parsel!' - throw `SchemaPreValidationError` exception
@markup_pre_validator()
def validate_markup(self) -> bool:
return self.__selector__.css('h1::text').get() == 'Hello, Parsel!'
# or create fields with custom rule
@sc_param
def custom(self) -> str:
return "hello world!"
# you can add extra methods!
def parse_urls(self):
for url in self.urls:
print(f"parse {url}")
text = """
<html>
<body>
<h1>Hello, Parsel!</h1>
<ul>
<li><a href="http://example.com">Link 1</a></li>
<li><a href="http://scrapy.org">Link 2</a></li>
</ul>
<script type="application/json">{"a": ["b", "c"]}</script>
</body>
</html>"""
schema = Schema(text)
# invoke custom method
schema.parse_urls()
# parse http://example.com
# parse http://scrapy.org
print(schema.dict())
# !!!field from @sc_param decorator
# vvvvvvvvvvvvvvvvvvvvvv
# {'custom': 'hello world!',
# !!!simple functions callbacks output
# vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# 'id': '6b66de7b-5b5f-445a-b8a7-3b17332c1ff5',
# 'date': '2023-09-29 18:47:03.638941',
# 'h1': 'Hello, Parsel!', 'h1_upper': 'HELLO, PARSEL!', 'h1_lower': 'hello, parsel!',
# 'words': ['Hello', 'Parsel'], 'words_2': ['Hello,', 'Parsel!'],
# 'words_join': 'Hello AND Parsel',
# 'urls': ['http://example.com', 'http://scrapy.org'],
# 'urls_https': ['https://example.com', 'https://scrapy.org'],
# !!!changed key alias 'sample_jmespath_2' TO 'class'!!!
# vvvvvvvvvvvvvvvvv
# 'jsn1': 'b', 'class': ['b', 'c'], 'jsn_len': 2}
```
See more [examples](examples) and [documentation](https://scrape-schema.readthedocs.io/en/latest/)
for get more information/examples
____
This project is licensed under the terms of the MIT license.
Raw data
{
"_id": null,
"home_page": null,
"name": "scrape-schema",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "vypivshiy",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/db/fd/57eb544d074b5d2f57b3d29415ed8c3a8840e93b898513a0736be53d5f59/scrape_schema-0.6.3.tar.gz",
"platform": null,
"description": "[![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch)\n[![Documentation Status](https://readthedocs.org/projects/scrape-schema/badge/?version=latest)](https://scrape-schema.readthedocs.io/en/latest/?badge=latest)\n![CI](https://github.com/vypivshiy/scrape_schema/actions/workflows/ci.yml/badge.svg)\n![License](https://img.shields.io/github/license/vypivshiy/scrape-schema)\n![Version](https://img.shields.io/pypi/v/scrape-schema)\n![Python-versions](https://img.shields.io/pypi/pyversions/scrape_schema)\n[![codecov](https://codecov.io/gh/vypivshiy/scrape-schema/branch/master/graph/badge.svg?token=jqSNuE7g5l)](https://codecov.io/gh/vypivshiy/scrape-schema)\n\n# Scrape-schema\nThis library is designed to write structured, readable and\nreusable parsers for html, raw text and is inspired by dataclasses and ORM libraries\n\n> \ud83d\udea8 Scrape-schema is currently in Pre-Alpha. Please expect breaking changes.\n\n\n## Motivation\nSimplifying parsers support, where it is difficult to use\nor the complete absence of the API interfaces, decrease boilerplate code and\neasy separate extraction logic from the crawling\n\nStructuring, data serialization and use as an intermediate layer\nfor third-party serialization libraries: pydantic, json, dataclasses, attrs, etc\n\n_____\n## Features\n- Built top on [Parsel](https://github.com/scrapy/parsel).\n- re, css, xpath, jmespath, [chompjs](https://github.com/Nykakin/chompjs) features.\n- [Fluent interface](https://en.wikipedia.org/wiki/Fluent_interface#Python) simular parsel.Selector API for easy to use.\n- Decrease boilerplate code.\n- Does not depend on the http client implementation, use any!\n- Python 3.8+ support.\n- Reusability, code consistency.\n- Dataclass-like structure.\n- Partial support auto type-casting from annotations (str, int, float, bool, list, dict, Optional)\n- Detailed logging process to make it easier to write a parser\n____\n\n## Install\n\n```shell\npip install scrape-schema\n```\n## Example\n\nThe fields interface is similar to the original [parsel](https://parsel.readthedocs.io/en/latest/) library\n\n```python\n# Example from parsel documentation\nfrom parsel import Selector\ntext = \"\"\"\n <html>\n <body>\n <h1>Hello, Parsel!</h1>\n <ul>\n <li><a href=\"http://example.com\">Link 1</a></li>\n <li><a href=\"http://scrapy.org\">Link 2</a></li>\n </ul>\n <script type=\"application/json\">{\"a\": [\"b\", \"c\"]}</script>\n </body>\n </html>\"\"\"\n\ndef schema(txt: str):\n selector = Selector(text=txt)\n return {\n \"h1\": selector.css('h1::text').get(),\n \"words\": selector.xpath('//h1/text()').re(r'\\w+'),\n \"urls\": selector.css('ul > li').xpath('.//@href').getall(),\n \"sample_jmespath_1\": selector.css('script::text').jmespath(\"a\").get(),\n \"sample_jmespath_2\": selector.css('script::text').jmespath(\"a\").getall()\n }\n\nprint(schema(text))\n# {'h1': 'Hello, Parsel!',\n# 'words': ['Hello', 'Parsel'],\n# 'urls': ['http://example.com', 'http://scrapy.org'],\n# 'sample_jmespath_1': 'b',\n# 'sample_jmespath_2': ['b', 'c']}\n```\n\n```python\nfrom scrape_schema import BaseSchema, Parsel\n\n\nclass Schema(BaseSchema):\n h1: str = Parsel().css('h1::text').get()\n words: list[str] = Parsel().xpath('//h1/text()').re(r'\\w+')\n urls: list[str] = Parsel().css('ul > li').xpath('.//@href').getall()\n sample_jmespath_1: str = Parsel().css('script::text').jmespath(\"a\").get()\n sample_jmespath_2: list[str] = Parsel().css('script::text').jmespath(\"a\").getall()\n\n\ntext = \"\"\"\n <html>\n <body>\n <h1>Hello, Parsel!</h1>\n <ul>\n <li><a href=\"http://example.com\">Link 1</a></li>\n <li><a href=\"http://scrapy.org\">Link 2</a></li>\n </ul>\n <script type=\"application/json\">{\"a\": [\"b\", \"c\"]}</script>\n </body>\n </html>\"\"\"\n\nprint(Schema(text).dict())\n# {'h1': 'Hello, Parsel!',\n# 'words': ['Hello', 'Parsel'],\n# 'urls': ['http://example.com', 'http://scrapy.org'],\n# 'sample_jmespath_1': 'b',\n# 'sample_jmespath_2': ['b', 'c']}\n```\n\nThe scrape_schema example output looks like the previous one, why do you need this library?\n\n- Easy to modify\n- Easy add additional methods\n- Easy to port to another project without having to deal with the logic and call stack\n- IDE typing support\n\nFor example, if you need to modify a parser, with `scrape_schema` it is easy and simple to do!\n\n```python\nfrom uuid import uuid4\nfrom datetime import datetime\n\nfrom scrape_schema import BaseSchema, Parsel, sc_param, Callback\nfrom scrape_schema.validator import markup_pre_validator\n\n\nclass Schema(BaseSchema):\n # invoke simple functions to fields output\n # add uuid4 id\n id: str = Callback(lambda: str(uuid4()))\n # add parse date\n date: str = Callback(lambda: str(datetime.today()))\n\n h1: str = Parsel().css('h1::text').get()\n # convert to upper case\n h1_upper: str = Parsel().css('h1::text').get().upper()\n # convert to lower case\n h1_lower: str = Parsel().css('h1::text').get().lower()\n\n words: list[str] = Parsel().xpath('//h1/text()').re(r'\\w+')\n # alt solution split words\n words_2: list[str] = Parsel().xpath('//h1/text()').get().split()\n # join result by ' - ' string\n words_join: str = Parsel().xpath('//h1/text()').re(r'\\w+').join(\" AND \")\n urls: list[str] = Parsel().css('ul > li').xpath('.//@href').getall()\n # replace http protocol to https\n urls_https: list[str] = Parsel().css('ul > li').xpath('.//@href').getall().replace(\"http://\", \"https://\")\n # you can modify output keys\n sample_jmespath_1: str = Parsel(alias=\"jsn1\").css('script::text').jmespath(\"a\").get()\n sample_jmespath_2: list[str] = Parsel(alias=\"class\").css('script::text').jmespath(\"a\").getall()\n\n # or calc json count values\n jsn_len: int = Parsel(auto_type=False).css('script::text').jmespath(\"a\").getall().count()\n\n # pre validation markup input before parse text.\n # if text from first h1 element != 'Hello, Parsel!' - throw `SchemaPreValidationError` exception\n @markup_pre_validator()\n def validate_markup(self) -> bool:\n return self.__selector__.css('h1::text').get() == 'Hello, Parsel!'\n\n # or create fields with custom rule\n @sc_param\n def custom(self) -> str:\n return \"hello world!\"\n\n # you can add extra methods!\n def parse_urls(self):\n for url in self.urls:\n print(f\"parse {url}\")\n\n\ntext = \"\"\"\n <html>\n <body>\n <h1>Hello, Parsel!</h1>\n <ul>\n <li><a href=\"http://example.com\">Link 1</a></li>\n <li><a href=\"http://scrapy.org\">Link 2</a></li>\n </ul>\n <script type=\"application/json\">{\"a\": [\"b\", \"c\"]}</script>\n </body>\n </html>\"\"\"\n\nschema = Schema(text)\n# invoke custom method\nschema.parse_urls()\n# parse http://example.com\n# parse http://scrapy.org\n\nprint(schema.dict())\n\n# !!!field from @sc_param decorator\n# vvvvvvvvvvvvvvvvvvvvvv\n# {'custom': 'hello world!',\n# !!!simple functions callbacks output\n# vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv\n# 'id': '6b66de7b-5b5f-445a-b8a7-3b17332c1ff5',\n# 'date': '2023-09-29 18:47:03.638941',\n# 'h1': 'Hello, Parsel!', 'h1_upper': 'HELLO, PARSEL!', 'h1_lower': 'hello, parsel!',\n# 'words': ['Hello', 'Parsel'], 'words_2': ['Hello,', 'Parsel!'],\n# 'words_join': 'Hello AND Parsel',\n# 'urls': ['http://example.com', 'http://scrapy.org'],\n# 'urls_https': ['https://example.com', 'https://scrapy.org'],\n# !!!changed key alias 'sample_jmespath_2' TO 'class'!!!\n# vvvvvvvvvvvvvvvvv\n# 'jsn1': 'b', 'class': ['b', 'c'], 'jsn_len': 2}\n\n```\n\nSee more [examples](examples) and [documentation](https://scrape-schema.readthedocs.io/en/latest/)\nfor get more information/examples\n____\nThis project is licensed under the terms of the MIT license.\n",
"bugtrack_url": null,
"license": null,
"summary": "A library for converting any text (xml, html, plain text, stdout, etc) to python datatypes",
"version": "0.6.3",
"project_urls": {
"Documentation": "https://github.com/vypivshiy/scrape-schema#readme",
"Examples": "https://github.com/vypivshiy/scrape-schema/examples",
"Issues": "https://github.com/vypivshiy/scrape-schema/issues",
"Source": "https://github.com/vypivshiy/scrape-schema"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "11a760ad88755e39cc2d21333de2652057529a0898f6acb3c0bc4b4c7c96c7b7",
"md5": "efe6199030dd44311ca02e42dd152ed1",
"sha256": "9fcd606644551199254b7421116f8de54d94938161f9292542b74e042b2fb221"
},
"downloads": -1,
"filename": "scrape_schema-0.6.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "efe6199030dd44311ca02e42dd152ed1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 30212,
"upload_time": "2023-10-08T16:10:03",
"upload_time_iso_8601": "2023-10-08T16:10:03.664166Z",
"url": "https://files.pythonhosted.org/packages/11/a7/60ad88755e39cc2d21333de2652057529a0898f6acb3c0bc4b4c7c96c7b7/scrape_schema-0.6.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "dbfd57eb544d074b5d2f57b3d29415ed8c3a8840e93b898513a0736be53d5f59",
"md5": "77f928fcfc8a3fe7065d6d205db2d1ea",
"sha256": "6490bce9cb84948b679b02284aa3fdebb0bb9147f3fc44b77d4a1386fec772cf"
},
"downloads": -1,
"filename": "scrape_schema-0.6.3.tar.gz",
"has_sig": false,
"md5_digest": "77f928fcfc8a3fe7065d6d205db2d1ea",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 23975,
"upload_time": "2023-10-08T16:10:00",
"upload_time_iso_8601": "2023-10-08T16:10:00.367649Z",
"url": "https://files.pythonhosted.org/packages/db/fd/57eb544d074b5d2f57b3d29415ed8c3a8840e93b898513a0736be53d5f59/scrape_schema-0.6.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-08 16:10:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vypivshiy",
"github_project": "scrape-schema#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "scrape-schema"
}