Name | ssc-codegen JSON |
Version |
0.4.3
JSON |
| download |
home_page | None |
Summary | generate web scrapers structures by dsl-like language based on python |
upload_time | 2024-12-24 15:41:03 |
maintainer | None |
docs_url | None |
author | vypivshiy |
requires_python | >=3.10 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Selector Schema Codegen
[RU](README_RU.md) [EN](README.md)
ssc_codegen is a generator for HTML parsers in various programming languages.
# Why?
- For convenient development of web scrapers, unofficial API interfaces, CI/CD integration
- Support for API interfaces in various programming languages and lib cores
- Easy configuration and reading
- auto documentation how to use it and generate parse structure signature
- Portability: generated parsers are not tied to a specific project and can be reused
- Simple syntax
# Features
- Declarative style: describe WHAT you want to do, not HOW to program it
- Standardization: the generated code has minimal dependencies
- Ability to rebuild in other programming languages
- CSS, XPath, regex, minimal string formatting operations
- Field validation, CSS/XPath/regex expressions
- Documentation transfer into the generated code
- Conversion of CSS to XPath queries
## Install
### pipx
```shell
pipx install ssc_codegen
```
### pip
```shell
pip install ssc_codegen
```
## Usage
See [examples](examples)
## Supported Libraries and Programming Languages
| Language | Library | XPath Support | CSS Support | Formatter |
|----------|--------------------------------------------------------------|---------------|-------------|-----------------------|
| Python | bs4 | NO | YES | ruff |
| - | parsel | YES | YES | - |
| - | selectolax (modest) | NO | YES | - |
| - | scrapy (based on parsel, but class init argument - Response) | YES | YES | - |
| Dart | universal_html | NO | YES | dart format, dart fix |
| Go | goquery | NO | YES | go fmt, go fix |
### Recommendations
- For quickly obtaining effective CSS selectors, it is recommended to use **any** Chromium-based browser
and the [SelectorGadget](https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)
extension.
- Use CSS selectors: they can be **guaranteed** convert to XPath.
- For maximum support across most programming languages, use simple queries for the following reasons:
- Some libraries do not support the full CSS specification (**even css 2.0** specs not fully support).
For example, the selector `#product_description+ p` works in `python.parsel` and `javascript pure`,
but not in the `dart.universal_html` and `selectolax` libraries.
- There is an XPath to CSS converter, but its functionality is not guaranteed. For example, CSS has no equivalent to `contains` from XPath.
### How to Read Schema Code
Before reading, make sure you are familiar with:
- CSS selectors
- XPath selectors
- Regular expressions
### Shortcuts
Variable notations in the code:
- D() — mark a `Document`/`Element` object
- N() — mark operations with nested structures
- R() — shortcut for `D().raw()`. Useful if you only need operations with regular expressions and strings, not with selectors
### Built-in Schemas
#### ItemSchema
Parses the structure according to the rules `{<key1> = <value1>, <key2> = <value2>, ...}`, returns a hash table.
#### DictSchema
Parses the structure according to the rule `{<key1> = <value1>, <key2> = <value2>, ...}`, returns a hash table.
#### ListSchema
Parses the structure according to the rule `[{<key1> = <value1>, <key2> = <value2>, ...}, {<key1> = <value1>, <key2> = <value2>, ...}]`, returns a list of hash tables.
#### FlattenListSchema
Parses the structure according to the rule `[<item1>, <item2>, ...]`, returns a list of objects.
### Types
Currently, there are 5 types
| TYPE | DESCRIPTION |
|---------------|-----------------------------------------------------------|
| DOCUMENT | 1 element/object of the document. Always the first argument in the field |
| LIST_DOCUMENT | Collection of elements |
| STRING | Tag string/attribute/tag text |
| LIST_STRING | Collection of strings/attributes/text |
| NESTED | Collection of strings/attributes/text |
### Magic Methods
- `__SPLIT_DOC__` - splits the document into elements for easier parsing
- `__PRE_VALIDATE__` - pre-validation of the document using `assert`. Throws an error if validation fails
- `__KEY__`, `__VALUE__` - magic methods for initializing `DictSchema` structure
- `__ITEM__` - magic method for initializing `FlattenListSchema` structure
### Operators
| Method | Accepts | Returns | Example | | Description |
|-------------------|---------------|--------------------|----------------------------------------------------------|:--|--------------------------------------------------------------------------------------------|
| default(None/str) | None/str | DOCUMENT | `D().default(None)` | | Default value if an error occurs. Must be the first |
| sub_parser | Schema | - | `N().sub_parser(Books)` | | Passes the document/element to another parser object. Returns the obtained result |
| css | CSS query | DOCUMENT | `D().css('a')` | | Returns the first found element of the selector result |
| xpath | XPATH query | DOCUMENT | `D().xpath('//a')` | | Returns the first found element of the selector result |
| css_all | CSS query | LIST_DOCUMENT | `D().css_all('a')` | | Returns all elements of the selector result |
| xpath_all | XPATH query | LIST_DOCUMENT | `D().xpath_all('//a')` | | Returns all elements of the selector result |
| raw | | STRING/LIST_STRING | `D().raw()` | | Returns the raw HTML of the document/element. Works with DOCUMENT, LIST_DOCUMENT |
| text | | STRING/LIST_STRING | `D().css('title').text()` | | Returns the text from the HTML document/element. Works with DOCUMENT, LIST_DOCUMENT |
| attr | ATTR-NAME | STRING/LIST_STRING | `D().css('a').attr('href')` | | Returns the attribute from the HTML tag. Works with DOCUMENT, LIST_DOCUMENT |
| trim | str | STRING/LIST_STRING | `R().trim('<body>')` | | Trims the string from the LEFT and RIGHT. Works with STRING, LIST_STRING |
| ltrim | str | STRING/LIST_STRING | `D().css('a').attr('href').ltrim('//')` | | Trims the string from the LEFT. Works with STRING, LIST_STRING |
| rtrim | str | STRING/LIST_STRING | `D().css('title').rtrim(' ')` | | Trims the string from the RIGHT. Works with STRING, LIST_STRING |
| replace/repl | old, new | STRING/LIST_STRING | `D().css('a').attr('href').repl('//', 'https://')` | | Replaces the string. Works with STRING, LIST_STRING |
| format/fmt | template | STRING/LIST_STRING | `D().css('title').fmt("title: {{}}")` | | Formats the string according to the template. Must have the `{{}}` marker. Works with STRING, LIST_STRING |
| re | pattern | STRING/LIST_STRING | `D().css('title').re('(\w+)')` | | Finds the first matching result of the regex pattern. Works with STRING, LIST_STRING |
| re_all | pattern | LIST_STRING | `D().css('title').re('(\w+)')` | | Finds all matching results of the regex pattern. Works with STRING |
| re_sub | pattern, repl | STRING/LIST_STRING | `D().css('title').re_sub('(\w+)', 'wow')` | | Replaces the string according to the regex pattern. Works with STRING, LIST_STRING |
| index | int | STRING/DOCUMENT | `D().css_all('a').index(0)` | | Takes the element by index. Works with LIST_DOCUMENT, LIST_STRING |
| first | | - | `D().css_all('a').first` | | Alias for index(0) |
| last | | - | `D().css_all('a').last` | | Alias for index(-1). Or implementation of a negative index |
| join | sep | STRING | `D().css_all('a').text().join(', ')` | | Collects the collection into a string. Works with LIST_STRING |
| assert_in | str | NONE | `D().css_all('a').attr('href').assert_in('example.com')` | | Checks if the string is in the collection. The checked argument must be LIST_STRING |
| assert_re | pattern | NONE | `D().css('a').attr('href').assert_re('example.com')` | | Checks if the regex pattern is found. The checked argument must be STRING |
| assert_css | CSS query | NONE | `D().assert_css('title')` | | Checks the element by CSS. The checked argument must be DOCUMENT |
| assert_xpath | XPATH query | NONE | `D().assert_xpath('//title')` | | Checks the element by XPath. The checked argument must be DOCUMENT |
Raw data
{
"_id": null,
"home_page": null,
"name": "ssc-codegen",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": "vypivshiy",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/15/9e/894564691632fe82c072711e383b39989ed6fe79f6c185e5bb14f6c5b402/ssc_codegen-0.4.3.tar.gz",
"platform": null,
"description": "# Selector Schema Codegen\n\n[RU](README_RU.md) [EN](README.md)\n\nssc_codegen is a generator for HTML parsers in various programming languages.\n\n# Why?\n\n- For convenient development of web scrapers, unofficial API interfaces, CI/CD integration\n- Support for API interfaces in various programming languages and lib cores\n- Easy configuration and reading\n- auto documentation how to use it and generate parse structure signature\n- Portability: generated parsers are not tied to a specific project and can be reused\n- Simple syntax\n\n# Features\n\n- Declarative style: describe WHAT you want to do, not HOW to program it\n- Standardization: the generated code has minimal dependencies\n- Ability to rebuild in other programming languages\n- CSS, XPath, regex, minimal string formatting operations\n- Field validation, CSS/XPath/regex expressions\n- Documentation transfer into the generated code\n- Conversion of CSS to XPath queries\n\n## Install\n\n### pipx\n\n```shell\npipx install ssc_codegen\n```\n\n### pip\n\n```shell\npip install ssc_codegen\n```\n\n## Usage\n\nSee [examples](examples)\n\n## Supported Libraries and Programming Languages\n\n| Language | Library | XPath Support | CSS Support | Formatter |\n|----------|--------------------------------------------------------------|---------------|-------------|-----------------------|\n| Python | bs4 | NO | YES | ruff |\n| - | parsel | YES | YES | - |\n| - | selectolax (modest) | NO | YES | - |\n| - | scrapy (based on parsel, but class init argument - Response) | YES | YES | - |\n| Dart | universal_html | NO | YES | dart format, dart fix |\n| Go | goquery | NO | YES | go fmt, go fix |\n\n### Recommendations\n\n- For quickly obtaining effective CSS selectors, it is recommended to use **any** Chromium-based browser \nand the [SelectorGadget](https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) \nextension.\n- Use CSS selectors: they can be **guaranteed** convert to XPath.\n- For maximum support across most programming languages, use simple queries for the following reasons:\n - Some libraries do not support the full CSS specification (**even css 2.0** specs not fully support). \nFor example, the selector `#product_description+ p` works in `python.parsel` and `javascript pure`,\n but not in the `dart.universal_html` and `selectolax` libraries.\n- There is an XPath to CSS converter, but its functionality is not guaranteed. For example, CSS has no equivalent to `contains` from XPath.\n\n### How to Read Schema Code\n\nBefore reading, make sure you are familiar with:\n\n- CSS selectors\n- XPath selectors\n- Regular expressions\n\n### Shortcuts\n\nVariable notations in the code:\n\n- D() \u2014 mark a `Document`/`Element` object\n- N() \u2014 mark operations with nested structures\n- R() \u2014 shortcut for `D().raw()`. Useful if you only need operations with regular expressions and strings, not with selectors\n\n### Built-in Schemas\n\n#### ItemSchema\nParses the structure according to the rules `{<key1> = <value1>, <key2> = <value2>, ...}`, returns a hash table.\n\n#### DictSchema\n\nParses the structure according to the rule `{<key1> = <value1>, <key2> = <value2>, ...}`, returns a hash table.\n\n#### ListSchema\n\nParses the structure according to the rule `[{<key1> = <value1>, <key2> = <value2>, ...}, {<key1> = <value1>, <key2> = <value2>, ...}]`, returns a list of hash tables.\n\n#### FlattenListSchema\n\nParses the structure according to the rule `[<item1>, <item2>, ...]`, returns a list of objects.\n\n\n### Types\n\nCurrently, there are 5 types\n\n| TYPE | DESCRIPTION |\n|---------------|-----------------------------------------------------------|\n| DOCUMENT | 1 element/object of the document. Always the first argument in the field |\n| LIST_DOCUMENT | Collection of elements |\n| STRING | Tag string/attribute/tag text |\n| LIST_STRING | Collection of strings/attributes/text |\n| NESTED | Collection of strings/attributes/text |\n\n\n### Magic Methods\n\n- `__SPLIT_DOC__` - splits the document into elements for easier parsing\n- `__PRE_VALIDATE__` - pre-validation of the document using `assert`. Throws an error if validation fails\n- `__KEY__`, `__VALUE__` - magic methods for initializing `DictSchema` structure\n- `__ITEM__` - magic method for initializing `FlattenListSchema` structure\n\n### Operators\n\n| Method | Accepts | Returns | Example | | Description |\n|-------------------|---------------|--------------------|----------------------------------------------------------|:--|--------------------------------------------------------------------------------------------|\n| default(None/str) | None/str | DOCUMENT | `D().default(None)` | | Default value if an error occurs. Must be the first |\n| sub_parser | Schema | - | `N().sub_parser(Books)` | | Passes the document/element to another parser object. Returns the obtained result |\n| css | CSS query | DOCUMENT | `D().css('a')` | | Returns the first found element of the selector result |\n| xpath | XPATH query | DOCUMENT | `D().xpath('//a')` | | Returns the first found element of the selector result |\n| css_all | CSS query | LIST_DOCUMENT | `D().css_all('a')` | | Returns all elements of the selector result |\n| xpath_all | XPATH query | LIST_DOCUMENT | `D().xpath_all('//a')` | | Returns all elements of the selector result |\n| raw | | STRING/LIST_STRING | `D().raw()` | | Returns the raw HTML of the document/element. Works with DOCUMENT, LIST_DOCUMENT |\n| text | | STRING/LIST_STRING | `D().css('title').text()` | | Returns the text from the HTML document/element. Works with DOCUMENT, LIST_DOCUMENT |\n| attr | ATTR-NAME | STRING/LIST_STRING | `D().css('a').attr('href')` | | Returns the attribute from the HTML tag. Works with DOCUMENT, LIST_DOCUMENT |\n| trim | str | STRING/LIST_STRING | `R().trim('<body>')` | | Trims the string from the LEFT and RIGHT. Works with STRING, LIST_STRING |\n| ltrim | str | STRING/LIST_STRING | `D().css('a').attr('href').ltrim('//')` | | Trims the string from the LEFT. Works with STRING, LIST_STRING |\n| rtrim | str | STRING/LIST_STRING | `D().css('title').rtrim(' ')` | | Trims the string from the RIGHT. Works with STRING, LIST_STRING |\n| replace/repl | old, new | STRING/LIST_STRING | `D().css('a').attr('href').repl('//', 'https://')` | | Replaces the string. Works with STRING, LIST_STRING |\n| format/fmt | template | STRING/LIST_STRING | `D().css('title').fmt(\"title: {{}}\")` | | Formats the string according to the template. Must have the `{{}}` marker. Works with STRING, LIST_STRING |\n| re | pattern | STRING/LIST_STRING | `D().css('title').re('(\\w+)')` | | Finds the first matching result of the regex pattern. Works with STRING, LIST_STRING |\n| re_all | pattern | LIST_STRING | `D().css('title').re('(\\w+)')` | | Finds all matching results of the regex pattern. Works with STRING |\n| re_sub | pattern, repl | STRING/LIST_STRING | `D().css('title').re_sub('(\\w+)', 'wow')` | | Replaces the string according to the regex pattern. Works with STRING, LIST_STRING |\n| index | int | STRING/DOCUMENT | `D().css_all('a').index(0)` | | Takes the element by index. Works with LIST_DOCUMENT, LIST_STRING |\n| first | | - | `D().css_all('a').first` | | Alias for index(0) |\n| last | | - | `D().css_all('a').last` | | Alias for index(-1). Or implementation of a negative index |\n| join | sep | STRING | `D().css_all('a').text().join(', ')` | | Collects the collection into a string. Works with LIST_STRING |\n| assert_in | str | NONE | `D().css_all('a').attr('href').assert_in('example.com')` | | Checks if the string is in the collection. The checked argument must be LIST_STRING |\n| assert_re | pattern | NONE | `D().css('a').attr('href').assert_re('example.com')` | | Checks if the regex pattern is found. The checked argument must be STRING |\n| assert_css | CSS query | NONE | `D().assert_css('title')` | | Checks the element by CSS. The checked argument must be DOCUMENT |\n| assert_xpath | XPATH query | NONE | `D().assert_xpath('//title')` | | Checks the element by XPath. The checked argument must be DOCUMENT |\n",
"bugtrack_url": null,
"license": null,
"summary": "generate web scrapers structures by dsl-like language based on python",
"version": "0.4.3",
"project_urls": {
"Documentation": "https://github.com/vypivshiy/selector_schema_codegen#readme",
"Examples": "https://github.com/vypivshiy/selector_schema_codegen/examples",
"Issues": "https://github.com/vypivshiy/selector_schema_codegen/issues",
"Source": "https://github.com/vypivshiy/selector_schema_codegen"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6dcf88c940c353b0e2392eb8f903ec0e63a6bbf26c3eda820dfdc69e4847db37",
"md5": "d85e4cce724159a4433f657f1d585343",
"sha256": "93a591007614423c764133092d503d721f1da00876199377a987a9d89b163003"
},
"downloads": -1,
"filename": "ssc_codegen-0.4.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d85e4cce724159a4433f657f1d585343",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 52304,
"upload_time": "2024-12-24T15:41:04",
"upload_time_iso_8601": "2024-12-24T15:41:04.966508Z",
"url": "https://files.pythonhosted.org/packages/6d/cf/88c940c353b0e2392eb8f903ec0e63a6bbf26c3eda820dfdc69e4847db37/ssc_codegen-0.4.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "159e894564691632fe82c072711e383b39989ed6fe79f6c185e5bb14f6c5b402",
"md5": "f282e6e581c0a2b43b88a74ae209f332",
"sha256": "a97c6fff6ce5ede2ce9c899aa655ce20471d5106630fe19ff4470e1e8dc71d2e"
},
"downloads": -1,
"filename": "ssc_codegen-0.4.3.tar.gz",
"has_sig": false,
"md5_digest": "f282e6e581c0a2b43b88a74ae209f332",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 56705,
"upload_time": "2024-12-24T15:41:03",
"upload_time_iso_8601": "2024-12-24T15:41:03.473821Z",
"url": "https://files.pythonhosted.org/packages/15/9e/894564691632fe82c072711e383b39989ed6fe79f6c185e5bb14f6c5b402/ssc_codegen-0.4.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-24 15:41:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vypivshiy",
"github_project": "selector_schema_codegen#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "ssc-codegen"
}