ssc-codegen


Namessc-codegen JSON
Version 0.4.3 PyPI version JSON
download
home_pageNone
Summarygenerate web scrapers structures by dsl-like language based on python
upload_time2024-12-24 15:41:03
maintainerNone
docs_urlNone
authorvypivshiy
requires_python>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Selector Schema Codegen

[RU](README_RU.md) [EN](README.md)

ssc_codegen is a generator for HTML parsers in various programming languages.

# Why?

- For convenient development of web scrapers, unofficial API interfaces, CI/CD integration
- Support for API interfaces in various programming languages and lib cores
- Easy configuration and reading
- auto documentation how to use it and generate parse structure signature
- Portability: generated parsers are not tied to a specific project and can be reused
- Simple syntax

# Features

- Declarative style: describe WHAT you want to do, not HOW to program it
- Standardization: the generated code has minimal dependencies
- Ability to rebuild in other programming languages
- CSS, XPath, regex, minimal string formatting operations
- Field validation, CSS/XPath/regex expressions
- Documentation transfer into the generated code
- Conversion of CSS to XPath queries

## Install

### pipx

```shell
pipx install ssc_codegen
```

### pip

```shell
pip install ssc_codegen
```

## Usage

See [examples](examples)

## Supported Libraries and Programming Languages

| Language | Library                                                      | XPath Support | CSS Support | Formatter             |
|----------|--------------------------------------------------------------|---------------|-------------|-----------------------|
| Python   | bs4                                                          | NO            | YES         | ruff                  |
| -        | parsel                                                       | YES           | YES         | -                     |
| -        | selectolax (modest)                                          | NO            | YES         | -                     |
| -        | scrapy (based on parsel, but class init argument - Response) | YES           | YES         | -                     |
| Dart     | universal_html                                               | NO            | YES         | dart format, dart fix |
| Go       | goquery                                                      | NO            | YES         | go fmt, go fix        |

### Recommendations

- For quickly obtaining effective CSS selectors, it is recommended to use **any** Chromium-based browser 
and the [SelectorGadget](https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) 
extension.
- Use CSS selectors: they can be **guaranteed** convert to XPath.
- For maximum support across most programming languages, use simple queries for the following reasons:
    - Some libraries do not support the full CSS specification (**even css 2.0** specs not fully support). 
For example, the selector `#product_description+ p` works in `python.parsel` and `javascript pure`,
      but not in the `dart.universal_html` and `selectolax` libraries.
- There is an XPath to CSS converter, but its functionality is not guaranteed. For example, CSS has no equivalent to `contains` from XPath.

### How to Read Schema Code

Before reading, make sure you are familiar with:

- CSS selectors
- XPath selectors
- Regular expressions

### Shortcuts

Variable notations in the code:

- D() — mark a `Document`/`Element` object
- N() — mark operations with nested structures
- R() — shortcut for `D().raw()`. Useful if you only need operations with regular expressions and strings, not with selectors

### Built-in Schemas

#### ItemSchema
Parses the structure according to the rules `{<key1> = <value1>, <key2> = <value2>, ...}`, returns a hash table.

#### DictSchema

Parses the structure according to the rule `{<key1> = <value1>, <key2> = <value2>, ...}`, returns a hash table.

#### ListSchema

Parses the structure according to the rule `[{<key1> = <value1>, <key2> = <value2>, ...}, {<key1> = <value1>, <key2> = <value2>, ...}]`, returns a list of hash tables.

#### FlattenListSchema

Parses the structure according to the rule `[<item1>, <item2>, ...]`, returns a list of objects.


### Types

Currently, there are 5 types

| TYPE          | DESCRIPTION                                               |
|---------------|-----------------------------------------------------------|
| DOCUMENT      | 1 element/object of the document. Always the first argument in the field |
| LIST_DOCUMENT | Collection of elements                                     |
| STRING        | Tag string/attribute/tag text                             |
| LIST_STRING   | Collection of strings/attributes/text                     |
| NESTED        | Collection of strings/attributes/text                     |


### Magic Methods

- `__SPLIT_DOC__` - splits the document into elements for easier parsing
- `__PRE_VALIDATE__` - pre-validation of the document using `assert`. Throws an error if validation fails
- `__KEY__`, `__VALUE__` - magic methods for initializing `DictSchema` structure
- `__ITEM__` - magic method for initializing `FlattenListSchema` structure

### Operators

| Method            | Accepts        | Returns           | Example                                                   |   | Description                                                                                   |
|-------------------|---------------|--------------------|----------------------------------------------------------|:--|--------------------------------------------------------------------------------------------|
| default(None/str) | None/str      | DOCUMENT           | `D().default(None)`                                      |   | Default value if an error occurs. Must be the first                                           |
| sub_parser        | Schema        | -                  | `N().sub_parser(Books)`                                  |   | Passes the document/element to another parser object. Returns the obtained result            |
| css               | CSS query     | DOCUMENT           | `D().css('a')`                                           |   | Returns the first found element of the selector result                                        |
| xpath             | XPATH query   | DOCUMENT           | `D().xpath('//a')`                                       |   | Returns the first found element of the selector result                                        |
| css_all           | CSS query     | LIST_DOCUMENT      | `D().css_all('a')`                                       |   | Returns all elements of the selector result                                                   |
| xpath_all         | XPATH query   | LIST_DOCUMENT      | `D().xpath_all('//a')`                                   |   | Returns all elements of the selector result                                                   |
| raw               |               | STRING/LIST_STRING | `D().raw()`                                              |   | Returns the raw HTML of the document/element. Works with DOCUMENT, LIST_DOCUMENT              |
| text              |               | STRING/LIST_STRING | `D().css('title').text()`                                |   | Returns the text from the HTML document/element. Works with DOCUMENT, LIST_DOCUMENT           |
| attr              | ATTR-NAME     | STRING/LIST_STRING | `D().css('a').attr('href')`                              |   | Returns the attribute from the HTML tag. Works with DOCUMENT, LIST_DOCUMENT                   |
| trim              | str           | STRING/LIST_STRING | `R().trim('<body>')`                                     |   | Trims the string from the LEFT and RIGHT. Works with STRING, LIST_STRING                      |
| ltrim             | str           | STRING/LIST_STRING | `D().css('a').attr('href').ltrim('//')`                  |   | Trims the string from the LEFT. Works with STRING, LIST_STRING                                |
| rtrim             | str           | STRING/LIST_STRING | `D().css('title').rtrim(' ')`                            |   | Trims the string from the RIGHT. Works with STRING, LIST_STRING                               |
| replace/repl      | old, new      | STRING/LIST_STRING | `D().css('a').attr('href').repl('//', 'https://')`       |   | Replaces the string. Works with STRING, LIST_STRING                                           |
| format/fmt        | template      | STRING/LIST_STRING | `D().css('title').fmt("title: {{}}")`                    |   | Formats the string according to the template. Must have the `{{}}` marker. Works with STRING, LIST_STRING |
| re                | pattern       | STRING/LIST_STRING | `D().css('title').re('(\w+)')`                           |   | Finds the first matching result of the regex pattern. Works with STRING, LIST_STRING         |
| re_all            | pattern       | LIST_STRING        | `D().css('title').re('(\w+)')`                           |   | Finds all matching results of the regex pattern. Works with STRING                           |
| re_sub            | pattern, repl | STRING/LIST_STRING | `D().css('title').re_sub('(\w+)', 'wow')`                |   | Replaces the string according to the regex pattern. Works with STRING, LIST_STRING           |
| index             | int           | STRING/DOCUMENT    | `D().css_all('a').index(0)`                              |   | Takes the element by index. Works with LIST_DOCUMENT, LIST_STRING                             |
| first             |               | -                  | `D().css_all('a').first`                                 |   | Alias for index(0)                                                                            |
| last              |               | -                  | `D().css_all('a').last`                                  |   | Alias for index(-1). Or implementation of a negative index                                   |
| join              | sep           | STRING             | `D().css_all('a').text().join(', ')`                     |   | Collects the collection into a string. Works with LIST_STRING                                |
| assert_in         | str           | NONE               | `D().css_all('a').attr('href').assert_in('example.com')` |   | Checks if the string is in the collection. The checked argument must be LIST_STRING          |
| assert_re         | pattern       | NONE               | `D().css('a').attr('href').assert_re('example.com')`     |   | Checks if the regex pattern is found. The checked argument must be STRING                    |
| assert_css        | CSS query     | NONE               | `D().assert_css('title')`                                |   | Checks the element by CSS. The checked argument must be DOCUMENT                             |
| assert_xpath      | XPATH query   | NONE               | `D().assert_xpath('//title')`                            |   | Checks the element by XPath. The checked argument must be DOCUMENT                           |

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ssc-codegen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "vypivshiy",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/15/9e/894564691632fe82c072711e383b39989ed6fe79f6c185e5bb14f6c5b402/ssc_codegen-0.4.3.tar.gz",
    "platform": null,
    "description": "# Selector Schema Codegen\n\n[RU](README_RU.md) [EN](README.md)\n\nssc_codegen is a generator for HTML parsers in various programming languages.\n\n# Why?\n\n- For convenient development of web scrapers, unofficial API interfaces, CI/CD integration\n- Support for API interfaces in various programming languages and lib cores\n- Easy configuration and reading\n- auto documentation how to use it and generate parse structure signature\n- Portability: generated parsers are not tied to a specific project and can be reused\n- Simple syntax\n\n# Features\n\n- Declarative style: describe WHAT you want to do, not HOW to program it\n- Standardization: the generated code has minimal dependencies\n- Ability to rebuild in other programming languages\n- CSS, XPath, regex, minimal string formatting operations\n- Field validation, CSS/XPath/regex expressions\n- Documentation transfer into the generated code\n- Conversion of CSS to XPath queries\n\n## Install\n\n### pipx\n\n```shell\npipx install ssc_codegen\n```\n\n### pip\n\n```shell\npip install ssc_codegen\n```\n\n## Usage\n\nSee [examples](examples)\n\n## Supported Libraries and Programming Languages\n\n| Language | Library                                                      | XPath Support | CSS Support | Formatter             |\n|----------|--------------------------------------------------------------|---------------|-------------|-----------------------|\n| Python   | bs4                                                          | NO            | YES         | ruff                  |\n| -        | parsel                                                       | YES           | YES         | -                     |\n| -        | selectolax (modest)                                          | NO            | YES         | -                     |\n| -        | scrapy (based on parsel, but class init argument - Response) | YES           | YES         | -                     |\n| Dart     | universal_html                                               | NO            | YES         | dart format, dart fix |\n| Go       | goquery                                                      | NO            | YES         | go fmt, go fix        |\n\n### Recommendations\n\n- For quickly obtaining effective CSS selectors, it is recommended to use **any** Chromium-based browser \nand the [SelectorGadget](https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) \nextension.\n- Use CSS selectors: they can be **guaranteed** convert to XPath.\n- For maximum support across most programming languages, use simple queries for the following reasons:\n    - Some libraries do not support the full CSS specification (**even css 2.0** specs not fully support). \nFor example, the selector `#product_description+ p` works in `python.parsel` and `javascript pure`,\n      but not in the `dart.universal_html` and `selectolax` libraries.\n- There is an XPath to CSS converter, but its functionality is not guaranteed. For example, CSS has no equivalent to `contains` from XPath.\n\n### How to Read Schema Code\n\nBefore reading, make sure you are familiar with:\n\n- CSS selectors\n- XPath selectors\n- Regular expressions\n\n### Shortcuts\n\nVariable notations in the code:\n\n- D() \u2014 mark a `Document`/`Element` object\n- N() \u2014 mark operations with nested structures\n- R() \u2014 shortcut for `D().raw()`. Useful if you only need operations with regular expressions and strings, not with selectors\n\n### Built-in Schemas\n\n#### ItemSchema\nParses the structure according to the rules `{<key1> = <value1>, <key2> = <value2>, ...}`, returns a hash table.\n\n#### DictSchema\n\nParses the structure according to the rule `{<key1> = <value1>, <key2> = <value2>, ...}`, returns a hash table.\n\n#### ListSchema\n\nParses the structure according to the rule `[{<key1> = <value1>, <key2> = <value2>, ...}, {<key1> = <value1>, <key2> = <value2>, ...}]`, returns a list of hash tables.\n\n#### FlattenListSchema\n\nParses the structure according to the rule `[<item1>, <item2>, ...]`, returns a list of objects.\n\n\n### Types\n\nCurrently, there are 5 types\n\n| TYPE          | DESCRIPTION                                               |\n|---------------|-----------------------------------------------------------|\n| DOCUMENT      | 1 element/object of the document. Always the first argument in the field |\n| LIST_DOCUMENT | Collection of elements                                     |\n| STRING        | Tag string/attribute/tag text                             |\n| LIST_STRING   | Collection of strings/attributes/text                     |\n| NESTED        | Collection of strings/attributes/text                     |\n\n\n### Magic Methods\n\n- `__SPLIT_DOC__` - splits the document into elements for easier parsing\n- `__PRE_VALIDATE__` - pre-validation of the document using `assert`. Throws an error if validation fails\n- `__KEY__`, `__VALUE__` - magic methods for initializing `DictSchema` structure\n- `__ITEM__` - magic method for initializing `FlattenListSchema` structure\n\n### Operators\n\n| Method            | Accepts        | Returns           | Example                                                   |   | Description                                                                                   |\n|-------------------|---------------|--------------------|----------------------------------------------------------|:--|--------------------------------------------------------------------------------------------|\n| default(None/str) | None/str      | DOCUMENT           | `D().default(None)`                                      |   | Default value if an error occurs. Must be the first                                           |\n| sub_parser        | Schema        | -                  | `N().sub_parser(Books)`                                  |   | Passes the document/element to another parser object. Returns the obtained result            |\n| css               | CSS query     | DOCUMENT           | `D().css('a')`                                           |   | Returns the first found element of the selector result                                        |\n| xpath             | XPATH query   | DOCUMENT           | `D().xpath('//a')`                                       |   | Returns the first found element of the selector result                                        |\n| css_all           | CSS query     | LIST_DOCUMENT      | `D().css_all('a')`                                       |   | Returns all elements of the selector result                                                   |\n| xpath_all         | XPATH query   | LIST_DOCUMENT      | `D().xpath_all('//a')`                                   |   | Returns all elements of the selector result                                                   |\n| raw               |               | STRING/LIST_STRING | `D().raw()`                                              |   | Returns the raw HTML of the document/element. Works with DOCUMENT, LIST_DOCUMENT              |\n| text              |               | STRING/LIST_STRING | `D().css('title').text()`                                |   | Returns the text from the HTML document/element. Works with DOCUMENT, LIST_DOCUMENT           |\n| attr              | ATTR-NAME     | STRING/LIST_STRING | `D().css('a').attr('href')`                              |   | Returns the attribute from the HTML tag. Works with DOCUMENT, LIST_DOCUMENT                   |\n| trim              | str           | STRING/LIST_STRING | `R().trim('<body>')`                                     |   | Trims the string from the LEFT and RIGHT. Works with STRING, LIST_STRING                      |\n| ltrim             | str           | STRING/LIST_STRING | `D().css('a').attr('href').ltrim('//')`                  |   | Trims the string from the LEFT. Works with STRING, LIST_STRING                                |\n| rtrim             | str           | STRING/LIST_STRING | `D().css('title').rtrim(' ')`                            |   | Trims the string from the RIGHT. Works with STRING, LIST_STRING                               |\n| replace/repl      | old, new      | STRING/LIST_STRING | `D().css('a').attr('href').repl('//', 'https://')`       |   | Replaces the string. Works with STRING, LIST_STRING                                           |\n| format/fmt        | template      | STRING/LIST_STRING | `D().css('title').fmt(\"title: {{}}\")`                    |   | Formats the string according to the template. Must have the `{{}}` marker. Works with STRING, LIST_STRING |\n| re                | pattern       | STRING/LIST_STRING | `D().css('title').re('(\\w+)')`                           |   | Finds the first matching result of the regex pattern. Works with STRING, LIST_STRING         |\n| re_all            | pattern       | LIST_STRING        | `D().css('title').re('(\\w+)')`                           |   | Finds all matching results of the regex pattern. Works with STRING                           |\n| re_sub            | pattern, repl | STRING/LIST_STRING | `D().css('title').re_sub('(\\w+)', 'wow')`                |   | Replaces the string according to the regex pattern. Works with STRING, LIST_STRING           |\n| index             | int           | STRING/DOCUMENT    | `D().css_all('a').index(0)`                              |   | Takes the element by index. Works with LIST_DOCUMENT, LIST_STRING                             |\n| first             |               | -                  | `D().css_all('a').first`                                 |   | Alias for index(0)                                                                            |\n| last              |               | -                  | `D().css_all('a').last`                                  |   | Alias for index(-1). Or implementation of a negative index                                   |\n| join              | sep           | STRING             | `D().css_all('a').text().join(', ')`                     |   | Collects the collection into a string. Works with LIST_STRING                                |\n| assert_in         | str           | NONE               | `D().css_all('a').attr('href').assert_in('example.com')` |   | Checks if the string is in the collection. The checked argument must be LIST_STRING          |\n| assert_re         | pattern       | NONE               | `D().css('a').attr('href').assert_re('example.com')`     |   | Checks if the regex pattern is found. The checked argument must be STRING                    |\n| assert_css        | CSS query     | NONE               | `D().assert_css('title')`                                |   | Checks the element by CSS. The checked argument must be DOCUMENT                             |\n| assert_xpath      | XPATH query   | NONE               | `D().assert_xpath('//title')`                            |   | Checks the element by XPath. The checked argument must be DOCUMENT                           |\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "generate web scrapers structures by dsl-like language based on python",
    "version": "0.4.3",
    "project_urls": {
        "Documentation": "https://github.com/vypivshiy/selector_schema_codegen#readme",
        "Examples": "https://github.com/vypivshiy/selector_schema_codegen/examples",
        "Issues": "https://github.com/vypivshiy/selector_schema_codegen/issues",
        "Source": "https://github.com/vypivshiy/selector_schema_codegen"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6dcf88c940c353b0e2392eb8f903ec0e63a6bbf26c3eda820dfdc69e4847db37",
                "md5": "d85e4cce724159a4433f657f1d585343",
                "sha256": "93a591007614423c764133092d503d721f1da00876199377a987a9d89b163003"
            },
            "downloads": -1,
            "filename": "ssc_codegen-0.4.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d85e4cce724159a4433f657f1d585343",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 52304,
            "upload_time": "2024-12-24T15:41:04",
            "upload_time_iso_8601": "2024-12-24T15:41:04.966508Z",
            "url": "https://files.pythonhosted.org/packages/6d/cf/88c940c353b0e2392eb8f903ec0e63a6bbf26c3eda820dfdc69e4847db37/ssc_codegen-0.4.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "159e894564691632fe82c072711e383b39989ed6fe79f6c185e5bb14f6c5b402",
                "md5": "f282e6e581c0a2b43b88a74ae209f332",
                "sha256": "a97c6fff6ce5ede2ce9c899aa655ce20471d5106630fe19ff4470e1e8dc71d2e"
            },
            "downloads": -1,
            "filename": "ssc_codegen-0.4.3.tar.gz",
            "has_sig": false,
            "md5_digest": "f282e6e581c0a2b43b88a74ae209f332",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 56705,
            "upload_time": "2024-12-24T15:41:03",
            "upload_time_iso_8601": "2024-12-24T15:41:03.473821Z",
            "url": "https://files.pythonhosted.org/packages/15/9e/894564691632fe82c072711e383b39989ed6fe79f6c185e5bb14f6c5b402/ssc_codegen-0.4.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-24 15:41:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vypivshiy",
    "github_project": "selector_schema_codegen#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "ssc-codegen"
}
        
Elapsed time: 2.33419s