ssc_codegen


Namessc_codegen JSON
Version 0.7.1 PyPI version JSON
download
home_pageNone
SummaryPython-dsl code converter to html parser for web scraping
upload_time2025-02-22 12:11:14
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Selector Schema codegen

## Introduction

ssc-gen - based-python DSL language for writing html parsers in dataclass style for converting to targeting language.

Project solving next problems:

- boilerplate code
- create types (type annotations) and documentation 
- simplify code support
- portability to other languages

## Support converters

Current support converters


| Language      | Library (html parser backend)                                | XPath Support | CSS Support | Generated types                          | Code formatter |
|---------------|--------------------------------------------------------------|---------------|-------------|------------------------------------------|----------------|
| Python (3.8+) | bs4                                                          | N             | Y           | TypedDict*, list, dict                   | ruff           |
| ...           | parsel                                                       | Y             | Y           | ...                                      | -              |
| ...           | selectolax (modest)                                          | N             | Y           | ...                                      | -              |
| ...           | scrapy (possibly use parsel - pass Response.selector object) | Y             | Y           | ...                                      | -              |
| Dart (3)      | universal_html                                               | N             | Y           | record, List, Map                        | dart format    |
| js (ES6)      | pure (firefox/chrome)                                        | Y             | Y           | Array, Map**                             | -              |
| go (1.10+)    | goquery                                                      | N             | Y           | struct(json anchors include), array, map | gofmt          |

- *this annotation type was deliberately chosen as a compromise reasons. 
Python has many ways of serialization: `dataclass, namedtuple, attrs, pydantic`
  - TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
- **js not exists build-in serialization methods

### Limitations

For maximum portability of the configuration to the target language:

- Use CSS selectors: they are guaranteed to be converted to XPATH
- Unlike javascript, most html parse libs implement [CSS3 selectors standard](https://www.w3.org/TR/selectors-3/)
  - basic selectors: (`tag`, `.class`, `#id`)
  - combined: (`div p`, `ul > li`, `h2 +p`\[1])
  - attribute: (`a[href]`, `input[type='text']`)\[2]
  - pseudo classes: (`:nth-child(n)`, `:first-child`, `:last-child`)\[3]
  - **often, not support more complex, dynamic styles**: (`:has()`, `:nth-of-type()`, `:where()`, `:is()`)

1. Several libs not support `+` operations (eg: [selectolax(modest)](https://github.com/rushter/selectolax), [dart.universal_html](https://pub.dev/packages/universal_html))
2. Often, web scraping libs not supports attribute operations like `*=`, `~=`, `|=`, `^=` and `$=`
3. Several libs not support pseudo classes (eg: standard [dart.html](https://dart.dev/libraries/dart-html) lib miss this feature). 
This project will not implement converters with such a cons

## Getting started

ssc_gen required python 3.10 version or higher

### Install

pip:

```shell
pip install ssc_codegen
```

uv:

```shell
uv pip install ssc_codegen
```

as cli converter tool:

| package manager | command                       |
|-----------------|-------------------------------|
| pipx            | `pipx install ssc_codegen`    |
| uv              | `uv tool install ssc_codegen` |

## Example

### Create a file `schema.py` with:

```python
from ssc_codegen import ItemSchema, D

class HelloWorld(ItemSchema):
    title = D().css('title').text()
    a_hrefs = D().css_all('a').attr('href')
```

### try it in cli

>[!note]
> this tools developed for testing purposes, not for web-scraping

### from file

>[!warning]
> DO NOT PASS CONFIGS FROM UNKNOWN SOURCES: 
> 
> PYTHON CODE FROM CONFIGS COMPILE IN RUNTIME WOUT SECURITY CHECKS!!!

Download any html file and pass as argument:

```shell
ssc-gen parse-from-file index.html -t schema.py:HelloWorld  
```

Short options descriptions:

- `-t --target` - config schema file and class from where to start the parser

![out1](docs/assets/parse_from_file.gif)

### from url

```shell
ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld  
```

![out1](docs/assets/parse_from_url.gif)
### from Chromium browser (CDP protocol)


```shell
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld
```

>[!note]
> if script cannot found chrome executable - provide it manually:

```shell
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium
```


### Convert to code

Convert to code for use in projects:

>![note]
> for example, used js: it can be fast test in developer console


```shell
ssc-gen js schema.py -o .
```

Code output looks like this (code formatted by IDE):

```javascript
// autogenerated by ssc-gen DO NOT_EDIT
/**
 *
 *
 * {
 *     "title": "String",
 *     "a_hrefs": "Array<String>"
 * }
 */
class HelloWorld {
    constructor(doc) {
        if (typeof doc === 'string') {
            this._doc = new DOMParser().parseFromString(doc, 'text/html');
        } else if (doc instanceof Document || doc instanceof Element) {
            this._doc = doc;
        } else {
            throw new Error("Invalid input: Expected a Document, Element, or string");
        }
    }

    _parseTitle(value) {
        let value1 = value.querySelector('title');
        return typeof value1.textContent === "undefined" ? value1.documentElement.textContent : value1.textContent;
    }

    _parseAHrefs(value) {
        let value1 = Array.from(value.querySelectorAll('a'));
        return value1.map(e => e.getAttribute('href'));
    }

    parse() {
        return {title: this._parseTitle(this._doc), a_hrefs: this._parseAHrefs(this._doc)};
    }
}
```

### copy code output and past to developer console:

Print output:

```javascript
alert(JSON.stringify((new HelloWorld(document).parse())))
```

![example](assets/example.png)


You can use any html source:

- read from html file
- get from http request
- get from browser (playwright, selenium, chrome-cdp)
- paste code to developer console (js)
- or call curl in shell and parse stdin


## See also
- [Brief](docs/brief.md) about css selectors and regular expressions.
- [Tutorial](docs/tutorial.md) how to use ssc-gen
- [Reference](docs/reference.md) about high-level API
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ssc_codegen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/dd/8d/a84f6f7337cea0d9c2d422ff1d231537c18bfc87160749778fdba6979d75/ssc_codegen-0.7.1.tar.gz",
    "platform": null,
    "description": "# Selector Schema codegen\n\n## Introduction\n\nssc-gen - based-python DSL language for writing html parsers in dataclass style for converting to targeting language.\n\nProject solving next problems:\n\n- boilerplate code\n- create types (type annotations) and documentation \n- simplify code support\n- portability to other languages\n\n## Support converters\n\nCurrent support converters\n\n\n| Language      | Library (html parser backend)                                | XPath Support | CSS Support | Generated types                          | Code formatter |\n|---------------|--------------------------------------------------------------|---------------|-------------|------------------------------------------|----------------|\n| Python (3.8+) | bs4                                                          | N             | Y           | TypedDict*, list, dict                   | ruff           |\n| ...           | parsel                                                       | Y             | Y           | ...                                      | -              |\n| ...           | selectolax (modest)                                          | N             | Y           | ...                                      | -              |\n| ...           | scrapy (possibly use parsel - pass Response.selector object) | Y             | Y           | ...                                      | -              |\n| Dart (3)      | universal_html                                               | N             | Y           | record, List, Map                        | dart format    |\n| js (ES6)      | pure (firefox/chrome)                                        | Y             | Y           | Array, Map**                             | -              |\n| go (1.10+)    | goquery                                                      | N             | Y           | struct(json anchors include), array, map | gofmt          |\n\n- *this annotation type was deliberately chosen as a compromise reasons. \nPython has many ways of serialization: `dataclass, namedtuple, attrs, pydantic`\n  - TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.\n- **js not exists build-in serialization methods\n\n### Limitations\n\nFor maximum portability of the configuration to the target language:\n\n- Use CSS selectors: they are guaranteed to be converted to XPATH\n- Unlike javascript, most html parse libs implement [CSS3 selectors standard](https://www.w3.org/TR/selectors-3/)\n  - basic selectors: (`tag`, `.class`, `#id`)\n  - combined: (`div p`, `ul > li`, `h2 +p`\\[1])\n  - attribute: (`a[href]`, `input[type='text']`)\\[2]\n  - pseudo classes: (`:nth-child(n)`, `:first-child`, `:last-child`)\\[3]\n  - **often, not support more complex, dynamic styles**: (`:has()`, `:nth-of-type()`, `:where()`, `:is()`)\n\n1. Several libs not support `+` operations (eg: [selectolax(modest)](https://github.com/rushter/selectolax), [dart.universal_html](https://pub.dev/packages/universal_html))\n2. Often, web scraping libs not supports attribute operations like `*=`, `~=`, `|=`, `^=` and `$=`\n3. Several libs not support pseudo classes (eg: standard [dart.html](https://dart.dev/libraries/dart-html) lib miss this feature). \nThis project will not implement converters with such a cons\n\n## Getting started\n\nssc_gen required python 3.10 version or higher\n\n### Install\n\npip:\n\n```shell\npip install ssc_codegen\n```\n\nuv:\n\n```shell\nuv pip install ssc_codegen\n```\n\nas cli converter tool:\n\n| package manager | command                       |\n|-----------------|-------------------------------|\n| pipx            | `pipx install ssc_codegen`    |\n| uv              | `uv tool install ssc_codegen` |\n\n## Example\n\n### Create a file `schema.py` with:\n\n```python\nfrom ssc_codegen import ItemSchema, D\n\nclass HelloWorld(ItemSchema):\n    title = D().css('title').text()\n    a_hrefs = D().css_all('a').attr('href')\n```\n\n### try it in cli\n\n>[!note]\n> this tools developed for testing purposes, not for web-scraping\n\n### from file\n\n>[!warning]\n> DO NOT PASS CONFIGS FROM UNKNOWN SOURCES: \n> \n> PYTHON CODE FROM CONFIGS COMPILE IN RUNTIME WOUT SECURITY CHECKS!!!\n\nDownload any html file and pass as argument:\n\n```shell\nssc-gen parse-from-file index.html -t schema.py:HelloWorld  \n```\n\nShort options descriptions:\n\n- `-t --target` - config schema file and class from where to start the parser\n\n![out1](docs/assets/parse_from_file.gif)\n\n### from url\n\n```shell\nssc-gen parse-from-url https://example.com -t schema.py:HelloWorld  \n```\n\n![out1](docs/assets/parse_from_url.gif)\n### from Chromium browser (CDP protocol)\n\n\n```shell\nssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld\n```\n\n>[!note]\n> if script cannot found chrome executable - provide it manually:\n\n```shell\nssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium\n```\n\n\n### Convert to code\n\nConvert to code for use in projects:\n\n>![note]\n> for example, used js: it can be fast test in developer console\n\n\n```shell\nssc-gen js schema.py -o .\n```\n\nCode output looks like this (code formatted by IDE):\n\n```javascript\n// autogenerated by ssc-gen DO NOT_EDIT\n/**\n *\n *\n * {\n *     \"title\": \"String\",\n *     \"a_hrefs\": \"Array<String>\"\n * }\n */\nclass HelloWorld {\n    constructor(doc) {\n        if (typeof doc === 'string') {\n            this._doc = new DOMParser().parseFromString(doc, 'text/html');\n        } else if (doc instanceof Document || doc instanceof Element) {\n            this._doc = doc;\n        } else {\n            throw new Error(\"Invalid input: Expected a Document, Element, or string\");\n        }\n    }\n\n    _parseTitle(value) {\n        let value1 = value.querySelector('title');\n        return typeof value1.textContent === \"undefined\" ? value1.documentElement.textContent : value1.textContent;\n    }\n\n    _parseAHrefs(value) {\n        let value1 = Array.from(value.querySelectorAll('a'));\n        return value1.map(e => e.getAttribute('href'));\n    }\n\n    parse() {\n        return {title: this._parseTitle(this._doc), a_hrefs: this._parseAHrefs(this._doc)};\n    }\n}\n```\n\n### copy code output and past to developer console:\n\nPrint output:\n\n```javascript\nalert(JSON.stringify((new HelloWorld(document).parse())))\n```\n\n![example](assets/example.png)\n\n\nYou can use any html source:\n\n- read from html file\n- get from http request\n- get from browser (playwright, selenium, chrome-cdp)\n- paste code to developer console (js)\n- or call curl in shell and parse stdin\n\n\n## See also\n- [Brief](docs/brief.md) about css selectors and regular expressions.\n- [Tutorial](docs/tutorial.md) how to use ssc-gen\n- [Reference](docs/reference.md) about high-level API",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python-dsl code converter to html parser for web scraping ",
    "version": "0.7.1",
    "project_urls": {
        "Documentation": "https://github.com/vypivshiy/selector_schema_codegen#readme",
        "Examples": "https://github.com/vypivshiy/selector_schema_codegen/examples",
        "Issues": "https://github.com/vypivshiy/selector_schema_codegen/issues",
        "Source": "https://github.com/vypivshiy/selector_schema_codegen"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "14bce6146bcb0728e3cc5a006a6541a501215682039cec6a54d0a42604f51096",
                "md5": "0352b6852411df4bfa471a34f192f345",
                "sha256": "3d5f9694a5d670e1d288c604c6919d4fbf9b98c63b11218162d290c5098dad4d"
            },
            "downloads": -1,
            "filename": "ssc_codegen-0.7.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0352b6852411df4bfa471a34f192f345",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 76815,
            "upload_time": "2025-02-22T12:11:12",
            "upload_time_iso_8601": "2025-02-22T12:11:12.799345Z",
            "url": "https://files.pythonhosted.org/packages/14/bc/e6146bcb0728e3cc5a006a6541a501215682039cec6a54d0a42604f51096/ssc_codegen-0.7.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dd8da84f6f7337cea0d9c2d422ff1d231537c18bfc87160749778fdba6979d75",
                "md5": "883bbfdb5d3b59c0bd231354ae22cff4",
                "sha256": "e8d4939072fa563c13d67e0c2e38ab875e07e03751dd5a64d70fe9eeace41542"
            },
            "downloads": -1,
            "filename": "ssc_codegen-0.7.1.tar.gz",
            "has_sig": false,
            "md5_digest": "883bbfdb5d3b59c0bd231354ae22cff4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 56313,
            "upload_time": "2025-02-22T12:11:14",
            "upload_time_iso_8601": "2025-02-22T12:11:14.971356Z",
            "url": "https://files.pythonhosted.org/packages/dd/8d/a84f6f7337cea0d9c2d422ff1d231537c18bfc87160749778fdba6979d75/ssc_codegen-0.7.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-22 12:11:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vypivshiy",
    "github_project": "selector_schema_codegen#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "ssc_codegen"
}
        
Elapsed time: 1.82487s