ssc_codegen


Namessc_codegen JSON
Version 0.10.1 PyPI version JSON
download
home_pageNone
SummaryPython-dsl code converter to html parser for web scraping
upload_time2025-09-16 05:19:35
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Selector Schema codegen

## Introduction

ssc-gen - a python-based DSL to describe parsers for html documents, which is translated into a standalone parsing module

### For a better experience using this library, you should know:

- HTML CSS selectors (CSS3 standard min), Xpath
- regular expressions (PCRE)

### Project solving next problems:

- designed for SSR (server-side-render) html pages parsers, **NOT FOR REST-API, GRAPHQL ENDPOINTS**
- decrease boilerplate code
- generates independent modules from the project that can be reused.
- generates docstring documentation and the signature of the parser output.
- for a better IDE experience, generates a typedefs, type annotations (if the target programming language supports it).
- support annotation and parsing of JSON-like strings from a document
- AST API codegen for developing a converter for parsing

## Support converters

Current support converters

| Language                                 | HTML parser lib + dependencies         | XPath | CSS3 | CSS4 | Generated annotations, types, structs | formatter dependency |
| ---------------------------------------- | -------------------------------------- | ----- | ---- | ---- | ------------------------------------- | -------------------- |
| Python (3.8+)                            | bs4, lxml                              | N     | Y    | Y    | TypedDict`1`, list, dict              | ruff                 |
| ...                                      | parsel                                 | Y     | Y    | N    | ...                                   | ...                  |
| ...                                      | selectolax (lexbor)                    | N     | Y    | N    | ...                                   | ...                  |
| ...                                      | lxml                                   | Y     | Y    | N    | ...                                   | ...                  |
| js (ES6)`2`                              | pure (firefox/chrome extension/nodejs) | Y     | Y    | Y    | Array, Map`3`                         | prettier             |
| go (1.10+) **(UNSTABLE)**                | goquery, gjson (`4`)                   | N     | Y    | N    | struct(+json anchors), array, map     | gofmt                |
| lua (5.2+), luajit(2+) **(UNSTABLE)**`5` | lua-htmlparser, lrexlib(opt), dkjson   | N     | Y    | N    | EmmyLua                               | LuaFormatter         |

- **CSS3** means support next selectors:
  - basic: (`tag`, `.class`, `#id`, `tag1,tag2`)
  - combined: (`div p`, `ul > li`, `h2 +p`, `title ~head`)
  - attribute: (`a[href]`, `input[type='text']`, `a[href*='...']`, ...)
  - CSS3 pseudo classes: (`:nth-child(n)`, `:first-child`, `:last-child`)
- **CSS4** means support next selectors:
  - `:nth-of-type()`, `:where()`, `:is()`, `:not()` etc
- `1`this annotation type was deliberately chosen as a compromise reasons:
  Python has many ways of serialization: `namedtuple, dataclass, attrs, pydantic, msgspec, etc`
  - TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
- `2`ES8 standart required if needed use PCRE `re.S | re.DOTALL` flag
- `3`js exclude build-in serialization methods, used standard Array and Map types. Focus on the singanutur documentation!
- `4`golang has not been tested much, there may be issues
- **formatter dependency** - optional dependency for prettify and fix codestyle

- `5`lua
  - Experimental Research PoC, performance and stability are not guaranteed
  - Priority on generation to pure lua without C-libs dependencies. using [mva/htmlparser](https://luarocks.org/modules/mva/htmlparser) and [dhkolf/dkjson](https://luarocks.org/modules/dhkolf/dkjson)
  - Translates unsupported CSS3 selectors into the equivalent in the form of function calls:
    - for example, `div +p` is equivalent to `CssExt.combine_plus(root:select("div"), "p")`
  - Translates PCRE regex to [string pattern matching](https://www.lua.org/manual/5.4/manual.html#6.4.1) (with restrictions) for more information in [lua_re_compat.py ](ssc_codegen/converters/templates/lua_re_compat.py)

### Limitations

For maximum portability of the configuration to the target language:

- If possible, use CSS selectors: they are guaranteed to be converted to XPATH
- Unlike javascript, most html parse libs implement [CSS3 selectors standard](https://www.w3.org/TR/selectors-3/). They may not fully implement the functionality!
  Check the html parser lib documentation aboud CSS selectors before implement code. Examples:
  1. Several libs not support `+` operations (eg: [selectolax(modest)](https://github.com/rushter/selectolax), [dart.universal_html](https://pub.dev/packages/universal_html))
  2. For research purpose, lua_htmlparser include converter for unsupported CSS3 query syntax
2. HTML parser libs maybe not supports attribute selectors: `*=`, `~=`, `|=`, `^=`, `$=`
3. Several libs not support pseudo classes (eg: standard [dart.html](https://dart.dev/libraries/dart-html) lib miss this feature).

## Getting started

ssc_gen required python 3.10 version or higher

### Install

pip:

```shell
pip install ssc_codegen
```

uv:

```shell
uv pip install ssc_codegen
```

## Example

### Create a file `schema.py` with:

```python
from ssc_codegen import ItemSchema, D

class HelloWorld(ItemSchema):
    title = D().css('title').text()
    a_hrefs = D().css_all('a').attr('href')
```

### try it in cli

> [!note]
> this tools developed for testing purposes, not for web-scraping tasks

### eval from file

Download any html file and pass as argument:

```shell
ssc-gen parse-from-file index.html -t schema.py:HelloWorld
```

Short options descriptions:

- `-t --target` - config schema file and class from where to start the parser

![out1](docs/assets/parse_from_file.gif)

### send GET request to url and parse response

```shell
ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld
```

![out1](docs/assets/parse_from_url.gif)

### send request via Chromium browser (CDP protocol)

```shell
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld
```

> [!note]
> if script cannot found chrome executable - provide it manually:

```shell
ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium
```

### Convert to code

Convert to code for use in projects:

> [!note]
> for example, used js: it can be fast test in developer console

```shell
ssc-gen js schema.py -o .
```

Code output looks like this:

```javascript
// autogenerated by ssc-gen DO NOT_EDIT
/***
 *
 * {
 *     "title": "String",
 *     "a_hrefs": "Array<String>"
 * }*/
class HelloWorld {
  constructor(doc) {
    if (typeof doc === "string") {
      this._doc = new DOMParser().parseFromString(doc, "text/html");
    } else if (doc instanceof Document || doc instanceof Element) {
      this._doc = doc;
    } else {
      throw new Error("Invalid input: Expected a Document, Element, or string");
    }
  }

  _parseTitle(v) {
    let v0 = v.querySelector("title");
    return typeof v0.textContent === "undefined"
      ? v0.documentElement.textContent
      : v0.textContent;
  }

  _parseAHrefs(v) {
    let v0 = Array.from(v.querySelectorAll("a"));
    return v0.map((e) => e.getAttribute("href"));
  }

  parse() {
    return {
      title: this._parseTitle(this._doc),
      a_hrefs: this._parseAHrefs(this._doc),
    };
  }
}
```

### Copy code output and past to developer console:

Print output:

```javascript
alert(JSON.stringify(new HelloWorld(document).parse()));
```

![example](docs/assets/example.png)

You can use any html source:

- parse from html files
- parse from http responses
- parse from browsers: playwright, selenium, chrome-cdp, etc.
- call curl in shell and parse STDIN
- use in STDIN pipelines with third-party tools like [projectdiscovery/httpx](https://github.com/projectdiscovery/httpx)

## See also

- [Brief](docs/briefing.md) about css selectors and regular expressions.
- [Quickstart](docs/quickstart.md) about css selectors and regular expressions.
- [Tutorial](docs/tutorial.md) basic usage ssc-gen
- [AST reference](docs/ast_reference.md) about generation code from AST

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ssc_codegen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/b2/c4/801fbec102ea6972f477bba47261310086bfbfd8d4b5067f5b83414ca81c/ssc_codegen-0.10.1.tar.gz",
    "platform": null,
    "description": "# Selector Schema codegen\n\n## Introduction\n\nssc-gen - a python-based DSL to describe parsers for html documents, which is translated into a standalone parsing module\n\n### For a better experience using this library, you should know:\n\n- HTML CSS selectors (CSS3 standard min), Xpath\n- regular expressions (PCRE)\n\n### Project solving next problems:\n\n- designed for SSR (server-side-render) html pages parsers, **NOT FOR REST-API, GRAPHQL ENDPOINTS**\n- decrease boilerplate code\n- generates independent modules from the project that can be reused.\n- generates docstring documentation and the signature of the parser output.\n- for a better IDE experience, generates a typedefs, type annotations (if the target programming language supports it).\n- support annotation and parsing of JSON-like strings from a document\n- AST API codegen for developing a converter for parsing\n\n## Support converters\n\nCurrent support converters\n\n| Language                                 | HTML parser lib + dependencies         | XPath | CSS3 | CSS4 | Generated annotations, types, structs | formatter dependency |\n| ---------------------------------------- | -------------------------------------- | ----- | ---- | ---- | ------------------------------------- | -------------------- |\n| Python (3.8+)                            | bs4, lxml                              | N     | Y    | Y    | TypedDict`1`, list, dict              | ruff                 |\n| ...                                      | parsel                                 | Y     | Y    | N    | ...                                   | ...                  |\n| ...                                      | selectolax (lexbor)                    | N     | Y    | N    | ...                                   | ...                  |\n| ...                                      | lxml                                   | Y     | Y    | N    | ...                                   | ...                  |\n| js (ES6)`2`                              | pure (firefox/chrome extension/nodejs) | Y     | Y    | Y    | Array, Map`3`                         | prettier             |\n| go (1.10+) **(UNSTABLE)**                | goquery, gjson (`4`)                   | N     | Y    | N    | struct(+json anchors), array, map     | gofmt                |\n| lua (5.2+), luajit(2+) **(UNSTABLE)**`5` | lua-htmlparser, lrexlib(opt), dkjson   | N     | Y    | N    | EmmyLua                               | LuaFormatter         |\n\n- **CSS3** means support next selectors:\n  - basic: (`tag`, `.class`, `#id`, `tag1,tag2`)\n  - combined: (`div p`, `ul > li`, `h2 +p`, `title ~head`)\n  - attribute: (`a[href]`, `input[type='text']`, `a[href*='...']`, ...)\n  - CSS3 pseudo classes: (`:nth-child(n)`, `:first-child`, `:last-child`)\n- **CSS4** means support next selectors:\n  - `:nth-of-type()`, `:where()`, `:is()`, `:not()` etc\n- `1`this annotation type was deliberately chosen as a compromise reasons:\n  Python has many ways of serialization: `namedtuple, dataclass, attrs, pydantic, msgspec, etc`\n  - TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.\n- `2`ES8 standart required if needed use PCRE `re.S | re.DOTALL` flag\n- `3`js exclude build-in serialization methods, used standard Array and Map types. Focus on the singanutur documentation!\n- `4`golang has not been tested much, there may be issues\n- **formatter dependency** - optional dependency for prettify and fix codestyle\n\n- `5`lua\n  - Experimental Research PoC, performance and stability are not guaranteed\n  - Priority on generation to pure lua without C-libs dependencies. using [mva/htmlparser](https://luarocks.org/modules/mva/htmlparser) and [dhkolf/dkjson](https://luarocks.org/modules/dhkolf/dkjson)\n  - Translates unsupported CSS3 selectors into the equivalent in the form of function calls:\n    - for example, `div +p` is equivalent to `CssExt.combine_plus(root:select(\"div\"), \"p\")`\n  - Translates PCRE regex to [string pattern matching](https://www.lua.org/manual/5.4/manual.html#6.4.1) (with restrictions) for more information in [lua_re_compat.py ](ssc_codegen/converters/templates/lua_re_compat.py)\n\n### Limitations\n\nFor maximum portability of the configuration to the target language:\n\n- If possible, use CSS selectors: they are guaranteed to be converted to XPATH\n- Unlike javascript, most html parse libs implement [CSS3 selectors standard](https://www.w3.org/TR/selectors-3/). They may not fully implement the functionality!\n  Check the html parser lib documentation aboud CSS selectors before implement code. Examples:\n  1. Several libs not support `+` operations (eg: [selectolax(modest)](https://github.com/rushter/selectolax), [dart.universal_html](https://pub.dev/packages/universal_html))\n  2. For research purpose, lua_htmlparser include converter for unsupported CSS3 query syntax\n2. HTML parser libs maybe not supports attribute selectors: `*=`, `~=`, `|=`, `^=`, `$=`\n3. Several libs not support pseudo classes (eg: standard [dart.html](https://dart.dev/libraries/dart-html) lib miss this feature).\n\n## Getting started\n\nssc_gen required python 3.10 version or higher\n\n### Install\n\npip:\n\n```shell\npip install ssc_codegen\n```\n\nuv:\n\n```shell\nuv pip install ssc_codegen\n```\n\n## Example\n\n### Create a file `schema.py` with:\n\n```python\nfrom ssc_codegen import ItemSchema, D\n\nclass HelloWorld(ItemSchema):\n    title = D().css('title').text()\n    a_hrefs = D().css_all('a').attr('href')\n```\n\n### try it in cli\n\n> [!note]\n> this tools developed for testing purposes, not for web-scraping tasks\n\n### eval from file\n\nDownload any html file and pass as argument:\n\n```shell\nssc-gen parse-from-file index.html -t schema.py:HelloWorld\n```\n\nShort options descriptions:\n\n- `-t --target` - config schema file and class from where to start the parser\n\n![out1](docs/assets/parse_from_file.gif)\n\n### send GET request to url and parse response\n\n```shell\nssc-gen parse-from-url https://example.com -t schema.py:HelloWorld\n```\n\n![out1](docs/assets/parse_from_url.gif)\n\n### send request via Chromium browser (CDP protocol)\n\n```shell\nssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld\n```\n\n> [!note]\n> if script cannot found chrome executable - provide it manually:\n\n```shell\nssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium\n```\n\n### Convert to code\n\nConvert to code for use in projects:\n\n> [!note]\n> for example, used js: it can be fast test in developer console\n\n```shell\nssc-gen js schema.py -o .\n```\n\nCode output looks like this:\n\n```javascript\n// autogenerated by ssc-gen DO NOT_EDIT\n/***\n *\n * {\n *     \"title\": \"String\",\n *     \"a_hrefs\": \"Array<String>\"\n * }*/\nclass HelloWorld {\n  constructor(doc) {\n    if (typeof doc === \"string\") {\n      this._doc = new DOMParser().parseFromString(doc, \"text/html\");\n    } else if (doc instanceof Document || doc instanceof Element) {\n      this._doc = doc;\n    } else {\n      throw new Error(\"Invalid input: Expected a Document, Element, or string\");\n    }\n  }\n\n  _parseTitle(v) {\n    let v0 = v.querySelector(\"title\");\n    return typeof v0.textContent === \"undefined\"\n      ? v0.documentElement.textContent\n      : v0.textContent;\n  }\n\n  _parseAHrefs(v) {\n    let v0 = Array.from(v.querySelectorAll(\"a\"));\n    return v0.map((e) => e.getAttribute(\"href\"));\n  }\n\n  parse() {\n    return {\n      title: this._parseTitle(this._doc),\n      a_hrefs: this._parseAHrefs(this._doc),\n    };\n  }\n}\n```\n\n### Copy code output and past to developer console:\n\nPrint output:\n\n```javascript\nalert(JSON.stringify(new HelloWorld(document).parse()));\n```\n\n![example](docs/assets/example.png)\n\nYou can use any html source:\n\n- parse from html files\n- parse from http responses\n- parse from browsers: playwright, selenium, chrome-cdp, etc.\n- call curl in shell and parse STDIN\n- use in STDIN pipelines with third-party tools like [projectdiscovery/httpx](https://github.com/projectdiscovery/httpx)\n\n## See also\n\n- [Brief](docs/briefing.md) about css selectors and regular expressions.\n- [Quickstart](docs/quickstart.md) about css selectors and regular expressions.\n- [Tutorial](docs/tutorial.md) basic usage ssc-gen\n- [AST reference](docs/ast_reference.md) about generation code from AST\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python-dsl code converter to html parser for web scraping ",
    "version": "0.10.1",
    "project_urls": {
        "Documentation": "https://github.com/vypivshiy/selector_schema_codegen#readme",
        "Examples": "https://github.com/vypivshiy/selector_schema_codegen/examples",
        "Issues": "https://github.com/vypivshiy/selector_schema_codegen/issues",
        "Source": "https://github.com/vypivshiy/selector_schema_codegen"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ec3dae014000d3207ea776544cc219d6fb03a276fbf9ed4c3b65b3cb227a4390",
                "md5": "493ddc0fbfd9a6929fa0c5baee656644",
                "sha256": "84936548a7ba01e83d36e66fc44f227617967553666e2456928192f04fd8dd8e"
            },
            "downloads": -1,
            "filename": "ssc_codegen-0.10.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "493ddc0fbfd9a6929fa0c5baee656644",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 125793,
            "upload_time": "2025-09-16T05:19:33",
            "upload_time_iso_8601": "2025-09-16T05:19:33.811457Z",
            "url": "https://files.pythonhosted.org/packages/ec/3d/ae014000d3207ea776544cc219d6fb03a276fbf9ed4c3b65b3cb227a4390/ssc_codegen-0.10.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b2c4801fbec102ea6972f477bba47261310086bfbfd8d4b5067f5b83414ca81c",
                "md5": "a289cbeefa24a97d800b2348ccbdddbd",
                "sha256": "82252f17330c7b6449b408ce09be8d2e6384611bb5250ee881b982bb355e8e0a"
            },
            "downloads": -1,
            "filename": "ssc_codegen-0.10.1.tar.gz",
            "has_sig": false,
            "md5_digest": "a289cbeefa24a97d800b2348ccbdddbd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 102516,
            "upload_time": "2025-09-16T05:19:35",
            "upload_time_iso_8601": "2025-09-16T05:19:35.619481Z",
            "url": "https://files.pythonhosted.org/packages/b2/c4/801fbec102ea6972f477bba47261310086bfbfd8d4b5067f5b83414ca81c/ssc_codegen-0.10.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-16 05:19:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vypivshiy",
    "github_project": "selector_schema_codegen#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "ssc_codegen"
}
        
Elapsed time: 4.23357s